Author: Sayed, Donia GamalEldin Nazim./ Title: Developing Intelligent Techniques for Cyber-Hate Speech Detection /

Search In this Thesis

العنوان

Developing Intelligent Techniques for Cyber-Hate Speech Detection /

المؤلف

Sayed, Donia GamalEldin Nazim.

هيئة الاعداد

باحث / دنيا جمال الدين نظيم سيد

مشرف / مصطفى عارف

مشرف / ماركو الفونس توفيق

مشرف / سالود ماريا خيمينيز-زافرا

تاريخ النشر

2023.

عدد الصفحات

139 p. :

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

Computer Science (miscellaneous)

تاريخ الإجازة

1/1/2023

مكان الإجازة

جامعة عين شمس - كلية الحاسبات والمعلومات - علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

139

from

139

Abstract

Sentiment Analysis, also known as opinion mining, is the area of Natural Language Processing (NLP) that aims to extract human perceptions, thoughts, and beliefs from unstructured textual content. It has become a useful, attractive, and challenging research area concerning the emergence and rise of social media and the mass volume of individuals’ reviews, comments, and feedback. One of the major problems, apparent and evident in social media, is the toxicity of online textual content. Cyber hate speech refers to the use of violent, offensive, or insulting expressions or phrases targeting individuals or minority groups on different social networks from diverse cultural backgrounds and beliefs to access Internet sites, concealing and disguising their identity under a cloud of anonymity. The growing prevalence of cyber hate speech in the world today is a significant concern, as it endangers the cohesion of civil societies. Due to users’ freedom and anonymity, as well as a lack of regulation governed by social media, cyber toxicity, and bullying speech are major issues that need an automated system to be detected and prevented. In particular, the Arab region has witnessed a rapid increase in the number of social media users, accompanied by a high rate of cyber hate speech. It is crucial to strive towards creating safe online spaces that promote positivity and inclusivity, free from any form of discrimination or hatred.
Sentiment classification and hate speech classification are two distinct but related fields of NLP. Sentiment classification involves identifying and extracting subjective information from text data, such as opinions, emotions, and attitudes. In contrast, hate speech classification involves identifying and categorizing text data that contain abusive or derogatory language towards a particular group of people based on their race, gender, religion, or other characteristics. By analyzing the sentiment of the text, it could be identified whether it contains positive or negative emotions. This information can be used to filter out irrelevant data that does not contain any hate speech. For instance, if we are trying to classify tweets that contain hate speech towards a particular group of people, sentiment analysis could be used first to filter out tweets that do not contain any negative emotions and classify them as none-offensive.
This thesis concentrates on reviews and feedback from customers, which are relevant content types representing opinions. This work aims to identify each sentence’s semantic orientation (e.g., positive or negative). Traditional methods for classifying sentiments often involve significant human efforts, e.g., the construction of lexicons, feature engineering, etc. In recent years, Deep Learning has emerged as a powerful way of solving the problem of classifying sentiments. This thesis proposes a novel deep learning framework classification on different sizes of four binary balanced Arabic datasets; Arabic Twitter Dataset (collected dataset), Arabic Sentiment Twitter Dataset for LEVantine (ArSenTD-Lev), Arabic Sentiment Analysis Dataset, and Arabic 100k Reviews dataset. The framework consists of using different types of deep neural networks: The densely connected neural network (Basic Neural Network), Convolutional Neural Network (CNN), Long Short-Term Memory Network (LSTM), which is a variant of Recurrent Neural Networks (RNN), Bidirectional LSTM (Bi-LSTM), and CNN + LSTM. Experiments on different datasets prove the effectiveness of the proposed system and its superiority over the traditional methods for large sizes while the traditional methods are more suitable for smaller datasets.
There is diverse research in different languages and approaches in this area, but the lack of a comprehensive study to investigate them from all aspects is tangible. A comprehensive multi-lingual and systematic review of cyber-hate sentiment analysis is presented. It states the definition, properties, and taxonomy of cyberbullying and how often each type occurs. Also, it presents the most recent popular cyber bullying benchmark datasets in different languages, showing their number of classes (Binary/Multiple), discussing the applied algorithms, and how they were evaluated. It also provides the challenges, solutions, as well as future directions.
For the longest time, translation was a labor-intensive process that required just human effort. While human translation remains the most reliable method of textual content translation, it takes longer and is more expensive if done for each individual piece of information. Several Machine Translation (MT) approaches have recently emerged to facilitate the migration of content across languages, especially for low-resource languages such as the Arabic Language. Given that Arabic is one of the world’s most widely spoken languages, the task of Arabic machine translation has recently gotten a lot of interest from the scientific community. Indeed, the amount of study devoted to these low-resource languages has resulted in some significant accomplishments; however, the status of Arabic machine translation systems falls short of the quality obtained for other languages. As a result, this thesis examines the origins and main development timeline of machine translation approaches, investigates the significant branches, and categorizes different study orientations. Furthermore, it gives a comprehensive overview of the key research works that have been completed in the field of Arabic Neural MT (ANMT) and discusses possible future research prospects in this discipline.
Arabic is a language with rich morphology and few resources. Arabic is therefore recognized as one of the most challenging languages for machine translation. The study of translation into Arabic has received significantly less attention than that of European languages. As a consequence, further research into Arabic machine translation quality needs more necessary investigation. A translation model is proposed between Arabic and English based on Neural Machine Translation (NMT). The proposed model employs a transformer with multi-head attention. It combines a feed-forward network with a multi-head attention mechanism. The NMT proposed model has demonstrated its effectiveness in improving translation by achieving an impressive accuracy of 97.68%, a loss of 0.0778, and a near-perfect Bilingual Evaluation Understudy (BLEU) score of 99.95.
Additionally, this thesis aims to create an intelligent model for detecting offensive and hate speech on Arab social media. A model is proposed that can paraphrase and translate English cyber hate data into Arabic to tackle the lack of Arabic context resources in the cyber hate speech domain. Also, It applied different deep learning models and a pre-trained model to classify these data. To achieve this, a Multi-Task Learning (MTL) model is developed using a pre-trained Arabic language model. The MTL model is trained on cross-corpora representing variations in offensive and hate contexts to learn global and dataset-specific contextual representations. The MTL model achieves accuracy with 99.24% using the transformer model while it achieves 78.71% using CNN+LSTM for Arabic cyber hate speech detection.