Hate speech is a persuasive phenomenon in our societies and in recent decades it has been in strong expansion due to social media platforms. The protection guaranteed by Internet anonymity and lack of potential repercussions appear to be responsible for the abundance of aggressive and harmful behaviours online (Dynel 2012). Also, the volume of content generated online and the psychological burden of manual moderation support the need for the automatic detection of such offensive and hateful content (Kovács et al. 2021). Numerous computational approaches have been proposed within the NLP community to tackle the problem of hate speech, mainly focusing on misogyny and racism (e.g., Waseem 2016, Zeinert et al. 2021). An understudied area in the field is religious hate online, despite being an important and impactful societal issue. There are very few studies that explicitly concentrated on religion (e.g., Albadi et al. 2018, Vidgen and Yasseri 2020, Zannettou et al. 2020), or religion is one of the targets without going further into detail (e.g., Ishmam and Sharmin 2019, Mossie and Wang 2020). Moreover, Italian language appears to be underrepresented. The present thesis focuses on religious groups as targets of online hate speech and concentrates on the three main monotheistic religions: Christianity, Islam, and Judaism. We collected Italian data from the Twitter platform and proposed an interoperable fine-grained labeling scheme with a main focus on religious hate speech detection. We then experimented on the annotated dataset to automatically detect abusive language, on the one hand, and religious hate, on the other hand, by implementing several supervised machine learning models: Logistic Regression, Multinomial Naïve Bayes, Decision Tree, Random Forest, K-Nearest Neighbors, Linear Support Vector Classifier, and Supervised Neural Network. The findings show that logistic regression, supervised neural network, and linear support vector classifier are the algorithms providing the best performance. We complete the work with a linguistic analysis on the various religious hate forms. The results show that Islam is the most verbally attacked religion, followed by Judaism, and finally Christianity.
L'hate speech è un fenomeno pervasivo nelle nostre società e in forte espansione a causa dei social media. La protezione garantita dall'anonimità di Internet e l'assenza di ripercussioni potenziali risultano i principali responsabili dell'abbondanza dei comportamenti aggressivi e dannosi per la comunità online (Dynel 2012). Inoltre, il volume di contenuti generati online e il peso psicologico derivante dalla moderazione manuale sostengono il bisogno di sistemi per l'identificazione automatica di questi contenuti offensivi e odiosi (Kovács et al. 2021). All'interno della comunità NLP sono stati proposti numerosi approcci computazionali per affrontare il problema dell'hate speech, focalizzandosi però soprattutto sui fenomeni di misoginia e razzismo (e.g., Waseem 2016, Zeinert et al. 2021). Un'area ancora poco studiata è quella dell'odio religioso online, nonostante sia un problema sociale importante. Pochissimi studi si sono concentrati esclusivamente sull'odio religioso (e.g., Albadi et al. 2018, Vidgen & Yasseri 2020, Zannettou et al. 2020), oppure la religione è uno dei target delle tassonomie o dei task proposti senza però entrare troppo nei dettagli (e.g., Ishmam & Sharmin 2019, Mossie & Wang 2020). Inoltre, la lingua italiana è poco rappresentata rispetto all'inglese, di solito lingua di maggiore riferimento per numerosi task. Questa tesi si focalizza proprio sui gruppi religiosi come target dell'hate speech online; in particolare, sono stati raccolti dati aventi come target le tre principali religioni monoteistiche: Cristianesimo, Islam ed Ebraismo. I dati sono stati raccolti da Twitter e annotati secondo la tassonomia a grana fine proposta in questa tesi. I dati annotati sono poi stati usati per l'implemetazione dei nostri modelli di machine learning supervisionati. Gli algoritmi per l'implementazione dei modelli sono i seguenti: regressione logistica, multinomial naive Bayes, decision tree, random forest, k-nearest neighbors, linear support vector classifier, rete neurale supervisionata. La regressione logistica, il linear support vector classifier e la rete neurale supervisionata sono gli algoritmi che hanno restituito le performance migliori. La tesi si conclude con un'analisi linguistica sulle varie forme di odio religioso. I risultati mostrano che l'Islam è la religione più pesantemente attaccata verbalmente, seguita poi dall'Ebraismo e infine dal Cristianesimo.
Identificazione dell'odio religioso online nei tweet di lingua italiana: proposta di uno schema di annotazione a grana fine e implementazione di sistemi di machine learning supervisionato
TESTA, BENEDETTA
2021/2022
Abstract
Hate speech is a persuasive phenomenon in our societies and in recent decades it has been in strong expansion due to social media platforms. The protection guaranteed by Internet anonymity and lack of potential repercussions appear to be responsible for the abundance of aggressive and harmful behaviours online (Dynel 2012). Also, the volume of content generated online and the psychological burden of manual moderation support the need for the automatic detection of such offensive and hateful content (Kovács et al. 2021). Numerous computational approaches have been proposed within the NLP community to tackle the problem of hate speech, mainly focusing on misogyny and racism (e.g., Waseem 2016, Zeinert et al. 2021). An understudied area in the field is religious hate online, despite being an important and impactful societal issue. There are very few studies that explicitly concentrated on religion (e.g., Albadi et al. 2018, Vidgen and Yasseri 2020, Zannettou et al. 2020), or religion is one of the targets without going further into detail (e.g., Ishmam and Sharmin 2019, Mossie and Wang 2020). Moreover, Italian language appears to be underrepresented. The present thesis focuses on religious groups as targets of online hate speech and concentrates on the three main monotheistic religions: Christianity, Islam, and Judaism. We collected Italian data from the Twitter platform and proposed an interoperable fine-grained labeling scheme with a main focus on religious hate speech detection. We then experimented on the annotated dataset to automatically detect abusive language, on the one hand, and religious hate, on the other hand, by implementing several supervised machine learning models: Logistic Regression, Multinomial Naïve Bayes, Decision Tree, Random Forest, K-Nearest Neighbors, Linear Support Vector Classifier, and Supervised Neural Network. The findings show that logistic regression, supervised neural network, and linear support vector classifier are the algorithms providing the best performance. We complete the work with a linguistic analysis on the various religious hate forms. The results show that Islam is the most verbally attacked religion, followed by Judaism, and finally Christianity.È consentito all'utente scaricare e condividere i documenti disponibili a testo pieno in UNITESI UNIPV nel rispetto della licenza Creative Commons del tipo CC BY NC ND.
Per maggiori informazioni e per verifiche sull'eventuale disponibilità del file scrivere a: unitesi@unipv.it.
https://hdl.handle.net/20.500.14239/2221