Hate Speech Detection in Code-Mixed Datasets Using Pretrained Embeddings and Transformers
Sohail, Tooba; Aiman, Atiqa; Hashmi, Ehtesham; Imran, Ali Shariq; Muhammad Daudpota, Sher; Yildirim-Yayilgan, Sule
Original version
2024 International Conference on Frontiers of Information Technology (FIT) 10.1109/FIT63703.2024.10838452Abstract
Social media platforms are more accessible than needed in this digital era. People are given the freedom to express their thoughts, their emotions, and their opinions. And this freedom is being exploited negatively. It has adverse effects on people's respect and hurts their well-being when it is dealt with with no responsibility. This freedom can harm or destroy individuals or communities if not supervised. The majority of the research is done on language with resources like English. However, Roman Urdu is still neglected due to its little or nearly no resources. With over 100 million Urdu speakers worldwide, this results in the generation of a huge amount of Roman Urdu content on various social media platforms. This study has employed different ML models with language-specific embeddings on three datasets comprising 5000, 170991, and 30000 instances respectively. Furthermore, we also implemented transformer-based models, including BERT, XLNET, Roberta, and multilingual transformer mBert. XLNET outperformed other models with an accuracy of 96%. In the concluding phase, we employed interpretability modeling with LIME. To the best of our knowledge, no other studies have utilized interpretability for detecting hate speech in Roman Urdu.