Online Grooming Detection on Social Media Platforms
Abstract
Online grooming detection has become a critical research topic in the era of extensive data analysis. It is essential to protect vulnerable users, particularly adolescents, against sexual predation on online platforms and media. However, many factors challenge online grooming detection, which leads to a high-risk problem for youth. The primary goal of this research work is to provide techniques that increase children’s security on online chat platforms. To this extent, many experiments have been conducted to create models fulfilling our research goal. As such, this thesis contains a comprehensive survey of child exploitation in chat logs that provides the readers with a deep knowledge of the problem, possible research gaps, and proposed solutions. In this research, we split the online grooming detection problem into several subproblems, including author profiling, predatory conversation detection, predatory identification, and data limitations issues.
The leading theory behind the author profiling in this problem comes from the fact that online predators provide fake identities to tarp their young victims. At the same time, children’s characteristics differ from the ones who imitate a minor, which leads us to detect the gender of users in this research. In this thesis, we propose a gender detection model that can recognize the gender of authors based on their keystroke dynamics features. This research also provides a fake identity detection technique with a high performance that detects users who are dishonest about their identity.
Providing an automatic predatory conversation detection system facilitates law enforcement authorities to act on time before any tragedy occurs. Therefore, we have examined and proposed several predatory conversation detection and predatory identification techniques focusing on finding the best feature vectors and embeddings that lead to the best performance in online grooming detection.
This thesis also aims to gain deep knowledge about predatory behaviour with semantic analysis. We might lose some semantic information by applying conventional embeddings such as Word2vec or GloVe feature vectors since they provide a single word embedding for a term in different contexts. At the same time, humans show their motivations in phrases or sentences rather than single terms. So, we provide an online grooming detection model based on extracting embeddings from sentences rather than single words. We apply contextual model based such as Bert-based and RoBerta-based systems for each sentence.
Several constraints, such as privacy and security issues, availability, and the imbalanced nature of the datasets, challenge online grooming datasets. The number of predatory chat logs is considerably lower than the other online conversations,
leading to a highly imbalanced data problem. It is challenging to build a machine learning model based on imbalanced datasets, which motivates us to provide a model to handle this issue. This research proposes a model that uses a hybrid sampling and class re-distribution to gain augmented data for coping with highly imbalanced datasets. We also improve the diversity of classifiers and feature vectors by perturbing the data along with the augmentation in an iterative manner.
Finally, we conclude our research by discussing potential research gaps and open problems and proposing possible solutions for them to give deep insights to the readers of future work based on the work of this thesis.
Has parts
Paper 1: Rezaee Borj, Parisa; Raja, Kiran; Bours, Patrick Adrianus. Online grooming detection: A comprehensive survey of child exploitation in chat logs. Knowledge-Based Systems 2022 ;Volum 259. This is an open access article under the CC BY licensePaper 2: Borj, Parisa Rezaee; Bours, Patrick. Detecting Liars in Chats using Keystroke Dynamics. I: Proceedings of the 2019 International Conference on Biometric Engineering and Applications (ICBEA 2019). Association for Computing Machinery (ACM) 2019 ISBN 978-1-4503-6305-1. Copyright © 2019 ACM
Paper 3: Li, Guoqiang; Borj, Parisa Rezaee; Bergeron, Loic; Bours, Patrick. Exploring Keystroke Dynamics and Stylometry Features for Gender Prediction on Chatting Data. I: Proceedings of the International Convention MIPRO. IEEE conference proceedings 2019 ISBN 978-1-5386-9296-7. s. - © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Paper 4: Borj, Parisa Rezaee; Bours, Patrick. Predatory Conversation Detection. I: International Conference on Cyber Security for Emerging Technologies. IEEE conference proceedings 2019 ISBN 978-1-7281-4539-6. s. - © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Paper 5: Borj, Parisa Rezaee; Bylappa Raja, Kiran; Bours, Patrick. On Preprocessing the Data for Improving Sexual Predator Detection. I: 15th International Workshop on Semantic and Social Media Adaptation and Personalization. IEEE 2020 ISBN 978-1-7281-5920-1 © 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Paper 6: Borj, Parisa Rezaee; Raja, Kiran; Bours, Patrick. Detecting Sexual Predatory Chats by Perturbed Data and Balanced Ensembles. I: Proceedings of the 20th International Conference of the Biometrics Special Interest Group (BIOSIG2021). Gesellschaft für Informatik 2021 ISBN 978-1-6654-2693-0. © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Paper 7: Borj, Parisa Rezaee; Raja, Kiran; Bours, Patrick. (2023). Detecting Online Grooming By Simple Contrastive Chat Embeddings, 9th ACM International Workshop on Security and Privacy Analytics (IWSPA 2023) [Accepted]. This paper is not yet published and is therefore not included.