The impact of preprocessing in natural language for open source intelligence and criminal investigation
MetadataShow full item record
Original versionTEMP 2017 IEEE International Conference on Big Data (Big Data). 2019, 4248-4254. 10.1109/BigData47090.2019.9006006
Underground forums serves as gathering place for like-minded cyber criminals and are an continued threat to law and order. Law enforcement agencies can use Open-Source Intelligence (OSINT) to gather valuable information to proactively counter existing and new threats. For example, by shifting criminal investigation's focus onto certain cyber criminals with large impact in underground forums and related criminal business models. This paper presents our study on text preprocessing requirements and document construction for the topic model algorithm Latent Dirichlet Allocation (LDA). We identify a set of preprocessing requirements based on literature review and demonstrate them on a real-world forum, similar to those used by cyber criminals. Our result show that topic modelling processes needs to follow a very strict procedure to provide significant result that can be useful in OSINT. Additionally, more reliable results are produced by tuning the hyper-parameters and the number of topics for LDA. We demonstrate improved results by iterative preprocessing to continuously improve the model, which provide more coherent and focused topics.