Towards Automated Fake News Classification - On Building Collections for Claim Analysis Research
Abstract
The term Fake News, although not a new phenomenon, became well known with the U.S. presidential election in 2016. It has become relevant with the prevalence of social media, and the increased use of the internet as a news source, since people are sharing the news they read on a larger scale than before. In this forest of available sources and user-generated content, it can be difficult to distinguish truth from lies, and a solution that can help the situation would be beneficial. Currently, fact-checkers are solving the problem manually: they first find the claims to check, then they look at various sources, talk with experts in the field, and summarize it as a fact-check before publishing it.
As the task of finding claims to check is extensive, a solution that can do this automatically would be advantageous. Some solutions already exist for the English language, but we have found none to use for the Norwegian language, so this thesis will focus on Norwegian. Towards making a solution like this, we will develop a specialized dataset of Norwegian political claims for use in claim analysis research. For this purpose, we propose a three-part system to compile an initial data source, collect annotations from users, and combine annotation contributions into class labels. After completing the labeling, we conduct an analysis of how the demographics age, education, and gender affect the users answers. We also look at which political parties have the most check-worthy claims, and from which political parties the claims were easier to label.
Overall, the results show that the proposed method can be employed to develop a dataset of Norwegian political claims. The method is used in this thesis to develop a dataset that can be a starting point for further research in claim analysis. The analysis of the resulting dataset indicates that other sources of data also should be considered for further work.