Abstract
The process of extracting essential data from standard text has experienced ex- ponential growth and adoption over the last few years, and it is not expected to be stalled according to Yarchi et al. [2021]. The field of text mining is now a large branch of knowledge, containing multiple applicable areas. This thesis provides an overview of the methods used in text mining, where some are applied in the attempt of identifying viewpoints from Norwegian text. Some of the more common techniques used in text mining, are often not applicable to political text mining, as you need to be able to detect irony, sarcasm, words used differently in different political parties, etc. In other words, the machine has to take a larger part of text in consideration to properly analyze if there exists a viewpoint.
Since there has not been much research in the area of political text mining, especially in Norway, another challenge is that the data sets available might not be annotated for political viewpoints. This makes the progress quite flat in the beginning, as the data annotation takes a lot of time and effort. As this project is focusing on Norwegian political viewpoints, we did not get access to any data already annotated, thus, a lot of time was spent on the annotation process.
In the experiment, supervised learning is used to investigate how well Bag- of-Words, Term Frequency-Inverse Document Frequency, and three models of Sentence Embeddings represents political texts, by applying and Naive Bayes, Logistic Regression and Random Forest as classifiers. The classifiers is also eval- uated. The accuracy scores has a long way to go, to be able to compete with scores of other classification solutions, but are better than expected. The best obtained accuracy was about 78%, have in mind that this is a point estimate.