Multilingual News Article Classification
MetadataVis full innførsel
News is an ever-growing and global resource, reliant on robust distribution networks to spread information. This thesis investigates how exploiting semantic, contextual and ontological information may form a basis for a language independent news article classification system. In light of the above, a scalable multi-label news article classification system, based exclusively on extracted DBpedia entities, and a predetermined standardized set of fixed-size IPTC Media Topic categories, is presented. The proposed system includes an ensemble of n-binary multinominal classifiers, comprised of both traditional Naïve Bayes and several sophisticated artificial neural networks all trained on 1.8 million news articles, spanning twenty years of content from The New York Times. Through a series of experiments, this thesis provides evidence that a reliable language independent news article classifier is plausible achieving a macro-averaged F-score of 91% in categories like sport, and an overall F-score of 49% for the whole system. Furthermore, the results show that utilizing pre-trained word embeddings like Word2Vec over the traditional Bag-of-Words approach for feature representation, provides both reduced training time and comparable classification quality. Also included in the experiments are several studies exploring how article length, incorporation of ontologically related supertypes, and moving through time, affects the classification quality of news articles. Among the most central findings is that article length is positively correlated with F-score up until a length of 600 words, at which point the F-score stabilizes. Finally, the thesis presents a thorough evaluation comparing traditional machine learning to the state-of-the-art in deep learning for the news article domain, both from a theoretical and practical standpoint ultimately concluding that replacing traditional and well- performing machine methods with deep learning is not necessarily the right solution in simple problem domains.