Multilingual News Article Classification

Skjennum, Patrick L

dc.contributor.advisor	Gulla, Jon Atle
dc.contributor.advisor	Ingvaldsen, Jon Espen
dc.contributor.author	Skjennum, Patrick L
dc.date.accessioned	2016-09-28T14:00:55Z
dc.date.available	2016-09-28T14:00:55Z
dc.date.created	2016-06-29
dc.date.issued	2016
dc.identifier	ntnudaim:15045
dc.identifier.uri	http://hdl.handle.net/11250/2411543
dc.description.abstract	News is an ever-growing and global resource, reliant on robust distribution networks to spread information. This thesis investigates how exploiting semantic, contextual and ontological information may form a basis for a language independent news article classification system. In light of the above, a scalable multi-label news article classification system, based exclusively on extracted DBpedia entities, and a predetermined standardized set of fixed-size IPTC Media Topic categories, is presented. The proposed system includes an ensemble of n-binary multinominal classifiers, comprised of both traditional Naïve Bayes and several sophisticated artificial neural networks all trained on 1.8 million news articles, spanning twenty years of content from The New York Times. Through a series of experiments, this thesis provides evidence that a reliable language independent news article classifier is plausible achieving a macro-averaged F-score of 91% in categories like sport, and an overall F-score of 49% for the whole system. Furthermore, the results show that utilizing pre-trained word embeddings like Word2Vec over the traditional Bag-of-Words approach for feature representation, provides both reduced training time and comparable classification quality. Also included in the experiments are several studies exploring how article length, incorporation of ontologically related supertypes, and moving through time, affects the classification quality of news articles. Among the most central findings is that article length is positively correlated with F-score up until a length of 600 words, at which point the F-score stabilizes. Finally, the thesis presents a thorough evaluation comparing traditional machine learning to the state-of-the-art in deep learning for the news article domain, both from a theoretical and practical standpoint ultimately concluding that replacing traditional and well- performing machine methods with deep learning is not necessarily the right solution in simple problem domains.
dc.language	eng
dc.publisher	NTNU
dc.subject	Datateknologi, Intelligente systemer
dc.title	Multilingual News Article Classification
dc.type	Master thesis
dc.source.pagenumber	141

Tilhørende fil(er)

Filnavn:: 15045_FULLTEXT.pdf
Størrelse:: 3.702Mb
Format:: PDF

Åpne

Filnavn:: 15045_COVER.pdf
Størrelse:: 1.556Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6544]

Vis enkel innførsel