• norsk
    • English
  • norsk 
    • norsk
    • English
  • Logg inn
Vis innførsel 
  •   Hjem
  • Fakultet for informasjonsteknologi og elektroteknikk (IE)
  • Institutt for datateknologi og informatikk
  • Vis innførsel
  •   Hjem
  • Fakultet for informasjonsteknologi og elektroteknikk (IE)
  • Institutt for datateknologi og informatikk
  • Vis innførsel
JavaScript is disabled for your browser. Some features of this site may not work without it.

Multilingual News Article Classification

Skjennum, Patrick L
Master thesis
Thumbnail
Åpne
15045_FULLTEXT.pdf (3.702Mb)
15045_COVER.pdf (1.556Mb)
Permanent lenke
http://hdl.handle.net/11250/2411543
Utgivelsesdato
2016
Metadata
Vis full innførsel
Samlinger
  • Institutt for datateknologi og informatikk [4881]
Sammendrag
News is an ever-growing and global resource, reliant on robust distribution networks to spread information. This thesis investigates how exploiting semantic, contextual and ontological information may form a basis for a language independent news article classification system.

In light of the above, a scalable multi-label news article classification system, based exclusively on extracted DBpedia entities, and a predetermined standardized set of fixed-size IPTC Media Topic categories, is presented. The proposed system includes an ensemble of n-binary multinominal classifiers, comprised of both traditional Naïve Bayes and several sophisticated artificial neural networks all trained on 1.8 million news articles, spanning twenty years of content from The New York Times.

Through a series of experiments, this thesis provides evidence that a reliable language independent news article classifier is plausible achieving a macro-averaged F-score of 91% in categories like sport, and an overall F-score of 49% for the whole system. Furthermore, the results show that utilizing pre-trained word embeddings like Word2Vec over the traditional Bag-of-Words approach for feature representation, provides both reduced training time and comparable classification quality. Also included in the experiments are several studies exploring how article length, incorporation of ontologically related supertypes, and moving through time, affects the classification quality of news articles. Among the most central findings is that article length is positively correlated with F-score up until a length of 600 words, at which point the F-score stabilizes.

Finally, the thesis presents a thorough evaluation comparing traditional machine learning to the state-of-the-art in deep learning for the news article domain, both from a theoretical and practical standpoint ultimately concluding that replacing traditional and well- performing machine methods with deep learning is not necessarily the right solution in simple problem domains.
Utgiver
NTNU

Kontakt oss | Gi tilbakemelding

Personvernerklæring
DSpace software copyright © 2002-2019  DuraSpace

Levert av  Unit
 

 

Bla i

Hele arkivetDelarkiv og samlingerUtgivelsesdatoForfattereTitlerEmneordDokumenttyperTidsskrifterDenne samlingenUtgivelsesdatoForfattereTitlerEmneordDokumenttyperTidsskrifter

Min side

Logg inn

Statistikk

Besøksstatistikk

Kontakt oss | Gi tilbakemelding

Personvernerklæring
DSpace software copyright © 2002-2019  DuraSpace

Levert av  Unit