Vis enkel innførsel

dc.contributor.advisorEidsvik, Jonb_NO
dc.contributor.advisorRamampiaro, Herinb_NO
dc.contributor.authorNæss, Arild Brandrudnb_NO
dc.date.accessioned2014-12-19T13:57:53Z
dc.date.available2014-12-19T13:57:53Z
dc.date.created2010-09-04nb_NO
dc.date.issued2007nb_NO
dc.identifier348586nb_NO
dc.identifierntnudaim:3773nb_NO
dc.identifier.urihttp://hdl.handle.net/11250/258403
dc.description.abstractNatural language processing is an interdisciplinary field of research which studies the problems and possibilities of automated generation and understanding of natural human languages. Text categorization is a central subfield of natural language processing. Automatically assigning categories to digital texts has a wide range of applications in today s information society from filtering spam to creating web hierarchies and digital newspaper archives. It is a discipline that lends itself more naturally to machine learning than to knowledge engineering; statistical approaches to text categorization are therefore a promising field of inquiry. We provide a survey of the state of the art in text categorization, presenting the most widespread methods in use, and placing particular emphasis on support vector machines an optimization algorithm that has emerged as the benchmark method in text categorization in the past ten years. We then turn our attention to Bayesian logistic regression, a fairly new, and largely unstudied method in text categorization. We see how this method has certain similarities to the support vector machine method, but also differs from it in crucial respects. Notably, Bayesian logistic regression provides us with a statistical framework. It can be claimed to be more modular, in the sense that it is more open to modifications and supplementations by other statistical methods; whereas the support vector machine method remains more of a black box. We present results of thorough testing of the BBR toolkit for Bayesian logistic regression on three separate data sets. We demonstrate which of BBR s parameters are of importance; and we show that its results compare favorably to those of the SVMli ght toolkit for support vector machines. We also present two extensions to the BBR toolkit. One attempts to incorporate domain knowledge by way of the prior probability distributions of single words; the other tries to make use of uncategorized documents to boost learning accuracy.nb_NO
dc.languageengnb_NO
dc.publisherInstitutt for matematiske fagnb_NO
dc.subjectntnudaimno_NO
dc.subjectSIF3 fysikk og matematikkno_NO
dc.subjectIndustriell matematikkno_NO
dc.titleBayesian Text Categorizationnb_NO
dc.typeMaster thesisnb_NO
dc.source.pagenumber77nb_NO
dc.contributor.departmentNorges teknisk-naturvitenskapelige universitet, Fakultet for informasjonsteknologi, matematikk og elektroteknikk, Institutt for matematiske fagnb_NO


Tilhørende fil(er)

Thumbnail
Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel