Bayesian Text Categorization

Næss, Arild Brandrud

dc.contributor.advisor	Eidsvik, Jo	nb_NO
dc.contributor.advisor	Ramampiaro, Heri	nb_NO
dc.contributor.author	Næss, Arild Brandrud	nb_NO
dc.date.accessioned	2014-12-19T13:57:53Z
dc.date.available	2014-12-19T13:57:53Z
dc.date.created	2010-09-04	nb_NO
dc.date.issued	2007	nb_NO
dc.identifier	348586	nb_NO
dc.identifier	ntnudaim:3773	nb_NO
dc.identifier.uri	http://hdl.handle.net/11250/258403
dc.description.abstract	Natural language processing is an interdisciplinary field of research which studies the problems and possibilities of automated generation and understanding of natural human languages. Text categorization is a central subfield of natural language processing. Automatically assigning categories to digital texts has a wide range of applications in today s information society from filtering spam to creating web hierarchies and digital newspaper archives. It is a discipline that lends itself more naturally to machine learning than to knowledge engineering; statistical approaches to text categorization are therefore a promising field of inquiry. We provide a survey of the state of the art in text categorization, presenting the most widespread methods in use, and placing particular emphasis on support vector machines an optimization algorithm that has emerged as the benchmark method in text categorization in the past ten years. We then turn our attention to Bayesian logistic regression, a fairly new, and largely unstudied method in text categorization. We see how this method has certain similarities to the support vector machine method, but also differs from it in crucial respects. Notably, Bayesian logistic regression provides us with a statistical framework. It can be claimed to be more modular, in the sense that it is more open to modifications and supplementations by other statistical methods; whereas the support vector machine method remains more of a black box. We present results of thorough testing of the BBR toolkit for Bayesian logistic regression on three separate data sets. We demonstrate which of BBR s parameters are of importance; and we show that its results compare favorably to those of the SVMli ght toolkit for support vector machines. We also present two extensions to the BBR toolkit. One attempts to incorporate domain knowledge by way of the prior probability distributions of single words; the other tries to make use of uncategorized documents to boost learning accuracy.	nb_NO
dc.language	eng	nb_NO
dc.publisher	Institutt for matematiske fag	nb_NO
dc.subject	ntnudaim	no_NO
dc.subject	SIF3 fysikk og matematikk	no_NO
dc.subject	Industriell matematikk	no_NO
dc.title	Bayesian Text Categorization	nb_NO
dc.type	Master thesis	nb_NO
dc.source.pagenumber	77	nb_NO
dc.contributor.department	Norges teknisk-naturvitenskapelige universitet, Fakultet for informasjonsteknologi, matematikk og elektroteknikk, Institutt for matematiske fag	nb_NO

Tilhørende fil(er)

Filnavn:: 348586_FULLTEXT01.pdf
Størrelse:: 967.4Kb
Format:: PDF

Åpne

Filnavn:: 348586_COVER01.pdf
Størrelse:: 46.50Kb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for matematiske fag [2353]

Vis enkel innførsel