Bayesian Text Categorization

Næss, Arild Brandrud

Næss, Arild Brandrud

Master thesis

Åpne

348586_FULLTEXT01.pdf (967.4Kb)

348586_COVER01.pdf (46.50Kb)

Permanent lenke

http://hdl.handle.net/11250/258403

Utgivelsesdato

2007

Metadata

Vis full innførsel

Samlinger

Institutt for matematiske fag [2468]

Sammendrag

Natural language processing is an interdisciplinary field of research which studies the problems and possibilities of automated generation and understanding of natural human languages. Text categorization is a central subfield of natural language processing. Automatically assigning categories to digital texts has a wide range of applications in today s information society from filtering spam to creating web hierarchies and digital newspaper archives. It is a discipline that lends itself more naturally to machine learning than to knowledge engineering; statistical approaches to text categorization are therefore a promising field of inquiry. We provide a survey of the state of the art in text categorization, presenting the most widespread methods in use, and placing particular emphasis on support vector machines an optimization algorithm that has emerged as the benchmark method in text categorization in the past ten years. We then turn our attention to Bayesian logistic regression, a fairly new, and largely unstudied method in text categorization. We see how this method has certain similarities to the support vector machine method, but also differs from it in crucial respects. Notably, Bayesian logistic regression provides us with a statistical framework. It can be claimed to be more modular, in the sense that it is more open to modifications and supplementations by other statistical methods; whereas the support vector machine method remains more of a black box. We present results of thorough testing of the BBR toolkit for Bayesian logistic regression on three separate data sets. We demonstrate which of BBR s parameters are of importance; and we show that its results compare favorably to those of the SVMli ght toolkit for support vector machines. We also present two extensions to the BBR toolkit. One attempts to incorporate domain knowledge by way of the prior probability distributions of single words; the other tries to make use of uncategorized documents to boost learning accuracy.

Utgiver

Institutt for matematiske fag