Feature Selection for Text Categorisation

Text categorization is the task of discovering the category or class text documents belongs to, or in other words spotting the correct topic for text documents. While there today exists many machine learning schemes for building automatic classifiers, these are typically resource demanding and do not always achieve the best results when given the whole contents of the documents. A popular solution to these problems is called feature selection. The features (e.g. terms) in a document collection are given weights based on a simple scheme, and then ranked by these weights. Next, each document is represented using only the top ranked features, typically only a few percent of the features. The classifier is then built in considerably less time, and might even improve accuracy. In situations where the documents can belong to one of a series of categories, one can either build a multi-class classifier and use one feature set for all categories, or one can split the problem into a series of binary categorization tasks (deciding if documents belong to a category or not) and create one ranked feature subset for each category/classifier. Many feature selection metrics have been suggested over the last decades, including supervised methods that make use of a manually pre-categorized set of training documents, and unsupervised methods that need only training documents of the same type or collection that is to be categorized. While many of these look promising, there has been a lack of large-scale comparison experiments. Also, several methods have been proposed the last two years. Moreover, most evaluations are conducted on a set of binary tasks instead of a multi-class task as this often gives better results, although multi-class categorization with a joint feature set often is used in operational environments. In this report, we present results from the comparison of 16 feature selection methods (in addition to random selection) using various feature set sizes. Of these, 5 were unsupervised , and 11 were supervised. All methods are tested on both a Naive Bayes (NB) classifier and a Support Vector Machine (SVM) classifier. We conducted multi-class experiments using a collection with 20 non-overlapping categories, and each feature selection method produced feature sets common for all the categories. We also combined feature selection methods and evaluated their joint efforts. We found that the classical supervised methods had the best performance, including Chi Square, Information Gain and Mutual Information. The Chi Square variant GSS coefficient was also among the top performers. Odds Ratio showed excellent performance for NB, but not for SVM. The three unsupervised methods Collection Frequency, Collection Frequency Inverse Document Frequency and Term Frequency Document Frequency all showed performances close to the best group. The Bi-Normal Separation metric produced excellent results for the smallest feature subsets. The weirdness factor performed several times better than random selection, but was not among the top performing group. Some combination experiments achieved better results than each method alone, but the majority did not. The top performers Chi square and GSS coefficient classified more documents when used together than alone.Four of the five combinations that showed increase in performance included the BNS metric.

Utgiver

Institutt for datateknikk og informasjonsvitenskap