The Use of Holographic Reduced Representations for Text Classification
MetadataVis full innførsel
Determining similarity between two documents for information retrieval purpose requires more than just knowing which words are used in these documents. It is also important how the words are used. This thesis studies an algorithm, called Holographic Reduced Representations (HRR), that takes into consideration both co-occurrence of words and the way these words are used in the sentences, hence, syntactic structure information. HRR is a rather novel algorithm and performs text classification automatically based on statistical information, and is based upon representing concepts, i.e. text, in randomly initiated vectors. It captures term context information and term order information from sentences by using vector addition and binding. Concepts related to HRR and text classification are introduced. The HRR algorithms ability to capture term context information and term order information were tested. Ways to use this information at a document level were discussed. HRR's suitability to text classification compared to traditional Vector Space Model (VSM) is tested and discussed. For a query term, the retrieved terms with most similar order information seems to be terms that has the same part of speech as the query. Using the combined context and order information when retrieving terms/documents for a query gave results with increased depth, i.e. results that also include grammatical information compared to context information alone.