Case-based Reasoning in Text Document Classification

This work investigates document classification in Case-Based Reasoning (CBR). The investigation is exemplified by the design and implementation of a system that uses the knowledge-intensive CBR framework Creek to categorize textual cases. The Information Extraction tool CORPORUM analyzes natural language text by extracting "light weight ontologies" consisting of key concepts and the links between them. The output delivered by CORPORUM has been the basis of text categorization in Creek. To find the category of an unknown text case, Creek compares it to a number of already categorized texts and outputs most similar. The calculation of similarity between textual cases has been done according to Creek's existing method. The implemented program is based on a study of Textual CBR and Information Extraction, as well as an analysis of Creek's representation and reasoning functionality. When testing the implemented system, we have observed that Creek and CORPORUM can cooperate in categorizing documents, even if their format of representing text cases is initially different. Because of differences in relation types, the general domain knowledge of Creek was not fully utilized during case matching. However, our results suggests that Creek will benefit greatly from using a text analysis tool such as CORPORUM for ontology building.

Utgiver

Institutt for datateknikk og informasjonsvitenskap