Enhanced Similarity Matching by Grouping of Features
MetadataShow full item record
In this report we introduce a classification system named Grouping of Features (GoF), together with a theoretical exploration of some of the important concepts in the Instant Based Learning(IBL)-field that are related to this system.A dataset's original features are by the GoF-system grouped together into abstract features. Each of these groups may capture inherent structures in one of the classes in the data. A genetic algorithm is used to extract a tree of such groups that can be used for measuring similarity between samples. As each class may have different inherent structures, different trees of groups are found for the different classes. To adjust the importance of one group in regards to the classifier, the concept of power average is used. A group's power-average may let either the smallest or the largest value of its group dominate, or take any value in-between. Tests show that the GoF-system outperforms kNN at many classification tasks.The system started as a research project by Verdande Technology, and a set of algorithms had been fully or partially implemented before the start of this thesis project. There existed no documentation however, so we have built an understanding of the fields on which the system relies, analyzed their properties, documented this understanding in explicit method descriptions, and tested, modified and extended the original system.During this project we found that scaling or weighting features as a data pre-processing step or during classification often is crucial for the performance of the classification-algorithm. Our hypothesis then was that by letting the weights vary between features and between groups of features, more complex structures could be captured. This would also make the classifier less dependent on how the features are originally scaled. We therefore implemented the Weighted Grouping of Features, an extension of the GoF-system.Notable results in this thesis include a 95.48 percent and 100.00 percent correctly classified non-scaled UCI Wine dataset using the GoF- and WGoF-system, respectively.