Empirical evaluation of metric indexing methods
Abstract
Metric indexing is a branch of search technology that is designed for search non-textual data. Examples of this includes image search (where the search query is an image), document search (finding documents that are roughly equal) to search in high-dimensional Euclidean spaces. Metric indexing is based on the theory of metric spaces, where the only thing known about a set of objects is the distance between them (defined by a metric distance function). A large number of methods have been proposed to solve the metric indexing problem. In this thesis, we have concentrated on new approaches to solving these problems, as well as combining existing methods to create better ones. The methods studied in this thesis include D-Index, GNAT, EMVP-Forest, HC, SA-Tree, SSS-Tree, M-Tree, PM-Tree, M*-Tree and PM*-Tree. These have all been implemented and tested against each other to find strengths and weaknesses. This thesis also studies a group of indexing methods called hybrid methods which combines tree-based methods (like SA-Tree, SSS-tree and M-Tree), with pivoting methods (like AESA and LAESA). The thesis also proposes a method to create hybrid trees from existing trees by using features in the programming language. Hybrid methods have been shown in this thesis to be very promising. While they may have a considerable overhead in construction time,CPU usage and/or memory usage, they show large benefits in reduced number of distance computations. We also propose a new way of calculating the Minimal Spanning Tree of a graph operating on metric objects, and show that it reduces the number of distance computations needed.