Phrase searching in text indexes
MetadataVis full innførsel
Phrase searching in text indexes Compare different approaches to perform phrase searching, and consider a new approach whereas bigrams is considered as index term. This master thesis focus at the challenges within phrase searching in large text indexes, and to assess alternative approaches to cope with such indexes. This goal was achieved by performing an experiment, based on the theory of using bigrams consisting of stopwords as additional index terms. Realizing the characteristics within inverted index structures, we utilized stopwords as indicators for severe long posting lists. The characteristics of stopwords proved valuable, and they were collected based on a already established index for a subset of the TREC GOV2 collection. In alternative approaches we outlined two state of the art index structures, speciﬁcally designed to cope with phrase searching challenges. The ﬁrst structure - nextword index - followed a modiﬁcation of the inverted index structure. The second structure - phrase index - utilized the inverted structure in using complete phrases as index terms. Our bigram index focused on the same manipulation of the inverted index structure as the phrase index, using bigrams of words to rastically cut posting lists lengths. This was one of our main goals, as we identiﬁed stopwords posting list lengths to be one of the primary challenges with phrase searching in inverted index structures. Using stopwords to create and select bigrams proved successful to enhance phrase searching, as response times substantially improved. We conclude that our bigram index provides a signiﬁcant performance in crease in terms of query evaluation time, and outperforms the standard inverted index within phrase searching.