A generic and flexible Framework for focusing Search at Yahoo! Shopping

Information retrieval is concerned with extraction of documents from a collection, according to the user's information need. The ranking returned by a search engine is determined by the relevance function in use. The amount of information stored digitally and being searched for on the Web, grows every day. As the document bases grow, relevance has never been more important. There is a trend towards domain-specific search solutions, vertical search services, in the case of searching the Web. A vertical search service utilise semi-structured documents, i.e. documents which contain metadata describing the content. Semi-structured information retrieval is a hybrid between traditional information retrieval based on unstructured documents, and database retrieval based on structured content. Semi-structured documents imply the use of multiple criteria for how the returned documents should be ranked. This in turn arises questions like which criterion that is more important, and how to combine the results produced by the different criteria. This thesis addresses these challenges. We have studied relevance techniques for the purpose of identifying an approach to improving the perceived relevance at the Yahoo! vertical search platform, Vespa. In particular, Yahoo! Shopping has been the focus during problem elaboration, implementation, and evalution. A plug-in is implemented in Vespa, providing a generic and flexible framework for hybrid search. Our solution allows for context queries, i.e. queries that include terms that describe the desired context, with no specific knowledge about the query language or document structure needed. Also, keyword and context terms in a query is treated differently, using the context terms only for focusing the search. 5 experiments have been performed to test our proposed solution. The results indicate that: - A considerable improvement in retrieval performance is achieved for context queries. Much of the improvement is obtained by removing noisy hits from the result. - The solution performs almost similar as the standard approach for non-context queries. However, these queries will suffer from a higher latency. The latency depends on the complexity of the domain. Most search engines today either return thousands of answers to a user query, or, in about 20% of the cases, none. Our solution may provide as a solution to these challenges and thus improve the perceived relevance. It should be noted that the solution requires a reasonable labelling of the documents, in addition to training of the users in order to make them use context words in their queries. The preliminary experiment results are positive, but are influenced by a reference collection somewhat adapted to our solution, and should therefore be complemented with experiments based on a full system implementation and a well-defined reference collection. The first step is to choose an appropriate labelling scheme for how the semantics of the documents and queries should be captured. Next, it would be interesting to experiment with the ranking of the results. Finally, the user interface should be extended in order to guide the user when submitting context queries.

Utgiver

Institutt for datateknikk og informasjonsvitenskap