Utilizing linguistic analysis in multiple source search engines

Økland, Vegard

Økland, Vegard

Master thesis

Åpne

454079_COVER01.pdf (46.85Kb)

454079_FULLTEXT01.pdf (1.088Mb)

Permanent lenke

http://hdl.handle.net/11250/252703

Utgivelsesdato

2011

Metadata

Vis full innførsel

Samlinger

Institutt for datateknologi og informatikk [6771]

Sammendrag

Modern search engines have several data sources available to users, e.g. Newssearch, Image search and Video search. When a user enters a query in a searchengine, it is up to the user to choose a different source than the normal web search.On average, a user will only consider the first few occurrences in a search result anddo so in a few seconds. It would therefore be beneficial to the user experienceif the user did not have to limit the sources manually to refine a search.This project will evaluate different machine learning methods to classify relevantsources to a query. The goal of this is having an automated learning system thattakes some labeled input and uses this to help inform or direct the user to therelevant source.The project will take advantage of a Yahoo! product; Yahoo! Query LinguistAnalysis Service (abbreviated QLAS from now on and through the document). Thegoal is to incorporate semantic data from QLAS into the learning system. Thisshould augment the amount of information available to the learning system, andimprove its performance. It is not clear how this semantic data could be combinedwith the training data and incorporated in the learning system. A substantial partof the project will be to explore this.This project was done in cooperation with Yahoo! Technologies Norway AS (YTN).YTN develops Vespa, a search engine platform that has the possibility to searchfrom multiple sources. YTN is interested in researching the field of learning sourcerelevance to improve the search experience in Yahoo services. YTN is also interestedin researching ways data from QLAS could be used by Vespa to enable sourcerelevance classification when Vespa is used in a multiple-index setup.

Utgiver

Institutt for datateknikk og informasjonsvitenskap