Search engine for biological information focusing on literature describing genes

Silden, Thor Henden

Silden, Thor Henden

Master thesis

Åpne

346788_COVER01.pdf (Låst)

346788_FULLTEXT01.pdf (Låst)

Permanent lenke

http://hdl.handle.net/11250/250164

Utgivelsesdato

2006

Metadata

Vis full innførsel

Samlinger

Institutt for datateknologi og informatikk [6808]

Sammendrag

The amount of information available in biological literature is extremely large and still grows. Because of this vast amount of information, it it difficult to extract specific knowledge and discover new links between genes. This is where information retrieval serves a purpose by indexing and allowing random search on millions of biological articles. The main problems here are how to index the articles, which text operations to use, how should one measure the results, and so on. One collection of biological literature is the TREC collection. TREC has many domain specific collections extracted from MEDLINE, and one of them focuses on genomics. Containing million of structured abstracts, the genomic collection is useful for conducting information retrieval experiments. It already has several runs resulting in many evaluation measures posted by official participants which can be used for comparison. Textpresso, an information retrieval and extraction system for biological literature, is a good example of how to efficiently search and extract knowledge from biological literature with the use of ontology. One such ontology is the GeneOntology, which is a structured thesaurus in the field of genomics. It is organized into molecular function, biological process, and cellular component. My main goal has been to create a solution for a biological search engine that could possibly improve existing search engines in the same field. I chose to make use of the GeneOntology thesaurus as a means to expand the query, and Lucene as a means of indexing and searching the TREC genomic collection. Also, I suggested using word-based Huffman coding to improve search speed and lessen requirements to storage space. The implementation only considered using a thesaurus to improve search, because Lucene was already fast enough and my focus was on recall and precision values. Other goals included determining text operation roles, which measurements to use, and making a prototype of the suggested biological search engine. Through the implementation description and discussion of the problems, I have come to the conclusions that relevancy in biological literature is subjective to the human user and IR system, and measurements used should be chosen on the criteria s of which systems you want to compare against and what qualities of the system you want to look into. Text operations can have a major impact on the search engine recall and precision values, and I can only justify using the elimination of stopwords and a advanced lexical analysis that does nothing with digits. The implementation improvements, thesaurus and word-based Huffman coding, should improve precision and search speed.

Utgiver

Institutt for datateknikk og informasjonsvitenskap