Vis enkel innførsel

dc.contributor.authorSætre, Runenb_NO
dc.date.accessioned2014-12-19T13:30:25Z
dc.date.available2014-12-19T13:30:25Z
dc.date.created2008-01-29nb_NO
dc.date.issued2006nb_NO
dc.identifier125531nb_NO
dc.identifier.isbn82-471-7865-6nb_NO
dc.identifier.urihttp://hdl.handle.net/11250/249975
dc.description.abstractNatural Language Understanding (NLU) is a 50 years old research field, but its application to molecular biology literature (BioNLU) is a less than 10 years old field. After the complete human genome sequence was published by Human Genome Project and Celera in 2001, there has been an explosion of research, shifting the NLU focus from domains like news articles to the domain of molecular biology and medical literature. BioNLU is needed, since there are almost 2000 new articles published and indexed every day, and the biologists need to know about existing knowledge regarding their own research. So far, BioNLU results are not as good as in other NLU domains, so more research is needed to solve the challenges of creating useful NLU applications for the biologists. The work in this PhD thesis is a “proof of concept”. It is the first to show that an existing Question Answering (QA) system can be successfully applied in the hard BioNLU domain, after the essential challenge of unknown entities is solved. The core contribution is a system that discovers and classifies unknown entities and relations between them automatically. The World Wide Web (through Google) is used as the main resource, and the performance is almost as good as other named entity extraction systems, but the advantage of this approach is that it is much simpler and requires less manual labor than any of the other comparable systems. The first paper in this collection gives an overview of the field of NLU and shows how the Information Extraction (IE) problem can be formulated with Local Grammars. The second paper uses Machine Learning to automatically recognize protein name based on features from the GSearch Engine. In the third paper, GSearch is substituted with Google, and the task in this paper is to extract all unknown names belonging to one of 273 biomedical entity classes, like genes, proteins, processes etc. After getting promising results with Google, the fourth paper shows that this approach can also be used to retrieve interactions or relationships between the named entities. The fifth paper describes an online implementation of the system, and shows that the method scales well to a larger set of entities. The final paper concludes the “proof of concept” research, and shows that the performance of the original GeneTUC NLU system has increased from handling 10% of the sentences in a large collection of abstracts in 2001, to 50% in 2006. This is still not good enough to create a commercial system, but it is believed that another 40% performance gain can be achieved by importing more verb templates into GeneTUC, just like nouns were imported during this work. Work has already begun on this, in the form of a local Masters Thesis.nb_NO
dc.languageengnb_NO
dc.publisherFakultet for informasjonsteknologi, matematikk og elektroteknikknb_NO
dc.relation.ispartofseriesDoktoravhandlinger ved NTNU, 1503-8181; 2006:59nb_NO
dc.relation.haspartSætre, Rune. GeneTUC. Proc. Computer Science Graduate Students Conference 2004 (CSGSC-2004), 2004.nb_NO
dc.relation.haspartTveit, Amund; Sætre, Rune; Lægreid, Astrid; Steigedal, Tonje Strømmen. ProtChew: Automatic Extraction of Protein Names from Biomedical Literature. 21st International Conference on Data Engineering Workshops (ICDEW'05): 1161, 2005.nb_NO
dc.relation.haspartSætre, Rune; Tveit, Amund; Steigedal, Tonje Strømmen; Lægreid, Astrid. Semantic Annotation of Biomedical Literature Using Google. Proc. Data Mining and Bioinformatics, Lecture Notes in Computer Science (LNCS). 3482: 327-337, 2005.nb_NO
dc.relation.haspartTveit, Amund; Ranang, Martin Thorsen; Steigedal, Tonje Strømmen; Thommesen, Liv; Stunes, Kamilla; Lægreid, Astrid. gProt. Proc. Knowledge-Based Intelligent Information and Engineering Systems (KES) 2005, Lecture Notes in Artificial Intelligence. 3683: 1195-1203, 2005.nb_NO
dc.relation.haspartSætre, Rune; Søvik, Harald; Amble, Tore; Tsuruoka, Yoshimasa. GeneTUC,GENIA and Google. Special Issue on "Data Mining and Bioinformatics" of Transactions on Computational Systems Biology. 4070: 68-82, 2006.nb_NO
dc.subjectInformation Extraction (IE)en_GB
dc.subjectNatural Language Processing (NLP)en_GB
dc.subjectBio-informaticsen_GB
dc.titleGeneTUC: Natural Language Understanding in Medical Textnb_NO
dc.typeDoctoral thesisnb_NO
dc.contributor.departmentNorges teknisk-naturvitenskapelige universitet, Fakultet for informasjonsteknologi, matematikk og elektroteknikk, Institutt for datateknikk og informasjonsvitenskapnb_NO
dc.description.degreedr.ing.nb_NO
dc.description.degreedr.ing.en_GB


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel