GeneTUC: Natural Language Understanding in Medical Text

Sætre, Rune

dc.contributor.author	Sætre, Rune	nb_NO
dc.date.accessioned	2014-12-19T13:30:25Z
dc.date.available	2014-12-19T13:30:25Z
dc.date.created	2008-01-29	nb_NO
dc.date.issued	2006	nb_NO
dc.identifier	125531	nb_NO
dc.identifier.isbn	82-471-7865-6	nb_NO
dc.identifier.uri	http://hdl.handle.net/11250/249975
dc.description.abstract	Natural Language Understanding (NLU) is a 50 years old research field, but its application to molecular biology literature (BioNLU) is a less than 10 years old field. After the complete human genome sequence was published by Human Genome Project and Celera in 2001, there has been an explosion of research, shifting the NLU focus from domains like news articles to the domain of molecular biology and medical literature. BioNLU is needed, since there are almost 2000 new articles published and indexed every day, and the biologists need to know about existing knowledge regarding their own research. So far, BioNLU results are not as good as in other NLU domains, so more research is needed to solve the challenges of creating useful NLU applications for the biologists. The work in this PhD thesis is a “proof of concept”. It is the first to show that an existing Question Answering (QA) system can be successfully applied in the hard BioNLU domain, after the essential challenge of unknown entities is solved. The core contribution is a system that discovers and classifies unknown entities and relations between them automatically. The World Wide Web (through Google) is used as the main resource, and the performance is almost as good as other named entity extraction systems, but the advantage of this approach is that it is much simpler and requires less manual labor than any of the other comparable systems. The first paper in this collection gives an overview of the field of NLU and shows how the Information Extraction (IE) problem can be formulated with Local Grammars. The second paper uses Machine Learning to automatically recognize protein name based on features from the GSearch Engine. In the third paper, GSearch is substituted with Google, and the task in this paper is to extract all unknown names belonging to one of 273 biomedical entity classes, like genes, proteins, processes etc. After getting promising results with Google, the fourth paper shows that this approach can also be used to retrieve interactions or relationships between the named entities. The fifth paper describes an online implementation of the system, and shows that the method scales well to a larger set of entities. The final paper concludes the “proof of concept” research, and shows that the performance of the original GeneTUC NLU system has increased from handling 10% of the sentences in a large collection of abstracts in 2001, to 50% in 2006. This is still not good enough to create a commercial system, but it is believed that another 40% performance gain can be achieved by importing more verb templates into GeneTUC, just like nouns were imported during this work. Work has already begun on this, in the form of a local Masters Thesis.	nb_NO
dc.language	eng	nb_NO
dc.publisher	Fakultet for informasjonsteknologi, matematikk og elektroteknikk	nb_NO
dc.relation.ispartofseries	Doktoravhandlinger ved NTNU, 1503-8181; 2006:59	nb_NO
dc.relation.haspart	Sætre, Rune. GeneTUC. Proc. Computer Science Graduate Students Conference 2004 (CSGSC-2004), 2004.	nb_NO
dc.relation.haspart	Tveit, Amund; Sætre, Rune; Lægreid, Astrid; Steigedal, Tonje Strømmen. ProtChew: Automatic Extraction of Protein Names from Biomedical Literature. 21st International Conference on Data Engineering Workshops (ICDEW'05): 1161, 2005.	nb_NO
dc.relation.haspart	Sætre, Rune; Tveit, Amund; Steigedal, Tonje Strømmen; Lægreid, Astrid. Semantic Annotation of Biomedical Literature Using Google. Proc. Data Mining and Bioinformatics, Lecture Notes in Computer Science (LNCS). 3482: 327-337, 2005.	nb_NO
dc.relation.haspart	Tveit, Amund; Ranang, Martin Thorsen; Steigedal, Tonje Strømmen; Thommesen, Liv; Stunes, Kamilla; Lægreid, Astrid. gProt. Proc. Knowledge-Based Intelligent Information and Engineering Systems (KES) 2005, Lecture Notes in Artificial Intelligence. 3683: 1195-1203, 2005.	nb_NO
dc.relation.haspart	Sætre, Rune; Søvik, Harald; Amble, Tore; Tsuruoka, Yoshimasa. GeneTUC,GENIA and Google. Special Issue on "Data Mining and Bioinformatics" of Transactions on Computational Systems Biology. 4070: 68-82, 2006.	nb_NO
dc.subject	Information Extraction (IE)	en_GB
dc.subject	Natural Language Processing (NLP)	en_GB
dc.subject	Bio-informatics	en_GB
dc.title	GeneTUC: Natural Language Understanding in Medical Text	nb_NO
dc.type	Doctoral thesis	nb_NO
dc.contributor.department	Norges teknisk-naturvitenskapelige universitet, Fakultet for informasjonsteknologi, matematikk og elektroteknikk, Institutt for datateknikk og informasjonsvitenskap	nb_NO
dc.description.degree	dr.ing.	nb_NO
dc.description.degree	dr.ing.	en_GB

Tilhørende fil(er)

Filnavn:: 125531_FULLTEXT01.pdf
Størrelse:: 1.893Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6620]

Vis enkel innførsel