Open-Domain Word-Level Interpretation of Norwegian: Towards a General Encyclopedic Question-Answering System for Norwegian
Abstract
No large-scale, open-domain semantic resource for Norwegian, with a rich number of semantic relations currently exists. The existing semantic resources for Norwegian are either limited in size and/or incompatible with the de facto standard resources used for Natural Language Processing for English. Both current and future cultural, technological, economical, and educational consequences caused by the scarcity of advanced Norwegian language-technological solutions and resources has been widely acknowledged (Simonsen 2005; Norwegian Language Council 2005; Norwegian Ministry of Culture and Church Affairs 2008). This dissertation presents (1) a novel method that consists of a model and several algorithms for automatically mapping content words from a non-English source language to (a power set of) WordNet (Miller 1995; Fellbaum 1998) senses with average precision of up to 92.1 % and recall of up to 36.5 %. Because an important feature of the method is its ability to correctly handle compounds, this dissertation also presents (2) a practical implementation, including algorithms and a grammar, of a program for automatically analyzing Norwegian compounds. This work also shows (3) how Verto, an implementation of the model and algorithms, is used to create Ordnett, a large-scale, open-domain lexical-semantic resource for Norwegian with a rich number of semantic relations. Finally, this work argues that the new method and automatically generated resource makes it possible to build large-scale open-domain Natural Language Understanding systems, that offer both wide coverage and deep analyses, for Norwegian texts. This is done by showing (4) how Ordnett can be used in an open-domain question answering system that automatically extracts and acquires knowledge from Norwegian encyclopedic articles and uses the acquired knowledge to answer questions formulated in natural language by its users. The open-domain question answering system, named TUClopedia, is based on The Understanding Computer (Amble 2003) which has previously been successfully applied to narrow domains.