Automatic template generation

Rogstad, Erik; Ulseth, Øystein

Rogstad, Erik; Ulseth, Øystein

Master thesis

View/Open

346668_COVER01.pdf (Locked)

346668_FULLTEXT01.pdf (Locked)

346668_ATTACHMENT01.zip (Locked)

URI

http://hdl.handle.net/11250/250037

Date

2006

Metadata

Show full item record

Collections

Institutt for datateknologi og informatikk [6830]

Abstract

In natural language processing (NLP), templates define events and actions in text documents. In particular, templates are useful for information extraction (IE). Traditionally, template generation is a manual process, which is time consuming and tedious. Additionally, such templates are restricted to a limited number of knowledge domains. With these considerations in mind, automatic generation of templates from unstructured text is useful for a wide range of applications. This thesis proposes a method for automatic generation of templates from unstructured text. The method learns templates from training sets of text documents and returns templates that capture stereotyped behavior in the document collections. In addition, the report proposes a method that uses the template sets in order to classify text documents and extract information from the documents. In order to arrive with a set of templates that captures stereotyped behavior, predicate argument structures (PA-structures) are first extracted from the documents. Next, all the PA-structures are transformed into template representation. Eventually templates are merged and the resulting template set is returned. All the templates are given a shared information value (SI-value). SI-values indicate the level of shared information captured in the templates, in other words to what extent the templates describe stereotyped behavior in the domain. As an integral part of the system a parser that extracts predicate argument structures have been implemented. Precision and recall of the extractor is 89,7% and 79,1%, respectively. The template sets generated have proven to be very useful both in order to classify text documents and to extract information from text document.

Publisher

Institutt for datateknikk og informasjonsvitenskap