Extraction-Based Automatic Summarization: Theoretical and Empirical Investigation of Summarization Techniques

Sizov, Gleb

dc.contributor.advisor	Öztürk, Pinar	nb_NO
dc.contributor.author	Sizov, Gleb	nb_NO
dc.date.accessioned	2014-12-19T13:35:59Z
dc.date.available	2014-12-19T13:35:59Z
dc.date.created	2010-09-21	nb_NO
dc.date.issued	2010	nb_NO
dc.identifier	352481	nb_NO
dc.identifier	ntnudaim:5567	nb_NO
dc.identifier.uri	http://hdl.handle.net/11250/252155
dc.description.abstract	A summary is a shortened version of a text that contains the main points of the original content. Automatic summarization is the task of generating a summary by a computer. For example, given a collection of news articles for the last week an automatic summarizer is able to create a concise overview of the important events. This summary can be used as the replacement for the original content or help to identify the events that a person is particularly interested in. Potentially, automatic summarization can save a lot of time for people that deal with a large amount of textual information. The straightforward way to generate a summary is to select several sentences from the original text and organize them in way to create a coherent text. This approach is called extraction-based summarization and is the topic of this thesis. Extraction-based summarization is a complex task that consists of several challenging subtasks. The essential part of the extraction-based approach is identification of sentences that contain important information. It can be done using graph-based representations and centrality measures that exploit similarities between sentences to identify the most central sentences. This thesis provide a comprehensive overview of methods used in extraction-based automatic summarization. In addition, several general natural language processing issues such as feature selection and text representation models are discussed with regard to automatic summarization. Part of the thesis is dedicated to graph-based representations and centrality measures used in extraction-based summarization. Theoretical analysis is reinforced with the experiments using the summarization framework implemented for this thesis. The task for the experiments is query-focused multi-document extraction-based summarization, that is, summarization of several documents according to a user query. The experiments investigate several approaches to this task as well as the use of different representation models, similarity and centrality measures. The obtained results indicate that use of graph centrality measures significantly improves the quality of generated summaries. Among the variety of centrality measure the degree-based ones perform better than path-based measures. The best performance is achieved when centralities are combined with redundancy removal techniques that prevent inclusion of similar sentences in a summary. Experiments with representation models reveal that a simple local term count representation performs better than the distributed representation based on latent semantic analysis, which indicates that further investigation of distributed representations in regard to automatic summarization is necessary. The implemented system performs quite good compared with the systems that participated in DUC 2007 summarization competition. Nevertheless, manual inspection of the generated summaries demonstrate some of the flaws of the implemented summarization mechanism that can be addressed by introducing advanced algorithms for sentence simplification and sentence ordering.	nb_NO
dc.language	eng	nb_NO
dc.publisher	Institutt for datateknikk og informasjonsvitenskap	nb_NO
dc.subject	ntnudaim	no_NO
dc.subject	SIF2 datateknikk	no_NO
dc.subject	Intelligente systemer	no_NO
dc.title	Extraction-Based Automatic Summarization: Theoretical and Empirical Investigation of Summarization Techniques	nb_NO
dc.type	Master thesis	nb_NO
dc.source.pagenumber	81	nb_NO
dc.contributor.department	Norges teknisk-naturvitenskapelige universitet, Fakultet for informasjonsteknologi, matematikk og elektroteknikk, Institutt for datateknikk og informasjonsvitenskap	nb_NO

Tilhørende fil(er)

Filnavn:: 352481_ATTACHMENT01.zip
Størrelse:: 2.463Mb
Format:: Ukjent

Åpne

Filnavn:: 352481_FULLTEXT01.pdf
Størrelse:: 804.8Kb
Format:: PDF

Åpne

Filnavn:: 352481_COVER01.pdf
Størrelse:: 246.9Kb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6552]

Vis enkel innførsel