Full-Text Search in XML Databases

Skoglund, Robin

dc.contributor.advisor	Aalberg, Trond	nb_NO
dc.contributor.author	Skoglund, Robin	nb_NO
dc.date.accessioned	2014-12-19T13:33:55Z
dc.date.available	2014-12-19T13:33:55Z
dc.date.created	2010-09-04	nb_NO
dc.date.issued	2009	nb_NO
dc.identifier	348756	nb_NO
dc.identifier	ntnudaim:4179	nb_NO
dc.identifier.uri	http://hdl.handle.net/11250/251331
dc.description.abstract	The Extensible Markup Language (XML) has become an increasingly popular format for representing and exchanging data. Its flexible and exstensible syntax makes it suitable for representing both structured data and textual information, or a mixture of both. The popularization of XML has lead to the development of a new database type. XML databases serve as repositories of large collections of XML documents, and seek to provide the same benefits for XML data as relational databases for relational data; indexing, transactional processing, failsafe physical storage, querying collections etc.. There are two standardized query languages for XML, XQuery and XPath, which are both powerful for querying and navigating the structure XML. However, they offer limited support for full-text search, and cannot be used alone for typical Information Retrieval (IR) applications. To address IR-related issues in XML, a new standard is emerging as an extension to XPath and XQuery: XQuery and XPath Full Text 1.0 (XQFT). XQFT is carefully investigated to determine how well-known IR techniques apply to XML, and the chracateristics of full-text search and indexing in existing XML databases are described in a state-of-the-art study. Based on findings from literature and source code review, the design and implementation of XQFT is discussed; first in general terms, then in the context of Oracle Berkeley DB XML (BDB XML). Experimental support for XQFT is enabled in BDB XML, and a few experiments are conducted in order to evaluate functionality aspects of the XQFT implementation. A scheme for full-text indexing in BDB XML is proposed. The full-text index acts as an augmented version of an inverted list, and is implemented on top of an Oracle Berkeley DB database. Tokens are used as keys, with data tuples for each distinct (document, path) combination the token occurs in. Lookups in the index are based on keywords, and should allow answering various queries without materializing data. Investigation shows that XML-based IR with XQFT is not fundamentally different from traditional text-based IR. Full-text queries rely on linguistic tokens, which --- in XQFT --- are derived from nodes without considering the XML structure. Further, it is discovered that full-text indexing is crucial for query efficiency in large document collections. In summary, common issues with full-text search are present in XML-based IR, and are addressed in the same manner as text-based IR.	nb_NO
dc.language	eng	nb_NO
dc.publisher	Institutt for datateknikk og informasjonsvitenskap	nb_NO
dc.subject	ntnudaim	no_NO
dc.subject	MIT informatikk	no_NO
dc.subject	Informasjonsforvaltning	no_NO
dc.title	Full-Text Search in XML Databases	nb_NO
dc.type	Master thesis	nb_NO
dc.source.pagenumber	105	nb_NO
dc.contributor.department	Norges teknisk-naturvitenskapelige universitet, Fakultet for informasjonsteknologi, matematikk og elektroteknikk, Institutt for datateknikk og informasjonsvitenskap	nb_NO

Tilhørende fil(er)

Filnavn:: 348756_FULLTEXT01.pdf
Størrelse:: 1.201Mb
Format:: PDF

Åpne

Filnavn:: 348756_COVER01.pdf
Størrelse:: 91.75Kb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Institutt for datateknologi og informatikk [6551]

Vis enkel innførsel