Development of a Demand Driven Dom Parser
Abstract
XML is a tremendous popular markup language in internet applications as well as astorage format. XML document access is often done through an API, and perhaps themost important of these is the W3C DOM. The recommendation from W3C defines anumber of interfaces for a developer to access and manipulate XML documents. Therecommendation does not define implementation specific approaches used behind theinterfaces.
A problem with the W3C DOM approach however, is that documents often are loadedin to memory as a node tree of objects, representing the structure of the XML document.This tree is memory consuming and can take up to 4-10 times the document size. Lazyprocessing have been proposed, building the node tree as it accesses new parts of thedocument. But when the whole document has been accessed, the overhead comparedto traditional parsers, both in terms of memory usage and performance, is high.
In this thesis a new approach is introduced. With the use of well known indexingschemes for XML, basic techniques for reducing memory consumption, and principlesfor memoryhandling in operation systems, a new and alternative approach is introduced.By using a memory cache repository for DOM nodes and simultaneous utilizeprinciples for lazy processing, the proposed implementation has full control over memoryconsumption. The proposed prototype is called Demand Driven Dom Parser, D3P.
The proposed approach removes least recently used nodes from the memory when thecache has exceeded its memory limit. This makes the D3P able to process the documentwith low memory requirements. An advantage with this approach is that the parser isable to process documents that exceed the size of the main memory, which is impossiblewith traditional approaches.
The implementation is evaluated and compared with other implementations, both lazyand traditional parsers that builds everything in memory on load. The proposed implementationperforms well when the bottleneck is memory usage, because the user canset the desired amount of memory to be used by the XML node tree. On the other hand,as the coverage of the document increases, time spend processing the node tree growsbeyond what is used by traditional approaches.