The CMV+P Document Model, Linear Version
MetadataShow full item record
Digital documents are peculiar in that they are different things at the same time. For example, an HTML document is a series of Unicode codepoints, but also a tree-like structure, as well as a rendered image in a browser window and a series of bits stored on a physical medium. These multiple identities of digital documents not only make it difficult to discuss the evolution of documents (especially digital-born documents) in rigorous scholarly terms, it also creates practical problems for computer-based comparison tools and algorithms. The CMV+P model addresses this problem providing a sound formalization of what a document is and how its many identities can coexist at the same time. In its linear version, described in this paper, the CMV+P model sees each document as a stack of abstraction levels, each composed of a) an addressable Content, b) a Model according to which the content has been recorded, and c) a set of Variants used for equivalence matching. The bottom of this stack is the Physical level, symbolizing the concrete medium that embodies the digital document. Content is moved across levels using transformation functions, i.e. encoding functions used to serialize (save) the document and decoding functions used to deserialize (read) it. A practical application of the CMV+P model is its use in comparison tools, algorithms, and methods. With a clear understanding of the internal stratification of formats and models found in digital documents, comparison tools are able to focus on the most meaningful abstraction levels, providing the user with the ability to understand which comparisons are possible between two arbitrary documents.