Forensic Analysis of OOXML Documents
Abstract
Microsoft Office 2007 and subsequent versions use an XML-based file format called Office Open XML (OOXML) for storing documents, spreadsheets and presentations. OOXML documents are often collected in forensic investigations, and is considered one of the main sources of evidence by the National Authority for Investigation and Prosecution of Economic and Environmental Crime in Norway (Norwegian: Økokrim). OOXML documents are zipped file containers which upon extraction reveals a file structure with files containing forensically interesting information. Metadata specified in the XML of these documents can often be used for e.g. attributing a document to a person or correlating time information to build a timeline of events. Revision identifiers are unique numbers appended to content in OOXML documents produced in Microsoft Word, and can be used in forensics to e.g. uncover previously unknown social networks, determine the source of a document and detect plagiarism of intellectual property. We have used experimental methods to determine the forensic difference between the word processors Microsoft Word 2007, 2010, 2013, 365 and Online, in addition to LibreOffice Writer and Google Docs, with respect to original path preservation of inserted images, thumbnail creation and implementation of revision identifiers. Experimental methods have been used to determine how unique the revision identifiers are, which resulted in detecting that 2 of 100 documents shared revision identifiers without sharing any content, i.e. a 2% false positive rate. This means that revision identifiers can likely be successfully used in forensic investigations. We present a forensic prototype, with the purpose of exploring the possibilties OOXML documents have in a forensic context. The prototype extracts metadata from documents, in addition to extracting and comparing revision identifiers from a set of documents, and displaying the documents with a relationship in a tree graph layout. This functionality has not previously been published in the existing literature or implemented in forensic tools. Interviews with two digital forensic experts working in law enforcement have determined that this implementation could have value in cases where a large amount of documents are collected.