Approximate String Matching and Filesystem Metadata Carving: A Study of Improving Precision and Recall for Assisting the Digital Forensics Backlog
Abstract
The technical aspects of digital forensics are often dependent upon the progress made in other scientific fields. One such area of study whose results are often applied to digital forensics is string matching theory. String matching is found in most forensic tools, and even plays a significant role in file carving. Despite being ubiquitous within digital forensics, the study of string matching and its theory within the context of digital forensics is relatively rare. The goal of this dissertation was to develop and implement string matching algorithms that attempt to produce more optimal precision and recall results for approximate string matching and filesystem metadata record carving. Benefits of improving recall include providing the investigator with more of the relevant contents of an examination, and benefits of improving precision will permit the investigator to view less false positive hits. By developing algorithms which improve or balance precision and recall, we can increase the efficiency of digital forensics examinations, and thereby assist in reducing the digital forensics backlog.
To improve approximate string matching, we developed a novel method of implementing the constrained edit distance (CED) metric into a nondeterministic finite automaton (NFA), thus allowing for a novel approximate string matching algorithm that has significant control over its approximation parameters. The traditional NFA for approximate string matching will only match keywords up to some k number of errors, but our novel NFA allows for approximate matching for every possible combination of edit operation errors (character insertions, deletions, and substitutions) up to some edit distance threshold k = 2, Elasticsearch’s maximum allowed distance. This provides more approximation parameters available to the user, where there is the possibility that some previously unreachable parameters can be applied to improve the precision of the search results without significantly and negatively affecting the recall of the results, or vice versa. For example, in applying the algorithmic implementation of our novel NFA on a list of terms derived from an inverted index of the Enron email dataset, we obtained new precision-recall results that maximized recall without decimating the precision.
Carving is also prone to frustrating precision-recall trade-offs. Usually, carving is associated with file carving, but we specifically carve for filesystem metadata records. This is useful in scenarios in which a filesystem is somehow made unusable, by either physical or logical damage to a volume. Furthermore, finding a record may allow for recovery of the file content whether or not a file is fragmented, and can may also allow for the recovery of files without any signature at all. Research into filesystem metadata record carving is niche, and only a few scientific studies have evaluated such methods. Of which, only the Ext4 and NTFS filesystems have been deeply studied, and while the recall of recovering inodes may potentially be high for these specific filesystems, it would appear that the precision is low.
Our novel filesystem metadata record carving algorithm searches for dynamic signatures within records, where the signatures are byte patterns that are characteristic of filesystem timestamps in metadata records for most filesystems. Timestamps within records are typically stored in a closely co-located manner, and oftentimes two or more of the timestamps exactly match. This characteristic can be used as a starting point for verifying if the surrounding data is indeed a metadata record. After carving for potential timestamps we run another script for the purpose of metadata record verification. We refer to the process of timestamp carving followed by filesystem specific parser verification as Generic Metadata Time Carving (GMTC), and the intention is that the method should be applicable to most filesystems even if the filesystem metadata records contain no static signature.
Results from our experiments have shown that this method of filesystem metadata carving has high precision and recall when searching for records with two or more equal timestamps for at least carving Master File Table (MFT) records from NTFS and inodes from Ext4. In fact, our tools successfully carved records from Ext4 and NTFS that commercial tools could not. We later developed a novel approximate matching potential timestamp carver that matches timestamps by their byte prefixes. This significantly increased the recall of our filesystem metadata carving methodology as stringologically similar timestamps should also be temporally similar. Surprisingly, results from these experiments show that we had 100% precision for all of our tests on realistic datasets due to the strictness of our verification parsers. By setting lower values of the timestamp prefix length, recall was significantly improved while also producing many more false positive potential timestamp locations and thus longer runtimes. When applying prefix- based potential timestamp carving on a 476 GB disk image, the total time taken to conduct the GMTC method was between two and half, and three hours.
Has parts
Paper 1: Porter, Kyle; Petrovic, Slobodan. On Application of Constrained Edit Distance Algorithms to Cryptanalysis and Digital Forensics. Norsk Informasjonssikkerhetskonferanse (NISK) 2017 ;Volum 11. s. 112-123Paper 2: Porter, Kyle; Petrovic, Slobodan. Obtaining precision-recall trade-offs in fuzzy searches of large email corpora. IFIP Advances in Information and Communication Technology 2018 ;Volum 532. s. 67-85 https://doi.org/10.1007/978-3-319-99277-8_5
Paper 3: Nordvik, Rune; Porter, Kyle; Toolan, Fergus; Axelsson, Stefan; Franke, Katrin. Generic Metadata Time Carving. Forensic Science International: Digital Investigation 2020 ;Volum 33. Suppl. July https://doi.org/10.1016/j.fsidi.2020.301005 This is an open access article under the CC BY-NC-ND license (http:// creativecommons.org/licenses/by-nc-nd/4.0/).
Paper 4: Porter, Kyle; Nordvik, Rune; Toolan, Fergus; Axelsson, Stefan. Timestamp prefix carving for filesystem metadata extraction. Forensic Science International: Digital Investigation 2021 ;Volum 38. s. 1-13 https://doi.org/10.1016/j.fsidi.2021.301266 This is an open access article under the CC BY license