mvHash: a new approach for fuzzy hashing
MetadataShow full item record
Effective investigation of crime is important to reduce crime in any society. During criminal investigation, the investigators often face digital evidences. As the amount of digital data seized in a case may be enormous, it’s important that the investigators have effective and efficient methods to reduce the number of files subject to manual inspection. The number of files is reduced by identifying known files. It is important to identify files which are identical or similar to known files. One way to automatic identify similar files is by using fuzzy hashing. The leading fuzzy hashing algorithm by now, sdhash, has several weaknesses as it is slow and generates long hashes. In this thesis, a new and really efficient fuzzy hashing algorithm named fuzzy hashing using majority vote (mvHash) is analysed, developed, implemented and tested. MvHash is based on a simple principle, for each byte in the input and its corresponding neighbourhood, it is calculated whether there is a plural of 0s and 1s. Based on this calculation, a new byte-string is created. The new byte string is then shortened by using run-length encoding (RLE). It is developed two versions of mvHash where the most successful one is named fuzzy hashing using majority vote and Bloom filters (mvHash-B). We used several tests, to ensure that mvHash is able to detect similar files and to evaluate the performance. Using code optimization techniques, it has been possible to make the prototype really fast, it generates its hashes almost as fast as SHA-1. This is way much faster than sdhash. Also noteworthy, the hash size of mvHash-B is significantly shorter than sdhash. MvHash-B, as sdhash, uses Bloom filter to store its hashes, making the comparison really fast.