mvHash: a new approach for fuzzy hashing

2012

Effective investigation of crime is important to reduce crime in any society. During criminal

investigation, the investigators often face digital evidences. As the amount of digital

data seized in a case may be enormous, it’s important that the investigators have effective

and efficient methods to reduce the number of files subject to manual inspection.

The number of files is reduced by identifying known files. It is important to identify files

which are identical or similar to known files. One way to automatic identify similar files is

by using fuzzy hashing. The leading fuzzy hashing algorithm by now, sdhash, has several

weaknesses as it is slow and generates long hashes.

In this thesis, a new and really efficient fuzzy hashing algorithm named fuzzy hashing

using majority vote (mvHash) is analysed, developed, implemented and tested. MvHash

is based on a simple principle, for each byte in the input and its corresponding neighbourhood,

it is calculated whether there is a plural of 0s and 1s. Based on this calculation,

a new byte-string is created. The new byte string is then shortened by using run-length

encoding (RLE). It is developed two versions of mvHash where the most successful one is

named fuzzy hashing using majority vote and Bloom filters (mvHash-B). We used several

tests, to ensure that mvHash is able to detect similar files and to evaluate the performance.

Using code optimization techniques, it has been possible to make the prototype

really fast, it generates its hashes almost as fast as SHA-1. This is way much faster than

sdhash. Also noteworthy, the hash size of mvHash-B is significantly shorter than sdhash.

MvHash-B, as sdhash, uses Bloom filter to store its hashes, making the comparison really

fast.