mvHash: a new approach for fuzzy hashing
Master thesis
Permanent lenke
http://hdl.handle.net/11250/144008Utgivelsesdato
2012Metadata
Vis full innførselSamlinger
Sammendrag
Effective investigation of crime is important to reduce crime in any society. During criminal
investigation, the investigators often face digital evidences. As the amount of digital
data seized in a case may be enormous, it’s important that the investigators have effective
and efficient methods to reduce the number of files subject to manual inspection.
The number of files is reduced by identifying known files. It is important to identify files
which are identical or similar to known files. One way to automatic identify similar files is
by using fuzzy hashing. The leading fuzzy hashing algorithm by now, sdhash, has several
weaknesses as it is slow and generates long hashes.
In this thesis, a new and really efficient fuzzy hashing algorithm named fuzzy hashing
using majority vote (mvHash) is analysed, developed, implemented and tested. MvHash
is based on a simple principle, for each byte in the input and its corresponding neighbourhood,
it is calculated whether there is a plural of 0s and 1s. Based on this calculation,
a new byte-string is created. The new byte string is then shortened by using run-length
encoding (RLE). It is developed two versions of mvHash where the most successful one is
named fuzzy hashing using majority vote and Bloom filters (mvHash-B). We used several
tests, to ensure that mvHash is able to detect similar files and to evaluate the performance.
Using code optimization techniques, it has been possible to make the prototype
really fast, it generates its hashes almost as fast as SHA-1. This is way much faster than
sdhash. Also noteworthy, the hash size of mvHash-B is significantly shorter than sdhash.
MvHash-B, as sdhash, uses Bloom filter to store its hashes, making the comparison really
fast.