Vis enkel innførsel

dc.contributor.authorRekdal, John Erik
dc.date.accessioned2014-07-28T09:14:54Z
dc.date.available2014-07-28T09:14:54Z
dc.date.issued2014
dc.identifier.urihttp://hdl.handle.net/11250/198567
dc.description.abstractIn forensics, investigations comprise diverse types of evidence. For example digital evidence in form of electronic documents and physical evidence, e.g. printed paper documents. One major challenge is to efficiently and accurately link digital evidence and physical evidence together. In particular, a computational method is needed to deal with the huge amount of data available in a forensic investigation and to reduce the time spent on linking and analyzing the different types of evidence. The thesis aims to improve the efficiency and effectiveness of this process by using computational methods such as plain text search (String search), approximate string matching and OCR (optical character recognition), and to incorporate these in a proof-of-concept tool. The tool is used for an experimental setup for testing of linking accuracy between similar and dissimilar documents. A dataset was created and used for testing, based on feedback from Økokrim1. The thesis seeks to answer how OCR affects evidence linking, characteristics of a forensics dataset, characteristics that enables linking and how it is possible to increase efficiency in evidence linking. The proof-of-concept tool, contains five methods for comparison, four text comparison methods; Levenshtein distance,Word frequency, Cosine similarity andW-shingles. And one image-to-image comparison; a pixel-to-pixel similarity. It uses Optical Character Recognition for text generation from scanned documents. Text extraction from digital documents are done through Java libraries. The results shows that W-shingles is the best performing algorithm for matching documents in this setting, and that text sanitation does not have any practical influence on W-shingles, whereas it does increase the matching accuracy for the remaining methods. Characteristics that enables evidence linking was found to be shingles and frequency of unique words used in Cosine similarity. Characteristics in a dataset consisting of DOCX documents are bold font style, and the combination of font size 11 and font type Calibri, which is the default combination for Microsoft Word 2007, 2010 and 2013. The efficiency and accuracy of OCR can be increased by using ensemble voting and decreasing runtime. As for OCR error rate in a forensics environment it is a nonissue since it is not used to recreate evidence, but for matching and locating evidence.nb_NO
dc.language.isoengnb_NO
dc.subjectVDP::Matematikk og Naturvitenskap: 400::Informasjons- og kommunikasjonsvitenskap: 420::Sikkerhet og sårbarhet: 424nb_NO
dc.subjectcross-comparison, forensic investigation, digital evidence, digitized physical evidencenb_NO
dc.titleCross-comparison of Digital and Digitized Physical Evidencenb_NO
dc.typeMaster thesisnb_NO
dc.source.pagenumber122nb_NO


Tilhørende fil(er)

Thumbnail

Denne innførselen finnes i følgende samling(er)

Vis enkel innførsel