The study of keyword search in open source search engines and digital forensics tools with respect to the needs of cyber crime investigations
Abstract
This master thesis consists of both a theoretical and practical part. In the theoreticalpart of the thesis are three main parts of study: 1) Exploration of experimental searchmethodologies used in a Digital forensics setting. 2) Analysis of the differences in documentedsearch capabilities between a set of open source search engines and open sourceforensics tools capable of keyword search. 3) Identified and summarized publicly availableDigital forensic related datasets.For the first area of exploration no surveys published in the period 2014-2017 couldbe found. Therefore, this exploration tackles a missing gap in the current knowledge.The second exploration creates an in-depth and up-to-date analysis of differencesin search capabilities, not found anywhere else. This analysis is useful for forensic examinersand researcher that want to know which application is most suitable for theirproblem domain.The third exploration extends previous lists of its kind, and adds many new unlistedforensic related datasets. This list, is to the best of my knowledge, the largest collection,of publicly forensic related datasets published in any paper. This addition in the paperwill be useful for researchers in many subfields of Information security who are lookingfor a dataset to use in their research. Using publicly available datasets will also maketheir experiments more reproducible.Some of the datasets are also used in the practical part of the thesis. The practicalpart is a benchmark experiment where the open source search engines are tested on howwell they perform at indexing, searching and memory performance during searching.Elasticsearch was generally better then Solr at index creation time, minimizing index sizeand response time for the first run of search terms. Solr outperformed Elasticsearch onsecond run of search terms. The difference between the search engines with regard tomemory performance during searching was negligible.There are two main limitations with the experiment. The first being that the experimentsare performed on only one virtual host machine. This environment does not allowtesting for how well the search engines perform at distributed search. The second mainissue is that only the default configurations was tested (out-of-the-box setup) with Solrand Elasticsearch. If more configurations had been tested, then some of the variablessuch as sharding and segment count could be controlled. Up-to-date experiments withthe same testing methodology could not be found. The experiments provide informationthat is useful for forensic examiners when deciding which search engine is best suitablefor their forensics tasks.