dc.description.abstract | To identify malware, most antivirus scanners use a combination of signature matching and heuristic-based detection. Signature matching compares only to previously known files, while heuristic-based detection looks for known system artifacts and previously known bad code patterns in a file. The obvious problem with this is that only known and partially known samples will be recognized. In this thesis we use reverse engineering to extract the assembly instructions from a given executable file. We chose to use only the opcodes, which are the part of the instruction that specifies the operation to be performed, in example mov. By performing statistical analysis on the datasets, a significant difference between the opcodes in malware and benign files was found. Due to this, supervised and unsupervised machine learning approaches like artificial neural network, support vector machine, bayes net, random forest, k nearest neighbours, and self organizing map was used to look at the sequences of these instructions. The unknown files were classified as either malware or benign depending on the presence of, and number of occurrences of different sequences. We show that by using only opcodes without operands (the rest of the instruction), malware can be distinguished from benign files. By using a sequence length of up to four opcodes, a classification accuracy of 95,58% was achieved. Our work contributes to the research field by proving that also obfuscated malware due to the use of packers is detected through this method. By using different classifiers and longer sequences than previous work, we also provide empirical evidence that the n-gram length has little influence on the performance. We used a sequence length of four, compared to previous work that focused on only one and two sequences. | nb_NO |