Malware detection through opcode sequence analysis using machine learning

Bragen, Simen Rune

dc.contributor.advisor	Franke, Katrin
dc.contributor.author	Bragen, Simen Rune
dc.date.accessioned	2018-08-10T08:09:28Z
dc.date.available	2018-08-10T08:09:28Z
dc.date.issued	2015
dc.identifier.uri	http://hdl.handle.net/11250/2515371
dc.description.abstract	To identify malware, most antivirus scanners use a combination of signature matching and heuristic-based detection. Signature matching compares only to previously known files, while heuristic-based detection looks for known system artifacts and previously known bad code patterns in a file. The obvious problem with this is that only known and partially known samples will be recognized. In this thesis we use reverse engineering to extract the assembly instructions from a given executable file. We chose to use only the opcodes, which are the part of the instruction that specifies the operation to be performed, in example mov. By performing statistical analysis on the datasets, a significant difference between the opcodes in malware and benign files was found. Due to this, supervised and unsupervised machine learning approaches like artificial neural network, support vector machine, bayes net, random forest, k nearest neighbours, and self organizing map was used to look at the sequences of these instructions. The unknown files were classified as either malware or benign depending on the presence of, and number of occurrences of different sequences. We show that by using only opcodes without operands (the rest of the instruction), malware can be distinguished from benign files. By using a sequence length of up to four opcodes, a classification accuracy of 95,58% was achieved. Our work contributes to the research field by proving that also obfuscated malware due to the use of packers is detected through this method. By using different classifiers and longer sequences than previous work, we also provide empirical evidence that the n-gram length has little influence on the performance. We used a sequence length of four, compared to previous work that focused on only one and two sequences.	nb_NO
dc.description.abstract	Antivirusprogrammer gjenkjenner vanligvis skadevare på to måter: Ved å sammenligne signaturen til en fil med andre kjente signaturer, eller ved å kjenne igjen skadelig kode i filen. Problemet med disse metodene er at bare kjente og delvis kjente filer blir oppdaget. I denne oppgaven bruker vi "reverse engineering" for å hente ut assembly instruksjonene til en gitt fil. Vi ser på "opcodene", altså den delen av instruksjonen som sier hva slags oppgave som skal utføres. Et eksempel er mov. Ved hjelp av analyse av dataene fant vi en signifikant forskjell mellom opcodene som blir brukt av skadevare og de som blir brukt av vanlige programmer. Videre brukte vi flere forskjellige maskinlæringsalgoritmer for å lære av sekvensene av disse opcodene. Nye filer ble klassifisert som enten skadevare eller vennligsinnet programmer. Ved å bruke sekvenser opptil 4 i lengden viser vi at en nøyaktighet på 95,58 % kan oppnås. Vi viderefører tidligere arbeid innen samme fagfelt ved å vise at også obfuskert skadevare kan gjenkjennes på denne måten, og vi bruker lengre sekvenser enn det er gjort tidligere.	nb_NO
dc.language.iso	eng	nb_NO
dc.subject	information security	nb_NO
dc.subject	malware detection	nb_NO
dc.title	Malware detection through opcode sequence analysis using machine learning	nb_NO
dc.type	Master thesis	nb_NO
dc.subject.nsi	VDP::Mathematics and natural science: 400::Information and communication science: 420::Security and vulnerability: 424	nb_NO
dc.source.pagenumber	67	nb_NO

Files in this item

Name:: SimenBragen.pdf
Size:: 1.017Mb
Format:: PDF
Description:: Main article

View/Open

This item appears in the following Collection(s)

Institutt for informasjonssikkerhet og kommunikasjonsteknologi [2590]

Show simple item record