Feature Extraction and Static Analysis for Large-Scale Detection of Malware Types and Families
MetadataShow full item record
There exist different methods of identifying malware, and widespread method is the one found in almost every antivirus solution on the market today; the signature based approach. This approach uses a one-way cryptographic function to generate a unique hash of each file. Afterwards, each hash is checked against a database of hashes of known malware. This method provides close to none false positives, but this does also mean that this approach can only detect previously known malware, and will in many cases also provide a number of false negatives. Malware authors exploit this weakness in the way that they change a small part of the malicious code, and thereby changes the entire hash of the file, which then leaves the malicious code undetectable until the sample is discovered, analyzed and updated in the vendors database(s). In the light of this relatively easy mitigation for malware authors, it is clear that we need other ways to identify malware. The other two main approaches for this are static analysis and behavior based/dynamic analysis. The primary goal of such analysis and previous research has been focused around detecting whether a file is malicious or benign (binary classification). There has been comprehensive work in these fields the last few years. In the work we are proposing, we will leverage results from static analysis using machine learning methods, to distinguish malicious Windows executables. Not just benign/malicious as in many researches, but by malware family affiliation. To do this we will use a database consisting of about of 330.000 malicious executables. A challenge in this work will be the naming of the samples and families as different antivirus vendors labels samples with different names and follows no standard naming scheme. This is exemplified by e.g. the VirusTotal online scanner which scans a hash in 57 malware databases. For the static analysis we will use the VirusTotal scanner as well as an open source tool for analyzing portable executables, PEframe. The work performed in the thesis presents a novel approach to extract and construct features that can be used to make an estimation of which type and family a malicious file is an instance of, which can be useful for analysis and antivirus scanners. This contribution is novel because multinominal classification is applied to distinguish between different types and families.