Skip to content

dsp-uga/team-hyperbola-P1

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Malware Classification

Team-Hyperbola

Team hyperbola is the team that built models to predict the Malwares for the Microsoft Malware Classification Challenge. This project is done over the course of three weeks for the CSCI 8360 Data Science Practicum at University of Georgia during Spring 2019.

Introduction to Problem:

The each instance of Malwares belong to one among the 9 classes of Malware categories

  • Ramnit
  • Lollipop
  • Kelihos_ver3
  • Vundo
  • Simda
  • Tracur
  • Kelihos_verl
  • Obfuscator.ACY
  • Gatak

The data consists of 8421 instances of Malware in training set and and the features are to be extracted from byte files consists of hexadecimal data or asm files consisting of assembly language files or both. The crux of the problem is to build the models that can predict the malwares on about 2700 instances of testing data.

Approach to problem:

The project is done using the Random Forest, Naive Bayes and Logistic Regression Models. Attempts were also made to build Convolutional Neural Networks and Custom Naive Bayes Models which were succesful to small_parts of data. The steps followed in doing this project are

  1. Data Preprocessing: The byte files are selected to extract the features and the line id from the byte code is removed and remaining data is converted into lowercase after adding label to corresponding byte file. The words like '??','00' and 'CC' are dropped from dataset as they are most repeated words across the documents or instances of Malware.
  2. Models Used: The models like Logistic Regression, Naive Bayes, Random Forest, Custom Naive Bayes and CNN are implemented where as the latter two are succesful on the small datasets, and are in development for the large datasets. The Logistic Regression gave the best accuracy of 94.96 on big data.

Platform:

The Logistic Regression, Naive Bayes and RandomForest are built using Pyspark on GCP cluster with the specifications as following.

  • 1 master node with 4 CPU's and 15 gb memory
  • 4 worker nodes with 16 CPU's and 104 gb memory

Each of the models can be tested using spark-submit [model].py -arguments on the GCP cluster. The arguments given are -d=big/small for selecting dataset on which models are to be constructed and -h for help.

Future Scope:

  • The features are to be extracted from the `.asm` files and integrate them with features from `.byte` files
  • The Custom Naive Bayes and CNN are to be extended to perform on large dataset and study the performance of these classifiers in comparison to the attempted classifiers

Credits:

see the Contributors for more information.

License:

This project is licensed under the MIT License - see the LICENSE file for the details.

About

Malware Classification

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages