Machine learning techniques for android malware detection and classification

  1. Martín García, Alejandro
Supervised by:
  1. David Camacho Fernández Director
  2. Raúl Lara Cabrera Director

Defence university: Universidad Autónoma de Madrid

Fecha de defensa: 15 March 2019

Committee:
  1. Sancho Salcedo Sanz Chair
  2. Antonio González Pardo Secretary
  3. Amparo Alonso Betanzos Committee member
  4. Julio César Hernández Castro Committee member
  5. Constantinos Patsakis Committee member

Type: Thesis

Abstract

Android has been intently picked as the main target by many malware creators for designing new malicious applications. Every day, thousands of new malware samples try to circumvent the security measures implemented by Android applications stores, aiming to infect new devices. In order to tackle this problem, it is required to research and develop mechanisms able to classify large amounts of suspicious samples automatically, detecting those that contain a malicious payload. This thesis studies and addresses the application of machine learning techniques for the construction of Android malware detection mechanisms taking into account different perspectives. Furthermore, the classification of Android malware into families is also addressed. A preliminary in-depth study of the Jisut family of Android malware has allowed to reveal some of the most important practices employed and which must be considered when facing these two tasks. In the first place, machine learning techniques are applied as the core element to build Android malware detection methods aimed at deciding accurately whether an application is malware or benignware. For that purpose, the behaviour of each application is described through groups of static and dynamic features, which are modelled using a Markov chains based representation. Then, ensemble classifiers are applied, showing how static features provide better results in comparison to dynamically extracted features. A fusion approach of both categories of features is also proposed, showing improved performance in comparison to models relying on a particular set of features. In the second place, the classification of Android malicious applications into malware families is also tackled in this dissertation, an essential task which seeks to minimise the damages caused and to properly identify groups of malware. Deep learning architectures, classic machine learning algorithms, and different techniques for dealing with imbalanced data are tested in this case. The results evidence that these techniques allow to develop accurate family classification methods. The resilience of these methods against adversarial attacks is also analysed. A targeted attack against a state-of-the-art classifier is proposed, showing that it is possible to force the classifier to allocate samples to a fictitious, random, and new malware family or even to a previously selected destination family. Finally, an open source framework called AndroPyTool is presented. It integrates different state-of-the-art malware analysis tools with the main goal of providing the research community with an integrated tool for the extraction of a wide set of static and dynamic features. Using this tool, the OmniDroid dataset is built and publicly released, containing both static and dynamic features extracted from benign and malicious Android applications.