SLIFER: Investigating Performance and Robustness of Malware Detection Pipelines (2024)

Andrea Ponteandrea.ponte@edu.unige.it1234-5678-9012University of GenoaGenoaItaly,Dmitrijs Triznatrizna.dmitrijs@microsoft.com1234-5678-9012Microsoft, University of GenoaPragueCzech Republic,Luca Demetrioluca.demetrio@unige.itUniversity of GenoaGenoaItaly,Fabio Rolifabio.roli@unige.itUniversity of GenoaGenoaItalyandIvan Tesfai Ogbuivan.tesfai@rina.orgRina Consulting S.p.A.Via Antonio Cecchi 6GenoaItaly

(2024)

Abstract.

As a result of decades of research, Windows malware detection is approached through a plethora of techniques.However, there is an ongoing mismatch between academia – which pursues an optimal performances in terms of detection rate and low false alarms – and the requirements of real-world scenarios.In particular, academia focuses on combining static and dynamic analysis within a single or ensemble of models, falling into several pitfalls like (i) firing dynamic analysis without considering the computational burden it requires; (ii) discarding impossible-to-analyse samples; and (iii) analysing robustness against adversarial attacks without considering that malware detectors are complemented with more non-machine-learning components.Thus, in this paper we propose SLIFER, a novel Windows malware detection pipeline sequentially leveraging both static and dynamic analysis, interrupting computations as soon as one module triggers an alarm, requiring dynamic analysis only when needed.Contrary to the state of the art, we investigate how to deal with samples resistance to analysis, showing how much they impact performances, concluding that it is better to flag them as legitimate to not drastically increase false alarms.Lastly, we perform a robustness evaluation of SLIFER leveraging content-injections attacks, and we show that, counter-intuitively, attacks are blocked more by YARA rules than dynamic analysis due to byte artifacts created while optimizing the adversarial strategy.

Malware analysis, machine learning, pipeline, robustness

copyright: acmlicensedjournalyear: 2024doi: XXXXXXX.XXXXXXXconference: 2024 ACM SIGSAC Conference on Computer and Communications Security(CCS ’24), October 14–18, 2024, Salt Lake City, U.S.A.; October 14 – 18,2024; Salt Lake City, U.S.A.isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Due to the rapid evolution of threats and skills of malware developers, the detection of Windows malware is an on-going challenge that has kept revolutionizing itself for more than two decades.To better understand the scale of this never-ending arms race, every week similar-to\sim 7M111https://virustotal.com/statistics Windows malicious programs are uploaded on cloud-based antivirus engines to be analysed.To exacerbate the issue, malware developers create several variants of the same malicious program to avoid detection by signature-based antivirus, which spot threats through unique indicators observed in the past and matched in analysed samples.Thus, modern detection engines fully embrace the machine learning (ML) paradigm, by directly understanding what makes a program malicious from data, being also able to generalize across variants.

While we can only glimpse the architectures developed by industrial companies through minimally-detailed white papers(Kaspersky, 2021; Saxe and Berlin, 2015; TECHNOLOGY, [n. d.]; Avira, 2017; Microsoft, [n. d.]), academic research mostly focuses on creating models with the best trade-off between detections and false alarms(Raff etal., 2018; Trizna etal., 2023; Anderson and Roth, 2018; Jindal etal., 2019; Raff etal., 2020; Gibert etal., 2020), with a strong focus on the latter they are extremely costly to handle(Hershberger, 2023; Kubovič, 2017).To do so, state-of-the-art techniques focus on single or ensemble models that separately or jointly leverage static and dynamic analysis.While the first one infer maliciousness from both the structure and the content of input samples, the second one requires programs to be executed or emulated inside an isolated environment.

However, dynamic analysis is costly, requiring a clean environment at each analysis where samples can be “detonated” to manifest their behavior.Every ML architecture that uses this kind of analysis must face this mandatory cost, and also plenty of proposed approaches consider features extracted with dynamic analysis fused or stacked together(Dambra etal., 2023; Trizna, 2022) with the static ones.This not only expands the complexity of the feature extraction phase, making it more costly and time-consuming, but also the overall predictive capability might not even benefit from this addition: Dambra et al.(Dambra etal., 2023) highlight a worrisome trend where dynamic analysis is neither better or complementary to the static one, diminishing the belief of tracing execution as the Swiss knife against malware.On the other hand, most of the proposed techniques hide the crashes encountered by static analysis while extracting features, resolving the issue by discarding those impossible-to-analyse samples.While this is not a problem at the research stage, such a behavior is not admissible in production environments where an answer must always be given to the users who requested an analysis.

To further exacerbate problems of static analysis that can be encountered in production, ML Windows malware detectors have been shown to be vulnerable to adversarial EXEmples(Demetrio etal., 2021a, b; Anderson etal., 2017; Lucas etal., 2021), carefully-crafted programs tailored to evade detection.These are constructed by manipulating the structure of samples, by either adding new or replacing existing content, thus interfering with the patterns learned at training time.With almost no implementations available for attacks that target dynamic classifiers(Rosenberg etal., 2018), robustness against EXEmples is only computed against static detectors, thanks to reproducible open-source software(Demetrio and Biggio, 2021).However, while these attacks have only been tested against specific targets, they have not been evaluated against a production-ready Windows malware pipeline that comprises many different components, such as dynamic analysis or signature matching.In theory, adversarial EXEmples against static detection should not have any effect on execution traces of tampered programs(Demetrio etal., 2021b, a), being also more detectable by dynamic analysis, but no investigations have been conducted in this direction.

Hence, in this work we propose SLIFER, a novel Windows malware detector that matches the needs of production environments.SLIFER is built on three components: (i) pattern-matching with YARA rules to rapidly filter out known samples; (ii) static malware detection with state-of-the-art models to capture most of the threats; and (iii) dynamic analysis with emulation to fine-tune results on more difficult samples.Input programs traverse the pipeline sequentially, halting the process at the first detection.This step reduces the needs for firing dynamic analysis, focusing its resource-demanding feature extraction process only when really needed.Static analysis is achieved with a mixture of models, combining end-to-end –less predictive, but it can not crash at inference time due to the absence of feature extraction– and feature-extraction-based –more predictive, but can crash– detection to reduce the number of crashes.Through SLIFER, which mimics an industrial malware detector, we want to answer three research questions:

RQ1: how to properly deploy a malware detection software in presence of pre-processing errors at different stages?

RQ2: which are the performances of sequential analysis, compared to the proposed single and hybrid approaches of the literature?

RQ3: what is the robustness of a pipeline of components?

RQ4: what is the overhead brought by a sequential pipeline in analysing input samples?

To answer these questions, we provide an extensive experimental analysis conducted on two dataset, where we compare SLIFER to state-of-the-art models in terms of detection, false alarms and robustness.Our findings highlight interesting trends, that can be summarised in the following take-home messages:

Take-home message 1. Counter-intuitively, samples causing errors during pre-processing should advance in the pipeline, being labelled as benign in case no modules are able to analyse them.This reduces drastically the number of false alarms, and partially the detection, still being acceptable in production environments.

Take-home message 2. SLIFER outperforms all the competitors by keeping an extremely-low number of false positive rates, without discarding any sample in the process.Also, SLIFER is much faster than all models leveraging dynamic analysis, launching it only when needed.

Take-home message 3. SLIFER is more robust against static adversarial EXEmples, but not thanks to dynamic analysis.In fact, we show that artifacts generated by attacks are detected by YARA rules that were not matching the unperturbed program.Also, we show that content injection attacks might have a marginal effect on functionality, since content is displaced and retrieved at different addresses at runtime, causing the dynamic-based ML model to reduce its score.

Take-home message 4. SLIFER does not negatively impact the time needed for analysing samples, since most of the malicious programs are rapidly recognized by static analysis, requiring the dynamic one only for benignware.

The rest of the paper is organised as follows: we firstly introduce the background concepts needed to understand our work (Sect.2), followed by the implementation details of SLIFER (Sect.3).We then continue by describing our experiments (Sect.4), and which results we derive from them, along with the answers to our research questions (Sect.5).We conclude the paper by detailing the limitations of our approach (Sect.7), possible future research directions, and final remarks (Sect.8).

2. Background

In this section, we introduce the main concepts and technologies that constitute the fundamentals for our research.

Windows Portable Executable (PE) Format. This is the standard format for Windows programs, used for both executables and shared libraries (DLLs) (Fig.1).It describes how files are stored on disk and how to be properly loaded in memory.

SLIFER: Investigating Performance and Robustness of Malware Detection Pipelines (1)

DOS Header + Stub. These chunks of bytes represent a valid DOS program, kept for retrocompatibility. It is useless in modern OSes.

NT Headers. These bytes represent the real header of the program, containing the PE signature and all the information needed by the OS to load the content in memory.

Sections. These contain the code and asset of the program to load. Usually, the first section contains the machine instructions of the software, while the others are used as storage for initialized variables, resources, and other relevant information.

Windows Malware Detection.To stop the spreading of malware, various techniques have been developed leveraging either static, dynamic, or both types of analyses.

Static analysis. This type of analysis is based on the extraction of relevant metrics from the structure and content of analysed samples(Raff etal., 2018, 2020; Anderson and Roth, 2018), without the need for executing it.The most naïve static analysis methodology is posed by pattern-matching with well-known signatures(Yara-Rules, 2019; The FLARE Team, 2020), but such a technique is not robust against all the malware variants that are released on a daily basis.Thus, ML models are trained on static features to gain general performance also on unseen samples.

Dynamic analysis. This type of analysis concentrates on characterizing the behavior of programs(Trizna etal., 2023; Jindal etal., 2019) by recording their trace of execution (spawned processes, API calls, reached websites, accessed registry keys) while being detonated inside an isolated environment, such as a sandbox, or through emulation.After having collected all the events, these are pre-processed to be fed to machine learning models as training data.

Hybrid analysis. This type of analysis merges both static and dynamic information, often obtaining better results due to the greater amount of collected malware characteristics(Trizna, 2022).This can be achieved by either stacking together information, or by fusing the representation in deep neural networks(Gibert etal., 2020).

Adversarial EXEmples. The rise of ML in malware detection brought in parallel the rise of Adversarial Machine Learning(Yuan etal., 2019; Wiyatno etal., 2019; Biggio etal., 2013; Biggio and Roli, 2018), which wants to exploit vulnerabilities of models creating adversarial EXEmples(Demetrio etal., 2021b, a).These are well-crafted programs tailored to fool ML malware detectors, causing impairments on end users’ devices.While limited research focuses on attacking dynamic detectors(Rosenberg etal., 2018), most of the efforts have been focused on evading static classifiers(Demetrio etal., 2021b, a; Anderson etal., 2017; Lucas etal., 2021).These attacks work by either replacing existing content or by injecting new one that disrupt the pattern learned at training time.

3. SLIFER: sequential pipeline for Windows Malware Detection

SLIFER: Investigating Performance and Robustness of Malware Detection Pipelines (2)

We now describe SLIFER, a Windows malware detector built on different components, leveraging both static and dynamic analysis as depicted in Fig.2.Differently from state-of-the-art techniques that use single or ensemble models, SLIFER performs predictions by sequentially testing different modules, halting computations for a specific sample when it is detected as malicious by one component.Instead of merging together static and dynamic analysis that would require detonating samples in sandboxes, SLIFER leverages emulation to retrieve execution traces, and such operation is only required for programs that are flagged as non-malicious by all the previous modules.In this way, we minimize the time of detection for malicious files, also thanks to the order of the modules, sorted from the fastest to the slowest.When one of the modules fails to analyse a sample due to errors or crashes of pre-processing, SLIFER continues the analysis by passing the input to the next one.If no module is able to successfully compute a prediction, we flag the sample as benign.We will later show that this counter-intuitive choice is applied to keep the number of false alarms extremely low, while not reducing much the predictions on malicious samples (Sect.5.1).We now detail all the components of SLIFER, by discussing their implementation and design choices.

3.1. Pattern-matching Detection with Signatures

The first module of SLIFER uses YARA222https://github.com/VirusTotal/yara, which is a pattern-matching tool for detecting already-known malware from signatures.These are textual descriptions containing binary patterns that identify a group of malicious programs, usually structured as shown in 1.Each rule contains metadata used to describe its function, followed by strings or patterns (that can be hexadecimal strings, regular expressions, etc.) that will be looked for inside programs.Lastly, rules must define the firing condition, which describes how malicious activity is spotted. These are described as an if-then-else conditional block that describes the detection algorithm.

rule rule_name :

{

meta:

description = "Rule Description"

author = "name"

date = ""

strings:

$a = {Hex Pattern}

$b = {Regex}

condition:

$a or $b

}

3.2. Machine Learning Static Analysis

The second module of SLIFER leverages two static analysis machine learning malware detectors: (i) MalConv(Raff etal., 2018), and end-to-end deep neural network without feature extraction; and (ii) GBDT-EMBER(Anderson and Roth, 2018), a gradient boosting decision tree trained on data pre-processed with hand-crafted features.

MalConv. Proposed by Raff et al.(Raff etal., 2018), this model has been developed to learn maliciousness directly from bytes, by taking in input whole executable files and returning probability scores.It is implemented as a convolutional neural network (CNN), starting from an embedding layer that encodes bytes inside a space where distances are defined. Input is then processed through a gated convolution layer, a global max-pooling layer, and a fully connected layer that computes the final prediction.MalConv is trained on the state-of-the-art dataset EMBER(Anderson and Roth, 2018), and it takes in input the first 1 MB of each sample. In case input programs are shorter than this amount, they are padded with a special value.

GBDT-EMBER. Proposed by Anderson et al.(Anderson and Roth, 2018), this model leverages a Gradient Boosted Decision Tree (GBDT) trained on the state-of-the-art dataset EMBER(Anderson and Roth, 2018), containing features extracted from Windows Portable Executables (PEs). Features consist of eight groups, including features extracted after parsing the PE, and ”format-agnostic” features, containing characteristics obtained without the parsing.The parsed features groups are: (i) general file information, including the information obtained from the PE header; (ii) header information taken from the headers, like the target machine, the target subsystem and so on; (iii) imported functions, reporting the imported functions by library; (iv) exported functions as a list; (v) section information, comprehending the properties of each section. The format-agnostic features are a (i) byte histogram, representing the normalized counts of each byte value within the file, a (ii) byte-entropy histogram which accounts for the entropy of the byte distribution of the file, applying a sliding window over the binary and (iii) string information taken from the printable strings inside the PE.

Each input sample is first analysed by MalConv, and, if the model is not rising any alert, it is passed to GBDT-EMBER.In this way, we first analyse samples with a faster model with no pre-processing, thus reducing the overall number of possible errors.

3.3. Machine Learning Dynamic Analysis

The last module of SLIFER leverages dynamic analysis to trace the execution of samples, and it should help SLIFER to recognize obfuscated and packed samples that evaded the previous modules.Among the recently released models for dynamic analysis, we select Nebula(Trizna etal., 2023), a pipeline influenced by advances of Large Language Models (LLM)(Radford etal., 2018) employing the self-attention neural mechanism to analyze dynamic analysis reports and classify samples.Nebula employs Windows kernel emulation as a compromise between computational complexity and coverage. While emulation is cheaper than system virtualization, it is more prone to dynamic analysis errors.Nebula introduces domain-knowledge-influenced filters to distill behavioral reports from redundant or irrelevant information like memory addresses or PE file segment hashes. Further, the behavioral report is processed by the Transformer encoder neural network, producing a probability of maliciousness.We use pre-trained objects released by Nebula, trained on a public dataset(Trizna, 2022), comprising \approx75k samples collected in January 2022, spanning across seven malware types and \approx25k benignware samples.

3.4. SLIFER Implementation

We now present how we serialize the components we have described.SLIFER processes input programs by sequentially passing them from module to module, halting computations if one of those modules raises an alert.If one of the modules crashes due to pre-processing errors, SLIFER skips that module by passing the input to the next one. This can happen for EMBER-GBDT and Nebula, which require heavy pre-processing.We discuss classification error management in section Sect.5, by showing that it is better to label samples as benign in case of pre-processing crashes.

For the first YARA module, we collect 2.7k rules available in open GitHub repositories, accessed until November 2023333https://github.com/bartblaze/Yara-rules/tree/master/rules444https://github.com/elastic/protections-artifacts/tree/main/yara/rules555https://github.com/malpedia/signator-rules/tree/main/rules666https://github.com/Neo23x0/signature-base/tree/master/yara777https://github.com/Yara-Rules/rules/tree/master/malware.For the machine learning static analysis module, we leverage a library called secml_malware(Demetrio and Biggio, 2021)888https://github.com/pralab/secml_malware which wraps both MalConv and GBDT-EMBER models.Instead of training both from scratch, we leverage pre-trained open-source implementations of both999https://github.com/endgameinc/malware_evasion_competition.As the last module of SLIFER leverages the Nebula model provided by its original repository101010https://github.com/dtrizna/nebula.Emulation, which is the core of Nebula, is achieved through Speakeasy(Mandiant. Speakeasy: portable, modular, binary emulator designed to emulate Windows kernel and user mode malware, 2022), a Windows kernel emulation library, which generates a behavioral report in a JSON file.

4. Experimental Setting

We now introduce all the experiments we will perform on SLIFER, and we will share results in Sect.5.

Datasets.In our experiments, we employed two different datasets, that deferred in cardinality, malware families, benignware, and period of collections.The first one (𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) is composed of 5k malware, collected from VirusTotal before year 2018, and 3.5k benignware, harvested from GitHub and clean Windows installations.We show its composition in Table1, drafted again utilising VirusTotal.As regards malware families, this dataset is unbalanced, with predominance of two specific families among all the others.While this composition might seem unsuitable for experiments, we believe it can be treated as a realistic snapshot of real production environments, where there is no control over the incoming threats to analyse.

ClassBenignBackdoorDownloaderGrayw.MinerRansomw.Roguew.Spyw.VirusWormUnlabel.
Count344637613541328231931047719381551
ClassBenignBackdoorCoinminerDropperKeyloggerRansomw.RATTrojan
Count100002500250025002500250025002500

The second dataset (𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) reflects the test set of dynamic analysis records published by Trizna(Trizna, 2022), comprising 10K benign-ware and 17.5K malware samples spanned across seven types (such as ransomware, trojans, keyloggers, etc.) as shown in Table2.The dataset was collected in April 2022 by partnering with an undisclosed security vendor.We re-collected PE files from public data sources based on released hashes to perform static and YARA level analysis applicable only to raw PE bytes.Contrary to 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, this dataset has perfect balances between malware families, making it a good baseline for fairly assessing performances across different techniques.

Evaluation metrics. We compute several metrics to characterize the performance of SLIFER and baseline models:

True Positive Rate (TPR) to evaluate the module/model capability to correctly label effective malware samples, i.e., detect malicious PEs.

Mean Detection Time (MDT) to evaluate the mean time that each module/model needs to analyse an input sample and to make a decision. We compute this metric on a subset of our datasets.

Error Rate (ER) of each module/model to estimate the percentage of impossible-to-analyse samples. We also provide an experimental explanation of the best way to label those samples, since we must make a decision in a production scenario.

Adversarial Detection Rate (ADR) to assess robustness evaluation simply stating how many adversarial EXEmples are correctly recognized as malware after the attack.

Comparison between SLIFER and single models.In the first experiment, we test the predictive capabilities of SLIFER on our two datasets, comparing it against each separate module.In doing this we motivate the choice of our proposed architecture, especially for the dynamic analysis module. We show that the chosen Nebula is better than a competitor model, named Neurlux(Jindal etal., 2019), a LSTM model trained on dynamic reports extracted from emulation.Also, we improve the comparison by including Quo.Vadis(Trizna, 2022) which is a hybrid-analysis model that merges both static and dynamic features.On the contrary to SLIFER, Quo.Vadis introduces a monolithic structure with single forward- and back- propagation path, processing static and dynamic components simultaneously, further employing a ”meta-model” that accumulates representations from both analyses types to classify Windows executable samples.

Effect of dynamic analysis in SLIFER Classification.Similarly to the analysis performed by Dambra et al.(Dambra etal., 2023), we investigate the efficacy of dynamic analysis.Differently from their setting, here we use emulation as last step of the analysis without concatenating together both static and dynamic features.Thereafter, we test the performances of SLIFER in two cases: in the former, we do not include the dynamic module, and in the latter, we test the full pipeline, including the model chosen after the analysis of the two separate models.Lastly, we also include SLIFER performance on detection divided per malware family contained in 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Detection Time Analysis.We here want to compare the mean speed of each separate module or model, aiming to analyze the advantages and disadvantages in terms of time. We calculate the MDT metric on two subsets of 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The first one comprises 500 malware samples from 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 500 malware samples from 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Similarly, we take 500 benignware samples from the two datasets. We also compute the standard deviation to show how the detection time can vary. For this evaluation, we separate malware and benignware samples because they differ in file dimension: benignware are usually bigger than malware, and this enlarges emulation, processing and classification times. Moreover, benign samples always pass through all SLIFER’s modules by design, and we prefer to evaluate times separately. To differ the two metrics, we denote with MDTm and MDTg the calculation of MDT for malware and benignware subsets respectively.

Pipeline Robustness Evaluation.We test SLIFER in terms of robustness.We evaluate the capabilities of adversarial examples to deceive the pipeline. We decided to test a transfer attack approach, supposing to be an attacker knowing that the GBDT-EMBER model (which proves to be the best performing in most scenarios in terms of TPR and FPR, as we are going to discuss in Sect.5) is used inside the pipeline.To do so, we rely on GAMMA, a state-of-the-art black-box attack(Demetrio etal., 2021a) based on section injection.This attack iteratively requests classification of perturbed samples to optimize the amount of injected content as sections, until either evasion is achieved or the query budget is terminated.We took 1k GBDT-classified malicious samples from 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and injected 75 benign sections picked from 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT benignware samples. We set λ=1×106𝜆1superscript106\lambda=1\times 10^{-6}italic_λ = 1 × 10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT and we set the maximum number of queries to 100.These sections contain read-only data used by programs at runtime, but no further code is included inside adversarial EXEmples.

5. Experimental Results

We now describe all the findings we gathered from the experiments described in Sect.4.

5.1. Single Modules Classification

We test the performance of all the single modules and models with both 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT datasets in our possession, and we present results of each module in Table3.We report the values of the described metrics when modules consider input as benignware in presence of pre-processing errors, by also including in brackets metrics computed considering classification errors as malware.

𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
ModuleTPRFPRER
Signatures0.155.2 ×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT0
MalConv0.782.0 ×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT0
GBDT0.91 (0.92)4.1 ×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (0.12)5.4%
Nebula0.23 (0.61)2.3 ×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT (0.37)39%
Quo.Vadis0.64 (0.73)1.5 ×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT (0.25)15%
SLIFER No Dyn.0.94 (0.95)2.5 ×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT (0.14)5.3%
SLIFER0.94 (0.97)4.1 ×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT (0.38)19%
𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
ModuleTPRFPRER
Signatures0.348.9×1038.9superscript1038.9\times 10^{-3}8.9 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT0
MalConv0.744.8×1024.8superscript1024.8\times 10^{-2}4.8 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT0
GBDT0.75 (0.75)2.8 (3.6) ×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT0.036%
Nebula0.41 (0.86)7.8 ×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (0.21)45%
Quo.Vadis0.80 (0.84)9.2 ×103absentsuperscript103\times 10^{-3}× 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT (8.0 ×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT)4.8%
SLIFER No Dyn.0.88 (0.88)5.8 (5.9) ×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT0.029%
SLIFER0.89 (0.96)6.4 ×102absentsuperscript102\times 10^{-2}× 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT (0.26)20%

To better clarify the choice of the dynamic malware detector, we compare two state-of-the-art models: (i) Nebula, the transformer architecture described in Sect.3.3; and Neurlux(Jindal etal., 2019).We plot the ROC curves of the two models, considering only successfully classified samples, dropping samples that resulted in classification errors.We see that Nebula outperforms Neurlux at each FPR, as shown in Fig.3, that is the reason for our choice in the development of the dynamic module.Also, we want to calibrate our models finding a threshold at 1% FPR if possible, but Neurlux achieves very-high FPR at that threshold, which is unacceptable in a production detector.Thus, coherently with previous work(Trizna etal., 2023), we discarded Neurlux as possible module for SLIFER.

SLIFER: Investigating Performance and Robustness of Malware Detection Pipelines (3)

As anticipated, we also include in the comparison the hybrid model Quo.Vadis(Trizna, 2022), considering that its approach of static and dynamic feature fusion is widespread in malware detection research.We compute metrics for Quo.Vadis considering a 1% FPR threshold computed on 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Before drawing a conclusions on the predictive capabilities of tested models, we note that the best in terms of TPR and FPR are achieved by considering pre-processing errors as benignware.Thus, we are able to properly answer RQ1 for single models, by counter-intuitively asserting that, due to the majority of impossible-to-analyse samples among benignware, it is better to slightly decrease TPR to favour a low FPR regime.By looking at the best trade-off between FPR and TPR, we can not isolate a clear winner among models evaluated singularly.In particular, both GBDT-EMBER and Quo.Vadis achieve the best performance on both datasets, with extremely low false positive rates.It is worth to notice that Quo.Vadis also incorporates the same feature set provided by Anderson et al.(Anderson and Roth, 2018), which is also used for GBDT-EMBER.However, if we also include the error rate (ER) as part of our analysis, we can observe that GBDT-EMBER is clearly the winner of this analysis, with a very low percentage of crashes on both datasets.On the contrary, Quo.Vadis achieves a peak of 15% of pre-processing errors due to the combination of both static feature extraction and emulation.As competing candidate to Quo.Vadis, also MalConv shows matching performance with these models, by also being characterized with zero pre-processing errors thanks to it end-to-end structure.

The best technique in terms of FPR is pattern-patching samples with YARA rules, but, as expected, the TPRs are very low for both datasets.In numbers, our sample dataset of rules comprising 2.7k signatures is only able to detect 750 malware in 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 6k in 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, scoring the least predictive detector.While this analysis highlights that, on average, each rule detects roughly 22 samples of 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, it is clear that YARA rules might match other models’ predictive capabilities only with a larger set of signatures.Lastly, dynamic analysis with emulation has a non-negligible number of errors, which hinders its already low metrics and its real impact in a multi-stage detector.This finding might be correlated with the fact that modeling the behavior of malware in an end-to-end perspective is a daunting task, thus confirming the superiority of static analysis(Dambra etal., 2023).

5.2. SLIFER Classification

We now analyse the performances of SLIFER, with and without (SLIFER No Dyn.) the dynamic module as last step for the detection pipeline.This comparison will grant us insights on how much dynamic analysis improve the quality of predictions when used as a sequential detection mechanism, and we show the results in Table3.For both settings compared to single modules, we clearly notice a drastic improvement in the trade-off between TPR and FPR for both datasets.While it is true that GBDT-EMBER achieves lower FPR, it is worth noticing that SLIFER increases TPR by 3% and 13% on 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT respectively.This increment is given by the sequential order of techniques of our methodology, that patches blind spots of single strategies thanks to the multitude of increasingly-accurate models.Thus, ER is lower compared to the dynamic module alone in both scenarios for two main reasons: (i) fewer malware samples that caused emulation to crash proceed to the last step of the pipeline, since they are stopped by earlier modules, and (ii) most of the pre-processing crashes are caused by benignware.Thus, we complete the answer to RQ1, by stating that also for SLIFER it is better to treat impossible-to-analyse samples as benign, to avoid higher FPR (which is 26% in the case of SLIFER Dyn.).

Sample labelBackdoorDownloaderGrayw.MinerRansomw.Roguew.SpywareVirusWorm
SLIFER No Dyn.0.97 (0.97)0.97 (0.98)0.93 (0.93)1.0 (1.0)0.96 (0.96)0.90 (0.90)0.94 (0.94)0.99 (0.99)0.99 (0.99)
SLIFER0.97 (0.98)0.97 (0.99)0.93 (0.96)1.0 (1.0)0.97 (0.98)0.90 (1.0)0.94 (1.0)0.99 (0.99)0.99 (0.99)
ArchitectureBackdoorCoinminerDropperKeyloggerRansomw.RATTrojan
SLIFER No Dyn.0.99 (0.99)0.97 (0.97)0.83 (0.83)0.43 (0.43)0.95 (0.95)0.98 (0.98)0.99 (0.99)
SLIFER0.99 (0.99)0.97 (0.99)0.83 (0.99)0.52 (0.81)0.97 (0.97)0.98 (0.99)0.99 (0.99)

As for differences between the pipeline with the dynamic module and without, we can see a negligible improvement in TPR and only in with 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT at the expense of FPR for both datasets.Also, to better understand such an improvement, we display the distinct predictive capabilities of both strategies on the different families that compose our datasets, by showing results in Table4 for 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Table5 for 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.For the first dataset, which is severely imbalanced in terms of malware families, we can notice an identical detection rate when treating errors as benignware.Also, the number of errors is very low, with a peak of 3% (40 samples) on Grayware programs, there is no clear advantage brought by dynamic analysis on this dataset.

Similarly, for the second dataset, both techniques score identical results, with low TPR for Dropper and Keylogger families.On this dataset, we confirm what has already been stated by Dambra et al.(Dambra etal., 2023), that dynamic analysis marginally helps in detecting specific families that are harder to detect by static analysis.However, by also looking at numbers, all the emulation errors are also focused on those families as well.Thus, it is possible that these numbers would further improve with a better emulation or virtualization tool as back-end for the dynamic analysis module.We derive that adding a dynamic module at the end of a sequential pipeline for malware detection might not improve performances as much as expected.In particular, the inclusion of this module might increase the errors encountered at analysis time, which must be dealt with accordingly.Secondly, differently from Dambra et al. which consider models with static and dynamic features concatenated together, dynamic analysis as a separate part of a detection pipeline does not increase the capabilities of the overall system.Hence, we can answer RQ2, by stating the sequential architecture of SLIFER improves TPR within a low FPR regime than single models alone, but not really thanks to the dynamic module.

5.3. Robustness Against Transfer Attacks

We now answer RQ3 by testing the robustness of our architecture against adversarial EXEmples computed on GBDT-EMBER and transferred again the SLIFER, and we show results in Table6.From the randomly-sampled 1k malicious samples, GAMMA is able to produce 388 EXEmples that bypass GBDT-EMBER detection.We use this set to compute the robustness against transfer attacks of SLIFER with and without the dynamic module.Since we are interested also in the effect that signatures play in countering malware, we want to quantify how much they can be useful when dealing with adversarial EXEmples.Thus, we also add to the comparison a version of SLIFER where YARA pattern-matching is removed (SLIFER No Sign.).Lastly, to understand whether an attack that manipulate only the structure of a program might also affect a behavioral model, we also test the robustness of Nebula alone.

Original Malware SamplesAdversarial EXEmples
ArchitectureADRArchitectureADR
SLIFER1.0 (1.0)SLIFER0.86 (0.93)
SLIFER No Dyn.1.0 (1.0)SLIFER No Dyn.0.76 (0.76)
SLIFER No Sign.1.0 (1.0)SLIFER No Sign.0.71 (0.89)
Nebula0.47 (0.82)Nebula0.45 (0.79)

While all untainted malware samples are detected by SLIFER, with the exception of Nebula alone due to the high number of emulation errors, results of transfer evaluation highlight an evasive trend against all models.These suggest that GAMMA attack can evade the whole SLIFER, decreasing the Detection Rate by 14%.This is achieved also thanks to the policy we established of flagging samples as benignware in case of errors of the last dynamic module.As we can see from the ADR if we count errors as malware, we derive that the 7% of Adversarial EXEmples crashes the analysis, and the remaining 7% bypasses SLIFER without pre-processing errors.By looking at Nebula performances on these data, it is likely that roughly half of them are not detected at all, but also that some are, surprisingly, evading detection.This can be noticed by looking again at the results scored by Nebula: when considering errors as malware, there is a small drop of performance (3 %) implying that those EXEmples do not crash the pipeline.By inspecting the results, we discovered that (i) some malware samples obtain a different score when emulated multiple times inside a single Speakeasy session, and (ii) GAMMA changes the output of Nebula.We analysed those reports, and we discovered that samples that invoke specific Windows APIs like:

  • GetCurrentThreadId, and GetSystemTimeAsFileTime get different return values when evaluated multiple times, thus changing the score. These are likely used to detect the presence of dynamic analysis, impacting the reports of emulation as well;

  • GetModuleFileNameW retrieves a different file name. It is likely that Speakeasy rename the file to analyse with its hash. Thus, the output report of the untainted malware and its adversarial EXEmple counterpart are different.

This analysis is not comprehensive, as it would need deeper reverse engineering of those samples, but, even if for few samples, we can conclude that also dynamic analysis is evaded by non-behavioral attacks, not strictly due to the effect of perturbations.Lastly, 5 samples that originally caused errors during emulation are now analysed by Speakeasy after the manipulation.This result needs further investigation, but they bring evidence that dynamic analysis based on emulation can be weak and discontinuous not only against adversarial transfer attacks, but also to multiple evaluations of the same samples.

If we remove dynamic analysis from SLIFER (SLIFER No Dyn. in), we detect a drop of ADR by 10%, implying that such a module was indeed able to stop most of the adversarial EXEmples.However, counter-intuitively, if we instead remove signatures (SLIFER No Sign.) and we keep dynamic analysis, we assist to a drop of ADR by 15% w.r.t. SLIFER.This means that, surprisingly, pattern-matching with YARA rules has a bigger impact on robustness than dynamic analysis itself.This relevant contribution of YARA rules is due to how the section injection manipulation works:while mixing together sections extracted from input data, it is likely that GAMMA creates some patterns that trigger YARA rules, contrary to the original PE.We notice that 20 samples evade the YARA module before GAMMA manipulation, but they are detected once perturbed.To better understand this result, we analyse a small subset of rules that are triggered by adversarial EXEmples, and we analyse them through expert domain knowledge.Through this study, we isolate culprits of such triggers:

  • CryptoLocker_rule2: it triggers on meta-data component of PE file.For instance, manifest.xml appears many times after section injection, instead of appearing only once (like in any regular PE), triggers the rule as a consequence;

  • AutoIT_Compiled: AutoIt is a rarely used scripting language for Windows111111https://www.autoitscript.com/site/, and it was found to be widely employed in malware crafting121212https://www.autoitscript.com/wiki/AutoIt_and_Malware.This rule triggers whenever finds sections compiled by AutoIt. It happened that two sections were injected from a benign PE which has some AutoIt-compiled artifacts, and the rule was triggered;

  • SUSP_NET_NAME_ConfuserEx: Confuser is a well-known packer for .NET apps,131313https://github.com/yck1509/ConfuserEx used to avoid reverse engineering proprietary code.However, this tool is used also by malware developers to obfuscate malicious code, and this rule blocks sections which manipulated with Confuser.In this case, some benign sections altered by Confuser were injected to craft the adversarial EXEmple;

  • Windows_Trojan_Njrat_30f3c220: it appears that some benign program contained in 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, which is the dataset from which we extracted benignware for GAMMA, might not be actually legitimate. In this case, GAMMA injects a malicious payload that triggers the rule and prevents the adversarial EXEmple from being effective.

Thus, we can conclude that SLIFER is indeed robust against transfer adversarial EXEmples computed against its strongest component, but such robustness is achieved more by static signatures than dynamic analysis alone.

5.4. Computation Times Comparison

We now analyse the classification times of all the modules and models tested, and we report our findings in Table7 to answer RQ4.We measure MDT for malware and benignware separately on both 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, as explained in Sect.5, including the standard deviation of each measurement as well.In general, we can see a difference between times computed for malware (first column) and benignware (second column).This discrepancy is mainly attributed to the larger dimension of the benignware in terms of bytes and content included in programs, thus slowing down all the compared techniques.As expected, signatures and static models are the fastest techniques, with MalConv being the best one, followed by YARA pattern-matching and GBDT-EMBER.Intuitively, MalConv does not pre-process input samples, and it just uses their first 1MB to compute predictions without relying on feature extractors.On the contrary, while pattern-matching is indeed fast, it is slowed down by the length of the file to analyse, requiring different amounts of computations for smaller and larger programs.Lastly, GBDT-EMBER is two orders of magnitude slower than MalConv due to the heavy feature extraction phase encoded through EMBER.Unsurprisingly, hybrid and dynamic analysis with Quo.Vadis and Nebula requires a non-negligible amount of seconds due to emulation.On the contrary, even if built on top of many components, SLIFER rivals GBDT-EMBER when analysing malware.Since computations are halted at first detection, SLIFER cuts the heavy feature-extraction phase when possible.On the contrary, since benign samples must traverse all the pipeline, we report a slightly higher required time to analyse benignware, mostly due to the emulation step.Thus, we can conclude that the overhead induced by SLIFER is negligible w.r.t. both static and dynamic analysis alone.

Architecture / ModuleMDTm (s)MDTg (s)
Signatures2.9±6.8×102plus-or-minus2.96.8superscript1022.9\pm 6.8\times 10^{-2}2.9 ± 6.8 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT6.5±18×102plus-or-minus6.518superscript1026.5\pm 18\times 10^{-2}6.5 ± 18 × 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT
MalConv8.9±20×103plus-or-minus8.920superscript1038.9\pm 20\times 10^{-3}8.9 ± 20 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT6.5±3.1×103plus-or-minus6.53.1superscript1036.5\pm 3.1\times 10^{-3}6.5 ± 3.1 × 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT
GBDT1.5±4.0×101plus-or-minus1.54.0superscript1011.5\pm 4.0\times 10^{-1}1.5 ± 4.0 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT2.2±6.4×101plus-or-minus2.26.4superscript1012.2\pm 6.4\times 10^{-1}2.2 ± 6.4 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
Nebula2.4±16plus-or-minus2.4162.4\pm 162.4 ± 163.0±14plus-or-minus3.0143.0\pm 143.0 ± 14
Quo.Vadis2.2±8.0plus-or-minus2.28.02.2\pm 8.02.2 ± 8.06.0±18plus-or-minus6.0186.0\pm 186.0 ± 18
SLIFER No Dyn.1.1±3.6×101plus-or-minus1.13.6superscript1011.1\pm 3.6\times 10^{-1}1.1 ± 3.6 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT3.4±8.4×101plus-or-minus3.48.4superscript1013.4\pm 8.4\times 10^{-1}3.4 ± 8.4 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT
SLIFER2.0±11×101plus-or-minus2.011superscript1012.0\pm 11\times 10^{-1}2.0 ± 11 × 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT6.3±19plus-or-minus6.3196.3\pm 196.3 ± 19

6. Related Work

Hybrid analysis. In recent years, academia has focused on many architectures for ML malware detectors that mix both static and dynamic analysis(Gibert etal., 2020).After the establishment of Machine Learning models in the malware analysis domain, research focused on how to train models extracting features from these two analysis approaches to increase performance in binary classification and/or family classification. As already mentioned in this work, with Quo.Vadis(Trizna, 2022), Trizna developed a multiple deep learning models architecture (early fusion models) that analyse both static and dynamic features and then merges results with a single meta-model.This approach is well-known in literature as late fusion strategy.Han et al.(Han etal., 2019a, b) tested different ML models in malware detection and classification tasks, merging static and dynamic features, and training models directly with the whole set of mixed features. The same approach is described by Kumar et al.(Kumar etal., 2019). All these works apply the so-called early fusion strategy, and experiments report an improvement in accuracy.Ngo et al.(Ngo etal., 2023) tackled the problem of feature fusion with Transfer Learning and Knowledge Distillation (KD): a large teacher model was trained on aggregated static and dynamic features and a small student model only on static features. Then the knowledge of the rich ”behavior-aware” model is transferred to the faster small model to perform classification. Similarly to our work, this approach aims at reducing detection delays caused by dynamic analysis: indeed, after the training effort of the teacher model including behavioral features, this knowledge is only transferred to the student model, which computes only static features to perform classification.

There are differences between related works and ours: differently from the techniques mentioned before, we also focus on evaluating the robustness of SLIFER to adversarial EXEmples, analysing the contribution of dynamic model and signatures on ADR; moreover, we present two different policies for impossible-to-analyse samples, while in literature they are discarded or they are not the object of interest for any further analysis.

7. Limitations

We now evaluate the limitations of our work, by also discussing how they can be handled or why they are not relevant.

SLIFER Model Training.Our pipeline’s architecture is composed of pre-trained objects on different datasets. The scope of our work is not to build a new model to optimize a common loss function, training the whole pipeline on the same dataset. Still, we want to highlight the performances of a sequence of ML models in detecting malware, adding signatures to static and dynamic analysis, similarly to industrial architectures, analysing the contributions of each module. Building a single model which can analyse different features without relying on feature fusion (as Quo.Vadis(Trizna, 2022)) is a non-trivial problem out of the scope of this work.

Temporal Analysis.We do not perform a systematic temporal analysis, looking at performance decreases, and submitting new documented future data. In literature, we find strategies to overcome temporal bias avoiding the so-called concept drift problem(Pendlebury etal., 2019), i.e. the obsolescence of ML models trained on past data at test-time with an unseen dataset caused by the decay of i.i.d (independent and identically distributed) assumption on data.However, MalConv and GBDT are trained on samples up to 2018, while Nebula training data is collected in January 2022, and 𝒟2subscript𝒟2\mathcal{D}_{2}caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT was collected in April 2022: this means that our evaluation can be still treated as realistic.Also, 𝒟1subscript𝒟1\mathcal{D}_{1}caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is strongly imbalanced in terms of malware families, as frequently happens in a real-world scenario. Also, we do not have precise timestamps of this collection and we do not use it to make conclusions on temporal analysis.

Dynamic Analysis on Virtualization.Nebula model is trained on reports generated through Speakeasy(Mandiant. Speakeasy: portable, modular, binary emulator designed to emulate Windows kernel and user mode malware, 2022), that is an emulation tool. We see limitations in PE emulation, especially in reliability, as assessed by our experimental work.Virtualizing a huge amount of PE as done for the Speakeasy dataset(Trizna, 2022) needs time and a well-settled environment and it is out of our purpose, which is testing available ML models for dynamic analysis inside a sequential pipeline.

End-to-End Adaptive Attacks.We perform transfer attacks targeting only GBDT, namely the best-performing one. Attacks on ML models trained on static features are well implemented and documented(Demetrio etal., 2021a).Evading dynamic analysis is a wide problem(Afianian etal., 2019) and currently, we lack an implementation of attacks to evade dynamic models.As far as we know, there is only one attack that has been proposed(Rosenberg etal., 2018), but no code has been provided.Moreover, we do not customise GAMMA to avoid including any artifact detected by YARA rules.However, our goal for this paper is the empirical analysis of malware pipelines that resembles industry settings, without delving into adversarial robustness, as this would require a deeper investigation on its own, as well as developing novel or re-implementing attacks.

8. Conclusions

In this work, we propose SLIFER, a novel Windows malware detector pipeline, that sequentially puts together static and dynamic analysis, leveraging pattern-matching with YARA and state-of-the-art machine learning models for malware detection.We build an architecture that mimics industrial solutions, testing its performances compared to hybrid approaches, proposing policies to handle analysis errors, and analysing its robustness against adversarial EXEmples. We highlight interesting findings remarked by our experimental work: SLIFER proves to outperform state-of-the-art models designed for static, dynamic, or hybrid analysis, keeping the highest TPR with low FPR. Moreover, it proves faster detection times of malicious PEs as compared to models leveraging dynamic analysis; in achieving this, our proposed architecture does not discard any input sample and propagates detection errors until the end of the pipeline, and labeling impossible-to-analyse samples as benign proves to be the best policy to provide less false alarms. SLIFER shows its robustness against adversarial transfer attacks and surprisingly demonstrates that section injection attacks are detected more by YARA rules than by dynamic analysis, whose integrity is mined by small changes in the sample structure that are reflected in the emulation report. Overall, we report the small contribution of dynamic analysis based on emulation to accuracy and robustness, besides the computational burden that is required, proving the effectiveness of our choice to use that kind of analysis only when required and in the last resort.

Future Work. As mentioned in Sect.7, we did not learn all the parameters of the components of SLIFER, but rather we leveraged pre-trained models.However, we believe that it is possible to formalize an algorithm that wraps all modules, whose parameters can be learned within a single minimization by also keeping static and dynamic analyses separated.Also, while inAnother future investigation will take into account the difference between emulation and virtualization.In particular, we will train the dynamic analysis module on virtualization reports collected from a training PE dataset, and we willniivestigate the difference in both performance and overhead.As regards end-to-end attacks, we plan on developing novel adversarial attacks suited to evade end-to-end malware pipelines like SLIFER, thus trying to deceive both malware signatures and dynamic modules.

Acknowledgements.

Andrea Ponte acknowledges the support of Rina Consulting Spa to his doctoral scholarship and research work.This work was partially supported by project SERICS (PE00000014) under the NRRP MUR program funded by the EU - NGEU.This work was partially supported by project FAIR (PE00000013) under the NRRP MUR program funded by the EU - NGEU.

References

  • (1)
  • Afianian etal. (2019)Amir Afianian, Salman Niksefat, Babak Sadeghiyan, and David Baptiste. 2019.Malware dynamic analysis evasion techniques: A survey.ACM Computing Surveys (CSUR) 52, 6 (2019), 1–28.
  • Anderson etal. (2017)HyrumS Anderson, Anant Kharkar, Bobby Filar, and Phil Roth. 2017.Evading machine learning malware detection.black Hat 2017 (2017), 1–6.
  • Anderson and Roth (2018)HyrumS Anderson and Phil Roth. 2018.Ember: an open dataset for training static pe malware machine learning models.arXiv preprint arXiv:1804.04637 (2018).
  • Avira (2017)Avira. 2017.NightVision – Using Machine Learning to Defeat Malware.https://www.webassetscdn.com/avira/prod/cache-buster-1598423379/assets/oem.avira.com/resources/to%20delete/whitepaper_NightVision_EN_20170704.pdfAccessed: April 2024.
  • Biggio etal. (2013)Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. 2013.Evasion attacks against machine learning at test time. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III 13. Springer, 387–402.
  • Biggio and Roli (2018)Battista Biggio and Fabio Roli. 2018.Wild patterns: Ten years after the rise of adversarial machine learning. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security. 2154–2156.
  • Dambra etal. (2023)Savino Dambra, Yufei Han, Simone Aonzo, Platon Kotzias, Antonino Vitale, Juan Caballero, Davide Balzarotti, and Leyla Bilge. 2023.Decoding the secrets of machine learning in malware classification: A deep dive into datasets, feature extraction, and model performance. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security. 60–74.
  • Demetrio and Biggio (2021)Luca Demetrio and Battista Biggio. 2021.secml-malware: A Python Library for Adversarial Robustness Evaluation of Windows Malware Classifiers.arXiv:2104.12848[cs.CR]
  • Demetrio etal. (2021a)Luca Demetrio, Battista Biggio, Giovanni Lagorio, Fabio Roli, and Alessandro Armando. 2021a.Functionality-preserving black-box optimization of adversarial windows malware.IEEE Transactions on Information Forensics and Security 16 (2021), 3469–3478.
  • Demetrio etal. (2021b)Luca Demetrio, ScottE Coull, Battista Biggio, Giovanni Lagorio, Alessandro Armando, and Fabio Roli. 2021b.Adversarial exemples: A survey and experimental evaluation of practical attacks on machine learning for windows malware detection.ACM Transactions on Privacy and Security (TOPS) 24, 4 (2021), 1–31.
  • Gibert etal. (2020)Daniel Gibert, Carles Mateu, and Jordi Planes. 2020.The rise of machine learning for detection and classification of malware: Research developments, trends and challenges.Journal of Network and Computer Applications 153 (2020), 102526.
  • Han etal. (2019a)Weijie Han, Jingfeng Xue, Yong Wang, Lu Huang, Zixiao Kong, and Limin Mao. 2019a.MalDAE: Detecting and explaining malware based on correlation and fusion of static and dynamic characteristics.computers & security 83 (2019), 208–233.
  • Han etal. (2019b)Weijie Han, Jingfeng Xue, Yong Wang, Zhenyan Liu, and Zixiao Kong. 2019b.MalInsight: A systematic profiling based malware detection framework.Journal of Network and Computer Applications 125 (2019), 236–250.
  • Hershberger (2023)Jeff Hershberger. 2023.The hidden costs of false positive security alerts.https://www.intrusion.com/blog/the-hidden-costs-of-false-positive-security-alerts/Accessed: April 2024.
  • Jindal etal. (2019)Chani Jindal, Christopher Salls, Hojjat Aghakhani, Keith Long, Christopher Kruegel, and Giovanni Vigna. 2019.Neurlux: dynamic malware analysis without feature engineering. In Proceedings of the 35th Annual Computer Security Applications Conference. 444–455.
  • Kaspersky (2021)Kaspersky. 2021.Machine Learning for Malware Detection.https://media.kaspersky.com/en/enterprise-security/Kaspersky-Lab-Whitepaper-Machine-Learning.pdfAccessed: April 2024.
  • Kubovič (2017)Ondrej Kubovič. 2017.False Positives can be more costly than malware infection.https://www.welivesecurity.com/2017/05/09/false-positives-can-costly-malware-infection/Accessed: April 2024.
  • Kumar etal. (2019)Nitesh Kumar, Subhasis Mukhopadhyay, Mugdha Gupta, Anand Handa, and Sandeep K.Shukla. 2019.Malware Classification using Early Stage Behavioral Analysis. In 2019 14th Asia Joint Conference on Information Security (AsiaJCIS). 16–23.https://doi.org/10.1109/AsiaJCIS.2019.00-10
  • Lucas etal. (2021)Keane Lucas, Mahmood Sharif, Lujo Bauer, MichaelK Reiter, and Saurabh Shintre. 2021.Malware makeover: Breaking ml-based static analysis by modifying executable bytes. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security. 744–758.
  • Mandiant. Speakeasy: portable, modular, binary emulator designed to emulate Windows kernel and user mode malware (2022)Mandiant. Speakeasy: portable, modular, binary emulator designed to emulate Windows kernel and user mode malware 2022.https://github.com/mandiant/speakeasy.
  • Microsoft ([n. d.])Microsoft. [n. d.].Evolution of Malware Prevention.https://info.microsoft.com/rs/157-GQE-382/images/Windows%2010%20Security%20Whitepaper.pdfAccessed: April 2024.
  • Ngo etal. (2023)MaoV Ngo, Tram Truong-Huu, Dima Rabadi, JiaYi Loo, and SinG Teo. 2023.Fast and efficient malware detection with joint static and dynamic features through transfer learning. In International Conference on Applied Cryptography and Network Security. Springer, 503–531.
  • Pendlebury etal. (2019)Feargus Pendlebury, Fabio Pierazzi, Roberto Jordaney, Johannes Kinder, and Lorenzo Cavallaro. 2019.{{\{{TESSERACT}}\}}: Eliminating experimental bias in malware classification across space and time. In 28th USENIX security symposium (USENIX Security 19). 729–746.
  • Radford etal. (2018)Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, etal. 2018.Improving language understanding by generative pre-training.(2018).
  • Raff etal. (2018)Edward Raff, Jon Barker, Jared Sylvester, Robert Brandon, Bryan Catanzaro, and CharlesK Nicholas. 2018.Malware detection by eating a whole exe. In Workshops at the thirty-second AAAI conference on artificial intelligence.
  • Raff etal. (2020)Edward Raff, Bobby Filar, and James Holt. 2020.Getting passive aggressive about false positives: patching deployed malware detectors. In 2020 International Conference on Data Mining Workshops (ICDMW). IEEE, 506–515.
  • Rosenberg etal. (2018)Ishai Rosenberg, Asaf Shabtai, Lior Rokach, and Yuval Elovici. 2018.Generic black-box end-to-end attack against state of the art API call based malware classifiers. In Research in Attacks, Intrusions, and Defenses: 21st International Symposium, RAID 2018, Heraklion, Crete, Greece, September 10-12, 2018, Proceedings 21. Springer, 490–510.
  • Saxe and Berlin (2015)Joshua Saxe and Konstantin Berlin. 2015.Deep neural network based malware detection using two dimensional binary program features. In 2015 10th international conference on malicious and unwanted software (MALWARE). IEEE, 11–20.
  • TECHNOLOGY ([n. d.])ESET TECHNOLOGY. [n. d.].The multilayered approach and its effectiveness.https://www.eset.com/fileadmin/ESET/US/docs/about/ESET-Technology-Whitepaper.pdfAccessed: April 2024.
  • The FLARE Team (2020)The FLARE Team. 2020.capa, a tool to identify capabilities in programs and sandbox traces.https://github.com/mandiant/capaAccessed: April 2024.
  • Trizna (2022)Dmitrijs Trizna. 2022.Quo Vadis: Hybrid Machine Learning Meta-Model Based on Contextual and Behavioral Malware Representations. In Proceedings of the 15th ACM Workshop on Artificial Intelligence and Security. 127–136.
  • Trizna etal. (2023)Dmitrijs Trizna, Luca Demetrio, Battista Biggio, and Fabio Roli. 2023.Nebula: Self-Attention for Dynamic Malware Analysis.arXiv preprint arXiv:2310.10664 (2023).
  • Wiyatno etal. (2019)ReyReza Wiyatno, Anqi Xu, Ousmane Dia, and Archy DeBerker. 2019.Adversarial examples in modern machine learning: A review.arXiv preprint arXiv:1911.05268 (2019).
  • Yara-Rules (2019)Yara-Rules. 2019.rules.https://github.com/Yara-Rules/rules/tree/master/malware.Accessed: November 2023.
  • Yuan etal. (2019)Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. 2019.Adversarial Examples: Attacks and Defenses for Deep Learning.IEEE Transactions on Neural Networks and Learning Systems 30, 9 (2019), 2805–2824.https://doi.org/10.1109/TNNLS.2018.2886017
SLIFER: Investigating Performance and Robustness of Malware Detection Pipelines (2024)

References

Top Articles
Latest Posts
Article information

Author: Terence Hammes MD

Last Updated:

Views: 5890

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Terence Hammes MD

Birthday: 1992-04-11

Address: Suite 408 9446 Mercy Mews, West Roxie, CT 04904

Phone: +50312511349175

Job: Product Consulting Liaison

Hobby: Jogging, Motor sports, Nordic skating, Jigsaw puzzles, Bird watching, Nordic skating, Sculpting

Introduction: My name is Terence Hammes MD, I am a inexpensive, energetic, jolly, faithful, cheerful, proud, rich person who loves writing and wants to share my knowledge and understanding with you.