Alert correlation framework for malware detection by anomaly-based packet payload analysis

Alert correlation framework for malware detection by anomaly-based packet payload analysis

Author’s Accepted Manuscript Alert Correlation Framework for Malware Detection by Anomaly-based Packet Payload Analysis Jorge Maestre Vidal, Ana Lucil...

1MB Sizes 1 Downloads 81 Views

Author’s Accepted Manuscript Alert Correlation Framework for Malware Detection by Anomaly-based Packet Payload Analysis Jorge Maestre Vidal, Ana Lucila Sandoval Orozco, Luis Javier García Villalba www.elsevier.com/locate/jnca

PII: DOI: Reference:

S1084-8045(17)30274-6 http://dx.doi.org/10.1016/j.jnca.2017.08.010 YJNCA1956

To appear in: Journal of Network and Computer Applications Received date: 14 January 2016 Revised date: 21 February 2017 Accepted date: 22 August 2017 Cite this article as: Jorge Maestre Vidal, Ana Lucila Sandoval Orozco and Luis Javier García Villalba, Alert Correlation Framework for Malware Detection by Anomaly-based Packet Payload Analysis, Journal of Network and Computer Applications, http://dx.doi.org/10.1016/j.jnca.2017.08.010 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting galley proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Alert Correlation Framework for Malware Detection by Anomaly-based Packet Payload Analysis Jorge Maestre Vidala,∗, Ana Lucila Sandoval Orozcoa , Luis Javier Garc´ıa Villalbaa a

Group of Analysis, Security and Systems (GASS), Department of Software Engineering and Artificial Intelligence (DISIA), School of Computer Science, Office 431, Universidad Complutense de Madrid (UCM), Calle Profesor Jos´e Garc´ıa Santesmases s/n, Ciudad Universitaria, 28040 Madrid, Spain

Abstract Intrusion detection based on identifying anomalies typically emits a large amount of reports about the malicious activities monitored; hence information gathered is difficult to manage. In this paper, an alert correlation system capable of dealing with this problem is introduced. The work carried out has focused on the study of a particular family of sensors, namely those which analyze the payload of network traffic looking for malware. Unlike conventional approaches, the information provided by the network packet headers is not taken into account. Instead, the proposed strategy considers the payload of the monitored traffic and the characteristics of the models built during the training of such detectors, in this way supporting the general-purpose incident management tools. It aims to analyze, classify and prioritize alerts issued, based on two criteria: the risk of threats being genuine and their nature. Incidences are studied both in a one-to-one and in a group context. This implies the consideration of two different processing layers: The first one allows fast reactions and resilience against certain adversarial attacks, and on the other hand, the deeper layer facilitates the reconstruction of attack scenarios and provides an overview of potential threats. Experiments conducted by analyzing real traffic demonstrated the effectiveness of the proposal. Keywords: Alert correlation, anomalies, Intrusion Detection System, ∗

Tel. +34 91 394 76 38, Fax: +34 91 394 75 47 Email addresses: [email protected] (Jorge Maestre Vidal), [email protected] (Ana Lucila Sandoval Orozco), [email protected] (Luis Javier Garc´ıa Villalba)

Preprint submitted to Journal of Network and Computer Applications

August 22, 2017

malware, network, payload 1. Introduction Intrusion Detection Systems (IDS) are defensive tools that monitor and analyze events that occur on the protected system looking for malicious activities. In their early days, detection strategies were based on identifying patterns corresponding to known threats, which are well known as signature detection (Hung-Jen et al. (2013)). However given the rapid proliferation of the attack techniques and the emergence of thousands of new unknown malware specimens, detection strategies evolved towards identifying anomalous behavior. These kinds of sensors are well known as Anomaly-Based Systems (ABS) (Bhuyan and Bhattacharyya (2014)). Unlike signature-based recognition systems, ABS do not require complex databases nor updated attack collections, and their accuracy depends on the quality of the protected systems usage models. But despite progress in this area, malware has grown dramatically in recent years. According to the European Network and Information Security Agency (ENISA), the malicious software has become the main threat to the current information technologies (Marinos, L. and Sfakianakis, A. (2015)). This is because of two main reasons; firstly more and more users rely on these services to take part in particularly sensitive activities, such as e-commerce or access to confidential information. On the other hand, malware is much more sophisticated: it has acquired greater capacity of spread, infection and avoidance of defensive methods. Consequently, to enhance the effectiveness of conventional detection systems often requires complementation by other tools, such as access lists, honeypots or alert correlation systems. The latter are the primary concern of this paper. Alert correlation systems facilitate the management of alerts issued by intrusion detection systems and improve the decision of countermeasures to be deployed. This involves implementation of different stages of information processing, such as reduction, normalization, aggregation, validation or filtering (Salah et al. (2013)). In the past decade, a large amount of proposals have been published in this regard. But despite being a widely studied problem, most of the efforts have resulted in solutions that address generalist goals, and whose real effectiveness varies considerably from that shown in their experimentation and development steps. Some major causes of this problem are presented in (Mirheidari et al. (2013)), highlighting the growth of the 2

volume of information to be processed, the need to manage a greater amount of alerts, their heterogeneity, limitations on the different monitoring environments or problems on IDS, such as high false positives rates, validation of incidences or privacy. This paper focuses on correlation of the alerts dropped from network intrusion detection systems based on statistical analysis of the payload for recognition of anomalous content. Through this their enhancement against most of the problems previously described is possible. The approach is motivated by the need to complement the Advanced Payload Analyser Preprocessor (APAP), which is an ABS that has been developed by our research group (Villalba et al. (2012, 2016)). In the experimentation, the general-purpose alert correlation schemes provided a first line of incident management. But they mainly rely on deductions in relation to the information contained in the packet headers, such as IP addresses, protocolos, duration of communications or their context. Because of this, they overlooked important information directly related to the nature of the sensor, which only is inferred by studying specific features of the payload-based detection: the payload of suspicious packets and the modeling strategies of the intrusion detection systems. On this basis we propose a framework for correlation of incidences, suited to similar detectors, and able to supplement the conventional tools for incident management. Our framework correlates alerts in a multi layered scheme that allows their individual and group treatment. A first stage facilitates fast decision making. It is also required in order to provide resilience against certain adversarial attacks. On a second layer, sequences of incidents are analyzed. The effectiveness of the proposal has been demonstrated in a real use case, when it is deployed for complementing APAP. It has been evaluated by analyzing alerts reported when monitoring traffic on the subnet of the faculty of Computer Science, at the Complutense University of Madrid. This paper is organized into eight sections, and the first of them is the introduction. Related works are described in section 2. In section 3, the main advances in anomaly-based malware detection are discussed, focusing on those that are deployed in network environments. In section 4 the proposed framework is introduced. In section 5 the implementation is explained. In section 6 all aspects concerning the experiments conducted are described. In section 7 the results of the experiments are presented. Finally, in section 8 conclusions and future work are discussed.

3

2. Alert Correlation Recently have been published different studies in relation to management of alerts issued by intrusion detection systems. Some of them provide general approaches, as is the case of (Salah et al. (2013); Mirheidari et al. (2013)). However, others deepen in more specific areas, such as collaboration between detectors (Elshoush and Osman (2011)) or reduction of false positive rates (Hubballi and Suryanarayanan (2014)). In (Davis and Clark (2014)) the principal methods applied for preprocessing of network incidences are reviewed. Throughout bibliography, alert correlation systems usually have been grouped in a way that (Salah et al. (2013)) proposed. This taxonomy considers as an axis for classification, the object of study of the different correlation methods. Bearing this point in mind, three major clusters are distinguished: similarity-based, sequential-based, and case-based methods. Similarity-based correlation often aims at aggregation and reduction of the reported alerts. To do this, various attributes or characteristics are considered, ranging from IP addresses and ports associated with the incidence, to the frequency of occurrence of certain patterns in packets content. In (Tjhai et al. (2010)) there is a clear example, where in addition to these attributes, other traits, such as protocol or priority of treatment, are considered. In (Cam et al. (2014)) similar features are applied on processing of events in mobile networks, which implies to take into account other specific features, such as flags or the duration of communications. Similarity-based correlation also considers time-space relationships of the monitored incidents. One example is (Amaral et al. (2012)), where from these features it is possible to infer the scene of previously identified attacks. Other proposals take specific attributes of detection systems as a reference. This is the case with (Chen et al. (2014)), where correlation is carried out from a compressed version of the dataset involved in the training steps of the detector. Sequential-based correlation considers causality relationships between incidences, usually by studying their preconditions and postconditions. Preconditions are defined as the set of requirements needed to trigger a threat. Postconditions are the consequences of their execution (when they success). A good example of this methodology is proposed in (Zhaowen et al. (2010)). There the different preconditions and postcondition of the detected intrusions are linked in order to allow rebuilding the complete scenario of the attacks. In (Ramaki et al. (2015)) it is carried out through construction of Causal Correlation Matrices (CCM). Another common representation is graphs. In 4

(Albanese et al. (2011)) it is deepened in this trend, and the introduction of generalized dependency graph and the extension of the classical definition of attack graph with the notion of timespan distribution, are proposed. The latter combines dependency and attack graphs, bridging the gap between known vulnerabilities and the services that could be ultimately affected by the corresponding exploits. For building attack scenarios, the implementation of different and complex strategies is often required. This is the case of Attribute Context-Free Grammars (ACF-Grammars) (Al-Mamory and Zhang (2008)), or data mining algorithms (Farhadi et al. (2011); Alsubhi et al. (2012)); the first of them is based on Hidden Markov Models (HMM), and the second on fuzzy logic. Case-based correlation is sustained by knowledge bases and the CaseBased Reasoning (CBR), which is defined as the process of solving new problems based on the solutions of similar past problems. Thus it depend on algorithms in charge of recognition of specific behavior patterns stored in data structures. An example of this can be seen in (Elshoush and Osman (2013)), where the attack scenarios which integrate the case base, are constructed by clustering incidents. This proposal infers future threats, allowing update the case base with new intrusions. In (Ahmadinejad et al. (2011)) a similar approach is proposed, which focuses on the automatic updating of the dependence graphs. In (Shittu et al. (2015)) clustering algorithms are applied for classifying incidents depending on the case base, in order to prioritize their treatment. Although most of these proposals pose solutions of general scope, several studies have introduced more specific schemes, easily adaptable to recent use cases. An example of this is the research related to the verification of alerts. In (Njogu et al. (2013)) the need of validation of incidents in large networks is discussed. It concludes that by this way, it is possible to separate isolated reports from those who really are part of attack scenarios, this being an important advantage over conventional alert correlation methods. Another example is (P´erez et al. (2014)), where a reputation scheme for verification and prioritization of alerts in collaborative systems is proposed. The trend of deploying conventional alert correlation systems on specific monitoring environments has encouraged the publication of different papers which discuss the complexity on their adaptation. This is the case of (Raftopoulos et al. (2013)), where the effectiveness of different correlation methods is studied. In (Mustapha et al. (2012)) the behavior of several honeypot databases that collect information about malware propagation and security information about 5

web-based server profile are explored. Their study is focused on the use of these databases to correlate local alerts with global knowledge, concluding that the information on the knowledge base often suffers from several limitations, such as the lack of precise information and standard representation. Finally, another interesting approach is provided by (Syed et al. (2013)), where alert correlation is implemented on Grid computing in order to meet the local security policy rules, with particular emphasis on identification of cross-domain attacks. 3. Malware Detection on Networks by Analysis of payload Network-based Intrusion Detection Systems (NIDS) monitor traffic flowing through them looking for malicious activity. At their inception, analysis strategies for recognition of previously known threats were applied (HungJen et al. (2013)). But the rapidly proliferation of malware led to the need to develop strategies capable of identifying new specimens, with the anomalybased detection method considering the most common scheme. Since this article focuses on the correlation of alerts issued by ABS, the signature detection schemes are out of scope. Instead, the anomaly detection methods are studied thoroughly. The most important approaches in this field are collected in (Bhuyan and Bhattacharyya (2014)). These are generally based on construction of representations of the habitual and legitimate uses of the protected environment. There, sensors report alerts when monitored activities differ significantly from the legitimate usage model. But despite their popularity, designing ABS involves tackling various inconsistencies, which are often not considered by developers nowadays. These issues are discussed in detail in (Viswanathan et al. (2013)), where the main risk factors at the different stages of designing ABS are described. For example, they include improper characterization of training data, incorrect sampling of input data, wrong feature selection, lack of groundtruth data, poorly defined threat scope, or insufficient definition of anomalies. Some of them are particularly difficult to solve; this is the case when considering an unrepresentative dataset, which is becoming more frequent because of the upward trend in heterogeneity on current networks. It is also a fact, that even the wrongly labeled “legitimate use samples” could contain attacks, as addressed in (Cretu et al. (2008)). Anomaly detection applied to recognition of malware in networks is typically based on statistical analysis of payload. This is a static detection scheme, very susceptible to changes in the monitored environment, such as 6

encryption or variations on protocols. It is also important to note that this method is part of the technologies for Deep Packet Inspection (DPI). Due of this, its deployment should consider data protection policies and confront the recent controversy around its application (Stalla-Bourdillon et al. (2014)). Most of the current approaches on this area are based on PAYL (K. and Stolfo (2004)). In PAYL network models were originally constructed by taking into account 256 different features of the payload. It was updated in (Wang et al. (2005)) because of high false positive rates, and the new version introduced the extraction of features by the n-gram method. Probably this was one of its most influential characteristics over subsequent proposals. PAYL fell into disuse when its vulnerability against certain adversarial attacks based on imitation was demonstrated in (Fogla et al. (2006)). In particular, it was weak against mimicry attacks and some polymorphic specimens. Since then, the research community has focused on reducing false positives rates, and on strengthening in order to face similar threats. This has given rise to a large family of detectors, among them it is important to highlight ANAGRAM (Wang et al. (2006)), McPAD (Perdisci et al. (2009)), HMMPayl (Ariu et al. (2011)), RePIDS (Jamdagni et al. (2013)) or APAP (Villalba et al. (2012, 2016)). But although their proved accuracy on experimental environments, nowadays there is discrepancy on their effectiveness. This is largely due to studies such as (Hadziosmanovik et al. (2012)), where their difficulty of adaptation to current networks is demonstrated. The major complications derived from some of the inconsistencies which have been previously described, being the high false positives rates the biggest problem. Hence there is a real need to develop alert correlation systems able to prioritize the treatment of alerts and reduce the labeling errors when analyzing legitimate traffic. 4. Alert Correlation Framework The proposed alert correlation framework considers the particularities of the ABS for malware detection based on analyzing payloads. Taking these peculiarities into account lead us to define the following design principles and limitations: • Detectors that are complemented have centralized architecture. This is because there are few proposals that raise collaborative schemes on the analysis of payload content. In addition, the centralized analysis predominates in the most relevant publications on this field. 7

• Detectors analyze traffic packet-to-packet. Network modeling is based on analyzing their payload, because it is the field that contains the malicious software. • The peculiarities of the detectors determine the features required for establishing usage models. • Taking into account certain features on the payload it is possible to infer properties of malware. Note that selecting the proper payload features helps to infer these properties. But decide the features to be taken into account, as well as define their properties, depends directly on the knowledge base about the considered intrusions, the nature of the threats and how the alert correlation framework is instantiated. Hence, this is a problem to be managed according to each use case requirement, which is out of the scope of our contribution. • When the payload of a packet satisfies certain properties defined in the training stage, an alert is issued. Alerts may also be reported when the payload differs considerably to the legitimate use model. • Malware can reach the victim distributed in various packets. In this case, it is common that each of them contains code able to perform a different intrusion step, but necessary to serve the next malicious action. • The IDS robust against evasion methods usually operate in a nondeterministic way, in order to avoid being enumerated. They are also capable of identifying certain obfuscation methods based on polymorphism, by identifying the invariant part of their infection vector. Based on these assumptions, a framework for alert correlation suited to centralized NIDS based on identifying anomalies in payload has been designed. To this end, information provided by alerts is analyzed considering the various features from which the sensor makes decisions. Thus both systems must share metrics and modeling methods. Therefore, it is possible to expect that both, alert correlation systems and the complemented intrusion detection systems, exhibit similar features. For example, if the sensors on which our approach acts are capable of heterogeneously deal with different networking protocols (ie. IPv4, IPv6, etc.), of course it is possible to expect

8

the same behavior when managing their outputs, as long as the right pattern recognition methods and rule definitions are implemented. This hinders scalability, but takes into account that ABS for payload analysis usually provides very specific information such as internal rules or distances between usage models, which cannot always be used by other security devices. Additionally, it must be understood that our proposal is one processing level above the ABS, so it manages the information emitted by the sensor, once the data gathered is analyzed. Because of this, the proposal is indifferent to the difficulties associated with monitoring the traffic payload. For example, if the traffic flowing through the network is encrypted, this is a problem that the ABS will have to deal with. In light of this, it is important to note that this proposal endeavors to adapt to the peculiarities of the detectors which will complement, unlike a very important part of the approaches in the bibliography. 4.1. Architecture The proposed framework architecture combines different classifiers in order to offset the drawbacks associated with the anomaly detectors deployment. The idea behind the ensemble of classifiers is to make decisions bearing in mind the conclusions reached by different “experts” (in this case, its components). This allows, inter alia, considering different datasets/subsets of training data, calibrate sensors with different training parameters and implement different classification methods. Additionally, it allows studying different features and their relationship. An interesting description of the advantages of this methodology is clearly described in (Zimek and Vreeken (2015)), where authors have drawn on the old Indian tale “The blind men and the elephant” to describe how different “experts” on different circumstances may reach different conclusions about the same object of analysis. In order to enhance the anomaly based detection methods, the framework considers conclusions provided by “experts” that analyze different features of the alerts reported by ABS. Particularly, alerts are managed based on two criteria: differences of models built during detection with models built from training data, and the real nature of the attacks. Bearing this in mind, two analysis components are distinguished: Anomaly Diagnosis components (AD) and Nature Diagnosis components (ND). AD components perform correlation by analyzing the degree of anomaly on the samples which triggered the emission of alerts. The main idea of this analysis is that the larger the difference of the payload on detected packets regarding the legitimate payload model, the 9

greater the possibility of containing real threats. On the other hand, ND components correlate incidences by taking their nature as reference point. Both criteria are important, and before deploying countermeasures, they must be studied together. This is because in certain situations, to consider only one of them can lead to the implementation of insufficient or disproportionate actions. For example, AD components could be sure that an incidence is a real threat. But if its nature presents low risk, it is not advisable to give it high priority treatment. In the opposite case, an ND component could report highly dangerous content within a packet. But if it has low degree of abnormality, there is a high chance that it leads to the issuance of a false positive, situation that must be taken into account by operators. Network Payload

NIDS

Alerts

Alert Correlation Framework Reports

Packets

Feature Extraction Distance Measures

Triggered Rules

AD

ND {Anomaly, Nature}

Diagnostics Packet

ND

Sequences

Sequence {Sequential Threat}

Figure 1: Architecture of the framework for alert correlation

The topology of the framework is multilayered, as shown in Fig. 1. It is structured into two processing stages. The first one is responsible for the event management. It considers the likelihood and nature of every alert separately, at packet level. Therein both AD and ND components are involved, and their main goals are prioritization and to differentiate false positives from real threats. Whereas that packets are the basic unit of measure managed by ABS that implement DPI technologies for payload analysis, it is possible to state that in right circumstances, this processing stage can perform online. 10

It facilitates fast decision making, and provides the information required by the sequence level tasks, which may only draw conclusions after processing a certain amount of packets. The second stage processes alerts sequentially. Every sequence is composed of warnings captured in a certain time periods or when a number of events are gathered, so this stage aims on identifying relationships between individual events, previously processed by the first layer. To associate incidences, a ND component applying a base of knowledge with multiple-step threats is required. It is important to emphasize that despite the relationship between both layers, the result of the analysis carried out on each of them should be interpreted independently. For instance, a succession of low risk alerts processed individually could not seem to be harmful, but when acting together, they could form part of a multi-step attack of high risk. Furthermore, it is also recommended their implementation by non-deterministic methods of classification, which hamper their enumeration and make them difficult to be evaded (Pastrana et al. (2014)). 4.2. Anomaly Diagnosis Component Anomaly Diagnosis is only performed on the AD component at packet analysis, and represents quantitative evaluation criteria. For this purpose the alert features provided by the ABS are analyzed. They allow the distance to be determined between the monitored events and the legitimate system usage. Before establishing labels, building the various risk groups is carried out through a clustering process previous to the detection step, which is integrated on the ABS training stage (see Fig. 2). Let D = {D1 , · · · , Dn } with 0 < n defined as the dataset used in the ABS training stage. The subset A = {A1 , · · · , Am1 }, A ⊂ D defines its malicious samples, and the subset L = {L1 , · · · , Lm2 }, L ⊂ D defines the legitimate payloads, where 0 < m1, m2, and A ∪ L = D. The anomaly diagnosis can be seen as a simple problem of clustering, which is performed in two steps: 1. Definition of clusters. The various anomaly classifications are established in the phase of definition of clusters. This step occurs during the training stage of the ABS, when L is taken into account for building a model that represents the habitual and legitimate network usage. Its most relevant features are M L = {M L1 , · · · , M Lp }, 0 < p. In the definition of clusters is considered M L, and in some cases also A; the latter will depend on the number of classes on the ABS training. 11

ABS Training ABS

ABS

Legitimate Samples

Contrast Samples

Legitimate Features

Contrast Features

Distances Clustering Distance calculation Detection Aggregation Definition of Clusters {Anomaly} Figure 2: Anomaly Diagnosis Component

When one-class training only the legitimate samples are involved. This also occurs when A is unrepresentative, so clusters are built based on the distance of the observations with thresholds of the legitimate usage model. However, when A is representative or the ABS apply two-class training, samples of attacks can also take part in the definition of clusters. Unlike the previous case, in this situation clustering algorithms consider only distances between the malicious samples with the ABS decision thresholds. This is an intuitively more precise method, since the central values of the defined clusters always fall outside the legitimate model, as is the case with those provided by packets that caused the issuance of alerts. 2. Aggregation. Alerts are aggregated at the detection stage on the ABS, in real-time. This process is performed every time the ABS issues an alert. Then the alert is included in the cluster of anomalies with major 12

similarity, according with the implementation of the clustering algorithm. At the end, packets in clusters closer to the threshold of the legitimate model have a lower degree of divergence, and therefore they are more likely to be false positives. The farther away from the threshold, the greater likelihood to be real attacks. In (Weller-Fahy et al. (2014)) different ways for estimating, building and implementing this kind of distances/thresholds are gathered and discussed. Our proposal adapts to any of them, but the selection and implementation of those that will perform best for every use case depends on the characteristics of the sensor. The metrics (thresholds and distances) are provided by the complemented ABS, so the framework takes advantage of the characteristics of analytics carried out by the NIDS 4.3. Nature Diagnosis Component The nature of an anomaly is the kind of threat that it triggers. The Nature Diagnosis component determines the nature of the alerts issued by the sensor. The scope of this component is established by a base of knowledge containing the set of threats that it is capable of detect. The base of knowledge can be imported or built in the training stage of the ABS. The natures are inferred from packets or sequences, in two variants of the ND component: ND at packet analysis and ND at sequence analysis, which are adapted to their processing layer. The following describes each of them in detail. 4.3.1. Packet Analysis The ND for alerts aims to establish the classification of each of the suspicious packets traversing the NIDS, according to the rules that are activated in the detection phase. It is a quick response module that under no circumstances should obstruct the analysis of the monitored traffic in real-time. As explained later, this condition has led us to choose an implementation via Artificial Neural Networks (ANN). However, regardless of the classification method chosen, this level requires a knowledge base to link natures of threats with the features observed in the payload. Fortunately there are many public databases with information on attacks and their characteristics (CERT-EU (2016); US-CERT (2016)) that can be used by our framework. But this is not enough, because the nature diagnosis must include specific features of the ABS to be complemented. This brings us to build a rule base which takes into account the properties of the NIDS and the available knowledge about

13

attacks. In order to provide a good design, the following assumptions should be considered: • Packets with anomalous contents could match with more than one rule. • In the rule base, a characteristic is a set of features of the payload with certain values. For example, the characteristic “Never-seen-Before” occurs whenever certain segment of the binary content of the payload has never been recorded before by the ABS. Another characteristic could be “infrequent”, which refers to segments in the payload uncommonly seen, and thus suspects of not being part of the usual and legitimate model of the network. Note that the previous examples, alone might not pose a threat. However, when combined with similar observations, they may be the clue to identify incidents (Wang et al. (2006); Villalba et al. (2012, 2016)). • Each rule is determined by a set of characteristics and infers a category. • A category could be inferred by different rules. Thus, the same category may be characterized by alternative features of the payload, or even in different relationships. • The nature diagnosis requires some fault tolerance capability. Consequently, it is considered that part of the rules that are activated when analyzing payload content could be noise. In view of these premises, characteristics that shape the rule base involve relationships between the features of the payload included in metrics of the ABS. The set of characteristics is defined as C = {C1 , · · · , Cn } and the set of rules is R = {R1 , · · · , Rm }. Whenever a rule is triggered, a category of the set E = {E1 , · · · , El } is inferred. Thus, for example if Ri , 0 < i ≤ m, is triggered by a subset of C constrained by p, q where 0 < p < q ≤ n, then Cp ∧ Cp+1 ∧ · · · ∧ Cq−1 ∧ Cq ⇒ Ek , where Ek is the category inferred by Ri . The categories are the different classifications that the rule base allows to perform. This is why there must be at least one category in each rule base. It is defined as “unknown anomaly”, and it is the default classification of contents that have not been tagged in any other way. How to construct the rule base depends on the ABS modeling strategy, which often must consider its learning methodology. In the experimentation,

14

the implemented framework considers as categories, the various attack families on the training dataset (virus, spyware, botnet, · · · ). Each characteristic is a feature on the detection metric that has exceeded the threshold of legitimacy. Therefore, if the metric is formed by twelve features of the payload, the rule base has twelve characteristics, which only occur when they are exceeded. By making the NIDS analyze the samples of each family separately, it is possible to know the characteristics they have in common. From them the detection rules are generated. In our implementation, characteristics have been ordered based on their frequency in the samples of each family, so rules are inferred when the characteristics found in the payload have certain distribution. Note that there are many solutions to similar problems related with data aggregation that are able to suit better in different ABS. Some of them are compiled and discussed in (Jesus et al. (2015)). Nevertheless, this implementation is simple and didactic, in order to facilitate greater understanding of the proposed framework. Given this, each time the detector issues an alert the following actions are performed. 1. The features of the payload are extracted. 2. The number of occurrences of each characteristic in the payload is listed. 3. The degree of representation of every characteristic in the payload is calculated. These values are referred to as scores of characteristics. In our implementation, given Ci , 0 < i ≤ n, the calculated score is its frequency in the payload f rec(i). 4. The characteristics of the payload are sorted from highest to lowest score. Thus the more representatives are taken into greater consideration. The sorted list of scores is defined as ORD, where ORD(i), 0 < i ≤ n is the position of the characteristic in the position i. In order to avoid inconsistences, we assume ∀p, q ∈ C : {ORD(p) = ORD(q)}. Because of our implementation, this has forced to create a table of disambiguation where if f rec(p) = f rec(q), their positions in ORD are indicated. Therein the most frequent characteristics in categories corresponding with the most harmful families occupy first positions. 5. The categories inferred by ORD and the rule base are determined. Ideally, the rule base should be constructed so that the response is 15

unique. In this case that category is the nature of the anomaly. When the anomaly is labeled with different categories, it is recommended that operators know each one of them. This is important because there are evasion methods that disguise high risk threats in tangles of low/medium risk attacks, thereby reducing their priority treatment, as is described in (Tsung-Huan et al. (2012)). For fast responses, a good option is to consider the category with higher risk. In Table 1, a rule base example that is able to establish the categories of the anomalous payload based on their two most representative characteristics is summarized. It considers E = {E1 , E2 , E3 , E4 , E5 , E6 , E7 , E8 }, C = {C1 , C2 , C3 , C4 } and R = {R1 , R2 , R3 , R4 , R5 , R6 , R7 , R8 , R9 , R10 , R11 , R12 }.To define the nature of an alert, it is enough to determine what rules are able to infer categories and what categories are inferred. Some rules require two characteristics and ignore the presence of others, as {R1 , R2 , R7 , R8 , R10 , R11 }. There are also rules that require that several characteristics are not detected, such as {R3 , R4 , R5 , R6 , R8 }. The rule R12 infers the default category (E8 ) as unknown anomaly. The rule base depends on the knowledge base and the models built by the ABS. This can be better understood by assigning a value to each variable. For example, if C1 means “malware obfuscation” and C2 is “buffer overflow ”, then R1 could be interpreted as “If most of the features found in the payload indicate the presence of malware obfuscation and buffer overflow methods, then it is added to the group of alerts with nature E1 ”, where E1 could be “polymorphic malware with privilege escalation”. In addition to this, if C4 is “remote control engine” then R7 and R8 means “If most of the features found in the payload indicate the presence of malware obfuscation and remote control engines then the nature of the payload is E5 ”, which could be “trojan”. 4.3.2. Sequence Analysis In a first approach, our alert correlation framework only consisted of the two previously described components. Thus a hybrid scheme capable of correlating real-time alerts was presented. But in a thorough revision of the proposal, other particularly relevant issues were considered. Some of them are described below. • The NIDS analyze payloads packet-to-packet, which hinders their ability to detect malicious content distributed at various points in them.

16

R R1 R2 R3 R4 R5 R6 R7 R8 R9 R10 R12 R13

Activation Ord(1) = C1 ∧ Ord(2) = C2 Ord(1) = C2 ∧ Ord(2) = C1 Ord(1) = C1 ∧ Ord(2) = C3 ∧ {Ord(3, · · · , n} = ∅ Ord(1) = C2 ∧ Ord(2) = C1 ∧ {Ord(3, · · · , n} = ∅ Ord(1) = C3 ∧ {Ord(2, · · · , n} = ∅ Ord(1) = C2 ∧ {Ord(2, · · · , n} = ∅ Ord(1) = C1 ∧ Ord(2) = C4 Ord(1) = C4 ∧ Ord(2) = C1 Ord(1) = C4 ∧ Ord(2, · · · , n) = ∅ Ord(1) = C2 ∧ Ord(2) = C4 Ord(1) = C4 ∧ Ord(2) = C2 Others

E E1 E1 E2 E2 E3 E4 E5 E5 E6 E7 E7 E8

Table 1: Example of rule base in ND

• The context in which an alert is issued, often provides additional information that may be useful to operators. When a non-deterministic labeling proposes different classifications, this makes the decision easier as to which of them should be countered. • Forensic analysis of traffic traces must consider information at both packet and sequence analysis. Thus operators have an overview of the incidence and can verify packet-to-packet composition. • The addition of a higher level of abstraction improves the understanding of the incidences and the system efficiency. This is because in certain situations the proposal may not be able to operate in real time at packet analysis, mainly due to the continuous deployment of countermeasures on the most restrictive scenarios. Taking action at the sequence analysis may be an interesting solution for similar situations. The Nature Diagnosis at sequence analysis is similar to that performed at packet analysis. This is draw from initial information provided by the labeling of each alert in the sequence, established at the previous layer. In this component a rule base similar to Table 1 is considered. But unlike the previous alert processing level, in this case the basic facts C = {C1 , C2 , C3 , C4 } are the diagnostics reported by the previous layer. For example, C1 could 17

indicate “polymorphic malware with privilege escalation” matching with E1 at packet analysis, or C5 could indicate “malware obfuscation and remote control engines” matching with E5 . So facts at sequence analysis are diagnostics provided by the previous layer related to the content of the payload of each analyzed packet. Examples of diagnostics of sequences are “spyware”, “adware” or “trojan”, which are framed at a much higher level of abstraction. Note that given the complexity on formulating this kind of production rules, a procedural knowledge base on multi-step attacks is also required. In the absence of enough information on this point, it could be inferred in a training stage, in the same way as in the ND component at packet analysis. Another aspect to consider is the delimitation of sequences. Make the cut in the right place could be the difference between success and failure at the decision of countermeasures. The most discussed strategies are limitation by time intervals or number of alerts, but in addition, more advanced techniques could be implemented. In the experiments, and for simplicity, the criterion of the number of alerts has been adopted. We consider that despite the interest unleashed by delimitation, its discussion in depth is out of the scope of this paper. 5. Implementation As example of instantiation of the proposed framework, the ABS known as “Advanced Payload Analyser Preprocessor” (APAP) (Villalba et al. (2012, 2016)) is complemented. APAP is a NIDS that analyzes payload of the monitored traffic in real-time looking for malware. This is achieved by application of rules generated in a semi-supervised learning approach where the model of legitimate usage and a refinement step that involves samples with malicious content are taken into account. Similar to other detectors of the PAYL (Wang et al. (2005)) family, APAP adopts the n-gram methodology when extracting features of the payload. In particular, the strategy that keeps more resemblance is ANAGRAM (Wang et al. (2006)), mainly due to the fact that both of them store the gathered data in Bloom filters and take advantage of the detection rules of Snort (Snort (2015)). In APAP, the monitored payloads are classified and ordered decreasingly, using the frequency of occurrence criterion for each n-gram of binary content. These occurrence frequencies are studied by principal component analysis techniques. In order to determine the nature of the payload content, the n most significant values constructed during training are compared with those of the pack18

ets to be analyzed. In our experimentation, the following 16 values of n {[1 · · · 10], 12, 14, 16, 32, 64, 96, 128} were considered. Note that a priori, our framework does not impose restrictions on the number of metrics or their characteristics. If constraints occur, they are derived from limitations on algorithms that operators have decided to implement in each instantiation. The instantiation of the proposed framework for suiting the particularities on APAP is detailed below. 5.1. Instantiation of AD at packet analysis The AD component implemented operates with the metrics of APAP. As shown in Fig. 3, in the step of definition of clusters, a list with all the Manhattan distances between metrics of the legitimate model built at APAP training stage with the features extracted of malicious samples at APAP refinement stage is generated. The values obtained allow defining similarity groups by applying clustering algorithms. Of the many options available, k-means algorithm has been chosen and the K value has been adjusted by the “elbow” method. This adjustment consists of the execution of k-means with different K values, until unrepresentative variations occur in the sum of the squared error between each member of the clusters with its central value. K-means was selected because is well known and it is especially resilient regarding the presence of outliers and errors in distance measurements. The “elbow” method compensates its main drawback: the requirement to specify the number of clusters. In (Rui and Wunsch (2005)) these algorithms and other methods are described, with the decision of which of them fits better with this problem discussed in future work. In the aggregation step, the metrics extracted from the payload of packets that have led the ABS to issue alerts, are associated with the cluster of greater similarity. The clusters with central values closer to the legitimate usage model contain less anomaly samples, and thus more likely to be false positives. 5.2. Instantiation of ND at packet analysis When instantiating the ND component for packet analysis correlation, the first thing to consider is to define the rule base. Since the only base of knowledge from which we start is the set of labels of malicious samples involved in the refinement of the training stage of APAP, we rely on them to define the classification rules. For this purpose, the features and rules triggered by a fully trained version of APAP, when analyzing samples of 19

Definition of Clusters Attacks

A1 A2 ... Am

Legit. Model

Distances mlp ml1 ml2

d1 d2 d3 ... dm

ABS Distance

d'1 d1 d2 d 2 d3 d 3 ... ... dm dm

d1 d2 d3 ... dm

Distance calculation

K-means 1

0.5

0

0.5

1

Clustering

Figure 3: Definition of clusters in the implementation

attacks that were not used in its training stage, are observed (see Fig. 4). The C = {C1 , · · · , Cn } of the rule base and all the possible activation rules R = {R1 , · · · , Rm} are implemented in an Artificial Neural Network (ANN). This is a common scheme that improves efficiency over conventional rule bases and provides fault tolerance (Andrews et al. (1995)). The ANN has n inputs, as many as characteristics, because every input is aligned with one of them. Each input accepts a range of v values, 0 ≤ v < 1 according with the possible frequencies. The ANN has a single output, and it corresponds one of the possible categories in which the anomalies could be tagged. Each sample involved in the training of the ANN has as input vector [C1 , · · · , Cn ] containing all the characteristics extracted from a malicious sample. The expected output is the value that identifies its family of attacks. Thus, the dataset for training the ANN contains the information related to each of the samples of attacks reserved for this purpose. At the conclusion of this process, it is only needed introduce the characteristics of the new

20

Rule Base Generation characteristics

Dataset

E1 E2 E3 ... El

S1 C1 C2 ... Cn E1

S1 S3 S4

S2 C1 C2 ... Cn E1

Sm

S3 C1 C2 ... Cn E2

S2

S4 C1 C2 ... Cn E3

... ... ... ... ... Sm C1 C2 ... Cn El

ABS Extraction of features C1 C2 C3

. . .

. . .

Ei

C4 ANN training

Figure 4: Rule base generation in ND at packet analysis

samples analyzed by the ABS to infer the category linked to their nature. Other interesting aspects about the implemented ANN are that it is displayed in three layers (the middle layer is hidden), the training error is 0.001 and the activation function is ELLIOT (Yonaba et al. (2010)). The latter is a popular high speed approximation to the hyperbolic tangent activation function with span 0 < y < 1, such as: y= d=s×

x×s 2

(1 + |x × s|) + 0.5

1 2 × (1 + |x × s|) × (1 + |x × s|)

(1) (2)

where x is the input of the activation function, y is the output, s is the steepness and d is the derivation. Table 2 summarizes the ANN configuration. It is simple and proved to be well suited to the system requirements.

21

Parameter Input Output Layers Training error Max. Training steps Hidden Layers Activation Function

Values n 1 3 0.001 50000000 1(middle) ELLIOT

Table 2: ANN configuration in ND at packet analysis

5.3. Instantiation of ND at sequence analysis The Nature Diagnosis at sequence analysis could be implemented in a similar scheme to the ANN of the previous layer. Thus efficiency is gained, but the non-determinism is lost. Although it is possible to apply a refinement stage about the ANN in order to preserve non-determinism, we have decided to propose a completely different implementation. This could be helpful in future instantiations of the alert correlation framework, and was implemented as a Genetic Algorithm (GA). The GA do evolve a population of individuals generated from the input data, subjecting it to random actions which imitate the biological evolution (mutations and genetic recombination). A selection of individuals according to their survivability is performed and the worst adapted are discarded. At the end, the solution of the problem is constructed considering the final population. In (Kamrani et al. (2001)) the principles for designing basic GA are described in detail. The GA to infer the nature of the analyzed sequences assumes as input, the output of the previous processing layer. Every individual is defined with a vector of dimension d representing its G = [G1 , . . . , Gd ] genotype, where Gi , 0 < i ≤ d corresponds with the gen in position i. The choice of d has direct impact on execution of the algorithm. The longer the genotype, the greater genetic wealth, situation that has a positive effect in solving the problem. This is because the GA is a particular variation of probabilistic algorithms, so the smaller is d, the higher the random component in the resolution. However, large genotypes decrease the performance of the component. To fill the genotype of a new individual is taken into account the set of all classifications of the packets within sequence. The set of labels may repeat classifications, so those that have appeared more in the sequence could have 22

greater representation in the population. A random classification within the set is assigned to each gen, as shown in Fig. 5. Non-determinism correlation with AG Genotype

Categories in the sequence

E2 E2 E1 ... El

E3 E2 E2 E1 E2 E E2 E3 E2 2 E 1 E4 E

Randomize

1

Generation of individuals Initialise Population Selection Crossover Mutation

No

Stop Condition

Final poblation Natures

Yes

66% 28% 6%

Nature Basic Genetic Algorithm

Figure 5: GA for Non-determinism in ND at sequence analysis

In the training of APAP a table that associates each sequence of single alerts with its kind of intrusion is created. This table allows the calculation of the fitness of every individual, which is the best Levenshtein distance (Levenshtein (1966)) between its genotype and the sequences into the table. This distance works well in this context, because it measures the difference between two sequences, according with the minimum number of insertions, deletions or substitutions required to change one of them into the other. Therefore the best situation (best fitness) is when the distance returns 0, i.e., the genotype match at least one sequence of the knowledge base. The GA performs crossover and mutation operations of individuals up to reach the stop condition. Mutations are performed by random modification of genes. Crossovers swing parental genotypes considering a random gene 23

as a pivot. The stop condition occurs when the maximum number of iterations allowed is achieved, or in which a representative subset of individuals reach the best fitness value. Upon completion of execution, the analyzed sequences are labeled as members of the attack groups associated with each of the individuals with optimal fitness. The sequences may belong to different groups at once. In addition, several individuals belonging to the same sequence may repeat classification. The more often this happens, the greater the probability that the label is correct, so operators could decide which nature matches better in every context. More detailed information about the GA configuration that best suited APAP requirements is summarized in Table 3. Parameter Size Genotype Size Initial Population Size Selected Population Prob. Crossover Prob. Mutation Fitness Function Crossover Method Mutation Operation Stop Condition

Values 6 200 50 30% 10% Levenshtein distance Pivot genotypes Random modification of gens Every individual is able to infer some nature or 500 iterations are exceeded.

Table 3: Configuration of the GA

6. Experiments For evaluating the proposal, a new collection of traffic captures has been applied. To a first approximation, the use of public domain datasets, which were accepted as functional standards by the research community about ten years ago, was taken into consideration. But the applications of these collections currently unleash controversy. As discussed in (Shiravi et al. (2012)), they not provide a realistic overview of the current network traffic, because among other reasons, they are antique, lack of consistency, diversity, and had errors during data gathering. On the other hand, it is important to keep 24

in mind that mainly due to legal constraints, most of the current datasets are anonymized, which implies removing the payload and performing major changes in the packet headers. Among them, the lack of payloads directly affects our work, which is focused on issues related with the payload analysis. Another point to consider is the context in which samples were gathered; the evaluation of our approach demands a recent dataset, with payload contents but also, with samples containing traces of properly labeled malware at both, packet and sequence levels. In the extensive bibliography we have consulted, there are no references to collections that meet all these characteristics. However it is constantly stated that one of the most common mistakes when developing ABS is the low relationship between results on their experimentation, and the real use cases due to improper selection of datasets. Given the difficult on obtaining an appropriate sample collection, the adoption of a brand new collection of traffic traces, updated and diverse, provides a better measurement of the effectiveness of the proposed framework. It should also be noted that this hinders its comparison with previous proposals. The considered dataset was provided by the Data Center of the University Complutense of Madrid, and contains 1.9 GB of traffic traces monitored in different periods and gathered from various subnets during the year 2011. The collection includes samples of legitimate and malicious traffic, and it was previously used for the evaluation of other sensors designed by our research group, including APAP. The malicious traffic is classified at packet and sequence analysis, according to the requirements of the alert correlation framework. The labels are summarized in Table 4 and specified below: • Alerts have been classified into 5 groups based on their risk of being harmful: (A1 ) very low, (A2 ) low, (A3 ) medium, (A4 ) high and (A5 ) very high. • Anomalous contents have been divided into 16 different natures at packet analysis. Namely: (C1 ) trojan, (C2 ) privilege escalation, (C3 ) executable code, (C4 ) denial of service, (C5 ) proprietary software, (C6 ) low risk enumeration, (C7 ) enumeration, (C8 ) virus, (C9 ) exploit, (C10 ) information leak, (C11 ) remote control, (C12 ) software privative with unauthorized access, (C13 ) software with unauthorized access, (C14 ) low risk spyware, (C15 ) high risk spyware, and (C16 ) others. Their classification is established based on the analysis of the payload. Consequently, 25

threats less detectable by this method, such as (C5 ) denial of service or (C1 1) remote control, include particular content able to be recognized by payload analysis. • Alerts have been divided into 9 different natures for sequence analysis: (E1 ) botnets, (E2 ) obfuscated malware, (E3 ) worms, (E4 ) drive-by, (E5 ) adware, (E6 ) spyware, (E7 ) virus, (E8 ) trojan and (E9 ) others. All these categories contain anomalies constructed from the previous level threats. Some of them are even repeated at packet analysis, which is consistent with reality. For example, a sequence tagged as (E8 ) trojan may contain individual packets labeled as (C1 ) trojan. Both, APAP and the alert correlation framework applied about 80% of these samples in training steps. At experimentation, legitimate traffic samples were used for evaluating the approach ability when identifying false positives. The rest provide the hit rate of the proposal. Accuracy and performance tests were carried out at every stage of processing information in the following way: • The AD component has been evaluated according to its ability to infer the possibility of real risk within each anomaly. To this end, the groups tagged by the alert correlation system and the degree of risk provided by the Data Center are compared. Optimal real-time performance is expected, and also clusters of anomalies show similar distribution to that of the risks labeled by the UCM. According to our instantiation of the framework, this is the only component able to identify false positives. Therefore, it is the only experiment that will process traces with legitimate contents. • The ND component at packet analysis has been tested in order to demonstrate accuracy and real-time performance. Since the rule base is built considering the same labels provided by the Data Center, the number of matches and mistakes obtained when processing malicious samples that have not been used in training steps is studied. • The ND component at sequence analysis assessment describes a scheme similar to its predecessor in the previous layer. This is because both, sequences of the dataset and rule base of the ND component, share labels. Given that the implementation by GA provides non-deterministic 26

solutions, the distribution of matches on the most likely options is discussed.

Table 4: Content on UCM dataset in experiments

Group Risk

Alert Nature

Sequence Nature

ID A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 E1 E2 E3 E4 E5 E6 E7 E8 E9

Description Very Low Low Medium High Very High Trojan Privilege Escalation Executable Code Denial of Service Proprietary Software Low Risk Enumeration Enumeration Virus Exploit Information Leak Remote Control Priv. Unauthorized Access Unauthorized Access Low Risk Spyware Spyware Unknown Botnet Obfuscated malware Worm Drive-by Adware Spyware Virus Trojan Unknown 27

7. Results The results obtained in the experiments are shown below. They are described and discussed component-to-component. 7.1. Anomaly Diagnosis For evaluating the AD component, two tests were performed. The first one determines the system’s ability to identify false positives. It consists of correlating packets that have forced the ABS to issue false positives. When analyzing legitimate traffic, 95.7% of the alerts were correlated in the lower risk groups, where 46% indicated very low risk and 50% low risk, as shown in Fig. 6. The remaining 4.3% was mistakenly grouped in the medium risk category. According to this criterion, if the detection system rejects alerts of greater similarity with the legitimate model (specifically, those tagged in low and very low risk ), it could filter successfully about 95.7% of the false positives. This would be very desirable behavior, as long as it is capable of successfully managing alerts derived from real attacks. Therefore a second experiment must be carried out, but this time with samples with malicious payload. Legitimate Payload

A5

A4

A3

A2

A1 0

5

10

15

20 25 30 35 Distribution (%)

40

45

50

55

Figure 6: Distribution when correlating false positives

In the second experiment, 84.82% of alerts were correctly tagged in the right risk level. Fig. 7 and Table 5 summarize the resultant distribution. All errors have implicated the malicious content assignment to the closest 28

cluster to the correct one, causing that even if the error rate seems high, the accuracy in terms of decision making is good, because in most cases, disproportionate measures will not be conducted. The imprecision that places high risk payloads into medium risk is the most dangerous which may occur. It represents 1.32% of system failures. The largest error is concentrated in the medium risk category, with a tendency to oscillate between low risk (7.3%) and high risk (32.7%). Based on this, it is possible to conclude that filtering alerts on the very low and low groups leads to improve 95.7% the false positives, but worse 7.3% the false negative rate. In general terms, this could be an important enhance of the current ABS in most use cases. UCM Correlation

A5

A4

A3

A2

A1 0

5

10

15

20 25 30 Distribution (%)

35

40

45

Figure 7: Distribution of alerts when correlating attacks

Anomaly A1 A2 A3 A4 A5

UCM 2011 (%) 15.1 15.6 37.2 19.2 12.9

Correlation (%) 12.3 14.8 44.8 16.4 11.4

Table 5: Distribution per cluster in AD

Another important aspect to keep in mind is that the worst ABS processing time per sequence has been 171.104ms (note that each of these sequences 29

contains about 200KB of network traffic). By deployment of the alert correlation system with only the activation of AD component, the maximum processing time has risen to 173.127ms (171.104ms + 2.023ms) in the worst case. So, the maximum system overload caused barely is 1%, confirming the ability of the implementation to operate in real time. 7.2. Nature Diagnosis at packet analysis The ND component at packet analysis was evaluated considering the previously described 16 categories. Thereby all the samples with malicious content and not involved on training steps were analyzed. The resultant distribution is summarized in Fig. 8 and Table 6. The precision obtained when comparing alerts correlated by the proposal with labels of the Data Center demonstrated a hit rate of 99.512%. This is mostly due to the rigorous training error rate established for the ANN training (in particular, 0.001). Additionally it should be noted that the observed error varies considerably among the different natures. There are clusters with 100% accuracy, such as trojan, privilege escalation or exploit, but also groups with high errors, as is the case with denial of service or high risk enumeration. The latter coincide with intrusions that are not usually addressed by payload analysis approaches, so it can be deduced that the content on their samples is less representative. C 15 C 14 C 13 C 12 C 11 C 10 C9 C8 C7 C6 C5 C4 C3 C2 C1

UCM Correlation

0

5

10

15 20 Distribution (%)

25

30

Figure 8: Distribution of natures at packet analysis

30

In performance it can be concluded that as in the previous experiments, the system is quite capable of operating in real time. The maximum system overhead is 2.3%, and the worst processing time per sequence is 175.135ms (171,104ms + 4,031ms). The processing time at the packet-layer of correlation is the highest on its two components, since both of them run in concurrency. It is the case of this step, so the total overhead on the ABS when analyzing single alerts is 2.3%. Category C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16

UCM 2011 (%) 0.25 1.13 0.78 19.27 8.44 15.7 28.8 0.01 0.03 23.61 0.42 0.67 0.04 0.27 0.58 0

Correlation (%) 0.25 1.13 0.8 18.93 8.5 15.7 29 0.01 0.03 23.62 0.42 0.64 0.03 0.31 0.63 0

Table 6: Distribution of results in ND at packet analysis

7.3. Sequence Level To evaluate the ND component at sequence analysis, it is important to bear in mind that it provided non-deterministic results. As with the previous component, all alerts that have not participated in training steps are analyzed. 100% of them have been correlated successfully in some of the possible natures proposed by the GA. In Fig. 9 the hit rate per nature and per choice is shown. This information is complemented in Table 7 with the distribution of each category on the sequences correlated.Though successes have been 31

distributed between the first two more probable options proposed, being a 75.124% in the most likely and 24.876% in the second most probable. This is the desired behavior because it sanitizes grouping errors and poses other alternatives. For example, if an operator identifies a suspicious sequence of alerts likely to be 45% adware and 55% botnet, could consider both options. In a deterministic labeling it would be labeled as botnet, because it is the most probable. But in the case of countermeasures applied not being effective, there would not be plenty of scope to change them, so here is the strength of the non-deterministic approach. In this experiment, the 24.876% difference between the first option and second, represents the homogeneity on the proposed solution. In certain contexts it may be advisable to reduce or extend that difference, which would take into account factors such as the ability for decision-making on the protected environment, performance or its degree of restriction. Moreover, and as expected, performance at sequence analysis is worse than the upper layer. In this experiment, maximum processing time per sequence of 221.362ms has been registered. Considering that 171.104ms is the delay caused by APAP, the maximum system overhead is 22.88% (171.104ms + 50.258ms), which makes this instantiation suitable for forensics, but is not recommended for real-time processing. Note that when decided the proposed instantiation of this component, the operability in real time was not a prerequisite. This is because we did focus on non-determinism and accuracy, and in addition our implementation allows analysis in concurrency, preventing the accumulated backlog. Nowadays there are other strategies for information management capable of adapting to different use cases, which could yield better performance than this example. 8. Conclusions This paper introduced an alert correlation framework that aims to complement the various deficiencies on the NIDS and alert correlation systems of the bibliography. In particular, those related to malware detection based on the recognition of anomalies within the payload content. Unlike similar and conventional approaches, the information provided by the network packet headers is not taken into account. Instead, the payload of the suspicious packets and the modeling strategies of the intrusion detection systems are considered. The approach presents a multi-layered architecture, which implies different criteria for grouping single alerts and sequences; specifically, 32

E9

First Option

E8

Second Option

E7 E6 E5 E4 E3 E2 E1 0

10

20

30

40 50 60 Hit Rate (%)

70

80

90

100

Figure 9: Hit rate in the most probably natures proposed by the GA

Category E1 E2 E3 E4 E5 E6 E8 E9

UCM 2011 (%) 0.91 0.14 0.16 23.18 30 43.2 72.8 0

First (%) 89.1 73.4 94.1 84.1 91.7 70.5 95.2 100

Second (%) 10.9 26.6 2.9 15.9 7.3 28.5 4.8 0

Table 7: Distribution and hit rate per choice of the GA

their nature and distance from the legitimate usage model built during the ABS training steps are taken into account. An example of instantiation of the framework was also proposed, with a fully described implementation for each component. This has been done considering their deployment for complementing APAP, an intrusion detection system designed by our research group. Its implementation has been evaluated by analyzing real traffic provided by the Data Center of the University Complutense of Madrid. The experiments performed demonstrated high accuracy. Every component efficiently classified the alerts generated by the intrusion detection

33

system. They shown that the nature and distance with legitimate model criteria have been applied successfully. Therefore, the degree of abnormality and the nature of the different identified threats have been precisely determined. Additionally it has been possible to properly identify a significant amount of the false positives issued by the ABS. This confirms the system’s ability to enhance the complemented systems accuracy. On the other hand, the instantiation of the sequence analysis component with non-determinism has demonstrated ability of propose multiple-natures. All the alerts processed in this step were correlated successfully on the two first solutions, with divergence of 24.876%. In this way operator would decide between different countermeasures. But due to its implementation via GA, their performance is worse than that observed at packet analysis. These results are evidence of the dependence of system behavior regarding its implementation, which should be taken into consideration at upcoming implementations. Throughout the article we have delegated discussions for future work. Most of them are about decisions of what data mining or machine learning algorithms are best suited for each use case and how they should be configured. Other aspects that should be addressed, are resistance to attacks and fault tolerance. Despite being considered in the design phase, the experiments did not evaluate them, so they might be new lines of progression. Finally it should be noted the interest in adapting this framework to different ABS, which undoubtedly will shed new perspectives and issues to be considered. Consequently, this will be probably our next step on this research. References Ahmadinejad, S.H., Jalili, S., Abadi, M.. A hybrid model for correlating alerts of known and unknown attack scenarios and updating attack graphs. Computer Networks 2011;55(9):2221–2240. Al-Mamory, S.O., Zhang, H.. IDS alerts correlation using grammar-based approach. Journal in Computer Virology 2008;5(4):271–282. Albanese, M., Jajodia, S., Pugliese, A., Subrahmanian, V.S.. Scalable Analysis of Attack Scenarios. In: Proceedings of the 16th European Symposium on Research in Computer Security (ESORICS). Leuven, Belgium; 2011. p. 416–433.

34

Alsubhi, K., Aib, I., Boutaba, R.. FuzMet: a fuzzy logic based alert prioritization engine for intrusion detection systems. International Journal of Network Managment 2012;22(4):263–284. Amaral, A.A., Zarpelao, B.B., Mendez, L.S., Rodrigues, J., Junior, M.. Inference of network anomaly propagation using spatio-temporal correlation. Journal of Network and Computer Applications 2012;35(6):1781–1792. Andrews, R., Diederich, J., Tickle, A.B.. Survey and critique of techniques for extracting rules from trained artificial neural networks. KnowledgeBased Systems 1995;8(6):373–389. Ariu, D., Tronci, R., Giacinto, G.. HMMPayl: An intrusion detection system based on Hidden Markov Models. Computers & Security 2011;30(4):221–241. Bhuyan, M.H., Bhattacharyya, J.K.. Network Anomaly Detection: Methods, Systems and Tools. IEEE Communications Surveys & Tutorials 2014;16(1):303–336. Cam, H., Mouallem, P.A., Pino, R.E.. Alert Data Aggregation and Transmission Prioritization over Mobile Networks; Springer US; volume 55 of Advances in Information Security. p. 205–220. CERT-EU, . Malware. https://cert.europa.eu/cert/alertedition/en/ Malware.html/; 2016. Chen, T., Zhang, X., Jin, S., Kim, O.. Efficient classification using parallel and scalable compressed model and its application on intrusion detection. Expert Systems with Applications 2014;41(13):5972–5983. Cretu, G.F., Stavrou, A., Locasto, M., Stolfo, S., Keromytis, A.. Casting out Demons: Sanitizing Training Data for Anomaly Sensors. In: Proceedings of the IEEE Symposium on Security and Privacy (SP). Oakland, CA, USA; 2008. p. 81–95. Davis, J.J., Clark, A.J.. Data preprocessing for anomaly based network intrusion detection: A review. Computers & Security 2014;30(6-7):353– 375.

35

Elshoush, H.T., Osman, I.M.. Alert correlation in collaborative intelligent intrusion detection systems-A survey. Applied Soft Computing 2011;11(7):4349–4365. Elshoush, H.T., Osman, I.M.. Intrusion Alert Correlation Framework: An Innovative Approach; Springer Netherlands; volume 229 of Lecture Notes in Electrical Engineering. p. 405–420. Farhadi, H., AmirHaeri, M., Khansari, M.. Alert correlation and prediction using data mining and HMM. International Journal of Information Security 2011;3(2):77–101. Fogla, P., Sharif, M., Perdisci, R., Kolesnikov, O., Lee, W.. Polymorphic blending attacks. In: Proceedings of the 15th USENIX Security Symposium. Vancouver, BC, Canada; 2006. p. 241–256. Hadziosmanovik, D., Simionato, L., Bolzoni, D., Zambon, E., Etalle, S.. N-gram against the machine: On the feasibility of the n-gram network analysis for binary protocols. In: Proceedings of the 15th International Symposium on Recent Advances in Intrusion Detection, Amsterdam (RAID). Amsterdam, The Netherlands; 2012. p. 59–81. Hubballi, N., Suryanarayanan, V.. False alarm minimization techniques in signature-based intrusion detection systems: A survey. Computer Communications 2014;49:1–17. Hung-Jen, L., Chun-Hung, R.L., Ying-Chih, L., Kuang-Yuan, T.. Intrusion detection system: A comprehensive review. Journal of Network and Computer Applications 2013;36(1):16–24. Jamdagni, A., Tan, Z., He, X., Nanda, P., Liu, R.P.. RePIDS: A multi-tier Realtime, Payload-based Intrusion Detection System. Computer Networks 2013;57(3):811–824. Jesus, P., Baquero, C., Almeida, P.S.. A Survey of Distributed Data Aggregation Algorithms. IEEE Communications Surveys & Tutorials 2015;17(1):1381–1404. K., W., Stolfo, S.J.. Anomalous Payload-based Network Intrusion Detection. In: Proceedings of 7th International Symposium on Recent Advances in Intrusion Detection (RAID). Sophia Antipolis, France; 2004. p. 203–222. 36

Kamrani, A., Rong, W., Gonzalez, R.. A genetic algorithm methodology for data mining and intelligent knowledge acquisition. Computers & Industrial Engineering 2001;40(4):361–377. Levenshtein, V.I.. Binary Codes Capable of Correcting Deletions, Insertions, and Reversals. Soviet Physics-Doklady 1966;10(8):707–710. Marinos, L. and Sfakianakis, A., . Threat Landscape 2014. https://www. enisa.europa.eu/; 2015. Mirheidari, S.A., Arshad, S., Jalili, R.. Alert Correlation Algorithms: A Survey and Taxonomy. In: Proceedings of the 5th International Symposium on Cyberspace Safety and Security (CSS). Zhangjiajie, China; 2013. p. 183–197. Mustapha, Y.B., D´ebar, H., Jacob, H.G.. Limitation of Honeypot/Honeynet Databases to Enhance Alert Correlation. In: Proceedings of the 6th International Conference on Mathematical Methods, Models and Architectures for Computer Network Security. St. Petersburg, Russia; 2012. p. 203–217. Njogu, H.W., Jiawei, L., Kiere, J.N., Hanyurwimfura, D.. A comprehensive vulnerability based alert management approach for large networks. Future Generation Computer Systems 2013;29(1):27–45. Pastrana, S., Orfila, A., Tapiador, J.E., Peris-Lopez, P.. Randomized Anagram revisited. Journal of Network and Computer Applications 2014;41:182–196. Perdisci, R., Ariu, D., Fogla, P., Giacinto, G., Lee, W.. McPAD: a multiple classifer system for accurate payload-based anomaly detection. Computer Networks 2009;53(6):864–881. P´erez, M.G., Tapiador, J.E., Clark, J., Perez, G., G´omez, A.. Trustworthy placements: Improving quality and resilience in collaborative attack detection. Computer Networks 2014;58:70–86. Raftopoulos, E., Egli, M., Dimitropoulos, X.. Shedding Light on Log Correlation in Network Forensics Analysis. In: Proceedings of the 9th International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA). Crete, Greece; 2013. p. 232–241. 37

Ramaki, A.A., Amini, M., Atani, R.E.. RTECA: Real time episode correlation algorithm for multi-step attack scenarios detection. Computers & Security 2015;49:206–219. Rui, X., Wunsch, D.. Survey of clustering algorithms. IEEE Transactions on Neural Networks 2005;16(3):645–678. Salah, S., Maci´a-Fern´andez, G., D´ıaz-Verdejo, J.E.. A model-based survey of alert correlation techniques. Computer Networks 2013;57(5):1289–1317. Shiravi, A., Shiravi, H., Tavallaee, M., Ghorbani, A.A.. Toward developing a systematic approach to generate benchmark datasets for intrusion detection. Computers & Security 2012;31(3):357–374. Shittu, R., Healing, A., Ghanea-Hercock, R., Bloomfield, R., Rajarajan, M.. Intrusion alert prioritisation and attack detection using postcorrelation analysis. Computers & Security 2015;50:1–15. Snort, . Snort. https://www.snort.org/; 2015. Stalla-Bourdillon, S., Papadaki, E., Chown, T.. From porn to cybersecurity passing by copyright: How mass surveillance technologies are gaining legitimacy The case of deep packet inspection technologies. Computer Law & Security Review 2014;30(6):670–686. Syed, R.H., Syrame, M., Bourgeois, J.. Protecting grids from cross-domain attacks using security alert sharing mechanisms. Future Generation Computer Systems 2013;29(2):536–547. Tjhai, J.C., Furnell, S.M., Papadaki, M., Clarke, N.L.. A preliminary two-stage alarm correlation and filtering system using SOM neural network and K means algorithm. Computers & Security 2010;29(6):712–723. Tsung-Huan, C., Ying-Dar, L., Yuan-Cheng, L., Po-Ching, L.. Evasion Techniques: Sneaking through Your Intrusion Detection/Prevention Systems. IEEE Communications Surveys & Tutorials 2012;14(4):1011–1020. US-CERT, . Home Network Security. Home-Network-Security; 2016.

38

https://www.us-cert.gov/

Villalba, L., Castro, J., Orozco, A., Puentes, J.M.. Malware detection system by payload analysis of network traffic. In: Proceedings of the 15th international conference on Research in Attacks, Intrusions, and Defenses (RAID). Amsterdam, The Netherland; 2012. p. 397–398. Villalba, L., Orozco, A., Vidal, J.M.. Advanced Payload Analyzer Preprocessor. Future Generation Computer Systems 2016;(doi: 10.1016/j.future.2016.10.032). Viswanathan, A., Tan, K., Neuman, C.. Deconstructing the Assessment of Anomaly-based Intrusion Detectors. In: Proceedings of the 16th International Symposium on Recent Advances in Intrusion Detection (RAID). Rodney Bay, St. Lucia; 2013. p. 286–306. Wang, K., Cretu, G., Stolfo, S.J.. Anomalous Payload-based Worm Detection and Signature Generation. In: Proceedings of the 8th International Symposium on Recent Advances in Intrusion Detection (RAID). Seattle, WA, USA; 2005. p. 227–246. Wang, K., Parekh, J.J., Stolfo, S.J.. Anagram: A Content Anomaly Detector Resistant to Mimicry Attack. In: Proceedings of the 9th International Symposium on Recent Advances in Intrusion Detection (RAID). Hamburg, Germany; 2006. p. 226–248. Weller-Fahy, D., Borghetti, B., Sodemann, A.. A Survey of Distance and Similarity Measures Used Within Network Intrusion Anomaly Detection. IEEE Communications Surveys & Tutorials 2014;17(1):70–91. Yonaba, H., Anctil, F., Fortin, V.. Comparing Sigmoid Transfer Functions for Neural Network Multistep Ahead Streamflow Forecasting. Journal of Hydrologic Engineering 2010;15(4):275–283. Zhaowen, L., Shan, L., Yan, M.. Real-Time Intrusion Alert Correlation System Based on Prerequisites and Consequence. In: Proceedings of the 6th International Conference on Wireless Communications Networking and Mobile Computing (WiCOM). Chengdu, Chine; 2010. p. 1–5. Zimek, A., Vreeken, J.. The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Machine Learning 2015;98(1):121–155. 39