Journal of Information Security and Applications 49 (2019) 102388
Contents lists available at ScienceDirect
Journal of Information Security and Applications journal homepage: www.elsevier.com/locate/jisa
Active learning approach to label network traffic datasets Jorge L. Guerra Torres a,∗, Carlos A. Catania b, Eduardo Veas c a
Institute for Information Technology and Communications, National University of Cuyo, Mendoza, Argentina LABSIN, School of Engineering, National University of Cuyo, Mendoza, Argentina c Institute of Interactive Systems and Data Science, Graz University of Technology, Graz, Austria b
a r t i c l e
i n f o
Article history:
Keywords: Active learning Labeling network Random Forest Learning rate Noise robustness
a b s t r a c t In the field of network security, the process of labeling a network traffic dataset is specially expensive since expert knowledge is required to perform the annotations. With the aid of visual analytic applications such as RiskID, the effort of labeling network traffic is considerable reduced. However, since the label assignment still requires an expert pondering several factors, the annotation process remains a difficult task. The present article introduces a novel active learning strategy for building a random forest model based on user previously-labeled connections. The resulting model provides to the user an estimation of the probability of the remaining unlabeled connections helping him in the traffic annotation task. The article describes the active learning strategy, the interfaces with the RiskID system, the algorithms used to predict botnet behavior, and a proposed evaluation framework. The evaluation framework includes studies to assess not only the prediction performance of the active learning strategy but also the learning rate and resilience against noise as well as the improvements on other well known labeling strategies. The framework represents a complete methodology for evaluating the performance of any active learning solution. The evaluation results showed proposed approach is a significant improvement over previous labeling strategies. © 2019 Elsevier Ltd. All rights reserved.
1. Introduction This paper describes an intelligent tool to aid the network security expert in the task of labeling network data. Computer networks have become indispensable for exchanging information among people and organizations; therefore, security is a major challenge nowadays. Beyond user authentication, data encryption and firewalls, network intrusion detection systems (NIDS) are widely used as an active defense for the network environment. An NIDS is an active process that monitors network traffic to identify security breaches (e.g., Botnet behavior) and initiate countermeasures. NIDSs require a way to adapt to a fast changing environment or they risk becoming obsolete. Intelligence-based detection systems deal with the fast evolution of network scenarios using machine learning techniques [1]. Just before deploying it in any real world environment, an intelligence-based NIDS must be trained and evaluated using real labeled network traffic traces with an extensive set of intrusions or attacks [2]. Hereby, one of the most significant issues during the development of intelligence-based detection systems is the lack of appro-
∗
Corresponding author. E-mail address:
[email protected] (J.L.G. Torres).
https://doi.org/10.1016/j.jisa.2019.102388 2214-2126/© 2019 Elsevier Ltd. All rights reserved.
priate public datasets [3]. This issue is originated from two major challenges: (i) network data contains sensitive information that organizations and individuals are not willing to disclose, (ii) labeling all published data requires a major human effort, which can only be carried out by highly trained experts: security specialists. As regards the sensitivity of network data (i), clearly there are high risks of disclosing private or classified information, whereby researchers frequently encounter insurmountable organizational and legal barriers when they attempt to provide datasets to the community [3]. The Stratosphere Intrusion Prevention System (IPS) [4] project comes as a response to the challenge of releasing network data without revealing sensitive information. The project aims to generate high-quality datasets, using a particular encoding of network behavior, for testing and developing new malware detection techniques. To address the human effort in labeling task (ii) several techniques are employed, ranging from the automatic generation of labels [2,5–8], semi-supervised labeling approach [9,10] and to the use of visual tools for the analysis of network traffic [11,12]. Despite of all the attempts, the most commonly used datasets used for evaluation are almost 12 years old, which makes them practically obsolete if we consider the fast evolution of the network security field [1]. Our contribution addresses the second challenge (ii) by building an intelligence visual analytics application (blending strategies)
2
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388
– RiskID – to assist the labeling of network traffic datasets. RiskID builds on the Stratosphere IPS encoding to ensure anonymity of network labeled data. RiskID trains a classifier on the subset of already labeled connections and uses classifier output to help the user in the label decision process. In particular, RiskID combines visualization strategies and active learning, working together to facilitate the recognition of malicious traffic. As a result, RiskID proposes to promote the creation of properly labeled public network traffic datasets, which are so useful for the scientific community. Besides issues specific to network security, the use of active learning raises other challenges: (a) which algorithms are suitable for online incremental learning in the domain (b) how should the performance be evaluated. This paper builds two major contributions in the area of intelligence assisted network security: 1. An active learning method using Random Forests to interactively assist the user in the labeling process. 2. The evaluation procedure needed to validate any active learning solutions in real environments. We present a thorough evaluation framework to validate: learning rate, prediction performance, resilience against noise and impact on overall performance. 2. Related work The lack of labeled public dataset in network security environment is a well-known problem and has been tackled considering different aspects. Synthetic datasets were created for representing certain problem domains, specific needs or conditions. Examples of known synthetic datasets are: KDDcup99 [5], built upon the data captured in the DARPA98 IDS evaluation program, DEFCON [6], that contains network traffic captured during a hacker competition called “Capture The Flag”, CAIDA dataset [7] that contains particular kind of attacks, among others. Synthetic datasets are often very useful but suffer from excessive preprocessing that separates them from real network environments. Real-life datasets. In an attempt to obtain real-life datasets, Bhuyan et al. [2] proposes a systematic approach for generating automatically real-life network intrusion dataset at both packet and flow level traffic information. Another example of automatic label generations is proposed by Pius et al. [13] who applied cluster techniques to annotate unlabeled multivariate sensor data in smart-phone networks. Mukkavilli’s et al. [8] used a similar approach. Their systematic approach is built upon an experimental platform used to represent the practical interaction between cloud users and cloud services. Hereby, they collects traces of network traffic as a result of the interaction between users and their cloud services, obtaining a real labeled dataset. These network traces from the cloud are readily shareable and can be interchanged among collaborators and researchers without major privacy issues. Clearly, if the researchers have control over network conditions, or if the network traffic is artificially generated using network simulation software, the process of labeling is simplified. However, obtaining such control of the network environment is not always possible. Moreover, even in controlled networks, assuring that the training datasets are correctly labeled or completely free of noise information is extremely hard due to the confidence in the control of the network. In addition, the injected attack traffic and the background traffic come from different packet captures, it may be easier to identify the attack traffic if appropriate care was not taken when merging the captures [14]. Helping the user labeling process. Human experts are essential for annotating network traffic but they are an expensive resource. Therefore, the labeling process should use expert time efficiently. Consequently, to reduce human effort in the labeling process it is
common to find two main approaches: (i) semi-automatic learning strategies and (ii) visual applications. (i) In the work of Aparicio et al. [10], the authors proposed an approach to automatically generate labeled network traffic datasets using an unsupervised anomaly based IDS. The resulting labeled dataset was then processed using a Genetic Algorithm (GA) for selecting the main features. Other works focus on active learning to build a labeled dataset for intrusion detection. Active learning is an interactive process where a user interface is required for the expert to annotate. For instance, the Aladin project [15] applies rare category detection [16] on top of active learning to foster the discovery of the different families, and Gornitz et al. [9] use a k-nearest neighbor approach to detect yet unknown malicious connections. Gornitz et al. have only run simulations on fully labeled datasets with an oracle answering the annotation queries and they have not mentioned any user interface to interact with users. On the other hand, the Aladin project has a corresponding graphical user interface, but the authors provide no detail about it. (ii) In another attempt to improve the process of manually labeling network connections, Soule’s et al. [12] propose a web-based software system that allows the users to share, label, and inspect traffic time-series. This tool analyzes raw network traffic and despite the visual tools that accompany collaborative tagging, the absence of supervised or semisupervised tools does the process of labeling a large dataset remains an arduous task. Beaugnon et al. [11] propose ILAB, a labeling strategy based on continuous interaction with the expert mixing the two approaches: user interface and active learning. With a user interface, the expert is asked to annotate some instances from a large unlabeled pool to improve the current detection model and the relevance of the future annotation queries. The work of Beaugnon et al. is closest to the approach discussed in the present article, for this reason we established a comparative with this work in Section 5.6. Our contribution differs from these previous works in that we combine visualization and learning in an intelligent visual analytics application, (RiskID). In ILAB, the user interface is rather rudimentary, showing only features: start time, duration, source and destination ip and port, number of bytes and packages [11]. Our application describes flows as a feature vector basing on the Stratosphere IPS encoding which offers information about periodicity, duration and package size. The visualization in RiskID offers distributions of such features and a similarity computation grouping them according to the feature vector. An important aspect to remark is that these strategies rely on correct labeling by the expert, so the quality of these labels influences the subsequent automatic labeling. Beaugnon et al. propose a new sampling strategy and compare it with other active learning approaches in terms of sampling bias and execution time. Instead, we present a comprehensive evaluation framework to validate: learning rate, prediction performance, resilience against noise and the impact on overall performance. 3. The RiskID application RiskID is a visual analytics tool that combines visualization with statistical techniques to assist the user in the process of labeling network connections [17]. The application organizes an overview and detail views of the network behavior for facilitating the exploration and finding of possible threats. Fig. 1 illustrates the architecture of RiskID and its three main modules. The Back-end is made of a Preprocessing Module, and an Analytics Module. The process starts with a raw network traffic
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388
3
Fig. 1. RiskD: Module interaction diagram. Table 1 Symbol assignment strategy to encode network behavior. Size
Small
Duration
Short
Med
Long
Short
Medium Med
Long
Short
Large Med
Long
Strong Per Weak Per. Weak Non-Per. Strong Non-Per No Data
a A r R 1
b B s S 2
c C t T 3
d D u U 4
e E v V 5
f F w W 6
g G x X 7
h H y Y 8
i I z Z 9
Time between 0–5s= . 5–60s=, 60s – 5m=+ 5m – 1h=∗ Timeout 1h = 0
dataset, usually in pcap (packet capture) format. The Preprocessing Module transforms a raw network traffic dataset to internal format –a 10-dimensional feature vector– and passes it to the Analytics Module. To help users during labeling, the Analytics Module applies several statistical methods with the goal to group items in the vector space. In the front end, the Visual Analytics Module receives the feature vectors, statistics and grouping information and organizes it in overview and details views following the Visual InformationSeeking Mantra [18]. 3.1. Preprocessing module The Preprocessing Module performs two conversion processes, each inside a specific submodule: the Network Pattern Extractor and the Feature Extractor submodules. The former takes care of anonymization and the later of feature generation. 3.1.1. Network pattern extractor submodule The Network Pattern Extractor Submodule implements the encoding proposed by the Stratosphere IPS project [4]. Such encoding is performed with two purposes: to reduce the usually considerable size of the network traffic data, and to guarantee data anonymity during the labeling process. The Stratosphere IPS encoding aggregates network flows according to a 4-tuple composed of: the source IP address, the destination IP address, the destination port and the protocol. For each flow, the encoding considers the information about size, duration and the periodicity of packet exchange. It uses a character encoding as follows: each letter in the connection define a 3-tuple with the characteristics < periodicity, duration, size > , each number indicates that there is not enough data to create a 3-tuple yet (it is normal to have numbers at the beginning of each SC). Finally, between each number and letters a symbol indicates the time elapsed between each flow [4]. Table 1 shows the symbol assignment strategy for encoding the network behavior according to Stratosphere IPS.
Fig. 2. An example behavioral encoding of connection from IP address 147.32.84.165 to destination port 53 at IP address 147.32.80.9 using UDP.
All network flows aggregated under a single tuple are referred as a single Stratosphere connection (SC). In other words, a single SC represents the temporal behavior from one IP address to a specific service running on a specific IP address. Several of SCs can be created from a raw network traffic dataset in pcap format. All of this methodology represents the anonymization technique that the Network Pattern Extractor Submodule uses to protect information from network traffic. A sample of the Stratosphere IPS behavioral encoding is shown in Fig. 2. The figure shows the symbols representing all the flows for a SC based on UDP protocol from IP address 147.32.84.165 to port 53 of IP address 147.32.80.9. In this case, the SC is represented by 24 flows (count of characters between numbers and letters). Note that most traffic was not periodic with long intervals between 5 min and 1 h (most presence of letters z/Z and s/S) and only some flows were periodic (two occurrences of letter B and one I). In this example, all the flows were between medium and long duration and mostly large size. We reference this connection example in the following sections as c-655 because this is the index that it has in our connection list. 3.1.2. Feature vector extractor submodule The Feature Vector Extractor Submodule is responsible for generating an even more condensed representation of the network traffic dataset. The Feature Vector Extractor Submodule summarizes a SC into a 10-dimensional numerical vector denoted as feature vector: < xsp , xwp , xwnp , xsnp , xds , xdm , xdl , xss , xsm , xsl > . The first four dimensions of the numerical vector represent the periodicity feature (strong
4
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388
Fig. 3. Visual representations in RiskID application.
periodicity (sp), weak periodicity (wp), weak non periodicity (wnp) and strong non periodicity (snp) respectively), the other three refer to duration feature (duration short (ds), duration medium (dm) and duration large (dl) respectively) and the last three represent the size feature (size short (ss), size medium (sm), size large (sl)). The feature vector for a given connection is generated considering, for the complete symbol sequence, the cumulative frequency of the corresponding values associated with the behavioral encoding. At the end of the sequence, a percent of each feature is calculated and normalized between the values [0,1]. Formally, each xj where j ∈ {sp, wp, wnp, snp, ds, dm, dl, ss, sm, sl} it is defined as:
xj =
N 1 I (ti ∈ S j ) N i=1
Where N is the count of symbols that makes up the SC, ti the ith symbol in the SC and Sj the set of characters that represents the j feature in the whole connection behavioral encoding. Finally, I(.) is the indicator function. As an example, the feature vector resultant for the connection c-655 is: < sp: 0, wp: 0.13, wnp: 0.21, snp: 0.58, ds: 0, dm: 0.25, dl: 0.66, ss: 0.25, sm: 0, sl: 0.66 > Notice that after performing the transformation, the resulting feature vector will provide a similar information level about a given SC, except for the temporal behavior. (i.e. historical information about the network flows). 3.2. Analytics module The Analytics Module analyzes the 10-dimensional feature vectors and group them according to standard similarity measures.
A first grouping strategy is based on clustering. Clustering is implemented using a k-means algorithm based on L2 distance to form the groups. The optimal number of groups are selected by the Elbow method [19], which consists of increasing the number of clusters until the marginal gain of the variance explained by the model is negligible. The advantage of this technique lies in the interaction with the visual components. The clustering approach is meant to offer a first visual approximation of similarity between SCs according to their feature vectors. In Fig. 3(b) the block on the left shows the benefits of the grouping strategy and the visual components (connections close to c-655 belong to the same cluster). A second grouping strategy is implemented considering the similarities between all the SCs in the dataset. The Analytics Module implements a similarity matrix by iterating over each SC in the dataset and ranking the remaining SCs according to the cosine distance function, much like an item-based recommender system. In this way, once a connection is selected from the list the remainder connections are arranged by their similarity with the connection selected. This functionality improves the detection of sets of connections with similar features. 3.3. Visual analytics module The Visual Analytics (VA) Module presents the information obtained from the preprocessing module in a set of graphical widgets, using the information from the analytics module to enrich and organize them. The features extracted by the Pre-processing module together with the grouping strategies from the Analytics module form the basis for the visual elements.
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388
Following the example of connection c-655 the VA Module creates a visual representation from its feature vector Fig. 3a. Fig. 3 illustrates the different visual elements in the VA Module. An overview display shows a vertical heatmap of all the feature vectors with different colors for periodicity, size and duration features. The vectors are organized after the cluster grouping. The intention is to show, from a pure statistical view, which connections belong together, so the expert can see how they are labeled. Upon picking an unlabeled connection (e.g. the connection of Fig. 2 c-655), it is moved to the top of the list and a detailed view is opened (see Fig 3b). Then, the list is reorganized by similarity, and two connections are moved up: the most similar connection labeled as Botnet and the most similar connection labeled as Normal. This feature is intended so the expert can compare connections by their features. The most similar Bot and Normal connections are automatically brought to the detailed view for comparison. The detailed view also shows origin and destination addresses, port and protocol. By clicking on the widgets the user can filter the overview list after them, in order to, for example, find connections originating in the same IP. 3.4. User labeling strategy A formative evaluation was carried out to observe the workflow and decisions taken while looking for undesired behavior in network logs. The evaluation used a dataset derived from three previously labeled datasets publicly available as part of the Malware Capture Facility Project (MCFP) [20]. Two experts participated in the study. Their task was to label connections using an earlier version of RiskID, which didn’t include label prediction or similarity re-ordering. The study helped us identify a labeling strategy based on filtering and multiple comparisons. The labeling work-flow consisted of, selecting an unlabeled connection and comparing it with labeled connections that share similar characteristics [17]. After analyzing the labels of similar connections (e.g., using RiskID cosine similarity visualization), the unlabeled connection is assigned to the majority class. The process can be repeated until the complete dataset is labeled. It is important to mention, that the aforementioned workflow requires a portion of the dataset be previously labeled. Initial labels could be assigned following a traditional approach based on blacklist IP addresses or services, and analysing the SC periodicity to determine whether flows occurring at periodic intervals are observed in the connection. The above user work-flow falls within the scope of semisupervised learning, where it is clear that, the more correctly labeled connections, the higher the probability that the remaining unlabeled connections will be correctly labeled. So, the quality of the labels in the dataset depends: (i) the number of labeled connections, (ii) the level of correctness of the labels. Therefore, based on user semi-supervised workflow, we propose to include active learning intelligence to suggest labels for unlabeled connections to the user and this way help in the labeling process. 4. The active learning strategy This section details the active learning strategy used to predict labels in close to real-time, based on the previous decisions of the expert user. The goal behind the active learning strategy is to use behavior information of connections previously labeled to estimate the label probabilities for connections not labeled yet. Hereby, the intention is to help the user by providing a tool for early finding, in the haystack of SC connections, those unlabeled connections that can potentially be labeled based on behavior informa-
5
Fig. 4. Learning and prediction process iterations by percent of labeled connections.
tion learnt from available labels. As usual with intelligent systems carrying out tasks autonomously for users, it becomes necessary to give the user some support or evidence about why the algorithm is suggesting either action. This active learning support strategy faces several challenging requirements: R1: It must fit perfectly into the work-flow of the application, which means that the new autonomous learning process should coexist with the user’s labeling strategies. R2: It should face the shortage of initial data and predict Botnet probability with acceptable accuracy. R3: It must be capable of dealing with some noise level (i.e. wrongly labeled connections) in the learning process without changing the course of the expected results. R4: It should be an improvement over the previous user labeling strategies. More specifically, it should reduce the time of labeling while improving label accuracy. R5: It must provide the user some evidence to raise confidence in the algorithm proposed decision. Of these requirements, compliance with R1–R4 can be validated experimentally with the evaluation framework. R5 is as much an algorithm characteristic as a design problem and has to be addressed at design / implementation time. 4.1. Prediction module The proposed Active Learning Strategy is included in a new Prediction Submodule that we introduce into the Analytics Module shown in Fig. 1. It performs model learning and prediction tasks, and requires a minimal set of labeled connections to that end. Such first labeling process can be done following the strategies mentioned in Section 3.4. The Prediction Module (PM) monitors the number of labeled connections. If the number of labels rises above the two percent, the PM initiates an autonomous process for learning behavior associated to connections using the available labels (see Fig. 4). The process is carried out in the background and does not affect the user’s interaction with the application (R1). After a learning cycle, the PM will include the resulting model to predict the Botnet class probability for each unlabeled connection. All unlabeled connections with a probability higher than 0.5 will be indicated as Botnet while those below or equal to 0.5 will be indicated as Normal. Then we instruct the user to label those unlabeled connections with a probability very close to the decision boundary. This strategy, called Uncertainty Sampling [21] guarantees that the most dubious connections are the first to be tagged by the user and thus help the prediction model. This procedure is repeated when the number of connection labeled increases by 2%.
6
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388
Fig. 5. Prediction bar and confidence level added in connection list view after each learning and prediction process.
As a basic means of evidence for the prediction (R5), the PM outputs a Support Level (SL) for each prediction. The SL of predicted label refers to the percentage of connections with a similar destination port within the set of labeled connections used for building the prediction module:
SL(sc p ) =
|sc pt | |sc pd |
Where scp mean a SC with port p, scpt mean the set of connection with p port inside the set of labeled connections and scpd the set of connection with p port in whole dataset. 4.2. Visual description After learning and prediction cycles have ended, an alert notifies the user about new label recommendations. Fig. 5 illustrates the label recommendation bar that appears next to an unlabeled connection (c-655, in the example). Each unlabeled connection receives a prediction bar with the red color indicating the percentage of probability of Botnet. Green color indicates the percent of probability of Normal. Next to the bar, a numerical value indicates the SL of that prediction. This minimalist visual cue aims to make it easy to compare predictions over several SCs and decide which to pick next. 5. Evaluation framework For an evaluation to be useful, one must consider its purpose and scope, select the appropriate metrics and correctly apply assessment techniques. According to the classification given by Staheli et al. [22] for the commonly-used techniques for evaluating visualization the most common evaluations are Usability Testing, Simulation and Performing Testing. We present an evaluation framework to analyze application performance using one kind of Simulation and the Application Performance Testing technique, leaving the evaluation with users for a later work. The PM is evaluated through a set of experiments aiming at validating performance according to aforementioned requirements. First, a preliminary study is carried out using traditional k-fold cross-validation to analyze the viability of a conventional machine learning algorithm in predicting Botnet connections. Second, we evaluate the behavior of the PM considering its integration with the RiskID application work-flow. For this particular case, evaluation of learning rate and noise tolerance are considered. Thereafter, the PM is compared with the current state-of-the-art system, ILAB [11]. In the following subsections, we describe the dataset preparation along with the selected metrics used in the proposed experiments and discuss the results. 5.1. Dataset description The evaluation uses a total of 22 datasets divided in two groups of data called CTU-13 and CTU-19. All data were captured in the CTU University, Czech Republic, over the period of 2011 and 2017 and are publicly available as part of the Malware Capture Facility Project (MCFP) [23]. The CTU-13 [24] group of data consists for thirteen datasets (called scenarios) of different botnet samples, normal and background traffic captured in 2011. Specific malware were executed on each scenario. Malware includes several protocol and performed
different actions such as SPAM, DDos and Click Fraud among others. In total, this group of datasets has 9241 connections with 6394 connections labeled as “Botnet” and 2847 labeled as “Normal”. The CTU-19 group of data consists for nineteen datasets (called scenarios) of different botnet samples. Specifically three botnet captures: 2013-08-20 capture-win15 [25], 2013-10-01 capturewin12 [26], 2013-10-01 capture-win8 [26] and normal traffic including DNS, HTTPS and P2P [27]. In total all captures represents 24,227 connections with 15,737 connections labeled as “Botnet” and 8490 labeled as “Normal”. All these captures were performed between 2013 and 2017. Fig 6 shows a summary of the class distribution in CTU-13 Fig. 6a and CTU-19 Fig. 6b by type of connections. The X-axis shows the type of connections represented by the most representative ports in the dataset and Y-axis the distribution between the classes Botnet and Normal for each. It is worth noting that CTU-13 presents more variety in types of connections than CTU-19. In CTU13 a large number of connections comes from port 25 (SMTP connections) and 80 (HTTP connections). However, the CTU-19 has a similar distribution between port 25, 53 (DNS connections), 80 and 443 (HTTPS connections). In both group of datasets all the connections coming from port 25 have been labeled as Botnet and traffic coming from HTTP/HTTPS (80/443 port) is mostly normal. 5.2. Metrics Several standard metrics for network detection evaluation were used for evaluating the performance of the PM. These metrics correspond to True Positive Rate (TPR) and False Positive Rate (FPR). TPR is computed as the ratio between the number of correctly detected malicious connections (True Positive) and the total number of malicious connections. Whereas FPR is computed as the ratio between the number of normal connections that are incorrectly classified as malicious (False Positive) and the total number of normal connections. Some other metrics are used for dealing with class imbalance: F1-Score and the Receiver Operating Characteristic (ROC) curve. F1-Score is computed as the weighted average between TPR and the total numbers of malicious connections in the dataset. ROC curve consists of a simple plot between TPR and FPR considering different models. The ROC curve can be reduced to a simple scalar by calculating the Area Under the Curve (AUC). Finally, the Equalized Lost of Accuracy (ELA) metric is used to evaluate the model robustness in terms of noise tolerance. The ELA metric computes the loss of accuracy with respect to the case without noise [28]. ELA for an x% noise level is calculated as A0% +Ax% 100+A + A 0% , where A0% is the accuracy of the classifier with A 0%
0%
a noise level 0%, and Ax% is the accuracy of the classifier with a noise level x%. 5.3. Prediction algorithm Under current implementation of RiskID, a Random Forest (RF) prediction algorithm was included in the PM. The inclusion of RF inside the PM responds to its parallelization capability. The bagging process implemented by RF makes it suitable to execute in distributed environments. This parallelization capability is a key feature for improving the usability of RiskID when large dataset needs to be labeled, a common situation in the network security research field. In addition, RF is a solution commonly used in unbalanced data situations [29,30], a condition commonly given when labeling network traffic datasets. The RF algorithm consists of a collection of tree-structured classifiers. Each tree grows with respect to a random vector k , where k , k = 1, . . . , L, is independent and identically distributed. Each
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388
7
Fig. 6. Distribution of classes in CTU-13 and CTU-19 from the perspective of the connection type.
Fig. 7. Random Forest architecture with multiple decisions trees and output class the majority vote.
tree casts a unit vote for the most popular class at input x [31]. Fig. 7 shows an example of the discussed RF implementation. Random Input Selection (RIS) is used to generate the different trees [32]. Hereby, the algorithm chooses randomly a subset S with M features from the original set of n features and seeks within S the best feature to split the node. A feature subset is selected for each node with a value of M is M = log2 n + 1, where n is the total number of features [31]. We carry out a preliminary RF test, which although not entirely adequate for the real operation of RiskID, gives us a general idea of how RF works in these datasets. We evaluate the performance of RF in terms of Accuracy, FPR, TPR, AUC and F1-Score. For CTU-13 and CTU-19, the 70% of the original dataset were used for training the models and the remaining 30% for testing. To deal with class unbalanced situations an Up-Sampled technique was applied over the training set. The training model process was performed used
a k-fold cross-validation (10-fold) over the training datasets. Then the untried testing dataset was used for evaluated each model. The whole process guarantees the independence of the results. Fig. 8 a presents a summary of metric values resultant of the best RF model and Fig. 8b presents the prediction performance of the RF model discriminated by type of connection. Both graphs are the results of testing RF with the two groups of data (CTU-13 and CTU-19) and obtaining their mean values. The second figure shows the distribution of connections correctly predicted (blue color bar) and incorrectly predicted (red color bar). The resulting RF model was able to correctly predict all SMTP connections (port 25). Note that there is a high percent of SMTP connections in CTU-13 and CTU-19 and the complete SMTP traffic was labeled as Botnet in both groups. On the other hand, for HTTP connections (port 80), RF showed some issues for predicting all considered cases. Such results could be explained by the imbalance situation observed in
8
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388
Fig. 8. Prediction performance of Random Forest model.
of HTTP labels. In the CTU-13, the majority of HTTP traffic was labeled as “Normal” (just about 25% Botnet). There were some differences in the predictions of other types of connections, but these represented a minority portion in the data set. These results show that RF is a viable candidate for inclusion in the RiskID predictor module. Hereafter, we only consider RF and perform an extensive evaluation to satisfy R1–R4. 5.4. Learning rate analysis The PM has to deal with the problem of assigning probabilities to unlabeled connections when only a small portion of the dataset is labeled (R2). Presumably, under scenarios when not enough information is available, the estimated probabilities will not be reliable. A similar situation is observed in recommender systems, when it necessary to recommend to a recent user who has no previous history with the system (cold start). Therefore, it is important to determine the amount of labeled connections required to provide reliable information to the user. The learning rate can be defined as the speed at which the PM learns new information and consequently updates label probabilities. A system with high learning rate will be able to adapt to new labels to provide correct predictions within a short period of time [33]. The learning rate is calculated by training the RF model with different sized portions of the training dataset and then evaluating the performance of each model on the testing portion. The experimental procedure started with a random sample of 200 connections for training. These 200 connections were randomly selected. Such selection pretend to simulate the first connections labeled by the user inside RiskID. As previously exposed in Section 3.1, the PM in RiskID will start when 2% of dataset is labeled, then this 200 initial connections represent a simulation of this 2% of labeled connections in CTU-13 dataset. In case of CTU-19 dataset 200 connections represent less than 2%, despite of that, we perform the analysis with a similar data distribution. After the first training of the model, each iteration increases the amount of data in the training set implementing an Uncertainty Sampling query selection [21]. In this way the size of the training set was increased by the connection instances about which it is least certain how to label (connections in training closer of 0.5 probability of botnet class). This procedure was repeated until all connections in the training were
used. The F1-Score was used to evaluate the performance of RF model at each iteration. Each experimental scenario was simulated 30 times (i.e. 44 different training sets x 30 = 1320 simulations in total) to ensure the statistical robustness of results. The resulting curve is plotted in Fig. 9. The X-axis refers to the size of the training set used for building the RF model, while the Y-axis refers to the mean of the F1 score over the 30 repetitions. In the first scenario (200 connections), the PM showed a mean F1 score test close to 0.93 using the CTU-19 and close to 0.89 using the CTU-13. The F1 score increased for the remaining 25 increments in the training set where it reaches a F1score value close to 0.96 using the CTU-19 and close to 0.92 using the CTU-13. After that point the F1-score does not show a significant improvement for any of both group of data. Note that for both data sets the first instance test show good results despite the few data used to train the model. Arguably, the learning rate of the RF classifier can vary for different types of connections. Such differences are not only caused by the initial disproportion in numbers between connection types, but also by the network traffic variability associated to each connection type. Fig. 10 shows the prediction performance by connection type. In particular, the Figures show results when the model was generated with 2, 50 and 90 percent of labeled connections. These percentage values were selected to represent the initial, middle and final phase, respectively, of the labeling process. Even with only 2 percent of labeled data (Fig. 10a), the RF correctly detected 100 percent of SMTP connections (port 25). Such behavior can possibly be explained by the small traffic variability of SMTP traffic present in the dataset. In other words, since all STMP traffic is similar, the model just needed a few samples to correctly classify them. For ports such as 80 (HTTP), 53 (DNS) and 123 (NTP) the RF model failed to detect between the 21% to 31% of the connections. In such cases, given the high variability observed in such ports, it was necessary to label a higher number connections. Figs. 10b and 10c show how the number of errors is considerable reduced as the number of labeled connections increases. 5.5. Robustness analysis It is widely known that a classifier performance will be influenced by the quality of the labeled data used. Since the PM builds
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388
9
Fig. 9. Random Forest performance with incremental training data.
Fig. 10. Prediction performance by port considering different amounts of labeled connections in the dataset.
Fig. 11. ELA measure with incremental noisy data in training set.
the RF model with connections labeled according to user opinions, the quality of the labels will impact directly in the final prediction. Here we analyze the influence of wrongly labeled connections in the performance of the generated RF model. In other words, we want to analyze the RF model tolerance to noise (R3). Clearly, a model with lower tolerance to noise is not suitable for it will also suggest noise. The robustness analysis was carried out by inserting an incremental noise level in the complete training datasets. The noise level was raised from 2 to 90 in 2 percent steps. The noise level x% on the existing data was controlled by randomly changing to the opposite class label of exactly x% of the samples. In each 2 percent step, the noise level was increased and the performance of PM was calculated considering a training set with 70 percent of the dataset and tested on the remaining 30 percent. The previous procedure was repeated 30 times for each step to ensure statistical significance of the results. Results are shown in Fig. 11 in terms of the ELA measure [28]. The X-axis represents the noise level in the training set. The Y-axis is the averaged ELA measure over the 30 repetitions for each step.
As expected, Fig. 11 shows that ELA increases according to the noise level present in the training set with similar performance for both CTU-13 and CTU-19 datasets. With noise levels between 35% to 60% percent the steepness of the curve becomes significant. However, ELA values remain under 0.15 for datasets with noise levels close to the 30%. Such moderate increment indicates that RF is robust to noise and reinforces its inclusion in RiskID. 5.6. Comparison with ILAB strategy In this section we compare the labeling result of two strategies that could be included inside the PM: RF model using Uncertainty Sampling and ILAB strategy proposed by Beaugnon et al. [11]. ILAB implements an Active Learning technique through a Logistic Regression [34] model following at each step a query selection known as Rare Category [16]. Unlike Uncertainty Sampling, Rare Category detection is applied on the instances that are more likely to be Malicious (Botnet) and Benign (Normal) (according to the detection model) separately. Not all connections are present in the initial pool of labeled dataset and rare category detection fosters
10
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388
Fig. 12. ILAB and Random Forest performance with incremental training data.
the discovery of yet unknown group of connections to avoid sampling bias. We replicate the ILAB implementation and to compare the performance of both strategies (RF using Uncertainty Sampling and ILAB) following the same methodology of the previous studies (Learning Rate and Robustness Analysis) over the same groups of data (CTU-13 and CTU-19). First, we compare how each strategy deal with the problem of assigning probabilities to unlabeled connections when only a small portion of the dataset is labeled, specifically a learning rate analysis (Fig. 12). Then, we compare the influence of wrongly labeled connections in the performance of each strategy Fig. 13. Learning Rate. Fig. 12a shows the results of Learning Rate study in RF using Uncertainty Sampling and ILAB strategy using the CTU13. Note that the performance of ILAB strategy presents little variation as the number of elements in the training set increases. Initially, training with only a 2% of the element in the training pool ILAB shows an F1 score close to 0.87 and ends with a value around to 0.88 of F1 score when the model was trained with the whole training set. A similar short variability is obtained when testing ILAB strategy over the CTU-19 (Fig. 12b). In this case the ILAB model starts predicting connections with a value of F1 score under 0.93 and ends with a value close to 0.94. On the other hand, the RF strategy presents greater variation as the training set increases. As illustrated in Section 5.4, the learning rate performance of the RF strategy increases proportionally with the amount of data in the training set. Our strategy achieves 0.92 and 0.95 F1 score values when the complete trainset sets (CTU-13 and CTU-19 respectively) were used to build the models. Clearly as can be seen in both figures our strategy as the training package increases gets better results over the ILAB strategy. Robustness Analysis. Fig. 13 represents the influence of wrongly labeled connections in the performance of RF model and ILAB strategies for both group of data. As we expected the ELA value increases according to the noise level for all cases. However, the results obtained using CTU-13 (Fig. 13a) show a difference between
the two strategies. In this case RF has a better noise tolerance than ILAB until a noise level close to 60% (note that the RF curve is below the curve obtained with the ILAB strategy). After the 60% of noisy data in the training set, RF strategy loses performance faster (the ELA value tends to one more quickly). On the other hand, the results obtained using the CTU-19 are very similar in RF and ILAB. Note that both strategies have a similar ELA value in the first 30% of the noise level. After 30% of the noise level, RF based strategy starts to increase faster, but for high values up to 50% of the noise, ILAB starts to tend to one more quickly. 5.7. Impact on overall performance The present experiment aims at evaluating the benefits provided by the PM compared with the common labeling strategy described in Section 3.4 (R4), hereafter referred as Simple Comparative Strategy (SCS). To this end, we simulate a user following the SCS strategy. From an algorithmic perspective, the SCS can be implemented as follows: (1) Select first unlabeled connection from RiskID. (2) Move Selected connection to the top of the list. (3) Reorder remaining connections by their similarity (cosine similarity) with the selected connection. (4) Pick an odd number of labeled connections from the top of the non-selected list and selects the majority label. The performance in terms of F1 score for SCS was evaluated following the methodology described in Section 5.4 Fig. 14 exhibits the Learning Rate for SCS. Each point the X-axis indicates the amount of labeled connections in the dataset while the Y-axis refers to the average F1 score for each iteration. Despite its simplicity, SCS achieved a F1 Score about 0.89 with only 200 labeled connections (approximately 2 percent of the whole CTU13 dataset). The F1 Score increased up to 0.92 with 7500 labeled connections. (i.e. 75% of the CTU-13 dataset). Fig. 15 compares the results of SCS (blue lines) with results of the RF model (green lines). An additional random selection strategy
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388
11
Fig. 13. ILAB and Random Forest performance with noise tolerance using ELA measure.
Fig. 14. SCS performance with incremental training data.
Fig. 15. Random Selection, RF and SCS performances with incremental training data.
(red lines) is also plotted in the figure as reference. The random selection strategy consists of just randomly selecting a label for each unlabeled connection. Testing against a random strategy is a practice widely used in the evaluation of recommender systems. The learning rate curves for the three strategies are similar. The mean F1 score in the first scenario for random selection, SCS, and RF model were 0.89, 0.85 and 0.92 respectively. Clearly, the imple-
mentation of an active learning strategy (using RF model in this case) improve both strategies: SCS and random selection. 6. Conclusions In this article, we propose an active learning strategy for helping the labeling process of network traffic datasets containing
12
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388
Normal and Botnet connections. In particular, a new Prediction Module was developed and inserted into the RiskID application workflow. Given a partially labeled dataset, the new Prediction Module constructs a random forest from previously labeled connections. The resulting model is capable of estimating the probability of the remaining unlabeled connections. Once the probability model was built, the application of the Uncertainty Sampling technique instructs the user to label those unlabeled connections with a probability very close to the decision boundary (most dubious connections) and thus help improving the performance of prediction model in a future learning cycle. The prediction model was tested on a total of 22 datasets divided in two groups of data called CTU-13 and CTU-19. The viability of applying Random Forest as a connection predictor was evaluated considering the standard machine learning 70/30 ratio. The resulting model showed an accuracy value of 0.93 providing a very accurate prediction of SMTP and HTTP traffic. However, based on the requirements elicited for the process of labeling a network traffic dataset, a more adequate evaluation process become necessary to verify the viability of including a model predictor inside RiskID. Therefore, we proposed an evaluation framework to validate the prediction module analyzing: learning rate, detection rate and robustness against noise, as well as the improvements on the labeling strategy. The prediction model showed a good learning rate improving the detection accuracy as we increased the number of instances in the training set (see Fig. 9). Likewise, the prediction model was able to predict correctly all the SMTP traffic with only a two percent of data in training and the probability estimation was progressively improved when more connection were labeled. On the other hand, the robustness study showed satisfactory results. The model proposed was capable to accept a 30 percent of noise in the training set and still throw correct labels (see Fig. 11). Finally, Different labeling strategies where used: Prediction Module based in RF using Uncertainty Sampling strategy, ILAB strategy, SCS and the random selection strategy. It is clearly observed that the prediction module based on RF exceeds the ILAB strategy and the labeling method normally used in simple RiskID. Since studies recreate the use of the application over time and the interaction between users and application, we consider this comprehensive study set represents a methodology for evaluating the performance of any active learning solution. We contend that this evaluation framework represents a contribution in itself and hope that researchers can consider the methodology when validating the suitability of other assistive algorithms for similar annotation tasks. The new Prediction Module does not pretend to be determinant when deciding if a connection is a “Botnet” or “Normal”. We simply intend to steer the process. Many factors play a role in the complex labeling process. To measure the real impact of the proposed prediction module, we need to consider not only a statistical evaluation but also the user interaction and confidence with the proposed extension. Such evaluation is beyond the scope of this paper and subject of future work. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Supplementary material Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.jisa.2019.102388.
References [1] Catania C, Garcia Garino C. Automatic network intrusion detection: current techniques and open issues. Comput Electr Eng 2012;7(11):1063–73. [2] Bhuyan MH, Bhattacharyya DK, Kalita JK. Towards generating real-life datasets for network intrusion detection. Int J Netw Secur 2015;17(6):683–701. [3] Sommer R, Paxson V. Outside the closed world: on using machine learning for network intrusion detection. In: Proceedings of the IEEE symposium on security and privacy; 2010. p. 305–16. doi:10.1109/SP.2010.25. http://ieeexplore. ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5504793. [4] Sebastian G. Stratosphere research laboratorys. 2015. https://stratosphereips. org/, [Online; accessed Jun-2018]. [5] University of California I. Knowledge discovery in databases DARPA archive. 1999. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html/ [Online; accessed September-2016]. [6] DEFCON Hacking Conference - capture the flag archive. 2011. https://www. defcon.org/html/links/dc-ctf.html, [Online; accessed April-2018]. [7] Center for applied internet data analysis. 1997. University of California, San Diego, http://www.caida.org/ [Online; accessed April-2019]. [8] Mukkavilli S.K., Shetty S., Hong L. Generation of Labelled Datasets to Quantify the Impact of Security Threats to Cloud Data Centers 2016; (April): 172–184. http://www.scirp.org/journal/PaperInformation.aspx?paperID=65482. doi:10.4236/jis.2016.73013. [9] Görnitz N., Kloft M., Rieck K., Brefeld U.. Active learning for network intrusion detection 2009. doi:10.1145/1654988.1655002. [10] Aparicio-Navarro FJ, Kyriakopoulos KG, Parish DJ. Automatic dataset labelling and feature selection for intrusion detection systems. Proceedings the IEEE military communications conference MILCOM 2014:46–51. doi:10.1109/ MILCOM.2014.17. [11] Beaugnon A, Chifflier P, Bach F. ILAB: an interactive labelling strategy for intrusion detection. In: Dacier M, Bailey M, Polychronakis M, Antonakakis M, editors. Research in attacks, intrusions, and defenses. Cham: Springer International Publishing; 2017. p. 120–40. ISBN 978-3-319-66332-6. [12] Soule A, Rexford J. Webclass: adding rigor to manual labeling of traffic anomalies. Comput Commun Rev 2008;38(1):35–8. doi:10.1145/1341431.1341437. [13] Pius Owoh N, Mahinderjit Singh M, Zaaba ZF. Automatic annotation of unlabeled data from smartphone-based motion and location sensors. Sensors (Switzerland) 2018;18(7). doi:10.3390/s18072134. [14] Lemay A, Fernandez JM. Providing SCADA network data sets for intrusion detection research. In: Proceedings of the USENIX CSET; 2016. [15] Sperotto A, Sadre R, Van Vliet F, Pras A. A labeled data set for flow-based intrusion detection. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), 5843 LNCS; 2009. p. 39–50. doi:10.1007/978- 3- 642- 04968-2_4. [16] Pelleg D, Moore A. Active learning for anomaly and rare-category detection. Adv Neural Inf Process Syst 2004;18(2):1073–80. [17] Guerra J, Catania CA, Veas E. Visual exploration of network hostile behavior. In: Proceedings of the ACM workshop on exploratory search and interactive data analytics - ESIDA ’17; 2017. p. 51–4. doi:10.1145/3038462.3038466. [18] Shneiderman B. The eyes have it: A Task by data type taxonomy for information visualizations. Craft Inf Vis 2003:364–71. doi:10.1016/B978-155860915-0/ 50046-9. [19] Kodinariya T, Makwana P. Review on determining number of cluster in KMeans clustering. Int J Adv Res Comput Sci Manag Stud 2013;1(6):90–5. www. ijarcsms.com. [20] Malware capture facility project. 2013. Czech Technical University, https:// mcfp.weebly.com/ [Online; accessed May-2019]. [21] Lewis DD, Gale WA. A sequential algorithm for training text classifiers. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. New York, NY, USA: Springer-Verlag New York, Inc.; 1994. p. 3–12. ISBN 0-387-19889-X. http://dl.acm.org/citation. cfm?id=188490.188495. [22] Staheli D, Yu T, Crouser RJ, Damodaran S, Nam K, O’Gwynn D, et al. Visualization evaluation for cyber security. In: Proceedings of the eleventh workshop on visualization for cyber security - VizSec ’14; 2014. p. 49–56. doi:10.1145/ 2671491.2671492. [23] Garcia S. Identifying, modeling and detecting botnet behaviors in the network. UNICEN University; 2014. Ph.D. thesis. doi:10.13140/2.1.3488.8006. [24] The CTU-13 dataset. 2011. Stratosphere Project, https://www.stratosphereips. org/datasets-ctu13/ [Online; accessed Jun-2018]. [25] The CTU-19 dataset, botnet kelihos tdptu02.exe. 2013a. https://mcfp.felk. cvut.cz/publicDatasets/CTU- Malware- Capture- Botnet- 3/ [Online; accessed Jun2018]. [26] The CTU-19 Dataset, Botnet 39UvZmv.exe. 2013b. Stratosphere Project, https:// mcfp.felk.cvut.cz/publicDatasets/CTU- Malware- Capture- Botnet- 1/ [Online; accessed Jun-2018]. [27] The CTU-19 Dataset, Normal Datasets. 2013c. Stratosphere Project, https:// www.stratosphereips.org/datasets-normal/ [Online; accessed Jun-2018]. [28] Sáez JA, Luengo J, Herrera F. Evaluating the classifier behavior with noisy data considering performance and robustness: the equalized loss of accuracy measure. Neurocomputing 2016;176:26–35. doi:10.1016/j.neucom.2014.11.086. [29] Ruiz-Gazeb A, Villa N. Storms prediction: Logistic regression vs random forest for unbalanced data. Case Stud Bus Ind Gov Stat 2007;1(2):91–101. http://arxiv. org/ftp/arxiv/papers/0804/0804.0650.pdf. [30] Liu M, Wang M, Wang J, Li D. Comparison of random forest, support vector machine and back propagation neural network for electronic tongue data
J.L.G. Torres, C.A. Catania and E. Veas / Journal of Information Security and Applications 49 (2019) 102388 classification: application to the recognition of orange beverage and chinese vinegar. Sens Actuators B Chem 2013;177:970–80. doi:10.1016/j.snb.2012.11. 071. http://www.sciencedirect.com/science/article/pii/S0925400512012671. [31] Breiman L. Random forests. Mach Learn 2001;45(1):5–32. doi:10.1023/A: 1010933404324. [32] Kuncheva LI. Combining pattern classifiers: methods and algorithms: second edition. Hoboken, New Jersey: John Wiley & Sons, Inc.; 2014. ISBN 9781118914564. doi:10.1002/9781118914564.
13
[33] Avazpour I., Pitakrat T., Grunske L., Grundy J. Recommendation systems in software engineering 2014. doi:10.1007/978- 3- 642- 45135- 5. [34] Collins M, Schapire RE, Singer Y. Logistic regression, AdaBoost and Bregman distances. Mach Learn 2002;48(1–3):253–85. doi:10.1023/A:1013912006537.