On botnet detection with genetic programming under streaming data label budgets and class imbalance

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx Contents lists available at ScienceDirect Swarm and Evolutionary Computation journal homepage:...

Download PDF

3MB Sizes 23 Downloads 8 Views

Report

PDF Reader
Full Text

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

Swarm and Evolutionary Computation journal homepage: www.elsevier.com/locate/swevo

On botnet detection with genetic programming under streaming data label budgets and class imbalance ⁎

Sara Khanchi, Ali Vahdat, Malcolm I. Heywood , A. Nur Zincir-Heywood Faculty of Computer Science, Dalhousie University, 6050 University Av., Halifax, NS, Canada

A R T I C L E I N F O

A BS T RAC T

Keywords: Non-stationary data Streaming data Botnet detection Class imbalance Genetic programming

Algorithms for constructing models of classiﬁcation under streaming data scenarios are becoming increasingly important. In order for such algorithms to be applicable under ‘real-world’ contexts we adopt the following objectives: 1) operate under label budgets, 2) make label requests without recourse to true label information, and 3) robustness to class imbalance. Speciﬁcally, we assume that model building is only performed using the content of a Data Subset (as in active learning). Thus, the principle design decisions are with regard to the deﬁnitions employed for sampling and archiving policies. Moreover, these policies should operate without prior information regarding the distribution of classes, as this varies over the course of the stream. A team formulation for genetic programming (GP) is assumed as the generic model for classiﬁcation in order to support incremental changes to classiﬁer content. Benchmarking is conducted with thirteen real-world Botnet datasets with label budgets of the order of 0.5–5% and signiﬁcant amounts of class imbalance. Speciﬁc recommendations are made for detecting the costly minor classes under these conditions. Comparison with current approaches to streaming data under label budgets supports the signiﬁcance of these ﬁndings.

1. Introduction Streaming data applications represent an environment in which data arrives on a continuous basis and exhibits non-stationary properx ) appear sequentially ties such as concept drift [1–4]. Thus, records (→ at discrete points in time, t, and are described by a joint probability distribution pt (→ x , d ), where in this work d represents the record's unknown true label. If for two points in time, t and t + 1 there exists an → x , d ) ≠ pt+1 (→ x , d ), then concept drift has occurred. Such x such that pt (→ drift might be slow or abrupt, subject to repetition and/or eﬀect diﬀerent subsets of classes at diﬀerent points in time. The goal of a classiﬁcation model operating on such streams is therefore multifaceted. Not only is it necessary to suggest labels for multiple classes of data in the stream on a real-time/anytime basis, but it is also necessary for the model to identify what data to learn from.1 The process of identifying what to learn from constitutes a ‘label request’ as a human expert is ultimately responsible for providing ground truth labels. Moreover, it is only feasible for the model to request labels for a small fraction of the data (the cost of acquiring labels is high). Such constraints potentially appear in several applications, e.g. constructing trading agents for ﬁnancial services or labelling satellite data.

⁎

1

In this work we are motivated by the particular issue of identifying Botnet behaviours in network traﬃc data. Botnets represent a networked collection of devices whose security was at some point compromised (the bots), so allowing a bot herder/master to remotely control the bots. The owners of the compromised devices are unaware of the ability of the bot master to control their devices. The bot master is then free to use the bots to launch a wide range of malicious behaviours while hiding their own identity. Detection of Botnets is nontrivial because: 1) malicious behaviours are mixed in with legitimate (normal) behaviours; 2) users have a wide range of ‘normal’ behaviours; 3) network load and application mix are time varying parameters; 4) many applications dynamically switch between diﬀerent modes of operation in unpredictable ways (e.g., services such as Skype and Tor explicitly attempt to hide their communication protocols); 5) new applications/updates to current applications (whether malicious or not) coexist with both old versions of the same application resulting in multiple simultaneous ‘ﬁngerprints’ for the same application; and, 6) the ratio of data pertaining to malicious versus non-malicious behaviour is very low. The Botnet detection scenario is framed as follows. We cannot predict a priori when Botnet behaviours will appear in the stream, as network data represents a mixture of normal and malicious data.

Corresponding author. E-mail address: [email protected] (M.I. Heywood). The non-stationary properties of the stream imply that training data has to be identiﬁed interactively during the course of deployment.

http://dx.doi.org/10.1016/j.swevo.2017.09.008 Received 16 November 2016; Received in revised form 16 August 2017; Accepted 9 September 2017 2210-6502/ © 2017 Elsevier B.V. All rights reserved.

Please cite this article as: Khanchi, S., Swarm and Evolutionary Computation (2017), http://dx.doi.org/10.1016/j.swevo.2017.09.008

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Section 2.2 provides an equivalent survey from the perspective of explicitly evolutionary methods.

Normal network data is also non-stationary, implying that it is also not feasible to pre-train models oﬀ-line and then deploy (such models will always ‘go stale’ as both normal and malicious data change at unpredictable points in time). Human expert(s) are available for providing true labels, d, for a small subset of the stream data (i.e. label budget) on a continuous basis. This is necessary because attacks against the machine learning algorithm itself lead to an attacker ‘reprogramming’ the classiﬁcation of attack behaviours as normal by manipulating stream data content [5,6]. In order to decouple human experts from the raw throughput of the network data, only the GP framework will identify data for labelling, not the human, i.e. this step cannot assume access to the true labels. We assume that the human experts are trustworthy (otherwise GP models could again be misled). A champion GP individual must always be available for label prediction, before any label querying can take place (real-time anytime operation). The GP framework therefore operates interactively with the stream providing predictions about the content (normal or Botnet) and directs the human labelling of the stream under a ﬁnite label budget.2 In framing the task this way, the proposed system has the ability to operate under incoming and/or outgoing network traﬃc on a wide range of network devices including servers and client devices. Such a framework would be deployed to protect institutions/infrastructure such as medical, ﬁnancial or other institutions with human security experts acting as the ultimate source of trusted label information. Other scenarios might include IT security companies who provide the anytime classiﬁer to service subscribers and retain the other components of the architecture. In the following, we develop the topic by reviewing previous works that address both the ability to operate under label budgets and address the issue of class imbalance under streaming data (Section 2). The framework we propose assumes a teaming formulation for genetic programming (GP), where team GP formulations provide an evolutionary approach for adapting an ‘ensemble’ of GP programs to data content. Section 3 establishes how GP teams are evolved from a ﬁxed size Data Subset, as per active learning, thus the following two critical decisions are addressed: 1) how to sample records from the stream to appear within the Data Subset without requiring label information; and 2) how to identify records for replacement from the Data Subset when the subset is full. Section 4 develops the methodology adopted for streaming classiﬁcation algorithms operating under label budgets with class imbalance, and introduces the real-world Botnet datasets employed for benchmarking. The ensuing empirical study both quantiﬁes the signiﬁcance of the GP teaming approach and compares to recent work capable of operating under label budgets (Section 5). We make speciﬁc recommendations regarding sampling versus replacement policies for GP and quantify the impact of operating under low label budgets while addressing class imbalance. Indeed, for streaming data applications to be appropriate for real-world applications, it is necessary for them to operate under both of these constraints simultaneously. Section 6 concludes the paper and suggests future research.

2.1. Non-evolutionary methods Change detection is a mechanism used to initiate retraining of a model. Thus, only when suﬃcient change is detected, will a model be updated. This potentially means that model building is decoupled from the need to provide labels. For example, Lindstrom et al. describe a process by which a reference distribution is constructed and used to calibrate the model [7]. As the model passes over the stream a divergence measure (expressing model conﬁdence independent from label information) is used to trigger model reconstruction. Any model reconstruction is only performed from the most recent window content. Such an approach only requests labels once a change is detected. However, it also assumes that variation is solely captured by the x ). Any change to the posterior unconditional distribution of data p (→ x ) remains undetected [8]. distributions of data p ( y|→ Active Learning implies that labels are explicitly sought for some fraction of the data, and employ some form of change detection/ uncertainty threshold to initiate label requests. Several authors have proposed bias/variance minimization schemes for this purpose [9,10,1,11,12]. That said, empirical benchmarking has demonstrated that just sampling with uniform probability (up to the label budget) is suﬃcient to build surprisingly eﬀective models, but only when class instances are well mixed [9,8]. Z˘liobaitė et al. introduced an active learning algorithm that balances both stochastic sampling with model based uncertainty sampling in order to simultaneously address both x ) and p ( y|→ x ); moreover, this is achieved within ﬁxed changes to p (→ label budgets. Such an algorithm combines (model driven) uncertainty sampling with random sampling. Additionally, the active learning approach was suﬃciently generic to be deployed with both the streaming formulations for Naive Bayes and Hoeﬀding decision tree models of classiﬁcation. Several recent works investigate the issue of class imbalance under streaming data contexts. One approach is to adopt a formulation of bagging with under or oversampling in order to construct a ‘Data Subset’ from which model building is performed [13,14]. Speciﬁcally, Ditzler et al. emphasizes operation under an incremental (i.e., batch) updating while also supporting anytime labelling, whereas work by Wang et al. emphasize operating under an online (i.e., record-wise) updating constraint. Also of note is that even though Ditzler et al. assumed the SMOTE algorithm developed for operating under stationary data with class imbalance [15], this was not the most eﬀective method investigated for operation under non-stationary data [13]. A second general approach adopted for addressing class imbalance under streaming data is the use of dynamically reweighing class costs [16,17], where this has also been reported under an active learning context [18]. Most recently, attention has been paid to scenarios in which classes repeatedly drop in and out from the stream against a general backdrop of classes that appear on a continuous basis [19], albeit with true labels known for the entire stream content. In the application context associated with this work, the generic number of classes is known (e.g. attack or normal), but when they might appear is not. Various semi-supervised frameworks have also been proposed for operation under streaming data contexts, and provide a natural framework for addressing labelled versus unlabeled data [20,21], i.e. after an initial period of training from labelled data, the classiﬁers go ‘online’ using unsupervised learning alone. Speciﬁc points of interest include operation under class imbalance and non-stationary data. However, from the perspective of this work, operation online using unsupervised learning would make such an approach particularly susceptible to adversarial attackers [5,6].

2. Related work Several recent survey articles have appeared that provide overviews of the scope of model building for streaming data classiﬁcation under non-stationary streams [2–4]. In the following we will concentrate on highlighting issues speciﬁc to the problem setting of streaming classiﬁcation under label budgets: Section 2.1 reviews developments regarding imbalanced data, change detection, and (online) active learning from the perspective of non-evolutionary methods; and 2 Predictions from the anytime classiﬁer might also be used to prioritize the records identiﬁed for labelling, i.e. a record predicted as an attack class would be prioritized over a record carrying a normal class prediction.

2

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 1. Overall framework within which GP streaming framework operates. Sampling policy (S ) determines whether a record should have its label requested (while operating under the label budget constraint β). Archiving policy (A) maintains a ﬁnite size Data Subset, typically replacing ‘Gap’ records for replacement. On updating the DS with the set of Gap (i ) labelled records, τ generations of GP are performed. A single champion individual is always available for predicting labels, y (t ) , and may also inﬂuence the Sampling policy.

2.2. Evolutionary methods

framework proposed in this work.

Model building for streaming data using evolutionary methods has not been as extensively investigated as under non-evolutionary methods. GP frameworks have been designed to address the specialized task of forecasting for ﬁnancial data [22]. Such an application implies that the model predicts the direction of movement in the next step of a temporal sequence, and as such is synonymous with a speciﬁc example of streaming classiﬁcation tasks in general (see also predicting electricity utilization [23]). Such frameworks are limited to tasks without label budgets. Folino and Papuzzo propose a GP ensemble for operating under streams exhibiting diﬀerent non-stationary properties [24]. Change detection was deﬁned using a statistic applied to pairs of x ) but not windows, hence model updating was only triggered by p (→ p ( y|→ x ). Moreover, a parallel distributed GP framework was necessary to support the rebuilding of GP ensembles. Dempsey et al. investigated the role of genotype-to-phenotype mappings under dynamic environments (stock market data in particular) [25]. Speciﬁcally, they emphasize the signiﬁcance of evolvability/plasticity in facilitating adaptation under non-stationary data. Earlier versions of the framework adopted in this work have assumed that GP teams are evolved from: 1) a Data Subset under some policy for sampling records from the stream; and 2) an archiving policy for determining which records to keep/replace. Initial attempts to describe the archiving policy using Pareto archiving indicated that label error (a form of noise disrupting the ability to accurately model p ( y|→ x )) had a signiﬁcant negative eﬀect on the ability to build robust models [26]. Adopting a simple uniform sampling policy under label budgets provided a more robust starting point [27], albeit with full label information necessary to guarantee a balanced Data Subset. It was also demonstrated that adopting a team GP formulation is much more eﬀective at reacting to change than assuming a single (monolithic) GP individual as the solution [28]. Finally, the issue of class imbalance of the Data Subset was shown to have implications for the quality of the resulting GP model [29,30]. In this work, we will concentrate on the Botnet detection task in particular (the earlier works were limited to artiﬁcial datasets) where this represents a particularly challenging task for streaming data analysis, i.e. very imbalanced data, classes that continuously appear and disappear, a signiﬁcant cost to miss classiﬁcation, and low label budgets. Several earlier approaches assume prototype style representations, such as learning classiﬁer systems (LCS). Dam et al. concentrate on measuring the reaction time of LCS under a ‘multiplexer’ task x ) and suggest that population reformulated to exhibit variation in p (→ reinitialization is the most appropriate mechanism for reacting to change once detected [31]. Behdad and French investigate an approach to LCS in which the order of explore and exploit cycles are reversed compared to that traditionally assumed for oﬀ-line batch learning [32]. Finally, we note that k-NN approaches can be managed through evolutionary methods such as particle swarm optimization and potentially applied to streaming data classiﬁcation tasks [33]. All of these methods are limited to diﬀerent subsets of the functionality of the

3. Framework for streaming GP teams 3.1. Streaming data environment under a label budget Streaming algorithms are deﬁned as online [34] or incremental [35]. Online algorithms operate instance-wise, possibly from the content of a ﬁnite length sliding window of sequentially encountered instances.3 Conversely, incremental algorithms process data in ‘chunks’ or ‘blocks’ deﬁning a non-overlapping window in which the most recent set of instances from the stream are available. Any querying performed by the streaming algorithm is limited to the data in the window (whether sliding or non-overlapping). This reinforces the limited memory constraint implicit in streaming data. In this work we assume an incremental non-overlapping approach (Fig. 1). The data stream is deﬁned by a continuous sequence of d-dimensional records, …,x (t ), x (t + 1), … where t represents the temporal index. The continuous nature of the stream implies that t → ∞. Each record has a (true) label, d (t ), that is not available unless explicitly requested. Label requests can only be made using: 1) records within the current window, and; 2) a label budget, β , such that β = 0.5 implies that ﬁfty percent of records may have their corresponding label requested. Operation within the context of a label budget implies that it is necessary for a sampling policy, S, to be deﬁned to explicitly identify which records will have their labels requested. Such a sampling policy only operates on the records in window W (i ), and cannot revisit records once a decision has been made. However, anytime operation implies that for each record, x (t ), a label prediction, y (t ), is made (in real-time) by the streaming classiﬁer. Given that a population based paradigm will be pursued (GP in this case) this implies that a champion classiﬁer (i.e. individual from the GP population) has to be available to provide a label prediction before any of the stream appears in the window. Fig. 1 summarizes how these concepts are related in this work. The above formulation of the streaming data classiﬁcation task implies that:

• • •

All of the data in the stream has labels ﬁrst proposed by the champion classiﬁer. Any updates to the GP population are only performed after the champion individual makes its label prediction, y (t ) ; Training is an interactive process, with the stream GP framework making decisions about what records to request true labels for, subject to the label budget β . Records arrive in an order dictated by the underlying properties of the task, hence class balance is not likely to be present within any local region of the stream.

3 As in a ‘ﬁrst-in-ﬁrst-out’ (FIFO) data structure in which the most recent instance (from the stream) pushes out the oldest instance from the FIFO.

3

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

•

when assuming more sophisticated algorithms [9,8]. A second sampling policy is considered that uses the GP champion classiﬁer, gp*, to promote records for labelling. That is to say, when gp* predicts a class label, y (t ), representing a minority class, it is prioritized for sampling (subject to label budget β ). In eﬀect, we are using gp* to actively promote records that could contribute to rebalancing the content of the Data Subset, DS (i ). Given that performance of the GP classiﬁer is correlated with the distribution of records in the DS, we are attempting to provide a Sampling policy that actively addresses this without requiring label information. Hereafter we refer to this as the biased sampling policy. Algorithm 1 provides a summary of such a process. We ﬁrst need to characterize the content of the previous Data Subset, DS (i − 1). Let C denote the number of classes currently present in DS (i − 1) and c the DS set of classes appearing with frequency ≥ C in DS (i − 1), where DS is the size of the Data Subset (the over represented class(es)). We ‘mark’ the current non-overlapping window content, W (i ), in terms of whether the predicted label, y (t ) ∉ c (Step 1). If there are less marked instances than capacity in Gap , labels are requested for all such cases and the resulting records copied to Gap (Step 2). If there is any remaining capacity in Gap , we ﬁll these by sampling uniformly from the non-marked instances in W (i ) (Step 2b). Finally, in the case of more instances marked than capacity in Gap , then we sample from the marked instances using a roulette wheel (Step 3). The roulette wheel samples from W (i ) with frequency inversely proportional to (marked) DS (i − 1) class content. There is one special case, and that is the case of a cold start. Under this condition there is no champion GP individual available to provide label predictions. We therefore assume the uniform sampling policy for W (i = 0).

Changes to the champion classiﬁer are also made by the stream GP framework, but once a change takes place the new champion cannot revisit any previous labelling decision(s).

3.2. Overall framework We will assume the overall generic framework of Fig. 1. The sampling policy, S, operates under the label budget constraint β to identify a total of Gap records that have their corresponding (true) label, d (t ), requested. Once the Gap records have had a label requested, a corresponding Gap records are identiﬁed for replacement from the ﬁnite sized Data Subset, DS (i ). The Data Subset therefore decouples ﬁtness evaluation from stream cardinality and potentially provides the ability to introduce biases into the representation of each class. Moreover, we implicitly assume an ‘incremental’ approach to model building under streaming data. This implies that ﬁtness evaluation is only performed once the Gap new label requests have been made. Thus, DS (i ) denotes the speciﬁc point at which a batch of Gap new records enter the ﬁnite size Data Subset. An archiving policy, A, prioritizes records for replacement/retention within the Data Subset. After identifying DS (i ), a ﬁxed number of τ generations are performed and a champion GP individual is identiﬁed. Thus, streaming operation commences following an initial cold start necessary to identify the ﬁrst champion individual. Note that the rate of champion identiﬁcation does not exceed the rate of DS updating, but it need not be the same. We will assume a symbiotic bid-based (SBB) formulation for expressing solutions as teams of GP programs [36]. Such a framework cooperatively coevolves GP individuals through a bidding mechanism that identiﬁes context for an action, in this case a class label. Each program is assigned a single action at initialization; teams and programs are represented in independent populations. The only constraint on team membership is that there must be at least two programs per team, and there must be at least two diﬀerent actions present across all the programs participating within the same team. Moreover, the SBB framework addresses multi-class classiﬁcation without any additional modiﬁcation. Previous work has demonstrated that such a framework is more eﬀective than monolithic (canonical) GP under oﬀ-line classiﬁcation tasks [37] and streaming classiﬁcation tasks [28]. The focus of this work lies in how to deﬁne eﬀective policies for sampling and archiving such that we construct classiﬁers capable of operating under the Botnet application context. The resulting sampling and archiving policy deﬁnitions are independent of the speciﬁc GP framework assumed. For completeness, Appendix A summarizes the SBB framework. Sections 3.3 and 3.4 characterize the approaches taken to deﬁning the Sampling and Archiving policies. Naturally, we assume that the ability to construct a classiﬁer robust to class imbalance (as measured with respect to any local region of the stream) is biased by the content of the Data Subset. However, this needs to be traded oﬀ against the desire to react to changes in the stream. Thus, maintaining underrepresented classes in the Data Subset at the expense of records representing the more frequently occurring classes potentially results in less sensitivity to the most recent data. Identifying the speciﬁc balance between these two factors will be a theme we will return to during benchmarking. Finally, Section 3.5 addresses the speciﬁc mechanism employed for champion identiﬁcation, thus anytime prediction of y (t ) given x (t ).

Algorithm 1. Biased Sampling Policy. Let rnd (A) return a randomly sampled instance from set A without replacement. roulette (A, b ) returns a randomly sampled instance (without replacement) from the subset A ∧ b with frequency inversely proportional to DS (i − 1) class content. Gap (i ) is the set of records transferred to DS (i ) (Fig. 1) at nonoverlapping window location i. Input: The current content of the (non-overlapping) window W (i ) and predicted labels, y (t ). The set c of over represented classes from Data subset DS (i − 1); Initial state: Gap (i ) = ∅; cnt1 = cnt 2 = 0 1. For all t ∈ W (i ) (a) IF y (t ) ∉ c THEN Mt = 1 AND cnt1 = cnt1 + 1 ELSE Mt = 0 2. IF cnt1 ≤ Gap THEN (a) For all t in which Mt == 1 (i) Request d (t ) x (t ), d (t )) (ii) Gap (i ) = Gap (i ) ∪ (→ (b) WHILE cnt1 < Gap (i) t = rnd (W (i )) subject to Mt == 0 (ii) Request d (t ) x (t ), d (t )) (iii) Gap (i ) = Gap (i ) ∪ (→ (iv) cnt1 = cnt1 + 1 3. ELSE (a) For any t in which Mt == 1 x (t ), d (t )) ← roulette (W (i ), Mt ) (i) (→ (ii) Request d (t ) x (t ), d (t )) (iii) Gap (i ) = Gap (i ) ∪ (→

3.3. Sampling policy

(iv) Mt = 0, cnt 2 = cnt 2 + 1 (b) Repeat Step 3a WHILE cnt 2 < Gap

A uniform sampling policy will be assumed as our baseline/control approach. This samples records from (non-overlapping) window location W (i ) prior to the availability of the label information (choose record x (t ) for labelling with probability P (β )). Previous works have demonstrated that such a starting point is not necessarily bettered 4

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

3.4. Archiving policy

Table 1 Generic properties of the streaming datasets. N cardinality, and k is the number of classes present over the entire duration of the stream. Each dataset has D = 8 ﬂow attributes (Direction, DToS, Duration, Protocol, Source Bytes, SToS, Total packets, Total Bytes). A combined Botnet/C & C label was assumed in the case of datasets in which the C & C class represents less than 0.01% of the original dataset (Capture 3, 4, 10, 11, 12).

The overall framework of Fig. 1 utilizes a Data Subset of a ﬁnite size (DS ). Thus, once the Data Subset is full, Gap records are identiﬁed for replacement at each non-overlapping window location, W (i ). We will again assume two possible algorithms for this purpose. The base case is referred to as the uniform archiving policy in which Gap records are identiﬁed for removal from DS (i ) with uniform probability. A biased archiving policy is deﬁned with the objective of incrementally (re)balancing the representation of records per class in the Data Subset as the stream progresses. Algorithm 2 details this process. Records already in the Data Subset are ﬁrst grouped by class and ranked by age (Step 1). A count, ck , is made of the number of instances of each record per class present in the Data Subset (Step 2a). The number of records for deletion per class are identiﬁed (Step 3a), relative to an ideal distribution of Data Subset capacity, i.e. DS − Gap . C Under represented classes are not targeted for record deletion (Step 3b), hence can accumulate additional records. Step 4a targets the over represented classes (relative to the ideal distribution) for having records deleted, oldest records ﬁrst. Gap instances have now been deleted, so the ﬁnal step adds the content of Gap (i ) to the remaining Data Subset content, producing DS (i ).

Dataset Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture

5. DS (i ) ← Add (DS (i − 1), Gap (i ))

3.5. Identifying the champion classiﬁer A champion classiﬁer, gp*, is identiﬁed by applying a robust performance metric to evaluate the operation of the population under the current content of the Data Subset, DS (i ). This is the only source of data with true label information. The performance metric assumed takes the form of the multi-class Detection rate (DR ), or

tpj tpj + fnj

4 4 3 3 4 4 4 4 4 3 3 3 4

[97.47, 1.08, 1.44, 0.01] [98.33, 0.5, 1.12, 0.04] [96.95, 2.48, 0.57] [97.52, 2.25, 0.23] [95.7, 3.6, 0.68, 0.02] [97.83, 1.34, 0.79, 0.04] [98.47, 1.47, 0.03, 0.02] [97.33, 2.47, 0.17, 0.04] [89.7, 1.44, 8.72, 0.14] [90.67, 1.21, 8.12] [89.85, 2.54, 7.61] [96.99, 2.34, 0.657] [96.26, 1.66, 2.05, 0.03]

The CTU dataset [38] includes data labelled as one of four general categories: background, normal, Botnet, command and control (C & C). The majority of the data present in the data set takes the form of background traﬃc (Table 1), where this represents network traﬃc collected from a real-world network. In the past it has been the characterization of the normal behaviour that has been the most diﬃcult to accurately express; leading to the use of benchmark datasets that are much easier to solve than in practice, see for example the discussion in [39]. However, it is then actually diﬃcult to deﬁne labels for normal and attack behaviour because the so called background traﬃc may actually consist of attack traﬃc. The approach currently adopted by the network security community is therefore to label all traﬃc as ‘background’ and apply ﬁlters characterizing deﬁnitively known examples of normal behaviour [40,41]. Any data from the background traﬃc labelled by the normal ﬁlters are labelled as normal, the remainder is labelled as background. Finally, attack data is explicitly created using (Botnet) attack tools from speciﬁc IP addresses on a virtual network. This means that any data explicitly labelled as attack is deﬁnitely attack data, although some amount of the background traﬃc data could also be so. Moreover, in the speciﬁc case of the CTU dataset, data associated with the operation of the Botnet master is explicitly distinguished from that of data associated with Botnet slaves (labelled as C & C and Botnet respectively). The Botnet master(s) control the operation of the slaves. The slaves actually execute attacks, with the objective of hiding the identity of the master, whereas from the perspective of detection, the identiﬁcation of the master(s) is the most important. All thirteen datasets will be employed from the CTU-13 network security dataset collection [38]; hereafter referred to as Capture 1 through 13.4 The data is described by 12 ‘ﬂow’ statistics obtained by the Argus network ﬂow generator.5 However, out of these 12 features, we did not employ IP addresses and port numbers as many recent network applications (Voice over IP, social media and network based games) can dynamically change their port addresses based on the blocked/ unblocked port combinations. Moreover, IP addresses can be spoofed by attackers for malicious intentions or can be hidden by proxies for

DS − Gap C

∑ DRCj =1 andDRj =

2,824,637 1,808,123 4,710,638 1,121,077 129,833 558,920 114,078 2,954,230 2,087,508 1,309,792 107,251 325,472 1,925,150

4.1. Datasets

(b) IF removek > 0 THEN T = T + removek ELSE removek = 0 4. For k = 1 to C remove (a) Delete oldest Gap × T k records of class k from DS (i − 1)

1 C

≈ Class Distribution (%)

4. Evaluation methodology

Input: Set of labelled instances Gap (i ) and the last available Data Subset, DS (i − 1) Initial state: T = 0 1. For all j ∈ DS (i − 1) identify class and rank w.r.t. record ‘age’ aj ; 2. For each class k present in DS (i − 1) (a) Count the number of records with class k in DS (i − 1). Let ck be the count for class k. 3. For k = 1 to C

DR =

k

terrupted as a champion classiﬁer is thereafter always available.

Algorithm 2. Biased Archiving Policy. Let aj be the ‘age’ of record j in DS (i − 1), where this is a scalar count for how long record j has appeared in the Data Subset. ck is a count of the number of class k instances in DS (i − 1). C is the number of diﬀerent classes currently represented in DS (i − 1). T is the total number of instances removed from the over represented classes.

(a) removek = ck −

1 2 3 4 5 6 7 8 9 10 11 12 13

N

(1)

where C is the count of classes present in DS (i ); tpj and fnj are the counts of true positive and false negative for class j, again with respect to the class distribution present in DS (i ). Note that the champion classiﬁer could potentially change as a function of each change to the Data Subset content (i.e., as a function of DS index i), but never more often than this (also reﬂected in the distinction between incremental and online operation). However, once the ﬁrst champion classiﬁer is identiﬁed, anytime operation is unin-

4 5

5

https://mcfp.felk.cvut.cz/publicDatasets/CTU-13-Dataset/CTU-13-Dataset.tar.bz2. http://qosient.com/argus/.

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

The MoA software suite6 provides the implementation for these algorithms, where modiﬁcations were made to provide reporting using the performance metric from Section 4.3.

Table 2 Best case configurations of comparator algorithms for operation under drifting streaming data with label budgets [8]. Classiﬁer

Policy

Naive Bayes Naive Bayes Hoeﬀding tree Hoeﬀding tree

Split Variable uncertainty Split Variable uncertainty

4.3. Performance metrics Performance metrics for streaming data classiﬁcation tasks generally take the form of one of three classes of metric [23]. Prequential error metrics characterize the goal as error minimization in which the error of older instances are subject to discounting/ forgetting.7 The principle drawback in assuming such a metric is that when the data is imbalanced, a model that labelled all the data as the most frequently occurring class (the major class) would appear to be the most ‘accurate’. Under this application the major class always corresponds to the ‘background’ class. Thus, labelling all the data as the major class would result in accuracies of between 98% and 89% (Table 1), whereas this represents a completely degenerate solution. Instead we want to quantify the ability to operate under a multi-class setting. Measures of (label) autocorrelation characterize the performance of the classiﬁer using the ability to out-perform a one-bit predictor that operates on the label space alone [42]. That is to say, if the distribution of labels across the stream are not well mixed, then periods will exist over which consecutive records carry the same label. Such sequences can be ‘predicted’ with low rates of miss prediction by a one/two-bit ﬁnite state machine.8 Although better than error minimization, such a metric does not explicitly quantify the ability of a classiﬁer to operate under a multi-class setting (i.e., the distribution of labels is still most likely to be dominated by the behaviour of the most frequent classes). Rate based metrics incrementally construct the confusion matrix as a function of progress through the stream [3,28]. This then leads to characterizing performance using any number of scalar (rate based) performance metrics, e.g. F-measure, Detection rate, Precision. Moreover, such metrics explicitly quantify performance under multiclass settings [44]. Given the ease with which rate based metrics may explicitly quantify multi-class performance, we will adopt such a metric for this work. One approach might be to assume an AUC metric (the area under the curve characterizing the interaction between Detection rate and false positive rate). However, such a metric is also limited to two class scenarios and requires complete re-estimation over a sliding window for each time step [45].9 Instead we will utilize Detection rate (DR) as independently computed for each class. Such a metric can be estimated incrementally, and visualized with time on the dependent (x) axis and class-wise performance metric on the independent (y) axis. Thus, assuming an online estimation of the Detection rate for the y-axis (champion classiﬁer always has to predict the label before any updates to the model), we can estimate the overall DR as the average across all classes [23]. Speciﬁcally, let the streaming estimation of Detection rate take the following form:

legitimate reasons to protect privacy of users. Thus, any classiﬁer relying on these attributes may not generalize well in real world applications. Speciﬁc properties that make these capture ﬁles of particular interest from an application perspective include: Captures 1, 2, 9: consist of instances of the Neris Botnet, hence traﬃc content pertaining to Internet Relay Chat (IRC), spam, click fraud and scanning activities are explicitly present. Captures 5, 13: consist of instances of the Virut Botnet, hence traﬃc content pertaining to Distributed Denial of Service (DDoS), spam, fraud and data theft attacks are explicitly present. Capture 6: consists of instances of the Menti Botnet, hence traﬃc content pertaining to identity theft and login credentials are explicitly present. Capture 7: consists of instances of the Sogou Botnet, hence traﬃc content pertaining to spam and popup adware to collect personal information are present. Capture 8: consists of instances of the Murlo Botnet, hence traﬃc content pertaining to the use of scanning activities and proprietary mechanisms for establishing C & C. Captures 3, 4, 10, 11: consist of instances of the Rbot Botnet, hence traﬃc content pertaining to IRC and Internet Control Message Protocol (ICMP) based DDoS attacks are explicitly present. Captures 12: consists of instances of the NSIS.ay Botnet, hence traﬃc content pertaining to identity theft and login credentials by using extra payloads are explicitly present. Table 1 emphasizes that the data is exceptionally imbalanced, moreover, all but the major class also appear and disappear repeatedly throughout the stream (Appendix B).

4.2. Comparator algorithms Section 2.1 identiﬁed that the work of Z˘liobaitė et al. addresses operation under label budgets in which changes to the data distribution are expected [8]. Moreover, their benchmarking study included one dataset that involved more than two classes. Conversely, none of the previous works explicitly associated with (multi-class) imbalanced data classiﬁcation under streaming data were designed to address operation under label budgets (Section 2). The speciﬁc cases of the Split and Variable uncertainty policy under Naive Bayes and Hoeﬀding tree classiﬁers will be assumed from Z˘liobaitė et al. (Table 2), where these represented the strongest algorithms from the original study [8]. The Variable uncertainty policy employs a threshold to determine which records from the stream to request labels for. Speciﬁcally, under the Variable uncertainty policy, the conﬁdence of the classiﬁer is used to provide the basis for identifying which records from the stream have their true label requested. Under the Split policy two models are concurrently maintained. One model operates under the Variable uncertainty policy, the second requests labels under a uniform probability sampling policy. This means that records are selected for true label requests using both model uncertainty and uniform sampling. The latter may aid detecting changes due to concept drift.

DRc (t ) =

tpc (t ) tpc (t ) + fnc (t )

(2)

where t is the record index, and tpc (t ), fnc (t ) are the respective online counts for true positive and false negative rates, i.e. up to this point in the stream. 6

http://moa.cms.waikato.ac.nz. Formulations using recall rate on the single smallest class have also been proposed [14]. 8 See for example, the widespread use of one/two-bit ﬁnite state machines in branch taken/not taken sequences for conditional statements associated with loop constructs [43]. 9 The alternative would be to re-estimate the entire AUC for each time step, limiting its application to short streams [13]. 7

6

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

conﬁgurations of the Streaming GP algorithm in total (Table 3). All ﬁve will be benchmarked in order to identify their relative contributions to the overall performance. Parameterization of GP is unchanged with respect to previous work (e.g. [29,30]) and summarized in Table 4. The label budget β deﬁnes what proportion of the non-overlapping window, W (i ), is queried for true class label information (Section 3.2). Naturally, this represents a signiﬁcant ‘cost’ in that a user/expert is then required to provide labels for each query. On the other hand, the lower the label budget, the higher the likelihood that changes in the process generating the stream data will be missed and/or the minor classes will be missed entirely. For example, if a label budget is assumed of 5% and a minor class appears with a frequency of 1% then the ‘raw’ chance of a uniform sampling scheme actually requesting a label for such an instance is 0.05%. Conversely, if the algorithm for sampling or archiving records is more ‘intelligent’ the performance of the overall framework for classifying streaming data will not be dominated by the label budget alone. Thus, for this study we benchmark performance for each dataset using three diﬀerent label budgets or β = {0.005, 0.01, 0.05} in order to gain some insight as to how quickly performance decreases as label rates decrease. Given that the raw class distributions for minor classes are so low (Table 1), this represents a challenging streaming classiﬁcation task. Table 5 summarizes the impact of each label budget on the size of the non-overlapping window, W (i ). Note that, the champion from window W (i − 1) provides predictions for the entire content of window W (i ) before it can be updated (i.e., incremental operation). Thus, lower label budgets are also synonymous with delays to true label information.

Table 3 Streaming GP configurations. Uniform implies identification of either sampling or archiving data using uniform sampling (Sections 3.3 and 3.4 respectively). Likewise biased denotes either the sampling or archiving data under the corresponding biased algorithms (Algorithms 1 and 2 respectively). Pareto was our earlier preferred conﬁguration for Archiving in which Pareto archiving prioritized records for removal from the Data Subset [28]. Model

Sampling Policy

Archiving Policy

Uniform (Rnd) Pareto Archive Sample Both

Uniform Uniform Uniform Biased Biased

Uniform Pareto Biased Uniform Biased

Table 4 GP Parameters. Mutation rates control the rate of adding/deleting programs or changing an action. DS (Tgap ) denotes the number of records from the Data Subset (teams) deleted at each location of the non-overlapping window. For each Data Subset update, τ generations are performed. Parameter

Value

Data Subset size (DS ) DS gap size (Gap ) GP gap size (Tgap ) Team pop. size (Psize ) Max. programs per team (ω) Prob. Program deletion (pd ) Prob. Program addition (pa ) Prob. Action mutation (μ) Generations per DS update (τ )

120 20 20 120 20 0.3 0.3 0.1 5

Table 5 Stream dataset parameters. Label Budget (β ) is deﬁned as a function of the window size W (i ) where for each non-overlapping window location there can only be Gap size (20) samples. Label Budget (β )

W (i ) cardinality

0.5% 1.0% 5.0%

4000 2000 400

5. Results In Section 5.1, the overall performance of each Streaming GP conﬁguration and MoA comparator algorithms is quantiﬁed over the thirteen Botnet datasets (Table 1). A subset of datasets identiﬁed by the initial analysis will then be used to provide more insight into what the distinguishing factors are in the operation of the various algorithms. Speciﬁcally, Section 5.2 uses a visualization of Detection rate as estimated through the stream to provide insight into the dynamic behaviour of the algorithms. Section 5.3 characterizes how the diﬀerent archive–sampling policies of Stream GP eﬀect the distribution of records retained in the Data Subset. Section 5.4 repeats the review from Section 5.2, but this time solely using the ability of the stream classiﬁer to detect the least frequent class. In doing so, we draw attention to the potential to detect Botnet command and control signals. Success in this task amounts to providing an early warning of Botnet activity. Finally, Section 5.5 quantiﬁes the computational time to perform ﬁtness evaluation and execute the champion individual.

The multi-class Detection rate now has the form:

DR (t ) =

1 C*

∑

DRc (t )

c =[1, …, C *]

(3)

where (for continuity) we assume that C * reﬂects the true count of the total number of classes encountered over the course of the stream.10 Hence, the multi-class Detection rate is a function of the ability to detect each class. Finally, we note that although streaming data sources are continuous, for benchmarking purposes, ﬁnite length sequences are assumed. Thus, the sum of the multi-class Detection rate metric as estimated across the duration of the stream can be quantiﬁed by a single scalar ‘area under the curve’ metric:

AUC =

1 smax

5.1. Overall performance evaluation There are a total of four Streaming GP conﬁgurations (Table 3) and four MoA comparator algorithms (Section 4.2), where in the case of MoA, these represent the strongest algorithm/stream sampling policies identiﬁed by Z˘liobaitė et al. for operation under label budgets [8]. All eight algorithms are run 20 times per dataset and the multi-class streaming AUC metric (Eq. (4) of Section 4.3) used to summarize the overall performance. Performance is then ranked using the median AUC. The Friedman non-parametric repeated measures statistic (and Nemenyi post hoc test) are then used to identify the general trends of the best algorithms/most challenging datasets. Such an approach does not make assumptions regarding the underlying distributions of the performance data, and represents the preferred approach for conducting comparisons between multiple algorithms/datasets [46,44]. The evaluation is repeated using each of the three label budgets (Table 5).

smax −1

∑

DR (t )

t =0

(4)

where smax is the cardinality of the stream. 4.4. Experimental design and parameterization Sections 3.3 and 3.4 introduced biased approaches for deﬁning sampling and archiving policies respectively. We will compare these with our previous Pareto policy for archiving [28], that is to say, ﬁve 10 If the true number of classes is unknown, a stepwise eﬀect appears in the metric each time a previously unseen class is encountered.

7

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Table 6 Algorithm ranks w.r.t. streaming AUC metric under a 5% label budget. Bracketed entries represent median AUC values to 1 decimal place. Naive Bayes (NB) and Hoeffding tree classifiers (from MoA) appear with either ‘split’ or ‘variable’ sampling policies. Table 3 declares the 4 sampling/replacement policies for stream SBB. Rj denotes the average rank across all datasets. Data

stream SBB

Set

Rnd

Pareto

Sample

Archive

Both

split

variable

split

variable

6 6 5 5 4 6 4 5 5 6 4 4 5

5 4 4 7 6 5 6 6 6 3 3 6 6

4 5 3 4 5 4 5 4 4 4 5 5 3

1 (56.8) 1 (68.1) 2 (81.5) 1 (62.7) 1 (36.1) 1 (65.8) 2.5 (29.8) 1 (78.0) 1 (54.1) 1 (70.3) 1 (54.8) 1 (52.5) 1 (70.2)

2 (51.5) 2 (67.7) 1 (83.5) 2 (60.8) 2 (34.5) 2 (64.6) 2.5 (29.8) 2 (76.2) 2 (48.5) 2 (68.7) 2 (52.4) 3 (47.2) 2 (67.4)

8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5

7 7 7 6 7 7 7 7 7 7 7 7 7

8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5

3 (43.5) 3 (54.8) 6 (59.5) 3 (55.2) 3 (30.7) 3 (48.6) 1 (32.5) 3 (57.9) 3 (45.4) 5 (58.7) 6 (47.6) 2 (48.3) 4 (42.9)

1.19

2.04

8.5

Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture

1 2 3 4 5 6 7 8 9 10 11 12 13

(32.4) (36.8) (65.9) (45.5) (29.3) (35.5) (29.8) (39.9) (34.8) (57.6) (48.7) (41.6) (41.8)

5.0

Rj

Hoeffding

(33.3) (43.9) (72.9) (41.4) (27.9) (41.4) (28.0) (32.8) (32.8) (62.4) (52.2) (37.9) (38.8)

5.15

(36.3) (41.5) (76.0) (51.4) (28.8) (42.5) (28.4) (46.2) (35.8) (61.6) (47.9) (40.2) (53.9)

4.23

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (25) (25) (25) (25)

NB

(26.7) (36.5) (55.5) (42.2) (26.3) (25.5) (25.9) (28.1) (26.5) (54.6) (42.2) (36.0) (27.3)

6.92

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (25) (25) (25) (25)

8.5

3.46

Table 7 Algorithm ranks w.r.t. streaming AUC metric under a 1% label budget. Bracketed entries represent median AUC values to 1 decimal place. Naive Bayes (NB) and Hoeffding tree classifiers (from MoA) appear with either ‘split’ or ‘variable’ sampling policies. Table 3 declares the 4 sampling/replacement policies for stream SBB. Rj denotes the average rank across all datasets. Data

stream SBB

Set

Rnd

Pareto

Sample

Archive

Both

split

variable

split

variable

5 6 5 5 4 6 3 4 4 5 2 5 4

6 4 3 6 6 3 9 5 6 3 3 4 5

4 5 4 4 5 4 8 6 5 4 6 6 3

1 (48.3) 1 (56.6) 1 (78.4) 1 (55.8) 1 (29.2) 1 (50.4) 4 (25.4) 2 (60.9) 1 (46.0) 1 (64.5) 1 (46.6) 1 (43.6) 1 (57.1)

2 (45.3) 2 (54.6) 2 (75.8) 2 (52.3) 3 (27.5) 2 (43.8) 5 (25.3) 1 (64.3) 2 (42.9) 2 (63.1) 5 (43.1) 3 (40.4) 2 (55.1)

8.5 8.5 8.5 8.5 8.5 8.5 6.5 8.5 8.5 8.5 8.5 8.5 8.5

7 (25.3) 7 (35.3) 7 (47.4) 7 (33.4) 7 (24.9) 7 (25.0) 1 (25.9) 7 (26.0) 7 (25.1) 7 (54.0) 7 (41.3) 7 (34.1) 7 (25.1)

8.5 8.5 8.5 8.5 8.5 8.5 6.5 8.5 8.5 8.5 8.5 8.5 8.5

3 3 6 3 2 5 2 3 3 6 4 2 6

1.23

2.23

8.35

6.54

8.35

Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture

1 2 3 4 5 6 7 8 9 10 11 12 13

(32.9) (37.2) (64.2) (43.5) (27.1) (33.3) (25.5) (34.4) (33.9) (57.1) (45.7) (39.6) (39.5)

4.46

Rj

Hoeffding

(31.0) (44.1) (71.8) (38.6) (26.5) (42.9) (24.3) (33.4) (30.7) (61.0) (45.9) (39.9) (37.2)

4.85

(33.3) (39.2) (70.8) (44.2) (26.7) (37.9) (24.7) (33.1) (32.9) (57.9) (41.9) (36.4) (43.8)

4.92

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (33.3) (33.3) (33.3) (25)

NB

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (33.3) (33.3) (33.3) (25)

(37.5) (48.0) (61.8) (51.0) (29.1) (36.5) (25.56) (52.3) (42.7) (56.3) (43.8) (43.1) (33.2)

3.69

Table 8 Algorithm ranks w.r.t. streaming AUC metric under a 0.5% label budget. Naive Bayes (NB) and Hoeffding tree classifiers (from MoA) appear with either ‘split’ or ‘variable’ sampling policies. Table 3 declares the 4 sampling/replacement policies for stream SBB. Rj denotes the average rank across all datasets. Data

stream SBB

Set

Rnd

Pareto

Sample

Archive

Both

split

variable

split

variable

5.5 (32.4) 5 (38.0) 5 (62.3) 6 (45.9) 2 (27.2) 5 (33.4) 3 (26.0) 4 (32.4) 5 (33.4) 3 (56.0) 5 (42.7) 3.5 (37.2) 5 (38.3)

5.5 (32.4) 4 (38.1) 4 (62.5) 5 (46.2) 6 (26.5) 6 (32.1) 9 (24.7) 5 (32.2) 4 (33.5) 5 (55.9) 4 (43.1) 3.5 (37.2) 4 (38.4)

4 6 3 4 9 4 8 6 6 4 7 6 3

1 (45.2) 1 (48.8) 1 (76.2) 1 (59.6) 3 (27.1) 1 (44.5) 4 (25.2) 1 (55.7) 1 (42.5) 1 (61.8) 3 (44.9) 2 (38.5) 1 (52.7)

2 2 2 2 5 2 7 2 2 2 6 5 2

8.5 8.5 8.5 8.5 7.5 8.5 5.5 8.5 8.5 8.5 8.5 8.5 8.5

7 7 7 7 4 7 2 7 7 7 2 7 7

8.5 8.5 8.5 8.5 7.5 8.5 5.5 8.5 8.5 8.5 8.5 8.5 8.5

3 (36.9) 3 (40.5) 6 (59.2) 3 (49.5) 1 (28.4) 3 (34.7) 1 (27.0) 3 (43.9) 3 (39.6) 6 (55.9) 1 (45.5) 1 (40.2) 6 (34.8)

4.38

5.0

5.38

1.62

3.15

Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture

Rj

1 2 3 4 5 6 7 8 9 10 11 12 13

Hoeffding

(33.0) (37.4) (67.9) (46.7) (25.0) (34.7) (24.8) (30.3) (32.2) (56.0) (37.7) (37.9) (41.5)

8

(41.0) (46.0) (73.3) (58.7) (26.2) (41.6) (25.0) (54.1) (40.6) (59.7) (37.9) (37.1) (49.5)

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (33.3) (33.3) (33.3) (25)

8.19

NB

(25.5) (34.5) (48.1) (33.5) (25.3) (26.0) (26.5) (26.0) (25.0) (53.7) (45.4) (34.2) (25.0)

6.0

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (33.3) (33.3) (33.3) (25)

8.19

3.08

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

have equivalent performance [46,44]. Speciﬁcally, if the average algorithm ranks are within the critical diﬀerence of CD = qα × k (k + 1) , then they are deemed equivalent. We adopt an

Table 9 Result of Friedman test χF2 and corresponding value for F-distribution FF . The critical value of F (7, 84) for α = 0.01 is 3.953, so the null-hypothesis is rejected in each case.

6N

Label budget

5%

1%

0.5%

χF2

94.4

80.6

70.1

FF

117.7

41.3

24.9

qα = 0.1 for which the critical diﬀerence is 2.855. This identiﬁes a set of top performing algorithms common to all three label budgets as: SBB–Archive, SBB–Both and NB–variable policies. Moreover, SBB–Archive was by far the most consistent performing model irrespective of dataset or label budget with SBB–Both generally appearing as the runner up. From the perspective of Botnet detection in general, we note that Captures 1, 2, 8, 9 are dominated by port scanning activities. Typically, access to port number and IP address information is assumed for detecting port scanning (so limiting the generality of the detector). Conversely, we are able to detect these activities without using source / destination port numbers and IP addresses at an overall AUC of between 54% and 78% (dropping to no less than 42% under the 0.5% label budget). Captures 3, 10 and 11 represent protocol based attacks so it would be expected that these behavioural patterns can be identiﬁed more easily with ﬂow features. Indeed, with Captures 3 and 10, this appears to be the case (high overall AUC maintained irrespective of the label budget), whereas Capture 11 always returned overall AUC in the region of 54% to 44%. Finally, the content of Captures 5, 6 and 7 represent payload attacks (such as windows vulnerabilities), thus detecting these types of attack with ﬂow features alone is particularly challenging. In two cases (Captures 5 and 7)

Tables 6–8 summarize results using the ranking of each of the 8 algorithms under the AUC metric. The last row reports the average rank (Rj ), where this forms the basis for the Friedman test, as follows:

χF2 =

⎡ ⎤ 12N ⎢ k (k + 1)2 ⎥ (Rj2 ) − ∑ ⎥⎦ k (k + 1) ⎢⎣ j 4

(5)

where N = 13 is the number of datasets and k = 9 is the number of algorithms. The null hypothesis is tested by mapping χF2 into the Fdistribution with k − 1 and (k − 1)(N − 1) degrees of freedom using [46]:

FF =

(N − 1) χF2 N (k − 1) − χF2

(6)

In each case, the null-hypothesis (i.e. that the ranks are random) is comfortably rejected (Table 9). The Nemenyi post hoc test may now be applied for establishing what groups of algorithms

Fig. 2. Capture 5 class-wise Detection rate through the stream. 5% versus 0.5% label budget. Class 3 represents Botnet and Class 4 Botnet C & C.

9

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 3. Capture 6 class-wise Detection rate through the stream. 5% versus 0.5% label budget. Class 3 represents Botnet and Class 4 Botnet C & C.

performance is just able to better that of a degenerate classiﬁer.11 Conversely, the best detector was able to identify Capture 6 data with an overall AUC of between 65% and 44.5% depending on the label budget.

of the stream (Eq. (2)) as averaged over the 20 runs per algorithm in the speciﬁc case of the Capture 5 dataset. Subplots 2(a) and 2(b) illustrate how class-wise Detection rate changes under the 5% label budget. It is apparent that the NB–Variable model does not detect class 4 at all (Botnet C & C). Moreover, gains in detecting the minor classes are generally at the expense of reduced detection in the major class.12 Conversely, SBB–Archive was ultimately able to detect minor classes better (Botnet and Botnet C & C), without losses in the detection of the major class. Performance under the 0.5% label budget is summarized by Subplots 2(c) and 2(d). The NB–Variable framework continues to place most emphasis on detecting the major class, at the expense of the minor classes, conversely SBB–Archive detects the minor classes earlier and even manages to continue to detect class 4, albeit at a much reduced rate than at the 5% label budget. Thus, although the NB–Variable model has a higher ranking for the 0.5% label budget under Capture 5 (Tabl 8), this appears to be solely due to the detection of the major class. Similar observations hold for the Capture 7 and 11 datasets. Fig. 3 illustrates the Detection rate for each class over the course of the stream in the speciﬁc case of the Capture 6 dataset. As with Capture 5, SBB–Archive was able to detect each of the 4 classes to varying

5.2. Detection Rate Dynamics: comparing the best streaming classiﬁers Performance evaluation will now consider the dynamic properties of operation by reviewing how the Detection rate of each class varies during the course of the stream. This is important because the properties of the stream vary over the course of the stream. Thus, unlike an oﬄine formulation of learning, the models develop/interact with the stream content over the course of the stream. In doing so, we also provide some insight into whether results with similar overall AUC also exhibit similar preferences in class detection. Space precludes the plotting of Detection rate for every algorithm under every dataset. With this in mind, we will concentrate on: (i) Captures 5 and 6 (payload attacks) and (ii) Captures 8 and 9 (port scanning) with the top two performing conﬁgurations of SBB and NB: SBB–Archive and NB–Variable. Fig. 2 summarizes the Detection rate for each class over the course 11 Equivalent to labelling all the data a single class or AUC = DRc (t ) = 1 occurs for one class alone and DRi ≠ c (t ) = 0 .

DR (t ) C*

12 Given the degree of class imbalance in evidence (Table 1), the major class is always class 1. The minor classes are always the remaining classes.

= 0.25 where

10

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 4. Capture 8 class-wise Detection rate through the stream. 5% versus 0.5% label budget. Class 3 represents Botnet and Class 4 Botnet C & C.

frequency, but with a slightly longer duration. This results in a stepwise improvement in the detection of Class 3 throughout the stream as both SBB and NB are able to incrementally get better at sampling/detection (Fig. 4). In the context of the Botnet detection task, we note that although the overall AUC might be in the order of 54–78% in the case of Capture 6, 8 and 9 (Section 5.1), this translates to detecting the master-to-slave communication (Class 4) at a Detection rate of 70–95% by the ‘end’ of the stream (SBB–Archive). Moreover, not only are the malicious behaviours in these streams very infrequent (Table 1) and transitory (Appendix B), but malicious behaviours also represent application and operating system related attacks. Such behaviours are usually represented in the payload of the traﬃc, and therefore not directly present in the ﬂow features used in this work (which do not include the payload). Instead, detection is taking place because behavioural ‘ﬁngerprints’ have been discovered in the ﬂow data that are suﬃcient for identifying the Botnet traﬃc.

degrees throughout the stream irrespective of label budget, whereas the NB-variable framework only did so under the 5% budget. Moreover, it also generally appears to be the case that the NB–variable framework begins by labelling every record as the major class. Over the duration of the stream, the detection of the major class decays as detection of the remaining classes improves. This appears as a general property of the NB–Variable framework irrespective of the dataset. This is not the case for SBB–Archive. Instead, each class, once detected, does not directly imply an inability to maintain the detection of other classes. We attribute this property in part due to the combination of: 1) a population based algorithm; 2) the use of a robust measure to identify a champion classiﬁer (Section 3.5); and 3) the ability to explicitly ‘balance’ the distribution of records retained in the DS (see Section 5.3). Similar observations carry over to the case of Capture 8 and 9 (Figs. 4 and 5 respectively). Comparison with the corresponding summary plots for the actual distribution of the minor classes during the stream (Fig. B.10, Appendix B) indicates that SBB in particular is also able to pick up the detection of minor classes near to their respective ﬁrst occurrence. For example, in the case of Capture 8, Class 4 appears as a high frequency burst early on in the stream then appears on an intermittent basis thereafter. Fig. 4 demonstrates that SBBVariable is able to react to this immediately, even under the 0.5% label budget. There is also a particularly interesting behaviour associated with the distribution of Class 3 in the Capture 8 stream. Initially Class 3 appears in very short bursts at a high frequency, and then decreases in

5.3. Detection Rate Dynamics: SBB sampling and archiving policies The goal of this section is to provide some insight as to the relative contribution of the four variants of the SBB sampling/archiving policies (Table 3). Section 5.1 has already identiﬁed the overall ranking of the four variants as: Archive > Both > Rnd > Sample; where ‘Rnd’ represents the control/baseline parameterization. We are now interested in characterizing properties that lead to this outcome. For this 11

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 5. Capture 9 class-wise Detection rate through the stream. 5% versus 0.5% label budget. Class 3 represents Botnet and Class 4 Botnet C & C.

despite its late introduction is now detected much more eﬀectively. Unfortunately, this also came at the expense of Class 2 (Normal) detection which has seen a reduction with respect to the ‘Rnd’ policy. Switching to the biased archiving policy (with a random sampling policy, Table 3) improves the detection of all minor classes, subplot 6(c). Some decay to the rates of detection appear with respect to Class 1, which may be due to having to share the available DS space with the minor classes when they appear later in the stream. However, all the minor classes are now detected much more eﬀectively than before. Thus, from the perspective of Botnet detection, by the ‘end’ of the stream the Botnet and C & C classes are identiﬁed with Detection rates of ≈ 70% and ≈ 55% respectively. Finally, subplot 6(d) represents the proﬁle for when both the biased archiving and sampling are introduced. Detection of Classes 2 and 3 actually declined with respect to the best case policy combination of ‘Archive’. We will investigate the sources of this further by considering how the distribution of records retained in the Data Subset changes as the stream passes. Fig. 7 summarizes the proportion with which records from each class are expressed in the Data Subset during the stream. As it is the content of the DS against which ﬁtness evaluation is performed and a champion classiﬁer identiﬁed, the distribution of records within the DS has a signiﬁcant impact on the overall performance of GP. All policies begin with DS content dominated by Class 1 representation (the major class). However, both the entirely random policy and the biased Sample

purpose the Capture 1 dataset will be assumed (observations are common to the other datasets, but space precludes their duplication). In particular Capture 1 introduces diﬀerent classes at diﬀerent points during the stream (Fig. B.10), thus requiring the Archiving policy to create ‘new’ categories within the ﬁnite Data Subset archive (Fig. 1) while the Sampling policy needs to detect the change in the ﬁrst place. From the perspective of the Botnet detection task, Capture 1 contains a lot of port scanning activities, which are not straightforward to detect without port number information (as is the case here). Fig. 6 summarizes class-wise Detection rate over the Capture 1 dataset with a 5.0% label budget. The control parameterization of ‘Rnd’ samples the stream window, W (i ), with uniform probability and replaces records within the Data Subset (DS) with uniform probability (Table 3). When comparing Subplot 6(c) to the underlying distribution of the minor classes (Fig. B.10) we make the following observations: 1. The underlying distribution of each class: Class 1 – the major class – is detected most strongly, and Class 4 (Botnet C & C) – the least frequent – very rarely. 2. When classes appear for the ﬁrst time in the stream: Class 2 (normal) appears later in the stream than Class 3 (Botnet). The impact of introducing a biased sampling policy (but retaining random replacement of records within DS) is summarized in subplot 6(b). Class 1 is even more strongly detected, and Class 3 (Botnet)

12

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 6. Class-wise Detection rate for SBB sampling and archiving policies on the Capture 1 dataset at a 5.0% label budget.

The last two columns of Table 10 provide the corresponding Friedman test statistic (χF2 ) and value for the F-distribution (FF ), indicating that the null-hypothesis is rejected. Given that the degree's of freedom are unchanged (from the earlier analysis), then the critical diﬀerence is also unchanged (2.855). Speciﬁcally, the Nemenyi post hoc test applied with respect to the highest ranked model implies that SBB– Archive, SBB–Both and NB–Variable are statistically independent (of the remaining 6 models) at a conﬁdence of q = 0.1 for a label budget of 5%. As the label budget decreases to 1% and 0.5%, these three models continue to be consistently identiﬁed. Moreover, the low ranking of Hoeﬀding and NB–Split is a general reﬂection of an inability to detect the minor class in a signiﬁcant number of the Capture datasets.

policy fail to ensure that the minor classes appear in suﬃcient quantity (subplots 7(a) and 7(b)). Conversely, both the SBB–Archive and SBB– Both policies achieve this (subplots 7(c) and 7(d)). It is notable that the SBB-Both parameterization is by far the most consistent. Conversely, SBB–Archive is more gradual in instigating changes to the distribution of class representation. Given that the SBB– Archive approach provided the basis for a better stream classiﬁer, it appears that this more gradual updating of record class distribution is the key to the performance diﬀerences. 5.4. Capacity for detecting Botnet C & C signals In this we take a closer look at performance under the minority class, where this corresponds to the least frequently occurring class in the stream. Given that the application data pertains to Botnet detection, it is the minor class that represents the ﬁrst indication of a Botnet, i.e. a command and control (C & C) signal. Hence, we are interested in learning whether the overall ranking of streaming classiﬁcation algorithms (Section 5.1) changes when we estimate stream AUC speciﬁc to the minor class alone. Table 10 summarizes the results of the rank based analysis of each algorithm under the minor class streaming AUC for each label budget. The SBB-Archive and SBB–Both formulations still represent the highest ranked models. Given the observations from Section 5.3, this is not surprising as both these parameter combinations were much more eﬀective at balancing the content of the Data Subset.

5.5. Real-time operation GP is often considered to present a considerable computational overhead that would preclude its operation under tasks requiring realtime operation. Under this application domain packet data are ﬁrst subject to pre-processing into traﬃc ﬂows using the Argus application [38]. Each ﬂow characterizes the statistics of a collection of packets that has the same source/destination IP, protocol, source/destination port. The number of packets per ﬂow is a function of the service/application (10's to 100's packets per ﬂow). CISCO deﬁne an upper bound of 600 ms as the time interval for the completion of any ﬂow, however, this represents a worst case ﬁgure, the inter-arrival time of packets is a function of network topology and load. 13

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 7. Typical Distribution of classes present in the Data Subset for SBB sampling and archiving policies on the Capture 1 dataset at a 5.0% label budget.

Table 10 Ranks for minor class streaming AUC alone. Assuming an α = 0.01 returns a critical value of F (7, 84) < 3.953, thus the null-hypothesis (of random a ranking) is rejected. Label

stream SBB

Hoeffding

NB

Statistic

Budget

Rnd

Pareto

Sample

Archive

Both

Split

Variable

Split

Variable

χF2

FF

5% 1% 0.5%

5.15 4.31 4.26

5.27 5.53 5.04

4.62 4.81 4.5

1.46 2.08 2.04

2.23 2.46 2.35

7.84 7.46 7.38

7.19 7.0 6.96

7.84 7.46 7.38

3.38 3.88 3.92

76.4 57.5 43.8

33.1 14.8 8.7

represents the dataset with largest cardinality, i.e. it should be clear whether computational costs stabilize or not. In all cases, this reﬂects a code base that executes as a single thread. The average execution time for the champion is ≈45 μ sec (Fig. 8(a)), implying an average (single threaded) throughput of 22, 222 ﬂows per second. Conversely, it takes between 2.7 and 3.8 s to update the champion predictor.13 We note that the biggest impact on the time to identify new champion classiﬁers is the time for human experts to provide labels for the Gap records associated with each (non-overlapping) window location W (i ). However, this does not interrupt the ability of the current champion classiﬁer to provide labels, and would be synonymous with current practice for deploying updates to ‘signature’ based detectors.

The capacity of GP to support real-time operation will be characterized from two perspectives: 1) the time for the champion GP individual to make a class label prediction for each ﬂow record (anytime operation); and, 2) the time to complete ﬁtness evaluation after updating the content of the Data Subset. Under the parameterization assumed in this work (Table 4), ﬁtness evaluation takes the form of evaluating Psize = 120 teams on DS = 120 ﬂow records for τ = 5 generations and identifying the champion individual. However, given that only twenty training records are introduced at each window location and only twenty teams are replaced per generation (Gap = Tgap = 20), then the computational cost per window location is actually in the order of 20 × 20 × 5 evaluations. Fig. 8 summarizes the execution time under each of these conditions under a common Intel i5 CPU (2.67 GHz, 48 GB RAM). The plots illustrate the mean and variance for 20 runs over Capture 3, where this

13 Adopting multi-threaded operation, say, eight threads could potentially reduce this to between 0.34–0.5s.

14

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

appropriate interaction between sampling and archiving policies.

• •

• •

SBB-Rnd: represented our control policy combination. In general the distribution of the classes in the data subset represent that of the stream and the accuracy of the anytime classiﬁer also reﬂects this distribution. SBB-Sample: was unable to retain ‘useful’ queries from the minor class in the data subset. One possible reason why this might appear is that the GP individual selected to be the champion (anytime) classiﬁer always does so on the basis of performance from the two or three most frequently occurring classes in the Data Subset. Hence, the pressure/penalty for miss-labelling the smallest class(es) is typically low. Future research will attempt to identify speciﬁc sources for the weak performance of this policy combination. SBB-Archive: emphasizes balancing data subset content over biased querying of the stream. This appeared to provide the best balance of keeping the major class exemplars up to date while identifying instances of the minor class(es). SBB-Both: combined targeted queries on the stream for labelling with biased record replacement in the data subset. This resulted in a system that was too aggressive in promoting the minor class in the Data Subset, reducing performance on the major class(es).

When comparing with four alternative formulations explicitly designed to operate under label budgets, the GP polices of SBB-Archive and SBBBoth are generally ranked 1 and 2 respectively. Other results for entirely artiﬁcial stream data resulted in the same ranking [30]. In pursuing a Botnet detection task, we illustrate the potential for addressing a network analysis under particularly challenging conditions, e.g. class imbalance, high cost to labelling, anytime operation. Indeed, given that the minor class often represents the most costly class to misclassify under Botnet detection, constructing streaming classiﬁers under the combined condition of low label budgets and class imbalance represents a signiﬁcant challenge. This study represents the ﬁrst time that streaming algorithms have been deployed under these conditions. GP streaming under the archiving policy is shown to be particularly eﬀective in this respect. Moreover, it is clear that the detection of Botnet behaviours improves over the course of the stream, resulting in the ability to detect the Botnet and C & C classes even though they might represent less than 1% of the total stream content. In order to minimize the eﬀect of adversarial attacks against the learning algorithm itself, human experts are still required to suggest the true labels. However, the streaming algorithm identiﬁes the data for labelling. Moreover, the anytime classiﬁer makes predictions regarding class labels, where such information can be used to prioritize the records that the experts label ﬁrst. Several avenues exist for future work including but not limited to: 1) the use of multi-armed bandit formulations to direct the process of constructing GP teams and/or direct the sampling of records from the stream. 2) extensions to further applications where labels can be provided, but at a cost.

Fig. 8. Wall clock time for (a) champion individual to make predictions and (b) ﬁtness evaluation to update the content of the population on a new non-overlapping window location for Capture 3 dataset on a 2.67 GHz CPU.

Fig. 9. Symbiotic bid-based GP. Each team indexes a diﬀerent combination of programs, but the same program may appear in multiple teams. The action (class label) of a program is expressed through colour.

6. Conclusion

Acknowledgments

Active learning under a streaming data context decouple the machine learning algorithm from the raw throughput of the stream and provides the opportunity to manipulate the distribution of data used for model building. We believe that this makes active learning a particularly useful approach to adopt with GP, particularly under low label budgets. The key observation to our approach is to identify the

This research is supported by the Canadian Safety and Security Program (53059) (CSSP) E-Security grant. The CSSP is led by the Defense Research and Development Canada, Centre for Security Science (CSS) on behalf of the Government of Canada and its partners across all levels of government, response and emergency management organizations, nongovernmental agencies, industry and academia.

15

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Appendix A. Symbiotic bid-based GP As noted in Section 3.2 this work assumes the Symbiotic Bid-Based formulation for GP (or SBB for short). The SBB framework takes the form of a two population model representing teams and programs respectively (Fig. 9), i.e. the relationship between the two populations is symbiotic [47]. Without loss of generality, we assume a linear GP representation for the programs on account of the ease with which intron code can be identiﬁed and ‘skipped’ during ﬁtness evaluation [48]. Each program when executed produces a scalar output, or the ‘bid’. Moreover, each program is assigned an action, a, at initialization, where in this case a ∈ {1, …, C} and C is the number of classes. The team population takes the form of a variable length GA in which each individual, tmi , indexes some subset of the programs from the program x (t ) from the stream and a team, all the programs from this team are evaluated in order to identify the program with population. Thus, given record → x (t ). Such a program has won the right to suggest its the maximum output (max. bid). Let this be the ‘winning’ program for team tmi on record → action, in this case representing the suggested class label. As the team population assumes a variable length chromosome, then team size evolves, so does team complement. This is very important as it implies that no prior decisions are necessary regarding the decomposition of the task. Indeed, even assuming that a three class problem must have at least three programs (with unique class labels) represents a poor learning bias. Instead, teams evolve incrementally over time, identifying the ‘easiest’ classes to classify ﬁrst and then adding the more diﬃcult classes. As a consequence, the same program may appear in multiple teams. The only constraints are for each team to posses at least two diﬀerent actions across its team complement, and a team must have a minimum of two programs. Fitness is only expressed at the level of the team population. After evaluating all teams, the bottom Tgap teams are dropped and replaced by oﬀspring as developed from the surviving individuals (i.e., a breeder model of selection/replacement). However, before team reproduction, the individuals from the program population tested to identify any program(s) that no longer receive at least one index from a team. Such programs are deleted. Variation operators act hierarchically and probabilistically delete or clone (add) programs (Table 4). Only the cloned programs see further modiﬁcation, and only the resulting new programs can be incorporated into new teams.14 This way, programs that survive between generations are not disrupted. This also implies that only the team population size is explicitly deﬁned (Psize ), whereas the program population size is free to vary. The instruction set for programs takes the form of a register level transfer language consisting of instructions deﬁned by one or two arguments:

• •

R [x ] = R [x ]〈op〉R [ y] where R [x ] denotes a register with an integer reference x over the range [0, …, B − 1] and B is the maximum number of registers; y ∈ [0, …, B + d − 1] where d is the dimension of the input space. The implication of the latter is that the last d register references are ‘read only’ and are initialized with the values of the record awaiting classiﬁcation. 〈op〉 ∈ {+,−,×, ÷ , IF − −THEN}. The conditional operator (IF– THEN) is evaluated as IFR [x ] < R [ y]THENR [x ] = −R [x ] R [x ] = 〈op〉R [ y] where x, y, R [·] follow the above deﬁnitions. 〈op〉 ∈ {cos,exp, ln }, where the ln operator assumes the absolute value of the operand.

Naturally, the choice of opcodes (〈op〉), their order, and the value of register references are all evolved qualities. Further details for the SBB framework can be found in the original publications [36,49,28]. Appendix B. Distribution of minor classes Fig. B.10 summarizes the distribution of the minor classes (everything except the background class) throughout the stream for the CTU datasets explicitly employed for illustrating behavioural properties of the streaming Botnet detection task, e.g. Captures 1, 5, 6, 8 and 9. Note that at any point in time the total traﬃc content is expressed as the sum of each class present in each unique non-overlapping window location for a window size of 1%. It is apparent that there is not a common ‘behaviour’ beyond class 4 (corresponding to Botnet master/slave communication (C & C)) having the lowest frequency and being of a very burst like nature. Class 2 (representing data corresponding to the CTU ‘normal ﬁlters’) appears in 3 of the 5 illustrated datasets throughout, but in the case of Captures 1 and 9 may only appear after a delay and might not appear continuously. Class 3 (representing Botnet attacks) is particularly interesting in the case of Capture 8 in which the attack switches on and oﬀ at speciﬁc (non-periodic) intervals. Moreover, also note that although class labels are used to indicate deﬁnitive classes (background/normal/Botnet/C & C), the behaviours associated with each label are a composite of multiple behaviours. For example, Capture 1, 2 and 9 are representative of the Neris Botnet, thus Class 3 can consist of any combination of IRC, spam, click fraud or scanning activities. See Section 4.1 for a summary of the types of malicious behaviours present in each stream.

14

For example, having the program action changed, or instructions randomly modiﬁed.

16

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. B.10. Distribution of minor classes over the course of the stream for the 5 capture datasets appearing in Sections 5.2 and 5.3. Note the use of a log scale and the colour coding corresponds to that adopted for the original Stream DR ﬁgures. Class 1 is omitted for clarity (always 90–99%). Class 2 represents ‘normal’ traﬃc corresponding to the CTU ﬁlters, Class 3 represents Botnet and Class 4 Botnet C & C. The log scale also implies that 10−2 is synonymous with zero content (e.g. the earliest that Class 4 appears is at the 40% point in Captures 5 and 9).

17

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

[26] A. Atwater, M.I. Heywood, A.N. Zincir-Heywood, GP under streaming data constraints: A case for Pareto archiving?, in: ACM Genetic and Evolutionary Computation Conference, 2012, pp. 703–710. [27] A. Vahdat, A. Atwater, A.R. McIntyre, M.I. Heywood, On the application of GP to streaming data classiﬁcation tasks with label budgets, in: ACM Genetic and Evolutionary Computation Conference: Big Data Workshop, 2014, pp. 1287–1294. [28] A. Vahdat, J. Morgan, A.R. McIntyre, M.I. Heywood, A.N. Zincir-Heywood, Evolving GP classiﬁers for streaming data tasks with concept change and label budgets: A benchmarking study, in: Handbook of Genetic Programming Applications, Springer, 2015, Ch. 18, pp. 451–480. [29] S. Khanchi, M. Heywood, N. Zincir-Heywood, On the impact of class imbalance in GP streaming classiﬁcation with label budgets, in: European Conference on Genetic Programming, vol. 9594 of LNCS, 2016, pp. 35–50. [30] S. Khanchi, M. Heywood, N. Zincir-Heywood, Properties of a GP active learning framework for streaming data with class imbalance, in: ACM Genetic and Evolutionary Computation Conference, 2017, pp. 945–952. [31] H.H. Dam, C. Lokan, H.A. Abbass, Evolutionary online data mining: An investigation in a dynamic environment, in: Studies in Computational Intelligence, vol. 51, Springer, 2007, Ch. 7, pp. 153–178. [32] M. Behdad, T. French, Online learning classiﬁers in dynamic environments with incomplete feedback, in: IEEE Congress on Evolutionary Computation, 2013, pp. 1786–1793. [33] A. Cervantes, P. Isasi, C. Gagné, M. Parizeau, Learning from non-stationary data using a growing network of prototypes, in: IEEE Congress on Evolutionary Computation, 2013, pp. 2634–2641. [34] L.L. Minku, A.P. White, X. Yao, The impact of diversity on online ensemble learning in the presence of concept drift, IEEE Trans. Knowl. Data Eng. 22 (5) (2010) 730–742. [35] R. Polikar, L. Udpa, S. Udpa, V. Honavar, Learn++: an incremental learning algorithm for supervised neural networks, IEEE Trans. Syst. Man Cybern.-Part C 31 (4) (2001) 497–508. [36] P. Lichodzijewski, M.I. Heywood, Managing team- based problem solving with Symbiotic Bid-based Genetic Programming, in: ACM Genetic and Evolutionary Computation Conference, 2008, pp. 363– 370. [37] P. Lichodzijewski, M. I. Heywood, Symbiosis, complexiﬁcation and simplicity under GP, in: ACM Genetic and Evolutionary Computation Conference, 2010, pp. 853– 860. [38] S. García, M. Grill, J. Stiborek, A. Zunino, An empirical comparison of botnet detection methods, Comput. Secur. 45 (2014) 100–123. [39] M.V. Mahoney, P.K. Chan, An analysis of the 1999 DARPA/Lincoln Laboratory evaluation data for network anomaly detection, in: Recent Advances in Intrusion Detection, Vol. 2820 of LNCS, 2003, pp. 220 –237. [40] C. Rossow, C.J. Dietrich, C. Grier, C. Kreibich, V. Paxson, N. Pohlmann, H. Bos, M. van Steen, Prudent practices for designing malware experiments: Status quo and outlook, in: IEEE Symposium on Security and Privacy, 2012, pp. 65–79. [41] A. Shiravi, H. Shiravi, M. Tavallaee, A.A. Ghorbani, Toward developing a systematic approach to generate benchmark datasets for intrusion detection, Comput. Secur. 31 (2012) 357–374. [42] A. Bifet, I. Z˘liobaitė, B. Pfahringer, G. Holmes, Pitfalls in benchmarking data stream classiﬁcation and how to avoid them, in: Machine Learning and Knowledge Discovery in Databases, Vol. 8188 of LNCS, 2013, pp. 465–479. [43] J.L. Hennessy, D.A. Patterson, Computer Architecture a Quantitive Approach, 2nd edition, Morgan Kaufmann, 1996. [44] N. Japkowicz, M. Shah, Evaluating Learning Algorithms: A Classiﬁcation Perspective, Cambridge University Press, 2011. [45] D. Brzezinski, J. Stefanowski, Prequential AUC for classiﬁer evaluation and drift detection in evolving data streams, in: ECML- PKDD Workshop on New Frontiers in Mining Complex Patters, vol. 8983 of LNCS, 2014, pp. 87–101. [46] J. Demšar, Statistical comparisons of classiﬁers over multiple data sets, J. Mach. Learn. Res. 7 (1) (2006) 1–30. [47] M.I. Heywood, P. Lichodzijewski, Symbiogensis as a mechanism for building complex adaptive systems: a review, in: European Conference on Genetic Programming, vol. 6024 of LNCS, 2010, pp. 51–60. [48] M. Brameier, W. Banzhof, Linear Genetic Programming, Springer, 2007. [49] J.A. Douncette, A.R. McIntyre, P. Lichodzijewski, M.I. Heywood, Symbiotic coevolutionary genetic programming: a benchmarking study under large attribute spaces, Genet. Program. Evol. Mach. 13 (2012) 71–101.

References [1] M. Sugiyama, M. Kawanabe, Machine Learning in Non-stationary Environments, MIT Press, 2012. [2] G. Ditzler, M. Roveri, C. Alippi, R. Polikar, Learning in non- stationary environments: a survey, IEEE Comput. Intell. 10 (4) (2015) 12–25. [3] M.I. Heywood, Evolutionary model building under streaming data for classiﬁcation tasks: opportunities and challenges, Genet. Program. Evol. Mach. 16 (3) (2015) 283–326. [4] B. Krawczyk, L.L. Minku, J. Gama, J. Stefanowski, M. Woźniak, Ensemble learning for data stream analysis: a survey, Inf. Fusion 37 (2017) 132–156. [5] M. Barreno, B. Nelson, R. Sears, A.D. Joseph, J.D. Tygar, Can machine learning be secure?, in: ACM Symposium on Information, Computer and Communications Security, 2006, pp. 16–25. [6] M. Barreno, B. Nelson, A.D. Joseph, J.D. Tygar, The security of machine learning, Mach. Learn. 81 (2) (2010) 121–148. [7] P. Lindstrom, B. MacNamee, S.J. Delany, Drift detection using uncertainty distribution divergence, Evol. Syst. 4 (1) (2013) 13–25. [8] I. Z˘liobaitė, A. Bifet, B. Pfahringer, G. Holmes, Active learning with drifting streaming data, IEEE Trans. Neural Netw. Learn. Syst. 25 (1) (2014) 27–54. [9] X. Zhu, P. Zhang, X. Lin, Y. Shi, Active learning from stream data using optimal weight classiﬁer ensemble, IEEE Trans. Syst. Man Cybern. - Part B 40 (6) (2010) 1607–1621. [10] M.M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, Classiﬁcation and novel class detection in data streams with active mining, in: Paciﬁc Asia Knowledge Discovery and Data Mining, Vol. 6119 of LNCS, 2010, pp. 311–324. [11] H. Kim, S. Madhvanath, T. Sun, Hybrid active learning for non- stationary streaming data with asynchronous labeling, in: IEEE International Conference on Big Data, 2015, pp. 287–272. [12] M. Woźniak, P. Kzieniewicz, B. Cyganek, A. Kasprzak, K. Walkowiak, Active learning classiﬁcation of drifted streaming data, in: International Conference on Computation Science, 2016, pp. 1724–1733. [13] G. Ditzler, R. Polikar, Incremental learning of concept drift from streaming balanced data, IEEE Trans. Knowl. Data Eng. 25 (10) (2013) 2283–2301. [14] S. Wang, L.L. Minku, X. Yao, Resampling based ensemble methods for online class imbalance learning, IEEE Trans. Knowl. Data Eng. 27 (5) (2015) 1356–1368. [15] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357. [16] B. Mirza, Z. Lin, K.-A. Toh, Weighted online sequential extreme learning machine for class imbalance learning, Neural Process Lett. 38 (3) (2013) 465–486. [17] A. Ghazikhani, R. Monseﬁ, H.S. Yazdi, Recursive least square perceptron model for non-stationary and imbalanced data stream classiﬁcation, Evol. Syst. 4 (2) (2013) 119–131. [18] M.-R. Bouguelia, Y. Belaïd, A. Belaïd, An adaptive streaming active learning strategy based on instance weighting, Pattern Recognit. Lett. 70 (2016) 38–44. [19] Y. Sun, K. Tang, L.L. Minku, S. Wang, X. Yao, Online ensemble learning of data streams with gradually evolved classes, IEEE Trans. Knowl. Data Eng. 28 (6) (2017) 1532. [20] K.B. Dyer, R. Capo, R. Polikar, Compose: a semisupervised learning framework for initially labeled nonstationary streaming data, IEEE Trans. Neural Netw. Learn. Syst. 25 (1) (2014) 12–26. [21] M.J. Hosseini, A. Gholipour, H. Beigy, An ensemble of cluster-based classiﬁers for semi-supervised classiﬁcation of non-stationary data streams, Knowl. Inf. Syst. 46 (3) (2016) 567–597. [22] M. Kampouridis, E. Tsang, EDDIE for investment opportunities forecasting: Extending the search space of the GP, in: IEEE Congress on Evolutionary Computation, 2010, pp. 2019–2026. [23] A. Loginov, M.I. Heywood, G. Wilson, Benchmarking a coevolutionary streaming classiﬁer under the individual household electric power consumption dataset, in: IEEE-INNS Joint Conference on Neural Networks, 2016, pp. 1–8. [24] G. Folino, G. Papuzzo, Handling diﬀerent categories of concept drifts in data streams using distributed GP, in: European Conference on Genetic Programming, vol. 6021 of LNCS, 2010, pp. 74–85. [25] I. Dempsey, M. O′Neill, A. Brabazon, Foundations in Grammatical Evolution for Dynamic Environments, Springer, 2009 (Vol. SCI 194).

18

On botnet detection with genetic programming under streaming data label budgets and class imbalance

On botnet detection with genetic programming under streaming data label budgets and class imbalance

Recommend Documents