On botnet detection with genetic programming under streaming data label budgets and class imbalance

On botnet detection with genetic programming under streaming data label budgets and class imbalance

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx Contents lists available at ScienceDirect Swarm and Evolutionary Computation journal homepage:...

3MB Sizes 23 Downloads 8 Views

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

Swarm and Evolutionary Computation journal homepage: www.elsevier.com/locate/swevo

On botnet detection with genetic programming under streaming data label budgets and class imbalance ⁎

Sara Khanchi, Ali Vahdat, Malcolm I. Heywood , A. Nur Zincir-Heywood Faculty of Computer Science, Dalhousie University, 6050 University Av., Halifax, NS, Canada

A R T I C L E I N F O

A BS T RAC T

Keywords: Non-stationary data Streaming data Botnet detection Class imbalance Genetic programming

Algorithms for constructing models of classification under streaming data scenarios are becoming increasingly important. In order for such algorithms to be applicable under ‘real-world’ contexts we adopt the following objectives: 1) operate under label budgets, 2) make label requests without recourse to true label information, and 3) robustness to class imbalance. Specifically, we assume that model building is only performed using the content of a Data Subset (as in active learning). Thus, the principle design decisions are with regard to the definitions employed for sampling and archiving policies. Moreover, these policies should operate without prior information regarding the distribution of classes, as this varies over the course of the stream. A team formulation for genetic programming (GP) is assumed as the generic model for classification in order to support incremental changes to classifier content. Benchmarking is conducted with thirteen real-world Botnet datasets with label budgets of the order of 0.5–5% and significant amounts of class imbalance. Specific recommendations are made for detecting the costly minor classes under these conditions. Comparison with current approaches to streaming data under label budgets supports the significance of these findings.

1. Introduction Streaming data applications represent an environment in which data arrives on a continuous basis and exhibits non-stationary properx ) appear sequentially ties such as concept drift [1–4]. Thus, records (→ at discrete points in time, t, and are described by a joint probability distribution pt (→ x , d ), where in this work d represents the record's unknown true label. If for two points in time, t and t + 1 there exists an → x , d ) ≠ pt+1 (→ x , d ), then concept drift has occurred. Such x such that pt (→ drift might be slow or abrupt, subject to repetition and/or effect different subsets of classes at different points in time. The goal of a classification model operating on such streams is therefore multifaceted. Not only is it necessary to suggest labels for multiple classes of data in the stream on a real-time/anytime basis, but it is also necessary for the model to identify what data to learn from.1 The process of identifying what to learn from constitutes a ‘label request’ as a human expert is ultimately responsible for providing ground truth labels. Moreover, it is only feasible for the model to request labels for a small fraction of the data (the cost of acquiring labels is high). Such constraints potentially appear in several applications, e.g. constructing trading agents for financial services or labelling satellite data.



1

In this work we are motivated by the particular issue of identifying Botnet behaviours in network traffic data. Botnets represent a networked collection of devices whose security was at some point compromised (the bots), so allowing a bot herder/master to remotely control the bots. The owners of the compromised devices are unaware of the ability of the bot master to control their devices. The bot master is then free to use the bots to launch a wide range of malicious behaviours while hiding their own identity. Detection of Botnets is nontrivial because: 1) malicious behaviours are mixed in with legitimate (normal) behaviours; 2) users have a wide range of ‘normal’ behaviours; 3) network load and application mix are time varying parameters; 4) many applications dynamically switch between different modes of operation in unpredictable ways (e.g., services such as Skype and Tor explicitly attempt to hide their communication protocols); 5) new applications/updates to current applications (whether malicious or not) coexist with both old versions of the same application resulting in multiple simultaneous ‘fingerprints’ for the same application; and, 6) the ratio of data pertaining to malicious versus non-malicious behaviour is very low. The Botnet detection scenario is framed as follows. We cannot predict a priori when Botnet behaviours will appear in the stream, as network data represents a mixture of normal and malicious data.

Corresponding author. E-mail address: [email protected] (M.I. Heywood). The non-stationary properties of the stream imply that training data has to be identified interactively during the course of deployment.

http://dx.doi.org/10.1016/j.swevo.2017.09.008 Received 16 November 2016; Received in revised form 16 August 2017; Accepted 9 September 2017 2210-6502/ © 2017 Elsevier B.V. All rights reserved.

Please cite this article as: Khanchi, S., Swarm and Evolutionary Computation (2017), http://dx.doi.org/10.1016/j.swevo.2017.09.008

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Section 2.2 provides an equivalent survey from the perspective of explicitly evolutionary methods.

Normal network data is also non-stationary, implying that it is also not feasible to pre-train models off-line and then deploy (such models will always ‘go stale’ as both normal and malicious data change at unpredictable points in time). Human expert(s) are available for providing true labels, d, for a small subset of the stream data (i.e. label budget) on a continuous basis. This is necessary because attacks against the machine learning algorithm itself lead to an attacker ‘reprogramming’ the classification of attack behaviours as normal by manipulating stream data content [5,6]. In order to decouple human experts from the raw throughput of the network data, only the GP framework will identify data for labelling, not the human, i.e. this step cannot assume access to the true labels. We assume that the human experts are trustworthy (otherwise GP models could again be misled). A champion GP individual must always be available for label prediction, before any label querying can take place (real-time anytime operation). The GP framework therefore operates interactively with the stream providing predictions about the content (normal or Botnet) and directs the human labelling of the stream under a finite label budget.2 In framing the task this way, the proposed system has the ability to operate under incoming and/or outgoing network traffic on a wide range of network devices including servers and client devices. Such a framework would be deployed to protect institutions/infrastructure such as medical, financial or other institutions with human security experts acting as the ultimate source of trusted label information. Other scenarios might include IT security companies who provide the anytime classifier to service subscribers and retain the other components of the architecture. In the following, we develop the topic by reviewing previous works that address both the ability to operate under label budgets and address the issue of class imbalance under streaming data (Section 2). The framework we propose assumes a teaming formulation for genetic programming (GP), where team GP formulations provide an evolutionary approach for adapting an ‘ensemble’ of GP programs to data content. Section 3 establishes how GP teams are evolved from a fixed size Data Subset, as per active learning, thus the following two critical decisions are addressed: 1) how to sample records from the stream to appear within the Data Subset without requiring label information; and 2) how to identify records for replacement from the Data Subset when the subset is full. Section 4 develops the methodology adopted for streaming classification algorithms operating under label budgets with class imbalance, and introduces the real-world Botnet datasets employed for benchmarking. The ensuing empirical study both quantifies the significance of the GP teaming approach and compares to recent work capable of operating under label budgets (Section 5). We make specific recommendations regarding sampling versus replacement policies for GP and quantify the impact of operating under low label budgets while addressing class imbalance. Indeed, for streaming data applications to be appropriate for real-world applications, it is necessary for them to operate under both of these constraints simultaneously. Section 6 concludes the paper and suggests future research.

2.1. Non-evolutionary methods Change detection is a mechanism used to initiate retraining of a model. Thus, only when sufficient change is detected, will a model be updated. This potentially means that model building is decoupled from the need to provide labels. For example, Lindstrom et al. describe a process by which a reference distribution is constructed and used to calibrate the model [7]. As the model passes over the stream a divergence measure (expressing model confidence independent from label information) is used to trigger model reconstruction. Any model reconstruction is only performed from the most recent window content. Such an approach only requests labels once a change is detected. However, it also assumes that variation is solely captured by the x ). Any change to the posterior unconditional distribution of data p (→ x ) remains undetected [8]. distributions of data p ( y|→ Active Learning implies that labels are explicitly sought for some fraction of the data, and employ some form of change detection/ uncertainty threshold to initiate label requests. Several authors have proposed bias/variance minimization schemes for this purpose [9,10,1,11,12]. That said, empirical benchmarking has demonstrated that just sampling with uniform probability (up to the label budget) is sufficient to build surprisingly effective models, but only when class instances are well mixed [9,8]. Z˘liobaitė et al. introduced an active learning algorithm that balances both stochastic sampling with model based uncertainty sampling in order to simultaneously address both x ) and p ( y|→ x ); moreover, this is achieved within fixed changes to p (→ label budgets. Such an algorithm combines (model driven) uncertainty sampling with random sampling. Additionally, the active learning approach was sufficiently generic to be deployed with both the streaming formulations for Naive Bayes and Hoeffding decision tree models of classification. Several recent works investigate the issue of class imbalance under streaming data contexts. One approach is to adopt a formulation of bagging with under or oversampling in order to construct a ‘Data Subset’ from which model building is performed [13,14]. Specifically, Ditzler et al. emphasizes operation under an incremental (i.e., batch) updating while also supporting anytime labelling, whereas work by Wang et al. emphasize operating under an online (i.e., record-wise) updating constraint. Also of note is that even though Ditzler et al. assumed the SMOTE algorithm developed for operating under stationary data with class imbalance [15], this was not the most effective method investigated for operation under non-stationary data [13]. A second general approach adopted for addressing class imbalance under streaming data is the use of dynamically reweighing class costs [16,17], where this has also been reported under an active learning context [18]. Most recently, attention has been paid to scenarios in which classes repeatedly drop in and out from the stream against a general backdrop of classes that appear on a continuous basis [19], albeit with true labels known for the entire stream content. In the application context associated with this work, the generic number of classes is known (e.g. attack or normal), but when they might appear is not. Various semi-supervised frameworks have also been proposed for operation under streaming data contexts, and provide a natural framework for addressing labelled versus unlabeled data [20,21], i.e. after an initial period of training from labelled data, the classifiers go ‘online’ using unsupervised learning alone. Specific points of interest include operation under class imbalance and non-stationary data. However, from the perspective of this work, operation online using unsupervised learning would make such an approach particularly susceptible to adversarial attackers [5,6].

2. Related work Several recent survey articles have appeared that provide overviews of the scope of model building for streaming data classification under non-stationary streams [2–4]. In the following we will concentrate on highlighting issues specific to the problem setting of streaming classification under label budgets: Section 2.1 reviews developments regarding imbalanced data, change detection, and (online) active learning from the perspective of non-evolutionary methods; and 2 Predictions from the anytime classifier might also be used to prioritize the records identified for labelling, i.e. a record predicted as an attack class would be prioritized over a record carrying a normal class prediction.

2

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 1. Overall framework within which GP streaming framework operates. Sampling policy (S ) determines whether a record should have its label requested (while operating under the label budget constraint β). Archiving policy (A) maintains a finite size Data Subset, typically replacing ‘Gap’ records for replacement. On updating the DS with the set of Gap (i ) labelled records, τ generations of GP are performed. A single champion individual is always available for predicting labels, y (t ) , and may also influence the Sampling policy.

2.2. Evolutionary methods

framework proposed in this work.

Model building for streaming data using evolutionary methods has not been as extensively investigated as under non-evolutionary methods. GP frameworks have been designed to address the specialized task of forecasting for financial data [22]. Such an application implies that the model predicts the direction of movement in the next step of a temporal sequence, and as such is synonymous with a specific example of streaming classification tasks in general (see also predicting electricity utilization [23]). Such frameworks are limited to tasks without label budgets. Folino and Papuzzo propose a GP ensemble for operating under streams exhibiting different non-stationary properties [24]. Change detection was defined using a statistic applied to pairs of x ) but not windows, hence model updating was only triggered by p (→ p ( y|→ x ). Moreover, a parallel distributed GP framework was necessary to support the rebuilding of GP ensembles. Dempsey et al. investigated the role of genotype-to-phenotype mappings under dynamic environments (stock market data in particular) [25]. Specifically, they emphasize the significance of evolvability/plasticity in facilitating adaptation under non-stationary data. Earlier versions of the framework adopted in this work have assumed that GP teams are evolved from: 1) a Data Subset under some policy for sampling records from the stream; and 2) an archiving policy for determining which records to keep/replace. Initial attempts to describe the archiving policy using Pareto archiving indicated that label error (a form of noise disrupting the ability to accurately model p ( y|→ x )) had a significant negative effect on the ability to build robust models [26]. Adopting a simple uniform sampling policy under label budgets provided a more robust starting point [27], albeit with full label information necessary to guarantee a balanced Data Subset. It was also demonstrated that adopting a team GP formulation is much more effective at reacting to change than assuming a single (monolithic) GP individual as the solution [28]. Finally, the issue of class imbalance of the Data Subset was shown to have implications for the quality of the resulting GP model [29,30]. In this work, we will concentrate on the Botnet detection task in particular (the earlier works were limited to artificial datasets) where this represents a particularly challenging task for streaming data analysis, i.e. very imbalanced data, classes that continuously appear and disappear, a significant cost to miss classification, and low label budgets. Several earlier approaches assume prototype style representations, such as learning classifier systems (LCS). Dam et al. concentrate on measuring the reaction time of LCS under a ‘multiplexer’ task x ) and suggest that population reformulated to exhibit variation in p (→ reinitialization is the most appropriate mechanism for reacting to change once detected [31]. Behdad and French investigate an approach to LCS in which the order of explore and exploit cycles are reversed compared to that traditionally assumed for off-line batch learning [32]. Finally, we note that k-NN approaches can be managed through evolutionary methods such as particle swarm optimization and potentially applied to streaming data classification tasks [33]. All of these methods are limited to different subsets of the functionality of the

3. Framework for streaming GP teams 3.1. Streaming data environment under a label budget Streaming algorithms are defined as online [34] or incremental [35]. Online algorithms operate instance-wise, possibly from the content of a finite length sliding window of sequentially encountered instances.3 Conversely, incremental algorithms process data in ‘chunks’ or ‘blocks’ defining a non-overlapping window in which the most recent set of instances from the stream are available. Any querying performed by the streaming algorithm is limited to the data in the window (whether sliding or non-overlapping). This reinforces the limited memory constraint implicit in streaming data. In this work we assume an incremental non-overlapping approach (Fig. 1). The data stream is defined by a continuous sequence of d-dimensional records, …,x (t ), x (t + 1), … where t represents the temporal index. The continuous nature of the stream implies that t → ∞. Each record has a (true) label, d (t ), that is not available unless explicitly requested. Label requests can only be made using: 1) records within the current window, and; 2) a label budget, β , such that β = 0.5 implies that fifty percent of records may have their corresponding label requested. Operation within the context of a label budget implies that it is necessary for a sampling policy, S, to be defined to explicitly identify which records will have their labels requested. Such a sampling policy only operates on the records in window W (i ), and cannot revisit records once a decision has been made. However, anytime operation implies that for each record, x (t ), a label prediction, y (t ), is made (in real-time) by the streaming classifier. Given that a population based paradigm will be pursued (GP in this case) this implies that a champion classifier (i.e. individual from the GP population) has to be available to provide a label prediction before any of the stream appears in the window. Fig. 1 summarizes how these concepts are related in this work. The above formulation of the streaming data classification task implies that:

• • •

All of the data in the stream has labels first proposed by the champion classifier. Any updates to the GP population are only performed after the champion individual makes its label prediction, y (t ) ; Training is an interactive process, with the stream GP framework making decisions about what records to request true labels for, subject to the label budget β . Records arrive in an order dictated by the underlying properties of the task, hence class balance is not likely to be present within any local region of the stream.

3 As in a ‘first-in-first-out’ (FIFO) data structure in which the most recent instance (from the stream) pushes out the oldest instance from the FIFO.

3

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.



when assuming more sophisticated algorithms [9,8]. A second sampling policy is considered that uses the GP champion classifier, gp*, to promote records for labelling. That is to say, when gp* predicts a class label, y (t ), representing a minority class, it is prioritized for sampling (subject to label budget β ). In effect, we are using gp* to actively promote records that could contribute to rebalancing the content of the Data Subset, DS (i ). Given that performance of the GP classifier is correlated with the distribution of records in the DS, we are attempting to provide a Sampling policy that actively addresses this without requiring label information. Hereafter we refer to this as the biased sampling policy. Algorithm 1 provides a summary of such a process. We first need to characterize the content of the previous Data Subset, DS (i − 1). Let C denote the number of classes currently present in DS (i − 1) and c the DS set of classes appearing with frequency ≥ C in DS (i − 1), where DS is the size of the Data Subset (the over represented class(es)). We ‘mark’ the current non-overlapping window content, W (i ), in terms of whether the predicted label, y (t ) ∉ c (Step 1). If there are less marked instances than capacity in Gap , labels are requested for all such cases and the resulting records copied to Gap (Step 2). If there is any remaining capacity in Gap , we fill these by sampling uniformly from the non-marked instances in W (i ) (Step 2b). Finally, in the case of more instances marked than capacity in Gap , then we sample from the marked instances using a roulette wheel (Step 3). The roulette wheel samples from W (i ) with frequency inversely proportional to (marked) DS (i − 1) class content. There is one special case, and that is the case of a cold start. Under this condition there is no champion GP individual available to provide label predictions. We therefore assume the uniform sampling policy for W (i = 0).

Changes to the champion classifier are also made by the stream GP framework, but once a change takes place the new champion cannot revisit any previous labelling decision(s).

3.2. Overall framework We will assume the overall generic framework of Fig. 1. The sampling policy, S, operates under the label budget constraint β to identify a total of Gap records that have their corresponding (true) label, d (t ), requested. Once the Gap records have had a label requested, a corresponding Gap records are identified for replacement from the finite sized Data Subset, DS (i ). The Data Subset therefore decouples fitness evaluation from stream cardinality and potentially provides the ability to introduce biases into the representation of each class. Moreover, we implicitly assume an ‘incremental’ approach to model building under streaming data. This implies that fitness evaluation is only performed once the Gap new label requests have been made. Thus, DS (i ) denotes the specific point at which a batch of Gap new records enter the finite size Data Subset. An archiving policy, A, prioritizes records for replacement/retention within the Data Subset. After identifying DS (i ), a fixed number of τ generations are performed and a champion GP individual is identified. Thus, streaming operation commences following an initial cold start necessary to identify the first champion individual. Note that the rate of champion identification does not exceed the rate of DS updating, but it need not be the same. We will assume a symbiotic bid-based (SBB) formulation for expressing solutions as teams of GP programs [36]. Such a framework cooperatively coevolves GP individuals through a bidding mechanism that identifies context for an action, in this case a class label. Each program is assigned a single action at initialization; teams and programs are represented in independent populations. The only constraint on team membership is that there must be at least two programs per team, and there must be at least two different actions present across all the programs participating within the same team. Moreover, the SBB framework addresses multi-class classification without any additional modification. Previous work has demonstrated that such a framework is more effective than monolithic (canonical) GP under off-line classification tasks [37] and streaming classification tasks [28]. The focus of this work lies in how to define effective policies for sampling and archiving such that we construct classifiers capable of operating under the Botnet application context. The resulting sampling and archiving policy definitions are independent of the specific GP framework assumed. For completeness, Appendix A summarizes the SBB framework. Sections 3.3 and 3.4 characterize the approaches taken to defining the Sampling and Archiving policies. Naturally, we assume that the ability to construct a classifier robust to class imbalance (as measured with respect to any local region of the stream) is biased by the content of the Data Subset. However, this needs to be traded off against the desire to react to changes in the stream. Thus, maintaining underrepresented classes in the Data Subset at the expense of records representing the more frequently occurring classes potentially results in less sensitivity to the most recent data. Identifying the specific balance between these two factors will be a theme we will return to during benchmarking. Finally, Section 3.5 addresses the specific mechanism employed for champion identification, thus anytime prediction of y (t ) given x (t ).

Algorithm 1. Biased Sampling Policy. Let rnd (A) return a randomly sampled instance from set A without replacement. roulette (A, b ) returns a randomly sampled instance (without replacement) from the subset A ∧ b with frequency inversely proportional to DS (i − 1) class content. Gap (i ) is the set of records transferred to DS (i ) (Fig. 1) at nonoverlapping window location i. Input: The current content of the (non-overlapping) window W (i ) and predicted labels, y (t ). The set c of over represented classes from Data subset DS (i − 1); Initial state: Gap (i ) = ∅; cnt1 = cnt 2 = 0 1. For all t ∈ W (i ) (a) IF y (t ) ∉ c THEN Mt = 1 AND cnt1 = cnt1 + 1 ELSE Mt = 0 2. IF cnt1 ≤ Gap THEN (a) For all t in which Mt == 1 (i) Request d (t ) x (t ), d (t )) (ii) Gap (i ) = Gap (i ) ∪ (→ (b) WHILE cnt1 < Gap (i) t = rnd (W (i )) subject to Mt == 0 (ii) Request d (t ) x (t ), d (t )) (iii) Gap (i ) = Gap (i ) ∪ (→ (iv) cnt1 = cnt1 + 1 3. ELSE (a) For any t in which Mt == 1 x (t ), d (t )) ← roulette (W (i ), Mt ) (i) (→ (ii) Request d (t ) x (t ), d (t )) (iii) Gap (i ) = Gap (i ) ∪ (→

3.3. Sampling policy

(iv) Mt = 0, cnt 2 = cnt 2 + 1 (b) Repeat Step 3a WHILE cnt 2 < Gap

A uniform sampling policy will be assumed as our baseline/control approach. This samples records from (non-overlapping) window location W (i ) prior to the availability of the label information (choose record x (t ) for labelling with probability P (β )). Previous works have demonstrated that such a starting point is not necessarily bettered 4

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

3.4. Archiving policy

Table 1 Generic properties of the streaming datasets. N cardinality, and k is the number of classes present over the entire duration of the stream. Each dataset has D = 8 flow attributes (Direction, DToS, Duration, Protocol, Source Bytes, SToS, Total packets, Total Bytes). A combined Botnet/C & C label was assumed in the case of datasets in which the C & C class represents less than 0.01% of the original dataset (Capture 3, 4, 10, 11, 12).

The overall framework of Fig. 1 utilizes a Data Subset of a finite size (DS ). Thus, once the Data Subset is full, Gap records are identified for replacement at each non-overlapping window location, W (i ). We will again assume two possible algorithms for this purpose. The base case is referred to as the uniform archiving policy in which Gap records are identified for removal from DS (i ) with uniform probability. A biased archiving policy is defined with the objective of incrementally (re)balancing the representation of records per class in the Data Subset as the stream progresses. Algorithm 2 details this process. Records already in the Data Subset are first grouped by class and ranked by age (Step 1). A count, ck , is made of the number of instances of each record per class present in the Data Subset (Step 2a). The number of records for deletion per class are identified (Step 3a), relative to an ideal distribution of Data Subset capacity, i.e. DS − Gap . C Under represented classes are not targeted for record deletion (Step 3b), hence can accumulate additional records. Step 4a targets the over represented classes (relative to the ideal distribution) for having records deleted, oldest records first. Gap instances have now been deleted, so the final step adds the content of Gap (i ) to the remaining Data Subset content, producing DS (i ).

Dataset Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture

5. DS (i ) ← Add (DS (i − 1), Gap (i ))

3.5. Identifying the champion classifier A champion classifier, gp*, is identified by applying a robust performance metric to evaluate the operation of the population under the current content of the Data Subset, DS (i ). This is the only source of data with true label information. The performance metric assumed takes the form of the multi-class Detection rate (DR ), or

tpj tpj + fnj

4 4 3 3 4 4 4 4 4 3 3 3 4

[97.47, 1.08, 1.44, 0.01] [98.33, 0.5, 1.12, 0.04] [96.95, 2.48, 0.57] [97.52, 2.25, 0.23] [95.7, 3.6, 0.68, 0.02] [97.83, 1.34, 0.79, 0.04] [98.47, 1.47, 0.03, 0.02] [97.33, 2.47, 0.17, 0.04] [89.7, 1.44, 8.72, 0.14] [90.67, 1.21, 8.12] [89.85, 2.54, 7.61] [96.99, 2.34, 0.657] [96.26, 1.66, 2.05, 0.03]

The CTU dataset [38] includes data labelled as one of four general categories: background, normal, Botnet, command and control (C & C). The majority of the data present in the data set takes the form of background traffic (Table 1), where this represents network traffic collected from a real-world network. In the past it has been the characterization of the normal behaviour that has been the most difficult to accurately express; leading to the use of benchmark datasets that are much easier to solve than in practice, see for example the discussion in [39]. However, it is then actually difficult to define labels for normal and attack behaviour because the so called background traffic may actually consist of attack traffic. The approach currently adopted by the network security community is therefore to label all traffic as ‘background’ and apply filters characterizing definitively known examples of normal behaviour [40,41]. Any data from the background traffic labelled by the normal filters are labelled as normal, the remainder is labelled as background. Finally, attack data is explicitly created using (Botnet) attack tools from specific IP addresses on a virtual network. This means that any data explicitly labelled as attack is definitely attack data, although some amount of the background traffic data could also be so. Moreover, in the specific case of the CTU dataset, data associated with the operation of the Botnet master is explicitly distinguished from that of data associated with Botnet slaves (labelled as C & C and Botnet respectively). The Botnet master(s) control the operation of the slaves. The slaves actually execute attacks, with the objective of hiding the identity of the master, whereas from the perspective of detection, the identification of the master(s) is the most important. All thirteen datasets will be employed from the CTU-13 network security dataset collection [38]; hereafter referred to as Capture 1 through 13.4 The data is described by 12 ‘flow’ statistics obtained by the Argus network flow generator.5 However, out of these 12 features, we did not employ IP addresses and port numbers as many recent network applications (Voice over IP, social media and network based games) can dynamically change their port addresses based on the blocked/ unblocked port combinations. Moreover, IP addresses can be spoofed by attackers for malicious intentions or can be hidden by proxies for

DS − Gap C

∑ DRCj =1 andDRj =

2,824,637 1,808,123 4,710,638 1,121,077 129,833 558,920 114,078 2,954,230 2,087,508 1,309,792 107,251 325,472 1,925,150

4.1. Datasets

(b) IF removek > 0 THEN T = T + removek ELSE removek = 0 4. For k = 1 to C remove (a) Delete oldest Gap × T k records of class k from DS (i − 1)

1 C

≈ Class Distribution (%)

4. Evaluation methodology

Input: Set of labelled instances Gap (i ) and the last available Data Subset, DS (i − 1) Initial state: T = 0 1. For all j ∈ DS (i − 1) identify class and rank w.r.t. record ‘age’ aj ; 2. For each class k present in DS (i − 1) (a) Count the number of records with class k in DS (i − 1). Let ck be the count for class k. 3. For k = 1 to C

DR =

k

terrupted as a champion classifier is thereafter always available.

Algorithm 2. Biased Archiving Policy. Let aj be the ‘age’ of record j in DS (i − 1), where this is a scalar count for how long record j has appeared in the Data Subset. ck is a count of the number of class k instances in DS (i − 1). C is the number of different classes currently represented in DS (i − 1). T is the total number of instances removed from the over represented classes.

(a) removek = ck −

1 2 3 4 5 6 7 8 9 10 11 12 13

N

(1)

where C is the count of classes present in DS (i ); tpj and fnj are the counts of true positive and false negative for class j, again with respect to the class distribution present in DS (i ). Note that the champion classifier could potentially change as a function of each change to the Data Subset content (i.e., as a function of DS index i), but never more often than this (also reflected in the distinction between incremental and online operation). However, once the first champion classifier is identified, anytime operation is unin-

4 5

5

https://mcfp.felk.cvut.cz/publicDatasets/CTU-13-Dataset/CTU-13-Dataset.tar.bz2. http://qosient.com/argus/.

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

The MoA software suite6 provides the implementation for these algorithms, where modifications were made to provide reporting using the performance metric from Section 4.3.

Table 2 Best case configurations of comparator algorithms for operation under drifting streaming data with label budgets [8]. Classifier

Policy

Naive Bayes Naive Bayes Hoeffding tree Hoeffding tree

Split Variable uncertainty Split Variable uncertainty

4.3. Performance metrics Performance metrics for streaming data classification tasks generally take the form of one of three classes of metric [23]. Prequential error metrics characterize the goal as error minimization in which the error of older instances are subject to discounting/ forgetting.7 The principle drawback in assuming such a metric is that when the data is imbalanced, a model that labelled all the data as the most frequently occurring class (the major class) would appear to be the most ‘accurate’. Under this application the major class always corresponds to the ‘background’ class. Thus, labelling all the data as the major class would result in accuracies of between 98% and 89% (Table 1), whereas this represents a completely degenerate solution. Instead we want to quantify the ability to operate under a multi-class setting. Measures of (label) autocorrelation characterize the performance of the classifier using the ability to out-perform a one-bit predictor that operates on the label space alone [42]. That is to say, if the distribution of labels across the stream are not well mixed, then periods will exist over which consecutive records carry the same label. Such sequences can be ‘predicted’ with low rates of miss prediction by a one/two-bit finite state machine.8 Although better than error minimization, such a metric does not explicitly quantify the ability of a classifier to operate under a multi-class setting (i.e., the distribution of labels is still most likely to be dominated by the behaviour of the most frequent classes). Rate based metrics incrementally construct the confusion matrix as a function of progress through the stream [3,28]. This then leads to characterizing performance using any number of scalar (rate based) performance metrics, e.g. F-measure, Detection rate, Precision. Moreover, such metrics explicitly quantify performance under multiclass settings [44]. Given the ease with which rate based metrics may explicitly quantify multi-class performance, we will adopt such a metric for this work. One approach might be to assume an AUC metric (the area under the curve characterizing the interaction between Detection rate and false positive rate). However, such a metric is also limited to two class scenarios and requires complete re-estimation over a sliding window for each time step [45].9 Instead we will utilize Detection rate (DR) as independently computed for each class. Such a metric can be estimated incrementally, and visualized with time on the dependent (x) axis and class-wise performance metric on the independent (y) axis. Thus, assuming an online estimation of the Detection rate for the y-axis (champion classifier always has to predict the label before any updates to the model), we can estimate the overall DR as the average across all classes [23]. Specifically, let the streaming estimation of Detection rate take the following form:

legitimate reasons to protect privacy of users. Thus, any classifier relying on these attributes may not generalize well in real world applications. Specific properties that make these capture files of particular interest from an application perspective include: Captures 1, 2, 9: consist of instances of the Neris Botnet, hence traffic content pertaining to Internet Relay Chat (IRC), spam, click fraud and scanning activities are explicitly present. Captures 5, 13: consist of instances of the Virut Botnet, hence traffic content pertaining to Distributed Denial of Service (DDoS), spam, fraud and data theft attacks are explicitly present. Capture 6: consists of instances of the Menti Botnet, hence traffic content pertaining to identity theft and login credentials are explicitly present. Capture 7: consists of instances of the Sogou Botnet, hence traffic content pertaining to spam and popup adware to collect personal information are present. Capture 8: consists of instances of the Murlo Botnet, hence traffic content pertaining to the use of scanning activities and proprietary mechanisms for establishing C & C. Captures 3, 4, 10, 11: consist of instances of the Rbot Botnet, hence traffic content pertaining to IRC and Internet Control Message Protocol (ICMP) based DDoS attacks are explicitly present. Captures 12: consists of instances of the NSIS.ay Botnet, hence traffic content pertaining to identity theft and login credentials by using extra payloads are explicitly present. Table 1 emphasizes that the data is exceptionally imbalanced, moreover, all but the major class also appear and disappear repeatedly throughout the stream (Appendix B).

4.2. Comparator algorithms Section 2.1 identified that the work of Z˘liobaitė et al. addresses operation under label budgets in which changes to the data distribution are expected [8]. Moreover, their benchmarking study included one dataset that involved more than two classes. Conversely, none of the previous works explicitly associated with (multi-class) imbalanced data classification under streaming data were designed to address operation under label budgets (Section 2). The specific cases of the Split and Variable uncertainty policy under Naive Bayes and Hoeffding tree classifiers will be assumed from Z˘liobaitė et al. (Table 2), where these represented the strongest algorithms from the original study [8]. The Variable uncertainty policy employs a threshold to determine which records from the stream to request labels for. Specifically, under the Variable uncertainty policy, the confidence of the classifier is used to provide the basis for identifying which records from the stream have their true label requested. Under the Split policy two models are concurrently maintained. One model operates under the Variable uncertainty policy, the second requests labels under a uniform probability sampling policy. This means that records are selected for true label requests using both model uncertainty and uniform sampling. The latter may aid detecting changes due to concept drift.

DRc (t ) =

tpc (t ) tpc (t ) + fnc (t )

(2)

where t is the record index, and tpc (t ), fnc (t ) are the respective online counts for true positive and false negative rates, i.e. up to this point in the stream. 6

http://moa.cms.waikato.ac.nz. Formulations using recall rate on the single smallest class have also been proposed [14]. 8 See for example, the widespread use of one/two-bit finite state machines in branch taken/not taken sequences for conditional statements associated with loop constructs [43]. 9 The alternative would be to re-estimate the entire AUC for each time step, limiting its application to short streams [13]. 7

6

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

configurations of the Streaming GP algorithm in total (Table 3). All five will be benchmarked in order to identify their relative contributions to the overall performance. Parameterization of GP is unchanged with respect to previous work (e.g. [29,30]) and summarized in Table 4. The label budget β defines what proportion of the non-overlapping window, W (i ), is queried for true class label information (Section 3.2). Naturally, this represents a significant ‘cost’ in that a user/expert is then required to provide labels for each query. On the other hand, the lower the label budget, the higher the likelihood that changes in the process generating the stream data will be missed and/or the minor classes will be missed entirely. For example, if a label budget is assumed of 5% and a minor class appears with a frequency of 1% then the ‘raw’ chance of a uniform sampling scheme actually requesting a label for such an instance is 0.05%. Conversely, if the algorithm for sampling or archiving records is more ‘intelligent’ the performance of the overall framework for classifying streaming data will not be dominated by the label budget alone. Thus, for this study we benchmark performance for each dataset using three different label budgets or β = {0.005, 0.01, 0.05} in order to gain some insight as to how quickly performance decreases as label rates decrease. Given that the raw class distributions for minor classes are so low (Table 1), this represents a challenging streaming classification task. Table 5 summarizes the impact of each label budget on the size of the non-overlapping window, W (i ). Note that, the champion from window W (i − 1) provides predictions for the entire content of window W (i ) before it can be updated (i.e., incremental operation). Thus, lower label budgets are also synonymous with delays to true label information.

Table 3 Streaming GP configurations. Uniform implies identification of either sampling or archiving data using uniform sampling (Sections 3.3 and 3.4 respectively). Likewise biased denotes either the sampling or archiving data under the corresponding biased algorithms (Algorithms 1 and 2 respectively). Pareto was our earlier preferred configuration for Archiving in which Pareto archiving prioritized records for removal from the Data Subset [28]. Model

Sampling Policy

Archiving Policy

Uniform (Rnd) Pareto Archive Sample Both

Uniform Uniform Uniform Biased Biased

Uniform Pareto Biased Uniform Biased

Table 4 GP Parameters. Mutation rates control the rate of adding/deleting programs or changing an action. DS (Tgap ) denotes the number of records from the Data Subset (teams) deleted at each location of the non-overlapping window. For each Data Subset update, τ generations are performed. Parameter

Value

Data Subset size (DS ) DS gap size (Gap ) GP gap size (Tgap ) Team pop. size (Psize ) Max. programs per team (ω) Prob. Program deletion (pd ) Prob. Program addition (pa ) Prob. Action mutation (μ) Generations per DS update (τ )

120 20 20 120 20 0.3 0.3 0.1 5

Table 5 Stream dataset parameters. Label Budget (β ) is defined as a function of the window size W (i ) where for each non-overlapping window location there can only be Gap size (20) samples. Label Budget (β )

W (i ) cardinality

0.5% 1.0% 5.0%

4000 2000 400

5. Results In Section 5.1, the overall performance of each Streaming GP configuration and MoA comparator algorithms is quantified over the thirteen Botnet datasets (Table 1). A subset of datasets identified by the initial analysis will then be used to provide more insight into what the distinguishing factors are in the operation of the various algorithms. Specifically, Section 5.2 uses a visualization of Detection rate as estimated through the stream to provide insight into the dynamic behaviour of the algorithms. Section 5.3 characterizes how the different archive–sampling policies of Stream GP effect the distribution of records retained in the Data Subset. Section 5.4 repeats the review from Section 5.2, but this time solely using the ability of the stream classifier to detect the least frequent class. In doing so, we draw attention to the potential to detect Botnet command and control signals. Success in this task amounts to providing an early warning of Botnet activity. Finally, Section 5.5 quantifies the computational time to perform fitness evaluation and execute the champion individual.

The multi-class Detection rate now has the form:

DR (t ) =

1 C*



DRc (t )

c =[1, …, C *]

(3)

where (for continuity) we assume that C * reflects the true count of the total number of classes encountered over the course of the stream.10 Hence, the multi-class Detection rate is a function of the ability to detect each class. Finally, we note that although streaming data sources are continuous, for benchmarking purposes, finite length sequences are assumed. Thus, the sum of the multi-class Detection rate metric as estimated across the duration of the stream can be quantified by a single scalar ‘area under the curve’ metric:

AUC =

1 smax

5.1. Overall performance evaluation There are a total of four Streaming GP configurations (Table 3) and four MoA comparator algorithms (Section 4.2), where in the case of MoA, these represent the strongest algorithm/stream sampling policies identified by Z˘liobaitė et al. for operation under label budgets [8]. All eight algorithms are run 20 times per dataset and the multi-class streaming AUC metric (Eq. (4) of Section 4.3) used to summarize the overall performance. Performance is then ranked using the median AUC. The Friedman non-parametric repeated measures statistic (and Nemenyi post hoc test) are then used to identify the general trends of the best algorithms/most challenging datasets. Such an approach does not make assumptions regarding the underlying distributions of the performance data, and represents the preferred approach for conducting comparisons between multiple algorithms/datasets [46,44]. The evaluation is repeated using each of the three label budgets (Table 5).

smax −1



DR (t )

t =0

(4)

where smax is the cardinality of the stream. 4.4. Experimental design and parameterization Sections 3.3 and 3.4 introduced biased approaches for defining sampling and archiving policies respectively. We will compare these with our previous Pareto policy for archiving [28], that is to say, five 10 If the true number of classes is unknown, a stepwise effect appears in the metric each time a previously unseen class is encountered.

7

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Table 6 Algorithm ranks w.r.t. streaming AUC metric under a 5% label budget. Bracketed entries represent median AUC values to 1 decimal place. Naive Bayes (NB) and Hoeffding tree classifiers (from MoA) appear with either ‘split’ or ‘variable’ sampling policies. Table 3 declares the 4 sampling/replacement policies for stream SBB. Rj denotes the average rank across all datasets. Data

stream SBB

Set

Rnd

Pareto

Sample

Archive

Both

split

variable

split

variable

6 6 5 5 4 6 4 5 5 6 4 4 5

5 4 4 7 6 5 6 6 6 3 3 6 6

4 5 3 4 5 4 5 4 4 4 5 5 3

1 (56.8) 1 (68.1) 2 (81.5) 1 (62.7) 1 (36.1) 1 (65.8) 2.5 (29.8) 1 (78.0) 1 (54.1) 1 (70.3) 1 (54.8) 1 (52.5) 1 (70.2)

2 (51.5) 2 (67.7) 1 (83.5) 2 (60.8) 2 (34.5) 2 (64.6) 2.5 (29.8) 2 (76.2) 2 (48.5) 2 (68.7) 2 (52.4) 3 (47.2) 2 (67.4)

8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5

7 7 7 6 7 7 7 7 7 7 7 7 7

8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5

3 (43.5) 3 (54.8) 6 (59.5) 3 (55.2) 3 (30.7) 3 (48.6) 1 (32.5) 3 (57.9) 3 (45.4) 5 (58.7) 6 (47.6) 2 (48.3) 4 (42.9)

1.19

2.04

8.5

Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture

1 2 3 4 5 6 7 8 9 10 11 12 13

(32.4) (36.8) (65.9) (45.5) (29.3) (35.5) (29.8) (39.9) (34.8) (57.6) (48.7) (41.6) (41.8)

5.0

Rj

Hoeffding

(33.3) (43.9) (72.9) (41.4) (27.9) (41.4) (28.0) (32.8) (32.8) (62.4) (52.2) (37.9) (38.8)

5.15

(36.3) (41.5) (76.0) (51.4) (28.8) (42.5) (28.4) (46.2) (35.8) (61.6) (47.9) (40.2) (53.9)

4.23

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (25) (25) (25) (25)

NB

(26.7) (36.5) (55.5) (42.2) (26.3) (25.5) (25.9) (28.1) (26.5) (54.6) (42.2) (36.0) (27.3)

6.92

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (25) (25) (25) (25)

8.5

3.46

Table 7 Algorithm ranks w.r.t. streaming AUC metric under a 1% label budget. Bracketed entries represent median AUC values to 1 decimal place. Naive Bayes (NB) and Hoeffding tree classifiers (from MoA) appear with either ‘split’ or ‘variable’ sampling policies. Table 3 declares the 4 sampling/replacement policies for stream SBB. Rj denotes the average rank across all datasets. Data

stream SBB

Set

Rnd

Pareto

Sample

Archive

Both

split

variable

split

variable

5 6 5 5 4 6 3 4 4 5 2 5 4

6 4 3 6 6 3 9 5 6 3 3 4 5

4 5 4 4 5 4 8 6 5 4 6 6 3

1 (48.3) 1 (56.6) 1 (78.4) 1 (55.8) 1 (29.2) 1 (50.4) 4 (25.4) 2 (60.9) 1 (46.0) 1 (64.5) 1 (46.6) 1 (43.6) 1 (57.1)

2 (45.3) 2 (54.6) 2 (75.8) 2 (52.3) 3 (27.5) 2 (43.8) 5 (25.3) 1 (64.3) 2 (42.9) 2 (63.1) 5 (43.1) 3 (40.4) 2 (55.1)

8.5 8.5 8.5 8.5 8.5 8.5 6.5 8.5 8.5 8.5 8.5 8.5 8.5

7 (25.3) 7 (35.3) 7 (47.4) 7 (33.4) 7 (24.9) 7 (25.0) 1 (25.9) 7 (26.0) 7 (25.1) 7 (54.0) 7 (41.3) 7 (34.1) 7 (25.1)

8.5 8.5 8.5 8.5 8.5 8.5 6.5 8.5 8.5 8.5 8.5 8.5 8.5

3 3 6 3 2 5 2 3 3 6 4 2 6

1.23

2.23

8.35

6.54

8.35

Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture

1 2 3 4 5 6 7 8 9 10 11 12 13

(32.9) (37.2) (64.2) (43.5) (27.1) (33.3) (25.5) (34.4) (33.9) (57.1) (45.7) (39.6) (39.5)

4.46

Rj

Hoeffding

(31.0) (44.1) (71.8) (38.6) (26.5) (42.9) (24.3) (33.4) (30.7) (61.0) (45.9) (39.9) (37.2)

4.85

(33.3) (39.2) (70.8) (44.2) (26.7) (37.9) (24.7) (33.1) (32.9) (57.9) (41.9) (36.4) (43.8)

4.92

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (33.3) (33.3) (33.3) (25)

NB

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (33.3) (33.3) (33.3) (25)

(37.5) (48.0) (61.8) (51.0) (29.1) (36.5) (25.56) (52.3) (42.7) (56.3) (43.8) (43.1) (33.2)

3.69

Table 8 Algorithm ranks w.r.t. streaming AUC metric under a 0.5% label budget. Naive Bayes (NB) and Hoeffding tree classifiers (from MoA) appear with either ‘split’ or ‘variable’ sampling policies. Table 3 declares the 4 sampling/replacement policies for stream SBB. Rj denotes the average rank across all datasets. Data

stream SBB

Set

Rnd

Pareto

Sample

Archive

Both

split

variable

split

variable

5.5 (32.4) 5 (38.0) 5 (62.3) 6 (45.9) 2 (27.2) 5 (33.4) 3 (26.0) 4 (32.4) 5 (33.4) 3 (56.0) 5 (42.7) 3.5 (37.2) 5 (38.3)

5.5 (32.4) 4 (38.1) 4 (62.5) 5 (46.2) 6 (26.5) 6 (32.1) 9 (24.7) 5 (32.2) 4 (33.5) 5 (55.9) 4 (43.1) 3.5 (37.2) 4 (38.4)

4 6 3 4 9 4 8 6 6 4 7 6 3

1 (45.2) 1 (48.8) 1 (76.2) 1 (59.6) 3 (27.1) 1 (44.5) 4 (25.2) 1 (55.7) 1 (42.5) 1 (61.8) 3 (44.9) 2 (38.5) 1 (52.7)

2 2 2 2 5 2 7 2 2 2 6 5 2

8.5 8.5 8.5 8.5 7.5 8.5 5.5 8.5 8.5 8.5 8.5 8.5 8.5

7 7 7 7 4 7 2 7 7 7 2 7 7

8.5 8.5 8.5 8.5 7.5 8.5 5.5 8.5 8.5 8.5 8.5 8.5 8.5

3 (36.9) 3 (40.5) 6 (59.2) 3 (49.5) 1 (28.4) 3 (34.7) 1 (27.0) 3 (43.9) 3 (39.6) 6 (55.9) 1 (45.5) 1 (40.2) 6 (34.8)

4.38

5.0

5.38

1.62

3.15

Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture Capture

Rj

1 2 3 4 5 6 7 8 9 10 11 12 13

Hoeffding

(33.0) (37.4) (67.9) (46.7) (25.0) (34.7) (24.8) (30.3) (32.2) (56.0) (37.7) (37.9) (41.5)

8

(41.0) (46.0) (73.3) (58.7) (26.2) (41.6) (25.0) (54.1) (40.6) (59.7) (37.9) (37.1) (49.5)

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (33.3) (33.3) (33.3) (25)

8.19

NB

(25.5) (34.5) (48.1) (33.5) (25.3) (26.0) (26.5) (26.0) (25.0) (53.7) (45.4) (34.2) (25.0)

6.0

(25) (25) (33.3) (33.3) (25) (25) (25) (25) (25) (33.3) (33.3) (33.3) (25)

8.19

3.08

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

have equivalent performance [46,44]. Specifically, if the average algorithm ranks are within the critical difference of CD = qα × k (k + 1) , then they are deemed equivalent. We adopt an

Table 9 Result of Friedman test χF2 and corresponding value for F-distribution FF . The critical value of F (7, 84) for α = 0.01 is 3.953, so the null-hypothesis is rejected in each case.

6N

Label budget

5%

1%

0.5%

χF2

94.4

80.6

70.1

FF

117.7

41.3

24.9

qα = 0.1 for which the critical difference is 2.855. This identifies a set of top performing algorithms common to all three label budgets as: SBB–Archive, SBB–Both and NB–variable policies. Moreover, SBB–Archive was by far the most consistent performing model irrespective of dataset or label budget with SBB–Both generally appearing as the runner up. From the perspective of Botnet detection in general, we note that Captures 1, 2, 8, 9 are dominated by port scanning activities. Typically, access to port number and IP address information is assumed for detecting port scanning (so limiting the generality of the detector). Conversely, we are able to detect these activities without using source / destination port numbers and IP addresses at an overall AUC of between 54% and 78% (dropping to no less than 42% under the 0.5% label budget). Captures 3, 10 and 11 represent protocol based attacks so it would be expected that these behavioural patterns can be identified more easily with flow features. Indeed, with Captures 3 and 10, this appears to be the case (high overall AUC maintained irrespective of the label budget), whereas Capture 11 always returned overall AUC in the region of 54% to 44%. Finally, the content of Captures 5, 6 and 7 represent payload attacks (such as windows vulnerabilities), thus detecting these types of attack with flow features alone is particularly challenging. In two cases (Captures 5 and 7)

Tables 6–8 summarize results using the ranking of each of the 8 algorithms under the AUC metric. The last row reports the average rank (Rj ), where this forms the basis for the Friedman test, as follows:

χF2 =

⎡ ⎤ 12N ⎢ k (k + 1)2 ⎥ (Rj2 ) − ∑ ⎥⎦ k (k + 1) ⎢⎣ j 4

(5)

where N = 13 is the number of datasets and k = 9 is the number of algorithms. The null hypothesis is tested by mapping χF2 into the Fdistribution with k − 1 and (k − 1)(N − 1) degrees of freedom using [46]:

FF =

(N − 1) χF2 N (k − 1) − χF2

(6)

In each case, the null-hypothesis (i.e. that the ranks are random) is comfortably rejected (Table 9). The Nemenyi post hoc test may now be applied for establishing what groups of algorithms

Fig. 2. Capture 5 class-wise Detection rate through the stream. 5% versus 0.5% label budget. Class 3 represents Botnet and Class 4 Botnet C & C.

9

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 3. Capture 6 class-wise Detection rate through the stream. 5% versus 0.5% label budget. Class 3 represents Botnet and Class 4 Botnet C & C.

performance is just able to better that of a degenerate classifier.11 Conversely, the best detector was able to identify Capture 6 data with an overall AUC of between 65% and 44.5% depending on the label budget.

of the stream (Eq. (2)) as averaged over the 20 runs per algorithm in the specific case of the Capture 5 dataset. Subplots 2(a) and 2(b) illustrate how class-wise Detection rate changes under the 5% label budget. It is apparent that the NB–Variable model does not detect class 4 at all (Botnet C & C). Moreover, gains in detecting the minor classes are generally at the expense of reduced detection in the major class.12 Conversely, SBB–Archive was ultimately able to detect minor classes better (Botnet and Botnet C & C), without losses in the detection of the major class. Performance under the 0.5% label budget is summarized by Subplots 2(c) and 2(d). The NB–Variable framework continues to place most emphasis on detecting the major class, at the expense of the minor classes, conversely SBB–Archive detects the minor classes earlier and even manages to continue to detect class 4, albeit at a much reduced rate than at the 5% label budget. Thus, although the NB–Variable model has a higher ranking for the 0.5% label budget under Capture 5 (Tabl 8), this appears to be solely due to the detection of the major class. Similar observations hold for the Capture 7 and 11 datasets. Fig. 3 illustrates the Detection rate for each class over the course of the stream in the specific case of the Capture 6 dataset. As with Capture 5, SBB–Archive was able to detect each of the 4 classes to varying

5.2. Detection Rate Dynamics: comparing the best streaming classifiers Performance evaluation will now consider the dynamic properties of operation by reviewing how the Detection rate of each class varies during the course of the stream. This is important because the properties of the stream vary over the course of the stream. Thus, unlike an offline formulation of learning, the models develop/interact with the stream content over the course of the stream. In doing so, we also provide some insight into whether results with similar overall AUC also exhibit similar preferences in class detection. Space precludes the plotting of Detection rate for every algorithm under every dataset. With this in mind, we will concentrate on: (i) Captures 5 and 6 (payload attacks) and (ii) Captures 8 and 9 (port scanning) with the top two performing configurations of SBB and NB: SBB–Archive and NB–Variable. Fig. 2 summarizes the Detection rate for each class over the course 11 Equivalent to labelling all the data a single class or AUC = DRc (t ) = 1 occurs for one class alone and DRi ≠ c (t ) = 0 .

DR (t ) C*

12 Given the degree of class imbalance in evidence (Table 1), the major class is always class 1. The minor classes are always the remaining classes.

= 0.25 where

10

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 4. Capture 8 class-wise Detection rate through the stream. 5% versus 0.5% label budget. Class 3 represents Botnet and Class 4 Botnet C & C.

frequency, but with a slightly longer duration. This results in a stepwise improvement in the detection of Class 3 throughout the stream as both SBB and NB are able to incrementally get better at sampling/detection (Fig. 4). In the context of the Botnet detection task, we note that although the overall AUC might be in the order of 54–78% in the case of Capture 6, 8 and 9 (Section 5.1), this translates to detecting the master-to-slave communication (Class 4) at a Detection rate of 70–95% by the ‘end’ of the stream (SBB–Archive). Moreover, not only are the malicious behaviours in these streams very infrequent (Table 1) and transitory (Appendix B), but malicious behaviours also represent application and operating system related attacks. Such behaviours are usually represented in the payload of the traffic, and therefore not directly present in the flow features used in this work (which do not include the payload). Instead, detection is taking place because behavioural ‘fingerprints’ have been discovered in the flow data that are sufficient for identifying the Botnet traffic.

degrees throughout the stream irrespective of label budget, whereas the NB-variable framework only did so under the 5% budget. Moreover, it also generally appears to be the case that the NB–variable framework begins by labelling every record as the major class. Over the duration of the stream, the detection of the major class decays as detection of the remaining classes improves. This appears as a general property of the NB–Variable framework irrespective of the dataset. This is not the case for SBB–Archive. Instead, each class, once detected, does not directly imply an inability to maintain the detection of other classes. We attribute this property in part due to the combination of: 1) a population based algorithm; 2) the use of a robust measure to identify a champion classifier (Section 3.5); and 3) the ability to explicitly ‘balance’ the distribution of records retained in the DS (see Section 5.3). Similar observations carry over to the case of Capture 8 and 9 (Figs. 4 and 5 respectively). Comparison with the corresponding summary plots for the actual distribution of the minor classes during the stream (Fig. B.10, Appendix B) indicates that SBB in particular is also able to pick up the detection of minor classes near to their respective first occurrence. For example, in the case of Capture 8, Class 4 appears as a high frequency burst early on in the stream then appears on an intermittent basis thereafter. Fig. 4 demonstrates that SBBVariable is able to react to this immediately, even under the 0.5% label budget. There is also a particularly interesting behaviour associated with the distribution of Class 3 in the Capture 8 stream. Initially Class 3 appears in very short bursts at a high frequency, and then decreases in

5.3. Detection Rate Dynamics: SBB sampling and archiving policies The goal of this section is to provide some insight as to the relative contribution of the four variants of the SBB sampling/archiving policies (Table 3). Section 5.1 has already identified the overall ranking of the four variants as: Archive > Both > Rnd > Sample; where ‘Rnd’ represents the control/baseline parameterization. We are now interested in characterizing properties that lead to this outcome. For this 11

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 5. Capture 9 class-wise Detection rate through the stream. 5% versus 0.5% label budget. Class 3 represents Botnet and Class 4 Botnet C & C.

despite its late introduction is now detected much more effectively. Unfortunately, this also came at the expense of Class 2 (Normal) detection which has seen a reduction with respect to the ‘Rnd’ policy. Switching to the biased archiving policy (with a random sampling policy, Table 3) improves the detection of all minor classes, subplot 6(c). Some decay to the rates of detection appear with respect to Class 1, which may be due to having to share the available DS space with the minor classes when they appear later in the stream. However, all the minor classes are now detected much more effectively than before. Thus, from the perspective of Botnet detection, by the ‘end’ of the stream the Botnet and C & C classes are identified with Detection rates of ≈ 70% and ≈ 55% respectively. Finally, subplot 6(d) represents the profile for when both the biased archiving and sampling are introduced. Detection of Classes 2 and 3 actually declined with respect to the best case policy combination of ‘Archive’. We will investigate the sources of this further by considering how the distribution of records retained in the Data Subset changes as the stream passes. Fig. 7 summarizes the proportion with which records from each class are expressed in the Data Subset during the stream. As it is the content of the DS against which fitness evaluation is performed and a champion classifier identified, the distribution of records within the DS has a significant impact on the overall performance of GP. All policies begin with DS content dominated by Class 1 representation (the major class). However, both the entirely random policy and the biased Sample

purpose the Capture 1 dataset will be assumed (observations are common to the other datasets, but space precludes their duplication). In particular Capture 1 introduces different classes at different points during the stream (Fig. B.10), thus requiring the Archiving policy to create ‘new’ categories within the finite Data Subset archive (Fig. 1) while the Sampling policy needs to detect the change in the first place. From the perspective of the Botnet detection task, Capture 1 contains a lot of port scanning activities, which are not straightforward to detect without port number information (as is the case here). Fig. 6 summarizes class-wise Detection rate over the Capture 1 dataset with a 5.0% label budget. The control parameterization of ‘Rnd’ samples the stream window, W (i ), with uniform probability and replaces records within the Data Subset (DS) with uniform probability (Table 3). When comparing Subplot 6(c) to the underlying distribution of the minor classes (Fig. B.10) we make the following observations: 1. The underlying distribution of each class: Class 1 – the major class – is detected most strongly, and Class 4 (Botnet C & C) – the least frequent – very rarely. 2. When classes appear for the first time in the stream: Class 2 (normal) appears later in the stream than Class 3 (Botnet). The impact of introducing a biased sampling policy (but retaining random replacement of records within DS) is summarized in subplot 6(b). Class 1 is even more strongly detected, and Class 3 (Botnet)

12

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 6. Class-wise Detection rate for SBB sampling and archiving policies on the Capture 1 dataset at a 5.0% label budget.

The last two columns of Table 10 provide the corresponding Friedman test statistic (χF2 ) and value for the F-distribution (FF ), indicating that the null-hypothesis is rejected. Given that the degree's of freedom are unchanged (from the earlier analysis), then the critical difference is also unchanged (2.855). Specifically, the Nemenyi post hoc test applied with respect to the highest ranked model implies that SBB– Archive, SBB–Both and NB–Variable are statistically independent (of the remaining 6 models) at a confidence of q = 0.1 for a label budget of 5%. As the label budget decreases to 1% and 0.5%, these three models continue to be consistently identified. Moreover, the low ranking of Hoeffding and NB–Split is a general reflection of an inability to detect the minor class in a significant number of the Capture datasets.

policy fail to ensure that the minor classes appear in sufficient quantity (subplots 7(a) and 7(b)). Conversely, both the SBB–Archive and SBB– Both policies achieve this (subplots 7(c) and 7(d)). It is notable that the SBB-Both parameterization is by far the most consistent. Conversely, SBB–Archive is more gradual in instigating changes to the distribution of class representation. Given that the SBB– Archive approach provided the basis for a better stream classifier, it appears that this more gradual updating of record class distribution is the key to the performance differences. 5.4. Capacity for detecting Botnet C & C signals In this we take a closer look at performance under the minority class, where this corresponds to the least frequently occurring class in the stream. Given that the application data pertains to Botnet detection, it is the minor class that represents the first indication of a Botnet, i.e. a command and control (C & C) signal. Hence, we are interested in learning whether the overall ranking of streaming classification algorithms (Section 5.1) changes when we estimate stream AUC specific to the minor class alone. Table 10 summarizes the results of the rank based analysis of each algorithm under the minor class streaming AUC for each label budget. The SBB-Archive and SBB–Both formulations still represent the highest ranked models. Given the observations from Section 5.3, this is not surprising as both these parameter combinations were much more effective at balancing the content of the Data Subset.

5.5. Real-time operation GP is often considered to present a considerable computational overhead that would preclude its operation under tasks requiring realtime operation. Under this application domain packet data are first subject to pre-processing into traffic flows using the Argus application [38]. Each flow characterizes the statistics of a collection of packets that has the same source/destination IP, protocol, source/destination port. The number of packets per flow is a function of the service/application (10's to 100's packets per flow). CISCO define an upper bound of 600 ms as the time interval for the completion of any flow, however, this represents a worst case figure, the inter-arrival time of packets is a function of network topology and load. 13

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. 7. Typical Distribution of classes present in the Data Subset for SBB sampling and archiving policies on the Capture 1 dataset at a 5.0% label budget.

Table 10 Ranks for minor class streaming AUC alone. Assuming an α = 0.01 returns a critical value of F (7, 84) < 3.953, thus the null-hypothesis (of random a ranking) is rejected. Label

stream SBB

Hoeffding

NB

Statistic

Budget

Rnd

Pareto

Sample

Archive

Both

Split

Variable

Split

Variable

χF2

FF

5% 1% 0.5%

5.15 4.31 4.26

5.27 5.53 5.04

4.62 4.81 4.5

1.46 2.08 2.04

2.23 2.46 2.35

7.84 7.46 7.38

7.19 7.0 6.96

7.84 7.46 7.38

3.38 3.88 3.92

76.4 57.5 43.8

33.1 14.8 8.7

represents the dataset with largest cardinality, i.e. it should be clear whether computational costs stabilize or not. In all cases, this reflects a code base that executes as a single thread. The average execution time for the champion is ≈45 μ sec (Fig. 8(a)), implying an average (single threaded) throughput of 22, 222 flows per second. Conversely, it takes between 2.7 and 3.8 s to update the champion predictor.13 We note that the biggest impact on the time to identify new champion classifiers is the time for human experts to provide labels for the Gap records associated with each (non-overlapping) window location W (i ). However, this does not interrupt the ability of the current champion classifier to provide labels, and would be synonymous with current practice for deploying updates to ‘signature’ based detectors.

The capacity of GP to support real-time operation will be characterized from two perspectives: 1) the time for the champion GP individual to make a class label prediction for each flow record (anytime operation); and, 2) the time to complete fitness evaluation after updating the content of the Data Subset. Under the parameterization assumed in this work (Table 4), fitness evaluation takes the form of evaluating Psize = 120 teams on DS = 120 flow records for τ = 5 generations and identifying the champion individual. However, given that only twenty training records are introduced at each window location and only twenty teams are replaced per generation (Gap = Tgap = 20), then the computational cost per window location is actually in the order of 20 × 20 × 5 evaluations. Fig. 8 summarizes the execution time under each of these conditions under a common Intel i5 CPU (2.67 GHz, 48 GB RAM). The plots illustrate the mean and variance for 20 runs over Capture 3, where this

13 Adopting multi-threaded operation, say, eight threads could potentially reduce this to between 0.34–0.5s.

14

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

appropriate interaction between sampling and archiving policies.

• •

• •

SBB-Rnd: represented our control policy combination. In general the distribution of the classes in the data subset represent that of the stream and the accuracy of the anytime classifier also reflects this distribution. SBB-Sample: was unable to retain ‘useful’ queries from the minor class in the data subset. One possible reason why this might appear is that the GP individual selected to be the champion (anytime) classifier always does so on the basis of performance from the two or three most frequently occurring classes in the Data Subset. Hence, the pressure/penalty for miss-labelling the smallest class(es) is typically low. Future research will attempt to identify specific sources for the weak performance of this policy combination. SBB-Archive: emphasizes balancing data subset content over biased querying of the stream. This appeared to provide the best balance of keeping the major class exemplars up to date while identifying instances of the minor class(es). SBB-Both: combined targeted queries on the stream for labelling with biased record replacement in the data subset. This resulted in a system that was too aggressive in promoting the minor class in the Data Subset, reducing performance on the major class(es).

When comparing with four alternative formulations explicitly designed to operate under label budgets, the GP polices of SBB-Archive and SBBBoth are generally ranked 1 and 2 respectively. Other results for entirely artificial stream data resulted in the same ranking [30]. In pursuing a Botnet detection task, we illustrate the potential for addressing a network analysis under particularly challenging conditions, e.g. class imbalance, high cost to labelling, anytime operation. Indeed, given that the minor class often represents the most costly class to misclassify under Botnet detection, constructing streaming classifiers under the combined condition of low label budgets and class imbalance represents a significant challenge. This study represents the first time that streaming algorithms have been deployed under these conditions. GP streaming under the archiving policy is shown to be particularly effective in this respect. Moreover, it is clear that the detection of Botnet behaviours improves over the course of the stream, resulting in the ability to detect the Botnet and C & C classes even though they might represent less than 1% of the total stream content. In order to minimize the effect of adversarial attacks against the learning algorithm itself, human experts are still required to suggest the true labels. However, the streaming algorithm identifies the data for labelling. Moreover, the anytime classifier makes predictions regarding class labels, where such information can be used to prioritize the records that the experts label first. Several avenues exist for future work including but not limited to: 1) the use of multi-armed bandit formulations to direct the process of constructing GP teams and/or direct the sampling of records from the stream. 2) extensions to further applications where labels can be provided, but at a cost.

Fig. 8. Wall clock time for (a) champion individual to make predictions and (b) fitness evaluation to update the content of the population on a new non-overlapping window location for Capture 3 dataset on a 2.67 GHz CPU.

Fig. 9. Symbiotic bid-based GP. Each team indexes a different combination of programs, but the same program may appear in multiple teams. The action (class label) of a program is expressed through colour.

6. Conclusion

Acknowledgments

Active learning under a streaming data context decouple the machine learning algorithm from the raw throughput of the stream and provides the opportunity to manipulate the distribution of data used for model building. We believe that this makes active learning a particularly useful approach to adopt with GP, particularly under low label budgets. The key observation to our approach is to identify the

This research is supported by the Canadian Safety and Security Program (53059) (CSSP) E-Security grant. The CSSP is led by the Defense Research and Development Canada, Centre for Security Science (CSS) on behalf of the Government of Canada and its partners across all levels of government, response and emergency management organizations, nongovernmental agencies, industry and academia.

15

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Appendix A. Symbiotic bid-based GP As noted in Section 3.2 this work assumes the Symbiotic Bid-Based formulation for GP (or SBB for short). The SBB framework takes the form of a two population model representing teams and programs respectively (Fig. 9), i.e. the relationship between the two populations is symbiotic [47]. Without loss of generality, we assume a linear GP representation for the programs on account of the ease with which intron code can be identified and ‘skipped’ during fitness evaluation [48]. Each program when executed produces a scalar output, or the ‘bid’. Moreover, each program is assigned an action, a, at initialization, where in this case a ∈ {1, …, C} and C is the number of classes. The team population takes the form of a variable length GA in which each individual, tmi , indexes some subset of the programs from the program x (t ) from the stream and a team, all the programs from this team are evaluated in order to identify the program with population. Thus, given record → x (t ). Such a program has won the right to suggest its the maximum output (max. bid). Let this be the ‘winning’ program for team tmi on record → action, in this case representing the suggested class label. As the team population assumes a variable length chromosome, then team size evolves, so does team complement. This is very important as it implies that no prior decisions are necessary regarding the decomposition of the task. Indeed, even assuming that a three class problem must have at least three programs (with unique class labels) represents a poor learning bias. Instead, teams evolve incrementally over time, identifying the ‘easiest’ classes to classify first and then adding the more difficult classes. As a consequence, the same program may appear in multiple teams. The only constraints are for each team to posses at least two different actions across its team complement, and a team must have a minimum of two programs. Fitness is only expressed at the level of the team population. After evaluating all teams, the bottom Tgap teams are dropped and replaced by offspring as developed from the surviving individuals (i.e., a breeder model of selection/replacement). However, before team reproduction, the individuals from the program population tested to identify any program(s) that no longer receive at least one index from a team. Such programs are deleted. Variation operators act hierarchically and probabilistically delete or clone (add) programs (Table 4). Only the cloned programs see further modification, and only the resulting new programs can be incorporated into new teams.14 This way, programs that survive between generations are not disrupted. This also implies that only the team population size is explicitly defined (Psize ), whereas the program population size is free to vary. The instruction set for programs takes the form of a register level transfer language consisting of instructions defined by one or two arguments:

• •

R [x ] = R [x ]〈op〉R [ y] where R [x ] denotes a register with an integer reference x over the range [0, …, B − 1] and B is the maximum number of registers; y ∈ [0, …, B + d − 1] where d is the dimension of the input space. The implication of the latter is that the last d register references are ‘read only’ and are initialized with the values of the record awaiting classification. 〈op〉 ∈ {+,−,×, ÷ , IF − −THEN}. The conditional operator (IF– THEN) is evaluated as IFR [x ] < R [ y]THENR [x ] = −R [x ] R [x ] = 〈op〉R [ y] where x, y, R [·] follow the above definitions. 〈op〉 ∈ {cos,exp, ln }, where the ln operator assumes the absolute value of the operand.

Naturally, the choice of opcodes (〈op〉), their order, and the value of register references are all evolved qualities. Further details for the SBB framework can be found in the original publications [36,49,28]. Appendix B. Distribution of minor classes Fig. B.10 summarizes the distribution of the minor classes (everything except the background class) throughout the stream for the CTU datasets explicitly employed for illustrating behavioural properties of the streaming Botnet detection task, e.g. Captures 1, 5, 6, 8 and 9. Note that at any point in time the total traffic content is expressed as the sum of each class present in each unique non-overlapping window location for a window size of 1%. It is apparent that there is not a common ‘behaviour’ beyond class 4 (corresponding to Botnet master/slave communication (C & C)) having the lowest frequency and being of a very burst like nature. Class 2 (representing data corresponding to the CTU ‘normal filters’) appears in 3 of the 5 illustrated datasets throughout, but in the case of Captures 1 and 9 may only appear after a delay and might not appear continuously. Class 3 (representing Botnet attacks) is particularly interesting in the case of Capture 8 in which the attack switches on and off at specific (non-periodic) intervals. Moreover, also note that although class labels are used to indicate definitive classes (background/normal/Botnet/C & C), the behaviours associated with each label are a composite of multiple behaviours. For example, Capture 1, 2 and 9 are representative of the Neris Botnet, thus Class 3 can consist of any combination of IRC, spam, click fraud or scanning activities. See Section 4.1 for a summary of the types of malicious behaviours present in each stream.

14

For example, having the program action changed, or instructions randomly modified.

16

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

Fig. B.10. Distribution of minor classes over the course of the stream for the 5 capture datasets appearing in Sections 5.2 and 5.3. Note the use of a log scale and the colour coding corresponds to that adopted for the original Stream DR figures. Class 1 is omitted for clarity (always 90–99%). Class 2 represents ‘normal’ traffic corresponding to the CTU filters, Class 3 represents Botnet and Class 4 Botnet C & C. The log scale also implies that 10−2 is synonymous with zero content (e.g. the earliest that Class 4 appears is at the 40% point in Captures 5 and 9).

17

Swarm and Evolutionary Computation xxx (xxxx) xxx–xxx

S. Khanchi et al.

[26] A. Atwater, M.I. Heywood, A.N. Zincir-Heywood, GP under streaming data constraints: A case for Pareto archiving?, in: ACM Genetic and Evolutionary Computation Conference, 2012, pp. 703–710. [27] A. Vahdat, A. Atwater, A.R. McIntyre, M.I. Heywood, On the application of GP to streaming data classification tasks with label budgets, in: ACM Genetic and Evolutionary Computation Conference: Big Data Workshop, 2014, pp. 1287–1294. [28] A. Vahdat, J. Morgan, A.R. McIntyre, M.I. Heywood, A.N. Zincir-Heywood, Evolving GP classifiers for streaming data tasks with concept change and label budgets: A benchmarking study, in: Handbook of Genetic Programming Applications, Springer, 2015, Ch. 18, pp. 451–480. [29] S. Khanchi, M. Heywood, N. Zincir-Heywood, On the impact of class imbalance in GP streaming classification with label budgets, in: European Conference on Genetic Programming, vol. 9594 of LNCS, 2016, pp. 35–50. [30] S. Khanchi, M. Heywood, N. Zincir-Heywood, Properties of a GP active learning framework for streaming data with class imbalance, in: ACM Genetic and Evolutionary Computation Conference, 2017, pp. 945–952. [31] H.H. Dam, C. Lokan, H.A. Abbass, Evolutionary online data mining: An investigation in a dynamic environment, in: Studies in Computational Intelligence, vol. 51, Springer, 2007, Ch. 7, pp. 153–178. [32] M. Behdad, T. French, Online learning classifiers in dynamic environments with incomplete feedback, in: IEEE Congress on Evolutionary Computation, 2013, pp. 1786–1793. [33] A. Cervantes, P. Isasi, C. Gagné, M. Parizeau, Learning from non-stationary data using a growing network of prototypes, in: IEEE Congress on Evolutionary Computation, 2013, pp. 2634–2641. [34] L.L. Minku, A.P. White, X. Yao, The impact of diversity on online ensemble learning in the presence of concept drift, IEEE Trans. Knowl. Data Eng. 22 (5) (2010) 730–742. [35] R. Polikar, L. Udpa, S. Udpa, V. Honavar, Learn++: an incremental learning algorithm for supervised neural networks, IEEE Trans. Syst. Man Cybern.-Part C 31 (4) (2001) 497–508. [36] P. Lichodzijewski, M.I. Heywood, Managing team- based problem solving with Symbiotic Bid-based Genetic Programming, in: ACM Genetic and Evolutionary Computation Conference, 2008, pp. 363– 370. [37] P. Lichodzijewski, M. I. Heywood, Symbiosis, complexification and simplicity under GP, in: ACM Genetic and Evolutionary Computation Conference, 2010, pp. 853– 860. [38] S. García, M. Grill, J. Stiborek, A. Zunino, An empirical comparison of botnet detection methods, Comput. Secur. 45 (2014) 100–123. [39] M.V. Mahoney, P.K. Chan, An analysis of the 1999 DARPA/Lincoln Laboratory evaluation data for network anomaly detection, in: Recent Advances in Intrusion Detection, Vol. 2820 of LNCS, 2003, pp. 220 –237. [40] C. Rossow, C.J. Dietrich, C. Grier, C. Kreibich, V. Paxson, N. Pohlmann, H. Bos, M. van Steen, Prudent practices for designing malware experiments: Status quo and outlook, in: IEEE Symposium on Security and Privacy, 2012, pp. 65–79. [41] A. Shiravi, H. Shiravi, M. Tavallaee, A.A. Ghorbani, Toward developing a systematic approach to generate benchmark datasets for intrusion detection, Comput. Secur. 31 (2012) 357–374. [42] A. Bifet, I. Z˘liobaitė, B. Pfahringer, G. Holmes, Pitfalls in benchmarking data stream classification and how to avoid them, in: Machine Learning and Knowledge Discovery in Databases, Vol. 8188 of LNCS, 2013, pp. 465–479. [43] J.L. Hennessy, D.A. Patterson, Computer Architecture a Quantitive Approach, 2nd edition, Morgan Kaufmann, 1996. [44] N. Japkowicz, M. Shah, Evaluating Learning Algorithms: A Classification Perspective, Cambridge University Press, 2011. [45] D. Brzezinski, J. Stefanowski, Prequential AUC for classifier evaluation and drift detection in evolving data streams, in: ECML- PKDD Workshop on New Frontiers in Mining Complex Patters, vol. 8983 of LNCS, 2014, pp. 87–101. [46] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (1) (2006) 1–30. [47] M.I. Heywood, P. Lichodzijewski, Symbiogensis as a mechanism for building complex adaptive systems: a review, in: European Conference on Genetic Programming, vol. 6024 of LNCS, 2010, pp. 51–60. [48] M. Brameier, W. Banzhof, Linear Genetic Programming, Springer, 2007. [49] J.A. Douncette, A.R. McIntyre, P. Lichodzijewski, M.I. Heywood, Symbiotic coevolutionary genetic programming: a benchmarking study under large attribute spaces, Genet. Program. Evol. Mach. 13 (2012) 71–101.

References [1] M. Sugiyama, M. Kawanabe, Machine Learning in Non-stationary Environments, MIT Press, 2012. [2] G. Ditzler, M. Roveri, C. Alippi, R. Polikar, Learning in non- stationary environments: a survey, IEEE Comput. Intell. 10 (4) (2015) 12–25. [3] M.I. Heywood, Evolutionary model building under streaming data for classification tasks: opportunities and challenges, Genet. Program. Evol. Mach. 16 (3) (2015) 283–326. [4] B. Krawczyk, L.L. Minku, J. Gama, J. Stefanowski, M. Woźniak, Ensemble learning for data stream analysis: a survey, Inf. Fusion 37 (2017) 132–156. [5] M. Barreno, B. Nelson, R. Sears, A.D. Joseph, J.D. Tygar, Can machine learning be secure?, in: ACM Symposium on Information, Computer and Communications Security, 2006, pp. 16–25. [6] M. Barreno, B. Nelson, A.D. Joseph, J.D. Tygar, The security of machine learning, Mach. Learn. 81 (2) (2010) 121–148. [7] P. Lindstrom, B. MacNamee, S.J. Delany, Drift detection using uncertainty distribution divergence, Evol. Syst. 4 (1) (2013) 13–25. [8] I. Z˘liobaitė, A. Bifet, B. Pfahringer, G. Holmes, Active learning with drifting streaming data, IEEE Trans. Neural Netw. Learn. Syst. 25 (1) (2014) 27–54. [9] X. Zhu, P. Zhang, X. Lin, Y. Shi, Active learning from stream data using optimal weight classifier ensemble, IEEE Trans. Syst. Man Cybern. - Part B 40 (6) (2010) 1607–1621. [10] M.M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, Classification and novel class detection in data streams with active mining, in: Pacific Asia Knowledge Discovery and Data Mining, Vol. 6119 of LNCS, 2010, pp. 311–324. [11] H. Kim, S. Madhvanath, T. Sun, Hybrid active learning for non- stationary streaming data with asynchronous labeling, in: IEEE International Conference on Big Data, 2015, pp. 287–272. [12] M. Woźniak, P. Kzieniewicz, B. Cyganek, A. Kasprzak, K. Walkowiak, Active learning classification of drifted streaming data, in: International Conference on Computation Science, 2016, pp. 1724–1733. [13] G. Ditzler, R. Polikar, Incremental learning of concept drift from streaming balanced data, IEEE Trans. Knowl. Data Eng. 25 (10) (2013) 2283–2301. [14] S. Wang, L.L. Minku, X. Yao, Resampling based ensemble methods for online class imbalance learning, IEEE Trans. Knowl. Data Eng. 27 (5) (2015) 1356–1368. [15] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357. [16] B. Mirza, Z. Lin, K.-A. Toh, Weighted online sequential extreme learning machine for class imbalance learning, Neural Process Lett. 38 (3) (2013) 465–486. [17] A. Ghazikhani, R. Monsefi, H.S. Yazdi, Recursive least square perceptron model for non-stationary and imbalanced data stream classification, Evol. Syst. 4 (2) (2013) 119–131. [18] M.-R. Bouguelia, Y. Belaïd, A. Belaïd, An adaptive streaming active learning strategy based on instance weighting, Pattern Recognit. Lett. 70 (2016) 38–44. [19] Y. Sun, K. Tang, L.L. Minku, S. Wang, X. Yao, Online ensemble learning of data streams with gradually evolved classes, IEEE Trans. Knowl. Data Eng. 28 (6) (2017) 1532. [20] K.B. Dyer, R. Capo, R. Polikar, Compose: a semisupervised learning framework for initially labeled nonstationary streaming data, IEEE Trans. Neural Netw. Learn. Syst. 25 (1) (2014) 12–26. [21] M.J. Hosseini, A. Gholipour, H. Beigy, An ensemble of cluster-based classifiers for semi-supervised classification of non-stationary data streams, Knowl. Inf. Syst. 46 (3) (2016) 567–597. [22] M. Kampouridis, E. Tsang, EDDIE for investment opportunities forecasting: Extending the search space of the GP, in: IEEE Congress on Evolutionary Computation, 2010, pp. 2019–2026. [23] A. Loginov, M.I. Heywood, G. Wilson, Benchmarking a coevolutionary streaming classifier under the individual household electric power consumption dataset, in: IEEE-INNS Joint Conference on Neural Networks, 2016, pp. 1–8. [24] G. Folino, G. Papuzzo, Handling different categories of concept drifts in data streams using distributed GP, in: European Conference on Genetic Programming, vol. 6021 of LNCS, 2010, pp. 74–85. [25] I. Dempsey, M. O′Neill, A. Brabazon, Foundations in Grammatical Evolution for Dynamic Environments, Springer, 2009 (Vol. SCI 194).

18