DELR: A double-level ensemble learning method for unsupervised anomaly detection

DELR: A double-level ensemble learning method for unsupervised anomaly detection

Knowledge-Based Systems 181 (2019) 104783 Contents lists available at ScienceDirect Knowledge-Based Systems journal homepage: www.elsevier.com/locat...

1MB Sizes 0 Downloads 70 Views

Knowledge-Based Systems 181 (2019) 104783

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

DELR: A double-level ensemble learning method for unsupervised anomaly detection✩ ∗

Jia Zhang a , Zhiyong Li a , , Ke Nai a , Yu Gu c,d , Ahmed Sallam b a

College of Computer Science and Electronic Engineering of Hunan University, and Key Laboratory for Embedded and Network Computing of Hunan Province, Changsha 410082, China b Faculty of Computers and Informatics, Suez Canal University, Ismailia 41522, Egypt c Beijing Advanced Innovation Center for Soft Matter Science and Engineering, Beijing University of Chemical Technology, Beijing 100029, China d Institute for Inorganic and Analytical Chemistry, University of Frankfurt, Max-von-Laue-Str. 7, 60438 Frankfurt, Germany

article

info

Article history: Received 4 August 2018 Received in revised form 13 May 2019 Accepted 16 May 2019 Available online 20 May 2019 Keywords: Anomaly detection Double-level ensemble Generalization ability

a b s t r a c t Although the anomaly detection problem has been widely studied in data mining and machine learning, most algorithms in this domain have been performed with limited generalization ability. To that end, ensemble learning has been proven to effectively improve the generalization ability of anomaly detection algorithms. However, there is room for further improvement in existing anomaly ensemble methods. For example, these methods are based on a single-level ensemble strategy that only considers the combination of the final results and usually neglects the loss of information during the generation of multiple subspaces. In this paper, we propose a double-level ensemble learning method using linear regression as the base detector called DELR, which has better robustness and can reduce the risk of information loss. The first level is used to reduce the loss of information, and the second level is used to improve the generalization ability. To better satisfy the diversity requirement for the anomaly ensemble, we present a diversity loss function to retrain the base models. Furthermore, we devise a novel weighted average strategy to ensure effectiveness in the second level. Our experimental results and analysis demonstrate that the DELR algorithm obtains better generalization ability over real-world datasets compared to several state-of-art anomaly algorithms. © 2019 Elsevier B.V. All rights reserved.

1. Introduction 1.1. Motivation Anomaly detection is a hot topic in machine learning and data mining, and it has been widely extended in many application domains, including fault detection, credit card fraud detection, traffic anomaly detection, and network intrusion detection [1– 4]. Accordingly, numerous algorithms have been proposed in recent years [5–8]. In particular, Chandola et al. [1] have shown a detailed overview of a series of anomaly detection algorithms. However, most of these algorithms perform with limited generalization ability and obtain good performance only in specific domains. ✩ No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys. 2019.05.026. ∗ Corresponding author. E-mail addresses: [email protected] (J. Zhang), [email protected] (Z. Li), [email protected] (K. Nai), [email protected] (Y. Gu), [email protected] (A. Sallam). https://doi.org/10.1016/j.knosys.2019.05.026 0950-7051/© 2019 Elsevier B.V. All rights reserved.

Ensemble learning is fast becoming a key instrument in improving the generalization ability of anomaly detection algorithms. The main idea of ensemble learning is to obtain better performance by combining the merits of multiple classifiers instead of using a single classifier. Currently, ensemble learning is widely used in classification problems, including text classification, image classification, data stream classification, and sentiment classification [9–13]. It is worth mentioning that bagging and boosting are two popular algorithms in classification ensembles. Traditionally, ensemble learning for anomaly detection has been inspired by the development of the classification ensemble. Furthermore, the main components of ensemble learning are nothing more than generation, selection, and the combination of multiple base classifiers [14]. In effect, many studies based on ensemble learning have been proposed to solve anomaly detection [15–17]. However, there are still various challenges for further explorations in anomaly ensembles, in terms of generation, selection, or combination of multiple base detectors. On the one hand, most anomaly ensembles depend on metaanomaly algorithms, such as the local outlier factor (LOF) and k nearest neighbor (kNN) algorithms, which are often used as

2

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

base detectors in ensemble methods. Generally, meta-anomaly algorithms are efficient in detecting anomaly samples and can be improved through combination with other simple techniques. Therefore, most studies merely propose a combination method for multiple results rather than an anomaly detection method in particular. Hence, the training phase of the base detectors is usually overlooked. On the other hand, several anomaly ensemble combination strategies combine the final results of multiple detectors. The main problem with these strategies is that they take on the single-level combination in the final integration phase (see Fig. 1) and ignore the information loss produced by the selection of the feature subspace [18–20]. For example, Lazarevic and Kumar [18] proposed to train multiple base detectors by using randomly selected feature subspaces and focused on proposing a variety of combination methods. Vries et al. [20] proposed using randomly selected feature subspaces to further optimize the LOF algorithm. Since Mitra et al. [21] have concluded that numerous studies focus on reducing this type of information loss, such as new proposed clustering algorithms and feature similarity measures, it is also valuable to consider the information loss in the anomaly ensemble. 1.2. Contributions In this paper, we aim to improve the generalization ability of anomaly detection algorithms by introducing a novel double-level ensemble learning method. The method focuses on two aspects: the training pattern of base detectors and the combination strategy of multiple detectors, which are key factors in improving the generalization ability of the anomaly ensemble algorithm. Our contributions are listed as follows:

• Considering that linear regression is a supervised learning algorithm, we propose pseudo-labels to replace the data labels and to not rely on expensive labels. • To ensure the diversity requirements, which are effective in improving the generalization ability, we propose a diversity loss function to re-update the trained model. • To reduce the information loss and improve the generalization ability, we propose a double-level ensemble strategy (see Fig. 2). In the first level, we combine the base detectors, Ci1 and Ci2 , based on the ith subspace division and propose using the two-dimensional anomaly score vector Vi as the intermediate outlier score of each sample. In the second level, we compute the weight wi of each sample based on the intermediate outlier scores and the rank information and recompute the final outlier score Si based on these intermediate outlier scores to implement the final combination. The remainder of this paper is organized as follows: Section 2 reviews the previous literature. Section 3 presents our new algorithm in detail and explains subspace generation, base detector selection and the combination strategy. Section 4 presents the experimental results and the analysis of our model on real-world datasets. Section 5 concludes this paper. 2. Related work 2.1. Classification ensembles Classification ensembles combine multiple classifiers to obtain better results. This technique has been widely studied for several decades and has been verified to efficiently improve the generalization ability of the classification algorithms [14,22,23]. Generally, frequent studies on this topic have focused on three

aspects: generation methods, selection mechanisms and combination strategies of multiple base classifiers. First, in view of the generation methods, bagging and boosting are both representative algorithms that belong to seminal studies in the base classifier generation phase. These two algorithms trigger two trends in designing the generation methods for base classifiers: the parallel ensemble and the sequential ensemble. The parallel ensemble, such as bagging, can generate multiple classifiers in parallel, while the sequential ensemble, such as boosting, can generate classifiers one by one such that each generated base classifier is dependent on the results of the previous one. AdaBoost [24], for example, is an instantiation of a boosting algorithm. In this algorithm, the dependence between the generated classifiers can be bias reflected by the training samples of the base classifier such that if a sample is classified incorrectly, then it will have a larger weight value compared to another sample that is classified correctly. Second, in view of the selection mechanisms, most classification ensemble methods combine the results of all base classifiers in use. However, this approach can have a series of negative effects. On the one hand, the more base classifiers there are, the more storage burden and computing time exist. On the other hand, poor classifiers may decrease the performance of the whole process, especially in terms of the generalization ability. Therefore, ensemble pruning has become an important research topic in improving the generalization ability of ensemble algorithms. The main purpose of this technique is to propose a mechanism for selecting a subset of multiple classifiers. In 1997, Margineantu and Dietterich [25] presented the concept of ensemble pruning, which mainly focused on pruning in the sequential ensemble. This concept was also rapidly extended to a parallel ensemble [26]. Finally, in view of the combination strategies, average and voting are both very popular and fundamental combination methods due to their simplicity [14]. Zhou et al. [27] used full voting and plurality voting in their sequential ensemble, where the first one made a preliminary categorization, while the second was more precise. Moreover, weighted average is another common combination method in which each base classifier is set with a particular weight based on its final result. Notably, our proposed anomaly detection algorithm referred to as DELR has a similar structure but with different purposes. In particular, our first level is designed to reduce the risk of information loss, and the second level is designed to improve the generalization ability. In addition, unlike the above mentioned algorithm, the DELR algorithm focuses on designing an unsupervised method that does not require ground truth labels, which prevents the method from using expensive labels in anomaly detection. 2.2. Anomaly ensembles Although ensemble learning for anomaly detection has proven its efficiency to improve the generalization ability [15,28,29], it has rarely been studied due to the absence of a well-formed objective model, in addition to the absence of ground truth labels. In fact, multiple algorithms in this domain have been inspired by the development of classification and clustering ensembles. Therefore, several studies have focused on the three main important components of ensemble learning mentioned previously, that is, the generation, selection, and combination. In view of the generation methods, feature bagging [18] and subsampling [12] are both common representative algorithms in this topic. To satisfy the diversity requirements of base detectors, the former was proposed to randomly select multiple feature subsets to produce new datasets to train base detectors, while

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

3

Fig. 1. A single-level ensemble strategy.

Fig. 2. A double-level ensemble strategy.

the latter was proposed to use a subsampling technique to produce new datasets with different sizes to train base detectors. Although both of them rely on the dimensionality of the original feature space, they are widely used in anomaly ensembles. Feature bagging is only suitable for detecting anomaly samples with high dimensionality by reducing the dimensionality of the feature space and has shown good performance in identifying anomaly samples hidden in low-dimensional space. Subsampling is superior to the bootstrap sampling technique [14]. In this technique, there is no duplicate sample during the sampling process that negatively affects many anomaly detectors, such as kNN and LOF [11]. Furthermore, various selection strategies for feature subsets are proposed as extensions for feature random selection [16,30–33]. In view of the selection mechanisms for multiple detectors, it is difficult to select or abandon a detector without ground truth labels during the evaluation phase. Hence, Rayana and Akoglo [34] proposed the full ensemble greedy strategy to select the results of all base detectors. Later, Rayana and Akoglo [35] proposed two other selection strategies based on outlier scores and outlier rankings of multiple base detectors. Both strategies

have been verified to improve filtering out poor detectors and to improve the generalization ability of anomaly detection algorithms. Also, in view of the combination strategies, the existing combination methods used in the classification ensemble are also widely used in anomaly ensembles, such as the average and the weighted average methods. In general, all these efforts have focused on improving the generalization ability of anomaly detection algorithms. Consequently, various improved algorithms have been proposed [15, 28,29] and extended to multiple fields, including medical image detection, computer networks, wireless sensor networks, and civil engineering [36–39]. For example, Zimek et al. [29] proposed an unsupervised method for an anomaly ensemble in which the data perturbation technique was used to implement a diversity of base detectors, and the rank accumulation method was proposed to combine multiple results. Rayana et al. [15] combined the core ideas of a parallel ensemble and a sequential ensemble and proposed a sequential ensemble method. In this method, the result of one detector was filtered out and freed back to train the next detector. The technique also exploited the weighted average method to combine the final results. Paulheim and Meusel [28] proposed another technique in which the

4

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

anomaly detection problem could be seen as a set of supervised learning problems, and the lack of ground truth labels could be compensated by the pseudo-labels. Similar to the current anomaly ensemble methods, our method also focuses on both the generation and combination of multiple base detectors. In terms of the generation, the main differences of our method compared to other methods mentioned above are that we not only use the linear regression as a base detector but also propose a diversity loss function for retraining the detector. In terms of the combination, we propose a double-level ensemble strategy to reduce the risk of information loss and improve the generalization ability instead of a single-level combination for the final results. 3. Proposed approach The most common definition of anomaly is provided by Hawkins as ‘‘an observation which deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism’’ [1]. However, how much is this deviation? There are different answers for different applications. Therefore, it is critical to improve the generalization ability of the anomaly detection algorithm. Our DELR method utilizes a double-level ensemble strategy to ensure better generalization ability. 3.1. Overview The inputs of DELR are the m-dimensional data, the value of α (subspace division ratio), the number of cycles l and the number of folds f . The output is the outlier score list S of all samples. In the experiments, we use α = 0.45 and l = 20, which have been verified to be a good combination in Section 4.3.6. For parameter f , we use a value of 5, which represents that we use the 5-fold cross-validation like scenario introduced in Section 3.3.1. The main steps of DELR are shown in Algorithm 1. There are four main components, namely, the subspace generation represented by steps 1–7, the training of the base detectors represented by steps 8–16, the retraining of the base detectors represented by steps 17–19, and the double-level ensemble strategy represented by steps 20–38, which are described in detail in the next section. Furthermore, to obtain a better understanding, the DELR process is shown in Fig. 3. 3.2. Subspace generation Numerous anomaly ensemble algorithms obtain the low-dimensional feature subset by the total feature space-based random selection and random projection. The random selection and random projection result in the randomness of the algorithm. Our work is similar in this regard. As we can see in Fig. 2, we need multiple divisions of the data space before the double-level ensemble strategy. Therefore, we can obtain these random divisions based on the random subspace generation [9]. Moreover, unlike other algorithms, we retain the two subspaces generated by each division. In general, we can obtain 2 ∗ l subspaces by performing l random subspace generations and can obtain 2 ∗ l different base detectors based on 2 ∗ l random subspaces. Let D = {xi , |i = 1, . . . , n} be the dataset in anomaly detection, where xi = (xi1 , . . . , xim )T represents the sample with m-dimensional features and α represents the division ratio of feature space. In the lth division phase, a new dataset Dl1 = {x∗i , |i = 1, . . . , n} with r-dimensional sample x∗i = (x∗i1 , . . . , x∗ir )T is generated by the total feature space-based random selection of r features, where r is equal to m times α . Then, the remaining features are used to form the other dataset Dl2 = {x∗∗ i , |i = ∗∗ T 1, . . . , n} with (m − r)-dimensional sample x∗∗ = (x , . . . , x∗∗ i ir +1 im ) .

Algorithm 1 DELR Input: the data set D = {xi , |i = 1, · · · , n}, (xi ∈ Rm ), the number of cycles l ←20, the subspace division ratio α ← 0.45, the number of folds f ← 5. Output: the outlier score list S. 1: for i = 1 to l do 2: Di1 ← ∅, Di2 ← ∅, Li1 ← ∅, Li2 ← ∅ 3: Di1 ←SUBSPACE GENERATION (D,α ) /*Section 3.2 */ 4: Di2 ←D/Di1 5: Li1 ←the pseudo-label set produced by randomly selected feature from Di2 6: Li2 ←the pseudo-label set produced by randomly selected feature from Di1 7: end for 8: for j = 1 to f do 9: for i = 1 to l do 10: for k = 1 to 2 do (j) 11: Sik ←the j-th 1/j instances in Dik (j)

(j)

12:

Tik ←Dik /Sik

13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:

(j) mik

←LINEAR REGRESSION(Tik , Lik ) end for end for end for for j = 1 to f do m∗(j) ←LOSS FUNCTION(D, m(j) ) end for for j = 1 to f do for i = 1 to l do for k = 1 to 2 do (j) for x ∈ Sik do

24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: 35: 36: 37: 38:

s ← |mik (x) − Lik (x)| S¯ ik ←S¯ ik ∪ s end for end for end for end for for i = 1 to l do Vi ← ∅ Vi ← INTERNAL INTEGRATION(S¯i1 , S¯i2 ) /*Section 3.4.1 */ end for S←∅ for x ∈ D do s ←EXTERNAL INTEGRATION(V1 (x), · · ·,Vl (x)) /*Section 3.4.2 */ S←S ∪ s end for

(j)

/*Section 3.3.2 */

/*Section 3.3.3 */

∗(j)

3.3. Base detectors It is well known that numerous anomaly ensemble algorithms use meta-anomaly algorithms as base detectors, such as kNN and LOF, which often perform well in anomaly detection. However, these studies often focus on proposing an effective ensemble strategy, and their effectiveness depends in part on the quality of the meta-anomaly algorithms, while DELR is independent of these meta-anomaly algorithms and uses linear regression as a base detector instead. Furthermore, some researchers have proposed using machine learning algorithms in anomaly detection [40,41]. For example, Mok et al. [40] proposed a random effects logistic regression model for anomaly detection that considers both the system characteristics and unexplained uncertainty. Khammassi and Krichen [41] used multiple machine learning algorithms for anomaly detection. In their work, logistic regression was used for the selection of the best feature subset, and three different decision tree algorithms were used for classification. However, these studies focus on network intrusion detection which have limited generalization ability. DELR, on the other hand, has been verified to have better generalization ability. In addition, their methods only work when labeled data are available while DELR is independent of data labels. Limited labeled data can be obtained in anomaly detection, especially abnormal data. Accordingly, DELR uses linear regression as a base detector in an unsupervised way and thus does not rely on both meta-anomaly algorithms and expensive data labels.

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

5

Fig. 3. Block diagram showing the steps of DELR.

3.3.1. Extending linear regression to unsupervised anomaly detection Linear regression is one of the fundamental algorithms in machine learning and is widely used in classification with advantages of rapid speed and low computational complexity. Although this method is effective in classification, it is difficult for it to be directly used for anomaly detection. Moreover, obtaining manually labeled data is particularly expensive in anomaly detection. Chandola et al. [1] have provided the definition of the label as ‘‘the labels associated with a data instance denote whether that instance is normal or anomalous’’. Worst of all, we may have new anomalies that cannot be labeled in advance. Hence, in order to not rely on data labels, we use linear regression in an unsupervised way and utilize pseudo-labels instead of expensive data labels. The pseudo-labels are produced by randomly selected features. In other words, we use partial features-based linear regression to predict these randomly selected features. This is not the first time an attribute has been predicted from the other attributes. In 1999, Teng [42] proposed a polishing procedure, which is the same as predicting a feature with the remainder of the features, to identify and correct the noise in data. Based on a similar method, Paulheim and Meusel [28] computed the weight of each feature and proposed a novel anomaly algorithm, ALSO, which can be compared with our algorithm in the experimental part. In addition, to reasonably divide the sample space and avoid overfitting in the model training phase, we use a cross-validation like scenario [28] in which a 5-fold cross-validation is used to train the models in different folds instead of evaluating the performance of the algorithm. In brief, the aim of using 5-fold cross validation is to implement the division of the sample space both Dl1 and Dl2 mentioned above. For example, the sample space Dl1

(1)

(5)

(i)

(j)

can be divided into Dl1 = Dl1 ∪ · · · ∪ Dl1 where Dl1 ∩ Dl1 = ∅, (i) and we can randomly select the testing subspace S1 = Dl1 and (i) the corresponding training subspace T1 = Dl1 /Dl1 . The sample in each testing subspace can obtain a predicted value using the model that is trained by the data in the corresponding training subspace. 3.3.2. Linear regression training As mentioned above, we can obtain the training set T1 and testing set S1 of the subspace Dl1 . Meanwhile, the pseudo-labels Ll1 of subspace Dl1 are produced by randomly selected features of subspace Dl2 . The random selection has been verified to effectively satisfy the diversity requirements in the anomaly ensemble. In 2005, Lazarevic and Kumar [18] proposed the anomaly ensemble algorithm named feature bagging. These authors proposed to randomly select multiple feature subspaces from the total feature space. Moreover, there are many variations based on this method. For example, Rayana et al. [15] implemented a combination of multiple methods based on repeatedly using feature bagging as a base detector. In the following part, to simplify the description, we use T = {xi , |i = 1, . . . , n1 } as the training set, S = {xi , |i = n1 + 1, . . . , n} as the testing set and L = {yi , |i = 1, . . . , n} as the pseudo-label set, where xi = (xi1 , . . . , xir )T represents the sample with r-dimensional features and yi represents the pseudo-label of sample xi . In this way, the general form of each base detector f is as follows: fθ (x∗ ) = θ T x∗ + θ0 ,

(1)

where θ = (θ1 , . . . , θr )T represents the r-dimensional weight vector, θ0 represents the bias value and x∗ = (x∗1 , . . . , x∗r )T ∈ T

6

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

represents a training sample with r-dimensional features. In addition, the parameter θ0 is merged into θ by adding an additional feature with a value of one for each sample. Therefore, Eq. (1) can be simplified as follows: fθ (x∗ ) = θ T x∗ , where θ = (θ0 , θ1 , . . . , θr )T and x∗ = (1, x∗1 , . . . , x∗r )T . In the model training, the minimum residual sum of squares is used to fit the training samples. The specific loss function is as follows: J(θ ) =

n1 1 ∑ (fθ (xi ) − yi )2 . 2n1

θ = (X T X )−1 X T Y , where X = [x1 , . . . , xn1 ]T in which xi represents the original sample xi with an additional feature fixed at value 1 and Y = [y1 , . . . , yn1 ]T . However, when X T X is not a full-rank matrix, we use the gradient descent method. The equation produced by the gradient descent for estimating the parameter in Eq. (2) is as follows: n1 1 ∑ (f (xi ) − yi )xij (j = 0, 1, . . . , r), n1 i=1

where the parameter α represents the learning rate. 3.3.3. Linear regression retraining As described, two subspaces, Dl1 and Dl2 , are generated simultaneously in each division phase and used simultaneously for the training of the linear regression. Therefore, two base models (f1 and f2 ) can be obtained through the training of the linear regression based on subspaces Dl1 and Dl2 . In general, 2 ∗ l base models can be obtained after l divisions. Although the training method mentioned above is effective in finding optimal parameters of base models, it does not consider the diversity of the model. In our work, due to the importance of diversity for the anomaly ensemble, we propose a diversity loss function to ensure the diversity requirement among these base models. The basic idea of our diversity loss function is to maximize the diversity among the base models based on the original data space D. Furthermore, to implement the retraining of these models based on the original data space, we add the parameter with a value of zero for each model where the feature does not appear in the previous model training. The general model form of the diversity loss function is as follows: L(f , D) =

N(N − 1)

N −1 N ∑ ∑

l(fi , fj )

i=1 j=i+1

N −1 N ∑ ∑ 1 ∑ = exp(−(fi (x) − fj (x))2 ), N(N − 1) |D|

(3)

2

i=1 j=i+1

∂ L(f , D) , ∂θi

N ∑ ∂ L(f , D) 2 ∂ l(fi , fj ) = , ∂θi N(N − 1) ∂θi j=1,j̸ =i ∂ l(fi , fj ) 2 ∑ = exp(−(fi (x) − fj (x))2 )(fj (x) − fi (x))x, ∂θi |D| x∈D

where the parameter α represents the learning rate. 3.4. Double-level ensemble strategy

There are two general solving methods, the least squares method and the gradient descent method. We can obtain the optimal solutions of parameters θ by the least squares method. The equation is as follows:

2

θi := θi − α

(2)

i=1

θj := θj − α

retraining, we update the model parameters by using gradient descent, as shown in the following equation.

x∈ D

where the parameter N represents the number of base models, f = {f1 , . . . , fN } represents a set of base models, D represents the dataset used to train this loss function and l(fi , fj ) represents the diversity loss between model fi and model fj . In this work, we set the parameter N in Eq. (3) to 2 ∗ l, where l represents the number of cycles. Meanwhile, in the model

The combination of multiple detectors is the core component of anomaly ensemble algorithms where the maximum, minimum and average are the most commonly used combination strategies. Generally, these strategies can be combined to construct a singlelevel strategy by integrating multiple results of base detectors. However, there is a risk of information loss produced from the generation of multiple detectors. Hence, in this work, we propose a double-level ensemble strategy to improve the generalization ability of algorithms and reduce the risk of information loss. In short, our double-level ensemble strategy consists of two phases: internal integration and external integration. In the following sections, we describe these two parts in detail. 3.4.1. Internal integration As shown in Sections 3.2 and 3.3, the original feature space is divided into two subspaces, Dl1 and Dl2 , which are simultaneously used for training the linear regression model. Consequently, we can obtain two base models, f1 and f2 , in each cycle. Next, we apply our combination strategy on each cycle to combine training results from these two base models. Because of the deviation of anomaly samples, we use the difference between the pseudo-label and the predicted value of the base model as a temporary outlier score. si (x) = |y − fi (x)|,

(4)

where x represents a sample from dataset D, y represents the pseudo-label of sample x, fi represents the ith base model and si represents the temporary outlier score based on the ith base model. It is obvious that the larger the difference value is, the greater the probability of the anomaly is. This idea is illustrated in Fig. 4. It shows the result of linear regression training based on an unsupervised method. The data samples are generated from the two-dimensional synthetic dataset, which can be written as x2 = a + b ∗ x1 + w, where w ∼ N (0, σ 2 ). In this illustration, we use σ 2 = 0.02, a = 0.1, b = 2 and generate an outlier sample o in this synthetic dataset. The green line represents the prediction of attribute x2 from attribute x1 by the linear regression. The difference between the outlier sample o and the predicted value is larger than that between the normal sample n and the predicted value. Therefore, we rank all samples based on the temporary outlier scores, where the sample with maximum score is assigned the rank of 1, the second value is assigned the rank of 2, and so forth. In this way, we obtain two rank lists, r1 and r2 , based on the corresponding outlier scores s1 and s2 in Eq. (4), respectively. Correspondingly, to unify the outlier scores of these two base models, we generate new outlier scores s¯1 and s¯2 by inverting

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

7

(i)

(i)

ranked lists (i.e., rk (xj ) = n), then the weight value wj will be (i) j (xj )

close to the minimum (i.e., w = p). Based on Eq. (6), we use the weighted arithmetic mean as the best estimate for the score vector of an anomaly sample. The specific equation is as follows: V¯ i = (

n ∑

(i)

P(wj (xj )))−1

j=1

n ∑

(i)

P(wj (xj ))Vi (xj ),

(7)

j=1 (i)

where the function P(wj (xj )) is defined as:

( (i) j (xj ))

P(w

Fig. 4. Example illustrating the relationship between the difference value and the probability of anomaly.

the corresponding rank lists r1 and r2 , respectively, which are consistent with the core idea of inverse rank aggregation [35]. s¯i (x) =

1 ri (x)

,

(5)

where ri (x) represents the position of the sample x generated from the ith base model. Finally, we combine these two anomaly scores s¯1 and s¯2 from Eq. (5) into a two-dimensional anomaly score vector V = (s¯1 , s¯2 )T . On the one hand, we combine the results of two base models into a two-dimensional anomaly score vector. On the other hand, we construct a ranked list of all samples to maintain the external integration. 3.4.2. External integration After completing the l times cycle, we can obtain an anomaly score vector group V˜ with size l. Assume that this vector group is marked as V˜ = (V1 , . . . , Vl ) where Vi = (s¯i1 , s¯i2 )T represents the two-dimensional anomaly score vector obtained in the ith cycle. In this section, we focus on proposing an effective ensemble strategy to integrate these multiple two-dimensional vectors. Based on the ranked list of samples obtained in the previous section, we can calculate the weight of each sample. Accordingly, by combining the weight and score vector, we can obtain a weighted arithmetic mean that represents the best estimate for the score vector of the anomaly sample.

wj(i) (xj ) = 1 −

R1 (xj ) + R2 (xj ) 2

(6)

(i)

(rk (xj ) − 1)(1 − p) n

0

wj(i) (xj )

.

Proof. Let Z = (Z1 , . . . , Z2n )T be the correction vector for outlier score V and the corresponding weight matrix be W = (W1 , W2 , . . . , W2n ). Then, the error equation is as follows: I

⎡⎤ ⎢I ⎥ T ⎥ˆ Z =⎢ ⎣ .. ⎦ V − V . . I Because the purpose of least squares estimation is to minimize ∑ the equation Wi Zi2 , we can obtain the following equation: I ⎢I ⎥

⎡⎤

I ⎢I ⎥

⎡ ⎤T

I

I

I

⎡ ⎤T

I

⎢I ⎥ ⎢ . ⎥ diag(W1 , . . . , W2n ) ⎢ . ⎥ Vˆ − ⎢ . ⎥ diag(W1 , . . . , W2n )V = 0. ⎣.⎦ ⎣.⎦ ⎣.⎦ . . . Then,

Vˆ = (

2n ∑ i=1

Wi )−1

2n ∑

Wi Vi = V¯ .

i=1

The proof is completed. Note that we have used the cosine metric to calculate the final outlier score. The score of a sample represents the similarity of that sample and the value of V¯ i in Eq. (7), and the larger the value of the score, the greater the probability of being an anomaly sample. V¯i • Vi (xj )

|V¯i | |Vi (xj )|

,

where Si (xj ) represents the final outlier score of the sample xj based on the result obtained in the ith cycle. Finally, after obtaining the final outlier score, we use the maximum value of multiple outlier scores as the final outlier score of each sample.

where the function Rk (xj ) is defined as: Rk (xj ) =

0

Proposition 1. Assume V˜ = (s1 (x), s2 (x)) is the anomaly score vector for an unknown anomaly sample and V is the set of anomaly score vectors for a known sample set where V = (V1 (x1 ), . . . , Vn (xn ))T and Vi (xi ) = (s¯i1 (xi ), s¯i2 (xi )). Then, the weighted arithmetic mean of this known set could be motivated as the least squares estimate of the unknown sample (i.e. V¯ = Vˆ ).

Si (xj ) =

,

=

)

wj(i) (xj )

. (i)

Here, xj represents the jth sample, wj

represents the weight

4. Experiment results and discussion

(i)

of the jth sample in the ith cycle, rk (xj ) represents the rank position of sample xj provided by rank list rk in the ith cycle and p is a parameter used to fit the minimum value of the weight. For example, if the sample ranks first in both ranked lists (i) (i) (i.e., rk (xj ) = 1), then the weight wj is equal to the maximum (i)

(i.e., wj (xj ) = 1). However, if the sample ranks last in both

To verify the performance of the DELR algorithm, we select 10 benchmark datasets that are widely used for performance evaluation in anomaly detection. Based on these datasets, we compare the DELR algorithm with meta-anomaly algorithms, state-of-theart anomaly ensemble algorithms, and traditional classification algorithms.

8

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

4.1. Method The training of the DELR algorithm is independent of ground truth labels. In other words, DELR is an unsupervised anomaly algorithm. However, there are two differences compared to other unsupervised anomaly algorithms. The first difference is that we use linear regression in an unsupervised way as the base detector for the anomaly ensemble. Second, we propose a doublelevel ensemble strategy to combine the results of multiple base detectors. In this section, we evaluate both aspects. First, distance-based anomaly detection algorithms are the most popular unsupervised methods due to their simplicity and effectiveness in performing anomaly detection. These algorithms can be divided into two main categories: global algorithms and local algorithms [43]. Here, kNN is a representative global algorithm, while LOF is a representative local algorithm. Both are often used as base detectors in anomaly ensembles. Furthermore, there are also many variations in improving these two methods. For example, outlier detection using in-degree number (ODIN) is proposed to calculate the anomaly score by the inverse of in-degree in the kNN graph, and COF belongs to a variant of LOF that proposes a new way to select neighborhoods based on the minimum spanning tree. These four algorithms are typical unsupervised meta-anomaly algorithms and have been widely used in various applications. Therefore, we will compare these algorithms with DELR in the first part of our experiments. Here, we set the nearest neighbors k between 1 and 10 with a step of 2 in these algorithms. Second, as mentioned before, there are two main categories in anomaly ensemble algorithms: sequential ensembles and independent ensembles [11]. The former is difficult to explore due to the lack of ground truth labels during the intermediate phase of anomaly detection. In contrast, the latter is easier to explore due to the independent training of multiple base detectors and the absence of the intermediate phase. The DELR algorithm is also an independent ensemble algorithm. To evaluate our algorithm, we select five state-of-the-art anomaly ensemble algorithms, namely, feature bagging (FG) [18], subsampling (SB) [12], rotated bagging (RB) [19], cumulative agreement rates ensemble(CARE) [15] and attribute-wise learning for scoring outliers (ALSO) [28]. Feature bagging and subsampling are often used to induce diversity into anomaly ensembles. RB is an improved algorithm based on the idea of random sampling. CARE is the combination of sequential ensembles and independent ensembles, in which they improve the accuracy of the algorithm by eliminating anomaly samples and improve the generalization ability by combining the results of multiple base detectors. ALSO is designed to decompose anomaly detection problems into multiple supervised learning problems and is verified to be effective in detecting anomalies hidden in a low-dimensional data space. Here, we set the sampling ratio α between 0.1 and 0.5 with a step of 0.1 in SB and the nearest neighbors k between 1 and 10 with a step of 2 in the CARE and RB algorithms. Finally, the anomaly detection problem can be seen as a special classification problem that contains two categories: abnormal class and normal class. To further evaluate our algorithm, we select SVC [44] and random forest [45] for comparison. Both methods perform well in multiple classification problems. In addition, we use receiver operating characteristics (ROC) curve [46], which is widely used in anomaly detection, as our performance measure method. The horizontal and vertical axes of this curve represent the false positive rate and true positive rate, respectively. The area under the curve (AUC) [46] is used to evaluate the algorithm in which the AUC value of a perfect algorithm is equal to one. In addition, to further evaluate the effectiveness of our algorithm, we use accuracy as another measurement method.

Table 1 Datasets for the evaluation. Dataset

Instance

Outliers

Attributes

WBC PageBlocks Lymphography Hepatitis Cardio WDBC Satimage-2 SpamBase Musk Arrhythmia

223 5013 148 70 1831 367 5803 2844 3062 248

10 100 6 3 176 10 71 56 97 4

9 10 18 19 21 30 36 57 166 259

Meanwhile, to examine the generalization ability of algorithms, we use a Nemenyi test [47], a variant of the Friedman test, which can be used to test the significance of differences between multiple algorithms. To analyze the information loss, we use the Shannon–Spearman measure [48]. 4.2. Datasets There is no specific public benchmark database in anomaly detection. We select 10 datasets from the UCI database involving two-class and multiclass datasets [49]. However, these datasets are often used for balanced classification, while the anomaly detection problem is an imbalanced classification. Therefore, it is necessary to have further preprocessing for these datasets. There are two main preprocessing methods: downsampling and normalization [43]. In the following, we present detailed descriptions of the datasets. The Arrhythmia dataset contains 14 categories in which class 1 represents healthy people and the rest are patients. Herein, the samples belonging to class 1 are marked as the normal class, while an anomaly class is randomly downsampled from the rest of the samples. The PageBlocks dataset is used to identify whether each block of the document is text or an image or other. There are four main categories in which the class text is marked as a normal class, and the rest are used to form an anomaly class. For the SpamBase dataset, the spam class is an anomaly class, while the no-spam class is a normal class. The Hepatitis dataset contains two categories: DIE and LIVE classes, in which LIVE is marked as a normal class, while DIE constitutes an anomaly class by downsampling. The WBC dataset records the attributes of both benign and malignant cancer, while malignant cancer is marked as an anomaly class and benign cancer is a normal class. The WDBC dataset is similar to the WBC, except it has different feature dimensionality. The Lymphography dataset is a multiclass dataset that is often used in anomaly detection. There are four categories in which classes 1 and 4 are used to constitute an anomaly class, and the remainder forms a normal class. The Cardio dataset contains three categories: normal, suspect and pathologic classes, in which we abandon the suspect class and mark the pathologic class as an anomaly class to obtain the imbalanced dataset. The Musk dataset is a two-class dataset, in which the Musk class is marked as an anomaly class and the non-Musk class is a normal class. For the Satimage-2 dataset, we merge the training set and test set, while some of the samples belonging to class 4 are randomly sampled as an anomaly class and the rest are used to form a normal class. Notably, we focus on two aspects of data preprocessing: sample size and attribute size. In terms of the sample size, downsampling is used to select the normal and anomaly samples and to avoid selecting duplicate samples. In terms of the attribute size, if the missing ratio of one attribute of the data is greater than 10%, then we abandon this attribute of the samples. Otherwise, we abandon the samples without this attribute. All of the details for the datasets are shown in Table 1.

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

9

Fig. 5. ∆-Average AUC values from DELR to the comparison meta-algorithms. Note: The DELR algorithm exhibits better performance than the comparison algorithm as long as the average AUC difference is greater than 0 on the y-axis. The comparison algorithms without randomness do not require a duplicate and use the AUC value of a single run instead of the average AUC value.

Fig. 6. ∆-Average AUC values from DELR to the anomaly ensemble algorithms. Note: The DELR algorithm demonstrates better performance than the comparison algorithm as long as the average AUC difference is greater than 0 on the y-axis. These comparison algorithms also perform 25 independent runs for each dataset and use the average AUC for comparison.

4.3. Results and observations

average AUC difference between DELR and kNN demonstrate a similar problem. This dataset has enough samples; however, it is a high-dimensional dataset, so it is easy for the kNN to fall into the curse of dimensionality. Although, in some cases, the average AUC differences in the lower-dimensionality datasets are larger than those in the high-dimensional datasets, it does not mean that DELR performs poorly with the high-dimensionality dataset. In Fig. 5, DELR exhibits better performance than the comparison algorithm as long as the average AUC difference is greater than 0 on the y-axis. In Table 2, we provide the average and the standard error for AUC results of the DELR algorithm. √ The specific estimation formula of the standard error is s/ n, where s is the standard deviation of AUC results, and n is the number of AUC results and is equal to 25. It is obvious that DELR is among the top 2 performers on 8/10 datasets. In general, DELR has demonstrated more stable performance over all datasets. In contrast, the comparison algorithms have performed worse in this regard. The kNN, LOF, COF and ODIN algorithms, all have AUC values less than 0.5 on some datasets, which indicates that they show poorer performance than a common weak classifier. In short, we can summarize from Fig. 5 and Table 2 that DELR has exhibited a significant performance improvement over the meta-algorithms.

DELR is compared with the meta-anomaly algorithms, stateof-the-art anomaly ensemble algorithms, and traditional classification algorithms. We then examine the generalization ability, information loss, and parameter sensitivity of DELR. 4.3.1. DELR vs. meta-anomaly detection algorithms To evaluate the performance of DELR relative to the comparison meta-algorithms, we use the ∆ Average AUC values from DELR to these four meta-algorithms. The bar indicates the average AUCDELR − average AUCbaseline . These meta-algorithms without randomness do not require a duplicate and use the AUC value of a single run instead of the average AUC value, while DELR uses the average AUC value of 25 independent runs for comparison. The DELR is better than the others when this difference is greater than 0. In Fig. 5, the x-axis shows the datasets used in the experiment, which are ranked in ascending order based on the size of the attribute, while the y-axis shows the average AUC difference between DELR and the comparison algorithm. It is obvious that the fluctuation of the feature dimensionality has little effect on the performance of our algorithm. In Fig. 5, the DELR algorithm performs better than all comparison algorithms on 8/10 datasets and performs slightly worse on the Lymphography and WDBC datasets. However, the average AUC values of DELR on these two datasets are 0.9209 and 0.9252. These values indicate that the performance is close to others, although the DELR performs worse. There are fewer negative average AUC differences than positive average AUC differences. In the upper right of Fig. 5, we provide the average AUC difference averaged over all the datasets. This result indicates that DELR performs well in most cases but in other cases performs poorly with approximate results. In particular, based on the Hepatitis dataset, the AUC differences between DELR and the comparison algorithms, kNN, LOF, and COF, are up to 38.65%, 36.16%, and 48.10%, respectively. These values indicate that the performance of kNN, LOF, and COF on this dataset is almost close to a common weak classifier. In our view, the main reason is that this dataset has few and sparse samples, and these comparison algorithms demonstrate poor performance on datasets without enough samples. Meanwhile, based on the Musk dataset, the

4.3.2. DELR vs. state-of-the-art anomaly ensembles Similar to the above section, the average AUC differences of DELR compared with five state-of-the-art anomaly ensemble algorithms are shown in Fig. 6. The average AUC values used to calculate the difference are the best in the anomaly ensemble algorithms. In the upper right of Fig. 6, it is obvious that the average AUC difference averaged over all the datasets is greater than 0, which indicates that DELR has exhibited better average performance than all comparison algorithms. The DELR algorithm mainly performs slightly worse on the Lymphography and WDBC datasets. However, negative average AUC difference is maintained at a small value (less than 0.1), which indicates that the performance of DELR is close to the comparison algorithms when it performs worse. In Fig. 6, the DELR algorithm is significantly superior to the FG, RB and ALSO algorithms and has exhibited better performance than these algorithms on half of the datasets. Although the CARE

10

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

Table 2 The average AUC results of the DELR vs. meta-anomaly detection algorithms. Note: The top two performers for each dataset are boldfaced. The number in parentheses is the standard error. Method

WBC

PageBlocks

Lymphography

Hepatitis

Cardio

WDBC

Satimage-2

SpamBase

Musk

Arrythmia

kNN(k = 1) kNN(k = 3) kNN(k = 5) kNN(k = 7) kNN(k = 9) LOF(k = 1) LOF(k = 3) LOF(k = 5) LOF(k = 7) LOF(k = 9) COF(k = 1) COF(k = 3) COF(k = 5) COF(k = 7) COF(k = 9) ODIN(k = 1) ODIN(k = 3) ODIN(k = 5) ODIN(k = 7) ODIN(k = 9) DELR

0.9542 0.9824 0.9873 0.9852 0.9862 0.4695 0.5136 0.6223 0.6340 0.6540 0.4704 0.5634 0.5293 0.6141 0.6423 0.5080 0.5643 0.6127 0.6540 0.6763 0.9940 (±0.0004)

0.6229 0.6371 0.6427 0.6481 0.6512 0.6058 0.7119 0.8078 0.8274 0.8575 0.6058 0.6919 0.7291 0.7668 0.7915 0.5332 0.6440 0.7079 0.7546 0.7722 0.9096 (± 0.0003)

0.9900 0.9930 0.9930 0.9947 0.9965 0.3175 0.8093 0.8703 0.8985 0.9419 0.3175 0.7365 0.8744 0.8967 0.9495 0.5123 0.7324 0.7911 0.8738 0.9149 0.9209 (±0.0041)

0.3806 0.4876 0.4876 0.5174 0.5224 0.3507 0.3731 0.4378 0.5473 0.5174 0.3507 0.2910 0.2687 0.4279 0.3632 0.4303 0.3930 0.5846 0.5423 0.6468 0.9089 (±0.0051)

0.5285 0.5774 0.6429 0.6775 0.6975 0.5053 0.5184 0.5062 0.5959 0.6059 0.5069 0.4990 0.4962 0.5343 0.5654 0.4744 0.5209 0.5368 0.5503 0.5619 0.9078 (± 0.0035)

0.9342 0.9821 0.9902 0.9923 0.9923 0.4549 0.7646 0.9588 0.9807 0.9873 0.4549 0.5203 0.8500 0.9097 0.9559 0.4429 0.6041 0.6961 0.7531 0.8304 0.9252 (±0.0015)

0.8053 0.8956 0.9178 0.9282 0.9363 0.5122 0.6248 0.6285 0.6207 0.6006 0.5256 0.5762 0.5662 0.5566 0.5645 0.4908 0.5150 0.5207 0.5309 0.5367 0.9806 (±0.0008)

0.7953 0.7829 0.7801 0.7791 0.7777 0.5861 0.5835 0.5527 0.5801 0.5786 0.5861 0.6361 0.5811 0.5962 0.6100 0.5888 0.6138 0.6131 0.6430 0.6414 0.9455 (±0.0041)

0.0650 0.0980 0.1387 0.1787 0.2193 0.4669 0.5042 0.5467 0.4400 0.4311 0.4669 0.4843 0.5309 0.5175 0.4922 0.4548 0.5087 0.5077 0.4692 0.4574 0.9436 (±0.0078)

0.7408 0.7561 0.7613 0.7587 0.7664 0.7259 0.6875 0.7234 0.7234 0.7080 0.7259 0.6752 0.6265 0.6701 0.6650 0.6014 0.6363 0.7346 0.7454 0.7556 0.9424 (± 0.0044)

Fig. 7. ROC curve for the Cardio dataset.

Fig. 8. ROC curve for the Satimage-2 dataset.

and SB algorithms perform better on some datasets, the average AUC differences averaged over all the datasets maintain a positive value. These positive values indicate that DELR shows more stable performance than the comparison algorithms. In particular, for some datasets (e.g., Hepatitis), CARE and SB perform poorly on the Hepatitis dataset with an average AUC value of 0.5452 and 0.4975, respectively, while DELR provides an improvement with an average AUC value of 0.9089. In addition, we present two ROC curves for a single run in Figs. 7 and 8 based on the Cardio and Satimage2 datasets, respectively. As we can see in Fig. 7, DELR performs the best on the Cardio and can reach the optimal true positive rate earlier than other comparison algorithms. Although the CARE algorithm can reach the optimal value at the approximate time, it has a lower AUC value. As we can see in Fig. 8, the performance of DELR is close to that of the CARE and SB algorithms. In short, we conclude that the DELR algorithm exhibits similar performance to the CARE and SB algorithms on some datasets but demonstrates significant performance improvements on most datasets compared with FG, RB and ALSO. We provide the average and the standard error for AUC results in Table 3. The specific estimation formula of the standard error

is s/ n, where s is the standard deviation of AUC results, and n is the number of AUC results and is equal to 25. It is obvious that DELR is among the top 2 performers on 6/10 datasets. In general, the DELR algorithm has a better generalization ability in which the difference between the maximum average AUC value and the minimum average AUC value is less than 0.1. In contrast, for the CARE algorithm, the difference is greater than 0.5, where it performs poorly with an average AUC value of 0.3296 for the Hepatitis dataset but performs well with an average AUC value of 0.9962 for the Musk dataset. The SB algorithm demonstrates a similar problem. For the ALSO algorithm, based on the Arrhythmia dataset, as the weights of all detectors are equal to zero, we do not show this unreasonable result. In short, we can summarize that the DELR algorithm has demonstrated significant performance improvement over the anomaly ensemble algorithms when it performs well, and has exhibited a similar performance when it performs poorly.



4.3.3. DELR vs. traditional classification algorithms Unlike the above experiments, we use accuracy rather than AUC as the measurement method in this section. Both SVC and

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

11

Table 3 The average AUC results of the DELR vs. anomaly ensemble algorithms. Note: The top two performers for each dataset are boldfaced. The number in parentheses is the standard error. Method

WBC

PageBlocks

Lymphography

Hepatitis

Cardio

WDBC

Satimage-2

SpamBase

Musk

Arrythmia

CARE(k = 1)

0.8876 (±0.0204) 0.9610 (±0.0122) 0.9815 (±0.0046) 0.9820 (±0.0047) 0.9831 (± 0.0039) 0.9909 (±0.0003) 0.9918 (±0.0003) 0.9913 (±0.0003) 0.9901 (±0.0002) 0.9892 (±0.0004) 0.5317 (±0.0076) 0.6970 (±0.0053) 0.7934 (±0.0033) 0.7530 (± 0.0040) 0.8616 (±0.0044) 0.9186 (±0.0021) 0.5797 (± 0.0201) 0.9940 (±0.0004)

0.7072 (±0.0035) 0.7905 (±0.0026) 0.8201 (±0.0024) 0.8374 (±0.0021) 0.8573 (±0.0025) 0.6277 (±0.0008) 0.6405 (±0.0007) 0.6450 (±0.0008) 0.6478 (±0.0006) 0.6490 (±0.0006) 0.6612 (±0.0016) 0.7673 (±0.0018) 0.8370 (±0.0012) 0.8659 (±0.0010) 0.8882 (±0.0016) 0.9587 (±0.0003) 0.8276 (±0.0039) 0.9096 (± 0.0003)

0.9106 (±0.0165) 0.9742 (±0.0031) 0.9871 (±0.0020) 0.9868 (±0.0018) 0.9908 (±0.0010) 0.9863 (±0.0004) 0.9943 (±0.0003) 0.9970 (±0.0003) 0.9977 (±0.0002) 0.9978 (±0.0002) 0.7219 (±0.0126) 0.9204 (±0.0031) 0.9366 (±0.0017) 0.9149 (±0.0024) 0.9644 (±0.0007) 0.9448 (±0.0047) 0.8492 (±0.0087) 0.9209 (±0.0041)

0.3296 (±0.0184) 0.5271 (±0.0067) 0.5452 (±0.0061) 0.5434 (±0.0045) 0.5208 (±0.0052) 0.4171 (±0.0079) 0.4929 (±0.0078) 0.4965 (±0.0052) 0.4888 (±0.0030) 0.4975 (±0.0023) 0.3355 (±0.0104) 0.4639 (±0.0074) 0.5132 (±0.0063) 0.5232 (± 0.0064) 0.5087 (±0.0038) 0.6589 (±0.0117) 0.5924 (±0.0078) 0.9089 (±0.0051)

0.6270 (±0.0165) 0.7786 (±0.0070) 0.7866 (±0.0098) 0.8265 (±0.0092) 0.8648 (±0.0073) 0.8798 (±0.0015) 0.8221 (±0.0019) 0.7821 (±0.0013) 0.7635 (±0.0016) 0.7526 (±0.0012) 0.4972 (±0.0016) 0.5064 (±0.0041) 0.5620 (±0.0023) 0.6085 (±0.0021) 0.6676 (±0.0027) 0.8606 (±0.0002) 0.5132 (±0.0042) 0.9078 (± 0.0035)

0.9258 (±0.0020) 0.9331 (±0.0023) 0.9690 (±0.0009) 0.9812 (±0.0007) 0.9852 (±0.0005) 0.9806 (±0.0006) 0.9870 (±0.0004) 0.9901 (±0.0002) 0.9913 (±0.0001) 0.9922 (±0.0001) 0.3354 (±0.0052) 0.8144 (±0.0052) 0.9634 (±0.0006) 0.9797 (± 0.0004) 0.9877 (±0.0002) 0.8533 (± 0.0021) 0.9441 (± 0.0007) 0.9252 (±0.0015)

0.9946 (±0.0009) 0.9955 (±0.0000) 0.9947 (±0.0000) 0.9942 (±0.0000) 0.9940 (±0.0000) 0.9985 (±0.0000) 0.9928 (±0.0003) 0.9830 (±0.0005) 0.9754 (±0.0004) 0.9669 (±0.0005) 0.5721 (±0.0059) 0.6023 (±0.0034) 0.6170 (±0.0016) 0.6347 (±0.0024) 0.5989 (±0.0018) 0.7887 (±0.0005) 0.6127 (±0.0001) 0.9806 (±0.0008)

0.5413 (±0.0070) 0.5873 (±0.0095) 0.5964 (±0.0077) 0.5856 (±0.0052) 0.5653 (±0.0095) 0.7748 (±0.0005) 0.7699 (±0.0007) 0.7709 (±0.0007) 0.7703 (±0.0005) 0.7716 (±0.0003) 0.6103 (±0.0028) 0.5869 (±0.0020) 0.5689 (±0.0018) 0.6135 (±0.0017) 0.6219 (± 0.0010) 0.4924 (± 0.0214) 0.5921 (±0.0038) 0.9455 (±0.0041)

0.8122 (±0.0286) 0.9271 (±0.0184) 0.9886 (±0.0077) 0.9962 (±0.0035) 0.9926 (±0.0071) 0.9896 (±0.0033) 0.8909 (±0.0042) 0.8134 (± 0.0026) 0.7540 (±0.0032) 0.6577 (±0.0036) 0.2957 (±0.0008) 0.4797 (±0.0057) 0.4663 (±0.0027) 0.3956 (±0.0039) 0.3803 (±0.0032) 0.8510 (±0.0011) 0.5751 (± 0.0014) 0.9436 (±0.0078)

0.7293 (±0.0052) 0.7486 (±0.0051) 0.7703 (±0.0055) 0.7754 (±0.0055) 0.7901 (±0.0040) 0.8144 (±0.0024) 0.7927 (±0.0024) 0.7832 (±0.0033) 0.7889 (±0.0015) 0.7948 (±0.0020) 0.7146 (±0.0112) 0.7244 (±0.0040) 0.7330 (±0.0030) 0.6921 (± 0.0043) 0.7320 (±0.0032) – – 0.6817 (± 0.0019) 0.9424 (± 0.0044)

CARE(k = 3) CARE(k = 5) CARE(k = 7) CARE(k = 9) SB(a = 0.1) SB(a = 0.2) SB(a = 0.3) SB(a = 0.4) SB(a = 0.5) RB(k = 1) RB(k = 3) RB(k = 5) RB(k = 7) RB(k = 9) ALSO FG DELR

random forest algorithms can directly output the classification results, and the accuracy is intuitive for evaluating their effectiveness. Following the common experimental settings, for the SVC and random forest algorithms, we randomly select half of the abnormal and normal samples as training data and use the remaining samples as test data. In addition, the DELR algorithm outputs the scores of the samples, and the scores can be converted to classification results by setting a user-defined anomaly ratio. Therefore, we select the anomaly ratio α ranging from 0.01 to 0.1 with a step of 0.01. In addition, both DELR and random forest perform 25 independent runs for each dataset, while SVC without randomness does not need a duplicate and uses the accuracy value of a single run instead of the average accuracy value. In Table 4, we provide the average and the standard error for accuracy results. The specific estimation formula of the standard √ error is s/ n, where s is the standard deviation of accuracy results, and n is the number of accuracy results and is equal to 25. It is obvious that DELR is among the top 2 performers on 8/10 datasets. Although the SVC and random forest algorithms have shown similar performance, DELR has the advantage that it is independent of the data labels, while both SVC and random forest can work when labeled data are available. Overall, DELR has shown measurable improvement over the classification algorithms. 4.3.4. Generalization ability analysis To show that DELR has demonstrated significant improvements, we conducted comparisons of multiple algorithms on 10 datasets. Here, to further evaluate the generalization ability of DELR, we use the Friedman test and the post-hoc test [50] for

a comparison of 8 algorithms, namely, CARE, SB, RB, kNN, LOF, COF, ODIN, and DELR, on the 10 datasets. Then, we provide a comparison for the average AUC values of these algorithms on the 10 datasets. The Friedman test and the post-hoc test are effective nonparametric statistical tests for comparisons of multiple algorithms on multiple datasets. The Friedman test is used to check whether the null hypothesis is true. The null-hypothesis means that all the algorithms are equivalent. In our experiment, this test first ranks all algorithms for each dataset, where the algorithm with maximum AUC value is assigned the rank of 1, the second value is assigned the rank of 2, and so forth. The criterion of the null-hypothesis is that whether the measured average ranks are significantly different from the mean rank, which is 4.5 in our case. In our experiment with 8 algorithms and 10 datasets, FF is distributed according to the F distribution with 8 − 1 = 7 and (8 − 1) × (10 − 1) = 63 degrees of freedom. The critical value of F (7, 63) for a = 0.1 is 1.814, so we reject the null-hypothesis. Then, the post-hoc test is used to measure the difference between each of the algorithms. The performance of the two algorithms is significantly different when the difference between their average ranks is greater or equal to the critical difference. In our experiment, the critical difference is equal to 3.0452 for a = 0.1. Fig. 9 graphically shows the results of six Nemenyi tests in the so-called critical difference diagrams. These diagrams sequentially depict the comparison results based on different combinations of parameters for the compared algorithms. In these six Nemenyi tests, DELR ranks as the top algorithm with an average rank of 1.8, 1.8, 2.2, 2.0, 2.4, and 2.6. All tests have one thing in common: DELR, CARE, SB, and kNN are ranked among the top 4 and show better performance than RB, COF, LOC and ODIN.

12

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

Table 4 The average accuracy results of the DELR vs. traditional classification algorithms. Note: The top two performers for each dataset are boldfaced. The number in parentheses is the standard error. Method

WBC

PageBlocks

Lymphography

Hepatitis

Cardio

WDBC

Satimage-2

SpamBase

Musk

Arrythmia

DELR(a = 0.01)

0.9312 (±0.0009) 0.9377 (±0.0009) 0.9445 (±0.0016) 0.9514 (±0.0016) 0.9539 (±0.0022) 0.9532 (±0.0024) 0.9560 (± 0.0029) 0.9553 (±0.0035) 0.9542 (±0.0029) 0.9575 (±0.0018) 0.9766 (±0.0010) 0.9550

0.9973 (±0.0001) 0.9928 (± 0.0000) 0.9887 (±0.0000) 0.9847 (±0.0001) 0.9798 (±0.0001) 0.9737 (±0.0001) 0.9682 (±0.0001) 0.9649 (±0.0002) 0.9600 (±0.0002) 0.9559 (±0.0002) 0.9860 (±0.0003) 0.9839

0.9568 (± 0.0021) 0.9584 (±0.0026) 0.9605 (±0.0035) 0.9605 (± 0.0035) 0.9622 (±0.0032) 0.9616 (±0.0036) 0.9622 (± 0.0041) 0.9622 (±0.0041) 0.9622 (±0.00460 0.9568 (±0.0051) 0.9757 (±0.0029) 0.9595

0.9847 (±0.0030) 0.9847 (±0.0030) 0.9671 (±0.0020) 0.9671 (± 0.0020) 0.9671 (± 0.0020) 0.9588 (±0.0029) 0.9588 (±0.0029) 0.9588 (± 0.0029) 0.9588 (±0.0029) 0.9588 (± 0.0029) 0.9718 (±0.0021) 0.9706

0.9987 (±0.0001) 0.9961 (± 0.0003) 0.9936 (±0.0006) 0.9917 (±0.0007) 0.9897 (±0.0006) 0.9879 (±0.0007) 0.9854 (± 0.0009) 0.9826 (±0.0009) 0.9794 (±0.0011) 0.9768 (±0.0011) 0.9826 (±0.0006) 0.9836

0.9604 (± 0.0007) 0.9659 (±0.0013) 0.9738 (±0.0014) 0.9749 (±0.0014) 0.9744 (± 0.0015) 0.9742 (± 0.0018) 0.9716 (±0.0017) 0.9666 (±0.0016) 0.9624 (± 0.0014) 0.9583 (±0.0013) 0.9897 (±0.0004) 0.9727

0.9993 (±0.0001) 0.9982 (±0.0002) 0.9947 (±0.0002) 0.9909 (±0.0002) 0.9865 (±0.0003) 0.9824 (±0.0003) 0.9777 (±0.0003) 0.9734 (±0.0003) 0.9691 (±0.0003) 0.9645 (±0.0004) 0.9984 (±0.0001) 0.9986

0.9648 (±0.0005) 0.9665 (± 0.0010) 0.9665 (± 0.0012) 0.9689 (±0.0015) 0.9704 (±0.0020) 0.9702 (±0.0022) 0.9669 (±0.0019) 0.9634 (±0.0017) 0.9594 (± 0.0017) 0.9563 (±0.0016) 0.9834 (±0.0002) 0.9798

0.9453 (± 0.0011) 0.9484 (±0.0016) 0.9491 (±0.0023) 0.9521 (±0.0031) 0.9532 (±0.0036) 0.9532 (±0.0039) 0.9539 (±0.0042) 0.9524 (±0.0044) 0.9499 (±0.0043) 0.9470 (±0.0044) 1.0000 (± 0.0000) 1.0000

0.9745 (±0.0010) 0.9697 (± 0.0013) 0.9706 (±0.0017) 0.9697 (±0.0016) 0.9632 (±0.0019) 0.9606 (± 0.0020) 0.9577 (±0.0022) 0.9558 (± 0.0020) 0.9474 (±0.0026) 0.9452 (±0.0025) 0.9839 (±0.0000) 0.9839

DELR(a = 0.02) DELR(a = 0.03) DELR(a = 0.04) DELR(a = 0.05) DELR(a = 0.06) DELR(a = 0.07) DELR(a = 0.08) DELR(a = 0.09) DELR(a = 0.1) Random forest SVC

Fig. 9. Comparison of the average AUC values by the Nemenyi test. Groups of algorithms that are not significantly different (at a = 0.1) are connected. Here, k represents the nearest neighbors of CARE, RB, kNN, LOF, COF and ODIN and α represents the sampling ratio of SB. a k = 1, α = 0.1, b k = 3, α = 0.2, c k = 5, α = 0.3, d k = 7, α = 0.4, e k = 9, α = 0.5, and f selecting the optimal k and α for the compared algorithms. Note: The DELR, CARE, SB, and RB algorithms use the average AUC value for the test, while the kNN, LOF, COF, and ODIN algorithms use the AUC value for a single run for the test.

This result is consistent with our previous evaluation in Sections 4.3.1 and 4.3.2. Although the CARE, SB, and kNN algorithms have shown better performance in comparison with DELR over some datasets, DELR remains the best in these Nemenyi tests. This result indicates that DELR shows a significant improvement in generalization ability. In Fig. 10, we compare the maximum average AUC value of these algorithms with DELR. On these 10 datasets, DELR shows more stable performance than the comparison algorithms, which further shows the effectiveness of our algorithm. 4.3.5. Information loss analysis In the process of DELR, we use subspace generation technology to obtain multiple random divisions. However, during the transformation from the total feature space to the subspace, information is bound to be lost. To further verify that DELR can indeed reduce the risk of information loss, we use the Shannon– Spearman measure [48] for comparisons of the 9 algorithms, namely, CARE, SB, RB, kNN, LOF, COF, ODIN, FG, and DELR, on the 10 datasets.

Based on the ‘‘loss of information’’ concept, the Shannon– Spearman measure was first proposed by Zhou et al. [48] and was used to measure the information loss in constructing a composite environmental index. Considering that the key of the Shannon– Spearman measure is to use the discrepancy between the amount of information in the original data and that in the experimental results as information loss, this is also available as the information loss measure in our experiment. This effect manifests itself by the fact that a smaller discrepancy represents better algorithm performance. Here, to simplify the description, we use D = (xij )n×m as the original data, F as the feature variables, and S as the outlier score list obtained by the algorithm, where xij is the value of the ith sample corresponding to the feature Fj . In addition, assume that both D and S have been normalized. The Shannon–Spearman measure is as follows: d = |1 − ∑n

(1 − p)rs

j=1 ((1

− pij )rsj )

|,

(8)

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

13

Table 5 The Shannon–Spearman measure results of the DELR vs. comparison algorithms. Note: The top two performers for each dataset are boldfaced. Method

WBC

PageBlocks

Lymphography

Hepatitis

Cardio

WDBC

Satimage-2

SpamBase

Musk

Arrythmia

CARE SB RB kNN LOF COF ODIN FG DELR

0.8747 0.9641 1.0291 0.9824 1.0717 1.0173 0.7684 0.9927 0.8338

0.8819 0.5359 0.8542 0.4910 0.8922 0.9609 1.0154 0.8517 0.9299

1.0073 0.9301 0.8568 0.8920 0.8943 0.9682 1.0337 0.9624 1.3094

1.3002 1.4322 1.3087 1.5175 1.3785 1.0637 0.9621 1.1433 1.0262

0.7041 0.5969 1.1101 1.1966 1.1092 1.0554 0.9450 1.0706 0.9365

0.8138 0.9666 1.0923 1.0739 1.1174 1.0593 0.9780 0.9995 0.9746

1.4856 1.6146 1.1127 1.4277 1.1004 1.0757 0.9406 1.0347 0.9221

1.0107 0.9362 0.9821 0.9146 0.9896 0.9698 1.0044 0.9934 0.9850

1.0544 1.0322 1.0094 1.0276 0.9946 0.9886 1.0009 0.9900 1.0187

1.0070 1.0119 0.9957 0.9970 0.9999 0.9999 1.0061 0.9938 1.2087

Fig. 10. The average AUC values of the 8 algorithms on the 10 datasets.

where rsj and rs are obtained by the Spearman rank correlation coefficient on D and S, respectively; and pij and p are the divergence of each sample with respect to D and S by Shannon’s entropy, respectively. In Table 5, we provide the Shannon–Spearman measure results of DELR compared with the other comparison algorithms. Note that the remaining algorithms used for comparison do not have this type of information loss except FG due to their utilization of the total feature space for training. Therefore, although DELR is only among the top 2 performers on 3/10 datasets, this still indicates that our method is effective in reducing information loss. There are two main reasons for this conclusion. First, compared with FG, which has the same information loss in feature subspace selection, the DELR algorithm performs better on 6/10 datasets. Second, the subspace-based DELR can have similar performance to the total space-based comparison algorithms when performing worse and has shown better performance than RB, kNN, LOF, COF, and ODIN. As we can see in Table 5, on the one hand, the average value of Shannon–Spearman measure for DELR is 1.014. Compared with SB, which has the best average performance over 10 datasets, the difference between the average value of DELR and SB is only 0.012. In addition, based on the average value of Shannon–Spearman measure, DELR and CARE rank in the top 3 and have shown better performance than RB, LOF, kNN, COF, and ODIN. On the other hand, compared with CARE, which has shown similar performance to DELR for some datasets, e.g., Satimage2 and Hepatitis, DELR performs well with difference values of 0.274 and 0.564, respectively. Overall, DELR can be especially effective in reducing the information loss when it performs well and exhibits similar performance when it performs poorly.

Fig. 11. Analysis of AUC when changing the subspace division ratio.

Fig. 12. Analysis of AUC when changing the number of base detectors.

4.3.6. Sensitivity of the parameters As we can see, there are three main components of the DELR algorithm: subspace generation, base detectors, and double-level ensemble. For the first part, there is a parameter of the subspace division ratio that can be selected from 0 to 0.5. For the second part, during the training of the diversity loss function, the maximum number of iterations for the gradient descent method is set to 50 as a default value. Finally, there is a sole parameter, namely, the number of cycles, in the double-level ensemble strategy. In the following, we will explain in detail the selection of these parameters.

14

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783

Fig. 13. Effects of the parameter selection of α and l on the AUC value.

One of the parameters is the ratio of subspace division α that is used to control the size of the two complementary subspaces obtained by each space division. Because the parameter α is only used for space division instead of subspace selection, the size of α has no effect on the complexity of the algorithm. Therefore, we ignore the analysis of the complexity with a changing α . In Fig. 11, we present the trend of the AUC value with the change in the parameter α where the range of α is from 0 to 0.5. We ignore the range from 0.5 to 1, only replicated consideration. As we can see in Fig. 11, there is a great fluctuation in the AUC value based on the subspace division ratio. The other parameter is the number of cycles l in our doublelevel ensemble strategy. On the one hand, in the internal integration phase, DELR can obtain two base models based on the space division. On the other hand, the parameter l is used to control the running times of the external integration. Therefore, the total number of base detectors in DELR is twice that of the value of l. As we can see in Fig. 12, there is no obvious positive or negative correlation between the value of AUC and l. Therefore, to find good parameters, we use a grid-search [44] on α and l. The effects of parameter selection are shown in Fig. 13. We can observe that good performance is achieved at α = 0.45 and l = 20. 5. Conclusion Even though ensemble learning is effective in improving the generalization ability of unsupervised anomaly detection algorithms, it is difficult to study due to the absence of ground truth labels. Most existing algorithms in this domain are dependent on common meta-algorithms and merely focus on designing a combination method for the final results. In this paper, we focus on two main components of the anomaly ensemble: the generation and combination of multiple base detectors. During the generation phase, the DELR algorithm is designed to be independent of common meta-algorithms and has a novel diversity loss function for the retraining of the base detectors. During the combination phase, we propose a double-level ensemble strategy that has the effect of improving the generalization ability. In addition, we evaluate our approach on 10 real-world datasets. Numerous experiments show that DELR has varying degrees of performance

improvements compared to the meta-algorithms, state-of-theart anomaly ensemble algorithms, and traditional classification algorithms. In future work, we extend our algorithm in two aspects: the design and the application of our algorithm. In terms of the design, we plan to use deep learning to detect anomaly samples with high dimensionality, which has been rarely studied in anomaly detection. Moreover, we will combine the idea of semi-supervised learning to further improve the performance of our algorithms by adding some labeled samples. In terms of the application, we intend to extend our algorithm to specific applications, such as fault detection and video surveillance. In addition to these aspects, we aim to improve our algorithm with regard to the real-time quality to gain more speed and efficiency. Acknowledgments This work was partially supported by the National Natural Science Foundation of China (No. 61672215), National Key R&D Program of China (No. 2018YFB1308604) and Hunan Science and Technology Innovation Project (No. 2017XK2102). References [1] V. Chandola, A. Banerjee, V. Kumar, Anomaly detection: A survey, ACM Comput. Surv. 41 (2009) 75–79. [2] A. Theissler, Detecting known and unknown faults in automotive systems using ensemble-based anomaly detection, Knowl. Based Syst. 123 (2017) 163–173. [3] G. Xie, K. Xie, J. Huang, X. Wang, Y. Chen, J. Wen, Fast low-rank matrix approximation with locality sensitive hashing for quick anomaly detection, in: Proc. IEEE Conf. Computer Communications, 2017, pp. 1–9. [4] H. Fujita, D. Cimr, Computer aided detection for fibrillations and flutters using deep convolutional neural network, Inform. Sci. 486 (2019) 231–239. [5] K. Xie, X. Li, X. Wang, J. Cao, G. Xie, J. Wen, D. Zhang, Z. Qin, On-line anomaly detection with high accuracy, IEEE/ACM Trans. Netw. 26 (3) (2018) 1–14. [6] K. Xie, X. Li, X. Wang, G. Xie, J. Wen, J. Cao, D. Zhang, Fast tensor factorization for accurate internet anomaly detection, IEEE/ACM Trans. Netw. 25 (6) (2017) 3794–3807. [7] H. Ren, M. Liu, Z. Li, W. Pedrycz, A piecewise aggregate pattern representation approach for anomaly detection in time series, Knowl. Based Syst. 135 (2017) 29–39.. [8] J. Huang, Q. Zhu, L. Yang, D.D. Cheng, Q. Wu, A novel outlier cluster detection algorithm without top-n parameter, Knowl. Based Syst. 121 (2017) 32–40.

J. Zhang, Z. Li, K. Nai et al. / Knowledge-Based Systems 181 (2019) 104783 [9] T.K. Ho, The random subspace method for constructing decision forests, IEEE Trans. Pattern Anal. Mach. Intell. 20 (8) (1998) 832–844. [10] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proc. 14th Int. Joint Conf. Artificial Intelligence, 1995, pp. 1137–1143. [11] C.C. Aggarwal, Outlier ensembles: position paper, SIGKDD Explor. Newsl. 14 (2) (2013) 49–58. [12] A. Zimek, M. Gaudet, R.J.G.B. Campello, Subsampling for efficient and effective unsupervised outlier detection ensembles, in: Proc. 19th ACM SIGKDD Int. Conf. Knowl. Discovery Data Mining, 2013, pp. 428–436. [13] C. Zhang, J. Bi, S. Xu, E. Ramentol, G. Fan, B. Qiao, H. Fujita, Multiimbalance: An open-source software for multi-class imbalance learning, Knowl. Based Syst. (2019). [14] Z.H. Zhou, Ensemble learning, in: S. Li (Ed.), Encyclopedia of Biometrics, 2009, pp. 270–273. [15] S. Rayana, W. Zhong, L. Akoglu, Sequential ensemble learning for outlier detection: A bias-variance perspective, in: Proc. 16th IEEE Int. Conf. Data Mining (ICDM), 2016, pp. 1167–1172. [16] F. Keller, E. Mller, K. Bhm, HiCS: High contrast subspaces for densitybased outlier ranking, in: Proc. IEEE Int. Conf. Data Engineering, 2012, pp. 1037–1048. [17] E. Muller, M. Schiffer, T. Seidl, Statistical selection of relevant subspace projections for outlier ranking, in: Proc. IEEE Int. Conf. Data Engineering, 2011, pp. 434–445. [18] A. Lazarevic, V. Kumar, Feature bagging for outlier detection, in: Proc. 17th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 2005, pp. 157–166. [19] C.C. Aggarwal, S. Sathe, Theoretical foundations and algorithms for outlier ensembles, SIGKDD Explor. Newsl. 17 (1) (2015) 24–47. [20] T.D. Vries, S. Chawla, M.E. Houle, Density-preserving projections for large-scale local anomaly detection, Knowl. Inf. Syst. 32 (1) (2012) 25–52. [21] P. Mitra, C.A. Murthy, S.K. Pal, Unsupervised feature selection using feature similarity, IEEE Trans. Pattern Anal. Mach. Intell. 24 (3) (2002) 301–312. [22] P. Robi, Ensemble based systems in decision making, IEEE Circuits Syst. Mag. 6 (3) (2006) 21–45. [23] C. Zhang, C. Liu, X. Zhang, G. Almpanidis, An up-to-date comparison of state-of-the-art classification algorithms, Expert Syst. Appl. 82 (2017) 128–150. [24] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors), Ann. Statist. 28 (2) (2000) 337–374. [25] D.D. Margineantu, T.G. Dietterich, Pruning adaptive boosting, in: Proc. 14th Int. Conf. Machine Learning, 1997, pp. 211–218. [26] G. Martnez-Muoz, D. Hernndez-Lobato, A. Surez, An analysis of ensemble pruning techniques based on ordered aggregation, IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 245–259. [27] Z.H. Zhou, Y. Jiang, Y.B. Yang, S.F. Chen, Lung cancer cell identification based on artificial neural network ensembles, Artif. Intell. Med. 24 (1) (2002) 25–36. [28] H. Paulheim, R. Meusel, A decomposition of the outlier detection problem into a set of supervised learning problems, Mach. Learn. 10 (2–3) (2015) 509–531.

15

[29] A. Zimek, R.J.G.B. Campello, J. Sander, Data perturbation for outlier detection ensembles, in: Proc. Int. Conf. Scientific and Statistical Database Management, 2014, pp. 1–12. [30] C. Baumgartner, C. Plant, K. Railing, H.P. Kriegel, Subspace selection for clustering high-dimensional data, in: Proc. IEEE Int. Conf. Data Mining, 2004, pp. 11–18. [31] C.H. Cheng, A.W. Fu, Y. Zhang, Entropy-based subspace clustering for mining numerical data, in: Proc. ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, 1999, pp. 84–93. [32] K. Kailing, H.P. Kriegel, P. Krger, S. Wanka, Ranking interesting subspaces for clustering high dimensional data, in: Proc. 7th European Conf. Principles and Practice of Knowledge Discovery in Databases, 2003, pp. 241–252. [33] D. Niu, J.G. Dy, M.I. Jordan, Multiple non-redundant spectral clustering views, in: Proc. Int. Conf. Machine Learning, 2010, pp. 831–838. [34] S. Rayana, L. Akoglu, An ensemble approach for event detection and characterization in dynamic graphs, in: Proc. ACM SIGKDD Workshop on Outlier Detection and Description (ODD), 2014. [35] S. Rayana, L. Akoglu, Less is more: Building selective anomaly ensembles, ACM Trans. Knowl. Discov. Data 10 (4) (2015) 42. [36] A. Kumar, J. Kim, D. Lyndon, M. Fulham, D. Feng, An ensemble of finetuned convolutional neural networks for medical image classification, IEEE J. Biomed. Health Informat. 21 (1) (2016) 31–40. [37] D.I. Curiac, C. Volosencu, Ensemble based sensing anomaly detection in wireless sensor networks, Expert Syst. Appl. 39 (10) (2012) 9087–9096. [38] G. Giacinto, R. Perdisci, M.D. Rio, F. Roli, Intrusion detection in computer networks by a modular ensemble of one-class classifiers, Inf. Fusion 9 (2008) 69–82. [39] D.B. Araya, K. Grolinger, H.F. Elyamany, M.A.M. Capretz, G. Bitsuamlak, An ensemble learning framework for anomaly detection in building energy consumption, Energy Build. 144 (2017) 191–206. [40] S.M. Min, S.Y. Sohn, Y.H. Ju, Random effects logistic regression model for anomaly detection, Expert Syst. Appl. 37 (10) (2010) 7162–7166. [41] C. Khammassi, S. Krichen, A GA-LR wrapper approach for feature selection in network intrusion detection, Comput. Secur. 70 (2017) 255–277. [42] C.M. Teng, Correcting noisy data, in: Proc. 16th Int. Conf. Machine Learning., 1999, pp. 239–248. [43] G.O. Campos, A. Zimek, R.J. Campello, E. Schubert, I. Assent, M.E. Houle, On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study, Data Min. Knowl. Discov. 30 (4) (2016) 891–927. [44] C.W. Hsu, C.C. Chang, C.J. Lin, A Practical Guide to Support Vector Classification, Tech. Rep., Natioanl Taiwan University, 2003, pp. 1–16. [45] B. Leo, Random forests, Mach. Learn. 45(1) (2001) 5–32. [46] J.A. Hanley, B.J. Mcneil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 143 (1982) 29–36. [47] J. Ar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [48] P. Zhou, B.W. Ang, K.L. Poh, Comparing aggregating methods for constructing the composite environmental index: An objective measure, Ecol. Econ. 59 (3) (2006) 305–311. [49] A. Asuncion, UCI Machine Learning Repository, 2007. Available: http:// www.ics.uci.edu/mlearn/MLRepository.html. [50] D. Janez, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (1) (2006) 1–30.