Cube-based Incremental Outlier Detection for Streaming Computing
Journal Pre-proof
Cube-based Incremental Outlier Detection for Streaming Computing Jianhua Gao, Weixing Ji, Lulu Zhang, Anmin Li, Yizhuo Wang, Zongyu Zhang PII: DOI: Reference:
S0020-0255(19)31178-8 https://doi.org/10.1016/j.ins.2019.12.060 INS 15103
To appear in:
Information Sciences
Received date: Revised date: Accepted date:
15 April 2019 3 November 2019 25 December 2019
Please cite this article as: Jianhua Gao, Weixing Ji, Lulu Zhang, Anmin Li, Yizhuo Wang, Zongyu Zhang, Cube-based Incremental Outlier Detection for Streaming Computing, Information Sciences (2019), doi: https://doi.org/10.1016/j.ins.2019.12.060
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier Inc.
Cube-based Incremental Outlier Detection for Streaming Computing Jianhua Gaoa , Weixing Jia,∗, Lulu Zhanga , Anmin Lia , Yizhuo Wanga , Zongyu Zhanga a
School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
Abstract Outlier detection is one of the most critical and challenging tasks of data mining. It aims to find patterns in data that do not conform to expected behavior. Data streams in streaming computing are huge in nature and arrive continuously with changing distribution, which imposes new challenges for outlier detection algorithms in time and space efficiency. Incremental local outlier factor (ILOF) detection dynamically updates the profiles of data points, while the arrival of consecutive and massive volume data points in a streaming manner causes high local data density and leads to expensive time and space overheads. Our work is motivated by its deficiencies, and in this paper we propose a cube-based outlier detection algorithm (CB-ILOF)and we propose a cube-based outlier detection algorithm (CB-ILOF)in this paper. The data space of streaming data is divided into multiple cubes, then the outlier detection of data points is transferred into the outlier detection of cubes, which significantly reduces time and memory overheads. We also present a performance evaluation on 5two datasetsdata sets. Experimental results show the superiority of the CB-ILOF over the ILOF on accuracy, memory usage, and execution time.in terms of execution time and runtime memory without decreasing the accuracy of outlier detection. Keywords: Streaming Computing, Density-based Outlier Detection, Online Outlier Detection ∗
Corresponding author Email addresses:
[email protected] (Jianhua Gao),
[email protected] (Weixing Ji),
[email protected] (Lulu Zhang),
[email protected] (Anmin Li),
[email protected] (Yizhuo Wang),
[email protected] (Zongyu Zhang) Preprint submitted to Elsevier
December 26, 2019
1. Introduction Outlier detection is one of the most important and challenging tasks of data mining, and it has been defined as the identification of items, events or data points (unifiedunify data points below) which do notdon’t conform to the normal or expected behavioral pattern defined in datasetsdata sets [20]. Continuous outlier detection for data streams has important applications in network intrusion detection, web clickstream analysis, fraud detection, fault machine detection and sensor data analysis. A number of streaming computing systems have been developed in recent years, such as Storm at Twitter [38], Kafka at LinkedIn [8, 7], Simple Scalable Streaming System(S4) [12, 24, 43][12],[24],[43] at Yahoo, as well as the StreamBase [26] and Borealis [1]. Data streams are huge in nature and arrive continuously with changing distribution, and thereby need to be analyzed in an one-pass manner as the data arrive, which imposes new challenges for outlier detection algorithms, especially in time and space efficiency. Outlier detection has received considerable research attention for a long time and a number of algorithms have been proposed, such as distribution-based [9, 25, 27, 32, 33][9],[25],[27],[32],[33], clustering-based [19, 30, 40][19],[30],[40], distancebased [41, 27, 17, 29, 39][41],[27],[17],[29],[39] and density-based methods [11, 23, 21, 36, 14, 44, 13][11],[23],[21],[36],[14],[44],[13]. A more detailed discussion of these methods is given in section 2. However, most of the existing outlier detection techniques are designed for applications where the entire datasetdata sets is available for random access. They are not suitable for online data streams, which cannot be stored and require continuously model update for adapting the changes. Incremental local outlier factor (hereafter referred as ILOF) detection[28] is an algorithm that assigns an outlier score to each data point according to its outlierness. It dynamically updates the parameters of data points. The insertion of a new data point and the deletion of an old data point influence a limited number of their closest neighbors. Nevertheless, the arrival of continuous and massive volume data points in a streaming manner always causes high local density, which leads to expensive time overhead updating parameters of neighborhoods and space overhead saving data points. Our work is motivated by deficiencies of the ILOF detection, and a cube2
based ILOF technique (hereafter referred to CB-ILOF) is proposed to address the aforementioned problem. In our algorithm, the space of streaming data is divided into multiple cubes (or subspaces), and each cube is assigned an outlier factor. By this means the update of intermediate parameters and outlier factors of affected data points is transferred to the update of affected cubes, which significantly reduces the overhead of execution time and memory. Finally, we perform a performance evaluation on 5 different datasetsdata sets, experimental results show the superiority of the CB-ILOF over the ILOF on accuracy, memory usage and execution time.in terms of time and memory overhead without decreasing the accuracy of outlier detection. Data streams have become available in increasing amounts, which imposes new challenges for outlier detection algorithms, especially in time and space efficiency. Instead of calculating local outlier factor for all data points in batch processing, CB-ILOF maps streaming data points into cubes and caculates local oulier factor for the cubes affected by the new arrival data point, which significantly reduces the memory and time overhead. CB-ILOF can run in resource-constrained environments and provides an useful and lightweight outlier detection solution to real-time streaming data processing. The rest of this paper is organized as follows. In section 2, we discuss related work on outlier detection. In section 3, we introduce two important existing algorithms of outlier detection in detail and discuss their deficiencies. In section 4, a cube-based incremental outlier detection algorithm for streaming computing is proposed. In section 5, we perform an extensive experimental evaluation, followed by our conclusion in section 6. 2. Related Work As one of the most important parts in data mining, outlier detection has raised much concern from many researchers. 2.1. Offline outlier detection The outlier detection for static datasetsdata sets isare generally classified into following four categories: distribution-based, clustering-based, distancebased and density-based methods. Distribution-based methods [9, 25, 27, 32, 33] assumemethod [9],[25],[27],[32],[33] assumes that a datasetdata sets follows a certain standard distribution (e.g., Gaussian distribution, Poisson distribution and Extreme value distribution), and then builds a model based on the distribution to detect outliers. For 3
example, Rousseeuw et al. proposed a 3σ method [9, 27][9],[27]. Under the assumption of a normal distribution, the region (µ − 3σ, µ + 3σ) contains 99.7% of the data, in which µ is the mean value of the distribution, and σ is the standard deviation. If the distance between a point and the mean value is larger than 3σ, then the point can be simply marked as an outlier. This method delivers good results when data is sufficient and the distribution of a dataset is known. However, most datasetsdata sets generally do not exactly fit an ideal mathematical distribution in many applications. Moreover, it is difficult to estimate the distribution of high-dimensional data. Therefore, the distribution-based outlier detection is only suitable for the situation of known distribution in advance or low-dimensional datasetsdata sets [42]. Clustering-based outlier detection is an unsupervised classification method [19, 30, 40][19],[30],[40]. It classifies the datasetsdata sets into multiple clusters in terms of the similarity among data points, and outliers are the data points that are not located in any clusters or are far from the centroid of the nearest cluster. The clustering-based method does not require users to understand the characteristic of data, and it can be applied to the high-dimensional datasetsdata sets. However, the performance of this method mainly depends on the selected clustering algorithm. Accurate clustering algorithms always have higher time complexity. In this situation, the time overhead of the algorithm is likely to become the performance bottleneck of outlier detection. Moreover, outlier detection is only a by-product of the clustering. If outliers are aggregated into a cluster or sorted into a large cluster, the method cannot effectively detect abnormal data. Knorr and Ng proposed a distance-based (DB) method [41, 27][41],[27] to solve the problem that traditional outlier detection methods can only deal with two-dimensional datasetsdata sets. They defined a data point p in a datasetdata sets D as a DB(pct, dmin)-outlier if at least percentage pct of the data points in D lies greater than distance dmin from p. Compared with the distribution-based methodsmethod, distance-based methods [17, 29, 39][17],[29],[39] can be applied to the datasetsdata sets that do not satisfy a specific behavior pattern or random distribution. Moreover, it is also suitable for high-dimensional datasetsdata sets. Unfortunately, there are currently no good rules orand algorithms to determine the parameters p and D. All the above algorithms for outlier detection consider abnormal data from a global perspective, while the density-based methods [11, 23, 21, 36, 14, 44, 13, 5][11],[23],[21],[36],[44],[13] can distinguish outliers from data points with various densities. It calculates the density of the neighbors of each data 4
point. If the neighbors of a data point are sparse, then the data point is regarded as abnormal. Otherwise, it is considered to be normal. The most popularcommonly used density-based algorithm is the local outlier factor (LOF) [11] algorithm, in which LOF indicates the degree of abnormality. A more detailed introduction to the algorithm will be given in the next section. 2.2. Online outlier detection At present, the ideas of outlier detection for streaming data can be classified into two categories. The first is an improvement to static methods. It first trains a model based on the datasetsdata sets collected in advance and then detects a new incoming data point employing the model. Next, the model is updated with the new data point. Considering that streaming data has considerable dynamic variation, the distribution and behavioral pattern of later coming data points change dramatically, so the built model based on the past datasetsdata sets needs to be adjusted and updated frequently. Consequently, this kindtype of methods have poor adaptability and poor real-time processing ability, which are relatively suitable for streaming data with constant distribution or behavioral pattern. The second is incremental learning [18, 6, 5][18] and integrated learning [10, 4][10],[4]. Incremental learning refers to that the model learns continuously new knowledge from new incoming data and also retains old knowledge learned from past data. Integrated learning is to construct multiple models instead of one model or one detector, and then detect outliers employing these models, finally integrate these results. This kind of methods havetype of method has a higher generalized capability and areis better for streaming data with inconstant distribution than incremental learning. However, they have lower reliability for the detection of local outlier than incremental learning. It is the reason why this paper chooses the idea of incremental learning. 3. LOF and Incremental LOF Considering that the algorithm proposed in this paper is mainly motivated by LOF [11] and ILOF [28] algorithms, we will first introduce these two algorithms so that readers can have a comprehensive understanding of our algorithm.
5
3.1. LOF Algorithm We use o, p, q to denote points in a datasetsdata sets, and d(p, q) to denote the distance between p and q. Some essential definitions in the LOF algorithm are shown as follows. Definition 1: k-distance of a data point p: For any positive integer k, the k-distance of a data point p, denoted as k-distance(p), is defined as the distance d(p, o) between p and a data point o ∈ P such that: - for at least k points o0 ∈ P \{p} it holds that d(p, o0 ) ≤ d(p, o), and - for at most k − 1 points o0 ∈ P \{p} it holds that d(p, o0 ) < d(p, o). Definition 2: k-distance neighborhood of a data point p: Given the k-distance of p, the k-distance neighborhood of p, denoted as Nk (p), contains every data point q whose distance from p is not greater than the k-distance of p, i.e. Nk (p) = {q ∈ P \{p}|d(p, q) ≤ k-distance(p)}
(1)
Definition 3: reachability distance of a data point p with respect tow.r.t. point o: For any positive integer k, the reachability distance of a data point p with respect to point o, denoted as reach-distk (p, o), is defined as reach-distk (p, o) = max{k-distance(o), d(p, o)}
(2)
Definition 4: local reachability density of a data point p: The local reachability density of a data point p, denoted as lrdk (p), is defined as P o∈Nk (p) reach-distk (p, o) ) (3) lrdk (p) = 1/( |Nk (p)| Definition 5: local outlier factor of a data point p: The local outlier factor of a data point p, denoted as LOFk (p), is defined
as LOFk (p) =
P
lrdk (o) o∈Nk (p) lrdk (p)
|Nk (p)|
(4)
The LOF represents the degree of outlier-ness of a data point. Equation 4 shows that the LOF of a data point p is the average ratio of the local 6
reachability density of p’s k-nearest neighbors to lrdk (p). Therefore, if the LOF of a data point is smaller than 1, then the point can be marked as normal data considering the denseness of normal data. If the LOF of a data point is far greater than 1, it can be regarded as an outlier considering the sparsity of abnormal data. Finally, the smaller of LOF is, the less likely this point is to be an outlier. The LOF algorithm can detect outliers with different granularity by changing the parameter k. Moreover, it does not require data conform to a uniform distribution and can be applied in stream computing in three ways. The first is to calculate the LOF for every point when a new point is inserted into a datasetdata sets. According to [11], the best time complexity of the LOF algorithm is O(n2 log n) for a datasetdata sets containing n points. Obviously, this method detects outliers accurately and in real time, but with too high time complexity. The second approach partitions the input data stream into multiple segments, and runs the LOF algorithm on each segment after all data points in the segment arrive. The size of each segment is critical to the performance of the algorithm. The larger segments are involved, the longer time is taken for running LOF algorithm, causing a longer time delay before next segment is started. Besides, this approach ignores the influence of historical data, while only considering the local data. Given a length of each segment, the time complexity of this approach is constant, while delayed outlier detection and low accuracy are ineluctable.The second is to calculate the LOF for all points in a data sets after a new batch of data points arrive. That is to say, we can apply the LOF algorithm to current data sets until a new batch of data points from a dynamic data stream is inserted. It is equivalent to divide the dynamic data stream into multiple datasets. However, the problem of high time complexity still exists, and this method also causes delayed detection. The third approach introduces the idea of sliding window and only carries out outlier detection on the data in the window at a time. It is equivalent to caching a data stream with a certain size and then detecting outliers using the LOF algorithm. The difference between second and third approaches lies in that the adjacent windows in the latter overlap each other, while the former is disjoint. Thereby this method takes into account the influence of local historical data. Similarly, the computational performance of this approach will be affected by the length of sliding window, and the problem of delayed detection still exists, while it outperforms the second approach in 7
the accuracy of outlier detection.The third way introduces the idea of sliding window, and only carries on outlier detection to the data in the window at a time. It is equivalent to caching a data stream with a certain size and then detecting it using the LOF algorithm. The computational complexity of this method is lower than previous two methods, while the problem of delayed detection still exists. Moreover, it only considers the data points included in a window and overlooks the influences of historical data and future data. To improve the LOF algorithm, Pokrajac proposed an incremental outlier detection for streaming data[28]. 3.2. Incremental LOF ILOF [28] is an improvement to the LOF algorithm. When a new data point n is inserted into a datasetdata sets, ILOF algorithm mainly includes the following steps: a) Calculate reachability distances of the new point n with respect tow.r.t. its k nearest neighbors. b) Update k-distance of points which have the new point n in their k nearest neighbors. c) Update lrd on all points where the new point n is now one of their k-neighbors and on all points q where reach-dist(q, p) is updated and p is among k nearest neighbors of q. d) Update LOF on all points q where lrd(q) is updated or lrd(p) of one of its k nearest neighbors p changes. e) Calculate lrd(n) and LOF (n). Besides, the algorithm also includesproposes a general framework for deleting certain data points due to their obsoleteness, and readers can refer to [28] for a detailed description of the ILOF. In short, the ILOF not only considers the effect of historical data, but also does notdoesn’t mistake normal data for outliers because of the data segmentation. It has the same accuracy and higher computational efficiency as the static LOF algorithm. Nonetheless, the method is still inefficient for streaming data. When a new data point arrives, the performance of this algorithm is crucially dependent on efficient indexing structures to compute kN N (n)(k nearest neighbors of n) and kRN N (n)(k reverse nearest neighbors of n, which include all 8
points p where n is among their k nearest neighbors). When efficient algorithms for kN N and kRN N areis applied, the time complexity of the ILOF is O(k · F · log n + F 2 · k), in which F is the maximal number of k reverse nearest neighbors of a data point. To further improve the LOF algorithm, this paper proposes a cube-based incremental LOF (CB-ILOF) method, which not only detects outliers with high accuracy but also significantly reduces the memory and time cost.which not only detects outliers in real time, but also reduces the time and memory overheads without decreasing the accuracy of outlier detection. 4. Cube-based Incremental LOF Algorithm The basic idea of CB-ILOF (Cube Based Incremental LOF Algorithm) is to divide the data space into multiple cubes. All data points in the streaming data are mapped into different cubes according to the attribute value of each dimension. Then the cubes instead of data points are used to participate in the calculation of outlier detection.according to the values of their respective attributes, and then outlier detection is performed based on cubes instead of data points. 4.1. Data space partition Suppose we have a data stream S ∈ Rd composed of many d-dimensional data points, and the data space of each dimension in S is divided into multiple segments. These segments are created on an equidistant basis and form the units that are used to define cubes in high dimensional space Rd . Let lk denote the cube size of the k-th dimension, then the number of cubes will not exceed l1 × l2 × · · · × lk × · · · × ld .Let lk denote the segment length and sk denote the total number of segments of the kth dimension in S, a d-dimensional cube that is created by selecting segments from d dimensions is given by Equation ?. where ik is the segment index of cube c in the kth dimension of S, and ik ∈ {1, 2, 3, ..., sk }. 4.2. Cube-based incremental outlier detection Similar to ILOF, each cube in CB-ILOF has an outlier factor, which indicates the outlierness of that cube, as well as all data points in that cube. Given a new data point p, the cube that it will be mapped to is firstly calculated. Taking a 2-dimensional space for example, as shown in Figure 1, 9
3
2 ( , ) ( ,
)
2
3
Figure 1: Selection of cube coordinate in 2-dimensional space.
we assume that (x0 , y 0 ) is the representative point of cube c, and p(x, y) is a point mapped into cube c. Let r denote the error caused by using cube c to replace point p in outlier detection, it is easy to derive that: r2 = (x − x0 )2 + (y − y 0 )2
(5)
Then, the total error R introduced by using c to replace all points mapped into cube c is as follows: Z 2l2 Z 2l1 2 R = ((x − x0 )2 + (y − y 0 )2 )dxdy (6) l2
l1
By solving Equation 6, we derive that R is minimized when x0 = 23 l1 , y 0 = 32 l2 . This conclusion can also be extended to a higher dimensional space. Assuming that a new data point y(y1 , y2 , ..., yd ) arrives, the transformation shown in Equation 7 is performed for each dimension of data y, then the the coordinate the cube c to which the new data point y is mapped are derived. ik = byk /lk c ∗ lk + 0.5 ∗ lk , k = 1, 2, ..., d. (7) In short, when a new data point arrives, it will be mapped into a cube, and then the center point of the cube is used for later calculations. Let C denote the set of cubes, and we use c, ci , cj to denote cubes in C, in which i = 1, 2, 3, ..., j = 1, 2, 3, .... Definition 6: weight of cube c: The weight of cube c, denoted as W (c), refers to the total number of data points in cube c. Definition 7: distance between cube ci and cube cj : 10
Assuming that the representative point of cube ci is xi (xi1 , xi2 , ..., xid ), and cj is xj (xj1 , xj2 , ..., xjd ), then the distance between ci and cj , denoted as d(ci , cj ), is defined as follows: d X 1 d(ci , cj ) = ( (xik − xjk )2 ) 2
(8)
k=1
Definition 8: k-distance of cube c: The k-distance of cube c, denoted as k-distance(c), is defined as the distance d(c, cj ) between cube c and cube cj ∈ C\{c} such that: P a) i W (ci ) + W (c) < k, and P b) i W (ci ) + W (c) + W (cj ) ≥ k, and c) d(c, ci ) ≤ d(c, cj )
In which, ci ∈ C\{c, cj }. Definition 9: k nearest neighbors of cube c: k nearest neighbors of cube c, denoted as kN N (c), contain every cube whose distance from c is not greater than the k-distance(c), i.e. kN N (c) = {ci ∈ C\{c}|d(c, ci ) ≤ k-distance(c)}
(9)
Definition 10: k reverse nearest neighbors of cube c: k reverse nearest neighbors of c, denoted as kRN N (c), contain every cube whose kN N includes c, i.e. kRN N (c) = {ci ∈ C\{c}|c ∈ kN N (ci )}
(10)
Definition 11: reachability distance of cube ci with respect tow.r.t. cube cj : Reachability distance of cube ci with respect to cj , denoted as reach-distk (ci , cj ), is defined as reach-distk (ci , cj ) = max{k-distance(cj ), d(ci , cj )}
(11)
Taking two-dimensional data space as an example(as shown in Figure 2), cubes c1 , c2 , ..., c6 are numbered according to the distances between ci and c from the near to the distant. In case of k = 8, then W (c)+W (c1 )+W (c2 ) = k. 11
Figure 2: A 2-dimensional example.
We derive that kN N (c) = {c1 , c2 }, and k-distance(c) = d(c, c2 ). In case of k = 12, then we derive the following inequality: ( W (c) + W (c1 ) + W (c2 ) + W (c3 ) < k (12) W (c) + W (c1 ) + W (c2 ) + W (c3 ) + W (c4 ) > k According to Equation 9, we derive kN N (c) = {c1 , c2 , c3 , c4 }. The total number of points included in these cubes is 22, which is greater than k = 12. To address this problem, we introduce the contribution factor CF . Assuming that the data points in each cube are uniformly distributed, then CF is defined as follows. Definition 12: Contribution factor of cube cj with respect tow.r.t cube c: Assuming cubes c1 , c2 , ...ci , ..., cl are k nearest neighbors of cube c, and they are numbered according to the distances between ci and c. The contribution factor of cube ci with respect to cube c, denoted as CF (ci , c), is defined as follows: if i < l; 1, CF (ci , c) =
P k− l−1 j=1 W (cj )−W (c) , W (cl )
0,
if i = l;
(13)
otherwise.
In the case of k = 12, CF (c4 , c) = (k − W (c) − W (c1 ) − W (c2 ) − W (c3 ))/W (c4 ) = 1/11. 12
(a)
(b)
Figure 3: A 2-dimensional example including two different cases. (a) the cube c does notdoesn’t contain any data except the point dc ; (b) the cube c contains other data besides the point dc .
Definition 13: local reachability density of cube c: The local reachability density of cube c, denoted as lrdk (c), is defined as ! P ci ∈kN N (c) (reach-distk (c, ci ) × W (ci ) × CF (ci , c)) (14) lrdk (c) = 1/ k − W (c) Definition 14: local outlier factor of cube c: The local outlier factor of cube c, denoted as LOFk (c), is defined as LOFk (c) =
P
lrdk (ci ) ci ∈kN N (c) lrdk (c)
|kN N (c)|
(15)
The cube-based outlier detection algorithm is shown in Algorithm 1. As the example shown in Figure 3, assuming cubes ci (i = 1, 2, 3, 4, 5, 6) are numbered according to the distances between ci and c, and a new point dc is mapped to cube c. In case that cube c does not existis empty before data point dc arrives (Figure 3a), W (c), kN N (c) and CF (ci , c) for each cube ci ∈ kN N (c) are firstly calculated (line 4-5 in Algorithm 1). Let k = 8, then W (c) = 1, kN N (c) = {c1 , c2 , c3 }, CF (c1 , c) = CF (c2 , c) = 1, CF (c3 , c) = 1/3. According to kN N (c), k-distance(c) can be calculated and is equal to d(c, c3 ). Next, kRN N (c) can be calculated and is inclusive of cubes c3 , c5 , and c6 . In addition, for affected cubes, the update of their kN N and k-distance is also required. W (c), kN N (c) are firstly calculated, and CF (ci , c) is calculated for 13
Algorithm 1: CB-ILOF Algorithm Input: Data stream S : {d1 , ..., dc , ...} Output: the LOF of each data point in S 1 for each data point dc ∈ S do 2 Calculate the coordinate of the cube c to which dc is mappded; 3 if cube c does not exist then 4 W (c) = 1; Compute kN N (c); 5 Compute CF (ci , c) for each cube ci ∈ kN N (c); 6 Compute k-distance(c); Compute kRN N (c); 7 Update kN N and k-distance of the affected cubes; 8 else 9 W (c) + +; Update kN N (c) and k-distance(c); 10 Update CF (ci , c) for each cube ci ∈ kN N (c); 11 for each cube ci ∈ kRN N (c) do 12 Update kN N (ci ) and CF (cj , ci ) for each cube cj ∈ kN N (ci ) \ {c}; 13 Update k-distance(ci ); 14 end 15 end 16 Compute reach-distk (c, ci ) for each cube ci ∈ kN N (c); 17 Supdate lrd = kRN N (c); 18 for each cube ci ∈ kRN N (c) do 19 for each cube cj ∈ kN N (ci ) \ {c} do 20 reach-distk (cj , ci ) = k-distance(ci ); 21 if ci ∈ kN N (cj ) then 22 Supdate lrd = Supdate lrd ∪ {cj }; 23 end 24 end 25 end 26 Supdate LOF = Supdate lrd ; 27 for each cube ci ∈ Supdate lrd do 28 Update lrd(ci ); 29 Supdate LOF = Supdate LOF ∪ kRN N {ci }; 30 end 31 Update LOF (ci ) for each cube ci ∈ Supdate LOF ; 32 Compute lrd(c) and LOF (c); 33 LOF (dc ) = LOF (c) 34 end 14
each cube ci ∈ kN N (c) simultaneously (line 4-5 in Algorithm 1). Let k = 8, then W (c) = 1, kN N (c) = {c1 , c2 , c3 }, and kRN N (c) = {c3 , c5 , c6 } in Figure 3a. Next, we compute reachability distances to k nearest neighbors of cube c and CF of k nearest neighbors with respect to cube c (line 5 in Algorithm 1). In case that cube c existsis not empty (Figure 3b), the weight of cube c is firstly updated. The arrival of new point dc may change the k nearest neighbors of cube c. For example, let k be 8, then the k nearest neighbors of cube c change from {c1 , c2 , c3 } to {c1 , c2 }. Hence, we also need to update kN N (c), k-distance, and CF (ci , c) for each cube ci ∈ kN N (c) (line 9-10 in Algorithm 1). Since insertion of the new point may decrease the k-distance of certain neighboring, and it can happen only to those points that have the new point in their k-neighborhood (Theorem 1 given in paper [28]), kN N of the cubes included in kRN N (c) may change. Consequently, for each cube ci ∈ kRN N (c), we need to update kN N (ci ), CF (cj , ci ) for each cj ∈ kN N (ci ), and k-distance(ci ) (line 11-14 in Algorithm 1).Hence, we need to update the weight and k nearest neighbors of cube c, and then update its reachability distance with respect to each cube ci ∈ kN N (c) (line 7-8 in Algorithm 1). For both cases, according to Equation 11, reachability distance is also required to be updated (line 18-25 in Algorithm 1) because of the change of k-distance. In our example shown in Figure 3b, the k-distance(c3 ) changes from d(c3 , c2 ) to d(c3 , c6 ), so reach-dist(c3 , c) and reach-dist(c3 , c6 ) need to be updated.Since insertion of the new point may decrease the k-distance of certain neighboring, and it can happen only to those points that have the new point in their k-neighborhood (Theorem 1 given in paper [28]), we also need to update the k-distance for each cube ci ∈ kRN N (c)(kRN N (c) = {c3 , c5 , c6 } in Figure 3a, kRN N (c) = {c3 , c5 } in Figure 3b) for both the two cases (line 10-11 in Algorithm 1). According to Equation 11, when k-distance(ci ) changes for each cube ci , reach-dist(cj , ci ) will be affected for cubes cj that are in kN N (ci ) (line 13-15 in Algorithm 1). In our example shown in Figure 3b, the k-distance(c3 ) changes from d(c3 , c2 ) to d(c3 , c6 ), so reach-dist(c3 , c) and reach-dist(c3 , c6 ) need to be updated. According to Equation 14, lrd of a cube ci is affected only if any of the following items changes: a) the k-neighborhood of cube c; 15
b) the reachability distance from cube c to one of its k-neighbors; c) the weight of one of k-neighbors of cube c; d) the CF of one of k-neighbors with respect to cube c. Because a), c) and d) will change only if the cube to which the new point mapped becomes one of k neighbors of cube c, we need to update lrd for all cubes ci to which the cube c is now one of their k nearest neighbors (cubes c3 , c5 , c6 in Figure 3a and cubes c3 , c5 in Figure 3b) and for all cubes cj where reach-dist(cj , ci ) is updated and ci is among k nearest neighbors of cj (cube c6 in Figure 3b, line 17-25, 27-30 in Algorithm 1).), corresponding to line 12-14,16-20,22-23,25 in Algorithm 1. According to Equation 15, the LOF of a cube ci need to be updated only if any of the following items changes: a) lrd(ci ); b) lrd(cj ) of one of its k nearest neighbors; c) the weight of one of its k-neighbors; d) the CF of one of k-neighbors with respect to the cube ci . Therefore, we need to update LOF for all cubes ci where lrd(ci ) is updated or lrd(cj ) of one of its k-nearest neighbors changes (line 26-31 in Algorithm 1).Both c) and d) have been covered in the update of lrd(ci ). Hence, we only need to update LOF for all cubes ci where lrd(ci ) is updated or lrd(cj ) of one of its k-nearest neighbors changes(line 21-22,24-26 in Algorithm 1). Finally, we compute lrd(c) and LOF (c) according to Equation 14 and Equation 15, also LOF (dc ) is calculated at the same time (line 32-33 in Algorithm 1). The main advantages of our proposed algorithm lie in: (1) Reduce runtime memory overhead. Streaming data is an unbounded sequence of data and dynamic in nature. Thus the outlier detection for streaming data always leads to high runtime memory overhead. In the CB-ILOF, the number of cubes keep stable after the data space is divided into multiple cubes. The memory space required for cube managementbookkeeping is the main runtime memory overhead in our algorithm, and it does not increase dramatically with the continuous arrival of data points. 16
(2) Reduce time overhead. The CB-ILOF not only reduces the demand for runtime memory, but also the computation time of the outlier detection. In the ILOF, both kN N and kRN N are updated for new data points, and the time complexity of efficient kN N [34] and kRN N [3] is O(log n). However, only kN N is updated in our new algorithm if the cube to which the new data point is mapped is not empty, and the time complexity of this update is O(1). Considering that the abnormal data accounts for a small ratio in the datasetsdata sets, so there are many data points inserted into a same cube on average, which will reduce calculation time further. 5. Evaluation To verify the efficiency of the proposed algorithm, we carry out a series of experiments on 5two different datasets. All of the following experiments were performed on a computer with 3.30 GHz CPU and 8 GB memory. 300 Class A
250
Class C
200 Y
Class D
150 Class B
100 50 0 0
100
200
300 X
400
500
600
Figure 4: The RandomSet.
5.1. ExperimentsExperiment on the synthetic dataset The synthetic dataset (RandomSet in this paper) has a total of 11000 data points generated randomly. As shown in Figure 4, the RandomSet has 4 classes: ClassA, ClassB, ClassC and ClassD, where ClassA and ClassD conform to the uniform distribution, and other classes conform to the normal distribution. We assume that the data points distributed sparsely around the edges of ClassB and ClassC are outliers or abnormal data. On the RandomSet, we change the parameter k of CB-ILOF and ILOF. Then, we calculate the value of LOF for each cube (in CB-ILOF) and each 17
/ 2 )
X
/ 2 )
Y
(a) CB-ILOF: k = 160.
X
X
Y
(d) ILOF: k = 160.
Y
(b) CB-ILOF: k = 170.
X
X
Y
(e) ILOF: k = 170.
Y
(c) CB-ILOF: k = 180.
/ 2 )
/ 2 )
/ 2 )
/ 2 )
X
Y
(f) ILOF: k = 180.
Figure 5: The experimental results of CB-ILOF and ILOF on the RandomSet with k = 160, 170, 180. The red cylinders in LOF direction show the detected abnormal data points.
point (in ILOF). Figure 5a, 5b, 5c are the detection result of CB-ILOF with k = 160, 170, and 180, and Figure 5d, 5e, 5f are the detection result of ILOF with k = 160, 170, and 180. The red cylinders in LOF direction show the detected abnormal data points. As the increase of k, more boundary points are detected as outliers. The abnormal cubes detected using CB-ILOF are consistent with the detection result whenusing ILOF.In this experiment, the value of k is 180, and the cube size of each dimension is 5. The experimental results of outlier detection with ILOF and CB-ILOF are shown as two three-dimensional histograms in Figure ?. Figure ? is the detection result of CB-ILOF, and Figure ? is the detection result of ILOF, in which LOF represents the local outlier factor, and red cylinders represent abnormal cubes or abnormal data points. The abnormal cubes detected using CB-ILOF are consistent with the detection result using ILOF, and the larger LOF of a data point, more likely it is to be an outlier. Figure 6 provides an insight on the effect of parameter l (cube size of each dimension) on the outlier detection accuracy. We change the cube size with k set to 180. It can be seen from Figure 6 that as l decreases, the number
18
/ 2 )
X
/ 2 )
Y
X
(a) l = 2.
Y
(b) l = 4.
/ 2 )
X
/ 2 )
Y
X
(c) l = 6.
Y
(d) l = 8.
Figure 6: The experimental results of CB-ILOF over RandomSet with l = 2, 4, 6, 8.
of data points accumulated in each cube decreases, and the value of LOF is close to the result of traditional LOF algorithm. Figure 7a, 7b, 7c show the runtime memory of two algorithms with k = 160, 170 and 180, respectively. It can be seen that as the increase of k, the memory usage of both algorithms presents an upward trend, but CB-ILOF has a smaller memory footprint than ILOF. When k = 180, the runtime memory of CB-ILOF is 0.61GB, while the ILOF reaches as high as 0.8GB.We also record the usage of runtime memory, which is shown in Figure ?. It can be seen that the runtime memory of CB-ILOF is up to 0.76GB, while the ILOF can reach 0.97GB. Therefore, the runtime memory overhead has been reduced using our new algorithm. To evaluateverify the performance of CB-ILOF, we record the detection 19
(a) k = 160.
(b) k = 170.
(c) k = 180. Figure 7: The runtime memory of CB-ILOF and ILOF over RandomSet with k = 160, 170, 180.
20
Figure 8: Run time of outlier detection using CB-ILOF and ILOF on RandomSet. Blue triangular points represent the run time of CB-ILOF, and orange circular points represent the run time of ILOF.
time of continuous 2000 data points, and the results are shown in Figure 8. It can be seen that the detection time of the two algorithms is basically the same when the number of data points is small. However, as the number of data points increases, the detection time of the ILOF is also increasing, while CB-ILOF keeps basically constant and substantially lower than ILOF. The speedup of detection time is up to 10x on average. Therefore, compared with ILOF, CB-ILOF is more efficient on large datasets, which indicates better scalability of our algorithm in computational efficiency. 5.2. ExperimentsExperiment on the KDD Cup 99 dataset The KDD Cup 99 dataset has 41 attributes in total, of which 34 attributes are numeric attributes, and 7 are non-numeric attributes. In the following experiments, 34 numeric attributes are used, and a total of 50000 data points are tested. Normalization is performed in advance to each attribute of the dataset. Assuming that xi is ith record and xij is the jth attribute of xi , then the normalization to xij is shown as Equation It can be derived from Equation 0 0 ?? that xij satisfies 0 ≤ xij ≤ 1. Hence the cube size of each dimension is set to 0.05 in this experiment, and the value of k is 180. Figure 9 compares the runtime memory usage of two algorithms with l = 0.05, k = 180. It can be seen that the runtime memory of the CB-ILOF is less than 2GB, while the memory of ILOF reaches as high as 2.5∼3GB. The detection time of 7000 continuous data points is presented in Figure 10. It can be seen that both algorithms run fast when the number of 21
(a) CB-ILOF.
(b) ILOF. Figure 9: Runtime memory of two algorithms on KDD Cup 99 dataset.
22
data points is small. However, CB-ILOF exhibits better computation efficiency as the number of data points increases. Moreover, the time of outlier detection with CB-ILOF is more stable over the entire dataset than ILOF, and the speedup of average detection time is 19x. It further indicates better scalability of our algorithm in computational efficiency.
Figure 10: Run time of outlier detection using CB-ILOF and ILOF on KDD Cup 99 dataset. Blue triangular points represent the run time of CB-ILOF, and orange circular points represent the run time of ILOF.
Figure 11: Predictive error evolution of CB-ILOF on the KDD Cup 99 dataset.
To evaluate the accuracy of our method, the prequential strategy proposed by Gama et al. [15, 16] is adopted in this experiment for evaluating 23
our algorithm. Figure 11 presents the predictive error evolution of CB-ILOF on the KDD Cup 99 dataset, where k and l are set to 180 and 0.1 respectively. From this figure we can see that final predictive error is around 30%. 5.3. Experiments on other datasets1 To evaluate the efficiency of CB-ILOF, we select three other datasets: Http, Smtp, and Shuttle [22, 2, 35, 37, 45] from ODDS Library [31]. Their details are as follows. - Shuttle: The original Statlog (Shuttle) dataset from UCI machine learning repository is a multi-class classification dataset with dimension of 9. Here, the training and test data are combined. The dataset removed most of the exception data, with a total of 4066049097 data points. In our experiments, 20660 data points of the dataset are used to initialize both models, and other 25000 data points are tested in streaming manner. - Http: The Http dataset comes from the previous mentioned KDD Cup 99 dataset. Because of the high redundancy and outliers ratio in the original KDD dataset, there are many improved versions of the dataset, and the Http dataset is one of them. In our experiments, 25000 data points of the dataset are used in initialization, and other 25000 data points are tested streamingly.The original KDD Cup 99 dataset from UCI machine learning repository contains 41 attributes (34 continuous, and 7 symbols). However, they are reduced to 3 attributes (service, duration, src bytes, dst bytes) as these attributes are regarded as the most basic attributes. - Smtp: Smtp dataset also comes from KDD Cup 99 dataset, and it contains 3 attributes. We randomly select 50000 data points from the dataset. Similarly, 25000 points of the dataset are used to initialize both models, and other 25000 points are tested in streaming manner.including about 0.1% abnormal data. In following experiments, the value of k is set to 220, and l is 0.005. We evaluate the accuracy of CB-ILOF and ILOF on Shuttle, Http, and Smtp 1
This subsection is new.
24
(a)
(b)
Figure 12: Outlier detection accuracy with different l and k on the Shuttle dataset. Left figure shows the detection results of CB-ILOF with different l. Right figure shows the detection results of CB-ILOF and ILOF with different k.
datasets. Test results are presented in table ?. We can see that CB-ILOF will cause a certain loss of accuracy, but it is still within a reasonable range. Moreover, the previous experimental results show that our algorithm significantly reduces the time and memory overhead, which is extremely important in streaming computing. Figure 12 presents the outlier detection accuracy of two algorithms for different parameters on the Shuttle dataset. Figure 12a shows the detection accuracy of CB-ILOF with different l and k set to 180. From the figure we can see that normal accuracy (the ratio of normal data points detected to the total normal data points) and outlier accuracy (the ratio of outliers detected to the total outliers) decrease first and then increase with the increase of cube size. Because of low ratio of outliers in Shuttle dataset, final accuracy is totally dominated by the normal accuracy. When the cube size of each dimension is set to 0.5, final accuracy achieves the maximum. By analyzing the experimental data, we notice that all abnormal data points of this data set are sparsely distributed in some cubes, while the normal data points are densely distributed in other cubes. Therefore, when the cube size is set to 0.5, our algorithm can achieve a high accuracy. The outlier detection accuracy of CB-ILOF and ILOF is presented in Figure 12b. It can be seen that the change of k has little effect on the outlier detection accuracy of two algorithms. Average memory usage of two algorithms on the Shuttle dataset for different parameters are summarised in Figure 13. Figure 13a compares the aver25
(a)
(b)
Figure 13: Average memory usage with different l and k on the Shuttle dataset. Left figure shows the runtime memory of CB-ILOF with different l. Right figure shows the runtime memory of CB-ILOF and ILOF with different k.
age runtime memory of CB-ILOF for different cube size. Generally speaking, average memory usage decreases with the increase of cube size. The reason is that the number of cubes is reduced with the increase of cube size. In addition, from the figure we can see that average memory usage increases when the cube size is larger than 0.01. This is because that, the execution time is close to memory sampling interval when the cube size is larger than 0.01, there will be some errors in final statistical memory data. Figure 13b compares the average runtime memory of CB-ILOF and ILOF with different k and l set to 0.02. It can be seen from the data in Figure 13b that the change of k has little effect on the average memory usage of both algorithms. Table 1: Accuracy, average runtime memory and average run time of CB-ILOF with different cube size l on the Smtp , and k = 180.
l Normal Acc. Outlier Acc. 0.005 0.778 1.000 0.010 0.878 1.000 0.020 0.927 1.000 0.050 0.974 1.000 0.100 0.993 1.000 0.200 0.994 1.000 0.500 0.999 1.000
Acc. Mem. (GB) 0.778 0.549 0.878 0.657 0.926 0.291 0.974 0.220 0.993 0.197 0.994 0.201 0.999 0.191 26
Run time (ms) 27.254 4.269 0.480 0.165 0.017 0.011 0.009
(a)
(b)
Figure 14: Average execution time with different k and l on the Shuttle dataset. Left figure shows the results with different l, and right is the results with different k.
Figure 14 provides the average execution time of two algorithms on the Shuttle dataset for different parameters. Similarly, left figure in Figure 14 compares the average execution time of CB-ILOF for different cube sizes. It is apparent from this figure that as the cube size increases, the average execution time reduces. Right figure in Figure 14 compares the average execution time of two algorithms with different k and l set to 0.02. As this figure shows, there is a significant difference between the execution time of two algorithms. Moreover, as k increases, the execution time of the ILOF increases apparently, while the CB-ILOF nearly keeps constant. Table 2: Comparison of accuracy, average runtime memory and average run time of CBILOF and ILOF with different k on the Smtp dataset, and l = 0.02.
k 150 160 170 180 190 200
Accuracy CB-ILOF ILOF 0.932 0.823 0.933 0.823 0.922 0.823 0.927 0.823 0.926 0.823 0.922 0.822
Mem. (GB) CB-ILOF ILOF 0.313 1.789 0.288 1.941 0.356 1.998 0.291 2.024 0.417 1.922 0.435 1.796
Run time (ms) CB-ILOF ILOF 0.307 130.473 0.353 170.458 0.397 210.167 0.480 261.734 0.530 317.269 0.656 385.365
Accuracy, average runtime memory and average run time of CB-ILOF with different cube size on the Smtp dataset are summarised in Table 1. It 27
can be seen from the data in this table that the CB-ILOF performs better than the ILOF in accuracy, memory, and run time. Accuracy, average runtime memory and average run time of CB-ILOF and ILOF with different k on the Smtp dataset are set out in Table 2. Similarly, the increase of k has little effect on the accuracy and average memory usage of both algorithms. Run time of both algorithms increase with the increase of k. Table 3: Comparison of accuracy, average runtime memory and average run time with different cube size l on the Http dataset, and k = 180.
l Normal Acc. Outlier Acc. 0.001 0.459 0.974 0.002 0.542 0.974 0.005 0.805 0.974 0.010 0.951 1.000 0.020 0.993 1.000 0.050 0.996 1.000 0.100 0.998 1.000 0.200 0.998 0.359 0.500 0.997 0.231
Acc. Ave. Mem. (GB) 0.459 0.632 0.543 0.631 0.806 0.779 0.951 0.704 0.993 0.524 0.996 0.288 0.998 0.092 0.996 0.053 0.996 0.056
Ave. time (ms) 136.951 20.904 8.865 6.128 3.232 0.137 0.025 0.014 0.015
Table 4: Comparison of accuracy, average runtime memory and average run time of CBILOF and ILOF with different k on the Http dataset, and l = 0.005.
k 150 160 170 180 190 200
Accuracy CB-ILOF ILOF 0.848 0.872 0.809 0.872 0.812 0.871 0.805 0.873 0.767 0.871 0.774 0.871
Mem. (GB) CB-ILOF ILOF 0.774 1.788 0.738 1.855 0.714 1.979 0.779 1.915 0.792 1.736 0.791 1.579
Run time (ms) CB-ILOF ILOF 5.309 145.627 6.330 180.737 7.123 218.084 8.865 276.220 9.627 334.471 10.754 414.629
In addition, similar experimental results, summarised in Table 3 and Table 4, are also received on the Http dataset. The data in this two figures indicate the superiority of CB-ILOF over ILOF on the accuracy, memory, and execution time. 28
6. Conclusion An efficient outlier detection algorithm for streaming data is presented in this paper. The proposed algorithm adopts an idea of dividing data space into multiple cubes and outlier factor is assigned to each cube. The new data point is first mapped into the corresponding cube based on the criteria of least error, and all data points are replaced by the coordinate of corresponding cubes to participate in subsequent calculations. Then, some important attributes in the traditional ILOF must be redefined to apply the new cube-based algorithm. Next, we propose our cube-based incremental LOF algorithm, which is an improvement to the traditional incremental algorithm. Moreover, detailed analysis to runtime memory and time overheads of our algorithm is presented. Finally, a series of experiments are conducted on 5 datasetsthe RandomSet and KDD Cup 99 dataset respectively with different cube sizes and k. Experimental results indicate lower average execution time and memory usage, and comparable or higher accuracy than ILOF.Experimental results indicate lower runtime memory and detection time of the CB-ILOF than ILOF, and without decreasing the accuracy of outlier detection. 7. Author’s Contribution Jianhua Gao proposed the method, conducted the main experiments and drafted the manuscript. Weixing Ji participated in the discussion of the algorithm and experiments, gave advice and directed the experimental procedure. Lulu Zhang carried out part of the experiments and wrote part of the manuscript. Anmin Li wrote part of the manuscript. Yizhuo Wang gave some very insightful advices about the writing of this manuscript. All authors read and approved the final manuscript. References References [1] Abadi, D. J., Ahmad, Y., Balazinska, M., etintemel, U., Cherniack, M., Hwang, J. H., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., 2005. The design of the borealis stream processing engine. Cidr, 277–289. [2] Abe, N., Zadrozny, B., Langford, J., 01 2006. Outlier detection by active learning. Vol. 2006. pp. 504–509. 29
[3] Achtert, E., Kunath, P., Pryakhin, A., Renz, M., 2006. Efficient reverse k-nearest neighbor search in arbitrary metric spaces. In: ACM SIGMOD International Conference on Management of Data. pp. 515–526. [4] Ando, S., Thanomphongphan, T., Seki, Y., Suzuki, E., 2015. Ensemble anomaly detection from multi-resolution trajectory features. Data Mining and Knowledge Discovery 29 (1), 39–83. ˇ [5] Andonovski, G., Blaˇziˇc, S., Skrjanc, I., 2019. Evolving Fuzzy Model for Fault Detection and Fault Identification of Dynamic Processes. Springer International Publishing, Cham, pp. 269–285. [6] Angelov, P., Filev, D. P., Kasabov, N., 2010. Evolving Intelligent Systems: Methodology and Applications. Wiley-IEEE Press. [7] Apache, S., http://kafka.apache.org/design.html. Apache kafka, a highthroughput distributed messaging system. [8] Auradkar, A., et. al., Apr. 2012. Data infrastructure at linkedin. In: Proc. IEEE 28th Int. Conf. Data Engineering. pp. 1370–1381. [9] Barnett, V., Lewis, T., Abeles, F., 1978. Outliers in Statistical Data. Wiley. [10] Bifet, A., Holmes, G., Pfahringer, B., 2009. Improving adaptive bagging methods for evolving data streams. In: Asian Conference on Machine Learning Advances in Machine Learning. pp. 23–27. [11] Breunig, M. M., 2000. Lof: identifying density-based local outliers. Acm Sigmod Record 29 (2), 93–104. [12] Chauhan, J., Chowdhury, S. A., Makaroff, D., Nov. 2012. Performance evaluation of yahoo! s4: A first look. In: Proc. Cloud and Internet Computing 2012 Seventh Int. Conf. P2P, Parallel, Grid. pp. 58–65. [13] Chiu, A. L. M., Ada Wai-chee Fu, July 2003. Enhancements on local outlier detection. In: Seventh International Database Engineering and Applications Symposium, 2003. Proceedings. pp. 298–307. [14] Costa, B. S. J., Angelov, P. P., Guedes, L. A., feb 2015. Fully unsupervised fault detection and identification based on recursive density estimation and self-evolving cloud-based classifier. Neurocomput. 150 (PA), 30
289–303. URL http://dx.doi.org/10.1016/j.neucom.2014.05.086 [15] Gama, J., Sebasti˜ao, R., Rodrigues, P. P., Mar 2013. On evaluating stream learning algorithms. Machine Learning 90 (3), 317–346. URL https://doi.org/10.1007/s10994-012-5320-9 [16] Gama, J. a., Sebasti˜ao, R., Rodrigues, P. P., 2009. Issues in evaluation of stream learning algorithms. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’09. ACM, New York, NY, USA, pp. 329–338. URL http://doi.acm.org/10.1145/1557019.1557060 [17] Ghoting, A., Parthasarathy, S., Otey, M. E., 2008. Fast mining of distance-based outliers in high-dimensional datasets. Data Min. Knowl. Discov. 16 (3), 349–364. URL https://doi.org/10.1007/s10618-008-0093-2 [18] He, H., Chen, S., Li, K., Xu, X., Dec. 2011. Incremental learning from stream data. IEEE Transactions on Neural Networks 22 (12), 1901–1914. [19] Jain, A. K., Murty, M. N., Flynn, P. J., 1999. Data clustering: a review. ACM Computing Surveys 31 (3), 264–323. [20] Karanjit, S., Dr. Shuchita, U., 2012. Outlier detection: Applications and techniques. International Journal of Computer Science Issues 9 (3), 307–323. [21] Lee, J., Cho, N.-W., 11 2016. Fast outlier detection using a grid-based algorithm. PLOS ONE 11 (11), 1–11. URL https://doi.org/10.1371/journal.pone.0165972 [22] Liu, F., Ting, K., Zhou, Z.-H., 2008. Isolation forest. In: Giannotti, F., Gunopulos, D., Turini, F., Zaniolo, C., Ramakrishnan, N., Wu, X. (Eds.), Proceedingsof the Eighth IEEE International Conference on Data Mining. IEEE, Institute of Electrical and Electronics Engineers, United States of America, pp. 413 – 422. [23] Ma, M. X., Ngan, H. Y. T., Liu, W., 2016. Density-based outlier detection by local outlier factor on largescale traffic data. electronic imaging 2016 (14), 1–4. 31
[24] Neumeyer, L., Robbins, B., Nair, A., Kesari, A., Dec. 2010. S4: Distributed stream computing platform. In: Proc. IEEE Int. Conf. Data Mining Workshops. pp. 170–177. [25] Ozkan, H., Ozkan, F., Kozat, S. S., Mar. 2016. Online anomaly detection under Markov statistics with controllable type-i error. IEEE Transactions on Signal Processing 64 (6), 1435–1445. [26] Patrizio, A., 2006. Streambase’s real-time database gets more real. Software Development Times. [27] Peter J Rousseeuw, A. M. L., 1987. Robust regression and outlier detection. Technometrics 31 (2), 260–261. [28] Pokrajac, D., Lazarevic, A., Latecki, L. J., Mar. 2007. Incremental local outlier detection for data streams. In: Proc. IEEE Symp. Computational Intelligence and Data Mining. pp. 504–515. [29] Radovanovic, M., Nanopoulos, A., Ivanovic, M., 2015. Reverse nearest neighbors in unsupervised distance-based outlier detection. IEEE Trans. Knowl. Data Eng. 27 (5), 1369–1382. URL https://doi.org/10.1109/TKDE.2014.2365790 [30] Rajasegarar, S., Leckie, C., Palaniswami, M., 2014. Hyperspherical cluster based distributed anomaly detection in wireless sensor networks. Journal of Parallel and Distributed Computing 74 (1), 1833–1847. [31] Rayana, S., 2016. ODDS library. URL http://odds.cs.stonybrook.edu [32] Ronald, B., 1995. Outliers in statistical data. Technometrics 37 (1), 117– 118. [33] Rousseeuw, P. J., Hubert, M., 2017. Anomaly detection by robust statistics. Wiley Interdisciplinary Reviews Data Mining and Knowledge Discovery. [34] Roussopoulos, N., Kelley, S., Vincent, F., 1995. Nearest neighbor queries. In: ACM SIGMOD International Conference on Management of Data. pp. 71–79.
32
[35] Tan, S. C., Ting, K. M., Liu, T. F., 2011. Fast anomaly detection for streaming data. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence - Volume Volume Two. IJCAI’11. AAAI Press, pp. 1511–1516. URL http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-254 [36] Tang, B., He, H., 2017. A local density-based approach for outlier detection. Neurocomputing 241, 171 – 180. [37] Ting, K. M., Zhou, G.-T., Liu, F. T., Tan, J. S. C., 2010. Mass estimation and its applications. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’10. ACM, New York, NY, USA, pp. 989–998. URL http://doi.acm.org/10.1145/1835804.1835929 [38] Toshniwal, A., et. al., 2014. Storm@twitter. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data. SIGMOD ’14. ACM, New York, NY, USA, pp. 147–156. URL http://doi.acm.org/10.1145/2588555.2595641 [39] Tran, L., Fan, L., Shahabi, C., Aug. 2016. Distance-based outlier detection in data streams. Proc. VLDB Endow. 9 (12), 1089–1100. URL http://dx.doi.org/10.14778/2994509.2994526 [40] Valles, Antonio C. (Gilbert, A. U. Z. V. J. F. W. W. U., April 2018. Cluster anomaly detection using function interposition (20180101680). URL http://www.freepatentsonline.com/y2018/0101680.html [41] Vic Barnett, T. L., 1979. Outliers in statistical data. Physics Today 32 (9), 73–74. [42] Wei, J., 2016. Research of outlier detection and data recovery based on statistical method. Ph.D. thesis, Nanjing University of Posts and Telecommunications. [43] Xhafa, F., Naranjo, V., Caball, S., Mar. 2015. Processing and analytics of big data streams with yahoo!s4. In: Proc. IEEE 29th Int. Conf. Advanced Information Networking and Applications. pp. 263–270.
33
[44] Xu, Z., Kakde, D., Chaudhuri, A., 2019. Automatic hyperparameter tuning method for local outlier factor, with applications to anomaly detection. ArXiv abs/1902.00567. [45] Yamanishi, K., Takeuchi, J. i., Williams, G., Milne, P., May 2004. Online unsupervised outlier detection using finite mixtures with discounting learning algorithms. Data Mining and Knowledge Discovery 8 (3), 275– 300. URL https://doi.org/10.1023/B:DAMI.0000023676.72185.7c
List of changes Added: Zongyu Zhang . Replaced: and in this paper w . . . Replaced: 5 Replaced: datasets Replaced: on accuracy, memo . . . Added: Zongyu Zhang . Replaced: unified Replaced: do not Replaced: datasets Replaced: [12, 24, 43] Replaced: [9, 25, 27, 32, 33] Replaced: [19, 30, 40] Replaced: [41, 27, 17, 29, 39] Replaced: [11, 23, 21, 36, 14, . . . Replaced: dataset Replaced: datasets Replaced: on accuracy, memo . . . Added: Data streams have . . . . Added: has . Replaced: datasets Replaced: is Replaced: methods [9, 25, 27, . . . Replaced: dataset Replaced: [9, 27] Replaced: datasets
. . . . . . . . . . . . . . . . . . . . . . . . .
34
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 4 4
Replaced: datasets Replaced: [19, 30, 40] Replaced: datasets Replaced: datasets Replaced: [41, 27] Replaced: datasets Replaced: dataset Replaced: methods Replaced: [17, 29, 39] Replaced: datasets Replaced: datasets Replaced: or Replaced: [11, 23, 21, 36, 14, . . . Replaced: popular Replaced: datasets Replaced: datasets Replaced: kind Replaced: [18, 6, 5] Replaced: [10, 4] Replaced: kind of methods have Replaced: are Replaced: datasets Replaced: with respect to Replaced: dataset Replaced: dataset Replaced: The second approa . . . Replaced: The third approach . . . Replaced: dataset Replaced: with respect to Replaced: includes Replaced: does not Replaced: are Replaced: which not only det . . . Replaced: according to the at . . . Replaced: Let lk denote the c . . . Deleted: where ik is the seg . . . . Replaced: with respect to Replaced: with respect to
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 4 . 4 . 4 . 4 . 4 . 4 . 4 . 4 . 4 . 4 . 4 . 4 . 4 . 5 . 5 . 5 . 5 . 5 . 5 . 5 . 5 . 6 . 6 . 7 . 7 . 7 . 8 . 8 . 8 . 8 . 8 . 9 . 9 . 9 . 9 . 9 . 11 . 12
Replaced: does not Added: (i = 1, 2, 3, 4, 5, 6) . Replaced: does not exist Replaced: W (c), kN N (c) and . . . Replaced: exists Replaced: Hence, we also nee . . . Replaced: For both cases, acc . . . Deleted: According to Equa . . . . Replaced: line 17-25, 27-30 in . . . Deleted: the weight of one of . . . . Deleted: the CF of one of k- . . . . Replaced: Therefore, we need . . . Replaced: management Replaced: datasets Replaced: 5 Replaced: Experiments Replaced: On the RandomSet . . . Added: Figure 6 provides a . . . . Replaced: Figure 7a, 7b, 7c s . . . Replaced: evaluate Deleted: the . Replaced: Experiments Deleted: Assuming that xi is . . . . Added: To evaluate the acc . . . . Added: Experiments on ot . . . . Replaced: 40660 Added: In our experiments, . . . . Replaced: In our experiments, . . . Replaced: Similarly, 25000 po . . . Added: Outlier detection a . . . . Deleted: In following experi . . . . Added: Figure 12 presents t . . . . Added: Average memory u . . . . Added: Average memory u . . . . Added: Accuracy, average r . . . . Added: Average execution . . . . Added: Figure 14 provides . . . . Added: Comparison of accu . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
36
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13 13 13 15 15 15 15 15 16 16 16 16 16 17 17 17 18 19 19 19 21 21 21 24 24 24 24 24 24 25 25 25 26 26 26 27 27 27
Added: Accuracy, average r . . . . Added: Accuracy, average r . . . . Added: Comparison of accu . . . . Added: Comparison of accu . . . . Added: In addition, similar . . . . Replaced: 5 datasets Added: with different cube . . . . Replaced: Experimental resul . . . Deleted: Author’s Contribution . Deleted: Jianhua Gao propo . . . . Added: References .
. . . . . . . . . . .
37
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
28 28 28 28 28 29 29 29 29 29 29