Outlier detection using neighborhood rank difference

Outlier detection using neighborhood rank difference

Accepted Manuscript Outlier Detection Using Neighborhood Rank Difference Gautam Bhattacharya, Koushik Ghosh, Ananda S. Chowdhury PII: DOI: Reference:...

636KB Sizes 0 Downloads 78 Views

Accepted Manuscript

Outlier Detection Using Neighborhood Rank Difference Gautam Bhattacharya, Koushik Ghosh, Ananda S. Chowdhury PII: DOI: Reference:

S0167-8655(15)00113-0 10.1016/j.patrec.2015.04.004 PATREC 6203

To appear in:

Pattern Recognition Letters

Received date: Accepted date:

28 August 2014 7 April 2015

Please cite this article as: Gautam Bhattacharya, Koushik Ghosh, Ananda S. Chowdhury, Outlier Detection Using Neighborhood Rank Difference, Pattern Recognition Letters (2015), doi: 10.1016/j.patrec.2015.04.004

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT 1

Pattern Recognition Letters journal homepage: www.elsevier.com

Outlier Detection Using Neighborhood Rank Difference Gautam Bhattacharyaa , Koushik Ghoshb , Ananda S. Chowdhuryc,∗∗ a Department

CR IP T

of Physics, University Institute of Technology, University of Burdwan, Golapbag(North), Burdwan 713104, India of Mathematics, University Institute of Technology, University of Burdwan, Golapbag(North), Burdwan 713104, India c Department of Electronics and Telecommunication Engineering, Jadavpur University, Kolkata 700032, India b Department

ABSTRACT

M

AN US

Presence of outliers critically affects many pattern classification tasks. In this paper, we propose a novel dynamic outlier detection method based on neighborhood rank difference. In particular, reverse and the forward nearest neighbor rank difference is employed to capture the variations in densities of a test point with respect to various training points. In the first step of our method, we determine the influence space for a given dataset. A score for outlierness is proposed in the second step using the rank difference as well as the absolute density within this influence space. Experiments on synthetic and some UCI machine learning repository datasets clearly indicate the supremacy of our method over some recently published approaches. c 2015 Elsevier Ltd. All rights reserved.

1. Introduction

AC

CE

PT

ED

The problem of outlier detection is of great interest to the pattern recognition community. The major objective of an outlier detection method is to find the rare or exceptional objects with respect to the remaining (large amount of) data [Chen et al., 2010]. Outlier detection has several practical applications in diverse fields, e.g., in fault detection of machines [Zhao et al., 2014, Purarjomandlangrudia et al., 2014], in anomaly detection in hyperspectral images[Matteoli et al., 2010], in novelty detection of image sequence analysis[Markou and Singh, 2006], in biomedical testing [Suzuki et al., 2003, Lin et al., 2005], in weather prediction [Zhao et al., 2003], in geoscience and remote Sensing [Du and Zhang, 2014], in medicine [Laurikkala et al., 2000], in financial fraud detection [Bolton and Hand, 2001][Singh and Upadhyaya, 2012], and in information security [Phoha, 2002, Denning, 1987]. Different outlier detection methods have been proposed over the years based on the nature of application. Outlier detection algorithms first create a normal pattern in the data, and then assign an outlier score to a given data point on the basis of its deviation with respect to the normal pattern [Aggarwal, 2013]. Extreme value analysis models, probabilistic models, linear models, proximity-based models[Orair et al., ∗∗ Corresponding

author: Tel.: +91-33-2457-2405; fax: +91-33-2414-6217; e-mail: [email protected] (Ananda S. Chowdhury )

2010], information-theoretic models and high dimensional outlier detection models represent some prominent categories of outlier detection techniques. Proximity-based methods treat outliers as points which are isolated from the remaining data and can be further classified into three different sub-categories, namely, cluster-based[Arribas-Gil and Romo, 2014], densitybased and nearest neighbor-based [Aggarwal, 2013]. The main difference between the clustering and the density-based methods is that the clustering methods segment the points, whereas the density-based methods segment the space[Aggrwal and Kaur, 2013]. Local Outlier Factor (LOF) [Breunig et al., 2000], Connectivity-based Outlier Factor (COF) [Tang et al., 2002] and INFLuenced Outlierness (INFLO) [Jin et al., 2006] are examples of some well-known density-based approaches for outlier detection. In contrast, Rank Based Detection Algorithm (RBDA) [Huang et al., 2013] and Outlier Detection using Modified-Ranks with Distance (ODMRD), [Huang et al., 2011] are two recently published approaches which use ranks of nearest-neighbors for the detection of the outliers. In most of the density-based approaches, it is assumed that the density around a normal data object is similar to the density around its neighbors, whereas in case of an outlier the density is considerably low than that of its neighbors. In LOF[Breunig et al., 2000], the densities of the points have been calculated within some local reachable distances and the degree of outlierness of a point has been assigned in terms of relative density of the test point with respect to its neighbors [Breunig et al.,

ACCEPTED MANUSCRIPT 2 2. Theoretical Foundations

CR IP T

Let D denotes the data set of all observations, k denotes the number of points in the set of k nearest neighbors Nk (p) around some point p ∈ D, and d(p, q) denotes the Euclidean distance between any two points p, q ∈ D. We consider Euclidean distance for its simplicity. Further, let R represents the reverse ranking of the point p with respect to the point q ∈ Nk (p). In the proposed method, we employ the difference of the Reverse Nearest Neighbor (RNN) and Forward Nearest Neighbor (kNN) ranks of a point. If q is the kth neighbor of the point p at distance dk (p, q), then the forward density up to kth neighbor is given by: Ωk (p) = k/dk (p, q)

(1)

Similarly, if p be the Rth neighbor of q for the same distance dk (p, q) then the reverse density around q at same distance dk (p, q) is given by: ΩR (q) = R/dk (p, q)

(2)

The positive value of the rank difference (R-k) indicates that q has a denser surrounding than that of p. By denser surrounding, we mean presence of more number of points within the hypersphere with radius dk (p, q) and center q. Similarly, a negative value of the rank difference indicates that p has a denser surrounding than that of q. For same values of R and k, p and q have equally dense surroundings. An illustration is shown in

AC

CE

PT

ED

M

AN US

2000]. Tang et al. (2002) argued that lower density is not a necessary condition to be an outlier. Accordingly, in COF[Tang et al., 2002], a set based nearest path was used to select a set of nearest neighbors [Tang et al., 2002]. This nearest path was further employed to find the relative density of a test point within the average chaining distance. COF[Tang et al., 2002] is shown to be more effective when a cluster and a neighboring outlier have similar neighborhood densities. Both LOF[Breunig et al., 2000] and COF[Tang et al., 2002], which use properties of kNN, are found to yield poor results when an outlier lies in between a sparse and a dense cluster. To handle such situations, Jin et al.(2006) proposed a new algorithm INFLO based on a symmetric neighborhood relationship. In this method both forward and reverse neighbors of a data point are considered while estimating its density distribution [Jin et al., 2006]. In case of density-based approaches all the neighbors of a test point are assumed to have a similar density. So, if a neighbor is chosen from different clusters with uneven densities the above assumption may introduce some errors in outlier detection. In addition, the notion of density may not work properly for some special types of distributions. For example, if all data points lie on a single straight line, the normal density-based algorithm [Huang et al., 2013] assumes equal density around the test point and its neighbors. This occurs due to the equal closest neighbor distance for both the test-point and its neighbor points. In such situations, rank-based outlier detection schemes like RBDA [Huang et al., 2013] and ODMRD [Huang et al., 2011] yield better results as compared to the density-based algorithms. RBDA uses mutual closeness between a test point and its k-neighbors for rank assignment.In ODMRD [Huang et al., 2011] the ranks were given some weights and the distances between the test point and its neighbors were incorporated. Still, both RBDA and ODMRD are found to be adversely affected by the local irregularities of a dataset like the cluster deficiency effect and the border effect. In order to address the shortcomings of density-based and rank-based methods, we propose a novel hybrid outlier detection approach using the concepts of density as well as neighborhood rank-difference. The first contribution of our method is: instead of local reachable distance [Breunig et al., 2000], we employ a dataset-specific global limit in terms of k (number of forward neighbors) to estimate the spatial density. The second contribution of our method is: we can better capture the variations in density by using reverse as well as forward rank difference over rank-based methods [Huang et al., 2013, 2011] by minimizing both the cluster deficiency effect and the border effect. Our third contribution is: we can minimize information loss due to averaging [Breunig et al., 2000] through an effective density sampling procedure. Experimental results clearly indicate that we can capture most of the m outliers within top-m instances. The rest of the paper is organized in the following manner: In Section 2, we provide the theoretical foundations. In Section 3, we describe the proposed method and also analyze its timecomplexity. In Section 4, we present the experimental results with detailed comparisons. Finally, the paper is concluded in Section 5 with an outline for directions of future research.

Fig. 1. Schematic diagram of the distribution of neighbors around a test point p and different regions of interest with respect to p and q

Fig. 1, where k=4 and R=6. So, their rank difference according to our definition is 2. In this case, as (R-k) is positive, the point q has denser surrounding than the point p. 3. Proposed Method Our proposed method consists of two steps. In the first step, we construct an influence space around a test point p. In the second step a rank difference based outlier score is assigned on the basis of this influence space. 3.1. Influence Space Construction Influence space depicts a region with significantly high reverse density in the locality of a point under consideration. If the localities of the neighbors within the influence space [Jin et al., 2006, Zhang et al., 2009] are more dense with respect to the locality of the concerned point, then a high value of outlierness score will be assigned to it. For an entire dataset, number of neighbors in the influence space is kept fixed.

ACCEPTED MANUSCRIPT 3

Ω smoothed =

1 nhoptimal

where, hoptimal =

(i 1 ΩR −ΩR hoptimal

− n X e 2 i=1



2

(3)



0.9σ n5

(4)

and

PT

ED

M

σ=

3.2. Outlier Score

AC

CE

In the second part of our proposed algorithm we have used a rank-difference based score for ranking of the outliers. The positive value of the rank difference (R-k) signifies the high concentration of the neighbors around the training point q than that of the test point p. The negative and zero value respectively signify a lower or same concentration of the training points around q than that of p. Thus the outlierness of the test point depends directly on the excess population of the neighborhood space of q with respect to the test point p, i.e., on the rank difference (R-k). Secondly, it also depends inversely on its own forward density Ωk (p). So, in summary, • •

(R − k) RDOS = median Ωk (p)

Outlierness ∝ (R − k) ∝ 1/Ωk (p)

for fixed distance dk (p, q) for fixed distance dk (p, q)

!

(6)

The median value has been used to have a more robust estimation of (RDOS). Algorithm : Rank-Difference based Outlier Detection Input : D Output : top-n Outliers according to RDOS Initialization : N(p), N k (p), d(p,q), dk (p, q), R, Ωk (p), ΩR (q), ΩR , hoptimal , Ωsmoothed , koptimal , S core, RDOS , RDOS list(p), top n list /*Finding sorted distances from p to all neighbors q ∈ D */ For all p ∈ D Do For all q ∈ D Do If q , p Then d(p,q) ← dist(p,q) N(p) ← q End End dk (p, q)← sort(d(p,q),’ascend’) Nk (p)← sort(N(p) by d(p,q)) End /*Calculation of Reverse Neighborhood Rank of p */ For all p ∈ D Do For all q ∈ Nk (p) Do R ← Rank p by q End End /*Average Reverse Density ΩR for varying depth k*/ For equal depth k Do For all p ∈ D Do ΩR (q) ← R/dk (p, q) End ΩR ← mean(ΩR (q)) End /*Kernel Smoothing of ΩR and Detection of First Most Significant Peak*/ hoptimal determination by (4) and (5) Ωsmoothed ← Gaussian Kernel Smoothing of ΩR using hoptimal in (3) koptimal ← First Most Significant Peak of Ωsmoothed using Daubechies wavelets /*RDOS score assignment to p*/ For all p ∈ D Do For equal depth k Do Ωk (p) ← k/dk (p, q) Score ← (R-k)/Ωk (p) End RDOS ← median(Score) End RDOS list(p) ← sort(RDOS,’descending’) top n list ← sort N k (p) by RDOS list(p)

AN US

median(|x − median(x)|) (5) 0.6745 where σ stands for an unbiased and consistent estimate of population standard deviation for large N[Bowman and Azzalini, 1997, Rouss and Croux, 1993]. In this smoothing process, an optimal width for the kernel hoptimal is determined using (4) [Silverman, 1986] and (5) for better estimation of the significant density fluctuation around the neighbor points. We deem the first most significant peak [Yasui et al., 2003, Morris et al., 2005] in this smoothed-kernel probability density function [Bowman and Azzalini, 1997] as the limit of the influence space. The peak has been determined using the undecimated wavelet transform with Daubechies coefficients [Coombes et al., 2005]. Such wavelet transforms can obtain peaks with maximum confidence by eliminating any surrounding noisy spurious peaks.

Hence, the Rank-difference and Density-based Outlier Score (RDOS) can be written as follows

CR IP T

In section 2, we have defined a reverse density ΩR which captures the density surrounding the locality of the neighboring points of a particular point. As the distance is increased from the target point, more number of neighbors get included in its surroundings resulting in different values of ΩR . With successive addition of neighboring points, a set of reverse densities is obtained for each point at varying depths (number of neighboring points).The average reverse density ΩR for each depth is determined next. Note that we have considered the depth and not the distance around the neighbors to handle situations where there is empty space (no neighboring point is present) surrounding a given point. To avoid random fluctuations, the variation in the average reverse density with respect to depth has been smoothed using a Gaussian kernel in the following manner:  

ACCEPTED MANUSCRIPT 4 4.2. Experimental Results

Table 1. Information regarding the datasets [Huang et al., 2013]

AN US

For the preprocessing of the data points the time complexity is O(N). Step 1: The time-complexity for finding the distance between point p to all of the points in D is O(N) and for sorting the distances the time requirement is N ln N. Hence to repeat the same for all p in D, the time-complexity is O(N(N + N ln N))=O(N 2 +N 2 ln N) Step 2: Now, for calculation of reverser rank the timecomplexity for all of the points in D is again O(N 2 ). Step 3: The time-complexity for finding the average reverse density ΩR for varying depth k (k varies from 1 to N) for all points in D is O(N 2 ). Step 4: The total time-complexity for kernel smoothing of ΩR and for finding the first most significant peak is O(N)+4*O(1) ≈ O(N). Step 5: The time-complexity for assigning RDOS score to all points in D is given as O(N(koptimal +1)) ≈ O(Nkoptimal ) ≈ O(N 2 ) So, the total time-complexity is 2*O(N)+O(N 2 +N 2 ln N) +3*O(N 2 ) ≈ O(N 2 ln N)

In synthetic datasets [Huang et al., 2013] we have six outliers. The real datasets considered are Iris, Ionosphere, Wisconsin Breast Cancer, Yeast and Cardiotocography all collected from the UCI archive. In the five real datasets, we have randomly selected few points of same class as our rare objects or as target outliers. Table 1 provides some basic information regarding the above datasets. The synthetic datasets [Huang et al., 2013] have been plotted in Fig. 2 along with the outliers class marked individually. The IRIS dataset has been shown in the same Fig. 2. We have shown the pairwise dimensional plot of these real datasets up to 4th dimension. Since our method falls at the juncture of both density and rank based methods, we have compared our work with some representative methods from each category. Within the density based method, we have compared our work with LOF[Breunig et al., 2000], COF[Tang et al., 2002] and INFLO[Jin et al., 2006] . Similarly, within the rank based group, we have compared our work with RBDA[Huang et al., 2013] and ODMRD [Huang et al., 2013, 2011].

CR IP T

3.3. Time-Complexity

4. Experiments

ED

M

We have used two synthetic and five real datasets of UCI machine learning data repository for performance evaluation of our method. The Fig. 2 shows the distribution of the different datasets along with the rare class as outliers. For real dataset with dimensions >3 pairwise plot of first four dimensions have been shown. The Fig. 3 shows the determination of koptimal from the neighborhood density plot. 4.1. Metrics for Measurement

CE

PT

For performance evaluation of the algorithms, we have used two metrics, namely recall and Rank-Power [Lijun, 2010]. As per [Huang et al., 2013] let m most suspicious instances in a dataset D contains dt true outliers and mt be the number of true outliers detected by an algorithm. Then, Recall (Re) measure which denotes the accuracy of an algorithm is given by:

AC

Recall =

|mt | |dt |

(7)

If using a given detection method, true outliers occupy top positions with respect to the non-outliers among m suspicious instances, then the RankPower(RP) of the proposed method is said to be high [Tang et al., 2002]. If n denotes the number of outliers found within top m instances and
2

n(n+1) Pn i=1
(8)

Rank-Power can have a maximum value of 1 when all n true outliers are in top n positions. For a fixed value of m, larger values of these metrics imply better performance.

Datasets

Synthetic 1 Synthetic 2 Iris Ionosphere Wisconsin Breast Cancer Yeast Cardiotocography

No. of Total No. of Attributes Instances including Outliers 2 74 2 134 4 105 33 235 9 223 8 36

1484 2069

No. of true Outliers 6 6 5 10 10 5 60

4.2.1. Synthetic Datasets We have applied our proposed method on Synthetic 1 and on Synthetic 2 datasets of [Huang et al., 2013]. In our method the range of influence space k is chosen dynamically which is quite advantageous with respect to the other methods in terms of its one-fold k finding strategy. In case of LOF[Breunig et al., 2000], COF[Tang et al., 2002], INFLO[Jin et al., 2006] and RBDA[Huang et al., 2013] the value of k has been varied, which requires more time to complete. In Table 2 the performance of Synthetic 1 and in Table 3 the performance of Synthetic 2 have been given. For Synthetic 1 dataset the value of k is 3 whereas for Synthetic 2 dataset the value of k is 7. The rank power of our method is always maximum in both the cases. Within top six values of RDOS score six outliers of the datasets have been identified. 4.2.2. Real Datasets In Tables 4-10 performance comparisons with Iris, Ionosphere, Wisconsin, Yeast and Cardiotocographic datasets are presented. For all the methods, we first compare Nrc number of outliers detected within m instances for a fixed value of k. We further compare number of outliers detected for fixed number of instances but with varying k. In Table 4 we have

ACCEPTED MANUSCRIPT 5 outliers have been detected within top 5 instances and in case of Cardiotocography dataset out of 2069 instances 60 outliers have been detected within top 70 instances. The rank-power RP of our method in case of Yeast dataset is 1 and in case of Cardiotocography dataset is 0.996. These are significantly high with respect to RP of LOF in case of high dimensional dataset also. In Fig. 4 Rank Power of the Yeast and Cardiotocography data has been plotted against top m instances for LOF and RDOS. The figure clearly shows the effectiveness of the proposed method in case of high dimensional dataset also. 1

E

CR IP T

0.9 0.8 0.7

Feature 2

0.5 0.4

B

0.3 0.2

A

0.2

0.4

0.6

0.8

1

Feature 1

B

0.9 0.8

E

0.7 0.6

Feature 2

M

ED

0

1

0.5 0.4 F 0.3

D

C

0.2 0.1

PT

0

A 0

0.2

0.4

0.6

0.8

1

Feature 3

Feature 2

Feature 1

Feature 1 1 0.5 0 1 0.5 0 1 0.5 0 1

Feature 4

CE

AC

F

0.1

0

D

C

0.6

AN US

compared our results of RDOS with the results obtained using LOF[Breunig et al., 2000], COF[Tang et al., 2002], INFLO[Jin et al., 2006], RBDA[Huang et al., 2013] and ODMRD [Huang et al., 2013, 2011] for Iris dataset. We have used an optimal k for our method, whereas for the other methods the value of k is 5. The value of our koptimal is 29. From Table 4 it is evident that the RBDA[Huang et al., 2013] and ODMRD[Huang et al., 2013, 2011] detects only 2 and 5 outliers respectively within first 10 instances. But they are not the top 5-instances of their score as evident from the rank-power of RBDA[Huang et al., 2013] and ODMRD[Huang et al., 2013, 2011] respectively. The rank-power of our method is 1 and it signifies that the 5-outliers of the Iris dataset have been detected within top 5 instances of the RDOS score. We have compared our results with the results of the other methods using k =5, 7 and 10 in Table 5. The value of our k is fixed at 29, whereas for the other methods k has been varied to get the optimal performance. All of the results have been compared within top-5 instances. Table 5 establishes the effectiveness of koptimal over varying k. The Ionosphere dataset contains 351 instances with 34 attributes; all attributes are normalized in the range of 0 and 1. There are two classes labeled as good and bad with 225 and 126 instances respectively. There are no duplicate instances in the dataset. To form the rare class, 116 instances from the bad class are randomly removed. The final dataset has only 235 instances with 225 good and 10 bad instances [Huang et al., 2013, 2011]. We have compared our results of RDOS with the results of LOF[Breunig et al., 2000], COF[Tang et al., 2002], INFLO[Jin et al., 2006], RBDA[Huang et al., 2013] and ODMRD[Huang et al., 2013, 2011] in Table 6. For the other methods k=15 whereas in our proposed method k has been chosen dynamically. With our koptimal =22, 10 outliers are detected within top 15 instances making the rank-power RP value (0.965) much higher than that of LOF[Breunig et al., 2000], COF[Tang et al., 2002], INFLO[Jin et al., 2006], RBDA and ODMRD. In Table 7 the optimal value of k is fixed at 22 for Ionosphere data. In case of Wisconsin data, within 449 instances 226 malignant instances have been removed randomly. In our experiments the final dataset consisted of 213 benign instances and 10malignant instances. With k=11 for the LOF[Breunig et al., 2000], COF[Tang et al., 2002], INFLO[Jin et al., 2006], RBDA[Huang et al., 2013] and ODMRD[Huang et al., 2013, 2011], our method is compared with dynamic k=58 for 40 instances. The comparison results are shown in Table 8. The rank-power RP of our method (0.982) is once again found to be very high. This signifies that all the 10 outliers have been detected within top-15 instances of the RDOS score. Table 9 shows the effectiveness of the optimal value of k for this dataset. In Fig. 3 Rank Power of the Iris, Ionosphere and Wisconsin data has been plotted against top m instances for different competing methods. The figure clearly portrays the superior performance of the proposed method. We have also shown the effectiveness of our proposed method in case of high dimensional dataset Yeast and Cardiotocography. In case of Yeast dataset out of 1484 instances 5

0.5 0

0

0.5

Feature 1

10

0.5

Feature 2

10

0.5

Feature 3

10

0.5

1

Feature 4

Fig. 2. 2-D plot of Synthetic 1, Synthetic 2 and IRIS(For dimension ≥ 4 pairwise-dimensional plot of first 4 dimensions have been shown) dataset alongwith their outliers(red points).

ACCEPTED MANUSCRIPT 6 Table 2. Performance of the Synthetic 1 dataset [Huang et al., 2013]

k

Nrc 5 6 5 6 5 6 5 6

LOF Re 0.83 1 0.83 1 0.83 1 0.83 1

Nrc 4 5 1 3 1 2

LOF Re RP 0.67 1 0.83 0.882 0.17 1 0.5 0.375 0.17 0.5 0.33 0.25

m

7

5 10 5 10 5 10 5 10

k

m

4 5 6

RP 1 0.955 1 0.955 1 0.913 1 0.913

Nrc 5 6 5 6 5 6 5 6

COF Re RP 0.83 1 1 1 0.83 1 1 0.955 0.83 1 1 0.955 0.83 1 1 0.913

INFLO Re RP 0.83 1 1 0.955 0.83 1 1 0.955 0.83 1 1 0.955 0.83 1 1 0.955

Nrc 5 6 5 6 5 6 5 6

Nrc 5 6 5 6 5 6 5 6

RBDA Re RP 0.83 1 1 1 0.83 1 1 1 0.83 1 1 1 0.83 1 1 1

RDOS koptimal Nrc Re RP 5 0.83 1 3 6 1 1 5 0.83 1 3 6 1 1 5 0.83 1 3 6 1 1 5 0.83 1 3 6 1 1

Table 3. Performance of the Synthetic 2 dataset [Huang et al., 2013]

50

INFLO Re RP 0.5 1 0.67 0.714 0.17 1 0.5 0.375 0.17 0.5 0.33 0.25

Nrc 3 4 1 3 1 2

RBDA Re RP 0.83 1 1 1 0.83 1 0.83 1 0.83 1 0.83 1

RDOS koptimal Nrc Re RP 5 0.83 1 7 6 1 1 5 0.83 1 7 6 1 1 5 0.83 1 7 6 1 1

CR IP T

35

COF Re RP 0.5 0.857 0.5 0.857 0 0 0.33 0.158 0 0 0.17 0.1

Nrc 5 6 5 5 5 5

AN US

5 10 5 10 5 10

25

Nrc 3 3 0 2 0 1

Table 4. Performance comparison on the Iris dataset with top 20 instances with fixed k

LOF(k=5) Nrc Re RP 1 0.2 0.2 4 0.8 0.313 5 1 0.341 5 1 0.341

m 5 10 15 20

COF(k=5) INFLO(k=5) Nrc Re RP Nrc Re RP 0 0 0 1 0.2 0.2 0 0 0 2 0.4 0.214 0 0 0 2 0.4 0.214 0 0 0 2 0.4 0.214

RBDA(k=5) Nrc Re RP 1 0.2 0.2 4 0.8 0.37 5 1 0.385 5 1 0.385

ODMRD(k=5) Nrc Re RP 3 0.6 0.75 5 1 0.714 5 1 0.714 5 1 0.714

RDOS(koptimal =29) Nrc Re RP 1 5 1 1 5 1 1 5 1 1 5 1

5 7 10

COF Nrc Re RP 0 0 0 0 0 0 0 0 0

INFLO Nrc Re RP 1 0.2 0.2 3 0.6 0.667 5 1 1

ED

LOF Nrc Re RP 1 0.2 0.2 5 1 1 5 1 1

M

Table 5. Performance comparison on the Iris dataset with fixed m=5

k

RBDA Nrc Re RP 1 0.2 0.2 3 0.6 0.75 5 1 1

ODMRD Nrc Re RP 3 0.6 0.75 4 1 0.833 4 0.8 1

RDOS koptimal Nrc Re RP 29

5

1

1

Table 6. Performance comparison on the Ionosphere dataset with top 85 instances with fixed k

k 11 15 20 23

PT

AC

5 15 30 60 85

LOF(k=15) COF(k=15) INFLO(k=15) Nrc Re RP Nrc Re RP Nrc Re RP 5 0.5 1 5 0.5 1 5 0.5 1 6 0.6 1 6 0.6 0.913 6 0.6 1 7 0.7 0.757 8 0.8 0.522 7 0.7 0.7 9 0.9 0.354 9 0.9 0.372 9 0.9 0.352 9 0.9 0.354 9 0.9 0.372 9 0.9 0.352

CE

m

LOF Nrc Re 9 0.9 9 0.9 9 0.9 9 0.9

RBDA(k=15) ODMRD(k=15) Nrc Re RP Nrc Re RP 5 0.5 1 5 0.5 1 8 0.8 0.818 8 0.8 0.837 9 0.9 0.714 9 0.9 0.738 9 0.9 0.714 9 0.9 0.738 10 1 0.377 10 1 0.407

RDOS(koptimal =22) Nrc Re RP 5 0.5 1 10 1 0.965 10 1 0.965 10 1 0.965 10 1 0.965

Table 7. Performance comparison on the Ionosphere dataset with fixed m=85

COF INFLO RP Nrc Re RP Nrc Re RP 0.294 9 0.9 0.29 9 0.9 0.3 0.354 9 0.9 0.372 9 0.9 0.352 0.417 9 0.9 0.413 9 0.9 0.405 0.441 9 0.9 0.484 9 0.9 0.417

RBDA Nrc Re RP 10 1 0.372 10 1 0.377 10 1 0.387 10 1 0.393

ODMRD RDOS Nrc Re RP koptimal Nrc Re RP 10 0.393 1 10 0.407 22 1 10 1 0.965 10 0.414 1 10 0.426 1

Table 8. Performance comparison on the Winsconsin dataset with top 40 instances with fixed k

m 15 40

LOF(k=11) COF(k=11) INFLO(k=11) RBDA(k=11) Nrc Re RP Nrc Re RP Nrc Re RP Nrc Re RP 5 0.5 0.469 6 0.6 0.656 5 0.5 0.417 7 0.7 0.519 9 0.9 0.372 10 1 0.444 9 0.9 0.349 9 0.9 0.517

ODMRD(k=11) Nrc Re RP 8 0.8 0.735 10 0.573 1

RDOS(koptimal =58) Nrc Re RP 10 1 0.982 10 1 0.982

ACCEPTED MANUSCRIPT 7 Table 9. Performance comparison on the Winsconsin dataset with fixed m=40

k 7 11 15 22

LOF Nrc Re RP 9 9 9

COF Nrc Re RP

0.9 0.372 10 0.9 0.395 10 0.9 0.446 10

1 1 1

INFLO Nrc Re RP

0.444 0.474 0.451

9 9 9

0.9 0.9 0.9

0.349 0.381 0.417

Nrc 10 9 9 9

RBDA Re RP 1 0.64 0.9 0.517 0.9 0.495 0.9 0.517

ODMRD RDOS Nrc Re RP koptimal Nrc Re RP 10 0.618 1 8 0.8 0.8 58 10 1 0.982 10

0.573

1

Table 10. Performance Comparison of the Proposed Method with LOF for High Dimensional Data Set with fixed k

0.6 0.6 0.8 1 0.617 0.8 0.883 0.933 0.967 1

0.111 0.111 0.125 0.13 0.695 0.577 0.507 0.460 0.427 0.391

7

0.01

6

Average Reverse Density

0.012

0.008

0.006

0.002

0

20

−3

Re

5 5 5 5 59 60 60 60 60 60

1 1 1 1 0.983 1 1 1 1 1

koptimal

1 1 1 1 1 0.996 0.996 0.996 0.996 0.996

10

20

30

40

50

60

80

5

4

3

2

0

483

18.486

17.2968

171

27.9755

39.856

0

20

40

0.01

0.008

0.006

0.004

0.002

0

60

80

100

120

140

Nearest Neighbor Number

0

20

40

60

80

100

120

Nearest Neighbor Number

PT

Nearest Neighbor Number

70

Average Total Processing Time of RDOS (Sec)

0.012

1

0

Average Total Processing Time of LOF (Sec)

0.014

ED

0.004

x 10

20

RDOS RP

Nrc

CR IP T

3 3 4 5 37 48 53 56 58 60

k

Average Reverse Density

20 25 30 35 60 120 180 240 300 360

LOF Re RP

CE

Fig. 3. The variation of average reverse density of p with respect to the number of nearest neighbors for the Synthetic 1, Synthetic 2 and IRIS dataset. The red lines indicate the first most significant peak of each Synthetic 1, Synthetic 2 and IRIS dataset respectively.

AC

Average Reverse Density

Cardiotocography

Nrc

AN US

Yeast

m

M

Dataset

(a) Iris data

(b) Ionosphere data Fig. 4. Rank-Power plot against no. of instances

(c) Wisconsin data

ACCEPTED MANUSCRIPT 8 5. Conclusion

AN US

CR IP T

We proposed a novel method for outlier detection using neighborhood rank difference. Experimental results clearly indicate the effectiveness of the proposed method for both synthetic as well as some real datasetswith varying dimensions and available instances. Unlike some of the existing approaches, we determine a fixed influence space for a given dataset and compute the outlier score in an unsupervised manner without varying the number of neighboring points. In future, we plan to make the determination of the influence space more accurate to further improve the performance of our outlier detection algorithm.

Denning, D.E., 1987. An intrusion detection model. IEEE Transactions of Software Engineering 13, 222–232. Du, B., Zhang, L., 2014. A discriminative metric learning based anomaly detection method. Transactions on Geoscience and Remote Sensing, IEEE 52, 6844–6857. Huang, H., Mehrotra, K., Mohan, C.K., 2011. Outlier Detection Using Modified-ranks and Other Variants. Technical Report. Electrical Engineering and Computer Science Technical Reports. Huang, H., Mehrotra, K., Mohan, C.K., 2013. Rank-based outlier detection. Journal of Statistical Computation and Simulation 83, 518–531. Jin, W., Tung, A.K.H., Han, J., Wang, W., 2006. Ranking outliers using symmetric neighborhood relationship, in: Proceedings of the 10th PacificAsia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD Conference. p. 3918. Laurikkala, J., Juhola1, M., Kentala., E., 2000. Informal identification of outliers in medical data, Fifth International Workshop on Intelligent Data Analysis in Medicine and Pharmacology. pp. 20–24. Lijun, C., 2010. A data stream outlier delection algorithm based on reverse k nearest neighbors. International Symposium on Computational Intelligence and Design (ISCID), IEEE 2. Lin, J., Keogh, E., Fu, A., Herle, H.V., 2005. Approximations to magic: Finding unusual medical time series, in: Proceedings of the 18th IEEE Symposium on Computer-Based Medical Systems, IEEE Computer Society. pp. 329– 334. Markou, M., Singh, S., 2006. A neural network-based novelty detector for image sequence analysis. IEEE Transaction of Pattern Analysis and Machine Intelligence 28, 1664–1677. Matteoli, S., Diani, M., Corsini, G., 2010. A tutorial overview of anomaly detection in hyperspectral images, in: Aerospace and Electronic Systems Magazine, IEEE. pp. 5–28. Morris, J.S., Coombes, K.R., Koomen, J., Baggerly, K.A., Kobayash, R., 2005. Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics 21, 153–168. Orair, G.H., Teixeira, C.H.C., Meira, Jr., W., Wang, Y., Parthasarathy, S., 2010. Distance-based outlier detection: Consolidation and renewed bearing. Proc. VLDB Endow. 3, 1469–1480. Phoha, V.V., 2002. The Springer Internet Security Dictionary. Springer-Verlag. Purarjomandlangrudia, A., Ghapanchib, H., Esmalifalakc, M., 2014. A data mining approach for fault diagnosis: An application of anomaly detection algorithm. Measurement, Elsevier 55, 343–352. Rouss, P.J., Croux, C., 1993. Alternatives to the median absolute deviation. Journal of the American Statistical Association 88, 1273–1283. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman and Hall, London. Singh, K., Upadhyaya, S., 2012. Outlier detection: Applications and techniques. IJCSI International Journal of Computer Science Issues 9, 307–323. Suzuki, E., Watanabe, T., Yokoi, H., Takabayashi, K., 2003. Detecting interesting exceptions from medical test data with visual summarization, in: Proceedings of the 3rd IEEE International Conference on Data Mining, pp. 315–322. Tang, J., Chen, Z., Fu, A.W., Cheung, D.W., 2002. Enhancing effectiveness of outlier detections for low density patterns, in: Proceedings of the 6th PacificAsia Conference on Advances in Knowledge Discovery and Data Mining, Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD), Taipei, Taiwan. pp. 45–84. Yasui, Y., Pepe, M., Thompson, M.L., Adam, B.L., Wright, G.L., Qu, Y., Potter, J.D., Winget, M., Thornquist, M., Feng, Z., 2003. A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics 4, 153–168. Zhang, K., Hutter, M., Jin, H., 2009. A new local distance-based outlier detection approach for scattered real-world data, in: Proceedings of the 13th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Springer-Verlag, Berlin, Heidelberg. pp. 813–822. Zhao, J., Lu, C.T., Kou, Y., 2003. Detecting region outliers in meteorological data, in: Proceedings of the 11th ACM International Symposium on Advances in Geographic Information Systems, ACM, New York, NY, USA. pp. 49–55. Zhao, R., Du, B., Zhang, L., 2014. A robust nonlinear hyperspectral anomaly detection approach. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 7, 1227–1234.

PT

ED

M

(a) Yeast data

(b) Cardiotocography data

CE

Fig. 5. Rank-Power plot against no. of instances

References

AC

Aggarwal, C.C., 2013. Outlier Analysis. Springer. Aggrwal, S., Kaur, P., 2013. Survey of partition based clustering algorithm used for outlier detection. International Journal for Advance Research in Engineering and Technology 1. Arribas-Gil, A., Romo, J., 2014. Shape outlier detection and visualization for functional data: the outliergram. Biostatistics 11, 1–17. Bolton, R.J., Hand, D.J., 2001. Unsupervised profiling methods for fraud detection, in: Proc. Credit Scoring and Credit Control VII, pp. 5–7. Bowman, A.W., Azzalini, A., 1997. Applied Smoothing Techniques for Data Analysis. Oxford University Press, New York. Breunig, M.M., Kriegel, H.P., Ng, R.T., Sander, J., 2000. Lof: identifying density-based local outliers, in: Proc. ACM SIGMOD, Int. Conf. on Management of Data (SIGMOD), Dallas, TX. pp. 93–104. Chen, Y., Miao, D., Zhang, H., 2010. Neighborhood outlier detection. Expert Systems with Applications 37, 8745–8749. Coombes, K.R., Morris, J.S., Tsavachidis, S., 2005. Improved peak detection and quantification of mass spectrometry data acquired from surfaceenhanced laser desorption and ionisation by denoising spectra with the undecimated discrete wavelet transform. Proteomics 5, 4107–4117.