Expert Systems with Applications 36 (2009) 4680–4687
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Some issues about outlier detection in rough set theory Feng Jiang a,*, Yuefei Sui b, Cungen Cao b a b
College of Information and Science Technology, Qingdao University of Science and Technology, Qingdao 266061, PR China Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, PR China
a r t i c l e Keywords: Outlier detection Rough sets Distance metric KDD
i n f o
a b s t r a c t ‘‘One person’s noise is another person’s signal” (Knorr, E., Ng, R. (1998). Algorithms for mining distancebased outliers in large datasets. In Proceedings of the 24th VLDB conference, New York (pp. 392–403)). In recent years, much attention has been given to the problem of outlier detection, whose aim is to detect outliers – objects which behave in an unexpected way or have abnormal properties. Detecting such outliers is important for many applications such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest infestations, etc. And outlier detection is critically important in the information-based society. In this paper, we discuss some issues about outlier detection in rough set theory which emerged about 20 years ago, and is nowadays a rapidly developing branch of artificial intelligence and soft computing. First, we propose a novel definition of outliers in information systems of rough set theory – sequence-based outliers. An algorithm to find such outliers in rough set theory is also given. The effectiveness of sequence-based method for outlier detection is demonstrated on two publicly available databases. Second, we introduce traditional distance-based outlier detection to rough set theory and discuss the definitions of distance metrics for distance-based outlier detection in rough set theory. Ó 2008 Elsevier Ltd. All rights reserved.
1. Introduction Knowledge discovery in databases (KDD), or data mining, is an important issue in the development of data- and knowledge-based systems. Usually, knowledge discovery tasks can be classified into four general categories: (a) dependency detection, (b) class identification, (c) class description, and (d) outlier/exception detection (Knorr & Ng, 1998). In contrast to most KDD tasks, such as clustering and classification, outlier detection aims to find small groups of data objects that are exceptional when compared with the remaining large amount of data, in terms of certain sets of properties. For many applications, such as fraud detection in E-commerce, it is more interesting to find the rare events than to find the common ones, from a knowledge discovery standpoint. Studying the extraordinary behaviors of outliers can help us uncover the valuable information hidden behind them. Recently, researchers have begun focusing on outlier detection, and attempted to apply algorithms for finding outliers to tasks such as fraud detection (Bolton & Hand, 2002), identification of computer network intrusions (Eskin, Arnold, Prerau, Portnoy, & Stolfo, 2002; Lane & Brodley, 1999), data cleaning (Rulequest Research), detection of employers with * Corresponding author. E-mail addresses:
[email protected] (F. Jiang),
[email protected] (Y. Sui),
[email protected] (C. Cao). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.06.019
poor injury histories (Knorr, Ng, & Tucakov, 2000), and peculiarity-oriented mining (Zhong, Yao, Ohshima, & Ohsuga, 2001). Outliers exist extensively in the real world, and are generated from different sources: a heavily tailed distribution or errors in inputting the data. While there is no single, generally accepted, formal definition of an outlier, Hawkins’ definition captures the spirit: ‘‘an outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980; Knorr & Ng, 1998). With increasing awareness on outlier detection in the literatures, more concrete meanings of outliers are defined for solving problems in specific domains. Nonetheless, most of these definitions follow the spirit of Hawkins’ definition (Chiu & Fu, 2003). Roughly speaking, the current approaches to outlier detection can be classified into the following five categories (Kovács, Vass, & Vidács, 2004): (1) Distribution-based approach is the classical method in statistics. It is based on some standard distribution models (Normal, Poisson, etc.), and those objects which deviate from the model are recognized as outliers (Rousseeuw & Leroy, 1987). Its greatest disadvantage is that the distribution of the measurement data is unknown in practice. Often a large number of tests are required in order to decide which distribution model the measurement data follow (if there is any).
F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687
(2) Depth-based approach is based on computational geometry and computes different layers of k–d convex hulls and flags objects in the outer layer as outliers (Johnson, Kwok, & Ng, 1998). However, it is a well-known fact that the algorithms employed suffer from the dimensionality curse and cannot cope with a large k. (3) Clustering approach classifies the input data. It detects outliers as by-products (Jain, Murty, & Flynn, 1999). However, since the main objective is clustering, it is not optimized for outlier detection. (4) Distance-based approach was originally proposed by Knorr and Ng (Knorr & Ng, 1998; Knorr et al., 2000). An object o in a data set T is a distance-based outlier if at least a fraction p of the objects in T are further than distance D from o.This outlier definition is based on a single, global criterion determined by the parameters p and D. Problems may occur if the parameters of the data are very different from each other in different regions of the data set. (5) Density-based approach was originally proposed by Breunig, Kriegel, Ng, and Sander (2000). A Local Outlier Factor (LOF) is assigned to each sample based on its local neighborhood density. Samples with high LOF value are identified as outliers. The disadvantage of this solution is that it is very sensitive to parameters defining the neighborhood. Rough set theory, introduced by Zdzislaw Pawlak in the early 1980s (Pawlak, 1982, 1991, Pawlak, Grzymala-Busse, Slowinski, & Ziarko, 1995), is for the study of intelligent systems characterized by insufficient and incomplete information. It is motivated by practical needs in classification and concept formation. The rough set philosophy is based on the assumption that with every object of the universe there is associated a certain amount of information (data, knowledge), expressed by means of some attributes. Objects having the same description are indiscernible. In recent years, there has been a fast growing interest in rough set theory. Successful applications of the rough set model in a variety of problems have demonstrated its importance and versatility. To the best of our knowledge, there is no existing work about outlier detection in rough set community. The aim of this work is to combine the rough set theory and outlier detection to show how outlier detection can be done in rough set theory. We suggest two different ways to achieve this aim. First, we propose sequencebased outlier detection in information systems of rough set theory. Second, we introduce traditional distance-based outlier detection to rough set theory. This paper is organized as follows. In the next section, we introduce some preliminaries in rough set theory and outlier detection. In Section 3, we give some definitions concerning sequence-based outliers in information systems of rough set theory. The basic idea is as follows: Given an information system IS ¼ ðU; A; V; f Þ, where U is a non-empty finite set of objects, A is a set of attributes, V is the union of attribute domains, and f is a function such that for any x 2 U and a 2 A, f ðx; aÞ 2 V a . Since each attribute subset B # A determines an indiscernibility (equivalence) relation INDðBÞ on U, we can obtain the corresponding equivalence class of relation INDðBÞ for every object x 2 U. If we decrease attribute subset B gradually, then the granularity of partition U=INDðBÞ will become coarser, and for every object x 2 U the corresponding equivalence class of x will become bigger. So when there is an object in U whose equivalence class always does not vary or only increases a little in comparison with those of other objects in U, then we may consider this object as a sequence-based outlier in U with respect to IS. An algorithm to find sequence-based outliers is also given. In Section 4, we apply traditional distance-based outlier detection to rough set theory. Since classical rough set theory is better suited to deal with nom-
4681
inal attributes, we propose the revised definitions of two traditional distance metrics for distance-based outlier detection in rough set theory – overlap metric and value difference metric in rough set theory, both of which are especially designed to deal with nominal attributes. Experimental results are given in Section 5, and Section 6 discusses the advantages of our sequence-based approach by comparing with other approaches to outlier detection. Section 7 concludes the paper.
2. Preliminaries In rough set data model, information is stored in a table, where each row (tuple) represents facts about an object. All we know about an object from the table is the corresponding tuple in the table. In rough set terminology, a data table is also called an information system. When the attributes are classified into decision attributes and condition attributes, a data table is also called a decision system. More formally, an information system is a quadruple IS ¼ ðU; A; V; f Þ, where 1. U is a non-empty finite set of objects. 2. A is a non-empty finite set of attributes. S 3. V is the union of attribute domains, i.e., V ¼ a2A V a , where V a denotes the domain of attribute a. 4. f : U A ! V is an information function such that for any a 2 A and x 2 U, f ðx; aÞ 2 V a . We can split set A of attributes into two subsets: C A and D ¼ A C, conditional set of attributes and decision (or class) attribute(s), respectively. The condition attributes represent measured features of the objects, while the decision attributes are a posteriori outcome of classification. Each subset B # A of attributes determines a binary relation INDðBÞ, called indiscernibility relation, which is defined as follows:
INDðBÞ ¼ fðx; yÞ 2 U U : 8a 2 Bðf ðx; aÞ ¼ f ðy; aÞÞg
ð1Þ
It is obvious that INDðBÞ is an equivalence relation on U and T INDðBÞ ¼ a2B INDðfagÞ. Given any B # A, relation INDðBÞ induces a partition of U, which is denoted by U=INDðBÞ, where an element from U=INDðBÞ is called an equivalence class. For every element x of U, let ½xB denote the equivalence class of relation INDðBÞ that contains element x, called the equivalence class of x under relation INDðBÞ. The distance-based approach is now widely used for outlier detection. An object o in a data set S is a distance-based ðDBÞ outlier with parameters p and d, denoted by DBðp; dÞ, if at least a fraction p of the objects in S lie at a distance greater than d from o. The advantage of the distance-based approach is that no explicit distribution is needed to determine unusualness, and it can be applied to any feature or attribute space (the vector space spanned by some features is called a feature space) for which we can define a distance metric. It should be noted that metrics are measures possessing metric properties which express the degree or strength of a quality factor. And many measures that we often use are in fact not metrics. A distance metric is a distance function on a set of points, mapping pairs of points into the nonnegative real numbers. In general, any distance metric which obeys the following conditions can be used in similarity measures (Li, Chen, Li, Ma, & Vitnyi, 2003): (1) (2) (3) (4)
Dðx; yÞ P 0: Distances cannot be negative. Dðx; yÞ ¼ 0 if and only if x ¼ y. Dðx; yÞ ¼ Dðy; xÞ: Distance is symmetric. Dðx; yÞ þ Dðy; zÞ P Dðx; zÞ: Triangular inequality.
4682
F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687
Usually, an attribute can be linear or nominal, and a linear attribute can be continuous or discrete. A continuous (or continuously valued) attribute uses real values, such as the mass of a planet or the velocity of an object. A linear discrete (or integer) attribute can have only a discrete set of linear values, such as number of children. A nominal (or symbolic) attribute is a discrete attribute whose values are not necessarily linearly ordered. For example, a variable representing color might have values such as red, green, blue, brown, black and white, which could be represented by 1–6, respectively (Wilson & Martinez, 1997). By now, there are many distance metrics that have been proposed, and most of these metrics are only defined for linear attributes, including the Euclidean, Minkowsky, Mahalanobis distance metrics, the context-similarity measure, hyperrectangle distance functions and others. When the attributes are nominal, these definitions of similarity measures become less trivial. For nominal attributes, a simple but commonly used measure is the overlap metric (Stanfill & Waltz, 1986). Under this metric, for two possible values vi and vj , the distance is defined as zero when vi and vj are identical, and one otherwise. For binary attributes, this overlap metric reduces to the so-called Hamming distance. Moreover, a very popular real-valued metric for nominal attributes is the value difference metric (Cheng, Li, Kwok, & Li, 2004; Stanfill & Waltz, 1986).
3. Sequence-based outlier detection in rough set theory By now, rough set theory has been found to have many interesting applications. The rough set approach seems to be of fundamental importance to AI and cognitive sciences, especially in the areas of machine learning and KDD. At present, in rough set community, the research concerning KDD mainly focuses on the first three categories of tasks in knowledge discovery. However, there is no concern on the fourth category of tasks – outlier/exception detection in rough set theory. Therefore in this section, we discuss issues about outlier definition and detection in information systems of rough set theory. In the following subsection, we first give some definitions concerning sequence-based outliers. Next an algorithm to find sequence-based outliers is presented. 3.1. Definitions Our definition for sequence-based outliers in an information system follows the spirit of Hawkins’ definition for outliers. That is, given an information system IS ¼ ðU; A; V; f Þ, for any x 2 U, if x has some characteristics that differ greatly from those of other objects in U, in terms of the attributes in A, we may call x an outlier in U with respect to IS. Our basic idea is as follows. Given an information system IS ¼ ðU; A; V; f Þ, since each attribute subset B # A determines an indiscernibility (equivalence) relation INDðBÞ on U, we can obtain the corresponding equivalence class for every object x 2 U under relation INDðBÞ, that is, the equivalence class that contains x. If we decrease B gradually, then the granularity of partition U=INDðBÞ will become coarser, and for every object x 2 U the corresponding equivalence class of x will become bigger. So when there is an object in U whose equivalence class always does not vary or only increases a little in comparison with those of other objects in U, then we consider this object as an outlier in U with respect to IS. In a word, an outlier in U is an element whose corresponding equivalence class (the set of all indiscernibility elements) always does not vary or varies a little when we regularly coarsen the granularity of partition U=INDðBÞ about U. Through reasoning about the changes of attribute subsets and the corresponding equivalence classes in an information system,
we can find all the outliers. In fact, most of our idea above is derived from the excellent work by A. Skowron et al., who discussed the approximate reasoning methods based on information changes in a series of papers (Bazan et al., 2002; Skowron & Synak, 2004). They introduced basic concepts for approximate reasoning about information changes across information maps, and measured degree of changes using information granules. Any rule for reasoning about information changes specifies how changes of information granules from the rule premise influence changes of information granules from the rule conclusion. Changes in information granules can be measured, e.g., using expressions analogous to derivatives. To implement the above idea, we construct three kinds of sequence (Arning, Agrawal, & Raghavan, 1996). The first is the sequence of attributes. Definition 3.1 (Sequence of attributes). Let IS ¼ ðU; A; V; f Þ be an information system, where A ¼ fa1 ; a2 ; . . . ; am g. For every attribute ai 2 A, let U=INDðfai gÞ ¼ fEi1 ; . . . ; Eik g denote the partition induced by equivalence relation INDðfai gÞ. And we use Varðai Þ to denote the variance of set fjEi1 j; . . . ; jEik jg, that is,
Varðai Þ ¼
ðjEi1 j AveÞ2 þ . . . þ ðjEik j AveÞ2 k
ð2Þ
where Ave ¼ ðjEi1 j þ . . . þ jEik jÞ=k, jXj denotes the cardinality of set X, and 1 6 i 6 m. We construct a sequence S ¼ ha01 ; a02 ; . . . ; a0m i of attributes in A (In this paper, a sequence is denoted by a tuple), where for every 1 6 i < m, Varða0i Þ P Varða0iþ1 Þ. Next, through decreasing the attribute set A gradually, we can determine a descending sequence of attribute subsets. Definition 3.2 (Descending sequence of attribute subsets). Let IS ¼ ðU; A; V; f Þ be an information system, where A ¼ fa1 ; a2 ; . . . ; am g. Let S ¼ ha01 ; a02 ; . . . ; a0m i be the sequence of attributes defined above. Given a sequence AS ¼ hA1 ; A2 ; . . . ; Am i of attribute subsets, where A1 ; A2 ; . . ., Am # A. If A1 ¼ A, Am ¼ fa0m g and Aiþ1 ¼ Ai fa0i g for every 1 6 i < m, then we call AS a descending sequence of attribute subsets in IS. By the above definition, in AS ¼ hA1 ; A2 ; . . . ; Am i, for every 1 6 i < m, Aiþ1 is the attribute subset transformed from Ai by removing the element a0i from Ai , where a0i is the ith element in sequence S. Given any object x 2 U, for the descending sequence AS of attribute subsets, we can construct a corresponding ascending sequence of equivalence classes of x. Definition 3.3 (Ascending sequence of equivalence classes). Let IS ¼ ðU; A; V; f Þ be an information system, where A ¼ fa1 ; a2 ; . . . ; am g. Let AS ¼ hA1 ; A2 ; . . . ; Am i be a descending sequence of attribute subsets in IS. For any object x 2 U, let ½xAi be the equivalence class of x under relation INDðAi Þ for every 1 6 i 6 m, then we have a sequence ESðxÞ ¼ h½xA1 ; ½xA2 ; . . . ; ½xAm i of equivalence classes, where ½xA1 ; ½xA2 ; . . . ; ½xAm # U. ESðxÞ is called an ascending sequence of equivalence classes of object x in IS. Most current methods for outlier detection give a binary classification of objects: is or is not an outlier, e.g. the distancebased outlier detection. However, for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. Given a degree of outlierness for every object, the objects can be ranked according to this degree, giving the data mining analyst a sequence in which to analyze the outliers. Therefore, Breunig et al. introduced a novel notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned a Local Outlier Factor (LOF), which represents the likelihood of that object being an outlier (Breunig et al., 2000).
F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687
Similar to Breunig’s method, we define a sequence outlier factor (SOF), which indicates the degree of outlierness for every object in an information system. Definition 3.4 (Sequence Outlier Factor). Let IS ¼ ðU; A; V; f Þ be an information system, where A ¼ fa1 ; a2 ; . . . ; am g. For any x 2 U, let ESðxÞ ¼ h½xA1 ; ½xA2 ; . . . ; ½xAm i be the ascending sequence of equivalence classes of x in IS. The sequence outlier factor of x in IS is defined as follows:
SOFðxÞ ¼
m X j¼2
1
j½xAj j j½xAj1 j jUj
! WðxÞ
ð3Þ
where W : U?[0, 1) is a weight function such that for any x 2 U, . P j½x j WðxÞ ¼ a2A 1 jUjfag jAj. ½xfag ¼ fu 2 U : f ðu; aÞ ¼ f ðx; aÞg denotes the indiscernibility class of relation INDðfagÞ that contains element x and jMj denotes the cardinality of set M. The weight function W in the above definition expresses such an idea that outlier detection always concerns the minority of objects in the data set and the minority of objects are more likely to be outliers than the majority of objects. Since from the above definition, we can see that the more the weight, the more the sequence outlier factor, the minority of objects should have more weight than the majority of objects. Therefore for every a 2 A, if the objects in U that are indiscernible with x under relation INDðfagÞ are always few, that is, the percentage of objects in U that are indiscernible with x is small, then we consider x belonging to the minority of objects in U, and assign a high weight to x. Furthermore, the above definition accords with our basic idea about sequence-based outliers. That is, when we regularly coarsen the granularity of the partitions on domain U of an information system IS, if the corresponding equivalence class of an object x 2 U always does not vary or varies a little, then the likelihood of x being an outlier, i.e. the sequence outlier factor of x, is high. Definition 3.5 (Sequence-based Outliers). Let IS ¼ ðU; A; V; f Þ be an information system, where A ¼ fa1 ; a2 ; . . . ; am g. Let l be a given threshold value, for any x 2 U, if SOFðxÞ > l then x is called a sequence-based outlier in U with respect to IS, where SOFðxÞ is the sequence outlier factor of x in IS. 3.2. Algorithm Algorithm 3.1. Input: information system IS ¼ ðU; A; V; f Þ, where jUj ¼ n and jAj ¼ m; threshold value l. Output: a set O of sequencebased outliers (1) For every a 2 A (2) { (3) Sort all objects from U according to a given order (e.g. the) (4) (lexicographical order) on domain V a of attribute a (Nguyen, 1996); (5) Determine the partition U=INDðfagÞ ¼ fEi1 ; . . . ; Eik g; (6) Calculate VarðaÞ, the variance of set fjEi1 j; . . . ; jEik jg (7) } (8) Determine the sequence S ¼ fa01 ; a02 ; . . . ; a0m g of attributes in A, where (9) for every 1 6 i < m, Varða0i Þ P Varða0iþ1 Þ; (10) Construct a descending sequence AS ¼ fA1 ; . . . ; Am g of attribute (11) subsets, in terms of sequence S; (12) For 1 6 i 6 m (13) { (14) Sort all objects from U according to a given order (e.g. the)
(15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27)
4683
(lexicographical order) on domain V Ai of attribute subset Ai ; Determine the partition U=INDðAi Þ } For every x 2 U { Construct an ascending sequence ESðxÞ of equivalence classes, where ESðxÞ ¼ h½xA1 ; ½xA2 ; . . . ; ½xAm i, and ½xAi 2 U=INDðAi Þ, 1 6 i 6 m; Assign a weight WðxÞ to x; Calculate SOFðxÞ, the sequence outlier factor of x in IS; If SOFðxÞ > l then O ¼ O [ fxg } Return O.
In Algorithm 3.1, we use a method proposed by Nguyen (1996) to calculate the partition induced by an indiscernibility relation in an information system. In the worst case, the time complexity of Algorithm 3.1 is Oðm2 nlognÞ, and its space complexity is Oðm nÞ, where m and n are the cardinalities of A and U, respectively. 4. Distance-based outlier detection in rough set theory Knorr and Ng (1998) and Knorr et al. (2000) proposed a nonparametric approach to outlier detection based on the distance of an object in a given data set to its nearest neighbors. In this approach, one looks at the local neighborhood of points for an object typically defined by the k-nearest objects (also known as neighbors). If the neighboring points are relatively close, then the object is considered to be normal; if the neighboring points are far away, then the object is considered to be unusual (Bay & Schwabacher, 2003). By virtue of the ideas of distance-based outlier detection, objects in an information system are considered to be outliers if they do not have enough neighbors. Although rough sets can also be used for non-nominal attributes, they are better suited to nominal data sets. In order to calculate the distance between any two objects in an information system, we should find a suitable distance metric for nominal attributes in rough set theory. Next, we shall give the revised definitions of the traditional overlap metric and value difference metric – overlap metric and value difference metric in rough set theory, respectively. 4.1. Overlap metric in rough set theory Definition 4.1. Given an information system IS ¼ ðU; A; V; f Þ, let x; y 2 U be any two objects between which we shall calculate the distance. The overlap metric in rough set theory is defined as
Dðx; yÞ ¼ jfa 2 A : aðxÞ–aðyÞgj
ð4Þ
where D : U U ! ½0; 1 is a function from U U to the non-negative real number (Stanfill & Waltz, 1986). It should be noticed that the above definition is closely related to the definition of discernibility matrices (Skowron & Rauszer, 1992). In fact, if we have calculated the discernibility matrix of a given information system, we can easily observe the distance between any two objects in the information system from the discernibility matrix. From the above definition, we can see that the overlap metric in rough set theory mainly counts the number of attributes in A on which objects x and y have different attribute values. This is a reasonable choice if no information about the weight of the different
4684
F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687
attributes is available. But if we do have information about attribute weight, then we can add linguistical bias to weight the different attributes. Therefore, a number of variants for overlap metric, each using a different weighting factor, have also been proposed (Gower & Legendre, 1986). However, this overlap metric and its variants assume that all attribute values are of equal distance from each other, and thus cannot represent value pairs with differing degrees of similarities. As an example, for attribute taste, it may be more desirable to have the value sour closer to sweet than to bitter. Hence, a real-valued distance metric is often preferred over a Boolean one (Cheng et al., 2004). 4.2. Value difference metric in rough set theory The value difference metric (VDM) was introduced by Stanfill and Waltz (1986) to provide an appropriate distance function for nominal attributes. A simplified version (without the weighting schemes) of the VDM is defined as follows:
VDMðx; yÞ ¼
X
df ðxf ; yf Þ
ð5Þ
f 2F
where F is the set of all features in the problem domain, x and y are any two objects between which we shall calculate the distance. And df ðxf ; yf Þ denotes the distance between two values xf and yf of feature f, where xf is the value of object x on feature f. For any feature f 2 F, df ðxf ; yf Þ is defined as follows:
df ðxf ; yf Þ ¼
X
2
ðPðcjxf Þ Pðcjyf ÞÞ
ð6Þ
c2OC
where OC is the set of all output classes in the problem domain, PðcjvÞ is the conditional probability that the output class is c given that feature f has the value v, and with value v and belonging to class c . PðcjvÞ ¼ number of objects number of objects with value v Next we give the revised definition of VDM in rough set theory. In rough set domain, given a decision table DT ¼ ðU; C [ D; V; f Þ, we use set C of condition attributes to replace the set F of features in the above definition. And the set U=INDðDÞ of decision categories is used to replace the set OC of output classes in the above definition. Definition 4.2. Given a decision table DT ¼ ðU; C [ D; V; f Þ, where C is the set of condition attributes and D is the set of decision attributes. Let x; y 2 U be any two objects between which we shall calculate the distance. The value difference metric in rough set theory VDMR : U U ! ½0; 1 is defined as
VDMR ðx; yÞ ¼
X
da ðxa ; ya Þ
Table 1 Decision table DT UnA
a
b
c
d
e
u1 u2 u3 u4 u5
1 0 2 0 1
0 0 0 0 1
2 1 2 2 2
1 2 2 1 1
0 1 0 2 0
Example 4.1. Given a decision table DT ¼ ðU; C [ D; V; f Þ, where U ¼ fu1 ; u2 ; u3 ; u4 ; u5 g, C ¼ fa; b; c; dg, D ¼ feg, as shown in Table 1. Using the distance metric defined by Definition 4.2, we can calculate the distance for every pair of objects in U. Because of the limitation of space, we just present the procedure for calculating the distance between u1 and u2 . Initialization: U=INDðDÞ ¼ U=INDðfegÞ ¼ ffu1 ; u3 ; u5 g; fu2 g; fu4 gg, Let E1 ¼ fu1 ; u3 ; u5 g, E2 ¼ fu2 g, E3 ¼ fu4 g. Then VDM R ðu1 ; u2 Þ ¼ da ð1; 0Þ þ db ð0; 0Þ þ dc ð2; 1Þ þ dd ð1; 2Þ. Step 1: Evaluate da ð1; 0Þ:
½u1 fag ¼ fu1 ; u5 g;
X
j½xfag \ Ej
E2U=INDðDÞ
j½xfag j
j½yfag \ Ej
!2
j½yfag j
j½u2 fag \ Ej
!2
Step 2: Evaluate db ð0; 0Þ:
½u1 fbg ¼ fu1 ; u2 ; u3 ; u4 g;
½u2 fbg ¼ fu1 ; u2 ; u3 ; u4 g:
Since ½u1 fbg ¼ ½u2 fbg , object u1 and u2 belong to the same equivalence class of equivalence relation INDðfbgÞ, and the distributions of decision categories with respect to the two objects are always the same. Therefore, db ð0; 0Þ certainly equals 0. Step 3: Evaluate dc ð2; 1Þ:
½u1 fcg ¼ fu1 ; u3 ; u4 ; u5 g; ½u2 fcg ¼ fu2 g; X
j½u1 fcg \ Ej
E2U=INDðDÞ
j½u1 fcg j
dc ð2; 1Þ ¼
j½u2 fcg \ Ej j½u2 fcg j
¼
!2
2
2 jfu1 ; u3 ; u5 gj j;j j;j jfu2 gj þ jfu1 ; u3 ; u4 ; u5 gj jfu2 gj jfu1 ; u3 ; u4 ; u5 gj jfu2 gj 2 jfu4 gj j;j þ jfu1 ; u3 ; u4 ; u5 gj jfu2 gj
¼ 9=16 þ 1 þ 1=16 ¼ 13=8: ð8Þ Step 4: Evaluate dd ð1; 2Þ:
where ½xfag is an equivalence class of INDðfagÞ that contains object x. For any E 2 U=INDðDÞ, all elements in E have the same decision attribute values, that is, E denotes a decision category. j½xfag j is the number of all objects in U that have the value xa on attribute a. And ½xfag \ E is the set of all objects in U that have the value xa on attribute a and belong to decision category E. Therefore
¼ 1 þ 1=4 þ 1=4 ¼ 3=2:
a2C
da ðxa ; ya Þ ¼
j½u1 fag \ Ej
j½u1 fag j j½u2 fag j E2U=INDðDÞ 2 2 jfu1 ; u5 gj j;j j;j jfu2 gj ¼ þ jfu1 ; u5 gj jfu2 ; u4 gj jfu1 ; u5 gj jfu2 ; u4 gj 2 j;j jfu4 gj þ jfu1 ; u5 gj jfu2 ; u4 gj
ð7Þ
where da ðxa ; ya Þ denotes the distance between two values xa and ya of condition attribute a, and xa is the value of object x on attribute a. For any a 2 C, define
X
da ð1; 0Þ ¼
½u2 fag ¼ fu2 ; u4 g;
j½xfag \Ej j½xfag j
is
the conditional probability that the decision category is E given that attribute a has the value xa . This corresponds to PðcjvÞ in the above definition.
½u1 fdg ¼ fu1 ; u4 ; u5 g; ½u2 fdg ¼ fu2 ; u3 g; X
dd ð1; 2Þ ¼
j½u1 fdg \ Ej
!2
j½u2 fdg j 2 2 jfu1 ; u5 gj jfu3 gj j;j jfu2 gj ¼ þ jfu1 ; u4 ; u5 gj jfu2 ; u3 gj jfu1 ; u4 ; u5 gj jfu2 ; u3 gj 2 jfu4 gj j;j þ jfu1 ; u4 ; u5 gj jfu2 ; u3 gj E2U=INDðDÞ
j½u1 fdg j
j½u2 fdg \ Ej
¼ 1=36 þ 1=4 þ 1=9 ¼ 7=18:
F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687
Step 5: VDMR ðu1 ; u2 Þ ¼ 3=2 þ 0 þ 13=8 þ 7=18 3:51. Repeating the above calculation, we can finally obtain distances for all the other pairs of objects in U. By then, we can find out all outliers in U by using the nested loops algorithm or other algorithms for distance-based outlier detection (Knorr & Ng, 1998; Knorr et al., 2000; Ramaswamy, Rastogi, & Shim, 2000). Since many fast algorithms for finding distance-based outliers have been proposed, such as Bay and Schwabacher (2003) proposed a simple algorithm based on nested loops, which can give near linear time performance when the data is in random order and a simple pruning rule is used. And they demonstrated that their algorithm scales to real data sets with millions of examples and many features. Therefore combining the distance metrics we defined above with these fast algorithms, we can efficiently find distance-based outliers in rough set theory.
5. Experimental results In Section 3, we proposed a sequence-based outlier detection algorithm. And in Section 4, we introduced traditional distancebased outlier detection to rough sets. In this section, following the experimental setup in He, Deng, and Xu (2005), we shall use two real life data sets (lymphography and cancer) to demonstrate the performance of sequence-based outlier detection algorithm against traditional distance-based method and KNN algorithm (Ramaswamy et al., 2000). In addition, on the cancer data set, we add the results of RNN-based outlier detection method for comparison, these results can be found in the work of Harkins et al. Harkins, He, Willams, and Baxter (2002) Willams, Baxter, He, Harkins, and Gu (2002). In our experiment, for the KNN algorithm, the results were obtained by using the 5th nearest neighbor. Since in traditional distance-based outlier detection, being an outlier is regarded as a binary property, we revise the definition of distance-based outlier detection by introducing a distance outlier factor (DOF) to indicate the degree of outlierness for every object in an information system. Definition 4.1 (Distance Outlier Factor). Given an information system IS ¼ ðU; A; V; f Þ. For any object x 2 U, the percentage of the objects in U which lies greater than d from x is called the distance outlier factor of x in IS, denoted by
DOFðxÞ ¼
jfy 2 U : distðx; yÞ > dgj jUj
ð9Þ
where distðx; yÞ denotes the distance between object x and y under a given distance metric in rough set theory (In our experiment, the overlap metric in rough set theory is adopted), d is a given parameter (In our experiment, we set d ¼ jAj=2), and jUj denotes the cardinality of set U. 5.1. Lymphography data The first is the lymphography data set, which can be found in the UCI machine learning repository (Bay, 1999). It contains 148 instances (or objects) with 19 attributes (including the class attribute). The 148 instances are partitioned into 4 classes: ‘‘normal find” (1.35%), ‘‘metastases” (54.73%), ‘‘malign lymph” (41.22%) and ‘‘fibrosis” (2.7%). Classes 1 and 4 (‘‘normal find” and ‘‘fibrosis”) are regarded as rare classes. Aggarwal et al., proposed a practicable way to test the effectiveness of an outlier detection method (Aggarwal & Yu, 2001; He et al., 2005). That is, we can run the outlier detection method on a given data set and test the percentage of points which belonging to one of the rare classes Aggarwal considered those kinds of class labels
4685
which occurred in less than 5% of the data set as rare labels (Aggarwal & Yu, 2001). Points belonged to the rare class are considered as outliers. If the method works well, we expect that such abnormal classes would be over-represented in the set of points found. In our experiment, data in the lymphography data set is input into an information system ISL ¼ ðU; A; V; f Þ, where U contains all the 148 instances of lymphography data set and A contains 18 attributes of lymphography data set (not including the class attribute). We consider detecting outliers (rare classes) in ISL . The experimental results are summarized in Table 2. In Table 2, ‘‘SEQ”, ‘‘DIS”, ‘‘KNN” denote sequence-based, traditional distance-based and KNN outlier detection methods, respectively. For every object in U, the degree of outlierness is calculated by using the three outlier detection methods, respectively. For each outlier detection method, the ‘‘Top Ratio (Number of Objects)” denotes the percentage (number) of the objects selected from U whose degrees of outlierness calculated by the method are higher than those of other objects in U. And if we use X # U to contain all those objects selected from U, then the ‘‘Number of Rare Classes Included” is the number of objects in X that belong to one of the rare classes. The ‘‘Coverage” is the ratio of the ‘‘Number of Rare Classes Included” to the number of objects in U that belong to one of the rare classes (He et al., 2005). From Table 2, we can see that for the lymphography data set, sequence-based and distance-based methods perform markedly better than KNN method. And the performances of sequence-based and distance-based methods are very close, though the former performs a little worse than the latter. 5.2. Wisconsin breast cancer data The Wisconsin breast cancer data set is found in the UCI machine learning repository (Bay, 1999). The data set contains 699 instances with 9 continuous attributes. Here, we follow the experimental technique of Harkins et al. by removing some of the malignant instances to form a very unbalanced distribution (Harkins et al., 2002; He et al., 2005; Willams et al., 2002). The resultant data set had 39 (8%) malignant instances and 444 (92%) benign instances. Moreover, the 9 continuous attributes in the data set are transformed into categorical attributes, respectively 1 (He et al., 2005). Data in the Wisconsin breast cancer data set is also input into an information system ISW ¼ ðU 0 ; A0 ; V 0 ; f 0 Þ, where U 0 contains all the 483 instances of the data set and A0 contains 9 categorical attributes of the data set (not including the class attribute). We consider detecting outliers (malignant instances) in ISW . The experimental results are summarized in Table 3. Table 3 is similar to Table 2. From Table 3, we can see that for the Wisconsin breast cancer data set, their performances of sequence-based, traditional distance-based and RNN-based methods are very close, the performances of are all markedly better than KNN algorithm. And among the former three methods, RNN-based method performs best, the next one is the sequence-based method, and the worst is the distance-based method. 6. Discussion There are two broad approaches to outlier detection: parametric (e.g., distribution-based) and non-parametric (e.g., distancebased). The parametric or statistical approach to outlier detection assumes a distribution or probability model for data set and then identifies outliers with respect to the model using a discordancy
1 The resultant data set to public available at: http://research.cmis.csiro.au/rohanb/ outliers/breast-cancer/.
4686
F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687
Table 2 Experimental results in ISL Top ratio (number of objects)
5%(7) 6%(9) 8%(12) 10%(15)
Number of rare classes included (coverage) SEQ
DIS
KNN
5(83%) 5(83%) 6(100%) 6(100%)
5(83%) 6(100%) 6(100%) 6(100%)
4(67%) 4(67%) 5(83%) 6(100%)
Table 3 Experimental results in ISW Top ratio (number of objects)
1%(4) 2%(8) 4%(16) 6%(24) 8%(32) 10%(40) 12%(48) 14%(56) 16%(64) 18%(72) 20%(80) 28%(112)
Number of rare classes included (coverage) SEQ
DIS
KNN
RNN
3(8%) 7(18%) 14(36%) 21(54%) 28(72%) 32(82%) 35(90%) 39(100%) 39(100%) 39(100%) 39(100%) 39(100%)
4(10%) 5(13%) 11(28%) 18(46%) 24(62%) 29(74%) 36(92%) 39(100%) 39(100%) 39(100%) 39(100%) 39(100%)
3(8%) 6(15%) 11(28%) 18(46%) 25(64%) 30(77%) 35(90%) 36(92%) 36(92%) 38(97%) 38(97%) 39(100%)
4(10%) 8(21%) 16(41%) 20(51%) 27(69%) 32(82%) 37(95%) 39(100%) 39(100%) 39(100%) 39(100%) 39(100%)
rough set theory, which has become a popular method for KDD, much due to its ability to handle uncertain and/or incomplete data. First, we proposed a new sequence-based outlier detection method in information systems of rough set theory. Experimental results on real data sets demonstrate the effectiveness of our method for outlier detection. Next, we introduced Knorr and Ng’s distancebased outlier detection to rough set theory. In order to calculate the distance between any two objects in an information system, we gave two revised definitions of the traditional distance metrics for nominal attributes in rough set theory. In the future work, for the sequence-based outlier detection algorithm, we shall consider using reducts to have smaller number of attributes while preserving the performance of it. For the distance-based outlier detection, since the proposed measures are all global ones, in the future work, we shall consider using more local versions to get better performance for distance-based method. Acknowledgements This work is supported by the Natural Science Foundation (Grant Nos. 60641010, 60496326, 60573063 and 60573064), the National 863 Programme (Grant No. 2007AA01Z325), and the National 973 Programme (Grant Nos. 2003CB317008 and G1999032701). References
test. A major drawback of the parametric approach is that most tests are for single attributes, yet many problems on data mining require finding outliers in multidimensional spaces. Moreover, the parametric approach requires knowledge about parameters of the data set, such as the data distribution. However, in many cases, the data distribution may not be known. The representative nonparametric approach is distance-based outlier detection, which was introduced to counter the main limitations imposed by the parametric approach. In comparison with statistical-based approach, distance-based outlier detection avoids the excessive computation associated with fitting an observed distribution into some standard distribution and in selecting discordancy tests (Han & Kamber, 2000). Distance-based outlier detection relies on a distance metric (function) to measure the similarity between any two objects. As we have noted before, most of the distance metrics that have been proposed are only defined for linear attributes. However, almost all real-world data sets contain a mixture of nominal and linear attributes. The nominal attributes are typically ignored or incorrectly modeled by existing approaches. In addition, it is well-known that, in practice, there are no clear advantages of one particular metric over another. Therefore, it is often very difficult to determine an appropriate distance metric for many practical distance-based outlier detection tasks. Moreover, it is possible in some situations that there is no appropriate distance metric to be defined for a particular data set. Then distance-based outlier detection obviously cannot be applied to those situations. Unlike distance-based outlier detection, our sequence-based outlier detection does not require a metric distance function. On the other hand, similar to distance-based outlier detection, sequence-based outlier detection is also a non-parametric approach to outlier detection, which can cope with the main limitations imposed by the parametric approach. So our approach can efficiently avoid the disadvantages of most current approaches to outlier detection. 7. Conclusion and future work Outlier detection is becoming critically important in many areas. In this paper, we extended outlier detection to Pawlak’s
Aggarwal, C. C., Yu, P. S. (2001). Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD international conference on managment of data, California (pp. 37–46). Arning, A., Agrawal, R., Raghavan, P. (1996). A linear method for deviation detection in large databases. In Proceedings of the 2nd international conference on KDD, Oregon (pp. 164–169). Bay, S. D. (1999). The UCI KDD repository.
. Bay, S. D., Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD international conference on KDD, Washington (pp. 29–38). Bazan, J., Osmolski, A., Skowron, A., Slezak, D., Szczuka, M., Wroblewski, J. (2002). Rough set approach to survival analysis. In Proceedings of the 3rd international conference on rough sets and current trends in computing (RSCTC2002), Malvern, PA, USA (pp. 522–529). Bolton, R. J., & Hand, D. J. (2002). Statistical fraud detection: A review (with discussion). Statistical Science, 17(3), 235–255. Breunig, M. M., Kriegel, H. -P., Ng, R. T., Sander, J. (2000). LOF: Identifying densitybased local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas (pp. 93–104). Cheng, V., Li, C. H., Kwok, J., & Li, C. K. (2004). Dissimilarity learning for nominal data. Pattern Recognition, 37(7), 1471–1477. Chiu, A. L., Fu, A. W. (2003). Enhancements on local outlier detection. In Proceedings of the 7th international database engineering and applications symposium (IDEAS’03), Hong Kong (pp. 298–307). Eskin, E., Arnold, A., Prerau, M., Portnoy, L., & Stolfo, S. (2002). A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In D. Barbar et al. (Eds.), Data mining for security applications. Boston: Kluwer Academic Publishers. Gower, J. C., & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefficients. Journal of Classification, 3, 5–48. Han, J. W., & Kamber, M. (2000). Data mining: Concepts and techniques. California: Morgan Kaufmann Publishers. Harkins, S., He, H. X., Willams, G. J., Baxter, R. A. (2002). Outlier detection using replicator neural networks. In Proceedings of the 4th international conference on data warehousing and knowledge discovery, France (pp. 170–180). Hawkins, D. (1980). Identifications of outliers. London: Chapman and Hall. He, Z. Y., Deng, S. C., Xu, X. F. (2005). An optimization model for outlier detection in categorical data. In Advances in intelligent computing, international conference on intelligent computing, ICIC(1) 2005, Hefei, China (pp. 400–409). Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323. Johnson, T., Kwok, I., Ng, R. T. (1998). Fast computation of 2-dimensional depth contours. In Proceedings of the 4th international conference on knowledge discovery and data mining, New York (pp. 224–228). Knorr, E., Ng, R. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th VLDB conference, New York, pp. 392–403. Knorr, E., Ng, R., & Tucakov, V. (2000). Distance-based outliers: Algorithms and applications. VLDB Journal: Very Large Databases, 8(3–4), 237–253. Kovács, L., Vass, D., Vidács, A. (2004). Improving quality of service parameter prediction with preliminary outlier detection and elimination. In Proceedings of
F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687 the 2nd international workshop on inter-domain performance and simulation (IPS 2004), Budapest (pp. 194–199). Lane, T., & Brodley, C. E. (1999). Temporal sequence learning and data reduction for anomaly detection. ACM Transactions on Information and System Security, 2(3), 295–331. Li, M., Chen, X., Li, X., Ma, B., Vitnyi, M. B. (2003). The similarity metric. In Proceedings of the 14th annual ACM-SIAM symposium on discrete algorithms, Maryland (pp. 863–872). Nguyen, S. H., Nguyen, H. S. (1996). Some efficient algorithms for rough set methods. In Proceedings of the 6th international conference on information processing and management of uncertainty (IPMU’96), Granada, Spain (pp. 1451– 1456). Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information Sciences, 11, 341–356. Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data. Dordrecht: Kluwer Academic Publishers. Pawlak, Z., Grzymala-Busse, J. W., Slowinski, R., & Ziarko, W. (1995). Rough sets. Communications of the ACM, 38(11), 89–95. Ramaswamy, S., Rastogi, R., Shim, K. (2000). Efficient algorithms for mining outliers from large datasets. In Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas (pp. 427–438).
4687
Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: John Wiley and Sons. Rulequest Research. GritBot. see . Skowron, A., & Rauszer, C. (1992). The discernibility matrices and functions in information systems. In R. Slowinski (Ed.), Intelligent decision support – Handbook of applications and advances of the rough sets theory (pp. 331–362). Dordrecht: Kluwer. Skowron, A., & Synak, P. (2004). Reasoning in information maps. Fundamenta Informaticae, 59, 241–259. Stanfill, C., & Waltz, D. (1986). Towards memory-based reasoning. Communications of the ACM, 29(12), 1213–1228. Willams, G. J., Baxter, R. A., He, H. X., Harkins, S., Gu, L. F. (2002). A comparative study of RNN for outlier detection in data mining. In ICDM, Japan (pp. 709–712). Wilson, D. R., & Martinez, T. R. (1997). Improved heterogeneous distance functions. Journal of Artificial Intelligence Research, 6, 1–34. Zhong, N., Yao, Y. Y., Ohshima, M., & Ohsuga, S. (2001). Interestingness, peculiarity, and multi-database mining. In Proceedings of the 2001 IEEE international conference on data mining (IEEE ICDM’01) (pp. 566–573). IEEE Computer Society Press.