Some issues about outlier detection in rough set theory

Expert Systems with Applications 36 (2009) 4680–4687 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

Download PDF

226KB Sizes 0 Downloads 97 Views

Report

PDF Reader
Full Text

Expert Systems with Applications 36 (2009) 4680–4687

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Some issues about outlier detection in rough set theory Feng Jiang a,*, Yuefei Sui b, Cungen Cao b a b

College of Information and Science Technology, Qingdao University of Science and Technology, Qingdao 266061, PR China Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, PR China

a r t i c l e Keywords: Outlier detection Rough sets Distance metric KDD

i n f o

a b s t r a c t ‘‘One person’s noise is another person’s signal” (Knorr, E., Ng, R. (1998). Algorithms for mining distancebased outliers in large datasets. In Proceedings of the 24th VLDB conference, New York (pp. 392–403)). In recent years, much attention has been given to the problem of outlier detection, whose aim is to detect outliers – objects which behave in an unexpected way or have abnormal properties. Detecting such outliers is important for many applications such as criminal activities in electronic commerce, computer intrusion attacks, terrorist threats, agricultural pest infestations, etc. And outlier detection is critically important in the information-based society. In this paper, we discuss some issues about outlier detection in rough set theory which emerged about 20 years ago, and is nowadays a rapidly developing branch of artiﬁcial intelligence and soft computing. First, we propose a novel deﬁnition of outliers in information systems of rough set theory – sequence-based outliers. An algorithm to ﬁnd such outliers in rough set theory is also given. The effectiveness of sequence-based method for outlier detection is demonstrated on two publicly available databases. Second, we introduce traditional distance-based outlier detection to rough set theory and discuss the deﬁnitions of distance metrics for distance-based outlier detection in rough set theory. Ó 2008 Elsevier Ltd. All rights reserved.

1. Introduction Knowledge discovery in databases (KDD), or data mining, is an important issue in the development of data- and knowledge-based systems. Usually, knowledge discovery tasks can be classiﬁed into four general categories: (a) dependency detection, (b) class identiﬁcation, (c) class description, and (d) outlier/exception detection (Knorr & Ng, 1998). In contrast to most KDD tasks, such as clustering and classiﬁcation, outlier detection aims to ﬁnd small groups of data objects that are exceptional when compared with the remaining large amount of data, in terms of certain sets of properties. For many applications, such as fraud detection in E-commerce, it is more interesting to ﬁnd the rare events than to ﬁnd the common ones, from a knowledge discovery standpoint. Studying the extraordinary behaviors of outliers can help us uncover the valuable information hidden behind them. Recently, researchers have begun focusing on outlier detection, and attempted to apply algorithms for ﬁnding outliers to tasks such as fraud detection (Bolton & Hand, 2002), identiﬁcation of computer network intrusions (Eskin, Arnold, Prerau, Portnoy, & Stolfo, 2002; Lane & Brodley, 1999), data cleaning (Rulequest Research), detection of employers with * Corresponding author. E-mail addresses: [email protected] (F. Jiang), [email protected] (Y. Sui), [email protected] (C. Cao). 0957-4174/$ - see front matter Ó 2008 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2008.06.019

poor injury histories (Knorr, Ng, & Tucakov, 2000), and peculiarity-oriented mining (Zhong, Yao, Ohshima, & Ohsuga, 2001). Outliers exist extensively in the real world, and are generated from different sources: a heavily tailed distribution or errors in inputting the data. While there is no single, generally accepted, formal deﬁnition of an outlier, Hawkins’ deﬁnition captures the spirit: ‘‘an outlier is an observation that deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism” (Hawkins, 1980; Knorr & Ng, 1998). With increasing awareness on outlier detection in the literatures, more concrete meanings of outliers are deﬁned for solving problems in speciﬁc domains. Nonetheless, most of these deﬁnitions follow the spirit of Hawkins’ deﬁnition (Chiu & Fu, 2003). Roughly speaking, the current approaches to outlier detection can be classiﬁed into the following ﬁve categories (Kovács, Vass, & Vidács, 2004): (1) Distribution-based approach is the classical method in statistics. It is based on some standard distribution models (Normal, Poisson, etc.), and those objects which deviate from the model are recognized as outliers (Rousseeuw & Leroy, 1987). Its greatest disadvantage is that the distribution of the measurement data is unknown in practice. Often a large number of tests are required in order to decide which distribution model the measurement data follow (if there is any).

F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687

(2) Depth-based approach is based on computational geometry and computes different layers of k–d convex hulls and ﬂags objects in the outer layer as outliers (Johnson, Kwok, & Ng, 1998). However, it is a well-known fact that the algorithms employed suffer from the dimensionality curse and cannot cope with a large k. (3) Clustering approach classiﬁes the input data. It detects outliers as by-products (Jain, Murty, & Flynn, 1999). However, since the main objective is clustering, it is not optimized for outlier detection. (4) Distance-based approach was originally proposed by Knorr and Ng (Knorr & Ng, 1998; Knorr et al., 2000). An object o in a data set T is a distance-based outlier if at least a fraction p of the objects in T are further than distance D from o.This outlier deﬁnition is based on a single, global criterion determined by the parameters p and D. Problems may occur if the parameters of the data are very different from each other in different regions of the data set. (5) Density-based approach was originally proposed by Breunig, Kriegel, Ng, and Sander (2000). A Local Outlier Factor (LOF) is assigned to each sample based on its local neighborhood density. Samples with high LOF value are identiﬁed as outliers. The disadvantage of this solution is that it is very sensitive to parameters deﬁning the neighborhood. Rough set theory, introduced by Zdzislaw Pawlak in the early 1980s (Pawlak, 1982, 1991, Pawlak, Grzymala-Busse, Slowinski, & Ziarko, 1995), is for the study of intelligent systems characterized by insufﬁcient and incomplete information. It is motivated by practical needs in classiﬁcation and concept formation. The rough set philosophy is based on the assumption that with every object of the universe there is associated a certain amount of information (data, knowledge), expressed by means of some attributes. Objects having the same description are indiscernible. In recent years, there has been a fast growing interest in rough set theory. Successful applications of the rough set model in a variety of problems have demonstrated its importance and versatility. To the best of our knowledge, there is no existing work about outlier detection in rough set community. The aim of this work is to combine the rough set theory and outlier detection to show how outlier detection can be done in rough set theory. We suggest two different ways to achieve this aim. First, we propose sequencebased outlier detection in information systems of rough set theory. Second, we introduce traditional distance-based outlier detection to rough set theory. This paper is organized as follows. In the next section, we introduce some preliminaries in rough set theory and outlier detection. In Section 3, we give some deﬁnitions concerning sequence-based outliers in information systems of rough set theory. The basic idea is as follows: Given an information system IS ¼ ðU; A; V; f Þ, where U is a non-empty ﬁnite set of objects, A is a set of attributes, V is the union of attribute domains, and f is a function such that for any x 2 U and a 2 A, f ðx; aÞ 2 V a . Since each attribute subset B # A determines an indiscernibility (equivalence) relation INDðBÞ on U, we can obtain the corresponding equivalence class of relation INDðBÞ for every object x 2 U. If we decrease attribute subset B gradually, then the granularity of partition U=INDðBÞ will become coarser, and for every object x 2 U the corresponding equivalence class of x will become bigger. So when there is an object in U whose equivalence class always does not vary or only increases a little in comparison with those of other objects in U, then we may consider this object as a sequence-based outlier in U with respect to IS. An algorithm to ﬁnd sequence-based outliers is also given. In Section 4, we apply traditional distance-based outlier detection to rough set theory. Since classical rough set theory is better suited to deal with nom-

4681

inal attributes, we propose the revised deﬁnitions of two traditional distance metrics for distance-based outlier detection in rough set theory – overlap metric and value difference metric in rough set theory, both of which are especially designed to deal with nominal attributes. Experimental results are given in Section 5, and Section 6 discusses the advantages of our sequence-based approach by comparing with other approaches to outlier detection. Section 7 concludes the paper.

2. Preliminaries In rough set data model, information is stored in a table, where each row (tuple) represents facts about an object. All we know about an object from the table is the corresponding tuple in the table. In rough set terminology, a data table is also called an information system. When the attributes are classiﬁed into decision attributes and condition attributes, a data table is also called a decision system. More formally, an information system is a quadruple IS ¼ ðU; A; V; f Þ, where 1. U is a non-empty ﬁnite set of objects. 2. A is a non-empty ﬁnite set of attributes. S 3. V is the union of attribute domains, i.e., V ¼ a2A V a , where V a denotes the domain of attribute a. 4. f : U A ! V is an information function such that for any a 2 A and x 2 U, f ðx; aÞ 2 V a . We can split set A of attributes into two subsets: C A and D ¼ A C, conditional set of attributes and decision (or class) attribute(s), respectively. The condition attributes represent measured features of the objects, while the decision attributes are a posteriori outcome of classiﬁcation. Each subset B # A of attributes determines a binary relation INDðBÞ, called indiscernibility relation, which is deﬁned as follows:

INDðBÞ ¼ fðx; yÞ 2 U U : 8a 2 Bðf ðx; aÞ ¼ f ðy; aÞÞg

ð1Þ

It is obvious that INDðBÞ is an equivalence relation on U and T INDðBÞ ¼ a2B INDðfagÞ. Given any B # A, relation INDðBÞ induces a partition of U, which is denoted by U=INDðBÞ, where an element from U=INDðBÞ is called an equivalence class. For every element x of U, let ½xB denote the equivalence class of relation INDðBÞ that contains element x, called the equivalence class of x under relation INDðBÞ. The distance-based approach is now widely used for outlier detection. An object o in a data set S is a distance-based ðDBÞ outlier with parameters p and d, denoted by DBðp; dÞ, if at least a fraction p of the objects in S lie at a distance greater than d from o. The advantage of the distance-based approach is that no explicit distribution is needed to determine unusualness, and it can be applied to any feature or attribute space (the vector space spanned by some features is called a feature space) for which we can deﬁne a distance metric. It should be noted that metrics are measures possessing metric properties which express the degree or strength of a quality factor. And many measures that we often use are in fact not metrics. A distance metric is a distance function on a set of points, mapping pairs of points into the nonnegative real numbers. In general, any distance metric which obeys the following conditions can be used in similarity measures (Li, Chen, Li, Ma, & Vitnyi, 2003): (1) (2) (3) (4)

Dðx; yÞ P 0: Distances cannot be negative. Dðx; yÞ ¼ 0 if and only if x ¼ y. Dðx; yÞ ¼ Dðy; xÞ: Distance is symmetric. Dðx; yÞ þ Dðy; zÞ P Dðx; zÞ: Triangular inequality.

4682

F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687

Usually, an attribute can be linear or nominal, and a linear attribute can be continuous or discrete. A continuous (or continuously valued) attribute uses real values, such as the mass of a planet or the velocity of an object. A linear discrete (or integer) attribute can have only a discrete set of linear values, such as number of children. A nominal (or symbolic) attribute is a discrete attribute whose values are not necessarily linearly ordered. For example, a variable representing color might have values such as red, green, blue, brown, black and white, which could be represented by 1–6, respectively (Wilson & Martinez, 1997). By now, there are many distance metrics that have been proposed, and most of these metrics are only deﬁned for linear attributes, including the Euclidean, Minkowsky, Mahalanobis distance metrics, the context-similarity measure, hyperrectangle distance functions and others. When the attributes are nominal, these deﬁnitions of similarity measures become less trivial. For nominal attributes, a simple but commonly used measure is the overlap metric (Stanﬁll & Waltz, 1986). Under this metric, for two possible values vi and vj , the distance is deﬁned as zero when vi and vj are identical, and one otherwise. For binary attributes, this overlap metric reduces to the so-called Hamming distance. Moreover, a very popular real-valued metric for nominal attributes is the value difference metric (Cheng, Li, Kwok, & Li, 2004; Stanﬁll & Waltz, 1986).

3. Sequence-based outlier detection in rough set theory By now, rough set theory has been found to have many interesting applications. The rough set approach seems to be of fundamental importance to AI and cognitive sciences, especially in the areas of machine learning and KDD. At present, in rough set community, the research concerning KDD mainly focuses on the ﬁrst three categories of tasks in knowledge discovery. However, there is no concern on the fourth category of tasks – outlier/exception detection in rough set theory. Therefore in this section, we discuss issues about outlier deﬁnition and detection in information systems of rough set theory. In the following subsection, we ﬁrst give some deﬁnitions concerning sequence-based outliers. Next an algorithm to ﬁnd sequence-based outliers is presented. 3.1. Deﬁnitions Our deﬁnition for sequence-based outliers in an information system follows the spirit of Hawkins’ deﬁnition for outliers. That is, given an information system IS ¼ ðU; A; V; f Þ, for any x 2 U, if x has some characteristics that differ greatly from those of other objects in U, in terms of the attributes in A, we may call x an outlier in U with respect to IS. Our basic idea is as follows. Given an information system IS ¼ ðU; A; V; f Þ, since each attribute subset B # A determines an indiscernibility (equivalence) relation INDðBÞ on U, we can obtain the corresponding equivalence class for every object x 2 U under relation INDðBÞ, that is, the equivalence class that contains x. If we decrease B gradually, then the granularity of partition U=INDðBÞ will become coarser, and for every object x 2 U the corresponding equivalence class of x will become bigger. So when there is an object in U whose equivalence class always does not vary or only increases a little in comparison with those of other objects in U, then we consider this object as an outlier in U with respect to IS. In a word, an outlier in U is an element whose corresponding equivalence class (the set of all indiscernibility elements) always does not vary or varies a little when we regularly coarsen the granularity of partition U=INDðBÞ about U. Through reasoning about the changes of attribute subsets and the corresponding equivalence classes in an information system,

we can ﬁnd all the outliers. In fact, most of our idea above is derived from the excellent work by A. Skowron et al., who discussed the approximate reasoning methods based on information changes in a series of papers (Bazan et al., 2002; Skowron & Synak, 2004). They introduced basic concepts for approximate reasoning about information changes across information maps, and measured degree of changes using information granules. Any rule for reasoning about information changes speciﬁes how changes of information granules from the rule premise inﬂuence changes of information granules from the rule conclusion. Changes in information granules can be measured, e.g., using expressions analogous to derivatives. To implement the above idea, we construct three kinds of sequence (Arning, Agrawal, & Raghavan, 1996). The ﬁrst is the sequence of attributes. Deﬁnition 3.1 (Sequence of attributes). Let IS ¼ ðU; A; V; f Þ be an information system, where A ¼ fa1 ; a2 ; . . . ; am g. For every attribute ai 2 A, let U=INDðfai gÞ ¼ fEi1 ; . . . ; Eik g denote the partition induced by equivalence relation INDðfai gÞ. And we use Varðai Þ to denote the variance of set fjEi1 j; . . . ; jEik jg, that is,

Varðai Þ ¼

ðjEi1 j AveÞ2 þ . . . þ ðjEik j AveÞ2 k

ð2Þ

where Ave ¼ ðjEi1 j þ . . . þ jEik jÞ=k, jXj denotes the cardinality of set X, and 1 6 i 6 m. We construct a sequence S ¼ ha01 ; a02 ; . . . ; a0m i of attributes in A (In this paper, a sequence is denoted by a tuple), where for every 1 6 i < m, Varða0i Þ P Varða0iþ1 Þ. Next, through decreasing the attribute set A gradually, we can determine a descending sequence of attribute subsets. Deﬁnition 3.2 (Descending sequence of attribute subsets). Let IS ¼ ðU; A; V; f Þ be an information system, where A ¼ fa1 ; a2 ; . . . ; am g. Let S ¼ ha01 ; a02 ; . . . ; a0m i be the sequence of attributes deﬁned above. Given a sequence AS ¼ hA1 ; A2 ; . . . ; Am i of attribute subsets, where A1 ; A2 ; . . ., Am # A. If A1 ¼ A, Am ¼ fa0m g and Aiþ1 ¼ Ai fa0i g for every 1 6 i < m, then we call AS a descending sequence of attribute subsets in IS. By the above deﬁnition, in AS ¼ hA1 ; A2 ; . . . ; Am i, for every 1 6 i < m, Aiþ1 is the attribute subset transformed from Ai by removing the element a0i from Ai , where a0i is the ith element in sequence S. Given any object x 2 U, for the descending sequence AS of attribute subsets, we can construct a corresponding ascending sequence of equivalence classes of x. Deﬁnition 3.3 (Ascending sequence of equivalence classes). Let IS ¼ ðU; A; V; f Þ be an information system, where A ¼ fa1 ; a2 ; . . . ; am g. Let AS ¼ hA1 ; A2 ; . . . ; Am i be a descending sequence of attribute subsets in IS. For any object x 2 U, let ½xAi be the equivalence class of x under relation INDðAi Þ for every 1 6 i 6 m, then we have a sequence ESðxÞ ¼ h½xA1 ; ½xA2 ; . . . ; ½xAm i of equivalence classes, where ½xA1 ; ½xA2 ; . . . ; ½xAm # U. ESðxÞ is called an ascending sequence of equivalence classes of object x in IS. Most current methods for outlier detection give a binary classiﬁcation of objects: is or is not an outlier, e.g. the distancebased outlier detection. However, for many scenarios, it is more meaningful to assign to each object a degree of being an outlier. Given a degree of outlierness for every object, the objects can be ranked according to this degree, giving the data mining analyst a sequence in which to analyze the outliers. Therefore, Breunig et al. introduced a novel notion of local outlier in which the degree to which an object is outlying is dependent on the density of its local neighborhood, and each object can be assigned a Local Outlier Factor (LOF), which represents the likelihood of that object being an outlier (Breunig et al., 2000).

F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687

Similar to Breunig’s method, we deﬁne a sequence outlier factor (SOF), which indicates the degree of outlierness for every object in an information system. Deﬁnition 3.4 (Sequence Outlier Factor). Let IS ¼ ðU; A; V; f Þ be an information system, where A ¼ fa1 ; a2 ; . . . ; am g. For any x 2 U, let ESðxÞ ¼ h½xA1 ; ½xA2 ; . . . ; ½xAm i be the ascending sequence of equivalence classes of x in IS. The sequence outlier factor of x in IS is deﬁned as follows:

SOFðxÞ ¼

m X j¼2

1

j½xAj j j½xAj1 j jUj

! WðxÞ

ð3Þ

where W : U?[0, 1) is a weight function such that for any x 2 U, . P j½x j WðxÞ ¼ a2A 1 jUjfag jAj. ½xfag ¼ fu 2 U : f ðu; aÞ ¼ f ðx; aÞg denotes the indiscernibility class of relation INDðfagÞ that contains element x and jMj denotes the cardinality of set M. The weight function W in the above deﬁnition expresses such an idea that outlier detection always concerns the minority of objects in the data set and the minority of objects are more likely to be outliers than the majority of objects. Since from the above definition, we can see that the more the weight, the more the sequence outlier factor, the minority of objects should have more weight than the majority of objects. Therefore for every a 2 A, if the objects in U that are indiscernible with x under relation INDðfagÞ are always few, that is, the percentage of objects in U that are indiscernible with x is small, then we consider x belonging to the minority of objects in U, and assign a high weight to x. Furthermore, the above deﬁnition accords with our basic idea about sequence-based outliers. That is, when we regularly coarsen the granularity of the partitions on domain U of an information system IS, if the corresponding equivalence class of an object x 2 U always does not vary or varies a little, then the likelihood of x being an outlier, i.e. the sequence outlier factor of x, is high. Deﬁnition 3.5 (Sequence-based Outliers). Let IS ¼ ðU; A; V; f Þ be an information system, where A ¼ fa1 ; a2 ; . . . ; am g. Let l be a given threshold value, for any x 2 U, if SOFðxÞ > l then x is called a sequence-based outlier in U with respect to IS, where SOFðxÞ is the sequence outlier factor of x in IS. 3.2. Algorithm Algorithm 3.1. Input: information system IS ¼ ðU; A; V; f Þ, where jUj ¼ n and jAj ¼ m; threshold value l. Output: a set O of sequencebased outliers (1) For every a 2 A (2) { (3) Sort all objects from U according to a given order (e.g. the) (4) (lexicographical order) on domain V a of attribute a (Nguyen, 1996); (5) Determine the partition U=INDðfagÞ ¼ fEi1 ; . . . ; Eik g; (6) Calculate VarðaÞ, the variance of set fjEi1 j; . . . ; jEik jg (7) } (8) Determine the sequence S ¼ fa01 ; a02 ; . . . ; a0m g of attributes in A, where (9) for every 1 6 i < m, Varða0i Þ P Varða0iþ1 Þ; (10) Construct a descending sequence AS ¼ fA1 ; . . . ; Am g of attribute (11) subsets, in terms of sequence S; (12) For 1 6 i 6 m (13) { (14) Sort all objects from U according to a given order (e.g. the)

(15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27)

4683

(lexicographical order) on domain V Ai of attribute subset Ai ; Determine the partition U=INDðAi Þ } For every x 2 U { Construct an ascending sequence ESðxÞ of equivalence classes, where ESðxÞ ¼ h½xA1 ; ½xA2 ; . . . ; ½xAm i, and ½xAi 2 U=INDðAi Þ, 1 6 i 6 m; Assign a weight WðxÞ to x; Calculate SOFðxÞ, the sequence outlier factor of x in IS; If SOFðxÞ > l then O ¼ O [ fxg } Return O.

In Algorithm 3.1, we use a method proposed by Nguyen (1996) to calculate the partition induced by an indiscernibility relation in an information system. In the worst case, the time complexity of Algorithm 3.1 is Oðm2 nlognÞ, and its space complexity is Oðm nÞ, where m and n are the cardinalities of A and U, respectively. 4. Distance-based outlier detection in rough set theory Knorr and Ng (1998) and Knorr et al. (2000) proposed a nonparametric approach to outlier detection based on the distance of an object in a given data set to its nearest neighbors. In this approach, one looks at the local neighborhood of points for an object typically deﬁned by the k-nearest objects (also known as neighbors). If the neighboring points are relatively close, then the object is considered to be normal; if the neighboring points are far away, then the object is considered to be unusual (Bay & Schwabacher, 2003). By virtue of the ideas of distance-based outlier detection, objects in an information system are considered to be outliers if they do not have enough neighbors. Although rough sets can also be used for non-nominal attributes, they are better suited to nominal data sets. In order to calculate the distance between any two objects in an information system, we should ﬁnd a suitable distance metric for nominal attributes in rough set theory. Next, we shall give the revised deﬁnitions of the traditional overlap metric and value difference metric – overlap metric and value difference metric in rough set theory, respectively. 4.1. Overlap metric in rough set theory Deﬁnition 4.1. Given an information system IS ¼ ðU; A; V; f Þ, let x; y 2 U be any two objects between which we shall calculate the distance. The overlap metric in rough set theory is deﬁned as

Dðx; yÞ ¼ jfa 2 A : aðxÞ–aðyÞgj

ð4Þ

where D : U U ! ½0; 1 is a function from U U to the non-negative real number (Stanﬁll & Waltz, 1986). It should be noticed that the above deﬁnition is closely related to the deﬁnition of discernibility matrices (Skowron & Rauszer, 1992). In fact, if we have calculated the discernibility matrix of a given information system, we can easily observe the distance between any two objects in the information system from the discernibility matrix. From the above deﬁnition, we can see that the overlap metric in rough set theory mainly counts the number of attributes in A on which objects x and y have different attribute values. This is a reasonable choice if no information about the weight of the different

4684

F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687

attributes is available. But if we do have information about attribute weight, then we can add linguistical bias to weight the different attributes. Therefore, a number of variants for overlap metric, each using a different weighting factor, have also been proposed (Gower & Legendre, 1986). However, this overlap metric and its variants assume that all attribute values are of equal distance from each other, and thus cannot represent value pairs with differing degrees of similarities. As an example, for attribute taste, it may be more desirable to have the value sour closer to sweet than to bitter. Hence, a real-valued distance metric is often preferred over a Boolean one (Cheng et al., 2004). 4.2. Value difference metric in rough set theory The value difference metric (VDM) was introduced by Stanﬁll and Waltz (1986) to provide an appropriate distance function for nominal attributes. A simpliﬁed version (without the weighting schemes) of the VDM is deﬁned as follows:

VDMðx; yÞ ¼

X

df ðxf ; yf Þ

ð5Þ

f 2F

where F is the set of all features in the problem domain, x and y are any two objects between which we shall calculate the distance. And df ðxf ; yf Þ denotes the distance between two values xf and yf of feature f, where xf is the value of object x on feature f. For any feature f 2 F, df ðxf ; yf Þ is deﬁned as follows:

df ðxf ; yf Þ ¼

X

2

ðPðcjxf Þ Pðcjyf ÞÞ

ð6Þ

c2OC

where OC is the set of all output classes in the problem domain, PðcjvÞ is the conditional probability that the output class is c given that feature f has the value v, and with value v and belonging to class c . PðcjvÞ ¼ number of objects number of objects with value v Next we give the revised deﬁnition of VDM in rough set theory. In rough set domain, given a decision table DT ¼ ðU; C [ D; V; f Þ, we use set C of condition attributes to replace the set F of features in the above deﬁnition. And the set U=INDðDÞ of decision categories is used to replace the set OC of output classes in the above deﬁnition. Deﬁnition 4.2. Given a decision table DT ¼ ðU; C [ D; V; f Þ, where C is the set of condition attributes and D is the set of decision attributes. Let x; y 2 U be any two objects between which we shall calculate the distance. The value difference metric in rough set theory VDMR : U U ! ½0; 1 is deﬁned as

VDMR ðx; yÞ ¼

X

da ðxa ; ya Þ

Table 1 Decision table DT UnA

a

b

c

d

e

u1 u2 u3 u4 u5

1 0 2 0 1

0 0 0 0 1

2 1 2 2 2

1 2 2 1 1

0 1 0 2 0

Example 4.1. Given a decision table DT ¼ ðU; C [ D; V; f Þ, where U ¼ fu1 ; u2 ; u3 ; u4 ; u5 g, C ¼ fa; b; c; dg, D ¼ feg, as shown in Table 1. Using the distance metric deﬁned by Deﬁnition 4.2, we can calculate the distance for every pair of objects in U. Because of the limitation of space, we just present the procedure for calculating the distance between u1 and u2 . Initialization: U=INDðDÞ ¼ U=INDðfegÞ ¼ ffu1 ; u3 ; u5 g; fu2 g; fu4 gg, Let E1 ¼ fu1 ; u3 ; u5 g, E2 ¼ fu2 g, E3 ¼ fu4 g. Then VDM R ðu1 ; u2 Þ ¼ da ð1; 0Þ þ db ð0; 0Þ þ dc ð2; 1Þ þ dd ð1; 2Þ. Step 1: Evaluate da ð1; 0Þ:

½u1 fag ¼ fu1 ; u5 g;

X

j½xfag \ Ej

E2U=INDðDÞ

j½xfag j

j½yfag \ Ej

!2

j½yfag j

j½u2 fag \ Ej

!2

Step 2: Evaluate db ð0; 0Þ:

½u1 fbg ¼ fu1 ; u2 ; u3 ; u4 g;

½u2 fbg ¼ fu1 ; u2 ; u3 ; u4 g:

Since ½u1 fbg ¼ ½u2 fbg , object u1 and u2 belong to the same equivalence class of equivalence relation INDðfbgÞ, and the distributions of decision categories with respect to the two objects are always the same. Therefore, db ð0; 0Þ certainly equals 0. Step 3: Evaluate dc ð2; 1Þ:

½u1 fcg ¼ fu1 ; u3 ; u4 ; u5 g; ½u2 fcg ¼ fu2 g; X

j½u1 fcg \ Ej

E2U=INDðDÞ

j½u1 fcg j

dc ð2; 1Þ ¼

j½u2 fcg \ Ej j½u2 fcg j

¼

!2

2

2 jfu1 ; u3 ; u5 gj j;j j;j jfu2 gj þ jfu1 ; u3 ; u4 ; u5 gj jfu2 gj jfu1 ; u3 ; u4 ; u5 gj jfu2 gj 2 jfu4 gj j;j þ jfu1 ; u3 ; u4 ; u5 gj jfu2 gj

¼ 9=16 þ 1 þ 1=16 ¼ 13=8: ð8Þ Step 4: Evaluate dd ð1; 2Þ:

where ½xfag is an equivalence class of INDðfagÞ that contains object x. For any E 2 U=INDðDÞ, all elements in E have the same decision attribute values, that is, E denotes a decision category. j½xfag j is the number of all objects in U that have the value xa on attribute a. And ½xfag \ E is the set of all objects in U that have the value xa on attribute a and belong to decision category E. Therefore

¼ 1 þ 1=4 þ 1=4 ¼ 3=2:

a2C

da ðxa ; ya Þ ¼

j½u1 fag \ Ej

j½u1 fag j j½u2 fag j E2U=INDðDÞ 2 2 jfu1 ; u5 gj j;j j;j jfu2 gj ¼ þ jfu1 ; u5 gj jfu2 ; u4 gj jfu1 ; u5 gj jfu2 ; u4 gj 2 j;j jfu4 gj þ jfu1 ; u5 gj jfu2 ; u4 gj

ð7Þ

where da ðxa ; ya Þ denotes the distance between two values xa and ya of condition attribute a, and xa is the value of object x on attribute a. For any a 2 C, deﬁne

X

da ð1; 0Þ ¼

½u2 fag ¼ fu2 ; u4 g;

j½xfag \Ej j½xfag j

is

the conditional probability that the decision category is E given that attribute a has the value xa . This corresponds to PðcjvÞ in the above deﬁnition.

½u1 fdg ¼ fu1 ; u4 ; u5 g; ½u2 fdg ¼ fu2 ; u3 g; X

dd ð1; 2Þ ¼

j½u1 fdg \ Ej

!2

j½u2 fdg j 2 2 jfu1 ; u5 gj jfu3 gj j;j jfu2 gj ¼ þ jfu1 ; u4 ; u5 gj jfu2 ; u3 gj jfu1 ; u4 ; u5 gj jfu2 ; u3 gj 2 jfu4 gj j;j þ jfu1 ; u4 ; u5 gj jfu2 ; u3 gj E2U=INDðDÞ

j½u1 fdg j

j½u2 fdg \ Ej

¼ 1=36 þ 1=4 þ 1=9 ¼ 7=18:

F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687

Step 5: VDMR ðu1 ; u2 Þ ¼ 3=2 þ 0 þ 13=8 þ 7=18 3:51. Repeating the above calculation, we can ﬁnally obtain distances for all the other pairs of objects in U. By then, we can ﬁnd out all outliers in U by using the nested loops algorithm or other algorithms for distance-based outlier detection (Knorr & Ng, 1998; Knorr et al., 2000; Ramaswamy, Rastogi, & Shim, 2000). Since many fast algorithms for ﬁnding distance-based outliers have been proposed, such as Bay and Schwabacher (2003) proposed a simple algorithm based on nested loops, which can give near linear time performance when the data is in random order and a simple pruning rule is used. And they demonstrated that their algorithm scales to real data sets with millions of examples and many features. Therefore combining the distance metrics we deﬁned above with these fast algorithms, we can efﬁciently ﬁnd distance-based outliers in rough set theory.

5. Experimental results In Section 3, we proposed a sequence-based outlier detection algorithm. And in Section 4, we introduced traditional distancebased outlier detection to rough sets. In this section, following the experimental setup in He, Deng, and Xu (2005), we shall use two real life data sets (lymphography and cancer) to demonstrate the performance of sequence-based outlier detection algorithm against traditional distance-based method and KNN algorithm (Ramaswamy et al., 2000). In addition, on the cancer data set, we add the results of RNN-based outlier detection method for comparison, these results can be found in the work of Harkins et al. Harkins, He, Willams, and Baxter (2002) Willams, Baxter, He, Harkins, and Gu (2002). In our experiment, for the KNN algorithm, the results were obtained by using the 5th nearest neighbor. Since in traditional distance-based outlier detection, being an outlier is regarded as a binary property, we revise the deﬁnition of distance-based outlier detection by introducing a distance outlier factor (DOF) to indicate the degree of outlierness for every object in an information system. Deﬁnition 4.1 (Distance Outlier Factor). Given an information system IS ¼ ðU; A; V; f Þ. For any object x 2 U, the percentage of the objects in U which lies greater than d from x is called the distance outlier factor of x in IS, denoted by

DOFðxÞ ¼

jfy 2 U : distðx; yÞ > dgj jUj

ð9Þ

where distðx; yÞ denotes the distance between object x and y under a given distance metric in rough set theory (In our experiment, the overlap metric in rough set theory is adopted), d is a given parameter (In our experiment, we set d ¼ jAj=2), and jUj denotes the cardinality of set U. 5.1. Lymphography data The ﬁrst is the lymphography data set, which can be found in the UCI machine learning repository (Bay, 1999). It contains 148 instances (or objects) with 19 attributes (including the class attribute). The 148 instances are partitioned into 4 classes: ‘‘normal ﬁnd” (1.35%), ‘‘metastases” (54.73%), ‘‘malign lymph” (41.22%) and ‘‘ﬁbrosis” (2.7%). Classes 1 and 4 (‘‘normal ﬁnd” and ‘‘ﬁbrosis”) are regarded as rare classes. Aggarwal et al., proposed a practicable way to test the effectiveness of an outlier detection method (Aggarwal & Yu, 2001; He et al., 2005). That is, we can run the outlier detection method on a given data set and test the percentage of points which belonging to one of the rare classes Aggarwal considered those kinds of class labels

4685

which occurred in less than 5% of the data set as rare labels (Aggarwal & Yu, 2001). Points belonged to the rare class are considered as outliers. If the method works well, we expect that such abnormal classes would be over-represented in the set of points found. In our experiment, data in the lymphography data set is input into an information system ISL ¼ ðU; A; V; f Þ, where U contains all the 148 instances of lymphography data set and A contains 18 attributes of lymphography data set (not including the class attribute). We consider detecting outliers (rare classes) in ISL . The experimental results are summarized in Table 2. In Table 2, ‘‘SEQ”, ‘‘DIS”, ‘‘KNN” denote sequence-based, traditional distance-based and KNN outlier detection methods, respectively. For every object in U, the degree of outlierness is calculated by using the three outlier detection methods, respectively. For each outlier detection method, the ‘‘Top Ratio (Number of Objects)” denotes the percentage (number) of the objects selected from U whose degrees of outlierness calculated by the method are higher than those of other objects in U. And if we use X # U to contain all those objects selected from U, then the ‘‘Number of Rare Classes Included” is the number of objects in X that belong to one of the rare classes. The ‘‘Coverage” is the ratio of the ‘‘Number of Rare Classes Included” to the number of objects in U that belong to one of the rare classes (He et al., 2005). From Table 2, we can see that for the lymphography data set, sequence-based and distance-based methods perform markedly better than KNN method. And the performances of sequence-based and distance-based methods are very close, though the former performs a little worse than the latter. 5.2. Wisconsin breast cancer data The Wisconsin breast cancer data set is found in the UCI machine learning repository (Bay, 1999). The data set contains 699 instances with 9 continuous attributes. Here, we follow the experimental technique of Harkins et al. by removing some of the malignant instances to form a very unbalanced distribution (Harkins et al., 2002; He et al., 2005; Willams et al., 2002). The resultant data set had 39 (8%) malignant instances and 444 (92%) benign instances. Moreover, the 9 continuous attributes in the data set are transformed into categorical attributes, respectively 1 (He et al., 2005). Data in the Wisconsin breast cancer data set is also input into an information system ISW ¼ ðU 0 ; A0 ; V 0 ; f 0 Þ, where U 0 contains all the 483 instances of the data set and A0 contains 9 categorical attributes of the data set (not including the class attribute). We consider detecting outliers (malignant instances) in ISW . The experimental results are summarized in Table 3. Table 3 is similar to Table 2. From Table 3, we can see that for the Wisconsin breast cancer data set, their performances of sequence-based, traditional distance-based and RNN-based methods are very close, the performances of are all markedly better than KNN algorithm. And among the former three methods, RNN-based method performs best, the next one is the sequence-based method, and the worst is the distance-based method. 6. Discussion There are two broad approaches to outlier detection: parametric (e.g., distribution-based) and non-parametric (e.g., distancebased). The parametric or statistical approach to outlier detection assumes a distribution or probability model for data set and then identiﬁes outliers with respect to the model using a discordancy

1 The resultant data set to public available at: http://research.cmis.csiro.au/rohanb/ outliers/breast-cancer/.

4686

F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687

Table 2 Experimental results in ISL Top ratio (number of objects)

5%(7) 6%(9) 8%(12) 10%(15)

Number of rare classes included (coverage) SEQ

DIS

KNN

5(83%) 5(83%) 6(100%) 6(100%)

5(83%) 6(100%) 6(100%) 6(100%)

4(67%) 4(67%) 5(83%) 6(100%)

Table 3 Experimental results in ISW Top ratio (number of objects)

1%(4) 2%(8) 4%(16) 6%(24) 8%(32) 10%(40) 12%(48) 14%(56) 16%(64) 18%(72) 20%(80) 28%(112)

Number of rare classes included (coverage) SEQ

DIS

KNN

RNN

3(8%) 7(18%) 14(36%) 21(54%) 28(72%) 32(82%) 35(90%) 39(100%) 39(100%) 39(100%) 39(100%) 39(100%)

4(10%) 5(13%) 11(28%) 18(46%) 24(62%) 29(74%) 36(92%) 39(100%) 39(100%) 39(100%) 39(100%) 39(100%)

3(8%) 6(15%) 11(28%) 18(46%) 25(64%) 30(77%) 35(90%) 36(92%) 36(92%) 38(97%) 38(97%) 39(100%)

4(10%) 8(21%) 16(41%) 20(51%) 27(69%) 32(82%) 37(95%) 39(100%) 39(100%) 39(100%) 39(100%) 39(100%)

rough set theory, which has become a popular method for KDD, much due to its ability to handle uncertain and/or incomplete data. First, we proposed a new sequence-based outlier detection method in information systems of rough set theory. Experimental results on real data sets demonstrate the effectiveness of our method for outlier detection. Next, we introduced Knorr and Ng’s distancebased outlier detection to rough set theory. In order to calculate the distance between any two objects in an information system, we gave two revised deﬁnitions of the traditional distance metrics for nominal attributes in rough set theory. In the future work, for the sequence-based outlier detection algorithm, we shall consider using reducts to have smaller number of attributes while preserving the performance of it. For the distance-based outlier detection, since the proposed measures are all global ones, in the future work, we shall consider using more local versions to get better performance for distance-based method. Acknowledgements This work is supported by the Natural Science Foundation (Grant Nos. 60641010, 60496326, 60573063 and 60573064), the National 863 Programme (Grant No. 2007AA01Z325), and the National 973 Programme (Grant Nos. 2003CB317008 and G1999032701). References

test. A major drawback of the parametric approach is that most tests are for single attributes, yet many problems on data mining require ﬁnding outliers in multidimensional spaces. Moreover, the parametric approach requires knowledge about parameters of the data set, such as the data distribution. However, in many cases, the data distribution may not be known. The representative nonparametric approach is distance-based outlier detection, which was introduced to counter the main limitations imposed by the parametric approach. In comparison with statistical-based approach, distance-based outlier detection avoids the excessive computation associated with ﬁtting an observed distribution into some standard distribution and in selecting discordancy tests (Han & Kamber, 2000). Distance-based outlier detection relies on a distance metric (function) to measure the similarity between any two objects. As we have noted before, most of the distance metrics that have been proposed are only deﬁned for linear attributes. However, almost all real-world data sets contain a mixture of nominal and linear attributes. The nominal attributes are typically ignored or incorrectly modeled by existing approaches. In addition, it is well-known that, in practice, there are no clear advantages of one particular metric over another. Therefore, it is often very difﬁcult to determine an appropriate distance metric for many practical distance-based outlier detection tasks. Moreover, it is possible in some situations that there is no appropriate distance metric to be deﬁned for a particular data set. Then distance-based outlier detection obviously cannot be applied to those situations. Unlike distance-based outlier detection, our sequence-based outlier detection does not require a metric distance function. On the other hand, similar to distance-based outlier detection, sequence-based outlier detection is also a non-parametric approach to outlier detection, which can cope with the main limitations imposed by the parametric approach. So our approach can efﬁciently avoid the disadvantages of most current approaches to outlier detection. 7. Conclusion and future work Outlier detection is becoming critically important in many areas. In this paper, we extended outlier detection to Pawlak’s

Aggarwal, C. C., Yu, P. S. (2001). Outlier detection for high dimensional data. In Proceedings of the 2001 ACM SIGMOD international conference on managment of data, California (pp. 37–46). Arning, A., Agrawal, R., Raghavan, P. (1996). A linear method for deviation detection in large databases. In Proceedings of the 2nd international conference on KDD, Oregon (pp. 164–169). Bay, S. D. (1999). The UCI KDD repository. . Bay, S. D., Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the 9th ACM SIGKDD international conference on KDD, Washington (pp. 29–38). Bazan, J., Osmolski, A., Skowron, A., Slezak, D., Szczuka, M., Wroblewski, J. (2002). Rough set approach to survival analysis. In Proceedings of the 3rd international conference on rough sets and current trends in computing (RSCTC2002), Malvern, PA, USA (pp. 522–529). Bolton, R. J., & Hand, D. J. (2002). Statistical fraud detection: A review (with discussion). Statistical Science, 17(3), 235–255. Breunig, M. M., Kriegel, H. -P., Ng, R. T., Sander, J. (2000). LOF: Identifying densitybased local outliers. In Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas (pp. 93–104). Cheng, V., Li, C. H., Kwok, J., & Li, C. K. (2004). Dissimilarity learning for nominal data. Pattern Recognition, 37(7), 1471–1477. Chiu, A. L., Fu, A. W. (2003). Enhancements on local outlier detection. In Proceedings of the 7th international database engineering and applications symposium (IDEAS’03), Hong Kong (pp. 298–307). Eskin, E., Arnold, A., Prerau, M., Portnoy, L., & Stolfo, S. (2002). A geometric framework for unsupervised anomaly detection: Detecting intrusions in unlabeled data. In D. Barbar et al. (Eds.), Data mining for security applications. Boston: Kluwer Academic Publishers. Gower, J. C., & Legendre, P. (1986). Metric and Euclidean properties of dissimilarity coefﬁcients. Journal of Classiﬁcation, 3, 5–48. Han, J. W., & Kamber, M. (2000). Data mining: Concepts and techniques. California: Morgan Kaufmann Publishers. Harkins, S., He, H. X., Willams, G. J., Baxter, R. A. (2002). Outlier detection using replicator neural networks. In Proceedings of the 4th international conference on data warehousing and knowledge discovery, France (pp. 170–180). Hawkins, D. (1980). Identiﬁcations of outliers. London: Chapman and Hall. He, Z. Y., Deng, S. C., Xu, X. F. (2005). An optimization model for outlier detection in categorical data. In Advances in intelligent computing, international conference on intelligent computing, ICIC(1) 2005, Hefei, China (pp. 400–409). Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: A review. ACM Computing Surveys, 31(3), 264–323. Johnson, T., Kwok, I., Ng, R. T. (1998). Fast computation of 2-dimensional depth contours. In Proceedings of the 4th international conference on knowledge discovery and data mining, New York (pp. 224–228). Knorr, E., Ng, R. (1998). Algorithms for mining distance-based outliers in large datasets. In Proceedings of the 24th VLDB conference, New York, pp. 392–403. Knorr, E., Ng, R., & Tucakov, V. (2000). Distance-based outliers: Algorithms and applications. VLDB Journal: Very Large Databases, 8(3–4), 237–253. Kovács, L., Vass, D., Vidács, A. (2004). Improving quality of service parameter prediction with preliminary outlier detection and elimination. In Proceedings of

F. Jiang et al. / Expert Systems with Applications 36 (2009) 4680–4687 the 2nd international workshop on inter-domain performance and simulation (IPS 2004), Budapest (pp. 194–199). Lane, T., & Brodley, C. E. (1999). Temporal sequence learning and data reduction for anomaly detection. ACM Transactions on Information and System Security, 2(3), 295–331. Li, M., Chen, X., Li, X., Ma, B., Vitnyi, M. B. (2003). The similarity metric. In Proceedings of the 14th annual ACM-SIAM symposium on discrete algorithms, Maryland (pp. 863–872). Nguyen, S. H., Nguyen, H. S. (1996). Some efﬁcient algorithms for rough set methods. In Proceedings of the 6th international conference on information processing and management of uncertainty (IPMU’96), Granada, Spain (pp. 1451– 1456). Pawlak, Z. (1982). Rough sets. International Journal of Computer and Information Sciences, 11, 341–356. Pawlak, Z. (1991). Rough sets: Theoretical aspects of reasoning about data. Dordrecht: Kluwer Academic Publishers. Pawlak, Z., Grzymala-Busse, J. W., Slowinski, R., & Ziarko, W. (1995). Rough sets. Communications of the ACM, 38(11), 89–95. Ramaswamy, S., Rastogi, R., Shim, K. (2000). Efﬁcient algorithms for mining outliers from large datasets. In Proceedings of the 2000 ACM SIGMOD international conference on management of data, Dallas (pp. 427–438).

4687

Rousseeuw, P. J., & Leroy, A. M. (1987). Robust regression and outlier detection. New York: John Wiley and Sons. Rulequest Research. GritBot. see . Skowron, A., & Rauszer, C. (1992). The discernibility matrices and functions in information systems. In R. Slowinski (Ed.), Intelligent decision support – Handbook of applications and advances of the rough sets theory (pp. 331–362). Dordrecht: Kluwer. Skowron, A., & Synak, P. (2004). Reasoning in information maps. Fundamenta Informaticae, 59, 241–259. Stanﬁll, C., & Waltz, D. (1986). Towards memory-based reasoning. Communications of the ACM, 29(12), 1213–1228. Willams, G. J., Baxter, R. A., He, H. X., Harkins, S., Gu, L. F. (2002). A comparative study of RNN for outlier detection in data mining. In ICDM, Japan (pp. 709–712). Wilson, D. R., & Martinez, T. R. (1997). Improved heterogeneous distance functions. Journal of Artiﬁcial Intelligence Research, 6, 1–34. Zhong, N., Yao, Y. Y., Ohshima, M., & Ohsuga, S. (2001). Interestingness, peculiarity, and multi-database mining. In Proceedings of the 2001 IEEE international conference on data mining (IEEE ICDM’01) (pp. 566–573). IEEE Computer Society Press.

Some issues about outlier detection in rough set theory

Some issues about outlier detection in rough set theory

Recommend Documents