International Journal of Approximate Reasoning 55 (2014) 180–196
Contents lists available at SciVerse ScienceDirect
International Journal of Approximate Reasoning journal homepage: www.elsevier.com/locate/ijar
Generalized probabilistic approximations of incomplete data Jerzy W. Grzymala-Busse a,b,∗ , Patrick G. Clark a , Martin Kuehnhausen a a b
Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045-7621, USA Institute of Computer Science, Polish Academy of Sciences, 01-237 Warsaw, Poland
ARTICLE INFO
ABSTRACT
Article history: Available online 16 April 2013
In this paper we discuss a generalization of the idea of probabilistic approximations. Probabilistic (or parameterized) approximations, studied mostly in variable precision rough set theory, were originally defined using equivalence relations. Recently, probabilistic approximations were defined for arbitrary binary relations. Such approximations have an immediate application to data mining from incomplete data because incomplete data sets are characterized by a characteristic relation which is reflexive but not necessarily symmetric or transitive. In contrast, complete data sets are described by indiscernibility which is an equivalence relation. The main objective of this paper was to compare experimentally, for the first time, two generalizations of probabilistic approximations: global and local. Additionally, we explored the problem how many distinct probabilistic approximations may be defined for a given data set. © 2013 Elsevier Inc. All rights reserved.
Keywords: Probabilistic approximations Parameterized approximations Generalization of probabilistic approximations Singleton, subset and concept probabilistic approximations Local probabilistic approximations
1. Introduction One of the central ideas of rough set theory [1,2] is the notion of the approximation. Complete data sets are described by an indiscernibility relation R, an equivalence relation. The equivalence classes of R are called elementary sets. A set is called definable if it is a union of elementary sets of R. For a set X that is not definable, two definable sets are constructed as follows: a lower approximation of X is a union of all elementary sets of R contained in X and an upper approximation is a union of all elementary sets Y of R that are relevant to X, i.e., X ∩ Y = ∅. An idea of the probabilistic (or parameterized) approximation, associated with an additional parameter α , is a generalization of the ordinary approximations. Originally, a probabilistic approximation of the set X was still a union of equivalence classes of R such that the conditional probability of X given the equivalence class was larger than or equal to α . In many papers theoretical properties of such probabilistic approximations were reported, in areas such as variable precision rough sets, Bayesian rough sets, decision-theoretic rough sets, game-theoretic rough sets, etc., [3–20]. If α = 1, the corresponding probabilistic approximation of X becomes the lower approximation of X; if α is small enough, e.g., 0.001, the probabilistic approximation of X is identical with the upper approximation of X. So far probabilistic approximations were usually defined as lower and upper approximations. However, the only difference between so called lower and upper probabilistic approximations is in the choice of the value of the parameter α [21], for lower probabilistic approximations α was closer to 1, for upper probabilistic approximations α was closer to 0. Probabilistic approximations were generalized to arbitrary binary relations in [22]. It is well known that for incomplete data sets an indiscernibility relation, used for complete data sets, is replaced by a characteristic relation [23–25], which is only reflexive but not necessarily symmetric or transitive. Such probabilistic approximations are called global. Some preliminary ∗ Corresponding author at: Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045-7621, USA. E-mail addresses:
[email protected] (J.W. Grzymala-Busse),
[email protected] (P.G. Clark),
[email protected] (M. Kuehnhausen). 0888-613X/$ - see front matter © 2013 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.ijar.2013.04.007
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
181
experimental results, evaluating probabilistic approximations in terms of an error rate for benchmark data sets, were reported in [26]. Results of some experiments testing the usefulness of such approximations were reported in [27–30]. For non-parameterized approximations global and local approximations were introduced in [31]. Characteristic sets that are used for constructing global approximations are intersections of more elementary sets called attribute-value blocks and defined directly from the data sets as the sets of all cases with the same attribute value. Local approximations are composed as unions of all possible intersections of attribute-value blocks, for distinct attributes. Such sets are called complexes of X. Obviously, local approximations better approximate a set X than global approximations in the sense that the local lower approximation of X is a superset of the global lower approximation of X and the local upper approximation of X is a subset of the global upper approximation of X. However, the high quality of local approximations comes with a price: an exponential complexity of computation of local approximations. We may use two approaches for a generalization of the local approximation to a probabilistic one. The first approach is to restrict the computation of complexes such that the resulting complex intersections are limited to a maximum number of attribute-value blocks. The second approach is to heuristically compute a local probabilistic approximation, which is a side-effect of rule induction when using the heuristic rule induction algorithm MLEM2. The main objective of this paper was to compare the quality of both approaches to local probabilistic approximations with the global probabilistic approximations by running experiments on eight benchmark data sets. All of these data sets were incomplete. In our experiments we used two interpretations of missing attribute values: lost values and “do not care” conditions [23–25]. Thus, in our experiments we used 16 data sets, eight data sets with lost values and eight data sets with “do not care” conditions. A missing attribute value is lost (denoted by ?) when it is currently unavailable. In rule induction from data sets with lost values rules are induced only from existing attribute values. On the other hand, “do not care” conditions (denoted by ∗) correspond to a refusal to answer a question. For example, data are collected on patients sick with flu and one of the attributes is color of eyes, with three values: blue, brown or other. Some patients may feel that this attribute is irrelevant and refuse to answer. With a “do not care” condition interpretation we will replace the missing attribute value by all three possibilities: blue, brown, and other. For some recent papers on missing attribute values see [32–34].
2. Characteristic relation We assume that the input data sets are presented in the form of a decision table. An example of a decision table is shown in Table 1 (a similar example was presented in [31]). Rows of the decision table represent cases, while columns are labeled by variables. The set of all cases will be denoted by U. In Table 1, U = {1, 2, 3, 4, 5, 6, 7, 8}. Some variables are called attributes while one selected variable is called a decision and is denoted by d. The set of all attributes will be denoted by A. In Table 1, A = {Temperature, Headache, Cough} and d = Flu. The value for a case x and an attribute a will be denoted by a(x). In this paper we distinguish between two interpretations of missing attribute values: lost values (the original attribute values are no longer accessible, for details see [35,36]), denoted by “?”, and “do not care” conditions (the original values were irrelevant, see [37,38]), denoted by “∗”. Table 1 presents an incomplete data set affected by both lost values and “do not care” conditions. One of the most important ideas of rough set theory [1] is an indiscernibility relation, defined for complete data sets. Let B be a nonempty subset of A. The indiscernibility relation R(B) is a relation on U defined for x, y ∈ U as follows:
(x, y) ∈ R(B) if and only if ∀a ∈ B (a(x) = a(y)). The indiscernibility relation R(B) is an equivalence relation. Equivalence classes of R(B) are called B-elementary sets of B and are denoted by [x]B . A subset of U is called A-definable if it is a union of A-elementary sets. The set X of all cases defined by the same value of the decision d is called a concept. For example, a concept associated with the value yes of the decision Flu is the set {1, 2, 3, 4, 5, 6}. The largest B-definable set contained in X is called the B-lower approximation of X, denoted by appr B (X ), and defined as follows
∪{[x]B | [x]B ⊆ X } Table 1 An incomplete decision table. Case
1 2 3 4 5 6 7 8
Attributes Temperature High Normal ? High High Very-high
∗ Normal
Decision Headache ? No Yes No ? No No Yes
Cough No Yes No Yes Yes No
∗ No
Flu Yes Yes Yes Yes Yes Yes No No
182
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
while the smallest B-definable set containing X, denoted by appr B (X ) is called the B-upper approximation of X, and is defined as follows
∪{[x]B | [x]B ∩ X = ∅}. Let (U , R) be an approximation space, where R is an equivalence relation on U. A probabilistic approximation of the set X with the threshold α , 0 < α ≤ 1, is denoted by apprα (X ) and defined as follows
∪{[x] | x ∈ U , Pr (X |[x]) ≥ α}, where [x] is an elementary set of R and Pr (X |[x])
=
|X ∩[x]| |[x]| is the conditional probability of X given [x].
The number of distinct probabilistic approximations of the concept X is smaller than or equal to the number n of distinct parameters α from the definition of a probabilistic approximation. The number n is equal to the number of distinct positive conditional probabilities Pr (X |[x]), where x ∈ U. Additionally, the number n is smaller than or equal to the number m of elementary sets [x] of R. Finally, m ≤ |U |. Thus the number of distinct probabilistic approximations of the given concept is smaller than or equal to the cardinality of U. An important tool to analyze data sets is a block of an attribute-value pair. Let (a, v) be an attribute-value pair. For complete decision tables, i.e., decision tables in which every attribute value is specified, a block of (a, v), denoted by [(a, v)], is the set of all cases x for which a(x) = v, where a(x) denotes the value of the attribute a for the case x. For incomplete decision tables the definition of a block of an attribute-value pair is modified.
• If for an attribute a there exists a case x such that a(x) = ?, i.e., the corresponding value is lost, then the case x should not be included in any blocks [(a, v)] for all values v of attribute a, • If for an attribute a there exists a case x such that the corresponding value is a “do not care” condition, i.e., a(x) = ∗, then the case x should be included in blocks [(a, v)] for all specified values v of attribute a. A special block of a decision-value pair is called a concept. In Table 1, [(Flu, yes)] = {1, 2, 3, 4, 5, 6}. Additionally, for Table 1 [(Temperature, normal)] = {2, 7, 8}, [(Temperature, high)] = {1, 4, 5, 7}, [(Temperature, very-high)] = {6, 7}, [(Headache, no)] = {2, 4, 6, 7}, [(Headache, yes)] = {3, 8}, [(Cough, no)] = {1, 3, 6, 7, 8}, [(Cough, yes)] = {2, 4, 5, 7}. For a case x ∈ U the characteristic set KB (x) is defined as the intersection of the sets K (x, a), for all a K (x, a) is defined in the following way:
∈ B, where the set
• If a(x) is specified, then K (x, a) is the block [(a, a(x))] of attribute a and its value a(x), • If a(x) =? or a(x) = ∗ then the set K (x, a) = U . Characteristic set KB (x) may be interpreted as the set of cases that are indistinguishable from x using all attributes from B and using a given interpretation of missing attribute values. Thus, KA (x) is the set of all cases that cannot be distinguished from x using all attributes. For Table 1 and B = A, KA (1) KA (2) KA (3) KA (4) KA (5) KA (6) KA (7) KA (8)
= {1, 4, 5, 7} ∩ U ∩ {1, 3, 6, 7, 8} = {1, 7}, = {2, 7, 8} ∩ {2, 4, 6, 7} ∩ {2, 4, 5, 7} = {2, 7}, = U ∩ {3, 8} ∩ {1, 3, 6, 7, 8} = {3, 8}, = {1, 4, 5, 7} ∩ {2, 4, 6, 7} ∩ {2, 4, 5, 7} = {4, 7}, = {1, 4, 5, 7} ∩ U ∩ {2, 4, 5, 7} = {4, 5, 7}, = {6, 7} ∩ {2, 4, 6, 7} ∩ {1, 3, 6, 7, 8} = {6, 7}, = U ∩ {2, 4, 6, 7} ∩ U = {2, 4, 6, 7}, and = {2, 7, 8} ∩ {3, 8} ∩ {1, 3, 6, 7, 8} = {8}.
The characteristic relation R(B) is a relation on U defined for x, y
∈ U as follows
(x, y) ∈ R(B) if and only if y ∈ KB (x). The characteristic relation R(B) is reflexive but—in general—does not need to be symmetric or transitive. Also, the characteristic relation R(B) is known if we know characteristic sets KB (x) for all x ∈ U. In our example, R(A) = {(1, 1), (1, 7), (2, 2), (2, 7), (3, 3), (3, 8), (4, 4), (4, 7), (5, 4), (5, 5), (5, 7), (6, 6), (6, 7), (7, 2), (7, 4), (7, 6), (7, 7), (8, 8)}. The most convenient way to define the characteristic relation is through the characteristic sets.
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
183
3. Global probabilistic approximations In this section first we will study approximations defined on the approximations space A = (U , R) where U is a finite nonempty set and R is an arbitrary binary relation. Then we will extend the corresponding definitions to generalized parameterized approximations.
3.1. Nonparameterized approximations First we will quote some definitions from [39]. Let x be a member of U. The R-successor set of x, denoted by Rs (x), is defined as follows Rs (x)
= {y | xRy}.
For a characteristic relation R(A) and for x Rs (x)
∈ U,
= Ka (x).
Let X be a subset of U. The R-singleton lower approximation of X, denoted by appr singleton (X ), is defined as follows
{x | x ∈ U , Rs (x) ⊆ X }. The singleton lower approximations were studied in many papers, see, e.g., [23,24,38,40–44,36,45]. The R-singleton upper approximation of X, denoted by appr singleton (X ), is defined as follows
{x | x ∈ U , Rs (x) ∩ X = ∅}. The singleton upper approximations, like singleton lower approximations, were also studied in many papers, e.g., [23,24,38,40,43,44,36,45]. For Table 1, singleton approximations for both concepts are: A{1, 2, 3, 4, 5, 6} A{7, 8} = {8}, A{1, 2, 3, 4, 5, 6} A{7, 8} = U .
= ∅, = {1, 2, 3, 4, 5, 6, 7},
The R-subset lower approximation of X, denoted by appr subset (X ), is defined as follows
∪ {Rs (x) | x ∈ U , Rs (x) ⊆ X }. The subset lower approximations were introduced in [23,24]. The R-subset upper approximation of X, denoted by appr subset (X ), is defined as follows
∪ {Rs (x) | x ∈ U , Rs (x) ∩ X = ∅}. The subset upper approximations were introduced in [23,24]. For Table 1, subset approximations for both concepts are: A{1, 2, 3, 4, 5, 6} A{7, 8} = {8}, A{1, 2, 3, 4, 5, 6} A{7, 8} = U .
= ∅, = U,
The R-concept lower approximation of X, denoted by appr concept (X ), is defined as follows
∪ {Rs (x) | x ∈ X , Rs (x) ⊆ X }. The concept lower approximations were introduced in [23,24]. The R-concept successor upper approximation of X, denoted by appr concept (X ), is defined as follows
∪ {Rs (x) | x ∈ X , Rs (x) ∩ X = ∅}. The concept upper approximations were studied in [23,24,42]. For Table 1, concept approximations for both concepts are: A{1, 2, 3, 4, 5, 6} = ∅, A{7, 8} = {8}, A{1, 2, 3, 4, 5, 6} = U , A{7, 8} = {2, 4, 6, 7, 8}.
184
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
3.2. Probabilistic approximations Three kinds of probabilistic approximations for arbitrary binary relations: singleton, subset and concept were introduced in [26]. The singleton probabilistic approximation of X with the threshold α , 0 < α ≤ 1, denoted by apprαsingleton (X ), is defined as follows
{x | x ∈ U , Pr (X |Rs (x)) ≥ α}, where Pr (X |Rs (x))
=
|X ∩Rs (x)| |Rs (x)| is the conditional probability of X given Rs (x).
A subset probabilistic approximation of the set X with the threshold α , 0 as follows
< α ≤ 1, denoted by apprαsubset (X ), is defined
∪{Rs (x) | x ∈ U , Pr (X |Rs (x)) ≥ α}, where Pr (X |[x])
=
|X ∩Rs (x)| |Rs (x)| is the conditional probability of X given Rs (x).
A concept probabilistic approximation of the set X with the threshold α , 0 as follows
< α ≤ 1, denoted by apprαconcept (X ), is defined
∪{Rs (x) | x ∈ X , Pr (X |Rs (x)) ≥ α}. The number of distinct R-successor sets Rs (x), where x ∈ U, is obviously not greater than n, where n is the cardinality of U. Therefore, for a given concept X, there is at most n distinct conditional probabilities Pr (X |Rs (x)). Thus, the number of different probabilistic approximations of a given type (singleton, subset or concept) is also not greater than n. Obviously, for the concept X, the probabilistic approximation of a given type (singleton, subset or concept) of X computed for the threshold equal to the smallest positive conditional probability Pr (X | [x]) is equal to the standard upper approximation of X of the same type. Additionally, the probabilistic approximation of a given type of X computed for the threshold equal to 1 is equal to the standard lower approximation of X of the same type. For Table 1, all distinct probabilistic approximations (singleton, subset and concept) for the concept [(Flu, yes)] are singleton
({1, 2, 3, 4, 5, 6}) = {1, 2, 3, 4, 5, 6, 7},
singleton
({1, 2, 3, 4, 5, 6}) = {5, 7},
singleton
({1, 2, 3, 4, 5, 6}) = {7},
singleton
({1, 2, 3, 4, 5, 6}) = ∅,
appr0.5
appr0.667 appr0.75 appr1
appr0subset .5 ({1, 2, 3, 4, 5, 6})
= U,
appr0subset .667 ({1, 2, 3, 4, 5, 6})
= {2, 4, 5, 6, 7},
appr0subset .75 ({1, 2, 3, 4, 5, 6})
= {2, 4, 6, 7},
appr1subset ({1, 2, 3, 4, 5, 6})
= ∅,
concept
({1, 2, 3, 4, 5, 6}) = U ,
concept
({1, 2, 3, 4, 5, 6}) = {4, 5, 7},
concept
({1, 2, 3, 4, 5, 6}) = ∅.
appr0.5
appr0.667 appr0.75
4. Definability For incomplete data sets, a set X will be called B-globally definable if it is a union of some characteristic sets KB (x), x ∈ U. A set that is A-globally definable will be called globally definable. A set T of attribute-value pairs, where all attributes belong to set B and are distinct, will be called a B-complex. Any Acomplex will be called – for simplicity – a complex. Obviously, any set containing a single attribute-value pair is a complex. For the rest of the paper we will discuss only nontrivial complexes, i.e., such complexes that the intersection of all attribute-value blocks from a given complex is not the empty set. A block of B-complex T, denoted by [T ], is defined as the set ∩{[t ] | t ∈ T }.
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
185
For an incomplete decision table and a subset B of A, a union of intersections of attribute-value pair blocks of attributevalue pairs from some B-complexes, will be called a B-locally definable set. A-locally definable sets will be called locally definable. For example, for the data set from Table 1, the set {7} is locally definable ({7} = [(Temperature, very-high)] ∩ singleton
[(Cough, yes)]) but it is not globally-definable. On the other hand, the set {5, 7} = appr0.667 ({1, 2, 3, 4, 5, 6}) is not even locally definable since all blocks of attribute-value pairs containing the case 5 contain the case 4 as well. Any set X that is B-globally definable is B-locally definable, the converse is not true. The importance of the idea of local definability is a consequence of the following fact: A set is locally definable if and only if it can be expressed by decision rule sets. This is why it is so important to distinguish between locally definable sets and those that are not locally definable. In general, subset and concept probabilistic approximations are globally definable while singleton probabilistic approximations are not even locally definable. For decision tables in which all missing attribute values are lost, local definability is reduced to global definability, see [31].
5. Local probabilistic approximations Let X be any subset of the set U of all cases. Let B ⊆ A. In general, X is not a B-definable set, locally or globally. The first, most general definition assumes only an existence of a family T of B-complexes T with the conditional probability |X ∩[T ]| Pr (X |[T ]) of X ≥ α , where Pr (X |[T ]) = |[T ]| . A B-local probabilistic approximation of the set X with the parameter α , 0 follows
< α ≤ 1, denoted by apprαlocal (X ), is defined as
∪{[T ] | ∃ a family T of B-complexes T of X with ∀ T ∈ T , Pr (X |[T ]) ≥ α}. For example of the decision table from Table 1 and for the set A of all three attributes, for the X =[(Flu, no)] = {7, 8}, let T = {{(Temperature, normal)}}, |{7, 8} ∩ [{2, 7, 8}]| = 0.667, Pr ({7, 8} | [(Temperature, normal)]) = |[{2, 7, 8}]| so [(Temperature, normal)] = {2, 7, 8} is an A-local probabilistic approximation of the set [(Flu, no)] = {7, 8} with the parameter α = 0.667. In general, for given set X and α , there exists more than one A-local probabilistic approximation. For given set X and parameter α , a B-local probabilistic approximation given by the next definition is unique. A complete B-local probabilistic approximation of the set X with the parameter α , 0 < α ≤ 1, denoted by apprαcomplete (X ), is defined as follows
∪{[T ] | T is a B-complex of X , Pr (X |[T ]) ≥ α}. Complete A-local probabilistic approximations will be called complete local probabilistic approximations. For Table 1, the set of all possible B-complexes, where B ⊆ A, is the union of the set of all attribute-values blocks and the following set {{1, 7}, {2, 4, 7}, {2, 7}, {3, 8}, {4, 5, 7}, {4, 7}, {6, 7}, {7}, {7, 8}, {8}}. For Table 1, all distinct complete local probabilistic approximations for the concept [(Flu, yes)] are complete
({1, 2, 3, 4, 5, 6}) = U ,
complete
({1, 2, 3, 4, 5, 6}) = {1, 2, 4, 5, 6, 7},
complete
({1, 2, 3, 4, 5, 6}) = ∅.
appr0.5
appr0.667 appr1
For Table 1, all distinct complete local probabilistic approximations for the concept [(Flu, no)] are complete
({7, 8}) = U ,
complete
({7, 8}) = {1, 2, 4, 5, 6, 7, 8},
complete
({7, 8}) = {2, 7, 8},
complete
({7, 8}) = {7, 8}.
appr0.333 appr0.5
appr0.667 appr0.75
For any concept X and α , computing apprαcomplete (X ) is a problem of exponential complexity since B-complexes T may contain attribute-value pairs using all possible subsets B of the set A of all attributes. Therefore we need to use an additional parameter, an integer called MAXSIZE, and consider B-complexes defined by at most MAXSIZE attributes. Our search space of
186
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
all potential B-complexes will contain first all B-complexes containing single attributes, then all possible pairs of attributes, and so on, finally B-complexes with exactly MAXSIZE attributes. Due to computational complexity, in our experiments we also used a heuristic approach to computing yet another local probabilistic approximation that will be called MLEM2 local probabilistic approximations and will be denoted by apprαmlem2 (X ), since it is inspired by the MLEM2 rule induction algorithm [46]. Using this approach, apprαmlem2 (X ) is constructed from Acomplexes Y that are the most relevant to X, i.e., with |X ∩ Y | as large as possible, if there is more than one A-complex that satisfies this criterion, the largest conditional probability of X given Y is the next criterion to select an A-complex. Note that if two A-complexes are equally relevant, then the second criterion selects an A-complex with the smaller block cardinality. The local probabilistic approximation apprαmlem2 (X ) is defined by the following algorithm that computes not only the approximation but also the corresponding rule set. The rule set is represented by the family T of A-complexes T, where every T corresponds to a rule. The local probabilistic approximation apprαmlem2 (X ) is defined as ∪{T |T ∈ T }. Algorithm for determining a single local probabilistic approximation input: a set X (a subset of U) and a parameter α , output: a set T of the set X, begin G := X; D := X; T := ∅; J := ∅; while G = ∅ begin T := ∅; Ts := ∅; Tn := ∅; T (G) := {t | [t ] ∩ G = ∅}; while (T = ∅ or [T ] ⊆ D) and T (G) = ∅ begin select a pair t = (at , vt ) ∈ T (G) such that |[t ] ∩ G| is maximum; if a tie occurs, select a pair t ∈ T (G) with the smallest cardinality of [t ]; if another tie occurs, select first pair; T := T ∪ {t }; G := [t ] ∩ G; T (G) := {t | [t ] ∩ G = ∅}; if at is symbolic {let Vat be the domain of at } then Ts := Ts ∪ {(at , v) | v ∈ Vat } else {at is numerical, let t = (at , u..v)}Tn := Tn ∪ {(at , x..y) | disjoint x..y and u..v} ∪ {(at , x..y) | x..y ⊇ u..v}; T (G) := T (G) − (Ts ∪ Tn ); end {while}; if Pr (X | [T ]) ≥ α then begin D := D ∪ [T ]; T := T ∪ {T }; end {then} else J := J ∪ {T }; G := D − ∪S∈T ∪J [S ]; end {while}; for each T ∈ T do for each numerical attribute at with (at , u..v) ∈ T do while (T contains at least two different pairs (at , u..v) and (at , x..y) with the same numerical attribute at ) replace these two pairs with a new pair (at , common part of (u..v) and (x..y)); for each t ∈ T do if [T − {t }] ⊆ D then T := T − {t }; for each T ∈ T do if ∪S∈(T −{T }) [S ] = ∪S∈T [S ] then T := T − {T }; end {procedure}.
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
187
The time complexity of the algorithm for determining a single local probabilistic approximation is polynomial. Let m be the number of cases, U = m, n be the number of attributes, A = n, and r be the number of all attribute-value blocks. The time complexity of our algorithm, using a “brute force” approach and the worst case scenario, i.e., not using special data structures such as hashing function, is O(mnr ). Indeed, the most demanding part of the algorithm is the nested WHILE loop. The time complexity of every statement in the body of the inner WHILE loop is limited by O(r ). The number of iterations of the inner WHILE loop is limited by O(n), while the number of iterations of the outer WHILE loop is limited by O(m). Additionally, the time complexity for the nested FOR loop is O(mn) and the time complexity for the last FOR loop is O(m). For attributes with fixed number of values (e.g., symbolic attributes) the number r is limited by O(n), and for the attributes with the number of values depending on m (e.g., numerical attributes) the number r is limited by O(mn). Therefore, the time complexity of the algorithm for data sets with attributes with the fixed number of values is O(mn2 ) and for attributes with the number of values depending on m it is O(m2 n2 ). For Table 1, all distinct MLEM2 local probabilistic approximations for the concept [(Flu, yes)] are appr0mlem2 .5 ({1, 2, 3, 4, 5, 6})
= U,
appr0mlem2 .667 ({1, 2, 3, 4, 5, 6})
= {1, 2, 4, 5, 7},
appr1mlem2 ({1, 2, 3, 4, 5, 6})
= ∅.
For Table 1, all distinct MLEM2 local probabilistic approximations for the concept [(Flu, no)] are apprαmlem2 ({7, 8})
= {7, 8} for any α > 0.
It is clear that for Table 1 the MLEM2 local probabilistic approximations are better than the complete local probabilistic approximations (the most convincing is apprαmlem2 ({7, 8})). For the concept [(Flu, yes)] and α = 0.667, T = {{(Cough, yes)}, {(Temperature, high)}}, while for the concept [(Flu, no)] and α = 0.667, T = {{(Temperature, normal), (Cough, no)}}, so the corresponding rule set is 1, 3, 4 (Cough, yes) → (Flu, yes) 1, 3, 4 (Temperature, high) → (Flu, yes) 2, 2, 2 (Temperature, normal) & (Cough, no) → (Flu, no) where rules are presented in the LERS format, every rule is associated with three numbers: the total number of attributevalue pairs on the left-hand side of the rule, the total number of cases correctly classified by the rule during training, and the total number of training cases matching the left-hand side of the rule, i.e., the rule domain size.
6. Experiments In our experiments we used eight real-life data sets taken from the University of California at Irvine Machine learning Repository. These data sets were enhanced by replacing, randomly, 35% of existing attribute values by missing attribute values, separately by lost values and by “do not care” conditions, see Table 2. Thus, for any data set from Table 2, two data sets were used for experiments, one with missing attribute values interpreted as lost values and the other one as “do not care” conditions. Three data sets were numerical: Bankruptcy, Echocardiogram, and Iris. We conducted three series of experiments. In the first series, three different approaches to rule induction using probabilistic approaches were used:
• concept probabilistic approximations, labeled as Global on Figs. 1–8. For any of our sixteen data sets (eight with lost • •
values and eight with “do not care” conditions), concept probabilistic approximations were computed for all concepts and then rule sets were induced by the MLEM2 algorithm, global version, as described, e.g., in [24], complete local probabilistic approximations, labeled as Complete Local on Figs. 1–8. The MAXSIZE parameter was set to the value of two in all experiments. For any of sixteen data sets complete local approximations were computed for all concepts and then rule sets were induced by the MLEM2 algorithm, local version, as described in this paper, MLEM2 local probabilistic approximations, labeled as MLEM2 Local on Figs. 1–8. For any of sixteen data sets rule sets were induced for all concepts directly by the MLEM2 algorithm, local version, as described in this paper.
In all three types of approximations, the parameter α was gradually incremented, first from 0.001 to 0.1, and then, from 0.1 to 1, with the increment of 0.1. The error rate was computed using ten-fold cross validation. Results of these experiments are presented on Figs. 1–8.
188
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196 Table 2 Data sets used for experiments. Data set Bankruptcy Breast cancer Echocardiogram Image segmentation Hepatitis Iris Lymphography Wine recognition
Cases 66 277 74 210 155 150 148 178
Number of attributes 5 9 7 19 19 4 18 13
Concepts 2 2 2 7 2 3 4 3
Fig. 1. Error rate for the data set Bankruptcy with lost values, denoted by ? and “do not care” conditions denoted by ∗.
Fig. 2. Error rate for the data set Breast cancer with lost values, denoted by ? and “do not care” conditions denoted by ∗.
Using labels from Figs. 1–8, we may observe that
• Global, ∗ approach was the winner for the following data sets: Echocardiogram and Hepatitis, with error rates 24.32% (for α between 0.001 and 0.5) and 18.71% (for α = 0.2), respectively, • Global, ? approach was the winner for the following data sets: Bankruptcy and Image Segmentation, with error rates 13.64% (for any α , between 0.001 and 1) and 43.81% (for α = 0.4 and 0.5), respectively, • Complete Local, ∗ approach was worse than other approaches for all data sets,
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
Fig. 3. Error rate for the data set Echocardiogram with lost values, denoted by ? and “do not care” conditions denoted by ∗.
Fig. 4. Error rate for the data set Hepatitis with lost values, denoted by ? and “do not care” conditions denoted by ∗.
Fig. 5. Error rate for the data set Image segmentation with lost values, denoted by ? and “do not care” conditions denoted by ∗.
189
190
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
Fig. 6. Error rate for the data set Iris with lost values, denoted by ? and “do not care” conditions denoted by ∗.
Fig. 7. Error rate for the data set Lymphography with lost values, denoted by ? and “do not care” conditions denoted by ∗
Fig. 8. Error rate for the data set Wine recognition with lost values, denoted by ? and “do not care” conditions denoted by ∗.
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
191
• Complete Local, ? approach was the winner for the following data sets: Bankruptcy, Iris and Lymphography, with error rates 13.64% (for any α , between 0.001 and 1), 12% (for α = 0.5), and 23.65% (for α = between 0.001 and 0.5 and for α = 0.7), respectively,
• MLEM2 Local, ∗ approach was the winner for the following data set: Wine Recognition, with the error rate 10.67% (for α = 0.6), • MLEM2 Local, ? approach was the winner for the following data sets: Bankruptcy, Breast Cancer and Lymphography, with error rates 13.64% (for any α , between 0.001 and 1), 26.35% (for α = 0.8) and 23.65% (for any α between 0.6 and 1), respectively. Note that some data sets are listed more than once due to ties. For data sets with lost values, all three approaches, Global, Complete Local, and MLEM2 Local provide good results. For data sets with “do not care” conditions, the best approaches were: Global and MLEM2 Local. However, for such data sets, Global approach provides larger error rates for all values of the parameter α for Bankruptcy, Iris and Wine Recognition data sets. Moreover, for data sets with “do not care” conditions, for Complete Local approach, the error rate was close to 100% for α close to 1 for all data sets. There is an important problem: how to choose the best approach to missing attribute values for a given data set. Some parameters characterizing the data set should be used such as existence of numerical attributes, number of concepts, etc. This problem is still open, in general, in data mining. As follows from our experiments, the best approaches are scattered
Fig. 9. Statistics for the data set Bankruptcy with lost values, denoted by ? and “do not care” conditions denoted by ∗.
Fig. 10. Statistics for the data set Breast cancer with lost values, denoted by ? and “do not care” conditions denoted by ∗.
192
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
Fig. 11. Statistics for the data set Echocardiogram with lost values, denoted by ? and “do not care” conditions denoted by ∗.
Fig. 12. Statistics for the data set Hepatitis with lost values, denoted by ? and “do not care” conditions denoted by ∗.
Fig. 13. Statistics for the data set Image segmentation with lost values, denoted by ? and “do not care” conditions denoted by ∗.
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
193
Fig. 14. Statistics for the data set Iris with lost values, denoted by ? and “do not care” conditions denoted by ∗.
Fig. 15. Statistics for the data set Lymphography with lost values, denoted by ? and “do not care” conditions denoted by ∗.
among data sets. Thus the only advice is to use all approaches with the exception of (Complete Local, ∗) and select the best approach using ten-fold cross validation. On the other hand, if time is an issue, one of our two best approaches: (Complete Local, ?) or (MLEM2 Local, ?) may be used. Similarly, it is very difficult to tell why for some data sets the lost value or “do not care” condition interpretation of missing attribute values works better. On the other hand, the choice between two interpretations should be made on the basis of an additional information about the data set based, in turn, on the information about the cause of missing attribute values. When the cause of missing attribute values is unknown, both possibilities should be tried. The second series of experiments was associated with creating, for any data set, a family of new data sets with randomly placed missing values, starting from 0%, the number of missing data values gradually increased, with increments of 5%, until in the process of adding missing attribute values the entire case had all attribute values missing, if so, another two attempts were made and finally, the process of adding missing attribute values was halted. Our objective was to test, for any data set from such set family, how many distinct concepts probabilistic approximations may be defined when the α parameter changes from 0.001 to 1. In this series of experiments, for any data set, the number of distinct characteristic sets, the average number of distinct concept probabilistic approximations for all concepts, and the average number of distinct conditional probabilities of a characteristic set given the concept, for all concepts and all values of α were reported, see Figs. 9–16. It is clear that for data sets with “do not care” conditions all three numbers were larger than for corresponding data sets with lost values. Additionally, the number of characteristic sets is close to the number of cases for data sets with large enough number of
194
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
Fig. 16. Statistics for the data set Wine recognition with lost values, denoted by ? and “do not care” conditions denoted by ∗.
Fig. 17. Experiments for the data set Lymphography with lost values, variable MAXSIZE.
Fig. 18. Experiments for the data set Lymphography with “do not care” conditions, variable MAXSIZE.
missing attribute values. Surprisingly, the number of distinct concept probabilistic approximations is quite limited and again, this number is larger for data with “do not care” conditions than data sets with lost values.
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
195
Yet another series of experiments was conducted to show how the MAXSIZE parameter affects the error rate for complete local probabilistic approximations. Results are presented on Figs. 17 and 18. Here the surprise is that for small values of the α parameter, the value of MAXSIZE does not matter, for both kinds of data sets, with lost values and “do not care” conditions. Too small value of the parameter MAXSIZE affects more the data sets with “do not care” conditions. 7. Conclusions As our experiments indicate, the best approach to rule induction from a specific data set using probabilistic approximations should be selected individually. The “Local, ?” and “Mlem2 Local, ?” approaches were the best for three out of eight data sets. Two other approaches were the best for two out of eight data sets. For data sets with “do not care” conditions, the approach based on complete probabilistic approximations should not be used. For three out of eight data sets (Bankruptcy, Echocardiogram and Lymphography) the best results are accomplished using non-parameterized approximations (or, probabilistic approximations with α equal to 0.001 or 1). Additionally, the number of distinct concept probabilistic approximations is much larger for data sets with “do not care” conditions, but even then it is quite limited. Acknowledgments The authors would like to thank the Editor-in-Chief, the Guest Editors and anonymous reviewers for their valuable suggestions. References [1] Z. Pawlak, Rough sets, International Journal of Computer and Information Sciences 11 (1982) 341–356. [2] Z. Pawlak, Rough Sets. Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers., Dordrecht, Boston, London, 1991. [3] N. Azam, J.T. Yao, Analyzing uncertainties of probabilistic rough set regions with game-theoretic rough sets, International Journal of Approximate Reasoning, in this issue. [4] J.O. Herbert, Y.J.T., Game-theoretic rough sets, Fundamenta Informaticae 108 (2011) 267–286. [5] T.J. Li, X.P. Yang, An axiomatic characterization of probabilistic rough sets, International Journal of Approximate Reasoning, in this issue. [6] H.X. Li, X.Z. Zhou, Risk decision making based on decision-theoretic rough set: a multi-view decision model, International Journal of Computational 4 (2011) 1–11. [7] D. Liu, D. Ruan, Probabilistic model criteria with decision-theoretic rough sets, Information Sciences 181 (2011) 3709–3722. [8] W. Ma, B. Sun, Probabilistic rough set over two universes and rough entropy, International Journal of Approximate Reasoning 53 (2012) 608–619. [9] F. Min, Q.H. Hu, W. Zhu, Feature selection with test constraint, International Journal of Approximate Reasoning, in this issue. [10] Z. Pawlak, S.K.M. Wong, W. Ziarko, Rough sets: probabilistic versus deterministic approach, International Journal of Man-Machine Studies 29 (1988) 81–95. [11] Y. Qian, J. Liang, W. Pedrycz, C. Dang, Positive approximation: an accelerator for attribute reduction in rough set theory, Artificial Intelligence 174 (2010) 597–618. [12] S. Tsumoto, H. Tanaka, PRIMEROSE: probabilistic rule induction method based on rough sets and resampling methods, Computational Intelligence 11 (1995) 389–405. [13] G. Wang, Extension of rough set under incomplete information systems, in: Proceedings of the IEEE International Conference on Fuzzy Systems, 2002, pp. 1098–1103. [14] Y.Y. Yao, Decision-theoretic rough set models, in: Proceedings of the Second International Conference on Rough Sets and Knowledge Technology, 2007, pp. 1–12. [15] Y.Y. Yao, Probabilistic rough set approximations, International Journal of Approximate Reasoning 49 (2008) 255–271. [16] Y.Y. Yao, Two semantic issues in a probabilistic rough set model, Fundamenta Informaticae 108 (2011) 249–265. [17] Y.Y. Yao, S.K.M. Wong, A decision theoretic framework for approximate concepts, International Journal of Man-Machine Studies 37 (1992) 793–809. [18] Y.Y. Yao, S.K.M. Wong, P. Lingras, A decision-theoretic rough set model, in: Proceedings of the 5th International Symposium on Methodologies for Intelligent Systems, 1990, pp. 388–395. [19] W. Ziarko, Variable precision rough set model, Journal of Computer and System Sciences 46 (1) (1993) 39–59. [20] W. Ziarko, Probabilistic approach to rough sets, International Journal of Approximate Reasoning 49 (2008) 272–284. [21] J.W. Grzymala-Busse, S.R. Marepally, Y. Yao, An empirical comparison of rule sets induced by LERS and probabilistic rough classification, in: Proceedings of the 7-th International Conference on Rough Sets and Current Trends in Computing, 2010, pp. 590–599. [22] J.W. Grzymala-Busse, Generalized parameterized approximations, in: Proceedings of the RSKT 2011, the 6-th International Conference on Rough Sets and Knowledge Technology, 2011, pp. 136–145. [23] J.W. Grzymala-Busse, Rough set strategies to data with missing attribute values, in: Workshop Notes, Foundations and New Directions of Data Mining, in Conjunction with the 3-rd International Conference on Data Mining, 2003, pp. 56–63. [24] J.W. Grzymala-Busse, Data with missing attribute values: generalization of indiscernibility relation and rule induction, Transactions on Rough Sets 1 (2004) 78–95. [25] J.W. Grzymala-Busse, Characteristic relations for incomplete data: a generalization of the indiscernibility relation, in: Proceedings of the Fourth International Conference on Rough Sets and Current Trends in Computing, 2004, pp. 244–253. [26] P.G. Clark, J.W. Grzymala-Busse, Experiments on probabilistic approximations, in: Proceedings of the 2011 IEEE International Conference on Granular Computing, 2011, pp. 144–149. [27] P.G. Clark, J.W. Grzymala-Busse, Rule induction using probabilistic approximations and data with missing attribute values, in: Proceedings of the 15-th IASTED International Conference on Artificial Intelligence and Soft Computing ASC 2012, 2012, pp. 235–242. [28] P.G. Clark, J.W. Grzymala-Busse, Experiments on rule induction from incomplete data using three probabilistic approximations, in: Proceedings of the 2012 IEEE International Conference on Granular Computing, 2012, pp. 90–95. [29] P.G. Clark, J.W. Grzymala-Busse, Experiments using three probabilistic approximations for rule induction from incomplete data sets, in: Proceeedings of the MCCSIS 2012, IADIS European Conference on Data Mining ECDM 2012, 2012, pp. 72–78. [30] P.G. Clark, J.W. Grzymala-Busse, M. Kuehnhausen, Local probabilistic approximations for incomplete data, in: Proceedings of the ISMIS 2012, the 20-th International Symposium on Methodologies for Intelligent Systems, 2012, pp. 93–98. [31] J.W. Grzymala-Busse, W. Rzasa, Local and global approximations for incomplete data, Transactions on Rough Sets 8 (2008) 21–34.
196
J.W. Grzymala-Busse et al. / International Journal of Approximate Reasoning 55 (2014) 180–196
[32] H.X. Li, M.H. Wang, X.Z. Zhou, J.B. Zhao, An interval model for learning rules from incomplete information table, International Journal of Approximate Reasoning 53 (2012) 24–37. [33] J. Li, C. Mei, Y. Lv, Incomplete decision contexts: approximate concept construction, rule acquisition and knowledge reduction, International Journal of Approximate Reasoning 54 (2012) 149–165. [34] Y. Leung, W.-Z. Wu, W.-X. Zhang, Knowledge acquisition in incomplete information systems: a rough set approach, European Journal of Operational Research 168 (2006) 164–180. [35] J.W. Grzymala-Busse, A.Y. Wang, Modified algorithms LEM1 and LEM2 for rule induction from data with missing attribute values, in: Proceedings of the Fifth International Workshop on Rough Sets and Soft Computing (RSSC’97) at the Third Joint Conference on Information Sciences (JCIS’97), 1997, pp. 69–72. [36] J. Stefanowski, A. Tsoukias, Incomplete information tables and rough classification, Computational Intelligence 17 (3) (2001) 545–566. [37] J.W. Grzymala-Busse, On the unknown attribute values in learning from examples, in: Proceedings of the ISMIS-91, 6th International Symposium on Methodologies for Intelligent Systems, 1991, pp. 368–377. [38] M. Kryszkiewicz, Rough set approach to incomplete information systems, in: Proceedings of the Second Annual Joint Conference on Information Sciences, 1995, pp. 194–197. [39] J.W. Grzymala-Busse, W. Rzasa, Definability and other properties of approximations for generalized indiscernibility relations, Transactions on Rough Sets 11 (2010) 14–39. [40] M. Kryszkiewicz, Rules in incomplete information systems, Information Sciences 113 (3–4) (1999) 271–292. [41] T.Y. Lin, Neighborhood systems and approximation in database and knowledge base systems, in: Proceedings of the ISMIS-89, the Fourth International Symposium on Methodologies of Intelligent Systems, 1989, pp. 75–86. [42] T.Y. Lin, Topological and fuzzy rough sets, in: R. Slowinski (Ed.), Intelligent Decision Support. Handbook of Applications and Advances of the Rough Sets Theory, Kluwer Academic Publishers., Dordrecht, Boston, London, 1992, pp. 287–304. [43] R. Slowinski, D. Vanderpooten, A generalized definition of rough approximations based on similarity, IEEE Transactions on Knowledge and Data Engineering 12 (2000) 331–336. [44] J. Stefanowski, A. Tsoukias, On the extension of rough sets under incomplete information, in: Proceedings of the RSFDGrC’1999, 7th International Workshop on New Directions in Rough Sets, Data Mining, and Granular-Soft Computing, 1999, pp. 73–81. [45] Y.Y. Yao, Relational interpretations of neighborhood operators and rough set approximation operators, Information Sciences 111 (1998) 239–259. [46] J.W. Grzymala-Busse, W. Rzasa, A local version of the MLEM2 algorithm for rule induction, Fundamenta Informaticae 100 (2010) 99–116.