Extraction of fuzzy rules from fuzzy decision trees: An axiomatic fuzzy sets (AFS) approach

Extraction of fuzzy rules from fuzzy decision trees: An axiomatic fuzzy sets (AFS) approach

Data & Knowledge Engineering 84 (2013) 1–25 Contents lists available at SciVerse ScienceDirect Data & Knowledge Engineering journal homepage: www.el...

1MB Sizes 1 Downloads 96 Views

Data & Knowledge Engineering 84 (2013) 1–25

Contents lists available at SciVerse ScienceDirect

Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak

Extraction of fuzzy rules from fuzzy decision trees: An axiomatic fuzzy sets (AFS) approach Xiaodong Liu a, b, Xinghua Feng a, Witold Pedrycz c,⁎ a

Research Center of Information and Control Dalian University of Technology, Dalian 116024, PR China Department of Mathematics Dalian Maritime University, Dalian 116026, PR China c Department of Electrical and Computer Engineering University of Alberta, Edmonton, Canada T6G 2G7, Department of Electrical and Computer Engineering Faculty of Engineering, King Abdulaziz University Jeddah, 21589, Saudi Arabia and Systems Research Institute, Polish Academy of Sciences Warsaw, Poland b

a r t i c l e

i n f o

Article history: Received 17 March 2009 Received in revised form 4 December 2012 Accepted 4 December 2012 Available online 14 December 2012 Keywords: Fuzzy decision trees Fuzzy rules AFS fuzzy logic Knowledge representation Comparative analysis

a b s t r a c t In this study, we introduce a new type of coherence membership function to describe fuzzy concepts, which builds upon the theoretical findings of the Axiomatic Fuzzy Set (AFS) theory. This type of membership function embraces both the factor of fuzziness (by capturing subjective imprecision) and randomness (by referring to the objective uncertainty) and treats both of them in a consistent manner. Furthermore we propose a method to construct a fuzzy rule-based classifier using coherence membership functions. Given the theoretical developments presented there, the resulting classification systems are referred to as AFS classifiers. The proposed algorithm consists of three major steps: (a) generating fuzzy decision trees by assuming some level of specificity (detailed view) quantified in terms of threshold; (b) pruning the obtained rule-base; and (c) determining the optimal threshold resulting in a final tree. Compared with other fuzzy classifiers, the AFS classifier exhibits several essential advantages being of practical relevance. In particular, the relevance of classification results is quantified by associated confidence levels. Furthermore the proposed algorithm can be applied to data sets with mixed data type attributes. We have experimented with various data commonly present in the literature and compared the results with that of SVM, KNN, C4.5, Fuzzy Decision Trees (FDTs), Fuzzy SLIQ Decision Tree (FS-DT), FARC-HD and FURIA. It has been shown that the accuracy is higher than that being obtained by other methods. The results of statistical tests supporting comparative analysis show that the proposed algorithm performs significantly better than FDTs, FS-DT, KNN and C4.5. © 2012 Elsevier B.V. All rights reserved.

1. Introduction There have been numerous approaches to the extraction of classification rules from numeric data [1–11]. One of the quite often used alternatives is to construct a decision tree and afterwards extract rules from this tree [12,1,2]. Due to the nature of continuous attributes as well as various facets of uncertainty present in the problem one has to take into consideration, there has been a visible trend to cope with the factor of fuzziness when carrying out learning from examples in tree induction. In a nutshell, this trend gave rise to the generalizations known as fuzzy decision trees, cf. [13–26,1,12]. Fuzzy decision trees grow in a top-down way, recursively partitioning the training data into segments with similar or the same outputs. Various approaches to the generation of fuzzy decision trees have been suggested by many authors (e.g., [13–26]). Fuzzy

⁎ Corresponding author. Tel.: +1 780 492 4661; fax: +1 780 492 1811. E-mail address: [email protected] (W. Pedrycz). 0169-023X/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.datak.2012.12.001

2

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

decision trees encountered in the literatures can be categorized into several types depending upon the nature of the splitting mechanism being in their design: • Fuzzy ID3 [27–29,7,14,15,23] The fuzzy ID3 generalizes ID3, which was initially proposed by Quinlan in the Boolean case [30]. The algorithms in this category apply fuzzy sets to describe (quantify) attributes and then use the ID3 approach to construct the decision tree. Fuzzy entropy, information gain or gain ratio are used as a measure of attribute selection. • Yuan and Shaw's Fuzzy decision tree (FDT) [31,32,18] This approach was introduced by Yuan and Shaw [18]. Being different from the fuzzy ID3, the tree uses the minimal ambiguity (nonspecificity) of a possibility distribution to select attributes for splitting. The attribute is represented through fuzzy sets before constructing the decision tree, and then a measure of a minimal ambiguity is considered to guide attribute selection in the tree building. • Gini index based FDT [33,34,2,17,19] Chandra and Varghese proposed several ways on how to improve the performance of the SLIQ (Supervised Learning in Quest) decision trees [35,36,2,17]. In [2,17], algorithms to generate the fuzzy decision trees using a fuzzy Gini index were presented. In these algorithms, the attribute is not encoded in terms of fuzzy sets before constructing a decision tree. To select the best attribute for splitting, attribute values are fuzzified based on the split-point value. • Wang's FDT [12,16] The approach proposed by Wang et al. uses a measure namely maximum classification importance of attribute contributing to its consequent to select the expanded attributes encoded in terms of fuzzy sets. • Normalized fuzzy Kolmogorov–Smirnov based FDT [21,22,24] Boyen et al. proposed an induction of fuzzy decision trees where fuzzy sets are constructed automatically during the growth of the tree and a normalized fuzzy Kolmogorov–Smirnov discrimination quality measure selects the attribute used in the node splitting. There are also some other selection methods that emerged in the design of decision trees in the tree building. For example, Pedrycz and Sosnowski introduced algorithms [37,13] using fuzzy granulation for partitioning the input space and constructing the fuzzy decision tree. Unlike the “standard” decision tree considering one attribute at a time to partition the training samples at each node, the fuzzy granulation consider all features to partition the training data. The Fuzzy CHAID algorithm, proposed by Fowdar et al. [38] generates fuzzy trees for both classification and regression problems from pre-generated CHAID decision trees using the Pearson Chi-squared test. Nevertheless most of these algorithms are variations of the generic algorithmic framework presented above. The applications of the fuzzy decision trees on interval-valued data [20] and multi-valued and multi-labeled [39] have been proposed as well. Combining other data mining techniques with fuzzy decision trees, several hybrid approaches were proposed such as neuro-fuzzy algorithms [40,1] and multiple fuzzy decision trees formed with the aid of rough sets [41,42]. An interesting commonality occurring across most of the existing methods is worth emphasizing: the algorithms require some knowledge about membership functions of the linguistic values of the attributes as well as specific aggregation operations (such as t-norms) before any optimization technique can be utilized. It becomes apparent that to a significant extent the obtained fuzzy decision trees are pre-determined by the membership functions of the fuzzy terms and the fuzzy logic operators. Besides, like in the “standard” decision trees, the class label of the terminal node is determined by the label of the majority of the training samples positioned at the node. The difference of the membership degrees and the disproportion between each class are ignored. In this paper, we use the AFS fuzzy set theory, which facilitates an important step on how to convert the information in databases into the membership functions and their fuzzy logic operations, by taking both fuzziness (subjective imprecision) and randomness (objective uncertainty) into account, and fuzzy entropy as an attribute selection to generate the AFS classifier (decision tree). A crucial threshold being a part of the method affects the tree structure and helps cope with the level of specificity that is being captured by the tree and a leaves labeling method is given. To offer a thorough comparative analysis, we experimented with the algorithm using a number of well known data sets coming from the UCI Repository of Machine Learning data [43]. The ensuing comparative analysis involves some other types of classifiers such as SVM [44], KNN [45], C4.5 [46], Fuzzy Decision Trees (FDTs) [18], Fuzzy SLIQ Decision Tree (FS-DT) [17], FARC-HD [47] and FURIA [48]. The main features of the proposed AFS classifier (decision tree) which distinct it from the fuzzy decision trees can be highlighted as follows: • The AFS fuzzy sets with their underlying logic operations can eliminate potential subjective bias encountered in the “conventional” fuzzy decision tree and resulting from the use of subjectively formed membership functions. • The tree structure comes with a great deal of flexibility supplied by the variable threshold is affected by the values of the crucial threshold (δ), which can effectively control the level of detail captured by the tree. • The relevance of classification results can be explicitly quantified by associated confidence levels. The paper is organized as follows. In Section 2, we recall the basic notions and properties of the AFS theory that are essential in the framework of our investigations on fuzzy rule extraction. In Section 3, we discuss the coherence membership functions of fuzzy concepts. In Section 4, we introduce an algorithm for generating fuzzy rules from AFS decision trees. Section 5 presents a suite of numeric experiments and offers some comparative analysis. Conclusions are presented in Section 6.

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

3

2. Selected preliminaries of the AFS theory In this section, we briefly recall the notations and present several of the most pertinent results of the AFS theory that are essential to this study. For details, the reader may refer to a comprehensive treatment of this subject presented in [49] and [50]. The following example serves as an introductory illustration of the most generic ideas. Let M be a non-empty set. The set EM* is defined as n   EM ¼ ∑i∈I ∏m∈Ai m jAi pM; i∈I; I is a non  empty index setg; where ∑ and ∏ denote a disjunction and conjunction, respectively. Example 1. Let X = {x1, x2, …, x14} be a set of 14 people characterized by some attributes (which are described by real numbers, Boolean values and the order relations) are shown in Table 1. Let M = {m1, m2, …, m12} be the set of fuzzy (or numeric) linguistic terms on X where each m ∈ M associates with a certain attribute shown in Table 1. Here we consider the following terms: m1: “old person”, m2: “tall person”, m3: “high self-appraisement person”, m4: “high monthly income”, m5: “high annual payment”, m6: “male”, m7: “female” (i.e., not male), m8: “good performance on test A” (A is viewed as a certain psychological test), where the number i (e.g., i = 1, 2, 3, 4, 5, 6) positioned in this column which corresponds to some x ∈ X implies that of the individual x when dealing with the assessment of performance on test A has been ordered and comes as the i-th out of the 14 persons, m9: “good performance on test B”, m10: “good performance on test C”, m11: “young person”, m12: “person about 40 years old”. The elements of M are viewed as “simple” (or “elementary”) terms of the corresponding attributes. In the expression ∑i ∈ I(∏m ∈ Aim) ∈ EM⁎, ∏m ∈ Aim represents a conjunction of the concepts in Ai and ∑i ∈ I(∏m ∈ Aim) is the disjunction of the conjunctions of the fuzzy terms in Ai represented by ∏m ∈ Ai m's (i.e., the disjunctive normal form of a formula representing a concept). For example, we may have γ = m1m6 + m1m3 + m2 which translates as “old male” or “high self-appraisement old person” or “tall person” (the “+” denotes here a disjunction of terms). While M may be a set of fuzzy or two-valued (Boolean) terms, every ∑i ∈ I (∏m ∈ Aim) ∈ EM⁎, has a well-defined meaning such as the one we have discussed above. Definition 1. ([49]). Let M be a non-empty set. We define a binary relation R on EM⁎: For ∑i ∈ I(∏m ∈ Aim), ∑j ∈ J(∏m ∈ Bjm) ∈ EM⁎, [∑i ∈ I(∏m ∈ Aim)]R [∑j ∈ J(∏m ∈ Bjm)] ⇔ (i) ∀Ai (i ∈ I), ∃ Bh (h ∈ J) such that Ai ⊇ Bh; (ii) ∀Bj (j ∈ J), ∃ Ak(k ∈ I) such that Bj ⊇ Ak. It is evident that R is an equivalence relation. The quotient set EM⁎/R is denoted by EM. The notation ∑i ∈ I(∏m ∈ Aim) = ∑j ∈ J(∏m ∈ Bjm) states that ∑i ∈ I(∏m ∈ Aim) and ∑j ∈ J(∏m ∈ Bjm) are equivalent under equivalence relation R. For ξ = m3m8 + m1m4 + m1m6m7 + m1m4m8 ∈ EM and ζ = m3m8 + m1m4 + m1m6m7 ∈ EM in Example 1, by Definition 1, we have ξ = ζ. Theorem 1. ([49]). Let M be a non-empty set. Then (EM, ∨, ∧) forms a completely distributive lattice under the binary compositions ∨ and ∧ defined as follows: For any ∑i ∈ I(∏m ∈ Ai m), ∑j ∈ J(∏m ∈ Bj m) ∈ EM h  i h  i   ∑i∈I Π m∈Ai m ∨ ∑j ∈J Π m∈Bj m ¼ ∑k∈I⊔J Π m∈C k m

ð1Þ

Table 1 Description of objects.

x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14

Age

Height

Selfappraisement

Monthly income

Annual payment

Male

Performance on test A

Performance on test B

Performance on test C

Credit

20 13 50 80 34 37 45 70 60 3 8 19 40 23

1.90 1.20 1.70 1.80 1.40 1.60 1.70 1.65 1.82 1.10 1.40 1.73 1.60 2.00

90 32 67 73 54 80 78 70 83 21 45 56 50 80

1 0 140 20 15 80 268 30 25 0 0 1 30 19

0 0 34 80 2 28 90 45 98 0 0 0 20 5

1 0 0 1 1 0 1 1 0 0 0 1 1 0

6 4 6 3 5 6 1 3 4 2 3 4 3 4

1 3 1 4 2 1 6 4 3 5 4 3 4 3

4 1 4 2 2 4 4 2 1 3 3 4 2 2

0 0 1 1 0 1 1 1 1 0 0 0 0 0

4

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

h

 i h  i   ∑i∈I ∏m∈Ai m ∧ ∑j ∈J ∏m∈Bj m ¼ ∑i∈I; j∈J ∏m∈Ai ∪Bj m

ð2Þ

where for any k ∈ I ⊔ J (the disjoint union of I and J, i.e., every element in I and every element in J are always regarded as different elements in I ⊔ J), Ck = Ak, if k ∈ I, and Ck = Bk, if k ∈ J. In Example 1, Let α = m1m4 + m2m5m6 ∈ EM and ν = m5m6 + m5m8 ∈ EM, then the algebra operations defined by Eqs. (1) and (2) come as follows: α∨ν ¼ m1 m4 þ m2 m5 m6 þ m5 m6 þ m5 m8 ¼ m1 m4 þ m5 m6 þ m5 m8 α∧ν ¼ m1 m4 m5 m6 þ m2 m5 m6 þ m1 m4 m5 m8 þ m2 m5 m6 m8 (EM, ∨, ∧) is called the EI (expanding one set M) algebra over M or AFS logic system, one type of AFS algebra. For α = ∑i ∈I(∏m∈Ai m), β = ∑j ∈J(∏m∈Bj m) ∈ EM, α≤ β ⇔α∨β = β ⇔ ∀Ai (i ∈ I), ∃Bh (h ∈ J), such that Ai ⊇Bh. Definition 2. ([51]). Let ζ be any concept defined on the universe of discourse X. Rζ is called a binary relation (i.e., Rζ p X × X) of ζ if Rζ satisfies the following condition: x, y ∈ X, (x, y) ∈ Rζ ⇔ x belongs to concept ζ to some extent and the degree of x belonging to ζ is larger than or equal to that of y, or x belongs to concept ζ to some degree and y does not. Definition 3. ([51]). Let X be a set and R be a binary relation on X. R is called a sub-preference relation on X if for x, y, z ∈ X, x ≠ y, R satisfies the following conditions: 1. 2. 3. 4.

If If If If

(x, y) ∈ R, then (x, x) ∈ R; (x, x) ∈ R and (y, y) ∉ R, then (x, y) ∈ R; (x, y), (y, z) ∈ R, then (x, z) ∈ R; (x, x) ∈ R and (y, y) ∈ R, then either (x, y) ∈ R or (y, x) ∈ R.

A concept ζ is called a simple concept on X if Rζ (given in Definition 2) is a sub-preference relation on X. Otherwise ζ is called a complex concept on X. 3. Coherence membership functions of fuzzy concepts In order to properly define the membership function of a fuzzy concept, we need to define the AFS structure of data. Definition 4. ([52]). Let X, M be sets and 2 M be the power set of M. Let τ:X × X → 2 M. (M, τ, X) is called an AFS structure if τ satisfies the following axioms: AX1 : ∀ðx1 ; x2 Þ∈X  X; τ ðx1 ; x2 Þ⊆τ ðx1 ; x1 Þ; AX2 : ∀ðx1 ; x2 Þ; ðx2 ; x3 Þ ∈ X  X; τðx1 ; x2 Þ∩τðx2 ; x3 Þ⊆τðx1 ; x3 Þ: X is called universe of discourse; M is called a concept set and τ is called a structure. Let X be a set of objects and M be a set of simple concepts on X. τ: X × X → 2 M is defined as follows: for any (x, y) ∈ X × X n M τðx; yÞ ¼ mjm∈M; ðx; yÞ∈Rm g∈2

ð3Þ

where Rm is the binary relation of simple concept m ∈ M (refer to Definition 2). It could be easily verified that τ defined by Eq. (3) satisfies AX1 and AX2 and (M, τ, X) is an AFS structure. Considering Table 1, one can verify that for any x, y ∈ X, τ(x, y) is well defined by Eq. (3). For instance, we have τðx4 ; x4 Þ ¼ fm1 ; m2 ; m3 ; m4 ; m5 ; m6 ; m8 ; m9 ; m10 ; m11 ; m12 g; τðx4 ; x7 Þ ¼ fm1 ; m2 ; m6 ; m8 g; by comparing the values of the attributes of x4 and x7 shown in Table 1. Definition 5. Let M be a set of simple concepts on the set X and (M, τ, X) be an AFS structure of data set X. For x ∈ X, A p M, the set A τ(x) p X is defined as follows: τ

A ðxÞ ¼ fy∈X jτðx; yÞ⊇Ag:

ð4Þ

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

5

For ξ ∈ EM, let μξ: X → [0, 1]. {μξ(x)|ξ ∈ EM} is called a set of coherence membership functions of the AFS fuzzy logic system (EM, ∨, ∧) and the AFS structure (M, τ, X), if the following conditions are satisfied. 1. For α, β ∈ EM, if α ≤ β in lattice (EM, ∨, ∧), then μα(x) ≤ μβ(x) for any x ∈ X; 2. For x ∈ X, η = ∑i ∈ I(∏m∈Aim) ∈ EM, if A τi(x) = ∅ for all i ∈ I then μη(x) = 0; 3. For x, y ∈ X, A p M, η = ∏m ∈ Am ∈ EM, if A τ(x) p A τ(y), then μη(x) ≤ μη(y); if A τ(x) = X then μη(x) = 1. The coherence membership function actually adheres to the Kolmogorov axioms that deal with the probability space. The following theorem supports a method one can use to construct coherence membership functions. Theorem 2. Let M be a set of simple concepts on X and (M, τ, X) be an AFS structure defined by Eq. (3). Let S be a σ-algebra over X such that for any m ∈ M and x ∈ X, {m} τ(x) ∈ S. For each simple concept γ ∈ M, let Mγ be a measure over S with 0 ≤ Mγ(U) ≤ 1 for all U ∈ S and Mγ(X) = 1. If for each concept ξ = ∑i ∈ I(∏m∈Aim) ∈ EM, μξ: X → [0, 1] is defined for any x ∈ X as follows   τ  μ ξ ðxÞ ¼ supi∈I ∏γ ∈A Mγ Ai ðxÞ ;

ð5Þ

  τ  μ ξ ðxÞ ¼ supi∈I inf γ ∈A Mγ Ai ðxÞ :

ð6Þ

i

or

i

then {μξ(x)|ξ ∈ EM} is a set of coherence membership functions for (EM, ∨, ∧) and (M, τ, X). The proof is given in the Appendix. It can be seen that the coherence membership functions are associated with a measure over X. We propose two types of measures for simple concepts, which can be constructed by taking the semantics of the simple concepts and the probability distribution of the feature values of the data. This indicates that for two coherence functions μη(x) and μξ(x) for fuzzy concepts η and ξ, they are not sufficient to determine μη ∧ ξ(x), see the example at the end of this section. This arises as a significant difference in comparison with the existing fuzzy logic systems equipped by some t-norms, in which the membership value μη ∧ ξ(x) = T(μη(x), μξ(x)) is completely determined by the membership degrees μη(x) and μξ(x). Hence, this implies that the constructed coherence membership functions associated with the logic operations presented in Theorem 2 include more information about the distribution of the original data, i.e., they more objectively reflect upon the logical relationships present among the fuzzy concepts described by the given data distribution. Now, we discuss how to define the coherence membership functions in a probability measure space. In order to complete this discussion, we first introduce the following definition. Definition 6. [49]. Let υ be a simple concept on X, ρυ: X → R + = [0, ∞). ρυ is called a weight function of the simple concept υ if ρυ satisfies the following conditions: 1. ρυ(x) = 0 ⇔ (x, x) ∉ Rυ, x ∈ X; 2. ρυ(x) ≥ ρυ(y) ⇔ (x, y) ∈ Rυ, x, y ∈ X, where Rυ is the binary relation of the concept υ. Theorem 3. Let (Ω, F, P) be a probability measure space and M be a set of simple concepts on Ω. Let ργ be the weight function for a simple concept γ ∈ M. Let X ⊆ Ω be a finite set of observed samples from the probability space (Ω, F, P) and (M, τ, Ω) and (M, τ|X, X) be two AFS structures defined in Eq. (4). If for any m ∈ M and any x ∈ Ω, {m} τ(x) ∈ F. Then the following assertions hold: 1. {μξ(x)|ξ ∈ EM} is a set of coherence membership functions of (EM, ∧, ∨) and (M, τ, Ω), (M,τ|X, X) provided that the membership function for each fuzzy concept ξ = ∑i ∈ I(∏m ∈ Ai m) ∈ EM is defined as follows. μ ξ ðxÞ ¼ sup ∏

i∈I γ∈Ai

∑u∈Aτ ðxÞ ργ ðuÞNu i

∑u∈X ργ ðuÞNu

;

∀x∈X;

ð7Þ

∫ ργ ðt ÞdP ðt Þ

μ ξ ðxÞ ¼ sup ∏

i∈I γ∈Ai

Aτi ðxÞ

∫ ργ ðt ÞdP ðt Þ

;

∀x∈Ω;

ð8Þ

;

ð9Þ

Ω

or μ ξ ðxÞ ¼ sup inf

i∈I γ∈Ai

∑u∈Aτ ðxÞ ργ ðuÞNu i

∑u∈X ργ ðuÞNu

∀x∈X;

6

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

∫ ργ ðt ÞdP ðt Þ

μ ξ ðxÞ ¼ sup inf

i∈I γ∈Ai

Aτi ðxÞ

∫ ργ ðt ÞdP ðt Þ

;

∀x∈Ω;

ð10Þ

Ω

where Nu is the number of times u ∈ X is observed. 2. If for every γ ∈ M, ργ (x) is continuous on Ω and X is a set of samples randomly drawn from the probability space (Ω, F, P) then the membership function defined by Eqs. (7) or (9) converges to the membership function defined by Eqs. (8) or (10), respectively for all x ∈ Ω as |X| approaches to infinity. The proof is presented in the Appendix. Theorem 3 defines the membership functions based on the fuzzy logic operations expressed on the observed data and the overall space by taking both fuzziness and randomness in account via ργ(x), Aiτ(x) and Nx which are based on observations of random samples. The following practical relevance of the coherence membership functions is ensured by Theorem 3. • The membership functions and the fuzzy logic operations determined by the observed data drawn from a probability space are consistent with those being determined by the probability distribution expressed in the probability space. • The results obtained via the AFS fuzzy logic based on the membership functions and their logic operations determined by different data sets drawn from the same probability space will be consistent. • The laws discovered based on the membership functions and their logic operations determined by the observed data drawn from a probability space can be applied to the whole space by considering the membership functions of the concepts determined by the probability distribution. In Example 1, let η1 = m3, η2 = m8, η3 = m3 + m8, η4 = m3m8 ∈ EM. Let ρm(x) = 1 for any x belonging to simple concept m at some degree and ρm(x) = 0 for x not belonging to m. Assume that every x ∈ X is observed once, i.e., we take Nx = 1 in Eq. (9). We can obtain the corresponding membership functions. The membership degrees of x4 are listed as follows,   τ τ fm3 g ðx4 Þ ¼ fx2 ; x3 ; x4 ; x5 ; x8 ; x10 ; x11 ; x12 ; x13 g; μ η 1 ðx4 Þ ¼ fm3 g ðx4 Þj=jX  ¼ 0:64;   τ τ fm8 g ðx4 Þ ¼ fx4 ; x7 ; x8 ; x10 ; x11 ; x13 g; μ η 2 ðx4 Þ ¼ fm8 g ðx4 Þj=jX  ¼ 0:43; τ τ μ η 3 ðx4 Þ ¼ max fjfm3 g ðx4 Þj=jX j;jfm8 g ðx4 Þj=jXjg ¼ 0:64;   τ τ fm3 m8 g ðx4 Þ ¼ fx4 ; x8 ; x10 ; x11 ; x13 g; μ η 4 ðx4 Þ ¼ fm3 m8 g ðx4 Þj=jX  ¼ 0:36:

Since EI algebra (EM, ∨, ∧) is closed under the logic operation ∨, ∧ defined by Eqs. (1) and (2), hence for any fuzzy concepts in EM, their membership functions and logic operations ∨, ∧, (or, and) can be determined by Eq. (9) and (EM, ∨, ∧) is a logic system, called AFS fuzzy logic. 4. Generation of fuzzy rules from AFS decision trees We use the fuzzy logic operations defined by Eqs. (1) and (2) as well as the membership function expressed by Eq. (9) in the construction of fuzzy decision trees for such as those shown as Table 1. The resulting classifiers are referred to as AFS decision trees. 4.1. Basic notions Before we construct the fuzzy decision trees and discuss their rule generation procedure, let us recall some notions and definitions pertaining to the study presented in [14] and [15]. Fig. 1 is an example of the decision tree for the data shown in

Fig. 1. An example of a fuzzy decision tree.

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

7

Table 2 A collection of training data. Training samples

V1 = inc U1 u1j

V2 = emp U2 u2j

credit Y yj

x1 x2 x3 x4 x5 x6 x7 x8

0.20 0.35 0.90 0.60 0.90 0.10 0.40 0.85

0.15 0.25 0.20 0.50 0.50 0.85 0.90 0.85

0.00 0.00 0.00 0.00 1.00 1.00 1.00 1.00

Table 2. As noted, the first four examples belong to the “No credit” category. The other examples belong to the “Yes credit” category. 1. The set of fuzzy variables (or attributes) is denoted by V ¼ fV 1 ; V 2 ; …; V n g: where Vi is a fuzzy variable over the universe of discourse Ui, i = 1, 2, …, n. 2. For each variable Vi ∈ V, we use the notation • value of training sample j is uji ∈ Ui. • Di denotes the set of fuzzy terms (i.e., simple concepts) associating with Vi. inc • vpi denotes the fuzzy term for the variable Vi. (e.g., vlow , as necessary to stress the variable or with anonymous values — otherwise p alone may be used). 3. The set of fuzzy terms (simple concepts) for the decision variable is denoted by Dc. Each fuzzy term vkc ∈ Dc is a fuzzy concept defined over the universe of discourse Y. 4. The set of training examples is n   o  1 2 n X ¼ xj xj ¼ uj ; uj ; …; uj ; yj : 5. M is the set of all simple concepts   n M ¼ Dc ∪ ∪i ¼ 1 Di : (M, τ, X) is the AFS structure and EM is the EI algebra over M. In general, for each pair fuzzy terms vpc, vqc ∈ Dc (on decision attribute Y), vpc ≠ vqc, μ vpc ∧ vqc(x) b ε (ε is a small positive number), for any x ∈ X. This implies that the fuzzy terms in Dc implement a fuzzy classification on X with l classes, e.g., X = ∪kl = 1 Xvck, vkc ∈ Dc. 6. For each node N of the fuzzy decision trees • F N denotes the set of fuzzy restrictions on the path from the root to the node N, as the one in Fig. 1 5

F ¼ f½emp is high; inc is highg: • V N is the set of attributes appearing on the path leading to the node N n  h i o  N i N : V ¼ V i ∃p V i is vp ∈F • β N is a fuzzy concept in EM, and μβN(xj) is the membership degree of sample xj at the node N, where N

β ¼ Π m∈fvip j∃ pð½V i is vip ∈F N Þg m: • βδN is the δ (δ ∈ (0, 1)) cut set of fuzzy set β N, i.e.,  n o  N βδ ¼ x∈X μ βN ðxÞ > δ • N|vpi denotes the particular child node of node N created by the use of the fuzzy attribute Vi to split N, vpi ∈ Di.

ð11Þ

8

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25 N • SVi denotes the set of N's children when Vi ∈ (V − V N) is used for the split. Note that

N

SV i ¼

o n    i  i N Nvp vp ∈Di ;

 n o N i  Di ¼ vp ∈Di ∃x∈X; μ βN ∧vip ðxÞ > δ

ð12Þ

in other words, some fuzzy terms vpi ∈ Di which satisfy μβN ∧ Vpi(x) ≤ δ for any x ∈ X may not be used to create sub-trees. • P vNc denotes the example count for decision vqc ∈ Dc in node N, where   jX j P Nvc ¼ ∑j¼1 μ βN ∧vc xj ; P N ¼ ∑vc ∈Dc P Nvc ;   i Njvi : Njvi jX j P vc p ¼ ∑j¼1 μ βN ∧vip ∧vc xj ; P Njvp ¼ ∑vc ∈Dc P vc p

ð13Þ

i

p c ; that It is important to note that unless the sets are such that the sum of all memberships for any is equal to 1, PvNc ≠ ∑vip ∈ DiPvN|v is, the membership sum from all children of N can differ from that of N. This is due to the existence of the fuzzy concepts; the total membership value can either increase or decrease while constructing the tree. • P N and I N denote the total example count and information measure for node N, where I N is the standard information content described as

N

N

P vc P c ⋅ log vN N p vc ∈Dc P

N

!

I ¼−∑

ð14Þ

N

• GVNi = I N − I SVi denotes the information gain when using the fuzzy attribute Vi to split N, where i

I

N i

Sv

¼ ∑

vip ∈DNi

i

P Njvp ⋅INjvp ∑vip ∈DN P

Njvip

ð15Þ

i

is the weighted information content. 4.2. Construction of AFS decision trees The construction of the AFS tree consists of the following main design steps. Discretization of attributes with the use of fuzzy terms In the classification problem, training data can be either categorical or numerical. When the data are numerical, the attributes need to be transformed (granulated) into some fuzzy terms. In fact, the fuzzification is a process of conceptualization (abstraction), which is often used by people to reduce information overload in decision-making process. For instance, the numerical salary may be perceived and quantified in linguistic terms such as large, middle, and low. Their membership functions can be approximately determined based on experts' opinion or common perception. In our case, we simply assign three fuzzy terms (i.e., simple concepts) to each attribute Vi. The set of fuzzy terms for Vi is Di = {v1i, v2i, v3i}, the simple concept vki with its underlying semantics “the value of Vi is closer to the k-th cut point”. The first cut point is the minimal value of the attribute, the second one is the mean and the third one is the maximal value of the attribute. The membership functions of fuzzy terms are defined by Eq. (9). Node splitting criterion The growth process of the tree is guided by the maximum information gain. The information gain of the use of fuzzy attribute N N Vi to split the current node N is GVNi = I N − I SVi, where I N and I SVi are defined by Eqs. (14) and (15). The fuzzy attribute Vi which exhibits the maximum information gain at the current node N is applied to split N. The children of the node N form the set SVNi (defined by Eq. (12)). Stopping condition First, a given node N can be expanded if the samples in the set βδN given by Eq. (11) are not in the same class, otherwise, the node is not further expanded. The second stopping condition is self-evident: the current node N can be expanded if V N ≠ V, the set V N is the attributes applied from root to the node N, V is the set of all the attributes. The final termination criterion is used to monitor the information gain along the nodes of the tree as it is being built. In case when the maximum information gain at the current node N is negative or the set of N's children is empty, i.e., SVNi = ∅, we stop expanding this node. The first two stopping criteria are a sort of precondition: if they are not satisfied, we stop expanding the node. The third one comes in a form of some postcondition: to make sure if it is satisfied, we have to expand the node first and then determine its value, if not satisfied, we should backtrack and refuse to expand this particular node. Table 3 summarizes the overall flow of the algorithm. 4.3. Rule extraction and pruning Once the AFS decision tree has been built for the given threshold δ, we can proceed with an extraction of the rule base.

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

9

Table 3 Construction of AFS decision trees. //X denotes the training data set with l classes, X = ∪lk= 1Xvck; V is the set of fuzzy variables or attributes; //N denotes the current node; δ is the given threshold; an AFS decision tree starts with N = ∅; 1. AFSDT = BuildTree(X, V, N, δ) 2. calculate the information content IN of node N; 3. calculate the information gain GVNi for each Vi from V; 4. Vmax = arg maxVi ∈V {GVNi}; 5. V= V\Vmax; //delete Vmax from V. 6. //check child and update the current node. 7. for k = 1: 3 //three fuzzy terms on each attributes. 8. if (∃ x ∈ X, μβN ∧vmaxk(x) >δ) 9. N = N ∧ vmaxk 10. AFSDT = BuildTree(X, V, N, δ) 11. end 12. end

4.3.1. Rule extraction Each path starting from the root traversing down to a classification node (terminal node) is converted to a rule. Suppose the rule r1, r2, …, rt is extracted from the fuzzy decision tree, the antecedent part of the rule ri is a fuzzy concept — the conditions leading to the terminal node, i.e., β Ni ∈ EM, where Ni is the terminal node of the corresponding path. The class labels of the rules are essential to the classifier. The class label methods considering the example count in the leaf node of different class for fuzzy decision tree in the existing papers are not suitable for the decision tree discussed here, especially in the imbalanced classes. Instead one sample was assigned to only one leaf node which makes the sample to attain the max membership degree. Then we label the leaf node as the class with the majority of the samples covered by it. Consider the following scheme for the fuzzy concept β Ni representing the antecedent part of rule ri. Let us introduce the following notation n o Ari ¼ xjx∈X; μ βNi ðxÞ≥μ βNj ðxÞ; j ¼ 1; 2; …; t; j≠i where |Ari| stands for the number of training samples covered by rule ri. However, Ari = ∅ does not imply that rule ri is incorrect. The condition Ari = ∅ states that the contribution of rule ri to the classification process is not so significant in the current rule set. Its importance to the classification may increase in the process of pruning. Thus, the class label of the rule ri is computed as arg max1 ≤ k ≤ l {Xri ∩ Xvck}, where Xvkc is the set of the samples of the k-th class, ( X ri ¼

Ari

N βδ i

if Ari ≠∅ otherwise

ð16Þ

where βNδ i is defined by Eq. (11) for the above given threshold δ. Table 4 presents the algorithm of rule extraction. 4.3.2. Pruning the rule-base The rules directly extracted from the AFS decision tree, however, may include redundant structures as well as poorly performing rules, which should be removed from the rule-base to enhance an overall performance of the classifier and improve its efficiency. In what follows, we prune the rule-base by making use of the available training data: 1. Remove each rule from the rule-base, and classify the training data using the remaining rules. 2. Delete the rule, whose corresponding remaining rules have the maximal increase of accuracy on training data. 3. Repeat steps (1)–(2) and terminate the pruning if the resulting pruned rule-base becomes worse than the original one when applied to the training data. Table 4 Algorithm of rule extraction. //X denotes the training data set with l classes, X = ∪lk= 1Xvck; //AFSDT is an AFS decision tree with t terminal node; 1. rules = Rule Extraction(X, AFSDT, δ) 2. for i = 1:t 3. Ari = {x |x ∈ X, μβNi(x) ≥μβNj(x), j = 1,2,…, t, j ≠ i} 4. end 5. for i = 1:t 6. if (Ari = ∅) 7. Class label of ri = argmaxk=1, 2, …, l{βNi ∩ Xvck}; //βNi is defined by (11); δ δ 8. else 9. Class label of ri = argmaxk=1, 2, …, l{Ari ∩ Xvck}; 10. end 11. end

10

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

Fig. 2. An AFS decision tree with δ = 0.63 obtained for the Wine data set.

4. Using the logic operation “∨” defined by (1), sum all fuzzy concepts representing the antecedents of the rules with the same consequent. Thus, for a training data set with l classes, we can represent the rule-base with l fuzzy concepts ξ1, ξ2, …, ξl ∈ EM. For each class vkc ∈ Dc, k = 1, 2, …, l, a rule can be obtained and read as: Rule k: If x is ξk, then x belongs to the class vkc, k = 1, 2, …, l. Example 2. As an illustration we complete the pruning process of the tree, which was built for the Wine data set in Experiment 1. The original tree is shown in Fig. 2. The rules extracted directly from the AFS decision tree by the algorithm Rule extraction read as follows: r1: If x is m19m28, then x belongs to the class 2; r2: If x is m19m29, then x belongs to the class 3; r3: If x is m19m30, then x belongs to the class 3; r4: If x is m20m28, then x belongs to the class 2; r5: If x is m20m29m39, then x belongs to the class 1; r6: If x is m21m37, then x belongs to the class 2; r7: If x is m21m38m1, then x belongs to the class 2; r8: If x is m21m38m3, then x belongs to the class 1; r9: If x is m21m39m2m29, then x belongs to the class 1; r10: If x is m21m39m2m30, then x belongs to the class 1; r11: If x is m21m39m3, then x belongs to the class 1. The classification rate obtained for the training set with the rule-base shown above is 92.3%. We prune the rule-base making use of the available training data. The variability of the classification rate on the training set while the pruning process is illustrated in Table 5. After pruning, the classification rate for the training set with the pruned rule-base increased to 97.9%. Then all fuzzy concepts representing the antecedents of the rules with the same consequent are aggregated to a single fuzzy concept in EM by the logic operation “∨” defined by Eq. (1). The pruned AFS decision tree is shown in Fig. 3 and the rule-base after Pruning the rule-base comes in the form: Rule 1: If x is ξ1, then x belongs to the class 1; Rule 2: If x is ξ2, then x belongs to the class 2; Rule 3: If x is ξ3, then x belongs to the class 3; where ξ1 = m21m39m3, ξ2 = m19m28 + m21m37, ξ3 = m19m29 + m19m30. Table 5 Variability of the classification rate reported for the training set. Pruning process

1

2

3

4

5

6

Accuracy (%) Deleted rule

96.5 r4

97.2 r5

97.9 r9

97.9 r7

97.9 r8

97.9 r10

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

11

Fig. 3. AFS decision tree with δ = 0.63 obtained for the Wine data set (after pruning).

4.4. Determining the optimal threshold δ The amount of ‘trivial detailed information’ included in the decision trees can be controlled by the values of the threshold δ ∈ (0, 1). The larger the value of δ, the less overlap occurs between the membership functions. In other words, the larger the value of δ, the more ‘trivial detailed information’ becomes filtered out (ignored). In the procedure of building AFS decision trees, different values of δ can produce different trees. In fact, following Eq. (12), we know that when we use a smaller δ in the growth of a tree, we build a larger tree (consisting of more nodes). The performance of the tree depends on the threshold δ. Here, optimal threshold δ means that the pruned rule-base induced by the AFS decision tree with it has the highest accuracy on testing data. However, it is very difficult to determine the optimal threshold δ on a basis of the existing training data. Thus, in this study, we compute the sub-optimal value of the threshold δ using the Fitness Index F(δ) with GA (genetic algorithm). More specifically, the sub-optimal threshold δ maximizes the following Fitness Index computed for the training data. F ðδÞ ¼ jX j⋅Classif ication rate−δ⋅Number of nodes

ð17Þ

where |X| is the number of training samples, “Classification rate” is the classification accuracy reported on the training samples for the rule-base obtained by threshold δ via the algorithm of building AFS decision trees shown in Table 3 and the algorithm of rule extraction included in Table 4, and “Number of nodes” is the total nodes of the pruned tree. In general, high accuracy achieved on the training data does not imply that high accuracy on testing data could be retained, since the tree could be very likely overfitted. We estimated the optimal threshold δ by Eq. (17) with the use of the genetic algorithm to avoid a possible overfitting effect. Example 3. We illustrate the proposed scheme using the data shown in Table 1. We take all 14 samples as a training set, let X = i i {x1, x2, …, x14}, on each attribute Vi, two fuzzy terms are defined. The set of fuzzy terms for attribute Vi is Di = {vsmall , vlarge }, and the c c set of fuzzy terms for the decision variable (decision attribute) is Dc = {vcredit, vno-credit}. Let M = {m1, m2, …, m20} be the set of i i simple concepts on U, where m2i−1 = vsmall with the semantics “the value on Vi is small”, m2i = vlarge with the semantics “the value on c c Vi is large” (i = 1, 2, …, 9) and m19 = vno-credit, m20 = vcredit. Now, we can establish the AFS structure (M, τ, X), where τ is defined by Eq. (3) while the membership functions of the fuzzy concepts in EM are defined by Eq. (9) in Theorem 3, in which for any m ∈ M, ρm(x) = 1 for any x belonging to m at some degree and ρm(x) = 0 for x not belonging to m and Nx = 1, i.e. x is observed once. To start with, we choose a threshold level δ = 0.8 (as it will be shown later on, the value of this threshold will be optimized). The root node embraces all the training samples without any restrictions, that is β root = ∅. By using the node splitting criterion, payment the 5th attribute “annual payment” is selected to split the root node and the children of the root node form the set {(root|vsmall ), payment (root|vlarge )}. By making use of the stopping condition, we obtain the decision tree shown in Fig. 4. The node 1 contains 5 samples and all of them belong to the “no-credit” category. There are 3 samples at node 2 which belong to the “credit” category. 6 samples were left out and they were neither assigned to node 1 nor node 2. This implies that these samples are not typical and significant enough for the predefined value of the threshold (δ = 0.8). They may be included in the AFS fuzzy decision tree when considering smaller values of the threshold δ.

Fig. 4. The AFS fuzzy decision tree with δ = 0.8.

12

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

Fig. 5. The first layer of the AFS fuzzy decision tree with δ = 0.43.

In what follows, we show how to extract rules from the tree. In order to determine the class labels of the rules with the antecedent represented by fuzzy concepts corresponding to node 1 and node 2. Using Eq. (16), we calculate Xr1 = {x1, x2, x5, x10, x11, x12, x14}, Xr2 = {x3, x4, x6, x7, x8, x9, x13}. The class label of r1 is computed as argmaxk ∈ {no-credit, credit} {Xr1 ∩ Xvkc} = no-credit and the class label of r2 is determined as argmaxk ∈ {no-credit, credit}{Xr1 ∩ Xvck} = credit, where Xvccredit, Xvcno-credit are the sets of credit samples and no-credit samples (refer to Table 1). After pruning, we arrive at Rule 1 and Rule 2: Rule 1: If x is ξ1, then x belongs to the class of no-credit; Rule 2: If x is ξ2, then x belongs to the class of credit; payment payment where ξ1 = vsmall , ξ2 = vlarge . Next, let the threshold level assume lower value, say δ = 0.43. The root node will be split by the 5th attribute “annual payment” payment payment where the children of the node form the set {(root|vsmall ), (root|vlarge )}. In Fig. 6, node 1 contains 8 samples having membership degrees larger than δ = 0.43 and the 8 samples all belonging to the no-credit category. According to the stopping condition, the decision tree stops growing at this node. Unlike at node 1, the tree at node 2 will continue to grow. There are 8 samples, which belong to different classes while falling into node 2, hence this node should be split further. The 3rd attribute “self-appraisement” is selected by the node splitting criterion. The child of node N2 is self-appraisement (N2|vlarge ) (Fig. 5). There are 6 samples falling into node 3 and these samples belong to the “credit” category. Given the stopping condition, the growth of the decision tree is terminated. After rule extraction and pruning, we obtain two rules Rule 1 and Rule 2:

Rule 1: If x is ξ1, then x belongs to the class of no-credit; Rule 2: If x is ξ2, then x belongs to the class of credit; payment payment self-appraisement where ξ1 = vsmall , ξ2 = vlarge vlarge . The accuracy of the tree reported on the data shown in Table 1 is 100% (Table 6). Comparing the situation where the threshold δ = 0.8 with the case where δ = 0.43, we note that the level of the detailed (more specific) information included in the AFS decision trees can be effectively controlled by the value of this threshold. Now we use the Fitness Index (17) obtained with genetic algorithm to estimate the optimal value of the threshold δ. The plot of this index is displayed in Fig. 7; it clearly shows that δ = 0.43 leads to the maximization of this index.

Fig. 6. The AFS fuzzy decision tree with δ = 0.43.

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

13

For comparison, the C4.5 decision tree produced the following rules: Rule 1: If x is annual payment ≤ 20, then x is no-credit; Rule 2: If x is annual payment > 20, then x is credit. The classification accuracy for the data shown in Table 1 is also 100%. One can observe that the rules extracted from C4.5 are similar to the rules extracted from the AFS decision tree with δ = 0.8 on the data shown in Table 1. 4.5. Inference of decision assignment and associated confidence degrees We should note that for each concept ξi ∈ EM representing the antecedent of the Rule i, the universe of discourse of its membership function defined by Eq. (9) is composed of the training set X. In order to predict the class label of the new samples which are not included in X, we need to express the fuzzy concept ξ defined over the entire input space U1 × U2 × … × Un ⊆ R n (X ⊆ U1 × U2 × … × Un). In what follows, for each fuzzy concept ξ ∈ EM, we expand its universe of discourse X to U1 × U2 × … × Un. For each x = (u 1, u 2, …, u n) ∈ U1 × U2 × … × Un, ξ = ∑i ∈ I(∏m ∈ Aim) ∈ EM, the lower bound and upper bound of membership functions of fuzzy concept ξ are defined over U1 × U2 × … × Un as follows:    U μ ξ ðxÞ ¼ supi∈I inf g ∈U x μ Ai ðg Þ ; Ai

   L μ ξ ðxÞ ¼ supi∈I supg∈Lx μ Ai ðg Þ

ð18Þ

Ai

where UAxi ⊆ X, LAxi ⊆ X, are defined as  n o   x U Ai ¼ xj ∈X  xj ; x ∈Rm ; ∀m∈Ai ;

  n o  x LAi ¼ xj ∈X  x; xj ∈Rm ; ∀m∈Ai ;

here Rm is the binary relation of simple concept m defined by Definition 2. We call μξL(x) the lower bound of the membership function of ξ and μξU(x) serves as the upper bound of membership function of ξ. We can easily prove the following relationships: • μξU(x) ≥ μξL(x) for each x ∈ U1 × U2 × … × Un, • μξU(x) = μξ(x) = μξL(x) for each x ∈ X. In virtue of Eq. (18), we can expand the fuzzy concept ξ = ∑i ∈ I(∏m ∈ Aim) ∈ EM, from universe of discourse X to the universe of discourse U1 × U2 × … × Un. Therefore, by the fuzzy rule-base, we can establish fuzzy-inference systems whose input space is U1 × U2 × … × Un. The membership functions μξU(x), μξL(x) are dependent on the distribution of training examples and the rules of the AFS fuzzy logic. When we are provided with a new pattern x ∈ U1 × U2 × … × Un with unknown class label, we calculate the membership degree μξL(x) by Eq. (18) and x belongs to the class arg maxvkc ∈ Dc {μ ξLk(x)}, k = 1, 2, …, l. Furthermore, according to the training samples, the confidence degree of the membership degree μξ(x) estimated by μξL(x) is defined as follows:   U L C ξ ðxÞ ¼ 1− μ ξ ðxÞ−μ ξ ðxÞ :

ð19Þ

The confidence degree Cξ(x) quantifies confidence we associate with μξL(x), the estimate of the membership degree of x belonging to ξ, x ∈ U1 × U2 × … × Un. For sample x, the closer the upper bound of membership function of ξ is to the lower bound, the larger value of Cξ(x) we obtain. In fact, the value of Cξ(x) depends on how many training samples are similar to x considering the fuzzy concept ξ. The larger value of Cξ(x) advises us to trust μξL(x) as the membership degree of x belonging to ξ. Especially, if x ∈ X, Cξ(x) = 1, then there exists a training sample x0 such that the values of both x0 and x on the attributes associating to ξ are equal. For each testing sample x, we know that it belongs to the class arg maxvkc ∈ Dc {μξk(x)}, k = 1, 2, …, l, which is determined by μξL(x), the estimate of all membership degree of x belonging to ξk. In order to achieve a high confidence prediction of the class label of x, every confidence degree Cξ(x) of μξL(x), k = 1, 2, …, l, has to be high. Thus the reliability of the classification result of each testing sample can be quantified by the confidence degrees. In practice, we can refuse to classify the testing samples whose confidence degree belonging to some μξL(x) is low; instead we can acquire more information to produce high confidence levels.

Table 6 Membership degree of samples belonging to ξ1 and ξ2.

μξ1(x) μξ2(x)

x1

x2

x5

x10

x11

x12

x13

x14

x3

x4

x6

x7

x8

x9

1.00 0.00

1.00 0.00

0.64 0.29

1.00 0.00

1.00 0.00

1.00 0.00

0.50 0.29

0.57 0.43

0.36 0.50

0.21 0.64

0.43 0.57

0.14 0.71

0.29 0.57

0.00 0.93

14

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

13 12.5

F(δ)

12 11.5 11 10.5 10 0.4

0.45

0.5

0.55

0.6

0.65

0.7

0.75

0.8

δ Fig. 7. Fitness of the AFS fuzzy decision trees versus the threshold δ; the values of the threshold under consideration are in the [0.4, 0.8] interval.

5. Experimental studies In this section, 28 data sets are used in a series of experiments. They come from the Machine Learning repository (http://www. ics.uci.edu/~mlearn/MLRepository.html). The description of the pertinent data sets is covered in Table 7. In the experiments, for each experiment set, a five-fold cross validation is carried out. All membership functions of fuzzy concepts in EM are determined by Eq. (9) in Theorem 3, in which for any m ∈ M, ρm(x) = 1 for any x belonging to m at some degree and ρm(x) = 0 for x not belonging to m and Nx = 1, i.e. x is observed for one time. On each attribute Vi, three fuzzy terms are specialized in the experiments. The code of the proposed approach written in Matlab is available on Web http://afsdtcode.sourceforge.net. 5.1. Experiment 1 In this experiment, we present a detailed comparative analysis with the C4.5 decision tree (C4.5) [46], the nearest neighbor (NN) [45], the J. Platt's sequential minimal optimization based support vector classifier (SVM) [44], FDTs [18], FS-DT [17], FARC-HD [47], FURIA [48] and AFS decision tree (AFSDT) over the datasets. The C4.5, KNN, and SVM classifiers are implemented by using the Weka machine learning toolkit [53]. The FARC-HD is available in the KEEL software tool [54]. A Java implementation

Table 7 Description of data sets coming from the UCI repository and used in the experiments. No.

Data set

Attributes type

Size

Missing values

Number of attributes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

Appendicitis Australian Acute1 Acute2 Auto Balance Breast-Caner Breast-W Car Dermatology Glass Haberman Heart Hepatitis Ionosphere Iris Kr-vs-Kp Liver M_mass Parkinsons Pima Statlog_GC Tae Transfusion Waveform Wdbc Wine Wpbc

Continuous Continuous Mixed Mixed Continuous Continuous Mixed Continuous Discrete Mixed Continuous Continuous Continuous Mixed Continuous Continuous Discrete Continuous Discrete Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous

106 690 120 120 398 625 286 699 1728 366 214 306 270 155 351 150 3196 345 961 197 768 1000 151 748 5000 569 178 198

No No No No Yes No Yes Yes No Yes No No No Yes No No No No Yes No No No No No No Yes No Yes

7 14 6 6 7 4 9 9 6 33 9 32 13 19 34 4 36 6 5 22 8 24 5 4 40 30 13 33

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

15

membership degrees on ξ1

1 0.8 0.6 0.4 0.2 0

0

100

200

300

400

500

600

700

the testing data of Breast-W, 1-458: Class1, 459-699: Class2 Fig. 8. The membership functions of ξ1 obtained in the experiments carried out for the Breast-W data set (Class 1).

of FURIA can be downloaded at http://www.uni-marburg.de/fb12/kebi/research/. All the classifiers are run with the default settings being specified in the toolkit. For example, the minimal number of instances per leaf is set to 2 and the pruning lever is set to 25% for C4.5 and the number of neighbors to use is set to 1 for KNN. The results of FDTs and FS-DT come from the algorithm implemented by ourselves while the parameters are respectively set to be α = 0.35, β = 0.85 and α = 0.1, β = 0.85, purity = 0.85 suggested by the experiments. The codes of the FDTs, FS-DT and AFSDT approaches in Matlab are available on Web http:// afsdtcode.sourceforge.net. We illustrate the performance of the AFSDT classifier for the Wisconsin breast data. The data set consists of 699 samples and i i i each sample is described by nine attributes. X is the training set. The set of fuzzy terms for attribute Vi is Di = {vsmall , vmid , vlarge }, c c and the set of fuzzy terms for the decision variable (class attribute) is Dc = {vbenign, vmalignant}. Let M = {m1, m2, …, m29} be the set i i of simple concepts, where m3i − 2 = vsmall with a clearly articulated semantic such as “the value on Vi is small”, m3i = vlarge with the i semantic meaning “the value on Vi is large”, m3i−1 = vmid with the semantic meaning “the value is closer to the mean of Vi” (i = 1, c c 2, …, 9) and m28 = vbenign , m29 = vmalignant . The trees in the five-fold cross validations induced by the sub-optimal thresholds δ obtained by the Fitness Index (17) have 1006.4 (average of the five experiments) terminal nodes, so we extract a rule-base with 1006.4 rules (on average). The average classification rate reported for the training set is 97.32% and the average accuracy on the testing set is 95.28%. Then we prune the rule base by invoking the algorithm shown in Section 4.3.2 Pruning the rule-base. The number of rules has been reduced to 13.00 (on average), and for this case, the percentage of correct classification on the training set by the rule-base increased to 97.39% (on average) and the average accuracy on the testing set increased to 95.85%. The membership functions of the fuzzy concepts ξ1, ξ2, which are the antecedents of the rules for classes 1 and 2 respectively, are illustrated in Figs. 8 and 9. Fig. 8 shows the membership degrees of the data for ξ1 and Fig. 9 shows the membership degrees of the data for ξ2. Fig. 10-a shows the distribution of the testing samples of the five-fold cross validations which fall into different regions of the confidence degree of the estimate of membership degree of the samples belonging to ξ1, ξ2. Fig. 10-b shows the distribution of the misclassification rate of testing samples of the five-fold cross validations fall into different square regions of confidence degree of the estimate of the membership degree of the samples belonging to ξ1, ξ2. Given Fig. 10-a, one can observe that the confidence degrees (defined by (19)) of the membership degrees (defined by (18)) of almost of the testing samples (499 of 699) belonging to ξ1, ξ2 are larger than 0.9 and just 5 misclassified testing samples fall into that region. The misclassification rate in that region is 1.00%. This implies that the classification result of a testing sample x with Cξ1(x) ≥ 0.9, Cξ2(x) ≥ 0.9 comes with high levels of confidence.

membership degrees on ξ2

1 0.8 0.6 0.4 0.2 0

0

100

200

300

400

500

600

the testing data of Breast-W, 1-458: Class1, 459-699: Class2 Fig. 9. The membership functions of ξ2 obtained in the experiments carried out for the Breast-W data set (Class 2).

700

16

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

a

b

500

1

400

0.8

300

0.6

200

0.4

100

0.2

0

0 0

0 0.5



2

0.2

1

0.4

0.6

0.8



1

0.5



2

0.2

1

1

0.4

0.8

0.6

1



1

Fig. 10. a) The distribution of the number of testing samples fall into different square regions of the confidence degree of the estimate of membership degree of the samples belonging to ξ1, ξ2; b) The distribution of the misclassification rate of testing samples fall into different square regions of the estimate of confidence degree of the membership degree of the samples belonging to ξ1, ξ2.

The testing error of each data set is reported in Table 8, where AFS1 refers to the result of the AFS fuzzy decision tree under the sub-optimal threshold obtained by the Fitness Index (17) optimized with the aid of the genetic algorithm and AFS2 denotes the results obtained for the optimal value of the threshold. Table 8 shows that the AFS decision tree can produce high accuracy and AFS decision tree, SVM, FARC-HD and FURIA are better than C4.5, FDTs, FS-DT and KNN on most of the data sets. Table 9 shows the values of the threshold levels (both optimal and suboptimal). The genetic algorithm has been implemented in the Matlab Genetic Algorithm and Direct Search Toolbox. The default settings have been used, for example the population size is set to 20, the crossover rate is set to 0.8 and the mutation rate is set to 0.2. The search interval is [0.4, 0.8] and the precision is set to 0.01. The results presented in Table 8 give some insight in the performance of the algorithms. However, those results do not provide enough support for drawing a strong conclusion in favor or against any of the studied methods. Demsar discusses in [55] a statistical analysis of the results. Following Demsar's recommendation, we first tested if there is any significant difference among the classifiers studied. Demsar recommends the use of Friedman test to compare several classifiers

Table 8 Percentage of misclassified cases for the testing data. Data set

C4.5

FDTs

SVM

KNN

FS-DT

FARC-HD

FURIA

AFS1

(AFS2)

App. Aus. Aute1 Aute2 Auto Balance Breast-C Breast-W Car Der. Glass Haberman Heart Hepatitis Ionosphere Iris Kr-vs-Kp Liver M_Mass Parkinsons Pima Statlog_GC Tae Transfusion Waveform Wdbc Wine Wpbc Average

14.15 15.07 0.00 0.00 25.90 22.40 25.87 6.01 8.45 5.74 34.58 29.74 24.07 43.86 13.54 4.00 0.72 31.30 18.83 15.90 27.95 26.70 43.71 23.26 24.40 6.15 9.55 27.78 18.92

16.90 30.29 6.67 35.00 30.92 28.32 27.64 7.87 30.03 36.61 44.86 26.76 18.89 20.00 15.10 6.67 32.76 35.07 20.29 14.36 26.43 28.30 54.37 23.66 31.82 8.79 13.46 24.24 24.86

13.21 14.64 0.00 0.00 33.17 11.68 30.07 3.15 6.77 3.83 43.46 27.12 16.29 15.48 11.37 3.33 4.54 41.74 20.92 13.84 22.79 24.00 47.68 23.80 13.72 2.11 1.69 23.74 16.93

20.75 18.84 0.00 0.00 32.16 22.72 31.82 4.29 22.63 6.01 30.37 33.99 27.40 20.00 13.96 6.00 9.58 38.84 26.85 3.59 29.69 29.50 38.41 31.68 26.84 4.92 3.93 31.31 20.22

16.98 14.50 10.00 11.67 28.40 48.95 25.55 7.44 29.98 41.00 49.53 27.75 24.81 20.00 12.55 7.33 5.91 34.78 18.00 24.62 28.77 30.30 58.24 23.80 29.90 37.26 9.51 31.29 25.32

14.16 14.35 0.00 0.00 26.38 13.44 29.90 4.29 14.87 8.19 35.06 26.14 17.14 18.71 11.11 5.33 3.94 30.14 19.98 9.74 24.49 27.50 45.08 21.92 16.48 4.57 6.19 25.76 16.96

13.21 14.64 0.00 0.00 20.35 18.88 28.32 4.86 7.64 7.38 30.84 26.80 22.59 22.58 10.26 5.33 0.69 33.04 17.38 10.26 25.65 26.90 52.98 21.52 17.76 6.50 7.30 27.78 17.19

14.13 13.78 0.00 0.00 25.15 6.41 31.14 4.15 14.18 4.91 33.65 26.48 19.63 34.19 13.12 2.67 1.66 30.72 18.52 9.74 24.09 26.85 43.68 22.06 23.02 4.04 2.79 22.18 16.89

(11.31) (12.76) (0.00) (0.00) (21.38) (6.09) (26.59) (2.86) (13.90) (4.37) (29.91) (24.15) (17.04) (32.3) (7.71) (2.00) (1.66) (25.22) (16.86) (7.69) (22.65) (26.65) (35.1) (21.25) (21.96) (2.63) (2.22) (17.67) (14.91)

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

17

Table 9 Values of sub-optimal and optimal threshold levels. AFS1 (sub-optimal threshold δ)

Data set

Appendicitis Australian Aute1 Aute2 Auto Balance Breast-C Breast-W Car Dermatologya Glass Haberman Heart Hepatitis Ionosphere Iris Kr-vs-Kp Liver M_mass Parkinsons Pima Statlog_GCa Tae Transfusion Waveform Wdbc Wine Wpbc a

AFS2 (optimal threshold δ)

1

2

3

4

5

1

2

3

4

5

0.79 0.79 0.40 0.40 0.62 0.61 0.40 0.78 0.51 0.90 0.41 0.40 0.80 0.40 0.42 0.70 0.40 0.62 0.40 0.45 0.57 0.94 0.40 0.65 0.42 0.69 0.40 0.59

0.79 0.80 0.40 0.40 0.67 0.80 0.40 0.40 0.40 0.93 0.40 0.40 0.78 0.42 0.52 0.68 0.40 0.41 0.45 0.41 0.60 0.95 0.40 0.59 0.43 0.65 0.40 0.55

0.78 0.80 0.40 0.40 0.57 0.60 0.40 0.48 0.75 0.90 0.42 0.40 0.79 0.40 0.51 0.71 0.40 0.79 0.76 0.54 0.79 0.94 0.40 0.63 0.46 0.57 0.40 0.61

0.77 0.70 0.40 0.40 0.61 0.61 0.41 0.40 0.75 0.93 0.40 0.40 0.80 0.40 0.75 0.68 0.40 0.53 0.73 0.65 0.76 0.95 0.40 0.43 0.42 0.62 0.40 0.42

0.62 0.79 0.40 0.40 0.61 0.40 0.40 0.40 0.40 0.90 0.47 0.40 0.80 0.40 0.43 0.68 0.40 0.65 0.40 0.40 0.71 0.95 0.40 0.59 0.47 0.63 0.42 0.66

0.40 0.74 0.40 0.40 0.74 0.61 0.44 0.77 0.51 0.90 0.40 0.40 0.66 0.40 0.50 0.46 0.40 0.62 0.73 0.43 0.72 0.94 0.65 0.60 0.49 0.53 0.43 0.66

0.40 0.80 0.40 0.40 0.73 0.80 0.53 0.78 0.40 0.90 0.40 0.77 0.63 0.74 0.46 0.47 0.40 0.52 0.52 0.41 0.60 0.95 0.59 0.64 0.43 0.42 0.40 0.55

0.74 0.59 0.40 0.40 0.73 0.60 0.57 0.61 0.75 0.90 0.66 0.40 0.79 0.67 0.73 0.68 0.40 0.70 0.73 0.41 0.60 0.95 0.52 0.52 0.42 0.67 0.40 0.49

0.75 0.70 0.40 0.40 0.60 0.60 0.60 0.58 0.67 0.90 0.40 0.40 0.80 0.65 0.61 0.68 0.40 0.54 0.45 0.41 0.78 0.90 0.65 0.40 0.62 0.67 0.40 0.43

0.80 0.43 0.40 0.40 0.68 0.61 0.52 0.68 0.40 0.90 0.73 0.72 0.66 0.40 0.58 0.59 0.40 0.80 0.80 0.40 0.71 0.95 0.46 0.59 0.56 0.40 0.41 0.46

The search interval of sub-optimal threshold δ is [0.9, 0.95] for a small tree.

on multiple datasets. The Friedman test is based on the relative performance of classifiers in terms of their ranks: for each data set, the methods to be compared are sorted according to their performance, i.e., each method is assigned a rank (in case of ties, average ranks are assigned) [56]. Suppose that we have k algorithms to compare on N datasets. The Friedman test can be applied as follows: • • • •

find rij — the rank of the algorithm j on the i-th dataset; compute the average rank R of algorithm j: Rj = N1 ∑i r ji ; the null hypothesis states that all algorithms  have the same  performance; 2 kðkþ1Þ 2 , it is asymptotically χ2 distributed with k − 1 degrees of freedom. If N compute the Friedman statistic: χ 2F ¼ kð12N ∑ R − j j 4 kþ1Þ and k are not large enough, it is recommended to use the following correction that is F-distributed with (k − 1) and (k − 1)(N − 1) ðN−1Þχ 2

DOFs [55]: Nðk−1Þ−χF 2 ; F

• if χF2 exceeds the critical value, then reject the null hypothesis, otherwise accept it; • when the null hypothesis is rejected a post-hoc test is used to determine the nature of the difference. In our case, χF2= 12.87, while the critical value for the significance level α = 0.05 is only 2.06. Thus, the null hypothesis is rejected, which implies that there is a statistically significant difference among the methods. Given the result of the Friedman test, we conducted the Holm test [55] as a test to compare classifier AFS1 and the other classifiers in a pair wise manner. The test statistics for comparing the two classifiers used the method is rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi kðk þ 1Þ : 6N

=

z ¼ ðR1 −R2 Þ

The z value is used to find the corresponding probability (p) from the table of normal distribution, which is then compared with an appropriate α. We denote the ordered p values by p1, p2, …, so that p1 ≤ p2 ≤ … ≤ pk − 1. The Holm's step-down procedure compares each pi with α/(k − i), but differ in the order of the tests, and it starts with the most significant p value. If p1 is below α/(k − 1), the corresponding hypothesis (the two classifiers have the same performance) is rejected and we are allowed to compare p2 with α/(k − 2). If the second hypothesis is rejected, the test proceeds with the third one, and so on. As soon as a certain null hypothesis cannot be rejected, all the remaining hypotheses are retained as well.

18

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

In our case, RAFS1 = 2.8750, RFS-DT = 6.3036, RFDTs = 6.2500, RKNN = 5.6607, Rq = 4.5179, RFURIA = 3.6071, RFARC-HD = 3.4464, C4.5 ffiffiffiffiffiffiffiffiffiffiffiffi ffi 8⋅ð8þ1Þ ¼ 0:6547. 6⋅28

RSVM = 3.3393, and with α = 0.05, k = 8 and N = 28, the standard error is SE ¼ i

Classifier

z = (Ri − RAFS1)/SE

p

α/ (k − i)

1 2 3 4 5 6 7

FS-DT FDTs KNN C4.5 FURIA FARC-HD SVM

(6.3036 − 2.8750)/0.6547 = 5.2372 (6.2500 − 2.8750)/0.6547 = 5.1554 (5.6607 − 2.8750)/0.6547 = 4.2552 (4.5179 − 2.8750)/0.6547 = 2.5095 (3.6071 − 2.8750)/0.6547 = 1.1184 (3.4464 − 2.8750)/0.6547 = 0.8729 (3.3393 − 2.8750)/0.6547 = 0.7092

0.0000 0.0000 0.0000 0.0121 0.2634 0.3827 0.4782

0.0071 0.0083 0.0100 0.0125 0.0167 0.0250 0.0500

The Holm procedure rejects the first, the second, the third and the forth hypothesis since the corresponding p values are smaller than the adjusted α's. The last three hypotheses cannot be rejected. This shows that AFS1 performs significantly better than FDTs, FS-DT, KNN and C4.5 at the significance level α = 0.05. AFS1 is not significantly better than FURIA, FARC-HD and SVM, nevertheless compared with SVM, the knowledge (fuzzy rules) represented by AFS decision tree is more transparent and in line with a way of human perception of classification problems and AFS1 obtains the best ranks in the Friedman test. 5.2. Experiment 2 In this experiment, we compare the number of rules and number of nodes returned by the three fuzzy decision trees: FDTs, FS-DT and AFS decision tree. For a detailed comparison, the errors of the three fuzzy decision tree approaches on the 28 data sets (see Table 7) are also reported in Table 10. By the analysis of the results presented in Table 10, we can draw the following conclusions: 1. The FS-DT approach presents a very small average number of rules and nodes on most of the datasets. This method, however, should be the worst in Friedman's test when we compare the accuracies obtained in the test data (see Table 8). 2. The AFS decision tree and FDTs approaches obtain the number of rules and nodes in the same level in all datasets (except Aus. and Heart data sets). Compare with FS-DT, these approaches returned more leaves (rules) and nodes of the fuzzy decision trees

Table 10 Results of comparative analysis of AFS decision tree, FDTs and FS-DT. Data set

App. Aus. Aute1 Aute2 Auto Balance Breast-C Breast-W Car Der. Glass Haberman Heart Hepatitis Ionosphere Iris Kr-vs-Kp Liver M_Mass Parkinsons Pima Statlog_GC Tae Transfusion Waveform Wdbc Wine Wpbc

AFS

FDTs

FS-DT

Error

Rules

Nodes

Error

Rules

Nodes

Error

Rules

Nodes

14.13 13.78 0.00 0.00 25.15 6.41 31.14 4.15 14.18 4.91 33.65 26.48 19.63 34.19 13.12 2.67 1.66 30.72 18.52 9.74 24.09 26.85 43.68 22.06 23.02 4.04 2.79 22.18

8.0 451.6 3.0 3.0 7.8 8.8 35.4 13.0 43.8 10.8 34.8 4.8 193.6 21.2 17.6 3.0 28.2 17.6 22.6 9.8 16.2 367.2 14.4 4.0 20.2 6.4 6.6 9.0

14.0 800.2 6.0 7.4 18.2 15.4 124.8 43.0 87.2 26.4 106.2 9.6 336.8 64.8 49.6 6.0 70.0 36.2 49.8 25.4 40.8 672.0 34.4 9.0 55.0 14.6 17.0 24.0

16.90 30.29 6.67 35.00 30.92 28.32 27.64 7.87 30.03 36.61 44.86 26.76 18.89 20.00 15.10 6.67 32.76 35.07 20.29 14.36 26.43 28.30 54.37 23.66 31.82 8.79 13.46 24.24

11.6 17.4 5.0 4.0 20.8 12.2 18.4 39.0 18.2 9.2 11.0 9.8 21.0 20.4 20.6 3.0 90.8 17.0 12.0 13.8 13.6 117.8 10.4 7.8 14.4 28.6 10.8 68.6

19.4 28.6 8.0 6.0 36.0 17.8 31.4 64.8 27.0 13.4 16.4 14.8 32.8 34.4 32.4 4.0 179.6 27.4 17.8 21.6 22.6 198.2 16.0 12.2 21.2 52.2 16.0 118.4

16.98 14.50 10.00 11.67 28.40 48.95 25.55 7.44 29.98 41.00 49.53 27.75 24.81 20.00 12.55 7.33 5.91 34.78 18.00 24.62 28.77 30.30 58.24 23.80 29.90 37.26 9.51 31.29

28.2 4.6 2.0 4.6 5.0 7.2 6.4 4.0 5.0 3.0 11.2 5.6 6.2 4.6 5.0 3.8 5.0 9.8 4.2 23.0 5.8 4.8 4.8 5.0 8.2 64.0 4.2 9.0

55.4 8.2 3.0 8.2 9.0 13.4 11.8 7.0 9.0 5.0 21.4 11.2 11.4 8.2 9.0 6.6 9.0 18.6 7.4 45.0 10.6 8.6 8.6 9.0 15.4 127.0 7.4 17.0

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

19

Fig. 11. Examples of two AFS decision trees for the Iris and Haberman data.

for the following reasons: to achieve good fit to the training data, more detailed information has to be considered; of course in this case we arrive at a larger tree with a significant number of leaves (rules) and nodes. However, compared with the FS-DT approach, AFS decision tree and FDTs approaches achieve better accuracies on the datasets. 3. The AFS decision tree approach presents a comparable number of rules and nodes, obtaining a good tree structure complexity and the best performance in accuracy. 5.3. Analytical comparison 5.3.1. Structure complexity AFS decision trees exhibit compact structures. The AFS decision tree has five leaf nodes shown in Fig. 3. Fig. 11 shows two AFS decision trees using the Iris dataset and Haberman data set (with classification accuracy of 96.67% and 83.37%, respectively) obtained for the five-fold cross validation. The AFS decision trees may have quite extensive tree structure as each AFS decision tree may have up to 3 |V| leaf nodes, e.g., for the Breast-W dataset, the number may be as high as 3 9 = 19,683. After pruning they usually are far lower as the trees are not fully spanned. Fig. 12 shows an AFS decision tree obtained for the Wine dataset (we show one of the trees resulting from five-fold cross validation experiments). Such a tree has only 5 leaf nodes, which is only a tiny fraction of the upper number of nodes available in this case, that is 3 13 = 1,594,323. This reveals that AFS decision trees, especially after pruning, can become far more compact. 5.3.2. Comparison of AFS fuzzy logic and “conventional” fuzzy logic In what follows, according to the concept of classification boundaries that were mentioned in [57], the AFS membership function's natural characterization is illustrated by showing its classification boundaries in decision area on the entire input space. Fig. 13 presents the decision areas resulting from the use of the triangular membership functions, while Fig. 14 shows the classification boundaries resulting from the use of the AFS membership functions. In Figs. 13 and 14, “V1” and “V13” represent the 1st and 13th feature of the wine data, and the subplots location on left and below show the membership degree of each feature, “o” denotes the data belonging to Class 1 and “x” is used to denote the data of Class 2. We can observe that the partition produced by the AFS membership functions leads to the better results than the one generated by the triangular membership functions. In Fig. 13, there are 15 misclassified patterns, while in Fig. 14 only 6 patterns are shown as wrongly classified. This observation is

Fig. 12. An AFS decision tree for the Wine data; δ = 0.45.

20

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

''o'': Class1, ''x'': Class2 1

0.8

0.8

0.6

0.6

V13

1

0.4

0.4

0.2

0.2

0

0 0

0.5

1

Membership Degree

Membership Degree

0

0.2

0.4

0.6

0.8

1

0.6

0.8

1

V1 1

0.5

0

0

0.2

0.4

Fig. 13. Classification boundaries generated by triangular membership function.

quantified through the results reported in Table 11, with the calculation of all misclassified patterns in the two-dimensional input space formed by the two arbitrarily selected features. Each fuzzy IF–THEN rule is confined to a certain input space. These regions are modified and the boundaries are adjusted by modifying the membership functions of the membership functions. While some optimization could be completed with this regard, one has to be aware that such adjustments could negatively impact the interpretability of the rules. In order to compare the performance of the AFS fuzzy logic with the fuzzy logic endowed with t-norms, triangular membership functions [58], trapezoidal membership functions [59] and Yuan's membership function [18] are applied to the proposed algorithm. Furthermore we consider operators of min and max as commonly encountered in the literature. For each attribute Vi, three fuzzy terms are defined. The obtained results summarized in Table 12 indicate that AFS offers better performance. 5.3.3. Consistency of the coherence membership functions In this section, we treat the Iris data set as the observed data coming from a probability space (Ω, F, P). The training data of the five experiments on one of Iris can be regarded as the different data drawn from the same probability space. We assess the consistency of the membership functions in Eq. (9) via the membership functions of the fuzzy rules. In the five experiments for

''o'': Class1, ''x'': Class2 1

0.8

0.8

0.6

0.6

V13

1

0.4

0.4

0.2

0.2 0 0

0.5

Membership Degree

1

Membership Degree

0

0

0.2

0.4

0.6

0.8

1

0.6

0.8

1

V1 1

0.5

0

0

0.2

0.4

Fig. 14. Classification boundaries generated by the AFS membership function.

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

21

Table 11 Number of misclassified patterns when using triangular and AFS membership functions (wine data); results shown for selected pairs of classes (C1&C2, C2&C3, etc.). Class

C1&C2

C2&C3

C1&C3

Overall

Misclassified number (triangle) Misclassified number (AFS)

2699 1853

1653 1438

1133 769

5485 4060

Iris data, the membership functions of the fuzzy concepts ξ1, ξ2, ξ3, which are the antecedents of Rule 1, Rule 2 and Rule 3 for class 1, 2, and 3 respectively, are shown in Figs. 15–17. One can observe that the five membership functions of ξi of different data set are quite similar, although different fuzzy concept ξi ∈ EM may be obtained in classification for different training data in the five experiments. By theorem 3, we know that when the number of the training samples approaches infinity, the five membership functions will converge to a single one. This ensures the consistency of the coherence membership functions.

6. Conclusions In this paper, we have introduced the coherence membership functions of fuzzy concepts and studied the AFS fuzzy rule-based classifier. We presented a way of building the AFS decision trees, and elaborated on a way in which the rules can be extracted from the tree and pruned afterwards. We introduced the Fitness Index to estimate the optimal threshold δ, which is used to control the design of the AFS decision tree and a level of detail being captured by the tree. We considered fuzzy sets (coherence membership functions) and the underlying logic operators generated by the AFS to eliminate potential subjective bias in the construction of the tree. The experiments demonstrated that the obtained results outperformed those produced by the C4.5, KNN, FDTs and FS-DT and the applications of conventional fuzzy logic to the proposed algorithm. The results show that the AFS decision trees can achieve high accuracy and the results are better than that of triangular and trapezoidal membership functions applied in the proposed method. We also showed the effectiveness of the rule extraction scheme. Interestingly even if the tree cannot result in an initial rule-base of good quality, the pruned rule-base can lead to much higher performance which is consistently better both on the training and testing data.

Table 12 Performance analysis of the use of different membership functions. Data set

Appendicitis Australian Aute1 Aute2 Auto Balance Breast-C Breast-W Car Dermatology Glass Haberman Heart Hepatitis Ionosphere Iris Kr-vs-Kp Liver M_Mass Parkinsons Pima Statlog_GC Tae Transfusion Waveform Wdbc Wine Wpbc

Triangular

Trapezoidal

Yuan's

AFS

Error (%)

Error (%)

Error (%)

Error (%)

17.89 ± 4.42 28.40 ± 12.67 0.00 ± 0.00 0.00 ± 0.00 29.90 ± 1.43 30.40 ± 0.79 28.66 ± 1.62 4.15 ± 1.14 23.52 ± 3.27 7.36 ± 3.24 43.48 ± 4.96 28.03 ± 4.81 37.78 ± 8.16 52.90 ± 25.04 16.23 ± 4.32 8.00 ± 4.00 1.66 ± 0.56 45.80 ± 4.90 20.92 ± 3.74 15.90 ± 7.50 26.04 ± 2.39 28.12 ± 2.06 50.99 ± 6.46 23.53 ± 1.14 27.98 ± 1.35 4.92 ± 1.26 11.18 ± 9.26 19.71 ± 1.08

17.89 ± 4.42 22.31 ± 4.34 0.00 ± 0.00 0.00 ± 0.00 31.93 ± 4.29 12.16 ± 6.00 30.41 ± 3.40 5.01 ± 2.18 18.78 ± 3.42 12.83 ± 3.46 40.14 ± 10.10 27.43 ± 2.74 26.30 ± 1.81 22.58 ± 4.56 21.65 ± 3.92 6.67 ± 2.98 12.85 ± 3.65 45.71 ± 4.04 21.85 ± 2.89 11.79 ± 2.61 27.99 ± 2.21 23.30 ± 1.32 44.97 ± 10.93 23.80 ± 0.38 26.36 ± 1.56 5.27 ± 2.36 16.93 ± 10.10 30.28 ± 4.04

18.76 ± 3.63 15.66 ± 3.81 0.00 ± 0.00 0.00 ± 0.00 28.41 ± 4.00 28.80 ± 2.88 31.13 ± 2.70 5.29 ± 2.24 17.21 ± 8.02 7.63 ± 3.57 45.38 ± 6.95 25.45 ± 2.56 21.11 ± 5.32 55.48 ± 24.60 21.60 ± 2.98 6.00 ± 4.90 9.28 ± 2.12 41.14 ± 6.15 21.02 ± 2.91 13.33 ± 8.01 25.91 ± 2.90 27.10 ± 4.16 46.97 ± 7.88 20.59 ± 1.96 30.15 ± 2.78 5.45 ± 1.88 15.69 ± 6.39 31.27 ± 5.39

14.13 ± 2.78 13.78 ± 3.38 0.00 ± 0.00 0.00 ± 0.00 25.15 ± 4.19 6.41 ± 3.93 31.14 ± 5.00 4.15 ± 1.99 14.18 ± 8.12 4.91 ± 2.03 33.65 ± 2.49 26.48 ± 4.30 19.63 ± 4.63 34.19 ± 13.62 13.12 ± 4.92 2.67 ± 2.49 1.66 ± 0.56 30.72 ± 5.05 18.52 ± 2.64 9.74 ± 4.41 24.09 ± 2.11 26.85 ± 2.11 43.68 ± 10.25 22.06 ± 0.93 23.02 ± 0.83 4.04 ± 1.62 2.79 ± 1.76 22.18 ± 5.70

22

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

1 Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Mean

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

50

100

150

samples 1-50: Class 1; 51-100: Class 2; 101-150:Class 3 Fig. 15. Membership functions of ξ1, the antecedent of the rule for class 1, in the five experiments for Iris data and the mean of the five membership functions of ξ1.

Acknowledgment The authors would like to thank the anonymous referees for their comments and suggestions. This work was supported by the National Natural Science Foundation of China under grants 61175041 and 61034003. 1 Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Mean

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

50

100

150

samples 1-50: Class 1; 51-100: Class 2; 101-150:Class 3 Fig. 16. Membership functions of ξ2, the antecedent of the rule for class 2, in the five experiments for Iris data and the mean of the five membership functions of ξ2.

1 Exp 1 Exp 2 Exp 3 Exp 4 Exp 5 Mean

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

50

100

150

samples 1-50: Class 1; 51-100: Class 2; 101-150:Class 3 Fig. 17. Membership functions of ξ3, the antecedent of the rule for class 3, in the five experiments for Iris data and the mean of the five membership functions of ξ3.

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

23

Appendix A Proof of Theorem 2. Let α = ∑i ∈ I(∏m ∈ Aim), β = ∑j ∈ J(∏m ∈ Bjm) ∈ EM and α ≤ β in lattice (EM, ∨, ∧). From Theorem 1, we know that for any Ai (i ∈ I), there exist Bh (h ∈ J) such that Ai L Bh. Then we have Aiτ(x) p Bhτ(x) for all x ∈ X. Thus for any i ∈ I,     ∏γ ∈Ai Mγ Aτi ðxÞ ≤∏γ ∈Ai Mγ Bτh ðxÞ   ≤∏γ ∈Bh Mγ Bτh ðxÞ ≤μ β ðxÞ; which implies that   τ  μ α ðxÞ ¼ supi∈I ∏γ ∈A Mγ Ai ðxÞ ≤μ β ðxÞ: i

Thus condition 1 of Definition 5 holds. Since for each simple concept γ ∈ M, Mγ is a measure over S. Hence the condition 2 in Definition 5 holds. For x, y ∈ X, A p M, η = ∏m ∈ Am ∈ EM, if A τ(x) p A τ(y), then for any γ ∈ A, Mγ(A τ(x)) ≤ Mγ(A τ(y)). This implies that μη(x) ≤ μη(y). In addition, Mγ(X) = 1, condition 3 of Definition 5 holds. Therefore {μξ(x) | ξ ∈ EM} is the set of coherence membership functions of (EM, ∨, ∧) and (M, τ, X). □ Proof of Theorem 3.1. According to [60], it is clear that for any U ∈ X ∩ F or U ∈ F with γ ∈ M,

Mγ ðU Þ ¼

∑u∈U ργ ðuÞN u ∑u∈X ργ ðuÞNu

∫ ργ ðt ÞdP ðt Þ ;

Mγ ðU Þ ¼

U

∫ ργ ðt ÞdP ðt Þ Ω

are measures on X ∩ F and F, respectively. Thus the conclusion is deduced directly from Theorem 2.



Proof of Theorem 3.2. Let p(x) be the density function of the probability space (Ω, F, P). Since X is a set of samples randomly drawn from (Ω, F, P). Hence referring to [61], for any x ∈ X, the space divide into small areas Δx, we have     X Δx  : ðA:1Þ pðxÞ ¼ lim jX j→∞; SðΔx Þ→0 jX jSðΔx Þ Here x ∈ Δx p Ω, S(Δx) is the size of the area Δx, XΔx is the set of the drawn samples in X falling into Δx, a sample is counted n different samples if it is observed n times (e.g. a sample may be so popular in the space that it can be selected more than one time in a observation). Since X is a set of samples randomly drawn from the probability space (Ω, F, P), the samples fall in Δx can be any larger if |X| is large enough, that is, |X|⋅Δx → ∞. For any i ∈ I, γ ∈ Ai in Eq. (7) or (9) with x ∈ X, assume that Ω can be divided into q small spaces Δj ∈ F, j = 1, …, q such that for any j either Δj p Aiτ(x) or Δj ∩ Aiτ(x) = ∅. Let n  o  τ J Aτ ðxÞ ¼ Δj Δj pAi ðxÞ; j ¼ 1; …; q i

and Δmax be the maximum size of S(Δj), j = 1, …, q and Δu be the small space Δj such that u ∈ Δj. From Eq. (A.1), we have lim

∑u∈Aτ ðxÞ ργ ðuÞN u

jX j→∞

¼ lim

i

∑u∈X ργ ðuÞN u ∑u∈X Δ ;Δ∈J Aτ ðxÞ ργ ðuÞ

jX j→∞;Δ max →0

¼ lim

i

∑u∈X Δ ;1≤j≤q ργ ðuÞ j

∑Δu ∈JAτ ðxÞ ργ ðuÞjX Δu j i

∑Δu ∈fΔj j1≤j≤qg ργ ðuÞjX Δu j     X Δu  ∑Δu ∈JAτ ðxÞ ργ ðuÞ SðΔu Þ jXjSðΔu Þ i   ¼ lim   jXj→∞;Δ max →0 X Δu  SðΔu Þ ∑Δu ∈fΔj j1≤j≤qg ργ ðuÞ jXjSðΔu Þ ∫ ργ ðt ÞdP ðt Þ jXj→∞;Δ max →0

¼

Aτi ðxÞ

∫ ργ ðt ÞdP ðt Þ Ω

(according to formula (A.1)).

24

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

Therefore the membership function defined in Eq. (7) or (9) converges to that defined in Eq. (8) or (10), respectively for all x ∈ Ω as |X| approaches infinity. □

References [1] H. Ichihashi, T. Shirai, K. Nagasaka, T. Miyoshi, Neuro fuzzy ID3: a method of inducing fuzzy decision trees with linear programming for maximizing entropy and algebraic methods, Fuzzy Sets and Systems 81 (1) (1996) 157–167. [2] B. Chandra, P.P. Varghese, Fuzzifying gini index based decision trees, Expert Systems with Applications 36 (2009) 8549–8559. [3] M. Serrurier, D. Dubois, H. Prade, T. Sudkamp, Learning fuzzy rules with their implication operators, Data and Knowledge Engineering 60 (2007) 71–89. [4] N. Ghazisaidi, C.M. Assi, M. Maier, Intelligent wireless mesh path selection algorithm using fuzzy decision making, Wireless Networks 18 (2) (2012) 129–146. [5] J.R. Cano, F. Herrera, M. Lozano, Evolutionary stratified training set selection for extracting classification rules with trade off precision-interpretability, Data and Knowledge Engineering 60 (2007) 90–108. [6] X.R. Jiang, L. Gruenwald, Microarray gene expression data association rules mining based on BSC-tree and FIS-tree, Data and Knowledge Engineering 53 (2005) 3–29. [7] R. Weber, Fuzzy-ID3: a class of methods for automatic knowledge acquisition, in: 2nd International Conf. on Fuzzy Logic and Neural Networks, Lizuka, Japan, July 1992, pp. 265–268. [8] W. Xin, X.D. Liu, W. Pedryczc, X.L. Zhu, G.F. Hu, Mining axiomatic fuzzy set association rules for classification problems, European Journal of Operational Research 218 (1) (2012) 202–210. [9] A.S. Andreou, E. Papatheocharous, Software cost estimation using fuzzy decision trees, in: ASE '08, 2008, pp. 371–374. [10] C.L. Chen, F.S.C. Tseng, T. Liang, An integration of wordnet and fuzzy association rule mining for multi-label document clustering, Data and Knowledge Engineering 69 (11) (2010) 1208–1226. [11] Y. Xu, Y.F. Li, G. Shaw, Reliable representations for association rules, Data and Knowledge Engineering 70 (6) (2011) 555–575. [12] D.S. Yeung, X.Z. Wang, E.C.C. Tsang, Learning weighted fuzzy rules from examples with mixed attributes by fuzzy decision trees, in: IEEE Int. Conf. on Systems, Man, and Cybernetics, Tokyo, Japan, Oct. 1999, pp. 349–354. [13] W. Pedrycz, Z.A. Sosnowski, C-fuzzy decision trees, IEEE Transactions on Systems, Man, and Cybernetics — Part B: Applications and Reviews 35 (4) (2005) 498–511. [14] X.D. Liu, W. Pedrycz, The development of fuzzy decision trees in the framework of axiomatic fuzzy set logic, Applied Soft Computing 7 (2007) 325–342. [15] C.Z. Janikow, Fuzzy decision trees: issues and methods, IEEE Transactions on Systems, Man, and Cybernetics — Part B: Cybernetics 28 (1) (1998) 1–14. [16] X.Z. Wang, D.S. Yeung, E.C.C. Tsang, A comparative study on heuristic algorithms for generating fuzzy decision trees, IEEE Transactions on Systems, Man, and Cybernetics — Part B: Cybernetics 31 (2) (2001) 215–226. [17] B. Chandra, P.P. Varghese, Fuzzy SLIQ decision tree algorithm, IEEE Transaction on Systems, Man, Cybernetics — Part B: Cybernetics 38 (5) (2008) 1294–1301. [18] Y. Yuan, M.J. Shaw, Induction of fuzzy decision trees, Fuzzy Sets and Systems 69 (1995) 125–139. [19] C. Olaru, L. Wehenkel, A complete fuzzy decision tree technique, Fuzzy Sets and Systems 138 (2) (2003) 221–254. [20] Y. Lertworaprachaya, Y.J. Yang, R. John, Interval-valued fuzzy decision trees, in: IEEE Int. Conf. on Fuzzy Systems, Barcelona, Spain, July 2010, pp. 1–7. [21] X. Boyen, L. Wehenkel, Automatic induction of fuzzy decision trees and its application to power system security assessment, Fuzzy Sets and Systems 102 (1999) 3–19. [22] X. Boyen, L. Wehenkel, Fuzzy decision tree induction for power system security assessment, in: Proc. SIPOWER '95, 2nd IFAC Symp. on Control of Power Plants and Power Systems, Mexico, Dec. 1995, pp. 151–156. [23] I. Hayashi, T. Maeda, A. Bastian, L.C. Jain, Generation of fuzzy decision trees by fuzzy ID3 with adjusting mechanism of and/or operators, in: Int. Conf. Fuzzy Syst., 1998, pp. 681–685. [24] X. Boyen, L. Wehenkel, Automatic induction of continuous decision trees, in: Proc. IPMU '96, Information Processing and Management of Uncertainty in Knowledge-Based Systems, Granada, July 1996, pp. 419–424. [25] J.F. Smith, Evolving fuzzy decision tree structure that adapts in real-time, in: GECCO '05, 2005, pp. 1737–1744. [26] S. Moustakidis, G. Mallinis, N. Koutsias, J.B. Theocharis, V. Petridis, SVM-based fuzzy decision trees for classification of high spatial resolution remote sensing images, IEEE Transactions on Geoscience and Remote Sensing 50 (1) (2012) 149–169. [27] M. Umanol, H. Okamoto, I. Hatono, H. Tamura, F. Kawachi, S. Umedzu, J. Kinoshita, Fuzzy decision trees by fuzzy ID3 algorithm and its application to diagnosis systems, in: IEEE Conference on Fuzzy Systems, Orlando, USA, Jun. 1994, pp. 2113–2118. [28] C.M. Qi, A new partition criterion for fuzzy decision tree algorithm, in: Intelligent Information Technology Application (Work shop), Dec. 2007, pp. 43–46. [29] R.Y. Chen, D.D. Sheu, C.M. Liu, Vague knowledge search in the design for outsourcing using fuzzy decision tree, Computers and Operations Research 34 (2007) 3628–3637. [30] J.R. Quinlan, Induction of decision trees, Machine Learning 1 (1986) 81–106. [31] N.M. Abu-halaweh, R.W. Harrison, Practical fuzzy decision trees, in: IEEE Symposium on CIDM '09, Nashville, USA, Mar./Apr. 2009, pp. 211–216. [32] Q.W. Meng, Q. He, N. Li, X.R. Du, L.N. Su, Crisp decision tree induction based on fuzzy decision tree algorithm, in: ICISE, Dec. 2009, pp. 4811–4814. [33] L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees, Wadsworth, Belmont, CA, 1984. [34] J.S.R. Jang, Structure determination in fuzzy modeling a fuzzy CART approach, in: IEEE Conference on Fuzzy Systems, Orlando, USA, Jun. 1994, pp. 480–485. [35] B. Chandra, S. Mazumdar, V. Arena, N. Parimi, Elegant decision tree algorithm for classification in data mining, in: Proceedings of the 3rd International Conference on Information Systems Engineering (work shops), 2002, pp. 160–169. [36] B. Chandra, P.P. Varghese, A robust algorithm for classification using decision trees, in: IEEE Conference on Cybernetics and Intelligent Systems, Jun. 2006, pp. 1–5. [37] W. Pedrycz, Z.A. Sosnowski, Designing decision trees with the use of fuzzy granulation, IEEE Transactions on Systems, Man, and Cybernetics — Part A: Systems and Humans 30 (2) (2000) 151–159. [38] J. Fowdar, K. Crockett, Z. Bandar, J. O'Shea, On the use of fuzzy trees for solving classification problems with numeric outcomes, in: IEEE Conference on Fuzzy Systems, May 2005, pp. 436–441. [39] Y.L. Chen, C.L. Hsu, S.C. Chou, Constructing a multi-valued and multi-labeled decision tree, Expert Systems with Applications 25 (2003) 199–209. [40] B. Apolloni, G. Zamponi, A.M. Zanaboni, Learning fuzzy decision trees, Neural Networks 11 (1998) 885–895. [41] X.Z. Wang, J.H. Zhai, S.X. Lu, Induction of multiple fuzzy decision trees based on rough set technique, Information Sciences 178 (2008) 3188–3202. [42] Y. Cheng, The incremental method for fast computing the rough fuzzy approximations, Data and Knowledge Engineering 70 (1) (2011) 84–100. [43] C.J. Merz, P.M. Murphy, UCI Repository for Machine Learning Data-Bases, Dept. of Information and Computer Science, University of California, Irvine, CA, 1996, ([Online]. Available: http://www.ics.uci.edu/~mlearn/MLRepository.html). [44] J. Platt, Fast Training of Support Vector Machines Using Sequential Minimal Optimization, Advances in Kernel Methods: Support Vector Learning, MIT Press, 1998. [45] D.W. Aha, D. Kibler, M.K. Albert, Instance-based learning algorithms, Machine Learning 6 (1) (1991) 37–66. [46] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, San Mateo, 1993. [47] J. Alcala-Fdez, R. Alcala, F. Herrera, A fuzzy association rule-based classification model for high-dimensional problems with genetic rule selection and lateral tuning, IEEE Transactions on Fuzzy Systems 19 (5) (2011) 857–872.

X. Liu et al. / Data & Knowledge Engineering 84 (2013) 1–25

25

[48] J.C. Huhn, E. Hullermeier, FURIA: an algorithm for unordered fuzzy rule induction, Data Mining and Knowledge Discovery 19 (2009) 293–319. [49] X.D. Liu, T.Y. Chai, W. Wang, W.Q. Liu, Approaches to the representations and logic operations for fuzzy concepts in the framework of axiomatic fuzzy set theory Ι, Information Sciences 177 (2007) 1007–1026. [50] X.D. Liu, W. Pedrycz, T.Y. Chai, M.L. Song, The development of fuzzy rough sets with the use of structures and algebras of axiomatic fuzzy sets, IEEE Transactions on Knowledge and Data Engineering 21 (3) (2009) 443–462. [51] X.D. Liu, W. Wang, T.Y. Chai, The fuzzy clustering analysis based on AFS theory, IEEE Transactions on Systems, Man, and Cybernetics — Part B: Cybernetics 35 (5) (2005) 1013–1027. [52] X.D. Liu, The fuzzy theory based on AFS algebras and AFS structure, Journal of Mathematical Analysis and Applications 217 (1998) 459–478. [53] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. Morgan Kaufmann, San Mateo, CA, 2005. [54] J. Alcala-Fdez, L. Sanchez, S. Garcia, M. Jesus, S. Ventura, J. Garrell, J. Otero, C. Romero, J. Bacardit, V. Rivas, J. Fernandez, F. Herrera, KEEL: a software tool to assess evolutionary algorithms to data mining problems, Soft Computing 13 (3) (2009) 307–318. [55] J. Demsar, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research 7 (2006) 1–30. [56] J.C. Huhn, E. Hullermeier, FR3: a fuzzy rule learner for Inducing reliable classifiers, IEEE Transactions on Fuzzy Systems 17 (1) (2009) 138–139. [57] H. Ishibuchi, T. Nakashima, Effect of rule weights in fuzzy rule-based classification systems, IEEE Transactions on Fuzzy Systems 9 (4) (2001) 506–515. [58] T.J. Ross, Fuzzy Logic with Engineering Applications, McGraw-Hill, New York, 1995. [59] W.E. Kelly, J.H. Painter, Hypertrapezoidal fuzzy membership functions, in: IEEE International Conference on Fuzzy Systems, 1996, pp. 1279–1284. [60] P.R. Halmos, Measure Theory, Springer-Verlag, New York, 1974. [61] J.S. Simono, Smoothing Methods in Statistics, Springer-Verlag, New York, 1996. Xiaodong Liu received the B.S. and the M.S. degrees in mathematics from Northeastern Normal University in 1986 and Jilin University, Jilin, in 1989, P. R. China respectively, and the Ph.D. degree in control theory and control engineering from Northeastern University, Shenyang, P. R. China in 2003. He is currently a Professor in Research Center of Information and Control, Dalian University of Technology and Department of Applied Mathematics, Dalian Maritime University, a Guest Professor of the ARC Research Center of Excellence in PIMCE, Curtin University of Technology, Australia. He was a Senior Visiting Scientist in Department of Electrical and Computer Engineering, University of Alberta, Edmonton Canada in 2003 and Visiting Research Fellow in Department of Computing, Curtin University of Technology, Perth Australia in 2004. He has been a Reviewer of American Mathematical Reviewer since 1993. He has proposed the AFS theory and is a coauthor of three books. His research interests include algebra rings, combinatorics, topology molecular lattices, AFS (axiomatic fuzzy sets) theory and its applications, knowledge discovery and representations, data mining, pattern recognition and hitch diagnoses, analysis and design of intelligent control systems. Dr. Liu is a recipient of the 2002 Wufu-Zhenhua Best Teacher Award of the Ministry of Communications of Peolpe's Republic of China.

Xinghua Feng received the B.S. and the M.S. degrees in mathematics from Dalian Maritime University, Dalian, P. R. China, in 2005 and 2008, respectively. Currently, he is a PhD candidate in the Research Center of Information and Control, Dalian University of Technology. His research interests include machine learning and data mining.

Witold Pedrycz is a Professor and Canada Research Chair (CRC — Computational Intelligence) in the Department of Electrical and Computer Engineering, University of Alberta, Edmonton, Canada. He is also with the Systems Research Institute of the Polish Academy of Sciences, Warsaw, Poland. He also holds an appointment of special professorship in the School of Computer Science, University of Nottingham, UK. In 2009 Dr. Pedrycz was elected as a foreign member of the Polish Academy of Sciences. In 2012 he was elected a Fellow of the Royal Society of Canada. Witold Pedrycz has been a member of numerous program committees of IEEE conferences in the area of fuzzy sets and neurocomputing. In 2007 he received a prestigious Norbert Wiener award from the IEEE Systems, Man, and Cybernetics Council. He is a recipient of the IEEE Canada Computer Engineering Medal 2008. In 2009 he has received a Cajastur Prize for Soft Computing from the European Centre for Soft Computing for “pioneering and multifaceted contributions to Granular Computing”. His main research directions involve Computational Intelligence, fuzzy modeling and Granular Computing, knowledge discovery and data mining, fuzzy control, pattern recognition, knowledge-based neural networks, relational computing, and Software Engineering. He has published numerous papers in this area. He is also an author of 14 research monographs covering various aspects of Computational Intelligence and Software Engineering. Dr. Pedrycz is intensively involved in editorial activities. He is the Editor-in-Chief of Information Sciences and Editor-in-Chief of IEEE Transactions on Systems, Man, and Cybernetics — part A. He currently serves as an Associate Editor of IEEE Transactions on Fuzzy Systems and is a member of a number of editorial boards of other international journals.