Applied Soft Computing 24 (2014) 534–542
Contents lists available at ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
A rapid fuzzy rule clustering method based on granular computing Xianchang Wang a,b,∗ , Xiaodong Liu a , Lishi Zhang b a b
Research Center of Information and Control, Dalian University of Technology, Dalian 116024, PR China School of Sciences, Dalian Ocean University, Dalian 116023, PR China
a r t i c l e
i n f o
Article history: Received 14 February 2013 Received in revised form 29 June 2014 Accepted 3 August 2014 Available online 13 August 2014 Keywords: Fuzzy clustering Granular computing Fuzzy rule Fuzzy number Sample’s description
a b s t r a c t Traditionally, clustering is the task of dividing samples into homogeneous clusters based on their degrees of similarity. As samples are assigned to clusters, users need to manually give descriptions for all clusters. In this paper, a rapid fuzzy rule clustering method based on granular computing is proposed to give descriptions for all clusters. A new and simple unsupervised feature selection method is employed to endow every sample with a suitable description. Exemplar descriptions are selected from sample’s descriptions by relative frequency, and data granulation is guided by the selected exemplar fuzzy descriptions. Every cluster is depicted by a single fuzzy rule, which make the clusters understandable for humans. The experimental results show that our proposed model is able to discover fuzzy IF–THEN rules to obtain the potential clusters. © 2014 Elsevier B.V. All rights reserved.
1. Introduction 1.1. Clustering Clustering is one of the most significant research fields in data mining. It aims at partitioning the data into groups of similar objects (samples, patterns). From a machine learning perspective, what clustering does is to find the hidden patterns of the dataset in an unsupervised way, and the resulting system is usually referred to as a data concept. From a practical perspective, clustering plays an outstanding role in data mining applications such as image segmentation [1], computational biology [2,3], web analysis [4], text mining [5], graph clustering [6], and many others.
1.2. State of the art Clustering methods can be classified into two types: partitional and hierarchical [7]. The partitional approach produces a single partition of the data points, such as the well-known K-means [8] clustering method. And while the hierarchical approach gives a nested clustering result in the form of a dendrogram (cluster tree),
∗ Corresponding author at: Research Center of Information and Control, Dalian University of Technology, No. 2, Linggong Road, Ganjingzi District, Dalian City, Liaoning Province 116024, PR China. Tel.: +86 41184709381. E-mail addresses:
[email protected],
[email protected] (X. Wang),
[email protected] (X. Liu),
[email protected] (L. Zhang). http://dx.doi.org/10.1016/j.asoc.2014.08.004 1568-4946/© 2014 Elsevier B.V. All rights reserved.
from which different levels of partitions can be obtained such as single-link [9], complete-link [10] and average-link [11]. Recently, Frey and Dueck devised a method called Affinity Propagation (AP) [12], which is an unsupervised clustering algorithm based on message-passing techniques. Zhang et al. presented a clustering method called KAP to generate specified K clusters based on AP. Fowlkes et al. proposed a spectral grouping method Normalized Cuts [13] to use the Nyström approximation to extend normalized cut. Zelnik-Manor and Perona proposed a method named Self-Tuning spectral clustering (STSC) [14] in which a “local” scale should be used to compute the affinity between each pair of points. Agarwal and Mustafa presented an extension of the K-means clustering algorithm for projective clustering in arbitrary subspaces (KMPC) [15]. Steinley and Hubert proposed an order-constrained K-means cluster analysis through an auxiliary quadratic assignment optimization heuristic (OCKC) [16]. Two distance based clustering methods al-SL [17] are proposed by Patra et al. for arbitrary shaped clusters. DBCAMM [18] is an approach to merge the sub-clusters by using the local sub-cluster density information. Relaxing this rigidity (crispness) of the partition has constituted in the past a domain of research in the framework of cluster analysis [19]. Many authors have proposed a fuzzy setting as the appropriate approach to cope with this problem. In fuzzy clustering, the well-known fuzzy c-means (FCM) clustering algorithm, which was first proposed by Dunn [20] and then extended by Bezdek [21], is the best-known and has been extensively used in data clustering and related applications. Some new fuzzy clustering methods are proposed, Honda et al. [22] used the fuzzy principal component
X. Wang et al. / Applied Soft Computing 24 (2014) 534–542
analysis to obtain the cluster indicator, as well as the responsibility weights of samples for the K-means process in order to develop a robust K-means clustering scheme. Tan et al. proposed an improved FCMBP fuzzy clustering method [23] based on evolutionary programming. A proximity fuzzy framework for clustering relational data [24] is presented by Graves et al. A certain knowledge-guided scheme of fuzzy clustering in which the domain knowledge is represented in the form of viewpoints is introduced [25] by Pedrycz et al., the users point of view at data, which are represented in a plain numeric format or through some information granules, is included in the clustering process.
1
0.5
0
2. Preliminary notions This section briefly reviews some background concepts regarding data representations as well as notions. Let X = {x1 , x2 , . . ., xn } be a set of n samples, where xi ∈ Rd (i = 1, 2, . . ., n) denotes the ith sample. fj (j = 1, 2, . . ., d) denotes the jth column (feature) of X. Thus, X = (xij ) is an n × d matrix representing data, each column of X corresponds to a feature, whereas each row corresponds to a sample. 2.1. Fuzzy number The non-symmetric trapezoidal and triangular forms of fuzzy numbers [27] are used (Fig. 1) in this paper to construct fuzzy rules. In Fig. 1, feature value is divided into four fuzzy subspaces, “big”, “medium big”, “medium small”, and “small”. The fuzzy numbers: trapezoidal for the linguistic terms “big”, “small”, and triangular for the linguistic term “medium big”, “medium small” are used. It is often possible, on the basis of expert experience, to define the parameters of fuzzy number by means of linguistic variables. The parameters of fuzzy numbers are defined by the method of Equal Interval Width/Uniform Binning (EIB) [28]. This method relies on sorting the jth feature value and dividing the fj values into equally spaced bin ranges. A seed K (the number of cluster) supplied by the user determines how many bins are required. With this seed K, it is just a matter of finding the maximum and minimum values of
ms 1
cp1
ms 2
cp2
ms 3
cp3
ms 4
Fig. 1. Triangular and trapezoidal forms of fuzzy number.
1.3. Contribution and paper organization Nevertheless, samples are assigned to clusters by these traditionally clustering techniques, users need to manually give descriptions for all clusters. In addition, users sometimes have no specific idea regarding how to explain the clustering results, thus, they might give inappropriate descriptions. A clustering technique is proposed in this study to discover fuzzy IF–THEN rules and provide description of clusters. Each cluster is depicted by a single fuzzy IF–THEN rule. A fuzzy rule-based clustering system is a special case of fuzzy modeling, the acquired knowledge with these system may be more human understandable [26]. The proposed clustering method named FRCGC is based on granular computing. The obtained clusters are specified with some interpretable fuzzy rules, which make the clusters understandable for humans. The experimental results show that our proposed model is able to discover fuzzy IF–THEN rules to obtain the potential clusters. The remainder of this paper is organized as follows. The following section, we provide an overview of data representation and preliminary notions. The procedure of our clustering algorithm is presented in section “Our proposed rapid fuzzy rule clustering method”. Section “Illustrative experiments” provides the detailed analysis of the experiments on a synthetic data. In section “Experimental evaluation”, we thoroughly evaluate the efficacy of the proposed model through a number of experiments, using publicly available data. Finally, in the concluding section, we summarize and discuss our results.
535
1
0.5
0
1.5
3.15
4.7
Fig. 2. Fuzzy numbers generated on the feature f1 .
fj to derive the range and then partition the data into K bins. The bin width is computed by: ε = (max(fj ) − min(fj ))/K, and bin thresholds (cut points) are constructed at cpi = min(fj ) + iε, where i = 1, 2, . . ., K − 1. Let msk denote the mean of all of the samples that fall into the kth bin. Example 1. To illustrate the process of generating fuzzy numbers, let us consider the dataset that is shown in Table 1. Assume that K = 2, for feature f1 , min(f1 ) = 1.4, max(f1 ) = 4.9, ε = (max(fj ) − min(fj ))/K = (4.9 − 1.4)/2 =1.75, cp1 = min(fj ) + ε = 1.4 + 1.75 = 3.15, ms1 = mean(1.4, 1.6, 1.5, 1.4, 1.5, 1.6) = 1.5, ms2 = mean(4.6, 4.7, 4.9, 4.6) = 4.7, and the fuzzy numbers are shown in Fig. 2 respectively corresponding to “big” and “small”. The membership value of sample belonging to fuzzy number is shown in Table 2, where, 11 stands for the value of feature f1 is “small”, 21 stands for the value of feature f1 is “big”,12 stands for the value of feature f2 is “small”, 22 stands for the value of feature f2 is “big”. 2.2. The description of sample IF–THEN clustering rules are intuitively comprehensible for most humans since they represent knowledge at a high level of abstraction involving logical conditions rather than point-based cluster representations. In this paper, a clustering rule R defined in the continuous space Rd is knowledge representation in the form: R : IF xi is 11 and, . . ., and xi is dd THEN cluster label
Table 1 A small dataset with 10 samples. Feature
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
f1 f2
1.4 0.3
4.6 0.2
1.6 1.2
1.5 0.4
1.4 0.5
4.7 1.4
1.5 0.3
4.9 1.5
1.6 1.3
4.6 1
Table 2 The membership value of samples belonging to fuzzy numbers.
11 (·) 21 (·) 12 (·) 22 (·)
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
1 0 1 0
0.06 0.97 1 0
0.97 0 0.16 0.91
1 0 0.94 0
1 0 0.84 0.09
0.03 1 0 1
1 0 1 0
0 1 0 1
0.97 0 0.06 1
0.06 0.97 0.35 0.67
536
X. Wang et al. / Applied Soft Computing 24 (2014) 534–542 Table 3 Sample’s fuzzy descriptions and the membership value of every sample belonging to the four fuzzy rules.
Des(·) A1 ( · ) A2 ( · ) A3 ( · ) A4 ( · )
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
A1 1.00 0.50 0.50 0
A3 0.53 0.03 0.98 0.48
A2 0.56 0.94 0.08 0.45
A1 0.97 0.50 0.47 0
A1 0.92 0.55 0.42 0.05
A4 0.02 0.52 0.50 1.00
A1 1.00 0.50 0.50 0
A4 0 0.50 0.50 1.00
A2 0.51 0.98 0.03 0.50
A4 0.21 0.37 0.66 0.82
Definition 3 (similarity between two fuzzy descriptions). For two fuzzy descriptions Des(xi ) and Des(xj ), the similarity Sim(Des(xi ), Des(xj )) between two descriptions Des(xi ) and Des(xj ), is defined as: Sim(Des(xi ), Des(xj )) =
Fig. 3. The fuzzy subspaces.
The antecedent (IF) part of R consists of a logical conjunction of d conditions, one for each feature, whereas the conclusion (THEN) part contains the cluster label. The semantics of this kind of clustering rule is: if all the conditions specified in the antecedent part are satisfied by the corresponding feature values of a given data point, then this point is assigned to (or covered by) the cluster, identified by the consequent. Let j = {kj |k = 1, 2, . . ., K} be the set of all fuzzy numbers one defined on the jth feature, = {kj |k = 1, 2, . . ., K, j = 1, 2, . . ., d} be the set of all fuzzy numbers one defined on the whole features f1 , f2 , . . ., fd . Definition 1 (fuzzy space). If Fs ∈ , then Fs is a fuzzy subspace, where can be calculated using: = {dj=1 |{ ∈ j }}
(1)
In fact, the antecedent part of fuzzy rule R is corresponding to a fuzzy subspace Fs, fuzzy numbers divides the universe of discourse Rd into Kd fuzzy subspaces (fuzzy rules). Example 2. Let us consider Example 1, due to K = 2, the universe of discourse R2 into 4 fuzzy subspaces is shown in Fig. 3. The 4 fuzzy subspaces can be respectively represented by the following 4 antecedent part of fuzzy rules. R1 : IF xi is 11 and xi is 12 THEN cluster R2 : IF xi is 11 and xi is 22 THEN cluster R3 : IF xi is 21 and xi is 12 THEN cluster R4 : IF xi is 21 and xi is 22 THEN cluster
label, label, label, label.
d
arg max(xi )
(2)
∈j
j=1
Each sample’s fuzzy description consists of a fuzzy subspace, is also an interpretable fuzzy set. Let Des(xi ) (xj ) denote the membership value of a sample xj belonging to the fuzzy description Des(xi ), Des(xi ) (xj ) can be calculated using the following formula: Des(xi ) (xj ) =
1 d
(xj )
(4)
The dissimilarity DSim(Des(xi ), Des(xj )) between two fuzzy descriptions Des(xi ) and Des(xj ) is measured by: DSim(Des(xi ), Des(xj )) = 1 − Sim(Des(xi ), Des(xj ))
(5)
where, the symbol | · | represents the cardinality of a set. The dissimilarity is a metric distance function, ∀DesA, DesB, DesC ∈ , one can easily verify that: (1) (2) (3) (4)
DSim(DesA, DesB) = DSim(DesB, DesA), DSim(DesA, DesA) = 0, 0 ≤ DSim(DesA, DesB) ≤ 1, DSim(DesA, DesB) ≤ DSim(DesA, DesC) + DSim(DesB, DesC).
Example 4. Let us consider the fuzzy descriptions in Example 3, Sim(A1 , A2 ) = 0.5, Sim(A1 , A3 ) = 0.5, Sim(A1 , A4 ) = 0, Sim(A2 , A3 ) = 0, Sim(A2 , A4 ) = 0.5, Sim(A3 , A4 ) = 0.5. 2.3. Probability space Definition 4 (probability space). The 3-tuple (, 2 , P) is a finite probability space, if P is a probability measure on satisfies: (1) P() = 1; (2) ∀B ∈ 2 , 0 ≤ P(B) ≤ 1; (3) For every countable sequence of mutually disjoint events n n P(Bl ); {Bl ∈ 2 }, l = 1, 2, . . ., n, P( l=1 Bl ) = l=1
Definition 2 (sample’s description [29–33]). For a sample xi ∈ Rd , the fuzzy description Des(xi ) ∈ of xi is defined as: Des(xi ) =
1 |{| ∈ Des(xi ) ∧ ∈ Des(xj )}| d
(3)
∈Des(xi )
Example 3. Let us consider Example 2, Let A1 , A2 , A3 , A4 respectively denote the antecedent part of fuzzy rules R1 , R2 , R3 , R4 . Every samples fuzzy descriptions is shown in Table 3, and the membership value of every sample belonging to the four fuzzy rules is also shown in Table 3.
where, 2˝ is the power set of , i.e., the class of all subsets of , 2˝ is a -algebra of , the set of all possible potentially interesting events. In the literature, ∀B ∈ 2˝ , P(B) denotes the probability of a sample fall into fuzzy subspace B, the probability distribution P(B) can be estimated by relative frequency by the following formula: P(B) =
1 |{x |B (xi ) ≥ Bc (xi )}| n i
(6)
where, Bc denotes complement of a set B, B (xi ) denotes the membership value of xi belonging to B can be calculated using: B (xi ) = maxFs (xi ),
(7)
Bc (xi ) = maxFs (xi ).
(8)
Fs∈B
Fs∈Bc
Example 5. Let us again consider the fuzzy descriptions in Example 3, P(A1 ) = (1/10)|{x1 , x4 , x5 , x7 }| = 0.4, P(A2 ) = (1/10)|{x3 , x9 }| = 0.2, P(A3 ) = (1/10)|{x2 }| = 0.1, P(A4 ) = (1/10)|{x6 , x8 , x10 }| = 0.3.
X. Wang et al. / Applied Soft Computing 24 (2014) 534–542
discretization is not employed. The modified method called mRFS is shown in Algorithm 1.
Raw data mRFS Feature Selection fuzzy number Sample’s Description granular computing Data granulation End Fig. 4. The procedure of our clustering algorithm.
Algorithm 1.
indexes of the selected features; X : n × m matrix, reduced dimensional dataset, with features sorted by decreasing relevance; 1: Compute the relevance rj = var(fj ) of each feature, j = 1, 2, . . ., d; Sort features by their relevance rj in decreasing order obtaining r(j) ; 2:
• Find the two most compatible information granules and merge them together as a new information granule containing both original granules. • Repeat the process of finding the two most compatible granules until a satisfactory data abstraction level is achieved. In this study, a fuzzy rule consists of only a single fuzzy number B ∈ is viewed as an atomic granule denoted by Gar [40]. Data granulation is accomplished by merging these atomic granules. 3. Our proposed rapid fuzzy rule clustering method Our proposed rapid fuzzy rule clustering method is based on granular computing, and named FRCGC. The FRCGC work flow is summarised in the top level overview of the process shown in Fig. 4. The idea of FRCGC is described as follows: firstly, some features are selected in order to control the complexity. After that, calculating every samples description (fuzzy rule) by Definition 2 on the remaining features. Lastly, exemplar descriptions are selected from samples descriptions, and data granulation (the clustering procedure) is guided by the selected exemplar fuzzy descriptions. 3.1. Step 1: unsupervised feature selection A. Ferreira and M. Figueiredo proposed an unsupervised feature selection method RFS [42], the key idea is that features with higher variance are more informative than features with lower variance. In order to choose an adequate number of features, A.J. Ferreira and M. Figueiredo proposed to use a cumulative measure as follows. Let {ri , i = 1, . . ., d} be the variance values as given by {ri = var(fi )/bi }, where bi is the number of bits allocated to feature fi by U-LBG1 method [42] which is a method of feature discretization, and {r(i) , i = 1, . . ., d} is the same values after sorting in descending order, choosing m as the
d
r / j=1 r(j) ≥ L, where L is some lowest value that satisfies: j=1 (j) threshold (such as 0.5). In order to decrease the computational cost, in this paper {ri , i = 1, . . ., d} is computed by {ri = var(fi )} and do not consider the number of bits bi , that means the procedure of feature
m
d
Compute m satisfies:
4:
Fill the FeatKeep array with the indexes of the m top-ranked features;
5:
j=1
r(j) /
r(j) ≥ L;
3:
6:
Granular computing [34,35] offers a simple and effective way of extracting information out of datasets, inspired by the human perception of grouping similar featured items together [36,37]. By using granular computing, it is possible to group data together not only based on similar mathematical properties such as proximity, but also it considers the raw data as conceptual entities that are captured in a compact and transparent manner. Data granulation [38–41] is an algorithmic process which is achieved by a simple two step iterative process involving the following two steps:
mRFS: modified Relevance Feature Selection.
Input: X: n × d matrix, d dimensional dataset with n samples; L ∈ [0, 1]: threshold to choose an adequate number of features to keep; Output: FeatKeep: an m-dimensional array (with m < d) containing the
2.4. Granular computing
m
537
j=1
Build X from X using FeatKeep, by keeping only the m features with largest relevance r(j) ; return X , FeatKeep, m;
3.2. Step 2: generating every sample’s fuzzy description Now suppose that, the number of clusters is K, the triangular and trapezoidal membership function are adopt to generate fuzzy number, and the number of fuzzy numbers on every feature is set to K. Let = {kj |k = 1, 2, . . ., K, j = 1, 2, . . ., d} be the set of all fuzzy numbers. Fuzzy numbers divide the universe of discourse Rd into Kd fuzzy rules. Each sample’s fuzzy description is calculated by Definition 2. 3.3. Step 3: data granulation Lastly, exemplar fuzzy descriptions are selected from samples descriptions by relative frequency, and data granulation is guided by the selected exemplar fuzzy descriptions. The clustering pseudocode is shown in Algorithm 2. Algorithm 2. FRCGC: Proposed rapid fuzzy clustering method based on granular computing. Input: X : n × m matrix, reduced dimensional dataset, with features sorted by decreasing relevance; K the number of clusters; Output: Grak (k = 1, 2, . . ., K): clustering granule set; Exek (k = 1, 2, . . ., K): is the kth exemplar of fuzzy descriptions; Gra =∅, =∅, =∅, p = 0; 1: For each sample xi (i = 1, 2, . . ., n); 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21:
Compute Des(xi ) =
d
j=1
arg maxkj (xi ); kj ∈j
if Des(xi ) ∈ / ; p = p + 1; p = Des(xi ); compute p = P(Des(xi )); end if; end for; For k = 1, 2, . . ., K; compute id = arg max q,(q=1,2,. . .,p) q ; Exek = id ; For q = 1, 2, . . ., p; compute q = q × Dsim( q , Exek ); end for; end for; For each sample xi (i = 1, 2, . . ., n); compute id = arg maxk,(k=1,2,...,K) Exek (xi ); Graid = Graid xi ; end for; return Grak , Exek , Des(xi );
4. Illustrative experiments A weather data in Table 4 is used as an illustrative example. It consists of 10-day observations and 3 features which are Temperature (f1 ), Humidity (f2 ), and Wind (f3 ). Let X = {x1 , . . ., x10 } be a set of 10 day observations, and xi ∈ R3 (i = 1, 2, . . ., n) denotes the ith
538
X. Wang et al. / Applied Soft Computing 24 (2014) 534–542
Table 4 A weather data.
Table 6 The membership value of samples belonging to fuzzy numbers.
Feature
x1
x7
x8
x9
Temperature Humidity Wind
75 73 70 94 100 91 25 0.9 0.2 0.3 0.1 0.2 0.4 0 3 4 9 4 2 3 2
x2
x3
x4
x5
x6
43 0 1
47 45 0.1 0.8 5 2
x10
Table 5 The normalized weather data. Feature
f (:, 1) f (:, 2) f (:, 3)
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
0.67
0.64
0.6
0.92
1
0.88
0
0.24
0.29
0.27
1
0.22
0.33
0.11
0.22
0.44
0
0
0.11
0.89
0.25
0.38
1
0.38
0.13
0.25
0.13
0
0.5
0.13
4.1. Step 1: feature selection Firstly, data should be firstly normalized by the following formula: fj − min{fj }
(9)
max{fj } − min{fj }
where, fj represents the jth column of X corresponds to the value of the jth feature. The normalized weather data is shown in Table 5. Due to the variance of r1 = var( f (:, 1)) = 0.11, r2 = var( f (: f (:, 3)) = 0.08, according to Algorithm 1 , 2)) = 0.12, and r3 = var( 3 (L = 0.5), r(1) = 0.12, r(2) = 0.11, r(3) = 0.08, r(1) / j=1 r(j) = 0.39 < L,
3
(r1 + r2 )/
r j=1 (j)
= 0.75 ≥ L, thus, features f1 and f2 are selected.
4.2. Step 2: generating every sample’s fuzzy descriptions Given the number of clusters K = 2, the parameters of fuzzy numbers can be calculated as follows. For feature f1 , the bin width is computed by: ε = (max(f1 ) − min(f1 ))/K = (100 − 25)/2 =37.5, and cut points are constructed at cp1 = min(f1 ) + ε = 62.5, and ms1 = mean{25, 43, 47, 45} = 40, ms2 = mean{75, 73, 70, 94, 100, 91} = 84. For feature f2 , the bin width is computed by: ε = (max(f1 ) − min(f1 ))/K = (0.9 − 0)/2 =0.45, and cut points are constructed at cp1 = min(f1 ) + ε = 0.45, and ms1 = 0.16, ms2 = 0.85, and the obtained fuzzy numbers are shown in Fig. 5 respectively corresponding to “big” and “small”. The membership value of sample belonging to fuzzy numbers is shown in Table 6, where, 11 stands for the value of feature f1 is “small”, 21 stands for the value of feature f1 is “big”,12 stands for the value of feature f2 is “small”, 22 stands for the value of feature f2 is “big”. The membership value of samples belonging to fuzzy number is shown in Table 6. According to Table 6
(a)feature1
x5
x6
x7
x8
x9
x10
0 1 1 0.06
0 1 0.93 0.19
0 1 0.59 0.44
1 0 1 0
0.93 0.05 1 0
0.84 0.14 1 0.06
0.89 0.09 0 0.94
x1
x2
x3
x4
x5
x6
x7
0.4 0.11
0.84 0.6
0.72 0.55
1 0.5
0.97 0.47
0.79 0.29
0.5 0.52 1 0.97
x8
x9
x10
0.57 0.92
0.04 0.44
Firstly, we need choose only a single sample’s description as exemplar to describe the first cluster by Algorithm 2. According to Eq. (6), the importance of sample’s description can be obtained by the frequency of occurrence: 1 = P(21 22 ) = 0.1,
2 = P(21 12 ) = 0.5, 3 = P(11 12 ) = 0.3, 4 = P(11 22 ) = 0.1. Thus, 21 12 is chosen as the first cluster’s description. Next, we should choose another sample’s description as the second cluster’s description. According to Algorithm 2, the importance of sample’s description should be updated by considered the first cluster’s description, 1 = 1 * Dsim(21 22 , 21 12 ) = 0.1 * 0.5 = 0.05,
2 = 2 * Dsim(21 12 , 21 12 ) = 0.5 * 0 =0, 3 = 3 * Dsim(11 12 , 21 12 ) = 0.3 * 0.5 = 0.15, 4 = 4 * Dsim(11 22 , 21 12 ) = 0.1 * 1 =0.1, the biggest one is 3 , thus 11 12 is selected as the second cluster’s description. The first cluster’s description is “the value of feature 1 is large and the value of feature 2 is small”. The second cluster’s description is “the value of feature 1 is small and the value of feature 2 is small”. Data granulation is guided by the cluster’s description, according to the membership degree of samples belonging to clusters’ description which is shown in Table 7, the first cluster is {x1 , x2 , x3 , x4 , x5 , x6 }, the second cluster is {x7 , x8 , x9 , x10 }. 5. Experimental evaluation In this section, we evaluate the performance of FRCGC algorithm with comparison to OCKC [16], KAP [43], KMPC [15], K-means++ [44], FCM [21], K-means [8], and STSC [14], and conduct a number of experiments to verify the properties of our proposed FRCGC such
0.5
84
x4
0.32 0.66 0.76 0.31
4.3. Step 3: data granulation
0.5
62.5
x3
0.25 0.73 0.93 0.19
and Definition 2, we can calculate every sample’s description as follows: Des(x1 ) = 21 22 , Des(x2 ) = 21 12 , Des(x3 ) = 21 12 , Des(x4 ) = 21 12 , Des(x5 ) = 21 12 , Des(x6 ) = 21 12 , Des(x7 ) = 11 12 , Des(x8 ) = 11 12 , Des(x9 ) = 11 12 , Des(x10 ) = 11 22 ,
1
40
x2
0.2 0.77 0 1
21 12 ( · ) 11 12 ( · )
1
0
x1
Table 7 The membership value of every sample belonging to the two clusters’ descriptions.
sample, fj (j = 1, 2, 3) denotes the jth feature of X. Thus, X = (xij ) is a 10 × 3 matrix representing data, xij is the jth feature value of xi .
fj =
11 (·) 21 (·) 12 (·) 22 (·)
0
0.16
0.45
(b)feature2
Fig. 5. Membership functions of fuzzy numbers formed for feature 1 and feature 2.
0.85
X. Wang et al. / Applied Soft Computing 24 (2014) 534–542
as: how to choose the threshold value L and the computational complexity of FRCGC. All experiments are carried out on a personal computer with Intel (R) Core (TM) i5-2520M CPU (2.50 GHz) processor, 4.00 GB (2.94 GB usable) memory, and Windows 7 64-bit Operating system. All algorithms are implemented as a computer program in the Matlab 7.13.0.564 (R2011b) environment. The classification error rate is a common measure used to determine how well clustering algorithms perform on a data with a known structure (i.e., classes). The classification error rate is determined first by transforming the fuzzy partition matrix into a Boolean partition matrix and by selecting the cluster with the maximum membership value for each sample. Class labels are assigned to each cluster according to the class that dominates that cluster. The classification error rate is the percentage of samples that belong to a correctly labeled cluster. The classification error rate can be determined by building a contingency matrix [45]. Higher classification error rates indicate better clustering results. The classification error rate is quite often used to measure the performance of fuzzy clustering algorithms.
5.1. Examining classification error rate of FRCGC To assess the ability of the FRCGC in real-world data, 21 classification datasets (i.e., data with labeled samples) with numerical attributes are chosen from the University of California at Irvine Machine Learning Repository [46]. ALL-AML data [47] is a collection of 72 Leukemia patient samples, can be download from (http://www-genome.wi.mit.edu/cgi-bin/cancer/publications/ pub paper.cgi?mode=view&paper id=43). Although more clustering data should be prepared for the evaluation of the FRCGC, the classification data are more appropriate here since the accuracy of the algorithm is more critical, provided the class labels of data are removed before applying the FRCGC. Table 8 briefly describes the datasets name, the number of features, classes, and samples used in the experiments. These datasets have several types of data and represent many different learning problems. The classification error rate of the FRCGC on 21 classification datasets are shown in Table 9, the threshold L for the number of selected features is set to 0.5 for all these datasets. We have executed all algorithms 100 times independently with random initialization (different sort of samples) on each dataset. In Table 9, the best performance among these datasets is identified in bold case,
539
Table 8 Statistics of datasets used in the experiments: their number of features (d), number of classes (c), number of samples (n). Dataset
d
c
n
Dataset
d
c
n
iris wobc autompg bupa ionosphere heart pima tae wdbc wpbc ALL-AML
4 9 7 6 34 13 8 5 30 32 7129
3 2 3 2 2 2 2 3 2 2 2
150 699 398 345 351 270 768 151 569 198 72
wine australian balance car haberman hepatitis sonar transfusion winequality shuttle
13 14 4 6 3 19 60 4 12 9
3 2 3 4 2 2 2 2 2 7
178 690 625 1728 306 155 208 748 6497 58,000
and “–” denotes this clustering method cannot obtain the corrected number of classes, “*” means out-of-memory errors, “&” denotes this clustering method cannot work on this data, “#” means the running time more than 3000 s. According to the results in this table, FRCGC is the best in 9 out of 21 datasets with five ties (Table 5), the performance of the FRCGC is comparable with other methods, although the clustering accuracy is not its essential issue. Standard deviation for the experiments on different datasets are summarized in Table 10. 5.2. Friedman test The results presented in Table 9 offer some insight in the performance of the algorithms. However, those results do not provide enough support for drawing a strong conclusion in favor or against any of the studied methods. To arrive at strong evidence, we resort ourselves to statistical testing of the result. The Holm test [48] is based on the relative performance of clustering method in terms of their ranks: for each dataset, the methods to be compared are sorted according to their performance, i.e., each method is assigned a rank (in case of ties, average ranks are assigned [48]). The test statistics for comparing the two clustering methods is expressed as: z=
Rankj − Rankk
(10)
SE
where, k is the number of clustering methods, N is thej number of N j dataset, Ranki = ( i=1 ri )/N, SE = k(k + 1)/(6 ∗ N), ri is the rank of the clustering method j on the ith dataset. The z value is used to find the corresponding probability (p) from the table of normal
Table 9 Classification error rate for FRCGC and other clustering methods, the best scores are indicated in boldface, threshold value L = 0.5. Data set
KMPC
K-means
FCM
KAP
OCKC
STSC
K-means++
FRCGC
iris wine wobc australian autompg balance bupa car ionosphere haberman heart hepatitis pima sonar tae transfusion wdbc winequality wpbc shutt1e ALL-AML
0.493 0.584 0.809 0.59 0.626 0.483 0.58 0.7 0.713 0.735 0.556 0.557 0.651 0.57 0.401 0.762 0.636 0.754 0.76 * 0.653
0.848 0.697 0.957 0.559 0.651 0.663 0.58 0.701 0.712 0.735 0.593 0.561 0.66 0.553 0.381 0.762 0.854 0.787 0.76 0.839 0.653
0.893 0.685 0.953 0.561 0.656 0.648 0.58 0.7 0.709 0.735 0.593 0.594 0.659 0.553 0.384 0.762 0.854 0.786 0.76 0.865 0.653
0.813 0.685 0.943 – 0.626 – 0.58 – 0.709 – 0.607 0.568 0.659 0.534 – – 0.775 – 0.76 # 0.653
0.787 0.624 0.993 & 0.613 0.619 0.58 # 0.641 0.735 0.557 0.554 0.651 0.546 0.353 0.762 0.627 # 0.76 * &
0.893 0.708 0.969 0.648 0.683 0.63 0.58 0.701 0.718 0.735 0.611 0.606 0.651 0.548 0.375 0.762 0.912 0.768 0.76 * 0.653
0.888 0.696 0.958 0.562 0.65 0.669 0.58 0.701 0.711 0.735 0.59 0.561 0.66 0.546 0.383 0.762 0.854 0.787 0.76 0.829 0.653
0.967 0.837 0.877 0.794 0.628 0.685 0.58 0.7 0.749 0.735 0.778 0.632 0.664 0.534 0.47 0.762 0.805 0.754 0.76 0.863 0.653
540
X. Wang et al. / Applied Soft Computing 24 (2014) 534–542
Table 10 Standard deviation for FRCGC and other clustering methods. Data set
KMPC
K-means
FCM
KAP
OCKC
STSC
K-means++
FRCGC
iris wine wobc australian autompg ba1ance bupa car ionosphere haberman heart hepatitis pima sonar tae transfusion wdbc winequa1ity wpbc shutt1e ALL-AML
0.02 0.016 0.008 0.023 0 0.013 0.012 0 0.093 0 0 0.019 0 0.036 0.032 0 0.027 0 0 * 0
0.096 0.012 0 0 0.003 0.043 0 0.002 0 0 0 0 0 0 0.009 0 0 0.001 0 0.011 0
0 0 0 0 0 0.088 0 0 0 0 0 0 0 0 0 0 0 0 0 0.001 0
0 0 0 – 0 – 0 – 0 – 0 0 0 0 – – 0 – 0 # 0
0 0 0 – 0 0 0 # 0 0 0.004 0.012 0 0.015 0.044 0 0 # 0 * &
0 0 0 0 0 0.001 0 0.002 0 0 0 0 0 0 0.003 0 0.001 0 0 * 0
0.003 0.009 0.001 0.001 0.003 0.021 0 0.001 0.001 0 0.002 0.002 0 0.007 0.019 0 0 0.002 0 0.009 0.1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
distribution, which is then compared with an appropriate ˛. We denote the ordered p values by p1 , p2 , . . ., so that p1 ≤ p2 ≤ . . . ≤ pk−1 . The Holm’s step-down procedure compares each pi with ˛/(k − i), but differ in the order of the tests, and it starts with the most significant p value. If p1 is below ˛/(k − 1), the corresponding hypothesis (the two classifiers have the same performance) is rejected and we are allowed to compare p2 with ˛/(k − 2). If the second hypothesis is rejected, the test proceeds with the third one, and so on. As soon as a certain null hypothesis cannot be rejected, all the remaining hypotheses are retained as well. In this studies, RankOCKC = 5.52, RankKAP = 5.43, RankKMPC = 4.43, RankK-means++ = 2.9, RankFCM = 2.86, RankK-means = 2.76, RankSTSC = 2.48, RankFRCGC = 2.33, and with ˛ = 0.05, k = 8 and N = 21, the standard error is SE = (8 × 9)/(6 × 21) = 0.76. The Holm procedure rejects the first, the second, and the third hypothesis since the corresponding p values are smaller than the adjusted ˛’s (see Table 11). This shows that FRCGC performs significantly better than OCKC, KAP, and KMPC at the significance level ˛ = 0.05. FRCGC is not significantly better than K-means++, FCM, K-means and STSC, however, FRCGC obtains the best ranks in the Friedman test and the knowledge represented by fuzzy rules is more human readable which will be shown in the following subsection. 5.3. The obtained descriptions Compared with the traditional clustering methods such as OCKC, KAP, KMPC, K-means++, FCM, K-means, and STSC, our proposed FRCGC can automatically mine fuzzy IF–THEN rules to describe each class. We illustrate the obtained descriptions on the Iris data. There is a 150 × 4 data matrix X = (xij )150×4 with data evenly distributed across three classes: iris-setosa, iris-versicolor, and iris-virginica. There are four features: sepal length and width,
and petal length and width. Let M = {f1 , f2 , f3 , f4 } be the set of features, xi = (xi1 , xi2 , xi3 , xi4 ) be the ith sample. First of all, data should be normalized by Eq. (9), and comf1 , f2 , f3 , and f4 , thus, puting the variance of normalized features f (:, 1)) = r1 = var(f (:, 1)) = 0.05, r1 = var(f (:, 1)) = 0.03, r1 = var( f (:, 1)) = 0.1. According to Algorithm 1 with L = 0.5, 0.09, r1 = var( 4 r(1) = 0.1, r(2) = 0.09, r(3) = 0.05, r(4) = 0.03, r(1) / j=1 r(j) = 0.37 < L,
3
(r1 + r2 )/ j=1 r(j) = 0.69 ≥ L, thus, features f4 and f3 are selected. Given the number of clusters is 3, the fuzzy numbers on features f4 and f3 is shown in Fig. 6. The obtained descriptions by FRCGC as follows: The first cluster’s description is “the value of petal length is small and the value of petal width is small”. The second cluster’s description is “the value of petal length is medium and the value of petal width is medium”. The third cluster’s description is “the value of petal length is large and the value of petal width is large”. 5.4. The threshold value From our proposed FRCGC, we can notice that the threshold value L can decide the number of selected features, and control the length of fuzzy rules. To further verify the clustering performance of FRCGC, we analyze the relationship between the threshold value L and classification error rate. FRCGC is executed on the whole 21 datasets by given threshold value L equal to 0.01, 0.02, . . ., 0.99, the mean classification error rate on the whole 21 datasets is shown in Fig. 7. Fig. 7 also shows the mean classification error rate of OCKC, KAP, KMPC, K-means++, FCM, K-means, and STSC obtained from Table 9. One can easily see the mean classification error rate of FRCGC is always higher than other’s. From Fig. 7, the best threshold value L should be set to 0.5 according to experimental studies. 5.5. The computational complexity
Table 11 The Holm test. i
Clustering method
z
p
˛/(k − i)
1 2 3 4 5 6 7
OCKC KAP KMPC K-means++ FCM K-means STSC
(5.52 − 2.33)/0.76 = 4.22 (5.43 − 2.33)/0.76 = 4.09 (4.43 − 2.33)/0.76 = 2.77 (2.90 − 2.33)/0.76 = 0.76 (2.86 − 2.33)/0.76 = 0.69 (2.76 − 2.33)/0.76 = 0.57 (2.48 − 2.33)/0.76 = 0.19
0.0 0.0 0.006 0.45 0.49 0.57 0.85
0.007 0.008 0.01 0.013 0.017 0.025 0.05
FRCGC is a non-iteration linear-time algorithm to find the best clustering solution, only needs to scan the dataset one time. The computational complexity is O(n) + O(d), where, O(n) is the computational complexity of mRFS, and O(d) is the computational complexity of FRCGC. The computational complexity of original K-means algorithm is O(ndK) at each iteration [49], and the computational complexity of original FCM algorithm is also O(ndK) at each iteration [50]. Table 12 shows the average running times (s) of 100 runs of the proposed FRCGC and other clustering methods.
X. Wang et al. / Applied Soft Computing 24 (2014) 534–542 1
membership degree
membership degree
1
541
small medium
0.5
small medium
0.5
big
big
0
0 0
1.46
4.29
5.63
0
7
0.25 0.46
1.34 1.55
2.07
2.5
petal width f
petal length f
4
3
(a)petallength f3
(b)petalwidthf4
Fig. 6. Membership functions of fuzzy numbers formed for Iris data.
mean accuracy on 21 datasets
0.72
0.7
0.68
KMPC K−means FCM KAP OCKC STSC K−means++ FRCGC
0.66
0.64
0.62
0
0.1
0.2
0.3
0.4 0.5 threshold value
0.6
0.7
0.8
0.9
1
Fig. 7. The mean classification error rate of FRCGC, K-means and FCM on twenty datasets under different threshold value L.
Table 12 Average running times (s) of FRCGC and other clustering methods. Data set
KMPC
K-means
FCM
KAP
OCKC
STSC
K-means++
FRCGC
iris wine wobc australian autompg balance bupa car ionosphere haberman heart hepatitis pima sonar tae transfusion wdbc winequality wpbc shuttle ALL-AML
0.575 0.511 0.963 1.198 0.664 0.797 0.239 9.083 0.507 0.269 0.354 0.43 0.823 0.882 0.804 0.702 2.181 83.89 0.559 * 590.5
0.003 0.004 0.006 0.007 0.007 0.009 0.005 0.01 0.012 0.005 0.006 0.008 0.007 0.015 0.007 0.005 0.006 0.087 0.005 1.691 0.51
0.003 0.011 0.007 0.023 0.018 0.043 0.008 0.055 0.008 0.005 0.008 0.007 0.02 0.026 0.011 0.0217 0.015 0.187 0.01 13.88 0.747
1.036 1.457 11.59 – 5.24 – 3.676 – 2.834 – 2.659 1.3 12.08 1.592 – – 8.06 – 1.535 # 0.636
0.211 0.35 106.6 – 5.22 55.31 2.94 # 3.47 2.12 1.382 0.249 157.84 0.845 0.295 135.5 36.28 # 0.511 * &
0.204 0.212 0.666 0.613 0.29 0.529 0.216 8.39 0.242 0.206 0.199 0.162 0.89 0.251 0.312 0.743 0.493 433.12 0.176 * 0.172
0.001 0.001 0.002 0.004 0.003 0.004 0.002 0.01 0.002 0.001 0.002 0.002 0.005 0.006 0.003 0.002 0.004 0.04 0.003 0.747 0.054
0.001 0.003 0.002 0.002 0.002 0.002 0.002 0.005 0.002 0.001 0.002 0.002 0.003 0.007 0.002 0.002 0.004 0.02 0.003 0.097 0.184
From this table, one can see that KMPC, K-means, FCM, KAP, OCKC, and STSC always take more time than FRCGC on the 21 datasets, the speed of K-means++ and FRCGC is similar.
6. Conclusion In this paper, we have proposed the FRCGC, i.e., a rapid fuzzy rule clustering method based on granular computing to automatically explore the potential clusters in the datasets. The generated fuzzy rules, which represent the clusters, are human understandable with
acceptable accuracy, the performance of the FRCGC is comparable with other methods. Friedman test shows that FRCGC performs significantly better than OCKC, KAP, and KMPC at the significance level ˛ = 0.05. Although FRCGC is not significantly better than Kmeans++, FCM, K-means and STSC, FRCGC can obtain the best ranks in the Friedman test and the knowledge represented by fuzzy rules is more human readable. The threshold value L can decide the number of selected features, and control the length of fuzzy rules. The best threshold value L should be set to 0.5 according to experimental studies. Therefore, they can operate as a suitable scheme for comprehensible representation of data and knowledge discovery in
542
X. Wang et al. / Applied Soft Computing 24 (2014) 534–542
data mining applications, while the other methods only determine the center and members of each cluster. References [1] K.S. Tan, W.H. Lim, N.A.M. Isa, Novel initialization scheme for fuzzy c-means algorithm on color image segmentation, Appl. Soft Comput. 13 (4) (2013) 1832–1852. [2] B. Hanczar, M. Nadif, Ensemble methods for biclustering tasks, Pattern Recogn. 45 (11) (2012) 3938–3949. [3] Y.K. Lam, P.W. Tsang, Exploratory k-means: a new simple and efficient algorithm for gene clustering, Appl. Soft Comput. 12 (3) (2012) 1149–1157. [4] C. Lu, X. Hu, J. Park, Exploiting the social tagging network for web clustering, IEEE Trans. Syst. Man Cybernet. A: Syst. Hum. 41 (5) (2011) 840–852. [5] H. Wu, J. Bu, C. Chen, J. Zhu, L. Zhang, H. Liu, C. Wang, D. Cai, Locally discriminative topic modeling, Pattern Recogn. 45 (1) (2012) 617–625. [6] S. Tabatabaei, M. Coates, M. Rabbat, GANC: Greedy agglomerative normalized cut for graph clustering, Pattern Recogn. 45 (2) (2012) 831–843. [7] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006. [8] J. MacQueen, et al., Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, California, USA, 1967, p. 14. [9] P. Sneath, R. Sokal, Numerical Taxonomy: The Principles and Practices of Numerical Classification, WH Freeman, San Francisco, 1973. [10] B. King, Step-wise clustering procedures, J. Am. Stat. Assoc. 62 (317) (1967) 86–101. [11] M. Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall, Upper Saddle River, NJ, 2003. [12] B. Frey, D. Dueck, Clustering by passing messages between data points, Science 315 (5814) (2007) 972–976. [13] C. Fowlkes, S. Belongie, F. Chung, J. Malik, Spectral grouping using the Nystrom method, IEEE Trans. Pattern Anal. Mach. Intell. 26 (2) (2004) 214–225. [14] L. Zelnik-Manor, P. Perona, Self-tuning spectral clustering, in: Advances in Neural Information Processing Systems, 2004, pp. 1601–1608. [15] P.K. Agarwal, N.H. Mustafa, k-Means projective clustering, in: Proceedings of the Twenty-Third ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, 2004, pp. 155–165. [16] D. Steinley, L. Hubert, Order-constrained solutions in k-means clustering: even better than being globally optimal, Psychometrika 73 (4) (2008) 647–664. [17] B. Patra, S. Nandi, P. Viswanath, A distance based clustering method for arbitrary shaped clusters in large datasets, Pattern Recogn. 44 (12) (2011) 2862–2870. [18] Y. Ren, X. Liu, W. Liu, DBCAMM: a novel density based clustering algorithm via using the Mahalanobis metric, Appl. Soft Comput. 12 (5) (2012) 1542–1554. [19] S. Chatzis, T. Varvarigou, Factor analysis latent subspace modeling and robust fuzzy clustering using t-distributions, IEEE Trans. Fuzzy Syst. 17 (3) (2009) 505–517. [20] J. Dunn, A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-separated Clusters, Taylor & Francis, 1973. [21] J.C. Bezdek, Fuzzy mathematics in pattern classification (Ph.D. thesis, Ph.D. Dissertation), Appl. Math. Cornell Univ., Ithaca, NY, 1973. [22] K. Honda, A. Notsu, H. Ichihashi, Fuzzy PCA-guided robust k-means clustering, IEEE Trans. Fuzzy Syst. 18 (1) (2010) 67–79. [23] Q. Tan, Q. He, W. Zhao, Z. Shi, E. Lee, An improved FCMBP fuzzy clustering method based on evolutionary programming, Comput. Math. Appl. 61 (4) (2011) 1129–1144. [24] D. Graves, J. Noppen, W. Pedrycz, Clustering with proximity knowledge and relational knowledge, Pattern Recogn. 45 (7) (2012) 2633–2644. [25] W. Pedrycz, V. Loia, S. Senatore, Fuzzy clustering with viewpoints, IEEE Trans. Fuzzy Syst. 18 (2) (2010) 274–284. [26] E. Mansoori, M. Zolghadri, S. Katebi, SGERD: a steady-state genetic algorithm for extracting fuzzy classification rules from data, IEEE Trans. Fuzzy Syst. 16 (4) (2008) 1061–1071.
[27] J. Nazarko, W. Zalewski, The fuzzy regression approach to peak load estimation in power distribution systems, IEEE Trans. Power Syst. 14 (3) (1999) 809–814. [28] I. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2005. [29] X. Liu, W. Wang, T. Chai, The fuzzy clustering analysis based on AFS theory, IEEE Trans. Syst. Man Cybernet. B: Cybernet. 35 (5) (2005) 1013–1027. [30] X. Xu, X. Liu, Y. Chen, Applications of axiomatic fuzzy set clustering method on management strategic analysis, Eur. J. Oper. Res. 198 (1) (2009) 297–304. [31] Y. Li, X. Liu, Y. Chen, Selection of logistics center location using axiomatic fuzzy set and TOPSIS methodology in logistics management, Expert Syst. Appl. 38 (6) (2011) 7901–7908. [32] X. Liu, Y. Ren, Novel artificial intelligent techniques via AFS theory: feature selection, concept categorization and characteristic description, Appl. Soft Comput. 10 (3) (2010) 793–805. [33] X. Liu, W. Pedrycz, T. Chai, M. Song, The development of fuzzy rough sets with the use of structures and algebras of axiomatic fuzzy sets, IEEE Trans. Knowl. Data Eng. 21 (3) (2009) 443–462. [34] T. Lin, Granular computing: from rough sets and neighborhood systems to information granulation and computing in words, in: European Congress on Intelligent Techniques and Soft Computing, 1997, pp. 1602–1606. [35] W. Pedrycz, Granular Computing: Analysis and Design of Intelligent Systems, CRC Press, 2013. [36] L. Zadeh, Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Set. Syst. 90 (2) (1997) 111–127. [37] Y. Yao, Granular computing: past, present and future, in: 2008 IEEE International Conference on Granular Computing, 2008, pp. 80–85. [38] A.R. Solis, G. Panoutsos, Granular computing neural-fuzzy modelling: a neutrosophic approach, Appl. Soft Comput. 13 (9) (2012) 4010–4021. [39] G. Panoutsos, M. Mahfouf, A neural-fuzzy modelling framework based on granular computing: concepts and applications, Fuzzy Set. Syst. 161 (21) (2010) 2808–2830. [40] H. Liu, S. Xiong, Z. Fang, FL-GrCCA: a granular computing classification algorithm based on fuzzy lattices, Comput. Math. Appl. 61 (1) (2011) 138–147. [41] A. Bargiela, W. Pedrycz, Toward a theory of granular computing for humancentered information processing, IEEE Trans. Fuzzy Syst. 16 (2) (2008) 320–330. [42] A. Ferreira, M. Figueiredo, An unsupervised approach to feature discretization and selection, Pattern Recogn. 45 (9) (2012) 3048–3060. [43] X. Zhang, W. Wang, K. Nørvag, M. Sebag, KAP: generating specified k clusters by efficient affinity propagation, in: 2010 IEEE 10th International Conference on Data Mining (ICDM), 2010, pp. 1187–1192. [44] D. Arthur, S. Vassilvitskii, k-means++: the advantages of careful seeding, in: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, 2007, pp. 1027–1035. [45] T. Li, S. Ma, M. Ogihara, Entropy-based criterion in categorical clustering, in: Proc. of the 21st Int. Conf. on Machine Learning, 2004, pp. 68–75. [46] K. Bache, M. Lichman, UCI Machine Learning Repository, University of California, School of Information and Computer Science, Irvine, CA, 2013 http://archive.ics.uci.edu/ml [47] T.R. Golub, D.K. Slonim, P. Tamayo, et al., Molecular classification of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (5439) (1999) 531–537. [48] J.C. Huhn, E. Hullermeier, FR3: a fuzzy rule learner for inducing reliable classifiers, IEEE Trans. Fuzzy Syst. 17 (1) (2009) 138–149. [49] Y.-F. Zhang, J.-L. Mao, Z.-Y. Xiong, An efficient clustering algorithm, in: 2003 IEEE International Conference on Machine Learning and Cybernetics, 2003, pp. 261–265. [50] D. Graves, W. Pedrycz, Kernel-based fuzzy clustering and fuzzy clustering: a comparative experimental study, Fuzzy Set. Syst. 161 (4) (2010) 522–543.