Applied Soft Computing 11 (2011) 1117–1125
Contents lists available at ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Autonomous and deterministic supervised fuzzy clustering with data imputation capabilities Lim Kian Ming a,1 , Loo Chu Kiong a,∗ , Lim Way Soong b,2 a b
Faculty of Information Science and Technology, Mutilmedia University, Jalan Ayer Keroh Lama, 75450 Bukit Beruang, Melaka, Malaysia Faculty of Engineering Technology, Mutilmedia University, Jalan Ayer Keroh Lama, 75450 Melaka, Malaysia
a r t i c l e
i n f o
Article history: Received 14 March 2007 Received in revised form 15 July 2009 Accepted 27 February 2010 Available online 7 March 2010 Keywords: Supervised fuzzy clustering Global k-means Optimal completion strategy Fault diagnosis Data imputation
a b s t r a c t A fuzzy model based on enhanced supervised fuzzy clustering algorithm is presented in this paper. Supervised fuzzy clustering algorithm by Janos Abonyi and Ferenc Szeifert in the year 2003 allows each rule to represent more than one output with different probabilities for each output. This algorithm implements k-means to initialize the fuzzy model. The main drawbacks of this approach are the number of clusters is unknown and the initial positions of clusters are randomly generated. In this work, the initialization is performed by global k-means algorithm [1] which can autonomously determine the actual number of clusters needed and give deterministic clustering result. In addition, fast global k-means [1] is presented to improve the computation time. Besides that, when collecting input data in a feature vector way, it might occur that some of the feature values are lost for a particular vector due to a faulty reading sensor. To deal with missing values in enhanced supervised fuzzy clustering, the efficient way is imputation during data preprocessing. The modified of optimal completion strategy is presented to solve this problem. This method allows imputation of missing data with high reliability and accuracy. The autonomous and deterministic enhanced supervised fuzzy clustering using supervised Gath–Geva clustering method and the modified of optimal completion strategy can be derived from the unsupervised Gath–Geva algorithm. The proposed algorithm is successfully justified based on benchmark data sets and a real vibration data which was collected from U.S. Navy CH-46E helicopter aft gearbox called Westland. © 2010 Elsevier B.V. All rights reserved.
1. Introduction In this paper, an enhanced of supervised fuzzy clustering [6] is proposed to apply in pattern classification which can autonomously and deterministically classify the faults for diagnosis. For several years, fault diagnosis problems have become an active area of research and many approaches have been developed. It is particularly important to detect and classify the fault modes which allow us to estimate the time of failure of a component and predict the remaining useful lifetime. In other words, fault diagnosis allows performance control. Fault diagnosis process will start with collecting data from sensor and extract the useful information from it. Next, it needs a diagnosis method to identify the abnormal condition (fault) from the data. In this paper, an enhanced supervised fuzzy model for fault diagnosis will be discussed. There are many methods to classify faults which had been researched. One of the
∗ Corresponding author. Tel.: +60 136229972; fax: +60 62316552. E-mail addresses:
[email protected] (L.K. Ming),
[email protected] (L.C. Kiong),
[email protected] (L.W. Soong). 1 Tel.: +60 126100170; fax: +60 62318840. 2 Tel.: +60 62523339; fax: +60 62316552. 1568-4946/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.asoc.2010.02.011
most important methods to do fault classification is using clustering. It is defined as the problem of finding homogeneous groups of data points in a given data set, which is called a cluster [1]. A cluster is a region whose density of data points is locally higher than in other cluster. In other words, clusters can be defined as subsets of the data. There are many alternatives to do clustering, such as statistical models [2], neural networks [3], and fuzzy logic systems [4]. In this paper, an enhanced supervised fuzzy logic model is chosen because it can deal with small data size, noisy data and atypical data. There are many approaches researched in fuzzy classification. Basically, it can be divided into two major categories: unsupervised fuzzy clustering and supervised fuzzy clustering [5]. To do classification, a model needs to be constructed to classify the class label of new unlabeled data correctly. Model construction requires a set of examples or training data set needed to be fed into it for learning. If the training data are labeled, then that is supervised training or supervised learning. On the contrary, in unsupervised training there is no output label and the model has to detect the inner structure in data and group the training set according to some notion of similarity. There are many fuzzy clustering methods have been widely researched. Some of the famous fuzzy clustering algorithms are:
1118
L.K. Ming et al. / Applied Soft Computing 11 (2011) 1117–1125
Fuzzy C-means [19], Gustafson–Kessel [20] and Gath–Geva [10]. Fuzzy C-means [19] algorithm by Bezdek, 1981, is a fuzzy clustering algorithm. So, each of the data point has a degree of belonging to the clusters. It is based on the minimization of an objective function and the result is depending on the initial random assignments of the clusters. Fuzzy C-means [19] algorithm computes with the standard Euclidean distance norm, which induces hyperspherical clusters. Next, Gustafson and Kessel [20] extended the Fuzzy Cmeans [19] algorithm by utilizing an adaptive distance norm. The purpose of utilizing this distance norm is to detect clusters of different geometrical shapes in a single data set. In Gustafson–Kessel [20] algorithm, every cluster has its own norm inducing matrix Ai . Therefore, clusters are allowed to adapt the distance norm to the local inner structure of the data points. For Gath–Geva algorithm [10], it employs a distance norm based on the fuzzy maximum likelihood estimates. In contrast to Gustafson–Kessel [20] algorithm, this distance norm occupies an exponential term. Hence, it will decrease faster than the inner-product norm. Gath and Geva [10] stated that the fuzzy maximum likelihood estimates clustering algorithm is able to detect clusters of varying shapes sizes and densities. The cluster covariance matrix is used in conjunction with an exponential distance, and the clusters are not constrained in volume. However, this algorithm is less robust because it needs a good starting initialization of the clusters. Furthermore, due to the exponential distance norm, it converges to a near local optimum. Supervised fuzzy clustering [6] allows each rule to represent more than one output labels with a different degree of fuzziness. Classical fuzzy classifier encounters problem which just allows one rule to describe one of the classes. Supervised fuzzy clustering [6] solves this bottleneck by allowing consequent part of a rule to describe more than one class given different degrees of fuzziness to each class. In addition, supervised fuzzy clustering allows direct supervised identification of fuzzy classifiers [6]. In this paper, supervised fuzzy clustering algorithm [6] will be used as a tool because it utilizes the class label that describes input data. However, supervised fuzzy clustering algorithm [6] will initialize the partition matrix randomly. This stochastic nature will affect the final performance of the classification. Experiments had proved that the quality of the classification is sensitive to the initial starting position of clusters. Besides that, supervised fuzzy clustering algorithm [6] needs the user to predetermine the number of the clusters required by trial and error and a different value set will produce a different result. In order to enhance the performance, global k-means clustering algorithm [1] is chosen for initialization step to overcome these two serious drawbacks. Global k-means clustering algorithm [1] can autonomously obtain the best location for each cluster based on minimizing the clustering criterion. In addition, Global k-means clustering algorithm [1] can also deterministically obtain the actual number of clusters needed by its incremental manner of algorithm running. Unfortunately the proposed algorithms cannot deal with missing data which happens in real world applications. The occurrence of faulty reading is from sensor creating input vectors with missing values. In this concern, many analyses have been done on missing data [14–16]. Instead of discarding the missing vectors, a fourstep procedure mainly based on modified of optimal completion strategy [9] is proposed. The proposed method is based on data imputation, which will predict the missing values. The paper is organized as follows: Section 2 briefly describes the supervised clustering, fuzzy clustering and supervised fuzzy clustering algorithm [6]. In Section 3, the limitation of k-means algorithm is discussed and general steps for global k-means clustering [1] are presented. Besides, fast global k-means [1] is presented as well to reduce the computation time of global k-means. To solve the missing data problem, a modified of optimal completion strategy is presented in Section 4. The experiments are conducted
Fig. 1. Comparison between unsupervised clustering (a) and supervised clustering (b).
by using the proposed enhanced supervised fuzzy clustering on benchmark data sets and compare results between some famous fuzzy clustering algorithms and proposed enhanced supervised fuzzy clustering algorithm, global k-means and random initialization, global k-means and fast global k-means, tests on data imputation using modified of optimal completion strategy. Lastly some tests on Westland vibration data are discussed in Section 5. Finally the conclusion is presented in Section 6. 2. Learning algorithm In the learning algorithm, the partition matrix or membership function is adjusted to gain better quality in classification. The partition matrix is updated according to distance measurement and terminated when the difference between current partition matrix and previous partition matrix is less than the terminating tolerance. 2.1. Supervised clustering Normally, clustering always refer to unsupervised learning framework using specific error criterion such as minimizing the distance between the clusters. In contrast, supervised clustering will make use of the class labels and identify those clusters which have higher density to the respective classes. Unsupervised clustering often will group different class of objects into one class. On the other hand, supervised clustering utilize class label to directly supervise the identification of cluster for the objects. Fig. 1(a) show the possible clustering by unsupervised clustering and Fig. 1(b) refer to the supervised clustering. The rectangles represent the clusters. From Fig. 1(a), it can see that unsupervised clustering may group different type of objects into same cluster and may group same type of object into different clusters. In contrast, Fig. 1(b) illustrates the supervised clustering. Supervised clustering will group the same objects into same group by using the label of the objects. 2.2. Fuzzy clustering Basically, each cluster can be divided in to a crisp or fuzzy pattern. In crisp (hard) clustering or non-fuzzy clustering, data is divided into crisp clusters. It is based on classical set theory where each data point either belongs or does not belong to a cluster. In crisp clustering, each data point just belongs to exactly one cluster. Let X = {x1 , x2 , . . ., xn } be the data set and let C = {c1 , c2 , . . ., cc } be the number of clusters and let U = Uik , i = 1, . . ., N; k = 1, . . ., C be the partition matrix where N is equal to number of data points. Hard clustering will mutually exclusively partition X into C clusters such that c1 ∪ c2 ∪ · · · ∪ cc = X. For all clusters, ci ∩ cj = 0 where i = / j. In membership function, Uik (xi ) = 1 if xi ∈ ck and Uik (xi ) = 0 if xi ∈ / ck for all i.
L.K. Ming et al. / Applied Soft Computing 11 (2011) 1117–1125
In real applications, data points are needed to be bound to several clusters with different levels of intensity. There is always no sharp border between clusters. In other words, hard clustering is always not efficient. On the contrary, fuzzy clustering is more constructive. In fuzzy clustering, the data points are allowed to belong to more than one cluster simultaneously, and associated with each of the points are a set of membership levels which indicate the degree to which the data points belong to the different clusters. Fuzzy clustering is a process of assigning these membership levels, and then using them to assign data elements to one or more clusters. These membership values should be in the range of real values in [0,1]. For each Uik (xi ), if the value is close to 1; that means xi has a high degree of membership with that particular kth cluster. On the contrary, if the value is close to 0; that means xi has low intensity to that kth cluster. Let Uik stand for the fuzzy partition matrix, Uik (xi ) ∈ [0,1] for i = 1, . . ., N; k = 1, . . ., C where N is equal to number of data points. The summation from k = 1,2, . . ., c for Uik is equal to 1. This is to limit the sum of each column to 1, thus the total membership for each data xi in the data set is equal to 1.
min
N R
J(Z, U, ) =
(Ui,k )
m
2 Di,k (Zk , ri )
(1)
i=1 k=1
where D2 i,k is squared distance between the data and i cluster prototypes, m determines the fuzziness of the clusters and most of the time it is set as 2. The fuzzy membership partition matrix, Ui,k , represents the degree of membership of data points in each ith cluster. Supervised fuzzy clustering is as follows [6]: Initialization Given a set of data Z specify R, choose a termination tolerance ε > 0. Initialize the U = [Ui,k ]RxN partition matrix randomly, where Ui,k denotes the membership that the zk data is generated by the ith cluster. Repeat for l = 1, 2, . . .
• Calculate the centers and standard deviation of the Gaussian membership functions (vi represents the center of ith multivariate Gaussian, 2 i,k stands for variances): k=1
v(l) = i N k=1
(l−1)
(l−1)
=
xk
m
(2)
Ui,k
N 2(l) i,j
m
Ui,k
k=1
(l−1)
m
Ui,k
N k=1
(xj,k − vj,k )2
(l−1)
m
(3)
Ui,k
ci ri =
k|yk =ci
N k=1
(l−1)
Uj,k
(l−1)
Uj,k
n
ωi = (ri )
1 ≤ i ≤ C,
1
(6)
2 2˘i,j
j=1
Step 2. Calculate the distance measure D2 i,k (zk ,ri ) by:
n
1 2 Di,k (zk , ri )
= (ri )
exp
1 xj,k − vi.j − 2 2
2
(cj = yk |ri )
(7)
i.j
j=1
Step 3. Update the partition matrix: (l)
Ui,k =
1
R
(Di,k (zk , ri )/Dj,k (zk , rj ))
1≤k≤N
2/(m−1)
, (8)
Until || U(l) − U(l−1) || < ε. Although supervised fuzzy clustering [6] is a good learning algorithm, but it is stochastic. It cannot autonomously perform the learning. This is owing to the fact that supervised fuzzy clustering [6] is sensitive to the initial starting point of clusters. The final performance of the learning algorithm is dependent on the quality of the initialization. Besides the stochastic nature, the algorithm is not able to predetermine the number of clusters needed for different data sets. So, the user needs to determine it by trial and error. In order to improve the algorithm, global k-means algorithm [1] is chosen as a forerunner to discover the number of clusters needed before the data set is thrown into the learning algorithm. 3. Cluster initialization In this section, global k-means algorithm [1] is presented to overcome the stochastic nature and difficulty in determining the actual number of clusters needed for supervised fuzzy model. Then, fast global k-means [1] is proposed to reduce the computation time of global k-means [1] without degrading the quality.
1≤j≤R
Global k-means algorithm is presented by A. Likas, N. Vlasis, J.J. Verbeek in year 2003 [1]. Global k-means clustering algorithm [1] uses the clustering criterion, which minimizes the sum of squared Euclidean distance between each data point and the cluster center. It employs k-means algorithm as a local search procedure. K-means algorithm has been widely used in clustering problems. It is an algorithm for clustering data into k disjoint subsets. Depending on the clustering criterion, this algorithm searches locally optimum solutions. Usually, this algorithm aims to minimize the objective function which is called sum of squared Euclidean distance. The objective function is given by Mi M
||Cij − Ci ||
(9)
i=1 j=1
m
m ,
(5)
k=1
E(C1 , . . . , CM ) =
• Calculate the consequent probability parameters:
m
3.1. Global k-means clustering
Step 1. Calculate the parameters of the clusters
N
1 (l−1) Ui,k N N
(ri ) =
1 ≤ i ≤ R,
Supervised fuzzy clustering introduced by Janos Abonyi and Ferenc Szeifert in the year 2003 [6] is a supervised fuzzy learning classifier which utilizes the class label of each vector for identification. In this section, supervised fuzzy clustering [6] is briefly presented. The details of supervised fuzzy clustering can be found in [6]. Given a set of data Z, supervised fuzzy clustering algorithm’s objective function is grouping similar data Z into R clusters. Hence, clustering will be a minimization of J:
• Calculate priori probability of the cluster and the weight of the rules:
j=1
2.3. Supervised fuzzy clustering
1119
(4)
where ||Cij − Ci || is the Euclidean distance between each data point Cij and the cluster center Ci . K-means algorithm starts with initializing the cluster centers at arbitrary positions randomly. Next, the membership for each data points is computed by associating it with
1120
L.K. Ming et al. / Applied Soft Computing 11 (2011) 1117–1125
the nearest cluster center. Then, carry on by shifting the cluster centers at each step in order to minimize the clustering criterion. The algorithm will keep looping until there are no more changes on the cluster center. At this point, the algorithm is converged. Unfortunately, the convergence is not guaranteed. This algorithm suffers from several serious drawbacks that significantly affect the performance. One of it will be the outcome of the algorithm depends heavily on the initial starting conditions [8] selected randomly. So, the initial cluster centers positions will be very sensitive. Another main drawback of k-means algorithm is it needs the actual number of clusters to be defined beforehand. If the number of clusters is not well defined, the performance of the algorithm will be affected. Thus, user needs to keep trial and error to obtain the actual number of clusters needed. Many methods had been researched to solve this stochastic nature and to obtain the exact number of clusters required and one of it will be global k-means clustering algorithm [1]. Instead of randomly selecting initial values for all cluster centers, global k-means clustering [1] proceeds in an incremental way attempting to optimally add one new cluster at one time. To solve clustering problem with M clusters, global k-means clustering [1] starts with one cluster, which means k = 1 and find its best position. The best position for one cluster of the given data set will be the centroid of the data set X. Next, in order to solve problem for second cluster, with k = 2; global k-means clustering [1] will attempt to place a new cluster at its best location given the best position of first cluster. To achieve this, global k-means clustering [1] executes N executions of k-means algorithm where N is the number of data points. The N executions need to be carried out based on initial positions of the first cluster center and second cluster center at execution n. Since the first cluster center already obtained by problem k = 1, so it will always be positioned at its best position. For the second cluster at execution n, it will position at data point Xn , where 1 ≤ n ≤ N. Without any doubt, second cluster will be positioned at the point which equals to the best solution after N executions of k-means algorithm. By incrementally following way, at the end the solution for all intermediate clusters and M clusters is obtained. Let (m1 (k), . . ., mk (k)) be equal to optimum solution for the k-clustering problem. Incrementally, we solve 1-clustering problem, 2-clustering problem until k − 1clustering problem. So, the solution of k-clustering problem can be obtained by performing N executions of k-means algorithm with starting positions of (m1 (k − 1), . . ., mk−1 (k − 1), Xn ). The algorithm runs as follows: Start: Search for the optimum position for first cluster (k = 1) which corresponds to the centroid of data set X. Let (m1 (k), . . ., mk (k)) be equals to optimum solution for k-clustering problem. For k = 2 Until M (k − 1 optimum solution is found). • Run N executions of k-means algorithm with k clusters where each run n starts from initial state (m1 (k − 1), . . ., mk−1 (k − 1), Xn ). • Optimal solution obtained from N runs is solution for (m1 (k), . . ., mk (k)) of k-clustering problem. End Finish
When the algorithm finishes running, optimal solution for M clusters problem is obtained and also all k-clustering problems where k < M. After all intermediate clustering finishes running, the actual number of clusters needed to solve the problem can be obtained. Furthermore, global k-means clustering algorithm [1] solves the randomized problem at initialization phase of supervised fuzzy clustering [6]. Instead of initializing the U = [Ui,k ]RxN partition matrix randomly, we initialize the partition matrix with global k-means algorithm [1]. Unfortunately, the weakness of global k-means [1] is that the computation time can be quite large. 3.2. Fast global k-means algorithm Although global k-means [1] manages to solve the drawbacks of supervised fuzzy clustering [6], but it suffers from heavy compu-
tation load. Thus, fast global k-means algorithm [1] is introduced in order to speed up the computation time without affecting the performance of proposed solution. From global k-means algorithm [1], a set of solution (m1 (k − 1), . . ., mk−1 (k − 1)) is obtained before the algorithm solves for k-clustering problem. Given the solution for (k − 1)-clustering, fast global k-means proceeds as follows [1]: 1. In global k-means, the algorithm will run the k-means algorithm with random initial position from X = {x1 , x2 , . . ., xi } until convergence to obtain the best solution for the cluster center. In fast global k-means algorithm, k-means algorithm is not executed until convergence. Fast global k-means will compute an upper bound error to obtain the cluster center before executing the k-means algorithm. 2. The upper bound error value, Ei can be computed by using: Ei ≤ E − bi bi =
N
(10) j
max(dk−1 − ||xi − xj ||2 , 0)
(11)
j=1
n = argmaxbi
(12)
i
3. The upper bound error value is computed on the clustering error Ei for all possible Xi . E in Eq. (1) represents the clustering error in the (k − 1)-clustering problem and bi is defined in Eq. (2). The cluster center xn is chosen as the new cluster center for kclustering problem based on the minimum of upper bound error value Ei , or maximum value of bi . K-means algorithm will then be initialized using the chosen cluster center xi and the execution will end when convergence. 4. dj k−1 represents the squared distance between xj and the nearest center among the k − 1 cluster center obtained. bi represents the guarantee reduction in the error measure obtained by inserting the new cluster at position xi . After the new cluster center is located at xi , it will allocate all points xj whose squared distance is smaller than the distance dj k−1 from their previous nearest center. Hence, by using [1], each data point will guarantee to have a reduction of dj k−1 − ||xi − xj ||2 . 4. Data imputation method Unfortunately the proposed enhanced fuzzy clustering algorithms cannot deal with missing values. However, data samples collected from real world application contain missing data due to sensors failure. For example, a particular data sample Xk might be incomplete, having the form Xk = (2.2 ? 2.3 ? 2.5), where the second and fourth feature values are missing. Therefore there is a need for algorithms that can deal with missing values. As mentioned before, simply discarding the input vectors can affect the fault diagnosis. For that reason, imputation method is proposed to impute the missing values then carry on with the learning and classification using the imputed vectors. In this paper, a modified of optimal completion strategy [9] is proposed to solve this problem. 4.1. Modified of optimal completion strategy To handle missing values, proposed method is based on data imputation, which will estimate missing values using some estimation methods fill the missing values then train and optimizes the resulting complete data sets. In this paper, a four-step procedure mainly based on modified of optimal completion strategy [9] is proposed.
L.K. Ming et al. / Applied Soft Computing 11 (2011) 1117–1125
1121
Table 1 Result comparison of three famous fuzzy clustering algorithms and proposed supervised fuzzy clustering. The result is from a ten-fold validation. Data sets
Fuzzy C-means
Gustafson–Kessel
Gath–Geva
Proposed enhanced SFC
Iris Cancer
89.33 96.63
90.00 90.19
80.00 88.58
95.33 92.97
The proposed algorithm first prepares the complete training samples. The complete training samples are obtained by extracting all the non-missing data samples form the original data set. In this paper, proposed enhanced supervised fuzzy clustering is chosen. Hence, the output labels of data sets are considered. But in preprocessing stage, the labels will not be used. First, the indexes and labels for missing and non-missing values in the data sets will be recorded. Next, the complete data samples are trained with unsupervised Gath–Geva algorithm [10]. This is to get the initial partition matrix for clusters V(0) and data membership function (0) by applying global k-means. Then, for the missing data set, the distances for each missing data samples Xm (0) to each cluster centroid V(0) is taken and initialize missing values Xm (0) by assigning the Xm (0) with the values of nearest cluster centroid V(0) . After that, start the algorithm by using the modified of optimal completion strategy. Modified optimal completion strategy views the missing data Xm (0) as additional variables over which needed to optimize in order to get the smallest possible value of clustering criterion J. This strategy completes the missing values in a way that leads to the smallest possible value of J given the available data. Available data refer to all data sample values that exist, for example, Xk = (2.2 ? 2.3 ? 2.5), the available data will be first, third and fifth position which are 2.2, 2.3, 2.5. For this particular missing data sample, modified optimal completion strategy completes the missing part by assigning values which can minimize the clustering criterion J. The algorithm goes as below: Having V(0) , (0) , and Xm (0) , repeat for l = 1, 2, . . . Step 1: Train Xm (0) with unsupervised Gath–Geva algorithm to consider unsupervised learning for missing values estimation. Step 2: Compute distance measure D2 i,k (zk ,ri ).
n
1 2 (z , r ) Di,k k i
= (ri )
exp
2 1 (xj,k − vi.j ) − 2 2
(13)
i.j
j=1
Step 3: Update the partition matrix (l)
Ui,k =
1
R j=1
1 ≤ i ≤ R,
(Di,k (zk , ri )/Dj,k (zk , rj ))
2/(m−1)
1≤k≤N
, (14)
Step 4: Checking termination IF ||U(l) − U(l−1) || < ε THEN exit, Xm (estimated) ELSE OCSFCM-5 [9]
c (l)
xmk,j =
i=1
(l) Ui,k
c i=1
m
(l)
v(l) i,j
m
5. Experiments and results 5.1. General description Foremost, a test is conducted to compare the proposed enhanced supervised fuzzy clustering with three of the famous fuzzy clustering algorithms. Next, an experiment is conducted using enhanced supervised fuzzy clustering model with two types of initialization, random and global k-means. For this experiment, the medical data sets and Iris data set is used to test the quality of the performance for both types of initialization. Next, by using the medical benchmark data sets, experiment is carry out to compare the computation time and also the performance of classification between enhanced supervised fuzzy clustering initialized from global k-means and fast global k-means. After that, experiment is done on imputing data sets with missing values using the modified of optimal completion strategy. In this test, Iris and Pima data sets are used. Missing values are created and were done in a completely random way. The following experiment is to test on Westland vibration data set, where the classification of Westland vibration data set uses enhanced supervised fuzzy clustering with initialization by global k-means. Lastly, the Westland vibration data set was tested for data imputation for missing values from 0% to 50% using the modified of optimal completion strategy. 5.2. Comparative tests between some fuzzy clustering algorithms and proposed enhanced supervised fuzzy clustering In this section, a comparative study was done on the performance of the proposed modification of supervised fuzzy clustering (SFC) algorithm and three of the famous fuzzy clustering algorithms: Fuzzy C-means (FCM) [19], Gustafson–Kessel (GK) [20] and Gath–Geva (GG) [10]. The Iris data set [11] and Wisconsin Breast Cancer (WBC) data sets were used. These two data sets are explained in details in Section 5.3. From Table 1, it is clear that the proposed supervised fuzzy clustering outperforms the three unsupervised fuzzy clustering algorithms for iris data set and comparable with Fuzzy C-means for Cancer data set. One of the main reasons is because supervised fuzzy clustering utilizes the class label for identification and performance of the three famous fuzzy clustering algorithms is heavily depended on the initial starting position of the clusters. 5.3. Comparative tests between randomly initialized and global k-means
(15)
Ui,k
There are basically two main differences between the optimal completion strategy [9] and the proposed solution. First, optimal completion strategy in [9] uses the Fuzzy C-means and the proposed solution uses supervised Gath–Geva clustering. Second, optimal completion strategy in [9] uses V(0) as terminating condition and the proposed solution using (0) . After imputation of missing data, the complete data sample is obtained.
In this section, the performance of the proposed modification of supervised fuzzy clustering algorithm and original algorithm is compared. The performance is evaluated by testing the proposed algorithm on several well-known datasets, iris data set [11], Wisconsin Breast Cancer (WBC), Dermatology (D), Heart Disease (HD), Hepatobiliary Disorder (HEPATO), Kala-azar Disease (KZAR) and Pima Indian Diabetes (PIMA) data sets. Experiments have been conducted to examine the applicability of the proposed enhanced supervised fuzzy clustering to these diagnosis problems. 5.3.1. The Wisconsin Breast Cancer (WBC) data set The Wisconsin Breast Cancer (WBC) data set consist of 569 samples, each with 10 input features. There are 2 classes, one is 0 to
1122
L.K. Ming et al. / Applied Soft Computing 11 (2011) 1117–1125
Table 2 Result comparison of supervised fuzzy clustering with random initialization and with proposed solution. The result is from a ten-fold validation.
Minimum accuracy Average accuracy Maximum accuracy
Supervised fuzzy clustering (random initialization)
Enhanced supervised fuzzy clustering (global k-means with single trial)
94.67 94.80 95.33
– 95.33 –
represent benign and the other one is 1 to represent malignant. Out of all WBC samples, 357 samples are benign and 212 samples are malignant. 5.3.2. The Dermatology (D) data set The Dermatology (D) data set consists of 358 samples, each with 34 input features. There are 6 classes, which are psoriasis, seboreic dermatitis, lichen planus, pityriasis rosea, cronic dermatitis, and pityriasis rubra pilaris. 5.3.3. The Heart Disease (HD) data set The Heart Disease data set consists of 270 samples, each with 13 input features. There are 2 classes, one is 0 to stand for absence of the disease and the other one is 1 to stand for presence of the disease. Out of all HD samples, 150 samples have absence of disease and 120 samples have presence of the disease. 5.3.4. The Hepatobiliary Disorder (HEPATO) data set The Hepatobiliary Disorder (HEPATO) data set consists of 536 samples, each with 9 input features. There are 4 classes, which are Alcoholic Liver Damage, Primary Hepatoma, Liver Cirrhosis and Cholelithiasis. 5.3.5. The Kala-azar Disease (KZAR) data set The Kala-azar Disease (KZAR) data set consists of 68 samples, each with 7 input features. There are 2 classes, 0 for normal, and 1 for presence of disease. 5.3.6. The Pima Indian Diabetes (PIMA) data set The Pima Indian Diabetes (PIMA) data set is available from machines learning database at UCI [11]. Pima data set contains 768 samples with 8 features. This data consists class overlap, thus clustering is harder. The data set is divided into two classes, indicating whether individual is diabetes positive or negative. There are 500 examples of class 1 (positive) and 268 of class 2 (negative). 5.3.7. The Iris data set Iris data has been widely used to test the various clustering algorithms. Iris data set contains 4 features in 150 samples. There are 3 classes, which are Setosa, Versicolour and Virginica, with a total of 50 samples per class. From Table 2, the worst case of supervised fuzzy clustering algorithm [6] with random initialization of number of clusters is 94.67%, average accuracy of 94.80% and best case with 95.33%.
Table 4 Values for parameter M and the optimal number of clusters obtained. Data sets
M clusters
Actual number of clusters
Cancer Dermatology Heart Hepato Kzar Pima
18 15 10 30 5 40
8 5 10 14 3 33
As Table 2 shows, the proposed solution gives better result than random initialization. The ten-fold validation experiment with proposed solution showed 95.33% mean classification accuracy. The parameter M, mentioned above is set to 10. After ten-fold validation running, global k-means clustering algorithm was able to give the actual number of clusters needed, which are 9 clusters. The result for supervised fuzzy clustering [6] (random initialization) is obtained by setting the number of cluster parameter to the actual number of cluster from proposed algorithm. If the actual number of clusters is not used, then the result will be lower than the value shown at Table 2. From Table 2, the best case of supervised fuzzy clustering [6] (random initialization) is just comparable to the average case of proposed algorithm. It is clear that even if the parameter is set to the optimal number of clusters, the mean accuracy is still lower than enhanced supervised fuzzy clustering (global k-means with single trial). Hence, not only the proposed algorithm gives better result but also it provides the actual number of clusters which is very useful in many applications. The latter can be easily obtained because when solving the M-clustering problem, all intermediate k-clustering problems are also solved for k = 1, . . ., M. Table 3 shows the accuracy of both random initialization algorithm and proposed algorithm for 6 types of medical data sets. From the table, it is clear that the proposed algorithm mean accuracy is even higher than the max accuracy of the random initialization algorithm for most of the data sets. Table 4 shows the values that we set to parameter M and the actual number of clusters obtained after the proposed algorithm execution. Accuracy rate for Table 3 supervised fuzzy clustering (random initialization) algorithm [6] is obtained by using the actual number of clusters in Table 4. Even though supervised fuzzy clustering (random initialization) [6] set the parameter to actual number of clusters, its accuracy still not be comparable to proposed enhanced supervised fuzzy clustering (global k-means with single trial). 5.4. Comparative tests between global k-means and fast global k-means Although global k-means algorithm [1] manages to overcome the drawbacks of k-means algorithm, it still suffers from heavy computation time. To reduce the computation load, fast global k-means algorithm [1] is introduced to accelerate the algorithm running.
Table 3 Classifications rate for Cancer, Dermatology, Heart, Hepato, Kzar and Pima data sets. The result is from a ten-fold validation. Data sets
Cancer Dermatology Heart Hepato Kzar Pima
supervised fuzzy clustering (random initialization)
Enhanced supervised fuzzy clustering (global k-means with single trial)
Minimum
Mean
Maximum
Mean
89.45 60.67 73.70 56.51 83.10 66.42
90.17 64.19 75.81 58.57 84.45 68.03
90.85 67.05 77.04 60.27 84.76 69.01
92.97 81.02 77.04 58.78 84.76 70.19
L.K. Ming et al. / Applied Soft Computing 11 (2011) 1117–1125 Table 5 Accuracy comparison between global k-means and fast global k-means algorithm for Iris, Cancer, Dermatology, Heart, Hepato, Kzar and Pima data sets. The result is from a ten-fold validation. Dataset
Global k-means
Fast global k-means
Iris Cancer Dermatology Heart Hepato Kzar Pima
95.3333 92.9668 81.0238 77.0370 58.7771 84.7619 70.1948
95.3333 87.8697 71.2381 68.1481 56.1495 84.7619 71.2389
Table 6 Running time(in CPU time) comparison between global k-means and fast global kmeans algorithm for Iris, Cancer, Dermatology, Heart, Hepato, Kzar and Pima data sets in seconds. The result is from a ten-fold validation. Dataset
Global k-means
Fast global k-means
Iris Cancer Dermatology Heart Hepato Kzar Pima
64.63 1095.30 231.44 92.27 2250.40 8.25 5613.90
14.31 35.88 4.20 5.75 48.30 1.42 116.16
From Tables 5 and 6, it shows that fast global k-means algorithm [1] did not significantly decrease the accuracy as compared to global k-means algorithm [1]. By comparing the computation time between both algorithms, fast global k-means greatly improved the computation time without degrading the accuracy of the model. Fast global k-means applies the upper bound error value instead of using local search procedure. Hence, the time for initialization of the cluster center to run the k-means algorithm is reduced, and the number of the clusters needed also reduced. 5.5. Tests on data imputation
1123
Table 7 Accuracy of Iris data set. The result is from a ten-fold validation. Missing percentage
Accuracy
0 10 20 30 40
95.33 86.67 81.33 78.00 53.33
Table 8 Accuracy of Pima data set. The result is from a ten-fold validation. Missing percentage
Accuracy
0 10 20 30
70.19 69.79 67.20 65.10
data before feeding it into the enhanced supervised fuzzy clustering. 5.6. Westland vibration dataset A real world case study was done to test the proposed enhanced supervised fuzzy clustering with initialization parameters by global k-means, using the popular benchmark dataset Westland [12]. This dataset consists of vibration time-series data which is gathered from an aft main power transmission of a U.S. Navy CH-46E helicopter by placing eight accelerometers at the known fault sensitive locations of the helicopter gearbox. The data was recorded for various faults including a no-defect case (Table 9). This data set consists of 9 torque levels. In this paper, only the 100% torque level on sensors 1 to 4 is used. As the number of features from this dataset is quite substantial, feature extraction was needed. Wavelet packet feature extraction [17] is used to shrink the dimension of the input vectors without affecting too much of the quality of classifier. Wavelet packets, a generalization of wavelet bases, are alternative bases that are formed by taking linear combinations of the
In this section, the performance of the proposed missing data imputation algorithm, modified of optimal completion strategy is evaluated based on Iris and Pima data sets. A ten-fold validation was employed (Fig. 2). From Tables 7 and 8, it is clear that the percentage of correct classification for each missing values percentage is quite stable. The accuracy does not drop much (Fig. 3). The missing values for Iris and Pima data sets are randomly created from the completed Iris and Pima datasets. The tests showed that the proposed method does not degrade the quality of the performance much when it was used to preprocess the
Fig. 3. Classification results for the Pima data set.
Table 9 Westland helicopter gearbox data description.
Fig. 2. Classification results for Iris data set.
Fault type number
Description
2 3 4 5 6 7 8 9
Planetary bearing corrosion Input pinion bearing corrosion Spiral bevel input pinion spalling Helical input pinion chipping Helical idler gear crack propagation Collector gear crack propagation Quill shaft crack propagation No defect
1124
L.K. Ming et al. / Applied Soft Computing 11 (2011) 1117–1125 Table 10 Correct classification rate for Westland. Accelerometer
Accuracy
1 2 3 4
89.30 89.31 90.59 86.34
Table 11 Correct classification rate for Sensor 1.
Fig. 4. Correct classification rate for Sensor 1 to the respective missing feature percentage.
Missing features (%)
Accuracy
0 10 20 30 40 50
89.30 80.68 75.51 74.49 73.06 72.94
Table 12 Correct classification rate for Sensor 2.
Fig. 5. Correct classification rate for Sensor 2 to the respective missing feature percentage.
Missing features (%)
Accuracy
0 10 20 30 40 50
89.31 76.41 64.56 63.78 62.61 60.04
usual wavelet functions [18,7]. These bases inherit properties such as orthonormality and time–frequency localization from their corresponding wavelet functions. Wavelet packet functions can be defined as: n Wj,k (t) = 2j/2 W n (2j t − k)
(16)
where n is the modulation or oscillation parameter, j is the index scale and k is the translation. For function f, the wavelet packet coefficients can be computed by
n wj,n,k = f, Wj,k =
n f (t)Wj,k (t) dt
(17)
In concise, the steps are: 1. Decompose the vibration signal using Wavelet Packet Transform (WPT) to extract out the time–frequency-dependant informaFig. 6. Correct classification rate for Sensor 3 to the respective missing feature percentage.
Table 13 Correct classification rate for Sensor 3. Missing features (%)
Accuracy
0 10 20 30 40 50
90.59 78.10 76.80 76.68 76.03 74.74
Table 14 Correct classification rate for Sensor 4.
Fig. 7. Correct classification rate for Sensor 4 with the respective missing feature percentage.
Missing features (%)
Accuracy
0 10 20 30 40 50
86.34 77.07 70.24 69.82 68.30 65.07
L.K. Ming et al. / Applied Soft Computing 11 (2011) 1117–1125
tion. For each vibration signal segment full decomposition is done up to the seventh level. 2. For seven levels wavelet packet decomposition, it shall produce a group of 254 sets of coefficients where each set corresponds to wavelet packet nodes. For the coefficients of every wavelet packet node, the wavelet packet node energy, ej,n is computed and this acts as the extracted feature. ej,n =
2 wj,n,k
(18)
k
Since the data set is of high dimensionality, Fisher criterion is used as the feature reduction method to feed lower dimensionality input vectors to the classifier. By using fisher criterion [13], the number of features was thus reduced to 8. A ten-fold validation was used to test the system on data collected from four of the accelerometers (Table 10). The performance obtained by the proposed system on the 8 features, 776 samples Westland dataset further prove the performance done on medical benchmark datasets. In addition, tests were also conducted for data imputation with missing values ranging from 0% to 50%. A ten-fold validation was used to test and missing values were produced purely at random (Tables 11–14; Figs. 4–7). 6. Conclusion In this paper, an enhanced supervised fuzzy clustering algorithm that constitutes an autonomous and deterministic clustering method with positive results which take missing data into account has been proposed. Comparative study on three famous fuzzy clustering algorithms and proposed enhanced supervised fuzzy clustering was done and the result is positive. The comparison among supervised fuzzy clustering with random initialization of clusters and enhanced supervised fuzzy clustering with initialization by global k-means indicates that the proposed solution has better performance. Supervised fuzzy clustering with random initialization performance is sensitive to the initialization of the clusters. But, this problem is solved by the proposed global k-means algorithm where it is independent of starting initialization because of its deterministic nature. In addition, another advantage of enhanced supervised fuzzy clustering initialized by global k-means is that it can automatically obtain the actual number of clusters needed, which is useful in many applications. In other words, with the application of global k-means in enhanced supervised fuzzy clustering, the resulting algorithm becomes autonomous and deterministic thus enhancing the performance. This is proved by results from benchmark data and testing on Westland dataset. In addition, comparative tests for fast global k-means and global kmeans are conducted to observe the correct classification rate and
1125
the computational time. The results obtained were favorable to fast global k-means as it can obtain close accuracy and yet much improve on the computational time. Other than that, incomplete data sample is discussed. Instead of discarding whole incomplete data samples, modified of optimal completion strategy is used as preprocessing step which will impute the missing values by minimizing the clustering criterion J and accuracy had been tested using two benchmark problems. The result obtained is encouraging. Both Iris and Pima data sets showed tolerable degradation. Furthermore, a case study was tested with the reduced 8 features of Westland vibration data set. It used the sensors 1 to 4 and the result obtained is encouraging. In conclusion, a model which is suitable for fault diagnosis that can be applied in real industry is presented in this paper. References [1] A. Likas, N. Vlassis, J.J. Verbeek, The global k-means clustering algorithm, Pattern Recogn. 36 (2003) 451–461. [2] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic, New York, 1972. [3] B.D. Ripley, Pattern Recognition and Neural Networks, Cambridge University Press, Cambridge, UK, 1996. [4] J.C. Bezdek, S.K. Pal, Fuzzy models for Pattern Recognition: Methods That Search for Structures in Data, IEEE Press, Piscataway, NJ, 1992. [5] D. Dumitrescu, B. Lazzerini, L.C. Jain, Fuzzy Sets and their Application to Clustering and Training, CRC Press, 2000, pp. 205–207. [6] J. Abonyi, F. Szeifert, Supervised fuzzy clustering for the identification of fuzzy classifiers, Pattern Recogn. Lett. 24 (October (14)) (2003) 2195–2207, Elsevier Science Inc., New York, NY, USA. [7] M.V. Wickerhauser, Adapted Wavelet Analysis from Theory to Software, Wellesley, Natick, MA, 1994. [8] J.A. Lozano, J.M. Pena, P. Larranaga, An empirical comparison of four initialization methods for the k-means algorithm, Pattern Recogn. Lett. 20 (1999) 1027–1040. [9] R.J. Hathaway, J.C. Bezdek, Fuzzy C-means clustering with incomplete data, IEEE Transl. Syst. Man Cybernet. Part B 31 (2001) 19–28. [10] Gath, Geva, Unsupervised optimal fuzzy clustering, IEEE Trans. Pat. Anal. Machine Intell. 11 (July (7)) (1989) 773–781. [11] C.L. Blake, C.J. Merz, UCI Repository of Machine Learning Databases, University of California, Irvine, Department of Information and Computer Sciences, 1998. [12] B.G. Cameron, Final Report on CH-46 Aft Transmission Seeded Fault Testing, Westland Helicopters, Ltd., UK, Res. Paper RP907, 1993. [13] Krzysztof Cios, Witold Pedrycz, Roman Swiniarski, Data Mining Methods for Knowledge Discovery, Kluwer Academic Publishers, 1998. [14] A.A. Afifi, R.M. Elashoff, Missing observations in multivariate statistics: review of the literature, J. Am. Stat. Assoc. 61 (1966) 595–604. [15] H.O. Hartley, R.R. Hocking, The analysis of incomplete data, Biometrics 14 (1971) 174–194. [16] R. Little, D. Rubin, Statistical Analysis with Missing Data, second ed., Wiley Interscience, 2002. [17] G.G. Yen, K.C. Lin, Wavelet packet feature extraction for vibration monitoring, IEEE Trans. Ind. Electron. 47 (3) (2000). [18] R.R. Coifman, M.V. Wickerhauser, Entropy based algorithms for best basis selection, IEEE Trans. Information Theory 38 (1992) 713–718. [19] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, 1981. [20] D.E. Gustafson, W.C. Kessel, Fuzzy clustering with fuzzy covariance matrix, in: Proceedings of the IEEE CDC, San Diego, 1979, pp. 761–766.