Applied Soft Computing 13 (2013) 881–890
Contents lists available at SciVerse ScienceDirect
Applied Soft Computing journal homepage: www.elsevier.com/locate/asoc
Fuzzy c-means improvement using relaxed constraints support vector machines Mostafa Sabzekar ∗ , Mahmoud Naghibzadeh Department of Computer Engineering, Ferdowsi University of Mashhad, Iran
a r t i c l e
i n f o
Article history: Received 30 December 2011 Received in revised form 18 July 2012 Accepted 17 September 2012 Available online 3 October 2012 Keywords: Fuzzy c-means Support vector machines Relaxed constraints support vector machines Fuzzy clustering
a b s t r a c t Fuzzy clustering is a widely applied method for extracting the underlying models within data. It has been applied successfully in many real-world applications. Fuzzy c-means is one of the most popular fuzzy clustering methods because it produces reasonable results and its implementation is straightforward. One problem with all fuzzy clustering algorithms such as fuzzy c-means is that some data points which are assigned to some clusters have low membership values. It is possible that many samples may be assigned to a cluster with low-confidence. In this paper, an efficient and noise-aware implementation of support vector machines, namely relaxed constraints support vector machines, is used to solve the mentioned problem and improve the performance of fuzzy c-means algorithm. First, fuzzy c-means partitions data into appropriate clusters. Then, the samples with high membership values in each cluster are selected for training a multi-class relaxed constraints support vector machine classifier. Finally, the class labels of the remaining data points are predicted by the latter classifier. The performance of the proposed clustering method is evaluated by quantitative measures such as cluster entropy and Minkowski scores. Experimental results on real-life data sets show the superiority of the proposed method. © 2012 Elsevier B.V. All rights reserved.
1. Introduction Data mining is an attractive field for researchers due to its helpfulness in many applications such as marketing analysis, business management, classification of documents, pattern discovery, and so on. The main focus of data mining researches is on the development of new algorithms that outperform previous techniques in terms of speed or accuracy. Clustering (cluster analysis) is a classical problem in data mining and plays an important role in business and science. The goal of clustering is to reduce the amount of data by categorizing or grouping similar data items together. A clustering process can be divided into four steps: feature selection, a clustering algorithm, validation of results, and interpretation of the results [1]. The clustering algorithm is the most critical among these steps. Therefore, many data clustering algorithms have been proposed in the literature. Theses algorithms can be categorized into partitioning [2,3], hierarchical [4,5], density-based [6,7], and grid-based [8,9]. Partitioning algorithms group data into k clusters, where k is a predefined parameter. Such an algorithm starts with an initial partition and iteratively optimizes the quality of clustering results by moving data between different groups. A hierarchical clustering algorithm creates a hierarchy of clusters which may be represented in a tree structure called dendrogram. The root of the
∗ Corresponding author. E-mail address:
[email protected] (M. Sabzekar). 1568-4946/$ – see front matter © 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.asoc.2012.09.018
tree consists of a single cluster containing all observations, and the leaves correspond to individual observations. A density-based clustering algorithm is devised to discover arbitrary-shaped clusters. In this approach, a cluster is regarded as a region in which the density of data objects exceeds a threshold. Finally, a grid-based algorithm quantizes the space into a finite number of grids and performs further processings on this quantized space. One of the most popular and widely studied iterative clustering algorithms that minimize the clustering error for points in Euclidean space is called k-means [10] clustering. K-means iteratively performs partition and new cluster center generation steps until the process converges. An iterative process with extensive computations is usually required to generate a set of cluster representatives. K-means algorithm has been shown to be effective in producing good clustering results for many practical applications. The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets. Fuzzy c-means (FCM) which was proposed by Dunn [11] and improved by Bezdek [12] is one of the most well-known methodologies in clustering analysis. It is one of the best extensions of the k-means algorithm that allows one piece of data to simultaneously belong to two or more clusters with different membership values. In FCM algorithm the input space is partitioned into K clusters {C1 , C2 , . . ., CK }. The main objective of the fuzzy clustering algorithms, e.g., FCM is to produce a K × n partition matrix U(X) of the given data set X = {x1 , x2 , . . ., xn }. The partition matrix may be represented as U = [ukj ], k = 1, . . ., K and j = 1, . . ., n, where ukj is the membership
882
M. Sabzekar, M. Naghibzadeh / Applied Soft Computing 13 (2013) 881–890
values of pattern xj to cluster Ck . Also, the partition matrix U satisfies the following conditions: 0<
n
ukj < n,
j=1
K
ukj = 1,
n K
k=1
ukj = n.
k=1 j=1
FCM tries to minimize the objective function: Jm =
K n
um D2 (xi , zj ), ji
(1)
The rest of this paper is organized as follows. Section 2 surveys related works and describes the contribution, limitations, and different extensions of fuzzy c-means algorithm. The SVM architecture and its extensions will be discussed in Section 3. The structure of developed algorithm is presented in Section 4. Section 5 reports the experimental results. Finally, concluding remarks are given in Section 6.
2. Literature review
i=1 j=1
where n represents the number of input data, K is the number of clusters, u denotes the fuzzy membership matrix, and m(m > 1) is the fuzzy exponent that controls the degree of fuzziness. Also, xi is the ith data point and zj denotes the center of kth cluster and D is a distance function such as Euclidean distance. The algorithm starts with random initial cluster centers and iteratively computes the fuzzy membership (uki ) of each data points and new cluster centers (k ) using the following equations:
⎛
uki =
⎞ 2/(m−1) −1 K D(k , xi ) ⎝ ⎠ j=1
D(j , xi )
,
for 1 ≤ k ≤ K, 1 ≤ i ≤ n, (2)
k =
n xi (uki )m , 1 ≤ k ≤ K. i=1 n m i=1
(uki )
(3)
The algorithm terminates when there is no movement of the cluster centers. Finally, each sample is assigned to the cluster to which it has maximum membership value. Thus, the FCM algorithm is as follows: FCM algorithm Begin Initialize K (number of clusters) Initialize m (fuzziness parameter) Initialize k (cluster centers) Repeat Compute uki with k by Eq. (2) Update k with uki by Eq. (3) Until (cluster center stabilized) End
In fuzzy clustering algorithms such as FCM, a sample that belongs to a specific cluster has higher membership degree to that cluster than to any other cluster. If the highest membership value of a sample is closed to one, it can be assigned to the corresponding cluster with high confidence but for those samples that have low membership value to their clusters, we cannot be sure about their real cluster. Therefore, many samples may be assigned to clusters with low-confidence. In this paper, an attempt has been made in order to solve the mentioned problem and improve the performance of fuzzy cmeans clustering algorithm by combining it with recently proposed relaxed constraints support vector machine (RSVM) [13] classifier and this new method is called FCM–RSVM. As it is discussed in [13], RSVM assigns a fuzzy membership to each training sample. In the RSVM algorithm the constraints of the SVM are converted to fuzzy inequalities and the result shows better efficiency in the classification of data for different domains. Another advantage of RSVM is its robustness against noisy data and outliers. To improve the performance of fuzzy c-means algorithm, we can first run the FCM algorithm and then assign a high importance degree to those samples that have high membership degree to their clusters and then, they are used for training a RSVM classifier. The class label of the remaining samples is predicted using the multi-class RSVM classifier.
In many real-world clustering problems, there are no sharp boundaries between different clusters. To solve this problem fuzzy clustering is good alternative. Another reason for using a fuzzy model for clustering of data is that the computational time usually decreases. This is due to the fact that a non-fuzzy model often results in an exhaustive search in a huge space, whereas in a fuzzy model all the variables are continuous, so that derivatives can be computed to find the right direction for the search. In crisp clustering algorithms the input data are divided into some clusters such that each data point is assigned to exactly one cluster. However, in fuzzy clustering, a data point may belong to several clusters with different degree of memberships. Therefore, the membership values for a data point will represent the degree to which that point belongs to a particular cluster. Fuzzy c-means algorithm is one of the most efficient methods for solving the fuzzy clustering problems. It is also straightforward and easy to implement. In spite of all advantages, FCM has some drawbacks too. So, there are many efforts to overcome the shortcomings of this popular algorithm. Although FCM solves the required optimization problem but it may trap in a local optimum. This may happen especially when the data set is very high dimensional. In such a situation, the stochastic optimization methods such as simulated annealing, evolutionary algorithms, and swarm-based can be integrated into FCM to jump out of a local optimum solution. For example in [14] a hybrid fuzzy clustering method based on FCM and fuzzy PSO (FPSO) is proposed which make use of the merits of both algorithms. This method applies FCM to the particles in the swarm of every iterations/generations such that the fitness value of each particle is improved. Scheunders [15] proposed a method which is a hybrid approach combining a genetic algorithm (GA) with the classical fuzzy c-means clustering algorithm. The proposed technique is superior to FCM in the sense that it converges to a nearby global optimum rather than to a local one. In this technique a set of random cluster centers, as a random population for GA, is generated. The inverse of the mean squared error (MSE) is used as fitness function. The crossover and mutation operations are performed with probability 0.8 and 0.05, respectively. For problems with high dimensionality, a large population may have to be defined, and a large number of generations may be necessary before the system converges. Also, there are many researches that combined the FCM and other stochastic algorithms, such as ant colony in [16], simulated annealing in [17], and so on. Another problem with FCM is that it is not capable of specifying the exact number of clusters. So, this may cause the algorithm to fall into a local optimum. There have been many efforts to find the optimum number of clusters for FCM algorithm. Stochastic global search methods can help the algorithm to solve this problem. For example Alata et al. [18] proposed a method that finds the optimum number of clusters and also fuzziness parameter (m) in Eq. (1) using genetic algorithm. In [19] an initialization method for fuzzy c-means algorithm is proposed by using an approximate number of clustering centers to initialize classification number, and using approximate clustering centers to initialize initial clustering centers. Another work is an iterated version of FCM (IFCM) [12]. In this
M. Sabzekar, M. Naghibzadeh / Applied Soft Computing 13 (2013) 881–890
algorithm, the FCM algorithm is executed for different values of k starting from 2 to K*, where K* is the soft estimate of the upper bound of the number of clusters. Thus, IFCM is able to evolve the number of clusters automatically. For each value of k, FCM is run N times for different random initial configurations and the run giving the best Jm (1) value is taken. Then the value of Xie-Beni (XB) validity index [20] for this best Jm is calculated. The XB index is defined as a function of the ratio of the total variation to the minimum separation of the clusters. The values of and can be written as: =
n K
u2ki d2 (zk , xi ),
(4)
k=1 i=1
= min{d2 (zk , zj )},
(5)
k= / j
where d(xi , xj ) is a function that measures the distance between two samples xi and xj . Finally, the XB index is computed as follows: = XB = n×
K n
u2 d2 (zk , xi ) i=1 ki n × (min{d2 (zk , zj )}) k= / j
k=1
.
(6)
A high value for and a lower value for report the partitioning is compact and good. In this situation XB index will have a low value. The objective is therefore to minimize the XB index to achieve proper clustering. The process is repeated for different values of k. The solution producing the minimum value of XB index is chosen as the best clustering result. The corresponding k and the partitioning matrix are considered as the solution. Also, selection of an appropriate distance metric is very critical in FCM clustering. When all clusters are well spread the Euclidean distance can lead to better results [21]. The authors in [22] proposed a G–K algorithm which uses the well-known Mahalanobis distance as the metric in FCM. They stated that the G–K algorithm is superior to Euclidean distance based algorithms when the shape of data is considered. In [23] the authors proposed a new robust metric, which is distinguished from the Euclidean distance, to improve the robustness of FCM. There are some limitations in fuzzy membership function of FCM (uki ). Since in FCM algorithm each data point has partial membership in all the clusters, the cluster centers tend to move towards the center of all the data points [24]. Also, the membership of a data point in a cluster depends directly on its membership values in other cluster centers and this sometimes happens to produce unrealistic results. So, there are some researches that propose a new membership function for calculating the membership of data points in clusters. For example in [25], an adaptive fuzzy c-means with a new membership function is presented. Here the membership function is given as in following: j (xi ) =
n × (1/dji )
K n k=1
z=1
1/m−1
(1/dkz )
1/m−1
,
(7)
where dji is the distance of data point xi from the center of cluster j. The adaptive fuzzy clustering algorithm is efficient in handling data with outlier points. Also, the authors in [26] and [27] have tried to improve the performance of FCM by applying some changes in membership function. Integrating clustering algorithms and supervised learning methods is another way to improve the performance of the clustering algorithms. For example, the authors in [36] proposed a modified differential evolution (DE)-based fuzzy c-medoids clustering of categorical data. An SVM classifier is trained with a fraction of data points selected from each clusters based on the proximity to the respective cluster medoids and the remaining points are reassigned using the trained SVM classifier. In another paper, Josien and Liao
883
[37] have presented a new approach for group technology (GT) part family and machine cell formation using the integration of fuzzy cmeans and fuzzy k-nearest neighbors. In [38] maximum weighted likelihood via Rival Penalized Expectation Maximization (RPEM) is used for density mixture clustering. RPEM makes the components in a density mixture compete each other at each time step and is able to fade out the redundant densities from a density mixture during the learning process. Several works in improvement of the fuzzy c-means method certifies this point that it is a popular, efficient, and powerful clustering algorithm. In the next section a brief review of support vector machines (SVMs) and its extension, relaxed constraints support vector machines (RSVMs), are considered. 3. Support vector machines Support vector machines (SVMs) as originally introduced by Vapnik within the area of statistical learning theory and structural risk minimization [28] have proven to work successfully on many applications of nonlinear classification and function estimation. The problems are formulated as convex optimization problems, usually quadratic programs (QP), for which the dual problem is solved. Within the models and the formulation one makes use of the kernel trick which is based on the Mercer theorem. With this strategy, input points are easily mapped into a high-dimensional feature space. Then, SVM finds a separating hyperplane that maximizes the margin between two classes in this space. 3.1. SVM structure Suppose that we have a training sample set {(x1 , y1 ), (x2 , y2 ), . . ., (xn , yn )} and each sample belongs to either of two classes with given label yi ∈ {−1, 1} for i = 1, . . ., n. When the training samples are linearly separable, the SVM separates the two classes with maximum margin between them without any misclassification error. The optimal separating hyperplane (OSH) can be achieved by solving the following QP problem:
1 ||w||2 + C i 2 n
Minimize Q (w, b, ) =
i=1
(8)
subject to yi (wT xi + b) ≥ 1 − i , i ≥ 0,
i = 1, . . . , n,
where w is a weight vector of hyperplane and b is the bias term. The parameter C is a regularization parameter that makes a balance between maximization of the margin and misclassification error. In many practical situations, a separating hyperplane does not exist. To allow for possibilities of violating, slack variables i ≥ 0 are introduced. For many problems, finding a linear classifier is impossible. In order to classify nonlinearly, a solution is to map the input space into a higher dimension feature space and searching the OSH in this feature space. Therefore, the mapping function ϕ(x) is introduced such that it satisfies Mercer’s condition. To solve the QP problem, one needs to compute the scalar products of the form ϕ(xi ) · ϕ(xj ). The problem is that we do not know the shape of ϕ(x). It is therefore convenient to introduce the kernel function K(xi , xj ) = ϕ(xi ) · ϕ(xj ). By using the Lagrange multiplier method and kernel trick, the QP problem for finding the SVM is defined as: Minimize Q (˛) =
subject to
1 ˛i ˛j yi yj K(xi , xj ) − ˛i , 2 n
n
i=1
i=1
n i=1
˛i yi = 0,
0 ≤ ˛i ≤ C,
(9)
884
M. Sabzekar, M. Naghibzadeh / Applied Soft Computing 13 (2013) 881–890
where ˛ = (˛1 , ˛2 , . . ., ˛n ) is the vector of non-negative Lagrange multipliers and solution of the QP problem (9). The point xi with ˛i ≥ 0 is called support vector (SV). The final classifier is to form f (x) = wT · ϕ(x) + b =
˛i yi K(xi , x) + b,
(10)
more relaxation and flexibility because of their fuzzy inequalities. So, the problem of SVM (8) is reformulated as:
i=1
i∈S
˛i yi K(xi , xj ),
(11)
i∈S
After converting the fuzzy inequality to a crisp one with defining a membership function for it, the RSVM classifier is obtained by solving the following optimization problem:
where xj is an unbounded support vector (0 < ˛i < C). The decision function is D(x) = sign(f (x)) = sign
i=1
.
(12)
i ≥ 0,
3.2. Fuzzy SVM In the standard SVM, the influence of each input pattern in the trained classifier is the same. In other words, each data point is fully assigned to one of the two classes. However, in many applications, some input points, such as the outliers, may not be exactly assigned to one of the two classes. To solve this problem Lin and Wang [29,30] proposed fuzzy SVM (FSVM). FSVM applies a fuzzy membership to each input point in order to have different contributions to learning of decision surface. Here, a fuzzy membership si is assigned to each training sample. The QP problem for finding the optimal hyperplane can be described as:
1 Minimize Q (w, b, i ) = ||w||2 + C si i , 2 n
i=1
i ≥ 0,
i = 1, 2, . . . , n,
where di is the importance degree of sample xi . A greater value for parameter di of the sample value xi means violation from the constraints for this sample is higher and the effect of this sample on training of the classifier is lower. In other words, xi is regarded as a noisy data. If the same di is assigned to all constraints, the system can equally tolerate crossing over any sample. The parameter ˛ is the level at which the membership degree of the fuzzy inequality of constraints is cut. A larger value for ˛ means our certainty in the whole set of data is higher and vice versa. Note that, if we have high certainty in the training samples, we should not permit constraint violations. With these two parameters, we can handle the tolerance and uncertainty for classification in a data set (for more details see [13]). The dual problem of Eq. (15) is: Maximize Q (ˇ) =
n
1 ˇi ˇj yi yj ϕ(xi )ϕ(xj ) 2 i=1 j=1
n
=
i = 1, . . . , n.
1 ˇi ˇj yi yj K(xi , xj ) 2 n
ˇi (1−di + di ˛)−
i=1
3.3. Relaxed constraints SVM (RSVM) Taking another point of view, there are some problems with the SVM. Since the classifier obtained by SVM depends on only a small part of the samples (support vectors), it is very easy for it to become sensitive to noises or outliers in the training set. Another problem is that the contributions of all training samples to training the classifier are identical. FSVM has tried to somewhat solve these problems. The performance of FSVM depends strongly on choosing fuzzy membership values si due to sensitivity of cost-function of the SVM formulation to any changes. RSVM [13] instead, considers the fuzzy membership values in the constraints of the SVM formulation. As discussed in [13], the RSVM is an efficient extension of the SVM algorithm that deals with these problems. It considers an importance degree for each training sample. It has been shown that it is robust against noisy data and outliers in data sets. As demonstrated in [13], RSVM is more powerful and efficient that competing methods such as fuzzy SVM (FSVM). The constraints of RSVM have
n
n
ˇi (1 − di + di ˛)−
i=1
(13)
The term si i is a measure of error with different weights. A smaller si for input sample xi can reduce the effect of corresponding i . It is very important for the FSVM classifier that a suitable model of fuzzy membership function be determined because choosing a proper fuzzy membership function can reduce the effect of noises and outliers. By setting different types of fuzzy membership, FSVM can be applied to solve different kinds of problems. This extends the application horizon of the SVM.
(15)
subject to yi (wT ϕ(xi ) + b) ≥ 1 − i − di (1 − ˛)
i∈S
subject to yi (wT xi + b) ≥ 1 − i ,
1 ||w||2 + C i 2 n
Minimize
˛i yi K(xi , x) + b
(14)
subject to yi (wT xi + b)>1 − i ˜ i ≥ 0, i = 1, . . . , n.
where S is the set of support vector indices and b is given by b = yj −
1 i ||w||2 + C 2 n
Minimize Q (w, b, ) =
n
i=1 j=1
(16) subject to the constraints n
ˇi yi = 0,
0 ≤ ˇi ≤ C, i = 1, 2, . . . , n,
i=1
where ˛i and ˇi are the nonnegative Lagrange multipliers. The decision function is given by D(x) = sign
ˇi yi K(xi , x) + b
,
(17)
i∈S
where b is given by b = yj −
ˇi yi K(xi , xj ),
(18)
i∈S
where S is the set of support vector indices and xj is an unbounded support vector (0 < ˛i < C). 3.4. Multi-class RSVM The basic SVM is designed to separate only two classes from each other. However, in many real applications, a method to deal with several classes is required. A solution is to decompose a multiclass problem into several two-class classification problems. The solution to the multi-class classification can be reconstructed from
M. Sabzekar, M. Naghibzadeh / Applied Soft Computing 13 (2013) 881–890
xt
Two-class Classification
RSVM1 D1
Decision Making
RSVM2
…
D2
RSVMm Dm
885
sample and, at the same time, reduce the effect of noisy data and outliers with higher precision compare to similar methods such as fuzzy SVM. In the training phase, we assign lower importance degree di to samples which have lower membership, uij . In this paper, one-against-all RSVM with RBF kernel function is used to improve the performance of FCM. The new method is named FCM–RSVM. The steps of the proposed method are summarized in Fig. 2 and are described in the following subsections.
arg max Di(xt)
4.1. Data clustering
Output
Class of xt
Fig. 1. Classification by one-against-all RSVM.
the outputs of the two-class classifiers. The following two strategies are mainly adopted: “one-against-all” and “one-against-one”. RSVM method also can be used for both binary and multi-class classification problems. In this paper one-against-all RSVM is used for classification level of the proposed method. In one-against-all RSVM, m RSVMs are trained, where m is the number of classes. RSVMi separates Class i from the remaining classes. A testing sample xt is assigned to the class with the maximum decision function value. Fig. 1 shows the details of this method. Note that, Di is the value of the ith decision function. In one-against-one RSVM, m(m − 1)/2 RSVMs are trained. RSVMij is the optimal separating hyperplane (OSH) between Class i and class j. Here, a testing sample xt is assigned to the class with the maximum decision function Di represented by following equation: Di (x) =
n
sign(Dij (x)),
(19)
i= / j,j=1
where sign(x) =
1
for x ≥ 0,
−1 for x < 0,
and Dij denotes the decision function for class i against class j, with the maximum margin. Since, FCM is one of the best methods that are presented for unsupervised clustering and SVM is so popular, efficient, and powerful in supervised clustering, the combination of these two methods can produce desired results. In the following of this paper, we introduce a new approach which can improve the performance of FCM using the capabilities of the RSVM. This new method is referred as FCM–RSVM. The details of the proposed method are discussed in the next sections.
At first, the FCM clustering algorithm must be run to partition data into different categories. Note that, different fuzzy clustering algorithms can be used in this stage. But in this paper, we focus on FCM and, at the same time, try to improve it. The input to this step of the algorithm is a collection of unlabeled samples which is supposed to be partitioned and the output is the clustered data, membership values, and a class label for each sample. 4.2. Clustering refinement This stage of the algorithm consists of four steps that are performed iteratively: preparing the training set of data points, assignment of importance degree to each data point, classification of testing data points, and evaluation of the achieved partitioning result. At first, data points of each cluster are sorted in descending order of their membership values uij and the top t% of data points in each cluster is considered as training set. These samples have high membership values which means they are assigned to their correspond clusters with high confidence levels. Then, the training samples of clusters are combined and the training set for the learning stage is formed. The main difficulty of this step is finding the threshold parameter t. The efficiency of the proposed algorithm strongly depends on choosing a proper value for this parameter. On the one hand, when t is small, the classifier is obtained by a small number of samples. In this situation, just those samples which have a very high membership values participate in the learning process but the multi-class RSVM classifier may not be defined. On the other hand, when a larger value is selected for the parameter t, samples with low membership values will also be placed in the training set, which can cause inaccurate class boundaries and undesired results. To solve this problem, the algorithm is repeated for different values of t. In each iteration, a multi-class RSVM is trained and class labels of the testing data points are defined. Then, the clustering result is evaluated for each size of the training set (t) and the best clustering result is considered as the outcome. The following procedure shows the operations of the “clustering refinement” stage. Clustering refinement algorithm
4. The proposed method One problem with all fuzzy clustering algorithms such as FCM is that some samples which are assigned to a specific cluster may have low membership values. Therefore, it is possible that many samples may be assigned to a cluster with low-confidence. An efficient and noise-aware implementation of support vector machines to solve the mentioned problem and improve the performance of FCM is implemented by RSVM [13]. With the proposed method, first, FCM clustering algorithm partitions data into some clusters. Then, the samples with high membership values in each cluster are selected for training a multi-class RSVM classifier. Here, the cluster number of each training sample is considered as its class label. Then, the class labels of the remaining data points are predicted by RSVM classifier. RSVM method is able to assign an importance degree to each training
1: 2: 3:
4: 5: 6: 7: 8: 9: 10: 11: 12: 13:
Begin For t = 20 to 80 (with step size 10) For each cluster, choose the top t% of samples which are sorted in descending order of their membership values uij as training data for this cluster. Training set ← the training samples of all clusters. Testing set ← the rest of samples. di ← 1 − ui (assigning importance degrees to each training sample xi ). Train one-against-all RSVM. Define the class labels of testing samples using trained RSVM. Clustering solutiont ← Merge training and testing set (samples with their class labels). Et ← Evaluate the Clustering solutiont for threshold t. End for Final solution ← Clustering solutiont corresponding to the best Et . End
Before training a multi-class RSVM classifier, we assign a fuzzy membership, di , to each training sample, xi (line 6 of the
886
M. Sabzekar, M. Naghibzadeh / Applied Soft Computing 13 (2013) 881–890
Data clustering
Preparation of the training samples
Assignment of importance degree to each sample
Assessment of the result
Classification using RSVM
Fig. 2. General overview of the proposed method.
clustering refinement algorithm). It is regarded as the importance degree for each training data point. Note that samples with higher membership values, uij , must have a higher effect on the training of the classifier. As it described in [13], assigning a higher value for di allows training sample xi to have more violation from its constraint. Thus, the effect of this sample on training of the classifier is decreased. In other words, xi is regarded as a noisy data. Therefore, samples with higher membership values (uij ) should have lower importance degrees (di ). If we consider ui (0 ≤ ui ≤ 1) the membership value of data point xi to its assigned cluster, 1 − ui , is a good estimation for the importance degree di . After training the one-against-all RSVM classifier, the class labels of the remaining data points are predicted. Therefore, we have two sets of data points. One is the training set that class labels of its members are given by FCM and the other is the testing set that the trained classifier is used to define the class labels of its members. Combining these two sets, we form a clustering solution for the given threshold t. The same scenario is repeated for different values of t while different clustering results are obtained. As mentioned in line 10 of the “clustering refinement algorithm”, the next step is the selection of a clustering result given by a particular t. There are many methods to measure the quality of a clustering solution. In this paper, cluster entropy [31] (described later) is used for this purpose. Cluster entropy is a measure for quality of a clustering solution, given the true clustering. The value of t, for which the best cluster entropy is obtained, is considered as the optimum threshold and the corresponding clustering result is returned as the final solution. We name this new method “FCM–RSVM”. Also, the traditional SVM or fuzzy SVM [29,30] can be used as the classifiers in clustering refinement stage of the proposed method. In this situation, we call the proposed method as “FCM–SVM” and “FCM–FSVM”, respectively. Also, the proposed method can be applied to different fuzzy clustering algorithms and improve their performances. For more illustrations, consider a synthetic data set with three clusters in Fig. 3. In fuzzy clustering algorithms, as mentioned before, a sample that belongs to a specific cluster has higher membership degree to that cluster than to any other cluster. The samples which assigned to a cluster with high membership degree uij (near to the cluster center) are more reliable and can describe their clusters. As shown in Fig. 4(a), we use these samples for training RVM multi-class classifiers. Then, we classify the reaming data points to identify their appropriate clusters. With this schema the performance of fuzzy clustering approaches are improved and they will be more reliable against noisy samples and outliers. Experimental results in the next section will substantiate this claim. In the following section a comparative study on the effectiveness of the proposed clustering algorithm is presented. Experiments on selected UCI [32] data sets demonstrate the improvement of performance of fuzzy c-means clustering using the proposed method.
5. Experimental results In this section, we investigate the performance of our proposed algorithm for clustering of real data sets. The proposed method is applied to FCM and IFCM [12] algorithms. As mentioned in previous sections, IFCM is an improved version of the fuzzy c-means algorithm that is able to evolve the number of clusters automatically. Then, the new proposed FCM–RSVM and also IFCM–RSVM are evaluated using two performance metrics (described later). Note that FCM–RSVM can be used when the number of clusters is predefined and IFCM–RSVM is used when the number of clusters for data sets is not predefined and we need to determine it, automatically. All of our experiments have been implemented using MATLAB R2008a [33] running on a computer with an Intel processor (Core 2 Duo, 2.50 GHz) and 4GB of memory.
5.1. Distance measure Clustering is the process of partitioning a set of objects into different subsets such that the data in each subset are similar to each other. The similarity between various samples is defined by a distance measure. Therefore, choosing the right distance measure for a given data set is a non-trivial problem and plays an important role in obtaining correct clusters. Many distance measures have been proposed in literature such as Euclidean distance, Manhattan distance, Mahalanobis distance, Hamming distance, etc. for different purposes [34]. In our experiments, Euclidean distance is used to measure the distance of two data points. Given two data points Pi
Fig. 3. A synthetic data set with three clusters.
M. Sabzekar, M. Naghibzadeh / Applied Soft Computing 13 (2013) 881–890
887
The total entropy for the set of K clusters is defined as: Entropy =
K n
k
k=1
n
Entk ,
(22)
where n is the number of data points in the data set. Cluster entropy is a measure of the class purity of the clusters. The optimum value is zero and a lower value reports a better clustering. MS measures the consistency between a clustering solution and a given true clustering. To define MS, let T be the true solution and C be the solution that must be measured. Let n11 denote the number of pairs of data points that are in the same cluster in both C and T. Also, let n01 represent the number of pairs that are in the same cluster only in C, and n10 refer to the number of pairs that are in the same cluster in T. The measure MS is then defined as follows:
MS(T, C) =
n01 + n10 . n11 + n10
(23)
The optimum value for MS is zero and a lower value reports a better clustering. In Section 5.3, some real life data sets are discussed qualitatively. In all experiments, the class labels of data points are deleted and are only used for evaluation purpose. 5.3. Data sets To investigate the performance of our proposed algorithm, we conducted experiments on four data sets available from the University of California at Irvine (UCI) repository [32]. All these datasets represent classification problems. Therefore, the class labels of samples in all of data sets are deleted manually to be appropriate for clustering. The class labels are used to obtain the “true clustering” and to evaluate a clustering solution. The four data sets are described below:
Fig. 4. Data points used for training one-against all RSVM classifier (a) and data points which assigned to a cluster with low membership degrees (b) with t = 60%.
and Pj with d dimensions, the Euclidean distance E(Pi , Pj ) between them is defined as follows:
d (Pix − Pjx )2 , E(Pi , Pj ) =
(20)
x=1
where Pix and Pjx represent the ith dimension value of Pi and Pj , respectively. 5.2. Performance metrics For evaluation of the performance of the clustering algorithms, two quantitative measures, cluster entropy [31] and Minkowski Scores (MS) [31], are used. Cluster entropy and MS are measures of the quality of a clustering solution, given the true clustering (where true clustering is known). Suppose a data set containing S classes is clustered into K clusters. Let nk be the number of data points in kth cluster and nsk be the number of data points from the sth class in the kth cluster. The entropy of the kth cluster is given by: S n
sk
Entk = −
s=1
nk
log
n sk
nk
.
(21)
1. The Glass Types Data. This data set contains 214 instances in 6 classes (70 instances of building windows that are float processed, 76 instances of building windows that are non-float processed, 17 instances of vehicle windows float processed, 13 instances of containers, 9 instances of tableware, 29 instances of headlamps), each instance is described using nine numeric attributes: RI, NA, MG, AL, SI, K, CA, BA, and FE. 2. The Iris Plants Data. This data set is a multivariate data set introduced by Fisher [35]. The data set consists of 50 samples from each of three species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample; they are the length and the width of sepal and petal, in centimeters. Based on the combination of the four features, Fisher developed a linear discriminant model to distinguish the species from each other. 3. The Haberman Data. It contains cases from a study that was conducted between 1958 and 1970 at the University of Chicago’s Billings Hospital on the survival of patients who had undergone surgery for breast cancer. In the Haberman data set there were two classes: patients who survived for up to 5 years and patients who died before that. It includes 306 samples with three attributes: age of patient at time of operation, patient’s year of operation, and number of positive axillary nodes detected. 4. The Wine Data. This data are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 constituents found in each of the three types of wines. 5. The Zoo Data. It has 101 data points (animals) in 7 different clusters (mammals, birds, reptiles, fish, amphibians, insects and invertebrates). These samples are described with 16 attributes,
888
M. Sabzekar, M. Naghibzadeh / Applied Soft Computing 13 (2013) 881–890
Table 1 Comparison of different algorithms in terms of Entropy and MS when the number of clusters for data sets is known. (a) Algorithm
FCM G–K FCM–FPSO FCM–KNN FCM–SVM FCM–FSVM FCM–RSVM
Glass
Iris
Haberman
K
Entropy
MS
K
Entropy
MS
K
Entropy
MS
6 6 6 6 6 6 6
0.4207 0.4034 0.4012 0.3627 0.4012 0.3876 0.2235
0.8626 0.8018 0.7633 0.6978 0.7633 0.7126 0.6656
3 3 3 3 3 3 3
0.1216 0.1216 0.1212 0.1154 0.1154 0.1017 0.1017
0.3722 0.3722 0.3722 0.3141 0.3141 0.2906 0.2906
2 2 2 2 2 2 2
0.2338 0.2338 0.3128 0.2215 0.2338 0.2012 0.1880
0.5112 0.5112 0.5376 0.5081 0.5112 0.4934 0.3765
K
Entropy
MS
K
Entropy
MS
K
Entropy
MS
7 7 7 7 7 7 7
1.3214 1.2591 1.2830 1.1885 1.1045 1.1045 0.9568
1.8344 1.7020 1.7722 1.5615 1.4678 1.4678 1.1733
3 3 3 3 3 3 3
1.0851 0.9536 0.9536 0.9630 0.9267 0.9267 0.9267
1.2356 1.1781 1.1781 1.1903 1.1364 1.1364 1.1364
3 3 3 3 3 3 3
0.2753 0.2753 0.3156 0.2867 0.3156 0.2753 0.2753
0.4328 0.4328 0.4876 0.4545 0.4876 0.4328 0.4328
(b) Algorithm
FCM G–K FCM–FPSO FCM–KNN FCM–SVM FCM–FSVM FCM–RSVM
Zoo
Splice
15 of which are Boolean, (hair, feathers, eggs, milk, airborne, etc.), and the other one is numerical (number of legs). 6. The Splice Data. This data set contains 3190 sequence of DNA with 60 nucleotides. These samples are organized in 3 classes of exon/intron boundaries (EI sites), intron/exon boundaries (IE sites) and neither IE nor EI. All of the above-mentioned real life data sets are used for classification tasks. In our experiments, the class labels of all samples are deleted and are used originally for clustering problems. The class labels are used just for evaluation of the methods. 5.4. Results for data sets In this subsection, we demonstrate the effectiveness of the proposed FCM–RSVM and IFCM–RSVM algorithms which improve the standard FCM and IFCM methods, respectively, using the data sets introduced in the previous subsection. Note that radial basis
Wine
function (RBF) kernel with = 1 is used for training of the classifiers used in the experiments. Table 1 reports the average Entropy and MS (described in Section 5.2) values for different algorithms over 10 independent runs for each datasets. The table consists of five columns. Column 1 lists various clustering algorithms used in the experiment (i.e., fuzzy c-means, G–K [22], FCM–PSO [14], FCM–KNN (fuzzy c-means integrated with k-nearest neighbor), and the proposed FCM–SVM, FCM–FSVM, and FCM–RSVM). Columns 2–5 contain the cluster number, K, and both Entropy and MS results for each data set (Glass, Iris, Haberman, and Wine) when pattern clustering with a prescribed cluster number, K, is run. The prescribed cluster number is the class number of each dataset (i.e., 6 for Glass, 3 for Iris, 2 for Haberman, 7 for Zoo, 3 for Splice and 3 for Wine). As shown in Table 1(a) and (b), the proposed FCM–RSVM outperforms other approaches for different data sets. Also, clustering algorithms which are combined with a supervised learning method (such as KNN or SVM) provide better results.
Table 2 Comparison of different algorithms in terms of Entropy and MS when the number of clusters for data sets is unknown. (a) Algorithm
IFCM GA-based FCM ACC FCM IFCM–SVM IFCM–FSVM IFCM–RSVM
Glass
Iris
Haberman
K
Entropy
MS
K
Entropy
MS
K
Entropy
MS
3 5 4 3 3 3
0.4021 0.4015 0.4207 0.3520 0.3520 0.2015
0.8485 0.8210 0.8874 0.6850 0.6850 0.6042
2 2 2 2 2 2
0.1216 0.1356 0.1266 0.1044 0.0975 0.0960
0.3722 0.4122 0.3870 0.3084 0.2745 0.2715
2 2 3 2 2 2
0.2143 0.2375 0.1915 0.2082 0.2045 0.1915
0.4985 0.5320 0.4136 0.5014 0.4978 0.4136
K
Entropy
MS
K
Entropy
MS
K
Entropy
MS
5 8 7 5 6 6
1.2580 0.9347 1.2781 0.9260 0.9584 0.89632
1.6973 1.1541 1.7635 1.1360 1.1804 1.0784
3 2 3 3 3 3
0.9860 0.9433 1.0262 0.9267 0.9267 0.9267
1.2780 1.1556 1.1893 1.1364 1.1364 1.1364
2 4 3 2 2 2
0.2433 0.3356 0.2753 0.2433 0.2135 0.1875
0.4145 0.5215 0.4328 0.4145 0.3789 0.3356
(b) Algorithm
IFCM GA-based FCM ACC FCM IFCM–SVM IFCM–FSVM IFCM–RSVM
Zoo
Splice
Wine
M. Sabzekar, M. Naghibzadeh / Applied Soft Computing 13 (2013) 881–890
When the number of cluster is not predefined, we can use the algorithms which determine it automatically and cluster data simultaneously. Table 2 summarizes the Entropy and MS values for various algorithms. All of these algorithms are improved versions of the FCM method. We assume that the number of clusters, K, for data sets is unknown and determined by the clustering method in each experiment. The table consists of five columns. Column 1 lists various clustering algorithms used in the experiment (i.e., IFCM [12], GA-based FCM [18], approximate clustering centers approach (ACC FCM) [19], and the proposed IFCM–SVM, IFCM–FSVM, and IFCM–RSVM). Columns 2–5 contain the cluster number, K, and both Entropy and MS results for each data set (Glass, Iris, Haberman, Zoo, Splice and Wine) when pattern clustering with an arbitrary cluster number, K, is run. As shown in Table 2, the proposed IFCM–RSVM has provided the best results for all data sets. SVM is a powerful and state-of-the-art algorithm with strong theoretical foundations based on the Vapnik-Chervonenkis theory. SVM has strong regularization properties. Regularization refers to the generalization of the model to new data. SVM performs well on data sets that have many attributes, even if there are very few cases on which to train the model. RSVM is an extension of SVM that consider an importance degree for each training sample and show promising results in different applications especially when there are noisy data in training samples. In the other hand, FCM is a very successful clustering method whose subtle variations are involved in various clustering related applications. Combining FCM and RSVM creates an interesting clustering method. 6. Summary and conclusion Clustering (cluster analysis) is a classical problem in data mining and plays an important role in business and science. The goal of clustering is to reduce the amount of data by categorizing or grouping similar data items together. In many real-world clustering problems, there are no sharp boundaries between different clusters. To solve this problem fuzzy clustering is a good choice. In fuzzy clustering algorithms such as FCM, a sample which is assigned to a cluster has higher membership degree to that cluster than its membership to any other cluster. If the highest membership of a sample is closed to one, it can be assigned to the corresponding cluster with high confidence but for those samples that have low membership to their clusters, we cannot be sure about their real cluster. Therefore, many samples may be assigned to a cluster with low-confidence. In this paper, we solve the mentioned problem by combining it with recently proposed relaxed constraints support vector machines which is an efficient and noise-aware implementation of support vector machines. First, FCM clustering algorithm partitions data into clusters. Then, the samples with high membership values in each cluster are selected for training a multi-class RSVM classifier. Finally, the class labels of the remaining data points are predicted by RSVM classifier. Experimental results with real-life data sets confirmed the superiority of the proposed methods. This method can be extended to other fuzzy clustering methods, easily. Acknowledgement This research is supported in part by research chancellor, Ferdowsi University of Mashhad, Mashhad, Iran, under contract no. 1882 and project code 17334/2. This support is gratefully acknowledged. References [1] U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, Advances in Knowledge Discovery and Data Mining, AAAI Press, 1996.
889
[2] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, John Wiley & Sons, 1990. [3] R.C. Dubes, How many clusters are best? An experiment, Pattern Recognition 20 (6) (1987) 645–663. [4] B. King, Step-wise clustering procedures, Journal of American Statistical Association 69 (1967) 86–101. [5] P.H.A. Sneath, R.R. Sokal, Numerical Taxonomy, Freeman, London, 1973. [6] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996, pp. 226–231. [7] A. Hinneburg, D.A. Keim, An efficient approach to clustering in large multimedia databases with noise, in: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, 1998, pp. 58–65. [8] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, in: Proceedings of the ACM SIGMOD Conference on Management of Data, 1998, pp. 94–105. [9] W. Wang, J. Yang, R. Muntz, STING: a statistical information grid approach to spatial data mining, in: Proceedings of the 23rd Very Large Data Bases Conference, 1997, pp. 186–195. [10] J. MacQueen, Some methods for classification and analysis of multivariate observations, in: 5th Berkeley Symposium on Mathematics and Statistical Probability, vol. 1, 1967, pp. 281–297. [11] J.C. Dunn, Some recent investigations of a new fuzzy partition algorithm and its application to pattern classification problems, Journal of Cybernetics 4 (1974) 1–15. [12] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, Plenum Press, New York, 1981. [13] M. Sabzekar, H.S. Yazdi, M. Naghibzadeh, Relaxed constraints support vector machines for noisy data, Neural Computing and Application 20 (2011) 671–685. [14] H. Izakian, A. Abraham, Fuzzy c-means and fuzzy swarm for fuzzy clustering problem, Expert Systems with Applications (2010), http://dx.doi.org/10.1016/j.eswa.2010.07.112. [15] P. Scheunders, A genetic c-means clustering algorithm applied to color image quantization, Pattern Recognition 30 (6) (1997) 859–866. [16] G. Wang, et al., Studies on fuzzy c-means based on ant colony algorithm, in: Proceedings of the International Conference on Measuring Technology and Mechatronics Automation, 2010. [17] N. Sharma, et al., Segmentation of medical images using simulated annealing based fuzzy c means algorithm, International Journal of Biomedical Engineering and Technology 2 (3) (2009) 260–278. [18] M. Alata, M. Molhim, A. Ramini, Optimizing of fuzzy c-means clustering algorithm using GA, World Academy of Science, Engineering and Technology 39 (2008) 224–229. [19] K. Zou, Z. Wang, M. Hu, A new initialization method for fuzzy cmeans algorithm, Fuzzy Optimization and Decision Making 7 (2008) 409–416. [20] X.L. Xie, G. Beni, A validity measure for fuzzy clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (1991) 841–847. [21] S.Y. Zhao, Calculus and Clustering, China Renming University Press, 1987. [22] R. Krishnapuram, J. Kim, A note on the Gustafson-Kessel and adaptive fuzzy clustering algorithms, IEEE Transaction on Fuzzy System 7 (1999) 453–461. [23] K.L. Wu, M.S. Yang, Alternative c-means clustering algorithms, Pattern Recognition 35 (2002) 2267–2278. [24] E. Cox, Fuzzy Modeling and Genetic Algorithms For Data Mining and Exploration, Elsevier Inc., 2005. [25] L. Jiang, W. Yang, A modified fuzzy c-means algorithm for segmentation of magnetic resonance images, in: Proceedings of the VII Digital Image Computing: Techniques and Applications, 2003, pp. 225–231, 10–12. [26] L. Jiang, W. Yang, A modified fuzzy c-means algorithm for segmentation of magnetic resonance images, in: Proceedings of the VII Digital Image Computing: Techniques and Applications, 2003, pp. 225–231. [27] F. Klawonn, A. Keller, Fuzzy clustering based on modified distance measures, Lecture Notes in Computer Science; 1642 (1999) 291–302. [28] V. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New York, 1995. [29] C.F. Lin, S.D. Wang, Fuzzy support vector machines, IEEE Transaction on Neural Networks 13 (2002). [30] C.F. Lin, S.D. Wang, Training algorithms for fuzzy support vector machines with noisy data, Pattern Recognition Letters 25 (2004) 1647–1656. [31] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001. [32] P.M. Murph, D.W. Aha, UCI Repository of Machine Learning Databases, Department of Information and Computer Science, University of California, Irvine, 1987. [33] http://www.mathworks.com [34] E. Deza, M.M. Deza, Encyclopedia of Distances, Springer-Verlag, Berlin/Heidelberg, 2009. [35] R.A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (1936) 179–188. [36] U. Maulik, S. Bandyopadhyay, I. Saha, Integrating clustering and supervised learning for categorical data analysis, IEEE Transactions on Systems,
890
M. Sabzekar, M. Naghibzadeh / Applied Soft Computing 13 (2013) 881–890
Man, and Cybernetics, Part A: Systems and Humans 40 (4) (2010) 664–675. [37] K. Josien, T.W. Liao, Integrated use of fuzzy c-means and fuzzy KNN for GT part family and machine cell formation, International Journal of Production Research 38 (15) (2000) 3513–3536.
[38] Y. Cheung, Maximum weighted likelihood via rival penalized EM for density mixture clustering with automatic model selection, IEEE Transactions on Knowledge and Data Engineering 17 (6) (2005) 750–761.