Fuzzy Sets and Systems 160 (2009) 1886 – 1901 www.elsevier.com/locate/fss
PFHC: A clustering algorithm based on data partitioning for unevenly distributed datasets Yihong Donga,∗ , Shaoka Caoa , Ken Chenb , Maoshun Hea , Xiaoying Taia a Institute of Computer Science and Technology, Ningbo University, Ningbo 315211, PR China b Institute of Circuit and Systems, Ningbo University, Ningbo 315211, PR China
Received 13 July 2007; received in revised form 25 August 2008; accepted 17 November 2008 Available online 27 November 2008
Abstract Recently many researchers exert their effort on clustering as a primary data mining method for knowledge discovery, but only few of them have focused on uneven dataset. In the last research, we proposed an efficient hierarchical algorithm based on fuzzy graph connectedness—FHC—to discover clusters with arbitrary shapes. In this paper, we present a novel clustering algorithm for uneven dataset—PFHC—which is an extended version based on FHC. In PFHC, dataset is divided into several local spaces firstly according to the data density of distribution, where the data density in any local space is nearly uniform. In order to achieve the goal, local and are used in each local domain to acquire local clustering result by FHC. Then boundary between local areas needs being taken into consideration for combination. Finally local clusters need to be merged to obtain global clusters. As an extension of FHC, PFHC can deal with uneven datasets more effectively and efficiently, and generate better quality clusters than other methods as experiment shows. Furthermore, PFHC is found to be able to process incremental data as well in this work. © 2008 Elsevier B.V. All rights reserved. Keywords: Data mining; Fuzzy clustering; Unevenly distributed dataset; Data partitioning
1. Introduction Clustering is an important task in data mining and knowledge discovery, which groups objects into meaningful subclasses. Recently, many clustering algorithms have been developed [1–6,8–11,13,18–20] trying to solve the classification of large databases. However, all the above methods are fitting for the nearly uniform density or the dataset needs being held in the memory. When the density in datasets is not of uniformity, or the datasets are so large that they cannot be held in the memory, the quality of clustering will be compromised. Fig. 1 shows three words “how are you” with various densities, where two of them—“how” and “you”—have the same density, while the other word “are” is thinner than the former. The famous DBSCAN algorithm [6] is applied on the datasets in Fig. 1. Figs. 2 and 3, in which the same colors of points represent the same classes, are results with different parameters of DBSCAN. In Fig. 2, the word “are” cannot be identified correctly when the parameter EPS is set to 5 and MinPts is set to be 4, where EPS is the radius of neighborhood of a point and MinPts is a minimum number ∗ Corresponding author. Department of Computer Science, College of Information Science and Engineering, Ningbo University, Ningbo, Zhejiang
Province 315211, China. Tel.: +86 574 87600593. E-mail address:
[email protected] (Y. Dong). 0165-0114/$ - see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.fss.2008.11.012
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
1887
Fig. 1. Dataset of three words “how are you”.
Fig. 2. Identified with less parameter by DBSCAN or FHC.
Fig. 3. Identified with greater parameter by DBSCAN or FHC.
of points in an EPS-neighborhood of that point. Because EPS is set to be smaller, the lower density word “are” cannot be identified as one class. If EPS is adjust to be 16 and MinPts is still to be 4, the word “are” seems to be recognized as one class, as shown in Fig. 3. However, with increase of EPS, the other two words “how” and “you” are identified as one class because of closeness and high density. In order to get better result of clustering, different parameters need to be set with respect to different densities. FHC [5] proposed in our earlier research runs more efficient than DBSCAN. However, as a clustering algorithm for evenly distributed datasets, FHC cannot identify the classes with uneven datasets exactly, as DBSCAN does. For datasets shown in Fig. 1, the word “are” is not identified as one class when EPS is set to be 10 and is set to be 0.01, as Fig. 2 shows. In this paper, we present an effective clustering method based on data partitioning to deal with ill-scattered datasets. Our algorithm is based on FHC [5], which is an efficient clustering algorithm for mining in a data warehousing environment.
1888
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
The remainder of this paper is organized as follows. Related work is discussed in Section 2. In Section 3, we briefly introduce the clustering algorithm FHC. The algorithm based on data partitioning—PFHC—is presented in Section 4, and an extensive performance evaluation is reported in Section 5. Section 6 concludes with a summary.
2. Related work Many clustering algorithms have been proposed to deal with various data type and high-dimensional data. Clustering analysis methods can be divided into hard and fuzzy clustering. Hard clustering is a hard partition where each object is assigned to one and only one cluster. For example, in K-medoids algorithm [11], each cluster is represented by one of the objects located near the center of the cluster; BIRCH [20] uses a CF-tree to hold data, and CURE [9] chooses a well-formed group of points to identify the distance between clusters; DBSCAN [6] is a density-based clustering method which cultivates clusters based on a density threshold. As hard clustering methods, each point belongs to a certain cluster definitely. Fuzzy clustering generates a fuzzy partition for clustering data by detecting the attributes of objects. Since the well-known fuzzy clustering approach—FCM—was proposed by Bezdek [3], a great deal of fuzzy clustering algorithms have been put forward. A generalized weighted fuzzy C-means (GWFCM) clustering algorithm [12] is presented with a linguistic constraint for each object and a weight based on the average distance between cluster centroids and objects. The concept of penalizing solutions [15] is introduced as a means of weighting the instance points based on local variations of the expected complete log-likelihoods. Paper by de Carvalho et al. [7] presents partitional fuzzy clustering methods utilizing adaptive quadratic distances, which change at each iteration and can either be the same for all clusters or different from one cluster to another. In [14], a new level-based approach to the fuzzy clustering problem for spatial data is proposed, in which each point of the initial set is handled as a fuzzy point of the multidimensional space. Pedrycz et al. [16] studied a proximity-based fuzzy clustering method which is concerned with the gradient-driven minimization of the differences between the provided proximity values and those computed on a basis of the partition matrix resulting from the standard FCM algorithm. Bezdek et al. [17] abandon the objective function model in favor of a generalized model called alternating cluster estimation (ACE). They treated every clustering model as an instance of ACE and proposed an algorithm with a dynamically changing prototype function and a computationally efficient algorithm with hyperconic membership functions. In the early work, we proposed an efficient and effective algorithm—hierarchical clustering—based on fuzzy graph connectedness algorithm (FHC) [5] and its incremental method IFHC, which applies fuzzy set theory to hierarchical clustering method so as to discover clusters with arbitrary shape. It first partitions the datasets into several sub-clusters using a partitioning method, and then constructs a fuzzy graph of sub-clusters by analyzing the fuzzy-connectedness degree among sub-clusters. By computing the cut graph, the connected components of the fuzzy graph can be obtained, hence resulting the desired clustering. The algorithm can be performed in high-dimensional datasets, finding clusters of arbitrary shapes such as the spherical, linear, elongated or concave ones. Not only can FHC and IFHC handle data with numerical attributes, but categorical attributes as well. Nevertheless, even if many clustering algorithms have been proposed, only few of them focus on the datasets with ill-distributed data. Here, we present PFHC algorithm, which is an improved version of FHC based on data partitioning aiming at the uneven dataset in this paper. After dataset is divided into several local domains according to the data density distribution, local and are selected in each local space. In each space, FHC is used to obtain local clusters. Finally local clusters need to be combined to acquire the global clusters.
3. The algorithm FHC The key idea of FHC algorithm is utilizing the fuzzy-connectedness degree among sub-clusters to obtain the desired clustering. We will first give a short introduction to FHC including the definitions that are required for evenly distributed datasets. See [5] for a detailed presentation of FHC. Definition 1 ( cut graph). G˜ = N , R is a fuzzy graph, where N is set of nodes, R is fuzzy relation with reflexivity and symmetric, and R is membership function of R. If E = {e = vi v j | R (vi , v j ) , vi , v j ∈ N }, then G = N , E ˜ is called cut graph of G.
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
1889
Fig. 4. Directly -fuzzy-connective and -fuzzy-connective.
Definition 2 (Neighborhood of a point). The neighborhood of a point p is denoted by Neig( p), called neighborhood p. To the data of numeric attribute, it is defined by Neig( p) = {q ∈ D|dist( p, q) }; to the data of categorical attribute, is defined by Neig( p) = {q ∈ D|sim( p, q)}. To the datasets with numeric attributes, the distance attribute dist( p, q) is adopted in this work, such as Euclidean distance, Manhattan distance, Chebyshev distance, or Minkowski distance; to the datasets with categorical attributes, similarity function sim( p, q) defined by Jaccard coefficient is adopted, where sim( p, q) = || p∩q| p∪q| . Definition 3 (Connectedness). Point x is a connectedness between neighborhoods p and q, if and only if x is either in the neighbor of point p or in that of point q. The notion cont( p, q) is used to denote this relationship. cont( p, q) = {x|x ∈ Neig( p), x ∈ Neig(q)}. Definition 4 (Fuzzy-connective-degree). Fuzzy-connective-degree is the connective intensity between neighborhoods p and q. ( p, q) =
|cont( p, q)| , 0 ( p, q) 1. |Neig( p)| + |Neig(q)| − |cont( p, q)|
Definition 5 (Directly -fuzzy-connective). Suppose that D is a set of objects, p ∈ D, q ∈ D, neighborhood p is
directly -fuzzy-connective from neighborhood q if and only if ( p, q) , denoted as p ↔ q. D
Definition 6 (-fuzzy-connective). Neighborhood p is -fuzzy-connective from neighborhood q, denote as p · · · q, if
D
a chain p1 , p2 , . . . , pn , p1 = q, pn = p, where pi ∈ D(1 i n), pi+1 ↔ pi exists, where D is a set of objects, D
satisfying transference. From Fig. 4 we can see that neighborhoods p1 and p2 are a directly -fuzzy-connective, so are neighborhoods p2 and p3 , while neighborhoods p1 and p5 are a -fuzzy-connective not directly -fuzzy-connective. A cluster is defined as a set of -fuzzy-connective objects which is maximal -fuzzy connective and the noise is the set of objects not contained in any cluster. Definition 7 (Cluster). Let D be a set of objects. A cluster C with and in D is a non-empty subset of D, which contains at least Minpts objects, satisfying the following conditions:
Maximality: ∀ p, q ∈ D: if p ∈ C and p · · · q (about ), then also q ∈ C. D
Fuzzy connectivity: ∀ p, q ∈ C : p and q are -fuzzy-connective in D.
1890
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
Definition 8 (Noise). Let C1 , C2 , . . . , Ck be the clusters with and . Then noise is the set of objects in D not belonging to any cluster Ci . Noise = p ∈ D|∀i : p ∈ / Ci }. The process of the clustering is as follows in FHC: objects in dataset are primarily classified into sub-clusters which have similar size by partitioning method, then a fuzzy graph after the fuzzy-connective-degree among sub-clusters analyzed is constituted. The fuzzy connected component got from cut graph of the fuzzy graph is finally the result of clustering. The algorithm FHC is sketched as follows: Algorithm 1. Hierarchical clustering algorithm based on fuzzy graph connectedness (FHC). Input: Dataset U = {u 1 , u 2 , . . . , u n }, radius of sub-cluster , threshold . Output: Connected component of cut graph. Algorithm: Initial cluster is u1 for each point ui ∈ U { Find out the sub-cluster center Oj which is nearest to ui cluster _num after computing the distance: d(u i , O j ) = min d(u i , Ok ) k=1
if (dist(ui , Oj ) < ) { ui belongs to sub-cluster Cj : C j = C j ∪ u i , n O j +u i ,nj = nj + 1 adjust center of Cj : O j = jn j +1 } else { //create a new sub-cluster to hold ui cluster_num=cluster_num+1, Ccluster _num = {u i }, Ocluster _num = u i }
} for each point ui ∈ U { for each sub-cluster Cj ∈ C If(ui ∈ neighborhood Oj ) make a marker on sub-cluster Cj Connectednesses of sub-clusters which have markers increment } //construct fuzzy graph of sub-cluster, and then compute the fuzzy-connective-degree among nodes for(p=1;p<=cluster_num-1,p++) for(q=p; q<=cluster_num;q++) computing fuzzy-connective-degree by the following |cont(O ,Oq )| formula: ( p, q) = |Neig(O )|+|Neig(Op )|−| cont(O ,O )| p
q
p
q
Get__Graph(G)
4. PFHC-clustering algorithm of data partitioning based on FHC FHC has high efficiency in discovering any clusters with arbitrary shapes, which deals not only with the data with numeric attributes, but with categorical attributes as well. Two parameters and are introduced in FHC, where determines radius of the sub-clusters, is used to estimate the fuzzy-connectedness degree to get cut graph. Both of them are global parameters, which cannot be changed in the clustering process. Whereas if data distribution is not well proportioned as Fig. 1 shows, it will lead to undesired result of clustering. For example, when is small, the compressed classes are easy to be identified, while the clusters of the lower density maybe treated as noises, as shown in Fig. 2. In contrary, when is set to a large value, the classes in lower density area should be identified correctly, but more than one cluster in the compressed area will be recognized as one cluster, or the noises should be treated as several effective objects, as seen from Fig. 3.
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
1891
For the reasons mentioned above, a new method based on FHC by data partitioning—PFHC—is proposed for the uneven dataset. PFHC works as follows: dataset is divided into several local regions based on the data density distribution, where the data density in any local region is uniformly distributed. In order to achieve the goal, local and are used in each local area. In every region, FHC is used to get local clusters. Finally local clusters need to be combined to obtain the global clusters. In PFHC, we are encountering two problems. One is how to partition datasets, the other is how to deal with boundary to combine local clusters into global clusters. 4.1. Data partitioning On the basis of the distributing characteristics of data in one dimension or several dimensions, whole datasets can be divided into some regions, where the data distribution is as uniform as possible in the same local region. In each local region, local parameters and are used in order to fit with the data distribution. After clustering in each local region by FHC algorithm and identifying boundaries, we can achieve final result of clustering by combining all local classes. Let A = {A1 , A2 , . . . , An } be a set of ordered domains and Rn = A1 × A2 × · · · × An an n-dimensional data space, where A1 , A2 , . . . , An are regarded as the dimensions/attributes of Rn . Suppose sample sets are made up of n-dimensional objects D = {X 1 , X 2 , . . . , X N }, where X i = (xi1 , xi2 , . . . , xin ). The task of clustering is to divide dataset D into k non-overlap subsets D1 , D2 , . . . , Dk . Objects in the same class are more similar than those found in different classes. Error square function is used as measure function to represent k 2 clustering quality, denoted as Je . Je = i=1 X ∈Di X − m i , where m i is a mean point that represents all points in Di , and m i = (1/num i ) X ∈Di X , in which num i is the total number of elements of subset Di . Let data space Rn be divided into k sub-space after data partitioning, that is, P1 , P2 , . . . ,Pk . The dataset D is also segmented into k parts, D1 , D2 , . . . , Dk , corresponding to the sub-space P1 , P2 , . . . ,Pk , respectively. Suppose that data in subset Di of sub-space Pi consist of Ni clusters, namely Ci,1 , Ci,2 , . . . , Ci,N i , the measure function of sub-space i 2 Pi is Jei = Nj=1 X ∈Ci, j X − m i, j , where m i, j is a mean point of the cluster Ci, j , so total measure function of Rn is Je =
Ni k
X − m i, j 2 .
(1)
i=1 j=1 X ∈Ci, j
In sub-space Pi , the clusters are made up of two classes: full class and cut-off class. The cut-off class is usually at the f ull cut -o f f f ull cut -o f f while cut-off one is Ni , then Ni = Ni + Ni . boundary of sub-spaces. Let a full class number is Ni So in each sub-space Pi Jei =
Ni
Ni
X − m i, j 2 =
j=1 X ∈Ci, j
f ull
j=1
X ∈Ci, j
cut o f f
Ni
X − m i, j 2 +
j=1 X ∈Ci, j
X − m i, j 2 .
(2)
Ni f ull 2 The error square function of the full classes in Pi is X ∈Ci, j X − m i, j , while that of cut-off ones is j=1 Nicut−o f f 2 X ∈Ci, j X − m i, j . The total error square function is j=1
Je =
Ni k i=1 j=1 X ∈Ci, j
f ull
X − m j = 2
k N i i=1 j=1 X ∈Ci, j
cut−o f f
X − m i, j + 2
k Ni i=1
j=1
X − m i, j 2 .
(3)
X ∈Ci, j
What we should do is dealing with those cut-off classes at boundaries, as the last term in Eq. (3). Is it reasonable that the global clusters are made up of local clusters? The local clusters are optimized result in every local region. Because the objects in the same cluster maybe assigned to different sub-spaces, the combination is sum of local optimization, as Eq. (3) shows. In the clustering process for ill-distributed dataset, what we should do is to better identify the data in boundary to improve the result of clustering.
1892
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
Fig. 5. Data distribution in x axis of dataset.
Fig. 6. Data distribution in y axis of dataset.
For FHC algorithm, and are the base level thresholds. represents the radius of sub-clusters and represents the cut sets between sub-clusters. There are different lower limit thresholds in different data sub-spaces, and the lowest one corresponds to the global lower limit. The result of the clustering will be more precise than that only using fixed thresholds due to adopting different lower limit threshold. The steps of PFHC are as follows: (1) Data partitioning in one or more dimensions. (2) Using FHC algorithm for clustering with local and in each sub-space. (3) Combining the local clusters with relevant boundary. The method of data partitioning we used is based on the statistic characteristic of data distribution. After analyzing data distribution in each dimension, we may decide which dimensions are to be chosen for partitioning. In our experiment, the data distribution curve is used in each dimension as the statistical approach. For dataset as shown in Fig. 1, Figs. 5 and 6 are data distribution of two dimensions in x and y axes. The partitioning positions can be selected by two methods. One is automatic selection of minimum value in the data distribution curve. The other is interactive user-computer selection. What we used is the latter method to select point A in x axis and B in y axis as cutting points. Fig. 7 shows data partitioning with position A in Fig. 5 and position B in Fig. 6. Fig. 8 is
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
1893
Fig. 7. Data partition.
Fig. 8. Local clustering in each region.
the result of local clustering in each region with = 20, = 0.01 in regions P1, P3, and P4, and = 100, = 0.01 in region P2. For P2, P3, and P4, each of them includes one class, and there are two classes in P1. 4.2. Boundary processing After data partitioning, parameters and are used independently in every sub-space by FHC algorithm to acquire local clusters. Some local clusters should be merged because they may be divided into two adjacent sub-spaces in partitioning process, as illustrated in Fig. 8. Definition 9 (Region boundary). Suppose Rn is space of n-dimension, kth dimension is sliced to m k parts. The coork dinate of cutting point in kth dimension is x1k , . . . , x i , . . . , xmk k . The boundary coordinate of sub-space Pi is adjacent to the other in a certain dimension. The region boundary of the hypercube Pi is isolated by (x i11 , xi22 , . . . , xitt , . . . , xinn ), n (x i11 +1 , xi22 , . . . , xitt . . . , xinn ), (x i11 , xi22 +1 , . . . , xitt , . . . , x in ), . . . , (x i11 , xi22 , . . . , xitt , . . . , xinn +1 ), . . . , (x i11 +1 , xi22 +1 , . . . t n t xit +1 , . . . , xin +1 ), where xit is intersected point in tth dimension, i t is i t th division in tth dimension, 1 t n. For the hypercube Pi of n-dimension, the number of vertices is 2n . For example, the vertex numbers of sub-space are 4 and 8 for two dimension and three dimension, respectively. For two-dimensional matrix, the boundary coordinate of the ith sub-space Pi can be denoted as (xiLeft , yiLower ), (xiLeft , yiUpper ), (xiRight , yiLower ), (xiRight , yiUpper ), where, xiLeft , xiRight , yiLower , yiUpper are the boundary coordinates in ith sub-space. For three-dimensional space, the sub-space Pi is enclosed by (x i11 , xi22 , xi33 ), (x i11 +1 , xi22 , xi33 ), (x i11 , xi22 +1 , xi33 ), (x i11 , xi22 , xi33 +1 ), (x i11 +1 , xi22 +1 , xi33 ), (x i11 +1 , xi22 , xi33 +1 ), (x i11 , xi22 +1 , xi33 +1 ), (x i11 +1 , xi22 +1 , xi33 +1 ).
1894
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
Fig. 9. Global clusters by combing local clusters.
Definition 10 (Sub-cluster of boundary). Suppose m i, j is a mean point of arbitrary cluster Ci, j , m i, j = (x i,1 j , xi,2 j , . . . , n n xi,t j , . . . , x i, j ), where xi,t j is the coordinate of tth dimension. Suppose that (x i11 , xi22 , . . . , xitt , . . . , x in ) is an arbitrary boundary coordinate in sub-space Pi , where 1t n, the cluster is called sub-cluster of boundary, which at least satisfies |x ikk −x i,k j | i in arbitrary kth dimension, denoted as Cb (i, j). i is in ith sub-space. For two-dimensional space, the condition should be one of the four formulas as follows: (1) (2) (3) (4)
|xiLeft − xi, j | i , |xiRight − xi, j | i , |yiLower − yi, j | i , and |yiUpper − yi, j | i ,
where (xi, j , yi, j ) is the center of jth sub-cluster in sub-space Pi . Let Pi and P j be two sub-spaces of Rn . Pi and P j are adjacent region iff Pi and P j have a common hyperplane of n − 1 dimensions. Because of different local and in its regions, the parameters are unified should the adjacent sub-clusters in adjacent region Pi and P j be combined. Suppose parameters i and i in Pi , j , and j in P j , then i j = min(i , j ), i j = min(i , j ). The following cases need being differentiated in merging process: (1) Two or more sub-clusters merging into one cluster. Suppose Pi and P j are adjacent regions, Cb (i, k), Cb ( j, l) are sub-cluster of boundary, Cb (i, k) ⊂ Pi , Cb ( j, l) ⊂ P j . Two sub-clusters are merged into one iff |m i,k − m j,l | 2i j and (Cb (i, k), Cb ( j, l))i j , where m i,k and m j,l are mean points of these two sub-clusters. The new centroid of n m i,k +n j,l m j,l the merged cluster is Onew = i,k ni,k , where n i,k and n j,l are numbers of points in Cb (i, k) and +n j,l Cb ( j, l). (2) Noises assimilated in sub-cluster. A cluster near region boundary maybe divided into two parts in partitioning process, with one being noises and the other being a cluster. These noises should be mixed into sub-cluster to which they belong originally. Suppose pn ∈ Pi , Cb ( j, k) ⊂ P j , where Pi and P j are adjacent regions, pnis noise. Noise pn should be absorbed into the neighbor sub-cluster of boundary Cb ( j, k) of adjacent region iff pn ∈ x∈Cb ( j,k) Neig(x). n
m
+p
The new centroid of the cluster is Onew = j,kn j,kj,k+1 n , where m j,k and n j,k are mean point and number of Cb ( j, k), respectively. (3) New cluster resulting from noises in adjacent partitions. As case (2) shows, a small cluster near boundary may be divided into two parts of noises because objects in each part are too few. These noises need to be merged into one or more clusters. Suppose pn 1 , pn 2 , . . . , pn k ∈ Pi , pm 1 , pm 2 , . . . , pm l ∈ P j , FHC algorithm [5] is re-run on these points with new parameter i j = min(i , j ) and i j = min(i , j ). As an example, Fig. 9 shows the adopted combination strategy in merging process, in which the word “are”, which is comprised by two sub-clusters in regions P1 and P2 with lower density, is combined, so is the word “how” consisting of two sub-clusters in regions P1 and P3.
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
1895
4.3. PFHC algorithm The pseudo-code of PFHC algorithm is listed as follows: PFHC(database db) { sample for database; //sampling analysis character of data; //analyzing data characteristics for each dimension choose partitioning points; //choosing partitioning points of each dimension partitioning database; //data partitioning for each partition Pi FHC (i , i ); //clustering for each local area by FHC Find the boundary sub-clusters; for each partition Pi for each adjacent partition Pj { if |m i,k − m j,l | 2i j and (Cb (i, k), Cb ( j, l)) i j { merge_cluster(k,l); n m i,k +n j,l m j,l new center is: Onew = i,k ni,k +n j,l } }//merging the sub-clusters for each noise pn ∈ Pi for each x ∈ Cb ( j, k), where Cb ( j, k) is sub-cluster of boundary of each adjacent partition Pj to sub-space Pi { if dist( pn , x) i j { put_noise_to_cluster (Cb ( j, k), pn ); n m +p Onew = j,kn j,kj,k+1 n ;} }//absorbing the noises for each remaining noise pn 1 , pn 2 , . . . , pn k ∈ Pi , pm 1 , pm 2 , . . . , pm l ∈ P j in adjacent partition Pi and Pj FHC(i j , i j ); //clustering by FHC //generating new cluster from noises } } 4.4. Estimating the parameter As cut value of fuzzy-connective-degree between sub-clusters, in each sub-space will be similar in spite of uneven distributed datasets. In our experiments, is set to be 0.01. However, the parameter is different between sub-spaces on account of diverse densities of points in different local partitions. How can be estimated in each sub-space? An effective heuristic method proposed in [6] is used in PFHC. Similar to literature [6], k is given to define a function k-dist from database D to the real numbers, mapping each point to the distance from its kth nearest neighbor. Then the points of the database are sorted in descending order by their k-dist values to make sorted k-dist graph. Initial is estimated from this sorted k-dist graph. Because different densities in each sub-space lead to different sorted k-dist graph, the initial in each sub-space will be dissimilar. The parameters and should be adjusted by experience gradually to get better results based on the initial values. 5. Performance evaluation In this section, we discuss how the PFHC algorithm performs on some illustrative synthetic data with ill-uniformed distribution to study the performance and scalability. Firstly, we compared the performance of PFHC method with FHC using three datasets. Secondly, we studied how PFHC deal with incremental data under data partitioning. Finally,
1896
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
Fig. 10. Two unevenly distributed datasets.
Fig. 11. Data partitioning.
PDBSCAN modified from DBSCAN with data partitioning were compared with PFHC using the same datasets to evaluate their performance. All experiments were conducted on Duo CPU 1.73G ∗ 2 notebook PC with 512M of RAM and in the environment of VC 6.0. 5.1. Experiments on PFHC In order to observe the results intuitively, two datasets containing points in two dimensions are experimented. Example 5.1. Dataset is made up of three circular and two elliptic shapes of different size, where one ellipse is with lower density than other four classes, as shown in Fig. 10(a). In this example, shapes of clusters are convex. After data partitioning with user-computer interactive method, the lower density ellipse is divided into two regions as illustrated in Fig. 11(a). Fig. 12(a) shows local clustering result with = 14.14, = 0.01 in regions P1, P3, and P4, and = 40, = 0.01 in region P2. As demonstrated, there are two high density clusters in region P1, one lower density cluster in P2, one high density cluster in P3, and two different density clusters in P4. With boundaries being identified and sub-clusters of boundary combined, Fig. 13(a) shows the last result of PFHC. It is well identified for uneven clusters. For comparison purposes, FHC algorithm is applied in the same dataset with global = 14.14 and = 0.01. As an
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
1897
Fig. 12. Clustering in local regions.
Fig. 13. Global clustering result.
approach for uniformly distributed datasets, FHC detects seven clusters, as shown in Fig. 14(a), so it is unsuitable for dealing with the ill-uniformed datasets well. Example 5.2. Fig. 10(b) shows second dataset consisting of five clusters with some irregular figures, containing concave shaped clusters with different density. With = 14.14, = 0.01 in regions P1, P2, and P4, and = 40, = 0.01 in region P3, Figs. 11(b), 12(b), and 13(b) show the results of data partitioning, local clustering, and global clustering, respectively. In this example, the intersected data belong to high density clusters. Fig. 14(b) gives the result by FHC without any data partitioning.
5.2. Experiment on PFHC with incremental data IFHC is an incremental version of FHC proposed in [5], where the clustering result is updated by Affected_neig which may potentially change cluster membership. We study herein the knowledge updated by incremental data under data partitioning.
1898
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
Fig. 14. Clustering result by FHC without any data partitioning (the lower density clusters are misidentified).
Fig. 15. Incremental points.
Example 5.3. We use Fig. 10(a) for our experiment. In Fig. 15, three blocks of 150 incremental points are added into dataset as shown in Fig. 10(a). It is marked with circles denoted as A, B, and C in Fig. 15. Incremental points in area A are adjacent to a blue cluster, while incremental points in area B are adjacent to a green cluster. They should be mixed into these two clusters. Incremental points in area C connect two clusters, so both should be combined into one cluster. Fig. 16(a) shows the result of incremental data in each local region. Apparently, incremental data in areas A and B are assigned to its adjacent clusters. Incremental data in area C cross two regions. Clusters D and E are combined as one class in the combination phase, as Fig. 16(b) shows.
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
1899
Fig. 16. Global clustering result by PFHC.
Fig. 17. (a) Incremental points. (b) Global clustering result by PFHC.
Example 5.4. Different from Example 5.3, the incremental data in this example are in the same region. In Fig. 17(a), incremental data in area A join two lower density clusters while those in area B connect two high density classes. The result of clustering is given in Fig. 17(b).
5.3. Comparison PFHC with PDBSCAN In this section, for comparison purposes we implement another algorithm—PDBSCAN—combining DBSCAN [6] with data partitioning method on the same datasets due to the fact that DBSCAN is an effective and efficient algorithm in discovering clustering of arbitrary shapes. In PDBSCAN method, the datasets are partitioned firstly followed by adopting DBSCAN in each local area. Six different datasets from approximately 5000 to 30,000 points are adopted in our experiment. Fig. 18 shows that PFHC runs faster than PDBSCAN. This is because area inquiry of all objects included in neighborhoods of core object must be executed in DBSCAN, the neighborhoods of all objects neighboring to the core object may overlap each other, which will affect run time of DBSCAN. In contrast, FHC searches for
1900
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901 4.5 PDBSCAN PFHC
4 3.5 runtime (sec.)
3 2.5 2 1.5 1 0.5 0 0
0.5
1
1.5 2 2.5 data numbers
3
3.5 x 104
Fig. 18. Run time of PFHC versus PDBSCAN.
sub-clusters by partitioning method to construct a fuzzy graph and obtain cut graph to discover the clusters. It is highly time-saving because overlapping searching is omitted. So PFHC uses a fraction of total time required by PDBSCAN. 6. Conclusions The conventional clustering algorithms is commonly applied to datasets with uniform density, whereas the algorithm presented in this research is directed to unevenly distributed datasets. Based on FHC presented in literature [5], a novel clustering algorithm for ill-uniformed dataset—PFHC—is proposed. After dataset is divided into several local regions with data of similar distribution density, FHC is used to obtain local clusters in each local region. Then local clusters need to be combined to acquire the global clusters. As an extension of FHC, PFHC can well deal with unevenly scattered data with high effectiveness and efficiency, as experiments suggest. Furthermore, PFHC generates better quality clusters than conventional algorithms, and scales up for large databases as well as FHC does. Acknowledgments The authors would like to thank anonymous reviewers for constructive comments. This work is partially supported by Natural Science Foundation of China (NSFC 60472099), Zhejiang Provincial Natural Science Foundation of China (Y1080490), Ningbo Natural Science Foundation (2006A610017), and K.C. Wong Magna Fund in Ningbo University. References [1] R. Agrawal, J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering of high dimensional data for data mining applications, in: Proc. ACM-SIGMOD Internat. Conf. on Management of Data (SIGMOD’98), Washington, DC, June 1998, pp. 94–105. [2] M. Ankerst, M. Breunig, H.-P. Kriegel, J. Sander, OPTICS: ordering points to identify the clustering structure, in: Proc. 1999 ACM-SIGMOD Internat. Conf. on Management of Data (SIGMOD’99), June 1999, pp. 49–60. [3] J.C. Bezdek, R.J. Hathaway, et al., Convergence theory for fuzzy c-means: counterexamples and repairs, IEEE Transactions on Systems, Man and Cybernetics 17 (5) (1987) 873–877. [4] L. Cinque, G. Foresti, L. Lombardi, A clustering fuzzy approach for image segmentation, Pattern Recognition 37 (9) (2004) 1797–1807. [5] Y. Dong, Y. Zhuang, K. Chen, X. Tai, A hierarchical clustering algorithm based on fuzzy graph connectedness, Fuzzy Sets and Systems 157 (2006) 1760–1774. [6] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases, in: Proc. Internat. Conf. on Knowledge Discovery and Data Mining (KDD’96), August 1996. [7] F. de A.T. de Carvalho, C.P. Tenorio, N.L. Cavalcanti Jr., Partitional fuzzy clustering methods based on adaptive quadratic distances, Fuzzy Sets and Systems 157 (2006) 2833–2857.
Y. Dong et al. / Fuzzy Sets and Systems 160 (2009) 1886 – 1901
1901
[8] A.P. Gasch, M.B. Eisen, Exploring the conditional co-regulation of yeast gene expression through fuzzy k-means clustering, Genome Biology 3 (11) (2002) 1–22. [9] S. Guha, R. Rastogi, K. Shim, Cure: an efficient clustering algorithm for large databases, in: Proc. ACM-SIGMOD Conf. on Management of Data (SIGMOD’98), May 1998. [10] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, Los Altos, 2000. [11] L. Kaufman, P.J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York, 1990. [12] J.M. Leski, Generalized weighted conditional fuzzy clustering, IEEE Transactions on Fuzzy Systems 11 (6) (2003) 709–715. [13] J. Macqueen, Some methods for classification and analysis of multivariate observations, Proc. Fifth Berkeley Symp. on Mathematical Statistics and Probability, Vol. 1, 1967, pp. 281–297. [14] E.N. Nasibov, G. Ulutagay, A new unsupervised approach for fuzzy clustering, Fuzzy Sets and Systems 158 (2007) 2118–2133. [15] R. Nock, F. Nielsen, On weighting clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence 28 (8) (2006) 1223–1235. [16] W. Pedrycz, V. Loia, S. Senatore, P-FCM: a proximity-based fuzzy clustering, Fuzzy Sets and Systems 148 (2004) 21–41. [17] T.A. Runkler, J.C. Bezdek, Alternating cluster estimation: a new tool for clustering and function approximation, IEEE Transactions on Fuzzy Systems 7 (4) (1999) 377–393. [18] G. Sheikholeslami, S. Chatterjee, A. Zhang, WaveCluster: a multi-resolution clustering approach for very large spatial databases, in: Proc. 24th Very Large Databases Conf. (VLDB98), New York, NY, 1998. [19] W. Wang, J. Yang, R. Muntz, STING: a statistical information grid approach to spatial data mining, in: Proc. 1997 Internat. Conf. on Very large Data Bases (VLDB’97), August 1997. [20] T. Zhang, R. Ramakrishnan, M. Livny, BIRCH: an efficient data clustering method for very large databases, in: Proc. ACM-SIGMOD Conf. on Management of Data (SIGMOD’96), June 1996.