PHA: A fast potential-based hierarchical agglomerative clustering method

PHA: A fast potential-based hierarchical agglomerative clustering method

Pattern Recognition 46 (2013) 1227–1239 Contents lists available at SciVerse ScienceDirect Pattern Recognition journal homepage: www.elsevier.com/lo...

1MB Sizes 0 Downloads 319 Views

Pattern Recognition 46 (2013) 1227–1239

Contents lists available at SciVerse ScienceDirect

Pattern Recognition journal homepage: www.elsevier.com/locate/pr

PHA: A fast potential-based hierarchical agglomerative clustering method Yonggang Lu n, Yi Wan School of Information Science and Engineering, Lanzhou University, Lanzhou 730000, China

a r t i c l e i n f o

abstract

Article history: Received 6 April 2012 Received in revised form 27 October 2012 Accepted 15 November 2012 Available online 23 November 2012

A novel potential-based hierarchical agglomerative (PHA) clustering method is proposed. In this method, we first construct a hypothetical potential field of all the data points, and show that this potential field is closely related to nonparametric estimation of the global probability density function of the data points. Then we propose a new similarity metric incorporating both the potential field which represents global data distribution information and the distance matrix which represents local data distribution information. Finally we develop another equivalent similarity metric based on an edge weighted tree of all the data points, which leads to a fast agglomerative clustering algorithm with time complexity O(N2). The proposed PHA method is evaluated by comparing with six other typical agglomerative clustering methods on four synthetic data sets and two real data sets. Experiments show that it runs much faster than the other methods and produces the most satisfying results in most cases. & 2012 Elsevier Ltd. All rights reserved.

Keywords: Clustering Algorithm Pattern recognition Potential field

1. Introduction Clustering is the process of dividing data points into groups (clusters) based on certain similarity measure. It is one of the most important steps in pattern recognition. Good reviews of the clustering methods can be found in [1,2]. Clustering methods can be classified into two types: partitional and hierarchical. The partitional approach produces a single partition of the data points, while the hierarchical approach gives a nested clustering result in the form of a dendrogram (cluster tree), from which different levels of partitions can be obtained [1,2]. Due to the rich information produced by the hierarchical clustering approach, it has been widely used in many scientific applications [3–5]. So far, agglomerative clustering is the most widely used hierarchical method. It starts with clusters consisting of a single data point, and then successively merges the two most similar clusters based on certain similarity metric. Three commonly used traditional similarity metrics are: Single Linkage, Complete Linkage and Average Linkage, which use the maximum similarity, the minimum similarity and the average similarity between two clusters respectively [1]. However, it is generally recognized that the traditional methods have some limitations. For example, Single Linkage can produce the ‘‘chaining’’ effect, while the other two metrics usually work well only for spherical-shaped clusters. In addition, they are usually too slow to be applied to large-scale data sets [2]. And due to the lack of global data distribution

n

Corresponding author. Tel.: þ86 9318914182; fax: þ86 9318912778. E-mail addresses: [email protected] (Y. Lu), [email protected] (Y. Wan).

0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.patcog.2012.11.017

information during the clustering process, they may fail to work with overlapping clusters [2]. Bayesian Hierarchical Clustering (BHC) has been introduced to address some of the limitations of the traditional methods [6,7]. Bayes rule is used to compute the posterior probability of a merged hypothesis. And two clusters with the highest such probability are selected to merge. BHC can be used to deal with clusters of non-spherical shapes and missing data, and also provide a guide for choosing the number of clusters. However, to use the BHC method the explicit type of the probability distribution of clusters has to be specified a priori. Recently, ensemble-based clustering methods have been shown to be able to improve the quality of clustering through combining many different clustering results [8–10]. Mirzaei et al. have shown that it is possible to improve the hierarchical clustering methods using the ensemble methods [10]. They have reported a program called MATCH which derives the ensemble results by computing the min-transitive closure of similarity matrices produced from multiple hierarchical clustering results. They have shown the superiority of the MATCH method over the Single Linkage method, and that cophenetic difference is the best dendrogram descriptor for producing the similarity metrics (The value of the cophenetic difference for a pair of data points is the lowest height in the dendrogram at which they are joined together) [10]. However, computing the min-transitive closure of a large similarity matrix is very time-consuming. The idea of clustering by simulating natural processes, such as the movements of objects under a potential field, has also been explored by researchers [11–13]. Observing the similarity between the isopotential contours of a potential field and the results of

1228

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

hierarchical clustering, Shi et al. use potentials to compute similarity metrics for hierarchical clustering [11]. They define two potential-based similarity metrics: APES and AMAPES, which use the average potential energy similarity and the average maximal potential energy similarity between two clusters respectively. And they have shown that the two potential-based similarity metrics are superior to the traditional ones in their experiments [11]. Although naturally a potential field is produced by all the data points, only the potential energy between a pair of clusters is used in Shi’s method. So the similarity between the natural potential field and the hierarchical clustering is not fully utilized. Moreover, computing the potential energy between every pair of clusters before each merging step takes much computational time. More recently, Yamachi et al. propose a partitional clustering method based on a potential field similar to gravity [12]. They derive clusters by moving data points along the gradient of the potential field. Then the clusters are formed by the data points moving close to each other. Although the method has been shown to perform better than the ant colony clustering method [12], it needs simulation of the multibody movement, which is too complicated to be done faithfully. Furthermore, the Yamachi’s method needs the explicit coordinates of all the data points in order to compute new data point locations after movements. However, explicit data coordinates may not be available in some clustering problems, such as some examples in bioinformatics [3,4,14]. In general, most traditional hierarchical clustering methods use only local data distribution information, while BHC uses only global data distribution information, and they all suffer from high computational complexity in finding the two most similar clusters before each merging step. To address these issues, a novel potential-based hierarchical agglomerative (PHA) clustering method is proposed in this paper. In the PHA method, both the potential field produced by all the data points (global data distribution information) and the distance matrix (local data distribution information) are used to define a new similarity metric. The idea of using the potential field of all the data points is already proposed in our previous paper for partitional clustering [15]. In this paper, the potential field idea is extended to hierarchical clustering. The potential field is shown to be negatively proportional to an estimated probability density function and thus represents global data distribution information. Moreover, in this paper the potential field is used to construct two novel similarity metrics which are shown to be equivalent. One of the similarity metrics is based on a carefully-constructed edgeweighted tree and has advantages over the other. It naturally leads to an efficient dendrogram construction algorithm where the two most similar clusters can be identified directly. As a result, the proposed PHA method can usually produce more satisfying results in much less time compared to other hierarchical clustering methods. The rest of the paper is organized as follows: the potential model of our method is introduced in Section 2; the proposed PHA clustering method is described in Section 3; experimental results are presented in Section 4; finally, the discussion and conclusion are given in Section 5.

2. The potential model Clustering under a hypothetical potential field is a new and interesting idea. The potential field model is already described in a previous paper [15]. For completeness, the key contents of the model are described briefly here. For two points i and j, if ri,j is the distance between them, we define the potential at point i from

point j as 



Fi,j ri,j ¼

8 <  r1

if r i,j Z d

:1

if r i,j o d

i,j

d

ð1Þ

where a parameter d is used to avoid the problem of singularity when ri,j becomes zero. The total potential at point i is the sum of the potentials from all the data points X   Fi ¼ Fi,j ri,j ð2Þ j ¼ 1::N

where N is the number of the data points. In this model, different distance measures can be used for computing ri,j. In our experiments, both the Euclidean distance measure and the Euclidean squared distance measure are used. To satisfy the condition of Scale-Invariance [16], it is a good practice to choose the parameter d according to the distribution of the data points. One good solution is to use the distance matrix of the data set   MinDi ¼ min r i,j ð3Þ r ij a 0,j ¼ 1::N

d ¼ meanðMinDi Þ=S

ð4Þ

where MinDi is the minimum distance from point i to all the other points, and S is the scale factor. Although it is necessary to adjust S for different data sets, we find empirically that a good trade-off between sensitivity and robustness is S¼10. In the following, we show that the potential model is closely related to non-parametric probability density estimation: the negative potential value of a data point i computed using (1) and (2) can be viewed as the likelihood of the point i under the probability density function estimated using a nonparametric approach similar to the Parzen window method [17]. First, we modify the Parzen window method to make the window a hypersphere with a fixed radius   r N ¼ max r i,j ð5Þ i,j

The window with the size defined above is always large enough to include all N data points. Then we define a new window function as follows: 8 0 if r 4 rN > < a if r Z r Z d N ð6Þ jðrÞ ¼ r > : a if r od d where a is the normalization factor which is used to make sure that the integral of the window function over all the feature space equals to 1. The probability density at data point i estimated using the above new settings is then p^ N ðiÞ ¼

N   1X j rij Nj¼1

ð7Þ

  It follows that Fi ¼ N=a p^ N ðiÞ, which shows that the total potential value is negatively proportional to the probability density estimated by the non-parametric method. The connection between the potential field and the estimated probability density function indicates that the potential field can provide valuable global data distribution information for the clustering process, which is one of the key ideas of the PHA method. Notice that although there are connections between the proposed PHA method and the traditional Parzen Window method, they are different in the following key aspects: Firstly, the PHA method is designed to perform clustering using the potentials derived by mimicking natural interactions among mass

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

3. The potential-based hierarchical agglomerative (PHA) clustering method In this section, we first introduce a new similarity metric incorporating both local and global data distribution information. Then based on this metric, we further develop an efficient agglomerative clustering method by introducing another equivalent similarity metric which is derived from an edge-weighted tree of the data points. 3.1. The new similarity metric Using the potential model defined in Section 2, the potential value at every data point is computed to represent the total potential field at that data point in the feature space. As shown in Fig. 1, the contour lines in the total potential field can be used to determine different levels of hierarchies for the hierarchical clustering, and it can be seen that the final clustering result depends on both the potential values and the local distribution of the data points: the data points which have more similar potential values and have shorter distances between them are grouped to the same cluster earlier in the hierarchical clustering. −1.4 −2 −2.5 −4 C

.5 −2

−4 B

−1.4

Potential Value

D −4 −2.5

E −4 −2.5

−1.4

−1.4

−1.4 −2 −2.5 B

−10

C

D

A

B

−12 −10

−8

−6

−4

−2

0

C

2

4

6

8

X

Fig. 2. Illustration of the similarity metric defined in Definition 2. There are 3 clusters and data points from different clusters are represented by different symbols. Point A has the lowest potential value in Cluster 2. The dashed line is used to indicate the potential value of point A. It can be seen that both Cluster 1 and Cluster 3 have points which have lower potential values than that of A (i.e., below the dashed line). Because point B is the nearest point from point A within the points which are in Cluster 1 and below the dashed line, points A and B are the two characteristic points, and the distance between Cluster 2 and Cluster 1 is defined as the distance between the two characteristic points A and B, which is shown as d21. Similarly, the distance between Cluster 2 and Cluster 3 is defined as the distance between points A and C, which is shown as d23.

So we define a new similarity metric for the hierarchical agglomerative clustering as follows: Definition 1. Cluster C1 is said to be below cluster C2 if the lowest potential value in C1 is lower than or equal to the lowest potential value in C2, and the relationship is represented as ‘‘C1 rC2’’; otherwise, C1 is said to be above C2. So, C1 rC2 3 (iA C 1 ð8j A C 2 ðFi r Fj ÞÞ. Definition 2. The distance between cluster C1 and cluster C2 is: Dist1 ðC 1 ,C 2 Þ ¼ r s1,s2 , where r s1,s2 is the distance between two characteristic points s1 and s2 from the two clusters respectively. The two characteristic points s1 and s2 are determined by:

If C 2 r C 1 ,

If C 1 r C 2 ,

−2

A

−8

.4

−1

−1.4

−2.5 −4 A

−2

−2

−2

Cluster 1 Cluster 2 Cluster 3

8 > < s1 ¼ > : s2 ¼

8 > < s2 ¼ > : s1 ¼

  argmin Fk 9k A C 1 k  argmin r k,s1 9ðk A C 2 Þ AND ðFk r Fs1 ÞÞ k

  argmin Fk 9kA C 2 k  argmin r k,s2 9ðk A C 1 Þ AND ðFk r Fs2 ÞÞ k

−2

.4

−1

−6 Potential Value

points governed by Newton’s Law of gravity [15], while the Parzen Window method is designed to produce asymptotically consistent probability density estimation [17]; Secondly, the window size defined in (5) is always large enough for the window to include all the data points, while in the Parzen Window method the window size should satisfy lim r N ¼ 0 in order to N-1 produce asymptotically consistent estimation [17]; Thirdly, although the window function of the Parzen Window method may take different forms, we haven’t seen any reported window functions similar to what is defined by (6). So, (5), (6) and (7) together define a new non-parametric probability density estimation method which is different from the traditional Parzen Window method.

1229

E

Fig. 1. Illustration of the similarity between isopotential contours and the results of hierarchical clustering. (a) The computed contour lines in the total potential field produced by points A, B, C, D and E. The contour lines at different potential values divide the data points into different clusters, which correspond to different hierarchies of the clusters in the hierarchical clustering results. The contour lines at potential value  4 divide each data point into a separate cluster. The contour lines at potential value  2.5 produce the clustering result {A, B}, {C}, {D}, {E}; The contour lines at potential value  2 produce the clustering result {A, B, C}, {D, E}; And The contour lines at potential value  1.4 give a single cluster {A, B, C, D, E}. (b) The clustering results produced at different potential values are shown by a dendrogram.

Fig. 2 shows a simple illustration of the new similarity metric defined above. In Fig. 2, because Cluster 2 is below Cluster 1, point A has the lowest potential value in Cluster 2, and point B is the nearest point from point A within the points which are in Cluster 1 and have lower potential values than that of point A (i.e., below the dashed line), so the distance between Cluster 2 and Cluster 1 is defined as the distance between the two characteristic points A and B, which is shown as d21. Similarly, the distance between Cluster 2 and Cluster 3 is defined as the distance between two characteristic points A and C, which is shown as d23. The new similarity metric Dist1 defined in Definition 2 uses the distance between two data points from the two clusters respectively, which is similar to Single Linkage and Complete Linkage [1]. In addition, it takes into account the potential values which represent global data distribution information. In the definition of Dist1, the total potential produced by all the data points is computed only once and used by all the merging steps. In contrast, the Shi’s method computes potentials for every pair of clusters before each merging step [11], which is very time consuming.

1230

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

3.2. Efficient clustering using an edge-weighted tree Although the dendrogram can now be constructed routinely as in the traditional methods with the new similarity metric Dist1, we show in the following that a more efficient method can be developed by defining another equivalent similarity metric based on an edge-weighted tree. First, two definitions are given: Definition 3. The shortest distance between a cluster Cm and all the other clusters below Cm is called the characteristic distance of Cm, and is represented as CDist(Cm). This can be described as ( min ðDistðC m ,C n ÞÞ if (n ðn a m AND Cn r Cm Þ CDist ðC m Þ ¼ n a m,C n r C m 1 otherwise where Dist can be any cluster distance, such as Dist1 defined in Definition 2 and Dist2 to be defined in Definition 5. Definition 4. For each data point i, a data point which is nearest from i within all the other data points which have potential values lower than or equal to the potential value of i is called the parent point of i, and is represented as parent[i]. So   parent½i ¼ argmin r k,i 9Fk r Fi AND ka i : k

From Definition 3, it is obvious that the distance between two clusters is always greater than or equal to both the characteristic distances of them. So the shortest distance between every pair of clusters is the same as the shortest characteristic distance: minðDist ðC m ,C n ÞÞ ¼ min ðCDistðC m ÞÞ. This implies that to find the m,n m two most similar clusters during the merging process, instead of comparing every pair of clusters, it can be done simply by comparing the characteristic distances of all the clusters. The following Lemma shows that the characteristic distance of a cluster can be found easily using the parent point defined in Definition 4:

its parent point. This makes it possible to construct an edgeweighted tree for speeding up the clustering process, which we describe next. In the edge-weighted tree, different data points correspond to different tree nodes and the parent point defined in Definition 4 becomes the parent node. In the PHA method, the edge-weighted tree is built efficiently as follows: first, all the data points are ranked by their potential values; then, the ranks of the data points and the distance matrix are used to construct the edge-weighted tree. The point having the lowest potential value is set as the root node. For every point other than the root node, the parent point defined in Definition 4 is set as the parent node of the point. The distance between a point and its parent point is set as the weight of the edge connecting them. The algorithm for constructing the edge-weighted tree from the distance matrix is shown in Fig. 3. In the edge-weighted tree, if a point pm has the lowest potential value in a cluster Cm, for all the other clusters below Cm except the cluster containing the point parent[pm], there are no edges connecting them and cluster Cm. Actually the distances between these clusters and Cm are useless for the clustering process, so they can be defined as infinity and ignored during the computation. This way, the computation time can be reduced while still producing the same clustering results. Following the idea, the similarity metric defined in Definition 2 can be simplified using the edge-weighted tree: Definition 5. The distance between cluster C1 and cluster C2 is: Dist2 ðC 1 ,C 2 Þ ¼



r i,j 1

if (i(jði A C 1 AND j AC 2 AND ðparent½i ¼ j OR parent½j ¼ iÞÞ otherwise

where parent[i] is the parent node of point i in the edge-weighted tree, and ri,j is the distance between point i and point j. The following three lemmas and two theorems prove that the similarity metric Dist2 defined in Definition 5 is unique and is thus well-defined, and that the two similarity metrics Dist1 and

Lemma 1. if Definition 2 is used in the hierarchical agglomerative clustering, and point pm has the lowest potential value in cluster Cm, then the characteristic distance of cluster Cm defined in Definition 3 is the distance between the point pm and its parent point parent[pm] defined in Definition 4. Proof. Using Definition 2 and Definition 3, the characteristic distance of Cm is the minimum distance between cluster Cm and all the other clusters below Cm. So, for the two characteristic points which define the characteristic distance of Cm, one of them is pm which has the lowest potential value in Cm, the other is q which is the nearest point from pm within the points which are in the clusters below Cm and have potential values lower than that of pm. Because the clusters above Cm contain only the data points having higher potential values than that of pm, the characteristic point q is also the nearest point from pm within all the points having lower potential values than that of pm. Using Definition 4, it can be seen that q is also the parent point of pm. So the characteristic distance of cluster Cm is the distance between the point pm and its parent point q ¼parent[pm].& Lemma 1 shows that for all the clusters below Cm except the cluster which contains the data point parent[pm], the distances between them and Cm are larger than the characteristic distance of Cm, and thus it is not necessary to compute the distances between them and Cm during the merging process. This also means that to find the two most similar clusters, it is only necessary to remember the distance between every point and

Fig. 3. Algorithm for building the edge-weighted tree, where N is the number of data points, Dist is the distance matrix of size N  N such that Dist[i,j] ¼Dist[j,i] ¼ ri,j, and d is the threshold distance defined in Section 2.

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

Dist2 are equivalent in the sense that they produce the same dendrogram. Lemma 2. If Definition 5 is used in the hierarchical agglomerative clustering, every cluster produced at each step of the merging process consists of all the nodes from a subtree rooted at a point p in the edge-weighted tree. In other words, in a cluster there is only one ‘‘root point’’ p whose parent node in the edge-weighted tree is not inside the cluster. And for all the other points in the cluster, p is their ancestor node in the edge-weighted tree. Proof. We prove by mathematical induction. (1). When there is only one point in the cluster, because the parent of a tree node cannot be itself, the single point in the cluster is the ‘‘root point’’. So the lemma holds. (2). Suppose that the lemma holds for clusters having less than k points. For a cluster having k points, if it is merged from two clusters Cm and Cn having Nm and Nn points respectively, it follows that k¼Nm þNn. Because Nm and Nn are both positive integers, Nm ok and Nn o k. Using the assumption, we know that the cluster Cm and Cn have the ‘‘root points’’ pm and pn respectively, where pm is the ancestor of all the other points in Cm, and pn is the ancestor of all the other points in Cn in the edge-weighted tree. If Definition 5 is used in the hierarchical agglomerative clustering, the two clusters are merged only if a point in Cm is the parent node of a point in Cn, or vice versa. If (iACm and (jACn, such that parent[i]¼j, parent point of i is not in the same cluster as point i, so i is the ‘‘root point’’ and pm ¼i. After the clusters Cm and Cn are merged, the only point whose parent is not in the merged cluster is point pn. Because parent[i]¼j, point i¼pm is the ‘‘root point’’ of cluster Cm, and point pn is the ancestor of point j, point pn is also the ancestor of all the points in Cm. So pn will become the ‘‘root point’’ in the merged cluster. Thus the lemma holds for this case. If (iACm and (jACn, such that parent[j]¼i, we can prove similarly that pm is the ‘‘root point’’ in the merged cluster. So the lemma holds for clusters having any number of points.& Lemma 3. If Definition 5 is used in the hierarchical agglomerative clustering, for any two different clusters C1 and C2 produced at any step, if (iAC1 and (jAC2 such that parent[i]¼j, and also (kAC1 and (lAC2 such that parent[k] ¼l, it follows that i¼k and j¼l. Proof. Using the Lemma 2, we know that i and k are both the ‘‘root points’’ of cluster C1. Because there is only one ‘‘root point’’ in a cluster as proved in Lemma 2, it follows that i¼k. A tree node can only have one parent node, so j ¼l. & Lemma 4. If Definition 5 is used in the hierarchical agglomerative clustering, for any two different clusters C1 and C2 produced at any step, if (iAC1 and (jAC2 such that parent[i]¼ j, then 8kAC1 and 8lAC2, parent[l] a k. Proof. Using Lemma 2, because iAC1, jAC2 and parent[i]¼j, i is the ‘‘root point’’ in cluster C1. If suppose that (kAC1 and (lAC2, parent[l]¼k, l would be the ‘‘root point’’ in cluster C2. Since kAC1, it follows from Lemma 2 that either i¼k or point i is the ancestor of point k. Similarly, it follows that l ¼j or point l is the ancestor of point j. In a tree data structure, there always exists a path from a child node to its ancestor. If iak and j al, there would exist a cycle path go through the point i, point j, point l, point k, and back to point i; if i ¼k and jal, there would be a cycle path go through the point i, point j, point l, and back to point i; if iak and j¼l, there would be a cycle path go through the point i, point j, point k, and back to point i; if i¼k and j ¼l, there would be a cycle path go through the point i, point j, and back to point i. So, for all the cases

1231

there would be a cycle path in the edge-weighted tree. This contradicts the definition of the tree data structure. So the proposition is wrong and the lemma is proved.&

Theorem 1. The similarity metric Dist2 defined in Definition 5 is well-defined for the hierarchical agglomerative clustering. Proof. Using Lemma 3, if there exists a pair of points oi, j4 which satisfies the following condition: iAC1, jAC2, and parent[i]¼ j, the pair oi, j4 is unique. Similarly, if kAC1, lAC2, and parent[l]¼ k, the pair ok, l4 is also unique. Using Lemma 4, it is known that it is not possible for two pairs of points oi, j,4 and ok, l4 to satisfy the following conditions at the same time: iAC1, jAC2, kAC1, lAC2, parent[i]¼j and parent[l]¼k. Combine both statements, it can be seen that if (iAC1 and (jAC2 such that parent[i]¼j or parent[j]¼i, the pair oi, j4 is unique. So Dist2 defined in the Definition 5 is unique, and is thus well-defined.&

Theorem 2. Dist1 defined in Definition 2 and Dist2 defined in Definition 5 are equivalent in the sense that they generate the same dendrogram. Proof. For a set of clusters {Cm}, if the two most similar clusters found using Definition 2 are C1 and C2, and C2 is below C1, C2 is the cluster having the shortest distance from C1 within the clusters below C1. So, the distance between C1 and C2, represented as d12, is also the characteristic distance of C1 defined in Definition 3. For the two characteristic points which determine d12 in Definition 2, if one of them is s1 from C1 and the other is s2 from C2, it can be seen that d12 is the distance between s1 and s2, and s1 is the point having the lowest potential value in C1. Because d12 is also the characteristic distance of C1, using Lemma 1, it follows that s2 is the parent point of s1, i.e., s2 ¼parent[s1]. Because s1 and s2 are in cluster C1 and cluster C2 respectively, d12 is also the distance between C1 and C2 defined by Definition 5. For the same set of clusters {Cm}, if the two most similar clusters found using Definition 5 are clusters C3 and C4, the distance between them is d34, and C4 is ‘‘below’’ C3, there will be a point s3 in C3 whose parent point is s4 in C4. Using Definition 5, d34 is the distance between s3 and s4. Using Lemma 2, it follows that s3 is the ancestor of all the other points in C3, so s3 has the lowest potential value in C3. As s4 ¼parent[s3], using Lemma 1, d34 is the characteristic distance of C3. Because C4 is below C3, it follows that d34 is also the distance between C3 and C4 defined by Definition 2. Suppose that d34 od12, it means that the distance between C3 and C4 defined by Definition 2 is shorter than the distance between the two most similar clusters found using Definition 2. This is a contradiction. Similarly, if suppose that d12 od34, it means that using Definition 5 the distance between C1 and C2 is shorter than the distance between C3 and C4, while C3 and C4 are the two most similar clusters found using Definition 5. This is also a contradiction. So it is only possible that d12 ¼d34. It means that the distances between C1 and C2 defined by Definition 2 and Definition 5 are the same, which is d12, and the distances between C3 and C4 defined by Definition 2 and Definition 5 are also the same, which is d34 ¼d12. So, for the same set of clusters, the two most similar clusters found by Definition 2 are also the two most similar clusters found by Definition 5, and vice versa. It follows that the dendrograms generated by Dist1 and Dist2 are the same.&

1232

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

After the edge-weighted tree is built, Definition 5 is used to define the similarity metric for the PHA method. Because the weight of an edge in the edge-weighted tree is defined as the distance between the two end points of the edge, the two most similar clusters can be found efficiently by comparing the weights of the edges in the edgeweighted tree. This is done by sorting the edges of the edge-weighted tree according to their weights. Then the two most similar clusters at each merging step are found directly using the sorted queue of the edges. The detailed algorithm for generating the final clustering result (a dendrogram) from the edge-weighted tree is shown in Fig. 4. The complete PHA clustering algorithm is given in Fig. 5. It first computes the threshold distance d using (3) and (4), and then the two algorithms shown in Figs. 3 and 4 are used to produce the final clustering result. Fig. 6 shows an illustrative example for the PHA algorithm. The data set contains 6 points: P1, P2, P3, P4, P5 and P6 located at (1.4, 1.3), (1.6, 1.5), (2.4, 1.0), (3.0, 1.5), (3.3, 1.0) and (3.4, 1.2) respectively as shown by the circles in Fig. 6(a). After sorting by the computed potential values, the order of the points is given as oP6, P5, P2, P4, P1, P34. The first point P6 having the lowest potential value becomes the root of the edge-weighted tree. Then the second point P5 is selected from the sorted queue, and it finds P6 as its parent because P6 is the only point visited so far. Later, P2 is selected and it finds P5 as its parent because P5 has a shorter distance to P2 than P6. Similarly, P4 finds P6 as its parent, P1 finds P2 as its parent, and P3 finds P4 as its parent. The weight of the edge is set to be the distance between a point and its parent. The edge-weighted tree built by this process is shown in Fig. 6(b). After sorting the data points by the weight of the edge connecting the point and its parent point, the order is given as oP5, P1, P4, P3, P2, P6 4. So, P5 and its parent P6 are merged to a new tree

1.6 1.4

P2 (−6.42) 0.283

1.2

P1 (−6.13)

P4 (−6.33) 0.781

1.772

0.500 0.223

P6 (−8.50)

1 P3 (−5.39)

0.8 0.6

1

1.5

2

P5 (−8.38)

2.5

3

3.5

4

P6 0.223 1.772 0.283

0.500

P5

P4

P2

0.781 P3

P1 2 1.5 1 0.5 0 P5

P6

P4

P3

P1

P2

Fig. 6. Illustration of the clustering process using the PHA clustering method. (a) The distribution of the data points. The computed potential value is shown in the parenthesis next to each point label. The line segments are used to connect each point with its identified parent. The length of each line segment is indicated by a number above it. (b) The edge-weighted tree built using the potentials. The arrows are used to indicate the children points. The edge weight is shown by the number above each edge. (c) The dendrogram derived from the edge-weighted tree.

node, then P1 and its parent P2 are merged, which is followed by merging P4 with (P5, P6), then P3 with (P4, P5, P6), and finally (P1, P2) with (P3, P4, P5, P6). The final dendrogram built this way is shown in Fig. 6(c). It can be seen from Fig. 3 through Fig. 5 that the time complexity of the proposed PHA method is O(N2). Most of the agglomerative clustering methods natively have a time complexity of O(N3). Although the Single Linkage method can be improved to have a time complexity of O(N2), it suffers a lot from the ‘‘chaining’’ effect. Our PHA method natively has the time complexity of O(N2), which makes it a better choice.

4. Experimental results The PHA method and all the other methods used in the experiments are all implemented in Matlab code without using any MEX-files. The experiments are run on a desktop computer with an Intel 3.06 GHz Dual-Core CPU and 3 GB of RAM. Fig. 4. Algorithm for building the dendrogram.

Fig. 5. PHA clustering algorithm.

4.1. The datasets Three 2-dimensional synthetic data sets are generated and shown in Fig. 7: Dataset A is used to represent a simple case of well separated clusters, Dataset B is used to represent overlapping clusters with different sizes, and Dataset C is used to represent clusters with non-spherical shapes. Two popular real data sets, Iris dataset (with 150 points in 4 dimensions) and Yeast dataset (with 1484 points in 8 dimensions) from the UCI Machine Learning Repository [18] are also used in the evaluation. For the

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

Dataset A

1233

Dataset B

Dataset C

20 15

6

15 10

4

10

2

5

0

5 0

0

−5

−5

−1 0

−2 −10 −2

0

2

4

6

8

−1 5 0

10

20

−10

0

10

20

Fig. 7. Three synthetic data sets with different symbols representing points from different normal distributions. Dataset A: 900 data points from three 2D normal distributions with the same size (n¼ 300, s ¼1) centered at (0, 0), (3, 5), and (6,0) respectively. Dataset B: 1600 data points from four 2D normal distributions with different sizes: n ¼200, s ¼ 2, centered at (0, 0); n¼600, s ¼ 3, centered at (6, 13); n ¼800, s ¼ 4, centered at (12, 0); and n ¼200, s ¼ 2, centered at (16, 11). Dataset C: 1000 data points from two 2D bivariate normal distributions generated with the same shape parameters (n¼ 500, s_x¼ 1, s_y¼ 5, covariance_xy¼ 0) centered at (0, 0) and (5, 0) respectively.

three synthetic data sets, each normal distribution is considered as a benchmark cluster. Iris and Yeast are both labeled data sets. So the benchmarks are available for all these data sets. In addition, a data set from the data set family 1 in Shi’s paper [11] is also studied. It is called Dataset D in our paper. Dataset D contains noise data and clusters of different shapes. And the benchmark is not available for Dataset D. 4.2. Performance measure When the benchmark is available, the dendrogram is cut horizontally to produce the desired number of clusters as in the benchmark. This is done by finding the smallest height at which a horizontal cut through the dendrogram tree leaves the same or less number of clusters as the benchmark. Then, the Fowlkes– Mallows index [19] is used to evaluate the consistency of the clustering results with the benchmark. Given two different clustering results R1 and R2, the Fowlkes-Mallows index can be qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi TP computed by: FM ¼ TPTP þ FP U TP þ FN, where TP is the number of true positives which are the pairs of points that are in the same cluster in both R1 and R2; FP is the number of false positives which are the pairs of points that are in the same cluster in R1 but not in R2; FN is the number of false negatives which are the pairs of points that are in the same cluster in R2 but not in R1; TN is the number of true negatives which are the pairs of points that are in different clusters in both R1 and R2. If the two clustering results match exactly, both FP and FN will be zero, and the Fowlkes– Mallows index will take the maximum value 1; if the two clustering results are completely different, TP will be zero and the Fowlkes–Mallows index will take the minimum value 0. So a larger Fowlkes–Mallows index indicates a better match between the clustering result and the benchmark. 4.3. Performance comparison To evaluate the proposed PHA clustering algorithm, it is compared with the APES potential-based similarity metric [11], the Bayesian Hierarchical Clustering (BHC) [6], the hierarchical ensemble method MATCH [10], and three traditional methods: Single Linkage, Complete Linkage and Average Linkage. For BHC method with the Gaussian model, it is only applied to Dataset A, Dataset B and Dataset C which are composed of normal distributions. For the other methods, both the Euclidean distance and Euclidean squared distance are used as the distance measures and they are applied to all the data sets. Because the two distance measures do not make a difference for the cluster

Table 1 Experimental results for Dataset A (#points¼ 900, #Clusters ¼ 3). Method name

Distance Measure

BHC Single linkage

Gaussian model Euclidean and Euclidean squared Euclidean and Euclidean squared Euclidean

Complete linkage Average linkage Average linkage APES APES MATCH MATCH PHA PHA

Euclidean squared Euclidean Euclidean squared Euclidean Euclidean squared Euclidean Euclidean squared

Time

Fowlkes–Mallows index

95.54 0.9933 2.11 0.7693 2.09 0.9911 2.09 0.9911 2.09 0.9845 2.15 2.14 489.73 406.13 0.53 0.54

0.9933 0.9933 0.9911 0.7658 0.9933 0.9933

pairs determined by the maximum and the minimum distances, they give the same results for Single Linkage and Complete Linkage, which is confirmed by our experimental results. So, separate results are shown for the two distance measures only when the methods used are Average Linkage, APES, MATCH and PHA. For the MATCH method, cophenetic difference (CD) is used as the dendrogram descriptor, and the results of Single Linkage, Complete Linkage, Average Linkage and APES are combined to produce the ensemble result. Except for Dataset D, the parameter d is selected using S¼10 for the PHA method. The method and the distance measure used, the running time in seconds (‘‘Time’’), the Fowlkes–Mallows index, the number of data points (‘‘#points’’), and the number of clusters used to cut the dendrogram (‘‘#Clusters’’) are all recorded in Table 1 through Table 5 for the data sets except Dataset D. The clustering results of the four synthetic data sets produced by cutting the dendrograms are shown in Fig. 8 through Fig. 11, where different symbols are used to display the data points assigned to different clusters. The method name and the name of the adopted distance measure (in the parenthesis) are shown above each subplot, where ‘‘Single’’, ‘‘Complete’’ and ‘‘Average’’ are used to represent Single Linkage, Complete Linkage and Average Linkage respectively. For Dataset A containing three well separated sphericalshaped clusters, the results are shown in Table 1 and Fig. 8. It is clear that PHA is the fastest among all the methods for this data set. Except MATCH with the Euclidean squared distance measure and Single Linkage, all the other methods have produced near

1234

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

Single (Euclidean and Euclidean squared)

BHC (Gaussian model)

Complete (Euclidean and Euclidean squared)

6

6

6

4

4

4

2

2

2

0

0

0

−2

−2

−2

−2

0

2

4

6

8

−2

Average (Euclidean)

0

2

4

6

−2

8

Average (Euclidean squared)

6

6

4

4

4

2

2

2

0

0

0

−2

−2

−2

0

2

4

6

8

−2

APES (Euclidean squared)

0

2

4

6

−2

8

MATCH (Euclidean)

6

6

4

4

4

2

2

2

0

0

0

−2

−2

−2

0

2

4

6

8

−2

PHA (Euclidean)

6

4

4

2

2

0

0

−2

−2 0

2

4

6

2

4

6

6

8

0

2

4

6

8

8

−2

0

2

4

6

8

PHA (Euclidean squared)

6

−2

0

4

MATCH (Euclidean squared)

6

−2

2

APES (Euclidean)

6

−2

0

8

−2

0

2

4

6

8

Fig. 8. Clustering results of Dataset A.

optimal results. It can be seen from Table 1 and Fig. 8 that BHC, APES and PHA have produced the same best result which has the highest Fowlkes–Mallows index. For Dataset B containing four overlapping clusters of different sizes, the results are shown in Table 2 and Fig. 9. It can be seen that the proposed PHA method has produced the most accurate clustering results indicated by the highest Fowlkes–Mallows indices for both distance measures. Fig. 9 also indicates that the PHA method has produced a near optimal result with the Euclidean squared distance measure, while all the other methods have failed to identify the four normal distributions correctly. This shows the exceptional good performance of PHA in dealing with overlapping clusters. For Dataset C with two bivariate normal distributions, the results are shown in Table 3 and Fig. 10. It can be seen that only

PHA with the Euclidean distance measure and BHC have produced near optimal results while the rest of the methods have failed. BHC is well known for its good performance in distinguishing points from different probabilistic models, so it can separate clusters of non-spherical shapes, which is confirmed by the results. Table 3 indicates that PHA with the Euclidean distance measure has produced the highest Fowlkes–Mallows index and it runs much faster than the other methods, which shows that PHA can also be used to deal with clusters of non-spherical shapes, and it can produce the similar results as those of BHC in much less time. For Iris data set, the results are shown in Table 4. PHA with the Euclidean squared distance measure has produced the highest Fowlkes–Mallows index, 0.8670. Average Linkage with the two

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

Table 2 Experimental results for Dataset B (#points ¼1600, #Clusters ¼4). Method name

Distance measure

BHC Single linkage

Gaussian model Euclidean and Euclidean squared Euclidean and Euclidean squared Euclidean

Complete linkage Average linkage Average linkage APES APES MATCH MATCH PHA PHA

Time

Fowlkes–Mallows index

366.76 0.5852 12.54 0.5844 12.48 0.5740 12.47 0.6187

Euclidean squared

12.52 0.6508

Euclidean Euclidean squared Euclidean Euclidean squared Euclidean Euclidean squared

12.83 12.65 4678.78 3505.46 1.67 1.68

0.7423 0.7498 0.5684 0.5816 0.7948 0.9173

1235

distance measures has produced the second and the third highest Fowlkes–Mallows indices which are both higher than 0.8. And the rest of methods have all produced Fowlkes–Mallows indices lower than 0.8. For Yeast data set, the results are shown in Table 5. All the methods including PHA have failed to produce satisfying results. While PHA has produced the second highest Fowlkes–Mallows index, all the methods have produced Fowlkes–Mallows indices lower than 0.5. This indicates the difficulties faced by agglomerative clustering methods in general when applied to deal with high dimensional data sets. For Dataset D, the results are shown in Fig. 11. It can be seen that APES and all the three traditional methods have failed to identify the five clusters correctly. By selecting d using S¼10, PHA also fails for the data set. But if selecting d using S¼ 0.5, PHA can identify all the five clusters with the Euclidean squared distance measure. This shows the importance of selecting the proper parameter d for PHA. The experiment also shows the robustness

Single (Euclidean and Euclidean squared)

BHC (Gaussian model)

Complete (Euclidean and Euclidean squared)

20

20

20

15

15

15

10

10

10

5

5

5

0

0

0

−5

−5

−5

−10

−10

−10 0

10

20

0

Average (Euclidean)

10

0

20

Average (Euclidean squared)

20

20

15

15

15

10

10

10

5

5

5

0

0

0

−5

−5

−5 −10

−10 0

10

20

0

APES (Euclidean squared)

10

0

20

MATCH (Euclidean)

20

20

15

15

15

10

10

10

5

5

5

0

0

0

−5

−5

−5

10

20

0

PHA (Euclidean)

20

15

15

10

10

5

5

0

0

−5

−5

−10

10

20

PHA (Euclidean squared)

20

−10 0

10

20

20

−10

−10 0

10

MATCH (Euclidean squared)

20

−10

20

APES (Euclidean)

20

−10

10

0

10

20

Fig. 9. Clustering results of Dataset B.

0

10

20

1236

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

Table 3 Experimental results for Dataset C (#points ¼1000, #Clusters ¼2). Method name

Distance measure

Time

Fowlkes–Mallows index

BHC Single linkage Complete linkage Average linkage Average linkage APES APES MATCH MATCH PHA PHA

Gaussian model Euclidean and Euclidean squared Euclidean and Euclidean squared Euclidean Euclidean squared Euclidean Euclidean squared Euclidean Euclidean squared Euclidean Euclidean squared

122.10 2.90 2.89 2.89 2.95 2.96 2.96 789.52 728.29 0.66 0.66

0.9763 0.7061 0.5007 0.5506 0.6069 0.6893 0.6893 0.6963 0.6963 0.9841 0.7040

Single (Euclidean and Euclidean squared)

BHC (Gaussian model)

Complete (Euclidean and Euclidean squared)

15

15

15

10

10

10

5

5

5

0

0

0

−5

−5

−5

−10

−10

−10

−15

−15 −10

0

10

20

−15 −10

Average (Euclidean)

0

10

−10

20

Average (Euclidean squared)

15

15

10

10

10

5

5

5

0

0

0

−5

−5

−5

−10

−10

−10

−15

−15 0

10

20

0

10

−10

20

MATCH (Euclidean)

15

15

10

10

10

5

5

5

0

0

0

−5

−5

−5

−10

−10

−10

−15 0

10

20

0

10

20

PHA (Euclidean squared)

15

15

10

10

5

5

0

0

−5

−5

−10

−10

−15

−15 −10

0

10

20

10

20

−15 −10

PHA (Euclidean)

0

MATCH (Euclidean squared)

15

−10

20

−15 −10

APES (Euclidean squared)

−15

10

APES (Euclidean)

15

−10

0

−10

0

10

20

Fig. 10. Clustering results of Dataset C.

−10

0

10

20

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

1237

Table 4 Experimental results for Iris data set (#points ¼150, #Clusters ¼3). Method name

Distance measure

Time

Fowlkes–Mallows index

Single linkage Complete linkage Average linkage Average linkage APES APES MATCH MATCH PHA PHA

Euclidean Euclidean Euclidean Euclidean Euclidean Euclidean Euclidean Euclidean Euclidean Euclidean

0.018 0.017 0.017 0.017 0.019 0.019 3.214 3.201 0.017 0.017

0.7635 0.7686 0.8320 0.8498 0.7567 0.7567 0.7567 0.7567 0.7635 0.8670

and Euclidean squared and Euclidean squared squared squared squared squared

Single (Euclidean and Euclidean squared)

Complete (Euclidean and Euclidean squared)

Average (Euclidean)

300

300

300

250

250

250

200

200

200

150

150

150

100

100

100

50

50

50

100

200

300

100

Average (Euclidean squared)

200

300

100

APES (Euclidean)

300

300

250

250

250

200

200

200

150

150

150

100

100

100

50

50

50

200

300

100

MATCH (Euclidean)

200

300

100

MATCH (Euclidean squared)

300

300

250

250

250

200

200

200

150

150

150

100

100

100

50 100

200

300

200

300

100

PHA (Euclidean, S=0.5)

300

300

250

250

250

200

200

200

150

150

150

100

100

100

50 100

200

300

200

300

PHA (Euclidean squared, S=0.5)

300

50

300

50 100

PHA (Euclidean squared, S=10)

200

PHA (Euclidean, S=10)

300

50

300

APES (Euclidean squared)

300

100

200

50 100

200

300

Fig. 11. Clustering results of Dataset D.

100

200

300

1238

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

Table 5 Experimental results for Yeast data set (#points ¼ 1484, #Clusters ¼10). Method name

Distance measure

Single linkage

Euclidean and Euclidean squared Euclidean and Euclidean squared Euclidean

9.85 0.2450

Euclidean squared

9.83 0.2421

Complete linkage Average linkage Average linkage APES APES MATCH MATCH PHA PHA

Time

Fowlkes–Mallows index

9.80 0.4700 9.84 0.3160

Euclidean Euclidean squared Euclidean Euclidean squared Euclidean Euclidean squared

9.98 10.00 2459.62 2463.34 1.46 1.47

0.4634 0.4668 0.4680 0.4694 0.4694 0.4694

5. Conclusion and discussion

Running Time Comparison

Running Time (in seonds)

1800 1600

Average Linkage

1400

PHA

1200 1000 800 600 400 200 0 0

1000

2000

3000

4000

5000

6000

7000

time for the same data set. It can also be seen that using different distance measures also produces similar running time for the same method. So, to further study the performance of PHA, we have compared the running time of Average Linkage and PHA using only the Euclidean distance measure on 6 data sets which have different sizes. All the 6 data sets are generated using the same normal distributions as Dataset B but with different number of points. The running time (in seconds) of both methods is plotted in Fig. 12. For the data set with 8000 points, Average Linkage takes more than 1500 s while PHA finishes in 46 s. It can be seen that as the size of the data set getting larger, more substantial speedup can be obtained by using PHA over Average Linkage.

8000

9000

Number of Points Fig. 12. Running time comparison between PHA and Average Linkage.

of PHA in dealing with data containing noise. In PHA, the potential value is mainly determined by the global data distribution of the data set, so the existence of small amount of noise usually does not affect much the final results. Although in Shi’s paper it is found that APES can identify all the five clusters of the data set by using a Gaussian potential function, the parameter s of the Gaussian potential function needs to be adjusted to the optimal value. Moreover, using the Gaussian potential function also means that the algorithm will not satisfy the Scale-Invariance condition [16], so only the potential function defined in Section 2 is used in the PHA method. Our results also show the limitations of the hierarchical ensemble method. MATCH is used to combine the results of Single Linkage, Complete Linkage, Average Linkage and APES, but it never produces any result with a better Fowlkes–Mallows index than the best of the four methods, and most of the time it only produces results slightly better than the worst of the four. This shows that care should be taken when applying the ensemble method to the hierarchical clustering problems. 4.4. Running time comparison As explained in Section 3, the PHA method avoids comparing every pair of clusters before each merging process, so it usually runs much faster than the other methods. This is confirmed by our experimental results. Except for Iris data set which is very small, the experimental results in Table 1 through Table 5 show that PHA runs much faster than all the other methods. Except BHC and MATCH which are usually slower than the other methods, APES and all the three traditional methods have similar running

In this paper a novel potential-based hierarchical agglomerative (PHA) clustering method is presented. Using PHA, the results of hierarchical clustering can be efficiently produced using an edge-weighted tree of all the data points. Experiments on real and synthetic data sets show that the proposed PHA method can usually produce more satisfying results and run much faster than the other agglomerative clustering methods. Besides the distance matrix, the potential field produced by all the data points is also used in defining the similarity metric in our method. The potential field can be viewed as the probability density function estimated using all the data points. So, different from the traditional methods, PHA uses both local and global data distribution information during the clustering process. It can deal with overlapping clusters, clusters of non-spherical shapes and clusters containing noise data. By making good use of the similarity between the isopotential contours of a potential field and the hierarchical clustering, PHA effectively improves the clustering results. By defining the similarity metric using an edge-weighted tree of all the data points, PHA avoids computing cluster similarities before each merging step in the agglomerative clustering. So it can run much faster than the other methods while still producing high quality results. Although the PHA method is very effective for most of the data sets studied in our experiments, it fails to produce satisfying results for the Yeast data set. We think that this may indicate that the distance measures used in our experiments may not be suitable in high dimensional space. In future work we will further investigate how to improve the performance of the PHA method, and especially how to improve the method for dealing with high dimensional data sets. Experiments show that with the Euclidean squared distance measure PHA usually performs better than with the Euclidean distance measure. And it is helpful to investigate further how to define a better distance measure for the PHA method.

Acknowledgments This work is supported by the National Science Foundation of China (Grants no. 61272213), and the Chinese Fundamental Research Funds for the Central Universities lzujbky-2011-65. The authors wish to thank Katherine A. Heller from MIT for her great help with the BHC algorithm. The authors would also like to thank anonymous reviewers for their valuable comments that helped us to improve the manuscript. References [1] M.G. Omran, A.P. Engelbrecht, A. Salman, An overview of clustering methods, Intelligent Data Analysis 11 (6) (2007) 583–605.

Y. Lu, Y. Wan / Pattern Recognition 46 (2013) 1227–1239

[2] R. Xu, D.I.I. Wunsch, Survey of clustering algorithms, IEEE Transactions on Neural Networks 16 (3) (2005) 645–678. [3] H. Yu, M. Gerstein, Genomic analysis of the hierarchical structure of regulatory networks, in: Proceedings of the National Academy of Sciences of USA, October 3, 2006, 103(40), pp. 14724–14731. [4] Y. Loewenstein, P. Elon, M. Fromer, M. Linial, Efficient algorithms for accurate hierarchical clustering of huge datasets: Tackling the entire protein space, Bioinformatics 24 (2008) i41–9. [5] M. Balcan, P. Gupta, Robust hierarchical clustering, in: Proceedings of the 23rd Conference on Learning Theory (COLT), Haifa, Israel, June 27–29, 2010, pp. 282–294. [6] K.A. Heller, Z. Ghahramani, Bayesian hierarchical clustering, in: Proceedings of the 22nd International Conference on Machine Learning, Bonn, Germany, August 7–11, 2005, vol. 22, pp. 297–304. [7] Y.W. Teh, H. Daume III, D. Roy, Bayesian agglomerative clustering with coalescents, in: J.C. Platt, D. Koller, Y. Singer, S. Roweis (Eds.), Advances in Neural Information Processing Systems 20, MIT Press, Cambridge, MA, 2007, pp. 1473–1480. [8] A.K. Jain, Data clustering: 50 years beyond K-means, Pattern Recognition Letters 31 (2010) 651–666. [9] P. Hore, L.O. Hall, D.B. Goldgof, A scalable framework for cluster ensembles, Pattern Recognition 42 (5) (2009) 676–688. [10] A. Mirzaei, M. Rahmati, A novel hierarchical clustering combination scheme based on fuzzy-similarity relations, IEEE Transactions on Fuzzy Systems 18 (1) (2010) 27–39.

1239

[11] S. Shi, G. Yang, D. Wang, W. Zheng, Potential-based hierarchical blustering, in: Proceedings of the 16th International Conference on Pattern Recognition, Quebec, Canada, August 11–15, 2002, vol. 4, pp. 272–275. [12] H. Yamachi, Y. Kambayashi, Y. Tsujimura, A clustering method based on potential field, in: Proceedings of the 10th Asia Pacific Industrial Engineering and Management System Conference (APIEMS), Kitakyushu, Japan, Dec. 14– 16, 2009, pp. 846–855. [13] J. Li, H. Fu, Molecular dynamics-like data clustering approach, Pattern Recognition 44 (2011) 1721–1737. [14] Y. Zhang, J. Skolnick, SPICKER: A clustering approach to identify near-native protein folds, Journal of Computational Chemistry 25 (6) (2004) 865–871. [15] Y. Lu, Y. Wan, Clustering by sorting potential values (CSPV): A novel potential-based clustering method, Pattern Recognition 45 (9) (2012) 3512–3522. [16] J. Kleinberg, An impossibility theorem for clustering. in: S. Becker, S. Thrun, K. Obermayer (Eds.), Proceedings of the Advances in Neural Information Processing Systems (NIPS) 15, Vancouver, British Columbia, Canada, December 9–14, 2002, pp. 463–470. [17] E. Parzen, On estimation of a probability density function and mode, Annals of Mathematical Statistics 33 (1962) 1065–1076. [18] A. Frank, A. Asuncion, UCI Machine Learning Repository /http://archive.ics. uci.edu/mlS, 2010. [19] E.B. Fowlkes, C.L. Mallows, A method for comparing two hierarchical clusterings, Journal of the American Statistical Association 78 (1983) 553–569.

Yonggang Lu received both the B.S. and M.S. Degrees in Physics from Lanzhou University, Lanzhou, China in 1996 and 1999 respectively. Later he received the M.S. and Ph.D. Degrees in Computer Science from New Mexico State University, Las Cruces, NM, USA in 2004 and 2007 respectively. He finished some of the Ph.D. work at Los Alamos National Lab, NM, USA. He is now an associate professor in the School of Information Science and Engineering, Lanzhou University, Lanzhou, China. His main research interests include pattern recognition, image processing, neural networks, and bioinformatics.

Yi Wan received his B.S. Degree from Xi’an Jiaotong University, Xi’an, China in 1992. Then he received both the M.S. Degree in Mathematics and the M.E. Degree in Electrical Engineering from Michigan State University, Michigan, USA in 1997 and 1998 respectively. Later he received the Ph.D. Degree in Electrical Engineering from Rice University, Houston, TX, USA in 2002. He is now a professor in the School of Information Science and Engineering, Lanzhou University, Lanzhou, China. His main research interests include signal processing, pattern recognition, digital transmission, and embedded systems.