Pattern Recognition 39 (2006) 776 – 788 www.elsevier.com/locate/patcog
A partitional clustering algorithm validated by a clustering tendency index based on graph theory Helena Brás Silvaa,∗ , Paula Britob , Joaquim Pinto da Costac a Department of Mathematics, Polytechnic School of Engineering of Porto (ISEP), Portugal b School of Economics/LIACC, University of Porto, Portugal c Department of Applied Mathematics/FC & LIACC, University of Porto, Portugal
Received 5 November 2004; received in revised form 14 October 2005; accepted 14 October 2005
Abstract Applying graph theory to clustering, we propose a partitional clustering method and a clustering tendency index. No initial assumptions about the data set are requested by the method. The number of clusters and the partition that best fits the data set, are selected according to the optimal clustering tendency index value. 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. Keywords: Unsupervised learning; Clustering algorithms; Clustering validity
1. Introduction Clustering is a method of data analysis which is used in many fields, such as pattern recognition (unsupervised learning), biological and ecological sciences (numerical taxonomy), social sciences (typology), graph theory (graph partitioning), psychology, etc. [1]. The main concern in the clustering process is about partitioning a given data set into subsets, groups or structures, identifying clusters which reflect the organization of the data set. The clusters must be compact and well separated, presenting a higher degree of similarity between data points belonging to the same cluster than between data points belonging to different clusters. So, the topic of clustering addresses the problem of summarizing the relationships within a set of objects by representing them as a smaller number of clusters of objects [2]. The heart of clustering analysis is the selection of the clustering method. A method must be selected that is suitable for the kind of structure that is expected to be present in the data. This decision is important because different clustering ∗ Corresponding author. Tel.: +351 228 340 500; fax: +351 228 32 1159.
E-mail addresses:
[email protected] (H. Brás Silva),
[email protected] (P. Brito),
[email protected] (J. Pinto da Costa).
methods tend to find different types of cluster structures. In the literature, a wide variety of clustering algorithms have been proposed, which can be broadly classified into the following types: partitional [1,3]); hierarchical [1,3]) and density-based [4,5]) clustering algorithms. Other clustering procedures like fuzzy and conceptual clustering are mentioned in Ref.[3]. The aim of the partitional clustering algorithms is to decompose directly the data set into a set of disjoint clusters, obtaining a partition which should optimize a certain criterion. One of the most popular partitional algorithms is the k-means algorithm [6] which attempts to minimize the dissimilarity between each element and the center of its cluster. More recent partitional algorithms include CLARANS [7] and the k-prototype [8] which is an extension of the k-means algorithm for clustering categorical data. This type of algorithms depend on the ordering of the elements in the data set and require some initial assumptions, usually the number of clusters the user believes to exist in the data. Moreover, partitional algorithms are generally unable to handle isolated points and to discover clusters with non-convex shapes. Another important issue in clustering related to clustering validity is the problem of choosing the right number of clusters, and given this number, selecting the partition that
0031-3203/$30.00 䉷 2005 Pattern Recognition Society. Published by Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2005.10.027
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788
better fits a data set. Addressing this problem may not be an easy task if no a priori information exists, as to the expected number of clusters in the data. Even when we know the right number of clusters, due to an inappropriate choice of algorithm parameters or wrong choice of the clustering algorithm itself, the generated partitions may not reflect the desired clustering of the data. Some authors have tried to overcome this problem, mention should be made to EjCluster [9], AUTOCLASS [10,11]); other approaches may be found in Refs.[12–16]. There are many methods for clustering but these methods are not universal. Due to the wide applicability of cluster analysis, some algorithms are more suitable for some type of data than others. No method is good for all types of data, nor are all methods equally applicable to all problems. Clustering is mostly an unsupervised procedure, where there is no a priori knowledge about the structure of the data set. Almost all clustering algorithms are strongly dependent on the features of the data set and on the different values of the input parameters. Thus, the clustering scheme provided by any algorithm is based on certain assumptions and it is probably not the “best” one to fit the data set. This is a particularly serious issue since virtually any clustering algorithm will produce partitions for any data set, even random noise data which contain no cluster structure [17]. Further, classifications of the same data set obtained using different clustering criteria can differ markedly from one another [2]. So, clustering algorithms can provide misleading summaries of data, and attention has been devoted to investigating ways of guarding against reaching incorrect conclusions, by validating the results of a cluster analysis [2]. Therefore, in most applications, the resulting clustering scheme requires some sort of evaluation as regards its validity. Evaluation and assessing the results of a clustering algorithm is the main subject of cluster validity. Clustering validation may be accomplished at three levels. First, we must check whether the data set possesses a clustering structure. If this is the case, then one may proceed by applying a clustering algorithm. Otherwise, cluster analysis is likely to lead to misleading results. The problem of determining the presence or the absence of a clustering structure is called clustering tendency [1]. The assessment of the clustering process follows by selecting a “good” clustering algorithm. For example Ness and Fisher [18–20] presented a list of properties, called admissible conditions, which one might expect clustering procedures to possess, and stated whether or not these properties were possessed by each of several standard clustering criteria. From background information about the data, the method indicates which clustering criteria could be relevant for the analysis of a particular data set [2]. Due to the lack of a precise mathematical formulations for defining the different concepts in clustering analysis, a formal study of methodologies in this field is not accomplished. Graph theory can be a valuable tool to develop models of abstraction for clustering, providing the required mathematical
777
formalism. The basic concepts of graph theory can be used also to develop some clustering algorithms and validity indices. In Ref. [21], graphs provide structural models for cluster analysis. In Ref.[22] a clustering algorithm based on an optimal coloring assignment to the vertices of the graph defined on the data set has been proposed. The authors have proved that the partition provided by the coloring algorithm which obtains the minimum number of colors, is the one of minimum diameter. More recent work is mentioned in Ref.[1] where some clustering algorithms are proposed based on minimum spanning trees or on directed trees. Our work tries to address some important issues of clustering processes regarding the determination of the number of clusters in the data set, the robustness as concerns isolated points, the detection of clusters of non-convex shapes and of data sets without a cluster structure and the assessment of the quality of the clustering results. Applying graph theory to clustering, we propose a partitional clustering method and a clustering tendency index. The number of clusters and the partition that best fits the data set, are selected according to the optimal clustering tendency index value. The remaining of the paper is organized as follows. Section 2 starts with a brief description of some graph theory concepts required for a good understanding of the remaining of the paper, followed by the description of the proposed method, which consists in a partitional clustering algorithm based on graph coloring. Section 3 introduces a clustering tendency index based on k-partite graphs. In Section 4 the performance of our approach is studied and compared with some known clustering algorithms. Section 5 concludes the paper.
2. A partitional clustering algorithm based on graph theory Applying graph theory to clustering, we propose a partitional clustering method and a clustering tendency index. No initial assumptions about the data set are requested by the method. The partitional algorithm is based on graph coloring and uses an extended greedy algorithm. The number of clusters and the partition that best fits the data set are selected according to the optimal clustering tendency index value. The key idea of this index is that there are k wellseparated and compact clusters, if a complete k-partite graph can be defined in the data set, after clustering. The clustering tendency index can also identify data sets with no clustering structure. 2.1. Some concepts of graph theory Let V (G) = {v1 , v2 , . . . , vn } be the vertex set and E(G) = {e1 , . . . , em } = {el = {vi , vj }|vi , vj ∈ V , vi = vj , l ∈ {1, . . . , m}}
778
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788
be the edge set of a graph G(V , E). Let vi , vj ∈ V be two different vertices of the graph G; if ∃l ∈ {1, . . . , m} : el ={vi , vj } then vi and vj are adjacent vertices while vi and el are incident, as are vj and el . For any graph G, we have m ( n2 ) and E ⊆ {{vi , vj }, ∀i, j ∈ {1, 2, . . . , n}, i = j }. In case of m=( n2 ) and E ≡ {{vi , vj }, ∀i, j ∈ {1, 2, . . . , n}, i = j }, every pair of vertices of the graph are adjacent and the graph is called a complete graph. The number of edges of G, incident with a vertex v is called the degree of vertex v and it is represented by dG (v), so dG (v) = Card{u ∈ V : {u, v} ∈ E(G)},
v ∈ V (G).
(1)
The degrees of the vertices of a graph are related to the number of edges in the graph by dG (v) = 2m. (2) v∈V (G)
One of the most important and perhaps one of the most studied problems in graph theory, is the problem of coloring a graph (see, for instance, Ref.[23]). The coloring problem involves assigning colors to the vertices of a graph so that two adjacent vertices are assigned distinct colors, with the objective of minimizing the number of colors used. In a given coloring of a graph G, the set consisting of all those vertices assigned the same color is referred as a color class. So, it is always possible to find a partition of the vertex set V, where any pair of vertices belonging to the same part, or color class, are not adjacent. The color classes may be also designed by independent sets. The extreme case concerns a complete graph of n vertices, in this case n colors are required to get a coloring of the graph because any pair of vertices is adjacent. The minimum number k (k n), for which a graph G is k-colorable is called the chromatic number of G, and is denoted by (G). The graph G is k-partite, k 1, if it is possible to partition V (G) into k subsets of non-adjacent vertices; the graph is k-partite complete if every pair of vertices belonging to different sets are adjacent. In this way, a k-partite graph has k color classes or independent sets. 2.2. Clustering algorithm based on color classes In this section we present a partitional clustering algorithm based on color classes. We use the greedy coloring algorithm, which is followed by an optimization step in order to reduce the number of colors, and which is oriented to finding homogeneous clusters. 2.2.1. Definition of the graph on the data set Let = {x1 , x2 , . . . , xn } be the data set of n elements to be clustered, we define the graph G(V , E) on the data set, assigning to each vertex of V an element of the data set. It is supposed that a dissimilarity measure d is defined on d : × → R+ 0, (xi , xj ) → d(xi , xj ).
(3)
We shall denote dist the dissimilarity between graph vertices corresponding to the data points: if u is the vertex associated to xi and v is the vertex associated to xj , then dist(u, v) := d(xi , xj ). The edge set E is then defined by E = {{u, v} : u, v ∈ V ; dist(u, v)},
(4)
where is an input parameter. So, there is an edge joining two different vertices of the graph if their dissimilarity is greater than some control parameter. In this case, two distinct vertices in G are adjacent if the dissimilarity between the corresponding elements of the data set is greater than the parameter . 2.2.2. Greedy coloring algorithm In the literature many heuristic algorithms have been proposed for graph coloring, but none can guarantee an optimal coloring assignment. The problem of finding a coloring using the minimal number of colors is known to be a NP-Hard problem, so a polynomial coloring algorithm does not exist. The exponential time execution of any optimal coloring algorithm is not suitable for data sets with more than a few hundred elements. We must then select a coloring algorithm which gets a coloring assignment with a number of colors greater than the chromatic number. One of the most simple and used non-optimal coloring algorithms is the greedy algorithm. This algorithm starts by assigning the first color to the first vertex; the following vertices are colored, with the first color allowed, that is, with the first color not yet assigned to any of its adjacent vertices. Due to the non-optimal execution of the selected greedy algorithm, almost all coloring assignments require a number of colors higher than the minimal number. In our work, the goal is to establish a correspondence between the color classes and the homogeneous clusters we are looking for in the data set. So, an excessive number of colors may break up homogeneous clusters. In this context, mention should also be made to the work done by Hansen and Delattre [22] where the authors proved that the color classes identified by an optimal coloring assignment correspond to a partition with minimal diameter. Nevertheless, if the criterion is the diameter of a partition which is to be minimized, the method is suitable to identify clusters with spherical shape but cannot detect elongated clusters. Since we wish to be able to detect clusters of non-convex shapes, the optimal coloring algorithm is not suitable for our purposes. So then, the procedure to reduce the number of colors required on a coloring assignment should be based on the density of clusters and not on their diameter. Therefore, another algorithm was developed whose goal is to optimize the partition provided by the greedy algorithm. The aim of the optimization concerns the reduction of the number of colors provided by the greedy algorithm, trying to identify the homogeneous clusters present in the data set.
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788
2.2.3. An optimized greedy coloring algorithm The first part of our clustering method concerns the application of a greedy algorithm to the graph defined on the data set, obtaining k1 color classes. The second part of the method consists in an optimization of the partition on k1 color classes, resulting in k (k k1 ) homogeneous classes or clusters in the clustering context. This optimization approach is density-based, attempting to gather in the same cluster the neighboring elements of each cluster. The optimization process proceeds as follows. For each element u belonging to the color class Ci with ni elements, let n˜ j be the number of elements of class Cj that belong to the ball with center v and radius , j = 1, . . . k1 , j = i. The value of is the same as defined in Eq.(4). Let n˜ max be the maximum of the n˜ j . If n˜ max > ni then the element v as well as all the remaining ni − 1 elements of the class Ci , are transferred to the class to which the n˜ max elements belong. In the optimization process some color classes are “swallowed” by others, following a density criteria, so that some adjacent vertices are gathering in the same cluster. This optimization processes sequentially each vertex without a specific order. Let us see more deeply how this process works. (1) Let C1 , C2 , . . . , Ck1 be the k1 color classes. (2) ∀i ∈ {1, . . . , k1 } and ∀v ∈ Ci , let B(v, ) ⊂ V be the ball with center v and radius . (3) Let j (v) = Cj ∩ B(v, ) and n˜ j (v) = Card{j (v)}, for j = 1, . . . , k1 (j = i), then, j (v) has the elements of class Cj which dissimilarity to v is less than . (4) Let Cm be the class with a greater number of elements in B(v, ), then, n˜ m (v) =
max
{j =1,...,k1 }
{n˜ j (v)},
j = i.
(5) If n˜ m (v) > ni , then, if the element v is close to a number of elements of a different cluster, greater than the cardinal of its own class, then the element v, as well as all the elements of the color class Ci , are transferred to the class m, then, Cm ← Cm ∪ Ci . If there are more than one cluster with the same n˜ m (v), the elements are transferred to the class first found. Using this approach, some adjacent vertices belonging to different color classes before optimization, may become members of the same cluster after the optimization process. Nevertheless, in order to maintain the claimed property that only non-adjacent vertices belong to the same cluster, the edges of these vertices are removed from the edge set of the graph. Let Ci , i = 1, . . . , k, be the resulting k clusters and let ni be the number of elements of the cluster Ci , for i = 1, . . . , k. In this case, we have / E{G}. vj , v ∈ Ci ⇒ {vj , v } ∈
(5)
779
However, and on the contrary as to what happens with the color classes, vj , v ∈ Ci dist(vj , v ) < ,
(6)
where is the input parameter used in Eq.(4). So our approach suitably handles clusters with non-convex shapes. The optimization approach applied to the color classes does not try to minimize the number of colors as it is usually done in graph theory. The aim here is to get a better performance identifying homogeneous clusters, so some optimal solution may correspond to a number of colors (clusters) greater than the chromatic number. To conclude, we can say that the greedy algorithm is accomplished in the context of the graph theory, while the optimization is accomplished in the context of clustering analysis. Applying the optimized greedy coloring algorithm to the graph representative of the data set, k-clusters are obtained which have the property that adjacent vertices do not belong to the same cluster. So, we can define a k-partite graph on the data set depending on the control parameter . This graph has k sets of non-adjacent vertices, and is complete if all vertices belonging to different subsets are adjacent. If is the maximum degree of the vertices of the graph, then the assignment of a color to a vertex, is done after at most comparisons. So the complexity of our nonoptimized method is O(n). The optimized version has a complexity of O(n2 ). 3. Clustering tendency index, IC For validating the partition provided by the algorithm, a clustering tendency index on a data set is defined next. Associated to each value of the control parameter (), a graph can be defined on the data set. As a result of the optimized greedy coloring algorithm applied to the graph, we obtain a k-partite graph, where vertices belonging to different sets may or may not be adjacent. The index we are proposing in this section, identifies the partition that best fits the cluster structure of the data set, as the one corresponding to a complete k-partite graph. According to the definition of the graph on the data set, two different vertices are adjacent in the graph G, if the dissimilarity between them is greater than the value of . Thus, the identification of a complete k-partite graph on the data set, after the application of the algorithm, implies the identification of clusters of non-adjacent vertices. The existence of a complete k-partite graph depends on the coloring algorithm, on the ordering of the elements and on the structure of the data set. Therefore, we do not always obtain, as the result of the coloring of the graph, a complete k-partite graph. The proposed index aims at measuring the clustering tendency of a data set, using complete k-partite graphs. Since it is always possible to get a graph coloring, there always exists a k-partite graph defined in the data set. The clustering tendency index counts the number of edges missing for the
780
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788
k-partite graph to be complete. The value of the clustering index is obtained summing for each vertex, the difference between its maximum degree and its effective degree. Let C1 , . . . , Ck be the k clusters, of n1 , . . . , nk elements, respectively, obtained by the optimized greedy coloring algorithm and that characterize the k-partite graph on the data set. Definition 3.1. The value of the clustering tendency index (IC), is defined as k 1 max I C := [dG (v) − dG (v)], 2
(7)
i=1 v∈Ci
max (v) is the maximum degree of the vertex v and where dG dG (v) is the degree of the vertex v.
Varying the values of the control parameter, the clustering tendency index identifies the best partition as the one corresponding to a graph with the minimum number of edges missing to have a complete k-partite graph. For each fixed value of the control parameter, the method determines a partition and the corresponding value of IC. We select the partition that best fits the data set as the one corresponding to an optimal value of the index. Using the property which establishes that in a k-partite graph the maximum degree of a vertex belonging to a set with ni elements is n − ni , and Eq.(2), we obtain IC =
k 1 max [dG (v) − dG (v)] 2 i=1 v∈Ci
⎛ ⎞ k k 1 ⎝ = (n − ni ) − dG (v)⎠ 2 i=1 v∈Ci
=
k 1 2
i=1 v∈Ci
ni (n − ni ) − 2m
i=1
k k 1 2 = ni n − ni − 2m 2 i=1
i=1
k k 1 1 = n2i − 2m = n2i − m, n2 − n2 − 2 2
i=1
i=1
(8) where m is the number of edges in the graph. The index is normalized by the maximum number of edges in the complete k-partite graph that is k k 1 1 2 2 Emax = (9) ni (n − ni ) = ni , n − 2 2 i=1
i=1
which corresponds to the sum of the maximum degrees of all vertices of the k-partite graph. So, the value of the
normalized clustering tendency index is I ˜C := 1 −
m . Emax
(10)
In the extreme case where all vertices have a dissimilarity from each other less than , there are no adjacent vertices and the graph does not have any edge. In this case, all vertices may have the same color and the data set has only one cluster. Thus, Emax = 0, because the maximum degree of every vertex is zero. In this case, the value of the clustering index is undetermined and the value −1 is associated to the index. On the other hand, if all vertices are at a dissimilarity greater then , all the vertices are adjacent and n colors are required to get a coloring assignment of the graph. So, there are n clusters with one element each, and in this case I ˜C =0. If the graphs are not complete and if I ˜C = 0, then we have the ideal case where the data set has k compact and isolated clusters. Nevertheless, as we will see in Section 4, an optimal partition may correspond to a value of the clustering tendency index different from zero, but corresponding to a decrease of the value of the index compared to some previous values. 3.1. Control parameter, The efficiency of the method in identifying the cluster structure in the data set depends on the control parameter . The value of this parameter determines the number of edges in the graph and so defines the vertex adjacencies, which are crucial on the identification of the k-partite graph defined on the data set. So then, the number of clusters and the number of elements in each cluster also depend on the value of , and so, the value of the clustering tendency index depends on the value of . Due to this fact, for different values of the control parameter, the optimized greedy coloring algorithm identifies different structures on the data set. In the proposed method, this parameter is not to be provided by the user, rather the method automatically assigns some significant values. Let dmin denote the minimum dissimilarity and dmax the maximum dissimilarity between the elements of the data set. Values of the parameter close to dmin correspond to a high number of edges in the graph and, therefore, to a high number of clusters, with a maximum of k = n, where n is the number of elements in the data set. On the other hand, values close to dmax correspond to a low number of edges and, therefore, to a low number of clusters, with a minimum of k = 1 for = dmax . The value of the parameter is made to vary between the minimum and the maximum limits and for each value, the clustering tendency index, whose goal is to identify the homogeneous clusters, better reflecting the data set structure, is calculated. Let lmin lmax , the interval [lmin , lmax ] is divided into I parts, and the method is executed for each =lmin +ih, for i =1, . . . , I and h=(lmax −lmin )/I . The values of lmin , lmax and I are selected by the user. They must verify lmin dmin
4.1. Data set generation The artificial data sets used in this section were generated as follows. The data set generator provides spherical or elliptical clusters with different point densities. Each cluster has three circles or ellipses with the same center but with different radius or elliptical parameters. In each arc of circle or ellipse the points are assigned with different probabilities, some of them may be 0. For the generation of a spherical cluster we must provide the following parameters: pabr1 r2 r3 p1 p2 p3 , where p is the percentage of total elements that must belong to this cluster, (a, b) is the center of the three circumferences, whose radius are r1 , r2 and r3 . The value p1 is the percentage of the p elements that belong to the inner circumference, p2 is the percentage of the p elements that belong to the arc of circumference of the middle and p3 is the percentage of the p elements that belong to the outer arc of circumference. For the generation of a elliptical cluster we must provide the following parameters: paba1 b1 a2 b2 a3 b3 p1 p2 p3 , where p is the percentage of total elements that must belong to this cluster, (a, b) is the center of the three ellipses, whose parameters are (a1 , b1 ), (a2 , b2 ) and (a3 , b3 ). The value p1 is the percentage of the p elements that belong to the inner ellipse, p2 is the percentage of the p elements that belong to the arc of ellipse of the middle and p3 is the percentage of the p elements that belong to the outer arc of ellipse. 4.2. The corrected Rand index For comparing the partitions provided by any of the clustering algorithms with the reference partition we will use the Rand Index [24] that was later corrected for chance by
0.050305
0.030176
0.016948
0.075276 2
3
3
3
0.010246 3
3
0.044741
0.040035 6
7
0.024584
0.018529
0.010199
0.005275 4
7
7
0.008161
0.00786
0.00714 11
7
0.00491
15
8
0.002382
33
0
clustering tendency index
116
In this section the optimized greedy coloring algorithm and the clustering tendency index are applied to several twodimensional data sets and to the Iris data set, the results are compared to those obtained by some hierarchical and partitional clustering algorithms.
1421
4. Application of the proposed method and comparison with other methods
Values of the normalized
and lmax dmax . The default execution automatically assigns lmin = dmin , lmax = dmax and I = 50. Among the I partitions, the one that optimizes the value of the index I ˜C is selected. Obviously, the larger the value of I, the smaller will be the step h, and a better solution may be encountered. However, since the number of different dissimilarity values is finite, the process will eventually stabilize. Beyond this point, the graphic of I ˜C/k will present landings, since the same solution will be obtained for close values of .
781 0.13057
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788
Number of Clusters (k)
Fig. 1. Graphic I ˜C/k obtained for the data set with elongated clusters, with dmin dmax .
Hubert and Arabie [25]. The corrected Rand index (CRI) is equal to the number of similar assignments of pointpairs in the two partitions to be compared, normalized by the total number of point-pairs and corrected for chance. If U = {u1 , . . . , uR } is a partition obtained by a clustering algorithm and V = {v1 , . . . , vC } is the reference partition, the CRI index between U and V is defined as C nij n.j ni. C n −1 R i=1 j =1 ( 2 ) − ( 2 ) i=1 ( 2 ) j =1 ( 2 )
, R ( ni. ) + C ( n.j ) − ( n )−1 R ( ni. ) C ( n.j ) i=1 2 i=1 2 j =1 2 j =1 2 2 R
CRI =
1 2
where n is the total number of objects, nij denotes the number of objects that are common to clusters ui and vj , ni. and n.j referring, respectively, to the number of objects in clusters ui and vj . This index takes values in [−1, 1], where the value 1 indicates a perfect agreement between the partitions, whereas values close to 0 correspond to cluster agreement found by chance. 4.3. Results The method was applied to seven artificial data sets with two numeric attributes and to the Iris data set. We used the Euclidean distance applied to the non-standardized data. 4.3.1. Data set with elongated clusters The first data set has 2000 elements distributed by four clusters represented in Fig. 2. Tables 1 and 2 indicate the data generator parameters for the three ellipses and for the circle of the data set. Fig. 1 represents the values of the index I ˜C and the number of clusters, resulting from the application of the method to the data set for different values of the control parameter , where dmin dmax and I = 50. Analyzing the graphic we conclude that the optimum value is I ˜C = 0.005275 corresponding to a four cluster partition, obtained for = 21.77236. Fig. 2 shows the distribution in four clusters obtained by the method. Due to the geometry
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788
Fig. 2. Partition in four clusters obtained for the data set with elongated clusters.
Table 1 Data generator parameters for the ellipses of the data set represented in Fig. 2 p
a
b
a1
b1
a2
b2
a3
b3
p1
p2
p3
25 25 25
100 100 100
50 90 130
30 30 30
10 10 10
30 30 30
20 20 20
40 40 40
30 30 30
100 100 100
0 0 0
0 0 0
Table 2 Data generator parameters for the circle of the data set represented in Fig. 2 p
a
b
r1
r2
r3
p1
p2
p3
25
50
100
10
30
40
100
0
0
Table 3 Means and standard deviations of the CRI values, for 100 runs of the first data set
Proposed method k-means/k = 4 Average linkage/k = 4
Mean
Std. deviation
0.9889644 0.9591565 0.994814
0.0378449 0.0936204 0.025468
of the data set we can conclude that for this example, the method successfully identified clusters of elongated shapes and not well separated. We generated 100 data sets and their reference clustering with the parameters indicated in Tables 1 and 2. The method was applied to each data set and the best clustering obtained was compared with the reference clustering using the CRI. The k-means and the average linkage hierarchical clustering methods were also applied to these data sets with k = 4. Table 3 represents the means and standard deviations of the corresponding CRI values. As it can be seen in Table 3, average linkage hierarchical clustering method has a better performance than the pro-
0.005275 8 9 7 7 7 7 7 7 7 6 7 5 4 5 5 6 5 6 4 4 3 4 3 3 3
Values of the normalized clustering tendency index
782
Number of Clusters (k)
Fig. 3. Graphic I ˜C/k obtained for the data set with elongated clusters, with 11.38618 32.15854 and I = 50.
posed method, nevertheless we use the knowledge of the structure of the data set as the stopping rule for that hierarchical method, while the proposed method automatically indicates the best partitions fitting the data set. The k-means method has the worst performance for this data set. In order to check if a better partition can be obtained, we run the method again, changing the limits of the interval where takes values. So, we set min max where min =11.38618 corresponding to a partition in eight clusters having I ˜C = 0.008161 and max = 32.15854 corresponding to a partition in three clusters having I ˜C = 0.030176 (see Fig. 1). Fig. 3 represents the values of the index I ˜C and the number of clusters, resulting from the application of the method to the data set for different values of the control parameter , where min max and I = 50. Analyzing the graphic we conclude that the better partition was the same as the one obtained earlier. If we set min = 19.69512 corresponding to a partition in seven clusters having I ˜C = 0.024584 and max = 23.8496 corresponding to a partition in seven clusters having I ˜C = 0.044741 and for I = 100, we get the graphic represented in Fig. 4. The optimum value is I ˜C = 0.005094 corresponding to a four cluster partition, obtained for = 21.48155. In this graphic, the landings mentioned in Section 3.1 are evident, since we get the same values of I ˜C for close values of . 4.3.2. Data set with isolated points The second data set has 1000 elements, 900 of them are distributed by five compact and isolated clusters, whereas the other 100 elements are uniformly distributed in the window where the data set fits. Table 4 has the data generator parameters for the five clusters of this data set, represented in Fig. 6. Fig. 5 represents the values of the index I ˜C and the number of clusters, resulting from the application of the method to the data set for different values of the control parameter , where dmin dmax and I = 50. Analyzing the graphic
783
5
5
5
5
5
5
5
4
5
5
5
6
6
7
7
Values of the normalized clustering tendency index
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788
Fig. 6. Clustering in 41 clusters obtained for the data set with isolated points.
Number of Clusters (k)
Fig. 4. Graphic I ˜C/k obtained for the data set with elongated clusters, with 19.69512 23.8496 and I = 100.
Table 4 Data generator parameters for the five clusters of the data set represented in Fig. 6 b
r1
r2
r3
p1
p2
p3
20 20 20 20 20
50 200 50 125 200
50 50 200 125 200
20 20 20 20 20
20 20 20 20 20
20 20 20 20 20
100 100 100 100 100
0 0 0 0 0
0 0 0 0 0
0.02134 3
2
0.01723 4
0.013067 4
0.009708 4
0.006483 4
0.003968 4
0.005562
0.002425 6
6
0.002201 8
0.00077 10
0.005926 15
0.000797 24
0.007225
0.000044 41
61
0.00623 110
755
0.000144
0.050567
a
Values of the normalized clustering tendency index
p
Table 5 Means and standard deviations of the CRI values, for 100 runs of the second data set
Proposed method k-means/k = 5 Average linkage/k = 5
Mean
Std. deviation
0.9218176 0.774692 0.875591
0.077219 0.127382 0.06994
by one cluster with five elements, four clusters with four elements, four clusters with three elements, 10 clusters with two elements each and 18 clusters with one element. Due to the geometry of the data set, we can conclude that for this example, the method was robust as concerns isolated points. We generated 100 data sets and their reference clustering with the parameters indicated in Table 4. The reference clusterings have six clusters each, where the sixth cluster has all the isolated points that have a uniform distribution. The method was applied for each data set and the best clustering was compared with the reference clustering using the CRI. The k-means and the average linkage hierarchical clustering methods were also applied to this data set with k = 5. Table 5 represents the means and standard deviations of the corresponding CRI values. As it can be seen in Table 5, the proposed method is the most robust as concerns isolated points, compared to the average linkage hierarchical clustering and the k-means methods.
Number of Clusters (k)
Fig. 5. Graphic I ˜C/k obtained for the data set with isolated points.
we conclude that the optimum value is I ˜C = 0.000044, corresponding to a partition in 41 clusters, obtained for = 23.529103. As it can be seen in Fig. 6, which shows the partition in 41 clusters obtained by the method, the isolated points did not mask the distribution of the elements by the five clusters. In this partition, 929 elements are distributed by the five clusters, whereas the isolated points are distributed
4.3.3. Data set with non-convex shape clusters The third example consists in a data set with 500 elements, distributed on two compact and isolated clusters with a nonconvex shape as represented in Fig. 8. Table 6 indicates the data generator parameters for the two clusters of this data set. Fig. 7 represents the values of the index I ˜C and the number of clusters, resulting from the application of the method to the data set for different values of the control parameter , where dmin dmax and I = 50. We got the partition in
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788
0.0892129 0.0012622 0.116936 0
0.121895
0.064236
0.107857
0.064627
3
0.048039
4
0.043182
4
0.334674
0.9424231 0.5247069 0.854804 1
0.295761
Std. deviation
0.024063
2
2
Mean
6
0.034222
0.029412 2
3
3
2
6
9
10
12
0.02665
0.02542 4
0.000048
0.021123 4
0.031603
0.017973
0.01138
0.022901
0.006605
16
0.017814
0.003642
24
0 128
493
Proposed method k-means/k = 2 Average linkage/k = 2 Single linkage/k = 2
0.014842
0 0
8
100 100
0.01244
0 0
0.008144
100 100
16
100 100
0.0039
80 80
29
110 90
0.001338
100 190
76
50 50
0
p3
301
p2
0.114841
p1
0.113099
r3
2
r2
0.111024
r1
2
b
0.102882
a
Values of the normalized clustering tendency index
p
Table 7 Means and standard deviations of the CRI values, for 100 runs of the third data set
949
Table 6 Data generator parameters for the clusters of the data set represented in Fig. 8
Values of the normalized clustering tendency index
784
Fig. 7. Graphic I ˜C/k obtained for the data set with non-convex shape clusters.
2
2
2
3
6
Number of Clusters (k) Number of Clusters (k)
Fig. 9. Graphic I ˜C/k obtained for the data set without a cluster structure.
also applied to this data set with k = 2. Table 7 represents the means and standard deviations of the corresponding CRI values. As it can be seen on Table 7 and as it should be expected, the single linkage hierarchical clustering method has an optimal performance having 100% of success in identifying the two clusters. Nevertheless the average linkage hierarchical clustering method has a worse performance than the proposed method. If the geometry of the data set was not known we could not choose between the two hierarchical methods. Fig. 8. Partition in two clusters obtained for the data set with non-convex shape clusters.
two clusters that best fits the data set, for = 58.937767 and I ˜C = 0.000048, as can be seen in Fig. 8. Due to the geometry of the data set we can conclude that for this example, the method successfully identified clusters of non-convex shapes. We generated 100 data sets and their reference clustering with the parameters indicated in Table 6. The method was applied for each data set and the best clustering was compared with the reference clustering using the CRI. The k-means, the average linkage hierarchical clustering method and the single linkage hierarchical clustering method were
4.3.4. Data set without a cluster structure Fig. 9 shows the graphic I ˜C/k, resulting from the application of the method for different values of the control parameter , where dmin dmax and I = 50, to a data set with 1000 elements with an uniform distribution on an unity window. As it can be seen in this graphic, the values of I ˜C for different values of , do not allow to conclude about the optimum value. This fact indicates the absence of a clustering structure in the data set. This example shows that the proposed method successfully addressed the problem of detecting data sets with no clustering structure. We generated 100 data sets with an uniform distribution and each of the corresponding I ˜C graphics was represented in Fig. 10. The mean value of the maximum differences between I ˜C values in any two curves is 0.060758.
2
2
0.070433 3
0.067903
0.06207 4
4
0.042814
0.034357 17
5
0.03349
0.018732
17
0.00088 100
17
0 100
0.01567
0 100
0.1136
0.15256
785
26
0 1179
Values of the normalized clustering tendency index
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788
Number of Clusters (k)
Fig. 10. Graphic I ˜C obtained for 100 runs of the data set without a cluster structure.
Fig. 12. Graphic I ˜C/k obtained for the data set with 100 clusters. Table 9 Means and standard deviations of the CRI values, for 100 runs of the fifth data set
Proposed method k-means/k = 100 Average linkage/k = 100
Fig. 11. Data set with 100 clusters.
Table 8 Data generator parameters for the clusters of the data set represented in Fig. 11 p
a
b
r1
r2
r3
p1
p2
p3
1
i
j
2
2
2
100
0
0
4.3.5. Data set with 100 clusters The fifth example consists in a data set with 3000 elements, distributed on 100 compact and isolated clusters as represented in Fig. 11. Table 8 indicates the data generator parameters for the clusters of this data set, where i =20, 40, 60, 80, 100, 120, 140, 160, 180, 200 and for each i we have j = 40, 60, 80, 100, 120, 140, 160, 180, 200. Fig. 12 represents the values of the index I ˜C and the number of clusters, resulting from the application of the method to the data set for different values of the control
Mean
Std. deviation
1 0.1244322 1
0 0.2711304 0
parameter , where dmin dmax and I = 50. We got the partition in 100 clusters that best fits the data set, for = 11.283798 and I ˜C = 0. We generated 100 data sets and their reference clustering with the parameters indicated in Table 8. The method was applied for each data set and the best clustering was compared with the reference clustering using the CRI. The kmeans and the average linkage hierarchical clustering methods were also applied to this data set with k = 100. Table 9 represents the means and standard deviations of the corresponding CRI values. As it can be seen from Table 9, the proposed method correctly identified the 100 clusters in all runs, as did the average linkage hierarchical clustering method. The k-means method obtained a bad result because in many clusterings all the elements were gathered in only one cluster. 4.3.6. Data set with different sizes and densities clusters The sixth example consists in a data set with 1000 elements, distributed on three clusters with different sizes and densities as represented in Fig. 13. Table 10 indicates the data generator parameters for the three clusters of this data set. Fig. 14 represents the values of the index I ˜C and the number of clusters, resulting from the application of the method to the data set for different values of the control parameter , where dmin dmax and I = 50. We got the partition in three clusters that best fits the data set, for = 37.46606 and I ˜C = 0.000294.
786
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788 Table 11 Means and standard deviations of the CRI values, for 100 runs of the sixth data set
Proposed method k-means/k = 3 Average linkage/k = 3
Mean
Std. deviation
0.944124 0.865926 0.997095
0.116789 0.104377 0.020332
Fig. 13. Data set with three clusters with different sizes and densities.
Table 10 Data generator parameters for the clusters of the data set represented in Fig. 13 a
b
r1
r2
r3
p1
p2
p3
40 15 45
100 100 175
100 210 170
60 15 8
60 15 8
60 15 8
100 100 100
0 0 0
0 0 0
2
2
0.004783
0.008703 2
2
0.014229
0.028578
0.035212 0.022348
0.00225 2
3
3
Fig. 15. Data set with two clusters of the seventh data set.
Table 12 Data generator parameters for the clusters of the data set represented in Fig. 15
0.018272
0.000294 4
4
5
6
3
0.010399
0.006677
0.00472
0.003854 6
0.005978
0.003428
9
0.002253
14
31
0.011293 0.002636 74
0 204
695
Values of the normalized clustering tendency index
p
p
a
b
r1
r2
r3
p1
p2
p3
40 60
160 100
80 100
10 25
10 35
10 45
100 0
0 100
0 0
As it can be seen in Table 11, average linkage hierarchical clustering method has a better performance than the proposed method, nevertheless we use the knowledge of the structure of the data set as the stopping rule for that hierarchical method, while the proposed method automatically indicates the best partitions fitting the data set. The k-means method has the worst performance for this data set.
Number of Clusters (k)
Fig. 14. Graphic I ˜C/k obtained for the data set with three clusters with different sizes and densities.
We generated 100 data sets and their reference clustering with the parameters indicated in table 10. The method was applied for each data set and the best clustering was compared with the reference clustering using the CRI. The kmeans and the average linkage hierarchical clustering methods were also applied to this data set with k = 3. Table 11 represents the means and standard deviations of the corresponding CRI values.
4.3.7. Data set with non-convex and compact clusters The seventh example consists in a data set with 1000 elements, distributed on one non-convex cluster and one compact cluster, as represented in Fig. 15. Table 12 indicates the data generator parameters for the two clusters of this data set. Fig. 16 represents the values of the index I ˜C and the number of clusters, resulting from the application of the method to the data set for different values of the control parameter , where dmin dmax and I = 50. We got the partition in two clusters that best fits the data set, for = 24.31177 and I ˜C = 0.
2
2 2 2
2 2 2
2 2 2
0 0 0 0.0006 0.0038 0.008913 0.026259 0.0404 0.057833 0.083581 2 2 2
0
4 2 2
11 5 3
109 52 22
0 0.002705 0.006129 0.01077 0.010782 0.017514 0.027471 0.088267
0.174576
0.27961 0.332474 0.39957
787
Number of Clusters (k)
Fig. 16. Graphic I ˜C/k obtained for the data set with two clusters of the seventh data set.
Table 13 Means and standard deviations of the CRI values, for 100 runs of the seventh data set
Proposed method k-means/k = 3 Average linkage/k = 3
147
Number of Clusters (k)
Values of the normalized clustering tendency index
0.17251 0.123011 0.146878 0.161832 0.210278
0.016387 0.006549 0.035906 0.044531 0.057196 0.062023 0.099097
0
0 0.005309 0.009002 0.010818 0.00989 0.021787 0.026059 0.019287 0.036307 0.046962 0.005521
662 49 11 8 5 4 5 4 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2
Values of the normalized clustering tendency index
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788
Mean
Std. deviation
0.966338 0.819787 0.968013
0.067508 0.049893 0.099377
We generated 100 data sets and their reference clustering with the parameters indicated in Table 12. The method was applied for each data set and the best clustering was compared with the reference clustering using the CRI. The kmeans and the average linkage hierarchical clustering methods were also applied to this data set with k = 2. Table 13 represents the means and standard deviations of the corresponding CRI values. As it can be seen in Table 13, the average linkage hierarchical clustering method and the proposed method have similar performances, nevertheless we use the knowledge of the structure of the data set as the stopping rule for that hierarchical method, while the proposed method automatically indicates the best partitions fitting the data set. The k-means method has the worst performance for this data set. 4.3.8. The Iris data set The Iris data set has 150 flowers, 50 each of the three varieties of Iris: Setosa, Versicolor and Virginica; characterized by four numerical attributes which describe the width and the length of the petals and sepals for each type of flowers. The flowers from x1 to x50 are of the type Iris Setosa, from x51 to x100 are of the type Iris Versicolor and the flowers from x101 to x150 are of the type Iris Virginica. The attributes are presented by the following order: (1) length of the sepal, (2) width of the sepal, (3) length of the petal and (4) width of the petal, where all the quantities are given in
Fig. 17. Graphic I ˜C/k obtained for the Iris data set.
centimeters. The reference clustering of this data set has a well separated cluster (Iris Setosa) whereas the other two are not totally separated. Fig. 17 represents the values of the index I ˜C and the number of clusters, resulting from the application of the method to the data set for different values of the control parameter , where dmin dmax and I = 50. We got the partition with two clusters, for =1.217631 and I ˜C =0. This partition has the Iris Setosa flowers in one cluster isolated from the other two type of flowers gathering in one cluster. This partition has the value IRC = 0.568 with the reference clustering in three clusters.
5. Concluding remarks In this paper, we proposed a partitional clustering method and a clustering tendency index based on graph theory. No initial assumptions about the data set are requested by the method. The number of clusters and the partition that best fits the data set are selected according to the optimal clustering tendency index value. The proposed methodology has been applied on simulated data sets, so as to evaluate its performance. This study has shown that the method is efficient, in particular, in detecting non-spherical clusters, is robust as concerns isolated points and could successfully detect data sets with no clustering structure. The methodology has been compared with some classical clustering methods, namely k-means and hierarchical clustering. The proposed method presented a better performance than the methods it was compared to.
Acknowledgements The authors thank the referee for the remarks and suggestions, that helped improving the paper.
788
H. Brás Silva et al. / Pattern Recognition 39 (2006) 776 – 788
References [1] S. Theodoridis, K. Koutroumbas, Pattern Recognition, Academic Press, New York, 1999. [2] A.D. Gordon, Clustering validation, in: C. Hayashi et al. (Eds.), Data Science, Classification, and Related Methods, Word Scientific Publishers, River Edge, 1998, pp. 22–39. [3] A.D. Gordon, Classification, second ed., Chapman & Hall/CRC, London, 1999. [4] E. Ester, H.-P. Kriegel, J. Sander, X. Xu, A density-based algorithm for discovering clusters in large spatial databases with noise, in: Proceedings of 24th International Conference on knowledge Discovery and Data Mining, Portland, 1996, pp. 226–231. [5] E. Ester, H.-P. Kriegel, J. Sander, M. Wimmer, X. Xu, Incremental clustering for mining in a data warehousing environment, in: Proceedings of 24th VLDB Conference, New York, USA, 1998, pp. 323–333. [6] J. MacQueen, Some methods for classification and analysis of multivariate observations, in: Proceedings of the Fifth Berkely Symposium on Mathematical Statistics and Probability, L.M. Le Cam, J. Neyman(Eds.), vol. 1, University of California Press, Berkeley, CA, 1967, pp. 281–297. [7] R.T. Ng, J. Han, Efficient and effective clustering methods for spatial data mining, in: Proceedings of the 20th International Conference on Very Large Data Bases, J. Bocca, M. Jarke, C. Zaniolo (Eds.), Morgan Kaufmann Publishers, Los Altos, CA, 1994, pp. 144–155. [8] Z. Huang, Extensions to the k-means algorithm for clustering large data sets with categorical values, in: Data Mining, Knowledge Discovery 2 (3) (1998) 283–304. [9] J.A. Garcia, J. dezValdivia, F.J. Cortijo, R. Molina, A dynamic approach for clustering data, Signal Process. 44 (2) (1994) 181–196. [10] P. Cheeseman, J. Kelly, M. Self, J. Stutz, Autoclass: a bayesian classification system, in: Proceedings of the Fifth International Conference on Machine Learning, Morgan Kaufmann, Los Altos, CA, 1988, pp. 54–64. [11] P. Cheeseman, J. Stutz, Bayesian classification autoclass: theory and results, in: Usama M. Fayyed, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining, The AAAI Press, New York, 1995, pp. 153–180.
[12] G. Milligan, M. Cooper, An examination of procedures for determining the number of clusters in a data set, Psychometrika 50 (2) (1985) 159–179. [13] I. Lerman, J. Costa, H.B. Silva, Validation of very large data sets clustering by means of a nonparametric linear criterion, in: Classification, clustering and data analysis, Recent Advances and Applications, K. Jajuga, A. Sokolowski, H.H. Bock (Eds.), Springer, Berlin, Heidelberg, 2002, pp. 147–157. [14] G. Soromenho, Avaliação do Número de Componentes de uma Mistura, Aplicações em Classificação, Ph.D. Dissertation, Universidade Nova de Lisboa, 1993. [15] C. Fraley, A.E. Raftery, Model-based clustering, discriminant analysis, and density estimation, Technical Report number 380, Department of Statistics, University of Washington, 2000. [16] C. Fraley, Algorithms for model-based gaussian hierarchical clustering, Technical Report number 311, Department of Statistics, University of Washington, 1996. [17] G. Milligan, An algorithm for generating artificial test clusters, Psychometrika 50 (1) (1981) 123–127. [18] L. Fisher, J. Van Ness, Admissible discriminant, Programs in mathematical sciences, Technical Report number 5, University of Texas at Dallas, 1971. [19] L. Fisher, J. Van Ness, Admissible clustering procedures, Biometrika 58 (1971) 91–104. [20] J.V. Ness, Recent results in clustering admissibility, in: H. BacelarNicolau, F. Costa Nicolau, J. Janssen (Eds.), Applied Stochastic Models and Data Analysis, Instituto Nacional de Estatistica, Lisboa, Portugal, 1999, pp. 19–29. [21] E. Godehardt, Graphs as Structural Models, Friedr Vieweg & Sohn, New York, 1988. [22] P. Hansen, M. Delattre, Complete-link cluster analysis by graph coloring, J. Am. Stat. Assoc. 16 (362) (1978) 397–403. [23] G. Chartrand, L. Lesniak, Graphs & Digraphs, third ed., Chapman & Hall/CRC, London, 1996. [24] W. Rand, Objective criteria for the evaluation of clustering methods, J. Am. Stat. Assoc. 66 (336) (1971) 846–850. [25] L. Hubert, P. Arabie, Comparing partitions, J. Classification 2 (1985) 193–218.
About the Author—HELENA BRÁS SILVA received the first degree in Applied Mathematics/Computer Science from Porto University, Portugal, the M.Sc. degree in Electronic and Computers Engineering, from Porto University and the Ph.D. degree in Applied Mathematics from Porto University, Portugal. From 1992 to 1997 she was researcher at the Institute of Systems Engineering and Computers in Porto (INESC). From 1997 until now she is assistant at the Department of Mathematics, Polytechnic School of Engineering of Porto (ISEP), Portugal. Her research interests include Clustering and Graph Theory. About the Author—PAULA BRITO is an Associate Professor at the School of Economics, and member of the Artificial Intelligence and Data Analysis Group of the University of Porto. She holds a doctorate degree in Applied Mathematics from the University of Paris-IX Dauphine. Her current research interests include data analysis methods, with particular incidence in clustering methods, and analysis of multidimensional complex data, known as symbolic data. About the Author—JOAQUIM PINTO DA COSTA received his first degree in Applied Mathematics from Porto University, Porto, Portugal, M. Sc. degree in Applied Statistics from Oxford University, Oxford, UK, the Ph. D. degree in Applied Mathematics from University of Rennes II, Rennes, France, in 1986, 1988 and 1996, respectively. From October 1996 until now he became Assistant Professor of the Applied Mathematics Department of Porto University, Porto, Portugal. His research interests include Statistical Learning Theory, Pattern Recognition, Discriminant Analysis and Clustering, Data Analysis, Neural Networks, SVMs and Machine Learning.