Towards understanding hierarchical clustering: A data distribution perspective

Towards understanding hierarchical clustering: A data distribution perspective

ARTICLE IN PRESS Neurocomputing 72 (2009) 2319–2330 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/loca...

558KB Sizes 0 Downloads 99 Views

ARTICLE IN PRESS Neurocomputing 72 (2009) 2319–2330

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Towards understanding hierarchical clustering: A data distribution perspective Junjie Wu a,, Hui Xiong b, Jian Chen c a

Information Systems Department, School of Economics and Management, Beihang University, Beijing 100083, China Management Science and Information Systems Department, Rutgers Business School, Rutgers University, Newark, NJ 07102, USA Research Center for Contemporary Management, Key Research Institute of Humanities and Social Sciences at Universities, School of Economics and Management, Tsinghua University, Beijing 100084, China

b c

a r t i c l e in fo

abstract

Article history: Received 10 September 2008 Received in revised form 21 December 2008 Accepted 21 December 2008 Communicated by D. Tao Available online 7 January 2009

A very important category of clustering methods is hierarchical clustering. There are considerable research efforts which have been focused on algorithm-level improvements of the hierarchical clustering process. In this paper, our goal is to provide a systematic understanding of hierarchical clustering from a data distribution perspective. Specifically, we investigate the issues about how the ‘‘true’’ cluster distribution can make impact on the clustering performance, and what is the relationship between hierarchical clustering schemes and validation measures with respect to different data distributions. To this end, we provide an organized study to illustrate these issues. Indeed, one of our key findings reveals that hierarchical clustering tends to produce clusters with high variation on cluster sizes regardless of ‘‘true’’ cluster distributions. Also, our results show that F-measure, an external clustering validation measure, has bias towards hierarchical clustering algorithms which tend to increase the variation on cluster sizes. Viewed in light of this, we propose F norm , the normalized version of the F-measure, to solve the cluster validation problem for hierarchical clustering. Experimental results show that F norm is indeed more suitable than the unnormalized F-measure in evaluating the hierarchical clustering results across data sets with different data distributions. & 2009 Elsevier B.V. All rights reserved.

Keywords: Hierarchical clustering F-measure Measure normalization Unweighted pair group method with arithmetic mean (UPGMA) Coefficient of variation (CV)

1. Introduction Hierarchical clustering [14] provides insight into the data by assembling all the objects into a dendrogram, such that each sub-cluster is a node of the dendrogram, and the combinations of sub-clusters create a hierarchy—a structure that is more informative than the unstructured set of clusters in partitional clustering. The process of hierarchical clustering algorithms is either top-down or bottom-up. In a bottom-up fashion, the algorithms merge or agglomerate objects and sub-clusters into larger and larger clusters. This is also known as the agglomerative hierarchical clustering (AHC). In contrast, top-down schemes view the whole data as a cluster, and proceed by splitting clusters recursively until individual objects are reached. AHC is more widely used and is the focus of this paper. There are considerable research efforts which are focused on algorithm-level improvements of the hierarchical clustering process [2]. Also, people have identified some characteristics of data that may strongly affect the performance of AHC, such as the size of the data, the level of noise in the data, high dimensionality,  Corresponding author. Tel.: +86 10 86884215; fax: +86 10 62784555.

E-mail address: [email protected] (J. Wu). 0925-2312/$ - see front matter & 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2008.12.011

types of attributes and data sets, and scales of attributes [31]. However, further investigation is needed to reveal whether and how the data distribution can have the impact on the performance of AHC. In this paper, we provide a systematic understanding for hierarchical clusterings from a data distribution perspective. Specifically, we study the issues about how ‘‘true’’ cluster distribution1 can make impact on the clustering performance, what is the relationship between hierarchical clustering algorithms and validation measures with respect to different cluster distributions, and how to normalize the F-measure so as to improve its cluster validation power. The analysis provided in this paper can guide us for the better use of AHC. This is noteworthy since many underlying applications, e.g., the creation of a document taxonomy, require a hierarchy in the clustering result [11]. In addition, unlike partitional clustering, hierarchical clustering does not require a pre-specified number of clusters. This is desirable in many applications.

1 Since clustering is an unsupervised method, we typically do not have class labels for data objects. However, for the purpose of validation, people quite often assume that the class labels of all objects are available. In this case, we use a ‘‘true’’ cluster distribution to denote the real class distribution.

ARTICLE IN PRESS 2320

J. Wu et al. / Neurocomputing 72 (2009) 2319–2330

It is well known that variants of the AHC algorithm differ in how similarity is defined. There are three of the main similarity measures used in AHC including single-link, complete-link, and group-average. The focus of our study is on group-average, since group-average evaluates cluster quality by all the pairwise similarities between objects, which can avoid the pitfalls of the single-link and complete-link criteria. In the clustering literature, the unweighted pair group method with arithmetic mean (UPGMA) [29] is a well-known version of the group-average schemes, because UPGMA is more robust than many other agglomerative clustering approaches. Therefore, we make an emphasis on the analysis of UPGMA in this paper. To this end, we first illustrate that UPGMA tends to generate the clusters with high variation on the cluster sizes, no matter what the ‘‘true’’ cluster distribution is. In addition, we show that F-measure, a widely used external cluster validation measure [19], has bias towards the algorithms, such as single-link and UPGMA, which tend to increase the variation on the resultant cluster sizes. And finally, we propose F norm , the normalized version of the F-measure, to correct the bias of the unnormalized F-measure. In addition to the analysis, we have conducted extensive experiments on a number of real-world data sets from various application domains, such as document data sets, gene expression data sets, and UCI data sets. Indeed, our experimental results show that UPGMA tends to produce clusters in which the variation of the cluster sizes is consistently higher than the variation of the ‘‘true’’ cluster sizes. We call this the ‘‘dispersion effect’’ of UPGMA. This data variation is measured by the coefficient of variation (CV) [3]. The CV, described in more detail later, is a measure of dispersion of a data distribution. It is a dimensionless number that allows the comparison of the variations of populations that have significantly different mean values. In general, the larger the CV value is, the greater the variability in the data. Therefore, by using CV, the variation of the resultant cluster sizes can be quantified into a specific range, say [1.0, 2.5]. As a result, for data sets with relatively low variation on the ‘‘true’’ cluster sizes, e.g., CVo1:0, the clustering results by UPGMA are away from the ‘‘true’’ cluster distributions. Also, we illustrate empirically that the F-measure values have an abnormal strong correlation with the distributions of the class sizes, and high F-measure scores are often incorrectly assigned to the poor clustering results of many highly imbalanced data sets. Nevertheless, all these problems can be greatly lessened by using the normalized version of the F-measure, which is also demonstrated by our experiments. Finally, our experimental results also show that noise can intensify the dispersion effect of hierarchical clustering. To illustrate this, we have adapted HCleaner [35], a noise removal technique, to remove noise in the data. The experimental results show that, this data cleaning process can indeed alleviate the dispersion effect of hierarchical clustering, and improve the clustering performance of UPGMA. The remainder of this paper is organized as follows. Section 2 presents an overview of the related work. In Section 3, we illustrate the effect of AHC on the distribution of the cluster sizes. Section 4 introduces an external clustering validation measure—F-measure and the problem of applying this measure for validating hierarchical clustering. F norm , the normalized F-measure, has also been proposed to handle this problem. Experimental results are given in Section 5. Finally, we draw conclusions in Section 6.

2. Related work AHC has been investigated from various perspectives. Many data factors, which may strongly affect the performance of AHC,

have been identified and addressed. In the following, we highlight some findings which are mostly related to the main theme of this paper. First, one major concern of AHC is the scalability issue. Indeed, AHC is very expensive in terms of its computational and storage requirements among various clustering methods. To address this issue, a variety of techniques have been proposed. CURE [6] used random sampling and a partitioning scheme to reduce the complexity of AHC. BIRCH [39] used a CF tree to summarize the data information required by the computation of Euclidean distances for AHC. A general discussion for the scalability issues of clustering methods was provided by Ghosh [5]. A more broad discussion of specific techniques for clustering massive data sets can be found in the paper by Murtagh [23]. Second, noise in the data can greatly degrade the performance of AHC. To deal with this problem, one research direction is to incorporate some data cleaning techniques before conducting AHC. For instance, Xiong et al. [35] developed HCleaner, a new data cleaning method based on the hyperclique patterns [36,37], to remove irrelevant or weakly relevant objects before data analysis. Another research direction is to handle noise during the clustering process. For example, Chameleon [17], BIRCH [39], and CURE [6] explicitly deal with noise during the clustering process. Third, it has been well recognized that high dimensionality can make negative impact on various clustering algorithms, such as AHC, which use Euclidean distance [31,12]. To meet this challenge, one research direction is to make use of dimensionality-reduction techniques, such as multi-dimensional scaling (MDS) [1], principal component analysis (PCA) [16,21,25], and singular value decomposition (SVD) [4]. A detailed discussion on various dimensionality-reduction techniques for document data sets has been provided by Tang et al. [32] Another direction is to redefine the notions of proximity, e.g., by cosine similarity [42] or the shared nearest neighbors (SNN) similarity [15]. Finally, there are still some other data factors which can make impact on AHC, such as the type of attributes or data sets, the location of data in a real database [28], etc. However, our focus in this paper is on understanding the impact of the ‘‘true’’ cluster sizes on the performances of AHC. In our previous study [38], we have investigated the relationship between the data distribution and the performance of K-means—a prototype-based partitional clustering method [31]. Our findings revealed that, if there is a big variation on the ‘‘true’’ cluster sizes, K-means will produce the clustering result which is away from the ‘‘true’’ cluster distribution. This understanding can guide us for a better use of K-means. Similarly, the study in this paper will be valuable for the better use of AHC.

3. The effect of hierarchical clustering schemes on the distribution of the resultant cluster sizes In general, there are two categories of hierarchical clustering schemes. The first category includes the agglomerative algorithms in which objects are initially regarded as individual clusters, then pairs of sub-clusters are repeatedly merged until the whole hierarchy is formed. Another category includes partitional algorithms which can also obtain a hierarchy via a sequence of repeated partitions of the data. The key component in agglomerative algorithms is the similarity metric used to determine which pair of sub-clusters to be merged. In our study, we used the single-link, complete-link and UPGMA schemes which have long been established. The single-link scheme measures the distance of two clusters by the minimum pairwise distance of the two clusters. In contrast, the complete-link scheme uses the maximum pairwise distance

ARTICLE IN PRESS J. Wu et al. / Neurocomputing 72 (2009) 2319–2330

as the distance between two clusters. Finally, UPGMA can be viewed as the tradeoff between the above two schemes. Indeed, UPGMA is a simple bottom-up hierarchical clustering method that defines cluster similarity in terms of the average pairwise similarity between the objects in two different clusters. UPGMA is widely used because it is more robust than many other agglomerative clustering approaches. The mathematical descriptions of these three metrics are as follows: distsingle-link ðC i ; C j Þ ¼

min fdistðdi ; dj Þg,

di 2C i ;dj 2C j

distcomplete-link ðC i ; C j Þ ¼

max fdistðdi ; dj Þg, ni nj

di 2C i ;dj 2C j

X

fdistðdi ; dj Þg,

di 2C i ;dj 2C j

distUPGMA ðC i ; C j Þ ¼ 1 where C i denotes sub-cluster i, di is an object in C i , ni is the number of objects in C i , and distðdi ; dj Þ is the distance function, such as the squared Euclidean distance, between two objects di and dj . Moreover, to facilitate our discussions, we provide a formal definition of the ‘‘dispersion effect’’ as follows. Definition 1. If the distribution of the cluster sizes by a clustering method has more variation than the distribution of the ‘‘true’’ cluster sizes, we say that this clustering method shows the dispersion effect.

3.1. The opposite effects of single-link and complete-link on the resultant cluster sizes In this subsection, we illustrate the opposite effects of singlelink and complete-link on the distribution of the resultant cluster sizes. To facilitate the discussion, we first assume that there is no outlier and noise in the data. We begin our discussion by analyzing the behavior of the large sub-clusters produced during the hierarchical clustering process. Assume that we have three sub-clusters after l  1 iterations of the merging operation in the hierarchical clustering process, as shown in Fig. 1. To simplify the discussion, we assume that the objects in clusters I and III are independently generated from two two-dimensional Gaussian distributions, and cluster II consists of only one object located at the position (0,0). The covariant matrices and means of the two distributions are as follows:       1 0 4 4 ; mI ¼ ; mIII ¼ . COVI ¼ COVIII ¼ 0 1 0 0 It is trivial to know that the Euclidean distances between the centroids of clusters I and II, and clusters III and II are the same, i.e., kmI  mII k ¼ kmIII  mII k ¼ 4. Now the question is, given the condition that the sample size of cluster I is 500, much larger than

3 sub−cluster 1 sub−cluster 3 sub−cluster 2

circle A

2

Y

1 0

r

2321

that of cluster III, i.e., 50, which cluster is more likely to merge cluster II in the next iteration, cluster I or III? To answer this question, we draw a circle A whose centroid is the object of cluster II, and the radius r can be arbitrarily small. Let the probability that ‘‘a point from cluster I falls into circle A’’ be p. Then, the probability that ‘‘there at least one object from cluster I falls into A’’ is P 1 ¼ 1  ð1  pÞn1 , where n1 denotes the number of objects in cluster I. Similarly, we can compute the probability that ‘‘there at least one object from cluster III falls into A’’: P3 ¼ 1  ð1  pÞn3 . Since n1 bn3 , we have P 1 bP 3 . In other words, since r can be arbitrarily small, the object closest to cluster II is more likely from cluster I rather than cluster III. Indeed, as Fig. 1 shows, the only one object in A is right from cluster I, which is randomly generated by Matlab. Hence, for single-link, cluster I with a larger size has a higher probability than cluster III to absorb cluster II. While the real-world scenario can be much more complex, the simple case illustrated above can still tell us something interesting; that is, for single-link, given other conditions are the same, the sub-clusters with relatively large sizes are easier to absorb other sub-clusters during the hierarchical clustering process. This analytical methodology can be adapted to the completelink scheme. That is, we can draw an arbitrarily large circle B around cluster II, and then show that the object farthest to cluster II has a high probability of being from cluster I rather than cluster III. Therefore, there is an opposite effect between single-link and complete-link. Specifically, for the complete-link scheme, the sub-clusters with larger sizes are more difficult to merge other sub-clusters during the hierarchical clustering process. Discussion: It is natural to extend the above analysis to realworld data set cases. Actually, no matter what the hierarchical clustering scheme is, say single-link or complete-link, the nearby objects will merge each other to form larger sub-clusters at the early stage of a hierarchical clustering process. At the later stage, if single-link is used, the relatively large sub-clusters have a higher probability to merge first. This process will continue until there exist only some small clusters or objects far away from the large sub-clusters. This is what we called the dispersion effect of singlelink; that is, single-link tends to increase the dispersion degree of the cluster sizes by producing several large clusters and some small even tiny clusters in the result. However, if the complete-link scheme is used instead, the relatively small and compact sub-clusters will have the priority to merge first, but sub-clusters will stop growing at some point as their sizes becoming too large. As a result, complete-link tends to decrease the dispersion degree of the cluster sizes by constraining the growth of the large sub-clusters. Finally, outliers and noise can also have impact on single-link and complete-link. In fact, we can view noise as the special cases of tiny and remote sub-clusters. Thus, for single-link, the dispersion effect can be intensified by outliers and noise. Also, for complete-link, due to the present of outliers and noise, the large distances between large sub-clusters can be much smaller than the ones between any sub-clusters and outliers. Therefore, the large sub-clusters can have more chances to merge and continue to grow larger. For complete-link, this can also lead to a similar dispersion effect as single-link.

−1

3.2. The effect of UPGMA on the resultant cluster sizes

−2 −3 −6

−4

−2

0

2

4

6

X Fig. 1. The behavior of the large sub-clusters during the hierarchical clustering.

Now, let us consider UPGMA, which is regarded as a tradeoff between single-link and complete-link, and has superior performance in many applications. Let C i denote sub-cluster i, mi be the mean of the objects in C i , si be the standard deviation of the

ARTICLE IN PRESS 2322

J. Wu et al. / Neurocomputing 72 (2009) 2319–2330

objects in C i , and ni be the object number in C i . We have P x mi ¼ x2Ci , ni 2 i

t i

We then calculate the recall and precision of that cluster for each given class as Recallði; jÞ ¼

2

m ¼ m mi ¼ kmi k , s2i ¼

1X 1X ðx  mi Þt ðx  mi Þ ¼ kx  mi k2 . ni x2C ni x2C i

i

Proposition 1. If the squared Euclidean distance is used as the pairwise distance function for any two objects, the distance between two clusters in UPGMA is

Proof. As we know, P distUPGMA ðC i ; C j Þ ¼

x2C i

P

y2C j kx

 yk2

ni nj

.

If we separate x and y in Eq. (2), we have P P 2 2 y2C j kyk x2C i kxk þ  2mti mj . distUPGMA ðC i ; C j Þ ¼ ni nj Furthermore, we can show P 2 x2C i kxk ¼ s2i þ m2i . ni

Precisionði; jÞ ¼

nij , ni

Fði; jÞ ¼

2  Recallði; jÞ  Precisionði; jÞ . Recallði; jÞ þ Precisionði; jÞ

(2)

Then the F-measure of each class is the maximum value attained at any node (cluster) in the tree. Finally, the F-measure for the entire hierarchy is computed by taking the weighted average of all per-class F-measures, as given by the equation X nj maxfFði; jÞg, F¼ n j

(3)

where the max is taken over all clusters at all levels, and n is the number of all data objects. The F-measure values are in the interval [0,1], and a larger F-measure value indicates a better clustering quality.

(1)

4.1. The dispersion of a data distribution (4)

If we substitute Eq. (4) to the right-hand side of Eq. (3), we have distUPGMA ðC i ; C j Þ ¼ s2i þ s2j þ kmi  mj k2 . Thus we complete the proof.

and

where nij is the number of objects of class j that are in cluster i, ni is the number of objects in cluster i, and nj is the number of objects in class j. The F-measure of cluster i and class j is then given by

Then we have the following proposition.

distUPGMA ðC i ; C j Þ ¼ s2i þ s2j þ kmi  mj k2 .

nij nj

&

Discussion: Eq. (1) indicates that UPGMA tends to merge relatively compact (i.e., low s2 ) and nearby (i.e., low kmi  mj k) sub-clusters. As a result, UPGMA tends not to merge the subclusters with relatively large sizes, since the objects in large subclusters are probably well scattered in the feature space, which results in a high s2 . This can lead to a similar clustering effect as complete-link. However, when sub-clusters with relatively small sizes are far away from each other, i.e., the kmi  mj k values are large, UPGMA will turn to merge sub-clusters with relatively large sizes, for their centroids can be much closer to each other. This situation will be even more prominent when there are outliers and noise in the data set. Indeed, Eq. (1) shows the subtle tradeoff of UPGMA between single-link and complete-link. However, for real-world data sets, the behavior of UPGMA is more complex. We will further illustrate it by extensive experiments in Section 5.

4. The effect of F-measure on validating hierarchical clustering Generally speaking, there are two types of clustering validation techniques [7,8,14], which are based on external criterion and internal criterion, respectively. This paper is focused on the external clustering validation measure—F-measure, which has been widely used for evaluating hierarchical clustering algorithms in many application domains, such as document clustering [30,41]. As an external criterion, F-measure uses external information—class labels in this case. F-measure combines the precision and recall concepts from information retrieval community [27]. We treat each cluster as if it was the result of a query, and each class as if it was the desired set of instances for a query.

Before we describe the problem of F-measure, we first introduce CV [3], which is a measure of dispersion for a data distribution. CV is defined as the ratio of the standard deviation to the mean. Given a set of data objects X ¼ fx1 ; x2 ; . . . ; xn g, we have qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi Pn P CV ¼ s=¯x, where x¯ ¼ ni¼1 xi =n and s ¼ ¯ Þ2 =ðn  1Þ. i¼1 ðxi  x Note that there are some other statistics, such as standard deviation and skewness, which can also be used to characterize the dispersion degree of data distributions. However, the standard deviation has no scalability; that is, the dispersion degree of the original data and the stratified sample data is not equal as indicated by standard deviation, which does not agree with our intuition. Meanwhile, skewness cannot catch the dispersion in the situation that the data are symmetric but do have high variance. Indeed, CV is a dimensionless number that allows the comparison of the variations of populations that have significantly different mean values. In general, the larger the CV value is, the greater the variability in the data. 4.2. The problem of F-measure for validating hierarchical clustering In our practice, we observed that some hierarchical clustering schemes, such as single-link and UPGMA, can produce clusters with highly variant sizes. Meanwhile, the validation measures, such as F-measure, can be misleading when the dispersion effect holds. To illustrate this, we create a sample data set as shown in Table 1. In this data set, there are four ‘‘true’’ clusters: Sports, Entertainment, Foreign, and Metro. The sizes of these four clusters are 100, 10, 5, and 5, respectively. Thus the ‘‘true’’ cluster

Table 1 A sample document data set. Class Class Class Class

1 2 3 4

CV0 ¼ 1:558

Sports: 100 objects Entertainment: 10 objects Foreign: 5 objects Metro: 5 objects

ARTICLE IN PRESS J. Wu et al. / Neurocomputing 72 (2009) 2319–2330

random,2 and propose a procedure to find a lower bound for F as follows:

Table 2 The clustering result. Cluster Cluster Cluster Cluster

1 2 3 4

2323

99 Sports þ 9 Entertainment + 4 Foreign + 5 Metro 1 Sports 1 Entertainment 1 Foreign

Procedure 1. The computation of F  . 1: 2: 3: 4: 5: 6: 7:

CV1 ¼ 1:933, F-measure ¼ 0.793

sizes of this data set are highly imbalanced, as indicated by CV0 ¼ 1:558. We assume that a clustering result on this sample data set is shown in Table 2. In the table, the clustering result consists of one ‘‘huge’’ cluster and three ‘‘tiny’’ clusters. Moreover, the largest cluster consists of objects with varied class labels. As discussed in Section 3, the single-link and UPGMA schemes may produce such a clustering result by showing the dispersion effect. According to F-measure, the clustering result in Table 2 is rather good, since the F-measure value is 0.793, very close to the upper bound: 1. However, if we look at the four clusters carefully, we can find that the three tiny clusters are not very meaningful. Indeed, the clustering result is far away from the ‘‘true’’ cluster distribution. This is indicated by the CV value of the cluster sizes. The CV value for the resultant clusters is 1.993, which is much larger than 1.558, the CV value of the ‘‘true’’ clusters. Indeed, F-measure tends to put more weights on the large cluster. Meanwhile, for a data set with highly variant ‘‘true’’ cluster sizes, the largest cluster produced by single-link or UPGMA contains majority objects from the largest ‘‘true’’ cluster. This means that the F-measure value for that largest ‘‘true’’ cluster will be excellent, for its recall and precision values are both high. So the F-measure value for the entire data set will be inevitably high, since it is a class-size-weighted average. And the poor clustering quality for the rest small ‘‘true’’ clusters is therefore ‘‘disguised’’. In summary, this example illustrates that F-measure has difficulties in evaluating the hierarchical clustering results on data sets with highly imbalanced cluster sizes.

Let n ¼ maxi ni . Sort the class sizes so that n½1 pn½2 p    pn½K 0  . Let aj ¼ 0, for j ¼ 1; 2; . . . ; K 0. for j ¼ 1 : K 0 if n pn½j , aj ¼ n , n ¼ 0, break. n  n½j . else aj ¼ n½j , n P 0 F  ¼ ð2=nÞ Kj¼1 aj =ð1 þ maxi ni =n½j Þ.

Now, we prove that the above procedure can find a lower bound for F. We begin by giving a lemma as follows. Lemma 1. Given K; c; 2 Zþþ , A ¼ fai j1pipKg  Zþþ , P cp Ki¼1 ai , the optimal solution for min

f ðXÞ ¼

X¼fxi j1pipKg

K X i¼1

K X

s:t:

and

xi c=ai þ 1

xi ¼ c,

i¼1

xi pai 2 Zþ ;

8i ¼ 1; 2; . . . ; K.

(6)

is 8 a½i ; > > > > < l P a½k ; x½i ¼ c  > k¼1 > > > : 0;

1pipl; i ¼ l þ 1;

(7)

l þ 1oipK:

where fa½i j0pipKg ¼ fai j0pipKg with a½0 ¼ 0oa½1 o    oa½K , P P þ1 a½i . and loK 2 Zþ with li¼1 a½i ocp li¼1 Proof. We prove it by contradiction. First, we assume that there exists an optimal solution X ¼ fx½i j1pipKg such that for some i 2 ðl þ 1; K, x½i  40. P P þ1 x½i ¼ c  Ki¼lþ2 x½i , we have Since li¼1 lþ1 X

x½i oc.

(8)

i¼1

4.3. The normalization of the F-measure One way to solve the validation problem of the F-measure is to normalize the F-measure before using it. Generally speaking, normalizing techniques can be divided into two categories. One is based on a statistical view, which formulates a baseline distribution to correct the measure for randomness. A clustering can then be termed valid if it has an unusually high or low value, as measured with respect to the baseline distribution (for more details of this scheme, refer to [14]). The other technique uses the minimum and maximum values to normalize the measure into the [0,1] range. That is, Snorm

S  minðSÞ ¼ . maxðSÞ  minðSÞ

(5)

We can also take a statistical view on this technique with the assumption that each measure takes a uniform distribution over the value interval. We take the latter scheme for the F-measure. According to Eq. (5), the normalization of the F-measure reduces to the finding of maxðFÞ and minðFÞ. As we know, maxðFÞ ¼ 1. So it remains to find a tight lower bound for F. Apparently, this is an NP-complete problem without any assumptions. In the following, we assume that the cluster sizes ni ði ¼ 1; . . . ; KÞ and the class sizes nj ðj ¼ 1; . . . ; K 0 Þ are fixed but the cluster members are selected at

Therefore, there must exists some i0 2 ½1; l þ 1 such that x½i0  oa½i0  , P þ1 P þ1 x½i ¼ li¼1 a½i Xc, which contradicts Eq. (8). As a otherwise li¼1 result, we can have a new solution as follows: X 0 ¼ fx½1 ; . . . ; x½i0 1 ; x½i0  þ 1; x½i0 þ1 ; . . . ; x½lþ1 , . . . ; x½i 1 ; x½i   1; x½i þ1 ; . . . ; x½K g. Accordingly, f ðX 0 Þ  f ðXÞ ¼

1 1  o0, c=a½i0  þ 1 c=a½i  þ 1

given a½i0  oa½i  . Therefore, X is not the optimal solution, which contradicts our assumption. So we have x½i ¼ 0;

l þ 1oipK.

(9)

Next, we assume that there exists an optimal solution X ¼ fx½i j1pipKg such that for some i 2 ½1; l, x½i  oa½i  . Accordingly, x½lþ1 ¼ c 

l X i¼1

x½i 4c 

l X

a½i 40.

(10)

i¼1

2 This assumption is also referred to as the multivariate hypergeometric distribution assumption in the statistical normalization scheme [13].

ARTICLE IN PRESS 2324

J. Wu et al. / Neurocomputing 72 (2009) 2319–2330

This implies that ( Xyi0 ½j ; 1pjpt; yi½j pyi0 ½j ; t þ 1ojpK 0 :

Then let us consider a new solution as follows: 0

X ¼ fx½1 ; . . . ; x½i 1 ; x½i  þ 1; x½i þ1 ; . . . ; x½lþ1  1; . . . ; x½K g. We have f ðX 0 Þ  f ðXÞ ¼

Since

1 1  o0, c=a½i  þ 1 c=a½lþ1 þ 1

0

given a½i  oa½lþ1 . Therefore, X is not the optimal solution, which contradicts our assumption. So we have x½i ¼ a½i

for i ¼ 1; 2; . . . ; l.

(11)

Thus we complete the proof by combining the results in Eqs. (9) and (11). &

K X

0

yi½j ¼

j¼1

K X

yi0 ½j ¼ 1,

j¼1

and j ") pi½j " , we have 0

K X

0

K X

Remark. It is trivial to show if there exist some i, j, 1piajpK such that ai ¼ aj , the solution in Eq. (7) may not be the unique optimal solution. Nevertheless, we can still use it as one of the optimal solutions for the following theorem.

Furthermore, according to the definition of pi½j , we have

Theorem 1. Given F  computed by Procedure 1, FXF  .

pi½j ppi0 ½j ;

Proof. It is easy to show:

Therefore,



X nj n

j

¼

max i

2 max n i

X j

X 2nij nij 2 X max nj ni þ nj n i n þ nj i j

nij . ni =nj þ 1

pi½j yi½j p

j¼1

8j 2 f1; . . . ; K 0 g.

0

Fi ¼ (12)

Let us consider an optimization problem as follows:

pi½j yi0 ½j .

j¼1

0

0

K K K 2X 2X 2X p y p p y0 p p 0 y 0 ¼ F i0 , n j¼1 i½j i½j n j¼1 i½j i ½j n j¼1 i ½j i ½j

which implies that ‘‘ni pni0  ’’ is the sufficient condition of ‘‘F i pF i0 ’’. Therefore, by Procedure 1, F  ¼ max F i . i

0

K X

min

fxij j1pjpK 0 g

j¼1

xij ni =nj þ 1

Finally, according to Eq. (14), we have FXF  .

By Theorem 1, we get the normalized F-measure as follows:

0

s:t:

K X

xij ¼ ni ,

F norm ¼

j¼1

xij pnj 2 Zþ ;

0

8j ¼ 1; 2; . . . ; K .

(13)

Let n½0 ¼ 0 and assume that l X

n½j oni p

lþ1 X

n½j ;

l 2 f0; 1; . . . ; K 0  1g,

j¼0

j¼0

then according to Lemma 1, we have an optimal solution as the following: 8 n½j ; 1pjpl; > > > > < l P n½k ; j ¼ l þ 1; xi½j ¼ ni  > k¼1 > > > : 0; l þ 1ojpK 0 : Therefore, according to Formula (12), 0

K X xi½j 2 . FX max n i j¼1 ni =n½j þ 1

(14)

Let P 0 P 0 F i ¼ ð2=nÞ Kj¼1 xi½j =ni =n½j þ 1 ¼ ð2=nÞ Kj¼1 xi½j =ni =ð1=n½j þ 1=ni Þ. Denote ‘‘xi½j =ni ’’ by ‘‘yi½j ’’, and ‘‘1=ð1=n½j þ 1=ni Þ’’ by ‘‘pi½j ’’, we have Fi ¼

2 n

K0 X

pi½j yi½j .

j¼1

Now we remain to show arg max F i ¼ arg max ni . i

i

Assume ni pni0  and t X j¼0

n½j oni p

&

tþ1 X j¼0

n½j ;

t 2 f0; 1; . . . ; K 0  1g.

F  F . 1  F

(15)

Remark. Note that F norm is not precisely the normalized F-measure, since we make the multivariate hypergeometric distribution assumption for the computation of ‘‘min F’’. Nevertheless, we will show in the experimental section that F norm can still provide more accurate evaluations than the unnormalized F-measure in comparing clustering results of different data sets. 5. Experimental results In this section, we present experimental results to evaluate the performance of UPGMA from a data distribution perspective. Specifically, we show: (1) the dispersion effect of UPGMA on the distribution of the resultant cluster sizes, (2) the reasons for the dispersion effect of UPGMA, (3) the problem with the F-measure and UPGMA, (4) the merits of the normalized F-measure, and (5) the effectiveness of HCleaner [35] on reducing the dispersion effect of UPGMA. 5.1. The experimental setup Experimental Tools. In our experiments, we used the CLUTO implementation of UPGMA [18]. Also, since the Euclidean notion of proximity is not very meaningful for hierarchical clustering on real-world high-dimensional data sets, such as gene expression data sets and document data sets, the cosine similarity is used instead. In addition to this, we used HCleaner, a novel and effective technique based on the hyperclique patterns [36,37], to remove a large amount of irrelevant or weakly relevant data objects in the data [35]. Finally, note that some notations used in our experiments are shown in Table 3.

ARTICLE IN PRESS J. Wu et al. / Neurocomputing 72 (2009) 2319–2330

Experimental data sets: For our experiments, we used a number of real-world data sets that are obtained from different application domains. Some characteristics of these data sets are shown in Table 4. In the table, CV0 shows the CV values of the ‘‘true’’ cluster sizes, and ‘‘#classes’’ indicates the number of the ‘‘true’’ clusters. Document data sets: The fbis data set was from the Foreign Broadcast Information Service data of the TREC-5 collection [34]. The hitech data set was derived from the San Jose Mercury newspaper articles that were distributed as part of the TREC collection (TIPSTER Vol. 3), and contains documents about computers, electronics, health, medical, research, and technology. The data sets k1a and k1b contain exactly the same set of documents but they differ in how the documents were assigned to different classes; in particular, k1a contains a finer-grain categorization than that contained in k1b. The la2 and la12 data sets are part of the TREC-5 collection [34] and contain news articles from the Los Angeles Times. The ohscal data set was obtained from the OHSUMED collection [10], which contains documents from the categories of antibodies, carcinoma, DNA, invitro, molecular sequence data, pregnancy, prognosis, receptors, risk factors, and tomography. The data sets re0 and re1 were from Reuters-21578 text categorization test collection Distribution 1.0 [20]. The tr31 and tr41 data sets were derived from the TREC-5, TREC-6, and TREC-7 [34] collections. The data set wap was from the WebACE project (WAP) [9]; each document corresponds to a web page listed in the subject hierarchy of Yahoo!. For all document clustering data sets, we used a stop-list to remove common words, and the words were stemmed using Porter’s suffix-stripping algorithm [26]. Biological data sets: Leukemia and LungCancer data sets were from Kent ridge biomedical data set repository (KRBDSR) which is

2325

an online repository of high-dimensional features [22]. The Leukemia data set contains six subtypes of pediatric acute lymphoblastic leukemia samples and one group samples that do not fit in any of the above six subtypes, and each is described by 12 558 genes. The LungCancer data set consists of samples of lung adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoid, small-cell lung carcinomas and normal lung described by 12 600 genes. UCI data sets: Besides the above high-dimensional data sets, we also used some UCI data sets with normal dimensionality [24]. The ecoli data set is about the information of cellular localization sites of proteins. The page-blocks data set contains the information of five types blocks of the page layout of a document that has been detected by a segmentation process.

5.2. The dispersion effect of UPGMA 5.2.1. The simulated case We first illustrate the dispersion effect of UPGMA by some simulated experiments. Assume we have samples generated by four different two-dimensional normal distributions, with m1 ¼ ½0; 0, m2 ¼ ½d; 0, m3 ¼ ½0; d, and m4 ¼ ½d; d being the means, respectively, and S ¼ ½1; 0; 0; 1 being the same covariance matrix, where d is the parameter used to adjust the centroid distances. And the sample size for each of the four distributions is exactly the same: 200, which implies that these samples form an extremely balanced data set. We let d ¼ 7; 6; 5, respectively, and employed UPGMA on the three simulated data sets. Results are shown in Fig. 2. Note that samples from the same class are marked by the same marker and color. And members from the same cluster are enclosed by the dotted lines. As can be seen, when the centroid distances between the classes are large enough, say d ¼ 7, UPGMA acts like completelink to prevent further merges of the large sub-clusters, and results in four clusters exactly the same as the four classes. As the centroids of the four classes getting closer, however, things are different. As indicated by sub-figure (b), the only member of cluster C 4 is far away from the population, which leads to the merge of the two large sub-clusters in cluster C 3, and finally results in a significant dispersion effect. The case for sub-figure (c)

Table 3 Some notations. CV0 : the CV value of the ‘‘true’’ cluster sizes CV1 : the CV value of the resultant cluster sizes DCV: CV1  CV0 ADLC: the aggregation degree of the largest cluster in the clustering result Note: Refer to Section 5.3 for the detailed definition of ADLC.

Table 4 Some characteristics of experimental data sets. Data set

Source

#objects

#features

#classes

MinSize

MaxSize

CV0

Document data set fbis hitech k1a k1b la2 la12 ohscal re1 re0 tr31 tr41 wap

TREC TREC WebACE WebACE TREC TREC OHSUMED-233445 Reuters-21578 Reuters-21578 TREC TREC WebACE

2463 2301 2340 2340 3075 6279 11162 1657 1504 927 878 1560

2000 126 373 21839 21839 31 472 31 472 11 465 3758 2886 10 128 7454 8460

17 6 20 6 6 6 10 25 13 7 10 20

38 116 9 60 248 521 709 10 11 2 9 5

506 603 494 1389 905 1848 1621 371 608 352 243 341

0.961 0.495 1.004 1.316 0.516 0.503 0.266 1.385 1.502 0.936 0.913 1.040

Biomedical data set Leukemia LungCancer

KRBDSR KRBDSR

325 203

12 558 12 600

7 5

15 6

79 139

0.584 1.363

UCI data set ecoli page-blocks

UCI UCI

336 5473

7 10

8 5

2 28

143 4913

1.160 1.953

ARTICLE IN PRESS 2326

J. Wu et al. / Neurocomputing 72 (2009) 2319–2330

12 10

10

C3

C4

6 4

4

Y

Y

6

C2

2

2

−4 −4

C1

C2

0

0 −2

C3

8

8

−2

C1

C4

−4 −2

0

2

4 X

6

8

10

12

−4

−2

0

2

4

6

8

10

X 10 8

C2

C3

C1

6 Y

4 2 0 −2 −4 −4

C4 −2

0

2

4

6

8

X Fig. 2. Illustration of the dispersion effect of UPGMA. (a) d ¼ 7, (b) d ¼ 6, and (c) d ¼ 5.

is even more prominent as d is further reduced to 5. Therefore, from the simulated experiments, we can know that UPGMA tends to show the dispersion effect on data sets containing classes without very clear boundaries, which is the common case for realworld applications.

5.2.2. The real-world case We further test the dispersion effect of UPGMA for real-world data sets. In this experiment, we first applied UPGMA on the data sets, and then computed the CV values for the ‘‘true’’ cluster distributions and the resultant cluster distributions. Note that, while hierarchical clustering does not require a pre-specified number of clusters as input, we set the number of clusters K as the ‘‘true’’ cluster number. So, we can get a cluster partition for the purpose of evaluation. Table 5 shows the experimental results on various real-world data sets. Note that the CV0 values in the table are the same as the ones in Table 4 for each data set. As can be seen from the column ‘‘DCV’’, for all data sets in the table, UPGMA tends to increase the variation of the cluster sizes. Indeed, no matter what the CV0 values are, the corresponding CV1 values for the same data sets are not less than 1.0, with only one exception: the tr41 data set. Therefore, we can empirically estimate the interval of CV1 values: [1.0, 2.5]. In other words, for data sets with CV0 o1:0, the distributions of the cluster sizes by UPGMA tend to be away from the true ones. This indicates a poor clustering quality. In fact, as Fig. 3 shows, the DCV values increase rapidly as the CV0 values decrease. The above observation is opposite to the observation on K-means clustering [31] in our previous work [38]. In [38], the empirical value interval of CV1 values by K-means clustering is [0.3, 1]. Indeed, this interval is much smaller than the original one

Table 5 Experimental results on various real-world data sets. ID

Data set

STD0

STD1

CV0

CV1

DCV

ADLCa

F

F norm

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

fbis hitech k1a k1b la2 la12 ohscal re1 re0 tr31 tr41 wap Leukemia LungCancer ecoli page-blocks

139 189 117 513 264 526 297 91 173 123 80 81 27 55 48 2137

215 924 200 642 926 2550 1823 121 261 152 86 125 82 76 75 2214

0.961 0.495 1.004 1.316 0.516 0.503 0.266 1.385 1.502 0.936 0.913 1.040 0.584 1.363 1.160 1.953

1.491 2.410 1.714 1.647 1.808 2.437 1.634 1.830 2.264 1.152 0.984 1.612 1.770 1.888 1.804 2.023

0.529 1.915 0.710 0.331 1.293 1.934 1.368 0.445 0.762 0.216 0.071 0.571 1.188 0.525 0.643 0.070

0.24 1.00 0.35 0.67 0.83 1.00 0.50 0.28 0.92 0.57 0.30 0.30 1.00 0.80 0.63 0.60

0.607 0.333 0.529 0.802 0.508 0.329 0.310 0.538 0.408 0.722 0.649 0.528 0.281 0.729 0.610 0.906

0.583 0.007 0.498 0.672 0.322 0.001 0.198 0.519 0.266 0.659 0.610 0.500 0.055 0.400 0.467 0.484

27 2137

75 2550

0.266 1.953

0.984 2.437

0.070 1.934

0.24 1.00

0.281 0.906

0.001 0.672

Min Max

Note: (1) Parameters used in CLUTO:clmethod ¼ agglo crfun ¼ upgma colmodel ¼ idf. a For the definition and use of ADLC, refer to Section 5.3.

sim ¼ cos

for the ‘‘true’’ cluster sizes, i.e., the interval of the CV0 values in Table 5. We call this the ‘‘uniform effect’’ of K-means clustering. Fig. 4 illustrates the opposite effects of K-means and UPGMA on the cluster distributions. This radar figure shows the CV0 values of all data sets, and their corresponding CV1 values by K-means and UPGMA, respectively. Note that the parameters set for K-means in CLUTO are all the default ones. The opposite effects

ARTICLE IN PRESS J. Wu et al. / Neurocomputing 72 (2009) 2319–2330

2327

1

2 16

1.8 Y = -0.9795X+1.7590

1.6

CV0 CLINK UPGMA SLINK

2

4.0

15

3

3.0

1.4 DCV

5.0

14

2.0

1.2

4

1.0

1

13

5

0.0

0.8 12

0.6

6

0.4 11

7

0.2 10

8

0 0

0.5

1

1.5

9

2

Fig. 5. A comparison of CV1 values by UPGMA, single-link, and complete-link on real-world data sets.

CVo Fig. 3. DCV versus CV0 .

K-means CV0 UPGMA

1 16

2.5

2

2.0

15

1.0

HITECH cla1 cla2 cla3 cla4 cla5 cla6 LA12 cla1 clu1 clu2 clu3 clu4 clu5 clu6

3

1.5 14

Table 6 Breakdowns of clustering partitions by UPGMA on HITECH and LA12.

4

0 0 0 0 1 0 0 0 3 0 0 17 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 3 4 0 485 115 426 600 474 170

clu1 clu2 clu3 clu4 clu5 clu6

cla2 cla3 cla4 cla5

cla6

1 0 0 0 0 0 1 2 2 2 0 0 0 0 1 3 0 2 0 1 1 1 0 1 5 1 0 1 0 1 1035 638 517 1841 1497 725

0.5 13

5

0.0

12

6

11

7 10

number of ‘‘true’’ clusters, we can find one huge cluster in the result. This largest cluster contains most data objects with varied class labels. To better illustrate this, we propose a new measure: ADLC—the aggregation degree for the largest cluster in the clustering result. The computation of ADLC for a clustering result is as follows. Let Rij denote the recall of class i in cluster j, c and c0 denote the numbers of classes and clusters, respectively, then

8 9

ADLC ¼ max k

Fig. 4. A comparison of CV1 values by K-means and UPGMA on real-world data sets.

of K-means and UPGMA are obvious; that is, K-means tends to reduce the variation of the resultant cluster sizes, whereas UPGMA acts in an opposite way. Finally, Fig. 5 shows the CV1 values for UPGMA, single-link, and complete-link on all the experimental data sets. As can be seen, single-link shows a strong dispersion effect on the resultant cluster sizes, and complete-link shows a weak uniform effect— 9 out of 16 data sets have smaller CV values after clustering. As to UPGMA, it acts more like single-link to show the dispersion effect on the resultant cluster sizes.

5.3. Reasons for the dispersion effect of UPGMA In the previous subsection, we showed that UPGMA tends to increase the variation on the cluster sizes no matter what the CV0 value is. In this experiment, we want to get an understanding about the reasons why UPGMA has a dispersion effect. First, we demonstrate that for many data sets in Table 5, if we get a cluster partition by setting the number of clusters as the

  c X Ik arg max Rij ¼ k c, i¼1

j

where j; k ¼ 1; . . . ; c0 , and Ik is a binary function taking 1 when arg maxj Rij ¼ k and taking 0 otherwise. It is easy to know that 1=cpADLCp1. Actually, ADLC measures the dispersion degree of hierarchical clustering by the ratio of classes, most instances of which have been assigned to the largest cluster. In general, a higher ADLC value indicates a more significant dispersion effect. Table 5 shows the ADLC value for each data set. In the table, 11 out of 16 data sets have ADLC values no less than 0.5. Three data sets, including hitech, la12, and Leukemia, even have ADLC values equal to 1.0. This observation suggests that many data sets have a largest cluster in the results which contains most of the instances. In other words, UPGMA acts more like single-link to show the dispersion effect on the real-world data sets. Table 6 shows the breakdowns of the clustering partitions produced by UPGMA on HITECH and LA12 data sets. In the table, ‘‘clu1’’ represents ‘‘cluster 1’’ and ‘‘cla1’’ means ‘‘class 1’’. As can be seen, each clustering partition consists of a largest cluster and several tiny clusters. Due to the smallness of these tiny clusters, they should be treated as the sets of noise. In other words, the existence of noise can intensify the dispersion effect of UPGMA. In summary, the reasons why UPGMA has a dispersion effect are due to the fact that large sub-clusters have a higher priority to

ARTICLE IN PRESS 2328

J. Wu et al. / Neurocomputing 72 (2009) 2319–2330

1 0.9

0.8

0.8

0.7

0.7

0.6

0.6 Fnorm

F-measure

1 0.9

0.5 0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1 0

0 0.2

0.4

0.6

0.8

1

1.2 CVo

1.4

1.6

1.8

2

1 Classes Disappeared F-measure

0.8 0.7

CVo = 0.936

CVo = 1.953

0.9

0.6 0.5

30

0.4

F−measure

40

CVo = 1.160

CVo = 1.316

50

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Fig. 8. F norm versus CV0 .

70 60

0.2

CV0

Fig. 6. F-measure versus CV0 .

Percentage of the Disappeared Classes (%)

0.5

disappeared. Meanwhile, we can observe that high F-measure values were achieved for these data sets with high CV0 values. In other words, if F-measure is used as the clustering validation measure, the poor results by UPGMA on skewed data sets will be reported as ‘‘excellent’’. In summary, F-measure is not a good validation measure for hierarchical clustering when data sets contain skewed ‘‘true’’ cluster sizes. 5.5. The performance of the normalized F-measure

0.3

20

0.2 10

0.1 0

0 pageblocks

k1b

tr31 Data Sets

ecoli

Fig. 7. An illustration of the problem of F-measure for validating the clustering results by UPGMA.

be merged than small and distant sub-clusters, such as noise, which usually cannot be avoided in real-world data sets. 5.4. The problem of F-measure for validating UPGMA In this subsection, we present the analysis of F-measure for evaluating the clustering results by UPGMA. Fig. 6 shows the relationship between the F-measure values and the CV0 values for all the experimental data sets in Table 4. In the figure, a general trend can be observed; that is, the F-measure values tend to increase as the increase of the CV0 values. In other words, F-measure prefers the results by UPGMA on data sets with high variations on the ‘‘true’’ cluster sizes (a higher F-measure value indicates a better clustering quality). However, theoretically, CV0 values have no relationship with the clustering quality. In other words, the above results can be misleading. To illustrate this, we selected four data sets with high CV0 values including k1b, tr31, ecoli and pageblocks. We did hierarchical clustering by UPGMA on these four data sets using the number of ‘‘true’’ clusters as the K. In the clustering results, we labeled each cluster by the label of the majority objects in the cluster. We found that many ‘‘true’’ clusters were disappeared in the clustering results. Fig. 7 shows the percentage of the disappeared ‘‘true’’ clusters for these four data sets. As can be seen, every data set has a significant number of ‘‘true’’ clusters

In this subsection, we demonstrate the validity of F norm on evaluating the clustering results of UPGMA. Table 2 shows the F norm values for the clustering results of all the experimental data sets. As can be seen, one notable observation is that all the F-measure values decrease after the normalization. For instance, the page-blocks and ecoli data sets with high F-measure values in Fig. 7 have shown much smaller F norm values, even no greater than 0.5. Some data sets, such as hitech, la12 and Leukemia, even have F norm values near to 0, which implies that their clustering results are random in essence. So the normalized F-measure indeed can tell us the absolute scores of the results which can facilitate the comparisons across different data sets. We also explored the relationship between F norm and CV0 values, as shown in Fig. 8. As can be seen, the clear upward trend in Fig. 6 disappeared; that is, F norm values do not show a strong correlation with the CV0 values, which is more reasonable than that of the unnormalized F-measure in Fig. 6. This result further validate that F norm is more suitable than the F-measure in comparing hierarchical clustering results across different data sets. 5.6. Data cleaning for reducing the dispersion effect of UPGMA From the above analysis, we know that noise can intensify the dispersion effect of UPGMA. In this subsection, we applied HCleaner [35], a noise removal technique, for removing noise in the data sets before clustering. In this experiment, we selected eight document data sets as shown in Table 7. These data sets have roughly equal number of objects ranging from 1500 to 3000. The ‘‘Supp’’ and ‘‘H-conf’’ in Table 7 denote the input parameters—support and h-confidence— for the HCleaner. Since real-world data sets can have a large amount of weakly relevant data objects, we set noise ratio

ARTICLE IN PRESS J. Wu et al. / Neurocomputing 72 (2009) 2319–2330

Table 7 Parameters for HCleaner and experimental results on noise-removed data sets.

k1a k1b re1 re0 wap fbis hitech la2

Supp

H-conf NR

CV0

CV1

0.0015 0.0015 0.0020 0.0020 0.0040 0.0100 0.0005 0.0006

0.20 0.20 0.15 0.20 0.20 0.23 0.10 0.13

1.163 1.109 1.491 1.560 1.271 0.983 0.485 0.529

1.636 0.473 0.710 1.001 0.108 0.331 1.635 0.144 0.445 1.162 0.398 0.762 1.553 0.282 0.571 1.521 0.538 0.529 2.411 1.926 1.915 2.424 1.895 1.293

0.20 0.20 0.27 0.21 0.25 0.21 0.20 0.19

DCV

DCV(old) F norm 0.562 0.722 0.525 0.300 0.615 0.521 0.005 0.217

F norm (old) 0.498 0.672 0.519 0.266 0.500 0.583 0.007 0.322

between 20% and 30%, as indicated by ‘‘NR’’ in the table. We then clustered the cleaned data sets by UPGMA and computed the dispersion degree before and after clustering. Table 7 shows the results. Note that ‘‘DCV(old)’’ in the table represents the DCV values of the original data sets in Table 5. We list it here for the purpose of comparison with DCV values of the cleaned data sets. For a similar reason, we also list ‘‘F norm (old)’’ in the table. An observation is that for data sets with original CV0 values over 1.0, such as k1a, k1b, re0, re1 and wap, the DCV values tend to decrease after noise removal, meanwhile the normalized F-measure values tend to increase. In other words, HCleaner is effective on removing the noise in the data sets so as to reduce the negative impact of the dispersion effect by UPGMA. Another observation is that two data sets k1b and re0 have negative DCV values. This means that the dispersion effect on the two data sets disappeared after noise removal. This also justifies our analysis in Section 3: UPGMA can act as complete-link after the removal of small and remote sub-clusters. However, for data sets with original CV0 values below 1.0, such as fbis, hitech, and la2, the situation is more complex; that is, the DCV values are larger but F norm values are smaller after noise removal. This indicates that HCleaner may not be very effective on reducing the dispersion effect for data sets with relatively balanced ‘‘true’’ cluster sizes.

6. Conclusions This paper has presented a study of hierarchical clustering from a data distribution perspective. Our objective is to characterize the relationship between the distribution of the ‘‘true’’ cluster sizes and the performance of hierarchical clustering. Along this line, UPGMA was shown to have a dispersion effect on the clustering results. In other words, UPGMA increases the variation of the resultant cluster sizes no matter what the ‘‘true’’ cluster distribution is. Also, extensive experiments have been conducted on a number of real-world data sets. The results revealed that UPGMA tends to produce clustering results where the CV values of the cluster sizes are greater than 1.0. This indicates that, if data sets have relatively uniform ‘‘true’’ cluster sizes, i.e., the CV values of the ‘‘true’’ cluster sizes are smaller than 1.0, UPGMA tends to produce the clustering results which can be far away from the ‘‘true’’ cluster distribution. In addition, the results also showed that F-measure usually indicates a good performance, if UPGMA has been applied for data sets with high variation on the ‘‘true’’ cluster sizes. However, after carefully looking at the clustering results, we found that many small ‘‘true’’ clusters are disappeared (absorbed in larger clusters). In other words, F-measure has difficulties in validating the hierarchical clustering results on highly imbalanced data sets. To solve this problem, we proposed the normalized F-measure—F norm to evaluate the clustering results. Experimental results demonstrated that F norm shows better performance and can be used to

2329

evaluate clustering results across data sets with substantially different distributions of sizes. Finally, we applied HCleaner for removing noise before hierarchical clustering. Experimental results showed that HCleaner can alleviate the dispersion effect of UPGMA and improves the clustering performance. For the future work, we plan to address the dispersion effect of UPGMA. Also, in addition to noise and outliers, we believe that the high dimensionality is also a major factor that can make impact on the dispersion effect of UPGMA, since there may exist many irrelevant or weakly relevant features in data that will mislead the clustering algorithms. Along this line, one potential solution is to use some dimensionality-reduction algorithms [40,33] while preserving most of the discriminative information in the data.

Acknowledgments This research was partially supported by the National Natural Science Foundation of China (NSFC) under Grants 70621061 and 70890082, the Rutgers Seed Funding for Collaborative Computing Research, and the Lan Tian Xin Xiu seed funding of Beihang University. Also, this research was supported in part by a Faculty Research Grant from Rutgers Business School—Newark and New Brunswick. Finally, we are grateful to the Neurocomputing anonymous referees for their constructive comments on the paper. References [1] I. Borg, P. Groenen, Modern Multidimensional Scaling—Theory and Applications, Springer, Berlin, 1997. [2] A. Bouguettaya, Q. Le Viet, Data clustering analysis in a multidimensional space, Information Sciences 112 (1–4) (1998) 267–295. [3] M. DeGroot, M. Schervish, Probability and Statistics, third ed., AddisonWesley, Reading, MA, 2001. [4] J.W. Demmel, Applied Numerical Linear Algebra, Society for Industrial & Applied Mathematics, Philadelphia, PA, 1997. [5] J. Ghosh, Scalable Clustering Methods for Data Mining, Handbook of Data Mining, Lawrence Ealbaum Assoc, 2003. [6] S. Guha, R. Rastogi, K. Shim, Cure: an efficient clustering algorithm for large databases, in: Proceedings of 1998 ACM SIGMOD International Conference on Management of Data, June 1998, pp. 73–84. [7] M. Halkidi, Y. Batistakis, M. Vazirgiannis, Cluster validity methods: Part i, SIGMOD Record 31 (2) (2002) 40–45. [8] M. Halkidi, Y. Batistakis, M. Vazirgiannis, Clustering validity checking methods: Part ii, SIGMOD Record 31 (3) (2002) 19–27. [9] E.-H. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, J. Moore, WebACE: a web agent for document categorization and exploration, in: Proceedings of the 2nd International Conference on Autonomous Agents, 1998. [10] W. Hersh, C. Buckley, T.J. Leone, D. Hickam, OHSUMED: an interactive retrieval evaluation and new large test collection for research, in: Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, July 1994, pp. 192–201. [11] S. Hirano, X. Sun, S. Tsumoto, Comparison of clustering methods for clinical databases, Information Sciences 159 (3–4) (2004) 155–165. [12] E.R. Hruschka, R.J.G.B. Campello, L.N. de Castro, Evolving clusters in geneexpression data, Information Sciences 176 (13) (2006) 1898–1927. [13] L. Hubert, P. Arabie, Comparing partitions, Journal of Classification 2 (1985) 193–218. [14] A.K. Jain, R.C. Dubes, Algorithms for Clustering Data, Prentice-Hall, Englewood Cliffs, NJ, 1998. [15] R.A. Jarvis, E.A. Patrick, Clustering using a similarity measure based on shared nearest neighbors, IEEE Transactions on Computers C-22 (11) (1973) 1025–1034. [16] I.T. Jolliffe, Principal Component Analysis, second ed., Springer, Berlin, 2002. [17] G. Karypis, E.-H. Han, V. Kumar, Chameleon: a hierarchical clustering algorithm using dynamic modeling, IEEE Computer 32 (8) (1999) 68–75. [18] G. Karypis, Cluto—software for clustering high-dimensional datasets, version 2.1.1, 2008 hhttp://glaros.dtc.umn.edu/gkhome/views/clutoi. [19] B. Larsen, C. Aone, Fast and effective text mining using linear-time document clustering, in: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 1999, pp. 16–22. [20] D. Lewis. Reuters-21578 text categorization text collection 1.0, 2008 hhttp:// www.research.att.com/lewisi.

ARTICLE IN PRESS 2330

J. Wu et al. / Neurocomputing 72 (2009) 2319–2330

[21] J. Li, D. Tao, W. Hu, X. Li, Kernel principle component analysis in pixels clustering, in: Proceedings of The 2005 IEEE/WIC/ACM International Conference on Web Intelligence, 2005, pp. 786–789. [22] J. Li, H. Liu. Kent ridge biomedical data set repository, 2008 hhttp:// sdmc.i2r.a-star.edu.sg/rp/i. [23] F. Murtagh, Clustering Massive Data Sets, Handbook of Massive Data Sets, Kluwer Academic Publishers, Dordrecht, 2000. [24] D.J. Newman, S. Hettich, C.L. Blake, C.J. Merz, UCI repository of machine learning databases, 1998. [25] Y. Pang, D. Tao, Y. Yuan, X. Li, Binary two-dimensional PCA, IEEE Transactions on Systems, Man, and Cybernetics, Part B 38 (4) (2008) 1176–1180. [26] M.F. Porter, An algorithm for suffix stripping, Program 14 (3) (1980) 130–137. [27] C.J. Van Rijsbergen, Information Retrieval, second ed., Butterworths, London, 1979. [28] T.-W. Ryu, C.F. Eick, A database clustering methodology and tool, Information Sciences 171 (1–3) (2005) 29–59. [29] P.H. Sneath, R.R. Sokal, Numerical Taxonomy, Freeman, San Francisco, CA, 1973. [30] M. Steinbach, G. Karypis, V. Kumar, A comparison of document clustering techniques, in: Workshop on Text Mining, the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2000. [31] P.-N. Tan, M. Steinbach, V. Kumar, Introduction to Data Mining, AddisonWesley, Reading, MA, 2005. [32] B. Tang, M. Shepherd, M.I. Heywood, X. Luo, Comparing dimension reduction techniques for document clustering, in: Canadian Conference on Artificial Intelligence, 2005, pp. 292–296. [33] D. Tao, X. Li, X. Wu, S. Maybank, Geometric mean for subspace selection in multiclass classification, IEEE Transactions on Pattern Analysis and Machine Intelligence 30 (2008). [34] TREC, Text retrieval conference, 2008 hhttp://trec.nist.govi. [35] H. Xiong, G. Pandey, M. Steinbach, V. Kumar, Enhancing data analysis with noise removal, IEEE Transactions on Knowledge and Data Engineering 18 (3) (2006) 304–319. [36] H. Xiong, P. Tan, V. Kumar, Mining strong affinity association patterns in data sets with skewed support distribution, in: Proceedings of the 3rd IEEE International Conference on Data Mining, 2003, pp. 387–394. [37] H. Xiong, P.-N. Tan, V. Kumar, Hyperclique pattern discovery, Data Mining and Knowledge Discovery Journal 13 (2) (2006) 219–242. [38] H. Xiong, J. Wu, J. Chen, K-means clustering versus validation measures: a data distribution perspective, in: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2006, pp. 779–784. [39] T. Zhang, R. Ramakrishnan, M. Livny, Birch: an efficient data clustering method for very large databases, in: Proceedings of 1996 ACM SIGMOD International Conference on Management of Data, June 1996, pp. 103–114. [40] T. Zhang, D. Tao, J. Yang, Discriminative locality alignment, in: Proceedings of the 10th European Conference on Computer Vision (ECCV), 2008, pp. 725–738. [41] Y. Zhao, G. Karypis, Hierarchical clustering algorithms for document datasets, Technical Report #03-027, University of Minnesota, Minneapolis, MN, 2003. [42] Y. Zhao, G. Karypis, Criterion functions for document clustering: experiments and analysis, Machine Learning 55 (3) (2004) 311–331.

Junjie Wu received his Ph.D. in Management Science and Engineering from Tsinghua University, China, in 2008. He also holds a B.E. degree in Civil Engineering from Tsinghua University, China. He is currently an Assistant Professor in Information Systems Department, School of Economics and Management, Beihang University, China. His general area of research is data mining and statistical modeling, with a special interest on solving the problems raised from the emerging data intensive applications. As a young scientist, he has published in refereed conference proceedings and journals, such as KDD, ICDM, TKDE, TSMCB. He has also been a reviewer for the leading academic journals

and many international conferences in his area. He is the recipient of the ‘‘Lan Tian Xin Xiu’’ Award of Beihang University, the Excellent Dissertation Award of Tsinghua University, the Outstanding Young Research Award at School of Economics and Management, Tsinghua University, and the Student Travel Awards of SIGKDD 2008 and ICDM 2008, respectively. He is a member of AIS.

Hui Xiong is currently an Assistant Professor in the Management Science and Information Systems Department at Rutgers, the State University of New Jersey. He received the B.E. degree in Automation from the University of Science and Technology of China, China, the M.S. degree in Computer Science from the National University of Singapore, Singapore, and the Ph.D. degree in Computer Science from the University of Minnesota, USA. His general area of research is data and knowledge engineering, with a focus on developing effective and efficient data analysis techniques for emerging data intensive applications. He has published over 50 technical papers in peer-reviewed journals and conference proceedings. He is a co-editor of Clustering and Information Retrieval (Kluwer Academic Publishers, 2003) and a co-Editor-in-Chief of Encyclopedia of GIS (Springer, 2008). He is an associate editor of the Knowledge and Information Systems journal and has served regularly in the organization committees and the program committees of a number of international conferences and workshops. He was the recipient of the 2008 IBM ESA Innovation Award, the 2007 Junior Faculty Teaching Excellence Award and the 2008 Junior Faculty Research Award at the Rutgers Business School. He is a senior member of the IEEE, and a member of the ACM, the ACM SIGKDD, and Sigma Xi.

Jian Chen received the B.Sc. degree in Electrical Engineering from Tsinghua University, Beijing, China, in 1983, and the M.Sc. and the Ph.D. degrees both in Systems Engineering from the same University in 1986 and 1989, respectively. He is EMC Professor and Chairman of Management Science Department, Director of Research Center for Contemporary Management, Tsinghua University. His main research interests include supply chain management, E-commerce, decision support systems, modeling and control of complex systems. Dr. Chen has published over 100 papers in refereed journals and has been a principal investigator for over 30 grants or research contracts with National Science Foundation of China, governmental organizations and companies. He has been invited to present several plenary lectures. He is the recipient of Ministry of Education Changjiang Scholars, Fudan Management Excellence Award (3rd), IBM Faculty Award, Science and Technology Progress Awards of Beijing Municipal Government; the Outstanding Contribution Award of IEEE Systems, Man and Cybernetics Society; Science and Technology Progress Award of the State Educational Commission; Science & Technology Award for Chinese Youth. He has also been elected to IEEE Fellow. He serves as Chairman of the Service Systems and Organizations Technical Committee of IEEE Systems, Man and Cybernetics Society, Vice President of Systems Engineering Society of China, Vice President of China Society for Optimization and Overall Planning, a member of the Standing Committee of China Information Industry Association. He is the editor of ‘‘the Journal of Systems Science and Systems Engineering,’’ an area editor of ‘‘Electronic Commerce Research and Applications,’’ an associate editor of ‘‘IEEE Transactions on Systems, Man and Cybernetics: Part A,’’ ‘‘IEEE Transactions on Systems, Man and Cybernetics: Part C,’’ and ‘‘Asia Pacific Journal of Operational Research’’ and serves on the Editorial Board of ‘‘International Journal of Electronic Business,’’ International Journal of Information Technology and Decision Making’’ and ‘‘Systems Research and Behavioral Science’’.