Pattern Recognition Vol, 10, 287-295. © Pergamon Press Ltd. 1978. Printed in Great Britain.
0031 3203/78/O801 0287 $02.110/0
THE LIMITED VALUE OF COPHENETIC CORRELATION AS A CLUSTERING CRITERION MARGARETA HOLGERSSON University of Uppsala, Department of Statistics and National Central Bureau of Statistics, Stockholm, Sweden (Received 12 October 1976; in revised form 26 October 1977)
Abstract - A common procedure for evaluating hierarchical cluster techniques is to compare the input data, in terms of for example a matrix of similarities or dissimilarities, with the output hierarchy expressed in matrix form. If an ordinary product-moment correlation is used for this comparison, the technique is known as that of cophenetic correlations, frequently used by numerical taxonomists. A high correlation between the input similarities and the output dendrogram has been regarded as a criterion of a successful classification. This paper contains a Monte Carlo study of the characteristics of the cophenetic correlation and a related measure of agreement which have been both interpreted in terms of generalized variance for some different hierarchical cluster algorithms. The generalized variance criterion chosen for this study is Wilk's lambda, whose sampling distribution under the null hypothesis of identical group centroids is used in this context to define the degree of separation between clusters. Thus, a probabilistic approach is introduced into the evaluation procedure. With the above definition of presence of clusters, use of the cophenetic correlation and related measures of agreement as criteria of goodness-of-fit is shown to be quite misleading in most cases. This is due to their large variability for low separation of clusters. Cophenetic correlation
Cluster analysis
Wilk's lambda
INTRODUCTION O f the large collection of techniques concerned with detecting and perceiving structure of multivariate data, generally referred to as methods of cluster analysis, or unsupervised pattern recognition methods, hierarchical cluster algorithms play an important role. Requiring considerably less computer time than most other cluster techniques, these methods are frequently used in numerous applications (for example, see Anderberg~). On the other hand, an interpretation of the output may not be straightforward. Since any hierarchical cluster algorithm is optimal merely in each single step of the procedure, "hierarchical clustering in fact defines a cluster as whatever results from applying a specific algorithm" (Duda and Hartt3~). For the Complete Linkage and Single Linkage algorithms, however, at least the groups of a specific proximity level of the hierarchy are welldefined in terms of the m a x i m u m distance within or the minimum distance between clusters. There are several ways of evaluating the output from cluster techniques, see for example Sneath. ~2°~ The most promising strategy in this context appears to be a comparison of the results from a few different cluster techniques with each other and also with the results from other multivariate methods applied to the same data. For instance, multidimensional scaling, principal components and similar ordination techniques may be used for this purpose. Further, the output may be evaluated by means of additional knowledge supplied
Clustering criteria
Kruskal's stress
by the user, as for instance some background variable not used in the grouping procedure. Still, some kind of numerical measure of the agreement between the input data and the output - whether it is a single grouping of data or a set of hierarchically nested groupings - may be desirable. For hierarchical cluster algorithms, a number of such measures of agreement have been proposed, see Gower 18t for a review. The most commonly used method for measuring the agreement between the input data and the output hierarchy is that of cophenetic correlations, first suggested by Sokal and Rohlf. ~ ~) Several authors have investigated its properties, for example Sneath, (~9~ Farris 141 and Sokal and Rohlf. 1~8) Other similar measures of agreement are euclidean metrics (see e.g. Hartigan~°~), like for instance the stress criterion by Kruskal ~13~ originally proposed for the technique of multidimensional scaling. Further, a least square criterion has been proposed by Gower ~'~ as a criterion of agreement. To derive the properties of sampling distributions of such criteria turns out to be a very complicated business, due to the fact that the sample similarities are stochastically dependent. Gower and Banfield ~9~have examined the sampling distributions of seven criteria of agreement by means of a simulation study. In this study, Single Linkage clusters were fitted to samples drawn from spherical multinormal distributions in various numbers of dimensions, thus representing a null hypothesis situation. They found, for example, 287
288
MARGARETA HOLGERSSON
that sampling distributions of all investigated criteria approach normality, as the numbers of units and variables increase. In the present paper, an interpretation of two measures of agreement is performed in terms of generalized variance for some different cluster algorithms, with the accent on the Complete Linkage algorithm. This is further described in the next Section. A Monte Carlo simulation study, yielding the distributions of the two criteria of agreement for different values of Wilk's lambda, is described in the third Section. Data are simulated in one and two dimensions, for different sample sizes and numbers of groups. In the final Section, some concluding remarks are added. AN I N T E R P R E T A T I O N O F T W O CRITERIA O F A G R E E M E N T IN TERMS O F GENERALIZED VARIANCE
A common way of representing multivariate data is to consider each object as a point in a p-dimensional space, where p is the number of variables measured on each object. Clusters can then be considered as volumes of points of high density separated by areas of low density in this space. Accordingly, there are some cluster methods using this density concept in the sense that they trace point swarms of high density in the pspace, forming the cluster nuclei, see e.g. Wishart/22) For simplicity's sake, merely clusters that meet this simple definition, founded on the concept of density, will be considered in this paper. Of course, this definition will not cover clusters of more irregular shape. It is certainly desirable to develop some scalar criterion indicating whether or not there exist "natural" groups in data. Most often, the word natural is used synonymously with the word homogeneous in this context, see e.g. Carmichael et al.,~z~and it may seem reasonable to look for some expression for the scatter of data. The most widely used criteria for clustering are all based on the well-known identity T = W + B,
(2.1)
where T is the total scatter matrix of the data, W the within cluster scatter matrix and B the between clusg
ter scatter matrix. W = ~ W~, where W k is the within k=*
scatter matrix of cluster k, k = 1,..., g. For p = 1 the above equation involves only scalars, and either WIT or W/B could be used as criteria for homogeneity, both invariant under nonsingular linear transformations. Fisher ~s} proposed a technique in the multivariate case finding an optimal partition of N objects into 0 groups, for which W is minimized. In the multivariate case, things become more complicated, and a number of different criteria have been suggested. Since the eigenvalues of W - x B are invariant under nonsingular linear transformations of data, a variety of invariant clustering criteria can be created from func-
tions of these eigenvalues, 21,22,..., for example the ratio p
I w l / I T I = 1/1-[ (1 + 2j).
(2.2)
j=l
This ratio, known as Wilk's lambda criterion, was originally proposed by Wilks (2~) as a MANOVA statistic, developed through the generalized likelihood principle. Some cluster techniques use the lambda criterion or some similar function of the eigenvalues in iteratively seeking the "optimal" partition of a data set into O groups, see e.g. the work of Friedmann and Rubin, ~6i Marriott "*) and McRaeJ ~) In this paper, Wilk's lambda will be used as a criterion of the presence of cluster in evaluating the following two measures of agreement: 1. The cophenetic correlation coefficient p. 2. An euclidean metric 6. The first measure, p, is an ordinary product moment correlation coefficient. In this context it is usually defined according to
(d,j - a ) ( d ~ - a * )
P=
i
(2.3)
Here, d 0 is the euclidean distance between units i and j and d~ is the corresponding minimal distance between units i and j in the output dendrogram, resulting from some particular hierarchical algorithm. The average distance, a, from a sample of size n is defined as
~ = [,~jdo]/[2(n2 - n,].
(2.4)
Analogously, 2" is defined. The second measure, 6, examined in this study, is defined according to
]
7 /i~j d~2[ 1/z"
(2.5)
In other contexts, see e.g. Kruskal ¢13) a similar criterion occurs, known as Kruskal's stress. In order to investigate the properties of these two coefficients of agreement, frequently used in the evaluation of results from applications of hierarchical algorithms, their distributions were generated for different values of Wilk's iambda, below denoted by A. In the multinormal case, under the null hypothesis of equal group means, the sampling distribution of A is known (Sehatzoff(*6)). For certain values of p, there exist functions of A, exactly F-distributed. For p = 1 and sample size = n we have (1
--
A)(n
--
g)/[A(o
--
1)] ~ ~ s - * ) •
( n - g l ,
(2.6)
and for p = 2.
(1-A*l=)(n-g-1)/[A*/a(g
--
1 ) ] -~ ,t"va(,2 ( n _ fn_
1),
(2,7)
The limited value of cophenetic correlation as a clustering criterion Thus, we are able to interpret the coefficients p and 6 in terms of the probability distribution of A. A sampling experiment, yielding the distributions of p and ~ for short intervals m 0 -[- c,
a r o u n d different values of A, is described in the third Section of this paper. The choice of values for the c o n s t a n t c was of course influenced by the a m o u n t of c o m p u t e r time available for this study. F o r data simulated in one dimension, seven hierarchical algorithms were investigated for sample size n = 10, namely
A1. A2. A3. A4. A5. A6. A7.
289
Centroid m e t h o d Median m e t h o d by G o v e r Average Linkage within the new group Average Linkage between the merged groups Complete Linkage Single Linkage Ward's method.
A description of A1 A7 as enumerated above, is given e.g. by Anderberg. m He has also listed c o m p u t e r subroutines for these algorithms, which have been used in this study. For the algorithm Complete Linkage the onedimensional study was extended to sample sizes n = 40
Table 1. Distribution ofp for seven different algorithms and for different intervals A0 ± 0.0005. Data in one dimension. Sample size: n = 10 (a) algorithm: Centroid method (A1)
(b) algorithm: Median method by Gower (A2)
distribution function of A 1.000 0 . 5 0 0 0 . 1 0 0 0.010 Arithm. mean S.D. Minimum Maximum Range Skewness
0.810 0 . 8 1 1 0 . 8 1 8 0 . 8 0 0 0 . 8 0 3 0.080 0 . 0 8 1 0 . 0 7 2 0 . 0 7 3 0 . 0 5 5 0.542 0 . 5 9 3 0 . 6 0 8 0 . 5 2 1 0 . 6 6 9 0.988 0 . 9 9 0 0 . 9 5 6 0 . 9 5 8 0 . 9 0 5 0.446 0 . 3 9 7 0 . 3 4 8 0 . 4 3 7 0 . 2 3 6 -0.126 -0.064 -0.030 -0.165 -0.308
(c) algorithm: Average Linkage within the new group (A3) distribution function of A 1.000 0 . 5 0 0 0 . 1 0 0 0 . 0 1 0 Arithm. mean S.D. Minimum Maximum Range Skewness
distribution function of A 1.000 0 . 5 0 0 0 . 1 0 0 0 . 0 1 0
0.001 Arithm. mean S.D. Minimum Maximum Range Skewness
(e) algorithm: Complete Linkage (A5)
distribution function of A 1.000 0 . 5 0 0 0 . 1 0 0 0 . 0 1 0 Arithm. mean S.D. Minimum Maximum Range Skewness
Arithm. mean S.D. Minimum Maximum Range Skewness
0.500
0.100
0.010
0.792 0 . 7 8 7 0 . 7 9 8 0 . 7 8 7 0 . 7 8 1 0.093 0 . 0 8 9 0 . 0 9 3 0 . 0 8 4 0.062 0.543 0 . 5 7 5 0 . 5 0 0 0 . 5 2 3 0 . 6 9 2 0.964 0 . 9 8 7 0 . 9 4 8 0 . 9 5 9 0 . 9 0 3 0.422 0 . 4 1 1 0 . 4 4 8 0 . 4 3 7 0 . 2 1 1 -0.080 -0.098 -0.171 -0.055 0.374
distribution function of A 1.000 0.500 0.100 0.010 0.760 0.106 0.500 0.962 0.462 - 0.058
0.755 0.101 0.550 0.968 0.418 0.037 -
0.766 0.105 0.500 0.956 0.456 0.080 -
0.833 0.836 0.062 0.059 0.652 0.675 0.990 0.962 0.338 0.287 0 . 0 1 7 0.000 -
0 . 8 1 8 0.801 0 . 0 5 6 0.056 0 . 6 4 9 0.697 0 . 9 6 8 0.968 0 . 3 1 9 0.271 0,004 0.069
distribution function of A 1.000 0 . 5 0 0 0 . 1 0 0 0 . 0 1 0
0.001
(g) algorithm: Ward's method (A7)
Arithm. mean S.D. Minimum Maximum Range Skewness
0.834 0.062 0.667 0.966 0.299 - 0.088
0.001
(f) algorithm: Single Linkage (A6)
distribution function of A 1.000
0.813 0 . 8 1 1 0 . 8 1 8 0 . 7 9 7 0.804 0.076 0 . 0 8 1 0 . 0 7 2 0 . 0 7 4 0.055 0.528 0 . 5 9 3 0 . 6 0 8 0 . 5 7 9 0.714 0.964 0 . 9 9 0 0 . 9 5 6 0 . 9 6 8 0.874 0.436 0 . 3 9 7 0 . 3 4 8 0 . 3 8 9 0.160 -0.120 -0.064 --0.030 -0.104 -0.298
(d) algorithm : Average Linkage between the merged groups (A4)
0.001
0.813 0 . 8 0 8 0 . 8 1 1 0 . 8 0 0 0.812 0.071 0 . 0 7 6 0 . 0 7 8 0 . 0 6 7 0.054 0.579 0.614 0 . 5 9 3 0 . 6 0 1 0.737 0.962 0 . 9 6 5 0 . 9 5 0 0 . 9 5 9 0.896 0.384 0 . 3 5 0 0 . 3 5 7 0 . 3 5 7 0.159 -0.167 -0.067 -0.083 -0.074 0.598
0.001
0.754 0.092 0.517 0.968 0.450 0.029
0.001 0.774 0.068 0.689 0.871 0.182 0.476
Arithm. mean S.D. Minimum Maximum Range Skewness
0.812 0.074 0.641 0.964 0.323 -0.089
0.001
0 . 8 1 5 0 . 8 1 8 0 . 7 9 5 0.764 0 . 0 7 3 0 . 0 7 2 0 . 0 7 0 0.079 0 . 5 9 4 0 . 6 1 0 0 . 5 9 8 0.584 0 . 9 8 8 0 . 9 5 8 0 . 9 5 9 0.968 0 . 3 9 4 0 . 3 4 8 0 . 3 6 1 0.383 0.012 -0.041 -0.058 -0.152
290
M ARGARETAHOLGERSSON
Table 2. Distribution o f f for seven different algorithms and for different intervals Ao _+ 0.0005. Data in one dimension. Sample size: n = 10 (a) algorithm: Centroid method (A1)
(b) algorithm: Median method by Gower (A2)
distribution function of A
Arithm. mean S.D. Minimum Maximum Range Skewness
0.100
0.010
distribution function of A
1.000
0.500
0.001
0.467 0.120 0.135 0.866 0.730 0.062
0.480 0.465 0.470 0.503 0.119 0.129 0.111 0.145 0.147 0.207 0.174 0.287 0.835 0.936 0.967 0.883 0.688 0.730 0.793 0.596 0 . 0 8 1 0 . 1 6 8 0.137 -0.041
(c) algorithm : Average Linkage within the new group (A3) distribution function of A 1.000 0.500 0.100 0.010 Arithm. mean S.D. Minimum Maximum Range Skewness
0.843 0.183 0.625 1.626 1.002 0.317
0.830 0.201 0.620 2.172 1.552 0.409
0.838 0.189 0.626 1.527 0.901 0.301
0.760 0.100 0.621 1.183 0.562 0.305
Arithm. mean S.D. Minimum Maximum Range Skewness
1.000
0.500
0.100
0.010
0.001
0.459 0.124 0.179 0.986 0.807 0.095
0 . 4 8 0 0 . 4 6 5 0 . 4 7 8 0.499 0 . 1 1 9 0 . 1 2 9 0 . 1 1 1 0.143 0 . 1 4 7 0 . 2 0 7 0 . 2 2 2 0.287 0 . 8 3 5 0 . 9 3 6 0 . 8 0 6 0.673 0 . 6 8 8 0 . 7 3 0 0 . 5 8 4 0.386 0 . 0 8 1 0 . 1 6 8 0.131 -0.069
(d) algorithm : Average Linkage between the merged groups (A4)
0.001
distribution function of A 1.000 0 . 5 0 0 0 . 1 0 0 0 . 0 1 0
0.694 0 . 0 3 4 Arithm. mean 0 . 6 3 3 S.D. 0 . 7 5 9 Minimum 0 . 1 2 6 Maximum 0 . 2 8 1 Range Skewness
0.322 0.052 0.174 0.442 0.268 - 0.088 -
0.323 0.053 0.109 0.436 0.327 0.123 -
0.321 0.048 0.180 0.446 0.266 0.154 -
0.335 0.047 0.152 0.453 0.301 0.071 -
0.001 0.338 0.046 0.150 0.408 0.258 0.137
(e) algorithm: Complete Linkage (A5) (f) algorithm: Single Linkage (A6) distribution function of A 1.000 0.500 0.100 0.010 Arithm. mean S.D. Minimum Maximum Range Skewness
0.455 0 . 4 5 8 0.073 0 . 0 7 6 0.265 0.194 0.634 0 . 6 1 6 0.369 0 . 4 2 3 0.053 -0.081
0.001
0 . 4 5 3 0.454 0.471 0 . 0 7 0 0 . 0 6 3 0 . 0 3 9 Arithm. mean 0 . 2 9 4 0 . 2 8 0 0 . 3 5 1 S.D. 0 . 6 3 2 0 . 6 2 7 0 . 5 2 3 Minimum 0 . 3 3 8 0 . 3 4 6 0 . 1 7 2 Maximum 0 . 1 2 6 0.046 -0.447 Range Skewness
distribution function of A 1.000 0.500 0.100 0.010
0.001
1.235 0.387 0.362 2.268 1.906 0.004 -
1.364 0.455 0.812 2.225 1.413 0.083
1.224 0.354 0.244 2.222 1.977 0.028
1.211 0.371 0.472 2.464 1.992 0.032
1.309 0.344 0.412 2.105 1.963 0.060 -
(g) algorithm: Ward's method (A7) distribution function of A 1.000 0 . 5 0 0 0 . 1 0 0 0.010 Arithm. mean S.D. Minimum Maximum Range Skewness
0.663 0.051 0.436 0.726 0.291 - 0.271 -
0.668 0.047 0.349 0.716 0.367 0.257 -
0.661 0.048 0.481 0.722 0.241 0.327 -
0.675 0.026 0.602 0.720 0.118 0.140 -
0.001 0.684 0.013 0.666 0.701 0.035 0.521
and n = 80. This particular algorithm was also investigated for data generated in two dimensions for different n u m b e r s of groups, # = 2 and 9 = 4. THE MONTE CARLO STUDY. DESIGN A N D RESULTS
F o r the one-dimensional study, samples were created in such a way that nl units were generated from a N(/~I, 1) distribution* and n 2 units from a N ( # 2, 1)
* Throughout this study, the normal pseudo random numbers generator GGNOR of the subroutine library IMSL (1975) was used.
distribution, where n~ = n 2. The total sample size was n = 10. This procedure was repeated until 200 samples were obtained with their A-values lying within a particular interval A 0 + 0.000.5. Samples with their Avalues outside this interval were excluded from the study. The mid-points, Ao, of these intervals were chosen to be the following: 1.000, 0.941, 0.698, 0.414 and 0.240 c o r r e s p o n d i n g to the following values of the distribution function o f A : 1.000, 0.500, 0.100, 0.010 and 0.001. By the choice of an appropriate value for the distance
The limited value of cophenetic correlation as a clustering criterion #1 - #2, #1 > #2, the desired A-values could be obtained. Thus, the distribution of p, conditional on A, i.e. f(p/A), was approximated by f(P/l#l - #2 [), and f(f/A) by f(6/lt~ 1 -#21). The distance #i - # 2 was assigned values between 0 and 5.8. In Tables 1 and 2, for each criterion p and 6, respectively, the arithmetic mean, S.D., minimum and maximum value, range and skewness* for different intervals Ao _+ 0.0005 are listed for algorithms A1-A7. For Complete Linkage (A5) exclusively, the same distribution characteristics are visualized in Figs 1-3 for intervals A 0 + 0.0005 around Ao = 0.100, A 0 = 0.500 and A o = 0.900 for sample sizes n = 10, n = 40 and n = 80. A comprehensive picture of the relationship between p and A is given in Fig. 4 for sample sizes n = 10 and n = 8 0 , 0 < A < I . Figure 5 shows the corresponding relationship between ~ and A. Again, the algorithm is Complete Linkage (A5). In two dimensions, data were generated in the *Skewness is here defined as (arithmetic median)/standard deviation.
ar#hm ( o )
n= I0
0.7
"'~--
0.40.3-
ds
ori~hm.i ( b ) frtearl
090.8
0.7 ,. . . . . . . . . . . m
•" r
,//¢
o. i
-- ~
--
n=80 n:+O
n=lO
a4-
o
n=~O
n :80
0.0#0021 0
5~
4
a's
T£ A
de~l (b) 0-t0 1 O08
--
/
f.~...-7..=
o'5
n:lO
n:aO
/.'o ~.
Fig. 2. Standard deviation of(a) p (b) 6, for different intervals A +_ 0.0005 at sample sizes n = 10, 40 and 80. Data in one dimension. Algorithm: Complete Linkage,
n =80
0,5-
0.6.
~-
0.06-
n =40
0.6
o
0O8
o
",N
(o)
OlO
mean-
tF2e~,n
0.8
st. de~
291
0.2
Fig. 1. Arithmetic mean of (a) p (b) 6, for different intervals A _ 0.0005 at sample sizes n = 10, 40 and 80. Data in one dimension. Algorithm: Complete Linkage.
following manner : A N(0, 1) random generator yielded two-dimensional vectors #i, i = 1,...,g, forming a basis for the 0 group centers. Around each group center, two-dimensional N(b#, 1) random vectors xu, i = l , . . . , g , j = l , . . . , n i , ni=n/g, were generated. The scalar b was assigned values ranging from 0 to 3.1, and the total sample size was n = 80. This procedure was repeated until 100 samples were generated within the interval A o + 0.01. In Figs. 6-8, sampling results are visualized for g = 2 and g = 4 and different intervals A o + 0.01 ; A 0 = 0.1, A o = 0.5 and Ao = 0.9, for p and 6 respectively. (Figs. 1-8 are based on the numerical results listed in Holgersson.(11)) Table 1 indicates that for each algorithm, A1 A7, the mean value of p is almost constant within the investigated range of the distribution function of A. F r o m Table 1 we can also notice, that the highest mean values of p are obtained from A4, while A7 gives on average the lowest values. It is well known that the highest cophenetic correlations are usually obtained with Average Linkage methods, like A3 and, in particular, A4 (see e.g. Sneath, 1969). The latter algorithm is often referred to as the "unweighted pair-group method using arithmetic averages" ( U P G M A ) , see e.g. Farris (1969). Centroid methods like A1 and A2 are also likely to yield large cophenetic correlations, due to the fact that they permit "reversals" in the resulting dendrograms. As
MARGARETAHOLGERSSON
292 P
(a) .
n=1o(m~)
(o)
.
09 n=40(mox) n =80(max)
0B07
N..
06
1.0
,C'" " " 0.8
.
--
tO (m/n) n = 4 o (rain) 80(min)
05 04-
" "'... •.
• ~,' :..".. .= t•
t
•
. -
.':
~
0.6-
0•4
0.30.2
02 Ot
o.'5
0
o2[
/o
°:'l o0 I o:. I':°
Ao.oo,
Ao.o,o
Ao.,oo
Ao.5oo
A 6
(b)
0.9 oO 0.7
.......... .,.~'~ ~ - -
n = 80(mox) n =40 (max)
..,, ............
n = IO(mox) n=aO(m/n)
/7
0.6: ,../
ICb )
0.8
;.:&
.•
#
": .':.)j.
o.
•
•
°
, ... - . .• . : - .." : f. •i•
0.6
a4 03
n = qO(rain)
0.4
0.20.l
o
0.2
a'5
io
Fig. 3. Range of(a) p (b) 6, for differentintervals A _+0.0005 at sample sizes n = 10, 40 and 80. Data in one dimension. Algorithm: Complete Linkage.
reversals occur, the distance values associated with each successive merger of the two "most similar" groups will not be monotonically increasing with the ordinal numbers of the mergers• However, the presence of reversals in a dendrogram is of course not desirable in a practical application, since it clearly obstructs the possibilities of interpreting the results• From Table 1, we can further note a slight decrease in the ranges and standard deviations of p with decreasing values of A for all algorithms, the standard deviations ranging from ca. 0.05 to 0.10. On the whole no particular asymmetry of the distribution of p for any algorithm is indicated by the measure of skewness. From Table 2, it appears that a similar pattern holds for the arithmetic means of 6 as for the means of p in Table 1, that is, almost constant values for the investigated range of A. As before, the best fit is expressed by the values from A4. However, we can notice a difference between the cophenetic correlation and the euclidean measure as
'
O).Z
'
0'.4
'
0'.6
J
OJSI
I
jl 0
r
Ao.oo, tAo~oo
IAo ,:o
A
Ao....
Fig. 4. Relationship between p and A for sample sizes (a) n = 10 (b) n = 80. Data in one dimension• Algorithm: Complete Linkage.
regards the other algorithms• For example, with 3, A7 yields a better fit than A6. As regards the magnitude of the standard deviations of 3, the seven algorithms here could be divided into three groups• For A4, A5 and A7 the standard deviation is ca. 0.05, for A1, A2 and A3 ca. 0.10 and for A6 ca. 0.40. All distributions are relatively symmetric, except for A3, which produces a positively skew distribution, and A7, which gives rise to a negatively skew one. From Figs. 1-3 we may study the changes in mean, standard deviation and range for p and & for three different values of A (0.1,0.5, 0.9) and for three different sample sizes (10, 40, 80). We now confine ourselves to a single algorithm, Complete Linkage (A5). The relationships between p and A and between (5 and A are also illustrated in Figs. 4 a n d 5, respectively. It c a n be
The limited value of cophenetic correlation as a clustering criterion
(a)
orilhrn. n3eon
a)
293
O9
1.0
08 0.8
0.7O.6 .
0.4
.:
"t
• ,..
t ;' ,5. " ' ' -
9=~
g=4.
!
.*
'.
.'.
"-
"
' -
-
. . . .
05 0.4 0.3-
0.2
0
Ao.oo, ~ Ao.o,o L
ho.,oo
A o.~oo
A
ctr,'thrn. I
t,,71ectl'Il
(b)
0"91
i(b )
aS.'
1.0
07o.B
06• "
• ,.'...,.'"
0.6-
.,...:" :.:". :: ~ .'J.,
,..:
o,
"..
. ..
:.',.
:;,
."
.,
0.5. 04.
0.4
O~
0.2
o '
0'.2 '
0',4 '
A
' = Ao.oo, Ao.~oo IAo.,oo :Ao.o~o
o'.6 ' o.'aliJ[~:o |iLl
o.'5
/.'o j~
Fig. 6. Arithmetic mean of (a) p (b) 6, for different intervals A + 0.01 and different numbers of groups, .a = 2 and a = 4. Data in two dimensions. Algorithm: Complete Linkage.
I
Fig. 5. Relationship between 6 and A for sample sizes (a) n = 10 (b) n = 80. Data in one dimension. Algorithm: Complete Linkage. noted, that the arithmetic mean o f p increases- and the mean of 6 decreases - as A decreases below the lower limits in Tables 1 and 2 (A0.ool). As from Tables I and 2 we can note that all standard deviations decrease with decreasing A. As the sample size becomes larger, the arithmetic means o f p and 6 decrease and increase, respectively. The standard deviations decrease with larger sample sizes. The effects of larger sample sizes are more obvious for larger values of A. Comprehensive pictures of the behaviour o f p and 6 are given in Fig. 3 ( s h o w i n g t h e r a n g e s ) and in Figs. 4 and 5. We may notice, that for p, the minimum values are constant with increasing sample size. This is not the case for the m a x i m u m values of&. F o r both measures, however, an increased sample size seems to improve the results. F o r example, with n = 10, a A-value between 0.1 and 0.9 can always yield a p above 0.9,
whereas with n = 80, a A-value between 0.5 and 0.9 does not seem to yield p-values above 0.8. Further, from Figs. 6 - 8 we can study the distributions of p and ~ obtained from two-dimensional data in two and four groups for sample size n = 80. It is interesting to note, that for an increased number of groups, the arithmetic mean of p and ~5decreases and increases, respectively, for small A's. It is more surprising, however, that both the standard deviation and range increase as A decreases for g = 4. Repeated simulations, however, have given identical results. Finally, we can study the effects of increasing the dimensionality from one to two dimensions on the distributions of p and f by comparing Figs. 1-3 (n = 80) with Figs. 6-8 (0 = 2). Comparing Figs. 3 and 8 we can observe that t h e r a n g e s of p and fi have moved downwards and upwards, respectively, for the twodimensional data. CONCLUSIONS
Certain restrictions have been imposed on the scope
294
MARGARETAHOLGERSSON
sf.
(a)
P (o)
dev. O.
09
Io
08
oOB 0.06
0.04
07
/>-<
g--2
06
9=4
9~-2-
. . . .
05
0.02OZe-
0.'5
t~
03
o.2
s~ dec,
O.l-
(b)
0t0 008
6 (b)
006-
0.90.8-
0.0~ "
g=2
002-
o
o15
io
07 g=E
0.6
9=zt . . . .
05
Fig. 7. Standard deviation of(a) p (b) 6, for different intervals A + 0.01 and different numbers of groups, g = 2 and g = 4. Data in two dimensions. Algorithm: Complete Linkage.
0.403
of this study by the large computer costs associated with the Monte Carlo simulations. Still, the results described in the previous section Show that the use of cophenetic correlations and related measures of agreement may be rather misleading if these measures are considered as indicators of presence of clusters. This conclusion is, of course, based on the use of Wilk's lambda in defining the presence of a certain number of pre-specified clusters. F r o m the present study, it is verified that a high separation of the groups, that is, a low value of Wilk's lambda, implies a high cophenetic correlation (and a low value of the euclidean measure, respectively). However, the opposite statement can not be verified. That is, a high value of the cophenetic correlation does not always correspond to a low value of Wilk's lambda. The magnitudes of the ranges and standard deviations of the two measures of agreement investigated in this study seem to be quite large - unless for very small values o f l a m b d a - at least for small sets of units. F r o m this study it is also obvious that the cophenetic correlation and the euclidean measure behave quite similarly; except for in a few situations, pointed out in the previous section. Nevertheless, the merits of cophenetic correlations and related measures of agreement seem to be rather limited. Still, further studies may be needed, especially those involving a n u m b e r of groups larger than two.
0.2Ot
Fig. 8. Range of (a) p (b) 6, for different intervals A __+0.01 and different numbers of groups, g = 2 and g = 4. Data in two dimensions. Algorithm: Complete Linkage.
REFERENCES
1. M. R. Anderberg, Cluster Analysis for Applications. Academic Press, New York (1973). 2. J. W. Carmichael, J. A. George and R. S. Julius, Finding natural cluster, Syst. ZooL 17, 144-150 (1968). 3. R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. Wiley, New York (1973). 4. J. S. Farris, On the cophenetic correlation coefficient, Syst. Zool. 18, 279-285 (1969). 5. W. D. Fisher, On grouping for maximal homogenity, J. Am. Statist. Assoc. 53, 789-798 (1958). 6. H.P. Friedmann and J. Rubin, On some invariant criteria for grouping data, J. Am. Statist. Assoc. 62, 1159-1178 (1967). 7. J. C. Gower, Statistical methods for comparing different multivariate analyses of the same data. In Mathematics in the Archaeological and Historical Sciences (F. R. Hudson,
The limited value of cophenetic correlation as a clustering criterion D. G. Kendall and P. Tautu eds.) Edinburgh U.P. (1971). 8. J. C. Gower, Classification problems, Bull. Int. Statist. Inst. 45, 402-408 (1973). 9. J.C. Gower and C. F. Banfield, Goodness-of-fit criteria in cluster analysis and their empirical distributions. Invited/contributed paper at the 8th International Biometrics Conference, Constanta, Romania, 25 30 August (1974). 10. J. A. Hartigan, Representation of similarity matrices by trees, J. Am. Statist. Assoc. 62, 1140-1158 (1967). II. M. Holgersson, An interpretation of two measures of agreement between the dissimilarity matrix and the resulting hierarchy in terms of generalized variance for some different cluster algorithms. Research Report 76-12, University of Uppsala, Department of Statistics 0976). 12. IMSL Library I Reference Manual. International Mathematical and Statistical Libraries, Houston, Texas (1975). 13. J. Kruskal, Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis, Psychometrica 29, 1-27 (1964). 14. F. H. C. Marriott, Practical problems in a method of cluster analysis, Biometrics 27, 501-514 (1971).
295
15. D.J. McRae, Clustering multivariate observations. PhD thesis. University of North Carolina, Chapel Hill (1973). 16. M. Schatzoff, Exact distributions of Wilk's likelihood ratio criterion, Biometrika 53, 347-358 (1966). 17. R. R. Sokal and F. J. Rohlf, The comparison of dendrograms by objective methods, Taxon 11, 33-40 (1962). 18. R.R. Sokal and F. J. Rohlf, The intelligent ignoramus, an experiment in numerical taxonomy, Taxon 19, 305 319 (1970). 19. P. H. A. Sneath, A comparison of different clustering methods as applied to randomly spaced points, Clas.~fn Soc. Bull. 1, 2 18 (1966). 20. P. H. A. Sneath, Evaluation of clustering methods. In Numerical Taxonomy (A. J. Cole, ed.). Proceedings of the Colloquium in Numerical Taxonomy held in the University of St Andrews, September 1968. Academic Press, New York (1969), 21. S. S. Wilks, Certain generalizations in the analysis of variance, Biometrika 24, 471-494 (19321. 22. D. Wishart, Mode analysis. In Numerical Taxonomy (A. J. Cole, ed.). Proceedings of the Colloquium in Numerical Taxonomy held in the Universily of St. Andrews, Academic Press, New York (1968).
About the Author- MARGARETAHOLGERSSONwas born in Uppsala, Sweden on 16 January 1947. She received
a fil.kand, comprising the subjects Mathematics, Chemistry and Statistics in 1968 and a Ph.D. in Statistics in 1976 from the University of Uppsala. She was engaged in the teaching of Statistics at the Department of Statistics, University of Upsala in combination with postgraduate studies and research in the area of pattern recognition from 1968 to 1976. In 1976 she became a staff member of the Unit of Statistical Methods, National Central Bureau of Statistics, Stockholm, Sweden.