Cluster validity profiles

Cluster validity profiles

0031 3203/82/020061 23 ~02.00/0 Pergamon Press Ltd. © 1982 Pattern Recognition Society Pattern Recognition Vol. 15, No. 2, pp. 61-83, 1982. Printed i...

2MB Sizes 26 Downloads 87 Views

0031 3203/82/020061 23 ~02.00/0 Pergamon Press Ltd. © 1982 Pattern Recognition Society

Pattern Recognition Vol. 15, No. 2, pp. 61-83, 1982. Printed in Great Britain.

CLUSTER VALIDITY PROFILES* THOMAS A. BAILEY, JR. and RICHARD DUaES Computer Science Department, Michigan State University, East Lansing, MI 48824, U.S.A.

(Received 25 July 1979; in revisedform 7 July 1980; receivedfor publication 27 January 1981) Abstract--The quantitative evaluation of clusters has lagged far behind the development of clustering algorithms. This paper introduces a new procedure, based on probability profiles, for judging the validity of clusters established from rank-order proximity data. Probability profiles furnish a comprehensive picture of the compactness and isolation of a cluster, scaled in probability units. Given a rank-order proximity matrix and a cluster to be examined, profiles compare the cluster's upper bounds on the best compactness and isolation indices one would expect in a randomly chosen graph. After reviewing the pertinent literature this paper explains the background from graph theory and cluster analysis needed to treat cluster validity. The probabilities and bounds needed to form cluster profiles are derived and strategies for using profiles are suggested. Special attention is given to the underlying probability models. Profiles are demonstrated on four artificially generated data sets, two of which have good hierarchical structure, and on data from a speaker recognition project. They reject spurious clusters and accept apparently valid clusters. Since profiles quantify the interaction between a cluster and its environment, they provide a much richer source of information on cluster structure than single-number indices proposed in the literature. Clustering method

Cluster validity

Verification

Random graph

1. INTRODUCTION Clustering methods have proved to be valuable tools of exploratory data analysis in several applications.t1 - 21~The goal of a clustering method is to find any natural groupings in the data, or subsets of data items that are compact and mutually isolated. Clustering techniques can classify large masses of data t22~ or suggest causative relations among variables, t23~ Unfortunately, the clustering procedure itself can impose inappropriate clustering structures. This paper treats the cluster validation process, which tries to identify 'true' clusters and reject spurious ones. Reviews of clustering methodology ~2'~-29~ and a treatment of the cluster validity problem 13° - 32~are available elsewhere. Data are presented to a clustering algorithm either as a pattern matrix, in which each column represents a measurement (or feature) and each row represents an object (or pattern), or as a proximity matrix in which row and column represent a data item. The rows of a pattern matrix are usually visualized as points in a multidimensional pattern space, while proximity values indicate the degree of 'alikeness' between data items. Various types of proximity matrices have been treated in the literature. ~24'33~ This paper defines a clustering as a partition of the data items and evaluates the validities of clusters derived from proximity matrices whose entries are on an ordinal scale.

*Research supported by NSF Grant ENG76-11936A01.

Single-link method

Complete-link

Three types of cluster validity questions are important in applications332~ The first question is 'Do the data tend to cluster?' If for instance patterns are generated from a uniform or unimodal distribution, and Euclidean distance measures proximity, any 'clusters' uncovered must be invalid. Data can also be 'anticlustered' ; interaction among the pattern could cause a more even distribution than expected from randomly selected patterns, t34'aS~ Unless data exhibit a tendency to cluster, further analysis by clustering is unwarranted. The second question refers to the global fit of the clustering structure. The membership of each cluster and the relationships among clusters define the global structure of the data. The clustering structure imposed by a clustering method should fit this underlying structure. For example, a measure of global fit should detect an inappropriate hierarchical structure imposed by applying the 'wrong' algorithm. We are concerned with the third question, which concentrates on individual clusters. An individual cluster exhibits characteristics such as compactness, lifetime, stability, and separation from other clusters which lead one to conclude that the cluster does or does not form from a single entity, or valid cluster. We concentrate on the question 'Is it unusual to find a cluster as compact and isolated as the observed cluster?' Our approach is based on the theory of random graphs. Section II provides the setting for our results by reviewing the pertinent literature and listing certain 61

62

THOMAS A. BAILEY, JR. and RICHARD DUBES

definitions from graph theory. Section llI describes indices of isolation and compactness and analyzes a new tool for validating clusters called a cluster profile. Section IV provides several examples and discusses computational matters. II. BACKGROUND

This section briefly reviews the primary approaches taken in the literature to validating the results of clustering algorithms on ordinal proximity matrices. Dubes and Jain ~3zl review approaches to the general problem of cluster validity. One approach to the interpretation of clusters, which is technically outside the realm of cluster validity, is to judge the results of a clustering algorithm according to whether or not the clustering structure 'makes sense' in a given application. Anderberg ~24J identifies three responses to the question 'How do you know when you have a good set of clusters?' (i) When clustering only provides summary statistics, the accuracy of the calculation is in question, not the validity of the clusters. (ii) When the clustering method is defined so that any clusters found must have certain properties which certify them as clusters, all subsets achieved are, by definition, valid clusters. (iii) When clustering is used as an exploratory tool, Anderberg claims that results lacking explanation cannot be salvaged by validity tests. This position obviates the need for cluster validity tests. By contrast, Rapoport and Fillenbaum t36~state that 'safeguards of various sorts.., are obviously necessary to guard against elaborate interpretation of randomly generated data.' Clustering is becoming an accepted methodological procedure in a wide variety of scientific studies. We believe the viewpoint expressed by Rapoport and Fillenbaum is becoming widely accepted and that the difficulty of the problem of establishing the validity of clusters is no excuse for ignoring it. Response (ii) above attempts to ensure that any subset of data items identified by a clustering method will be valid, independent of distributional questions. Hubert 137~ called subsets of data items that satisfy specific restrictions 'perfect clusters'. Definitions of perfect clusters usually involve an index of compactness, which measures cohesiveness among data items, and/or an index of isolation, which measures separation among clusters. A cluster with strong compactness usually needs less isolation to be considered perfect than does a cluster with weak compactness. McQuitty ~3s~defined a 'comprehensive type' cluster as a cluster in which each data item is more like every other data item in the cluster than it is like any data item not in the cluster. In a 'restricted type' cluster, each data item is more like s o m e other data item in the cluster than it is like any data item outside the cluster. Van Rijsbergen's perfect cluster ~391 requires that the least similar pair of data items in a cluster be more alike than the most similar pair comprised of one data item from the cluster and one from outside the cluster.

This idea of a perfect cluster is more restrictive than McQuitty's perfect cluster. Hubert ~37) provided several generalizations of these definitions. Day c4°) extended this approach to the overlapping case in which data items can belong to more than one cluster. He defines consistency and authenticity properties and determines whether classes of clustering methods have these properties. Day also defines general indices of cohesion and attenuation which must be specialized to each clustering method; he provides no information on distributions. We do not argue with the desirability of finding perfect clusters but are searching for an objective means of validating clusters. The fact that a cluster fits one definition of perfect cluster but not another seems a dangerous criterion for labeling a cluster valid or not. We prefer to use an objective, probability-based criterion. The final decision on what is valid rests on the level of significance one is willing to accept. Cluster validity questions fit naturally into a framework of hypothesis testing. A null hypothesis describing 'no clustering' is stated and the distribution of an index of cluster validity is derived for testing this hypothesis in the usual way. Several factors pose severe obstacles to this approach. Stating a meaningful null hypothesis is difficult. The distribution of any reasonable index of cluster validity depends in unknown ways upon several factors, such as sample size, the clustering method employed and cluster size, and is extremely difficult to establish. An erroneous procedure that sometimes appears in the literature is to apply the statistical procedures used in multivariate analysis of variance to validate clusters. Clusters must be identified before applying the clustering algorithm if such procedures are to be correctly applied. Observing the results of an algorithm designed to identify clusters and comparing it with random subsets of data items will certainly label many 'clusters' as valid even in random data. Mountford, c41~ Sneath t42~ and Hartigan ~3~ have demonstrated the difficulty of establishing the null distributions of clustering statistics. When the proximities between data items are on an ordinal scale, as is widely assumed for data in Psychology and the Social Sciences,ta'9"t°~ a clear relationship exists between graph theory and clustering methods, t37'44-47~ Two prominent hierarchical clustering methods, the single-link (or nearest neighbor) method and the complete-link (or furthest neighbor) method generate clusterings that depend on the order of the proximities but not on their actual values,t48~ Single-link clusters are subgraphs of minimum spanning trees ~.9~ while complete-link clusters are related to the node colorability of graphs, tS°'sl~ We will be primarily concerned with these two clustering methods. Our approach to cluster validity exploits relationships between clustering methods and graph theory. The pertinent graph-theoretic ideas are now explained. An ordinal proximity matrix D = (dij) on n data

Cluster validity profiles items is a symmetrical, n x n matrix whose n(n - 1)/2 upper-diagonal entries are non-negative and distinct (i.e., no ties); the diagonal entries are ignored. A proximity can measure similarity, where the larger the measure, the more alike the data items (e.g., correlation) or dissimilarity, where a large value means unalike data items (e.g., distance). We assume dissimilarities throughout this paper for convenience. A threshold graph T(D, t) is a graph on n labeled nodes containing edges between all pairs of nodes (i,j) for which d o < t. Each (upper-diagonal) entry of D generates a distinct threshold graph. A rank graph R(D, t) is a threshold graph with an order imposed on edges by the order of the entries in D. Since we assume no ties, the edges of R(D,t) are labeled sequentially according to order. An example is given in Fig. 1. Clusters found by the single-link method are components (connected subgraphs) of some threshold graph and each component of a threshold graph is a single-link cluster. Clusters identified by the completelink method are cliques (maximal complete subgraphs) 2 -12

3 14

--

8

4 35

5 46

6 16

7 9

8 7

3

45 54

62 28

58 41

13 72

19

29

61

70

15

21

52

22

42

86

57 (Q)

6

I

5

8

4 2

63

of some threshold graph but not all cliques are complete-link clusters. Graph-theoretic concepts of connectedness, such as k-edge and k-node connectedness, have also led to cluster definitions based on random graphs. (37'46~ A rank matrix is a symmetric, n × n matrix with arbitrary diagonal entries and all integers from 1 to n(n - 1)/2 above the diagonal. The entries are the rank orders of the entries in a proximity matrix. Figure 2(al shows the rank matrix derived from the proximity matrix in Fig. l(a). Single-link and complete-link dendrograms based on this rank matrix are given in Fig. 2(b) and (c). Note that all single-link clusters, except the conjoint cluster, are formed by level 8 whereas level 26 is required to establish all completelink clusters. Note also that subgraph (1, 2, 3) of the threshold graph in Fig. l(b) is a clique but not a complete-link cluster. Cutting a dendrogram horizontally creates a clustering or partition of the data items. The two methods identify different clusters and suggest different clustering structures. Hubert's definition of clustering method ~371 establishes a new level in the dendrogram only when a new cluster is formed. However, the clusters themselves are precisely the same as those in Fig. 2(b) and (c). Applying the singlelink or complete-link methods to the proximity matrix in Fig. l(a) would change the levels at which clusters form in Fig. 2(b) and (c) but not the clusters themselves. The null hypothesis most commonly used with ordinal data is called the random graph hypothesis. This 'no-clustering' hypothesis states that a rank matrix, D,, is chosen at random from among all [ n ( n - 1)/2]! possible rank matrices. ~5z'53~ Choosing D, at random is equivalent to imposing a random ordering on the edges of a complete graph on n nodes. In this paper, an (n,N) random graph is a subgraph obtained by choosing a rank matrix, D, at random, forming the rank graph R(D,, N) and ignoring the edge weights. In other words, an (n, N) random graph is a threshold graph on n nodes whose N edges are inserted at random. There are

(?)

(b)

9

distinct (n,N) random graphs. The notation n:m denotes the number of ways m items can be chosen from a population of size n without replacement ; the notation

8

I0

I

(c)

Fig. 1. (a) Dissimilarity matrix, D. (b) Threshold graph T(D, 25) - - nodes numbered. (c) Rank graph R (D,25) - edges weighted.

(:) is also used for this purpose. All possible (4, 3) random graphs are depicted in Fig. 3(a). A total of N! rank graphs can be formed for each (n, N) random graph, one for each assignment of ordinal edge weights. There are

64

THOMAS A. BAILEY, JR. and RICHARD DUBES

retic notions to define several different types of clusters and to investigate the significance of clustering tendency. Errors in his asymptotic expressions limit the usefulness of the results. {ss} Rapoport and Fillenbaum °6} applied several results from Erdos and Renyi to test for non-randomness using degree sequence, occurrence of cycles of order 3 and 4, and the number of edges required to connect the graphs. They suggested the difference between the mean rank of edges between clusters as a statistic for cluster validity. Unfortunately, the distribution which they adopted is appropriate only if the cluster being tested is selected independently of the graph which determines the statistic. In their case, the cluster was chosen by selecting a good cluster on the basis of the graph, so the distribution on which the test is based is not the appropriate one. Ling {52}defined an isolation index called lifetime for single link clusters. The lifetime is the rank at which the

distinct (n, N) rank graphs. The six rank graphs from one of the (4, 3) random graphs are shown in Fig. 3(b). A random rank graph will refer to a random graph whose edges are ordered. A random (n, N) rank graph can also be evolved by starting with n nodes and inserting N edges, one at a time at random, and labeling them as to order of insertion. The random graph hypothesis can be expressed by the statement that all (n,N) rank graphs achieved this way are equally likely. Erdos and Renyi {s'*-s61 studied the asymptotic behavior of various graph properties as the number of nodes in a random graph increased. In many cases, they also gave exact expressions for the probability of special subgraphs which provide the basis for several tests of cluster validity. Abraham {57} used graph theo-

I --

i

2

3

4

5

6

7

8

5

7

16

20

9

4

2

I

19

25

23

6

28

15

13

17

27

I0

14

24

26

8

II

21

3

2 3 4 5

12

6

18 22

7 8

(a)

O-

3

2

4

I

8

7

6

2

4

I

llTJ

8

i

5-

I0-

I

t

7

5

6

3

Ll

15-

20-

25-

(b)

Icl

Fig. 2. (a) Rank matrix corresponding to Fig. l(a). (b) Single-link dendrogram. (c) Complete-link dendrogram.

Cluster validity profiles

65

(0) I

I

3

i

2

2

i

3

2

I

3

I

3

2 (b) Fig. 3. (a) All (4, 3) threshold graphs. (b) All rank graphs for one threshold graph.

cluster is absorbed by the creation of another cluster less the birth rank. For example, the lifetime of cluster (2, 3, 4) in Fig. 2(b) is two. Ling determined the distribution of this index under the random graph null hypothesis for a given cluster size and a given birth rank. Although the lifetime index is defined for clusters obtained by any hierarchical technique, Ling's distribution is specific to single-link clusters. Several authors have obtained results for the expected value of the number of edges needed to connect a random graph. Rapoport and Fillenbaum(36) used Erdos and Renyi's asymptotic form for small graphs. Schultz and Hubert (s9) used a Monte Carlo simulation to show that the asymptotic form was not accurate for small graphs. Ling(58) and Ling and Killough(6°) adopted exact results due to Riddell and Uhlenbeck(61) to produce expressions of greater accuracy for small graphs and tables of accurate results.

The number of edges needed to connect a graph is an index of clustering tendency, not an indication of compactness or isolation for a particular cluster. It measures the clustering tendency at only one rank in the evolution of the graph. A test which is appliable at all ranks uses the number of components in a graph. Ling(62) gives expected values for this index but not the complete distributions so it is difficult to base a test of significance on this index. Baker and Hubert (63) used the random graph null hypothesis to study the power of a test of clustering based on the Goodman-Kruskal gamma statistic for partitions from single-link and complete-link hierarchies. Hubert (64) applied the gamma statistic to measure rank correlation between the proximity ranks, obtained from the given proximity matrix and partition ranks, derived from the cluster hierarchy. The gamma statistic tests the global fit of a proximity

66

TIIOMASA. BAII,EY,JR. and RICIIARDDUm:.S

matrix to the hierarchy. The gamma statistic tests the global fit of a proximity matrix to the hierarchy of clusters given by the single-link or complete-link clustering techniques. Baker and Hubert ~5°1 also used a Monte Carlo simulation under the random graph hypothesis to find the distribution of an isolation index defined for a partition into complete-link clusters. The index is the number of extraneous edges, or edges which are not internal to some complete link cluster. They proposed a test of goodness-of-fit for a completelink clustering in which the observed number of extraneous edges after each new complete-link clustering forms is evaluated by reference to tables produced by their simulation for clusters on 8, 12 and 16 nodes. Matula 146~ found the distribution of the size of the largest clique (maximal complete subgraph) for a random edge graph. A random edge graph is a graph in which each node pair has probability p of being chosen as an edge. The number of edges is random with expected value pn(n - 1)/2. The distribution of largest clique size, or clique number, is quite peaked. Chance occurrence of a clique which is more than a few nodes larger than the expected size is quite unlikely. Since every complete-link cluster is a clique, this distribution can be applied to validate complete-link clusters as explained in Section V.' IlL CLUSTER PROFILES The work to date in cluster validity based on random graphs is limited in two ways. First, with the exception of Matula's (46~ criterion, indices of cluster validity have been appropriate only for clusters found by particular clustering methods. Second, with the exception of Ling's ts2) lifetime criterion and Matula's criterion, the distributions of validity indices were obtained by simulation. Such distributions can test clusters only when both the size of the cluster and the size of the data set being studied match those of the simulation. Thus, each situation would need to be simulated. The cluster validity tests proposed in this section can be applied to any subset of data items and do not require simulations. Ling's isolation index for single-link clusters t52~ involves only two rank graphs, the one in which the cluster first appears and the one in which it is absorbed. The Baker and Hubert isolation index for completelink clusters ~5°~ uses only the rank graph in which the cluster is formed. By contrast, cluster profiles look at all the rank graphs and can be applied to clusters formed by any clustering method. Indices of compactness and isolation are defined in the first section and the raw cluster profile is introduced. A more readily interpreted picture of cluster validity is established with the probability profile in the next section. Strategies for using cluster profiles are suggested in this last section. (a) Raw profiles A raw cluster profile is a sequence of values for a

validity index, generated as the threshold graphs evolve, that documents the interaction between a cluster and its environment. Observing the variations in isolation and compactness indices for a particular cluster as edges are added in the order dictated by the proximity matrices provides a comprehensive picture of cluster validity. We now define isolation and compactness indices that are easy to compute and have meaning in all clustering problems. Let D = (dij) be an n x n proximity matrix and let A be a set of data items proposed as a cluster. Partition the n(n - 1)/2 distinct proximities in D, or the edges in the complete threshold graph, as follows. Dtot = {do:i < j }

(all edges)

Din(A) = {do:i < j , i ~ A , j ~ A }

(internal edges)

Dour(A) = {dij:i < j , i C A , j C A }

(external edges)

Db,t(A) = D,o, - Di,(A) - Dour(A)

(linking edges).

Let {HI denote the number of items in the finite set H. Definition I. The compactness index for cluster A at threshold t based on proximity matrix D is eA(t) =

I {a,j: i < j, d,j

< t} ~ Din(A) I.

Definition 2. The isolation index for cluster A at threshold t based on proximity matrix D is bA(t) = I {do:i < J, do < t} n Ob,t(A) I. Letting k = 1,41, the number of nodes (or data items) in A, it follows that 0 < eA(t) < ID,n(,4)l = k(k - 1)/Z; 0 < b,i(t) < I Ob,,(`4)l = k ( n - k). Thus, eA(t) is the number of internal edges, or the number of edges in threshold graph T(D,t) between node-pairs in A, while hA(t) is the number of edges in T(D, t) between a node in A and one outside of A, or the number of linking edges. A 'compact' cluster A has a relatively large eA(t) and an 'isolated' cluster has a small ba(t) for several values of t. A 'valid" cluster is both compact and isolated, the ideal being Van Rijsenbergen's perfect cluster, which for some t satisfies bA(t) = 0;

eA(t) = k(k - 1)/2.

Definition 3. Raw cluster profiles for k-node cluster A are graphs of hA(t) and eA(t) VS t. It is usually more convenient to plot a raw cluster profile on a rank scale than on a proximity scale. Figure 4 shows a threshold graph for an artificial data set of 25 points in two dimensions using Euclidean distance as the proximity measure and a

Cluster validity profiles

threshold scaled to one inch. The six circled points define the cluster A to be studied. Examinations of the sequence of threshold graphs show that A is both a single-link and a complete-link cluster. However, Fig. 4 shows that A is not a Van Rijsenbergen perfect cluster. Intuition suggests that A is certainly a reasonably valid cluster. The raw profiles for cluster A are sketched in Fig. 5 for every fifth rank as functions of rank-ordered distance. Also plotted are the number of edges which are neither internal to A or link A to its environment. The compactness index eA(t) rises rapidly to its maximum of 15, suggesting a relatively compact cluster. The isolation index bA(t ) is initially small then increases at a fairly constant rate. The most distinctive feature of a raw cluster profile is its ability to picture the interaction of a cluster with its environment as the threshold graph evolves. However, this example demonstrates that a raw cluster profile, such as Fig. 5, provides only qualitative or relative evidence of compactness and isolation. Is the rapid increase of compactness index e A unusual? Suppose several 25-point data sets were generated and compact-looking clusters of six points were identified. Might we expect clusters as compact or isolated as the cluster in Fig. 4? A probability scale is required to answer these questions quantitatively. We need to determine the probability that an index of validity is as large or as small as an observed index. Choosing a meaningful probability measure is not a trivial task. For example, the cluster in Fig. 4 will certainly appear to be compact if compared to a randomly chosen subset of six points. However, we did not choose the cluster in Fig. 4 at random. We chose a 'good' cluster, which may not be unusually compact if compared to

\ Fig. 4. Threshold graph.

67

÷* ++

120

++Outside edges ÷ ÷+ o°

÷÷

.o_ 80 +

÷ +

÷

...°*

.,'b~(r

++ +

+

°

• ..

++

40 +++

." Linking edges

***.0"•

++

+~

)



**

...

eA(r) Internal edges 100

200

Rank

Fig. 5. Raw cluster profile.

the most compact six-point subset in other threshold graphs of the same size. (b)

Probability profiles

We have demonstrated the utility of a raw cluster profile in qualitatively evaluating cluster validity and the need for a probability scale for quantitatively assessing the isolation and compactness of a cluster. A probability scale requires selection of a probability measure over the population of clusters. The interpretation of cluster validity criteria is directly related to the choice of probability measure. For example, the cluster in Fig. 4 will certainly appear compact when all six-point clusters are assumed to be equally likely. The question is whether it will still appear to be compact if the only six-point subsets in a random graph assigned non-zero probability are those with large indices of compactness. This choice of probability measure is the core of our approach and the most salient aspect of cluster profiles. A probability profile for a cluster shows, for each rank, the probability that an index of clustering is as good or better than the observed index under a model of 'no clustering'. The number, n, of nodes and the number, N, of edges in a threshold graph are fixed in our discussion as is the number, k, of nodes in the cluster, A, under investigation. All possible k-node subsets in all possible (n, N) random graphs form the underlying population, or sample space. Several probability measures are suggested below over this sample space. For example, Fig. 6 shows the six possible two-node subset specifications in one of the (4, 3) rank graphs of Fig. 3(b). The random experiment is to choose a k-node subset and compute its validity indices. Two random variables are now defined. Let B be the

68

THOMASA. BAILEY,JR. and RICHARDDUaF.S

number oflinking edges, or the isolation index, for a knode subset and let E be the number of internal edges, or compactness index. The distributions of random variables B and E quantify the raw profiles. The underlying probability measure P makes all k-node subsets in (n, N) random graphs equally likely. The isolation index B counts the number of linking edges but ignores the ranks of the edges so several knode subsets will have the same isolation index. Rather than counting the number of subsets, we focus on the edges and employ the hypergeometric model (see the appendix) which shows that

specified before any data are observed or before a clustering algorithm is employed, equations (1) and (2) will provide over-optimistic assessments of cluster validity. Other criteria that overcome this difficulty are proposed below. Suppose we assign zero probability mass to all (knode) subsets in an (n, N) random rank graph which cannot be 'reached', or achieved, by a specific clustering method and make the remaining k-node subsets equally likely. Such a distribution is called a CMreachable distribution. For example, consider the (8, 12) rank graph obtained by selecting the first twelve edges from the rank matrix in Fig. 2(a). There are

(28)

(k(n - k)l(n:2 N - -k(nb k)/

12 = 30,421,755

such rank graphs. This particular rank graph leads to The smaller the value of B the fewer edges link the subset to its environment and the more isolated the subset. Thus, we can establish an isolation criterion as b.4

P(B<_bA)= Z P(B=b).

(1)

b=O

The probability mass function for the compactness random variable E is determined in the same way (see the appendix)

P(E = e) =

e J\

N-e

The more internal edges there are, the larger the value of E and the more compact the subset. The compactness criterion for cluster A having compactness index eA can thus be written as k:2 P ( E > e A ) = ~ P(E=e). (2) e=e A

The criteria in equations (1) and (2) indicate the trend of our development but are not themselves very helpful in validating clusters. Unless a cluster is

3-point subsets. One of these is the subset (2, 3, 4) and another is (1, 8, 7). Figure 2(b) shows that these are the only three-node single-link clusters so only two of the 56 3-point subsets are single-link-reachable at rank 12 based on the proximity matrix in Fig. 2. Only one of the 56 sample points, the one which specified subset (5, 6, 7), is complete-link-reachable at level 12. The random experiment is to select an (n, N) rank graph at random and apply the clustering method in question. Identify all k-node clusters so obtained, select one of them at random, and compute its validity index. If no k-node clusters are generated by the clustering method, select another (n,N) graph at random and repeat. We assume proper ranges for the values of n, N, and k. Criteria of isolation and compactness for cluster A can now be defined as the conditional probabilities

P(B < bA/GcM) and P(E > eA/GcM), respectively. Here, GcM is the set of all k-node clusters in (n, N) random rank graphs that can be achieved by the

% 3

) Fig. 6. All two-node subsets of a (4, 3) rank graph.

Cluster validity profiles clustering method in question. Analytical expression for these probabilities are not available for any of the common clustering methods so they must be determined by Monte Carlo simulations. The fact that three parameters (n, N, and k) are involved implies that several simulations would be required for each application. Baker and Hubert (5°' used the complete-linkreachable distribution to develop a test of cluster validity. Even if it could be derived, a CM-reachable distribution would be specific to a particular clustering method. We now propose a distribution that has meaning for any clustering method. The random experiment is to select an (n, N) rank graph and reserve all k-node subsets for which the isolation index is minimum, or 'best'. All k-node subsets on (n, N) rank graphs which could be reserved in this way form the population, B*, to which non-zero probability is assigned. Similarly, deline the population, E*, of most compact k-node subsets on (n, N) rank graphs. Probability distributions conditioned on B* and E* are called best-case probability distributions. A measure of isolation for cluster A with isolation index ba is

69

best case validity criteria which are now formally defined. Definition 4. The best-case isolation criterion for the k-node cluster A in an (n,N) rank graph having isolation index bA is

/ n \ b~

(k(n : k))(n:2 - k(n - k) t ~ N -- b

,3,

Definition 5. The best-case compactness criterion for the k-node cluster A in an (n, N) rank graph having compactness index eA is I k : 2 ~ n : 2 - k:2~ C I ( A ) = ( ~ ) k~2 \ e }\

N-e

}

(4)

Similarly, the compactness of cluster A with compactness index eA can be assessed by

Conditions other than the CM-reachable and the bestcase conditions can be imposed to define validity measures. Two that are useful fix one index and depend on the distribution of the other. Definition 6. Let A be a k-node cluster in an (n, N) rank graph having isolation index bA and compactness index %. The fixed-compactness isolation criterion for A is

P(E > ea/E*).

12(A) = P(B <_bA/E= %)

Exact expressions for these probabilities are not available, but the following theorem provides bounds on which tests of cluster validity can be based. The tightness of these bounds is studied elsewhere, t3°'31) Theorem. The best-case validity measures satisfy the following bounds.

and the fixed-isolation compactness criterion for A is :

P(B <_bA/B*).

P(B N b/B*) <_ k P(B <_b), a n y b ;

C2(A) = P(E >>_ea/B = bA). There criteria can be most easily derived by recourse to the hypergeometfic model (see the appendix).

n-k :2 bA

b

J\N

-

eA -

12(A)=b=O ~ ((n-~Zk~N_eA

P(E~e/E*)<_(~)P(E>_e),

J

(5)

anye.

Proof. From elementary probability theory, P(B < b/B*) -

bJ

P(B <_b ~ B*) P(B <_b) <_- P(B*) P(B*)

Since every threshold graph has at least one most isolated subset and there are

such subsets.

Similarly,

P(E > e) (~) P(E > e/E*) <_ P(E*~ < P(E > e). Replacing b and e with the indices bA and eA for a particular cluster and applying (1) and (2) produces the

(k : 2"~( ( n - k ) : 2 k:2 \ e ] \ N - b A - e J C2(A)= ~ / k : 2 + ( n _ k ) : 2 \ e=e

(6)

x

N - bA Probability profiles plot the conditional probabilities of the events (B < bA) and (E >_%) as functions of N. Of particular interest are the conditions in equations (3)-(6). Definition 7. Probability profiles for a k-node cluster A are graphs of II(A), I2(A), CI(A), and C2(A) as functions of rank N. The probability profiles, hereafter called profiles, for the six-node cluster in Fig. 4 are given in Fig. 7, plotted for every fifth rank on a log scale. Since C1 is low, the cluster is unusually compact; a large I 1 shows that A is not unusually isolated when compared to the most isolated subsets but I2 demonstrates that A is quite isolated when compared to clusters as compact as A. These observations support the conclusion that A is a

TtlOMAS A. BAILEY,JR. and RICHARD Dums

70

valid cluster and demonstrate that probability profiles permit objective assessment of cluster validity. (c) Strategies for applying probability profiles Which validity measure should be applied in a particular situation ? The answer depends on the prior information available, on the cluster being investigated, and on computational matters. Suppose a cluster is defined strictly from pattern class labels or from intuition, independent ofany proximity measures or clustering methods. (We might argue that the term 'cluster' should not be used with a set of data items chosen in this way.) We refer to such a cluster as an 'a priori' cluster. Validity criteria (1) and (2), which are based on unconditional probabilities, are appropriate for a priori clusters. That is, the null hypothesis that all proximity matrices are equally likely seems reasonable when a proximity matrix was not consulted to form the duster. The unconditional probabilities certainly are not appropriate for a cluster chosen by observing a proximity matrix, especially via a clustering method. The fact that a subset of nodes is identified as a cluster means that it satisfies some criterion and is almost certain to be more isolated or compact than a randomly chosen subset. One alternative is to use a CMreachable validity criterion. Obtaining the distribution of a CM-reachable validity criterion is extremely tedious and expensive, even for moderate n, since the distribution must be developed by Monte Carlo simulations. The existence of upper bounds on the best-case probabilities make equations (3) and (4) much more attractive than the CM-reachable validity criteria. The best-case criteria provide fair, although conservative, tests for clusters formed in any way, even by clustering methods, since they compare isolation and compactness indices to those for the best clusters in a randomly selected graph. The best-case validity criteria and profiles provide a more comprehensive and computationaily attractive evaluation of cluster properties than anything available in the literature.

Validity criteria 12 and C2 in equations (5) and (6) compare one validity index for cluster A to indices for all clusters having the same value of the other index as A. Profiles I2 and C2 can be reasonably applied to clusters defined by any clustering method or to a priori clusters. A general strategy for applying profiles to an observed cluster A is to first look for small values of II(A) and CI(A), say 10- 3 or less. The value o f l l ( A ) at rank N is an upper bound on the size (false alarm probability) of the test that isolation index B is governed by the best-case probability measure. Thus, a small II(A) provides evidence that A is unusually isolated at a particular rank. If II(A) is small over a span of ranks, we call A an isolated cluster. Compactness is measured by C1 in an analogous fashion. The most favorable results are the simultaneous occurrence of small I1 and C 1 over wide ranges of ranks, in which case we label the cluster 'valid'. A cluster might be compact (small CI) but not isolated. Profile I2 provides a weaker test for isolation than I1 since a small 12 means that the cluster being tested is isolated among all clusters as compact as that cluster. Similarly, an isolated but not compact cluster can be investigated for compactness among clusters as isolated as the one in question. Thus, profiles I2 and C2 can establish a different level of validity than I1 and C1. If both I1 and C1 are large, except for spurious small values at a few ranks, the cluster cannot be called isolated or compact, according to our criteria. A word of caution is in order concerning the meaning of 'valid'. A cluster judged valid by our method has an unusual structure in that it is much more compact and/or isolated than the best cluster in a random graph. This does not always imply that all nodes in the cluster have the same levels of mutual adhesion or the same measure of detachment from the rest of the graph. For example, a cluster consisting of the union of two very compact clusters might exhibit unusually compact structure because of the behavior

g E

-

~

I

I

C2

50

I00

Ronk

Fig. 7. Probability profiles of a cluster.

Cluster validity profiles of the individual clusters, even though the two clusters are not as closely related as nodes in the individual clusters. Cluster validity indices based on random graphs must be interpreted with care. First, the intrinsic dimensionality of the set of data items can make the basic assumption of equally-likely proximity matrices unreasonable. Constraining the data items to a space having few dimensions limits the number of possible proximity matrices. Second, Ling t53~ points out that validity indices based on random graphs are better at demonstrating the absence of clustering than at establishing valid clusters. A cluster judged not valid under the random graph hypothesis is probably not valid under any reasonable hypothesis. However, rejecting the random graph hypothesis could mean that this hypothesis is not an appropriate statement of noclustering. The statistical powers of the tests proposed here have not been studied because meaningful alternative hypotheses have not been formulated. We emphasize that our validity measures investigate clusters in isolation. We have not attacked validity questions for several clusters at once. Ling t53~ noted that when his isolation index for single-link clusters was applied to all single-link clusters generated from a seemingly random proximity matrix, a few 'valid' clusters usually surfaced. We demonstrate later that profiles avoid this pitfall.

IV. E X A M P L E S A N D A P P L I C A T I O N S

Probability profiles permit one to examine the isolation and compactness of a cluster as the threshold graph grows around the cluster. The ability of profiles to quantify the dynamic behavior of a cluster guards against accepting spurious clusters as valid and provides a more thorough assessment of cluster validity than is currently possible. The first section uses artificially generated data to exhibit characteristics of profiles. The second section demonstrates the application of profiles to a speaker recognition study. (a) A r t i f i c i a l d a t a Two experiments were performed with artificially generated data sets. Our objectives were to see how well probability profiles emphasized the isolation and compactness of 'good' clusters and to determine if spurious clusters could be detected. The first experiment requires randomly selected proximity matrices in which hierarchical structures 'fit' the proximity matrices poorly and in which no real clusters should be found. The second experiment imposes Baker's ~65) binary structures to obtain good clusters and proximity matrices well suited to single-link and complete-link hierarchies. Specifically, we began with the 'partition ranks' from the binary hierarchy of Fig. 8 and added random numbers to break all ties but not otherwise alter rank order, as explained next. A partition rank for a pair (i,j) of data items is the

I

2

3

71 8

4

9

I0

II

12

tg

~4

t5

16

L Fig. 8. Dendrogram for ideal binary structure, n = 16. level at which the two data items first come together in a dendrogram. For example, the ideal binary structure in Fig. 8 has 15 levels; the partition rank of data items (1, 2) is one, that of data items (3, 4) is two, and that of data items (1, 3) is nine. A dendrogram assigns n(n - 1)/2 partition ranks, but no more than (n - 1) distinct partition ranks exist. A rank matrix formed from partition ranks would, thus, contain many ties. Adding random numbers much smaller than one to each partition rank and re-ranking the entire matrix eliminates the ties but maintains the relative rank orders of the original partition ranks. Characteristics of these artificially generated rank order proximity matrices are summarized in Table 1. The tendency of the data to cluster is assessed by observing the rank V, at which the rank graph based on each proximity matrix first becomes connected. The probability, assuming a random graph hypothesis, that V is as large as the observed value, Vobs, is determined from the distribution given by Ling and Killough. ~6°~ Since edges interior to clusters tend to occur at lower ranks than edges between clusters when real clusters are present, a large value of Vob~suggests a tendency to cluster. The numbers in Table 1 indicate that the randomly selected data sets show little clustering tendency while the binary data sets show a strong tendency to cluster, as expected. Hubert's gamma statistic~64) measures global fit of hierarchy to proximity matrix. A value of 1 indicates 'perfect' fit, or complete agreement between proximity ranks and partition ranks. A 'proximity rank' is an entry in the rank matrix derived from the given proximity matrix. A partition rank, as explained above, is derived from a dendrogram. Here, the dendrogram is created by applying a clustering method to the proximity matrix. Since no more than (n - 1) distinct partition ranks exist, many ties will be found which are, in effect, ignored by the gamma statistic. Hubert provides simulated percentage points for the gamma statistic under the random graph hypothesis for several values ofn under single-link and complete-link hierarchies. Table 1 shows the observed and median (fiftieth percentile) values of gamma. The medians for n = 16 were obtained from Hubert's Table 3 while those for n = 48 relied on Hubert's conjecture that the quantity

THOMASA. BAILEY,JR. and RICHARDDunES

72

Table 1. Characteristics of artificial data Clustering tendency Proximity matrix

Size

Generation

Vob,

P(V > Vob~)

RAN16

16

Random selection

29

0.18

RAN48

48

BIN16

Random selection

16

BIN48

48

95

Binary plus noise

57

< 10 -3

Binary plus noise

617

< 10 -3

has an asymptotic N(0, 1) distribution with a = 1.1 for the single-link method and a = 1.8 for the completelink method. Thus, the median is a[ln(n)]. We conclude from Table 1 that the hierarchies do not fit the randomly selected data well but provide perfect fits to the binary data, as expected.

' '

8

12

2

7

Clustering method

0.6

n - a[ln(n)]

4

Global fit

II

Yob~

Single-link

0.199

Complete-link

0.306

0.31

Single-link

0.0779

0.0887

Complete-link

0.126

0.145

Single-link

1.00

--

Complete-link

1.00

Single-link

1.00

Complete-link

1.00

0.21

--

We now investigate the validity of individual clusters. The results reported in Table 1 suggest that the randomly selected proximity matrices should contain no valid clusters while the binary matrices should be investigated further. Dendrograms for proximity matrix RAN16 are shown in Fig. 9. Note the difference in scale for the two dendrograms. The single-link den-

13

9

6

14

3

I0

,liTl"

15

....

2(3

Median 7

4C

nr 60

5

t6

I

-Cluster 6

80

IOC

II

12

2

7

9

6

14

13

I

5

4

8

3

I0

15 16

[.

I0

202530, r

Fig. 9. (a) Complete-link dendrogram for RAN16. (b) Single-link dendrogram for RAN16.

Cluster validity profiles drogram exhibits the chaining that is characteristic of the single-link method. Profiles are applicable to clusters formed by any means whatever. We chose to investigate the 'best' single-link and complete-link clusters for examination. Ling's ~52~index, the probability that a cluster's lifetime is greater than the observed lifetime under the r a n d o m graph hypothesis, was the criterion for choosing single-link clusters. Since valid clusters should form earlier in the dendrogram and live longer than 'random' clusters, a small value for Ling's index suggests a good cluster. The birth ranks and lifetimes for all single-link clusters in RAN16 are given in Table 2 along with Ling's index. Cluster 9, also identified in Fig. 9(b), was chosen for study. Clusters 6, 9, and 12 are on the verge of being valid according to Ling's index. The criterion for 'best' complete-link cluster was Matula's lower bound on the asymptotic distribution ~461for the size of the largest complete subgraph in a random graph. A small value again indicates a good cluster. Table 2 identifies all complete-link clusters; none of them are significant by Matula's index. We chose cluster 6, identified in Fig. 9(a), for further investigation. Profiles for single-link cluster 9 and complete-link

cluster 6 are plotted in Fig. 10 at every fifth rank. Compactness criterion C1 and isolation criterion I1 are identically one, implying that neither cluster is valid. The variations in 12 and C2 demonstrate how randomly chosen clusters can exhibit some indication of compactness and isolation, which could lead to erroneous interpretation if C1 and I1 were ignored. Rank 20

Size

Ling index

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 2 3 4 5 6 7 8 9 10 11 12 15 22 29

6 3 1 1 1 4 2 1 6 1 1 3 7 7 --

2 2 2 3 5 6 3 2 5 7 8 9 14 15 16

0.196 0.441 0.761 0.664 0.522 0.054 0.428 0.750 0.023 0.427 0.413 0.076 0.111 0.303 --

80

I/

I

I

ioo

!

i2o

'

/f

/\

,

A

-~"

f

"

/

"it./ f-x /C /

(a) Single-link Lifetime

60

/"/"/

oo

Birth

40

c,,I,

Table 2. Clusters from RAN16

Cluster

73

/

(a)

2

/ /

/

/

Rank I0

''

2o

30

40

i-.i''

×/ /

/

50

60

70

80

90

'

",-\\

./

ioo

/

IiO

12o

:i:'Il /

,..

(b) Complete-link Cluster

Birth

Lifetime

Size

Matula index

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1 2 3 8 13 17 19 39 52 63 82 93 lll 112 120

16 50 16 55 69 95 74 54 59 19 29 19 9 8 --

2 2 2 2 2 3 3 2 3 3 5 5 8 8 16

0.502 0.670 0.755 0.896 0.936 0.488 0.550 0.983 0.900 0.931 0.639 0.768 0.575 0.615 --

01

(b)

00I

/

Fig. 10. (a) Profiles for SL cluster 9, RAN16. (b) Profiles for CL cluster 6, RAN16.

74

THOMASA. BAILEY,JR. and RICHARDDUlU~S

The same procedures were employed with proximity matrix RAN48. The two 'best' single-link clusters and the 'best' complete-link cluster are described in Table 3; the two single-link clusters could be considered valid, but not the complete-link cluster. Criteria CI and II were identically one for clusters 16 and 36, except for one small deviation from one in CI for cluster 16. Parts of the profiles for cluster 33 are shown in Fig. 11. Compactness criterion C1 is identically one. The part of isolation criterion I1 not equal to one is shown. The deviations from one are not significant. The variations in C2 shows that cluster 33 is not very compact for a cluster as isolated as cluster 33. In summary, cluster 33 is neither compact nor isolated and, thus, is not valid. We now investigate clusters from the binary data. We chose to study data with optimal global fit to separate the issue of globa! fit from that of validity of individual clusters. The interaction between these issues is beyond the scope of this paper. The procedure explained above was applied to BIN16. Table 4 summarizes cluster information. The clusters themselves are the same under single-link and completelink clustering; only the birth ranks and lifetimes differ. The values of Matula's index responds more to compactness than to isolation. Ling's index is small for almost all the clusters. Profiles provide much more detailed information on cluster validity than these two indices, as we now demonstrate. Cluster 7 exhibits the smallest Ling index among two-node clusters. The profiles for cluster 7 are sketched in Fig. 12 and show that cluster 7 is neither isolated nor compact. The profiles CI and I1 for clusters 11, 12, and 14 are plotted in Fig. 13, with ordinates below 10- 3 shown as equal to 10- 3. Cluster 14 is unusually compact and even more unusually isolated so we would label cluster 14 as valid. Ling's index and Matula's index are both small for cluster 14, which agrees with this conclusion. Note that Ling's index for cluster 7 (0.022) is comparable to that for cluster 14 (0.013) even though the two clusters exhibit dramatically different profiles. Clusters 11 and 12 demonstrate that Ling's index is primarily an index of isolation. Both clusters are unusually isolated [Fig. 13(b)] but neither is unusually compact ['Fig. 13(a)], notwithstanding the small values of Matula's index in Table 4. Cluster 14 is the union of Clusters 11 and 12. Could the unusual structure of cluster 14 be caused by the structure in clusters 11 and 12 alone? Figure 13 suggests that a great deal of interaction occurs between points in clusters 11 and 12 when cluster 14 is formed.

16

32 48 64 80 96

112 12.8 144 160 1"/'6 192 ~

224 244

"-'T-'--'-"---'- ...... " T - - W '.-'/=

v fv If o !,: ooo;ooooo .......o "°

i

OOI

!

:

Fig. 11. Profiles for cluster 33, RAN48. Cluster 14 exhibits compactness and isolation for ranges in which clusters 11 and 12 are neither compact or isolated. The effects of substructure on profiles is demonstrated further in Fig. 14, which plots the profiles of the union of the two most unrelated fournode clusters, 9 and 12, along with profiles of the individual clusters. Recall that the interpretation of profiles is unrelated to the origin of the cluster. We expect little interaction between the clusters. Figure 14(a) shows that the union exhibits none of the compactness of the individual clusters while Fig. 14(b) shows that the isolation of the component clusters make the union appear slightly isolated. A comparison of Figs 13(b) and 14(b) shows the reduction in the range of significantly small isolation indices when there is minimal interaction between subclusters. Proximity matrix BIN48 exhibited many of the characteristics of BIN16. Table 5 summarizes the clusters. As with BIN16, single-link and complete-link clusters differ only in birth ranks and lifetimes. Matula's indices grouped according to cluster size and increased monotonically with cluster number in each

Table 3. Clusters from RAN48 Cluster

Type

Birth

Lifetime

Size

Index

33 36 16

SL SL CL

33 40 36

62 8 449

2 33 3

0.0046 (Ling) 0.0085 (Ling) 0.3316 (Matula)

Cluster validity profiles

20

Rank 60

40

75 Rank

80

ioo

I0

i2o

20

30

40

50 I

O,ustorJ" Cl~ter~

o,p I

!

i I

,

70

60 I

80

90

100 I10

120

r

;luster 14

I1

°'

J (a)

i

O0

--

I

I

I I

OOI

I I

h

/ e--e--e--e

0.001

Fig. 12. Profiles for cluster 7, BIN16. IO

group, the clusters being numbered in order of formation. All clusters except those of size 2 should be compact. Ling's index suggests that all clusters, with the possible exception of size-2 clusters, should be isolated. The profiles for cluster 29, which had the best Ling index among all clusters, show that (31~ cluster 29 is somewhat compact and isolated but C1 and I1 do not remain small for large ranges; cluster 29 is also relatively isolated for such a compact cluster, but is not nearly as strong a cluster as some we have discussed.

20

30

40

Rank 60 7O

50

-r-T

i

i

80

90

IOO IIO

t20

i

lusters If and f2 Ol

(b) Speaker recognition data The data for a problem in speaker recognition consisted of 40 samples of speech, 10 samples each from four different male subjects. Each subject read five different pieces of material while being recorded in two ways. The sound was recorded directly and simultaneously transmitted over telephone lines and recorded. For each subject, we obtained five direct and five telephone samples. The recordings were then translated into choral speech (66J and Fourier analyzed to determine the energy in each of 2048 frequency bands. Each sample was normalized so that the maximum feature value for that sample was 1.00. All 2048 features were used to calculate a proximity matrix with Manhattan distance, viz. 2048

d(i,j) = ~, J(F(i,m) - F(j,m) I, ra=I

Cluster 14

(b)

OOk

0 001

Fig. 13. (a) Compactness criterion CI, BIN16 clusters. (b) Isolation criterion I1, BIN16 clusters.

THOMASA. BAILEY,JR. and RICHARDDUBES

76

Table 4. Clusters from BINI6 Complete-link

Single-link Cluster

Size

Birth

Life

Ling index

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

2 2 2 2 2 2 2 2 4 4 4 4 8 8 16

I 2 3 4 5 6 7 8 9 13 17 21 25 41 57

8 7 10 9 12 ll 14 13 16 12 24 20 32 16 --

0.113 0.146 0.063 0.081 0.036 0.045 0.022 0.026 0.009 0.010 0.010 0.010 0.011 0.013 --

where F(i,m) is the rnth feature for the ith sample. The data set for this study consists of the resulting 40 by 40 dissimilarity matrix. To help visualize the data, Kruskal'st67~ MDSCAL program was used to create a configuration of 40 points in a low dimensional space which tries to preserve the order of the proximities. The two dimensional MDSCAL configuration (stress = 14~'~) is shown in Fig. 15. The marked cluster will be discussed below. The ten samples for each subject are numbered consecutively with the direct samples first. For example, point 22 is the second direct sample for subject 3. The low stress for this configuration implies that Fig. 17 is a good representation of the proximity matrix. Three different clustering techniques were used to study the data set. The single-link and complete-link cluster hierarchies are shown in Figs 16 and 17. Several single-speaker clusters are evident in the dendrograms. Ling's k-clusters~53~ were also investigated for k = 4 and meaningful groupings of five points were obtained. In addition to the striking appearance of five-point clusters in several of the clustering results, we note the split into two clusters of 20 points each which occurs in the single-link hierarchy. This is unusual in light of the single-link method's tendency to chain points into one large cluster. Tests of cluster validity based on the random graph hypothesis were applied to many of these clusters. The V statistic for checking clustering tendency is Vobs = 223. From Ling and Killough,~6°~ Prob(V> 223) = 0.00005. We conclude that the data set is not random under the random graph hypothesis. Hubert's gamma statistics, which have approximate N(0, 1) distributions under the random graph hypothesis, are 25.3 (single-link) and 23.6 (complete-link). Thus, the dendrograms in Figs 18 and 19 both fit the data well. Table 6 lists the numbers of nodes, ranks at formation, and cluster lifetimes, along with Ling indices

Birth

Life

Matula index

1 2 3 4 5 6 7 8 12 16 20 24 40 56 120

i1 10 13 12 15 14 17 16 28 24 36 32 80 64 --

0.502 0.670 0.755 0.805 0.839 0.863 0.881 0.896 0.002 0.009 0.030 0.072 0.000 0.000 --

for all single-link clusters with a Ling index of 0.05 or less. These clusters are identified in Figs 15 and 16. As expected, subclusters of five nodes from single subjects are important components of the data set. Four of the twelve clusters with significantly long lifetimes have five points and four others have multiples of five points. Furthermore, these eight clusters all consist of combinations of complete five point groupings, where each five point grouping contail~s all the samples from a single speaker using one mode of recording. The two twenty-point clusters consist of the direct and phone recordings mode samples. Profiles were obtained for all single-link and complete-link clusters with four or more nodes and for all clusters which appeared in more than one kclustering. In addition, profiles were calculated for all five-point subsets corresponding to a single subject and single mode of recording, and all ten-point subsets corresponding to a single subject. The indices and measures were calculated for every tenth rank from 10 to 780. The profiles for four of the five-point subsets for a single subject and single recording mode, circled in Fig. 15, are shown in Fig. 18. Cluster 25 shows good evidence of compactness, with C1 below 0.001 for ranks 40 to 100. The number of linking edges is not unusually low but 12 is below 0.001 for ranks 40 to 70, so this subset is compact and, for a subset this compact, is somewhat isolated. Cluster 26 exhibits strong isolation but weak compactness by measure C1. Since C2 is below 0.001 for ranks 100 to 160, this subset is isolated, and compact for a subset this isolated. Cluster 27 is neither compact nor isolated. Since I1 and C1 are never small, we do not consider 12 or C2. Cluster 28 is more isolated than cluster 26. In addition, cluster 28 has several low values for C1, but the evidence for compactness is not as strong as it was for cluster 25. Cluster 28 is compact among clusters as isolated as cluster 26. Of these four clusters, number 25 is the most compact and number 28 is the most

Cluster validity profiles

77

isolated. Cluster 26 is also isolated, while cluster 27 is neither compact nor isolated. i Figures 15, 16 and 17 provide support for the conclusions reached by analyzing the probability profiles. Clusters 25, 26 and 28 are both single-link and _ - - 'i - - " ~ C ~ c ~ er 192 complete-link clusters. Cluster 27 is neither a singlelink nor a complete-link cluster, so it is neither compact nor isolated. If we use rank at birth as a measure ofcompactness, Table 6 shows that cluster 25 is the most compact and cluster 26 is the least compact, with cluster 28 falling in between. If we use the lifetime as a measure of isolation, then cluster 28 is the most isolated, cluster 25 is the least isolated and cluster 26 falls in between. (a) The data collection process suggests that the data set could be organized as two clusters of twenty points each (telephone and direct) or as four clusters of ten points each (by speaker). The profiles for the four tenpoint subsets representing the four speakers show that OC none of the subsets are valid clusters. Isolation criterion I2 for the second speaker exhibits a few values below 0.01 at small ranks, but all other isolation criteria are close to one. Compactness criterion C1 never dips below 0.1 for any of the profiles. By con trast, the profiles for clusters 2 and 3 in Fig. 19 demonstrate a great deal of unusual structure. We conclude that the data are organized into two clusters according to mode 0.00 of recording, but not into four clusters according to speaker. Ronk Table 7 summarizes the profiles of clusters with I0 20 5o 4o 5o 6o 70 8 0 9 0 I 0 0 I10 120 more than four nodes which appear as single link I I I I I clusters (identified in Fig. 18), and as complete-link clusters (identified in Fig. 19). The clusters whose compactness criterion C1 (or C2) is less than 10- s, and those which have an isolation measure I1 (or I2) less I l than 10-5 are marked. With three exceptions, all the r clusters with 10 or more points are both compact and i ICluster 12 isolated at the 10-5 level for some rank. Only three clusters fail to be either compact or isolated. Of these, OA the most interesting is cluster 30, which appears to be a single-link and complete-link cluster. This cluster is somewhat compact, with C1 below 10 -4, but I1 never (b) falls below 10 -3 . The Ling index for cluster 30, 0.000078, suggests that the cluster is isolated; in addition, 12 is below 10 -5. If we are willing to accept a cluster with C1 below 10 -4 as compact, then we also conclude that cluster 30 is isolated among clusters with O0 the same compactness. Clusters 22 and 25 are compact, but are not isolated at the 10- s level, even among clusters of their compactness. Cluster 25 is a single-link cluster with a lifetime of 22 and a Ling index of 0.0038. From the profiles we conclude that this low value is due to the compactness of the cluster, rather than its isolation. It is interesting to compare cluster 25 with the single-link I cluster 21. The lifetime of cluster 21, seven, is much l_x_x_x-x--x O001 shorter than the lifetime of cluster 25, yet C1 and I1 are both lower for cluster 21 than for cluster 25. The profiles provide information on the interaction of the Fig. 14. (a) Compactness criterion CI, BINI6 clusters. cluster with points in the data set at many different (b) Isolation criterion I1, BIN16 clusters.

Ronk

IO 20

50

:i/

/

40

50

60

70

80

90 I00 I I0 i I I

on,o,

120

78

THOMASA. BA(LEY,JR. and RICIIARI)Dtml~S Table 5. Summary of clusters from BIN48 Size

No. of clusters

2 4 8 16 32

24 12 6 3 1

Ling indices

Matula indices

SL lifetimes

CL lifetimes

0.019-0.127 0.500-0.961 24-46 0.0009-0.0011 0.00005-0.0132 44-88 < 10-3 < 10-v 80-160 < 10-3 < 10-v 128-320 < 10+3 < 10-v 256

27-48 60-100 128-208 320-768 512

Clus'

tI Clus

36 38

I

m

~

20

'I

II

I

'

++

26 $0

t

28 29

28

3,+ "JI 3'9 h a 33 "J U

t --VFig. 15. MDSCAL configuration, speaker recognition data. ranks and thus may provide a very different view of the cluster from a test which observes only a limited range of ranks. The profiles show that almost all the potential clusters identified by the clustering methods used here are significantly compact and isolated. This data set is well described by clusters. It is far from being a random data set under the random graph null hypothesis. Aside from statements regarding the births and lifetimes of the clusters, which are useful only for comparisons among clusters, the only tools available in the literature for testing the validity of clusters derived from proximity matrices are the single-link lifetime test of Ling(~3) and the complete-link extraneous edges tests of Baker and Hubert. (sin The extraneous edges test cannot be applied to a 40-node data set since the required Monte Carlo runs have not been published. The single-link cluster lifetime tests are inferior to cluster profiles in two ways. First, cluster profiles yield unique information about the compactness and isolation of a cluster, while the lifetime test combines the two requirements into a single test. Second, cluster profiles may be applied to any cluster, including complete-link clusters, while the lifetime test

12

5

I

f

I

I

t

1

I

t

ioo

L

1

L

I

200 Rank

Fig. 16. Single-link dendrogram, speaker recognition (SR) data.

is only applicable to single-link clusters. Another conclusion must be considered and cannot be easily dismissed. If the intrinsic dimensionality of the data set is two, as suggested by the results of the MDSCAL program, then the probability bounds developed under the random graph null hypothesis are not appropriate and will lead to gross underestimates of the probability that a validity index will have a value as good as the observed value.(a" We must join with Ling and Matula in repeating our earlier warning about the danger inherent in applying tests based on the random graph hypothesis to situations where the assumptions of the hypothesis may be violated.

Cluster validity profiles

79 Cluster

2 (direct)

24

I 7 6 5O

18

I

2a] I-]

~8

29 27 ~ _ _

26 24

I

I00

o~

200

300

400

500

Rank

z3

1 ~

51

-

o~

E

-

~o

5

iz -1

5F \

19 - 20 - - - - } _ _ 18

5 (telephone)

.I,

2O

I

38 40 57 56

,

30 I

tl

Cluster

o1I00

1 300

200

i

I 400

,

l 500

Rank

Fig. 19. Profiles for clusters 2 and 3, SR data.

25

12

22

i4

I

13 22 21

12

I

V. SUMMARY

4 2 :33 s5 32 25--

31 I

I

I

o

I

I

I

300

I

600 Rank

Fig. 17. Complete-link dendrogram, speaker recognition (SR) data. Clus~r

I

,

I

,

I 200

I~

25

J

I

~

I 4~

3~

J

I

I

Ronk

Cluster

26 . . . .

/

oJ

,

I

,

I

,

I

ioo

L

I

L

2oo

I

300

4o0

Probability profiles furnish a more comprehensive tool for quantitatively examining the validity of certain types of clusters than anything available in the literature, as indicated in our literature review. Profiles are based on the random graph hypothesis so they apply to clusters formed from ordinal proximity matrices. In our formulation, a 'valid' cluster is a subset of data items which exhibit unusual compactness, or internal cohesion and isolation, or inter-cluster separation, when compared to clusters generated randomly. Our index of compactness is the number of links between pairs of nodes in the proposed cluster; our index of isolation is the number of links having one incident node inside the cluster and one outside. Profiles track these indices as edges are inserted in the order dictated by the given proximity matrix, and compare the results to what is expected under the random graph hypothesis. That is, profiles scale the observed indices in probability units so that the observed raw profiles can be evaluated in terms of what is expected when no clusters are present. The manner in which clusters are formed determine the type of probability measure imposed. A priori

~0

Rank

Cluster

o~

~

cI

27

Table 6. Valid (Ling index) single-link clusters from SR data

[I

t \ \ li

I

J

200

I

,

300 Rank

I

L

400

I 500

Cluster 2 8

i51

,

t

,

I

, I Rank

,

I

,

Fig. 18. Profiles for clusters 25-28, SR data.

I

Cluster

Size

Birth

Lifetime

2 3 4 6 8 19 25 26 28 30 39 44

20 20 19 16 15 10 5 5 5 5 4 3

144 150 82 41 110 60 19 55 31 24 5 38

79 73 62 41 40 50 22 55 119 36 24 44

TItOMASA. BAILEY,JR. and RICHARDDUaES

80

Table 7. Validity of hierarchical clusters (a)

Single-link clusters Cluster number

Size

I1

C1

2 3 4 6 8 13 19

20 20 19 16 15 11 l0

x x x x x x

x x x x x x

x

x

Size

I1

C1

12

C2

Cluster number

Size

21 24 25 26 28 30 31

9 5 5 5 5 5 5

Cluster number

Size

It

C1

12

C2

x

x x

x

x x x x

x × x x

(b) Complete-link clusters Cluster number

12

C2

1

24

x

x

24

5

6 9 12 18 20

16 14 11 l0 10

x x x x x

×

25 26 28 30 31

5 5 5 5 5

22

7

x x x x

I1

C1

12

C2

x

x

x x x x

x x x x

x

x indicates that the criterion has a minimum value less than 10 -5.

dusters are formed without reference to a proximity matrix so that they can be compared to subsets of nodes chosen completely at random. The (unconditional) probabilities (1) and (2) are the basis of comparison. When the cluster is generated by a clustering method, such as the single-link or completelink method, the cluster is almost certain to be better than one chosen at r a n d o m so a CM-reachable distribution is the basis for comparison. Unfortunately, CM-reachable distributions can only be found by Monte Carlo simulation. We propose the best-case distribution for evaluating clusters. This distribution compares the indices for a cluster with those for the best (most compact or most isolated) cluster in a randomly selected graph. The existence of upper bounds on the best-case probability measure lead to isolation criterion I I in (3) and compactness criterion C1 in (4) which can be used with clusters formed by any means whatever. The two other criteria we propose a r e / 2 and C2 in equations (5) and (6). Criterion C2 determines whether a proposed cluster is unusually compact for clusters as isolated as the cluster under study. Similarly, criterion 12 compares the isolation of a given cluster with that of randomly chosen subsets of nodes having the same compactness as the given cluster. Profiles are plots of these criteria as functions of proximity rank. Strategies are suggested for applying profiles to data sets. The term 'valid cluster' is used for a subset of points whose profiles C1 and I1 remain small for large ranges of ranks. Descriptive phrases such as 'compact but not isolated' and 'isolated among clusters that

compact' can also be assigned after inspecting the profiles. The ability to study the evolution of a cluster in probability terms makes profiles a more broadly applicable measure of cluster validity than the singlen u m b e r indices that have been proposed in the literature. We examine artificially generated data sets to calibrate our intuition concerning the behavior of profiles for spurious and real clusters. Profiles performed well in all cases. We recommend applying quantitative tests for clustering tendency (first rank at which graph becomes connected) and global fit of hierarchy (gamma measure of rank correlation) before attempting to interpret single-link and complete-link clusters. However, we emphasize that profiles are independent of clustering method and permit evaluation of any subset of data items. Our studies of artificially generated data demonstrate that profiles provide a much more detailed picture of cluster validity than do Ling's index for single-link clusters. Profiles are particularly valuable in catching spurious clusters. The flexibility of profiles was demonstrated with data from a speaker recognition study. Clusters which were expected to be valid were shown to display only weak evidence for validity. In brief, we have proposed a new strategy for validating clusters when a rank-order proximity matrix describes the data. Profiles make moderate computational demands, do not require Monte Carlo simulation, apply to a cluster created by any means whatever, and furnish a detailed picture of cluster validity.

Cluster validity profiles

81

REFERENCES Soc. AI34, 321-367 (1971). 27. R. Dubes and A. K. Jain, Clustering techniques : the user's 1. A. K. Agrawala, J. M. Mohr and R. M. Bryant, An dilemma, Pattern Recognition 8, 247-260 (1976). approach to the workload characterization problem, 28. B. Everitt, Cluster Analysis. John Wiley and Sons, New Computer 9, 18-32 (1976). York (1974). 2. G. H. Ball, Data analysis in the social sciences: what 29. J. A. Hartigan, Clustering Algorithms. John Wiley and about the details'?. Proc. Fall Joint Computer Conf. pp. Sons, New York (1975). 533-560 (1965). 3. R. L. Breiger, S. A. Boorman and P. Arabic, Algorithm for 30. T.A. Bailey, Jr. and R. C. Dubes, Cluster validity profiles. Technical Report TR-79-01, Department of Computer clustering relational data with applications to social Science, Michigan State University, East Lansing, MI network analysis and comparison with multidimensional (1979). scaling, J. math. Psychol. 12, 328-383 (1975). 4. J. E. Doran and F. R. Hodson, Mathematics and Com- 31. T. A. Bailey, Jr., Cluster validity and intrinsic dimensionality. Ph. thesis, Department of Computer Science, puters in Archaeology. Harvard University Press, CamMichigan State University, East Lansing, MI (1978). bridge (1973). 32. R. Dubes and A. K. Jain, Models and methods in cluster 5. (3. F. Estabrook, A mathematical model in graph theory validity. Proc. IEEE Conf. Pattern Recognition and for biological classification, J. theor. Biol. 12, 297 310 Image Processing, pp. 148-155 (1978). (1966). 33. R. K. Blashfield and M. S. Aldenderfer, A consumer 6. E. Filsinger and W. Sauer, Empirical typology of adjustreport on cluster analysis software. Report for NSF ment to aging, J. Geront. 33, 437-445 (1978). Grant DCR 74-20007 (1977). 7. H. P. Friedman and J. Rubin, The logic of the statistical 34. D.J. Strauss, Model for clustering, Biometrika 2,467-475 methods, The Borderline Syndrome, R. Grinker, B. (1975). Werble and R. Drye, Chapter 5. Basic Books, New York 35. F. P. Kelly and B. D. Riply, A note on Strauss" model for (1968). clustering, Biometrika 63, 357-360 (1976). 8. B. C. Griffith, H. G. Small, J. A. Stonehill and S. Dey, 36. A. Rapoport and S. Fillenbaum, An experimental study Structure of scientific literature, Sei. Stud. 4, 339 365 of semantic structures, Multidimensional Scaling, Vol. II. (1974). A. K. Romney, R. N. Shepard and S. B. Nerlove (eds), 9. L. Hubert and J. Schultz, Hierarchical clustering and Seminar Press, New York (1972). concept of space distortion, Br. J. math. statist. Psychol. 37. L. Hubert, Some applications of graph theory to clustec28, 121-133 (1975). ing, Psychometrika 39, 283-309 (1974). 10. L. Hubert and J. R. Levin, A general statistical frame38. L.L. McQuitty, A mutual development of some typologiwork for assessing categorical clustering in free recall, cal theories and pattern analytical methods, Educ. psyPsychol. Bull. 83, 1072-1080 (1976). chol. Measur. 27, 21-48 (1967). 11. D. M. Levine, Nonmetric multidimensional scaling and 39. C.J. Van Rijsbergen, A clustering algorithm, Comput. J. hierarchical clustering - - procedures for investigation of 13, 113-115 (1970). perception of sports, Res. Q. 48, 341-348 (1977). 40. W. H. E. Day, Validity of clusters formed by graph12. J. O. McClain and V. R. Rag, Tradeofl's and conflicts in theoretical cluster methods, Mathl Biosci. 36, 299-317 evaluation of health services alternatives: methodology (1977). for analysis, Hlth Servs Res.O0, 35-52 (1974). 41. M. D. Mountford, A test of the difference between 13. J. E. Mezzich, Evaluating clustering methods for psyclusters, Statistical Ecology, G. P. Patil, E. C. Pielou and chiatric diagnosis, Biol. Psychiat. 13, 265-281 (1978). W. Water (eds), Pennsylvania State Univ. Press, Uni14. E. S. Paykel, Classification of depressed patients - versity Park, PA. Vol. 3, pp. 237-257 (1970). cluster analysis derived grouping, Br. J. Psychiat. 118, 42. P. H. A. Sheath, Method for testing distinctness of 275-278 (1971). clusters - - test of disjunction of two clusters in Euclidean 15. P.H. Haven, B. Berlin and D. E. Breedlove, The origins of space as measured by their overlap, Mathl Geol. 9, taxonomy, Science 174, 1210-1213 (1971). 123-143 (1977). 16. F. J. Rohlf, Adaptive hierarchical clustering schemes, 43. J. A. Hartigan, Asymptotic distributions for clustering Syst. Zool. 19, 58-82 (1970). criteria, Ann. Statist. 6, 117 131 (1978). 17. R. J. Shanley and M. A. Mahtab, Delineation and analysis of clusters in orientation data, Mathl Geol. 8, 44. J. A. Hartigan, Representation of similarity matrices by trees, J. Am. statist. Ass. 62, 1140 1158 (1967). 9-23 (1976). 45. L. Hubert, Spanning trees and aspects of clustering, Br. J. 18. P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy. math. statist. Psychol. 27, 14-28 (1974). W. H. Freeman, San Francisco (1973). 19. R. R. Sokal, Classification: purposes, principles, pro- 46. D. W. Matula, Graph theoretic techniques for cluster analysis algorithms, Class(fication and Clustering, J. Van gress, prospects, Science 185, 1115-1123 (1974). Ryzin, ed. Academic Press, New York (1977). 20. D. J. Strauss, Classification by cluster analysis, The 47. C.T. Zahn, Graph-theoretical methods for detecting and International Pilot Study of Schizophrenia, Vol. 1, Chapdescribing gestalt clusters, IEEE Trans. Comput. C20, ter 12, pp. 336-359. World Health Organization (1973). 68-86 (1971). 21. J. S. Strauss, J. J. Bartko and W. J. Carpenter, Jr., The use of clustering techniques for the classification of psychiat- 48. S. C. Johnson, Hierarchical clustering schemes, Psychometrika 32, 241-254 (1967). ric patients, Br. J. Psychiat. 122, 531-540 (1973). 22. A. N. Mucciardi and E. E. Gose, An automatic clustering 49. J. C. Gower and G. J. S. Ross, Minimum spanning trees algorithm and its properties in high-dimensional spaces, and single-linkage cluster analysis, Appl. Statist. 18, IEEE Trans. Systems. Man Cybernet. SMC2, 247-254 54-64 (1969). (1972). 50. F.B. Baker and L. J. Hubert, A graph-theoretic approach 23. W. J. Borucki, D. H. Card and G. C. Lyle, A method of to goodness-of-fit in complete-link hierarchical clusterusing cluster analysis to study statistical dependence in ing, J. Am. statist. Ass. 71, 870-878 (1976). multivariate data, IEEE Trans. Comput. C24, 1183- 1191 51. P. Hansen and M. DeLattre, Complete-link cluster(1975). analysis by graph coloring, J Am. statist. Ass. 73, 24. M. R. Anderberg, Cluster Analysis for Applications. 397-403 (1978). Academic Press, New York (1973). 52. R. F. Ling, Probability theory of cluster analysis, J. Am. 25. A. J. Cole, ed., Numerical Taxonomy. Academic Press, statist. Ass. 68, 159-164 (1973). New York (1969). 53. R. F. Ling, On the theory and construction of k-clusters, 26. R. M. Cormack, A review of classification, J. R. statist. Comput. J. 15, 326-332 (1972).

82

THOMAS A. BAILEY,JR. and RICHARD DUaES

54. P. Erdos and A. Renyi, On random graphs I, Publ. Math. Debrecen. 6, 290-297 0959). 55. P. Erdos and A. Renyi, On the evolution of random graphs, Mathl Inst. Hung. Acad. Sci. 5, 17-61 (1960). 56. P. Erdos and A. Renyi, Oh the strength ofconnectedness of a random graph, Acta math. Hung. 12, 261-267 (1961). 57. C. T. Abraham, Evaluation of clusters on the basis of random graph theory. IBM Research Report RC-I177 0964). 58. R. F. Ling, An exact probability distribution on the connectivity of random graphs, J. math. Psychol. 12, 90-98 (1975). 59. J. Schultz and L. J. Hubert, Data-analysis and connectivity of random graphs, J. math. Psychol. 10, 421-428 (1973). 60. R. F. Ling and G. S. Killough, Probability tables for cluster analysis based on a theory of random graphs, J. Am. statist. Ass. 71,293-300 (1976). 61. R. J. Riddell, Jr. and G. E. Uhlenbeck, On the theory of the virial development of the equation of state of monoatomic gases, J. chem. Phys. 21, 2056-2064 (1953). 62. R. F. Ling, The expected number of components in random linear graphs, Ann. Prob. 1,876-881 (1973). 63. F. B. Baker and L. J. Hubert, Measuring the power of hierarchical cluster analysis, J. Am. statist. Ass. 70, 31-38 (1975). 64. L. Hubert, Approximate evaluation techniques for the single-link and complete-link hierarchical clustering procedures, J. Am. statist. Ass. 69, 698-704 (1974). 65. F. B. Baker, Stability of two hierarchical grouping techniques; Case 1: sensitivity to data errors, J. Am. statist. Ass. 69, 440-445 0974). 66. O. I. Tosi, Voice Identification and Its Leoal Application. Academic Press, New York (1978). 67. J. B. Kruskal, Nonmetric multidimensional scaling: a numerical method, Psychometrika 29, 115-129 (1964). 68. E. Parzen, Modern Probability Theory and Its Application. John Wiley and Sons, New York (1960).

samples. Define the random variable X as the number of defectives in a sample of size r. The (hypergeometric) probability mass function for X is:

Prob(X = x) =

x/\r - x/

if max(0,r - m + d) _< x _< min(r,d). This model is adapted to tind the distribution of isolation index B in (1) as follows. Suppose we draw r = N edges at random from a population of

o:(;t edges to form an (n, N) random graph. Designate k nodes as a cluster and treat the linking edges, which link nodes in the cluster to nodes outside the cluster, as 'defectives'. There are d = k(n - k) possible defectives. To express the probability distribution of compactness index E in equation (2), we designate the set of

internal edges as defectives in place of linking edges. If the number of internal edges is fixed at eA, the probability required for each term on the right of the fixed-compactness isolation criterion in equation (3) can also be derived from the hypergeometric model. The population being sampled consists of

edges of which d = k(n - k) are'defective' or linking edges the size of the sample is:

r=N-e

APPENDIX The hyl~rgeometric model 16sldescribes a situation where the population to b¢ sampled consists ofm objects, d of which are 'defective', or designated in some way. The random experiment is to draw an unordered sample of r objects without replacement. The probability measure assigns equal probability to all

(7)

a.

Finally, the fixed-isolation compactness criterion in equation (6) is derived by drawing a sample of size r = N - ba from a population of

m=(:)-k(n-k)=(k2)+(n-2k

)

edges containing

internal edges designated as 'defectives'.

About the Author-RICHARD C. DUBESwas born in Chicago, Illinois in 1934. He received the B.S. degree from the University of Illinois in 1956, the M.S. degree from Michigan State University in 1959 and the Ph.D. degree from Michigan State University in 1962, all in electrical engineering. In 1956 and 1957 he was a member of the technical staff of the Hughes Aircraft Company, Culver City, California. From 1957 to 1968, he served as graduate assistant, research assistant, assistant professor, and associate professor in the Electrical Engineering Department at Michigan State University. In 1969, he joined the Computer Science Department at Michigan State University, where he became a professor in 1970. He is the author of The Theory of Applied Probability (Prentice-Hall, 1968) and several technical papers and reports. His areas of technical interest include pattern recognition, exploratory data analysis, and signal processing. He is a member of the Institute of Electrical and Electronic Engineers, the Pattern Recognition Society, the Classification Society, and Sigma Xi.

Cluster validity profiles About the Author TItOMASA. BAILEYwas born in Chicago, Illinois in 1942. He received the B.S. degree from Alma College, Alma, Michigan, in 1964, the M.S. degree in Physics from the University of Colorado, Boulder, in 1969, and the Ph.D. in Computer Science from Michigan State University, East Lansing, in 1978. From 1969 to 1976 he was a member of the teaching faculty at Alma College and was a research assistant in the Computer Science Department at Michigan State University from 1976 to 1978. He served on the Computer Science faculty at the University of Wollongong, Wollongong, NSW, Australia from 1978 to 1980 and is currently on the Computer Science faculty at the University of Wyoming, Laramie. His current research interests include applications of random graph theory to exploratory data analysis and related areas of pattern recognition. Dr Bailey is a member of the Institute of Electrical and Electronic Engineers and Association for Computing Machinery.

83