Fuzzy Sets and Systems 5 (1981) 177-185 North-Holland Publishing Company
CLUSTER VALIDITY FOR FUZZY CLUSTERING ALGORITHMS Michael P. WINDHAM Department of Mathematics, Utah State University, Logan, UT 84322, U.S.A. Received October 1979 Revised December 1979 The proportion exponent is introduced as a measure of the validity of the clustering obtained for a data set using a fuzzy clustering algorithm. It is assumed that the output of an algorithm includes a fuzzy membership function for each data point. We show how to compute the proportion of possible membership functions whose maximum entry exceeds the maximum entry of a given membership function, and use these proportions to define the proportion exponent. Its use as a validity functional is illustrated with four numerical examples and its effectiveness compared to other validity functionais, namely, classification entropy and partition coefficient.
Keywords: Fuzzy clustering algorithms, Cluster validity, Proportion exponent, Fuzzy isodata alogrithm, Entropy, Partition coefficient.
Fuzzy clustering algorithms have been used extensively to classify data. The user wishes to classify the data into c groups or clusters. The algorithm produces for each data point a fuzzy membership function, which is a c-dimensional vector with non-negative entries summing to one. Each entry in a membership function corresponds to a particular cluster and the value of the entry measures the extent to which the data point has been associated with that cluster. It is the membership functions which are used to determine the classification of the data into clusters. For example, a data point may be assigned to the ith cluster if the largest entry in its membership function occurs in the ith component and the larger this entry is the more clearly the data point has been identified with the cluster. The most important question associated with the use of a clustering algorithm is simply how well has it identified structure that is present in the data. This is the so-called 'cluster validity problem'. We define here a function which assigns to a collection of membership functions a real number called the proportion exponent. We will show how and why the values of the proportion exponent can be used to measure the effectiveness with which cluster structure has been properly identified by a fuzzy clustering algorithm. We assume that the output of a fuzzy clustering algorithm includes a c × n matrix, U = [uij] where c is the number of clusters identified and n is the number of data points and 2 <~c ~
178
M.P. Windham
The proportion exponent of U, P(U), is defined by P(U) = - logz
( - 1)k --
1
tt
(1 - kp.~)c- 1
.
In order to understand the definition it is best to examine its computation in stages. The only values from the matrix, U, that are used are the n maximum entries from each of its columns. It is precisely these values which indicate the clusters with which the data points have been most closely identified. Moreover, the larger these values are the more clearly the identification has been defined. So, it is natural to focus on them as a measure of cluster validity. Unfortunately, the values themselves are too closely tied to the values of c. For example, if c = 2, then 21t% of the possible membership functions will have a maximum of at least 0.9, whereas for c = 4, only 0.4% have a maximum that large. It is to overcome this difficulty that the next step in the computation of the proportion exponent is taken. We will show below that for l / c < ~ t < ~ l , the proportion of possible cdimensional membership functions whose maximum entries are at least/x is given by Y~,__~( - 1 ) T M (~) ( 1 - k / x ) c-1, where I is the greatest integer in 1//z. In other words, for a given value of p., if the sum is 0.1, then 10% of all possible c-dimensional membership functions will have a maximum entry at least /~, and the interpretation of this rest, li: is independent of c. In order to obtain a single measure for the matrix, we multiply together the proportions for each of the columns. This gives a number which can be interpreted as the proportion of possible membership matrices which would have larger maximum in every columr~ than those found in U, or more intuitively, the proportion of matrices which would more clearly identify the clusters. At this point we have given the proportion exponent a strong dependence on the number of data points, n. For this reason, it is not reasonable to compare proportion exponents for different sample sizes; fortunately, this is not likely to be necessary. If a fuzzy clustering algorithm has done at all well, the proportions for each column will be small. If the sample size is very large, the proportion for the matrix is likely to be very small indeed. It is for this reason that the final computation is performed, namely, taking the negative logarithm (base 2). This spreads the values of the functional ove~ a much wider range, particularly for proportions near zero. It also implies that. large values for the proportion exponent indicates that the algorithm has worked well. There is, in fact, a heuristic interpretation of this step derived from information theory. If one wishes to quantify the 'surprise' felt by observing an event whose probability of occurence is p, then a natural measure is given by -log2(p!. (see [13] for details~. So, we may loosely interpret the proportion exponent as a measuring of the '~urprise' one would feel if the algorithm had done better than it did. The proportion exponent was designed primarily to aid in identifying the number of ch:-,ters present in the data; however, there is no reason that it cannot be used to evaluate choices of other parameters the investigator sets in using clustering algorithms. For example, many of ~Ihese algorithms are iterative in nature, and the user starts the iteration process by choosing a membership matrix.
Cluster validity for fuzzy clustering algorithms
179
The final output of many algorithms can be greatly affected by the starting point. The proportion exponent can be used to evaluate the relative merits of different starting points. It cannot be used to evaluate parameters which artificially affect the 'fuzziness' of the membership functions. Weight exponents are sometimes used to, ,:~e-emphasize the effects of noisy or extraneous data. Increasing a weight exponent ',:,,ill cause the membership functions to become fuzzier, i.e., the maximams will become smaller and other entries larger. This means that the proportion exponent may decrease even when the user feels that the data classification has improved. The key building block of the proportion exponent is the proportion of membership functions whose maximum exceeds a given value. We now explain what is meant by this proportion and how to compute it. For a given c, the set of possible membership functions is MFc = { u ~ R C : ui t>0 for i = 1 , . . . , c and X u, = 1}. This set is the intersection of a (c-1)-dimensional plane and the first orthant of R':. The 'size' of a subset of MFc can be measured by its area, in particular, the area of MF~ itself, is x/~c/(c- 1)! So. for a given A c MF,., we define the proportion of membership functions in A, p(u ~ A), by
pIu~A)=(Area
of A)/(Area of MF~)
Since MF, -is a ( c - 1)-dimensional surface in R ~. These areas can be computed by surface integrals, and the surface integrals can be computed using an appropriate par;am,:~erization of MF,.. For example, let V, = { x ~ R ~ - i : x ~ , ) for i = 1. . . . . c -- 1 and Y.xj~ < 1} and define (r: V,.--, MF,: by ¢r(x~. . . . . x,:_~)=(x~ . . . . . x, i, -S.x~), then for A ~ MFo the area of A is J c J',~ ,(ix) dx~ . . . . . d x ~ ~ and
p(u~A)=(c-1)!f,,
t(A)
dxl,...,dx~_l.
Using these definitions we prove the following theorem.
Theorem. For c = 2, 3 , . . .
and 1/c <<-~ <~ 1 let A~ = {u ~ MFc" max(u~) I> tt } then
()
p(u ~ A , ) = ~ (_l)k+ 1 C ( 1 - k~) c-~ k =i
k
where I is the greatest integer in 1~ix.
Proof. If for i = l , . . . , c
we define E i = { u ~ M F , "
u, 1>lx} then A , = E I O
E 2 U " " ' O E c so
p(u~A~)=p(u~ElU""
UEc) c+l
ill) k=2
it<
• • •
The last equality in (1) is just the interpretation in this context of art elementary
M.P Windham
180
theorem from probability theory [13, p. 26]. If a particular membership function is thought of as the value of a r a n d o m vector uniformly distributed in MFc, then subsets of MFc become eveats in a probability space and the proportions, probabilities, justifying the second eq~mlity in (1). Because of the geometrical symmetry of MFc, p ( u ~ E ~ t q . . . D E ~ ) = p ( u ~ E~ fq . . . f3 E k ) for all if,) choices of i~ < . - - < ik, so in order to complete the proof we need only compute for k = 1 . . . . , c + 1 J'B~d x ~ . . , dx~_~, where Bk = t r - ~ t E ~ N " ' " NEk). But, Bk is the set of vectors, x ~ I R c-~ satisfying a<~xi<~l-(k-j)p.-x~
.....
xi_ ~
for ] = 1. . . . .
k
and
() <~xj ~< 1 - x: . . . . .
xj_ ~
for ] = k + 1 . . . . .
c - 1.
In particular, ~ <~x~ <~ l - ( k - 1 ) ~ from which it follows that Bk will be empty unless p~ <~ 1 - (k - l)tt, i.e., k ~
i-:li=l
Where U = [u,] is the ¢ × n matrix of membership functions, V = I-v;i] is the d × c matrix whose columns are the cluster centers, d;; is the Euclidean distance from the ith data vector to the ith cluster center, and m is a weighting exponent (1 < m < 2). So. the objective function is a weighted sum of the squares of the distances between the data vectors and the cluster centers. Each weight is a
Cluster validity for fuzzy clustering algorithms
181
power of the membership of the data point in the cluster represented by the cluster center. The purpose of the weighting exponent, m. is to allow the user to de-emphasize the effects of 'noisy' data, namely, data points which are difficult to identify with any particular cluster. Such data points will tend to have membership functions all of whose values will be small, consequently, the larger the values of m used, the smaller the contribution to the objective function these "noisy' data points will make. The fuzzy isodata algorithm itself, is an iterative procedure which it it converges, does so to a local minimum for the objective function. The particular procedure that was used is based on the following computations. (1) Given U compute V by
k
k
where xki is the fih component of the k th data vector. (2) Given V compute U by
k
Values for c and m, and an initial U matrix are chosen then 1 and 2 are successively repeated until the distances between cluster centers at two successive stages are less than a preassigned value. Justification for this procedure may be found in [3] and [8] and convergence properties of this type of algorithm may be found in [2] and [12]. For each of the data sets described below the alogorithm was run for c = 2, 3, 4, 5 and 6. The weighting exponent, m, was 4 for all runs. The initial choice of U was made by dividing the data into c groups. The first group consisted of the first n/¢ data vectors, the second groupthe next n/c vectors, and so forth. For the ith group u,j was chosen to be 0.707 +0.293/c for i = j and 0.293/c for i # j. The stopping criteria was [v,i- w,j[ < 0.05 for all i, j, where V and W are cluster center matrices for successive iterations. The firs! of the four data sets, called Plane 1, was artificially generated. Four base point:; in R a were chosen and 50 points were randomly generated about each base point, to obtain a sample of 200 vectors in R 2. The next two data sets, Plane 2 and Plane 3, were obtained from Plane 1 by, respectively doubling and triphng the radial distances of the data points to the base points. These data sets are pictured in Fig. 1. Because of the manner in which these data sets were generated. it would be reasonable to assume that an effective validity functional should identify the presence of four clusters. This is borne out by a visual inspection of the data. The fou'rth data set is the Iris data of Anderson. [1]. which also appears in [9] This data set has often been used as a standard for testing clustering algorithms and validky functional (see [3, 10, 11, 14, 15]). The data consists of 150 four dimensional vectors. The components of a vector are the measurements of the petal lengl:h, petal width, sepal length and sepal width of a particular iris plant. There are 50 plants in each of three subspecies of Iris represented in the data, so
M.E
182
Windham
•m"
:-,...
,,,....
,.,,..
"~'•~,° •
.t•
~
•
•
•
.
._,to
,,By•
•
11'
•-'*--.'t..
":
•
:: •
--
°•41' I'
"
•
.-... 9 lla,~d,lt~
• *
IK-,",,
•
4'
Plane I
Plane 2
4'
,lee t
ee e
e&
94, ee •
I:
,d~ •
••
1~ '11" • 4 ,
o'~..
~
•
•
";-a,. ~'?"
~,~,,e'~
•
~.•
.¢
-
• 4, t
t;. %,, -~ F--. " ' "
•
• 9
~9
P1ane 3
Fig. 1.
it was assumed that an effective validity functional should indicate the presence of three clusters. The results of the analysis are shown in Table 1, , h e r e c is the number of clusters the alogorithm was to identify, and P is the proportion exponent obtained from the output of the algorithm. In each case the value of c indicated by the maximum for P to be the optimum agreed with the value that was assumed to be correct. Also included in Table 1 are values of the classification entropy (H) and partition coefficient (F). These are validity functionals introduced by Bezdek ([6, 5] resp.), and are defined as follows. For a membership matrix U, H ( U ) = - ~ uii Iog~ ( u , j ) l n , i,i
i.i
In the computation of entropy, if u~i = 0 the corresponding term is assumed to be zero. Consequently, the values of entr py range between zero (u~i = 0 or 1, a 'hard' chester) and log(c) (u~i = l/c, no clustering). A good cluster identification should be indicated by an entropy value close to zero. It would seem from the
Cluster validity for fuzzy clustering algorithms
183
Table 1 Plane !
Plane 2
C
P
H
F
C
P
H
F
2 3 4 5 6
163 323 613 596 578
0.60 0.85 0 97 1 19 1. -~8
0.59 0.51 0.50 0.41 0.34
2 3 4 5 6
164 120 368 366 359
0.59 1.00 1.12 1.36 ! .54
0.59 0.40 0.41 0.32 0.27
Plane 3
Iris
C
P
H
F
C
P
H
F
2 3 4 5 6
155 1(15 258 250 252
(I.60 1.01 1.19 1.41 1.59
0.59 0.39 (I.36 0.29 0.25
2 3 4 5 6
157 163 160 89 98
0.55 0.92 1.21 1.43 1.60
0.63 11.45 tl.35 0.27 0.23
table that entropy indicates the presence of two clusters in each of the four data sets. It has been shown that this is not the proper interpretation of the entropy values. By integrating the entropy function over the set MF, and dividing by the area of MFc, one obtains the average value of H, namely Y~,--2 1/k (for details see [7]). This indicates that entropy tends to increase with c independent of structure in the data. So, the value of c indicated by entropy as optimum, is the one for which entropy falls below an increasing trend in its values. This technique is illustrated in Fig. 2 for the data set. Plane i, where the value c = 4 is indicated as
Optimum
I
|
|
I
2
3
I
I
I
I
4
5
6
7
Number of c|usters
Fig. 2. Entropy for Plane 2.
184
M.P. Windham
optimum. Similar analysis for Plane 2 and Plane 3 also indicates c = 4, however, for the Iris data no optimum is clearly indicated. The partition coefficient must be analyzed in a analogous manner. The range of values of F is 1/c (no clustering) to one (hard clusters) and the average value for F is 21(c + 1) (see [7]). So, the partition coefficient tends to decrease with c and the optimum c should occur where the value of F lies above the trend. Using this analysis, one finds that F indicates c = 4 as the optimum for Plane 1, Plane 2 and Plane 3, but provides no answer for the Iris data. These examples illustrate how the proportion exponent is used and that it can provide correct answers. Its effectiveness in general will be determined only by subsequent applications. In any case, we have given here a general technique for improving validity functionals. The maximum entries of the membership functions can be used to evaluate cluster validity. However, the values of the maximums that one sees is strongly affected by the number of clusters sought. This dependence was removed by using proportions. The same could be done with entropy or the partition coefficient, provided a closed form expression for the appropriate proportions can be obtained. The ideal valid,;ty functional should be independent of as many parameters as possible, it is to at least partially achieve this goal that motivates the introduction of the proportion exponent.
Acknowledgment I wish to acknowledge Professor James C. Bezdek for introducing me to the cluster validity problem and providing continual advice and encouragement throughout this investigation.
References [1] E. Anderson, The iris of the Gaspe Peninsula, Bull. Am. Iris Soc. 59 (1935) 2-5. [2] J.C. Bezdek, A convergence theorem for fuzzy isodata algorithms, IEEE Trans. PAMI, to appear. [3] J.C. Bezdek, Fuzzy mathematics in pattern classification, PhD Thesis, Cornell University, Ithaca, NY (1973).
[4] J.C. Bezdek, Numerical taxonomy with fuzzy sets, J. Math. Biol. 1 (1974) 57-71. [5] J.C. Bezdek, Cluster validity with fuzzy sets, Cybernetics 3 (1974) 58-73. [6] J.C. Bezdek, Mathematical models for systemics and taxonomy, in: G. Estabrook, ed., Proc. [7] [8] [9] [10] [Ill
Eight Annual International Conference on Numerical Taxonomy (Freeman Co., San Francisco, CA, 1975). J.C. Bezdek, M.P. Windha.-a and R. Erhlich, Statistical parameters of cluster validity functionals, in re, qew. J.C. Dunn, A fuzzy relative of the isodata process and its use in detecting compact, wellseparated clusters, Cybernetics 2 (1973). R.A. Fisher, The use of laultiple measurements in taxonomic problems, Ann. Eugenics 7 (1936) 179-188. H.P. Friedman and J. Rubm, On some invariant criteria for grouping data, J. Am. Statist. Assoc. 62 (1967) 1159-1178. M.G. Kendall, Discrimination and classification, in: P. Krishnaik, ed., Multivariate Analysis (Academic Press, New York, 1966) 165-185.
Cluster validity [or fuzzy clustering algorithms
185
[12] D.L. Luenberger, An Introduction to Linear and Non-Linear Programming (Addison-Wesley. Reading, MA, 1973). [13] S. Ross, A First Course in Probability (Macmillan, New York, 1976). [14] A. Scott and M. Symons, Clustering methods based on likelihood ratio criteria, Biometrics 27 (1971) 387-397. [15] J. Wolfe, Pattern clustering by multivariate mixture analysis, Multivar. Behav. Res. 5 (197(I) 329-350.