249
trends in analyticalchemistry, vol. 8, no. 7,1989
Principal components of a fuzzy clustering Peter J. Rousseeuw, Marie-Paule Derde and Leonard Kaufman Brussels, Belgium
Cluster analysis is the art of finding groups in data. It can be applied to multivariate data (where each object is characterized by its coordinates) as well as proximity data (where a collection of similarities or dissimilarities between the objects are given). Among the many accepted clustering techniques, the fuzzy clustering approach has enjoyed a growing popularity over the last decade. Contrary to ‘hard’ clustering methods, wherein each object is assigned to a single cluster, fuzzy clustering is capable of describing ambiguity in the data, such as the existence of points that lie between two clusters. In fuzzy clustering, each object is more or less ‘spread out’ over the various clusters via membership coefficients that range from zero to one. Table I shows an example with three clusters. The data are taken from Buydens’ who characterized monofunctional molecules by two physicochemical parameters . (Hansch’ constant z and Hammet’s constant a), five molecular connectivity parameters (denoted by OX, lx, 2~, 3~p, and 3~t), and four quantum chemical parameters (denoted by q?, qa, qa2, and dm). The dissimilarities between these eleven parameters were calculated as (1 - r)/2, r being the Pearson correlation coefficient. The fuzzy clustering was obtained with the program FANNY2. From Table I, we see that OXbelongs for 63% to cluster 1, for 17% to clusTABLE I. Fuzzy membership coefficients
ter 2, and for 20% to cluster 3. For each parameter (i.e., in each row) the sum of all memberships equals one (or 100% . We note that u belongs mainly to cluster 2, that 2xp belongs mainly to cluster 1, and so on. The parameter n is an intermediate case because it has substantial memberships in all clusters. Therefore n does not really belong to any one cluster, which makes it difficult to classify. Fuzzy clustering yields more detailed information than does hard clustering, but this can also be a disadvantage because there is a rapid increase in output as the number of objects and clusters grows. For example, Table I contains 33 numbers. To summarize this information we have computed the principal components of the membership coefficients. We applied a standard principal components program (as available in SPSS, BMDP, SAS, etc.) to the memberships, in the same way thatit is usually applied to coordinates. We have not found any prior use of this approach in the literature. The number of non-degenerate principal components is the number of fuzzy clusters minus 1 (because a membership can always be written as a function of the remaining memberships). For the example in Table I this yields two principal components, the scores of which are plotted in Fig. 1. In this plot the three clusters can clearly be seen. Three of the quantum chemical parameters (qa2, q?, and qa) form a homogeneous group together with o. The parameter dm becomes a 2.6
-,
2.6
-
2.4
-
2.2
-
inl
21.6 -
Parameter i n u
Ox
*?x ‘X
3xP 3xt 4t2 9a El
2
Memberships
1.6 1.4 -
mii
mi2
mi3
0.362 0.023 0.636 0.502 0.835 0.936 0.939 0.003 0.005 0.001 0.026
0.284 0.957 0.168 0.197 0.049 0.026 0.018 0.995 0.992 0.998 0.011
0.354 0.020 0.196 0.301 0.116 0.038 0.043 0.002 0.003 0.001 0.963
0165-9936/89/$03.00.
1.2 10.8_ 0.6 0.4 -
.” lx
0.2 0 -0.2
-
-0.4 -“.6
-
P 2 .x p 3/ +Fl*,qta,qa
- 0.8 - 1.2
- 0.8
“5 P
- 0.4
0
0.4
0.6
1.2
Fig. 1. Principal components of memberships of 11 parameters in three fuzzy clusters. @ Elsevier
Science Publishers
B.V.
trendsin analyticalchemistry, vol. 8;no. 7,1989
250
1.2 I0.8 -
-15 V8
'. 5
0.6 -
.
0.4 0.2 0 -0.2
-
-0.4
-
-0.6
-
. .
.
- 0.0 -
- 1.2
.
.
'r,..
e&q
.*
-l-1.2 - 1.4
.
k c
l
I
11 - 0.8
- 0.4
I 0
I 0.4
I
1
8
0.8
1 1.2
I
1
T
1.6
2
Fig. 2. First two principal components of memberships of Ruspini data (consisting of 75 objects) in fourfuzzy clusters.
singleton cluster, providing quite different information than that from all other parameters. The molecular connectivity parameters form a rather inhomogeneous cluster, and the intermediate position of x is clearly visible. Because this method is based on memberships only, it can be used for proximity data as well as measurements. Let us now look at some special cases. When there are only two fuzzy clusters, only one non-degenerate component will be left. In that case it is sufficient to plot the values of mil, because the mi2 can then be read from right to left as mi2 = 1 mil.
When there are three fuzzy clusters, each object has memberships (mir , mi2, ma). The possible combinations fill an equilateral triangle in three-dimensional space. Principal components recover this triangle (as in Fig. l), but one can also plot the memberships directly by means of barycentric coordi-
USA/CANADA
nates (also called trilinear coordinates). For instance, one could plot the point (mi2 + m,/2, mg m/2/2) for each object i. In this way, the plotted points lie in an equilateral triangle with vertices (0, 0), (l,O), and (l/2, e/2). When there are more than three fuzzy clusters, there will be more than two principal components. We then follow the customary practice of displaying the two components with largest eigenvalues, thereby ‘explaining’ the largest portion of the variability. An example is given in Fig. 2, where the two main components of the memberships (mir, mi2, mi3, mid) are displayed. One could also draw a three-dimensional plot, or draw several plots in which each component is plotted versus every other component. The plot can also be refined by adding ‘ideal’ objects that correspond to the clusters themselves. Indeed, each cluster can be represented by an object with membership 1 to it, and zero membership to all other clusters. By transforming these ‘membership coordinates’ in the same way as the actual objects, the plot will be enriched by as many additional points as there are clusters. In this way the final plot contains both objects and clusters (in the same way that correspondence analysis yields plots containing both objects and variables). References 1 L. Buydens, Structure-Activity Relation: Contribution to the Gaschromatography and Study of Neuroleptic Drugs, Ph.D. Thesis, Vrije Universiteit Brussel, Brussels, 1986, p. 135. 2 L. Kaufman and P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley-Interscience, in press. Peter J. Rousseeuw is with the Centre for Statistics and Operations Research, Vrije Universiteit Brussel, Pleinlaan 2, B-1050 Brussels, Belgium. Marie-Paule Derde en Leonard Kaufman are with the Pharmaceutical Institute, Vrije Universiteit Brussel, Laarbeeklaan 103, B-1090 Brussels, Belgium.
JAPAN
Michael Baer
ESP - Tokyo Branch
50 East 42nd Street, Suite 504 NEWYORK, NY 10017 Tel: (212) 682-2200 Telex: 226000 ur m.baer/synergistic
Mr H. Ogura 28-t Yushima, 3-chome, TOKYO 113 Tel: 103) 836 0810 Telex: 02657617
Bunkyo-Ku
REST OF WORLD GREAT BRITAIN
T.G. Scott 8 Son Ltd. Mr M. White or MS A. Malcolm 30-32 Southampton Street LONDON WCZE 7HR Tel: 1011240 2032 Telex: 299181 adsale& Fax: 1011379 7155
ELSEVIERSCIENCEPUBLISHERS MS W. van Cattenburch PO. Box 211 1000 AE AMSTERDAM The Netherlands Tel: 120) 5803.714/715/721 Telex: 18582 espa/nl Fax: (201 5803.769