Tetrahedron Computer Methodology, Vol. 2, No. 3, pp. 133 to 140,1989
0898-5529/89 $3 .t30+.00 Pergamon Press pie
Printed in Great Britain
Cluster Analysis for Chemists Peter Senn Physical Chemistry Laboratory, ETH-Zentrum, CHN H32, CH-8092 ZiJrich, Switzerland
Received 21 February1989; Revised 15 January 1990; Accepted15 January 1990
Key words: Structure-Activity; Clustering; Data Analysis; Pattern Recognition Abstract: The simplest versions of cluster analysis currently in use are very easy to grasp. Such a straightfoward clustering technique is explained in detail and a sample analysis covering .five isomers of an organic compound is presented. Cluster analysis is normally used in conjunction with other techniques of multivariate analysis. Some examples where multivariate analysis has been applied to problem solving in Chemistry are briefly described and for an in-depth discussion of some pitfalls encountered in practical applications the reader is referred to the literature.
INTRODUCTION Many problems in chemistry can be solved at least partially by identifying objects or solved problems which share some common features with the object or the problem of interest. Consider for example, the situation where we have an unknown compound whose infrared spectrum has been recorded. We can compare this spectrum with a database of spectra in which we might locate a spectrum which looks so similar to the given infrared spectrum that we can be quite sure that the two compounds are identicat. However, only in rare cases will we be lucky enough to stumble onto the spectrum of the unknown compound. Nevertheless, the outcome of the search will likely be that several spectra will be found which somewhat resemble the spectrum of the unknown compound. A detailed comparison among the similar spectra may reveal in what structural features they differ and may enable us to identify the unknown compound. When comparing a given infrared spectrum with spectra in an atlas of infrared spectra, we have our own set of intuitive criteria which permit us to judge spectra as similar or dissimilar. Few databases contain visual information and in order to search, for example, a data base with digitized spectra we would need a well-defined measure of similarity among spectra. If a pair of compounds A and B are being characterized by a single quantitative attribute x, we could choose as a basis of comparison the quantity IxA - XBI, where x A and x B belong to compounds A and 13 respectively. We might find that among the examined compounds there are subsets of compounds which, according to the chosen criteria, are very similar while other compounds appear to be more or less unique. The realization that something has properties that are truly unique may inspire a researcher to speculate on some potential applications that exploit this unique property or rare combination of properties. On the other hand, finding that some compounds tend to cluster in terms of the properties of interest will induce us to 133
134
P. SENN
contemplate possible underlying reasons. As we all know, objects in the real world can rarely be characterized by a single attribute. Even if we ignore nonessential attributes, we are in general still left with a fair number of attributes needed to characterize the objects. Of the attributes deemed essential, those shared by all objects in the set will be ignored beyond a brief mention at the outset regarding the nature of the objects to be examined. For example, if we are concerned with polycyclic aromatic compounds, we expect all the objects in the set of to possess the attributes required to unambiguously identify them as polycyclic aromatic compounds. When dealing with two essential attributes, visualization of the relatedness among the objects is possible. Consider if we have an unknown compound whose molecular formula is known to be C6H12 and whose boiling point and density have been determined. We can then find in the literature all the known compounds with the given molecular formula and note their boiling points and densities. If none match the physical properties of the unknown compound within experimental accuracy, then, as a next step, we can try to find out how similar the unknown compound is in its physical properties to all the known isomers of C6H12. This can be done by preparing a scatter plot such as the one shown in Fig. 1. In addition, a point labelled with a question mark would have to be placed at the location indicated by the pair of physical properties which had been determined for the unknown compound. At some point we probably would have prepared an inventory of alternatives not covered by the set of compounds shown in Fig. 1. When pondering the plausibility of the remaining alternatives we would try to determine how compatible each structure is with the position of the unknown compound on the scatter plot; some would be rejected and some would be retained for further consideration. The applied criteria would be the degree of similarity of the proposed structure with the structures located in close proximity, on the variability within different types of compounds, etc. A cluster analysis of two attributes based on a visual examination reveals a fundamental problem encountered in cluster analyses: the scales for individual attributes can be chosen at will, but the outcome will be affected by the choice of scaling factors. Of course there are a myriad of additional options, such as choosing a logarithmic scale for one or both attributes. We shall be concerned only with the effects of scaling. It tums out that scaling acts as a kind of weighting of the attributes. A relatively large scaling factor applied to an attribute will cause this attribute to carry greater weight in the classification of a set of compounds. Consider the case where a chemist plans to move into a new laboratory in which he wants to store his equipment according to suggestions from some kind of cluster analysis. For this chemist, it may be important that equipment used for distillations or for thin-layer chromatography should be kept in one place whereas the placement of equipment with many uses may not be so obvious. In order to induce the desired type of result from a cluster analysis, one of the chosen attributes of the equipment would have to be its intended use and this particular attribute would have to carry the greatest weight. The isomers shown in Fig. 1 are characterized by two attributes. This pair of attributes spans an attribute space which in the present case happens to be a plane. This allows us to perform a cluster analysis based on visual examination. Clusters can be identified simply by encircling groups of points which are considered to be related. If objects are characterized by three attributes, a visual examination of their relative positions in the three-dimensional attribute space is in principle still possible. There are useful and innovative techniques for displaying objects in higher dimensional attribute spaces. 1 Recently, Larsen 2 has shown how sets of attributes can be visualized by representing them as human faces. In some cases it is possible to map data from an attribute hyperspace onto a plane on which they can be displayed for visual examination without unacceptable distortions. 3
Cluster analysis for chemists
!35
AN EXAMPLE Broadly speaking, there are two types of cluster analysis. The first type has a neurophysiological basis;the human brain does perform a kind of cluster analysis as can be inferred from such things as notions prejudices, etc. These can be thought of as being end results of cluster analyses (with subsequent classifications). In addition to this broadly-defined type of cluster analysis, there are techniques for performing cluster analyses on finn mathematical bases. Among the many different methods for clustering I shall concentrate on a type that is referred to as hierarchical clustering or "unweighted pair-group method
using arithmetic averages". In cluster analyses, one always starts out with a set of attributes that are not necessarily quantitative.
Binary attributes (attributes with two states such as presence/absence, yes/no) can be incorporated into a set of attributes. For the compounds shown in Fig. 1, a binary attribute would arise by asking whether they react with bromine water. In the case of binary attributes, the pair of states can be stated as for example 0 for "no" and 1 for "yes," or vice versa. The situation is somewhat more complicated for multistate attributes such as colourless/red/blue, where an arbitrary assignment of numbers such as -1, 0, and 1 may be unsatisfactory simply because colourless may be viewed as being equally dissimilar to having a red colour as to being blue. One obvious solution can be found by converting the multistate attributes into sets of binary attributes by asking for example "is this object red?". In hierarchical cluster analyses, one measures the distance among pairs of points representing objects in the chosen attribute space, identifies the nearest pair, combines them into a "cluster", recalculates an average distance from this cluster to the remaining objects or clusters, and starts the process anew. This short description of hierarchical clustering would be adequate for devising an algorithm for performing such an analysis. Only one point has been left unclear; it is not yet obvious what is meant by the term "average distance". In some hierarchical clustering techniques; the merged points are simply replaced by a new point in between. In the present case, the average distance means the arithmetic mean of all the distances among pairs of objects that can be formed by taking one object from each of the two clusters. A hierarchical cluster analysis consists of steps, and at each step, a pair of objects or clusters of objects are merged into a new cluster. The original clusters are discarded with the creation of one new unifying cluster. As a result, there is generally one less cluster after each step. This does not necessarily apply in cases where clusters happen to be equidistant. There are a number of ways to measure the separation between clusters in an attribute space. The measure of separation that we will be using is called the Euclidean distance coefficient (EDC). Let the pair of vector o A and o B point from the origin of an appropriate Cartesian coordinate system to a pair of points representing the objects A and B in the chosen attribute space. The EDC for this pair of points is then equal to the magnitude of the vector pointing from the point for A to the point for B or vice versa. This vector can be constructed as a vector sum from OA and o B, where the direction of one of the two vectors has to be reversed. Previously it was pointed out that the results of a cluster analysis can be greatly affected by the scale chosen for the different attributes. For this reason cluster analyses are usually preceded by a preliminary step called standardization of the data. Standardization makes attributes contribute more equally to the computed measure of similarity among pairs of objects or clusters of objects. Let the j-th component of the vector a A denote thej-th attribute of the object A. The average of the j-th attribute is
(1) A where the summation is over the n objects.
Standardization converts original vectors a A by linear
136
P. SENN
transformations into the vectors o A such that the arithmetic mean, ~j, for each component becomes zero while the variance and the standard deviation become unity. This can be accomplished as follows:
{na }
1/2
(oj), = [ ( a g A -
[ (@A-
(2)
2
A Standardization casts them into dimensionless units. The EDC for a given pair of objects A and B can then be computed as follows: 1/2
eAB = { Z [ (Oj)A-- (Oj)B]2} J
(3)
where the summation is over the attributes. Let us now proceed with the presentation of an example of cluster analysis. From the 20 compounds shown in Fig. 1, five have been chosen at random. Those chosen are indicated by circles surrounding the points. The pairs of attributes for the give samples are shown in Table 1 together with the standardized data. The EDC's can be arranged in a resemblance matrix (see Table 2). Since resemblance matrices are symmetric only the lower-left positions are shown.
Table 1. The Original and the Standardized Data Used for the Cluster Analysis of Five Isomers of C6H12. The Quantities o 1 and 02 are Standardized Data Which Have Been Derived From the Boiling Points (B.P.) and the Densities (D) Respectively According to the Standardization Procedure Shown in Equations (1) and (2). No. Name
1c 2c 3c 4c 5d
1-hexene trans-2-hexene 2,3-dimethyl- 1- butene 2,3-dimethyl-2-butene ethyl-cyclobutane
B.P. a (°C) 63.5 67.9 55.7 73.2 70.7
o1
D b (g/ml)
02
0.673 0.678 0.678 0.708 0.728
-0.834 -0.626 -0.626 0.626 1.460
-0,392 0.247 -1.525 1.017 0.653
athe boiling points are for a pressure of 760 torr. The standardized data derived from • the boiling i~ointsare labelled "o1". t)the densities are for 20°. The standardizeddata derived from the densities are labelled "02". Cthe data for these isomers were taken from Ref. 14. dthe data for this isomer are from Ref. 15. According to Eq. 3, the EDC between isomers no. 1 and 2 can be obtained as follows: el2 = { (-0.39 - 0.25) 2 + (-0.83 + 0.63) 2 } 1/2 = 0.67
(4)
This result can be found at the corresponding location of the resemblance matrix in Table 2 where it happens to be the smallest entry. This means that in the first step, isomers no. 1 and 2 are merged into a cluster.
137
Cluster analysis for chemists Table 2. The resemblance matrix which has been assembled from the standardized data o 1 and 02 shown in Table 1. The entries in this matrix are the Euclidean distance coefficients for pairs of isomers which for two attributes are { [ (Ol) i - (o2) i ]2 + [ (o2) i _ (o2)i ]2 } 1/2, where the subscripts i a n d j refer to the nufnbers of the two igomers. The numbering of the isomers is shown in Table 1. No.
No. 1
1
2
3
4
5
.....
2
0.672 a
3
1.152
1.772
.....
4
2.028
1.469
2.833
.....
5
2.521
2.124
3.015
0.910
aln the resemblance matrices shown in this work, the smallest entry is always in italics
The revised resemblance matrix will contain average distances between this cluster and the remaining three objects. The average distance between the cluster and isomer no. 3 is computed as follows: 1
1
(5)
e(12)3 = + (el3 + e23) = "--(1.15 + 1.77) = 1.46 2 2
An analogous procedure gives 1.75 and 2.32 for e(12)4 and e(12)5 respectively. The remaining elements in the revised resemblance matrix can be transferred from the original resemblance matrix in Table 2, giving the results shown in Table 3. Table 3. Revised Resemblance Matrix with Cluster (12) and Remaining Points
No. (12)
(12)
3
4
5
.....
3
1.46
.....
4
1.75
2.83
.....
5
2.32
3.02
0.91
Since the EDC for the isomers no. 4 and 5 is the smallest entry in the above resemblance matrix, they will be merged into a cluster. The next resemblance matrix will have an entry with e(12) 3 which has been computed in Eq. 5 and two new elements e(12)(45) and e3(45) which can be computed as follows:
1 e(12)(45) = ~(e14 + el5 + e24 + e25 ) = 2.04
(6)
138
P. SENN e3(45) -_ ' ~1 (e 34 + e35) = 2.92
(7)
Among these remaining entries in the resemblance matrix e(12) 3 is the smallest such that, according to our rules, isomers no. 1, 2, and 3 have to be merged into a cluster. This leaves us with only a pair of clusters. In the next step they will be merged into a cluster. There seems to be no use in computing the average distance between these two clusters, but since it will be needed later we compute it anyway. 1
e(123)(45) =-~(e14 + el5 + e24 + e25 + e34 + e35 ) = 2.33
(8)
DISPLAY OF RESULTS FROM HIERARCHICAL CLUSTER ANALYSES The outcome of a hierarchical cluster analysis can be depicted in a dendrogram, also called a tree. In a dendrogram the points are placed at equidistant locations on the horizontal axis and the EDC's for pairs of objects or clusters of objects are measured on the vertical axis. From each point, a vertical line is drawn and these lines merge at the values of the EDC's. The resulting figure looks like a stylized inverted tree. The objects may have to be rearranged in a new sequence so that the branches of the tree do not to intersect each other. Let us summarize the outcome of the cluster analysis of the five isomers of C6H12 (see Fig. 2). Isomers no. 1 and 2 have been merged into a cluster for a value of the EDC of 0.67. Later this cluster has been merged for a value of the EDC of 1.46 with the isomer no. 3. Isomers no. 4 and 5 have been merged into a cluster for a value of the EDC of 0.91 and the resulting cluster has been merged with the cluster containing the remaining three isomers for a value of the EDC of 2.33. This lengthy description can be presented in a much more transparent fashion with the help of a tree. HOW TO "READ" A TREE
The tree in Fig. 2 does not necessarily provide more information than the scatter plot in Fig. 1. What can be learned from the degrees of similarity among the five isomers would scarcely allow generalizations. In order to demonstrate the usefulness of such trees, a cluster analysis has been performed for all of the twenty isomers of C6Ht2 shown in Fig. 1; and the resulting tree is shown in Fig. 3. The same clustering technique has been used for both trees in Fig. 2 and 3 but for the cluster analysis with the tree shown in Fig. 3, the melting points and the indices of refraction have been used as additional attributes so lhat the cluster analysis involved four attributes. In cluster analyses, dummy compounds with suitable properties can be introduced to help one visualize the relationship of each compound to the "ideal" or the "worst possible" case for example. An examination of the compounds that cluster with the dummy objects can provide clues in a search for improvements or in identifying potential hazards. A cluster analysis of chemical compounds can also involve computed properties such as ab initio results. Recently, topological indices of molecules have been used to predict properties of molecules. 5'6 For small molecules, some topological indices can readily be computed by hand. Trees from cluster analyses can be used for making classifications. Usually such classifications are made simply by cutting the tree at a suitably chosen height. The original set of objects will then be separated into subsets defined by similar properties. These subsets are represented by smaller trees which can be cut further to allow more detailed classifications. When preparing classification schemes to be used by other researchers, it is sometimes sufficient to indicate typical ranges for the different attributes.
Cluster analysis for chemists
139
However, frequently the regions in attribute space occupied by different classes have to be fenced off by sets of hyperplanes or even curved surfaces necessitating the use of computers for later classifications of unknown objects. There are some algorithms, called "learning machines," which with the help of a set of classified objects called the "training set," will carve out regions in attribute space that are occupied preferentially by members of the same class. 7 The effectiveness of such algorithms can be judged on their ability to correctly classify objects not included in the training set. Learning machines can partially identify unknown objects. For example, based on typical fragmentation patterns in mass spectra, they can suggest some types of substituents being likely present or absent in a molecule with a given mass spectrum. 8 Frequently a comparatively simple classification algorithm can replace time consuming manual examinations of the objects. For example, the distinction between European and Africanized honeybees is not an easy task. Nevertheless, Lavine and Carlson 9 have devised a procedure that uses gas-liquid chromatography and is capable of faithfully reproducing the classifications by experts based on visual examinations of the bees. In addition to a classification based on properties (Fig. 1), we can superimpose a classification based on the topologies of the molecules. It is rather interesting to see whether a suitably chosen classification correlates with the topological classifications. We might ask for example whether the alkenes form a cluster and whether the cyclic alkanes form a separate cluster. From Fig. 3, it is readily apparent that the tree can be cut in such a manner that the alkenes, with the exception of two compounds, indeed form a cluster. This cluster of 15 isomers could be separated further into a pair of clusters with an equal number of isomers. However, there is no obvious associativeness among the members of the resulting pair of clusters in terms of their structural formulae. Elongated clusters often appear as sets of two or more clusters in hierarchical cluster analyses. It is not clear whether the alkenes form two separate clusters or one elongated cluster. A closer examination of the structural formulae of the two alkenes that do not cluster with the 15 remaining alkenes reveals that they are unique: one is the only compound with a tertiary butyl group, the other is unique because its carbon skeleton has D2h symmetry and no hydrogen atom attached to the carbon atoms that form the double bond. There are only three cyclic compounds among the twenty compounds which have been included in this cluster analysis. It is obvious that the cyclic compounds differ considerably in their physical properties from the acyclic alkenes. Ethylcyclobutane and methylcyclopentane can be viewed as forming a cluster. However, cyclohexane is more distinct in its physical properties from the cyclic compounds than from some of the alkenes. It is rather interesting to find that the unbranched alkenes differ among themselves in their physical properties just as much as they differ from the branched alkenes. DISCUSSION Cluster analysis, is a relatively simple pattern-seeking method. The search for patterns has been essential for the survival of the ill-equipped early humans, thus the analytical abilities of the human brain are highly evolved. Nevertheless, patterns in attribute spaces of higher dimensions are difficult to visualize and almost impossible to detect. The hope of finding patterns in attribute spaces of low dimensions is frequently unrealistic. For example, in the diagnosis of an illness with high-performance chromatographic methods, it would be surprising to fincl a single feature that would provide conclusive evidence for the presence or absence of disease. Nevertheless, cluster analysis of a set of chromatograms in an attribute space of higher dimensions seems promising for finding sets of criteria for the diagnosis of a disease. Jurs 1° describes a method for identifying cystic fibrosis heterocygotes with the help of pyrochromatograms of tissue culture samples derived from human subjects. Cluster analysis is generally used in conjunction with other more sophisticated techniques of multivariate analysis. The more sophisticated these techniques are, the more difficult it is for a nonexpert to interpret the results with respect to their meanings and especially in terms of their plausibility. Chemists so far have encountered multivariate analysis mostly in collaborative ventures with other scientific fields. Juts 10 has expressed concern over the use of such techniques by chemists. 11 Stating that in
140
P. SENN
the past investigations based on multivariate analysis have been plagued by gross errors and misconceptions, which might be repeated unless chemists were made aware of the numerous pitfalls. Hierarchical cluster analysis might seem a relatively harmless tool that provides only a tree from which the researcher can draw any conclusions he likes. However, in many situations the best possible analysis of the data requires additional tools of multivariate analysis. For example, Froidvaux, et al., 12 reduced the number of attributes in a geochemical classification of sediments from Lake Erie to a small number of new "orthogonal" attributes by using a technique called principal-component analysis concluding that this preliminary step should always be applied for the type of cluster analysis they used. Multivariate analysis is so deeply ingrained in some branches of science that it is very difficult to imagine that science can exist in the absence of these tools. The intended application of techniques of multivariate analysis usually affects how data are acquired, so it becomes rather unlikely that improper sampling procedures will be the cause of spurious results. This kind of farsighted strategy will almost invariably provide reasonably convincing results. However, in many situations in chemistry, the data are rather sketchy and, by the exacting standards normally applied to published research, they often can't be considered suitable for multivariate analysis. There have been reports of the application of pattem recognition techniques towards the design of pharmaceutical drugs. 13 At an early stage of the search when only a few compounds with the desired biological activity are known, it is possible to use multivariate analysis to propose relationships between chemical structure and pharmacological properties (including side effects). However, when such analyses are based on a small sample size, they give results that by themselves remain highly questionable, especially if they rule in favour of a hypothesis. On the other hand, we should keep in mind that many techniques of multivariate analysis can be viewed as the rational counterpart to the comparatively fuzzy mental processes, which in the face of vagueness and uncertainty are activated to produce working hypotheses. In conclusion, cluster analysis and related techniques could have a favourable impact on traditional fields of chemistry if the results are checked closely and the findings are carefully analyzed with respect to their ramifications. REFERENCES
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
Wang, P. H., Ed. Graphical Representation of Multivariate Data; Acad. Press: New York, 1978. Larsen, R. D. J. Chem. Educ. 1986, 63, 505 and J. Chem. Educ. 1986, 63, 1067. Tou, J. T.; Gonzalez, R. C. Pattern Recognition Principles; Addison-Wesley: Reading, 1974. Romesburg, H. C. Cluster Analysis for Researchers; Wadsworth, Inc.: London, 1984. Rouvray, D. H. Sci. Amer. 1986, 255, 36. Razinger, M. Theoret. Chim. Acta 1986, 70, 365. Jurs, P. C.; Isenhour, T. L. Chemical Applications of Pattern Recognition; Wiley: New York, 1975. Kowalski, B. R. In Computers in Chemical and Biochemical Research; Klopfenstein, C. E.; Wilkins, C. L., Eds.; Acad. Press: New York, 1977; Vol. 2, pp. 1-76. Lavine, B.; Carlson, D. Anal. Chem. 1987, 59, 468A. Jurs, P. C. Science 1986, 232, 1219. Ahlgren, A. Science 1986, 234, 530 and refs. therein. Froidvaux, R.; Jaquet, J. M.; Thomas, R. L. Computers and Geosciences 1977, 3, 31. Kirschner, G. L.; Kowalski, B. R. In Drug Design; Ari6ns, E. J., Ed.; Acad. Press: 1979; Medicinal Chemistry, a Series of Monographs; Vol. 3, pp.73-131. Dreisbach, R. R. Physical Properties of Chemical Compounds; Amer. Chem. Soc.: Washington, D.C., 1959. CRC Handbook of Chemistry and Physics; CRC Press: Cleveland, Ohio, 54th ed.; 1973.