Inform. Stor. Retr.
Vol. 9, pp. 561-568.
Pergamon Press 1973.
Printed in Great Britain
A METHOD FOR THE AUTOMATIC CLASSIFICATION OF CHEMICAL STRUCTURES GEORGE W. ADAMSON and JUDITH A. BUSH Postgraduate School of Librarianship and Information Science, Sheffield University, Western Bank, Sheffield, S10 2TN
Summary--A method has been developed for the automatic classification of chemical structures and it has been tested by applying it to the common naturally occurring amino acids. The resulting classification is reasonable from a qualitative viewpoint. Different techniques have been compared by simulating the use of the similarity and dissimilarity coefficients and classifications for the prediction of an observable physical property and determining the agreement between the observed and "predicted" values for the species classified. INTRODUCTION AUTOMATIC classification in information science is a field in which there is a great deal of research activity. However, attention has been mainly concentrated on the classification of documents [1]. Also, although much attention has been given to the storage and retrieval o f chemical structures and associated information [2] very little published work is available on the application of automatic classification to chemical structures. This paper describes work which has been carried out on an approach to the automatic classification of chemical structures. The main objective of the work is to develop methods for handling the structural attributes of chemical species although a better classification of chemical species should eventually result from using as many of their attributes as possible. When suitable methods of automatically handling structural attributes have been developed then these could be incorporated with other attributes in more broadly based classifications of chemical species. A suitable automatic method of classifying chemical structures could be expected to lead to gains in the areas of chemical structure information systems discussed below. File structures based on automatically derived classifications could lead to improved performance in systems for the storage and retrieval of chemical structures. For example, JARDINE and VAN RIJSBERGEN[3] have recently claimed that the use of a file structure based on an automatically derived hierarchic classification is very effective for the retrieval of documents. Automatic classification could also be applied to the creation of sub-files on specialized topics from general chemical structure data bases. One stage in automatic classification is the calculation of similarity coefficients (SCs) or dissimilarity coefficients (DCs) between the objects to be classified. Suitable coefficients could also be of use in measuring the degree of association between structure search queries and structures in the file being searched. So far structure search systems have operated mainly in the registration and substructure search modes [2]. In registration the search is for structures which are identical to the query and in substructure search, for structures 561
562
GEORGE W. ADAMSON AND JUDITH A. BUSH
which contain the query. In these modes of operation a structure on file is either relevant or irrelevant to a particular query and no ranking of answers in order of relevance is possible (i.e. relevance is measured on a nominal scale). If suitable SCs or DCs were developed then searches could be made for structures with at least a given degree of similarity between the query and an answer, and answers could then be ranked in order of SC 07" DC resulting in relevance measurement on ordinal or even more precise scales. The chemical structure of a species is often represented for written commt, nication as a structure diagram and the automatic processing of chemical structures is usually carried out on structure diagrams in the form of machine-rcadabie eonneciion tables or linear notations. These may be fragmented to ease some handling problems. The structure diagram is related to the structure it describes and could be eonside;ed as: a very approximate two-dimensional projection of the real structure. It is also an approximate graphical representation of the wave function of the molecule, u~;nally co,re:;pc, ndmg ~;~ore closely to valence bond or localized molecular orbital desc:iptions but laking so>m.:: account of electron delocalization by equalizing alternating single and double bonds. Because of these relationships and the relationship between the wave function, structure and o~her properties of a species the characteristics of struclure diagrams should be correlaled approximately with the physical, biological and chemical properties of the species. The extent of the correlation is, however, extremely difficult to estimate a priori becau'-;e of the large approximation~; in,,olved. The expected imperfect correlation between ~tructure diagram and observable properties of chemical species has consequences which may be important. The correlation between SCs or DCs based on structure and other properties of the chemical species provides a criterion for the comparison of the pevf'ormance of different SCs and DCs. The expected correlation also leads to the possibility that the SCs, DCs, or classifications based wholly or in part upon structural attribute:~, will be of some value in predicting unknown properties of chemical species. I f such predictive use is feasible then the usefulness of chemical information storage and retrieval systems may be considerably increased. Also, by sinmlating the predictive use of a classification it may be evaluated by comparing "predicted" with observed properties. In this publication we will illustrate the methods by using the 20 most common natural amino acids as an example. This enables comparisons to be made with another classification of the amino acids [4] which was made on the basis of manually assigned structural attributes and physical, chemical and biological properties. In the method to be described here the classification is based solely on structural attributes which are recognized and assigned automatically by computer and thus the algorithms developed could be applied without modification to any group of structures. It should be possible to deal with of the order of 104 structures using similar methods [5]. To illustrate the predictive use of the classification the pI values of the amino acids have been used [6]. METHOD The structures of the 20 amino acids were coded as redundant connection tables for input to the computer. Bonds were divided into five types for ceding namely: single chain, single ring, double chain, double ring and alternating ring bonds. Tautomeric bonds were not represented as such but were reduced to one of the five possibilities listed above. Similarity and dissimilarity coefficients were based on the presence or absence of augmented atoms [7] in the structures. An augmented atom was generated centred on
A Method for the Automatic Classification of Chemical Structures
563
each a t o m o f each structure a n d consists o f the central a t o m , the b o n d s it forms a n d the a t o m s to which it is b o n d e d (excluding h y d r o g e n a t o m s a n d b o n d s to hydrogen). Three different descriptions in terms o f a u g m e n t e d a t o m s were used. These are: (i) The presence or absence of an augmented atom type in a structure was noted. The second and subsequent occurrences of the same augmented atom type in the same structure were ignored. (ii) Each occurrence of an augmented atom type in a structure was noted. Multiple occurrences of the same augmented atom type in a structure were allowed for in the calculation of SCs and DCs by an additive coding method [4]. (iii) Multiple occurrences of augmented atom types were treated as in (ii) but only three bond types namely alternating ring bonds, single bonds and double bonds were discriminated. Thus in contrast to (i) and (ii) ring and chain bonds were not differentiated in the case of double and single bonds. The first stage in the calculation was to analyse the 20 structures a n d note all o f the a u g m e n t e d a t o m types which occurred. This gave a list o f all o f the attributes u p o n which the calculation o f SCs a n d D C s w o u l d be based. Next a description o f each structure in terms o f the set o f attributes was f o r m e d a n d stored in a bit vector. Each pair o f bit vectors was then c o m p a r e d to calculate the SC or D C between the c o r r e s p o n d i n g p a i r o f a m i n o acids. F o r each p a i r o f structures the attributes were divided into four groups containing a, b, c a n d d attributes where " a " , " b " , " c " a n d " d " are the entries in a 2 x 2 table, i.e. " a " is the n u m b e r o f attributes which are c o m m o n to b o t h structures, " b " a n d " c " are the n u m b e r s which occur in the first structure but not the second a n d vice versa, a n d " d " is the n u m b e r which occurs in the set o f structures b u t in neither o f the p a i r o f structures to which this SC or D C refers. Three coefficients were used [4, 8]: (i) Dice's SC = - -
2a
2a+b+c (ad-bc)
(ii) ~b = [(a+b)(a+c)(d+b)(cl+c)] ~ (iii) Sneath's D C -
b+c a+b+c+d"
The m a t r i x o f SCs o b t a i n e d using ~b a n d structural description (ii) is shown in Table 1. The a m i n o acids were classified by the single linkage m e t h o d [8] using an a l g o r i t h m o r i g i n a t e d b y van Rijsbergen [9]. C o m p u t a t i o n was carried out on an I C L 1907 c o m p u t e r with a core store o f 24-bit w o r d s a n d a cycle time o f a p p r o x i m a t e l y 2 gsec. The p r o g r a m which analysed the connection tables a n d described each structure in terms o f the set o f attributes ( a u g m e n t e d atoms) f o u n d was written in P L A N (the I C L assembly language) a n d i n c o r p o r a t e d P L A N a n d F O R T R A N subroutines for the calculation o f SCs a n d D C s ( C P U times < 10 sec; core storage used 2795 w o r d s + w o r k i n g storage). Clustering was carried out by a P L A N p r o g r a m which i n c o r p o r a t e d van Rijsbergen's clustering a l g o r i t h m [9] in the f o r m o f a F O R T R A N subroutine. The clustering p r o g r a m operates on a r a n k e d listing o f SCs or D C s a n d prints these together with the clusters f o u n d at each level o f association, if they differ f r o m the clusters f o u n d at the previous level ( C P U times < 20 sec; core storage used 3670 w o r d s + w o r k i n g storage).
6"00 10-76 2.77 5"41 5.07 3"22 5-65 5.97 7"59 6"02 5"98 9,74 5'74 5.48 6"30 5.68 6.16 5"89 5,66 5"96
pI
Alanine Arginine Aspartic acid Asparagine Cysteine Glutamic acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine
Amino acid Ala
Asp
Asp Cys (NH2)
0"503 0.621 0'621 0"734 0"452 0-578 0"434 0'722 0'552 0.552
Arg 0-577 0'524 0'935 0"668 0"508
Glu 0.577 0"645 0'668 0"935 0-508 0"743
0"693 0"586 0"530 0"530 0'629 0"491 0.491
Glu Gly (NH~) 0'537 0-359 0"491 0'491 0"469 0"442 0'442 0.457
His 0.784 0'452 0.583 0.583 0,552 0.535 0.535 0.530 0.491
lieu 0.784 0.452 0.583 0,583 0,552 0"535 0.535 0.530 0.491 1.000
Leu 0.577 0"766 0'535 0"668 0"508 0.614 0.743 0-661 0.442 0.535 0"535
Lys 0"621 0'452 0"583 0"583 0"705 0"535 0.535 0.530 0.491 0'583 0.583 0'535
Met 0"503 0"318 0-452 0.452 0-434 0'403 0.403 0.426 0.359 0'452 0"452 0"403 0.452
Phen 0-331 0'114 0.203 0-203 0.282 0"171 0.171 0.390 0.277 0"203 0'203 0"171 0.203 0.114
Pro
Set 0-734 0-434 0-705 0"552 0.662 0"655 0.508 0.629 0.469 0"552 0"552 0"508 0-552 0.434 0-282
TABLE 1. pI AND ~ VALUESFOR THE 20 COMMON NATURALLY OCCURRING AMINO ACIDS[6]. THE ff VALUESWERE CALCULATED USING STRUCTURAL DESCRIPTION (ii)
0"844 0-377 0'639 0"494 0"602 0"590 0.450 0.575 0"412 0"639 0"639 0'450 0-494 0'377 0-240 0.763
Thr
0.416 0.213 0"354 0'354 0-347 0'302 0-302 0.350 0-475 0-354 0.354 0-302 0.354 0.640 0,164 0-347 0.288
Try
0-471 0'281 0'539 0'417 0'403 0-485 0.367 0'399 0"322 0-417 0'417 0"367 0'417 0'835 0.088 0'538 0-473 0"589
Tyr
0"844 0"377 0"494 0"494 0'602 0-450 0.450 0-575 0.412 0-930 0"930 0"450 0.494 0"377 0"240 0.602 0-696 0.288 0.345
Val
eO
:Z
o z z ¢~ a
¢~
.~
tJi O~
A Method for the Automatic Classification of Chemical Structures
565
RESULTS AND DISCUSSION (a) The similarity and dissimilarity coefficients The relative usefulness of the SCs and D C used and of the structural representations were assessed by simulating their predictive use. It was assumed that the p I value of each amino acid in turn was not known and its p I was set equal to that of the amino acid with which it was most similar according to the values of the SC or D C under consideration. The average value of the difference between observed and "predicted" pI values was then calculated. Where an amino acid's highest similarity was with two or more others the average of the pI values was taken. The results are given in the lower entries in the cells of Table 2. The best result was found for Dice's SC and for ~b with structural description (ii). These gave average deviations of 0.43 (median = 0-29) pI units between the observed and "predicted" values for the 20 amino acids. It is instructive to compare the value of 0.43 with two other values which could be obtained under other circumstances. A high value would arise if it were not possible to form classes of amino acids from the 20 studied. In this case the "predicted" value for any acid would be the average pI value taken over the other 19 structures. The average deviation in this case is 1.08 pI units (median = 0.40). The smallest possible value which could be expected would occur if each acid had its highest SC or lowest D C with the acid which also had the nearest pI value. In this case the average deviation between observed and "predicted" values would be 0.26 pI units (median = 0.07). This value could be reduced by including more amino acids in the data base upon which the "prediction" is based or by changing the method of "prediction". The actual mean deviation of 0.43 found using ~b or Dice's coefficient is thus close to the smallest possible value of 0.26 for this set of structures and is very much less than the value of 1.08 which would have resulted if no resolution of structures had been obtained. The SCs and D C do not distinguish between leucine and isoleucine (q5 = 1). These structures could be resolved by using fragments larger than augmented atoms. TABLE 2. MEAN DIFFERENCES BETWEEN OBSERVED AND "PREDICTED" pI VALUES FOR THE 2 0 NATURALLY OCCURRING AMINO ACIDS USING THREE TYPES OF S C AND D C AND THREE STRUCTURAL REPRESENTATIONS IN TERMS OF AUGMENTED ATOMS. MEDIAN VALUES ARE GIVEN IN BRACKETS. THE UPPER VALUES IN THE CELLS WERE CALCULATED FROM THE AVERAGE pI VALUE OF THE CLUSTER WHICH AN ACID JOINED AND THE LOWER ENTRY FROM THE ACID(S) WITH WHICH THE ACID WITH " U N K N O W N " pI VALUE HAS THE HIGHEST S C OR LOWEST D C
xN. Structural SC or D C ~ representati°n
(i)
(ii)
(iii)
0.81 (0.24)
0"39 (0.25) 0"43 (0.29) 0'74 (0.22) 0.50 (0-32) 0.39 (0-25) 0'43 (0.29)
0-42 (0.25) 0"48 (0.33) 0"76 (0'28) 0.46 (0-35) 0.42 (0.28) 0'48 (0"33)
type Dice SC
-
SHeath DC
-
0.81 (0'24) -0"81 (0.24) --
(b) The classifications The classifications were tested in a similar manner to the SCs and D C but the "predicted" value of an acid was taken to be the average pI value of the cluster which it joined. On this basis Dice's SC and ~b performed best (see Table 2) with average devia-.
_ 1
~
.
.
.
-I--T
I___~____
ACyC IiC
I
zl
~-
Acids
q~1
075 O.BO 085 0.90 095
I
[
--
]~
Amid¢~
[
[
chain
[
- - ~ - - . I
BQSiC {
aromdtic
045 059 055 OSO 0E5 070
I
.
I
Ar°mat~c { £ings exclusively
-
1
ICH2CH(~H2)CO2H
I[L__~.
~ h C N2C H(NH2) C02H
~so!eucine
C2HsCH(C H 3)CH(NN2)CO2H
(CH 3)2 CHCH2C HIN;~2)CO2N
(CH3)2CNCH(NH2)CO2N
Valinc
!4OCH(C H 3) C H(NN2)CO2N
HOCH2CH(NH2)CO2 N
HSCH2CH(NH2~CO2N
HO2CCN2CH(NH2)CO2H
CH3CH(NH2~CO2N
acid
Alanin¢
Threonme
Scrin~
Cysteine
Aspar tic
HO2CC N2C H2 CH( NH2~ C02H
H2NCO.C H2C H( N !~2 ~,CO2H
G!utamic acid
N2NCO CH2CH2CH(NHz)CO2H
HN C(NH2) NH(C H2)3C H(N H2)COzH
H2N(CH2) 4 C H( NH2) CO2 H
CH3SCH2CH2C H(NH2~C02N
H2NCH2CO2H
C6H5CH2CH(NH2)C@2H
p - NOCGH4CH2CH( NH2) C02H
~L~I
C02H
Asporagm~
l Leucine
__
H
~
Glutamme
Arginine
Lysine
~gthioning
Glycine
Phe.nylalanine
Tyros ine
Tryptophan
HisUdin~
Praline
F[c. 1. Dendrogram of the structural relations between the 20 common naturally occurring amino acids based on ~b, the single linkage clustering method and structural representation (ii).
L__ I 0.35 040
Cyclic--
I
8
#
O
A Method for the Automatic Classification of Chemical Structures
567
tions of 0.39 pI units (median = 0.25) for structural representation (ii). This value could presumably be reduced by improvements in the description of the amino acids, including non-structural descriptions and by improvements to the SC or DC and the clustering technique. The classification is illustrated in Fig. 1 as a dendrogram. The clusters formed are sensible from a general qualitative chemical point of view and correspond closely to those described by SNEATH[4] and MEISTER [6]. There is a broad breakdown into the cyclic and acyclic amino acids. The two dicarboxylic acids, glutamic and aspartic acids, form a cluster as do glutamine and asparagine which are both amides, and lysine and arginine which contain two - - N H 2 groups. The acyclic hydroxy amino acids, serine and threonine, do not form a separate cluster but both join the same cluster at a different level. To illustrate the strength of the relationship between the structural classification and the p[ the product moment correlation coefficient between the observed pI values and the values "predicted" on the basis of the classification was calculated and found to be 0.94 (a value of 1-00 would be obtained for perfect correlation). A graph of observed against "predicted" pI values for the 20 acids is shown in Fig. 2 together with the least squares best straight line through the points. The very high correlation coefficient obtained here is due to some extent to the ability of the method to distinguish three main groups of amino acids, namely the two strongly acidic amino acids, the two strongly basic and the remaining amino acids which have nearly neutral pI values. 14 13 12 II I0
~ 9 2 8 a3
6
r / ---
+ + /
4 3 2
_L~2_I 2
! ! I I E I r ! r I I 3
~-
5
6
7
Dbservcd
8
9
I0 Ff 12 P-$ 14
pI
FIG. 2. Graph of observed against "predicted" pI values using the classification shown in Fig. 1. CONCLUSION The results obtained show that the combination of structure handling techniques originally developed for information storage and retrieval and numerical taxonomic techniques developed for biological classification lead to a classification which is reasonable from a general qualitative chemical point of view. There is substantial agreement between the classification and SCs and DCs based on the structure diagram and a physical property of the amino acids. The use of the SCs and DC and the classification for predicting a physical property has been simulated and high correlation between observed and "predicted" values found. It is interesting to note that we have constructed no physical model for the relationship between the structure diagrams
568
GEORGE W. ADAMSON AND JUDITH A. BUSH
a n d the p I values. The relationship m u s t be implicit in the structure d i a g r a m s a n d has been b r o u g h t o u t b y the a p p l i c a t i o n o f i n f o r m a t i o n processing techniques, some o f which a r e o f very general applicability. This indicates t h a t in the a u t o m a t i c analysis o f the p r o p e r t i e s o f chemical species for the p u r p o s e o f predicting u n k n o w n biological, physical or chemical properties the structural p r o p e r t i e s as represented by the structure d i a g r a m are likely to be c o r r e l a t e d with the u n k n o w n p r o p e r t i e s a n d they or m o r e accurate structural descriptions ought to be included in the calculations. The c o n n e c t i o n table used in this w o r k w o u l d n o t differentiate between geometrical or o p t i c a l isomers b u t this limitation c o u l d be r e m o v e d by using a m o r e detailed structural r e p r e s e n t a t i o n or by including, in the attribute set used for the SC or D C calculations, p r o p e r t i e s which differ for the different isomers. Acknowledgements--We would like to thank Dr. M. F. LYNCH for his encouragement and advice and the Office for Scientific and Technical Information (London) for the award of a Postgraduate Research Studentship (to JUD1THA. BUSH).
REFERENCES [1] G. SALTON: Automatic Information Organisation and Retrieval. McGraw-Hill, New York (1968). The SMART Retrieval System--Experiments in Automatic Document Processing. Prentice-Hall, Englewood Cliffs (1971); K. S•ARCK-JONES, Automatic Keyword Classification for Information Retrieval. Butterworth, London (1971). [2] M. F. LYNCH,J. M. HARmSON,W. G. TOWN and J. E. ASH: Computer Handling of Chemical Structure Information. Macdonald, London (1971). [3] N. JARDINE and C. J. VAN RXSSBERGEN:The use of hierarchic clustering in information retrieval. htform. Stor. Retr. 1971, 7, 217-240. [4] P. H. A. SNEATH:Relations between chemical structure and biological activity in peptides. J. Theoret. Biol. 1966, 12, 157-195. [5] R. SmSON:SLINK: An optimally efficient algorithm for the single-link cluster method. Computer J. 1973, 16, 30-34. [6] A. MEISTER:The Biochemistry of the Amino Acids. Academic Press, New York (1957). [7] G. W. ADAMSON,M. F. LYNCH and W. G. TOWN" Analysis of structural characteristics of chemical compounds in a large computer-based file. Part II. Atom-centred fragments. Y. Chem. Soe., C 1971, 3702-3706. [8] R. R. SOKALand P. H. A. SN~ATH:Prineiples of Numerical Taxonomy. Freeman, San Francisco (1963). [9] N. JARDINEand R. Smsoy: Mathematical Taxonomy. Wiley, London (1971).