Analytica Chimica Acta 388 (1999) 243±250
Optimisation of structure representation for QSAR studies Jure Zupan*, Marjana NovicÏ National Institute of Chemistry, Ljubljana, Hajdrihova 19, Slovenia Received 8 October 1998; received in revised form 15 December 1998; accepted 15 December 1998
Abstract Optimisation of a spectrum-like structure representation via genetic algorithm (GA) is described. The ®nal optimised structure representation of 28 molecules (¯avonoid derivatives, inhibitors of the enzyme p56lck protein tyrosine kinase) contains only 15 variables compared with the 120 ones of the initial spectrum-like representation. The ®tness function in the variable reduction of the GA procedure were counterpropagation arti®cial neural network (ANN) models. Using one chromosome after another as a code for new representation, a new ANN model was trained and tested for each of them. The correlation coef®cient r between the experimental biological activity and the value predicted by the ANN model for the test set of 14 compounds (not used in the training) was estimated. The obtained correlation coef®cient r is used as the ®nal ®tness criterion in the selection and reproduction ability of the genetic procedure for generation of the new population. Due to the fact that the spectrum-like structure representation is reversible, each representation's variable can be backtraced to the structural feature. The consequence is that 15 variables selected by the GA optimisation can pinpoint the most relevant spatial directions (with the respect to the skeleton) most responsible for the biological activities of the entire series of the compounds. # 1999 Elsevier Science B.V. All rights reserved. Keywords: Optimisation; QSAR studies; Genetic algorithm; Spectrum-like representation
Dedication This paper is written as a tribute to Professor Jean Thomas Clerc, a scientist, a mentor, a teacher, and a friend to all of us. Professor Clerc visited Slovenia for the ®rst time a quarter of a century ago at the occasion of the Second International Conference on Computers in Chemical Research and Education (ICCCRE) in 1973. Ever since we have been fortunate enough to be involved in most of his frequent visits of our country at various occasions or meeting him at many conferences throughout the world. By participation in discussions, *Corresponding author. Tel. +386-61-1760-279; fax: +386-611259-244; e-mail:
[email protected]
through exchange of ideas, by learning from his many lectures, as well as for working together on various mutual projects we are proud to claim that we have acquired a part of his enormous knowledge. Even for the origin of the idea explained in this paper can, to some extent, be claimed that it evolves over a period of many years and comes from several discussions with Professor Clerc. 1. Introduction From Professor Clerc's earlier papers [1±3] on and through all of his subsequent scienti®c work, a special
0003-2670/99/$ ± see front matter # 1999 Elsevier Science B.V. All rights reserved. PII: S0003-2670(99)00079-3
244
J. Zupan, M. NovicÏ / Analytica Chimica Acta 388 (1999) 243±250
attention to the representation of chemical structures and consequently to the structure-spectra/properties relationship can be traced. Very early he addressed the problem of the structure similarity [3] asserting that the measure for the similarity between structure representations depends on the nature of the property for which the particular structure description is used. With other words, he pointed out that the similarity between two structures is not an absolute quantity, but a problem oriented one. In this work we would like to describe an attempt for optimisation of structural representation intended for the use in modelling in quantitative structure activity relationship (QSAR). A good structure representation does not only ful®l four main conditions: (a) uniqueness, (b) uniformity, (c) reversibility, and (d) translational and rotational invariance, but at the same time should be as concise and informative as possible. We are trying to show that optimisation by the genetic algorithm (GA) [4,5] is an effective method for the selection of a small number of important variables out of a set of 120 variables as used in the spectrum-like representation [6] of chemical structures. The criterion for the evaluation of ``chromosomes'' (binary represented selection of variables) is the prediction ability of the arti®cial neural network (ANN) model generated with the structures represented by the selected variables. The second goal of the work is to show that 15 variables (out of the 120 original ones) selected by the GA algorithm are informative enough to yield valuable information about the structural features responsible for the effect under study, i.e., to show that the originally selected sparse spectrum-like 120-dimensional representation [6] contains most of the information needed for such studies. 2. Data-set and 120-dimensional structure representation The data-set [7±11] consists of 28 structurally diverse ¯avonoid derivatives that inhibit the enzyme p56lck protein tyrosine kinase. Enzyme inhibitory activity for a series of ¯avonoids is expressed as log (1/C) [7±9]. The substituents are placed at positions 5, 6, 7, and 8 on the double ring moiety (see Fig. 1) and at 30 and 40 on the phenyl ring.
Fig. 1. Chemical structure of flavonoid derivatives. Substituents are placed on the skeleton (AC) at sites labelled from 5 to 8, and on the phenyl ring (B) at 30 and 40 [7±9].
The data-set has been already studied by researches in different laboratories [7±11]. Linear regression models were tested in the initial studies [7±10], while later on a successful application of arti®cial neural network models was reported [11]. In all mentioned studies, the compounds were represented on more or less the same way, by physicochemical parameters: Hammett's , Hansch's hydrofobicity parameter, molar refractivity MR, and by quantum-chemical parameters calculated for speci®c structural features. In our study the chemical structures are represented by a general spectrum-like representation taking into account three-dimensional co-ordinates and partial charges of all the atoms in the molecule, regardless of presumably important substituents' sites [6]. The descriptors for spectrum-like representation of a molecular structure de®ned by three-dimensional co-ordinates of its atoms, are obtained by the projection of all atomic centres (each de®ned by x, y, z coordinates) onto the sphere of an arbitrary radius. The oriented structure is placed into an arbitrary large sphere. The projection beam from the central point of sphere causes a pattern of points on the sphere, where each point represents particular atom. Each point on the sphere is taken as the centre of a bellshaped function with intensity related to the distance between the co-ordinate origin and a particular atom. As the bell-shaped function of atom i we have taken Lorentzian curve with the form i i si
'j ; #l ;
'j ÿ 'i 2 2i
#l ÿ #i 2 2i 2 2j ;; ; ; 2; j 1; k; 'j k k l ;; ; ; ; l 1; k=2; (1) #l k=2 k=2 here si('j,#l) is spectrum intensity related to atom i, while the other parameters are: i ± distance between
J. Zupan, M. NovicÏ / Analytica Chimica Acta 388 (1999) 243±250
the centre of the sphere and atom i, 'i,#i ± polar and azimuthal angle of atom i, i ± atomic charge (extended by 1) on atom i, k ± resolution of the representation (steps for indices j and l). To obtain the total intensity related to the entire molecule, the contributions of individual atoms are summed up S
'j ; #l
natom X
si
'j ; #l :
245
three-dimensional structural co-ordinates and net atomic charges (for minimal energy state) were calculated by MOPAC software package. The descriptors employed in the study contained 120 intensity values in spectrum-like representation calculated (Eqs. (1) and (2)) in the x±y projection plane. On Fig. 2 an example of spectrum-like structure representation of 6,40 -NH2 substituted ¯avonoid structure is shown.
(2)
i1
In practice the projections on three perpendicular equatorial trajectories rather than the projection on the entire sphere are considered. In the case that largest part of the skeletons of all molecules in the study are planar, only the projection on one trajectory (x±y plane) is taken into account. Additional trajectory in the x±z plane is used in order to distinguish between some atoms having almost equal pairs of x and y co-ordinates. Incorporation of Mulliken charges (1i for ith atom) in Eq. (1) makes the recovering of atom positions from the code (reversibility) more computer intensive. For the calculation of the spectrum-like representation of 28 structures from our data-set, optimised
3. Genetic algorithm (GA) with the artificial neural network (ANN) as a fitness function Due to several speci®c requirements which will be discussed next, the genetic algorithm program in this study was custom-build in the authors' laboratory. The goal of the optimisation was to select the most in¯uential structural variables out of the 120 possible ones. As described in the previous paragraph, the uniform spectrum-like representation of structures is easily reversible, which means that each of the 120 variables can be easily associated with the structural feature of the represented compound. Hence, the idea of the optimisation was to pick-out a small number of such
Fig. 2. Initial 120-dimensional spectrum-like representation of 6,40 -NH2 substituted flavonoid from which the reduced representation is sought. Each of the 120 variables corresponds to an interval of 38.
246
J. Zupan, M. NovicÏ / Analytica Chimica Acta 388 (1999) 243±250
important variables, and consequently to associate them with the structural features responsible for the property in question. The above consideration of the problem leads us to the following design of the ``chromosomes''. Each chromosome is a blue-print for one speci®c structural representation derived from the original 120-dimensional one. Hence, all chromosomes have 120 ``genes'', i.e., they are sets of ones and zeros representing presence or absence for each of the 120 structural variables of the original representation. If the chromosome contains six ``ones'' (for example: at the positions 41, 56, 70, 93, 115, and 119) and 114 ``zeros'' at all other positions, a model for the prediction is made on the basis of 14 training compounds structures which are represented by the selected six variables. Using the proposed 6-variable structural representation an ANN model is build and tested with other 14 testing compounds. The correlation coef®cient r between the actual and the predicted activity values is taken as the ®tness measure of the particular chromosome. This means that for the evaluation of the ®tness of each chromosome a three step procedure has to be carried out: According to the gene (zero/one) code of the particular chromosome the ``short'' structure representation from the 120-dimensional original one is made for all 28 compounds in the study. Complete training procedure for generation of an ANN model based on the ``short'' representation using 14 compounds as the training set is carried out. The correlation coefficient r is determined by testing the prediction ability of the generated ANN model with the remaining 14 compounds. Correlation coefficient r is used as the fitness criterion of the particular chromosome. Each pool contains 50 chromosomes which means that in one generation 50 ANN models, each based on a different structural representation, have to be generated and tested. As mentioned before, the ®tness criterion for each chromosome specifying the ``short'' structure representation is the correlation coef®cient r between the actual and predicted biological activity. Once the entire pool of the chromosomes is evaluated, 20 best chromosomes are retained for the reproduction and all 40 remaining ones are dropped
from the study. The 20 best chromosomes are then subject to random cross-over procedure. Randomness is applied for the determination of the mating partner to each of the 20 best chromosomes as well as for the selection of the cross-over point (gene) of the selected pair. If the random procedure for the selection of the mating partner selects the particular chromosome itself, then its mate is by default set to be the very best chromosome. In this manner the highest mating rate of the best chromosome is assured. Due to comparably small number of genes in chromosomes equal to ``one'' the probability of obtaining clones (i.e., identical offspring) after crossing is relatively high. Therefore, whenever a clone is identi®ed, both parents in the cross-over procedure are subject to the random mutation of one gene. At the end of the cross-over and mutation procedures 40 different chromosomes of the new generation are obtained. As a standard measure the best chromosome from the previous generation is added to the pool of the new generation. In order to obtain complete new pool of 50 chromosomes, the still missing nine are made from empty (120 zeros) chromosomes by random setting genes to ``one''. For the generation of nine new chromosomes, there is only one restriction, namely, the number of genes turned to ``one'' is required to be equal to the number of ``ones'' in the very best chromosome. If chromosome with the best ®tness value has six ``ones'' all nine new generated chromosomes are required to have six ``ones'' as well. In order to clarify the evaluation of the ®tness function slightly more in detail the ANN procedure called counter-propagation learning [12±14] is brie¯y discussed. Counter-propagation (CP) learning is a supervised method suitable for generation of arti®cial neural network (ANN) models. In order to train the model, each supervised method requires the data to be organised in pairs {Xs,Ts}: with each object Xs the corresponding target Ts must be associated. In our case the input objects are ``short'' structure representations of objects as de®ned by the chromosome, while the targets are corresponding biological activities. Inputs Xs are many-dimensional objects (our search space for the ``short'' representation is in the range from 2 to 50 variables out of the 120 possible ones) while the targets Ts are only one-dimensional (one value). The CP-ANN is a two-layer network consisting of a Kohonen [15,16] and an output layer, both layers
J. Zupan, M. NovicÏ / Analytica Chimica Acta 388 (1999) 243±250
247
The position of the excited neuron is copied to the output layer below, where the procedure of correcting weights is repeated with the target value ts: 1 ÿ d=
p 1
ts ÿ woutput ; woutput j j d 0; 1; 2; . . . ; p:
Fig. 3. Counter-propagation ANN is a two-layer network. In the present work the input layer of neurons is different for each model, i.e., for each tested chromosome. It contains as many number of weight levels as there are bits turned to one in the corresponding chromosome. The output layer has always only one level of weights for the storage of the biological activity values normalised between zero and one.
consist of exactly the same number of neurons placed in one-to-one correspondence one above the other. As a rule the layout of neurons in both layers is quadratic (Fig. 3). On the contrast to other modelling techniques like error-backpropagation ANNs [17] or multiple linear regression (MLR) [18], the CP-ANNs generates a look-up table instead of the modelling function. This means that instead of calculating the answer the CPANN ``model'' calculates the address (or position in the table) where the answer is stored. The cells of the look-up table are output layer neurons. The speci®c address or position of the neuron, de®ned during the training or during the prediction is that of the ``excited'' or the ``central'' neuron (abbreviated by ``c''), which in turn is determined according to the minimal distance between the weight vector Wj(wj1,wj2,. . .,wjm) representing the jth neuron and the sth object represented by Xs(xs1,xs2,. . .,xsm): ( ) m X 2
xsi ÿ wji ; neuron c min i1
j 1; 2; . . . ; c; . . . ;
N N:
(3)
Once the object Xs is input to the Kohonen layer and the weights in the excited and neighbouring neurons are corrected 1 ÿ d=
p 1
xsi ÿ wKohonen ; wKohonen ji ji d 0; 1; 2; . . . ; p:
(4)
(5)
and woutput in The change of weights wKohonen j ji the Kohonen and in the output layer, respectively, are corrected virtually in the same manner. Because each output neuron has only one weight (for accommodating one target value) there is no need for index i in Eq. (5) explaining the corrections of weights in the output layer. In Eqs. (4) and (5) is the learning rate constant, d is the topological distance between the neuron j on which correction is applied and the central one, and p is the most distant neuron to which correction of weights is applied. Parameter p decreases during the training which lasts for epochtot according to the following equation: p
epochtot ÿ epochN=
epochtot ÿ 1:
(6)
The advantage of the CP-ANN generated look-up table is its effectiveness and the fact that all cells in the look-up table (neurons in the output layer) contain answers (responses) even if they are never ``excited'' by the training data. The effectiveness of the CP-ANN modelling means that the method requires one or two orders of magnitude lower amount of training epochs (70 epochs in all our calculations) compared to the number of training epochs of the error-backpropagation ANN learning. One epoch means a training cycle in which all data from the training set are send through the network once. It can be seen from Eq. (5) that answers (®nal in the output layer) are not values of weights woutput j just put into the look-up table (neural network) according the values of the training data-set, but are simultaneously adapted in many cells (neurons) with different amounts of correction. The topological clohave larger in¯uence on the actual sest answers woutput j than the distant ones. Eqs. (4)±(6) response woutput c show that the size of corrections depends on the stage of the learning procedure, epoch, and on the actual topological distance d between the excited and the corrected neuron. The further the neuron j on which corrections of weights have to be made is from the excited neuron c, the smaller the corrections.
248
J. Zupan, M. NovicÏ / Analytica Chimica Acta 388 (1999) 243±250
A drawback of the CP-ANN model lies in the limitation of the number of neurons, i.e., the number of look-up cells. Because by the de®nition all answers or responses provided by the ®nal look-up table are within the range of the training data, there is no room for extrapolation. The consequence of this fact is that the training set should be selected very carefully and should cover the entire measurement space as evenly as possible. Anyway, even for other models the outputs inappropriately normalised between 0 and 1 often suffer from the same drawback. For all CP-ANN generations and evaluations the same ANN parameters have been employed strictly: the same (77) CP-ANN layout of neurons, the same random initialisation of CP-ANN weights, the same triangular correction function for different neighbourhood rings, the learning constant was monotonously decreasing from 1.0 to 0.1 during the entire training process, always keeping the training exactly 70 epochs long, always using the same set of 14 compounds for training, and always using the same set of 14 other compounds for testing. The reason for this uniformity of conditions is to maintain the comparability between the correlation values r. After the preliminary evaluation of several generations of chromosomes, the ®nal run extending for 730 generations was made. During this procedure 36 500 ANNs were generated and evaluated. Each evaluation was based on the test of the same set of 14 compounds which never participated in any ANN generation. At the beginning, after only ®ve generations, the correlation coef®cient r jumped fast to 0.910. At the 39th generation it reaches 0.971. The selected variables were only the 41st, 56th, 70th, 93rd, 115th, and 119th; which recalculated to the angular degrees mean that the selected important directions are 1238, 1688, 2108, 2798, 3458, and 3578 with the respect to the skeleton. Further on, at the 55th generation new best representation comes up with 11 variables (intensities of spectrum-like representation at 338, 1058, 1238, 1508, 1688, 1988, 2228, 2708, 2798, 3458, and 3578). Finally, the 141st generation yields the representation containing 15 variables (intensities
of spectrum-like representation at 698, 1058, 1238, 1328, 1538, 1628, 1688, 1778, 1988, 2258, 2708, 2798, 3248, 3458, and 3578). There was no improvement of the correlation coef®cient r after 141th generation although we continued with the GA optimisation up to 730 generations. 4. Results and discussion The optimised best structural representation, i.e., the chromosome with the best ®tness value of r0.982 has 15 genes or bits turned to 1, what means that with 15 selected structural variables a large part of information for prediction of biological activity of the compounds under investigation is encompassed. Fig. 4 shows the selected 15 directions on the background of all 120 variables of the complete spectrumlike (x±y)-representation of the 6-OH,5,7,40 -NO2 substituted ¯avonoid. It is interesting to note that the ®nal 15-dimensional representation contains ®ve of the six reported by the representation found already at the ®fth generation. Hence, the ®nal representation can be regarded only as an extension (or improvement) of the ®rst one. The directions of the representation obtained at the ®fth generation are marked with an asterisk on Fig. 4. At the end the predictions of the biological activity of 14 test compounds using the selected 15-variable representation and CP-ANN model as obtained from the GA is shown in Fig. 5. It must be said that CP-ANN is not the best modelling technique what the ®nal prediction ability is concerned. Error backpropagation ANNs or multivariate linear regression with non-linear factors can be employed with probably better success. However, counter-propagation ANN is one of the fastest mulivariate non-linear modelling techniques and for that reason only it was used in the GA optimisation. It has enabled to generate more than 30 000 models in 12 h on a Pentium1 PC. It has to be kept in mind that the main goal of this study was to select the optimised structural representation, from which various models and various modelling techniques can be employed. The selected structural representation has only 15 variables each of which exactly pinpoints directions in the (x±y) plane of the skeleton relevant for biological activity of the compounds in question. The
J. Zupan, M. NovicÏ / Analytica Chimica Acta 388 (1999) 243±250
249
Fig. 4. Structure representation of one of the compounds and indication of the 15 most important variables. It is interesting to note that five of them (indicated by *) were selected already in the fifth generation of chromosomes.
speculation can be extended even further: if the determined directions are relevant for the biological activity of the 28 compounds in the study, the synthesis of derivatives having substituents in this directions can
yield compounds with very different biological activities. For example, knowing that compounds which have substituents in the direction 1238 (towards the position 40 ) with a negative charge of the substituted atom (e.g., nitrogen in ±NH2) have higher activities than the compounds having on the same position substituents with a positive charge (e.g., nitrogen in ±NO2), the implication for the synthesis of new active compounds is straightforward. Of course, such conclusions about the biological activities are suitable only for qualitative or at most for the semi-quantitative predictions. Nevertheless, any study on the reversible structure representation that can yield hints about the variation of substituents, fragments, substitution positions on skeletons, or spatial conformations, upon which the variations in the biological activities can be deduced are useful and worthwhile to pursue.
Fig. 5. Regression (b0ÿ0.2, s(b0)0.2, b11.1, s(b1)0.1) line for the predictions of biological activity of 14 test compounds not used for the generation of CP-ANN models. There are six compounds in the test set with the lowest experimental activity of 2.7. For them the model predicts three different values: 2.7, 2.8, and 3.0.
Acknowledgements The work was supported by the Ministry of Science and Technology through the Grants J1-291 and J18900.
250
J. Zupan, M. NovicÏ / Analytica Chimica Acta 388 (1999) 243±250
References [1] J.T. Clerc, E. Pretsch, Kernresonanz-spektroskopie, Akademische Vlgsl., Frankfurt, 1970. [2] F. Erni, J.T. Clerc, Helv. Chim. Acta 55 (1972) 489. [3] J.T. Clerc, Computer-generated structure proposals from spectroscopic data, in: D. HadzÏi, J. Zupan (Eds.), Computers in Chemical Research and Education, vol 2, Chapter 3, Elsevier, Amsterdam, 1973, pp. 109±127. [4] J.H. Holland, Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor, MI, 1975. [5] D.B. Hibbert, Genetic algorithms in chemistry, Tutorial, Chemom. Intell. Lab. Syst. 19 (1993) 277±293. [6] J. Zupan, M. NovicÏ, Anal. Chim. Acta 348 (1997) 409±418. [7] M. Cushman, D. Nagarathnam, L.R. Geahlen, J. Nat. Products 54 (1991) 1345±1352. [8] M. Cushman, D. Nagarathnam, L.D. Burg, L.R. Geahlen, J. Med. Chem. 34 (1991) 798±806. [9] M. Cushman, H. Zhu, L.R. Geahlen, J.A. Kraker, J. Med. Chem. 37 (1994) 3353±3362.
[10] M. NovicÏ, Z. Nikolovska-Coleska, T. Solmajer, J. Chem. Inf. Comput. Sci. 37 (1997) 990±998. [11] Z. Nikolovska-Coleska, L. Suturkova, K. Dorevski, A. Krbavcic, T. Solmajer, Quant. Struct.-Act. Relat. 17 (1998) 7±13. [12] R. Hecht-Nielsen, Appl. Opt. 26 (1987) 4979±4984. [13] J. Dayhof, Neural Network Architectures, An Introduction, Van Nostrand Reinhold, New York, 1990, p. 192. [14] J. Zupan, M. NovicÏ, I. Ruisanchez, Chem. Intell. Lab. Syst. 38 (1997) 1±23. [15] T. Kohonen, Self-Organization and Associative Memory, Springer, Berlin, 1989. [16] T. Kohonen, Neural Networks 1 (1988) 3±16. [17] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by error propagation, in: D.E. Rumalhart, J.L. McClelland (Eds.), Microstructures of Cognition, vol. 1, MIT Press, Cambridge, 1986, pp. 318±362. [18] D.L. Massart, B.G.M. Vandengiste, S.N. Deming, Y. Michotte, L. Kaufman, Chemometrics: A Textbook, Elsevier, Amsterdam, 1988.