Pareto optimal flexible alignment of molecules using a non-dominated sorting genetic algorithm

Pareto optimal flexible alignment of molecules using a non-dominated sorting genetic algorithm

Chemometrics and Intelligent Laboratory Systems 77 (2005) 232 – 237 www.elsevier.com/locate/chemolab Pareto optimal flexible alignment of molecules u...

287KB Sizes 0 Downloads 10 Views

Chemometrics and Intelligent Laboratory Systems 77 (2005) 232 – 237 www.elsevier.com/locate/chemolab

Pareto optimal flexible alignment of molecules using a non-dominated sorting genetic algorithm F. Daeyaerta,T, M. de Jongea, J. Heeresb, L. Koymansa, P. Lewib, W. van den Broeckc, M. Vinkersa b

a MolMo Services BVBA, Campus Blairon 424, 2300 Turnhout, Belgium Center for Molecular Design, Janssen Pharmaceutica N.V. Antwerpsesteenweg 37, 2350 Vosselaar, Belgium c Tibotec BVBA, Generaal De Wittelaan L 11B 3, 2800 Mechelen, Belgium

Received 9 June 2004; received in revised form 11 August 2004; accepted 29 September 2004 Available online 5 March 2005

Abstract The use of multi-objective function optimization to superimpose a flexible source molecule onto a rigid target molecule is explored. The objective functions are the SEAL similarity score between the source and target molecules, which has to be maximized, and the conformational strain of the source molecule, which has to be minimized. The optimization algorithm used is an elitist non-dominated sorting genetic algorithm. The algorithm is tested with the superposition of two non-nucleoside HIV-reverse transcriptase inhibitors and the superposition of methotrexate onto dihydrofolate. D 2005 Elsevier B.V. All rights reserved. Keywords: Optimization algorithm; Methotrexate; Dihydrofolate

1. Introduction Proper alignment of molecules is a prerequisite for structure based drug design methodologies such as Comparative Molecular Field Analysis (CoMFA) and Comparative Molecular Similarity Indices Analysis (CoMSIA) [1,2]. For rigid molecules, this involves the search for maximal similarity in the six-dimensional coordinate space consisting of three rotational and three translational degrees of freedom. For the alignment of flexible molecules, each rotatable bond adds a degree of freedom. An issue that arises is the conformational strain that is induced by rotating the torsion angles. While the similarity score has to be maximized, the induced conformational strain has to be kept minimal. This multicriteria optimization problem can be addressed by finding the set of Pareto optimal solutions, also called the set of non-dominated solutions. This is the set of those T Corresponding author. 0169-7439/$ - see front matter D 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2004.09.016

solutions for which an improvement in one objective function cannot be obtained without deterioration of the second objective function. As the search space of this problem is a high-dimensional space with many local optima, we have investigated the use of a multi-objective genetic algorithm. A review of applications of multiobjective optimization in chemical engineering is given in [3]. Handschuh et al. [4] have described the application of a genetic algorithm to find the Pareto optimal superposition of molecules using a maximal common substructure search. The objective functions in this method are the size of the common substructure and the root-mean-square distance between the atoms in this substructure of the molecules to be superimposed. In the present paper, we explore the use of a multi-objective genetic algorithm (GA) in which one objective function is a similarity score that does not require atom-to-atom mapping (see Method section). A second objective function is the conformational strain of the flexible source molecule.

F. Daeyaert et al. / Chemometrics and Intelligent Laboratory Systems 77 (2005) 232–237

2. Method The problem we want to solve is to determine the set of Pareto optimal superpositions of a flexible dsourceT molecule onto a dtargetT molecule of which the conformation is known. The number of dimensions of the search space is therefore 3+3+N tor, where N tor is the number of rotatable bonds of the source molecule. The objective functions are the similarity score, which has to be maximized, and the conformational strain of the source molecule, which has to be minimized. As similarity score, we use the Steric and Electrostatic Alignment (SEAL) score introduced by Kearsley and Smith [5]. The SEAL score S for the overlap of two molecules is given by: S¼

Ns Nt X X   2 wsteric vi vj þ wcharge qi qj  earij i

j

where N t and N s are the number of atoms of the target and source molecules, v i and v j are the third power of the atomic radii of atoms i and j, q i and q j are the atomic charges, r ij is the interatomic distance between atoms i and j, a is an attenuation factor, and w steric and w charge are the weights of the steric and charge components of the SEAL score. The atomic charges are the partial charges of the MMFF94 force field [6]. The score is normalized by dividing S by the average of the self alignments of the target and source molecules [5].

233

The conformational strain of the molecular mechanics is given by molecular mechanics torsional and non-bond energy terms: E ¼ Etors þ E nonQbond

E tors ¼

N tors X

3 X

Vn  ð1  cosnUk Þ

k¼1 n¼1

E nonQbond ¼

Ns Nt X X Aij Bij qi qj  4 þ : 8 erij rij j rij i

The torsional energy term uses the parameters of the MMFF94 force field. The van der Waals part of the nonbond energy term is an 8–4 Lennard–Jones type potential of which the parameters were obtained by fitting to the corresponding MMFF94 term. The use of this softer potential compensates for the fact that bond lengths and bond angles of the source molecule are kept fixed and therefore cannot adapt to lower the non-bond interaction. Moreover, in combination with the use of a distancedependent dielectric (e=r ij ) for calculating the Coulomb contribution to the non-bond energy, the calculation of the square root to obtain the interatomic distances is avoided, enhancing computational performance.

Fig. 1. Schematic overview of the non-dominated sorting genetic algorithm.

234

F. Daeyaert et al. / Chemometrics and Intelligent Laboratory Systems 77 (2005) 232–237

To obtain the Pareto optimal front of solutions, we have implemented the non-dominated sorting genetic algorithm for multi-objective optimization of Deb et al. [7]. Its main features are a fast non-dominated sorting algorithm used to rank the solutions, a dcrowded comparisonT operator that ensures the diversity of the population in Y-space (i.e. in the space of the objective functions) and the use of elitism to ensure survival of the fittest solutions. The chromosomes of our GA consists of real numbers (dallelesT in GA terminology) and are organized as follows: the first three alleles represent the translation in the x, y and z directions of the source molecule, alleles 4 through 6 encode the Euler angles determining the orientation of the source molecule, and the rest of the chromosome consists of the values of the torsion angle of each of the rotatable bonds in the source molecule. Prior to starting the GA, the coordinates of the target molecule are centered. To evaluate the two objective function values of a chromosome, the torsions of the source molecules are set according to the values of the corresponding alleles of the chromosome and the molecule is centered and rotated and translated according

to the first six alleles of the chromosome. The molecular mechanics strain of the source molecule is calculated, and the SEAL score between the source and target molecule is maximized with respect to the position and orientation of the source using a local minimizer (see below). The outline of the GA is shown in Fig. 1. Initially, a random population with N p individuals is created and sorted using non-domination, with the first Pareto optimal front given the highest fitness. The fast non-dominated sorting algorithm is detailed in [7]. Next, a child population of N p individuals is created using binary tournament selection of parents from the initial population, uniform crossover and random mutation. In the tournament selection, the first criterion is dominance: the candidate that dominates the other candidate wins the tournament. If neither of the candidates dominates the other, i.e. if they belong to the same Pareto front, selection occurs on the basis of a dnon-crowding distanceT. This is a distance measure, which is larger for solutions that are more isolated in Y-space. Selection based upon this measure ensures that the Pareto optimal front is evenly covered. The non-crowding distance is detailed in [7]. The parent

Fig. 2. Structures of DATA (upper left), ITU (upper right), dihydrofolate (bottom left) and methotrexate (bottom right).

F. Daeyaert et al. / Chemometrics and Intelligent Laboratory Systems 77 (2005) 232–237

and child populations, each consisting of N p chromosomes and their fitnesses, are then combined and sorted. The sorting is carried out in two phases. First, solutions are sorted according to the non-dominance criterion. This is carried out by determining the Pareto fronts using a fast non-dominated sorting algorithm. Within each Pareto front, the solutions are sorted according to the noncrowding distance criterion also used in the tournament selection procedure. The new generation is selected by taking the first N p individuals from the sorted combined parent and child population. The population size N p was set to 100 and the mutation rate for each allele was set to 1/m where m is the number of alleles in the chromosomes. The GA was carried out until the total number of function evaluations was 25,000. The SEAL parameters w steric and w charge were set to 1 and 1500, respectively. The attenuation factor alpha was 0.3. To improve the performance of the original nondominated sorting GA, we have introduced two modifications. The first is motivated by the observation that the translational and rotational degrees of freedom, coded by the first six alleles of the chromosomes, only affect the similarity score and not the conformational strain of the source molecule. Also, it has been our experience that a genetic algorithm does not perform well on optimizing translational and rotational degrees of freedom in molecular docking algorithms (unpublished results). We have therefore implemented a local optimizer to optimize these degrees of freedom. This optimizer presently consists of the Simplex method [8]. It is applied to each

235

chromosome and the optimized values of the translational and rotational degrees of freedom are inherited by the child chromosomes, according to the so-called Lamarckian model of evolution [9]. The incorporation of the translational and rotational degrees of freedom in the chromosomes of the GA remains essential, as application of only the Simplex optimizer will lead to a local optimum, depending on the starting position and orientation of the source molecule. Another modification is the introduction of a mechanism to enhance the diversity of the population in the dXspaceT, this is the space of the chromosomes, in addition to the crowded-distance sorting described above, which works in the dY-spaceT, i.e. the space of the objective functions. After sorting the combined child and parent population, we calculate for each chromosome the number of neighbors in the X-space. For the three types of degrees of freedom, a distance measure is calculated. For the translation, this is the Euclidian distance between two poses. For the rotational degrees of freedom, this is the root-mean-square deviation between the elements of the two rotation matrices. For the torsional degrees of freedom, this is the difference between the torsion angles. Two chromosomes are considered neighbors if the distance between the centers is less than 0.1 2, the root-meansquare deviation between the rotation matrix is less than 30, and no torsion angle differs by more than 308. Neighbor chromosomes are not copied into the next generation. We have applied our algorithm to the superposition of two non-nucleoside HIV reverse transcriptase inhibitors

Fig. 3. Scatterplot of the results of five runs of the non-dominated sorting GA applied to the superposition of ITU on DATA. The solid line is a quadratic fit through the Pareto front.

236

F. Daeyaert et al. / Chemometrics and Intelligent Laboratory Systems 77 (2005) 232–237

Fig. 4. Superposition of ITU (black) onto DATA (grey). The configuration shown corresponds to the solution with the highest SEAL score.

(NNRTIs), ITU and DATA, and the superposition of methotrexate onto dihydrofolate (Fig. 2).

3. Results and discussion We have first tested our algorithm with the superposition of two relatively rigid molecules, the NNRTIs ITU and DATA (Fig. 2). DATA was used as the target molecule and ITU as the source molecule. ITU has six torsional degrees of freedom; thus, the total dimensionality of the alignment search space is 12. The initial structures of both molecules were extracted from the proprietary X-ray crystal structures of their complexes with HIV reverse trancriptase. The DATA structure was minimized using the MMFF94 force field to remove any strain. ITU was also minimized but with

the non-bond interaction energy turned off in order not to bias the results towards the initial conformation. Fig. 3 shows the scatterplot of the values of both objective functions of the final Pareto fronts obtained in five separate runs of the GA. The solid line is a fit through the Pareto front of the combined results. While there is some variation between the results of different runs and the individual Pareto fronts sometimes cross, the conformations of the source molecules with the highest SEAL score become identical after molecular mechanics minimization. The solution with the highest SEAL score is shown in Fig. 4. The orientation and conformation of the source molecule is in agreement with its conformation bound to the reverse trancriptase enzyme. ITU is a rather rigid molecule with only six rotatable bonds of which two are involved in a conjugated system. An in-house developed genetic algorithm for conformational search generated 16 different conformations within 5 kcal/ mol from the lowest energy conformer of ITU. An approach as described in [10] where these conformations are one-byone superimposed onto the target molecule can be more efficient for this type of molecule. For a more flexible molecule such as methotrexate with 10 rotatable bonds, this is no longer true. Application of the same conformational search algorithm generated 2000 different conformations within 3 kcal/mol from the lowest energy conformation (2000 was the population size of the GA). This conformational search took 20 min on a 64-cpu SGI Origin 2000. Fig. 5 shows the scatter plot of the two objective functions obtained with a single run of the non-dominated GA on the methotrexate-dihydrofolate system, with methotrexate being the target. Previous studies have found two

Fig. 5. Pareto optimal front for the superposition of methotrexate onto dihydrofolate. The indicated regions correspond to the two different superposition modes.

F. Daeyaert et al. / Chemometrics and Intelligent Laboratory Systems 77 (2005) 232–237

237

The largest part of the computing time goes to the local minimization in translation–rotation space, and it can be expected that the use of a more efficient, gradient-based minimizer will appreciably improve this.

4. Conclusion

Fig. 6. Superposition of methotrexate (black) onto dihydrofolate (grey). Left the dX-ray-modeT alignment, right the dhetero-modeT alignment. The alignments with the highest SEAL score are shown.

modes of superposition of these two molecules [5]. One is known as the dhetero-modeT, in which the atoms of the bicycles of the molecules are superimposed. This configuration is however in contrast with the experimental data consisting of the structure of both molecules complexed with dihydrofolate reductase, which is referred to as the dXray modeT. Analysis of the results of several runs of our non-dominated sorting GA indicate that both alignment modes are found by the algorithm. Upon closer inspection, it is found that points in the upper part of the front correspond to alignments in the dhetero-modeT and points in the lower part correspond to alignments in the dX-ray modeT (Fig. 5). The alignments with the highest SEAL score for the two modes are shown in Fig. 6. While the hetero-mode alignment is satisfactory, the alignment of the glutamate moiety in the X-ray mode is suboptimal. We assume that this is because, at higher strain energies, the hetero-mode alignments dominate the X-ray-mode alignments. Due to the non-dominated sorting selection, X-ray-mode alignments with higher SEAL scores but higher strain do not appear in the final population. As possible approaches to solve this problem, we suggest the use of a local optimization algorithm in the full translational–rotational–torsional space to the GA solutions, the use of a third objective function reflecting the diversity of the solutions generated, or the use of nicheing techniques. The computing time of the algorithm on a single R14000 cpu of a SGI Origin 2000 was ~15 min for the ITU-DATA system and ~30 min for the methotrexate-dihydrofolate system when 25,000 function evaluations were carried out.

We have explored the use of a non-dominated sorting genetic algorithm for the superposition of flexible molecules, using as multiple objective functions the similarity score between a flexible source and a fixed target molecule, and the conformational strain of the source molecule. The algorithm reproducibly finds good alignments in a reasonable time. Generation of the Pareto optimal front allows one to pick a solution with a conformational strain that is still acceptable, knowing that the corresponding similarity score is optimal for that particular conformation. Another possible approach would be to apply full molecular mechanics minimization to the solutions in the Pareto front, and exhaustively superimpose this limited set to the target molecule. For the methotrexate-dihydrofolate system, both alignment modes described by other authors are found. Because of the non-dominated selection operator, however, one of the alignment modes is not optimal. We suggest that approaches to alleviate this problem be examined in further research.

References [1] R.D. Cramer, D.E. Patterson, J.D. Bunce, J. Am. Chem. Soc. 110 (1988) 5959 – 5967. [2] G. Klebe, U. Abraham, T. Mietzner, J. Med. Chem. 37 (1994) 4130 – 4146. [3] V. Bhaskar, S.K. Gupta, A.K. Ray, Rev. Chem. Eng. 16 (2000) 1 – 54. [4] S. Handschuh, M. Wagener, J. Gasteiger, J. Chem. Inf. Comput. Sci. 38 (1998) 220 – 232. [5] S.K. Kearsley, G.M. Smith, Tetrahedron Comput. Methodol. 3 (1990) 615 – 633. [6] T.A. Halgren, J. Comput. Chem. 17 (1996) 490 – 519. [7] K. Deb, A. Pratap, S. Agrawal, T. Meyarivan, IEEE Trans. Evol. Comput. 6 (2002) 182 – 197. [8] J.A. Nelder, R. Mead, Comput. J. 7 (1965) 308 – 313. [9] G.M. Morris, D.S. Goodsell, R.S. Halliday, R. Huey, W.E. Hart, R.K. Belew, A.J. Olson, J. Comput. Chem. 19 (1998) 1639 – 1662. [10] Y. Kato, A. Inoue, M. Yamada, N. Tomioka, A. Itai, J. Comput.-Aided Mol. Des. 6 (1992) 475 – 486.