Machine learning optimization of cross docking accuracy

Machine learning optimization of cross docking accuracy

Computational Biology and Chemistry 62 (2016) 133–144 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage...

3MB Sizes 116 Downloads 251 Views

Computational Biology and Chemistry 62 (2016) 133–144

Contents lists available at ScienceDirect

Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem

Research article

Machine learning optimization of cross docking accuracy Esben J. Bjerrum Wildcard Pharmaceutical Consulting, Frødings Alle 41, 2860 Søborg, Denmark

A R T I C L E I N F O

Article history: Received 11 January 2016 Received in revised form 8 April 2016 Accepted 9 April 2016 Available online 4 May 2016 Keywords: Molecular docking Docking power Scoring function Machine learning optimization Smina Autodock Vina Drug discovery Cross docking

A B S T R A C T

Performance of small molecule automated docking programs has conceptually been divided into docking -, scoring -, ranking - and screening power, which focuses on the crystal pose prediction, affinity prediction, ligand ranking and database screening capabilities of the docking program, respectively. Benchmarks show that different docking programs can excel in individual benchmarks which suggests that the scoring function employed by the programs can be optimized for a particular task. Here the scoring function of Smina is re-optimized towards enhancing the docking power using a supervised machine learning approach and a manually curated database of ligands and cross docking receptor pairs. The optimization method does not need associated binding data for the receptor-ligand examples used in the data set and works with small train sets. The re-optimization of the weights for the scoring function results in a similar docking performance with regard to docking power towards a cross docking test set. A ligand decoy based benchmark indicates a better discrimination between poses with high and low RMSD. The reported parameters for Smina are compatible with Autodock Vina and represent ready-to-use alternative parameters for researchers who aim at pose prediction rather than affinity prediction. ã 2016 Elsevier Ltd. All rights reserved.

1. Introduction Automated ligand docking to receptor models is an important method in structure based molecular design and structural bioinformatics. Docking algorithms aim to predict the preferred conformations and binding affinities of small molecule ligands given a receptor model. Docking of small molecule ligands for receptor targets has been used in a variety of studies, ranging from in silico screening of large compound databases to predicting affinity of novel compound designs in lead optimization. Usually the docking algorithms use an optimization search strategy that generates proposed conformations of the ligand in the receptor in combination with a scoring function that evaluates how the ligand fits in the receptor binding pocket (Meng et al., 2011). Performance of docking programs has conceptually been divided into docking power, scoring power, ranking power and screening power (Li et al., 2014a) to probe the usefulness of the docking program for different applications such as crystal pose prediction, affinity prediction, ligand ranking and virtual screening, respectively (Wang et al., 2004a). Docking power is focused on the programs ability to predict the correct conformation of the ligand and the placement in the receptor molecules (the pose). It can be evaluated by docking

E-mail address: [email protected] (E.J. Bjerrum). http://dx.doi.org/10.1016/j.compbiolchem.2016.04.005 1476-9271/ã 2016 Elsevier Ltd. All rights reserved.

ligands to receptors with known binding mode of the ligands and comparing the result with the already known experimentally determined binding mode. The docked poses are usually evaluated by computing the root mean square deviation (RMSD) between the heavy atoms of the docked ligand and native crystal pose. The average RMSD can be used, but often a threshold of 2 Å RMSD is used to calculate the fraction of correct predictions (Cheng et al., 2009; Xu et al., 2015). The scoring, ranking and screening power focuses on the affinity prediction, ligand series ranking and database screening performance, respectively. Scoring power is usually benchmarked by calculating the correlation between predicted and measured affinities for large databases such as the PDBbind (Wang et al., 2004b; Wang et al., 2005) database (Li et al., 2014a). Ranking power represents a variant of the scoring power where the focus is on the ranking of the ligand series (Plewczynski et al., 2011) and screening power is measured by the ability to identify known binders seeded in large databases of decoys by calculating the area under the receiver operator curves (ROC) (Li et al., 2014a; Christofferson and Huang, 2011; Bauer et al., 2013). The differences in benchmarks reflect the versatility of the in silico docking approach, but also they emphasize that evaluation of the docking programs is linked to the intended use. For example, the medicinal chemist who wants to use bio-structural inspiration for lead optimization, may not be interested in the affinity prediction as the experimental data are already available, but

134

E.J. Bjerrum / Computational Biology and Chemistry 62 (2016) 133–144

instead care very much about the accuracy of the predicted pose. In an in silico compound library screening program, the ability of the docking algorithm to distinguish binders from non-binders is important, whereas the affinity prediction will be the important aspect for a medicinal chemist who wants to examine the possible effect of a small structural change to a lead compound with known binding mode. With regard to evaluation of docking power there is an additional aspect to be taken into account. Either the ligand can be re-docked into the receptor X-ray structure where the ligand binding mode was determined or docked into a receptor model from an X-ray structure determined with a different co-crystallized ligand (Cross docking). The last procedure reflects the real world use and has been used as a benchmark for docking accuracy (Morris et al., 2009; Liu et al., 2012; Forli and Olson, 2012; Stigliani et al., 2012). A widely used docking program is Autodock Vina (Trott and Olson, 2010) programmed as an alternative to Autodock (Morris et al., 2009). Autodock Vina employs a docking algorithm and scoring function different than Autodock, but is compatible with the file formats used in Autodock and uses the same tools to prepare receptor and ligand files. In comparison with the original Autodock, Autodock Vina was shown to be much faster and with a large improvement of docking accuracy for compounds with large conformational freedom in the form of torsions (Trott and Olson, 2010). It is unclear precisely how the weights were determined for the original scoring function in Autodock Vina, except that it was tuned against PDBind (Wang et al., 2004b; Wang et al., 2005). Autodock Vina uses an empirical scoring function to dock and to predict the affinity of the ligand. Five terms, gauss1, gauss2, repulsion, hydrophobic and hydrogen bond, analyzing pairwise atom interactions are scaled and linear combined, whereas a last term, Nrot, is combined in a non-linear fashion (Trott and Olson, 2010). The source code for Autodock Vina was released in 2010, and has been forked and modified by others. QuickVina2 (Alhossary et al., 2015) focuses on improving computational efficiency, whereas Smina (Koes et al., 2013) has exposed already existing experimental score terms and added convenience functions such as automatic calculation of the docking box from a reference ligand. VinaLC (Zhang et al., 2013) added multi-threading and message passing interface (MPI) support for use at high performance clusters. When ranking docking programs and the scoring functions evaluating the different “powers”, no single program is a clear winner amongst all benchmark types. For example GOLD (Verdonk et al., 2003) has been found to be one of the best in benchmarks for pose prediction (Li et al., 2014a), whereas Schrödingers Glide (Friesner et al., 2004; Halgren et al., 2004) has performed better in database enrichment studies and affinity prediction (Li et al., 2014a). It has been noted that no single scoring function outperforms others in the benchmarking of the different aspects of docking performance (Cheng et al., 2009). Recent efforts in optimizing binding affinity prediction with the use of non-linear machine learning models such as random forests (Zilian and Sotriffer, 2013; Li et al., 2015; Fourches et al., 2015) or support vector machines (Kinnings et al., 2011a), have led to improvements in affinity prediction. As an example, the RF-Score-v3 (Li et al., 2015) nearly doubles the Pearsson’s correlation coefficient in comparison with GlideScore-XP (Friesner et al., 2006). However, the improvements in affinity prediction seem unrelated to improvements in the docking performance, an effect which has also been noted in previous benchmarks of docking and scoring performance (Warren et al., 2006). This seemingly counter intuitive divergence between the various “powers” is the reason why the machine learning approach has been criticized (Gabel et al., 2014). However, machine learning approaches have on

several occasions outperformed more classical scoring functions in a variety of benchmark experiments (Kinnings et al., 2011b; Li et al., 2011; Ballester et al., 2012; Ding et al., 2013; Durrant et al., 2013). Recently these differences in scoring performance have been reviewed in depth (Ain et al., 2015). Taken together, these observations indicate that docking performance can be tuned for different tasks. The different approaches and the scoring functions optimal for the different tasks may have some overlap as illustrated in Fig. 1, but the extent of this overlap is not known in detail. Generally speaking, machine learning is the process of getting computers to perform actions without being explicitly programmed. In supervised machine learning, the computer algorithm is presented with a task (here docking) and the results compared with the provided correct results. The error is formulated mathematically in a loss or cost function which calculates how wrong the outcome is. The terms and parameters of the algorithm are then stepwise optimized to minimize the loss function (Mohri et al., 2012). In this study a machine learning approach is used to re-optimize the weights for the scoring function of Smina (Koes et al., 2013) using a loss function calculated from the pose prediction accuracy of a small curated cross docking data set. 2. Computational methods 2.1. Train and test sets A database of receptor-ligand complexes for cross docking was manually curated using the protein data bank (PDB) id’s from the Astex diverse data set (Hartshorn et al., 2007) with known binding affinities determined as Ki/Kd in the Binding MOAD (Hu et al., 2005). All the PDB files matching the UniProtID of the receptor from the Astex set were downloaded from the protein data bank and structurally aligned. Hydrogen atoms were added and the Xray model split into receptor, water molecules, co-factors and

Fig. 1. Aims of docking and scoring. A: Binding mode prediction as it will be used in docking and lead optimization studies. B: Affinity Prediction as it will be used in the assessment of proposed novel compounds. C: Binding classification for the use in in silico based virtual library screening and development of focused libraries for high throughput screening. D: The Holy Grail of scoring functions: A scoring function that can do all three things optimal.

E.J. Bjerrum / Computational Biology and Chemistry 62 (2016) 133–144

135

Table 1 Details of the train and test data sets. Each cross docking pair is referenced in the text using the PDB id from the first column. PDBid

Ligand PDB name

log Affinity log(M)

Type

UniProtID

Docking target PDBid, chain

Docking Speed Seconds

Train set 1GPK 1HNN 1HP0 1L2S 1N1M 1N2V 1SQN 1TT1 1U1C 1UOU 1W2G

HUP SKF AD3 STC A3M BDI NDR KAI BAU CMU THM

5.37 6.24 6.70 4.59 5.70 4.08 9.40 7.19 5.37 7.70 4.57

Ki Ki Ki Ki Ki Ki Kd Ki Ki Ki Ki

ACES_TORCA PNMT_HUMAN Q9GPQ4_TRYVI AMPC_ECOLI DPP4_HUMAN TGT_ZYMMO PRGR_HUMAN GRIK2_RAT UDP_ECOLI TYPH_HUMAN KTHY_MYCTU

1W6R, A 1YZ3, A 3EPX, A 1IEM, A 3Q8W, A 3BLO, A 3KBA, A 3G3K, A 3KVV, A:B 2W6K, A 1MRN,A

3 4 9 8 3 4 4 4 8 5 6

Test Set 1KZK 1LPZ 1N46 1OYT 1R1H 1UML 1YGC 1Z95 2BM2 2BR1

JE2 CMB PFA FSN BIR FR4 905 198 PM2 PFP

10.39 7.60 10.52 7.24 8.92 7.52 9.46 9.12 7.82 5.14

Ki Ki Ki Ki Ki Ki Ki Kd Ki Ki

O38716_9HIV1 FA10_HUMAN THB_HUMAN THRB_HUMAN NEP_HUMAN ADA_BOVIN FA7_HUMAN ANDR_HUMAN TRYB2_HUMAN CHK1_HUMAN

2F3K, AB 4A7I, B 1YOX, X 1G32, B 2QPJ,A 2EIW, A 4ISI, H 3B5R,A 2ZA5, AC 2C3K, A

50 37 13 28 46 42 71 19 40 26

ligands which were saved as individual PDB files. PDB file manipulation was done with PyMOL (Schrödinger, 2010). For each receptor selected from the Astex diverse set, the other PDB files with the same UniprotID were visualized in PyMOL and a target for the cross docking experiment chosen using the following criteria: The binding site of the cross docking site should be different, but large scale movements and changes in the receptor site such as a complete blockade by a rotation of an amino acid side chain were not allowed. Unless problems were noted with the first alternative PDB file sorted alphanumerically, this was chosen. The receptors molecules stripped of water molecules and the ligands were converted from PDB file format to PDBQT format with the use of AutodockTools (Morris et al., 2009). The Crossdocking receptor pairs were divided into a training set and a test set on the basis of the speed of a test docking with Smina (Koes et al., 2013). The details of the resulting data sets are summarized in Table 1.

The weight terms were allowed to vary between 0 and +/1 in PPSwarm, with the physically most relevant sign used: Repulsion was set to vary between 0 and 1, with the other terms set for the range 0 to 1. The weights set by the PPswarm algorithm were normalized before being used by Smina for docking. The normalization factors were determined before optimization in the following way. The native pose ligands were copied from the X-

2.2. Machine learning algorithm A schematic outline of the machine learning script for optimization programmed in python is shown in Fig. 2. The optimization library PPSwarm (Vaz and Vicente, 2007), accessed through its python bindings in the OpenOpt framework (Kroshko, 2013), feeds scoring function weight parameters to Smina (Koes et al., 2013). All ligands in the training set are then docked against their respective cross-docking receptors. The obtained PDBQT formatted files with the docked poses are converted to PDB format files with OpenBabel (O’Boyle et al., 2011) and the RMSDs between the experimental ligand position in the X-ray structures and the docked ligands are calculated with RDKit (Landrum, 2015). All possible combinations of symmetric rotatable groups are enumerated by the RDKit subroutine and the combination with the lowest RMSD chosen. The loss function is calculated by converting the determined RMSD values by means of a sigmoid function with the form:1=ð1 þ 10ðRMSD3Þ Þ. The correspondence between RMSD and classification score is shown in Fig. 3. The average loss function is returned to the PPSwarm optimization library as the loss function to be minimized.

Fig. 2. Flow diagram of the machine learning optimization loop script written in Python. Initial parameters are fed to the OpenOpt framework which uses PPSwarm as the optimization algorithm. The optimization algorithm generates test parameters, that are used by SMINA for evaluation. SMINA combines the parameters with a fixed list of terms, receptor files and test ligands, and docks the ligands to their assigned receptors. The docked poses are converted with Open Babel and the root mean square deviations (RMSDs) with respect to the experimentally determined poses are calculated using the RDkit chemoinformatic toolkit. The RMSD values are used to calculate a sigmoid loss function which are fed back to the optimization algorithm for another round of optimization.

136

E.J. Bjerrum / Computational Biology and Chemistry 62 (2016) 133–144

Fig. 3. The sigmoid loss function for classification of correctness based on root mean square deviation (RMSD) between the docked and experimental poses. RMSD values below 2 are close to 0 (correct), whereas scores above 4 are near 1 (wrong).

ray structure into the target receptor and minimized with the Vina default parameters and terms. The minimization was followed by a re-scoring with all term weights set to 1 to obtain the raw contributions from each term. The 90% percentiles of the resulting raw contributions from the terms for all the ligands were used as the normalization factor for the exchange of parameters between Ppswarm and Smina. The number of particles in the PPSwarm algorithm was reduced from 42 to 10, but otherwise the default values for PPSwarm were used (social = 0.5; cognitial = 0.5; fweight = 0.4; iweight = 0.9; tol = 1e-5; ddelta = 0.5; idelta = 2.0). 2.3. Benchmarking The obtained new weights were benchmarked by using them for docking of the test set of ligand and target receptor pairs which had no overlap with the training set. The average RMSD and loss function were computed in the same way as in the machine learning algorithm (vide supra). Additionally the correlation coefficient (R2) between the scores and -log (affinity) of the ligands as well as the percentage of poses with an RMSD value below 2 were calculated. 2.4. Validation set An extended validation set was made from the examples in the Astex diverse data set (Verdonk et al., 2008) which were not included in the training set. The receptor and ligand PDBQT format files were prepared in a similar way to the training and test set, except that MOL2 and MOL format files from the Astex non-native set (Verdonk et al., 2008) were used after conversion with open babel (O’Boyle et al., 2011). The structures were not manually inspected in PyMOL. Examples from the PDBid 1kkc, 1hq2, 1ia1, 1of1, 1r55, 1sg0 failed to convert to PDBQT format and were excluded from the validation set. Receptor files in PDBQT format which could not be loaded by Smina were also removed from the benchmark set. Otherwise the first receptor in file system order was used as a docking target for the ligand from the diverse set. The details of the resulting validation set are summarized in the spreadsheet in supplementary material. Ten different random seeds were generated and used for docking and subsequent benchmarking of both the default and re-optimized weights.

scoring functions and added to the decoy set, rising the number of possible decoys to 20 for each of the 48 ligands in the validation set. Minimization was done with Smina, using 1000 iterations, linear approximation and a force cap of 100. All decoys were scored with both scoring functions and the RMSDs were calculated with respect to the original native crystal pose. The scores were normalized by subtracting the median of the decoy scores obtained for the ligand in question. The scores were divided into three groups based on the RMSD value of the decoys using limits of <2, 2–4 and >4 Å. Scores in all groups were normalized with median centering and unit variance by using the median value and standard deviation of the RMSD > 4 group. The median, 25% and 75% quartiles were calculated. The area of the difference between the cumulative sums of the RMSD < 2 and RMSD > 4 groups was found by integrating the difference between the cumulative sum curves using the trapezoidal rule. Calculations, scatter plots and normalized cumulative sums were made with Python scripts using matplotlib and numpy modules. 3. Results 3.1. Data sets and selected receptors Of the 66 PDBids in the Astex diverse set (Hartshorn et al., 2007), 24 PDBids had associated binding data in the form Ki/Kd in the Binding MOAD (Hu et al., 2005). The receptor families originating from PDB ids 1lrh, 1q1g and 1r1h were discarded manual inspection of the PDB files related to that UnitProtID, giving a final data set of 21 X-ray receptor pairs and ligands. Using docking speed as a criterion to divide the X-ray structure cross-docking pairs into a train and a test set, resulted in an asymmetrical division. The training set consisted of the small and rigid ligands, whereas the test set consisted of larger, flexible ligands. Details of the test compounds and their receptor pairs are listed in Table 1 and structures of the ligands are shown in the Supplementary material Figs. 1 and 2. The division into data sets based on the speed of ligand docking significantly shortened the evaluation time of the training set for each step of the optimization algorithm and made the optimization tractable on a single workstation computer. 3.2. Re-optimization of term weight parameters The default scoring function of Autodock Vina (Trott and Olson, 2010), is calculated by summing contributions from different functions or terms applied to distances between atom pairs between the ligand and the receptor. The different terms use information about the atoms in the atom pairs such as the sum of vdWalls radii and their type, such as hydrophilic, hydrogen bond donor/acceptor. In a similar fashion the intra-molecular contributions are calculated for the atoms in the ligand that can move relatively to each other, but excluding 1–4 interactions (atoms separated by 3 consecutive bonds). The individual terms are weighted and summarized for five of the terms (Eq. (1.1)), and the Nrot terms is added in a non-linear fashion (Eq. (1.2)). Einter ¼ w1  Egauss1 þ w2  Egauss2 þ w3  Erepulsion þ w4  Ehydrophobic þ w5  Ehydrogenbond

ð1:1Þ

2.5. Decoy based validation A decoy set was made by docking to all structures in the validation set using both scoring functions. The energy range was extended to 8 and 9 structures were kept per ligand. Splitting the resulting PDBQT format files into single files gave up to 18 decoys for each ligand. The native ligand poses were minimized with both

Vinascore ¼

Einter 1  w6  Nrot

ð1:2Þ

The sum of the vdWalls radii is subtracted from the inter-atomic distance to give a distance d. Gauss1 is a Gaussian with optimum at d = 0. Gauss2 is a broad Gaussian with optimum at d + 3 Å.

E.J. Bjerrum / Computational Biology and Chemistry 62 (2016) 133–144

Fig. 4. Machine learning optimization run. The graphs show various parameters followed during the steps of the machine learning optimization. Parm: the different parameters as they are varied through the optimization. From starting values of 0.5 or 0.5 they are adjusted to their final values after 314 steps. RMSD: the root mean square deviation (RMSD) values of individual ligands compared with the experimentally determined pose. The RMSD values show highest variation in the first half of the optimization but as the parameter solution stabilizes, the variation goes down. Periodic worsenings are observed when the particle swarm optimization (PSO) algorithm test the nearby solutions. Loss: the sigmoid loss function calculated from the RMSD values with a running average (40 steps). This is the parameter the PSO algorithm tries to minimize. The value levels off after approximately 150 steps. %Correct: The percentage of correctly docked ligands for RMSD thresholds set at 1, 2, and 3 Ångström respectively.

Repulsion is the square of the vdWalls distances overlap (d^2), and 0 if there is no overlap d > 0. The Hydrophobic equals 1 for d < 0.5 and interpolated linearly to 0 for d > 1 Å. For potential hydrogen bond pairs, the hydrogen bond term is 1 for d < 0.7 Å and 0 for d > 0. For the other poses, here shown as the k'th score, the intra molecular contributions from the ligand from the top scoring pose are subtracted, but as only the top scoring pose is used, this is not relevant in the current context. The default terms weights (w1-5, Eq. (1.1)) were re-optimized using the train set with particle swarm optimization in the machine learning loop programmed in Python. The Nrot term was not used at all, as this is fixed during docking, and only has influence on the affinity scoring. The tested parameters, the RMSD of each compound with respect to the X-ray pose, the total cost and the percentage of compounds docked correctly with a 1–3 Å criterion were followed over the course of the optimization and are plotted in Fig. 4. In the first approximately 50 steps, the results fluctuate wildly as can be seen by the high cost-function and the low percentages of correct solutions, but over a couple of updates of the swarm, the loss function drops. After the next 100 steps the

137

solution stabilizes, with subsequent testing of possible nearby solutions with periodic drops in the performance corresponding to the swarm size. The whole run of 314 optimization steps represents 31 updates of the particle swarm and 3140 docking experiments as each evaluation of the cost-function needed to dock 10 ligands. PPswarm (Vaz and Vicente, 2007) is an implementation of particle swarm optimization, which is global optimization algorithm originally inspired by flocking birds and fish schools. It iteratively searches for an optimal solution by updating the position in search space of the candidate solutions (particles) by combining their current velocity and attraction towards the particle with the best solution. This leads to flocking behavior and movement of the flock towards better solutions. At the same time the vicinity of the proposed best solution is investigated by the flock. Particle swarm optimization has been shown in benchmarks of global optimizers to perform efficiently as it only requires a low number of function evaluation for optimization (Hassan et al., 2005). The re-optimized weights are listed in Table 2, where they are compared with the default weights after normalization to match the magnitude of the repulsive term of the default parameters. Overall speaking, the gauss1 term and the non directional hydrogen bonds are given more weight after the re optimization whereas there has been reduction in the gauss2, repulsion and hydrophobic terms. The re-optimization has overall given a larger weight on the attractive terms in comparison with the repulsive term. 3.3. Benchmarking the new parameters The weights giving the lowest loss function value were used to benchmark the performance against the test set as well as the train set, with respect to docking power by calculating the fraction of ligands docking with an RMSD value <2%. The default parameters of Autodock Vina were also benchmarked and the results are summarized in Table 3 and shown graphically in Fig. 5. The optimized parameters results in a marked improvement for the train set compared with the Vina default parameters as the percentage docking with an RMSD below 2 Å rose from 42% to 75%. This is to be expected as the parameters have been tuned to this particular set of protein-ligand pairs. The docking accuracy results obtained using the optimized parameters are at least on par with the Vina default parameters. Even though the percentage docked with an RMSD below 2 Å remains the same, the average RMSD is improved from 2.77 to 2.00, with the improvement visible as a left shift of the curve in Graph B of Fig. 5. The train set curves in Fig. 5A also show this left shift, but the gain in the top is slightly offset by a loss in accuracy for the best docked molecules, where the curve for the performance of the new parameters lays to the right of the performance for the default parameters. This loss of performance is, however, not penalized significantly by the sigmoid loss

Table 2 Comparison of the default and re-optimized weights for the scoring function terms. The re-optimized parameters are shown both in their raw form for usage with the docking programs, but also normalized to the repulsion term and as% change with respect to the default terms. Term

Default

Re-optimized

Re-optimized (normalized)

% Change

Gauss1 Gauss2 Repulsion Hydrophobic H-bond Num_tors_div

0.035579 0.005156 0.840245 0.035069 0.587439 0.058460

0.046016 0.000384 0.366584 0.008122 0.431416 not used

0.11 0.00088 0.84 0.019 0.99 not used

196 83 0 47 68

Sum Repulsive Sum Attractive

0.84 0.66

0.37 0.49

0.84 1.11

138

E.J. Bjerrum / Computational Biology and Chemistry 62 (2016) 133–144

Table 3 Root mean square deviation (RMSD) comparison of the X-ray structure and docked pose before and after re-optimization of scoring function. Receptor pairs are identified with the reference receptor protein data bank id (PDB id) as in Table 1 column 1. Results are shown for both train and test sets. The train set has improvement in all three ways of summarizing the overall docking accuracy, whereas the test set has improvements in average RMSD and Loss function. Train set

Test set RMSD

RMSD

PDB id

Default

Re-optimized

PDB id

Default

Re-optimized

1N1M 1GPK 1SQN 1N2V 1TT1 1HNN 1UOU 1W2G 1L2S 1U1C 1HP0

2.14 4.84 0.56 3.13 3.66 1.05 4.90 0.77 3.02 1.67 0.51

1.93 3.67 1.31 3.01 3.98 1.17 0.77 0.85 1.66 1.66 0.32

1N46 1Z95 2BR1 1OYT 1LPZ 2BM2 1UML 1R1H 1KZK 1YGC

1.50 1.03 3.32 7.68 3.14 2.45 1.56 1.23 1.63 4.12

1.57 1.13 3.55 0.81 0.93 2.57 2.22 1.29 2.11 3.82

Average % below 2 Loss Function

2.39 0.42 0.37

1.85 0.75 0.23

2.77 50.00 0.35

2.00 50.00 0.21

function used, as the rise in loss function from 1 Å to 2 Å error is only 0.11 which is compared to the 0.36 rise in loss function when going from 2 Å error to 3 Å error (c.f. Fig. 3). The curves in Fig. 5, do not show the improvement of the single receptors as the points on the curves do not necessarily correspond to each other horizontally. The improvement for each compoundreceptor pairs is shown in Fig. 6, where the the bar chart has been sorted by the change in RMSD obtained using the re-optimized parameters instead of the default parameters. For the train set the most marked improvement is seen for the 1UOU example, and slightly worse accuracy is seen for 1W2G, 1HNN,1TT1 and 1SQN. 1OYT and 1LPZ from the test set becomes improved from above to below the 2 Å RMSD threshold, whereas two other compounds, 1KZK and 1UML go from below to above the threshold of 2 Å. The percentage docked correctly according to the 2 Å RMSD criterion thus remains the same. However, the 1OYT example from the test set is markedly improved from an RMSD of above 7 to below 1, whereas the 1KZK and 1UML cases are only slightly increased to an RMSD just above 2. The sensitivity of the parameters towards use of different random seeds was investigated by benchmarking the docking performance with 10 different random seeds. Dockings were performed against an extended validation set using both the default and thr optimized parameter weights. Results of all dockings can be found in the supplementary info spreadsheet. Both set of parameter weights show variations in the docking benchmark due to the use of different random seeds. The percentage below 2 is on average 37 with a standard deviation of 1.8 for the re-optimized weights, which is slightly better than the results obtained using the default weights (35, 2.0). However, docking with the default parameter weights results in a lower average RMSD than using the re-optimized weights (4.23 and 4.59 respectively). The two scoring functions docking performance is statistically insignificant when comparing the percent correct poses below 2 Å RMSD (T-test P-value two tail, 0.08). 3.4. Comparative scoring of the decoys The normalized scores obtained with both scoring functions and RMSD values of the decoy set are plotted in Fig. 7, with statistics summarized in Table 4. The scatter plots show that both

Fig. 5. Improvement in the pose prediction before and after tuning the scoring function in SMINA using the training set. Compounds are ordered according to root mean square deviation (RMSD). The graph shows the number of compounds passing the threshold for different values of RMSD and the improvements are visible for both the train set and the test set as a left shift of the curve.

scoring functions on average give lower scores for decoys with RMSD values below 2. Above an RMSD value of 3, there is no correlation between score and RMSD value of the decoy poses. The scores obtained with the re-optimized parameter weights have a lower median value of 0.99 compared with the median value of 0.69 obtained with the default values. The variance is large for all groups, but slightly larger for the groups scored with the reoptimized parameter weights. Integration of the difference between the cumulative sum curves for groups with RMSD < 2 and RMSD > 4 gives an area between the curves of 87 for the re-optimized scoring function which is larger than the 66 obtained with the default scoring function. This indicates that the re-optimized scoring function in spite of larger variance in the groups is better at discriminating between correct and incorrect decoys. The median value of the groups with RMSD > 4 is 0.0 due to the normalization procedure. 4. Discussion 4.1. Optimization algorithm The presented optimization approach complements the recent efforts in optimizing the scoring function by either making a new selection of docking terms (Koes et al., 2013) and/or using nonlinear functions such as random forests (Zilian and Sotriffer, 2013; Li et al., 2015; Fourches et al., 2015) or support vector machines (Kinnings et al., 2011a). Multivariate statistics has previously been used to optimize settings of docking programs for optimizing the docking power (Andersson et al., 2007). Machine learning approaches have also been used to search for new scoring functions and terms (Koes et al., 2013) but it has seemingly not been used to optimize the weights of a scoring function with a

E.J. Bjerrum / Computational Biology and Chemistry 62 (2016) 133–144

Fig. 6. Root mean square deviation (RMSD) accuracy of docked poses for each ligand tested. The cross-docking receptor pairs are identified using the protein data bank id from the originating receptor X-ray structure. The bar chart is sorted according to the improvement in docking accuracy after using the re-optimized scoring function.

direct measure of the docking power as the loss-function. Other approaches for using machine learning approaches to crystal pose prediction have employed SVM’s to train a consensus scoring based on a database of decoys (Wang et al., 2013) or tested the performance of several different machine learning algorithms in their ability to predict the RMSD of a decoy set (Ashtawy and Mahapatra, 2015). Both studies thus used post-docking decoy based benchmarks in contrast to the here proposed algorithm which optimizes the parameters directly as used internally in the docking algorithm. Smina (Koes et al., 2013) was used in the optimization, but the results should be directly transferable to the other variants of Autodock Vina as only the default terms were used. 4.2. Docking accuracy The optimization methodology described here for a docking function does seem to produce parameters that give a similar amount of successful docked ligands as judged by the 2 Å RMSD criterion. Autodock Vina and derivatives has been benchmarked against PDBbind with pose prediction fractions (RMSD < 2 Å) of 78% (Handoko et al., 2012), DUD 64% (Zhang et al., 2013), a subset of the CSAR 2012 data set yielding 55–100% for redocking, but only around 25% for crossdocking (Koes et al., 2013). A similar but smaller drop in cross docking experiments was also found for the docking program GOLD which dropped from around 80% for redocking to around 60% for cross-docking (Verdonk et al., 2008). The fractions are high (75% train, 50%test) compared to previously published cross-docking benchmarks. Koes et al. (2013) finds only 14 successful docking out of 54 cross dockings (26%). Where the by Koes et al. affinity optimized docking score only docks 10/54 (19%) successfully. It is, however, an unfair comparison as the results presented here are from a curated data set, where too large differences between the cross docking pairs

139

were excluded from the data set, making it plausible that the data set chosen for the re-optimization contains easier docking targets. This is consistent with the drop in perceived docking performance seen when using the un-curated and larger validation set (37% docked correctly). Another reason for the large fraction of successful cross-docking could be the use of ligand conformations taken directly from the native receptor. It has been found that many docking programs, including Autodock Vina, are sensitive to the starting conformations of the ligands (Oda et al., 2014). The performance after re-optimization is not as high as the docking power of other commercial available docking programs, such as GOLD (Verdonk et al., 2003) or Glide (Friesner et al., 2004). It can be as high as 70–80% (Cheng et al., 2009; Li et al., 2014b) in benchmarks which uses re-docking and be 60% for cross-docking experiments with GOLD (Verdonk et al., 2008). The here applied optimization method improves the docking function directly as used by the sampling algorithm. To separate differences in scoring separated from the sampling, the efficiency of the docking functions were also compared with a decoy based benchmark. Both scoring functions were used to generate decoys and minimizing native ligands, which gives a balanced decoy set for comparing the two scoring functions head to head, without favoring one or the other. The plot in Fig. 7 has a j-shaped correlation between the ligand normalized scores and the RMSD values of the corresponding poses. Both scoring functions show a weak correlation between normalized score and RMSD value, but only up to a certain level. Above an RMSD value of 3 Å there is no correlation at all between RMSD and score for both scoring functions. This indicates that decoy poses above this threshold are indistinguishable by the scoring functions, which thus doesn't seem able to judge if a wrong pose is in the ligand pocket or outside the ligand pocked. The choice of a sigmoid function for classification of structures into right or wrong thus seems justified, as this conversion of the RMSD values do not significantly distinguish between poses with and RMSD of 4 or 10 (c.f. Fig. 3). The lower median score obtained with the re-optimized docking function for decoys with RMSD values below 2 Å, may provide better discrimination between decoys with high RMSD values and decoys with poses close to the native ligand. This larger correlation is, however, offset a bit by the larger variance seen for the groups scored with the re-optimized docking function, but nevertheless it leads to a lower overlap between the group with RMSD below 2 Å and the two other groups. Even though it is not straightforward to interpret the results of the decoy based benchmark, the re-optimized docking function performs similar or better than the default function. 4.3. Structural examination of the docked poses Benchmarks of a docking algorithm for docking accuracy and binding mode prediction should simulate as close as possible to the situation and task to be performed. The goal is to be able to predict or suggest the binding mode of a ligand, given a receptor structure determined after co-crystallization with another ligand. It generates little new knowledge to re-dock the ligand into the same Xray structure where the pose has already been experimentally determined. Ligands often make an induced fit when binding to a protein. The overall shape and characteristics of the binding pocket may largely remain the same, but small adaptations of the amino acid side-chains and adjustment of the backbone structure makes the protein adapt conformationally to the ligand and obtain an energetically favorable tight fit. This can explain the differences previously found (Koes et al., 2013) between benchmarking docking power in a re-docking contra a cross docking. The docking poses from the benchmark which shows the largest improvement with the new parameter set are shown in

140

E.J. Bjerrum / Computational Biology and Chemistry 62 (2016) 133–144

Fig. 7. Scatter plots and normalized cumulative sums of normalized scores for ligand pose decoys. Scatter plots and cumulative sums are color coded according to RMSD with respect to the original native ligand pose. blue: RMSD < 2, Green RMSD is between 2 and 4, yellow RMSD > 4. Both plots show a tendency for a J-shaped correlation between scores and RMSD values. Below 2 there is an offset towards lower scores. The scores obtained with the re-optimized scoring function have a lower median than obtained with the default scoring function (0.99 and 0.69 respectively). This is reflected in a right shift of the curve in the cumulative sum plot. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Figs. 8 and 9, where the docked poses in the cross-docking are compared with the native receptor for the 1UOU and 1OYT labeled cross docking receptor pairs. Comparing the native pocket shape of 1UOU from the train set in A with the pocket shape in B and C, shows that the 2WK6 receptor used for cross-docking is more narrow in this pocket than 1UOU. There is an inward shift of amino acid side chains and an adjustment of the backbone. With the default parameters the steric clash between the ligand and the non native receptor is penalized through the repulsion term, and the inverted pose is preferred although the potential hydrogen bonds are not optimal. The re-optimized parameters give a lower contribution weight to the repulsion term and a higher from the non-directional hydrogen bond term and prefer the correct mode, although the ligand is slightly shifted due to steric differences of the binding pockets. The reason for the large improvement seen for the 1OYT example from the test set seems to be similar. Here the default parameters penalize a steric clash from a slight shift in a tryptophan side chain in the 1G32 cross-docking receptor when compared to the native 1OYT X-ray structure. The gauss1 term favors placing atoms with a distance to other atoms with a sum of their van der Waals radii and is thus crucial in placing the atoms of the ligand in favored positions. The nondirectional hydrogen bonding terms reward placing ligand in

positions where there is a potential for establishing a hydrogen bond and are thus a driver in placing the ligand into a pose where a favorable hydrogen bond network could form. The gauss2 term favors the placement of ligands atoms in an environment with a majority of receptor atoms in a long distance at a rather broad distribution. This term could act as a driver to position the ligand deep into the protein. The repulsion term, which in the reoptimized parameters has been down tuned, helps placing the atoms at a distance with not too large van der Waals overlap and thus tries to prevent steric clashes. The hydrophobic terms reward placing hydrophobic atoms from the ligand near hydrophobic atoms in the receptor. Generally speaking, the scoring function tries to drive the ligand into the deepest pocket (Gauss 2), that fits the ligand (repulsion + gauss1), where there's a potential match for the hydrophobic/hydrogen bond properties of the ligand (hydrogen bond term and hydrophobic term). This balance is shifted in the re-optimized terms, the repulsion term is given less weight and thus allow for larger steric clashes as can be found in the receptors for cross docking, where the pocket amino acid side chains are not conformational optimized for the ligand. The correct placement of the ligand atoms also depends on the gauss1 term, which are given much more weight in the re-optimized terms. The hydrogen bonding term is given larger weight after re-optimization, which could be a hint that correct hydrogen bond networks are more

Table 4 Statistical values obtained by scoring the scoring the ligand pose decoy set with both scoring functions. Default Score

Re optimized Score

Group

Median

25% Quartile

75% Quartile

Median

25% Quartile

75% Quartile

RMSD < 2 RMSD 2 < 4 RMSD < 4

0.69 0.10 0.00

1.43 0.53 0.36

0.20 0.47 0.64

0.99 0.11 0.00

1.63 0.62 0.48

0.11 0.47 0.73

E.J. Bjerrum / Computational Biology and Chemistry 62 (2016) 133–144

141

Fig. 8. Docking poses for the cross docking pair from the train set labeled 1UOU. A: Native receptor (protein data bank id 1UOU) with experimentally determined ligand pose. The binding pocket around the 5-chloro-uracil moiety of the ligand are a flat pocket complementing the flat shape of the ligand. B: Cross docking receptor (protein data bank id 2W6 K) with ligand docked using default parameters. The ligand is inverted in the binding pocket, with the smaller 2-iminopyrrolidinyl part occupying the pocket. C: cross docking receptor with ligand docked using re-optimized parameters. The ligand is correctly aligned although a bit shifted in the distal 2-iminopyrrolidinyl part. Visualizations are made with PyMOL.

important for correct pose prediction than the hydrophobic term, whereas placement of hydrophobic atoms in a hydrophobic environments could be of more importance for the ligand affinity (Friesner et al., 2006; Young et al., 2007) which was totally disregarded in the re-optimization. The machine learning algorithm thus seems to have learned that minor steric clashes can be observed under cross-docking conditions and have lowered the weight for repulsion term, while at the same time enlarged the weight for the gauss1 term to get an optimal steric placement of the ligand atoms. The new weights allow some of the docked ligands to obtain the correct pose even with what originally seemed like a steric clash. Using cross docking in the optimization of docking functions is thus likely to be important. Docking another ligand into the often rigidly treated receptor will lead to non-optimal overlap of van der Waals

distances. These small differences and steric clashes could likely in reality be absorbed by the protein through small adaptations of the conformations of side chains and backbone. However, this may be severely penalized if the docking function has only been trained on re-docked examples, where there is an induced fit between ligand and receptor molecules. On the other hand, insisting on the cross docking benchmark limits the size of the potential database of cross docking pairs, as they have to be manually curated and collected. 4.4. Train set size It is noteworthy how small a database of cross-docking receptor-pairs was necessary for a successful re-optimization. This stands in stark contrast to the more than thousands of

Fig. 9. Docking poses for the cross docking pair from the test set labeled 1OYT. A: Native receptor-ligand complex (protein data bank id 1OYT). B: Cross docking of the ligand using default parameters. (receptor protein data bank id 1G32). The ligand pose is inverted. The receptor pocket is smaller when compared to A due to a shift in a tryptophan side chain seen partially in the top of the picture. C: Cross docking of ligand using re-optimized parameters. The ligand docks more similar to the experimentally determined pose in A. Visualizations are made with PyMOL.

142

E.J. Bjerrum / Computational Biology and Chemistry 62 (2016) 133–144

Fig. 10. Correlation between docking score and ligand affinity. The scoring function values obtained using the default parameters (green curve) shows a better correlation with the experimentally determined affinity than the scoring values obtained with the re-optimized parameters (blue curve). The train set (green diamonds and blue +) generally obtains lower scores than the test set (green * and blue x). The trend lines for the scores obtained with the default and re-optimized parameters are shown for the train and test set combined. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

receptor-ligand pairs found in PDBbind database previously used for tuning affinity scoring functions. This could reflect the simplicity of the default Vina scoring, but also that the terms for this scoring function represent a successful subset selected from a multitude of experimental terms. The default Autodock Vina terms (Trott and Olson, 2010) are already selected from a larger pool of experimentally programmed terms re-exposed in Smina (Koes et al., 2013). They as such represent a good mixture of parameters for the task at hand. As such the selection of parameters could induce an information transfer from the larger data set which the successful re-optimization depended on. Attempts to extend the parameter set with the terms available in Smina were unsuccessful (results not show). This possibly reflects the need for a larger training set for term selection and development. The successful re-optimization of the scoring function with only a small training set suggests that the method could be used for developing target specific weighting of terms. Five different receptor models will give twenty possible cross docking experiments. The heterogeneity and flexibility of the training ligands and receptor models must be taken into consideration as well as the molecular diversity of the ligands whose pose should be predicted. 4.5. Drop in docking scores-affinity correlation The re-optimization protocol didn't use the docking affinity information, but nevertheless a slight correlation between affinity and score is seen as it is illustrated in Fig. 10. The trend lines are shown for a combination of train and test data, and have slopes of 0.73 and 0.43 for the default and re-optimized terms, respectively. However, neither the default nor re-optimized terms have a convincing slope if only the train set is analyzed, here they are 0.06 and 0.09, respectively. The scoring-affinity correlation obtained with the re-optimized parameters is worse than for the default Vina scoring function. It has been found that using the number of heavy atoms or molecular weight of the ligands alone leads to a correlation coefficient of 0.431 and 0.418, respectively (Li et al., 2015) The score-affinity correlation left after optimization of

the docking power is of a similar magnitude 0.43. The scoring algorithm itself possibly reflects this as it is a summation of atom pair interactions (c.f. Eq. (1.1)). The test set contains the molecules with the longest docking times and is predisposed to contain the largest ligands. The residual correlation may thus be reflecting the fact that larger ligands generally have higher affinity. This shows that the re-optimized parameters should not be used for affinity prediction. The optimized parameters do not use the N-rot term related to the number of rotable bonds in the ligand, which could be another possible reason for the drop in affinity-scoring correlation. This term is a parameter determined by the ligand and it is not updated during the docking run and therefore not tunable by the here presented optimization algorithm. Koes et al. (2013) find that optimizing the scoring function towards affinity led to a worsening of the docking power. This is in line with the results presented here, where an optimization of the docking performance led to worse scoring affinity correlations. It has also been noted that VS specific optimizations of scoring functions may be needed for optimizing the screening power, rather than relying on the measured affinity of known binders as a pseudo end-point (Li et al., 2015). The results taken together indicates that it is indeed a better strategy to separate the docking function from the affinity scoring function and adapt a multistage procedure for scoring. First dock the ligands with the parameters shown here, then minimize and re-score with the default Vina parameters or the parameters developed for Smina as also suggested previously (Koes et al., 2013). 4.6. Future perspectives The machine learning methodology presented here seems promising as a way to develop better combinations of terms and weights for the first step of the docking procedure: Predicting the correct pose from a non-native receptor. Efforts should possibly be focused on enlarging the train and test-cross docking database with more receptor-ligand pairs examples and then subsequently search for a new combination of terms directly focused at predicting poses rather than using the affinity as a pseudo target for optimization of the docking function. Development of new terms should also be considered as many of the current terms available in Smina are very similar as revealed by principal components analysis of the individual scores obtained across a large amount of receptor-ligand pairs (results not shown). In this study, only a small training set was used, which makes it possible to hand select and curate the cross docking examples, as well as making it computationally feasible even on readily accessible desktop workstations. The small training and test set however limits the trust in the general performance of the parameters and the re-optimized parameters should be used with caution, but the parameters represent a readily available optimized alternative to the Autodock Vina/Smina default parameters for molecular modelers to test for their specific targets. Thorough benchmarking with target specific cross docking experiments are encouraged if there exist more than one X-ray structure. 5. Conclusion The results presented show that optimization employing a machine learning approach to docking function parameterization using pose prediction as a measure of docking performance is feasible even with a small training data sets. The optimization algorithm outlined was thus able to produce parameters which perform on par with than the default parameters of Smina and Autodock Vina with respect to cross docking performance in two different benchmarks. The method complements current efforts in developing better scoring functions for affinity prediction as a two

E.J. Bjerrum / Computational Biology and Chemistry 62 (2016) 133–144

stage approach to docking and scoring. It will likely be superior to a combined docking and scoring function. The optimization method enables use of new receptor-ligand for training as the method does not need associated ligand affinity data which have previously been widely used as a surrogate endpoint for tuning docking functions. The optimization method is generally applicable and could support development of new parameters for obtaining docking accuracy for other docking programs as well. Comparison of the default term weights with the cross docking optimized weights showed that contribution from repulsion was diminished whereas the attractive terms were enlarged. This was consistent with the structural investigations of selected crossdocking examples where minor steric clashes had be accepted with the non-native receptor to obtain the correct docking pose. This underlines the importance of using cross-docking and non-native receptors when tuning and benchmarking the docking function. The reported scoring functions weights are compatible with Smina and Autodock Vina and represent a ready-to-use alternative to the default scoring function for researchers who aim at prediction of pose rather than affinity. Acknowledgements Nuevolution A/S, Rønnegade 8, 2100 Copenhagen Ø, Denmark is acknowledged for donation of computational resources. The author wants to thank MD, PhD Peter N. Bjerring and Prof. DMSc. Ole J. Bjerrum for helpful comments with the manuscript. Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j. compbiolchem.2016.04.005. References Ain, Q.U., Aleksandrova, A., Roessler, F.D., Ballester, P.J., 2015. Machine-learning scoring functions to improve structure-based binding affinity prediction and virtual screening WIREs. Comput. Mol. Sci. 5, 405–424. doi:http://dx.doi.org/ 10.1002/wcms.1225. Alhossary, A., Handoko, S.D., Mu, Y., Kwoh, C.-K., 2015. Fast, accurate, and reliable molecular docking with QuickVina 2. Bioinformatics 31, 2214–2216. doi:http:// dx.doi.org/10.1093/bioinformatics/btv082. Andersson, C.D., Thysell, E., Lindström, A., Bylesjö, M., Raubacher, F., Linusson, A., 2007. A multivariate approach to investigate docking parameters’ effects on docking performance. J. Chem. Inf. Model. 47, 1673–1687. doi:http://dx.doi.org/ 10.1021/ci6005596. Ashtawy, H.M., Mahapatra, N.R., 2015. Machine-learning scoring functions for identifying native poses of ligands docked to known and novel proteins. BMC Bioinf. 16 (Suppl. 6), S3. doi:http://dx.doi.org/10.1186/1471-2105-16-S6-S3. Ballester, P.J., Mangold, M., Howard, N.I., Robinson, R.L.M., Abell, C., Blumberger, J., Mitchell, J.B.O., 2012. Hierarchical virtual screening for the discovery of new molecular scaffolds in antibacterial hit identification. J. R. Soc. Interface 9, 3196– 3207. doi:http://dx.doi.org/10.1098/rsif.2012.0569. Bauer, M.R., Ibrahim, T.M., Vogel, S.M., Boeckler, F.M., 2013. Evaluation and optimization of virtual screening workflows with DEKOIS 2.0—a public library of challenging docking benchmark sets. J. Chem. Inf. Model. 53, 1447–1462. doi: http://dx.doi.org/10.1021/ci400115b. Cheng, T., Li, X., Li, Y., Liu, Z., Wang, R., 2009. Comparative assessment of scoring functions on a diverse test set. J. Chem. Inf. Model. 49, 1079–1093. doi:http://dx. doi.org/10.1021/ci9000053. Christofferson, A.J., Huang, N., 2011. How to benchmark methods for structurebased virtual screening of large compound libraries. In: Baron, R. (Ed.), Computational Drug Discovery and Design. Springer Science + Business Media, pp. 187–195. doi:http://dx.doi.org/10.1007/978-1-61779-465-0_13. Ding, B., Wang, J., Li, N., Wang, W., 2013. Characterization of small molecule binding. I. Accurate identification of strong inhibitors in virtual screening. J. Chem. Inf. Model. 53, 114–122. doi:http://dx.doi.org/10.1021/ci300508m. Durrant, J.D., Friedman, A.J., Rogers, K.E., McCammon, J.A., 2013. Comparing neuralnetwork scoring functions and the state of the art: applications to common library screening. J. Chem. Inf. Model. 53, 1726–1735. doi:http://dx.doi.org/ 10.1021/ci400042y. Forli, S., Olson, A.J., 2012. A force field with discrete displaceable waters and desolvation entropy for hydrated ligand docking. J. Med. Chem. 55, 623–638. doi:http://dx.doi.org/10.1021/jm2005145.

143

Fourches, D., Politi, R., Tropsha, A., 2015. Target-specific native/decoy pose classifier improves the accuracy of ligand ranking in the CSAR benchmark. J. Chem. Inf. Model. 55, 63–71. doi:http://dx.doi.org/10.1021/ci500519w. Friesner, R.A., Banks, J.L., Murphy, R.B., Halgren, T.A., Klicic, J.J., Mainz, D.T., Repasky, M.P., Knoll, E.H., Shelley, M., Perry, J.K., Shaw, D.E., Francis, P., Shenkin, P.S., 2004. Glide: a new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. J. Med. Chem. 47, 1739–1749. doi:http://dx.doi. org/10.1021/jm0306430. Friesner, R., Murphy, R., Repasky, M., Frye, L., Greenwood, J., Halgren, T., Sanschagrin, P., Mainz, D., 2006. Extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes. J. Med. Chem. 49, 6177–6196. doi:http://dx.doi.org/10.1021/jm051256o. Gabel, J., Desaphy, Jé, Rognan, D., 2014. Beware of machine learning-based scoring functions—on the danger of developing black boxes. J. Chem. Inf. Model. 54, 2807–2815. doi:http://dx.doi.org/10.1021/ci500406k. Halgren, T.A., Murphy, R.B., Friesner, R.A., Beard, H.S., Frye, L.L., Pollard, W.T., Banks, J. L., 2004. Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. J. Med. Chem. 47, 1750–1759. doi: http://dx.doi.org/10.1021/jm030644s. Handoko, S.D., Ouyang, X., Su, C.T.T., Kwoh, C.K., Ong, Y.S., 2012. QuickVina: accelerating AutoDock Vina using gradient-based heuristics for global optimization. IEEE/ACM Trans. Comput. Biol. Bioinf. 9, 1266–1272. doi:http://dx. doi.org/10.1109/TCBB.2012.82. Hartshorn, M.J., Verdonk, M.L., Chessari, G., Brewerton, S.C., Mooij, W.T.M., Mortenson, P.N., Murray, C.W., 2007. Diverse, high-quality test set for the validation of protein-ligand docking performance. J. Med. Chem. 50, 726–741. doi:http://dx.doi.org/10.1021/jm061277y. Hassan, R., Cohanim, B., de Weck, O., Venter, G., 2005. A comparison of particle swarm optimization and the genetic algorithm. Proceedings of the 1st AIAA Multidisciplinary Design Optimization Specialist Conference, American Institute of Aeronautics and Astronautics 1–13. doi:http://dx.doi.org/10.2514/ 6.2005-1897. Hu, L., Benson, M.L., Smith, R.D., Lerner, M.G., Carlson, H.A., 2005. Binding MOAD (Mother of all databases). Proteins 60, 333–340. doi:http://dx.doi.org/10.1002/ prot.20512. Kinnings, S.L., Liu, N., Tonge, P.J., Jackson, R.M., Xie, L., Bourne, P.E., 2011a. A machine learning-based method to improve docking scoring functions and its application to drug repurposing. J. Chem. Inf. Model. 51, 408–419. doi:http://dx. doi.org/10.1021/ci100369f. Kinnings, S.L., Liu, N., Tonge, P.J., Jackson, R.M., Xie, L., Bourne, P.E., 2011b. A machine learning-based method to improve docking scoring functions and its application to drug repurposing. J. Chem. Inf. Model. 51, 408–419. doi:http://dx. doi.org/10.1021/ci100369f. Koes, D.R., Baumgartner, M.P., Camacho, C.J., 2013. Lessons learned in empirical scoring with smina from the CSAR benchmarking exercise. J. Chem. Inf. Model. 53, 1893–1904. doi:http://dx.doi.org/10.1021/ci300604z. Kroshko, D (2013) OpenOpt: Free scientific-engineering software for mathematical modeling and optimization. http://openopt.org (accessed 08.08.15). Landrum G (2015) RDKit: Open-source cheminformatics. http://www.rdkit.org (accessed 27.12.15). Li, L., Wang, B., Meroueh, S.O., 2011. Support vector regression scoring of receptorligand complexes for rank-ordering and virtual screening of chemical libraries. J. Chem. Inf. Model. 51, 2132–2138. doi:http://dx.doi.org/10.1021/ci200078f. Li, Y., Liu, Z., Li, J., Han, L., Liu, J., Zhao, Z., Wang, R., 2014a. Comparative assessment of scoring functions on an updated benchmark: 1. Compilation of the test set. J. Chem. Inf. Model. 54, 1700–1716. doi:http://dx.doi.org/10.1021/ci500080q. Li, Y., Han, L., Liu, Z., Wang, R., 2014b. Comparative assessment of scoring functions on an updated benchmark: 2. Evaluation methods and general results. J. Chem. Inf. Model. 54, 1717–1736. doi:http://dx.doi.org/10.1021/ci500081m. Li, H., Leung, K.-S., Wong, M.-H., Ballester, P.J., 2015. Improving AutoDock Vina using random forest: the growing accuracy of binding affinity prediction by the effective exploitation of larger data sets. Mol. Info. 34, 115–126. doi:http://dx. doi.org/10.1002/minf.201400132. Liu, Y., Zhao, L., Li, W., Zhao, D., Song, M., Yang, Y., 2012. FIPSDock: a new molecular docking technique driven by fully informed swarm optimization algorithm. J. Comput. Chem. 34, 67–75. doi:http://dx.doi.org/10.1002/jcc.23108. Meng, X.-Y., Zhang, H.-X., Mezei, M., Cui, M., 2011. Molecular docking: a powerful approach for structure-based drug discovery. Curr. Comput. Aided Drug Des. 7, 146–157. doi:http://dx.doi.org/10.2174/157340911795677602. Mohri, M., Rostamizadeh, A., Talwalkar, A., 2012. Foundations of Machine Learning. MIT Press, London, England. Morris, G.M., Huey, R., Lindstrom, W., Sanner, M.F., Belew, R.K., Goodsell, D.S., Olson, A.J., 2009. AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem. 30, 2785–2791. doi:http://dx.doi.org/ 10.1002/jcc.21256. O’Boyle, N.M., Banck, M., James, C.A., Morley, C., Vandermeersch, T., Hutchison, G.R., 2011. Open babel: an open chemical toolbox. J. Cheminform. 3, 33. doi:http://dx. doi.org/10.1186/1758-2946-3-33. Oda, A., Yamaotsu, N., Hirono, S., Watanabe, Y., Fukuyoshi, S., Takahashi, O., 2014. Effects of initial settings on computational protein–ligand docking accuracies for several docking programs. Mol. Simul. 41, 1027–1034. doi:http://dx.doi.org/ 10.1080/08927022.2014.917300. Plewczynski, D., Ła zniewski, M., Augustyniak, R., Ginalski, K., 2011. Can we trust docking results? Evaluation of seven commonly used programs on PDBbind database. J. Comput. Chem. 32, 742–755. doi:http://dx.doi.org/10.1002/ jcc.21643.

144

E.J. Bjerrum / Computational Biology and Chemistry 62 (2016) 133–144

Schrödinger, L.L.C., 2010. The PyMOL Molecular Graphics System. Version 1.3r1. Stigliani, J.-L., Bernardes-Génisson, V., Bernadou, J., Pratviel, G., 2012. Cross-docking study on InhA inhibitors: a combination of Autodock Vina and PM6DH2 simulations to retrieve bio-active conformations. Org. Biomol. Chem. 10, 6341–6349. doi:http://dx.doi.org/10.1039/c2ob25602a. Trott, O., Olson, A.J., 2010. AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31, 455–461. doi:http://dx.doi.org/10.1002/ jcc.21334. Vaz, A.I.F., Vicente, L.N., 2007. A particle swarm pattern search method for bound constrained global optimization. J. Glob. Optim. 39, 197–219. doi:http://dx.doi. org/10.1007/s10898-007-9133-5. Verdonk, M.L., Cole, J.C., Hartshorn, M.J., Murray, C.W., Taylor, R.D., 2003. Improved protein-ligand docking using GOLD. Proteins 52, 609–623. doi:http://dx.doi. org/10.1002/prot.10465. Verdonk, M.L., Mortenson, P.N., Hall, R.J., Hartshorn, M.J., Murray, C.W., 2008. Protein-ligand docking against non-native protein conformers. J. Chem. Inf. Model. 48, 2214–2225. doi:http://dx.doi.org/10.1021/ci8002254. Wang, R., Lu, Y., Fang, X., Wang, S., 2004a. An Extensive test of 14scoring functions using the PDBbind refined set of 800 protein-ligand complexes. J. Chem. Inf. Model. 44, 2114–2125. doi:http://dx.doi.org/10.1021/ci049733j. Wang, R., Fang, X., Lu, Y., Wang, S., 2004b. The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. J. Med. Chem. 47, 2977–2980. doi:http://dx.doi.org/10.1021/ jm030580l.

Wang, R., Fang, X., Lu, Y., Yang, C.-Y., Wang, S., 2005. The PDBbind database: methodologies and updates. J. Med. Chem. 48, 4111–4119. doi:http://dx.doi.org/ 10.1021/jm048957q. Wang, W., He, W., Zhou, X., Chen, X., 2013. Optimization of molecular docking scores with support vector rank regression. Proteins 81, 1386–1398. doi:http://dx.doi. org/10.1002/prot.24282. Warren, G.L., Andrews, C.W., Capelli, A.-M., Clarke, B., LaLonde, J., Lambert, M.H., Lindvall, M., Nevins, N., Semus, S.F., Senger, S., Tedesco, G., Wall, I.D., Woolven, J. M., Peishoff, C.E., Head, M.S., 2006. A critical assessment of docking programs and scoring functions. J. Med. Chem. 49, 5912–5931. doi:http://dx.doi.org/ 10.1021/jm050362n. Xu, W., Lucke, A.J., Fairlie, D.P., 2015. Comparing sixteen scoring functions for predicting biological activities of ligands for protein targets. J. Mol. Graphics Model. 57, 76–88. doi:http://dx.doi.org/10.1016/j.jmgm.2015.01.009. Young, T., Abel, R., Kim, B., Berne, B.J., Friesner, R.A., 2007. Motifs for molecular recognition exploiting hydrophobic enclosure in protein-ligand binding. Proc. Natl. Acad. Sci. U. S. A. 104, 808–813. doi:http://dx.doi.org/10.1073/ pnas.0610202104. Zhang, X., Wong, S.E., Lightstone, F.C., 2013. Message passing interface and multithreading hybrid for parallel molecular docking of large databases on petascale high performance computing machines. J. Comput. Chem. 34, 915– 927. doi:http://dx.doi.org/10.1002/jcc.23214. Zilian, D., Sotriffer, C.A., 2013. SFCscore RF: a random forest-based scoring function for improved affinity prediction of protein-ligand complexes. J. Chem. Inf. Model. 53, 1923–1933. doi:http://dx.doi.org/10.1021/ci400120b.