doi:10.1016/j.jmb.2010.04.047
J. Mol. Biol. (2010) 399, 645–661
Available online at www.sciencedirect.com
Homology-Modelling Protein–Ligand Interactions: Allowing for Ligand-Induced Conformational Change James A. R. Dalton and Richard M. Jackson⁎ Institute of Molecular and Cellular Biology and Astbury Centre for Structural Molecular Biology, University of Leeds, Leeds LS2 9JT, UK Received 10 November 2009; received in revised form 9 March 2010; accepted 23 April 2010 Available online 29 April 2010
Current homology-modelling methods do not consider small molecules in their automated processes. Therefore, the development of a reliable tool for protein–ligand homology modelling is an important next step in generating plausible models for molecular interactions. Two automated protein–ligand homology-modelling strategies, requiring no expert knowledge from the user, are investigated here. Both employ the “induced fit” concept with flexibility in side chains and ligand. The most successful strategy superimposes the new ligand over the original ligand before homology modelling, allowing the new ligand to be taken into consideration during protein modelling (rather than after), facilitating conformational change in the local backbone if necessary. We show that this approach results in successful modelling of the ligand and key binding-site residues of angiotensinconverting enzyme 2 (ACE2) from its homologue ACE, which is not possible via conventional homology modelling or by homology modelling followed by docking. Several other difficult target complexes are also successfully modelled, reproducing native protein–ligand contacts with significantly different biological substrates and different binding-site conformations. These include the modelling of Cdk5 (cyclin-dependent kinase 5) from Cdk2, thymidine phosphorylase from a bacterial homologue, and dihydrofolate reductase from a recombinant variant with a markedly different inhibitor. In terms of average modelling quality across 82 targets, the ligand RMSD with respect to the experimental structure is 1.4 Å (and 2.0 Å for the protein binding site) for “easy” cases and 2.9 Å for the ligand (and 2.7 Å for the protein binding site) in “hard” cases. This demonstrates the importance of selecting an optimal template. Ligand-modelling accuracy is strongly dependent on target–template ligand structural similarity, rather than target–template sequence identity. However, protein-modelling accuracy is dependent on both. Our automated protein–ligand homology-modelling strategy generates a higher degree of accuracy than homology modelling followed by docking, generating an average ligand RMSD that is 1–2 Å better than docking with homology models. © 2010 Elsevier Ltd. All rights reserved.
Edited by B. Honig
Keywords: homology modelling; induced fit; protein–ligand; structural alignment; conformational change
Introduction Homology modelling is the most successful technique for predicting three-dimensional protein structure.1–3 However, conventional homology*Corresponding author. E-mail address:
[email protected]. Abbreviations used: ACE, angiotensin-converting enzyme; PI, Poisson Index; HMG, 3-hydroxy-3-methylglutaric acid; PDB, Protein Data Bank.
modelling methods do not consider drug-like molecules in their automated procedures, with the consequence that residue side chains in contact with a ligand are often modelled inappropriately. Time-consuming manual re-modelling of the binding site or docking is often required to generate sensible side-chain orientations and a correctly positioned ligand, for example,, in the homology modelling of angiotensin-converting enzyme 2 (ACE2) from ACE.4 Previous studies in the area of ligand-supported homology modelling have concentrated on methods that use a combination
0022-2836/$ - see front matter © 2010 Elsevier Ltd. All rights reserved.
646 of homology modelling, docking, and re-modelling with manually imposed restraints5 or knowledgebased potentials.6 However, these methods require expert-user knowledge regarding ligand orientation or mutagenesis data and generally rely on manual intervention to generate accurate results. Other “ligand-centric” modelling methods, collectively referred to as ligand comparative docking, use experimentally determined information from a crystallised protein(s) with a bound ligand(s) to guide the docking of a new ligand into a preexisting binding-site structure. 7–9 However, these methods neglect the protein side of homology modelling and make no attempt to model a new binding site with re-orientated side chains. Therefore, to our knowledge, a fully automatic protein–ligand homology-modelling program, which takes into account conformational change, has yet to be developed. The basis of homology modelling is that two evolutionary related proteins share a common structure. This concept has been further extended, noting that related proteins bind similar ligands.10 Frequently, the ligands bound by related proteins are different, but often share a common core substructure.11 This common core can potentially be utilised to align one ligand with another, in effect allowing an original ligand to be replaced by a new ligand in a new binding-site model. This hypothesis includes the notion that highly conserved residues bind common ligand parts and different residues bind different ligand parts.9,12 As such, different binding-site residues are responsible for different ligand-binding specificities in protein homologues. Using this overall premise, we have constructed an algorithmic framework, capable of protein–ligand homology modelling that brings together new and existing computational routines. This framework is based on Koshland's “induced fit” theory13 where there is conformational change in binding-site side chains and ligand. A rotamer library of multiple side-chain conformations simulates protein sidechain flexibility but also keeps sampling at a manageable level,14 and ligand flexibility is incorporated with the generation of a conformer ensemble. The ligand and side chains in a model binding site are simultaneously refined with a molecular mechanics approach, previously implemented in the optimisation of protein–protein interfaces,15 to find the most energetically favourable conformation. Two similar protein–ligand homology-modelling strategies, each capable of generating a complete binding-site model, have been developed side by side and evaluated with regard to their accuracy at reproducing native ligand/binding-site conformations. These strategies share many of the same computational routines, including 3D-Coffee for target–template sequence alignment,16 Modeller for protein homology modelling,17 and Omega for ligand multi-conformer generation. 18 3D-Coffee uses structural information to guide sequence alignments and has been shown to generate a high level of alignment accuracy between targets and tem-
Homology-Modelling Protein–Ligand Interactions
plates.19 Modeller builds loops accurately19 and can transfer ligands from templates to models as rigid bodies. Omega generates new ligand conformers quickly20 and reproduces bioactive-like conformations in a consistent fashion.21,22 Despite these similarities, the two homology-modelling strategies are significantly different in terms of the order of events whilst building a model. The first strategy, referred to as “hybrid-template modelling”, exchanges new and original ligands prior to homology modelling, with the new ligand fitted into the template binding site. This results in a “hybridised” template, with the new protein binding site subsequently modelled around the new ligand. The second strategy, referred to as “original-template modelling”, uses the unmodified template structure with its original ligand for modelling the new protein. This results in the new binding site containing the original template ligand, which is then replaced by the new ligand during post-modelling refinement. Two critical steps within “hybrid-template modelling” were inspected in greater detail: (i) ligand structural superposition and (ii) side-chain optimisation (i.e., post-homology-modelling refinement). Alternative strategies were implemented at these junctures to determine how modelling accuracy as a whole is affected. With respect to ligand superposition, two different algorithms were utilised: the geometric hashing method LigMatch23,24 and the shape-based matching method Rocs.25 The use of an atomic weighting approach within LigMatch was also investigated. In terms of side-chain refinement, two different rotamer libraries were evaluated to find the more successful approach.14,26 The accuracy of protein–ligand homology modelling was assessed by using two data sets derived from the Astex Diverse Set: 27 an “easy” set containing optimal templates and a “hard” set containing sub-optimal templates. The models created from these sets were compared against their native structures with respect to ligand and protein binding-site conformations. The level of modelling accuracy was also compared against a conventional modelling approach combining homology modelling with docking, using Modeller followed by the docking method Fred.28 Conventional docking does not use any information regarding the orientation of the original ligand and predicts the orientation of a new ligand based on binding-site conformation and a scoring function. As such, the orientations of side chains within a binding site are critical for predicting accurate ligand geometries.29 As a consequence, docking generates accurate results when using highresolution native structures but is much more unreliable when using homology models containing inaccurately modelled side chains.29–31 Docking is performed here using two sets of ligand-free homology models: the first set built with template ligand information and the second set built without it. Homology models generated with ligand information were expected to facilitate better docking
647
Homology-Modelling Protein–Ligand Interactions
results than homology models generated without, as binding-site side chains should be modelled with more favourable orientations. These two sets of models are referred to as “optimal” and “ordinary”, respectively, and represent the best- and worst-case scenarios that can be reasonably expected when docking into the binding sites of homology models.
Results A comparison of two protein–ligand homology-modelling strategies Two strategies for automatic protein–ligand homology modelling were developed and evaluated independently. “Hybrid-template modelling” replaces the original ligand with the new ligand in the template structure prior to homology modelling. After homology modelling, the binding site is refined with a molecular mechanics approach that finds the optimal orientation of side chains and ligand. “Original-template modelling” keeps the original ligand in the process for longer, only replacing it with the new ligand during molecular mechanics optimisation of the binding site, after homology modelling (for details, see Methods). All 82 targets in the easy and hard sets were modelled from their respective templates using the two homologymodelling strategies, with the average results regarding ligand accuracy shown in Table 1. The input for each modelling case consisted of a target sequence, a template structure, and the minimised coordinates of the new ligand. With regard to ligand modelling accuracy, “hybrid-template modelling” generates an average ligand RMSD that is 0.1 and 0.4 Å better than that generated by “original-template modelling” for the easy and hard sets, respectively. As the “hybridtemplate” strategy fits the new ligand into the template binding site, ligand accuracy can be monitored at all stages, that is, from template to homology model to optimised model. Using this approach, it appears that ligand orientation is most accurate in the “hybridised” template, becoming marginally less accurate in the homology model and optimised model, as binding site and ligand are readjusted to accommodate each other (Table 1). As the “original-template” strategy only positions the new ligand into the binding site during post-modelling optimisation, only one reading of ligand accuracy is possible. The quality of ligand modelling generated from the easy set (i.e., with
the “best” templates) using the “hybrid-template” strategy is relatively high, with an average RMSD of 1.4 Å and 79% of models with an RMSD b 2 Å. However, when using the hard set and suboptimal templates, quality is much lower, with an average ligand RMSD of 2.9 Å and 48% of models with a ligand RMSD b 2 Å. Therefore, in terms of ligand modelling accuracy, the difference between using optimal and sub-optimal templates equates to about 1.5 Å. Although the two homology-modelling strategies use the same method for ligand structural superimposition,23,24 the principal difference is that the “hybrid-template” approach makes an assumption about the conformation of the new ligand early in the procedure (guided by structural alignment), modelling the target protein thereafter. However, the “original-template” method models the target protein around the original ligand, with the conformation of the new ligand determined by post-modelling structural superimposition and molecular mechanics. This approach is less accurate, particularly where there is a significant structural difference between target and template as present in the hard set. Since “hybrid-template modelling” proved to be the most successful strategy in terms of ligand accuracy, it was used as the method of choice, with further development targeted in two areas: ligand structural alignment and post-modelling optimisation. Ligand structural alignment The approach used in ligand 3-D superimposition implements the geometric hashing algorithm LigMatch 23,24 for selecting the best match and Omega18 for generating alternative conformations of the new ligand. In this process, the original ligand is static and the new ligand is flexible, with the best conformation of the new ligand determined by highest score of like-atom matches (see Methods). This strategy was compared against two alternative approaches: (i) the shape-based matching algorithm Rocs25 and (ii) a modified version of LigMatch intended to optimise ligand matching. The latter approach involves weighting ligand alignment towards oxygen, nitrogen, and sulphur atoms (called “element-weighted”) and using protein structure to filter out ligand conformations that clash with conserved side chains or co-factors (see Methods). The “hybrid-template” method was used as the basis for this comparison, together with the hard set, which was selected for its lower levels of
Table 1. Average ligand accuracy generated with two protein–ligand homology-modelling strategies Hybrid-template modelling (RMSD, Å)b a
Data set Easy Hard a b
Original-template modelling (RMSD, Å)b
Template
Homology model (unoptimised)
Optimised model
Optimised model
1.35 (±0.15) 2.83 (±0.26)
1.35 (±0.16) 2.88 (±0.26)
1.41 (±0.15) 2.85 (±0.26)
1.49 (± 0.17) 3.23 (± 0.29)
Homology modelling is performed with two data sets: easy and hard, which both contain 82 targets and 82 templates. Ligand accuracy is measured by RMSD (in angstroms) from the native crystal structure.
648
Homology-Modelling Protein–Ligand Interactions
Table 2. Average ligand accuracy generated with different ligand structural superposition methods LigMatch methods (ligand RMSD, Å)b Data seta Hard
Rocs (ligand RMSD, Å)b
Default
Element-weightedc
Element-weighted and protein clashesd
Combo score (default)e
3.17 (±0.33)
2.85 (±0.32)
2.78 (±0.32)
3.60 (±0.38)
a
Protein–ligand homology modelling is performed with the hard set using the “hybrid-template” strategy. Ligand accuracy is measured by RMSD (in angstroms) from the native crystal structure. c A modified version of the LigMatch algorithm is compared against the default. d The “element-weighted” version of LigMatch is used with and without protein structural information (in the form of interactions with conserved side chains and co-factors if present). e A method using the Rocs algorithm is included for comparison. b
ligand similarity and greater ability to discriminate between structural alignment methods. The new ligand conformation was directly compared against the bioactive conformation in the respective crystal structure at the “hybrid-template” stage, that is, before homology modelling. This ensured that ligand structural superimposition was the only determinant of ligand accuracy. The average effect of different ligand alignment approaches on ligand accuracy is summarised in Table 2. The Rocs combo score (the sum of shape Tanimoto and scaled colour score) does not perform as well as LigMatch, whilst the element-weighted version of LigMatch is more accurate than both the default version of LigMatch and Rocs, by an average of 0.3 and 0.8 Å, respectively. Being the most accurate, the element-weighted version of LigMatch was chosen as the primary means of superimposing ligands in three dimensions. This method recognizes the increased importance of oxygen and nitrogen atoms in making intermolecular hydrogen bonds, above the more frequently occurring carbon atoms in drug-like molecules. Ligand 3-D alignment can be further improved by incorporating information from the template protein (Table 2). This is particularly important if more than one conformation of the new ligand possesses the same match score, in which case the “best” is only identifiable by considering the interactions that are made with
conserved side chains. The impact of this additional protein steric information is not dramatic, equating to an increase in accuracy of only 0.1 Å, but it is certainly significant for the few models where ligand superimposition by itself is not enough. An example is shown in Fig. 1, where the optimal alignment of fluvastatin with 3-hydroxy-3-methyl-glutaric acid (HMG) is found in the template binding site of HMG-CoA reductase by considering interactions between the ligand and conserved side chains. Three examples of the improved structural superimposition between ligands of low similarity using element-weighted (as compared with default) LigMatch are shown in Fig. 2. The first shows a match between the inhibitors of ACE and ACE2, which have a 2-D Tanimoto similarity coefficient of 0.37. The default version of LigMatch fails to match the carboxyl group of one ligand to the carbonyl of the other (Fig. 2a). This is corrected by the elementweighted approach that prioritises the matching of oxygen atoms (Fig. 2b) and generates a solution in which the two core carboxyl groups are in the correct orientation (Fig. 2c), albeit in the top ranked solution, one of the side-chain R-groups is incorrectly placed (with an overall ligand RMSD of 3.3 Å). The second example shows a match between two Factor-Xa inhibitors. Again, the ligand 2-D Tanimoto similarity is low at 0.26. The element-weighted superimposition is superior because of weighting
Fig. 1. The template binding site of HMG-CoA reductase with the new ligand (fluvastatin) superimposed over the original ligand (HMG). (a) Five conformations of fluvastatin (in green) have the same match score against the original HMG ligand (in pink with a dotted surface). The conserved side chains and backbone of the protein are coloured blue. (b) The “best” conformation of fluvastatin (in green) matched against HMG (in pink with a dotted surface). This conformation has the fewest number of unfavourable clashes with the surrounding protein.
Homology-Modelling Protein–Ligand Interactions
649
Fig. 2. A comparison of the 3-D match between three pairs of ligands. First row: enalaprilat and MLN-4760 (the respective ligands of ACE and ACE2). Here, the MLN-4760 ligand (in green) is flexible (Omega-generated conformation) and is superimposed over the crystal enalaprilat ligand (in pink). (a) The match generated with default LigMatch. The mismatch in the centre is highlighted in yellow. (b) The match generated with element-weighted LigMatch. (c) The conformation of MLN-4760 from (b) superimposed over its respective crystal structure (in blue). Second row: two FactorXa inhibitors. Here, the “CMB” inhibitor (Omega conformation, in green) is superimposed over the crystal “237” inhibitor (in pink). (d) The match generated with default LigMatch. (e) The match generated with element-weighted LigMatch. (f) The conformation of “CMB” from (e) superimposed over its respective crystal structure (in blue). Third row: two metalloβ-lactamase inhibitors. Here, the “BYS” inhibitor (Omega conformation, in green) is superimposed over the crystal “113” inhibitor (in pink). (g) The match generated with default LigMatch. (H) The match generated with element-weighted LigMatch. (I) The conformation of “BYS” from (H) superimposed over its respective crystal structure (in blue). Oxygen atoms are coloured red. Nitrogen atoms are coloured blue. Phosphorus atoms are coloured beige.
towards nitrogen and oxygen atoms in the ligand core (Fig. 2e). The element-weighted conformation has an RMSD of 1.9 Å (as opposed to 7.4 Å) and captures several of the features of the bioactive conformation (Fig. 2f). The third example shows the superimposition between two metallo-β-lactamase inhibitors, which have a 2-D Tanimoto similarity of 0.23. The default match looks good from a visual point of view but actually orientates the new ligand in the wrong direction (Fig. 2g). The elementweighted match gives a much more accurate solution due to an increased number of oxygen atom matches (Fig. 2h) and results in an RMSD of 2.2 Å (Fig. 2i). Binding-site optimisation The optimisation process used in “hybrid-template modelling” was assessed with respect to the level of accuracy that the method models both the ligand and the surrounding protein atoms of the binding site.
Comparisons were made between the unoptimised and optimised stages in the homology-modelling process. The RMSDs of the binding site (backbone and side chains) and ligand were calculated by superimposing the model over the respective crystal structure. The optimisation process was intended to increase the accuracy of ligand and binding-site side chains simultaneously. The backbone of the protein is not altered during this process because the protein homology-modelling step has already taken place; thus, any improvement or reduction in accuracy of the binding site is due to shifts in the orientation of side chains only. A rotamer library is used to introduce flexibility into side chains and two were compared here: a small library14 and a more recently released larger library.26 Limited flexibility is permitted in the ligand, with its core held rigid and peripheral groups allowed to rotate (see Methods). The “best” conformations of binding-site side chains and ligand are selected using a mean field optimisation with the molecularmechanics-based Multicopy-Gaff approach (see
650
Homology-Modelling Protein–Ligand Interactions
Table 3. Average ligand and protein binding-site accuracy generated with protein–ligand homology modelling Ligand RMSD (Å)b Data seta Easy Hard
Protein binding-site RMSD (Å)b
Unoptimised modelc
Optimised model (small libraryc)
Optimised model (large libraryc)
Unoptimised modelc
Optimised model (small libraryc)
Optimised model (large libraryc)
1.35 (±0.16) 2.88 (±0.26)
1.41 (±0.15) 2.85 (±0.26)
1.39 (±0.15) 2.87 (±0.26)
2.03 (±0.36) 2.63 (±0.36)
2.02 (± 0.34) 2.65 (± 0.36)
2.07 (±0.34) 2.68 (±0.36)
a
Protein–ligand homology modelling is performed with the easy and hard sets using the “hybrid-template” method. Ligand and protein binding-site (side chains and backbone) accuracy is measured by RMSD (in angstroms) from the native crystal structure. c The accuracy of the ligand and protein binding site is shown before and after optimisation with two rotamer libraries (small14 and large26). b
Methods). The results generated by binding-site optimisation of homology models built from the easy and hard sets are summarised in Table 3. Using the easy set, where average ligand and sequence similarities are very high (a 2-D Tanimoto coefficient of 0.70 and a sequence identity of 86%), the accuracy of ligand modelling is higher than protein modelling (Table 3). Establishing the correct ligand orientation (based on ligand superimposition) is clearly easier at this level than establishing
correct side-chain conformations, with the latter involving a higher number of degrees of freedom. Despite this, the accuracy of protein modelling with the easy set is still relatively high, with an average RMSD of 2.0 Å and 82% of binding sites with an RMSD b 2 Å. The situation is reversed with the hard set, where protein-modelling accuracy is better than ligand-modelling accuracy (Table 3). This is probably a reflection of the hard set makeup, where average ligand similarity is much lower than
Fig. 3. A comparison of the model binding site of dihydrofolate reductase (1S3V), bound to a tetrahydroquinazolineantifolate, both before and after binding-site optimisation (modelled from 1HFQ). (a and c) The unoptimised model (in green) superimposed onto the target crystal structure (in pink). (a) The highlighted area shows the modelled ligand and the native ligand. (c) The highlighted areas show the discrepancy of modelled residues Gln35 and Ser59 as compared to the native structure. (b and d) The optimised model (in blue) superimposed onto the target crystal structure (in pink). (b) The highlighted area shows the improved conformation of the modelled ligand as compared to the native. (d) The highlighted area shows the optimisation of modelled residues Gln35 and Ser59 into native-like orientations.
Homology-Modelling Protein–Ligand Interactions
651
Fig. 4. A comparison of the model binding site of thymidine phosphorylase (1UOU) bound to an inhibitor, both before and after binding-site optimisation (modelled from 1BRW). (a) The unoptimised model (in green) superimposed onto the target crystal structure (in pink). The highlighted area shows the modelled ligand and the native ligand: 6-[(2aminopyrrolidin-1-yl)methyl]-5-chloro-1H-pyrimidine-2,4-dione. (b) The optimised model (in blue) superimposed onto the target crystal structure (in pink). The highlighted area shows the improved conformation of the modelled ligand as compared to the native. The orientation of model side chains Ser126 and Lys221 is also improved (shown as “sticks”).
average sequence identity: ligand 2-D Tanimoto score of 0.32 and 69% sequence identity. At lower levels of ligand similarity, it is clear that ligand structural alignment and subsequent modelling are a real challenge. Protein-modelling accuracy is noticeably lower in the hard set than in the easy set, with an average protein binding-site RMSD of 2.7 Å and 60% of binding sites with an RMSD b 2 Å. The optimisation of binding sites does not appear to significantly increase the average accuracy with which protein atoms are modelled (Table 3). Indeed with the hard set, the trend is actually a slight decline in average protein accuracy, with 50% of models undergoing an improvement (or remaining unchanged) in terms of RMSD. The average result only shows a small gain with the easy set, where 49% of modelled protein binding sites are improved (or remain constant) according to RMSD. However, a
minor improvement is seen with ligand orientation in the hard set, where after optimisation, the average result becomes marginally more accurate (Table 3), with 55% of modelled ligands successfully refined according to RMSD. Examples demonstrating successful ligand refinement include the RMSD of an inhibitor in a model of dihydrofolate reductase improving from 3.0 to 1.1 Å (Fig. 3) and that of a modelled inhibitor of thymidine phosphorylase improving from 1.6 to 0.8 Å (Fig. 4). The marginal improvement seen with the hard set is not reproduced with the easy set, where there is a slight loss in average ligand accuracy, with only 45% of ligands successfully refined according to RMSD. In the easy set, it appears that the initial orientation of the new ligand is optimal, and ligand flexibility during optimisation permits small shifts away from the native state due to sidechain interactions. However, when both protein and
Fig. 5. A comparison of the modelled binding site of ACE2 (1R4L), as modelled from ACE (1UZE), both before and after binding-site optimisation. (a) The initial homology model (in green) superimposed onto the crystal structure of ACE2 (in pink). The highlighted area shows the discrepancy of modelled residues Arg273 and Phe274 as compared to the native structure. The modelled and native ligands (MLN-4760) are positioned centrally. (b) The optimised model (in blue) superimposed onto the crystal structure of ACE2 (in pink). The highlighted area shows the optimisation of modelled side chains Arg273 and Phe274 into native-like conformations.
652 ligand refinement are taken together, 71% of models in the hard set (or 68% in the easy set) are improved according to one criteria or the other (i.e., protein or ligand accuracy) after optimisation. Although the lack of improvement in the average accuracy of protein side chains is somewhat disappointing, side-chain optimisation is a necessary step in accounting for the significant changes that sometimes occur between related binding sites. An example is the non-conserved substitution of Gln281 in ACE to Arg273 in ACE2, which have very
Homology-Modelling Protein–Ligand Interactions
different orientations (Fig. 5). The correct side-chain conformation is found during binding-site optimisation (which would otherwise not be the case) with the overall RMSD of the binding site refined from 1.7 to 1.5 Å. Specifically, the movement of Arg273 is concomitant with the movement of adjacent Phe274, which together shift from an initial RMSD of 3.8 to 1.5 Å (3.7 to 1.8 Å for Arg and 4.9 to 1.4 Å for Phe), whilst the rest of the binding site shows no net change. Another example shown here is the adjustment of side chains in a model of dihydrofolate
Fig. 6. Modelling accuracy compared against target–template 2-D ligand similarity and sequence identity. Ligand accuracy generated by “hybrid-template” protein–ligand homology modelling from the hard and easy sets: (a) compared against target–template ligand similarity, (b) compared against target–template sequence identity. Protein binding-site accuracy generated by “hybrid-template” protein–ligand homology modelling from the hard and easy sets: (c) compared against target–template ligand similarity, (d) compared against target–template sequence identity. Ligand accuracy generated by docking (using Fred) as a function of target–template sequence identity with (e) “optimal” ligand-bound homology models from the easy set and (f) ordinary homology models from the easy set. All individual points represented individual models.
653
Homology-Modelling Protein–Ligand Interactions
reductase, which also facilitates ligand conformational change, resulting in a binding-site RMSD of 1.4 Å from an initial RMSD of 1.5 Å (Fig. 3c and d). In this instance, Gln35 shifts from an initial RMSD of 1.1 to 0.5 Å and Ser59 shifts from 2.6 to 1.6 Å, whilst the rest of the binding site undergoes a net change of b 0.1 Å. A third example involves the adjustment of multiple side chains in a model of Cdk5 (cyclindependent kinase 5), resulting in a binding-site RMSD of 2.1 Å (Figure S1, Supplementary Information). In this model, four side chains in different areas of the pocket move from an initial RMSD of 2.0 to 1.7 Å, with the largest change involving Asn135, which shifts from 2.8 to 1.6 Å, whilst the rest of the binding site shows a net change of b0.1 Å. A visual analysis of other case studies suggests that the relatively low rate of beneficial cases in binding-site optimisation (according to global RMSD) is a result of small shifts across all side chains away from their original orientations, swamping the favourable shifts made to non-conserved or functionally important side chains. The larger rotamer library samples more conformational space within the binding sites than the smaller library. However, the larger rotamer library is consistently outperformed by the smaller library in terms of average side-chain refinement (Table 3), suggesting that either the increase in sampling or more likely the energy refinement process has a slightly detrimental effect. The influence of template selection on protein–ligand homology modelling To discover the relationship between modelling accuracy and the two main variables of target– template selection, we plotted (i) 2D ligand similarity and (ii) protein sequence identity against ligand accuracy and protein binding-site accuracy, respectively (Fig. 6a–d). Target–template ligand similarity is the primary influence behind ligand modelling accuracy rather than target–template sequence identity (Fig. 6a and b). Indeed, when ligand 2-D Tanimoto similarity is ≥ 0.5, ligand modelling is, on average, three times more accurate than in situations where ligand similarity is b 0.5 (i.e., an average ligand RMSD of 1.0 Å as opposed to 3.0 Å). However, even when ligand similarity is very high, accurate
modelling is not necessarily guaranteed as some ligands bind differently due to association with cofactors or because of natural pseudo-symmetry. There is only a fairly weak relationship between sequence identity and ligand-modelling accuracy (Fig. 6b). This is surprising since sequence identity is the classic determining factor in selecting a suitable template for protein homology modelling.32 However, it would appear that it is not a sufficient criterion for identifying a suitable template for the homology modelling of protein–ligand interactions. Rather, maximising the similarity between target and template ligands is necessary to generate the best possible model, and achieving this whilst maintaining a high sequence similarity is the ideal scenario. The accuracy of protein atoms in the binding site depends on both target–template ligand similarity and sequence identity (Fig. 6c and d). Ligand similarity has a strong bearing on protein modelling accuracy (Fig. 6c), showing a 2-fold increase in binding-site accuracy when ligand similarity is ≥0.5 as compared to b 0.5 (i.e., an average binding-site RMSD of 1.5 Å as opposed to 2.9 Å). This reflects the likelihood that the more similar the ligands, the more similar the protein–ligand interactions and, hence, the more similar the binding sites. This results in a greater chance of modelling success. Target–template sequence similarity strongly dictates protein bindingsite modelling (Fig. 6d). Protein modelling is three times more accurate when sequence identity is ≥50%, as compared to that when sequence identity is b 50% (i.e., an average binding-site RMSD of 1.5 Å as opposed to 4.6 Å). This is expected and follows the well-established link between sequence identity and protein modelling accuracy.32 However, a high sequence similarity does not always guarantee an accurately modelled binding site, as the conformation of flexible loops can be different depending on which ligand is bound, and in some Protein Data Bank (PDB) structures, these regions can be unresolved, making template selection critically important. Protein–ligand homology modelling versus docking The accuracy of ligand modelling generated in protein–ligand homology modelling was compared
Table 4. Protein–ligand homology modelling compared against docking with homology models Average ligand RMSD (Å)b,c
Data seta Easy a
Average binding-site RMSD (Å)b,d
Hybridtemplate homology models
Originaltemplate homology models
Docking: with optimal modelse
Docking: with ordinary modelse
Hybridtemplate homology models
Originaltemplate homology models
Docking: optimal homology models
Docking: ordinary homology models
1.41 (±0.15)
1.49 (±0.17)
2.44 (±0.24)
3.57 (±0.31)
2.02 (±0.34)
2.11 (±0.38)
2.22 (±0.41)
2.25 (±0.40)
All homology modelling is performed with the easy set. b Ligand and protein binding-site accuracy is measured by RMSD (in angstroms) from the native crystal structure. c The average ligand accuracy generated by “hybrid-template” and “original-template” protein–ligand homology modelling is compared against that generated by docking using (i) ligand-bound optimal homology models and (ii) ordinary homology models. d The average protein binding-site accuracy using “hybrid-template” and “original-template” protein–ligand homology modelling is compared against the homology models generated for docking. e The Fred algorithm is used for docking.
654 against that generated by homology modelling followed by docking. The Fred algorithm,28 combined with Omega for ligand flexibility, was used to dock ligands into homology models generated from the easy set, that is, from the “best” available template. The hard set was not included in this comparison because it was deemed too difficult for conventional docking approaches. Two categories of homology models were used for docking: (i) “ordinary” models built without ligand information, that is, conventional homology models that take no account of a ligand, and (ii) “optimal” models built with a ligand inherited from the template. Models built with ligand information had their bound ligands stripped prior to docking and were expected to contain better side-chain geometries (more suited to docking) than models generated without a ligand present. With “ordinary” homology models, Fred generates an average ligand RMSD of 3.6 Å and 41% of targets with an RMSD b 2 Å, and with “optimal” models, Fred generates an average RMSD of 2.4 Å and 59% of targets with an RMSD b 2 Å (Table 4). This demonstrates the difference between models built with and without ligand information: a N1 Å gap in average ligand accuracy. Interestingly, the average binding-site accuracy of models built with ligand information is only marginally better than models built without ligand information (Table 4), suggesting that docking accuracy is greatly influenced by subtle changes in the orientation of a few select residues, which is not necessarily well expressed by global RMSD (an observation consistent with the “hybrid-template” homology-modelling method). Unlike protein–ligand homology modelling, docking accuracy is highly dependent on target–template sequence similarity, whether using optimal or ordinary homology models (Fig. 6d–f). Therefore, as protein homology-modelling accuracy (including accuracy of side chains) increases with target–template sequence similarity, so does the accuracy of docking. This result reflects the nature of docking methods in general, where success is based primarily on the accuracy of the protein structure.30 In summary, both the “hybrid” and “original” protein–ligand homology-modelling methods generate a higher degree of accuracy than homology modelling followed by docking (Table 4). However, perhaps unsurprisingly, as conventional homology modelling combined with docking is closer in nature to the “original-template” method, the difference in accuracy is less marked than that between the docking and the “hybrid-template” approach. Indeed “hybrid-template” protein–ligand homology modelling results in an average ligand RMSD that is 1 Å better than docking with “optimal” homology models (including 20% more models with an RMSD b 2 Å) and N 2 Å better than docking with “ordinary” homology models (including 38% more models with an RMSD b 2 Å). “Hybrid-template” protein–ligand homology modelling also generates more accurate binding sites, with an average RMSD
Homology-Modelling Protein–Ligand Interactions
that is 0.2 Å less than the average RMSD observed in conventional homology modelling (Table 4). As protein–ligand homology modelling is less reliant on target–template sequence similarity (Fig. 6b), it is possible to reliably model binding sites even when sequence similarity is low, as long as a good structural relationship exists between target and template ligands.
Discussion The development and evaluation of two different automatic protein–ligand homology-modelling strategies has been described. The most successful strategy generates a “hybrid” template by positioning the new ligand into the binding site of the template prior to homology modelling. This allows the backbone of the binding site, in addition to just side chains, to be adjusted for a better fit around the new ligand. In this case, the new ligand influences the homology modelling of the new binding site, potentially facilitating “conformational selection” or an “induced fit”. A likely technical explanation for the higher level of accuracy offered by the “hybridtemplate” approach involves the idea that the optimal ligand conformation can be more reliably found within an experimentally determined binding site, that is, within the template rather than in the modelled binding site. This mirrors the findings of a recent study where docking accuracy was often found to be greater with template structures rather than homology models.29 However, the context here is somewhat different, with protein conformation considered secondary to ligand structural alignment and only conserved residues used to guide ligand orientation, rather than all side chains as in docking. An additional practical advantage of the “hybridtemplate” approach involves giving the ligand only limited flexibility during post-modelling optimisation. This is achieved by keeping the ligand core fixed, whilst only permitting rotational freedom in peripheral groups (see Methods). This focuses the refinement process on more native-like protein– ligand interactions and prevents the ligand from exploring unrealistic poses. A critical step in protein–ligand homology modelling involves the structural superimposition of target and template ligands in three dimensions. Determining the optimal structural alignment between two different ligands is a challenging task, particularly at lower levels of ligand similarity. Therefore, ensuring ligand 2-D similarity is as high as possible between target and template is the best way of ensuring high modelling quality. A minimum ligand similarity of 0.5 is ideal, although this is sometimes not possible even when the best ligand template is selected. A combination of ≥ 0.5 ligand similarity and ≥ 50% sequence identity usually results in a successfully modelled binding site, including correct protein–ligand interactions. However, in some cases, binding sites can be accurately modelled with much lower levels of
655
Homology-Modelling Protein–Ligand Interactions
ligand or protein similarity. Weighting ligand structural alignment towards functionally important oxygen, nitrogen, or sulphur atoms typically improves the quality of the ligand superimposition and modelling and is particularly important where ligand 2-D similarity is b 0.5. Binding-site optimisation is a necessary step in the protein–ligand homology-modelling method described here because Modeller17 is only able to consider the ligand as an excluded volume, without proper modelling. This limitation is particularly relevant in terms of electrostatic protein–ligand interactions, which require atomic partial charges on the ligand in order to be accurately modelled. Currently, only nucleotides and common co-factors (e.g., ATP and metal ions) can be properly modelled by Modeller, and the process of manually developing new parameters for novel ligands is extremely time-consuming and, in our experience of ACE/ ACE2, does not lead to increased ligand modelling accuracy. Regarding binding-site optimisation via molecular mechanics, currently 70% of the models generated here show successful refinement according to ligand or protein binding-site RMSD. However, the overall protein and ligand RMSD averages do not show significant improvement with respect to the unoptimised models. The difficulty arises because many models need very little optimisation or indeed sometimes none (despite the limitations described above) as target–template binding site/ ligand similarity is often high and homology modelling into the unoptimised binding site is successful. The optimisation step can sometimes give the binding site as a whole the opportunity to move away from the native state. However, there is no doubt that for many models, particularly those that require significant main- and side-chain rearrangement relative to the template, optimisation is an absolutely necessary step for the conformation of binding-site side chains and ligand. After close inspection, it is apparent that the majority of closequarter interactions between protein and ligand are indeed improved, but peripheral side chains tend to experience a slight reduction in accuracy. Remedies for this situation may include using alternative rotamer libraries, scaled van der Waals parameters,33 or more selective conformational sampling of binding-site regions. In protein–ligand homology modelling, the accuracy of ligand modelling is fundamentally linked to target–template ligand similarity, rather than sequence similarity. This is because ligand structural alignment becomes more accurate as ligand similarity increases. However, a high sequence similarity can also assist ligand superposition by supplying a greater number of conserved side chains in the binding site and filtering out inappropriate ligand conformations. The accuracy of protein binding-site modelling is strongly dependent on sequence similarity as in conventional homology modelling and also displays a significant dependence on ligand similarity. This can be explained by the ability of the “hybrid-template” method to take into account
“induced fit” or “conformational selection”, with structurally different ligands inducing different changes in binding-site conformation. Overall, this method appears to be very promising, particularly where a suitable template can be identified with high sequence and ligand similarities to the target. In terms of ligand modelling with optimal templates, accuracy is better than docking into homology models by an average margin of 1–2 Å. When using the “best” available template, protein–ligand homology modelling generates a ligand RMSD b 2 Å in 79% of cases, a protein binding-site RMSD b 2 Å in 82% cases, and a combination of both in 67% of cases. We suggest therefore that the next challenge in protein–ligand homology modelling is in the identification of the best possible combination of ligand and protein template.
Methods Overview of two automatic protein–ligand homology-modelling strategies “Hybrid-template” modelling The first method superimposes the new ligand onto the original ligand, which is subsequently replaced in the template structure (steps 4 and 5, Fig. 7). The coordinates of the new ligand are generated by minimisation of the experimental ligand structure (in the absence of the protein) to regularise bond lengths and angles and relieve strain energy. The average RMSD of the minimised ligand coordinates from the experiment is 1.0 Å (ranging between 0.1 and 3.6 Å). The new ligand (in a minimised conformation) is subjected to conformer generation with the program Omega,18 with each newly generated conformation structurally aligned against the experimentally determined original (template) ligand. The conformation with the highest match score (i.e., with the greatest extent of overlap with the original ligand) is selected as the “lead” and superimposed in the binding site of the template structure, replacing the original ligand. Two different algorithms are used to superimpose ligands: LigMatch (geometric hashing)23,24 and Rocs (shapebased).25 Where more than one ligand conformation shares the same highest match score, the interaction energy of each (within the template) is calculated with a molecular mechanics approach called Multicopy-Gaff. This is a new implementation of an algorithm previously used in protein interface refinement,15 which incorporates the Amber-99 force field34 and Gaff force field35 for modelling ligands (see below). The “lead” conformation is then selected based on the lowest non-bonded (electrostatic and van der Waals) interaction energy with conserved side chains and co-factors. Homology modelling is performed with Modeller,17 which builds the new binding site around the “lead” conformation of the new ligand, altering the conformation of the binding site if required. As Modeller is unable to recognise ligand atomic partial charges, a post-homology-modelling step is required to refine the binding site, taking into account electrostatic and van der Waals interactions between protein and ligand. This is performed with MulticopyGaff. Before optimisation proceeds, alternative conformations of binding-site side chains are generated with a rotamer library14,26 and alternative ligand conformations
656
Homology-Modelling Protein–Ligand Interactions
Fig. 7. A flowchart of the “hybrid-template” modelling strategy. are generated from the “lead” conformation of the new ligand. During this process, the core of the ligand (consisting of atoms matched against the original ligand) is kept static whilst the bonds of peripheral atoms are rotated with the distance geometry algorithm Dgeom v1.0.36 Alternative conformations of the new ligand are superimposed onto the “lead” conformation with the LigMatch algorithm, and if there is any clash with the binding-site backbone, they are immediately rejected. Binding-site optimisation proceeds with side-chain rotamers and alternative ligand conformers (a maximum of 500) using a mean field optimisation with Multicopy-Gaff, with the energetically best orientations of each selected for inclusion in the refined model. A flowchart of binding-site optimisation is shown in Fig. 8.
presence of the original ligand), alternative conformations of the new ligand are generated with Omega and superimposed over the original ligand with the LigMatch algorithm (as described above). Unlike in the “hybridtemplate” approach, where alternative conformers of the new ligand are generated twice at separate stages, a full conformational search of the new ligand is performed only once without any restraints. The original ligand is removed and the “best” conformations of the new ligand (with highest match scores) are selected. These candidate ligand conformations, together with alternative bindingsite side-chain rotamers, undergo mean-field refinement with Multicopy-Gaff. This generates a final model with the lowest overall energetic state, containing the single “best” conformation of the new ligand and optimal side-chain orientations.
“Original-template” modelling In the second method, the new binding site is modelled around the original ligand, as present in the template (step 3, Fig. 9). This method does not allow for any ligandinduced changes of the binding-site backbone and only permits changes in the orientation of side chains. As such, it is fairly conservative in nature with no risk of an inappropriately orientated ligand during homology modelling. The “original-template” method is designed to preserve the orientation of conserved binding-site residues, which are transferred from template to homology model. After homology modelling with Modeller (in the
Ligand structural superimposition The LigMatch (or Rocs) algorithm is used to superimpose multiple conformations of the new ligand, generated with Omega, onto the original ligand. This process is the same in both the “hybrid-template” and “original-template” strategies. LigMatch uses a geometric hashing method that initially forms a series of triplets, each consisting of three atoms forming a triangle, for both new and original ligands. All possible triplets between the two ligands are compared,
Homology-Modelling Protein–Ligand Interactions
657
Fig. 8. A representation of binding-site optimisation (with Multicopy-Gaff) as part of the “hybrid-template” modelling method.
Fig. 9. A flowchart of the “original-template” modelling strategy.
658 and those with the same atom types at triangle vertices (i.e., the same element) and similar inter-atomic distances are considered a match. A least-squares fitting routine is used to perform a transformation for each match, resulting in a translational and rotational matrix that maps one triplet onto another.23,24 Each match is given a score corresponding to the number of coincident atoms and the one with the highest “atom–atom score” is deemed the optimal superimposition. The default version of LigMatch regards all heavy atoms as equally important and matches as many as possible regardless of their identity, assigning a score of 1 to each heavy-atom match.23,24 As the structural alignment of two ligands with potentially different sizes and different chemical makeups is somewhat subjective, the possibility of weighting the matching of certain atoms over others is explored with an “element-weighted” version of LigMatch, operating within the “hybrid-template” modelling method. The intention is to improve the alignment of functionally important atoms and in doing so improve ligand structural alignment. The concept of weighting some atoms over others is approached in a heuristic fashion, with an increased reward for atom matches involving oxygen, nitrogen, and sulphur. Oxygen atoms are regularly involved in hydrogen bonding and are often assigned strongly negative AM1-BCC charges.37 Their structural alignment is therefore considered most important, with matches involving oxygen atoms awarded a score of 3. Nitrogens are also involved in hydrogenbonding interactions and sulphur atoms often occupy distinct locations in related ligand molecules (e.g., for metal binding) and can be used to guide structural alignment. Therefore, matches involving nitrogen and sulphur atoms are assigned a score of 2. All other matches between other atom types, such as carbon atoms, receive a score of 1. Rocs v2.3.1 is implemented as an alternative to LigMatch and is a shape-based method that assesses the volume overlap of two ligands using Gaussians parameterised according to the hard-sphere volume of heavy atoms.25 The default Rocs combo score is used here to determine the optimal superimposition between new and original ligands. The combo score is the sum of the shape Tanimoto and the scaled colour score, where the former is a measure of shape similarity and the latter is a measure of chemical similarity. Multicopy-Gaff: A binding-site interaction refinement procedure Multicopy-Gaff is a new implementation of an algorithm previously used in protein–protein interface refinement.15 The method uses the Gaff force field35 so that ligands and co-factors can be simulated within the Amber-99 force field.34 The protein–ligand interaction interface, that is, the binding site, is defined as an area consisting of residues with a Cβ atom within 10 Ǻ of any ligand atom. Side chains within this interface are assigned multicopy rotamers defined by one of two backbone-independent libraries: (i) a relatively small library of 110 rotamers14 or (ii) a larger library of 341 rotamers.26 Side-chain refinement proceeds by finding the “best” rotamers with lowest interaction energies (with all other possible rotamers, protein backbone, and ligand) and the greatest probability of occurrence using a mean-field optimization.15 All non-bonded atomic interactions (i.e., electrostatic and van der Waals interaction energies) are calculated for atoms within 10 Ǻ of each other, with a distance-dependent dielectric of ɛ = 4r.
Homology-Modelling Protein–Ligand Interactions
Ligand refinement is considered an extension of sidechain refinement, with the rotamer concept applied to the ligand. Alternative ligand conformations, generated by Omega (or Dgeom), are treated as extra rotamers, with all co-factor molecules treated as stationary rigid bodies. The “best” ligand conformer is selected based on the mean field optimisation process, which simultaneously considers the interaction energy with all side-chain rotamers, protein backbone, and co-factors. All ligand conformations are assumed to have the same chance of occurrence (i.e., assumed to have same internal energy) and are therefore solely determined by their interaction energies. Details of algorithms employed in protein–ligand homology modelling Modeller17 v9.4 is used for homology modelling and is configured with the following commands: hydrogens on side chains and backbone, recognition of ligand hydrogens, recognition of non-protein molecules (i.e., cofactors), output of one model, randomisation of coordinates turned off, and model optimisation set to high (i.e., very slow). Modeller uses a “satisfaction of spatial restraints” approach for modelling and treats non-protein molecules as uncharged rigid bodies, which are directly transferred from template to model.17 The steric influences of ligands/co-factors are considered during homology modelling with clashes between protein and ligand avoided. If necessary, Modeller reconfigures the orientation of side chains or adjusts the path of the backbone. 3D-Coffee v5.72 is used for target–template sequence alignments and is executed with the integral methods slow_pair and Fugue_pair.16 3D-Coffee implements an additional threading step for sequence-structure alignment, which is combined with the standard T-Coffee library to increase overall alignment accuracy.16 The calculation of atomic partial charges for the new ligand is performed with Antechamber v1.27 using the AM1-BCC method.37,38 Ligand van der Waals parameters are defined by the Gaff force field.35 This involves the generation of maximum and minimum atomic radii for each ligand atom. Molcharge, from the QuacPac v1.1.1 package,39 is used to calculate AM1-BCC charges if Antechamber fails. Ligand protonation states are generated with OpenBabel v2.1.0 (pH-dependent mode).40 In instances where this is not compatible with Antechamber, the protonation method of Molcharge/QuacPac is used as an alternative. Conformer ensembles are generated with Omega18 v2.2.1 in both the “original-template” and “hybrid-template” modelling methods. Omega uses a depth first, rule-based method for creating alternative conformations of small molecules18 and is implemented with the following parameters: a maximum conformer limit of 10,000, “from connectivity table” option set to false, a conformer energy limit of 50 kcal/mol, and an RMSD threshold of 0.2–0.5 Å with a five-rotatable bond increment. Conformer ensembles are additionally generated with Dgeom36 v1.0 in the “hybrid-template” method (step 8, Fig. 7). Dgeom is a distance geometry method and is used with the following parameters: a maximum limit of 5000 conformers, “constraints” turned on, and an RMSD threshold of 0.2 Å. Development of easy and hard data sets for protein–ligand homology modelling The Astex Diverse Dataset of 85 high-resolution protein– ligand complexes, each containing a bound drug-like
659
Homology-Modelling Protein–Ligand Interactions
molecule, is commonly used to test conventional docking methods.27 The binding sites within this set were chosen as targets for protein–ligand homology modelling because they provide a wide range of different structures and different ligands. It also allows for a fair comparison between protein–ligand homology modelling and docking. Two modelling data sets were created from the Astex Diverse Set,27 each containing the same set of targets but a different set of templates. There are potentially two differences between target and template: a difference in protein sequence and/or a difference in bound ligand. In all cases, a single target is modelled from a single template. The first data set, referred to as the easy set, contains templates that have a high level of sequence similarity, binding-site structural similarity, and ligand structural similarity, with the target. This is the least demanding scenario and shows the level of success that can be achieved using protein–ligand homology modelling. The second data set, referred to as the hard set, contains templates of lower sequence and structural similarity and, critically, lower ligand structural similarity. This data set is intended to make modelling as challenging as possible, whilst still retaining some structural similarity in binding sites. Templates were selected for the easy set or hard set based on three criteria in relation to the target: sequence similarity, ligand similarity, and binding-site structural similarity. The latter was the starting point for template selection. Binding sites with a similar structure to the target crystal structure were identified with SitesBase,41 which assesses binding-site structural similarity with the Poisson Index (PI).42 This is a probabilistic score that refers to the size of two sites and number of matching atoms; it measures physical similarity and also indicates whether the match could occur by chance. Experience has shown that a score of ≤ 1 × 10− 6 is indicative of reliable and meaningful similarity.42 Potential templates for the hard set were initially identified using PI of ≤ 1 × 10− 6 as a threshold, with selected binding sites as close as possible to this score (but if possible, no worse). Potential templates for the easy set were selected based on a significantly higher PI score, which were often but not always the top hit generated by SitesBase. Furthermore, all templates in both data sets had to belong to the same structural classification of proteins superfamily as the respective target. Once potential templates for each target in each data set had been identified, other criteria were introduced to select the most appropriate single template. Where possible, templates with a sequence identity ≥ 70% and ligand 2-D similarity (Tanimoto coefficient) ≥ 0.5 were selected for the easy set, and templates with a sequence identity b 70% and a ligand similarity b 0.5 were selected for the hard set. When more than one template fulfilled the criteria for the hard set, the template with the greatest perceived level of difficulty was chosen. This situation was reversed for the easy set; that is, the template judged to be easiest was selected. At the time of development, no appropriate template could be found for three structures in the Astex Diverse Set: 1HWW, 1V4S, and 1YV3. This applied to both the easy set and the hard set, and as a result, these targets were removed. This left 82 targets to be modelled twice—once with an “easy” template and once with a “hard” template, giving a total of 164 modelling instances. All template structures and native target structures were obtained from the PDB. Three-dimensional ligand coordinates for modelling the respective targets were extracted directly from native structures in the PDB.
Native ligands were minimised in the MMFF force field43 with Szybki v1.2.2 (minus protein structures).44 Assessing the accuracy of protein–ligand homology modelling Model accuracy is assessed by RMSD from the native structure.27 The model binding site is superimposed onto the corresponding native binding site with LigMatch, which can also superimpose protein structures using all atoms.23,24 Binding sites are defined as substructures consisting of residues whose Cβ atom is within 10 Å of any ligand atom. After structurally aligning the native structure and model, we calculated the RMSD for both ligand and binding site (backbone and side chains) with Vrms v1.01, which takes molecular symmetry into account.45 The same process is used to measure the ligand RMSD between “hybrid templates” and native target structures. This approach only applies to the “hybrid-template” method and allows for the comparison of different ligand 3-D superimposition methods, prior to homology modelling. A docking methodology for comparison with protein–ligand homology modelling The “hybrid-template” homology-modelling method was compared against homology modelling followed by docking. Fred28 v2.2.5 was used to dock ligands into the ligand-free binding sites of homology models generated from the easy set (i.e., the same targets, templates, and ligands used in protein–ligand homology modelling). Prior to docking, two sets of homology models were built: one set from templates containing their original ligands and a second set from the same templates without ligands. Homology models were built with Modeller using the default “automodel.very_fast” function and the “hetero-atoms” option set to true. Sequence alignments were generated with 3D-Coffee. Any ligands that were incorporated into models were removed prior to docking. In order to simulate ligand flexibility, Fred requires the pre-generation of a ligand conformer library, which was generated by Omega using the same parameters as in protein–ligand homology modelling. By default, Fred defines the binding site as a box of 4 Å extension in all directions from a bound ligand. Receptor boxes were therefore calculated from models possessing a bound ligand. For models built without a ligand, the receptor box was transferred from the appropriate ligand-bound model. Fred was used with default settings and implements the Chemgauss3 scoring system.28 The accuracy of docked ligands is measured in terms of RMSD, using the same approach as for models generated by protein–ligand homology modelling. Software availability All the software used in this study is freely available under an academic license except for Multicopy-GAFF, LigMatch, and Dgeom. The latter is available from the Quantum Chemistry Program Exchange for a nominal fee. A version of the complete protein-ligand homologymodelling method described here using Multicopy-Gaff and LigMatch has been made available in a freely accessible web server called SitesModel†. † http://www.modelling.leeds.ac.uk/sitesmodel/index. html
660
Acknowledgements We thank all the authors of the programs used in this study for their help with implementation. We also thank the Biotechnology and Biological Sciences Research Council for funding in the form of a studentship for J.A.R.D.
Supplementary Data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/ j.jmb.2010.04.047
References 1. Martin, A. C., MacArthur, M. W. & Thornton, J. M. (1997). Assessment of comparative modeling in CASP2. Proteins, 29, 14–28. 2. Moult, J. (2005). A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Curr. Opin. Struct. Biol. 15, 285–289. 3. Moult, J., Fidelis, K., Kryshtafovych, A., Rost, B., Hubbard, T. & Tramontano, A. (2007). Critical assessment of methods of protein structure prediction—round VII. Proteins, 69, 3–9. 4. Guy, J. L., Jackson, R. M., Acharya, K. R., Sturrock, E. D., Hooper, N. M. & Turner, A. J. (2003). Angiotensinconverting enzyme-2 (ACE2): comparative modelling of the active site, specificity requirements, and chloride dependence. Biochemistry, 42, 13185–13192. 5. Kneissl, B., Leonhardt, B., Hildebrandt, A. & Tautermann, C. S. (2009). Revisiting automated Gprotein coupled receptor modeling: the benefit of additional template structures for a neurokinin-1 receptor model. J. Med. Chem. 52, 3166–3173. 6. Evers, A., Gohlke, H. & Klebe, G. (2003). Ligandsupported homology modelling of protein bindingsites using knowledge-based potentials. J. Mol. Biol. 334, 327–345. 7. Hare, B. J., Walters, W. P., Caron, P. R. & Bemis, G. W. (2004). CORES: an automated method for generating three-dimensional models of protein/ligand complexes. J. Med. Chem. 47, 4731–4740. 8. Martin, L., Catherinot, V. & Labesse, G. (2006). KinDOCK: a tool for comparative docking of protein kinase ligands. Nucleic Acids Res. 34, W325–329. 9. Brylinski, M. & Skolnick, J. (2009). FINDSITE: a threading-based approach to ligand homology modeling. PLoS Comput. Biol. 5, e1000405. 10. Mitchell, J. B. (2001). The relationship between the sequence identities of alpha helical proteins in the PDB and the molecular similarities of their ligands. J. Chem. Inf. Comput. Sci. 41, 1617–1622. 11. Nobeli, I., Spriggs, R. V., George, R. A. & Thornton, J. M. (2005). A ligand-centric analysis of the diversity and evolution of protein–ligand relationships in E. coli. J. Mol. Biol. 347, 415–436. 12. Kalinina, O. V., Gelfand, M. S. & Russell, R. B. (2009). Combining specificity determining and conserved residues improves functional site prediction. BMC Bioinformatics, 10, 174. 13. Koshland, D. E. (1958). Application of a theory of enzyme specificity to protein synthesis. Proc. Natl Acad. Sci. USA, 44, 98–104.
Homology-Modelling Protein–Ligand Interactions
14. Tuffery, P., Etchebest, C., Hazout, S. & Lavery, R. (1991). A new approach to the rapid determination of protein side chain conformations. J. Biomol. Struct. Dyn. 8, 1267–1289. 15. Jackson, R. M., Gabb, H. A. & Sternberg, M. J. (1998). Rapid refinement of protein interfaces incorporating solvation: application to the docking problem. J. Mol. Biol. 276, 265–285. 16. O'Sullivan, O., Suhre, K., Abergel, C., Higgins, D. G. & Notredame, C. (2004). 3DCoffee: combining protein sequences and structures within multiple sequence alignments. J. Mol. Biol. 340, 385–395. 17. Sali, A. & Blundell, T. L. (1993). Comparative modelling by satisfaction of spatial restraints. J. Mol. Biol. 234, 779–815. 18. Omega, OpenEye Science Software, 9 Bisbee Court, Suite D, Santa Fe, NM 87508. 19. Dalton, J. A. & Jackson, R. M. (2007). An evaluation of automated homology modelling methods at low target template sequence similarity. Bioinformatics, 23, 1901–1908. 20. Loferer, M. J., Kolossváry, I. & Aszódi, A. (2007). Analyzing the performance of conformational search programs on compound databases. J. Mol. Graphics Modell. 25, 700–710. 21. Boström, J. (2001). Reproducing the conformations of protein-bound ligands: a critical evaluation of several popular conformational searching tools. J. Comput.Aided Mol. Des. 15, 1137–1152. 22. Boström, J., Greenwood, J. R. & Gottfries, J. (2003). Assessing the performance of Omega with respect to retrieving bioactive conformations. J. Mol. Graphics Modell. 21, 449–462. 23. Brakoulias, A. & Jackson, R. M. (2004). Towards a structural classification of phosphate binding sites in protein-nucleotide complexes: an automated allagainst-all structural comparison using geometric matching. Proteins, 56, 250–260. 24. Kinnings, S. L. & Jackson, R. M. (2009). LigMatch: a multiple structure-based ligand matching method for 3D virtual screening. J. Chem. Inf. Model. 49, 2056–2066. 25. Nicholls, A., MacCuish, N. E. & MacCuish, J. D. (2004). Variable selection and model validation of 2D and 3D molecular descriptors. J. Comput.-Aided Mol. Des. 18, 451–474. 26. Dunbrack, R. L. (2002). Rotamer libraries in the 21st century. Curr. Opin. Struct. Biol. 12, 431–440. 27. Hartshorn, M. J., Verdonk, M. L., Chessari, G., Brewerton, S. C., Mooij, W. T. M., Mortenson, P. N. & Murray, C. W. (2007). Diverse, high-quality test set for the validation of protein–ligand docking performance. J. Med. Chem. 50, 726–741. 28. McGann, M. R., Almond, H. R., Nicholls, A., Grant, J. A. & Brown, F. K. (2003). Gaussian docking functions. Biopolymers, 68, 76–90. 29. Kairys, V., Fernandes, M. X. & Gilson, M. K. (2006). Screening drug-like compounds by docking to homology models: a systematic study. J. Chem. Inf. Model. 46, 365–379. 30. McGovern, S. L. & Shoichet, B. K. (2003). Information decay in molecular docking screens against holo, apo, and modeled conformations of enzymes. J. Med. Chem. 46, 2895–2907. 31. DeWeese-Scott, C. & Moult, J. (2004). Molecular modeling of protein function regions. Proteins, 55, 942–961. 32. Kopp, J. & Schwede, T. (2004). Automated protein structure homology modeling: a progress report. Pharmacogenomics, 5, 405–416.
661
Homology-Modelling Protein–Ligand Interactions
33. Grigoryan, G., Ochoa, A. & Keating, A. E. (2007). Computing van der Waals energies in the context of the rotamer approximation. Proteins, 68, 863–878. 34. Cornell, W. D., Cieplak, P., Bayly, C. I., Gould, I. R., Merz, K. M., Ferguson, D. M. et al. (1995). A second generation force field for the simulation of proteins, nucleic acids, and organic molecules. J. Am. Chem. Soc. 117, 5179–5197. 35. Wang, J., Wolf, R. M., Caldwell, J. W., Kollman, P. A. & Case, D. A. (2004). Development and testing of a general amber force field. J. Comput. Chem. 25, 1157–1174. 36. Spellmeyer, D. C., Wong, A. K., Bower, M. J. & Blaney, J. M. (1997). Conformational analysis using distance geometry methods. J. Mol. Graphics Modell. 15, 18–36. 37. Jakalian, A., Bush, B. L., Jack, D. B. & Bayly, C. I. (2000). Fast, efficient generation of high-quality atomic charges. AM1-BCC model: I. Method. J. Comput. Chem. 21, 132–146. 38. Wang, J., Wang, W., Kollman, P. A. & Case, D. A. (2006). Automatic atom type and bond type percep-
39. 40. 41. 42.
43.
44. 45.
tion in molecular mechanical calculations. J. Mol. Graphics Modell. 25, 247–260. QuacPac, OpenEye Science Software, 9 Bisbee Court, Suite D, Santa Fe, NM 87508. OpenBabel, http://openbabel.org/wiki/Main_Page. Gold, N. D. & Jackson, R. M. (2006). SitesBase: a database for structure-based protein–ligand binding site comparisons. Nucleic Acids Res. 34, D231–234. Davies, J. R., Jackson, R. M., Mardia, K. V. & Taylor, C. C. (2007). The Poisson Index: a new probabilistic model for protein–ligand binding site similarity. Bioinformatics, 23, 3001–3008. Halgren, T. A. (1992). The representation of van der Waals (vdW) interactions in molecular mechanics force fields: potential form, combination rules and vdW parameters. J. Am. Chem. Soc. 114, 7827–7843. Szybki, OpenEye Science Software, 9 Bisbee Court, Suite D, Santa Fe, NM 87508. Chen, W., Huang, J. & Gilson, M. K. (2004). Identification of symmetries in molecules and complexes. J. Chem. Inf. Comput. Sci. 44, 1301–1313.