Structure
Article Hydrophobic Potential of Mean Force as a Solvation Function for Protein Structure Prediction Matthew S. Lin,1 Nicolas Lux Fawzi,1 and Teresa Head-Gordon1,2,* 1
UCSF/UCB Joint Graduate Group in Bioengineering, Berkeley, CA 94720, USA Department of Bioengineering, University of California, Berkeley, CA 94720, USA *Correspondence:
[email protected] DOI 10.1016/j.str.2007.05.004
2
SUMMARY
We have developed a solvation function that combines a Generalized Born model for polarization of protein charge by the high dielectric solvent, with a hydrophobic potential of mean force (HPMF) as a model for hydrophobic interaction, to aid in the discrimination of native structures from other misfolded states in protein structure prediction. We find that our energy function outperforms other reported scoring functions in terms of correct native ranking for 91% of proteins and low Z scores for a variety of decoy sets, including the challenging Rosetta decoys. This work shows that the stabilizing effect of hydrophobic exposure to aqueous solvent that defines the HPMF hydration physics is an apparent improvement over solventaccessible surface area models that penalize hydrophobic exposure. Decoys generated by thermal sampling around the native-state basin reveal a potentially important role for side-chain entropy in the future development of even more accurate free energy surfaces. INTRODUCTION The protein structure prediction problem is to predict the three-dimensional structure of the native state of a protein given its sequence of amino acids. Our group has used global optimization as a promising approach to its solution because it is believed that in most cases the native state corresponds to the global, or very low-lying, free energy minimum (Anfinsen, 1973). Global optimization is required because the energy landscape of a realistic-sized protein has thousands of parameters and an enormous number of local minima that are potential false traps for native structure (Sugita and Okamoto, 1999; Vasquez et al., 1994). However, without a quantitative description of the free energy function describing both the proteins’ intramolecular forces and the intermolecular interactions with aqueous solvent, global optimization as a solution to the structure prediction problem would be useless. We must search
on a free energy surface where the global optimum, or very low-lying free energy minimum, has functional relevance. As is usual for global optimization approaches, the question is whether the sampling is sufficient or whether the energy or scoring function is adequate for discriminating native structure from misfolded states. Atomic nonpolarizable empirical force fields such as AMBER (Cornell et al., 1995; Hornak et al., 2006; Wang et al., 2000; Weiner et al., 1984, 1986), CHARMM (Brooks et al., 1983; MacKerell et al., 1998), GROMOS (Scott et al., 1999), ECEPP (Momany et al., 1975; Nemethy et al., 1983; Sippl et al., 1984), and OPLS (Jorgensen et al., 1996; Kaminski et al., 2001) are fairly simple mathematical functional forms that are reasonable and physically motivated approximations of the potential energy surface for proteins. Empirical protein force fields such as these have been shown to be effective in both protein folding and protein structure prediction (Ponder and Case, 2003). An important demonstration of their effectiveness is the study of native folds and misfolds of the protein sequences of hemerythrin, a predominantly alpha protein, and immunoglobulin VL domain, a predominantly b sheet structure. When the sequences of the two were threaded through the other tertiary structure, the gas phase energy values (i.e., no protein-solvent interactions) for the native folds were comparable to their misfolds. However, the addition of a simple solvation description raised the energy of the misfolded structures, demonstrating that these protein force fields combined with an adequate solvent model can perform effectively in discriminating between correct folds and misfolds (Novotny et al., 1984). Modeling the interaction of a protein with aqueous solvent can be accomplished with any of the independently developed liquid water force fields, such as the TIP (Jorgensen et al., 1983; Jorgensen and Madura, 1985; Mahoney and Jorgensen, 2000) or SPC (Berendsen et al., 1981) series of models, and with more recent versions that are parameterized for long-range electrostatic interactions, such as TIP4P-Ew (Horn et al., 2004) and TIP5P-Ew (Rick, 2004), as well as with polarizable water models, such as TIP4P-pol2 (Chen et al., 2000), TIP4PFQ (Rick, 2001; Rick et al., 1994), and SPC-FQ (Rick et al., 1994). However, the use of an all-atom water potential is computationally expensive in the context of global optimization and is possibly not needed in structure
Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 727
Structure Hydrophobic PMF for Protein Structure Prediction
prediction if the physics of hydration can be adequately described by coarse-grained descriptions. For example the Generalized Born (GB) (Constanciel and Contreras, 1984; Hawkins et al., 1995, 1996; Onufriev et al., 2000, 2004; Still et al., 1990) model represents the electrostatic polarization free energy as the interaction between the protein’s charge distribution enclosed in a low dielectric region and the reaction potential it induces in the surrounding high dielectric solvent. The GB description integrates out the degrees of freedom of explicit water and replaces them with an effective high dielectric constant. The GB treatment has been reasonably successful in both protein folding and protein structure prediction (Dominy and Brooks, 2002; Feig and Brooks, 2002; Felts et al., 2002; Garcia and Sanbonmatsu, 2002; Mitsutake et al., 2001; Pande and Rokhsar, 1999; Simmerling et al., 2002; van der Vaart et al., 2000; Vorobjev et al., 1998). A popular coarse-grained model of aqueous solvation involves combining the GB electrostatics model with a term proportional to the solvent-accessible surface area (SASA) (Chothia, 1974; Eisenberg and McLachlan, 1986; Lee and Richards, 1971; Sharp et al., 1991; Shrake and Rupley, 1973; Spolar et al., 1989; Swanson et al., 2004) to approximate the nonpolar interactions induced by the aqueous solvent. It is evident from theoretical considerations that the SASA model is incapable of accurately describing nonpolar solvation forces at molecular-length scales on the order of a water molecule diameter, instead being more useful at ‘‘large’’ length-scales when more extended hydrophobic surfaces interface with the water solvent (Chandler, 2005; ten Wolde, 2002). Thus, alternative models to SASA are needed to reproduce hydration phenomena on the molecular scale. For example, the hydrophobic effect (the thermodynamics of transfer of hydrophobic solutes from organic solvents or vacuum into water) has been modeled in one case by supplementing the SASA model with additional volume and dispersion interactions (Wagoner and Baker, 2006), while Galliccho and Levy discard the SASA model altogether in favor of a nonpolar estimator to the work of cavity formation and dispersion interactions between the solute atoms and the solvent (Gallicchio and Levy, 2004). Using both wide-angle and small-angle solution scattering experiments and computer simulation (Head-Gordon et al., 1997; Hura et al., 1999; Pertsemlidis et al., 1996, 1999; Sorenson et al., 1999), we have studied hydrophobic interactions between model hydrophobic peptides such as N-acetyl-leucine-methylamide from low to very high concentrations (0.1–2.0 M). That these solutions never phase separate at the highest concentrations studied suggested to us that small length-scale hydration physics (Pratt and Chandler, 1977) is operative for the folding and stabilization of globular proteins (Sorenson et al., 1999). As such, the influence of water on the free energy of association of two small hydrophobic groups in water exhibits a potential of mean force with two minima separated by a barrier: one for the hydrophobic molecules in contact and one for the hydrophobic groups separated
Figure 1. Hydrophobic Potential of Mean Force The sum of HPMF and Lennard-Jones energy is plotted against Rij, the distance between a pair of aliphatic and aromatic carbons.
by a water layer (Figure 1). This model of the hydrophobic interaction was introduced by our group in application to protein structure prediction (Crivelli et al., 2002) to replace SASA models. Its novelty in the context of structure prediction and protein energetics is the many-body effect of water’s influence on protein hydrophobic group interactions, and a desolvation barrier to demixing that is not described by SASA models. In this work we show that the replacement of the SASA description with a hydrophobic potential of mean force (HPMF) model allows better discrimination in favor of native states with a wide variety of small globular proteins. In total, our potential energy function to describe the protein and aqueous solvent free energy surface combines the AMBER99 protein force field (Wang et al., 2000) (VProtein), the Generalized Born (GB) description of the electrostatic component of solvent free energy (VGB) developed by Case and coworkers (Onufriev et al., 2000), and the hydrophobic potential of mean force to describe the solute-solute interaction induced by water (VHPMF): V = VProtein + VGB + VHPMF :
(1)
The performance of AMBER99 combined with the GB/ HPMF solvation model is then evaluated on a large selection of databases of so-called ‘‘decoy’’ structures that allow us to assess its ability to discriminate against misfolded structures in favor of the native state. We report tests of our energy function on 106 proteins using 7 publicly available decoy sets of greatly varying size and difficulty (Holm and Sander, 1992; Keasar and Levitt, 2003; Park and Levitt, 1996; Samudrala and Levitt, 2000; Simons et al., 1997; Tsai et al., 2003). We find that our energy function outperforms other published energy functions in terms of native ranking and native Z score on 91% of proteins, and does exceptionally well against the hardest Rosetta decoy sets, an outcome we attribute to the success of the HPMF model in particular. We also consider challenging our free energy description with nativelike decoys generated from finite temperature molecular
728 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved
Structure Hydrophobic PMF for Protein Structure Prediction
Table 1. EMBL Decoy Set AMBER99+GB
AMBER99+GB+HPMF
Protein
Decoy
Number of Residues
Decoy Rmsd (A˚)
1BP2
2PAZ
123
15.38
42.257
0.344
DE per Residue
DE
DE per Residue
DE 178.378
1.450
1CBH
1PPT
36
9.87
43.240
1.201
48.947
1.360
1FDX
5RXN
54
8.67
11.939
0.221
43.470
0.805
1HIP
2B5C
85
13.84
116.394
1.369
212.696
2.502
1LH1
2I1B
153
17.49
394.606
2.579
285.355
1.865
1P2P
1RN3
124
18.31
113.640
0.916
173.945
1.403
1PPT
1CBH
36
9.76
6.835
0.190
58.543
1.626
1REI
5PAD
212
18.08
278.791
1.315
60.054
0.283
1RHD
2CYP
293
21.56
534.787
1.825
417.677
1.426
1RN3
1P2P
124
18.35
126.357
1.019
147.419
1.189
1SN3
2CI2
65
12.48
112.805
1.735
143.908
2.214
1SN3
2CRO
65
10.77
59.923
0.922
118.553
1.824
2B5C
1HIP
85
14.60
33.580
0.395
2.167
0.025
2CDV
2SSI
107
14.14
140.142
1.310
110.474
1.032
2CI2
1SN3
65
24.93
109.698
1.688
143.420
2.206
2CI2
2CRO
65
23.45
40.205
0.619
108.143
1.664
2CRO
1SN3
65
10.71
192.941
2.968
235.277
3.620
2CRO
2CI2
65
10.66
173.368
2.667
187.076
2.878
2CYP
1RHD
293
21.42
2167.252
7.397
2456.140
8.383
2I1B
1LH1
153
17.52
101.107
0.661
262.069
1.713
2PAZ
1BP2
123
15.35
212.318
1.726
225.312
1.832
2SSI
2CDV
107
15.38
131.374
1.228
183.579
1.716
2TMN
2TS1
316
22.44
689.522
2.182
633.548
2.005
2TS1
2TMN
317
22.75
53.578
0.169
250.015
0.789
5PAD
1REI
212
18.07
514.832
2.428
593.937
2.802
There are 25 proteins whose native states have been determined by both X-ray and NMR. There is only one decoy per protein with no sequence length variations. DE is the difference in energy between the native and the decoy.
dynamics to sample states around the native state for 4PTI at four different temperatures. The thermal decoys reveal a potentially important role for side-chain entropy as a direction for future development of even more accurate free energy surfaces. RESULTS The results for all proteins used to evaluate our energy function are given in Tables 1–7 for the EMBL (Holm and Sander, 1992), 4-state-reduced (Park and Levitt, 1996; Samudrala and Levitt, 2000), LMDS (Keasar and Levitt, 2003; Samudrala and Levitt, 2000), fisa (Samudrala and Levitt, 2000; Simons et al., 1997), fisa-casp3 (Samudrala and Levitt, 2000; Simons et al., 1997), lattice (Samudrala and Levitt, 2000), and the Rosetta decoy sets (Simons et al., 1997; Tsai et al., 2003). We do not include 1BBA in the LMDS set since it does not have a well-defined na-
tive state. We also do not include protein decoy sets with heme prosthetic groups (hg_structural [Samudrala and Levitt, 2000]) or sequence length variations of greater than 10% with respect to the native structure (MMPBSA [Lee et al., 2001; Simons et al., 1997] and some parts of the Rosetta decoy sets) as part of our averages, although we do report our results on most of these decoys in Table 8 and Table 9. We have divided the Rosetta decoys developed in Simons et al. (1999) and Tsai et al. (2003) into two categories: (1) restricting the decoy set to X-ray-determined native structures and for %10% variation in sequence length (Rosetta-1) and (2) restricting the decoy set to native structures determined by NMR in which multiple models are reported and for %10% variation in sequence length (Rosetta-2). Restrictions on sequence length are necessary because it is ambiguous whether performance on decoys with >10% variation in sequence length is
Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 729
Structure Hydrophobic PMF for Protein Structure Prediction
Table 2. 4-State-Reduced Decoy Set AMBER99+GB
AMBER99+GB+HPMF
Protein
Fold Class
Native-State Resolution (A˚)
Number of Residues
Native Rank
Native Z Score
Native Rank
Native Z Score
1CTF
a/b
1.70
68
1
3.287
1
3.798
1R69
a
2.00
63
1
3.616
1
3.854
1SN3
a/b
1.20
65
1
3.151
1
5.029
2CRO
a
2.35
65
1
3.370
1
3.684
3ICB
a
2.30
75
1
2.073
2
2.129
4PTI
a/b
1.50
58
1
3.472
1
4.244
4RXN
a/b
1.20
54
1
4.042
1
4.005
There are seven proteins whose native states have been determined by X-ray crystallography. There is an average of 700 decoys per protein with no sequence length variations, and an rmsd range of 0.88–9.39 A˚.
a problem of the energy function or missing interactions necessary to stabilize the native fold. We find that discriminating for native protein structures solved by X-ray crystallography is ‘‘easier,’’ while the ensemble of native structures that are derived from the NMR refinement of solution measurements make native ranking more ambiguous due to nonideal packing (Zhang and Liu, 2006). For these 106 proteins, our energy function is able to correctly identify the native structure of 96 proteins. When we compare our energy function (AMBER+GB+ HPMF) with the baseline energy function (AMBER+GB) to evaluate the contribution of the HPMF term, the baseline function can only identify the native of 79 proteins. The average native Z score of the results produced by our energy function is 4.13 (and 4.50 when we correctly predict the native state) and 3.29 for the baseline function (and 3.92 when we correctly predict the native state). When we compare the structural change of the native structure after local minimization using our energy function, we find that the average post-rmsd of the native structure of each protein is 0.6 A˚, while the average post-rmsd
of the decoys is 0.2 A˚. We speculate that this difference stems from the model building used in X-ray crystallography or NMR refinement being a different energy function than that used to generate the decoys. We now focus on the ten proteins whose native structure is not ranked number one by our energy function. The native of 3ICB (4-state-reduced) is ranked number two, although the top-ranked structure is within 1.5 A˚ rmsd with respect to the native. We question whether this is in actuality a bad result since thermal motion at room temperature would likely populate ‘‘misfolds’’ in the 1.5 A˚ and maybe up to 3 A˚ range. In fact the reported NMR data for other proteins shows an ensemble with a spread of rmsd between 0.18 A˚ and 0.47 A˚ in the best case (1GB1) and between 1.30 A˚ and 5.41 A˚ in the worst case (1AJ3). That the energy versus rmsd plot for 3ICB when fit to a linear function gives a correlation coefficient (R value) of 0.69 (Figure 2) strongly suggests that the free energy surface for 3ICB is well-described by our model. Some of the proteins in which our energy function does not rank the native structure near the top have
Table 3. LMDS Decoy Set AMBER99+GB
AMBER99+GB+HPMF
Protein
Fold Class
Native-State Resolution (A˚)
Number of Residues
Native Rank
Native Z Score
Native Rank
Native Z Score
1B0N-B
a
1.90
31
29
1.684
40
1.433
1CTF
a/b
1.70
68
1
3.375
1
4.540
1DTK
a/b
NMR
57
1
3.527
1
4.261
1FC2
a/b
2.80
43
1
3.389
1
4.232
1IGD
a/b
1.10
61
1
4.286
1
4.477
1SHF-A
b
1.90
59
1
2.972
1
4.585
2CRO
a
2.35
65
1
5.649
1
7.914
2OVO
a/b
1.50
56
1
3.411
1
4.580
4PTI
a/b
1.50
58
1
4.888
1
6.575
There are nine proteins whose native states have been determined by both NMR and X-ray crystallography. There is an average of 500 decoys per protein with no sequence length variations, and an rmsd range of 2.45–13.47 A˚. 730 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved
Structure Hydrophobic PMF for Protein Structure Prediction
Table 4. Fisa and Fisa-CASP3 Decoy Set Number of Residues
AMBER+GB
AMBER+GB+HPMF
Native Rank
Native Z Score
Native Rank
Native Z Score
Protein
Fold Class
Decoy Set
Native-State Resolution (A˚)
1FC2
a
FISA
2.80
43
152
0.555
70
0.940
1HDD-C
a
FISA
2.80
57
64
1.180
1
2.454
2CRO
a
FISA
2.35
65
1
4.922
1
5.077
4ICB
a
FISA
1.60
76
1
3.588
1
4.487
1BG8-A
a
FISA-CASP3
2.00
76
1
3.105
1
3.725
1BL0
a
FISA-CASP3
2.30
99
15
2.130
1
3.493
1EH2
a
FISA-CASP3
NMR
79
18
2.162
1
3.078
1JWE
a
FISA-CASP3
NMR
114
1
4.745
1
5.285
The fisa decoy set has four proteins whose native states have been determined by X-ray crystallography. There is an average of 500 decoys per protein with no sequence length variations, and an rmsd range of 2.77–14.13 A˚. The fisa-CASP3 decoy set has four proteins whose native states have been determined by both NMR and X-ray crystallography. There is an average of 1500 decoys per protein with no sequence length variations, and an rmsd range of 3.63–20.87 A˚.
complications of either being a subunit of a larger protein or having prosthetic heme or metal groups. 1CC5 and 5ICB in Rosetta-1 and 2PAC in Rosetta-2 are heme-binding or metal-binding proteins. The prosthetic groups that bind to these proteins reside in the interior, and without them there can be destabilization in the core. The nonspecificity of interactions in the collapse of the cavity contributes to many possible misfolds that are energetically competitive with the native structure. Thus it is uncertain if the failure in identifying the native of these proteins is because of the energy function or the differences in destabilization of the native structures and misfolds caused by the missing prosthetic group. In fact we perform well on a number of heme-binding proteins in the Rosetta decoy sets. We also succeeded in identifying the native structure for many proteins that are a domain in a larger quaternary structure. However, we fail on the small protein 1BON-B (LMDS decoy set), as well as 1FC2 (fisa decoy set), which exposes a large amount of hydrophobic surface area. As
described by others (Felts et al., 2002; McConkey et al., 2003), the removal of other subunits can possibly destabilize the native structure of the target domain more so than the decoys. It has been suggested that 1BON-B is so small that other subunits are needed to stabilize it, while 1FC2 packs exposed hydrophobic surface area against other subunits. Although our energy function ranked the native state as number two for 2FDN in Rosetta-1 set and number 13 for 1APF in Rosetta-2 set, the top-ranked structure for these proteins have very large rmsds with respect to native and have very poor R values based on linear fits of energy versus rmsd across the whole decoy set. Other proteins such as 1ORC (native rank 100) and 1HYP (native rank 1038) seem to be resounding failures. These proteins offer more interesting analysis questions about the success of our energy function. In the cases of 2FDN, 1APF, and 1HYP proteins, there are no native-like decoys (with rmsd < 3.0 A˚). It is possible that having more native-like
Table 5. Lattice Decoy Set
Protein
Fold Class
Native-State Resolution (A˚)
Number of Residues
AMBER99+GB
AMBER99+GB+HPMF
Native Rank
Native Rank
Native Z Score
Native Z Score
1BEO
a
2.20
98
1
5.731
1
5.550
1CTF
a/b
1.70
68
1
4.609
1
5.311
1DKT-A
a/b
2.90
72
1
4.841
1
3.924
1FCA
a/b
1.80
55
3
3.071
1
3.068
1NKL
a
NMR
78
52
1.811
1
4.173
1PGB
a/b
1.92
56
1
3.710
1
5.910
1TRL-A
a
NMR
62
1
4.383
1
4.261
4ICB
a
1.60
76
1
3.790
1
3.524
There are eight proteins whose native states have been determined by both X-ray crystallography and NMR. There are 2000 decoys per protein with no sequence length variations and an rmsd range of 4.74–17.44 A˚. Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 731
Structure Hydrophobic PMF for Protein Structure Prediction
Table 6. Rosetta 1 Decoy Set AMBER99+GB
AMBER99+GB+HPMF
Native Rank
Native Rank
Protein
Fold Class
Native-State Resolution (A˚)
DB
Number of Residues
Native Z Score
Native Z Score
1AA2
a
2.00
Simons
105
1
4.453
1
5.623
1ACF
a/b
2.00
Simons
123
1
6.243
1
7.379
1AIL
a
1.90
Tsai
67
17
2.360
1
3.711
1BDO
b
1.80
Simons
75
1
3.742
1
5.357
1BQ9
a/b
1.20
Tsai
53
1
4.852
1
4.279
1CC5
a
2.50
Simons
76
123
1.008
77
1.391
1CEI
a
1.80
Tsai
85
1
3.401
1
4.895
1CSP
b
2.45
Tsai
64
3
3.123
1
3.812
1CTF
a/b
1.70
Tsai
67
1
3.263
1
3.835
1ECA
a
1.40
Simons
132
1
3.861
1
5.741
1ERV
a/b
1.65
Simons
105
1
5.003
1
6.936
1GVP
b
1.60
Simons
82
4
2.440
1
3.782
1HYP
a
1.80
Tsai
75
1597
1.014
1038
0.051
1KTE
a/b
2.20
Simons
100
1
4.083
1
5.176
1MBD
a
1.40
Simons
147
1
5.050
1
6.741
1MSI
a/b
1.25
Tsai
60
1
3.953
1
4.161
1ORC
a/b
1.54
Tsai
56
461
0.708
100
1.488
1PAL
a
1.65
Simons
100
15
1.906
1
3.928
1PDO
a/b
1.70
Simons
121
1
4.977
1
5.762
1R69
a
2.00
Tsai
61
1
3.475
1
3.802
1RIS
a/b
2.00
Simons
92
1
3.360
1
4.745
1TUC
b
2.02
Tsai
61
1
3.664
1
3.756
1TUL
b
2.20
Simons
97
1
3.514
1
4.536
1UTG
a
1.34
Simons
61
1
2.857
1
3.245
1VCC
a/b
1.60
Tsai
77
1
3.739
1
4.860
1VLS
a
1.85
Simons
143
1
2.461
1
3.321
1WHO
b
1.90
Simons
88
1
4.476
1
5.664
2ACY
a/b
1.80
Simons
92
1
4.396
1
5.756
2FDN
a/b
0.94
Simons
55
30
1.770
2
2.541
2FHA
a
1.90
Simons
160
1
3.411
1
6.187
2FXB
a
2.30
Tsai
81
1
3.584
1
4.101
2GDM
a
1.70
Simons
149
1
4.800
1
6.250
4FGF
b
1.60
Simons
121
1
5.122
1
6.855
5ICB
a
1.50
Tsai
72
187
1.281
41
1.791
5PTI
a/b
1.00
Tsai
55
1
3.546
1
3.243
There are 35 proteins whose native states have been determined by X-ray crystallography. There is an average of 1000 (Simons) or 2000 (Tsai) decoys per protein with sequence length variation of %10%, and an rmsd range of 2.18–33.53 A˚.
decoys might have pushed the clearly nonnative structures out of the top-ranked set, thereby substantially improving the Z score and R value for the linear fit. Another possibility is that proteins like 1ORC, which have a larger
amount of exposed hydrophobic surface area relative to the average measured over these 106 proteins, hint at the possibility of a transition in which the hydration free energy scaling with volume now scales with surface
732 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved
Structure Hydrophobic PMF for Protein Structure Prediction
Table 7. Rosetta 2 Decoy Set AMBER99+GB
AMBER99+GB+HPMF
Protein
Fold Class
Number of Residues
Native Rank
Native Z Score
1AJ3
a
95
17
1.738
Native Rank 1
Native Z Score 3.605
1APF
b
47
13
1.801
13
1.884
1ARK
b
55
1
2.843
1
2.993
1AYJ
a/b
46
1
3.172
1
3.923
1BTB
a/b
89
1
3.775
1
4.435
1C5A
a
62
1
3.266
1
4.811
1GB1
a/b
54
1
3.724
1
3.601
1GPT
a/b
47
8
2.355
1
3.307
1KSR
b
92
30
1.758
1
2.469
1POU
a
70
3
2.249
1
3.440
1SVQ
a/b
90
3
2.706
1
3.336
1WIU
b
90
12
2.136
1
2.783
2NCM
b
96
1
3.830
1
5.562
2PAC
a
77
18
1.698
10
1.944
The Rosetta subset of 14 proteins whose native states have been determined by NMR. There is an average of 1000 decoys per protein with sequence length variation of %10%, and an rmsd range of 1.94–28.79 A˚.
area. However, when we evaluate these proteins under the AMBER99/GBSA model, we do not find substantial improvement, although there may be a better implementation of GBSA than the one that we use here. Overall, our results appear to perform better than other scoring functions published in the literature that were tested on the same or similar decoy sets. The published scoring functions can be divided into physically derived energy functions (Dominy and Brooks, 2002; Felts et al., 2002; Gatchell et al., 2000; Hsieh and Luo, 2004; Lazaridis and Karplus, 1999; Lee and Kollman, 2001; Narang et al., 2006; Petrey and Honig, 2000; Zhu et al., 2003), similar in spirit to our energy function presented here, and those that are generated as a statistical scoring function based on the frequency of observations of atom or residue contacts in the PDB database, and sometimes combined with physical forces (Berglund et al., 2004; Dehouck et al., 2006; Fain et al., 2001; McConkey et al., 2003; Mukherjee et al., 2005; Shen and Sali, 2006; Wang et al., 2004). While it might seem an advantage to use knowledge-based scoring functions since they are believed to be more reliable in protein structure prediction, we found that our energy function does significantly better in both the native ranking and the native Z score, especially for the most challenging Rosetta sets, which generate 1000–2000 decoys with native-like features. Finally we consider challenging our free energy description with native-like decoys generated based on thermal fluctuations around the native state for 4PTI, using molecular dynamics and a GBSA model implemented in the AMBER program (Case et al., 2005) at four different temperatures, 200K, 273K, 293K, and 313K. At each temperature, 250 structures with a range of rmsds of 3–5 A˚
were generated around the native state; the resulting sampled structures were then locally minimized with our energy function. For the 250 structures generated at 200K, we found that the protein’s native structure is ranked first. However, the native state rank drops to 2nd for native-like decoys generated at 273K, 7th for nativelike decoys generated at 293K, and 18th for native-like decoys generated at 313K, out of a similar size set of 250 decoys at each temperature. Thus it is clear that decoys generated from high-temperature sampling near the native state provide a more difficult decoy set than the LMDS set (Table 10). We have analyzed the differences between the LMDS and the thermal decoys and have determined the following features (Table 10). Relative to the native structure, the decoys in the LMDS set have, on average, a larger radius of gyration (11.9 A˚ versus 11.0 A˚ for the native structure) and a greater amount of exposed nonpolar atoms (63% versus 44% for the native structure). Likewise, the temperature-dependent decoys that are well discriminated against by our energy function have a larger radius of gyration and a greater amount of nonpolar exposure, and therefore are similar to the LMDS set. However the handful of temperature-dependent decoys that have a lower minimized energy than the native structure have either (1) a significantly smaller radius of gyration and similar percentage of exposed hydrophobic groups, or (2) a greater amount of hydrophobic exposure and similar radius of gyration, relative to the native structure. Perhaps the more favorable thermal decoys are simply emphasizing the artificialness of the dogma of finding a ‘‘single’’ global minimum structure and that fluctuations of 3–5 A˚ are thermally accessible in the native basin. The
Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 733
Structure Hydrophobic PMF for Protein Structure Prediction
Table 8. Hg_Structural Decoy Set AMBER99+GB
AMBER99+GB+HPMF
Native Rank
Native Rank
Protein
Fold Class
Native-State Resolution (A˚)
Number of Residues
Native Z Score
Native Z Score
1ASH
a
2.15
147
1
3.174
1
3.195
1BAB-B
a
1.50
146
1
2.041
7
0.873
1COL-A
a
2.40
197
1
7.794
1
9.643
1CPC-A
a
1.66
162
1
6.772
1
5.837
1ECD
a
1.40
136
1
3.380
1
4.084
1EMY
a
1.78
153
1
2.422
1
1.981
1FLP
a
1.50
142
1
2.521
1
3.698
1GDM
a
1.70
153
1
4.445
1
5.893
1HBG
a
1.50
147
1
2.854
1
3.552
1HBH-A
a
2.20
142
1
2.523
1
2.223
1HBH-B
a
2.20
146
1
1.740
1
1.932
1HDA-A
a
2.20
141
1
1.455
1
1.650
1HDA-B
a
2.20
145
1
1.628
1
2.000
1HLB
a
2.50
157
12
0.260
17
0.022
1HLM
a
2.90
158
26
1.483
25
0.591
1HSY
a
1.90
153
4
1.803
3
1.857
1ITH-A
a
2.50
141
1
2.665
1
2.121
1LHT
a
2.00
153
2
1.797
4
1.547 2.062
1MBA
a
1.60
146
1
2.201
1
1MBS
a
2.50
153
22
0.510
22
1MYG-A
a
1.75
153
1
2.121
2
0.5248 1.844
1MYJ-A
a
1.90
153
4
1.801
5
1.624
1MYT
a
1.74
146
1
2.290
1
1.797
2DHB-A
a
2.80
141
1
1.730
4
1.213
2DHB-B
a
2.80
146
13
0.371
7
0.805
2LHB
a
2.00
149
1
3.571
1
3.920
2PGH-A
a
2.80
141
11
0.647
3
1.444
2PGH-B
a
2.80
146
2
1.612
3
1.496
4SDH-A
a
1.60
145
1
2.809
1
5.108
There are 29 proteins whose native states have been determined by X-ray crystallography. There are 30 decoys per protein with no sequence length variations and an rmsd range of 0.69–30.28 A˚.
other possibility is the known temperature dependence of the hydrophobic interaction in which the solvent-mediated attraction between small hydrophobic groups becomes stronger at higher temperatures (Baldwin, 1986). Alternatively, one might conclude that the side chains pack very differently in the various low-energy decoys in spite of their native-like backbone structure. The differences in sidechain packing have certainly been noted when comparing X-ray crystallographic and NMR structures (Gronenborn and Clore, 1995). In fact, recent work by Zhang and Liu (2006) has shown that X-ray structures exhibit a far greater side-chain entropy contribution than the corresponding
NMR structures. For example, the ensembles of superimposable rotomer states for each residue that do not involve steric clashes are far greater for X-ray structures than NMR structures. As noted in Zhang and Liu (2006), such entropic effects are not explicitly represented in NMR refinement procedures or in energy functions for structure prediction. DISCUSSION Hydrophobic interaction in aqueous solution is one of the important driving forces for a protein to fold into its
734 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved
Structure Hydrophobic PMF for Protein Structure Prediction
Table 9. MMPBSA Decoy Set AMBER99+GB
AMBER99+GB+HPMF Native Rank
Protein
Fold Class
Native-State Resolution (A˚)
Number of residues
Native Rank
Native Z Score
Native Z Score
1GAB
a
NMR
47
24
0.885
27
0.157
1LEB
a
NMR
63
1
5.782
1
6.382
1POU
a
NMR
70
1
2.315
1
3.618
1QYP
b
NMR
42
2
2.232
1
2.839
1SRO
b
NMR
66
1
1.293
1
1.889
1UTG
a
1.34
62
1
2.189
1
2.838
1UXD
a
NMR
43
1
1.496
2
2.059
1VIF
b
1.80
48
1
3.787
1
5.216
2CDX
b
NMR
54
1
3.367
6
1.148
2PTL
a/b
NMR
60
5
1.323
3
1.763
5ICB
a/b
1.50
72
4
1.156
4
1.529
5ZNF
a/b
NMR
25
1
2.815
1
1.729
There are 12 proteins whose native states have been determined by both NMR and X-ray crystallography. There are 30 decoys per protein with sequence length variations, and an rmsd range of 1.34–11.63 A˚.
functional state (Dill, 1990; Kauzmann, 1959). Including a well-designed solvation model can enhance the ability of a physically motivated free energy function to discriminate the native and native-like structures from other, misfolded states. The large degree of hydrophobic and hydrophilic heterogeneity of typical protein sequences, with residues attached to a polar backbone of the polymer chain, means that complete demixing of hydrophobic side chains into a sequestered and pure hydrophobic core is an impossible outcome. In that case, most singledomain proteins would best be classified as biological structures dominated by a large number of weak nonpolar
Figure 2. Scatter Plot of 3ICB Decoys and Native Structure in the LMDS Set The energy, evaluated with our energy function, of each decoy and the native is plotted against its rmsd to the native. The native with an rmsd of zero sits on the y axis. The plot is fitted with a linear function (solid line), which gives an R value of 0.69. It demonstrates that our energy function can well model the energy surface of the protein.
interactions. In the language of hydration physics, the hydrophobic interaction relevant for protein folding is within the ‘‘wetting regime’’ dominated by entropic effects (Chandler, 2005; ten Wolde, 2002) and in which hydrophobic solvation free energy scales with volume and not surface area (Pratt and Chandler, 1977). SASA descriptions of aqueous solvation typically penalize for any surface exposure of hydrophobic residues to water, when in fact the many-body effect of water-water interactions can actually stabilize the exposure of hydrophobic groups to aqueous solvent. SASA descriptions may be more useful for multidomain proteins in which individual proteins pack exposed hydrophobic surface against other protein domains, and therefore surface area dependence in hydrophobic solvation may be a better model of the thermodynamic driver for their association (Liu et al., 2005). Although we did not extensively evaluate other protein force fields, we did find in our testing that the AMBER99 model developed by Wang et al, (2000) was a clear improvement over AMBER94 (Cornell et al., 1995). Another interesting comparison among physical energy functions is the relative performance of nonpolarizable and emerging polarizable force fields. The polarizable AMBER model developed by Duan and colleagues (Wang et al., 2006) has been tested on some of the same decoy set proteins we have examined here, but much less exhaustively, in part because it is more expensive to evaluate (Lee and Duan, 2004). While our energy function determines the correct native-state ranking more consistently than the polarizable force field, our average native Z score is higher than that found by the polarizable energy function, when evaluated over the same set of proteins. In some individual cases the polarizable force field shows a dramatic drop in Z score relative to our energy function, sometimes as much as four units in cases where both energy functions
Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 735
Structure Hydrophobic PMF for Protein Structure Prediction
Table 10. Thermally Sampled Decoy Sets Decoy Set
Native Rank
Native Z Score
low
<%NP>low
high
<%NP>high
LMDS
1
4.90
—
—
11.9
62.4%
200K
1
1.85
—
—
11.7
50.5%
273K
2
1.61
10.9
44.4%
11.2
47.9%
293K
7
1.51
11.0
50.1%
11.4
52.6%
313K
18
1.00
10.4
43.1%
10.9
47.6%
The energy ranking of the minimized native structure and locally minimized decoys generated from thermal fluctuations around the native state for 4PTI at 200K, 273K, 293K, and 313K, as well as comparison to the LMDS set. There are 250 thermal decoys generated at each temperature, and the rmsds with respect to native structure range from 3.0 to 5.0 A˚. Characteristics of the decoy sets include: average radius of gyration and percentage of nonpolar atoms that are exposed for decoys lower in energy than the native structure (low and <%NP>low) and the same quantities for decoys that have higher energy than the native structure (high and <%NP>high). For comparison, the radius of gyration for locally minimized native structure of 4PTI is 11.0 A˚, and the percentage of nonpolar atoms that are exposed is 43.5%.
rank the native as the lowest in energy. Thus polarizability may be a key level of physics for strongly discriminating native states from misfolds and even for helping to promote confidence in the best Model 1 submission in the context of the CASP exercise. We also generated more difficult decoy sets by thermal sampling near the native-state basin, which provides a better challenge for assessing the correctness of aqueous solvated protein free energy surfaces. Our energy function determines that there are a handful of energetically more favorable ‘‘misfolded’’ decoys that are characterized by having a smaller radius of gyration or greater percentage of exposed nonpolar atoms relative to the X-ray crystal structure. We have shown that the thermal decoys we generated reveal a clear temperature dependence on the effectiveness of our energy function that may be due to the missing description of side-chain entropy in our model. For example, the greater compaction of the lowenergy decoys could diminish the side-chain entropy of these structures and thereby restore the true free energy preference for the native structure. In summary, we have shown that a hydrophobic potential of mean force (HPMF) combined with a Generalized Born model for polarization of protein charge by the high dielectric solvent is an excellent solvation model for use in structure prediction on single-domain proteins. When compared to other physics-based energy functions, we find that we outperform these other tested energy functions, with both substantial improvements in native ranking and Z score drops of 0.5 to 2.5 units. In addition, we have found that our energy function also outperforms other tested energy functions on loop decoys that we will report on in the near future. EXPERIMENTAL PROCEDURES Summary of Decoy Sets Several early decoy sets were made publicly available through Decoys ‘R’ Us (Samudrala and Levitt, 2000), enabling the examination of the ability of different types of energy functions, ranging from knowledge-based energy functions to purely physics-based potentials (Berglund et al., 2004; Dehouck et al., 2006; Dominy and
Brooks, 2002; Fain et al., 2001; Felts et al., 2002; Gatchell et al., 2000; Hsieh and Luo, 2004; Lazaridis and Karplus, 1999; Lee and Kollman, 2001; McConkey et al., 2003; Mukherjee et al., 2005; Narang et al., 2006; Park and Levitt, 1996; Petrey and Honig, 2000; Shen and Sali, 2006; Simons et al., 1997; Wang et al., 2004; Zhu et al., 2003), to detect the native protein structure from a set of misfolded or ‘‘decoy’’ structures. These are either grossly misfolded structures (similar to the test on hemerythrin and the immunoglobulin VL domain described above) with relatively few decoys (1–30 misfolds) to large decoy sets (200–2000 misfolds) with misfolded structures that range over rmsd values between 1.0 A˚ and 20.0 A˚. More recent decoy sets have been designed to generate subtle misfolds that have more native-like structural elements, concentrated in an rmsd range of <3.0 A˚ (Simons et al., 1997; Tsai et al., 2003). The decoy sets are composed of target native structures that cover the all-a, all-b, and a/b fold classes derived from both X-ray and NMR data. We have used the decoy data sets listed in Tables 1–9 that differ substantially in how misfolded protein structures have been generated. The ordering of decoy sets in Tables 1–9 is meant to reflect an increasing level of difficulty for discriminating the native fold from misfolded structures. Greater levels of difficulty arise from several factors, including the creation of misfolds with significant native-like or reasonable generic protein secondary and tertiary features such as the Rosetta sets, the use of NMR structures to define the native state, when thousands instead of tens of nonnative conformations have been computer generated, proteins with prosthetic groups, and/or proteins that are part of larger domains. Energy Function Model The first term in Equation (1) has the following functional form: VProtein =
#bonds X
#angles X
i
i
kb ðbi bo Þ2 +
+
#dihedrals X
kx ½1 + cosðnc + dÞ
i
+
kq ðqi qo Þ2
N X
N X
i
i
qi qj + 43ij rij
" #! 12 6 sij sij : rij rij
(2)
The first three terms in Equation (2) represent the chain connectivity interactions, and the next two terms represent the nonbonded interactions, which comprise electrostatic and van der Waals interactions, respectively. We refer the reader to the literature for the definitions of all variables and parameters of the AMBER99 model (Wang et al., 2000). The second term in Equation (1), the Generalized Born energy (VGB) is given as
736 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved
Structure Hydrophobic PMF for Protein Structure Prediction
Table 11. Parameters of the Hydrophobic Potential of Mean Force Term in the Energy Function Parameters
1st Gaussian
2nd Gaussian
3rd Gaussian
c
3.81679
5.46692
7.11677
w
1.68589
1.39064
1.57417
h
0.73080
0.20016
0.09055
VGB =
1 1 1 2 3p 3w
XX i
i
qi qj ; fijGB ðrij Þ
Rj Ri ; bij = pðRi + Rs ÞðRj + Ri + 2Rs rij Þ 1 + rij
(3a)
where qi and qj are the partial charge of atoms i and j, respectively, and fijGB is a continuous function in the form of " fijGB ðrij Þ = rij2 + Ri Rj exp
rij2 4Ri Rj
!#12 (3b)
;
where Ri and Rj are the Born radius of atoms i and j, respectively, and rij is the distance between them. We use the parameterization of Born radii as developed by Onufriev et al. (2000). The dielectric constant of the protein, 3p, is 4, and of the continuum aqueous solvent, 3w, is 80. The final term in Equation (1) is the hydrophobic potential of mean force, VHPMF, in which we use the following functional form:
VHPMF =
Nc X
tanhðSAi Þ
Nc X
type of atom i, and pij is a constant dependent on the connectivity between atoms i and j. bij is the solvent-accessible surface area of atom i removed by atom j:
where Ri, Rj, and Rs are the radii of atoms i, j, and the solvent probe respectively. We use the parameterization developed in Hasel et al. (1988). Local Minimization Procedure The native and decoy conformations are given as Cartesian coordinates of heavy atoms, but our energy function requires hydrogen positions to be specified as well. We use the CHARMM program (Brooks et al., 1983) to build the positions of the hydrogen atoms on the native and the given decoy structures. We evaluate the structures using our energy function both before and after local optimization. The structures are then optimized to their nearest local minimum using the BFGS (Broyden-Fletcher-Goldfarb-Shanno) limited memory quasiNewton method (Liu and Nocedal, 1989; Press et al., 1992). QuasiNewton methods are effective for large-scale minimization where the Hessian of the energy function is dense and gradient evaluations are expensive, as is the case for the energy function explored here. Limited memory quasi-Newton methods generate a sequence of approximate solutions by the iteration xk + 1 = xk lk B1 k Vfðxk Þ;
tanhðSAj Þ
(5b)
(6)
i˛SAi >Ac
j˛SAj >Ac jsi 2 3 X rij ck ; hk exp wk k=1
(4)
where the sum over i and j is over all aliphatic and aromatic carbon centers, and the evaluation of HPMF is restricted to carbon centers with an exposed surface area greater than an area cutoff, Ac. The hyperbolic tangent functions are used to create a continuous VHPMF function to smooth discontinuities introduced by the surface area cutoff (SA > Ac). Each of the three Gaussians is parameterized by position ck, depth hk, and width wk so as to describe the two minima and the barrier (Figure 1). Starting from an estimate of these parameters for a methanemethane potential of mean force profile in water, the HPMF parameters in Equation (4), including the value of Ac, were optimized in the context of Equation (1) over seven proteins in the 4-state-reduced decoy set given in Table 2. The HPMF parameters are listed in Table 11, and Ac was set to 6.0 A˚2 (by comparison, the total surface area of a carbon atom is 120.0 A˚2). We believe that this small training set is adequate since it generalizes so well to the much larger and more extensive test set, and furthermore the parameters are plausible physical values based on the methane-methane potential of mean force. We believe that the optimized cutoff, Ac, corresponds to the inherent error of the analytical but approximate estimate of the surface area term, and essentially atoms with exposed area less than Ac are in fact fully buried. While better numerical schemes exist that may be able to calculate the surface area more precisely, they are computationally expensive, while the cheaper analytical form has a correctible systematic error. The calculation of the solvent-accessible surface area of an atom, Ai, has the following functional form:
Ai = S i
YN j=1
1
pi pij bij ; Si
(5a)
where Si is the total solvent-accessible surface area of atom i when it is fully exposed to solvent, and pi is a constant depended on the atom
where Bk is an N 3 N positive definite matrix approximating the Hessian of f, and lk is a scalar step-length parameter. In the limited memory version, Bk is stored using only the 2m update vectors for the m most recent iterations, where m % 20. At each iteration, Bk is implicitly constructed by applying the saved vectors to some standard initial matrix such as the identity. Analysis of Native and Decoy Structures After local optimization, the native and the decoy structures of each protein are sorted in ascending order based on their energy. According to the thermodynamic hypothesis (Anfinsen, 1973), a protein’s native structure should be ranked first and therefore the lowest energy in the sorted list of native and decoys. Furthermore, the degree of discrimination of the native state from the decoys can be evaluated by calculating the Z score: Z score =
V V ; s
(7)
where V is the energy of the native, and V and s are the average energy and the standard deviation of the energy distribution of the misfolds, respectively. Z score measures the number of standard deviations of the native energy separated from the average energy of all of the misfolded structures. Molecular Dynamics for Sampling Near-Native States We also generated many near-native structures generated by thermal fluctuations around the protein native state. We take the native PDB file as the starting structure and use the molecular dynamics in the SANDER module of Amber 9 package (Case et al., 2005). The energy function used in the simulations is composed of the AMBER99 force field (Wang et al., 2000) and GBOBC model (Onufriev et al., 2004), which is one of the parameterized GB models. The tleap program is used to set atomic GB radii to MBondi2 values (Onufriev et al., 2004), and the molecular dynamics time step is 1 fs, and an Andersen thermostat (Andersen, 1980; Andrea et al., 1983) is used with velocity resampling every 1000 fs. The simulations are run at constant
Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 737
Structure Hydrophobic PMF for Protein Structure Prediction
temperatures of 200K, 273K, 293K, and 313K, and each simulation at each temperature is run for 3 ns and structures are collected every 15 ps. The generated structures and the corresponding native are then local minimized using our energy function, and the energy of the native is compared to the energy of the thermal decoy structures. ACKNOWLEDGMENTS
Crivelli, S., Eskow, E., Bader, B., Lamberti, V., Byrd, R., Schnabel, R., and Head-Gordon, T. (2002). A physical approach to protein structure prediction. Biophys. J. 82, 36–49. Dehouck, Y., Gilis, D., and Rooman, M. (2006). A new generation of statistical potentials for proteins. Biophys. J. 90, 4010–4017. Dill, K.A. (1990). Dominant forces in protein folding. Biochemistry 29, 7133–7155.
T.H.-G. gratefully acknowledges Schlumberger Fellowship for support while on sabbatical at the University of Cambridge and financial support under the NSF-Cyber program. M.S.L. would like to acknowledge support from the DOE/MICS program. N.L.F. thanks the Whitaker Foundation for a graduate research fellowship. We thank Harry Ming Tak Choi for his contribution on the early testing of the model. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC0205CH11231.
Dominy, B.N., and Brooks, C.L. (2002). Identifying native-like protein structures using physics-based potentials. J. Comput. Chem. 23, 147–160.
Received: December 16, 2006 Revised: May 4, 2007 Accepted: May 7, 2007 Published: June 12, 2007
Felts, A.K., Gallicchio, E., Wallqvist, A., and Levy, R.M. (2002). Distinguishing native conformations of proteins from decoys with an effective free energy estimator based on the OPLS all-atom force field and the surface generalized born solvent model. Proteins 48, 404–422.
REFERENCES
Gallicchio, E., and Levy, R.M. (2004). AGBNP: an analytic implicit solvent model suitable for molecular dynamics simulations and high-resolution modeling. J. Comput. Chem. 25, 479–499.
Andersen, H.C. (1980). Molecular-dynamics simulations at constant pressure and-or temperature. J. Chem. Phys. 72, 2384–2393. Andrea, T.A., Swope, W.C., and Andersen, H.C. (1983). The role of long ranged forces in determining the structure and properties of liquid water. J. Chem. Phys. 79, 4576–4584. Anfinsen, C.B. (1973). Principles that govern folding of protein chains. Science 181, 223–230.
Eisenberg, D., and McLachlan, A.D. (1986). Solvation energy in protein folding and binding. Nature 319, 199–203. Fain, B., Xia, Y., and Levitt, M. (2001). Determination of optimal Chebyshev-expanded hydrophobic discrimination function for globular proteins. IBM Journal Research and Development 45, 525–532. Feig, M., and Brooks, C.L. (2002). Evaluating CASP4 predictions with physical energy functions. Proteins 49, 232–245.
Garcia, A.E., and Sanbonmatsu, K.Y. (2002). a-helical stabilization by side chain shielding of backbone hydrogen bonds. Proc. Natl. Acad. Sci. USA 99, 2782–2787. Gatchell, D.W., Dennis, S., and Vajda, S. (2000). Discrimination of nearnative protein structures from misfolded models by empirical free energy functions. Proteins 41, 518–534.
Baldwin, R.L. (1986). Temperature-dependence of the hydrophobic interaction in protein folding. Proc. Natl. Acad. Sci. USA 83, 8069–8072.
Gronenborn, A.M., and Clore, G.M. (1995). Structures of protein complexes by multidimensional heteronuclear magnetic-resonance spectroscopy. Crit. Rev. Biochem. Mol. Biol. 30, 351.
Berendsen, H.J.C., Postma, J.P.M., van Gunsteren, W.F., and Hermans, J. (1981). Interaction models for water in relation to protein hydration. In Intermolecular Forces, B. Pullman, ed. (Dordrecht, The Netherlands: D. Reidel Publishing Company), pp. 331–342.
Hasel, W., Hendrickson, T.F., and Still, W.C. (1988). A rapid approximation to the solvent accessible surface areas of atoms. Tetrahdron Computer Methodology 1, 103–116.
Berglund, A., Head, R.D., Welsh, E.A., and Marshall, G.R. (2004). ProVal: a protein-scoring function for the selection of native and near-native folds. Proteins 54, 289–302. Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., and Karplus, M. (1983). Charmm—a program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4, 187–217. Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz, K.M., Onufriev, A., Simmerling, C., Wang, B., and Woods, R.J. (2005). The Amber biomolecular simulation programs. J. Comput. Chem. 26, 1668–1688. Chandler, D. (2005). Interfaces and the driving force of hydrophobic assembly. Nature 437, 640–647. Chen, B., Xing, J.H., and Siepmann, J.I. (2000). Development of polarizable water force fields for phase equilibrium calculations. J. Phys. Chem. B 104, 2391–2401. Chothia, C. (1974). Hydrophobic bonding and accessible surface area in proteins. Nature 248, 338–339. Constanciel, R., and Contreras, R. (1984). Self-consistent field-theory of solvent effects representation by continuum models—introduction of desolvation contribution. Theor. Chim. Acta 65, 1–11. Cornell, W.D., Cieplak, P., Bayly, C.I., Gould, I.R., Merz, K.M., Ferguson, D.M., Spellmeyer, D.C., Fox, T., Caldwell, J.W., and Kollman, P.A. (1995). A 2nd generation force-field for the simulation of proteins, nucleic-acids, and organic-molecules. J. Am. Chem. Soc. 117, 5179–5197.
Hawkins, G.D., Cramer, C.J., and Truhlar, D.G. (1995). Pairwise solute descreening of solute charges from a dielectric medium. Chem. Phys. Lett. 246, 122–129. Hawkins, G.D., Cramer, C.J., and Truhlar, D.G. (1996). Parametrized models of aqueous free energies of solvation based on pairwise descreening of solute atomic charges from a dielectric medium. J. Phys. Chem. 100, 19824–19839. Head-Gordon, T., Sorenson, J.M., Pertsemlidis, A., and Glaeser, R.M. (1997). Differences in hydration structure near hydrophobic and hydrophilic amino acids. Biophys. J. 73, 2106–2115. Holm, L., and Sander, C. (1992). Evaluation of protein models by atomic solvation preference. J. Mol. Biol. 225, 93–105. Horn, H.W., Swope, W.C., Pitera, J.W., Madura, J.D., Dick, T.J., Hura, G.L., and Head-Gordon, T. (2004). Development of an improved foursite water model for biomolecular simulations: TIP4P-Ew. J. Chem. Phys. 120, 9665–9678. Hornak, V., Abel, R., Okur, A., Strockbine, B., Roitberg, A., and Simmerling, C. (2006). Comparison of multiple amber force fields and development of improved protein backbone parameters. Proteins 65, 712–725. Hsieh, M.J., and Luo, R. (2004). Physical scoring function based on AMBER force field and Poisson-Boltzmann implicit solvent for protein structure prediction. Proteins 56, 475–486. Hura, G., Sorenson, J.M., Glaeser, R.M., and Head-Gordon, T. (1999). Solution X-ray scattering as a probe of hydration-dependent structuring of aqueous solutions. Perspect. Drug Discov. Des. 17, 97–118.
738 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved
Structure Hydrophobic PMF for Protein Structure Prediction
Jorgensen, W.L., and Madura, J.D. (1985). Temperature and size dependence for Monte-Carlo simulations of Tip4p water. Mol. Phys. 56, 1381–1392. Jorgensen, W.L., Chandrasekhar, J., Madura, J.D., Impey, R.W., and Klein, M.L. (1983). Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926–935. Jorgensen, W.L., Maxwell, D.S., and Tirado-Rives, J. (1996). Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J. Am. Chem. Soc. 118, 11225–11236. Kaminski, G.A., Friesner, R.A., Tirado-Rives, J., and Jorgensen, W.L. (2001). Evaluation and reparametrization of the OPLS-AA force field for proteins via comparison with accurate quantum chemical calculations on peptides. J. Phys. Chem. B 105, 6474–6487.
Nemethy, G., Pottle, M.S., and Scheraga, H.A. (1983). Energy parameters in polypeptides. 9. Updating of geometrical parameters, nonbonded interactions, and hydrogen-bond interactions for the naturally-occurring amino-acids. J. Phys. Chem. 87, 1883–1887. Novotny, J., Bruccoleri, R., and Karplus, M. (1984). An analysis of incorrectly folded protein models—implications for structure predictions. J. Mol. Biol. 177, 787–818. Onufriev, A., Bashford, D., and Case, D.A. (2000). Modification of the generalized Born model suitable for macromolecules. J. Phys. Chem. B 104, 3712–3720. Onufriev, A., Bashford, D., and Case, D.A. (2004). Exploring protein native states and large-scale conformational changes with a modified generalized born model. Proteins 55, 383–394.
Kauzmann, W. (1959). Some factors in the interpretation of protein denaturation. Adv. Protein Chem. 14, 1–63.
Pande, V.S., and Rokhsar, D.S. (1999). Molecular dynamics simulations of unfolding and refolding of a beta-hairpin fragment of protein G. Proc. Natl. Acad. Sci. USA 96, 9062–9067.
Keasar, C., and Levitt, M. (2003). A novel approach to decoy set generation: designing a physical energy function having local minima with native structure characteristics. J. Mol. Biol. 329, 159–174.
Park, B., and Levitt, M. (1996). Energy functions that discriminate X-ray and near-native folds from well-constructed decoys. J. Mol. Biol. 258, 367–392.
Lazaridis, T., and Karplus, M. (1999). Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J. Mol. Biol. 288, 477–487.
Pertsemlidis, A., Saxena, A.M., Soper, A.K., Head-Gordon, T., and Glaeser, R.M. (1996). Direct evidence for modified solvent structure within the hydration shell of a hydrophobic amino acid. Proc. Natl. Acad. Sci. USA 93, 10769–10774.
Lee, B., and Richards, F.M. (1971). Interpretation of protein structures—estimation of static accessibility. J. Mol. Biol. 55, 379. Lee, M.R., and Kollman, P.A. (2001). Free-energy calculations highlight differences in accuracy between X-ray and NMR structures and add value to protein structure prediction. Structure 9, 905–916. Lee, M.C., and Duan, Y. (2004). Distinguish protein decoys by using a scoring function based on a new AMBER force field, short molecular dynamics simulations, and the generalized born solvent model. Proteins 55, 620–634. Lee, M.R., Tsai, J., Baker, D., and Kollman, P.A. (2001). Molecular dynamics in the endgame of protein structure prediction. J. Mol. Biol. 313, 417–430.
Pertsemlidis, A., Soper, A.K., Sorenson, J.M., and Head-Gordon, T. (1999). Evidence for microscopic, long-range hydration forces for a hydrophobic amino acid. Proc. Natl. Acad. Sci. USA 96, 481–486. Petrey, D., and Honig, B. (2000). Free energy determinants of tertiary structure and the evaluation of protein models. Protein Sci. 9, 2181– 2191. Ponder, J.W., and Case, D.A. (2003). Force fields for protein simulations. Adv. Protein Chem. 66, 27–85. Pratt, L.R., and Chandler, D. (1977). Theory of hydrophobic effect. J. Chem. Phys. 67, 3683–3704.
Liu, D.C., and Nocedal, J. (1989). On the limited memory Bfgs method for large-scale optimization. Math. Program. 45, 503–528.
Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. (1992). Numerical Recipes in C: The Art of Scientific Computing, Second Edition (Cambridge, UK: Cambridge University Press).
Liu, P., Huang, X.H., Zhou, R.H., and Berne, B.J. (2005). Observation of a dewetting transition in the collapse of the melittin tetramer. Nature 437, 159–162.
Rick, S.W. (2001). Simulations of ice and liquid water over a range of temperatures using the fluctuating charge model. J. Chem. Phys. 114, 2276–2283.
MacKerell, A.D., Bashford, D., Bellott, M., Dunbrack, R.L., Evanseck, J.D., Field, M.J., Fischer, S., Gao, J., Guo, H., Ha, S., et al. (1998). All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 102, 3586–3616.
Rick, S.W. (2004). A reoptimization of the five-site water potential (TIP5P) for use with Ewald sums. J. Chem. Phys. 120, 6085–6093.
Mahoney, M.W., and Jorgensen, W.L. (2000). A five-site model for liquid water and the reproduction of the density anomaly by rigid, nonpolarizable potential functions. J. Chem. Phys. 112, 8910–8922. McConkey, B.J., Sobolev, V., and Edelman, M. (2003). Discrimination of native protein structures using atom-atom contact scoring. Proc. Natl. Acad. Sci. USA 100, 3215–3220. Mitsutake, A., Sugita, Y., and Okamoto, Y. (2001). Generalized-ensemble algorithms for molecular simulations of biopolymers. Biopolymers 60, 96–123. Momany, F.A., Mcguire, R.F., Burgess, A.W., and Scheraga, H.A. (1975). Energy parameters in polypeptides. 7. Geometric parameters, partial atomic charges, nonbonded interactions, hydrogen-bond interactions, and intrinsic torsional potentials for naturally occurring aminoacids. J. Phys. Chem. 79, 2361–2381. Mukherjee, A., Bhimalapuram, P., and Bagchi, B. (2005). Orientationdependent potential of mean force for protein folding. J. Chem. Phys. 123, 14901. Narang, P., Bhushan, K., Bose, S., and Jayaram, B. (2006). Protein structure evaluation using an all-atom energy based empirical scoring function. J. Biomol. Struct. Dyn. 23, 385–406.
Rick, S.W., Stuart, S.J., and Berne, B.J. (1994). Dynamical fluctuating charge force-fields—application to liquid water. J. Chem. Phys. 101, 6141–6156. Samudrala, R., and Levitt, M. (2000). Decoys ‘R’ Us: a database of incorrect conformations to improve protein structure prediction. Protein Sci. 9, 1399–1401. Scott, W.R.P., Hunenberger, P.H., Tironi, I.G., Mark, A.E., Billeter, S.R., Fennen, J., Torda, A.E., Huber, T., Kruger, P., and van Gunsteren, W.F. (1999). The GROMOS biomolecular simulation program package. J. Phys. Chem. A 103, 3596–3607. Sharp, K.A., Nicholls, A., Fine, R.F., and Honig, B. (1991). Reconciling the magnitude of the microscopic and macroscopic hydrophobic effects. Science 252, 106–109. Shen, M.Y., and Sali, A. (2006). Statistical potential for assessment and prediction of protein structures. Protein Sci. 15, 2507–2524. Shrake, A., and Rupley, J.A. (1973). Environment and exposure to solvent of protein atoms—lysozyme and insulin. J. Mol. Biol. 79, 351–371. Simmerling, C., Strockbine, B., and Roitberg, A.E. (2002). All-atom structure prediction and folding simulations of a stable protein. J. Am. Chem. Soc. 124, 11258–11259. Simons, K.T., Kooperberg, C., Huang, E., and Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local
Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 739
Structure Hydrophobic PMF for Protein Structure Prediction
sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268, 209–225. Simons, K.T., Bonneau, R., Ruczinski, I., and Baker, D. (1999). Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins (Suppl. 3), 171–176. Sippl, M.J., Nemethy, G., and Scheraga, H.A. (1984). Intermolecular potentials from crystal data. 6. Determination of empirical potentials for O-H=O=C hydrogen-bonds from packing configurations. J. Phys. Chem. 88, 6231–6233. Sorenson, J.M., Hura, G., Soper, A.K., Pertsemlidis, A., and HeadGordon, T. (1999). Determining the role of hydration forces in protein folding. J. Phys. Chem. B 103, 5413–5426. Spolar, R.S., Ha, J.H., and Record, M.T., Jr. (1989). Hydrophobic effect in protein folding and other noncovalent processes involving proteins. Proc. Natl. Acad. Sci. USA 86, 8382–8385. Still, W.C., Tempczyk, A., Hawley, R.C., and Hendrickson, T. (1990). Semianalytical treatment of solvation for molecular mechanics and dynamics. J. Am. Chem. Soc. 112, 6127–6129. Sugita, Y., and Okamoto, Y. (1999). Replica-exchange molecular dynamics method for protein folding. Chem. Phys. Lett. 314, 141–151. Swanson, J.M., Henchman, R.H., and McCammon, J.A. (2004). Revisiting free energy calculations: a theoretical connection to MM/PBSA and direct calculation of the association free energy. Biophys. J. 86, 67–74. ten Wolde, P.R. (2002). Hydrophobic interactions: an overview. J. Phys. Condens. Matter 14, 9445–9460. Tsai, J., Bonneau, R., Morozov, A.V., Kuhlman, B., Rohl, C.A., and Baker, D. (2003). An improved protein decoy set for testing energy functions for protein structure prediction. Proteins 53, 76–87. van der Vaart, A., Bursulaya, B.D., Brooks, C.L., and Merz, K.M. (2000). Are many-body effects important in protein folding? J. Phys. Chem. B 104, 9554–9563.
Vasquez, M., Nemethy, G., and Scheraga, H.A. (1994). Conformational energy calculations on polypeptides and proteins. Chem. Rev. 94, 2183–2239. Vorobjev, Y.N., Almagro, J.C., and Hermans, J. (1998). Discrimination between native and intentionally misfolded conformations of proteins: ES/IS, a new method for calculating conformational free energy that uses both dynamics simulations with an explicit solvent and an implicit solvent continuum model. Proteins 32, 399–413. Wagoner, J.A., and Baker, N.A. (2006). Assessing implicit models for nonpolar mean solvation forces: the importance of dispersion and volume terms. Proc. Natl. Acad. Sci. USA 103, 8331–8336. Wang, J.M., Cieplak, P., and Kollman, P.A. (2000). How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? J. Comput. Chem. 21, 1049–1074. Wang, K., Fain, B., Levitt, M., and Samudrala, R. (2004). Improved protein structure selection using decoy-dependent discriminatory functions. BMC Struct. Biol. 4, 8. Wang, Z.X., Zhang, W., Wu, C., Lei, H.X., Cieplak, P., and Duan, Y. (2006). Strike a balance: optimization of backbone torsion parameters of AMBER polarizable force field for simulations of proteins and peptides. J. Comput. Chem. 27, 781–790. Weiner, S.J., Kollman, P.A., Case, D.A., Singh, U.C., Ghio, C., Alagona, G., Profeta, S., and Weiner, P. (1984). A new force-field for molecular mechanical simulation of nucleic-acids and proteins. J. Am. Chem. Soc. 106, 765–784. Weiner, S.J., Kollman, P.A., Nguyen, D.T., and Case, D.A. (1986). An all atom force-field for simulations of proteins and nucleic-acids. J. Comput. Chem. 7, 230–252. Zhang, J.F., and Liu, J.S. (2006). On side-chain conformational entropy of proteins. PLoS Comput. Biol. 2, 1586–1591. Zhu, J., Zhu, Q.Q., Shi, Y.Y., and Liu, H.Y. (2003). How well can we predict native contacts in proteins based on decoy structures and their energies? Proteins 52, 598–608.
740 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved