Hydrophobic Potential of Mean Force as a Solvation Function for Protein Structure Prediction

Hydrophobic Potential of Mean Force as a Solvation Function for Protein Structure Prediction

Structure Article Hydrophobic Potential of Mean Force as a Solvation Function for Protein Structure Prediction Matthew S. Lin,1 Nicolas Lux Fawzi,1 a...

374KB Sizes 0 Downloads 50 Views

Structure

Article Hydrophobic Potential of Mean Force as a Solvation Function for Protein Structure Prediction Matthew S. Lin,1 Nicolas Lux Fawzi,1 and Teresa Head-Gordon1,2,* 1

UCSF/UCB Joint Graduate Group in Bioengineering, Berkeley, CA 94720, USA Department of Bioengineering, University of California, Berkeley, CA 94720, USA *Correspondence: [email protected] DOI 10.1016/j.str.2007.05.004

2

SUMMARY

We have developed a solvation function that combines a Generalized Born model for polarization of protein charge by the high dielectric solvent, with a hydrophobic potential of mean force (HPMF) as a model for hydrophobic interaction, to aid in the discrimination of native structures from other misfolded states in protein structure prediction. We find that our energy function outperforms other reported scoring functions in terms of correct native ranking for 91% of proteins and low Z scores for a variety of decoy sets, including the challenging Rosetta decoys. This work shows that the stabilizing effect of hydrophobic exposure to aqueous solvent that defines the HPMF hydration physics is an apparent improvement over solventaccessible surface area models that penalize hydrophobic exposure. Decoys generated by thermal sampling around the native-state basin reveal a potentially important role for side-chain entropy in the future development of even more accurate free energy surfaces. INTRODUCTION The protein structure prediction problem is to predict the three-dimensional structure of the native state of a protein given its sequence of amino acids. Our group has used global optimization as a promising approach to its solution because it is believed that in most cases the native state corresponds to the global, or very low-lying, free energy minimum (Anfinsen, 1973). Global optimization is required because the energy landscape of a realistic-sized protein has thousands of parameters and an enormous number of local minima that are potential false traps for native structure (Sugita and Okamoto, 1999; Vasquez et al., 1994). However, without a quantitative description of the free energy function describing both the proteins’ intramolecular forces and the intermolecular interactions with aqueous solvent, global optimization as a solution to the structure prediction problem would be useless. We must search

on a free energy surface where the global optimum, or very low-lying free energy minimum, has functional relevance. As is usual for global optimization approaches, the question is whether the sampling is sufficient or whether the energy or scoring function is adequate for discriminating native structure from misfolded states. Atomic nonpolarizable empirical force fields such as AMBER (Cornell et al., 1995; Hornak et al., 2006; Wang et al., 2000; Weiner et al., 1984, 1986), CHARMM (Brooks et al., 1983; MacKerell et al., 1998), GROMOS (Scott et al., 1999), ECEPP (Momany et al., 1975; Nemethy et al., 1983; Sippl et al., 1984), and OPLS (Jorgensen et al., 1996; Kaminski et al., 2001) are fairly simple mathematical functional forms that are reasonable and physically motivated approximations of the potential energy surface for proteins. Empirical protein force fields such as these have been shown to be effective in both protein folding and protein structure prediction (Ponder and Case, 2003). An important demonstration of their effectiveness is the study of native folds and misfolds of the protein sequences of hemerythrin, a predominantly alpha protein, and immunoglobulin VL domain, a predominantly b sheet structure. When the sequences of the two were threaded through the other tertiary structure, the gas phase energy values (i.e., no protein-solvent interactions) for the native folds were comparable to their misfolds. However, the addition of a simple solvation description raised the energy of the misfolded structures, demonstrating that these protein force fields combined with an adequate solvent model can perform effectively in discriminating between correct folds and misfolds (Novotny et al., 1984). Modeling the interaction of a protein with aqueous solvent can be accomplished with any of the independently developed liquid water force fields, such as the TIP (Jorgensen et al., 1983; Jorgensen and Madura, 1985; Mahoney and Jorgensen, 2000) or SPC (Berendsen et al., 1981) series of models, and with more recent versions that are parameterized for long-range electrostatic interactions, such as TIP4P-Ew (Horn et al., 2004) and TIP5P-Ew (Rick, 2004), as well as with polarizable water models, such as TIP4P-pol2 (Chen et al., 2000), TIP4PFQ (Rick, 2001; Rick et al., 1994), and SPC-FQ (Rick et al., 1994). However, the use of an all-atom water potential is computationally expensive in the context of global optimization and is possibly not needed in structure

Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 727

Structure Hydrophobic PMF for Protein Structure Prediction

prediction if the physics of hydration can be adequately described by coarse-grained descriptions. For example the Generalized Born (GB) (Constanciel and Contreras, 1984; Hawkins et al., 1995, 1996; Onufriev et al., 2000, 2004; Still et al., 1990) model represents the electrostatic polarization free energy as the interaction between the protein’s charge distribution enclosed in a low dielectric region and the reaction potential it induces in the surrounding high dielectric solvent. The GB description integrates out the degrees of freedom of explicit water and replaces them with an effective high dielectric constant. The GB treatment has been reasonably successful in both protein folding and protein structure prediction (Dominy and Brooks, 2002; Feig and Brooks, 2002; Felts et al., 2002; Garcia and Sanbonmatsu, 2002; Mitsutake et al., 2001; Pande and Rokhsar, 1999; Simmerling et al., 2002; van der Vaart et al., 2000; Vorobjev et al., 1998). A popular coarse-grained model of aqueous solvation involves combining the GB electrostatics model with a term proportional to the solvent-accessible surface area (SASA) (Chothia, 1974; Eisenberg and McLachlan, 1986; Lee and Richards, 1971; Sharp et al., 1991; Shrake and Rupley, 1973; Spolar et al., 1989; Swanson et al., 2004) to approximate the nonpolar interactions induced by the aqueous solvent. It is evident from theoretical considerations that the SASA model is incapable of accurately describing nonpolar solvation forces at molecular-length scales on the order of a water molecule diameter, instead being more useful at ‘‘large’’ length-scales when more extended hydrophobic surfaces interface with the water solvent (Chandler, 2005; ten Wolde, 2002). Thus, alternative models to SASA are needed to reproduce hydration phenomena on the molecular scale. For example, the hydrophobic effect (the thermodynamics of transfer of hydrophobic solutes from organic solvents or vacuum into water) has been modeled in one case by supplementing the SASA model with additional volume and dispersion interactions (Wagoner and Baker, 2006), while Galliccho and Levy discard the SASA model altogether in favor of a nonpolar estimator to the work of cavity formation and dispersion interactions between the solute atoms and the solvent (Gallicchio and Levy, 2004). Using both wide-angle and small-angle solution scattering experiments and computer simulation (Head-Gordon et al., 1997; Hura et al., 1999; Pertsemlidis et al., 1996, 1999; Sorenson et al., 1999), we have studied hydrophobic interactions between model hydrophobic peptides such as N-acetyl-leucine-methylamide from low to very high concentrations (0.1–2.0 M). That these solutions never phase separate at the highest concentrations studied suggested to us that small length-scale hydration physics (Pratt and Chandler, 1977) is operative for the folding and stabilization of globular proteins (Sorenson et al., 1999). As such, the influence of water on the free energy of association of two small hydrophobic groups in water exhibits a potential of mean force with two minima separated by a barrier: one for the hydrophobic molecules in contact and one for the hydrophobic groups separated

Figure 1. Hydrophobic Potential of Mean Force The sum of HPMF and Lennard-Jones energy is plotted against Rij, the distance between a pair of aliphatic and aromatic carbons.

by a water layer (Figure 1). This model of the hydrophobic interaction was introduced by our group in application to protein structure prediction (Crivelli et al., 2002) to replace SASA models. Its novelty in the context of structure prediction and protein energetics is the many-body effect of water’s influence on protein hydrophobic group interactions, and a desolvation barrier to demixing that is not described by SASA models. In this work we show that the replacement of the SASA description with a hydrophobic potential of mean force (HPMF) model allows better discrimination in favor of native states with a wide variety of small globular proteins. In total, our potential energy function to describe the protein and aqueous solvent free energy surface combines the AMBER99 protein force field (Wang et al., 2000) (VProtein), the Generalized Born (GB) description of the electrostatic component of solvent free energy (VGB) developed by Case and coworkers (Onufriev et al., 2000), and the hydrophobic potential of mean force to describe the solute-solute interaction induced by water (VHPMF): V = VProtein + VGB + VHPMF :

(1)

The performance of AMBER99 combined with the GB/ HPMF solvation model is then evaluated on a large selection of databases of so-called ‘‘decoy’’ structures that allow us to assess its ability to discriminate against misfolded structures in favor of the native state. We report tests of our energy function on 106 proteins using 7 publicly available decoy sets of greatly varying size and difficulty (Holm and Sander, 1992; Keasar and Levitt, 2003; Park and Levitt, 1996; Samudrala and Levitt, 2000; Simons et al., 1997; Tsai et al., 2003). We find that our energy function outperforms other published energy functions in terms of native ranking and native Z score on 91% of proteins, and does exceptionally well against the hardest Rosetta decoy sets, an outcome we attribute to the success of the HPMF model in particular. We also consider challenging our free energy description with nativelike decoys generated from finite temperature molecular

728 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved

Structure Hydrophobic PMF for Protein Structure Prediction

Table 1. EMBL Decoy Set AMBER99+GB

AMBER99+GB+HPMF

Protein

Decoy

Number of Residues

Decoy Rmsd (A˚)

1BP2

2PAZ

123

15.38

42.257

0.344

DE per Residue

DE

DE per Residue

DE 178.378

1.450

1CBH

1PPT

36

9.87

43.240

1.201

48.947

1.360

1FDX

5RXN

54

8.67

11.939

0.221

43.470

0.805

1HIP

2B5C

85

13.84

116.394

1.369

212.696

2.502

1LH1

2I1B

153

17.49

394.606

2.579

285.355

1.865

1P2P

1RN3

124

18.31

113.640

0.916

173.945

1.403

1PPT

1CBH

36

9.76

6.835

0.190

58.543

1.626

1REI

5PAD

212

18.08

278.791

1.315

60.054

0.283

1RHD

2CYP

293

21.56

534.787

1.825

417.677

1.426

1RN3

1P2P

124

18.35

126.357

1.019

147.419

1.189

1SN3

2CI2

65

12.48

112.805

1.735

143.908

2.214

1SN3

2CRO

65

10.77

59.923

0.922

118.553

1.824

2B5C

1HIP

85

14.60

33.580

0.395

2.167

0.025

2CDV

2SSI

107

14.14

140.142

1.310

110.474

1.032

2CI2

1SN3

65

24.93

109.698

1.688

143.420

2.206

2CI2

2CRO

65

23.45

40.205

0.619

108.143

1.664

2CRO

1SN3

65

10.71

192.941

2.968

235.277

3.620

2CRO

2CI2

65

10.66

173.368

2.667

187.076

2.878

2CYP

1RHD

293

21.42

2167.252

7.397

2456.140

8.383

2I1B

1LH1

153

17.52

101.107

0.661

262.069

1.713

2PAZ

1BP2

123

15.35

212.318

1.726

225.312

1.832

2SSI

2CDV

107

15.38

131.374

1.228

183.579

1.716

2TMN

2TS1

316

22.44

689.522

2.182

633.548

2.005

2TS1

2TMN

317

22.75

53.578

0.169

250.015

0.789

5PAD

1REI

212

18.07

514.832

2.428

593.937

2.802

There are 25 proteins whose native states have been determined by both X-ray and NMR. There is only one decoy per protein with no sequence length variations. DE is the difference in energy between the native and the decoy.

dynamics to sample states around the native state for 4PTI at four different temperatures. The thermal decoys reveal a potentially important role for side-chain entropy as a direction for future development of even more accurate free energy surfaces. RESULTS The results for all proteins used to evaluate our energy function are given in Tables 1–7 for the EMBL (Holm and Sander, 1992), 4-state-reduced (Park and Levitt, 1996; Samudrala and Levitt, 2000), LMDS (Keasar and Levitt, 2003; Samudrala and Levitt, 2000), fisa (Samudrala and Levitt, 2000; Simons et al., 1997), fisa-casp3 (Samudrala and Levitt, 2000; Simons et al., 1997), lattice (Samudrala and Levitt, 2000), and the Rosetta decoy sets (Simons et al., 1997; Tsai et al., 2003). We do not include 1BBA in the LMDS set since it does not have a well-defined na-

tive state. We also do not include protein decoy sets with heme prosthetic groups (hg_structural [Samudrala and Levitt, 2000]) or sequence length variations of greater than 10% with respect to the native structure (MMPBSA [Lee et al., 2001; Simons et al., 1997] and some parts of the Rosetta decoy sets) as part of our averages, although we do report our results on most of these decoys in Table 8 and Table 9. We have divided the Rosetta decoys developed in Simons et al. (1999) and Tsai et al. (2003) into two categories: (1) restricting the decoy set to X-ray-determined native structures and for %10% variation in sequence length (Rosetta-1) and (2) restricting the decoy set to native structures determined by NMR in which multiple models are reported and for %10% variation in sequence length (Rosetta-2). Restrictions on sequence length are necessary because it is ambiguous whether performance on decoys with >10% variation in sequence length is

Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 729

Structure Hydrophobic PMF for Protein Structure Prediction

Table 2. 4-State-Reduced Decoy Set AMBER99+GB

AMBER99+GB+HPMF

Protein

Fold Class

Native-State Resolution (A˚)

Number of Residues

Native Rank

Native Z Score

Native Rank

Native Z Score

1CTF

a/b

1.70

68

1

3.287

1

3.798

1R69

a

2.00

63

1

3.616

1

3.854

1SN3

a/b

1.20

65

1

3.151

1

5.029

2CRO

a

2.35

65

1

3.370

1

3.684

3ICB

a

2.30

75

1

2.073

2

2.129

4PTI

a/b

1.50

58

1

3.472

1

4.244

4RXN

a/b

1.20

54

1

4.042

1

4.005

There are seven proteins whose native states have been determined by X-ray crystallography. There is an average of 700 decoys per protein with no sequence length variations, and an rmsd range of 0.88–9.39 A˚.

a problem of the energy function or missing interactions necessary to stabilize the native fold. We find that discriminating for native protein structures solved by X-ray crystallography is ‘‘easier,’’ while the ensemble of native structures that are derived from the NMR refinement of solution measurements make native ranking more ambiguous due to nonideal packing (Zhang and Liu, 2006). For these 106 proteins, our energy function is able to correctly identify the native structure of 96 proteins. When we compare our energy function (AMBER+GB+ HPMF) with the baseline energy function (AMBER+GB) to evaluate the contribution of the HPMF term, the baseline function can only identify the native of 79 proteins. The average native Z score of the results produced by our energy function is 4.13 (and 4.50 when we correctly predict the native state) and 3.29 for the baseline function (and 3.92 when we correctly predict the native state). When we compare the structural change of the native structure after local minimization using our energy function, we find that the average post-rmsd of the native structure of each protein is 0.6 A˚, while the average post-rmsd

of the decoys is 0.2 A˚. We speculate that this difference stems from the model building used in X-ray crystallography or NMR refinement being a different energy function than that used to generate the decoys. We now focus on the ten proteins whose native structure is not ranked number one by our energy function. The native of 3ICB (4-state-reduced) is ranked number two, although the top-ranked structure is within 1.5 A˚ rmsd with respect to the native. We question whether this is in actuality a bad result since thermal motion at room temperature would likely populate ‘‘misfolds’’ in the 1.5 A˚ and maybe up to 3 A˚ range. In fact the reported NMR data for other proteins shows an ensemble with a spread of rmsd between 0.18 A˚ and 0.47 A˚ in the best case (1GB1) and between 1.30 A˚ and 5.41 A˚ in the worst case (1AJ3). That the energy versus rmsd plot for 3ICB when fit to a linear function gives a correlation coefficient (R value) of 0.69 (Figure 2) strongly suggests that the free energy surface for 3ICB is well-described by our model. Some of the proteins in which our energy function does not rank the native structure near the top have

Table 3. LMDS Decoy Set AMBER99+GB

AMBER99+GB+HPMF

Protein

Fold Class

Native-State Resolution (A˚)

Number of Residues

Native Rank

Native Z Score

Native Rank

Native Z Score

1B0N-B

a

1.90

31

29

1.684

40

1.433

1CTF

a/b

1.70

68

1

3.375

1

4.540

1DTK

a/b

NMR

57

1

3.527

1

4.261

1FC2

a/b

2.80

43

1

3.389

1

4.232

1IGD

a/b

1.10

61

1

4.286

1

4.477

1SHF-A

b

1.90

59

1

2.972

1

4.585

2CRO

a

2.35

65

1

5.649

1

7.914

2OVO

a/b

1.50

56

1

3.411

1

4.580

4PTI

a/b

1.50

58

1

4.888

1

6.575

There are nine proteins whose native states have been determined by both NMR and X-ray crystallography. There is an average of 500 decoys per protein with no sequence length variations, and an rmsd range of 2.45–13.47 A˚. 730 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved

Structure Hydrophobic PMF for Protein Structure Prediction

Table 4. Fisa and Fisa-CASP3 Decoy Set Number of Residues

AMBER+GB

AMBER+GB+HPMF

Native Rank

Native Z Score

Native Rank

Native Z Score

Protein

Fold Class

Decoy Set

Native-State Resolution (A˚)

1FC2

a

FISA

2.80

43

152

0.555

70

0.940

1HDD-C

a

FISA

2.80

57

64

1.180

1

2.454

2CRO

a

FISA

2.35

65

1

4.922

1

5.077

4ICB

a

FISA

1.60

76

1

3.588

1

4.487

1BG8-A

a

FISA-CASP3

2.00

76

1

3.105

1

3.725

1BL0

a

FISA-CASP3

2.30

99

15

2.130

1

3.493

1EH2

a

FISA-CASP3

NMR

79

18

2.162

1

3.078

1JWE

a

FISA-CASP3

NMR

114

1

4.745

1

5.285

The fisa decoy set has four proteins whose native states have been determined by X-ray crystallography. There is an average of 500 decoys per protein with no sequence length variations, and an rmsd range of 2.77–14.13 A˚. The fisa-CASP3 decoy set has four proteins whose native states have been determined by both NMR and X-ray crystallography. There is an average of 1500 decoys per protein with no sequence length variations, and an rmsd range of 3.63–20.87 A˚.

complications of either being a subunit of a larger protein or having prosthetic heme or metal groups. 1CC5 and 5ICB in Rosetta-1 and 2PAC in Rosetta-2 are heme-binding or metal-binding proteins. The prosthetic groups that bind to these proteins reside in the interior, and without them there can be destabilization in the core. The nonspecificity of interactions in the collapse of the cavity contributes to many possible misfolds that are energetically competitive with the native structure. Thus it is uncertain if the failure in identifying the native of these proteins is because of the energy function or the differences in destabilization of the native structures and misfolds caused by the missing prosthetic group. In fact we perform well on a number of heme-binding proteins in the Rosetta decoy sets. We also succeeded in identifying the native structure for many proteins that are a domain in a larger quaternary structure. However, we fail on the small protein 1BON-B (LMDS decoy set), as well as 1FC2 (fisa decoy set), which exposes a large amount of hydrophobic surface area. As

described by others (Felts et al., 2002; McConkey et al., 2003), the removal of other subunits can possibly destabilize the native structure of the target domain more so than the decoys. It has been suggested that 1BON-B is so small that other subunits are needed to stabilize it, while 1FC2 packs exposed hydrophobic surface area against other subunits. Although our energy function ranked the native state as number two for 2FDN in Rosetta-1 set and number 13 for 1APF in Rosetta-2 set, the top-ranked structure for these proteins have very large rmsds with respect to native and have very poor R values based on linear fits of energy versus rmsd across the whole decoy set. Other proteins such as 1ORC (native rank 100) and 1HYP (native rank 1038) seem to be resounding failures. These proteins offer more interesting analysis questions about the success of our energy function. In the cases of 2FDN, 1APF, and 1HYP proteins, there are no native-like decoys (with rmsd < 3.0 A˚). It is possible that having more native-like

Table 5. Lattice Decoy Set

Protein

Fold Class

Native-State Resolution (A˚)

Number of Residues

AMBER99+GB

AMBER99+GB+HPMF

Native Rank

Native Rank

Native Z Score

Native Z Score

1BEO

a

2.20

98

1

5.731

1

5.550

1CTF

a/b

1.70

68

1

4.609

1

5.311

1DKT-A

a/b

2.90

72

1

4.841

1

3.924

1FCA

a/b

1.80

55

3

3.071

1

3.068

1NKL

a

NMR

78

52

1.811

1

4.173

1PGB

a/b

1.92

56

1

3.710

1

5.910

1TRL-A

a

NMR

62

1

4.383

1

4.261

4ICB

a

1.60

76

1

3.790

1

3.524

There are eight proteins whose native states have been determined by both X-ray crystallography and NMR. There are 2000 decoys per protein with no sequence length variations and an rmsd range of 4.74–17.44 A˚. Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 731

Structure Hydrophobic PMF for Protein Structure Prediction

Table 6. Rosetta 1 Decoy Set AMBER99+GB

AMBER99+GB+HPMF

Native Rank

Native Rank

Protein

Fold Class

Native-State Resolution (A˚)

DB

Number of Residues

Native Z Score

Native Z Score

1AA2

a

2.00

Simons

105

1

4.453

1

5.623

1ACF

a/b

2.00

Simons

123

1

6.243

1

7.379

1AIL

a

1.90

Tsai

67

17

2.360

1

3.711

1BDO

b

1.80

Simons

75

1

3.742

1

5.357

1BQ9

a/b

1.20

Tsai

53

1

4.852

1

4.279

1CC5

a

2.50

Simons

76

123

1.008

77

1.391

1CEI

a

1.80

Tsai

85

1

3.401

1

4.895

1CSP

b

2.45

Tsai

64

3

3.123

1

3.812

1CTF

a/b

1.70

Tsai

67

1

3.263

1

3.835

1ECA

a

1.40

Simons

132

1

3.861

1

5.741

1ERV

a/b

1.65

Simons

105

1

5.003

1

6.936

1GVP

b

1.60

Simons

82

4

2.440

1

3.782

1HYP

a

1.80

Tsai

75

1597

1.014

1038

0.051

1KTE

a/b

2.20

Simons

100

1

4.083

1

5.176

1MBD

a

1.40

Simons

147

1

5.050

1

6.741

1MSI

a/b

1.25

Tsai

60

1

3.953

1

4.161

1ORC

a/b

1.54

Tsai

56

461

0.708

100

1.488

1PAL

a

1.65

Simons

100

15

1.906

1

3.928

1PDO

a/b

1.70

Simons

121

1

4.977

1

5.762

1R69

a

2.00

Tsai

61

1

3.475

1

3.802

1RIS

a/b

2.00

Simons

92

1

3.360

1

4.745

1TUC

b

2.02

Tsai

61

1

3.664

1

3.756

1TUL

b

2.20

Simons

97

1

3.514

1

4.536

1UTG

a

1.34

Simons

61

1

2.857

1

3.245

1VCC

a/b

1.60

Tsai

77

1

3.739

1

4.860

1VLS

a

1.85

Simons

143

1

2.461

1

3.321

1WHO

b

1.90

Simons

88

1

4.476

1

5.664

2ACY

a/b

1.80

Simons

92

1

4.396

1

5.756

2FDN

a/b

0.94

Simons

55

30

1.770

2

2.541

2FHA

a

1.90

Simons

160

1

3.411

1

6.187

2FXB

a

2.30

Tsai

81

1

3.584

1

4.101

2GDM

a

1.70

Simons

149

1

4.800

1

6.250

4FGF

b

1.60

Simons

121

1

5.122

1

6.855

5ICB

a

1.50

Tsai

72

187

1.281

41

1.791

5PTI

a/b

1.00

Tsai

55

1

3.546

1

3.243

There are 35 proteins whose native states have been determined by X-ray crystallography. There is an average of 1000 (Simons) or 2000 (Tsai) decoys per protein with sequence length variation of %10%, and an rmsd range of 2.18–33.53 A˚.

decoys might have pushed the clearly nonnative structures out of the top-ranked set, thereby substantially improving the Z score and R value for the linear fit. Another possibility is that proteins like 1ORC, which have a larger

amount of exposed hydrophobic surface area relative to the average measured over these 106 proteins, hint at the possibility of a transition in which the hydration free energy scaling with volume now scales with surface

732 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved

Structure Hydrophobic PMF for Protein Structure Prediction

Table 7. Rosetta 2 Decoy Set AMBER99+GB

AMBER99+GB+HPMF

Protein

Fold Class

Number of Residues

Native Rank

Native Z Score

1AJ3

a

95

17

1.738

Native Rank 1

Native Z Score 3.605

1APF

b

47

13

1.801

13

1.884

1ARK

b

55

1

2.843

1

2.993

1AYJ

a/b

46

1

3.172

1

3.923

1BTB

a/b

89

1

3.775

1

4.435

1C5A

a

62

1

3.266

1

4.811

1GB1

a/b

54

1

3.724

1

3.601

1GPT

a/b

47

8

2.355

1

3.307

1KSR

b

92

30

1.758

1

2.469

1POU

a

70

3

2.249

1

3.440

1SVQ

a/b

90

3

2.706

1

3.336

1WIU

b

90

12

2.136

1

2.783

2NCM

b

96

1

3.830

1

5.562

2PAC

a

77

18

1.698

10

1.944

The Rosetta subset of 14 proteins whose native states have been determined by NMR. There is an average of 1000 decoys per protein with sequence length variation of %10%, and an rmsd range of 1.94–28.79 A˚.

area. However, when we evaluate these proteins under the AMBER99/GBSA model, we do not find substantial improvement, although there may be a better implementation of GBSA than the one that we use here. Overall, our results appear to perform better than other scoring functions published in the literature that were tested on the same or similar decoy sets. The published scoring functions can be divided into physically derived energy functions (Dominy and Brooks, 2002; Felts et al., 2002; Gatchell et al., 2000; Hsieh and Luo, 2004; Lazaridis and Karplus, 1999; Lee and Kollman, 2001; Narang et al., 2006; Petrey and Honig, 2000; Zhu et al., 2003), similar in spirit to our energy function presented here, and those that are generated as a statistical scoring function based on the frequency of observations of atom or residue contacts in the PDB database, and sometimes combined with physical forces (Berglund et al., 2004; Dehouck et al., 2006; Fain et al., 2001; McConkey et al., 2003; Mukherjee et al., 2005; Shen and Sali, 2006; Wang et al., 2004). While it might seem an advantage to use knowledge-based scoring functions since they are believed to be more reliable in protein structure prediction, we found that our energy function does significantly better in both the native ranking and the native Z score, especially for the most challenging Rosetta sets, which generate 1000–2000 decoys with native-like features. Finally we consider challenging our free energy description with native-like decoys generated based on thermal fluctuations around the native state for 4PTI, using molecular dynamics and a GBSA model implemented in the AMBER program (Case et al., 2005) at four different temperatures, 200K, 273K, 293K, and 313K. At each temperature, 250 structures with a range of rmsds of 3–5 A˚

were generated around the native state; the resulting sampled structures were then locally minimized with our energy function. For the 250 structures generated at 200K, we found that the protein’s native structure is ranked first. However, the native state rank drops to 2nd for native-like decoys generated at 273K, 7th for nativelike decoys generated at 293K, and 18th for native-like decoys generated at 313K, out of a similar size set of 250 decoys at each temperature. Thus it is clear that decoys generated from high-temperature sampling near the native state provide a more difficult decoy set than the LMDS set (Table 10). We have analyzed the differences between the LMDS and the thermal decoys and have determined the following features (Table 10). Relative to the native structure, the decoys in the LMDS set have, on average, a larger radius of gyration (11.9 A˚ versus 11.0 A˚ for the native structure) and a greater amount of exposed nonpolar atoms (63% versus 44% for the native structure). Likewise, the temperature-dependent decoys that are well discriminated against by our energy function have a larger radius of gyration and a greater amount of nonpolar exposure, and therefore are similar to the LMDS set. However the handful of temperature-dependent decoys that have a lower minimized energy than the native structure have either (1) a significantly smaller radius of gyration and similar percentage of exposed hydrophobic groups, or (2) a greater amount of hydrophobic exposure and similar radius of gyration, relative to the native structure. Perhaps the more favorable thermal decoys are simply emphasizing the artificialness of the dogma of finding a ‘‘single’’ global minimum structure and that fluctuations of 3–5 A˚ are thermally accessible in the native basin. The

Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 733

Structure Hydrophobic PMF for Protein Structure Prediction

Table 8. Hg_Structural Decoy Set AMBER99+GB

AMBER99+GB+HPMF

Native Rank

Native Rank

Protein

Fold Class

Native-State Resolution (A˚)

Number of Residues

Native Z Score

Native Z Score

1ASH

a

2.15

147

1

3.174

1

3.195

1BAB-B

a

1.50

146

1

2.041

7

0.873

1COL-A

a

2.40

197

1

7.794

1

9.643

1CPC-A

a

1.66

162

1

6.772

1

5.837

1ECD

a

1.40

136

1

3.380

1

4.084

1EMY

a

1.78

153

1

2.422

1

1.981

1FLP

a

1.50

142

1

2.521

1

3.698

1GDM

a

1.70

153

1

4.445

1

5.893

1HBG

a

1.50

147

1

2.854

1

3.552

1HBH-A

a

2.20

142

1

2.523

1

2.223

1HBH-B

a

2.20

146

1

1.740

1

1.932

1HDA-A

a

2.20

141

1

1.455

1

1.650

1HDA-B

a

2.20

145

1

1.628

1

2.000

1HLB

a

2.50

157

12

0.260

17

0.022

1HLM

a

2.90

158

26

1.483

25

0.591

1HSY

a

1.90

153

4

1.803

3

1.857

1ITH-A

a

2.50

141

1

2.665

1

2.121

1LHT

a

2.00

153

2

1.797

4

1.547 2.062

1MBA

a

1.60

146

1

2.201

1

1MBS

a

2.50

153

22

0.510

22

1MYG-A

a

1.75

153

1

2.121

2

0.5248 1.844

1MYJ-A

a

1.90

153

4

1.801

5

1.624

1MYT

a

1.74

146

1

2.290

1

1.797

2DHB-A

a

2.80

141

1

1.730

4

1.213

2DHB-B

a

2.80

146

13

0.371

7

0.805

2LHB

a

2.00

149

1

3.571

1

3.920

2PGH-A

a

2.80

141

11

0.647

3

1.444

2PGH-B

a

2.80

146

2

1.612

3

1.496

4SDH-A

a

1.60

145

1

2.809

1

5.108

There are 29 proteins whose native states have been determined by X-ray crystallography. There are 30 decoys per protein with no sequence length variations and an rmsd range of 0.69–30.28 A˚.

other possibility is the known temperature dependence of the hydrophobic interaction in which the solvent-mediated attraction between small hydrophobic groups becomes stronger at higher temperatures (Baldwin, 1986). Alternatively, one might conclude that the side chains pack very differently in the various low-energy decoys in spite of their native-like backbone structure. The differences in sidechain packing have certainly been noted when comparing X-ray crystallographic and NMR structures (Gronenborn and Clore, 1995). In fact, recent work by Zhang and Liu (2006) has shown that X-ray structures exhibit a far greater side-chain entropy contribution than the corresponding

NMR structures. For example, the ensembles of superimposable rotomer states for each residue that do not involve steric clashes are far greater for X-ray structures than NMR structures. As noted in Zhang and Liu (2006), such entropic effects are not explicitly represented in NMR refinement procedures or in energy functions for structure prediction. DISCUSSION Hydrophobic interaction in aqueous solution is one of the important driving forces for a protein to fold into its

734 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved

Structure Hydrophobic PMF for Protein Structure Prediction

Table 9. MMPBSA Decoy Set AMBER99+GB

AMBER99+GB+HPMF Native Rank

Protein

Fold Class

Native-State Resolution (A˚)

Number of residues

Native Rank

Native Z Score

Native Z Score

1GAB

a

NMR

47

24

0.885

27

0.157

1LEB

a

NMR

63

1

5.782

1

6.382

1POU

a

NMR

70

1

2.315

1

3.618

1QYP

b

NMR

42

2

2.232

1

2.839

1SRO

b

NMR

66

1

1.293

1

1.889

1UTG

a

1.34

62

1

2.189

1

2.838

1UXD

a

NMR

43

1

1.496

2

2.059

1VIF

b

1.80

48

1

3.787

1

5.216

2CDX

b

NMR

54

1

3.367

6

1.148

2PTL

a/b

NMR

60

5

1.323

3

1.763

5ICB

a/b

1.50

72

4

1.156

4

1.529

5ZNF

a/b

NMR

25

1

2.815

1

1.729

There are 12 proteins whose native states have been determined by both NMR and X-ray crystallography. There are 30 decoys per protein with sequence length variations, and an rmsd range of 1.34–11.63 A˚.

functional state (Dill, 1990; Kauzmann, 1959). Including a well-designed solvation model can enhance the ability of a physically motivated free energy function to discriminate the native and native-like structures from other, misfolded states. The large degree of hydrophobic and hydrophilic heterogeneity of typical protein sequences, with residues attached to a polar backbone of the polymer chain, means that complete demixing of hydrophobic side chains into a sequestered and pure hydrophobic core is an impossible outcome. In that case, most singledomain proteins would best be classified as biological structures dominated by a large number of weak nonpolar

Figure 2. Scatter Plot of 3ICB Decoys and Native Structure in the LMDS Set The energy, evaluated with our energy function, of each decoy and the native is plotted against its rmsd to the native. The native with an rmsd of zero sits on the y axis. The plot is fitted with a linear function (solid line), which gives an R value of 0.69. It demonstrates that our energy function can well model the energy surface of the protein.

interactions. In the language of hydration physics, the hydrophobic interaction relevant for protein folding is within the ‘‘wetting regime’’ dominated by entropic effects (Chandler, 2005; ten Wolde, 2002) and in which hydrophobic solvation free energy scales with volume and not surface area (Pratt and Chandler, 1977). SASA descriptions of aqueous solvation typically penalize for any surface exposure of hydrophobic residues to water, when in fact the many-body effect of water-water interactions can actually stabilize the exposure of hydrophobic groups to aqueous solvent. SASA descriptions may be more useful for multidomain proteins in which individual proteins pack exposed hydrophobic surface against other protein domains, and therefore surface area dependence in hydrophobic solvation may be a better model of the thermodynamic driver for their association (Liu et al., 2005). Although we did not extensively evaluate other protein force fields, we did find in our testing that the AMBER99 model developed by Wang et al, (2000) was a clear improvement over AMBER94 (Cornell et al., 1995). Another interesting comparison among physical energy functions is the relative performance of nonpolarizable and emerging polarizable force fields. The polarizable AMBER model developed by Duan and colleagues (Wang et al., 2006) has been tested on some of the same decoy set proteins we have examined here, but much less exhaustively, in part because it is more expensive to evaluate (Lee and Duan, 2004). While our energy function determines the correct native-state ranking more consistently than the polarizable force field, our average native Z score is higher than that found by the polarizable energy function, when evaluated over the same set of proteins. In some individual cases the polarizable force field shows a dramatic drop in Z score relative to our energy function, sometimes as much as four units in cases where both energy functions

Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 735

Structure Hydrophobic PMF for Protein Structure Prediction

Table 10. Thermally Sampled Decoy Sets Decoy Set

Native Rank

Native Z Score

low

<%NP>low

high

<%NP>high

LMDS

1

4.90





11.9

62.4%

200K

1

1.85





11.7

50.5%

273K

2

1.61

10.9

44.4%

11.2

47.9%

293K

7

1.51

11.0

50.1%

11.4

52.6%

313K

18

1.00

10.4

43.1%

10.9

47.6%

The energy ranking of the minimized native structure and locally minimized decoys generated from thermal fluctuations around the native state for 4PTI at 200K, 273K, 293K, and 313K, as well as comparison to the LMDS set. There are 250 thermal decoys generated at each temperature, and the rmsds with respect to native structure range from 3.0 to 5.0 A˚. Characteristics of the decoy sets include: average radius of gyration and percentage of nonpolar atoms that are exposed for decoys lower in energy than the native structure (low and <%NP>low) and the same quantities for decoys that have higher energy than the native structure (high and <%NP>high). For comparison, the radius of gyration for locally minimized native structure of 4PTI is 11.0 A˚, and the percentage of nonpolar atoms that are exposed is 43.5%.

rank the native as the lowest in energy. Thus polarizability may be a key level of physics for strongly discriminating native states from misfolds and even for helping to promote confidence in the best Model 1 submission in the context of the CASP exercise. We also generated more difficult decoy sets by thermal sampling near the native-state basin, which provides a better challenge for assessing the correctness of aqueous solvated protein free energy surfaces. Our energy function determines that there are a handful of energetically more favorable ‘‘misfolded’’ decoys that are characterized by having a smaller radius of gyration or greater percentage of exposed nonpolar atoms relative to the X-ray crystal structure. We have shown that the thermal decoys we generated reveal a clear temperature dependence on the effectiveness of our energy function that may be due to the missing description of side-chain entropy in our model. For example, the greater compaction of the lowenergy decoys could diminish the side-chain entropy of these structures and thereby restore the true free energy preference for the native structure. In summary, we have shown that a hydrophobic potential of mean force (HPMF) combined with a Generalized Born model for polarization of protein charge by the high dielectric solvent is an excellent solvation model for use in structure prediction on single-domain proteins. When compared to other physics-based energy functions, we find that we outperform these other tested energy functions, with both substantial improvements in native ranking and Z score drops of 0.5 to 2.5 units. In addition, we have found that our energy function also outperforms other tested energy functions on loop decoys that we will report on in the near future. EXPERIMENTAL PROCEDURES Summary of Decoy Sets Several early decoy sets were made publicly available through Decoys ‘R’ Us (Samudrala and Levitt, 2000), enabling the examination of the ability of different types of energy functions, ranging from knowledge-based energy functions to purely physics-based potentials (Berglund et al., 2004; Dehouck et al., 2006; Dominy and

Brooks, 2002; Fain et al., 2001; Felts et al., 2002; Gatchell et al., 2000; Hsieh and Luo, 2004; Lazaridis and Karplus, 1999; Lee and Kollman, 2001; McConkey et al., 2003; Mukherjee et al., 2005; Narang et al., 2006; Park and Levitt, 1996; Petrey and Honig, 2000; Shen and Sali, 2006; Simons et al., 1997; Wang et al., 2004; Zhu et al., 2003), to detect the native protein structure from a set of misfolded or ‘‘decoy’’ structures. These are either grossly misfolded structures (similar to the test on hemerythrin and the immunoglobulin VL domain described above) with relatively few decoys (1–30 misfolds) to large decoy sets (200–2000 misfolds) with misfolded structures that range over rmsd values between 1.0 A˚ and 20.0 A˚. More recent decoy sets have been designed to generate subtle misfolds that have more native-like structural elements, concentrated in an rmsd range of <3.0 A˚ (Simons et al., 1997; Tsai et al., 2003). The decoy sets are composed of target native structures that cover the all-a, all-b, and a/b fold classes derived from both X-ray and NMR data. We have used the decoy data sets listed in Tables 1–9 that differ substantially in how misfolded protein structures have been generated. The ordering of decoy sets in Tables 1–9 is meant to reflect an increasing level of difficulty for discriminating the native fold from misfolded structures. Greater levels of difficulty arise from several factors, including the creation of misfolds with significant native-like or reasonable generic protein secondary and tertiary features such as the Rosetta sets, the use of NMR structures to define the native state, when thousands instead of tens of nonnative conformations have been computer generated, proteins with prosthetic groups, and/or proteins that are part of larger domains. Energy Function Model The first term in Equation (1) has the following functional form: VProtein =

#bonds X

#angles X

i

i

kb ðbi  bo Þ2 +

+

#dihedrals X

kx ½1 + cosðnc + dÞ

i

+

kq ðqi  qo Þ2

N X

N X

i

i
qi qj + 43ij rij

"    #! 12 6 sij sij  : rij rij

(2)

The first three terms in Equation (2) represent the chain connectivity interactions, and the next two terms represent the nonbonded interactions, which comprise electrostatic and van der Waals interactions, respectively. We refer the reader to the literature for the definitions of all variables and parameters of the AMBER99 model (Wang et al., 2000). The second term in Equation (1), the Generalized Born energy (VGB) is given as

736 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved

Structure Hydrophobic PMF for Protein Structure Prediction

Table 11. Parameters of the Hydrophobic Potential of Mean Force Term in the Energy Function Parameters

1st Gaussian

2nd Gaussian

3rd Gaussian

c

3.81679

5.46692

7.11677

w

1.68589

1.39064

1.57417

h

0.73080

0.20016

0.09055



VGB = 

1 1 1  2 3p 3w

 XX i

i
qi qj ; fijGB ðrij Þ

  Rj  Ri ; bij = pðRi + Rs ÞðRj + Ri + 2Rs  rij Þ 1 + rij

(3a)

where qi and qj are the partial charge of atoms i and j, respectively, and fijGB is a continuous function in the form of " fijGB ðrij Þ = rij2 + Ri Rj exp

rij2 4Ri Rj

!#12 (3b)

;

where Ri and Rj are the Born radius of atoms i and j, respectively, and rij is the distance between them. We use the parameterization of Born radii as developed by Onufriev et al. (2000). The dielectric constant of the protein, 3p, is 4, and of the continuum aqueous solvent, 3w, is 80. The final term in Equation (1) is the hydrophobic potential of mean force, VHPMF, in which we use the following functional form:

VHPMF =

Nc X

tanhðSAi Þ

Nc X

type of atom i, and pij is a constant dependent on the connectivity between atoms i and j. bij is the solvent-accessible surface area of atom i removed by atom j:

where Ri, Rj, and Rs are the radii of atoms i, j, and the solvent probe respectively. We use the parameterization developed in Hasel et al. (1988). Local Minimization Procedure The native and decoy conformations are given as Cartesian coordinates of heavy atoms, but our energy function requires hydrogen positions to be specified as well. We use the CHARMM program (Brooks et al., 1983) to build the positions of the hydrogen atoms on the native and the given decoy structures. We evaluate the structures using our energy function both before and after local optimization. The structures are then optimized to their nearest local minimum using the BFGS (Broyden-Fletcher-Goldfarb-Shanno) limited memory quasiNewton method (Liu and Nocedal, 1989; Press et al., 1992). QuasiNewton methods are effective for large-scale minimization where the Hessian of the energy function is dense and gradient evaluations are expensive, as is the case for the energy function explored here. Limited memory quasi-Newton methods generate a sequence of approximate solutions by the iteration xk + 1 = xk  lk B1 k Vfðxk Þ;

tanhðSAj Þ

(5b)

(6)

i˛SAi >Ac

j˛SAj >Ac jsi 2    3 X rij  ck  ; hk exp  wk k=1

(4)

where the sum over i and j is over all aliphatic and aromatic carbon centers, and the evaluation of HPMF is restricted to carbon centers with an exposed surface area greater than an area cutoff, Ac. The hyperbolic tangent functions are used to create a continuous VHPMF function to smooth discontinuities introduced by the surface area cutoff (SA > Ac). Each of the three Gaussians is parameterized by position ck, depth hk, and width wk so as to describe the two minima and the barrier (Figure 1). Starting from an estimate of these parameters for a methanemethane potential of mean force profile in water, the HPMF parameters in Equation (4), including the value of Ac, were optimized in the context of Equation (1) over seven proteins in the 4-state-reduced decoy set given in Table 2. The HPMF parameters are listed in Table 11, and Ac was set to 6.0 A˚2 (by comparison, the total surface area of a carbon atom is 120.0 A˚2). We believe that this small training set is adequate since it generalizes so well to the much larger and more extensive test set, and furthermore the parameters are plausible physical values based on the methane-methane potential of mean force. We believe that the optimized cutoff, Ac, corresponds to the inherent error of the analytical but approximate estimate of the surface area term, and essentially atoms with exposed area less than Ac are in fact fully buried. While better numerical schemes exist that may be able to calculate the surface area more precisely, they are computationally expensive, while the cheaper analytical form has a correctible systematic error. The calculation of the solvent-accessible surface area of an atom, Ai, has the following functional form:

Ai = S i

YN  j=1

1

 pi pij bij ; Si

(5a)

where Si is the total solvent-accessible surface area of atom i when it is fully exposed to solvent, and pi is a constant depended on the atom

where Bk is an N 3 N positive definite matrix approximating the Hessian of f, and lk is a scalar step-length parameter. In the limited memory version, Bk is stored using only the 2m update vectors for the m most recent iterations, where m % 20. At each iteration, Bk is implicitly constructed by applying the saved vectors to some standard initial matrix such as the identity. Analysis of Native and Decoy Structures After local optimization, the native and the decoy structures of each protein are sorted in ascending order based on their energy. According to the thermodynamic hypothesis (Anfinsen, 1973), a protein’s native structure should be ranked first and therefore the lowest energy in the sorted list of native and decoys. Furthermore, the degree of discrimination of the native state from the decoys can be evaluated by calculating the Z score: Z score =

V V ; s

(7)

where V is the energy of the native, and V and s are the average energy and the standard deviation of the energy distribution of the misfolds, respectively. Z score measures the number of standard deviations of the native energy separated from the average energy of all of the misfolded structures. Molecular Dynamics for Sampling Near-Native States We also generated many near-native structures generated by thermal fluctuations around the protein native state. We take the native PDB file as the starting structure and use the molecular dynamics in the SANDER module of Amber 9 package (Case et al., 2005). The energy function used in the simulations is composed of the AMBER99 force field (Wang et al., 2000) and GBOBC model (Onufriev et al., 2004), which is one of the parameterized GB models. The tleap program is used to set atomic GB radii to MBondi2 values (Onufriev et al., 2004), and the molecular dynamics time step is 1 fs, and an Andersen thermostat (Andersen, 1980; Andrea et al., 1983) is used with velocity resampling every 1000 fs. The simulations are run at constant

Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 737

Structure Hydrophobic PMF for Protein Structure Prediction

temperatures of 200K, 273K, 293K, and 313K, and each simulation at each temperature is run for 3 ns and structures are collected every 15 ps. The generated structures and the corresponding native are then local minimized using our energy function, and the energy of the native is compared to the energy of the thermal decoy structures. ACKNOWLEDGMENTS

Crivelli, S., Eskow, E., Bader, B., Lamberti, V., Byrd, R., Schnabel, R., and Head-Gordon, T. (2002). A physical approach to protein structure prediction. Biophys. J. 82, 36–49. Dehouck, Y., Gilis, D., and Rooman, M. (2006). A new generation of statistical potentials for proteins. Biophys. J. 90, 4010–4017. Dill, K.A. (1990). Dominant forces in protein folding. Biochemistry 29, 7133–7155.

T.H.-G. gratefully acknowledges Schlumberger Fellowship for support while on sabbatical at the University of Cambridge and financial support under the NSF-Cyber program. M.S.L. would like to acknowledge support from the DOE/MICS program. N.L.F. thanks the Whitaker Foundation for a graduate research fellowship. We thank Harry Ming Tak Choi for his contribution on the early testing of the model. This research used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC0205CH11231.

Dominy, B.N., and Brooks, C.L. (2002). Identifying native-like protein structures using physics-based potentials. J. Comput. Chem. 23, 147–160.

Received: December 16, 2006 Revised: May 4, 2007 Accepted: May 7, 2007 Published: June 12, 2007

Felts, A.K., Gallicchio, E., Wallqvist, A., and Levy, R.M. (2002). Distinguishing native conformations of proteins from decoys with an effective free energy estimator based on the OPLS all-atom force field and the surface generalized born solvent model. Proteins 48, 404–422.

REFERENCES

Gallicchio, E., and Levy, R.M. (2004). AGBNP: an analytic implicit solvent model suitable for molecular dynamics simulations and high-resolution modeling. J. Comput. Chem. 25, 479–499.

Andersen, H.C. (1980). Molecular-dynamics simulations at constant pressure and-or temperature. J. Chem. Phys. 72, 2384–2393. Andrea, T.A., Swope, W.C., and Andersen, H.C. (1983). The role of long ranged forces in determining the structure and properties of liquid water. J. Chem. Phys. 79, 4576–4584. Anfinsen, C.B. (1973). Principles that govern folding of protein chains. Science 181, 223–230.

Eisenberg, D., and McLachlan, A.D. (1986). Solvation energy in protein folding and binding. Nature 319, 199–203. Fain, B., Xia, Y., and Levitt, M. (2001). Determination of optimal Chebyshev-expanded hydrophobic discrimination function for globular proteins. IBM Journal Research and Development 45, 525–532. Feig, M., and Brooks, C.L. (2002). Evaluating CASP4 predictions with physical energy functions. Proteins 49, 232–245.

Garcia, A.E., and Sanbonmatsu, K.Y. (2002). a-helical stabilization by side chain shielding of backbone hydrogen bonds. Proc. Natl. Acad. Sci. USA 99, 2782–2787. Gatchell, D.W., Dennis, S., and Vajda, S. (2000). Discrimination of nearnative protein structures from misfolded models by empirical free energy functions. Proteins 41, 518–534.

Baldwin, R.L. (1986). Temperature-dependence of the hydrophobic interaction in protein folding. Proc. Natl. Acad. Sci. USA 83, 8069–8072.

Gronenborn, A.M., and Clore, G.M. (1995). Structures of protein complexes by multidimensional heteronuclear magnetic-resonance spectroscopy. Crit. Rev. Biochem. Mol. Biol. 30, 351.

Berendsen, H.J.C., Postma, J.P.M., van Gunsteren, W.F., and Hermans, J. (1981). Interaction models for water in relation to protein hydration. In Intermolecular Forces, B. Pullman, ed. (Dordrecht, The Netherlands: D. Reidel Publishing Company), pp. 331–342.

Hasel, W., Hendrickson, T.F., and Still, W.C. (1988). A rapid approximation to the solvent accessible surface areas of atoms. Tetrahdron Computer Methodology 1, 103–116.

Berglund, A., Head, R.D., Welsh, E.A., and Marshall, G.R. (2004). ProVal: a protein-scoring function for the selection of native and near-native folds. Proteins 54, 289–302. Brooks, B.R., Bruccoleri, R.E., Olafson, B.D., States, D.J., Swaminathan, S., and Karplus, M. (1983). Charmm—a program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4, 187–217. Case, D.A., Cheatham, T.E., Darden, T., Gohlke, H., Luo, R., Merz, K.M., Onufriev, A., Simmerling, C., Wang, B., and Woods, R.J. (2005). The Amber biomolecular simulation programs. J. Comput. Chem. 26, 1668–1688. Chandler, D. (2005). Interfaces and the driving force of hydrophobic assembly. Nature 437, 640–647. Chen, B., Xing, J.H., and Siepmann, J.I. (2000). Development of polarizable water force fields for phase equilibrium calculations. J. Phys. Chem. B 104, 2391–2401. Chothia, C. (1974). Hydrophobic bonding and accessible surface area in proteins. Nature 248, 338–339. Constanciel, R., and Contreras, R. (1984). Self-consistent field-theory of solvent effects representation by continuum models—introduction of desolvation contribution. Theor. Chim. Acta 65, 1–11. Cornell, W.D., Cieplak, P., Bayly, C.I., Gould, I.R., Merz, K.M., Ferguson, D.M., Spellmeyer, D.C., Fox, T., Caldwell, J.W., and Kollman, P.A. (1995). A 2nd generation force-field for the simulation of proteins, nucleic-acids, and organic-molecules. J. Am. Chem. Soc. 117, 5179–5197.

Hawkins, G.D., Cramer, C.J., and Truhlar, D.G. (1995). Pairwise solute descreening of solute charges from a dielectric medium. Chem. Phys. Lett. 246, 122–129. Hawkins, G.D., Cramer, C.J., and Truhlar, D.G. (1996). Parametrized models of aqueous free energies of solvation based on pairwise descreening of solute atomic charges from a dielectric medium. J. Phys. Chem. 100, 19824–19839. Head-Gordon, T., Sorenson, J.M., Pertsemlidis, A., and Glaeser, R.M. (1997). Differences in hydration structure near hydrophobic and hydrophilic amino acids. Biophys. J. 73, 2106–2115. Holm, L., and Sander, C. (1992). Evaluation of protein models by atomic solvation preference. J. Mol. Biol. 225, 93–105. Horn, H.W., Swope, W.C., Pitera, J.W., Madura, J.D., Dick, T.J., Hura, G.L., and Head-Gordon, T. (2004). Development of an improved foursite water model for biomolecular simulations: TIP4P-Ew. J. Chem. Phys. 120, 9665–9678. Hornak, V., Abel, R., Okur, A., Strockbine, B., Roitberg, A., and Simmerling, C. (2006). Comparison of multiple amber force fields and development of improved protein backbone parameters. Proteins 65, 712–725. Hsieh, M.J., and Luo, R. (2004). Physical scoring function based on AMBER force field and Poisson-Boltzmann implicit solvent for protein structure prediction. Proteins 56, 475–486. Hura, G., Sorenson, J.M., Glaeser, R.M., and Head-Gordon, T. (1999). Solution X-ray scattering as a probe of hydration-dependent structuring of aqueous solutions. Perspect. Drug Discov. Des. 17, 97–118.

738 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved

Structure Hydrophobic PMF for Protein Structure Prediction

Jorgensen, W.L., and Madura, J.D. (1985). Temperature and size dependence for Monte-Carlo simulations of Tip4p water. Mol. Phys. 56, 1381–1392. Jorgensen, W.L., Chandrasekhar, J., Madura, J.D., Impey, R.W., and Klein, M.L. (1983). Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926–935. Jorgensen, W.L., Maxwell, D.S., and Tirado-Rives, J. (1996). Development and testing of the OPLS all-atom force field on conformational energetics and properties of organic liquids. J. Am. Chem. Soc. 118, 11225–11236. Kaminski, G.A., Friesner, R.A., Tirado-Rives, J., and Jorgensen, W.L. (2001). Evaluation and reparametrization of the OPLS-AA force field for proteins via comparison with accurate quantum chemical calculations on peptides. J. Phys. Chem. B 105, 6474–6487.

Nemethy, G., Pottle, M.S., and Scheraga, H.A. (1983). Energy parameters in polypeptides. 9. Updating of geometrical parameters, nonbonded interactions, and hydrogen-bond interactions for the naturally-occurring amino-acids. J. Phys. Chem. 87, 1883–1887. Novotny, J., Bruccoleri, R., and Karplus, M. (1984). An analysis of incorrectly folded protein models—implications for structure predictions. J. Mol. Biol. 177, 787–818. Onufriev, A., Bashford, D., and Case, D.A. (2000). Modification of the generalized Born model suitable for macromolecules. J. Phys. Chem. B 104, 3712–3720. Onufriev, A., Bashford, D., and Case, D.A. (2004). Exploring protein native states and large-scale conformational changes with a modified generalized born model. Proteins 55, 383–394.

Kauzmann, W. (1959). Some factors in the interpretation of protein denaturation. Adv. Protein Chem. 14, 1–63.

Pande, V.S., and Rokhsar, D.S. (1999). Molecular dynamics simulations of unfolding and refolding of a beta-hairpin fragment of protein G. Proc. Natl. Acad. Sci. USA 96, 9062–9067.

Keasar, C., and Levitt, M. (2003). A novel approach to decoy set generation: designing a physical energy function having local minima with native structure characteristics. J. Mol. Biol. 329, 159–174.

Park, B., and Levitt, M. (1996). Energy functions that discriminate X-ray and near-native folds from well-constructed decoys. J. Mol. Biol. 258, 367–392.

Lazaridis, T., and Karplus, M. (1999). Discrimination of the native from misfolded protein models with an energy function including implicit solvation. J. Mol. Biol. 288, 477–487.

Pertsemlidis, A., Saxena, A.M., Soper, A.K., Head-Gordon, T., and Glaeser, R.M. (1996). Direct evidence for modified solvent structure within the hydration shell of a hydrophobic amino acid. Proc. Natl. Acad. Sci. USA 93, 10769–10774.

Lee, B., and Richards, F.M. (1971). Interpretation of protein structures—estimation of static accessibility. J. Mol. Biol. 55, 379. Lee, M.R., and Kollman, P.A. (2001). Free-energy calculations highlight differences in accuracy between X-ray and NMR structures and add value to protein structure prediction. Structure 9, 905–916. Lee, M.C., and Duan, Y. (2004). Distinguish protein decoys by using a scoring function based on a new AMBER force field, short molecular dynamics simulations, and the generalized born solvent model. Proteins 55, 620–634. Lee, M.R., Tsai, J., Baker, D., and Kollman, P.A. (2001). Molecular dynamics in the endgame of protein structure prediction. J. Mol. Biol. 313, 417–430.

Pertsemlidis, A., Soper, A.K., Sorenson, J.M., and Head-Gordon, T. (1999). Evidence for microscopic, long-range hydration forces for a hydrophobic amino acid. Proc. Natl. Acad. Sci. USA 96, 481–486. Petrey, D., and Honig, B. (2000). Free energy determinants of tertiary structure and the evaluation of protein models. Protein Sci. 9, 2181– 2191. Ponder, J.W., and Case, D.A. (2003). Force fields for protein simulations. Adv. Protein Chem. 66, 27–85. Pratt, L.R., and Chandler, D. (1977). Theory of hydrophobic effect. J. Chem. Phys. 67, 3683–3704.

Liu, D.C., and Nocedal, J. (1989). On the limited memory Bfgs method for large-scale optimization. Math. Program. 45, 503–528.

Press, W.H., Flannery, B.P., Teukolsky, S.A., and Vetterling, W.T. (1992). Numerical Recipes in C: The Art of Scientific Computing, Second Edition (Cambridge, UK: Cambridge University Press).

Liu, P., Huang, X.H., Zhou, R.H., and Berne, B.J. (2005). Observation of a dewetting transition in the collapse of the melittin tetramer. Nature 437, 159–162.

Rick, S.W. (2001). Simulations of ice and liquid water over a range of temperatures using the fluctuating charge model. J. Chem. Phys. 114, 2276–2283.

MacKerell, A.D., Bashford, D., Bellott, M., Dunbrack, R.L., Evanseck, J.D., Field, M.J., Fischer, S., Gao, J., Guo, H., Ha, S., et al. (1998). All-atom empirical potential for molecular modeling and dynamics studies of proteins. J. Phys. Chem. B 102, 3586–3616.

Rick, S.W. (2004). A reoptimization of the five-site water potential (TIP5P) for use with Ewald sums. J. Chem. Phys. 120, 6085–6093.

Mahoney, M.W., and Jorgensen, W.L. (2000). A five-site model for liquid water and the reproduction of the density anomaly by rigid, nonpolarizable potential functions. J. Chem. Phys. 112, 8910–8922. McConkey, B.J., Sobolev, V., and Edelman, M. (2003). Discrimination of native protein structures using atom-atom contact scoring. Proc. Natl. Acad. Sci. USA 100, 3215–3220. Mitsutake, A., Sugita, Y., and Okamoto, Y. (2001). Generalized-ensemble algorithms for molecular simulations of biopolymers. Biopolymers 60, 96–123. Momany, F.A., Mcguire, R.F., Burgess, A.W., and Scheraga, H.A. (1975). Energy parameters in polypeptides. 7. Geometric parameters, partial atomic charges, nonbonded interactions, hydrogen-bond interactions, and intrinsic torsional potentials for naturally occurring aminoacids. J. Phys. Chem. 79, 2361–2381. Mukherjee, A., Bhimalapuram, P., and Bagchi, B. (2005). Orientationdependent potential of mean force for protein folding. J. Chem. Phys. 123, 14901. Narang, P., Bhushan, K., Bose, S., and Jayaram, B. (2006). Protein structure evaluation using an all-atom energy based empirical scoring function. J. Biomol. Struct. Dyn. 23, 385–406.

Rick, S.W., Stuart, S.J., and Berne, B.J. (1994). Dynamical fluctuating charge force-fields—application to liquid water. J. Chem. Phys. 101, 6141–6156. Samudrala, R., and Levitt, M. (2000). Decoys ‘R’ Us: a database of incorrect conformations to improve protein structure prediction. Protein Sci. 9, 1399–1401. Scott, W.R.P., Hunenberger, P.H., Tironi, I.G., Mark, A.E., Billeter, S.R., Fennen, J., Torda, A.E., Huber, T., Kruger, P., and van Gunsteren, W.F. (1999). The GROMOS biomolecular simulation program package. J. Phys. Chem. A 103, 3596–3607. Sharp, K.A., Nicholls, A., Fine, R.F., and Honig, B. (1991). Reconciling the magnitude of the microscopic and macroscopic hydrophobic effects. Science 252, 106–109. Shen, M.Y., and Sali, A. (2006). Statistical potential for assessment and prediction of protein structures. Protein Sci. 15, 2507–2524. Shrake, A., and Rupley, J.A. (1973). Environment and exposure to solvent of protein atoms—lysozyme and insulin. J. Mol. Biol. 79, 351–371. Simmerling, C., Strockbine, B., and Roitberg, A.E. (2002). All-atom structure prediction and folding simulations of a stable protein. J. Am. Chem. Soc. 124, 11258–11259. Simons, K.T., Kooperberg, C., Huang, E., and Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local

Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved 739

Structure Hydrophobic PMF for Protein Structure Prediction

sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268, 209–225. Simons, K.T., Bonneau, R., Ruczinski, I., and Baker, D. (1999). Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins (Suppl. 3), 171–176. Sippl, M.J., Nemethy, G., and Scheraga, H.A. (1984). Intermolecular potentials from crystal data. 6. Determination of empirical potentials for O-H=O=C hydrogen-bonds from packing configurations. J. Phys. Chem. 88, 6231–6233. Sorenson, J.M., Hura, G., Soper, A.K., Pertsemlidis, A., and HeadGordon, T. (1999). Determining the role of hydration forces in protein folding. J. Phys. Chem. B 103, 5413–5426. Spolar, R.S., Ha, J.H., and Record, M.T., Jr. (1989). Hydrophobic effect in protein folding and other noncovalent processes involving proteins. Proc. Natl. Acad. Sci. USA 86, 8382–8385. Still, W.C., Tempczyk, A., Hawley, R.C., and Hendrickson, T. (1990). Semianalytical treatment of solvation for molecular mechanics and dynamics. J. Am. Chem. Soc. 112, 6127–6129. Sugita, Y., and Okamoto, Y. (1999). Replica-exchange molecular dynamics method for protein folding. Chem. Phys. Lett. 314, 141–151. Swanson, J.M., Henchman, R.H., and McCammon, J.A. (2004). Revisiting free energy calculations: a theoretical connection to MM/PBSA and direct calculation of the association free energy. Biophys. J. 86, 67–74. ten Wolde, P.R. (2002). Hydrophobic interactions: an overview. J. Phys. Condens. Matter 14, 9445–9460. Tsai, J., Bonneau, R., Morozov, A.V., Kuhlman, B., Rohl, C.A., and Baker, D. (2003). An improved protein decoy set for testing energy functions for protein structure prediction. Proteins 53, 76–87. van der Vaart, A., Bursulaya, B.D., Brooks, C.L., and Merz, K.M. (2000). Are many-body effects important in protein folding? J. Phys. Chem. B 104, 9554–9563.

Vasquez, M., Nemethy, G., and Scheraga, H.A. (1994). Conformational energy calculations on polypeptides and proteins. Chem. Rev. 94, 2183–2239. Vorobjev, Y.N., Almagro, J.C., and Hermans, J. (1998). Discrimination between native and intentionally misfolded conformations of proteins: ES/IS, a new method for calculating conformational free energy that uses both dynamics simulations with an explicit solvent and an implicit solvent continuum model. Proteins 32, 399–413. Wagoner, J.A., and Baker, N.A. (2006). Assessing implicit models for nonpolar mean solvation forces: the importance of dispersion and volume terms. Proc. Natl. Acad. Sci. USA 103, 8331–8336. Wang, J.M., Cieplak, P., and Kollman, P.A. (2000). How well does a restrained electrostatic potential (RESP) model perform in calculating conformational energies of organic and biological molecules? J. Comput. Chem. 21, 1049–1074. Wang, K., Fain, B., Levitt, M., and Samudrala, R. (2004). Improved protein structure selection using decoy-dependent discriminatory functions. BMC Struct. Biol. 4, 8. Wang, Z.X., Zhang, W., Wu, C., Lei, H.X., Cieplak, P., and Duan, Y. (2006). Strike a balance: optimization of backbone torsion parameters of AMBER polarizable force field for simulations of proteins and peptides. J. Comput. Chem. 27, 781–790. Weiner, S.J., Kollman, P.A., Case, D.A., Singh, U.C., Ghio, C., Alagona, G., Profeta, S., and Weiner, P. (1984). A new force-field for molecular mechanical simulation of nucleic-acids and proteins. J. Am. Chem. Soc. 106, 765–784. Weiner, S.J., Kollman, P.A., Nguyen, D.T., and Case, D.A. (1986). An all atom force-field for simulations of proteins and nucleic-acids. J. Comput. Chem. 7, 230–252. Zhang, J.F., and Liu, J.S. (2006). On side-chain conformational entropy of proteins. PLoS Comput. Biol. 2, 1586–1591. Zhu, J., Zhu, Q.Q., Shi, Y.Y., and Liu, H.Y. (2003). How well can we predict native contacts in proteins based on decoy structures and their energies? Proteins 52, 598–608.

740 Structure 15, 727–740, June 2007 ª2007 Elsevier Ltd All rights reserved