J. Mol. Biol. (1998) 278, 439±456
Calculations on Folding of Segment B1 of Streptococcal Protein G Felix B. Sheinerman and Charles L. Brooks III* Department of Molecular Biology, The Scripps Research Institute, 10550 North Torrey Pines Road, La Jolla, CA 92037, USA
We present an investigation of the folding thermodynamics and mechanism of segment B1 of streptococcal protein G. Molecular dynamics simulations of the fully solvated protein are used to probe thermodynamically signi®cant states at different stages of folding. We performed several unfolding simulations to generate a database of initial conditions. The database is analyzed and clustered. The cluster centers extracted from this database were then used as starting points for umbrella sampling of the folding free energy landscape under folding conditions. The resulting sampling was combined with the weighted histogram analysis method. One and two-dimensional free energy surfaces were constructed along several order parameters and used to analyze the folding process. Our ®ndings indicate that an initial collapse precedes the formation of signi®cant native structure. Elements of local structure originate in the regions of the protein shown to have higher H/2H exchange protection factors in early stages of folding. A non-native contact, observed experimentally at the N terminus of the a-helix in a peptide excised from the protein, is seen to pre-organize the chain in early stages of folding. Collapse and early structure formation yields a compact globule with a signi®cant number of water molecules present. Desolvation of the protein core is coincident with the ®nal stages of folding from the compact state. # 1998 Academic Press Limited
*Corresponding author
Keywords: protein folding; folding mechanism; potential of mean force; molecular dynamics; protein G
Introduction Since An®nsen elucidated the nature of the protein folding problem (An®nsen, 1973), understanding the sequence requirements for the development of a speci®c ®nal folded structure, quests for the ``rules'' that determine this sequence to structure process as well as the physical principles that govern the formation of the native conformation have been underway. The physical side of the protein folding problem, investigation of the forces and mechanisms governing formation of the native conformation, is extremely important for the general understanding of protein function. As protein function is de®ned by protein structure, the ability to predict conformational transitions caused by changes in external conditions is essential to our understanding of protein function. Abbreviations used: GB1, segment B1 of streptococcal protein G; NOE, nuclear Overhauser enhancement; CI2, chymotrypsin inhibitor 2; GuHCl, guanidine-HCl; MD, molecular dynamics; PMF, potential mean force. 0022±2836/98/170439±18 $25.00/0/mb981688
Recent years have witnessed signi®cant advancements in our understanding of the physical side of protein folding. These advances have stemmed from both experimental and theoretical progress. Developments on the theory side are based on both advances in analytical treatments of the phenomenon of folding and the elaboration of numerical simulation techniques (Bryngelson & Wolynes, 1987; Dill et al., 1995; Guo & Brooks, 1997; Guo & Thirumalai, 1995; Klimov & Thirumalai, 1996; Leopold et al., 1992; LutheySchulten et al., 1995; Sali et al., 1994; Saven & Wolynes, 1996; Shakhnovich, 1997; Shakhnovich & Gutin, 1989; Shoemaker et al., 1997). The theoretical developments have lead to the formulation of a ``new'' view of protein folding (Dill & Chan, 1997; Onuchic et al., 1997). This view focuses on the description of general statistical properties of the energy landscape of a protein rather than the characterization of speci®c intermediates occurring during the folding reaction. The evolution of theoretical concepts has been stipulated by the development of experimental techniques for monitoring fast events and detect# 1998 Academic Press Limited
440 ing short-lived folding intermediates with increasing structural detail. NMR spectroscopy has played an especially important role in many of the developed methods. For instance, onedimensional NMR has been used to monitor real-time folding of apo bovine a-lactalbumin (Balbach et al., 1995). The native state formed cooperatively from an intermediate formed during the dead-time of the experiment. Moreover, this intermediate was found to differ from both the native and unfolded states and to resemble an acid-induced equilibrium intermediate. In another study performed on ribonuclease A (Kiefhaber et al., 1995), an early unfolding intermediate was detected by the loss of chemical shifts of resolved resonances. This state could not be detected by either CD measurements or NMR hydrogen exchange experiments. Pulsedlabeling hydrogen exchange measurements have been used to characterize early intermediates transiently formed during folding. The earliest detectable intermediate formed during apomyoglobin refolding was shown to be structurally similar to its equilibrium structure under acidic conditions (Jennings & Wright, 1993). When augmented by mass spectroscopy (Miranker et al., 1996), pulsed hydrogen exchange has been demonstrated to be a powerful tool in the investigation of protein folding. The experimental technique which is most similar in spirit to the computational methodology we apply below is native state hydrogen exchange (Bai & Englander, 1996; Bai et al., 1995). This method characterizes non-native conformations that the protein populates with very low probability under native conditions. Complementing the analytical and simpli®ed model developments and the advances in experimentation have been the development and use of simulation methods. Approximately six years ago, the methods of molecular dynamics simulations were applied to the study of the intermediates and mechanisms of protein unfolding (Brooks, 1992; Daggett & Levitt, 1992; Fan et al., 1991; Mark & van Gunsteren, 1992). Since then, a signi®cant number of unfolding studies have been performed on different proteins (Alonso & Daggett, 1995; Cas¯ish & Karplus, 1994, 1995; Daggett et al., 1996; Hunenberger et al., 1995; Li & Daggett, 1994, 1996; Tirado-Rives & Jorgensen, 1993; Vijayakumar et al., 1993). The majority of these studies used high temperature to destabilize the native state of the protein. The subject of our work is the application of the methods of molecular dynamics simulation with detailed models of a solvated protein to the study of non-native states of this protein, coexisting, albeit at lower probability, with the native state under native conditions. Thus, our goal is to sample thermodynamically signi®cant states that the protein populates during different stages of folding, and thereby to reconstruct the mechanism by which it folds under native con-
Mechanism of Protein G Folding
ditions. From early simulation studies on equilibrium properties of liquids, it is known that sampling of a small fraction of all states accessible to a system may be suf®cient to obtain good energetic and structural properties (Rahman & Stillinger, 1971). We believe that similar ideas should apply to the sampling of localized regions of conformational space of proteins. Furthermore, methods such as the biased, or umbrella, sampling technique have been developed and applied to enhance sampling in particular conformational regions of interest and to study conformational transitions of different scales in biomolecules (Brooks & Case, 1993). We note that within this framework the explicit timescale of a process, e.g. the folding time, does not play a direct role. The sampling is of thermodynamic states and the biasing potentials used to expedite this sampling concentrate and direct the sampling process. Thus, processes that occur over very long times can be characterized, in principle, through the free energy landscape, with relatively short simulations of locally accessible states. Kinetic mechanisms, at least in the gross sense of early non-speci®c collapse versus concommittant collapse and tertiary structure formation, can also be deduced. Our studies build on the pioneering work of (Boczko & Brooks, 1995), who reported the initial formulation of the methodology and applied it to the thermodynamics of folding of a three-helix bundle. As a model system for our study we have chosen segment B1 of streptococcal protein G (GB1), a small (56 residues) protein stable in isolation with a melting temperature of 87 C at neutral pH. Both X-ray and NMR structures of this protein have been solved (Achari et al., 1992; Gallagher et al., 1994; Gronenborn et al.,
Figure 1. Structure of segment B1 of streptococcal protein G depicted using MOLSCRIPT (Kraulis, 1991).
441
Mechanism of Protein G Folding
1991). Topologically, the protein consists of a four-stranded b-sheet and an a-helix (Figure 1). The protein contains no proline residues, disul®de bonds or prosthetic groups and thus its folding is thought to represent the general features of folding of small proteins. The topological assembly of GB1 is somewhat similar to that of another small a/b protein, chymotrypsin inhibitor 2 (CI2). The folding of CI2 has been studied extensively both experimentally (Itzhaki et al., 1995; Neira et al., 1996) and by simulations (Li & Daggett, 1994, 1996). GB1 has been used as a model system in a signi®cant number of experimental folding studies as well. The thermodynamics and kinetics of its folding have been investigated calorimetrically (Alexander et al., 1992a,b). Further, the study of Kuszewski et al. (1994) investigated the folding kinetics by quenched-¯ow deuterium-hydrogen exchange. In both kinetic studies, GB1 was observed to be a fast folding protein. In the study of Alexander et al., (1992b) GB1 was found to fold faster than 2 ms (the mixing time of the stopped-¯ow cell) under native conditions. Folding was monitored by the change in extinction of tryptophan and tyrosine residues (Trp43, Tyr3, Tyr33, Tyr45). In a deuterium-hydrogen exchange experiment (Kuszewski et al., 1994), that was carried out at pH 4.0, T 278 K and in the presence of 0.25 M GuHCl, the protein folded with a half-life of 5.2 ms, following initial collapse that occurred in less than 1 ms (dead-time of the quenched ¯ow apparatus). The intrinsic secondary structure propensities of the peptides excised from GB1 were studied by Blanco and co-workers (Blanco et al., 1994a,b; Blanco & Serrano, 1995). The authors observed that the C-terminal b-hairpin is stable in isolation and that the N-terminal b-hairpin and the a-helix are not. The urea-denatured state of the protein was characterized by multi-dimensional NMR spectroscopy (Frank et al., 1995). Although no indications of residual structure were observed, 15N relaxation experiments suggested that the conformational dynamics in the C-terminal b-hairpin at the N terminus and in the middle of the a-helix are somewhat restricted. Segment B1 of protein G was also used as a host system in the examination of b-sheet propensities of amino acid side-chains (Minor & Kim, 1994; Smith et al., 1994; Smith & Regan, 1995). It is also worth mentioning that GB1 is used as a test case for structure prediction algorithms. The overall topology of this protein was successfully reproduced by the algorithm recently developed by Skolnick and collaborators (Skolnick et al., 1997). In what follows we describe the methods and models used in our study. We spend signi®cant effort in describing and discussing our methods. We go to this effort because our approach is novel in folding studies employing all atom models. Following this description we present our ®ndings on the thermodynamics and mechanism of folding of
Figure 2. Potential of mean force (PMF), kcal/mol, as a Ê ) and the fracfunction of the radius of gyration, RG(A tion of native contacts broken, r. Contours are drawn every 0.5 kcal/mol.
segment B1 of streptococcal protein G. In the Appendix we give additional technical details of our simulations.
Results and Discussion Initial collapse The free energy landscape for the folding of GB1 is illustrated in Figure 2. The ®rst feature, apparent from an examination of the free energy surface as a function of the number of native contacts and the radius of gyration, is an almost vertical descent of the radius of gyration accompanied by the formation of a relatively small fraction of native contacts. Collapse to a radius of gyration of 110% of the native state value is associated with the formation of 35% of the native contacts. We do not observe preferential formation of hydrophobic or hydrophilic native contacts at this stage of folding. The shape of the folding surface of GB1 is close to a ``reversed-L shape''. This picture differs from the analysis of the folding of a small all-a-helical protein performed in this laboratory (Guo et al., 1997). The potential of mean force, as a function of the radius of gyration and the number of native contacts, is closer to the diagonal for this protein. An explanation for this difference might be that a higher fraction of local contacts in helical proteins can form while the polypeptide chain is still expanded. Thus, the general mechanism of folding may be related to the ®nal folded topology. Support for this idea comes from at least two other theoretical studies. Luthey-Schulten et al. (1995) performed an analysis of the formation of the secondary structure in a helix-forming heteropolymer. Their analysis indicates that, in the polymer model
442 considered, the overall collapse and the reduction of conformational entropy is accompanied by the formation of a signi®cant fraction of helical structure. Also, Guo & Brooks (1997) explored the folding of a minimalist off-lattice model of an all-b protein and observed a similar ``L-shaped'' free energy surface. Several experimental techniques can be used to assess the degree of compactness of the protein at different stages of folding. These methods as well as associated problems in their interpretation have been reviewed (Miranker & Dobson, 1996). It will be very interesting to see whether the observation that the folding mechanism is linked to the folded topology holds generally. We do not observe signi®cant free energy barriers in the potential of mean force as a function of the several order parameters we examined (number of native contacts, radius of gyration, number of water molecules in the protein core (see below)). An estimation of the kinetic rate of the folding process based on a free energy surface for such a multi-dimensional problem as protein folding represents a signi®cant challenge. We do not attempt to make such an estimate here. The solution to this problem might be sought in the consideration of protein folding as a diffusion process, and the development of a Fokker-Planck type equation to relate the rate of the diffusion on a multi-dimensional surface to the shape of the free energy as a function of several ``progress'' coordinates. In one recent lattice study it was shown that the rate of folding of the model 27-mer could be predicted with the use of Kramers' equation (Socci et al., 1996). The formalism is based on considerations of the rate of diffusion in the pre-barrier basin and the height of the free energy barrier, calculated along the coordinate de®ned by the total number of native contacts. The development of corresponding machinery for the kinetics of folding of realistic models of solvated proteins is certainly a much more involved task. We note here that a new estimate of the rate of the diffusion of a polypeptide chain in the
Mechanism of Protein G Folding
denatured state of proteins, based on the nanosecond-resolved spectroscopy (Hagen et al., 1996), suggests that the free energy barrier for such a fast folding protein as GB1, if it exists, is not high. Folding of a protein controlled principally by diffusion is envisioned as one of the possible scenarios (scenario ``0``) of protein folding in the landscape analysis of Wolynes and collaborators (Bryngelson et al., 1995). Very fast folding, occurring seemingly without a signi®cant free energy barrier, is observed for a number of small proteins (see Eaton et al. (1997), for a review).
Structure of the initial collapsed state In the study of GB1 folding by quenched deuterium-hydrogen (2H-H) exchange (Kuszewski et al., 1994), the authors reported the formation of an initial collapsed state during the dead-time of the experiment (<1 ms). In this state, several amide groups displayed higher protection factors than those expected for random coil, suggesting partial shielding from the solvent. The authors also monitored the ¯uorescence of a single tryptophan residue present in GB1 (Trp43). The ¯uorescence intensity of this residue is reduced in the denatured state when the residue is exposed to the solvent, relative to the native state, due to solvent quenching. Kuszewski et al. (1994) observed that the tryptophan ¯uorescence intensity recovers faster (less than 5 ms, dead-time of the stopped-¯ow apparatus) than the formation of native structure as monitored by 2H-H exchange. The experimental conditions of the 2H-H exchange experiment differ from the native conditions used in our calculation (as mentioned above, the experiment was carried out at pH 4, T 278 K and in the presence of 0.25 M GuHCl). There is also not suf®cient information to directly map the experimentally observed initial collapsed state onto any particular part of our free energy landscape. We therefore compared the apparent initial state rapidly formed
Figure 3. Accessible surface area Ê 2) of the Trp43 as a func(ASA, A tion of r (a) and the radius of gyraÊ ) (b). The ASA for tion RG(A structures was collected every 20 ps throughout the free energy sampling for all r-based partitions (shown at crosses). The bold line in (a) indicates the average value of the accessible surface area at a given point on the reaction coordinate. The straight line in (b) represents the best linear ®t of ASA versus RG.
Mechanism of Protein G Folding
443
Figure 4. Formation of native contacts along the reaction coordinate. The lower triangle of each panel contains contacts that possess high probability of formation, p > 0.5, in a corresponding region of the reaction coordinate. The upper triangle contains all 54 native contacts. The six panels correspond to the following regions along the reaction coordinate: 1, 0.7 < r < 0.8; 2, 0.6 < r < 0.7; 0.5 < r < 0.6; 4, 0.4 < r < 0.5; 5, 0. < r < 0.4; 6, 0.2 < r < 0.3.
in the experiment to the most unfolded states sampled in our calculation. We ®rst examined the exposure of the tryptophan residue. Figure 3(a) shows the accessible surface area of the side-chain of this residue along the reaction coordinate. Crosses indicate structures collected every 20 ps. The average value of the accessible surface area at a given value of the reaction coordinate is shown by the line. The exposure of the tryptophan side-chain in the fully unfolded Ê 2. Although in some structures this state is 212 A residue is almost completely exposed, on average, the degree of exposure in the most unfolded structures (based on the number of native contacts), is closer to the native state value than to the fully unfolded one. Figure 3(b) shows the exposure of the tryptophan residue as a function of the radius of gyration, a measure of the protein's overall compactness. The correlation is not particularly high (Pearson coef®cient: r 0.45). We conclude that in the large r region of our reaction coordinate, the ¯uorescence intensity of the tryptophan residue is
likely to already be close to its native state value, in accord with rapid quenching of the tryptophan ¯uorescence seen experimentally (Kuszewski et al., 1994). We next examined the structure present in the most unfolded part of the database. In Figure 4 (panel 1) we show contacts present in more than 50% of structures that project between 0.7 and 0.8 on the reaction coordinate (have 20 to 30% of total number of native contacts). The contacts emerge at the beginning of the a-helix, in the turn connecting the third and fourth b-strands and in the middle of a-helix (contact between residues 29 and 33). A contact also develops between the C terminus of the a-helix and the ®rst b-strand (residues 37 and 7). This result correlates very well with the pattern of protection factors found in the initial collapsed state in the 2H-H exchange experiment. Higher than average protection factors were found in the b3-b4 turn (Lys50, Thr51), the N terminus of the a-helix (Thr25, Ala26) and to a lesser degree
444
Figure 5. (a) Probabilities of formation for two native contacts between the N terminus of the a-helix and the b3-b4 turn. Probabilities are plotted along the reaction coordinate r. The contacts involve residues Ala23 and Lys50 (continuous line) and Glu27 and Phe52 (dotted line). (b) Probability of formation of a non-native contact along the reaction coordinate r. The contact is formed between residues Leu7 and Phe30. (c) The non-native contact formed in the 19± 41 fragment of GB1 (Blanco et al., 1997) pre-organizes the polypeptide chain early in folding. The non-native contact Val21-Ala26 is shown as the continuous line and the native contact Ala20-Ala26 is shown as a broken line.
in the middle of the a-helix (Lys31, Gln32 and Tyr33). The N terminus of the a-helix and the b3-b4 turn are spatially close in the native state of GB1. However, the absence of contacts between these two regions and visual examination of the protein con-
Mechanism of Protein G Folding
formations in this region of the sampling indicate that these regions are not adjacent in space in the early stage of folding in our calculations. Rather, structure forms independently in these two loci of GB1. In Figure 5(a) we show the probabilities of contacts between the N terminus of the a-helix and the b3-b4 turn along the reaction coordinate. It is seen that these probabilities ¯uctuate and approach one only at a late stage of folding. We observe one non-local non-native contact which persists in the early stages of folding. This contact forms between hydrophobic residues Leu7 and Phe30 (Figure 5(b)). In the native structure both of these residues are buried but displaced with respect to each other. Phe30 is located lower on the a-helix than the region contacting Leu7. The distance between the centers of geometry of the side-chains of these two residues in native state is Ê. 7.8 A Recently, Blanco et al. (1997) reported determination of a non-native contact present near the N terminus of the a-helix in a peptide excised from GB1 comprising the a-helix together with two loops on two ends of the helix (residues 19 to 41). We observed that this contact, Val21-Ala26, is present with signi®cant probability in the most unfolded structures from our database (Figure 5(c)). As the protein folds, this contact is replaced with a native contact Ala20-Ala26. Thus, this non-native contact, which is a hydrophobic staple, preorganizes the polypeptide chain early in folding. Blanco et al. suggest that formation of such a hydrophobic staple at the N terminus of an a-helix might partially stabilize the helix. The stabilization effect of this contact alone is probably not particularly high, as the helicity of the studied peptide is less than 10% as determined by circular dichroism spectroscopy. Blanco & Serrano (1995) also investigated the secondary structure propensities of peptides excised from GB1 that correspond to three segments of secondary structure. They found that the peptide corresponding to the a-helix and comprising residues 21 through 40 does not form any detectable helical conformation under native conditions. The far-ultraviolet CD spectrum was very close to that of a random coil and indicated a possible helical content of less than 5%. The assignment of NOEs con®rmed that the peptide is largely unstructured. We note that formation of non-local contacts Phe30-Leu7 and Asn37-Leu7, along with a hydrophobic staple at the N terminus of the helix, may explain early stabilization of the portion of the a-helix observed both in the pattern of protection factors (Kuszewski et al., 1994) and in our calculations. Good correlation of structural characteristics from the most unfolded states sampled in our calculation with the state rapidly formed in the folding experiments of Kuszewski et al. (1994) is very encouraging and allows us to suggest complementary information on the early stages of GB1 folding. Namely, we ®nd that non-native contacts play a
445
Mechanism of Protein G Folding
Ê ). Four panels of the Figure 6. Distances between oxygen and hydrogen atoms of native hydrogen bonds, ROH(A Figure correspond to four regions of secondary structure: three interfaces of the b-sheet and the a-helix.
role in organizing the early tertiary structure formation and that the early collapse intermediate may not consist of a single ``core'' as in the ®nal folded structure. Secondary structure In Figure 6 we show formation of secondary structure of GB1 along the folding coordinate. The average distances between hydrogen bond donor and acceptor for native hydrogen bonds (see Methods, on how ``native'' hydrogen bonds were selected) are plotted. Only i, i 4 a-helical hydrogen bonds are shown in the panel corresponding to the a-helix. If we assume that a hydrogen bond forms with a signi®cant probability if the average donor-acceptor distance does not exceed Ê (Young & Brooks, 1996), we see that b3-b4 3.5 A and b1-b2 turns form at r 0.35, whereas b1-b4 forms only at r 0.2. With the exception of two hydrogen bonds located at the C terminus, the a-helix forms early in folding. Ten out of 12 hydrogen bonds are formed at r 0.6. We conclude that
the b-sheet forms simultaneously with overall structure assembly, whereas the a-helix forms at the stages when only a small fraction of native contacts is formed. Analysis of the dynamics of GB1 in the native state (Sheinerman & Brooks, 1997) indicated that the b-sheet is more mobile than the a-helix. The mobility was judged based upon the amplitude of slow (t > 20 ps) ¯uctuations. We have suggested that the mobility of the b-sheet indicates that this region is more labile for unfolding, and can be expected to unfold ®rst and fold last. The present observation that the a-helix forms before the formation of the majority of contacts, whereas the b-sheet is formed only when 80% of contacts are already in place, suggests that, at least for this system, this conjecture is valid. Solvation of the protein interior We performed the following calculation to characterize the solvation of the protein core at Ê different stages of folding. A sphere of radius 8 A
446
Figure 7. Potential of mean force as a function of the number of core water molecules (see the text for the de®nition of the core) and the reaction coordinate r. Contours are shown every 0.5 kcal/mol.
around the protein center of mass is de®ned (the radius of gyration of GB1 in the native state is Ê , the shape of the GB1 is close to spherical, 10.8 A and ratio of the largest moment of inertia of the protein to the smallest is 1.3), and all water molecules within this sphere, in all structures collected during equilibrium sampling, are counted. The radius of the sphere was chosen to be suf®ciently small so that the number of water molecules present in the native region of the conformational space would not be strongly affected by thermal ¯uctuations of the protein exterior, yet at the same time it was chosen to be suf®ciently large to capture penetration of water into the protein interior. Ê to yield a nearly conWe found the choice of 8 A stant (6 2) number of water molecules inside the sphere during native dynamics. We noticed that these water molecules are not bound in particular loci, but rather enter the sphere at random locations. As the protein unfolds, the number of water molecules inside the sphere increases to 50 to 60. Thus, such a number is well de®ned for the native state and discriminates the native state from the unfolded manifold. In this sense it represents an order parameter for the folding reaction. The potential of mean force as a function of the reaction coordinate r and the number of water molecules in the protein core is shown in Figure 7. Comparison of these results and the free energy surface shown in Figure 2 (see also Figure A1 in the Appendix) indicates that at the metastable state, where 0.2 < r < 0.45, the radius of gyration of the protein is close to its value in the native state, yet the protein interior is signi®cantly more solvated than in the native state. The number of water molecules in the protein core at this stage of folding is 2 to 2.5 times larger than in the native
Mechanism of Protein G Folding
state. Thus, we observe that in our detailed model the disruption of a small fraction of native contacts (20%), accompanied by an insigni®cant increase of the overall volume in the globule, is suf®cient to allow initial penetration of water into the protein core. This result lies in contrast to the description of a similar event by Shakhnovich & Finkelstein (1989). Several assumptions about protein structure and energetics (e.g. that secondary structure elements move apart as rigid blocks) have likely lead these authors to overestimate the barrier associated with this transition. The presence of water in the protein globule after overall collapse is likely to be important for the mechanism of folding. Examination of trajectories has revealed several cases in which backbone hydrogen bonds formed through a water-mediated intermediate. Water may also serve as an effective lubricant, facilitating structural transitions. Formation of native contacts The formation of native contacts along the folding coordinate is illustrated in Figure 4. The panels show contacts having greater than 50% probability of formation in each of six regions of the reaction coordinate of width 0.1, ranging from r 0.8 to r 0.2. The upper triangle of each contact map contains all 54 native contacts. In the initial stage of folding (panel 1, 0.7 < r < 0.8), structure forms in the N terminus of the a-helix and the b3-b4 turn. As folding progresses, additional contacts form in these two regions. On the other hand, the contact between residues Leu7 and Asn37 disappears and Leu7 forms a contact with Thr16, located on the second b-strand. As folding continues, struc-
Figure 8. Number of native contacts having a given probability of formation in four regions of the reaction coordinate.
447
Mechanism of Protein G Folding
ture forms in other regions of the protein, including contacts between the ®rst and fourth b-strands (residues Tyr3 and Lys50, and Leu7 and Val54), between the middle of the a-helix and the b3-b4 turn (residues Phe30 and Phe52), and between the a-helix and the ®rst b-strand (residues Ala26 and Tyr3), etc. Contacts that form in the latest stage of folding include those between the N terminus of the a-helix and the b3-b4 turn (residues Ala23 and Lys50, and Glu27 and Phe52), contacts between the ®rst and fourth b-strands (residues Asn8 and Thr55) and those between the second b-strand and the middle of the a-helix (residues Thr18 and Val29, and Thr18 and Tyr33). At the intermediate stages of folding, contacts form with probabilities which are broadly distributed between zero and one. Figure 8 displays the number of contacts that have a given probability of formation in four regions of the reaction coordinate. A large fraction of contacts form with probabilities between 0.2 and 0.8 in the middle of the folding process, as indicated by intermediate values of the reaction coordinate. As the protein approaches the native state, the probability distribution becomes more localized. A putative compact misfolded state In this section we comment on one peculiar feature of our free energy surface. As is seen from the free energy landscape in Figure 2, there is a free Ê and energy minimum in the region RG 10.7 A r 0.35. This metastable conformational basis is more compact than the native state (radius of gyraÊ ). This result may be an artifact of tion is 10.7 A the sampling scheme employed, sampling of this region of conformational space emerged from a single initial condition. However, when we doubled the sampling for this initial condition (to 200 ps instead of 100 ps) we observed that the trajectory remained in the same basin, suggesting a
Figure 9. Cumulative probability to observe a protein conformation along the reaction coordinate r.
Table 1. Energy difference between most folded and unfolded states Epp U!F (kcal/mol) 142 34
Epw U!F (kcal/mol)
Etotal U!F (kcal/mol)
ÿ239 70
47 17
local stabilization. Thus, if the protein enters this basin during folding it must expand to further progress to the native state. The basin contains structures with a well-formed b4-b1 sheet (®ve interstrand hydrogen bonds, compared to three identi®ed as ``native'') and a well-formed b1-b2 interface. The interface between the third and the fourth b-strands is poorly de®ned and the C terminus of the a-helix is unwound. In principle, such a ``kinetic trap'' could be detected with pulsed hydrogen exchange experiments (Baldwin, 1995). Energetic components of free energy In Figure 9 we show the cumulative probability, Kf, for the protein structure to populate states with r < r*: r exp
ÿb W
r dr Kf 00:8 0 exp
ÿb W
r dr where b 1/kBT is the Boltzmann factor. By examining Kf as a function of r* we see that the deepest minimum on the free energy landscape, the basin corresponding to 0 < r < 0.2, yields an accumulated population of 90%. The population approaches 99% by r* 0.5. Since we do not know how the free energy surface changes with temperature, nor the location of the barrier separating the folded and unfolded states at the melting temperature, the relation of the values for free energy differences between regions of the reaction coordinate to the folding free energy inferred from the melting experiments is unclear. However, a direct comparison can be made to the experiments performed under native conditions. Though in this case the correspondence will be de®ned by the particular means used to monitor the folding reaction. For example, if far-UV CD spectra is used to follow folding, then the states with r < 0.6, in which the helix is almost completely formed, are likely to be characterized as folded. As is seen from Figure 3, the trytophan residue is buried in almost all conformations we sampled, and thus all sampled states are inferred to be folded based on the measurements of tryptophan extinction. If we use r* < 0.6 to de®ne the native manifold, we ®nd that the overall stability of the native state is near 4 kcal/mol. The protein-protein and protein-water interaction energies as functions of both r and the radius of gyration are shown in Figure 10. The protein-protein interaction favors the folded state, whereas the protein-water interaction favors the unfolded manifold. It is also seen that the differ-
448
Mechanism of Protein G Folding
Figure 10. The protein-protein (continuous line) and protein-water (broken line) interaction energy (kcal/mol) along the reaction coordinate r (a) and the radius of gyraÊ ) (b). tion RG(A
ence between the most expanded states and the folded state is larger than the difference between the most unfolded states (de®ned by the number of native contacts) and the folded state. Table 1 contains the differences in the protein-protein and the protein-water interaction energies between the most folded and the most unfolded states. The values are similar to those obtained by Boczko & Brooks (1995) in their study of folding of threehelix bundle. The energy difference between the folded and the most unfolded states was found to be 142 kcal/mol for protein-protein interactions and ÿ239 kcal/mol for protein-solvent interactions. The total energy, also reported in Table 1, is the sum of the protein-protein, protein-water and water-water interactions. Here we de®ne the folded state by r < 0.2 and the most unfolded states by 0.7 < r < 0.8. The error is estimated by the deviation of the energy sub-averages (corresponding to 50 to 100 ps of sampling) over segments of the reaction coordinate of width 0.01 from the total average calculated over the folded and unfolded manifold of states intervals (0 < r < 0.2 for folded and 0.7 < r < 0.8 for unfolded). The error for average energies of the heterogeneous unfolded state is approximately twice as large as the error for the folded state. Thus, the error in the difference, p de®ned as s s2f s2u , is to a large part determined by the error of the average energy estimation for the unfolded state.
Conclusions In this work we applied an importance sampling technique with molecular dynamics simulations to characterize the folding landscape of segment B1 of streptococcal protein G. We used a full atomic representation of the solvated protein structure and investigated folding under native conditions. To summarize, the folding of the protein occurs as follows. Structure originates in two distinct regions of the polypeptide chain. Overall collapse is accompanied by formation of 35% of the native contacts. These contacts are delocalized to a number of regions in the sequence and are ¯uctuating.
Free energy surfaces along either the number of native contacts or the radius of gyration do not exhibit appreciable barriers under native conditions. The a-helix forms early in folding, while the b-sheet assembles concomitantly with the overall topology. Interestingly, water is present in the protein core until a late stage of folding and is only forced from the interior once 80% of the native contacts and all elements of secondary structure have formed. An interesting continuation of the work presented here is an ongoing investigation of the folding thermodynamics under elevated temperatures. This work is directed towards understanding of the temperature dependence of the potential of mean force and characterization of the folding process under the conditions where folded and unfolded states are more equally populated.
Methods Overview In this section we describe the methodology that was used in this study. Brie¯y, the study was performed as follows. First we performed two native state simulations to provide a reference state, de®ned with the same force ®eld, molecular model and methods to treat long-range interactions that were used throughout the subsequent folding study. Next, a database of initial conditions for sampling was generated from three high temperature unfolding simulations performed in the presence of solvent. The database of structures was partitioned based on the total number of native contacts into 16 partitions (or bins). A set of characteristics de®ned to discriminate between structures and a clustering algorithm were used to analyze the diversity of conformations within the bins. From the cluster centers, a subset of structures representing the natural partitioning at each value of our partition (number of native contacts) were extracted for use as initial conditions in biased molecular dynamics sampling. Each conformation was equilibrated under conditions where the folding thermodynamics were to be studied (298 K and 1 atm). Sampling of thermodynamically important states was performed along the coordinate de®ned by the total number of native contacts using the umbrella sampling technique. Finally, we combined the data from different simulations to obtain an
449
Mechanism of Protein G Folding estimate of the density of states using the weighted histogram analysis method. The description of the molecular dynamics (MD) simulation protocol, equilibration procedure, assessment of the precision of sampling and the technical computational details are given in the Appendix for completeness.
Generation of initial conditions To provide a reference point for our folding studies, we performed two simulations of the native state of the GB1. The detailed results of these simulations are reported elsewhere (Sheinerman & Brooks, 1997). To summarize, the results relevant to this study are the following. Our simulations revealed a stable protein with a well-de®ned core. Two independent simulations started from the NMR and the X-ray structures converged to a single equilibrium basis of conformations. Fluctuations around the average structure were small, indicating that the native state of the protein is stable in our force ®eld (at least in a kinetic sense, during 2 ns of simulation). The native state simulations allowed us to de®ne order parameters discriminating the native manifold from the other conformations of the polypeptide chain as well as to obtain structural characteristics of the native manifold (``native'' contacts and ``native'' hydrogen bonds) in the manner outlined below. Having characterized the native state of the protein, we proceeded to generate the set of initial conditions to probe the thermodynamically signi®cant states of the protein at different stages of folding. This database of initial conditions for importance sampling should satisfy two general criteria. It should be diverse, in that the database should span a broad range of conformations rather than a single possible pathway (if multiple ``pathways'' exist). At the same time the database should not contain conformations of very high energy, which are very unlikely to be relevant to the folding process. Based on the large body of work performed in different laboratories, and on our own experience, we chose high temperature unfolding of the native protein as a means to generate such a database. We observed that the presence of water during unfolding is essential. The examination of a test unfolding simulation performed in vacuo indicated that the hydrophobic residues became exposed very rapidly, leading to a signi®cant decrease of the solvation energy of individual structures estimated with the algorithm of Wesson & Eisenberg (1992). This was not observed when the unfolding was performed in the presence of solvent at the temperature of our unfolding simulations (400 K). We performed three unfolding simulations starting from three different structures taken from the equilibrium part of the native state simulation. The temperature was maintained at 400 K. In every other respect, the MD protocol employed was analogous to the one used in the sampling simulations, as described in the Appendix. Each simulation was performed for 0.8 ns. Although the time course of the protein unfolding was quite different in the three simulations (Figure 11), analysis of the generated conformations indicated that the simulations produced highly overlapping distributions of structures. This was indicated because the structures obtained in different unfolding runs project on highly overlapping clouds in the plane of the radius of gyration and the number of native contacts and because the majority of clusters, obtained in the analysis of the database (see
Figure 11. Fraction of native contacts broken as a function of time during three unfolding simulations. Each simulation was run for 0.8 ns. Temperature was maintained at 400 K. Time is given in nanoseconds. Each point in the Figure represents a running average of the native contact fraction over 2 ps of unfolding simulation.
below), contained structures from different unfolding simulations. We found that a relatively straightforward and computationally inexpensive equilibration protocol (see the Appendix) allowed us to obtain a satisfactory equilibration of the structures generated by high temperature unfolding. Equilibration was judged by the restoration of the solvent-solvent interaction energy, characteristic for the native conditions (comparison was made to the energy obtained in the native state simulations) and by the general stabilization of the protein conformation. Reaction coordinate and biasing potential To use a biased (umbrella) sampling technique one has to de®ne a coordinate that describes progress of the process and to which a bias is to be applied. One obvious requirement for such a coordinate is that different states should project onto different regions of the coordinate. For the protein folding reaction, such a coordinate should discriminate between the native state and non-native conformations of the protein. Certainly, the choice of such a coordinate for the reaction of protein folding is not unique. To de®ne a coordinate in our study we have chosen one of the parameters that discriminates between different conformations of the protein, the number of native contacts. Implementation of the biasing potential along this coordinate was reasonably straightforward and computationally inexpensive. We de®ne a contact as being present if the centers of geometry of side-chains of two residues (i, j) are within Ê of each other. We used two native state simu6.5 A lations to de®ne contacts present over a signi®cant fraction of both simulations, and hence essential for protein stability. In our de®nition of the reaction coordinate we included all contacts formed between residues not adja-
450
Mechanism of Protein G Folding
Table 2. List of native contacts for GB1 No. 1 2 3 4 5 6 7 8 9 10 11
1st res.
2nd res. No.
Met1 Tyr3 Tyr3 Tyr3 Tyr3 Lys4 Leu5 Leu5 Leu5 Ile6 Ile6
Tyr3 Ala20 Ala23 Ala26 Lys50 Thr17 Leu7 Thr16 Phe30 Glu15 Thr51
12 13 14 15 16 17 18 19 20 21 22
1st res. Ile6 Leu7 Leu7 Leu7 Leu7 Leu7 Asn8 Asn8 Thr16 Thr16 Thr18
2nd res. No. Thr53 Thr16 Ala34 Asn37 Val39 Val54 Lys13 Thr55 Thr18 Tyr33 Ala20
23 24 25 26 27 28 29 30 31 32 33
1st res. The18 Thr18 Ala20 Asp22 Asp22 Ala23 Ala23 Ala24 Ala24 Thr25 Ala26
2nd res. No. Val29 Tyr33 Ala26 Ala24 Thr25 Ala26 Lys50 Glu27 Lys28 Lys28 Val29
34 35 36 37 38 39 40 41 42 43 44
1st res. Ala26 Glu27 Val29 Val29 Phe30 Lys31 Ala34 Ala34 Ala34 Asn37 Val39
2nd res. No. Phe30 Phe52 Gln32 Tyr33 Phe52 Trp43 Val39 Trp43 Val54 Val39 Val54
45 46 47 48 49 50 51 52 53 54
1st res.
2nd res.
Val39 Trp43 Thr44 Asp46 Asp46 Asp46 Thr49 Lys50 Thr51 Thr53
Glu56 Val54 Thr53 Ala48 Thr49 Thr51 Thr51 Phe52 Thr53 Thr55
res, residue.
cent in sequence (i.e. for i, j residue pairs with j > i 1) and present in both reference native simulations for more than two-thirds of the simulation time. Fifty-four contacts formed by 41 residues were identi®ed in this fashion (Table 2). Amongst these 54 contacts, 11 are formed within the a-helix, 18 within the b-sheet, 14 between the a-helix and the b-sheet, and the remaining 11 include residues within loop regions. Twenty-four contacts are formed between hydrophobic and hydrophilic residues, 15 between hydrophobic and 15 between hydrophilic residues. We de®ne the state of each contact by a continuous analogue of a step function. The state of a contact, x(i), is 1 if the distance, d(i), between the centers of geometry of Ê , and 0 if the the side-chains involved is less than 6.5 A Ê . x(i) is given by the distance is larger than 7 A expression. x
i
1
1 exp
20
d
i ÿ 6:75
1
Then, the position of a structure on the reaction coordinate, r, is de®ned by the weighted sum of the states of the native contacts: r
p
i
1 ÿ x
i p
i
2
The weight of each contact, p(i), is de®ned based on the fraction of time that this contact was present in the native trajectory (all weights of selected contacts, thus, are larger than 0.66 and smaller than 1). The native state in this de®nition corresponds to 0 on our reaction coordinate, whereas 1 corresponds to states possessing no native contacts. Sampling along the reaction coordinate was performed using the umbrella sampling technique (Valleau & Torrie, 1977). This method has been successfully used in several studies of the thermodynamics of conformational transitions in biomolecules (Brooks & Case, 1993). The reaction coordinate was divided into 20 bins (only 16 of which were occupied by our initial conditions), each covering 0.05. A quadratic biasing potential of the same form was used in each window. For window i, the center of which was at r(i) 0.05 i ÿ 0.025, the biasing potential added to the energy function was: 1 U k
r ÿ r
i2 2
3
The force constant k was set to 2000 kcal/mol, so that the value of the biasing potential is kBT at the border of
the bin. The forces on individual atoms are constructed by the straightforward application of chain rule differentiation of equation (3) using equations (1) and (2). The direction of the force is de®ned by the position of the entire structure on the reaction coordinate. For example, if the structure is to the right of the bin center (there are too few contacts), the force is directed towards the formation of a contact. The maximal value of the force on a Ê )) can be compared to given atom (about 80 kcal/(mol. A the direct electrostatic force between two charged atoms Ê from each other (about 66 kcal/(mol. A Ê )). located at 5 A It is seen that atoms of the protein experience a bias only during the short period of breaking or forming of a contact. Fully folded as well as unfolded parts of the protein do not ``feel'' any bias during the simulation. Sampling of the density of states along the reaction coordinate was used to construct one and two-dimensional surfaces of the potential of mean force by tabulation of histograms in these variables for each initial condition (window). The data from different windows were combined with the weighted histogram analysis method (WHAM), originally developed by Ferrenberg & Swendsen (1989). Application of the method to the simulation of biological systems was worked out by Kumar et al. (1992) and Boczko & Brooks (1993). These references contain a general formulation of the methodology. We proceed with a description of the clustering procedure that was used to analyze the created databases of structures. Database analysis/clustering To identify appropriate conformations for umbrella sampling simulations from our database of initial conditions, we used a hierarchichal clustering augmented by a stopping rule, that determines which level of the hierarchy produces the best partitioning (Jain & Dubes, 1988). The stopping rule was developed essentially empirically (Xu et al., 1993). E. M. Boczko & C. L. Brooks (unpublished results) have shown that this rule can be rationalized in the context of information theory. Clustering is stopped at the level i, when the function: E
i
M
i ÿ M
i 1 ; i n ÿ 1; n ÿ 2; . . . 3; 2 s
i ÿ s
i 1
4
reaches its maximum. Here n is the number of datapoints being partitioned, M(i) is the minimal betweencluster distance at the level i, s2(i) (diab)2, where the summation is done over all inter-cluster distances at the level i. The level where E(i) is maximal represents a par-
451
Mechanism of Protein G Folding titioning at which clusters are compact and well separated. Several comparative studies, by Jain & Dubes (1988), conclude that the minimum variance, or the Ward's (1963) method, of updating the distances between newly formed clusters outperforms other methods. Thus, we used this method in our clustering algorithm. To perform clustering of a conformational database one must de®ne a set of descriptors of the protein structure and the dissimilarity function. We have chosen three classes of structural characteristics to describe the protein conformation. The descriptors are chosen to de®ne tertiary packing, secondary structure and overall solvation: ``native'' contacts, ``native'' hydrogen bonds and the solvation energy of the structure, respectively. Selection of the 54 native contacts is described above. All oxygen-hydrogen pairs, where the oxygen and hydrogen belong to the backbone of residues not adjacent in sequence, were considered for the de®nition of ``native'' hydrogen bonds. We termed the pairs that have an averÊ in both native simuage distance of less than 3.5 A lations, ``native'' hydrogen bonds. Forty-six ``native'' hydrogen bonds were identi®ed. Finally, the solvation energy of each structure was used as another descriptor. The solvation energy was de®ned using the solvation parameters of (Wesson & Eisenberg, 1992) based upon the accessible surface area of the atoms and the empirical vapor to water free energy transfer of side-chain analogs. Thus, 101 parameters describing the protein conformation were de®ned. We used a single dissimilarity function in the space of these 101 descriptors. Different classes of the descriptors were scaled by the total variance of the class in all databases spanning the range from fully folded to fully unfolded structures. If in a particular database produced at a given point along the reaction coordinate, the variance in one class of descriptors is large compared to the total variance between folded and unfolded conformations, the distance in the space of the descriptors from this class will make a large contribution to the dissimilarity function, and vice versa. Denoting by dCi(u) the distance of the ith contact in the structure u, by dHj(u) the distance of the jth hydrogen bond in the structure u and by SE(u) the solvation energy of the structure u, we de®ne the dissimilarity between structures a and b: DIS
a; b
54 X 1
dCi
a ÿ dCi
b2 var
dc i1
46 X 1
dHj
a ÿ dHj
b2 var
dH j1
1
SE
a ÿ SE
b2 var
SE
5
A particular example illustrates the in¯uence of the scaling of three classes of descriptors on the result of clustering. In the database constructed from the structures that project between 0.45 and 0.5 on the reaction coordinate (bin number 100), the largest variance occurs in the space of hydrogen bonds. The ratio of the variance of this class (hydrogen bonds) in this database to the total variance of this class is 26.5%. The corresponding variances for the classes of native contacts and the solvation energy are 23.5% and 24.7%. When clustering produced with a single dissimilarity function is compared to the clustering produced with a dissimilarity function de®ned in the space of hydrogen bonds only, it is seen that both
partitionings produce three clusters. Two centers out of three coincide. In this database the largest diversity, as compared to the total diversity from the folded to the unfolded state, occurs in secondary structure context. This factor played a signi®cant role in de®ning how the database was partitioned. We performed the hierarchical clustering with the dissimilarity function de®ned above in 16 partitions of the overall database, with the partition i consisting of structures projecting between 0.05 (i ÿ 1) and 0.05 i on the reaction coordinate, r.
Acknowledgments This work was performed as a part of the PhD dissertation of F.B.S. We thank members of the thesis committee, Drs J. Skolnick, D. Case and J. Onuchic for many fruitful comments during different stages of the project. We are grateful to Drs Hirst, Vieth, Guo, Shirley and Mr Nochomovitz for invigorating discussions and assistance with the preparation of the manuscript. F.B.S. is thankful to Dr E. Shakhnovich for a stimulating discussion. Funding from the National Institutes of Health (GM48807) is acknowledged. Computational assistance and resources from the Pittsburgh Supercomputing Center and the MetaCenter Program are appreciated.
References Acjari, A., Hale, S. P., Howard, A. J., Clore, G. M., Gronenborn, A. M., Hardman, K. D. & Whitlow, M. Ê X-ray structure of the B2 immunoglo(1992). 1.67-A bulin-binding domain of streptococcal protein G and comparison to the NMR structure of the B1 domain. Biochemistry, 31, 10449± 10457. Alexander, P., Fahnestock, S., Lee, T., Orman, J. & Bryan, P. (1992a). Thermodynamic analysis of the folding of the streptococcal protein G IgG-binding domains B1 and B2: why small proteins tend to have high denaturation temperatures. Biochemistry, 31, 3597± 3603. Alexander, P., Orban, J. & Bryan, P. (1992b). Kinetic analysis of folding and unfolding the 56 amino acid IgG-binding domain of streptococcal protein G. Biochemistry, 31, 7243± 7248. Alonso, D. O. V. & Daggett, V. (1995). Molecular dynamics simulations of protein unfolding and limited refolding: characterization of partially unfolded states of ubiquitin in 60% methanol and in water. J. Mol. Biol. 247, 501± 520. An®nsen, C. B. (1973). Principles that govern the folding of protein chains. Science, 181, 223± 230. Bai, Y. & Englander, S. W. (1996). Future directions in folding: the multi-state nature of protein structure. Proteins: Struct. Funct. Geomet. 24, 145± 151. Bai, Y., Sosnick, T. R., Mayne, L. & Englander, S. W. (1995). Protein folding intermediates: native-state hydrogen exchange. Science, 269, 192± 197. Balbach, J., Forge, V., van Nuland, N. A. J., Winder, S. L., Hore, P. J. & Dobson, C. M. (1995). Following protein folding in real time using NMR spectroscopy. Nature Struct. Biol. 2, 865±870. Baldwin, R. L. (1995). The nature of protein folding pathways: the classical versus the new view. J. Biomol. NMR, 5, 103± 109.
452 Blanco, F. J. & Serrano, L. (1995). Folding of protein G B1 domain studied by the conformational characterization of fragments comprising its secondary structure elements. Eur. J. Biochem. 230, 634 ± 649. Blanco, F. J., Jimenez, M. A., Pineda, A., Rico, M., Santoro, J. & Nieto, J. L. (1994a). NMR solution structure of the isolated N-terminal fragment of protein-G B1 domain. Evidence of tri¯uoroethanol induced native-like b-hairpin formation. Biochemistry, 33, 6004± 6014. Blanco, F. J., Rivas, G. & Serrano, L. (1994b). A short linear peptide that folds into a native stable b-hairpin in aqueous solution. Nature Struct. Biol. 1, 584 ± 590. Blanco, F. J., Ortiz, A. R. & Serrano, L. (1997). Role of a non-native interaction in the folding of the protein G B1 domain as inferred from the conformational analysis of the a-helix fragment. Folding Design, 2, 123± 133. Boczko, E. M. & Brooks, C. L., III (1993). Constant-temperature free energy surfaces for physical and chemical processes. J. Phys. Chem. 97, 4509± 4513. Boczko, E. M. & Brooks, C. L., III (1995). First-principles calculation of the folding free energy of a threehelix bundle protein. Science, 269, 393 ± 396. Brooks, C. L., III (1992). Characterization of ``native'' apolyoglobin by molecular dynamics simulation. J. Mol. Biol. 227, 375±380. Brooks, C. L., III & Case, D. (1993). Simulations of peptide conformational dynamics and thermodynamics. Chem. Rev. 93, 2487± 2502. Bryngelson, J. D. & Wolynes, P. G. (1987). Spin glasses and the statistical mechanics of protein folding. Proc. Natl Acad. Sci. USA, 84, 7524± 7528. Bryngelson, J. D., Onuchic, J. N., Socci, N. D. & Wolynes, P. G. (1995). Funnels, pathways, and the energy landscape of protein folding: a synthesis. Proteins: Struct. Funct. Genet. 21, 167 ± 195. Ca¯isch, A. & Karplus, M. (1994). Molecular dynamics simulation of protein denaturation: solvation of the hydrophobic cores and secondary structure of barnase. Proc. Natl Acad. Sci. USA, 91, 1746± 1750. Ca¯isch, A. & Karplus, M. (1995). Acid and thermal denaturation of barnase investigated by molecular dynamics simulations. J. Mol. Biol. 252, 672 ± 708. Daggett, V. & Levitt, M. (1992). A model of the molten globule state from molecular dynamics simulations. Proc. Natl Acad. Sci. USA, 89, 5142± 5146. Daggett, V., Li, A., Itzhaki, L. S., Otzen, D. E. & Fersht, A. R. (1996). Structure of the transition state for folding of a protein derived from experiment and simulation. J. Mol. Biol. 257, 430 ± 440. Dill, K. A. & Chan, H. S. (1997). From Levinthal to pathways to funnels. Nature Struct. Biol. 4, 10 ± 19. Dill, K. A., Bromberg, S., Yue, K., Fiebig, K. M., Yee, D. P., Thomas, P. D. & Chan, H. S. (1995). Principles of protein folding a perspective from simple exact models. Protein Sci. 4, 561 ± 602. Eaton, W. A., Munoz, V., Thompson, P. A., Chan, C.-K. & Hofrichter, J. (1997). Submillisecond kinetics of protein folding. Curr. Opin. Struct. Biol. 7, 10± 14. Fan, P., Kominos, D., Kitchen, D. B., Levy, R. M. & Baum, J. (1991). Stabilization of a-helical secondary structure during high-temperature moleculardynamics simulations of a-lactalbumin. Chem. Phys. 158, 295± 301. Ferrenberg, A. M. & Swendsen, R. H. (1989). Optimized Monte Carlo data analysis. Phys. Rev. Letters, 63, 1195± 1198.
Mechanism of Protein G Folding Frank, M. K., Clore, G. M. & Gronenborn, A. M. (1995). Structural and dynamic characterization of the urea denatured state of the immunoglobulin binding domain of streptococcal protein G by multidimensional heteronuclear NMR spectroscopy. Protein Sci. 4, 2605± 2615. Gallagher, T., Alexander, P., Bryan, P. & Gilliland, G. L. (1994). Two crystal structures of the B1 immunoglobulin-binding domain of streptococcal protein G and comparison with NMR. Biochemistry, 33, 4721± 4729. Gronenborn, A. M., Filpula, D. R., Essig, N. Z., Achari, A., Whitlow, M., Wing®eld, P. T. & Clore, G. M. (1991). A novel, highly stable fold of the immunoglobulin binding domain of streptococcal protein G. Science, 253, 657± 661. Guo, Z. & Brooks, C. L., III (1997). Thermodynamics of protein folding: a statistical mechanical study of a small all b protein. Biopolymers, 42, 745± 757. Guo, Z. & Thirumalai, D. (1995). Kinetics of protein folding: nucleation mechanism, time scales, and pathways. Biopolymers, 36, 83± 102. Guo, Z., Brooks, C. L., III & Boczko, E. M. (1997). Exploring the folding free energy surface of a threehelix bundle protein. Proc. Natl Acad. Sci. USA, 94, 10161± 10166. Hagen, S. J., Hofrichter, J., Szabo, A. & Eaton, W. A. (1996). Diffusion-limited contact formation in unfolded cytochrom c: estimating the maximum rate of protein folding. Proc. Natl Acad. Sci. USA, 93, 11615± 11617. Hunenberger, P. H., Mark, A. E. & van Gunsteren, W. F. (1995). Computational approaches to study protein unfolding: henn egg white lysozyme as a case study. Proteins: Struct. Funct. Genet. 21, 196± 213. Itzhaki, L. S., Otzen, D. E. & Fersht, A. R. (1995). The structure of the transition state for folding of chymotrypsin inhibitor 2 analysed by protein engineering methods: evidence for a nucleationcondensation mechanism for protein folding. J. Mol. Biol. 254, 260 ± 288. Jain, A. K. & Dubes, R. C. (1988). Algorithms for Clustering Data, Prentice Hall, Englewood Cliffs, NJ. Jennings, P. A. & Wright, P. E. (1993). Formation of a molten globule intermediate early in the kinetic folding pathway of apomyoglobin. Science, 262, 892 ± 896. Kiefhaber, T., Labhardt, A. M. & Baldwin, R. L. (1995). Direct NMR evidence for an intermediate preceding the rate-limiting step in the unfolding of ribonuclease A. Nature, 375, 513± 515. Klimov, D. K. & Thirumalai, D. (1996). Factors governing the foldability of proteins. Proteins: Struct. Funct. Genet. 26, 411±441. Kraulis, P. J. (1991). MOLSCRIPT: a programme to produce both detailed and schematic plots of protein structures. J. Appl. Crystallog. 24, 946± 950. Kumar, S., Bouzida, D., Swendsen, R. H., Kollman, P. A. & Rosenberg, J. M. (1992). The weighted histogram analysis method for free-energy on biomolecules. I. The method. J. Comput. Chem. 13, 1011± 1021. Kuszewski, J., Clore, G. M. & Gronenborn, A. M. (1994). Fast folding of a prototypic polypeptide: the immunoglobulin binding domain of streptococcal protein G. Protein Sci. 3, 1945±1952. Leopold, P. E., Montal, M. & Onuchic, J. N. (1992). Protein folding funnels: a kinetic approach to the
453
Mechanism of Protein G Folding sequence-structure relationship. Proc. Natl Acad. Sci. USA, 89, 8721± 8725. Li, A. & Daggett, V. (1994). Characterization of the transition state of protein unfolding by use of molecular dynamics: chymotrypsin inhibitor 2. Proc. Natl Acad. Sci. USA, 91, 10430± 10434. Li, A. & Daggett, V. (1996). Identi®cation and characterization of the unfolding transition state of chymotrypsin inhibitor 2 by molecular dynamics simulations. J. Mol. Biol. 257, 412 ±429. Luthey-Schulten, Z., Ramirez, B. E. & Wolynes, P. G. (1995). Helix-coil, liquid crystal and spin glass transitions of a collapsed heteropolymer. J. Phys. Chem. 99, 2177± 2185. Mark, A. E. & van Gunsteren, W. F. (1992). Simulation of the thermal denaturation of hen egg white lysozyme: trapping the molten globule state. Biochemistry, 31, 7745± 7748. Monor, D. L., Jr & Kim, P. S. (1994). Context is a major determinant of b-sheet propensity. Nature, 371, 264 ± 267. Miranker, A. D. & Dobson, C. M. (1996). Collapse and cooperativity in protein folding. Curr. Opin. Struct. Biol. 6, 31 ±42. Miranker, A. D., Robinson, C. V., Radford, S. E. & Dobson, C. M. (1996). Investigation of protein folding by mass spectrometry. FASEB J. 10, 93 ± 101. Neira, J. L., Davis, B., Ladurner, A. G., Buckle, A. M., de Prat, Gay G. & Gersht, A. R. (1996). Towards the complete structural characterization of a protein folding pathway: the structures of the denatured, transition and native states for the association/folding of two complementary fragments of cleaved chymotrypsin inhibitor 2. Direct evidence for a nucleation-condensation mechanism. Folding Design, 1, 189 ± 208. Onuchic, J. N., Luthey-Schulten, Z. & Wolynes, P. G. (1997). Theory of protein folding: the energy landscape perspective. Annu. Rev. Phys. Chem. 48, 545± 600. Rahman, A. & Stillinger, F. H. (1971). Molecular dynamics study of liquid water. J. Phys. Chem. 55, 3336± 3359. Sali, A., Shakhnovich, E. & Karplus, M. (1994). Kinetics of protein folding. A lattice model study of the requirements for folding to the native state. J. Mol. Biol. 235, 1614± 1636. Saven, J. & Wolynes, P. G. (1996). Local conformational signals and the statistical thermodynamics of collapsed helical proteins. J. Mol. Biol. 257, 199 ± 216. Shakhnovich, E. I. (1997). Theoretical studies of proteinfolding thermodynamics and kinetics. Curr. Opin. Struct. Biol. 7, 29 ± 40. Shakhnovich, E. I. & Finkelstein, A. V. (1989). Theory of cooperative transition in protein molecules. I. Why denaturation of globular protein is a ®rst-order phase transition. Biopolymers, 28, 1667± 1680. Shakhnovich, E. I. & Gutin, A. M. (1989). Formation of unique structure in polypeptide chains. Theoretical investigation with the aid of a replica approach. Biophys. Chem. 34, 187 ±199. Sheinerman, F. B. & Brooks, C. L., III (1997). A molecular dynamics simulation study of segment B1 of protein G. Proteins: Struct. Funct. Genet. 29, 193± 202. Shoemaker, B. A., Wang, J. & Wlynes, P. G. (1997). Structural correlations in protein folding funnels. Proc. Natl Acad. Sci. USA, 94, 777 ± 782.
Skolnick, J., Kolinski, A. & Ortiz, A. R. (1997). MONSSTER: a method for folding proteins with a small number of distance restraints. J. Mol. Biol. 265, 217± 241. Smith, C. K. & Regan, L. (1995). Guidelines for protein design: the energetics of b-sheet side chain interactions. Science, 270, 980± 982. Smith, C. K., Withka, J. M. & Regan, L. (1994). A thermodynamic scale for the b-sheet forming tendencies of the amino acids. Biochemistry, 33, 5510± 5517. Socci, N. D., Onuchic, J. N. & Wolynes, P. G. (1996). Diffusive dynamics of the reaction coordinate for protein folding funnels. J. Chem. Phys. 104, 5860± 5868. Tirado-Rives, J. & Jorgensen, W. L. (1993). Molecular dynamics simulations of the unfolding of apolyoglobin in water. Biochemistry, 32, 4175± 4184. Valleau, J. P. & Torrie, G. M. (1977). Byways in statistical mechanics. In A Guide to Monte Carlo for Statistical Mechanics (Berne, B. J., ed.), Part A, pp. 169± 194, Plenum, New York. Vijayakumar, S., Vishveshwara, S., Ravishanker, G. & Beveridge, D. L. (1993). Differential stability of b-sheets and a-helices in b-lactamase: a high temperature molecular dynamics study of unfolding intermediates. Biophys. J. 65, 2304± 2312. Ward, J. H., Jr (1963). Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236 ± 244. Wesson, L. & Eisenberg, D. (1992). Atomic solvation parameters applied to molecular dynamics of proteins in solution. Protein Sci. 1, 227±235. Xu, S., Kamath, M. V. & Capson, D. W. (1993). Selection of partitions from hierarchy. Patt. Recog. Letters, 14, 7 ±15. Young, W. S. & Brooks, C. L., III (1996). A microscopic view of helix propagation: N and C-terminal helix growth in alanine helices. J. Mol. Biol. 259, 560± 572.
Appendix MD protocol All simulations were performed using the CHARMM macromolecular mechanics package (Brooks et al., 1983; Mertz et al., 1991). The protein was represented with the param19 parameter set. All charges were set to those corresponding to pH 7. The N and C termini of the protein were capped with neutral acetyl (Ace) and methylamide (Amn) groups. The TIP3P water model (Jorgensen et al., 1983) was employed. All X-H bonds, where X is a heavy atom, were kept rigid with the SHAKE algorithm (Ryckaert et al., 1977). The simulations were performed with a constant volume and constant solvent density. The protein was solvated within a truncated octahedron constructed Ê on a side (Allen & Tildesley, from a cube 62 A 1989). The solvent volume comprised 3659 water molecules. This volume was suf®cient to provide Ê shell around the protein in all simuat least a 12 A lations. Periodic boundary conditions were employed. During both molecular dynamics and minimization, the non-bonded forces were truncated at Ê . Truncation was done using an electrostatic 11 A
454 shifting function and a van der Waals switching function (Brooks et al., 1985). The dynamics were propagated using a 0.002 ps timestep. Non-bonded Ê list, interactions were processed using a 13 A updated every 25 steps (Allen & Tildesley, 1989; Verlet, 1967). Coordinates were saved every 100 steps. The temperature of the system during equilibration and sampling was maintained near 298 K by reassigning velocities from the Boltzmann distribution if the temperature drifted outside a window of 5 K around the target of 298 K. Simulations were performed on both scalar and vector platforms on the CRAY C-90, T3D and T3E supercomputers at the Pittsburgh Supercomputing Center. The biasing potential used to perform umbrella sampling along the reaction coordinate was found to impose a negligibly small computational cost and thus was implemented on a single processor with no appreciable decrease in the parallel performance.
Mechanism of Protein G Folding
atoms were removed. Variation of the cutoff disÊ to tance for the removal of the water from 2.5 A Ê was suf®cient to maintain a constant number 3.0 A of water molecules in all simulations. After the protein was inserted into the solvent, the system was minimized for 200 steps using steepest descent minimization. During the minimization, all protein atoms were constrained to their original positions by a harmonic potential with a Ê 2. Next, four force constant of 20 kcal/mol per A runs of equilibration dynamics, 5 ps each, were performed. The harmonic restraints, with the decreasing force constant (15, 10, 5 and 0 kcal/mol Ê 2) were applied on all protein atoms during per A the four equilibration runs. The biasing potential
Equilibration The truncated octahedron was constructed from a pre-equilibrated cubic box of water. This volume was additionally equilibrated for 200 ps with the corresponding periodic boundaries and was then employed for re-solvation of all protein structures used as initial conditions for sampling. Upon insertion of the protein into the truncated octahedron, all water molecules overlapping with protein
Figure A1. Potential of mean force (PMF) in kcal/mol, along the reaction coordinate r. Five surfaces based on 80% of data are shown in black. The surface based on 100% of data is shown in a dotted line.
Figure A2. Fraction of native contacts broken, r, the Ê ), and the constraint energy, radius of gyration, RG(A Ec(kcal/mol) as a function of time (in nanoseconds) during the RG-biased unfolding trajectory.
455
Mechanism of Protein G Folding
To investigate how the shape of the free energy surfaces is in¯uenced by the speci®c procedure we chose to generate initial conditions, we performed a different unfolding experiment. A protein structure taken from the native simulation was unfolded with a harmonic potential de®ned on the radius of gyration. An unfolding simulation (1 ns) was performed in water under normal conditions with a biasing potential of the form: 1 Uc k
RG ÿ 16:02 2
Figure A3. Two equilibration trajectories initiated from the structures at cluster enters obtained from the RGbiased unfolding trajectory. (a) Two trajectories superimposed with a two-dimensional potential of mean force. (b) Time course of two trajectories. Time is in nanoseÊ. conds and the radius of gyration, RG, is in A
de®ned on the native contacts was also applied during all stages of equilibration.
A1
where RG is the radius of gyration of the protein in angstrom units. The force constant k was initially Ê 2 and was increased every set to 1 kcal/mol per A Ê 2. Figure A2 shows the 100 ps by 1 kcal/mol per A time course of the constraint energy, radius of gyration and the position of the structure on the reaction coordinate. The database of structures generated in this fashion was clustered and a set of representative structures was extracted. Analysis of the database indicated that in these structures the loss of native contacts is more strongly correlated with overall expansion of the protein than in the conformations produced in the sampling from our original database of initial conditions. Thus, structures with the same number of native contacts had, in general, a larger radius of gyration. More rapid expansion of the protein structure was especially apparent in the region of 0.2 4 r 4 0.55, where rather compact conformations were produced in our original sampling. To test whether the sampling initiated from these structures would produce a different description of the folding surface, we performed two simulations starting from these alternative initial conditions. Simulations were performed with a protocol identical to the one used in our initial sampling. We observed that structures rapidly collapsed, approaching the free energy basin in one case (where the initial radius of gyration was quite large), and reaching it in the second (Figure A3). We believe that this result indicates that the free energy of regions where no sampling was produced is indeed high, and the absence of conformations sampled in these regions is not an artifact of the implicit bias produced by the manner in which we generated initial conditions.
Precision of the free energy surface A total of 9.6 ns of sampling generated from 76 trajectories was combined with the WHAM formalism to produce the potential of mean force (PMF) along the reaction coordinate, r. We used ®vefold cross-validation of the data to assess the precision of this surface. Thus, 20% of the data was randomly excluded from all bins. Five such sets of data, each containing 80% of the data, were then used to construct the potential of mean force. Figure A1 demonstrates ®ve surfaces constructed in this fashion, along with the original surface in which 100% of data was used. We observe that the surface is quite precise, as judged by this test.
Total computational cost of the project Each of the 76 initial conditions were equilibrated for 20 ps. The production (sampling) dynamics were performed for 100 ps for the majority of initial conditions. The sampling for the conditions from the bins numbered 10, 12, 15 and 16 was done for 200 ps. A total of 9.6 ns of sampling was generated. Sixty picoseconds of dynamics required approximately 7.8 CPU hours on 64 processors of the T3D, 2.8 CPU hours on 64 processors of the T3E and 3.3 CPU hours on 16 processors of the C90.
456
Mechanism of Protein G Folding
References Allen, M. P. & Tidesley, D. J. (1989). Computer Simulation of Liquids, Clarendon Press, Oxford. Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S. & Karplus, M. (1983). CHARMM: a program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4, 187 ±217. Brooks, C. L., III, Pettitt, B. M. & Karplus, M. (1985). Structural and energetic effects of truncating long ranged interaction in ionic and polar ¯uids. J. Chem. Phys. 83, 5897± 5908. Jorgensen, W. L., Chandrasekhar, J., Madura, J., Impley, R. W. & Klein, M. L. (1983). Comparison of simple
potential functions for stimulating liquid water. J. Chem. Phys. 79, 926 ± 935. Mertz, J. E., Tobias, D. J., Brooks, C. L. & Singh, U. C. (1991). Vector and parallel algorithms for the molecular dynamics simulations of macromolecules on shared memory computers. J. Comput. Chem. 12, 1270± 1277. Ryckaert, J.-P., Ciccotti, G. & Berendsen, H. J. C. (1977). Numerical integration of the cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. J. Comput. Phys. 23, 327 ± 341. Verlet, L. (1967). Computer ``experiments'' on classical ¯uids. I. Thermodynamical properties of LennardJones molecules. Phys. Rev. 159, 98± 103.
Edited by F. Cohen (Received 26 August 1997; received in revised form 20 January 1998; accepted 21 January 1998)