Structure, Vol. 13, 1587–1597, November, 2005, ª2005 Elsevier Ltd All rights reserved.
DOI 10.1016/j.str.2005.07.023
Folding of Small Helical Proteins Assisted by Small-Angle X-Ray Scattering Profiles Yinghao Wu,1 Xia Tian,2 Mingyang Lu,2 Mingzhi Chen,3 Qinghua Wang,2 and Jianpeng Ma1,2,3,* 1 Department of Bioengineering Rice University Houston, Texas 77005 2 Verna and Marrs McLean Department of Biochemistry and Molecular Biology 3 Graduate Program of Structural and Computational Biology and Molecular Biophysics Baylor College of Medicine One Baylor Plaza Houston, Texas 77030
Summary This paper reports a computational method for folding small helical proteins. The goal was to determine the overall topology of proteins given secondary structure assignment on sequence. In doing so, a Monte Carlo protocol, which combines coarse-grained normal modes and a Hamiltonian at a different scale, was developed to enhance sampling. In addition to the knowledge-based potential functions, a small-angle X-ray scattering (SAXS) profile was also used as a weak constraint for guiding the folding. The algorithm can deliver structural models with overall correct to˚ pology, which makes them similar to those of 5w6 A cryo-EM density maps. The success could contribute to make the SAXS technique a fast and inexpensive solution-phase experimental method for determining the overall topology of small, soluble, but noncrystallizable, helical proteins. Introduction One of the major goals of protein folding is to obtain the overall three-dimensional structure from a one-dimensional primary sequence of amino acids. Through decades of extensive studies, substantial progress has been made toward understanding the folding mechanisms and the prediction of three-dimensional structures (Blundell et al., 1987; Boczko and Brooks, 1995; Bowie et al., 1991; Cardenas and Elber, 2003; Dill and Chan, 1997; Dinner et al., 2000; Duan and Kollman, 1998; Elofsson et al., 1996; Guo and Thirumalai, 1995; Jones et al., 1992; Liwo et al., 2005; Onuchic et al., 1997; Pande et al., 2003; Pedersen and Moult, 1997; Sali et al., 1994; Simons et al., 1997; Srinivasan et al., 2004; Wedemeyer et al., 2000; Wolynes et al., 1995; Zhang and Skolnick, 2004; Zhou and Karplus, 1999). At the current stage, however, reliably predicting the overall topology is still a challenging problem for most proteins. In this paper, we report on the effort of determining the tertiary topology of small helical proteins or domains by using several constraints. The proteins were modeled
*Correspondence:
[email protected]
based on Ca traces and the secondary structures as standard helices. The starting conformations were randomly chosen and had extended tertiary structures. Knowledge-based potential functions and a small-angle X-ray scattering (SAXS) profile were used to guide the simulations. To efficiently sample conformational space, we also developed a novel, to our knowledge, Monte Carlo (MC) protocol, which was based on observations made from previous studies that found that the large-scale conformation changes are dominated by a small number of low-frequency normal modes (Ma, 2004, 2005). A distinct feature in this protocol is the separation of scale between Hamiltonian for calculating normal modes and for updating Metropolis criterion in MC simulation. A substantially improved sampling efficiency was observed in our study. Multiple folding trajectories that usually resulted in different final structures with fairly diverse topologies were simulated. A major issue, therefore, was to distinguish the most native-like topology among them. To do so, we adopted an approach developed in our recent study (Wu et al., 2005), which was originally designed to determine protein topology from the skeletons of secondary elements derived from low- to intermediate-resolution density maps. The approach was based on an essential observation that the average energy of an ensemble of structures with slightly different configurations around the original structure was a much more robust parameter for marking native topology than the energy of any of the individual structure in the ensemble. The underlying hypothesis is that native topology was chosen by evolution as the one that can accommodate the largest structural variations, not the one rigidly trapped in a deep, but narrow, conformational energy well. In this study, for each final folded structure, we separately generated an ensemble of structures around it and evaluated the average energy as the criterion for identifying the native topology. Our results showed that, in two-thirds of the testing cases with simulated scattering profiles, the native topology was successfully identified. Tests on real experimental scattering data also yielded a satisfactory outcome. Since we used Ca-based models, an approach that could not deliver atomic resolution structure, our focus was on getting the overall topology correct. It was shown that the approach was particularly effective for small, globular helical proteins or protein domains. Such a success could help increase the effective resolution of the SAXS method. This is particularly important since SAXS is a relatively low-cost and fast method compared with other solution-phase methods such as NMR. Results Test of 12 All-Helical Globular Proteins The simulation protocol was first tested on a total of 12 all-helical globular proteins that can be roughly split into four groups, which contain three, four, five, and six helices, respectively. They have an architecture of an orthogonal bundle or an up-down bundle (Orengo et al., 1997). The SAXS profiles for all of the proteins were
Structure 1588
Table 1. Results for 12 Testing Proteins
Helix Number
PDB Codes
Protein Size (Number of Residues)
Protein Architecture
Lowest Rmsd Topology
Rmsd with Loop (A˚)
Rmsd without Loop (A˚)
Three helices
1mbg 1prb 1erc 1eo0 1i2t 1eoq 1nkl 2cro 1ow5 1a1w 1ngr 1ich
40 45 40 76 61 64 78 65 60 91 74 87
orthogonal up-down up-down up-down orthogonal orthogonal orthogonal orthogonal orthogonal orthogonal orthogonal orthogonal
1st 1st 5th 1st 1st 3rd 1st 1st 7th 1st 1st 4th
3.98 3.76 3.78 5.77 5.52 5.53 5.47 6.85 5.10 7.02 5.35 6.09
4.10 3.39 3.07 5.57 5.24 5.12 5.37 6.02 4.45 6.19 5.38 5.82
Four helices
Five helices
Six helices
The number in the column ‘‘Lowest Rmsd Topology’’ describes the energetics ranking of the topology of the lowest rmsd in relation to the native topology. Therefore, ‘‘1st’’ means the energetics-based screening successfully identified the most native-like topology as the lowest one in energy. The values of rmsd calculated with and without loops are both listed.
computationally calculated. The native topologies of 8 out of 12 proteins were successfully identified with reasonable root mean square deviations (rmsd) from known crystal structures (Table 1). The errors in the final folded structures inevitably increased with the size of the proteins, but such magnitudes of rmsd were within the overall range of precision of the current structural prediction by other methods (Moult et al., 2003). In Figure 1, one successful case from each group, together with the corresponding crystal structure, was displayed. Clearly, for all groups, the overall topology
of the proteins was correctly reached. Note that for clarity the loops were not shown. All major helices were in the vicinity of their correct positions. Shown in Figure 2 is one of the folding trajectories of 1nkl from a conformation near the beginning of simulation (Figure 2A) to the end of simulation (Figure 2F). It is apparent that the overall sampling of the conformational space by the elastic normal mode-based MC protocol is efficient. Each trajectory for the largest protein we tested took about 48 hr on a single processor PC machine. Moreover, the speed of computing also depends on the total number
Figure 1. Comparison of the Most Nativelike Topology with the Corresponding Native Topology (A–D) For clarity, only the helices are drawn as cylinders. The dark color indicates native structures, and the light color indicates simulated structures. The values of rmsd without loops are 4.1 A˚ for 1mbg, 5.1 A˚ for 1eo0, 5.4 A˚ for 1nkl, and 6.2 A˚ for 1a1w.
SAXS-Assisted Folding of Small Helical Proteins 1589
Figure 2. The Snapshots Taken from One of the Folding Trajectories of Protein 1nkl (A–F) (A) shows a conformation near the beginning of simulation, and (F) shows the final folded conformation that has an rmsd of 5.5 A˚ to the native structure.
of blocks in BNM analysis that was related to the length of loops. However, our protocol was very fast for folding the small- to medium-sized helical proteins. Four typical final folded structures were shown for the three-helical protein 1mbg (Figure 3A) and for the fivehelical protein 1nkl (Figure 3B). For each protein, the first folded structure had the most native-like topology, which was successfully identified in our final screening. The results clearly demonstrated that the final folded structures from different folding trajectories had quite different topologies, and that some were very far away from the native topology. This fact necessitated the application of an additional energetics-based screening method in our protocol (see Experimental Procedures). The convergence of energy values during MC simulations was quick for all trajectories (Figure 4A), even though the rmsds of the final folded structures were fairly different (Figure 4B). Finally, as we had noted previously (Wu et al., 2005), the most native-like topology of 1nkl had the lowest average total energy in the perturbed ensemble of structures, the smallest standard deviation (SD) for the energy distribution, and the largest number of perturbed structures (Table 2). These results further support the finding that the native topology was chosen by evolution as the one that can accommodate the largest structural variations, not the one rigidly trapped in a deep, but narrow, conformational energy well. Analysis of Some Failed Cases To better understand its overall performance, here we present two typical cases in which our simulation protocol failed: myoglobin (PDB code 1bvc) and cytochrome c (PDB code 1akk). The structure of myoglobin has 8 helices (6 major ones) and 153 residues. Out of the ten trajectories,
none of them had a native-like topology. The closest had an rmsd of 10.52 A˚. If only the six major helices were compared, the rmsd was 8.0 A˚. A detailed comparison of the six major helices in the lowest-rmsd model with those of the native structure is shown in Figure 5A. The dark color indicates the native helices, and the light color indicates our modeled helices. Obviously, the lowest-rmsd model has a very similar shape compared with the native structure. The largest discrepancy from the native structure was in the last two major helices, which were swapped in the modeled structure. Note that since the native topology was not in the final ten folded structures by visual inspection, we didn’t apply energetics-based screening. Cytochrome c is a smaller four-helical protein (Figure 5B). The folding simulation also failed to produce the native-like topology out of the multiple trajectories. One possible reason for this failure was that the folding of the native structure of cytochrome c heavily involves its cofactor heme. Since we didn’t include heme in our simulation, this could contribute to the substantial error in the final folded structures. Also, there is a very long loop from residues 17 to 48 that extensively interacts with the heme group. This loop also contributed to the frustration of the folding simulation. Cases with Mismatch between Predicted and Native Secondary Structures In all applications discussed so far, we utilized the native secondary structure assignments taken from the crystal structures. For the purpose of practical applications, we also tested cases with mismatches between the predicted and native assignments. By using the consensus secondary structure prediction server (http://ibivu.cs. vu.nl/programs/sympredwww/), 3 out of 12 proteins
Structure 1590
Figure 4. Convergence of MC Simulation for Protein 1nkl, for Energy and for Rmsd (A and B) (A) indicates convergence of energy, and (B) indicates convergence of rmsd. Although the overall convergences for different trajectories were similar, the final structures differ substantially with regard to rmsd.
Figure 3. Four Final Folded Structures from Different Trajectories for 1mbg and 1nkl (A and B) For each protein ([A] 1mbg; [B] 1nkl), the first structure has the lowest rmsd value (most native-like). All folded structures have compact conformations and similar radii of gyration, but with quite different topologies.
had secondary structure mismatches. They were 1erc, 1prb, and 1ow5. For 1erc and 1prb, although they are all-helical proteins, the secondary structure prediction indicated that they had strands. In these cases, our folding simulations totally failed. In the case of 1ow5, which is a five-helical bundle, secondary structure prediction indicated four helical regions (two small helical regions were predicted as a single long helix). If we constructed the initial protein model based on the predicted secondary structures with the scattering profile still computed from the native crystal structures, the folding simulation successfully identified the most native-like topology with the lowest rmsd of 6.86 A˚ from the native structure
with loop regions, and 6.78 A˚ for the structure without loop regions (Figure 6A). (Note that in Table 1, when there was no mismatch, this protein has an rmsd 5.10 A˚ with loop regions and an rmsd of 4.45 A˚ without loop regions.) The global arrangement of all three helices is similar to the native structure, while the single long helix, predicted over a sequence region that should have two smaller helices in the native structure, lies at the average position between the two native small helices (Figure 6A, (3)). Another example that can illustrate the effects of errors in secondary structure assignment is a protein of PDB code 3icb (a five-helical orthogonal bundle protein). As in Figure 6B, the consensus prediction for the secondary structure gave a correct assignment, and, consequently, the folded topology was also correct (lowestrmsd structure without loop regions was 4.93 A˚ out of the ten folding trajectories [Figure 6B, (1)]). If we used the assignment by the YASPIN method (Lin et al., 2005), in which a small helix (residues 37–41) was merged into a preceding longer helix, the topology was more or less correct (lowest rmsd was 5.85 A˚ [Figure 6B, (2)]). If,
SAXS-Assisted Folding of Small Helical Proteins 1591
Table 2. Detailed Results for the Final Energetics-Based Ranking of Ten Folding Trajectories for 1nkl Trajectory Index
Average Rmsd (A˚)
Average Total Energy (RT)
SD of Total Energy (RT)
Average Long-Range Energy (RT)
SD of Long-Range Energy (RT)
Number of Randomized Structures in the Ensemble
1 2 3 4 5 6 7 8 9 10
10.7 6.6 10.8 8.9 10.5 7.2 5.4 10.5 11.7 10.6
2378.21 2399.08 2378.59 2396.36 2378.70 2399.85 2418.47 2403.07 2371.95 2388.36
50.74 45.99 61.71 44.87 50.91 42.71 37.75 38.95 45.91 48.5
254.42 254.91 242.19 249.77 231.08 253.26 264.92 252.19 234.79 240.16
26.93 22.08 33.91 33.35 40.08 26.14 25.59 24.25 35.62 36.13
253 353 322 373 472 445 563 510 416 412
however, we used the assignment by the PSIRed method (Jones, 1999), which mistakenly predicted the same small helix as a loop, the folded topology became much worse (lowest rmsd became 11.36 A˚, [Figure 6B, (3)]). These results indicate that, in general, if the mismatches between the predicted and native secondary structures are not too severe, our method is still capable of delivering reasonable overall topology. But failures
did occur when the mismatches were too large for our method to handle. Test of System with Experimental Scattering Data In order to test our simulation protocol by using real experimentally measured SAXS data, we performed the test on a system that had published scattering data
Figure 5. Examination of Failed Cases (A) Comparison of the lowest-rmsd structure with the native structure for myoglobin 1bvc. Six major long helices are illustrated one-by-one. Their amino acid sequence ranges are 3–18, 20–36, 59–78, 85–96, 100–119, and 124–149. The last two were swapped in the folded structure, but the overall architecture of the folded structure is not far away from that of the native structure. (B) Ribbon diagram for cytochrome c. The long loop that extensively interacts with the heme group is highlighted in a darker color.
Structure 1592
Figure 6. Comparison of the Lowest-Rmsd Structures with the Native Structure for 1ow5 and 3icb with Mismatches between the Predicted and Native Secondary Structures The darker color is for the native structure, and the lighter color is for the folded structure. (A) For 1ow5, a particular helix is highlighted by ribbon representation in each panel (1–4). Note the mismatch that occurs over the sequence region of two small, dark helices highlighted in (3). (B) For 3icb, results for different secondary structure assignments are shown.
available (Mehboob et al., 2003) (the maximal scattering vector s was set to 0.05 A˚21). The N-terminal region of spectrin is a relatively large three-helical bundle protein (149 residues), and the secondary structure prediction gave the same assignment as the native structure. Although the absolute size of this three-helical bundle was even larger than the largest six-helical bundle protein we tested, our method successfully found the most native-like topology (Figure 7) with an rmsd of 6.5 A˚. Convergence of Elastic Normal Mode-Based MC Simulation To demonstrate the convergence of the MC protocol we developed, we compared this protocol with a more traditional rigid-body MC simulation, in which each block was chosen randomly at each time step. After each block was chosen, the chosen block took a random move. If the chosen block contained only one Ca
atom, as in the loops, the random move was chosen on its three translational degrees of freedom. If the chosen block contained more than one Ca atom, as in helices, three additional rotational degrees of freedom were changed randomly. For comparison purposes, we used the same initial condition, the same energy potential, and the same weight value; the moving amplitude was also adjusted close to the normal mode-based MC simulation. The convergences of two simulation methods are shown in Figure 8. The normal mode-based MC simulation converged much faster and to a lower energy, or a more compact conformation, than the traditional rigidbody MC simulation. Although the calculation of elastic normal modes in each step took some additional time, the overall performance of normal mode-based MC is significantly better, especially in terms of the energy of the final folded structure. As a direct comparison, ten independent trajectories were run by the rigid-body MC
SAXS-Assisted Folding of Small Helical Proteins 1593
Figure 8. Comparison of the Convergence of Two MC Methods on Protein 1nkl The dotted line is for traditional rigid-body MC, and the solid line is for elastic normal mode-based MC. It is clear that the normal mode-based MC converges much faster and to a lower energy, i.e., a more compact conformation.
Figure 7. Comparison of the Most Native-like Topology with Native Structure for Spectrin The rmsd is 6.5 A˚. This protein was folded by using published experimental SAXS data (Mehboob et al., 2003) as constraints.
simulation for 1nkl. The final average rmsd to the native structure was more than 13 A˚, and the lowest rmsd was 11.2 A˚, compared with the average rmsd of 9.3 A˚ and the lowest rmsd of 5.4 A˚ obtained with the elastic normal mode-based MC simulation. More importantly, with the traditional rigid-body MC, the lowest-rmsd final structure out of the ten folding trajectories had a noncompact structure with a nonnative topology. These results suggest the better performance of the normal mode-based MC protocol (detailed analysis of performance of the normal mode-based MC will be given in a separate paper). Discussion In this paper, we reported a new, to our knowledge, computational protocol for determining tertiary topology of small, globular helical proteins, or protein domains, from sequence. Knowledge-based potential functions and coarse-grained protein models (Ca traces) were used in the folding simulations. A novel, to our knowledge, protocol of MC simulation was devel-
oped in which the scale in Hamiltonian was used for conformational random walk and for which the Metropolis criterion was separated. The random walk was based on elastic normal modes, and the Metropolis criterion was based on more accurate potential functions. Such a separation of Hamiltonian allows an effective sampling of the conformational space. It is important to point out that the potential functions used for Metropolis criterion had no restriction and that they can be any kind. Therefore, the protocol can be generalized to any case. Another distinct feature in this study was the utilization of SAXS data as soft constraints to assist folding. For a set of testing proteins we studied with both simulated and experimentally measured SAXS profiles, a satisfactory successful rate was achieved. We wish to emphasize that the purpose of making random configurational moves during MC simulation along the eigenvectors of elastic normal modes is to achieve the maximal sampling efficiency. Each normal mode is a single, but collective, degree of freedom. Moving along the low-frequency mode has the effect of achieving the largest structural change collectively with the smallest energetic cost. Our fundamental assumption is that, at any given instant, the direction of the conformational movement in the immediate future can be approximated by the normal modes calculated at that instant. Although the conventional concept of normal mode, as a harmonic approximation, does not seem to allow the system to cross the energy barrier along the eigenvector, in this case, the separation of scale of Hamiltonian overcomes this problem. As shown in our study, the structural movement along the coarse-grained elastic normal modes can efficiently cross energy barriers in the landscape defined by more sophisticated Hamiltonian later used for Metropolis criterion. It was shown in this study that the method was very effective for folding simulation because the folding process involves large-scale conformational rearrangement that drastically changes the shape of the molecule. Normal modes are good parameters to describe the conformational changes that
Structure 1594
substantially alter the overall shape of the molecule (Lu and Ma, 2005; Ma, 2004, 2005). From our study, it seems that the success of folding depends more on the complexity of the structure and less on the absolute size of the structure. The threehelical protein that we successfully tested with experimental scattering data was relatively simple in its topology, but it was large in size compared with other systems we tested. Myoglobin was a good example in which the complexity led to the failure of simulation. However, we wish to point out that, even for myoglobin, in which the native topology was not found, the final folded structure had four major helices correctly positioned. The remaining two helices had their positions swapped; the overall angles of them were still close enough to the native structures. In other words, if only the skeletons of secondary structure elements were concerned, all of the rough positions of helices would have been considered as correctly predicted. The results would be of a similar level of information as to what can be obtained from intermediate-resolution cryo-EM density maps (Jiang et al., 2001; Kong and Ma, 2003; Kong et al., 2004). In its current form, the protocol we used relies on the knowledge of secondary structure assignment on sequence. In a real application, the secondary structure assignment would have to be taken from computational prediction, which has about 80% accuracy in determining major secondary structures. Our method should even be able to handle cases in which small helices are missing in the sequence-based prediction. Of course, there are other factors that could affect the results of folding. For example, proteins containing large cofactors would probably be harder to fold by the current method than the ones without cofactors. In conclusion, we believe that, for small, globular helical proteins or protein domains, it is feasible to utilize SAXS data to assist structure prediction to a level of accuracy that is roughly equivalent to, or better than, intermediate-resolution density maps from other structural biology experiments. However, SAXS has the unique advantage of being able to quickly collect data on small, soluble proteins. It also provides a useful alternative technique for fields such as structural genomics, since a SAXS experiment would be much easier to be automated than X-ray crystallography. Together with the recent developments of other related computational methods (Costenaro et al., 2005; Davies et al., 2005; Kojima et al., 2004; Petoukhov et al., 2002; Svergun and Koch, 2002; Svergun et al., 2001; Walther et al., 2000; Zheng and Doniach, 2002, 2005) that also utilize SAXS data to model protein structures, we believe that these methods will eventually enable SAXS to become a mainstream experimental technique in the field of structural biology. Experimental Procedures Overall Procedure of the Simulation To fold the proteins, we assumed that the positions of secondary structures (helices in this case) in the sequence are known. The simulations were based on the Ca traces of proteins. Each run started from a random conformation in which the loops were modeled as extended and helices were kept in a standard form—a model similar to the diffusion-collision model (Karplus and Weaver, 1979, 1994). The
conformational space was then sampled by a novel, to our knowledge, MC protocol, which, in essence, separates the Hamiltonians between the random walk and Metropolis criterion. In each step of the simulation, a small set of low-frequency normal modes was first computed based on the Ca trace by elastic normal mode analysis (eNMA) (Atilgan et al., 2001) with the block normal mode (BNM) (Durand et al., 1994; Li and Cui, 2002; Tama et al., 2000) scheme for constructing the Hessian matrix (the helical regions were kept as rigid bodies during MC simulation). Then, the structure was changed along a randomly chosen mode by a random displacement. Different from what was used to compute normal modes, the energy function used to evaluate the acceptance included a short-range bonded term, a long-range nonbonded term, a hydrophobicity term, and a constraint derived from the SAXS profile (detailed below). Multiple trajectories were used to generate a set of folded structures, usually with quite different topologies. These structures were then used as candidates from which the topology that is the most similar to native topology (most native-like) was identified by using an effective procedure recently developed by us (Wu et al., 2005).
Protein Model and Initial Condition Each amino acid was represented by a single Ca atom. Two adjacent Ca atoms were assumed to be connected by a pseudo bond with an equilibrium length of 3.8 A˚. The conformation of the Ca trace for a protein of N residues was thus defined by 3N 2 6 parameters: N 2 1 virtual bonds {li}, N 2 2 virtual bond angles {qi}, N 2 3 dihedral angles {fi}. The Cartesian coordinates of the Ca trace of a protein were constructed by a method developed by Levitt and colleagues (Park and Levitt, 1995). Given the positions of the first three Ca atoms by: x1 = 0; y1 = 0; z1 = 0 x2 = 3:8; y2 = 0; z2 = 0 x3 = x2 + 3:8 cosðp 2 q2Þ y3 = y2 + 3:8 sinðp 2 q2Þ z3 = 0:
(1)
The coordinates of the ith Ca atom ri (i R 4) were calculated by: ri = ri 2 1 + 3:8 cosðp 2 qi 2 1 Þu + 3:8 sinðp 2 qi 2 1 Þ cosfi 2 2 v + 3:8 sinðp 2 qi 2 1 Þ sinfi 2 2 w;
(2)
where three orthogonal unit vectors, u, v, and w, were defined as: u=
v=
ri 2 1 2 ri 2 2 jri 2 1 2 ri 2 2 j
ðri 2 3 2 ri 2 2 Þ 2 ½ðri 2 3 2 ri 2 2 Þ,uu jðri 2 3 2 ri 2 2 Þ 2 ½ðri 2 3 2 ri 2 2 Þ,uuj w = u3v:
(3)
Since we assumed that the secondary structures were known, the helical regions were fixed at the ideal angles (q = 90º, f = 50º) (Bahar et al., 1997). For residues in the loop regions, the bond angles and dihedrals were set as completely extended (q = 120º, f = 180º). By starting our simulations from a maximally extended conformation, we expected to avoid bias. Monte Carlo Simulation Protocol Based on Elastic Normal Modes The sampling of conformational space during folding simulations was carried out by an MC protocol (Allen and Tildesley, 1980), which usually involves two major steps: (1) a random walk of the structure in configurational space, and (2) a Metropolis criterion for acceptance of the walk based on the Boltzmann distribution. In order to increase the sampling efficiency for conformational collapse during folding, we developed a novel, to our knowledge, protocol for implementing the MC simulation. The essence of this protocol is the separation of Hamiltonian in scale for the two steps. In the first step,
SAXS-Assisted Folding of Small Helical Proteins 1595
each random walk of the structure was along a low-frequency eigenvector of elastic normal mode calculated from a coarse-grained Hamiltonian for the elastic network. Then, in the second Metropolis step, the Hamiltonian used to evaluate the energy change was based on more accurate potential functions. Random Walk Based on Elastic Normal Modes In the first random walk step, the low-frequency normal modes of the structure were calculated by using a simplified potential function, 0 1jrij j%rc gX 2 V= : (4) sij ðjrij j 2 jrij0 jÞ ; sij = 0jrij0 j>rc 2 ij Here, rij and jrij0 j were the instantaneous and current values of distance, respectively, between Ca atom pair i and j. The force constant g was uniformly set to one, and the cutoff distance rc was set to 13 A˚. A distinct feature of elastic normal modes is that they are calculated from a potential function that regards the current coordinates as the equilibrium coordinates. Consequently, initial energy minimization was not required; thus, no additional structural distortion resulting from energy minimization would occur, which made it possible to carry out this normal mode-based Monte Carlo protocol step by step to propagate the simulation. Furthermore, since the helices were treated as rigid bodies, we used the well-established BNM approach (Durand et al., 1994; Li and Cui, 2002; Tama et al., 2000) to change the conformation of the protein. The BNM method first partitions a structure into n blocks. In our model, one block was chosen for each helix or each residue in the loop regions. For a particular conformation of a protein, a subset of low-frequency modes was calculated based on the current Ca network by using BNM and the elastic harmonic potential (Equation 4). The structure was then moved along a randomly chosen mode with a random displacement. The SHAKE algorithm (Ryckaert et al., 1977) was applied to keep the pseudo bond length unchanged. The newly updated structure was either accepted or rejected based on the Metropolis criterion according to a different Hamiltonian described in the next section. If it was accepted, the newly updated structure was set as the current structure of the protein for the next cycle. Hamiltonian Used in Metropolis Criterion The energy function we used in Metropolis update can be expressed as: UTotal = UAngle + UvdW + UNonbonded + UHydrophobic + UScattering
(5)
The first term on the right is for the pseudo bond angle constraint, which can be maintained by a harmonic potential, UAngle =
X 1 q 2 q0 2 kq : 2 qv i
(6)
The equilibrium bond angle q0 was set to 105º, the constant qv was set to 15º, and the force constant kq was 2. In order to prevent the protein model from interresidue collapse, a pseudo van der Waals potential, UvdW, was added with the form of (Berriz and Shakhnovich, 2001): UvdW = u0
d0 dij
12
22
d0 dij
6 ;
(7)
where dij was the instantaneous distance between residue i and j, d0 was a constant distance set to 5 A˚, which was the most favorable distance between two adjacent Ca atoms, and u0 was set to 1. The long-range nonbonded interactions were modeled by a residue-specific, distance-dependent potential extracted from the structural database by Bahar and Jernigan (Bahar and Jernigan, 1997). The knowledge-based potential can be written as: Unon 2 bonded =
X
j; rÞ; uði;
(8)
i< j
in which i and j were residue indices, r was the distance between the j; rÞ was the energy parameter. two residues, and uði; Since hydrophobic residues tend to be buried in the proteins, we constructed a crude penalty term to mimic this hydrophobic effect. A group of hydrophobic residues on the helices with the highest hy-
drophobicity indices (Kyte and Doolittle, 1982), such as Ile, Leu, Val, or Phe, were chosen. Then, the summation of distances between all pairs of the chosen hydrophobic residues was calculated as: UHydrophobic =
1 X . . 2 r 2 rj : 2 i; j˛HC i
(9)
Here, HC stands for the chosen hydrophobic core. This term has an effect of ‘‘pulling’’ the chosen set of the hydrophobic core toward a more globular packing geometry. Our experience indicates that it was better to choose a set of hydrophobic residues with very high hydrophobicity indices than to include all of the hydrophobic residues in the sequence. This is perhaps because those residues tend to be deeply buried in tertiary structures. Finally, the SAXS profile of proteins that gives information on their size and shape was incorporated as a constraint for the folding process by adopting a term in the potential function, UScattering, concerning the scattering profile as: UScattering = w
n X
2
jIT ðsi Þ 2 IM ðsi Þj ;
(10)
i=1
where Si is the scattering vector, and IT(si) and IM(si) are the scattering intensity of target structure and modeled structure, respectively. The constant w is a weight for balancing the contribution with respect to other energy terms. Similar to what has been reported in the literature (Walther et al., 2000; Zheng and Doniach, 2002), the scattering intensity I(s) of a model formed by N beads can be calculated by the Debye equation: IðsÞ = N + 2
N X sin 2psrij ; 2psrij i; j
(11)
where rij is the distance between a pair of beads. For larger systems, the above N 3 N calculation can be replaced by an alternative Debye equation in its pair-distance histogram form: IðsÞ = N + 2
nbins X i=1
gðri Þ
sinð2pjri jsÞ ; 2pjri js
(12)
where g(ri) is the pair-distance histogram of all singly counted pairwise distances, and the number of bins is nbins. To numerically represent the scattering profile I(s), the scattering vector s was discretized with the interval ds = 0.001 A˚21. So, if the maximal s was set to 0.1 A˚21, then the value of n in Equation 10 would be 100. The weight factor, w, was adjusted to a value at which the scattering term roughly made up one-fifth of the contribution to the total energy function in Equation 5. After the new structure was generated by a random walk along normal mode, the acceptance of this move was based on the Metropolis criterion: the new conformation was accepted based on probability p of: 2 UTotal 2 U0Total ; (13) pwexp C where UTotal was the energy of the new conformation, U0Total was the energy of the old conformation, and C was a constant related to Monte Carlo temperature that was empirically chosen. The initial value of constant C was adjusted from 1.5 to 3 depending on the size of the proteins. Generally, proteins with larger sizes got a larger value of C. Additionally, the parameter C was decreased in the final steps of the simulation so that the structure could converge to the lower-energy state. Correspondingly, the maximal displacement along the normal mode was also decreased in the final steps so that the conformation space can be finely sampled before the simulation is converged. Final Identification of Native Topology Using the coarse-grained potential for residual interactions and other simple constraints, one naturally cannot expect every simulation trajectory to converge to exactly the same native topology. The problem of deriving a three-dimensional model from a one-dimensional scattering profile was fundamentally underdetermined, especially when using solution scattering information. Therefore, for each
Structure 1596
case, multiple trajectories were generated independently, which led to a number of compact structures that were similar in shape and radius of gyration, but different in tertiary packing of helices. A major task then was to identify the most native-like topology among all of the folded structures. To do so, we adopted an energy-based screening procedure developed in our previous study (Wu et al., 2005) that is briefly outlined below. For each final folded structure, only the axes and directions of helices (as vectors) and their corresponding sequence identities were extracted from Ca coordinates. These vectors represented the packing geometry of a particular structure. Then, an ensemble of slightly perturbed structures was constructed to represent the conformational variations around the given topology. All of the structures in the ensemble were generated as a Ca-based trace representation. Each structure in the ensemble was constructed by two steps. The first step was helix placement: the standard helical segments were placed based on their extracted helical axes. To allow for enough structural deviations, the x, y, and z coordinates of the centers of mass of the standard helical segments were perturbed in a sphere of radius of 1.0 A˚. The three angles of rotation around the centers of mass were also randomized within a range of 5º. The second step was loop construction once the helices were all in place, which was done by an off-lattice MC procedure. First, the loops were built up to their full length in an extended conformation. Then, they were allowed to relax based on MC movement to connect two neighboring helices. After the construction of all perturbed structures in the ensemble, all members of the ensemble were separately energy refined by global optimization, which included the genetic algorithm (GA) optimization for rotations of the helices and MC relaxation for loop regions. By iteratively applying the procedure, all structures were guided globally to energetically more favorable states. After global optimization of all members of the structural ensemble, the average energy for the whole ensemble was evaluated. The average energies of all folded structures are then compared to locate the one with the lowest energy. Our previous study indicated that (Wu et al., 2005), even with severe errors inherent in the constructed structures and limited accuracy in the coarse-grained potential functions, the average energy of a slightly perturbed ensemble of structures around a given topology was a much more robust evaluator for the topology than any individual member in the ensemble, no matter how extensively the structures in the ensemble were optimized. Acknowledgments We wish to thank Tom Irving for helpful discussion in the early stage of the project. The authors gratefully acknowledge the support from the National Institutes of Health (R01-GM067801). M.C. and M.L. are partially supported by predoctoral fellowships from the W.M. Keck Foundation of the Gulf Coast Consortia through the Keck Center for Computational and Structural Biology. Received: May 25, 2005 Revised: July 21, 2005 Accepted: July 22, 2005 Published: November 8, 2005 References Allen, M.P., and Tildesley, D.J. (1980). Computer Simulation of Liquids (Oxford, UK: Clarendon Press). Atilgan, A.R., Durell, S.R., Jernigan, R.L., Demirel, M.C., Keskin, O., and Bahar, I. (2001). Anisotropy of fluctuation dynamics of proteins with an elastic network model. Biophys. J. 80, 505–515. Bahar, I., and Jernigan, R.L. (1997). Inter-residue potentials in globular proteins and the dominance of highly specific hydrophilic interactions at close separation. J. Mol. Biol. 266, 195–214. Bahar, I., Kaplan, M., and Jernigan, R.L. (1997). Short-range conformational energies, secondary structure propensities, and recognition of correct sequence-structure matches. Proteins 29, 292–308. Berriz, G.F., and Shakhnovich, E.I. (2001). Characterization of the folding kinetics of a three-helix bundle protein via a minimalist Langevin model. J. Mol. Biol. 310, 673–685.
Blundell, T.L., Sibanda, B.L., Sternberg, M.J., and Thornton, J.M. (1987). Knowledge-based prediction of protein structures and the design of novel molecules. Nature 326, 347–352. Boczko, E.M., and Brooks, C.L., 3rd. (1995). First-principles calculation of the folding free energy of a three-helix bundle protein. Science 269, 393–396. Bowie, J.U., Luthy, R., and Eisenberg, D. (1991). A method to identify protein sequences that fold into a known three-dimensional structure. Science 253, 164–170. Cardenas, A.E., and Elber, R. (2003). Kinetics of cytochrome C folding: atomically detailed simulations. Proteins 51, 245–257. Costenaro, L., Grossmann, J.G., Ebel, C., and Maxwell, A. (2005). Small-angle X-ray scattering reveals the solution structure of the full-length DNA gyrase a subunit. Structure 13, 287–296. Davies, J.M., Tsuruta, H., May, A.P., and Weis, W.I. (2005). Conformational changes of p97 during nucleotide hydrolysis determined by small-angle X-ray scattering. Structure 13, 183–195. Dill, K.A., and Chan, H.S. (1997). From Levinthal to pathways to funnels. Nat. Struct. Biol. 4, 10–19. Dinner, A.R., Sali, A., Smith, L.J., Dobson, C.M., and Karplus, M. (2000). Understanding protein folding via free-energy surfaces from theory and experiment. Trends Biochem. Sci. 25, 331–339. Duan, Y., and Kollman, P.A. (1998). Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution. Science 282, 740–744. Durand, P., Trinquier, G., and Sanejouand, Y.H. (1994). New approach for determining low-frequency normal-modes in macromolecules. Biopolymers 34, 759–771. Elofsson, A., Fischer, D., Rice, D.W., Le Grand, S.M., and Eisenberg, D. (1996). A study of combined structure/sequence profiles. Fold Des. 1, 451–461. Guo, Z., and Thirumalai, D. (1995). Kinetics of protein folding: nucleation mechanism, time scales, and pathways. Biopolymers 36, 83–102. Jiang, W., Baker, M.L., Ludtke, S.J., and Chiu, W. (2001). Bridging the information gap: computational tools for intermediate resolution structure interpretation. J. Mol. Biol. 308, 1033–1044. Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol. 292, 195–202. Jones, D.T., Taylor, W.R., and Thornton, J.M. (1992). A new approach to protein fold recognition. Nature 358, 86–89. Karplus, M., and Weaver, D.L. (1979). Diffusion-collision model for protein folding. Biopolymers 18, 1421–1437. Karplus, M., and Weaver, D.L. (1994). Protein folding dynamics: the diffusion-collision model and experimental data. Protein Sci. 3, 650– 668. Kojima, K., Timchenko, A.A., Higo, J., Ito, K., Kihara, H., and Takahashi, K. (2004). Structural refinement by restrained molecular-dynamics algorithm with small-angle X-ray scattering constraints for a biomolecule. J. Appl. Crystallogr. 37, 103–109. Kong, Y., and Ma, J. (2003). A structural-informatics approach for mining b-sheets: locating sheets in intermediate-resolution density maps. J. Mol. Biol. 332, 399–413. Kong, Y., Zhang, X., Baker, T.S., and Ma, J. (2004). A Structural-informatics approach for tracing b-sheets: building pseudo-Ca traces for b-strands in intermediate-resolution density maps. J. Mol. Biol. 339, 117–130. Kyte, J., and Doolittle, R.F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105–132. Li, G., and Cui, Q. (2002). A coarse-grained normal mode approach for macromolecules: an efficient implementation and application to Ca(2+)-ATPase. Biophys. J. 83, 2457–2474. Lin, K., Simossis, V.A., Taylor, W.R., and Heringa, J. (2005). A simple and fast secondary structure prediction method using hidden neural networks. Bioinformatics 21, 152–159. Liwo, A., Khalili, M., and Scheraga, H.A. (2005). Ab initio simulations of protein-folding pathways by molecular dynamics with the unitedresidue model of polypeptide chains. Proc. Natl. Acad. Sci. USA 102, 2362–2367.
SAXS-Assisted Folding of Small Helical Proteins 1597
Lu, M., and Ma, J. (2005). The role of shape in determining molecular motion. Biophys. J. 4, 2395–2401. Ma, J. (2004). New advances in normal mode analysis of supermolecular complexes and applications to structural refinement. Curr. Protein Pept. Sci. 5, 119–123. Ma, J. (2005). Usefulness and limitations of normal mode analysis in modeling dynamics of biomolecular complexes. Structure 13, 373– 380. Mehboob, S., Jacob, J., May, M., Kotula, L., Thiyagarajan, P., Johnson, M.E., and Fung, L.W. (2003). Structural analysis of the a Nterminal region of erythroid and nonerythroid spectrins by smallangle X-ray scattering. Biochemistry 42, 14702–14710. Moult, J., Fidelis, K., Zemla, A., and Hubbard, T. (2003). Critical assessment of methods of protein structure prediction (CASP)-round V. Proteins 53 (Suppl 6), 334–339. Onuchic, J.N., Luthey-Schulten, Z., and Wolynes, P.G. (1997). Theory of protein folding: the energy landscape perspective. Annu. Rev. Phys. Chem. 48, 545–600. Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. (1997). CATH–a hierarchic classification of protein domain structures. Structure 5, 1093–1109. Pande, V.S., Baker, I., Chapman, J., Elmer, S.P., Khaliq, S., Larson, S.M., Rhee, Y.M., Shirts, M.R., Snow, C.D., Sorin, E.J., and Zagrovic, B. (2003). Atomistic protein folding simulations on the submillisecond time scale using worldwide distributed computing. Biopolymers 68, 91–109. Park, B.H., and Levitt, M. (1995). The complexity and accuracy of discrete state models of protein structure. J. Mol. Biol. 249, 493–507. Pedersen, J.T., and Moult, J. (1997). Protein folding simulations with genetic algorithms and a detailed molecular description. J. Mol. Biol. 269, 240–259. Petoukhov, M.V., Eady, N.A., Brown, K.A., and Svergun, D.I. (2002). Addition of missing loops and domains to protein models by x-ray solution scattering. Biophys. J. 83, 3113–3125. Ryckaert, J.P., Ciccotti, G., and Berendsen, H.J.C. (1977). Numerical integration of the cartesian equations of motion of a system with constraints: molecular dynamics of n-alkanes. J. Comput. Phys. 23, 327–341. Sali, A., Shakhnovich, E., and Karplus, M. (1994). How does a protein fold? Nature 369, 248–251. Simons, K.T., Kooperberg, C., Huang, E., and Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 268, 209–225. Srinivasan, R., Fleming, P.J., and Rose, G.D. (2004). Ab initio protein folding using LINUS. Methods Enzymol. 383, 48–66. Svergun, D.I., and Koch, M.H. (2002). Advances in structure analysis using small-angle scattering in solution. Curr. Opin. Struct. Biol. 12, 654–660. Svergun, D.I., Petoukhov, M.V., and Koch, M.H. (2001). Determination of domain structure of proteins from X-ray solution scattering. Biophys. J. 80, 2946–2953. Tama, F., Gadea, F.X., Marques, O., and Sanejouand, Y.H. (2000). Building-block approach for determining low-frequency normal modes of macromolecules. Proteins 41, 1–7. Walther, D., Cohen, F.E., and Doniach, S. (2000). Reconstruction of low-resolution three-dimensional density maps from onedimensional small-angle x-ray solution scattering data for biomolecules. J. Appl. Crystallogr. 33, 350–363. Wedemeyer, W.J., Welker, E., Narayan, M., and Scheraga, H.A. (2000). Disulfide bonds and protein folding. Biochemistry 39, 4207–4216. Wolynes, P.G., Onuchic, J.N., and Thirumalai, D. (1995). Navigating the folding routes. Science 267, 1619–1620. Wu, Y., Chen, M., Lu, M., Wang, Q., and Ma, J. (2005). Determining protein topology from skeletons of secondary structures. J. Mol. Biol. 350, 571–586.
Zhang, Y., and Skolnick, J. (2004). Tertiary structure predictions on a comprehensive benchmark of medium to large size proteins. Biophys. J. 87, 2647–2655. Zheng, W., and Doniach, S. (2002). Protein structure prediction constrained by solution X-ray scattering data and structural homology identification. J. Mol. Biol. 316, 173–187. Zheng, W., and Doniach, S. (2005). Fold recognition aided by constraints from small angle X-ray scattering data. Protein Eng. Des. Sel. 18, 209–219. Zhou, Y., and Karplus, M. (1999). Interpreting the folding kinetics of helical proteins. Nature 401, 400–403.