Structural Interpretation of Paramagnetic Relaxation Enhancement-Derived Distances for Disordered Protein States

Structural Interpretation of Paramagnetic Relaxation Enhancement-Derived Distances for Disordered Protein States

J. Mol. Biol. (2009) 390, 467–477 doi:10.1016/j.jmb.2009.05.019 Available online at www.sciencedirect.com Structural Interpretation of Paramagnetic...

2MB Sizes 0 Downloads 18 Views

J. Mol. Biol. (2009) 390, 467–477

doi:10.1016/j.jmb.2009.05.019

Available online at www.sciencedirect.com

Structural Interpretation of Paramagnetic Relaxation Enhancement-Derived Distances for Disordered Protein States Debabani Ganguly and Jianhan Chen⁎ Department of Biochemistry, Kansas State University, Manhattan, KS 66506, USA Received 9 March 2009; received in revised form 27 April 2009; accepted 12 May 2009 Available online 15 May 2009

Paramagnetic relaxation enhancement (PRE) is a powerful technique for studying transient tertiary organizations of unfolded and partially folded proteins. The heterogeneous and dynamic nature of disordered protein states, together with the r−6 dependence of PRE, presents significant challenges for reliable structural interpretation of PRE-derived distances. Without additional knowledge of accessible conformational substates, ensemble-simulation-based protocols have been used to calculate structure ensembles that appear to be consistent with the PRE distance restraints imposed on the ensemble level with the proper r−6 weighting. However, rigorous assessment of the reliability of such protocols has been difficult without intimate knowledge of the true nature of disordered protein states. Here we utilize sets of theoretical PRE distances derived from simulated structure ensembles that represent the folded, partially folded and unfolded states of a small protein to investigate the efficacy of ensemble-simulationbased structural interpretation of PRE distances. The results confirm a critical limitation that, due to r−6 weighting, only one or a few members need to satisfy the distance restraints and the rest of the ensemble are essentially unrestrained. Consequently, calculated structure ensembles will appear artificially heterogeneous no matter whether the PRE distances are derived from the folded, partially unfolded or unfolded state. Furthermore, the nature of the heterogeneous ensembles is largely determined by the protein model employed in structure calculation and reflects little on the true nature of the underlying disordered state. These findings suggest that PRE measurements on disordered protein states alone generally do not contain enough information for a reliable structural interpretation and that the latter will require additional knowledge of accessible conformational substates. Interestingly, when a very large number of PRE measurements is available, faithful structural interpretation might be possible with intermediate ensemble sizes under ideal conditions. © 2009 Elsevier Ltd. All rights reserved.

Edited by A. G. Palmer III

Keywords: disordered protein states; ensemble simulation; intrinsic disorder; molecular dynamics; protein folding

Introduction It has been recently recognized that many functional proteins lack stable tertiary structures and can *Corresponding author. E-mail address: [email protected]. Abbreviations used: PRE, paramagnetic relaxation enhancement; RDC, residual dipolar coupling; NOE, nuclear Overhauser effect; MD, molecular dynamics; REX-MD, replica exchange molecular dynamics.

exist as dynamical ensembles of disordered conformations under physiological conditions.1–3 Frequently involved in crucial areas such as regulation and cellular signaling, the disordered nature of these proteins might offer a range of unique advantages, allowing high specificity coupled with low affinity, rapid turnover for accurate responses and structural plasticity for binding multiple partners. Unfolded and partially unfolded states are also important for understanding the folding and stability of structured proteins, even though in this case the disordered states are only weakly occupied under physiological

0022-2836/$ - see front matter © 2009 Elsevier Ltd. All rights reserved.

468 conditions and normally need to be probed under denaturing conditions. 4,5 In a disordered state (either induced by denaturing conditions or due to intrinsic disorder under native conditions), the protein samples are a heterogeneous and dynamical ensemble of conformations without a stable tertiary fold. These disordered protein conformations are not necessarily random. Instead, various residual structures often persist, both on the secondary and on the tertiary level, and might have important functional implications.2,4–8 A detailed structural characterization of disordered protein states is thus necessary for understanding a wide range of biological processes and diseases that involve protein folding, misfolding and aggregation. Experimental structural characterization of disordered protein states has proven to be very difficult. The heterogeneous and dynamical nature essentially eliminates the feasibility of traditional X-ray crystallography-based high-resolution structural determination. Biomolecular nuclear magnetic resonance (NMR) spectroscopy is by far one of the most comprehensive techniques for structural characterization of disordered protein states.4,5,8 A range of NMR observables can be measured to infer structural organizations on both the secondary and the tertiary levels. For example, so-called secondary chemical shifts, defined as deviations of chemical shifts from certain established random coil values, provide information on the residue secondary-structure propensities9; residual dipolar couplings (RDCs) obtained in partially aligned media might be used to uncover the existence of both residual secondary structures and transient tertiary contacts.10 Interproton distances derived from nuclear Overhauser effects (NOEs), which provide the primary restraints for structure determination of folded proteins, however, have limited usage for characterizing disordered states due to a general absence of (sequentially) long-range contacts within the NOE range of about 6 Å.4 Paramagnetic relaxation enhancement (PRE) coupled with site-directed spin-labeling techniques can overcome these limitations by providing distance information up to ∼35 Å and has recently emerged as a powerful tool for characterization of transient contacts.11–13 The unpaired electrons of a paramagnetic center lead to increases in the relaxation rates of nearby nuclei, and such relaxation enhancement effects are very sensitive to the electron–nuclear distance with an r−6 dependence.14 A drawback of PRE is that extrinsic paramagnetic centers often need to be introduced through chemical modifications, which usually involve engineering of a single Cys residue through site-directed mutagenesis followed by covalent attachment of a nitroxide spin label. These modifications might inevitably perturb the disordered ensembles, both due to changes of the protein sequence and as a result of potential (nonspecific) interactions of the protein with the spin label.5 Nevertheless, PRE experiments have been shown to be capable of providing reliable, albeit qualitative, descriptions of transient long-range contacts of denatured protein states.15,16

Structure Interpretation of PRE Distances

A quantitative, structural interpretation of experimental observables obtained on a disordered protein state is challenging. The measurements can no longer be interpreted in the context of a single conformation but must be represented as averaged quantities of a heterogeneous ensemble. In the limit of rapidly interconverting conformers, the observed relaxation rate enhancement reduces to a statistical averaging over all conformers in the ensemble,12 R DR = K PðrÞr6 dr; ð1Þ where P(r) is the distance distribution function for a given electron–nuclear pair. K is a constant that is defined by the nuclear resonance frequency and the apparent electron–nuclear dipole–dipole interaction correlation time, with the different functional forms depending on whether the longitudinal or transverse relaxation rate is concerned.14 Due to the r−6 weighting, the ensemble average in Eq. (1) is dominated by contributions from compact conformers. While this provides the sensitivity for detecting weakly populated transient contacts, it also adds to the challenges of quantitative structural interpretation of PRE. For folded proteins, P(r) is narrow and deviations of the r−6 weighted averaged distances [referred to as PRE distances hereinafter, rPRE = (ΔR/K)−1/6] from simple (linear) averages are small. Therefore, the r−6 weighting does not lead to any practical limitations in applications of PRE for rapid structure determination of folded proteins.17,18 For a disordered protein state with a broad P(r), PRE distances can be much smaller than (linear) average distances, and direct incorporation of PRE distances using conventional single-conformation-based NMR structural calculation protocols will substantially overestimate the compactness of the structure ensemble. Furthermore, simultaneity of PRE-identified contacts is not known. In other words, any single conformation in a heterogeneous disordered ensemble might only make a small subset of the observed transient contacts. Hence, simply relaxing the PRE-derived distances in single-conformationbased structure calculations, such as by setting large uncertainty bounds,19–22 will still lead to overly ordered ensembles with structural features that do not necessarily coexist in the same conformer at any moment. The structural interpretation of PRE distances is essentially an inverse problem, that is, to estimate P (r) distributions, or in a simpler case, the averages 〈P (r)〉, from a set of measured ΔR based on Eq. (1). Clearly, this is a severely underdetermined problem for disordered protein states. Additional challenges also arise from large experimental uncertainties of PRE distances due to the flexibility of spin labels and experimental errors associated with relaxation rate measurments.19,23 In particular, the highly flexible label in principle needs to be represented by an ensemble of states itself and the effects of motion of the label also have to be taken into account such as in terms of order parameters.23 As such, computation of the disordered ensembles often requires pregeneration of possible conformation substates, either

Structure Interpretation of PRE Distances

based on various physical insights or using simulation tools in the presence of experimental restraints, and focuses on determining the weight of each substate.24–26 These approaches depend critically on the reliability of assumed substates. For a general structural interpretation without reliable prior knowledge of the conformational substates, ensemble-simulation approaches have been recently proposed and applied to analyze the disordered states of several proteins.27–31 The main idea is to simulate Nrep noninteracting replicas of the protein simultaneously and impose the restraint potential based on ensemble-averaged distances, X  2 max jrcalc  rexp EPRE = kPRE i i j  rbound ; 0 ; ð2Þ i where riexp is the experimental PRE distance and ricalc is the calculated value defined as, !1=6 Nrep X calc 1 6 ri = Nrep ri;k : ð3Þ k=1

In Eq. (2), the half-width of the flat-bottom harmonic potential rbound represents the estimated uncertainty of PRE distances and is typically set to 5 Å for the analysis of disordered protein states. The peptide chain can be represented by either coarsegrained models (e.g., Cα-only self-avoiding polymer model27 ) or all-atom classical mechanical force fields.28 Restrained molecular dynamics (MD) or Monte Carlo simulations are then used to generate conformation ensembles that sufficiently satisfy the PRE distance restraints. Validation of the calculated structure ensembles on disordered protein states so far has been mainly based on comparison of additional ensemble-averaged properties such as radius of gyration (Rg) or RDC with experimental results.5 However, a rigorous assessment of the effectiveness of various structural interpretation approaches and the reliability of the resulting disordered ensembles has not been possible with current limits in our understanding of disordered protein states. In the present work, we utilize simulated structural ensembles with various degrees of disorder to derive theoretical PRE distances and examine how reliably the ensemble-simulation methods can be used to obtain a structural interpretation of these PRE distances without additional assumptions of or insights into the possible conformational substates. Since the underlying structure ensembles that give rise to these theoretical PRE distances are precisely known, it is possible to access the reliability of structural interpretation unambiguously.

469 forward and faithful interpretation of the PRE distances obtained on a disordered protein state, careful considerations reveal a severe limitation in practice. Specifically, in the presence of one or a few compact replicas that satisfy the PRE distance restraints, other members of the ensemble will be minimally restrained by the PRE pseudo energy function and are essentially free to adopt any conformations as dedicated by the underlying peptide model. We illustrate this effect in Fig. 1, using a simple example where we only consider a two-replica ensemble with a single PRE distance restraint. If one of the replicas already satisfies the experimental PRE distance, there is only a slight dependence of the calculated PRE distance on the conformation of the second replica. In this particular example, the limiting calculated PRE distance when r2 approaches infinity is rcalc = 10 × (1/2)−1/6 = 11.22 Å. If we further consider the typical choice of rbound = 5 Å in the flatbottom harmonic potential of Eq. (2), there will not be any distance violation regardless of the conformation of the second replica. Both the restraint potential energy and the gradient are zero for all possible values of r2 except at very small values as long as r1 – rexp. This simple example illustrates a fundamental limitation of r−6 weighted averaging that requires only a few replicas to satisfy the PRE distance restraints and allows most replicas to be essentially unrestrained in the ensemble-simulation approach. With large Nrep, the calculated structure ensembles will appear to be heterogeneous, as expected for a disordered protein state. However, the nature of the ensemble will depend almost solely on the peptide model that is employed in the calculation and does not necessarily resemble the true protein state that gives rise to the set of observed PRE distances. This is a severe consequence that can lead to misleading

Results and Discussion A theoretical consideration of the consequences of r−6 dependence While the restrained ensemble-simulation strategy outlined in Eqs. (2) and (3) appears to be a straight-

Fig. 1. Theoretical PRE distance, restrain potential energy and gradient as a function of r2 of a two-replica ensemble. All quantities were computed using Eqs. (2) and (3) with Nrep = 2, r1 = rexp =10 Å, kPRE =1.0 kcal/mol and rbound =0.

470 structural interpretations in experimental studies of unknown disordered protein states. Generation of structural ensembles and theoretical PRE distances To further validate the consequences of the abovediscussed limitation of the r−6 weighted averaging, theoretical PRE distances were computed from several simulated structure ensembles that represent the folded, partially folded or unfolded protein states. These structure ensembles were generated using coarse-grained simulations of 56-residue protein G B1 domain (GB1) with a Gō-like model, shown in Fig. 2. Use of a minimalist model that is capable of capturing many essential features of protein folding ensures that the simulated structure ensembles would reasonably resemble true protein states without the necessity of expensive all-atom simulations. GB1 was chosen as the model protein for its small size and robust α/β fold. However, the results presented here are not expected to depend on the choice of model protein. Three structural ensembles with various degrees of disorder sampled at 270, 400 and 600 K were chosen to represent the folded, partially folded and unfolded states, respectively. As shown in Fig. 3, these ensembles show a gradual loss of structure and increasing levels of heterogeneity with higher temperature. In particular, the 600 K ensemble is virtually free of long-range order. Nonetheless, there are still substantial probabilities of observing compact conformations that are characterized by small radius of gyration (Rg), end-toend distance and transient long-range contacts even at 600 K. Note that for a heterogeneous ensemble, average properties, even without nonlinear weighting such as in PRE, might still lead to a wrong impression on the level of structural order, as features that do not necessarily coexist in the same conformer at any moment will emerge together through averaging. For example, few structures sampled at 400 K, such as those representative ones

Fig. 2. A Cα-only model of protein GB1. Sites of theoretical spin-label attachment include M1, E15, K28 and T44 (green spheres, included in 4-, 8- and 16-labeling data sets); N8, 22D, 36D and K50 (blue spheres, additionally included in 8- and 16-labeling data sets); L5, L12, 18T, 29V, 32Q, 40D, 46D and 54V (red spheres, additionally included in 16-labeling data set). Location of a representative long-range native contact between A26 and F52 is also marked (gray spheres).

Structure Interpretation of PRE Distances

shown in Fig. 3a, are nearly as native-like as the average contact map shown in Fig. 3b might suggest. Three sets of theoretical PRE distances (folded, partially folded and unfolded) were then calculated based on Eq. (3) for all residue pairs that form native contacts in the Protein Data Bank structure.32 We note that, experimentally, only a few residues are typically chosen for spin labeling, and sets of PRE distances between these labeling sites and the rest of the protein are measured. The theoretical PRE distances as constructed above thus have different distribution patterns. However, as will be demonstrated later, the observations discussed below do not depend on the difference in the pattern of the PRE distance network. In the Supplementary Materials, we provide a table of all the theoretical PRE distances in comparison with the corresponding linear averages (Table S1). Theoretical PRE broadenings for several representative sites of spin-label attachment are shown in Fig. S1. Structural interpretation of the theoretical PRE distances A simulated-annealing ensemble-simulation protocol was used to back-calculate the structure ensembles from the theoretical PRE distance sets. The protein is described by a self-avoiding Cα-only polymer model analogous to the one used by Lindorff-Larsen and coworkers.27 Many simulated-annealing trials were carried out for each combination of theoretical PRE data set (folded, partially folded and unfolded) and ensemble size (Nrep = 1, 2, 8 and 32) to generate at least 10,000 structures that completely satisfy the PRE distance restraints for each case. The probability distributions of key structural properties computed from these calculated ensembles are summarized in Fig. 4 and the representative snapshots are shown in Fig. 5. As expected, single-conformation-based interpretation of the PRE distances (with Nrep = 1) leads to overly compacted structures that dramatically overemphasize the (native-like) structural features for heterogeneous protein states such as the partially folded and unfolded states. Strikingly, regardless of the theoretical PRE distance set used, the calculated ensembles always become increasingly heterogeneous with larger Nrep. In particular, with Nrep = 32, the ensembles calculated from the three PRE data sets are all highly disordered (e.g., see Fig. 5) and give rise to similar broad distributions on all the structural properties examined (e.g., see Fig. 4). This observation is actually not so striking based on the theoretical consideration discussed above, as only a few replicas need to satisfy the PRE distance restraints and most members of the ensemble are essentially unrestrained due to the nature of r−6 averaging in the ensemblesimulation approach. In Fig. 4, we also compare the probability distributions of key structural properties to those calculated from the true structure ensembles that were used to derive the theoretical PRE distances. Clearly, the only case where the calculated ensemble matches the true one reasonably well is single-conformation interpre-

Structure Interpretation of PRE Distances

471

Fig. 3. Representative snapshots (a) and key structural properties (b and c) of three structure ensembles sampled by a REX-MD simulation at 270, 400 and 600 K. The properties shown include (b) average residue contact probabilities and (c) probability distributions of radius of gyration (Rg), end-to-end distance and distance between a representative long-range residue pairs, A26–F52. The theoretical A26–F52 PRE distances are 8.8, 10.3 and 14.4 Å for the 270, 400 and 600 K ensembles, respectively. The corresponding linear average values are 8.9, 13.4 and 26.5 Å, respectively.

tation of the PRE distances for the folded state. This agrees with the proven success of the well-established NOE-based NMR structure determination methodology33 and recent work using PRE as a means for quick fold determination of structured proteins.17,18 However, the broad distributions that resulted from using large Nrep (e.g., 8 or larger) do not resemble the true ones that give rise to the theoretical PRE distances. Instead, they well resemble the distributions obtained from a long equilibrium simulation of the free self-avoiding polymer model employed in the ensemble structure calculations. The calculated ensembles simply become increasingly “spoiled” by essentially unrestrained replicas with larger Nrep, and there is no optimal choice of Nrep that would lead to a realistic structure ensemble for the heterogeneous protein states. These results validate our expectations derived from theoretical considera-

tions of the consequences of r−6 dependence that the nature of the heterogeneous ensembles obtained from the restrained ensemble simulations is determined by the protein model employed and reflects little on the true nature of the disordered protein states. As further demonstrated in Fig. S4, if a different protein model is to be used in the structure calculations, different heterogeneous ensembles will result and now reflect the properties of the new protein model, even though the same sets of PRE distances are satisfied on the ensemble level. We note that structure ensembles generated from multiple simulated-annealing runs are not proper thermodynamic ensembles, which is mainly why the distributions calculated with large Nrep only resemble, but do not match perfectly with, the distributions obtained from long simulations of the free polymers in Figs. 4 and S4.

472

Structure Interpretation of PRE Distances

Fig. 4. Probability distributions of radius of gyration, end-to-end distance, a representative long-range contact distance (A26–F52) calculated from the theoretical PRE distances for the folded (top), partially unfolded (middle) and unfolded states (bottom). Four ensemble sizes were used in the PRE structure calculations, including Nrep = 1, 2, 8 and 32. The results are shown with black, brown, orange and blue traces, respectively. These distributions were calculated from at least 10,000 structures generated from simulation annealing ensemble simulations using the self-avoiding polymer model. The green traces were computed from the true underlying structure ensembles, and the red traces were computed from a 1-μs equilibrium simulation of the unrestrained self-avoiding polymer at 300 K.

The average contact maps of the calculated structure ensembles are examined in Fig. 6. Consistent with the probability distributions examined in Fig. 4, there is a gradual loss of structural features with larger Nrep regardless of which PRE distance set is used. In principle, in the case of using protein models that do not give rise to any intrinsic tertiary organization such as the self-avoiding polymer model employed here (e.g., see Fig. S2), one might expect that contributions from the unrestrained members of the ensemble average out and the final average contact maps might resemble the true ones, even though they are contaminated with unrestrained conformers. This is somewhat true. For example, the contact map for the unfolded state computed with Nrep = 8 matches that of the true one (see Fig. 3b) reasonably well. However, such a match is coincidental. In actual studies of unknown protein states, an objective choice of the optimal ensemble size is typically not possible, and, as discussed above, such a choice generally does not exist. Dependence on the pattern of PRE distance network and number of spin-labeling sites As discussed above, experimentally a set of selected sites is chosen for spin labeling and distances between these sites and the rest of the

protein are then measured, while the above numerical experiments are based on PRE distances between residue pairs that are in contact in the native structure. To examine whether the results would be affected by this difference in pattern of PRE distance network and how the calculated structure ensembles depend on the number of sites of spin labeling, additional sets of theoretical PRE distances were calculated for the simulated disordered state at 600 K. These sets were derived from 4, 8 and 16 selected sites of spin labeling evenly distributed throughout the sequence (see Fig. 2). In addition, a full set that includes all possible longrange contacts (i.e., between sites separated by three or more residues sequentially) was also generated. Such a full set represents the limiting case where all residues in the protein are chosen for spin labeling, which is generally not feasible in practice. The same ensemble simulated-annealing protocol was then used to generate structure ensembles that are consistent with these theoretical PRE distance sets using different ensemble sizes. The probability distributions of the same three key structural properties are summarized in Fig. 7. Again, similar observations can be made, that the calculated ensembles become increasingly heterogeneous with larger Nrep and lead to broader distributions and that the nature of the heterogeneous structure

Structure Interpretation of PRE Distances

473

Fig. 5. Randomly selected snapshots of the structure ensembles calculated using the theoretical PRE distance sets that correspond to folded (top), partially unfolded (middle) and unfolded states (bottom). The structure ensembles were calculated using the simulated-annealing ensemble-simulation protocol with various ensemble sizes (Nrep = 1, 2, 8 and 32).

ensembles largely reflects the intrinsic properties of the underlying protein model employed in the structural calculations and does not resemble that of the true disordered state. These conclusions hold even for the full set with maximum number of PRE measurements that can be made in this model protein. Nonetheless, it appears that for the 16labeling and full sets, structure ensembles calculated with the intermediate ensemble size of Nrep = 32 describe well the true underlying disordered state (e.g., compare the orange and green traces in Fig. 7). This indicates that with a sufficiently large number of PRE distances, an optimal choice of ensemble size might exist and allow a faithful structural interpretation. This is an important observation that might provide useful guidelines for interpreting actual PRE measurements made on an unknown disordered protein state. The empirical guidelines are that (1) about one-quarter or more of the protein evenly distributed throughout the sequence needs to be chosen for spin labeling; (2) an intermediate ensemble size, e.g., Nrep ∼ 24 to 32, should be used in ensemble-based structural calculation (see Fig. S5).

Under these conditions, the ratio of PRE distances per residue per replica is sufficiently large (e.g., greater than 4) and occurrence of unrestrained replicas during ensemble simulations might thus be effectively suppressed. Obtaining such a large number of PRE measurements is a very challenging task experimentally. However, this seems to be necessary to overcome the intrinsic limitations of PRE measurements due to r−6 averaging. We note that potential experimental errors in PRE-derived distances might further complicate the structural interpretation, and an even larger number of measurements might be required for bootstrapping in an actual setting. For large proteins, PRE distances between all nuclear–electron pairs might not be available and this will further increase the required number of spin labeling. We have additionally investigated whether smaller assumed experimental uncertainties (e.g., rbound = 2.5 Å) or incorporation of Rg restraint potential might help to improve the agreement between the true and backcalculated structure ensembles. The results suggest that neither measure can meaningfully improve the

474

Structure Interpretation of PRE Distances

Fig. 6. Average contact maps of structure ensembles calculated with the theoretical PRE distance sets that correspond to folded (top), partially unfolded (middle) and unfolded states (bottom). The structure ensembles were calculated using the simulated-annealing ensemble-simulation protocol with various ensemble sizes (Nrep = 1, 8 and 32).

quality of calculated structure ensembles (data not shown).

Conclusions While PRE is a powerful tool for experimental characterization of disordered protein states, the r−6 averaging nature of the derived nuclear–electron distances poses substantial challenges for their faithful structural interpretations. Without prior knowledge or assumption of the accessible underlying conformational substates, one can use ensemble-simulation approaches with PRE distance restraints imposed on the ensemble level with proper r−6 averaging, which in principle offer a straightforward means of calculating structure ensembles that are apparently consistent with the PRE measurements. However, r−6 weighted average distances are dominated by contributions from compact ensemble members. In practice, only one or a few members need to satisfy the distance restraints and the rest of the ensemble can be minimally restrained by the PRE pseudo energy potential. As a consequence, the

calculated structure ensembles will appear artificially heterogeneous. The nature of the heterogeneous ensemble is mainly determined by the protein model employed in the structure calculation and reflects little of the true nature of the disordered state. These considerations are confirmed by numerical experiments with theoretical PRE distances computed from simulated folded, partially folded and unfolded states of a model protein, the use of which allows direct comparison between the true and backcalculated ensembles. Results from these numerical experiments demonstrate that PRE measurements alone do not provide sufficient constraints for an unambiguous structure interpretation of disordered protein states in general and that any ensemblebased structural calculation methods would hinge on the protein model (and conformational generation technique) employed for reliability of the resulting ensembles. The later approach, even though intended as a structural calculation, should actually be viewed as a structural prediction given the minimal effects of restraint potentials on the majority of the ensemble members. These intrinsic limitations also suggest that satisfaction of PRE

Structure Interpretation of PRE Distances

475

Fig. 7. Probability distributions of radius of gyration, end-to-end distance, and a representative long-range contact distance (A26–F52) calculated from the four sets of theoretical PRE distances for the unfolded state. These data sets were derived from 4, 8 and 16 selected sites of spin labeling shown in Fig. 2 and a maximum set where all residues are spinlabeled. Results using four ensemble sizes including Nrep = 1, 8, 32 and 128 are shown, in comparison with those computed from the true underlying structure ensemble and the equilibrium ensemble of unrestrained self-avoiding polymer at 300 K.

distances on the ensemble level does not provide a sufficient validation of calculated structure ensembles for disordered protein states, and validation based on agreement of PRE distances alone can be dangerously misleading. Numerical experiments with theoretical PRE distances also suggest a possible scenario where a very large number of PRE measurements (e.g., with at least one-quarter of the residues throughout the protein chosen for spin labeling) might be coupled with use of intermediate ensemble size (e.g., Nrep ∼ 24 to 32) to faithfully recover the true disordered structure ensemble. Under these conditions, the ratio of PRE distances per residue per replica is sufficiently large to suppress occurrence of unrestrained replicas during ensemble simulations. This observation might provide useful guidelines for interpretation of PRE measurements on an unknown disordered protein state. However, it needs to be stressed that how sensitively this stable interpretation regime depends on possible experimental errors, flexibility and motions of spin labels, nonuniform distribution of labeling sites and the nature of the actual disordered state is not established. In practice, dependence of the calculated structure ensembles on the ensemble size might

provide a useful hint. In case of insufficient number of PRE measurements, the calculated structure ensemble is expected to continuously evolve with larger Nrep until it reaches the limiting behavior, as determined by the protein model employed. In contrast, with a large number of PRE measurements, the calculated ensemble might stabilize with intermediate ensemble sizes before further evolving to the limiting behavior (see Fig. S5). Nonetheless, additional information about the disordered state of interest, such as RDC and Rg measurements, should also be used to provide important independent validations. Complete cross-validation can be particularly important for avoiding overfitting when a large ensemble size is used to interpret the data.

Methods Generation of folded, partially folded and unfolded structure ensembles A “sequence-flavored” Gō-like model was first generated based on the 3D atomic structure of 56-residue protein GB1 (Protein Data Bank code 3GB132), using the Multiscale Modeling Tools for Structural Biology (MMTSB) Go-Model

476 Builder34,35†. This Gō-like model incorporates sequencedependent Cα-based residue–residue interactions and includes additional knowledge-based, sequence-dependent (but native-structure independent) pseudo-torsional potentials. It has been shown to provide an improved description of the folding mechanism and kinetics.36 A replica exchange molecular dynamics (REX-MD) simulation was then carried out with the MMTSB Toolset37 and CHARMM.38 Eight replicas at 270, 300, 340, 360, 400, 440, 500 and 600 K were used. Exchanges of simulation temperatures were attempted every 10 ps and the total length of the REX-MD simulation is 100 ns. All structural properties examined including the averaged contact maps, distributions of radius of gyration and pairwise distances converged within 10 ns at all temperatures (data not shown). We have chosen the conformations sampled during the last 50 ns to construct the structure ensembles at different temperatures, each of which contains 50,000 snapshots. Based on the degree of structural disorder, the ensembles sampled at 270, 400 and 600 K were chosen to represent the folded, partially folded and unfolded states of protein GB1, respectively. Two Cα-only polymer models for structural calculations Two Cα-only protein models were used in the structural calculations presented here. The first one is analogous to the self-avoiding polymer model previously used by Lindorff-Larsen et al.27 The model was generated by simply deleting all native dispersion interactions in the Gō-like model described above.34,35 Cα beads were subjected to the native-structure-independent pseudotorsional potentials besides the hard-sphere repulsion. In addition, we also used another polymer model (termed NP polymer model), where we assigned each Cα bead to be either nonpolar (N) or polar (P) depending on the nature of the corresponding residue in the original protein GB1 sequence. N-type beads interact with their own kind through the Lennard–Jones potential with an energy minimum ɛ of 0.5 kcal/mol. All other interactions (P–P and N–P) are all purely repulsive. The equilibrium bond distance between two beads in the NP-polymer model is fixed at 4.7 Å. As shown in Supplementary Fig. S2, these two polymer models do not give rise to significant intrinsic long-range structural features at 270 K or higher temperatures, even though the NP polymer model is slightly more compact and has stronger temperature dependence due to the weak dispersion interactions between N-type particles. Simulated-annealing restrained ensemble simulations All structural calculations were done using a simulation-annealing restrained ensemble MD simulation protocol. Briefly, an ensemble of Nrep noninteracting copies of the system is simulated simultaneously using the replica facility in CHARMM.38 All calculations are initiated from fully extended conformations. The PRE distance restraints were imposed on the ensemble level based on Eqs. (2) and (3), with kPRE = 1.0 kcal/mol/Å2 and rbound = 5 Å. During annealing, the temperature is first slowly reduced from 1000 to 300 K over the course of 700 ps and then held at 300 K for another 300 ps,

† http://www.mmtsb.org

Structure Interpretation of PRE Distances followed by 1000 steps of energy minimization with kPRE gradually increased to 75 kcal/mol/Å2. The MD time step was 5 fs; larger ones might occasionally lead to unstable trajectories during annealing. In the current work, various ensemble sizes were used, including Nrep = 1, 2, 8, 32, 64 and 128. As demonstrated in Supplementary Fig. S3, this simple simulated-annealing protocol is highly efficient and can reliably generate structure ensembles that completely satisfy the PRE distance restraints with virtually 100% success rate. In this work, at least 10,000 conformations were generated for each combination of theoretical PRE data set and ensemble size. Note that smaller numbers of simulatedannealing runs are required to generate the same total number of conformations with larger ensemble sizes.

Acknowledgements J.C. acknowledges helpful discussions with Peter E. Wright, Charles L. Brooks III and Daniel J. Felitsky when he was at the Scripps Research Institute, which led to the initiation of the current work. This work is 09-314-J from the Kansas Agricultural Experiment Station.

Supplementary Data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j. jmb.2009.05.019

References 1. Wright, P. E. & Dyson, H. J. (1999). Intrinsically unstructured proteins: re-assessing the protein structure–function paradigm. J. Mol. Biol. 293, 321–331. 2. Dyson, H. J. & Wright, P. E. (2005). Intrinsically unstructured proteins and their functions. Nat. Rev. Mol. Cell Biol. 6, 197–208. 3. Dunker, A. K., Brown, C. J., Lawson, J. D., Iakoucheva, L. M. & Obradovic, Z. (2002). Intrinsic disorder and protein function. Biochemistry, 41, 6573–6582. 4. Dyson, H. J. & Wright, P. E. (2004). Unfolded proteins and protein folding studied by NMR. Chem. Rev. 104, 3607–3622. 5. Mittag, T. & Forman-Kay, J. D. (2007). Atomic-level characterization of disordered protein ensembles. Curr. Opin. Struct. Biol. 17, 3–14. 6. Fuxreiter, M., Simon, I., Friedrich, P. & Tompa, P. (2004). Preformed structural elements feature in partner recognition by intrinsically unstructured proteins. J. Mol. Biol. 338, 1015–1026. 7. Receveur-Brechot, V., Bourhis, J. M., Uversky, V. N., Canard, B. & Longhi, S. (2006). Assessing protein disorder and induced folding. Proteins, 62, 24–45. 8. Shortle, D. R. (1996). Structural analysis of non-native states of proteins by NMR methods. Curr. Opin. Struct. Biol. 6, 24–30. 9. Wishart, D. S. & Sykes, B. D. (1994). The C-13 chemical-shift index—a simple method for the identification of protein secondary structure using C-13 chemical-shift data. J. Biomol. NMR, 4, 171–180.

Structure Interpretation of PRE Distances 10. Shortle, D. & Ackerman, M. S. (2001). Persistence of native-like topology in a denatured protein in 8 M urea. Science, 293, 487–489. 11. Gillespie, J. R. & Shortle, D. (1997). Characterization of long-range structure in the denatured state of staphylococcal nuclease. 1. Paramagnetic relaxation enhancement by nitroxide spin labels. J. Mol. Biol. 268, 158–169. 12. Gillespie, J. R. & Shortle, D. (1997). Characterization of long-range structure in the denatured state of staphylococcal nuclease. 2. Distance restraints from paramagnetic relaxation and calculation of an ensemble of structures. J. Mol. Biol. 268, 170–184. 13. Clore, G. M., Tang, C. & Iwahara, J. (2007). Elucidating transient macromolecular interactions using paramagnetic relaxation enhancement. Curr. Opin. Struct. Biol. 17, 603–616. 14. Bloembergen, N. & Morgan, L. O. (1961). Proton relaxation times in paramagnetic solutions. Effects of electron spin relaxation. J. Chem. Phys. 34, 842–850. 15. Yi, Q., Scalley-Kim, M. L., Alm, E. J. & Baker, D. (2000). NMR characterization of residual structure in the denatured state of protein L. J. Mol. Biol. 299, 1341–1351. 16. Lietzow, M. A., Jamin, M., Jane Dyson, H. J. & Wright, P. E. (2002). Mapping long-range contacts in a highly unfolded protein. J. Mol. Biol. 322, 655–662. 17. Battiste, J. L. & Wagner, G. (2000). Utilization of sitedirected spin labeling and high-resolution heteronuclear nuclear magnetic resonance for global fold determination of large proteins with limited nuclear overhauser effect data. Biochemistry, 39, 5355–5365. 18. Gaponenko, V., Howarth, J. W., Columbus, L., GasmiSeabrook, G., Yuan, J., Hubbell, W. L. & Rosevear, P. R. (2000). Protein global fold determination using sitedirected spin and isotope labeling. Protein Sci. 9, 302–309. 19. Gillespie, J. R. & Shortle, D. (1997). Characterization of long-range structure in the denatured state of staphylococcal nuclease. II. Distance restraints from paramagnetic relaxation and calculation of an ensemble of structures. J. Mol. Biol. 268, 170–184. 20. Bertoncini, C. W., Jung, Y. S., Fernandez, C. O., Hoyer, W., Griesinger, C., Jovin, T. M. & Zweckstetter, M. (2005). Release of long-range tertiary interactions potentiates aggregation of natively unstructured α-synuclein. Proc. Natl Acad. Sci. USA, 102, 1430–1435. 21. Song, J., Guo, L. W., Muradov, H., Artemyev, N. O., Ruoho, A. E. & Markley, J. L. (2008). Intrinsically disordered gamma-subunit of cGMP phosphodiesterase encodes functionally relevant transient secondary and tertiary structure. Proc. Natl Acad. Sci. USA, 105, 1505–1510. 22. Lowry, D. F., Stancik, A., Shrestha, R. M. & Daughdrill, G. W. (2008). Modeling the accessible conformations of the intrinsically unstructured transactivation domain of p53. Proteins, 71, 587–598. 23. Iwahara, J., Schwieters, C. D. & Clore, G. M. (2004). Ensemble approach for NMR structure refinement against H-1 paramagnetic relaxation enhancement data arising from a flexible paramagnetic group attached to a macromolecule. J. Am. Chem. Soc. 126, 5879–5896. 24. Tang, C., Schwieters, C. D. & Clore, G. M. (2007). Open-to-closed transition in apo maltose-binding

477

25.

26.

27.

28.

29.

30.

31.

32.

33. 34. 35.

36.

37.

38.

protein observed by paramagnetic NMR. Nature, 449, 1078–1082. Felitsky, D. J., Lietzow, M. A., Dyson, H. J. & Wright, P. E. (2008). Modeling transient collapsed states of an unfolded protein to provide insights into early folding events. Proc. Natl Acad. Sci. USA, 105, 6278–6283. Marsh, J. A., Neale, C., Jack, F. E., Choy, W. Y., Lee, A. Y., Crowhurst, K. A. & Forman-Kay, J. D. (2007). Improved structural characterizations of the drkN SH3 domain unfolded state suggest a compact ensemble with native-like and non-native structure. J. Mol. Biol. 367, 1494–1510. Lindorff-Larsen, K., Kristjansdottir, S., Teilum, K., Fieber, W., Dobson, C. M., Poulsen, F. M. & Vendruscolo, M. (2004). Determination of an ensemble of structures representing the denatured state of the bovine acyl-coenzyme A binding protein. J. Am. Chem. Soc. 126, 3291–3299. Dedmon, M. M., Lindorff-Larsen, K., Christodoulou, J., Vendruscolo, M. & Dobson, C. M. (2005). Mapping long-range interactions in α-synuclein using spinlabel NMR and ensemble molecular dynamics simulations. J. Am. Chem. Soc. 127, 476–477. Kristjansdottir, S., Lindorff-Larsen, K., Fieber, W., Dobson, C. M., Vendruscolo, M. & Poulsen, F. M. (2005). Formation of native and non-native interactions in ensembles of denatured ACBP molecules from paramagnetic relaxation enhancement studies. J. Mol. Biol. 347, 1053–1062. Francis, C. J., Lindorff-Larsen, K., Best, R. B. & Vendruscolo, M. (2006). Characterization of the residual structure in the unfolded state of the Delta 131 Delta fragment of staphylococcal nuclease. Proteins, 65, 145–152. Calloni, G., Lendel, C., Campioni, S., Giannini, S., Gliozzi, A., Relini, A. et al. (2008). Structure and dynamics of a partially folded protein are decoupled from its mechanism of aggregation. J. Am. Chem. Soc. 130, 13040–13050. Gronenborn, A. M., Filpula, D. R., Essig, N. Z., Achari, A., Whitlow, M., Wingfield, P. T. & Clore, G. M. (1991). A novel, highly stable fold of the immunoglobulin binding domain of streptococcal protein G. Science, 253, 657–661. Wuthrich, K. (1986). NMR of Proteins and Nucleic Acids (Baker Lecture Series) Wiley-Interscience. Karanicolas, J. & Brooks, C. L. (2002). The origins of asymmetry in the folding transition states of protein L and protein G. Protein Sci. 11, 2351–2361. Karanicolas, J. & Brooks, C. L. (2003). Improved Gōlike models demonstrate the robustness of protein folding mechanisms towards non-native interactions. J. Mol. Biol. 334, 309–325. Karanicolas, J. & Brooks, C. L. (2004). Integrating folding kinetics and protein function: biphasic kinetics and dual binding specificity in a WW domain. Proc. Natl Acad. Sci. USA, 101, 3432–3437. Feig, M., Karanicolas, J. & Brooks, C. L., III (2004). MMTSB Tool Set: enhanced sampling and multiscale modeling methods for applications in structural biology. J. Mol. Graphics Model, 22, 377–395. Brooks, B. R., Bruccoleri, R. E., Olafson, B. D., States, D. J., Swaminathan, S. & Karplus, M. (1983). Charmm—a program for macromolecular energy, minimization, and dynamics calculations. J. Comput. Chem. 4, 187–217.