Mapping Local Conformational Landscapes of Proteins in Solution

Mapping Local Conformational Landscapes of Proteins in Solution

Resource Mapping Local Conformational Landscapes of Proteins in Solution Graphical Abstract Authors Mohammad ElGamacy, Michael Riss, Hongbo Zhu, Vin...

6MB Sizes 0 Downloads 36 Views

Resource

Mapping Local Conformational Landscapes of Proteins in Solution Graphical Abstract

Authors Mohammad ElGamacy, Michael Riss, Hongbo Zhu, Vincent Truffault, Murray Coles

Correspondence [email protected]

In Brief ElGamacy et al. introduce an approach to elucidating protein structure and dynamics by NMR spectroscopy. The method deploys spectral decomposition and an R factor quantifying agreement between structural models and input spectra. The resulting structures emphasize details of conformational diversity, central to understanding mechanisms such as molecular recognition, signaling and catalysis.

Highlights d

A sensitive measure of the accuracy of NMR structures of proteins

d

Assessing conformational diversity in proteins

d

Structure determination de novo without NOESY assignments or restraints

ElGamacy et al., 2019, Structure 27, 853–865 May 7, 2019 ª 2019 Elsevier Ltd. https://doi.org/10.1016/j.str.2019.03.005

Structure

Resource Mapping Local Conformational Landscapes of Proteins in Solution Mohammad ElGamacy,1 Michael Riss,2 Hongbo Zhu,1 Vincent Truffault,1 and Murray Coles1,3,*

€bingen, Germany of Protein Evolution, Max Planck Institute for Developmental Biology, Max-Planck-Ring 5, 72076 Tu of Informatics, Technical University of Munich, Boltzmannstrasse 3, 85748 Garching, Germany 3Lead Contact *Correspondence: [email protected] https://doi.org/10.1016/j.str.2019.03.005 1Department 2Department

SUMMARY

The ability of proteins to adopt multiple conformational states is essential to their function, and elucidating the details of such diversity under physiological conditions has been a major challenge. Here we present a generalized method for mapping protein population landscapes by NMR spectroscopy. Experimental NOESY spectra are directly compared with a set of expectation spectra back-calculated across an arbitrary conformational space. Signal decomposition of the experimental spectrum then directly yields the relative populations of local conformational microstates. In this way, averaged descriptions of conformation can be eliminated. As the method quantitatively compares experimental and expectation spectra, it inherently delivers an R factor expressing how well structural models explain the input data. We demonstrate that our method extracts sufficient information from a single 3D NOESY experiment to perform initial model building, refinement, and validation, thus offering a complete de novo structure determination protocol.

INTRODUCTION Proteins are inherently flexible molecules and undergo internal motions over a wide range of distance and timescales. In doing so they traverse a set of conformational microstates that are accessible either thermally or via free energy dissipation. Characterizing this landscape is vital in understanding a wide range of protein functions where conformational change is an underlying mechanism, including molecular recognition, enzyme catalysis, and allosteric signaling. NMR spectroscopy is uniquely placed to provide this data, offering atomic-level detail on samples in native-like physical states and at biological temperatures. For this reason, accurately representing conformational diversity in proteins has long been a goal in the field (Bonvin and Brunger, 1996; Burgi et al., 2001; Lindorff-Larsen et al., 2005; Richter et al., 2007). NMR data are time averaged over the course of observation periods in the experiment, a phenomenon that is routinely exploited to localize and characterize the timescales of internal

dynamics (Lange et al., 2008). However, the data are also ensemble averaged over all molecules in the sample volume and should provide information on the nature and population of the underlying conformational microstates. Attempts have been made over many years to exploit this fact to accurately represent the structure and dynamics of proteins simultaneously. These have been centered on ensemble representations, where the population of a conformer in the ensemble should reflect that of the protein in solution––i.e., the ‘‘Boltzmann ensemble’’ of thermally accessible states (Lindorff-Larsen et al., 2005). The computational approach has been to apply NMR-derived data as restraints in molecular dynamics simulations, requiring only that the ensemble as a whole, rather than any individual member, fulfills the restraints (Best and Vendruscolo, 2004; Bonvin et al., 1994; Clore and Schwieters, 2004, 2006; Fenwick et al., 2010; Lindorff-Larsen et al., 2005; Richter et al., 2007). However, these methods have all relied on proxies derived as averages from the input data––average distances and dihedral angles––rather than the data themselves, and the information obtained from these proxies has been insufficient to precisely define conformer populations. Early estimates indicated that the limit of precision (under-restraining) was reached with as few as two models (Bonvin and Brunger, 1996), a limit respected in most recent implementations (Fenwick et al., 2011; Richter et al., 2007). Here we adopt a markedly different approach based on spectral decomposition, reasoning that maximal structural information is extracted when a structural model recapitulates the input spectrum. We demonstrate that routinely collected spectra contain sufficient information to define the population of local conformers at considerably higher precision than has previously been possible. The underlying principle of this work is that NMR spectra can be approximated as a linear combination of the spectra of the contributing conformers, weighted by their population. Systematic back-calculation of expectation spectra across a conformational space should then allow reconstruction of the experimental spectra to recover these weighting factors. In NMR spectroscopy, the richest source of structural data are nuclear Overhauser effect spectroscopy (NOESY) spectra, which report on inter-proton distances within a detection limit of 5–6 A˚. Due to this short spatial range, a large fraction of NOESY intensity for proteins can be explained within short, linear sequence fragments. Each such fragment thus represents a subspace that could be searched systematically to provide detailed information on local dihedral angles and their distributions. Moreover, comparison of the back-calculated and experimental data would provide a quantitative quality measure: an NMR R factor. Structure 27, 853–865, May 7, 2019 ª 2019 Elsevier Ltd. 853

Figure 1. NOESY Back-Calculation in the CoMAND Method (A) Comparisons are shown for three different proteins, demonstrating the quality of back-calculation of CNH-NOESY spectra. The experimental spectra are shown in blue and the back-calculated spectra in green, while the residual signal is in red. The comparisons are for selected residues in example proteins used in (legend continued on next page)

854 Structure 27, 853–865, May 7, 2019

A difficulty in realizing this approach lies in the nature of NOESY data itself. The information content of NOESY spectra is very unevenly distributed across the observed intensities, and small, informative peaks can thus be overwhelmed by inaccuracies in back-calculation and spectral artifacts. For this reason, quantitative comparison of back-calculated and experimental spectra has been far less applicable in NMR structure determination than equivalent measures used in crystallography (Brunger et al., 1993; Huang et al., 2012; Spronk et al., 2004). Here we show that these obstacles can be largely averted in the 3D CNH-NOESY experiment (Diercks et al., 1999). This is an implementation of a 13C-heteronuclear single quantum coherence (HSQC)NOESY-15N-HSQC in which the indirect proton dimension has been omitted, thus displaying contacts to backbone amide protons in a well-resolved 13C dimension. It exploits the higher dispersion and more homogeneous effective linewidths of the heteronuclei, while suppressing water-exchange cross-peaks and obviating the need for stereospecific proton assignments. Crucially, it intrinsically lacks large, uninformative diagonal peaks and the associated baseline and truncation artifacts. Combined, these represent decisive advantages in the accuracy of backcalculation. Here we demonstrate that a single 3D CNH-NOESY spectrum contains sufficient information to map the population distributions of local dihedral subspaces. This conformational mapping provides highly detailed data for model building, with progress monitored by a quantitative R factor. In this way, both the reliance on a knowledge-base and the interpretation of spectra in terms of peak picking, assignment lists, or distance restraints can be eliminated. Moreover, refined ensembles that best reproduce the experimental spectra can be compiled from unbiased molecular dynamics trajectories. We validate this method against human ubiquitin (hUb), widely considered the gold standard for NMRbased protein structure determination (Fenwick et al., 2011; Lindorff-Larsen et al., 2016; Maltsev et al., 2014; Richter et al., 2007). We further demonstrate the generality of the method, applying it to solve the structures of four further example proteins. RESULTS An R Factor from CNH-NOESY Data We have adapted existing routines to back-calculate CNHNOESY spectra for ensembles of structures, where the output is a 1D 13C strip displaying contacts for a given amide proton. These are compiled for each backbone amide proton in a protein and compared directly with equivalent strips extracted from the experimental 3D matrix. An R factor expressing the discrepancy between experimental and expectation spectra is readily calculated as the fractional root-mean-squared residual (STAR Methods). This R factor is analogous to its crystallographic counterpart, except that it is calculated on a per-residue basis. Our back-calculation routines very accurately reproduce the intensities and line-

shapes of experimental CNH-NOESY data collected for our set of examples (Figure 1). The back-calculated spectra are also highly sensitive to local backbone and side-chain dihedral angles (Figures 2 and S1–S4), a prerequisite for conformational mapping. A fundamental concept in this work is that R factors should improve with an accurate and comprehensive description of conformational diversity. To test this, we use reference data for hUb, choosing three ensembles based on full sets of NOESY and other NMR-derived data that have been compiled according to different metrics. Two have employed ensemble-averaging protocols to elucidate internal motions. The first (PDB: 2NR2) was calculated under the minimal under-restraining, minimal over-restraining protocol (Richter et al., 2007), whereby all 128 structures are averaged to comply with backbone S2NH-order parameters derived from relaxation data (dynamic ensemble refinement [DER]), while distance restraints are average over pairwise subsets. The second (PDB: 2KOX) averaged distance data via similar pairwise subsets, but included a large set of residual dipolar coupling (RDC) data, averaged over blocks of 64 structures (Fenwick et al., 2011). The third (PDB: 2MJB) provides a static control ensemble close to the average structure (Maltsev et al., 2014). High R factors are obtained where these ensembles either over- or underestimate conformational diversity. Moreover, the use of a residue-wise R factor in evaluating the ensemble localizes such diversity (Figure 1B). Mapping Local Conformational Spaces In the CNH-NOESY, the vast majority of cross-peak intensity can be explained by intra-residue contacts and those to the immediately preceding residue. The R factor for residue i is thus strongly dependent on conformation in a shifted Ramachandran space defined by the backbone dihedral angles ci-1 (here denoted yi) and 4i. This is extended by including the relevant side-chain rotamers up to c1i and c2i, representing a periodic space within a dipeptide fragment that can be searched exhaustively (Figure 3A). The back-calculated spectra for these systematically sampled conformers constitute a ‘‘features set’’ that can be used to decompose the experimental spectra (Figure 3B). Calculating the respective population weights for each conformer is analogous to parts-based representations of complex spectral mixtures often encountered in other fields of spectroscopy (Xu et al., 2006). Decomposition of the experimental spectrum can be framed as a positive matrix factorization problem (Paatero and Tapper, 1994). The features set is represented by the matrix W, comprising backcalculated spectra for l conformers, resolved to m points along the 13 C dimension. A solution can thus be found for a vector of weights H to reconstruct the observed experimental spectrum V: 3n m31 l31 VzWH; V˛Rm R0 ; W˛RR0 ; H˛RR0 ;

where n = 1 in this case, as one spectrum is considered per-residue. To map the energetic landscape of the conformational

this study; two for human ubiquitin (hUb) and one each for KH-S1 and MlbQ. The comparisons are for ensembles compiled here to minimize R factors. In each case, the R factor is limited by the noise level of the experimental spectrum. (B) R factors plotted across the sequence for three literature ensembles for hUb. These ensembles have been compiled to emphasize different aspects of the hUb structure: PDB: 2KOX (Fenwick et al., 2011) and PDB: 2NR2 use ensemble averaging to elucidate internal motions and PDB: 2MJB to represents a static average pose. The average structures for all three ensembles are very similar and differences in R factors are therefore attributable to the different representations of conformational diversity (see also Figure S6). These are compared with the CoMAND ensemble (green line).

Structure 27, 853–865, May 7, 2019 855

Figure 2. R Factors Are Highly Sensitive to Local Conformation Literature ensembles for hUb over- and underestimate conformational diversity. Back-calculated CNH-NOESY strips for K48 are shown for the three reference ensembles and an ensemble calculated here to minimize the R factor. The respective ensembles and their side-chain rotamer distributions (c1/c2) are shown on the left. The resulting R factor comparisons are shown on the right, displayed as in Figure 1A. Neither the highly focused PDB: 2MJB ensemble (Maltsev et al., 2014) nor the more diverse PDB: 2NR2 (Richter et al., 2007) and PDB: 2KOX ensembles explain the NOESY data well. The R factors for these comparisons demonstrate their sensitivity to accurate representation of conformational diversity. The distribution obtained for K48 closely matches that for all lysine residues in the MolProbity database (Chen et al., 2010). Three further examples of this type are provided in Figure S2.

856 Structure 27, 853–865, May 7, 2019

Figure 3. Conformational Mapping in the CoMAND Method (A) The definition of the conformational search space. Note that in the shifted Ramachandran space, yi is equivalent to c(i-1). (B) The features matrix for L67 in hUb shown as a stacked plot where each row is a spectrum back-calculated via systematic conformational sampling, resulting in periodic intensity patterns. The order of sampling, from fastest to slowest, is c2, c1, y, and 4, with 10 steps for backbone and 120 steps for side-chain angles. The intensity of each peak in the spectrum displays a different dependency on the dihedrals, underlining the power of the data to discriminate individual conformations. The projection of this plot—i.e., all members of the features set overlaid—is shown above with individual peaks assigned. (C) Decomposing overlapped spectra. The top panel shows all members of the concatenated features set for Q2 (orange) and I30 (purple) in hUb. Twocomponent factorization successfully decomposes the completely overlapped experimental spectra, yielding the correct conformations of the respective residues (middle panel).

space, the estimates of the population weights obtained can in turn be expressed via a Boltzmann factor relative to a reference conformer (STAR Methods). Examples of these conformational maps for hUb are shown in Figure 4. An inherent question in factorization methods is the uniqueness of the solution. Here uniqueness is limited by cross-peaks that cannot be explained within the dipeptide space, but overlap with peak positions in the features set. The larger the fraction of such contamination, the more difficult finding a unique solution becomes. Given the low intensity of potentially contaminating peaks and the high dispersion of the 13C dimension, the extent of contamination is usually minor. The extreme case of contamination is the coincidence of 15N-HSQC positions for two or more residues, resulting in overlap of the experimental strips. Such a situation can be solved by concatenation of the features sets of the overlapped residues and solving in a multiple-dipeptide space. Figure 3C shows a typical example of this situation, where conformational maps have been obtained for two overlapped residues. Building Structural Models from Conformational Maps The conformational maps obtained from spectral decomposition provide rich information for structure determination. At the

simplest level, global minima can provide local torsion angles sufficient for model building. These initial dihedrals constitute an agnostic starting point, as they are derived directly from the data without recourse to heuristics or conformational databases. A unique feature of this method is the deployment of R factors as an objective convergence test that captures both local and longrange contacts. The latter can be isolated by examining the difference between R factors obtained for a linear peptide fragment and those from the full, folded model. We term this measure the fold factor (F), and it should be negative if the model explains long-range contacts well (STAR Methods). Note that positive fold factors are obtained when the folded model either underor overestimates peak intensities in the experimental spectrum. Figure 5 shows that the average fold factor (Fmean) is a sensitive overall measure of correct folding, while the sequence profile can localize misfolded or poorly defined regions. Owing to this objective measure of convergence, models from any source can be evaluated, independent of the model-building routine. To build initial models, we have employed two different approaches. The first is a Rosetta-based protocol, where backbone dihedral angles extracted from the factorization minima are used directly as input. The second is a purpose-designed molecular dynamics routine: simulated annealing replica seilschaft Structure 27, 853–865, May 7, 2019 857

Figure 4. Conformation Maps from Spectral Decomposition Conformational maps are shown for six examples residues representing different secondary structural contexts in hUb. They are expressed as heatmaps of conformer free energy change, relative to a two-conformer global minimum reference state. For non-glycine residues, a 2D (y, 4) slice through the full 3D or 4D map at the minimum c1/c2 position is shown. The map for G35 displays typical pseudo-symmetry about the 4 = 0 axis due to the achiral nature of glycine residues. In each map, the minima agree very well with the corresponding crystallographic conformations (PDB: 1UBQ; white diamonds).

858 Structure 27, 853–865, May 7, 2019

(SARS). Here, the conformer probability distributions derived from conformational mapping are encoded as a grid-based dihedral energy correction term (CMAP) (Figure 4; STAR Methods). In the standard force field, CMAPs are cross-terms that correlate two dihedrals and are used to confer the conformational tendencies of different amino acid types (Mackerell et al., 2004). Here we override these standard CMAPs with bespoke ones on a per-residue basis, providing an ensemble representation augmented by the experimental data. As examples, we present model building for four structure determination projects from our Institute, in addition to hUb. U3Sfl (125 residues) is a protein designed as a chimera of subdomain-sized fragments, KH-S1 (170 residues) is a fusion construct of the KH and S1 domains of E. coli exosomal polynucleotide phosphorylase. MlbQ (141 residues) is a protein implicated in self-resistance to endogenous lantibiotics in actinomycetes (Pozzi et al., 2016). The final example, polb4 (122 residues), is a protein designed to reconstruct the polymerase beta N-terminal domain using two unrelated peptide fragments (ElGamacy et al., 2018) and is presented here as a de novo structure determination. For U3Sfl, KH-S1, and hUb we applied the Rosetta protocol (STAR Methods). For MlbQ and polb4 we applied SARS, supplementing the CHARMM36 force field with bespoke conformational maps, and starting from completely extended chains. For MlbQ, folding was accelerated by the addition of ten unambiguous nuclear Overhauser effect (NOE) distance restraints. We used the average R factor across the full length of the protein as a criterion for selecting a single model; a single Rosetta decoy or a frame from the SARS runs. These models were very similar to respective reference structures (Figure 6). For hUb the reference is the crystal structure PDB: 1UBQ (backbone root-mean-square deviation [RMSD] = 0.49 A˚). For KH-S1, crystal structures for close homologs of the individual domains are available (PDB: 4AM3; RMSD = 1.48 A˚ and PDB: 4NNG; RMSD = 1.67 A˚). For MlbQ (PDB: 2MVO; RMSD = 1.92 A˚) and U3Sfl (RMSD = 1.98 A˚) the references were solution structures we had solved previously. As polb4 is presented as a de novo structure determination, we use the design target as the reference structure (RMSD = 1.47 A˚). Refining Structural Ensembles To construct ensembles with accurate descriptions of the underlying microstates, we applied a protocol based on unrestrained molecular dynamics simulations of canonical ensembles using the standard force field. We then aggregated sets of frames from these trajectories using the R factor as a selection criterion. This not only provides an unbiased coverage of conformational space, but eliminates potential over-restraining. The starting points for these trajectories were the initial models generated above (STAR Methods). This combination of model building

Figure 5. The R Factor as an Objective Target Function (A) Fold factors plotted across the sequence for hUb. Fold factors (F) are calculated on a per-residue basis as the difference between the R factor calculated for a linear peptide and that calculated for the full structure. This isolates the component of the R factor not explained by local contacts. Negative values indicate residues in well-folded environments. Values are shown for an initial folded model from Rosetta runs, plus a Rosetta structure

misfolded by a strand swap in the N-terminal a hairpin. These are compared with a representative of the final CoMAND ensemble (green line). (B) Comparison of the average fold factor (Fmean) versus the Rosetta score (Rosetta energy units) as selection criteria for well-folded hUb models. Both measures are plotted against the RMSD to the reference structure (PDB: 1UBQ) for the same set of 7,215 decoys with subzero Rosetta score. Structures with low Fmean are consistently close to the reference structure. See also Figure S6.

Structure 27, 853–865, May 7, 2019 859

Figure 6. Initial Model Building with CoMAND Models (orange) generated by either the Rosetta protocol (U3Sfl, KH-S1, and hUb) or the purpose-built SARS simulated annealing protocol (MlbQ and polb4; see also Video S1) are shown superimposed on their respective reference structures (purple). For U3Sfl and MlbQ the reference structures are previously solved solution structures. For KH-S1, the references are crystal structures for close homologs: KH domain PDB: 4AM3 (light purple) and S1 domain 4NNG (dark purple). For hUb the reference structure is the crystal structure (PDB: 1UBQ). The de novo structure determined for polb4 is compared with the design target (ElGamacy et al., 2018).

and refinement therefore represents a complete ‘‘white gloves’’ structure determination protocol, where NOESY spectra are not interpreted in terms of peaks lists, cross-peak assignments, or distance restraints. We name this de novo structure determination protocol CoMAND (for conformational mapping by analytical NOESY decomposition). Given the wealth of structural and dynamics data available for hUb, we compiled such a refined ensemble and compared it with the reference ensembles (PDB: 2KOX, 2MJB, and 2NR2). We conducted unrestrained molecular dynamics simulations in 860 Structure 27, 853–865, May 7, 2019

explicit solvent for microsecond timescales, seeded by the initial model. This was followed by a frame-picking procedure to minimize the average R factor across the ensemble (Figure 7; STAR Methods). Given the combinatorial complexity of this framepicking problem, we deploy a greedy optimization procedure that iteratively scans through the trajectory, adding the frame that most reduces the average R factor of the growing ensemble. The algorithm stops when it has converged; i.e., the R factor no longer decreases. To avoid over-fitting, the maximum ensemble size was capped at 20. The resulting ensemble (Figure 8; backbone RMSD = 0.63 A˚) shows better correlation to experimental NH-order parameters than the literature ensembles (Figure 9). Note that the timescales of motions associated with low-order parameters for hUb range from picoseconds in the C-terminal strand to microseconds in the internal loops (Lindorff-Larsen et al., 2016). Thus, the good match to experimental values across the hUb sequence indicates that the CoMAND ensemble captures motions over this range; i.e., commensurate with the timescale of the production trajectories. It is also comparable in predicting experimental scalar couplings and RDCs to ensembles that have been built on one or more of these observables, plus thousands of NOE restraints (Figures 9 and S5; Table S1). We then compiled ensembles for KH-S1 and MlbQ in a similar manner. These represent the more difficult cases of the remaining examples; KH-S1 as a two-domain protein with the highest molecular weight and MlbQ as the most complex topologically and with the lowest data quality. For the KH-S1 ensemble, the individual domains are well defined (backbone RMSDs: KH, 0.82 A˚; S1, 0.74 A˚), but some variability remains in the interdomain orientation (Figure 8). We compared the results of frame picking to minimize R factors for all residues with frame picking over a subset of inter-domain residues. Both results explained the inter-domain NOESY intensity equally well. The MlbQ ensemble is a difficult case, due to a complex topology including two longer loops. In addition, eight proline residues and several missing and overlapped residues reduce the overall coverage of strips for factorization. For proline residues, the lack of an amide proton means no strip will be observed in a CNHNOESY experiment. We hence adapted factorization routines for proline to observe their conformational space from the amide proton of the previous residue (STAR Methods). We also added CMAPs for eight residues after decomposing four pairs of overlapped residues. In model building using SARS, the calculations converged slowly and the R factor had plateaued toward the end of the simulation. In this case we seeded the refinement dynamics with the last frame of the SARS trajectory. This slow convergence was reflected in the frame picking, which converged after selecting 10 of a possible 20 frames. The ensemble obtained is well defined with a backbone RMSD over secondary structure of 0.77 A˚ (Figure 8). Structure statistics for the three ensembles are presented in Table S2. A central aim of this work is to show that NOESY spectra contain more information than is routinely extracted by conventional analysis, particularly in terms of local conformation. To demonstrate the best case, we applied a stochastic version of the greedy algorithm to perform frame picking from the unrestrained trajectories, in this case optimizing the R factor for single residues (Figures 7, S7, and S8; STAR Methods). This method routinely explains the experimental data down to the noise

Figure 7. Frame Picking in the CoMAND Method The ‘‘greedy’’ frame-picking method selects frames from the unrestrained molecular dynamics trajectory to minimize the R factor. The example used is K48 in human ubiquitin (see also Figures 2, S7, and S8). The top row shows the distribution of backbone and side-chain dihedrals in the molecular dynamics trajectory (10,000 frames), with the R factor comparison on the right. The lower row shows the selected ensemble (20 frames). Some conformations that are well represented in the trajectory, e.g., c1  60 side-chain rotamers, are unrepresented in the optimized ensemble.

level––e.g., the plots in Figure 1 were generated in this way. Applying this routine across all available residues of a protein localizes any residual discrepancies between the output ensemble and the input spectrum. Moreover, the stochastic nature of the routine allows it to be run multiple times for each residue, allowing an estimate of precision (STAR Methods). Even for residues sampling a range of conformations, e.g., hUb K48 (Figure 2), the precision in estimating the uncertainty in conformer populations was typically 5% (Figure S8), justifying the limit on ensemble sizes of 20 frames used in this work. DISCUSSION NOESY spectra can be seen as an encoding of a proton-proton contact map with an approximate upper distance limit of 5 A˚. If correctly decoded as a set of distance restraints, this information is sufficient to solve the structure with high accuracy and precision. However, crowded spectra and the consequent spectral overlap mean that the encoding is ambiguous. Spectral editing, for example, via additional frequency dimensions, can only partially alleviate this problem, often at considerable cost in €ntert, 2003). The consequences of this amexperiment time (Gu

biguity are not only that individual cross-peaks cannot be uniquely assigned––i.e., attributed to a specific proton-proton contact––but also that cross-peaks may comprise significant intensity from several contacts. Conventional NMR structure determination protocols interpret NOESY spectra through peak picking, assignment, and conversion into distance restraints under a paradigm of one peak; one assignment; one restraint. Even automated routines that specifically consider ambiguities will resolve to a single effective restraint per picked peak. This represents a compromise that is not justified by the underlying nature of the data, affecting either the accuracy or precision of distance estimates. In contrast, the CoMAND method makes no interpretation of cross-peaks, and thus stands outside this conventional assignment paradigm. Given the ambiguity of NOESY data, the incorporation of unambiguous data from other sources is advantageous in NMR structure determination protocols. Particularly useful are data that define local dihedral angles (e.g., scalar couplings), as these are poorly defined by imprecise NOE-based distance estimates. Backbone dihedral angle predictions derived from chemical shift heuristics using the program TALOS are very widely used (Shen et al., 2009a). For example, the CS-Rosetta approach exploits Structure 27, 853–865, May 7, 2019 861

Figure 8. Gallery of CoMAND Ensembles Ensembles are shown for KH-S1, MlbQ, and hUb (b strands, orange; a helices, blue), superimposed on the respective reference structures defined in Figure 6 (purple). The ensembles were compiled by frame picking from unrestrained molecular dynamics trajectories, using the overall R factor as a selection criterion. The ensembles for both KH-S1 and hUb were capped at 20 structures, while the frame-picking routine for MlbQ converged with 10 structures.

this data to build structural models within the Rosetta framework (Vernon et al., 2013). In contrast to these methods, we show that spectral decomposition can directly yield backbone dihedrals unambiguously and without heuristics. Moreover, we demonstrate that the NOESY data can map the underlying conformational landscape in a systematic fashion. A further key advantage over existing methods is that the whole process of structure determination, including resonance assignment, model building, refinement, and conformational mixture elucidation, can be objectively assessed by the R factor as a single metric. The objectivity of the R factor opens up a range of methodologies, which become independent of the method used to evaluate the results. For example, we have chosen to use unrestrained molecular dynamics simulations to sample the conformational diversity of the final ensembles. However, accelerated molecular dynamics sampling schemes or Monte Carlo methods would 862 Structure 27, 853–865, May 7, 2019

also be applicable and may have advantages where larger-scale motions are expected, or to test specific motional hypotheses. Conventional NMR structure determination has typically employed simulated annealing routines that produce ensembles of structures, all of which should fit the restraints equally well. Yet these restraints are based on average quantities and the structures thus converge on the average structure. Although these ensembles may achieve high precision–– as exemplified here by the PDB: 2MJB ensemble for hUb––they suppress correlated internal motions and poorly explain dynamics data, such as backbone S2NHorder parameters (Figure 9D). The other two reference structures for hUb used here were calculated using ensemble averaging of restraints to allow for conformational variation, leading to much better accordance with experimental order parameters (note that under the DER protocol the PDB: 2NR2 ensemble was restrained to do so [Lindorff-Larsen et al., 2005; Richter et al., 2007]). However, this improvement in explaining internal dynamics does not lead to a concomitant improvement in R factors; indeed, all three reference ensembles are comparable in this respect (Figure 1B). Thus, even these high-quality ensembles, based on large sets of distance, dihedral and coupling restraints and using careful computational methods, do not describe the conformational microstates sampled by hUb well. The CoMAND method produces an ensemble that is comparable in terms of describing dynamics data and performs well in explaining coupling data not used in compiling it. Yet the ensemble is based on data from a single CNH-NOESY spectrum. This demonstrates the depth of the structural information captured when NOESY spectra are analyzed holistically. In recent years, development of protein NMR methods has been striving toward larger proteins and more difficult cases, such as membrane proteins. These have often employed selectively deuterated samples, where the proton density is diluted to slow the fast proton relaxation associated with

Figure 9. Validation of CoMAND Ensembles (A) Evolution of the average sequence R factor and fold factor along the first replica of the SARS folding trajectory of polb4. The lowest R factor structure (time point of 6.4 ns) was chosen as an initial model (see also Video S1). (B) The RMSD from the native fold along the trajectory of the same SARS folding simulation. Traces for all five replicates are shown, illustrating the convergence of the protocol. (C) The CoMAND ensemble independently reproduces NMR observables. The correlation between residual RDC values back-calculated from the CoMAND ensemble and experimental values in four different alignment media is shown (Maltsev et al., 2014). The Q factor expressing the agreement between prediction and experiment for this dataset is 0.24. Good agreement is also obtained between back-calculated and experiment 3JHNHa coupling constants (correlation coefficient = 0.95; see Figure S3). RDC values report on the orientation of various bond vectors to an external molecular alignment medium and are thus sensitive to both local and global structure. 3JHNHa coupling constants report on local 4 angles. Neither parameter was used in compiling the CoMAND ensemble. (D) Calculated S2NH order parameter values across the sequence of hUb using the first 20 models of the PDB: 2MJB, 2KOX, 2NR2, and CoMAND ensembles. Secondary structures for hUb are shown above the plot. The CoMAND ensemble best reproduces experimental values derived from NMR relaxation analysis (Nederveen and Bonvin, 2005) (correlation coefficient = 0.82). Correlations for the reference ensembles are shown in Figure S5.

such samples. In the present implementation, CoMAND is limited by the use of CNH-NOESY spectra, necessitating uniformly protonated samples. This places an upper limit on the size of proteins and protein complexes on the order of 35 kDa. Nevertheless, CoMAND leverages a small set of spectra on a single sample into a high-resolution structure and is therefore applicable where protein concentration or stability are limiting. As the method involves minimum user intervention after the resonance assignment stage, it is also intrinsically suited to automation. However, the most unique feature of the method lies in the power to obtain accurate descriptions of protein conformational ensembles. We are currently applying the method to protein design projects, where these descriptions provide feedback to calibrate dynamics-based design algorithms. We anticipate that the method can be applied in similar ways to studying ligand binding and allosteric pro-

cesses, promising to elucidate subtle conformational changes in fine detail. STAR+METHODS Detailed methods are provided in the online version of this paper and include the following: d d d

KEY RESOURCES TABLE CONTACT FOR REAGENT AND RESOURCE SHARING METHOD DETAILS B NMR Spectroscopy B NOESY Back-Calculation B Calculation of R-factors B Calculation of Fold Factors B Factorisation and CMAP Construction Structure 27, 853–865, May 7, 2019 863

B

Model Building Using Rosetta SARS Simulations B Unrestrained Molecular Dynamics B Ensemble Building B Validation versus NMR Observables DATA AND SOFTWARE AVAILABILITY B

d

SUPPLEMENTAL INFORMATION Supplemental Information can be found online at https://doi.org/10.1016/j.str. 2019.03.005. ACKNOWLEDGMENTS This work was supported by institutional funds of the Max Planck Society. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank Birte Ho¨cker, Maxime Audin, Remco Sprangers, and their co-workers for providing samples for U3Sfl, KH-S1, and hUb. Programming of the SHINE package was supported by Professors Alois Knoll and Horst Kessler at the Technical University, Mu€bingen) for helpnich. We thank Andre Noll (MPI for Developmental Biology, Tu ful discussions. AUTHOR CONTRIBUTIONS M.G. designed and implemented spectral decomposition routines and the SARS protocol, performed molecular dynamics calculations, and analyzed their output. M.C. implemented back-calculation routines. M.R. programmed Shine. H.Z. performed Rosetta calculations and analysis. V.T. and M.C. acquired and analyzed NMR spectra. M.C. conceived and directed the project. The manuscript was written by M.C. and M.G. and edited by all the authors. DECLARATION OF INTERESTS

Clore, G.M., and Schwieters, C.D. (2004). How much backbone motion in ubiquitin is required to account for dipolar coupling data measured in multiple alignment media as assessed by independent cross-validation? J. Am. Chem. Soc. 126, 2923–2938. Clore, G.M., and Schwieters, C.D. (2006). Concordance of residual dipolar couplings, backbone order parameters and crystallographic B-factors for a small alpha/beta protein: a unified picture of high probability, fast atomic motions in proteins. J. Mol. Biol. 355, 879–886. Cornilescu, G., Marquardt, J.L., Ottiger, M., and Bax, A. (1998). Validation of protein structure from anisotropic carbonyl chemical shifts in a dilute liquid crystalline phase. J. Am. Chem. Soc. 120, 6836–6837. Das, R., and Baker, D. (2008). Macromolecular modeling with Rosetta. Annu. Rev. Biochem. 77, 363–382. Diercks, T., Coles, M., and Kessler, H. (1999). An efficient strategy for assignment of cross-peaks in 3D heteronuclear NOESY experiments. J. Biomol. NMR 15, 177–180. ElGamacy, M., Coles, M., and Lupas, A. (2018). Asymmetric protein design from conserved supersecondary structures. J. Struct. Biol. 204, 380–387. Essmann, U., Perera, L., Berkowitz, M.L., Darden, T., Lee, H., and Pedersen, L.G. (1995). A smooth particle mesh Ewald method. J. Chem. Phys. 103, 8577–8593. Fenwick, R.B., Esteban-Martı´n, S., Richter, B., Lee, D., Walter, K.F.A., Milovanovic, D., Becker, S., Lakomek, N.A., Griesinger, C., and Salvatella, X. (2011). Weak long-range correlated motions in a surface patch of ubiquitin involved in molecular recognition. J. Am. Chem. Soc. 133, 10336–10339. Fenwick, R.B., Esteban-Martin, S., and Salvatella, X. (2010). Refinement of ensembles describing unstructured proteins using NMR residual dipolar couplings. J. Phys. Chem. Lett. 1, 3438–3441. Goddard, T.D. and Kneller, D.G. SPARKY 3 (University of California, San Fransisco). Gront, D., Kulp, D.W., Vernon, R.M., Strauss, C.E., and Baker, D. (2011). Generalized fragment picking in Rosetta: design, protocols and applications. PLoS One 6, e23294.

The authors declare no competing interests.

€ntert, P. (2003). Automated NMR protein structure calculation. Prog. Nucl. Gu Magn. Reson. Spectrosc. 43, 105–125.

Received: May 17, 2018 Revised: October 5, 2018 Accepted: March 7, 2019 Published: March 28, 2019

Huang, J., and MacKerell, A.D. (2013). CHARMM36 all-atom additive protein force field: validation based on comparison to NMR data. J. Comput. Chem. 34, 2135–2145.

SUPPORTING CITATIONS The following references appear in the Supplemental Information: Mulder, 2009; Zhou et al., 1999. REFERENCES Best, R.B., and Vendruscolo, M. (2004). Determination of protein structures consistent with NMR order parameters. J. Am. Chem. Soc. 126, 8090–8091. Bonvin, A.M., Boelens, R., and Kaptein, R. (1994). Time- and ensemble-averaged direct NOE restraints. J. Biomol. NMR 4, 143–149. Bonvin, A.M., and Brunger, A.T. (1996). Do NOE distances contain enough information to assess the relative populations of multi-conformer structures? J. Biomol. NMR 7, 72–76. Brunger, A., Clore, G., Gronenborn, A., Saffrich, R., and Nilges, M. (1993). Assessing the quality of solution nuclear magnetic resonance structures by complete cross-validation. Science 261, 328–331. Burgi, R., Pitera, J., and van Gunsteren, W.F. (2001). Assessing the effect of conformational averaging on the measured values of observables. J. Biomol. NMR 19, 305–320. Chen, V.B., Arendall, W.B., Headd, J.J., Keedy, D.A., Immormino, R.M., Kapral, G.J., Murray, L.W., Richardson, J.S., and Richardson, D.C. (2010). MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. D Biol. Crystallogr. 66, 12–21.

864 Structure 27, 853–865, May 7, 2019

Huang, Y.J., Rosato, A., Singh, G., and Montelione, G.T. (2012). RPF: a quality assessment tool for protein NMR structures. Nucleic Acids Res. 40, W542–W546. Lange, O.F., Lakomek, N.A., Fares, C., Schroder, G.F., Walter, K.F., Becker, S., Meiler, J., Grubmuller, H., Griesinger, C., and de Groot, B.L. (2008). Recognition dynamics up to microseconds revealed from an RDC-derived ubiquitin ensemble in solution. Science 320, 1471–1475. Lindorff-Larsen, K., Best, R.B., Depristo, M.A., Dobson, C.M., and Vendruscolo, M. (2005). Simultaneous determination of protein structure and dynamics. Nature 433, 128–132. Lindorff-Larsen, K., Maragakis, P., Piana, S., and Shaw, D.E. (2016). Picosecond to millisecond structural dynamics in human ubiquitin. J. Phys. Chem. B 120, 8313–8320. Mackerell, A.D., Feig, M., and Brooks, C.L. (2004). Extending the treatment of backbone energetics in protein force fields: limitations of gas-phase quantum mechanics in reproducing protein conformational distributions in molecular dynamics simulations. J. Comput. Chem. 25, 1400–1415. Maltsev, A.S., Grishaev, A., Roche, J., Zasloff, M., and Bax, A. (2014). Improved cross validation of a static ubiquitin structure derived from high precision residual dipolar couplings measured in a drug-based liquid crystalline phase. J. Am. Chem. Soc. 136, 3752–3755. Mulder, F.A.A. (2009). Leucine side-chain conformation and dynamics in proteins from 13C NMR chemical shifts. ChemBioChem 10, 1477–1479. Nederveen, A.J., and Bonvin, A.M.J.J. (2005). NMR relaxation and internal dynamics of ubiquitin from a 0.2 ms MD simulation. J. Chem. Theory Comput. 1, 363–374.

Paatero, P., and Tapper, U. (1994). Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 5, 111–126.

Vernon, R., Shen, Y., Baker, D., and Lange, O.F. (2013). Improved chemical shift based fragment selection for CS-Rosetta using Rosetta3 fragment picker. J. Biomol. NMR 57, 117–127.

Phillips, J.C., Braun, R., Wang, W., Gumbart, J., Tajkhorshid, E., Villa, E., Chipot, C., Skeel, R.D., Kale´, L., and Schulten, K. (2005). Scalable molecular dynamics with NAMD. J. Comput. Chem. 26, 1781–1802.

Wang, A.C., and Bax, A. (1996). Determination of the backbone dihedral angles phi in human ubiquitin from reparametrized empirical Karplus equations. J. Am. Chem. Soc. 118, 2483–2494.

Pozzi, R., Coles, M., Linke, D., Kulik, A., Nega, M., Wohlleben, W., and Stegmann, E. (2016). Distinct mechanisms contribute to immunity in the lantibiotic NAI-107 producer strain Microbispora ATCC PTA-5024. Environ. Microbiol. 18, 118–132.

Wang, Y.X., and Zhang, Y.J. (2013). Nonnegative matrix factorization: a comprehensive review. IEEE Trans. Knowl. Data Eng. 25, 1336–1353.

Richter, B., Gsponer, J., Va´rnai, P., Salvatella, X., and Vendruscolo, M. (2007). The MUMO (minimal under-restraining minimal over-restraining) method for the determination of native state ensembles of proteins. J. Biomol. NMR 37, 117–135. Shen, Y., Delaglio, F., Cornilescu, G., and Bax, A. (2009a). TALOS+: a hybrid method for predicting protein backbone torsion angles from NMR chemical shifts. J. Biomol. NMR 44, 213–223. Spronk, C.A.E.M., Nabuurs, S.B., Krieger, E., Vriend, G., and Vuister, G.W. (2004). Validation of protein structures derived by NMR spectroscopy. Prog. Nucl. Magn. Reson. Spectrosc. 45, 315–337.

Xu, Q., Sachs, J.R., Wang, T.-C., and Schaefer, W.H. (2006). Quantification and identification of components in solution mixtures from 1D proton NMR spectra using singular value decomposition. Anal. Chem. 78, 7175–7185. Zhou, Y., Vitkup, D., and Karplus, M. (1999). Native proteins are surfacemolten solids: application of the Lindemann criterion for the solid versus liquid state. J. Mol. Biol. 285, 1371–1375. Zhu, L., Dyson, H.J., and Wright, P.E. (1998). A NOESY-HSQC simulation program, SPIRIT. J. Biomol. NMR 11, 17–29. Zweckstetter, M., and Bax, A. (2000). Prediction of sterically induced alignment in a dilute liquid crystalline phase: aid to protein structure determination by NMR. J. Am. Chem. Soc. 122, 3791–3792.

Structure 27, 853–865, May 7, 2019 865

STAR+METHODS KEY RESOURCES TABLE

REAGENT or RESOURCE

SOURCE

IDENTIFIER

Chemicals, Peptides, and Recombinant Proteins 13

This paper

N/A

13

C/15N-labelled Human ubiquitin

Birte Ho¨cker, University of Bayreuth

N/A

13

Remco Sprangers, University of Regensburg

N/A

13

Pozzi et al., 2016

N/A

13

ElGamacy et al., 2018

N/A

C/15N-labelled U3Sfl C/15N-labelled KH-S1 C/15N-labelled MlbQ C/15N-labelled polb4

Deposited Data CoMAND ensemble for human ubiquitin

This paper

PDB: 6QF8

CoMAND ensemble for MlbQ

This paper

PDB: 6QFP

CoMAND ensemble for KH-S1

This paper

PDB: 6QH2

Conventional ensemble for polb4

ElGamacy et al., 2018

PDB: 6H5H

SHINE 1.2

This paper

https://github.com/MichaelRiss/Shine

CoMAND scripts

This paper

N/A

Topspin

Bruker BioSpin

https://www.bruker.com/products/mr/nmr/nmrsoftware/nmr-software/topspin/

Rosetta 3.6

Das and Baker, 2008

https://www.rosettacommons.org

NAMD 2.12

Phillips et al., 2005

http://www.ks.uiuc.edu/Research/namd

OpenMM 7.0

Stanford University

http://openmm.org

PALES

Zweckstetter and Bax, 2000

https://www3.mpibpc.mpg.de/groups/ zweckstetter/_links/software_pales.htm

PyMOL

Schro¨dinger

http://pymol.org

Molprobity

Chen et al., 2010

http://molprobity.biochem.duke.edu

Software and Algorithms

CONTACT FOR REAGENT AND RESOURCE SHARING Further information and requests for resources should be directed to and will be fulfilled by the Lead Contact, Murray Coles (murray. [email protected]). METHOD DETAILS NMR Spectroscopy NMR data for the current work were acquired as part of routine structure determination projects. Doubly labelled samples for MlbQ (0.5 mM, pH 8.0) and polb4 (0.5 mM, pH 7.1) were prepared as described previously (Pozzi et al., 2016; ElGamacy et al., 2018). Briefly, the respective sequences were cloned into plasmid vectors for expression in E. coli, in both case as N-terminally His6-tagged fusion proteins and utilizing a kanamycin resistance gene as a selection marker. Cells transformed with the resulting plasmids were cultivated in 13C,15N-labeled minimal medium, with protein expression induced by addition of IPTG. Cells lysates were collected and centrifuged, with the soluble fraction filtered and purified by Ni-NTA affinity chromatography. The polb4 sample was further purified by gel filtration on an equilibrated Superdex 75 column (GE Healthcare). The MlbQ sample was further purified by anion-exchange chromatography using a MonoQ H55/5 column (GE Healthcare) and a linear salt gradient. A sample for U3Sfl (0.8 mM, pH 7.6) was provided by Birte Ho¨cker (University of Bayreuth). Samples for KH-S1 (1.6 mM, pH 7.5) and human ubiquitin (2.0 mM, pH 5.8) were provided by Remco Sprangers (University of Regensburg). The ubiquitin sample was also purified by affinity chromatography and used without cleavage of an N-terminal hexa-histidine tag. Backbone and sidechain assignments for de novo structure determinations were obtained using standard triple resonance experiments. For human Ubiquitin literature values were used. Slight correction of 13C shifts against the respective CNH-NOESY spectra was necessary to account for calibration differences between spectrometers and spectrum types. 3D CNH-NOESY spectra were

e1 Structure 27, 853–865.e1–e5, May 7, 2019

acquired at 800 MHz on a Bruker Avance III spectrometer equipped with room temperature probehead. Indirect 13C dimensions were typically acquired with 100 time increments and processed with linear prediction and zero filling to 256 data points. The 13C sweep width was set to cover aliphatic carbon resonances; i.e. 10-73 ppm, resulting in a resolution of 30 Hz per point. At this resolution, 1 JCC couplings are unresolved and the spectra were run in non-constant time mode. Broadband 13C pulses were used to excite aromatic resonances and these were folded into the aliphatic window without phase inversion. CNH-NOESY spectra were analysed by extracting one-dimensional 13C sub-spectra chosen from a search area centred on assigned 15N-HSQC positions (typically 1-3 points in each dimension). As these sub-spectra contain only cross-peaks to a specific amide proton, choosing the strip with highest integral maximises the signal-to-noise. Residues with overlapping search areas were examined separately. In most cases strips with acceptable separation of signals could be obtained. Where this was not possible the residues were flagged as overlapped and a joint strip constructed by summing those at the estimated maxima of the respective components. A set of strips well separated from assigned HSQC positions were averaged to define a global noise level for the spectrum. NOESY Back-Calculation In order to back-calculate 3D CNH-NOESY spectra we modified the program SPIRIT (Zhu et al., 1998) by porting it to C++ and extending it to accommodate any combination of proton and heteronuclear dimensions. We name this program SHINE, for Simulation of Hetero-Indirect NOESY Experiments. The calculations are based on a full relaxation matrix and thus account for spin diffusion in static structures. Internal motion of the protein is treated by ensemble averaging over n contributing microstates, effectively applying an n-state jump model of motion, where the life-time of a microstate is assumed to be long compared to the interconversion time. This does not account for true time-averaged phenomena, such as motion-mediated spin-diffusion, which are treated as negligible for the current application. Inputs for the calculations are a chemical shift list, a test structure and a set of simulation parameters. The latter are largely spectral details, such as spectrometer frequencies and sweep widths, which are extracted automatically from the corresponding experimental files (Bruker format), but also include an estimate of a global molecular correlation time. Resonances are modelled as Gaussian lines, with 13C linewidths assigned on a class basis, taking into account unresolved 1JCC couplings. Here it should be noted that the short acquisition times in an indirect 13C dimension (<10 ms on an 800 MHz spectrometer) mean that effective lineshapes are largely governed by apodisation of the time domain. They are thus considerably more homogeneous than for a proton dimension. The computational demand of NOESY back-calculation depends on the number of protons in the relaxation network. For this reason, we employ sub-structures containing <150 atoms. These can be linear peptides or fragments of a folded structure. Linear peptides are typically tri- or penta-peptides where the test residue is in the second last position. Fragments are compiled at the residue level; residues are included in the sub-structure if they contain a proton within a given radius (typically 5 A˚) of a target residue proton. For each such sub-structure, a one-dimensional strip displaying contacts to a single backbone amide proton is calculated. In this mode, back-calculation typically takes less than 20 ms per conformer on a single processor core. The program also outputs a list of peak intensities that can be used to build multi-dimensional spectra suitable for viewing in SPARKY (Goddard and Kneller), with annotation of individual cross peaks. Calculation of features sets is performed for each residue for which an experimental spectrum is available. The starting structures are linear peptide fragments extracted from an arbitrary structural model. In the current work these were tripeptides centred on the test residue. For back-calculation of contacts to the amide proton of residue i this peptide is modified through a set of torsion angles in a shifted Ramachandran space: angles ci-1 (here denoted yi) and 4i and up to two sidechain c angles. The backbone angles were searched at 10 granularity, while sidechain angles sample all staggered rotamers. This results in a maximum of 11664 conformers. No checks for steric clashes are applied to test if a conformer is physically reasonable and bond lengths and angles remain constant throughout. An exception is for proline residues where the search space is restricted to a physically realistic range in 4i. As proline residues lack an amide proton, their features sets are compiled by back-calculation of strips for the amide proton of the previous residue, which is the most sensitive reporter on the y angle of proline, due to contacts between that amide proton and the proline Cd protons. The set of back-calculated spectra for each residue are stored as a single file. For residues flagged as overlapped, the features set files are concatenated to create a joint set. Calculation of R-factors We define the R-factor as the relative RMS residual between an experimental spectrum and the expectation spectrum back-calculated from a structural model. The use of RMS is in analogy to the quality factor (Q-factor) calculated for residual dipolar coupling data (Cornilescu et al., 1998). The use of RMS tends to emphasize large outliers relative to the R-factor used in crystallography, which averages absolute differences between experimental and back-calculated structure factors. It also provides a convenient definition of the optimum scaling factor for the back-calculated spectrum scalc, which can be calculated as: Scalc =

nexp ,ncalc kncalc k2

Structure 27, 853–865.e1–e5, May 7, 2019 e2

where vexp and vcalc are the experimental and back-calculated spectral intensities vectors, respectively. The R-factor is then calculated as: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2ffi Pn  i nexpi  Scalc ncalci qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi R= Pn 2 i nexpi The theoretical range of the metric is from 0 to 1, however the maximum value can only be reached if there is no correspondence between peaks in the experimental and expectation spectra, in which case the scaling factor, scalc, will approach zero. The practical minimum value is limited by the RMS noise on a per-residue basis. Calculation of Fold Factors Back-calculation for a structure can be carried out either on a linear peptide or as a fragment of the full structure. The R-factor can therefore be calculated on either basis. Calculation for peptides cannot explain any peaks outside the linear context, whereas all peaks should be explained in a fragment. The difference between fragment and peptide R-factors therefore reports on the fraction of cross-peak intensity that can only be explained in the folded structure. Here we report this difference as a per-residue fold factor, F:  Rpentap Fi = Rfull i i where Rifull is based on a fragment from the full model and Ripenta is the pentapeptide based R-factor for residue i. As R-factors should decrease as more of the cross-peak intensity is explained, the fold factor will be consistently negative for well-folded structures. High positive values are indications of misfolding, while continuous stretches of values close to zero should only be seen for unstructured regions. As the minimum fold factor expected for a protein will vary with the secondary structure content and the contact order of the fold, fold factors are best used for internal comparisons, rather than as a general quality measure. Factorisation and CMAP Construction To decompose the experimental spectrum, a set of back-calculated spectra is required for all potentially relevant conformers. As an agnostic starting point we use the features set described above; i.e. the dipeptide conformational space is sampled systematically: Dy = 10 , D4 = 10 Dc1 = 120 , Dc2 = 120 . The aim of the factorisation is to solve for a set of factors h that weights each entry of the features set W to best explain the experimental spectrum v. The principal solution can be defined as: minkn  Whk2 hR0

As the elements of both h and W are positive, this solution can be obtained by positive matrix factorisation. First computing the Moore-Penrose pseudo-inverse of the features set matrix, W+, h can be derived directly as: pffiffiffiffiffiffiffiffiffiffiffiffiffiffi h = n2 W + 2 The uniqueness of the solution in positive matrix factorisation is limited by the ranks of the component matrices (Wang and Zhang, 2013). For an m x n matrix, the factorisation rank has the upper bound of min(m,n) and here rank(W) >> rank(v) = 1; i.e. the limiting factor is the rank of v. To stay close to this limit, we restrict the factorisation to a block-wise, two-conformer solution. A two-conformer solution is also convergent and computationally tractable for handling the fine granularity of conformational sampling used here. The solution provides a recovered spectrum that has a weight that is at least equal to any single-conformer solution. This two-component weight can thus be used as the highest propensity reference state href for estimating the relative, normalised propensities of every other conformer hi. A Boltzmann factor can be directly used to estimate the energy of every conformer according to: href DGref/i =e hi kT The above procedure is performed across the (y, 4) planes at every (c1, c2) combination, with the plane yielding the lowest R-factor embedded into the CHARMM36 force field to generate experimentally-derived backbone dihedral energy surfaces (CMAPs). Model Building Using Rosetta The Rosetta software package (Das and Baker, 2008) was used to build structural models using backbone dihedral restraints derived from conformational mapping (version 3.6). First, a Rosetta dihedral angle constraint file (.cst) was compiled. For each residue position i, a MultiConstraint field was written to comprise both dihedral angles yi and 4i. When multiple dihedral angles were possible for one residue position, an AmbigousConstraint field was used to include all possibilities. The Rosetta fragment picking program fragment_picker (Gront et al., 2011) was used to select 3mer and 9mer fragments satisfying the dihedral restraints from the PDB database. The DihedralConstraintsScore weight used was 500 and the minimum allowed 100. The SecondarySimilarity (weight 150, minimum 1.5) and RamaScore (weight 150, minimum 1.5) both used psipred. FragmentCrmsd was not used. Other fragment_picker parameters were defaults. The fragment database vall.jul19.2011 was searched and 200 3mer and 200 9mer fragments were picked

e3 Structure 27, 853–865.e1–e5, May 7, 2019

for each position in the protein. The Rosetta ab initio folding program AbinitioRelax was then used to fold the target proteins using these picked fragments (flag file: -ex1, -ex2 -use_input_sc, -flip_HNQ, -no_optH false, -silent_gz 1). Typically, 20,000 decoys were generated for each target protein. SARS Simulations The SARS framework was designed to provide robust and efficient conformational sampling without relying on any bioinformatic data, with the only input being experimentally determined CMAPs as residue-wise biasing potentials into the standard atomistic force field. The sampling acceleration scheme is based on the assumption that the native structure lies at the global minimum of a potential energy surface that is led to by successively deeper local minima. The algorithm initiates multiple replicas from the same fully extended peptide chain starting system. It then combines alternating rounds of simulated annealing between two temperature baths with conjugate gradient minimisation. Each such round constitutes a search step in a collective swarming behaviour that guides configuration exchange between replicas whenever a new minimum is reached. In this way, all of the high-energy replicas follow the lead of the lowest energy one. The implementation details and convergence properties of the SARS method will be detailed in a separate publication. Unrestrained Molecular Dynamics An initial low-resolution model generated by ROSETTA from the CNH-NOESY-acquired dihedral restraints was taken as the input coordinates for molecular dynamics simulations. The standard trajectories were acquired from several independent replicas conducted using the standard CHARMM36 force field (Huang and MacKerell, 2013). All of the simulations were performed in explicit TIP3P water as solvent, with solvent padding spanning 13.5 A˚ to the closest periodic boundary in a cuboidal box, and neutralised by 0.15 M Sodium Chloride in a cubic periodic unit cell. Energy minimisation was performed through 5000 steps of conjugate gradient minimisation, which was followed by 30 ns of NPT equilibration. The time step was set to 2 fs and a Langevin Piston was set to 1 Atm at oscillation period of 200 fs and damping period of 50 fs and a temperature of 298 K. A Langevin thermostat was accordingly set with damping coefficient of 1 ps-1. A nonbonded interactions distance cutoff was set to 12.0 A˚ at a switching distance of 10.0 A˚ with all nonbonded force and pair list evaluations were performed every timestep, and long-range electrostatics were computed using the Smooth Particle Mesh Ewald method (Essmann et al., 1995) as implemented in the NAMD engine (Phillips et al., 2005). Data was collected from the ensuing NVT trajectories, dumping coordinates every 5 ps for analysis. Frame picking was done from the production trajectories of 30 ns each based on the standard force field that would represent a steady state canonical ensemble. Ensemble Building To compile the final CoMAND ensemble for hUb, we performed frame picking from an equilibrium ensemble, such that the compiled conformers belong to microstates of minimal free energy and maximal entropy at the target temperature. This should provide more physically realistic final models compared to those collected from constrained tempering schemes with unrealistic Hamiltonians. To pick frames from the production trajectories we applied a greedy algorithm aimed at minimising the average R-factor for all residues where experimental CNH-NOESY strips were available. This algorithm is iterative and adds one structure per iteration, starting from an empty ensemble. At each iteration, the algorithm searches all frames, choosing the one that most decreases the overall average R-factor. The algorithm either converges - i.e. the R-factor no longer decreases - or reaches a set ensemble size chosen to respect under-restraining limits (here 20). A second implementation of the greedy algorithm was used to incorporate stochastic elements. Here the starting point is one of the 10 frames with the lowest R-factor, chosen at random. The algorithm then searches all frames in random order adding the first frame that decreases the overall R-factor in each iteration. This algorithm is designed to be applied to R-factor optimisation over for a single residue or small sub-sets of residues. Validation versus NMR Observables Expectation NMR observables, R-factors and fold-factors were calculated for the CoMAND and reference hUb ensembles. For uniformity, only the first 20 conformers from each ensemble were considered for the comparisons (an exception are the R-factors presented in Figure 1B, which were calculated for the whole ensemble; R-factors for the first 20 models of the larger ensembles are shown in Figure S9). For backbone amide order parameters, frames were aligned into a singular molecular frame of reference that best fits backbone atoms of residues 1 through 70. The order parameter, S2NH, was directly computed according to the method described by Nederveen and Bonvin (Nederveen and Bonvin, 2005) through the following equation: ! 3 X 3 X  2 1 2 SNH = 3 mi mj  1 2 i=1 j=1 where mi is the normalised internuclear NH bond vector in the molecular frame of reference and h*i represents the ensemble-averaged value. Expectation HN-Ha scalar coupling constants were calculated according to the following Karplus function: 3

JHN Ha = 6:98 cos2 q  1:38 cos q + 1:72

where q is the HN – N – Ca - Ha dihedral angle. These were compared to literature experimental values (Wang and Bax, 1996). Deviations from experimental residual dipolar couplings and an overall Q-factor for the CoMAND ensemble were calculated using the Structure 27, 853–865.e1–e5, May 7, 2019 e4

program PALES (Zweckstetter and Bax, 2000) using 996 backbone coupling published for 2MJB across four alignment media (Maltsev et al., 2014). DATA AND SOFTWARE AVAILABILITY The SHINE program is available at https://github.com/MichaelRiss/Shine. Other scripts associated with the CoMAND method are available from the lead contact on request. Coordinates for the CoMAND ensembles have been deposited in the Protein Data Bank under the following accession numbers: hUb, 6QF8; MlbQ, 6QFP; KH-S1, 6QH2. An ensemble for polb4 has been solved in parallel by conventional methods (ElGamacy et al., 2018) and deposited under the accession number 6H5H.

e5 Structure 27, 853–865.e1–e5, May 7, 2019