Unraveling the meaning of chemical shifts in protein NMR

Unraveling the meaning of chemical shifts in protein NMR

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx Contents lists available at ScienceDirect BBA - Proteins and Proteomics journal homepage: www.elsev...

1MB Sizes 2 Downloads 48 Views

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

BBA - Proteins and Proteomics journal homepage: www.elsevier.com/locate/bbapap

Unraveling the meaning of chemical shifts in protein NMR☆ Mark V. Berjanskii, David S. Wishart⁎ Department of Computing Science, University of Alberta, Edmonton T6G 2E9, Canada Department of Biological Sciences, University of Alberta, Edmonton T6G 2E8, Canada

A R T I C L E I N F O

A B S T R A C T

Keywords: NMR Chemical shift Protein Structure Prediction

Chemical shifts are among the most informative parameters in protein NMR. They provide wealth of information about protein secondary and tertiary structure, protein flexibility, and protein-ligand binding. In this report, we review the progress in interpreting and utilizing protein chemical shifts that has occurred over the past 25 years, with a particular focus on the large body of work arising from our group and other Canadian NMR laboratories. More specifically, this review focuses on describing, assessing, and providing some historical context for various chemical shift-based methods to: (1) determine protein secondary and super-secondary structure; (2) derive protein torsion angles; (3) assess protein flexibility; (4) predict residue accessible surface area; (5) refine 3D protein structures; (6) determine 3D protein structures and (7) characterize intrinsically disordered proteins. This review also briefly covers some of the methods that we previously developed to predict chemical shifts from 3D protein structures and/or protein sequence data. It is hoped that this review will help to increase awareness of the considerable utility of NMR chemical shifts in structural biology and facilitate more widespread adoption of chemical-shift based methods by the NMR spectroscopists, structural biologists, protein biophysicists, and biochemists worldwide. This article is part of a Special Issue entitled: Biophysics in Canada, edited by Lewis Kay, John Baenziger, Albert Berghuis and Peter Tieleman.

1. Introduction Chemical shifts are often known as the mileposts of NMR spectroscopy. They provide a robust, reliable, precise, and easily measured route to map out the covalent structure of organic molecules. In organic chemistry, NMR chemical shifts have been used to identify and determine the structure of small organic compounds for > 60 years. This is because chemical shifts are exquisitely sensitive to pairwise bonds, molecular geometry, and the electronegativity of individual atoms. This structural sensitivity has allowed organic chemists to develop simple, heuristic rules and widely used chemical shift tables to help with small molecule structural interpretation and analysis. In protein NMR, the interpretation of chemical shifts has historically been much more difficult. This is because proteins are very large molecules with not just 5–10 chemical shifts, but with hundreds to thousands of chemical shifts. Furthermore, the covalent structure of proteins is relatively uniform and it is only the non-covalent structure (i.e. the secondary and tertiary structure) that leads to subtle, seemingly random chemical shift variations. As a result, protein chemical shifts have been used primarily as simple “ledger entries” to track NOEs (nuclear Overhauser effects) ra-

ther than as tools for protein structure determination. However, as highlighted in this review, there is much more to these subtle, seemingly random variations in protein chemical shifts. Over the past 25 years, a small number of protein NMR spectroscopists have been trying to tease out the relationships between protein chemical shifts and protein structure/dynamics. These efforts have proven to be remarkably successful. Today, it is possible to use protein chemical shifts to rapidly identify protein secondary and super-secondary structure [1–15], to calculate backbone and side-chain torsion angles [11,16–18], to determine residue-specific assessable surface areas [19], to measure protein flexibility [20,21], to generate protein structure models [22–29] and to precisely refine protein structures [30,31]. Interestingly, Canadian researchers have played a key role in the development of many of these protein chemical shift techniques. This review is intended to highlight some of this work and to provide some historical context with regard to what events or processes led to which discoveries. Overall, it is hoped that this manuscript will make readers more aware of how NMR chemical shifts can make protein structural analysis much easier, much faster, and far more informative.

☆ ⁎

This article is part of a Special Issue entitled: Biophysics in Canada, edited by Lewis Kay, John Baenziger, Albert Berghuis and Peter Tieleman. Corresponding author at: Department of Computing Science, University of Alberta, Edmonton T6G 2E8, Canada. E-mail address: [email protected] (D.S. Wishart).

http://dx.doi.org/10.1016/j.bbapap.2017.07.005 Received 22 March 2017; Received in revised form 29 June 2017; Accepted 7 July 2017 1570-9639/ © 2017 Published by Elsevier B.V.

Please cite this article as: Berjanskii, M.V., BBA - Proteins and Proteomics (2017), http://dx.doi.org/10.1016/j.bbapap.2017.07.005

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart

Fig. 1. Example of CSI 3.0 output. CSI 3.0 was calculated for BMRB entry 6338. CSI values 1 and − 1 are shown as positive or negative bars.

bonding in α-helices and β-strands, implied contact information for βstrands, and topological information for β-turns. The sensitivity of chemical shifts to protein secondary structure was first discovered > 50 years ago [42,43]. However, it was not until the early 1990s that a sufficient quantity of assigned protein chemical shifts was available to detect any useful correlations between residue-specific chemical shifts and protein secondary structure [44]. In this particular paper [44], the Sykes group at University of Alberta we identified a clear and consistent relationship between amino acid 1H and 13C chemical shifts for helices and β-strands. On average, 1Hα shifts of protein residues, except glycine, were found to be shifted upfield with respect to their random coil values by 0.30 ppm when located in an α-helix and downfield by 0.46 ppm when located in a β-sheet. Likewise, 13Cα shifts were found to be shifted downfield by an average of 2.6 ppm in helices and upfield by an average of 1.1 ppm in β-sheets. Similar albeit less pronounced and consistent trends were observed for other nuclei [33]. It was shortly after this discovery that the Sykes group at the University of Alberta developed a new method, called the Chemical Shift Index (CSI), for rapidly determining protein secondary structure from chemical shifts [1,2]. Prior to the development of the CSI, protein secondary structures were typically determined through a tedious analysis of NOE patterns and vicinal coupling constants. With the CSI method, NMR spectroscopists could immediately use their chemical shift assignments and few simple rules to determine the location and extent of their newly assigned protein secondary structure. The chemical shift index uses a set of upper and lower thresholds to convert residue-specific secondary chemical shifts of backbone nuclei (1Hα, 13 Cα, 13Cβ and 13C′) into three indices that correspond to three secondary structure states. Secondary chemical shifts that exceed the upper threshold are given an index 1, while those that fall below the lower threshold are assigned an index − 1. Secondary shifts with values between the upper and lower thresholds are given an index 0. The CSI method was original developed for 1Hα chemical shifts [1] and then extended to include 13Cα, 13Cβ, and 13C′ secondary shifts [2] in order to improve method accuracy. This extended CSI method is called the consensus CSI. In the consensus version of CSI, β-sheet conformation is assigned to index 1, α-helix state is assigned to index − 1, and random

2. Scope of this review The focus of this review is on what chemical shifts tells us about protein structure and how chemical shift can be used to determine protein structures or improve their quality. There are many useful papers and reviews covering the theory of protein chemical shifts and their application to protein structural biology [32–35]. To avoid repetition, we will limit the scope of this review to the areas of protein chemical shift analysis specifically explored by our laboratory and other Canadian laboratories over the past 25 years. Therefore, a total of 8 different subjects will be covered. These include the use of chemical shifts to: 1) determine protein secondary and super-secondary structure; 2) determine backbone and side-chain torsion angles; 3) determine protein flexibility; 4) calculate residue-specific accessible surface area; 5) perform protein structure refinement; 6) generate 3D protein structures, and 7) model intrinsically disordered proteins. In addition, we will describe 8) the development of methods to accurately predict chemical shifts from sequence and structure. 3. Secondary and super-secondary structure from chemical shifts Secondary and super-secondary structures are a fundamental property of proteins and play a critical role in our understanding of protein structure, function, and evolution. Secondary structure refers to shorter-range regular structural features of proteins such as helices, βturns, and β-strands. Super-secondary structure typically refers to longer-range secondary structure features, such as β-hairpins, β-sheet topologies or two or more contiguous secondary structure features. Secondary and super-secondary structure predictions are used by many protein fold recognition algorithms [36], protein threading methods [37], protein tertiary structure prediction techniques [33,38], and intrinsically disordered protein (IDP) identification tools [39]. Secondary structure information is also used by many scoring functions and molecular modelling programs to fold, refine, design or validate protein structures [23,31,40,41]. Secondary structure can provide protein modelers with approximate backbone torsion angles, qualitative maps of protein rigidity and flexibility, expected locations of hydrogen 2

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart

detecting internal and edge β-strands. This algorithm uses a scoring system that evaluates β-strands using a number of rules based on known patterns of alternating upfield/downfield Hα chemical shifts, characteristic accessible surface area features, residue flexibility, hydrophobicity patterns, and other features. When tested on a test set of 13 proteins [14], CSI 3.0 identified edge β-strands, interior β-strands, and non-strands with an accuracy of 73%, 88%, and 97%, respectively. CSI 3.0 was also able to detect type I β-turns with an accuracy of 98% and types II, I, II, and VIII β-turns with an accuracy of 100%. β-Hairpins were detected with accuracy of 98%. α-Helices, β-strands, and coil regions were assigned with accuracy of 98%, 96%, and 96%, respectively. The CSI 3.0 program is available as a webserver at http://csi3. wishartlab.com. An example of the CSI 3.0 webserver output is shown on the Fig. 1. The primary reason for developing the CSI 2.0 and CSI 3.0 programs was not only the intent to improve the accuracy and informational content of shift-derived secondary structure but also the need to improve shift-based determination of 3D protein structures. This is a subject that we will visit in Section 9 of this review.

coil state is assigned to index 0. Chemical shift indices are often displayed as a bar graph that indicates the start and end of α-helices, βsheets, and random coil regions (Fig. 1). The agreement between CSI and secondary structures derived from X-ray based protein models is between 84 and 92%, depending on the set of proteins and secondary structure calculation method [2,45–47]. The CSI method has been integrated in several programs and webservers, including the NMRView program [48], the RCI webserver [49], and the PREDITOR webserver [17]. The CSI method is not without some problems. Early on, it was found that the CSI method was generally more accurate for α-helices (> 90%) than for β-strands (< 80%) [45,46]. It was also found that the performance of the CSI approach would suffer if chemical shift assignments were incomplete or mis-referenced. Since secondary chemical shifts depend on residue-specific random coil values, the CSI method was also sensitive to the choice of random coil shifts [46,47]. Furthermore, the CSI method could not identify other types of secondary structure such as 3/10 helices, helix capping boxes, and β-turns. To address these shortcomings, a new version of the CSI was developed by our laboratory in 2015 [15]. More specifically, to make secondary structure assignments more accurate and less dependent on assignment completeness and referencing errors, the secondary chemical shift information was supplemented with additional data and the method was optimized using machine learning techniques. This additional information included amino acid sequence data [50], flexibility data derived from the Random Coil Index or RCI [51], per-residue fractional accessible surface area derived from chemical shifts [19], residue conservation [52], and sequence-predicted secondary structure generated by PSIPRED [37]. The combination of these features was trained using a multi-class Support Vector Machine method and multi-residue Markov model for post-assignment filtering [15]. The enhanced version of CSI (called CSI 2.0) had a secondary structure identification accuracy of 90.6% and 89.4% on two different test sets. Given the general ambiguity and disagreement between secondary structure assignment calls (and existing programs), 90–91% accuracy is about as good as it is possible to get. In fact, CSI 2.0 performance was comparable to that of standard methods for secondary structure identification from protein coordinates, such as DSSP [53] and STRIDE [54]. The CSI 2.0 algorithm was also shown to be significantly more accurate (3–8%) than several other shift-based methods for secondary structure assignment, such as TALOS + [16], TALOS-N [18], DANGLE [11], CSI [1,2], PSSI [3], Delta2D [13], and Psi-CSI [4]. CSI 2.0 is available as a webserver at http://csi.wishartlab.com. To increase the number of secondary structure types that the CSI method can identify, we recently undertook further improvements to the CSI algorithm. This new version (CSI 3.0) [14], which was developed in 2015, can identify eleven types of secondary and super-secondary structures: β-strand, α-helix, coil, β-hairpins, interior and edge β-strands, as well as 5 types of β-turns (type I, II, I, II and VIII). To achieve this level of coverage, CSI 3.0 combines four different chemical shift-based tools including: (1) CSI 2.0 to identify the locations of αhelices, β-strands, and coils; (2) TALOS-N to predict torsion angles; (3) the backbone random coil index (RCI) to predict backbone order parameters as measures of protein flexibility, and (4) the side-chain RCI to predict fractional accessible surface areas. Once all these analyses or predictions have been performed, the β-turn algorithm scans all rigid coil regions (RCI-derived order parameters > 0.7) and compares torsion angles predicted by TALOS-N to torsion angles expected for the central residues (i + 1) and (i + 2) in the five aforementioned types of β-turns [55]. If three of the four torsion angles fall within 30° and one of the angles fall within 45° of their expected values in a particular β-turn, this β-turn type is assigned to this protein region. CSI 3.0 assigns β-hairpins when a region has two sequential βstrands that are connected by six or fewer residues containing a reverse β-turn. CSI 3.0 also employs a complex pattern-based algorithm for

4. Torsion angles from chemical shifts Torsion angle restraints play a key role in normalizing the geometry of polypeptide chains and defining the secondary structure of protein NMR models [56]. Torsion angle information is typically obtained from scalar couplings (e.g. 3JHNHα, 3JHα-1N, 3JC'Hα) and cross-correlated relaxation experiments [57–60,61]. These measurements involve analysis of peak intensities or peak splitting. However, for larger proteins, this kind of analysis can be very difficult or even impossible due to resonance broadening and low signal-to-noise in multi-dimensional NMR spectra. Chemical shifts offer an alternative approach for obtaining torsion angle information. Indeed, as we and other researchers found, 1 Hα, 13Cα, and 13C′ chemical shifts are particularly sensitive to backbone ϕ/ψ angles and 15N shifts are significantly influenced by ψ angles of the preceding residue [44,62,63]. Several programs have been developed to predict torsion angles from chemical shifts. These include TALOS [64], TALOS+ [16], DANGLE [11], and TALOS-N [18,65]. Our laboratory has also been actively developing shift-based torsion angle predictors. In 2006, we described one of the first torsion angle predictors, called SHIFTOR [66]. This was later integrated into the PREDITOR webserver [17], which became the first web-server for NMRbased protein torsion angle prediction and the first tool to predict protein ϕ, ψ, ω, χ1 angles. PREDITOR predicts backbone torsion angles by comparing the chemical shifts and primary sequence of residue triplets from a query sequence with the chemical shifts and primary sequences of triplets stored in PREDITOR database. To compare triplets, PREDITOR calculates a similarity score [17] that includes a weighted sum of 13Cα, 13Cβ, 13 C′, 1HN, 1Hα, and 15N chemical shift differences between each triplet in the query protein and each triplet in the database. In addition to the chemical shift term, the score also includes a term describing the sequence similarity of the triplets, which is calculated using a special amino acid weight matrix [67]. Ten triplets with the best shift similarity scores are selected and their average torsion angles are used to predict the torsion angles. PREDITOR uses probabilistic χ1 hypersurfaces [68] to predict χ1 angles in one of three states (−60°, + 60° and 180°) based on ϕ and ψ angles. By default, PREDITOR predicts a trans conformation (180°) to all ω angles. However, if chemical shift data indicates the presence of cis peptide bonds in proline, PREDITOR assigns a cis ω angle (0°) to the proline residue. PREDITOR uses the Random Coil Index or RCI [51] to assess the flexibility of protein backbones and to predict torsion angle uncertainties. PREDITOR also checks if the query protein has a homologous structures available in the Protein Data Bank (PDB). If PREDITOR can find a homologue with sequence identity > 50%, the torsion angles of the homologue are mapped to the aligned portions 3

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart

between the query sequence and the homologous sequence. The torsion angles predicted from chemical shifts and the torsion angles predicted from a PDB homologue (if it exists) are compared and merged. Tests of PREDITOR performance indicated that 85% of its ϕ/ψ predictions are within 30 degrees of the correct values, 84% of its χ1 predictions are correct (using a 3-state g +, g′, trans criteria), and 99.97% of ω angle predictions are correct (using a 2-state, cis/trans state prediction). In its hybrid mode (i.e. combined predictions from chemical shifts and sequence), PREDITOR demonstrated better accuracy than TALOS, TALOS +, and DANGLE [33]. PREDITOR is available as a webserver at http:// www.preditor.ca.

ualberta.ca/rci. The RCI method was also incorporated into TALOS + [16], TALOS-N [65], and the MICS [12] programs to provide error estimates for shift-derived torsion angle predictions. The RCI web server accepts chemical shifts input in both SHIFTY and BMRB NMR-STAR formats and allows users to select a chemical shift reference correction, the set of random coil reference values, the type of nearest-neighbor residue corrections, the handling of end-effects, and the treatment of assignment gaps. The web server outputs text files and graphical plots of the RCI values as well as the predicted per residue MD RMSD, NMR RMSD, and order parameters. The relationships between the RCI and other measures of backbone protein flexibility used by the web server are given by [20]:

5. Protein flexibility from chemical shifts

S2 = 1 − 0.4 ln (1 + RCI × 17.7);

(3)

Chemical shifts have long been known to be affected by protein motions, especially with regard to fast, intermediate and slow-exchange phenomena. However, exchange data is not easily measurable for all residues in a protein and so this rich dynamic information was largely ignored (or at least viewed as unattainable) for many years. The first qualitative correlation between secondary chemical shifts (1Hα) and protein flexibility (as implied from X-Ray B-factors) was discovered for E. coli thioredoxin in 1991 [44]. A similar correlation was demonstrated for ubiquitin three years later [69]. Based on these empirical observations, the concept of Random Coil Index (RCI) was developed by our group in 2005. Random coil chemical shifts are defined as the characteristic chemical shifts of amino acid residues in completely disordered, “coil” regions of proteins. They arise from the rapid conformational exchange among all theoretically possible conformations of an unfolded polypeptide chain in the absence of long-range interactions [70,71]. When the conformation and mobility of an amino acid or a polypeptide region approaches the completely disordered, random coil state, the chemical shifts of its atoms typically move toward their random coil values. By combining the chemical shifts from multiple nuclei (13Cα, 13C′, 13Cβ, 15N, NH and 1Hα) into a single parameter, a reliable quantitative relationship between chemical shifts and protein flexibility could be derived [51]. In particular, the inverse of the averaged, weighted secondary shifts of the aforementioned nuclei was shown to correlate well with standard measures of protein flexibility, such as the Model-Free order parameter S2, the root mean square deviations (RMSD) of NMR and molecular dynamic (MD) ensembles, and the X-ray B-factor. This inverse average of six chemical shifts was named the Random Coil Index (RCI) because it quantitatively reports the relative degree to which a protein region adopts the random coil state. The RCI calculation involves several steps including the smoothing of secondary shifts over several adjacent residues, the use of neighboring residue corrections, chemical shift re-referencing, gap filling, chemical shift scaling, and numeric adjustments to prevent divide-by-zero problems. Once these steps have been completed, the RCI is calculated using the following expression:

MD RMSD = RCI × 28.3 Å;

(4)

NMR RMSD = RCI × 16.7 Å;

(5)

B − factor = RCI1

(6)

RCI = [(A |ΔδCα| + B |ΔδCO| + C |ΔδCβ| + D |ΔδN| + E |ΔδNH| + F |ΔδHα|)

n]−1

× 142.0.

When tested on a set of 33 proteins, the average correlations between the RCI-derived S2, MD RMSD, per-residue NMR RMSD, and Bfactors with their corresponding experimental values were 0.75, 0.82, 0.77, and 0.60, respectively [20]. Meanwhile, the average absolute errors for S2, MD RMSD, NMR RMSD, and B-factor predicted from RCI values were calculated to be 0.05, 0.40 Å, 0.43 Å, and 16.9 Å2, respectively [20]. A more detailed description of the relationship between backbone flexibility (as described by the RCI) and protein dynamics or protein time scales was described in a paper we published in 2008 [20]. While most estimates of protein dynamics and flexibility are still obtained through NOEs, T1 and T2 measurements, these measurements are difficult, tedious and time consuming to collect. Often values are not measureable for all residues. On the other hand, RCI calculations are simple, quick, and comprehensive. Furthermore, they often reveal comparable information or even complementary information that is not detectable (or obvious) using traditional NOE, T1, and T2 measurements. As the field of NMR has evolved, the interest in understanding sidechain motions has picked up. This is because many enzymatic or ligand recognition processes depend on side-chain interactions and side-chain motions. In 2013, we decided to extend the RCI method to predict the mobility of protein side-chains. Since the amplitudes of total side-chain displacements are affected by both side-chain dynamics (e.g. interconversions among side-chain rotamers) and motions of the protein backbone (e.g. local fluctuations of backbone torsion angles, collective motions in protein loops and termini), the expression for the side-chain RCI required a combination of side-chain secondary chemical shifts with the backbone RCI. This was given as follows:

RCISC = A∗ 〈∑ (k | Δδa |) 〉−1 + B∗RCIBB a

(1)

(7)

where a indicates side-chain atoms of a given residue, |Δδa | is absolute secondary chemical shift of the side-chain atom a and k, A, and B are weighting coefficients. RCIBB is the RCI of the backbone chemical shifts [20]. Comparisons of side-chain and backbone RCIs as well as sidechain and backbone RMSDs from molecular dynamic simulations revealed that side-chain total movements are dominated by different mechanisms. These mechanisms can be divided into three main groups: side-chain driven, backbone driven, and combined backbone-side-chain driven. According to a combined RCI and MD analysis (Fig. 2), the amplitudes of the total side-chain motions in rigid parts of a protein (αhelix, β-sheet) are dominated by the interconversions among side-chain rotamers (RCISC ≫ RCIBB, MD RMSDSC ≫ MD RMSDBB, side-chain driven mechanism). In very flexible parts of a protein (the termini, loops), total side-chain motions are primarily driven by backbone motions (RCISC ≈ RCIBB, MD RMSDSC ≈ MD RMSDBB, backbone driven). Between these two extremes, backbone motions and side-chain rotamer

where |ΔδCα |, | ΔδCO |, | ΔδCβ |, |ΔδN |, | ΔδNH |, and |ΔδHα | are the absolute scaled values of the secondary chemical shifts (in ppm) of 13Cα, 13 C′, 13Cβ, 15N, NH, and 1Hα, respectively. A, B, C, D, E, and F are nucleus-specific weighting coefficients [20] and n is the number of chemical shift types. When all 6 chemical shifts are available, the RCI is calculated as

RCI = [(0.74 |ΔδCα| + 0.72 |ΔδCO| + 0.13 |ΔδCβ| + 0.38 |Δδ N| + 0.15 |ΔδNH| + 0.91 |ΔδHα|) 6]−1

2

(2)

The last steps of the protocol include “end-effect corrections” and smoothing by three-point averaging. A detailed description of the RCI protocol was first published in 2006 [72]. A webserver to calculate RCI values was made available to public in 2007 [49] and can be accessed at http://wishart.biology. 4

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart

Vendruscolo and Laio groups [73,74] can serve as excellent examples of how this method can be used to explore protein dynamics. Chemicalshift-biased metadynamics is conceptually complementary to the RCI methods described above. While RCI can only predict motional amplitudes, biased metadynamics can help visualize the atomic motions associated with chemical shifts. 6. Residue accessible surface area from chemical shifts The accessible surface area (ASA) of protein residues can be extremely useful in protein structure determination [75], assessing protein folding energies [76], estimating protein-ligand binding constants [77], calculating changes in protein enthalpy/entropy [78] or for assessing the quality of protein structure predictions [79]. While ASA values can be easily measured after a protein structure is determined, knowing ASA values prior to structure determination or without a 3D structure is even more valuable. For instance, indirect measurements of per-residue ASA with targeted chemical modification, partial proteolysis, and mass spectrometry were used to provide constraints for lowresolution protein structure determination [80]. After discovering the correlation between the side-chain RCI and the ASA of protein residues, we decided to develop a fully dedicated method (called ShiftASA) to predict residue-specific ASA using chemical shift information [19]. ShiftASA converts RCI, shift-based secondary structure, and sequence-derived parameters into accurate estimates of fractional accessible surface areas (fASA). To achieve higher accuracy than the side-chain RCI, ShiftASA uses the following shift- and sequence-based features to predict fASA including: (1) the side-chain RCI [21], (2) the backbone RCI [20], (3) chemical shift-derived secondary structure [3], (4) residue-specific hydrophobicity [81], (5) a residue conservation score [52], and (5) the sequence-predicted fASA generated by the SABLE program [82]. These features were combined and their weightings optimized using a Stochastic Gradient Boosted Tree Model [83]. When tested on a set of 92 proteins with complete or near-complete chemical shift assignments, ShiftASA demonstrated superior performance (rank-order coefficient of correlation with structure-derived fASA of 0.82) in comparison with a sequence-only method (SABLE with a correlation coefficient of 0.67) and a shift-only method (side-chain RCI with a correlation coefficient of 0.73). ShiftASA is available as a web server at http://shiftasa.wishartlab.com.

Fig. 2. Mechanisms of total side-chain motions identified by (A) MD and (B) side-chain RCI for PyJ (PDB ID: 1FAF). Mechanism I corresponds to the dominant contribution of side-chain rotameric jumps to side-chain displacements. Mechanism II indicates comparable contributions of side-chain and backbone flexibilities to side-chain total motions. Mechanism III corresponds to the dominant contribution of backbone motions to the amplitude of side-chain movements. Ellipses indicate protein regions corresponding to particular mechanisms of total side-chain motions. The picture is adopted from [21].

transitions may have comparable contributions to the total side-chain fluctuations (RCISC > RCIBB, MD RMSDSC > MD RMSDBB, backboneside-chain driven mechanism). Unlike the backbone RCI, the side-chain RCI was found to correlate reasonably well with the fractional assessable surface area of protein residues (fASA), with mean Pearson correlation coefficient of 0.73. This good agreement between RCI and fASA can be clearly seen, for example, with the PyJ protein structure (PDB ID: 1FAF) where the sidechains are colored based the side-chain RCI values (Fig. 3). The sidechains of PyJ with large RCISC values are mostly exposed whereas sidechains with small RCISC values are mostly buried. The side-chain RCI was also found to correlate well with side-chain RMSDs of NMR ensembles, side-chain RMSDs of MD ensembles, and B-factors. The mean Pearson correlation coefficients were found to be 0.82, 0.80, and 0.74, respectively. The side-chain RCI can be used to quantitatively estimate side-chain RMSDs of NMR and MD ensembles, fractional ASA (fASA), and B-factors using the following scaling relationships

MD RMSDSC = RCISC ∗9.0Å

(8)

NMR RMSDSC = RCISC ∗6.0Å

(9)

ASAf = RCISC ∗1.5

(10)

2

(11)

BSC = RCISC ∗1.0Å + 1.0

7. Calculating chemical shifts from 3D structures While the calculation of protein structural and dynamic parameters from chemical shifts was a primary focus of our work in the 1990s, it soon became apparent that the reverse capability (calculating chemical shifts from structural parameters) was also needed. As a result, a major focus of our lab research for the late 1990s and early 2000s shifted to protein chemical shift calculation. Our initial motivation for chemical shift calculation was to assist with chemical shift assignments (when the X-ray structure is known). It soon became apparent that not only could it be used to assist with protein NMR assignments [84,85], but also the identification of erroneous chemical shift assignments [86], the correction of mis-referenced chemical shifts [87], protein fragment based assembly [24], protein homology modelling [23], protein threading [88], and the refinement of protein structures [31,89,90]. There are three main approaches to calculate protein chemical shifts: (1) predicting chemical shifts from sequence homology, (2) calculating chemical shifts from protein coordinates, and (3) combinations of both sequence-based and structure-based approaches. Sequence-based methods are based on the simple observation that similar protein sequences share similar structures, which in turn, share similar chemical shifts. Therefore, reasonable predictions of chemical shifts could be obtained by aligning the primary sequence of a query protein against a previously assigned protein with a sufficiently high level of sequence identity to the query protein. The Sykes and Wishart

where BSC is the average side-chain B-factor normalized by the smallest observed B-factor value (BMIN), using equation BSC = (B-BMIN) + 1. Absolute prediction errors for MD RMSDSC, NMR RMSDSC, ASAf, and the normalized B-factor were 0.46 Å, 0.40 Å, 0.09, and 0.07 Å2, respectively. The side-chain RCI program is available as a webserver and as a stand-alone Python program at http://www.randomcoilindex.ca. While being outside the “Canadian scope” of this review, recent developments in chemical-shift-biased metadynamics methods should be briefly mentioned in this chapter. In these methods, chemical shifts help to define the free-energy landscapes in molecular dynamics simulations. Specifically, the agreement between experimental and predicted chemical shifts is used to create collective chemical shift variables that guide molecular dynamics simulations into regions of conformational space with the best agreement between experimental and predicted chemical shifts. Recent publications from the 5

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart

Fig. 3. Localization of PyJ (PDB ID: 1FAF) side-chains with low (< 0.11, violet color) and high (> 0.15, red color) RCISC values. Rigid side-chains (low RCISC) are located primarily in the PyJ core (A), whereas flexible side-chains (high RCISC) are mostly solvent exposed (B). The figure was generated with MolMol [162]. The picture is adopted from [21].

cannot account for conformational differences nor can they be used for chemical shift refinement. Nevertheless, sequence-based chemical shift predictions are starting to be used in a number of diverse applications [23,41,84] and other chemical shift calculators [95–97]. Methods for structure-based chemical shift predictions can be classified into four main categories: 1) semi-classical methods; 2) quantum mechanical methods; 3) empirical methods; and 4) blended methods. Semi-classical approaches employ simplified or empirical equations derived from classical physics and experimental data [98–100]. Quantum mechanical methods use first principles to calculate nuclear shieldings and then convert them to chemical shifts [101]. Empirical methods rely on using chemical shift “hypersurfaces” or related “structure/shift” tables [62,63,102,103]. Finally, blended methods combine elements of the first three approaches together to achieve better accuracy and better speed [63,95,96,104–115]. Excellent reviews of all four methodologies can be found elsewhere [33]. Here we will focus on the hybrid method, called SHIFTX, that we began developing in the early 2000s. SHIFTX combines empirical hypersurfaces with semi-classical methods (for ring current, electric field, hydrogen bond, and solvent effects) to calculate protein chemical shifts. Many of SHIFTX semiclassical formulas and parameters were taken from previously published methods or formulas [44,116,117]. SHIFTX chemical shift hypersurfaces for backbone 1H, 13C and 15N shifts were constructed using a database of 37proteins that had both high-resolution (< 2.1 Å) X-ray structures and mostly complete, reference-corrected 1H, 13C and 15N assignments. A total of 400 different hypersurfaces were initially built, each of which related nucleus-specific secondary chemical shifts to

groups at University of Alberta began collaborating on testing and implementing this idea at the end of 1990s [84,91,92]. This “classic” bioinformatics approach is the basis to the very first protein chemical shift prediction program and web server called SHIFTY [91], which is available at http://shifty.wishartlab.com. SHIFTY uses the classic Needleman-Wunsch sequence alignment technique [93] to score and align similar sequences. Once the sequence alignment step is complete, a second step involving the transfer of chemical shifts from the database match to the query sequence is performed. Query residues that exactly match the database residues are given the same chemical shifts. Query residues that differ from the database residues are given adjusted chemical shifts based on the secondary shifts corresponding to the database residues and the random coil shifts of the query residue. Appropriate adjustments for nearest neighbor effects (especially for 15N shifts and pre-proline residues) are also made during the alignment and shifttransfer process [94]. Typically, if a query protein shares > 35% sequence identity to a previously assigned protein, it is possible to attain chemical shift predictions with correlation coefficients between observed and predicted values exceeding 0.94 for 1Hα shifts, 0.85 for 1HN shifts, 0.98 for 13Cα shifts, and 0.90 for 15N shifts. The original SHIFTY database consisted of 193 proteins, but it has subsequently been expanded to > 2000 proteins via the RefDB database [87]. A key advantage to using sequence-based approaches is that as the chemical shift database expands, the predictions will tend to improve as the odds of finding a sequence homologue tend to increase. A key disadvantage to using sequence-based methods is that no predictions will be performed if no sequence homologue can be found. An additional weakness lies in the fact that sequence-based predictions 6

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart

backbone ϕ, ψ, ω, χ1 angles, neighboring residues, local secondary structure, hydrogen bonding status and other parameters. The 20 most informative hypersurfaces were then selected for each kind of nucleus (1Hα, 1HN, 13Cα, 13Cβ, 13C’, 15N) and used for prediction purposes. A smaller protein database was used to generate corresponding hypersurfaces for side-chain 1H shifts. In SHIFTX, hypersurface-predicted shifts are combined with chemical shifts calculated via the semi-classical approaches to account for ring current, electric field, hydrogen bond, and solvent effects. These effects are not easily captured by the hypersurface methods. Because many protein coordinate files lack explicit hydrogens, SHIFTX also calculates the position of protons using the program called REDUCE [118]. The net result is that SHIFTX is able to calculate backbone 1H, 13C and 15N shifts as well as side chain 1H shifts using any standard PDB coordinate file as input. Because of its extensive use of pre-calculated empirical hypersurfaces and rapidly calculated semi-classical formulas, SHIFTX has proven to be very quick and quite accurate. In fact, at the time of its publication in 2002, SHIFTX was the most accurate protein chemical shift predictor and remained so for a number of years. SHIFTX is available both as a standalone program and a web server at http://shiftx.wishartlab.com. SHIFTX was further enhanced and updated in 2011 [97]. The second version of SHIFTX (SHIFTX2) used a much larger database of training proteins (> 160), included more parameters (χ2 and χ3 angles, solvent accessibility, H-bond geometry, pH, temperature), employed machine learning methods to develop better parametric equations. It also integrated SHIFTY [91] and RefDB [87] to support sequence-based shift prediction. SHIFTX2 also supports the prediction of all main chain and side chain 1H, 13C and 15N shifts. SHIFTX2 achieves correlations between predicted and observed chemical shifts of 0.97 for 1Hα shifts, 0.97 for 1HN shifts, 0.99 for 13Cα shifts, 0.99 for 13Cβ shifts, 0.97 for 13 CΟ shifts, and 0.98 for 15N shifts. Extensive comparisons between SHIFTX2 and other programs such as SPARTA [95], SHIFTS [104], and CAMSHIFT [105] have shown that SHIFTX2 is one of the most comprehensive and the most accurate chemical shift predictor developed to date. SHIFTX2 is available both as a web server and as a stand-alone program at http://www.shiftx2.ca.

improvements were observed in regions in a close spatial proximity to aromatic rings after a refinement of carbonmonoxy myoglobin structure with proton chemical shifts in Amber [89]. A year later, improved agreements between experimental and back-calculated chemical shifts were reported for human thioredoxin after proton and carbon chemical shift refinements in XPLOR [125,126]. However, no increases in 3D model accuracy were reported in either of these studies. Due to the lack of significant improvements in model quality upon chemical shift refinement in these early studies, work on protein chemical shift refinement large stopped for the next 15 years. It wasn't until the CS-MD method was introduced in 2010 that the efficacy of chemical shift refinement was convincingly demonstrated [30]. CS-MD combines a conventional molecular dynamic force-field with chemical shifts predicted by the CamShift program [105]. When chemical shift data was included in the refinement process, improvements in accuracy of distorted models as large as 5 Å were observed. However, CS-MD requires substantial computational resources and so it has only seen limited use beyond these initial demonstrations. Inspired by the work of Vendruscolo and colleagues, we began to investigate how we could use SHIFTX and modified molecular dynamic techniques to facilitate protein structure refinement. We developed a new algorithm, called CS-GAMDy (Chemical Shift driven Genetic Algorithm for Molecular Dynamics) [31], that combined knowledgebased chemical shift and energy potentials with an integrated multiobjective MD biasing algorithm and a conventional genetic algorithm to perform model optimization (Fig. 4). The molecular dynamics in CSGAMDy is conducted with XPLOR-NIH [124] and, thus, can be adapted to a wide range of restraints and refinement techniques. CS-GAMDy also takes advantage of various knowledge-based scoring functions, such as GOAP [127], RW [128], and GeNMR [41]. Additional, CS-GAMDy incorporates the side-chain RCI [21] to improve the agreement between model accessible surface area (ASA) and the ASA derived from chemical shifts [21]. When tested on a set of distorted models (Table 1), CS-GAMDy demonstrated significantly better reductions in coordinate errors (up-to 9 Å) than CS-MD or XPLOR-NIH. When tested on homology models, CS-GAMDy was able to improve their accuracy as much as 3 Å [31]. CS-GAMDy is not without some limitations. The time and the amount of computing resources that CS-GAMDy needs to refine a protein model can be quite large. A successful CS-GAMDy job may require running 100–200 separate simulations requiring 3–7 days of compute time. Not every NMR group has access to this kind of computing resources or an expertise in high-performance computing. Efforts are currently underway to create a publically available CS-GAMDy web server and to reduce the running time from several days to several hours.

8. Structure refinement with chemical shifts Since the early 1980s, standard approaches for protein structure determination and refinement have depended almost exclusively on NOE-based techniques. NOE-based methods work remarkably well for the typical targets of protein NMR (proteins with MW < 15 kDa) [119]. Most NOEs can be easily fit to a 1 / r6 distance function and converted into easy-to-use unambiguous distance restraints. However, NOE-based methods are not without their problems. NOEs are often difficult to measure for larger (> 15 kDa) proteins or proteins with flexible regions. Furthermore, NOE-derived distance restraints, even when combined with other NMR restraints, can still lead to significant errors in protein 3D models [120,121]. As judged by the Resolution-byProxy method [122], the quality of typical NOE-based protein structures is consistently inferior to X-ray based structures of the same protein [122]. Furthermore, to obtain distance restraints from NOEs, time-consuming, and error-prone analysis of multiple NOESY spectra is normally required. To address the shortcomings of NOE-based methods, researchers have been exploring various ways to incorporate chemical shift data into the process of protein structure determination and refinement. The most popular options currently include employing shift-based torsion angles [11,17,18,64,65] as structural restraints and shift-based secondary structure [1,2] to provide approximate torsion angle or hydrogen bond constraints of regular secondary structure elements. Several attempts to include direct refinement against chemical shifts in protein structure refinement were attempted in the 1990s with modest success. In particular, empirical shift calculators were implemented in AMBER [123] and XPLOR [124]. Small structural

9. Chemical shift-driven 3D protein structure generation From the earliest days of our work on protein chemical shifts, the primary goal has been to develop a chemical-shift-only technique for determining the 3D structures of proteins. The first successful demonstration of using chemical shifts to determine the structure of a protein using only chemical shift data was published by Wishart and Case in 2001 [45]. This novel approach used a technique known as chemical shift threading. In protein threading, templates are scored not only by sequence similarity but also by structural and/or physicochemical similarity, such as similarity of secondary structures, accessible surface area, residue hydrophobicity or even chemical shifts. The work described in [45] involved creating a database of chemical shifts predicted from known PDB structures and then matching the sequence and chemical shifts of the query protein to sequences and chemical shifts of proteins in the database. We demonstrated that that this technique could produce a model of ubiquitin with good accuracy (RMSD < 1.5 Å) using only the protein sequence and the corresponding 1H and 13 C chemical shifts. 7

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart

Fig. 4. CS-GAMDy protocol. (A) Main elements of CS-GAMDy framework. XPLOR-NIH molecular dynamics and its force-field energies (i.e. covalent structure energies, knowledge-based energies, and experimental restraint energies) are in the core of the CS-GAMDy method. MD biasing takes advantage of fast scoring functions (i.e. GeNMR, RCI-ASA score) to rank independent MD runs that start from the same model but with different velocities and several other MD parameters. Genetic algorithm utilizes slower scoring functions to manage (i.e. rank and update) a population of biased MD models. (B) Genetic algorithm. Each iteration of CS-GAMDy genetic algorithm (GA) evolves a population of 10 biased MD models. Each biased MD model is created by 20 steps of MD biasing. When biased MD runs are done, their final models are evaluated and ranked by a GA scoring function and 20% of population with the worst scores gets replaced by the best-score model. The picture is adopted from [31].

Table 1 Model accuracy of the distorted protein models after refinement by CS-MD, XPLOR, and CS-GAMDy. Protein name

PyJ Ubiquitin GB3 Q5E7H1 RPA3401 RHOS4 26,430 Protein LX PefI tRNA hydrolase domain CSPA Calbindin D9K NE1242 Average

PDB ID

1FAF 1UBQ 1P7E 2JVW 2JTV 2JVM 2JXT 2JT1 2JVA 1MJC 3ICB 2JV8

Model accuracy (backbone RMSD to the reference model, Å) Initial model

Refined by CS-MD

Refined by XPLOR with NMR data

Refined by CS-GAMDy without NMR data

Refined by CS-GAMDy with NMR data

6.35 3.57 3.47 6.40 3.20 4.60 3.33 6.75 5.27 7.07 4.81 3.63 4.9

2.02 1.92 0.84 1.11 1.30 1.51 1.59 1.67 1.88 2.08 2.15 2.48 1.7

8.97 2.33 2.82 14.19 2.87 4.38 6.24 14.89 5.23 8.04 4.68 6.37 6.8

3.74 0.64 0.88 2.3 1.13 2.15 3.01 1.06 3.35 2.3 3.26 2.38 2.2

1.81 0.49 0.89 1.11 0.74 1.0 0.75 1.71 1.85 1.56 1.17 1.18 1.2

CS-MD performance data was taken from the work of Paul Robustelli and co-workers [30]. CS-GAMDy performance is taken from [31].

8

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart

These initial successes in protein structural modelling via chemical shift data energized efforts to further develop this approach and to extend it to fragment-based methods of protein structure determination. While varying in details, fragment-based programs have three main steps where chemical shift data can be used: 1) fragment selection, 2) conformational sampling, and 3) final model selection [40]. In 2007, successful fragment-based modelling of protein structure from chemical shifts was reported by both the Vendruscolo group (the CHESHIRE algorithm) [22] and the Rose group [129]. In 2008, the Rosetta program for fragment-based protein folding [40] was further extended to include chemical shifts during the fragment selection and final model selection steps [24]. The new algorithm was named CSRosetta to distinguish it from the more general Rosetta program. As will be described later in this section, CS-Rosetta has become the method of choice in studying transient protein states by the Kay group at University of Toronto. In 2008, our group extended the chemical shift threading concept further by combining shift-driven threading with shift-driven homology and fragment based assembly [40] in a program called CS23D (Chemical Shift to 3D structure) [23]. CS23D uses CSI [2] and PREDITOR [17] to estimate secondary structure and torsion angles from submitted chemical shifts and then ranks the templates based on the agreement between shift-based and template-based values of these parameters. CS23D also employs SHIFTX [63] to predict backbone chemical shifts from protein coordinates for shift-based refinement of the resulting 3D structures. If a target protein has high degrees of sequence identity and secondary structure similarity to a protein already in the PDB, CS23D functions as a homology modelling server that uses shift-based template ranking and shift-based refinement. In this mode, CS23D typically outputs a 3D model in as little as 10 CPU minutes. On the other hand, if a query protein has very low sequence identity and poor secondary structure conservation with respect to proteins in the PDB, CS23D uses fragment-based assembly in Rosetta [40] to generate an initial model. Then, CS23D optimizes the model with extended chemical shift refinement. Between these two extreme scenarios, CS23D employs a method called THRIFTY (THReading with shIFTY) that uses a shiftbased torsion angle alphabet to identify templates for comparative modelling. The accuracy of CS23D is comparable [33] to the accuracy of 3D protein structures produced by other shift-based structure determination programs, such as CS- Rosetta [24] and CHESHIRE [22]. However, the main advantage of CS23D is its speed. CS23D is available as a webserver at http://www.cs23d.ca. Like any program, CS23D has its limitations. Specifically, CS23D does not accept NOE-derived distance restraints (or any other types of NMR-based restraints). In 2009, we added the ability to use distance restraints in the CS23D-like algorithm, which was implemented in a separate web server called GeNMR [41]. GeNMR is available at http:// www.genmr.ca. GeNMR supports protein structure generation from NOE-based distance restraints alone, chemical shifts alone as well as distance restraints plus chemical shifts. Another area where CS23D performance was found to be weak was in the efficacy and efficiency of its threading algorithm (THRIFTY). To further enhance THRIFTY and improved version called E-Thrifty (Enhanced-Thrifty) was recently developed by our group [130]. E-Thrifty takes advantage of the significant improvements arising from CSI 2.0 [15], as well as CSI 3.0 for supersecondary and structural motif identification [14]. It also employs SHIFTX2 [97] to calculate chemical shifts from protein coordinates, and the latest version of TALOS (i.e. TALOS-N) [65] to predict torsion angles from chemical shifts. E-Thrifty combines all these methods to assess chemical shift derived structural similarity between any given query protein and all the proteins in the PDB. E-Thrifty employs a SmithWaterman local alignment algorithm with a variable gap penalty function to assess the sequence, chemical shift and torsion angle similarity or fitness [131]. Templates with the best fitness scores are then further clustered based on their structural similarity. Representatives of the best clusters can be used for building 3D models of the query

Table 2 Template recognition performances of three threading programs using sequence identity cutoff as ≤30%. The E-Thrifty column shows the top template identified by E-Thrifty, whereas the next two columns show the top templates identified by POMONA and PSIBLAST. The TM-score reports the structural similarity of protein models, with 0 indicating no structural similarity and 1 indicating identical structures [133]. Table is adopted from [130]. Query Protein name

PDB/BMRB ID

KaiA 1M2F/5031 NEDTH 1F3Y/4448 NCS-1 2LCP/4378 Sortage 1IJA/4879 PyJ 1FAF/4403 ERp18 2K8V/15964 ApolPBP1A 2JPO/15256 Pru Av 1 1E09/4671 Ets-1 2JV3/4205 cg2496 2KPT/16569 NCAM 1EPFA/4162 PG 2HZE/4113 AT5g22580 1RJJ/6011 N-WASP 1MKE/5554 Grx2 1G7O/4318 Average TM-score

E-Thrifty

POMONA

PSI-BLAST

Length/ (fold)

TM-score

TM-score

TM-score

135/(α/β) 165/(α/β) 190/(α) 148/(β) 79/(α) 157/(α/β) 142/(α) 159/(α/β) 110/(α) 148/(α/β) 191/(β) 108/(α/β) 111/(α/β) 144/(α/β) 215/(α)

0.67 0.66 0.60 0.70 0.67 0.55 0.63 0.84 0.61 0.63 0.66 0.77 0.72 0.64 0.63 0.66

0.34 0.62 0.37 0.68 0.38 0.51 0.29 0.83 0.28 0.64 0.79 0.78 0.72 0.62 0.63 0.57

0.37 0.67 0.57 0.66 0.36 0.44 0.63 0.83 0.58 0.0 0.27 0.76 0.69 0.57 0.57 0.53

protein with programs such as MODELLER [132] and, then, refined with CS-GAMDy or other chemical shift driven minimizers. When EThrifty performance is compared with the performance of other stateof-the-art threading programs (Table 2), such as PSI-BLAST and another chemical shift structure generation method POMONA [26], E-Thrifty demonstrates 10–20% improvements in accuracy, as evident from TMscores [133] of protein 3D models. E-Thrifty is available as a webserver at http://www.ethrifty.ca. Chemical shift data can be utilized not only to determine structures of abundant, ground-state, long-lived protein conformations but also to gain insights into rare, short-lived or transient conformational states of proteins. These transient folding intermediates or weakly populated protein folding states are also called “excited” states. Chemical shifts of excited conformational states can obtained via several NMR relaxation techniques, such as chemical-exchange saturation transfer [134,135], Carr-Purcell-Meiboom-Gill relaxation dispersion [136,137], and, in favorable cases, rotating frame relaxation rate measurements [136]. Lewis Kay's group at the University of Toronto has pioneered the application of chemical shifts (using the aforementioned techniques) in modelling the 3D structures of transient protein folding states [138]. In 2010, the Kay group determined the excited-state structure of the FF domain from HYPA/FBP11 using only backbone 15N, 13C, and 1H chemical shifts (and amide bond-vector orientations) as the only source of experimental information. A year later, the Kay group determined the excited-state structure of a T4 lysozyme mutant using backbone chemical shifts and a specially developed CS-Rosetta refinement protocol [139]. In 2012, the excited-state structure of a mutant Fyn SH3 domain [140] was determined from chemical shifts using the CS-MD program [30]. The shift-derived structures of these weakly populated intermediate states has helped improve our understanding of transient structures, such as their rate limiting effect on protein folding (via the FF domain) or their influence on protein aggregation (via the Fyn SH3 domain), or their utility for engineering ground-state conformations with novel properties (T4 lysozyme). In order to gain a community-wide acceptance, any new method needs to be tested thoroughly and independently by the scientific community. This has led to the development of CASP (Critical Assessment of Structure Prediction) or CASD (Critical Assessment of Structure Determination) “competitions”. The performance of CHESHIRE and CS-Rosetta has been extensively evaluated by several 9

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart

(TRajectory Directed Ensemble Sampling) [158]. The grand ensemble is then split into smaller subsets. Population weights of individual models in those smaller ensembles are optimized to maximize the agreement between the experimental data and the ensemble averages. The program produces a final ensemble that consists of much smaller number (< 60) of essential conformers with significant population weights. The ENSEMBLE program has been successfully applied to study the structural properties of a highly unstable drkN SH3 domain [157,159] as well as several IDPs, including intrinsically disordered protein phosphatase 1 regulators [160] and intrinsically disordered Sic1 [161]. ENSEMBLE is available at http://pound.med.utoronto.ca/~JFKlab.

CASD-NMR assessments [141,142]. These CASD-NMR competitions evaluate both NOE-based methods and chemical-shift-only methods of protein structure generation and have highlighted the importance of developing objective methods for the judging the accuracy of structures determined via chemical shifts. With the rapid increase in chemical shift-only techniques, such as aforementioned CHESHIRE and CS-Rosetta, as well as CS-MD, CS23D, CS-GAMDy, POMONA, E-Thrifty, and CS-TORUS [28], this group of methods probably deserves its own CASD-like competition. 10. Chemical shift-driven modelling of intrinsically disordered proteins

11. Conclusion With the advent of greatly improved X-ray methods and the appearance of ultra-high resolution cryo-EM techniques, NMR faces increasing competition from these techniques in characterizing folded proteins. On the other hand, intrinsically disordered proteins (IDPs) are not amenable to study using X-ray crystallography or cryo-EM. In fact, NMR appears to be the only high-resolution technique that can be used to gain insights into conformational properties of these remarkably common and biologically important proteins. The fact that NMR is uniquely positioned to help in the characterization of IDPs has led to a substantial refocusing of efforts by the protein NMR community. Unlike folded proteins, IDPs lack the abundant NOEs typically used to determine and refine structures by conventional NMR techniques. Chemical shifts are among few available NMR-derived parameters that can be used to gain some structural insights into IDPs. This realization has led to the development of several methods to assess or identify IDP secondary structure propensities from chemical shifts, such as δΔD [13] and ncSPC [143]. It has also led to the development of shift-driven predictions of torsion angle distributions [144,145] and the development of tools to generate ensembles of disordered proteins from chemical shifts and other NMR data [146–148]. One of the most common and successful approach described to date combines the ASTEROIDS algorithm [149] with the Flexible-Meccano program [150]. Several sets of specific sets of random coil chemical shifts [151–153] have been published to make the aforementioned methods more accurate. For more details about the progress in this field worldwide, we would like to refer the reader to several recent reviews [154–156]. In Canada, the efforts to decipher conformations of disordered proteins from chemical shifts have largely been pioneered by the Forman-Kay lab at University of Toronto. Key to this work was the development of a method to calculate residue-specific secondary structure propensity (SSP) of disordered proteins from chemical shifts. Their method, called SSP [7], combines six secondary chemical shifts 13 Cα, 13Cβ, 13C′, 1HN, 1Hα, 15N to calculate a SSP score that represents the expected fraction of α-helices or β-strands along the length of the IDP of interest. The SSP score is an average of the six secondary chemical shifts that are weighted according to their sensitivity to different types of protein secondary structure. In addition, the SSP program rereferences carbon chemical shifts and performs a 5-residue weighted averaging of combined secondary chemical shifts to improve program accuracy. The development of the SSP method in 2006, allowed the Forman-Kay group to successfully assess and compare secondary structure propensities of α- and γ-synuclein and to suggest that an increase in α-helical propensity in the amyloid-forming region of synuclein in various species correlates with a reduced fibrillation propensity. The SSP program can be downloaded at http://pound.med.utoronto.ca/ ~JFKlab. The Forman-Kay group was also the first to develop and successfully apply a chemical shift-driven program, called ENSEMBLE [157], to generate ensembles of intrinsically disordered proteins. ENSEMBLE accepts chemical shifts as well other NMR data (NOE-based distance restraints, J-coupling constants, translational diffusion coefficients, radius of gyration) and fluorescence data. The program starts from generating tens of thousands models using the TRADES program

Thanks to > 25 years of work by a small number of NMR labs from around the world and across Canada, protein chemical shifts now provide NMR spectroscopists and structural biologists with a wealth of information about protein structure, dynamics, and function [33]. Finding the methods to extract this information has not been easy. Protein NMR shifts are surprisingly complex and are heavily influenced by many subtle effects arising from difficult-to-understand non-covalent forces and dynamics. Progress in this field has required not only the development of large databases of protein chemical shift data and protein coordinate data but also the development of newer or better methods for molecular dynamic simulation, pattern recognition, feature finding and data processing. As a result, many of today's techniques used to interpret protein chemical shifts require sophisticated machine learning tools, large computers, and complex computer programs. Chemical shift analysis can no longer be done with a pencil, a calculator or a spreadsheet. However, thanks to the internet, many (if not all) of these chemical shift interpretation tools are now easily available as webservers or downloadable programs. In particular, over the past 25 years, our group and our Canadian colleagues have made the following chemical shift-based tools available on the web: (1) CSI and CSI 2.0 for the rapid and accurate identification of α-helices, β-sheets and random coil regions; (2) CSI 3.0 for rapid identification of conventional secondary structures as well as supersecondary structures such as β-turns, β-hairpins, external and internal β-strands; (3) PREDITOR for accurately predicting ϕ, ψ, ω, and χ1 torsion angles; (4) RCI for estimating parameters of backbone and sidechain flexibility, such as residue-specific RMSD of NMR and MD ensembles, backbone order parameters, and B-factors (5) ShiftASA for estimating residue-specific fractional accessible surface area; (6) SHIFTY, SHIFTX, and SHIFTX 2.0 for accurately and rapidly predicting chemical shifts from sequence and structure, respectively; (7) CSGAMDy for robust and accurate chemical shift refinement; (8) CS23D and GeNMR for rapid chemical-shift based protein structure determination; (9) E-Thrifty for rapid and robust threading via chemical shifts; (10) SSP for calculating secondary structure propensities of disordered proteins and (11) ENSEMBLE for calculating ensembles of disordered proteins from chemical shifts and other biophysical data. We would like to stress that chemical shift analysis techniques are highly interconnected. A breakthrough in one area often stimulates progress in several others. For example, the development of the CSI method (in 1992) for secondary structure predictions from chemical shifts was a prerequisite for development of the backbone RCI method in 2005 to assess protein flexibility from chemical shifts. The backbone RCI method stimulated the development of SSP by the Forman-Kay group in 2007, which then led to the development of delta2d by the Vendruscolo group in 2012. Likewise, without the development of the backbone RCI method, the development of the side-chain RCI in 2013 would not have been possible and, without the side-chain RCI, the development of ShiftASA in 2015 for predicting accessible surface area from chemical shifts would not have been possible. In turn, ShiftASA and RCI predictions became important features in developing the next generations of CSI: CSI 2.0 (in 2014), and CSI 3.0 (in 2015). Likewise, 10

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart

[16] Y. Shen, F. Delaglio, G. Cornilescu, A. Bax, TALOS +: a hybrid method for predicting protein backbone torsion angles from NMR chemical shifts, J. Biomol. NMR 44 (2009) 213–223. [17] M.V. Berjanskii, S. Neal, D.S. Wishart, PREDITOR: a web server for predicting protein torsion angle restraints, Nucleic Acids Res. 34 (2006) W63–69. [18] Y. Shen, A. Bax, Protein backbone and sidechain torsion angles predicted from NMR chemical shifts using artificial neural networks, J. Biomol. NMR 56 (2013) 227–241. [19] N.E. Hafsa, D. Arndt, D.S. Wishart, Accessible surface area from NMR chemical shifts, J. Biomol. NMR 62 (2015) 387–401. [20] M.V. Berjanskii, D.S. Wishart, Application of the random coil index to studying protein flexibility, J. Biomol. NMR 40 (2008) 31–48. [21] M.V. Berjanskii, D.S. Wishart, A simple method to measure protein side-chain mobility using NMR chemical shifts, J. Am. Chem. Soc. 135 (2013) 14536–14539. [22] A. Cavalli, X. Salvatella, C.M. Dobson, M. Vendruscolo, Protein structure determination from NMR chemical shifts, Proc. Natl. Acad. Sci. U. S. A. 104 (2007) 9615–9620. [23] D.S. Wishart, D. Arndt, M. Berjanskii, P. Tang, J. Zhou, G. Lin, CS23D: a web server for rapid protein structure generation using NMR chemical shifts and sequence data, Nucleic Acids Res. 36 (2008) W496–502. [24] Y. Shen, O. Lange, F. Delaglio, P. Rossi, J.M. Aramini, G. Liu, A. Eletsky, Y. Wu, K.K. Singarapu, A. Lemak, A. Ignatchenko, C.H. Arrowsmith, T. Szyperski, G.T. Montelione, D. Baker, A. Bax, Consistent blind protein structure generation from NMR chemical shift data, Proc. Natl. Acad. Sci. U. S. A. 105 (2008) 4685–4690. [25] P. Robustelli, A. Cavalli, C.M. Dobson, M. Vendruscolo, X. Salvatella, Folding of small proteins by Monte Carlo simulations with chemical shift restraints without the use of molecular fragment replacement or structural homology, J. Phys. Chem. B 113 (2009) 7890–7896. [26] Y. Shen, A. Bax, Homology modeling of larger proteins guided by chemical shifts, Nat. Methods 12 (2015) 747–750. [27] V. Menon, B.K. Vallat, J.M. Dybas, A. Fiser, Modeling proteins using a supersecondary structure library and NMR chemical shift information, Structure 21 (2013) 891–899. [28] W. Boomsma, P. Tian, J. Frellsen, J. Ferkinghoff-Borg, T. Hamelryck, K. LindorffLarsen, M. Vendruscolo, Equilibrium simulations of proteins using molecular fragment replacement and NMR chemical shifts, Proc. Natl. Acad. Sci. U. S. A. 111 (2014) 13852–13857. [29] L.A. Bratholm, A.S. Christensen, T. Hamelryck, J.H. Jensen, Bayesian inference of protein structure from chemical shift data, PeerJ 3 (2015) e861. [30] P. Robustelli, K. Kohlhoff, A. Cavalli, M. Vendruscolo, Using NMR chemical shifts as structural restraints in molecular dynamics simulations of proteins, Structure 18 (2010) 923–933. [31] M. Berjanskii, D. Arndt, Y. Liang, D.S. Wishart, A robust algorithm for optimizing protein structures with NMR chemical shifts, J. Biomol. NMR 63 (2015) 255–264. [32] L. Szilagyi, Chemical-shifts in proteins come of age, Prog. Nucl. Magn. Reson. Spectrosc. 27 (1995) 325–443. [33] D.S. Wishart, Interpreting protein chemical shift data, Prog. Nucl. Magn. Reson. Spectrosc. 58 (2011) 62–87. [34] D.A. Case, The use of chemical shifts and their anisotropies in biomolecular structure determination, Curr. Opin. Struct. Biol. 8 (1998) 624–630. [35] E. Oldfield, Chemical shifts and three-dimensional protein structures, J. Biomol. NMR 5 (1995) 217–225. [36] J. Soding, A. Biegert, A.N. Lupas, The HHpred interactive server for protein homology detection and structure prediction, Nucleic Acids Res. 33 (2005) W244–248. [37] D.T. Jones, Protein secondary structure prediction based on position-specific scoring matrices, J. Mol. Biol. 292 (1999) 195–202. [38] J. Soding, M. Remmert, Protein sequence comparison and fold recognition: progress and good-practice benchmarking, Curr. Opin. Struct. Biol. 21 (2011) 404–411. [39] B. He, K. Wang, Y. Liu, B. Xue, V.N. Uversky, A.K. Dunker, Predicting intrinsic disorder in proteins: an overview, Cell Res. 19 (2009) 929–949. [40] C.A. Rohl, C.E. Strauss, K.M. Misura, D. Baker, Protein structure prediction using Rosetta, Methods Enzymol. 383 (2004) 66–93. [41] M. Berjanskii, P. Tang, J. Liang, J.A. Cruz, J. Zhou, Y. Zhou, E. Bassett, C. MacDonell, P. Lu, G. Lin, D.S. Wishart, GeNMR: a web server for rapid NMRbased protein structure determination, Nucleic Acids Res. 37 (2009) W670–677. [42] J.L. Markley, D.H. Meadows, O. Jardetzky, Nuclear magnetic resonance studies of helix-coil transitions in polyamino acids, J. Mol. Biol. 27 (1967) 25–40. [43] A. Pastore, V. Saudek, The relationship between chemical shift and secondary structure in proteins, J. Magn. Reson. 90 (1990) (1969) 165–176. [44] D.S. Wishart, B.D. Sykes, F.M. Richards, Relationship between nuclear magnetic resonance chemical shift and protein secondary structure, J. Mol. Biol. 222 (1991) 311–333. [45] D.S. Wishart, D.A. Case, Use of chemical shifts in macromolecular structure determination, Methods Enzymol. 338 (2001) 3–34. [46] S.P. Mielke, V.V. Krishnan, Characterization of protein secondary structure from NMR chemical shifts, Prog. Nucl. Magn. Reson. Spectrosc. 54 (2009) 141–165. [47] S.P. Mielke, V.V. Krishnan, An evaluation of chemical shift index-based secondary structure determination in proteins: influence of random coil chemical shifts, J. Biomol. NMR 30 (2004) 143–153. [48] B.A. Johnson, Using NMRView to visualize and analyze the NMR spectra of macromolecules, Methods Mol. Biol. 278 (2004) 313–352. [49] M.V. Berjanskii, D.S. Wishart, The RCI server: rapid and accurate calculation of protein flexibility using chemical shifts, Nucleic Acids Res. 35 (2007) W531–537.

the wealth of structural information from the CSI 3.0 and ShiftASA methods triggered the development of the E-Thrifty method for comparative modelling with chemical shifts. Finally, the progress in techniques for predicting chemical shifts from structure, such as SHIFTX, SPARTA, and CAMSHIFT, along with methods for predicting torsion angles from chemical shifts, such as TALOS, PREDITOR, and DANGLE, led to the development of programs for shift-based structure generation and refinement, such as CHESHIRE, CS-Rosetta, CS23D, ENSEMBLE, CS-MD, CS-GAMDy, as well as more recent chemical-shift-biased metadynamics methods. While the focus of this review has been on tools and techniques developed in our group and in laboratories of our colleagues across Canada, it is important to emphasize that there are many other excellent chemical shift software tools and databases that have been developed around the world. Some of these have been briefly mentioned or highlighted, but space does not allow us to cover all of the tools or techniques available in the field. We would encourage interested readers to explore these resources and to make use of the abundance of software, databases, and web servers that now make protein chemical shift analysis an incredibly useful and powerful approach for NMR spectroscopists and structural biologists. Transparency Document The http://dx.doi.org/10.1016/j.bbapap.2017.07.005 associated with this article can be found, in online version. Acknowledgements Financial support from the Natural Sciences and Engineering Research Council of Canada (NSERC), the Alberta Prion Research Institute (APRI) and PrioNet is gratefully acknowledged. References [1] D.S. Wishart, B.D. Sykes, F.M. Richards, The chemical shift index: a fast and simple method for the assignment of protein secondary structure through NMR spectroscopy, Biochemistry 31 (1992) 1647–1651. [2] D.S. Wishart, B.D. Sykes, The 13C chemical-shift index: a simple method for the identification of protein secondary structure using 13C chemical-shift data, J. Biomol. NMR 4 (1994) 171–180. [3] Y. Wang, O. Jardetzky, Probability-based protein secondary structure identification using combined NMR chemical-shift data, Protein Sci. 11 (2002) 852–861. [4] L.H. Hung, R. Samudrala, Accurate and automated classification of protein secondary structure with PsiCSI, Protein Sci. 12 (2003) 288–295. [5] D. Labudde, D. Leitner, M. Kruger, H. Oschkinat, Prediction algorithm for amino acid types with their secondary structure in proteins (PLATON) using chemical shifts, J. Biomol. NMR 25 (2003) 41–53. [6] H.R. Eghbalnia, L. Wang, A. Bahrami, A. Assadi, J.L. Markley, Protein energetic conformational analysis from NMR chemical shifts (PECAN) and its use in determining secondary structural elements, J. Biomol. NMR 32 (2005) 71–81. [7] J.A. Marsh, V.K. Singh, Z. Jia, J.D. Forman-Kay, Sensitivity of secondary structure propensities to sequence differences between alpha- and gamma-synuclein: implications for fibrillation, Protein Sci. 15 (2006) 2795–2804. [8] C.C. Wang, J.H. Chen, W.C. Lai, W.J. Chuang, 2DCSi: identification of protein secondary structure and redox state using 2D cluster analysis of NMR chemical shifts, J. Biomol. NMR 38 (2007) 57–63. [9] M. Swain, H.S. Atreya, CSSI-PRO: a method for secondary structure type editing, assignment and estimation in proteins using linear combination of backbone chemical shifts, J. Biomol. NMR 44 (2009) 185–194. [10] Y. Zhao, B. Alipanahi, S.C. Li, M. Li, Protein secondary structure prediction using NMR chemical shift data, J. Bioinforma. Comput. Biol. 8 (2010) 867–884. [11] M.S. Cheung, M.L. Maguire, T.J. Stevens, R.W. Broadhurst, DANGLE: a Bayesian inferential method for predicting protein backbone dihedral angles and secondary structure, J. Magn. Reson. 202 (2010) 223–233. [12] Y. Shen, A. Bax, Identification of helix capping and b-turn motifs from NMR chemical shifts, J. Biomol. NMR 52 (2012) 211–232. [13] C. Camilloni, A. De Simone, W.F. Vranken, M. Vendruscolo, Determination of secondary structure populations in disordered states of proteins using nuclear magnetic resonance chemical shifts, Biochemistry 51 (2012) 2224–2231. [14] N.E. Hafsa, D. Arndt, D.S. Wishart, CSI 3.0: a web server for identifying secondary and super-secondary structure in proteins using NMR chemical shifts, Nucleic Acids Res. 43 (2015) W370–377. [15] N.E. Hafsa, D.S. Wishart, CSI 2.0: a significantly improved version of the chemical shift index, J. Biomol. NMR 60 (2014) 131–146.

11

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart [50] M. Levitt, Conformational preferences of amino acids in globular proteins, Biochemistry 17 (1978) 4277–4285. [51] M.V. Berjanskii, D.S. Wishart, A simple method to predict protein flexibility using secondary chemical shifts, J. Am. Chem. Soc. 127 (2005) 14970–14971. [52] W.S. Valdar, Scoring residue conservation, Proteins 48 (2002) 227–241. [53] W. Kabsch, C. Sander, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers 22 (1983) 2577–2637. [54] D. Frishman, P. Argos, Knowledge-based protein secondary structure assignment, Proteins 23 (1995) 566–579. [55] E.G. Hutchinson, J.M. Thornton, A revised set of potentials for beta-turn formation in proteins, Protein Sci. 3 (1994) 2207–2216. [56] J. Cavanagh, W.J. Fairbrother, A.G. Palmer Iii, N.J. Skelton, Protein NMR Spectroscopy: Principles and Practice, Academic Press, 1995. [57] J.-S. Hu, A. Bax, Determination of ϕ and χ1 angles in proteins from 13C–13C three-bond J couplings measured by three-dimensional heteronuclear NMR. How planar is the peptide bond? J. Am. Chem. Soc. 119 (1997) 6360–6368. [58] M. Pellecchia, R. Fattorusso, G. Wider, Determination of the dihedral angle psi based on J coupling measurements in N-15/C-13-labeled proteins, J. Am. Chem. Soc. 120 (1998) 6824–6825. [59] N.J. West, L.J. Smith, Side-chains in native and random coil protein conformations. Analysis of NMR coupling constants and chi1 torsion angle preferences, J. Mol. Biol. 280 (1998) 867–877. [60] K. Kloiber, W. Schuler, R. Konrat, Automated NMR determination of protein backbone dihedral angles from cross-correlated spin relaxation, J. Biomol. NMR 22 (2002) 349–363. [61] T. Carlomagno, W. Bermel, C. Griesinger, Measuring the chi 1 torsion angle in protein by CH-CH cross-correlated relaxation: a new resolution-optimised experiment, J. Biomol. NMR 27 (2003) 151–157. [62] S. Spera, A. Bax, Empirical correlation between protein backbone conformation and C-alpha and C-beta C-13 nuclear magnetic resonance chemical shifts, J. Am. Chem. Soc. 113 (1991) 5490–5492. [63] S. Neal, A.M. Nip, H. Zhang, D.S. Wishart, Rapid and accurate calculation of protein 1H, 13C and 15N chemical shifts, J. Biomol. NMR 26 (2003) 215–240. [64] G. Cornilescu, F. Delaglio, A. Bax, Protein backbone angle restraints from searching a database for chemical shift and sequence homology, J. Biomol. NMR 13 (1999) 289–302. [65] Y. Shen, A. Bax, Protein structural information derived from NMR chemical shift with the neural network program TALOS-N, Methods Mol. Biol. 1260 (2015) 17–32. [66] S. Neal, M. Berjanskii, H. Zhang, D.S. Wishart, Accurate prediction of protein torsion angles using chemical shifts and sequence homology, Magn. Reson. Chem. 44 (2006) S158–167 (Spec No). [67] D.S. Wishart, R.F. Boyko, L. Willard, F.M. Richards, B.D. Sykes, SEQSEE: a comprehensive program suite for protein sequence analysis, Comput. Appl. Biosci. 10 (1994) 121–132. [68] A.A. Canutescu, A.A. Shelenkov, R.L. Dunbrack Jr., A graph-theory algorithm for rapid protein side-chain prediction, Protein Sci. 12 (2003) 2001–2014. [69] D.S. Wishart, B.D. Sykes, Chemical shifts as a tool for structure determination, Methods Enzymol. 239 (1994) 363–392. [70] A. Bundi, K. Wuthrich, H-1-NMR parameters of the common amino-acid residues measured in aqueous-solutions of the linear tetrapeptides H-GLY-GLY-X-L-ALAOH, Biopolymers 18 (1979) 285–297. [71] J.A. Vila, D.R. Ripoll, H.A. Baldoni, H.A. Scheraga, Unblocked statistical-coil tetrapeptides and pentapeptides in aqueous solution: a theoretical study, J. Biomol. NMR 24 (2002) 245–262. [72] M. Berjanskii, D.S. Wishart, NMR: prediction of protein flexibility, Nat. Protoc. 1 (2006) 683–688. [73] C. Camilloni, A. Cavalli, M. Vendruscolo, Replica-averaged metadynamics, J. Chem. Theory Comput. 9 (2013) 5610–5617. [74] D. Granata, C. Camilloni, M. Vendruscolo, A. Laio, Characterization of the freeenergy landscapes of proteins by NMR-guided metadynamics, Proc. Natl. Acad. Sci. U. S. A. 110 (2013) 6817–6822. [75] E. Durham, B. Dorr, N. Woetzel, R. Staritzbichler, J. Meiler, Solvent accessible surface area approximations for rapid and accurate protein structure prediction, J. Mol. Model. 15 (2009) 1093–1108. [76] J. Mendes, A.M. Baptista, M.A. Carrondo, C.M. Soares, Implicit solvation in the self-consistent mean field theory method: sidechain modelling and prediction of folding free energies of protein mutants, J. Comput. Aided Mol. Des. 15 (2001) 721–740. [77] J. Chen, N. Sawyer, L. Regan, Protein-protein interactions: general trends in the relationship between binding affinity and interfacial buried surface area, Protein Sci. 22 (2013) 510–515. [78] P. Lavigne, J.R. Bagu, R. Boyko, L. Willard, C.F. Holmes, B.D. Sykes, Structurebased thermodynamic analysis of the dissociation of protein phosphatase-1 catalytic subunit and microcystin-LR docked complexes, Protein Sci. 9 (2000) 252–264. [79] P. Benkert, S.C. Tosatto, D. Schomburg, QMEAN: a comprehensive scoring function for model quality assessment, Proteins 71 (2008) 261–277. [80] J.J. Serpa, K.A.T. Makepeace, T.H. Borchers, D.S. Wishart, E.V. Petrotchenko, C.H. Borchers, Using isotopically-coded hydrogen peroxide as a surface modification reagent for the structural characterization of prion protein aggregates, J. Proteome 100 (2014) 160–166. [81] J. Janin, Surface and inside volumes in globular proteins, Nature 277 (1979) 491–492. [82] R. Adamczak, A. Porollo, J. Meller, Accurate prediction of solvent accessibility

[83] [84]

[85] [86] [87] [88] [89]

[90]

[91]

[92]

[93]

[94]

[95]

[96] [97] [98]

[99] [100] [101] [102]

[103] [104]

[105]

[106] [107]

[108]

[109] [110]

[111]

[112]

[113]

[114]

12

using neural networks–based regression, Proteins Struct. Funct. Bioinf. 56 (2004) 753–767. J. Friedman, T. Hastie, R. Tibshirani, The elements of statistical learning, Springer Series in Statistics, Springer, Berlin, 2001. W. Gronwald, L. Willard, T. Jellard, R.F. Boyko, K. Rajarathnam, D.S. Wishart, F.D. Sonnichsen, B.D. Sykes, CAMRA: chemical shift based computer aided protein NMR assignments, J. Biomol. NMR 12 (1998) 395–405. J.L. Pons, M.A. Delsuc, RESCUE: an artificial neural network tool for the NMR spectral assignment of proteins, J. Biomol. NMR 15 (1999) 15–26. B. Wang, Y. Wang, D.S. Wishart, A probabilistic approach for validating protein NMR chemical shift assignments, J. Biomol. NMR 47 (2010) 85–99. H. Zhang, S. Neal, D.S. Wishart, RefDB: a database of uniformly referenced protein chemical shifts, J. Biomol. NMR 25 (2003) 173–195. S.W. Ginzinger, J. Fischer, SimShift: identifying structural similarities from NMR chemical shifts, Bioinformatics 22 (2006) 460–465. K. Osapay, Y. Theriault, P.E. Wright, D.A. Case, Solution structure of carbonmonoxy myoglobin determined from nuclear magnetic resonance distance and chemical shift constraints, J. Mol. Biol. 244 (1994) 183–197. J.A. Vila, J.M. Aramini, P. Rossi, A. Kuzin, M. Su, J. Seetharaman, R. Xiao, L. Tong, G.T. Montelione, H.A. Scheraga, Quantum chemical 13C(alpha) chemical shift calculations for protein NMR structure determination, refinement, and validation, Proc. Natl. Acad. Sci. U. S. A. 105 (2008) 14389–14394. D.S. Wishart, M.S. Watson, R.F. Boyko, B.D. Sykes, Automated 1H and 13C chemical shift prediction using the BioMagResBank, J. Biomol. NMR 10 (1997) 329–336. W. Gronwald, R.F. Boyko, F.D. Sönnichsen, D.S. Wishart, B.D. Sykes, ORB, a homology-based program for the prediction of protein NMR chemical shifts, J. Biomol. NMR 10 (1997) 165–179. S.B. Needleman, C.D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol. 48 (1970) 443–453. S. Schwarzinger, G.J. Kroon, T.R. Foss, J. Chung, P.E. Wright, H.J. Dyson, Sequence-dependent correction of random coil NMR chemical shifts, J. Am. Chem. Soc. 123 (2001) 2970–2978. Y. Shen, A. Bax, Protein backbone chemical shifts predicted from searching a database for torsion angle and sequence homology, J. Biomol. NMR 38 (2007) 289–302. J. Meiler, PROSHIFT: protein chemical shift prediction using artificial neural networks, J. Biomol. NMR 26 (2003) 25–37. B. Han, Y. Liu, S.W. Ginzinger, D.S. Wishart, SHIFTX2: significantly improved protein chemical shift prediction, J. Biomol. NMR 50 (2011) 43–57. H. Sternlicht, D. Wilson, Magnetic resonance studies of macromolecules. I. Aromatic-methyl interactions and helical structure effects in lysozyme, Biochemistry 6 (1967) 2881–2892. H.L. Tigelaar, W.H. Flygare, Molecular Zeeman effect in formamide and the -proton chemical shift in poly(L-alanine), J. Am. Chem. Soc. 94 (1972) 343–346. D.A. Case, Calibration of ring-current effects in proteins and nucleic acids, J. Biomol. NMR 6 (1995) 341–346. E. Oldfield, Chemical shifts in amino acids, peptides, and proteins: from quantum chemistry to drug design, Annu. Rev. Phys. Chem. 53 (2002) 349–378. R.D. Beger, P.H. Bolton, Protein phi and psi dihedral restraints determined from multidimensional hypersurface correlations of backbone chemical shifts and their use in the determination of protein tertiary structures, J. Biomol. NMR 10 (1997) 129–142. M. Iwadate, T. Asakura, M.P. Williamson, C alpha and C beta carbon-13 chemical shifts in proteins from an empirical database, J. Biomol. NMR 13 (1999) 199–211. X.P. Xu, D.A. Case, Automated prediction of 15N, 13Calpha, 13Cbeta and 13C′ chemical shifts in proteins using a density functional database, J. Biomol. NMR 21 (2001) 321–333. K.J. Kohlhoff, P. Robustelli, A. Cavalli, X. Salvatella, M. Vendruscolo, Fast and accurate predictions of protein NMR chemical shifts from interatomic distances, J. Am. Chem. Soc. 131 (2009) 13894–13895. J. Lehtivarjo, T. Hassinen, S.P. Korhonen, M. Perakyla, R. Laatikainen, 4D prediction of protein (1)H chemical shifts, J. Biomol. NMR 45 (2009) 413–426. J.A. Vila, Y.A. Arnautova, O.A. Martin, H.A. Scheraga, Quantum-mechanics-derived 13Calpha chemical shift server (CheShift) for protein structure validation, Proc. Natl. Acad. Sci. U. S. A. 106 (2009) 16972–16977. Z. Atieh, M. Aubert-Frecon, A.R. Allouche, Rapid, accurate and simple model to predict NMR chemical shifts for biological molecules, J. Phys. Chem. B 114 (2010) 16388–16392. A.B. Sahakyan, W.F. Vranken, A. Cavalli, M. Vendruscolo, Structure-based prediction of methyl chemical shifts in proteins, J. Biomol. NMR 50 (2011) 331–346. A.B. Sahakyan, W.F. Vranken, A. Cavalli, M. Vendruscolo, Using side-chain aromatic proton chemical shifts for a quantitative analysis of protein structures, Angew. Chem. Int. Ed. Eng. 50 (2011) 9620–9623. J.T. Nielsen, H.R. Eghbalnia, N.C. Nielsen, Chemical shift prediction for protein structure calculation and quality assessment using an optimally parameterized force field, Prog. Nucl. Magn. Reson. Spectrosc. 60 (2012) 1–28. D.W. Li, R. Bruschweiler, PPM: a side-chain and backbone chemical shift predictor for the assessment of protein conformational ensembles, J. Biomol. NMR 54 (2012) 257–265. A.S. Christensen, T.E. Linnet, M. Borg, W. Boomsma, K. Lindorff-Larsen, T. Hamelryck, J.H. Jensen, Protein structure validation and refinement using amide proton chemical shifts derived from quantum mechanics, PLoS One 8 (2013) e84123. A.S. Larsen, L.A. Bratholm, A.S. Christensen, M. Channir, J.H. Jensen, ProCS15: a

BBA - Proteins and Proteomics xxx (xxxx) xxx–xxx

M.V. Berjanskii, D.S. Wishart

[115] [116] [117]

[118]

[119]

[120]

[121]

[122]

[123]

[124] [125]

[126]

[127]

[128]

[129]

[130]

[131] [132] [133] [134]

[135]

[136]

[137]

[138]

[139]

[140]

[141] A. Rosato, J.M. Aramini, C. Arrowsmith, A. Bagaria, D. Baker, A. Cavalli, J.F. Doreleijers, A. Eletsky, A. Giachetti, P. Guerry, A. Gutmanas, P. Guntert, Y. He, T. Herrmann, Y.J. Huang, V. Jaravine, H.R. Jonker, M.A. Kennedy, O.F. Lange, G. Liu, T.E. Malliavin, R. Mani, B. Mao, G.T. Montelione, M. Nilges, P. Rossi, G. van der Schot, H. Schwalbe, T.A. Szyperski, M. Vendruscolo, R. Vernon, W.F. Vranken, S. Vries, G.W. Vuister, B. Wu, Y. Yang, A.M. Bonvin, Blind testing of routine, fully automated determination of protein structures from NMR data, Structure 20 (2012) 227–236. [142] A. Rosato, W. Vranken, R.H. Fogh, T.J. Ragan, R. Tejero, K. Pederson, H.W. Lee, J.H. Prestegard, A. Yee, B. Wu, A. Lemak, S. Houliston, C.H. Arrowsmith, M. Kennedy, T.B. Acton, R. Xiao, G. Liu, G.T. Montelione, G.W. Vuister, The second round of critical assessment of automated structure determination of proteins by NMR: CASD-NMR-2013, J. Biomol. NMR 62 (2015) 413–424. [143] K. Tamiola, F.A. Mulder, Using NMR chemical shifts to calculate the propensity for structural order and disorder in proteins, Biochem. Soc. Trans. 40 (2012) 1014–1020. [144] A.B. Mantsyzov, A.S. Maltsev, J. Ying, Y. Shen, G. Hummer, A. Bax, A maximum entropy approach to the study of residue-specific backbone angle distributions in alpha-synuclein, an intrinsically disordered protein, Protein Sci. 23 (2014) 1275–1290. [145] A.B. Mantsyzov, Y. Shen, J.H. Lee, G. Hummer, A. Bax, MERA: a webserver for evaluating backbone torsion angle distributions in dynamic and disordered proteins from NMR data, J. Biomol. NMR 63 (2015) 85–95. [146] M.R. Jensen, L. Salmon, G. Nodet, M. Blackledge, Defining conformational ensembles of intrinsically disordered and partially folded proteins directly from chemical shifts, J. Am. Chem. Soc. 132 (2010) 1270–1272. [147] S. Kashtanov, W. Borcherds, H. Wu, G.W. Daughdrill, F.M. Ytreberg, Using chemical shifts to assess transient secondary structure and generate ensemble structures of intrinsically disordered proteins, Methods Mol. Biol. 895 (2012) 139–152. [148] H. Gong, S. Zhang, J. Wang, J. Zeng, Constructing structure ensembles of intrinsically disordered proteins from chemical shift data, J. Comput. Biol. 23 (2016) 300–310. [149] G. Nodet, L. Salmon, V. Ozenne, S. Meier, M.R. Jensen, M. Blackledge, Quantitative description of backbone conformational sampling of unfolded proteins at amino acid resolution from NMR residual dipolar couplings, J. Am. Chem. Soc. 131 (2009) 17908–17918. [150] V. Ozenne, F. Bauer, L. Salmon, J.R. Huang, M.R. Jensen, S. Segard, P. Bernado, C. Charavay, M. Blackledge, Flexible-meccano: a tool for the generation of explicit ensemble descriptions of intrinsically disordered proteins and their associated experimental observables, Bioinformatics 28 (2012) 1463–1470. [151] A. De Simone, A. Cavalli, S.T. Hsu, W. Vranken, M. Vendruscolo, Accurate random coil chemical shifts from an analysis of loop regions in native states of proteins, J. Am. Chem. Soc. 131 (2009) 16332–16333. [152] K. Tamiola, B. Acar, F.A. Mulder, Sequence-specific random coil chemical shifts of intrinsically disordered proteins, J. Am. Chem. Soc. 132 (2010) 18000–18003. [153] M. Kjaergaard, F.M. Poulsen, Sequence correction of random coil chemical shifts: correlation between neighbor correction factors and changes in the Ramachandran distribution, J. Biomol. NMR 50 (2011) 157–165. [154] M. Kjaergaard, F.M. Poulsen, Disordered proteins studied by chemical shifts, Prog. Nucl. Magn. Reson. Spectrosc. 60 (2012) 42–51. [155] J. Kragelj, V. Ozenne, M. Blackledge, M.R. Jensen, Conformational propensities of intrinsically disordered proteins from NMR chemical shifts, ChemPhysChem 14 (2013) 3034–3045. [156] P. Sormanni, D. Piovesan, G.T. Heller, M. Bonomi, P. Kukic, C. Camilloni, M. Fuxreiter, Z. Dosztanyi, R.V. Pappu, M.M. Babu, S. Longhi, P. Tompa, A.K. Dunker, V.N. Uversky, S.C. Tosatto, M. Vendruscolo, Simultaneous quantification of protein order and disorder, Nat. Chem. Biol. 13 (2017) 339–342. [157] W.Y. Choy, J.D. Forman-Kay, Calculation of ensembles of structures representing the unfolded state of an SH3 domain, J. Mol. Biol. 308 (2001) 1011–1032. [158] H.J. Feldman, C.W. Hogue, A fast method to sample real protein conformational space, Proteins 39 (2000) 112–131. [159] J.A. Marsh, C. Neale, F.E. Jack, W.Y. Choy, A.Y. Lee, K.A. Crowhurst, J.D. FormanKay, Improved structural characterizations of the drkN SH3 domain unfolded state suggest a compact ensemble with native-like and non-native structure, J. Mol. Biol. 367 (2007) 1494–1510. [160] J.A. Marsh, B. Dancheck, M.J. Ragusa, M. Allaire, J.D. Forman-Kay, W. Peti, Structural diversity in free and bound states of intrinsically disordered protein phosphatase 1 regulators, Structure 18 (2010) 1094–1103. [161] T. Mittag, J. Marsh, A. Grishaev, S. Orlicky, H. Lin, F. Sicheri, M. Tyers, J.D. Forman-Kay, Structure/function implications in a dynamic complex of the intrinsically disordered Sic1 with the Cdc4 subunit of an SCF ubiquitin ligase, Structure 18 (2010) 494–506. [162] R. Koradi, M. Billeter, K. Wüthrich, MOLMOL: a program for display and analysis of macromolecular structures, J. Mol. Graph. 14 (1996) 51–55.

DFT-based chemical shift predictor for backbone and Cbeta atoms in proteins, PeerJ 3 (2015) e1344. D. Li, R. Bruschweiler, PPM_One: a static protein structure based chemical shift predictor, J. Biomol. NMR 62 (2015) 403–409. K. Osapay, D.A. Case, A new analysis of proton chemical shifts in proteins, J. Am. Chem. Soc. 113 (1991) 9436–9444. A.C. de Dios, J.G. Pearson, E. Oldfield, Secondary and tertiary structural effects on protein NMR chemical shifts: an ab initio approach, Science 260 (1993) 1491–1496. J.M. Word, S.C. Lovell, J.S. Richardson, D.C. Richardson, Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation, J. Mol. Biol. 285 (1999) 1735–1747. G. Wider, Structure determination of biological macromolecules in solution using nuclear magnetic resonance spectroscopy, BioTechniques 29 (2000) 1278–1282 (1284–1290, 1292 passim). B. Zagrovic, W.F. van Gunsteren, Comparing atomistic simulation data with the NMR experiment: how much can NOEs actually tell us? Proteins 63 (2006) 210–218. S.B. Nabuurs, C.A. Spronk, G.W. Vuister, G. Vriend, Traditional biomolecular structure determination by NMR spectroscopy allows for major errors, PLoS Comput. Biol. 2 (2006) e9. M. Berjanskii, J. Zhou, Y. Liang, G. Lin, D.S. Wishart, Resolution-by-proxy: a simple measure for assessing and comparing the overall quality of NMR protein structures, J. Biomol. NMR 53 (2012) 167–180. D.A. Case, T.E. Cheatham III, T. Darden, H. Gohlke, R. Luo, K.M. Merz Jr., A. Onufriev, C. Simmerling, B. Wang, R.J. Woods, The Amber biomolecular simulation programs, J. Comput. Chem. 26 (2005) 1668–1688. C.D. Schwieters, J.J. Kuszewski, N. Tjandra, G.M. Clore, The Xplor-NIH NMR molecular structure determination package, J. Magn. Reson. 160 (2003) 65–73. J. Kuszewski, A.M. Gronenborn, G.M. Clore, The impact of direct refinement against proton chemical shifts on protein structure determination by NMR, J. Magn. Reson. B 107 (1995) 293–297. J. Kuszewski, J. Qin, A.M. Gronenborn, G.M. Clore, The impact of direct refinement against 13C alpha and 13C beta chemical shifts on protein structure determination by NMR, J. Magn. Reson. B 106 (1995) 92–96. H. Zhou, J. Skolnick, GOAP: a generalized orientation-dependent, all-atom statistical potential for protein structure prediction, Biophys. J. 101 (2011) 2043–2052. J. Zhang, Y. Zhang, A novel side-chain orientation dependent potential derived from random-walk reference state for protein fold selection and structure prediction, PLoS One 5 (2010) e15386. H. Gong, Y. Shen, G.D. Rose, Building native protein conformation from NMR backbone chemical shifts using Monte Carlo fragment assembly, Protein Sci. 16 (2007) 1515–1521. N.E. Hafsa, M. Berjanskii, D. Arndt, D.S. Wishart, Rapid and reliable protein structure determination via chemical shift threading (manuscript under preparation), J. Biomol. NMR (2017). B. Rost, R. Schneider, C. Sander, Protein fold recognition by prediction-based threading, J. Mol. Biol. 270 (1997) 471–480. A. Sali, L. Potterton, F. Yuan, H. van Vlijmen, M. Karplus, Evaluation of comparative protein modeling by MODELLER, Proteins 23 (1995) 318–326. Y. Zhang, J. Skolnick, Scoring function for automated assessment of protein structure template quality, Proteins 57 (2004) 702–710. N.L. Fawzi, J. Ying, R. Ghirlando, D.A. Torchia, G.M. Clore, Atomic-resolution dynamics on the surface of amyloid-beta protofibrils probed by solution NMR, Nature 480 (2011) 268–272. P. Vallurupalli, G. Bouvignies, L.E. Kay, Studying “invisible” excited protein states in slow exchange with a major state conformation, J. Am. Chem. Soc. 134 (2012) 8148–8161. A.G. Palmer II, C.D. Kroenke, J.P. Loria, Nuclear magnetic resonance methods for quantifying microsecond-to-millisecond motions in biological macromolecules, Methods Enzymol. 339 (2001) 204–238. D.F. Hansen, P. Vallurupalli, L.E. Kay, Using relaxation dispersion NMR spectroscopy to determine structures of excited, invisible protein states, J. Biomol. NMR 41 (2008) 113–120. N.R. Skrynnikov, F.W. Dahlquist, L.E. Kay, Reconstructing NMR spectra of “invisible” excited protein states using HSQC and HMQC experiments, J. Am. Chem. Soc. 124 (2002) 12352–12360. G. Bouvignies, P. Vallurupalli, D.F. Hansen, B.E. Correia, O. Lange, A. Bah, R.M. Vernon, F.W. Dahlquist, D. Baker, L.E. Kay, Solution structure of a minor and transiently formed state of a T4 lysozyme mutant, Nature 477 (2011) 111–114. P. Neudecker, P. Robustelli, A. Cavalli, P. Walsh, P. Lundstrom, A. Zarrine-Afsar, S. Sharpe, M. Vendruscolo, L.E. Kay, Structure of an intermediate state in protein folding and aggregation, Science 336 (2012) 362–366.

13