Biochimica et Biophysica Acta 1834 (2013) 1520–1531
Contents lists available at SciVerse ScienceDirect
Biochimica et Biophysica Acta journal homepage: www.elsevier.com/locate/bbapap
Capturing native/native like structures with a physico-chemical metric (pcSM) in protein folding Avinash Mishra a, Satyanarayan Rao b, Aditya Mittal a, B. Jayaram a, b, c,⁎ a b c
Kusuma School of Biological Sciences, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India Supercomputing Facility for Bioinformatics & Computational Biology, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India Department of Chemistry, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India
a r t i c l e
i n f o
Article history: Received 15 February 2013 Received in revised form 12 April 2013 Accepted 15 April 2013 Available online 7 May 2013 Keywords: Protein folding Protein structure prediction Decoy Scoring function Native structure
a b s t r a c t Specification of the three dimensional structure of a protein from its amino acid sequence, also called a “Grand Challenge” problem, has eluded a solution for over six decades. A modestly successful strategy has evolved over the last couple of decades based on development of scoring functions (e.g. mimicking free energy) that can capture native or native-like structures from an ensemble of decoys generated as plausible candidates for the native structure. A scoring function must be fast enough in discriminating the native from unfolded/misfolded structures, and requires validation on a large data set(s) to generate sufficient confidence in the score. Here we develop a scoring function called pcSM that detects true native structure in the top 5 with 93% accuracy from an ensemble of candidate structures. If we eliminate the native from ensemble of decoys then pcSM is able to capture near native structure (RMSD b =5 Ǻ) in top 10 with 86% accuracy. The parameters considered in pcSM are a C-alpha Euclidean metric, secondary structural propensity, surface areas and an intramolecular energy function. pcSM has been tested on 415 systems consisting 142,698 decoys (public and CASP—largest reported hitherto in literature). The average rank for the native is 2.38, a significant improvement over that existing in literature. In-silico protein structure prediction requires robust scoring technique(s). Therefore, pcSM is easily amenable to integration into a successful protein structure prediction strategy. The tool is freely available at http://www.scfbio-iitd.res.in/software/pcsm.jsp. © 2013 Elsevier B.V. All rights reserved.
1. Introduction Prediction of the tertiary structure of a protein from its amino acid sequence continues to be a grand challenge in biology [1–3]. Modern textbook knowledge on understanding protein structures is essentially reflective of a body of literature having well defined secondary structures [4–6] that can be formed by a continuous series of amino acids in a primary sequence (a.k.a. window size), with each amino acid having a defined ‘propensity’ to be a part of a particular secondary structure. Prediction of secondary structure component of protein can be classified into various phases based on the accuracy of an algorithm. Most of these algorithms employ machine learning approaches for prediction. Initial methods were based on physico-chemical properties [7–9] where the accuracy was reported as 56–60% [10,11]. As the data availability increased with time, accuracy of the machine learning approaches improved and went up to 70% [12–16]. Subsequently,
⁎ Corresponding author at: Department of Chemistry, Indian Institute of Technology, Hauz Khas, New Delhi 110016, India. Tel.: +91 11 2659 1505, +91 11 2659 6786; fax: +91 11 2658 2037, +91 11 2659 7530. E-mail address:
[email protected] (B. Jayaram). URL: http://www.scfbio-iitd.res.in (B. Jayaram). 1570-9639/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.bbapap.2013.04.023
due to improvements in sequence alignment, secondary structure prediction also advanced to 80% accuracy [17,18]. Modern protein tertiary structure prediction algorithms rely either on the concept of Anfinsen-inspired energy landscapes with native structures having the lowest free energies [19–33] or on the utilization of homology/template based models scored via statistical potentials [34–41] (for individual amino acids and occurrence of a variable series of peptide-bonded amino acids in primary sequences) developed from experimentally determined structures [42]. A thorough inspection of the literature on protein folding reveals the following broad classification of the algorithms developed for structure prediction—(a) homology/ template based [43–49], (b) fold recognition [50–56], (c) ab initio [57–61] and (d) hybrid methods [62] to compensate for and minimize the limitations of either of the methods. With a remarkable increase in the availability of computational resources in the last few years [63–80], coupled with an accessibility of growing high resolution crystal structure data through the Protein Data Bank, substantial progress has been made towards developing algorithms for predicting protein structures from primary sequences [81,82]. However, limitations of the existing algorithms are well appreciated, evidence for which is in the continued efforts to develop better algorithms [83]; number of CASP (Critical Assessment of protein Structure Prediction) participants since the first one as a function of the CASP iteration is shown
A. Mishra et al. / Biochimica et Biophysica Acta 1834 (2013) 1520–1531
in Supplementary Table S1. Regardless of the variety of algorithms, over the years, some computational filters have emerged to create and/or screen ensembles of 3D structures, from primary sequences, towards obtaining/identifying native structures. These include degree of compactness [84,85] and constraints on variations of loops connecting different secondary structures [86–88] etc. However, structural variations within different native structures lead to (sometimes severe) limitations of these computational filters. In this work, in an attempt to overcome the limitations of existing algorithms in capturing the native-like conformation out of an ensemble of protein structures, we develop a robust metric that effectively combines the chemical, physical, geometrical and energetic constraints known to be important in protein folding. While none of the constraints is sufficient to capture the native individually all the time in its fold diversity, we design a combination metric that captures the native/native-like structure out of an ensemble of structures with 95% success. We test the metric for its ability to capture the native structure from the well established ensembles of protein structures in the literature. Further, we also mimic the real scenario of protein structure prediction where native is not present in ensemble and test the ability of pcSM to capture the under 5 Ǻ structure if there are any. We report that our algorithm performs better than any known existing algorithm for capturing native/native-like structures out of an ensemble of structures. Our results strongly highlight the need to combine chemical, physical, structural and energetic constraints in attempts to predict native structures. 2. Methods 2.1. Metric components Following are the parameters we have chosen for designing our scoring function: (a) Secondary Structure Penalty (P) = PH + PS Secondary structure of a sequence is predicted and the value of Xi (either 0 or 1) is assigned for each position and termed as reference array. Secondary structure of the Decoy is evaluated for a given decoy structure and value of Xi is assigned for decoy now XOR operator is applied on both these arrays (reference and decoy). We do this separately for helix and sheet, formula for helix case is given below. Similarly we do this for sheet and add PH and PS. i¼n n o X PH ¼ ½X1 ; X2 ; X3 …Xn Reference ⊕½X1 ; X2 ; X3 …Xn Decoy i¼1
Xi = 1 =0 ‘n’
Given a sequence of amino acids, the predicted secondary structure is treated as the reference. Mismatches between the secondary structural patterns of the decoy and the reference are treated as penalty 1 whereas matches are given a score of 0. Secondary structural elements can guide us in capturing the native from a pool of decoys. This parameter is quite efficient in filtering those conformations whose secondary structures are not properly formed in the complete tertiary form. The implementation is illustrated in Supplementary Text S1. This filter however, is limited by the accuracies of the secondary structure prediction which is currently placed at ~80%. (b) Euclidean distance (M)
M¼
n −1 X
n X
ð1Þ
CAij
i¼1 j¼iþ1
CA ‘n’
Euclidean distance between the Cα atoms of the ith and jth residues number of amino acids.
This parameter covers the globular compactness of the structures. There can be a total of 210 unique residue pairs with 20 amino acids. We calculate the mean of Euclidean distances of Cα atoms of all possible unique pairs [89–91]. All of them are then summed (Eq. (1)) to give a distance measure which we denote as M. A sample calculation on a peptide with six amino acids in two conformations is illustrated in Fig. 1. In this example, the extended structure (B), which is likely to be thermodynamically less stable, shows a higher value for M while a compact and folded structure (A) shows a smaller value. To illustrate the efficiency of this metric, we show its value for three decoys (D1, D2 and D3) with different conformations taken from the in-house decoy set for a protein with PDB ID: 1AT0 in Fig. 2, M value is calculated using Eq. (1) where D2 shows the least value which makes it a better decoy. (c) Surface areas (i) Fractional area of exposed nonpolar residues (A1) t X
A1 ¼
exAnp
i n X
HELIX
(presence of helix) (absence of helix) Number of amino acids. Here, ‘⊕’ refers to XOR operator, a logical binary operator which is TRUE if only one of the operands is TRUE.
1521
ð2Þ exAT
i
exAnp exAT t n
exposed area of nonpolar residue exposed area of a residue total number of nonpolar residues total number of residues.
Fig. 1. Illustration of Euclidean distance. MA represents a folded structure on a rectangle made of unit squares and MB represents an extended form.
1522
A. Mishra et al. / Biochimica et Biophysica Acta 1834 (2013) 1520–1531
(ii) Fractional area of exposed nonpolar part of residues (A2)
A2 ¼
n X npA
i
totAi
i
npAi totAi n
ð3Þ
exposed area of nonpolar part of residue ‘i’ exposed area of a residue ‘i’ total number of residues.
n X
miAi
332 qi qj D rij
ij
ð4Þ
cij12 r12 ij
−
ij
rij = distance between pair of atoms i and j. 12
6
12
ij
6
M12 ¼ ε R ; M6 ¼ 2 ε R : Ai
ð5Þ
i
Ai n
ij
Here σ and ε are the sum of van der Waals radii and well depth respectively of the interacting pair of atoms i and j ij
A4 ¼
ð8Þ
C12 ¼ ε σ ; C6 ¼ 2 ε σ :
(iv) Total surface area (A4) n X
ð7Þ
r6ij
M12 M6 − 6 r12 rij ij
ij
molecular mass of residue ‘i’ exposed area of residue ‘i’ total number of residues.
ð6Þ
cij6
ij
ij
Ehyd ¼
1
mi Ai n
ij
Eelec ¼
EVdW ¼
(iii) Weighted exposed area (A3)
A3 ¼
(d) Empirical potential energy function (E)
Here, R is a distance variable and ε is set to 1.
exposed area of residue ‘i’ total number of residues.
Minimum accessible surface area of nonpolar residues is a widely recognized facet of native conformations. Varieties of hydrophobic scales were established [92,93] and so also procedures for calculating surface area-related parameters [94]. Hydrophobic collapse is better represented by a loss of SASA because of the observed strong correlation between atomic surface area and solvation energy [95]. Propensity of any amino acid to be found on the surface or in the core is dependent on the side chain of that particular residue [96–98]. This rule works fine in most cases except where mis-aggregated structures are present in the decoy set. For examining the strength and limitation of this metric, accessible surface area (ASA) [99] is calculated first for a given structure. This metric calculates the exposed area of each residue of protein and computes total exposed area by adding all the atomic values and this is termed as exAT. Nonpolar residues are then selected from the protein and total exposed area is calculated for them and this is termed as exAnp. A1, A2, A3 and A4 are calculated by using formulas given in Eqs. (2), (3), (4) and (5) respectively. Fig. 3 describes how the exposed surface area varies between two different conformations of a protein. Here the second conformation is characterized by lower values of A1, A2 and A3 parameters. Second conformation is indeed the native.
E¼
n−1 X
n X
i
j¼iþ1
ij
ij
ij
Eelec þ EVdW þ Ehyd n
ð9Þ
n above is the number of atoms in the system and summation is over all unique pairs of atoms. An empirical intramolecular interaction energy based scoring function was used to discriminate the native earlier by Narang et al. [100–103], where the native conformation was at first rank energywise with respect to the decoys in 67 out of 69 systems. The energy function used contained contributions from electrostatics, van der Waals and a desolvation term captured by a Gurney function. In this study, we have mapped the Gurney description of desolvation on to a continuous potential [20] viz. a “hydrophobic interaction potential” as shown in Eq. (8). Eq. (9) gives the total non-bonded energy per atom of the protein. Energy based scoring has been used in several studies to detect the native but unfortunately energy value is only meaningful when the given structure is energetically minimized. This minimization of energy is one of the compute intensive steps. Other limitation particularly with a large set of decoys is the degeneracy, i.e. structures with different RMSDs from the native having similar values for energy. Due to this
Fig. 2. Variation of Cα distances among 3 different decoys. Sum of pair wise Cα distances of three different decoys, illustrating the fact that compactness of structure leads to lower values of M (PDB ID: 1AT0).
A. Mishra et al. / Biochimica et Biophysica Acta 1834 (2013) 1520–1531
1523
Fig. 3. Values of all area metric for 2 different decoys. A1, A2, A3 and A4 are different numerical representations of exposed area components of a protein (PDB ID: 1A6Q). The values of these parameters are expected to be at a minimum for native conformation.
degeneracy, multiple structures can possess same energy value. Therefore on the basis of energy alone, it would be difficult to pick the native or native-like structures. We remove primary atomic clashes by running short steepest descent (SD) and conjugate gradient (CG) energy minimizations. After calculating the all atom energy of the energy minimized molecule, we divide it with the number of atoms to obtain energy (in kcal) per atom to arrive at an energy-based parameter that is protein size independent (Eq. (9)). For further illustration of energy score, we show two decoys (D1and D2) for the same amino acid sequence (T0283 system of CASP7) and their corresponding energies per atom in Fig. 4. It may be noted that for this system, the minimum energy structure (D2) corresponds to the native.
2.2. Cumulative parameter score (CS) Cumulative score (CS) is comprised of seven different parameters with each having its own individual native discriminating potential. To avoid sophisticated empirical formalisms, CS is designed as a linear combination of these seven parameters in such a way that native/ native like structures show minimum values and thus can be captured in top 10. Range of numerical values of these parameters differs significantly since each of the parameters has a different “arbitrary unit”. Thus a simple linear combination without any coefficient could result in a biased CS towards parameters with higher magnitudes. Therefore we decided to assign coefficients to each parameter
to bring contributions of each of the parameters on a comparable/ similar scale. As a result the linear summation of the product of individual coefficients with the respective parameters allowed similar weightage to each parameter in CS as shown in Eq. (10) below: CS ¼ cA1 A1 þ cA2 A2 þ cA3 A3 þ cA4 A4 þ cP max ðPH ; PS Þ þ cM1 M1
ð10Þ
where, A1, A2, A3, A4, M1, PH and PS are the parameters and “c” followed by parameter names are respective coefficients. The following steps were taken to optimize and fix the values of coefficients in this work: 1. Four different systems from public and CASP decoys were randomly selected as test set—these were 1ctf, 4rxn, 4pti and CASP9_T0546. 2. All these four systems gave rise to 2199 decoys, including 4 distinct natives. 3. A1, A2, A3, A4, M1, PH and PS were calculated for these 2199 decoys. 4. Average values of each of the parameters calculated for the 2199 decoys were found to be A1 = 0.332, A2 = 51.89, A3 = 721,364.25, A4 = 5383.75, M1 = 4089.65, PH = 27.88, and, PS = 13.32. 5. For simplicity, and inspired by first principles of linear programming, we multiplied the above average values with coefficients such that the resulting product would fall in a numerical range of 1 to 10. Thus A1 was multiplied with 10 resulting in 3.32 as the
Fig. 4. Energy value for 2 different conformations of same protein. Energy (in kcal) per atom for two different decoys taken from CASP7 (T0283) data set is calculated where structure with minimum energy corresponds to the native.
1524
A. Mishra et al. / Biochimica et Biophysica Acta 1834 (2013) 1520–1531
contribution of A1 in CS. Similarly, A2, A3, A4 and M1 were multiplied with 0.1, 0.00001, 0.001 and .001 respectively resulting in 5.18, 7.21, 5.38 and 4.089 as their respective contributions in CS. This approach, while arbitrary, is straightforward and adequately addresses the possibility of any bias that may arise due to individual parameters in CS. 6. Since the secondary structure penalty parameters, PH and PS are statistical in nature compared to the purely physico-chemical nature of A1, A2, A3, A4 and M1, coefficients for these two parameters were fixed using an iterative strategy. First, values of PH and PS were multiplied with random coefficient values between 0 and 0.3 so that the resulting contributions of these parameters to CS were between 1 and 10. Then 10 random generation cycles each for PH and PS, i.e. total 100 cycles, were performed to calculate CS while keeping the other parameters fixed. It was found that a coefficient of 0.15 for PH and a coefficient of 0.21 for PS provided the best discriminating potential for the native in CS. Therefore, the final values adopted for the coefficients were fixed as: cA1 = 10; cA2 = 0.1; cA3 = 0.00001; cA4 = 0.001; cM1 = .001; Cp = 0.15(PH) and 0.21(PS). The implementation of pcSM is shown in Fig. 5 in the form of a flow chart. For any given decoy set, pcSM needs predicted secondary structure of the amino acid sequence in order to evaluate parameter P. Other parameters such as A1, A2, A3, A4, M and E are calculated from the tertiary structure simultaneously. Short energy minimization is performed to remove atomic clashes. Afterwards, five structures from each, P and A4 are chosen based on minimum scores and kept in the bin FINAL-RUN which contains structures on which the complete energy minimization (the last step of pcSM) has to be run.
After removal of atomic clashes, top 50 structures are chosen for the next step. Best five structures from each E, P, A1 and A2 from the above 50 structures are chosen based on their minimum scores and added to FINAL-RUN bin. Cumulative score (CS) is calculated for all 50 structures and their respective energies (E) are also added and termed as ‘S’, thereafter based on minimum S score top 10 structures are selected and added to FINAL-RUN bin. The redundancy in FINAL-RUN bin is removed to get unique structures for complete energy minimization. After the complete energy minimization, top 10 as well as top 5 structures are selected based on minimum energy score (E) which is expected to capture the native or native-like structures. 2.3. Case study Fig. 6 further illustrates the working of the combination metric pcSM on a sample case (PDB ID: 1EH2) belonging to public decoys Semfold. As to how the screening is done at different levels of the protocol based on various metrics is depicted here. Also please see the flow chart in Fig. 5. The total number of decoys in the data set is 11,441. Rank of the native is given on individual metrics at each level in parenthesis. ‘A1’ performs best while ‘P’ does not do well at the initial level. The individual rankings at initial level and at different subsequent stages can differ in other cases. At next level, top 50 decoys are selected on energy basis after short minimization. Here rank of the native on individual metrics has improved due to a reduction in the number of decoys. In cumulative score (CS) rank of the native is 1 on top 50 decoys which indicates the dominance of ‘P’ and ‘A1’ scores in the value of CS. Native remains at top after adding energy, as energy also picks the native at first position in the top 50 decoys. In the final stage, native is captured at rank 1.
Fig. 5. Flow chart for pcSM.
A. Mishra et al. / Biochimica et Biophysica Acta 1834 (2013) 1520–1531
1525
Fig. 6. A case study of decoy set. A Semfold public decoy was considered with PBD ID: 1EH2, containing 11,441 decoys. Number in parenthesis refers to rank of native on given parameter. Short minimization: SD(35) CG(15), complete minimization: SD(75) CG(125).
3. Data selection Primarily, two types of data set are chosen: (i) Public decoys from Decoys ‘R’ Us (http://dd.compbio.washington.edu/) and, (ii) Server predictions of CASP experiments CASP5 to CASP9 (http://predictioncenter. org/download_area/). Table 1 describes the data sets and the number of systems therein. Overall, we have considered 415 systems comprising 142,698 decoys. Decoy generation methods discussed below have been taken from documentation of the corresponding decoy set. (i) Public decoys (a) 4state_reduced (b) Fisa (c) Fisa_casp3 (c) Lattice_ssfit (c) Lmds (d) ROSETTA (e) rosetta_decoys_62 (f) CASP5. The public decoys considered in this study (http://www.scfbio-iitd. res.in/software/pcsm/dataset/public-decoys) comprise 137 systems and contain 92,329 decoys (Supplementary Table S3). (ii) CASP decoys (a) CASP5 (b) CASP6 (c) CASP7 (d) CASP8 (e) CASP8.
Table 1 Description of data set taken to evaluate the robustness of the scoring function (pcSM). Decoy set CASP
Public decoys
Total
Number of decoys 92,329
50,369
142,698
Decoys
Systems
CASP5 CASP6 CASP7 CASP8 CASP9 4state_reduced casp5 (public decoys) Fisa fisa_casp3 John-2002 Lattice_ssfit Lmds Rosetta Rosetta 62 Semfold
37 37 51 75 78 7 3 4 5 20 8 10 21 57 2 415
In all, the CASP decoy set considered in this study (http://www. scfbio-iitd.res.in/software/pcsm/dataset/CASP-decoys) comprises 278 systems and contains 50,369 decoys/modeled structures generated by different servers (Supplementary Table S4). 4. Results Table 2 shows the performance of the scoring function on the complete data set obtained from public and CASP decoys. Here the last column shows the performance of the pcSM in capturing the native. Out of the 137 public decoy sets, pcSM is able to capture the native in top 5 in 126 cases. In 5 systems, Ranknative lies between 6 and 10 and in 3 systems natives are above rank 10. There are 3 more systems where native is not ranked by pcSM as it has not passed through the initial filters namely 2(a), 2(b) and 3(a) steps in the flow chart of pcSM (Fig. 5). Out of the 278 CASP decoy sets, pcSM is able to capture the native in top 5 in 223 cases. In 27 systems Ranknative lies between 6 and 10 and in 17 systems above rank 10. There are 11 more systems in which native is not ranked by pcSM as it has not passed through the initial filters namely 2(a), 2(b) and 3(a) steps in the flow chart of pcSM (Fig. 5). 4.1. A comparative analysis of pcSM A few comparative studies of various scoring functions have been reported previously. One such study by Fiser's Group shows the average rank of native on CASP5 to CASP8 decoys with top 20 scoring
Table 2 Native capturing efficiency of pcSM on all the systems taken from public and CASP decoys. Decoy set
Number of systems
Number of decoys
Average RankNative
Public decoys CASP decoys Total
137 278 415
92,329 50,369 142,698
1.85 (3 NFa) 2.91 (11 NFa) 2.38
a
NF: Native not crossed the first filter so there is no ranking of native.
1526
A. Mishra et al. / Biochimica et Biophysica Acta 1834 (2013) 1520–1531
Table 3 Comparison of top 20 scoring functions considered by Fiser et al. with pcSM in CASP5– 8 decoys set in the presence of native. Scoring function
Avg. Ranknative
pcSM RF_HA_SRS Shortle2006 VSCORE-pair QMEANall_atom QMEAN6 RF_CB_SRS RF_CB_SRS_OD RF_CB_OD VSCORE combined Liang_geometric OPUS_PSP RF_CB RF_HA QMEAN-torsion NAMD Shortle2005 QMEAN-pairwise DOPE PROSA-pair
1.30 1.66 2.54 2.81 2.9 3.26 3.46 3.6 3.65 3.79 3.94 4.11 4.31 4.37 4.66 4.96 5.19 5.86 5.97 6.02
To gauze the relative performance of pcSM on public decoys, we have adapted the comparison format of Tian et al. [117] in which 7 different scoring functions [118–122] were considered. This data set includes 32 different proteins from public decoys containing all the decoys of corresponding systems. Results are shown in Table 4. Rank of the native is given under the name of each scoring function. Average rank of the native calculated by pcSM is 3.3 which asserts its better performance over other scoring functions given in Table 4. Samudrala and coworkers while introducing LoCo recently compared 30 scoring functions [123–140]. Table 5 shows a comparison of pcSM with respect to 77 public decoy sets. Details of the decoy sets in this study are provided in Supplementary Table S5. pcSM gives an average rank of 4.04 for the native which is better than the rank obtained by all other scoring functions mentioned. Over all, results in Tables 2 to 5 indicate that the pcSM scoring function shows a very high efficiency in bracketing the native. 4.2. Detection of near native structure
functions as shown in Table 3. Varieties of scoring functions [104–116] are considered in this comparative study. The performance evaluation of pcSM has been done on the same data set consisting of 143 systems having 2628 decoys. The best average rank of the native reported earlier was 1.66 by RF_HA_SRS. pcSM showed a better performance compared to all other scoring functions with average rank of the native at 1.30 as shown in the second column of Table 3.
We also evaluated the potential of pcSM on capturing near native structures from the ensemble, here we eliminate true native from the pool of decoys and mimic the real scenario of protein structure prediction. We consider a structure of RMSD less than 5 Ǻ as near native. After eliminating native we tested that whether pcSM is able to pick any near native structure in top 5 or not. In this case we considered only those decoy set where there is at least one structure which is falling in near native criteria (b=5 Ǻ RMSD), thus we have 100 such decoy set in CASP whereas 106 in public decoy set. Out of these 206 systems we are able to pick near native structure in 177 systems with 86% of accuracy in top 10.
Table 4 Comparison of pcSM with other scoring functions on data from Decoys ‘R’ Us. Decoy set
4state_reduced
Fisa
fisa_casp3
Lmds
lattice_ssffit
Total Average of Native
System name
1ctf 1r69 1sn3 2cro 3icb 4pti 4rxn 1fc2 1hdd-C 2cro 4icb 1bg8-A 1bl0 1jwe 1b0n-B 1bba 1ctf 1dtk 1fc2 1igd 1shf-A 2cro 2ovo 4pti 1beo 1ctf 1dkt-A 1fca 1nkl 1pgb 1trl-A 4icb
Avg. Ranknative Sequence length
RAPDF
Atomic KBP
DFIRE-A
DFIRE-B
PC2CA
DFMAC
NCACO
pcSM
68 63 65 65 75 58 54 43 57 65 76 76 99 114 31 36 68 57 43 61 59 65 56 58 98 68 72 55 78 56 62 76
1 1 1 1 1 1 1 497 17 14 1 1 1 1 359 501 1 116 501 1 1 416 4 157 1 1 1 1 1 1 1 1 81.37
1 1 1 1 1 1 1 413 25 24 6 2 215 4 74 500 1 31 501 1 2 175 1 13 1 1 1 1 1 1 1 1 62.59
1 1 1 1 4 1 1 254 1 1 1 1 1 1 430 501 1 1 501 1 1 1 1 1 1 1 1 1 1 1 1 1 53.65
1 1 1 2 24 1 19 1 1 1 1 1 3 1 261 501 1 5 441 1 1 1 27 1 1 1 1 1 1 1 1 1 40.8
1 1 1 1 1 1 667 1 1 1 1 1 1 1 1 501 1 2 53 1 1 1 1 1 1 1 1 1 1 1 1 1 39.09
1 1 1 1 1 1 1 399 1 1 1 14 8 1 1 501 1 70 501 1 1 1 1 3 1 1 1 1 1 1 1 1 47.53
1 1 1 1 1 1 1 461 21 3 1 44 3 6 1 497 1 8 113 1 1 1 1 1 1 1 1 1 1 1 1 1 36.8
1 1 1 1 3 1 1 18 1 1 1 1 1 1 5 26 1 1 28 1 1 1 1 1 1 1 1 1 2 1 1 1 3.3
In addition to the data referred to by Fiser's group [100], we also evaluated pcSM on a much larger data set of CASP5 to CASP9 consisting of 278 systems comprising 50,369 decoys (see Table 2).
A. Mishra et al. / Biochimica et Biophysica Acta 1834 (2013) 1520–1531
the chance of missing the native in the final bracket. Although the decoys considered in this study were generated by different protocols, the pcSM showed a good potential to pick the native/native like (93.3% in top 5). Interestingly, the ability of pcSM to discriminate the native structures from the ensembles of structures is independent of the type, number, conformation and size of proteins. Supplementary Table S3 shows the efficiency of pcSM on public decoys. Here 1fc2 and 1bba systems are a couple of exceptions in which rank of the native is above 5. The reason for this is not completely understood. There are 6 cases shown in Table 4 where pcSM failed to capture the native at top most rank. In these cases some other structures got captured in place of native therefore we calculated their RMSD shown in Supplementary Table S6, which essentially conveys that where native is not ranked at first position but a structure within 5 Å RMSD is captured in top 5 by pcSM. pcSM consists of six parameters, all of them are effective when they are implemented synergistically to detect the native. Supplementary Table S7 shows the rank of the native calculated on each parameter individually. It can be seen that all the parameters complement each other in several cases. Behavior of discriminating native from decoys switches from one parameter to another very often. This conveys that each one of them is important for discriminating the native in general. There are some cases when individual performance of each parameter is poor but the cumulative score (CS) is able to pick the native. All the six parameters considered in the cumulative score, result in 21 unique pairs. The pair correlations are shown in Supplementary Table S8. The values of correlation have been categorized into four classes: (1) weak (2) average (3) good and (4) high. It may be seen from the table; most of the pairs (18 out of 21) are weakly correlated indicating very low redundancy in choice of parameters. In three of the 21 pairs viz. A3–A4, A3–M and A4–M, a large number of systems fall in high correlation band which suggest to exclude some of parameter from scoring function because of high correlation. But read together with data in Supplementary Table S7. A3, A4 and M assign different rank to native despite of their high correlation. Therefore high correlation between a pair of parameters does not necessarily imply similar native discrimination capability. So it is better to include all parameters in pcSM. Fig. 5 illustrates that pcSM consists of three pathways to reach to FINAL-RUN and detection of native for a given decoy set requires all the three. These are: path I (1 → 2.a → 3.a), path II (1 → 2.a → 3.b → 4) and path III (1 → 2.b). Each pathway does not necessarily include all parameters. We assess the importance of each parameter by using “leave one out” method where one parameter is excluded from pcSM and rank of the native is calculated for each
Table 5 A comparative analysis of 30 scoring functions on 77 public decoy sets. Public decoys (77 systems) Scoring function
Avg. Ranknative
pcSM DFMAC LoCo RF_CB_SRS_OD GKS Qp SKOb HLPL Qm SKOa SKJG Qa [119] BT TD MJ3 MJ3h MS MSBM TEs BFKV General-four-body MJPL TS VD Tel Four-body Short-range MJ2h MJ1 RO ProSa 2003
4.04 6.7 13.4 19.3 28.5 28.8 30.3 31.4 31.6 33.1 34.1 37.4 45.8 47.7 50.6 52.1 54 54.2 54.2 54.5 56.3 57.8 66.1 73.7 80 81.8 87.5 101.3 124.5 248.3 44
1527
5. Discussion A variety of scoring functions have been used in the past to detect the native but due to the size and structural diversity of the natives, there is always a probability that some natives slip out of the selection box. In this study, a scoring function (pcSM) based on physicochemical properties has been introduced and its efficiency in capturing the native structure is evaluated in diverse decoy sets comprising 415 different systems with 142,698 decoys and representing a vast sample space of protein conformations. The pcSM has four surface area descriptors along with pair wise Cα distance metric, secondary structure penalty and all atom non-bonded energy. Appropriate combination of all the above properties in the scoring function minimized
Table 6 Importance of parameters in pcSM pathways. a) Path I
Capturing native
Number of systems affected
Excluding ‘E’
Excluding ‘P’
Excluding ‘A1’
excluding ‘A2’
27
21
2
9
b). Path II
Excluding ‘A1’
Excluding ‘A2’
Excluding ‘A3’
Excluding ‘A4’
Excluding ‘M’
Excluding ‘P’
Number of systems affected
18
94
12
16
3
80
c). Path III
Number of systems affected Number of system affected means, system where native is missed out.
Capturing native Excluding ‘A4’
Excluding ‘P’
75
67
1528
A. Mishra et al. / Biochimica et Biophysica Acta 1834 (2013) 1520–1531
Fig. 7. Overall strength of pcSM. The figure shows the discriminating strength of native or near native structures by pcSM in top 5 (0 Å RMSD structure corresponds to native). X-axis: RMSD (Å), Y-axis: percentage coverage for given RMSD.
of the path in Table 6 summarizing the result. It is seen that exclusion of any parameter diminishes the strength of pcSM. Thus if we exclude any parameter from any path then there is a chance of missing the native in top 10. Descriptors based on accessible surface areas do not fit well for proteins of small sequence length which may assume non-globular conformations. Other filters such as P, M and E are able to capture the native in such cases. Secondary structure filter (P) has a limitation in cases where decoys are generated following secondary structure prediction methods. In this case all decoys exhibit same secondary structure but other filters such as area (A1, A2, A3, A4), Euclidean distance (M) perform well to penalize non-native structures and bracketing the best structure. Energy (E) filter works quite efficiently in those cases where native is not in compact form or the nonpolar residues are not minimally exposed but their arrangements of atoms are energetically favorable. Overall strength of pcSM for 415 systems for capturing the native/ native structures in top 5 is shown in Fig. 7. It is clear that native (0 Å RMSD) is captured in 84.1% of the cases and a structure within 5 Å of RMSD from the native is captured 93.3% of the cases by pcSM. To further analyze the efficacy of pcSM protocol, we have run a one tailed p-value test using Ranknative as sample data set (Fig. 8). Rank of the
native is extracted from Supplementary Tables S3 and S4 and distributed in different bins and plotted against number of systems. Fig. 8 shows that in 304 cases Ranknative is below 1.36, in 45 cases Ranknative lies between 1.36 and 5.0 and in 27 systems Ranknative lies between 5 and 10 while in 25 cases rank of the native is greater than 10. Size of each bin in the distribution is 0.3SD (SD = standard deviation). The test validates pcSM with rank of native below 5 to 99% confidence level. Protein structure prediction demands a robust scoring function to discriminate correctly folded structures from a large pool of decoys. pcSM, developed in this work, is clearly a solid step in this direction. Here, it is important to note that while the current application of pcSM involves linear coefficients in the scoring function, values for which are determined and fixed based on a relatively straightforward linear programming approach, there is scope for even better performance by tuning these coefficients further by developing more advanced approaches. 6. Usage pcSM program is available on http://www.scfbio-iitd.res.in/software/ pcsm.jsp in user friendly form where the user can upload either a single
Fig. 8. Ranknative distribution for 401 systems.
A. Mishra et al. / Biochimica et Biophysica Acta 1834 (2013) 1520–1531
structure to evaluate the score or multiple structures to bracket top 10 structures. The prerequisites of pcSM are as follows: 1. PDB file format of protein is supported. 2. Single chain PDB is preferred. 3. Multiple structures can be uploaded by keeping structures in a single directory and compressing them into TAR zipped format. 4. In order to receive email notification, the user can provide emailaddress. Supplementary data to this article can be found online at http:// dx.doi.org/10.1016/j.bbapap.2013.04.023. Acknowledgements Program support to the Supercomputing Facility for Bioinformatics & Computational Biology (SCFBio), IIT Delhi from the Department of Biotechnology, Govt. of India and DST-JST Collaboration on Biogrid computing are gratefully acknowledged. References [1] S. Salzberg, D. Searls, S. Kasif, Grand challenges in computational biology, Computational Methods in Molecular Biology, Elsevier Science, 1998. [2] R. Unger, J. Moult, Bull. Math. Biol. 55 (1993) 1183–1198. [3] A.S. Fraenkel, Bull. Math. Biol. 55 (1993) 1199–1210. [4] L. Pauling, R.B. Corey, H.R. Branson, The structure of proteins: two hydrogenbonded helical configurations of the polypeptide chain, Proc. Natl. Acad. Sci. U. S. A. 37 (1951) 205–211. [5] L. Pauling, R.B. Corey, The pleated sheet, a new layer configuration of polypeptide chains, Proc. Natl. Acad. Sci. U. S. A. 37 (1951) 251–256. [6] L. Pauling, R.B. Corey, Atomic coordinates and structure factors for two helical configurations of polypeptide chains, Proc. Natl. Acad. Sci. U. S. A. 37 (1951) 235–240. [7] P.Y. Chou, G.D. Fasman, Prediction of protein conformation, Biochemistry 13 (1974) 222–245. [8] T.Z. Sen, R.L. Jernigan, J. Garnier, A. Kloczkowski, GOR V server for protein secondary structure prediction, Bioinformatics 21 (2005) 2787–2788. [9] V.I. Lim, Structural principles of the globular organization of protein chains, a stereochemical theory of globular protein secondary structure, J. Mol. Biol. 88 (1974) 857–872. [10] W. Kabsch, C. Sander, Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features, Biopolymers 22 (1983) 2577–2637. [11] W. Kabsch, C. Sander, How good are predictions of protein secondary structure? FEBS Lett. 155 (1983) 179–182. [12] B. Rost, C. Sander, Prediction of protein secondary structure at better than 70% accuracy, J. Mol. Biol. 232 (1993) 584–599. [13] M.J. Zvelebil, Prediction of protein secondary structure and active sites using the alignment of homologous sequences, J. Mol. Biol. 195 (1987) 957–961. [14] A.A. Salamov, V.V. Solovyev, Protein secondary structure prediction using local alignments, J. Mol. Biol. 268 (1997) 31–36. [15] P.K. Mehta, A simple and fast approach to prediction of protein secondary structure from multiply aligned sequences with accuracy above 70%, Protein Sci. 4 (1995) 2517–2525. [16] C. Geourjon, G. Deleage, SOPM: a self-optimized method for protein secondary structure prediction, Protein Eng. 7 (1994) 157–164. [17] D.T. Jones, Protein secondary structure prediction based on position-specific scoring matrices, J. Mol. Biol. 292 (1999) 195–202. [18] C. Cole, J.D. Barber, The Jpred 3 secondary structure prediction server, Nucleic Acids Res. 36 (2008) W197–W201. [19] H. Frauenfelder, S. Sligar, P.G. Wolynes, The energy landscapes and motions of proteins, Science 254 (1991) 1598–1603. [20] J.D. Bryngelson, J.N. Onuchic, N.D. Socci, P.G. Wolynes, Funnels, pathways, and the energy landscape of protein folding: a synthesis, Proteins 21 (1995) 167–195. [21] C.B. Anfinsen, Principles that govern the folding of protein chains, Science 181 (1973) 223–230. [22] K.A. Dill, H.S. Chan, From Levinthal to pathways to funnels, Nat. Struct. Biol. 4 (1997) 10–19. [23] K.A. Dill, S.B. Ozkan, T.R. Weikl, J.D. Chodera, V.A. Voelz, The protein folding problem: when will it be solved? Curr. Opin. Struct. Biol. 17 (2007) 342–346. [24] P. Kollman, Free energy calculations: applications to chemical and biochemical phenomena, Chem. Rev. 93 (1993) 2395–2417. [25] W. Jorgensen, Free energy calculations: a breakthrough for modeling organic chemistry in solution, Acc. Chem. Res. 22 (1989) 184–189. [26] W.F. Van Gunsteren, H.J.C. Berendsen, Computer simulation of molecular dynamics: methodology, applications and perspectives in chemistry, Angew. Chem. Int. Ed Engl. 29 (1990) 992–1023.
1529
[27] D.L. Beveridge, F.M. DiCapua, Free energy via molecular simulation: applications to chemical and biomolecular systems, Annu. Rev. Biophys. Biomol. Struct. 18 (1989) 431–492. [28] B. Jayaram, D. Sprous, D.L. Beveridge, Solvation free energy of biomacromolecules: parameters for a modified generalized born model consistent with the AMBER force field, J. Phys. Chem. B 102 (1998) 9571–9576. [29] P. Kalra, T.V. Reddy, B. Jayaram, Free energy component analysis for drug design: a case study of HIV-1 protease-inhibitor binding, J. Med. Chem. 44 (2001) 4325–4338. [30] B. Jayaram, D. Sprous, M.A. Young, D.L. Beveridge, Free energy analysis of the conformational preferences of A and B forms of DNA in solution, J. Am. Chem. Soc. 120 (1998) 10629–10633. [31] P. Narang, K. Bhushan, S. Bose, B. Jayaram, A computational pathway for bracketing native-like structures for small alpha helical globular proteins, Phys. Chem. Chem. Phys. 7 (2005) 2364–2375. [32] J. Pillardy, C. Czaplewski, A. Liwo, W.J. Wedemeyer, J. Lee, D.R. Ripoll, P. Arlukowicz, S. Oldziej, Y.A. Arnautova, H.A. Scheraga, Development of physicsbased energy functions that predict medium-resolution structures for protein of the á, â and á/â structural classes, J. Phys. Chem. B 105 (2001) 7299–7311. [33] K.A. Dill, Dominant forces in protein folding, Biochemistry 29 (31) (1990) 7133–7155. [34] B.N. Dominy, E.I. Shakhnovich, Native atom types for knowledge-based potentials: application to binding energy prediction, J. Med. Chem. 47 (2004) 4538–4558. [35] J. Shimada, A.I. Ischenko, E.I. Shakhnovich, Analysis of knowledge-based protein-ligand potentials using a self-consistent method, Protein Sci. 9 (2000) 765–775. [36] T. Lazaridis, M. Karplus, Effective energy functions for protein structure prediction, Curr. Opin. Struct. Biol. 10 (2000) 139–145. [37] M.J. Sippl, Knowledge-based potentials for proteins, Curr. Opin. Struct. Biol. 5 (1995) 229–235. [38] H. Lu, J. Skolnick, A distance-dependent atomic knowledge-based potential for improved protein structure selection, Proteins: Struct. Funct. Genet. 44 (2001) 223–232. [39] R.L. Jernigan, I. Bahar, Structure derived potential and protein simulation, Curr. Opin. Struct. Biol. 6 (1996) 195–209. [40] D. Mohanty, B.N. Dominy, A. Kolinski, C.L. Brooks III, J. Skolnick, Correlation between knowledge‐based and detailed atomic potentials: application to the unfolding of the GCN4 leucine zipper, Proteins: Struct. Funct. Genet. 35 (1999) 447–452. [41] L. Thukral, S.R. Shenoy, K. Bhusan, B. Jayaram, ProRegIn: a regularity index for the selection of native-like tertiary structures of proteins, J. Biosci. 32 (1) (2007) 71–81. [42] P.W. Rose, The RCSB Protein Data Bank: redesigned web site and web services, Nucleic Acids Res. 39 (2011) D392–D401. [43] O. Lund, M. Nielsen, C. Lundegard, P. Worning, CPHmodels 2.0: X3M a computer program to extract 3dmodels, Abstract at the CASP5 Conference, 2002, p. A102. [44] N. Guex, M.C. Peitsch, SWISS-MODEL and the Swiss-PdbViewer: an environment for comparative protein modeling, Electrophoresis 18 (1997) 2714–2723. [45] C. Lambert, N. Leonard, X. De Bolle, E. Depiereux, ESyPred3D: prediction of proteins 3D structures, Bioinformatics 18 (2002) 1250–1256. [46] A. Sali, T. Blundell, Comparative protein modeling by satisfaction of spatial restraints, J. Mol. Biol. 234 (1993) 779–815. [47] C. Combat, M. Jambon, G. Deleage, C. Geourjon, Geno3D: automatic comparative molecular modelling of protein, Bioinformatics 18 (2002) 213–214. [48] P.A. Bates, L.A. Kelley, R.M. MacCallum, M. Sternberg, Enhancement of protein modeling by human intervention in applying the automatic programs 3D-JIGSAW and 3D-PSSM, Proteins 45 (2001) 39–46. [49] B. Contreras-Moreira, P.A. Bates, Domain fishing: a first step in protein comparative modeling, Bioinformatics 18 (2002) 1141–1142. [50] A. Lobley, M.I. Sadowski, D.T. Jones, pGenTHREADER and pDomTHREADER: new methods for improved protein fold recognition and superfamily discrimination, Bioinformatics 25 (2009) 1761–1767. [51] L. Jaroszewski, L. Rychlewski, Z. Li, W. Li, A. Godzik, FFAS03: a server for profile– profile sequence alignments, Nucleic Acids Res. 33 (2005) W284–W288. [52] L.A. Kelley, M.J. Sternberg, Protein structure prediction on the Web: a case study using the Phyre server, Nat. Protoc. 4 (2009) 363–371. [53] H. Zhou, S.B. Pandit, J. Skolnick, Performance of the Pro-sp3-TASSER server in CASP8, Proteins 77 (Suppl. 9) (2009) S123–S127. [54] H. Chen, J. Skolnick, M-TASSER: an algorithm for protein quaternary structure prediction, Biophys. J. 94 (2008) 918–928. [55] W. Zhang, S. Liu, Y. Zhou, SP5: improving protein fold recognition by using torsion angle profiles and profile-based gap penalty model, PLoS One 3 (2008) e2325. [56] N. Fernandez-Fuentes, J.M. Dybas, A. Fiser, Structural characteristics of novel protein folds, PLoS Comput. Biol. 6 (4) (2010). [57] J. Cheng, A.Z. Randell, M.J. Sweredoski, P. Baldi, SCRATCH: a protein structure and structural feature, Nucleic Acids Res. 33 (2005) W72–W76. [58] B. Jayaram, K. Bhushan, S.R. Shenoy, P. Narang, S. Bose, P. Agrawal, D. Sahu, V. Pandey, Bhageerath: an energy based web enabled computer software suite for limiting the search space of tertiary structures of small globular proteins, Nucleic Acids Res. 34 (2006) 6195–6204. [59] L.H. Hung, S.C. Ngan, T. Liu, R. Samudrala, PROTINFO: new algorithms for enhanced protein structure prediction, Nucleic Acids Res. 33 (2005) W77–W80.
1530
A. Mishra et al. / Biochimica et Biophysica Acta 1834 (2013) 1520–1531
[60] C.A. Rohl, C.E. Strauss, K.M. Misura, D. Baker, Protein structure prediction using Rosetta, Meth. Enzymol. 383 (2004) 66–93. [61] D.E. Kim, D. Chivian, D. Baker, Protein structure prediction and analysis using the Robetta server, Nucleic Acids Res. 32 (2004) W526–W531. [62] T. Ishida, K. Kinoshita, PrDOS: prediction of disordered protein regions from amino acid sequence, Nucleic Acids Res. 35 (2007), (Web Server issue). [63] J.A. McCammon, B.R. Gelin, M. Karplus, Dynamics of folded proteins, Nature 267 (1977) 585–590. [64] A. Li, V. Daggett, Investigation of the solution structure of chymotrypsin inhibitor 2 using molecular dynamics: comparison to X-ray crystallographic and NMR data, Protein Eng. 8 (1995) 1117–1128. [65] V. Daggett, M. Levitt, Molecular dynamics simulation of the molten globule state, Proc. Natl. Acad. Sci. U. S. A. 89 (1992) 5142–5146. [66] M. Levitt, Molecular dynamics of native protein: I. Computer simulation of trajectories, J. Mol. Biol. 168 (1983) 595–620. [67] M. Levitt, Molecular dynamics of native protein. II. Analysis and nature of motion, J. Mol. Biol. 168 (1983) 621–657. [68] J. Tirado-Rives, W.L. Jorgensen, Molecular dynamics simulations of the unfolding of apomyoglobinin water, Biochemistry 32 (1993) 4175–4184. [69] E.M. Boczko, C.L. Brooks III, First principles calculation of the free energy surface for folding of a three helix bundle protein, Science 269 (1995) 393–396. [70] E. Demchuk, D. Bashford, D.A. Case, Dynamics of a type VI reverses turn in a linear peptide in aqueous solution, Fold. Des. 2 (1997) 35–46. [71] X. Daura, B. Jaun, D. Seebach, W.F. van Gunsteren, A.E. Mark, Reversible peptide folding in solution by molecular dynamics simulation, J. Mol. Biol. 280 (1998) 925–932. [72] Y. Duan, P.A. Kollman, Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution, Science 282 (1998) 740–744. [73] U. Mayor, C.M. Johnson, V. Daggett, A.R. Fersht, Protein folding and unfolding in microsecond to nanoseconds by experiment and simulation, Proc. Natl. Acad. Sci. U. S. A. 97 (2000) 13518–13522. [74] B. Zagrovic, E.J. Sorin, V.S. Pande, Beta-hairpin folding simulations in atomistic detail using an implicit solvent model, J. Mol. Biol. 313 (2001) 151–169. [75] C. Simmerling, B. Strockbine, A.E. Roitberg, All-atom structure prediction and folding simulations of a stable protein, J. Am. Chem. Soc. U. S. A. 124 (2002) 11258–11259. [76] C.D. Snow, B. Zagrovic, V.S. Pande, The Trp cage: folding kinetics and unfolded state topology via molecular dynamics simulations, J. Am. Chem. Soc. 124 (2002) 14548–14549. [77] P.L. Freddolino, F. Liu, M. Gruebele, K. Schulten, Ten-microsecond. MD simulation of a fast-folding WW domain, Biophys. J. 94 (2008) 75–77. [78] P.L. Freddolino, S. Park, B. Roux, K. Schulten, Force field bias in protein folding simulations, Biophys. J. 96 (2009) 3772–3780. [79] V.A. Voelz, G.R. Bowman, K. Beauchamp, V.S. Pande, Molecular simulation of ab initio protein folding for a millisecond folder NTL9(1-39), J. Am. Chem. Soc. 132 (2010) 1526–1528. [80] D.E. Shaw, P. Maragakis, K. Lindorff-Larsen, S. Piana, R.O. Dror, M.P. Eastwood, J.A. Bank, J.M. Jumper, J.K. Salmon, Y. Shan, W. Wriggers, Atomic-level characterization of the structural dynamics of proteins, Science 330 (2010) 341–346. [81] K.A. Dill, S. Banu Ozkan, M. Scott Shell, R. Weikl Thomas, The protein folding problem, Annu. Rev. Biophys. 37 (2008) 289–316. [82] S. Cooper, F. Khatib, A. Treuille, J. Barbero, J. Lee, M. Beenen, A. Leaver-Fay, D. Baker, Z. Popović, Foldit players, Predicting protein structures with a multiplayer online game, Nature 466 (2010) 756–770. [83] J. Moult, K. Fidelis, A. Kryshtafovych, C. Venclovas, Progress over the first decade of CASP experiments, Proteins: Struct. Funct. Bioinform. 7 (2005) 225–236. [84] J.C. Wootton, Non-globular domains in protein sequences: automated segmentation using complexity measures, Comput. Chem. 18 (1994) 269–285. [85] P.J. Flory, Principles of Polymer Chemistry, Cornell University, 1953. 428–429. [86] C. Venclovas, A. Zemla, K. Fidelis, J. Moult, Assessment of progress over the CASP experiments, Proteins 53 (Suppl. 6) (2003) 585–595. [87] A. Fiser, R.K. Do, A. Sali, Modeling of loops in protein structures, Protein Sci. 9 (2000) 1753–1773. [88] A. Kryshtafovych, C. Venclovas, K. Fidelis, J. Moult, Progress over first decade of CASP experiments, Proteins 61 (2005) 225–236. [89] A. Mittal, B. Jayaram, S.R. Shenoy, T.S. Bawa, A Stoichiometry driven universal spatial organization of backbones of folded proteins: are there Chargaff's rules for protein folding? J. Biomol. Struct. Dyn. 28 (2) (2010) 133–142. [90] V. Soundararajan, R. Raman, S. Raguram, V. Sasisekharan, R. Sasisekharan, Atomic interaction networks in the core of protein domains and their native folds, PLoS One 5 (2) (2010) e9391, http://dx.doi.org/10.1371/journal.pone. 0009391. [91] A. Mittal, B. Jayaram, Backbones of folded proteins reveal novel invariant aminoacid neighborhoods, J. Biomol. Struc. Dyn. 28 (4) (2011) 443–454. [92] J. Janin, Surface and inside volume in globular protein, Nature 277 (1979) 491–492. [93] C.C. Palliser, D.A.D. Parry, Quantitative comparison of the ability of hydropathy scales to recognize surface β-strands in proteins, Proteins 42 (2001) 243–255. [94] R.P. Bahdur, P. Chakrabarti, Discriminating the native structure from decoys using scoring functions based on the residue packing in globular proteins, BMC Bioinform. 9 (2009) 76. [95] J. Chen, W.E. Stites, Packing is a key selection factor in the evolution of protein hydrophobic cores, Biochemistry 40 (2001) 5280–15289. [96] R. Wolfenden, L. Anderson, P.M. Cullis, C.C. Southgate, Affinities of amino acid side chains for solvent water, Biochemistry 20 (1981) 849–855.
[97] S.B. Dixit, R. Bhasin, E. Rajasekaran, B. Jayaram, Solvation thermodynamics of amino acids: assessment of the electrostatic contribution and force-field dependence, J. Chem. Soc., Faraday Trans. 93 (6) (1997) 1105–1113. [98] J. Kyte, R.F. Doolittle, A simple method for displaying the hydropathic character of a protein, J. Mol. Biol. 157 (1982) 105–132. [99] B. Lee, F.M. Richards, The interpretation of protein structures: estimation of static accessibility, J. Mol. Biol. 55 (1971) 379–400. [100] P. Narang, K. Bhushan, S. Bose, B. Jayaram, Protein structure evaluation using allatom energy based empirical scoring function, J. Biomol. Struct. Dyn. 23 (2006) 385–406. [101] N. Jayaram B. Arora, Energetics of base pairs in B-DNA in solution: an appraisal of potential functions and dielectric treatments, J. Phys. Chem. 102 (1998) 6139–6144. [102] N. Arora, B. Jayaram, Strength of hydrogen bonds in alpha helices, J. Comput. Chem. 18 (1997) 1245–1252. [103] M.A. Young, B. Jayaram, D.L. Beveridge, Local dielectric environment of B-DNA in solution: results from a 14 nanosecond molecular dynamics trajectory, J. Phys. Chem. 102 (1998) 7666–7669. [104] D. Rykunov, A. Fiser, New statistical potential for quality assessment of protein models and a survey of energy functions, BMC Bioinform. 11 (2010) 28. [105] Q. Fang, D. Shortle, Protein refolding in silico with atom-based statistical potentials and conformational search using a simple genetic algorithm, J. Mol. Biol. 359 (5) (2006) 1456. [106] B.J. McConkey, V. Sobolev, M. Edelman, Discrimination of native protein structures using atom–atom contact scoring, Proc. Natl. Acad. Sci. U. S. A. 100 (6) (2003) 215. [107] P. Benkert, M. Kunzli, T. Schwede, QMEAN server for protein model quality estimation, Nucleic Acids Res. (2009), http://dx.doi.org/10.1093/nar/gkp322. [108] P. Benkert, S.C. Tosatto, D. Schomburg, QMEAN: a comprehensive scoring function for model quality assessment, Proteins 71 (1) (2008) 261–277. [109] B.J. McConkey, V. Sobolev, M. Edelman, Discrimination of native protein structures using atom–atom contact scoring, Proc. Natl. Acad. Sci. U. S. A. 100 (6) (2003) 3215. [110] J. Zhang, R. Chen, J. Liang, Empirical potential function for simplified protein models: combining contact and local sequence-structure descriptors, Proteins: Struct. Funct. Bioinform. 63 (4) (2006) 949–960. [111] M. Lu, A.D. Dousis, J. Ma, OPUS-PSP: an orientation-dependent statistical all-atom potential derived from side-chain packing, J. Mol. Biol. 376 (1) (2008) 288–301. [112] D. Rykunov, A. Fiser, Effects of amino acid composition, finite size of proteins, and sparse statistics on distance-dependent statistical pair potentials, Proteins: Struct. Funct. Bioinform. 67 (3) (2007) 59–568. [113] J.C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R.D. Skeel, L. Kale, K. Schulten, Scalable molecular dynamics with NAMD, J. Comput. Chem. 26 (16) (2005) 781–1802. [114] Q. Fang, D. Shortle, A consistent set of statistical potentials for quantifying local side-chain and backbone interactions, Proteins 60 (1) (2005) 90. [115] M.Y. Shen, A. Sali, Statistical potential for assessment and prediction of protein structures, Protein Sci. 15 (11) (2006) 2507–2524. [116] M.J. Sippl, Recognition of errors in three-dimensional structures of proteins, Proteins 17 (4) (1993) 355. [117] Liqing Tian, Wu. Aiping, Yang Cao, Xiaoxi Dong, Hu. Yun, Taijiao Jiang, NCACO-score: an effective main-chain dependent scoring function for structure modeling, BMC Bioinform. 1 (2) (2011) 208. [118] Y. Makino, N. Itoh, A knowledge-based structure-discriminating function that requires only main-chain atom coordinates, BMC Struct. Biol. 8 (2008) 46. [119] H. Zhou, Y. Zhou, Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction, Protein Sci. 11 (11) (2002) 2714–2726. [120] H. Lu, J. Skolnick, A distance-dependent atomic knowledge-based potential for improved protein structure selection, Proteins 44 (2001) 223–232. [121] R. Samudrala, J. Moult, An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction, J. Mol. Biol. 275 (5) (1998) 895–916. [122] A. Godzik, A. Kolinski, J. Skolnick, Are proteins ideal mixtures of amino acids? Analysis of energy parameter sets, Protein Sci. 4 (10) (1995) 2107–2117. [123] U. Bastolla, J. Farwer, E.W. Knapp, M. Vendruscolo, How to guarantee optimal stability for most representative structures in the Protein Data Bank, Proteins 44 (2) (2001) 79–96. [124] J. Skolnick, L. Jaroszewski, A. Kolinski, A. Godzik, Derivation and testing of pair potentials for protein folding. When is the quasichemical approximation correct? Protein Sci. 6 (3) (1997) 676–688. [125] B. Park, M. Levitt, Energy functions that discriminate X-ray and near native folds from well-constructed decoys, J. Mol. Biol. 258 (2) (1996) 367–392. [126] J. Skolnick, A. Kolinski, A. Ortiz, Derivation of protein-specific pair potentials based on weak sequence fragment similarity, Proteins 38 (1) (2000) 3–16. [127] M.R. Betancourt, D. Thirumalai, Pair potentials for protein folding: choice of reference states and sensitivity of predicted native states to variations in the interaction schemes, Protein Sci. 8 (2) (1999) 361–369. [128] P.D. Thomas, K.A. Dill, An iterative method for extracting energy-like quantities from protein structures, Proc. Natl. Acad. Sci. U. S. A. 93 (21) (1996) 11628–11633. [129] S. Miyazawa, R.L. Jernigan, Estimation of effective inter residue contact energies from protein crystal structures: quasi-chemical approximation, Macromolecules 18 (3) (1985) 534–552.
A. Mishra et al. / Biochimica et Biophysica Acta 1834 (2013) 1520–1531 [130] S. Miyazawa, R.L. Jernigan, Residue-residue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading, J. Mol. Biol. 256 (3) (1996) 623–644. [131] S. Miyazawa, R.L. Jernigan, Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues, Proteins 34 (1) (1999) 49–68. [132] L.A. Mirny, E.I. Shakhnovich, How to derive a protein folding potential? A new approach to an old problem, J. Mol. Biol. 264 (5) (1996) 1164–1179. [133] K.T. Simons, I. Ruczinski, C. Kooperberg, B.A. Fox, C. Bystroff, D. Baker, Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins, Proteins 34 (1) (1999) 82–95. [134] K.T. Simons, C. Kooperberg, E. Huang, D. Baker, Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions, J. Mol. Biol. 268 (1) (1997) 209–225.
1531
[135] D. Tobi, G. Shafran, N. Linial, R. Elber, On the design and analysis of protein folding potentials, Proteins 40 (1) (2000) 71–85. [136] Y. Feng, A. Kloczkowski, R.L. Jernigan, Four-body contact potentials derived from two protein datasets to discriminate native structures from decoys, Proteins 68 (1) (2007) 57–66. [137] S. Tanaka, H.A. Scheraga, Medium- and long-range interaction parameters between amino acids for predicting three-dimensional structures of proteins, Macromolecules 9 (6) (1976) 945–950. [138] M. Vendruscolo, E. Domany, Pairwise contact potentials are unsuitable for protein folding, J. Chem. Phys. 109 (24) (1998) 11101–11108. [139] I. Bahar, M. Kaplan, R.L. Jernigan, Short-range conformational energies, secondary structure propensities, and recognition of correct sequence-structure matches, Proteins 29 (3) (1997) 292–308. [140] B. Robson, D.J. Osguthorpe, Refined models for computer simulation of protein folding. Applications to the study of conserved secondary structure and flexible hinge points during the folding of pancreatic trypsin inhibitor, J. Mol. Biol. 132 (1) (1979) 19–51.