An Evaluation Of Protein Secondary Structure Prediction Algorithms Georgios Pappas Jr. and Shankar Subramaniam Department of Physiology and Biophysics, Beckman Institute, University of Illinois, Urbana, Illinois 61801
I. Introduction Over the past years several algorithms were developed in order to predict the secondary structure of proteins based on very distinct theoretical approaches (Fasman, 1989, Eisenhaber et al. 1995). This list keeps growing incessantly, with the authors eager to improve the predictive power of their methods. This frantic search exposes the current status of the prediction accuracy which is far from ideal in order to make reasonable inferences about the tertiary structure, although it does not invalidate the use of these methods as rough starting points for modeling purposes (Rost and Sander, 1995; Schultz, 1987). The level of success in the predictions reported by the authors is in the range from 60 % to 72%. Most of these results represent an overestimation due to incomplete cross-validation (Holley and Karplus, 1989) or by the lack of a reasonable number of test cases (Burgess et al., 1974). All the methods developed so far try to extract information, directly or indirectly (Lim, 1974), from the ever growing databases of X-ray crystallography resolved protein structures. Unfortunately, the rate at which new structures are added to the structure databases is far from optimal. Chothia (1992) estimated that all proteins, when their structures are known, would fall into about one thousand folding classes, more than half of them yet to be discovered. If so, this means that a great deal of information in the forthcoming structures is not available for the current methods, and therefore we still must rely on the future to see a coherent and realistic increase In the accuracy of secondary structure prediction methods. Comparative analysis of the performance of various algorithms has been carried out in the past (Kabsh and Sander, 1983). However, this task can be deceptive If factors such as the selection of proteins for the testing set and the choice of the scoring Index are not carried out properly. The present work aims to provide an updated evaluation of several predictive methods with a testing set size that permits to obtain more accurate statistics, which in turn can possibly measure the usefulness of the information gathered by those methods and also identify trends that characterize the behavior of individual algorithms. Further, we present a uniform testing of these methods, ws-a-ws the size of the datasets, the measure of accuracy and proper cross-validation procedures. TECHNIQUES IN PROTEIN CHEMISTRY VIII Copyright © 1997 by Academic Press All rights of reproduction in any form reserved.
783
784
Georgios Pappas, Jr. and Shankar Subramaniam
II. Material And Methods A. Secondary Structure Prediction Algorithms Algorithms for secondary structure prediction are based upon diverse theoretical approaches. There are three mainstream classes of methods (Gamier and Levin, 1991): 1. Statistical: Rely on the assumption that amino acids have intrinsic propensities for formation of a specific type of secondary structure. This information is collected by analysis of proteins with known tertiary structure and is done using simple statistical principles (Chou and Fasman, 1974) or more elaborate ones like information theory (Gamier et ai, 1978). 2. Neural networks: Those are highly nonlinear pattern recognition devices that try to mimic the organization of nervous systems. They are trained by adaptively learning a set of patterns and can extract high order features of the input space with the ability of making generalizations for unknown input events. 3. Sequence similarity: Based on the comparison of the protein to be predicted with an available database of known structures. The prediction is made by assigning the secondary structure of the fragment in the database which displays the most sequence similarity with a segment in the test protein. Other variations often appear utilizing different methodologies, but with no relative gain in accuracy. Those include methods that are based on hidden Markov models (Sasagawa and Tagima, 1993), stereo-chemical principles (Lim, 1974) and statistical mechanics (Ptitsyn and Finkelstein, 1983), just to cite a few. From a large list of available algorithms for secondary structure prediction, nine of them were selected to represent the main classes depicted above. Those were chosen mainly because they are the most often cited in the literature and by the fact that they permit a relatively safe implementation through of a computer program. The selected methods are summarized in table I.
Table L List of secondary structure prediction methods utilized Method Code Type Statistical BPS C F Statistical Statistical D R Statistical GAS GGR Information Theory Information Theory GOR Neural Networks H K L G Sequence Similarity Q_S Neural Networks
Reference Burgess et al. (1974) Chou and Fasman (1974) Del6age and Roux(1987) Gascuel and Golmard (1988) Gibratefa/.(1989) Gamier ef a/. (1978) Holley and Karplus (1989) Levin and Gamier (1988) Qian and Sejnowski (1988)
Protein Secondary Structure Prediction Algorithms
785
A software package called MultPred (Multiple Predictions) was developed in C++ language and implements all but C__F (Chou and Fasman, 1974) and GGR (Gibrat et al., 1987) methods which were taken from the program ANTHEPROT (Deleage and Roux, 1989). Additionally, a joint prediction scheme (JOI- Joint prediction) was utilized in which the prediction from the different methods analyzed were combined and the structure predicted by the majority was assigned to the respective residue in the same fashion implemented by Nishikawa and Noguchi (1991). Some methods provide three-state prediction (i.e., locating helices, sheets and coil regions) and others four-state prediction (the former plus p-turns). For the present analysis only three-state predictions were analyzed, and the four-state predictions were transformed to three-state ones by assigning the coil state to predicted turn regions. In the case of BPS (Burgess et al., 1974) method the secondary structural propensities of the amino acids were recalculated following the paper, because the original values were based on just 9 proteins. For the L_G method (Levin and Gamier, 1988) the database used to make the prediction was the database used in this work (see below) instead of the original given in the paper. This was necessary to avoid over-predictions owing to high sequence homology between the two datasets. However, we note that the L_G parameters are not optimized for the current protein database.
6. Protein Database Selection The testing set used is composed of 148 proteins with resolution better than 2.4 A and less than 25% of sequence homology between each other (Hobohm et al., 1992). The total number of residues analyzed was 36229. Secondary structure assignments were taken from DSSP program (Kabsh and Sander, 1983a) and those were transformed to a three-state form according the rules given by Levin and Gamier (1988). Protein class assignments were based on SCOP database (Murzin et al., 1995), dividing the set of proteins in 31 all-alpha, 31 all-beta, 51 alpha/beta, 21 alpha+beta and 14 irregular or multi-domain. The relative secondary structure composition for each class is given below:
Table II. Relative composition in terms of structure for the current database distributed in terms of protein classes Coil (%) Number of proteins Protein Class 3-sheet (%) a-helix(%) 31.93 48.19 19.88 All Proteins 148 56.74 All-Alpha 31 40.04 3.22 31 52.98 38.39 8.64 All-Beta 35.78 Alpha/Beta 51 46.57 17.65 49.60 23.77 26.63 Alpha+Beta 21
786
Georgios Pappas, Jr. and Shankar Subramaniam
C. Accuracy Measurements One key element in the performance analysis of secondary structure prediction methods is the proper selection of the accuracy measurement to be employed. Three different types of predictive accuracy measurements were used (Schultz and Schirmer, 1979): 1. Q3: Is calculated by taking the number of correct predictions over the sequence by the total number of amino acids. This index often produces high values because it does not penalize over predictions. 2. Mathews correlation coefficient (Cx): It is the correlation coefficient between positive predicted and observed as well as negative predicted residues. This index is particular for a specific structure x and the formula is
Cs-
('''^ yl(x + y)(x + z)(w + y)(w + z))
where
s = a,p or Coil w = Number of correct predicted residues for structure x X = Number of negative correct prediction for structure x y = Number of residues under predicted for structure x z = Number of residues over predicted for structure x 3. Entropy-related information: This measure was introduced by Rost and Sander (1993) and is related to the probability of deviation between a random prediction and the actual prediction. The value of this index is affected by over- and underpredictions, which is not accomplished by Q3. Therefore it provides a more reliable estimate of the significance of accuracy. The formula is given by 3
3
2 at • lna«-2 ^v ' In^i, Info^l-
1=1
ij=\
N'\nN-Y,biinbi N = Number of residues a, = Number of residues predicted to be in secondary structure / b/ = Number of residues observed to be in secondary structure / Ay = Number of residues predicted to be In / and observed to be iny
Protein Secondary Structure Prediction Algorithms
787
D. Multidimensional Scaling To help in the visualization of how the secondary structure prediction methods relate to each other, the statistical technique called multidimensional scaling (MDS) was utilized. Basically, what MDS does is, when provided a matrix of dissimilarities between objects (in our case, the algorithms for secondary structure prediction), to find a lowdimensional representation (2-dimensional) of the data, one point representing one object in such a manner that the distances in the new coordinate system match as well as possible the original distances provided in the dissimilarity matrix (Cox and Cox, 1994). For this kind of analysis one of the crucial steps is the definition of the dissimilarity between two predictive algorithms. Q3 values and Mathews' correlation coefficients where calculated for all proteins in the database resulting in an accuracy vector for each predictive method (""Xj, where m= predictive method and 1= Q3, Ca, Cp or Cc). Given those vectors, dissimilarity matrices where calculated for each accuracy index over all predictive methods using Guttman's ^ coefficient (Guttman, 1968), which is a measure of simple monotonic relationship between variables and is given by
I.UXi-''Xj\\'Xi-''Xj\
Hr.s= Dissimilarity between method rand s. ; ""Xj = Accuracy index m for protein /.
III. Results And Discussion A. Analysis Of Predictive Accuracy The first step in the analysis was to obtain the secondary structure prediction for the 148 proteins in the test database with the selected methods. The accuracy results in terms of the Q3 index can be examined in table III.
Table III Newly calculated and reported accuracies in the original papers in terms of Q3 Method New Accuracy (Q3%) Original Accuracy (Q3%) EPS 53.1 ± 7 . 5 61.3 C_F 55.5 ± 8.7 59.2 D_R 59.6 ± 9.4 61.3 G_G 55.3 ± 7.5 62.3 GGR 60.4 ± 8.1 63.0 GOR 57.4 ± 8.4 58.0 H_K 60.1 ± 8.4 63.2 L_G 55.2 ± 9.7 63.0 Q S 59.6 ± 8.7 64.3
788
Georgios Pappas, Jr. and Shankar Subramaniam
When analyzing the Q3 values it is clear that those are lower compared to the ones claimed by the authors in the original papers. This can basically be due to two factors: poor statistics as a consequence of low number of test proteins, and lack of cross-validation of the results. The discrepancy between reported values and the newly calculated ones is variable indicating different degrees of prediction generalization attained by each method. Additionally, it must be kept in mind that the database used in this work may contain proteins used in the training set of some of methods, which induces an overestimation of the accuracies. Nevertheless, it is still observed that all methods had the accuracies decreased. However, despite Q3 being the most popular and widespread measure it suffers serious problems in terms of providing a reliable and significant accuracy estimate. The main Q3 drawback is that it does not take in account under- and over-predictions failing to capture the real significance of the results. For example, if we predict all the residues as being coil in the test database, an average Q3 value of 48.19% Is obtained but correlation coefficients and information values will be null. As an alternative way to analyze the accuracies it is possible to use the average Mathews' correlation coefficients and information values reported in table IV for all predicted proteins. The use of these two measures is very scarce in secondary structure prediction literature, despite their obvious superiority over Q3. In one of the few publications that utilize Mathews' correlation coefficients, Holley and Karplus (1989) reported values of Ca=41%, Cp=32% and Cc=36%. In the new analysis those values were sensibly decreased (Ca=32%, Cp=25% and Cc=31%), clearly indicating there is a poor generalization power of the method to a larger set of proteins. It also strengthens an important fact for secondary structure evaluation, already noted by Rost and Sander (1994), which is the need of a representative testing set in terms of size and structural composition that permits gathering reliable statistical information from the results.
Table IV. Average Mathews' correlation coefficients and information values for the 148 chains in the testing set. Standard deviation values are shown in parentheses Method Ca (%) Cp (%) Cc (%) Information (%) 24.08 ±(16.51) 8.23 ± (5.34) 20.00 ±(12.56) 13.73 ±(14.22) BPS 11.10 ±(6.69) 29.25 ±(12.38) 21.24 ±(17.12) 24.33 ±(18.61) C F 12.85 ±(6.30) 36.58 ±(11.65) 23.27 ±(16.68) 29.15 ±(14.98) D R 9.63 ± (5.07) 23.32 ±(11.27) 28.23 ±(18.04) 20.68 ±(13.49) G G 14.08 ±(6.68) 36.17 ±(11.87) 26.87 ±(17.98) 32.15 ±(16.92) GGR 13.71 ±(6.75) 34.65 ±(11.66) 26.16 ±(17.35) 29.31 ±(17.32) GOR 13.01 ±(7.34) 30.99 ±(13.10) 24.62 ±(16.69) 31.96 ±(19.66) H K 12.69 ±(5.31) 33.67 ±(10.22) 25.64 ±(17.54) 30.56 ±(12.75) L G 12.85 ±(9.02) 33.31 ±(13.21) 22.63 ±(17.46) 28.66 ±(19.34) Q S 14.97 ±(8.32) 35.44 ±(12.56) 27.67 ±(17.86) 34.68 ±(18.71) JOI
Protein Secondary Structure Prediction Algorithms
789
To further extend the analysis, accuracies were measured in terms of correlation coefficients and information independently based on protein structural classes in order to check if there are biases particular to specific chain folds. The results are shown in figures 1 and 2. Predictive Accuracies (information) of secondary structure algoritlims 0.20
BPS C_F D_R G_G GGR GOR H_K L_G
Q_S
JOI
Algorithm Code
Predictive Accuracies ( C^^) of secondary structure algoritlims
Ca(%)
BPS
C_F D_R G_G GGR GOR H_K L_G Q_S
JOI
Algorithm Code
a^ Figure 1. Predictive accuracy in terms of information and Ca. The values are averaged over the respective class of proteins.
Georgios Pappas, Jr. and Shankar Subramaniam
790
Predictive Accuracies (CQ) of secondary structure algorithms 40.00
30.00
Cp(%) 20.00
10.00
0.00 BPS C_F D_R G_G GGR GOR H_K L_G Q_S
JOI
Algorithm Code
m-
I
ALPHAfiETA
^ ^ M
ALPHA+BETA
Predictive Accuracies (Cc) of secondary structure algorithms
30.00
CC(%)
0.00 BPS
C_F D^R G_G GGR GOR H_K
L_G Q_S
JOI
Algorithm Code i
ALPHA«ETA
^ ^ M
ALPHAfBETA
Figure 2. Variation of predictive accuracies (average) according to the protein class as measured by the Cp and Cc values.
Protein Secondary Structure Prediction Algorithms
791
The first observation of this kind of analysis is that for all types of measures utilized the behavior of predictive methods varies significantly according to the protein fold family. This can be relevant in pointing out what method performs better for the prediction of a determined structural element depending on the protein class. Conversely, it also is possible to diagnose critical points where the algorithms fail. From the information values in figure 1 it is possible to observe that in general the prediction in all alpha and alpha+beta classes has a greater success than for all beta and alpha/beta classes. Also it shows a greater variability between predictive accuracies of the methods among each other as well as between the fold classes individually. Figures 1 and 2 reinforce this view with the use of Mathews' correlation coefficients for each type of secondary structure. However, a more striking observation arises. When the prediction is done for all-beta proteins the Ca is extremely low (<16% for all methods) and this is more pronounced for Cp when predicting all-alpha proteins (<7%). This means that when one has a protein dominated by just one type of secondary structure element (all-alpha or all-beta), the prediction of a structure of the opposite type (like beta-sheets in all-alpha proteins) is an almost random event. The correlation coefficients for coil structures although showing a better performance for all alpha proteins does not show a pronounced variation as for Ca and Cp. Also it is possible to identify why generally the methods perform better for all alpha and alpha+beta proteins. For all alpha proteins the good performance is due more to the correct prediction of coils than the a-helical residues. However, those two combine to give a fair performance. In the case of alpha+beta proteins all methods display a better than average performance for both a-helices and p-strands. This fact may be due to the segregation of the domains that may confer a homogeneous character for them independently and that can be captured successfully by the methods. For the allbeta proteins the quality of the Ca is so low It decreases the overall performance of the algorithms for this class. The question now shifts to which method is the best in terms of performance. This is a complex matter because even by using the same testing set and accuracy measures the results are not totally comparable. Notwithstanding, one can observe that the joint prediction (JO!) is consistently the most successful In terms of Ca and information index. Individually, Q_S method (Qian and Sejnowski, 1988) performs well for all-alpha and alpha+beta proteins while GOR (Gamier et al., 1978) is indicated for all-beta proteins and GGR (Gibrat et al., 1987) has an edge for alpha/beta proteins.
B. Multidimensional Scaling In order to provide an easy way to visualize how the methods differ from each other the accuracy values (Q3, Ca, Cp and Cc) for each method where subjected to nonmetric multidimensional scaling analysis (Cox and Cox, 1994). The resulting graphs provide a representation in a two-dimensional space of the methods and those are shown in figure 3. As can be observed, the interrelationship between the methods
Georgios Pappas, Jr. and Shankar Subramaniam
792 Multidimensional Scaling Mathews' C
Multidimensional Scaling Mathews' C
1-l
*
0.5-
GGR D_R JOI*
*
0-
GOR
^
C F
JOI
*
BPS
H_K
* GGR H_K
G_G -U.b-
G_G
CM
**
Q!S
GOR
L_G -1-1
-0.5
0
0.5
Axis 1 Multidimensional Scaling Mathews' Cc
\ 1
-0.5
r0
Multidimensional Scaling Q3
Figure 3. Multidimensional scaling analysis of the dissimilarities between accuracies of different protein secondary structure prediction methods. The method codes can be found in Table I.
Protein Secondary Structure Prediction Algorithms
793
varies sensibly with the accuracy measure used. This fact brings up one important observation about the accuracy indexes themselves. They extract different pieces of information about the accuracy of the methods and exposes that the predictive methods work differently to attain the same goal, which is not necessarily the optimal one. Additionally, it indicates that when reporting the performance of a method it is imperative to use several evaluation indexes in order to provide a less biased estimative of efficacy. The distance between points in the graphs represents the dissimilarity of the methods within the statistical error of the construction. Therefore, clusters of methods indicate similar performance for the specific index, whereas points far apart indicate that the methods diverge in predictive terms. One fact that can be observed is that the methods JOI, H_K and GGR cluster together for the indices Ca, Cp and Cc, suggesting that they behave similarly. Incidentally, those three are among the best ones in terms of accuracy. Another observation is that methods that share the same theoretical framework like H_K and Q_S (both are neural networks based) are somewhat located closely (except perhaps in the case of Ca) maybe because they extract complementary information from the training set, despite the sets of proteins used are different.
IV. Conclusions The present analysis might give rise to a somewhat pessimistic view of the effectiveness of protein secondary structure prediction algorithms. In fact, with the increasing number of proteins with known three-dimensional structure, constant reevaluation of performance must take place in order to ascertain the validity of the methods. We note that the methods do not have the predictive power claimed by its authors when analyzed consistently using the 148 proteins selected in this study. Moreover, the situation is even worse for the Mathews correlation coefficient, which indicates that the predictions are poorly correlated with the actual structure. The inherent variability of predictive success rate depending on the protein fold class brings important observations: (1) When reporting accuracies the selection of the test set proteins should be balanced in order to include a representative number of each of the protein fold classes. (2) the prior knowledge of the protein fold class (Chou and Zhang, 1995) can be a valuable aid for the predictions and with that one can use the different algorithms in combination to predict a specific structural element of the chain. The apparent failure of the prediction methods can be explained by the fact that they take in account just short-range interactions, which are very influenced by the number of proteins in the training process. It seems that this type of statistics tends to reach a plateau with the increasing number of structures available, ruling out the existence of absolute structural propensities for individual amino acids.
794
Georgios Pappas, Jr. and Shankar Subramaniam
Acknowledgments This work was supported by a NFS grant (ASC89-02S29) to SS, a fellowship from the Brazlian government (CNPq) to GP and a computational grant from the National Center for Supercomputing Applications (NCSA). We also thank Drs. A.F.P. de Araujo and M.M. Ventura for helpful discussions.
References Bohm, G. (1996). Biophys. Chem., 59,1-32. Burgess, A. W., Ponnuswamy, P.K. and Sheraga, H. A. (1974). Israel J. Chem., 12,239-286. Chothia, C. (1992). Nature, 357, 543-544. Chou, P.Y. and Fasman, G.D. (1978). Ad\/ar). EnzymoL, 13,211-215. Cox, T.F. and Cox, M.A.A. (1994). "Monographs on statistics and applied probability 59". Chapman and Hall: New York. Deleage, G., Clerc, F.F., Roux, B. And Gautheron, D.C. (1988). CABIOS, 4 (3), 351-356. Deleage, G. and Roux, B. (1987). Prot. Eng., 1 (4), 289-294. Eisenhaber, F., Persson, B. and Argos, P. (1995). CRC Crit Rev. Biochem. 30 (1):1-94. Gamier, J. and Levin, JM (1991). CABIOS, 7,133-142. Gamier, J., Osguthorpe, D.J. and Robson, B. (1978). J. Mol. Biol., 120, 97-120. Gascuel, O. and Golmard, J. L. (1988). CABIOS, 4, 357-365. Gibrat, J.-F., Gamier, J. And Robson, B. (1987). J. Mol. Biol., 198,425-443. Guttman, L. (1968). Psychometrika, 33,469-506. Hobohm, U., Scharf, M., Schneider, R. and Sander, C. (1992). Protein Sc/. 1,409-417. Holley, L.H. and Karplus, M. (1989). Proc. Natl. Acad. Sci. USAfiS, 152-156. Kabsch, W. and Sander, C. (1983a). Biopolymers, 22,2577-2637. Kabsch, W. and Sander, C. (1983b). FEBS /eft 155,179-182. Levin, J.M. and Gamier, J. (1988). Biochim Biophys Acta, 955,283-295. Lim, V.I. (1974). J. Mol. Biol., 88, 873-894. Murzin, A.G., Brenner, S.E., Hubbard, T. and ChothIa, C. (1995). J. Mol. Biol., 247, 536-540. Nishikawa, K. and Noguchi, T. (1991). Methods EnzymoL, 202, 31-44. Ritsyn, O.B. and Finkelstein, A.V. (1983). Biopolymers, 22,15-25. Qian, N. and Sejnowski, T.J. (1988). J. Mol. Biol., 202, 865-884. Rost,B. and Sander, C. (1993). J. Mol. Biol., 232, 584-599. Rost,B. and Sander. C. (1994). J. Mol. B/o/., 235,13-26. Rost,B. and Sander, C. (1995). Proteins, 23,295-300. Rumelhart, D.E. and McClelland, J.L. (1986). "Parallel distributed processing I". MIT press, CambridgeMA. Sasagawa, F. and Tajima, K. (1993). CABIOS, 9 (2), 147-152. Schuiz, G.E. and Schirmer, R.H. (1979). "Principles of protein structure". Springer-Verlag, New York-NY. Schuiz, G.E. (1987). ^nn. Rev. Biophys. Biophys. Chem.,i7,1-21.