ANALYTICAL BIOCHEMISTRY Analytical Biochemistry 357 (2006) 116–121 www.elsevier.com/locate/yabio
Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network Chao Chen, Xibin Zhou, Yuanxin Tian, Xiaoyong Zou *, Peixiang Cai School of Chemistry and Chemical Engineering, Sun Yat-Sen University, Guangzhou 510275, PR China Received 18 May 2006 Available online 7 August 2006
Abstract Because a priori knowledge of a protein structural class can provide useful information about its overall structure, the determination of protein structural class is a quite meaningful topic in protein science. However, with the rapid increase in newly found protein sequences entering into databanks, it is both time-consuming and expensive to do so based solely on experimental techniques. Therefore, it is vitally important to develop a computational method for predicting the protein structural class quickly and accurately. To deal with the challenge, this article presents a dual-layer support vector machine (SVM) fusion network that is featured by using a different pseudo-amino acid composition (PseAA). The PseAA here contains much information that is related to the sequence order of a protein and the distribution of the hydrophobic amino acids along its chain. As a showcase, the rigorous jackknife cross-validation test was performed on the two benchmark data sets constructed by Zhou. A significant enhancement in success rates was observed, indicating that the current approach may serve as a powerful complementary tool to other existing methods in this area. 2006 Elsevier Inc. All rights reserved. Keywords: Support vector machine; Fusion; Amino acid composition; Pair-coupled amino acid composition; Pseudo-amino acid composition; Protein structural class
According to the definition by Levitt and Chothia [1], proteins can be classified into the following four structural classes: (i) all-a, (ii) all-b, (iii) a/b, and (iv) a + b. A series of previous studies showed that the structural class of a protein correlates strongly with its amino acid composition (AA).1 Actually, most classifiers were constructed to predict the protein structural classes based on their AAs [2–13] (for a systematic description in this area, see comprehensive reviews by Chou [14–16]). In representing a protein sample with its AA, however, many important features associated with the sequence order were completely missed, undoubtedly reducing the success rate of prediction. In view of this, various descriptors were pro*
Corresponding author. Fax: +86 20 84112245. E-mail address:
[email protected] (X. Zou). 1 Abbreviations used: AA, amino acid composition; pair-coupled AA, pair-coupled amino acid composition; PseAA, pseudo-amino acid composition; SVM, support vector machine; RBF, radial basis function. 0003-2697/$ - see front matter 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.ab.2006.07.022
posed to improve the predictive accuracy, including the pair-coupled amino acid composition (pair-coupled AA) [17], the polypeptide composition [18,19], the pseudoamino acid composition (PseAA) [20,21], and other compositions [22,23]. Another important progress in this area is the introduction of functional domain composition by Chou and Cai that can significantly enhance the success rates in predicting protein structural class [24] and many other protein attributes [25–27]. Besides more accurate protein sample representations, the predictive quality can also be improved by methods such as boosting and bagging [28,29] that focus on, and give more weight to, the hard samples that could not be classified correctly with the previous weak classifiers. Alternatively, it is widely accepted that combining multiple classifiers can provide advantages over the traditional monolithic approaches [30]. Considering that the various classifiers may make different and perhaps complementary errors, the aim is to design a composite system that outperforms
Predicting protein structural class / C. Chen et al. / Anal. Biochem. 357 (2006) 116–121
any individual classifier by pooling together the decisions of all classifiers. In view of the above facts, this article presents a framework with a dual-layer support vector machine (SVM) fusion network following the main idea of Refs. [31–35]. In its first layer, there are three SVM classifiers trained by various protein features. Then the computational results are combined and input into the second layer, where another SVM classifier performs the fusion work and makes the final decisions adaptively. It is demonstrated through two different working data sets that the success rates are improved significantly. Materials and methods
incorporated. However, the number of k must be smaller than the number of amino acid residues of the shortest protein chain in the data set concerned. On the other hand, because of the information loss during the jackknifing, the overall success rate by the jackknife test does not always increase monotonically with k [20]. Because jackknife tests are accepted as the most objective methods for cross-validation in statistics [9,12,14], the optimal value for k should be the one that results in the best overall jackknife-tested rate. Besides, it should be pointed out that k may have different optimal values for different training data sets. For the current study, the optimal value for k is 11; that is, the dimension of PseAA considered here is 20 + 11 = 31. Given a protein X, its PseAA can be defined in a 31-D space as given by 2
Protein features A protein sample can be represented by its AA, paircoupled AA, or PseAA. Because the first two have explicit definitions, only the PseAA is stated below. Since its introduction [20], the PseAA has been used widely and successfully to improve the prediction quality in diverse applications of bioinformatics [36–41]. It is proposed that the sequence order effect along a protein chain can be approximately reflected with a set of sequence order correlation factors defined as follows [20]: hm ¼
1 Lm
Lm X
HðRi ; Riþm Þ; ðm ¼ 1; 2; . . . ; k and k < LÞ:
i¼1
ð1Þ In Eq. (1), L denotes the length of the protein and hm is called the mth rank of coupling factor that harbors the mth sequence order correlation factor. It is worth noting that for various studies, the correlation function H(Ri, Rj) may have different appropriate forms. In this study, H(Ri, Rj) is formulated as follows: HðRi ; Rj Þ ¼ H ðRi Þ H ðRj Þ;
ð2Þ
where H(Ri) and H(Rj) are the hydrophobicity values of the amino acids Ri and Rj, respectively, taken from Ref. [42]. Eq. (2) is part of the amphiphilic PseAA formulated by Chou for predicting enzyme subfamily classes (see Eq. (3) in Ref. [40]) and membrane protein types (see Eq. (3) in Ref. [43]). Note that before substituting the hydrophobicity values into Eq. (2), they all were subjected to a standard conversion as described by the following equation: P20 H 0 ðRi Þ k¼1 H 0 ðRk Þ=20 H ðRi Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ð3Þ P20 P20 2 ½H ðR Þ H ðR Þ=20 =20 0 u k u¼1 k¼1 0 In Eq. (3), Ri ði ¼ 1; 2; . . . ; 20Þ denote the 20 native amino acids and H0(Ri) is the original hydrophobicity value of amino acid Ri. In general, the larger the number of the correlation factors, the more the sequence order effects
117
x1
3
6x 7 6 2 7 6. 7 6. 7 6. 7 7 X¼6 6 x 7; 6 i 7 6. 7 6. 7 4. 5 x31
ð4Þ
where
xu ¼
8 P20 fu P11 ; > > < i¼1 fi þx j¼1 hj
ð1 6 u 6 20Þ
> > : P20
ð20 þ 1 6 u 6 20 þ 11Þ:
i¼1
xhu20 fi þx
P11 j¼1
hj
;
ð5Þ
In the above equation, fi (i = 1, 2, . . . , 20), the same as in the conventional AA, are the normalized occurrence frequencies of the 20 native amino acids in the protein X, and hj (j = 1, 2,. . . 11) are the j-tier sequence order correlation factors computed according to Eqs. (1)–(3). Of the 31-D components, the first 20 components reflect the effect of the AA, whereas the components from 20 + 1 to 20 + 11 reflect the effect of sequence order. The parameter x is a weight factor that adjusts the 31-D components to be in similar scales. Here it was set at 0.1. SVM classifiers The SVM learning system, first proposed by Cortes and Vapnik [44], is based on statistical learning theory. Compared with other machine learning systems, the SVM has many attractive features, including the absence of local minima, its speed and scalability, and its ability to condense information contained in the training set. During the past decade, SVMs have performed well in predicting protein secondary structure [32,45], predicting protein subcellular localization [46], predicting membrane protein types [47], and so on. In this research, the publicly available LIBSVM software is used [48]. With AA, pair-coupled AA, and PseAA as inputs to LIBSVM
118
Predicting protein structural class / C. Chen et al. / Anal. Biochem. 357 (2006) 116–121
for training, three classifiers are constructed: SVM1, SVM2, and SVM3. The computational results from them are the protein structural classes, which can be (i) all-a, (ii) all-b, (iii) a/b, and (iv) a + b. For the four-class problem, we use the ‘‘one versus others’’ method to transfer it into a two-class problem. Therefore, so far as one kind of composition descriptors is concerned (e.g., AA), there are actually four SVM classifiers (rather than just one): one for classifying the all-a class, one for classifying the all-b class, and the other two for classifying the other two classes. Despite these differences, for simplicity, all four classifiers are denoted as SVM1. Considering that it is not a simple linear relationship between the composition descriptors and the structural classes, we select the radial basis function (RBF) as the kernel function: 2
Kð~ u;~ vÞ ¼ expðcjj~ u ~ vjj Þ:
ð6Þ
Subsequently, the parameter set (c, C) needs to optimize. We perform grid searches for maximal jackknife-tested overall rates with c ranging among 0.001, 0.01, 0.1, 1, 10, and 100 and with C ranging among 0.1, 1, 10, 100, and 1000. Of all the possible combinations, the set (100, 100) is found to have the best average performance. Although we can use particular sets for different classifiers, for simplicity, we choose the set (100, 100) and the total departure from the optimal value is very trivial. Fusion From the first-layer SVM classifiers, there are 3 (actually, 4 * 3 = 12) computational results. One can employ methods such as vote counting and decision templates to make the final decision. Furthermore, one can assign various weights for all 12 results with respect to the reliabilities of the different classifiers. Nevertheless, in this work, we construct an SVM fusion system to make decisions. If we suppose that a protein data set contains N samples, it can be represented as an N · 20 matrix with AA, as an N · 210 matrix with pair-coupled AA, or as an N · 31 matrix with PseAA. These matrices are input into the first-layer SVM1, SVM2, and SVM3, respectively. The outputs of the first-layer SVMs are three N · 4 matrices representing the probability that the protein sample belongs to that structural class. Then the three matrices are combined to form a new N · 12 matrix that is used as the inputs of the second-layer SVM. Different from the first-layer SVMs, the relationship between the inputs and the outputs may be linear here. Accordingly, we select the linear kernel rather than the RBF kernel for fusion, in which case only one parameter (i.e., the regularization parameter C) needs to optimize. We set it within a range among 0.001, 0.01, 0.1, 1, 10, and 100. As a result, we find that the number 0.01 is the optimal value for fusing three of the former structural classes, that is, all except the a + b class. For a + b, the optimal value of C is 1. The whole procedure is shown in Fig. 1.
Fig. 1. Flowchart of the SVM fusion network.
Results and discussion For the fusion network proposed here, we used the two data sets constructed by Zhou [9] to test its prediction quality. One consists of 277 domains: 70 all-a domains, 61 all-b domains, 81 a/b domains, and 65 a + b domains. The other data set consists of 498 domains: 107 all-a domains, 126 all-b domains, 136 a/b domains, and 129 a + b domains. In general, a prediction method is usually evaluated by the resubstitution test, the independent data set test, or the jackknife test. Of these three tests, the jackknife test is accepted as the most rigorous and objective one [9,12,14]. In the jackknife test, each protein in the data set is singled out as an independent test sample and all of the rule parameters are calculated without using this protein. The success rates by the jackknife cross-validation test are listed in Table 1. As this table shows, the success rates are improved significantly. Especially for the most difficult case of a + b, the success rates for the two data sets are increased by 27.7 and 13.2% in comparison with the best classifier among the three individuals and are even increased by 9.2 and 12.4% in comparison with the values by the oracle. Noting that in the fusion SVM classifier the inputs are 12-D vectors combining the computational results for all four structural classes, it is not strange that the success rates by fusion exceed the theoretical limits by the oracle. The overall rates are also increased by 7.2 and 3.6%. As illustrated in Table 2, we further carried out a comparison with some prior works. From this table, it can be seen that our method is superior or comparable to other
Predicting protein structural class / C. Chen et al. / Anal. Biochem. 357 (2006) 116–121
119
Table 1 Results of the jackknife test Data set
Classifier
Success rate (%) All-a
All-b
a/b
a+b
Overall
Z277
SVM1 SVM2 SVM3 Oraclea Fusion
82.9 82.9 87.1 87.1 85.7
90.2 85.3 88.5 91.8 90.2
93.8 93.8 91.4 93.8 93.8
52.3 44.6 52.3 70.8 80.0
80.5 77.6 80.5 83.8 87.7
Z498
SVM1 SVM2 SVM3 Oraclea Fusion
99.1 98.1 98.1 99.1 99.1
96.0 94.4 96.0 96.8 96.0
80.9 78.7 80.9 81.6 80.9
76.7 71.3 78.3 79.1 91.5
87.6 84.9 87.8 88.6 91.4
a The oracle works as follows: assign the correct structural class label to protein sample X if at least one individual classifier produces the correct structural class label of X, so that the theoretical maximal success rate by the fusion technique can be expected [49].
Table 2 Comparison with other algorithms by the jackknife test Data set
Algorithm
Success rate (%) All-a
All-b
a/b
a+b
Overall
Z277
Component coupleda Neural networkb SVMc LogitBoostd Rough Setse Our method
84.3 68.6 74.3 81.4 77.1 85.7
82.0 85.2 82.0 88.5 77.0 90.2
81.5 86.4 87.7 92.6 93.8 93.8
67.7 56.9 72.3 72.3 66.2 80.0
79.1 74.7 79.4 84.1 79.4 87.7
Z498
Component coupleda Neural networkb SVMc LogitBoostd Rough Setse Our method
93.5 86.0 88.8 92.5 87.9 99.1
88.9 96.0 95.2 96.0 91.3 96.0
90.4 88.2 96.3 97.1 97.1 80.9
84.5 86.0 91.5 93.0 86.0 91.5
89.2 89.2 93.2 94.8 90.8 91.4
a b c d e
Results Results Results Results Results
are are are are are
from from from from from
Ref. Ref. Ref. Ref. Ref.
[9]. [10]. [11]. [28]. [50].
existing methods. Accordingly, from both the rationality of testing procedure and the success rates of test results, the current SVM fusion network can significantly improve the prediction quality of protein structural class. Conclusions Rather than constructing a more complicated classifier, it is believed that fusing the relatively simple individuals, which may play complementary roles to each other, is a good idea. In this article, a novel SVM fusion network for predicting the protein structural class has been presented. In its first layer, three classifiers are constructed with various statistics as inputs. The computational values are then combined and input to the second layer to fuse, where the final decisions are made. The results from two different working data sets show that the fusion network can signif-
icantly improve the predictive accuracy, even compared with the best individual classifier. Moreover, it can be anticipated that the current fusion network may also have impacts on improving the success rates for many other protein attributes such as subcellular localization, membrane types, and enzyme family and subfamily classes. Acknowledgments The authors thank the anonymous reviewers whose constructive comments were very helpful in strengthening the presentation of this article. Financial support from the National Natural Science Foundation of China (20475068 and 20575082), the Natural Science Foundation of Guangdong Province (031577), and the Scientific Technology Project of Guangdong Province (2005B30101003) is acknowledged.
120
Predicting protein structural class / C. Chen et al. / Anal. Biochem. 357 (2006) 116–121
References [1] M. Levitt, C. Chothia, Structural patterns in globular proteins, Nature 261 (1976) 552–558. [2] K.C. Chou, C.T. Zhang, A correlation-coefficient method to predicting protein-structural classes from amino-acid compositions, Eur. J. Biochem. 207 (1992) 429–433. [3] G.F. Zhou, X.H. Xu, C.T. Zhang, A weighting method for predicting protein structural class from amino-acid composition, Eur. J. Biochem. 210 (1992) 747–749. [4] C.T. Zhang, K.C. Chou, An optimization approach to predicting protein structural class from amino-acid composition, Protein Sci. 1 (1992) 401–408. [5] K.C. Chou, C.T. Zhang, Predicting protein-folding types by distance functions that make allowances for amino-acid interactions, J. Biol. Chem. 269 (1994) 22014–22020. [6] C.T. Zhang, K.C. Chou, G.M. Maggiora, Predicting protein structural classes from amino-acid composition: application of fuzzy clustering, Protein Eng. 8 (1995) 425–435. [7] K.C. Chou, A novel approach to predicting protein structural classes in a (20–1)-D amino-acid composition space, Proteins: Struct. Funct. Genet. 21 (1995) 319–344. [8] I. Bahar, A.R. Atilgan, R.L. Jernigan, B. Erman, Understanding the recognition of protein structural classes by amino acid composition, Proteins 29 (1997) 172–185. [9] G.P. Zhou, An intriguing controversy over protein structural class prediction, J. Protein Chem. 17 (1998) 729–738. [10] Y.D. Cai, G.P. Zhou, Prediction of protein structural classes by neural network, Biochimie 82 (2000) 783–785. [11] Y.D. Cai, X.J. Liu, X.B. Xu, G.P. Zhou, Support vector machines for predicting protein structural class, BMC Bioinform. 2 (2001) 1–5. [12] G.P. Zhou, N. Assa-Munt, Some insights into protein structural class prediction, Proteins: Struct. Funct. Genet. 44 (2001) 57–59. [13] H.B. Shen, J. Yang, X.J. Liu, K.C. Chou, Using supervised fuzzy clustering to predict protein structural classes, Biochem. Biophys. Res. Commun. 334 (2005) 577–581. [14] K.C. Chou, C.T. Zhang, Prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol. 30 (1995) 275–349. [15] K.C. Chou, Prediction of protein structural classes and subcellular locations, Curr. Protein Peptide Sci. 1 (2000) 171–208. [16] K.C. Chou, Progress in protein structural class prediction and its impact to bioinformatics and proteomics, Curr. Protein Peptide Sci. 6 (2005) 423–436. [17] K.C. Chou, Using pair-coupled amino acid composition to predict protein secondary structure content, J. Protein Chem. 18 (1999) 473–480. [18] R.Y. Luo, Z.P. Feng, J.K. Liu, Prediction of protein structural class by amino acid and polypeptide composition, Eur. J. Biochem. 269 (2002) 4219–4225. [19] X.D. Sun, R.B. Huang, Prediction of protein structural classes using support vector machines, Amino Acids 30 (2006) 469–475. [20] K.C. Chou, Prediction of protein cellular attributes using pseudoamino acid composition, Proteins: Struct. Funct. Genet. 43 (2001) 246–255. [21] X. Xiao, S.H. Shao, Z.D. Huang, K.C. Chou, Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor, J. Comput. Chem. 27 (2006) 478–482. [22] Q.S. Du, D.Q. Wei, K.C. Chou, Correlations of amino acids in proteins, Peptides 24 (2003) 1863–1869. [23] Q.S. Du, Z.Q. Jiang, W.Z. He, D.P. Li, K.C. Chou, Amino acid principal component analysis (AAPCA) and its applications in protein structural class prediction, J. Biomol. Struct. Dyn. 23 (2006) 635–640. [24] K.C. Chou, Y.D. Cai, Predicting protein structural class by functional domain composition, Biochem. Biophys. Res. Commun. 321 (2004) 1007–1009.
[25] Y.D. Cai, K.C. Chou, Nearest neighbor algorithm for predicting protein subcellular location by combining functional domain composition and pseudo-amino acid composition, Biochem. Biophys. Res. Commun. 305 (2003) 407–411. [26] Y.D. Cai, K.C. Chou, Using functional domain composition to predict enzyme family classes, J. Proteome Res. 4 (2005) 109–111. [27] Y.D. Cai, K.C. Chou, Predicting enzyme subclass by functional domain composition and pseudo amino acid composition, J. Proteome Res. 4 (2005) 967–971. [28] K.Y. Feng, Y.D. Cai, K.C. Chou, Boosting classifier for predicting protein domain structural class, Biochem. Biophys. Res. Commun. 334 (2005) 213–217. [29] Y.D. Cai, K.Y. Feng, W.C. Lu, K.C. Chou, Using LogitBoost classifier to predict protein structural classes, J. Theor. Biol. 238 (2006) 172–176. [30] L. Nanni, Fusion of classifiers for protein fold recognition, Neurocomputing 68 (2005) 315–321. [31] C. Yan, D. Dobbs, V. Honavar, A two-stage classifier for identification of protein–protein interface residues, Bioinformatics 20 (Suppl. 1) (2004) i371–i378. [32] J. Guo, H. Chen, Z.R. Sun, Y.L. Lin, A novel method for protein secondary structure prediction using dual-layer SVM and profiles, Proteins: Struct. Funct. Bioinform. 54 (2004) 738–743. [33] M.N. Nguyen, J.C. Rajapakse, Prediction of protein relative solvent accessibility with a two-stage SVM approach, Proteins: Struct. Funct. Bioinform. 59 (2005) 30–37. [34] M.N. Nguyen, J.C. Rajapakse, Two-stage multi-class support vector machines to protein secondary structure prediction, Pacific Symp. Biocomp. (2005) 346–357. [35] M.N. Nguyen, J.C. Rajapakse, Two-stage support vector regression approach for predicting accessible surface areas of amino acids, Proteins: Struct. Funct. Bioinform. 63 (2006) 542–550. [36] K.C. Chou, Y.D. Cai, Predicting protein quaternary structure by pseudo amino acid composition, Proteins: Struct. Funct. Genet. 53 (2003) 282–289. [37] K.C. Chou, Y.D. Cai, Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition, J. Cell. Biochem. 91 (2004) 1197–1203. [38] H.B. Shen, K.C. Chou, Using optimized evidence—theoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types, Biochem. Biophys. Res. Commun. 334 (2005) 288–292. [39] H.B. Shen, K.C. Chou, Predicting protein subnuclear location with optimized evidence—theoretic K-nearest classifier and pseudo amino acid composition, Biochem. Biophys. Res. Commun. 337 (2005) 752–756. [40] K.C. Chou, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics 21 (2005) 10–19. [41] S.W. Zhang, Q. Pan, H.C. Zhang, Z.C. Shao, J.Y. Shi, Prediction of protein homo-oligomer types by pseudo amino acid composition: approached with an improved feature extraction and naive Bayes feature fusion, Amino Acids 30 (2006) 461–468. [42] J. Kyte, R.F. Doolittle, A simple method for displaying the hydropathic character of a protein, J. Mol. Biol. 157 (1982) 105–132. [43] K.C. Chou, Y.D. Cai, Prediction of membrane protein types by incorporating amphipathic effects, J. Chem. Inform. Model. 45 (2005) 407–413. [44] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (1995) 273–297. [45] M. Kumar, M. Bhasin, N.K. Natt, G.P.S. Raghava, BhairPred: prediction of b-hairpins in a protein from multiple alignment information using ANN and SVM techniques, Nucleic Acids Res. 33 (2005) W154–W159. [46] K.C. Chou, Y.D. Cai, Using functional domain composition and support vector machines for prediction of protein subcellular location, J. Biol. Chem. 277 (2002) 45765–45769.
Predicting protein structural class / C. Chen et al. / Anal. Biochem. 357 (2006) 116–121 [47] Y.D. Cai, G.P. Zhou, K.C. Chou, Support vector machines for predicting membrane protein types by using functional domain composition, Biophys. J. 84 (2003) 3257–3263. [48] C.C. Chang, C.J. Lin, LIBSVM: A Library for Support Vector Machines [software], 2001, www.csie.ntu.edu.tw/~cjlin/ libsvm.
121
[49] L.I. Kuncheva, Switching between selection and fusion in combining classifiers: an experiment, IEEE Trans. Syst. Man Cybern. B Cybern. 32 (2002) 146–156. [50] Y.F. Cao, S. Liu, L.D. Zhang, J. Qin, J. Wang, K.X. Tang, Prediction of protein structural class with Rough Sets, BMC Bioinform. 7 (2006), doi:10.1186/1471-2105-720.