A novel predictor for protein structural class based on integrated information of the secondary structure sequence

A novel predictor for protein structural class based on integrated information of the secondary structure sequence

Biochimie xxx (2014) 1e6 Contents lists available at ScienceDirect Biochimie journal homepage: www.elsevier.com/locate/biochi Research paper A nov...

319KB Sizes 1 Downloads 68 Views

Biochimie xxx (2014) 1e6

Contents lists available at ScienceDirect

Biochimie journal homepage: www.elsevier.com/locate/biochi

Research paper

A novel predictor for protein structural class based on integrated information of the secondary structure sequence Lichao Zhang a, Xiqiang Zhao b, *, Liang Kong c, Shuxia Liu b a

Department of Biotechnology, College of Marine Life Science, Ocean University of China, Qingdao, PR China College of Mathematical Science, Ocean University of China, Qingdao, PR China c College of Mathematics and Information Technology, Hebei Normal University of Science and Technology, Qinhuangdao, PR China b

a r t i c l e i n f o

a b s t r a c t

Article history: Received 23 October 2013 Accepted 11 May 2014 Available online xxx

The structural class has become one of the most important features for characterizing the overall folding type of a protein and played important roles in many aspects of protein research. At present, it is still a challenging problem to accurately predict protein structural class for low-similarity sequences. In this study, an 18-dimensional integrated feature vector is proposed by fusing the information about content and position of the predicted secondary structure elements. The consistently high accuracies of jackknife and 10-fold cross-validation tests on different low-similarity benchmark datasets show that the proposed method is reliable and stable. Comparison of our results with other methods demonstrates that our method is an effective computational tool for protein structural class prediction, especially for lowsimilarity sequences. © 2014 Elsevier Masson SAS. All rights reserved.

Keywords: Sequence similarity Protein structural class Support vector machine Position of a-helices and b-strands

1. Introduction The first definition of protein structural class was introduced by Levitt and Chothia in 1976 [1]. Based on their pioneering work, four structural classes of globular proteins are usually distinguished: (1) the all-a class, which includes proteins with only a small amount of strands, (2) the all-b class with proteins with only a small amount of helices, (3) the a/b class with proteins that include both helices and strands and where the strands are mostly parallel, and (4) the a þ b class, which includes proteins with both helices and strands and where strands are mostly anti-parallel. The structural class has become one of the most important features for characterizing the overall folding type of a protein and plays important roles in many aspects of protein research. More specifically, a knowledge of structural class has been applied to improve the accuracy of secondary structure prediction [2], to reduce the search space of possible conformations of the tertiary structure [3e5], and to implement a heuristic approach to determine tertiary structure. To date, protein structural class prediction has become a quite meaningful topic in bioinformatics [6,7]. Traditional lab based methods assign the structural class to a protein by manual

* Corresponding author. Department of Mathematics, College of Mathematical Science, Ocean University of China, Songling Road, Qingdao 266100, PR China. Tel.: þ86 53266787282. E-mail address: [email protected] (X. Zhao).

inspection such as X-ray crystallography or nuclear magnetic resonance (NMR) spectroscopy, which is a time-consuming and expensive process. Thus, with the rapid development of the genomics and proteomics, it is crucially important to develop a fast and accurate computational method to determine structural class for the dramatically expanding newly-discovered proteins. One important aspect to predict structural class is to properly extract protein sequence information and then form a feature vector. In the earlier research, features were always extracted from amino acid (AA) sequence [4e13] such as the frequency of each AA in a given protein. Considering that these features ignored the sequential order information, some order features have been introduced, such as pseudo AA composition [14], collocation of AA, function domain composition [15], and position-specific scoring matrix (PSSM) computed by the position-specific iterative basic local alignment search tool (PSI-BLAST) [16]. However, these methods perform poorly with low-similarity sequences, with accuracies between 50% and 70% [17]. Recently, several new features based on predicted secondary structure sequence (SSS) have been proposed to improve the prediction accuracy with the low-similarity sequences [17e23] such as the length of the longest a-helices and b-strands. After the feature vector is extracted from the protein sequence, the feature vector is subsequently used as the input to different types of machine learning algorithms, including neural network [24], support vector machine (SVM) [25e29], fuzzy clustering [30], Bayesian classification [31], rough sets [32] and so on. A review by Chou

http://dx.doi.org/10.1016/j.biochi.2014.05.008 0300-9084/© 2014 Elsevier Masson SAS. All rights reserved.

Please cite this article in press as: L. Zhang, et al., A novel predictor for protein structural class based on integrated information of the secondary structure sequence, Biochimie (2014), http://dx.doi.org/10.1016/j.biochi.2014.05.008

2

L. Zhang et al. / Biochimie xxx (2014) 1e6

provided further details for the development of protein structural class prediction methods [6]. Although quite encouraging results have been achieved by many predicted secondary structure based methods, development of high quality prediction methods, especially for low-similarity sequences continues to be a challenging task. In this study, an 18-dimensional integrated feature vector (IFV) is proposed by fusing the content and position information of the predicted secondary structure elements and then a multi-class support vector machine (SVM) is implemented to predict protein structural class on three different low-similarity benchmark datasets. In order to evaluate the proposed prediction method objectively, the jackknife cross-validation test and the 10-fold crossvalidation test (10-CV) are implemented. Thanks to the comprehensive features which could represent enough protein sequence information to grasp the relationship between protein sequence and structural class, the experimental results demonstrate that our method is an effective computational tool for protein structural classes prediction. Moreover, the results suggest that further mining the integrated information about content and position based on the predicted secondary structure sequence is an effective way to improve prediction accuracy. 2. Materials and methods

PðHÞ ¼ NH =N; PðEÞ ¼ NE =N where NH and NE were the number of H, E in the SSS. The length of the SSS was denoted by N. 2. CMVH, CMVE and CMVC [17] based on the SSS have proved that they were useful for protein structural class prediction since they reflect the spatial arrangements of H, E and C in the SSS. They were formulated as:

CMVH ¼

NH X

, ðNðN  1ÞÞ

PHj

j¼1

CMVE ¼

NE X

, PEj

ðNðN  1ÞÞ

j¼1

CMVC ¼

NC X

, PCj

ðNðN  1ÞÞ

j¼1

where NC was the number of C in the SSS, PHj, PEj and PCj were the jth position of H, E and C in the SSS.

2.1. Datasets In order to give a comprehensive experimental comparison of different prediction algorithms, three widely-used benchmark datasets with low sequence identity were employed in our study. The ASTRAL dataset (including 7 classes) selected had sequence similarity lower than 20% and contains 6424 sequences [21]. Among the 7 classes, four major classes (all-a, all-b, a/b and a þ b) were selected in this study. The dataset with 5626 sequences was randomly divided into two equal subsets, one was used as the training set (ASTRALtraining) and the other was used as the test set (ASTRALtest) [18]. The dataset 25PDB [10] that comprised 1673 proteins of about 25% sequence similarity. The final dataset, named 640 [18,22] comprised 640 proteins of about 25% sequence similarity. The details about the four datasets are shown in Table 1. 2.2. Feature vector It was known that every residue in a protein sequence was predicted into one of three secondary structural elements H (helix), E (strand) and C (coil) using PSIPRED. These secondary structural elements defined the predicted secondary structure sequence (SSS) of a given protein. Based on the SSS, the following 27 features were given to identify protein structural class, including 11 reused features in the previous studies and 16 novel features. Below, we gave the concrete details and investigated how these features contributed to the prediction results. 1. P(H) and P(E) [17] based on the SSS which expressed the fraction of H and E can reflect the contents of H and E in the SSS. They were formulated as:

3. As the concept of protein structural class was given according to globular protein, the lengths of the a-helices and b-strands can affect the spatial structure of protein. The normalized lengths of the longest a-helices and b-strands in the SSS [23] (denoted by MaxsegH/N and MaxsegE/N) were significant to improve the prediction accuracy. 4. If two segments of E are separated by segments of H, these two segments of E would tend to form parallel b-sheets. Otherwise, they would tend to form anti-parallel b-sheets. Take sequence EEEEECCHHHHHHCEEEECCCCHHHEEEECCCCEEEE as an example (Fig. 1), segment 1 and segment 2, as well as segment 2 and segment 3, are supposed to form parallel b-sheets, and segment 3 and segment 4 are supposed to form anti-parallel bsheets. Consider that the b-strands in a/b proteins were usually composed of parallel b-sheets, while in a þ b proteins the bstrands were usually composed of anti-parallel b-sheets, the number of b-strands (segments of E) that form parallel b-sheets and the number of b-strands that form anti-parallel b-sheets were important to identify a/b and a þ b classes. Here the normalized parallel and anti-parallel b-sheets (PnE/N and APnE/ N) [20] were used in this study. 5. The normalized maximum distances between the adjacent segments E and H as well as the adjacent segments H and E (MaxdEH/N and MaxdHE/N) were used in this study. The above 11 features were used due to their prior successful application in protein structural class prediction. Below, we give 16 novel features to improve the prediction accuracy, and hope that

Parallel sheet

Parallel-sheet Table 1 The compositions of the datasets employed in our study.

EEEEECCHHHHHHCEEEECCCCHHHEEEECCCCEEEE

Dataset

All-a

All-b

a/b

aþb

Total

ASTRALtraining ASTRALtest 25PDB 640

640 640 443 138

662 662 443 154

748 747 346 177

763 764 441 171

2813 2813 1673 640

1

2

3

4

Anti- parallel sheet Fig. 1. The representation of E segments composing parallel b-sheets or anti-parallel bsheets directly from the predicted secondary structural sequences.

Please cite this article in press as: L. Zhang, et al., A novel predictor for protein structural class based on integrated information of the secondary structure sequence, Biochimie (2014), http://dx.doi.org/10.1016/j.biochi.2014.05.008

L. Zhang et al. / Biochimie xxx (2014) 1e6

these novel features can preserve enough information about the secondary structure to characterize protein structural class. 6. The secondary structural elements H lying in different positions of the SSS can interact with each other to form a long a-helices. Therefore, the trend helix probability THP was defined by the following equations:

THP1 ¼ ðNðH*HÞÞ=N; THP2 ¼ NðH**HÞ=N THP3 ¼ NðH***HÞ=N; THP4 ¼ NðH****HÞ=N where * denotes one of H, E and C, and N(H*H), N(H**H), N(H***H) and N(H****H) denote the number of segments with structure H*H, H**H, H***H and H****H. Although the proteins in the a/b and a þ b classes contained both a-helices and b-strands, they differed at least in the following two aspects. One was the directionality of b-strands, the other was the distributions of a-helices and b-strands such as a-helices and b-strands were largely separated in the a/b protein, but instead largely aggregated in the a þ b protein. In order to reflect the distribution information of a-helix and b-strand segments effectively, the coil segments were ignored and three sequences were acquired based on the SSS. The first one was HE sequence (HES), which was obtained by deleting coils from the SSS. Since at least three and two residues were generally required to form a-helices and b-strands respectively, those helices and strands that did not meet this size requirement were considered as coils. This sequence was named as improved SSS (ISSS). The second one, named improved HE sequence (IHES), which was acquired by deleting the coils of ISSS. The last sequence was simplified sequence (SS) [23], which was obtained by the following steps: First, every segment H, E and C in the SSS was respectively replaced by the letter a, b and c. Then, all of the letters c were removed. Here, the lengths of HES, IHES and SS were denoted by N0 , N00 and N000 . The following 12 novel features were proposed based on these sequences to reflect the level of separation and aggregation of a-helices and b-strands in the SSS. 7. The fraction of HH and EE (PHH and PEE) in HES were proposed to represent the level of aggregation about a-helices and b-strands. They were defined by:

PHH ¼ NðHHÞ=N 0 ; PEE ¼ NðEEÞ=N 0 where N(HH) and N(EE) were the number of HH and EE in HES, respectively. 8. Since the number of HH and EE in IHES (NIHH and NIEE) could reflect the level of aggregation about a-helices and b-strands as well as the composition elements HE and EH in IHES (NIHE and NIEH), NIHE and NIEH revealed that they formed either terminal end of a b-strand or an a-helix that folded into parallel b-sheets. Hence, the protein that include the higher NIHH or NIEE may be the a þ b class and the higher NIHE or NIEH may be the a/b class. Four novel features were defined by:

. 000 . 000 PA1 ¼ NIHH N ; PA2 ¼ NIEE N . 000 . 000 PA3 ¼ NIHE N ; PA4 ¼ NIEH N

3

9. Although a-helices and b-strands were included in the a/b and a þ b classes, their position could influence the forming of parallel b-sheets and anti-parallel b-sheets. For example, given a simplified sequence SS was bbbaabab, if its aa segments were moved to the second position of the SS, the anti-parallel sheets bbb may become parallel sheets. This motivated us to construct the following novel features according to the position of a-helices and b-strands. According to the distance of adjacent b or a (Db and Da) in the SS, the six novel features were defined by:

PSS1 ¼ AverDb ; PSS2 ¼ AverDa PSS3 ¼ VarDb ; PSS4 ¼ VarDa  Rb ¼ AverDb VarDb ; Ra ¼ AverDa =VarDa where AverDb and AverDa were the average Db and Da, VarDb and VarDa were the variance of Db and Da, respectively.

2.3. Feature selection Feature selection is the process of identifying and removing as many irrelevant and redundant features as possible. This can help to obtain a more efficient prediction model and speed up the computational analysis. Many feature selection methods have been used in a wide range of bioinformatics studies [33]. These methods can be divided into two main groups: filter and wrapper. Due to combining the feature selection method with a specific classifier, feature wrappers often achieve better results than filters. Thus, a wrapper approach which was based on the best first search algorithm was adopted to choose a subset of the original features in this paper. 10-fold cross-validation on the ASTRALtraining dataset with an SVM classifier as described in Section 2.4 was adopted to avoid overfitting. Finally, an 18-dimensional integrated feature vector (IFV) was constructed using the above 27 features, fusing the information about the content and position of the predicted secondary structural elements. It could be formally expressed as:

IFV ¼ ðp1 ; p2 ; p3 ; p4 ; p5 ; p6 ; p7 ; p8 ; p9 ; p10 ; p11 ; p12 ; p13 ; p14 ; p15 ; p16 ; p17 ; p18 ÞT where p1 ¼ P(E), p2 ¼ CMVH, p3 ¼ CMVE, p4 ¼ MaxsegH/N, p5 ¼ MaxsegE/N, p6 ¼ PnE/N, p7 ¼ APnE/N, p8 ¼ THP2, p9 ¼ THP4, p10 ¼ PHH, p11 ¼ PEE, p12 ¼ PA1, p13 ¼ PA2, p14 ¼ PA3, p15 ¼ PA4, p16 ¼ PSS2, p17 ¼ PSS4, p18 ¼ Rb. 2.4. Classification algorithm construction The support vector machine (SVM), first introduced by Vapnik [34], is a popular learning algorithm mainly used in pattern recognition. It belongs to the family of margin-based classifiers and is a very powerful method to deal with prediction, classification, and regression problems. There were four kinds of kernel functions: linear, polynomial, sigmoid and radial basis function (RBF), available to perform prediction. Empirical studies have proved that the RBF outperformed the other kernel functions [35,36]. Hence, the RBF was selected to implement the prediction in our study, and it was defined as follows: 2   K xi ; xj ¼ egkxi xj k

The regularization parameter C [37] and kernel parameter g [37] were optimized based on 10-fold cross-validation on the ASTRALtraining dataset with a grid search strategy in the LIBSVM software

Please cite this article in press as: L. Zhang, et al., A novel predictor for protein structural class based on integrated information of the secondary structure sequence, Biochimie (2014), http://dx.doi.org/10.1016/j.biochi.2014.05.008

4

L. Zhang et al. / Biochimie xxx (2014) 1e6

[37,38], where C 2 [25,215] and g 2 [215,25]. Finally, the best overall accuracy was obtained by C ¼ 2048 and g ¼ 0.013602, so they were used in our model. 2.5. Performance measures In statistical prediction, the jackknife and 10-CV tests are widely-used to assure statistical validity of a predictor [3,39,40]. They were also employed to evaluate the reliability and stability of our method. For comprehensive evaluation, the individual sensitivity (Sens), the individual specificity (Spec) and Matthew's correlation coefficient (MCC) over each of the four structural classes, as well as the overall prediction accuracy (OA) over the entire dataset are reported. They are defined as follows [41]:

Sensj ¼ TPj



Specj ¼ TNj

Table 3 The prediction quality of our method on four datasets by 10-CV test. Dataset

Class

Sens (%)

Spec (%)

MCC (%)

ASTRALtraining

All-a All-b a/b aþb OA

94.06 82.94 79.15 72.35 81.59

97.38 96.28 94.19 87.27

90.57 80.70 74.53 58.50

ASTRALtest

All-a All-b a/b aþb OA

95.00 80.82 84.21 72.24 82.61

98.11 96.47 92.98 88.92

92.66 79.59 76.42 60.79

25PDB

All-a All-b a/b aþb OA

95.26 83.50 81.79 77.10 84.58

96.83 96.67 96.08 89.77

91.01 82.33 79.08 65.69

640

All-a All-b a/b aþb OA

92.03 81.79 88.69 74.83 84.05

97.81 97.94 93.07 89.54

89.86 83.34 80.36 63.71

   TPj þ FNj ¼ TPj Cj 





,

FPj þ TNj ¼ TNj

X

jCk j

ksj

 . MCCj ¼ TPj  TNj  FPj  FNj qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi     ffi FPj þ TPj TPj þ FNj FPj þ TNj TNj þ FNj 0 OA ¼ @

X

1 1 0 . X    @ A Cj A TPj

j

j

where TNj, TPj, FNj, FPj and jCjj were the number of true negatives, true positives, false negatives, false positives, and proteins in the structural class Cj, respectively. 3. Results and discussion 3.1. Structural class prediction accuracy To check the consistency of the proposed method, two test methods, jackknife and 10-CV, were used to evaluate the model on benchmark datasets. The results are listed in Tables 2 and 3. As can be seen from Tables 2 and 3, the overall accuracies by both the

jackknife test and the 10-CV test were high for all datasets (all above 80%), indicating that our predictor was quite promising in generating reliable results. The prediction performances demonstrated that the predictor not only possessed high accuracy but was also quite stable even though the dataset size and similarity were different. The jackknife test was employed here to be the final predictor since it is not only rigorous and objective [41] but also provides a unique result for a given dataset. According to Table 2, the Sens, Spec and MCC values in all-a class were the best for all datasets, while the values in a þ b class were the lowest, for example the MCC was only 59.65% in the ASTRALtraining dataset. This implied that the former was the easiest to predict and the latter was the most difficult to identify. As mentioned by previous researchers, it was difficult to differentiate the a þ b class due to its non-negligible overlap with the other classes [16]. The low accuracy of a þ b class reflected that it was a challenge to identify anti-parallel sheets. The results for the 10-CV test shown in Table 3 were similar. 3.2. Analysis and comparison with other prediction methods

Table 2 The prediction quality of our method on four datasets by jackknife test. Dataset

Class

Sens (%)

Spec (%)

MCC (%)

ASTRALtraining

All-a All-b a/ b aþb OA

94.06 81.72 79.55 73.79 81.80

97.24 96.61 94.29 87.27

90.23 80.44 74.99 59.65

ASTRALtest

All-a All-b a/ b aþb OA

95.16 80.97 83.94 72.51 82.69

98.25 96.33 93.27 88.73

93.05 79.33 76.60 60.73

All-a All-b a/ b aþb OA

94.81 82.39 81.21 77.32 84.10

96.99 96.26 96.01 89.45

90.90 80.62 78.24 65.41

All-a All-b a/ b aþb OA

92.75 81.82 89.27 74.27 84.22

98.01 97.53 93.52 89.55

90.76 82.48 81.30 63.25

25PDB

640

As mentioned earlier, the novel features were aimed to improve the accuracies. In order to show their the contribution, the experiments were performed by the jackknife test on all mentioned datasets with 18 features and only the 7 reused features and the results were given in Table 4. According to Table 4, it was obvious Table 4 Comparison of the accuracies between our method that included 18 features and only 7 re-used features. Dataset

Method

Accuracy (%)

OA

All-a

All-b

a/b

aþb

ASTRALtraining

IFV Exclude novel features

94.06 92.97

81.72 77.95

79.55 77.81

73.79 69.46

81.80 79.03

ASTRALtest

IFV Exclude novel features

95.16 93.28

80.97 73.87

83.94 80.19

72.51 69.37

82.69 78.74

25PDB

IFV Exclude novel features

94.81 94.36

82.39 77.88

81.21 80.64

77.32 71.43

84.10 81.11

640

IFV Exclude novel features

92.75 86.23

81.82 77.92

89.27 87.57

74.27 72.51

84.22 80.94

The higher values are highlighted in bold face.

Please cite this article in press as: L. Zhang, et al., A novel predictor for protein structural class based on integrated information of the secondary structure sequence, Biochimie (2014), http://dx.doi.org/10.1016/j.biochi.2014.05.008

L. Zhang et al. / Biochimie xxx (2014) 1e6

4. Conclusions

Table 5 Performance comparison of different methods on four datasets. Dataset

Reference

5

Accuracy (%)

OA

All-a

All-b

a/b

aþb

ASTRALtraining

This study

94.06

81.72

79.55

73.79

81.80

ASTRALtest

[17] [18] This study

93.13 94.53 95.16

78.33 77.49 80.97

83.38 87.28 83.94

64.27 71.47 72.51

79.14 82.33 82.69

25PDB

[17] [21] [22] [20] [23] [19] [18] This study

92.60 92.30 92.80 92.60 95.70 95.00 95.03 94.81

80.10 83.70 83.30 81.30 80.80 85.60 81.26 82.39

74.00 81.20 85.80 81.50 82.40 81.50 83.24 81.21

71.00 68.30 70.10 76.00 75.50 73.20 77.55 77.32

79.70 81.40 82.90 82.90 83.70 83.90 84.34 84.10

640

[17] [21] [18] This study

90.60 89.10 94.93 92.75

81.80 85.10 76.62 81.82

85.90 88.10 89.27 89.27

66.70 71.40 74.27 74.27

80.80 83.10 83.44 84.22

We propose a novel, accurate method for protein structural class prediction. The novelty lies in an 18-dimensional integrated feature vector which is constructed by fusing the content and position information in SSS. The consistent results of jackknife and 10-CV tests demonstrate the proposed method is reliable for low-similarity datasets. The proposed sequence representation includes 11 novel features and achieves satisfactory prediction accuracy compared with all previous methods on datasets. This is due to the fact that IFV can accurately grasp the relationship between the sequence and the protein structural class. These results show that the proposed method is a very powerful tool for protein structural class prediction, especially for lowsimilarity sequences. Conflict of interest The authors declare that they have no conflict of interets. Acknowledgements

that all prediction accuracies were improved after adding the novel features. Below, our method is compared with 7 previously published methods including the famous methods SCPRED [17] and MODAS [21], which are often used as a baseline for comparison. We also compared with some competing structural class prediction methods: RKS-PPSC [22], Liu and Jia [20], Zhang et al. [19], Ding et al. [18] and Zhang et al. [23]. As shown in Table 5, the highest overall accuracies were obtained by the proposed method among all the tested methods in ASTRALtest and 640 datasets (82.69% and 84.22%) and improved by 0.36% and 0.78% compared with previous best-performance results [18]. As for ASTRALtest, the all-a, all-b and a þ b class accuracies were 0.63%, 3.48% and 1.04% higher than Ding et al. method [18]. For 25PDB dataset, the accuracy of the all-b class was 1.13% higher than previous best-performance results [18]. Compared with our previous model [23], the overall accuracy was improved by 0.40%. In addition, the all-b and a þ b class accuracies were improved 1.59% and 1.82%. As for the 640 dataset, the overall accuracy and the all-b class accuracy were 0.78% and 5.20% higher than the previous best-performance results Ding et al. [18], respectively. The a/b and a þ b class accuracies were satisfactory and the same as the previous best value (89.27% and 74.27%). To objectively evaluate our method, predictions with our previous model [23] were performed on the 640 dataset. The overall accuracy and the four structural class accuracies were 83.44%, 92.03%, 79.87%, 83.62% and 79.53%, respectively. The accuracies of the overall, all-a, all-b and a/b classes were improved by 0.78%, 0.72%, 1.95% and 5.65%. From Table 5, some results were inferior to the best among the compared methods. This was partly because the features we used ignored some less common secondary structural elements such as b-turns and because the real structures of proteins are much more complex than the theoretical model. Although the improvements looked small, the mean was significant to identify the protein structural class. For example, only around 38,221 PDB entries with 110,800 domains or proteins had known structural class labels in SCOP (as of February 2009), while there were more than 8,000,000 nonredundant protein sequences in the Protein database at the National Center for Biotechnology Information (NCBI). Hence, 0.1% improvement in accuracy could help in finding the accurate structural class labels for about 8000 proteins. These prediction improvements hence clearly demonstrated that our method was very promising for recognizing protein structural class.

The authors thank the anonymous referees for many valuable suggestions that have improved this manuscript. This work is supported by the National Natural Science Foundation of China (No. 11271341) and the Fundamental Research Funds for the Central Universities (No. 201362031). Appendix A. Supplementary data Supplementary data related to this article can be found at http:// dx.doi.org/doi:10.1016/j.biochi.2014.05.008. References [1] M. Levitt, C. Chothia, Structural patterns in globular proteins, Nature 261 (1976) 552e558. [2] M. Gromiha, S. Selvaraj, Protein secondary structure prediction in different structural classes, Protein Eng. 11 (1998) 249e251. [3] K.C. Chou, Energy-optimized structure of antifreeze protein and its binding mechanism, J. Mol. Biol. 223 (1992) 509e517. [4] K.C. Chou, C.T. Zhang, Prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol. 30 (1995) 275e349. [5] I. Bahar, A.R. Atilgan, R.L. Jernigan, B. Erman, Understanding the recognition of protein structural classes by amino acid composition, Proteins 29 (1997) 172e185. [6] K.C. Chou, Progress in protein structural class prediction and its impact to bioinformatics and proteomics, Curr. Protein Pept. Sci. 6 (2005) 423e436. [7] K.C. Chou, Structural bioinformatics and its impact to biomedical science, Curr. Med. Chem. 11 (2004) 2105e2134. [8] H. Nakashima, K. Nishikawa, T. Ooi, The folding type of a protein is relevant to the amino acid composition, J. Biochem. 99 (1986) 153e162. [9] K.C. Chou, A novel approach to predicting protein structural classes in a (201)-D amino acid composition space, Proteins 21 (1995) 319e344. [10] K.D. Kedarisetti, L.A. Kurgan, S. Dick, Classifier ensembles for protein structural class prediction with varying homology, Biochem. Biophys. Res. Commun. 348 (2006) 981e988. [11] S. Costantini, A.M. Facchiano, Prediction of the protein structural class by specific peptide frequencies, Biochimie 91 (2009) 226e229. [12] J.Y. Yang, Z.L. Peng, Z.G. Yu, R.J. Zhang, V. Anh, D.S. Wang, Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation, J. Theor. Biol. 257 (2009) 618e626. [13] K.C. Chou, Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins 43 (2001) 246e255. [14] Y.S. Ding, T.L. Zhang, K.C. Chou, Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network, Protein Pept. Lett. 14 (2007) 811e815. [15] K. Chou, Y. Cai, Prediction of protein subcellular locations by GO-FunD-PseAA predictor, Biochem. Biophys. Res. Commun. 321 (2004) 1007e1009. [16] K. Chen, L.A. Kurgan, J. Ruan, Prediction of protein structural class using novel evolutionary collocation based sequence representation, J. Comput. Chem. 29 (2008) 1596e1604.

Please cite this article in press as: L. Zhang, et al., A novel predictor for protein structural class based on integrated information of the secondary structure sequence, Biochimie (2014), http://dx.doi.org/10.1016/j.biochi.2014.05.008

6

L. Zhang et al. / Biochimie xxx (2014) 1e6

[17] L.A. Kurgan, K. Cios, K. Chen, SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences, BMC Bioinform. 9 (2008) 226. [18] S. Ding, S. Zhang, Y. Li, T. Wang, A novel protein structural classes prediction method based on predicted secondary structure, Biochimie 94 (2012) 1166e1171. [19] S. Zhang, S. Ding, T. Wang, High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure, Biochimie 93 (2011) 710e714. [20] T. Liu, C. Jia, A high-accuracy protein structural class prediction algorithm using predicted secondary structural information, J. Theor. Biol. 267 (2010) 272e275. [21] M.J. Mizianty, L. Kurgan, Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences, BMC Bioinform. 10 (2009) 414. [22] J. Yang, Z. Peng, X. Chen, Prediction of protein structural classes for low homology sequences based on predicted secondary structure, BMC Bioinform. 11 (2010) S9. [23] L. Zhang, X. Zhao, L. Kong, A protein structural class prediction method based on novel features, Biochimie 95 (2013) 1741e1744. [24] D. Cai, G.P. Zhou, Prediction of protein structural classes by neural network, Biochimie 82 (2000) 783e785. [25] A. Anand, G. Pugalenthi, P.N. Suganthan, Predicting protein structural class by SVM with class-wise optimized features and decision probabilities, J. Theor. Biol. 253 (2008) 375e380. [26] D. Cai, X.J. Liu, X. Xu, G.P. Zhou, Support vector machines for predicting protein structural class, BMC Bioinform. 2 (2011) 3. [27] D. Cai, X.J. Liu, X.B. Xu, K.C. Chou, Prediction of protein structural classes by support vector machines, Comput. Chem. 26 (2002) 293e296. [28] C. Chen, Y.X. Tian, X.Y. Zou, P.X. Cai, J.Y. Mo, Using pseudo-amino acid composition and support vector machine to predict protein structural class, J. Theor. Biol. 243 (2006) 444e448.

[29] J.D. Qiu, S.H. Luo, J.H. Huang, R.P. Liang, Using support vector machines for prediction of protein structural classes based on discrete wavelet transform, J. Comput. Chem. 30 (2009) 1344e1350. [30] H.B. Shen, J. Yang, X.J. Liu, K.C. Chou, Using supervised fuzzy clustering to predict protein structural classes, Biochem. Biophys. Res. Commun. 334 (2005) 577e581. [31] Z.X. Wang, Z. Yuan, How good is prediction of protein structural class by the component-coupled method? Proteins 38 (2000) 165e175. [32] Y.F. Cao, S. Liu, L.D. Zhang, J. Qin, J. Wang, K.X. Tang, Prediction of protein structural class with rough sets, BMC Bioinform. 7 (2006) 20. [33] Y. Saeys, I. Inza, P. Larranaga, A review of feature selection techniques in bioinformatics, Bioinformatics 23 (2007) 2507e2517. [34] V. Vapnik, Statistical Learning Theory, The MIT Press, New York, 1998. [35] Z. Yuan, B. Huang, Prediction of protein accessible surface areas by support vector regression, Proteins 57 (2004) 558e564. [36] Z. Yuan, T.L. Bailey, R.D. Teasdak, Prediction of protein B-factor profiles, Proteins 58 (2005) 905e912. [37] C.C. Chang, C.J. Lin, LIBSVM: A Library for Support Vector Machines, 2001. [38] http://www.csie.ntu.edu.tw/wcjlin/libsvm. [39] L.A. Kurgan, L. Homaeian, Prediction of structural classes for protein sequences and domainsdimpact of prediction algorithms, sequence representation and homology, and test procedures on accuracy, Pattern Recognit. 39 (2006) 2323e2343. [40] K.C. Chou, H.B. Shen, Recent progress in protein subcellular location prediction, Anal. Biochem. 370 (2007) 1e16. [41] T. Liu, X. Zheng, J. Wang, Prediction of protein structural class for lowsimilarity sequences using support vector machine and PSI-BLAST profile, Biochimie 92 (2010) 1330e1334.

Please cite this article in press as: L. Zhang, et al., A novel predictor for protein structural class based on integrated information of the secondary structure sequence, Biochimie (2014), http://dx.doi.org/10.1016/j.biochi.2014.05.008