Biochimie 101 (2014) 104e112
Contents lists available at ScienceDirect
Biochimie journal homepage: www.elsevier.com/locate/biochi
Research paper
High-accuracy prediction of protein structural classes using PseAA structural properties and secondary structural patterns Junru Wang b,1, Yan Li a,1, Xiaoqing Liu c, Qi Dai a, *, Yuhua Yao a, Pingan He d a
College of Life Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People’s Republic of China College of Mechanical Engineering and Automation, Zhejiang Sci-Tech University, Hangzhou 310018, People’s Republic of China c College of Sciences, Hangzhou Dianzi University, Hangzhou 310018, People’s Republic of China d College of Sciences, Zhejiang Sci-Tech University, Hangzhou 310018, People’s Republic of China b
a r t i c l e i n f o
a b s t r a c t
Article history: Received 22 July 2013 Accepted 30 December 2013 Available online 8 January 2014
Since introduction of PseAAs and functional domains, promising results have been achieved in protein structural class predication, but some challenges still exist in the representation of the PseAA structural correlation and structural domains. This paper proposed a high-accuracy prediction method using novel PseAA structural properties and secondary structural patterns, reflecting the long-range and local structural properties of the PseAAs and certain compact structural domains. The proposed prediction method was tested against the competing prediction methods with four experiments. The experiment results indicate that the proposed method achieved the best performance. Its overall accuracies for datasets 25PDB, D640, FC699 and 1189 are 88.8%, 90.9%, 96.4% and 87.4%, which are 4.5%, 7.6%, 2% and 3.9% higher than the existing best-performing method. This understanding can be used to guide development of more powerful methods for protein structural class prediction. The software and supplement material are freely available at http://bioinfo.zstu.edu.cn/PseAA-SSP. 2014 Elsevier Masson SAS. All rights reserved.
Keywords: PseAAs Long-range structural property Local structural correlation Protein structural class prediction Support vector machine
1. Introduction In the post genomic era the study of sequence to structure relationship and functional annotation plays an important role in molecular biology. The functions of protein are relevant to its 3D structure and can be efficiently determined by the sequence and structure analysis [1e5]. The knowledge of protein structural class provides useful information towards the determination of protein structure [2]. According to the definition by Levitt and Chothia [3], proteins can be classified into the following four structural classes: all-a, all-b, a þ b and a/b classes [4,5]. The two former classes include structures dominated by a-helices and b-strands, respectively. The two latter classes correspond to structures that include both helices and strands where in the case of the aþb class these secondary structures are segregated, whereas for a/b class the structures are interspersed. The structural class has become one of the important features of a protein, and has played an important role in both experimental and theoretical studies in protein science, because a prior
* Corresponding author. Tel.: þ86 57186843746. E-mail address:
[email protected] (Q. Dai). 1 Junru Wang and Yan Li contributed equally to this work as co-first authors. 0300-9084/$ e see front matter 2014 Elsevier Masson SAS. All rights reserved. http://dx.doi.org/10.1016/j.biochi.2013.12.021
knowledge of protein structural classes is helpful to improve the prediction accuracy of the protein secondary and tertiary structure [6e11]. The exponential growth of newly discovered protein sequences by different scientific communities has made a large gap between the number of sequence-known and the number of structure-known proteins, the burden of experimental screening methods regarding time and cost to find the 3-dimensional structure would become even more unbearable. Hence there exists a critical challenge to develop automated methods for fast and accurate determination of the structures of proteins in order to reduce the gap. Due to the importance of protein structural class prediction, various significant efforts have been devoted to this problem during the past 30 years, aiming to find a prediction model that automatically determines the structural class based on the protein sequences and predicted secondary structures [8,12e14]. Previous studies have shown that the protein structural class is strongly correlated with amino acid (AA) sequence, and the protein structural class can be predicted based on their AAs. In representing a protein sample with its AAs, however, many important features associated with the sequence order were completely missed, undoubtedly reducing the success rate of prediction. In view of this, various descriptors were proposed to improve the predictive accuracy, including composition of short polypeptides [15,16], pseudo
J. Wang et al. / Biochimie 101 (2014) 104e112
AA composition [17], collocation of AA, function domain composition [18], and positions specific scoring matrices profiles computed by position specific iterative basic local alignment search tool (PSIBlast) [19]. Pseudo amino acid (PseAA) composition has been widely used to convert complicated protein sequences with various lengths to fixed length digital feature vectors while keeping considerable sequence order information. The concept of PseAA composition was originally proposed for improving the prediction quality of protein subcellular localization and membrane protein type [20]. The essence of PseAA composition is to keep using a discrete model to represent a protein sample, yet without completely losing its sequence order information. According to its definition, the PseAA composition for a given protein sample is expressed by a set of 20 þ l discrete numbers, where the first 20 represent the 20 components of the classical amino acid composition while the additional l numbers incorporate some of its sequence-order information via different kinds of coupling modes. Since the concept of Chou’s PseAAC was introduced, various PseAAC approaches have been stimulated to deal with varieties of problems in proteins and protein-related systems. Because of its wide usage, recently a very flexible pse-AAC generator, called “PseAAC” [21], was established at the website http://chou.med.harvard.edu/bioinf/PseAAC/, by which users can generate 63 different kinds of PseAA composition. More recently, Liu et al. construct PseAAC by using low-frequency Fourier spectrum analysis [22]. Xiao et al. introduced a kind of PseAAC by measuring the complexity of a protein digital signal sequence [23]. Then, Lin and Li proposed a novel method to generate PseAAC, which was based on the diversity of the amino acid and dipeptide composition [24]. Zhang et al. used the approximate entropy and hydrophobicity pattern of a protein sequence to construct PseAAC [14], and Xiao et al. introduced the gray dynamic modeling to construct PseAAC [25]. Although promising results have been achieved in many cases, the above methods appear to be less effective in low-homology datasets whose average pair-wise sequence identities less than 40% [10,26]. In order to improve the prediction accuracy of lowsimilarity proteins, several new features of predicted secondary structures have been proposed [11,27e31]. They exploited the fact that proteins with low sequence similarity but in the same structural class are likely to have high similarity in their corresponding secondary structure elements. Taking the above fact into account, Kurgan et al. computed the content of predicted secondary structural elements (contentSE), normalized count of segments (NCount), length of the longest segment (MaxSeg), normalized length of the longest segment (NMaxSeg), average length of the segment (AvgSeg), normalized average length of the segment (NAvgSeg) based on the predicted secondary structures in protein structural class prediction [27]. Zheng and Kurgan counted the 3PATTERN of the predicted secondary structures to improve the bturns prediction [28]. In MODAS, the predicted secondary structure information was employed to perform the prediction with evolutionary profiles [11]. In 2010, Liu and Jia found that a-helices and bstrands alternate more frequently in a/b proteins than in a þ b proteins, and counted their alternating frequency as well as the content of parallel b-sheets and antiparallel b-sheets [29]. Zhang et al. computed the transition probability matrix (TPM) of the reduced predicted secondary structural sequences and added it to protein structural class prediction [30]. With help of the features of predicted secondary structures, the prediction accuracy has been improved significantly, between 80% and 85% on several lowsimilarity benchmark data-sets. The available approaches have achieved promising results in protein structural class prediction, but several critical problems still exist in their development. First, the existing PseAA-based methods
105
converted a protein from a character sequence to a numerical sequence and got its PseAA components of correlation type by the routine correlation equations, in which l order correlation factor is defined based on two positions separated by l letters. In fact, l order correlation factor focuses mostly on the correlation of pair positions in protein sequences, and it is therefore sometimes unaware of total structural correlation of PseAA pattern with length l in protein sequences. Since long-range and local interaction is a major driving force for the protein folding process, so the structural correlation among the contiguous l residues along the protein sequences should not be ignored in protein structural class prediction. Second, some prediction methods often extracted various features of predicted secondary structure elements, without considering the appearance of compact structural pattern or domains. It is well known that different combinations of domains result in variety of protein structures, so compact structural patterns of the predicted secondary structures should be taken into account when predicting protein structural classes. With the above problems in mind, we presented a scheme to enhance protein structural class prediction by combining PseAA structural properties with predicted secondary structural patterns. The contents can be summarized as follows: 1. We explored a potential way to describe PseAA structural properties including long-range structural properties of the PseAAs, local structural correlation of the PseAAs and PseAA composition distribution. 2. To highlight contribution of secondary structural pattern or domain, we converted a predicted secondary structure to a segment sequence composed of helix segments and strand segments. We then analyzed the distribution of a series of compact structural patterns, which are determinant factors in forming a particular structural class. 3. We implemented a multi-class support vector machine (SVM) to predict protein structural class using both PseAA structural properties and secondary structural patterns. Through a comprehensive comparison, we wanted to address how well the proposed method performs in comparison with available competing prediction methods.
2. Materials and methods 2.1. Datasets In order to facilitate comparison with previous studies, we selected four widely used low-homology benchmark datasets in which any pair of sequences shares twilight-zone similarity [11,26e 30]. This means that any test sequence shares twilight-zone identity with any sequence in the training set used to generate the proposed classification model. The dataset, referred to as 25PDB, was selected using 25% PDBSELECT list [31], which includes proteins from PDB that were scanned with high resolution, and with low, on average about 25%, identity. The dataset was originally published in Ref. [26] and was used to benchmark two structural class prediction methods [10,32]. It contains 1673 proteins and domains. The secondary dataset, referred to as 1189, was downloaded from RCSB Protein Data Bank with the PDB IDs listed in the paper [26]. It contains 1092 proteins with 40% sequence identity. The third protein dataset, referred to as 640, was first studied in Chen et al. (2008) [19]. It contains 640 proteins with 25% sequence identity, and their classification labels are retrieved from the database SCOP [5]. The final dataset, named FC699, includes 858 sequences that share low 40% identity with each other. More details are presented in Table 1.
106
J. Wang et al. / Biochimie 101 (2014) 104e112
Table 1 The number of proteins belonging to different structural classes in the datasets. Dataset
All-a
All-b
a/b
aþb
Total
25PDB 640 FC699 1189
443 138 130 223
443 154 269 294
346 177 377 334
441 171 82 241
1673 640 858 1092
2.2. PseAA structural properties Hydropathy profile is an important physicochemical property of amino acids that influences protein functions and structures. Based on the hydropathy profile of amino acids, 20 amino acids can be classified into three groups: internal group, external group, ambivalent group. The representation of protein sequences is defined according to the following rule:
8 < I if SðiÞ ¼ F; I; L; M; V FðSðiÞÞ ¼ E if SðiÞ ¼ D; E; H; K; N; Q ; R : A if SðiÞ ¼ S; T; Y; C; W; G; P; A where S(i) represents the i-th letter in protein sequence S, and F(S(i)) represents the substitution for S(i). Liu et al. derived the reduced alphabets from amino acid according to this hydropathy profile [33]. For example, given a protein sequence S ¼ ESHFTCISLNEYAMQ, we can get its reduced sequence F(S) ¼ EAEIAAIAIEEAAIE. Here, we analyzed the structural properties of the PseAAs based on the reduced protein sequence. 2.2.1. PseAA composition (PseAAC) There is a large body of literature on word statistics, where a sequence is interpreted as a succession of symbols and further analyzed by representing the frequencies of its small segments [20]. A k-word wk is a series of k consecutive letters from the set {I, E, A}. For a reduced protein sequence, let fi(wk) represents the occurrence frequencies of the word wik . Here k-words were allowed to overlap in sequences. Hence the composition of the elements (p(wk)i) in the reduced protein sequence was given by
pi ðwk Þ ¼
f ðw Þ Pi k ; f ðwk Þ
wk1 wk2 /wkk
where wki˛{I, E, A}, i ¼ 1, 2, /, k. The reduced protein sequence in the composition space was then defined as:
pðxÞ ¼ ½p1 ðwk Þ; p2 ðwk Þ; /; p3k ðwk Þ
distribution function. In probability theory and statistics, the coefficient of variation is a normalized measure of dispersion of a probability distribution. In order to describe long-range structural correlation of the PseAAs, we further calculated the long-range structural index LrSPseAA:
Ewk ðxÞ LrSPseAAðwk Þ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 Ewk x Ewk ðxÞ where Eðwk Þ ðxÞ is expectation value of the variable z defined as follows:
Eðwk Þ ðxÞ ¼
X x
x Pðwk Þ ðxÞ
LrSPseAA is a useful statistic for comparing the degree of longrange structural correlation from one reduced protein sequence to another, even if the means are drastically different from each other.
2.2.3. Local structural properties of PseAA (LcSPseAA) 3PseAA has been used widely and successfully to improve the prediction quality in diverse applications of bioinformatics [34e 39]. It was proposed that the sequence order effect along a protein chain can be approximately reflected with a set of sequence order correlation factors defined as follows [20]:
qm ¼
X 1 Lm QðSi ; Siþm Þ; Lm
ðm ¼ 1; 2; /; l and l < LÞ
i¼1
where L denotes the length of the protein and qm is called the mth rank of coupling factor that harbors the mth sequence order correlation factor, and Q(Si,Sj) is correlation function. It is worth noting that qm reflects the sequence order correlation between two residues whose positions are separated by m residues along a protein sequence. That is to say, m order correlation factor focuses mostly on the correlation of pair positions in protein sequences, and it therefore is sometimes unaware of total structural correlation among the contiguous m residues in protein sequences. With this problem in mind, we explored a potential way to describe total structural correlation of m-window PseAA. Given the first PseAA pattern with length m, its structural correlation can be defined below: 1) At the beginning, we calculated the Lm/2þ1,m/21 as follows:
2.2.2. Long-range structural properties of PseAA (LrSPseAA) In PseAA composition, the protein sequence order and length information are completely lost which in turn affect the prediction accuracy. As for biological sequences, the elements’ order is as important as their content in biological sequences. Taking the order of these elements into account, we explored the long-range structural correlation based on inter-word distances. Given a k-word wk consisting of k consecutive letters from the set {I, E, A}, we transformed a reduced protein sequence into a numerical sequence whose elements denote the interval distances between two nearest words wk. It is worth noting that the interval distance between two nearest words wk is a variable z. When we analyze PseAA structural properties we are often interested not in the particular interval distance that occurs, but rather in some number associated with that outcome. Therefore, we calculated the probability Pðwk Þ ðxÞ of occurring of the variable z and obtained its
Lm2þ1; m21
8 > < Q Rm2 ; Rm2þ1 ; if m is even; ¼ > : Q Rm ; Rm þ2 ; if m is odd; 2 2
where Q(Ri,Rj) is correlation function defined as:
Q Ri ; Rj ¼ HðRi Þ þ H Rj : H(Ri) and H(Rj) are the hydrophobicity values of the amino acids Ri and Rj, respectively. 2) If i is equal to j, or in the case of i þ j ¼ m, we assumed Li,j ¼ H(Ri) or H(Rj), otherwise Li,j were calculated as follows:
J. Wang et al. / Biochimie 101 (2014) 104e112
Li;j ¼ Q Li1;j1 ; Li1;jþ1 ; where
(
1im 2 þ1 1ji
(m ;
2
þ1
1jmx
;
3) Repeated the step 2) until we got L3,1,L5,1,/,Lm1,1. In combination with the R1 and Rm, we got the structural correlation of m-window PseAA (m-WSS)
m WSS1 ¼ Q S1 ; L3;1 ; L5;1 ; /; QðLm2 ; Sm Þ : For example, Fig. 1 is the local structural correlation among the contiguous 9 residues along a protein chain. We then used a sliding window of length m, shifting the frame one base at a time from position 1 to L e m þ 1, and calculated the structural correlation of m-window PseAA of a protein s
LcSPseAA ¼
Lmþ1 X 1 m WSSi : Lmþa i¼1
LcSPseAA is called the m-window correlation factor that reflects the local structural correlation among the contiguous m residues along a protein chain.
107
about protein structural classification we actually mean structural pattern classification. There are three kinds of compact structural domains: a domain, b domain and a/b domain [43]. In protein secondary structures, there are at least three continuous a-helices to form a a domain, and b domain is then achieved through combination of continuous four or more than four b-strands in the formation of compact antiparallel sheet or tubular structure. As for the a/b domain, the numbers of bab and baab consecutively occurring (or nearly consecutively but inserted by a hairpin bb, namely babbbab) in protein secondary structure should be greater than or equal to three. If the fragment with consecutive elements bab occurs more than once in a secondary structure sequence then the longest should be taken. By use of above rules, the following secondary structural patterns can be identified when recognizing the compact structural domains:
SSPattern ¼ faaa; bbb; aba; bab; abbag Analysis of the secondary structural patterns’ distribution typically begins with reducing a secondary structure sequence into a segment sequence, which is composed of helix segments and strand segments (denoted by a and b, respectively). Here, a helix (strand) segment refers to a continuous segment of all H (E) symbols in the secondary structure sequence. To highlight the influence of the arrangement of a-helice and b-strand segments, the coil segments were ignored in the reduced segment sequence. From this, the distribution of the secondary structural patterns was computed as follows:
PðSSPatternÞ ¼ ½pðaaaÞ; pðbbbÞ; pðabaÞ; pðbabÞ; pðabbaÞ 2.3. Predicted secondary structural patterns When discussing protein folds and classes we first need to identify the folding unit. Structural domains can be thought of as the most fundamental units of the protein structure that capture the basic features of the entire protein. Among such features are (1) stability, (2) compactness, (3) presence of the hydrophobic core, and (4) ability to fold independently. These structural properties of domains suggest that atomic interactions within domains are more extensive than that between the domains [40,41]. From this, it follows that domains can be identified by looking for groups of residues with a maximum number of atomic contacts within a group, but a minimum number of contacts between the groups. Domains are the building blocks of the proteins: different combinations of domains result in variety of protein structures. Sequence information is often insufficient for identifying the protein structural classes in the protein because the same structure can be reached from widely divergent sequence space (typically down to 30% sequence identity). Therefore, knowledge of protein structure is often a prerequisite to the delineation of structural domains. Some studies used PSIPRED [42] to predict protein secondary structures and analyzed their contents and spatial arrangements of secondary structural elements for protein structural class prediction, such as the content of the predicted secondary structure elements or segments [11,27e30]. Instead of the attention given to the secondary structural elements or segments, this paper put emphasis on the structural domains. Since total number of distinct structural domains is currently hovering around one thousand, we are not able to take them into account when predicting protein structural classes. From this, some predicted secondary structural patterns that are associated with structural classes were chosen to represent certain compact structural domains. This means that when we talked
where
pðkÞ ¼
f ðkÞ ; k˛SSPattern Lengthðreduced segment sequenceÞ
f(k) is the occurrence of secondary structural pattern k in the given reduced segment sequence. For example, we took the 417th protein in the FC699 dataset and used the position specific iterated prediction (PSI-PRED) to predict its secondary structural sequence “CEEEECCCCCCCCCCCCCCCCCCHHHHHHHHHCCCCEEEEECCCCCCCCCCCCCCCEEECCCCCCEEECCCCCCCCHHHHHHHHHHHHHHHHHHHHHCCCCC-
Fig. 1. Structural correlation of 9-window PseAAs (9-WSS) of the contiguous 9 residues along a protein chain.
108
J. Wang et al. / Biochimie 101 (2014) 104e112
CEEEEECCCCCHHHHHHHHHHHHHCCCCEEEECCCCCCCCCCCCCCCHHHHHHHHHHHHHHHHHHHCCCCEEEEECCCHHHHHHHHHHHHHCCCCEEEEEEEECCCCCCCCCCCCHHHHHCCCCCCCCCCCCCCCCHHHHHHHHHHHHHCCCCEEEEECCCCCHHHHHHHHHCCCCHHHHHHHHHHCCHHHHHHHHHHHHHHHHHCCCC”. With the transform method, we obtained its reduced segment sequence babbbababababaa baaa. We further calculated its distribution of the secondary structural patterns:
PðSSPatternÞ ¼
1 5 5 1 ; ; ; ;0 19 19 19 19
2.5. Prediction assessment We selected the Gaussian as the kernel function for the SVM because its superiority for solving nonlinear problems compared with other kernel functions [47]. Here, we selected the parameters for the sake of getting the highest overall prediction as possible. Then a simple grid search strategy over C and gamma values based on 10-fold cross-validation for each dataset was selected, where C and gamma were allowed to take the values only between 210 and 210.
3. Results and discussion 2.4. Classification algorithm construction Support vector machine is one type of learning machine based on statistical learning theory. In this paper, we adopted Vapnik’s support vector machine to predict the protein structural class [44], which is implemented in Matlab toolbox. There are total four protein structural classes, and prediction of protein structural class is therefore a four-classification problem. Therefore, we adopted the multiclass prediction method, SVM using “one-against-others” strategy. Given a test protein of unknown category, the SVM first maps the input vectors into one feature space (perhaps with a higher dimension). Then within the space mentioned above, it finds an optimized linear division to solve two-class or multi-class problem [45]. Finally, a prediction label to the test sample is assigned according to this way. A more detailed description of SVM is in Vapnik’s book [44]. Among the three kinds of cross-validation methods (the singletest-set analysis, sub-sampling and jackknife analysis), the jackknife test is supposed to be the most effective one [46]. Here, we used it to evaluate the performance of the proposed method. We also considered standard performance measures over structural class, including the accuracy for class Cj and overall accuracy, which was defined as the fraction of class Cj or all the proteins tested that are classified correctly. In addition, Sensitivity (Sens), Specificity (Spec) and Matthew’s Correlation Coefficient (MCC) were used to evaluate the prediction accuracy, where the MCC value ranges between 1 and 1, 0 represents random correlation, and bigger positive (negative) values indicate better (lower) prediction quality for a given class. Explicitly, they are defined by the following formulas:
TPj Accuracyj ¼ ; Cj P
TPj
j
Overall accuracy ¼ P ; Cj j
Sens ¼
TP ; TP þ FN
Spec ¼
TN ; FP þ TN
TP TN FP FN MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ðTP þ FPÞðTP þ FNÞðTN þ FPÞðTN þ FNÞ where TPj is the number of true positives, FP is the number of false positives, TN is the number of true negatives, FN is the number of false negatives, and jCjj is the number of proteins in each structural class Cj (all-a, all-b,a/b and a þ b classes).
This section includes experiment results, comparison of the competing methods on four benchmark datasets and discussion of the selected feature. This work began with reducing the protein sequence into the reduced protein sequence with help of the hydropathy profile of amino acids and described the PseAA structural properties including the PseAA composition, long-range and local structural properties of the PseAAs. We then used the PSIPRED to predict the secondary structures of proteins and calculated the distribution of the secondary structural patterns. In merging the above information into one, we got a 45-dimension vector to represent the combined information of the given protein. This combined information was then used to feed into support vector machine to make prediction of its protein structural class. The overall accuracy and accuracy for each structural class were reported in this section.
3.1. Prediction accuracy of the proposed method on three benchmark datasets Four widely used datasets with low sequence identity were used in this study, including 25PDB that comprises 1673 proteins of about 25% sequence identity, 640 that includes 640 proteins of about 25% sequence identity, FC699 with 858 proteins of about 40% sequence identity, and 1189 that contains 1092 proteins of about 40% sequence identity. We used a simple grid search strategy over C and gamma values based on 10-fold cross-validation for each dataset, where C and gamma were allowed to take the values only between 210 and 210. The final values of C (gamma) used in 25PDB, D640, FC699 and 1189 are 55.7(0.38), 194(0.1), 675.6(0.06) and 103.9(0.2), respectively. The results obtained by the proposed method are shown in Tables 2 and 3. From Table 2, we find that the prediction of proteins in all-a class is always the best (with sensitivities and specificities higher than 90% for all the datasets, and Matthew’s correlation coefficient higher than 0.9 for all the datasets). However, it seems very challenging to predict the a þ b class as its sensitivity and specificity are relatively low when compared with the other classes. The low prediction accuracy may be due to its non-negligible overlap with the other classes. In Table 3, the proposed method achieved overall accuracy 88.8% in 25PDB experiment, with 95%, 91.4%, 77.5%, 88.7% for all-a, all-b, a/b and a þ b classes, respectively. As for D640 dataset, the overall accuracy of the proposed method achieved is 90.9%, and the accuracies for all-a, all-b, a/b and a þ b are 95.7%, 89.6%, 89.3%, and 90.1%, respectively. In FC699 experiment, the overall accuracy of the proposed method is 96.4%, with 98.5%, 98.1%, 97.6% and 81.7% for all-a, all-b, a/b and a þ b, respectively. The overall accuracy of the proposed method for 1189 datasets is 87.1%, with 96.4%, 92.9%, 82.0% and 78.4% for all-a, all-b, a/b and a þ b, respectively.
J. Wang et al. / Biochimie 101 (2014) 104e112 Table 2 The prediction quality of our method on 4 test datasets. Dataset
Class
Sens (%)
Spec (%)
MCC
25PDB
All-a All-b a/b aþb All-a All-b a/b aþb All-a All-b a/b aþb All-a All-b a/b aþb
95.03 91.42 77.46 88.66 95.65 89.61 89.27 90.06 98.46 98.14 97.61 81.71 96.41 92.86 82.04 78.42
96.99 97.56 95.93 94.56 98.80 98.15 94.60 96.16 99.59 99.66 96.67 98.71 98.96 98.74 91.29 93.41
0.91 0.90 0.75 0.82 0.94 0.89 0.83 0.86 0.98 0.98 0.94 0.83 0.95 0.93 0.73 0.71
640
FC699
1189
3.2. Comparison of the proposed method with the competing predictions To evaluate the efficiency of the proposed method, we compared it with competing prediction methods on the same datasets. In this section, we selected the accuracy of each class and overall accuracy as evaluation indexes that were summarized in Table 2. As for the 25PDB dataset, we compared the proposed method with competing methods including SCPRED [27], MODAS [48], RKSPPSC [49], IEA-PSSF [50], AADP-PSSM [51], Zhang et al. [30], Ding et al. [52], Xia et al. [53], AAC-PSSM-AC [54]. Table 2 lists the accuracy of each class and overall accuracy of all the evaluated prediction methods. From Table 2, we found that the proposed method outperforms all other methods. There are only two methods that provide the overall accuracy over 84%. One is the proposed method, and the other is the method proposed by Ding et al. [51]. But the
109
overall accuracy of the proposed method is 88.8%, which is 4.5% higher than Ding’s method [51]. In the D640 experiment, the evaluated methods consist of RKSPPSC [49], IEA-PSSF [50], Ding et al. [51]. Table 2 summarizes the accuracy of each class and overall accuracy of all the evaluated prediction methods. We noted that the proposed method achieves the highest overall prediction accuracy among all the evaluated methods with overall accuracy 90.9%, which is 7.46% higher than the next best one proposed by Ding et al. [38]. As for the FC699 experiment, we compared the proposed method with the competing methods such as SCPRED [27], 11 features [29], IEA-PSSF [50]. Table 2 represents the accuracy of each class and overall accuracy of all the evaluated prediction methods. Table 2 indicates that the proposed method achieves the best performance among all the evaluated methods, with overall accuracy 96.4%. The next best one is IEA-PSSF [50]. In the 1189 dataset, we compared the proposed method with RKS-PPSC [49], IEA-PSSF [50], AADP-PSSM [51], Zhang et al. [30], Ding et al. [51], Xia et al. [53], AAC-PSSM-AC [54]. Table 2 lists the accuracy of each class and overall accuracy of all the evaluated prediction methods. From Table 2, we found that the proposed method outperforms all other methods. There is only the proposed method that provides the overall accuracy over 85%. Its overall accuracy is 87.1%, which is 3.9% higher than the next best one proposed by Zhang et al. [30]. From the above experiments, we noted that the proposed method achieved the best performance among all the competing methods. Its overall accuracies for datasets 25PDB, D640, FC699 and 1189 are 88.8%, 90.9%, 96.4% and 87.4%, which are 4.5%, 7.6%, 2% and 3.9% higher than the existing best-performing method. We attributed higher overall accuracies of the proposed method to integration of the PseAA structural properties and secondary structural patterns. 3.3. Selection of word length for PseAAC and LrSPseAA
Table 3 The accuracy of each class and overall accuracy of the proposed method for four datasets, and comparison with the competing prediction methods based on protein prediction secondary structures. Dataset
Method
25PDB
SCPRED [19] MODAS [48] RKS-PPSC [49] IEA-PSSF [50] AADP-PSSM [51] Zhang et al. [30] Ding et al. [52] Xia et al. [53] AAC-PSSM-AC [54] This paper IEA-PSSF [50] RKS-PPSC [49] Ding et al. [52] This paper SCPRED [19] 11 features [29] IEA-PSSF [50] This paper IEA-PSSF [49] RKS-PPSC [50] AADP-PSSM [51] Zhang et al. [30] Ding et al. [51] Xia et al. [53] AAC-PSSM-AC [54] This paper
D640
FC699
1189
Prediction accuracy (%) All-a
All-b
a/b
aþb
Overall
92.6 92.3 92.8 90.1 69.1 95.0 95.0 92.6 85.3 95.0 90.1 89.1 94.9 95.7 e 97.7 96.9 98.5 90.1 89.2 83.3 92.4 93.7 95.6 80.7 96.4
80.1 83.7 83.3 84.7 83.7 85.6 81.3 72.5 81.7 91.4 84.7 85.1 76.6 89.6 e 88.0 95.9 98.1 84.7 86.7 78.1 87.4 84.0 81.0 86.4 92.9
74.0 81.2 85.8 79.5 85.6 81.5 83.2 71.7 73.7 77.5 79.5 88.1 89.3 89.3 e 89.1 97.1 97.6 79.5 82.6 76.3 82.0 83.5 78.9 81.4 82.0
71.0 68.3 70.1 77.6 35.7 73.2 77.6 71.0 55.3 88.7 77.6 71.4 74.3 90.1 e 84.2 74.4 81.7 77.6 65.6 54.4 71.0 66.4 71.9 45.2 78.4
79.7 81.4 82.9 83.1 70.7 83.9 84.3 77.2 74.1 88.8 83.1 83.1 83.4 90.9 87.5 89.6 94.5 96.5 83.1 81.3 72.9 83.2 82.0 80.1 74.6 87.1
To get PseAA composition and long-range structural properties of the PseAAs, we transformed a reduced protein sequence into a numerical sequence of the given element and analyzed the probability distribution of its interval distances. However, it should be noted that the composition and long-range structural properties of the PseAAs rely heavily on the length of the given word wk. Here, we selected the word length using a simple grid search strategy over the word length k from 1 to 5 based on 10-fold cross-validation for four datasets. Fig. 2 is the prediction accuracy of the structural properties of the PseAAs with different word lengths. Although we found that the differences of prediction accuracies from 3 to 5 are not statistically significant with T-test, the overall prediction accuracies of the structural properties of the PseAAs show two clear trends: (i) the overall prediction accuracies increase from k ¼ 1 to k ¼ 3 for all four data sets. When the length of word is equal to 3, the structural properties of the PseAAs achieves the best performance for all four data sets; (ii) it is interesting to note that the predication discrimination of the structural properties of the PseAAs worsens with k from 3 to 5, because the high dimension of the frequency vectors and the relative low dimension of the sequences length itself cause the vectors of the structural properties of the PseAAs to be very sparse. Therefore, we selected the words of length 3 to construct the PseAA composition and long-range structural properties of the PseAAs. 3.4. Influence of parameters in LcSPseAA A feature of the proposed method is that the local structural properties of the PseAAs were constructed upon total structural
110
J. Wang et al. / Biochimie 101 (2014) 104e112
Fig. 2. Prediction accuracy of the structural properties of PseAAs with different word lengths, where the word length is selected from 1 to 5 based on 10-fold cross-validation for four data sets.
correlation of m-window PseAAs. It describes the structural correlation among the contiguous m residues along a protein chain and contributes to the protein structural class prediction. However, it should be noted that the size of the LcSPseAA vector is associated with the m-window. For a better understanding of the influence of the m-window, we selected the window size using a simple grid search strategy over the size of the LcSPseAA vector from 5 to 20 based on 10-fold cross-validation for four data sets. Fig. 3 is the prediction accuracy of the proposed method with different window sizes. Take a closer look at Fig. 3, we found that the proposed method possesses different performances with different sizes of the LcSPseAA vector. The changes of their accuracy for the datasets 25PDB, D640, FC699 and 1189 are similar. As would be expected, the overall prediction accuracy increases with the sizes of the LcSPseAA vector from 5 to 10 for all four data sets. When the size of the LcSPseAA vector is equal to 10, the proposed method achieves the best performance for all four data sets. After that, the overall prediction accuracy worsens as the LcSPseAA size increasing. Therefore, we selected the LcSPseAA size 10 to construct the local structural properties of the PseAAs. In addition, the local structural properties of the PseAAs rely heavily on the correlation function Q(Ri,Rj) defined based on the hydrophobicity of the amino acids. Here, we bypassed the hydrophobicity values of the amino acids and assigned their values according to their groups: internal group (I), external group (E) and
Fig. 3. Prediction accuracy of the proposed method with different window sizes, where the window size of the LcSPseAA vector is from 5 to 20 based on 10-fold crossvalidation for four data sets.
ambivalent group (A). In order to evaluate the influence of assignments, we adopted the two assignment schemes: IEA[1, 1, 0] and IEA[0, 1, 1] to calculate the local structural properties of the PseAAs. The differences between IEA[1, 1, 0] and IEA[0, 1, 1] are that if an amino acid belongs to internal group, external group (E) and ambivalent group (A), its hydrophobicity value is 1, 1 and 0, respectively in the former, but its hydrophobicity value is 0, 1 and 1, respectively in the latter. That is to say, the scheme IEA[1, 1, 0] is more specific than the scheme IEA[0, 1, 1]. Fig. 4 is the comparison of overall accuracies of the proposed method based on the schemes IEA[1, 1, 0] and IEA[0, 1, 1]. From Fig. 4, we found that the IEA[0, 1, 1] outperforms the IEA[1, 1, 0] on datasets 25PDB, D640, FC699 and 1189. These results strongly demonstrate that the external group is similar to the ambivalent group, which are distinguished from the internal group in the proposed local structural properties of the PseAAs. 3.5. Comparison of accuracies between different classification algorithms This work selects the support vector machine (SVM) as the classifier. In order to evaluate its performance, we also adopt other two classifiers k-nearest neighbor algorithm (K-NN) with K ¼ 9 and Back-Progagation (BP) neural network with 9 neurons in the hidden layer that are implemented in Matlab toolbox. For each
Fig. 4. Comparison of overall accuracies of the proposed method based on the schemes IEA[1, 1, 0] and IEA[0, 1, 1], where IEA[1, 1, 0] is that if an amino acid belongs to internal group (I), external group (E) and ambivalent group (A), its hydrophobicity value is 1, 1 and 0, respectively, but its hydrophobicity value is 0, 1 and 1, respectively in the IEA[0, 1, 1].
J. Wang et al. / Biochimie 101 (2014) 104e112
Acknowledgments
Table 4 Comparison of accuracies between different classification algorithms. Dataset
25PDB
D640
FC699
1189
Method
SVM K-NN BP-NN SVM K-NN BP-NN SVM K-NN BP-NN SVM K-NN BP-NN
111
Prediction accuracy (%) All-a
All-b
a/b
aþb
Overall
95.0 90.7 84.9 95.7 91.3 60.9 98.5 99.2 23.1 96.4 92.4 34.9
91.4 78.8 77.2 89.6 73.4 64.9 98.1 90.3 91.4 92.9 84 77.6
77.5 77.7 76.0 89.3 88.1 84.7 97.6 94.2 94.7 82.0 81.1 83.8
88.7 70.5 65.9 90.1 68.4 48 81.7 57.3 67.1 78.4 57.7 25.7
88.8 79.6 76 90.9 80 65 96.5 90.2 80.2 87.1 79 59.3
classifier, the 45-dimension vector was used to represent the combined information of the given proteins. All experiments are performed using jackknife test, and the overall classification accuracies as well as the accuracies for each structural class are listed in Table 4. Table 4 shows that the SVM performs the best among the three classifiers. The average predicted accuracy of the SVM for the four data sets is 90.8%, which is 8.6e20.7% higher than that of other two classifiers, which indicates that the SVM is more suitable for prediction of protein structural classes based on the PseAA structural properties and secondary structural patterns.
4. Conclusion Prediction of structural classes for the low-homology datasets not only allows learning the overall folding type for a given protein sequence, but also helps in finding proteins that form similar folds in spite of low sequence similarity. Therefore, high quality prediction would be beneficial for prediction of tertiary structure of proteins with low sequence identity with respect to sequence used for prediction. Numerous efficient methods have been proposed to predict protein structural classes for low-homology sequences, but challenge remains. In this paper, we proposed a novel scheme to improve prediction accuracy, which explores a potential way to capture the structural properties of the PseAAs and the secondary structural patterns. To do so, we first transformed a protein sequence into a reduced protein sequence according to the hydrophobicity of the amino acids, and described the distribution, long-range and local structural properties of the PseAAs with the proposed statistical algorithm. Instead of the attention given to the secondary structural elements or segments, we then analyzed the distribution of secondary structural patterns that are associated with compact structural domains. The main goal of this paper is to provide a new statistical method explore he structural properties of the PseAAs and the secondary structural patterns. The contribution of the proposed method can be deduced from its comparison with competing prediction methods. Its overall accuracies for datasets 25PDB, D640, FC699 and 1189 are 88.8%, 90.9%, 96.4% and 87.4%, which are 4.5%, 7.6%, 2% and 3.9% higher than the existing best-performing method. We attributed higher overall accuracies of the proposed method to the novel the PseAA structural properties and secondary structural patterns. Thus, this understanding can be used to guide development of more powerful methods for protein structural class prediction.
We thank the referees for many valuable comments that have improved this manuscript. This work is supported by the National Natural Science Foundation of China (61370015, 61170316, and 61003191).
References [1] P. Klein, C. Delisi, Prediction of protein structural class from the amino-acid sequence, Biopolymers 25 (1986) 1659e1672. [2] K.C. Chou, Structural bioinformatics and its impact to biomedical science and drug discovery, Front. Med. Chem. 3 (2006) 455e502. [3] M. Levitt, C. Chothia, Structural patterns in globular proteins, Nature 261 (1976) 552e558. [4] A. Andreeva, D. Howorth, S.E. Brenner, T.J. Hubbard, C. Chothia, A.G. Murzin, SCOP database in 2004: refinements integrate structure and sequence family data, Nucleic. Acids. Res. 32 (2004) D226eD229. [5] A.G. Murzin, S.E. Brenner, T. Hubbard, C. Chothia, SCOP: a structural classification of protein database for the investigation of sequence and structures, J. Mol. Biol. 247 (1995) 536e540. [6] P. Ferragina, R. Giancarlo, V. Greco, G. Manzini, G. Valiente, Compressionbased classification of biological sequences and structures via the universal similarity metric: experimental assessment, BMC Bioinf. 8 (2007) 252. [7] Q. Dai, T.M. Wang, Comparison study on k-word statistical measures for protein: from sequence to ’sequence space’, BMC Bioinf. 9 (2008) 394. [8] C. Chen, Y. Tian, X. Zou, P. Cai, J. Mo, Using pseudo-amino acid composition and support vector machine to predict protein structural class, J. Theor. Biol. 243 (2006) 444e448. [9] K. Chou, Review: prediction of protein structural classes and subcellular locations, Curr. Protein. Pept. Sci. 1 (2000) 171e208. [10] K.D. Kedarisetti, L.A. Kurgan, S. Dick, Classifier ensembles for protein structural class prediction with varying homology, Biochem. Biophys. Res. Commun. 348 (2006) 981e988. [11] M.J. Mizianty, L. Kurgan, Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences, BMC Bioinf. 10 (2009) 414e438. [12] P. Klein, C. Delisi, Prediction of protein structural class from amino acid sequence, Biopolymers 25 (1986) 1659e1672. [13] K. Chou, A key driving force in determination of protein structural classes, Biochem. Biophys. Res. Commun. 264 (1999) 216e224. [14] T.L. Zhang, Y.S. Ding, K. Chou, Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern, J. Theor. Biol. 250 (2008) 186e193. [15] R.Y. Luo, Z.P. Feng, J.K. Liu, Prediction of protein structural class by amino acid and polypeptide composition, Eur. J. Biochem. 269 (2002) 4219e4225. [16] X.D. Sun, R.B. Huang, Prediction of protein structural classes using support vector machines, Amino Acids 30 (2006) 469e475. [17] Y.S. Ding, T.L. Zhang, K.C. Chou, Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network, Protein. Pept. Lett. 14 (2007) 811e815. [18] K. Chou, Y. Cai, Prediction of protein subcellular locations by GO-FunD-PseAA predictor, Biochem. Biophys. Res. Commun. 321 (2004) 1007e1009. [19] K. Chen, L.A. Kurgan, J. Ruan, Prediction of protein structural class using novel evolutionary collocation based sequence representation, J. Comput. Chem. 29 (2008) 1596e1604. [20] K.C. Chou, Prediction of protein cellular attributes using pseudo amino acid composition, Proteins Struct. Funct. Genet. 43 (2001) 246e255. [21] H.B. Shen, K.C. Chou, PseAAC: a flexible web-server for generating various kinds of protein pseudo amino acid composition, Anal. Biochem. 373 (2008) 386e388. [22] H. Liu, M. Wang, K.C. Chou, Low-frequency Fourier spectrum for predicting membrane protein types, Biochem. Biophys. Res. Commun. 336 (2005) 737e 739. [23] X. Xiao, S.H. Shao, Z.D. Huang, K. Chou, Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor, J. Comput. Chem. 27 (2006) 478e482. [24] H. Lin, Q.Z. Li, Using pseudo amino acid composition to predict protein structural class: approached by incorporating 400 dipeptide components, J. Comput. Chem. 28 (2007) 1463e1466. [25] X. Xiao, W.Z. Lin, K.C. Chou, Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes, J. Comput. Chem. 29 (2008) 2018e2024. [26] L.A. Kurgan, L. Homaeian, Prediction of structural classes for protein sequences and domains-Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy, Pattern Recogn. 39 (2006) 2323e2343. [27] L. Kurgan, K. Cios, K. Chen, SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences, BMC Bioinf. 9 (2008) 226e240.
112
J. Wang et al. / Biochimie 101 (2014) 104e112
[28] C. Zheng, L. Kurgan, Prediction of beta-turns at over 80% accuracy based on an ensemble of predicted secondary structures and multiple alignments, BMC Bioinf. 9 (2008) 430. [29] T. Liu, C.Z. Jia, A high-accuracy protein structural class prediction algorithm using predicted secondary structural information, J. Theor. Biol. 267 (3) (2010) 272e275. [30] S.L. Zhang, S.Y. Ding, T.M. Wang, High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure, Biochimie 93 (2011) 710e714. [31] U. Hobohm, C. Sander, Enlarged representative set of protein structures, Protein Sci. 3 (1994) 522e524. [32] L. Kurgan, K. Chen, Prediction of protein structural class for the twilight zone sequences, Biochem. Biophys. Res. Commun. 357 (2007) 453e460. [33] N. Liu, T. Wang, Protein-based phylogenetic analysis by using hydropathy profile of amino acids, FEBS. Lett. 580 (2006) 5321e5327. [34] K.C. Chou, Y.D. Cai, Predicting protein quaternary structure by pseudo amino acid composition, Proteins Struct. Funct. Genet. 53 (2003) 282e 289. [35] K.C. Chou, Y.D. Cai, Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition, J. Cell. Biochem. 91 (2004) 1197e1203. [36] H.B. Shen, K.C. Chou, Using optimized evidencedtheoretic K-nearest neighbor classifier and pseudo-amino acid composition to predict membrane protein types, Biochem. Biophys. Res. Commun. 334 (2005) 288e292. [37] H.B. Shen, K.C. Chou, Predicting protein subnuclear location with optimized evidencedtheoretic K-nearest classifier and pseudo amino acid composition, Biochem. Biophys. Res. Commun. 337 (2005) 752e756. [38] K.C. Chou, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics 21 (2005) 10e19. [39] S.W. Zhang, Q. Pan, H.C. Zhang, Z.C. Shao, J.Y. Shi, Prediction of protein homooligomer types by pseudo amino acid composition: approached with an improved feature extraction and naive Bayes feature fusion, Amino. Acids 30 (2006) 461e468.
[40] D.B. Wetlaufer, Nucleation, rapid folding, and globular intrachain regions in proteins, Proc. Natl. Acad. Sci. USA 70 (1973) 697e701. [41] J.S. Richardson, The anatomy and taxonomy of protein structure, Adv. Protein. Chem. 34 (1981) 167e339. [42] D.T. Jones, Protein secondary structure prediction based on position specific scoring matrices, J. Mol. Biol. 292 (1999) 195e202. [43] L.F. Luo, X.Q. Li, Recognition and architecture of the framework structure of protein, Proteins 39 (2000) 9e25. [44] V. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag, 2000. [45] Y. Cai, X. Liu, X. Xu, K. Chou, Prediction of protein structural classes by support vector machines, Comput. Chem. 26 (2002) 293e296. [46] K. Chou, H. Shen, Recent progress in protein subcellular location prediction, Anal. Biochem. 370 (2007) 1e16. [47] Z. Yuan, T.L. Bailey, R.D. Teasdale, Prediction of protein B-factor profiles, Proteins 58 (2005) 905e912. [48] M.J. Mizianty, L. Kurgan, Meta prediction of protein crystallization propensity, Biochem. Biophys. Res. Commun. 390 (2009) 10e15. [49] J.Y. Yang, Z.L. Peng, X. Chen, Prediction of protein structural classes for lowhomology sequences based on predicted secondary structure, BMC Bioinf. 11 (2010) S9. [50] Q. Dai, L. Wu, L.H. Li, Improving protein structural class prediction using novel combined sequence information and predicted secondary structural features, J. Comput. Chem. 32 (2011) 3393e3398. [51] T.G. Liu, X.Q. Zheng, J. Wang, Prediction of protein structural class for lowsimilarity sequences, Biochimie 92 (2010) 1330e1334. [52] S.Y. Ding, S.L. Zhang, Y. Li, T.M. Wang, A novel protein structural classes prediction method based on predicted secondary structure, Biochimie 94 (2012) 1166e1171. [53] X.Y. Xia, M. Ge, Z.X. Wang, X.M. Pan, Accurate prediction of protein structural class, PLoS One 7 (2012) e37653. [54] T.G. Liu, X.B. Geng, X.Q. Zheng, R.S. Li, J. Wang, Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles, Amino Acids 42 (2012) 2243e2249.