Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou's general PseAAC and Support Vector Machine

ARTICLE IN PRESS COMM-3813; No. of Pages 9 c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e x x x ( 2 0 1 4 ) xxx–xxx ...

Download PDF

1MB Sizes 1 Downloads 42 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

COMM-3813; No. of Pages 9

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e x x x ( 2 0 1 4 ) xxx–xxx

journal homepage: www.intl.elsevierhealth.com/journals/cmpb

Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou’s general PseAAC and Support Vector Machine Maqsood Hayat ∗ , Nadeem Iqbal Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan

a r t i c l e

i n f o

a b s t r a c t

Article history:

Proteins control all biological functions in living species. Protein structure is comprised of

Received 13 February 2014

four major classes including all-␣ class, all-␤ class, ␣+␤, and ␣/␤. Each class performs differ-

Received in revised form 9 June 2014

ent function according to their nature. Owing to the large exploration of protein sequences

Accepted 13 June 2014

in the databanks, the identiﬁcation of protein structure classes is difﬁcult through conventional methods with respect to cost and time. Looking at the importance of protein

Keywords:

structure classes, it is thus highly desirable to develop a computational model for discrim-

SVM

inating protein structure classes with high accuracy. For this purpose, we propose a silco

Protein structure classes

method by incorporating Pseudo Average Chemical Shift and Support Vector Machine. Two

PseAA composition

feature extraction schemes namely Pseudo Amino Acid Composition and Pseudo Average

Pseudo Average Chemical Shift

Chemical Shift are used to explore valuable information from protein sequences. The performance of the proposed model is assessed using four benchmark datasets 25PDB, 1189, 640 and 399 employing jackknife test. The success rates of the proposed model are 84.2%, 85.0%, 86.4%, and 89.2%, respectively on the four datasets. The empirical results reveal that the performance of our proposed model compared to existing models is promising in the literature so far and might be useful for future research. © 2014 Published by Elsevier Ireland Ltd.

1.

Introduction

Proteins perform various functions in living organisms. The functions of a protein are basically associated with its structure [4]. The structure of protein is constituted according to the behavior and spatial position of amino acids. However, protein structure prediction is totally based on the folding patterns of already existing protein structures. Protein structures are categorized into four main classes including all-␣

class, all-␤ class, ␣+␤, and ␣/␤ according to the natures and organizations of their secondary structural elements. The all␣ class comprises helices whereas all-␤ class contains strands. The other two classes are the combination of ␣ helices and ␤ strands. The ␣+␤ class is composed of anti-parallel ␤ strands whereas the ␣/␤ class comprised of parallel ␤ strands. The prediction of protein structure classes is very essential and is useful for studying and annotation of protein function, regulation, and interactions [6]. For this purpose, a lot of efforts have been carried outbut still critical challenges exist in developing

∗

Corresponding author. Tel.: +92 937 542194; fax: +92 937 542194. E-mail addresses: [email protected], [email protected] (M. Hayat). http://dx.doi.org/10.1016/j.cmpb.2014.06.007 0169-2607/© 2014 Published by Elsevier Ireland Ltd.

Please cite this article in press as: M. Hayat, N. Iqbal, Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou’s general PseAAC and Support Vector Machine, Comput. Methods Programs Biomed. (2014), http://dx.doi.org/10.1016/j.cmpb.2014.06.007

ARTICLE IN PRESS

COMM-3813; No. of Pages 9

2

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e x x x ( 2 0 1 4 ) xxx–xxx

automated methods, which can determine protein structure fast and accurate. Therefore, a robust, reliable and computationally intelligent model is required for identifying the structural class of the novel protein from their primary sequences. Various algorithms and efforts have been carried out for prediction of protein structure classes since 1980s. In this connection, numerous investigators have applied different protein sequence representation techniques for encoding protein sequences to extract distinct information. These include amino acid composition [17,67], dipeptide composition [61,78], Pseudo amino acid composition [19,55,85,93,94], function domain composition [22], amino acid sequence reverse encoding [27,64], Position Speciﬁc Scoring Matrix proﬁle [44], evolutionary features, and PSI-Blast proﬁles [11]. Different learning algorithms were utilized in order to predict protein structure classes more accurately including artiﬁcial neural network [8], fuzzy clustering [75], Support Vector Machine [3,8,43,71], Bayesian classiﬁcation [81], and ensemble classiﬁcation [7,45,92]. Due to the presence of homologous protein sequences, the performance of classiﬁcation algorithms is highly affected. However, in case of high similarity between training and testing datasets, the performance of classiﬁcation algorithms is overestimated whereas in case of low similarity the performance of the classiﬁcation algorithms is underestimated. Kurgan and Homaeian have addressed the problem of varying similarity using ensemble classiﬁcation [50]. Further, a series of efforts have been carried out to enhance the prediction outcome of the classiﬁcation algorithm using low similarity datasets. In order to develop a useful computational model various steps are essential, as mentioned in a comprehensive review [21] and other literature [14,58,72,84], the ﬁrst step is to select a valid benchmark dataset, the second step is to represent the instances with an effective formula, the third step is to introduce a hypothesis for prediction, the four step is using statistical cross validation tests and ﬁnally establish a user friendly web server for public access. In order to enhance the success rates of classiﬁcation algorithms on low similarity datasets, we, therefore, propose an accurate and robust classiﬁcation model for prediction of protein structure classes. The model is designed using SVM in conjunction with Pseudo Average Chemical Shift features. The performance of the learning algorithm is evaluated using one of the most rigorous and uniquely generated results known as cross validation jackknife test on four low similarity benchmark datasets. The rest of the article is organized as follows. Section 1 represents materials and methods, Section 2 describes feature extraction methods, classiﬁcation algorithm and evaluation methods, Section 3 presents results and discussion ﬁnally conclusion is drawn in the last section.

2.

Materials and methods

2.1.

Datasets

Domain related benchmark datasets are always required for developing a robust and intelligent prediction system. For this purpose, we have used four benchmark datasets with

Table 1 – Number of instances in the given datasets. Datasets Structure classes

25PDB

1189

640

399 124 112 163 (mixed)

All-␣ All-␤ ␣/␤ ␣+␤

443 443 346 441

223 294 334 241

138 154 177 171

Total

1673

1092

640

low-similarity, which were employed by many investigators for the evaluation of their automated models [49,64,88]. The datasets 25PDB and 1189 were downloaded from RCSB protein Data Bank [47,48,50]. Dataset 25PDB includes only those protein sequences, which have about 25% of sequence identity. It is comprised of 1673 protein sequences. The second dataset, 1189 dataset contains 1092 protein sequences possessing less than 40% of sequence identity. We have utilized a third dataset referred to as 640 [11], which contains 640 protein sequences with 25% of sequence identity. The number of instances in the datasets are mentioned in Table 1. We have used forth dataset having 399 protein sequences have less than 15% sequence identity [53].

2.2.

Feature extraction methods

The primary structure of protein is a polymer of amino acids, which formatted and folded according to the attributes of amino acids. These attributes are very vital for the recognition of protein structure classes. Therefore, an efﬁcient feature extraction method is required, in order to extract signiﬁcant information from the segments of protein structure classes. For this purpose, we have used two feature extraction methods namely PseAA composition and Pseudo Average Chemical Shift.

2.3.

Pseudo Amino Acid Composition

The primary structure of proteins is a combination of 20 amino acids. So in simple amino acid composition only 20 numerical attributes are explored [15–17,67]. P = [p1 , p2 , p3 , . . .p20 ]T

(1)

where p1 indicates the relative frequency of amino acid A, p2 is the relative frequency of amino acid C, and p20 represents the relative frequency of amino acid Y and P having composition of all amino acids and T is transpose. However, simple amino acid composition only computes the relative frequency of each amino acid but does not retain information regarding protein sequence order and length of sequence. Sometime, only frequency information is not adequate for identifying protein sequences. Therefore, to extract sequence order and many other essential information hidden in protein sequences, Chou has introduced the concept of Pseudo Amino Acid Composition (PseAAC) [18,20], which is the combination of frequency information as well as sequence arrangement information [18]. The concept PseAAC was intensively applied in almost all the ﬁelds of computational proteomics such as predicting GABA(A) receptor proteins [66], subcellular

Please cite this article in press as: M. Hayat, N. Iqbal, Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou’s general PseAAC and Support Vector Machine, Comput. Methods Programs Biomed. (2014), http://dx.doi.org/10.1016/j.cmpb.2014.06.007

ARTICLE IN PRESS

COMM-3813; No. of Pages 9

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e x x x ( 2 0 1 4 ) xxx–xxx

3

Fig. 1 – Illustrates (1) the ﬁrst-rank of correlation factor, (2) the second-rank and (3) the third-rank of correlation factor. Adapted from Chou [21], with permission.

localizations [56], identifying bacterial virulent proteins [69], predicting supersecondary structure [97], predicting protein structural classes [73], predicting protein submitochondria locations [68], predicting membrane protein types [45]. In addition, a web-server is also developed for PseAAC [38] recently three powerful open access softwares were developed called ‘PseAAC-Builder’ [32], ‘propy’ [9], and ‘PseAA CompositionGeneral’ [31] for generating various modes of Chou’s special PseAA composition. Pseudo k-tuple nucleotide composition was also used for identifying recombination spots and nucleosome positioning sequences [13], identiﬁcation of nucleosome positioning sequences [40]. Moreover, the application of PseAA composition such as the prediction of cell-wall lytic enzymes [30]. PseAA composition can be represented as P = [p1 , . . .p20 , p20+1 , . . .p20+ , . . .]T

(2)

The ﬁrst 20 components are the simple amino acid composition and the remaining are the correlation factors of amino acids determined on the basis of hydrophobicity and hydrophilicity [18].

⎧ L−1 ⎪ 1 ⎪ = Ji,i+1 ⎪ 1 ⎪ L−1 ⎪ ⎪ ⎪ i=1 ⎪ ⎪ L−2 ⎪ ⎪ 1 ⎪ ⎪ = Ji,i+2 2 ⎪ ⎪ L−2 ⎪ ⎪ i=1 ⎨ L−3 1 3 = Ji,i+3 ⎪ ⎪ L−3 ⎪ ⎪ i=1 ⎪ ⎪ ⎪ .. ⎪ ⎪ . ⎪ ⎪ ⎪ ⎪ L− ⎪ ⎪ 1 ⎪ ⎪ = Ji,i+ ⎩ L−

(3)

i=1

where L is the length of protein sequence 1 is the ﬁrst rank of correlation factor, 2 is the second rank of correlation and is the last rank of correlation illustrated in Fig. 1. We have selected the value of = 30 means taking ﬁrst 30 ranks of sequence-order correlations into consideration. In this work, we have utilized two physiochemical properties of amino acid

hydrophobicity and hydrophilicity. Thus, dimension of the feature space is (20 + 60)D = 80D vector.

2.4.

Pseudo Average Chemical Shift features

In molecules, the density of each electron varies according to the types of nuclei and bonds. However, protons are sensitive to their chemical atmosphere while electron are circulating around them and generates its own magnetic ﬁeld, which vary the external ﬁeld of the proton. Proton absorb slightly different frequencies and different magnetic ﬁelds in different chemical environments. Chemical shift is the resonant frequencies of the various protons relative to a standard. It is one of the most vital parameters, which is measured by NMR spectroscopy. It is also work as a good indicators of local conformations. The chemical shifts of backbone atoms in proteins are tightly coupled with backbone dihedral angles or secondary structure types [60,77,82]. Numerous researchers have revealed that the Average Chemical Shift of a speciﬁc nucleus in the protein backbone couples well to its secondary structure [63,76,95]. Information regarding protein structure is more essential for protein representation. Therefore, Average Chemical Shift can be an efﬁcient parameter for expressing information regarding protein secondary structure and it has been utilized to enhance the discrimination power for various protein subcellular locations and other computational proteomics problems [33,34]. For calculating Pseudo Average Chemical Shift features, we have got protein’s secondary structure using Porter [70], it can be accessed online by the following link: http://distill. ucd.ie/porter/. Let us assume A is a protein sequence of L residues long represented by P. Due to the availability of secondary structure of protein sequence A, each amino acid in A is substituted by the Average Chemical Shift. P can be represented as: P = [Ai1 , Ai2 , Ai3 , . . ., AiL ](i = 15 N, 13 C˛ , 1 H˛ , 1 HN )

(4)

where N stands for Nitrogen, C˛ for alpha Carbon, H˛ for alpha Hydrogen, and HN for Hydrogen linked with Nitrogen.

Please cite this article in press as: M. Hayat, N. Iqbal, Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou’s general PseAAC and Support Vector Machine, Comput. Methods Programs Biomed. (2014), http://dx.doi.org/10.1016/j.cmpb.2014.06.007

ARTICLE IN PRESS

COMM-3813; No. of Pages 9

4

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e x x x ( 2 0 1 4 ) xxx–xxx

After, substitution by Average Chemical Shift, the protein sequence is expressed as follows: PPseACS = [ϕi0 , ϕi1 , ϕi2 , . . ., ϕi ](i = 15 N, 13 C˛ , 1 H˛ , 1 HN )

(5)

2 1 i [Ak − Aik+ ] (i = A˛ , 1 HN ; < L) L−

(6)

L−

ϕi =

k=1

where i indicate the backbone atom. We investigated all the combinations of backbone atom i and auto covariance length to ﬁnd the optimal parameters for prediction. We found that = 25 and i = C˛ 1 HN are the suitable parameters for solving this problem. After that, the obtained information is provided to the Pseudo Average Chemical Shift web server in conjunction with the original protein sequence to extract the Pseudo Average Chemical Shift features [34,63]. Pseudo Average Chemical Shift web server is accessible at: http://wlxy.imu.edu.cn/college/ biostation/fuwu/acACS/index.asp.

2.5.

Classiﬁcation algorithm

Classiﬁcation is the phase of machine learning where data is categorized into predeﬁned classes. It is also called supervised learning where the targets of these classes are known in advance. In a classiﬁcation process, classes are represented on the basis of features and characteristics of already known data for which these classes are deﬁned. In this study, we utilized Support Vector Machine (SVM) for the classiﬁcation of protein structure classes. Detailed discussion about SVM is as follows.

2.6.

Support Vector Machine

SVM is a statistical learning theory used for classiﬁcation [62,80]. It was developed by Vipnik 1995 for binary classiﬁcation problems. Later, it was utilized multiclass problems. SVM has been extensively applied in various area of bioinformatics, proteomics, pattern recognition and data mining such as for prediction of subcellular localization [79] for outer membrane protein discrimination [41] prediction of membrane protein types [42] mitochondrial protein prediction [2], and protein methylation sites [74]. SVM maps data into high dimensional space in order to maximize the margin between the two classes of instances. It draws a parallel line to the hyperplane that determines the distance between dividing line and the closest points in the training set to minimize classiﬁcation errors; the points are called support vectors and the distance is called margin. Radial basis kernel function (RBF) was utilized for our SVM training. RBF is deﬁned as given in the following equation. K(xi , xj ) = exp{−||xi − xj ||2 }

(7)

where the parameter gamma ‘’ represents the width of Gaussian function. The cost parameter ‘c’ controls the tradeoff between margin and classiﬁcation error. Here, we have used the package of LIBSVM “libsvm-mat-2.88-1” and selected the parameters of Lib-SVM using optimization technique [10].

2.7.

Evaluation methods

In statistical prediction, usually three cross validation tests namely jackknife, sub-sampling, and independent dataset tests are examined by many investigators for evaluating the effectiveness of their models [25]. Among the three cross-validation methods, jackknife test is deemed the least arbitrary and most rigorous one that can exclude the memory effects during the entire testing process and can always yield a unique result for a given benchmark dataset [24,28,37,39,43,45,46,65,73,84,96]. Accordingly, the jackknife test was also utilized here to examine the quality of the present predictor. It divides the dataset into N folds where one fold is used for testing and remaining N − 1 folds for training [5,12,29,38,51,52,54,90]. The evaluation process is repeated N times. The performance of the classiﬁcation algorithm is measured through the following measures. Accuracy =

TP + TN × 100 TP + FN + TN + FP

(8)

Sensitivity =

TP × 100 TP + FN

(9)

Speciﬁcity =

TN × 100 FP + TN

(10)

MCC(i) =

(

F−M=2×

(TP × TN − FP × FN) [TP + FP][TP + FN][TN + FP][TN + FN])

Precision × Recall Precision + Recall

(11)

(12)

where Precision = (TP/TP + FP) and Recall = (TP/TP + FN) TP, FN, TN, and FP is the number of true positive, false negative, true negative and false positive protein sequences respectively. MCC is a discrete version of Pearson’s correlation coefﬁcient that takes values in the interval of [−1, 1]. A value of 1 means the classiﬁer never makes any mistakes and a value −1 means the classiﬁer always makes mistake for more details Eqs. (9)–(13) in Ref. [86] or Eqs. (10)–(14) in Ref. [13].

3.

Results and discussion

In this study, we have used two feature extraction strategies, PseAA composition and Pseudo Average Chemical Shift. SVM is used as classiﬁcation algorithm. The performance of proposed system using each feature extraction strategy is mentioned below.

3.1.

Performance of PseAA Composition features

The success rates of SVM in conjunction with PseAA composition feature space using three benchmark datasets are listed in Table 2. In case of 25PDB dataset, SVM achieved overall accuracy of 80.6% whereas the accuracy for individual protein structure classes is 86.0% for all-␣, 82.2% for all-␤, 79.6% for ␣/␤, and 74.6% for ␣+␤. Using another dataset 1189, SVM obtained overall accuracy of 82.7%, while the accuracy for each protein structure class is 90.8%, 83.4%, 85.9%, and 70.4%%, respectively

Please cite this article in press as: M. Hayat, N. Iqbal, Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou’s general PseAAC and Support Vector Machine, Comput. Methods Programs Biomed. (2014), http://dx.doi.org/10.1016/j.cmpb.2014.06.007

ARTICLE IN PRESS

COMM-3813; No. of Pages 9

5

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e x x x ( 2 0 1 4 ) xxx–xxx

Table 2 – Individual and overall success rates of each structure class using three benchmark datasets. Datasets

Feature extraction strategies

Prediction accuracy (%) All-␣

All-␤

␣/␤

␣+␤

Overall

PseAA Composition Pseudo Average Chemical Shift

86.0 92.2

82.2 85.4

79.6 82.2

74.6 76.9

80.6 84.2

1189

PseAA Composition Pseudo Average Chemical Shift

90.8 92.2

83.4 84.0

85.9 88.6

70.4 74.6

82.7 85.0

640

PseAA Composition Pseudo Average Chemical Shift

78.0 85.0

74.7 83.0

88.7 86.3

81.9 83.2

81.3 86.4

25PDB

Table 3 – Prediction performance on 25PDB dataset. Structure class Sensitivity Speciﬁcity MCC PseAA Composition All-␣ All-␤ ␣/␤ ␣+␤ Overall

86.0 84.0 87.8 84.1 85.5

Pseudo Average Chemical Shift 92.2 All-␣ 88.8 All-␤ 91.5 ␣/␤ 87.5 ␣+␤ 90.0 Overall

F-measure

78.6 79.3 78.4 79.5 79.0

0.58 0.57 0.57 0.56 0.58

0.38 0.37 0.34 0.34 0.36

81.4 82.6 82.0 83.2 82.3

0.66 0.65 0.65 0.64 0.65

0.40 0.39 0.35 0.35 0.37

for all-␣, all-␤, ␣/␤, and ␣+␤. Likewise, in case of 640 dataset, overall accuracy of SVM is 81.3% and individual accuracy for each protein structure class is 78.0% for all-␣, 74.7% for all-␤, 88.7% for ␣/␤, and 81.9% for ␣+␤. We have also evaluated the performance of the SVM through other performance measures including sensitivity, speciﬁcity, F-measure, and MCC are reported in Tables 3–5 for the three datasets. So, in case of ﬁrst dataset 25PDB, the overall sensitivity, speciﬁcity, MCC and Fmeasure of the proposed model are 85.5%, 79.0%, 0.58 and 0.36, respectively. The overall and class wise sensitivity, speciﬁcity, MCC, and F-measure of the proposed model using 1189 dataset are reported in Table 4. The overall sensitivity, speciﬁcity, MCC and F-measure are 80.7%, 83.3%, 0.55, and 0.32, respectively. In case of 640 dataset, the overall and class wise sensitivity, speciﬁcity, MCC, and F-measure are listed in Table 5. The overall sensitivity is 80.4%, speciﬁcity is 83.3%, MCC 0.55 and F-measure is 0.32. The predicted outcome of proposed model

using 399 dataset obtained 83.2% overall accuracy is reported in Table 7.

3.2. Performance of Pseudo Average Chemical Shift features The success rates of proposed model using Pseudo Average Chemical Shift based features are shown in Table 2. In case of ﬁrst dataset 25PDB, the proposed model obtained the overall accuracy of 84.2% and likewise 92.2%, 85.4%, 82.2%, and 76.9% for all-␣, all-␤, ␣/␤, and ␣+␤, respectively. Using the second dataset 1189, the proposed model yielded overall accuracy of 85.0% whereas 92.2% for all-␣, 84.0% for all-␤, 88.6% for ␣/␤, 74.6% for ␣+␤. In case of third dataset 640, our proposed model achieved an accuracy of 85.0% for all-␣, 83.0% for all-␤, 86.3% ␣/␤, and 83.2% for ␣+␤ where the overall accuracy is 86.4%. Apart from accuracy, we have also computed other performance measures including sensitivity, speciﬁcity, MCC, and F-measure in order to evaluate the discrimination power of the SVM as well as Pseudo Average Chemical Shift based features. In case of 25PDB dataset, the overall and class wise sensitivity, speciﬁcity, MCC, and F-measure are listed in Table 3. The overall sensitivity, speciﬁcity, MCC and F-measure are 90.0%, 82.3%, 0.65, and 0.37, respectively. Subsequently, sensitivity for each class is 92.2%, 88.8%, 91.5%, and 87.5% for all-␣, all-␤, ␣/␤, and ␣+␤, respectively. Whereas the speciﬁcity is 81.4%, 82.6%, 82.0%, and 83.2% for all-␣, all-␤, ␣/␤, and ␣+␤, respectively. Similarly the MCC and F-measure are 0.66, 0.65, 0.65, 0.64, 0.40, 0.39, 0.35, and 0.35 for all-␣, all-␤, ␣/␤, and ␣+␤, respectively. In Tables 4 and 5 overall and class wise sensitivity, speciﬁcity, MCC, and F-measure of 1189 and 640 datasets are shown. In case of 1189 the overall sensitivity, speciﬁcity,

Table 4 – Prediction performance on 1189 dataset.

Table 5 – Prediction performance on 640 dataset.

Structure class Sensitivity Speciﬁcity MCC

Structure class Sensitivity Speciﬁcity MCC

PseAA Composition All-␣ All-␤ ␣/␤ ␣+␤ Overall

90.8 86.5 91.3 86.4 88.8

Pseudo Average Chemical Shift 92.2 All-␣ 87.5 All-␤ 92.0 ␣/␤ 88.0 ␣+␤ 90.0 Overall

F-measure

80.6 81.5 80.0 81.5 80.9

0.67 0.60 0.63 0.61 0.61

0.32 0.35 0.37 0.35 0.35

PseAA Composition All-␣ All-␤ ␣/␤ ␣+␤ Overall

83.1 84.1 82.6 84.0 83.5

0.64 0.65 0.67 0.65 0.65

0.32 0.35 0.37 0.36 0.35

Pseudo Average Chemical Shift 84.0 All-␣ 83.0 All-␤ 88.8 ␣/␤ 87.5 ␣+␤ 85.8 Overall

78.0 76.2 84.7 83.9 80.7

F-measure

82.0 82.6 80.1 80.3 83.3

0.53 0.53 0.57 0.57 0.55

0.30 0.30 0.34 0.35 0.32

83.8 84.2 82.4 82.7 83.3

0.67 0.60 0.63 0.63 0.61

0.31 0.32 0.34 0.35 0.33

Please cite this article in press as: M. Hayat, N. Iqbal, Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou’s general PseAAC and Support Vector Machine, Comput. Methods Programs Biomed. (2014), http://dx.doi.org/10.1016/j.cmpb.2014.06.007

ARTICLE IN PRESS

COMM-3813; No. of Pages 9

6

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e x x x ( 2 0 1 4 ) xxx–xxx

Table 6 – Comparison with existing methods on three datasets. Dataset

Reference

Prediction accuracy All-␣

All-␤

␣/␤

␣+␤

Overall

[50] [26] [47] [88] [59] [49] [64] [87] Our proposed

69.1 60.6 N/A 64.3 83.3 92.6 92.3 92.8 92.2

61.6 60.7 N/A 65.0 78.1 80.1 83.7 83.3 85.4

38.3 67.9 N/A 65.0 76.3 74.0 81.2 85.8 82.2

60.1 44.3 N/A 61.7 54.4 71.0 68.3 70.1 76.9

57.1 58.6 59.9 64.0 72.9 79.7 81.4 82.9 84.2

1189

[50] [26] [47] [88] [59] [49] [64] [87] Our proposed

57.0 NA NA 62.3 69.1 89.1 92.3 89.2 92.2

62.9 NA NA 67.7 83.7 86.7 87.1 86.7 84.0

64.6 NA NA 66.5 85.6 89.6 87.9 82.6 88.6

25.3 NA NA 63.1 35.7 53.8 65.4 65.6 74.6

53.9 59.9 58.9 65.2 70.7 80.6 83.5 81.3 85.0

640

[1] Our proposed

NA 85.0

NA 83.0

NA 86.3

NA 83.2

85.0 86.4

25PDB

MCC and F-measure are 90.0%, 83.5%, 0.65, and 0.35, respectively. Similarly, in case of 640 the overall sensitivity is 85.8%, speciﬁcity is 83.3%, MCC is 0.61, and F-measure is 0.33. The success rates of proposed model using 399 dataset are listed in Table 7, whereas the overall accuracy is 89.2%. From the results above, we ascribe the excellent performance of our model to the novel feature vector called Pseudo Average Chemical Shift to construct the features of protein sequence. Pseudo Average Chemical Shift differs from composition-based methods such as amino acid composition and dipeptide composition and from order information-based extraction methods such as pseudo-amino acid composition and amphiphilic pseudoamino acid composition [13,83,86,89]. Our method includes structure information and represents a different and satisfactory approach for distinguishing protein structural classes. We believe that Pseudo Average Chemical Shift is also a useful and efﬁcient feature extraction tool for protein sequence feature extraction in other bioinformatics problems [33–36].

3.3.

Performance comparison with existing models

Owing to large exploration of protein sequences in protein databanks, the importance of computational methods is increased. In the last twenty years, several automated models were developed for prediction of protein structure classes. In each model, the aim of researchers is to enhance the success rate of their model. In this regard, we have made a comparison of our proposed model and already existing models in order to show the importance of our proposed model. The comparison between proposed model and existing models is listed in Table 6. For protein structure classes prediction, the ﬁrst model was carried out by Kurgan and Chen using two datasets including 25PDB and 1189. The proposed model of Kurgan and Chen have obtained an accuracy of 69.1% and 57.0% for all-␣, 61.6% and 62.9% for all-␤, 38.3% and 64.6% for ␣/␤, 60.1% and

25.3% for ␣+␤ whereas the overall accuracy is 57.1% and 53.9% [48]. Yang et al. proposed model has achieved overall accuracy of 64.0% and 65.2%, respectively using 25PDB and 1189 datasets [87]. The recent approach developed by Zhang et al., for prediction of protein structure classes achieved accuracy of 92.8% and 89.2% for all-␣, 83.3% and 86.7% for all-␤, 85.8% and 82.6% for ␣/␤, 70.1% and 65.6% for ␣+␤, respectively [91]. Consequently, the overall accuracy is 82.9% and 81.3% correspondingly using datasets 25PDB and 1189. The predicted accuracies obtained by our proposed model are 92.2% for all␣, 85.4% for all-␤, 82.2% for ␣/␤, 76.9% for ␣+␤, leading overall accuracy to 84.2% using 25PDB dataset. In case of 1189 dataset, the accuracies are 92.2% for all-␣, 84.0% for all-␤, 88.6% for ␣/␤, and 74.6% for ␣+␤, whereas the overall accuracy is 85%. The 640 dataset has only been used by Adl et al., for evaluating the performance of their model [1]. They have only reported the overall accuracy of 85.0%. Using this dataset, we have not only reported the overall accuracy but also individual class accuracies. The accuracy values for each class are 85.0%, 83.0%, 86.3%, and 83.2%, for all-␣, all-␤, ␣/␤, ␣+␤, respectively, and overall accuracy is 86.4%. We have also evaluated the performance of proposed model using 4th dataset. On this dataset, Lin et al., published approach yielded 88.0% overall accuracy [53]. Similarly, our proposed model obtained 89.2% overall accuracy on the same dataset shown in Table 7. So, the success rates of our proposed model are higher than that of the existing models reported in the literature. The empirical results reveal that our proposed model is promising compared to existing results. These results will play key roles in the enhancement of protein structure prediction. Since userfriendly and publicly accessible web-servers represent the future direction for developing practically more useful models, simulated methods, or predictors [23,57], we shall make efforts in our future work to provide a web-server for the method presented in this paper.

Please cite this article in press as: M. Hayat, N. Iqbal, Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou’s general PseAAC and Support Vector Machine, Comput. Methods Programs Biomed. (2014), http://dx.doi.org/10.1016/j.cmpb.2014.06.007

ARTICLE IN PRESS

COMM-3813; No. of Pages 9

7

c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e x x x ( 2 0 1 4 ) xxx–xxx

Table 7 – Performance comparison with existing methods on dataset 399. Dataset

399

Reference

[53] Proposed (PseAAC) Proposed (PACs)

Prediction accuracy All- ␣

All-␤

N/A 79.0 87.9

N/A 80.3 87.5

Mixed ␣,␤ N/A 88.3 91.4

Overall 88.0 83.2 89.2

Pseudo Average Chemical Shift (PACs).

4.

Conclusions

In this study, we proposed an efﬁcient and reliable prediction model for discriminating protein structure classes. The model is based on SVM in conjunction with Pseudo Average Chemical Shift features. In this work, we have examined two feature extraction strategies including PseAA composition and Pseudo Average Chemical Shift. The performance of Pseudo Average Chemical Shift is the best compared to PseAA composition. The result exhibits that the discrimination power of Pseudo Average Chemical Shift is better compared to PseAA composition on the three datasets respectively. Jackknife test was applied to evaluate the performance of the proposed model using four low similarity benchmark datasets. Our proposed model has achieved 84.2% accuracy on 25DBP, 85.0% on 1189, 86.4% accuracy on 640, and 89.2% on 399 datasets. The predicted results reveal that the performance of our proposed model is enhanced as compared to the existing models reported in the literature. It is anticipated that our proposed model might be helpful in future research particularly in low similarity datasets.

Conﬂict of interest The authors declare that they have no conﬂict of interest.

references

[1] A.A. Adl, A.N. Dalini, B. Xue, V.N. Uversky, X. Qian, Accurate prediction of protein structural classes using functional domains and predicted secondary structure sequences, J. Biomol. Struct. Dynam. 29 (2012) 623–633. [2] T.H. Afridi, A. Khan, Y.S. Lee, Mito-GSAAC: mitochondria prediction using genetic ensemble classiﬁer and split amino acid composition, Amino Acids 42 (2012) 1443–1454. [3] G. Anand, P.N. Pugalenthi, Suganthan, Predicting protein structural class by SVM with class-wise optimized features and decision probabilities, J. Theor. Biol. 253 (2008) 375–380. [4] C. Anﬁnsen, Principles that govern the folding of protein chains, Science 181 (1973) 223–230. [5] S. Babaei, A. Geranmayeh, S.A. Seyyedsalehi, Protein secondary structure prediction using modular reciprocal bidirectional recurrent neural networks, Comput. Methods Programs Biomed. 100 (2010) 237–247. [6] I. Bahar, A.R. Atilgan, R.L. Jernigan, B. Erman, Understanding the recognition of protein structural classes by amino acid composition, Proteins 29 (1997) 172–185.

[7] Y.D. Cai, K.Y. Feng, W.C. Lu, K.C. Chou, Using LogitBoost classiﬁer to predict protein structural classes, J. Theor. Biol. 238 (2006) 172–176. [8] Y.D. Cai, G.P. Zhou, Prediction of protein structural classes by neural network, Biochimie 82 (2000) 783–785. [9] D.S. Cao, Q.S. Xu, Y.Z. Liang, Propy: a tool to generate various modes of Chou’s PseAAC, Bioinformatics 29 (2013) 960–962. [10] C.C. Chang, C.J. Lin, A library for support vector machines, ACM Trans. Intell. Syst. Technol. 27 (2011) 1–27. [11] K. Chen, L.A. Kurgan, J.S. Ruan, Prediction of protein structural class using novel evolutionary collocation-based sequence representation, J. Comput. Chem. 29 (2008) 1596–1604. [12] L. Chen, J. Lu, N. Zhang, T. Huang, Y.D. Cai, A hybrid method for prediction and repositioning of drug Anatomical Therapeutic Chemical classes, Mol. Biosyst. 10 (2014) 868–877. [13] W. Chen, P.M. Feng, H. Lin, K.C. Chou, IRSPOT-PSEDNC. Identify recombination spots with pseudo dinucleotide composition, Nucleic Acids Res. 41 (2013) e68. [14] W. Chen, H. Lin, P.M. Feng, C. Ding, Y.C. Zuo, K.C. Chou, iNuc-PhysChem: a sequence-based predictor for identifying nucleosomes via physicochemical properties, PLOS ONE 7 (2012) e47843. [15] J.J. Chou, C.T. Zhang, A joint prediction of the folding types of 1490 human proteins from their genetic codons, J. Theor. Biol. 161 (1993) 251–262. [16] K.C. Chou, A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space, Proteins: Struct. Funct. Genet. 21 (1995) 319–344. [17] K.C. Chou, A key driving force in determination of protein structural classes, Biochem. Biophys. Res. Commun. 264 (1999) 216–224. [18] K.C. Chou, Prediction of protein subcellular attributes using pseudo-amino acid composition, Proteins: Struct. Funct. Genet. 43 (2001) 246–255. [19] K.C. Chou, Prediction of protein cellular attributes using pseudo-amino acid composition, Proteins 43 (2001) 246–255. [20] K.C. Chou, Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes, Bioinformatics 21 (2005) 10–19. [21] K.C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition (50th Anniversary Year Review), J. Theor. Biol. 273 (2011) 236–247. [22] K.C. Chou, Y.D. Cai, Predicting protein structural class by functional domain composition, Biochem. Biophys. Res. Commun. 321 (2004) 1007–1009. [23] K.C. Chou, H.B. Shen, Review: recent advances in developing web-servers for predicting protein attributes, Nat. Sci. 2 (2009) 63–92. [24] K.C. Chou, Z.C. Wu, X. Xiao, iLoc-Hum: using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites, Mol. Biosyst. 8 (2012) 629–641. [25] K.C. Chou, C.T. Zhang, Prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol. 30 (1995) 275–349.

Please cite this article in press as: M. Hayat, N. Iqbal, Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou’s general PseAAC and Support Vector Machine, Comput. Methods Programs Biomed. (2014), http://dx.doi.org/10.1016/j.cmpb.2014.06.007

COMM-3813; No. of Pages 9

8

ARTICLE IN PRESS c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e x x x ( 2 0 1 4 ) xxx–xxx

[26] S. Costantini, A.M. Facchiano, Prediction of the protein structural class by speciﬁc peptide frequencies, Biochimie 91 (2009) 226–229. [27] P. Deschavanne, P. Tuffery, Exploring an alignment free approach for protein classiﬁcation and structural class prediction, Biochimie 90 (2008) 615–625. [28] C. Ding, L.F. Yuan, S.H. Guo, H. Lin, W. Chen, Identiﬁcation of mycobacterial membrane proteins and their types using over-represented tripeptide compositions, J. Proteomics 77 (2012) 321–328. [29] H. Ding, L. Liu, F.B. Guo, J. Huang, H. Lin, Identify Golgi protein types with modiﬁed Mahalanobis discriminant algorithm and pseudo amino acid composition, Protein Pept. Lett. 18 (2011) 58–63. [30] H. Ding, L. Luo, H. Lin, Prediction of cell wall lytic enzymes using Chou’s amphiphilic pseudo amino acid composition, Protein Pept. Lett. 16 (2009) 351–355. [31] P. Du, S. Gu, Y. Jiao, PseAAC-General. Fast building various modes of general form of Chou’s pseudo-amino acid composition for large-scale protein datasets, Int. J. Mol. Sci. 15 (2014) 3495–3509. [32] P. Du, X. Wang, C. Xu, Y. Gao, PseAAC-Builder. A cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositions, Anal. Biochem. 425 (2012) 117–119. [33] G.L. Fan, Q.Z. Li, Predict mycobacterial proteins subcellular locations by incorporating pseudo-average chemical shift into the general form of Chou’s pseudo amino acid composition, J. Theor. Biol. 304 (2012) 88–95. [34] G.L. Fan, Q.Z. Li, Predicting protein submitochondria locations by combining different descriptors into the general form of Chou’s pseudo amino acid composition, Amino Acids 43 (2012) 545–555. [35] G.L. Fan, Q.Z. Li, Discriminating bioluminescent proteins by incorporating average chemical shift and evolutionary information into the general form of Chou’s pseudo amino acid composition, J. Theor. Biol. 334 (2013) 45–51. [36] G.L. Fan, Q.Z. Li, Predicting acidic and alkaline enzymes by incorporating the average chemical shift and gene ontology informations into the general form of Chou’s PseAAC, Process Biochem. 48 (2013) 1048–1053. [37] P.M. Feng, H. Ding, W. Chen, H. Lin, Naïve Bayes classiﬁer with feature selection to identify phage virion proteins, Comput. Math. Methods Med. (2013) 530696. [38] P.M. Feng, H. Lin, W. Chen, Identiﬁcation of antioxidants from sequence information using naïve Bayes, Comput. Math. Methods Med. (2013). [39] D.N. Georgiou, T.E. Karakasidis, J.J. Nietoc, A. Torresd, Use of fuzzy clustering technique and matrices to classify amino acids and its impact to Chou’s pseudo amino acid composition, J. Theor. Biol. 257 (2009) 17–26. [40] S.H. Guo, E.Z. Deng, L.Q. Xu, H. Ding, H. Lin, W. Chen, et al., iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition, Bioinformatics 30 (11) (2014) 1522–1529. [41] M. Hayat, A. Khan, Discriminating outer membrane proteins with fuzzy k-nearest neighbor algorithms based on the general form of Chou’s PseAAC, Protein Pept. Lett. 18 (2011) 411–421. [42] M. Hayat, A. Khan, Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition, J. Theor. Biol. 271 (2011) 10–17. [43] M. Hayat, A. Khan, Mem-PHybrid. Hybrid features based prediction system for classifying membrane protein types, Anal. Biochem. 424 (2012) 35–44.

[44] M. Hayat, A. Khan, MemHyb. Predicting membrane protein types by hybridizing SAAC and PSSM, J. Theor. Biol. 292 (2012) 93–102. [45] M. Hayat, A. Khan, M. Yeasin, Prediction of membrane proteins using split amino acid composition and ensemble classiﬁcation, J. Amino Acids 42 (2012) 2447–2460. [46] C. Jia, T. Liu, A.K. Chang, Y. Zhai, Prediction of mitochondrial proteins of malaria parasite using bi-proﬁle Bayes feature extraction, Biochimie 93 (2011) 778–782. [47] K.D. Kedarisetti, L. Kurgan, S. Dick, Classiﬁer ensembles for protein structural class prediction with varying homology, Biochem. Biophys. Res. Commun. 348 (2006). [48] L. Kurgan, K. Chen, Prediction of protein structural class for the twilight zone sequences, Biochem. Biophys. Res. Commun. 357 (2007) 453–460. [49] L. Kurgan, K. Cios, K. Chen, SCPRED: accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences, BMC Bioinformatics 9 (2008) 226. [50] L. Kurgan, L. Homaeian, Prediction of structural classes for protein sequences and domains – impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy, Pattern Recogn. 39 (2006) 2323–2343. [51] H. Lin, W. Chen, Prediction of thermophilic proteins using feature selection technique, J. Microbiol. Methods 84 (2011) 67–70. [52] H. Lin, W. Chen, L.F. Yuan, Z.Q. Li, H. Ding, Using over-represented tetrapeptides to predict protein submitochondria locations, Acta Biotheor. 61 (2013) 259–268. [53] H. Lin, C. Ding, Q. Song, P. Yang, H. Ding, K.J. Deng, et al., The prediction of protein structural class using averaged chemical shifts, J. Biomol. Struct. Dynam. 29 (2012) 1147–1153. [54] H. Lin, H. Ding, F.B. Guo, J. Huang, Prediction of subcellular location of mycobacterial protein using feature selection techniques, Mol. Divers. 14 (2010) 667–671. [55] H. Lin, Q.Z. Li, Using pseudo amino acid composition to predict protein structural class: approached by incorporating 400 dipeptide components, J. Comput. Chem. 28 (2007) 1463–1466. [56] H. Lin, H. Wang, H. Ding, Y.L. Chen, Q.Z. Li, Prediction of subcellular localization of apoptosis protein using Chou’s pseudo amino acid composition, Acta Biotheor. 57 (2009) 321–330. [57] S.X. Lin, J. Lapointe, Theoretical and experimental biology in one, J. Biomed. Sci. Eng. 6 (2013) 435–442. [58] B. Liu, D. Zhang, R. Xu, J. Xu, X. Wang, Q. Chen, et al., Combining evolutionary information extracted from frequency proﬁles with sequence-based kernels for protein remote homology detection, Bioinformatics 30 (2014) 472–479. [59] T. Liu, X. Zheng, J. Wang, Prediction of protein structural class for low-similarity sequences using support vector machine and PSI-BLAST proﬁle, Biochimie 92 (2010) 1330–1334. [60] P. Luginbuhl, T. Szyperski, K. Wuthrich, Statistical basis for the use of 13C a chemical shifts in protein structure determination, J. Magn. Reson. B 109 (1995) 229–233. [61] R.Y. Luo, Z.P. Feng, J.K. Liu, Prediction of protein structural class by amino acid and polypeptide composition, Eur. J. Biochem. 269 (2002) 4219–4225. [62] A. Majid, S. Ali, M. Iqbal, N. Kausar, Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines, Comput. Methods Programs Biomed. 113 (2014) 792–808. [63] S.P. Mielke, V.V. Krishnan, Protein structural class identiﬁcation directly from NMR spectra using averaged chemical shifts, Bioinformatics 19 (2003) 2054–2064.

Please cite this article in press as: M. Hayat, N. Iqbal, Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou’s general PseAAC and Support Vector Machine, Comput. Methods Programs Biomed. (2014), http://dx.doi.org/10.1016/j.cmpb.2014.06.007

COMM-3813; No. of Pages 9

ARTICLE IN PRESS c o m p u t e r m e t h o d s a n d p r o g r a m s i n b i o m e d i c i n e x x x ( 2 0 1 4 ) xxx–xxx

[64] M.J. Mizianty, L. Kurgan, Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences, BMC Bioinformatics 10 (2009) 414. [65] H. Mohabatkar, Prediction of cyclin proteins using Chou’s pseudo amino acid composition, Protein Pept. Lett. 17 (2010) 1207–1214. [66] H. Mohabatkar, M. Mohammad Beigi, A. Esmaeili, Prediction of GABA(A) receptor proteins using the concept of Chou’s pseudo-amino acid composition and support vector machine, J. Theor. Biol. 281 (2011) 18–23. [67] H. Nakashima, K. Nishikawa, T. Ooi, The folding type of a protein is relevant to the amino acid composition, J. Biochem. 99 (1986) 153–162. [68] L. Nanni, A. Lumini, Genetic programming for creating Chou’s pseudo amino acid based features for submitochondria localization, Amino Acids 34 (2008) 653–660. [69] L. Nanni, A. Lumini, D. Gupta, A. Garg, Identifying bacterial virulent proteins by fusing a set of classiﬁers based on variants of Chou’s pseudo amino acid composition and on evolutionary information, IEEE/ACM Trans. Comput. Biol. Bioinform. 9 (2012) 467–475. [70] G. Pollastri, A. McLysaght, Porter: a new, accurate server for protein secondary structure prediction, Bioinformatics 21 (2005) 1719–1720. [71] J.D. Qiu, S.H. Luo, J.H. Huang, R.P. Liang, Using support vector machines for prediction of protein structural classes based on discrete wavelet transform, J. Comput. Chem. 30 (2009) 1344–1350. [72] W.R. Qiu, X. Xiao, K.C. Chou, iRSpot-TNCPseAAC: identify recombination spots with trinucleotide composition and pseudo amino acid components, Int. J. Mol. Sci. 15 (2014) 1746–1766. [73] S.S. Sahu, G. Panda, A novel feature representation method based on Chou’s pseudo amino acid composition for protein structural class prediction, Comput. Biol. Chem. 34 (2010) 320–327. [74] J. Shao, D. Xu, S.N. Tsai, Y. Wang, S.M. Ngai, Computational identiﬁcation of protein methylation sites through bi-proﬁle Bayes feature extraction, J. PLoS ONE 4 (2009). [75] H.B. Shen, J. Yang, X.J. Liu, K.C. Chou, Using supervised fuzzy clustering to predict protein structural classes, Biochem. Biophys. Res. Commun. 334 (2005) 577–581. [76] A. Sibley, M. Cosman, V. Krishnan, An empirical correlation between secondary structure content and averaged chemical shifts in proteins, J. Biophys. 84 (2003) 1223–1227. [77] S. Spera, A. Bax, Empirical correlation between protein backbone conformation and C-alpha and C-beta 13C nuclear magnetic resonance chemical shifts, J. Am. Chem. Soc. 113 (1991) 5490–5492. [78] X.D. Sun, R.B. Huang, Prediction of protein structural classes using support vector machines, Amino Acids 30 (2006) 469–475. [79] M. Tahir, A. Khan, A. Majid, Protein subcellular localization of ﬂuorescence imagery using spatial and transform domain features, Bioinformatics 28 (2012) 91–97.

9

[80] V.N. Vapnik, Statistical Learning Theory, Wiley, City, 1998. [81] Z.X. Wang, Z. Yuan, How good is prediction of protein structural class by the component-coupled method, Proteins 38 (2000) 165–175. [82] D. Wishart, B. Sykes, F. Richards, Relationship between nuclear magnetic resonance chemical shift and protein secondary structure, J. Mol. Biol. 222 (1991) 311–333. [83] X. Xiao, W.Z. Lin, K.C. Chou, Recent advances in predicting protein classiﬁcation and their applications to drug development, Curr. Top. Med. Chem. 13 (2013) 1622–1635. [84] X. Xiao, J.L. Min, P. Wang, K.C. Chou, iCDI-PseFpt: identify the channel–drug interaction in cellular networking with PseAAC and molecular ﬁngerprints, J. Theor. Biol. 337C (2013) 71–79. [85] X. Xiao, S.H. Shao, Z.D. Huang, K.C. Chou, Using pseudo amino acid composition to predict protein structural classes: approached with complexity measure factor, J. Comput. Chem. 27 (2006) 478–482. [86] Y. Xu, J. Ding, L.Y. Wu, K.C. Chou, iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position speciﬁc amino acid propensity into pseudo amino acid composition, PLOS ONE 8 (2013) e55844. [87] J.Y. Yang, Z.L. Peng, X. Chen, Prediction of protein structural classes for low-homology sequences based on predicted secondary structure, BMC Bioinformatics 9 (2010) 11. [88] J.Y. Yang, Z.L. Peng, Z.G. Yu, R.J. Zhang, V. Anh, D.S. Wang, Prediction of protein structural classes by recurrence quantiﬁcation analysis based on chaos game representation, J. Theor. Biol. 257 (2009) 618–626. [89] D.J. Yu, J. Hu, X.W. Wu, H.B. Shen, J. Chen, Z.M. Tang, et al., Learning protein multi-view features in complex space, Amino Acids 44 (2013) 1365–1379. [90] L.F. Yuan, C. Ding, S.H. Guo, H. Ding, W. Chen, H. Lin, Prediction of the types of ion channel-targeted conotoxins based on radial basis function network, Toxicol. In Vitro 27 (2013) 852–856. [91] S. Zhang, S. Ding, T. Wang, High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure, Biochimie (2011) 1–5. [92] S. Zhang, L. Yang, T. Wang, Use of information discrepancy measure to compare protein secondary structures, J. Mol. Struct. Theochem. 909 (2009) 102–106. [93] T.L. Zhang, Y.S. Ding, Using pseudo amino acid composition and binary-tree support vector machines to predict protein structural classes, Amino Acids 33 (2007) 623–629. [94] T.L. Zhang, Y.S. Ding, K.C. Chou, Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern, J. Theor. Biol. 250 (2008) 186–193. [95] Y. Zhao, B. Alipanahi, S.C. Li, M. Li, Protein secondary structure prediction using NMR chemical shift data, J. Bioinform. Comput. Biol. 8 (2010) 867–884. [96] G.P. Zhou, An intriguing controversy over protein structural class prediction, J. Protein Chem. 17 (1998) 729–738. [97] D. Zou, Z. He, J. He, Y. Xia, Supersecondary structure prediction using Chou’s pseudo amino acid composition, J. Comput. Chem. 32 (2011) 271–278.

Please cite this article in press as: M. Hayat, N. Iqbal, Discriminating protein structure classes by incorporating Pseudo Average Chemical Shift to Chou’s general PseAAC and Support Vector Machine, Comput. Methods Programs Biomed. (2014), http://dx.doi.org/10.1016/j.cmpb.2014.06.007