Prediction of flavin mono-nucleotide binding sites using modified PSSM profile and ensemble support vector machine

Prediction of flavin mono-nucleotide binding sites using modified PSSM profile and ensemble support vector machine

Computers in Biology and Medicine 42 (2012) 1053–1059 Contents lists available at SciVerse ScienceDirect Computers in Biology and Medicine journal h...

672KB Sizes 0 Downloads 3 Views

Computers in Biology and Medicine 42 (2012) 1053–1059

Contents lists available at SciVerse ScienceDirect

Computers in Biology and Medicine journal homepage: www.elsevier.com/locate/cbm

Prediction of flavin mono-nucleotide binding sites using modified PSSM profile and ensemble support vector machine Xia Wang a, Gang Mi b, Cuicui Wang a, Yongqing Zhang c, Juan Li a, Yanzhi Guo a,n, Xuemei Pu a, Menglong Li a,n a

College of Chemistry, Sichuan University, Chengdu 610064, PR China College of Life Science, Sichuan University, Chengdu 610064, PR China c College of Computer Science, Sichuan University, Chengdu 610064, PR China b

a r t i c l e i n f o

abstract

Article history: Received 18 March 2012 Accepted 13 August 2012

Flavin mono-nucleotide (FMN) closely evolves in many biological processes. In this study, a computational method was proposed to identify FMN binding sites based on amino acid sequences of proteins only. A modified Position Specific Score Matrix was used to characterize the local environmental sequence information, and a visible improvement of performance was obtained. Also, the ensemble SVM was applied to solve the imbalanced data problem. Additionally, an independent dataset was built to evaluate the practical performance of the method, and a satisfactory accuracy of 87.87% was achieved. It demonstrates that the method is effective in predicting FMN-binding sites. & 2012 Elsevier Ltd. All rights reserved.

Keywords: Flavin mono-nucleotide (FMN) Binding site prediction Position specific score matrix (PSSM) Ensemble classifier Support vector machine (SVM)

1. Introduction Previous studies show that the most of proteins could not perform their biological functions alone. Cofactors always play pivotal roles to complete protein functions. These protein cofactors are organic compounds or metal atoms which can associate with proteins and strengthen the stability of the compounds [1]. Knowing the interaction sites of proteins with cofactors or ligands would help researchers to understand the protein function and mechanism more specifically. FMN (flavin mono-nucleotide) is a coenzyme of flavoprotein, which closely takes part in some biological processes such as energy production and cellular respiration (especially in electron transfer process) [2]. FMN is an electron carrier molecule that functions as a hydrogen acceptor. It exists in three oxidation states during catalytic cycle: oxidized (FMN), semiquinone (FMNHþ) and hydroquinone (FMNH2). The strong oxidizability of FMN and the capability of transferring one or two electrons make FMN become a critical part in the electron transfer system. For example, flavodoxin is one of the flavoproteins and also an electron transfer protein involved in photosynthetic reactions. All the flavodoxins carry a molecule of flavin mono-nucleotide which confers redox properties to the protein [3]. Flavodoxins have been demonstrated to be critical in some pathogenic organisms and

n

Corresponding authors. Tel.: þ 86 28 85413330; fax: þ 86 28 85412356. E-mail addresses: [email protected] (Y. Guo), [email protected] (M. Li).

0010-4825/$ - see front matter & 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.compbiomed.2012.08.005

could be used as targets in drug design [4]. Therefore, as an important part of flavodoxins, recognition of FMN binding sites is greatly helpful in investigating the flavin recognition mechanism and the advanced researches about photosynthesis. Moreover, FMN plays a very important role in energy metabolism for several reduction–oxidation enzymes as a coenzyme. And it also exits in processes of metabolism in folate, vitamin B12, and other vitamins [5]. So identifying the FMN site is of great significance for designing relative inhibitors or antagonists [6]. However, identification of FMN binding sites by experimental methods is expensive and time consuming. Now computational methods have been developed. For example, Saito et al. reported an empirical approach to identify the nucleotide-binding sites [7], and Kelly et al. developed a machine learning approach to discriminate flavin adenine dinucleotide binding sites from nicotinamide adenine dinucleotide binding sites [8]. In this paper we specifically aimed at FMN binding sites, and a new method was developed to recognize the FMN binding sites using support vector machine. Instead of examining the structure of proteins, the FMN sites are identified based on amino acid sequences of proteins. The SVM models are constructed based on physicochemical properties and evolutionary information, respectively. For the evolutionary information, not original PSSM profile but the modified PSSM profile scaled by the sliding window and smoothing window was used to characterize the local environmental information of each FMN site. The data imbalanced problem commonly existing in the protein binding sites’ researches was effectively resolved by the two most conventional

1054

X. Wang et al. / Computers in Biology and Medicine 42 (2012) 1053–1059

approaches, the ensemble support vector machine classifier [9] and synthetic minority over-sampling technique (SMOTE) [10]. Therefore, these two approaches were introduced to deal with the imbalanced dataset problem in this paper, respectively.

2. Material and method As to generate a really useful statistical predictor to identity the FMN binding sites, according to Chou’s recent review [11], the following procedures was considered: (i) a valid dataset is constructed to train and test the predictor; (ii) an effective mathematical expression should be formulated that can truly reflect protein samples’ intrinsic correlation; (iii) a powerful algorithm is developed to operate the prediction; (iv) the proper cross-validation tests is used to evaluate the performance of the predictor; (v) a user-friendly web-server of the predictor which is accessible to the public should be established. 2.1. Data mining In this paper, 467 FMN binding protein IDs were extracted from SuperSite documentation [12] and the protein chains were downloaded from Protein Data Bank (PDB) [13]. To avoid redundancy and homology bias, proteins that have the mutual identity of over 25% with others in the dataset were removed by blastclust [14]. Then, LPC (Ligand Protein Contact) [15] was used to check the FMN binding sites whether on the chosen proteins or not and to confirm the position. Finally, 111 proteins including 2369 binding sites and 28,136 non-binding sites were retrieved. Thirty proteins containing 628 positive and 7712 negative samples, named as dataset P30, were randomly selected as an independent set to evaluate the practical performance of the method. The other 81 proteins containing 1741 positive and 20,424 negative samples, named as dataset P81, were used to construct the model. The PDB IDs of all the 111 proteins used in this paper have been listed in Supplementary Table S1. 2.2. Feature extraction 2.2.1. Evolutionary information Biology is a natural science. All biological species have developed beginning from a very limited number of ancestral species, as well as protein sequences [16]. The evolution includes mutations, insertions and deletions of residues [17] and so on. With these changes accumulated for a long period of time, many similarities between initial and resultant amino acid sequences are gradually eliminated, but the corresponding proteins may still share many common attributes, such as having basically the same biological function and similar binding site. In order to incorporate this kind of evolution information into the feature vector, we employed the data derived from the position specific scoring matrix (PSSM) [18]. Currently, evolutionary information by PSSM has been widely used as the feature to characterize the functional sites of proteins [19,20]. In this paper, the PSSM profile was generated by PSI-BLAST. So, a protein of n amino acids was transformed into a 20  n dimensions matrix. However, the original PSSM only expresses the evolutionary information of the binding site. We know that the surrounding residues of the binding site usually affect the binding process. So it is necessary to incorporate the surrounding residues’ evolutionary information. In this paper, a modified PSSM by adding the sliding window and smoothing window was used to represent the local environmental information of the binding sites [21–23]. The details of sliding and smoothing window have been elaborated in [24]. After optimizing sliding window size and

smoothing window size, the improved PSSM profile is constructed. For a protein of n residues, PSSM could be expressed as follows: 2 3 A1-1 A1-2 A1-3    A1-20 6A 7 6 2-1 A2-2 A2-3    A2-20 7 6 7 6 PPSSM ¼ 6 A3-1 A3-2 A3-3    A3-20 7 ð1Þ 7 6 ^ ^ ^ Ai-j ^ 7 4 5 An-1 An-2 An-3    An-20 where Ai-j means the evolutionary score of ith amino acid mutates to the j type amino acid. Here, the general form of PseAAC [11,25] was introduced to present the input feature more clearly and elegantly. So for a FMN binding site (Ai), the PSSM could be formulated as the following vector: h iT ð2Þ PPSSMðAi Þ ¼ C1 C2    Cj    C20 where Cj ¼ Ai-j (1r j r20). 2.2.2. Physicochemical properties For binding site prediction, researches using physicochemical properties of amino acids have been reported [26–28]. According to the work of Mishra [28], 544 physicochemical properties were downloaded from AAindex [29–32]. After removing the amino acid indices with the value ‘‘N/A’’, 531 physicochemical properties, named as PP531, were remaining to express the protein sequence information [33]. Each amino acid was substituted by value of 531 physicochemical properties, and then the matrix with size of 531  n was generated, n was the length of a protein sequence. It could be expressed as follows: 2 3 A1,1 A1,2 A1,3    A1,531 7 6A 6 2,1 A2,2 A2,3    A2,531 7 6 7 6 PPP531 ¼ 6 A3,1 A3,2 A3,3    A3,531 7 ð3Þ 7 6 ^ ^ ^ Ai,j ^ 7 5 4 An,1 An,2 An,3    An,531 where Ai,j is the value of jth physicochemical property of ith amino acid. Here, the general form of a FMN binding site (Ai) can be described as the following vector: h iT ð4Þ PPP531ðA Þ ¼ C1 C2    Cj    C531 i

where Cj ¼ Ai,j (1r j r531). 2.3. Ensemble support vector machine and synthetic minority over-sampling technique (SMOTE) For the functional site prediction of proteins, the data imbalanced problem commonly exists. The large ratio between the negatives and positives would result in high prediction accuracy for the majority class but poor prediction accuracy for the minority class [34–36]. In the training set, there were 1741 binding sites and 20,424 non-binding sites. So two methods, ensemble SVM and SMOTE, were used to resolve the imbalanced problem. SVM is a kind of machine learning approach based on structural risk minimization principle of statistical learning theory proposed by Vapnik [37]. The software, LIBSVM (version 3.0) was freely downloaded from http://www.csie.ntu.edu.tw/  cjlin/ libsvm/ [38]. SVM has been widely applied in various biological areas, such as in predicting protein secondary structures [39,40] and classifying functions of proteins [41–44]. In this paper, the radial basis function (RBF) was chosen as the kernel function. The regularization parameter C and the kernel width parameter g were optimized until an optimal SVM model was obtained.

X. Wang et al. / Computers in Biology and Medicine 42 (2012) 1053–1059

The ensemble SVM classifier means to assemble multi-sub SVM classifiers [9] and it is deemed as the powerful tool to improve the performance of prediction [45]. In processing of imbalanced data, the ensemble classifier has shown the superiority in previous works [46,47]. In this paper, a certain amount of negative samples (equals to the number of positives) were extracted randomly 30 times from all negatives. Thus, the 30 sub-training sets were constructed with 1:1 ratio of negative to positive samples. With these 30 sub-training sets, 30 SVM models were generated. The ultimate prediction result was determined by majority voting scheme [47–49]. The flow chart of ensemble SVM classifiers is clearly shown in Fig. 1. Moreover, an alternative method, SMOTE (synthetic minority over-sampling technique) has been reported to be able to resolve the imbalanced data problem in predictions [50,51]. In this paper, it was also used to investigate our dataset. SMOTE is a preprocessing technique to tackle the imbalanced data problem, which is the method of over-sampling the minority class and under-sampling the majority class. In this case, the quantity of positive samples is much less than that of negatives, so pseudopositive samples were generated via SMOTE algorithm to balance the dataset. The specific information about SMOTE is detailed by Chawla et al. in [10]. 2.4. Model validation and measurement At present, three methods (independent dataset test, subsampling or K-fold cross validation test, jackknife test) are often applied to examine the effectiveness of a predictor [52]. The jackknife test is deemed as the least arbitrary method which can always yield a unique result for a specified dataset. The reasons are following: (1) For an independent dataset test, all the samples used to test the predictor are outside the training dataset so as to exclude the ‘‘memory’’ effect or bias. However, the way of how to select the independent samples to test the predictor could be quite arbitrary unless the number of independent samples is large enough. This kind of arbitrariness might result in completely different results. (2) For the K-fold cross validation test, the

1055

concrete procedure usually used in literatures is the 5-fold, 7-fold or 10-fold cross validation. The problem with this kind of subsampling test is that the number of possible selections in dividing a benchmark dataset is an astronomical figure even for a very simple dataset, as elucidated in [11]. Therefore, in any actual subsampling cross-validation tests, only an extremely small fraction of the possible selections are taken into account. Since different selections will always lead to different results even for a same benchmark dataset and a same predictor, the subsampling test cannot avoid the arbitrariness either. A test method unable to yield a unique outcome cannot be deemed as a good one. (3) In the jackknife test, all the samples in the benchmark dataset will be singled out one-by-one and tested by the predictor trained by the remaining samples. During the process of jackknifing, both the training dataset and testing dataset are actually open, and each sample will be in turn moved between the two. The jackknife test can exclude the ‘‘memory’’ effect. Also, the arbitrariness problem can be avoided because the outcome obtained by the jackknife cross-validation is always unique for a given benchmark dataset. Accordingly, the jackknife test has been increasingly and widely used to examine the quality of various predictors (see, e.g. [41,53–58]). However, to reduce the computational time, we adopted the 5-fold cross validation test and independent dataset test in this study as done by many investigators with SVM as the prediction engine [59]. Sensitivity (Se), Specificity (Sp), Overall accuracy (Acc) and Matthew’s correlation coefficient (MCC) were used as our evaluation criterion. They are defined as the following equations: Se ¼

TP  100% TP þ FN

ð5Þ

Sp ¼

TN  100% TN þ FP

ð6Þ

Acc ¼

TN þ TP  100% TN þ TP þ FP þ FN

TP  TNFP  FN MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðTP þ FPÞ  ðTP þ FNÞ  ðTN þ FPÞ  ðTN þ FNÞ

ð7Þ

ð8Þ

where TP, TN, FP and FN represent true positive, true negative, false positive and false negative, respectively. Besides, the receiver operating characteristic (ROC) curve is deemed as one of the most reliable approaches in evaluating performance of classifiers [60]. In this paper, we also used ROC to evaluate the classifier’s quality. AUC (the area under the ROC curve) will be calculated to access the model and its value is between 0 and 1. The higher AUC means the better model and there is no valid classifier if the value is lower than 0.5 [61].

3. Results and discussion 3.1. Selection of input feature

Fig. 1. The flow chart of ensemble SVM classifier.

3.1.1. Optimizing window size of the modified PSSM profile With the purpose of obtaining the optimal modified PSSM profile, sliding window size and smoothing window size need to be optimized based on SVM. In this paper, the two window sizes are optimized with respect to the Acc and MCC value on the whole dataset. Fig. 2 shows the prediction results of the different sliding window sizes and smoothing window sizes. The detailed results are listed in Supplementary Table S2. For the sliding window size, it was optimized from 1 to 15, respectively. As shown in Fig. 2(a), we can see that when the sliding window size is 11, the model gives that highest Acc and MCC, so the optimal sliding window size is set to 11 in the following study. Then, training smoothing window size was

1056

X. Wang et al. / Computers in Biology and Medicine 42 (2012) 1053–1059

Fig. 2. (a) The prediction results of the different sliding window sizes. (b) The prediction results of the different smoothing window sizes.

the modified PSSM profile attained much more outstanding performance in this paper. As a consequence, the modified PSSM was selected as the input feature in the following works. 3.2. Resolving imbalanced dataset problem

Fig. 3. The ROC curves based on 531 physicochemical properties (PP531) and modified PSSM profile.

from 1 to 11. Fig. 2(b) shows that the model with the smoothing window size of 1 achieves the highest Acc but the lowest MCC, so it is not reliable. The same highest MCC is obtained when the smoothing window size is 5 or 7 and the Acc of them is high and almost equal. Considering the neighboring effect of residues surrounding each FMN-binding site, a longer smoothing window size of 7 was selected. Finally, the sliding window size of 11 and smoothing window size of 7 were chosen as the optimal window size to modify PSSM.

3.1.2. Comparison between the modified PSSM and PP531 The SVM model based on PP531 was constructed based on the same dataset used in the model of the modified PSSM. The comparison results between the models based on the modified PSSM and PP531, respectively, are shown in Fig. 3, which includes the ROC curves of two models. As seen from Fig. 3, the area under the ROC curve (AUC) based on PP531 was only 0.6067, while, the AUC based on modified PSSM achieved 0.9303. Obviously,

The imbalanced dataset problem is always an issue in prediction of binding sites, because the number of non-binding residues of proteins is far larger than that of binding sites. So the models based on such an imbalanced dataset always give a high negative. For the FMN binding site prediction, the SVM model using the original dataset P81 was built and the sensitivity is not more than 60%. Three methods resolving the imbalanced dataset problem were used in this paper: (1) Some researches try to balance the positive and negative dataset by randomly extracting a very small fraction of the negative samples as the training data [62–64]. Based on the optimal PSSM profile, a model constructed when the ratio of positive to negative examples was 1:1. However, the results are not dependable or objective, because the model constructed in this way did not cover the most information in the original data. (2) The second method is the ensemble SVM. Using this method, all negative samples were used during the construction of the model. (3) Meanwhile, another technique SMOTE was also used to construct the model. Additionally, Fig. 4 contains three ROC curves about three methods mentioned above. All prediction results are listed in Table 1. From Table 1, we can see that the comparison among the three methods, the first method gave a satisfactory performance with the Acc of 85.60%, Se of 85.46%, Sp of 85.62% and AUC of 0.9346. However, the results were not dependable or objective, because the model constructed in this way did not cover the most information in the original data. Based on the ensemble SVM, Acc achieved 87.30% with Se of 86.85%, Sp of 87.35% and AUC reached 0.9427. Although the accuracy of the ensemble classifier was only about 2% higher than the first method, it was much more reliable and robust. In addition, SMOTE based model yielded the highest accuracy of 90.46%, but gave the lowest sensitivity of 79.79%. So ensemble SVM is more effective than SMOTE in solving this imbalanced problem. At last, the final classifier is the ensemble SVM model based on training set P81 and the optimal PSSM profile in this paper. 3.3. Performance of the independent dataset by ensemble SVM The independent dataset P30 was used to assess the practical prediction ability of the method. As shown in Table 2, the method achieved a good performance and the sensitivity, specificity and accuracy were 71.50%, 89.20% and 87.87%, respectively. Also, two

X. Wang et al. / Computers in Biology and Medicine 42 (2012) 1053–1059

1057

Fig. 4. The ROC curves of the three methods: SVM, ensemble SVM and SMOTE, respectively.

Table 1 The results of SVM and SMOTE compared with Ensemble SVM method. Methods

Acc (%)

Se (%)

Sp (%)

AUC

Ensemble SVM SMOTE SVM

87.30 90.46 85.60

86.85 79.79 85.46

87.35 91.38 85.62

0.9427 0.9261 0.9346

Table 2 Prediction results of the independent dataset (P30) based on the proposed method. Dataset

Sample numbers

Acc (%)

Se (%)

Sp (%)

P30

Positive: 628 Negative: 7712

87.87

71.50

89.20

proteins, 1FX1_A and 1X77_A were chosen as examples for further demonstrating the practical performance of our method. Fig. 5 shows the representative prediction results in the context of threedimensional structures for 1FX1_A and 1X77_A. The TPs, TNs, FNs and FPs are shown in red, blue, green and yellow, respectively. For 1FX1_A, 19 FMN-binding residues are correctly predicted in all 20 binding residues and only 13 out of 127 non-binding residues are misclassified as binding ones. For the 21 binding residues and 153 non-binding residues in 1X77_A, 17 binding residues are predicted correctly and 133 non-binding residues are correctly predicted. The overall prediction accuracy is 86.21%. The detailed prediction results of all proteins in P30 are listed in Supplementary Table S3. The results indicate that the proposed method can be a useful tool for predicting FMN-binding sites and has great usability in understanding FMN-protein interactions.

Fig. 5. Representative prediction results are shown in the context of threedimensional structures for 1X77_A and 1FX1_A. The correctly predicted FMNbinding residues are in red; the correctly predicted non-binding residues are in blue; the binding residues but predicted as negatives are in green; the nonbinding residues but predicted as positives are in yellow. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this article.)

modified PSSM profile. This profile scaled by the sliding window and smoothing window incorporates the information of the surrounding residues of the FMN binding sites. The prediction performance demonstrates that the method is an efficient way to predict the FMN binding sites. Meanwhile, the ensemble SVM can effectively solve the imbalanced dataset problem. We hope that our method can be a useful supplement to identify the unknown FMN-binding sites. User-friendly and publicly accessible webservers stand for the future direction for developing practically more useful models, simulated methods, or predictors [65]. We shall try our best in further study to provide a web-server for the method presented in this paper.

5. Summary 4. Conclusion In this paper, a new computational method was developed to predict FMN binding sites, using the ensemble SVM and the

Flavin mono-nucleotide (FMN) is a cofactor of the flavoprotein and closely evolves in some biological processes such as energy production and cellular respiration. By binding with proteins, FMN affects the functions of them. So it is essential for us to

1058

X. Wang et al. / Computers in Biology and Medicine 42 (2012) 1053–1059

identify the FMN binding sites of proteins. In this study, a computational method based on support vector machine (SVM) was proposed to identify FMN binding sites based on amino acid sequences of proteins only. A modified position specific score matrix (PSSM) scaled with sliding window and smoothing window was used to characterize the local environmental sequence information of the FMN binding sites. Compared to the original PSSM profile, a visible improvement of performance was obtained with an accuracy of 85.60% and the area under receiver operating characteristic (AUC) of 0.93. Additionally, synthetic minority over-sampling technique (SMOTE) and the ensemble SVM were used to solve the imbalanced data problem. Finally, the ensemble SVM classifier shows the superior performance. At last, an independent dataset was built to evaluate further the practical performance of the method, and a satisfactory accuracy of 87.87% was achieved. So it demonstrates that the method is effective in predicting FMN-binding sites in proteins.

Conflict of interest statement None declared.

Acknowledgement This work was supported by the National Natural Science Foundation of China (Nos. 20905054, 20972103, 20973115) and the Specialized Research Fund for the Doctoral Program of Higher Education (No. 20090181120058). We greatly appreciate the anonymous reviewers for their patient review and valuable suggestions.

Appendix A. Supporting information Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.compbiomed.2012. 08.005.

References [1] L. Campos, J. Sancho, Native-specific stabilization of flavodoxin by the FMN cofactor: structural and thermodynamical explanation, Proteins 63 (2006) 581–594. [2] J.M. Granjeiro, C.V. Ferreira, M.B. Jucfi, E.M. Taga, H. Aoyama, Bovine kidney low molecular weight acid phosphatase: FMN-dependent kinetics, IUBMB Life 41 (1997) 1201–1208. [3] S. Maldonado, A. Lostao, I. MP., F.-R. J., G.G. C., B.G. E., Rubio J.A., A. Luquita, F. Daoudi, J. Sancho, Apoflavodoxin: structure, stability, and FMN binding, Biochimie 80 (1998) 813–820. [4] M. Martı´nez-Ju´lvez, N. Cremades, M. Bueno, I. Pe´rez-Dorado, C. Maya, S. Cuesta-Lo´pez, D. Prada, F. Falo, J.A. Hermoso, J. Sancho, Common conformational changes in flavodoxins induced by FMN and anion binding: the structure of helicobacter pylori apoflavodoxin, Proteins 69 (2007) 581–594. [5] M. Akimoto, Y. Sato, T. Okubo, H. Todo, T. Hasegawa, K. Sugibayashi, Conversion of FAD to FMN and riboflavin in plasma: effects of measuring method, Biol. Pharm. Bull. 29 (2006) 1779–1782. [6] M. Morita, S. Nakamura, K. Shimizu, Highly accurate method for ligandbinding site prediction in unbound state (apo) protein structures, Proteins 73 (2008) 468–479. [7] M. Saito, M. Go, T. Shirai, An empirical approach for detecting nucleotidebinding sites on proteins, Protein Eng. Des. Sel. 19 (2006) 67–75. [8] L.A. Kelley, P.J. Shrimpton, S.H. Muggleton, M.J.E. Sternberg, Discovering rules for protein–ligand specificity using support vector inductive logic programming, Protein Eng. Des. Sel. 22 (2009) 561–567. [9] T. Dietterich, Ensemble methods in machine learning, Lecture Notes in Computer Science. vol. 1857, 2000, pp. 1–15. [10] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357. [11] K.C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition, J. Theor. Biol. 273 (2011) 236–247.

¨ [12] R.A. Bauer, S. Gunther, D. Jansen, C. Heeger, P.F. Thaben, R. Preissner, SuperSite: dictionary of metabolite and drug binding sites in proteins, Nucleic Acids Res. 37 (2009) D195. [13] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, The protein data bank, Nucleic Acids Res. 28 (2000) 235–242. [14] S.F. Altschul, T.L. Madden, A.A. Sch ffer, J. Zhang, Z. Zhang, W. Miller, D.J. Lipman, Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25 (1997) 3389. [15] V. Sobolev, A. Sorokine, J. Prilusky, E.E. Abola, M. Edelman, Automated analysis of interatomic contacts in proteins, Bioinformatics 15 (1999) 327. [16] K.C. Chou, Structural bioinformatics and its impact to biomedical science, Curr. Med. Chem. 11 (2004) 2105–2134. [17] K.C. Chou, The convergence-divergence duality in lectin domains of selectin family and its implications, FEBS Lett. 363 (1995) 123–126. ¨ [18] A.A. Schaffer, L. Aravind, T.L. Madden, S. Shavirin, J.L. Spouge, Y.I. Wolf, E.V. Koonin, S.F. Altschul, Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements, Nucleic Acids Res. 29 (2001) 2994–3005. [19] S. Ahmad, A. Sarai, PSSM-based prediction of DNA binding sites in proteins, BMC Bioinformatics 6 (2005) 33. [20] J. Chauhan, N. Mishra, G. Raghava, Identification of ATP binding residues of a protein from its primary sequence, BMC Bioinformatics 10 (2009) 434. [21] C. Wang, Y. Fang, J. Xiao, M. Li, Identification of RNA-binding sites in proteins by integrating various sequence information, Amino acids 40 (2010) 1–10. [22] X. Guang, Y. Guo, J. Xiao, X. Wang, J. Sun, W. Xiong, M. Li, Predicting the state of cysteines based on sequence information, J. Theor. Biol. 267 (2010) 312–318. [23] W. Xiong, Y. Guo, M. Li, Prediction of lipid-binding sites based on support vector machine and position specific scoring matrix, Protein J. 29 (2010) 1–5. [24] C.W. Cheng, E. Su, J.K. Hwang, T.Y. Sung, W.L. Hsu, Predicting RNA-binding sites of proteins using support vector machines and evolutionary information, BMC Bioinformatics 9 (2008) S6. [25] K.C. Chou, Prediction of protein cellular attributes using pseudoamino acid composition, Proteins: Structure, Funct. Bioinformatics 43 (2001) 246–255. [26] L. Deng, J. Guan, Q. Dong, S. Zhou, Prediction of protein–protein interaction sites using an ensemble method, BMC Bioinformatics 10 (2009) 426. [27] Y.D. Cai, L. Lu, Predicting n-terminal acetylation based on feature selection method, Biochem. Biophys. Res. Commun. 372 (2008) 862–865. [28] N. Mishra, G. Raghava, Prediction of FAD interacting residues in a protein from its primary sequence using evolutionary information, BMC Bioinformatics 11 (2010) S48. [29] K. Nakai, A. Kidera, M. Kanehisa, Cluster analysis of amino acid indices for prediction of protein structure and function, Protein Eng. Des. Sel. 2 (1988) 93–100. [30] K. Tomii, M. Kanehisa, Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins, Protein Eng. Des. Sel. 9 (1996) 27–36. [31] S. Kawashima, H. Ogata, M. Kanehisa, AAindex: amino acid index database, Nucleic Acids Res. 27 (1999) 368. [32] S. Kawashima, M. Kanehisa, AAindex: amino acid index database, Nucleic Acids Res. 28 (2000) 374. [33] T.Y. Lee, J.B.K. Hsu, F.M. Lin, W.C. Chang, P.C. Hsu, H.D. Huang, N-Ace: using solvent accessibility and physicochemical properties to identify protein N-acetylation sites, J. Comput. Chem. 31 (2010). [34] L. Xu, M.Y. Chow, A classification approach for power distribution systems fault cause identification, IEEE Trans. Power Syst. 21 (2006) 53–60. [35] Z.H. Zhou, X.Y. Liu, Training cost-sensitive neural networks with methods addressing the class imbalance problem, IEEE Trans. Knowl. Data Eng. 18 (2006) 63–77. [36] M.C. Chen, L.S. Chen, C.C. Hsu, W.R. Zeng, An information granulation based data mining approach for classifying imbalanced data, Inf. Sci. (NY) 178 (2008) 3214–3227. [37] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998. [38] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, 2001, Software available at /http://www.csie.ntu.edu.tw/  cjlin/libsvmS. [39] N. Zhang, G. Duan, S. Gao, J. Ruan, T. Zhang, Prediction of the parallel/ antiparallel orientation of beta-strands using amino acid pairing preferences and support vector machines, J. Theor. Biol. 263 (2010) 360–368. [40] C. Chen, L. Chen, X. Zou, P. Cai, Prediction of protein secondary structure content by using the concept of chous pseudo amino acid composition and support vector machine, Protein Pept. Lett. 16 (2009) 27–31. [41] H. Mohabatkar, M. Mohammad Beigi, A. Esmaeili, Prediction of GABAA receptor proteins using the concept of Chou’s pseudo-amino acid composition and support vector machine, J. Theor. Biol. (2011). [42] X.B. Zhou, C. Chen, Z.C. Li, X.Y. Zou, Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes, J. Theor. Biol. 248 (2007) 546–551. [43] J.D. Qiu, J.H. Huang, S.P. Shi, R.P. Liang, Using the concept of chous pseudo amino acid composition to predict enzyme family classes: an approach with support vector machine based on discrete wavelet transform, Protein Pept. Lett. 17 (2010) 715–722. [44] Y.D. Cai, G.P. Zhou, K.C. Chou, Support vector machines for predicting membrane protein types by using functional domain composition, Biophys. J. 84 (2003) 3257–3263. [45] K.C. Chou, H.B. Shen, Recent progress in protein subcellular location prediction, Anal. Biochem. 370 (2007) 1–16.

X. Wang et al. / Computers in Biology and Medicine 42 (2012) 1053–1059

[46] Y. Xu, X.B. Wang, J. Ding, L.Y. Wu, N.Y. Deng, Lysine acetylation sites prediction using an ensemble of support vector machine classifiers, J. Theor. Biol. 264 (2010) 130–135. [47] H.B. Shen, K.C. Chou, Ensemble classifier for protein fold pattern recognition, Bioinformatics 22 (2006) 1717–1722. [48] D. Ruta, B. Gabrys, Classifier selection for majority voting, Inf. fusion 6 (2005) 63–81. [49] H.C. Kim, S. Pang, H.M. Je, D. Kim, S. Yang Bang, Constructing support vector machine ensemble, Pattern Recognition 36 (2003) 2757–2767. [50] S. Garcia, J. Derrac, I. Triguero, C.J. Carmona, F. Herrera, Evolutionary-based selection of generalized instances for imbalanced classification, Knowl. Based Syst. 25 (2012) 3–12. [51] Y. Tang, Y.Q. Zhang, N.V. Chawla, S. Krasser, SVMs modeling for highly imbalanced classification, IEEE Trans. , Syst. Man Cybern. Part B: Cybern. 39 (2009) 281–288. [52] K.C. Chou, C.T. Zhang, Prediction of protein structural classes, Crit. Rev. Biochem. Mol. Biol. 30 (1995) 275–349. [53] M. Esmaeili, H. Mohabatkar, S. Mohsenzadeh, Using the concept of Chou’s pseudo amino acid composition for risk type prediction of human papillomaviruses, J. Theor. Biol. 263 (2010) 203–209. [54] C. Chen, Z.B. Shen, X.Y. Zou, Dual-layer wavelet SVM for predicting protein structural class via the general form of chous pseudo amino acid composition, Protein Pept. Lett. 19 (2012) 422–429. [55] K.C. Chou, Z.C. Wu, X. Xiao, iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites, Mol. BioSyst. 8 (2012) 629–641.

1059

[56] Z.C. Wu, X. Xiao, K.C. Chou, Loc-Gpos: a multi-layer classifier for predicting the subcellular localization of singleplex and multiplex Gram-positive bacterial proteins, Protein Pept. Lett. 19 (2012) 4–14. [57] K.C. Chou, Z.C. Wu, X. Xiao, iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins, PLoS ONE 6 (2011) e18258. [58] M. Hayat, A. Khan, Discriminating outer membrane proteins with fuzzy Knearest neighbor algorithms based on the general form of Chous PseAAC, Protein Pept. Lett. 19 (2012) 411–421. [59] R. Kohavi, A study of cross-validation and bootstrap for accuracy estimation and model selection, in: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Citeseer, 1995, pp. 1137–1145. [60] T. Fawcett, An introduction to ROC analysis, Pattern Recognition Lett. 27 (2006) 861–874. [61] A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognition 30 (1997) 1145–1159. [62] F. Gnad, S. Ren, C. Choudhary, J. Cox, M. Mann, Predicting post-translational lysine acetylation using support vector machines, Bioinformatics 26 (2010) 1666–1668. [63] L. Kiemer, J.D. Bendtsen, N. Blom, NetAcet: prediction of N-terminal acetylation sites, Bioinformatics 21 (2005) 1269–1270. [64] Y. Liu, Y. Lin, A novel method for N-terminal acetylation prediction, Genomics Proteomics Bioinformatics 2 (2004) 253–255. [65] K.C. Chou, H.B. Shen, Review: recent advances in developing web-servers for predicting protein attributes, Nat. Sci. 1 (2009) 63–92.