Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou׳s general PseAAC

Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou׳s general PseAAC

Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 1 2 3 4 5 6 7 8 9 10 11 12 13 Q2 14 15 Q1 16 17 Q3 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34...

840KB Sizes 0 Downloads 24 Views

Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 Q2 14 15 Q1 16 17 Q3 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou's general PseAAC Zhe Ju, Jun-Zhe Cao, Hong Gu n School of Control Science and Engineering, Dalian University of Technology, #2 Ling-gong Road, Dalian 116024, People's Republic of China

H I G H L I G H T S

   

A novel predictor is built to identify lysine phosphoglycerylation sites. The CKSAAP feature is used to analyze and predict lysine phosphoglycerylation sites. The proposed method is more effective than existing method. A matlab software package is available for prediction.

art ic l e i nf o

a b s t r a c t

Article history: Received 27 November 2015 Received in revised form 14 February 2016 Accepted 15 February 2016

As a new type of post-translational modification, lysine phosphoglycerylation plays a key role in regulating glycolytic process and metabolism in cells. Due to the traditional experimental methods are timeconsuming and labor-intensive, it is important to develop computational methods to identify the potential phosphoglycerylation sites. However, the prediction performance of the existing phosphoglycerylation site predictor is not satisfactory. In this study, a novel predictor named CKSAAP_PhoglySite is developed to predict phosphoglycerylation sites by using composition of k-spaced amino acid pairs and fuzzy support vector machine. On the one hand, after many aspects of assessments, we find the composition of k-spaced amino acid pairs is more suitable for representing the protein sequence around the phosphoglycerylation sites than other encoding schemes. On the other hand, the proposed fuzzy support vector machine algorithm can effectively handle the imbalanced and noisy problem in phosphoglycerylation sites training dataset. Experimental results indicate that CKSAAP_PhoglySite outperforms the existing phosphoglycerylation site predictor Phogly-PseAAC significantly. A matlab software package for CKSAAP_PhoglySite can be freely downloaded from https://github.com/juzhe1120/Matlab_Software/ blob/master/CKSAAP_PhoglySite_Matlab_Software.zip. & 2016 Published by Elsevier Ltd.

Keywords: Post-translational modification Phosphoglycerylation k-spaced amino acid pairs Fuzzy support vector machine

1. Introduction Some of the important post-translational modifications (PTMs), such as acetylation, ubiquitination, methylation and sumoylation can occur at the active ε-amino groups of specific lysine residues (Lanouette et al., 2014). These protein lysine modifications play a key role in regulating various protein functions (Liu et al., 2014). Recently, a new type of non-enzymatic lysine modification, phosphoglycerylation, has been identified in both mouse liver and human cells. Phosphoglycerylation is a dynamic and reversible biochemical process in which a primary n

Corresponding author. Tel.: þ 86 411 84705858. E-mail addresses: [email protected] (Z. Ju), [email protected] (J.-Z. Cao), [email protected] (H. Gu).

glycolytic intermediate 1,3-bisphosphoglycerate (1,3-BPG) reacts with specific lysine residues in a substrate protein to form 3-phosphoglyceryl-lysine (pgK) (Moellering and Cravatt, 2013). Furthermore, Moellering and Cravatt (2013) found that phosphoglycerylation inhibit glycolytic enzymes and, in cells exposed to high glucose, accumulate on these enzymes to create a potential feedback mechanism that contributes to the buildup and redirection of glycolytic intermediates to alternate biosynthetic pathways. However, the role of phosphoglycerylation in cellular regulating process remains poorly characterized. To better understand the molecular mechanisms of phosphoglycerylation, the fundamental step is to identify phosphoglycerylation substrates and sites with high accuracy. As we know, the large-scale proteomics methods such as mass spectrometry are usually time-consuming and labor-intensive. The

http://dx.doi.org/10.1016/j.jtbi.2016.02.020 0022-5193/& 2016 Published by Elsevier Ltd.

Please cite this article as: Ju, Z., et al., Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou's general PseAAC. J. Theor. Biol. (2016), http://dx.doi.org/10.1016/j.jtbi.2016.02.020i

67 68 69 70 71 Q4 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Z. Ju et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

computational studies of PTMs are gaining increasing attention (Chou, 2015; Xu and Chou, 2016), such as the identification of protein cysteine S-nitrosylation sites (Xu et al., 2013a, 2013b), methylation sites (Qiu et al.,2014; Ju et al., 2015), hydroxylation sites (Xu et al., 2014b), nitrotyrosine sites (Xu et al., 2014c) and ubiquitination sites (Qiu et al., 2015). Recently, Xu et al. (2015a) developed a predictor named Phogly-PseAAC to predict phosphoglycerylation sites using position-specific amino acid propensity and k-nearest neighbor (KNN) algorithm. However, the prediction performance of Phogly-PseAAC obtained the specificity 75.57%, and the sensitivity 68.87% is not satisfactory. To improve the performance of phosphoglycerylation predictor based on the computational method, it is important to seek an effective encoding technique for representing the sequence context around the phosphoglycerylation sites. After that, we recognize the fact that the number of phosphoglycerylation sites is much smaller than that of non-phosphoglycerylation sites, which means the training dataset is highly imbalanced. Moreover, due to the limitations of experimental condition and technique, there may be some noisy samples in the training dataset. Especially, the non-annotated lysine residues may contain a small amount of phosphoglycerylation sites which are not experimentally validated yet. To address above issues, a novel predictor named CKSAAP_PhoglySite was proposed to predict phosphoglycerylation sites from protein sequences using composition of k-spaced amino acid pairs (CKSAAP) and fuzzy support vector machine (SVM). On the one hand, we found the effective encoding technique CKSAAP based on many aspects of assessments, which is more suitable for representing the protein sequence around the phosphoglycerylation sites than other encoding schemes. On the anther hand, to overcome the disadvantage of imbalanced and noisy training dataset, an imbalanced fuzzy support vector machine algorithm was developed to construct a stable classifier for predicting phosphoglycerylation sites. In fact, the fuzzy approaches have been used by many previous investigators to study various important biological problems (Ding and Zhang, 2007, 2008; Wang and Xiao, 2011; Xiao and Wang, 2011; Xiao et al., 2013a, 2013b, 2013c). We compared our method with two popular imbalanced SVM algorithms (DEC and FSVM-CIL) and one existing phosphoglycerylation sites predictor (Phogly-PseAAC) on training dataset. Experimental results indicated that our method is more effective and can identify more phosphoglycerylation sites from query proteins than other methods. Finally, we also analyzed the difference between the phosphoglycerylation and non-phosphoglycerylation sites. These analytical and predictive results might offer some useful clues for studying the mechanisms of phosphoglycerylation and related experimental validations. As demonstrated by a series of recent publications (Chen et al., 2014; Lin et al., 2014; Liu et al., 2015a; Jia et al., 2015) in compliance with Chou's 5-step rule (Chou, 2011), to establish a really useful sequence-based statistical predictor for a biological system, we should follow the following five guidelines: (a) construct a valid benchmark dataset to train and test the predictor; (b) extract effective features from protein sequences to truly reflect their intrinsic correlation with the target to be predicted; (c) develop a powerful machine learning algorithm to operate the prediction; (d) properly perform cross-validation tests to evaluate the performance of the predictor; (e) establish a user-friendly web-server (or software package) for the predictor that is accessible to the public. Below, we are to describe these steps one-by-one.

2. Materials and methods 2.1. Dataset Xu's training set (Xu et al., 2015a) was used to train and evaluate our model. Xu's training set was extracted from protein lysine modifications database CPLM (Liu et al., 2014), and it consisted of 106 experimentally annotated phosphoglycerylation sites and 1408 non-annotated phosphoglycerylation sites. The sliding window method was used to encode every lysine residue K of dataset. According to Xu's work (Xu et al., 2015a) and our preliminary trials (see Supplementary material S1), the window size was selected as 15. Thus, every sample of training and testing datasets was represented as a peptide segment of length with 7 residues upstream and 7 residues downstream of lysine residue K. To unify the length of each peptide, the added residue ‘X’ was used to fill the positions without sufficient residues. The phosphoglycerylated peptides were used as positive samples, while the nonphosphoglycerylated peptides were used as negative samples. The training set was provided in Supplementary material S2. 2.2. Feature extraction and coding Based on many aspects of assessments, we found the composition of k-spaced amino acid pairs (CKSAAP) encoding is more suitable for representing the protein sequence around the phosphoglycerylation sites than other encoding schemes including amino acid composition (AAC), binary encoding (BE), position specific scoring matrix (PSSM) and secondary structure (SS) (see Supplementary material S3). Feature extraction methods, such as ACC, BE, CKSAAP, PSSM, SS, can all be covered by the general pseudo-amino acid composition (PseAAC). Since the concept of pseudo-amino acid composition (Chou, 2001) or Chou's PseAAC (Du et al., 2012; Cao et al., 2013; Lin and Lapointe, 2013) was proposed, it has been widely used in the areas of computational proteomics (Du et al., 2014; Chou, 2015). Because it has been widely and increasingly used, four open access soft-wares, called 'PseAAC-Builder' (Du et al., 2012), 'propy' (Cao et al., 2013), 'PseAAC-General' (Du et al., 2014), and 'Pse-in-One' (Liu et al., 2015b), were established: the 1st and 2nd ones are for generating various modes of Chou's special PseAAC; the 3rd for generating various Chou's general PseAAC; the 4th one not only can generate varieties of PseAAC defined by users themselves but also can generate various feature vectors for DNA/RNA sequences (Chen and Lin, 2015). For a given peptide, the CKSAAP feature contains 441 types of residue pairs (AA, AC, AD, …, XX). The CKSAAP encoding scheme calculates the occurrence frequencies of the k-spaced amino acid pairs in the peptide. Here, the k-spaced amino acid pair means that the amino acid pair which are separated by k other amino acids. For example, the CKSAAP encoding of a peptide for k ¼2 is a 441dimensional feature vector defined as: ðN AxxA =N Total ; N AxxC =N Total ; N AxxD =N Total ; :::; N XxxX =N Total ÞT441

ð1Þ

where ‘x’ means any one of 21 amino acids; NTotal means the total number of 2-spaced amino acid pairs. Here, CKSAAP with k¼ 0, 1, 2, 3 and 4 was utilized to encode training peptides as 2205dimensional vectors. 2.3. Feature selection The F-score feature selection method (Chen and Lin, 2006) was employed to remove the redundant and irrelevant features.

Please cite this article as: Ju, Z., et al., Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou's general PseAAC. J. Theor. Biol. (2016), http://dx.doi.org/10.1016/j.jtbi.2016.02.020i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

Z. Ju et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

The F-score of j-th feature is defined as,  2  2 xðj þ Þ  xj þ xðj  Þ  xj FðjÞ ¼ 2 2 m  m  P P 1 xðk;jþ Þ  xðj þ Þ þ m 1 1 xðk;j Þ  xðj  Þ mþ 1 k¼1

ð2Þ

k¼1

xj ; xðj þ Þ ; xðj  Þ

are the mean value of the j-th feature in whole, where positive and negative training samples, respectively. m þ is the number of positive training samples, m  is the number of negative training samples, xðk;jþ Þ is the j-th feature of the k-th positive

training sample, and xðk;j Þ is the j-th feature of the k-th negative training sample. 2.4. Prediction methods

2.4.1. Fuzzy support vector machine As an effective and popular machine learning algorithm, SVM has been widely applied to the prediction of various PTMs sites (Xu et al., 2014a, 2015b; Qiu et al., 2014). In SVM, each training sample is considered to be of equal importance and assigned to same weight value. However, the training set in the prediction of the phosphoglycerylation sites is highly imbalanced (the ratio between phosphoglycerylated peptides and non-phosphoglycerylated peptides is roughly 1:13). In addition, there may be some noisy samples in the training dataset. Therefore, it is more reasonable to assign different weight values to different samples on the basis of their importance and unbalance than to assign the same weight value. In this study, a fuzzy SVM was developed to distinguish phosphoglycerylation and non-phosphoglycerylation sites. For facilitating description later, Xu's training set was denoted as ðX; TÞ ¼ fðxi ; t i Þ; i ¼ 1; 2; :::; lg. Assume that the first p examples are positive examples ði:e:; t i ¼ 1; i ¼ 1; 2; :::; pÞ, while the rest are negative examples ði:e:; t i ¼  1; i ¼ p þ 1; p þ 2; :::; lÞ. The fuzzy SVM can be written as follow: minð1=2Þjj ω jj 2 þ C þ ω;ξ

p X i¼1

siþ ξi þ C 

l X

si ξi

i ¼ pþ1

s:t:t i ðω U Φðxi Þ þ bÞ Z 1  ξi ; ξi Z 0; i ¼ 1; 2; :::; l

ð3Þ

where ΦðxÞ is the non-linear feature mapping; ξi ; i ¼ 1; 2; :::; l, are slack variables; C þ andC  are the penalty factors of misclassification for positive examples and negative examples, respectively; siþ and si are the fuzzy memberships which reflect the importance of xi in its own class. 2.4.2. Assigning fuzzy membership values Lin and Wang (2002) proposed the first fuzzy SVM algorithm named F-SVM in which the fuzzy membership was defined by the distance between the training sample and its class center. The sample is thought to be more important and assigned higher fuzzy membership value when it is closer to its class center; whereas the sample is treated as less important (such as noises or outliers) and assigned lower fuzzy membership value when it is farther away from its class center. The fuzzy membership in (Lin and Wang, 2002) was given as follow: siþ ¼ 1 

cen þ

di ; i ¼ 1; 2; :::; p cen þ maxðdj Þþδ

ð4Þ

j

si ¼ 1 

cen

di ; i ¼ p þ 1; p þ2; :::; l cen maxðdj Þþδ

ð5Þ

j

cen þ

where di

¼ j j xi  1p

p P j¼1

cen 

x j j j , di

¼ j j xi  l 1 p

l P j ¼ pþ1

xj j j ; δ is a

very small positive value, which is used to guarantee the value of

3

fuzzy membership always higher than zero. However, for irregular distributed samples, this method may treat some noise samples as normal samples, because it did not take into account the closeness of the samples. In this study, a simple K nearest neighbor strategy was used to estimate the closeness around training samples. For a positive sample xi, the K nearest neighbors of xi in positive training dataset is denoted as N Kþ ðxi Þ; the average distance between xi and NKþ ðxi Þ is defined as: 1 X jj xi  xj jj ð6Þ Diþ ¼ K þ xj A N K ðxi Þ

Similarly, for a negative sample xi, the K nearest neighbors of xi in negative training dataset is denoted as N K ðxi Þ; the average distance between xi and N K ðxi Þ is defined as: 1 X jj x  xj jj ð7Þ Di ¼ K x A N  ðx Þ i j

K

i

The higher the value of Diþ (Di ) is, the sparser the sample density is, thus the sample xi is more likely to be considered as noise or outliers in its own class. Here, the fuzzy membership was defined by not only the distance between a sample and its class center, but also the closeness around a sample. According to Eqs. (4)–(7), the fuzzy membership can be defined as follows: 0 1m Diþ  min Djþ cen þ di j B C þ si ¼ @1  α   ð1  αÞ  A ; cen þ max Djþ  min Djþ þ δ maxðdj Þþδ j

j

j

i ¼ 1; 2; :::; p 0 si

B ¼ @1  α 

ð8Þ

1m Di  min Dj cen  di j C  ð1  αÞ  A ; cen  max Dj  min Dj þ δ maxðdj Þþδ j

j

i ¼ p þ1; p þ 2; :::; l

j

ð9Þ

where α A ½0; 1; m 4 0. As mentioned above, the training set in the prediction of the phosphoglycerylation sites is highly imbalanced. By assigning a bigger penalty factor for the positive (minority class) samples than the negative (majority class) samples, the effects of class imbalance could be reduced. Based on the results in Veropoulos et al. (1999) and Batuwita and Palade (2013), the reasonably good classification results for the DEC algorithm could be obtained when C þ =C  is set to the majority-to-minority class ratio. Here, the penalty factor C þ is set to C  ðl  pÞ=p and the penalty factor C  is set to C (C 40). 2.5. Cross-validation and performance assessment Independent dataset test, K-fold cross-validation test, and jackknife test are often used to evaluate the anticipated accuracy of a predictor in statistical prediction. Among the three cross-validation methods, the jackknife test is considered to be the most objective and least arbitrary because it can always yield a unique result for a given training dataset (Chou, 2011). However, to reduce computational time, the 10 flod cross-validation was adopted to evaluate our model in this study. In 10 flod cross-validation, the training dataset is randomly divided into 10 approximately equal size subsets. One subset is extracted as training dataset and the remaining nine subsets as testing dataset. This procedure is repeated 10 times and the final prediction result is the average result of the 10 testing subsets. Here, to obtain a reliable estimate, the 10 flod crossvalidation procedure is repeated 10 times. The detailed training process of our model is described as follows. Firstly, the F-score rank the importance of the 2205 CKSAAP features. Then, for each top s (s¼50, 100, 150, …, 2200) features, our algorithm is implemented and evaluated by 10 flod cross-validation. Finally, the parameters

Please cite this article as: Ju, Z., et al., Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou's general PseAAC. J. Theor. Biol. (2016), http://dx.doi.org/10.1016/j.jtbi.2016.02.020i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

Z. Ju et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

with highest average Gm in 10 flod cross-validation performances are used to construct CKSAAP_PhoglySite. function    The radial basis  (RBF) kernel Kðxi ; xj Þ ¼ Φðxi ÞT Φ xj ¼ exp  γ ‖xi  xj ‖2 was used in SVM models. Kernel parameter γ was selected from {2  12, 2  11, …, 20}; penalty parameters C was selected from {20, 21, …, 212}; parameter α was selected from {0, 0.1, …, 1}; parameter K was set to 10 and m was set to 1. Libsvm-weights-3.20, a variant of the LIBSVM (Chang and Lin, 2011), was used to train the fuzzy SVM models (the software available at https://www.csie.ntu.edu.tw/ cjlin/libsvm tools/#weights_for_data_instances). Six widely-accepted measurements, including Sensitivity (Sn), Specificity (Sp), Precision (Pre), Accuracy (ACC), Matthew's correlation coefficient (MCC) and G-mean (Gm), were used to evaluate prediction performances of CKSAAP_PhoglySite. In accordance with Eq. (14) of Chen et al. (2013), they are defined as: 8 þ Sn ¼ 1 N þ >  =N > > >   > > Sp ¼ 1 N þ =N > > < þ þ  Pre ¼ 1 N  þ =ðN N  þN þ Þ > > ACC ¼ 1 ðN þ þN  Þ=ðN þ þN  Þ >  þ > ffi > >    qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi > þ   þ þ þ   > = ð1 þðN  : MCC ¼ 1  N þ  =N þ N þ =N þ  N  Þ=N Þð1 þðN   N þ Þ=N Þ

ð10Þ þ

Where N is the total number of the phosphoglycerylation sites investigated, while N þ  is the number of the sites incorrectly predicted as the non-phosphoglycerylation sites, and N  is the total number of the non-phosphoglycerylation sites investigated, while N  þ is the number of the non-phosphoglycerylation sites incorrectly predicted as the phosphoglycerylation sites. Among these measurements, Gm was used to evaluate the performances of the proposed fuzzy SVM algorithm on the imbalanced and noisy training dataset in this study, as commonly used in imbalanced classification (Batuwita and Palade, 2010). The set of metrics is valid only for the single-label systems. For the multi-label systems whose existence has become more frequent in system biology (Wu and Xiao, 2011; Wu and Xiao, 2012) and system medicine (Xiao et al., 2013c), a completely different set of metrics as defined in (Chou, 2013) is needed.

3. Results and discussions 3.1. Comparison with other imbalanced SVM algorithms

performance

of

DEC,

FSVM-CIL

3.2. Comparison with existing prediction method Recently, a prediction method Phogly-PseAAC (Xu et al., 2015a) was proposed as the only available online server for the prediction of phosphoglycerylation sites. It is interesting to compare CKSAAP_PhoglySite with Phogly-PseAAC. It should be pointed out that the jackknife test result of Phogly-PseAAC was obtained from the literature (Xu et al., 2015a). Therefore, to fairly compare the two methods, CKSAAP_PhoglySite was also implemented by jackknife test with the optimal parameter obtained in the 10 flod cross-validation. The compared results of the two methods were given in Table 2. As shown in Table 2, the CKSAAP_PhoglySite achieved much better jackknife test performance than PhoglyPseAAC. For example, the predictive Sn of CKSAAP_PhoglySite (0.8491) was much higher than that of Phogly-PseAAC (0.6887); the predictive Sp of CKSAAP_PhoglySite (0.8828) was also much higher than that of Phogly-PseAAC (0.7557). These results indicate that CKSAAP_PhoglySite is more effective and can identify more phosphoglycerylation sites from training proteins. 3.3. Top ranked k-spaced amino acid pairs As mentioned above, the optimal feature set consisted of 300 CKSAAP features was obtained by 10 flod cross-validation and feature selection. To better understand the sequence pattern of phosphoglycerylation, the top-20 k-spaced residue pairs in the optimal feature set were presented in Table 3. Moreover, the amino acid frequencies around the phosphoglycerylation and nonphosphoglycerylation sites were given in Fig. 1 by WebLogo (Crooks et al., 2004). As we can see from Table 3 and Fig. 1, the distributions of these top features were remarkably different surrounding the two types of sites. For example, the 0-spaced amino acid pair AK was significantly enriched in position pairs ( 2/  1, 2/3 and 3/4, respectively) surrounding the phosphoglycerylation sites; whereas the 4-spaced amino acid pair KxxxxT was depleted Table 2 Comparison of CKSAAP_PhoglySite with Phogly-PseAAC.

In order to evaluate the effectiveness of the proposed fuzzy SVM algorithm on phosphoglycerylation sites prediction, we compared it with two other popular imbalanced SVM algorithms including DEC (the Different Error Cost) SVM (Veropoulos et al., 1999) and FSVM-CIL (Batuwita and Palade, 2010) on Xu's training set. The results of the 10 flod cross-validation of the three algorithms were shown in Table 1. As described in Table 1, the Gm of CKSAAP_PhoglySite (0.8634) was higher than that of DEC (0.8540), cen FSVM  CILcen lin (0.8534) and FSVM  CIL exp (0.8630). The highest value of Gm indicated that compared to other two imbalanced Table 1 The 10 flod cross-validation CKSAAP_PhoglySite.

support vector machine algorithms DEC and FSVM-CIL, the proposed fuzzy support vector machine algorithm achieved a better performance on the imbalanced and noisy training dataset. The cen optimal parameters of DEC, FSVM  CILcen lin , FSVM  CIL exp and CKSAAP_PhoglySite were given in Supplementary material S4.

and

Method

Sn

Sp

Gm

DEC FSVM  CILcen lin FSVM  CILcen exp

0.81427 0.0267a 0.81047 0.0163 0.8594 7 0.0186

0.89617 0.0027 0.89877 0.0021 0.86677 0.0032

0.8540 7 0.0132 0.8534 7 0.0085 0.8630 7 0.0100

CKSAAP_PhoglySite

0.8462 7 0.0100

0.8809 7 0.0030

0.8634 7 0.0059

a The corresponding measurement was represented as the average value 7 standard deviation.

Method

Sn

Sp

Pre

ACC

MCC

Gm

Phogly-PseAACa CKSAAP_PhoglySitea CKSAAP_PhoglySiteb

0.6887 0.8491 0.8462

0.7557 0.8828 0.8809

0.1751 0.3529 0.3486

0.7510 0.8804 0.8785

0.2538 0.4990 0.4940

0.7214 0.8658 0.8634

a b

The corresponding results were obtained by jackknife test. The corresponding results were obtained by 10 flod cross-validation.

Table 3 The top 20 optimal CKSAAP features ranked by F-score method. Order

CKSAAP features

Order

CKSAAP features

1 2 3 4 5 6 7 8 9 10

KxxxxT ExG YD CxxL AK QxxxxL VE ExR VxxxxL GQ

11 12 13 14 15 16 17 18 19 20

PG IxxL TxxL EF ExxxR TD FxQ DxT TxI PG

Please cite this article as: Ju, Z., et al., Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou's general PseAAC. J. Theor. Biol. (2016), http://dx.doi.org/10.1016/j.jtbi.2016.02.020i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

Z. Ju et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Q5 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

5

Fig. 1. Amino acid frequencies surrounding the phosphoglycerylation and non-phosphoglycerylation sites.

in position pair (0/5) around the phosphoglycerylation sites. The 300 optimal CKSAAP features were shown in Supplementary material S5, they may offer some useful clues for studying the molecular pattern of phosphoglycerylation.

4. Conclusions In this study, we developed a novel predictor named CKSAAP_PhoglySite to identify phosphoglycerylation sites by using CKSAAP feature encoding and fuzzy SVM algorithm. To the best of our knowledge, this is the first time fuzzy SVM algorithm has been applied to the prediction of posttranslational modification sites. Experimental results showed that CKSAAP_PhoglySite outperformed the existing phosphoglycerylation sites predictor Phogly-PseAAC significantly. We believe that our method can also be applied to predict the other types of posttranslational modification sites. Since user-friendly and publicly accessible webservers represent the future direction for developing practically more useful predictors (Chou and Shen, 2009), we shall make efforts to establish a web-server for the new model presented in this paper.

Acknowledgments This work was supported by the National Natural Science Foundation of China (Nos. 61502074 and 61305034); the Specialized Research Fund for the Doctoral Program of Higher Education (No. 20120041110008) and Dalian University of Technology Fundamental Research Fund (DUT15RC(3)030).

Appendix A. Supplementary material Supplementary data associated with this article can be found in the online version at doi:10.1016/j.jtbi.2016.02.020.

References Batuwita, R., Palade, V., 2010. FSVM-CIL: fuzzy support vector machines for class imbalance learning. IEEE Trans. Fuzzy Syst. 18, 558–571. Batuwita, R., Palade, V., 2013. Class imbalance learning methods for support vector machines. Imbalanced Learn.: Found. Algorithms Appl., 83–99.

Cao, D.S., Xu, Q.S., Liang, Y.Z., 2013. Propy: a tool to generate various modes of Chou's PseAAC. Bioinformatics 29, 960–962. Chang, C.C., Lin, C.J., 2011. Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27. Chen, W., Feng, P.M., Deng, E.Z., 2014. iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Anal. Biochem. 462, 76–83. Chen, W., Feng, P.M., Lin, H., 2013. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res. 41, e68. Chen, W., Lin, H., 2015. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol. Biosyst. 11, 2620–2634. Chen, Y.W., Lin, C.J., 2006. Combining svms with various feature selection strategies. In: Feature Extraction. Springer, 2006, pp. 315–324. Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. Proteins: Struct. Funct. Genet. 43, 246–255. Chou, K.C., 2011. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273, 236–247. Chou, K.C., 2013. Some remarks on predicting multi-label attributes in molecular biosystems. Mol. Biosyst. 9, 1092–1100. Chou, K.C., 2015. Impacts of bioinformatics to medicinal chemistry. Med. Chem. 11, 218–234. Chou, K.C., Shen, H.B., 2009. Review: recent advances in developing web-servers for predicting protein attributes. Nat. Sci. 1, 63–92. Crooks, G.E., Hon, G., Chandonia, J.M., Brenner, S.E., 2004. Weblogo: a sequence logo generator. Genome Res. 14, 1188–1190. Ding, Y.S., Zhang, T.L., 2007. Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. Protein Pept. Lett. 14, 811–815. Du, P., Gu, S., Jiao, Y., 2014. PseAAC-General: fast building various modes of general form of Chou's pseudo-amino acid composition for large-scale protein datasets. Int. J. Mol. Sci. 15, 3495–3506. Du, P., Wang, X., Xu, C., Gao, Y., 2012. PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. Anal. Biochem. 425, 117–119. Jia, J., Liu, Z., Xiao, X., 2015. iPPI-Esml: an ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J. Theor. Biol. 377, 47–56. Ju, Z., Cao, J.Z., Gu, H., 2015. iLM-2L: a two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into Chou's general PseAAC. J. Theor. Biol. 385, 50–57. Lanouette, S., Mongeon, V., Figeys, D., 2014. The functional diversity of protein lysine methylation. Mol. Syst. Biol. 10, 724. Lin, C.F., Wang, S.D., 2002. Fuzzy support vector machines. IEEE Trans. Neural Netw. 13, 464–471. Lin, H., Deng, E.Z., Ding, H., 2014. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 42, 12961–12972. Lin, S.X., Lapointe, J., 2013. Theoretical and experimental biology in one—a symposium in honour of Professor Kuo-Chen Chou's 50th anniversary and Professor Richard Giegé's 40th anniversary of their scientific careers. Biomed. Sci. Eng. 6, 435–442. Liu, B., Fang, L., Liu, F., Wang, X., 2015a. Identification of real microRNA precursors with a pseudo structure status composition approach. PLoS One 10, e0121501. Liu, B., Liu, F., Wang, X., Chen, J., 2015b. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Res. 43, W65–W71.

Please cite this article as: Ju, Z., et al., Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou's general PseAAC. J. Theor. Biol. (2016), http://dx.doi.org/10.1016/j.jtbi.2016.02.020i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 Q6 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Z. Ju et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Liu, Z.X., Wang, Y.B., Gao, T.S., Pan, Z.C., Chen, H., Yang, Q., Cheng, Z.Y., Guo, A.Y., Ren, J., Xue, Y., 2014. CPLM: a database of protein lysine modifications. Nucleic acids Res. 42, D531–D536. Moellering, R.E., Cravatt, B.F., 2013. Functional lysine modification by an intrinsically reactive primary glycolytic metabolite. Science 341, 549–553. Qiu, W.R., Xiao, X., Lin, W.Z., 2015. iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a grey system model. J. Biomol. Struct. Dyn. 33, 1731–1742. Qiu, W.R., Xiao, X., Lin, W.Z., Chou, K.C., 2014. iMethyl-PseAAC: identification of protein methylation sites via a pseudo amino acid composition approach. Biomed. Res. Int. Veropoulos, K., Campbell, C., Cristianini, N., 1999. Controlling the sensitivity of support vector machines. In: Proceedings of the International Joint Conference on AI 55-60. Wang, P., Xiao, X., 2011. NR  2L: a two-level predictor for identifying nuclear receptor subfamilies based on sequence-derived features. PLoS One 6, e23505. Wu, Z.C., Xiao, X., 2011. iLoc-Euk: a multi-label classifier for predicting the subcellular localization of singleplex and multiplex eukaryotic proteins. PLoS One 6, e18258. Wu, Z.C., Xiao, X., 2012. iLoc-Hum: using accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. Mol. Biosyst. 8, 629–641. Xiao, X., Min, J.L., Wang, P., 2013a. iGPCR-Drug: a web server for predicting interaction between GPCRs and drugs in cellular networking. Plos. ONE 8, e72234. Xiao, X., Min, J.L., Wang, P., 2013b. iCDI-PseFpt: identify the channel-drug interaction in cellular networking with PseAAC and molecular fingerprints. J. Theor. Biol. 337C, 71–79. Xiao, X., Wang, P., 2011. GPCR  2L: predicting G protein-coupled receptors and their types by hybridizing two different modes of pseudo amino acid compositions. Mol. Biosyst. 7, 911–919.

Xiao, X., Wang, P., Lin, W.Z., Jia, J.H., Chou, K.C., 2013c. iAMP  2L: a two-level multilabel classifier for identifying antimicrobial peptides and their functional types. Anal. Biochem. 436, 168–177. Xu, Y., Chou, K.C., 2016. Recent progress in predicting posttranslational modification sites in proteins. Curr. Top. Med. Chem. 16, 591–603. Xu, Y., Ding, J., Wu, L.Y., 2013a. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS One 8, e55844. Xu, Y., Ding, Y.X., Ding, J., Deng, N.Y., 2015a. Phogly–PseAAC: prediction of lysine phosphoglycerylation in proteins incorporating with position-specific propensity. J. Theor. Biol. 379, 10–15. Xu, Y., Ding, Y.X., Ding, J., Lei, Y.H., Wu, L.Y., Deng, N.Y., 2015b. iSuc-PseAAC: predicting lysine succinylation in proteins by incorporating peptide positionspecific propensity. Sci. Rep. 5. Xu, Y., Shao, X.J., Wu, L.Y., 2013b. iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ 1, e171. Xu, Y., Wang, X., Wang, Y., Tian, Y., Shao, X., Wu, L.Y., Deng, N.Y., 2014a. Prediction of posttranslational modification sites from amino acid sequences with kernel methods. J. Theor. Biol. 344, 78–87. Xu, Y., Wen, X., Shao, X.J., 2014b. iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-specific propensity into pseudo amino acid composition. Int. J. Mol. Sci. 15, 7594–7610. Xu, Y., Wen, X., Wen, L.S., Wu, L.Y., 2014c. iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One 9, e105018. Zhang, T.L., Ding, Y.S., 2008. Prediction protein structural classes with pseudo amino acid composition: approximate entropy and hydrophobicity pattern. J. Theor. Biol. 250, 186–193.

Please cite this article as: Ju, Z., et al., Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou's general PseAAC. J. Theor. Biol. (2016), http://dx.doi.org/10.1016/j.jtbi.2016.02.020i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132