iLM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into Chou׳s general PseAAC

Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Q1 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36...

Download PDF

1MB Sizes 3 Downloads 37 Views

Report

PDF Reader
Full Text

Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Q1 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/yjtbi

ILM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into Chou's general PseAAC Zhe Ju, Jun-Zhe Cao, Hong Gu n School of Control Science and Engineering, Dalian University of Technology, #2 Ling-gong Road, Dalian 116024, People's Republic of China

H I G H L I G H T S

A two-level multi-label predictor is built to identify methyllysine sites and their methylation degrees. The CKSAAP feature is used to predict and analyze lysine methylation sites. The proposed method is more effective than existing methods. A matlab software package is available for prediction.

art ic l e i nf o

a b s t r a c t

Article history: Received 15 May 2015 Received in revised form 6 July 2015 Accepted 23 July 2015

As one of the most critical post-translational modiﬁcations, lysine methylation plays a key role in regulating various protein functions. In order to understand the molecular mechanism of lysine methylation, it is important to identify lysine methylation sites and their methylation degrees accurately. As the traditional experimental methods are time-consuming and labor-intensive, several computational methods have been developed for the identiﬁcation of methylation sites. However, the prediction accuracy of existing computational methods is still unsatisfactory. Moreover, they are only focused on predicting whether a query lysine residue is a methylation site, without considering its methylation degrees. In this paper, a novel two-level predictor named iLM-2L is proposed to predict lysine methylation sites and their methylation degrees using composition of k-spaced amino acid pairs feature coding scheme and support vector machine algorithm. The 1st level is to identify whether a query lysine residue is a methylation site, and the 2nd level is to identify which methylation degree(s) the query lysine residue belongs to if it has been predicted as a methyllysine site in the 1st level identiﬁcation. The iLM-2L achieves a promising performance with a Sensitivity of 76.46%, a Speciﬁcity of 91.90%, an Accuracy of 85.31% and a Matthew's correlation coefﬁcient of 69.94% for the 1st level as well as a Precision of 84.81%, an accuracy of 79.35%, a recall of 80.83%, an Absolute_Ture of 73.89% and a Hamming_loss of 15.63% for the 2nd level in jackknife test. As illustrated by independent test, the performance of iLM-2L outperforms other existing lysine methylation site predictors signiﬁcantly. A matlab software package for iLM-2L can be freely downloaded from https://github.com/juzhe1120/ Matlab_Software/blob/master/iLM-2L_Matlab_Software.rar. & 2015 Published by Elsevier Ltd.

Keywords: K-spaced amino acid pair Multi-label classiﬁcation Post-translational modiﬁcation Support vector machine

1. Introduction Lysine methylation is an important and common protein posttranslational modiﬁcation (PTM) in both prokaryotes and eukaryotes. Lysine methylation of histone protein was ﬁrstly identiﬁed in the 1960s (Murray, 1964), which played crucial roles in various biological

n

Corresponding author. Tel.: þ 86 411 84705858. E-mail addresses: [email protected] (J. Zhe), [email protected] (J.-Z. Cao), [email protected] (H. Gu).

processes, such as heterochromatin compaction, X-chromosome inactivation and transcriptional silencing or activation (Lee et al., 2005; Martin and Zhang, 2005). Subsequently, researchers found that lysine methylation also occurred in non-histone proteins and it played extensive roles in regulating non-histone protein activity, protein stability, protein–protein interactions and subcellular localization (Hart-Smith et al., 2014; Hamamoto et al., 2015). Since lysine residue can be methylated once, twice or three times by lysine methyltransferases (KMTs), lysine methylation has three degrees: mono-, di-, and tri-methylation (Bannister and Kouzarides, 2005). An increasing number of evidences show that lysine methylation is related to either

http://dx.doi.org/10.1016/j.jtbi.2015.07.030 0022-5193/& 2015 Published by Elsevier Ltd.

Please cite this article as: Zhe, J., et al., ILM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into.... J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.030i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81

2

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Q2 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

J. Zhe et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

gene activation or repression hinging upon the lysine methylation site and its methylation degree (Paik et al., 2007). As the lysine methylation and its KMTs play important roles in gene regulation, they are associated with various human diseases, such as cancer (Hamamoto et al., 2015; Varier and Timmers, 2011) and diabetic nephropathy (Sun et al., 2014). Thus, the knowledge of lysine methylation would be beneﬁcial to drug design for related human diseases. To better understand the modiﬁcation dynamics and molecular mechanism of lysine methylation, the fundamental but crucial step is the accurate identiﬁcation of lysine methylation sites and their methylation degrees. Several conventional experimental approaches, such as Chip–Chip (Johnson et al., 2008), methylation-speciﬁc antibodies (Turner, 2002) and mass spectrometry (Snijders et al., 2010), have been developed to identify methylation sites. However, those conventional experimental approaches are usually time-consuming and labor-intensive; it is desirable to develop computational methods to identify the potential lysine methylation sites and their methylation degrees. In contrast with conventional experimental approaches, computational methods are usually accurate and convenient which can provide some useful information for further experimental veriﬁcation. Actually, the computational studies of PTMs are attracting growing attention (Chou, 2015), such as the prediction of protein cysteine S-nitrosylation sites (Xu et al., 2013a, 2013b; Jia et al., 2014; Zhang et al., 2014), the prediction of protein methylation sites (Qiu et al., 2014), the prediction of protein hydroxylation sites (Xu et al., 2014a), the prediction of protein tyrosine nitration sites (Xu et al., 2014b) and the prediction of DNA methylation sites (Liu et al., 2015e). Up to now, several computational methods have been developed for the identiﬁcation of the methylation sites. Based on the hypothesis that methylation sites tend to be intrinsically disordered, Daily et al. (2005) presented the ﬁrst method for the prediction of arginine and lysine methylation sites. Chen et al. (2006) developed a methylation sites online predictor, MeMo, based on the binary coding scheme and Support Vector Machine (SVM) algorithm. Shao et al. (2009) proposed SVM-based predictor named BPB-PPMS, in which Biproﬁle Bayes feature extraction method was employed to encode lysine-centered and arginine-centered peptides. Using binary coding, accessible surface areas and second structural feature extraction approach, Shien et al. (2009) developed an online tool called MASA for the prediction of methylation sites. Hu et al. (2011) proposed a novel algorithm for predicting methylation sites, which used amino acid factors, position speciﬁc scoring matrix and disorder score feature extraction based on nearest neighbor algorithm. Shi et al. (2012) constructed an online server called PMeS to predict protein methylation sites using an enhanced feature encoding scheme and SVMs algorithm. Zhang et al. (2013) developed an online methylation site prediction tool called CKSAA_Methsite using the composition of k-space amino acid pairs (CKSAAPs) and SVMs algorithm. Recently, Qiu et al. (2014) developed a web server named iMethyl-PseAAC to predict methylation sites using general form of pseudo-amino acid composition and SVMs algorithm. However, the prediction performance of these existing prediction methods is still unsatisfactory. Moreover, they are only focused on predicting whether a query lysine residue was a methylation site, without considering its possible methylation degrees. In fact, some regulatory functions depend not only on the speciﬁc lysine methylation sites, but also their methylation degrees. More importantly, each of the three methylation degrees may play different regulation roles. For example, methylation of histone H4 lysine 20 (H4K20) plays critical roles in various cellular processes such as gene expression, cell cycle progression and DNA damage repair. Mono-methylation on H4K20 regulates cell cycle progression and gene expression, whereas di- and tri-methylation on H4K20 are required for DNA damage checkpoint activation and maintenance of heterochromatin structures, respectively (Wang and Jia, 2009). For another example, p53

protein is a tumor suppressor that prevents cancer formation. Monomethylation on p53K370 represses p53 function, whereas dimethylation on p53K370 activates p53 regulation through providing an interaction surface for the binding of p53-binding protein 1 (Huang et al., 2007). Lysine methylation site only has one of three methylation degrees at a time, but it may have other two methylation degrees at different times. Therefore, the identiﬁcation of lysine methylation sites and their methylation degrees should be a twolevel multi-label classiﬁcation problem. For the above reasons, we developed a novel two-level predictor called iLM-2L for identifying lysine methylation sites and their methylation degrees using CKSAAP feature encoding and SVM. The 1st level of iLM-2L identiﬁes whether a query lysine residue is a methylation site, and the 2nd level identiﬁes which methylation degree(s) the query lysine residue belongs to if it has been identiﬁed as a methylation site in the 1st level identiﬁcation. As illustrated by jackknife test and independent test, the predictive performance of our method outperformed ﬁve existing predictors signiﬁcantly. These experimental results indicated that iLM-2L is a powerful tool for identifying protein lysine methylation sites and their methylation degrees. According to the Chou's 5-step rule (Chou, 2011) and demonstrated by a series of recent publications (Xu et al., 2014b; Chen et al., 2014a; Lin et al., 2014; Liu et al., 2015a; Jia et al., 2015), to establish a really useful sequence-based statistical predictor for a biological system, we need the following ﬁve steps: (a) a valid benchmark dataset to train and test the predictor; (b) a effective feature encoding scheme to represent the biological sequences concerned; (c) a powerful machine learning algorithm to operate the prediction; (d) a proper cross-validation method to objectively measure the performance of the predictor; and (e) a user-friendly web server for public to use. Below, let us describe how to carry out these steps one-by-one.

2. Materials and methods 2.1. Dataset Qiu's training set and independent test set (Qiu et al., 2014) were used to train and evaluate the 1st level of our model, which were extracted from Uniprot/Swiss-Prot (version 2013 06). Qiu's training set consisted of 226 methyllysine sites and 1518 nonmethyllysine sites; Qiu's independent test set consisted of 14 methyllysine sites and 26 non-methyllysine sites. For practical applications in methylation site predicting system, the input is often the entire protein sequences. Therefore, we randomly selected 20 proteins with 38 methyllysine sites and 543 nonmethyllysine sites from the Uniprot/Swiss-Prot (version 2015 03) to construct our independent test set. Our independent test set did not have the same sample with Qiu's training set. To train and evaluate the 2nd level of our model, we collected methylation degrees corresponding to the methyllysine sites in Qiu's training set from Uniprot/Swiss-Prot. The training dataset used in the 2nd level consisted of 226 methyllysine sites, of which 181 have one methylation degree, 15 have two different methylation degrees, and 30 have three different methylation degrees. To further validate the 2nd level prediction of iLM-2L, we also collected methylation degrees corresponding to the methyllysine sites in our independent test set used in the 1st level from Uniprot/Swiss-Prot as our independent test dataset for the 2nd level prediction. This independent test dataset consisted of 38 methyllysine sites, of which 27 have one methylation degree, 2 have two different methylation degrees, and 9 have three different methylation degrees. Sliding window method was used to represent every lysine residue K in aforementioned datasets. To ensure the uniform

Please cite this article as: Zhe, J., et al., ILM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into.... J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.030i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhe et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Q3 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

3

length of each peptide, an added residue ‘X’ was employed to ﬁll the corresponding position where there is no sufﬁcient residue. Window size is an important parameter for a prediction model. How to select the optimal window size of our model will be discussed later. In this study, peptides with a centered methyllysine site were used as positive samples, while peptides with a centered non-methyllysine site were used as negative samples. Datasets used in 1st and 2nd level of our model are provided in Supplementary material S1.

example, the CKSAAP encoding of a peptide for k¼ 1 is a 441dimensional feature vector deﬁned as T N AxA =NTotal ; N AxC =N Total ; NAxD =NTotal ; :::; N XxX =NTotal 441 ð2Þ

2.2. Feature construction

2.3. Prediction methods

One of the most important but also most difﬁcult problems in computational biology and biomedicine today is how to formulate a biological sequence with a discrete model or a vector, yet still keep considerable sequence order information. This is because all the existing operation engines, such as Support Vector Machine (SVM) and Neural Network (NN), can only handle vector but not sequence samples, as elaborated in (Chou, 2015). However, a vector deﬁned in a discrete model may completely lose all the sequence-order information. To avoid completely losing the sequence-order information for proteins, the pseudo-amino acid composition or PseAAC was propose. Since the concept of pseudoamino acid composition or Chou's PseAAC (Du et al., 2012; Cao et al., 2013; Lin and Lapointe, 2013) was proposed, it has penetrated into many biomedicine and drug development areas (Zhong and Zhou, 2014) and nearly all the areas of computational proteomics (Lin et al., 2009; Khan et al., 2015; Dehzangi et al., 2015; Kumar et al., 2015; Mondal and Pai, 2014; Wang et al., 2015; Du et al., 2014; Chen and Lin, 2015). Because it has been widely and increasingly used, recently three powerful open access softwares, called ‘PseAAC-Builder’ (Du et al., 2012), ‘propy’ (Lin and Lapointe, 2013), and ‘PseAAC-General’ (Du et al., 2014), were established: the former two are for generating various modes of Chou's special PseAAC; while the 3rd one for those of Chou's general PseAAC (Chou, 2011), including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as “Functional Domain” mode (see Eqs. (9) and (10) of Chou (2011)), “Gene Ontology” mode (see Eqs. (11) and (12) of Chou (2011)), and “Sequential Evolution” or “PSSM” mode (see Eqs. (13) and (14) of Chou (2011)). Encouraged by the successes of using PseAAC to deal with protein/peptide sequences, four web-servers (Chen et al., 2014b; Chen et al., 2015; Liu et al., 2015b, 2015c) were developed for generating various feature vectors for DNA/RNA sequences. Particularly, recently a powerful web-server called Pse-in-One (Liu et al., 2015d) has been established that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users' studies. In this study, our prediction method incorporates the composition of k-spaced amino acid pairs (CKSAAP) features into the general form of Chou's pseudo-amino acid composition. According to Eqs. (6)–(13) of Chou (2011), the feature vector for any protein (or peptide) sequence can be covered by the general form of Chou's pseudo-amino acid composition (Du et al., 2014) T F ¼ f 1 ; f 2 ; :::; f Ω ð1Þ

As an effective machine learning technique, SVM has been successfully applied to many biology problems including the prediction of various PTMs sites in recent years. The SVM with RBF kernel was used to train both the 1st and 2nd level of iLM-2L. Given a set of training dataset ðX; T Þ ¼ ðxi ; t i Þ; i ¼ 1; 2; :::; N , where t i is the label of sample xi and t i A f 1; 1g. An SVM solves the following minimization problem:

where T is a transpose operator, the components f 1 ; f 2 … will depend on how to extract the desired information from the statistical samples concerned, while Ω is the dimension of the feature vector F. For a sequence fragment of 2m þ1 amino acids, it comprised 441 possible types of amino acid pairs (i.e., AA, AC, AD,…, XX). The occurrence frequencies of the k-spaced amino acid pairs in a fragment were calculated by CKSAAP encoding scheme. For

where ‘x’ represents any one of 21 amino acids; N Total represents the total number of 1-spaced amino acid pairs. Here, CKSAAP with k¼ 0, 1, 2, 3 and 4 were used to encode each residue of lysine fragment as a 2205-dimensional feature vector F ¼ f 1 ; f 2 ; :::; f 2205 ÞT .

N X min 1=2 j j ωj j 2 þ C j j ξi j j 2 ω;ξ

s:t:

i¼1

t i ωT Φðxi Þ þ b Z 1 ξi ; i ¼ 1; 2; :::; N

ð3Þ

where ΦðxÞ is the feature mapping satisfying Mercer's theorem. n Suppose ωn ; b is the solution of (3). The ﬁnal SVM classiﬁer can n

be written as f ðxÞ ¼ signðg ðxÞÞ, where g ðxÞ ¼ ωnT ΦðxÞ þ b and ( 1; if x Z 0 signðxÞ ¼ . 1; if x o 0 The RBF kernel function can be written as K xi ; xj ¼ Φðxi ÞT Φ xj ¼ exp γj j xi xj j j 2 . The parameter γ in RBF kernel is used to determine how the samples are mapped into a highdimensional space. The penalty factor C is used to control the tradeoff between the complexity of the model and the approximation error. In this study, Libsvm (Chang and Lin, 2011) was used to train SVM models and the grid-search method was applied to tune the parameters in jackknife test. Penalty parameters C was selected from {0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000, 2000, 5000, 10,000}; and kernel parameter γ was selected from {0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1}. The 256 pairs of ðC; γ Þ were used to search the optimal parameters of SVM in both two levels of our model. The 1st level of our model is to identify whether a query lysine residue is a methyllysine site, this is a binary classiﬁcation problem. As mentioned in Section 2.1, Qiu's training set consisted of 226 positive samples and 1518 negative samples. So as the ratio between positive and negative examples in Qiu's training set was relatively low, we randomly divided the negative samples into 5 groups. There were 303, 303, 303, 303 and 306 negative samples in the each group. The 226 positive samples were combined with the 5 groups of negative samples to generate 5 new training sets, respectively. They were named as training set 1, training set 2, training set 3, training set 4, training set 5. For facilitating description, we denoted the training set i as ðX i ; T i Þ ¼ xij ; t ij ; j ¼ 1; 2; :::; M i ði ¼ 1; 2; 3; 4; 5Þ, where Mi is the number of samples in the training set i, lysine site xij is a vector of 2205 dimensions and tij is a label vector of xij, tij is 1 if xij is methyllysine site and 1 otherwise. Thus Qiu's training set can be represented as ðX; T Þ ¼ [ 5i ¼ 1 ðX i ; T i Þ. The goal of the 1st level of our model is to learn a function Λ : Χ T which can predict the singlelabel for an unseen sample x. For training set ðX i ; T i Þ ¼ xij ; t ij ; j ¼ 1; 2; :::; M i ði ¼ 1; 2; 3; 4; 5Þ, we trained a SVM classiﬁer Λi using CKSAAP features ﬁrstly. However, not all CKSAAP features contribute to the difference between methyllysine and non-methyllysine sites. Therefore, a well-

Please cite this article as: Zhe, J., et al., ILM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into.... J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.030i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhe et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

established feature selection method named F-score (Chen and Lin, 2006) was used to remove the irrelevant and redundant features. Speciﬁcally, for each window size w (w¼7, 9, 11, 13, 15, 17, 19, 21), penalty parameter C and kernel parameter γ, F-score method was utilized to rank the importance of the 2205 CKSAAP features ﬁrstly, Frank ¼(f1n, f2n, …, f2205n)T; the more important a feature is, the smaller its index will be. Then, for each feature subset of top s kspaced amino acid pairs S, a SVM was constructed and evaluated by jackknife test, where S¼{f1n, f2n, …, fsn} and s¼{50, 100,…,2200}. Finally, the feature set Sn , window size wn , parameters C n and Kernel parameter γ n with the best prediction performance of jackknife test were selected as optimal parameters. The F-score of j-th feature is deﬁned as F ðjÞ ¼

2 2 = xðj þ Þ xj þ xðj Þ xj þ

m X k¼1

xðk;j Þ xðj Þ

2

mþ X k¼1

xðk;jþ Þ xðj þ Þ

2 = mþ 1

!

!

ð4Þ

where xj ; xðj þ Þ ; xðj Þ are the mean value of the j-th feature in whole, positive and negative samples, respectively. m þ denotes the number of positive samples, m denotes the number of negative samples, xðk;jþ Þ denotes the j-th feature of the k-th positive sample, and xðk;j Þ denotes the j-th feature of the k-th negative sample. Five independent SVM classiﬁers Λi ¼ signðλi Þ ði ¼ 1; 2; 3; 4; 5Þ were trained using the above method, where the λi is the decision function of Λi ði ¼ 1; 2; 3; 4; 5Þ. Finally, for a given unseen sample x, the output of the 1st level of iLM-2L can be ! 5 P λi ðxÞ . The sample x is methyllysine site written as ΛðxÞ ¼ sign i¼1

if ΛðxÞ is 1 and non-methyllysine site otherwise. The 2nd level of our model is to identify which methylation degree(s) a query methyllysine site has, this is a multi-label classiﬁcation problem. We denoted training set used in the 2nd level as ðY; LÞ ¼ yi ; li ; i ¼ 1; 2; :::; 226 , where methyllysine site yi is a vector of 2205 dimensions and li ¼(li1, li2, li3) is a label vector of yi. li1 is 1 if yi has mono-methylated degree and 1 otherwise; li2 is 1 if yi has di-methylated degree and 1 otherwise; li3 is 1 if yi has tri-methylated degree and 1 otherwise. The goal of the 2nd level of our model is to learn a function Ψ : Y L which can predict the multi-label l for an unseen sample y. We transformed this multi-label classiﬁcation problem into a seven class classiﬁcation problem, since there are only three possible methylation degrees for a given methyllysine site. Therefore, the multi-label set L ¼{(1,1,1), (1,1, 1), (1, 1,1), (1, 1, 1), ( 1,1,1), ( 1,1, 1), ( 1, 1,1)} was transformed into single-label set Q ¼{1, 2, 3, 4, 5, 6, 7}. The compositions of these seven classes are given in Table 1. Then, the “one-versus-one (OVO)” technique (Cristianini and Shawe-Taylor, 2000) was used to handle this seven classes classiﬁcation problem by Libsvm. As imbalance of positive samples and negative samples (see Table 1), in the OVO method, we set the penalty factor of SVMs as C=N i for class i, where Ni is the number of training samples for class i. Speciﬁcally, for class i and class j ði; jA f1; 2; :::; 7g; i a jÞ, a weighted SVM classiﬁer Ψ ij was ﬁrstly trained by solving the following minimization problem: Ni i þ Nj X NX j j ξi j j 2 þ C=Nj j j ξi j j 2 min 1=2 j j ωj j 2 þ C=N i i¼1

i ¼ Ni þ 1

Table 1 Compositions of the seven classes after label transformation. Number Number of Class 1 of Class 2

Number of Class 3

Number of Class 4

Number Number Number of Class 5 of Class 6 of Class 7

30

1

120

3

11

5

t i ωT Φðxi Þ þ b Z 1 ξi ; i ¼ 1; 2; :::; Ni þN j

ð5Þ

where qi is a label vector of yi, qi is 1 if yi belongs class i and 1 otherwise. Then, for a given unseen sample y, a predictive single-label in was given by a simple majority vote scheme: in ¼ 0 1 B B B arg max B i A f1; 2; :::;7gB @

P i aj

C C C Ψ ij ðyÞC. Finally, the output of the 2nd C A

j A f1; 2; :::; 7g

level of iLM-2L can be written as Ψ ðyÞ ¼ L in , where L(i) represents the i-th element in L. 2.4. Performance measurement

=ðm 1Þ

ω;ξ

s:t:

56

It should be pointed out that the performance measurement for the single-label and multi-label classiﬁcation should be different (Chou and Shen, 2007). The 1st level prediction of iLM-2L, like other existing methylation site prediction methods, is a singlelabel classiﬁcation problem. Four widely-accepted measurements of single-label classiﬁer were employed in the 1st level identiﬁcation, including Sensitivity (Sn), Speciﬁcity (Sp), Accuracy (ACC) and Matthew's correlation coefﬁcient (MCC). They are deﬁned as Sn ¼ TP=ðTP þ FNÞ

ð6Þ

Sp ¼ TN=ðTN þ FPÞ

ð7Þ

ACC ¼ ðTP þ TNÞ=ðTP þ FP þ TN þ FNÞ

ð8Þ

pﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ MCC ¼ ðTPnTN FNnFPÞ= ðTP þ FNÞnðTN þ FPÞnðTP þFPÞnðTN þ FNÞ ð9Þ where TP, TN, FP and FN denote the number of true positives, true negatives, false positives and false negatives, respectively. To most biologists, however, the four metrics (6)–(9) are lack of intuitiveness and not easy to understand, especially for the MCC. Here let us adopt the formulation proposed recently in (Chen et al., 2013). According to the formulation, the four metrics (6)–(9) can be rewritten as þ Sn ¼ 1 N þ =N

ð10Þ

Sp ¼ 1 N þ =N

ð11Þ

þ ACC ¼ 1 ðN þ þ N þ Þ=ðN þ N Þ

ð12Þ

qﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ þ þ þ þ MCC ¼ ð1 ðN þ =N þ N þ =N ÞÞ= ð1þ ðN þ N Þ=N Þð1 þ ðN N þ Þ=N Þ

ð13Þ where N þ is the total number of the methyllysine sites investigated, whileN þ is the number of the sites incorrectly predicted as the nonmethyllysine sites, and N is the total number of the nonmethyllysine sites investigated, while N þ is the number of the nonmethyllysine sites incorrectly predicted as the methyllysine sites. Now, it is very clear from (10) to (13) that when N þ ¼ 0 meaning none of the methyllysine sites was incorrectly predicted to be a nonþ methyllysine site, so the sensitivity Sn ¼ 1. When N þ meaning ¼N that all the methyllysine sites were incorrectly predicted to be the non-methyllysine sites, so the sensitivity Sn ¼ 0. Likewise, when N þ ¼ 0 meaning none of the non-methyllysine sites was incorrectly predicted to be the methyllysine sites, so the speciﬁcity Sp ¼ 1, whereas N meaning all the non-methyllysine sites were þ ¼N incorrectly predicted as the methyllysine sites, so the speciﬁcity Sp ¼ 0. When N þ ¼ N þ ¼ 0 meaning that none of methyllysine sites in the positive dataset and none of the non-methyllysine sites in the negative dataset was incorrectly predicted, so the overall accuracy þ Acc ¼ 1 and MCC ¼ 1; when N þ and N meaning that ¼N þ ¼N

Please cite this article as: Zhe, J., et al., ILM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into.... J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.030i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhe et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

all the methyllysine sites in the positive dataset and all the nonmethyllysine sites in the negative dataset were incorrectly predicted, so the overall accuracy Acc ¼ 0 and MCC ¼ 1, whereas when þ Nþ ¼ N =2 and N þ ¼ N =2 we have Acc ¼ 0:5 and MCC ¼ 0 meaning no better than random prediction. As we can see from the above discussion based on (10)–(13), the meanings of Sn, Sp, ACC, and MCC have become much more intelligible and intuitive. The 2nd level of iLM-2L is a multi-label classiﬁcation problem. The following ﬁve widely-accepted measurements of multi-label classiﬁer were employed in the 2nd level prediction. N X

Hammingloss ¼

j j Li [ Lni j j j j Li \ Lni j j =M =N

ð14Þ

i¼1

Accuracy ¼

N X

j j Li \ Lni j j =j j Li [ Lni j j =N

ð15Þ

j j Li \ Lni j j =Lni =N

ð16Þ

i¼1 N X

Precision ¼

i¼1

Recall ¼

N X

j j Li \ Lni j j =Li =N

ð17Þ

i¼1

AbsoluteTure ¼

N X Δ Li ; Lni =N

ð18Þ

i¼1

where N denotes the number of all test samples; M denotes the numbers of all possible different methylation statuses; Li represents the actual label set of i-th test sample and Lni represents the predictive label set of i-th test sample; [ represents the union of two sets, \ represents the intersection of two sets, and ||.|| represents the number of all elements in the set, and ( 1; if Li is identical to Lni ð19Þ Δ Li ; Lni ¼ 0; otherwise Note that the lower value of Hamming loss is, the better performance of a multi-label classiﬁer will be; whereas, for the other four measurements, the higher their values are, the better performance of a multi-label classiﬁer will be. For a detailed explanation of Eqs. (14)–(18), please refer to Eq. (16) of Chou (2013). 2.5. Cross-validation Three cross-validation methods are often utilized to validate a predictor: independent dataset test, subsampling test, and jackknife test. However, as demonstrated by Eqs. (28)–(32) in Chou (2011), considerable arbitrariness exists in the independent dataset test and subsampling test, and only the jackknife test can always yield a unique outcome for a given benchmark dataset. Therefore, jackknife test was used to evaluate the performance of our model. In addition, independent dataset test was also used to compare our method with existing prediction methods. Table 2 The predictive performance of jackknife test in the 1st level of iLM-2L with various window sizes. Window size

Sn (%)

Sp (%)

ACC (%)

MCC

7 9 11 13 15 17 19 21

77.43 71.68 76.46 79.20 76.90 76.46 72.12 74.87

88.54 92.16 91.90 89.46 89.33 86.36 90.45 88.21

83.80 83.42 85.31 85.08 84.02 82.14 82.63 82.52

0.6681 0.6645 0.6994 0.6955 0.6730 0.6335 0.6446 0.6409

5

3. Results and discussion 3.1. Prediction performance For the 1st level of iLM-2L, the parameters C; γ; s and w were determined by maximizing MCC value in jackknife test performances. For the 2nd level of iLM-2L, the parameters C; γ and w were determined by minimizing Hamming_loss value in jackknife test performances. The optimal parameters of two levels of iLM-2L are listed in Supplementary material S2. Table 2 showed the jackknife test performances of the 1st level of iLM-2L with various window sizes. The 1st level of iLM-2L with window size 11 yielded the highest averaged MCC value 0.6994. So the best window size of each peptide was set to 11 in the 1st level identiﬁcation. The jackknife test performance of the 1st level of iLM-2L for each one of ﬁve training sets are shown in Table 3. We also compared iLM-2L with Qiu's predictor iMethyl-PseAAC on Qiu's training set. The results of jackknife test of the two methods are shown in Table 4. The predictive Sn, Sp, ACC, and MCC of iLM2L (76.46%, 91.90%, 85.31%, and 0.70) were much higher than those of iMethyl-PseAAC (71.81%, 80.56%, 76.19%, and 0.53), respectively. This result indicates that our method is more effective and can identify more lysine methylation sites from query proteins than iMethyl-PseAAC. Moreover, the average MCC of our method was improved from 0.44 to 0.74 after F-score feature selection. It suggests F-score feature selection method can improve the prediction performance of our model effectively. Table 5 showed the jackknife test performances of the 2nd level of iLM-2L with various window sizes. Our method with window size 15 yielded the lowest Hamming_loss value 0.1563. Therefore, the optimal window size was set to 15 in the 2nd level prediction. For a multi-label system, the absolute-true success rate for each of the individual labels is meaningless and misleading (Xiao et al., 2013). Instead, the absolute true success rates for the methyllysine sites with different numbers of methylation statuses (or labels) were taken into account. As shown in Table 6, the absolute-true rates for the methyllysine sites with 1 and 3 methylation statuses by the iLM-2L were much higher than those by the completely random guess (CRG); whereas the absolute true rates for the methyllysine sites with 2 methylation statuses by iLM-2L were slightly lower than those by the CRG. This could be because the number of training samples for 2 methylation statuses is relatively few. The CRG rates were calculated as following equation: PðCRGÞ ¼ 1=M U 1=C ðM; mÞ ¼ 1= M UM!=ððM mÞ!m!Þ

ðm r M Þ ð20Þ

where M is the total number of all the methylation statuses and m is the numbers of the current methylation statuses. It is noteworthy that not any feature selection method is used to select optimal features in the 2nd level prediction, due to the promising performance of our model with all the 2205 CKSAAP features in jackknife test.

Table 3 The predictive performance of jackknife test in the 1st level of iLM-2L with window size 11 on 5 training sets. Training set

Sn (%)

Sp (%)

ACC (%)

MCC

1 2 3 4 5 Average

74.78 75.66 77.88 76.11 77.88 76.46

92.41 93.07 93.07 89.77 91.18 91.90

84.88 85.63 86.58 83.93 85.53 85.31

0.6912 0.7071 0.7259 0.6701 0.7027 0.6994

Please cite this article as: Zhe, J., et al., ILM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into.... J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.030i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhe et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

6

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

Table 4 The jackknife test performances of iMethyl-PseAAC and the 1st level of iLM-2L.

Table 8 Comparison with other methods on our independent test set.

Method

Sn (%)

Sp (%)

ACC (%)

MCC

Method

Sn (%)

Sp (%)

ACC (%)

MCC

iMethyl-PseAAC iLM-2L iLM-2La

71.81 76.46 34.96

80.56 91.90 97.70

76.19 85.31 70.92

0.53 0.70 0.44

MeMo MASAa BPB-PPMSb PMeS iMethyl-PseAAC iLM-2L

26.32 18.42 23.68 47.37 76.32 71.05

93.19 98.90 90.06 98.34 65.93 81.40

88.81 93.63 85.71 95.01 66.61 80.72

0.1768 0.2895 0.1093 0.5369 0.2165 0.3129

a

iLM-2L without F-score feature selection

Table 5 The predictive performance of jackknife test in the 2nd level of iLM-2L with various window sizes. Window sizes

Precision (%)

Accurracy (%)

Recall (%)

AbsoluteTure (%)

HammingLoss

7 9 11 13 15 17 19 21

82.89 81.27 83.48 84.37 84.81 83.33 83.70 84.29

77.14 75.07 77.51 78.69 79.35 78.02 78.39 78.98

77.88 75.66 78.98 80.16 80.83 78.76 79.20 80.09

72.12 69.47 71.68 73.01 73.89 73.45 73.89 74.34

0.1696 0.1858 0.1696 0.1593 0.1563 0.1637 0.1608 0.1578

Number of methylation Number of statuses methyllysine sites

AbsoluteTure (%) CRG (%)

1 2 3

148/181¼ 81.77 1/15¼ 6.67 18/30¼ 60.00

11.11 11.11 33.33

Table 7 Comparison with other methods on Qiu's independent test set. Method

Sn (%)

Sp (%)

ACC (%)

MCC

MeMo MASAa BPB-PPMSb PMeS iMethyl-PseAAC iLM-2L

100.00 85.71 71.43 78.57 100.00 100.00

61.54 61.54 73.08 57.69 61.54 65.38

75.00 70.00 72.50 65.00 75.00 77.50

0.60 0.45 0.43 0.35 0.60 0.63

a b

b

Prediction sensitivity was set to 80%. Threshold was set to 0.5.

Table 9 The independent test performance of the 2nd level of iLM-2L. Precision (%)

Accurracy (%)

Recall (%)

AbsoluteTure (%)

HammingLoss

86.40

75.00

78.07

65.79

0.2018

Table 10 The top 15 features of ﬁve optimal feature sets ranked by F-score feature selection method.

Table 6 The absolute true success rates for the methyllysine sites with different numbers of methylation statuses.

181 15 30

a

Prediction sensitivity was set to 80%. Threshold was set to 0.5.

To further evaluate the effectiveness of iLM-2L, we compared it with existing prediction methods. However, all the existing predictors can only be used to predict a lysine residue as a methyllysine site or non-methyllysine site, i.e., the 1st level prediction of iLM-2L; none of the existing predictors can be used to deal with the 2nd level prediction of iLM-2L. Therefore, the comparison was only limited in the 1st level prediction of iLM-2L. However, the models proposed in (Daily et al., 2005) and (Hu et al., 2011) did not have web-server at all, and the web-server of CKSAAP_Methsite (Zhang et al., 2013) did not work. Hence we compared the 1st level of iLM-2L with ﬁve existing predictors: MeMo, MASA, BPB-PPMS, PMeS, and iMethyl-PseAAC. In order to facilitate the comparison, the sensitivity of the MASA was set as 80% and the threshold of BPB-PPMS was set as 0.5. The compared results of different methods on Qiu's independent test set are given in Table 7. As shown in Table 7, iLM-2L reached the highest Sn, ACC and MCC values of 100%, 77.5% and 0.63, respectively on Qiu's independent test set. Listed in Table 8 were the compared results by various

Top 15 Optimal Optimal Optimal Optimal Optimal features feature set 1 feature set 2 feature set 3 feature set 4 feature set 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

KxxK AR RK XxxxxK HR XX VxxxH GxxxxH XM VxT LxK AxR AxxxF AxxS KxxxxX

LxK PG KxxK XxxxxK XX VxxxH XM NP ExxV ExxxxK TxR KxxxxX PxT AR KxxxR

VxxxH XxxxxK XX XM PG PxT AxxS KxxxxX RxS DxxK KxxK AR RxxxxS PxxY IP

KxxK XxxxxK XX VxxxH XM AxxxF ExxxxK AR KxxxxX GD KxxxxY GxxxxP PxxxxP HL KxK

KxxK VxxxH XxxxxK XX KxxR KS AR AxxxF XM VxR SxG VT KK PG KxxxxX

methods for our independent test set. Since the imbalance of our independent test set (38 methyllysine sites and 543 nonmethyllysine sites), the MCC value could not accurately evaluate the performance of various methods on our independent test set. Although the PMeS achieved the highest MCC value (0.5369), the Sn value (47.37%) was much lower than that of iLM-2L (71.05%). It indicates that PMeS tends to predict a query lysine residue as a non-methyllysine site, and can identify signiﬁcantly less methyllysine sites than iLM-2L. Moreover, the Sn value of iMethyl-PseAAC (76.32%) was slightly higher than that of iLM-2L (71.05%), but Sp value of iMethyl-PseAAC (65.93%) was much lower than that of iLM-2L (81.40%). It means that iLM-2L can identify more nonmethyllysine sites than iMethyl-PseAAC at a similar level of Sn. In short, iLM-2L outperformed the existing methylation sites predictors signiﬁcantly on both Qiu's independent test set and our independent test set. It is worth noting that we directly sent the test proteins to these ﬁve web-servers and obtained the predicted results instead of re-implementing these prediction methods in the independent testing phase. In order to further validate the 2nd level prediction of iLM-2L, As shown in Table 9, the iLM-2L achieved a satisfactory performance with a precision of 86.40%, an accuracy of 75.00%, a recall of 78.07%, an Absolute_Ture of 65.79% and a Hamming_loss of 20.18% on the our independent test set used in 2nd level.

Please cite this article as: Zhe, J., et al., ILM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into.... J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.030i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

J. Zhe et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66

7

Fig. 1. Amino acid frequencies surrounding the methyllysine and non-methyllysine sites, which was inferred from Qiu's training set. (EPS format, 103.7 mmn172.2 mm, 300 dpi, 1.5-column ﬁtting image).

3.2. The signiﬁcant features As mentioned in Section 2.3, ﬁve optimal CKSAAP feature sets were obtained by F-score method in the 1st level prediction. To better understand the differences between the methyllysine and non-methyllysine sites in terms of the CKSAAP feature, the top 15 ranked residue pairs of each optimal CKSAAP feature set are given in Table 10. As shown in Table 10, some residue pairs, such as KxxK, AR, XxxxxK and VxxxH, appeared in the top 15 ranked residue pairs of each optimal feature set. It indicates that these residue pairs might be important for the identiﬁcation of lysine methylation sites. To intuitively and clearly describe those top ranked residue pairs, amino acid frequencies surrounding the methyllysine and non-methyllysine sites were given by WebLogo (Crooks et al., 2004). As we can see from Fig. 1, although there were no obvious conserved residues around the methyllysine and non-methyllysine sites, the differences of the two types of sites also exist obviously. For example, the 0-spaced amino acid pair AR was signiﬁcantly enriched in position pair ( 2/ 1) surrounding the methyllysine sites; as another example, the 2-spaced amino acid pair KxxK was depleted in position pairs ( 1/2 and 2/5) around the methyllysine sites. The ﬁve completed optimal CKSAAP feature sets are given in Supplementary material S3, they may provide some useful clues for studying the sequence patterns around methyllysine sites.

4. Conclusion In this study, we developed a novel two-level predictor iLM-2L for identifying protein lysine methylation sites and their methylation degrees. To the best of our knowledge, this is the ﬁrst time machine learning technique has been applied to identify methylation degrees of lysine methylation sites. Our experimental results have shown that iLM-2L signiﬁcantly outperformed the ﬁve existing lysine methylation sites predictors: MeMo, MASA, BPBPPMS, PMeS, and iMethyl-PseAAC. Moreover, we also analyzed the difference between the methyllysine and non-methyllysine sites in terms of the CKSAAP feature. These systematic analyses and predictions might contribute to an understanding of the mechanisms of lysine methylation system and provide guidance for related experimental validations. As demonstrated in a series of

recent publications (Liu et al., 2015b; Chen et al., 2014a, 2013; Lin et al., 2014; Jia et al., 2015; Ding et al., 2014; Guo et al., 2014) in developing new prediction methods, user-friendly and publicly accessible web-servers will signiﬁcantly enhance their impacts (Chou, 2015), we shall make efforts in our future work to provide a web-server for the prediction method presented in this paper.

Acknowledgments This work was supported by the National Natural Science Foundation of China (No. 61305034); the Specialized Research Fund for the Doctral Program of Higher Education (No. 20120041110008); and the Dalian University of Technology Fundamental Research Fund (No. DUT15RC(3)030).

Appendix A. Supporting information Supplementary data associated with this article can be found in the online version at http://dx.doi.org/10.1016/j.jtbi.2015.07.030. References Bannister, A.J., Kouzarides, T., 2005. Reversing histone methylation. Nature 436, 1103–1106. Cao, D.S., Xu, Q.S., Liang, Y.Z., 2013. Propy: a tool to generate various modes of Chou's PseAAC. Bioinformatics 29, 960–962. Chang, C.C., Lin, C.J., 2011. Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. (TIST) 2, 27. Chen, H., Xue, Y., Huang, N., Yao, X., Sun, Z., 2006. Memo: a web tool for prediction of protein methylation modiﬁcations. Nucl. Acids Res. 34, W249–W253. Chen, W., Feng, P.M., Deng, E.Z., 2014a. iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Anal. Biochem. 462, 76–83. Chen, W., Feng, P.M., Lin, H., 2013. iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucl. Acids Res. 41, e68. Chen, W., Lei, T.Y., Jin, D.C., 2014b. PseKNC: a ﬂexible web-server for generating pseudo K-tuple nucleotide composition. Anal. Biochem. 456, 53–60. Chen, W., Lin, H., 2015. Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol. BioSyst., 10.1039/ c5mb00155b. Chen, W., Zhang, X., Brooker, J., Lin, H., 2015. PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions. Bioinformatics 31, 119–120. Chen, Y.W., Lin, C.J., 2006. Combining svms with various feature selection strategies. Springer, pp. 315–324.

Please cite this article as: Zhe, J., et al., ILM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into.... J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.030i

67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 Q4101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132

8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57

J. Zhe et al. / Journal of Theoretical Biology ∎ (∎∎∎∎) ∎∎∎–∎∎∎

Chou, K.C., 2011. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273, 236–247. Chou, K.C., 2013. Some remarks on predicting multi-label attributes in molecular biosystems. Mol. Biosyst. 9, 1092–1100. Chou, K.C., 2015. Impacts of bioinformatics to medicinal chemistry. Med. Chem. 11, 218–234. Chou, K.C., Shen, H.B., 2007. Recent progress in protein subcellular location prediction. Anal. Biochem. 370, 1–16. Cristianini, N., Shawe-Taylor, J., 2000. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press. Crooks, G.E., Hon, G., Chandonia, J.M., Brenner, S.E., 2004. Weblogo: a sequence logo generator. Genome Res. 14, 1188–1190. Daily, K.M., Radivojac, P., Dunker, A.K., 2005. Intrinsic disorder and prote in modiﬁcations: building an svm predictor for methylation. In: Proceedings of the IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, CIBCB'05, 2005, pp. 1–7. Dehzangi, A., Heffernan, R., Sharma, A., Lyons, J., Paliwal, K., Sattar, A., 2015. Grampositive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou's general PseAAC. J. Theor. Biol. 364, 284–294. Ding, H., Deng, E.Z., Yuan, L.F., Liu, L., 2014. iCTX-Type: a sequence-based predictor for identifying the types of conotoxins in targeting ion channels. BioMed Res. Int., 2014. Du, P., Gu, S., Jiao, Y., 2014. PseAAC-General: fast building various modes of general form of Chou's pseudo-amino acid composition for large-scale protein datasets. Int. J. Mol. Sci. 15, 3495–3506. Du, P., Wang, X., Xu, C., Gao, Y., 2012. PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. Anal. Biochem. 425, 117–119. Guo, S.H., Deng, E.Z., Xu, L.Q., Ding, H., 2014. iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo ktuple nucleotide composition. Bioinformatics 30, 1522–1529. Hamamoto, R., Saloura, V., Nakamura, Y., 2015. Critical roles of non-histone protein lysine methylation in human tumorigenesis. Nat. Rev. Cancer 15, 110–124. Hart-Smith, G., Chia, S.Z., Low, J.K., McKay, M.J., Molloy, M.P., Wilkins, M.R., 2014. Stoichiometry of saccharomyces cerevisiae lysine methylation: insights into non-histone protein lysine methyltransferase activity. J. Proteome Res. 13, 1744–1756. Hu, L.L., Li, Z., Wang, K., Niu, S., Shi, X.H., Cai, Y.D., Li, H.P., 2011. Prediction and analysis of protein methylarginine and methyllysine based on multisequence features. Biopolymers 95, 763–771. Huang, J., Sengupta, R., Espejo, A.B., Lee, M.G., Dorsey, J.A., Richter, M., Berger, S.L., 2007. p53 is regulated by the lysine demethylase LSD1. Nature 449, 105–108. Jia, C., Lin, X., Wang, Z., 2014. Prediction of protein S-nitrosylation sites based on adapted normal distribution bi-proﬁle Bayes and Chou's pseudo amino acid composition. Int. J. Mol. Sci. 15, 10410–10423. Jia, J., Liu, Z., Xiao, X., 2015. iPPI-Esml: an ensemble classiﬁer for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. J. Theor. Biol.377, 47–56. Johnson, D.S., Li, W., Gordon, D.B., Bhattacharjee, A., Curry, B., Ghosh, J., Brizuela, L., Carroll, J.S., Brown, M., Flicek, P., et al., 2008. Systematic evaluation of variability in chip-chip experiments using predeﬁned dna targets. Genome Res. 18, 393–403. Khan, Z.U., Hayat, M., Khan, M.A., 2015. Discrimination of acidic and alkaline enzyme using Chou's pseudo amino acid composition in conjunction with probabilistic neural network model. J. Theor. Biol. 365, 197–203. Kumar, R., Srivastava, A., Kumari, B., Kumar, M., 2015. Prediction of beta-lactamase and its class by Chou's pseudo-amino acid composition and support vector machine. J. Theor. Biol. 365, 96–103. Lee, D.Y., Teyssier, C., Strahl, B.D., Stallcup, M.R., 2005. Role of protein methylation in regulation of transcription. Endocr. Rev. 26, 147–170. Lin, H., Deng, E.Z., Ding, H., 2014. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucl. Acids Res. 42, 12961–12972. Lin, H., Wang, H., Ding, H., Chen, Y.L., Li, Q.Z., 2009. Prediction of subcellular localization of apoptosis protein using Chou's pseudo amino acid composition. Acta Biotheor. 57, 321–330. Lin, S.X., Lapointe, J., 2013. Theoretical and experimental biology in one. J. Biomed. Sci. Eng. 6, 435–442. Liu, B., Fang, L., Liu, F., Wang, X., Chen, J., 2015a. Identiﬁcation of real microRNA precursors with a pseudo structure status composition approach. PLoS One 10, e0121501.

Liu, B., Liu, F., Fang, L., 2015b. repRNA: a web server for generating various feature vectors of RNA sequences Mol. Genet. Genom., 10.1007/s00438-015-10787.2015. Liu, B., Liu, F., Fang, L., Wang, X., 2015c. repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating userdeﬁned physicochemical properties and sequence-order effects. Bioinformatics 31, 1307–1309. Liu, B., Liu, F., Wang, X., Chen, J., Fang, L., 2015d. Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucl. Acids Res. , http://dx.doi.org/10.1093/nar/gkv458. Liu, Z., Xiao, X., Qiu, W.R., 2015e. iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. Anal. Biochem. 474, 69–77. Martin, C., Zhang, Y., 2005. The diverse functions of histone lysine methylation. Nat. Rev. Mol. Cell Biol. 6, 838–849. Mondal, S., Pai, P.P., 2014. Chou's pseudo amino acid composition improves sequence-based antifreeze protein prediction. J. Theor. Biol. 356, 30–35. Murray, K., 1964. The occurrence of i"-n-methyl lysine in histones. Biochemistry 3, 10–15. Paik, W.K., Paik, D.C., Kim, S., 2007. Historical review: the ﬁeld of protein methylation. Trends Biochem. Sci. 32, 146–152. Qiu, W.R., Xiao, X., Lin, W.Z., Chou, K.C., 2014. iMethyl-PseAAC: identiﬁcation of protein methylation sites via a pseudo amino acid composition approach. BioMed Res. Int., 2014. Shao, J., Xu, D., Tsai, S.N., Wang, Y., Ngai, S.M., 2009. Computational identiﬁcation of protein methylation sites through bi-proﬁle bayes feature extraction. PLoS One 4, e4920. Shi, S.P., Qiu, J.D., Sun, X.Y., Suo, S.B., Huang, S.Y., Liang, R.P., 2012. Pmes: prediction of methylation sites based on enhanced feature encoding scheme. PLoS One 7, e38772. Shien, D.M., Lee, T.Y., Chang, W.C., Hsu, J.B.K., Horng, J.T., Hsu, P.C., Wang, T.Y., Huang, H.D., 2009. Incorporating structural characteristics for identiﬁcation of protein methylation sites. J. Comput. Chem. 30, 1532–1543. Snijders, A.P., Hung, M.L., Wilson, S.A., Dickman, M.J., 2010. Analysis of arginine and lysine methylation utilizing peptide separations at neutral ph and electron transfer dissociation mass spectrometry. J. Am. Soc. Mass Spectrom. 21, 88–96. Sun, G.D., Cui, W.P., Guo, Q.Y., Miao, L.N., 2014. Histone lysine methylation in diabetic nephropathy. J. Diabetes Res.. Turner, B.M., 2002. Cellular memory and the histone code. Cell 111, 285–291. Varier, R.A., Timmers, H.M., 2011. Histone lysine methylation and demethylation pathways in cancer. Biochim. Biophys. Acta (BBA) – Rev.Cancer 1815, 75–89. Wang, X., Zhang, W., Zhang, Q., Li, G.Z., 2015. MultiP-SChlo: multi-label protein subchloroplast localization prediction with Chou's pseudo amino acid composition and a novel multi-label classiﬁer. Bioinformatics , http://dx.doi.org/ 10.1093/bioinformatics/btv1212. Wang, Y., Jia, S., 2009. Degrees make all the difference: the multifunctionality of histone h4 lysine 20 methylation. Epigenetics 4, 273–276. Xiao, X., Wang, P., Lin, W.Z., Jia, J.H., Chou, K.C., 2013. iamp-2l: a twolevel multi-label classiﬁer for identifying antimicrobial peptides and their functional types. Anal. Biochem. 436, 168–177. Xu, Y., Ding, J., Wu, L.Y., 2013a. iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position speciﬁc amino acid propensity into pseudo amino acid composition. PLoS One 8, e55844. Xu, Y., Shao, X.J., Wu, L.Y., 2013b. iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. PeerJ 1, e171. Xu, Y., Wen, X., Shao, X.J., Deng, N.Y., 2014a. iHyd-PseAAC: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide position-speciﬁc propensity into pseudo amino acid composition. Int. J. Mol. Sci. 15, 7594–7610. Xu, Y., Wen, X., Wen, L.S., Wu, L.Y., 2014b. iNitro-Tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One 9, e105018. Zhang, J., Zhao, X., Sun, P., Ma, Z., 2014. PSNO: predicting cysteine S-nitrosylation sites by incorporating various sequence-derived features into the general form of Chou's PseAAC. Int. J. Mol. Sci. 15, 11204–11219. Zhang, W., Xu, X., Yin, M., Luo, N., Zhang, J., Wang, J., 2013. Prediction of methylation sites using the composition of k-spaced amino acid pairs. Protein Pept. Lett.20, 911–917. Zhong, W.Z., Zhou, S.F., 2014. Molecular science for drug development and biomedicine. Int. J. Mol. Sci. 15, 20072–20078.

Please cite this article as: Zhe, J., et al., ILM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into.... J. Theor. Biol. (2015), http://dx.doi.org/10.1016/j.jtbi.2015.07.030i

58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113

iLM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into Chou׳s general PseAAC

iLM-2L: A two-level predictor for identifying protein lysine methylation sites and their methylation degrees by incorporating K-gap amino acid pairs into Chou׳s general PseAAC

Recommend Documents