Prediction and functional analysis of prokaryote lysine acetylation site by incorporating six types of features into Chou's general PseAAC

Prediction and functional analysis of prokaryote lysine acetylation site by incorporating six types of features into Chou's general PseAAC

Journal of Theoretical Biology 461 (2019) 92–101 Contents lists available at ScienceDirect Journal of Theoretical Biology journal homepage: www.else...

2MB Sizes 0 Downloads 32 Views

Journal of Theoretical Biology 461 (2019) 92–101

Contents lists available at ScienceDirect

Journal of Theoretical Biology journal homepage: www.elsevier.com/locate/jtb

Prediction and functional analysis of prokaryote lysine acetylation site by incorporating six types of features into Chou’s general PseAAC Guodong Chen, Man Cao, Jialin Yu, Xinyun Guo, Shaoping Shi∗ Department of Mathematics and Numerical Simulation and High-Performance Computing Laboratory, School of Sciences, Nanchang University, Nanchang 330031, China

a r t i c l e

i n f o

Article history: Received 16 August 2018 Revised 9 October 2018 Accepted 22 October 2018 Available online 23 October 2018 Keywords: Information gain Elastic net Post-translational modifications Predictor

a b s t r a c t Lysine acetylation is one of the most important types of protein post-translational modifications (PTM) that are widely involved in cellular regulatory processes. To fully understand the regulatory mechanism of acetylation, identification of acetylation sites is first and most important. However, experimental identification of protein acetylation sites is often time consuming and expensive. Thus, it is popular that predicts PTM sites by computational methods in recent years. Here, we developed a novel method, ProAcePred 2.0, to predict species-specific prokaryote lysine acetylation sites. In this study, we employed an efficient position-specific analysis strategy information gain method to constitute position-specific window of acetylation peptide, and then incorporated different types of features and adopted elastic net algorithm to optimize feature vectors for model learning. The prediction model achieved area under the receiver operating characteristic curve value of six species in training datasets, which are 0.78, 0.752, 0.783, 0.718, 0.839 and 0.826, of Escherichia coli, Corynebacterium glutamicum, Mycobacterium tuberculosis, Bacillus subtilis, S. typhimurium and Geobacillus kaustophilus, respectively. And our method was highly competitive for the majority of species when compared with other methods by using independent test datasets. In addition, function analyses demonstrated that different organisms were preferentially involved in different biological processes and pathways. The detailed analyses in this paper could help us to understand more of the acetylation mechanism and provide guidance for the related experimental validation. A user-friendly online web service of ProAcePred 2.0 can be freely available at http://computbiol.ncu.edu.cn/PAPred. © 2018 Elsevier Ltd. All rights reserved.

1. Introduction Nɛ-lysine acetylation is one kind of PTM affecting protein structure, function and stability, which involves diverse pathways in eukaryote and prokaryote (Barak et al., 2006; Song et al., 2016; Starai and Escalante-Semerena, 2004). In bacteria, the ɛ-amino group of lysine residue can be acetylated by either lysine acetyltransferases (YfiQ/Pat) or the non-enzymatic mechanism depending on the activities of acetyl-phosphate (AcP) or acetylcoenzyme A (AcCoA) (Barak et al., 2006; Starai and Escalante-Semerena, 2004; Weinert et al., 2013). Recently, more experimental observations have suggested that lysine acetylation broadly impacts bacterial physiology and bacterial virulence (Liang et al., 2011; Ren et al., 2017; Wang et al., 2010). For example, in Listeria monocytogenes and Campylobacter jejuni, response regulator protein CheY is involved in bacterial virulence (e.g., adherence and invasion) (Dons et al., 2004; Yao et al., 1997). Because acetylation is critical to the activity of



Corresponding author. E-mail address: [email protected] (S. Shi).

https://doi.org/10.1016/j.jtbi.2018.10.047 0022-5193/© 2018 Elsevier Ltd. All rights reserved.

CheY, it is highly possible that acetylation may regulate bacterial virulence through CheY. Similarly, transcriptional regulators protein RcsB contributes to the bacterial virulence in Erwinia amylovora and Salmonella (Bereswill et al., 1997; Domínguez-Bernal et al., 2004; Mouslim et al., 2004). A study in Escherichia coli (E.coli) first identified that acetylation is involved in regulating RcsB activity, and the acetylation of lysine 154 of RcsB impairs its function, affecting flagella biosynthesis and bacterial motility and decreasing acid stress survival (Castaño-Cerezo et al., 2015). Therefore, it is really essential to understand the regulatory mechanism of acetylation, whereas the first step is to identify the acetylation sites. In nowadays, with the advent of technology, many prokaryote acetylation substrates and sites were identified by radioactivity detection (Welsch and Nelsestuen, 1988), mass spectrometry (Zhou et al., 2004), and chromatin immunoprecipitation (ChIP) (Umlauf et al., 2004). For example, Zhang’s group identified 349 acetylated proteins and 1070 acetylation sites of E.coli (Zhang et al., 2013). Xie’s group identified 658 acetylated proteins and 1128 acetylation sites of Mycobacterium tuberculosis (M.tuberculosis) (Xie et al., 2015), and so on. However, these experimental methods are usually time consuming and expensive. There-

G. Chen et al. / Journal of Theoretical Biology 461 (2019) 92–101

93

Fig. 1. Flow chart of the ProAcePred 2.0 approach, which includes sequence fragment optimization, feature calculation, feature selection and model training (AAC: amino acid compositions, AASA: average accessible surface area, EBGW: encoding based on grouped weight, KNN: K nearest neighbors, PWAA: position weight amino acid compositions, AAP: amino acid pair compositions).

fore, the alternative computational methods are necessary for highthroughput identification of prokaryote acetylation sites. In the last few years, many PTM-predicting papers were published by the previous investigators (Feng et al., 2017, 2018; Qiu et al., 2014, 2015, 2018; Xu et al., 2013a, 2013b, 2014a, 2014b). Various computational models have been proposed to predict eukaryote acetylated lysine sites, such as PredMod (Basu et al., 2009), ASEB (Li et al., 2012), LAceP (Hou et al., 2014), LysAcet (Li et al., 2009), N-Ace (Lee et al., 2010), EnsemblePail (Xu et al., 2010), Phosida (Gnad et al., 2010), PLMLA (Shi et al., 2012), PSKAcePred (Suo et al., 2012) and BRABSB (Shao et al., 2012). Only three prediction methods are designed for predicting prokaryote acetylation sites. Song’s group developed a method termed as SSPKA for eukaryote and prokaryote acetylation sites prediction (including two prokaryote species E.coli and Salmonella typhimurium (S.typhimurium)), using random forest classifier that combined sequence-derived and functional features with minimum redundancy maximum relevance (mRMR) approach to optimize feature vectors (Li et al., 2014). Subsequently, Hu’s group developed a novel tool, KA-predictor, to predict species-specific lysine acetylation sites, which included two prokaryote species E.coli and S.typhimurium (Wuyun et al., 2016). In our previous work, we collected 7288 non-redundant prokaryotic acetylation sites across nine different species, and developed a tool, ProAcePred, for predicting species-specific prokaryote acetylation sites, through combining three types of feature with an elastic net (EN) algorithm to

optimize feature vectors (Chen et al., 2018). Recently, the amount of identified prokaryotes lysine acetylation sites has greatly increased in a wide variety of organisms, which provides a great opportunity to improve the prediction quality of prokaryotes specific acetylation site and comprehend some regulatory mechanism of prokaryotes acetylation. In this study, we aimed to update ProAcePred and develop a highly useful tool to predict six species of prokaryote lysine acetylation sites, named as ProAcePred 2.0. We manually collected 14,145 experimentally identified prokaryote lysine acetylation sites of 4176 proteins across six different prokaryote species, including E.coli, S.typhimurium, Bacillus subtilis (B.subtilis), M.tuberculosis, Corynebacterium glutamicum (C.glutamicum) and Geobacillus kaustophilus (G.kaustophilus), respectively. The ProAcePred was adopted and further improved by using information gain (IG) method to select some important position-specific residues for constituting a new acetylation peptide based on primary sequence fragment. To develop a really useful sequencebased statistical predictor for a biological system as reported in a series of recent publications (Cheng et al., 2017; Feng et al., 2013; Liu et al., 2018; Song et al., 2018a, 2018b; Yang et al., 2018), one should observe the Chou’s 5-step rule (Chou, 2011) to design our experiments: (i) how to construct or select a valid benchmark dataset to train and test the predictor; (ii) how to formulate the biological sequence samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the

94

G. Chen et al. / Journal of Theoretical Biology 461 (2019) 92–101

target to be predicted; (iii) how to introduce or develop a powerful algorithm (or engine) to operate the prediction; (iv) how to properly perform cross-validation tests to objectively evaluate the anticipated accuracy of the predictor; (v) how to establish a userfriendly web-server for the predictor that is accessible to the public. Below, we are to describe how to deal with these steps one-byone. A flowchart of the ProAcePred 2.0 approach is given in Fig. 1.

2. Materials and methods 2.1. Data collection and preprocessing In order to develop a reliable predictor, a new non-redundant dataset was constructed in here. Acetylation is predominantly found on lysine residues, of which prokaryotes lysine acetylation data were composed of six species: E.coli, S.typhimurium, B.subtilis, M.tuberculosis, C.glutamicum and G.kaustophilus. These data were extracted from PLMD (Xu et al., 2017) database sources. In PLMD, they totally collected and integrated 284,780 modification events in 53,501 proteins across 176 eukaryotes and prokaryotes for up to 20 types of protein lysine modifications in 2017. From the PLMD database sources, we collected original dataset of acetyl-lysine containing 1860 acetylated proteins and 9188 acetylation sites of E.coli, 186 acetylated proteins and 254 acetylation sites of S.typhimurium, 762 acetylated proteins and 2036 acetylation sites of B.subtilis, 661 acetylated proteins and 1129 acetylation sites of M.tuberculosis, 593 acetylated proteins and 1285 acetylation sites of C.glutamicum and 114 acetylated proteins and 253 acetylation sites of G.kaustophilus, respectively. Then we clustered the protein sequences from dataset with a threshold of 30% identity by CD-HIT (Li and Godzik, 2006) to eliminate homology protein sequences in six species, respectively (detailed information shown in Supplementary Table S1). Afterward, six independent test datasets were constructed by randomly selecting 10% of all six non-homologous protein entries, respectively. The remaining datasets were used as the six species of training dataset, respectively (detailed information shown in Supplementary Table S2). Thereafter, the experimentally validated acetylation lysine (K) fragments were extracted as positive set, and the remaining residue K in these proteins were considered as the negative set (non-acetylation sites). For both the positive and negative sets, we defined a local window size for each acetylation or non-acetylation fragment. The window was denoted by a sequence fragment x = S − L S − (L − 1) S − 1 S0 S1 S(L − 1) SL , which S0 represents center residue K, Si represents adjacent residue of center residue K (i = ±1, ± 2, , ± L). Because the structural studies have shown that lysine acetyltransferases domains coupled with peptide substrates typically do not exceed 14–20 amino acids in length (Marmorstein, 2001), thus we initial chose fragment window size number of 10 (L = 10) residues that were upstream and downstream of the acetylation or non-acetylation site in such a way that the whole length of the peptide became 21 (if sequence insufficient, we add virtual residues O). Then we again used CD-HIT tool to eliminate homology sequence fragment for positive set and negative set in training dataset and independent test dataset (threshold of 30%), respectively. The detailed information for training dataset and independent test dataset are displayed in Supplementary Tables S3-S4. In order to perform the cross-validation, all of the non-redundant positive samples were selected to be in the positive training set, the balanced negative training set was randomly extracted from the non-redundant negative samples to construct a balance sample in six species. The detailed original datasets, training sets and independent test sets of six species can be downloaded from our website (file name is ‘Original Datasets.xlsx’, ‘Training Datasets.xlsx’ and ‘Independent Datasets.xlsx’, respectively).

2.2. IG For a given protein sequence fragment, the conservative property varies from site to site, and some residues near the central site have little contribution to the identification of the lysine acetylation sites (Weinert et al., 2011). In addition, we observed that sequence have existential specificity for different species from our previous work (Chen et al., 2018). Therefore, we need a method to select the specific residues that have a positive influence on the predictive models. Shannon provides an effective informationtheoretical concept to measure the uncertainty for a given system (Shannon, 1997). Therefore, we choose a correlation measure IG to process datasets based on the information-theoretical concept. The detailed theory of IG method is as follow. The information entropy Hs (X) of each amino acid residue in all sequence fragments as the following:

Hs (X ) = −



Ps (xi )log2 (Ps (xi ) )

(1)

i

And the entropy of X after observing values of another variable Y is defined as:

Hs (X|Y ) = −

 j

 

Ps y j

Ps (xi |y j )log2 Ps (xi |y j )

(2)

i

Where s = 1, 2, , n; i = 1, 2; j = 1, 2, , 21; n is the length of the sequence fragment. Ps (xi ) is the prior probabilities for acetylation and non-acetylation sites in all sequence fragments, Ps (yj ) is the probability of the jth amino acid occurring in position s in those fragments, and Ps (xi |yj ) is the posterior probabilities of jth amino acid in acetylation and non-acetylation sequence fragments. The amount by which the entropy of X decreases reflects additional information about X provided by Y and is called IG:

IGs (X|Y ) = Hs (X ) − Hs (X|Y )

(3)

The above theory displays that the corresponding residues are more important with the increase of IG values for predicting acetylation site. 2.3. Feature extraction and optimization methods With the explosive growth of biological sequences in the postgenomic era, one of the most important but also most difficult problems in computational biology is how to express a biological sequence with a discrete model or a vector, yet still keep considerable sequence-order information or key pattern characteristic. This is because all the existing machine-learning algorithms can only handle vector but not sequence samples, as elucidated in a comprehensive review (Chou, 2015). However, a vector defined in a discrete model may completely lose all the sequence-pattern information. To avoid completely losing the sequence-pattern information for proteins, the pseudo amino acid composition (Chou, 20 01a, 20 01b) or PseAAC (Chou, 2005) was proposed. Ever since the concept of Chou’s PseAAC was proposed, it has been widely used in nearly all the areas of computational proteomics (Akbar and Hayat, 2018; Arif et al., 2018; Ju and Wang, 2018; Mei and Zhao, 2018; Qiu et al., 2018). Because it has been widely and increasingly used, recently three powerful open access soft-wares, called ’PseAAC-Builder’, ’propy’, and ’PseAAC-General’, were established: the former two are for generating various modes of Chou’s special PseAAC (Chou, 2009); while the 3rd one for those of Chou’s general PseAAC (Chou, 2011), including not only all the special modes of feature vectors for proteins but also the higher level feature vectors such as "Functional Domain" mode, "Gene Ontology" mode, and "Sequential Evolution" or "PSSM" mode. Particularly, recently a very powerful web-server called ’Pse-in-One’ (Liu et al., 2015) and its updated version ’Pse-in-One2.0 (Liu et al., 2017) have

G. Chen et al. / Journal of Theoretical Biology 461 (2019) 92–101

95

Fig. 2. Function annotation of lysine acetylation in C.glutamicum (three classes of GO terms, including biological processes, molecular functions, and cellular components, are adopted, and statistical enrichment analysis of the GO terms is performed with the binomial distribution with p < 0.01).

been established that can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the users’ need or their own definition. In our study, we used six types of pseudo amino acid composition to extract acetylation peptide information (Fig. 1), including amino acid compositions (AAC), average accessible surface area (AASA), encoding based on grouped weight (EBGW), K nearest neighbors (KNN), position weight amino acid compositions (PWAA) and amino acid pair compositions (AAP), respectively (detailed feature description shown in Supplementary Text S1). Although six different features contain much more protein information, it also contains some noisy and redundant information which may lead to an adverse impact on model learning, such as decreasing prediction performance, a time-consuming training classifiers and possibly biased model prediction, thus feature optimization is very necessary. We adopted EN algorithm to optimize merged all types of features and form the final optimal feature set. Based on lasso theory proposed by Tibshirani (1996), Zou and Hastie proposed new method EN. Similar to the lasso, the EN simultaneously does automatic variable selection and continuous shrinkage, and it can select groups of correlated variables (Zou and Hastie, 2005; Zhou et al., 2015; Xiao and Yin, 2017). Suppose that the data set has n observations with p predictors. Let Y = (y1 , y2 ,, yn )T be the response and X = (X1 ,, Xp ) be the

model matrix, where Xi = (xi1 ,, xin )T , i = 1, , p, β = (β 1 , β 2 , , β p )T is sparse set. After a location and scale transformation, we can assume that the response is centered and the predictors are standardized: n 1 yi = 0, n

n 1  xij = 0 and n

i=1

i=1

n 1 2 xi j = 1, f or i = 1, 2, · · · , n; j = 1, 2, · · · , p. n

(4)

i=1

Given data set (Y, X) and (s, λ2 ), define an EN as:



βˆ (elastic net ) = arg min β T β

XT X + λ2 1 + λ2

 β − 2YT Xβ + s|β|1 . (5)

Which

|β|2 =

p j=1

βj2 ,|β|1 =

p j=1

|βj |, satisfies the equation

(1 − s+λλ2 )|β|1 + s+λλ2 |β|2 ≤ s, T represents for matrix transpose. 2 2

Note that there are two tuning parameters ( λ2 , s) in the elastic net, thus we need to adjust parameter. In this study, we selected optimal parameters s value based on λ2 value for different six species (steps = 500, different steps values will control when to stop EN regression), and X represents merged six types of feature

96

G. Chen et al. / Journal of Theoretical Biology 461 (2019) 92–101

Fig. 3. The information gain values at different positions of residues in the sequence fragments.

Fig. 4. Sequence logo illustration generated by Two Sample Logo for acetylation sites sequence information of six species (P-value < 0.05; t-test).

vectors, Y represents sample label (yi = 1 if sequence fragment is acetylation sequence, otherwise, yi = −1). 2.4. Model learning and evaluation Support vector machine (SVM) is a popular machine learning algorithm based on statistical learning theory (Noble, 2006). The notion is to map the input samples into a higher dimensional space

using a kernel function and then to find a hyper-plane that discriminates between the two classes of the SVM. In our method, a radial basis function (RBF) is chosen as the kernel function, and two parameters, the penalty parameter C and the kernel width parameter γ , are tuned based on the training set, using the grid search strategy in LIBSVM (version 3.1) (http://www.Csie.Ntu.Edu. Tw/Ücjlin/libsvm/).

G. Chen et al. / Journal of Theoretical Biology 461 (2019) 92–101

To evaluate model performance, 10-fold cross-validation is performed. In addition, to provide a more intuitive and easier-tounderstand method to measure the prediction quality, the following set of four metrics (accuracy (Acc), sensitivity (Sn), specificity (Sp) and the Matthews correlation coefficient (MCC)) based on the formulation used by Chou (Chou, 20 01a, 20 01b) in studying signal peptide prediction was adopted. According to Chou’s formulation, the Acc, Sn, Sp and MCC can be expressed as Eq. (6) (Chen et al., 2013; Cheng et al., 2017a, 2017b, 2017c, 2017d; Lin et al., 2014; Liu et al., 2018; Xiao et al., 2017).

⎧ + N− ⎪ ⎪Sn = 1 − NN+− ⎪ + ⎪ ⎪ ⎨Sp = 1 − NN−+ +N− Acc = 1 −

⎪ ⎪ ⎪ ⎪ MCC = ⎪ ⎩





+

N + +N − + − N N+ 1− N− + + N−

1+

0 ≤ Sn ≤ 1 0 ≤ Sp ≤ 1 0 ≤ Acc ≤ 1

− −N + N+ − N+



1+

+ −N − N− + N−

(6)

−1 ≤ MCC ≤ 1

Where N + represents the total number of acetylation sites inves+ tigated whereas N− is the number of true acetylation sites incorrectly predicted to be of non-acetylation site; N − represents the − total number of non-acetylation sites investigated whereas N+ is the number of non-acetylation sites incorrectly predicted to be of acetylation site. In addition, the receiver operating characteristic (ROC) curves are plotted based on Sp and Sn by taking different thresholds and their area under the ROC curve (AUC) values are also calculated based on the trapezoidal approximation. 3. Results and discussion 3.1. Functional analysis To further investigate the characterization of acetylated proteins, we classified the non-homologous acetylated protein groups according to cell component (CC), molecular function (MF) and biological process (BP) assigned by Gene Ontology (GO) annotation based on DAVID 6.8 tool (Huang et al., 2009). Furthermore, Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway was performed to investigate the pathways in six species. The detail information is listed in Supplementary Table S5 and can be downloaded from our website http://computbiol.ncu.edu.cn/PAPred (filename is ‘Function Annotation.xlsx’). As shown in Fig. 2, for C.glutamicum, the most significant BP, CC and MF are translation (GO:0 0 06412), cytoplasm (GO:0 0 05737) and structural constituent of ribosome (GO:0 0 03735), respectively. Some statistically significant results of CC, MF, BP and KEGG for six organisms are listed in Supplementary Tables S6-S11. The analyses show that six organisms have both commonality and difference. For example, they are all statistically enriched in cytoplasm (GO:0 0 05737) for CC, translation (GO:0 0 06412) for BP, and ATP binding (GO:0 0 05524) for MF. Meanwhile, some acetylation proteins of E.coli distribute in nucleoid (GO:0 0 09295) for CC, but other species acetylation proteins have not appeared in nucleoid; the C.glutamicum acetylation proteins are involved in carbohydrate metabolic process (GO:0 0 05975), but other species acetylation proteins do not occur in this BP; some acetylation protein in B.subtilis have GTP binding (GO:0 0 05525) function, but other species acetylation protein have not this MF. Furthermore, from Tables S6-S11, we could find some KEGG properties in six species. The most significantly enriched pathway of ribosome implies a potential role of prokaryotes acetylation in protein synthesis. The results also suggest that prokaryote acetylation is significantly enriched in the multiple metabolic pathways which are consistent with GO annotation. Meanwhile, there are also some differences among six organisms. Microbial metabolism in diverse environment and carbon metabolism are

97

Table 1 The sizes and positions of IG window in the E.coli sequence fragments. IG Windows size

Positions in original 21-mer acetylation sequence fragment

9 11 13 15 17

−10, −10, −10, −10, −10,

−9, −9, −9, −9, −9,

−8, −8, −8, −8, −8,

−3, −2, −1, 1, 2, 3 −6, −4, −3, −2, −1, −7, −6, −4, −3, −2, −7, −6, −5, −4, −3, −7, −6, −5, −4, −3,

1, 2, 3 −1, 1, 2, 3, 4 −2, −1, 1, 2, 3, 4, 7 −2, −1, 1, 2, 3, 4, 5, 7, 8

enriched in C.glutamicum, B.subtilis and S.typhimurium but not in E.coli, M.tuberculosis and G.kaustophilus. To sum up, functional analysis reveals potential impacts of prokaryotes lysine acetylation on enzymes involved in metabolism and other cellular processes that metabolism related. Meanwhile, the difference between six organisms certifies the necessary for developing species-specific prokaryotes acetylation computational prediction tools. 3.2. Determination of the best window size by IG We used IG method to analyze position-specific residues in initial sequence (windows size is 21, −10∼K∼10). Fig. 3 displays the statistically significant composition of each position of amino acid residues in six species. The positions with high IG values are the significant amino acids in the surrounding region. We find that the values in different positions of residues have relatively large changes, and the local acetylation sequences have own unique nature and characteristics in six species. Some residues that are closer to the sites do not obtain all of the higher values; in contrast, some residues that are far from the acetylated site have higher values. For example, the positions of −10, −9 and −8 have high values, compared with the positions of −7, −6 and −1 for G.kaustophilus. However, the positions of −2, −1, 1 and 2, their IG values are higher than other positions, especially the position of 1. Sequence logo illustration generated by Two Sample Logo (Vacic et al., 2006) for acetylation sites sequence information of six species (P-value < 0.05; t-test), also illustrates the position specificity for different species (Fig. 4). For example, residue glutamic acid (E) enriches in position −1 for E.coli, B.subtilis and M.tuberculosis, residue aspartic acid (D) enriches in position + 1 for E.coli, B.subtilis and G.kaustophilus. Residue arginine (R) enriches in upstream for E.coli acetylation sequence fragment, residues phenylalanine (F) and R respectively enriches in positions −2, + 1 for C.glutamicum, residue R enriches in downstream for M.tuberculosis acetylation fragment. Therefore, to improve the prediction performance of prokaryotes acetylation site, we need choose those locations of amino acids with higher IG values to rebuild new sequence fragments in six species. In this study, we defined a new sequence fragment based on IG value in six species. Table 1 exhibits the five different IG window sizes according to the IG values in the E.coli, IG window sizes of other species are displayed in Supplementary Tables S12-S16. We further investigated the effects of IG window sizes on the prediction performance through a 10-fold cross-validation. In fact, compared to the single feature, the combination of features can reflect more protein sequence information leading to a certain improvement of prediction performance (Cao et al., 2018; Li et al., 2014; Shi et al., 2015; Wen et al., 2016; Wang et al., 2017). Therefore, we combined all six types of feature to learn the predictive model, and chose a best window size based on AUC value. The training results of different window size in six species are displayed in Supplementary Table S17. We observed that the best AUC values were 0.749, 0.747, 0.762, 0.714, 0.682 and 0.681, of E.coli, C.glutamicum, M.tuberculosis, B.subtilis, S.typhimurium and G.kaustophilus, respectively. Compared to initial sequence (window

98

G. Chen et al. / Journal of Theoretical Biology 461 (2019) 92–101

Fig. 5. The ratio of remaining feature vectors by EN for six optimal models.

Table 2 The optimal prediction performance of IG and IG with EN in best window size. Species

Method

E.coli

IG IG + EN C.glutamicum IG IG + EN M.tuberculosis IG IG + EN B.subtilis IG IG + EN S.typhimurium IG IG + EN G.kaustophilus IG IG + EN

Acc

Sn

Sp

MCC

AUC

Dimension

0.739 0.772 0.731 0.756 0.761 0.783 0.71 0.719 0.666 0.821 0.668 0.807

0.759 0.758 0.729 0.741 0.758 0.78 0.716 0.714 0.668 0.822 0.676 0.808

0.724 0.789 0.734 0.775 0.766 0.788 0.705 0.726 0.671 0.834 0.668 0.82

0.481 0.545 0.462 0.514 0.523 0.566 0.42 0.439 0.335 0.649 0.34 0.621

0.749 0.78 0.747 0.752 0.762 0.783 0.714 0.718 0.682 0.839 0.681 0.826

516 113 512 135 516 152 512 112 520 173 512 167

size is 21) in six species, these AUC values have increased by 0.008, 0.013, 0.029, 0.035, 0.08 and 0.037, respectively. Therefore, we selected the best IG window sizes were 13, 9, 13, 9, 17 and 9, of E.coli, C.glutamicum, M.tuberculosis, B.subtilis, S.typhimurium and G.kaustophilus, respectively. 3.3. Optimization of feature vectors by EN In this step, we adopted EN algorithm to optimize the dimensionality of feature vectors. The optimal prediction performance of training datasets for six species is listed in Table 2. For parameters in EN algorithm, we first picked a (relatively small) grid of values for λ2 , 0.1, 0.05, 0.3, 0.1, 0.01 and 0.2, of E.coli, C.glutamicum, M.tuberculosis, B.subtilis, S.typhimurium and G.kaustophilus, respectively. Then, for each λ2 , s value has five options (s = 0.1, 0.2, 0.3, 0.4, 0.5). We took different s value into

EN algorithm, and it would output their optimal feature vectors for different s value. Finally, we selected best s values based on their AUC values by 10-fold cross-validation with SVM and optimal feature vectors for corresponding parameters. The detail training results are shown in Supplementary Tables S18-S23. Based on AUC value for six species, the optimal s values were 0.2, 0.2, 0.3, 0.2, 0.2 and 0.3, of E.coli, C.glutamicum, M.tuberculosis, B.subtilis, S.typhimurium and G.kaustophilus, respectively. From Table 2, we observed that the prediction performance of the combination IG with EN method (IG + EN) were higher than that of single IG method. Especially for S.typhimurium and G.kaustophilus, their AUC results were improved from 0.682 to 0.839 and 0.681 to 0.826, respectively. Furthermore, compared with single type feature, the prediction performance of IG + EN method has more improvement (the prediction performance of single type feature listed in Supplementary Table S24).These results suggested that EN method can

G. Chen et al. / Journal of Theoretical Biology 461 (2019) 92–101

99

Table 3 Comparison of AUC values between our method and other tools. Species

Our Method

ProAcePred

PLMLA

PSKAcePred

Phosida

Ensemblepail

E.coli C.glutamicum M.tuberculosis B.subtilis S.typhimurium G.kaustophilus

0.831 0.763 0.798 0.814 0.8 0.831

0.651 0.792 0.834 0.617 0.802 0.729

0.535 0.604 0.629 0.554 0.615 0.649

0.572 0.619 0.599 0.565 0.69 0.659

0.493 0.591 0.684 0.474 0.609 0.582

0.539 0.525 0.643 0.607 0.297 0.379

effectively extract some important feature vectors to improve prediction performance. In addition, we analyzed the relative importance and contribution of these different features in different species (the ratios of remaining feature vectors shown in Fig. 5, the number of remaining feature vectors listed in Supplementary Table S25). We observed that the KNN feature played an important role in six species’ model, and also found that different type feature played a different important role in six models. For example, the ratios of KNN in S.typhimurium and G.kaustophilus were lower than other species, and the AUC values of KNN feature in S.typhimurium and G.kaustophilus were relatively lower than other species from Table S24. Except for KNN feature, the ratios of PWAA feature in C.glutamicum and S.typhimurium were higher than other species, and the AASA ratio in G.kaustophilus was higher than other species. From a general view, the KNN, PWAA and AASA feature are relative importance to six prediction models. 3.4. Comparison with other prediction tools To further evaluate the performance of ProAcePred 2.0, we compared it with some widely used lysine acetylation prediction tools based on independent test sets. Since 2009, many lysine acetylation prediction methods have been developed. But some web server tools do not work. Finally, five prediction tools have been selected, including EnsemblePail, Phosida, PLMLA, PSKAcePred and ProAcePred. Because the PLMLA, Ensemblepail and PSKAcePred incorporated prokaryotic and eukaryotic datasets into their training datasets, we used these three tools to predict independent datasets of six species. In addition, the tool Phosida was specifically designed for prediction Homo sapiens acetylation. Therefore, we used this tool to predict six species, in order to find the difference of substrate specificity between prokaryote and eukaryote acetylation sites. Since the ProAcePred is a species-specific prokaryotes acetyl-lysine prediction method, we submitted six species data to the corresponding models in ProAcePred. Comparison of AUC values between our method and other tools are shown in Table 3 (the other detailed comparison results listed in Supplementary Table S26). From Table 3, we found the AUC values of PLMLA, PSKAcePred, Phosida and Ensemblepail were far below ProAcePred for six species. For our method, AUC values were 0.831, 0.763, 0.798, 0.814, 0.8 and 0.831, of E.coli, C.glutamicum, M.tuberculosis, B.subtilis, S.typhimurium and G.kaustophilus, respectively. In addition, we could observe that some AUC values of our method were slightly lower than ProAcePred method for C.glutamicum, M.tuberculosis and S.typhimurium. However, our method reached a high accuracy for the prediction of other species. These results reveal that distinguishing different species is important for prediction PTM site and our method is an effectiveness method for prokaryotes lysine acetylation prediction. 4. Conclusion In this study, we proposed a new method named as ProAcePred 2.0 to predict prokaryotes lysine acetylation, which achieved a

promising performance and outperformed other prediction tools. We first employed IG method to select some important positionspecific residues to make up a new sequence fragment based on primary sequence fragment, and then adopted an EN algorithm to optimize feature vectors, which significantly improved the prediction performance of six species models. In addition, we observed that acetylated proteins distribute mainly in ribosome and cytoplasm. User-friendly and publicly accessible web-servers represent the current trend for developing various computational methods (Chou and Shen, 2009). Actually they have significantly enhance the impacts of computational biology on medical science (Chou, 2015), driving medicinal chemistry into an unprecedented revolution (Chuo, 2017), here we also provide a web server at http://computbiol.ncu.edu.cn/PAPred for the new method reported in this paper. We expect that the ProAcePred 2.0 can provide more instructive help for further experimental investigation of prokaryote acetylation. Acknowledgments This work was supported by grants from the National Natural Science Foundation of China (21665016 and 21305062), and the Natural Science Foundation of Jiangxi Province (20151BAB203022). Supplementary materials Supplementary material associated with this article can be found, in the online version, at doi:10.1016/j.jtbi.2018.10.047. References Akbar, S., Hayat, M., 2018. iMethyl-STTNC: Identification of N(6)-methyladenosine sites by extending the Idea of SAAC into Chou’s PseAAC to formulate RNA sequences. J. Theor. Biol. 455, 205–211. https://doi.org/10.1016/j.jtbi.2018.07.018. Arif, M., Hayat, M., Jan, Z., 2018. Imem-2lsaac: a two-level model for discrimination of membrane proteins and their types by extending the notion of saac into chou’s pseudo amino acid composition. J. Theor. Biol. 442, 11–21. https: //doi.org/10.1016/j.jtbi.2018.01.008. Barak, R., Yan, J., Shainskaya, A., Eisenbach, M., 2006. The chemotaxis response regulator chey can catalyze its own acetylation. J. Mol. Biol. 359, 251–265. https://doi.org/10.1016/j.jmb.2006.03.033. Basu, A., Rose, K.L., Zhang, J., Beavis, R.C., Ueberheide, B., Garcia, B.A., et al., 2009. Proteome-wide prediction of acetylation substrates. Proc. Natl. Acad. Sci. USA 106, 13785–13790. https://doi.org/10.1073/pnas.0906801106. Bereswill, S., Geider, K., 1997. Characterization of the rcsb gene from erwinia amylovora and its influence on exoploysaccharide synthesis and virulence of the fire blight pathogen. J. Bacteriol. 17, 1354–1361. https://doi.org/10.1128/jb.179.4. 1354-1361.1997. Cao, M., Chen, G.D., Wang, L.N., Wen, P.P., Shi, S.P., 2018. Computational prediction and analysis for tyrosine post-translational modifications via elastic net. J. Chem. Inf. Model. 58, 1272–1281. https://doi.org/10.1021/acs.jcim.7b00688. Castaño-Cerezo, S., Bernal, V., Post, H., Fuhrer, T., Cappadona, S., Sánchez-Díaz, N.C., et al., 2015. Protein acetylation affects acetate metabolism, motility and acid stress response in escherichia coli. Mol. Syst. Biol. 10, 762. https://doi.org/10. 15252/msb.20145227. Chen, G., Cao, M., Luo, K., Wang, L., Wen, P., Shi, S., 2018. Proacepred: prokaryote lysine acetylation sites prediction based on elastic net feature optimization. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty444. Chen, W., Feng, P.M., Lin, H., Chou, K.C., 2013. Irspot-psednc: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Res. 41, e68. https: //doi.org/10.1093/nar/gks1450.

100

G. Chen et al. / Journal of Theoretical Biology 461 (2019) 92–101

Cheng, X., Xiao, X., Chou, K.C., 2017a. pLoc-mVirus: predict subcellular localization of multi-location virus proteins via incorporating the optimal GO information into general PseAAC. Gene 628, 315–321. https://doi.org/10.1016/j.gene.2017.07. 036. Cheng, X., Zhao, S.G., Lin, W.Z., Xiao, X., Chou, K.C., 2017b. Ploc-manimal: predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics 33, 3524–3531. https://doi.org/10.1093/bioinformatics/btx476. Cheng, X., Zhao, S.G., Xiao, X., Chou, K.C., 2017c. Iatc-mhyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals. Oncotarget 8, 58494–58503. https://doi.org/10.18632/oncotarget.17028. Cheng, X., Zhao, S.G., Xiao, X., et al., 2017d. iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. Bioinformatics 33, 341–346. https://doi.org/10.1093/bioinformatics/btw644. Chou, K.C., 2001a. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Struct. Funct. Genet. 44, 246–255. https://doi.org/ 10.1002/prot.1035. Chou, K.C., 2001b. Using subsite coupling to predict signal peptides. Protein Eng. 14, 75–79. https://doi.org/10.1093/protein/14.2.75. Chou, K.C., 2005. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10–19. https://doi.org/10.1093/ bioinformatics/bth466. Chou, K.C., 2009. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Curr. Proteom. 6, 262–274. https: //doi.org/10.2174/157016409789973707. Chou, K.C., 2011. Some remarks on protein attribute prediction and pseudo amino acid composition. J. Theor. Biol. 273, 236–247. https://doi.org/10.1016/j.jtbi.2010. 12.024. Chou, K.C., 2015. Impacts of bioinformatics to medicinal chemistry. Med. Chem. 11, 218–234. https://doi.org/10.2174/1573406411666141229162834. Chou, K.C., 2017. An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Curr. Top. Med. Chem. 17, 2337–2358. https://doi. org/10.2174/1568026617666170414145508. Chou, K.C., Shen, H.B., 2009. Review: recent advances in developing web-servers for predicting protein attributes. Nat. Sci. 1, 63–92. https://doi.org/10.4236/ns.2009. 12011. Domínguez-Bernal, G., Pucciarelli, M.G., Ramos-Morales, F., García-Quintanilla, M., Cano, D.A., Casadesús, J., et al., 2004. Repression of the rcsc-yojn-rcsb phosphorelay by the igaa protein is a requisite for salmonella virulence. Mol. Microbiol. 53, 1437–1449. https://doi.org/10.1111/j.1365-2958.2004.04213.x. Dons, L., Eriksson, E., Jin, Y., Rottenberg, M.E., Kristensson, K., Larsen, C.N., et al., 2004. Role of flagellin and the two-component chea/chey system of listeria monocytogenes in host cell invasion and virulence. Infect. Immun. 72, 3237– 3244. https://doi.org/10.1128/IAI.72.6.3237. Feng, P., Ding, H., Yang, H., et al., 2017. Irna-psecoll: identifying the occurrence sites of different rna modifications by incorporating collective effects of nucleotides into pseknc. Mol. Ther. – Nucleic Acids 7, 155–163. https://doi.org/10.1016/j. omtn.2017.03.006. Feng, P., Yang, H., Ding, H., et al., 2018. Idna6ma-pseknc: identifying dna n6methyladenosine sites by incorporating nucleotide physicochemical properties into pseknc. Genomics. https://doi.org/10.1016/j.ygeno.2018.01.005. Feng, P.M., Chen, W., Lin, H., et al., 2013. iHSP-PseRAAAC: identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Anal. Biochem. 442, 118–125. https://doi.org/10.1016/j.ab.2013.05.024. Gnad, F., Ren, S., Choudhary, C., Cox, J., Mann, M., 2010. Predicting post-translational lysine acetylation using support vector machines. Bioinformatics 26, 1666–1668. https://doi.org/10.1093/bioinformatics/btq260. Hou, T., Zheng, G., Zhang, P., Jia, J., Li, J., Xie, L., et al., 2014. Lacep: lysine acetylation site prediction using logistic regression classifiers. PLoS One 9, e89575. https: //doi.org/10.1371/journal.pone.0089575. Huang, D.W., Sherman, B.T., Lempicki, R.A., 2009. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc. 4, 44–57. https://doi.org/10.1038/nprot.2008.211. Ju, Z., Wang, S.Y., 2018. Prediction of citrullination sites by incorporating k-spaced amino acid pairs into chou’s general pseudo amino acid composition. Gene 664, 78–83. https://doi.org/10.1016/j.gene.2018.04.055. Lee, T.Y., Hsu, J.B., Lin, F.M., Chang, W.C., Hsu, P.C., Huang, H.D., 2010. N-ace: using solvent accessibility and physicochemical properties to identify protein nacetylation sites. J. Comput. Chem. 31, 2759–2771. https://doi.org/10.1002/jcc. 21569. Li, S., Li, H., Li, M., Shyr, Y., Xie, L., Li, Y., 2009. Improved prediction of lysine acetylation by support vector machines. Protein Pept. Lett. 16, 977–983. https: //doi.org/10.2174/092986609788923338. Li, T., Du, Y., Wang, L., Huang, L., Li, W., Lu, M., et al., 2012. Characterization and prediction of lysine (k)-acetyl-transferase specific acetylation sites. Mol. Cell. Proteom. 11. M111.011080. https://doi.org/10.1074/mcp.M111.011080. Li, W., Godzik, A., 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659. https: //doi.org/10.1093/bioinformatics/btl158. Li, Y., Wang, M., Wang, H., Tan, H., Zhang, Z., Webb, G.I., et al., 2014. Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features. Sci. Rep. 4, 5765. https://doi.org/10. 1038/srep05765. Liang, W., Malhotra, A., Deutscher, M.P., 2011. Acetylation regulates the stability of a bacterial protein: growth stage-dependent modification of rnase r. Mol. Cell 44, 160–166. https://doi.org/10.1016/j.molcel.2011.06.037.

Lin, H., Deng, E.Z., Ding, H., Chen, W., Chou, K.C., 2014. Ipro54-pseknc: a sequencebased predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 42, 12961–12972. https://doi. org/10.1093/nar/gku1019. Liu, B., Liu, F., Wang, X., Chen, J., Fang, L., Chou, K.C., 2015. Pse-in-one: a web server for generating various modes of pseudo components of dna, rna, and protein sequences. Nucleic Acids Res. 43, W65–W71. https://doi.org/10.1093/nar/gkv458. Liu, B., Wu, H., Chou, K.C., 2017. Pse-in-one 2.0: an improved package of web servers for generating various modes of pseudo components of dna, rna, and protein sequences. Nat. Sci. 9, 67–91. https://doi.org/10.4236/ns.2017.94007. Liu, B., Yang, F., Huang, D.S., et al., 2018. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics 34, 33–40. https://doi.org/10.1093/bioinformatics/btx579. Marmorstein, R., 2001. Structure and function of histone acetyltransferases. Cell. Mol. Life Sci. 58, 693–703. https://doi.org/10.10 07/PL0 0 0 0 0893. Mei, J., Zhao, J., 2018. Analysis and prediction of presynaptic and postsynaptic neurotoxins by chou’s general pseudo amino acid composition and motif features. J. Theor. Biol. 447, 147–153. https://doi.org/10.1016/j.jtbi.2018.03.034. Mouslim, C., Delgado, M., Groisman, E.A., 2004. Activation of the rcsc/yojn/rcsb phosphorelay system attenuates salmonella virulence. Mol. Microbiol. 54, 386– 395. https://doi.org/10.1111/j.1365-2958.2004.04293.x. Noble, W.S., 2006. What is a support vector machine? Nat. Biotechnol. 24, 1565– 1567. https://doi.org/10.1038/nbt1206-1565. Qiu, W., Li, S., Cui, X., et al., 2018a. Predicting protein submitochondrial locations by incorporating the pseudo-position specific scoring matrix into the general Chou’s pseudo-amino acid composition. J. Theor. Biol. 450, 86–103. https://doi. org/10.1016/j.jtbi.2018.04.026. Qiu, W.R., Sun, B.Q., Xiao, X., et al., 2018b. Ikcr-pseens: identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics 110, 239–246. https://doi.org/10.1016/j.ygeno.2017.10.008. Qiu, W.R., Xiao, X., Lin, W.Z., Chou, K.C., 2014. Imethyl-pseaac: identification of protein methylation sites via a pseudo amino acid composition approach. Biomed. Res. Int. 2014, 947416. https://doi.org/10.1155/2014/947416. Qiu, W.R., Xiao, X., Lin, W.Z., Chou, K.C., 2015. Iubiq-lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model. J. Biomol. Struct. Dyn. 33, 1731–1742. https://doi.org/10. 1080/07391102.2014.968875. Ren, J., Sang, Y., Lu, J., Yao, Y.F., 2017. Protein acetylation and its role in bacterial virulence. Trends Microbiol. 25, 768–779. https://doi.org/10.1016/j.tim.2017.04.001. Shannon, C.E., 1997. The mathematical theory of communication (Reprinted). MD Comput. 14, 306–317. Shao, J., Xu, D., Hu, L., Kwan, Y.W., Wang, Y., Kong, X., et al., 2012. Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi-relative adapted binomial score bayes feature representation. Mol. BioSyst. 8, 2964–2973. https://doi.org/10.1039/c2mb25251a. Shi, S.P., Qiu, J.D., Sun, X.Y., Suo, S.B., Huang, S.Y., Liang, R.P., 2012. Plmla: prediction of lysine methylation and lysine acetylation by combining multiple features. Mol. BioSyst. 8, 1520–1527. https://doi.org/10.1039/c2mb05502c. Shi, S.P., Xu, H.D., Wen, P.P., Qiu, J.D., 2015. Progress and challenges in predicting protein methylation sites. Mol. BioSyst. 11, 2610–2619. https://doi.org/10.1039/ C5MB00259A. Song, J., Li, F., Takemoto, K., et al., 2018a. PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural and network features in a machine learning framework. J. Theor. Biol. 443, 125–137. https://doi.org/10.1016/j. jtbi.2018.01.023. Song, J., Wang, Y., Li, F., et al., 2018b. iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites. Brief. Bioinf.. https://doi.org/10.1093/bib/bby028. Song, L., Wang, G., Malhotra, A., Deutscher, M.P., Liang, W., 2016. Reversible acetylation on lys501 regulates the activity of rnase ii. Nucleic Acids Res. 44, 1979– 1988. https://doi.org/10.1093/nar/gkw053. Starai, V.J., Escalante-Semerena, J.C., 2004. Identification of the protein acetyltransferase (pat) enzyme that acetylates acetyl-coa synthetase in salmonella enterica. J. Mol. Biol. 340, 1005–1012. https://doi.org/10.1016/j.jmb.2004.05.010. Suo, S.B., Qiu, J.D., Shi, S.P., Sun, X.Y., Huang, S.Y., Chen, X., et al., 2012. Positionspecific analysis and prediction for protein lysine acetylation based on multiple features. PLoS One 7, e49108. https://doi.org/10.1371/journal.pone.0049108. Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. B 58, 267–288. Umlauf, D., Goto, Y., Feil, R., 2004. Site-specific analysis of histone methylation and acetylation. Methods Mol. Biol. 287, 99–120. Vacic, V., Iakoucheva, L.M., Radivojac, P., 2006. Two sample logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics 22, 1536–1537. https://doi.org/10.1093/bioinformatics/btl151. Wang, L.N., Shi, S.P., Xu, H.D., Wen, P.P., Qiu, J.D., 2017. Computational prediction of species-specific malonylation sites via enhanced characteristic strategy. Bioinformatics 33, 1457–1463. https://doi.org/10.1093/bioinformatics/btw755. Wang, Q., Zhang, Y., Yang, C., Xiong, H., Lin, Y., Yao, J., et al., 2010. Acetylation of metabolic enzymes coordinates carbon source utilization and metabolic flux. Science 327, 1004–1007. https://doi.org/10.1126/science.1179687. Weinert, B.T., Wagner, S.A., Horn, H., Henriksen, P., Liu, W.R., Olsen, J.V., et al., 2011. Proteome-wide mapping of the drosophila acetylome demonstrates a high degree of conservation of lysine acetylation. Sci. Signal. 4. ra48. https://doi.org/ 10.1126/scisignal.2001902.

G. Chen et al. / Journal of Theoretical Biology 461 (2019) 92–101 Weinert, Brian, Iesmantavicius, Vytautas,Wagner, Sebastian, et al., 2013. Acetylphosphate is a critical determinant of lysine acetylation in e. coli. Mol. Cell 51, 265–272. https://doi.org/10.1016/j.molcel.2013.06.003. Welsch, D.J., Nelsestuen, G.L., 1988. Amino-terminal alanine functions in a calciumspecific process essential for membrane binding by prothrombin fragment 1. Biochemistry 27, 4939–4945. https://doi.org/10.1021/bi00413a052. Wen, P.P., Shi, S.P., Xu, H.D., Wang, L.N., Qiu, J.D., 2016. Accurate in silico prediction of species-specific methylation sites based on information gain feature optimization. Bioinformatics 32, 3107–3115. https://doi.org/10.1093/bioinformatics/ btw377. Wuyun, Q., Zheng, W., Zhang, Y., Ruan, J., Hu, G., 2016. Improved species-specific lysine acetylation site prediction based on a large variety of features set. PLoS One 11, e0155370. https://doi.org/10.1371/journal.pone.0155370. Xiao, X., Cheng, X., Su, S., Mao, Q., Chou, K.C., 2017. Ploc-mgpos: incorporate key gene ontology information into general pseaac for predicting subcellular localization of gram-positive bacterial proteins. Nat. Sci. 9, 331–349. https://doi.org/ 10.4236/ns.2017.99032. Xiao, X.Y., Yin, H.W., 2017. Achieving higher order of convergence for solving systems of nonlinear equations. Appl. Math. Comput. 311, 251–261. https://doi.org/ 10.1016/j.amc.2017.05.033. Xie, L., Wang, X., Zeng, J., Zhou, M., Duan, X., Li, Q., et al., 2015. Proteome-wide lysine acetylation profiling of the human pathogen mycobacterium tuberculosis. Int. J. Biochem. Cell Biol. 59, 193–202. https://doi.org/10.1016/j.biocel.2014. 11.010. Xu, H., Zhou, J., Lin, S., Deng, W., Zhang, Y., Xue, Y., 2017. Plmd: an updated data resource of protein lysine modifications. J. Genet. Genom. 44, 243–250. https: //doi.org/10.1016/j.jgg.2017.03.007. Xu, Y., Ding, J., Wu, L.Y., Chou, K.C., 2013a. Isno-pseaac: predict cysteine snitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. PLoS One 8, e55844. https: //doi.org/10.1371/journal.pone.0055844.

101

Xu, Y., Shao, X.J., Wu, L.Y., Deng, N.Y., Chou, K.C., 2013b. Isno-aapair: incorporating amino acid pairwise coupling into pseaac for predicting cysteine s-nitrosylation sites in proteins. Peerj 1, e171. https://doi.org/10.7717/peerj.171. Xu, Y., Wang, X.B., Ding, J., Wu, L.Y., Deng, N.Y., 2010. Lysine acetylation sites prediction using an ensemble of support vector machine classifiers. J. Theor. Biol. 264, 130–135. https://doi.org/10.1016/j.jtbi.2010.01.013. Xu, Y., Wen, X., Shao, X.J., Deng, N.Y., Chou, K.C., 2014a. Ihyd-pseaac: predicting hydroxyproline and hydroxylysine in proteins by incorporating dipeptide positionspecific propensity into pseudo amino acid composition. Int. J. Mol. Sci. 15, 7594–7610. https://doi.org/10.3390/ijms15057594. Xu, Y., Wen, X., Wen, L.S., Wu, L.Y., et al., 2014b. Initro-tyr: prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PLoS One 9, e105018. https://doi.org/10.1371/journal.pone.0105018. Yang, H., Qiu, W.R., et al., 2018. iRSpot-Pse6NC: identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC. Int. J. Biol. Sci. 14, 883–891. https://doi.org/10.7150/ijbs.24616. Yao, R., Burr, D.H., Guerry, P., 1997. Chey-mediated modulation of campylobacter jejuni virulence. Mol. Microbiol. 23, 1021–1031. https://doi.org/10.1046/j. 1365-2958.1997.2861650.x. Zhang, K., Zheng, S., Yang, J.S., Chen, Y., Cheng, Z., 2013. Comprehensive profiling of protein lysine acetylation in escherichia coli. J. Proteome Res. 12, 844–851. https://doi.org/10.1021/pr300912q. Zhou, H., Boyle, R., Aebersold, R., 2004. Quantitative protein analysis by solid phase isotope tagging and mass spectrometry. Methods Mol. Biol. 261, 511–518. Zhou, R., Wang, X., Tang, X.B., 2015. A generalization of the Hermitian and skewHermitian splitting iteration method for solving Sylvester equations. Appl. Math. Comput. 271, 609–617. https://doi.org/10.1016/j.amc.2015.09.027. Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. J. R. Stat. Soc. B 67, 301–320. https://doi.org/10.1111/j.1467-9868.20 05.0 0503.x.