Chemometrics and Intelligent Laboratory Systems 146 (2015) 102–107
Contents lists available at ScienceDirect
Chemometrics and Intelligent Laboratory Systems journal homepage: www.elsevier.com/locate/chemolab
Identifying protein arginine methylation sites using global features of protein sequence coupled with support vector machine optimized by particle swarm optimization algorithm Yan Zhang a, Lijuan Tang a, Hongyan Zou b,⁎, Qin Yang a, Xinliang Yu a, Jianhui Jiang a, Hailong Wu a, Ruqin Yu a,⁎ a
State Key Laboratory of Chemo/Biosensing and Chemometrics, College of Chemistry and Chemical Engineering, Hunan University, Changsha 410082, PR China Key Laboratory of Luminescent and Real-Time Analytical Chemistry (Southwest University), Ministry of Education, College of Chemistry and Chemical Engineering, Southwest University, Chongqing 400715, PR China
b
a r t i c l e
i n f o
Article history: Received 14 July 2014 Received in revised form 5 May 2015 Accepted 9 May 2015 Available online 18 May 2015 Keywords: Arginine methylation sites Support vector machine Particle swarm optimization Global features of protein sequence
a b s t r a c t Protein methylation, which plays vital roles in signal transduction and many cellular processes, is one of the most common protein post-translation modifications. Identification of methylation sites is very helpful for understanding the fundamental molecular mechanism of the methylation related biological processes. In silico predictions of methylation sites have emerged to be a powerful approach for methylation identifying. They also facilitate the performance of downstream characterizations and site-specific investigations. Herein, we proposed a novel strategy for the prediction of methylation sites based on a combination of the pseudo amino acid composition (PseAAC) and protein chain description as global features of protein sequence. The global features of protein sequence comprehensively utilize amino acid composition information and sequence-order information, along with the physicochemical properties and structural characteristics of amino acid information. Support vector machine (SVM) is invoked to build the prediction model for methylation sites on the basis of the global features of protein sequence. Meanwhile, a global stochastic optimization technique, particle swarm algorithm (PSO) is employed for effectively searching the optimal parameters in SVM. The prediction accuracy, sensitivity, specificity and Matthew's correlation coefficient values of the independent prediction set are 98.11%, 96.23%, 100% and 96.30%, respectively. It obviously indicates that our method has sufficient prediction effect in identification of the protein arginine methylation sites. As a comparison, other predictors are also constructed based on different feature extracting and modeling strategies. The results show that the proposed method can greatly improve the performance of arginine methylation sites prediction. © 2015 Elsevier B.V. All rights reserved.
1. Introduction Protein post-translational modifications (PTMs), which are chemical modifications at distinct amino acid side chains or peptide linkages, play crucial roles in functional proteomics by regulating the activity, localization, interaction and degradation of proteins. As one of the most common protein post-translation modifications, protein methylation is involved in dozens of biological processes and critical in adjusting protein physicochemical properties, conformations and functions [1–3]. Although protein methylation modification has been discovered for several decades, it is still a kind of PTMs which is far less studied than other modifications. Protein methylation can occur at nitrogen atoms of either backbone or side-chain of varying amino acid residues, such as lysine (K), arginine (R) and histidine (H), or oxygen atoms of aspartate (D) and glutamate
⁎ Corresponding authors. Tel./fax: +86 731 88822577. E-mail addresses:
[email protected] (H. Zou),
[email protected] (R. Yu).
http://dx.doi.org/10.1016/j.chemolab.2015.05.011 0169-7439/© 2015 Elsevier B.V. All rights reserved.
(E) residues. To understand the fundamental molecular mechanism of the methylation related biological processes, a prerequisite is to accurately identify the sites of methylation on the protein sequences. Some techniques have been developed for the detection of protein methylation based on methylation specific antibodies and mass spectrometry [4,5]. However, these traditional methods are often labor intensive, time-consuming and rather expensive. The development of the information technology has provided new opportunities to solve these problems. Actually, several in silico methods have been developed for the prediction of methylation site by directly utilizing the primary sequence information of the proteins. In 2005, Daily et al. [6] built a predictor for arginine and lysine methylation based on a hypothesis that PTMs preferentially occur intrinsically disordered regions [7] of protein sequences coupled with a machine learning algorithm, support vector machine (SVM) [8–11]. In 2006, by using an orthogonal binary coding strategy of each amino acid residue to describe the protein sequence information, Chen et al. [12] developed the first on line server called MeMo for methylation prediction. Although the means used for the sequence information description are
Y. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 146 (2015) 102–107
quite simple, the prediction accuracy was unfortunately not so satisfactory. Whereafter, physicochemical and biochemical properties, position weight amino acid composition of amino acids, or the frequency of each amino acid at each position in the datasets was proposed to extract the protein sequence information [13–15]. The prediction accuracies of methylation sites were improved due to the participation of some features of amino acid residues in predictor modeling. By introducing physicochemical, sequence evolution, biochemical, amino acid composition and structural disorder information of protein sequences into a SVM model, Chou et al. developed a predictor for arginine and lysine methylation site prediction with improved performance [16]. Herein, we propose a combination of the pseudo amino acid composition (PseAAC) and protein chain description for the global feature extraction of protein sequences for methylation sites prediction. The PseAAC is utilized to globally characterize the amino acid composition and sequence-order information of the proteins [17]. As well, protein chain description includes both physicochemical properties of amino acid and structural characteristics surrounding methylation sites, which have been successfully used for the prediction of protein folding class [18]. Besides, SVM is invoked to build the prediction model for methylation sites on the basis of the extracted global features of the protein sequences. SVM is a machine learning algorithm advocating structural risk minimization (RSM) principle and has been widely used in PTM site prediction because of its desirable generalization performance. For effectively searching the optimal parameters in SVM, a global stochastic optimization technique, particle swarm optimization algorithm (PSO) [19–21] is employed. As a heuristic search method, PSO is similar to the Genetic Algorithm (GA) [22] in the sense that these two evolutionary heuristics are population-based search methods. Hassan et al., who compare PSO and genetic algorithm (GA) in detail, conclude that PSO is more computationally efficient than the GA [23]. It has been demonstrated that PSO is a robust algorithm with a relatively high efficiency in convergence to a desired optimum when used for model training in most cases [24–27]. The proposed global feature extracting scheme of protein sequences coupled with the parameter-free PSO–SVM modeling method was applied to the prediction of arginine methylation sites. Arginine methylation which governs a variety of the gene regulation and signal transduction [2,3] is one of the most important methylation types of proteins. As a comparison, other predictors are also constructed based on different feature extracting and modeling strategies. The results show that the proposed method can greatly improve the performance of arginine methylation sites prediction due to the combination of global protein feature extracting with parameter-free modeling, indicating that such a strategy holds great promise in modeling and prediction of methylation types. 2. Methods 2.1. Feature construction To develop a powerful predictor for a protein system, one of the key steps is to formulate the protein or peptide samples with an effective mathematical expression that can truly reflect their intrinsic correlation with the target to be predicted [28]. In this paper, we combined PseAAC and protein chain description to form a global feature extracting scheme. The PseAAC couples the amino acid composition and the sequence-order information. As well, protein chain description reflects the distribution of the certain amino acid residue with different structure and physical–chemical properties around the methylation sites. All descriptors can be calculated by Propy program [29] when given the certain protein fragment sequence. 2.1.1. Pseudo amino acid composition (PseAAC) The concept of pseudo amino acid composition (PseAAC) developed on the basis of amino acid composition (AAC) was originally introduced
103
by Chou [17]. Consider a protein or peptide chain of L amino acid residues: R1 R2 R3 R4 R5 R6 ⋯RL where Ri(i = 1, 2, ⋯, 20) represents the i-th residue, and each belongs to one of the 20 native amino acids. According to the amino acid composition (AAC) model [30], a protein or peptide can be expressed by P ¼ ½ f 1 f 2⋯ f 20
T
ð1Þ
where fi (i = 1, 2, ⋯, 20) are the normalized occurrence frequencies of the 20 native amino acids in the protein, and T is the transposing operator. Therefore, the amino acid composition of a protein can be easily derived once the protein sequence information is known. However, the ACC model representation of the protein or peptide P loses sequence-order information, and hence the prediction quality might be considerably limited. To complement this limit, the pseudo amino acid composition (PseAAC) was proposed, as formulated by: P ¼ p1 p2 ⋯ p20 p20þ1 ⋯p20þλ
ð2Þ
where the (20 + λ) components are given by 8 fu > > > ; ð1 ≤ u ≤ 20Þ X20 Xλ > > < f þ ω θ i¼1 i j¼1 j pu ¼ ωθu−20 > > > ; ð20 þ 1 ≤ u ≤ 20 þ λ; λ b LÞ Xλ > > X20 : f þ ω θ j i i¼1 j¼1
ð3Þ
8 1 XL−1 > > θ1 ¼ ΘðRi ; Riþ1 Þ > > i¼1 L−1 > > < 1 XL−2 ΘðRi ; Riþ2 Þ Where θ2 ¼ L−2 i¼1 > > ⋮ > > XL−λ > > : θλ ¼ 1 ΘðRi ; Riþλ Þ i¼1 L−λ
ð4Þ
ðλ b LÞ
where ω is the weight factor, and θ1 is the first-tier correlation factor, θ2 is the 2nd-tier correlation factor, and so forth, while the correlation function is Θ Ri ; R j ¼ H ðRi Þ H R j
ð5Þ
where H(Ri) is the physicochemical property score of the amino acid Ri, while H(Rj) is the corresponding value for the amino acid Rj. Since the hydrophobicity [31] property of amino acid residues has a deep influence on protein structure function, incorporating such effects might provide some helpful information for prediction. Therefore, in this study, hydrophobicity was taken into account. Moreover, the van der Waals volume was also counted according to the work of Shi et al. [15]. The numerical values of the two physical–chemical properties for each of the 20 native amino acids can be obtained from amino acid index (AAindex) database [32]. 2.1.2. Protein chain description Dubchak et al. [18] first proposed protein chain descriptors including overall Composition (C), Transition (T), and Distribution (D) of amino acid attributes. They used the three descriptors, C, T and D, to describe the overall composition of a given amino acid property in a protein, the frequencies with which the property changes along the entire length of the protein, and the distribution pattern of the property along the sequence, respectively. In detail, the first descriptor C namely is the proportion of a given amino acid property in the entire protein. The second descriptor T, characterizes the percent frequency with which one amino acid group is
104
Y. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 146 (2015) 102–107
followed by another different amino acid group. The third descriptor, D, is described by five chain lengths (in percent), within which the first, 25%, 50%, 75%, and 100% of the amino acids with a certain property are contained. Because our target is the protein fragment, we only choose C and T descriptors in this paper. Seven types of amino acid attributes including hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structure and solvent accessibility, have been used for calculating the aforementioned descriptors [29]. Finally, we obtain 21 Composition (C) and 21 Transition (T) descriptor values. Table S1 in Supplementary data lists the 20 amino acid attributes and the division of the amino acids. 2.2. Prediction model 2.2.1. Support vector machine (SVM) Support vector machine (SVM) proposed by Vapnik and his coworkers in 1995 [8] is a promising learning technique and was originally developed for pattern recognition problem [33–36]. Compared with other learning machines, SVM boasts its structural risk minimization (RSM) principle and has desirable generalization performance. It has demonstrated good performance in model estimation problems by numerous successful applications [12–16]. The basic theory of SVM will be briefly reviewed in the following. Consider a dataset {Xi, yi}, i = 1, 2, …, l, yi ∈ {−1, 1}, Xi ∈ Rd, support vector machine algorithm simply looks for the separating hyperplane with largest margin. This can be formulated as follows: yi ðX i w þ bÞ−1 ≥ 0 ; i ¼ 1; 2; …; l; yi ∈ f−1; 1g; X i ∈ Rd
ð6Þ
where w is the weight vector, and b is the threshold. At this point, the margin is simply 2/||w|| with normal w, thus to maximize margin is equivalent to minimize 12 jjwjj2 . In order to facilitate calculation, Lagrange optimization method and kernel function are introduced, and then the objective function changes to: maxQ ðα Þ ¼ Xl
Xl i¼1
αi −
1 Xl α α y y K xi x j i; j¼1 i j i j 2
ð7Þ
α i yi ¼ 0
ð8Þ
α i ϵ½0; C ; i ¼ 1; …; l
ð9Þ
s:t:
i¼1
here the decision function is given by: f ðxÞ ¼
Xl i¼1
α i yi K ðxi xÞ þ b
y ¼ sgnð f ðxÞÞ
ð10Þ ð11Þ
where ai is Lagrange multipliers, C is penalty parameter introduced to determine the trade-off between the empirical error and the model complexity. As described above, except the penalty parameter C, other parameters in SVM can be easily obtained by solving a quadratic programming problem. When the exact nonlinear model is unknown, Gaussian radial basis function transform width is frequently used. The kernel width reflects the interaction between support vectors. If it is too small, the interaction between support vectors will be weak, leading to inferior generalization performance; while too large kernel width cannot guarantee the accuracy of the model due to too strong interaction between support vectors. Therefore, it is important to select proper kernel width in determining the approximation accuracy and generalization performance. In cases where insufficient or excessive support vectors are utilized in kernel transform, SVM is still exposed to substantial risk of underfitting or overfitting. On the other hand, the number of free parameters in the SVM model is equal to the number of support vectors. Training a SVM
becomes computationally intensive even prohibitive, when the training set is too large. To develop a new effective approach for selecting the parameters in SVM model is of considerable significance for combating these problems and improving the learning and generalization performance of SVM. 2.2.2. Particle swarm optimization (PSO) optimized SVM The Particle Swarm Optimization algorithm (PSO) [19–21] derived from simulating the behavior of birds searching food is a stochastic global optimization method. In PSO, the potential solutions called particles fly over the problem space by following the current optimum particles. Each particle keeps track of its coordinate in the problem space which is associated with the best solution (fitness value) encountered so far. This value is called personal best position (pBest) for particle i represented as pi = (pi1, pi2,…, piD). Another best value called global best position (gBest) represented as pg = (pg1, pg2, …,pgD) is the best value obtained so far by all particles in the solution space. Each particle updates its velocity vi = (vi1, vi2,…, viD) and position xi = (xi1, xi2,…, xiD) by tracking these two best values according to the following two equations: vid ðnewÞ ¼ w vid ðoldÞ þ c1 r 1 ðpid −xid Þ þ c2 r 2 pgd −xid ð12Þ xid ðnewÞ ¼ xid ðoldÞ þ μ vid ðnewÞ
ð13Þ
where w is an inertia weight which plays the role of balancing the global search and local search, r1 and r2 are random numbers in the interval (0, 1). Two positive constants, c1 and c2, called learning factors are introduced, and generally both take the integer 2. In Eq. (13), μ being a random number uniformly distributed in (0, 1), is the restriction factor to determine velocity weight. The particle swarm optimization concept consists of, at each time step, changing the velocity of each particle toward its pBest and gBest positions. Acceleration is weighted by a random term, with separate random numbers being generated for acceleration toward pBest and gBest positions. The algorithm is ceased with the minimum error criterion or the user-defined limit of the iteration number reached. In PSO–SVM, PSO is employed to search the optimal solution of SVM by using minimizing the classification error rate as a criterion for parameter determining. Each particle is encoded as a real string representing the penalty parameter and the kernel widths parameter. With the movement of the particles in the problem space, the optimal solution with minimum value of the predicting error will be obtained. Optimizing the penalty parameter and the kernel widths parameter of the SVM model synergistically escapes the model from getting trapped into local optima and improves the model performance. 2.3. Assessing predictive performance For published predictive tools, the determination of the predictive performance values is usually carried out through a cross-validation procedure. Conventionally, the total data divided into two parts, called training set (used for the refinement of algorithm parameters) and testing set (used for the calculation of sensitivity, specificity and so on) respectively. It is critically important to note that the data from test sets were not included in any of the algorithmic refinement procedures to avoid overfitting results. According to a comprehensive review [37], one option for selecting important data features without biasing an algorithm involves splitting the total data into three subsets. The first and second subsets should then be used to determine important data features to be included in the algorithm, while the third subset should be used only to determine the performance metrics of the approach [37].
Y. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 146 (2015) 102–107
105
In this paper, the dataset was split into three independent subsets called training data set, monitoring set and prediction set by DUPLEX method [38]. The training data set with 204 samples was used for training model, and the monitoring set with 104 samples was used for training the parameters in SVM model by PSO algorithms to mitigate the probability of overfitting. The prediction set with 106 samples was used to evaluate the performance of the model. The predictive performance is most commonly assessed using four important metrics: Sn (sensitivity), Sp (specificity), Acc (accuracy) and MCC (Mattews correlation coefficient). These metrics are calculated using the equations: Sn ¼
TP ; TP þ FN
ð14Þ
Sp ¼
TN ; TN þ FP
ð15Þ
Acc ¼
TP þ TN ; TP þ FP þ TN þ FN
TP TN−FP FN MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; ðTP þ FNÞ ðTP þ FPÞ ðTN þ FNÞ ðTN þ FPÞ
ð16Þ ð17Þ
where TP is the number of positive samples predicted correctly (i.e. true positives), TN is the number of negative samples predicted correctly (i.e. true negatives), FP is the number of negative samples predicted as positive (i.e. false positives) and FN is the positive samples predicted as negative (i.e. false negatives). Sensitivity (Sn) and specificity (Sp) illustrate the correct prediction ratios of positive samples (methylation) and negative samples (non-methylation) respectively, while accuracy (Acc) represents the correct ratio among both positive and negative data sets. Mattews correlation coefficient (MCC) [39] is a weighted measure, has increasingly been used for measuring the predictive capability of classifiers, which reflects both the sensitivity and specificity of the prediction algorithm. All the algorithms used in this study were written in Matlab 5.3 and run on a personal computer (Intel Pentium(R) Dual-Core CPU E5700/ 3.00GHz 2GB RAM). 3. Dataset
Fig. 1. Prediction results of PSO–SVM with different window sliding sizes.
some non-methylated fragments that are randomly selected. The proportion of “1:1” aims to avoid a biased prediction of samples that may be caused by a very large size of negative samples but a quite small size of positive ones [14,15]. The final used dataset can be found in Supplementary data (Table S2). 4. Results and discussion 4.1. Choosing the best window size For each methylation or non-methylation sites with its profile feature taken from a sequence fragment containing the n nearest residues (spatially), it is crucial to confirm the appropriate window size and realize its effects on the prediction performance. By using different window sliding sizes, different residues around the methylation sites were used to construct models. In order to get better results, the different window sliding sizes from 4 to 11 were tested. As shown in Fig. 1, most of the prediction accuracies were more than 0.9. It indicated that our current method was very effective. Through the histogram, we found that the result obtained by window sliding size 6 was the best. Therefore, in this study we chose the window sliding size as 6.
3.1. Data collection 4.2. Investigation of different features All training data were downloaded from UniProtKB/Swiss-Prot database (version 2013_11, www.uniprot.org) by using the keywords “Omega-N-methylated arginine”, “symmetric dimethylarginine”, “Omega-N-methylarginine” and “asymmetric dimethylarginine”. Finally, 109 proteins having at least one experimentally verified site have been found. The proteins with experimentally validated arginine methylation sites were defined as positive dataset, excluding those annotated by ‘potential’, ‘probable’ or ‘by similarity’ in the description field. Negative datasets included arginine (R) residues that were not marked by any methylation information on the same protein. To avoid such overestimation of predictive performance, we clustered the protein sequences with a threshold of 40% identity by CD-HIT program [40] to remove the highly homologous sequences. Then, the sliding window strategy was utilized to extract positive and negative data from protein sequences as training data, which were represented by peptide sequences with arginine symmetrically surrounded by flanking residues. Six residues upstream and six residues downstream of methylation sites in the protein sequences were extracted as positive samples and negative samples. After strictly following the above procedures, we attained 207 high-quality positive sites and 2282 negative sites. In order to keep the balance between positive samples and negative samples, the final dataset is composed of all methylated fragments and
As described in the Methods section, our feature construction included two types of features: the pseudo amino acid composition (PseAAC) and protein chain description including C and T. Here we constructed three models based on PseAAC, C&T, PseAAC + C&T respectively. When the window sliding size was 6, the performances of PSO–SVM model trained with various features are shown in Table 1. According to the results in the Table 1, the performances trained with the combination of PseAAC and C&T (PseAAC + C&T) show obvious improvement as compared to the individual feature. This demonstrated that both of the two types of features contributed to Table 1 The performance obtained by PSO–SVM with different combinations of protein encoding features. Training features PseAAC C&T PseAAC + C&T
Monitoring set Prediction set Monitoring set Prediction set Monitoring set Prediction set
The window size was 6.
Sn (%)
Sp (%)
Acc (%)
Mcc (%)
75.00 71.70 84.62 79.25 96.15 96.23
76.92 81.13 98.08 81.13 100 100
75.96 76.42 91.35 80.19 98.08 98.11
51.93 53.07 83.45 60.39 96.23 96.30
106
Y. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 146 (2015) 102–107
Table 2 Comparison among KNN, BPNN, SVM, GA–SVM and PSO–SVM classifiers. Classifier KNN BPNN SVM GA–SVM PSO–SVMa
Monitoring set Prediction set Monitoring set Prediction set Monitoring set Prediction set Monitoring set Prediction set Monitoring set Prediction set
Sn (%)
Sp (%)
Acc (%)
Mcc (%)
69.23 64.15 77.30 79.32 90.38 90.57 96.15 94.34 96.15 96.23
84.62 69.81 82.32 85.28 100 98.11 100 97.17 100 100
76.92 66.98 79.81 82.30 95.19 94.34 98.08 97.17 98.08 98.11
54.49 34.02 59.85 64.86 90.81 88.93 96.23 94.49 96.23 96.30
a The optimal value for penalty parameter C is 2854.10, for the kernel width σ is 1.99 in PSO–SVM.
Fig. 2. Converging performance of PSO–SVM.
distinguishing between methylation sites and non-methylation sites. There was a strong complementary effect among these features. Henceforth, the combination of PseAAC and C&T was selected as global features of protein sequence to learn the predictive model. 4.3. Comparison with other method In this paper, support vector machine optimized by particle swarm optimization (PSO–SVM) is invoked to build the prediction model for methylation sites on the basis of the global features of protein sequence. As to the PSO–SVM training, the performance of the PSO–SVM with the optimal feature (i.e. PseAAC + C&T) was illustrated in Table 2. The prediction accuracy, sensitivity, specificity and Matthew's correlation coefficient values of the independent prediction set were 98.11%, 96.23%, 100% and 96.30%, respectively. It obviously indicates that our method has great prediction effect in identification of the protein arginine methylation sites. Furthermore, the efficiency of PSO was investigated via
observing the convergence process for PSO–SVM. As shown in Fig. 2, PSO converges no more than 50 cycles, implying a high efficiency in achieving an optimal solution. In order to further evaluate the PSO–SVM classifier, K-Nearest Neighbor (KNN) [41], back propagation neural networks (BPNN) [42] and conventional non-linear SVM were employed to identify potential protein methylation sites of the same datasets. Meanwhile, genetic algorithm (GA) was also tried as the optimization method and used to build SVM prediction model. For BPNN, it was run for 50 times with different initial parameter values to give avoid a stochastic bad result of BPNN, and the result in Table 2 was the average performance of the 50 BPNN models. In addition, the optimal hidden layer with ten nodes was considered by evaluating the performance of BPNN models obtained by using different numbers of nodes in the hidden layer from 8 to 20. The monitoring set with 104 samples was used to reduce the possible risk of overfitting of the training data for BPNN. As shown in Table 2, the performances of the PSO–SVM, GA–SVM and conventional SVM are much better than BPNN and KNN. Here, a radial basis function was chosen as the kernel function in SVM method, the penalty parameter and the kernel width parameter were optimized using the grid search strategy. It may suggest that support vector machine algorithm (SVM) is more effectively solving the nonlinear problem in multi-dimensional space than the KNN or BPNN algorithm. In addition, the performances of the PSO–SVM and GA–SVM are much better than conventional SVM in Table 2. One notices that selecting the parameters in SVM model is of considerable significance for improving the learning and generalization performance of SVMs. As shown in Table 2, the PSO– SVM gave a better performance than GA–SVM. Our proposed method PSO–SVM is reliable to improve the performance of arginine methylation sites prediction. Moreover, several important arginine methylation site prediction tools such as MeMo [12], BPB-PPMS [13], Methy_SVMIACO [14], PMeS [15] as well as PSO–SVM, are summarized in Table 3, with the performance of certain predictors cited the related original articles. It is worth to note that while using global features of protein sequence to describe a dataset the performance of conventional SVM (the results were shown in Table 2) is outperformed than the other SVM methylation prediction tools in Table 3. It is manifested that the employed global features of protein sequence is superior to previous encoding schemes. Nevertheless, the conventional SVM still showed poorer performance when compared with PSO–SVM in the present study. In briefly, the result comparisons reveal that our improvements can be attributed to the adoption of the optimal global features of protein sequence and the proper classifier. 5. Conclusion In this paper, we proposed a novel strategy for the prediction of methylation sites based on a combination of the pseudo amino acid composition (PseAAC) and protein chain description as global features of protein sequence, coupled with support vector machine optimized by particle swarm optimization (PSO–SVM). The feature encoding
Table 3 Summarization of the reported performance of some important methylation site prediction tools and the proposed method. Prediction tools
Feature encoding scheme
Classifier
Sn (%)
MeMo [12] BPB-PPMS [13] Methy_SVMIACO [14] PMeS [15]
Binary encoding Bi-profile Bayes feature extraction Various properties of amino acid from AAindex database SPCb + PWAAc + ASAd + VDWVe PseAAC + C&T
SVM SVM IACOa–SVM SVM
69.6 74.71 89.03 86.18
PSO–SVM
96.23
The proposed method a b c d e
IACO, ant colony optimization algorithm. SPC, the sparse property coding. PWAA, position weight amino acid composition. ASA, accessible surface area; VDWV, normalized van der Waals volume.
Sp (%) 89.2 94.32 94.07 90.24 100
Acc (%)
MCC (%)
86.7 87.98 91.56 88.21
– – 83.23 76.61
98.11
96.30
Y. Zhang et al. / Chemometrics and Intelligent Laboratory Systems 146 (2015) 102–107
scheme in this study incorporated the amino acid composition information, sequence order information and physicochemical properties of residues with structural characteristic to improve the prediction of protein methylation sites. Feature analysis showed that both of the two types of features contributed to distinguishing between methylation sites and non-methylation sites. Comparison of the conventional SVM using the grid search strategy to optimize parameters, the usage of stochastic global optimization method PSO to optimize the SVM parameters significantly enhances the performance of arginine methylation sites prediction. The prediction results of the independent test demonstrated that the proposed method achieved a promising performance and outperformed other methylation prediction tools. It can be anticipated that the method might be useful to guide future experiments needed to identify potential methylation sites in proteins of interest. Conflicts of interest The authors declared that they have no conflicts of interest to this work. Acknowledgment This work was supported by NSFC (21205034, 21025521, 21035001, 21190041, 91317312), Ministry of Education of the People's Republic of China (New Teachers, 20120161120032), Hunan Provincial Natural Science Foundation (13JJ4031), Fundamental Research Funds for the Central Universities and Young Scholar Support Program of Hunan University. Appendix A. Supplementary data Supplementary data to this article can be found online at http://dx. doi.org/10.1016/j.chemolab.2015.05.011. References [1] W.K. Paik, D.C. Paik, S. Kim, Historical review: the field of protein methylation, Trends Biochem. Sci. 32 (2007) 146–152. [2] K.B. Sylvestersen, H. Horn, S. Jungmichel, L.J. Jensen, M.L. Nielsen, Proteomic analysis of arginine methylation sites in human cells reveals dynamic regulation during transcriptional arrest, Mol. Cell. Proteomics 13 (2014) 2072–2088. [3] M.T. Bedford, S. Richard, Arginine methylation: an emerging regulator of protein function, Mol. Cell 18 (2005) 263–272. [4] B.M. Turner, Cellular memory and the histone code, Cell 111 (2002) 285–291. [5] A. Guo, H. Gu, J. Zhou, D. Mulhern, Y. Wang, K.A. Lee, V. Yang, M. Aguiar, J. Kornhauser, X. Jia, Immunoaffinity enrichment and mass spectrometry analysis of protein methylation, Mol. Cell. Proteomics 13 (2014) 372–387. [6] K.M. Daily, P. Radivojac, A.K. Dunker, Intrinsic disorder and protein modifications: building an SVM predictor for methylation, Computational Intelligence in Bioinformatics and Computational Biology, 2005. CIBCB'05. Proceedings of the 2005 IEEE Symposium on, IEEE 2005, pp. 1–7. [7] A.K. Dunker, J.D. Lawson, C.J. Brown, R.M. Williams, P. Romero, J.S. Oh, C.J. Oldfield, A.M. Campen, C.M. Ratliff, K.W. Hipps, Intrinsically disordered protein, J. Mol. Graph. Model. 19 (2001) 26–59. [8] C. Cortes, V. Vapnik, Support-vector networks, Mach. Learn. 20 (1995) 273–297. [9] H. Li, Y. Liang, Q. Xu, Support vector machines and its applications in chemistry, Chemom. Intell. Lab. 95 (2009) 188–198. [10] U. Thissen, M. Pepers, B. Üstün, W.J. Melssen, L.M.C. Buydens, Comparing support vector machines to PLS for spectral regression applications, Chemom. Intell. Lab. 73 (2004) 169–179. [11] S.J. Dixon, R.G. Brereton, Comparison of performance of five common classifiers represented as boundary methods: Euclidean Distance to Centroids, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Learning Vector Quantization and Support Vector Machines, as dependent on data structure, Chemom. Intell. Lab. 95 (2009) 1–17. [12] H. Chen, Y. Xue, N. Huang, X. Yao, Z. Sun, MeMo: a web tool for prediction of protein methylation modifications, Nucleic Acids Res. 34 (2006) 249–253.
107
[13] J. Shao, D. Xu, S.-N. Tsai, Y. Wang, S.-M. Ngai, Computational identification of protein methylation sites through bi-profile Bayes feature extraction, Plos One 4 (2009) e4920. [14] Z.C. Li, X. Zhou, Z. Dai, X.Y. Zou, Identification of protein methylation sites by coupling improved ant colony optimization algorithm and support vector machine, Anal. Chim. Acta 703 (2011) 163–171. [15] S.P. Shi, J.D. Qiu, X.Y. Sun, S.B. Suo, S.Y. Huang, R.-P. Liang, PMeS: prediction of methylation sites based on enhanced feature encoding scheme, Plos One 7 (2012) e38772. [16] W.R. Qiu, X. Xiao, W.Z. Lin, K.C. Chou, iMethyl-PseAAC: Identification of protein methylation sites via a pseudo amino acid composition approach, Biomed. Res. Int. (2014), http://dx.doi.org/10.1155/2014/947416. [17] K.C. Chou, Prediction of protein cellular attributes using pseudo‐amino acid composition, Proteins 43 (2001) 246–255. [18] I. Dubchak, I. Muchnik, S.R. Holbrook, S.-H. Kim, Prediction of protein folding class using global description of amino acid sequence, Proc. Natl. Acad. Sci. U. S. A. 92 (1995) 8700–8704. [19] J. Kennedy, R. Eberhart, Particle swarm optimization, Proceedings of IEEE international conference on neural networks, Perth, Australia 1995, pp. 1942–1948. [20] Y. Shi, R. Eberhart, A modified particle swarm optimizer, Evolutionary Computation Proceedings, 1998. IEEE World Congress on Computational Intelligence., The 1998 IEEE International Conference on, IEEE 1998, pp. 69–73. [21] Y. Shi, R.C. Eberhart, Fuzzy adaptive particle swarm optimization, Evolutionary Computation, 2001. Proceedings of the 2001 Congress on, IEEE 2001, pp. 101–106. [22] D.E. Golberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addion wesley, 1989. [23] R. Hassan, B. Cohanim, O. De Weck, G. Venter, A comparison of particle swarm optimization and the genetic algorithm, Proceedings of the 1st AIAA multidisciplinary design optimization specialist conference 2005, pp. 18–21. [24] K. Chen, T. Li, T. Cao, Tribe-PSO: a novel global optimization algorithm and its application in molecular docking, Chemom. Intell. Lab. 82 (2006) 248–259. [25] H. Shinzawa, J.-H. Jiang, M. Iwahashi, I. Noda, Y. Ozaki, Self-modeling curve resolution (SMCR) by particle swarm optimization (PSO), Anal. Chim. Acta 595 (2007) 275–281. [26] J.-H. Wen, K.-J. Zhong, L.-J. Tang, J.-H. Jiang, H.-L. Wu, G.-L. Shen, R.-Q. Yu, Adaptive variable-weighted support vector machine as optimized by particle swarm optimization algorithm with application of QSAR studies, Talanta 84 (2011) 13–18. [27] X. Yu, R. Yu, L. Tang, Q. Guo, Y. Zhang, Y. Zhou, Q. Yang, X. He, X. Yang, K. Wang, Recognition of candidate aptamer sequences for human hepatocellular carcinoma in SELEX screening using structure–activity relationships, Chemom. Intell. Lab. 136 (2014) 10–14. [28] K.C. Chou, Some remarks on protein attribute prediction and pseudo amino acid composition, J. Theor. Biol. 273 (2011) 236–247. [29] D.S. Cao, Q.S. Xu, Y.Z. Liang, propy: a tool to generate various modes of Chou's PseAAC, Bioinformatics 29 (2013) 960–962. [30] H. Nakashima, K. Nishikawa, T. Ooi, The folding type of a protein is relevant to the amino acid composition, J. Biochem. 99 (1986) 153–162. [31] C. Tanford, Contribution of hydrophobic interactions to the stability of the globular conformation of proteins, J. Am. Chem. Soc. 84 (1962) 4240–4247. [32] S. Kawashima, H. Ogata, M. Kanehisa, AAindex: amino acid index database, Nucleic Acids Res. 27 (1999) 368–369. [33] R.J. Vong, T.V. Larson, W.H. Zoller, A multivariate chemical classification of rainwater samples, Chemom. Intell. Lab. 3 (1988) 99–109. [34] B.K. Lavine, P.C. Jurs, D.R. Henry, R.K.V. Meer, J.A. Pino, J.E. McMurry, Pattern recognition studies of complex chromatographic data sets: Design and analysis of pattern recognition experiments, Chemom. Intell. Lab. 3 (1988) 79–89. [35] I.E. Frank, S. Lanteri, Classification models: Discriminant analysis, SIMCA, CART, Chemom. Intell. Lab. 5 (1989) 247–256. [36] J. Kim, A. Mowat, P. Poole, N. Kasabov, Linear and non-linear pattern recognition models for classification of fruit from visible–near infrared spectra, Chemom. Intell. Lab. 51 (2000) 201–216. [37] D. Schwartz, Prediction of lysine post-translational modifications using bioinformatic tools, Essays Biochem. 52 (2012) 165–177. [38] R.D. Snee, Validation of regression models: methods and examples, Technometrics 19 (1977) 415–428. [39] B.W. Matthews, Comparison of the predicted and observed secondary structure of T4 phage lysozyme, BBA-Protein Struct. Mol. 405 (1975) 442–451. [40] L. Fu, B. Niu, Z. Zhu, S. Wu, W. Li, CD-HIT: accelerated for clustering the nextgeneration sequencing data, Bioinformatics 28 (2012) 3150–3152. [41] B.R. Kowalski, C. Bender, K-Nearest Neighbor Classification Rule (pattern recognition) applied to nuclear magnetic resonance spectral interpretation, Anal. Chem. 44 (1972) 1405–1411. [42] S. Grossberg, Nonlinear neural networks: principles, mechanisms, and architectures, Neural Netw. 1 (1988) 17–61.