Highly accurate prediction of protein self-interactions by incorporating the average block and PSSM information into the general PseAAC

Highly accurate prediction of protein self-interactions by incorporating the average block and PSSM information into the general PseAAC

Accepted Manuscript Highly accurate prediction of protein self-interactions by incorporating the average block and PSSM information into the general ...

1MB Sizes 0 Downloads 34 Views

Accepted Manuscript

Highly accurate prediction of protein self-interactions by incorporating the average block and PSSM information into the general PseAAC Jing-Xuan Zhai , Tian-Jie Cao , Ji-Yong An , Yong-Tao Bian PII: DOI: Reference:

S0022-5193(17)30375-2 10.1016/j.jtbi.2017.08.009 YJTBI 9173

To appear in:

Journal of Theoretical Biology

Received date: Revised date: Accepted date:

12 May 2017 5 August 2017 8 August 2017

Please cite this article as: Jing-Xuan Zhai , Tian-Jie Cao , Ji-Yong An , Yong-Tao Bian , Highly accurate prediction of protein self-interactions by incorporating the average block and PSSM information into the general PseAAC, Journal of Theoretical Biology (2017), doi: 10.1016/j.jtbi.2017.08.009

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights   

AC

CE

PT

ED

M

AN US

CR IP T



We proposed a novel computational model called RVM-AB for predicting SIPs. We developed an effectiveness feature extraction method named Average Blocks (AB) that combined with PSSM and PCA to transform protein sequence into feature vector. A robust machine-learning algorithm (RVM) was employed to carry out classification. The performance of RVM-AB is assessed and compared with the state-of-the-art support vector machine (SVM) classifier and other exiting methods on yeast and human datasets respectively.

1

ACCEPTED MANUSCRIPT

Highly accurate prediction of protein self-interactions by incorporating the average block and PSSM information into the general PseAAC Jing-Xuan Zhai1, Tian-Jie Cao1, Ji-Yong An1, Yong-Tao Bian1 ([email protected], [email protected] , [email protected], [email protected]) 1

Xuzhou Jiangsu 21116, China

CR IP T

School of Computer Science and Technology, China University of Mining and Technology

CE

PT

ED

M

AN US

Abstract: It is a challenging task for fundamental research whether proteins can interact with their partners. Protein self-interaction (SIP) is a special case of PPIs, which plays a key role in the regulation of cellular functions. Due to the limitations of experimental self-interaction identification, it is very important to develop an effective biological tool for predicting SIPs based on protein sequences. In the study, we developed a novel computational method called RVM-AB that combines the Relevance Vector Machine (RVM) model and Average Blocks (AB) for detecting SIPs from protein sequences. Firstly, Average Blocks (AB) feature extraction method is employed to represent protein sequences on a Position Specific Scoring Matrix (PSSM). Secondly, Principal Component Analysis (PCA) method is used to reduce the dimension of AB vector for reducing the influence of noise. Then, by employing the Relevance Vector Machine (RVM) algorithm, the performance of RVM-AB is assessed and compared with the state-of-the-art support vector machine (SVM) classifier and other exiting methods on yeast and human datasets respectively. Using the fivefold test experiment, RVM-AB model achieved very high accuracies of 93.01% and 97.72% on yeast and human datasets respectively, which are significantly better than the method based on SVM classifier and other previous methods. The experimental results proved that the RVM-AB prediction model is efficient and robust. It can be an automatic decision support tool for detecting SIPs. For facilitating extensive studies for future proteomics research, the RVMAB server is freely available for academic use at http://219.219.62.123:8888/SIP_AB. Keywords: SIPs, AB, Position-specific scoring matrix, Protein sequences, RVM

1. Introduction

AC

Protein-protein interactions (PPIs) detection is a key issue to understand the cell metabolism and function in biological fields ADDIN EN.CITE [Error! Bookmark not defined.. However, it is a very important problem whether proteins can interact with their partners for fundamental research. Protein self-interaction (SIP) is a special case of PPIs. Two interaction partners of SIP are the same copies represented by the same gene and results in the formation of homo-oligomer. In recent years, many researches have proved that homo-oligomerization plays an essential role in a wide range of biological processes, such as gene expression regulation, signal transduction, enzyme activation and immune response [1-5]. As a result, SIP is an essential factor in the regulation of protein function. SIP can significantly prolong the function diversity of proteins without increasing the size of genome. In addition, SIP is conducive to improving the stability and preventing the denaturation of a protein by reducing its surface area[6]. Up to now, a majority of computational methods for predicting PPIs have been exploited. 2

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

AN US

CR IP T

Sylvain at el [7] proposed a novel Protein-Protein Interaction Prediction Engine called PIPE, which can detect PPIs for any target pair of the yeast Saccharomyces cerevisiae proteins. Xia at el [8] proposed a sequence-based method that selected rotation forest as classifier and employed autocorrelation descriptor as feature extraction method for PPIs prediction. However, these methods have certain limitations for protein self-interaction detection. These limitations lie in that these methods usually take account for the correlational information between protein pairs, like co-expression, co-localization and coevolution. However, such information is not available for SIPs. In addition, the datasets used to predict PPIs fail to cover the PPIs between same partners. Therefore, these computational approaches are not fit for SIP detection. In a previous study, Liu et al [9] proposed a method called SLIPPER to SIP prediction, which integrated multiple representative known properties to construct a prediction model. The disadvantage of this method is that it cannot deal with the proteins that are not covered by the current human interatomic. Due to the limitations of the aforementioned methods, as a result, developing an efficiency computational method for SIP detection becomes more and more important. In the paper, a novel computational method is proposed for detecting SIPs only using protein sequence data. The major novelty of the proposed method includes (1) The Average Blocks (AB) feature extract method is used to represent protein sequences on a Position Specific Scoring Matrix (PSSM), (2) Principal Component Analysis (PCA) method is employed to reduce the dimension of AB vector for reducing the influence of noise in experiment, (3) The Relevance Vector Machine (RVM) classifier is used to classify. The process of the proposed method is as follows: firstly, each protein sequence is expressed as a PSSM; secondly, each protein sequence PSSM is converted into a 400-dimensional feature vector by using the Average Blocks (AB) descriptor for capturing useful information from; thirdly, the dimensions of the AB vector are reduced to improve the accuracy of prediction using PCA; finally, the RVM model is used to perform classification. Two different SIPs datasets (yeast and human) are used to carry out the experiment. Consequently, the experimental results are found to be superior to SVM and other previous methods. It is demonstrated that the proposed method is suitable for SIPs detection and can perform incredibly well in terms of SIPs prediction. As demonstrated by a series of recent publications [10-25]. In the study, for establishing a robust and effectiveness SIP predictor, we follow the following five steps[26]: (1) a valid benchmark dataset was constructed and selected for the prediction model; (2) a novelty feature extraction method was employed to transformed protein sequences into feature vectors; (3) employing a robust machine learning algorithm to carry out the classification; (4) using appropriate cross-validation tests to assess the prediction performance of the classifier; (5) in order to facilitate extensive studies for future proteomics research, a user-friendly web-server for the predicting model is accessible to the public.

2. Materials and Methodology 2.1. Dataset The UniProt database [27] contains 20,199 curated human protein sequences. The PPI data can be obtained from diverse resources, including DIP [28], BioGRID[29], IntAct[30], InnateDB [31] and MatrixDB [32]. In the paper, we constructed the PPIs data that only contains two identical interaction protein sequences and whose interaction type was defined as ‘direct 3

ACCEPTED MANUSCRIPT

AN US

2.2. General Pseudo Amino Acid Composition

CR IP T

interaction’ in relevant databases. Consequently, we obtained 2994 human protein self-interaction protein sequences. For evaluating the performance of the proposed prediction model, the experiment datasets were constructed, which was done in the following three steps[33]: (1) we removed the protein sequences whose length is shorter than 50 residues and longer than 5000 residues from the whole human proteome;(2)For the positive dataset we selected the protein self-interaction data that must satisfy one of the following conditions: (a) it has been detected for the self-interaction by at least two kinds of large scale experiments or one small-scale experiment; (b) the protein has been defined as homooligomer (including homodimer and homotrimer) in UniProt; (c) it has been reported by at least two publications for the self-interaction;(3)To construct the negative dataset, we removed all types of SIPs from the whole human proteome (including proteins annotated as ‘direct interaction’ and more extensive ‘physical association’) and UniProt database. As a result, the resulting experiment human dataset contained 1441 SIPs as positives and 15,938 non-SIPs as negatives[33]. In addition, to further demonstrate the prediction performance of RVM-AB, we also constructed the yeast dataset, which included 710 positive and 5511 negative protein[33] sequences by using the same method mentioned above.

CE

PT

ED

M

It is the most important and difficult that how to represent a protein sequence with a discrete model or a vector in computational biology. This is because the machine-learning algorithms can only handle vector but not protein sequence [34]. For preventing completely losing the sequence-pattern information for protein sequence, the pseudo amino acid composition or PseAAC [35] was proposed. PseAAC has penetrated into many biomedicine and drug development areas[36, 37]. As a result, it is a crucial problem for the proposed feature extraction method is how to construct feature vector of protein sequence. However, a feature vector may lose the protein sequence-pattern information. In order to prevent losing the protein sequence-pattern information of protein sequence, PseAAC ( pseudo amino acid composition )[35] was proposed, which has been applied into many computational proteomics[38-41] as well as a long list of references cited in [42, 43]. Particularly, recently a very powerful web-server called Pse-in-One [43, 44]has been established that can be used to generate any desired feature vectors for protein sequences. As a result, in our study, the "Average Blocks" and "PSSM" modes of the general PseAAC are employed to predict SIPs.

2.3. Position Specific Scoring Matrix

AC

Position Specific Scoring Matrix (PSSM) is a useful tool that was originally applied for detecting distantly related proteins. Each protein sequence can be transform into a PSSM [45] by using the Position Specific Iterated BLAST (PSI-BLAST) [46]. A given protein sequence can be expressed as a PSSM: an N×20 matrix M = *Mij i: 1 = 1 … N, j = 1 … 20+ , where N represents the length of a protein sequence, and 20 represents a total of 20 amino acids. For the query protein sequence, a PSSM can assign the score Mij that represents the jth amino acid in the ith position. The score Mij can be defined as Mij = ∑20 k=1 p(i, k) × q(j, k) , where p(i, k) represents the appearing frequency value of the k th amino acid at position i of the probe, and q(j, k) is the value of Dayhoff’s mutation matrix between jth and k th amino acids. As a result, a well conserved position usually has a high score and a weakly conserved position has a low score. PSSM could serve as a help for predicting protein quaternary structural attributes, disulfide 4

ACCEPTED MANUSCRIPT

connectivity and folding patterns[47, 48].Thus, PSI-BLAST is employed to construct each protein sequence PSSM for predicting SIPs in the paper. For obtaining highly and widely homologous sequences, the e-value parameter of PSI_BLAST was set at 0.001 and three iterations were selected. Consequently, the PSSM of each protein sequence can be expressed as a 20-dimensional matrix that consists of M× 20 elements, where M represents the number of residues of a protein, and the columns of the matrix represent the 20 amino acids.

2.4. Average Blocks

AB(k) =

20 N

N

AN US

CR IP T

The characteristics of the Average Blocks (AB) were originally described in the literature [49]. It is a challenge task to create informative features for machine learning-based methods. In the paper, on the ground that each protein sequence has different numbers of amino acids, we cannot directly transform a PSSM into a feature vector, which will result in different lengths of feature vectors. To solve the problem, the feature extraction method called averaged PSSM profiles over blocks (Average Blocks) is used to create feature vectors. The feature extraction method is described as follows: a block contains 5% of a protein sequence. As a result, regardless of the length of a protein sequence, we divided each protein sequence into 20 blocks. Thus, each block consists of 20 features derived from the 20 columns in PSSMs. Related mathematical formula can be expressed as follows: ∑20 (i − 1) × p=1 Mt(p +

N , j) 20

(1)

i = 1, … , 20; j = 1, … , 20; k = j + 20 × (i − 1),

Mt(p + (i − 1) ×

N 20

M

Where N/20 is 5% of the length of a protein sequence and represents the size of the jth blocks. , j) is expressed as a 1×20 vector extracted from the PSSM profile at the ith

AC

CE

PT

ED

positon in the jth block. Thus, each protein sequence has 20 blocks and can be expressed as a 400-dimensional vector. The theoretical basis of Average Blocks is that the residue conservation tendencies in the same domain family are similar, and the locations of domains in the same family are closely related to the length of the sequences[49]. In our application, finally, each protein sequence of yeast and human datasets was converted into a 400-dimensional vector by using Average Blocks. In the study, for reducing the influence of noise and improving the prediction accuracy, the dimensional of yeast and human were reduced from 400 to 350 by using Principal Component Analysis (PCA)[50] method.

2.5. Relevance Vector Machine The characteristics of the Relevance Vector Machine are described in the literature[51]. It is

d assumed that *xn , t n +N n=1 , x n ∈ R is the training set for binary classification problems, where t n ∈ *0,1+ represents the training set label and t i represents the label of testing set, and that t i = yi + εi , where yi = w T φ(xi) = ∑N j=1 wj K(xi , x j ) + w0 represents classification model; εi

represents additional noise, with a mean value of zero and a variance of σ2 ,where εi ~N(0, σ2 ), t i ~N(yi , σ2 ). As assumed that the training sets are independent and identically distributed; the vector t obeys the following distribution: 1

p(t|x, w, σ2 ) = (2πσ2 )−N/2 exp,− 2σ2 ||t − φw||2 5

(2)

ACCEPTED MANUSCRIPT Where φ is expressed as follows: 1 k(x1 , x1 ) ⋯ k(x1 , xN )) … … φ = (… ) 1 k(xN , x1 ) … k(xN , xN )

(3)

The set label t is employed to detect the testing set label t ∗ , given by p(t ∗|t) = ∫ p(t ∗ |w, σ2)p(w, σ2 |t)dwdσ2

(4) For the sake of making the value of most components of the weight vector w zero and reducing the amount of calculation of the kernel function, additional conditions are attached to the weight

CR IP T

vector w. Assuming that wi obeys a distribution with a mean value of zero and a variance ofα−1 i , −1 N the mean wi ~N(0, αi ),p(w|a) = ∏i=0 p(wi |ai ), where a represents a hyper-parameters vector of the prior distribution of the weight vector w. p(t ∗ |t) = ∫ p(t ∗ |w, a, σ2 )p(w, a, σ2 |t)dwdadσ2 (5) 2 2 p(t ∗ |w, a, σ ) = N(t ∗ |y(x∗ ; w), σ ). (6) 2 |t) Since p(w, a, σ cannot be obtained by an integral, it must be resolved using a Bayesian formula, given by

AN US

p(w, a, σ2|t) = p(w|a, σ2 , t)p(a, σ2 |t) p(w|a, σ2 , t) = p(t|w, σ2 )p(w|a)/p(t|a, σ2)

(7) (8)

The integral of the product of p(t|a, σ2 ) and p(w|a) is given by p(t|a, σ2 ) = (2π)−N/2 |Ω|−1/2 exp(−

−1

tT Ω

t

)

(9)

Ω = σ2 I + φA−1 φT , A = diag(a0, a1, … , aN ),

(10)

2

M

p(w|a, σ2 , t) = (2π)−(N+1)/2 |Σ|−1/2 exp(−

(w−u)T (w−u) 2

)

(11)

ED

Σ = (σ−2φT φ + A)−1 (12) −2 T u = σ Σφ t (13) Because p(a, σ2 |t)∝ p(t|a, σ2)p(a)p(σ2 ) and p(a, σ2 |t) cannot be solved by means of integration,

PT

the solution is approximated using the maximum likelihood method, represented by (aMP , σ2MP ) = arg a,σ2 maxp(t|a, σ2 ) (14)

CE

The iterative process of aMP and σ2MP is shown as follows: γ

anew = μ2i i

2 new

(σ )

i

||t−φμ||2

(15)

= N−∑N

i=0 μi

AC

{ γi = 1 − ai ∑ i, i Here ∑ i, i is ith element on the diagonal of Σ, and the initial value of a and σ2 can be decided via the approximation of aMP and σ2MP using formula (15) continuously renewal. After enough iterations, most of ai will be close to infinity, the corresponding parameters in wi will be zero, and other ai values will be close to finite. The resulting corresponding parameters xi of ai are now referred to as the relevance vector.

2.6. Performance Evaluation In order to evaluate the feasibility and effectiveness of the proposed method, we calculated the value of five parameters, namely, Accuracy (Ac), Sensitivity (Sn), specificity(Sp),Precision (Pe) and Matthews’s correlation coefficient (Mcc)[19, 52-59] respectively. They are expressed as 6

ACCEPTED MANUSCRIPT

follows: Ac =

TP+TN

(16)

TP+FP+TN+FN TP

Sn = TP+TN

Pe = Mcc =

TN

(18)

FP+TN TP

(19)

FP+TP (TP×TN)−(FP×FN)

CR IP T

Sp =

(17)

√(TP+FN)×(TN+FP)×(TP+FP)×(TN+FN)

(20)

AN US

Where TP represents true positives, FP denotes false positives, TN represents true negatives and FN stands for false negatives respectively. True positives represent the count of true interacting pairs correctly predicted. True negatives refer to the number of true non-interacting pairs predicted correctly while false positives define the count of true non-interacting pairs falsely predicted, and false negatives represent true interacting pairs falsely predicted to be non-interacting pairs. Moreover, a Receiver Operating Curve (ROC) was created to evaluate the performance of the proposed method.

3. Results and Discussion

M

3.1. Performance of the proposed method

AC

CE

PT

ED

Computational experiments were performed on yeast and human dataset respectively. In order to avert the over-fitting and verify the validity and stability of the proposed RVM-AB prediction model, the datasets that contain yeast and human were divided into the training sets and independent test sets in the experiment. More specifically, 1/6 of the datasets were randomly selected as independent test sets and the remaining datasets as training sets. In statistical detection, there are three cross-validation methods: independent dataset test, subsampling test, and jackknife test[26], are usually employed to assess the effectiveness of a predictor in the experiment. However, it is demonstrated that the jackknife test can always yield a unique result for a given benchmark dataset and has been widely recognized and increasingly used by investigators to examine the quality of various predictors [38-41]. However, in our experiment, for reducing the computational time, fivefold cross-validation tests were performed to benchmark the performance of the proposed model. To ensure fairness, there are several parameters for RVM-AB prediction model, which should be optimized. In the study, these parameters were set up the same for yeast and human datasets. Thus, the Gaussian function was chosen as the kernel function with the three parameters: width=2.8, initapla=1/N and beta=0, where width represents the width of Gaussian function, N represents a total of training sets, and beta denotes classification or regression. The experimental results are shown in Table 1-2 by using the proposed prediction model RVM-AB on yeast and human datasets. It can be seen from Table 1 that the proposed method achieved the results of average Accuracy, Sensitivity, Precision, and Mcc of 93.01%, 59.50%, 74.58%, and 65.21%, the standard deviations of which are 0.8%, 4.3%, 4.1%, and 3.9% on yeast dataset, respectively. Similarly good results were also obtained of average Accuracy, Sensitivity, Precision, and Mcc of 97.72%, 81.66%, 7

ACCEPTED MANUSCRIPT

AN US

CR IP T

89.52%, and 84.60% on human dataset, with their respectie standard deviations being 0.4%, 4.0%, 1.3%, and 2.6%. It can be found from Table 1 and Table 2 that the proposed method is accurate, robust, and effective for predicting SIPs. The key to obtaining the good prediction results of the proposed approach may lie in the choice of feature extraction method and classifier. The major improvement of the proposed feature extraction method include three aspects: (1) The PSSM matrix is a much useful tool for representing protein sequence, which can not only describes the order information but also retains sufficient prior information for the protein sequence. As a result, each protein sequence can be represented as a PSSM that contains all the useful information for predicting PPIs. (2) Average Blocks can ensure the residue conservation tendencies in the same domain family are similar and the locations of domains in the same family are closely related to the length of the sequence [35]. As a result, it further improves the performance of the proposed prediction model. (3) Under the condition of guaranteeing the integrity of the information of feature vector, for reducing the influence of noise, each feature vector is reduced dimensionally by using Principal Component Analysis (PCA) method. Thus, the experiments results demonstrated that the feature vector extracted using Average Blocks combined with PCA on PSSM is very suitable for SIPs prediction. Table 1. Fivefold cross validation results obtained by using the proposed method on yeast Ac (%) 94.21

2

93.05

3

91.89

4

92.57

5

93.34

Average

ED

1

93.01±0. 8

Sn (%)

Pe (%)

Mcc (%)

62.16

79.31

69.79

62.31

77.88

67.99

57.60

69.90

62.14

52.68

71.08

60.10

62.71

74.75

67.02

59.50±4.3

74.58±4.1

M

Testing set

65.21±3.9

Testing set 2 3

AC

4

CE

1

PT

Table 2. Fivefold cross validation results obtained by using the proposed method on human

5

Average

Ac (%)

Sn (%)

Pe (%)

Mcc (%)

97.55

80.85

87.96

83.41

97.27

78.23

88.58

82.23

97.65

78.97

90.64

83.69

98.31

88.33

90.99

88.92

97.82

81.94

89.42

84.75

97.72±0.4

81.66±4.0

89.52±1.3

84.60±2.6

3.2. Comparison with the SVM-based Method Despite achieving a better predictive performance using the proposed method, in order to further assess the prediction performance of the proposed classifier, we also compared the prediction accuracies of RVM classifier with the state-of-the-art support vector machine (SVM) classifier based on yeast and human datasets using the same feature extraction method (AB). The LIBSVM tool[60] was employed to execute classification in SVM. For the SVM classifier, there are several parameters which need to be optimized. Here, we selected a radial basis function (RBF) 8

ACCEPTED MANUSCRIPT

AN US

CR IP T

as the kernel function of SVM. The RBF kernel parameters were optimized by using a grid search method, which were set up c=0.1 and g=0.1. The experiment results of the RVM and SVM on yeast and human datasets are shown in Table 3 and Table 4 respectively. At the same time, as displayed in Figure 1 and Figure 2, the ROC curves are compared between RVM and SVM. We can find the SVM classifier obtained 90.27% average Accuracy, 21.11% average Sensitivity, 79.07% average Precision, and 39.71% average Mcc on yeast datasets from Table 3. However, the RVM classifier achieved 93.01% average Accuracy, 59.50% average Sensitivity, 74.58%, average Precision, and 65.21% average Mcc. At the same time, it can be found from Table 4 that the proposed RVM classifier achieved 97.72% average Accuracy, 81.66% average Sensitivity, 89.52%, average Precision, and 84.60% average Mcc on human dataset. Nevertheless, the SVM classifier obtained 94.77% average Accuracy, 45.43% average Sensitivity, 82.77% average Precision, and 60.20% average Mcc on human dataset. It can be seen from these prediction results that the performance of RVM classifier is significantly better than that of SVM classifier. Similarity, as shown in Figure 1 and Figure 2, the ROC curves of RVM classifier are also significantly better than that of SVM classifier. This clearly proves that the RVM classifier is an accurate and robust classifier for predicting SIPs. The better classification performance of RVM classifier may be justified by the following two reasons: (1) It is an obvious advantage that the amount of calculation of the kernel function is greatly reduced in RVM classifier; (2) It is the obvious disadvantage for SVM classifier that the kernel functions required meeting the condition of Mercer, which has been overcome by RVM classifier. As a result, it is proved that the proposed prediction model can obtain higher accuracy for detecting SIPs.

Testing set

Ac (%)

ED

RVM+PSSM+AB

M

Table 3. Fivefold cross validation results shown by using our proposed method on yeast Sn (%)

Pe (%)

Mcc (%)

94.21

62.16

79.31

69.79

2

93.05

62.31

77.88

67.99

3

91.89

57.60

69.90

62.14

92.57

52.68

71.08

60.10

93.34

62.71

74.75

67.02

93.01±0. 8

59.50±4.3

74.58±4.1

PT

1

5

CE

4

Average

65.21±3.9

SVM+PSSM+AB

90.35

17.12

70.37

34.33

2

89.58

22.31

80.56

41.12

3

89.86

19.20

85.71

39.11

4

91.22

22.32

86.21

42.47

5

90.35

24.58

72.50

41.53

AC

1

Average

90.27±0.6

21.11±2.9

9

79.07±7.4

39.71±3.2

ACCEPTED MANUSCRIPT

Comparison of ROC Curves between RVM and SVM On yeast 1 0.9 0.8

0.6 0.5 0.4 0.3 0.2

CR IP T

Sensitivity

0.7

RVM+PSSM+AB

0.1 0

SVM+PSSM+AB

0

0.1

0.2

0.3

0.4

0.5 1 - Specificity

0.6

0.7

0.8

0.9

1

AN US

Figure 1. Comparison of ROC curves performed between RVM and SVM on yeast dataset.

Ac (%)

Sn (%)

Pe (%)

Mcc (%)

1

97.55

80.85

87.96

83.41

2

97.27

78.23

88.58

82.23

3

97.65

78.97

90.64

83.69

4

98.31

M

Table 4. Fivefold cross validation results obtained by using our proposed method on human Testing set

88.33

90.99

88.92

5

97.82

81.94

89.42

84.75

Average

ED

RVM+PSSM+AB

97.72±0.4

1

3 4

AC

5

Average

84.60±2.6

94.68

42.98

83.47

58.80

94.75

47.58

84.29

62.13

94.96

48.50

81.29

61.75

94.75

46.67

82.35

60.89

94.72

41.41

82.46

57.41

CE

2

89.52±1.3

PT

SVM+PSSM+AB

81.66±4.0

94.77±0.1

45.43±3.0

10

82.77±1.1

60.20±2.0

ACCEPTED MANUSCRIPT

Comparison of ROC Curves between RVM and SVM on human dataset 1

0.9

0.8

0.7

0.5

0.4

0.3

0.2

0.1

CR IP T

Sensitivity

0.6

RVM+PSSM+AB SVM+PSSM+AB

0

0.1

0.2

0.3

0.4

0.5 1 - Specificity

0.6

0.7

0.8

AN US

0

0.9

1

Figure 2. Comparison of ROC curves performed between RVM and SVM on human dataset.

3.3. Comparison with Other Methods

AC

CE

PT

ED

M

For demonstrating the effectiveness of the proposed method, the performance of the final model called RVM-AB was also compared with three existing SIP predictor SLIPPER, CRS, SPAR and three PPI predictors DXECPPI [61], PPIevo [62]and LocFuse [63] based on the human and yeast datasets. These experimental results of the above listed methods on yeast and human datasets can be seen from Table 5 and Table 6. It can be found from Table 5 that the average accuracy of the final model is obviously higher than that of the other six methods on yeast dataset. At the same time, it can be seen that the specificity and sensitivity of the other six methods are also lower than that of the proposed model. Similarly, as shown in Table 6, the experiment results of the final model are also significantly better than those of the six different methods on human dataset. From Tables 5 and 6, it can be demonstrated that the performance of the proposed model is obviously superior to that of the other existing six methods. All experiment results proved that the proposed model named RVM-AB can improve the prediction accuracy compared with all the currently existing approaches. Thanks to the employment of a good classifier and a novel feature extraction method, the proposed model obtained good prediction results. It is further proved that the proposed method is fit for SIPs prediction. Table 5. Predicting ability of different methods on yeast Model

Ac (%)

Sp (%)

Sn (%)

72.18

69.72

SLIPPER[9]

71.90

DXECPPI[61]

87.46

94.93

PPIevo[62]

66.28

87.46

11

29.44 60.14

MCC 0.2842 0.2825 0.1801

ACCEPTED MANUSCRIPT

LocFuse[63]

66.66

68.10

55.49

0.1577

CRS [33]

72.69

74.37

59.58

0.2368

SPAR[33]

76.96

80.02

53.24

0.2484

93.01

97.35

59.49

0.6521

Proposed method

Table 6.

Sp (%)

Sn (%)

MCC

SLIPPER[9]

91.10

95.06

47.26

0.4197

DXECPPI[61]

30.90

25.83

87.08

0.0825

PPIevo[62]

78.04

25.82

LocFuse[63]

80.66

80.50

CRS[33]

91.54

96.72

SPAR[33]

92.09

97.40

97.72

99.15

Proposed method

CR IP T

Ac (%)

87.83

0.2082

50.83

0.2026

34.17

0.3633

33.33

0.3836

81.66

0.8460

AN US

Model

Predicting ability of different methods on human

4. Conclusion

AC

CE

PT

ED

M

In this paper, based on sequence information to detect SIPs,a novel computational method is developed which is called RVM-AB. The RVM-AB model was created by combining RVM classifier with Average Blocks and Position Specific Scoring Matrix. Experimental results obtained from the proposed model on yeast and human datasets demonstrated that the prediction accuracy is significantly higher than that of the method based on SVM classifier and other exiting approaches. The major improvements of the proposed method may be attributed to as the following reasons (1) the employment of an effective feature extraction method that can ensure the residue conservation tendencies in the same domain family are similar and the locations of domains in the same family are closely related to the length of the sequence. This method can capture useful evolutionary information to improve performance efficiency. (2) PCA can integrate the useful information and reduce the influence of noise, which offer some help for improving the prediction accuracy. (3)The RVM classifier model used by the proposed model is very suitable for predicting SIPs. In conclusion, the RVM-AB is an efficient, reliable, and powerful prediction model and can be a useful tool for future proteomics research. For the future study, more effective feature extraction methods and machine learning techniques will be explored to detect SIPs. As pointed out in [64] that user-friendly and publicly accessible web-servers represent the future direction for developing practically more useful methods. In future work, we shall make efforts to provide a user-friendly web-server for predicting SIPs [12, 13, 16, 19, 24, 65]. Author Contributions: Jing-Xuan Zhai and Ji-Yong An conceived the algorithm, carried out analyses, prepared the data sets, carried out experiments, and wrote the manuscript; Tian-Jie Cao and Yong-Tao Bian designed, performed and analyzed experiments and wrote the manuscript; All authors read and approved the final manuscript. Conflicts of Interest: The authors declare no conflict of interest.

12

ACCEPTED MANUSCRIPT

References 1.

Baisamy, L., N. Jurisch, and D. Diviani, Leucine zipper-mediated homo-oligomerization regulates the Rho-GEF activity of AKAP-Lbc. Journal of Biological Chemistry, 2005. 280(15): p. 15405-12.

2.

Hattori, T., et al., C/EBP family transcription factors are degraded by the proteasome but stabilized by forming dimer. Oncogene, 2003. 22(9): p. 1273-80.

3.

Katsamba, P., et al., Linking molecular affinity and cellular specificity in cadherin-mediated adhesion. Proceedings of the National Academy of Sciences of the United States of America,

4.

CR IP T

2009. 106(28): p. 11594-9. Koike, R., A. Kidera, and M. Ota, Alteration of oligomeric state and domain architecture is essential for functional transformation between transferase and hydrolase with the same scaffold. Protein Science, 2009. 18(10): p. 2060–2066. 5.

Woodcock, J.M., et al., The dimeric versus monomeric status of 14-3-3ζ is controlled by phosphorylation of Ser58 at the dimer interface. Journal of Biological Chemistry, 2003. 278(38): p. 36323-36327.

Miller, S., et al., The accessible surface area stability of oligomeric proteins. Nature, 1987.

AN US

6.

328(6133): p. 834-6. 7.

Pitre, S., et al., PIPE: a protein-protein interaction prediction engine based on the re-occurring short polypeptide sequences between known interacting protein pairs. Bmc Bioinformatics, 2006. 7(10): p. 763-769.

8.

Xia, J.F., K. Han, and D.S. Huang, Sequence-based prediction of protein-protein interactions by

M

means of rotation forest and autocorrelation descriptor. Protein & Peptide Letters, 2010. 17(1): p. 137-45. 9.

Liu, Z., et al., Proteome-wide prediction of self-interacting proteins based on multiple

ED

properties. Molecular & Cellular Proteomics Mcp, 2013. 12(6): p. 1689-1700. 10.

Jia, J., et al., iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. Journal

11.

PT

of Theoretical Biology, 2015. 377: p. 47. Liu, L.M., Y. Xu, and K.C. Chou, iPGK-PseAAC: identify lysine phosphoglycerylation sites in proteins by incorporating four different tiers of amino acid pairwise coupling information into 12.

CE

the general PseAAC. Medicinal Chemistry, 2017. Feng, P., et al., iRNA-PseColl: Identifying the Occurrence Sites of Different RNA Modifications by Incorporating Collective Effects of Nucleotides into PseKNC. Molecular Therapy Nucleic

AC

Acids, 2017. 7: p. 155-163.

13.

Liu, B., Y. Fan, and K.C. Chou, 2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function. Molecular Therapy Nucleic Acids, 2017. 7(C): p. 267.

14.

Jia, J., et al., iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC. Oncotarget, 2016. 7(23): p. 34558-34570.

15.

Qiu, W.R., et al., iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC. Oncotarget, 2016. 7(28): p. 44310-44321.

16.

Chen, W., et al., iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences. 13

ACCEPTED MANUSCRIPT

Oncotarget, 2017. 8(3): p. 4208. 17.

Cheng, X., et al., iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals. Oncotarget, 2017.

18.

Liu, B., et al., Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods. Oncotarget, 2017. 8(8): p. 13338-13343.

19.

Qiu, W.R., et al., iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition. Oncotarget, 2017. 8(25): p. 41178. Qiu, W.R., et al., iPTM-mLys: identifying multiple lysine PTM sites and their different types.

CR IP T

20.

Bioinformatics, 2016. 32(20): p. 3116. 21.

Qiu, W.R., et al., iPhos-PseEn: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier. Oncotarget, 2016. 7(32): p. 51270-51283.

22.

Su, Q., et al., Prediction of the aquatic toxicity of aromatic compounds to tetrahymena pyriformis through support vector regression. Oncotarget, 2017.

23.

Wei, C., et al., iRNA-PseU: Identifying RNA pseudouridine sites. Molecular Therapy Nucleic

24.

AN US

Acids, 2016. 5(7): p. e332.

Xiang, C., et al., iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. Bioinformatics, 2016.

25.

Xu, Y., et al., iPreny-PseAAC: identify C-terminal cysteine prenylation sites in proteins by incorporating two tiers of sequence couplings into PseAAC. Medicinal Chemistry, 2017. 13(999): p. 1-1.

Chou, K.C., Some remarks on protein attribute prediction and pseudo amino acid composition.

M

26.

Journal of Theoretical Biology, 2011. 273(1): p. 236-47. 27.

Consortium, U.P., UniProt: a hub for protein information. Nucleic Acids Research, 2014.

28.

ED

43(D1): p. D204-12.

Salwinski, L., et al., The Database of Interacting Proteins: 2004 update. Nucleic Acids Research, 2004. 32(22): p. D449-D451. Chatraryamontri, A., et al., The BioGRID interaction database: 2015 update. Nucleic Acids

PT

29.

Research, 2015. 43(Database issue): p. 470-8. Orchard, S., et al., The MIntAct project--IntAct as a common curation platform for 11

CE

30.

molecular interaction databases. Nucleic Acids Research, 2014. 42: p. 358-63.

31.

Breuer, K., et al., InnateDB: Systems biology of innate immunity and beyond - Recent updates

AC

and continuing curation. Nucleic Acids Research, 2012. 41(Database issue): p. D1228-D1233.

32.

Launay, G., et al., MatrixDB, the extracellular matrix interaction database: updated content, a new navigator and expanded functionalities. Nucleic Acids Research, 2015. 43(D1): p. 321-7.

33.

Liu, X., et al., SPAR: a random forest-based predictor for self-interacting proteins with fine-grained domain information. Amino Acids, 2016: p. 1-11.

34.

Chou, K.C., Impacts of bioinformatics to medicinal chemistry. Medicinal Chemistry, 2015. 11(3): p. 218.

35.

Chou, K.C., Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 2005. 21(1): p. 10-9.

36.

Zhong, W.Z. and S.F. Zhou, Molecular Science for Drug Development and Biomedicine. International Journal of Molecular Sciences, 2014. 15(11): p. 20072.

14

ACCEPTED MANUSCRIPT

37.

Zhou, G.P. and W.Z. Zhong, Perspectives in Medicinal Chemistry. Current Topics in Medicinal Chemistry, 2016. 16(4): p. 381.

38.

Dehzangi, A., et al., Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou‫ ׳‬s general PseAAC. Journal of Theoretical Biology, 2015. 364: p. 284.

39.

Khan, M., et al., Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC. Journal of Theoretical Biology, 2017. 415: p. 13-19.

40.

Meher, P.K., et al., Predicting antimicrobial peptides with improved accuracy by incorporating

CR IP T

the compositional, physico-chemical and structural features into Chou’s general PseAAC. Scientific Reports, 2017. 7. 41.

Rahimi, M., M.R. Bakhtiarizadeh, and A. Mohammadisangcheshmeh, OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou's pseudo amino acid composition. Journal of Theoretical Biology, 2017. 414: p. 128-136.

42.

Chou, K.C., An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Current Topics in Medicinal Chemistry, 2017, 17: 2337-2358.

Liu, B., H. Wu, and K.C. Chou, Pse-in-One 2.0: An Improved Package of Web Servers for

AN US

43.

Generating Various Modes of Pseudo Components of DNA, RNA, and Protein Sequences. Natural Science, 2017. 09(4): p. 67-91. 44.

Liu, B., et al., Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences. Nucleic Acids Research, 2015. 43(Web Server issue): p. W65-W71.

Gribskov, M., A.D. Mclachlan, and D. Eisenberg, Profile analysis: detection of distantly related

M

45.

proteins. Proceedings of the National Academy of Sciences of the United States of America, 1987. 84(13): p. 4355-8.

Altschul, S.F. and E.V. Koonin, Iterated profile searches with PSI-BLAST--a tool for discovery in

ED

46.

protein databases. Trends in Biochemical Sciences, 1998. 23(11): p. 444–447. 47.

Georgiou, D.N., T.E. Karakasidis, and A.C. Megaritis, A short survey on genetic sequences,

PT

Chou’s pseudo amino acid composition and its combination with fuzzy set theory. Maternal & Child Health Care of China, 2013. 7(1): p. 41-48. Georgiou, D.N., et al., A study of entropy/clarity of genetic sequences using metric spaces and

CE

48.

fuzzy sets. Journal of Theoretical Biology, 2010. 267(1): p. 95–105.

49.

Jeong, J.C., X. Lin, and X.W. Chen, On position-specific scoring matrix for protein function

AC

prediction. IEEE/ACM Transactions on Computational Biology & Bioinformatics, 2011. 8(2): p. 308-315.

50.

Du, Q., et al., Amino Acid Principal Component Analysis (AAPCA) and its Applications in Protein Structural Class Prediction. Journal of Biomolecular Structure & Dynamics, 2006. 23(6): p. 635.

51.

Tipping, M.E., Sparse bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 2001. 1(3): p. 211-244.

52.

Chen, W., et al., iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research, 2013. 41(6): p. e68.

53.

Feng, P., et al., iRNA-PseColl: Identifying the Occurrence Sites of Different RNA Modifications by Incorporating Collective Effects of Nucleotides into PseKNC. Mol Ther Nucleic Acids, 2017.

15

ACCEPTED MANUSCRIPT

7(C): p. 155-163. 54.

Lin, H., et al., iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Research, 2014. 42(21): p. 12961-72.

55.

Liu, B., et al., iRSpot-EL: identify recombination spots with an ensemble learning approach. Bioinformatics, 2017. 33(1): p. 35-41.

56.

Qiu, W.R., et al., iPhos-PseEvo: Identifying Human Phosphorylated Proteins by Incorporating Evolutionary Information into General PseAAC via Grey System Theory. Molecular Informatics, 2017. Wei, C., et al., iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences.

CR IP T

57.

Oncotarget, 2017. 8(3): p. 4208. 58.

Xu, Y., et al., iSNO-PseAAC: predict cysteine S-nitrosylation sites in proteins by incorporating position specific amino acid propensity into pseudo amino acid composition. Plos One, 2013. 8(2): p. e55844.

59.

Yan, X., et al., iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins. Peerj, 2013. 1(article e171): p. e171.

Chang, C.C. and C.J. Lin, LIBSVM: A library for support vector machines. Acm Transactions on

AN US

60.

Intelligent Systems & Technology, 2011. 2(3): p. 389-396. 61.

Xiuquan Du, J.C., Tingting Zheng, Zheng Duan, Fulan Qian, A Novel Feature Extraction Scheme with Ensemble Coding for Protein–Protein Interaction Prediction. International Journal of Molecular Sciences, 2014. 15(7): p. 12731-49.

62.

Zahiri, J., et al., PPIevo: Protein-Protein Interaction Prediction from PSSM Based Evolutionary

63.

M

Information. Genomics, 2013. 102(4): p. 237-42.

Zahiri, J., et al., LocFuse: Human protein–protein interaction prediction via classifier fusion using protein localization information. Q.rev.chem.soc, 2014. 104(6): p. 496-503. Chou, K.C. and H.B. Shen, REVIEW : Recent advances in developing web-servers for predicting

ED

64.

protein attributes. Natural Science, 2009. 1(2): p. 63-92. 65.

Liu, B., et al., iRSpot-EL: identify recombination spots with an ensemble learning approach.

AC

CE

PT

Bioinformatics, 2017. 33(1): p. 35.

16