Computational prediction of protein ubiquitination sites mapping on Arabidopsis thaliana

Computational prediction of protein ubiquitination sites mapping on Arabidopsis thaliana

Journal Pre-proof Computational Prediction of Protein Ubiquitination Sites Mapping on Arabidopsis thaliana Md. Parvez Mosharaf, Md. Mehedi Hassan, Fee...

2MB Sizes 0 Downloads 67 Views

Journal Pre-proof Computational Prediction of Protein Ubiquitination Sites Mapping on Arabidopsis thaliana Md. Parvez Mosharaf, Md. Mehedi Hassan, Fee Faysal Ahmed, Mst. Shamima Khatun, Mohammad Ali Moni, Md. Nurul Haque Mollah

PII:

S1476-9271(18)30965-4

DOI:

https://doi.org/10.1016/j.compbiolchem.2020.107238

Reference:

CBAC 107238

To appear in:

Computational Biology and Chemistry

Received Date:

16 December 2018

Revised Date:

22 January 2020

Accepted Date:

18 February 2020

Please cite this article as: Parvez Mosharaf M, Hassan MM, Ahmed FF, Khatun MS, Moni MA, Haque Mollah MN, Computational Prediction of Protein Ubiquitination Sites Mapping on Arabidopsis thaliana, Computational Biology and Chemistry (2020), doi: https://doi.org/10.1016/j.compbiolchem.2020.107238

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2019 Published by Elsevier.

Computational Prediction of Protein Ubiquitination Sites Mapping on Arabidopsis thaliana

of

Md. Parvez Mosharaf1, Md. Mehedi Hassan2, Fee Faysal Ahmed1, 3, Mst. Shamima Khatun2, Mohammad Ali Moni4 and Md. Nurul Haque Mollah1* [email protected]

1

Bioinformatics Lab., Department of Statistics, Rajshahi University, Rajshahi-6205, Bangladesh. Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 8208502, Japan. 3 Department of Mathematics, Jashore University of Science and Technology, Jashore, Bangladesh 4 The University of Sydney, Sydney Medical School, School of Medical Sciences, Discipline of Biomedical Science, Sydney, New South Wales, Australia.

-p

ro

2

re

*Correspondence: Md. Nurul Haque Mollah

Jo

ur na

lP

Graphical abstract

of ro -p re lP

Highlights

Ubiquitination sites prediction CKSAAP encoding scheme Random Forest Classification Arabidopsis thaliana

Jo

   

ur na

The working flowchart of the proposed in silico prediction method is represented here. The various sample ratios were considered to train the supervised learning algorithm. To evaluate the model performance cross validation test and independent test dataset were used.

Abstract

of

Among the protein post-translational modifications (PTMs), ubiquitination is considered as one of the most significant processes which can regulate the cellular functions and various diseases. Identification of ubiquitination sites becomes important for understanding the mechanisms of ubiquitination-related biological processes. Both experimental and computational approaches are available for identifying ubiquitination sites based on protein sequences of different species. The experimental approaches are time-consuming, laborious and costly. In silico prediction is an alternative time saving, easier and cost-effective approach for identifying ubiquitination sites. Moreover, the sequence patterns in the different species around the ubiquitination sites are not similar which demands species-specific predictors. Therefore, in this study, we have proposed a novel computational method for identifying ubiquitination sites based on protein sequences of A. thaliana species which will be robust against outlying observations also. Through the comparative study of two encoding schemes and three classifiers, the random forest (RF) based predictor was selected as the best predictor under the CKSAAP encoding scheme with 1:1 ratio of positive and negative samples (i.e. ubiquitinated and non-ubiquitinated) in training dataset. The proposed predictor produced the area under the ROC curve (AUC score) as 0.91 and 0.86 for 5-fold cross-validation test with the training dataset and the independent test dataset of A. thaliana respectively. The proposed RF based predictor also performed much better than the other existing ubiquitination sites predictors for A. thaliana. Keywords: Arabidopsis thaliana species, Protein sequences, Ubiquitination sites, CKSAAP encoding, Random forest.

ro

Data Availability: The Computational Codes and instructions for implementing the proposed methodology can be downloaded at http://www.bbcba.org/softwares/ArabidopsisUbq.zip

-p

1. Introduction

ur na

lP

re

Ubiquitination is one of the most critical and significant protein post-translational modification (PTM) which can regulate the cellular functions. It is also an enzymatic process, in which a substrate protein is attached by a small regulatory protein (called ubiquitin) (Welchman et al., 2005, Herrmann et al., 2007). In the ubiquitination process, the small regulatory protein ubiquitin, either a single ubiquitin or chain of ubiquitin’s adjoin to targeted lysine (K) residues on the protein substrate to be degraded. This process work in three steps, they are activation, conjugation and ligation. Activation is performed by ubiquitin activating enzymes (E1s), ubiquitin conjugating enzymes (E2s) complete the conjugation, and remain ligation is done by ubiquitin ligases (E3s) enzymes (Welchman et al., 2005, Herrmann et al., 2007, Tung et al., 2008). It is known that protein post-translational modification on any cell is highly correlated with lots of biological processes and intimately engaged with different kinds of diseases. As a post-translational modification, ubiquitination is strongly related to various complex biological processes and diseases in plants and animals. Distinct kinds of significant regulatory functions and relevant disorders because of ubiquitination have been found, for instances, hypersensitive response, proteasomal degradation and down regulation, endocytosis and sorting, transcription and DNA repairing, signal transduction, and Alzheimer, Parkinson, infectious diseases, cancers etc. In the biological processes, these protein regulation functions are considered as important as other regulation functions (Welchman et al., 2005, Herrmann et al., 2007, Tung et al., 2008, Walsh et al., 2014).

Jo

It is clear that ubiquitination has significant roles in protein regulation function in the biological system; large scale research initiatives have been conducted successfully to disclose the molecular characteristics and regulatory roles of ubiquitination process through the entire biological processes. For this purpose, to identified the ubiquitination sites in a ubiquitinated protein, different experimental procedures have been devoted, for example ubiquitin antibodies and ubiquitin binding proteins, high-throughput Mass Spectrometry (MS) techniques, combinations of liquid chromatography and mass spectrometry (Xu et al., 2010, Kim et al., 2011, Kirkpatrick, 2005, Peng, 2003, Wagner, 2011). Because of the dynamic, rapid and reversible properties of ubiquitination process, the existing experimental methods become highly expensive, labor-intensive as well as time-consuming. On the other hand, species-specific different computational (in silico) approaches have been developed to identify the ubiquitination sites; those are considered as alternative time saving, easier and cost-effective approaches. Different computational approaches are developed using distinguish methods, features and properties (Chen J et al. 2019; He W et al., 2018; Chen Z et al., 2018;

Fei H et al., 2018; Wang J R et al., 2017; Cai and Jiang, 2016;, Chen Z et al., 2016; Zhou Y et al., 2013; Radivojac et al., 2010; Chen et al., 2011; Chen et al., 2013; Tung and Ho, 2008).

of

In different species, the pattern of ubiquitination sites are not conserved (Kumar and Vellaichamy, 2019; Zhen et al., 2014; Chen Z et al., 2014). For this reason, the existing ubiquitination sites predictors are not appropriate to predict the multi-species ubiquitination sites properly (Chen Z et al., 2014) for different organisms. Because, the predictors were trained by training datasets from a specific organism such as yeast (Saccharomyces cerevisiae), Homo sapiens, Mus musculus, but perform well for that specific organism. In earlier study, existing predictors were implemented in an A. thaliana dataset and shown that the performance was similar to random prediction due to the absent of A. thaliana protein sequence in training dataset (Chen Z et al., 2014). So, it was necessary demand to develop species-specific ubiquitination sites predictor to gain the suitable prediction. However, in case of A. thaliana species, only one predictor has been developed (Chen J et al. 2019). Though the existing predictor predict well, but there is still had some space to improve. Therefore, in this study, an attempt is made to develop a novel computational method for identifying ubiquitination sites based on protein sequences of A. thaliana species.

-p

ro

To develop the novel predictor, we considered three different popular classifiers (RF, SVM and NB) for comparative study on the prediction of ubiquitination sites using the experimentally identified ubiquitinated protein sequences of A. thaliana. The composition of k-spaced amino acid pairs (CKSAAP) was selected as a better encoding scheme through a comparative study to develop the proposed predictor. The proposed RF based predictor showed the better performance measured by ROC curves and AUC scores for both training and independent test datasets. We provide the whole development procedure in details for the new proposed method in the next section.

lP

re

2. Methods and Materials In brief, for predicting ubiquitination sites, our method is an RF-based prediction method. It was developed using the comparative study of the two consecutive sequence encoding features (CKSAAP and binary encoding method), comparative study of three different classifications (i.e. SVM: Support Vector Machine, NB: Naïve Bayes Classifier and RF: Random Forest Classifier) methods on the basis of optimum window size 27. The working flowchart of this proposed prediction method is shown in (Fig. 1). After exploring two types of encoding methods the CKSAAP encoding method based RF classifier was utilized to build the proposed predictor. After optimizing the parameter and evaluate the performance, the best model was proposed.

Jo

ur na

2.1 Datasets Arrangement In this current study, ubiquitinated protein sequence training dataset were used to train the models and check the performances in independent test dataset accordingly. The ubiquitinated protein sequences were arranged from the previously published paper (Walton et al., (2016).Plant Cell 10.1105/tpc.15.00878) where the experimentally validated ubiquitination sites (lysine residues) were reported. The ubiquitination sites annotations were extracted from the UniProtKB/Swiss-Prot and NCBI protein sequence database (http://www.ncbi.nlm.nih.gov/ protein/) regarding the model plant A. thaliana. Among the total 1607 downloaded protein sequences, a random sample (without replacement) of 250 protein sequence was selected to train the model which contained 522 ubiquitinated sites. Among the selected proteins some proteins were excluded due to the lysine position line mismatch. . As well as to overcome the model overfitting issue in the predictor, the CD-HIT server (http://weizhonglab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi) (Li W et. al. 2002, 2006; Wang JR et al. 2017) was used to reduce the sequence homology. We have used 30%, 40%, and 50% sequence identity threshold to address the homology redundancy. After removing as similar patterns (Ahmed S et. al, 2018), we found 463, 482, and 491 ubiquitination sites for 30%, 40%, and 50% identity cutoff. According to the literature review, it was observed that most of the ubiquitination site predictor was developed by considering the 40% sequence identity approach to reduce the homology redundancy [Chen J et. al, 2019]. Based on model performance and the literature review, the 40% sequence identity threshold was chosen to explore the analysis for the fare comparison with the other existing ubiquitination site predictors. Finally, we considered 239 ubiquitinated protein sequences having 482 experimentally validated ubiquitination sites. In this study, the experimentally validated lysine ubiquitination sites were represented as positive samples (i.e. ubiquitinated sites), while all the remaining lysine residues were noted as negative samples (i.e. non-ubiquitinated sites). The training dataset contained 6753 non-ubiquitination sites (i.e. negative samples) which showed an extreme unbalance in the training dataset. The non-ubiquitination sites were randomly

selected from all negative samples to construct the training dataset of 1:1, 1:2 and 1:3 ratio of the positive vs. negative samples to train the models.

lP

re

-p

ro

of

To check the model’s performances, the cross-validation test and the independent test datasets were used. From the downloaded protein sequences, the independent test datasets were constructed from 250 randomly selected protein sequences which were not included into the training dataset. We checked the model performance stability accessing the prediction performance for the independent test dataset considering all positive and negative samples. Nevertheless, the assessed training set contained 1:1 ratio of positive and negative samples whether the performances of all training models were checked through the jackknife and several cross-validation tests. The whole datasets are available at http://www.bbcba.org/softwares/ArabidopsisUbq.zip

ur na

Fig. 1. The working flowchart of the proposed in silico prediction method is represented here. The various sample ratios were considered to train the supervised learning algorithm. To evaluate the model performance cross validation test and independent test dataset were used.

Jo

2.2 Encoding Schemes Here, in order to encode the protein sequences into suitable sequence fragments so that the sequences are transformed into a numeric vector; we considered optimum fragment size 27, which is called window size. Every fragment contained an ubiquitination (lysine residue (k)) or non-ubiquitination site in the middle position. To overcome the ultimate step; encoding, two types of well-known amino acid encoding scheme i.e. CKSAAP encoding and binary encoding method were obtained to encode the sequence fragments for the classifiers model rather than a simple binary representation. 2.2.1 CKSAAP encoding One of the widely used encoding methods in bioinformatics is the composition of k-spaced residue pairs (CKSAAP) in the sequence fragments (Chen et al., 2014, Hasan et al., 2018; Hasan et al., 2016; Hasan et al., 2019; Hasan et al., 2017). The value of k (stands for the space between two amino acids) defines the combination of the amino acid pairs for each encoded fragments. For instance, if k=0 (considering the 0-spaced) with the composition of window size 2r+1 and 21 types of amino acids (including the gap (O)), then the 0-spaced amino acid pairs (i.e. AA, AC, AD, …, OO) and it contains (21×21) = 441 types pairs. Similarly, “AxxA” will be for k=2 and so on. If Ntotal is the total number of residue pairs in a fragment (for instance, when the sequence window length L is considered as 27 and space k = 0, 1, 2, 3, 4, 5 then Ntotal = (L - k – 1) will be 26, 25, 24, 23, 22 and 21, respectively) and N AA, NAC, …, NOO exhibit the number of the

certain pair of amino acid residues within the sequence window. Then the feature vector was calculated by the given formula, 𝑁𝐴𝐴 𝑁𝐴𝐶 𝑁𝑜𝑜 ( , ,… , ) 𝑁𝑡𝑜𝑡𝑎𝑙 𝑁𝑡𝑜𝑡𝑎𝑙 𝑁𝑡𝑜𝑡𝑎𝑙 441

In this study, the optimum window size was 27 and considered k= 0, 1, 2, 3, 4 and 5 sequentially whereas the ideal kmax was set at 5, reproducing {21 × (kmax + 1) × 21} = 2646 feature dimension (amino acid pairs) that were composed to compute the corresponding feature vector for each sequence fragment according to the given formula. In the result vector, correspondence to every feature (amino acid pairs), the value was represented the ratio of their frequency with respect to the Ntotal. This ultimate encoded matrix for all the residue pairs was utilized for the supervised learning algorithm to construct the proposed predictor.

-p

ro

of

2.2.2 Binary encoding To compute the position specific information from protein sequence fragments, binary encoding with a unified dimension was assessed besides the CKSAAP encoding method to make the predictor more robust. In this study, The 21 types of amino acid residues (including gap (O)) were ordered like ACDEFGHIKLMNPQRSTVWYO. For constructing the binary vector with the query protein sequence, the A amino acid was represented by 21-dimentional binary vector like 100000000000000000000. And similarly other each amino acid was coded as C (010000000000000000000), D (001000000000000000000), …, O (000000000000000000001). Since the selected optimum window-size of surrounding ubiquitination sites was considered as 27 and the center position was fixed for lysine (k) residue. As a result, the binary encoding scheme outputted total 21×26=546 feature dimension because of the exclusion of central k residue from consideration.

lP

re

2.3 Feature selection As mentioned in ‘‘Encoding Schemes’’, each investigated ubiquitinated or non-ubiquitinated fragment was encoded into as a high dimensional features vector. However, in some cases, high dimensional features create some computational complexity in some popular classifiers during the development of class predictor. Therefore, we consider the feature filtering approach to select the important features, since some features are not so important usually to determine the surrounding ubiquitinated or non-ubiquitinated sites. Since the CKAAP encoding method provided the position specific frequency ratio values for each amino acid pair which are not normally distributed, in this paper, we selected the top 1500 most significant features using the Wilcoxon Sign Rank Test approach between positive and negative groups.

Jo

ur na

2.4 Random forest classifier In proteomics, random forest classifier has been widely practiced as a significant supervised learning algorithm (Hasan et al., 2015; Chen et al., 2014; Hasan et al., 2019; Hasan et al., 2018, Khatun et al., 2019; Khatun et al., 2019; Hasan et al., 2019). It also shows robustness in presence of noise and outliers comparatively to others algorithm (Breiman et al., 2001). Basically, random forest classifier has been built based on collection of decision tree classifiers where a randomly selected sample subset was used to train each decision tree. The decision tree making strategy is described as follows. If there are K features and considering with replacement sampling scheme, N samples are selected randomly. After that, among the F features, the best split node is selected which is the key to build the decision tree and finally the decision tree is made without pruning as large as possible. Generally, the maximum number of votes is considered to construct the forest when the votes are provided by the all-individual trees. In case of our study, by the votes among the number of trees, the RF predicted two classes i.e. either positive sites (ubiquitinated sites) or negative sites (nonubiquitinated sites). The random forest classifier was applied on the encoded data matrix which actually contains the position specific numeric values obtained by CKSAAP encoding scheme. The RF classification provided classification score ranges from 0 to 1. Based on the RF classification scores (≥0.50), the lysine (K) positions were declared as positive sites and scores <0.50 are as non-ubiquitination sites. The lysine positions which get the score more closely to 1 are declared more confidently as ubiquitination sites. 2.5 Model training As mentioned before, three classification algorithms i.e. RF, SVM and Naïve Bayes were used in this study to construct the proposed ubiquitination sites predictor. In order to train the classifiers, the training dataset was used whether their prediction accuracy increase to distinguish the ubiquitination and non- ubiquitination sites perfectly. Here, for the SVM kernel radial based function was used when the RF was implemented under the consideration of optimum values of the

parameters. The models were trained with various ratios of positive and negative samples through different cross validation. Among the three classifiers, RF was selected as best classifier to classify the ubiquitination site and nonubiquitination sites when the model was trained by 1:1 ratio of positive and negative samples and with 5-fold cross validation test. 2.6 Model performance evaluation and cross validation To observe the performance of the suggested ubiquitination sites predictor in silico method, four widely used performance measures denoted as sensitivity (Sn), specificity (Sp), accuracy (Ac) and the Matthew correlation coefficient (MCC) were used. These can be formulated as follows: 𝑡𝑝 + 𝑡𝑛 𝑡𝑝 + 𝑓𝑛 + 𝑡𝑛 + 𝑓𝑝 𝑡𝑝 𝑡𝑝 + 𝑓𝑛

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 (𝑆𝑝) =

𝑡𝑛 𝑡𝑛 + 𝑓𝑝

(𝑡𝑝 × 𝑡𝑛) − (𝑓𝑝 × 𝑓𝑛)

√(𝑡𝑝 + 𝑓𝑝) × (𝑡𝑝 + 𝑓𝑛) × (𝑡𝑛 + 𝑓𝑛) × (𝑡𝑛 + 𝑓𝑝)

-p

𝑀𝑎𝑡𝑡ℎ𝑒𝑤 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 (𝑀𝐶𝐶) =

ro

𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 (𝑆𝑛) =

of

𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 (𝐴𝑐) =

3. Results

ur na

lP

re

Where tp, fp, fn and tn are standing for the true positive, false positive, false negative and true negative respectively. In addition, for providing a comprehensive performance measurement we also considered the Receiver Operating Characteristic (ROC) curve; which represents the graphical representation of true positive rate (i.e. sensitivity) as a function of false positive rate (i.e. 1-specificity) for all possible cut-off values. Moreover, the area under the ROC curve (AUC) was used to quantify the overall performance of the proposed method. The values of AUC closer to 1 indicate the better performance. On the other hand, the jackknife cross-validation and 2-, 5-fold cross validation tests were taken under consideration to check the performance of the prediction model. Formally, in 5-fold cross validation tests, it creates five subgroups dataset from the training dataset with approximately equal size. Then, the classifier was trained using the four subgroups while the remaining one set was used as the test dataset to assess the performance of the classifier and the process run accordingly 5 times. The model performance was also justified through the prediction performance for independent test dataset. The performance measures Sn, Sp and MCC were calculated at the threshold value FPR=0.20 (fp rate), while the values of AUC measure (total areas under the ROC curve) were calculated by threshold independent scores.

and discussion

Jo

3.1 Performance assessment of the models In nature, the ubiquitination and non-ubiquitination sites are highly unbalanced. The raw training dataset was utilized to develop the proposed predictor, was highly imbalanced with the positive and negative samples. Naturally, the unbalanced datasets affect the computational accuracy strongly. To overcome this problem, like other PTM site predictors we used a comparatively balanced dataset with positive and negative samples to train the classification models. We used 1:1, 1:2 and 1:3 ratios of positive and negative samples for training datasets. Finally, a 1:1 ratio of positive and negative samples was considered as training dataset to train and develop the RF based predictor in this study. In this work, one of the significant tasks was to select the appropriate encoding scheme. We conducted two single feature encoding-based methods (i.e. CKSAAP and binary encoding) to create the encoded vector from the sequence fragment considering the optimum window size 27. Furthermore, we compared the 30%, 40%, and 50% identity cutoff at the sequence level by CD-HIT (Huang et al., 2010) on the training dataset. After removing as similar patterns, we found 463, 482, and 491 ubiquitination sites for 30%, 40%, and 50% identity cutoff, respectively. To avoid the possible biased problem, we randomly pooled negative samples from the whole negative sample to maintain a 1:1 positive versus negative. Then we performed each dataset by a 5-fold cross-validation test using RF classifier with the both encoding respectively. The performance for the two encoding scheme was assessed based on the model’s prediction performance.

The CKSAAP encoding method performed significantly better than the binary encoding scheme in predicting the ubiquitination sites. Based on the performance results, the CKSAAP encoding-based models were chosen to encode the sequence fragments into numeric vectors for applying classification models. So, in this case, we suggested that the CKSAAP encoding based model has better performances than other encoding method. The overall performance is shown in Table 1. Moreover, we found that the 40% redundancy cutoff achieved slightly better than 30% and 50% redundancy. Finally, a 40% sequence cutoff was chosen in our final prediction model with the CKSAAP encoding method.

of

Table 1 AUC values by 5-fold CV test with different sequence redundancy threshold Sequence Identity Cutoff Random Forest Model (n)* Binary Encoding CKSAAP Encoding 30% (n=232) 0.843 0.877 40% (n=239) 0.864 0.910 50% (n=243) 0.851 0.892 * means the number of protein sequences obtained under that particular sequence identity cutoff

-p

ro

Then, the performances of the three classification methods (SVM, NB and RF) were optimized through cross-validation tests, when the jackknife cross-validation and 2-, 5-fold cross validation tests were implemented on the training dataset to select the best classifier. Among the three classifiers, the RF reflected the promising performance in all aspects than the others. To check the models performance, we firstly applied the jackknife cross-validation test when the random forest classifier achieved AUC score=0.887 with 82.3% accuracy for 1:1 training dataset by using the CKSAAP encoding method. The performance outweighed when the 2-, 5-fold cross validation tests were implemented. The RF classifier revealed the highest AUC values 0.910 with the accuracy of 83.8% in 5-fold cross validation test, which is significantly better than others are. The overall performances of RF models are provided in Table 2.

Sn

Sp

PR

Jackknife

0.743

0.902

0.753

2-fold

0.757

0.892

0.758

5-fold

0.772

0.915

Ac

MCC

AUC

0.823

0.661

0.887

0.825

0.669

0.888

0.838

0.680

0.910

lP

CV-fold

re

Table 2 Prediction performance obtained from different cross-validation tests by CKSAAP Based RF algorithm

0.763

Jo

ur na

Furthermore, we plotted the ROC curve for the RF classifier along with the other two classifiers (Fig. 2). It has been clearly indicated that the RF classifier with the CKSAAP encoding method outer performs noticeably (with AUC = 0.910) than SVM (with AUC= 0.889) and Naïve Bayes classifier (with AUC= 0.877) under the 5-fold cross validation test.

of ro

-p

Fig. 2. ROC curve of three classifiers of the selected training dataset (ratio 1:1) based on 5-fold cross-validtion test with the CKSAAP encoding scheme.

re

On the other hand, the RF classifier reflected the remarkable performance with 91.5% specificity, 77.2% sensitivity, 83.8% accuracy and the MCC is 0.680 for the 5-fold cross validation test with the CKSAAP encoding method. That is why finally, this combination of random forest model was selected as the proposed ubiquitination sites predictor.

lP

3.2 Results comparison after feature selection The top 1500 informative features were arranged from the encoded feature vector to explore the most substantial amino acid pairs surrounding the ubiquitination and non-ubiquitination sites. We extracted features by a nonparametric method as Wilcoxon signed rank test from 1:1 ratio of positive and negative sample. However, the results were almost remaining unchanged after feature selection. Therefore, we used the full features dataset for ubiquitination sites prediction.

Jo

ur na

3.3 Performance assessment for independent test dataset In this study, the independent test dataset was constructed by randomly selected 250 ubiquitinated protein sequences from the remains downloaded dataset after taking the training dataset. The independent test dataset contained totally 408 ubiquitination sites and 8098 non-ubiquitination sites. These independent test protein sequences were not included into the training sample that have been used in this study. The RF based predictor with CKSAAP and binary encoding scheme were used to predict the ubiquitination site. The 250 independent test protein sequences were used to check the performance stability of the proposed predictor. The four performance measurements i.e. sensitivity (Sn), specificity (Sp), accuracy (Ac) and MCC along with the AUC score were considered to assess the predictor’s performance in independent test dataset. The proposed CKSAAP encoding based prediction method exhibited the Sn, Sp, Ac and MCC as 80%, 78%, 80% and 58% respectfully. The prediction performance under the binary encoding scheme is lower than the CKSAAP encoding (Table 3). On the other hand, the area under the ROC curve (Fig. 3) clearly indicated that the overall performance is significantly better as the AUC value is 86%. The entire results are summarized in Table 3. 3.4 Performance comparison of proposed method with other existing predictor and discussion Chen Z et al., 2014 showed that the existing ubiquitination sites predictors are good for species-specific prediction. The proposed ubiquitination site predictor for A. thaliana performed more accurately which was noticed by cross-validation test and prediction performance for independent test dataset. The only ubiquitination sites predictor for A. thaliana named AraUbiSite (Chen J et al. 2019) was developed which was basically a SVM method based model. For AraUbiSite development, the performances were assessed using the independent test dataset made by randomly selected sequence fragments containing ubiquitinated and non-ubiquitinated sites in a 1:3 ratio. Naturally, the ubiquitinated protein sequences are highly unbalanced with ubiquitinated and non-ubiquitinated sites. To predict the ubiquitination sites for

of

every lysine (K) position of an unknown independent protein sequence, each lysine (K) residue should be assessed and no sampling should be considered for independent test dataset. Otherwise some position may not be considered for prediction. This unbalance property may affect the prediction performance when an entire protein sequence will be used to predict the ubiquitination sites. To compare the prediction performance of the proposed method with the existing ubiquitination prediction server AraUbiSite (http://systbio.cau.edu.cn/araubisite/index.php), the prediction performances were observed for independent test dataset. In order to get the fair comparison, the same independent test dataset was submitted into the ArUbiSite ubiquitination sites prediction server and the prediction SVM scores were collected. The server did not provided scores for some protein sequence and also for some lysine (K) position. Then the performance measurements were calculated and it was shown that the proposed method achieved better performance than the ArUbiSite for the same independent test dataset. The performance was measured through the Sn, Sp, Ac, MCC and AUC values. The proposed method showed 80% accuracy and AUC was 0.86 (Table 3) when the AraUbiSite showed 73% accuracy with AUC 0.82 for the same independent test dataset (Fig 3) under the threshold of false positive rate FPR)=0.20. The other performance measurements are also clearly demonstrated that the proposed method performed reasonably better than the existing AraUbiSite predictor (Table 3).

RF + Binary Encoding

0.595 0.796 0.696

0.399 0.803

RF + CKSAAP Encoding

0.801

0.580

0.782

0.802

0.861

-p

Proposed Method

ro

Table 3 The prediction performance comparison between the AraUbiSite and the proposed method with binary and CKSAAP encoding for same independent test dataset. Prediction Methods Classifier+ Encoding Method Sn Sp Ac MCC AUC AraUbiSite Predictor 0.665 0.799 0.732 0.479 0.824

Jo

ur na

lP

re

The proposed method obtained better performance than existing predictors for independent test dataset with the increasing prediction performances as well as AUC value (0.86) along with the other performance measurements. In comparison, the performance of the proposed method for ubiquitination sites prediction in A. thaliana sequences clearly indicates the importance to use the method.

Fig. 3. Combined ROC curve for the independent test dataset by using the proposed CKSAAP+RF based classifier, Binary+RF based classifier and the AraUbiSite ubiquitination site prediction server.

ro

of

The proposed method utilized the CKSAAP encoding scheme, which produced the encoded vector from the sequence fragments. The engagement of the position specific amino acids combination around the ubiquitination and nonubiquitination sites for the both training and test dataset are shown from the two-sample logo software (http://www.twosamplelogo.org/cgi-bin/tsl/tsl.cgi). The height of the residues in the X-axis was in proportion to the percentage of corresponding residue in the positive and negative samples (i.e. over- and under-represented). Interestingly, the enrichment of several amino acid residues (such as R, E, K, L) is highly exhibited around the ubiquitination sites in the protein sequences from A. thaliana (Fig. 4).

re

-p

Fig. 4. The engagement of amino acids residues around the ubiquitination sites compared to non-ubiquitination sites is represented by Two-Sample Logos software (statistical t-test, p-value<0.05). This displays the significant positional difference between the compositional amino acids of the ubiquitinated and non-ubiquitinated peptides, particularly from -13 to -1 and +1 to +13 positions.

ur na

lP

4. Limitations Though the proposed method is significantly better and robust than the existing prediction models, there still have some space to be improved. Firstly, the proposed method used the dataset which was reported based on the proteome-wide ubiquitination sites mapping (Walton et al. 2016). Thus it may loss some inevitable features about the ubiquitination sites in A. thaliana which will be more helpful for developing the more robust and accurate ubiquitination site predictor for the model plant. Secondly, the proposed method utilized around 500 randomly selected protein sequence data to develop the predictor. The inclusion of more datasets may improve the prediction performance. In future we would like to utilize more comprehensive dataset to develop ubiquitination site predictor. Thirdly, a genome wide online prediction server will be introduced as soon as possible to make the predictor more user friendly. The prediction server will be available at http://www.bbcba.org/software.

Jo

5. Conclusion In this paper, we proposed a novel random forest based in silico predictor for predicting protein ubiquitination sites of A. thaliana species. To develop this novel predictor, three different popular classifiers (RF, SVM and NB) was considered for comparative study on the prediction of ubiquitination sites using the experimentally identified ubiquitinated protein sequences of A. thaliana. The composition of k-spaced amino acid pairs (CKSAAP) was selected as a better encoding scheme through a comparative study to develop the proposed predictor. The proposed RF based predictor showed the best performance measured by AUC scores for both cross-validation test and independent test datasets that achieved 0.910 and 0.86 respectively. The prediction performance outweighed significantly compared to the existing ubiquitination sites prediction server AraUbiSite. The output of the proposed method will help to explore the mechanisms of ubiquitination-related biological processes and also speed up the explanation and analysis of the ubiquitination-related cellular functions in the model plant A. thaliana. It can be utilized from the robustness point of view also. To implement the proposed method more widely, we provided the computational codes and instructions, which can be downloaded at http://www.bbcba.org/softwares/ArabidopsisUbq.zip. An online ubiquitination prediction server will be developed as soon as possible to expand the usability of the proposed method all over the world that will be available at http://www.bbcba.org/software.

Conflict of interest: none

Jo

ur na

lP

re

-p

ro

of

Acknowledgement I acknowledged to the Ministry of Science and Technology, Bangladesh for providing the financial support through NST Fellowship for this research work.

Reference Walton A, Stes E, Cybulski N, Van Bel M, Inigo S. 2016. It's Time for Some "Site"-Seeing: Novel Tools to Monitor the Ubiquitin Landscape in Arabidopsis thaliana. The Plant Cell, 28 (1) 6-16

Breiman L. 2001, “Random forests,” Machine Learning, 45, 5-32. Binghuang Cai, Xia Jiang, 2016. Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences. BMC Bioinformatics; 17:116 Cai Y, Huang T, Hu L, Shi X, Xie L, Li Y., 2012. Prediction of lysine ubiquitination with mRMR feature selection and analysis. Amino Acids.;42:1387–95.

ro

of

Chen Z, Zhou Y, Zhang Z, Song J, 2014. Towards more accurate prediction of ubiquitination sites: a comprehensive review of current methods, tools and features. Brief Bioinform, Advance Access, doi:10.1093/bib/bbu031

Herrmann J, Lerman LO, Lerman A., 2007. Ubiquitin and ubiquitin-like proteins in protein regulation. Circ Res.;100(9):1276–91.

-p

Hasan, M.M. and Kurata, H (2018). GPSuc: Global Prediction of Generic and Species-specific Succinylation Sites by Aggregating Multiple Sequence Features. PLoS One, 13(10): e0200283.

re

Hasan M.M., Guo D. and Kurata H, 2017. Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information. Mol Biosyst 13, 2545-2550.

lP

Hasan M.M., Khatun M. S and Kurata H, 2018. A comprehensive review of in silico analysis for protein Ssulfenylation sites. Protein & Peptide Letters, 25 (9), 815-821. He Fei, Wang Rui, L Jiagen, Bao Lingling, Xu Dong, Zhao Xiaowei, 2018. Large-scale prediction of protein ubiquitination sites using a multimodal deep architecture, BMC Syst Biol. 22;12(Suppl 6):109.

ur na

Hasan, M.M., Rashid, M.M., Khatun, M.S., Kurata, H., 2019. Computational identification of microbial phosphorylation sites by the enhanced characteristics of sequence information, Scientific reports 9(1), 8258. Hasan, M.M., Khatun, M.S., Mollah, M.N.H., Yong, C., and Guo, D, 2017. A systematic identification of speciesspecific protein succinylation sites using joint element features information. Int J Nanomedicine 12, 6303-6315.

Jo

Hasan, M.M., Khatun, M.S., Mollah, M.N.H., Yong C and Guo D, 2018. NTyroSite: Computational Identification of Protein Nitrotyrosine Sites Using Sequence Evolutionary Features. Molecules 23, 1667. Hasan, M.M., Yang, S., Zhou, Y., and Mollah, M.N., 2016. SuccinSite: a computational tool for the prediction of protein succinylation sites by exploiting the amino acid patterns and properties. Mol Biosyst 12, 786-795. Hasan, M.M., Zhou, Y., Lu, X., Li, J., Song, J. and Zhang, Z, 2015. Computational Identification of Protein Pupylation Sites by Using Profile-Based Composition of k-Spaced Amino Acid Pairs. PLoS One 10, e0129635. Khatun, S., Hasan M, and Kurata, H, 2019. Efficient computational model for identification of antitubercular peptides by integrating amino acid patterns and properties. FEBS Lett 593(21):3029-3039. Khatun, M.S., Hasan, M.M., Kurata, H., 2019. PreAIP: Computational prediction of anti-inflammatory peptides by integrating multiple complementary features. Front Genet,10:129.

Hasan, M.M., Manavalan, B., Khatun, M.S., and Kurata, H. 2019. Prediction of S-nitrosylation sites by integrating support vector machines and random forest. Mol Omics 2;15(6):451-458. Jiajing Chen, Jianan Zhao, Shiping Yang, Zhen Chen and Ziding Zhang 2019. “Prediction of Protein Ubiquitination Sites in Arabidopsis thaliana”, Current Bioinformatics (2019) 14: 614. https://doi.org/10.2174/1574893614666190311141647 Jyun-Rong Wang, Wen-Lin Huang, Ming-Ju Tsai, Kai-Ti Hsu, Hui-Ling Huang, Shinn-Ying Ho, 2017. ESA-UbiSite: accurate prediction of human ubiquitination sites by identifying a set of effective negatives. Bioinformatics;33(5):661668

of

Kirkpatrick DS, Denison C, Gygi SP., 2005. Weighing in on ubiquitin: the expanding role of mass-spectrometry-based proteomics. Nat Cell Biol.;7(8):750–7. Kim W, Bennett EJ, Huttlin EL, Guo A, Li J, Possemato A, 2011. Systematic and quantitative assessment of the ubiquitin-modified proteome. Mol Cell.;44:325–40.

ro

Hasan, M.M., Khatun, M.S., Kurata H., 2019. Large-Scale Assessment of Bioinformatics Tools for Lysine Succinylation Sites, Cells, 8(2).

-p

Li W, Jaroszewski L, Godzik A, 2002. Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics; 18(1): 77-82.

re

Li W, Godzik A., 2006. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics ; 22(13): 1658-9.

lP

Meinke D. W., Cherry J. M., Dean C., Rounsley S. D. and Koornneef M., 1998. Arabidopsis thaliana: A Model Plant for Genome Analysis. Science: Vol. 282, Issue 5389, pp. 662-682. DOI: 10.1126/science.282.5389.662 Peng JM, Schwartz D, Elias JE, Thoreen CC, Cheng D, Marsischky G, 2003. A proteomics approach to understanding protein ubiquitination. Nat Biotechnol.;21:921–6.

ur na

Radivojac P, Vacic V, Haynes C, Cocklin RR, Mohan A, Heyen JW, Goebl MG, Iakoucheva LM. 2010. Identification, analysis and prediction of protein ubiquitination sites. Proteins.;78:365–80.

Jo

Shandar Ahmad, Philip Prathipati, Lokesh P Tripathi, Yi-An Chen, Ajay Arya, Yoichi Murakami, Kenji Mizuguchi, Integrating sequence and gene expression information predicts genome-wide DNA-binding proteins and suggests a cooperative mechanism, Nucleic Acids Research, Volume 46, Issue 1, 9 January 2018, Pages 54– 70, https://doi.org/10.1093/nar/gkx1166 Tung CW, Ho SY., 2008. Computational identification of ubiquitylation sites from protein sequences. BMC Bioinformatics; 9:310. Vanitha Shyamili Kumar and Adaikkalam Vellaichamy, 2019. Sequence and structure‐based characterization of ubiquitination sites in human and yeast proteins using Chou's sample formulation, Proteins;87(8):646-657 Wang JR, Huang WL, Tsai MJ, Hsu KT, Huang HL, Ho SY, 2017. ESAUbiSite: accurate prediction of human ubiquitination sites by identifying a set of effective negatives. Bioinformatics (Oxford, England).; 33(5):661-8.

Wagner SA, Beli P, Weinert BT, Nielsen ML, Cox J, Mann M, Choudhary, 2011. C. A proteome-wide, quantitative survey of in vivo ubiquitylation sites reveals widespread regulatory roles. Mol Cell Proteomics.;10(10):M111.013284. Wenying He, Leyi Wei, Quan Zou, 2018. Research progress in protein posttranslational modification site prediction. Brief Funct Genomics ;18(4):220-229 Welchman RL, Gordon C, Mayer RJ., 2005. Ubiquitin and ubiquitin-like proteins as multifunctional signals. Nat Rev Mol Cell Biol.;6(8):599–609. Walsh I, Domenico TD, Tosatto SCE., 2014. RUBI: rapid proteomic-scale prediction of lysine ubiquitination and factors influencing predictor performance, Amino Acids.;46:853–62.

of

Xu G, Paige JS, Jaffrey SR., 2010. Global analysis of lysine ubiquitination by ubiquitin remnant immune affinity profiling. Nat Biotechnol.;28:868–73.

ro

Yuan Zhou, Sixue Liu, Jiangning Song, Ziding Zhang, 2013. Structural Propensities of Human Ubiquitination Sites: Accessibility, Centrality and Local Conformation. PLoS One ; 8(12): e83167

-p

Zhen Chen, Xuhan Liu, Fuyi Li, Chen Li, Tatiana Marquez-Lago, André Leier, Tatsuya Akutsu, Geoffrey I. Webb, Dakang Xu, Alexander Ian Smith, 2018. Large-scale comparative assessment of computational predictors for lysine post-translational modification sites, Brief Bioinform. 2018 Oct 4. doi: 10.1093/bib/bby089

Jo

ur na

lP

re

Zhen Chen, Yuan Zhou, Jiangning Song, Ziding Zhang, 2016. hCKSAAP_UbSite: improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties. Biochim Biophys Acta ; 1834(8): 1461–1467.