Accepted Manuscript
Effective DNA binding protein prediction by using key features via Chou’s general PseAAC Sheikh Adilina, Dewan Md Farid, Swakkhar Shatabda PII: DOI: Reference:
S0022-5193(18)30503-4 https://doi.org/10.1016/j.jtbi.2018.10.027 YJTBI 9670
To appear in:
Journal of Theoretical Biology
Received date: Revised date: Accepted date:
6 August 2018 7 October 2018 10 October 2018
Please cite this article as: Sheikh Adilina, Dewan Md Farid, Swakkhar Shatabda, Effective DNA binding protein prediction by using key features via Chou’s general PseAAC, Journal of Theoretical Biology (2018), doi: https://doi.org/10.1016/j.jtbi.2018.10.027
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Highlights • Effective and Simple to Extract Sequence Based Features • Grouped and Recursive Feature Selection • Reduced over-fitting on train set • Extra Tree Classifier Based Predictor
AC
CE
PT
ED
M
AN US
CR IP T
• Codes and Data freely available
1
ACCEPTED MANUSCRIPT
Effective DNA binding protein prediction by using key features via Chou’s general PseAAC
Abstract
CR IP T
Sheikh Adilina, Dewan Md Farid and Swakkhar Shatabda Department of Computer Science and Engineering, United International University, Plot 2, United City, Madani Avenue, Satarkul, Badda, Dhaka-1212, Bangladesh Email:
[email protected]
AN US
DNA-binding proteins (DBPs) are responsible for several cellular functions, starting from our
immunity system to the transport of oxygen. In the recent studies, scientist have used supervised machine learning based methods that use information from the protein sequence only to classify the DBPs. Most of the methods work effectively on the train sets but performance of most of them degrades in the independent test set. It shows a room for improving the prediction method by reducing over-fitting In this paper, we have extracted several features solely using the
M
protein sequence and carried out two different types of feature selection on them. Our results have proven comparable on training set and significantly improved on the independent test set.
ED
On the independent test set our accuracy was 82.26% which is 1.62% improved compared to the previous best state-of-the-art methods. Performance in terms of sensitivity and area under receiver operating characteristic curve for the independent test set was also higher and they were
PT
0.95 and 0.823 respectively. We have made our methods and data available for other researchers
1
CE
at: https://github.com/SkAdilina/DNA_Binding.
Introduction
AC
Proteins are a series of linked amino acids that are formed within the cell as encoded by RNA which are transcribed from DNA sequences. Several important tasks in the immunity system, muscle contraction, transport of oxygen and many other functions are carried out by the proteins. Proteins bind to DNA molecule to regulate gene expression. Therefore, DNA-binding proteins (DBPs) are vital. However, recognizing DBPs is another challenge. Several experimental or in vitro methods do exist for recognition of DBPs. They include genetic analysis [1], X-ray crystallography [2], chromatin immunoprecipitation on microarrays [3], filter binding assays [4], etc. 2
ACCEPTED MANUSCRIPT
Many computational methods have been applied for recognition of DBPs. However, classification of DBPs using machine learning algorithms is preferred by the researchers as these methods are less expensive and time saving [5] unlike the previous methods that would be incapable of handling the sudden outburst of biological sequences in the postgenomic era. One of the most important and difficult problems in computational biology is expressing a biological sequence with a discrete model or a vector,
CR IP T
at the same time keeping considerable sequence-order information or key pattern characteristic. This is because all the existing machine learning algorithms can only handle vector but not sequence samples, as elucidated in a comprehensive review [6]. A vector defined in a discrete model may completely lose all the sequence pattern information. To avoid this problem, the pseudo amino acid composition [7] or PseAAC [8] was proposed by Chou. From that moment on, the concept has been widely used in
AN US
nearly all the areas of computational proteomics [9] [10] [11] [12] [13] [14] and also in the long list of papers cited in [15].
In the recent years, scientists have worked on several methods of identifying DBPs solely using the sequence. Back in 2007, DNABinder [16] was proposed. Even though the DNABinder turned out to be efficient, it was heavily dependent on the database being used. In the same year another method was
M
proposed [17] where the PseAAC was used alongside dipeptide composition and autocross-co variance (ACC) to form numeric series. Each of these series was passed through Support Vector Machine (SVM)
ED
to obtain the final result.
In 2010, Robert E. Langlois and Hui Lu came up with BLAST [18], a pattern based machine learning protocol to identify DNA-binding proteins. Since then a lot of improvements have been made
PT
in prediction on DBPs using the information from DNA string only. Features were extracted from DNA sequence, in absence of any dependencies on any other information, and were trained using Random
CE
Forest classifier in DNA-Prot [19]. Grey Model was added to the DNA-Prot predictor to create iDNAProt in [20]. Prominsing result was acheived in [21], where the correlation and the weighting factors
AC
of PseAAC was used and experiments were carried out using the Random Forest Classifier. Chou’s general form or pseudo amino acid composition (PseACC) [22] was incorporated with evo-
lutionary information and added to the features and then Support Vector Machine was added to form the predictor, iDNAPro-PseAAC [5]. Evolutionary informaton was also included alonside PseAAC by using the top-n-gram approach in [23]. Combining PseAAC with overall amino acid composition, physicochemical composition distance transformation and SVM resulted in an accuracy of 80% on the test dataset [24]. The iDNAPro-PseAAC outperformed a handful of state-of-the-art methods. Dis-
3
ACCEPTED MANUSCRIPT
tance coupling, alphabet reduction and reduction in dimension of feature set of PseAAC done in the iDNA-Prot—dis [25] predictor made the classification process faster. This new technique outperformed all the then existing predictors. Protein was transformed into vectors of same length using kmer composition and was then inputted to support vector machine to identify DNA binding proteins. They named the technique Kmer1 + ACC [26]. Wei et al. used local Pse-PSSM (Pseudo Position-Specific
CR IP T
Scoring Matrix) features in Local-DPP [27] and trained the features using Random Forest Classifier. PSSM and ACC were used in the making of DBPred [28] and yet again Random Forest Classifier was used to rank the features. However, Gaussian Naive Bayes classifier was used to finally train the feature set.
HMM profile based monogram and bigram features were proved to be more effective in finding DBPs
AN US
compared to PSSM profile based features. HMMBinder was proposed based on these features [29]. A mix of structural and evolutionary features were used in iDNAProt-ES proposed in [30]. The concept of PseAAC was again used in a recent work [31] to form the DPP-PseAAC predictor. In DPP-PseAAC, features were ranked, with he help of random forest classifier, and features were deleted until an optimal feature set was achieved. Support Vector Machine was further used to train the feature set and the
M
experimental results proved a very high accuracy rate.
Due to the widespread increase in the use of PseAAC, three powerful open access softwares, called
ED
PseAACBuilder, propy and PseAACGeneral were recently established: the former two are for generating various modes of Chou’s special PseAAC [32]; while the third one for those of Chou’s general PseAAC [33], including not only all the special modes of feature vectors for proteins but also the
PT
higher level feature vectors such as ”Functional Domain” mode [33], ”Gene Ontology” mode [33], and ”Sequential Evolution” or ”PSSM” mode [33]. Encouraged by the successes of using PseAAC to deal
CE
with protein/peptide sequences, the concept of PseKNC (Pseudo Ktuple Nucleotide Composition) [34] was developed for generating various feature vectors for DNA/RNA sequences [35] [36] that have
AC
proved very useful as well. Particularly, recently powerful web-servers called Pse-in-One [37] and Pse-in-One2.0 [38] (the updated version of [37]) were also established. These web-servers can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users’ studies. In this paper, we focus only on the sequence based features as they were proved to be effective and simple to generate. We have extracted several types of features only using the protein sequences. We also discuss two different methods of carrying out feature selection: firstly, we grouped similar
4
ACCEPTED MANUSCRIPT
features together and recorded the performances of individual groups as well as selected combinations of groups. Then, we have also performed recursive feature selection on the entire feature set and selected the best features. Based on these features we have built classification models for prediction of DBPs. We have trained and tested our models on standard benchmark datasets. Moreover, we compared our results with the state-of-the-art predictors and received promising results. The accuracy
CR IP T
of our feature selection methods is comparable to the other predictors and have shown significant improvements on the independent test set. We have made our methods and data available for other researchers at: https://github.com/SkAdilina/DNA_Binding.
Materials and Methods
AN US
2
In this section, we describe the datasets and the methods used in this paper. Kuo Chen Chou [39] suggested his famous five step rule for any machine learning based protein attribute classification research, which has also been used in several other recent papers [40–46]. These five steps are: i) selection of standard benchmark datasets ii) protein sample representation iii) selection of algorithms
M
iv) performance evaluation and v) development of an web application. We too follow the same steps in this paper, except that we did not implement any web application. The rest of this section describes the four steps in detail.
ED
A brief summary of our work is given in Figure 1. Initially, we extracted features from the protein sequence only. Then we performed two types of feature selection on the extracted features: Grouped
PT
Feature Selection and Recursive Feature Selection. We recorded the train accuracy and selected two best feature sets from each of the feature selection methods. We then tested the accuracy of the feature
AC
CE
sets on the independent test dataset.
5
AN US
CR IP T
ACCEPTED MANUSCRIPT
Dataset Description
ED
2.1
M
Figure 1: Overview of methodology used in this paper.
The problem of prediction of DBPs is formulated as a binary classification problem. For any binary
PT
classification problem in the supervised learning setting, construction of training and testing datasets is the primary task. Any dataset for binary classification problem consists of a subset of positive
S = S+ ∪ S−
(1)
AC
CE
samples and another subset of negative samples. Formally:
Here S+ is the sets of DBPs or positive samples and S− is the set of non-DBPs or negative samples. As training dataset, we have selected the most widely used dataset in the literature [27–31] known as the benchmark dataset. It contains 1075 protein sequences, 525 of which are positive samples (DNA-
binding proteins) and the 550 are negative samples (non DNA-binding proteins). As the independent test set, we have used the the independent set proposed in [28]. This dataset contains a total of 186 instances with equal number of positive and negative samples. A summary of the datasets used in this
6
ACCEPTED MANUSCRIPT
paper is given in Table 1. Table 1: Summary of the datasets.
2.2
DNA Binding Proteins 525 93
Non-Binding Proteins 550 93
Protein Sample Representation
Total 1075 186
CR IP T
Dataset Training Dataset Independent Test Set
Each sample in the dataset are a sequence of amino acid residues. Formally,
(2)
AN US
P = R1 , R2 , R3 , · · · , RL
Here, P ∈ S is a protein sequence of length L and Ri are individual amino acid residue symbols. Each of this samples are either positive or negative. Positive samples or DBPs are given +1 as label which indicate positive label and negative samples or non-DBPs are given 0 as their labels. In the next phase each of these samples will be used to extract sequence based features from them. This section describes
M
the details of the features extracted in this paper.
1. Monograms: The frequency of each amino acids was calculated to generate this feature and it
ED
was normalized by dividing the frequency with the length of the protein sequence. Since there
FA =
L
1X match(Ri , aj ) L i=1
(3)
CE
where:
PT
are 20 different amino acids, this group contains 20 features.
L = length of the protein sequence
AC
aj = is an amino acid symbol in amino acid alphabet Σ Ri = is an amino acid residue at position i in protein P
The function match matches any two strings. Formally,
match(s1 , s2 ) =
7
1 If s1 == s2 0 Else
(4)
ACCEPTED MANUSCRIPT
Thus we get, 20 different monograms. 2. Bigram: The frequency of two consecutive amino acids was taken into account to create this feature and then normalized. Since there are 20 different amino acids, taking all possible combination results in a total of 400 combinations. L−1 1 X match(Ri Ri+1 , sj ) L i=1
CR IP T
FB =
where: sj = a di-amino acid string taken from Σ2
(5)
3. Trigram: In total, 8000 features was created by taking the count of three consecutive amino
FC =
AN US
acids. The feature was normalized by dividing the sum with the length of each protein instances. L−2 1 X match(Ri Ri+1 Ri+2 , sj ) L i=1
where:
(6)
M
sj = a tri-amino acid string taken from Σ3
4. Gapped bigram: Gapped Bigrams [47,48] was formed by calculating the number of occurrences
ED
of two amino acids at a certain distance and ignoring the entire sequence in between.
where:
PT
FD
L−g 1 X = match(Ri Ri+g+1 , sj ) (1 ≤ g ≤ 20) L i=1
(7)
CE
sj = a di-amino acid string taken from Σ2 g = the gap between amino acids
AC
In this paper, we have used gaps, g = 1, 2, · · · , 20. Thus the total number of features generated
was 8000.
5. Monogram Percentile Separation: The purpose of this feature was to keep of occurrences of each amino acid within a certain range in the protein sequence. This was done by calculating the total number of each amino acid in partial sequences. The number of occurrences of each amino acid within 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% of the entire string was
8
ACCEPTED MANUSCRIPT
recorded. Finally the result was divided by the length of the protein sequence. This was done for all 20 amino acids. PL 1 X match(Ri , aj ) PL i=1
where: PL = length of the partial protein sequence aj = an amino acid taken from the alphabet Σ
(8)
CR IP T
FE =
6. Bigram Percentile Separation: This feature was extracted using the same concept as the
AN US
Monogram Percentile Separation feature. The only difference in that instead of considering individual amino acid 2 consecutive amino acids were considered.
FF =
(9)
M
where:
PL −1 1 X match(Ri Ri+1 , sj ) PL i=1
PL = length of the partial protein sequence
ED
sj = a di-amino acid string taken from the alphabet Σ2
PT
7. Nearest Neighbor Bigram: If amino acid aj is closest to the amino acid ai , it can be said that ai and aj are nearest neighbors. This concept was used in creating this feature set. In this case, the first 30 nearest occurrences of each amino-acid pairs have been considered to create the
CE
total 12000 feature instances. The Nearest Neighbor Bigram [25] feature set was normalized in
AC
the same way as all the previous feature sets.
FG =
1 distance(ai , ak )(1 ≤ i ≤ 20, k = 1, 2, 3, · · · , 30) L
where:
distance(x, y) = distance between x and y is position in the sequence ai = position i of amino acid a ak = k-th nearest neighbor of ai for all possible symbols a ∈ Σ 9
(10)
ACCEPTED MANUSCRIPT
Thus, each protein sequence P ∈ S is converted into a feature vector as follows: F = [FA FB FC FD FE FF FG ]
(11)
A summary of the features generated in this paper is given in Table 2.
monogram bigram trigram gapped bigram monogram percentile bigram percentile nearest neighbor bigram
CR IP T
A B C D E F G
Table 2: Summary of Features. Feature Name Total No. of Features 20 400 8000 8000 200 4000 12000
AN US
Group Name
2.3
Feature Selection
M
Total = 32620
2.3.1
ED
In this section, we describe the two feature selection algorithms that we have used in this paper. Grouped Feature Selection
PT
The basic pseudo-code for the grouped feature selection is given in Algorithm 1. The basic idea of the algorithm is to add the feature groups following a forward selection strategy. The algorithm starts
CE
with an empty set and in each iteration it adds a single group not in the set to see whether the training accuracy improves or not. The algorithm stops adding any further groups when adding them does not improve training accuracy anymore. The details of the experimental results found with this feature
AC
selection technique is presented in the next section.
10
2.3.2
Recursive Feature Selection
AN US
Algorithm 1 Grouped Feature Selection 1: group size ← 0 2: best group ← empty 3: max accuracy ← 0 4: new accuracy ← 0 5: do 6: increment group size 7: max accuracy ← new accuracy 8: for all remaining individual groups do 9: add 1 group to best group 10: perform 10-fold cross validation on train dataset 11: record accuracy for all classifiers 12: for each combinations do P 13: calculate (best train accuracy) 14: average train accuracy P 15: identify group with maximum (best train accuracy) 16: add group to best group 17: identify group with maximum average train accuracy 18: update new accuracy 19: add group to best group 20: while (max accuracy > new accuracy)
CR IP T
ACCEPTED MANUSCRIPT
M
In recursive feature selection (summarized in Algorithm 2), we ran the Random Forest Classifier on benchmark dataset using the entire feature set. In each loop the feature with the least train accuracy
ED
was removed from the feature set. This was done until no feature was left. We then compared the result and selected the feature set with the maximum train accuracy and tested it on the independent
PT
data set.
AC
CE
Algorithm 2 Recursive Feature Selection total f eatures ← countFeatures() while total f eatures > 0 do perform 10-fold cross validation on train dataset using Random forest classifier Rank groups by importance Remove group with least ranking decrement total f eatures Identify feature set with highest train accuracy
2.4
Classification Algorithms
Several classification algorithms are used in the experiments in this paper. In this section, we provide very brief description of the algorithms.
11
ACCEPTED MANUSCRIPT
1. Decision Tree: Decision tree [49] classifiers are tree based classifier where attributes are used in a hierarchical manner to find the labels of the samples as leafs of a decision tree. 2. Random Forest: Random forests [50] is a ensemble of decision trees learned by randomly selected features at each iteration of the algorithm.
CR IP T
3. Extra Tree Classifier: Extra tree classifiers [51] are similar to random forests in addition that in each iteration of the algorithm it also uses subsamples of the dataset.
4. Logistic Regression: Logistic regression classifier [52] is linear classifier that uses a hyper plane to separate the samples in case of binary classification problem. It uses a decision rule of the following form:
AN US
~y = sign(w0 + w1 x1 + · · · )
5. K-Nearest Neighbor: KNN algorithm [53] is a lazy instance based classification algorithm where label of a given instance is decided based on the nearest neighbors of that instance in the Cartesian space.
M
6. Linear Discriminant Analysis: Linear discriminant analysis [54] expresses the dependent variable or label as a linear combination of other features. It looks for those features that can
ED
best explain the data.
7. Support Vector Machine: Support vector machine is a maximum margin classifier that tries
CE
form.
PT
to separate two classes in the case of binary classification. It uses a decision rule of the following X h(~x) = sign( αj yj (~x.~xj ) − b) j
Here, ~xj are the support vectors that define the maximum margin.
AC
8. Gaussian Naive Bayes: Naive bayes classifier [52] uses a maximum posteriori decision rule as the following: ~y = arg max p(Ck ) k∈{1,··· ,K}
n Y
i=1
p(xi |Ck )
9. AdaBoost: Adaptive boosting algorithm or AdaBoost [55] is an ensemble approach where a number of weak classifiers are boosted by adaptively weighting wrongly classified instances at
12
ACCEPTED MANUSCRIPT
each iteration of the algorithm. The decision rule used by an AdaBoost classifier is as follows:
h(~x) = sign(α1 h1 (~x) + α2 h2 (~x) + · · · )
2.5
CR IP T
Here, hi is the weak classifier at iteration i and αi is the weight associated with it.
Performance Evaluation
In order to validate any machine learning based prediction method, it is very important to select the evaluations methods and metrics. In this paper we are using five performance evaluation metrics that are widely used in the literature by many researchers and also in the case of prediction of DBPs
AN US
[27,29–31,56,57]. They are: Accuracy, Sensitivity or Recall, Specificity, the Mathews correlation coefficient (MCC) [58], area under receiver operating characteristic curve (auROC) and the area under precision recall curve (auPR). The former four metrics are defined below:
Accuracy =
tp + tn tp + tn + f p + f n tn tn = n tn + f p
(13)
tp + tn tp + tn + f p + f n
(14)
M
Sensitivity =
ED
Specif icity =
(tp ∗ tn) − (f p ∗ f n) M CC = p (tp + f p)(tp + f n)(tn + f p)(tn + f n)
(15)
PT
where:
(12)
CE
n = number of real negative samples p = number of real positive samples
AC
tp = number of positive samples that has been correctly predicted as positives. tn = number of negatives samples that has been correctly predicted as negatives. f p = number of negative samples that has been incorrectly predicted as positives. f n = number of positive samples that has been incorrectly predicted as negatives.
The first three metrics ( Accuracy, Sensitivity or Recall and Specificity) have their values in the range [0,1]. Here a maximum value of 1 means a perfect classifier and a minimum value 0 means
13
ACCEPTED MANUSCRIPT
worst classifier possible. The fourth metric, MCC, has been used a lot to evaluate the classification quality for single label systems. The value of MCC ranges from +1 to −1, where +1 refers to a perfect classifier, −1 refers to the worst possible classifier and 0 refers to a random classifier. Having said that, multi-label systems are becoming more common in the recent years in system biology [59–65], system medicine [66] and
CR IP T
biomedicine [67]. None of these metrics are valid for the multi-label systems, where an instance can belong to several classes at the same time. Fortunately, a completely different set of metrics exists at the moment and is defined in [68].
The other two metrics, auROC and auPR are more appropriate for probabilistic predictors that depend on thresholds for prediction of labels. The Area Under Receiver Operating Characteristic
AN US
(auROC) curve has TPR (true positive rate) on the y-axis and FPR (false positive rate) on the x-axis. TPR is the ratio tp and p, where FPR is the ratio of f p and n. The TPR and FPR are calculated for all possible threshold and plotted [69]. Area Under Precision-Recall Curve (auPR) is a similar metric. Both of these metrics have their values in the range [0,1]. Here 1 means a perfect classifier and 0.5 means a random classifier for the case of binary classification task.
M
Another important decision to take is the sampling methods used to validate the predictors. Two methods are used in this paper: k-fold cross-fold validation and independent test set. In this paper,
ED
k-fold cross validation technique is used for training the classifiers on the benchmark dataset. In k-fold cross validation, the training dataset is chopped into k sub datasets of same size. In each iteration k − 1 sub datasets are used for training and the one remaining sub dataset is used for testing. This
PT
process is carried out for k times and the final result is the average of all k results. We used 10 as the value of k for all of our experiments on training dataset. However, since we have an independent test
CE
set available for this problem and widely used in the dataset, we have also tested the trained model
AC
using that dataset.
3
Results and Discussion
In this section, we describe the details of the experiments done in this paper. All experiments were conducted on a Computing Machine provided by CITS, United International University. The machine was equipped with 8 core processors each core having a Dell R 730 Intel Xeon Processor (E5-2630 V3) with 2.4 GHz speed and 18.5 GB memory. All the programs were written using Python language version 3.6 and Scikit-learn library [70]. Each experiment was run 10 times and only the average are 14
ACCEPTED MANUSCRIPT
reported.
3.1
Results on Grouped Feature Selection
In Grouped Feature Selection (shown in Algorithm 1), the entire feature set was divided into 7 different groups (A,B,C,D,E,F and G), explained in Table 2. In the beginning, we calculated the train accuracy
1. The group with the highest frequency of best train accuracy 2. The group with the maximum average train accuracy
CR IP T
of each group. We then selected the best group using two different parameters:
We took the best feature group, added it to a running set of features, each time adding another group
AN US
to it. The best performing group was kept constant and the accuracy of all possible combinations of addition of another group was recorded. We stopped adding groups when the accuracy stopped increasing and started to decrease instead. We have used several classification algorithms mentioned in the previous section.
First, we used only individual feature groups to see their performance on train set. The experimental
M
results were recorded in terms of train accuracy. In Figure 2, we show the histograms of the feature groups showing best results in each iterations of the experiments. We could note that Group A has
ED
achieved the best train accuracy for the highest number of times. In Figure 3, average train accuracy is shown as achieved by the different feature groups. Here, we could see that the best achieved average
AC
CE
PT
accuracy was by Group G.
15
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
Figure 2: Frequency of Maximum Train Accuracy of individual Groups A,B,C,D,E,F and G. We selected Group A since it has the highest frequency.
Figure 3: Average Train Accuracy of individual Groups A,B,C,D,E,F and G. We selected the the group with the maximum average accuracy, i.e. Group G. 16
ACCEPTED MANUSCRIPT
Then, we added another group to the best groups and recorded from the previous iteration and recorded performance of all possible combinations by adding another group at each iterations. Similar to the previous iteration, groups with frequency of best train accuracy and average train accuracy was taken as parameter for selection. Figure 4 and Figure 5 shows the plots of these two measures for the groups in the second round. We selected Group AC and AG from combinations of A and Group GC
ED
M
AN US
CR IP T
and GF from the combinations of G.
AC
CE
PT
Figure 4: Frequency of Maximum Train Accuracy of groups in the second loop. We selected Group AG from combinations of A and Group GC and GF from the combinations of G.
17
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 5: Average Train Accuracy of groups in the second loop. We selected Group AC from combinations of A and Group GC from the combinations of G. In the third iteration of the loop, as it can be seen from Figure 6 and Figure 7, the group combi-
AC
CE
PT
ED
M
nations AGC, AGF, GCB, GCE and GCF performed the best.
Figure 6: Frequency of Maximum Train Accuracy of groups in the third loop, we selected the best combinations for each best performing groups(AC,AG,GC,GF) chosen at the end of third iteration of the loop. Groups AGF, GFA, GCE, GCF, ACG were selected. (We ignored ACF in this loop because when we took the overall average into account, ACG clearly outperforms ACF)
18
AN US
CR IP T
ACCEPTED MANUSCRIPT
M
Figure 7: Average Train Accuracy of groups in the third iteration of the loop. The best combinations were Groups ACG, AGF, GCB, GCE, GFC/GCF. We continued the experiments and stopped when the accuracy began to decrease. We had achieved
ED
our results in the fourth iteration of the loop, the combination GCEF had the highest frequency of best train accuracy and also the maximum average train accuracy among all the combinations. The
AC
CE
PT
results are graphically shown in Figure 8 and Figure 9.
Figure 8: Frequency of Maximum Train Accuracy of groups in the fourth loop
19
AN US
CR IP T
ACCEPTED MANUSCRIPT
Figure 9: Average Train Accuracy of groups in the fourth loop
3.1.1
Classifier Selection
M
After the experiments in the feature selection step, we can conclude that the best performing groups are G, C, E and F. The next phase was to select the best performing classifier for this selected groups of
ED
features. We tried the classifiers mentioned in the previous section. The details of the results in terms of train set accuracy found in cross fold validation and test set accuracy are depicted in Figure 10,
PT
Table 3 and Table 4. From Figure 10, it can be pointed out that the best classifier is the Extra Tree Classifier. Random Forest classifier closely follows with the second best train and test accuracy.
CE
While Support Vector Machine, Logistic Regression and Adaboost have fairly good train and test accuracy, Decision Tree and Gaussian Naive Bayes have the least train and test accuracy respectively. We have received the best train accuracy using Linear Discriminant Analysis (LDA) classifier, however,
AC
the test accuracy in this case is among one of the worst. We can say that over fitting has occurred and therefore cannot be considered as a good classifier.
20
M
AN US
CR IP T
ACCEPTED MANUSCRIPT
ED
Figure 10: Graphical representation of Train and Test Accuracy of the best group combination, GCEF
Table 3: Summary of Experimental Results of the Best Group combination(GCEF) on Training Set Classifier Accuracy Recall Specificity auROC auPR MCC
AC
CE
PT
Random Forest Extra Tree Classifier Support Vector Machine Logistic Regression AdaBoost Decision Tree Gaussian Naive Bayes K-Nearest Neighbour Linear Discriminant Analysis
68.15 70.21 67.97 67.87 67.70 60.74 63.79 66.49 73.36
0.59 0.61 0.68 0.67 0.65 0.60 0.82 0.80 0.74
21
0.771 0.797 0.682 0.687 0.699 0.613 0.459 0.530 0.727
0.735 0.751 0.726 0.735 0.736 0.606 0.643 0.725 0.793
0.702 0.721 0.699 0.710 0.726 0.567 0.578 0.671 0.784
0.366 0.407 0.363 0.361 0.355 0.202 0.305 0.349 0.470
ACCEPTED MANUSCRIPT
Table 4: Summary of Experimental Results of the Best Group combination(GCEF) on Test Set Classifier Accuracy Recall Specificity auROC auPR MCC 0.78 0.95 0.94 0.94 0.85 0.94 0.86 0.84 0.86
Results on Recursive Feature Selection
0.780 0.699 0.505 0.527 0.570 0.581 0.387 0.430 0.516
0.780 0.823 0.720 0.731 0.710 0.758 0.624 0.634 0.688
0.702 0.745 0.644 0.654 0.639 0.678 0.572 0.580 0.620
0.577 0.666 0.488 0.507 0.437 0.483 0.281 0.295 0.401
AN US
3.2
77.96 82.26 72.04 73.12 70.97 75.81 62.37 63.44 68.82
CR IP T
Random Forest Extra Tree Classifier Support Vector Machine Logistic Regression AdaBoost Decision Tree Gaussian Naive Bayes K-Nearest Neighbour Linear Discriminant Analysis
After the recursive feature selection, we found the number of features from different groups present in the best feature set. Note that this experiment was only done on the train set in the first phase. The percentage of features present in the best feature set from the different feature groups are shown in Table 5. From the percentage of features from each group in Table 5 we can further deduce that the
M
best performing groups are C, D, E and F.
AC
CE
PT
ED
Table 5: Summary of Percentage of features of each group present in the feature set of the Recursive Feature Selection Group Total Number of Percentage features features present A 20 0 0.00% B 400 0 0.00% C 8000 7986 99.83% D 8000 5749 71.86% E 200 188 94.00% F 4000 3904 97.60% G 12000 11995 99.96%
3.2.1
Classifier Selection
Here too, we have performed another set of experiments to select the best classifier. The results are depicted in Figure 11, Table 6 and Table 7. In Figure 11, we can clearly see that Extra Tree Classifier provides the highest test accuracy but the train accuracy is very low. Consequently, we have selected Random Forest Classifier as the best classifier for the recursive feature selection. Adaboost has the
22
ACCEPTED MANUSCRIPT
third best test accuracy. The classification behavior of Linear Discriminant Analysis (LDA), Support Vector Machine, Logistic Regression, Decision Tree and Gaussian Naive Bayes is exactly the same as the results we achieved in the Grouped Feature Selection. After selecting the best feature subset using the recursive feature selection, we tested them on the
feature set on the training dataset is given in Table 6.
CR IP T
testing dataset. The experimental results are provided in Table 7 and the detailed experimental results
Table 6: Summary of Experimental Results of the Best Feature Set on the Training Set (Recursive Feature Selection) Classifier Accuracy Recall Specificity auROC auPR MCC 0.62 0.59 0.67 0.67 0.66 0.63 0.45 0.80 0.74
0.799 0.762 0.685 0.695 0.725 0.637 0.756 0.528 0.729
0.751 0.732 0.726 0.734 0.751 0.630 0.601 0.724 0.798
AN US
71.04 67.78 67.97 68.24 69.27 63.34 60.16 66.21 73.63
0.724 0.699 0.696 0.707 0.735 0.574 0.558 0.671 0.787
0.426 0.358 0.363 0.368 0.385 0.242 0.206 0.343 0.476
M
Random Forest Extra Tree Classifier Support Vector Machine Logistic Regression AdaBoost Decision Tree Gaussian Naive Bayes K-Nearest Neighbour Linear Discriminant Analysis
ED
Table 7: Summary of Experimental Results of the Best Feature Set on the Test Set (Recursive Feature Selection) Classifier Accuracy Recall Specificity auROC auPR MCC
AC
CE
PT
Random Forest Extra Tree Classifier Support Vector Machine Logistic Regression AdaBoost Decision Tree Gaussian Naive Bayes K-Nearest Neighbour Linear Discriminant Analysis
76.88 79.03 72.04 73.12 76.34 73.66 73.12 62.90 69.89
0.77 0.91 0.94 0.94 0.94 0.92 0.86 0.84 0.81
23
0.769 0.667 0.505 0.527 0.591 0.548 0.602 0.419 0.591
0.769 0.790 0.720 0.731 0.763 0.737 0.731 0.629 0.699
0.696 0.713 0.644 0.654 0.683 0.659 0.658 0.576 0.632
0.547 0.599 0.488 0.507 0.561 0.543 0.479 0.284 0.407
AN US
CR IP T
ACCEPTED MANUSCRIPT
3.3
M
Figure 11: Graphical representation of Train and Test Accuracy of the best feature set obtained using the Recursive Feature Selection method
Comparison Using ROC Analysis
ED
We have also compared ROC curve of both grouped and recursive feature selection methods as we stated it as a performance evaluation metric. The Figure 12 contains the ROC curve of performance of both grouped and recursive feature selection on the training dataset. The Figure 13 contains the
PT
results on test dataset. Both these figures contain a red dotted line. This line represents a ROC curve of the performance for such a predictor which is no better than random guessing. The further away the
CE
ROC curve of a predictor is from the red dotted line, the better its performance is. Even though the accuracy of the two feature selection methods on the train dataset are slightly different from each other
AC
but their ROC curves are almost the same. We can clearly witness that from the Figure 12. However, there is a noticeable difference in the performance of the methods in the test dataset. It can be seen in Figure 13 that the curve of Recursive Feature Selection is a little closer to the red dotted curve. Hence, we can conclude that Grouped Feature Selection is better than Recursive Feature Selection method.
24
AN US
CR IP T
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
Figure 12: ROC curve of the Performance of both Feature Sets on Training Data
Figure 13: ROC curve of the Performance of both Feature Sets on Testing Data
25
ACCEPTED MANUSCRIPT
3.4
Comparison With Other Methods
From the previous section, we got Extra Tree Classifier as the best performing classification algorithm with the features selected using Grouped Feature Selection method.
On the other
hand Random Forest performed best on the features found from Recursive Feature Selec-
CR IP T
tion. In order to strengthen our claim, we also compared the performance of these two methods on both benchmark training dataset and test dataset with other state-of-the-art methods. We have considered nine of the previous methods to compare with our method. They are: DNAProt [19], DNABinder [16], Kmer1+ACC [26], iDNAProt [20], iDNAPro-PseAAC [5], HMMBinder [29], Local DPP [27], iDNAProt-ES [30] and DPP-PseAAC [31]. We did not run the other predictors since we are
AN US
working on the same datasets. We have taken the results as reported in the literature [29, 31].
Table 8: Comparison of Performance with the state-of-the-art methods on the Training Set Classifier Accuracy Recall Specificity auROC auPR MCC 0.61 0.62 0.83 0.66 0.77 0.84 0.84 0.87 0.90 0.84 0.94
M
70.21 71.04 72.55 73.58 75.23 75.40 76.76 86.33 90.18 79.20 95.91
0.797 0.799 0.598 0.804 0.738 0.647 0.745 0.855 0.900 0.745 0.976
0.751 0.751 0.789 0.815 0.828 0.761 0.902 0.941 0.988
0.721 0.724 -
0.41 0.43 0.44 0.47 0.50 0.50 0.53 0.72 0.80 0.59 0.92
PT
ED
Grouped Feature Selection Recursive Feature Selection DNAProt DNAbinder Kmer1 + ACC iDNAProt iDNAPro-PseAAC HMMBinder iDNAProt-ES Local-DPP DPP-PseAAC
CE
We have reported the performance of the nine other algorithms and our methods on the benchmark train dataset in Table 8. From the values reported in Table 8, it is clear that the best performing method is the most recent one, DPP-PseAAC by Rahman et al. [31]. They have achieved almost a
AC
near perfect performance on the benchmark training dataset. However, to ensure the effectiveness of their methods one must perform independent test with another set.
26
ACCEPTED MANUSCRIPT
Table 9: Comparison of Performance with the state-of-the-art methods on the Test Set Classifier Accuracy Recall Specificity auROC auPR MCC 0.95 0.77 0.70 0.57 0.83 0.68 0.77 0.61 0.81 0.92 0.83
0.699 0.769 0.538 0.645 0.591 0.667 0.624 0.763 0.800 0.656 0.709
0.823 0.769 0.240 0.216 0.431 0.344 0.402 0.632 0.843 0.798
0.745 0.696 0.607 0.752 0.775 -
0.67 0.55 0.24 0.22 0.43 0.34 0.40 0.39 0.61 0.63 0.55
CR IP T
82.26 76.88 61.80 60.80 70.96 67.20 69.89 69.02 80.64 79.00 77.42
AN US
Grouped Feature Selection Recursive Feature Selection DNAProt DNAbinder Kmer1 + ACC iDNAProt iDNAPro-PseAAC HMMBinder iDNAProt-ES Local-DPP DPP-PseAAC
We have trained a model using the benchmark dataset and tested it with the independent test set. Any case of overfitting of data is evident only when separate independent test sets are used. The comparison of results is shown in Table 9 for independent test set. Here, we note that our method is the best performing and the test accuracy is higher than that of the train accuracy which is a
M
clear indication of absence of overfitting. We can note that the previous best method DPP-PseAAC performs poorly on the test set compared to their performance in train set which leads us to believe
3.5
ED
that their case is one of overfitting.
Availability of Our Method
PT
We have provided all the files used in our experiments in this online repository: https://github. com/SkAdilina/DNA_Binding. It contains the codes for training the data for both the Grouped and
CE
Recursive feature selection. It also contains the codes used for recording all the dataset for all the
AC
metrics and classifier. The feature extracted for both the method have also been provided.
4
Conclusion
In this paper, we have presented several features solely using the protein sequence. Then we performed grouped and recursive feature selection on the feature set. The final models for the grouped and recursive feature selection were developed by Extra Tree Classifier and Random forest Classifier respectively. Our results significantly improved on test set. However, the results in the train set have
27
ACCEPTED MANUSCRIPT
room for improvement with keeping the higher accuracy in the test set. As pointed out in [71], userfriendly and publicly accessible webservers represent the future direction for developing practically more useful predictors or any computational tools. They will also will significantly enhance the impacts of theoretical work on medical science [72], driving medicinal chemistry into an unprecedented revolution [73]. Moreover, userfriendly webservers have been provided in
CR IP T
a series of recent publications [74–83] and therefore we shall make efforts in our future work to provide a web-server for the prediction method presented in this paper. For the time being, we have made all methods and data available for anyone to use at: https://github.com/SkAdilina/DNA_Binding.
AN US
References
[1] Katie Freeman, Marc Gwadz, and David Shore. Molecular and genetic analysis of the toxic effect of rap1 overexpression in yeast. Genetics, 141(4):1253–1262, 1995.
[2] Chia-Cheng Chou, Ting-Wan Lin, Chin-Yu Chen, and Andrew H-J Wang. Crystal structure of the hyperthermophilic archaeal dna-binding protein sso10b2 at a resolution of 1.85 angstroms.
M
Journal of bacteriology, 185(14):4066–4073, 2003.
[3] Michael J Buck and Jason D Lieb. Chip-chip: considerations for the design, analysis, and ap-
ED
plication of genome-wide chromatin immunoprecipitation experiments. Genomics, 83(3):349–360, 2004.
PT
[4] Reham Helwa and J¨ org D Hoheisel. Analysis of dna–protein interactions: from nitrocellulose filter binding assays to microarray studies. Analytical and bioanalytical chemistry, 398(6):2551–2561,
CE
2010.
[5] Bin Liu, Shanyi Wang, and Xiaolong Wang. Dna binding protein identification by combin-
AC
ing pseudo amino acid composition and profile-based protein representation. Scientific reports, 5:15479, 2015.
[6] Xuhua Xia. Bioinformatics and drug discovery. Current Topics in Medicinal Chemistry, 17:1709– 26, 2017. [7] Kuo-Chen Chou. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure, Function, and Bioinformatics, 43(3):246–255, 2001.
28
ACCEPTED MANUSCRIPT
[8] Kuo-Chen Chou. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 21(1):10–19, 2005. [9] Y. Fang, Y. Guo, Y. Feng, and M. Li. Predicting DNA-binding proteins: approached from chou’s pseudo amino acid composition and other specific sequence features. Amino Acids, 34(1):103–109,
CR IP T
jul 2007. [10] Xiao-Wei Zhao, Xiang-Tao Li, Zhi-Qiang Ma, and Ming-Hao Yin. Identify dna-binding proteins with optimal chous amino acid composition. Protein & Peptide Letters, 19:398 – 405, 2012.
[11] Bin Liu, Jinghao Xu, Shixi Fan, Ruifeng Xu, Jiyun Zhou, and Xiaolong Wang. Psedna-pro: Dnabinding protein identification by combining chous pseaac and physicochemical distance transfor-
AN US
mation. Molecular Informatics, 34, 09 2014.
[12] M Saifur Rahman, Swakkhar Shatabda, Sanjay Saha, Mohammad Kaykobad, and Mohammad Rahman. Dpp-pseaac: A dna-binding protein prediction model using chou’s general pseaac. Journal of theoretical biology, 452, 05 2018.
M
[13] Md Abdullah Al Maruf and Swakkahr Shatabda. irspot-sf: Prediction of recombination hotspots by incorporating sequence based features into chou’s pseudo components. Genomics, 2018.
ED
[14] Zhe Ju and Shi-Yun Wang. Prediction of citrullination sites by incorporating k-spaced amino acid pairs into chou’s general pseudo amino acid composition. Gene, 664, 04 2018.
PT
[15] Kuo-Chen Chou. An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Current Topics in Medicinal Chemistry, 17(21):2337–2358, 2017.
CE
[16] Manish Kumar, Michael M Gromiha, and Gajendra PS Raghava. Identification of dna-binding proteins using support vector machines and evolutionary profiles. BMC bioinformatics, 8(1):463,
AC
2007.
[17] Y. Fang, Y. Guo, Y. Feng, and M. Li. Predicting dna-binding proteins: approached from chou’s pseudo amino acid composition and other specific sequence features. Amino Acids, 34(1):103–109, Jan 2008.
[18] Robert E Langlois and Hui Lu. Boosting the prediction and understanding of dna-binding domains from sequence. Nucleic acids research, 38(10):3149–3158, 2010.
29
ACCEPTED MANUSCRIPT
[19] K Krishna Kumar, Ganesan Pugalenthi, and PN Suganthan. Dna-prot: identification of dna binding proteins from protein sequence information using random forest. Journal of Biomolecular Structure and Dynamics, 26(6):679–686, 2009. [20] Wei-Zhong Lin, Jian-An Fang, Xuan Xiao, and Kuo-Chen Chou. idna-prot: identification of dna
CR IP T
binding proteins using random forest with grey model. PloS one, 6(9):e24756, 2011. [21] Xiao-Wei Zhao, Xiang-Tao Li, Zhi-Qiang Ma, Zhi-Qiang Ma, and Ming-Hao Yin. Identify dnabinding proteins with optimal chou’s amino acid composition. Protein and Peptide Letters, 19(4):398–405, 2012.
[22] Kuo-Chen Chou. A novel approach to predicting protein structural classes in a (20–1)-d amino
AN US
acid composition space. Proteins: Structure, Function, and Bioinformatics, 21(4):319–344, 1995. [23] Ruifeng Xu, Jiyun Zhou, Bin Liu, Yulan He, Quan Zou, Xiaolong Wang, and Kuo-Chen Chou. Identification of dna-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach. Journal of Biomolecular Structure and Dynamics, 33(8):1720–1730, 2015. PMID: 25252709.
M
[24] Bin Liu, Jinghao Xu, Shixi Fan, Ruifeng Xu, Jiyun Zhou, and Xiaolong Wang. Psedna-pro: Dnabinding protein identification by combining chous pseaac and physicochemical distance transfor-
ED
mation. Molecular Informatics, 34(1):8–17, 2015. [25] Bin Liu, Jinghao Xu, Xun Lan, Ruifeng Xu, Jiyun Zhou, Xiaolong Wang, and Kuo-Chen Chou.
PT
idna-prot— dis: identifying dna-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PloS one, 9(9):e106691,
CE
2014.
[26] Qiwen Dong, Shanyi Wang, Kai Wang, Xuan Liu, and Bin Liu. Identification of dna-binding
AC
proteins by auto-cross covariance transformation. In Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, pages 470–475. IEEE, 2015.
[27] Leyi Wei, Jijun Tang, and Quan Zou. Local-dpp: An improved dna-binding protein prediction method by exploring local evolutionary information. Information Sciences, 384:135–144, 2017.
[28] Wangchao Lou, Xiaoqing Wang, Fan Chen, Yixiao Chen, Bo Jiang, and Hua Zhang. Sequence based prediction of dna-binding proteins based on hybrid feature selection using random forest and gaussian naive bayes. PLoS One, 9(1):e86703, 2014. 30
ACCEPTED MANUSCRIPT
[29] Rianon Zaman, Shahana Yasmin Chowdhury, Mahmood A Rashid, Alok Sharma, Abdollah Dehzangi, and Swakkhar Shatabda. Hmmbinder: Dna-binding protein prediction using hmm profile based features. BioMed research international, 2017, 2017. [30] Shahana Yasmin Chowdhury, Swakkhar Shatabda, and Abdollah Dehzangi. iDNAProt-ES: Iden-
CR IP T
tification of dna-binding proteins using evolutionary and structural features. Scientific Reports, 7(1):14938, 2017.
[31] M Saifur Rahman, Swakkhar Shatabda, Sanjay Saha, M Kaykobad, and M Sohel Rahman. Dpppseaac: A dna-binding protein prediction model using chous general pseaac. Journal of theoretical biology, 452:22–34, 2018.
AN US
[32] Kuo-Chen Chou. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics - CURR PROTEOMICS, 6, 12 2009. [33] Kuo-Chen Chou. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of theoretical biology, 273:236–47, 03 2011.
M
[34] Wei Chen, Tian-Yu Lei, Dian-Chuan Jin, Hao Lin, and Kuo-Chen Chou. Pseknc: A flexible web server for generating pseudo k-tuple nucleotide composition. Analytical biochemistry, 456, 04
ED
2014.
[35] Wei Chen, Peng-Mian Feng, Hao Lin, and Kuo-Chen Chou. iss-psednc: Identifying splicing sites
PT
using pseudo dinucleotide composition. BioMed Research International, 2014. [36] Wei Chen, Hao Lin, and Kuo-Chen Chou. Pseudo nucleotide composition or pseknc: an effective
CE
formulation for analyzing genomic sequences. Molecular BioSystems, 2015. [37] Bin Liu, Fule Liu, Xiaolong Wang, Junjie Chen, Longyun Fang, and Kuo-Chen Chou. Pse-inone: a web server for generating various modes of pseudo components of dna, rna, and protein
AC
sequences. Nucleic Acids Research, 43:W65–71, 2015.
[38] Bin Liu, Hao Wu, and Kuo-Chen Chou. Pse-in-one 2.0: An improved package of web servers for generating various modes of pseudo components of dna, rna, and protein sequences. Natural Science, 09:67–91, 01 2017. [39] Kuo-Chen Chou. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of theoretical biology, 273(1):236–247, 2011. 31
ACCEPTED MANUSCRIPT
[40] Jianhua Jia, Zi Liu, Xuan Xiao, Bingxiang Liu, and Kuo-Chen Chou. ippi-esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into pseaac. Journal of Theoretical Biology, 377:47 – 56, 2015. [41] Lei Cai, Tao Huang, Jingjing Su, Xinxin Zhang, Wenzhong Chen, Fuquan Zhang, Lin He, and Kuo-
Molecular Therapy. Nucleic Acids, 12:433–42, 2018.
CR IP T
Chen Chou. Implications of newly identified brain eqtl genes and their interactors in schizophrenia.
[42] Jianhua Jia, Zi Liu, Xuan Xiao, Bingxiang Liu, and Kuo-Chen Chou. isuc-pseopt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Analytical Biochemistry, 497:48 – 56,
AN US
2016.
[43] Wei Chen, Pengmian Feng, Hui Yang, Hui Ding, Hao Lin, and Kuo-Chen Chou. irna-3typea: Identifying three types of modification at rnas adenosine sites. Molecular Therapy - Nucleic Acids, 11:468 – 474, 2018.
[44] Wang-Ren Qiu, Bi-Qian Sun, Xuan Xiao, Zhao-Chun Xu, Jian-Hua Jia, and Kuo-Chen Chou.
M
ikcr-pseens: Identify lysine crotonylation sites in histone proteins with pseudo components and
ED
ensemble classifier. Genomics, 110(5):239 – 246, 2018. [45] Xiang Cheng, Wei-Zhong Lin, Xuan Xiao, and Kuo-Chen Chou. ploc bal-manimal: predict subcellular localization of animal proteins by balancing training dataset and pseaac. Bioinformatics,
PT
page bty628, 2018.
[46] Jianhua Jia, Zi Liu, Xuan Xiao, Bingxiang Liu, and Kuo-Chen Chou. psuc-lys: Predict lysine
CE
succinylation sites in proteins with pseaac and ensemble random forest approach. Journal of Theoretical Biology, 394:223 – 230, 2016.
AC
[47] Jia-Ming Chang, Emily Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung, and Wen-Lian Hsu. Psldoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. Proteins: Structure, Function, and Bioinformatics, 72(2):693–710, 2008.
[48] Mahmoud Ghandi, Morteza Mohammad-Noori, and Michael A Beer. Robust k-mer frequency estimation using gapped k-mers. Journal of mathematical biology, 69(2):469–500, 2014.
32
ACCEPTED MANUSCRIPT
[49] S Rasoul Safavian and David Landgrebe. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics, 21(3):660–674, 1991. [50] Tin Kam Ho. Random decision forests. In Document analysis and recognition, 1995., proceedings of the third international conference on, volume 1, pages 278–282. IEEE, 1995.
CR IP T
[51] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning, 63(1):3–42, 2006.
[52] Andrew Y Ng and Michael I Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in neural information processing systems, pages
AN US
841–848, 2002.
[53] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.
[54] Alan Julian Izenman. Linear discriminant analysis. In Modern multivariate statistical techniques, pages 237–280. Springer, 2013.
M
[55] Robert E Schapire. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification, pages 149–171. Springer, 2003.
ED
[56] Siddiqur Rahman, Usma Aktar, Rafsan Jani, and Swakkhar Shatabda. ipromoter-fsen: Identification of bacterial σ70 promoter sequences using feature subspace based ensemble classifier.
PT
Genomics, 2018.
[57] Md Mofijul Islam, Sanjay Saha, Md Mahmudur Rahman, Swakkhar Shatabda, Dewan Md Farid,
CE
and Abdollah Dehzangi. iprotgly-ss: identifying protein glycation sites using sequence and structure based features. Proteins: Structure, Function, and Bioinformatics, 2018.
AC
[58] Wei Chen, Peng-Mian Feng, Hao Lin, and Kuo-Chen Chou. irspot-psednc: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research, 41(6):e68, 2013.
[59] Xiang Cheng, Xuan Xiao, and Kuo-Chen Chou. ploc-mplant: predict subcellular localization of multi-location plant proteins by incorporating the optimal go information into general pseaac. Mol. BioSyst., 13:1722–1727, 2017.
33
ACCEPTED MANUSCRIPT
[60] Xuan Xiao, Xiang Cheng, Genqiang Chen, Qi Mao, and Kuo-Chen Chou. ploc bal-mgpos: Predict subcellular localization of gram-positive bacterial proteins by quasi-balancing training dataset and pseaac. Genomics, 2018. [61] Xiang Cheng, Xuan Xiao, and Kuo-Chen Chou. ploc-mvirus: Predict subcellular localization of
CR IP T
multi-location virus proteins via incorporating the optimal go information into general pseaac. Gene, 628:315 – 321, 2017.
[62] Xiang Cheng, Shu-Guang Zhao, Wei-Zhong Lin, Xuan Xiao, and Kuo-Chen Chou. ploc-manimal: predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics, 33(22):3524–3531, 2017.
AN US
[63] Xuan Xiao, Xiang Cheng, Genqiang Chen, Qi Mao, and Kuo-Chen Chou. ploc bal-mgpos: Predict subcellular localization of gram-positive bacterial proteins by quasi-balancing training dataset and pseaac. Genomics, 2018.
[64] Xiang Cheng, Xuan Xiao, and Kuo-Chen Chou. ploc-mgneg: Predict subcellular localization of
110(4):231 – 239, 2018.
M
gram-negative bacterial proteins by deep gene ontology learning via general pseaac. Genomics,
ED
[65] Xiang Cheng, Xuan Xiao, and Kuo-Chen Chou. ploc-meuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key go information into general pseaac. Ge-
PT
nomics, 110(1):50 – 58, 2018.
[66] Xiang Cheng, Shu-Guang Zhao, Xuan Xiao, and Kuo-Chen Chou. iatc-misf: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. Bioinformatics, 33(3):341–346,
CE
2017.
[67] Wang-Ren Qiu, Bi-Qian Sun, Xuan Xiao, Zhao-Chun Xu, and Kuo-Chen Chou. iptm-mlys:
AC
identifying multiple lysine ptm sites and their different types. Bioinformatics, 32(20):3116–3123, 2016.
[68] Kuo-Chen Chou. Some remarks on predicting multi-label attributes in molecular biosystems. Mol. BioSyst., 9:1092–1100, 2013. [69] Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27(8):861 – 874, 2006. ROC Analysis in Pattern Recognition. 34
ACCEPTED MANUSCRIPT
[70] Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011.
CR IP T
[71] Kuo-Chen Chou and Hong-Bin Shen. Review : Recent advances in developing web-servers for predicting protein attributes. Natural Science, Vol.01No.02:30, 2009.
[72] Kuo-Chen Chou. Impacts of bioinformatics to medicinal chemistry. Medicinal chemistry (Shariqah (United Arab Emirates)), 11, 12 2014.
[73] Kuo-Chen Chou. An unprecedented revolution in medicinal chemistry driven by the progress of
AN US
biological science. Current topics in medicinal chemistry, 17 21:2337–2358, 2017.
[74] Peng-Mian Feng, Wei Chen, Hao Lin, and Kuo-Chen Chou. ihsp-pseraaac: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Analytical Biochemistry, 442(1):118 – 125, 2013.
M
[75] Wei Chen, Peng-Mian Feng, Hao Lin, and Kuo-Chen Chou. iss-psednc: Identifying splicing sites using pseudo dinucleotide composition. BioMed Research International, page 12, 2014.
ED
[76] Wei Chen, Pengmian Feng, Hui Ding, Hao Lin, and Kuo-Chen Chou. irna-methyl: Identifying n6-methyladenosine sites using pseudo nucleotide composition. Analytical Biochemistry, 490:26 –
PT
33, 2015.
[77] Wei Chen, Hui Ding, Pengmian Feng, Hao Lin, and Kuo-Chen Chou. iacp: a sequence-based tool
CE
for identifying anticancer peptides. Oncotarget, 7(13):16895–16909, 03 2016. [78] Bin Liu, Longyun Fang, Fule Liu, Xiaolong Wang, Junjie Chen, and Kuo-Chen Chou. Identifi-
AC
cation of real microrna precursors with a pseudo structure status composition approach. PLOS ONE, 10(3):1–20, 03 2015.
[79] Bin Liu, Longyun Fang, Ren Long, Xun Lan, and Kuo-Chen Chou. ienhancer-2l: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics, 32(3):362–369, 2016.
35
ACCEPTED MANUSCRIPT
[80] Jianhua Jia, Zi Liu, Xuan Xiao, and Kuo-Chen Chou. icar-psecp: identify carbonylation sites in proteins by monte carlo sampling and incorporating sequence coupled effects into general pseaac. Oncotarget, 7:34558–70, 06 2016. [81] Bin Liu, Hao Wu, Deyuan Zhang, and Xiaolong Wang. Pse-analysis: a python package for
CR IP T
dna/rna and protein/peptide sequence analysis based on pseudo components and kernel methods. Oncotarget, 8:13338–43, 02 2017.
[82] Wang-Ren Qiu, Bi-Qian Sun, Xuan Xiao, and Zhao-Chun Xu. ihyd-psecp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general pseaac. Oncotarget, 7, 06 2016.
AN US
[83] Wang-Ren Qiu, Xuan Xiao, Zhao-Chun Xu, and Kuo-Chen Chou. iphos-pseen: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier.
AC
CE
PT
ED
M
Oncotarget, 7:51270–83, 08 2016.
36