Effective DNA binding protein prediction by using key features via Chou’s general PseAAC

Accepted Manuscript Effective DNA binding protein prediction by using key features via Chou’s general PseAAC Sheikh Adilina, Dewan Md Farid, Swakkhar...

Download PDF

2MB Sizes 0 Downloads 33 Views

Report

PDF Reader
Full Text

Accepted Manuscript

Effective DNA binding protein prediction by using key features via Chou’s general PseAAC Sheikh Adilina, Dewan Md Farid, Swakkhar Shatabda PII: DOI: Reference:

S0022-5193(18)30503-4 https://doi.org/10.1016/j.jtbi.2018.10.027 YJTBI 9670

To appear in:

Journal of Theoretical Biology

Received date: Revised date: Accepted date:

6 August 2018 7 October 2018 10 October 2018

Please cite this article as: Sheikh Adilina, Dewan Md Farid, Swakkhar Shatabda, Effective DNA binding protein prediction by using key features via Chou’s general PseAAC, Journal of Theoretical Biology (2018), doi: https://doi.org/10.1016/j.jtbi.2018.10.027

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Highlights • Effective and Simple to Extract Sequence Based Features • Grouped and Recursive Feature Selection • Reduced over-fitting on train set • Extra Tree Classifier Based Predictor

AC

CE

PT

ED

M

AN US

CR IP T

• Codes and Data freely available

1

ACCEPTED MANUSCRIPT

Effective DNA binding protein prediction by using key features via Chou’s general PseAAC

Abstract

CR IP T

Sheikh Adilina, Dewan Md Farid and Swakkhar Shatabda Department of Computer Science and Engineering, United International University, Plot 2, United City, Madani Avenue, Satarkul, Badda, Dhaka-1212, Bangladesh Email: [email protected]

AN US

DNA-binding proteins (DBPs) are responsible for several cellular functions, starting from our

immunity system to the transport of oxygen. In the recent studies, scientist have used supervised machine learning based methods that use information from the protein sequence only to classify the DBPs. Most of the methods work effectively on the train sets but performance of most of them degrades in the independent test set. It shows a room for improving the prediction method by reducing over-fitting In this paper, we have extracted several features solely using the

M

protein sequence and carried out two different types of feature selection on them. Our results have proven comparable on training set and significantly improved on the independent test set.

ED

On the independent test set our accuracy was 82.26% which is 1.62% improved compared to the previous best state-of-the-art methods. Performance in terms of sensitivity and area under receiver operating characteristic curve for the independent test set was also higher and they were

PT

0.95 and 0.823 respectively. We have made our methods and data available for other researchers

1

CE

at: https://github.com/SkAdilina/DNA_Binding.

Introduction

AC

Proteins are a series of linked amino acids that are formed within the cell as encoded by RNA which are transcribed from DNA sequences. Several important tasks in the immunity system, muscle contraction, transport of oxygen and many other functions are carried out by the proteins. Proteins bind to DNA molecule to regulate gene expression. Therefore, DNA-binding proteins (DBPs) are vital. However, recognizing DBPs is another challenge. Several experimental or in vitro methods do exist for recognition of DBPs. They include genetic analysis [1], X-ray crystallography [2], chromatin immunoprecipitation on microarrays [3], filter binding assays [4], etc. 2

ACCEPTED MANUSCRIPT

Many computational methods have been applied for recognition of DBPs. However, classification of DBPs using machine learning algorithms is preferred by the researchers as these methods are less expensive and time saving [5] unlike the previous methods that would be incapable of handling the sudden outburst of biological sequences in the postgenomic era. One of the most important and difficult problems in computational biology is expressing a biological sequence with a discrete model or a vector,

CR IP T

at the same time keeping considerable sequence-order information or key pattern characteristic. This is because all the existing machine learning algorithms can only handle vector but not sequence samples, as elucidated in a comprehensive review [6]. A vector defined in a discrete model may completely lose all the sequence pattern information. To avoid this problem, the pseudo amino acid composition [7] or PseAAC [8] was proposed by Chou. From that moment on, the concept has been widely used in

AN US

nearly all the areas of computational proteomics [9] [10] [11] [12] [13] [14] and also in the long list of papers cited in [15].

In the recent years, scientists have worked on several methods of identifying DBPs solely using the sequence. Back in 2007, DNABinder [16] was proposed. Even though the DNABinder turned out to be efficient, it was heavily dependent on the database being used. In the same year another method was

M

proposed [17] where the PseAAC was used alongside dipeptide composition and autocross-co variance (ACC) to form numeric series. Each of these series was passed through Support Vector Machine (SVM)

ED

to obtain the final result.

In 2010, Robert E. Langlois and Hui Lu came up with BLAST [18], a pattern based machine learning protocol to identify DNA-binding proteins. Since then a lot of improvements have been made

PT

in prediction on DBPs using the information from DNA string only. Features were extracted from DNA sequence, in absence of any dependencies on any other information, and were trained using Random

CE

Forest classifier in DNA-Prot [19]. Grey Model was added to the DNA-Prot predictor to create iDNAProt in [20]. Prominsing result was acheived in [21], where the correlation and the weighting factors

AC

of PseAAC was used and experiments were carried out using the Random Forest Classifier. Chou’s general form or pseudo amino acid composition (PseACC) [22] was incorporated with evo-

lutionary information and added to the features and then Support Vector Machine was added to form the predictor, iDNAPro-PseAAC [5]. Evolutionary informaton was also included alonside PseAAC by using the top-n-gram approach in [23]. Combining PseAAC with overall amino acid composition, physicochemical composition distance transformation and SVM resulted in an accuracy of 80% on the test dataset [24]. The iDNAPro-PseAAC outperformed a handful of state-of-the-art methods. Dis-

3

ACCEPTED MANUSCRIPT

tance coupling, alphabet reduction and reduction in dimension of feature set of PseAAC done in the iDNA-Prot—dis [25] predictor made the classification process faster. This new technique outperformed all the then existing predictors. Protein was transformed into vectors of same length using kmer composition and was then inputted to support vector machine to identify DNA binding proteins. They named the technique Kmer1 + ACC [26]. Wei et al. used local Pse-PSSM (Pseudo Position-Specific

CR IP T

Scoring Matrix) features in Local-DPP [27] and trained the features using Random Forest Classifier. PSSM and ACC were used in the making of DBPred [28] and yet again Random Forest Classifier was used to rank the features. However, Gaussian Naive Bayes classifier was used to finally train the feature set.

HMM profile based monogram and bigram features were proved to be more effective in finding DBPs

AN US

compared to PSSM profile based features. HMMBinder was proposed based on these features [29]. A mix of structural and evolutionary features were used in iDNAProt-ES proposed in [30]. The concept of PseAAC was again used in a recent work [31] to form the DPP-PseAAC predictor. In DPP-PseAAC, features were ranked, with he help of random forest classifier, and features were deleted until an optimal feature set was achieved. Support Vector Machine was further used to train the feature set and the

M

experimental results proved a very high accuracy rate.

Due to the widespread increase in the use of PseAAC, three powerful open access softwares, called

ED

PseAACBuilder, propy and PseAACGeneral were recently established: the former two are for generating various modes of Chou’s special PseAAC [32]; while the third one for those of Chou’s general PseAAC [33], including not only all the special modes of feature vectors for proteins but also the

PT

higher level feature vectors such as ”Functional Domain” mode [33], ”Gene Ontology” mode [33], and ”Sequential Evolution” or ”PSSM” mode [33]. Encouraged by the successes of using PseAAC to deal

CE

with protein/peptide sequences, the concept of PseKNC (Pseudo Ktuple Nucleotide Composition) [34] was developed for generating various feature vectors for DNA/RNA sequences [35] [36] that have

AC

proved very useful as well. Particularly, recently powerful web-servers called Pse-in-One [37] and Pse-in-One2.0 [38] (the updated version of [37]) were also established. These web-servers can be used to generate any desired feature vectors for protein/peptide and DNA/RNA sequences according to the need of users’ studies. In this paper, we focus only on the sequence based features as they were proved to be effective and simple to generate. We have extracted several types of features only using the protein sequences. We also discuss two different methods of carrying out feature selection: firstly, we grouped similar

4

ACCEPTED MANUSCRIPT

features together and recorded the performances of individual groups as well as selected combinations of groups. Then, we have also performed recursive feature selection on the entire feature set and selected the best features. Based on these features we have built classification models for prediction of DBPs. We have trained and tested our models on standard benchmark datasets. Moreover, we compared our results with the state-of-the-art predictors and received promising results. The accuracy

CR IP T

of our feature selection methods is comparable to the other predictors and have shown significant improvements on the independent test set. We have made our methods and data available for other researchers at: https://github.com/SkAdilina/DNA_Binding.

Materials and Methods

AN US

2

In this section, we describe the datasets and the methods used in this paper. Kuo Chen Chou [39] suggested his famous five step rule for any machine learning based protein attribute classification research, which has also been used in several other recent papers [40–46]. These five steps are: i) selection of standard benchmark datasets ii) protein sample representation iii) selection of algorithms

M

iv) performance evaluation and v) development of an web application. We too follow the same steps in this paper, except that we did not implement any web application. The rest of this section describes the four steps in detail.

ED

A brief summary of our work is given in Figure 1. Initially, we extracted features from the protein sequence only. Then we performed two types of feature selection on the extracted features: Grouped

PT

Feature Selection and Recursive Feature Selection. We recorded the train accuracy and selected two best feature sets from each of the feature selection methods. We then tested the accuracy of the feature

AC

CE

sets on the independent test dataset.

5

AN US

CR IP T

ACCEPTED MANUSCRIPT

Dataset Description

ED

2.1

M

Figure 1: Overview of methodology used in this paper.

The problem of prediction of DBPs is formulated as a binary classification problem. For any binary

PT

classification problem in the supervised learning setting, construction of training and testing datasets is the primary task. Any dataset for binary classification problem consists of a subset of positive

S = S+ ∪ S−

(1)

AC

CE

samples and another subset of negative samples. Formally:

Here S+ is the sets of DBPs or positive samples and S− is the set of non-DBPs or negative samples. As training dataset, we have selected the most widely used dataset in the literature [27–31] known as the benchmark dataset. It contains 1075 protein sequences, 525 of which are positive samples (DNA-

binding proteins) and the 550 are negative samples (non DNA-binding proteins). As the independent test set, we have used the the independent set proposed in [28]. This dataset contains a total of 186 instances with equal number of positive and negative samples. A summary of the datasets used in this

6

ACCEPTED MANUSCRIPT

paper is given in Table 1. Table 1: Summary of the datasets.

2.2

DNA Binding Proteins 525 93

Non-Binding Proteins 550 93

Protein Sample Representation

Total 1075 186

CR IP T

Dataset Training Dataset Independent Test Set

Each sample in the dataset are a sequence of amino acid residues. Formally,

(2)

AN US

P = R1 , R2 , R3 , · · · , RL

Here, P ∈ S is a protein sequence of length L and Ri are individual amino acid residue symbols. Each of this samples are either positive or negative. Positive samples or DBPs are given +1 as label which indicate positive label and negative samples or non-DBPs are given 0 as their labels. In the next phase each of these samples will be used to extract sequence based features from them. This section describes

M

the details of the features extracted in this paper.

1. Monograms: The frequency of each amino acids was calculated to generate this feature and it

ED

was normalized by dividing the frequency with the length of the protein sequence. Since there

FA =

L

1X match(Ri , aj ) L i=1

(3)

CE

where:

PT

are 20 different amino acids, this group contains 20 features.

L = length of the protein sequence

AC

aj = is an amino acid symbol in amino acid alphabet Σ Ri = is an amino acid residue at position i in protein P

The function match matches any two strings. Formally,

match(s1 , s2 ) =

7

   1 If s1 == s2   0 Else

(4)

ACCEPTED MANUSCRIPT

Thus we get, 20 different monograms. 2. Bigram: The frequency of two consecutive amino acids was taken into account to create this feature and then normalized. Since there are 20 different amino acids, taking all possible combination results in a total of 400 combinations. L−1 1 X match(Ri Ri+1 , sj ) L i=1

CR IP T

FB =

where: sj = a di-amino acid string taken from Σ2

(5)

3. Trigram: In total, 8000 features was created by taking the count of three consecutive amino

FC =

AN US

acids. The feature was normalized by dividing the sum with the length of each protein instances. L−2 1 X match(Ri Ri+1 Ri+2 , sj ) L i=1

where:

(6)

M

sj = a tri-amino acid string taken from Σ3

4. Gapped bigram: Gapped Bigrams [47,48] was formed by calculating the number of occurrences

ED

of two amino acids at a certain distance and ignoring the entire sequence in between.

where:

PT

FD

L−g 1 X = match(Ri Ri+g+1 , sj ) (1 ≤ g ≤ 20) L i=1

(7)

CE

sj = a di-amino acid string taken from Σ2 g = the gap between amino acids

AC

In this paper, we have used gaps, g = 1, 2, · · · , 20. Thus the total number of features generated

was 8000.

5. Monogram Percentile Separation: The purpose of this feature was to keep of occurrences of each amino acid within a certain range in the protein sequence. This was done by calculating the total number of each amino acid in partial sequences. The number of occurrences of each amino acid within 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% of the entire string was

8

ACCEPTED MANUSCRIPT

recorded. Finally the result was divided by the length of the protein sequence. This was done for all 20 amino acids. PL 1 X match(Ri , aj ) PL i=1

where: PL = length of the partial protein sequence aj = an amino acid taken from the alphabet Σ

(8)

CR IP T

FE =

6. Bigram Percentile Separation: This feature was extracted using the same concept as the

AN US

Monogram Percentile Separation feature. The only difference in that instead of considering individual amino acid 2 consecutive amino acids were considered.

FF =

(9)

M

where:

PL −1 1 X match(Ri Ri+1 , sj ) PL i=1

PL = length of the partial protein sequence

ED

sj = a di-amino acid string taken from the alphabet Σ2

PT

7. Nearest Neighbor Bigram: If amino acid aj is closest to the amino acid ai , it can be said that ai and aj are nearest neighbors. This concept was used in creating this feature set. In this case, the first 30 nearest occurrences of each amino-acid pairs have been considered to create the

CE

total 12000 feature instances. The Nearest Neighbor Bigram [25] feature set was normalized in

AC

the same way as all the previous feature sets.

FG =

1 distance(ai , ak )(1 ≤ i ≤ 20, k = 1, 2, 3, · · · , 30) L

where:

distance(x, y) = distance between x and y is position in the sequence ai = position i of amino acid a ak = k-th nearest neighbor of ai for all possible symbols a ∈ Σ 9

(10)

ACCEPTED MANUSCRIPT

Thus, each protein sequence P ∈ S is converted into a feature vector as follows: F = [FA FB FC FD FE FF FG ]

(11)

A summary of the features generated in this paper is given in Table 2.

monogram bigram trigram gapped bigram monogram percentile bigram percentile nearest neighbor bigram

CR IP T

A B C D E F G

Table 2: Summary of Features. Feature Name Total No. of Features 20 400 8000 8000 200 4000 12000

AN US

Group Name

2.3

Feature Selection

M

Total = 32620

2.3.1

ED

In this section, we describe the two feature selection algorithms that we have used in this paper. Grouped Feature Selection

PT

The basic pseudo-code for the grouped feature selection is given in Algorithm 1. The basic idea of the algorithm is to add the feature groups following a forward selection strategy. The algorithm starts

CE

with an empty set and in each iteration it adds a single group not in the set to see whether the training accuracy improves or not. The algorithm stops adding any further groups when adding them does not improve training accuracy anymore. The details of the experimental results found with this feature

AC

selection technique is presented in the next section.

10

2.3.2

Recursive Feature Selection

AN US

Algorithm 1 Grouped Feature Selection 1: group size ← 0 2: best group ← empty 3: max accuracy ← 0 4: new accuracy ← 0 5: do 6: increment group size 7: max accuracy ← new accuracy 8: for all remaining individual groups do 9: add 1 group to best group 10: perform 10-fold cross validation on train dataset 11: record accuracy for all classifiers 12: for each combinations do P 13: calculate (best train accuracy) 14: average train accuracy P 15: identify group with maximum (best train accuracy) 16: add group to best group 17: identify group with maximum average train accuracy 18: update new accuracy 19: add group to best group 20: while (max accuracy > new accuracy)

CR IP T

ACCEPTED MANUSCRIPT

M

In recursive feature selection (summarized in Algorithm 2), we ran the Random Forest Classifier on benchmark dataset using the entire feature set. In each loop the feature with the least train accuracy

ED

was removed from the feature set. This was done until no feature was left. We then compared the result and selected the feature set with the maximum train accuracy and tested it on the independent

PT

data set.

AC

CE

Algorithm 2 Recursive Feature Selection total f eatures ← countFeatures() while total f eatures > 0 do perform 10-fold cross validation on train dataset using Random forest classifier Rank groups by importance Remove group with least ranking decrement total f eatures Identify feature set with highest train accuracy

2.4

Classification Algorithms

Several classification algorithms are used in the experiments in this paper. In this section, we provide very brief description of the algorithms.

11

ACCEPTED MANUSCRIPT

1. Decision Tree: Decision tree [49] classifiers are tree based classifier where attributes are used in a hierarchical manner to find the labels of the samples as leafs of a decision tree. 2. Random Forest: Random forests [50] is a ensemble of decision trees learned by randomly selected features at each iteration of the algorithm.

CR IP T

3. Extra Tree Classifier: Extra tree classifiers [51] are similar to random forests in addition that in each iteration of the algorithm it also uses subsamples of the dataset.

4. Logistic Regression: Logistic regression classifier [52] is linear classifier that uses a hyper plane to separate the samples in case of binary classification problem. It uses a decision rule of the following form:

AN US

~y = sign(w0 + w1 x1 + · · · )

5. K-Nearest Neighbor: KNN algorithm [53] is a lazy instance based classification algorithm where label of a given instance is decided based on the nearest neighbors of that instance in the Cartesian space.

M

6. Linear Discriminant Analysis: Linear discriminant analysis [54] expresses the dependent variable or label as a linear combination of other features. It looks for those features that can

ED

best explain the data.

7. Support Vector Machine: Support vector machine is a maximum margin classifier that tries

CE

form.

PT

to separate two classes in the case of binary classification. It uses a decision rule of the following X h(~x) = sign( αj yj (~x.~xj ) − b) j

Here, ~xj are the support vectors that define the maximum margin.

AC

8. Gaussian Naive Bayes: Naive bayes classifier [52] uses a maximum posteriori decision rule as the following: ~y = arg max p(Ck ) k∈{1,··· ,K}

n Y

i=1

p(xi |Ck )

9. AdaBoost: Adaptive boosting algorithm or AdaBoost [55] is an ensemble approach where a number of weak classifiers are boosted by adaptively weighting wrongly classified instances at

12

ACCEPTED MANUSCRIPT

each iteration of the algorithm. The decision rule used by an AdaBoost classifier is as follows:

h(~x) = sign(α1 h1 (~x) + α2 h2 (~x) + · · · )

2.5

CR IP T

Here, hi is the weak classifier at iteration i and αi is the weight associated with it.

Performance Evaluation

In order to validate any machine learning based prediction method, it is very important to select the evaluations methods and metrics. In this paper we are using five performance evaluation metrics that are widely used in the literature by many researchers and also in the case of prediction of DBPs

AN US

[27,29–31,56,57]. They are: Accuracy, Sensitivity or Recall, Specificity, the Mathews correlation coefficient (MCC) [58], area under receiver operating characteristic curve (auROC) and the area under precision recall curve (auPR). The former four metrics are defined below:

Accuracy =

tp + tn tp + tn + f p + f n tn tn = n tn + f p

(13)

tp + tn tp + tn + f p + f n

(14)

M

Sensitivity =

ED

Specif icity =

(tp ∗ tn) − (f p ∗ f n) M CC = p (tp + f p)(tp + f n)(tn + f p)(tn + f n)

(15)

PT

where:

(12)

CE

n = number of real negative samples p = number of real positive samples

AC

tp = number of positive samples that has been correctly predicted as positives. tn = number of negatives samples that has been correctly predicted as negatives. f p = number of negative samples that has been incorrectly predicted as positives. f n = number of positive samples that has been incorrectly predicted as negatives.

The first three metrics ( Accuracy, Sensitivity or Recall and Specificity) have their values in the range [0,1]. Here a maximum value of 1 means a perfect classifier and a minimum value 0 means

13

ACCEPTED MANUSCRIPT

worst classifier possible. The fourth metric, MCC, has been used a lot to evaluate the classification quality for single label systems. The value of MCC ranges from +1 to −1, where +1 refers to a perfect classifier, −1 refers to the worst possible classifier and 0 refers to a random classifier. Having said that, multi-label systems are becoming more common in the recent years in system biology [59–65], system medicine [66] and

CR IP T

biomedicine [67]. None of these metrics are valid for the multi-label systems, where an instance can belong to several classes at the same time. Fortunately, a completely different set of metrics exists at the moment and is defined in [68].

The other two metrics, auROC and auPR are more appropriate for probabilistic predictors that depend on thresholds for prediction of labels. The Area Under Receiver Operating Characteristic

AN US

(auROC) curve has TPR (true positive rate) on the y-axis and FPR (false positive rate) on the x-axis. TPR is the ratio tp and p, where FPR is the ratio of f p and n. The TPR and FPR are calculated for all possible threshold and plotted [69]. Area Under Precision-Recall Curve (auPR) is a similar metric. Both of these metrics have their values in the range [0,1]. Here 1 means a perfect classifier and 0.5 means a random classifier for the case of binary classification task.

M

Another important decision to take is the sampling methods used to validate the predictors. Two methods are used in this paper: k-fold cross-fold validation and independent test set. In this paper,

ED

k-fold cross validation technique is used for training the classifiers on the benchmark dataset. In k-fold cross validation, the training dataset is chopped into k sub datasets of same size. In each iteration k − 1 sub datasets are used for training and the one remaining sub dataset is used for testing. This

PT

process is carried out for k times and the final result is the average of all k results. We used 10 as the value of k for all of our experiments on training dataset. However, since we have an independent test

CE

set available for this problem and widely used in the dataset, we have also tested the trained model

AC

using that dataset.

3

Results and Discussion

In this section, we describe the details of the experiments done in this paper. All experiments were conducted on a Computing Machine provided by CITS, United International University. The machine was equipped with 8 core processors each core having a Dell R 730 Intel Xeon Processor (E5-2630 V3) with 2.4 GHz speed and 18.5 GB memory. All the programs were written using Python language version 3.6 and Scikit-learn library [70]. Each experiment was run 10 times and only the average are 14

ACCEPTED MANUSCRIPT

reported.

3.1

Results on Grouped Feature Selection

In Grouped Feature Selection (shown in Algorithm 1), the entire feature set was divided into 7 different groups (A,B,C,D,E,F and G), explained in Table 2. In the beginning, we calculated the train accuracy

1. The group with the highest frequency of best train accuracy 2. The group with the maximum average train accuracy

CR IP T

of each group. We then selected the best group using two different parameters:

We took the best feature group, added it to a running set of features, each time adding another group

AN US

to it. The best performing group was kept constant and the accuracy of all possible combinations of addition of another group was recorded. We stopped adding groups when the accuracy stopped increasing and started to decrease instead. We have used several classification algorithms mentioned in the previous section.

First, we used only individual feature groups to see their performance on train set. The experimental

M

results were recorded in terms of train accuracy. In Figure 2, we show the histograms of the feature groups showing best results in each iterations of the experiments. We could note that Group A has

ED

achieved the best train accuracy for the highest number of times. In Figure 3, average train accuracy is shown as achieved by the different feature groups. Here, we could see that the best achieved average

AC

CE

PT

accuracy was by Group G.

15

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Figure 2: Frequency of Maximum Train Accuracy of individual Groups A,B,C,D,E,F and G. We selected Group A since it has the highest frequency.

Figure 3: Average Train Accuracy of individual Groups A,B,C,D,E,F and G. We selected the the group with the maximum average accuracy, i.e. Group G. 16

ACCEPTED MANUSCRIPT

Then, we added another group to the best groups and recorded from the previous iteration and recorded performance of all possible combinations by adding another group at each iterations. Similar to the previous iteration, groups with frequency of best train accuracy and average train accuracy was taken as parameter for selection. Figure 4 and Figure 5 shows the plots of these two measures for the groups in the second round. We selected Group AC and AG from combinations of A and Group GC

ED

M

AN US

CR IP T

and GF from the combinations of G.

AC

CE

PT

Figure 4: Frequency of Maximum Train Accuracy of groups in the second loop. We selected Group AG from combinations of A and Group GC and GF from the combinations of G.

17

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 5: Average Train Accuracy of groups in the second loop. We selected Group AC from combinations of A and Group GC from the combinations of G. In the third iteration of the loop, as it can be seen from Figure 6 and Figure 7, the group combi-

AC

CE

PT

ED

M

nations AGC, AGF, GCB, GCE and GCF performed the best.

Figure 6: Frequency of Maximum Train Accuracy of groups in the third loop, we selected the best combinations for each best performing groups(AC,AG,GC,GF) chosen at the end of third iteration of the loop. Groups AGF, GFA, GCE, GCF, ACG were selected. (We ignored ACF in this loop because when we took the overall average into account, ACG clearly outperforms ACF)

18

AN US

CR IP T

ACCEPTED MANUSCRIPT

M

Figure 7: Average Train Accuracy of groups in the third iteration of the loop. The best combinations were Groups ACG, AGF, GCB, GCE, GFC/GCF. We continued the experiments and stopped when the accuracy began to decrease. We had achieved

ED

our results in the fourth iteration of the loop, the combination GCEF had the highest frequency of best train accuracy and also the maximum average train accuracy among all the combinations. The

AC

CE

PT

results are graphically shown in Figure 8 and Figure 9.

Figure 8: Frequency of Maximum Train Accuracy of groups in the fourth loop

19

AN US

CR IP T

ACCEPTED MANUSCRIPT

Figure 9: Average Train Accuracy of groups in the fourth loop

3.1.1

Classifier Selection

M

After the experiments in the feature selection step, we can conclude that the best performing groups are G, C, E and F. The next phase was to select the best performing classifier for this selected groups of

ED

features. We tried the classifiers mentioned in the previous section. The details of the results in terms of train set accuracy found in cross fold validation and test set accuracy are depicted in Figure 10,

PT

Table 3 and Table 4. From Figure 10, it can be pointed out that the best classifier is the Extra Tree Classifier. Random Forest classifier closely follows with the second best train and test accuracy.

CE

While Support Vector Machine, Logistic Regression and Adaboost have fairly good train and test accuracy, Decision Tree and Gaussian Naive Bayes have the least train and test accuracy respectively. We have received the best train accuracy using Linear Discriminant Analysis (LDA) classifier, however,

AC

the test accuracy in this case is among one of the worst. We can say that over fitting has occurred and therefore cannot be considered as a good classifier.

20

M

AN US

CR IP T

ACCEPTED MANUSCRIPT

ED

Figure 10: Graphical representation of Train and Test Accuracy of the best group combination, GCEF

Table 3: Summary of Experimental Results of the Best Group combination(GCEF) on Training Set Classifier Accuracy Recall Specificity auROC auPR MCC

AC

CE

PT

Random Forest Extra Tree Classifier Support Vector Machine Logistic Regression AdaBoost Decision Tree Gaussian Naive Bayes K-Nearest Neighbour Linear Discriminant Analysis

68.15 70.21 67.97 67.87 67.70 60.74 63.79 66.49 73.36

0.59 0.61 0.68 0.67 0.65 0.60 0.82 0.80 0.74

21

0.771 0.797 0.682 0.687 0.699 0.613 0.459 0.530 0.727

0.735 0.751 0.726 0.735 0.736 0.606 0.643 0.725 0.793

0.702 0.721 0.699 0.710 0.726 0.567 0.578 0.671 0.784

0.366 0.407 0.363 0.361 0.355 0.202 0.305 0.349 0.470

ACCEPTED MANUSCRIPT

Table 4: Summary of Experimental Results of the Best Group combination(GCEF) on Test Set Classifier Accuracy Recall Specificity auROC auPR MCC 0.78 0.95 0.94 0.94 0.85 0.94 0.86 0.84 0.86

Results on Recursive Feature Selection

0.780 0.699 0.505 0.527 0.570 0.581 0.387 0.430 0.516

0.780 0.823 0.720 0.731 0.710 0.758 0.624 0.634 0.688

0.702 0.745 0.644 0.654 0.639 0.678 0.572 0.580 0.620

0.577 0.666 0.488 0.507 0.437 0.483 0.281 0.295 0.401

AN US

3.2

77.96 82.26 72.04 73.12 70.97 75.81 62.37 63.44 68.82

CR IP T

Random Forest Extra Tree Classifier Support Vector Machine Logistic Regression AdaBoost Decision Tree Gaussian Naive Bayes K-Nearest Neighbour Linear Discriminant Analysis

After the recursive feature selection, we found the number of features from different groups present in the best feature set. Note that this experiment was only done on the train set in the first phase. The percentage of features present in the best feature set from the different feature groups are shown in Table 5. From the percentage of features from each group in Table 5 we can further deduce that the

M

best performing groups are C, D, E and F.

AC

CE

PT

ED

Table 5: Summary of Percentage of features of each group present in the feature set of the Recursive Feature Selection Group Total Number of Percentage features features present A 20 0 0.00% B 400 0 0.00% C 8000 7986 99.83% D 8000 5749 71.86% E 200 188 94.00% F 4000 3904 97.60% G 12000 11995 99.96%

3.2.1

Classifier Selection

Here too, we have performed another set of experiments to select the best classifier. The results are depicted in Figure 11, Table 6 and Table 7. In Figure 11, we can clearly see that Extra Tree Classifier provides the highest test accuracy but the train accuracy is very low. Consequently, we have selected Random Forest Classifier as the best classifier for the recursive feature selection. Adaboost has the

22

ACCEPTED MANUSCRIPT

third best test accuracy. The classification behavior of Linear Discriminant Analysis (LDA), Support Vector Machine, Logistic Regression, Decision Tree and Gaussian Naive Bayes is exactly the same as the results we achieved in the Grouped Feature Selection. After selecting the best feature subset using the recursive feature selection, we tested them on the

feature set on the training dataset is given in Table 6.

CR IP T

testing dataset. The experimental results are provided in Table 7 and the detailed experimental results

Table 6: Summary of Experimental Results of the Best Feature Set on the Training Set (Recursive Feature Selection) Classifier Accuracy Recall Specificity auROC auPR MCC 0.62 0.59 0.67 0.67 0.66 0.63 0.45 0.80 0.74

0.799 0.762 0.685 0.695 0.725 0.637 0.756 0.528 0.729

0.751 0.732 0.726 0.734 0.751 0.630 0.601 0.724 0.798

AN US

71.04 67.78 67.97 68.24 69.27 63.34 60.16 66.21 73.63

0.724 0.699 0.696 0.707 0.735 0.574 0.558 0.671 0.787

0.426 0.358 0.363 0.368 0.385 0.242 0.206 0.343 0.476

M

Random Forest Extra Tree Classifier Support Vector Machine Logistic Regression AdaBoost Decision Tree Gaussian Naive Bayes K-Nearest Neighbour Linear Discriminant Analysis

ED

Table 7: Summary of Experimental Results of the Best Feature Set on the Test Set (Recursive Feature Selection) Classifier Accuracy Recall Specificity auROC auPR MCC

AC

CE

PT

Random Forest Extra Tree Classifier Support Vector Machine Logistic Regression AdaBoost Decision Tree Gaussian Naive Bayes K-Nearest Neighbour Linear Discriminant Analysis

76.88 79.03 72.04 73.12 76.34 73.66 73.12 62.90 69.89

0.77 0.91 0.94 0.94 0.94 0.92 0.86 0.84 0.81

23

0.769 0.667 0.505 0.527 0.591 0.548 0.602 0.419 0.591

0.769 0.790 0.720 0.731 0.763 0.737 0.731 0.629 0.699

0.696 0.713 0.644 0.654 0.683 0.659 0.658 0.576 0.632

0.547 0.599 0.488 0.507 0.561 0.543 0.479 0.284 0.407

AN US

CR IP T

ACCEPTED MANUSCRIPT

3.3

M

Figure 11: Graphical representation of Train and Test Accuracy of the best feature set obtained using the Recursive Feature Selection method

Comparison Using ROC Analysis

ED

We have also compared ROC curve of both grouped and recursive feature selection methods as we stated it as a performance evaluation metric. The Figure 12 contains the ROC curve of performance of both grouped and recursive feature selection on the training dataset. The Figure 13 contains the

PT

results on test dataset. Both these figures contain a red dotted line. This line represents a ROC curve of the performance for such a predictor which is no better than random guessing. The further away the

CE

ROC curve of a predictor is from the red dotted line, the better its performance is. Even though the accuracy of the two feature selection methods on the train dataset are slightly different from each other

AC

but their ROC curves are almost the same. We can clearly witness that from the Figure 12. However, there is a noticeable difference in the performance of the methods in the test dataset. It can be seen in Figure 13 that the curve of Recursive Feature Selection is a little closer to the red dotted curve. Hence, we can conclude that Grouped Feature Selection is better than Recursive Feature Selection method.

24

AN US

CR IP T

ACCEPTED MANUSCRIPT

AC

CE

PT

ED

M

Figure 12: ROC curve of the Performance of both Feature Sets on Training Data

Figure 13: ROC curve of the Performance of both Feature Sets on Testing Data

25

ACCEPTED MANUSCRIPT

3.4

Comparison With Other Methods

From the previous section, we got Extra Tree Classifier as the best performing classification algorithm with the features selected using Grouped Feature Selection method.

On the other

hand Random Forest performed best on the features found from Recursive Feature Selec-

CR IP T

tion. In order to strengthen our claim, we also compared the performance of these two methods on both benchmark training dataset and test dataset with other state-of-the-art methods. We have considered nine of the previous methods to compare with our method. They are: DNAProt [19], DNABinder [16], Kmer1+ACC [26], iDNAProt [20], iDNAPro-PseAAC [5], HMMBinder [29], Local DPP [27], iDNAProt-ES [30] and DPP-PseAAC [31]. We did not run the other predictors since we are

AN US

working on the same datasets. We have taken the results as reported in the literature [29, 31].

Table 8: Comparison of Performance with the state-of-the-art methods on the Training Set Classifier Accuracy Recall Specificity auROC auPR MCC 0.61 0.62 0.83 0.66 0.77 0.84 0.84 0.87 0.90 0.84 0.94

M

70.21 71.04 72.55 73.58 75.23 75.40 76.76 86.33 90.18 79.20 95.91

0.797 0.799 0.598 0.804 0.738 0.647 0.745 0.855 0.900 0.745 0.976

0.751 0.751 0.789 0.815 0.828 0.761 0.902 0.941 0.988

0.721 0.724 -

0.41 0.43 0.44 0.47 0.50 0.50 0.53 0.72 0.80 0.59 0.92

PT

ED

Grouped Feature Selection Recursive Feature Selection DNAProt DNAbinder Kmer1 + ACC iDNAProt iDNAPro-PseAAC HMMBinder iDNAProt-ES Local-DPP DPP-PseAAC

CE

We have reported the performance of the nine other algorithms and our methods on the benchmark train dataset in Table 8. From the values reported in Table 8, it is clear that the best performing method is the most recent one, DPP-PseAAC by Rahman et al. [31]. They have achieved almost a

AC

near perfect performance on the benchmark training dataset. However, to ensure the effectiveness of their methods one must perform independent test with another set.

26

ACCEPTED MANUSCRIPT

Table 9: Comparison of Performance with the state-of-the-art methods on the Test Set Classifier Accuracy Recall Specificity auROC auPR MCC 0.95 0.77 0.70 0.57 0.83 0.68 0.77 0.61 0.81 0.92 0.83

0.699 0.769 0.538 0.645 0.591 0.667 0.624 0.763 0.800 0.656 0.709

0.823 0.769 0.240 0.216 0.431 0.344 0.402 0.632 0.843 0.798

0.745 0.696 0.607 0.752 0.775 -

0.67 0.55 0.24 0.22 0.43 0.34 0.40 0.39 0.61 0.63 0.55

CR IP T

82.26 76.88 61.80 60.80 70.96 67.20 69.89 69.02 80.64 79.00 77.42

AN US

Grouped Feature Selection Recursive Feature Selection DNAProt DNAbinder Kmer1 + ACC iDNAProt iDNAPro-PseAAC HMMBinder iDNAProt-ES Local-DPP DPP-PseAAC

We have trained a model using the benchmark dataset and tested it with the independent test set. Any case of overfitting of data is evident only when separate independent test sets are used. The comparison of results is shown in Table 9 for independent test set. Here, we note that our method is the best performing and the test accuracy is higher than that of the train accuracy which is a

M

clear indication of absence of overfitting. We can note that the previous best method DPP-PseAAC performs poorly on the test set compared to their performance in train set which leads us to believe

3.5

ED

that their case is one of overfitting.

Availability of Our Method

PT

We have provided all the files used in our experiments in this online repository: https://github. com/SkAdilina/DNA_Binding. It contains the codes for training the data for both the Grouped and

CE

Recursive feature selection. It also contains the codes used for recording all the dataset for all the

AC

metrics and classifier. The feature extracted for both the method have also been provided.

4

Conclusion

In this paper, we have presented several features solely using the protein sequence. Then we performed grouped and recursive feature selection on the feature set. The final models for the grouped and recursive feature selection were developed by Extra Tree Classifier and Random forest Classifier respectively. Our results significantly improved on test set. However, the results in the train set have

27

ACCEPTED MANUSCRIPT

room for improvement with keeping the higher accuracy in the test set. As pointed out in [71], userfriendly and publicly accessible webservers represent the future direction for developing practically more useful predictors or any computational tools. They will also will significantly enhance the impacts of theoretical work on medical science [72], driving medicinal chemistry into an unprecedented revolution [73]. Moreover, userfriendly webservers have been provided in

CR IP T

a series of recent publications [74–83] and therefore we shall make efforts in our future work to provide a web-server for the prediction method presented in this paper. For the time being, we have made all methods and data available for anyone to use at: https://github.com/SkAdilina/DNA_Binding.

AN US

References

[1] Katie Freeman, Marc Gwadz, and David Shore. Molecular and genetic analysis of the toxic effect of rap1 overexpression in yeast. Genetics, 141(4):1253–1262, 1995.

[2] Chia-Cheng Chou, Ting-Wan Lin, Chin-Yu Chen, and Andrew H-J Wang. Crystal structure of the hyperthermophilic archaeal dna-binding protein sso10b2 at a resolution of 1.85 angstroms.

M

Journal of bacteriology, 185(14):4066–4073, 2003.

[3] Michael J Buck and Jason D Lieb. Chip-chip: considerations for the design, analysis, and ap-

ED

plication of genome-wide chromatin immunoprecipitation experiments. Genomics, 83(3):349–360, 2004.

PT

[4] Reham Helwa and J¨ org D Hoheisel. Analysis of dna–protein interactions: from nitrocellulose filter binding assays to microarray studies. Analytical and bioanalytical chemistry, 398(6):2551–2561,

CE

2010.

[5] Bin Liu, Shanyi Wang, and Xiaolong Wang. Dna binding protein identification by combin-

AC

ing pseudo amino acid composition and profile-based protein representation. Scientific reports, 5:15479, 2015.

[6] Xuhua Xia. Bioinformatics and drug discovery. Current Topics in Medicinal Chemistry, 17:1709– 26, 2017. [7] Kuo-Chen Chou. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Structure, Function, and Bioinformatics, 43(3):246–255, 2001.

28

ACCEPTED MANUSCRIPT

[8] Kuo-Chen Chou. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics, 21(1):10–19, 2005. [9] Y. Fang, Y. Guo, Y. Feng, and M. Li. Predicting DNA-binding proteins: approached from chou’s pseudo amino acid composition and other specific sequence features. Amino Acids, 34(1):103–109,

CR IP T

jul 2007. [10] Xiao-Wei Zhao, Xiang-Tao Li, Zhi-Qiang Ma, and Ming-Hao Yin. Identify dna-binding proteins with optimal chous amino acid composition. Protein & Peptide Letters, 19:398 – 405, 2012.

[11] Bin Liu, Jinghao Xu, Shixi Fan, Ruifeng Xu, Jiyun Zhou, and Xiaolong Wang. Psedna-pro: Dnabinding protein identification by combining chous pseaac and physicochemical distance transfor-

AN US

mation. Molecular Informatics, 34, 09 2014.

[12] M Saifur Rahman, Swakkhar Shatabda, Sanjay Saha, Mohammad Kaykobad, and Mohammad Rahman. Dpp-pseaac: A dna-binding protein prediction model using chou’s general pseaac. Journal of theoretical biology, 452, 05 2018.

M

[13] Md Abdullah Al Maruf and Swakkahr Shatabda. irspot-sf: Prediction of recombination hotspots by incorporating sequence based features into chou’s pseudo components. Genomics, 2018.

ED

[14] Zhe Ju and Shi-Yun Wang. Prediction of citrullination sites by incorporating k-spaced amino acid pairs into chou’s general pseudo amino acid composition. Gene, 664, 04 2018.

PT

[15] Kuo-Chen Chou. An unprecedented revolution in medicinal chemistry driven by the progress of biological science. Current Topics in Medicinal Chemistry, 17(21):2337–2358, 2017.

CE

[16] Manish Kumar, Michael M Gromiha, and Gajendra PS Raghava. Identification of dna-binding proteins using support vector machines and evolutionary profiles. BMC bioinformatics, 8(1):463,

AC

2007.

[17] Y. Fang, Y. Guo, Y. Feng, and M. Li. Predicting dna-binding proteins: approached from chou’s pseudo amino acid composition and other specific sequence features. Amino Acids, 34(1):103–109, Jan 2008.

[18] Robert E Langlois and Hui Lu. Boosting the prediction and understanding of dna-binding domains from sequence. Nucleic acids research, 38(10):3149–3158, 2010.

29

ACCEPTED MANUSCRIPT

[19] K Krishna Kumar, Ganesan Pugalenthi, and PN Suganthan. Dna-prot: identification of dna binding proteins from protein sequence information using random forest. Journal of Biomolecular Structure and Dynamics, 26(6):679–686, 2009. [20] Wei-Zhong Lin, Jian-An Fang, Xuan Xiao, and Kuo-Chen Chou. idna-prot: identification of dna

CR IP T

binding proteins using random forest with grey model. PloS one, 6(9):e24756, 2011. [21] Xiao-Wei Zhao, Xiang-Tao Li, Zhi-Qiang Ma, Zhi-Qiang Ma, and Ming-Hao Yin. Identify dnabinding proteins with optimal chou’s amino acid composition. Protein and Peptide Letters, 19(4):398–405, 2012.

[22] Kuo-Chen Chou. A novel approach to predicting protein structural classes in a (20–1)-d amino

AN US

acid composition space. Proteins: Structure, Function, and Bioinformatics, 21(4):319–344, 1995. [23] Ruifeng Xu, Jiyun Zhou, Bin Liu, Yulan He, Quan Zou, Xiaolong Wang, and Kuo-Chen Chou. Identification of dna-binding proteins by incorporating evolutionary information into pseudo amino acid composition via the top-n-gram approach. Journal of Biomolecular Structure and Dynamics, 33(8):1720–1730, 2015. PMID: 25252709.

M

[24] Bin Liu, Jinghao Xu, Shixi Fan, Ruifeng Xu, Jiyun Zhou, and Xiaolong Wang. Psedna-pro: Dnabinding protein identification by combining chous pseaac and physicochemical distance transfor-

ED

mation. Molecular Informatics, 34(1):8–17, 2015. [25] Bin Liu, Jinghao Xu, Xun Lan, Ruifeng Xu, Jiyun Zhou, Xiaolong Wang, and Kuo-Chen Chou.

PT

idna-prot— dis: identifying dna-binding proteins by incorporating amino acid distance-pairs and reduced alphabet profile into the general pseudo amino acid composition. PloS one, 9(9):e106691,

CE

2014.

[26] Qiwen Dong, Shanyi Wang, Kai Wang, Xuan Liu, and Bin Liu. Identification of dna-binding

AC

proteins by auto-cross covariance transformation. In Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on, pages 470–475. IEEE, 2015.

[27] Leyi Wei, Jijun Tang, and Quan Zou. Local-dpp: An improved dna-binding protein prediction method by exploring local evolutionary information. Information Sciences, 384:135–144, 2017.

[28] Wangchao Lou, Xiaoqing Wang, Fan Chen, Yixiao Chen, Bo Jiang, and Hua Zhang. Sequence based prediction of dna-binding proteins based on hybrid feature selection using random forest and gaussian naive bayes. PLoS One, 9(1):e86703, 2014. 30

ACCEPTED MANUSCRIPT

[29] Rianon Zaman, Shahana Yasmin Chowdhury, Mahmood A Rashid, Alok Sharma, Abdollah Dehzangi, and Swakkhar Shatabda. Hmmbinder: Dna-binding protein prediction using hmm profile based features. BioMed research international, 2017, 2017. [30] Shahana Yasmin Chowdhury, Swakkhar Shatabda, and Abdollah Dehzangi. iDNAProt-ES: Iden-

CR IP T

tification of dna-binding proteins using evolutionary and structural features. Scientific Reports, 7(1):14938, 2017.

[31] M Saifur Rahman, Swakkhar Shatabda, Sanjay Saha, M Kaykobad, and M Sohel Rahman. Dpppseaac: A dna-binding protein prediction model using chous general pseaac. Journal of theoretical biology, 452:22–34, 2018.

AN US

[32] Kuo-Chen Chou. Pseudo amino acid composition and its applications in bioinformatics, proteomics and system biology. Current Proteomics - CURR PROTEOMICS, 6, 12 2009. [33] Kuo-Chen Chou. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of theoretical biology, 273:236–47, 03 2011.

M

[34] Wei Chen, Tian-Yu Lei, Dian-Chuan Jin, Hao Lin, and Kuo-Chen Chou. Pseknc: A flexible web server for generating pseudo k-tuple nucleotide composition. Analytical biochemistry, 456, 04

ED

2014.

[35] Wei Chen, Peng-Mian Feng, Hao Lin, and Kuo-Chen Chou. iss-psednc: Identifying splicing sites

PT

using pseudo dinucleotide composition. BioMed Research International, 2014. [36] Wei Chen, Hao Lin, and Kuo-Chen Chou. Pseudo nucleotide composition or pseknc: an effective

CE

formulation for analyzing genomic sequences. Molecular BioSystems, 2015. [37] Bin Liu, Fule Liu, Xiaolong Wang, Junjie Chen, Longyun Fang, and Kuo-Chen Chou. Pse-inone: a web server for generating various modes of pseudo components of dna, rna, and protein

AC

sequences. Nucleic Acids Research, 43:W65–71, 2015.

[38] Bin Liu, Hao Wu, and Kuo-Chen Chou. Pse-in-one 2.0: An improved package of web servers for generating various modes of pseudo components of dna, rna, and protein sequences. Natural Science, 09:67–91, 01 2017. [39] Kuo-Chen Chou. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of theoretical biology, 273(1):236–247, 2011. 31

ACCEPTED MANUSCRIPT

[40] Jianhua Jia, Zi Liu, Xuan Xiao, Bingxiang Liu, and Kuo-Chen Chou. ippi-esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into pseaac. Journal of Theoretical Biology, 377:47 – 56, 2015. [41] Lei Cai, Tao Huang, Jingjing Su, Xinxin Zhang, Wenzhong Chen, Fuquan Zhang, Lin He, and Kuo-

Molecular Therapy. Nucleic Acids, 12:433–42, 2018.

CR IP T

Chen Chou. Implications of newly identified brain eqtl genes and their interactors in schizophrenia.

[42] Jianhua Jia, Zi Liu, Xuan Xiao, Bingxiang Liu, and Kuo-Chen Chou. isuc-pseopt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Analytical Biochemistry, 497:48 – 56,

AN US

2016.

[43] Wei Chen, Pengmian Feng, Hui Yang, Hui Ding, Hao Lin, and Kuo-Chen Chou. irna-3typea: Identifying three types of modification at rnas adenosine sites. Molecular Therapy - Nucleic Acids, 11:468 – 474, 2018.

[44] Wang-Ren Qiu, Bi-Qian Sun, Xuan Xiao, Zhao-Chun Xu, Jian-Hua Jia, and Kuo-Chen Chou.

M

ikcr-pseens: Identify lysine crotonylation sites in histone proteins with pseudo components and

ED

ensemble classifier. Genomics, 110(5):239 – 246, 2018. [45] Xiang Cheng, Wei-Zhong Lin, Xuan Xiao, and Kuo-Chen Chou. ploc bal-manimal: predict subcellular localization of animal proteins by balancing training dataset and pseaac. Bioinformatics,

PT

page bty628, 2018.

[46] Jianhua Jia, Zi Liu, Xuan Xiao, Bingxiang Liu, and Kuo-Chen Chou. psuc-lys: Predict lysine

CE

succinylation sites in proteins with pseaac and ensemble random forest approach. Journal of Theoretical Biology, 394:223 – 230, 2016.

AC

[47] Jia-Ming Chang, Emily Chia-Yu Su, Allan Lo, Hua-Sheng Chiu, Ting-Yi Sung, and Wen-Lian Hsu. Psldoc: Protein subcellular localization prediction based on gapped-dipeptides and probabilistic latent semantic analysis. Proteins: Structure, Function, and Bioinformatics, 72(2):693–710, 2008.

[48] Mahmoud Ghandi, Morteza Mohammad-Noori, and Michael A Beer. Robust k-mer frequency estimation using gapped k-mers. Journal of mathematical biology, 69(2):469–500, 2014.

32

ACCEPTED MANUSCRIPT

[49] S Rasoul Safavian and David Landgrebe. A survey of decision tree classifier methodology. IEEE transactions on systems, man, and cybernetics, 21(3):660–674, 1991. [50] Tin Kam Ho. Random decision forests. In Document analysis and recognition, 1995., proceedings of the third international conference on, volume 1, pages 278–282. IEEE, 1995.

CR IP T

[51] Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremely randomized trees. Machine learning, 63(1):3–42, 2006.

[52] Andrew Y Ng and Michael I Jordan. On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes. In Advances in neural information processing systems, pages

AN US

841–848, 2002.

[53] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2012.

[54] Alan Julian Izenman. Linear discriminant analysis. In Modern multivariate statistical techniques, pages 237–280. Springer, 2013.

M

[55] Robert E Schapire. The boosting approach to machine learning: An overview. In Nonlinear estimation and classification, pages 149–171. Springer, 2003.

ED

[56] Siddiqur Rahman, Usma Aktar, Rafsan Jani, and Swakkhar Shatabda. ipromoter-fsen: Identification of bacterial σ70 promoter sequences using feature subspace based ensemble classifier.

PT

Genomics, 2018.

[57] Md Mofijul Islam, Sanjay Saha, Md Mahmudur Rahman, Swakkhar Shatabda, Dewan Md Farid,

CE

and Abdollah Dehzangi. iprotgly-ss: identifying protein glycation sites using sequence and structure based features. Proteins: Structure, Function, and Bioinformatics, 2018.

AC

[58] Wei Chen, Peng-Mian Feng, Hao Lin, and Kuo-Chen Chou. irspot-psednc: identify recombination spots with pseudo dinucleotide composition. Nucleic Acids Research, 41(6):e68, 2013.

[59] Xiang Cheng, Xuan Xiao, and Kuo-Chen Chou. ploc-mplant: predict subcellular localization of multi-location plant proteins by incorporating the optimal go information into general pseaac. Mol. BioSyst., 13:1722–1727, 2017.

33

ACCEPTED MANUSCRIPT

[60] Xuan Xiao, Xiang Cheng, Genqiang Chen, Qi Mao, and Kuo-Chen Chou. ploc bal-mgpos: Predict subcellular localization of gram-positive bacterial proteins by quasi-balancing training dataset and pseaac. Genomics, 2018. [61] Xiang Cheng, Xuan Xiao, and Kuo-Chen Chou. ploc-mvirus: Predict subcellular localization of

CR IP T

multi-location virus proteins via incorporating the optimal go information into general pseaac. Gene, 628:315 – 321, 2017.

[62] Xiang Cheng, Shu-Guang Zhao, Wei-Zhong Lin, Xuan Xiao, and Kuo-Chen Chou. ploc-manimal: predict subcellular localization of animal proteins with both single and multiple sites. Bioinformatics, 33(22):3524–3531, 2017.

AN US

[63] Xuan Xiao, Xiang Cheng, Genqiang Chen, Qi Mao, and Kuo-Chen Chou. ploc bal-mgpos: Predict subcellular localization of gram-positive bacterial proteins by quasi-balancing training dataset and pseaac. Genomics, 2018.

[64] Xiang Cheng, Xuan Xiao, and Kuo-Chen Chou. ploc-mgneg: Predict subcellular localization of

110(4):231 – 239, 2018.

M

gram-negative bacterial proteins by deep gene ontology learning via general pseaac. Genomics,

ED

[65] Xiang Cheng, Xuan Xiao, and Kuo-Chen Chou. ploc-meuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key go information into general pseaac. Ge-

PT

nomics, 110(1):50 – 58, 2018.

[66] Xiang Cheng, Shu-Guang Zhao, Xuan Xiao, and Kuo-Chen Chou. iatc-misf: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals. Bioinformatics, 33(3):341–346,

CE

2017.

[67] Wang-Ren Qiu, Bi-Qian Sun, Xuan Xiao, Zhao-Chun Xu, and Kuo-Chen Chou. iptm-mlys:

AC

identifying multiple lysine ptm sites and their different types. Bioinformatics, 32(20):3116–3123, 2016.

[68] Kuo-Chen Chou. Some remarks on predicting multi-label attributes in molecular biosystems. Mol. BioSyst., 9:1092–1100, 2013. [69] Tom Fawcett. An introduction to roc analysis. Pattern Recognition Letters, 27(8):861 – 874, 2006. ROC Analysis in Pattern Recognition. 34

ACCEPTED MANUSCRIPT

[70] Fabian Pedregosa, Ga¨el Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikitlearn: Machine learning in python. Journal of Machine Learning Research, 12(Oct):2825–2830, 2011.

CR IP T

[71] Kuo-Chen Chou and Hong-Bin Shen. Review : Recent advances in developing web-servers for predicting protein attributes. Natural Science, Vol.01No.02:30, 2009.

[72] Kuo-Chen Chou. Impacts of bioinformatics to medicinal chemistry. Medicinal chemistry (Shariqah (United Arab Emirates)), 11, 12 2014.

[73] Kuo-Chen Chou. An unprecedented revolution in medicinal chemistry driven by the progress of

AN US

biological science. Current topics in medicinal chemistry, 17 21:2337–2358, 2017.

[74] Peng-Mian Feng, Wei Chen, Hao Lin, and Kuo-Chen Chou. ihsp-pseraaac: Identifying the heat shock protein families using pseudo reduced amino acid alphabet composition. Analytical Biochemistry, 442(1):118 – 125, 2013.

M

[75] Wei Chen, Peng-Mian Feng, Hao Lin, and Kuo-Chen Chou. iss-psednc: Identifying splicing sites using pseudo dinucleotide composition. BioMed Research International, page 12, 2014.

ED

[76] Wei Chen, Pengmian Feng, Hui Ding, Hao Lin, and Kuo-Chen Chou. irna-methyl: Identifying n6-methyladenosine sites using pseudo nucleotide composition. Analytical Biochemistry, 490:26 –

PT

33, 2015.

[77] Wei Chen, Hui Ding, Pengmian Feng, Hao Lin, and Kuo-Chen Chou. iacp: a sequence-based tool

CE

for identifying anticancer peptides. Oncotarget, 7(13):16895–16909, 03 2016. [78] Bin Liu, Longyun Fang, Fule Liu, Xiaolong Wang, Junjie Chen, and Kuo-Chen Chou. Identifi-

AC

cation of real microrna precursors with a pseudo structure status composition approach. PLOS ONE, 10(3):1–20, 03 2015.

[79] Bin Liu, Longyun Fang, Ren Long, Xun Lan, and Kuo-Chen Chou. ienhancer-2l: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition. Bioinformatics, 32(3):362–369, 2016.

35

ACCEPTED MANUSCRIPT

[80] Jianhua Jia, Zi Liu, Xuan Xiao, and Kuo-Chen Chou. icar-psecp: identify carbonylation sites in proteins by monte carlo sampling and incorporating sequence coupled effects into general pseaac. Oncotarget, 7:34558–70, 06 2016. [81] Bin Liu, Hao Wu, Deyuan Zhang, and Xiaolong Wang. Pse-analysis: a python package for

CR IP T

dna/rna and protein/peptide sequence analysis based on pseudo components and kernel methods. Oncotarget, 8:13338–43, 02 2017.

[82] Wang-Ren Qiu, Bi-Qian Sun, Xuan Xiao, and Zhao-Chun Xu. ihyd-psecp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general pseaac. Oncotarget, 7, 06 2016.

AN US

[83] Wang-Ren Qiu, Xuan Xiao, Zhao-Chun Xu, and Kuo-Chen Chou. iphos-pseen: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier.

AC

CE

PT

ED

M

Oncotarget, 7:51270–83, 08 2016.

36