Accepted Manuscript Title: Identification of Coenzyme-Binding Proteins with Machine Learning Algorithms Authors: Yong Liu, Zhiwei Kong, Tao Ran, Alfredo Sahagun-Ruiz, Zhixiong He, Chuanshe Zhou, Zhiliang Tan PII: DOI: Reference:
S1476-9271(18)30230-5 https://doi.org/10.1016/j.compbiolchem.2019.01.014 CBAC 7013
To appear in:
Computational Biology and Chemistry
Received date: Revised date: Accepted date:
2 April 2018 11 September 2018 25 January 2019
Please cite this article as: Liu Y, Kong Z, Ran T, Sahagun-Ruiz A, He Z, Zhou C, Tan Z, Identification of Coenzyme-Binding Proteins with Machine Learning Algorithms, Computational Biology and Chemistry (2019), https://doi.org/10.1016/j.compbiolchem.2019.01.014 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Identification of Coenzyme-Binding Proteins with Machine Learning Algorithms Yong Liu1,2, Zhiwei Kong 1,5, Tao Ran 3, Alfredo Sahagun-Ruiz 4, Zhixiong He1,2,*, Chuanshe Zhou 1,2, Zhiliang Tan 1,2 1
IP T
Key Laboratory for Agro-Ecological Processes in Subtropical Region, National Engineering
Laboratory for Pollution Control and Waste Utilization in Livestock and Poultry Production, South
Central Experimental Station of Animal Nutrition and Feed Science in the Ministry of Agriculture,
SC R
Institute of Subtropical Agriculture, The Chinese Academy of Sciences, Changsha, Hunan 410125, P.R. China
2 Hunan Co-Innovation Center of Animal Production Safety, CICAPS, Changsha, Hunan 410128, P.R. China
U
3 Lethbridge Research and Development Centre, Agriculture and Agri-Food Canada, Lethbridge, Alberta T1J 4B1, Canada
Department of Microbiology and Immunology, Faculty of Veterinary Medicine and Animal Science,
N
4
A
National Autonomous University of Mexico, Universidad 3000, Copilco Coyoacán, CP 04510 México D.F., México
M
5
University of the Chinese Academy of Sciences, Beijing 100049, PR China
*
A
CC E
PT
Graphical abstract
ED
Corresponding Author:
[email protected]
1
Highlights: 1. The coenzyme-binding proteins play a vital role in the cellular metabolism processes, such as fatty acid biosynthesis, enzyme and gene regulation, lipid synthesis, particular vesicular traffic, and β-oxidation
IP T
donation of acyl-CoA esters. 2. There is a tiny subset of protein molecules in nature, which have been recognized with coenzyme binding activity. In silico screening of a new peptide with coenzyme binding activity might be an efficient process.
SC R
3. A dataset containing 2,897 proteins, among 456 proteins with coenzyme binding activity, was created. 42 Star Graph Topological Indices (TIs) of each protein sequence were calculated by the S2SNet
application. All these TIs were used as inputs to several classification techniques with a machine learning
U
software-Weka.
N
4. The best model found was based on only three features of the embedded and non-embedded graphs with Random Forest classifier. The performance of this new model considered as excellent model in the field
A
with an AUROC of 0.971 and a true positive (TP) rate of 91.7%, and a false positive (FP) rate of 7.6%.
M
5. It proved that machine learning classification models for prediction of new coenzyme binding activity proteins with this model could be useful for future drug development or enzyme catalysis metabolism
ED
research.
Abstract: The coenzyme-binding proteins play a vital role in the cellular metabolism processes,
PT
such as fatty acid biosynthesis, enzyme and gene regulation, lipid synthesis, particular vesicular traffic, and β-oxidation donation of acyl-CoA esters. Based on the theory of Star Graph Topological Indices (SGTIs) of protein primary sequences, we proposed a method to develop a
CC E
first classification model for predicting protein with coenzyme-binding properties. In order to simulate the properties of coenzyme-binding proteins, we created a dataset containing 2,897 proteins, among 456 proteins functioned as coenzyme-binding activity. The SGTIs of peptide sequence were calculated with S2SNet application. We used the SGTIs as inputs to several
A
classification techniques with a machine learning software - Weka. A Random Forest classifier based on 3 features of the embedded and non-embedded graphs was identified as the best predictive model for coenzyme-binding proteins. This model developed was with the true positive (TP) rate of 91.7%, false positive (FP) rate of 7.6%, and Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.971. The prediction of new coenzyme-binding activity proteins using this model could be useful for further drug development or enzyme metabolism researches.
2
Keywords: Coenzyme-binding, Protein sequence, Topological indices, Classification model, Random Forest.
1. Introduction Coenzymes are organic non-protein molecules that bind to enzymes (apoenzymes) to activate and participate in the activity of corresponding enzymes (holoenzymes). Vitamins constitute and play an important role as coenzyme prosthetic group in the activity of enzyme processes. The
IP T
linkage of coenzyme with enzyme is the first and key step for the enzyme metabolism [1, 2]. A
part or whole of enzyme protein makes the key contribution to the binding process by interacting
selectively and non-covalently with a particular coenzyme [3]. Coenzymes are directly involved
SC R
and altered in the process of enzyme chemical reactions by transporting groups between enzymes,
including the hydride ions, phosphate groups and acetyl groups. Coenzymes, like NAD and NADP, are involved in the biosynthesis processes of transcription, DNA repair function, signal
U
transduction and detoxification reaction [4-6]. Ames et al [1] reported that the coenzyme binding and enzymatic activity are reduced invariant amino acid of enzyme, and may be remediable by
N
raising coenzyme (vitamin) concentration. It implied that the coenzyme binding ability and
A
enzyme activity is dependent on the primary structure of enzyme [7]. Thus, we used machinelearning methods to identify the coenzyme-binding ability of enzyme based on its primary amino
M
acid sequence. The coenzyme-binding proteins are obtained from RCSB protein data bank (PDB) [8], as well as the gene ontology (GO) [9]: 50662. In addition, coenzyme binders are divided into
ED
different protein types depending on the connected coenzyme. It includes Flavin mononucleotide (FMN) binding protein [10], Nicotinamide adenine dinucleotide (NAD+) binding protein [11, 12], NADP binding protein [13], coenzyme F420 binding protein [14, 15], fatty-acyl-CoA binding
PT
protein [16], Flavin adenine dinucleotide binding protein [17, 18], lipoic acid binding protein [19], molybdopterin cofactor binding protein [20], as well as the thiamine pyrophosphate binding
CC E
protein [21, 22], etc. Some coenzymes and the corresponding coenzyme-binding proteins are presented in Table 1. Whereas, the coenzyme-binding proteins play a vital role in lots of biosynthesis processes. In
here, we took the Acyl-CoA binding protein (ACBP) for example. ACBP is involving in multiple
A
cellular processes [23], including cell growth [24], enzyme regulation [23], the acyl chain ceramide synthesis [25], fatty acid biosynthesis modification [26], sphingolipid synthesis [27], and steroid hormone synthesis stimulation [28]. However, there is few reports showed the function of other coenzyme-binding proteins. In this case, it is valuable to identify a new protein with coenzyme-binding ability or not. The present work has the main objective to develop a new theoretical model that can predict a new peptide accompanying coenzyme-binding properties through the in-silico screening, to reduce economic cost and molecule quantity in an actual trial.
3
The peptide primary sequence information is proposed to develop the classification models according to Quantitative Structure-Activity Relationship (QSAR) [29]. The information of amino acid sequence is extracted based on embedded/ non-embedded Star Graph topological indices (SGTIs) [30] in this study. Some mathematical theories are used to calculate SGTIs, such as the indexes of Wiener, Schultz, Randic connectivity, and Gutman, etc [31]. The theoretical foundation of QSAR classification model, that can predict protein function, usually based on Linear Discriminate Analysis (LDA) [32], and non-linear Artificial Neural Networks (ANNs)
IP T
[33]. For ANNs, some machine learning algorithms, such as Naive Bayes, Random Tree, Random Forest (RF) and Support vector machine (SVM) were widely used as the predictive classification model. In some previous works, some classification models for prediction of protein biological
SC R
properties based on protein sequence graph descriptors have been proved, such as cancer-related
proteins [34], lipid-binding proteins [35], antioxidant proteins [36], and nucleotide binding proteins [37].
In addition, the combination of protein molecular structure profiles with Machine Learning
U
(ML) techniques to develop the predictive classification models have proved to be a successful approach depending on the primary structures of protein sequences. In all, we further proposed a
N
classification model by using the Start Graph primary sequence descriptors (SGTIs) to, for the
A
first time, efficiently predict a protein with the coenzyme-binding activity.
A
CC E
PT
ED
M
Table 1. The structure of some coenzymes and the example of coenzyme-binding proteins
4
Pentaerythritol Tetranitrate Reductase, NADH dehydrogenase
Pentaerythritol Tetranitrate Reductase
NADH-ubiquinone oxidoreductase, NADH dehydrogenase, Fumarate reductase, Glutamate dehydrogenase, aldose reductase, glucose-6phosphate dehydrogenase, and methylene tetrahydrofolate reductase NADP-linked malic enzyme, NADP-linked isocitrate dehydrogenase, NADPlinked glutamate dehydrogenase, nicotinamide nucleotide transhydrogenase, NADH kinase, and Malate dehydrogenase (oxaloacetatedecarboxylating) (NADP+) Coenzyme F420 hydrogenase, 5,10methylenetetrahydrome thanopterin reductase, methylenetetrahydrome thanopterin dehydrogenase Xanthine oxidase, DMSO reductase, Sulfite oxidase, Nitrate reductase, Ethylbenzene dehydrogenase, Glyceraldehyde-3phosphate ferredoxinoxidoreducta se, Respiratory arsenate reductase, Carbon monoxide dehydrogenase, Formate dehydrogenase, purine hydroxylase, Thiosulfate reductase
NADH dehydrogenase (NADHquinoneoxidoreduct ase)
4XDB
Fumarate reductase
4YXD
Glutamate dehydrogenase
2YFQ
ED
CC E
PT
Coenzyme F420
A
Molybdopte rin
3D Struct ure
Sena et al., 2015 [39]
9ICD
Glutamatedehydrog enase
4BHT
1Y7T
Nicotinamide nucleotide transhydrogenase
1F8G
Coenzyme hydrogenase
3ZF
Malate dehydrogenase
F420
Reference s Barna et al., 2001 [38]
1H50
Isocitrate dehydrogenase
M
NADP+
PDB-ID
Methylenetetrahydromethanop terin dehydrogenase
3IQE
Xanthine oxidase
4YRW
Nitrate reductase
1Q16
Ethylbenzene dehydrogenase
2IVF
Inaok et al., 2015 [40] Oliveira et al., 2012 [41]
IP T
NAD+
Enzyme sample
SC R
FMN
List of enzyme
U
Chemical Structure
N
Abbre.
Corresponding Coenzyme-binding proteins (Enzyme)
A
Coenzyme
Hurleyet al., 1991 [42] Sharkey et al, 2013 [43] Tomita et al., 2005 [44] Buckley et al., 2000 [45] Mills et al., 2013 [46] Ceh et al., 2009 [47]
Nishino et al., 2015 [17] Bertero et al., 2003 [48] Kloer et al., 2006 [49]
2. Materials and Methods 5
The amino acid sequence (primary structure) of coenzyme-binding proteins (CoEnBP, positive group) dataset were obtained from protein databank (PDB) [8]. Other dataset of proteins without coenzyme-binding activity (non-CoEnBP, negative group) were downloaded from PISCES CulledPDB [50]. The S2SNet application was used to transform amino acid sequences into Star Graphs [31], then to calculate the corresponding SGTIs. All SGTIs (42 attributes, embedded and non-embedded) of each sequence characterized by the corresponding graph are inputted to Weka software [51] as the basic supervised data to find the best ML classification model. The ML model
ED
M
A
N
U
SC R
workflow and main methodologies used in this study are presented in Fig. 1.
IP T
obtained can be used to predict coenzyme-binding ability for a new amino acid sequence. The
Fig. 1. Workflow of predictive classification of coenzyme-binding protein properties based on
PT
SGTIs descriptors of protein primary sequences
CC E
2.1. Protein dataset establishment In the current study, two datasets (positive and negative datasets) was constructed by extracting primary amino acid sequences. The positive dataset was consisted of 456 proteins reported with coenzyme-binding
ability
(described
as
CoEnBP)
from
PDB
A
(http://www.rcsb.org/pdb/browse/jbrowse.do?t=6&useMenu=no) [8], whereas, the negative one included 2,441 proteins without coenzyme-binding ability (named as non-CoEnBP) from PISCES CulledPDB (http://dunbrack.fccc.edu/PISCES.php) [50]. For positive dataset, we took the following steps. In the website interface of PDB database, first to search “Advanced Search” Interface, and choose “Molecular Function Browser”, then to select “Coenzyme binding” keyword. While, the negative dataset from PISCES CulledPDB [50] collected was similar with our previous report in [37]. In details, a set of protein sequences without coenzyme-binding ability
6
and with identity of less than 20%, resolution of 1.6 Å, and R-factor 0.25 were randomly collected. In here, the identity ≥ 20% or 25%, as the degree of correspondence, implies a similarity in function [37]. The PISCES server set the Z-score = 3.5 as the threshold to accept possible evolutionary relationships [52]. As the local property of PISCES alignments, two proteins that share a common domain with the Z-score > 3.5 are excluded in our non-CoEnBP dataset. 2.2. Star Graph Topological Indices
IP T
The mechanism of Star Graph constructed based on the primary amino acid (AA) sequences of protein were elucidated by Munteanu and his colleagues [37, 53]. In here, we used the Sequence of Star Networks (S2SNet) application to calculate the SGTIs of the primary amino acid sequence
SC R
[37, 53, 54]. In details, in the Star Graph (SG), each node (or vertex) represents the amino acids
which connected through the peptide bond. For SG of protein primary sequence, it is a special tree with μ vertices accompanying with μ-1 degree of freedom [37].
[55][56] In all, 26 possible branches were contained in SG with the non-embedded connectivity
U
represented the AA type, location and the frequency, and the embedded connectivity based on the
N
chemical peptide bonds [53].
A
All SGTIs were calculated by Markov normalization but not weighted, in addition, the power (£)
M
of matrices or indices was set from 0 to 5. The £, which imitates the interaction of AA in the primary protein sequence at the distance of £, like the Markov chains. Both of embedded and nonembedded SGTIs, 42 TI variables in all, were set into different types based on the index
ED
properties. It included Shannon entropies, trace of the μ connectivity matrices, the indexes of Wiener and Gutman topology, connectivity indexes of Randic and Balaban distance, etc. All
CC E
53].
PT
variables named were, in here, showed in our full dataset, similar with our previous reports [37,
2.3. Classification Model Construction
A
Weka is an application containing many machine learning classifiers [57, 58]. In here, we proposed eight classifiers to classify full supervised dataset calculated with the S2SNet application. The full dataset was resampled and normalized under the supervised and unsupervised conditions. The classifiers included RF [59], Random Tree [60], K-Star [61], J-Rip [62], Lib-Linear [63], Multilayer Perceptron (MLP) [64], Lib-SVM (Support Vector Machines) [65], and Naïve Bayes - Bayesian technique [66]. The parameters to develop the predictive
7
classifiers were initialized with the default setting of Weka application, with a 10-fold crossvalidation used for all classifiers [67]. Random forests or random decision forests, a widely accepted type of classifier, are a tree predictors which combining the random vector value for each tree and same distribution for all sampled independent trees in the forest [59]. Particularly, RFs combine the decision trees to make the fitting, by setting the desired class as output variable based on the concept of “ensemble learning”. Robustness is one of the main advantages of RF over other techniques, such as MLP
IP T
and SVM [68]. Especially, it can consider a solution without overfitting, and always tends to converge as the large number of random trees. RF is constructed and metaphor as a tree with arborescence that is formed by a stochastic process that considers kernel (K) randomly chosen
SC R
attributes at each node. K-Star approach is an algorithm that differs from instance-based ones in an entropy distance function [69, 70]. J-Rip algorithm is also a widely acceptable machine
learning, a enhance learner of propositional rule to produce error reduction by repeated incremental pruning (abbreviation as RIPPER) [62]. Multilayer perceptron (MLP) is a non-linear
U
ML represented by a feed forward artificial neural network that maps a set of inputs onto a series of corresponding outputs [64]. While, the Lib-SVM technique uses the key concept of a kernel,
N
allows converting the original data from non-linearly separable to linearly separable data along
A
with a function endowing the data with a higher dimensionality [71]. In addition, the SVM
M
provides a very good result with high-dimensional data [72, 73]. In order to evaluate and select the best classifier, we took several well-known parameters of accuracy measure into consideration, like the Sensitivity (true positive rate, TP rate), false positive
ED
rate (FP rate), Precision (positive predictive value, PPV), F-measure and the Area Under the Receiver Operating Characteristic Curve (AUROC) [74]. F-measure value is a harmonic mean of
PT
precision and recall, to form a trade-off between them [75]. The higher precision, the less effort wasted in testing or inspection, meanwhile, the higher recall, the fewer defective modules that went through undetected. In addition, the ROC curve is a type of graphical plot that elucidates the
CC E
characteristics of a binary classifier system when setting the discrimination threshold (DT) varied from − ∞ to DT. The AUROC value is calculated by plotting the fraction of TP rate vs. FP rate at the serial of DT ranges. Meanwhile, AUROC value is considered to be a better measure for
A
classifier comparison [76].
3. Results and discussion 3.1. Classifiers with all 42 features In present study, the dataset contained with 42 attributes had been tested with eight classifiers. First, we obtained a high performance CoEnBP/non-CoEnBP machine learning classification
8
model with all 42 features (Table 2). The models of RF, Random Tree and K-Star classifiers were characterized with the TP rate and Precision > 89%, the FP rate < 10%, as well as the AUROC > 0.91. Compared to those high-performance models, other classifiers have lower classifying performance accompanying with lower AUROC and TP rate values and higher FP rate. For all 42 features, RF was the best performance classification model with higher TP rate and AUROC value, while lower FP rate value compared to other classifiers.
TP rate (%)
FP rate (%)
Precision (%) F-Measure
RF
92.6
6.8
93.1
0.926
Random Tree
91.5
7.8
92.0
0.915
J-Rip
75.9
23.9
76.1
0.759
K-Star
89.9
9.2
91.0
Lib-Linear
70.0
30.4
70.0
Lib-SVM
70.4
30.6
70.3
Multilayer Perceptron
69.9
30.7
69.9
Naïve Bayes
70.4
30.8
70.5
0.984 0.918 0.775
U
0.977
0.700
0.698
0.702
0.699
0.698
0.755
0.701
0.749
A
N
0.898
M
3.2. Classifiers within feature selection
AUROC
SC R
Weka classifiers
IP T
Table 2. CoEnBP/non-CoEnBP classification models using all 42 attributes
Due to the overfitting appearance existed in all features (42 attributes) in these models, it might be a strong recommendation to filter and remove the overfitting features, based on the principle
ED
of the simpler, the better for model construction. In this way, a feature selection was carried out (with the attribute evaluator: CfsSubsetEval -P 1 -E 1, and the search method: BestFirst -D 1 -N 5, conducted in Weka feature selection) by removing the high correlated features. Only six
PT
individual features obtained after Weka feature selection, include J, eSh3-5, eX2 and e1XR, respectively, have been tested with 8 classifiers (Table 3). For new models obtained, only RF and
CC E
Random Tree classifiers maintained high performance that were similar with the models constructed previous by 42 feature attributes. However, the TP rate (71.5%) and the AUROC (0.799) value of K-Star dropped rapidly, while the FP rate (29.3%) increased dramatically. RF, as the best classifier with the six selected features in present study, showed the high performance
A
with TP rate of 91.5%, a precision of 92.0% and AUROC of 0.966, while a relative low FP rate (7.9%). It demonstrates that RF classifier is capable to keep the strong overall performance, even with less features (from 42 attributes to six selected attributes). Table 3. CoEnBP/non-CoEnBP classification models with six attributes selected by Feature Selection Classifiers
TP rate (%)
FP rate (%)
Precision (%)
F-Measure
AUROC
9
91.5
7.9
92.0
0.915
0.966
Random Tree
90.9
8.3
91.6
0.909
0.913
J-Rip
72.8
27.4
72.8
0.728
0.731
K-Star
71.5
29.3
71.5
0.714
0.799
Lib-Linear
69.4
31.0
69.4
0.694
0.635
Lib-SVM
70.1
30.7
70.0
0.699
0.697
Multilayer Perceptron
69.1
31.4
69.1
0.691
0.727
Naïve Bayes
68.4
31.4
68.7
0.685
0.720
IP T
RF
3.3. RF Classifier with different selected features
SC R
And then, the further work was attempted to reduce the attributes based on these six selected features. Thus, more dataset composited with these different features was processed based on the
RF model performance (Table 4). In here, we firstly removed the features of eSh3 and eSh4 from these six selected features, as the similar feature properties with eSh5. A new RF classifier with 4
U
selected features (J, e1XR, eX2, eSh5,) provides a slightly higher prediction performance compared
N
to the model previous with six selected features. To keep this going on, we attempted to obtain different models with different selected features with RF. It was proven that dataset, contained
A
the J, eSh5 and eX2 features, yielded the best classification performance with RF classifier.
M
Therefore, the final RF model with only 3 selected features (J, eSh5 and eX2) resulted high performance with TP rate = 91.7%, FPR rate = 7.6%, and AUROC = 0.971. It is interesting to note that the model with 3 selected features resulted in even better performance than the model
ED
constructed with six selected features.
Attributes
PT
Table 4. CoEnBP/non-CoEnBP classification models with different attributes datasets by RF TP rate (%) FP rate (%) Precision (%) F-Measure
AUROC
J, eSh3, eSh4, eSh5, e XR, eX2
91.5
7.9
92.0
0.915
0.966
J, eSh5, e1XR, eX2
91.7
7.6
92.2
0.918
0.970
J, eSh5, e XR,
91.3
8.0
91.9
0.913
0.968
J, eSh5, eX2
91.7
7.6
92.3
0.918
0.971
J, eX2
91.3
8.0
91.9
0.913
0.965
J, eSh5
90.8
8.4
91.5
0.908
0.965
eSh5, eX2
90.9
8.4
91.6
0.909
0.946
J
90.5
8.9
91.0
0.905
0.916
eX2
91.1
8.3
91.6
0.911
0.924
CC E
1
A
1
3.4. Best combinatorial features with different classifiers
10
In final, we aimed to find the best classifiers for the selected features (J, eSh5, and eX2) (Table 5). The results demonstrated that RF classifier also showed the best performance of classification model compared to other classifiers used in this study. In short, RF classifier developed with 3 features (J, eSh5 and eX2) calculated from Star Graph topological indexes can be viewed as the best classification model for prediction an uncertain protein sequence with the coenzyme-binding ability or not. For coenzyme binding properties, the power 2 and 5 might show the vital role of interaction of the amino acids at the distance of 2 or 5 in the amino acid main chain, like the
IP T
importance role of power 5 in the nucleotide binding peptides in our previous report [37]. Table 5. CoEnBP/non-CoEnBP classification results based on the best feature dataset obtained (J, eSh5, and eX2) TP rate (%)
FP rate (%)
Precision (%) F-Measure
RF
91.7
7.6
92.3
Random Tree
91.4
7.9
92.0
J-Rip
72.3
28.0
72.3
K-Star
70.8
30.2
70.8
Lib-Linear
69.5
31.1
69.5
Lib-SVM
70.1
30.7
70.1
Multilayer Perceptron
69.4
31.1
Naïve Bayes
68.2
31.6
AUROC
SC R
Weka classifiers
0.971
0.914
0.918
0.723
0.720
0.706
0.772 0.692
0.700
0.697
69.3
0.693
0.737
0.683
0.746
N
0.695
A
U
0.918
M
68.5
In these eight classifier approaches, it was demonstrated that RF classifier yields better
ED
classification performance than other classifiers in the context of dataset used in present study with coenzyme-binding ability. In addition, it was demonstrated that RF classifier with only 3 selected features, Balaban distance connectivity index (J), embedded Randic connectivity index
PT
with the power = 2 (eX2) and embedded Shannon entropy descriptor with power = 5 (eSh5) provided the best classification performance in the context of higher performance with a few
CC E
feature attributes, and the codification of peptide sequence information of CoEnBP/non-CoEnBP dataset.
A
Conclusion
In all, the present study had successfully proposed a first machine learning classification model, which designed to identify uncertain proteins with coenzyme-binding ability by Star Graph topological indices extracted from primary amino acid sequences. The proposed random forest model, based on 3 feature attributes (J, eSh5 and eX2), shows a high predictive performance with a TP rate = 91.7%, a FP rate = 7.6%, a precision rate = 92.3%, and an AUROC value = 0.971. This is another strong evidence for combining machine learning classifiers with molecular Star
11
graph descriptors of primary amino acid sequence to predictive the peptide function. The result uses to predict the new peptides with coenzyme-binding ability for future drug development or to enlighten the new ideas in the research area of enzyme catalysis metabolism.
Acknowledgements All authors give many thanks to the National Key Research and Development plan of China (2016YFD0700201), the National Natural Science Foundation of China (Grant No. 31702141),
IP T
and the Chinese Academy of Sciences (CAS) Pioneer Hundred Talents Program (Y823042010) for the joint support. Particularly, Dr. Y. Liu acknowledges to the General Directorate of Academic Personnel Affairs of the National Autonomous University of Mexico (PAPIIT-UNAM
SC R
IN224717) for the financial support of postdoctoral researches. In addition, all authors give thanks
to Dr. E. Frank, M.A. Hall and I.H. Witten for providing a free machine learning application – Weka to complete this work, to the S2SNet application for calculating the primary protein
U
sequence star graph topological indices.
N
References:
A
CC E
PT
ED
M
A
[1] B.N. Ames, I. Elson-Schwab, E.A. Silver, The American journal of clinical nutrition, 75 (2002) 616-658. [2] D.F. Aktas, P.F. Cook, Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, 1784 (2008) 2059-2064. [3] A. Glasfeld, Journal of Chemical Education, 81 (2004) 646. [4] M. Ziegler, European journal of biochemistry / FEBS, 267 (2000) 1550-1564. [5] M. Rizzi, H. Schindelin, Current opinion in structural biology, 12 (2002) 709-720. [6] F. Berger, M.H. Ramirez-Hernandez, M. Ziegler, Trends in biochemical sciences, 29 (2004) 111-118. [7] H. Chen, J.A. Piccirilli, M.E. Harris, D.M. York, Biochimica et Biophysica Acta (BBA) Proteins and Proteomics, 1854 (2015) 1795-1800. [8] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. Weissig, I.N. Shindyalov, P.E. Bourne, Nucleic Acids Research, 28 (2000) 235-242. [9] M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, G. Sherlock, Nat Genet, 25 (2000) 25-29. [10] J. Walter, S. Hausmann, T. Drepper, M. Puls, T. Eggert, M. Dihné, PLOS ONE, 7 (2012) e43921. [11] L. Zeng, W.-H. Shin, X. Zhu, S.H. Park, C. Park, W.A. Tao, D. Kihara, Journal of Proteome Research, 16 (2017) 470-480. [12] P. Poltronieri, N. Čerekovic, Challenges, 9 (2018) 3. [13] R. Chofor, S. Sooriyaarachchi, M.D. Risseeuw, T. Bergfors, J. Pouyez, C. Johny, A. Haymond, A. Everaert, C.S. Dowd, L. Maes, T. Coenye, A. Alex, R.D. Couch, T.A. Jones, J. Wouters, S.L. Mowbray, S. Van Calenbergh, Journal of medicinal chemistry, (2015). 12
A
CC E
PT
ED
M
A
N
U
SC R
IP T
[14] T. Rumpf, M. Schiedel, B. Karaman, C. Roessler, B.J. North, A. Lehotzky, J. Oláh, K.I. Ladwein, K. Schmidtkunz, M. Gajer, M. Pannek, C. Steegborn, D.A. Sinclair, S. Gerhardt, J. Ovádi, M. Schutkowski, W. Sippl, O. Einsle, M. Jung, Nat Commun, 6 (2015). [15] A.V. Crain, J.B. Broderick, Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, 1834 (2013) 2512-2519. [16] W. Shi, G. Kovacikova, W. Lin, R.K. Taylor, K. Skorupski, F.J. Kull, Nat Commun, 6 (2015) 6032. [17] T. Nishino, K. Okamoto, Y. Kawaguchi, T. Matsumura, B.T. Eger, E.F. Pai, T. Nishino, The FEBS journal, (2015). [18] S.I. Bibikov, L.A. Barnes, Y. Gitin, J.S. Parkinson, Proceedings of the National Academy of Sciences of the United States of America, 97 (2000) 5830-5835. [19] K. Okamura-Ikeda, H. Hosaka, N. Maita, K. Fujiwara, A.C. Yoshizawa, A. Nakagawa, H. Taniguchi, J Biol Chem, 285 (2010) 18684-18692. [20] H. Ishikita, B.T. Eger, K. Okamoto, T. Nishino, E.F. Pai, Journal of the American Chemical Society, 134 (2012) 999-1009. [21] J.B. Siegel, A.L. Smith, S. Poust, A.J. Wargacki, A. Bar-Even, C. Louw, B.W. Shen, C.B. Eiben, H.M. Tran, E. Noor, J.L. Gallaher, J. Bale, Y. Yoshikuni, M.H. Gelb, J.D. Keasling, B.L. Stoddard, M.E. Lidstrom, D. Baker, Proceedings of the National Academy of Sciences of the United States of America, 112 (2015) 3704-3709. [22] A. Serganov, A. Polonskaia, A.T. Phan, R.R. Breaker, D.J. Patel, Nature, 441 (2006) 1167-1171. [23] M. Burton, T.M. Rose, N.J. Færgeman, J. Knudsen, Biochemical Journal, 392 (2005) 299-307. [24] F.T. Harris, S.M. Rahman, M. Hassanein, J. Qian, M.D. Hoeksema, H. Chen, R. Eisenberg, P. Chaurand, R.M. Caprioli, M. Shiota, P.P. Massion, Cancer prevention research (Philadelphia, Pa.), 7 (2014) 748-757. [25] N.S. Ferreira, H. Engelsby, D. Neess, S.L. Kelly, G. Volpert, A.H. Merrill, A.H. Futerman, N.J. Færgeman, Journal of Biological Chemistry, 292 (2017) 7588-7597. [26] L. Lee, C. Anthony DeBono, D.R. Campagna, D.C. Young, D. Branch Moody, M.D. Fleming, Journal of Investigative Dermatology, 127 (2007) 16-23. [27] B. Gaigg, T.B.F. Neergaard, R. Schneiter, J.K. Hansen, N.J. Færgeman, N.A. Jensen, J.R. Andersen, J. Friis, R. Sandhoff, H.D. Schrøder, J. Knudsen, Molecular Biology of the Cell, 12 (2001) 1147-1160. [28] M.J. Besman, K. Yanagibashi, T.D. Lee, M. Kawamura, P.F. Hall, J.E. Shively, Proceedings of the National Academy of Sciences of the United States of America, 86 (1989) 4897-4901. [29] J. Devillers, A.T. Balaban, Topological Indices and Related Descriptors in QSAR and QSPR, Gordon and Breach, The Netherlands, (1999). [30] M. Randić, J. Zupan, D. Vikic-Topic, J Mol Graph Model, (2007) 290-305. [31] C.R. Munteanu, A.L. Magalhães, E. Uriarte, H. González-Díaz, Journal of theoretical biology, 257 (2009) 303-311. [32] H. Van Waterbeemd, Discriminant Analysis for Activity Prediction, in: H. Van Waterbeemd (Ed.) Chemometric methods in molecular design, Wiley-VCH, New York, 1995, pp. 265-282. [33] D. Rivero, E. Fernandez-Blanco, J. Dorado, A. Pazos, in: Evolutionary Computation (CEC), 2011 IEEE Congress on, IEEE, 2011, pp. 587-592.
13
A
CC E
PT
ED
M
A
N
U
SC R
IP T
[34] C.R. Munteanu, A.L. Magalhaes, E. Uriarte, H. Gonzalez-Diaz, Journal of theoretical biology, 257 (2009) 303-311. [35] H. Gonzalez-Diaz, C.R. Munteanu, L. Postelnicu, F. Prado-Prado, M. Gestal, A. Pazos, Mol Biosyst, 8 (2012) 851-862. [36] E. Fernandez-Blanco, V. Aguiar-Pulido, C.R. Munteanu, J. Dorado, Journal of theoretical biology, 317 (2012) 331-337. [37] Y. Liu, C.R. Munteanu, E.F. Blanco, Z. Tan, A.S.d. Riego, A. Pazos, Molecular Informatics, 34 (2015). [38] T.M. Barna, H. Khan, N.C. Bruce, I. Barsukov, N.S. Scrutton, P.C. Moody, Journal of molecular biology, 310 (2001) 433-447. [39] F.V. Sena, A.P. Batista, T. Catarino, J.A. Brito, M. Archer, M. Viertler, T. Madl, E.J. Cabrita, M.M. Pereira, Molecular microbiology, 98 (2015) 272-288. [40] D.K. Inaoka, T. Shiba, D. Sato, E.O. Balogun, T. Sasaki, M. Nagahama, M. Oda, S. Matsuoka, J. Ohmori, T. Honma, M. Inoue, K. Kita, S. Harada, International journal of molecular sciences, 16 (2015) 15287-15308. [41] T. Oliveira, S. Panjikar, J.B. Carrigan, M. Hamza, M.A. Sharkey, P.C. Engel, A.R. Khan, Journal of structural biology, 177 (2012) 543-552. [42] J.H. Hurley, A.M. Dean, D.E. Koshland, Jr., R.M. Stroud, Biochemistry, 30 (1991) 8671-8678. [43] M.A. Sharkey, T.F. Oliveira, P.C. Engel, A.R. Khan, The FEBS journal, 280 (2013) 10.1111/febs.12439. [44] T. Tomita, S. Fushinobu, T. Kuzuyama, M. Nishiyama, Biochemical and biophysical research communications, 334 (2005) 613-618. [45] P.A. Buckley, J. Baz Jackson, T. Schneider, S.A. White, D.W. Rice, P.J. Baker, Structure (London, England : 1993), 8 (2000) 809-815. [46] D.J. Mills, S. Vitt, M. Strauss, S. Shima, J. Vonck, eLife, 2 (2013) e00218. [47] K. Ceh, U. Demmer, E. Warkentin, J. Moll, R.K. Thauer, S. Shima, U. Ermler, Biochemistry, 48 (2009) 10098-10105. [48] M.G. Bertero, R.A. Rothery, M. Palak, C. Hou, D. Lim, F. Blasco, J.H. Weiner, N.C. Strynadka, Nature structural biology, 10 (2003) 681-687. [49] D.P. Kloer, C. Hagel, J. Heider, G.E. Schulz, Structure (London, England : 1993), 14 (2006) 1377-1388. [50] G. Wang, J. R. L. Dunbrack, Bioinformatics, 19 (2003) 1589-1591. [51] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.A. Witten, SIGKDD Explorations, 11 (2009). [52] J. Xu, Y. Zhang, Bioinformatics, 26 (2010) 889-895. [53] C.R. Munteanu, A.L. Magalhaes, A. Duardo-Sanchez, A. Pazos, H. Gonzalez-Diaz, Current Bioinformatics, 8 (2013) 429-437. [54] M. González-Durruthy, J. Monserrat, B. Rasulev, G. Casañola-Martín, J. Barreiro Sorrivas, S. Paraíso-Medina, V. Maojo, H. González-Díaz, A. Pazos, C. Munteanu, Nanomaterials, 7 (2017) 386. [55] A.A. Dobrynin, R. Entringer, I. Gutman, Acta Applicandae Mathematica, 66 (2001) 211-249. [56] H. Hua, S. Zhang, Discrete Applied Mathematics, 160 (2012) 1152-1163. [57] I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Data Mining, Fourth Edition: Practical Machine Learning Tools and Techniques, Morgan Kaufmann Publishers Inc., (2016).
14
A
CC E
PT
ED
M
A
N
U
SC R
IP T
[58] I.H. Witten, E. Frank, M.A. Hall, C.J. Pal, Chapter 10 - Deep learning, in: I.H. Witten, E. Frank, M.A. Hall, C.J. Pal (Eds.) Data Mining (Fourth Edition), Morgan Kaufmann, 2017, pp. 417-466. [59] L. Breiman, Machine Learning, 45 (2001) 5-32. [60] C.M. Bishop, Pattern recognition and machine learning, Springer, (2006). [61] J.G. Cleary, L.E. Trigg, in: Machine Learning International Workshop, Morgan Kaufmann Plublishers, Inc., 1995, pp. 108-114. [62] W.W. Cohen, Fast Effective Rule Induction, in: Twelfth International Conference on Machine Learning, 1995, pp. 115-123. [63] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, C.-J. Lin, J. Mach. Learn. Res., 9 (2008) 1871-1874. [64] F. Rosenblatt, Principles of neurodynamics; perceptrons and the theory of brain mechanisms, Spartan Books, Washington, (1962). [65] C.-C. Chang, C.-J. Lin, ACM Trans. Intell. Syst. Technol., 2 (2011) 1-27. [66] G.H. John, P. Langley, in: 11th Conference on Uncertainty in Artificial Intelligence, Morgan Kaufman, Montreal, Quebec, 1995, pp. 338-345. [67] G.J. McLachlan, K.-A. Do, C. Ambroise, Analyzing microarray gene expression data, Wiley, (2004). [68] S.B. Kotsiantis, in: Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive Technologies, IOS Press, 2007, pp. 3-24. [69] C.E. Shannon, W. Weaver, R.E. Blahut, B. Hajek, The mathematical theory of communication, University of Illinois press Urbana, (1949). [70] D.J. MacKay, Information theory, inference and learning algorithms, Cambridge university press, (2003). [71] V.N. Vapnik, in, Nauka, English Translation Springer Verlang, 1982, 1979. [72] O. Chapelle, P. Haffner, V.N. Vapnik, IEEE Transactions on Neural Networks, 10 (1999) 1055-1064. [73] L.S. Moulin, A.P. Alves Da Silva, M.A. El-Sharkawi, R.J. Marks Ii, IEEE Transactions on Power Systems, 19 (2004) 818-825. [74] C. Ferri, J. Hernandez-Orallo, R. Modroiu, Pattern Recogn. Lett., 30 (2009) 27-38. [75] I. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems), Morgan Kaufmann, (2005). [76] H. Jin, IEEE Transactions on Knowledge and Data Engineering, 17 (2005) 299-310.
15