Accepted Manuscript Title: Enzyme Classification using Multiclass Support Vector Machine and Feature Subset Selection Authors: Debasmita Pradhan, Sudarsan Padhy, Biswajit Sahoo PII: DOI: Reference:
S1476-9271(17)30329-8 http://dx.doi.org/10.1016/j.compbiolchem.2017.08.009 CBAC 6717
To appear in:
Computational Biology and Chemistry
Received date: Revised date: Accepted date:
25-5-2017 15-7-2017 15-8-2017
Please cite this article as: Pradhan, Debasmita, Padhy, Sudarsan, Sahoo, Biswajit, Enzyme Classification using Multiclass Support Vector Machine and Feature Subset Selection.Computational Biology and Chemistry http://dx.doi.org/10.1016/j.compbiolchem.2017.08.009 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Enzyme Classification using Multiclass Support Vector Machine and Feature Subset Selection Debasmita Pradhan(1) Sudarsan Padhy(1) Biswajit Sahoo(2) 1Department
of Computer Scienceing and Engineering Silicon Institute of Technology, Silicon Hills, Patia, Bhubaneswar, 751024, 2
School of Computer Engineering KIIT University,Bhubaneswar,751024
Graphical abstract
Highlights
Primary Sequence of Proteins are collected from Expasy portal and converted to vector of physicochemical properties from which properties responsible significantly for protein function classification are identified by using Feature Subset Selection algorithms. Proteins are classified into Enzymes and Non-Enzymes and subsequently enzymes into their 6-functional classes using multiclass SVM method as wet-lab methods are expensive and time consuming. The proposed model generalizes the existing model to include multiclass classification and to identify most significant features affecting the protein function. Our model is seen to have good performance with respect to many performance measures.
Abstract: Proteins are the macromolecules responsible for almost all biological processes in a cell. With the availability of large number of protein sequences from different sequencing projects, the challenge with the scientist is to characterize their functions. As the wet lab methods are time consuming and expensive, many computational methods such as FASTA, PSI-BLAST, DNA microarray clustering, and Nearest Neighborhood classification on protein-protein interaction network have been proposed. Support vector machine is one such method that has been used successfully for several problems such as protein fold recognition, protein structure prediction etc. Cai et al in 2003 has used SVM for classifying proteins into different functional classes and to predict their function. They used the physico-chemical properties of proteins to represent the protein sequences. In this paper a model comprising of feature subset selection followed by multiclass Support Vector Machine is
proposed to determine the functional class of a newly generated protein sequence. To train and test the model for its performance, 32 physico-chemical properties of enzymes from 6 enzyme classes are considered. To determine the features that contribute significantly for functional classification, Sequential Forward Floating Selection (SFFS), Orthogonal Forward Selection (OFS) and SVM Recursive Feature Elimination (SVM-RFE) algorithms are used and it is observed that out of 32 properties considered initially, only 20 features are sufficient to classify the proteins into its functional classes with an accuracy ranging from 91% to 94%. On comparison it is seen that, OFS followed by SVM performs better than other methods. Our model generalizes the existing model to include multiclass classification and to identify most significant features affecting the protein function. Keywords: Enzyme Classification, Multiclass SVM, Sequential Forward Floating Selection (SFFS), Orthogonal Forward Selection (OFS), SVM Recursive Feature Elimination (SVM-RFE), Random Forest (RF)
1.
Introduction
Proteins are important macromolecules responsible for almost all biological processes in a cell such as growth, function, cell metabolism and maintenance. With the availability of large no of biological sequences obtained from different sequencing projects [1,2], the challenge with the scientist is to know the functions of the newly generated protein sequences in order to understand the biological processes [3-5]. There are many methods available for functional annotation of newly sequenced proteins. The wet lab method of functional characterization of proteins is time consuming and expensive, where as computational approaches are fast and cost effective. The classical computational approaches for function prediction use programs like FASTA [6] and PSI-BLAST[7] which are based on homology between the annotated sequences with unannotated sequence i.e the new sequence. The methods of Comparative Genomics are also used for the prediction of protein function [8]. They consider the protein to be functionally linked if they have similar phylogenetic profiles [9,10]. Some authors such as David J. Lockhart et al., Mark Schena[11,12], designed clustering algorithms to be used on DNA-microarray data to predict the protein function based on the assumption that genes with correlated expression profile are functionally related [13,14] . The protein-protein-interaction networks are also used for prediction of protein function using Nearest Neighborhood approach [15] based on the fact that proteins may interact for a common purpose. But as protein-protein-interaction data is noisy, the prediction accuracy becomes low. Some methods [16,17] predict the function of a protein by classifying it into a specific functional class based on the sequence similarity. These methods work well if the similarity between sequences is significant. However the prediction becomes random if the similarity between two sequences is not up to a threshold. Support vector machine method [18] is used for protein fold recognition [19,20], protein structure prediction[21-23],protein-protein interaction prediction, and protein function classification [24]. In these problems the physico-chemical properties of proteins computed from sequences, are used as input for implementing the method. Cai et al [24] used Binary SVM classifier to predict the functional class of a protein. They considered the functional classes like RNA-binding proteins, protein homodimers, drug absorption proteins, drug delivery proteins, drug excretion proteins, Class-I drug metabolizing enzymes, Class-II drug metabolizing enzymes and used 1808 physico-chemical properties such as hydrophobicity, polarity, polarizability, charge, surface tension, secondary structure etc. to represent a protein sequence and obtained accuracy in
the range 88 % to 99% for different classes. Moreover as the dimension of the feature vector used is very high the computation takes more time. In our model at the first step a binary classifier is designed to classify a protein sequence as enzyme or non-enzyme. In the second step a multi-class classifier is designed to predict the functional class of the protein out of six available enzyme classes such as oxidoreductases, transferases, hydrolases, lyases, isomerase, and ligases. To implement the model, initially 32 physico-chemical properties like number of amino acids, theoretical pie, amino acid compositions(20), number of negatively charged residue, number of positively charged residue, atomic compositions(5), aliphatic index, and hydrophobicity are considered. Since many of the features may carry redundant information, Sequential Forward Floating Selection algorithm (SFFS)[25],Orthogonal Forward Selection (OFS)[26] algorithm, and SVM Recursive Feature Elemination(SVM-RFE)[27,28] are applied to identify the most significant features for classifying the proteins. SFFS gives amino acid compositions such as Arg(A), Asn(N), Cys(C), Gln(Q), Glu(E), Ile(I), Leu(L), Lys(K), Met(M), Phe(F), Pro(P), Ser(S), Thr(T), Trp(W), Tyr(Y), Val(V), atomic compositions, such as Hydrogen(H), Nitrogen(N), Oxygen(O), Sulfur(S) are more significant features where as OFS gives aliphatic index, number of amino acids, atomic compositions such as Carbon(C), Oxygen(O), amino acid compositions such as Cys(C), Asp(D), Arg(R), Phe(F), Gly(G), Pro(P), His(H), Ile(I), Thr(T), Trp(W), Leu(L), Gln(Q), Lys(K), Try(Y), no of positively charge residues, and no of negatively charged residues are more significant features. However, when SVM-RFE is applied it dropped seven features such as number of amino acids, Theoritical pie, Cys(C), Gly(G), Ile(I), Carbon(C), Sulfur(S) to yield 25 significant features and with these features an accuracy range of 90.6149%- 93.5275% is obtained. Results of these three algorithms show that Gln(Q), Leu(L), Lys(K), Phe(F), Pro(P), Thr(T), Trp(W), Tyr(Y), and Oxygen(O) play major role for functional classification of proteins. Using all 32 features, i.e Without Feature Selection(WFS) an accuracy range from 90.9699 % to 93.6455 % is obtained where as using Sequential Forward Feature Floating Selection(SFFS) algorithm with 20 significant features an accuracy from 90.3010% to 92.3077% is obtained and using Orthogonal Forward Feature Selection algorithm(OFS) with 20 significant features an accuracy from 89.6321% to 94.3144% is obtained. Our model found that 20(Atomic and Amino acid compositions) out of 32 physico-chemical properties are sufficient to predict the functional class of a protein with a high accuracy. The performance of our model is compared with the Random Forest classification algorithm [29]. The average accuracy obtained by Random Forest Model is 86.7314%. It is observed that all the three models discussed above have better average accuracy than Random Forest Model. The rest of the paper is organized as follows. Section 2 presents Multiclass Support Vector Machine, Sequential Forward Feature Selection algorithm, and Orthogonal Forward Feature Section algorithm. Section 3 describes the proposed model. Section 4 discusses the result and performance of our model and Section 5 concludes the work. 2. 2.1
Preliminaries Multiclass Support vector machine
The Support vector machine described in appendix is a binary classifier i.e it classifies objects belonging to two distinct classes. However the real world problems deal with classifying
objects into more than two classes. There are many approaches followed to use SVM for multiclass classification. Following are the frequently used approaches. a. One Verses the Rest classification This approach constructs as many support vector machines as there are classes in the classification problem i.e. given M classes it constructs M binary SVM classifiers f1 ,..., f M .To construct f i , the i th classifier ( i 1,..., M ) an SVM is designed by considering the patterns of ith class as positive samples and patterns of the rest of classes as negative samples. An unknown sample X is classified by providing it to each classifier and applying majority voting technique. The class lebel with maximum frequency is assigned to the pattern X . One of the major limitation of this approach is the training samples used to build the model are highly unbalanced. b. Pairwise Classification The pair wise classification technique avoids the limitation of the above method by constructing decision surfaces for each pair of classes. Given the training set method generates D {( xi , yi )} , xi R n and yi {1,2,3..., M } this M (M 1) / 2 classifiers, one classifiers for each pair of classes. Let f ij be the classifier which separates the pair of classes i and j with i j and i, j {1,2,..., M } . f ij is trained taking
Di as the positive class and D j as the negative class where Di is the samples in D with class level i .The output of the classifier f ji is f ij .once the classifiers are trained an unknown sample X is classified by presenting it to each of M (M 1) / 2 classifiers. Each classifier assigns a class lebel to the new sample. The class lebel with highest count is then considered as the label of the unknown sample X .
2.2
Feature Subset Selection Given a set of n features the goal of feature Subset Selection is to select a subset of d features ( d < n ) without significantly degrading the performance of the recognition system [25]. 2.2.1
Sequential Forward Floating Selection Method (SFFS)[25] This method start with an empty feature set to begin with. In successive steps the features are included / excluded depending on some class separability measure. We use class separability measure C of a set S defined by T C (S ) trace(S w Sb ) , Where S w is within class scatter matrix defined as C
S w pi si i 1
pi is probability of class i , si is covariance matrix of class i and c is no of classes. sb is the between class matrix, defined as c
sb pi (mi m0 )(mi m0 )T , i 1
c
m0 pi mi is the global mean vector and mi is the class mean vector. The i 1
following algorithm describes the SFFS technique. Algorithm (SFFS) 1. 2. 3.
4.
Let Fc is the set of selected features and Fp is a temporary set of selected features. Let Fp initially. Select the best feature (with highest class separability measure), x arg max C ( Fp {xi }) Fc Fp {x } Select the worse feature (the feature which contributes less for the class separability measure. x arg max C ( Fc {x}), x Fc if C ( Fc {x }) CFc then Fc Fc {x } Go to step 3 else Go to step 2
2.2.2
Orthogonal Forward Selection Method (OFS)[26]
The OFS method uses Gram-Schmidt orthogonal decomposition of the feature matrix D . As the features are decorrelated in the orthogonal space, these can be evaluated and selected independent of each other. After selecting the subset of features in the orthogonal space, these are mapped to the subset of features in the orthogonal feature space, which serves as the desired significant subset. The following algorithm describes the method [26]. Let there are N samples X 1 , X 2 ,... X N in D . Each sample is represented by ndimensional vector X j {x1 j , x2 j ,...xnj }T , j 1,2,...N . The feature vector xi is defined as xi [ xi1 , xi 2 ,...xiN ]T , i 1,2,...n and the feature matrix is defined as
x11 x21 ... xn1 x 12 x22 ... xn 2 . . . X [ x1 , x2 ,..., xn ] . . ... . . . . x1N x2 N ... xnN Algorithm (OFS): 1. Let all features xi , (i 1,2......n) are candidate to enter feature matrix X . Set q1 xi (i )
Compute Mahalanobis distance measures provided by q1 , i 1,2,.......n . Let x j yields maximum class separability. (i )
Set q1 x j 2.
For all remaining n 1 features, compute (i ) (i ) q2 xi 12 q1 1 i n, i j , where
12(i ) q1T xi / q1T q1 And compute corresponding mahalanobis distance measures. The feature that provides maximum class separation is identified and added to the feature subsets. 3.
3.
The process continues untill the class separability measure provided by the next best feature is less than a prespecified threshold.
Proposed Method
We are given a data set D of protein sequences and their class labels. Each sequence is then represented by 32 physico-chemical properties .Now the dataset D contains the protein sequences represented by 32 features each along with their class labels i.e. D {( X i . yi )}, i {1,2,..., N} where X i R 32 is the feature vector of i th sequence and yi {0,1} for classification of a protein as enzyme
and non-enzyme or yi {1,2,..., M } for classifying enzyme class. The objective of our work is to develop a model which will predict the class label of a new protein in two steps. In the first step a binary SVM classifier F1 is developed which classifies a new protein as enzyme or non-enzyme. In the second step we developed a multiclass SVM classifier F2 to predict the functional class label of the new protein. Given a new sequence, our model first determines, whether it is an enzyme sequence or non-enzyme sequence. If it is an enzyme, then our model predicts its class label, out of six available enzyme classes. To train the model, initially we considered 32 physico-chemical properties to represent a protein sequence. Since many of these properties are redundant, we applied Sequential Forward Floating subset Selection algorithm and Orthogonal Feature Subset Selection algorithm to identify important and non-redundant features which contribute significantly to predict the functional class of enzymes. To handle non-linearity in data the Radial Basis Function (RBF) kernel is used throughout the implementation of SVM. Our proposed work is described in the flowchart given in Figure 1. 3.1
Data Description and Preprocessing
The protein sequences are collected from SwissProt database. The protparam tool of Expasy.org portal is used to compute the physico-chemical properties of proteins. The motivation of using these features is taken from the work of Cai et al [24]. The properties include: number of amino acids, molecular weight, theoretical pie, amino acid compositions, and number of negatively charged residue, number of positively charged residue, atomic compositions, aliphatic index, and hydrophobicity. Table1 lists these properties and their dimensions. Total 32 features are considered to represent the proteins. To implement the model, at the first step enzyme proteins as the positive samples and Hemoglobin proteins as the negative samples are taken, to train the binary SVM classifier F1. Then the proteins from 6 enzyme classes such as oxidoreductases, transferases, hydrolases, lyases, isomerases and ligases are considered for the training of multiclass SVM classifier F2 .
3.2 Performance Measure
To measure the performance of the Binary classifier F1 the parameters, true positives (tp), true negatives (tn), false positives (fp), false negatives (fn) and confusion matrix [Table 2] are considered. These parameters are used to compute the performance measures like Accuray, Precision, and Recall [Table 3] for F1 . Then the confusion matrix for multiclass classifier F2 , true positives (tpi), true negatives (tni), false positives (fpi), false negatives (fni) for ith class are computed.Using these parameters the Average Accuracy, Pr ecision , Re call , Pr ecission m and
Re call m are computed [Table 4][30]. Further the ROC curve for multiclass classifier [31] is plotted and shown in Figure2.
4.
Results and Discussions To train the binary support vector machine classifier F1 , 200 distinct enzyme proteins are
taken as positive samples and 200 hemoglobin proteins are taken as negative samples. Then the classifier is tested with 62 proteins which is a mixture of enzymes and hemoglobin proteins. The accuracy of the model with given test set is 98.3871%. Table 5 summarizes the performance measures of the classifier F1 with and without feature selection.
After training and testing of the binary classifier, six enzyme classes oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases are considered to train the multiclass support vector machine classifier F2 . 219 samples from oxidoreductases, 225 from transferases, 200 from hydrolases, 207 from lyases, 198 from isomerases and 163 from ligases are taken. The model is trained using pairwise multiclass classification algorithm without and with feature selection algorithms. First the model is trained using all 32 physico-chemical properties of the proteins. The average accuracy of the model with all the features is 93.6455%. Then Sequential Forward Floating Selection (SFFS) algorithm and Orthogonal Forward Selection (OFS) algorithm are used to find out most significant features. The accuracy of the model when SFFS algorithm is used ranges from 90.3010% to 92.3077% where as the accuracy of the model when OFS algorithm is used ranges from 89.6321% to 94.3144%. As support vector machine can be used for feature selection, we also implemented SVM-RFE [27,28] for selecting relevant features. For a two class classification problem, at every iteration SVM-RFE ranks and removes a feature i whose removal minimizes the ranking function defined as Rc (i) w(i )
2
k *(i ) yk y j K (i ) ( xk , x j ) , *(i )
k, j
where K (i ) is a kernel function. A Gaussian Kernel is used for implementation of SVM-RFE. As our model is a six class classifier which yields 15 features to be dropped after each iteration. So we applied a majority voting method to select the feature, to be dropped after each iteration. In this way 7 features are dropped from the original data and rest 25 features are taken to implement the model. With these 25 features an accuracy range of 90.6149%- 93.5275% is obtained. For comparison of our model Random Forest classifier is implemented and an average accuracy of 86.7314 % is obtained. Appendix B contains the Confusion Matrices and Table 6 summarizes the
performance measures of all the models implemented. For visualization of performance of the implemented models bar graphs are plotted with respect to average accuracy, Pr ecision ,and Re call and given as Figure B.1, B.2, and B.3 in appendix B.
The results show that our model gives a better performance when features are selected by OFS. The multiclass Receiver Operating Characteristics( ROC)[31] curve is computed for the model with each feature selection algorithm. To save space, the multiclass ROC curve of the model implemented with OFS feature selection algorithm is given in figure 2, as it gives better performance than others. The curve gives a high average Area Under the Curve(AUC). Figure 3 gives the class wise accuracy of the models implemented.
It is observed that the accuracy of SVM with OFS is maximum for all classes except 2 and 5[Figure 3]. However with respect to all other performance measures: average accuracy, Pr ecision , Re call , Pr ecission m and Re call m , SVM with OFS outperforms other algorithms. For instance the average accuracy of the model without feature selection is 93.6455%, with SFFS is 92.3077%, with OFS is 94.3144%, with SVM-RFE is 93.5275 %, and with Random Forest is 86.7314%. So we may conclude that the overall performance of SVM-OFS is better than other algorithms considered here. SFFS, OFS and SVM-RFE have their own set of selected features. Although they select different features, the common features selected by all three algorithms are Gln(Q), Ile(I), Leu(L), Lys(K), Phe(F), Pro(P), Thr(T), Trp(W), Tyr(Y), and Oxygen(O). So we may conclude that these features play a significant role for functional classification of proteins.
5.
Conclusion and future work
In this paper a model comprising of feature subset selection followed by multiclass support vector machine to determine the functional class of a newly generated protein sequence is proposed. To train and test the model for its performance, 32 physico-chemical properties of enzymes from 6 classes are considered. To identify features those contribute significantly for functional classification of proteins, SFFS,OFS and SVM-RFE algorithms are applied. The results of all the algorithms show that Gln(Q), Leu(L), Lys(K), Phe(F), Pro(P), Thr(T), Trp(W), Tyr(Y), and Oxygen(O) play major role for protein function prediction. It is observed that the features selected by OFS yield better classification performance with respect to average accuracy, Pr ecision , Re call , Pr ecission m and Re call m .For instance with all 32 features an average accuracy range from 90.9699% to 93.6455 % is obtained, with 20 features selected by OFS an accuracy range from 89.6321% to 94.3144% is obtained, with 25 features selected by SVM-RFE an average accuracy range from 90.9699 % to 93.6455 % is obtained and with Random Forest an average accuracy of 86.7314 is obtained. Our model generalizes the existing model [20] to include multiclass classification and to identify most significant features affecting the protein function in order to reduce computation time. To handle non-linearity in the data, radial basis function is used
as kernel with the multiclass support vector machine. In future we plan to extend the classification up to subclass level and to develop a kernel so that along with physico-chemical properties, the protein primary sequences can be given as input directly.
Acknowledgement
Authors thank the reviewers for their valuable comments
Appendix A Support Vector Machine for Binary Classification Support Vector Machine (SVM) was pioneered by Vapnik in 1995. It’s a linear classifier based on the idea of constructing a hyperplane as the decision surface in such a way that the margin of separation between the separating hyperplane and the closest positive and negative sample is maximized. Consider the training patterns {( X i , d i )}iN1 , where X i R n is the i th input pattern and
d i {1,1} is the corresponding desired class label. It is assumed that the subset of patterns represented by d i 1 and the subset of patterns represented by the subset d i 1 are linearly separable. Let the equation to the decision hyper plane be WT X b 0 ,
(1)
Where W R n , b R . The objective of SVM is to compute W and b in such a way that the margin of separation of the hyper plane from the closest data points is maximized (Figure A.1). Such a hyper plane is called optimal hyper plane.
Figure A.1: Linearly separable data and Optimal Hyperplane
The problem of computing the optimal W and b can be formulated as solving the following optimization problem ( p ):
1 T W W 2 subject to the constraints d i (W T X i b) 1 for i 1,..., N
Minimize
(2)
This problem can be solved by using primal-dual method where the primal problem is to minimize the Lagrangian function: N 1 (3) J (W , b, ) W T W i [d i (W T X i b) 1] 2 i 1 over all W and b, where i s i 1,2,..., N are Lagrangian multipliers ,
the solution of this problem( p ) yields N
W i d i X i ,
(4)
i 1
where i ' s are Lagrangian multipliers to be determined by solving the dual problem, which can be stated as Maximize N
Q( ) i i 1
1 N 2 i 1
N
j 1
i
T
j
di d j X i X j
(5)
subject to the constraints N
(1)
d i 1
i
i
0
(2) i 0
(6)
for i 1,..., N .
(7)
Let i be the optimal solution of equation (5) - (7). From the Karush-Kuhn-Tuckker conditions it *
follows that when
j* 0 , d j (wT X j b) 1
b d j W T X j for some index j .
which yields
(8) (9)
Then from equation (1), (4) ,(9) the equation to the optimal hyper plane is N
N
i 1
i 1
* T * T i d i X i X d j i d i X i X j 0
(10)
When the given data set is not linearly separable then the problem ( p ) for determining the optimal W and b is modified as follows ( p1 ) Minimize
N 1 T W W c i 2 i 1
subject to d i (W T X i b) 1 i
i 0 for i 1,..., N
(11) (12) (13)
where i ' s are the slack variables introduced to appropriately handle the data points of type Xi and Xj as shown in the figure A.2.
Figure A.2 : Soft Margin / Non-Linear Classifier
It is observed that if 0 i 1 constraints (12) and (13) are satisfied and X i can be correctly classified. However if i 1 then X j is not correctly classified. So to minimize the number of misclassifications C i is included in the objective function where
i
represents
the upper bound of number of misclassifications and C 0 is a user specified constant (knows as the regularization parameter) used to modulate the tradeoff between the maximum margin and minimization of misclassifications. When the problem ( p1 ) is solved using primal-dual method the corresponding dual problem is N
Maximize Q( ) i i 1
1 N 2 i 1
N
d d j 1
i
j
i
T
j
Xi X j
subject to N
d i 1
i
i
0
0 i C
, i 1,..., N
and b d j W T X j for some index j with 0 < j < C The above method for nonlinearly separable case is further improved by introducing the kernel function K defined as K:XX R such that K ( x, z ) ( x)T ( z) ,
(14) : X R m , ( m n ) such that the set {( xi )}i 1
where X is the input space and linearly separable in R m .
N
W i d i ( xi )
Then optimal value of W satisfies
(15)
i 1
and the dual problem reduces to N 1 N Q( ) i Maximize 2 i 1 i 1
N
d d j 1
i
j
i
j
K(Xi , X j )
subject to constraints N
(1)
d i 1
i
i
0
(2) i 0 i 1,..., N and
b d j W T ( X j )
(16)
N
is
using the expression for W (Eqn. 15) and b (Eqn. 16) and the definition of kernel(Eqn. 14) the decision function f (x) is computed as
f ( X ) W T ( X ) b = i d i K ( X i , X ) d i i d i K ( X i , X ) i
(17)
i
Then a new data point x is classified in class1 ( d i 1 ) if
f ( x) 0 and classified in class2
( d i 1 ) if f ( x) 0 . Appendix B Confusion Matrices and Bar Graphs of Models Implemented
Actual Class
Predicted Class Enzyme
Non-Enzyme
Enzyme
31
1
Non-Enzyme
0
30
Table B.1. Confusion Matrix for Binary Classifier without feature selection
Actual Class
Predicted Class Enzyme
Non-Enzyme
Enzyme
30
0
Non-Enzyme
0
32
Table B.2: Confusion Matrix for Binary Classifier using OFS
Actual Class
Predicted Class EC 1
EC 2
EC 3
EC4
EC5
EC6
EC1
46
0
1
1
0
2
EC 2
1
53
0
1
0
1
EC 3
0
0
44
1
0
5
EC 4
0
0
0
42
0
0
EC5
4
0
0
0
46
0
EC6
1
0
0
1
0
39
Table B.3: Confusion Matrix of the multiclass classifier without feature selection Algorithm.
Actual Class
Predicted Class EC 1
EC 2
EC 3
EC4
EC5
EC6
EC1
47
0
1
2
0
0
EC 2
0
55
0
1
0
0
EC 3
0
0
44
0
0
6
EC 4
5
0
0
47
0
0
EC5
2
0
0
0
48
0
EC6
3
0
1
2
0
35
Table B.4: Confusion Matrix for Multiclass Classifier using SFFS
Actual Class
Predicted Class EC 1
EC 2
EC 3
EC4
EC5
EC6
EC1
50
0
0
0
0
0
EC 2
0
53
0
0
3
0
EC 3
0
0
47
1
0
2
EC 4
0
1
1
49
1
0
EC5
1
0
0
2
47
0
EC6
0
1
0
0
4
36
Table B.5: Confusion Matrix for Multiclass Classifier using OFS
Actual Class
Predicted Class EC 1
EC 2
EC 3
EC4
EC5
EC6
EC1
47
0
0
1
1
1
EC 2
3
51
0
2
0
0
EC 3
1
0
47
1
0
1
EC 4
3
1
0
46
0
2
EC5
2
1
0
0
47
0
EC6
2
1
1
2
0
45
Table B.6 : Confusion Matrix for Multiclass Classifier using SVM-RFE
Actual Class
Predicted Class
EC1
EC 1
EC 2
EC 3
EC4
EC5
EC6
41
2
1
2
3
1
EC 2
0
54
0
1
1
0
EC 3
0
1
45
1
0
3
EC 4
1
2
1
44
4
0
EC5
3
1
0
0
46
0
EC6
1
1
1
3
0
45
Table B.7: Confusion Matrix for Random Forest Classifier
Figure B.1: Average accuracy of all the models.
Figure B.2: Comparison Precisionμ of all the models
Figure B.3: Comparison of Recallμ of all the models
References 1.
Koonin, Eugene V., Roman L. Tatusov, and Michael Y. Galperin. "Beyond complete genomes: from sequence to structure and function." Current Opinion in Structural Biology, Vol. 8, no. 3 (1998): 355-363.
2.
Fetrow, Jacquelyn S., and Jeffrey Skolnick. "Method for prediction of protein function from sequence using the sequence-tostructure-to-function paradigm with application to glutaredoxins/thioredoxins and T 1 ribonucleases." Journal of Molecular Biology, Vol. 281, no. 5 (1998): 949-968.
3.
Siomi, Haruhiko, and Gideon Dreyfuss. "RNA-binding proteins as regulators of gene expression." Current Opinion in Genetics & Development, Vol. 7, no. 3 (1997): 345-353.
4.
Draper, David E. "Themes in RNA-protein recognition." Journal of Molecular Biology, Vol. 293, no. 2 (1999): 255-270.
5.
Koonin, Eugene V., Roman L. Tatusov, and Michael Y. Galperin. "Beyond complete genomes: from sequence to structure and function." Current Opinion in Structural Biology, Vol. 8, no. 3 (1998): 355-363.
6.
Pearson, William R., and David J. Lipman. "Improved tools for biological sequence comparison." Proceedings of the National Academy of Sciences, Vol. 85, no. 8 (1988): 2444-2448.
7.
Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. "Basic local alignment search tool." Journal of Molecular Biology, Vol.215, no. 3 (1990): 403-410.
8.
Pellegrini, Matteo, Edward M. Marcotte, Michael J. Thompson, David Eisenberg, and Todd O. Yeates. "Assigning protein functions by comparative genome analysis: protein phylogenetic profiles." Proceedings of the National Academy of Sciences , Vol.96, no. 8 (1999): 4285-4288.
9.
Marcotte, Edward M., Ioannis Xenarios, Alexander M. Van der Bliek, and David Eisenberg. "Localizing proteins in the cell from their phylogenetic profiles." Proceedings of the National Academy of Sciences , Vol.97, no. 22 (2000): 12115-12120.
10.
Zheng, Yu, Richard J. Roberts, and Simon Kasif. "Genomic functional annotation using co-evolution profiles of gene clusters." Genome Biology, Vol. 3, no. 11 (2002): research0060-1.
11.
Lockhart, David J., Helin Dong, Michael C. Byrne, Maximillian T. Follettie, Michael V. Gallo, Mark S. Chee, Michael Mittmann et al. "Expression monitoring by hybridization to high-density oligonucleotide arrays." Nature Biotechnology, Vol. 14, no. 13 (1996): 1675-1680.
12.
Schena, Mark, Dari Shalon, Ronald W. Davis, and Patrick O. Brown. "Quantitative monitoring of gene expression patterns with a complementary DNA microarray." Science , Vol.270, no. 5235 (1995): 467.
13.
Eisen, Michael B., Paul T. Spellman, Patrick O. Brown, and David Botstein. "Cluster analysis and display of genome-wide expression patterns." Proceedings of the National Academy of Sciences , Vol.95, no. 25 (1998): 14863-14868.
14.
Zhou, Xianghong, Ming-Chih J. Kao, and Wing Hung Wong. "Transitive functional annotation by shortest-path analysis of gene expression data." Proceedings of the National Academy of Sciences , Vol.99, no. 20 (2002): 12783-12788.
15.
Lin, Chuan, Daxin Jiang, and Aidong Zhang. "Prediction of protein function using common-neighbors in protein-protein interaction networks." In BioInformatics and BioEngineering, 2006. BIBE 2006. Sixth IEEE Symposium on, pp. 251-260. IEEE, 2006.
16.
Tatusov, Roman L., Darren A. Natale, Igor V. Garkavtsev, Tatiana A. Tatusova, Uma T. Shankavaram, Bachoti S. Rao, Boris Kiryutin, Michael Y. Galperin, Natalie D. Fedorova, and Eugene V. Koonin. "The COG database: new developments in phylogenetic classification of proteins from complete genomes." Nucleic Acids Research, Vol. 29, no. 1 (2001): 22-28.
17.
Jones, Philip, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAnulla, Hamish McWilliam et al. "InterProScan 5: genome-scale protein function classification." Bioinformatics, Vol.30, no. 9 (2014): 1236-1240.
18.
Vapnik, Vladimir. The nature of statistical learning theory. Springer science & business media, 2013.
19.
Ding, Chris HQ, and Inna Dubchak. "Multi-class protein fold recognition using support vector machines and neural networks." Bioinformatics , Vol.17, no. 4 (2001): 349-358.
20.
Cai, Yu‐Dong, Xiao‐Jun Liu, Xue‐biao Xu, and Kuo‐Chen Chou. "Support vector machines for the classification and prediction of β‐turn types." Journal of Peptide Science, Vol.8, no. 7 (2002): 297-301.
21.
Yuan, Zheng, Kevin Burrage, and John S. Mattick. "Prediction of protein solvent accessibility using support vector machines." Proteins: Structure, Function, and Bioinformatics, Vol. 48, no. 3 (2002): 566-570.
22.
Hua, Sujun, and Zhirong Sun. "A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach." Journal of Molecular Biology, Vol. 308, no. 2 (2001): 397-407.
23.
Cai, Yu-Dong, Xiao-Jun Liu, Xue-biao Xu, and Kuo-Chen Chou. "Prediction of protein structural classes by support vector machines." Computers & Chemistry , Vol.26, no. 3 (2002): 293-296.
24.
Cai, C. Z., W. L. Wang, L. Z. Sun, and Y. Z. Chen. "Protein function classification via support vector machine approach." Mathematical Biosciences, Vol. 185, no. 2 (2003): 111-122.
25.
Pudil, Pavel, Jana Novovičová, and Josef Kittler. "Floating search methods in feature selection." Pattern RecognitionLletters , Vol.15, no. 11 (1994): 1119-1125.
26.
Mao, Kezhi Z. "Orthogonal forward selection and backward elimination algorithms for feature subset selection." IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) , Vol.34, no. 1 (2004): 629-634.
27.
Guyon, Isabelle, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. "Gene selection for cancer classification using support vector machines." Machine Learning, Vol.46, no.1 (2002): 389-422
28.
Rakotomamonjy, Alain. "Variable selection using SVM-based criteria." Journal of Machine Learning Research Vol.3, no. Mar (2003): 1357-1370.
29.
Liaw, Andy, and Matthew Wiener. "Classification and regression by randomForest." R news, Vol. 2, no. 3 (2002): 18-22.
30.
Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks." Information Processing & Management, Vol. 45, no. 4 (2009): 427-437.
31.
Liu, Bing. Web data mining: exploring hyperlinks, contents, and usage data. Springer Science & Business Media, 2007.
Data Collection
Representation of Sequence as a vector of features (Real Numbers)
Select Relevant Features
Train Binary SVM without / with feature selection technique for Enzymes /Non-Enzymes
Input a new protein sequence
Train Multiclass SVM without/ with feature selection technique for 6 Enzymes classes.
Apply Binary SVM classifier F1
Output is Enzyme?
No
Yes
Figure 1: Flow chart of the proposed Model Predict the class label j of the Enzyme sequence using Multiclass SVM model F2
Output nonenzyme
Stop Output the class label j
Stop
Multiclass ROC for OFS 1 0.9 0.8
True Positive Rate
0.7 0.6 0.5 0.4
EC1 EC2 EC3 EC4 EC5 EC6 Average
0.3 0.2 0.1 0
0
0.2
0.4 0.6 False Positive Rate
0.8
1
Figure 2: Multiclass ROC Curve for the classifier with OFS Feature selection algorithm.
Figure 3: Class wise accuracy of all the models
Table 1: Physico-chemical properties and their dimensions
Name
Dimension
Molecular Weight
1
Number of Amino Acids
1
Theoritical Pie
1
Amino Acid Composition
20
Number of positive charged Residue
1
Number of negative charged Residue
1
Atomic Composition
5
Aliphatic Index
1
Hydrophobicity
1
Table 2: Confusion Matrix for binary classifier
Data Class
Classified as Positive
Classified as Negative
Positive Class
True Positive(tp)
False Negative(fn)
Negative Class
False Positive(fp)
True Negative(tn)
Table 3: Performance Measures for binary classifier
Measure
Formula
Evaluation Focus
Accuracy
tp tn tp fn fp tn
Overall effectiveness of a classifier
Precision
tp tp fp
Class agreement of the data labels with the positive class labels given by the classifiers
Recall
tp tp fn
Effectiveness of a classifier to identify positive labels
Table 4: Performance Measures of Multiclass classifier
Measure
Formula l
Average Accuracy
tp i 1
i
Evaluation Focus
tpi tni tni fpi fni l
The Average per-class effectiveness of a classifier
l
tp i 1
Pr ecision
l
tp i 1
i
Arrangement of the data class labels with those of a classifiers if calculated from sums of pretext decisions
i
fp i
l
tp i 1
Re call
l
tp i 1
i
l
Pr ecision M
tp
Re call M
tp
i 1
l
i 1
Effectiveness of a classifier to identify class labels if calculated from sums of pre-text decisions
i
fn i
tp i i fp i l
An average per-class agreement of the data class labels with those of a classifiers
tp i i fn i l
An average per-class effectiveness of a classifier to identify class labels.
Table 5: Performance Measures for the Binary Classifiers with/without feature selection
Performance Measure
Value (without Value (with OFS Feature selection) Feature selection)
Percentage of Accuracy
98.3871
100
Precission
0.9677
1
Recall
1
1
Table 6: Performance Measures for the Multiclass Classifiers with/without feature selection and Random Forest
Value (SFFS)
Value (OFS)
Value (SVM-RFE)
Random Forest
Percentage 0f Average 93.6455 Accuracy
92.3077
94.3144
93.5275
86.7314
Pr ecission
0.9365
0.9231
0.9431
0.9353
0.8673
Re call
0.9365
0.9231
0.9431
0.9353
0.8673
Pr ecission M
0.1561
0.1538
0.1572
0.1559
0.1446
Re call M
0.1561
0.1538
0.1572
0.1559
0.1446
Performance Measure
Value(Without Feature Selection)