Enzyme classification using multiclass support vector machine and feature subset selection

Enzyme classification using multiclass support vector machine and feature subset selection

Accepted Manuscript Title: Enzyme Classification using Multiclass Support Vector Machine and Feature Subset Selection Authors: Debasmita Pradhan, Suda...

937KB Sizes 0 Downloads 128 Views

Accepted Manuscript Title: Enzyme Classification using Multiclass Support Vector Machine and Feature Subset Selection Authors: Debasmita Pradhan, Sudarsan Padhy, Biswajit Sahoo PII: DOI: Reference:

S1476-9271(17)30329-8 http://dx.doi.org/10.1016/j.compbiolchem.2017.08.009 CBAC 6717

To appear in:

Computational Biology and Chemistry

Received date: Revised date: Accepted date:

25-5-2017 15-7-2017 15-8-2017

Please cite this article as: Pradhan, Debasmita, Padhy, Sudarsan, Sahoo, Biswajit, Enzyme Classification using Multiclass Support Vector Machine and Feature Subset Selection.Computational Biology and Chemistry http://dx.doi.org/10.1016/j.compbiolchem.2017.08.009 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Enzyme Classification using Multiclass Support Vector Machine and Feature Subset Selection Debasmita Pradhan(1) Sudarsan Padhy(1) Biswajit Sahoo(2) 1Department

of Computer Scienceing and Engineering Silicon Institute of Technology, Silicon Hills, Patia, Bhubaneswar, 751024, 2

School of Computer Engineering KIIT University,Bhubaneswar,751024

Graphical abstract

Highlights 





Primary Sequence of Proteins are collected from Expasy portal and converted to vector of physicochemical properties from which properties responsible significantly for protein function classification are identified by using Feature Subset Selection algorithms. Proteins are classified into Enzymes and Non-Enzymes and subsequently enzymes into their 6-functional classes using multiclass SVM method as wet-lab methods are expensive and time consuming. The proposed model generalizes the existing model to include multiclass classification and to identify most significant features affecting the protein function. Our model is seen to have good performance with respect to many performance measures.

Abstract: Proteins are the macromolecules responsible for almost all biological processes in a cell. With the availability of large number of protein sequences from different sequencing projects, the challenge with the scientist is to characterize their functions. As the wet lab methods are time consuming and expensive, many computational methods such as FASTA, PSI-BLAST, DNA microarray clustering, and Nearest Neighborhood classification on protein-protein interaction network have been proposed. Support vector machine is one such method that has been used successfully for several problems such as protein fold recognition, protein structure prediction etc. Cai et al in 2003 has used SVM for classifying proteins into different functional classes and to predict their function. They used the physico-chemical properties of proteins to represent the protein sequences. In this paper a model comprising of feature subset selection followed by multiclass Support Vector Machine is

proposed to determine the functional class of a newly generated protein sequence. To train and test the model for its performance, 32 physico-chemical properties of enzymes from 6 enzyme classes are considered. To determine the features that contribute significantly for functional classification, Sequential Forward Floating Selection (SFFS), Orthogonal Forward Selection (OFS) and SVM Recursive Feature Elimination (SVM-RFE) algorithms are used and it is observed that out of 32 properties considered initially, only 20 features are sufficient to classify the proteins into its functional classes with an accuracy ranging from 91% to 94%. On comparison it is seen that, OFS followed by SVM performs better than other methods. Our model generalizes the existing model to include multiclass classification and to identify most significant features affecting the protein function. Keywords: Enzyme Classification, Multiclass SVM, Sequential Forward Floating Selection (SFFS), Orthogonal Forward Selection (OFS), SVM Recursive Feature Elimination (SVM-RFE), Random Forest (RF)

1.

Introduction

Proteins are important macromolecules responsible for almost all biological processes in a cell such as growth, function, cell metabolism and maintenance. With the availability of large no of biological sequences obtained from different sequencing projects [1,2], the challenge with the scientist is to know the functions of the newly generated protein sequences in order to understand the biological processes [3-5]. There are many methods available for functional annotation of newly sequenced proteins. The wet lab method of functional characterization of proteins is time consuming and expensive, where as computational approaches are fast and cost effective. The classical computational approaches for function prediction use programs like FASTA [6] and PSI-BLAST[7] which are based on homology between the annotated sequences with unannotated sequence i.e the new sequence. The methods of Comparative Genomics are also used for the prediction of protein function [8]. They consider the protein to be functionally linked if they have similar phylogenetic profiles [9,10]. Some authors such as David J. Lockhart et al., Mark Schena[11,12], designed clustering algorithms to be used on DNA-microarray data to predict the protein function based on the assumption that genes with correlated expression profile are functionally related [13,14] . The protein-protein-interaction networks are also used for prediction of protein function using Nearest Neighborhood approach [15] based on the fact that proteins may interact for a common purpose. But as protein-protein-interaction data is noisy, the prediction accuracy becomes low. Some methods [16,17] predict the function of a protein by classifying it into a specific functional class based on the sequence similarity. These methods work well if the similarity between sequences is significant. However the prediction becomes random if the similarity between two sequences is not up to a threshold. Support vector machine method [18] is used for protein fold recognition [19,20], protein structure prediction[21-23],protein-protein interaction prediction, and protein function classification [24]. In these problems the physico-chemical properties of proteins computed from sequences, are used as input for implementing the method. Cai et al [24] used Binary SVM classifier to predict the functional class of a protein. They considered the functional classes like RNA-binding proteins, protein homodimers, drug absorption proteins, drug delivery proteins, drug excretion proteins, Class-I drug metabolizing enzymes, Class-II drug metabolizing enzymes and used 1808 physico-chemical properties such as hydrophobicity, polarity, polarizability, charge, surface tension, secondary structure etc. to represent a protein sequence and obtained accuracy in

the range 88 % to 99% for different classes. Moreover as the dimension of the feature vector used is very high the computation takes more time. In our model at the first step a binary classifier is designed to classify a protein sequence as enzyme or non-enzyme. In the second step a multi-class classifier is designed to predict the functional class of the protein out of six available enzyme classes such as oxidoreductases, transferases, hydrolases, lyases, isomerase, and ligases. To implement the model, initially 32 physico-chemical properties like number of amino acids, theoretical pie, amino acid compositions(20), number of negatively charged residue, number of positively charged residue, atomic compositions(5), aliphatic index, and hydrophobicity are considered. Since many of the features may carry redundant information, Sequential Forward Floating Selection algorithm (SFFS)[25],Orthogonal Forward Selection (OFS)[26] algorithm, and SVM Recursive Feature Elemination(SVM-RFE)[27,28] are applied to identify the most significant features for classifying the proteins. SFFS gives amino acid compositions such as Arg(A), Asn(N), Cys(C), Gln(Q), Glu(E), Ile(I), Leu(L), Lys(K), Met(M), Phe(F), Pro(P), Ser(S), Thr(T), Trp(W), Tyr(Y), Val(V), atomic compositions, such as Hydrogen(H), Nitrogen(N), Oxygen(O), Sulfur(S) are more significant features where as OFS gives aliphatic index, number of amino acids, atomic compositions such as Carbon(C), Oxygen(O), amino acid compositions such as Cys(C), Asp(D), Arg(R), Phe(F), Gly(G), Pro(P), His(H), Ile(I), Thr(T), Trp(W), Leu(L), Gln(Q), Lys(K), Try(Y), no of positively charge residues, and no of negatively charged residues are more significant features. However, when SVM-RFE is applied it dropped seven features such as number of amino acids, Theoritical pie, Cys(C), Gly(G), Ile(I), Carbon(C), Sulfur(S) to yield 25 significant features and with these features an accuracy range of 90.6149%- 93.5275% is obtained. Results of these three algorithms show that Gln(Q), Leu(L), Lys(K), Phe(F), Pro(P), Thr(T), Trp(W), Tyr(Y), and Oxygen(O) play major role for functional classification of proteins. Using all 32 features, i.e Without Feature Selection(WFS) an accuracy range from 90.9699 % to 93.6455 % is obtained where as using Sequential Forward Feature Floating Selection(SFFS) algorithm with 20 significant features an accuracy from 90.3010% to 92.3077% is obtained and using Orthogonal Forward Feature Selection algorithm(OFS) with 20 significant features an accuracy from 89.6321% to 94.3144% is obtained. Our model found that 20(Atomic and Amino acid compositions) out of 32 physico-chemical properties are sufficient to predict the functional class of a protein with a high accuracy. The performance of our model is compared with the Random Forest classification algorithm [29]. The average accuracy obtained by Random Forest Model is 86.7314%. It is observed that all the three models discussed above have better average accuracy than Random Forest Model. The rest of the paper is organized as follows. Section 2 presents Multiclass Support Vector Machine, Sequential Forward Feature Selection algorithm, and Orthogonal Forward Feature Section algorithm. Section 3 describes the proposed model. Section 4 discusses the result and performance of our model and Section 5 concludes the work. 2. 2.1

Preliminaries Multiclass Support vector machine

The Support vector machine described in appendix is a binary classifier i.e it classifies objects belonging to two distinct classes. However the real world problems deal with classifying

objects into more than two classes. There are many approaches followed to use SVM for multiclass classification. Following are the frequently used approaches. a. One Verses the Rest classification This approach constructs as many support vector machines as there are classes in the classification problem i.e. given M classes it constructs M binary SVM classifiers f1 ,..., f M .To construct f i , the i th classifier ( i  1,..., M ) an SVM is designed by considering the patterns of ith class as positive samples and patterns of the rest of classes as negative samples. An unknown sample X is classified by providing it to each classifier and applying majority voting technique. The class lebel with maximum frequency is assigned to the pattern X . One of the major limitation of this approach is the training samples used to build the model are highly unbalanced. b. Pairwise Classification The pair wise classification technique avoids the limitation of the above method by constructing decision surfaces for each pair of classes. Given the training set method generates D  {( xi , yi )} , xi  R n and yi {1,2,3..., M } this M (M  1) / 2 classifiers, one classifiers for each pair of classes. Let f ij be the classifier which separates the pair of classes i and j with i  j and i, j {1,2,..., M } . f ij is trained taking

Di as the positive class and D j as the negative class where Di is the samples in D with class level i .The output of the classifier f ji is  f ij .once the classifiers are trained an unknown sample X is classified by presenting it to each of M (M  1) / 2 classifiers. Each classifier assigns a class lebel to the new sample. The class lebel with highest count is then considered as the label of the unknown sample X .

2.2

Feature Subset Selection Given a set of n features the goal of feature Subset Selection is to select a subset of d features ( d < n ) without significantly degrading the performance of the recognition system [25]. 2.2.1

Sequential Forward Floating Selection Method (SFFS)[25] This method start with an empty feature set to begin with. In successive steps the features are included / excluded depending on some class separability measure. We use class separability measure C of a set S defined by T C (S )  trace(S w Sb ) , Where S w is within class scatter matrix defined as C

S w   pi si i 1

pi is probability of class i , si is covariance matrix of class i and c is no of classes. sb is the between class matrix, defined as c

sb   pi (mi  m0 )(mi  m0 )T , i 1

c

m0   pi mi is the global mean vector and mi is the class mean vector. The i 1

following algorithm describes the SFFS technique. Algorithm (SFFS) 1. 2. 3.

4.

Let Fc is the set of selected features and Fp is a temporary set of selected features. Let Fp   initially. Select the best feature (with highest class separability measure), x   arg max C ( Fp  {xi }) Fc  Fp  {x  } Select the worse feature (the feature which contributes less for the class separability measure. x   arg max C ( Fc  {x}), x  Fc if C ( Fc  {x  })  CFc then Fc  Fc  {x  } Go to step 3 else Go to step 2

2.2.2

Orthogonal Forward Selection Method (OFS)[26]

The OFS method uses Gram-Schmidt orthogonal decomposition of the feature matrix D . As the features are decorrelated in the orthogonal space, these can be evaluated and selected independent of each other. After selecting the subset of features in the orthogonal space, these are mapped to the subset of features in the orthogonal feature space, which serves as the desired significant subset. The following algorithm describes the method [26]. Let there are N samples X 1 , X 2 ,... X N in D . Each sample is represented by ndimensional vector X j  {x1 j , x2 j ,...xnj }T , j  1,2,...N . The feature vector xi is defined as xi  [ xi1 , xi 2 ,...xiN ]T , i  1,2,...n and the feature matrix  is defined as

 x11 x21 ... xn1  x   12 x22 ... xn 2  . . .  X  [ x1 , x2 ,..., xn ]   . . ... .    . .   .  x1N x2 N ... xnN  Algorithm (OFS): 1. Let all features xi , (i  1,2......n) are candidate to enter feature matrix X . Set q1  xi (i )

Compute Mahalanobis distance measures provided by q1 , i  1,2,.......n . Let x j yields maximum class separability. (i )

Set q1  x j 2.

For all remaining n  1 features, compute (i ) (i ) q2  xi  12 q1 1 i  n, i  j , where

12(i )  q1T xi / q1T q1 And compute corresponding mahalanobis distance measures. The feature that provides maximum class separation is identified and added to the feature subsets. 3.

3.

The process continues untill the class separability measure provided by the next best feature is less than a prespecified threshold.

Proposed Method

We are given a data set D of protein sequences and their class labels. Each sequence is then represented by 32 physico-chemical properties .Now the dataset D contains the protein sequences represented by 32 features each along with their class labels i.e. D  {( X i . yi )}, i  {1,2,..., N} where X i  R 32 is the feature vector of i th sequence and yi {0,1} for classification of a protein as enzyme

and non-enzyme or yi  {1,2,..., M } for classifying enzyme class. The objective of our work is to develop a model which will predict the class label of a new protein in two steps. In the first step a binary SVM classifier F1 is developed which classifies a new protein as enzyme or non-enzyme. In the second step we developed a multiclass SVM classifier F2 to predict the functional class label of the new protein. Given a new sequence, our model first determines, whether it is an enzyme sequence or non-enzyme sequence. If it is an enzyme, then our model predicts its class label, out of six available enzyme classes. To train the model, initially we considered 32 physico-chemical properties to represent a protein sequence. Since many of these properties are redundant, we applied Sequential Forward Floating subset Selection algorithm and Orthogonal Feature Subset Selection algorithm to identify important and non-redundant features which contribute significantly to predict the functional class of enzymes. To handle non-linearity in data the Radial Basis Function (RBF) kernel is used throughout the implementation of SVM. Our proposed work is described in the flowchart given in Figure 1. 3.1

Data Description and Preprocessing

The protein sequences are collected from SwissProt database. The protparam tool of Expasy.org portal is used to compute the physico-chemical properties of proteins. The motivation of using these features is taken from the work of Cai et al [24]. The properties include: number of amino acids, molecular weight, theoretical pie, amino acid compositions, and number of negatively charged residue, number of positively charged residue, atomic compositions, aliphatic index, and hydrophobicity. Table1 lists these properties and their dimensions. Total 32 features are considered to represent the proteins. To implement the model, at the first step enzyme proteins as the positive samples and Hemoglobin proteins as the negative samples are taken, to train the binary SVM classifier F1. Then the proteins from 6 enzyme classes such as oxidoreductases, transferases, hydrolases, lyases, isomerases and ligases are considered for the training of multiclass SVM classifier F2 .

3.2 Performance Measure

To measure the performance of the Binary classifier F1 the parameters, true positives (tp), true negatives (tn), false positives (fp), false negatives (fn) and confusion matrix [Table 2] are considered. These parameters are used to compute the performance measures like Accuray, Precision, and Recall [Table 3] for F1 . Then the confusion matrix for multiclass classifier F2 , true positives (tpi), true negatives (tni), false positives (fpi), false negatives (fni) for ith class are computed.Using these parameters the Average Accuracy, Pr ecision  , Re call  , Pr ecission m and

Re call m are computed [Table 4][30]. Further the ROC curve for multiclass classifier [31] is plotted and shown in Figure2.

4.

Results and Discussions To train the binary support vector machine classifier F1 , 200 distinct enzyme proteins are

taken as positive samples and 200 hemoglobin proteins are taken as negative samples. Then the classifier is tested with 62 proteins which is a mixture of enzymes and hemoglobin proteins. The accuracy of the model with given test set is 98.3871%. Table 5 summarizes the performance measures of the classifier F1 with and without feature selection.

After training and testing of the binary classifier, six enzyme classes oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases are considered to train the multiclass support vector machine classifier F2 . 219 samples from oxidoreductases, 225 from transferases, 200 from hydrolases, 207 from lyases, 198 from isomerases and 163 from ligases are taken. The model is trained using pairwise multiclass classification algorithm without and with feature selection algorithms. First the model is trained using all 32 physico-chemical properties of the proteins. The average accuracy of the model with all the features is 93.6455%. Then Sequential Forward Floating Selection (SFFS) algorithm and Orthogonal Forward Selection (OFS) algorithm are used to find out most significant features. The accuracy of the model when SFFS algorithm is used ranges from 90.3010% to 92.3077% where as the accuracy of the model when OFS algorithm is used ranges from 89.6321% to 94.3144%. As support vector machine can be used for feature selection, we also implemented SVM-RFE [27,28] for selecting relevant features. For a two class classification problem, at every iteration SVM-RFE ranks and removes a feature i whose removal minimizes the ranking function defined as Rc (i)  w(i )

2

  k  *(i ) yk y j K (i ) ( xk , x j ) , *(i )

k, j

where K (i ) is a kernel function. A Gaussian Kernel is used for implementation of SVM-RFE. As our model is a six class classifier which yields 15 features to be dropped after each iteration. So we applied a majority voting method to select the feature, to be dropped after each iteration. In this way 7 features are dropped from the original data and rest 25 features are taken to implement the model. With these 25 features an accuracy range of 90.6149%- 93.5275% is obtained. For comparison of our model Random Forest classifier is implemented and an average accuracy of 86.7314 % is obtained. Appendix B contains the Confusion Matrices and Table 6 summarizes the

performance measures of all the models implemented. For visualization of performance of the implemented models bar graphs are plotted with respect to average accuracy, Pr ecision  ,and Re call  and given as Figure B.1, B.2, and B.3 in appendix B.

The results show that our model gives a better performance when features are selected by OFS. The multiclass Receiver Operating Characteristics( ROC)[31] curve is computed for the model with each feature selection algorithm. To save space, the multiclass ROC curve of the model implemented with OFS feature selection algorithm is given in figure 2, as it gives better performance than others. The curve gives a high average Area Under the Curve(AUC). Figure 3 gives the class wise accuracy of the models implemented.

It is observed that the accuracy of SVM with OFS is maximum for all classes except 2 and 5[Figure 3]. However with respect to all other performance measures: average accuracy, Pr ecision  , Re call  , Pr ecission m and Re call m , SVM with OFS outperforms other algorithms. For instance the average accuracy of the model without feature selection is 93.6455%, with SFFS is 92.3077%, with OFS is 94.3144%, with SVM-RFE is 93.5275 %, and with Random Forest is 86.7314%. So we may conclude that the overall performance of SVM-OFS is better than other algorithms considered here. SFFS, OFS and SVM-RFE have their own set of selected features. Although they select different features, the common features selected by all three algorithms are Gln(Q), Ile(I), Leu(L), Lys(K), Phe(F), Pro(P), Thr(T), Trp(W), Tyr(Y), and Oxygen(O). So we may conclude that these features play a significant role for functional classification of proteins.

5.

Conclusion and future work

In this paper a model comprising of feature subset selection followed by multiclass support vector machine to determine the functional class of a newly generated protein sequence is proposed. To train and test the model for its performance, 32 physico-chemical properties of enzymes from 6 classes are considered. To identify features those contribute significantly for functional classification of proteins, SFFS,OFS and SVM-RFE algorithms are applied. The results of all the algorithms show that Gln(Q), Leu(L), Lys(K), Phe(F), Pro(P), Thr(T), Trp(W), Tyr(Y), and Oxygen(O) play major role for protein function prediction. It is observed that the features selected by OFS yield better classification performance with respect to average accuracy, Pr ecision  , Re call  , Pr ecission m and Re call m .For instance with all 32 features an average accuracy range from 90.9699% to 93.6455 % is obtained, with 20 features selected by OFS an accuracy range from 89.6321% to 94.3144% is obtained, with 25 features selected by SVM-RFE an average accuracy range from 90.9699 % to 93.6455 % is obtained and with Random Forest an average accuracy of 86.7314 is obtained. Our model generalizes the existing model [20] to include multiclass classification and to identify most significant features affecting the protein function in order to reduce computation time. To handle non-linearity in the data, radial basis function is used

as kernel with the multiclass support vector machine. In future we plan to extend the classification up to subclass level and to develop a kernel so that along with physico-chemical properties, the protein primary sequences can be given as input directly.

Acknowledgement

Authors thank the reviewers for their valuable comments

Appendix A Support Vector Machine for Binary Classification Support Vector Machine (SVM) was pioneered by Vapnik in 1995. It’s a linear classifier based on the idea of constructing a hyperplane as the decision surface in such a way that the margin of separation between the separating hyperplane and the closest positive and negative sample is maximized. Consider the training patterns {( X i , d i )}iN1 , where X i  R n is the i th input pattern and

d i  {1,1} is the corresponding desired class label. It is assumed that the subset of patterns represented by d i  1 and the subset of patterns represented by the subset d i  1 are linearly separable. Let the equation to the decision hyper plane be WT X b  0 ,

(1)

Where W  R n , b  R . The objective of SVM is to compute W and b in such a way that the margin of separation of the hyper plane from the closest data points is maximized (Figure A.1). Such a hyper plane is called optimal hyper plane.

Figure A.1: Linearly separable data and Optimal Hyperplane

The problem of computing the optimal W and b can be formulated as solving the following optimization problem ( p ):

1 T W W 2 subject to the constraints d i (W T X i  b)  1 for i  1,..., N

Minimize

(2)

This problem can be solved by using primal-dual method where the primal problem is to minimize the Lagrangian function: N 1 (3) J (W , b,  )  W T W    i [d i (W T X i  b)  1] 2 i 1 over all W and b, where  i s i  1,2,..., N are Lagrangian multipliers ,

the solution of this problem( p ) yields N

W   i d i X i ,

(4)

i 1

where  i ' s are Lagrangian multipliers to be determined by solving the dual problem, which can be stated as Maximize N

Q( )   i  i 1

1 N  2 i 1

N

  j 1

i

T

j

di d j X i X j

(5)

subject to the constraints N

(1)

 d i 1

i

i

0

(2)  i  0

(6)

for i  1,..., N .

(7)

Let  i be the optimal solution of equation (5) - (7). From the Karush-Kuhn-Tuckker conditions it *

follows that when

 j*  0 , d j (wT X j  b)  1

b  d j  W T X j for some index j .

which yields

(8) (9)

Then from equation (1), (4) ,(9) the equation to the optimal hyper plane is N

N

i 1

i 1

* T * T  i d i X i X  d j   i d i X i X j  0

(10)

When the given data set is not linearly separable then the problem ( p ) for determining the optimal W and b is modified as follows ( p1 ) Minimize

N 1 T W W  c i 2 i 1

subject to d i (W T X i  b)  1   i

 i  0 for i  1,..., N

(11) (12) (13)

where  i ' s are the slack variables introduced to appropriately handle the data points of type Xi and Xj as shown in the figure A.2.

Figure A.2 : Soft Margin / Non-Linear Classifier

It is observed that if 0   i  1 constraints (12) and (13) are satisfied and X i can be correctly classified. However if  i  1 then X j is not correctly classified. So to minimize the number of misclassifications C   i is included in the objective function where



i

represents

the upper bound of number of misclassifications and C  0 is a user specified constant (knows as the regularization parameter) used to modulate the tradeoff between the maximum margin and minimization of misclassifications. When the problem ( p1 ) is solved using primal-dual method the corresponding dual problem is N

Maximize Q( )   i  i 1

1 N  2 i 1

N

  d d j 1

i

j

i

T

j

Xi X j

subject to N

 d i 1

i

i

0

0  i  C

, i  1,..., N

and b  d j  W T X j for some index j with 0 <  j < C The above method for nonlinearly separable case is further improved by introducing the kernel function K defined as K:XX R such that K ( x, z )  ( x)T ( z) ,

(14)  : X  R m , ( m  n ) such that the set {( xi )}i 1

where X is the input space and linearly separable in R m .

N

W   i d i ( xi )

Then optimal value of W satisfies

(15)

i 1

and the dual problem reduces to N 1 N Q( )   i   Maximize 2 i 1 i 1

N

  d d j 1

i

j

i

j

K(Xi , X j )

subject to constraints N

(1)

 d i 1

i

i

0

(2)  i  0 i  1,..., N and

b  d j  W T ( X j )

(16)

N

is

using the expression for W (Eqn. 15) and b (Eqn. 16) and the definition of kernel(Eqn. 14) the decision function f (x) is computed as

f ( X )  W T ( X )  b =  i d i K ( X i , X )  d i   i d i K ( X i , X ) i

(17)

i

Then a new data point x is classified in class1 ( d i  1 ) if

f ( x)  0 and classified in class2

( d i  1 ) if f ( x)  0 . Appendix B Confusion Matrices and Bar Graphs of Models Implemented

Actual Class

Predicted Class Enzyme

Non-Enzyme

Enzyme

31

1

Non-Enzyme

0

30

Table B.1. Confusion Matrix for Binary Classifier without feature selection

Actual Class

Predicted Class Enzyme

Non-Enzyme

Enzyme

30

0

Non-Enzyme

0

32

Table B.2: Confusion Matrix for Binary Classifier using OFS

Actual Class

Predicted Class EC 1

EC 2

EC 3

EC4

EC5

EC6

EC1

46

0

1

1

0

2

EC 2

1

53

0

1

0

1

EC 3

0

0

44

1

0

5

EC 4

0

0

0

42

0

0

EC5

4

0

0

0

46

0

EC6

1

0

0

1

0

39

Table B.3: Confusion Matrix of the multiclass classifier without feature selection Algorithm.

Actual Class

Predicted Class EC 1

EC 2

EC 3

EC4

EC5

EC6

EC1

47

0

1

2

0

0

EC 2

0

55

0

1

0

0

EC 3

0

0

44

0

0

6

EC 4

5

0

0

47

0

0

EC5

2

0

0

0

48

0

EC6

3

0

1

2

0

35

Table B.4: Confusion Matrix for Multiclass Classifier using SFFS

Actual Class

Predicted Class EC 1

EC 2

EC 3

EC4

EC5

EC6

EC1

50

0

0

0

0

0

EC 2

0

53

0

0

3

0

EC 3

0

0

47

1

0

2

EC 4

0

1

1

49

1

0

EC5

1

0

0

2

47

0

EC6

0

1

0

0

4

36

Table B.5: Confusion Matrix for Multiclass Classifier using OFS

Actual Class

Predicted Class EC 1

EC 2

EC 3

EC4

EC5

EC6

EC1

47

0

0

1

1

1

EC 2

3

51

0

2

0

0

EC 3

1

0

47

1

0

1

EC 4

3

1

0

46

0

2

EC5

2

1

0

0

47

0

EC6

2

1

1

2

0

45

Table B.6 : Confusion Matrix for Multiclass Classifier using SVM-RFE

Actual Class

Predicted Class

EC1

EC 1

EC 2

EC 3

EC4

EC5

EC6

41

2

1

2

3

1

EC 2

0

54

0

1

1

0

EC 3

0

1

45

1

0

3

EC 4

1

2

1

44

4

0

EC5

3

1

0

0

46

0

EC6

1

1

1

3

0

45

Table B.7: Confusion Matrix for Random Forest Classifier

Figure B.1: Average accuracy of all the models.

Figure B.2: Comparison Precisionμ of all the models

Figure B.3: Comparison of Recallμ of all the models

References 1.

Koonin, Eugene V., Roman L. Tatusov, and Michael Y. Galperin. "Beyond complete genomes: from sequence to structure and function." Current Opinion in Structural Biology, Vol. 8, no. 3 (1998): 355-363.

2.

Fetrow, Jacquelyn S., and Jeffrey Skolnick. "Method for prediction of protein function from sequence using the sequence-tostructure-to-function paradigm with application to glutaredoxins/thioredoxins and T 1 ribonucleases." Journal of Molecular Biology, Vol. 281, no. 5 (1998): 949-968.

3.

Siomi, Haruhiko, and Gideon Dreyfuss. "RNA-binding proteins as regulators of gene expression." Current Opinion in Genetics & Development, Vol. 7, no. 3 (1997): 345-353.

4.

Draper, David E. "Themes in RNA-protein recognition." Journal of Molecular Biology, Vol. 293, no. 2 (1999): 255-270.

5.

Koonin, Eugene V., Roman L. Tatusov, and Michael Y. Galperin. "Beyond complete genomes: from sequence to structure and function." Current Opinion in Structural Biology, Vol. 8, no. 3 (1998): 355-363.

6.

Pearson, William R., and David J. Lipman. "Improved tools for biological sequence comparison." Proceedings of the National Academy of Sciences, Vol. 85, no. 8 (1988): 2444-2448.

7.

Altschul, Stephen F., Warren Gish, Webb Miller, Eugene W. Myers, and David J. Lipman. "Basic local alignment search tool." Journal of Molecular Biology, Vol.215, no. 3 (1990): 403-410.

8.

Pellegrini, Matteo, Edward M. Marcotte, Michael J. Thompson, David Eisenberg, and Todd O. Yeates. "Assigning protein functions by comparative genome analysis: protein phylogenetic profiles." Proceedings of the National Academy of Sciences , Vol.96, no. 8 (1999): 4285-4288.

9.

Marcotte, Edward M., Ioannis Xenarios, Alexander M. Van der Bliek, and David Eisenberg. "Localizing proteins in the cell from their phylogenetic profiles." Proceedings of the National Academy of Sciences , Vol.97, no. 22 (2000): 12115-12120.

10.

Zheng, Yu, Richard J. Roberts, and Simon Kasif. "Genomic functional annotation using co-evolution profiles of gene clusters." Genome Biology, Vol. 3, no. 11 (2002): research0060-1.

11.

Lockhart, David J., Helin Dong, Michael C. Byrne, Maximillian T. Follettie, Michael V. Gallo, Mark S. Chee, Michael Mittmann et al. "Expression monitoring by hybridization to high-density oligonucleotide arrays." Nature Biotechnology, Vol. 14, no. 13 (1996): 1675-1680.

12.

Schena, Mark, Dari Shalon, Ronald W. Davis, and Patrick O. Brown. "Quantitative monitoring of gene expression patterns with a complementary DNA microarray." Science , Vol.270, no. 5235 (1995): 467.

13.

Eisen, Michael B., Paul T. Spellman, Patrick O. Brown, and David Botstein. "Cluster analysis and display of genome-wide expression patterns." Proceedings of the National Academy of Sciences , Vol.95, no. 25 (1998): 14863-14868.

14.

Zhou, Xianghong, Ming-Chih J. Kao, and Wing Hung Wong. "Transitive functional annotation by shortest-path analysis of gene expression data." Proceedings of the National Academy of Sciences , Vol.99, no. 20 (2002): 12783-12788.

15.

Lin, Chuan, Daxin Jiang, and Aidong Zhang. "Prediction of protein function using common-neighbors in protein-protein interaction networks." In BioInformatics and BioEngineering, 2006. BIBE 2006. Sixth IEEE Symposium on, pp. 251-260. IEEE, 2006.

16.

Tatusov, Roman L., Darren A. Natale, Igor V. Garkavtsev, Tatiana A. Tatusova, Uma T. Shankavaram, Bachoti S. Rao, Boris Kiryutin, Michael Y. Galperin, Natalie D. Fedorova, and Eugene V. Koonin. "The COG database: new developments in phylogenetic classification of proteins from complete genomes." Nucleic Acids Research, Vol. 29, no. 1 (2001): 22-28.

17.

Jones, Philip, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAnulla, Hamish McWilliam et al. "InterProScan 5: genome-scale protein function classification." Bioinformatics, Vol.30, no. 9 (2014): 1236-1240.

18.

Vapnik, Vladimir. The nature of statistical learning theory. Springer science & business media, 2013.

19.

Ding, Chris HQ, and Inna Dubchak. "Multi-class protein fold recognition using support vector machines and neural networks." Bioinformatics , Vol.17, no. 4 (2001): 349-358.

20.

Cai, Yu‐Dong, Xiao‐Jun Liu, Xue‐biao Xu, and Kuo‐Chen Chou. "Support vector machines for the classification and prediction of β‐turn types." Journal of Peptide Science, Vol.8, no. 7 (2002): 297-301.

21.

Yuan, Zheng, Kevin Burrage, and John S. Mattick. "Prediction of protein solvent accessibility using support vector machines." Proteins: Structure, Function, and Bioinformatics, Vol. 48, no. 3 (2002): 566-570.

22.

Hua, Sujun, and Zhirong Sun. "A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach." Journal of Molecular Biology, Vol. 308, no. 2 (2001): 397-407.

23.

Cai, Yu-Dong, Xiao-Jun Liu, Xue-biao Xu, and Kuo-Chen Chou. "Prediction of protein structural classes by support vector machines." Computers & Chemistry , Vol.26, no. 3 (2002): 293-296.

24.

Cai, C. Z., W. L. Wang, L. Z. Sun, and Y. Z. Chen. "Protein function classification via support vector machine approach." Mathematical Biosciences, Vol. 185, no. 2 (2003): 111-122.

25.

Pudil, Pavel, Jana Novovičová, and Josef Kittler. "Floating search methods in feature selection." Pattern RecognitionLletters , Vol.15, no. 11 (1994): 1119-1125.

26.

Mao, Kezhi Z. "Orthogonal forward selection and backward elimination algorithms for feature subset selection." IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) , Vol.34, no. 1 (2004): 629-634.

27.

Guyon, Isabelle, Jason Weston, Stephen Barnhill, and Vladimir Vapnik. "Gene selection for cancer classification using support vector machines." Machine Learning, Vol.46, no.1 (2002): 389-422

28.

Rakotomamonjy, Alain. "Variable selection using SVM-based criteria." Journal of Machine Learning Research Vol.3, no. Mar (2003): 1357-1370.

29.

Liaw, Andy, and Matthew Wiener. "Classification and regression by randomForest." R news, Vol. 2, no. 3 (2002): 18-22.

30.

Sokolova, Marina, and Guy Lapalme. "A systematic analysis of performance measures for classification tasks." Information Processing & Management, Vol. 45, no. 4 (2009): 427-437.

31.

Liu, Bing. Web data mining: exploring hyperlinks, contents, and usage data. Springer Science & Business Media, 2007.

Data Collection

Representation of Sequence as a vector of features (Real Numbers)

Select Relevant Features

Train Binary SVM without / with feature selection technique for Enzymes /Non-Enzymes

Input a new protein sequence

Train Multiclass SVM without/ with feature selection technique for 6 Enzymes classes.

Apply Binary SVM classifier F1

Output is Enzyme?

No

Yes

Figure 1: Flow chart of the proposed Model Predict the class label j of the Enzyme sequence using Multiclass SVM model F2

Output nonenzyme

Stop Output the class label j

Stop

Multiclass ROC for OFS 1 0.9 0.8

True Positive Rate

0.7 0.6 0.5 0.4

EC1 EC2 EC3 EC4 EC5 EC6 Average

0.3 0.2 0.1 0

0

0.2

0.4 0.6 False Positive Rate

0.8

1

Figure 2: Multiclass ROC Curve for the classifier with OFS Feature selection algorithm.

Figure 3: Class wise accuracy of all the models

Table 1: Physico-chemical properties and their dimensions

Name

Dimension

Molecular Weight

1

Number of Amino Acids

1

Theoritical Pie

1

Amino Acid Composition

20

Number of positive charged Residue

1

Number of negative charged Residue

1

Atomic Composition

5

Aliphatic Index

1

Hydrophobicity

1

Table 2: Confusion Matrix for binary classifier

Data Class

Classified as Positive

Classified as Negative

Positive Class

True Positive(tp)

False Negative(fn)

Negative Class

False Positive(fp)

True Negative(tn)

Table 3: Performance Measures for binary classifier

Measure

Formula

Evaluation Focus

Accuracy

tp  tn tp  fn  fp  tn

Overall effectiveness of a classifier

Precision

tp tp  fp

Class agreement of the data labels with the positive class labels given by the classifiers

Recall

tp tp  fn

Effectiveness of a classifier to identify positive labels

Table 4: Performance Measures of Multiclass classifier

Measure

Formula l

Average Accuracy

 tp i 1

i

Evaluation Focus

tpi  tni  tni  fpi  fni l

The Average per-class effectiveness of a classifier

l

 tp i 1

Pr ecision 

l

 tp i 1

i

Arrangement of the data class labels with those of a classifiers if calculated from sums of pretext decisions

i

 fp i

l

 tp i 1

Re call 

l

 tp i 1

i

l

Pr ecision M

 tp

Re call M

 tp

i 1

l

i 1

Effectiveness of a classifier to identify class labels if calculated from sums of pre-text decisions

i

 fn i

tp i i  fp i l

An average per-class agreement of the data class labels with those of a classifiers

tp i i  fn i l

An average per-class effectiveness of a classifier to identify class labels.

Table 5: Performance Measures for the Binary Classifiers with/without feature selection

Performance Measure

Value (without Value (with OFS Feature selection) Feature selection)

Percentage of Accuracy

98.3871

100

Precission

0.9677

1

Recall

1

1

Table 6: Performance Measures for the Multiclass Classifiers with/without feature selection and Random Forest

Value (SFFS)

Value (OFS)

Value (SVM-RFE)

Random Forest

Percentage 0f Average 93.6455 Accuracy

92.3077

94.3144

93.5275

86.7314

Pr ecission 

0.9365

0.9231

0.9431

0.9353

0.8673

Re call 

0.9365

0.9231

0.9431

0.9353

0.8673

Pr ecission M

0.1561

0.1538

0.1572

0.1559

0.1446

Re call M

0.1561

0.1538

0.1572

0.1559

0.1446

Performance Measure

Value(Without Feature Selection)