Fusion of classifiers for protein fold recognition

ARTICLE IN PRESS Neurocomputing 68 (2005) 315–321 www.elsevier.com/locate/neucom Letters Fusion of classiﬁers for protein fold recognition Loris Na...

Download PDF

192KB Sizes 30 Downloads 138 Views

Report

PDF Reader
Full Text

ARTICLE IN PRESS

Neurocomputing 68 (2005) 315–321 www.elsevier.com/locate/neucom

Letters

Fusion of classiﬁers for protein fold recognition Loris Nanni DEIS, IEIIT—CNR, Universita` di Bologna Viale Risorgimento 2, 40136 Bologna, Italy Received 16 January 2005; received in revised form 1 March 2005; accepted 6 March 2005 Available online 13 May 2005 Communicated by R. W. Newcomb

Abstract Predicting the three-dimensional structure of a protein from its amino acid sequence is an important problem in bioinformatics and a challenging task for machine learning algorithms. Given (numerical) features, one of the existing machine learning techniques can be then applied to learn and classify proteins represented by these features. We show that combining Fisher’s linear classiﬁer and K-Local Hyperplane Distance Nearest Neighbor we obtain an error rate lower than previously published in the literature. r 2005 Elsevier B.V. All rights reserved. Keyword: Fusion of classiﬁers; Protein fold recognition; Machine learning algorithms

1. Introduction The folding problem is central in molecular biology and it can be formulated as follows: given the primary structure (linear sequence of amino acids in a protein molecule) of a protein, how the three-dimensional (3-D) fold can be deduced from it. To be able to predict the protein structure from the amino-acid sequence would have tremendous impact in all of biotechnology and drug design. Recently, several works have approached the problem of predicting the 3-D structure of a protein by applying techniques from machine learning [3] employed ensembles of both three-layer feed-forward neural networks (NN) (trained with the Corresponding author.

E-mail address: [email protected]. 0925-2312/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2005.03.001

ARTICLE IN PRESS 316

L. Nanni / Neurocomputing 68 (2005) 315–321

conjugate gradient method) and support vector machines (one-vs.-all, unique onevs.-all and one-vs.-one methods for building multi-class support vector machines). Each ensemble consisted of many two-class classiﬁers. [1] used a 131-dimensional feature vector and an ensemble of four-layer discretized interpretable multi-layer perceptrons (DIMLP), where each network learns all protein folds simultaneously. Bagging and arcing combined the outputs of DIMLPs. [2] selected NN and support vector machines (SVM) as basic building blocks of two-level classiﬁcation. The NN with a single hidden layer were used: multi-layer perceptron (MLP), radial basis function network (RBFN), and general regression neural network (GRNN). [9] performed independent classiﬁcation at each level, i.e. they did not utilize a hierarchical classiﬁer. They trained the MLP and RBFN with new features (400 in number) based on the hydrophobicity of the amino acids. In [7] a modiﬁed nearest neighbor algorithm called the K-Local Hyperplane is applied to protein fold recognition based on features derived from secondary structure. The combination of multiple classiﬁers was shown to be suitable for improving the recognition performance in difﬁcult classiﬁcation problems [6]. We show that combining Fisher’s linear classiﬁer and K-Local hyperplane distance nearest neighbor, we obtain an error rate lower than previously published in the literature.

2. System To use machine learning methods, feature vectors are extracted from protein sequences. Six features were extracted from protein sequences: amino acids’ composition, predicted secondary structure, hydrophobicity, normalized van der Waals volume, polarity, and polarizability. All (except the ﬁrst) features have dimensionality 21, whereas the composition has dimensionality 20. Thus, in total, a feature vector combining six features has 125 dimensions (or components). In addition, the protein length is reported. As a result, in total, a feature vector has the length of 126. K-local hyperplane distance nearest neighbor (HKNN) [7] is trained using these 126 features, these features are projected onto a 126-dimensional space by Karhunen Loeve transform (KL) [4]. Fisher’s linear classiﬁer is trained using the features obtained by KL Fig. 1.

HKNN

Features

Weighted Sum Rule KL

FLC Fig. 1. System proposed.

ARTICLE IN PRESS L. Nanni / Neurocomputing 68 (2005) 315–321

317

2.1. Fisher’s linear classifier (FLC) In this work, we use the Fisher’s linear classiﬁer [4]. FLC ﬁnds the linear discriminant function between the classes in the dataset by minimizing the errors in the least square sense. For each of the classes a combined linear classiﬁer is computed separating it from the other classes. Patterns are assigned to the class for which the (combined) classiﬁer yields the highest posterior probability. 2.2. Feature transform (FT) Feature transformation is a process through which a new set of features is created from an existing one represented in a vector space
n X

Xj

(1)

j¼1

is the mean vector; F ¼ j1 ; j2 ; . . . jkd and L ¼ ½l1 ; l2 ; . . . ; lkd T I are the matrices of the ﬁrst kd eigenvectors and eigenvalues ðl1 l2 :::lkd Þ of the data covariance matrix. 2.3. K-local HKNN algorithm HKNN [7] is a modiﬁed k-nearest neighbor (KNN) algorithm intended to improve the classiﬁcation performance of the conventional KNN to a level of SVM, which is considered by many as the state-of-the-art in pattern recognition. Unlike the SVM, which builds a (non-linear) decision surface, separating different classes of the data, in a high dimensional feature space, HKNN tries to ﬁnd this surface directly in input space. HKNN computes distances of each test point x to L local hyperplanes, where L is the number of different classes. The ‘-th hyperplane is composed of KNN of x in the training set, belonging to the ‘-th class. A test point x is associated with the class whose hyperplane is closest to x. The ‘-th local hyperplane can be expressed as XK LH K ¯þ a V ; a 2 Rg, (2) ‘ ¼ fpjp ¼ n k¼1 k k k P where N k is the kth nearest neighbor of x, n¯ ¼ K1 K k¼1 N k is the centroid of a Kneighborhood, and V k ¼ N k n¯ . LH K ‘ ðxÞ deﬁnes a (K 1) dimensional hyperplane. To determine ak one needs to solve the following linear system [7]: ðV0 VÞa ¼ V0 ðx n¯ Þ,

(3)

ARTICLE IN PRESS 318

L. Nanni / Neurocomputing 68 (2005) 315–321

Where a ¼ ða1 ; . . . ; aK Þ0 V is a matrix whose columns are the V k vectors deﬁned earlier. To penalize large values of ak a penalty term l can be introduced. The distance of the point x to the hyperplane LH K ‘ ðxÞ is deﬁned as 2 K K X X 2 K dðx; LH ‘ ðxÞÞ ¼ x n¯ ak V k þ l a2k , (4) k¼1 k¼1 Where ak are found by solving this equation V0 V þ lIÞa ¼ V0 ðx n¯ Þ,

(5)

where I is the (K K) identity matrix. HKNN has two parameters, K and a penalty term l used to ﬁnd the hyperplane. For a better mathematical description of K-local HDNN please refer to [7]. 2.4. Weighted sum rule The performances of different classiﬁers are different, so it is necessary to use different weights to combine the individual classiﬁers. We sum the results of individual classiﬁers with different weights. In this paper the weight of HKNN is 1, we tested different values (1, 0.75 and 0.5) as weight for FLC. The weight of HKNN is higher than the weights of FLC because HKNN outperforms FLC. The summation of both classiﬁers’ conﬁdence score [4] is calculated as S ¼ ThFLCs þ HKNNs,

(6)

where FLCs and HKNNs represent the conﬁdence score of FLC and HKNN respectively, Th is the weight of FLC. For a better mathematical description of Weighted Sum Rule please refer to [8].

3. Experimental results Two datasets derived from the SCOP (Structural Classiﬁcation of Proteins) database were used. These datasets are available on line (http://crd.lbl.gov/cding/ protein/) and their detailed description can be found in [3]. Each dataset contains the 27 most populated folds represented by seven or more proteins and corresponding to four major structural classes. Table 1 shows the protein distribution among these classes. The ﬁrst dataset of 385 proteins, also known as independent test dataset, is composed of protein sequences of less than 40% identity with each other and less than 35% identity with the proteins of the second dataset. In fact, 90% of the proteins of the ﬁrst dataset have less than 25% sequence identity with the proteins of the second dataset. The second set consists of 313 protein folds having (for each two proteins) no more than 35% of the sequence identity for aligned subsequences longer than 80 residues.

ARTICLE IN PRESS L. Nanni / Neurocomputing 68 (2005) 315–321

319

Table 1 Distribution of the proteins among structural classes for two datasets Structural class

Dataset 1

Dataset 2

a b a=b a=b

55 109 115 34

61 117 145 62

Total

313

385

Table 2 Performance of HKNN and FLC

HKNN HKNN HKNN HKNN FLC

K K K K

¼ 7; ¼ 4; ¼ 4; ¼ 3;

l ¼ 8; l ¼ 20; l ¼ 1; l ¼ 1;

Dataset 1

Dataset 2

56.8 55 57.4 53.5 52.5

— 48.7 47.1 45.1 47.4

Table 3 Performance of ‘‘Weighted Sum Rule’’ (Dataset 1)

FUS1 FUS2 FUS3 FUS4

Th ¼ 1

Th ¼ 0:75

Th ¼ 0:5

57.1 56.1 56.6 56.3

58.4 57.4 58 57.1

59 58.7 59.2 57.7

For the ﬁrst dataset, the system was trained on the second dataset, and then tested on the ﬁrst set to measure the average predictive accuracy. For the second dataset, we estimated the average predictive accuracy by 10-fold cross-validation. In a crossvalidation run, 10 folds of approximately the same size are created. Then, 9 folds out of 10 are used as the training set and 1/10 as the testing set. At the end of training, the folds are shifted by one position. In Tables 2–5, we report the performance of the different classiﬁers: HKNN K ¼ x; l ¼ y; intends that we train HKNN setting K ¼ x and l ¼ y; FUS1, fusion between FLC and HKNN K ¼ 7, l ¼ 8; FUS2, fusion between FLC and HKNN K ¼ 4, l ¼ 20; FUS3, fusion between FLC and HKNN K ¼ 4, l ¼ 1; FUS4, fusion between FLC and HKNN K ¼ 3, l ¼ 1; Th, weight of FLC in the ‘‘weighted sum rule’’.

ARTICLE IN PRESS 320

L. Nanni / Neurocomputing 68 (2005) 315–321

Table 4 Performance of ‘‘Weighted Sum Rule’’ (Dataset 2)

FUS2 FUS3 FUS4

Th ¼ 1

Th ¼ 0:75

Th ¼ 0:5

53.2 52.9 51.3

52.3 52 51

52 52 52.3

Table 5 Performance of different methods

HKNN FLC FUS2 FUS3 DIMLP-B [1] DIMLP-A [1] MLP [2] GRNN [2] RBFN [2] SVM [2] RBFN [9] SVM [3]

Dataset 1

Dataset 2

57.4 52.5 58.7 59.2 61.2 59.1 48.8 44.2 49.4 51.4 51.2 56

47.1 47.4 53.2 52 45 46.7 — — — — — 45.4

In this paper (as in [3]), the measure adopted for the performance evaluation is the Q percentage accuracy. Suppose we have Np ¼ n1 þ . . . þ ntp test proteins (n1 are observed to belong to class 1, etc.). Suppose that out of n1 proteins, c1 are correctly recognized as belonging to 1, etc., so that total C ¼ c1 þ . . . þ ctp proteins are correctly recognized. The ‘‘total accuracy’’ is Q ¼ C=Np. It is seen that our approach outperformed all competitors in Dataset 2 and outperformed almost all competitors, except DIMLP-B (bagged DIMLP), in Dataset 1. We want to stress that DIMLPs required many parameters to be tuned, and these parameters are ﬁxed to maximize the performance in the Dataset 1.

4. Conclusions In this paper, we investigated fusion of classiﬁers, applied for protein fold recognition, and tested it on a real-world dataset. It is well known in literature [5] that classiﬁer ensembles that enforce diversity fare better than ones that do not. To enforce the diversity we combine classiﬁers based on different methodologies, moreover these classiﬁers are trained using different features. It is believed [5] that classiﬁers based on different methodologies or different features offer

ARTICLE IN PRESS L. Nanni / Neurocomputing 68 (2005) 315–321

321

complementary information about the patterns to be classiﬁed. We propose an ensemble of classiﬁers that combine a non-density based classiﬁer (FLC), and a hyperplane based classiﬁer (HKNN). Moreover, FLC and HKNN are trained using different features. The obtained results are very encouraging, we are dealing with a relatively small dataset, where four misclassiﬁcations comprise 1% of error and there are only a few representatives of many folds in the training set! Hence, this issue should be addressed in future research.

Acknowledgment The authors would like to thank C.H.Q. Ding at Lawrence Berkeley National Laboratory for sharing protein fold data set. References [1] G. Bologna, R.D. Appel, A comparison study on protein fold recognition, In: Proceedings of the ninth International Conference on Neural Information Processing, vol. 5, Singapore, Singapore, November 18–22, 2002, pp. 2492–2496. [2] I.F. Chung, C.D. Huang, Y.H. Shen, C.T. Lin, Recognition of structure classiﬁcation of protein folding by NN and SVM hierarchical learning architecture, Artiﬁcial Neural Networks and Neural Information Processing—ICANN/ICONIP, vol. 2714, Turkey, Instanbul, June 26–29, 2003, pp. 1159–1167. [3] C. Ding, I. Dubchak, Multi-class protein fold recognition using support vector machines and neural networks, Bioinformatics 17 (4) (2001) 349–358. [4] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classiﬁcation, 2nd Ed., Wiley, New York, 2000. [5] L.I. Kuncheva, Diversity in multiple classiﬁer systems, Inf. Fusion 6 (1) (2005) 3–4. [6] L.I. Kuncheva, R.K. Kountchev, Generating classiﬁer outputs of ﬁxed accuracy and diversity, Pattern Recogn. Lett. 23 (5) (2002) 593–600. [7] O. Okun, Protein Fold Recognition with K-local Hyperplane Distance Nearest Neighbor Algorithm, Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics, vol. 1, Italy, Pisa, September 24, 2004, pp. 51–57. [8] M.G.H Ong, T. Connie, A.T.B. Jin, D.N.C. Ling, A single-sensor hand geometry and palmprint veriﬁcation system, Multimedia Biometrics Methods and Applications Workshop, vol. 1, USA, Berkeley, November 8, 2003, pp. 100–106. [9] N.R. Pal, D. Chakraborty, Some new features for protein fold recognition, Artiﬁcial Neural Networks and Neural Information Processing—ICANN/ICONIP, vol. 2714, Turkey, Instanbul, June 26–29, 2003, pp. 1176–1183.

Fusion of classifiers for protein fold recognition

Fusion of classifiers for protein fold recognition

Recommend Documents