ARTICLE IN PRESS
Neurocomputing 68 (2005) 315–321 www.elsevier.com/locate/neucom
Letters
Fusion of classifiers for protein fold recognition Loris Nanni DEIS, IEIIT—CNR, Universita` di Bologna Viale Risorgimento 2, 40136 Bologna, Italy Received 16 January 2005; received in revised form 1 March 2005; accepted 6 March 2005 Available online 13 May 2005 Communicated by R. W. Newcomb
Abstract Predicting the three-dimensional structure of a protein from its amino acid sequence is an important problem in bioinformatics and a challenging task for machine learning algorithms. Given (numerical) features, one of the existing machine learning techniques can be then applied to learn and classify proteins represented by these features. We show that combining Fisher’s linear classifier and K-Local Hyperplane Distance Nearest Neighbor we obtain an error rate lower than previously published in the literature. r 2005 Elsevier B.V. All rights reserved. Keyword: Fusion of classifiers; Protein fold recognition; Machine learning algorithms
1. Introduction The folding problem is central in molecular biology and it can be formulated as follows: given the primary structure (linear sequence of amino acids in a protein molecule) of a protein, how the three-dimensional (3-D) fold can be deduced from it. To be able to predict the protein structure from the amino-acid sequence would have tremendous impact in all of biotechnology and drug design. Recently, several works have approached the problem of predicting the 3-D structure of a protein by applying techniques from machine learning [3] employed ensembles of both three-layer feed-forward neural networks (NN) (trained with the Corresponding author.
E-mail address:
[email protected]. 0925-2312/$ - see front matter r 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2005.03.001
ARTICLE IN PRESS 316
L. Nanni / Neurocomputing 68 (2005) 315–321
conjugate gradient method) and support vector machines (one-vs.-all, unique onevs.-all and one-vs.-one methods for building multi-class support vector machines). Each ensemble consisted of many two-class classifiers. [1] used a 131-dimensional feature vector and an ensemble of four-layer discretized interpretable multi-layer perceptrons (DIMLP), where each network learns all protein folds simultaneously. Bagging and arcing combined the outputs of DIMLPs. [2] selected NN and support vector machines (SVM) as basic building blocks of two-level classification. The NN with a single hidden layer were used: multi-layer perceptron (MLP), radial basis function network (RBFN), and general regression neural network (GRNN). [9] performed independent classification at each level, i.e. they did not utilize a hierarchical classifier. They trained the MLP and RBFN with new features (400 in number) based on the hydrophobicity of the amino acids. In [7] a modified nearest neighbor algorithm called the K-Local Hyperplane is applied to protein fold recognition based on features derived from secondary structure. The combination of multiple classifiers was shown to be suitable for improving the recognition performance in difficult classification problems [6]. We show that combining Fisher’s linear classifier and K-Local hyperplane distance nearest neighbor, we obtain an error rate lower than previously published in the literature.
2. System To use machine learning methods, feature vectors are extracted from protein sequences. Six features were extracted from protein sequences: amino acids’ composition, predicted secondary structure, hydrophobicity, normalized van der Waals volume, polarity, and polarizability. All (except the first) features have dimensionality 21, whereas the composition has dimensionality 20. Thus, in total, a feature vector combining six features has 125 dimensions (or components). In addition, the protein length is reported. As a result, in total, a feature vector has the length of 126. K-local hyperplane distance nearest neighbor (HKNN) [7] is trained using these 126 features, these features are projected onto a 126-dimensional space by Karhunen Loeve transform (KL) [4]. Fisher’s linear classifier is trained using the features obtained by KL Fig. 1.
HKNN
Features
Weighted Sum Rule KL
FLC Fig. 1. System proposed.
ARTICLE IN PRESS L. Nanni / Neurocomputing 68 (2005) 315–321
317
2.1. Fisher’s linear classifier (FLC) In this work, we use the Fisher’s linear classifier [4]. FLC finds the linear discriminant function between the classes in the dataset by minimizing the errors in the least square sense. For each of the classes a combined linear classifier is computed separating it from the other classes. Patterns are assigned to the class for which the (combined) classifier yields the highest posterior probability. 2.2. Feature transform (FT) Feature transformation is a process through which a new set of features is created from an existing one represented in a vector space
n X
Xj
(1)
j¼1
is the mean vector; F ¼ j1 ; j2 ; . . . jkd and L ¼ ½l1 ; l2 ; . . . ; lkd T I are the matrices of the first kd eigenvectors and eigenvalues ðl1 l2 :::lkd Þ of the data covariance matrix. 2.3. K-local HKNN algorithm HKNN [7] is a modified k-nearest neighbor (KNN) algorithm intended to improve the classification performance of the conventional KNN to a level of SVM, which is considered by many as the state-of-the-art in pattern recognition. Unlike the SVM, which builds a (non-linear) decision surface, separating different classes of the data, in a high dimensional feature space, HKNN tries to find this surface directly in input space. HKNN computes distances of each test point x to L local hyperplanes, where L is the number of different classes. The ‘-th hyperplane is composed of KNN of x in the training set, belonging to the ‘-th class. A test point x is associated with the class whose hyperplane is closest to x. The ‘-th local hyperplane can be expressed as XK LH K ¯þ a V ; a 2 Rg, (2) ‘ ¼ fpjp ¼ n k¼1 k k k P where N k is the kth nearest neighbor of x, n¯ ¼ K1 K k¼1 N k is the centroid of a Kneighborhood, and V k ¼ N k n¯ . LH K ‘ ðxÞ defines a (K 1) dimensional hyperplane. To determine ak one needs to solve the following linear system [7]: ðV0 VÞa ¼ V0 ðx n¯ Þ,
(3)
ARTICLE IN PRESS 318
L. Nanni / Neurocomputing 68 (2005) 315–321
Where a ¼ ða1 ; . . . ; aK Þ0 V is a matrix whose columns are the V k vectors defined earlier. To penalize large values of ak a penalty term l can be introduced. The distance of the point x to the hyperplane LH K ‘ ðxÞ is defined as 2 K K X X 2 K dðx; LH ‘ ðxÞÞ ¼ x n¯ ak V k þ l a2k , (4) k¼1 k¼1 Where ak are found by solving this equation V0 V þ lIÞa ¼ V0 ðx n¯ Þ,
(5)
where I is the (K K) identity matrix. HKNN has two parameters, K and a penalty term l used to find the hyperplane. For a better mathematical description of K-local HDNN please refer to [7]. 2.4. Weighted sum rule The performances of different classifiers are different, so it is necessary to use different weights to combine the individual classifiers. We sum the results of individual classifiers with different weights. In this paper the weight of HKNN is 1, we tested different values (1, 0.75 and 0.5) as weight for FLC. The weight of HKNN is higher than the weights of FLC because HKNN outperforms FLC. The summation of both classifiers’ confidence score [4] is calculated as S ¼ ThFLCs þ HKNNs,
(6)
where FLCs and HKNNs represent the confidence score of FLC and HKNN respectively, Th is the weight of FLC. For a better mathematical description of Weighted Sum Rule please refer to [8].
3. Experimental results Two datasets derived from the SCOP (Structural Classification of Proteins) database were used. These datasets are available on line (http://crd.lbl.gov/cding/ protein/) and their detailed description can be found in [3]. Each dataset contains the 27 most populated folds represented by seven or more proteins and corresponding to four major structural classes. Table 1 shows the protein distribution among these classes. The first dataset of 385 proteins, also known as independent test dataset, is composed of protein sequences of less than 40% identity with each other and less than 35% identity with the proteins of the second dataset. In fact, 90% of the proteins of the first dataset have less than 25% sequence identity with the proteins of the second dataset. The second set consists of 313 protein folds having (for each two proteins) no more than 35% of the sequence identity for aligned subsequences longer than 80 residues.
ARTICLE IN PRESS L. Nanni / Neurocomputing 68 (2005) 315–321
319
Table 1 Distribution of the proteins among structural classes for two datasets Structural class
Dataset 1
Dataset 2
a b a=b a=b
55 109 115 34
61 117 145 62
Total
313
385
Table 2 Performance of HKNN and FLC
HKNN HKNN HKNN HKNN FLC
K K K K
¼ 7; ¼ 4; ¼ 4; ¼ 3;
l ¼ 8; l ¼ 20; l ¼ 1; l ¼ 1;
Dataset 1
Dataset 2
56.8 55 57.4 53.5 52.5
— 48.7 47.1 45.1 47.4
Table 3 Performance of ‘‘Weighted Sum Rule’’ (Dataset 1)
FUS1 FUS2 FUS3 FUS4
Th ¼ 1
Th ¼ 0:75
Th ¼ 0:5
57.1 56.1 56.6 56.3
58.4 57.4 58 57.1
59 58.7 59.2 57.7
For the first dataset, the system was trained on the second dataset, and then tested on the first set to measure the average predictive accuracy. For the second dataset, we estimated the average predictive accuracy by 10-fold cross-validation. In a crossvalidation run, 10 folds of approximately the same size are created. Then, 9 folds out of 10 are used as the training set and 1/10 as the testing set. At the end of training, the folds are shifted by one position. In Tables 2–5, we report the performance of the different classifiers: HKNN K ¼ x; l ¼ y; intends that we train HKNN setting K ¼ x and l ¼ y; FUS1, fusion between FLC and HKNN K ¼ 7, l ¼ 8; FUS2, fusion between FLC and HKNN K ¼ 4, l ¼ 20; FUS3, fusion between FLC and HKNN K ¼ 4, l ¼ 1; FUS4, fusion between FLC and HKNN K ¼ 3, l ¼ 1; Th, weight of FLC in the ‘‘weighted sum rule’’.
ARTICLE IN PRESS 320
L. Nanni / Neurocomputing 68 (2005) 315–321
Table 4 Performance of ‘‘Weighted Sum Rule’’ (Dataset 2)
FUS2 FUS3 FUS4
Th ¼ 1
Th ¼ 0:75
Th ¼ 0:5
53.2 52.9 51.3
52.3 52 51
52 52 52.3
Table 5 Performance of different methods
HKNN FLC FUS2 FUS3 DIMLP-B [1] DIMLP-A [1] MLP [2] GRNN [2] RBFN [2] SVM [2] RBFN [9] SVM [3]
Dataset 1
Dataset 2
57.4 52.5 58.7 59.2 61.2 59.1 48.8 44.2 49.4 51.4 51.2 56
47.1 47.4 53.2 52 45 46.7 — — — — — 45.4
In this paper (as in [3]), the measure adopted for the performance evaluation is the Q percentage accuracy. Suppose we have Np ¼ n1 þ . . . þ ntp test proteins (n1 are observed to belong to class 1, etc.). Suppose that out of n1 proteins, c1 are correctly recognized as belonging to 1, etc., so that total C ¼ c1 þ . . . þ ctp proteins are correctly recognized. The ‘‘total accuracy’’ is Q ¼ C=Np. It is seen that our approach outperformed all competitors in Dataset 2 and outperformed almost all competitors, except DIMLP-B (bagged DIMLP), in Dataset 1. We want to stress that DIMLPs required many parameters to be tuned, and these parameters are fixed to maximize the performance in the Dataset 1.
4. Conclusions In this paper, we investigated fusion of classifiers, applied for protein fold recognition, and tested it on a real-world dataset. It is well known in literature [5] that classifier ensembles that enforce diversity fare better than ones that do not. To enforce the diversity we combine classifiers based on different methodologies, moreover these classifiers are trained using different features. It is believed [5] that classifiers based on different methodologies or different features offer
ARTICLE IN PRESS L. Nanni / Neurocomputing 68 (2005) 315–321
321
complementary information about the patterns to be classified. We propose an ensemble of classifiers that combine a non-density based classifier (FLC), and a hyperplane based classifier (HKNN). Moreover, FLC and HKNN are trained using different features. The obtained results are very encouraging, we are dealing with a relatively small dataset, where four misclassifications comprise 1% of error and there are only a few representatives of many folds in the training set! Hence, this issue should be addressed in future research.
Acknowledgment The authors would like to thank C.H.Q. Ding at Lawrence Berkeley National Laboratory for sharing protein fold data set. References [1] G. Bologna, R.D. Appel, A comparison study on protein fold recognition, In: Proceedings of the ninth International Conference on Neural Information Processing, vol. 5, Singapore, Singapore, November 18–22, 2002, pp. 2492–2496. [2] I.F. Chung, C.D. Huang, Y.H. Shen, C.T. Lin, Recognition of structure classification of protein folding by NN and SVM hierarchical learning architecture, Artificial Neural Networks and Neural Information Processing—ICANN/ICONIP, vol. 2714, Turkey, Instanbul, June 26–29, 2003, pp. 1159–1167. [3] C. Ding, I. Dubchak, Multi-class protein fold recognition using support vector machines and neural networks, Bioinformatics 17 (4) (2001) 349–358. [4] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, 2nd Ed., Wiley, New York, 2000. [5] L.I. Kuncheva, Diversity in multiple classifier systems, Inf. Fusion 6 (1) (2005) 3–4. [6] L.I. Kuncheva, R.K. Kountchev, Generating classifier outputs of fixed accuracy and diversity, Pattern Recogn. Lett. 23 (5) (2002) 593–600. [7] O. Okun, Protein Fold Recognition with K-local Hyperplane Distance Nearest Neighbor Algorithm, Proceedings of the Second European Workshop on Data Mining and Text Mining in Bioinformatics, vol. 1, Italy, Pisa, September 24, 2004, pp. 51–57. [8] M.G.H Ong, T. Connie, A.T.B. Jin, D.N.C. Ling, A single-sensor hand geometry and palmprint verification system, Multimedia Biometrics Methods and Applications Workshop, vol. 1, USA, Berkeley, November 8, 2003, pp. 100–106. [9] N.R. Pal, D. Chakraborty, Some new features for protein fold recognition, Artificial Neural Networks and Neural Information Processing—ICANN/ICONIP, vol. 2714, Turkey, Instanbul, June 26–29, 2003, pp. 1176–1183.