Computational Biology and Chemistry 32 (2008) 298–301
Contents lists available at ScienceDirect
Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem
Short Communication
A method for discovering transmembrane beta-barrel proteins in Gram-negative bacterial proteomes夽 Jing Hu, Changhui Yan ∗ Department of Computer Science, Utah State University, Logan, UT 84322, USA
a r t i c l e
i n f o
Article history: Received 29 November 2007 Received in revised form 18 March 2008 Accepted 19 March 2008 Keywords: Beta-barrel Prediction k-Nearest neighbor Weighted Euclidian distance
a b s t r a c t Transmembrane -barrel (TMB) proteins play pivotal roles in many aspects of bacterial functions. This paper presents a k-nearest neighbor (K-NN) method for discriminating TMB and non-TMB proteins. We start with a method that makes predictions based on a distance computed from residue composition and gradually improve the prediction performance by including homologous sequences and searching for a set of residues and di-peptides for calculating the distance. The final method achieves an accuracy of 97.1%, with 0.876 MCC, 86.4% sensitivity and 98.8% specificity. A web server based on the proposed method is available at http://yanbioinformatics.cs.usu.edu:8080/TMBKNNsubmit. Published by Elsevier Ltd
1. Introduction Transmembrane -barrel (TMB) proteins perform diverse functional roles including bacterial adhesion, structural integrity of the cell wall, and material transport (Koebnik et al., 2000; Schulz, 2000; Wimley, 2003). Unlike transmembrane ␣-helical proteins that can be easily identified by the long hydrophobic transmembrane regions, TMB proteins are much harder to identify due to the big variation in the short transmembrane segments (Koebnik et al., 2000). A few methods have been developed to identify TMB proteins by exploring properties such as sequence profiles (Gnanasekaran et al., 2000), -barrel score and signal peptides (Schleiff et al., 2003), the distribution of multiple properties on protein sequences (Zhai and Saier, 2002), and residue composition and predicted secondary structure (Liu et al., 2003). Garrow et al. (2005a,b) developed a TMB-Hunt method to identify TMB proteins based on residue composition and Berven et al. (2004) developed the BOMP method that identifies TMB proteins using a combination of pattern search, -barrel score based on amino acid distribution, and a filter that explores the abundance of asparagine and isoleucine in the protein. Both TMB-Hunt and BOMP have web servers available. That makes it possible to compare them with the
夽 Author’s contributions: CY conceived and designed the study, performed the analysis and drafted the manuscript. JH carried out the computation. ∗ Corresponding author at: Department of Computer Science, Utah State University, Old Main Hill 4205, Logan, UT 84322, USA. Tel.: +1 435 797 2570; fax: +1 435 797 3265. E-mail address:
[email protected] (C. Yan). 1476-9271/$ – see front matter. Published by Elsevier Ltd doi:10.1016/j.compbiolchem.2008.03.010
current study. In addition to those methods that predict whether a protein is a TMB protein, many other methods predict the topology of TMB proteins. Some of these topology-predicting methods can also discriminate TMB proteins from non-TMB proteins. In a recent study, Bagos et al. (2005) made a systematic comparison of the topology-predicting methods. In that comparison, PRED-TMBB (Bagos et al., 2004) and PROFtmb (Bigelow and Rost, 2006) achieved the 1st and 2nd best scores in predicting the topology of TMB proteins. Both methods can be used to discriminate TMB proteins. Thus, we will also compare our method with PRED-TMBB and PROFtmb. 2. Materials and Methods 2.1. Datasets Transmembrane -barrel (TMB) proteins were obtained from the SCOP database (“Transmembrane beta-barrels” family) (Murzin et al., 1995) and Transport Proteins Database (“-Barrel porins” subclass) (Saier et al., 2006). Redundant proteins were removed using BLAST (Altschul et al., 1997) so that mutual identity is less than 25% in the dataset. Proteins with less than 50 amino acids and proteins that were not from Gram-negative bacteria were removed. The final dataset consisted of 119 TMB proteins. NonTMB proteins were obtained from the PSORTdb database (Rey et al., 2005) and redundant proteins were removed so that mutual identity is less than 25% in the dataset. These non-TMB proteins were divided into six groups based on their subcelluar locations: 245 proteins from “Cytoplasmic”, 195 proteins from “CytoplasmicMembrane”, 15 proteins from “Cytoplasmic, CytoplasmicMembrane”,
J. Hu, C. Yan / Computational Biology and Chemistry 32 (2008) 298–301
299
165 proteins from “Periplasmic”, 35 proteins from “Periplasmic, CytoplasmicMembrane”, and 87 proteins from “Extracellular”. Thus, we have seven groups of proteins in total (available at http://yanbioinformatics.cs.usu.edu:8080/TMBKNNsubmit.)
Table 1 Comparisons of the proposed method (K-NN) with similarity search in the discrimination of TMB and non-TMB proteins
2.2. Weighted Euclidean Distance (WED) and k-Nearest Neighbor Method
K-NN (single sequence)a K-NN (homologous sequences)b Similarity searchc K-NN (19 residues + 24 di-peptides)d
Fivefold cross-validation was used to evaluate the proposed method. Residue composition of each protein was calculated. Average residue composition for each group of proteins was calculated using proteins in the training set. For each test protein, its distance to a protein (referred to as train protein) in the training set was calculated using D =
i
(xi
test
− xi
train )
2
/¯xi
train , where xi test
is the
residue composition of the test protein, xi train is the composition of train protein, and x¯ i train is the average composition of the group that train protein belongs to. Note that
i
(xi
test
− xi
train )
2
gives the
Euclidean distance between the test protein and train protein. Here, in the calculation of D, each item within the summation is weighted by a factor 1/¯xi train . Therefore, D is referred to as weighted Euclidean distance (WED). For a test protein, its WEDs to every protein in the training set were calculated. Then, for each of the seven groups, k smallest distances were chosen and the distance from the test protein to the group was defined as the average of these k smallest distances. The test protein was predicted to be TMB if its distance to the TMB group was the least. Otherwise, it was predicted to be non-TMB. In this study, different values of k were tried. The best performance were achieved when k = 4. Thus, the results with k = 4 were reported. 2.3. Greedy Approach to Select Residues and Di-peptides We used a greedy method to search for residues and di-peptides that were useful for the prediction of TMB proteins. The algorithm is a simplified version of the Bestfirst method included in Weka (Witten and Frank, 2005). The greedy search started with a feature set that included 20 amino acids. The feature set was reduced by removing amino acid from the set one at a time, until removing any amino acid from the feature set would reduce the prediction performance. Then, the feature set was grown by adding di-peptides into the set one at a time until adding any more di-peptide into the set would decrease the prediction performance. In the end, we obtained a set of features that included 19 residues and 24 di-peptides. 2.4. Performance Measures Prediction performance was evaluated based on sensitivity, specificity, accuracy and MCC: sensitivity = accuracy = MCC =
TP ; TP + FN
specificity =
TN ; TN + FP
TP + TN ; TP + FN + TN + FP TP × TN − FP × FN
(TP + FN)(TP + FP)(TN + FP)(TN + FN)
3. Results 3.1. The Proposed Method can Distinguish Between TMB and Non-TMB Proteins We developed a WED for measuring the distance between a protein and a group, and a K-NN method for classifying proteins into
Accuracy (%)
MCC
Sensitivity (%)
Specificity (%)
91.5
0.633
64.5
95.8
94.4
0.757
74.6
97.6
75.4 97.1
0.439 0.876
86.4 86.4
73.6 98.8
a For each protein, only the protein itself was used to calculate residue composition. Twenty amino acids were used to calculate the weighted Euclidian distances. b For each protein, 50 homologous proteins and itself were used to calculate residue composition. Twenty amino acids were used to calculate the weighted Euclidian distances. c Predictions based on sequence similarity. d For each protein, 50 homologous proteins and itself were used to calculate residue composition. Nineteen residues and 24 di-peptides were used to calculate weighted Euclidian distances.
TMB and non-TMB classes (Section 2.2). The method was evaluated using a fivefold cross-validation. The results (Table 1, row 2) show that the method achieves 91.5% overall accuracy with 0.633 MCC. 64.5% (sensitivity) of the TMB proteins and 95.8% (specificity) of the non-TMB proteins are correctly identified. Then, for each protein, the BLAST program (Altschul et al., 1997) was used to search for homologous sequences in the NCBI non-redundant database using threshold E = 0.0001. Fifty best hits from the returned result and the query protein were used to calculate the residue composition for the query protein. This new residue composition was used to calculate WEDs. Proteins were classified based on the new WEDs. Comparisons (Table 1, rows 2 and 3) show that including homologous information can improve the performance remarkably: the accuracy is increased to 94.4%; MCC reaches 0.757; 74.6% of the TMB proteins (sensitivity) and 97.6% of the non-TMB proteins (specificity) are correctly identified. Then, is it possible that a test protein is correctly recognized only because it shares some homologous sequences with some proteins in the training set? To answer this question, we repeated the experiments by ensuring that the homologous sets of any two of proteins do not overlap. The results show that adding this requirement does not have noticeable effects on the performance. Then, how good is the performance if predictions are made solely based on sequence similarity? To answer this question, we classified a test protein into the class of the protein from the training set that shares the highest similarity with it. Table 1 (row 4) shows that this approach only achieves 75.4% accuracy with 0.439 MCC, 86.4% sensitivity and 73.6% specificity. In comparison, the K-NN method achieves as high as 94.4% accuracy and 0.757 MCC. We then used a greedy approach to search for a combination of residues and di-peptides that are useful for predicting TMB proteins (Section 2.3). In the end, 19 residues and 24 dipeptides were chosen. The composition of these residues and di-peptides were used to calculate WEDs. Proteins were classified based on the new WEDs. The results (Table 1 row 5) show that this strategy improves the performance, reaching 97.1% accuracy and 0.876 MCC. 3.2. Comparisons with Other Methods As discussed in Baldi et al. (2000), in a two-class classification, if the numbers of examples in the two classes are not equal, MCC is a better measure for evaluating classification methods. In this
300
J. Hu, C. Yan / Computational Biology and Chemistry 32 (2008) 298–301
Table 2 Comparisons with other methods Method
Accuracy (%)
MCC
Sensitivity (%)
Specificity (%)
K-NNa BOMP (with BLAST search)b TMB-Hunt (with evolutionary information)c PRED-TMBBd PROFtmbe
97.1 95.0
0.876 0.787
86.4 79.8
98.8 97.4
93.7
0.747
81.5
95.7
64.3 92.6
0.342 0.684
89.1 71.4
60.4 96.0
a
The method proposed in this study. Method developed by Berven et al. (2004). When BLAST search is selected, BOMP can achieve better performance. The results reported here were achieved with BLAST selected. c Method developed by Garrow et al. (2005a,b). When evolutionary information is used, TMB-Hunt can achieve better performance. The results reported here were achieved with evolutionary information. d Method developed by Bagos et al. (2004). Three decoding methods are provided on the web server. The posterior decoding was reported to achieve the best performance in Bagos et al. (2005). The results reported here were achieved using posterior decoding. e Method developed by Bigelow and Rost (2006). Z ≥ 10 and Z ≥ 6 are suggested as the prediction cutoff on the PROFtmb website. We tried both values. PROFtmb achieved better performance with Z ≥ 6 as threshold. The results reported here were achieved with Z ≥ 6. b
study, the numbers of examples in the two classes (TMB and nonTMB) are not equal. Therefore, we will use MCC as the primary measure to compare different methods. The datasets used in this study were submitted to the servers of BOMP, TMB-Hunt, PREDTMBB and PROFtmb. The returned results were compared with the results achieved by the K-NN method. Table 2 shows that the K-NN method outperforms other methods. It is worth to point out that the datasets used in the current study are likely to overlap with the datasets that were used to train BOMP, TMB-Hunt, PRED-TMBB and PROFtmb servers. Thus, when we evaluated these methods by submitting our datasets to the web servers, the performance of these methods might have been overestimated. In contrast, our K-NN method was evaluated using a fivefold cross-validation such that the mutual identity among the proteins is less than 25%. Remarkably, our method still outperforms the others under this condition. 3.3. Web Server A web server was developed based on the proposed method (available at http://yanbioinformatics.cs.usu.edu:8080/ TMBKNNsubmit). The server can run in two modes: not using homologous sequences or using homologous sequences. The method can run faster when homologous sequences are not used. But it can achieve more accurate predictions when homologous sequences are used. Detailed instructions for users are available on the server. 3.4. Proteome Scanning We scanned the proteomes of 11 Gram-negative bacteria (downloaded from http://ca.expasy.org/sprot/hamap/) using our server. The results are available at http://yanbioinformatics.cs.usu.edu: 8080/TMBKNNsubmit. Here, we will analyze the predictions on the proteome of Escherichia coli in detail, since this proteome is relatively well studied compared with the others. The E. coli proteome consists of 4319 proteins. One hundred and forty-four of them were predicted to be TMB proteins by the K-NN method. Among these 144 hits, 12 are found in the TMB dataset which was used to trained the server, 49 proteins are annotated as “outer membrane proteins”
in SwissProt, and 15 share very high similarity with some TMB proteins in the training dataset (E < 0.0001 in BLAST search). Thus, we have high confidence in believing that these 76 proteins are true positives. Besides these true positives, 22 proteins are annotated with “membrane”, “Cell membrane” and “multi-pass membrane protein” in SwissProt. Only 1 of these 22 proteins is predicted to be transmembrane ␣-helical proteins by both TMHMM (Krogh et al., 2001) and PSORTb (Gardy et al., 2005). Thus, most of these proteins are likely TMB proteins. Among the remaining 46 proteins, 27 are annotated with subcellular locations other than outer membranes. Thus, these 27 proteins are false positives. The remaining 19 proteins may suggest new TMB proteins that have not been previously discovered. 4. Summary We start with a K-NN method that identifies TMB proteins based on a WED calculated based on the composition of 20 amino acids. Then, the method is improved by including homologous sequences and considering the composition of di-peptides. The final method achieves an accuracy of 97.1%, with 0.876 MCC, 86.4% sensitivity and 98.8% specificity. Comparisons show that our method outperforms previously published methods. In addition to its superior performance, the proposed method is simple and fast. Thus, the proposed method can be easily applied at a proteomic scale. References Altschul, S., Madden, T., Schaffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl. Acids Res. 25, 3389–3402. Bagos, P., Liakopoulos, T., Hamodrakas, S., 2005. Evaluation of methods for predicting the topology of beta-barrel outer membrane proteins and a consensus prediction method. BMC Bioinformatics 6, 7. Bagos, P.G., Liakopoulos, T.D., Spyropoulos, I.C., Hamodrakas, S.J., 2004. PRED-TMBB: a web server for predicting the topology of beta-barrel outer membrane proteins. Nucl. Acids Res. 32, W400–W404. Baldi, P., Brunak, S., Chauvin, Y., Andersen, C.A.F., 2000. Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics 16, 412–424. Berven, F.S., Flikka, K., Jensen, H.B., Eidhammer, I., 2004. BOMP: a program to predict integral beta-barrel outer membrane proteins encoded within genomes of Gram-negative bacteria. Nucl. Acids Res. 32, W394–W399. Bigelow, H., Rost, B., 2006. PROFtmb: a web server for predicting bacterial transmembrane beta barrel proteins. Nucl. Acids Res. 34, W186–W188. Gardy, J.L., Laird, M.R., Chen, F., Rey, S., Walsh, C.J., Ester, M., Brinkman, F.S.L., 2005. PSORTb v.2.0: expanded prediction of bacterial protein subcellular localization and insights gained from comparative proteome analysis. Bioinformatics 21, 617–623. Garrow, A., Agnew, A., Westhead, D., 2005a. TMB-Hunt: an amino acid composition based method to screen proteomes for beta-barrel transmembrane proteins. BMC Bioinformatics 6, 56. Garrow, A.G., Agnew, A., Westhead, D.R., 2005b. TMB-Hunt: a web server to screen sequence sets for transmembrane beta-barrel proteins. Nucl. Acids Res. 33, W188–W192. Gnanasekaran, T.V., Peri, S., Arockiasamy, A., Krishnaswamy, S., 2000. Profiles from structure based sequence alignment of porins can identify beta stranded integral membrane proteins. Bioinformatics 16, 839–842. Koebnik, R., Locher, K.P., Van Gelder, P., 2000. Structure and function of bacterial outer membrane proteins: barrels in a nutshell. Mol. Microbiol. 37, 239–253. Krogh, A., Larsson, B., Heijne, G.v., Sonnhammer, E.L.L., 2001. Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J. Mol. Biol. 305, 567–580. Liu, Q., Zhu, Y., Wang, B., Li, Y., 2003. Identification of beta-barrel membrane proteins based on amino acid composition properties and predicted secondary structure. Comp. Biol. Chem. 27, 355–361. Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C., 1995. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247, 536–540. Rey, S., Acab, M., Gardy, J.L., Laird, M.R., deFays, K., Lambert, C., Brinkman, F.S.L., 2005. PSORTdb: a protein subcellular localization database for bacteria. Nucl. Acids Res. 33, D164–D168. Saier Jr., M.H., Tran, C.V., Barabote, R.D., 2006. TCDB: the transporter classification database for membrane transport protein analyses and information. Nucl. Acids Res. 34, D181–D186.
J. Hu, C. Yan / Computational Biology and Chemistry 32 (2008) 298–301 Schleiff, E., Eichacker, L.A., Eckart, K., Becker, T., Mirus, O., Stahl, T., Soll, J., 2003. Prediction of the plant beta-barrel proteome: a case study of the chloroplast outer envelope. Protein Sci. 12, 748–759. Schulz, G.E., 2000. Beta-barrel membrane proteins. Curr. Opin. Struct. Biol. 10, 443–447. Wimley, W.C., 2003. The versatile beta-barrel membrane protein. Curr. Opin. Struct. Biol. 13, 404–411.
301
Witten, I.H., Frank, E., 2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco. Zhai, Y., Saier, M.H.J.R., 2002. The beta-barrel finder (BBF) program, allowing identification of outer membrane beta-barrel proteins encoded within prokaryotic genomes. Protein Sci. 11, 2196–2207.