Scoring hidden Markov models to discriminate β-barrel membrane proteins

Scoring hidden Markov models to discriminate β-barrel membrane proteins

Computational Biology and Chemistry 28 (2004) 189–194 Scoring hidden Markov models to discriminate ␤-barrel membrane proteins Yong Deng a,1 , Qi Liu ...

168KB Sizes 20 Downloads 60 Views

Computational Biology and Chemistry 28 (2004) 189–194

Scoring hidden Markov models to discriminate ␤-barrel membrane proteins Yong Deng a,1 , Qi Liu b,1 , Yi-Xue Li b,∗,1 b

a School of Electronics & Information Technology, Shanghai Jiao Tong University, Shanghai 200030, PR China Bioinformation Center, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai 200031, PR China

Received 4 February 2004; received in revised form 26 February 2004; accepted 26 February 2004

Abstract A new method is presented for identification of ␤-barrel membrane proteins. It is based on a hidden Markov model (HMM) with an architecture obeying these proteins’ construction principles. Once the HMM is trained, log-odds score relative to a null model is used to discriminate ␤-barrel membrane proteins from other proteins. The method achieves only 10% false positive and false negative rates in a six-fold cross-validation procedure. The results compare favorably with existing methods. This method is proposed to be a valuable tool to quickly scan proteomes of entirely sequenced organisms for ␤-barrel membrane proteins. © 2004 Elsevier Ltd. All rights reserved. Keywords: ␤-Barrel membrane proteins; Hidden Markov model; Log-odds score

1. Introduction Computational methods for identifying potential integral membrane proteins have become increasing important as a result of whole genome sequencing projects. Of two known structural motifs for membrane proteins, ␤-barrels appears to be more difficult to identify than ␣-helices for several reasons. First, relative paucity of known structures has hindered the development of prediction methods. Second, transmembrane ␤-strands are amphipathic and thus more difficult to find than consecutive hydrophobic residues. Third, many soluble proteins also contain amphiphathic ␤-sheets similar to transmembrane ␤-strands. Although those neural network-based methods achieve reasonable accuracy for predicting transmembrane regions of ␤-barrels (Diederichs et al., 1998; Jacoboni et al., 2001), they have poor ability to distinguish between membrane and soluble proteins. Recently, Wimley (2002) developed an algorithm to identify ␤-barrel membrane proteins based on information gathered from this type of 15 proteins with known structures. Al∗ Corresponding author. Tel.: +86-21-62932851; fax: +86-21-62932851. E-mail address: [email protected] (Y.-X. Li). 1 These authors contributed equally.

1476-9271/$ – see front matter © 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.compbiolchem.2004.02.004

though good performance is reported when it is applied to screen genomes, the given results only indicate low false positive rate but have little information about false negative rate. That is to say, whether the method developed on the 15 ␤-barrel membrane proteins has equal good performance on new ones or not is not clear. ␤-barrel membrane proteins are found in the outer membranes (OMs) of bacteria and are likely in the OMs of mitochondria and chloroplasts. This constricted environment limits the ways that ␤-strands can be arranged. Therefore, it is possible to extract some simple construction principles of ␤-barrel membrane proteins even with a small dataset of known structures (Bowie, 2000). Schulz (2000) summarized a number of main structural features shared by all known ␤-barrel membrane proteins. First, the number of ␤-strands is even and the N and C termini are at the periplasmic side. Second, the connections between strands at the periplasmic side are “short turns” of a couple of residues, while connections at the external side are usually “long loops.” Third, most transmembrane ␤-strands follow a dyad repeat topology, where hydrophobic and hydrophilic residues at the barrel inside and outside alternate. Fourth, the barrel is flanked by two “girdles” of aromatic side chains, which interact favorably with interface layers of the membrane. Based on these common features, we developed a HMM

190

Y. Deng et al. / Computational Biology and Chemistry 28 (2004) 189–194

for predicting the topology of ␤-barrel membrane proteins (Liu et al., 2003). The HMM successfully incorporated these features (the dyad repeat topology, strand and loop length, the band of aromatic residues) into one model and then gained a prediction power comparable to or exceeding that of the previous ones. Based on the same HMM architecture, this paper presents a method to identify ␤-barrel membrane proteins. Through comparing the probability that a sequence is generated by the ␤-barrel membrane protein model (the trained HMM) and by the non-␤-barrel membrane protein model (the null model), the method achieves good performance in identifying ␤-barrel membrane proteins.

2. Materials and methods 2.1. Datasets We have compiled a set of 132 ␤-barrel membrane proteins (Table 1), most of which (129) are extracted from “␤-barrel porins” subclass in Transport Protein Database (TC-DB) (http://tcdb.ucsd.edu/tcdb/tcclass.php). The TC-DB details a comprehensive classification system for membrane transport proteins known as the Transport Commission (TC) system (Saier, 2000). Most proteins that belong to this subclass are reported experimental evidence to be located in the outer membrane and contain

Table 1 The set of ␤-barrel membrane proteins Protein code

Superfamily

OMPF ECOLI PHOE ECOLI OMPC ECOLI NMPC ECOLI OMPF XENNE OMPU VIBCH OM25 HAEIN OMP BORPE PORI BPPA2 OMA NEIGO Q05520 OM32 COMAC OM1E CHLPS OMP2 CHLTR LAMB ECOLI SCRY SALTY YIEC ECOLI Q44667 OM3A RHILV PORP PSEAE PORO PSEAE OMPA ECOLI PORF PSEAE Y899 MYCTU Q52677 P95467 PORI RHOBL POR1 YEAST POR3 MOUSE POR1 BOVIN POR1 WHEAT PORI PEA SLAP BACST O65943 FADL ECOLI Q9RBI TSX ECOLI FAED ECOLI PAPC ECOLI FIMC BORPE AIDA ECOLI Q52298 PERT BORPE Q45343 Q45340 IGA NEIGO HAP HAEIN P77070 Q47692 PRTS SERMA VAC2 HELPY Q9ADU7 ALGE PSEAE FEPA ECOLI FECA ECOLI FHUE ECOLI FHUA ECOLI O54507 CIRA ECOLI PFEA PSEAE Q9LAE4 FPVA PSEAE TB11 NEIMB Q51104 HEMR YEREN HPUB NEIMC O87343 Q9ALL8 BTUB ECOLI RHTA RHIME P95494 P72121 Q00620 P72473 Q45780 Q93TH9 VIUA VIBCH FYUA YEREN P77076 O05123 Q02728 KSD1 ECOLI CTRA NEIMA WZA ECOLI TOLC ECOLI PRTF ERWCH CNRC ALCEU CZCC ALCEU CYAE BORPE NODT RHILT FUSA BURCE Q45083 SILC SALTY CUSC ECOLI PORB PSEAE HLYB SERMA HLYB PROMI FHAC BORPE HXB2 HAEIN OMPG ECOLI P76773 GSPD KLEPN GSPD PSEAE PILQ PSEAE COME HAEIN YSCC YEREN INVG SALTY HRPH PSESY VG4 BPF1 NOLW RHISN Q9S142 O86994 Q59991 Q9RLP7 PORD PSEAE Q51510 O24779 Q9XC30 O84986 Q48391 Q9ZLD5 O49929 Q9SM57 Q41050 Q9ZAN1

The General Bacterial Porin Family

Q47905 Q9KK91 OMB NEISI Q934G3 OMPT ECOLI OMPA ECOLI PA1 ECOLI

The The The The The The The

Chlamydial Porin Family Sugar Porin Family Brucella-Rhizobium Porin Family Pseudomonas OprP Porin Family OmpA-OmpF Porin Family Rhodobacter PorCa Porin Family Mitochondrial and Plastid Porin Family

The The The The

FadL Outer Membrane Protein Family Nucleoside-specific Channel-forming Outer Outer Membrane Fimbrial Usher Porin Autotransporter Family

The Alginate Export Porin Family The Outer Membrane Receptor Family

The The The The

Raffinose Porin Family Short Chain Amide and Urea Porin Family Outer Membrane Auxiliary Protein Family Outer Membrane Factor Family

The The The The

Glucose-selective OprB Porin Family Two-Partner Secretion Family OmpG Porin Family Outer Bacterial Membrane Secretin Family

The Cyanobacterial Porin Family The Mycobacterial Porin Family The Outer Membrane Porin Family The Cyclodextrin Porin Family The Helicobacter Outer Membrane Porin Family OEP24 Family OEP21 Family OEP16 Family The Campylobacter jejuni Major Outer Membrane Porin Family The Fusobacterial Outer Membrane Porin Family The Vibrio Chitoporin/Neisserial Porin Family The Oligogalacturonate-specific Porin Family Not belong to TC-DB

Y. Deng et al. / Computational Biology and Chemistry 28 (2004) 189–194

a high secondary structure fraction of ␤-strand, thus they are considered to be ␤-barrel membrane proteins with high probability. Additionally, three ␤-barrel membrane proteins (OMPX ECOLI, OMPT ECOLI, PA1 ECOLI) with available three-dimensional (3D) structures but not included in TC-DB are also collected. Negative sets of 1736 soluble proteins are extracted from the Protein Data Bank (Berman et al., 2000), which are based on 25% sequence identity cut-off. 2.2. Cross-validation The discrimination accuracy is estimated on a six-fold cross-validation. The set of 132 ␤-barrel membrane proteins is divided into six subsets of equal size. Meanwhile, any two sequences from different subsets are made sure to be less than 25% identity. Each time, one of the six subsets is used as the test set and the other five subsets are put together to train the HMM. The procedure is repeated six times so that every sequence gets to be in a test set exactly once, and gets to be in a training set five times. This gives us a very reliable measure of the accuracy of our method over new data. The false negative rate is calculated by adding up the false negatives for each of the models tested on the corresponding test set and dividing by the size of the entire set (132 proteins). The false positive rate is found by averaging the results for all six models on the entire negative set (1736 proteins). 2.3. HMM training The architecture of HMM is the same with the one built to predict transmembrane regions (Liu et al., 2003). Given a set of training sequences s(1), s(2), . . . , s(n), the model is estimated by maximizing the overall likelihood of those sequences Prob(sequences|model), which is simply a product of probability calculated for each sequence: Prob(sequences|model) =

n  Prob(s(i)|model)

191

However, the Baum–Welch algorithm always finds a local rather than a global maximum. To solve this problem, we start the algorithm several times from different initial models. The resulting models then represent different local maximum, and we pick the one with the highest probability. We choose these initial models carefully to enhance the chance of terminating close to the true global maximum, taking advantage of the following knowledge: The aromatic amino acids, Phe, Tyr and Trp have the highest emission probabilities over the “G” (girdle) states; hydrophobic residues such as Val, Ile and Leu have higher (lower) emission probabilities than the hydrophilic residues such as Glu, Asn and Lys in the “BO” (“BI”) states (please refer to Liu et al. (2003) for detailed information of these states). In addition, noise injection to the model before each reestimation is used to further solve the local optimum problem. We initially add quite a lot of noise and then decrease the noise level linearly to zero over 15 iterations. 2.4. Scoring sequences If the trained HMM is a good model for the ␤-barrel membrane proteins, sequences that belong to ␤-barrels will yield a higher probability score from the trained HMM than from the null model, i.e. P(s|H1 ) > P(s|H0 ). Since the raw probability P(s|H1 ) is strongly dependent on the length of the sequence s, it cannot be used directly to classify the sequence. Log-odds score (Eq. (3)) asserts whether or not the probability that the sequence s is generated by the model H1 is larger than the probability that the sequence is generated by the null model H0 , thus it is used instead of the simple probability P(s|H1 ) to predict if a new protein is a ␤-barrel membrane protein or not. score(s) = log

P(s|H1 ) P(s|H0 )

(3)

Three different null models are investigated here: (1) Flat: each amino acid is equally likely distributed.

(1)

i=1

where Prob(s(i)|model) (the probability that the model generates sequence s(i)) is a sum over all possible state paths that could produce that sequence, which is defined as follows:  Prob(s(i)|model) = Prob(s(i), path|model) (2) paths

The Baum–Welch algorithm, as described in (Rabiner, 1989), is used to find this best model. The algorithm can be viewed as an iterative adaptation of the model to fit the training sequences. The overall likelihood Prob(sequences|model) is increased by the iteration, so the Baum–Welch algorithm is expected to find the best model by assigning a maximal probability to the training sequences.

pA = pC , . . . , = pY =

1 20

(2) Background: the amino acid distribution is calculated from all proteins in SWISS-PROT. (3) Train counts: the amino acid distribution is calculated from the training set.

3. Results and discussion Discrimination between ␤-barrel membrane proteins and soluble proteins The discrimination accuracy is estimated using the cross-validation models (see Section 2). Fig. 1 shows the false positive rate and false negative rate as a function of threshold score setting based on background null model. At a threshold of 0, about 10% false postitive rate and false negative rate are obtained, i.e. 157 false positives and 13 false

192

Y. Deng et al. / Computational Biology and Chemistry 28 (2004) 189–194

Fig. 1. The false positive rate and false negative rate as a function of threshold score setting based on background null model.

negatives. Thirteen ␤-barrel membrane proteins misclassified as soluble ones are listed in Table 2. Y899 MYCTU and Q9ADU7 are hypothetical proteins. Q41050 belongs to a class of proteins located in the chloroplastic outer envelope membrane. It forms a cation-selective high conductance channel with permeability to amines and amino acids. Although it functions analogously to the outer membrane porins, its putative structure would be distinct from the classical porin structure, which is made up of amphiphilic ␤-strands. The secondary structure prediction and the CD spectra of the reconstituted protein indicate that Q41050 contains both ␤-barrel and ␣-helical elements (Pohlmeyer et al., 1997; Steinkamp et al., 2000), which possibly leads to its misclassification. All the other incorrectly predicted proteins belong to two families, the Outer Bacterial Membrane Secretin Family and the Outer Membrane Auxiliary (OMA) Protein Family. Moreover, one of the OMA family, WzaK30 forms oligomeric ring-like structures resembling the secretins of type II and III protein translocation systems (Drummelsmith and Whitfield, 2000). The negative scores obtained by these proteins suggest that they possibly follow different structure principles from those summarized by Schulz (2000).

Fig. 2. The false positive rate and false negative rate as a function of ␤-barrel score threshold.

4. Comparison with other methods Recently, Wimley (2002) developed a method to identify ␤-barrel membrane proteins. The method gains the information about the relative abundance of each amino acid for the barrel exterior and interior by exploring the architecture of 15 known structures. Based on this information, a single ␤-barrel score is calculated for each protein and those with high scores are predicted to be ␤-barrel membrane proteins. Although this method seems to perform well when applied to screen genomes of Escherichia coli and Pseudomonas auriginosa, the results only suggest false positive rate but have little information about the false negative rate. To estimate its discrimination accuracy, we test the method on the dataset used in this paper. Fig. 2 illustrates the false positive rate and false negative rate as a function of ␤-barrel score threshold. At a threshold of 1.7 (the value used to select the possible outer membrane proteins in the paper, Wimley, 2002), low rate of false positives (about 1%) but very high rate of false negatives (about 51%) is obtained.

Table 2 Thirteen ␤-barrel membrane proteins misclassified as soluble ones Protein

Score

Comments

Y899 MYCTU Q9ADU7 Q41050 GSPD PSEAE PILQ PSEAE COME HAEIN YSCC YEREN INVG SALTY HRPH PSESY NOLW RHISN Q02728 CTRA NEIMA WZA ECOLI

−0.33 −10.03 −0.30 −0.92 −4.52 −15.97 −13.12 −11.60 −10.66 −7.04 −3.07 −2.53 −2.03

Hypothetical protein Hypothetical protein Possibly a mixed structure of ␣/␤-folds The Outer Bacterial Membrane Secretin The Outer Bacterial Membrane Secretin The Outer Bacterial Membrane Secretin The Outer Bacterial Membrane Secretin The Outer Bacterial Membrane Secretin The Outer Bacterial Membrane Secretin The Outer Bacterial Membrane Secretin The Outer Membrane Auxiliary (OMA) The Outer Membrane Auxiliary (OMA) The Outer Membrane Auxiliary (OMA)

(Secretin) Family (Secretin) Family (Secretin) Family (Secretin) Family (Secretin) Family (Secretin) Family (Secretin) Family Protein Family Protein Family Protein Family

Y. Deng et al. / Computational Biology and Chemistry 28 (2004) 189–194

Fig. 3. The number of false negatives vs. the number of false positives for different thresholds.

We also compare our results based on different null models with those of “barrel score” method developed by Wimley. Fig. 3 plots the number of false negatives versus the number of false positives for different thresholds. As can be seen from the Figure, the best tradeoff for false negatives and false positives is for the flat and background null models. The flat and background curves become vertical at 3–10 false negatives, while “barrel score” method at 20 false negatives and train counts at 35–40 false negatives, indicating more sequences that could not be found regardless of the number of false positives. For a low number of false postives (about four), there are about 48–54 false negatives for flat, background and train counts null models, while there are about 120 false negatives for “barrel score” method. It is clear that our HMM-based method with the flat and background null model provides better discrimination accuracy than “barrel score” approach. The poor performance of the “barrel score” method is possibly due to the following reasons. First, the set of 15 known structures is too small to get statistically significant information. Second, these 15 structures are not representative of all ␤-barrel membrane proteins, which lead to poor performance on new ones. Third, assumption that a prototypical ␤-hairpin is composed of two 10-residue ␤-strands separated by a 5-residue loop is fairly crude. In contrast, our HMM-based method successfully incorporates amino acid distribution, loop length, strand length into one model and gain information from a much larger set. Therefore, our method is more powerful than “barrel score” method in identifying ␤-barrel membrane proteins. Zhai and Saier (2002) designed the ␤-barrel finder (BBF) program based on the hydropathy, amphipathicity and secondary structure analysis of six ␤-barrel membrane proteins of known 3D-structures. Most transmembrane ␤-strands are characterized by peaks of both hydrophobicity and amphipathicity. This fact is in agreement with the common fea-

193

tures of occurrence of an increased proportion of hydrophobic residues and of alternating hydrophilic and hydrophobic residues in the transmembrane region, which is summarized by Schulz (2000). The BBF program, applied to E. coli, retrieves most of outer membrane proteins with established ␤-barrel structures as well as many probable outer membrane proteins. However, since the BBF program combined three preexisting programs, its performance is limited by predictive inaccuracies and unusual positions of transmembrane signal segments as well as inadequate secondary structure predictions. Martelli et al. (2002) developed a HMM model on 12 non-redundant ␤-barrel membrane proteins of known 3D-structure and used the normalized probability of each sequence being emitted by the model to discriminate ␤-barrel membrane proteins. Compared with 10% false negative rate obtained by our method at 10% false positive rate, the method by Martelli got 14% false negative rate. Good performance achieved by these two methods suggest that HMM-based methods compare favorably with other methods due to their capacity to capture the basic features of most ␤-barrel membrane proteins. 4.1. Screening of genomic data The model, which is trained from all the 132 sequences, is applied to screen genomes of P. auriginosa for detecting ␤-barrel membrane proteins. The background null model is used to obtain the log-odds score. To increase the predictive positive value (the probability that a positive test result is a true positive), a higher threshold score of 10 is used. As a result, 141 proteins are picked out. Analyzing these 141 proteins, we find that 76 proteins are known or proposed outer membrane proteins, 52 are unidentified open reading frames or hypothetical proteins, 4 are not ␤-barrel membrane proteins (false positives), 9 are flagellin, flagellar hook protein, flagellar basal-body rod protein, fimbrial or adhesive protein that reside in or pass through the outer membrane. Recently, some of 52 hypothetical proteins are further believed to be outer membrane porins that function in the export of proteins (Yen et al., 2002). For example, hypothetical protein PA4652 belongs to FUP family, PA0328 to AT family, PA0040, PA0692, PA2463, PA2543, PA3339

Table 3 High-scoring proteins in genomic data of Pseudomonas aeruginosa HMMbased method Known or proposed outer membrane protein Unidentified or hypothetical proteins Proteins that reside in or pass through the outer membrane Other proteins (false positives) Total

76 52 9

Barrel score method 50 65 5

4

5

141

125

194

Y. Deng et al. / Computational Biology and Chemistry 28 (2004) 189–194

and PA4540 to TPS family. Compared with the results got by “barrel score” method (listed in Table 3), our HMM-based method detects more ␤-barrel membrane proteins at almost the same amount of false positives.

5. Conclusions The identification of ␤-barrel membrane proteins is more difficult than that of ␣-helices ones. In this paper, a HMM with architecture obeying ␤-barrel membrane proteins construction principles is trained and then log-odds score is used to discriminate ␤-barrel membrane proteins. The results on 132 ␤-barrel membrane proteins and 1736 soluble proteins using cross-validation models and on complete genomes of Pseudomonas aeruginosa demonstrate that the method can serve as a valuable tool for scanning genomic data for ␤-barrel membrane proteins.

Acknowledgements The work is partially supported by Shanghai Nature Foundation (no. 03ZR14065). The authors would like to thank the anonymous referees for their helpful comments on the paper.

References Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E., 2000. The Protein Data Bank. Nucleic Acids Res. 28, 235–242. Bowie, J.U., 2000. Are we destined to repeat history. Curr. Opin. Struct. Biol. 10, 435–437.

Diederichs, K., Freigang, J., Umhau, S., Zeth, K., Breed, J., 1998. Prediction by a neural network of outer membrane beta-strand protein topology. Protein Sci. 7, 2413–2420. Drummelsmith, J., Whitfield, C., 2000. Translocation of group 1 capsular polysaccharide to the surface of Escherichia coli requires a multimeric complex in the outer membrane. EMBO J. 19, 57–66. Jacoboni, I., Martelli, P.L., Fariselli, P., De Pinto, V., Casadio, R., 2001. Prediction of the transmembrane regions of beta-barrel membrane proteins with a neural network-based predictor. Protein Sci. 10, 779– 787. Liu, Q., Zhu, Y.S., Wang, B.H., Li, Y.X., 2003. A HMM-based method to predict the transmembrane regions of beta-barrel membrane proteins. Comput. Biol. Chem. 27, 69–76. Martelli, P.L., Fariselli, P., Krogh, A., Casadio, R., 2002. A sequence-profile-based HMM for predicting and discriminating beta barrel membrane proteins. Bioinformatics 18 (Suppl. 1), S46–S53. Pohlmeyer, K., Soll, J., Steinkamp, T., Hinnah, S.C., Wagner, R., 1997. Isolation and characterization of an amino acid-selective channel protein present in the chloroplastic outer envelope membrane. Proc. Natl. Acad. Sci. U.S.A. 94, 9504–9509. Rabiner, L.R., 1989. A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286. Saier Jr., M.H., 2000. A functional-phylogenetic classification system for transmembrane solute transporters. Microbiol. Mol. Biol. Rev. 64, 354–411. Schulz, G.E., 2000. beta-Barrel membrane proteins. Curr. Opin. Struct. Biol. 10, 443–447. Steinkamp, T., Hill, K., Hinnah, S.C., Wagner, R., Rohl, T., Pohlmeyer, K., Soll, J., 2000. Identification of the pore-forming region of the outer chloroplast envelope protein OEP16. J. Biol. Chem. 275, 11758– 11764. Wimley, W.C., 2002. Toward genomic identification of beta-barrel membrane proteins: composition and architecture of known structures. Protein Sci. 11, 301–312. Yen, M.R., Peabody, C.R., Partovi, S.M., Zhai, Y., Tseng, Y.H., Saier, M.H., 2002. Protein-translocating outer membrane porins of Gram-negative bacteria. Biochim. Biophys. Acta 1562, 6–31. Zhai, Y., Saier Jr., M.H., 2002. The beta-barrel finder (BBF) program, allowing identification of outer membrane beta-barrel proteins encoded within prokaryotic genomes. Protein Sci. 11, 2196–2207.