ARTICLE IN PRESS
Journal of Theoretical Biology 238 (2006) 395–400 www.elsevier.com/locate/yjtbi
Predicting membrane protein type by functional domain composition and pseudo-amino acid composition Yu-Dong Caia,b,, Kuo-Chen Choub,c a
Biomolecular Sciences Department, University of Manchester Institute of Science & Technology, P.O. Box 88, Manchester, M60 1QD, UK b Gordon Life Science Institute, San Diego, CA 92130, USA c Tianjin Institute of Bioinformatics and Drug Discovery, Tianjin, China Received 11 April 2005; received in revised form 25 May 2005; accepted 26 May 2005 Available online 25 July 2005
Abstract Given the sequence of a protein, how can we predict whether it is a membrane protein or non-membrane protein? If it is, what membrane protein type it belongs to? Since these questions are closely relevant to the function of an uncharacterized protein, their importance is self-evident. Particularly, with the explosion of protein sequences entering into databanks and the relatively much slower progress in using biochemical experiments to determine their functions, it is highly desired to develop an automated method that can be used to give a fast answers to these questions. By hybridizing the functional domain (FunD) and pseudo-amino acid composition (PseAA), a new strategy called FunD–PseAA predictor was introduced. To test the power of the predictor, a highly non-homologous data set was constructed where none of proteins has X25% sequence identity to any other. The overall success rates obtained with the FunD–PseAA predictor on such a data set by the jackknife cross-validation test was 85% for the case in identifying membrane protein and non-membrane protein, and 91% in identifying the membrane protein type among the following 5 categories: (1) type-1 membrane protein, (2) type-2 membrane protein, (3) multipass transmembrane protein, (4) lipid chainanchored membrane protein, and (5) GPI-anchored membrane protein. These rates are much higher than those obtained by the other methods on the same stringent data set, indicating that the FunD–PseAA predictor may become a useful high throughput tool in bioinformatics and proteomics. r 2005 Elsevier Ltd. All rights reserved. Keywords: Type-1; Type-2; Multi-pass transmembrane; Lipid chain-anchored; GPI-anchored; Function-related feature; Less than 25% sequence identity; FunD–PseAA predictor
1. Introduction Membrane-bound proteins, or membrane proteins, are generally classified into the following five types (Chou and Elrod, 1999a): (1) type-1 membrane protein (Fig. 1a), (2) type-2 membrane protein (Fig. 1b), (3) multipass transmembrane proteins (Fig. 1c), (4) lipid chain-anchored membrane proteins (Fig. 1d), and (5) GPI-anchored membrane proteins (Fig. 1e). The way how a membrane protein is associated with the lipid Corresponding author. Tel.: +44 161 200 8936; fax: +44 161 236 6409. E-mail address:
[email protected] (Y.-D. Cai).
0022-5193/$ - see front matter r 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.jtbi.2005.05.035
bilayer is closely relevant to its function (Alberts et al., 1994; Lodish et al., 1995). The transmembrane proteins, for example, can function on both sides of membrane or transport molecules across it, whereas proteins that function on only one side of the lipid bilayer are often associated exclusively with either the lipid monolayer or a protein domain on that side. Therefore, it is highly desired to establish an automated method to identify the type of a newly found membrane protein. Actually, various studies have been conducted in this regard (Cai et al., 2003, 2004; Chou, 2000; Chou and Elrod, 1999a; Feng and Zhang, 2000; Guo, 2002; Tusna´dy and Simon, 2001; Wang et al., 2004, 2005). However, all these studies have the
ARTICLE IN PRESS Y.-D. Cai, K.-C. Chou / Journal of Theoretical Biology 238 (2006) 395–400
396
(a) (a)
(b) (b)
(c) (c)
(d) (d)
(e) (e)
C C
N N Extracellular Extracellular or Luminal or Luminal
GPI
Lipid Lipid bilayer bilayer Cytoplasmic
N N C C
Fig. 1. Schematic drawing showing the following five types of membrane proteins: (a) type-1 transmembrane, (b) type-2 transmembrane, (c) multipass transmembrane, (d) lipid-chain anchored membrane, and (e) GPI-anchored membrane. As shown from the figure, although both type I and type II membrane proteins are of single-pass transmembrane, type I has a cytoplasmic C-terminus and an extracellular or luminal N-terminus for plasma membrane or organelle membrane, respectively, while the arrangement of N- and C-termini in type II membrane proteins is just reverse. No such distinction was drawn between the extracellular (or luminal) and cytoplasmic sides for the other three types in the current classification scheme. Reproduced from Chou (2001) with permission.
following problems that need to be further addressed. (1) The identification was confined on the scope that the query protein was already known belonging to membrane proteins. To make the case more practical and widely useful, the first thing that needs to be considered is whether the query protein is a membrane protein or non-membrane protein. Only if it turns out to be a membrane protein, does it make sense to identify which membrane protein type it belongs to. (2) No clear cutoff was made regarding the sequence identity for the training data sets to avoid redundancy and homologous bias, and hence the prediction success rates reported in those studies might be overestimated. (3) Although the recently developed HMMTOP method (Tusna´dy and Simon, 2001) can be used to predict the localization of helical transmembrane segments and the topology of transmembrane proteins and might be indirectly used to predict type-1 and type-2 membrane proteins, predictions of lipid chain-anchored membrane proteins and GPI-anchored membrane proteins are beyond the reach of this method. In this paper these problems will be explicitly addressed. Besides, in some of the studies mentioned above (see, e.g. Chou and Elrod, 1999a; Cai et al., 2004), the sample of a protein was represented by its amino acid composition alone, and hence all the sequence-order effects were ignored during the prediction process. This would certainly limit the potential for improving the prediction quality. To overcome such a problem, the pseudo-amino acid (PscAA) composition (Du et al., 2004; Guo, 2002; Wang et al., 2004) and the functional domain (FunD) composition (Cai et al., 2003) were, respectively, utilized to represent the protein sample. Although the former can incorporate some sequence-
order effects through a set of correlation factors as elucidated in the paper (Chou, 2001) originally introducing the concept of PSeAA composition, those sequence-order effects which cannot be reflected by the correlation factors are missed and they might play an important role in the case of identifying membrane type. Moreover, although the FunD composition originally introduced for improving the prediction of protein subcellular location (Chou and Cai, 2002) may cover most of the sequence-order effects, not all the proteins concerned can be defined in terms of the FunD composition owing to the limitation of the current FunD database. Those undefined proteins will fail to be identified by default. In view of this, a new representation is introduced in the present study to define the sample of a protein by hybridizing the FunD composition and the PseAA composition. The new representation is called FunD–PseAA composition, which can complement the two, bringing out the best in each other and making each shining more brilliantly in the other’s company.
2. Materials The UniProt/Swiss-Prot at www.ebi.ac.uk/swissprot (Release 44, 5 July 2004) was used to construct the data sets for the current study according to the following procedures. (1) In order to obtain a high-quality, welldefined training data set for membrane protein types, the data were strictly screened as described in Chou and Elrod (1999a) to exclude those with ambiguous annotations. (2) Those sequences were removed that have less than 50 amino acid residues and hence actually belong
ARTICLE IN PRESS Y.-D. Cai, K.-C. Chou / Journal of Theoretical Biology 238 (2006) 395–400
to a protein fragment rather than a protein. (3) To avoid any homologous bias, a redundancy cutoff was operated by a culling program (Wang and Dunbrack Jr., 2003) to winnow those sequences which had X25% sequence identity to any other in a same subset. Thus, a total of 2763 sequences were generated that consist of 219 type-1 membrane proteins, 140 type-2 membrane proteins, 2137 multi-pass transmembrane proteins, 195 lipid chain-anchored membrane proteins, and 72 GPI-anchored membrane proteins. Meanwhile, a total of 3038 non-membrane protein sequences were randomly taken also from the UniProt/Swiss-Prot by using the same procedures as described above. The accession numbers of the 2137 membrane proteins (classified into 5 types) and the 3038 non-membrane proteins are given in the On-line Supplementary Materials A. The data set thus constructed is even much more stringent than the one constructed recently with a threshold of 40% (Chou and Cai, 2005).
3. Method To improve the quality of predicting membrane protein type, a key step is to find an effective representation that can contain as much information of a protein as possible. Given a protein, its entire sequence contains of course the most complete information. Unfortunately, if using the entire sequence to formulate the statistical prediction algorithm, one would face the difficulty of dealing with almost an infinity of sample patterns, as elaborated by Chou (2001). In order to formulate a feasible statistical prediction algorithm, a protein must be expressed in terms of a set of discrete numbers, such as the 20 amino acid components as used by many previous investigators in various prediction algorithms (Bahar et al., 1997; Cedano et al., 1997; Chandonia and Karplus, 1995; Chou, 1989, 1995; Chou and Zhang, 1994; Deleage and Roux, 1987; Klein, 1986; Klein and Delisi, 1986; Kneller et al., 1990; Metfessel et al., 1993; Nakashima et al., 1986; Zhou, 1998; Zhou and Assa-Munt, 2001). In this sense, we are actually confronted with such a dilemma that, if wishing to include the complete information, the prediction would become unfeasible; if wishing to make the prediction feasible, some important information must be ignored. In view of this, can we find a compromise scenario, i.e. a new protein representation that is constituted by a set of discrete numbers but that also bears as many important sequence-order-related features as possible? This can be realized by introducing the FunD-PseAA composition as formulated below. As mentioned in the beginning, the type of a membrane protein is closely related to its function, it is anticipated that the prediction quality will be significantly enhanced if we can find a feasible approach to use the knowledge of FunDs to define a protein
397
sample. It is known that the integrated domain and motif database (Apweiler et al., 2001), or InterPro database, consists of many sequences with well-known FunD types. InterPro release 6.2 (24 April, 2003) contains 7785 entries that are available from the web site at http://www.ebi.ac.uk/interpro. Using each of the 7785 FunDs as a vector-base, a sequence can be defined as a 7785D (dimensional) vector according to the following steps. 3.1. Step 1—FunD composition Use the program IPRSCAN (Apweiler et al., 2001) to search InterPro database for a given protein, if there is a hit (e.g. IPR001979, meaning the protein contains a sequence segment very similar to that of the 1979th domain of the InterPro database), then the 1979th component of the protein in the 7785D FunD space is assigned 1; otherwise, 0. A hit in InterPro is a match of a protein signature to a protein sequence in a given region. How a hit is defined is dependent on the method used to create the signature. Profiles and Hidden Markov models define hits as those that score within a predefined threshold e-value. The thresholds are defined by the member databases of InterPro (e.g. Pfam, Prints etc.) such that only hits with e-values below this threshold are considered to be correct. For a detailed description about this, refer to (Apweiler et al., 2001). By following the above procedure, the protein can be explicitly formulated as follows: 3 2 p1 6 p 7 6 2 7 7 6 6 .. 7 6 . 7 7 6 P¼6 (1) 7, 6 pj 7 7 6 6 . 7 6 .. 7 5 4 p7785 where ( pj ¼
1; 0;
hit found in InterPro database otherwise:
(2)
Defined in this way, a protein will correspond to a 7785D vector P with each of the 7785 FunD patterns as a base for the vector space. In other words, rather than the 20D space of the amino acid composition approach as used by many previous investigators (Bahar et al., 1997; Cedano et al., 1997; Chandonia and Karplus, 1995; Chou, 1989, 1995; Chou and Zhang, 1994; Deleage and Roux, 1987; Klein, 1986; Klein and Delisi, 1986; Kneller et al., 1990; Metfessel et al., 1993; Nakashima et al., 1986; Zhou, 1998; Zhou and Assa-Munt, 2001), a protein is now represented in terms of the FunD composition in a 7785D space. By doing so, not only some sequence-order-related
ARTICLE IN PRESS Y.-D. Cai, K.-C. Chou / Journal of Theoretical Biology 238 (2006) 395–400
398
features but also some function-related features are naturally incorporated in the representation. 3.2. Step 2—PseAA composition If no hit is found for the entire InterPro database, the protein P formulated by Eq. (1) will correspond to a naught vector. To cope with such a circumstance, the protein is instead defined in the (20+l)D PseAA (pseudo-amino acid composition) space (Chou, 2001), as given below 3 2 p1 6 p 7 6 2 7 6 . 7 6 . 7 6 . 7 7 6 7 6 p (3) P ¼ 6 20 7, 7 6p 6 20þ1 7 7 6 6 . 7 6 .. 7 5 4 p20þl where p1 ; p2 ; ; p20 represent the 20 components of the classical amino acid composition (Chou and Zhang, 1993; Zhou, 1998), while p20þ1 is the first-tier sequence order correlation factor, p20þ2 the second-tier sequence order correlation factor, and so forth (Chou, 2001). It is the additional l components in Eq. (3) that incorporate some sequence-coupling effects into the vector representation of an enzyme. Generally speaking, the larger the number of the l components is, the more the sequence-coupling effects will be incorporated. However, the number l cannot exceed the length of a protein (i.e., the number of its total constituent residues). Also, if the number of l is too large, the overall success rate by jackknife tests might be decreased owing to the reduction of the cluster tolerant capacity (Chou, 1999). Therefore, for different training data sets, l may have different optimal values. For the current study, the optimal value of l is 37. Given a protein, the ð20 þ 37Þ ¼ 57 PSeAA components in Eq. (3) can be easily derived by following the procedures as described by Chou (2001). Thus, any protein that corresponds to a naught vector in the 7785D FunD space (Eq. (2)) can always be explicitly defined in the 57D PseAA space (Eq. (3)). The prediction was performed with the ISort (Intimate Sorting) predictor, which can be briefed as follows. Suppose there are N proteins (P1, P2,y, PN ) which have been classified into categories 1, 2,y, m. Now, for a query protein P, how can we predict which category it belongs to? To deal with this problem, let us define the following scale to measure the similarity between P and Pi (i ¼ 1, 2,y, N ) LðP; Pi Þ ¼
P Pi ; kPkkPi k
ði ¼ 1; 2; ; NÞ;
(4)
where P Pi is the dot product of vectors P and Pi, and kPk and kPi k their modulus, respectively. Obviously, when P Pi ; we have LðP; Pi Þ ¼ 1; meaning they have perfect or 100% similarity. Generally speaking, the similarity is within the range of 0 and 1; i.e. 0pLðP; Pi Þp1. Accordingly, the ISort predictor can be formulated as follows. If the similarity between P and Pk ðk ¼ 1; 2; ; or NÞ is the highest; i.e. LðP; Pk Þ ¼ Max LðP; P1 Þ; LðP; P2 Þ; ; LðP; PN Þ , (5) where the operator Max means taking the maximum one among those in the brackets, then the query protein P is predicted belonging to the same category as of Pk. If there is a tie, the query protein may not be uniquely determined, but cases like that rarely occur. The ISort predictor is particularly useful for the situation when the distributions of the samples are unknown.
4. Results and discussion The computation was performed in a Silicon Graphics IRIS Indigo workstation (Elan 4000). For the proteins listed in the On-line Supplementary Materials A we obtained the following results according to Steps 1–2 of Section 3: (1) Of the 2763 membrane protein sequences, 1933 got the hits and hence were defined in the 7785D FunD space, and the remainder defined in the 57D PseAA space. (2) Of the 3,038 non-membrane protein sequences, 1957 were defined in the 7785D FunD space, and the remainder defined in the 57D PseAA space. A breakdown of the protein entries defined in the two spaces are given in Table 1, from which we can see that, if the definition of proteins was only based on the FunD database, 2763 1933 ¼ 830 proteins in the membrane protein set and 3038 1957 ¼ 1081 in the non-membrane protein set would have no definition, leading to a failure of identifying their attribute. That is why it is so important to hybridize with the PseAA composition, by which not only a protein can always be defined but also its sequence-order effects may considerably be taken into account (Chou, 2001). Thus, the hybrid algorithm was operated according to the following procedures: if a query protein was defined by the FunD composition, then the ISort-7785D FunD predictor was used to predict its attribute; otherwise, the ISort-57D PseAA predictor was used to predict its attribute. As is well known, in statistical prediction the single independent data set test, sub-sampling test and jackknife test are the three methods often used for crossvalidation. Of these three, the jackknife test is deemed as the most rigorous and objective one. See a monograph (Mardia et al., 1979) for the underlying mathematical principle about this, and a review (Chou and Zhang,
ARTICLE IN PRESS Y.-D. Cai, K.-C. Chou / Journal of Theoretical Biology 238 (2006) 395–400
1995) for a comprehensive discussion. Therefore, jackknife test has been used by more and more investigators (Du et al., 2004; Feng, 2001; Hua and Sun, 2001; Luo et al., 2002; Pan et al., 2003; Wang et al., 2004; Yuan, 1999; Zhou, 1998; Zhou and Assa-Munt, 2001; Zhou and Doctor, 2003) in examining the power of various prediction methods. Accordingly, the real power of a predictor should be measured by the success rate of the jackknife test. The overall jackknife success rates obtained by the current FunD–PseAA hybridization approach in identifying membrane and non-membrane proteins are 85.05% (Table 2). The corresponding rate for the case in identifying the 5 membrane protein types is 91.31% (Table 3). In contrast to this, when the same data set was used for the jackknife test by the predictors based on the amino acid composition alone, such as the least Hamming distance algorithm (Chou, 1989), least Euclidean distance algorithm (Nakashima et al., 1986), and ProtLoc predictor (Cedano et al., 1997), the overall success rates in identifying the 5 membrane protein types were only within the range of 66–69%, indicating the current predictor is much more superior. Meanwhile, it is seen from Table 3 that the jackknifing success rates for predicting GPI-anchored membrane proteins is 50%, remarkably lower than the rates for the other membrane protein types. This is because the subset size for GPI-anchored membrane Table 1 Breakdown of the protein entriesa into the group defined in the 7785D FunD space (Eq. (1)) and the group in the 57 PseAA space (Eq. (3)) Data set
7785D FunD space
Membrane 1933 proteins Non-membrane 1957 proteins a
57D PseAA space
399
proteins is small, containing only 72 proteins after winnowing away those with X25% sequence identity, and hence has a much lower cluster tolerant capacity (Chou, 1999). Therefore, the information loss during the jackknifing process will have a greater impact to the success rate (Chou, 1999; Chou and Elrod, 1999b; Chou and Zhang, 1995). However, with more newly found protein sequences entering into databanks, the subset size for GPI-anchored membrane proteins may increase and the corresponding success rate will be improved as well.
5. Conclusion Cell membranes are crucial to the life of a cell. Although the basic structure of biological membrane is provided by the lipid bilayer, most of the specific functions are carried out by membrane proteins. Knowledge of membrane protein type often offers important clues toward determining the function of an uncharacterized protein. Therefore, predicting the type of a membrane protein from its primary sequence, or even just identifying whether the uncharacterized protein belongs to a membrane protein or not, is very important. The high success rates achieved in this study imply that the FunD–PseAA predictor may become a complementary tool to the experimental approaches, speeding up the pace in characterizing newly found proteins. FunD–PseAA predictor will be available as a web-server at www.pami.sjtu.edu.cn/kcchou.
Total
830
2763
1081
3038
From the On-line Supplementary Materials A.
Table 2 Success rates in identifying membrane and non-membrane proteins by the jackknife test Membrane proteins
Non-membrane proteins
Overall
2407 ¼ 87:12% 2763
2527 ¼ 83:18% 3038
4934 ¼ 85:05% 5801
Acknowledgments The authors wish to thank the three anonymous reviewers for their constructive comments, which are very helpful in strengthening the presentation of this work.
Appendix A. Supplementary data The on-line version of the article contains additional supplementary data. Please visit doi:10.1016/j.jtbi.2005. 05.035.
Table 3 Success rates in identifying membrane protein type by the jackknife test Type I membrane proteins
Type II membrane proteins
Multipass transmembrane proteins
Lipid chain-anchored membrane proteins
GPI-anchored membrane proteins
Overall
183 ¼ 83:56% 219
100 ¼ 71:43% 140
2085 ¼ 97:57% 2137
119 ¼ 61:03% 195
36 ¼ 50:00% 72
2523 ¼ 91:31% 2763
ARTICLE IN PRESS 400
Y.-D. Cai, K.-C. Chou / Journal of Theoretical Biology 238 (2006) 395–400
References Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., Watson, J.D., 1994. Molecular Biology of the Cell, Third ed. Garland Publishing, New York & London (chapter 1). Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D.R., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, L., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N.J., Oinn, T.M., Pagni, M., Servant, F., Sigrist, C.J.A., Zdobnov, E.M., 2001. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29, 37–40. Bahar, I., Atilgan, A.R., Jernigan, R.L., Erman, B., 1997. Understanding the recognition of protein structural classes by amino acid composition. PROTEINS: Struct. Func. Genet. 29, 172–185. Cai, Y.D., Zhou, G.P., Chou, K.C., 2003. Support vector machines for predicting membrane protein types by using functional domain composition. Biophys. J. 84, 3257–3263. Cai, Y.D., Pong-Wong, R., Feng, K., Jen, J.C.H., Chou, K.C., 2004. Application of SVM to predict membrane protein types. J. Theor. Bio. 226, 373–376. Cedano, J., Aloy, P., P’erez-Pons, J.A., Querol, E., 1997. Relation between amino acid composition and cellular location of proteins. J. Mol. Biol. 266, 594–600. Chandonia, J.M., Karplus, M., 1995. Neural networks for secondary structure and structural class prediction. Protein Sci. 4, 275–285. Chou, J.J., Zhang, C.T., 1993. A joint prediction of the folding types of 1490 human proteins from their genetic codons. J. Theor. Biol. 161, 251–262. Chou, K.C., 1995. A novel approach to predicting protein structural classes in a (20-1)-D amino acid composition space. Proteins: Struct. Funct. Genet. 21, 319–344. Chou, K.C., 1999. A key driving force in determination of protein structural classes. Biochemical and Biophys. Res. Commu. 264, 216–224. Chou, K.C., 2000. Review: Prediction of protein structural classes and subcellular locations. Curr. Protein Peptide Sci. 1, 171–208. Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo amino acid composition. [Proteins: Struct. Funct. Genet.] Erratum Rep.44, 60) 43, 246–255. Chou, K.C., Cai, Y.D., 2002. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 277, 45765–45769. Chou, K.C., Cai, Y.D., 2005. Using GO-PseAA predictor to identify membrane proteins and their types. Biochem. Biophys. Res. Comm. 327, 845–847. Chou, K.C., Elrod, D.W., 1999a. Prediction of membrane protein types and subcellular locations. PROTEINS: Struct. Funct. Genet. 34, 137–153. Chou, K.C., Elrod, D.W., 1999b. Protein subcellular location prediction. Protein Eng. 12, 107–118. Chou, K.C., Zhang, C.T., 1994. Predicting protein folding types by distance functions that make allowances for amino acid interactions. J. Biol. Chem. 269, 22014–22020. Chou, K.C., Zhang, C.T., 1995. Review: Prediction of protein structural classes. Crit. Rev. Biochem. Mol. Biol. 30, 275–349. Chou, P.Y., 1989. Prediction of protein structural classes from amino acid composition. In: Fasman, G.D. (Ed.), Prediction of Protein Structure and the Principles of Protein Conformation. Plenum Press, New York, pp. 549–586.
Deleage, G., Roux, B., 1987. An algorithm for protein secondary structure prediction based on class prediction. Protein Eng. 1, 289–294. Du, Q.S., Wang, S.Q., Wei, D.Q., Zhu, Y., Guo, H., Sirois, S., Chou, K.C., 2004. Polyprotein cleavage mechanism of SARS CoV Mpro and chemical modification of octapeptide. Peptides 25, 1857–1864. Feng, Z.P., 2001. Prediction of the subcellular location of prokaryotic proteins based on a new representation of the amino acid composition. Biopolymers 58, 491–499. Feng, Z.P., Zhang, C.T., 2000. Prediction of membrane protein types based on the hydrophobic index of amino acids. J. Protein Chem. 19, 269–275. Guo, Z.M., 2002. Prediction of Membrane protein types by using pattern recognition method based on pseudo amino acid composition. Master Thesis, Bio-X Life Science Research Center, Shanghai Jiaotong University. Hua, S., Sun, Z., 2001. Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17, 721–728. Klein, P., 1986. Prediction of protein structural class by discriminant analysis. Biochim. Biophys. Acta 874, 205–215. Klein, P., Delisi, C., 1986. Prediction of protein structural class from amino acid sequence. Biopolymers 25, 1659–1672. Kneller, D.G., Cohen, F.E., Langridge, R., 1990. Improvements in protein secondary structure prediction by an enhanced neural network. J. Mol. Biol. 214, 171–182. Lodish, H., Baltimore, D., Berk, A., Zipursky, S.L., Matsudaira, P., Darnell, J., 1995. Molecular Cell Biology, Third ed. Scientific American Books, New York (Chapter 3). Luo, R.Y., Feng, Z.P., Liu, J.K., 2002. Prediction of protein strctural class by amino acid and polypeptide composition. Eur. J. Biochem. 269, 4219–4225. Mardia, K.V., Kent, J.T., Bibby, J.M., 1979. Multivariate Analysis: Chapter 11 Discriminant Analysis; Chapter 12 Multivariate analysis of variance; Chapter 13 cluster analysis (pp 322-381). Academic Press, London. Metfessel, B.A., Saurugger, P.N., Connelly, D.P., Rich, S.T., 1993. Cross-validation of protein structural class prediction using statistical clustering and neural networks. Protein Sci 2, 1171–1182. Nakashima, H., Nishikawa, K., Ooi, T., 1986. The folding type of a protein is relevant to the amino acid composition. J. Biochem. 99, 152–162. Pan, Y.X., Zhang, Z.Z., Guo, Z.M., Feng, G.Y., Huang, Z.D., He, L., 2003. Application of pseudo amino acid composition for predicting protein subcellular location: stochastic signal processing approach. J. Protein Chem. 22, 395–402. Tusna´dy, G.E., Simon, I., 2001. The HMMTOP transmembrane topology prediction server. Bioinformatics 17, 849–850. Wang, G., Dunbrack Jr., R.L., 2003. PISCES: a protein sequence culling server. Bioinformatics 19, 1589–1591. Wang, M., Yang, J., Liu, G.P., Xu, Z.J., Chou, K.C., 2004. Weightedsupport vector machines for predicting membrane protein types based on pseudo amino acid composition. Protein Eng. Des. Sele. 17, 509–516. Wang, M., Yang, J., Xu, Z.J., Chou, K.C., 2005. SLLE for predicting membrane protein types. J. Theor. Biol. 232, 7–15. Yuan, Z., 1999. Prediction of protein subcellular locations using Markov chain models. FEBS Lett 451, 23–26. Zhou, G.P., 1998. An intriguing controversy over protein structural class prediction. J. Protein Chem. 17, 729–738. Zhou, G.P., Assa-Munt, N., 2001. Some insights into protein structural class prediction. Proteins: Struct. Funct. Genet. 44, 57–59. Zhou, G.P., Doctor, K., 2003. Subcellular location prediction of apoptosis proteins. Proteins: Struct. Funct. Genet. 50, 44–48.