BBRC Biochemical and Biophysical Research Communications 311 (2003) 743–747 www.elsevier.com/locate/ybbrc
A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology Kuo-Chen Choua,b,* and Yu-Dong Caic a Gordon Life Science Institute, San Diego, CA 92130, USA Tianjin Institute of Bioinformatics and Drug Discovery (TIBDD), Tianjin, China Biomolecular Sciences Department, UMIST, P.O. Box 88, Manchester M60 1QD, UK b
c
Received 7 October 2003
Abstract Based on the recent development in the gene ontology and functional domain databases, a new hybridization approach is developed for predicting protein subcellular location by combining the gene product, functional domain, and quasi-sequence-order effects. As a showcase, the same prokaryotic and eukaryotic datasets, which were studied by many previous investigators, are used for demonstration. The overall success rate by the jackknife test for the prokaryotic set is 94.7% and that for the eukaryotic set 92.9%. These are so far the highest success rates achieved for the two datasets by following a rigorous cross-validation test procedure, suggesting that such a hybrid approach may become a very useful high-throughput tool in the area of bioinformatics, proteomics, as well as molecular cell biology. The very high success rates also reflect the fact that the subcellular localization of a protein is closely correlated with: (1) the biological objective to which the gene or gene product contributes, (2) the biochemical activity of a gene product, and (3) the place in the cell where a gene product is active. Ó 2003 Elsevier Inc. All rights reserved. Keywords: Gene ontology; Functional domain composition; Pseudo-amino acid composition; InterPro database; Hybrid space; Intimate sorting algorithm; ISort predictor; Protein subcellular location
Prediction of protein subcellular location is currently a very hot topic in molecular biology because it has involved the three essential features of a protein: its biological objective, its biochemical activity, as well as its place in the cell where a gene product is active. Many efforts have been made trying to develop computational methods from different approaches for quickly predicting the localization of proteins in a cell [1–13]. Each of these approaches has its own special feature and merit but limit as well. Recently, a hybrid approach [14] was developed to predict protein subcellular location by combining functional domain composition [15] and pseudo-amino acid composition [16]. Compared with the reports by many previous investigators, the success prediction rates thus obtained were remarkably improved, particularly for the eukaryotic proteins (Table 1 *
Corresponding author. Fax: 1-858-455-0954. E-mail addresses:
[email protected],
[email protected] (K.C. Chou). 0006-291X/$ - see front matter Ó 2003 Elsevier Inc. All rights reserved. doi:10.1016/j.bbrc.2003.10.062
of [14]). However, as shown in the same table, the overall success rate for the prokaryotic proteins by the hybrid approach is still not the highest. To deal with such a situation, a new hybrid approach is proposed to incorporate the knowledge from the recent development in ontology. With such an approach, the Intimate Sorting Algorithm is developed to predict the subcellular localization of proteins and the highest success rates are obtained for both the prokaryotic and eukaryotic proteins, respectively.
Hybridization of gene ontology, functional domain, and pseudo-amino acid compositions To improve the quality in predicting protein subcellular location, a logic step is to catch the core features of a protein that are intimately related to its localization in a cell. According to the Gene Ontology (GO) Consortium [17], the GO database was established based on the
744
K.-C. Chou, Y.-D. Cai / Biochemical and Biophysical Research Communications 311 (2003) 743–747
Table 1 Breakdown of the proteins defined in the hybridization space of gene ontology, functional domain, and pseudo-amino acid composition Dataset
1930D GO_ compress space
7785D InerPro space
57D pseudo amino acid composition space
Total
Prokaryotic Eukaryotic
870 1852
65 364
62 211
997 2427
following criteria: (a) biological process referring to a biological objective to which the gene or gene product contributes; (b) molecular function defined as the biochemical activity of a gene product; and (c) cellular component referring to the place in the cell where a gene product is active. Since the above three criteria are not only the attributes of genes, gene products or geneproduct groups, but also the core features reflecting the subcellular localization [17–19], it is anticipated that the prediction quality will be enhanced if using the GO database to define proteins according to the following procedures. By mapping of InterPro [20] entries to GO, one can get a list of data called “InterProt2GO” (ftp://ftp.ebi.ac.uk/ pub/databases/interpro/interpro2go/), where each InterPro entrance corresponds to a GO number. The relationships between InterPro and GO may be one-to-many, “reflecting the biological reality that a particular protein may function in several processes, contain domains that carry out diverse molecular functions, and participate in multiple alternative interactions with other proteins, organelles or locations in the cell” [17]. For example, “IPR000003” corresponds to “GO:0003677,” “GO:0004879,” “GO:0005496,” “GO:0006355,” and “GO:0005634.” Also, since the current GO database is far from complete yet, some InterPro entrances (such as IPR000001, IPR000002, and IPR000004) do not have the corresponding GO numbers in the InterProt2GO list. Moreover, the GO numbers in InerProt2GO are not increasing successively and orderly, and hence some reorganization and compression procedure was taken to renumber them. For example, after such a procedure, the original GO numbers GO:0000012, GO:0000015, GO:0000030, . . ., GO:0046413 would become GO_compress: 0000001, GO_compress: 0000002, GO_compress: 0000003, . . ., GO_compress: 0001930, respectively. The GO database thus obtained is called GO_compress database, whose dimensions were reduced to 1930 from 46,413 in the original GO database. Each of the 1930 entities in the GO_compress database will serve as a base to define a protein. However, the current GO numbers do not give a complete coverage in the sense that some proteins might not belong to any of the GO numbers. Although the problem will eventually no longer exist as GO increases in size, to cope with such a situation right now, a hybrid approach was introduced
by combining GO with the functional domain composition [15] and pseudo-amino acid composition [16], as described below. (1) Use the program IPRSCAN [20] to search InterPro functional domain (release 6.1) database [20] for a given protein. If there is a hit corresponding to the ith number of the GO_compress database, then the ith component of the protein in the 1930D GO_compress space is assigned 1; otherwise, 0. Thus, the protein can be formulated as 3 2 a1 6 a2 7 6 . 7 6 . 7 6 . 7 P¼6 ð1Þ 7; 6 ai 7 6 . 7 4 . 5 . a1930 where 1; ai ¼ 0;
hit found in GO compress; otherwise:
ð2Þ
(2) If no hit (i.e., no corresponding GO number) is found at all in the entire 1930D GO_compress space, the protein should be defined in the 7785D InterPro FunD (functional domain composition) space [20], as given below 3 2 b1 6 b2 7 7 6 6 .. 7 6 . 7 7 ð3Þ P¼6 6 bj 7 ; 7 6 6 .. 7 4 . 5 b7785 where 1; bj ¼ 0;
hit found in InterPro FunD database; otherwise:
ð4Þ
(3) If no hit is found at all even in the entire 7785D FunD space, the protein should be defined in the (20 þ k)D PseAA (pseudo-amino acid composition) space, as given below 3 2 c1 6 c2 7 7 6 6 .. 7 6 . 7 7 6 7 ð5Þ P¼6 6 c20 7; 6 c20þ1 7 7 6 6 .. 7 4 . 5 c20þk where c1 ; c2 ; ; c20 represent the 20 components of the classical amino acid composition, while c20þ1 is the firsttier sequence order correlation factor, c20þ2 the secondtier sequence order correlation factor, and so forth (cf. Fig. 1 of [16]). Generally speaking, the larger the num-
K.-C. Chou, Y.-D. Cai / Biochemical and Biophysical Research Communications 311 (2003) 743–747
ber of these correlation factors, the more the sequenceorder effects incorporated. However, the number k cannot exceed the length of a protein (i.e., the number of its total residues). Also, if the number of k is too large, the overall success rate by jackknife tests might be reduced [16]. Therefore, for different training datasets, k may have different optimal values. For the current study, the optimal value for k is 37; i.e., the dimension of the pseudo-amino acid composition considered is 20 þ 37 ¼ 57. Given a protein, the 57 pseudo-amino acid components in Eq. (5) can be easily derived by following the procedures as described in the original paper [16] that has introduced the concept of pseudoamino acid composition.
745
components for the query protein in the 7785D functional domain space were also zero and its definition must be made by shifting to the (20 þ k)D pseudoamino acid composition space (see Eq. (5)), then the prediction should be carried out according to the principle that all the proteins in the training set be defined in the same pseudo-amino acid composition space as well. Accordingly, the current ISort predictor actually consists of three sub predictors: (a) the ISort-1930D GO predictor that operates in the 1930D GO_compress space, (b) the ISort-7785D FunD predictor that operates in the 7785D functional domain composition space, and (c) the ISort-57D PseAA predictor that operates in the 57D pseudo-amino acid composition space with k ¼ 37. The entire process is called GO-FunD–PseAA hybridization approach.
The intimate sorting algorithm Suppose there are N proteins (P1 ; P2 ; . . . ; PN ) which have been classified into categories 1; 2; . . . ; l. Now, for a query protein P, how can we predict which category it belongs to? To deal with this problem, below we shall introduce a new algorithm, the so-called Intimate Sorting Algorithm, or abbreviated as ISort predictor. First, let us define the similarity between P and Pi (i ¼ 1; 2; . . . ; N ) given by UðP; Pi Þ ¼
P Pi kPkkPi k
ði ¼ 1; 2; . . . ; N Þ;
ð6Þ
where P Pi is the dot product of vectors P and Pi and kPk and kPi k their modulus, respectively. Obviously, when P Pi , we have UðP; Pi Þ ¼ 1; meaning they have perfect or 100% similarity. Generally speaking, the similarity is within the range of 0 and 1; i.e., 0 6 UðP; Pi Þ 6 1. Accordingly, the Intimate Sorting Algorithm can be formulated as follows. If the similarity between P and Pk ðk ¼ 1; 2; . . ., or N ) is the highest; i.e., UðP; Pk Þ ¼ MaxfUðP; P1 Þ; UðP; P2 Þ; . . . ; UðP; PN Þg; ð7Þ then the query protein P is predicted belonging to the same category as of Pk . If there is a tie, the query protein is not uniquely determined, but cases like that rarely occur. The following self-consistence principle should be followed in practically using the hybridization approach. If a query protein was defined in the 1930D GO_compress space (see Eq. (1)), then the prediction should be carried out based on those proteins in the training set that could be defined in the same 1930D space. If all of the components for the query protein in the 1930D GO_compress space were zero and hence it was defined by shifting to the 7785D functional domain space (see Eq. (3)), then the prediction should be conducted on the basis that all the rule parameters were derived from the same 7785D space. Finally, if all the
Results and discussion To benchmark the prediction quality of the current method against others and make the comparison more objective, the datasets constructed by Reinhardt and Hubbard [5] are used for demonstration. The reason we use the datasets constructed by other investigators rather than our own is to make the showcase more objective and compelling. This will also facilitate the comparison because these data have been used by a number of previous investigators [8–11,21,22] to test their prediction methods with explicit reported results in literature. The datasets consist of two parts: the prokaryotic set and the eukaryotic set. The former contains 997 prokaryotic protein sequences, of which 688 are cytoplasmic, 107 extracellular, 202 periplasmic; the latter contains 2427 eukaryotic protein sequences, of which 684 are cytoplasmic, 325 extracellular, 321 mitochondrial, and 1097 nuclear. The computation was carried out in a Silicon Graphics IRIS Indigo workstation (Elan 4000). According to the search procedures as described in the last section, we obtained the following results. (a) For the 997 protein sequences in the prokaryotic set, 870 got hits in the GO database and hence were defined in the 1930D GO_compress space, 65 of the remainder got hits in the InterPro FunD database (release 6.1) and hence were defined in the 7785D FunD space, and finally the left 62 proteins were defined in the 57D PseAA space. (b) For the 2427 protein sequences in the eukaryotic set, 1852 were defined in the 1930D GO_compress space, 364 in the 7785D InterPro FunD space, and 211 in the 57D PseAA space. This means that, if only the GO database were used, 997 870 ¼ 127 proteins in the prokaryotic set and 2427 1852 ¼ 575 in the eukaryotic set would have no definition, leading to a failure of identifying their localization. By incorporating the InterPro FunD database, we still have 62 proteins in the prokaryotic set and 211
746
K.-C. Chou, Y.-D. Cai / Biochemical and Biophysical Research Communications 311 (2003) 743–747
Table 2 Overall success rates reported by investigators using different prediction methods for the prokaryotic and eukaryotic datasetsa , respectively Investigators
Prokaryotic setb Re-substitution
Chou and Elrod [21] Yuan [8] Cai and Chou [22] Feng [9] Feng and Zhang [10] Hua and Sun [11] Cai and Chou [14] Authors of this paper
90.4% N/A 96.1% 93.5% 97.7% N/A 100% 100%
Eukaryotic setc Jackknife 86.5% 89.1% 84.4% 89.2% 90.4% 91.4% 89.3% 94.7%
Re-substitution
Jackknife
N/A N/A 95.6% N/A N/A N/A 100% 100%
N/A 73.0% 70.6% N/A N/A 79.4% 90.4% 92.9%
a
The datasets used here were constructed by Reinhardt and Hubbard [5]. Contains 997 protein sequences, of which 688 are cytoplasmic, 107 extracellular, and 202 periplasmic. c Contains 2427 protein sequences, of which 684 are cytoplasmic, 325 extracellular, 321 mitochondrial, and 1097 nuclear. b
proteins in the eukaryotic set without definition (Table 1). That is why it is so important to hybridize with the pseudo-amino acid composition (PseAA), by which not only a protein can always be defined but also its sequenceorder effects may considerably be reflected [16]. Thus, the hybrid algorithm was operated according to the flowchart: if a query protein was defined in the GO_compress database, then the ISort-1930D GO predictor was used to predict its subcellular location; if the query protein could not be defined in the GO_compress database but defined in the InterPro FunD database, then the ISort-7785D FunD predictor was used to predict its subcellular location; if the query protein could neither be defined in the GO_compress database nor in the InterPro FunD database, then the ISort-57D PseAA predictor was used to predict its subcellular location. The demonstration is performed by the re-substitution test and jackknife test. The results thus obtained are summarized in Table 2. Meanwhile, for facilitating comparison, the corresponding rates reported by the previous investigators are listed in the same table as well. It is shown from the table that the overall success rates by the re-substitution test for the 799 proteins in the prokaryotic set and the 2427 proteins in the eukaryotic set are both 100%, same as the results obtained by the previous hybrid approach [14]. However, the re-substitution test is generally used to check the self-consistency of a prediction method [23,24]. To examine the power of a predictor, a cross-validation procedure is needed [24]. As is well known, the single independent dataset test, sub-sampling test, and jackknife test are the three procedures often used for cross-validation in the literature [23]. Of these three, the jackknife test is regarded as the most objective and effective one [25]. The mathematical principle and a comprehensive discussion about this can be found in a monograph [26] and a review paper [23], respectively. Also, to help readers understand this, an illustration is given in the Appendix. Accordingly, the real power of a predictor should be reflected by the success rate of jackknife test. As shown in Table 2, the overall jackknife success rates obtained by the current
methods are 94.7% for the prokaryotic set and 92.9% for the eukaryotic set, both being the highest compared with the rates reported previously. The reason the current approach can yield the highest success rates is because it has combined the gene product, functional domain, and sequence-order effects, fully consistent with the fact that the subcellular localization of a protein is closely related to its gene product and function in both the prokaryotic and eukaryotic cases.
Conclusion Introduction of the gene ontology approach (GO), hybridized with the functional domain composition (FunD) approach [15] and the pseudo-amino acid composition (PseAA) approach [16], can make them complement one another in catching the core features of a protein that are intimately related to its localization in a cell. Meanwhile, the introduction of the intimate sorting (ISort) algorithm can make allowance for bringing out the best in one another and making each shine more brilliantly in the other’s company. This is the essence why the success rates predicted by the current method for both the prokaryotic and eukaryotic proteins are superior to all those reported previously. Such an outcome is fully consistent with the scientific logic because the current hybrid approach has combined the gene product, functional domain, and quasi-sequence-order effects. The gene product is closely correlated with the biological process, molecular function, and cellular component; while the functional domain [15] and quasi-sequence-order [16] have each proved to play an important role in determining the subcellular localization of a protein.
Appendix It is instructive to give a discussion through the following illustration. If the cross-validation was performed
K.-C. Chou, Y.-D. Cai / Biochemical and Biophysical Research Communications 311 (2003) 743–747
by the sub-sampling test as treated by Reinhardt and Hubbard [5], “all data sets were split into three equally sized subsets.” Thus, the number of possible splits for the prokaryotic set would be P ¼ P1 P2 P3 , where 687! 105! 201! P1 ¼ 229!229!229!3! , P2 ¼ 35!35!35!3! , and P3 ¼ 67!67!67!3! . Of P1 ,
[8] [9] [10] [11] [12]
P2 , and P3 , the smallest is P2 1:6 1047 , indicating that the number of possible splits would be P 10141 ! This is an astronomical figure, which is too huge to be handled by any existing computers. For the eukaryotic set, the number of possible splits would be even greater. Therefore, in any practical sub-sampling tests, only a very tiny fraction of the possible splits were investigated; i.e., the tests were far from complete. The success rates thus obtained might bear a considerable arbitrariness and likely be overestimated. For example, by using the current prediction method, some sub-sampling tests led to the overall success rates of 99.2% and 98.9% for the prokaryotic and eukaryotic sets, respectively. These rates were obviously overestimated due to the incompleteness of sub-samplings investigated. Accordingly, the success rates derived from jackknife tests as reported in Table 2 are much more objective and rigorous.
[13] [14] [15] [16]
[17]
[18] [19] [20]
References [21] [1] K. Nakai, M. Kanehisa, Genomics 14 (1992) 897–911. [2] M.G. Claros, S. Brunak, G. von Heijne, Curr. Opin. Struct. Biol. 7 (1997) 394–398. [3] H. Nakashima, K. Nishikawa, J. Mol. Biol. 238 (1994) 54–61. [4] J. Cedano, P. Aloy, J.A. P’erez-Pons, E. Querol, J. Mol. Biol. 266 (1997) 594–600. [5] A. Reinhardt, T. Hubbard, Nucleic Acids Res. 26 (1998) 2230–2236. [6] K.C. Chou, D.W. Elrod, Protein Eng. 12 (1999) 107–118. [7] O. Emanuellson, H. Nielsen, S. Brunak, G. von Heijne, J. Mol. Biol. 300 (2000) 1005–1016.
[22] [23] [24] [25] [26]
747
Z. Yuan, FEBS Lett. 451 (1999) 23–26. Z.P. Feng, Biopolymers 58 (2001) 491–499. Z.P. Feng, C.T. Zhang, Int. J. Biol. Macromol. 28 (2001) 255–261. S. Hua, Z. Sun, Bioinformatics 17 (2001) 721–728. G.P. Zhou, K. Doctor, PROTEINS: Structure, Function, and Genetics 50 (2003) 44–48. Y.X. Pan, Z.Z. Zhang, Z.M. Guo, G.Y. Feng, Z.D. Huang, L. He, J. Protein Chem. 22 (2003) 395–402. Y.D. Cai, K.C. Chou, Biochem. Biophys. Res. Commun. 305 (2003) 407–411. K.C. Chou, Y.D. Cai, J. Biol. Chem. 277 (2002) 45765–45769. K.C. Chou, PROTEINS: Structure, Function, and Genetics 43 (2001) 246–255 (Erratum: Proteins: Struct. Funct. Genet. 2001, vol. 44, 60). M. Ashburner, C.A. Ball, J.A. Blake, D. Botstein, H. Butler, J.M. Cherry, A.P. Davis, K. Dolinski, S.S. Dwight, J.T. Eppig, M.A. Harris, D.P. Hill, L. Issel-Tarver, A. Kasarskis, S. Lewis, J.C. Matese, J.E. Richardson, M. Ringwald, G.M. Rubin, G. Sherlock, Nat. Genet. 25 (2000) 25–29. J.J. Chou, H. Matsuo, H. Duan, G. Wagner, Cell 94 (1998) 171– 180. J.J. Chou, H. Li, G.S. Salvessen, J. Yuan, G. Wagner, Cell 96 (1999) 615–624. R. Apweiler, T.K. Attwood, A. Bairoch, A. Bateman, E. Birney, M. Biswas, P. Bucher, L. Cerutti, F. Corpet, M.D.R. Croning, R. Durbin, L. Falquet, W. Fleischmann, L. Gouzy, H. Hermjakob, N. Hulo, I. Jonassen, D. Kahn, A. Kanapin, Y. Karavidopoulou, R. Lopez, B. Marx, N.J. Mulder, T.M. Oinn, M. Pagni, F. Servant, C.J.A. Sigrist, E.M. Zdobnov, Nucleic Acids Res. 29 (2001) 37–40. K.C. Chou, D.W. Elrod, Biochem. Biophys. Res. Commun. 252 (1998) 63–68. Y.D. Cai, K.C. Chou, Mol. Cell Biol. Res. Commun. 4 (2000) 172–173. K.C. Chou, C.T. Zhang, Crit. Rev. Biochem. Mol. Biol. 30 (1995) 275–349. G.P. Zhou, J. Protein Chem. 17 (1998) 729–738. G.P. Zhou, N. Assa-Munt, PROTEINS: Structure, Function, and Genetics 44 (2001) 57–59. K.V. Mardia, J.T. Kent, J.M. Bibby, in: Multivariate Analysis, Academic Press, London, pp. 322 and 381.