Forum
732
TRENDS in Genetics Vol.17 No.12 December 2001
Web Watch
Harnessing the cellular immune system to the gene-prediction cart Yael Altuvia, Gila Lithwick and Hanah Margalit Prediction of genes and verification of their bona fide expression in the cell are major challenges of the post-genomic era. Here, we demonstrate how information from the apparently unrelated field of cellular immunology can be recruited for these challenging tasks. The cellular immune system presents short peptides that are the degradation products of both foreign and self-proteins expressed in the cell. We carried out a comprehensive search comparing these peptides to all accumulated human sequence data. Our findings illustrate how these ‘presented self-peptides’ are informative for the identification of new genes, for hypothetical gene verification, for verifying gene expression at the protein level and for supporting splice junctions.
Gene prediction in higher eukaryotes has never been considered an easy task. Yet it was only recently, with the draft of the human genome at hand, that the full
extent of this challenge could be appreciated. The sparseness of the genes, the low exon to intron ratio and the high number of genes capable of alternative splicing, make the prediction of human genes and human gene products very difficult1,2. Moreover, even when a predicted gene has experimental support in the form of mRNA, the challenge of proving that its protein product is actually expressed in human cells still remains. Here, we illustrate how information from the apparently unrelated field of cellular immunology can aid in gene detection and in the confirmation of otherwise hypothetical genes and proteins. The use of naturally processed peptides in gene annotation
One of the major roles of the cellular immune system is to identify and destroy cells expressing non-self proteins. To this
end, short peptides derived from endogenous proteins, termed naturally processed peptides (NPPs), are constantly presented by major histocompatibility complex (MHC) class I molecules on the cells’ surface. Cytotoxic T cells, trained to recognize non-self ligands, survey the cells regularly, ready to destroy cells presenting foreign peptides. In normal uninfected cells, however, all NPPs originate from self cellular products that were marked for degradation. These include native proteins, as well as ‘defective ribosomal products’ (DRiPs) consisting of various damaged proteins and protein fragments3. Thus, NPPs can be viewed as the remnants of translated gene products, providing leads to their source genes. Moreover, matching between an NPP sequence and a protein sequence translated from the human genome provides an indication of bona fide gene
Box 1. Search methods The NCBI (http://www.ncbi.nlm.nih.gov) and Ensembl (GenScan predictions from http://www.ensembl.org/) databases were searched with the BLAST program (version 2.0.13) (Refs a,b). Because the databases are very large and the query sequences are very short, we performed both gapped and ungapped searches with word size two, high E-value (ranging from 30 000 to 100 000, depending on the database size) and no repeat filtering. For each query only the top ten hits were kept. The hits were
further filtered where only hits with at most one mismatch were considered. Peptides that did not correspond to a human protein or translated mRNA sequence, or to the NCBI human genome sequence, were further analyzed using the UCSC human draft genome sequences and browser, and the following three programs: (1) The UCSC BLAT program that searches the UCSC human genome draft sequences (http://genome.ucsc.edu); (2) Our EXACT-F utility, which looks for
Table I. Databases used Database
Release
NCBI human protein NCBI human mRNA NCBI human genome NCBI est_human NCBI nr NCBI nt Ensembl predicted peptides Ensembl predicted cDNAs UCSC human genome draft
Apr 2001 Apr 2001 May 2001 July 2001 June 2001 June 2001 Apr 2001 Apr 2001 Apr 2001
http://tig.trends.com
Size (bp) 11 419 126 58 181 352 2 840 245 397 1 742 153 739 222 117 092 3 367 259 498 30 717 263 92 820 993 3 257 144 810
Search method BLASTP TBLASTN TBLASTN TBLASTN BLASTP TBLASTN BLASTP TBLASTN BLAT, EXACT-F SPLICE-F
exact matches of peptide queries on all reading frames of the chromosomal sequences; (3) Our SPLICE-F program, which looks for split occurrences of the peptides, indicating that the sequence encoding them might span a splice site. We used this program for peptides longer than eight amino acids. The splice junction consensus used was the canonical GT/AG junction, and we required a maximum intron length of 10 000 base pairs and a minimum exon length of ten amino acids. Splice sites were scored based on the human GT/AG base frequency matrices from the Intron Sequence and Information database (http://www.introns.com/). The EXACT-F and SPLICE-F programs are available on request. References a Altschul, S.F. et al. (1990) Basic local alignment search tool. J. Mol. Biol. 215, 403–410 b Altschul, S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402
0168-9525/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0168-9525(01)02496-9
Forum
expression, as it designates the presence of this protein in a human cell. Fortunately, a relatively large database (named SYFPEITHI) of hundreds of individually sequenced NPPs eluted from MHC molecules has been organized and is available publicly4 (http://www.unituebingen.de/uni/kxi). We used this database to demonstrate how sequence comparison between the NPPs and various human sequence databases can aid in the human genome annotation, and provide additional information regarding gene expression. We extracted the sequences of 514 MHC class I restricted NPPs of eight amino acids or more, and carried out extensive comparisons against all relevant databases (Box 1, Table I). First the sequences were compared with all accumulated human sequence data, in the form of proteins, mRNA, expressed sequence tags (ESTs), and human protein and mRNA predictions by GenScan5. In addition, the availability of the draft of the human genome (http://genome.ucsc.edu, ftp://ncbi.nlm.nih.gov/genomes/H_sapiens/) allowed us to search thoroughly for the source proteins of the NPPs in the translated human genome, matching each peptide to all its possible occurrences in all reading frames. Finally, the peptides were compared with the general nonredundant protein and translated nucleotide databases of NCBI. The results of these comparisons were compiled into a detailed web table available at http://bioinfo.md.huji.ac.il/marg/IMtoGENE, a sample of which is demonstrated in Fig. 1. Gene detection and the validation of gene expression
Many of the sequences in the database (~73%) perfectly matched a human protein and/or translated mRNA sequence (most of these matches were documented previously in the SYFPEITHI database). Such matches are especially informative for hypothetical proteins derived by conceptual translation, as they imply that the proteins are actually translated and expressed in the cell; for example, the peptide SLPDFGISYV in Fig. 1. The NPPs can even confirm expression of paralogs, as can be seen for the peptides DIREEKTSW and DIREEKASW (Fig. 1), which are nearly identical and were eluted from MHC molecules of the same cells. The former matches the ring-finger protein 21 (encoded by the RNF21 gene), http://tig.trends.com
TRENDS in Genetics Vol.17 No.12 December 2001
733
Fig. 1. The applicability of naturally processed peptides (NPPs) to gene prediction. Examples of the NPP sequence comparison to the public databases (Box 1), as they appear in our web table (http://bioinfo.md.huji.ac.il/ marg/IMtoGENE). The second column provides the definition of the protein as documented in the SYFPEITHI database of NPPs4. Each peptide is assigned a colored ‘hit string’ that describes its matches in the following databases (in this order): (1) ProtHum (white), the human protein database from NCBI; (2) mRNAHum (light grey), the human mRNA database from NCBI; (3) ChrHum (dark grey), the human genomic sequences from NCBI; (4) nr (pale yellow), the NCBI protein non-redundant database; (5) nt (yellow), the NCBI nucleotide non-redundant database; (6) GSProtHum (light pink), the Ensembl GenScan protein predictions; (7) GSmRNAHum (dark pink), the Ensembl GenScan mRNA predictions; (8) ESTHum (light brown), the NCBI human expressed sequence tag (EST) database. The database hits are marked as: + perfect hit; ~, hit with a single mismatch; –, no hit. Hits with more than one mismatch were not considered. The top three hits for each database can be viewed through the hyperlinks of the ‘hit string’. In addition, in the Top Hit Details column, representative hits (up to three per database) are shown according to the rules described at http://bioinfo.md.huji.ac.il/marg/IMtoGENE. For NPPs without hits (i.e. all the hits are –), this field is marked N/A (not applicable). MATCHES are highlighted in red if a perfect match was found. The position (POS) of the peptide sequence within a DNA hit sequence is accompanied by a + if it was found on the plus strand of a translated nucleotide sequence, and – otherwise. The Information field indicates how the specific NPP adds to, or confirms, our knowledge with regard to gene prediction and proof of expression.
and the latter matches its predicted paralog (encoded by the TRIM5 gene). Evidence of transcription exists for both genes. Thus, these two peptides provide evidence for the expression of both paralogous genes at the protein level. Forty-eight peptides that matched a human protein and/or translated mRNA sequence perfectly could not be associated with any region of the human genome sequence. A detailed analysis revealed that most of the sequences encoding these peptides fell into two
categories: in most cases (~63%), they spanned a splice site; in another ~14% of the cases, the peptide matched perfectly to a translated mRNA sequence and not a translated genomic sequence owing to a single nucleotide difference between the DNA and mRNA sequences. The peptide NYGGGNYGSGSY (Fig. 1), which was found with a mismatch in a human protein sequence and a translated human mRNA sequence, yet not at all in the translated genomic sequence, exemplifies both phenomena. This peptide matched
734
Forum
human ribonucleoprotein A2 with one substitution of Ser↔Asn at position 11. The codon at this position is AAT (Asn) in human and ACT (Thr) in the mouse homolog (SwissProt: O88569). Interestingly, one mutation AAT→AGT at the same alterable nucleotide will substitute Ser in place of Asn, suggesting a mutational hot spot or single nucleotide polymorphism (SNP) at this position. In addition, the peptide could be found in the translated human genome only by its two sub-sequences whose boundaries precisely fitted a splice site. Thus, NPPs can hint at possible SNPs and can support putative splice junctions. Approximately 27% of the NPPs did not match any human protein or translated mRNA sequence. A fraction of those (~4% of the total NPPs) were found in human pathogen sequences in accordance with the documentation in the SYFPEITHI database. These were probably eluted from infected cells and were not analyzed further. Another small fraction of these peptides (~2% of the total NPPs) matched a sequence in at least one of the other databases. These matches are especially attractive as they provide clues for new gene candidates (Fig. 1). Three peptides, THNPQAPVL, ELNPNAEVW and MTIEMRTTR, demonstrate potential new genes with varying degrees of supporting evidence. The peptide THNPQAPVL was found in a mammalian (Rattus norvegicus) protein, Chp. It was also found on human chromosome 15 in a region where the translation product is similar in sequence to the Chp protein. In addition, this region is predicted to be a gene by GenScan. Thus, the peptide THNPQAPVL pinpoints a potential new gene homologous to a known mammalian gene, and supports previous gene predictions for this region. The peptide ELNPNAEVW did not correspond to any known sequence in the databases searched. However, it matched a 117 amino acid open reading frame (ORF) encoded on chromosome 10, in a region that is predicted to be a gene by GenScan. The match between the peptide ELNPNAEVW and a product of a hypothetical gene lends support to this gene prediction and to the expression of that gene in the cell. The peptide MTIEMRTTR is an NPP that did not match any known or predicted gene product, yet it matched a 29 amino acid ORF on chromosome 10. In addition, http://tig.trends.com
TRENDS in Genetics Vol.17 No.12 December 2001
this ORF could be associated with nine ESTs from the ‘human dbEST’ database. Thus, it could be a product of a new gene with the 29 amino acid ORF being one of its exons. Conclusions
The various examples given here demonstrate the relevance of NPPs to human gene detection and to validation of hypothetical proteins. These peptides are reported continuously, and the SYFPEITHI database is being constantly updated. Thus, they can be used as an additional source of information for the human genome annotation. Furthermore, one can envision the usefulness of NPPs for indication of tissue specificity, both for newly detected proteins as well as for well-defined proteins. Interestingly, many NPPs were found in highly expressed proteins (e.g. histones, ribosomal proteins). This might suggest that many source proteins of other NPPs in the database could be highly expressed as well. Thus, the NPPs could provide additional information regarding the level of protein expression. It should be noted that because the peptides are short (usually 9±1 amino acids), they could often be matched to more than one protein. However, even in such cases the presence of the peptides increases the likelihood of expression of one of the putative proteins. Likewise, because alternative splicing is common in human genes1, an NPP might match several splice variants. Still, the NPPs provide a level of information beyond the expression of mRNA because they indicate the translation of the exons where they are found. Although we have focused on human NPPs, NPPs can be informative for other organisms as well. For example, a large
database of mouse NPPs eluted from mouse MHC molecules is also available4. A similar analysis of these NPPs can aid in the annotation of the mouse genome, the sequencing of which is currently in progress. Thus, elution of NPPs from different cells and tissues is useful not only for immunological purposes, but also for gene prediction and for inference on expression patterns at the protein level. They constitute an available, growing resource that can provide valuable information for the annotation of the human genome. Acknowledgements
This study was supported by the Israeli Cancer Research Foundation. Yael Altuvia and Gila Lithwick contributed equally to this work. Yael Altuvia Gila Lithwick Hanah Margalit* Dept of Molecular Genetics and Biotechnology, The Hebrew University – Hadassah Medical School, PO Box 12272, Jerusalem 91120, Israel. *e-mail:
[email protected] References 1 International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921 2 Koonin, E.V. (2001) Computational genomics. Curr. Biol. 11, R155–R158 3 Yewdell, J. et al. (2001) At the crossroads of cell biology and immunology: DRiPs and other sources of peptide ligands for MHC class I molecules. J. Cell Sci. 114, 845–851 4 Rammensee, H. et al. (1999) SYFPEITHI: database for MHC ligands and peptide motifs. Immunogenetics 50, 213–219 5 Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94
Erratum Flavell, A. [2001] Retrotransposons Rule in Carry-le-Rouet. Trends in Genetics 17, 489–490 This report on the meeting ‘Retrotransposons: Their Impact on Organisms, the Genome and Biodiversity’ should have acknowledged the support of the European Science Foundation, without whom the meeting could not have taken place. PII: S0168-9525(01)025044-6