Gene 461 (2010) 1–4
Contents lists available at ScienceDirect
Gene j o u r n a l h o m e p a g e : w w w. e l s e v i e r. c o m / l o c a t e / g e n e
Review
An overview of the current status of eukaryote gene prediction strategies Roy D. Sleator ⁎ Department of Biological Sciences, Cork Institute of Technology, Rossa Avenue, Bishopstown, Cork, Ireland
a r t i c l e
i n f o
Article history: Received 22 March 2010 Received in revised form 15 April 2010 Accepted 16 April 2010 Available online 27 April 2010
a b s t r a c t As sequence data continues to be generated at a logarithmic rate our dependence on effective in silico gene prediction methods is also increasing. Herein, I review the current state of eukaryote gene prediction methods; their strengths, weaknesses and future directions. © 2010 Elsevier B.V. All rights reserved.
Received by A.J. van Wijnen Keyword: Gene prediction
1. Introduction The publication of the draft human genome sequence in 2001 marked a watershed in the genomics era (Lander et al., 2001; Venter et al., 2001). However, rather than heralding the end of large scale sequencing projects; the completion of the human genome project enabled sequencing facilities to turn their considerable resources to even more ambitious projects (Sleator et al., 2008). One exciting example is the human microbiome initiative which aims to sequence the totality of all microbes in, or on, the human body, thus providing us with an extended view of ourselves as super-organisms (Turnbaugh et al., 2007; Sleator, 2010). However, as the sequence data continues to increase logarithmically our ability to annotate the information and to accurately pinpoint coding regions has lagged considerably behind. While prokaryote gene annotation can be complicated by overlapping regions which makes identification of translation start sites difficult (Palleja et al., 2008), eukaryote gene structure prediction is even more complex (Lewis et al., 2000). In addition to low-density coding sequence (∼3% in human DNA); eukaryote coding regions (exons) are often widely interspersed with non-coding intervening sequences (introns) (Lander et al., 2001; Venter et al., 2001). Furthermore, eukaryote coding sequences are subject to alternative-splicing; a process of shuffling genetic information which facilitates the synthesis of more than one protein from a single gene sequence (Schellenberg et al., 2008). Indeed, it is estimated that
Abbreviations: MM, Markov model; PWM, Positional weight matrices; WAM, Weight array model; IMM, Interpolated Markov models; HMM, Hidden Markov models; GHMM, Generalized Hidden Markov models; SMCRF, Semi-Markov conditional random field; EHMM, Evolutionary Hidden Markov Models; UTR's, un-translated regions; FASTA, Fast-all; BLAST, basic local alignment search tool; TSS, Transcriptional signal sensors; GFF, general feature format; ncRNA, non-coding RNAs; miRNAs, MicroRNAs. ⁎ Fax: +353 21 432 6851. E-mail address:
[email protected]. 0378-1119/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.gene.2010.04.008
more than 95% of human genes show evidence of at least one alternative splice site. Current in silico gene prediction methods involve two distinct aspects: the first centres on the type of information utilized by gene prediction programs (i.e. the evidence for the existence of a gene based on functional signal recognition), while the second involves the algorithms employed by these programs to accurately predict gene structure and organisation (Fig. 1). Although gene prediction has been the subject of several excellent review papers (Do and Choi, 2006; Brent, 2007; Flicek, 2007); the current study is designed to appeal to both the expert and non-expert reader alike; providing a concise overview of the current status of eukaryote gene prediction models, beginning with a brief overview of sensor recognition and gene-finders, their strengths and weaknesses, before concluding with an outline of some outstanding issues which still need to be addressed. 2. Functional sensor recognition Just as large-scale genome sequencing projects relied on the existence of molecular markers to facilitate genome assembly, gene annotation strategies rely on sensors within the DNA sequence to allow accurate delineation of gene structure and organisation (Mathe et al., 2002). Two types of sensor (content and signal) are routinely used to locate genes in the genomic sequence: (i) Content sensors classify DNA into coding regions and non-coding regions (introns, intergenic regions and un-translated regions (UTR's)). Content sensors can be further divided into extrinsic and intrinsic sensors. Based on the assumption that coding sequences are more conserved than non-coding ones (Mathe et al., 2002); extrinsic content sensors exploit homology searches to identify highly conserved exons. Using local alignment methods (ranging from the optimal Smith–Waterman algorithm
2
R.D. Sleator / Gene 461 (2010) 1–4
Fig. 1. Schematic overview of eukaryote gene prediction methods and the underling sensors routinely used to locate genes in genomic sequences.
to fast heuristic approaches such as FASTA and BLAST), two approaches can be employed: inter-genomic or cross species comparisons. While these strategies may be effective, they are limited by the constraints of phylogenetic distance. The second approach overcomes this limitation by employing intra-genomic comparisons which (providing data for multigenic families) represents a large percentage of existing genes (e.g. up to 80% for Arabidopisis). A significant failing of extrinsic approaches is that they are limited to homologies within the database; if no homologs exist no data can be extracted. Intrinsic content sensors, on the other hand, focus on specific innate characteristics of the DNA sequence itself, which help to predict the likelihood of whether the sequence in question “codes” for a protein or not. The most obvious indicator of coding versus non-coding sequence identified to date is hexamer frequency (i.e. 6 nucleotide long words) (Mathe et al., 2002). Other useful intrinsic content sensors include nucleotide composition, codon usage and base occurrence periodicity. Coding regions are defined by three Markov models (MMs; see Box 1), one for each position inside a codon. These three-periodic MMs are based on the kmer (especially hexamer) composition of coding sequence and are trained on a set of known sequences before being used to detect a particular content. Region-specific content sensors for coding and non-coding regions or even for different subtypes of non-coding regions have been developed (Mathe et al., 2002). (ii) Signal sensors detect the presence of functional sites specific to a gene. To date signals relating to transcription, translation and splicing have all been employed to facilitate gene identification and structure prediction. Transcriptional signal sensors (TSS) include the initiator or cap signal located at the transcriptional start site and the upstream TATA box promoter signal, as well as the polyadenylation signal (a consensus AATAAA hexamer) located 20 to 30 bp downstream of the coding region. Translational signals include the “Kozak signal” located immediately upstream of the start codon (Kozak, 1996). However, given that higher eukaryote genes in particular harbour multiple exons, accurate gene structure prediction in these organisms relies heavily on the identification of splice site signals (Stamm, 2008), specifically donor and acceptor sites (GT-AG on the introns sequence) and branch points (CU[A/G]A[C/U] located 20–50 bp upstream of the AG acceptor).
3. Gene predictor programs—strengths and weaknesses Gene prediction programs can be divided into two classes: empirical and Ab initio gene-finders. (i) Empirical gene predictors, also referred to as sequence similarity based gene-finders, identify genes based on homology searches of known databases (genomic DNA, cDNA, dbEST, or protein). The comparison of two (or more) homologous genomic sequences (either inter- or intraspecies) facilitates the identification of conserved exons. When combined with signal sensors this information allows us to refine region boundaries and more accurately model gene structure and organisation. (ii) Ab initio (or de novo) gene-finders rely on sequence information afforded by both signal and content sensors (Do and Choi, 2006). The algorithms employed by these programs to model gene structure include neural networks, Fourier transforms and most commonly Markov models (Outlined in Box 1). Ab initio genefinders can be categorized based on the number of genome sequences employed for gene analysis and include single, dual and multiple-genome predictors (Brent and Guigo, 2004). Single-genome predictors, such as GENSCAN (Burge and Karlin, 1997), which focus exclusively on one genome, are comparatively faster and easier to run than the equivalent dual or multigenome predictors. Furthermore, given that only one genome is considered, single-genome predictors are not restricted by phylogenetic distance; i.e. the availability of closely related genomic sequences. However, dependence on a single genome can be restrictive, particularly given that newly sequenced genomes may contain as few as 50% known genes from which to estimate model parameters (Guigo et al., 2000). To help overcome this limitation, dual-genome predictors, such as TWINSCAN (Flicek et al., 2003), have been developed to exploit sequence conservation between two related genomes (e.g. mouse and man). Alignments are performed first and the resulting data is used to inform prediction algorithms such as Hidden Markov Models (outlined in Box 1). However, there remain inherent uncertainties in reconstructing the lineages of genomic regions for two such distantly related organisms, as human and mouse, owing to the extent of genomic restructuring which has occurred since their last common ancestor. Given that the genomes of
R.D. Sleator / Gene 461 (2010) 1–4
3
Box 1 An overview of Markov Models in sequence analysis and gene prediction. A Markov model (MM) is a stochastic model which assumes that the probability of a particular nucleotide occurring at a given position depends only on the k previous nucleotides. In this case k is the order of the MM, the larger k the finer the MM can characterize dependencies between adjacent nucleotides. Such a model is defined by the conditional probabilities P(X|k previous nucleotides), where X = A, T, G or C. In order to build a Markov model, a learning set of sequences, on which these probabilities will be estimated, is required. The most frequently used categories of MMs in eukaryote gene prediction methods are outlined below: Positional weight matrices (PWM)
Weight array model (WAM) Three-periodic Markov model Interpolated Markov models (IMM) Hidden Markov models (HMM) Generalized Hidden Markov models (GHMM) Semi-Markov conditional random field (SMCRF) Evolutionary Hidden Markov Models (EHMM)
The simplest MMs are homogeneous zero order MMs which assume that each base occurs independently with a given frequency. Such simple models are often used for non-coding regions. An inhomogeneous higher order MM capable of capturing potential dependencies between adjacent positions of a signal. Characterize coding sequences. Coding regions are defined by three MMs, one for each position inside a codon. IMMs combine statistics from several MMs, from order zero to a given order k (typically k = 8), according to the information available. HMMs allow for insertions and deletions and so variation in signal length. GHMMs allow a string, rather than a single symbol, as the output of a state. A more flexible variation of GHMM which allows a wider range of biological features to be incorporated with fewer technical concerns (Bernal et al., 2007) EHMMs model molecular evolution as a Markov process in two dimensions: a substitution process over time at each site in the aligned genomes, which is guided by a phylogenetic tree; and a process by which the rate of evolution changes from one site to the next (Brent and Guigo, 2004)
the higher primates can be aligned much more accurately than those of human and mouse, it would appear that multi-genome predictors involving several closely related species (using EHMMs) are significantly more attractive than dual-genome predictors focusing on two distantly related genomes (Boffelli et al., 2003). (iii) Combining gene-predictor outputs. Coupling the extrinsic approach of empirical gene-finders with intrinsic ab initio prediction programs significantly improves gene prediction protocols (Allen et al., 2004). GenomeScan, developed by Yeh and others (2001), is an extension of GENSCAN which incorporates similarity with a protein retrieved by BLASTX, thus combining extrinsic and intrinsic approaches to gene identification. Using GenomeScan regions of higher similarity (on the basis of BLASTX E-values) are accorded more confidence than comparable regions of lower similarity. Thus, GenomeScan predictions may sometimes ignore a region that has either weak intrinsic properties (e.g. poor splice signals) or is inconsistent with other extrinsic information. As a result GenomeScan accuracy is significantly higher than GENSCAN when related sequences are available. 4. Conclusions and future prospects Although significant advances continue to be made in the gene prediction arena, several issues still need to be addressed (Do and Choi, 2006). As outlined previously by Claverie (1997), existing sensors relying on known sequences, either in the form of training sets or databases, are highly conservative and as such relatively inflexible. Furthermore, accuracy of gene prediction is highly dependent on database quality; while in extrinsic gene prediction erroneous data only affects the analysed data itself, in intrinsic prediction it can lead to corrupted training sets which dramatically affects overall program performance. In addition to problems with the
data, the design of the programs themselves can be problematic. Until recently little commonality existed between newly developed gene prediction programs (Mathe et al., 2002). Little or no equivalence in outputs or vocabulary made cooperative data analysis by more than one program difficult if not impossible. By designing a general feature format (GFF) to standardise all gene predictor outputs it will be possible to develop common tools for down-stream analysis: evaluation, graphical representation and ultimately the development of combination predictors. Other factors complicating eukaryote genome prediction include the presence of extended introns (e.g. the human dystrophin gene consists of N99% of introns, some of which are N100 kb). This is particularly problematic when bordering short exons, for example some Arabidopsis genes contain exons which are only 3 bp long making them extremely difficult to detect, especially given that missing such exons does not introduce a frame shift (Mathe et al., 2002). In addition, unusual examples of eukaryote gene structure and function continue to be identified; overlapping genes for example, although more characteristic of prokaryotes, have nonetheless been reported in the genomes of both plants and animals (Quesada et al., 1999). Furthermore, though originally believed to occur exclusively in prokaryotes, polycistronic genes have also been identified in the eukaryote Caenorhabditis elegans (Blumenthal, 1998). As non-canonical cases continue to be uncovered; ever increasing levels of sophistication will be required from newly designed gene prediction methods. Additionally, to further complicate the issue, as well as protein coding genes, a large proportion of the human genome is composed of RNA sequences that do not encode proteins (Taft et al., 2010). Also known as non-coding RNAs (ncRNA) these genes are predicted to play an important role in the regulation of eukaryote gene expression (Forrest et al., 2009; Oulas et al., 2009). Indeed, MicroRNAs (miRNAs)—a subgroup of ncRNAs are predicted to control the activity of approximately 30% of all protein-coding genes in mammals (Li et al., 2009). Given that a significant fraction of ncRNAs are short and/or poorly
4
R.D. Sleator / Gene 461 (2010) 1–4
conserved in sequence, the conceptually simple approach of homologybased transfer becomes a complex and technically demanding task; one which is further complicated by a paucity of information on RNA families. Although several recent efforts to customize sequence-based search tools for ncRNA applications have shown some success; such as the use of semi-global alignments and the development of methods for fragmented pattern search (Mosig et al., 2009), much still needs to be achieved in this area. Finally, irrespective of the level of sophistication achieved, or the reliability of the data obtained, gene prediction methods remain just that—predictions. In silico analysis must always be confirmed by in vitro and/or in vivo “wet lab” experimentation to confirm the existence of a putative gene and the functionally of its predicted protein product. Acknowledgments The author wishes to acknowledge the continued financial assistance of the Health Research Board (HRB), the Food Institutional Research Measure (FIRM) through the Department of Agriculture and the Alimentary Pharmabiotic Centre (APC) through Science Foundation Ireland (SFI). References Allen, J.E., Pertea, M., Salzberg, S.L., 2004. Computational gene prediction using multiple sources of evidence. Genome Res. 14, 142–148. Bernal, A., Crammer, K., Hatzigeorgiou, A., Pereira, F., 2007. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3, e54. Blumenthal, T., 1998. Gene clusters and polycistronic transcription in eukaryotes. Bioessays 20, 480–487. Boffelli, D., et al., 2003. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299, 1391–1394. Brent, M.R., 2007. How does eukaryotic gene prediction work? Nat. Biotech. 25, 883–885. Brent, M.R., Guigo, R., 2004. Recent advances in gene structure prediction. Curr. Opin. Struct. Biol. 14, 264–272. Burge, C., Karlin, S., 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78–94.
Claverie, J.M., 1997. Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6, 1735–1744. Do, J.H., Choi, D.K., 2006. Computational approaches to gene prediction. J. Microbiol. 44, 137–144. Flicek, P., 2007. Gene prediction: compare and CONTRAST. Genome Biol. 8, 233. Flicek, P., Keibler, E., Hu, P., Korf, I., Brent, M.R., 2003. Leveraging the mouse genome for gene prediction in human: from whole-genome shotgun reads to a global synteny map. Genome Res. 13, 46–54. Forrest, A.R.R., Abdelhamid, R.F., Carninci, P., 2009. Annotating non-coding transcription using functional genomics strategies. Brief. Funct. Genomics Proteomics 8, 437–443. Guigo, R., Agarwal, P., Abril, J.F., Burset, M., Fickett, J.W., 2000. An assessment of gene prediction accuracy in large DNA sequences. Genome Res. 10, 1631–1642. Kozak, M., 1996. Interpreting cDNA sequences: some insights from studies on translation. Mamm. Genome 7, 563–574. Lander, E.S., et al., 2001. Initial sequencing and analysis of the human genome. Nature 409, 860–921. Lewis, S., Ashburner, M., Reese, M.G., 2000. Annotating eukaryote genomes. Curr. Opin. Struct. Biol. 10, 349–354. Li, M., Marin-Muller, C., Bharadwaj, U., Chow, K.H., Yao, Q., Chen, C., 2009. MicroRNAs: control and loss of control in human physiology and disease. World J. Surg. 33, 667–684. Mathe, C., Sagot, M.F., Schiex, T., Rouze, P., 2002. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 30, 4103–4117. Mosig, A., Zhu, L., Stadler, P.F., 2009. Customized strategies for discovering distant ncRNA homologs. Brief. Funct. Genomic Proteomic 8, 451–460. Oulas, A., Reczko, M., Poirazi, P., 2009. MicroRNAs and cancer—the search begins! IEEE Trans. Inf. Technol. Biomed. 13, 67–77. Palleja, A., Harrington, E.D., Bork, P., 2008. Large gene overlaps in prokaryotic genomes: result of functional constraints or mispredictions? BMC Genomics 9, 335. Quesada, V., Ponce, M.R., Micol, J.L., 1999. OTC and AUL1, two convergent and overlapping genes in the nuclear genome of Arabidopsis thaliana. FEBS Lett. 461, 101–106. Schellenberg, M.J., Ritchie, D.B., MacMillan, A.M., 2008. Pre-mRNA splicing: a complex picture in higher definition. Trends Biochem. Sci. 33, 243–246. Sleator, R.D., 2010. The human superorganism—of microbes and men. Med. Hypotheses 74, 214–215. Sleator, R.D., Shortall, C., Hill, C., 2008. Metagenomics. Lett. Appl. Microbiol. 47, 361–366. Stamm, S., 2008. Regulation of alternative splicing by reversible protein phosphorylation. J. Biol. Chem. 283, 1223–1227. Taft, R.J., Pang, K.C., Mercer, T.R., Dinger, M., Mattick, J.S., 2010. Non-coding RNAs: regulators of disease. J. Pathol. 220, 126–139. Turnbaugh, P.J., Ley, R.E., Hamady, M., Fraser-Liggett, C.M., Knight, R., Gordon, J.I., 2007. The human microbiome project. Nature 449, 804–810. Venter, J.C., et al., 2001. The sequence of the human genome. Science 291, 1304–1351. Yeh, R.F., Lim, L.P., Burge, C.B., 2001. Computational inference of homologous gene structures in the human genome. Genome Res. 11, 803–816.