90
Genome annotation:
which tools do we have for it?
Pierre Rouz6*, Nathalie
Pavyt and Stephane
Genome
to be converted
useful
data have to biologists.
Many
already
been
sequences,
developed and these
to help may
remains
Addresses *laboratoire
a major
computational annotation
be improved
identification of more gene standard computer-assisted genomes
into knowledge
valuable
regulatory annotation
of plant further,
elements. platform
Associc! de I’INRA
Opinion
have
genome
for example
by
The lack of a for eukaryotic
bottle-neck.
(France),
*a Department
Flanders Interuniversity Institute of Biotechnology, KL Ledeganckstraat 35, B-9000 Gent, Belgium *e-mail:
[email protected] +e-mail:
[email protected] b-mail:
[email protected]
Current
to be tools
in Plant Biology
1999,
University
of Genetics, of Ghent,
2:90-95
http://biomednet.com/elecref/1369526600200090 0 Elsevier
Science
Abbreviations AGI Arabidopsis EST expressed
Ltd
ISSN
1369-5266
genome sequence
initiative tag
Introduction High throughput genome sequencing projects are producing an enormous amount of raw sequence data. For plants, the complete sequence of the Arabidopsis dzaliana nuclear genome is expected for the year 2000; about one-third of the expected 120 Mb whole sequence is already available [1,2”] and sequencing the rice genome has begun (see S C;off this issue pp 86-89). [Jnannotated sequences are not useful as such, except to find a given DN,4 sequence, because rich database queries cannot be done and the relevance of homology searches cannot be evaluated. What biologists are looking for is information derived from the genome sequence: markers, genes, mRNA or deduced protein sequence, not the sequence itself. ‘Ib be meaningful to them, the genome sequences have to be converted into biologically significant knowledge; annotation is the first step toward knowledge acquisition. Contrary to a recent opinion [3], we believe that this step should better be done as soon as possible. Annotation is certainly a risky process and at present its result is far from optimal, even for well-documented bacterial genomes [4’]. Rut it is a trade-off, with the benefit for the plant research community as the ability to mine the sequence data more efficiently, and hence to plan classic and genome-wide experiments which would gain a wider biological knowledge more quickly, and would include a validation of the annotation itself. As pointed out by Meinke et al. [Z”], a long list of fundamental and technological research advances have already resulted from the hzbidopsis genome initiative (AGI); for a number of these advances the availability of annotated sequence played a important role,
RombautsJ
For a genomic DNA sequence, annotation is the process by which an anonymous sequence is documented, by positioning along the sequence the various sites and segments that are involved in the genome functionality. Many elements of possible biological relevance that are not, or not yet, linked to a particular gene may well be annotated, such as matrix attachment regions (MARS) [S] and sequence repeats. Nevertheless, the basic objects in genome annotation are obviously the genes and the related elements that are involved in their structure, their expression. and their function or rhe function of their encoded products. Here, we will particularly consider recent annotation tools that have been developed for plant nuclear genomes, and will also present tools and results obtained from other genomes, as much more information is available from bacterial, yeast and mammalian (especially human) genome projects. Resides, plant nuclear genomes are not the only ones. The genomes from the many plantinteracting organisms for which genomics projects are ongoing [6], as well as organellar genomes, are useful sources of information for plant biologists.
Structural
and functional
annotation
Although gene annotation has been in existence as long as the sequence darabases themselves, and is familiar to any biologists who has cloned and sequenced a particular gene, annotating genomic sequences is in fact a new kind of task. Many tools used for genomic sequences of individual genes are inadequate, firstly because of the size of the genome contig sequences. Sequence size in itself is often a problem for pre-genome software, together with the inability to deal with sequences containing several genes on both strands. Secondly, and more importantly, the sequence of a single gene often comes as a result of a carefully designed experimental approach which supplies biological information, contrary to systematically obtained genome sequences [3]. Functional genomics analysis in plants [7’] should partially fill this gap by providing genome-wide experimental data on function and expression of genes, and will undoubtedly contribute to a much improved functional annotation in the near future. Annotation is a two step process. The first step can be described as structural annotation and consists of finding the location of the biologically relevant sites and strings, ending up with a coherent model for the whole sequence where each object is properly defined and each object component has a unique location. As the major objects are genes, this step is often likened to gene finding, although it is a true annotation step. After structural annotation has been carried out, there is a generic description of the whole sequence, which can now be used in a more powerful way for database searches (e.g. for deduced protein sequences)
Genome
annotation: which tools do we have for it? ROW?, Pavy and Rombauts
and for experimental purposes (e.g. to design primers for exons). The second step is an information processing step, and is best described as functional annotation. It consists of attributing a range of specific biologically relevant information (e.g. species, source, gene, gene product or domain name and function) to the sequence as a whole, to each compound object and to each individual gene component. ‘I’his information may reside inside the database proper, for example in specified fields and in feature tables as in the GenBank/EMBL/DDBJ public DNA databases or may be obtained through links to other databases [8,9’]. SWISS-PROT, which now offers links to 29 different databases [10’], is the best example of functional annotation and this expert curated protein database represents a turning point in interconnected databases. Structural annotation: finding and locating the genes Computational approaches to gene finding are a very active field in bioinformatics with significant advances in the past few years. Several reviews aimed at a variety of readerships written recently on this issue have been [11,12’,13’,14,15”,16”] and a useful bibliography on computational gent recognition is maintained on the Internet by Li [17]. Fundamentally, there are two ways to find a gene. ‘The more obv-ious way is to search in sequence databases for a homologue. Besides giving the exon/intron structure, the homology approach also gives some indication of gene function. This is done using the familiar BLAS’I’ or FASTA suite of programs; BLASl’N searching in DNA databases is useful for the really close homologues which include expressed sequence rags (ESTs) whereas BLAS’I’X searching in protein databases is much more sensitive. New flavours of BLAST have been introduced: allowing gapping (BLAS’1’2) and higher seach sensitivity through an iterative process (PSI-BI.AST) [18], taking into account pattern information 1191, or providing interactive and specially tuned searching for large contig analysis (PowerBLAST) [ZO]. These approaches, nevertheless, depend on the existence of homologues in the database, and on their correct annotation [3]. In practice, for the Araha’opsis genome, a close homologue can be found for about one third of the genes, and either distant or partial similarities for an additional third, at best. Modeling gene structure from homology alone is valid only for the first third. Specific software such as PROCRUSTES [21’,22], EST-GENOME [23], sim4 [24’] and EbEST [ZS] has been written to find structure genes through comparison with ES’I> or distant cDNAs. AAT is an analysis and annotation tool with programs for comparing the query sequence with a protein database [26], and is the only one routinely used for Ambidopsi5s. Bork and Koonin [27] have given a perceptive analysis of the issue of predicted protein sequences and pointed to the potential problems and hottlenecks of this exercise these being the lack of an accepted gene prediction system integrating a robust and updated suite of sequence analysis methods first, and second, the poor or misleading information in databases which easily ends up in bad annotations which then propagate.
91
To search for genes with no homologue, and to confirm and extend the ones with poor homologues, programs have been developed to find genes only from the knowledge of the sequence. They rely on characteristic gene properties that leave clues in their sequence [ll] which are mostly linked to their ability to be expressed, for example regularity and nucleotide bias in the coding sequence, motifs and composition bias for transcription and translation. Many algorithms have been written to find either gene elements (e.g. start or splice sites), compound objects (e.g. exons) or to model genes as a whole. Obviously the task differs from organism to organism. For prokaryotes, where the coding information is dense and uninterrupted, it is easier than for higher eukaryotes; two gene prediction programs are the current standards: GeneMark [28] and Glimmer [29]. For eukaryotes, software development is largely driven towards the human genome. This software, as well as software devised for yeast or C;: elegam cannot be used as such for the plant genomes. Even if the basic molecular gene replication and expression mechanisms are conserved, there are significant taxa peculiarities (e.g. trans-splicing in worms) and the genome style variation is a reality [30’]. Software have to be specifically developed for each genome, therefore, or adapted, at best retrained, from one genome to the other. Software developed for Arabidopsis, a dicot, is often not adapted to rice, a monocot. Likewise, the validation measures made for a given software in a given species, for example humans [31], do not necessarily apply in another species for which this software may have been adapted, such as Arahidopi5. Several annotation programs have recently been developed or adapted for gene prediction in Arabidopsis, most of them used by the various AGI consortiums (see Tables 1 and 2) Splice Predictor [32], NetPlantGene [33] and NetGene [34’] for splice site prediction, NetStart [35] for translational start, GeneMark [28], the ACeDB GeneFinder (I’ Green, personal communication), GRAII, [36], MZEF 137.1, Solovyev’s gene-finder suite (FGENEA, FEXA and ASPL) [38] for exon prediction, GENSCAN [39’] and GeneMark.hmm [40] for gene modelling on both strands. ho evaluation on the respective performance of this software is yet available, but we are in the process of assessing it. Suprisingly, GeneMark.hmm, a software not yet utilized for annotation, appears to be the most efficient gene predictor for Ambidopsi5 (N Pavy, S Rombauts, VV Ramana Daluvuri, C Mathe, P Dehais eta/., unpublished data). GeneGenerator [41’] is an integrative gene prediction software developed for maize; a version adapted for Arabdopsis is in preparation (J Kleffe, personal communication). Even if the performance of this software in Arabiu’opsis is reasonably high for sites or individual exons, however, it has been observed during careful manual annotation of a 400 Kb contig (N Terryn et al., personal communication) that gene modeling remains poor whatever the program -many genes are wrongly fused or split, border exons are often missing or added wrongly, overlapping genes are not rare, although this has never been experimentally observed. Similar observations were made for
92
Genome
Table
studies
and molecular
genetics
1
The Arabidopsis (a) Web
sites
Genome of the consortia
Initiative involved
consortia
and their
in Arabidopsis
annotation
sequencing
and
tools. annotation
groups.
Consortium TIGR CSHL KAOS SPP
http:llwww.tigr.org/tdb/at/atgenome.html http://nucleus.cshl.org/protarab/ http:llzearth.kazusa.or.jp/arabi/ http://sequencewww.stanford.edu/SPP.html
The Institute for Genomic Research Cold Spring Harbor Laboratory Kazusa Institute Stanford University of Pennsylvania Plant Genome Expression Center Munich Information center for Protein Sequences (MIPS) Centre National de sequencage (France), annotation: MIPS
ESSA Genoscope
(b) Prediction
programs
Annotators
NetPlant
used
by the
different
MZEF
annotation
Gene-Finder
http://www.mips.biochem.mpg.de/proj/thal/ http://www,genoscope.cns.fr/
groups.
GeneFinder
GENSCAN
GRAIL
AAT
GeneMark
Gene TIGR CHSL SPP MIPS KAOS *For
X X
X
X
X
MIPS
X
X
X
X
X
x
X
X
X
X
X
search
X
X
X
X
tRNA
X
X
uses
tRNA
scanner
[61].
.‘tRNA
search
is done
the human [13’] and C’, elegnnx [42’] genomes. In order to improve the prediction of gene borders and to gain further biologically relevant information, the prediction of 5’ and 3’ untranslated regions [43] and especially of core promoters have been attempted (44’,4.5”] with limited success so far [ l.S”]. Computer prediction of tRNA genes is efficient [46]. Other biological objects or remarkable features are often reported, such as transposable elements and repeats in Arahidopsis for which two useful databases have been developed [47]. Sofnvare for IvIAR prediction has recently been developed, on the basis of statistics of motifs often found associated with lClARs vertebrates [S]; its value for plants needs further checking. Functional annotation - the link to biological knowledge Functional annotation of gene database entries relies on the expertise of scientists. For genome annotation, especially for the first pass, which has to be performed shortly after the release of a contig sequence, the annotator does not have the time to act as an expert. As there is no aphi knowledge of the sequence, every attribute of information has to be assessed by analogy, using a variety of computer tools. Many aspects of this process are well analysed in a recent exhaustive review [48”]. One should notice first that, up to now, it is essentially the molecular function of the gene products that are annotated this way. and only more rarely the biological function of the gene. ‘I’he information on protein function can be derived from full length sequence comparisons, or better through multiple alignments, pattern and domain searches and predictions of location and structure for which a large spectrum of computer tools - too numerous to be cited here -are now available [16”], many of them through the Internet. Predicting the molecular function of
at the Kazusa
Institute
by searching
TRNA SCAN-SE
X
X X
X
X
X
X* X*
for similarity
in a RNA
database.
the putative proteins encoded by the genes has many documented pitfalls [4’,27,48”], which include the risk of propagating a wrong annotation [3]. ‘I’he only way to prevent this would be for experts to check manually each entry [4’]. Clearly, the quality of the previous structural annotation step will affect the quality of functional annotation. ‘I’his is especially true for the many cases where the structural annotation comes from the interaction between intrinsic gene prediction and extrinsic database searches. -Ihis often ends up with correct labels given to genes listed with the wrong structure, or to wrong labels being given to genes with the correct structure, or both, as we observed ourselves for ilrt/hid&. As an alternative, or a help to the time-consuming manual human expertise, active research towards automatic or computer-assisted procedures to fetch knowledge in literature is ongoing [4Y-511, but is only practical at present if the literature source is a normalised text such as database entries. For proper annotation, nomenclature is a major concern which has been underevaluated. Fortunately for plants, the International Society for Plant R,Iolecular Biology has taken an early initiative to work towards a kingdom-wide gene nomenclature available in the 1ClE:NDE:L database [WI.
Generating
genome-wide
annotations
l’he conversion of raw genomc data into knowledge, here specifically sequence annotation, involves a series of interdependent tasks. ,21anual annotation is no longer a valid solution and several systems have been designed to cope with this flow of incoming genome sequences. Automatic annotation such as that done by GeneQuiz [S3] has proven to be error prone in prokaryotes [4’]. Cenotator [54] and GAI.4 (551 are interactive graphical environments, allowing
Genome
Table
2
Web
sites
for annotation
annotation:
which
tools
do we have
for it? RouzB,
URL
PROCRUSTES
http://www-hto.usc.edu/software/procrustes/index.htmI
EbEST
http:l/ares.ifrc.mcw.edu/EBEST/ebest.html
AAT (analysis
http:llgenome.cs.mtu.edu/aat.html
ALIGN MAP
(pairwise (multiple
annotation sequence
sequence
tool) alignment) alignment)
93
http:llgenome.cs.mtu.edu/map/map.html http://www.cs.jhu.edu/labs/compbio/glimmer.html
GRAIL
http://compbio.ornl.gov/Grail-1.3/intro.html
SPLICEPREDICTOR
http:l/gremlinl
NETPLANTGENE
http://www.cbs.dtu.dk/services/NetPGene/
NETGENE:!
http:Ilwww.cbs.dtu.dk/services/NetGene2/
N ETSTART
http:llwww.cbs.dtu.dWservices/NetStart/
MZEF
http://siclio.cshl.org/genefinder/ FEXAand
Rombauts
http://genome.cs.mtu.edu/align/align.html
GLIMMER
FGENEA,
and
programs.
Program
and
Pavy
ASPL
.zool
.iastate.edu/-volker/SplicePredictor.html
http://dot.imgen.bcm.tmc.edu:9331/gene-finderlgf.html
GENSCAN
http:llgenomic.stanford.edu/GENSCANW.html
GeneMark.hmm
http://genemark.biology.gatech.edu/GeneMark.hmmchoice,html
GeneMark
http://genemark.biology.gatech.edu/GeneMark/webgenemark.htmI
tRNAscan-SE
http://www.genetics.wustl.edu/eddy/
AtRepBase
http://nucleus.chsl.org/protarab/AtRepbase.htm
human expertise, that have recently been designed to face the needs of eukaryote genome annotations. In a further category, Imagene [%‘I which has recently been used for 61. SI&~/& annotation, has the additional capacity to chain the various component tasks needed for annotation, in scenarios piloted by their individual results. For Arahidopsis a simpler management tool called ANNOTATOR is available [Zh] and is used by several AGI consortiums. As stated by Hot-k and Koonin [27], ‘there is’ obviously ‘a lack of a widely accepted, robust and continuously updated suite of sequence analysis methods integrated into a coherent and efficient prediction system’. I’roper annotation of genome data is crucial for genomics (48”]. ‘I’he huge potential of genome-wide approaches resides indeed in the capacity to integrate the various kinds of data into a higher level of biological knowledge for one species and to allow comparison between genomes. This implies that some work on semantics and ontology is necessary to allow a full interconnectivity between databases [Y’]. Gelbart [57] gives a striking example of such a need when debating the definition of a gene, which has a conceptual definition for geneticists but is ascribed to a sequence entity in databases. ‘I’his debate has practical implications in the way the data should be represented in the different databases: insertion mutagenesis. expression, proteome and sequence databases that will all be components of an AUZMO& knowledge base. Special care
tRNAscan-se/
should be paid to this issue in plants where relatively few cases of alternative expression of genes are experimentally documented compared to vertebrates, but where the real extent of their occurrence is unknown.
Open
problems
and future
trends
Although many advanced annotation tools exist, computational annotation of genome secluences is still in its infancy. Gene finding software must still be significantly improved, especially for plant gnomes which are not the main spot for developments. As an example, exon prediction programs have been improved taking into account the clustering of Amhidopsis coding sequences into two classes [.%‘I. The existing tools are intended to find the mainstream highly expressed protein-encoding genes and have neglected man); other genes of biological significance, such as low expressed genes, RNA-encoding genes (besides tRNAs), rare introns [59,60’], or objects with less available information, such as promoters, transcriptional and translational control regions or MARS. ‘I’hcre is ample room for the improvement of gene modeling and gene function prediction using improved identification and classification of promoters and transcriptional control regions. Annotation will evolve with time to cope with and feed new experiments; it is of major importance to track the way annotation has developed, and to clearly identify what has an experimental basis and what is derived information.
94
Genome
studies
and
molecular
genetics
‘I’he status of.Ara&hpszk genome annotation will change dramatically with the availability of expression and insertion data [7’], which will secure the structural and functional annotation of a large fraction of the genes, and serve to improve the computational methods for the rest awaiting experimental support. Similarly, sequencing the rice genome will also change deeply our understanding and annotation of the genes which are common to both plants. The challenge is in our capacity to integrate these many dard, as well as to use the dispersed expertise of the whole research community
Note added
in proof
The work referred to in the text as N Terryn been accepted for publication [62].
EraL. has now
10.
Bairoch A, Apweiler R: The Swiss-Prot protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res 1999,27:49-54. l short paper presenting the reality of an expert-curated database and its computer-generated complement, as a relational node between various databases. The entire issue of the journal is worth consulting to discover the wide range of existing biological databases. II.
and recommended
Papers of particular have been highlighted l
“of I.
interest, as:
published
within
reading the annual
period
of review,
programs
Krogh A: Gene finding: putting the parts together. In Guide Human Genome Computing, edn 2. Edited by Marttn Bishop. Academic Press; 1998:261-274.
and
to Oxford:
15. Burge CB, Karlin S: Finding the genes in genomic DNA. Curr Opin .. Sfrucf Biol 1998, 8:346-354. This review presents the rationale of gene findlng, and is useful to understand the methodologies (especially GENSCAN) and analyse the current limitations. 16. Trends guide to Bioinformatics Trends Supplement .. 1998. This very useful supplement to the fiends series of lournals contains a few articles that are either fundamental or pragmatic and easy to read for their target audience of biologists. They Include features on the most important every-day issues in using bioinformatlcs and deal with computational issues arismg from genome annotation.
Bevan M, Bancroft I, Bent E, Love K, Goodman H, Dean C, Bergkamp R, Dirkse W, Van Staveren M, Stiekema W ef al: Analysis of 1.9 Mb of contiguous sequence from chromosome 4 of Arabidopsis thaliana. Nature 1998, 391:485-488.
18.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402.
19.
Zhang Z, Schlffer AA, Miller W, Madden TL, Llpman Altschul SF: Protein sequence similarity searches as seeds. Nucleic Acids Res 1998, 26:3986-3990.
20.
Zhang J, Madden TL: PowerBLAST: a new network application for interactive or automated sequence annotation. Genome Res 1997, 7:649-656.
Wheelan annotation
SJ, Boguskl problem.
MS: Late-night Genome Res
thoughts 1998,8:168-l
on the 69.
of
sequence
Smgh GB, Kramer JA, Krawetz SA: Mathematical regions of chromatin attachment to the nuclear Acids Res 1997, 25:1419-1425.
model matrix.
Preston GM, Haubold B, Rainey PB: Bacterial genomics adaptation to life on plants: implications for the evolution pathogenicity and symbiosis. Curr Opin Microbial 1998, 1:589-597.
to predict Nucleic and of
7. Bouchez D, Htifte H: Functional genomics in plants. Plant fhysiol . 1998. 118:725-732. The first comprehensive review giving a broad spectrum of the ‘toolbox for the global study of gene function In plants’ and critically presenting the different methods presently available.
9. .
problem.
Computational gene recogmtion http://lmkage.rockefeller.edu/wli/gene/
Galperin MY, Koonin EV: Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption. In Silica Biol 1998, 1:0007. By looking at bacterial genome annotations obtamed through similarity search, the authors identify several causes of Inadequate annotation which lead to ‘database explosion’: non-critical use of databases and functional inference, low complexity regions, ignoring multi-domain organisation (http://www.bioinfo.de/isb/l998/01/0007/). A useful paper, together with [271, for those who would like to avoid similar mistakes.
8.
for
1 7.
4. .
6.
an overview 18.
of special interest outstanding interest
Meinke DZ, Cherry JM, Dean C, Rounsley SD, Koomneef M: Arabidopsis l * thatiana: a model plant for genome analysis. Science 1998, 282:662-682. This is a very useful review and progress report on the present status genomics research usmg Arabidopsis thaliana as a model plant.
5.
problem: 20:103-l
13. Gucgo R: Computational gene identification: an open . Comput Chem 1997,21:215-222. The author presents the current limitations of gene finding their low efficiency for gene modeling.
2.
3.
identification Chem 1996,
JM Claverie: Computational methods for the identification of genes in vertebrate genomic sequences. Human MO/ Genet 1997,6:1735-l 744. 1 critical and comparative review on the software available for gene prediction, with a historical perspective over the last 15 years and an analytical presentatlon of the methods behind the software. The author concludes by pointing to conservatism as the common limitation of current methods.
Adcnowledgements
References
The gene Corn@
12.
14.
WC- gratefully acknowledge Koderic Guigo, Juergrn Klcffe, Klaus Mayer and I,arry Parnell for communicating unpuhlishcd information, Parrice I)dhai, and I,uc Van Wiemccrsch for continuOlls capert support with hardware and software rind Marrinc dr Cock for edirorial assistance. Pierre Kotrzti is il rcscarch director and I\‘atalic F%v a contract scicnfist of rht: In\tirut ~:~rmnal dc Kccherchc :Igrom~miquc (i:rancc).
Fickett JW: developers.
Baker PG, Brass databases. Curr
A: Recent developments in biological Opin Biotech 1998, 9:54-58
sequence
Frishman D, Heumann K, Lesk A, Mewes HW: Comprehensive, comprehensible, distributed and intelligent databases: current status. Bioinformarics 1998, 14:551-561, A review focusing on the challenges, and the conceptual and technical aspects of biological databases.
on World
Wide
Web
URL:
DJ, Koonin EV, using patterns BLAST analysis
and
21. .
Sze SH, Pevzner PA: Las Vegas algorithms for gene recognition: suboptimal and error-tolerant spliced alignment. J Comput Biol 1997, 4:297-309. This is a paper presenting a more efficient algorithm for the spliced alignment method used in PROCRUSTES [221, the principle of which is to search using the exon-mtron boundaries that maximize the score of alignment of a cDNA with the genomic sequence. 22.
Mironov AA, Roytberg MA, Pevzner PA, Gelfand MS: Performance guarantee gene predictions via spliced alignment. Genomics 1998, 51:332-339.
23.
Mott R: EST-GENOME: to unspliced genomic
a program to align spliced DNA sequences DNA. Compuf Appl Biosci 1997,13:477-478.
24. .
Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 1998,8:967-974. .,. This paper presents a new program which can be used to fmd splice sites by homology searching, and compares its performance to previous similar tools. 25.
Jiang J, Jacob HJ: EbEST: sequence tags to delineate 8:268-275.
26.
Huang X, Adams and annotating
27.
Bork P, Koonin where are the
28.
Borodovsky M. M&inch for both DNA strands.
29.
Salrberg SL, Delcher identification using Res 1998, 26:544-548.
an automated tool using expressed gene structure. Genome Res 1998,
MD, Zhou H, Kerlavage AR: A tool for analyzing genomic sequences. Genomics 1997,46:37-45. EV: Predicting bottlenecks?
functions Nat Genel
I: GENMARK: Compuf Chem AL, Kasif interpolated
from 1998.
protein sequences: 18:313-318.
parallel gene recognition 1993, 17:123-133.
S, White 0: Microbial Markov models.
gene Nucleic
Acids
Genome
30.
.
S: Global dinucleotide heterogeneity. Gun Opin
signatures
Karlin
and analysis
annotation:
of genomic
Microbial 1996,i :596-610 This paper reviews the concept of ‘genomic signature’ introduced by the author to refer to the remarkable stability of dinucleotide abundance in an organism and its conservation in closely related taxa, allowing discrimination between arganisms and ident&catian at hatizantal emnSter 31.
Burset
R: Evaluation
M, Guigo
programs.
of gene structure 1996, 34:353-367.
Genomics
~~~~.~~~~rrsll~~~~~~~~~;Ir~~~ with applications to gene identification in Arebictopsis genomic DNA. Nucleic Acids Res 1998, 26:4746-4757.
3%
a-h~k~bsld~ X*mn
34. .
information.
In this paper, the authors show tlrst that It IS possible in documented Arabidopsis intron sequences using the U2 snRNA site. and then that oredictino intron addition to splice .&es I331 increases the p&Ii&on sites. NetGene uses this meltrod and in addEtion &ice des in &he c&an, whidr co& be used ker
intelligent Systems 1997:226-233.
for Molecular
Xu V, Uberbacher
genomic 37: .
to localize the branch-point a Hidden Markov Model of branch-ooints in contias in perforhance for acce’ptor predicls the positbn of the an in gene modehirg.
Biology.
Park:
AAAI
gene identification Biol 1997,4:325-336.
EC: Automated I Compuf
sequences.
based on quadratic MO/ Bio/1998,37:803-806. . the/i&a
Menlo
genome
Press;
in large-scale
dis&miant
analysis.
38.
Solovyev
V, Salamov
of human
A: The gene-finder
and model organisms
genome
computer tools for analysis sequences. In Proceedings
of the Fiffh lnfemafional Conference on lnfelligenf Systems Molecular Bio/ogy Menlo Park: AAAI Press; 1997:294-302.
M: GeneMerk.hmm: new solutions Acids Res 1998, 26:1 107-I 115.
softand &es.
Kleffe
GeneGenerator - a flexible algorithm for gene prediction and its application to maize sequences. Bioinformafics 1998, 14:232-243.
The first gene prediction clep\ua\ approach. 42. .
6, Brendel
for a monocot,
using
a low
Sequencing Cansortium: Genorne C. elegens: a platform for investigating
nematode 1998, 262:2012-2016.
information
sequence
biology.
L, $J/as&
signals
prediction
Comput
Appl
M, Migo
P: Ha~terlng
in 5’ and 3’ regions Biosci 1996, 12:399-404.
&M
of eukaryotic
of the
Science
Fickett JW, Hatzigeorgiou AG: Eukaryotic promoter recognition. Genome Res 1997 7:861-878. i documented review 0; computational promoter recognition, with an excellent first part on biology and a second on a fair comparative analysis of the available tools. The conclusion shows that there is much to be done yet. Duret
1, Euoher
noncoding
P: Searching
sequences.
zother, more conceptual, power of a comparative regulatory elements.
fur regulatory
Curr Opin
Sfrucf
view on the same phylogenetic approach
elements in human 1997 7:399-406. topic w);ich presents the to identify new unknown
Biol
Y, Yamamoto
repeats
Y, Okazaki
T, Uchiyama
of the Fiffh lnfernafional Conference Biology. Menlo Park: AAAl Press;
jintit~&r~i~&An interesting point made by completely sequenced
papers.
extraction to the knowledge
Bioinformatics
and names inforrt&ion
1998,
14:600-607.
F, Julliard
L, Pillet
in biological extraction.
In
on lnfelligenf 1997:218-225.
A: Automatic
D, Rechenmann
M, Yuan Y:
and beck. J MO/ Viol
base from biological
text: application
Web
I, Takagi T: Automatic
Proceedings for Molecular
M, Valencia
Wide
F, Huynen
to genomes
of knowledge
Proux
on World
Y, Eisenhaber
from genes
constructing
Andrade
for improved sequence. Nucleic
Sysfen
1.5
of keywords from domain of protein
V, Jacq B: Detecting gene texts: a first step toward In Praceedings from the Ninth S, ?-aka@ 7:
52. Lonsdale D: Nomenclature regulation. Nature 1998, . 391 :l 18. The International Society for Plant Molecular Biology initiative to develop a kinadom-wide nomenclature can be found on the World Wide Web URL: MfNDEL: http:/lwww.mendel.ac.ukf or http://genome-www.stanford.edu/ MendeJJ see also AJDB: htip:JJgenome-~stanford.edu/Arab;do~sjsJ nomencl.html for additional information on nomenclature. 53.
Stiart’M, Schnelb’er R, Casar G, dord c tiienci A, Ouzounl’s C, Sander C: GeneQuiz: a workbench for sequence analysis. In /+oceedings of the Second lnfemafional Conference on lnfelligenf Systems for Molecular Biology. Menlo Park: AAAI Press; >994.348 -353.
54.
Harris NL: Genotetor: a workbench Genome Res 1997, 7:754-762.
55.
Bailey
LC, Fisher
S, Schug
annotation
for sequence
annotation.
J, Crabtree J, Gibson M, Overton GC: of genomic sequence. Genome Res
8:234-250.
MBcligue
C, Rechenmann
integrated computer analysis. Bioinformafics
F, Dan&in A, Viari A: Imagene: an environment for sequence annotation 1999, 15:in press.
A new kind of integrated annotation environment manaB*men> pSaYmm incDrpora%n~ a bata graphical tool. A prototype of lmagene was sequence annotation. 57.
Gelbart W: Databases 282:659-661.
56. .
Math&
end
is presented which is a task managemenf sydem anb a used for 6. subfilis genome
C, Peresetsky
in genomic A, DBhais
research.
Science
P, Van Montagu
Classification of Arebidopsis the/iene of coding sequences into two groups
M, Rouzb
1998, P:
oene seauences: clusterino according to codon usege-
improves gene pm&2h, i, MD> Bj,) X+99,2X%;> 933-> 9x0. This paper reports the clustering of Arabidopsis coding sequences into two classes according to their codon usage, the main difference lying in the use of T and C at the third codon position. The classes correlate with gene expression. Using GeneMark, it was shown that using two gene models instead of one improves Qene prediction. 59.
Wu HJ, Gaubier-Comella Rouzb P: AU-AA splicing
at least 1 Og years
frx
genes.
44.
45.
T, Diaz-Lazcoz
function:
95
Rombauts
Woksbu~ on Genome J!tufmaatics. E&ted by Mpnu Tokyo: Universal Academy Press Inc.; 1998:72-80.
con-
on the full sequence of a higher eukaryote, with a genome to Arabidopsis. The accompanying papers from the same reading too, as an illustration of what genomics is offering to
MilzsRsi
Ohta
symbols partkent
V:
The C. etegans
The first report size comparable issue are worth biology now. 43.
tool
W, Wittig
P, Dandekar
and
for
.
K, Vahrson
51.
56. .
41.
J, Hermann
Bork
Predicting
Pavy
a program in genomic
Annotation of transposons and Qenomic URL: http~/nucleuscshl,org/profarab/Tn4nnotatjon.htm
1998,
Burge C, Karlin S: Prediion of complete gene structures in human genomic DNA. I MO/ Bioll997, 268:78-94. This paper describes GENSCAN, the most efficient gene prediction ware to date for vertebrates. GENSCAN makes use of many signals comp&ti~ona~ sIa$s%xza\ propefi,es, ti)‘rh a dependency ru)e ‘ror spS)ce It allows the search for multiple Qenes on both strands. Lukashin AV, Borodovsky gene finding. Nucleic
SR: tRNAscan-SE: of transfer RNA genes Res 1997, 25195.5964.
GAIA: framework
.
for it? RouzB,
TM, Eddy
scientific families.
for
39.
40.
49.
fianr
This paper IS an appllcatlon ot a previous one dedicated to the human atenome, besci,ti)no MZEF ‘mr Plratidopsjs. The metior5 ‘1s uino ouabraijc ziscriminant analysk, a generalisation oi linear discriminant analy&,‘as used by Solovyev and Salamov [361.
do we have
1996, 283:707-725. ~s~ytn-,~~ii\tr;~~~cr~~~Rnnwi~dln~~~i~T; gap between phenotype and genotype as a goal. in this review is on using the added value provided genomes for acne function prediction.
50.
AG, Nielsen H: Neural network prediction of translation sites in eukervotes: _ .oersoectives for EST and oenome In Proceedings of the Fiffh lnfemafional Conference on
Pedersen
initiation analysis.
36.
*
N. Rouzb
prediction
35.
Acids 47.
Nucleic
P. Brunak S: A branch ooint consensus from found by non-circular analysis allows for better of acceptor sites. Nucleic Acids Res 1997, 25:3159-3163.
Tolstruo Arebiiopsk
Lowe
tools
detection
l
ciikw&?3: pre-mRNA
by combining local and global sequence Acids Res 1996, 24:3439-3452.
46.
48.
prediction
321
theliane
which
P, Delseny
M, Grellet
in Arebidopsis: old. Nat Genef 1996,
F, Van Montagu
non-canonical
M,
introns
are
14:383-384.
Sharpe PA, Burge CB: Classification of introns: U2-type or U12 type. Cell 1997, 91:875-879. i synthetic presentation of the recent finding of a second class of spliceosomal introns found in animal and plants [59]. Their distinctive splicing and signature are described together with the donor and the branch sites.
60.
61. 62.
C: Identifying potential tRNA J MD/ Viol 1991, 220:659-671.
genes
DNA sequences. Terryn N, Heijnen
L, De Keyser
M, DeClercq
Fichant
GA, Burks
Evidence theliene APETALA 445~237.245.
A, Van Asseldonck
in genomic
for an ancient chrcmosomal duplication in Arabidomis by sequencing and anelysing a 400 kb contig et the’ 2 locus on chromosome 4. FEBS Lea 1999,
R: