Genome annotation: which tools do we have for it?

Genome annotation: which tools do we have for it?

90 Genome annotation: which tools do we have for it? Pierre Rouz6*, Nathalie Pavyt and Stephane Genome to be converted useful data have to bio...

803KB Sizes 0 Downloads 81 Views

90

Genome annotation:

which tools do we have for it?

Pierre Rouz6*, Nathalie

Pavyt and Stephane

Genome

to be converted

useful

data have to biologists.

Many

already

been

sequences,

developed and these

to help may

remains

Addresses *laboratoire

a major

computational annotation

be improved

identification of more gene standard computer-assisted genomes

into knowledge

valuable

regulatory annotation

of plant further,

elements. platform

Associc! de I’INRA

Opinion

have

genome

for example

by

The lack of a for eukaryotic

bottle-neck.

(France),

*a Department

Flanders Interuniversity Institute of Biotechnology, KL Ledeganckstraat 35, B-9000 Gent, Belgium *e-mail: [email protected] +e-mail: [email protected] b-mail: [email protected]

Current

to be tools

in Plant Biology

1999,

University

of Genetics, of Ghent,

2:90-95

http://biomednet.com/elecref/1369526600200090 0 Elsevier

Science

Abbreviations AGI Arabidopsis EST expressed

Ltd

ISSN

1369-5266

genome sequence

initiative tag

Introduction High throughput genome sequencing projects are producing an enormous amount of raw sequence data. For plants, the complete sequence of the Arabidopsis dzaliana nuclear genome is expected for the year 2000; about one-third of the expected 120 Mb whole sequence is already available [1,2”] and sequencing the rice genome has begun (see S C;off this issue pp 86-89). [Jnannotated sequences are not useful as such, except to find a given DN,4 sequence, because rich database queries cannot be done and the relevance of homology searches cannot be evaluated. What biologists are looking for is information derived from the genome sequence: markers, genes, mRNA or deduced protein sequence, not the sequence itself. ‘Ib be meaningful to them, the genome sequences have to be converted into biologically significant knowledge; annotation is the first step toward knowledge acquisition. Contrary to a recent opinion [3], we believe that this step should better be done as soon as possible. Annotation is certainly a risky process and at present its result is far from optimal, even for well-documented bacterial genomes [4’]. Rut it is a trade-off, with the benefit for the plant research community as the ability to mine the sequence data more efficiently, and hence to plan classic and genome-wide experiments which would gain a wider biological knowledge more quickly, and would include a validation of the annotation itself. As pointed out by Meinke et al. [Z”], a long list of fundamental and technological research advances have already resulted from the hzbidopsis genome initiative (AGI); for a number of these advances the availability of annotated sequence played a important role,

RombautsJ

For a genomic DNA sequence, annotation is the process by which an anonymous sequence is documented, by positioning along the sequence the various sites and segments that are involved in the genome functionality. Many elements of possible biological relevance that are not, or not yet, linked to a particular gene may well be annotated, such as matrix attachment regions (MARS) [S] and sequence repeats. Nevertheless, the basic objects in genome annotation are obviously the genes and the related elements that are involved in their structure, their expression. and their function or rhe function of their encoded products. Here, we will particularly consider recent annotation tools that have been developed for plant nuclear genomes, and will also present tools and results obtained from other genomes, as much more information is available from bacterial, yeast and mammalian (especially human) genome projects. Resides, plant nuclear genomes are not the only ones. The genomes from the many plantinteracting organisms for which genomics projects are ongoing [6], as well as organellar genomes, are useful sources of information for plant biologists.

Structural

and functional

annotation

Although gene annotation has been in existence as long as the sequence darabases themselves, and is familiar to any biologists who has cloned and sequenced a particular gene, annotating genomic sequences is in fact a new kind of task. Many tools used for genomic sequences of individual genes are inadequate, firstly because of the size of the genome contig sequences. Sequence size in itself is often a problem for pre-genome software, together with the inability to deal with sequences containing several genes on both strands. Secondly, and more importantly, the sequence of a single gene often comes as a result of a carefully designed experimental approach which supplies biological information, contrary to systematically obtained genome sequences [3]. Functional genomics analysis in plants [7’] should partially fill this gap by providing genome-wide experimental data on function and expression of genes, and will undoubtedly contribute to a much improved functional annotation in the near future. Annotation is a two step process. The first step can be described as structural annotation and consists of finding the location of the biologically relevant sites and strings, ending up with a coherent model for the whole sequence where each object is properly defined and each object component has a unique location. As the major objects are genes, this step is often likened to gene finding, although it is a true annotation step. After structural annotation has been carried out, there is a generic description of the whole sequence, which can now be used in a more powerful way for database searches (e.g. for deduced protein sequences)

Genome

annotation: which tools do we have for it? ROW?, Pavy and Rombauts

and for experimental purposes (e.g. to design primers for exons). The second step is an information processing step, and is best described as functional annotation. It consists of attributing a range of specific biologically relevant information (e.g. species, source, gene, gene product or domain name and function) to the sequence as a whole, to each compound object and to each individual gene component. ‘I’his information may reside inside the database proper, for example in specified fields and in feature tables as in the GenBank/EMBL/DDBJ public DNA databases or may be obtained through links to other databases [8,9’]. SWISS-PROT, which now offers links to 29 different databases [10’], is the best example of functional annotation and this expert curated protein database represents a turning point in interconnected databases. Structural annotation: finding and locating the genes Computational approaches to gene finding are a very active field in bioinformatics with significant advances in the past few years. Several reviews aimed at a variety of readerships written recently on this issue have been [11,12’,13’,14,15”,16”] and a useful bibliography on computational gent recognition is maintained on the Internet by Li [17]. Fundamentally, there are two ways to find a gene. ‘The more obv-ious way is to search in sequence databases for a homologue. Besides giving the exon/intron structure, the homology approach also gives some indication of gene function. This is done using the familiar BLAS’I’ or FASTA suite of programs; BLASl’N searching in DNA databases is useful for the really close homologues which include expressed sequence rags (ESTs) whereas BLAS’I’X searching in protein databases is much more sensitive. New flavours of BLAST have been introduced: allowing gapping (BLAS’1’2) and higher seach sensitivity through an iterative process (PSI-BI.AST) [18], taking into account pattern information 1191, or providing interactive and specially tuned searching for large contig analysis (PowerBLAST) [ZO]. These approaches, nevertheless, depend on the existence of homologues in the database, and on their correct annotation [3]. In practice, for the Araha’opsis genome, a close homologue can be found for about one third of the genes, and either distant or partial similarities for an additional third, at best. Modeling gene structure from homology alone is valid only for the first third. Specific software such as PROCRUSTES [21’,22], EST-GENOME [23], sim4 [24’] and EbEST [ZS] has been written to find structure genes through comparison with ES’I> or distant cDNAs. AAT is an analysis and annotation tool with programs for comparing the query sequence with a protein database [26], and is the only one routinely used for Ambidopsi5s. Bork and Koonin [27] have given a perceptive analysis of the issue of predicted protein sequences and pointed to the potential problems and hottlenecks of this exercise these being the lack of an accepted gene prediction system integrating a robust and updated suite of sequence analysis methods first, and second, the poor or misleading information in databases which easily ends up in bad annotations which then propagate.

91

To search for genes with no homologue, and to confirm and extend the ones with poor homologues, programs have been developed to find genes only from the knowledge of the sequence. They rely on characteristic gene properties that leave clues in their sequence [ll] which are mostly linked to their ability to be expressed, for example regularity and nucleotide bias in the coding sequence, motifs and composition bias for transcription and translation. Many algorithms have been written to find either gene elements (e.g. start or splice sites), compound objects (e.g. exons) or to model genes as a whole. Obviously the task differs from organism to organism. For prokaryotes, where the coding information is dense and uninterrupted, it is easier than for higher eukaryotes; two gene prediction programs are the current standards: GeneMark [28] and Glimmer [29]. For eukaryotes, software development is largely driven towards the human genome. This software, as well as software devised for yeast or C;: elegam cannot be used as such for the plant genomes. Even if the basic molecular gene replication and expression mechanisms are conserved, there are significant taxa peculiarities (e.g. trans-splicing in worms) and the genome style variation is a reality [30’]. Software have to be specifically developed for each genome, therefore, or adapted, at best retrained, from one genome to the other. Software developed for Arabidopsis, a dicot, is often not adapted to rice, a monocot. Likewise, the validation measures made for a given software in a given species, for example humans [31], do not necessarily apply in another species for which this software may have been adapted, such as Arahidopi5. Several annotation programs have recently been developed or adapted for gene prediction in Arabidopsis, most of them used by the various AGI consortiums (see Tables 1 and 2) Splice Predictor [32], NetPlantGene [33] and NetGene [34’] for splice site prediction, NetStart [35] for translational start, GeneMark [28], the ACeDB GeneFinder (I’ Green, personal communication), GRAII, [36], MZEF 137.1, Solovyev’s gene-finder suite (FGENEA, FEXA and ASPL) [38] for exon prediction, GENSCAN [39’] and GeneMark.hmm [40] for gene modelling on both strands. ho evaluation on the respective performance of this software is yet available, but we are in the process of assessing it. Suprisingly, GeneMark.hmm, a software not yet utilized for annotation, appears to be the most efficient gene predictor for Ambidopsi5 (N Pavy, S Rombauts, VV Ramana Daluvuri, C Mathe, P Dehais eta/., unpublished data). GeneGenerator [41’] is an integrative gene prediction software developed for maize; a version adapted for Arabdopsis is in preparation (J Kleffe, personal communication). Even if the performance of this software in Arabiu’opsis is reasonably high for sites or individual exons, however, it has been observed during careful manual annotation of a 400 Kb contig (N Terryn et al., personal communication) that gene modeling remains poor whatever the program -many genes are wrongly fused or split, border exons are often missing or added wrongly, overlapping genes are not rare, although this has never been experimentally observed. Similar observations were made for

92

Genome

Table

studies

and molecular

genetics

1

The Arabidopsis (a) Web

sites

Genome of the consortia

Initiative involved

consortia

and their

in Arabidopsis

annotation

sequencing

and

tools. annotation

groups.

Consortium TIGR CSHL KAOS SPP

http:llwww.tigr.org/tdb/at/atgenome.html http://nucleus.cshl.org/protarab/ http:llzearth.kazusa.or.jp/arabi/ http://sequencewww.stanford.edu/SPP.html

The Institute for Genomic Research Cold Spring Harbor Laboratory Kazusa Institute Stanford University of Pennsylvania Plant Genome Expression Center Munich Information center for Protein Sequences (MIPS) Centre National de sequencage (France), annotation: MIPS

ESSA Genoscope

(b) Prediction

programs

Annotators

NetPlant

used

by the

different

MZEF

annotation

Gene-Finder

http://www.mips.biochem.mpg.de/proj/thal/ http://www,genoscope.cns.fr/

groups.

GeneFinder

GENSCAN

GRAIL

AAT

GeneMark

Gene TIGR CHSL SPP MIPS KAOS *For

X X

X

X

X

MIPS

X

X

X

X

X

x

X

X

X

X

X

search

X

X

X

X

tRNA

X

X

uses

tRNA

scanner

[61].

.‘tRNA

search

is done

the human [13’] and C’, elegnnx [42’] genomes. In order to improve the prediction of gene borders and to gain further biologically relevant information, the prediction of 5’ and 3’ untranslated regions [43] and especially of core promoters have been attempted (44’,4.5”] with limited success so far [ l.S”]. Computer prediction of tRNA genes is efficient [46]. Other biological objects or remarkable features are often reported, such as transposable elements and repeats in Arahidopsis for which two useful databases have been developed [47]. Sofnvare for IvIAR prediction has recently been developed, on the basis of statistics of motifs often found associated with lClARs vertebrates [S]; its value for plants needs further checking. Functional annotation - the link to biological knowledge Functional annotation of gene database entries relies on the expertise of scientists. For genome annotation, especially for the first pass, which has to be performed shortly after the release of a contig sequence, the annotator does not have the time to act as an expert. As there is no aphi knowledge of the sequence, every attribute of information has to be assessed by analogy, using a variety of computer tools. Many aspects of this process are well analysed in a recent exhaustive review [48”]. One should notice first that, up to now, it is essentially the molecular function of the gene products that are annotated this way. and only more rarely the biological function of the gene. ‘I’he information on protein function can be derived from full length sequence comparisons, or better through multiple alignments, pattern and domain searches and predictions of location and structure for which a large spectrum of computer tools - too numerous to be cited here -are now available [16”], many of them through the Internet. Predicting the molecular function of

at the Kazusa

Institute

by searching

TRNA SCAN-SE

X

X X

X

X

X

X* X*

for similarity

in a RNA

database.

the putative proteins encoded by the genes has many documented pitfalls [4’,27,48”], which include the risk of propagating a wrong annotation [3]. ‘I’he only way to prevent this would be for experts to check manually each entry [4’]. Clearly, the quality of the previous structural annotation step will affect the quality of functional annotation. ‘I’his is especially true for the many cases where the structural annotation comes from the interaction between intrinsic gene prediction and extrinsic database searches. -Ihis often ends up with correct labels given to genes listed with the wrong structure, or to wrong labels being given to genes with the correct structure, or both, as we observed ourselves for ilrt/hid&. As an alternative, or a help to the time-consuming manual human expertise, active research towards automatic or computer-assisted procedures to fetch knowledge in literature is ongoing [4Y-511, but is only practical at present if the literature source is a normalised text such as database entries. For proper annotation, nomenclature is a major concern which has been underevaluated. Fortunately for plants, the International Society for Plant R,Iolecular Biology has taken an early initiative to work towards a kingdom-wide gene nomenclature available in the 1ClE:NDE:L database [WI.

Generating

genome-wide

annotations

l’he conversion of raw genomc data into knowledge, here specifically sequence annotation, involves a series of interdependent tasks. ,21anual annotation is no longer a valid solution and several systems have been designed to cope with this flow of incoming genome sequences. Automatic annotation such as that done by GeneQuiz [S3] has proven to be error prone in prokaryotes [4’]. Cenotator [54] and GAI.4 (551 are interactive graphical environments, allowing

Genome

Table

2

Web

sites

for annotation

annotation:

which

tools

do we have

for it? RouzB,

URL

PROCRUSTES

http://www-hto.usc.edu/software/procrustes/index.htmI

EbEST

http:l/ares.ifrc.mcw.edu/EBEST/ebest.html

AAT (analysis

http:llgenome.cs.mtu.edu/aat.html

ALIGN MAP

(pairwise (multiple

annotation sequence

sequence

tool) alignment) alignment)

93

http:llgenome.cs.mtu.edu/map/map.html http://www.cs.jhu.edu/labs/compbio/glimmer.html

GRAIL

http://compbio.ornl.gov/Grail-1.3/intro.html

SPLICEPREDICTOR

http:l/gremlinl

NETPLANTGENE

http://www.cbs.dtu.dk/services/NetPGene/

NETGENE:!

http:Ilwww.cbs.dtu.dk/services/NetGene2/

N ETSTART

http:llwww.cbs.dtu.dWservices/NetStart/

MZEF

http://siclio.cshl.org/genefinder/ FEXAand

Rombauts

http://genome.cs.mtu.edu/align/align.html

GLIMMER

FGENEA,

and

programs.

Program

and

Pavy

ASPL

.zool

.iastate.edu/-volker/SplicePredictor.html

http://dot.imgen.bcm.tmc.edu:9331/gene-finderlgf.html

GENSCAN

http:llgenomic.stanford.edu/GENSCANW.html

GeneMark.hmm

http://genemark.biology.gatech.edu/GeneMark.hmmchoice,html

GeneMark

http://genemark.biology.gatech.edu/GeneMark/webgenemark.htmI

tRNAscan-SE

http://www.genetics.wustl.edu/eddy/

AtRepBase

http://nucleus.chsl.org/protarab/AtRepbase.htm

human expertise, that have recently been designed to face the needs of eukaryote genome annotations. In a further category, Imagene [%‘I which has recently been used for 61. SI&~/& annotation, has the additional capacity to chain the various component tasks needed for annotation, in scenarios piloted by their individual results. For Arahidopsis a simpler management tool called ANNOTATOR is available [Zh] and is used by several AGI consortiums. As stated by Hot-k and Koonin [27], ‘there is’ obviously ‘a lack of a widely accepted, robust and continuously updated suite of sequence analysis methods integrated into a coherent and efficient prediction system’. I’roper annotation of genome data is crucial for genomics (48”]. ‘I’he huge potential of genome-wide approaches resides indeed in the capacity to integrate the various kinds of data into a higher level of biological knowledge for one species and to allow comparison between genomes. This implies that some work on semantics and ontology is necessary to allow a full interconnectivity between databases [Y’]. Gelbart [57] gives a striking example of such a need when debating the definition of a gene, which has a conceptual definition for geneticists but is ascribed to a sequence entity in databases. ‘I’his debate has practical implications in the way the data should be represented in the different databases: insertion mutagenesis. expression, proteome and sequence databases that will all be components of an AUZMO& knowledge base. Special care

tRNAscan-se/

should be paid to this issue in plants where relatively few cases of alternative expression of genes are experimentally documented compared to vertebrates, but where the real extent of their occurrence is unknown.

Open

problems

and future

trends

Although many advanced annotation tools exist, computational annotation of genome secluences is still in its infancy. Gene finding software must still be significantly improved, especially for plant gnomes which are not the main spot for developments. As an example, exon prediction programs have been improved taking into account the clustering of Amhidopsis coding sequences into two classes [.%‘I. The existing tools are intended to find the mainstream highly expressed protein-encoding genes and have neglected man); other genes of biological significance, such as low expressed genes, RNA-encoding genes (besides tRNAs), rare introns [59,60’], or objects with less available information, such as promoters, transcriptional and translational control regions or MARS. ‘I’hcre is ample room for the improvement of gene modeling and gene function prediction using improved identification and classification of promoters and transcriptional control regions. Annotation will evolve with time to cope with and feed new experiments; it is of major importance to track the way annotation has developed, and to clearly identify what has an experimental basis and what is derived information.

94

Genome

studies

and

molecular

genetics

‘I’he status of.Ara&hpszk genome annotation will change dramatically with the availability of expression and insertion data [7’], which will secure the structural and functional annotation of a large fraction of the genes, and serve to improve the computational methods for the rest awaiting experimental support. Similarly, sequencing the rice genome will also change deeply our understanding and annotation of the genes which are common to both plants. The challenge is in our capacity to integrate these many dard, as well as to use the dispersed expertise of the whole research community

Note added

in proof

The work referred to in the text as N Terryn been accepted for publication [62].

EraL. has now

10.

Bairoch A, Apweiler R: The Swiss-Prot protein sequence data bank and its supplement TrEMBL in 1999. Nucleic Acids Res 1999,27:49-54. l short paper presenting the reality of an expert-curated database and its computer-generated complement, as a relational node between various databases. The entire issue of the journal is worth consulting to discover the wide range of existing biological databases. II.

and recommended

Papers of particular have been highlighted l

“of I.

interest, as:

published

within

reading the annual

period

of review,

programs

Krogh A: Gene finding: putting the parts together. In Guide Human Genome Computing, edn 2. Edited by Marttn Bishop. Academic Press; 1998:261-274.

and

to Oxford:

15. Burge CB, Karlin S: Finding the genes in genomic DNA. Curr Opin .. Sfrucf Biol 1998, 8:346-354. This review presents the rationale of gene findlng, and is useful to understand the methodologies (especially GENSCAN) and analyse the current limitations. 16. Trends guide to Bioinformatics Trends Supplement .. 1998. This very useful supplement to the fiends series of lournals contains a few articles that are either fundamental or pragmatic and easy to read for their target audience of biologists. They Include features on the most important every-day issues in using bioinformatlcs and deal with computational issues arismg from genome annotation.

Bevan M, Bancroft I, Bent E, Love K, Goodman H, Dean C, Bergkamp R, Dirkse W, Van Staveren M, Stiekema W ef al: Analysis of 1.9 Mb of contiguous sequence from chromosome 4 of Arabidopsis thaliana. Nature 1998, 391:485-488.

18.

Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25:3389-3402.

19.

Zhang Z, Schlffer AA, Miller W, Madden TL, Llpman Altschul SF: Protein sequence similarity searches as seeds. Nucleic Acids Res 1998, 26:3986-3990.

20.

Zhang J, Madden TL: PowerBLAST: a new network application for interactive or automated sequence annotation. Genome Res 1997, 7:649-656.

Wheelan annotation

SJ, Boguskl problem.

MS: Late-night Genome Res

thoughts 1998,8:168-l

on the 69.

of

sequence

Smgh GB, Kramer JA, Krawetz SA: Mathematical regions of chromatin attachment to the nuclear Acids Res 1997, 25:1419-1425.

model matrix.

Preston GM, Haubold B, Rainey PB: Bacterial genomics adaptation to life on plants: implications for the evolution pathogenicity and symbiosis. Curr Opin Microbial 1998, 1:589-597.

to predict Nucleic and of

7. Bouchez D, Htifte H: Functional genomics in plants. Plant fhysiol . 1998. 118:725-732. The first comprehensive review giving a broad spectrum of the ‘toolbox for the global study of gene function In plants’ and critically presenting the different methods presently available.

9. .

problem.

Computational gene recogmtion http://lmkage.rockefeller.edu/wli/gene/

Galperin MY, Koonin EV: Sources of systematic error in functional annotation of genomes: domain rearrangement, non-orthologous gene displacement, and operon disruption. In Silica Biol 1998, 1:0007. By looking at bacterial genome annotations obtamed through similarity search, the authors identify several causes of Inadequate annotation which lead to ‘database explosion’: non-critical use of databases and functional inference, low complexity regions, ignoring multi-domain organisation (http://www.bioinfo.de/isb/l998/01/0007/). A useful paper, together with [271, for those who would like to avoid similar mistakes.

8.

for

1 7.

4. .

6.

an overview 18.

of special interest outstanding interest

Meinke DZ, Cherry JM, Dean C, Rounsley SD, Koomneef M: Arabidopsis l * thatiana: a model plant for genome analysis. Science 1998, 282:662-682. This is a very useful review and progress report on the present status genomics research usmg Arabidopsis thaliana as a model plant.

5.

problem: 20:103-l

13. Gucgo R: Computational gene identification: an open . Comput Chem 1997,21:215-222. The author presents the current limitations of gene finding their low efficiency for gene modeling.

2.

3.

identification Chem 1996,

JM Claverie: Computational methods for the identification of genes in vertebrate genomic sequences. Human MO/ Genet 1997,6:1735-l 744. 1 critical and comparative review on the software available for gene prediction, with a historical perspective over the last 15 years and an analytical presentatlon of the methods behind the software. The author concludes by pointing to conservatism as the common limitation of current methods.

Adcnowledgements

References

The gene Corn@

12.

14.

WC- gratefully acknowledge Koderic Guigo, Juergrn Klcffe, Klaus Mayer and I,arry Parnell for communicating unpuhlishcd information, Parrice I)dhai, and I,uc Van Wiemccrsch for continuOlls capert support with hardware and software rind Marrinc dr Cock for edirorial assistance. Pierre Kotrzti is il rcscarch director and I\‘atalic F%v a contract scicnfist of rht: In\tirut ~:~rmnal dc Kccherchc :Igrom~miquc (i:rancc).

Fickett JW: developers.

Baker PG, Brass databases. Curr

A: Recent developments in biological Opin Biotech 1998, 9:54-58

sequence

Frishman D, Heumann K, Lesk A, Mewes HW: Comprehensive, comprehensible, distributed and intelligent databases: current status. Bioinformarics 1998, 14:551-561, A review focusing on the challenges, and the conceptual and technical aspects of biological databases.

on World

Wide

Web

URL:

DJ, Koonin EV, using patterns BLAST analysis

and

21. .

Sze SH, Pevzner PA: Las Vegas algorithms for gene recognition: suboptimal and error-tolerant spliced alignment. J Comput Biol 1997, 4:297-309. This is a paper presenting a more efficient algorithm for the spliced alignment method used in PROCRUSTES [221, the principle of which is to search using the exon-mtron boundaries that maximize the score of alignment of a cDNA with the genomic sequence. 22.

Mironov AA, Roytberg MA, Pevzner PA, Gelfand MS: Performance guarantee gene predictions via spliced alignment. Genomics 1998, 51:332-339.

23.

Mott R: EST-GENOME: to unspliced genomic

a program to align spliced DNA sequences DNA. Compuf Appl Biosci 1997,13:477-478.

24. .

Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W: A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 1998,8:967-974. .,. This paper presents a new program which can be used to fmd splice sites by homology searching, and compares its performance to previous similar tools. 25.

Jiang J, Jacob HJ: EbEST: sequence tags to delineate 8:268-275.

26.

Huang X, Adams and annotating

27.

Bork P, Koonin where are the

28.

Borodovsky M. M&inch for both DNA strands.

29.

Salrberg SL, Delcher identification using Res 1998, 26:544-548.

an automated tool using expressed gene structure. Genome Res 1998,

MD, Zhou H, Kerlavage AR: A tool for analyzing genomic sequences. Genomics 1997,46:37-45. EV: Predicting bottlenecks?

functions Nat Genel

I: GENMARK: Compuf Chem AL, Kasif interpolated

from 1998.

protein sequences: 18:313-318.

parallel gene recognition 1993, 17:123-133.

S, White 0: Microbial Markov models.

gene Nucleic

Acids

Genome

30.

.

S: Global dinucleotide heterogeneity. Gun Opin

signatures

Karlin

and analysis

annotation:

of genomic

Microbial 1996,i :596-610 This paper reviews the concept of ‘genomic signature’ introduced by the author to refer to the remarkable stability of dinucleotide abundance in an organism and its conservation in closely related taxa, allowing discrimination between arganisms and ident&catian at hatizantal emnSter 31.

Burset

R: Evaluation

M, Guigo

programs.

of gene structure 1996, 34:353-367.

Genomics

~~~~.~~~~rrsll~~~~~~~~~;Ir~~~ with applications to gene identification in Arebictopsis genomic DNA. Nucleic Acids Res 1998, 26:4746-4757.

3%

a-h~k~bsld~ X*mn
34. .

information.

In this paper, the authors show tlrst that It IS possible in documented Arabidopsis intron sequences using the U2 snRNA site. and then that oredictino intron addition to splice .&es I331 increases the p&Ii&on sites. NetGene uses this meltrod and in addEtion &ice des in &he c&an, whidr co& be used ker

intelligent Systems 1997:226-233.

for Molecular

Xu V, Uberbacher

genomic 37: .

to localize the branch-point a Hidden Markov Model of branch-ooints in contias in perforhance for acce’ptor predicls the positbn of the an in gene modehirg.

Biology.

Park:

AAAI

gene identification Biol 1997,4:325-336.

EC: Automated I Compuf

sequences.

based on quadratic MO/ Bio/1998,37:803-806. . the/i&a

Menlo

genome

Press;

in large-scale

dis&miant

analysis.

38.

Solovyev

V, Salamov

of human

A: The gene-finder

and model organisms

genome

computer tools for analysis sequences. In Proceedings

of the Fiffh lnfemafional Conference on lnfelligenf Systems Molecular Bio/ogy Menlo Park: AAAI Press; 1997:294-302.

M: GeneMerk.hmm: new solutions Acids Res 1998, 26:1 107-I 115.

softand &es.

Kleffe

GeneGenerator - a flexible algorithm for gene prediction and its application to maize sequences. Bioinformafics 1998, 14:232-243.

The first gene prediction clep\ua\ approach. 42. .

6, Brendel

for a monocot,

using

a low

Sequencing Cansortium: Genorne C. elegens: a platform for investigating

nematode 1998, 262:2012-2016.

information

sequence

biology.

L, $J/as&

signals

prediction

Comput

Appl

M, Migo

P: Ha~terlng

in 5’ and 3’ regions Biosci 1996, 12:399-404.

&M

of eukaryotic

of the

Science

Fickett JW, Hatzigeorgiou AG: Eukaryotic promoter recognition. Genome Res 1997 7:861-878. i documented review 0; computational promoter recognition, with an excellent first part on biology and a second on a fair comparative analysis of the available tools. The conclusion shows that there is much to be done yet. Duret

1, Euoher

noncoding

P: Searching

sequences.

zother, more conceptual, power of a comparative regulatory elements.

fur regulatory

Curr Opin

Sfrucf

view on the same phylogenetic approach

elements in human 1997 7:399-406. topic w);ich presents the to identify new unknown

Biol

Y, Yamamoto

repeats

Y, Okazaki

T, Uchiyama

of the Fiffh lnfernafional Conference Biology. Menlo Park: AAAl Press;

jintit~&r~i~&An interesting point made by completely sequenced

papers.

extraction to the knowledge

Bioinformatics

and names inforrt&ion

1998,

14:600-607.

F, Julliard

L, Pillet

in biological extraction.

In

on lnfelligenf 1997:218-225.

A: Automatic

D, Rechenmann

M, Yuan Y:

and beck. J MO/ Viol

base from biological

text: application

Web

I, Takagi T: Automatic

Proceedings for Molecular

M, Valencia

Wide

F, Huynen

to genomes

of knowledge

Proux

on World

Y, Eisenhaber

from genes

constructing

Andrade

for improved sequence. Nucleic

Sysfen

1.5

of keywords from domain of protein

V, Jacq B: Detecting gene texts: a first step toward In Praceedings from the Ninth S, ?-aka@ 7:

52. Lonsdale D: Nomenclature regulation. Nature 1998, . 391 :l 18. The International Society for Plant Molecular Biology initiative to develop a kinadom-wide nomenclature can be found on the World Wide Web URL: MfNDEL: http:/lwww.mendel.ac.ukf or http://genome-www.stanford.edu/ MendeJJ see also AJDB: htip:JJgenome-~stanford.edu/Arab;do~sjsJ nomencl.html for additional information on nomenclature. 53.

Stiart’M, Schnelb’er R, Casar G, dord c tiienci A, Ouzounl’s C, Sander C: GeneQuiz: a workbench for sequence analysis. In /+oceedings of the Second lnfemafional Conference on lnfelligenf Systems for Molecular Biology. Menlo Park: AAAI Press; >994.348 -353.

54.

Harris NL: Genotetor: a workbench Genome Res 1997, 7:754-762.

55.

Bailey

LC, Fisher

S, Schug

annotation

for sequence

annotation.

J, Crabtree J, Gibson M, Overton GC: of genomic sequence. Genome Res

8:234-250.

MBcligue

C, Rechenmann

integrated computer analysis. Bioinformafics

F, Dan&in A, Viari A: Imagene: an environment for sequence annotation 1999, 15:in press.

A new kind of integrated annotation environment manaB*men> pSaYmm incDrpora%n~ a bata graphical tool. A prototype of lmagene was sequence annotation. 57.

Gelbart W: Databases 282:659-661.

56. .

Math&

end

is presented which is a task managemenf sydem anb a used for 6. subfilis genome

C, Peresetsky

in genomic A, DBhais

research.

Science

P, Van Montagu

Classification of Arebidopsis the/iene of coding sequences into two groups

M, Rouzb

1998, P:

oene seauences: clusterino according to codon usege-

improves gene pm&2h, i, MD> Bj,) X+99,2X%;> 933-> 9x0. This paper reports the clustering of Arabidopsis coding sequences into two classes according to their codon usage, the main difference lying in the use of T and C at the third codon position. The classes correlate with gene expression. Using GeneMark, it was shown that using two gene models instead of one improves Qene prediction. 59.

Wu HJ, Gaubier-Comella Rouzb P: AU-AA splicing

at least 1 Og years

frx

genes.

44.

45.

T, Diaz-Lazcoz

function:

95

Rombauts

Woksbu~ on Genome J!tufmaatics. E&ted by Mpnu Tokyo: Universal Academy Press Inc.; 1998:72-80.

con-

on the full sequence of a higher eukaryote, with a genome to Arabidopsis. The accompanying papers from the same reading too, as an illustration of what genomics is offering to

MilzsRsi

Ohta

symbols partkent

V:

The C. etegans

The first report size comparable issue are worth biology now. 43.

tool

W, Wittig

P, Dandekar

and

for

.

K, Vahrson

51.

56. .

41.

J, Hermann

Bork

Predicting

Pavy

a program in genomic

Annotation of transposons and Qenomic URL: http~/nucleuscshl,org/profarab/Tn4nnotatjon.htm

1998,

Burge C, Karlin S: Prediion of complete gene structures in human genomic DNA. I MO/ Bioll997, 268:78-94. This paper describes GENSCAN, the most efficient gene prediction ware to date for vertebrates. GENSCAN makes use of many signals comp&ti~ona~ sIa$s%xza\ propefi,es, ti)‘rh a dependency ru)e ‘ror spS)ce It allows the search for multiple Qenes on both strands. Lukashin AV, Borodovsky gene finding. Nucleic

SR: tRNAscan-SE: of transfer RNA genes Res 1997, 25195.5964.

GAIA: framework

.

for it? RouzB,

TM, Eddy

scientific families.

for

39.

40.

49.

fianr

This paper IS an appllcatlon ot a previous one dedicated to the human atenome, besci,ti)no MZEF ‘mr Plratidopsjs. The metior5 ‘1s uino ouabraijc ziscriminant analysk, a generalisation oi linear discriminant analy&,‘as used by Solovyev and Salamov [361.

do we have

1996, 283:707-725. ~s~ytn-,~~ii\tr;~~~cr~~~Rnnwi~dln~~~i~T; gap between phenotype and genotype as a goal. in this review is on using the added value provided genomes for acne function prediction.

50.

AG, Nielsen H: Neural network prediction of translation sites in eukervotes: _ .oersoectives for EST and oenome In Proceedings of the Fiffh lnfemafional Conference on

Pedersen

initiation analysis.

36.

*

N. Rouzb

prediction

35.

Acids 47.

Nucleic

P. Brunak S: A branch ooint consensus from found by non-circular analysis allows for better of acceptor sites. Nucleic Acids Res 1997, 25:3159-3163.

Tolstruo Arebiiopsk

Lowe

tools

detection

l

ciikw&?3: pre-mRNA

by combining local and global sequence Acids Res 1996, 24:3439-3452.

46.

48.

prediction

321

theliane

which

P, Delseny

M, Grellet

in Arebidopsis: old. Nat Genef 1996,

F, Van Montagu

non-canonical

M,

introns

are

14:383-384.

Sharpe PA, Burge CB: Classification of introns: U2-type or U12 type. Cell 1997, 91:875-879. i synthetic presentation of the recent finding of a second class of spliceosomal introns found in animal and plants [59]. Their distinctive splicing and signature are described together with the donor and the branch sites.

60.

61. 62.

C: Identifying potential tRNA J MD/ Viol 1991, 220:659-671.

genes

DNA sequences. Terryn N, Heijnen

L, De Keyser

M, DeClercq

Fichant

GA, Burks

Evidence theliene APETALA 445~237.245.

A, Van Asseldonck

in genomic

for an ancient chrcmosomal duplication in Arabidomis by sequencing and anelysing a 400 kb contig et the’ 2 locus on chromosome 4. FEBS Lea 1999,

R: