Annotating the genome of Medicago truncatula Christopher D Town Medicago truncatula will be among the first plant species to benefit from the completion of a whole-genome sequencing project. For each of these species, Arabidopsis, rice and now poplar and Medicago, annotation, the process of identifying gene structures and defining their functions, is essential for the research community to benefit from the sequence data generated. Annotation of the Arabidopsis genome involved gene-by-gene curation of the entire genome, but the larger genomes of rice, Medicago and other species necessitate the automation of the annotation process. Profiting from the experience gained from previous whole-genome efforts, a uniform set of Medicago gene annotations has been generated by coordinated international effort and, along with other views of the genome data, has been provided to the research community at several websites. Addresses The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850, USA Corresponding author: Town, Christopher D (
[email protected])
Current Opinion in Plant Biology 2006, 9:122–127 This review comes from a themed issue on Genome studies and molecular genetics Edited by Nevin D Young and Randy C Shoemaker Available online 2nd February 2006 1369-5266/$ – see front matter # 2005 Elsevier Ltd. All rights reserved. DOI 10.1016/j.pbi.2006.01.004
Introduction Genome annotation is the process by which plain DNA sequence is analyzed and information is added to specific regions to describe features of biological interest, the most important of which are protein-coding genes. There are two distinct components to genome annotation: prediction of gene structures (i.e. transcription and translational start and stop locations, exons/introns, and untranslated regions) and description of the putative function of the genes. The Medicago Genome Project, which aims to sequence the euchromatic gene space on a bacterial artificial chromosome (BAC)-by-BAC basis [1], is now well over halfway to completion. As of October 2005, 160 Mbp of sequence was either complete or in progress, with approximately 25 000 genes predicted by an automated annotation pipeline. The aim of this review Current Opinion in Plant Biology 2006, 9:122–127
is to describe previous plant genome annotation efforts and how they have influenced the methods now being used for Medicago. Arabidopsis genome annotation occurred in two distinct phases. During the sequencing process itself, each group performed manual (i.e. human-curated) annotation on all the BACs that it sequenced, and submitted the annotated BAC sequence to the public databases (GenBank, European Molecular Biology Laboratory [EMBL] and DNA Data Bank of Japan [DDBJ]). As a result of the different processes and the variable standards and nomenclature used by each group, the collective annotation was very heterogeneous and often confusing to the user community. In response to these shortcomings, the US National Science Foundation (NSF) funded The Institute for Genomic Research (TIGR) to perform a complete genome re-annotation using a uniform set of processes and nomenclature. Over a three year period, both the structure and the function of every gene was subjected to some degree of human scrutiny (curation). The strategies, methods and results are described in [2]. The International Rice Genome Sequencing Program (IRGSP) had some similarities to the Arabidopsis project in that the sequencing was distributed between numerous centers and the resulting annotation was heterogeneous. Again, TIGR was funded by NSF to perform a complete genome re-annotation. In this case, however, the process was much more automated with only limited curation of the data and consequently a significantly smaller outlay of resources. This process is well under way [3] and will continue until December 2007. Lessons learned from the Arabidopsis re-annotation effort influenced the strategies that were implemented for the rice genome. Particularly important in both efforts was the recognition that consistent and thorough annotation is best done when the entire genome has been sequenced, so that the description of the structure and function of each gene can be done in the context of all related members of that gene family. From an operational perspective, it also became clear that it is much simpler to perform gene prediction directly on chromosome-scale pseudomolecules, representing DNA sequence assembled from BACs, rather than to annotate individual BACs and then transfer the annotation to pseudomolecules at a later stage. Both these lessons have implications for Medicago genome annotation. The first lesson (i.e. the importance of gene families) played an important role in the widespread agreement among the sequencing groups and funding agencies to invest only a modest effort into the annotation process until sequencing was completed. www.sciencedirect.com
Annotating the genome of Medicago truncatula Town 123
Evolution of large-scale, automated annotation strategies and their implications for Medicago During both the initial annotation of the Arabidopsis genome and its re-annotation at TIGR, the process involved a blend of computational processes and human curation. Gene prediction algorithms and database searches were run on the finished DNA sequences and on provisional gene models. Subsequently, the gene models were refined in the light of experimental data (DNA and protein alignments from both Arabidopsis and other species) on a geneby-gene basis by a human curator. The names of each protein were assigned on the basis of inspection of the best database matches and the credibility of the annotations of these matches. As can be imagined, this is a labor-intensive task, even for the relatively small Arabidopsis genome, and involved well over 200 person-months of time for the 120 Mb Arabidopsis genome with its 27 000+ proteincoding genes. Some of the human burden of updates was relieved by the development of a program that automatically corrected gene models on the basis of new Arabidopsis full-length (FL)-cDNA/expressed sequence tag (EST) evidence (Program to Assemble Spliced Alignments [PASA]; [4]). Although alignments with DNA or protein sequences from other species could be used to flag gene models that were inconsistent with such evidence, the correction of gene models still required human intervention [2]. The TIGR rice whole-genome annotation was planned to be a largely automated process. FGENESH, trained on a set of curated monocot gene models (www.softberry.com), was compared with other gene prediction programs, found to be the most accurate and hence adopted for the default set of gene models. PASA was then used to refine gene models using rice FL-cDNA and EST data. Work in progress at TIGR involves the training and evaluation of ‘JIGSAW’ [5], a successor program to ‘Combiner’ [6]. JIGSAW weights evidence from both ab initio predictions and database matches from multiple species to produce the best model, thereby serving a role similar to that provided by human curators in the Arabidopsis project. The assignment of accurate and meaningful names or descriptions to structurally annotated genes is more difficult to automate. In the case of Arabidopsis, the functions assigned to predicted proteins were based upon human curation that included examination of the best database matches, investigation of any relevant literature (at least to the level of the abstract), and the filtering and rejection of any matches to proteins that were considered to be poorly annotated, thus avoiding what is often called ‘transitive error propagation’. TIGR has an in-house database of protein sequences collected from all major public repositories that has been rendered non-redundant. The protein descriptions in this database are concatenated in the FASTA header line in a hierarchy that www.sciencedirect.com
reflects the level of curation and credibility associated with the database from which a particular description originates. The process for automated gene naming in rice takes the descriptions of the best hit(s) to this database and uses a series of decision steps that include the elimination of redundant or non-informative annotations to arrive at a reasonable protein description that is useful to the research community. Because this TIGR database incorporates the protein descriptions from the latest Arabidopsis annotation that have received human curation, the automatically applied descriptions are for the most part sensible and reliable.
Medicago genome annotation Medicago genome sequencing began at the University of Oklahoma in 2001 [7], with annotation in the form of automated gene predictions by Genscan [8] and FGENESH [9] provided through a GBrowse [10] viewer. For the coordinated international genome initiative, it was agreed that only automated annotation would be performed until the genome sequence was deemed ‘complete’. The original intention was that each of the two US sequencing groups (at TIGR and at the University of Oklahoma) would be responsible for the annotation of the BACs on the chromosomes that they sequenced. In the European Union, both sequencing groups (John Innes Centre/Sanger Centre and Institut National de la Recherche Agronomique [INRA]-Toulouse/Genoscope) delegated the responsibility for data management and annotation to the Munich Institute for Protein Sequence (MIPS). Because the automated annotation was intended to provide simply a working draft of Medicago gene structures and protein sequences, only finished un-annotated DNA sequence was to be submitted to the public databases (GenBank and EMBL), with the annotations hosted at the respective sequencing and informatics websites associated with the project. It soon became clear, however, that this model was not in the best interests of the user community. Not only did the sequencing/informatics centers use different annotation pipelines for their respective chromosomes, but efforts by several centers to generate ‘whole-genome’ datasets was producing a plethora of gene models and predicted protein datasets that could only confuse the average user.
Evolution of IMGAG: the International Medicago Genome Annotation Group To address this issue, bioinformaticians from the various groups coalesced into the self-identified International Medicago Genome Annotation Group (IMGAG; see www.medicago.org/genome/IMGAG.php for details), an acronym that also serves as a namespace identifier for its annotations. After evaluating the various pipelines, the IMGAG agreed on a single pipeline that would be distributed across several centers, and upon their respective resources, experience, and expertise. The resulting genestructure predictions would be adopted by this group and Current Opinion in Plant Biology 2006, 9:122–127
124 Genome studies and molecular genetics
their associated databases and web resources, and promoted to the research community as the canonical set of M. truncatula genome annotations. A schematic of the IMGAG pipeline is shown in Figure 1. The core of the pipeline is the Eugene system [11], which combines both intrinsic (ab initio predictions) and extrinsic data (cDNA/EST and protein matches) to produce its gene structure predictions (models) and thus resembles the JIGSAW program. In addition to its own splice-site prediction modules, Eugene, as implemented in the IMGAG pipeline, accepts FGENESH predictions that have been enhanced by PASA-based cDNA/EST updates (http://medicago.toulouse.inra.fr/imgag_egn/cgi-bin/ egn_getinfo.cgi). The naming or description of gene products follows a similar protocol to that used for rice, the main difference being that precedence is give to
matches in the INTERPRO [12] collection of databases, before defaulting to the TIGR protein database. Other data from these searches are captured as evidence and made available for text searching. It is important for users to be aware of the limitations and pitfalls of this approach and to make intelligent use of domain and other information that may be provided along with the basic annotation.
Accuracy of gene prediction programs and pipelines Several gene prediction programs that were used for Medicago genome annotation before the evolution of the IMGAG pipeline were evaluated for accuracy against training sets that consisted of genes with full-length expressed sequence support and manually curated genes that were supported by protein evidence. FGENESH-
Figure 1
The IMGAG pipeline. Current Opinion in Plant Biology 2006, 9:122–127
www.sciencedirect.com
Annotating the genome of Medicago truncatula Town 125
dicot (www.softberry.com) was more accurate than the other tested programs (Genscan+, genemark.hmm, glimmerA and grail, in that order). The original ‘dicot’ version of FGENESH was actually trained on Arabidopsis, but performed surprisingly well on Medicago DNA sequence. However, an initiative by the Center for Computational Genomics and Biology (CCGB) at the University of Minnesota led to the re-training of the FGENESH program for Medicago gene models and the generation of a Medicago-specific matrix that would be expected to perform better. A comparison of the gene-structure predictions of FGENESH-dicot (Arabidopsis-trained) with those of FGENESH-Mt (Medicago-trained) on a set of 443 finished BACs revealed that 10 000 genes were predicted using the dicot matrix and 13 500 by the Medicago matrix. All but 99 of the genes predicted with the dicot matrix were also predicted by FGENESH-Mt, whereas 3368 were predicted only by the Medicago matrix. The majority of the genes in this latter set were short: 70% were less than 800 bp and 95% less than 2.4 kb. Only a fraction of these predictions had matches to either other plant gene indices or a protein database (16% and 9%, respectively; F Cheung, C Town, unpublished). This is a surprising difference that warrants further investigation to facilitate the discovery and annotation of the complete set of Medicago genes. In a small-scale reverse transcription PCR (RT–PCR) experiment using gene-specific primers, most (if not all) of the predicted genes were transcribed but very few (if any) of the transcripts were spliced according to the FGENESH-Mt predictions (J Redman, Y Xiao, C Town, unpublished). Larger-scale experiments are planned; meanwhile, these FGENESH-Mt-specific gene predictions are assigned to the ‘low confidence’ category (see below). A recent evaluation of the overall accuracy of IMGAG pipeline structure predictions showed that it can reach a sensitivity and a specificity of 80% at the gene level and 95% at the exon level (T Schiex, pers. comm.), with most of the discrepancies occurring in initial exons (choice of start codon). This level of performance is reached in an ideal situation in which the gene model is fully covered by ESTs.
Accessing and visualizing the annotation data The IMGAG annotation consists of, first, flat files of predicted coding sequences (CDS) and proteins in FASTA format with descriptive header lines (Box 1) from which redundancies that are due to annotation of the same gene on different BACs have been removed; and second, complete annotation of each BAC in TIGR XML format (see ftp.tigr.org for TIGR XML document type definition [DTD]). IMGAG annotation is applied only to finished BACs in GenBank (high-throughput genomic sequence [HTGS] Phase 3: see http://www.ncbi.nlm.nih.gov/HTGS/), and is run in a batch mode (monthly or in increments of 50 new BACs), so that recently www.sciencedirect.com
Box 1 FASTA header format for IMGAG predictions. >IMGAjAC146331_42.3 RmlC-like cupin AC146331.34 1398414242 H EGN_Mt050217 20050419 The successive elements in this set of information are: IMGA: registered namespace that uniquely identifies the source of the annotation. AC146331_42.3: Gene identifier that includes the BAC GenBank accession joined to the gene identifier and version number. RmlC-like cupin: name or description of the gene. AC146331.34 13984-14242: GenBank accession, including version number, followed by the coordinates of the 50 and 30 ends of the gene. Annotation evidence code: F, full coverage/FL-cDNA; E, expressed/EST matches; H, homology/heterologous; I, intrinsic/ab initio/inferred/hypothetical; L, ‘low quality’ gene calls: gene calls not in F, E, nor H. EGN_Mt050217: version of Eugene pipeline used for annotation. 20050419: Date of annotation run.
completed BACs, as well as all the unfinished sequences, do not have IMGAG gene predictions. To address this issue, and to provide an up-to-date view of how the individual BAC sequences are assembling into sequence contigs, a daily contigging pipeline was developed (http:// www.tigr.org/tigr-scripts/medicago/contig_location_association.pl). This pipeline assembles all available genomic sequence (phases 1, 2 and 3) from GenBank into sequence-based contigs, using a process similar to that used by the TIGR gene indices [13]. The contigs are then run through FGENESH-Mt for prediction of gene structures, CDS and protein sequences, which are displayed through GBrowse and made available for searching by BLAST. These predictions may be less robust than those produced by the IMGAG pipeline and lack functional assignments, but they provide a valuable first look at the coding potential of all of the Medicago sequence generated to date. Another limitation of the IMGAG process at this time is that the annotations are not submitted to GenBank as part of the BAC record, and thus the CDS and protein predictions are not incorporated into the non-redundant (nr) databases. It is anticipated that this situation will be resolved by the time this article is published. Although a well-annotated set of genes is essential for generating searchable datasets, and the acceptance of a single set of annotations will facilitate communication and minimize confusion in the user community. It should be borne in mind that predictions are just that. Many bioinformaticians feel that the user is better served by providing views of the genome that include both multiple sets of gene predictions and various kinds of supporting evidence, such as cDNA/EST or protein matches. Hence, most of the Medicago genome project data centers provide both the IMGAG annotation and any other features deemed relevant on a genome browser, the most common of which is GBrowse. Other genome browsers include site-specific tools that have been developed in house (e.g. Current Opinion in Plant Biology 2006, 9:122–127
126 Genome studies and molecular genetics
Table 1 Annotation data for the Medicago genome. TIGR (http://www.tigr.org/tdb/e2k1/mta1/)
IMGAG annotation, searchable by keyword, locus and BAC name. GBrowse view of all BACs and sequence contigs with first-pass FGENESH annotation, tentative consensus sequence (TC )/EST matches and other evidence. Downloads of CDS, protein FASTA files and XML formatted BAC annotation.
MIPS (http://mips.gsf.de/proj/plant/jsf/medi/index.jsp)
IMGAG annotation. Gbrowse and DBBrowser views of IMGAG annotation. Text search of clones, genes and contigs. Sequence search of BAC nucleotide sequences. Downloads of CDS, protein FASTA files and XML formatted BAC annotation.
OU (http://www.genome.ou.edu/medicago.html)
GBrowse view of FGENESH-Mt and Genscan predictions and IMGAG annotation on BACs sequenced at OU as well as a variety of other evidence. Blast and text-searching capabilities.
CCGB (http://ensembl.ccgb.umn.edu/Medicago_truncatula)
Ensembl views of IMGAG annotation. DAS reference server. Downloads of transcript, protein, and flanking sequences in multiple formats. Direct mysql database access.
UMN (http://www.medicago.org/genome/)
Central coordinating site for US Medicago sequencing effort. Includes a registry of BACs being sequenced, information on genome scale assemblies, genome statistics, marker sequences, BLAST searching as well as downloads of IMGAG annotations.
a browser at CCGB built on Ensembl [14,15] or those developed at TIGR and MIPS). A list of the websites that provide genome views and search capabilities for the Medicago genome annotation is given in Table 1. A more comprehensive description of Medicago web resources can be found in [16].
The future of Medicago genome annotation The expectations of both the sequencing and informatics groups as well as the user community is that the Medicago Genome Project will ultimately produce a uniform, highquality annotation of the entire Medicago gene space. Already, the IMGAG processes ensure a degree of consistency across all finished sequence, regardless of its source, that was not realized during the first round of annotation of either the Arabidopsis or the rice genomes. However, no effort has yet been made to use paralogous gene families to check and improve the consistency of either gene structures or the naming of gene functions. This is clearly important, but it remains to be seen how easily this can be accomplished using the automated processes necessitated by the limited resources available for Medicago. One way to improve functional annotation will be to encourage community input through the use of browser-based tools, like that for Arabidopsis at PlantGDB [17] and at TIGR OSA1 for rice (http://rice.tigr.org/tdb/ e2k1/osa1/ca/rice_ca.shtml).
As many of these data-types require human curation, such databases are relatively costly to develop and maintain. Other databases, such as the rice database at TIGR, have to date restricted their value addition to data-types that can be mapped to the genome (and hence genes) by sequence association. These include the locations of expression array probes, the sites of insertion mutants, and the location of genetic markers. After the Medicago genome sequence is nearing completion and pseudomolecules have been constructed, it should be an early goal of the Medicago informatics groups to provide similar enhancements to the basic gene structure–function annotation. Once the annotation has matured and stable identifiers have been established, it can be enhanced by mapping the genes to expression data and, in conjunction with resources such as the Legume Information System (LIS; www.comparative-legumes.org/), by establishing informative relationships between Medicago genes and their likely orthologs in other legume species.
Acknowledgements Medicago sequencing at TIGR is supported by the US National Science Foundation (NSF) Cooperative Agreement No. DBI-0321460. The author would like to acknowledge the work of the many members of the sequencing groups in both the US and the EU, and particularly the members of IMGAG for their comments on this manuscript and their hard work and cooperative spirit in the annotation project.
References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as:
Conclusions The focus of this article has been on the structural and functional annotation of the Medicago genome at the sequence level. Databases such as The Arabidopsis Information Resource [18], however, contain a wider variety of data, such as expression profiles, associated mutants, phenotypes, literature, that can be overlaid on genes. Current Opinion in Plant Biology 2006, 9:122–127
of special interest of outstanding interest 1.
Young ND, Cannon SB, Sato S, Kim D, Cook DR, Town CD, Roe BA, Tabata S: Sequencing the genespaces of Medicago truncatula and Lotus japonicus. Plant Physiol 2005, 137:1174-1181. A good overview of the Medicago truncatula genome project. www.sciencedirect.com
Annotating the genome of Medicago truncatula Town 127
2.
Haas BJ, Wortman JR, Ronning CM, Hannick LI, Smith RK Jr, Maiti R, Chan AP, Yu C, Farzad M, Wu D et al.: Complete reannotation of the Arabidopsis genome: methods, tools, protocols and the final release. BMC Biol 2005, 3:7. A comprehensive description of the re-annotation of the Arabidopsis genome. The authors provide a good overview of how both gene structure and gene function are annotated.
3.
Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, Lin H, Hamilton J, Haas B, Sultana R, Cheung F et al.: The institute for genomic research Osa1 rice genome annotation database. Plant Physiol 2005, 138:18-26. A good description of the automated annotation pipeline used for the TIGR re-annotation of the rice genome. 4.
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD et al.: Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 2003, 31:5654-5666.
5.
Allen JE, Salzberg SL: JIGSAW: integration of multiple sources of evidence for gene prediction. Bioinformatics 2005, 21:3596-3603.
6.
Allen JE, Pertea M, Salzberg SL: Computational gene prediction using multiple sources of evidence. Genome Res 2004, 14:142-148. A good description of the issues involved in computationally merging ab initio and experimental evidence that supports gene predictions. This paper is a bit less technical than the JIGSAW paper [5]. 7.
Roe BA, Kupfer DM: Sequencing gene rich regions of Medicago truncatula, a model legume. In Molecular Breeding of Forage and Turf. Edited by Hopkins ZY, Yang ZY, Mian R, Sledge ME, Barker RE. Kluwer Academic Publishers; 2004:333-344.
8.
Burge C, Karlin S: Prediction of complete gene structures in human genomic DNA. J Mol Biol 1997, 268:78-94.
9.
Salamov AA, Solovyev VV: Ab initio gene finding in Drosophila genomic DNA. Genome Res 2000, 10:516-522.
10. Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A et al.: The generic
www.sciencedirect.com
genome browser: a building block for a model organism system database. Genome Res 2002, 12:1599-1610. 11. Foissac S, Bardou P, Moisan A, Cros MJ, Schiex T: EUGENE’HOM: a generic similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res 2003, 31:3742-3745. 12. Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bradley P, Bork P, Bucher P, Cerutti L et al.: InterPro, progress and status in 2005. Nucleic Acids Res 2005, 33:D201-D205. 13. Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F, Parvizi B et al.: TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 2003, 19:651-652. 14. Birney E, Andrews TD, Bevan P, Caccamo M, Chen Y, Clarke L, Coates G, Cuff J, Curwen V, Cutts T et al.: An overview of Ensembl. Genome Res 2004, 14:925-928. 15. Hubbard T, Andrews D, Caccamo M, Cameron G, Chen Y, Clamp M, Clarke L, Coates G, Cox T, Cunningham F et al.: Ensembl 2005. Nucleic Acids Res 2005, 33:D447-D453. 16. Cannon SB, Crow JA, Heuer ML, Wang X, Cannon EK, Dwan C, Lamblin AF, Vasdewani J, Mudge J, Cook A et al.: Databases and information integration for the Medicago truncatula genome and transcriptome. Plant Physiol 2005, 138:38-46. A good and comprehensive summary of M. truncatula genome informatics. 17. Schlueter SD, Wilkerson MD, Huala E, Rhee SY, Brendel V: Community-based gene structure annotation. Trends Plant Sci 2005, 10:9-14. 18. Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M et al.: The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res 2003, 31:224-228.
Current Opinion in Plant Biology 2006, 9:122–127