Journal of Biomedical Informatics 39 (2006) 314–320 www.elsevier.com/locate/yjbin
Beyond the data deluge: Data integration and bio-ontologies Judith A. Blake *, Carol J. Bult The Jackson Laboratory, Bar Harbor, ME, USA Received 31 August 2005 Available online 21 February 2006
Abstract Biomedical research is increasingly a data-driven science. New technologies support the generation of genome-scale data sets of sequences, sequence variants, transcripts, and proteins; genetic elements underpinning understanding of biomedicine and disease. Information systems designed to manage these data, and the functional insights (biological knowledge) that come from the analysis of these data, are critical to mining large, heterogeneous data sets for new biologically relevant patterns, to generating hypotheses for experimental validation, and ultimately, to building models of how biological systems work. Bio-ontologies have an essential role in supporting two key approaches to effective interpretation of genome-scale data sets: data integration and comparative genomics. To date, bio-ontologies such as the Gene Ontology have been used primarily in community genome databases as structured controlled terminologies and as data aggregators. In this paper we use the Gene Ontology (GO) and the Mouse Genome Informatics (MGI) database as use cases to illustrate the impact of bio-ontologies on data integration and for comparative genomics. Despite the profound impact ontologies are having on the digital categorization of biological knowledge, new biomedical research and the expanding and changing nature of biological information have limited the development of bio-ontologies to support dynamic reasoning for knowledge discovery. Ó 2006 Elsevier Inc. All rights reserved. Keywords: Gene Ontology; Mouse Genome Informatics; Data integration; Biomedical ontologies; Bio-ontologies
1. Introduction The era of genome analysis was launched with such landmark publications as the first large-scale survey of transcriptional activity in the human genome using expressed sequence tags (ESTs) [1] and the first complete genome of a cellular organism [2]. In relatively quick succession, the complete genome sequences of several major biomedical model organisms were published including yeast (Saccharomyces cerevisiae) [3], nematode (Caenorhabditis elegans) [4], fruit fly (Drosophila melanogaster) [5], and the laboratory mouse (Mus musculus) [6]. With the availability of the genome sequences of these model systems and genome sequence of the human [7,8], biomedical researchers were able to compare and contrast catalogs of genome features. The explosion of sequence-based comparative genomics made even more apparent the potential and *
Corresponding author. Fax: +1 207 288 6132. E-mail address:
[email protected] (J.A. Blake).
1532-0464/$ - see front matter Ó 2006 Elsevier Inc. All rights reserved. doi:10.1016/j.jbi.2006.01.003
power of experimental studies in model organisms to illuminate aspects of human biology and disease [9]. In addition to the analysis of genome features, new technologies, such as microarrays and massively parallel signature sequencing, allow researchers to investigate global transcriptional activity in different cell and tissue types [10,11]. Technologies that permit comprehensive surveys of the protein content of cells and tissues are rapidly emerging and will undoubtedly reveal much about the nature of the proteome and the relationship of the genome, transcriptome, and proteome to one another [12,13]. The challenge of assembling and maintaining a comprehensive catalog of an organism’s genome features is matched by the challenge of managing the range of attribute information associated with these features. Our knowledge about each genome or cellular component is both complex and incomplete. Tens of thousands of scientific papers are published each year that correct old information and add new details about millions of distinct biological entities. One of the major challenges faced by
J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320
the biomedical research community is how to access, analyze, and visualize heterogeneous data in ways that lead to novel insights into biological processes or that lead to the formulation of a hypothesis that can be tested experimentally. Beyond the annotation of each individual gene, transcript, and protein in an organism’s genome, is the much larger challenge of detecting and representing the networks and interactions of these parts in cells and multi-cellular complexes. Eventually, the characterization of these networks will allow biomedical researchers to directly connect gene function to organismal phenotype. Coincident with the emergence of genome-scale data generation technologies has been the development of ontologies that are specific for biomedical research domains. Ontology-based knowledge representations have been and are being developed in many domains to facilitate information extraction and retrieval and to support data interoperability in biology [14,15]. More than dictionaries or thesauri, bio-ontologies formally represent relationships between defined biological terms such that the terminologies can be used both by humans and by computers to exchange and explore information. Data integration and comparative genomics are two common approaches to interpreting genomic data. In this paper, we explore the impact of bio-ontologies for both data integration and comparative genomics. Our viewpoint is a decidedly practical one. The issues we address are not about the challenges of ontology development per se; rather we focus on how a specific ontology development effort, the Gene Ontology, is being used to facilitate data integration in the organism-specific Mouse Genome Informatics (MGI) database and, in the context of information systems, how it is designed to facilitate sharing biological knowledge across organisms. We begin by briefly describing the Gene Ontology (GO) project. We then describe the Mouse Genome Informatics (MGI) resource and the role that GO plays in providing semantic consistency to functional annotations for mouse genes. Finally, we discuss some of the challenges before us as we move beyond descriptive models of biological systems to predictive models. 2. The Gene Ontology: representative Bio-Ontology The Gene Ontology (GO) project was founded to advance the development and utilization of bio-ontologies and semantic standards for molecular biology [16]. The GO Consortium, the coordinating group for this project, is developing structured controlled terminologies of terms for the biological domains of molecular function, biological process, and cellular location of gene products. The GO Consortium members develop the structured set of terms, define the relationships and paths between terms, and then annotate genes and gene products using the terms. The ontologies and annotations are provided publicly as part of the GO database resource [http://www.geneontology.org/]. Gene Ontology is a work in progress and incorpo-
315
rates community input to help prioritize needed changes and improvements. Workshops that include domain and ontology development experts are held on a regular basis to continue with expansion and improvement of the resource. The GO Consortium includes the major model organism database groups (mouse, fly, yeast, Arabidopsis, worm, nematode, and rat) and a major sequence database resource (UniProt). In addition, many other genome annotation groups and bioinformatics centers participate both in the development of the ontologies and the use of the GO annotation process for functional annotation systems. The GO web site provides access to the ontologies, annotations, documentation, bibliography, and tools to support data analysis using the power of ontological information structures. GO has become a community standard for genome annotations and, together with other ontologies for biology such as anatomies [17,18] and cell types [19] provides semantic and ontological representations for biology. These resources are now essential for the information management of genetics and genomics data relevant to biomedicine. The Open Biological Ontologies resource (http:// obo.sourceforge.net/) provides a repository of community bio-ontologies. 3. The Mouse Genome Informatics database: representative database Mouse Genome Informatics (MGI; http://www. informatics.jax.org/) is the recognized community database for the laboratory mouse [20].1 MGI integrates genetic and genomic data for the mouse in order to fulfill its mission as an informatics resource that facilitates the use of the mouse as a model system for understanding human biology and disease processes. MGI provides extensive information about mouse genes, sequences, alleles, mutant phenotypes, strain polymorphisms, gene and protein function, developmental gene expression patterns, and curated mammalian homology data. MGI is also a platform for computational assessment of integrated biological data with the goal of identifying candidate genes associated with complex phenotypes. Data integration is a primary focus of MGI [21]. Data integration means identifying disparate data that describe the same biological entity (e.g., gene, transcript, protein, etc.). Data integration is critical to knowledge discovery because it allows different information about the same entity to be related in new ways [22]. Many hurdles to achieving data integration exist. For example, the same gene can be referred to by multiple different names in the literature. Or, the same name can be applied to more than one gene. Differ1 Similar model organism genome databases are maintained for yeast (http://www.yeastgenome.org/), Drosophila (http://flybase.bio.indiana. edu/), C. elegans (http://www.wormbase.org/), zebra fish (http://zfin. org/), and rat (http://rgd.mcw.edu/) among others.
316
J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320
ent data about the same gene can be obtained from various sources. Sequence data obtained from a variety of resources may include overlapping sequence sets identified by different primary identifiers. In addition to data integration challenges, MGI must deal with incomplete data and inconsistent data and conflicts in data that appear over time. Key strategies employed by the MGI staff for data integration include use of unique, permanent accessioned identifiers for major biological entities to preserve crossreference integrity and the use of controlled terminologies and ontologies for functional annotations to achieve semantic normalization both with mouse-specific data as well as across multiple model organism databases. These integration processes enable comprehensive and accurate recall of diverse kinds of data. The combination of reference integrity and semantic normalization enable complex queries that facilitate knowledge discovery. 4. GO at MGI The incorporation of the Gene Ontology within MGI provides semantic integration and access to heterogeneous
data sets assembled from data loads and the biomedical literature. In this context, the ontologies are used as controlled terminologies for the assignment of consistent functional annotations. The assignment of appropriate GO terms to genes in MGI enables researchers to retrieve from the database all genes that are associated with a specific biological process, molecular function or cellular location. An example of how the Gene Ontology is used to facilitate data integration is illustrated by the functional annotation of the mouse hexosaminidase A gene (Hexa, MGI:96073) in MGI. The gene product of the Hexa gene is involved in hydrolase activity and one of the GO molecular function terms assigned to this gene by MGI curators is b-N-acetylhexosaminidase activity (GO:0004563) (Fig. 1). Data from seven publications support this functional annotation for Hexa [23–29]. Although each of these papers supports the GO term assignment, the nature of the evidence in each paper differs and there is substantial variation in the words that the authors of these papers use to describe the function of Hexa. In this case, GO is used primarily as a controlled terminology to enforce standards
Fig. 1. GO annotations for the Hexa gene in MGI. The molecular function annotation of b-N-acetylhexosaminidase activity is supported by multiple references and several different lines of evidence. The reference numbers are MGI journal numbers and provide a link to PubMed. The rows of this table provide a subset of the GO annotation record [not included on the web page are GO term Accession ID, Taxon ID, etc.]. One row is emphasized for example. The columns represented are (A) GO domain—here ‘Molecular Function’; (B) GO Term—here ‘b-N-acetylhexosaminidase activity’; (C) GO Evidence Code—here ‘IGI’’ for ‘Inferred from Genetic Interaction.’ This provides a high-level characterization of the type of experimental or computation method used to provide the data for this annotation; (D) Inferred From—here ‘MGI:96074’ is the MGI public Accession ID for the gene with which Hexa interacted in the experiment reported by this annotation. (E) Ref(s)—here ‘J:36305’ is the MGI journal identifier number. This is the citation for the experiment that informs this annotation row. The MGI journal records link to PubMed. The annotation row, then, provides the GO annotation for the gene as well as links to supporting evidence and citations.
J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320
in how the functions of genes are described. Such naming standards are one of the keys to data integration [21]. In the Hexa example above, the advantage of the ontological structure of GO is not readily apparent. It could be argued that a simple flat list of terms would suffice to achieve integration in this and similar cases. However, the advantage of the ontological structure of GO becomes evident when the retrieval of integrated data becomes necessary or desirable. For example, a researcher may wish to retrieve all mouse genes whose products are involved in ‘lipid metabolism.’ If a simple flat list of controlled terminology terms is used for functional annotation, each term must be explicitly associated with the gene product in order for the Hexa gene to be returned in response to this query. Because of the structure of the GO, however, it is possible to traverse the relationships of the terms in the GO hierarchy to retrieve all of the genes that are annotated specifically to the term, ‘lipid metabolism,’ but also to any of the child terms, including ‘sequestering of lipid,’ the actual term to which Hexa is annotated (Fig. 2). A further use of GO annotation sets is to provide reference information that can be used for initial functional annotations. The comprehensive annotation of well-known genes provides the reference information that can be used to provide initial functional annotations for relatively uncharacterized genes via comparative sequence and domain analysis (Fig. 3). These computational assignments are identified by the evidence and citation information provided in the annotations, and can thus be included or not as appropriate for the use of the annotation sets. These initial annotation assignments can be extremely useful for experimental biologists making crucial decisions as to allocation of research resources for further characterization of specific genes. While the use of GO for functional annotations is a powerful tool for data integration of information
317
from the biomedical literature, the curation of the literature also provides a primary mechanism for continually updating the GO. All bio-ontologies must evolve as biological knowledge about genes and gene products emerges. The curation of literature reveals new and changing knowledge about biological systems and can result in changes to an ontology. For example, a recent emphasis by the MGI curation staff to fully annotate genes involved with the biological process of ‘regulation of blood pressure’ in the mouse resulted in 43 new terms being added to the GO. One immediate consequence of the data aggregation made available through structured controlled terminologies is the ability to develop statistical measures that evaluate the significance of experimental data sets relative to the complete set of information about a given system. For example, the analysis of data generated by such genome-scale technologies as gene microarray data illustrates the interplay between increasingly global experimental paradigms and the critical role that biomedical ontologies play in interpretation of large-scale data sets [11]. In many microarray experiments, the outcome is a list of tens to hundreds of genes that are differentially expressed in two or more samples under different experimental conditions. While gleaning biological context from a list of genes is a daunting task, it is made tractable by knowledgebases and ontologies. Indeed, a number of software applications have emerged in recent years that leverage the availability of Gene Ontology annotations to help researchers identify biological themes in their gene lists [30]. GO is not the only bio-ontology or structured controlled terminology included in the MGI resource. The Mouse Anatomy [18] and the Mammalian Phenotype Ontology [31] are also used for gene and gene product annotation by MGI curators. These, along with multiple sets of controlled terminologies, including most prominently the offi-
Fig. 2. A portion of the GO hierarchy showing the position of ‘lipid metabolism’ in the Biological Process ontology. Note that there are, in MGI, 444 mouse genes with 812 annotations assigned to this term or one of its subterms. At the level of the ‘sequestering of lipid’ term, there are only eight genes with nine annotations. One of the genes annotated to ‘sequestering of lipid’ term is Hexa.
318
J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320
Fig. 3. The use of structured controlled terminologies enables robust data aggregation and recovery of information. As illustrated here, reviewed computational analysis (RCA) and inferred from electronic annotation (IEA) assignments of GO terms provide initial annotations for 12 relatively uncharacterized RIKEN gene models to ‘lipid metabolism’ and its subterms. These annotations provide experimental biologists with initial assessments of gene characteristics and help in the design of future experiments.
cial nomenclature for mouse genes, strains, and alleles, provide the semantic structure to support the data integration and retrieval needs of a major bioinformatics resource. 5. Ontologies and comparative genomics Organism-centric databases such as MGI will continue to play an important role in biomedical research. Recently, the availability of sequenced genomes has driven the use of sequence-based comparative methods. These approaches have provided tremendous insights into genome organization and have revealed novel biological features such as conserved non-coding regions that are often completely conserved at the nucleotide level from human to pufferfish [32]. Comparative biology also includes comparing and contrasting biological knowledge about the functions and roles that genes and gene products play in different organisms. GO has provided a common language for describing universal biological processes which, in turn, has facilitated both in-depth understanding of biology within a single organism and for comparing biological processes across multiple organisms. Consistent use of biological terms across different organisms means that researchers can retrieve data for a single organism or for multiple organisms accurately and consistently. Being able to compare data across species makes it possible to use and play to the strengths of all the different model organisms to study the function of genes. For example, GO annotations for many genomes are integrated in the GO database [www.godatabase.org]. Searching the GO database for annotations to the term ‘sequestering of lipid,’ the term discussed earlier in this paper, returns 35 gene products from six organisms [budding yeast, fission yeast, weed, fly, rat,
and mouse] that have been annotated to this term or one of its subterms. This set of annotations is independent of the evolutionary relationships between the gene products (although they might actually be homologs), reflecting, rather, the shared biology. Thus, we can use the combined expertise in biological annotations for diverse genomes to provide a robust set of gene products that share a defined aspect of their biology as another measure of comparative biology. 6. Biological knowledge in computable form The examples of data integration and recovery as represented by the MGI and GO systems demonstrate both the promise and challenge of data representations. On the one hand, the use of ontologies and the careful mapping of data across experimental systems provide a comprehensive mechanism to recover robust sets of relevant data in response to complex queries. On the other hand, these representations lack the precision in representing the full context surrounding knowledge that is described in a typical journal article comprised of unstructured text, tables, and figures. To fully represent the complexities of information presented in a biomedical study require a granularity of data representation for heterogeneous data that is difficult to acquire in computable form. For example, the presentation of phenotype data from a study of a targeted mutation of a single gene in the laboratory mouse requires the complete and controlled description of the genetic construct used, the genotypes of the mice studied, the details of the experimental assays, the explanation of the phenotypic results, and the narrative of the context for the study in relationship to current knowledge. Indeed, changing any one of these contexts can change the outcome of an experiment and complicate the represen-
J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320
tation of knowledge about a gene or gene product. These accounts of the experiment as reported in the literature also include the perspective of the study in relation to human biology and diseases, often the genesis of the investigation in the first place. Evidence ontologies under development are a critical next step in the maturation of the bio-ontology field. There are active communities devoted to the representation of experimental information for microarray [33], toxicogenomic [34], and other experimental data, and more efforts are under discussion. The current challenge for data management is to facilitate the intersection of the expressiveness of the biomedical literature with the data retrieval and correlation capabilities of the structure information tracked with ontologies and stored in databases. The biomedical literature brings the expertise and experimental results of scientists into the context of the current knowledge. In the age of genomic-sized data sets, the biomedical literature is increasingly archaic as a form of transmission of scientific knowledge for computers, yet the written word is essential for humans. Can we improve the computability over biomedical information by using bioinformatics tools at the level of biomedical publications? Much work in this area has been undertaken by the natural language processing (NLP) community with critical assessments undertaken to evaluation information retrieval using GO annotations [35,36], but the core indexing of the literature to controlled terminologies is still primarily undertaken by curators at bioinformatics resources such as the NLM Mesh indexers [37]. Ultimately, the integration of structured terminologies and bio-ontologies into the scientific reports as part of the publication process may provide the means to capture both the human readable and the computationally tractable forms of semantic understanding. 7. The role and future of biomedical ontologies Biomedical ontologies provide the semantic structure that supports integration and comparison of large, complex data sets in biology. To date, the practical focus for ontology development and application has been on the capture of domain knowledge such as function and location. Ontologies support uniform data encoding thus providing semantic integration of information derived from multiple resources such as free-text publications or functional annotations of sequences from multiple providers. Such semantic integration facilitates the exchange and integration of data between resources and is essential for future development of semantic web services. Attention to standards of ontology development [38] ensure these domain-specific ontologies are useful as terminologies for annotation systems and that they can be used in formal ontological representations in biomedicine as these resources develop. Bio-ontologies are having a significant impact on data integration, access, and analysis through their use in capturing and structuring biological data. As the genomic and biomedical informatics systems coalesce,
319
the bio-ontologies will be of critical importance to the success of the data integration and retrieval. Ultimately, the construction and use of bio-ontologies as formal, logical systems will enable computational reasoning over complex semantic systems. The extension of bio-ontologies to model or represent biological systems faces formidable challenges, including the representation of environment, scale, and temporal–spatial context (i.e., complexity), as well as the incompleteness and ever changing nature of biological knowledge. The rapid advances in development of biomedical ontologies brings both encouragements and challenges to our efforts to integrate and access data for knowledge discovery. Acknowledgments J.A.B. and C.J.B. are funded through their work with the GO and MGI Consortiums. The GO project is funded by NIH/NHGRI HG02273. MGI database resources are funded by NIH/NHGRI (HG00330, HG02273), NIH/ NICHD (HD33745), and NCI (CA89713). The genesis of this paper was presented at the Workshop on Ontology and Biomedical Informatics (Rome, 2005) supported by the International Medical Informatics Association. References [1] Adams MD, Kerlavage AR, Fleischmann RD, Fuldner RA, Bult CJ, Lee NH, et al. Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. Nature 1995;77(Suppl.):3–174. [2] Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, et al. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 1995;269:496–512. [3] Wood V, Gwilliam R, Rajandream MA, Lyne M, Lyne R, Stewart A, et al. The genome sequence of Schizosaccharomyces pombe. Nature 2002;415:871–80. [4] The C. elegans Sequencing Consortium. Genome sequence of the nematode C. elegans: a platform for investigating biology. Science 1998;282:2012–18. [5] Adams MD, Celniker SE, Holt RA, Evans CA, Gocayne JD, Amanatides PG, et al. The genome sequence of Drosophila melanogaster. Science 2000;287:2185–95. [6] Mouse Genome Sequence Consortium and Mouse Genome Analysis Group. Initial sequencing and comparative analysis of the mouse genome. Nature 2002;420:520–62. [7] Lander ES et al. Initial sequencing and analysis of the human genome. Nature 2001;409:860–921. [8] Venter JC et al. The sequence of the human genome. Science 2001;291:1304–51. [9] Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, Hariharan IK, et al. The genome sequence of Drosophila melanogaster. Science 2000;287:2185–95. [10] Oudes AJ, Roach JC, Walashek LS, Eichner LJ, True LD, Vessella RL, et al. Application of Affymetrix array and Massively Parallel Signature Sequencing for identification of genes involved in prostate cancer progression. BMC Cancer 2005;5:86. [11] Zhang W, Morris QD, Chang R, Shai O, Bakowski MA, Mitsakakis N, et al. The functional landscape of mouse gene expression. J Biol 2004;3:21. Epub 2004 Dec 6. [12] Elias JE, Haas W, Faherty BK, Gygi SP. Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations. Nat Methods 2005;2:667–75.
320
J.A. Blake, C.J. Bult / Journal of Biomedical Informatics 39 (2006) 314–320
[13] Guan JQ, Chance MR. Structural proteomics of macromolecular assemblies using oxidative footprinting and mass spectrometry. Trends Biochem Sci 2005;Aug 2. [14] Bard J, Rhee S. Ontologies in biology: design, applications and future challenges. Nat Rev Genet 2004;5:213–22. [15] Stevens R, Goble CA, Bechhofer S. Ontology-based knowledge representation for bioinformatics. Brief Bioinform 2000;1:398–414. [16] The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nat Genet 2000;25:25–9. [17] Rosse C, Mejino JL. A reference ontology for biomedical informatics: the Foundational Model of Anatomy. J Biomed Inform 2003;36:478–500. [18] Hayamizu TF, Mangan M, Corradi JP, Kadin JA, Ringwald M. The Adult Mouse Anatomical Dictionary: a tool for annotating and integrating data. Genome Biol 2005;6:R9. Epub 2005 Feb 15. [19]
. [20] Blake JA, Eppig JT, Richardson JE, Bult CJ, Kadin JA, and the Mouse Genome Database Group. The Mouse Genome Database (MGD): Integration Nexus for the Laboratory Mouse. Nucleic Acids Res 2001;29:91–4. [21] Bult CJ. Data integration standards in model organisms: from genotype to phenotype in the laboratory mouse. Drug Discov Today 2002;1:163–8. [22] Bult CJ, Richardson JE, Blake JA, Kadin JA, Ringwald M, Eppig JT, and the Mouse Genome Informatics Staff. Mouse genome informatics in a new age of biological inquiry. In: Proceedings of the IEEE international symposium on bio-informatics and biomedical engineering; 2000. p. 29–32. [23] Cohen-Tannoudji M, Marchand P, Akli S, Sheardown SA, Puech JP, Kress C, et al. Disruption of murine Hexa gene leads to enzymatic deficiency and to neuronal lysosomal storage, similar to that observed in Tay–Sachs disease. Mamm Genome 1995;6:844–9. [24] Yuziuk JA, Bertoni C, Beccari T, Orlacchio A, Wu YY, Li SC, et al. Specificity of mouse GM2 activator protein and beta-N-acetylhexosaminidases A and B. Similarities and differences with their human counterparts in the catabolism of GM2. J Biol Chem 1998;273:66–72. [25] Lankar D, Vincent-Schneider H, Briken V, Yokozeki T, Raposo G, Bonnerot C. Dynamics of major histocompatibility complex class II
[26]
[27]
[28]
[29]
[30] [31]
[32]
[33] [34] [35]
[36]
[37] [38]
compartments during B cell receptor-mediated cell activation. J Exp Med 2002;4:461–72. Sango K, McDonald MP, Crawley JN, Mack ML, Tifft CJ, Skop E, et al. Mice lacking both subunits of lysosomal beta-hexosaminidase display gangliosidosis and mucopolysaccharidosis. Nat Genet 1996;14:348–52. Yamanaka S, Johnson MD, Grinberg A, Westphal H, Crawley JN, Taniike M, et al. Targeted disruption of the Hexa gene results in mice with biochemical and pathologic features of Tay–Sachs disease. Proc Natl Acad Sci USA 1994;21:9975–9. Phaneuf D, Wakamatsu N, Huang JQ, Borowski A, Peterson AC, Fortunato SR, et al. Dramatically different phenotypes in mouse models of human Tay–Sachs and Sandhoff diseases. Hum Mol Genet 1996;5:1–14. Guidotti JE, Mignon A, Haase G, Caillaud C, McDonell N, Kahn A, et al. Adenoviral gene therapy of the Tay–Sachs disease in hexosaminidase A-deficient knock-out mice. Hum Mol Genet 1999;5:831–8.
. Smith CL, Goldsmith CA, Eppig JT. The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol 2005;6:R7. Epub 2004 Dec 15. Sandelin A, Bailey P, Bruce S, Engstrom PG, Klos JM, Wasserman WW, et al. Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics 2004;5:99.
.
. Hirschman L, Yeh A, Blaschke C, Valencia A. Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics 2005;6(Suppl. 1):S1. Epub 2005 May 24. Camon EB, Barrell DG, Dimmer EC, Lee V, Magrane M, Maslen J, et al. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA. BMC Bioinformatics 2005;6(Suppl. 1):S17.
. Smith B, Ceusters W, Klagges B, Kohler J, Kumar A, Lomax J, et al. Relations in biomedical ontologies. Genome Biol 2005;6:R46. Epub 2005 Apr 28.