Mutation Research 595 (2006) 137–144
Comparative Mouse Genomics Centers Consortium: The Mouse Genotype Database Jesse C. Wiley a , Manjula Prattipati a , Ching-Ping Lin b , Warren Ladiges a,∗ b
a Department of Comparative Medicine, University of Washington, Seattle, WA 98195, United States Department of Biomedical and Health Informatics, University of Washington, Seattle, WA 98195, United States
Received 7 September 2005; received in revised form 28 October 2005; accepted 2 November 2005 Available online 27 January 2006
Abstract The Comparative Mouse Genomics Centers Consortium (CMGCC) is a branch of the Environmental Genome Project sponsored by the National Institute of Environmental Health Sciences (NIEHS) focusing upon the identification of human single nucleotide polymorphisms (SNPs) that may confer disease susceptibility within the human population. The goal of the CMGCC (http://www.niehs.nih.gov/cmgcc/) is to make genetic mouse models for human SNPs within cell cycle control, DNA replication and DNA repair genes that may be associated with human pathologies. In order to facilitate information sharing and analysis within the consortium a set of informatics resources have been generated to support the mouse model development efforts. The primary entry point for information about the mouse models developed by the consortium is through the CMGCC Genotype Database (http://mrages.niehs.nih.gov/genotype/), which maintains both a consortium specific and public access display of the available and developing mouse models. © 2005 Elsevier B.V. All rights reserved. Keywords: Mouse Genotype Database; Genomics; Mouse models; Genetic variation; DNA repair
1. Background The completion of the human genome project has ushered in the next era of genetic exploration—the role of variability in genetics within the human population.
Abbreviations: CED, CMGCC Epidemiological Database; CGD, CMGCC Genotype Database; CMGCC, Comparative Mouse Genomics Centers Consortium; EGP, Environmental Genome Project; ES cells, embryonic stem cells; GEDAT, Gene Expression Data Analysis Tool; GO, gene ontology; MFD, Mouse Federated Database; SNP, single nucleotide polymorphism ∗ Corresponding author. E-mail addresses:
[email protected] (J.C. Wiley),
[email protected] (M. Prattipati),
[email protected] (C.-P. Lin),
[email protected] (W. Ladiges). 0027-5107/$ – see front matter © 2005 Elsevier B.V. All rights reserved. doi:10.1016/j.mrfmmm.2005.11.010
The primary form of genetic variation between individuals is single base changes in the DNA, known as single nucleotide polymorphisms (SNPs). SNPs may account for up to 90% of the genetic variability observed between individuals [1], and are linked with heritable forms of disease prevalence ranging from cancer to Alzheimer’s disease [2]. Over 7 million SNPs have been identified within the human genome [3], and estimates suggest that over 10–15 million SNPs may exist [4]. Despite the relatively small number of the SNPs that fall within the coding region of a gene, there may be 20–30,000 SNPs that do alter the protein sequence [3]. Consequently, most of the human proteome may have polymorphic variants. The high frequency of SNPs within the human genome demonstrates the ubiquitous nature of these genetic polymorphisms, and the plausibility of
138
J.C. Wiley et al. / Mutation Research 595 (2006) 137–144
their playing a major role in disease prevalence. Beyond disease prevalence, susceptibility to environmental risk factors may also be affected by genetic polymorphisms, as the human toxicological response varies enormously within the population. Biological responses to environmental exposure to toxic agents is regulated by complicated environment–gene interactions [5]. Allelic differences in specific genes can modify biological responses to environmental health risks [6]. Consequently, SNPs may play a role in determining which individuals are at risk for a panoply of disorders arising from both genetic predispositions and environmental interactions. Genes involved in DNA repair processes and cell cycle control are known to be involved in several human diseases—most notably pathologies related to cancer and aging [7–9]. However, it is less clear what role human genetic variability within this subset of the genome has upon prevalence toward specific pathologies. In order to examine this issue, the Environmental Genome Project (EGP) [10–16] has begun in-depth re-sequencing of the human genes involved in DNA repair, signaling and cell cycle control (http://www.niehs.nih.gov/envgenom/ genes.htm). Through the Comparative Mouse Genomics Centers Consortium (CMGCC, http://www.niehs.nih. gov/cmgcc/), one of the major components of the EGP, the identified SNPs are being incorporated into engineered mouse models and a suite of informatics tools to analyze the animals has been developed. The CMGCC encompasses five centers located at the University of Washington, University of Texas Health Sciences Center San Antonio, University of Texas MD Anderson Cancer Center, Harvard University and the University of Cincinnati. The goal of the CMGCC is to generate transgenic and targeted mouse models bearing human SNPs for the study of human genetic variation in disease prevalence, which will be made available to the general scientific community. In order to foster that goal, the CMGCC Genotype Database (CGD) has just been released to the public. This paper describes the CGD, the structure and design imperative, and the mouse models currently available. 2. Construction and content 2.1. Overview The CMGCC Genotype Database has two levels of access: secure (for consortium members and affiliates) and open (intended for public browsing). The secure and open views of the database both maintain web-based front-ends. The secure version of the database is intended for collecting and sharing data
between groups that intend to work with the specific mouse models. Consequently, the secure front-end to the database allows workflow based storage of information intended to track and share protocols, experimental data and bioinformatics information. The goal of the private view of the database is to provide a “portal” representation of genetic information about the animal model. While the secure version of the database is currently restricted to consortium members, with consortium approval, members of the general scientific community will be provided access to the secure version of the database. Consequently, all of the mouse specific protocol information can be made available to the scientific community as the CMGCC developed mouse models are employed in research efforts outside the consortium. The public view of the database draws authorized information from the same database instantiation, yet presents a detailed summary of the animal models, the current development state of the project, and some bioinformatics outsourcing to publicly available resources. The database schema is based upon the individual tasks associated with the development of genetically engineered mice. Each aspect of the development of the animals is stored in the database to facilitate later publications and data sharing between groups working with any particular mouse model. Specific information associated with each experimental assay, both with embryonic stem cell lineages (ES cells) and candidate founder line animals, are stored within the database to promote reusable protocol information. The root node of each record is the genetic identity of the specific animal model. As the data structure maintains a gene-oriented representation, each model is integrated with publicly available bioinformatic genetic information (such as GO terms and NCBI LocusLink/Entrez Gene data). Fig. 1 shows an example ‘portal view’ of a specific mouse model within the secure version of the database. Specific modules represent different aspects of pertinent information associated with the specific animal models. 2.2. Representation of complex haplotypes and genetic crosses One interest of the CMGCC is the role of genetic interactions between biologically associated genes. Consequently, the CGD was designed to have the capacity to represent any number of genes, and any number of SNPs per gene. Within the user interface, there is a gene selection tool, which directs the information entered by the user to that particular gene. Consequently, NCBI referent data, SNP positional characteristics and GO terms
J.C. Wiley et al. / Mutation Research 595 (2006) 137–144
139
Fig. 1. Secure interface for model submission and viewing. The image above is taken from the consortium member front-end to the database. Each component shown above maps to one of the modules of the database. Consequently, data can be submitted or retrieved through each specific module. Central storage of this information is intended to promote detailed understanding of different aspects of the genetic data and mouse model construction for consortium members and collaborators. Through the consortium front-end to the CGD, protocol and datasharing can be directly administered.
140
J.C. Wiley et al. / Mutation Research 595 (2006) 137–144
can all be associated with distinct genes and polymorphisms within a complex genetic model. This provides the capacity to track haplotypes within which several SNPs may travel together, as is commonly observed within haplotype blocks [17]. Additionally, the multigene tracking capacity enables functional information about multiple genes to be associated with individual mouse models, as is necessary in polygenic crosses. For example, specific GO terms can be stored for each gene involved in a genetic cross to highlight the potential biological significance of any particular animal model.
of the information sharing promoted by the CGD is the integration of biological data from both CMGCC informatics teams and other public bioinformatics resources. The unified CMGCC informatics tools are referred to as the Mouse Federated Database (MFD) system. The MFD is a semi-interoperable suite of resources designed to facilitate the analysis of the genetic polymorphisms and mouse model data under study by the CMGCC. The MFD will be discussed in greater detail in the following sections.
2.3. Public interface
As mentioned within the previous sections, the CGD employs a modular design to store information along the workflow path. Given that mouse model development is the primary aim of the consortium, it may be necessary to save some aspects of the experimental data associated with the genotyping process for periods of time prior to publication. In order to minimize the loss of information across time, the CGD accepts submission of mouse model projects prior to completion. The status of any project is regularly updated by the contributing scientists. The experimental data generated across the developmental process can also be submitted to the CGD. The goal of this workflow model is two-fold. First, a central information storage resource will promote the description and publication of the data generated using the individual mouse models. Secondly, the communication between collaborating groups of scientists ought to be fostered by central storage of detailed protocol information. A component of the protocol data is very specific reagent information. Hence, the antibodies used in Western blot analysis, or the primer sequences made for PCR genotyping, can be obtained through the on-line resource. As the animal models become disseminated to the scientific community, the on-line availability of detailed protocol and reagent information should minimize the work incumbent upon contributing scientists to communicate these details. Additionally, the effort required from individual scientists ought to be minimized by the capacity to re-use the stored reagent information in the description of future models.
The pubic interface draws from the CGD data records and provides public representation of the mouse model information deemed by the consortium members to be of the greatest public utility. The public web-interface uses a classic three tier system for data representation. The primary tier is the search page, which enables complex Boolean search strategies to be performed. The second tier is the results page, which provides a summary listing of attributes for each mouse model matching the search criteria. The third tier is the details page for any given mouse model, which includes the majority of data within the database. The primary details omitted from the public database are workflow based details, such as protocol and reagent information. The public database is also linked to other consortium resources, such as the Epidemiology Database, an integrated compendium of epidemiological studies performed examining the genes under investigation, and the Animal and Pathology databases. The Animal and Pathology databases are resources developed by other members of the consortium intended to give experimental elaboration of the phenotypes associated with the animals generated by the CMGCC. Additionally, there is both a basic and advanced query tool that allows the user to search the mouse models based upon numerous different components of the animal record, including Gene Ontology classification. An example of the results returned from a search is shown in Fig. 2, wherein a search for all currently available mouse models was executed through the public search interface. 3. Discussion and utility The goal of the CGD is to provide an information conduit for the scientific community about the mouse models developed by the CMGCC, and to facilitate data storage and sharing within the consortium. One aspect
3.1. Experimental data sharing
3.2. Bioinformatic data sharing The CGD is designed to function as an information portal into specific mouse models. Linkage through the CGD to web-based public domain bioinformatics resources facilitates this goal by providing a wealth of genetic information about the individual models. The bioinformatic data are derived from a variety of sources
J.C. Wiley et al. / Mutation Research 595 (2006) 137–144
141
Fig. 2. The CMGCC public front-end to CGD is shown above. There are three levels to the public interface: the search page, the results page and the individual mouse model. Here, we demonstrate the results of an exhaustive search for all currently available mouse models generated by the CMGCC. All of the informatics resources are tethered to the individual animal record. The search page provides both a simple search, and an advanced search option—the latter is depicted within the figure here. Also, in both the consortium and public sites, the HGNC genomic nomenclature is provided to facilitate an exhaustive search for model matching a specific gene. Most features associated with the individual animal models present within the consortium specific front-end are available within the public site, with the notable exception of procedural and experimental data.
covering topics such as primary genetic information (genomic and mRNA sequences), protein sequence, gene modeling information, SNP identification, as well as, associated disease information and biological pathway
models. The goal is to provide the broadest informational coverage about not only the mouse models, but the genes that have been chosen by the CMGCC for modeling. The web-portal approach is used within both the public and
142
J.C. Wiley et al. / Mutation Research 595 (2006) 137–144
secured entries to the database. While the secured interface provides access to primary experimental data absent in the public representation, the public interface enables greater search capacity of the animal models resident within the database. 3.3. Genetic and polymorphic tools associated with the CGD The CGD links to several other bioinformatics resources cultivated by the consortium. The informatics tools generated by the consortium centers are directed toward the analysis of the polymorphisms, the genes of interest, or mouse phenotypic data. Consequently, the CGD attempts to utilize the breadth of consortium informatics tools within the mouse portal. A few of the resources that will ultimately be associated with the individual animal records are not yet completed, yet should be on-line soon. The Gene Information module of the CGD stores links to two different CMGCC resources: the GeneServer and PolyDom (developed at University of Cincinnati by Dr. Aronow’s group). The GeneServer is a compendium of genetic information specifically about the genes under investigation by the CMGCC. One difficulty the consortium has encountered in organizing investigations across a diverse platform of biological specialists is the lack of a common reference point for the discussion of SNP impacts. For example, high-throughput sequence data may be clear and informative to the geneticists but may use nomenclatures that are more ambiguous to specialists in three-dimensional protein chemistry. In order to bridge the gaps in communication, and generate models that are more intuitively obvious to individuals with diverse backgrounds, we began developing manual ‘gene models’ (http://scmgc.cmo.washington.edu/GeneModels/index. html). These models graphically depict the location of SNPs in the genomic sequence, the mRNA and the protein sequence. Inference about the potential impact of these polymorphisms upon protein function is augmented by overlaying the protein domain onto the primary protein sequence representation. Manual construction of gene models is laborious and time consuming, therefore the process is being automated. PolyDom (developed by Dr. Aronow, University of Cincinnati) is a visualization tool that maps SNPs onto the domain-based representation of the protein. Additionally, the PolyDom model is associated with a three-dimensional rendering of the non-synonymous SNP, demonstrating the location of the amino acid substitution within the structure of the protein. PolyDom only represents the SNP relative to the protein sequence.
In collaborative efforts between the University of Cincinnati and our center, under the direction of Dr. Aronow, SNPlots is currently under development. SNPlots provides a graphical rendering of the non-synonymous SNP position in the genomic DNA sequence, the mRNA, and within the domain demarcated protein sequence. Prototypes of SNPlots have already been circulated within the consortium. Ultimately, SNPlots gene records will be stored as the associated ‘gene model’. One component of the EGP project involves the resequencing of the human genes of interest to the consortium. The EGP re-sequencing effort involves a much higher sample number than was used in the original sequencing of the human genome in order to identify polymorphisms and the frequency rates with which they appear. The SNPs identified by this sequencing efforts are stored within the GeneSNPs database (see Table 2). The CGD links the animal model to the polymorphism data associated with the gene through GeneSNP. Namely, within the SNP Position module of the CGD, linkage to both dbSNP and GeneSNP are maintained. The positional information about the SNP is currently manually entered by the contributing scientists. However, over time we hope to automate this process by extracting the positional information for specific SNPs from the SNPlots genomic representation. CMGCC selection of appropriate genes for mouse modeling efforts has utilized numerous analytic criteria, including in silico prediction of the SNP impacts and epidemiological evidence. The identification of SNP position, and subsequent predicted functional impact analysis using SIFT (http://blocks.fhcrc.org/sift/SIFT. htm), are two aspects of the selection criteria. However, epidemiological data suggesting a connection between human pathological conditions and specific polymorphisms was also weighed heavily in the selection criteria. In order to facilitate the implementation of epidemiological data into gene selection criteria, we have adapted previously published epidemiological review data into the CMGCC Epidemiology Database (http://mrages.niehs.nih.gov/epidemiology/public/ search/). CED contains three primary tables: Polymorphisms in DNA Repair Genes, Epidemiological Studies of DNA Repair Polymorphisms and Risk of Various Cancers, and Results of Epidemiologic Studies of DNA Repair Polymorphisms and Risk of Various Cancers. Each of these tables is searchable and is currently being maintained by the UW center. The animal models for which there is epidemiological data available within CED, have links to these records within the public entry to the CGD.
J.C. Wiley et al. / Mutation Research 595 (2006) 137–144
3.4. CGD linkage to mouse phenotypic information
Table 1 Demographics of the CMGCC Genotype Database
The vast majority of consortium efforts to date have been focused upon the development of mouse models. Currently, as the consortium moves into a period heavily focused upon the characterization of the cultivated animal models, the phenotypic data analysis resources will be more heavily employed. Two resources generated by the UTHSCSA branch of the consortium have focused on different aspects of the mouse phenotypic data: the Animal and the Pathology databases. The Animal database focuses upon gross metabolic and morphological data, such as weight and age of individual mice within experimental groups. Analysis of weight and survival across time can be performed by the application. The Pathology database is an ontology driven resource for veterinary pathologists to enter their characterization of specific mouse models. The CGD models, for which there is data within the Animal and Pathology databases, are linked to the individual animal record through the public interface. Other analytical tools have been developed for the characterization of mouse phenotypic data, such as GEDAT (an online resource for the analysis of microarray data, developed by Dr. MacCloud’s group at MD Anderson) and microarray tools cultivated at the UTHSCSA center. Data from these resources will be linked to the CGD as it becomes available.
Number of animal models Transgenic models Targeted models Number of available animal models Transgenics Targeted Number of genes represented Different research areas intended
3.5. Public access to CMGCC mouse models The CGD is the primary entry point for the mouse model repository. The CGD tracks the model projects prior to their completion. However, once the mouse models are finished and validated they can be submitted to the MMRRC mouse model repository through the CMGCC.
143
73 19 54 33 14 19 33 35
Once the animals are stored within the repository, the information about public ordering is entered into the database, and made available through both the public and secure versions of the database. 3.6. Current state and contents of the CGD The CGD currently contains the mouse models developed within the early years of the consortium. Currently, there are 73 animal models within CGD, 33 of which are completed and available (see Table 1). These mouse models are developed to cover a broad number of different research areas, denoted by the 35 different intended purposes for the animal models. These animals are being developed to explore topics from cancer to aging related pathologies such as Alzheimer’s disease. In order to provide a brief overview of the efforts of the consortium to date, Table 1 shows the number of mouse models currently available and under development. Clearly, the number of mouse models will grow in the next few years of the consortium. However, the models already generated represent a robust effort on the part of the CMGCC, and may already present valuable opportunities for collaborations between consortium and non-consortium members.
Table 2 EGP and CMGCC related informatic resources Name of resource
Purpose
Author
Web address
MGD (public interface) MGD (consortium) EGP homepage CMGCC homepage Seattle CMGCC CED MPHASYS PolyDom GeneServer GeneSNP EGPSNP sequencing TraFac GEDAT
Mouse models Model entry Project info Project info Local resources Epidemiology Animal/pathology data SNP mapping CMGCC genes CMGCC SNPdb Human SNP identification Promoter analysis Microarray analysis
UW CMGCC UW CMGCC NIEHS NIEHS UW CMGCC UW CMGCC UTHSCSA UC CMGCC UC CMGCC University of Utah UW-Nickerson UC CMGCC MD Anderson CMGCC
http://mrages.niehs.nih.gov/genotype/public/search/ http://mrages.niehs.nih.gov/genotype/ http://www.niehs.nih.gov/envgenom/ http://www.niehs.nih.gov/cmgcc/ http://scmgc.cmo.washington.edu/ http://mrages.niehs.nih.gov/epidemiology/ http://mrages.niehs.nih.gov:8080/cmgcc-mphasys/home.html http://polydoms.cchmc.org/polydoms/index.jsp http://genome.chmcc.org/geneserver/ http://www.genome.utah.edu/genesnps/ http://egp.gs.washington.edu/ http://pixel.niehs.nih.gov:8080/trafaccurated/index.jsp http://spi.mdacc.tmc.edu/
144
J.C. Wiley et al. / Mutation Research 595 (2006) 137–144
3.7. Compendium of consortium informatics resources Many of the CMGCC resources developed by other groups are linked to the CGD. In order to provide easier reference to the informatics and biological resources mentioned earlier, Table 2 provides the URL, as well as the authoring group, for many CMGCC related informatics resources. MPHASYS functions as the interface to the Animal and Pathology databases. 4. Conclusions The goal of the CMGCC is to develop mouse models based on human genetic variation which will spark future research within the broader scientific community. Hopefully, the CGD will contribute to communication and collaborations between consortium and non-consortium scientists interested in the role of SNPs within CMGCC target genes. As part of the EGP, the CMGCC is particularly interested in the role that polymorphic variants of human genes may play in the biological response to sensitive environmental factors. We strongly urge interested scientists to examine the CGD and make use of its resources, including the facile means of contacting the developing scientists. Acknowledgements Many thanks to members of the CMGCC supported by U01ES11045 (Ladiges, PI) for their thoughtful input into the CGD. Brent Calder and John David Garza at UTHSCSA were of invaluable assistance in the development of this resource. Jerry Niehls, Larry Hall and F.O. Finch within the infrastructure support team at NIEHS were also invaluable in implementing this work. Further, the input of Jan Vijg and Bruce Aronow were extremely helpful at various points during the CGD development.
References [1] A.J. Brookes, The essence of SNPs, Gene 234 (2) (1999) 177– 186. [2] E.L. Goode, C.M. Ulrich, J.D. Potter, Polymorphisms in DNA repair genes and associations with cancer risk, Cancer Epidemiol. Biomarkers Prev. 11 (12) (2002) 1513–1530. [3] C.S. Carlson, et al., Mapping complex disease loci in wholegenome association studies, Nature 429 (6990) (2004) 446– 452. [4] W.Y. Wang, et al., Genome-wide association studies: theoretical and practical concerns, Nat. Rev. Genet. 6 (2) (2005) 109–118. [5] P. Brennan, Gene-environment interaction and aetiology of cancer: what does it mean and how can we measure it? Carcinogenesis 23 (3) (2002) 381–387. [6] S.N. Kelada, et al., The role of genetic polymorphisms in environmental health, Environ. Health Perspect. 111 (8) (2003) 1055–1064. [7] R.G. Dumitrescu, I. Cotarla, Understanding breast cancer risk—where do we stand in 2005? J. Cell. Mol. Med. 9 (1) (2005) 208–221. [8] J. Thacker, The RAD51 gene family, genetic instability and cancer, Cancer Lett. 219 (2) (2005) 125–135. [9] J. Campisi, Senescent cells, tumor suppression, and organismal aging: good citizens, bad neighbors, Cell 120 (4) (2005) 513– 522. [10] S.L. Shalat, J.Y. Hong, M. Gallo, The Environmental Genome Project, Epidemiology 9 (2) (1998) 211–212. [11] F.P. Guengerich, The Environmental Genome Project: functional analysis of polymorphisms, Environ. Health Perspect. 106 (7) (1998) 365–368. [12] V. Brower, Looking for the trigger. The Environmental Genome Project to uncover the interactions of genes and the environment in disease, EMBO Rep. 4 (5) (2003) 452–454. [13] S.H. Wilson, K. Olden, The environmental genome project: phase I and beyond, Mol. Interv. 4 (3) (2004) 147–156. [14] S.M. Booker, Environmental genome project: a positive sequence of events, Environ. Health Perspect. 109 (1) (2001) A22–A23. [15] J. Wakefield, Environmental genome project: focusing on differences to understand the whole, Environ. Health Perspect. 110 (12) (2002) A757–A759. [16] R.R. Sharp, J.C. Barrett, The Environmental Genome Project and bioethics, Kennedy Inst. Ethics J. 9 (2) (1999) 175–188. [17] A.G. Clark, The role of haplotypes in candidate gene studies, Genet. Epidemiol. 27 (4) (2004) 321–333.