Mouse genomics: Making sense of the sequence

Mouse genomics: Making sense of the sequence

Dispatch R311 Mouse genomics: Making sense of the sequence Ian J. Jackson Interpretation of the human genome sequence relies on studies of model ge...

54KB Sizes 0 Downloads 88 Views

Dispatch

R311

Mouse genomics: Making sense of the sequence Ian J. Jackson

Interpretation of the human genome sequence relies on studies of model genetic organisms. Mouse genetics and genomics will help to identify all the genes, and to determine their function. Address: MRC Human Genetics Unit, Western General Hospital, Crewe Road, Edinburgh, EH4 2XU, UK. Current Biology 2001, 11:R311–R314 0960-9822/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved.

For once the hype surrounding the publication of the draft human genome sequence [1,2] is justified; practically all of the important human DNA now resides in publicly accessible databases. Of course this fantastic resource generates innumerable questions, but these boil down to two fundamental ones. Where are the genes? And what does each gene do? Ongoing work on the mouse genome will provide important leads to answering these questions. One commonly used method for identifying genes in genomic sequence is to find the sequence represented in cDNA libraries, and a recently published collection of mouse cDNA sequences adds substantially to these [3]. Furthermore, the similarity between mouse and human genomes means that the respective genomic DNA sequences can be easily aligned, and the most highly conserved segments (mostly corresponding to coding exons) readily spotted. Several mouse genome sequencing projects are at various stages of production of qualitatively different datasets, all of which will be invaluable for annotating the human sequence. Finally, a recently announced international consortium aims to discover a function for all mouse genes, and by homology for all human genes [4]. The genome projects have brought to biology a new modus operandi; just as molecular technologies revolutionised cell and developmental biology 15 or 20 years ago, so genomic approaches will fundamentally change the way we carry out biological experimentation. Mouse as a genetic model

Many organisms are valid genetic models of humans. If a human gene has a clearly identifiable equivalent in another sequenced genome, then useful information should be gained from studying the model organism. There are over 1,000 genes that are present in single copy in the human, nematode worm and fruitfly genomes [1]. On the whole, however, the gene content in invertebrate models appears to be quite different from human. For example, humans

encode 616 G-protein-coupled receptors, whereas flies have 146 and worms, 284 [2]. In many cases it is just not possible to find a gene that is equivalent in a comparison of human to invertebrates. By contrast, the mouse is evolutionarily much closer to humans and its gene content largely identical. The mouse and human genomes are derived from a common mammalian ancestor, and as few as 100 chromosomal rearrangements separate each genome from this ancestral genome [1,5]. Genes linked in one species are often linked in the other; and genomic sequence comparisons indicate that gene order is typically conserved over many megabases of DNA [1]. More mouse cDNAs

Expressed sequence tags, or ESTs, are short sequence reads, typically of a few hundred bases, from the ends of cDNA clones. A few caveats aside, these represent transcripts and therefore genes, and so have been invaluable for finding genes in genomic sequence. Millions of ESTs have been produced from hundreds of cDNA libraries. Projects such as Unigene (www.ncbi.nlm.nih.gov/Unigene) have used automated methods to assign ESTs into clusters on the basis of sequence matches, and these clusters provide one estimate of gene number. These are overestimates, partly because a substantial fraction of ESTs derive from genomic DNA rather than cDNA, and partly because multiple, non-overlapping clusters can derive from the same gene. Furthermore, ESTs by their nature are short ‘tags’, which often do not contain the protein coding segment of the transcript, and so do not help in cataloguing functional gene content, and as they do not have conserved features — coding potential — they are often not useful for cross-species comparisons. A better representation of mouse cDNA has been produced by human curation and annotation by the Genome Exploration Group at RIKEN, Japan [3]. In this project, almost one million mouse cDNA sequence tags from numerous libraries were clustered, from which about 21,000 clones were sequenced. These sequences contained redundancies identifiable by cross comparisons, and further redundancies in which non-overlapping sequences derived from the same known gene. By extrapolating the incidence of this latter redundancy across their collection of novel sequences, the authors could estimate that they have representatives of just under 13,000 unique genes. RIKEN hosted an annotation ‘jamboree’ at which curators examined the sequences and, where possible, assigned definitions to the cDNA on the basis of likely function or similarity to known genes. A key tool in the annotation is the vocabulary developed by the Gene Ontology (GO)

R312

Current Biology Vol 11 No 8

Consortium [6]. GO annotations assign to gene products standard terms that describe the biological process, the molecular function and the cellular component or location of the product. GO terms are intended to enable gene function and content information to be readily interpretable across species. This cDNA resource has already proved useful in measuring the gene content of the draft human sequence. The International Human Genome Sequencing Consortium attempted to compile an index of genes from the available sequence [1]. They derived a list of over 31,000 predictions, almost 15,000 of which are known genes with about 17,000 predicted by various methods (which probably have a fairly high false-positive rate). When the RIKEN set was compared to the 31,000 predicted genes, 69% of them showed sequence similarity. If the same RIKEN sequences were used to search the total human sequence, 81% found a match. So 81% of the mouse cDNA set detects a match against the whole human genome sequence, but only 69% pick up hits to the human gene index, indicating that the human gene index underrepresents the gene set and contains 69/81 or 85% of the mouse cDNA collection. The reverse comparison — of the human set to the RIKEN collection — found 69% were represented, and for known genes was 78%. So the comparisons indicate that both collections of genes are incomplete, but there are some problems in deciding how incomplete. The mouse cDNAs were selected to bias for novel genes, but we do not know the effect of that bias on overall representation in the collection (although we know that only 78% of known genes are present). Mouse genome sequencing

These comparisons, as well as other analyses, are the basis for the surprisingly low prediction for the human gene number of 32,000 (another, higher estimate based on the same data by F.A. Wright et al. can be found in an electronic preprint available at genomebiology.com). A firmer estimate will come from doing a whole genome comparison of human to mouse. All the current methods used to predict genes in genomic sequence are subject to error. Methods using matches to cDNA may underestimate because of incomplete representation in the libraries, whereas ab initio methods produce overestimates that must be tempered by additional evidence, such as similarity to already described genes, which in turn will overcompensate and miss truly novel genes. Mouse genomic sequence will give a new and powerful means of finding genes. The human and rodent lineages diverged sufficiently long ago that only sequences subject to selection, such as exons, will retain extensive similarity. Coincident with the publication of the draft human sequence, a consortium funded by public, charity and industrial sources released the first batch of the mouse

genome sequence, with the intention of releasing threefold coverage by April 2001. This sequence is a whole genome shotgun, which essentially means it is random reads, each of a few hundred base pairs, from throughout the genome, and these will not overlap into larger contiguous segments to any significant degree. Instead, the intention is that the mouse data will align along the human sequence, indicating conserved sequence. This is currently viewable at the Ensembl web site (www.ensembl.org). At the moment, these mouse matches should be treated with caution, but indications are that they will be useful crossspecies sequence tags, whose location in the human genome is defined. The mouse genome is also being sequenced at a higher level of coverage in a clone-by-clone approach. Much of the genome will be completed at ‘draft’ level in 2001, and certain regions have been targeted for production of finished sequence by October 2002 [7]. These sequences will enable a large-scale overview of sequence similarity between mouse and human genomes, and identification of likely conserved exons and other features. Figure 1 shows a ‘percent identity plot’ of a 100 kilobase region of the X chromosomes of mouse and human [8,9]. Exons of known genes are clearly distinguished by the higher percent identity compared to surrounding DNA, and putative novel genes may be identified. More mouse mutations

Mouse studies will also be of key importance in providing information about gene function. The phenotype of a mouse with a mutation in a single gene provides clear evidence of at least one function of that gene. There are several ways that such mutant mice can be made, the most widely used over the past decade being targeting genes by homologous recombination in embryonic stem cells. Several thousand genes have so far been mutagenised in this way, but this is a long way short of the total number in the genome, and the technique is very labour intensive. Methods have been developed to accelerate the stem cell approach, in particular the use of gene traps which cause mutations by the random insertion of marker DNA, via which the insertion site can be sequenced to identify the disrupted gene before a mutant mouse is generated. This is a genotype-driven technology, in that the identity of the gene is known before the mutation is made. A complementary, phenotype-driven approach is to create point mutations at random and identify mice with informative phenotypes from the progeny. The availability of the mouse genome sequence should permit the mutated genes to be readily identified. The last few years has seen the development of numerous phenotype-driven programmes, utilising the powerful mutagen ethyl nitrosourea. Initial studies, in the UK and in Germany, have catalogued several hundred new dominant mutations

Dispatch

R313

Figure 1 Percent identity plot (PIP) comparing a region of mouse and human X chromosomes [8,9]. The human sequence is represented on the abscissa and percentage sequence identity is plotted on the ordinate. The symbols above the plot represent features of the human sequence, including confirmed and putative exons which are depicted as numbered black boxes. ECRA1–ECRA3 are evolutionarily conserved regions that may predict a gene. Calractin and Nsdh1 are known mouse and human genes. (Figure from [9].)

magea9 100% 75%

0k

2k

4k

6k

8k

10k

12k

14k

16k

50% 20k

18k

exon 6.4f2

ECRA1-A3

1

A1A2 100% 75%

20k

22k

24k

26k

28k

30k

32k

34k

36k

50% 40k

38k

Caltractin

ECRA1-A3 A3

5

4

Nsdhl

3 2

11 100% 75%

40k

42k

44k

46k

48k

50k

52k

54k

56k

58k

50% 60k

Nsdhl 2

3 100% 75%

60k

62k

64k

66k

68k

70k

72k

74k

76k

78k

50% 80k

Nsdhl 4

5

6

7

8 100% 75%

80k

82k

84k

86k

88k

90k

92k

94k

96k

98k

50% 100k

Current Biology

conferring a range of behavioural, developmental and other phenotypes [10,11]. More ENU programmes are now underway throughout the world. The NIH has funded several large collaborative projects which are looking for dominant and recessive mutations that affect development, nervous system function and complex behaviour [7]. Screens elsewhere in the world are also expanding to find recessive mutations. The mouse genetics community has recognised that it is now possible to set the goal of creating a collection of mouse lines that together have a mutation in each gene of the genome. A recent publication in Science [4] marks the formation of the International Mouse Mutagenesis Consortium, which brings together geneticists from across the world with the common goal of assigning a function to every gene.

The genomic view of biology

About 20 years ago, the techniques of molecular biology began to be used to study cell biology and developmental biology. It was more than the methodology that was brought to bear, but a particular philosophy, which was that complex processes could be reduced to simple, tractable, interactions. Now, another fundamental change is underway in biology with the advent of genomic techniques. Mass collections of data, whether sequence, expression profiles, protein content, molecular interactions or mutations generate information on whole systems rather than on isolated parts. The philosophy is that we can gain meaningful information from problems whose answers are too large and complex to be written in a lab notebook, and can only reside in a computer. Biologists will have to change the way we think about experiments to take advantage of these resources.

R314

Current Biology Vol 11 No 8

References 1. International Human Genome Sequencing Consortium: Initial sequencing and analysis of the human genome. Nature 2001, 409:860-921. 2. Venter JC, Adams MD, Myers EW, Li PW, Mural RJ, Sutton GG, Smith HO, Yandell M, Evans CA, Holt RA, et al.: The sequence of the human genome. Science 2001, 291:1304-1351. 3. The RIKEN Genome Exploration Research Group Phase II Team and the FANTOM Consortium: Functional annotation of a full-length mouse cDNA collection. Nature 2001, 409:685-689. 4. The International Mouse Mutagenesis Consortium: Annotating genome sequences with biological functions in mice. Science 2001, 291:1251-1255. 5. Nadeau JH, Taylor, BA: Lengths of chromosomal segments conserved since divergence of mouse and man. Proc Natl Acad Sci USA 1984, 81:814-818. 6. The Gene Ontology Consortium: Gene ontology: tool for the unification of biology. Nat Genetics 2000, 25:25-29. 7. Graham B, Battey E, Jordan E: Report of second follow-up workshop on priority setting for mouse genomics. Mamm Genome 2001, 12:1-2. 8. Hardison RC, Oeltjen J, Miller W: Long human-mouse sequence alignments reveal novel regulatory elements: a reason to sequence the mouse genome. Genome Res 1997, 7:959-966. 9. Mallon AM, Platzer M, Bate R, Gloeckner G, Botcherby MR, Nordsiek G, Strivens MA, Kioschis P, Dangel A, et al.: Comparative genome sequence analysis of the Bpa/Str region in mouse and man. Genome Res 2000, 10:758-775. 10. Nolan PM, Peters J, Strivens M, Rogers D, Hagan J, Spurr N, Gray IC, Vizor L, Brooker D, Whitehill E, et al.: A systematic, genome-wide, phenotype-driven mutagenesis programme for gene function studies in the mouse. Nat Genetics 2000, 25:440-443. 11. Hrabé de Angelis M, Flaswinkel H, Fuchs H, Rathkolb B, Soewarto D, Marschall S, Heffner S, Pargent W, Wuensch K, Jung M, et al.: Genome-wide, large-scale production of mutant mice by ENU mutagenesis. Nat Genetics 2000, 25:444-447.