The molecular natural history of the human genome

The molecular natural history of the human genome

420 Research Update Hannu Ylönen Dept of Biological and Environmental Science, University of Jyväskylä, PO Box 35, 40351 Jyväskylä, Finland. e-mail:...

32KB Sizes 7 Downloads 33 Views

420

Research Update

Hannu Ylönen Dept of Biological and Environmental Science, University of Jyväskylä, PO Box 35, 40351 Jyväskylä, Finland. e-mail: [email protected] References 1 Jackson, R.J. et al. (2001) Expression of mouse interleukin-4 by a recombinant Ectromelia virus suppresses cytolytic lymphocyte responses and overcomes genetic resistance to mousepox. J. Virol. 73, 1205–1210 2 Chambers, L.K. et al. (1999) Biological control of rodents – the case for fertility control using immunocontraception. In Ecologically Based Management of Rodent Pests (Singleton, G. et al., eds), pp. 215–242, ACIAR Monograph No. 59 3 Singleton, G. et al., eds (1999) Ecologically Based Management of Rodent Pests, ACIAR Monograph No. 59

TRENDS in Ecology & Evolution Vol.16 No.8 August 2001

4 Nowak, R. (2001) Disaster in the making. An engineered mouse virus leaves us one step away from the ultimate bioweapon. New Scientist 169, 4–5 5 Finkel, E. (2001) Engineered mouse virus spurs bioweapon fears. Science 291, 585 6 Frank, F. (1957) The causality of microtine cycles in Germany. J. Wildl. Manage. 21, 113–121 7 Singleton, G. et al. Reproductive changes in fluctuating house mouse populations southeastern Australia. Proc. Zool. Soc. London Ser. B (in press) 8 Pech, R.P. et al. (1999) Models for predicting plagues of house mouse (Mus domesticus) in Australia. In Ecologically Based Management of Rodent Pests (Singleton, G. et al., eds), pp. 81–112, ACIAR Monograph No. 59 9 Muller, L.I. et al. (1997) Theory and practice of immunocontraception in wild animals. Wildl. Soc. Bull. 25, 507–514 10 Fayrer-Hosken, R.A. et al. (2000) Immunocontraception of African elephants – a

11

12

13

14

15

humane method to control elephant populations without behavioural side effects. Nature 407, 149 Miller, L.A. et al. (2000) Immunocontraception of white-tailed deer using native and recombinant zona pellucida vaccines. Anim. Reprod. Sci. 63, 187–195 Rout, P.K. and Vrati, S. (1997) Oral immunization with recombinant vaccinia expressing cell-surfaceanchored beta hCG induces anti-hCG antibodies and T-cell proliferative response in rats. Vaccine 15, 1503–1505 Chambers, L.K. et al. (1999) Fertility of wild mouse populations: the effects of hormonal competence and an imposed level of sterility. Wildl. Res. 26, 579–591 Jackson, R.J. et al. (1998) Infertility in mice induced by recombinant Ectromelia virus expressing mouse zona pellucida glycoprotein. Biol. Rep. 58, 152–159 Duncan, A. (2001) Virus a timely reminder to step up controls. The Australian, 12 January, 2

The molecular natural history of the human genome Michael Lynch A remarkable pair of recently published studies provides the first glimpse of the fine-scale structure of the human genome sequence. The data revealed by these investigations, and their future refinements, provide an infrastructure that will forever transform the intellectual playing field for evolutionary biologists.

The 15 February 2001 issue of Nature displayed a landmark series of papers outlining the fine-scale features of the draft human genome sequence revealed by a publicly funded consortium, the IHGSC (International Human Genome Sequencing Consortium)1. Just a day later, a parallel series of reports appeared in Science, in this case drawing from a privately funded project (the Celera project) that enjoyed unrestricted access to the publicly funded database2. Because of the asymmetry of information flow between the two projects, Celera had the clear upper hand in terms of power of analysis. The two groups employed rather different sequencing and processing strategies, and not surprisingly, there has been a fair amount of posturing as to who has produced the superior product, in spite of the considerable amount of consensus between the two reports. The human genome is big, although by no means the largest and, by my estimate, a printing of the entire sequence would have required approximately 300 000 pages of Nature or Science. The pressure to publish early was intense, and there are a few http://tree.trends.com

important caveats to keep in mind. First, only ~90% of the sequence has actually been completed, with <20% of the genome being represented in contigs >100 kb and half of it falling in contigs <22 kb. This is significant because the average human gene is approximately 30 kb in length (i.e. larger than the average contig). Thousands of gaps remain to be filled, and although both groups are hard at work trying to obtain the finishing touches, many months will pass before the finished product is available, and some of the information will be essentially unattainable by conventional sequencing methods. Second, at least half of the proteincoding loci identified are simply candidates suggested by computer algorithms and hence await more rigorous evaluation. Third, the genomic sequences are chimeric (both within and between projects) in the sense that they are derived from DNA extracted from an array of individuals, whose ethnic and geographical backgrounds will remain permanently unknown. Thus, strictly speaking, characterization of the human genome is not complete, nor does it accurately represent any single member of our species. Rather, the Nature and Science reports are best viewed as a pair of rapidly assembled and hypothetical abstracts of a documentary that has not yet been completely written. Fairly well-curated sequences of two human chromosomes (21 and 22) have been available for well over a year, and assuming that they are fairly typical, characterization

of the remainder of the genome was not expected to lead to too many new revelations. However, certain aspects of genome structure can only be ascertained after a nearly complete sequence has become available. A few of the more interesting observations are the subject of this article. Our debris-laden genome

Approximately half of the human genome consists of sequences that are obviously associated with transposable-element activity, and a large fraction of the remaining noncoding DNA might be a product of such activity but too divergent to be recognized as such. So much for intelligent design. The numbers and ages of mobile elements in humans greatly exceed those in Drosophila melanogaster and Caenorhabditis elegans, and much of this difference is the result of the huge proliferation of two families (Alu and LINE1 elements), which together account for ~60% of interspersed elements in humans. Part of the reason for this large accumulation of excess DNA could be that the rate of deletion of nonfunctional DNA from the human genome is extraordinarily low. By one estimate, the half-life of such sequences in modern humans is of the order of 800 million years, compared to ~12 million years in flies3. The high incidence of ‘junk’DNA in the human genome is even more remarkable when one considers an additional claim by the IHGSC – that there has been a

0169–5347/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0169-5347(01)02242-X

Research Update

substantial recent decline in the activity (birth rate) of all transposons throughout the human genome. The validity of this conjecture is unclear. To evaluate the age distribution of interspersed repeats, the IHGSC derived a consensus sequence for each of the major families and then simply estimated the divergence of each extant sequence from the consensus. Finding that few elements were similar to the consensus, they reasoned that all of the elements must be quite old. However, the signature of a newborn element is not the degree of identity with a reconstructed consensus but with another extant element. A more rigorous assessment of the degree of activity of mobile elements in humans will require a proper phylogenetic analysis as well as a population-level survey for site-specific presence/absence polymorphisms. An unexpectedly small gene count?

Although our genome size is approximately 30 times greater than that of C. elegans in nucleotide number, the number of coding genes might be no more than twofold greater. The IHGSC and Celera papers both estimate that the human genome encodes 30 000–40 000 genes, compared to the estimate of 19 000 for C. elegans. However, both sets of counts are to a large extent based on indirect computational searches rather than on direct observation. Although gene-prediction algorithms have become quite sophisticated, various aspects of the human genome, including intron number and size (commonly hundreds to thousands of bases), present considerable computational challenges. It will not be surprising if the current counts are underestimated by as much as twofold, and a more recent computational screening suggests the human number is more in the range of 65 000 to 75 000 (Ref. 4). Complete sequences from cDNA libraries will help clarify which hypothetical genes are actually transcribed, but they will also almost certainly reveal previously undetected genes. For example, from a recent screening of a cDNA library extracted from just a single tissue (the brain) of D. melanogaster, >10% of the observed clones had not been previously identified by whole-genome gene-prediction surveys for this species5. cDNA analysis is unlikely to illuminate all of our functional genes, as transcripts of genes that are only active during brief stages of development, rare environmental circumstances, or in http://tree.trends.com

TRENDS in Ecology & Evolution Vol.16 No.8 August 2001

restricted cell lineages may be very elusive. In addition, because nongene sequences are probably sometimes transcribed, an unknown fraction of the members of a cDNA library can also be expected to be false positives. An alternative approach to verifying gene expression – the use of tiled microarrays6 – has similar limitations. The most powerful tool for identifying functional human genes (along with their associated regulatory elements) will almost certainly be comparative analysis with an array of related species. Within the next few months, nearly complete genomic sequences will become available for mouse Mus musculus, rat Rattus rattus, pufferfish Fugus rubripes and zebrafish Danio rerio, and we will enjoy those for numerous other vertebrate species shortly thereafter. Counting problems aside, the actual number of unique transcripts in the human genome is clearly of the order of 105 or larger. Roughly a third of all human genes appear to experience alternative splicing (i.e. different ways of stitching exon sequences together). The average number of transcripts per human gene is approximately three, which contrasts with the situation in C. elegans, where the average is ~1.3. Thus, the number of distinct proteins employed in humans could easily be three or more times that in worms, and a further increase in our genetic repertoire must result from greater complexity in the tissue-specificity of gene regulation. It is becoming increasingly clear that these two aspects of gene structure – alternative splicing and regulatory-region complexity – provide the molecular basis of pleiotropy, a phenomenon that evolutionary biologists have long regarded as a fundamental, but mechanistically mystical, constraining process in adaptive evolution. Finally, it should be remembered that the number of potential two-gene interactions increases with the square of gene number, three-gene interactions with the cube of gene number, and so on. Thus, even a twofold difference between species in gene number can translate into a several-fold difference in epistatic interactions. A high rate of origin of new (hopeful) genes

The human genome harbors a very large number of duplicate genes – at least 40% of our genes are represented in two or more copies7. Although many of these duplicates probably arose before the origin of mammals, the number of recently derived copies is remarkably high. For example, at

421

least 5% of our genome is represented in two or more locations, in the form of large (1–200 kb) segmental duplications. Celera located a total of 1077 segmental duplications, 781 of which contained five or more shared genes. These are very conservative estimates, because assembly problems almost certainly resulted in the exclusion of many true duplicates. Indeed, the wide distribution of recently derived segmental duplications is one of the primary impediments to accurate reconstruction of chromosomal sequences from small (approximately 1 kb) random sequences, the main strategy of the Celera group. The high incidence of segmental duplications in humans is well beyond that which is seen in D. melanogaster and C. elegans, but not greatly different than that which is observed in Arabidopsis8, although the Arabidopsis genome has been substantially patterned by one or more ancient polyploidization events. Because many of the interchomosomal duplications in humans have a very high degree of sequence similarity, IHGSC suggests that they arose during a recent period of intense duplication activity. However, an alternative explanation for such a pattern is that segmental duplications have a high rate of eradication. The distribution of sequence similarity between duplicate segments must ultimately reflect the joint processes of birth and death of such segments9,10. Many pairs of human segmental duplicates are Ⰶ1% divergent, whereas comparative analysis of homologous nucleotide positions throughout the human genome implies an average divergence of ~0.1% per nucleotide site. This suggests that a substantial fraction of the segmental duplications observed in these studies might be younger than the mean coalescence time of neutral nucleotide sites and hence may not even be fixed in the human population. Such presence/absence polymorphisms are already known to exist for duplicate olfactory receptor genes in the human population11. Thus, although most of our attention in evolutionary quantitative genetics has been focused on nucleotide polymorphisms, the possibility that phenotypic variation depends in important ways on individual differences in gene content merits consideration. The human segmental duplications are not randomly distributed. Regions near the centromeres consist almost entirely of such segments, and telomeric regions are also enriched. Moreover, several cases exist in

422

Research Update

which a duplicated segment has been reduplicated and redistributed to still another location. Chromosome 19 appears to be especially unusual in this regard, with large blocks of genes having been moved by duplicative transposition to virtually all of the other chromosomes. It might be no coincidence that chromosome 19 is unusually rich in Alu elements, which may promote interchromosomal exchange. No analysis of duplicate genes in humans would be complete without considering Ohno’s hypothesis that one or more basal polyploidization events provided the fuel for the origin of morphological novelties in vertebrates12. Although this hypothesis has been treated with increasing skepticism and ridicule over the past few years13–15, and a very limited analysis by the IHGSC fails to support it, all of the phylogenetic tests of Ohno’s hypothesis have several potential shortcomings, including a shortage of welldefined sequences from basal chordates known to contain single-copy genes, an inability to discriminate secondary duplications from putative events associated with earlier polyploidization, and inattention to information on map positions. Because of the substantial amount of sequence divergence, genome rearrangement, and gene deletion that occurs over a time span of approximately one billion years (summing over two descendant lineages), the challenges to testing Ohno’s hypothesis in a proper fashion are formidable, and a rigorous formal analysis remains to be done. Extinction of the Ecdysozoa?

With the (nearly) complete genomic sequences now available for a nematode, a fly, and a vertebrate, what about the phylogenetic arrangement of these three organisms? A definitive answer to this question has significant implications for our understanding of the origins of morphological diversity in animals, as the vast majority of work in development and genetics is performed on members of these three clades. Although classical morphological analysis had historically positioned nematodes as basal to the protostome (e.g. fly) – deuterostome (e.g. vertebrate) divergence, a recent analysis of 18S rRNA sequences led to the suggestion that all molting animals (including flies and nematodes) are members of the same monophyletic clade (the Ecdysozoa)16. It is risky and no longer necessary to base a deep phylogenetic http://tree.trends.com

TRENDS in Ecology & Evolution Vol.16 No.8 August 2001

analysis on a single gene, and a study based on 50 genes supports the traditional basal position of nematodes17 as does another study base on four highly conserved proteins18. With sequences of thousands of homologous genes now available in all three lineages, it should be possible to settle the matter once and for all with a massively large-scale phylogenetic analysis. Although the IHGSC notes that the human genome appears to share about 1.5 times as many homologous genes with D. melanogaster as with C. elegans, neither this observation nor sequence comparisons can be translated into measures of phylogenetic affinity until key out-groups have been entered into the analysis. Fortunately, there already is a fungus (Saccharomyces cerevisiae) and a plant (Arabidopsis thaliana) to work with, although a cnidarian or a sponge would presumably be more informative. The future

The IHGSC and Celera projects provide just an opening glimpse of the structure of the human genome, mainly providing a contrast with single members of two of our sister animal phyla (arthropods and nematodes). Much of this century will be spent trying to elucidate the sources of variation within our own species. There is much to explain, as heritabilities for morphological and behavioral traits in humans are quite large compared to those seen in other species. Driven by the insatiable financial dreams of the pharmaceutical and biotechnology industries, progress is likely to be rapid. Methods for sequencing DNA will change radically over the next few years, perhaps to the point that the reading of a megabase of DNA can be performed on a time scale not much greater than the reading of a megabyte of data by current computers. Whole genome sequences will then be available for numerous members of our population and for many of our closest living primate relatives, and it is no longer far fetched to think that comparative molecular data will be attainable for our closest extinct relatives19. Making the connection between variation at the molecular and phenotypic levels will require a lot of biology, and it is clear that the rich array of tools from quantitative genetics and evolutionary biology will play a central role in this endeavor. Molecular evolution will no longer be focused entirely on single-gene analyses but on explaining the origin, proliferation, and occasional demise of

entire networks of genes. It is a privilege to live during the time in which the essence of the unique ingenuity that got us here might actually become biologically understandable. Acknowledgements

I thank J. Crow, D. Hartl, P. Phillips, and J. Postlethwait for helpful comments. References 1 International Human Genome Sequencing Consortium (2001) Initial sequencing and analysis of the human genome. Nature 409, 860–921 2 Venter, J.C. et al. (2001) The sequence of the human genome. Science 291, 1304–1351 3 Petrov, D.A. and Hartl, D.L. (1999) Patterns of nucleotide substitution in Drosophila and mammalian genomes. Proc. Natl. Acad. Sci. U. S. A. 96, 1475–1479 4 Wright, F.A. et al. A draft annotation and overview of the human genome. Genome Biol. (in press) 5 Posey, K.L. et al. (2001) Survey of transcripts in the adult Drosophila brain. Genome Biol. (in press) 6 Shoemaker, D.D. et al. (2001) Experimental annotation of the human genome using microarray technology. Nature 409, 922–927 7 Li, W-H. et al. (2001) Evolutionary analyses of the human genome. Nature 409, 847–850 8 Vision, T.J. et al. (2000) The origins of genomic duplications in Arabidopsis. Science 290, 2114–2117 9 Nei, M. et al. (1997) Evolution by the birth-anddeath process in multigene families of the vertebrate immune system. Proc. Natl. Acad. Sci. U. S. A. 94, 7799–7806 10 Lynch, M. and Conery, J. (2000) The evolutionary fate and consequences of duplicate genes. Science 290, 1151–1154 11 Trask, B.J. et al. (1998) Members of the olfactory receptor gene family are contained in large blocks of DNA duplicated polymorphically near the ends of human chromosomes. Hum. Mol. Genet. 7, 13–26 12 Ohno, S. (1970) Evolution by Gene Duplication, Springer-Verlag 13 Skrabanek, L. and Wolfe, K.H. (1998) Eukaryote genome duplication – where’s the evidence? Curr. Opin. Genet. Dev. 8, 694–700 14 Hughes, A.L. (1999) Phylogenies of developmentally important proteins do not support the hypothesis of two rounds of genome duplication early in vertebrate history. J. Mol. Evol. 48, 565–576 15 Martin, A. (2001) Is tetralogy true? Lack of support for the ‘one-to-four rule’. Mol. Biol. Evol. 18, 89–93 16 Aguinaldo, A.M. et al. (1997) Evidence for a clade of nematodes, arthropods and other moulting animals. Nature 29, 489–493 17 Wang, D.Y. et al. (1999) Divergence time estimates for the early history of animal phyla and the origin of plants, animals and fungi. Proc. R. Soc. London B Biol. Sci. 266, 163–171 18 Baldauf, S.L. et al. (2000) A kingdom-level phylogeny of eukaryotes based on combined protein data. Science 290, 972–977 19 Ovchinnikov, I.V. et al. (2000) Molecular analysis of Neanderthal DNA from the northern Caucasus. Nature 404, 490–493

Michael Lynch Dept of Biology, Indiana University, Bloomington, IN 47405, USA. e-mail: [email protected]