Opinion
TRENDS in Genetics Vol.21 No.11 November 2005
Gene factories, microfunctionalization and the evolution of gene families John M. Hancock Medical Research Council Mammalian Genetics Unit, Harwell, Oxfordshire, UK, OX11 0RD
Gene duplication has long been considered an important force in genome evolution. In this article, I consider families of tandemly duplicated genes that show ’microfunctionalization’ – genes encoding similar proteins with subtly different functions, such as olfactory receptors. I discuss the genomic processes giving rise to such microfunctionalized gene families and suggest that, like sites of chromosomal rearrangement and breakage, they are associated with relatively high concentrations of repetitive elements. I suggest that microfunctionalized gene families arise within gene factories: genomic regions rich in repetitive elements that undergo increased levels of unequal crossing-over. Introduction The advent of the genomics era is bringing about a revolution in our ways of thinking about genome evolution. For the first time it is possible to perform detailed investigations of the sequence-level organization of eukaryotic genomes. This has brought about a revival of interest in several areas that were difficult to address by traditional molecular evolutionary techniques, and were therefore relatively neglected. One such area is the process of gene duplication and its underlying mechanisms [1,2]. In this article, I discuss a particular class of duplicate genes, those that lie in close proximity to one another in the genome. I focus on the processes that might give rise to such gene clusters and suggest that they arise in ‘gene factories’ driven by recombinogenic sequences derived from transposable elements. Gene duplication in gene and genome evolution Duplication events in genomes can be divided into two broad classes: whole-genome duplications (WGD), in which the total chromosome complement of an organism is doubled, and segmental duplication, in which segments of a genome are duplicated [3]. At one extreme of segmental duplication is single gene duplication, which can give rise to local clusters of homologous genes. Gene clusters of this kind have been known for a long time; wellknown examples include the globins [4], immunoglobulins [5], rDNA [6], histones [7] and Hox genes [8]. Estimates of the proportions of metazoan genomes that contain duplicate genes are 8–20% for a wide range of organisms from yeast to humans [9], although these estimates depend on the criteria used to identify Corresponding author: Hancock, J.M. (
[email protected]). Available online 8 September 2005
duplicates (e.g. Ref. [10]). The sizes of these homologous groups in the human genome vary from two to 479 genes [10], again depending on the definition of a duplicate. Many of these duplicate genes are even conserved between distantly related species (outparalogs) but a significant proportion are what are known as inparalogs, that is duplicates that have arisen since the divergence of two species in question (see Ref. [11] for definitions of out- and inparalogs). For example, in the mouse genome, 20% of genes have been suggested to have an inparalogous relationship to human genes [12]. Inparalogs seem to be of particular interest because they might provide insights into the selective pressures acting as a result of changes in species’ ‘lifestyle’ during and after speciation [12–14]. Evolution of duplicate genes The classical view of gene duplication is that it provides the opportunity for the evolution of new functions [2]. One way of conceptualizing this is that one copy of the duplicated gene retains its original function, whereas the second copy (generally considered to be the new copy, although it is not obvious that there is any mechanism to identify the ‘original’ version) diverges (neofunctionalization [15]). Divergence between the two copies might be at the level of protein sequence or regulatory sequence, enabling the divergent copy to be expressed in a new (or more restricted) part of the organism, with this changed expression producing phenotypic effects. Deleterious mutations in the divergent copy will result in its elimination or pseudogenization (nonfunctionalization; duplicate genes are also likely to be lost almost immediately after the duplication event by genetic drift, but this is not relevant to the scenarios discussed here because I am considering cases where the duplicate has been fixed or is close to fixation). Duplicate copies of the gene tend to retain similarity owing to the action of gene conversion [16]. More recent ideas about gene duplication have refined these ideas by introducing the concept of subfunctionalization [15]. Rather than one gene retaining an ancestral function and a second one diverging in sequence, mutations can occur in both genes, resulting in the retention of different aspects of the original function or the evolution of novel functions or expression patterns. These models of gene family evolution in essence attempt to explain gene families, such as the Hox family, in which individual duplicates have adopted different functions over evolutionary time. However, in many cases duplicates have another role: to provide the opportunity
www.sciencedirect.com 0168-9525/$ - see front matter Q 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.tig.2005.08.008
592
Opinion
TRENDS in Genetics Vol.21 No.11 November 2005
for large-scale mobilization of a particular class of genes at particular stages in the cell cycle or development (e.g. histones need to be produced in large quantities during DNA replication, as do rRNA molecules during rapid cell growth). Genes of this type are under selective pressure to remain essentially homogeneous at the sequence level, forming multigene families. There is still considerable divergence of opinion in explaining how homogeneity of this type of gene family is maintained, but two general mechanisms have been suggested: homogenization by unequal crossing-over or gene conversion [17] and birth-and-death evolution (in which gene duplication is so rapid that copies carrying deleterious mutations can be eliminated by selection while the gene copy-number is maintained by replacement) [18]. Lying between these extreme examples is a class of gene family in which the individual copies are similar but not identical and, although playing essentially the same role in the organism, are subtly different in function, giving the organism a variety of types of activity within a general class. I describe these gene families as ‘microfunctionalized’, to distinguish them from homogeneous and fully adapted families. The immunoglobulins are a good example of a microfunctionalized gene family, providing great flexibility in antigen recognition but essentially fulfilling the same function. Another example is the olfactory receptor family. In this example, each receptor responds to a particular environmental chemical but, again, each has an essentially identical role in the organism [19]. Mechanisms of duplication As currently understood, there are two types of process that can give rise to gene duplication: unequal crossingover (UCO) and transposition. UCO takes place when two genetically similar, but nonhomologous, chromosome regions become aligned during chromosome pairing and undergo crossing-over. This results in the deletion of a region on one of the recombining chromosomes and its duplication on the other. The deleted or duplicated version of the chromosome can then become fixed in the population, especially if it confers some selective advantage. UCO typically takes place between identical or nearly identical sequences. As well as giving rise to gene duplication and deletion, UCO can also result in homogenization of the sequences of members of gene families when they lie close to one another [17]. Simulations of the action of UCO suggest that, because it acts much more rapidly than mutation on the sequences of an array, it tends to ensure that the component members of an array remain identical in sequence, while allowing for the spread of new variants from time to time [20,21]. In addition to UCO and gene conversion [17,22], homogenization can also result from birth-and-death processes [18]. Transposition can give rise to gene duplication by two mechanistically distinct processes. Retrotransposition can produce processed retrogenes by making and inserting DNA copies of mRNAs into the genome, whereas active DNA transposons can potentially carry copies of genes with them when they transpose, producing new copies www.sciencedirect.com
elsewhere in the genome. These transposed copies are incorporated in a pseudo-random manner about the genome and are unlikely to be tandemly arranged. Although studies of tandemly arranged gene families have generally considered the elevated UCO levels in these regions to reflect the high sequence similarity between the gene copies, there is sufficient evidence that transposition can interact with UCO to generate gene duplications and deletions by ectopic (illegitimate) recombination. Several studies have shown that sites of chromosomal breakage and evolutionary rearrangement correlate with high densities of transposable elements (TEs) (reviewed in Ref. [3]), whereas Alu and long interspersed nuclear elements (LINEs) have been implicated in O40 disease-related deletions [23–25]. A good example of TE–UCO interaction, because of its relative frequency, is Charcot-Marie-Tooth neuropathy Type 1a (CMT1a) and Tomaculous neuropathy (HNPP), both of which are caused by ectopic recombination at the human PMP22 locus, which encodes peripheral myelin protein 22-kDa. Duplication of the gene gives rise to CMT1a as a result of overexpression, whereas deletion produces HNPP. CMT1a and HNPP are caused by a mariner-like TE located either side of PMP22 [26], resulting in misalignment during recombination. Interestingly, double-strand breaks induced by the mariner transposes can induce recombination directly [26]. Gene factories? We recently investigated the sequence organization of the Del36H region of the mouse genome. Del36H is a microscopically visible deletion of w20% of mouse chromosome 13 [27], 12.7 Mb in length, showing conserved synteny with human chromosome 6p22.1–6p22.3 and 6p25 [14]. The region is of potential interest genetically because it results in some observable phenotypes, and several disease loci map to the syntenic human region on chromosome 6p. Interestingly, the region contains several tandemly organized groups of genes: histones, vomeronasal receptors, serpins and prolactins [14] (Figure 1). The histone genes belong to a wellcharacterised gene array in which members of each histone type (H2a, H2b, H3 and H4) are homogeneous in sequence [28]. The other three large gene families show a degree of functional and sequence diversity: the vomeronasal receptors contribute to mate choice (e.g. Ref. [29]), prolactins contribute to the increased reproductive rate of mice (which produce litters with an average size of 10–12 at a spacing of as little as 20 days in the wild) by promoting lactation, whereas the serpins show restricted patterns of expression, with many found in reproductive tissues [30]. These last three gene families show small sequence differences between genes, representing microfunctionalization, and have more members in mouse than in human. It has been suggested that this reflects a preferential expansion in the mouse, or rodent, lineage [12], but the sizes of these gene families, and other comparable families, remain to be established in other mammals. Indeed, it is possible that some of these gene families might be significantly smaller in humans than in other mammals, as is seen for the olfactory receptor gene
Opinion
TRENDS in Genetics Vol.21 No.11 November 2005
593
Mm Del36H 12.7 Mb
His: 57(3) Vnr: 67(34) Btn: 2
Prl: 26(3)
His: 66(11) Vnr: 5(5) Btn: 7
Srp: 27(10)
Srp:(1)4
Prl: 1
Hs 6p22.1–6p22.3
Hs 6p25.2–6p25.3
7.8 Mb
2.7 Mb
TRENDS in Genetics
Figure 1. Relative expansion and contraction of gene families in the Del36H region, and its syntenic regions on human chromosome 6. The three regions of Del36H containing variable-sized gene families [14] are identified by different colours. The gene families are identified by codes as follows. Total numbers of genes are given for each species with numbers of pseudogenes in brackets (e.g. the Del36H histone clusters total 57 genes including three pseudogenes). The larger of each gene family is underlined. The sizes of each chromosome region illustrated are given in megabases; the sequence orientations are indicated by arrows. Abbreviations: His, histones; Vnr, vomernasal receptors; Btn, butyrophilins; Prl, prolactins; Srp, serpins.
family [31], owing to gene loss or nonfunctionalization; some of the features of mouse lifestyle are possibly more typical of mammals in general than humans. A preliminary analysis of the distributions of TEs along Del36H [14] showed that regions containing tandemly repeated gene families were enriched in two general classes of repetitive element. The regions containing the histone and vomeronasal receptor genes and the region containing the serpin genes were rich in long terminal repeat (LTR) elements, whereas the prolactin- and serpingene-rich regions were rich in LINE elements. This suggested an association between elevated levels of ectopic recombination mediated by TEs and the evolutionary origin of these gene clusters. A more detailed analysis of the individual repeat types enriched in these segments of Del36H is presented in Table 1 and confirms these
associations, although the different regions show different spectra of over-represented elements. The most prominent types of over-represented element in this analysis are LINE elements and, in particular, canonical L1 elements, which are strongly over-represented in all three regions. Endogenous retroviral elements, and retroviral sequences generally, are also commonly over-represented. We suggested [14] that regions rich in TEs, which undergo relatively high frequencies of UCO giving rise to high frequencies of gene duplication, can be considered ‘gene factories’. This suggestion differs from previous ideas on the evolution of tandemly repeated gene families in emphasizing the role of TEs in stimulating UCO and in its emphasis on the evolutionary origins of gene clusters. Although segmental duplication and chromosome breakage have been associated with TEs previously, this is not
Table 1. Repetitive elements most over-represented in segments of the Del36H sequence containing expanded gene familiesa,b
1 2 3 4 5 6 7 8 9 10
Segment 1 Element B1_MM ETNERV2 LX6 MYSERV L1 RMER1B LX3 ORR1A1 ETNERV3 B2_MM2
a
Excess 41.6 36.6 27.2 24.5 17.2 14.3 13.6 12.5 11.5 11.4
Segment 3 Element AT_RICH L1_MM LX2 L1 LX LX9 LX6 RMER15 LX8 LX5
Excess 49.4 45.0 43.2 37.2 19.9 19.5 17.8 17.0 16.5 15.4
Segment 9 Element L1 LX2 L1_MM L1F ETNERV2 RMER17A2 MYSERV RLTR11A MMVL30-INT MMETN-INT
Excess 46.0 31.1 20.8 19.9 17.4 10.1 9.8 9.6 8.8 8.6
Segments 1,3 and 9 Element Excess L1 100.4 L1_MM 68.4 LX2 57.0 ETNERV2 56.5 MYSERV 38.1 LX6 34.7 L1F 24.4 LX5 19.3 L1VL2 18.5 RLTR13D 17.7
Abbreviations: L, LINE (colour-coded blue); ETNERV2,3, MYSERV, RLTR13D, MMETN and RLTR11A, retroviral sequences (colour-coded green); ORR1AI, a MaLR retroviruslike internal sequence; MMVL30, an internal sequence of the mouse VL30 retroelement; B1 and B2, SINEs; RMER1B, RMER15 and RMER17A2, medium reiteration frequency repeats (colour-coded red and purple, respectively). b Segments 1, 3 and 9 of the Del36H region of mouse chromosome 13 contain amplified gene families as follows: Segment 1: histones and vomeronasal receptors; Segment 3: prolactins; Segment 9: serpins [14]. The table shows the ten element types, as detected by RepeatMasker (http://www.repeatmasker.org), that show the greatest numerical over-representation in each region and in combined regions 1, 3 and 9. L1 elements are highly over-represented in all three regions and shown in bold. Over-representations are based on expectations calculated from overall repeat counts in the region and segment lengths, assuming a uniform distribution. www.sciencedirect.com
594
Opinion
TRENDS in Genetics Vol.21 No.11 November 2005
so for locally clustered gene families of this kind, and these analyses have not dealt explicitly with the conditions giving rise to the founding of gene clusters. The gene factory hypothesis identifies regions rich in TEs as sites of preferential formation of tandem gene clusters. The alternative is a uniform probability of cluster formation across the genome. Recent reports of copy number polymorphisms (CNPs) of DNA segments in the human and mouse genomes [32–34] would be consistent with this second idea if they were uniformly distributed and represented single gene duplications. However, many CNPs occur preferentially at particular genomic loci [32–34]. CNP loci are also often involved in human chromosomal rearrangements and associations have also been suggested with evolution by segmental duplication [32,34]. CNPs are also generally larger than single genes although the R200 kb resolution of methods used to detect them do not exclude the possibility that some single gene CNPs occur. Analysis of repeat sequences that are associated with segmental duplications in the human genome indicate a strong association with Alu elements [35], whereas short interspersed nuclear elements (SINEs) are not prominent in the Del36H analyses. The association of gene clusters with LINEs seen in Del36H suggests that these elements have been primary drivers of UCO in the evolution of this region and perhaps the mouse genome generally, consistent with greater activity of LINEs in the mouse [26]. However, perhaps the only requirement to stimulate UCO is an elevated concentration of one or more element type, irrespective of class. Other differences in the evolution of this kind of gene family have been reported between the human and rodent lineages, notably different relative rates of local duplication and translocation appear to give rise to differences in the proportions of gene family members located on the same or different chromosomes [36]. It is possible that TE-rich regions are preferential targets for transposition and even that, once the concentration of TEs in these regions has started to stimulate recombination, their high rates of recombination expose them to more transposition (perhaps as a result of the higher exposure of DNA breaks). This could lead to a virtuous cycle of transposition and recombination in these regions. In future, comparative analyses of the relative distributions of gene clusters and TEs will be needed to test the hypothesis that TE clusters pre-date gene clusters. Other areas of future study will be to test how often TEs are associated with tandemly repeated gene families, how consistently particular types of element are associated with different gene clusters and whether these correlations are consistent between species. In addition, it will be important to investigate the relative rates of different processes within regions of this kind, because the gene factory hypothesis predicts that the turnover processes (UCO and gene conversion) taking place in these regions do not give rise to sequence homogenization but rather enable neo-, sub- and nonfunctionalization of duplicate copies. Concluding remarks Along with large-scale segmental duplication, it is becoming clear that tandem duplication of genes within restricted localities of the genome is an important force in www.sciencedirect.com
genome evolution and, potentially, in the adaptive evolution both of species and gene families. It is important in this context to distinguish microfunctionalized gene families, such as those seen in the Del36H region, from fully adapted and homogeneous types. The association of high concentrations of TEs with tandem gene families and with sites of chromosome reorganization, and their potential role in driving gene factories, strengthens the view that TEs have an important role in genome evolution by giving rise to genomic rearrangements at a variety of scales [23]. Acknowledgements I thank the Medical Research Council for financial support and AnnMarie Mallon for useful discussions. I also thank an anonymous referee for suggestions in clarifying some aspects of the arguments presented in this article. This article describes research funded as part of the MRC UK Mouse Sequencing Consortium.
References 1 Taylor, J.S. and Raes, J. (2005) Small-scale gene duplications. In The Evolution of the Genome (Gregory, T.R., ed.), pp. 289–327, Elsevier Academic Press 2 Ohno, S. (1970) Evolution by Gene Duplication, Springer-Verlag 3 Eichler, E.E. and Sankoff, D. (2003) Structural dynamics of eukaryotic chromosome evolution. Science 301, 793–797 4 Hardison, R. (1998) Hemoglobins from bacteria to man: evolution of different patterns of gene expression. J. Exp. Biol. 201, 1099–1117 5 Eason, D.D. et al. (2004) Mechanisms of antigen receptor evolution. Semin. Immunol. 16, 215–226 6 Hillis, D.M. and Dixon, M.T. (1991) Ribosomal DNA: molecular evolution and phylogenetic inference. Q. Rev. Biol. 66, 411–453 7 Hentschel, C.C. and Birnstiel, M.L. (1981) The organization and expression of histone gene families. Cell 25, 301–313 8 Wagner, G.P. et al. (2003) Hox cluster duplications and the opportunity for evolutionary novelties. Proc. Natl. Acad. Sci. U. S. A. 100, 14603–14606 9 Moore, R.C. and Purugganan, M.D. (2003) The early stages of duplicate gene evolution. Proc. Natl. Acad. Sci. U. S. A. 100, 15682–15687 10 Li, W.H. et al. (2001) Evolutionary analyses of the human genome. Nature 409, 847–849 11 Sonnhammer, E.L. and Koonin, E.V. (2002) Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 18, 619–620 12 Waterston, R.H. et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562 13 Emes, R.D. et al. (2003) Comparison of the genomes of human and mouse lays the foundation of genome zoology. Hum. Mol. Genet. 12, 701–709 14 Mallon, A.M. et al. (2004) Organization and evolution of a gene-rich region of the mouse genome: a 12.7-Mb region deleted in the Del(13) Svea36H mouse. Genome Res. 14, 1888–1901 15 Gao, L.Z. and Innan, H. (2004) Very low gene duplication rate in the yeast genome. Science 306, 1367–1370 16 Lynch, M. and Force, A. (2000) The probability of duplicate gene preservation by subfunctionalization. Genetics 154, 459–473 17 Dover, G. (1982) Molecular drive: a cohesive mode of species evolution. Nature 299, 111–117 18 Nei, M. and Hughes, A.L. (1992) Balanced polymorphism and evolution by the birth-and-death process in the MHC loci. In 11th Histocompatibility Workshop and Conference (Tsuji, K. et al., eds), pp. 27–38, Oxford University Press 19 Gaillard, I. et al. (2004) Olfactory receptors. Cell. Mol. Life Sci. 61, 456–469 20 Smith, G.P. (1976) Evolution of repeated DNA sequences by unequal crossover. Science 191, 528–535 21 Alkan, C. et al. (2004) The role of unequal crossover in alpha-satellite DNA evolution: a computational analysis. J. Comput. Biol. 11, 933–944 22 Ohta, T. (1983) On the evolution of multigene families. Theor. Popul. Biol. 23, 216–240
Opinion
TRENDS in Genetics Vol.21 No.11 November 2005
23 Kazazian, H.H., Jr. (2004) Mobile elements: drivers of genome evolution. Science 303, 1626–1632 24 Deininger, P.L. and Batzer, M.A. (1999) Alu repeats and human disease. Mol. Genet. Metab. 67, 183–193 25 Ostertag, E.M. and Kazazian, H.H., Jr. (2001) Biology of mammalian L1 retrotransposons. Annu. Rev. Genet. 35, 501–538 26 Reiter, L.T. et al. (1996) A recombination hotspot responsible for two inherited peripheral neuropathies is located near a mariner transposonlike element. Nat. Genet. 12, 288–297 27 Arkell, R.M. et al. (2001) Genetic, physical, and phenotypic characterization of the Del(13)Svea36H mouse. Mamm. Genome 12, 687–694 28 Wang, Z.F. et al. (1996) Characterization of the mouse histone gene cluster on chromosome 13: 45 histone genes in three patches spread over 1Mb. Genome Res. 6, 688–701 29 Keverne, E.B. (1999) The vomeronasal organ. Science 286, 716– 720
30 Kaiserman, D. et al. (2002) Comparison of human chromosome 6p25 with mouse chromosome 13 reveals a greatly expanded ov-serpin gene repertoire in the mouse. Genomics 79, 349–362 31 Gilad, Y. et al. (2003) Human specific loss of olfactory receptor genes. Proc. Natl. Acad. Sci. U. S. A. 100, 3324–3327 32 Sebat, J. et al. (2004) Large-scale copy number polymorphism in the human genome. Science 305, 525–528 33 Iafrate, A.J. et al. (2004) Detection of large-scale variation in the human genome. Nat. Genet. 36, 949–951 34 Li, J. et al. (2004) Genomic segmental polymorphisms in inbred mouse strains. Nat. Genet. 36, 952–954 35 Bailey, J.A. et al. (2003) An Alu transposition model for the origin and expansion of human segmental duplications. Am. J. Hum. Genet. 73, 823–834 36 Friedman, R. and Hughes, A.L. (2004) Two patterns of genome organization in mammals: the chromosomal distribution of duplicate genes in human and mouse. Mol. Biol. Evol. 21, 1008–1013
Articles of interest from Trends and Current Opinion journals The best control for the specificity of RNAi Mihail Sarov and A. Francis Stewart Trends in Biotechnology 23, 446–448 Advancing vascular tissue engineering: the role of stem cell technology Kevin M. Sales, Henryk J. Salacinski, Nasser Alobaid, Michael Mikhail, Vishni Balakrishnan and Alexander M. Seifalian Trends in Biotechnology 23, 461–467 Aneuploidy: a matter of bad connections Daniela Cimini and Francesca Degrassi Trends in Cell Biology 15, 442–451 Genetic variability under mutation selection balance Xu-Sheng Zhang and William G. Hill Trends in Ecology and Evolution 20, 468–470 Evolution of phenotypic plasticity: where are we going now? Massimo Pigliucci Trends in Ecology and Evolution 20, 481–486 Genetic therapies for cardiovascular diseases Luis G. Melo, Alok S. Pachori, Massimiliano Gnecchi and Victor J. Dzau Trends in Molecular Medicine 11, 240–250 Female choice and the MHC Andreas Ziegler, Heribert Kentenich and Barbara Uchanska-Ziegler Trends in Immunology 26, 496–502 Genes, odours and the recognition of parasitized individuals by rodents Martin Kavaliers, Elena Choleris and Donald W. Pfaff Trends in Parasitology 21, 423–429 Several genes in the extended human MHC contribute to predisposition to autoimmune diseases Benedicte A. Lie and Erik Thorsbys Current Opinion in Immunology 17, 526–531 HLA genomics in the third millennium John Trowsdale Current Opinion in Immunology 17, 498–504 www.sciencedirect.com
595