Why do genes have introns? Recombination might add a new piece to the puzzle

Why do genes have introns? Recombination might add a new piece to the puzzle

172 Research Update 5 Patel, C.V. and Gopinathan, K.P. (1987) Determination of trace amounts of 5-methylcytosine in DNA by reverse-phase highperform...

48KB Sizes 0 Downloads 37 Views

172

Research Update

5 Patel, C.V. and Gopinathan, K.P. (1987) Determination of trace amounts of 5-methylcytosine in DNA by reverse-phase highperformance liquid chromatography. Anal. Biochem. 164, 164–169 6 Bird, A.P. (1995) Gene number, noise reduction and biological complexity. Trends Genet. 11, 94–100 7 Bird, A.P. and Wolffe, A.P. (1999) Methylationinduced repression – belts, braces, and chromatin. Cell 99, 451–454 8 Hung, M.S. et al. (1999) Drosophila proteins related to vertebrate DNA (5-cytosine) methyltransferases. Proc. Natl. Acad. Sci. U. S. A. 96, 11940–11945 9 Tweedie, S. et al. (1999) Vestiges of a DNA methylation system in Drosophila melanogaster? Nat. Genet. 23, 389–390 10 Wade, P.A. et al. (1999) Mi-2 complex couples DNA methylation to chromatin remodelling and histone deacetylation. Nat. Genet. 23, 62–66 11 Dong, A. et al. (2001) Structure of human DNMT2, an enigmatic DNA methyltransferase homolog that displays denaturant-resistant binding to DNA. Nucleic Acids Res. 29, 439–448

TRENDS in Genetics Vol.17 No.4 April 2001

12 Okano, M. et al. (1998) Dnmt2 is not required for de novo and maintenance methylation of viral DNA in embryonic stem cells. Nucleic Acids Res. 26, 2536–2540 13 Lyko, F. et al. (2000) The putative Drosophila methyltransferase gene dDnmt2 is contained in a transposon-like element and is expressed specifically in ovaries. Mech. Dev. 95, 215–217 14 Lyko, F. et al. (2000) DNA methylation in Drosophila melanogaster. Nature 408, 538–540 15 Gowher, H. et al. (2000) DNA of Drosophila melanogaster contains 5-methylcytosine. EMBO J. 19, 6918–6923 16 Jaenisch, R. (1997) DNA methylation and imprinting: why bother? Trends Genet. 13, 323–329 17 Ramsahoye, B.H. et al. (2000) Non-CpG methylation is prevalent in embryonic stem cells and may be mediated by DNA methyltransferase 3a. Proc. Natl. Acad. Sci. U. S. A. 97, 5237–5242 18 Yoder, J.A. and Bestor, T.H. (1998) A candidate mammalian DNA methyltransferase related to pmt1p of fission yeast. Hum. Mol. Genet. 7, 279–284

19 Lyko, F. et al. (1999) Mammalian (cytosine-5) methyltransferases cause genomic DNA methylation and lethality in Drosophila. Nat. Genet. 23, 363–366 20 Roder, K. et al. (2000) Transcriptional repression by Drosophila methyl-CpG-binding proteins. Mol. Cell Biol. 20, 7401–7409 21 Ahringer, J. (2000) NuRD and SIN3 histone deacetylase complexes in development. Trends Genet. 16, 351–356 22 Larsson, J. et al. (1996) Mutations in the Drosophila melanogaster gene encoding S-adenosylmethionine synthetase suppress position-effect variegation. Genetics 143, 887–896 23 Larsson, J. and Rasmuson-Lestander, A. (1994) Molecular cloning of the S-adenosylmethionine synthetase gene in Drosophila melanogaster. FEBS Lett. 342, 329–333

F. Lyko Deutsches Krebsforschungszentrum, Im Neuenheimer Feld 280, 69120 Heidelberg, Germany. e-mail: [email protected]

Why do genes have introns? Recombination might add a new piece to the puzzle Laurent Duret Much progress has been made recently regarding when and how spliceosomal introns invaded eukaryotic genomes. Although the ‘intron early–intron late’ debate seems to be settled, the original and essential question remains: why have introns at all – do they have a purpose? Analyses of the relationship between intron length and recombination in Drosophila shed new light on the forces that drive the evolution of introns. Comeron and Kreitman proposed recently that introns are advantageous because they enhance withingene recombination and therefore increase selection efficacy (Hill–Robertson effects). However, their observations can also be explained by alternative neutralist models.

The phylogenetic distribution of spliceosomal introns clearly shows that most of them have been gained during eukaryotic evolution, and indeed there is no evidence that any spliceosomal intron existed before the prokaryote–eukaryote split1. Several examples of recent gains indicate that intron insertion is an ongoing process, at least in some eukaryotic lineages1. Detailed biochemical analyses reveal compelling similarities between the mechanisms of splicing for SPLICEOSOMAL INTRONS (see

Glossary) and GROUP II INTRONS, which strongly suggests that these two splicing processes share a common ancestor1. The spliceosome might have originated from a group II intron horizontally transferred from a prokaryote to the eukaryotic nucleus, possibly through a mitochondrial ancestor2. Why do genes keep introns?

Because of their mere presence within genes, introns incur an extra cost in energy and time during replication and transcription. Why, therefore, did they flourish in eukaryotic genomes? Are introns just selfish DNA elements that took advantage of the uncoupling of transcription and translation in eukaryotes to invade protein-coding genes? Or do they confer a selective advantage that outweighs the extra cost that they incur? ‘Are introns just selfish DNA…or do they confer a selective advantage that outweighs the extra cost that they incur?’

Various models have been proposed in line with this latter hypothesis. First, thanks to alternative splicing, a single gene might encode many proteins. (It is estimated that more than 30% of human

genes undergo alternative splicing3.) Thus, the acquisition of introns would have been positively selected as a source of functional diversity4. Second, introns might contain functional elements (regulatory elements, alternative promoters or antisense promoters). Overall, introns evolve rapidly, at about the same rate as synonymous sites, which suggests that their sequence is generally only weakly constrained by selection5. However, some of them contain highly conserved regulatory regions6. It is noteworthy that the frequency of such conserved elements is relatively high within the first intron of protein-coding genes (L. Duret, unpublished). This intron is also, on average, the longest one7,8. Thus, one potential advantage of introns is to harbour regulatory elements downstream of the promoter, without interfering with the coding of the protein. Third, introns might also contain other genes. The most striking example is the case of small nucleolar RNA genes, which are transcribed as part of the parent pre-mRNA and then processed from the intron to yield the mature snoRNA (Ref. 9). This particular organization allows the coordinated expression of proteins and snoRNA.

http://tig.trends.com 0168–9525/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0168-9525(01)02236-3

Research Update

Fourth, Kricker et al.10 proposed that the fragmentation of genes into exons might be favoured because it limits the risk of illegitimate recombination with paralogous sequences (pseudogenes or other copies in multigenic families) and therefore reduces the risk of chromosomal rearrangements10. Although these models clearly highlight the advantages that introns could confer, it is difficult to determine whether these advantages are the cause for the spread of introns, or whether they simply reflect their opportunistic domestication by eukaryotic genomes. The evolution of introns

Thanks to the data generated by genome projects, large-scale inter- and intragenomic comparative analyses are now feasible. Because the cost incurred by an intron is expected to be proportional to its size, the analysis of the factors that affect intron size should help to understand whether their evolution is governed by selection and, if so, how. Notably, population genetics models predict that, all else being equal, the efficacy of natural selection is decreased in regions of low recombination rate compared with regions of high recombination rate11 (the so-called ‘Hill-Robertson’ effects, Box 1). In other words, if intron length is governed by natural selection, there should be a correlation between this size and the rate of recombination. Thus, two studies on the relationship between intron length and rate of recombination in Drosophila melanogaster were conducted recently to test whether intron length is a neutral or a selected trait12,13.

TRENDS in Genetics Vol.17 No.4 April 2001

173

Glossary C-value paradox: There is a wide variation of genome size (C-value) among species. Unexpectedly, the C-value is not correlated with the number of genes or with the morphological complexity of organisms. For example, some amoebas have 200 times more DNA than humansa. Group II introns: These are found in bacteria, mitochondria and chloroplast genomes. These introns are self splicing (they catalyse their own splicing reaction), and some of them are able to transpose within genomes. Spliceosomal introns probably evolved from an ancestral group II intron. Indels: Insertions or deletions. Spliceosomal introns: Eukaryotic nuclear genes are often disrupted by introns. These introns are called spliceosomal introns, because their removal is catalysed by the spliceosome (an elaborate complex of proteins and small nuclear RNAs).

Reference a Thomas, C.A. (1971) The genetic organization of chromosomes. Annu. Rev. Genet. 5, 237–256

probability of being fixed in the population than small ones. But analysis of deletions in defective transposable elements shows no evidence for such a pattern16. ‘…long introns might be favoured because they enhance recombination between mutations under the influence of selection in adjacent exons.’

Comeron and Kreitman confirmed, not only in Drosophila, but also in human, that intron length is negatively correlated with the rate of recombination13. But, interestingly, they observed that deletions are more frequent than insertions. This latter observation raises an important point: if there is a mutational pressure biased towards deletion, this should cause the rapid collapse of intron length. Hence, the presence of long introns can only mean that selection must favour their length. Why would long introns be advantageous? First, very short introns are counterselected because a minimum

intron length is required for the splicing reaction12,13. Second, and this is the original idea proposed by Comeron and Kreitman, long introns might be favoured because they enhance recombination between mutations under the influence of selection in adjacent exons. Indeed, theoretical studies have shown that, in regions of very low recombination, modifiers that increase recombination between selected loci can be selectively beneficial and will be preferentially fixed in the genome, even if these modifiers are slightly deleterious17,18. The advantage of such modifiers is all the more important if the Hill–Robertson effects are strong (low recombination rate or small population size). The correlation between intron length and recombination is therefore consistent with this model (Fig. 1). Interspecies comparisons also support this model: D. melanogaster has both a lower rate of recombination and a smaller effective population size than Drosophila simulans, as well as having longer

Box 1. Hill–Robertson effect.

Selective model

Carvalho and Clark12 proposed that intron length reflects an equilibrium between a mutational pressure tending to increase their size and selection favouring deletions. Because of Hill–Robertson effects, this equilibrium should depend on recombination rate: introns should be larger in regions of low recombination rate (where selection is less efficient) than in regions of high recombination rate14,15. Consistent with this model, they found that larger introns occur preferentially in regions of low recombination12. Note however that, in contradiction with the model, there is no evidence of selection favouring deletions16. Indeed, if there was a selective pressure in favour of deletions, then large deletions should have a higher http://tig.trends.com

Population genetics models show that the efficacy of selection (as opposed to genetic drift) is decreased in regions of reduced recombination (the Hill–Robertson effect). To illustrate this effect, consider two genetically linked loci, each with two alleles (X, x and Y, y), and assume that the relative fitness of the different haplotypes is: XY > xY > Xy > xy. Consider a population that initially contained only xY and xy haplotypes. Then a new, advantageous allele, X, arises. Suppose that, by chance, this new allele is associated with the relatively deleterious allele y (haplotype Xy). If there is no recombination, the haplotype XY will never be produced in the population. Thus, selection will favour the fixation of the xY haplotype, and hence X will be driven out of the population. Conversely, if there is recombination between the two loci, the haplotype XY will be produced, and selection will favour the fixation of favourable alleles at both loci. In other words, there is interference between the effects of selection at linked loci. Such interference is maximized when the selection coefficients are of equivalent magnitude and when recombination rate is low. A consequence of this effect is that, all else being equal, the efficacy of selection (i.e. the probability of fixation of favourable alleles) varies along chromosomes according to the local rate of recombination.

Research Update

Average intron length (kb)

174

11.5 11.0 10.5 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0

0

0.001 0.002 0.003 0.004 0.005 Recombination rate per bp per generation)

(x10–5

TRENDS in Genetics

Fig. 1. Relationship between average intron length and recombination rates in D. melanogaster (adapted, with permission, from Ref. 13). Intron length is negatively correlated with recombination rate (Spearman’s rank correlation Rs = −0.2; P = 1.3 × 10−5; n = 447 genes with one or more introns). Recombination rate is measured by comparing genetic maps (in cM) to physical maps (in bp): for each chromosome, a polynomial regression curve of the genetic distance as a function of the quantity of DNA is generated. Recombination rate, as a function of the position along a chromosome, is estimated by taking the derivative of the polynomial function for each chromosome13. Figure kindly provided by J. Comeron.

introns. The fact that the correlation between intron length and recombination was observed both in Drosophila and in human, led the authors to suggest that their model might be valid for all eukaryotic genomes (but see below). Neutral model

The selective model of Comeron and Kreitman is elegant, but further work will be necessary to test it. Notably, Carvalho and Clark found that the correlation between intron size and recombination rate is basically established by introns less than 100 bp in length. For larger introns, there is no significant correlation12. Simulation studies would be useful to determine whether the relationship between intron length and recombination expected from Comeron and Kreitman’s model fits with observed data. Moreover, the evidence supporting that model is not very strong. There are two points that are essential for the demonstration of Comeron and Kreitman’s hypothesis: (1) there is a mutational pressure that tends to reduce the size of introns and (2) the correlation between intron size and recombination reflects variations in selective pressure and not in mutation pattern. To determine the pattern of INDEL mutation in Drosophila, the authors analysed data on polymorphic insertion and deletion http://tig.trends.com

TRENDS in Genetics Vol.17 No.4 April 2001

events. As mentioned above, these data show that deletions are more frequent than insertions (in agreement with other studies based on the analysis of retrotransposons in Drosophila16 and pseudogenes in mammals19). Moreover, the ratio of deletion to insertion is similar in regions of high or low recombination, which indicates that there is no strong variation in indel mutation patterns. Although the observations in Drosophila are compatible with their model, it is not certain that the data are robust enough to reject alternative explanations. First, it should be noted that the great majority of the indels analysed are short (1–10 bp): the 136 polymorphic indels identified in introns totalled about 2 kb. Given the paucity of polymorphism data presently available, it is possible that rare large insertions (e.g. 100 times less frequent than short indels) have been overlooked. Thus, it is possible that rare large insertions [e.g. transposable elements (TEs) are typically 6–8 kb in length] have, in fact, a much stronger impact on the length of noncoding regions than the numerous short indels. Comeron and Kreitman argue that they did not detect any TEs within their data set. However, introns might contain ancient TEs that are no longer recognizable and still contribute to intron size. In human, it is clear that TEs (which make up about 42% of euchromatic DNA) have a major impact on intron length20,21. Second, it should be stressed that the correlation between intron size and recombination is weak (Fig. 1). Thanks to the large set of introns analysed (n = 447), it can be shown that this correlation is statistically significant13 (P = 1.3 × 10−5). However, such a weak effect is probably undetectable with the polymorphism data available (n = 31). More data will therefore be necessary to assess the impact of rare large insertions (e.g. TEs) in Drosophila and to determine whether indel mutational pattern varies with recombination. Data from Caenorhabditis elegans provide an interesting viewpoint on this issue. In this species, TEs are located preferentially in regions of high recombination, in both intergenic regions and introns22. This pattern is not consistent with any selectionist model (including Comeron and Kreitman’s model). The most probable explanation is that their distribution reflects variations of the pattern of TE insertion. In contrast to that in human or Drosophila, intron length in C. elegans is

positively correlated with recombination rate (n = 41 707 introns; Spearman’s rank correlation Rs = +0.14; P < 10−4; G. Marais, pers. commun.). It is likely that variations in intron length with recombination are due to these differences in the rate of TE insertion. This highlights the fact that the correlation between intron length and recombination does not necessarily reflect variations in selective pressure. Conclusion

In conclusion, the present data do not allow the teasing apart of the selectionist and neutralist models for the evolution of intron size in Drosophila. It is possible that variations of intron length within and between genomes essentially reflect differences in the dynamic equilibrium between rare TE insertions and frequent, but short, base deletions rather than variations in selective pressure. Although it remains speculative, the model of Comeron and Kreitman is attractive. It is generally recognized that the evolutionary advantage of recombination is to increase the efficacy of selection by allowing the combination of favourable alleles at linked loci. Long introns and intergenic regions undoubtedly favour within- and betweengene recombination. Thus, Comeron and Kreitman’s model provides a novel way of thinking about the C-VALUE PARADOX. However, further investigations are required to determine whether this advantage is sufficient to outweigh the cost incurred by these noncoding regions. References 1 Logsdon, J.M. (1998) The recent origin of spliceosomal introns revisited. Curr. Opin. Genet. Dev. 8, 637–648 2 Cavalier-Smith, T. (1991) Intron phylogeny: a new hypothesis. Trends Genet. 7, 145–148 3 Hanke, J. et al. (1999) Alternative splicing of human genes: more the rule than the exception? Trends Genet. 15, 389–390 4 Dibb, N.J. (1993) Why do genes have introns? FEBS Lett. 325, 135–139 5 Smith, N.G.C. and Hurst, L.D. (1998) Sensitivity of patterns of molecular evolution to alterations in methodology: a critique of Hughes and Yeager. J. Mol. Evol. 47, 493–500 6 Duret, L. and Bucher, P. (1997) Searching for regulatory elements in human noncoding sequences. Curr. Opin. Struct. Biol. 7, 399–406 7 Smith, M.W. (1988) Structure of vertebrate genes: a statistical analysis implicating selection. J. Mol. Evol. 27, 45–55 8 Maroni, G. (1994) The organization of Drosophila genes. DNA Seq. 4, 347–354 9 Maxwell, E.S. and Fournier, M.J. (1995) The small nucleolar RNAs. Annu. Rev. Biochem. 64, 897–934 10 Kricker, M.C. et al. (1992) Duplication-targeted DNA methylation and mutagenesis in the

Research Update

11

12 13

14 15

evolution of eukaryotic chromosomes. Proc. Natl. Acad. Sci. U. S. A. 89, 1075–1079 Hill, W.G. and Robertson, A. (1966) The effect of linkage on limits to artificial selection. Genet. Res. 8, 269–294 Carvalho, A.B. and Clark, A.G. (1999) Intron size and natural selection. Nature 401, 344 Comeron, J.M. and Kreitman, M. (2000) The correlation between intron length and recombination in Drosophila. Dynamic equilibrium between mutational and selective forces. Genetics 156, 1175–1190 Charlesworth, B. (1996) Genome evolution – the changing sizes of genes. Nature 384, 315–316 Hurst, L.D. et al. (1999) Small introns tend to occur in GC-rich regions in some but not all vertebrates. Trends Genet. 15, 437–439

TRENDS in Genetics Vol.17 No.4 April 2001

16 Petrov, D.A. et al. (2000) Evidence for DNA loss as a determinant of genome size. Science 287, 1060–1062 17 Otto, S.P. and Barton, N.H. (1997) The evolution of recombination: removing the limits to natural selection. Genetics 147, 879–906 18 Hey, J. (1998) Selfish genes, pleiotropy and the origin of recombination. Genetics 149, 2089–2097 19 Ophir, R. and Graur, D. (1997) Patterns and rates of indel evolution in processed pseudogenes from humans and murids. Gene 205, 191–202 20 Smit, A.F. (1999) Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Opin. Genet. Dev. 9, 657–663 21 Duret, L. and Hurst, L.D. The elevated G and C content at exonic third sites is not evidence against neutralist models of isochore evolution. Mol. Biol. Evol. (in press)

175

22 Duret, L. et al. (2000) Transposons but not retrotransposons are located preferentially in regions of high recombination rate in Caenorhabditis elegans. Genetics 156, 1661–1669

L. Duret Laboratoire de Biométrie et Biologie Évolutive, CNRS UMR 5558, Université Claude Bernard–Lyon 1, 43 Bd du 11 Novembre 1918, 69622 Villeurbanne, Cedex, France. e-mail: [email protected]

Genome Analysis

Transcription unit conservation in the three domains of life: a perspective from Escherichia coli Gabriel Moreno-Hagelsieb, Victor Treviño, Ernesto Pérez-Rueda, Temple F. Smith and Julio Collado-Vides Here we address the question of the degree to which genes within experimentally characterized operons in one organism (Escherichia coli) are conserved in other genomes. We found that two genes adjacent within an operon are more likely both to have an ortholog in other organisms, regardless of relative position, than genes adjacent on the same strand but in two different transcription units. They are also more likely to occur next to, or fused to, one another in other genomes. Genes frequently conserved adjacent to each other, especially among evolutionarily distant species, must be part of the same transcription unit in most of them.

Analyses of genome organization have shown that gene order differs between genomes1, and that such order deteriorates much faster than protein sequence identities2. Despite this, it has been possible to find some conserved gene clusters with protein products that either physically interact3 or have an otherwise related function4. Such clusters have been related to operons implicitly in the texts and explicitly in the examples. Nonetheless, if we understand operons as a collection of adjacent genes transcribed into a single messenger RNA, or polycistronic transcription unit (TU), there is still a

need to demonstrate that the conserved clusters correspond to operons. Here, we demonstrate that genes within experimentally characterized operons in Escherichia coli have evident tendencies towards conservation in other genomes, and that pairs of genes showing a high conservation of vicinity, might actually be part of the same TU in most prokaryotic organisms. Our computer analyses were based on the comparison of the conservation of adjacent pairs of genes transcribed in the same direction in E. coli. The genes were from two collections built as described previously5: a collection of pairs found in operons in RegulonDB, a database compiled from the literature on regulation of transcription in E. coli6 (612 pairs out of 269 operons), and, as a control, a dataset of adjacent pairs found at the boundaries of TUs (405 pairs); that is, the last gene in a TU, and the first in the next one. We call these two sets ‘within-operon pairs’ and ‘boundary pairs’, respectively. To find orthologs of E. coli genes in other organisms, we ran gapped BLASTP (Ref. 7) comparisons of all the protein sequences corresponding to all the open reading frames (ORFs) of E. coli, against every protein sequence corresponding to the ORFs of all other genomes obtained from GenBank (Ref. 8). We used an expectation value cutoff of 0.01. We kept only those results where

the alignment covered at least 50% of one of the sequences. Our putative orthologs were those genes with protein products that were overall best hits to the E. coli query proteins. We used this uni-directional best hits definition of orthology instead of the more common bi-directional one (i.e. the query is also the best hit when its best hit is used as query), because we observed that the data self-clean as the analyses advance. The definition also facilitates the finding of fusions and of synteny (conservation of gene order), which is another indication of orthology2. Co-occurrence of genes among genomes

If the proteins coded by two genes have a related function (for instance, take part in sequential steps of a pathway), they would be expected to co-occur in different genomes. Thus, it has been suggested2 and demonstrated9, that functional relationships of genes can be inferred if such genes have similar ‘phylogenetic profiles’. Figure 1a shows that this trend characterizes adjacent genes within operons, most of which are formed from functionally related genes. If the two members of an E. coli pair each have an ortholog in another genome, regardless of the relative positions of these orthologs within that genome, we call them an ortholog pair, or a co-occurring pair. If only

http://tig.trends.com 0168–9525/01/$ – see front matter © 2001 Elsevier Science Ltd. All rights reserved. PII: S0168-9525(01)02241-7