Processed pseudogenes: the ‘fossilized footprints’ of past gene expression

Processed pseudogenes: the ‘fossilized footprints’ of past gene expression

Update in the future. Nelson [13] makes a coherent and eloquent plea for incorporating more effective (and interesting) strategies into evolution inst...

697KB Sizes 7 Downloads 69 Views

Update in the future. Nelson [13] makes a coherent and eloquent plea for incorporating more effective (and interesting) strategies into evolution instruction; strategies that make students think rather than merely memorize. Instructors also need to be encouraged to develop tools to assess evolutionary learning. What does it mean to say that your students understand evolution? How do you know what they know? Progress is being made at an institutional level in encouraging teachers to take a scientific approach to teaching, wherein evidence of learning is collected, critically assessed and used to improve instruction [14]; published examples of scientific teaching are also increasing. Within evolutionary biology, for example, Nelson [13] has done pioneering work on the advantages of active (versus passive) instructional methods; Nehm and Reilly [3] explored student understanding of natural selection; and Catley and Novick analyzed how students interpret phylogenetic trees [15]. When guided by the use of educational literature (a form of metacognition), teaching tends to improve as practitioners master the use of validated instructional materials and pedagogical techniques. However, such evidence-based teaching is still the exception rather than the rule. Concluding remarks Assuming that these challenges can be overcome, then great strides can be made in the public understanding of evolution. Recalling that all biology and most general science teachers in primary and secondary schools have taken a college biology course, improvements in the teaching of evolution within these college courses can have a huge long-term impact. By integrating molecular perspectives into the teaching of evolution (and vice versa), great improvements can be expected in evolutionary understanding on the college campus and beyond. This will help to build a scientific and technological workforce that is

Trends in Genetics

Vol.25 No.10

better able to apply evolutionary principles to guide decision making. Equally importantly, the improved teaching of evolution in introductory biology courses will help the public become better informed as to the nature of science and the centrality of evolutionary thinking to understanding life on Earth. References 1 Nelson, C.E. (2007) Teaching evolution effectively: a central dilemma and alternative strategies. McGill J. Edu. 42, 265–283 2 Alters, B.J. and Nelson, C.E. (2002) Teaching evolution in college. Evolution 56, 1891–1901 3 Nehm, R.H. and Reilly, L. (2007) Biology majors’ knowledge and misconceptions of natural selection. BioScience 57, 263–272 4 Hillis, D.M. (2007) Making evolution relevant and exciting to biology students. Evolution 61, 1261–1264 5 Futuyma, D.J. (1998) Wherefore and whither the naturalist? Am. Nat. 151, 1–6 6 Grant, P.R. (2000) What does it mean to be a naturalist at the end of the twentieth century? Am. Nat. 155, 1–12 7 Wilson, E.O. (2000) On the future of conservation biology. Conserv. Biol. 14, 1–3 8 McGlynn, T.P. (2008) Natural history education for students heading into the century of biology. Am. Biol. Teach. 70, 109–111 9 Moore, A. (2008) Science teaching must evolve. Nature 453, 31–32 10 Hoekstra, H.E. et al. (2006) A single amino acid mutation contributes to adaptive beach mouse color pattern. Science 313, 101–104 11 Yokoyama, S. et al. (2006) Tertiary structure and spectral tuning of UV and violet pigments in vertebrates. Gene 365, 95–103 12 Dean, A.M. and Thornton, J.W. (2007) Mechanistic approaches to the study of evolution: the functional synthesis. Nat. Rev. Genet. 8, 675–688 13 Nelson, C.E. (2008) Teaching evolution (and all of biology) more effectively: strategies for engagement, critical reasoning, and confronting misconceptions. Integr. Comp. Biol. 48, 213–225 14 Handelsman, J. et al. (2004) Scientific teaching. Science 304, 521–522 15 Catley, K.M. and Novick, L.R. (2008) Seeing the wood for the trees: an analysis of evolutionary diagrams in biology textbooks. BioScience 58, 976–987

0168-9525/$ – see front matter ß 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.tig.2009.09.001 Available online 2 October 2009

Genome Analysis

Processed pseudogenes: the ‘fossilized footprints’ of past gene expression Ondrej Podlaha1,2 and Jianzhi Zhang1 1 2

Department of Ecology and Evolutionary Biology, University of Michigan, Ann Arbor, MI 48109, USA Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT 06518, USA

Although our knowledge of the genes and genomes of extinct organisms is improving as a result of progress in sequencing ancient DNA, the transcriptomes of extinct organisms remain inaccessible, owing to the rapid degradation of messenger RNA after death. We provide empirical evidence that gene expression levels in the reproductive tissues of mice and during early mouse development correlate highly with the rate of inherited retroposition: the source of processed pseudogenes in the genome. Thus, processed pseudogenes might serve Corresponding author: Zhang, J. ([email protected]).

as fossilized footprints of the expression of their parent genes, shedding light on ancient transcriptomes that could provide significant insights into the evolution of gene expression. Accessing the transcriptomes of ancient organisms Sequencing the genomes of ancient extinct organisms is a ‘dream come true’ for evolutionary geneticists. With the unravelling of whole or partial mitochondrial and nuclear genome sequences of extinct species, including the woolly mammoth [1–3] and Neanderthal [4–6], many important 429

Update

Trends in Genetics Vol.25 No.10

Figure 1. The number of young processed pseudogenes is correlated with parent gene expression in mouse. (a) The age distribution of mouse processed pseudogenes. Age is measured by Kimura’s two-parameter nucleotide distance between a pseudogene and its parent gene at third codon positions. (b) Mean Fisher’s Z transformed

430

Update evolutionary events, from human origins to past epidemics and crop domestication, can be dissected to an unprecedented resolution. For example, Neanderthal genome sequences have and will continue to inform us about the potential genetic contribution of Neanderthals to the contemporary human gene pool [4–7] and will enable the identification of human-specific genetic changes that occurred in the past few hundred thousand years of human evolution. Research in evolutionary developmental biology repeatedly demonstrated the importance of gene expression changes in phenotypic evolution [8–10]. However, owing to the instability of messenger RNA (mRNA), its extraction from ancient organisms is beyond reach. Although prediction of gene expression levels and patterns from DNA sequences has been attempted in unicellular model organisms [11], it is not yet possible in complex multicellular organisms, particularly if the prediction is based on ancient DNA sequences that are incomplete or inaccurate. Although it is theoretically possible to infer the expression of a gene in an ancestral organism using the expression data from extant organisms [12], such inferences are likely to be unreliable, because gene expression evolves rapidly [13–15]. Here, we explore the possibility that processed pseudogenes provide information about past gene expression levels and present an analytical framework for retrieving such information. Are processed pseudogenes the fossilized footprints of past gene expression? The genomes of many eukaryotes, including humans, are ‘littered’ with pseudogenes—defunct relatives of known genes that have lost their protein-coding ability. Many pseudogenes originate through germline retroposition, where the mRNA of a gene is reverse-transcribed to complementary DNA and reinserted into the genome at random [16]. Most of these re-inserted retrogenes lack promoters and are ‘dead-on-arrival’. They gradually accumulate mutations that disrupt open reading frames (ORFs), and are eventually recognizable as processed pseudogenes (Pcs). Pcs have been used for archiving splice variants [17] and for studying rates and patterns of neutral mutations [18]. In principle, the number of Pcs of a given parent gene depends on the abundance of the mRNA of the parent gene in the germline, which is determined jointly by the rates of mRNA production and degradation [19,20]. Indeed, a large proportion of Pcs are from housekeeping genes, which generally are highly transcribed and have stable mRNAs [21]. However, a recent comparison between the expression level of a gene with the number of its ‘Pc offspring’ failed to find gene expression level to be a dominant determinant of the rate of retroposition [21]. Here, we hypothesize that the present-day expression level of a gene should be best correlated with the number of its young Pc offspring, not all Pc offspring, because expression of a gene in the germline might not have been constant during

Trends in Genetics

Vol.25 No.10

evolution. Because retroposition occurring at earlier stages of development (i.e. embryogenesis) is more likely to impact the germline than that in later stages, we examined mouse expression data from a variety of tissues at different developmental stages (see the supplementary material online). Our datasets include the comprehensive catalogue of mouse pseudogenes (www.pseudogene.org) and 23 mouse microarray expression datasets from Gene Expression Omnibus (www.ncbi.nlm.nih.gov/geo/) (Table S1). As a proxy for Pc age, we calculated the nucleotide sequence divergence (d) between each of 5315 Pcs and their respective parent genes in mouse, at third codon positions, using Kimura’s two-parameter distance [22]. We then binned the pseudogenes in increments of d = 0.05. The age distribution of Pcs in our dataset (Figure 1a) reflects the combined effects of an exponential decay of non-functional sequences in the genome and an accrual of pseudogene characteristics over time. The former effect is due to the gradual deletion of non-functional sequences from the genome and gradual divergence of pseudogenes from their parent genes beyond recognition. The latter effect occurs because very recently formed Pcs are under-detected due to the lack of ORFdisrupting substitutions (also below). The age distribution of Pcs is also highly influenced by the cellular reverse transcriptase activity in the past, which in mammals is determined by the activity of long interspersed element 1 (LINE1) [20]. We correlated parent gene expression in various wild-type tissues at various developmental stages with the number of Pc offspring of different age groups. We found the correlations to be greater for expression levels in reproductive and embryonic tissues than for those in somatic tissues (e.g. lung, kidney, liver, and neural tissues) across all Pc age categories (Figure 1b; Table S1 and Figure S1 in the supplementary material online). When considering all Pcs younger than d, we observed the correlation to reach its maximum at d = 0.25 (Figure S1). In theory, the lower the d value, the higher the correlation. In reality, however, very recently formed Pcs are underdetected (see above). Based on the rates of point and insertion/deletion mutations in mammals, we calculated that retropseudogenes should be detectable when d = 0.1 (see the supplementary material online). The probable reason that the correlation is not higher for d = 0.1 than d = 0.25 is that the former group contains much fewer genes than the latter group. We divided all Pcs into three broad age groups referred to as young (d < 0.25), middle (d = 0.25-0.65), and old (d > 0.65) Pcs. It is apparent that the correlation is much higher for the young group than for the middle and old groups (Figure 1b). The highest correlation was found between the parent gene expression levels during embryonic gonad development and the number of Pc offspring with d < 0.25 (Pearson’s R = 0.54, P < 10-115; Spearman’s r = 0.38, P < 10-50; Figure 1c). For this dataset, the number of Pcs can be used to predict the expression levels of their parent genes to some extent. For example,

correlations of parent gene expression and the number of processed pseudogene offspring of different age groups. Z = 0.5 ln[(1+R)/(1 – R)], where R is Pearson’s correlation coefficient. Unlike R, Z is normally distributed. Error bars represent one standard error. Note that ‘reproductive’ includes embryonic expressions that potentially predate the separation of germlines. (c) Pearson’s correlation between the number of young processed pseudogenes (d < 0.25) and their parent gene expression in mouse embryonic testis 14.5 days post conception (R = 0.54, P < 10-115). For the same data set, Spearman’s rank correlation is r = 0.38 (P < 10-50). When the apparent outlier with the highest number of pseudogene offspring is removed, Pearson’s correlation becomes 0.58 (P < 10-124) and Spearman’s correlation becomes 0.37 (P < 10-50).

431

Update expressions of parent genes with 5 Pcs are significantly higher than those with four Pcs (P = 0.003, one-tailed Mann-Whitney U test; Table S2 in the supplementary material online). Note that because microarray-based measurement of gene expression is imprecise, particularly for comparisons among genes as in the present case [14], the actual correlation between parent gene expression and Pc offspring number is expected to be even greater. These results confirm that the expression levels of parent genes affect the rate of retroposition and are consistent with the understanding that inheritable retroposition occurs in germlines or in early embryos before the separation of germlines. The relatively high correlation (R  0.3) for some somatic tissues is probably caused by the presence of genes

Trends in Genetics Vol.25 No.10

in the data set (especially housekeeping genes) with correlated expression between somatic tissues and germlines. Given the strong correlation between the number of recent Pcs and the current expression levels of their parent genes, Pcs of different ages can be treated as fossilized footprints to infer past expression levels of parent genes in embryonic or reproductive tissues, which might provide insights into the evolution of gene expression. For example, overall, 62%, 28%, and 10% of mouse Pcs fall into young, middle and old age groups, respectively (Figure 2a). By contrast, mouse Hmgb2 (encoding the high-mobility group B 2 protein), which is important for male gonad development and spermatogenesis [23], gave rise to 65 Pcs, of which 16 (25%), 8 (12%), and 41 (63%) belong to the young,

Figure 2. Variation of gene expression level over evolutionary time, probed from biased age distributions of processed pseudogenes. (a) Examples of parent genes whose processed pseudogenes have age distributions significantly different from the genomic distribution (eukaryotic translation initiation factor 1 [Eif1], high-mobility group box 2 [Hmgb2], and ribosomal protein S6 [Rps6]) or similar to the genomic distribution (ribosomal protein L7 [Rpl7]). (b) The top 20 mouse and human parent genes whose processed pseudogenes have age distributions significantly different from the genomic distribution (ranked by Q-values). P-values are from x2 tests and Q-values are false discovery rates.

432

Update middle and old age groups, respectively. This age distribution is significantly different from the genomic distribution (x2 test, P < 10-45) and shows a decrease in the expression of Hmgb2 (compared with the genomic average) in the evolution leading to mouse (Figure 2a). By examining all parent genes in the same way, we observed that 59 (3.2%) mouse and 37 (1.9%) human parent genes differ significantly from their respective genomic averages in age distribution at a false discovery rate of 1%, presumably resulting from evolutionary changes in expression level in embryonic tissues or germlines (Figure 2b; Table S3 in the supplementary material online). A similar approach can be used to study the evolutionary trajectory of expression divergence between duplicate genes. Caveats to this approach Here, we showed strong correlation between the number of recent Pcs and the current expression levels of their parent genes in mouse. Because this phenomenon is a result of the retroposition rate being determined by the abundance of mRNA, we predict that similarly strong correlations exist in other organisms, although it is not feasible to validate this prediction at this time, owing to the lack of appropriate gene expression data from germlines or during early development. We illustrated the use of Pcs in analysing the evolution of gene expression. The overall accuracy of similar analyses in any extant or ancient genome depends on three important factors: identification of Pcs, identification of the parent genes of Pcs, and the estimation of Pc ages. Identification of Pcs is not a trivial task and it is usually executed by searching in a sequenced genome for proteincoding-gene homologues that contain ORF-disrupting substitutions and lack introns. By design, this approach is limited to Pcs of intron-containing parent genes, due to the difficulty of distinguishing between processed and non-processed pseudogenes of intron-less parent genes (10% in mouse [24]). Furthermore, because it takes on average a few million years for ORF-disrupting mutations to occur and become fixed in Pcs (of humans and related primates) [25], very young Pcs will be under-detected, hampering the study of parent gene expression in the very recent past. Other factors, such as the E-value cutoff used in the BLAST search of homologous sequences, affect the identification of Pcs. For example, very old Pcs are hard to detect even with loose E-value cutoffs, simply because these Pcs are too different in DNA sequence from their parent genes. But, as long as the same criteria are applied to all Pcs, the bias for individual parent genes is expected to be minimal. Currently, Pcs are assigned to their functional parent genes by BLAST searches and are assumed to have arisen from their parent genes through retroposition. However, it is possible that a Pc is generated secondarily by simple duplication of another Pc, rather than by retroposition of its functional parent gene. In theory, the number of duplication-generated Pcs is expected to be proportional to the number of retroposition-generated Pcs. Thus, treating all Pcs as retroposition-derived should not bias the age distribution of a parent gene relative to that of the genomic average. In reality, however, the segmental duplication rate varies widely across the genome and it might be necessary to differentiate between duplication-generated and retroposi-

Trends in Genetics

Vol.25 No.10

tion-generated Pcs. For this purpose, phylogenetic analysis will be useful, as the sister clade of each duplication-generated Pc would not include the functional parent gene. Another complication in identifying the parent gene of a Pc is that the parent gene might no longer exist, for two reasons. First, the parent gene might have been lost during evolution, which will lead to the identification of an incorrect parent gene or no parent gene, depending on whether the true parent gene has a paralogue in the genome. Second, a parent gene might have been duplicated to generate two (or more) daughter genes since the birth of the Pc that is under investigation. In this case, BLAST can assign the Pc to either daughter gene, depending on which one evolves more slowly, although the true parent gene should be the common ancestor of these daughter genes and can be identified through phylogenetic analysis. We measured the age of a Pc by its nucleotide sequence divergence d from its functional parent gene at the parent gene’s third codon positions. This approach is applicable to both retroposition-generated and duplication-generated Pcs. However, it could lead to an underestimation of the ages of Pcs of more conserved parent genes than those of less conserved parent genes. This is because, even at third codon positions, a substantial fraction of mutations are non-synonymous (31% if there is no transition/transversion bias) and less likely to be fixed in a conserved parent gene than in a less conserved one. Thus, for two Pcs of the same true age, the one derived from the more conserved parent gene tends to look younger than the one derived from the less conserved parent gene. This bias could be alleviated if d is calculated using only fourfold degenerate sites of parent genes, but such estimates of d might have larger variances owing to smaller numbers of usable sites. Transcribed but untranslated regions (UTRs) of parent genes are frequently present in Pcs and may be used for estimating d. However, because UTRs of parent genes are subject to stronger functional constraints than fourfold degenerate sites [26], they are less ideal for estimation of Pc age. We used the Pc information to infer relative changes of gene expression during evolution. Is it possible to infer the absolute gene expression level in the past? We found that even among parent genes with the same number of young Pcs, their log2-transformed microarray expression levels still vary substantially, with a standard deviation that often reaches a quarter of the mean (Table S2 in the supplementary material online). Thus, it is probably difficult to infer precisely the absolute expression level of a gene in the past from the number of its Pcs. These caveats notwithstanding, our results point to a fascinating observation that the Pc record embedded in the genome of an organism provides a uniquely informative glimpse of past gene expression levels, which complements the knowledge gained from sequencing ancient extinct genomes to provide a more comprehensive picture of evolution. A recent study suggested that gene expression differences between species are generally smaller at the protein level than at the mRNA level [27]. Our approach can infer mRNA expression changes, but not protein expression changes. Future research should improve the reliability of the retrieval of ancient transcriptome information and apply it to the study of evolution. 433

Update Acknowledgements We thank members of the Zhang laboratory and three anonymous reviewers for valuable comments. This work was supported by research grants from the National Institutes of Health to J.Z.

Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.tig.2009.09.002. References 1 Miller, W. et al. (2008) Sequencing the nuclear genome of the extinct woolly mammoth. Nature 456, 387–390 2 Krause, J. et al. (2006) Multiplex amplification of the mammoth mitochondrial genome and the evolution of Elephantidae. Nature 439, 724–727 3 Rogaev, E.I. et al. (2006) Complete mitochondrial genome and phylogeny of Pleistocene mammoth Mammuthus primigenius. PLoS Biol. 4, e73 4 Green, R.E. et al. (2006) Analysis of one million base pairs of Neanderthal DNA. Nature 444, 330–336 5 Noonan, J.P. et al. (2006) Sequencing and analysis of Neanderthal genomic DNA. Science 314, 1113–1118 6 Green, R.E. et al. (2008) A complete Neandertal mitochondrial genome sequence determined by high-throughput sequencing. Cell 134, 416– 426 7 Krings, M. et al. (1997) Neandertal DNA sequences and the origin of modern humans. Cell 90, 19–30 8 Stern, D.L. and Orgogozo, V. (2008) The loci of evolution: how predictable is genetic evolution? Evolution 62, 2155–2177 9 Carroll, S.B. (2008) Evo-devo and an expanding evolutionary synthesis: a genetic theory of morphological evolution. Cell 134, 25–36 10 Wray, G.A. (2007) The evolutionary significance of cis-regulatory mutations. Nat Rev Genet 8, 206–216 11 Beer, M.A. and Tavazoie, S. (2004) Predicting gene expression from sequence. Cell 117, 185–198 12 Gu, X. (2004) Statistical framework for phylogenomic analysis of gene family expression profiles. Genetics 167, 531–542

Trends in Genetics Vol.25 No.10 13 Khaitovich, P. et al. (2004) A neutral model of transcriptome evolution. PLoS Biol 2, E132 14 Liao, B.Y. and Zhang, J. (2006) Evolutionary conservation of expression profiles between human and mouse orthologous genes. Mol. Biol. Evol. 23, 530–540 15 Gu, Z. et al. (2002) Rapid divergence in expression between duplicate genes inferred from microarray data. Trends Genet. 18, 609–613 16 Kaessmann, H. et al. (2009) RNA-based gene duplication: mechanistic and evolutionary insights. Nat. Rev. Genet. 10, 19–31 17 Shemesh, R. et al. (2006) Genomic fossils as a snapshot of the human transcriptome. Proc. Natl Acad. Sci. USA 103, 1364–1369 18 Graur, D. et al. (1989) Deletions in processed pseudogenes accumulate faster in rodents than in humans. J. Mol. Evol. 28, 279–285 19 Pavlicek, A. et al. (2006) Retroposition of processed pseudogenes: the impact of RNA stability and translational control. Trends Genet. 22, 69–73 20 Zhang, Z. et al. (2004) Comparative analysis of processed pseudogenes in the mouse and human genomes. Trends Genet. 20, 62–67 21 Balasubramanian, S. et al. (2009) Comparative analysis of processed ribosomal protein pseudogenes in four mammalian genomes. Genome Biol. 10, R2 22 Kimura, M. (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16, 111–120 23 Ronfani, L. et al. (2001) Reduced fertility and spermatogenesis defects in mice lacking chromosomal protein Hmgb2. Development 128, 1265–1273 24 Sakharkar, K.R. et al. (2006) Functional and evolutionary analyses on expressed intronless genes in the mouse genome. FEBS Lett. 580, 1472–1478 25 Zhang, J. and Webb, D.M. (2003) Evolutionary deterioration of the vomeronasal pheromone transduction pathway in catarrhine primates. Proc. Natl Acad. Sci. USA 100, 8337–8341 26 Li, W-H. (1997) Molecular Evolution, Sinauer, p. 184 27 Schrimpf, S.P. et al. (2009) Comparative functional analysis of the Caenorhabditis elegans and Drosophila melanogaster proteomes. PLoS Biol. 7, e48 0168-9525/$ – see front matter ß 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.tig.2009.09.002 Available online 30 September 2009

Genome Analysis

Different gene regulation strategies revealed by analysis of binding motifs Zeba Wunderlich1 and Leonid A. Mirny2 1 2

Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA Harvard–MIT Division of Health Sciences and Technology, Massachusetts, Institute of Technology, Cambridge, MA 02139, USA

Coordinated regulation of gene expression relies on transcription factors (TFs) binding to specific DNA sites. Our large-scale information–theoretical analysis of >950 TF-binding motifs demonstrates that prokaryotes and eukaryotes use strikingly different strategies to target TFs to specific genome locations. Although bacterial TFs can recognize a specific DNA site in the genomic background, eukaryotic TFs exhibit widespread, nonfunctional binding and require clustering of sites to achieve specificity. We find support for this mechanism in a range of experimental studies and in our evolutionary analysis of DNA-binding domains. Our systematic characterization of binding motifs provides a quantitative Corresponding author: Mirny, L.A. ([email protected]).

434

assessment of the differences in transcription regulation in prokaryotes and eukaryotes.

DNA binding and gene regulation Classical experiments have demonstrated that strong binding of a TF to its cognate site in a promoter is sufficient to alter gene expression [1]. Significant effort has been put into experimentally determining [2–6] and computationally inferring [7–10] motifs recognized by TFs, and determining the occupancy of promoters by TFs [11]. The motifs and binding locations of a TF have in turn been used to predict which genes it regulates and their expression levels [12]. Such studies rely on linking the binding of TFs to DNA with the regulation of nearby genes.