Genomics 86 (2005) 708 – 717 www.elsevier.com/locate/ygeno
Discovery of a large number of previously unrecognized mitochondrial pseudogenes in fish genomes Agostinho Antunes *, Maria Joa˜o Ramos REQUIMTE, Grupo de Quı´mica Teo´rica e Computacional – Departamento de Quı´mica, Faculdade de Cieˆncias, Universidade do Porto, Rua do Campo Alegre, 687, 4169-007 Porto, Portugal Received 20 April 2005; accepted 1 August 2005 Available online 19 September 2005
Abstract Nuclear inserted copies of mitochondrial origin (numts) vary widely among eukaryotes, with human and plant genomes harboring the largest repertoires. Numts were previously thought to be absent from fish species, but the recent release of three fish nuclear genome sequences provides the resource to obtain a more comprehensive insight into the extent of mtDNA transfer in fishes. From the sequence analyses of the genomes of Fugu rubripes, Tetraodon nigroviridis, and Danio rerio, we have identified 2, 5, and 10 recent numt integrations, respectively, which integrated into those genomes less than 0.6 million years (Myr) ago. Such results contradict the hypothesis of absence or rarity of numts in fishes, as (i) the ratio of numts to the total size of the nuclear genome in T. nigroviridis was superior to the ratio observed in several higher vertebrate species (e.g., chicken, mouse, and rat), and only surpassed by humans, and (ii) the mtDNA coverage transferred to the nuclear genome of D. rerio is exceeded only by human and mouse, within the whole range of eukaryotic genomes surveyed for numts. Additionally, 335, 336, and 471 old numts (>12.5 Myr) were detected in F. rubripes, T. nigroviridis, and D. rerio, respectively. Surprisingly, old numts are inserted preferentially into known or predicted genes, as inferred for recent numts in human. However, because in fish genomes such integrations are old, they are likely to represent evolutionary successes and they may be considered a potential important evolutionary mechanism for the enhancement of genomic coding regions. D 2005 Elsevier Inc. All rights reserved. Keywords: Gene transfer; Genome evolution; Mitochondrial pseudogene; Mutagenesis; Numt; Teleosts
The mitochondrion plays a central role in the fate and evolution of eukaryotic cells. In addition to its importance in the oxidative phosphorylation process (supplying up to 95% of the cell energy), other critical key cellular processes have been recently identified including its role in cell apoptosis [1]. However, the functionality of the mitochondrion is not solely ensured by the gene-products repertoire synthesized by its own genome. Indeed, other products (¨1000 proteins) are encoded by the nuclear genome, translated by the protein machinery of the cell cytoplasm and then transferred to the mitochondrial organelle. These genes have been progressively transferred from the organelle to the nucleus [2,3]. This functional gene transfer is still active in plants and green algae [4,5], but not in animals, in which the mitochondrial
* Corresponding author. E-mail address:
[email protected] (A. Antunes). 0888-7543/$ - see front matter D 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2005.08.002
genetic code differs from the standard code [6]. Nevertheless, nuclear inserted copies of mtDNA origin, often referred to as numts [7], continue to integrate in animals but are likely ‘‘dead on arrival’’ (pseudogenes). The human genome sequence database has provided a broad view of the extent of mtDNA transfer, facilitated the identification of transfer mechanisms, and elucidated the evolutionary dynamics of numts [8– 11]. Several hundred numts have integrated into the human nuclear genome over a continuous evolutionary process [10], but the numt copies may also arise from duplication after insertion into the nucleus [9,12,13]. The prevalence of reported numts varies widely among eukaryotes (reviewed in [14]), with human and plant genomes harboring the largest numt repertoires [15,16]. Interestingly, although numts have been checked for in fish, they seem to be absent in such organisms [14]. More than 99% of living fishes are represented by the teleosts (teleostei), a monophyletic group of ray-finned fishes (Actinopterygii). They
A. Antunes, M.J. Ramos / Genomics 86 (2005) 708 – 717
have undergone an unparalleled rapid and successful radiation, which gave rise to more than 23,600 fish species [17], more than half the number of extant vertebrate species. Although such fishes are well studied, with over 2,111,461 sequences deposited in GenBank as of April 20, 2005, a single nuclear copy of the ATP6 mtDNA gene was described to date only in Galaxias spp. [18]. Recent preliminary studies on genome drafts of two teleost species, the pufferfish Fugu rubripes and the zebrafish Danio rerio, reported a few alignments found with BLASTN [15,16], but no further analyses were performed to evaluate the extent and characterization of promising numts. In this study, we attempt to reconstruct the evolutionary dynamics of numt integration over time in three teleost species: the pufferfishes F. rubripes and Tetraodon nigroviridis and the zebrafish D. rerio. From the data currently available, fishes appear less affected by numt integrations, but the recent release of three fish nuclear genome sequences provides a valuable resource to obtain a more comprehensive insight into the extent of mtDNA transfer in fishes and its significance for the study of genome evolution. Results The nucleotide sequence homology between the three fish nuclear genomes and their mtDNA genomes inferred by BLASTN searches retrieved 918 alignments. We infer the
709
presence of at least 2, 5, and 10 numts in F. rubripes, T. nigroviridis, and D. rerio, respectively (Fig. 1; Supplementary Appendix 1). The mtDNA fraction covered by numts was higher in D. rerio (91.5%) compared to the two pufferfishes (35.9 and 59.2%) (Table 1; Fig. 1). Interestingly, 77% of the observed numts were larger than 2000 bp, while the remaining numts varied from 106 to 614 bp. The numt contribution in the three genomes ranged from 0.0018 to 0.003% (Table 1). The similarity of numts detected by BLASTN with the homologous mtDNA sequences varied from 98.6 to 99.9%, 81.2 to 97.4%, and 98.2 to 100%, in F. rubripes, T. nigroviridis, and D. rerio, respectively (Supplementary Appendix 1). Both point mutations and indels were observed. In protein-coding genes, the amino acid divergence increased further with frame disruption (frameshifts and in-frame stop codons) and changes resulting from differences in the vertebrate mitochondrion code relative to the standard code (methionine Y isoleucine and tryptophan Y stop codon). Nucleotide divergence of the T. nigroviridis numts was overestimated as the F. rubripes mtDNA genome was used as query (see Material and Methods), but using T. nigroviridis species-specific mtDNA sequences available from GenBank (control region and CytB) resulted, as expected, in higher similarity hits. Thus, the older age of T. nigroviridis numts (3.5 to 6.3 Myr, Table 1) is certainly overestimated. The coalescent age of numts in F. rubripes and D. rerio was similar and recent (0.2 to 0.6 Myr; Table 1). Inspection of numt
Fig. 1. Representation of the BLASTN numts detected in the three fish genomes surveyed and their homology with the mtDNA genome. Numbers in each numt segment are identical to those referred to in Supplementary Appendix 1. The genetic map of the mtDNA molecule represented corresponds to the pufferfish Fugu rubripes (adapted, permission of the publisher, from [55]). Protein coding genes are shown by solid bars, rDNAs are shown by open bars, and tRNAs are identified by their one-letter code and, in the case of leucine and serine, by their codon family. ND, NADH dehydrogenase subunit; CO, cytochrome c oxidase subunit; ATP, ATP synthase F0 subunit; CYTb, cytochrome b.
710
A. Antunes, M.J. Ramos / Genomics 86 (2005) 708 – 717
Table 1 Sizes of the mtDNA and nuclear fish genomes and of respective numts detected by BLASTN at a threshold of 10 Species
Tetraodon nigroviridis Fugu rubripes Danio rerio a
mtDNA
4
Nuclear genome (Mb)
Numts
Total size (bp)
Transferred (%)
Coverage
Total
Base pairs
10 – 3%
Age (Myr)
16,447a 16,447 16,596
59.2 35.3 91.5
342 329 1560
350 365 1600
10,114 5811 32,454
3.0 1.8 2.1
3.5 – 6.3a 0.2 – 0.4 0.2 – 0.6
Estimated using the mtDNA genome of F. rubripes; see Material and Methods.
integration sites revealed that novel predicted genes (Ensembl, Genoscope, and Genscan) overlap with portions of some numts in all three fish genomes. The small D. rerio numt_Da3 located in chromosome 1 has integrated into the intron of the Genscan00000044746 gene and it was flanked by a repeat motif (Fig. 2). Phylogenetic analyses of some BLASTN numts and homologous mtDNA sequences of the control region in F. rubripes and T. nigroviridis (842 bp; Fig. 3A) and of the 16S rRNA region in D. rerio (525 bp; results not shown) supported the recent integration of such numts. The TBLASTN homology searches performed with amino acid sequences for each of the 13 mtDNA protein-coding genes (see Material and Methods) on the three fish nuclear genomes revealed 31,853 alignments, which led us to infer the presence of 320, 336, and 506 additional numts (undetected by BLASTN) in F. rubripes, T. nigroviridis, and D. rerio, respectively (Fig. 4; Supplementary Appendix 2). Some of the more ubiquitous numts, namely those homologous to cytochrome b (CytB), were detected in all chromosomes of T. nigroviridis (2n C = 21) and D. rerio (2n C = 25) (the karyotype of F. rubripes is not available), indicating that multiple
independent insertions likely occurred. However, independent insertions probably have also been followed by secondary nuclear transfers. In fact, the phylogenetic analysis of several TBLASTN CytB numts in F. rubripes and T. nigroviridis suggests two duplication events prior to the divergence of both species (Fig. 3B). In T. nigroviridis one CytB numt duplicated first from chromosome (Chr) 3 to Chr 2 and then from Chr 2 to Chr 1 (Fig. 3B). The distribution of numts detected by TBLASTN varied given the species (D. rerio typically exhibited more numts) and given the gene content (CytB and ND5 numts were much more abundant) (Fig. 4; Supplementary Appendix 2). Most of the TBLASTN numts corresponded to the ‘‘disrupted and fragmented’’ category (in some cases with complex patterns of duplication and fragment inversion), while only a few examples were of ‘‘intact numts’’ (see Material and Methods). Numts recovered with the BLOSUM62 matrix reproduced basically those previously detected by BLASTN, with the exception of a new CO1 intact numt in T. nigroviridis. The BLOSUM80 and BLOSUM100 also recovered the BLASTN numts and the
Fig. 2. Schematic representation of some fish numt insertions in genes. The exons are represented by boxes and the introns by lines. The F. rubripes and the T. nigroviridis numts were retrieved with TBLASTN searches using the F. rubripes COI amino acid sequence as query. The D. rerio numt (numt_ Da3; Supplementary Appendix 1) was depicted with BLASTN searches. The F. rubripes numt represented is an example of a ‘‘disrupted numt.’’
A. Antunes, M.J. Ramos / Genomics 86 (2005) 708 – 717
711
Fig. 3. Phylogenetic analyses of mitochondrial and homologous nuclear pseudogenes. (A) Neighbor-joining phylogenetic tree for the mtDNA control region (CR) and homologous numts detected by BLASTN searches in F. rubripes and T. nigroviridis. (B) Neighbor-joining phylogenetic tree for the mtDNA cytochrome b (CytB) and several homologous numts detected by TBLASTN searches in F. rubripes and T. nigroviridis. Bootstrap values are placed at each branchpoint for the neighbor-joining/maximum parsimony/maximum likelihood phylogenetic analyses (NJ/MP/ML). The symbol (.) represents nodes with bootstrap support <50 or an inferred polytomy in the bootstrap 50% majority-rule consensus tree. GenBank accession numbers for the CR sequences are as follows: F. rubripes NC_004299, Fugu exascurus AY599623, Fugu stictonotus AY599627, Tetraodon biocellatus AJ248457, Tetraodon fluviatilis AJ248474, T. nigroviridis AJ248471, Masturus lanceolatus AB191712, Mola mola AB191719. GenBank accession numbers for the CytB sequences are as follows: D. rerio NC_002333, M. mola NC_005836, Fugu oblongus AY128529, F. stictonotus AY267362, T. biocellatus AJ248554, T. fluviatilis AJ248571, T. nigroviridis AJ248546.
CO1 intact numt in T. nigroviridis, but the scores, probabilities, and e values improved considerably compared to the BLOSUM62 results, as would be expected given that the previous matrices are derived from a threshold of 80 and 100% identity, respectively, being more appropriate for less divergent sequences. From each fish genome, we estimated the age of several TBLASTN numts. We selected numt insertions that exhibited higher similarity with homologous
mtDNA, which would define the youngest numts detected by TBLASTN searches. From our calculations, the younger TBLASTN numts were a CO1 intact numt in T. nigroviridis (scaffold25641) at 12.5 Myr, an ND3 intact numt in D. rerio (Zv4 NA17965) at 18.5 Myr, and an ATP6 intact numt in F. rubripes (scaffold 6158) at 21.1 Myr. Most TBLASTN numts detected in F. rubripes and D. rerio were integrated prior to 18.5 Myr ago, which would roughly indicate a gap
712
A. Antunes, M.J. Ramos / Genomics 86 (2005) 708 – 717
Fig. 4. TBLASTN numts homologous to each of the 13 mtDNA protein-coding genes and detected in the three fish nuclear genomes. TBLASTN numts depicted with distinct block substitution matrices (BLOSUM; see Material and Methods) are represented by different shades. Te, T. nigroviridis; Fu, F. rubripes; and Da, D. rerio. The number of amino acids corresponding to each of the 13 mtDNA genes is given below each gene name. A single number means similar gene sizes for both F. rubripes and D. rerio; two numbers represent size differences within the same gene in F. rubripes and D. rerio.
of nearly 18 Myr between the incorporation of numts detected by BLASTN and by TBLASTN searches. Despite the methodological limitation for dating T. nigroviridis numts (see above), the gap seems to be shorter in T. nigroviridis (¨6 Myr).
Fig. 5. Proportion of TBLASTN numts inserted in intergenic (IG) and intragenic (I, intron; E, exon) regions. Only the numts homologous to the mtDNA genes ND1, CO1, CO3, ND5, and CytB are represented. Te, T. nigroviridis; Fu, F. rubripes; and Da, D. rerio.
Surprisingly, the integration of TBLASTN numts in the fish genomes occurs mainly in known or predicted coding regions (introns/exons), rather than in intergenic regions (Fig. 5; see also some selected examples on Fig. 2). The compact genomes of the two pufferfishes showed, however, a much higher rate of numt insertions in exons compared to D. rerio. The remarkable similarity of the patterns of numt integration in T. nigroviridis and F. rubripes genomes (Fig. 5) suggests that many numts may be shared between both species, a fact that is further supported by the phylogenetic analysis of
Fig. 6. Schematic representation of the phylogenetic relationships among the three fish species studied. (a) Divergence of the Euteleostei [17]. (b) Genome contraction of the smooth puffers (Tetraodontidae) after their divergence from the spiny puffers (Diodontidae) [23]. (c) Most recent common ancestor between F. rubripes and T. nigroviridis [19].
A. Antunes, M.J. Ramos / Genomics 86 (2005) 708 – 717
several TBLASTN CytB numts (Fig. 3B), which predated the emergence of both species (i.e., numt insertion in the ancestral of both species). Moreover, such CytB numts presumably originated 19.7 to 24.6 Myr ago, which would be consistent with the assumed evolutionary history for these pufferfishes (i.e., divergence from a common ancestor between 18 and 30 Myr ago [19]; Fig. 6). Discussion Why have numts remained previously undetected in fishes? Although numerous fish species have been surveyed for mtDNA variation, reports of numt insertions are virtually nonexistent in fishes [14]. Our results provide, however, evidence that contradict such empirical observations. First, the ratio of BLASTN numts to the total size of the nuclear genome in T. nigroviridis (3 10 3%) was superior to the ratio observed in several higher vertebrate species (chicken—0.78 10 3%, mouse—2.1 10 3%, and rat— 0.2 10 3%) and surpassed only by humans, who exhibit the largest numt repertoire described to date in vertebrates [15,16], and second, the mtDNA coverage transferred to the nuclear genome of D. rerio (92%) is exceeded only by human (98%) and mouse (99%), within the whole range of eukaryotic genomes surveyed for numts [16]. Given such unexpected results, why have numts in fish stayed undiscovered for so long? Our data may provide useful insights to help understand such a pattern. First, all BLASTN numts described in the genomes of F. rubripes and D. rerio showed less than 2% overall sequence divergence (the higher divergence of T. nigroviridis numts has been overestimated; see Results), but depending on the gene region compared, numt sequences may harbor almost complete identity to authentic mtDNA. Likewise, species such as gorillas, with a large variety of numt sequences bearing high similarity to mtDNA, can make analysis of mtDNA using standard approaches impossible, as numts would become indistinguishable [20]. The identification of such recent numts, as with sequence heteroplasmy, requires much more complicated data collection and analysis [21]. Second, a large gap of ¨18 Myr separates numts detected by BLASTN from TBLASTN in F. rubripes and D. rerio. Such a gap seems to be smaller in T. nigroviridis, but even so, larger than 6 Myr. The use of extant mtDNA nucleotide sequences as a probe in BLASTN searches, as would be expected, will allow only the recovery of relatively recent numts [11]. Original insertions are large, but decay over evolutionary time into smaller fragments of increased sequence divergence, and TBLASTN searches have successfully retrieved much older and divergent numts. Third, most of the TBLASTN numts in the three fish species surveyed were disrupted numts and, in conjunction with the previous point, will further reduce the successful amplification of such old numts. Those TBLASTN numts would likely not be detected by PCR amplification in a routine mtDNA analysis, and we further confirmed this with several in silico PCR experiments (results not shown).
713
Evolutionary dynamics of fish numts Because T. nigroviridis and F. rubripes (the ‘‘smooth’’ pufferfishes) have the smallest genomes characterized to date in vertebrates, they may yield important insights for the evolution of numts in fishes. Comparisons of mutation profiles between the smooth puffers (Tetraodontidae) and their sister group the spiny puffers (Diodontidae), whose genome size is twice as large (¨800 Mb) [22], suggest that a reduction in the rate of large insertions, rather than an increase in large deletions, was the probable cause of the reduction in the genome size of the smooth puffers, which seems to have occurred in the past 50 to 70 Myr [23] (Fig. 6). Given the high resistance to insertions in Tetraodontiformes, the identification of large recent numt integrations in the nuclear genomes of T. nigroviridis and F. rubripes (up to 6454 and 3689 bp, respectively) is remarkable. In fact, pseudogenes were previously thought to be an extremely rare feature in pufferfishes, with a single documented case in F. rubripes [24]. Recent whole-genome analysis showed that, although repetitive sequences account for <15% of the F. rubripes genome, almost every class of transposable elements (TEs) known in eukaryotes is represented, but a large number seem to be of recent origin (<5% nucleotide divergence), suggesting that the propagation of these elements is restricted due to the high bias for DNA loss in both F. rubripes and T. nigroviridis [25,26]. Given that both recent and old numts have been detected in these two pufferfishes (and despite their high tendency to exhibit compact genome sizes), continued insertion of mtDNA into these fish genomes seems likely, as suggested previously for humans and other organisms [e.g., 10,14]. However, assuming that these two fish genomes have had similar rates of DNA loss [25,26], the differences in the ratio of recent numt integrations between T. nigroviridis and F. rubripes (3 10 3 and 1.8 10 3, respectively) and the inversely proportional gap of time between recent and old numt insertions (6.2 and 20.7 Myr, respectively) (Fig. 7), clearly suggest that the frequency of DNA transfer from the mitochondria to the nucleus differs between species. The fixation rate of numts may increase with certain population demographic events (e.g., bottlenecks and range expansion [27]), and species with
Fig. 7. Proportion of recent numts inserted and gap of time between recent and old numt insertions per fish genome. Fu, F. rubripes; Te, T. nigroviridis; and Da, D. rerio.
714
A. Antunes, M.J. Ramos / Genomics 86 (2005) 708 – 717
distinct modes of life may accumulate such events at different frequencies. Freshwater fishes typically show high levels of population substructure due to small and irregular population demographics, while marine fishes show reduced levels of substructure due to larger and more stable population demographics [28,29]. This may explain the higher proportion of numts and increased integration frequency (i.e., short gap between recent/old numts) observed in the freshwater fish T. nigroviridis relative to the marine fish F. rubripes (higher proportion of numts was also observed in the freshwater fish D. rerio: 2.1 10 3%, Fig. 7). In fact, migration and demography history are known to be strong forces determining the dynamics of TEs within genomes and populations of species [30]. Future fine-scale analyses to unravel the extent of numt colonization in fish genomes would provide some insight, as revealed recently for the radiation of tiger beetles [31]. However, if T. nigroviridis shows a ‘‘more’’ continued pattern of colonization of its genome by mtDNA relative to F. rubripes, the fact that the overall number of old numts remained similar between species (Fig. 4) suggests an efficient dynamic among numt insertion/duplication/deletion, as would be expected when genomes are constrained for being small. By contrast, the genome of D. rerio, despite its reduced proportion of recent numts compared to T. nigroviridis, presents an overall number of old numts that exceeds by almost 30% those observed in T. nigroviridis. This may explain the marked differences existing between the two pufferfishes and D. rerio regarding the preferential integration of the TBLASTN numts in genes (Fig. 5; see further detail in the next section). Most of the D. rerio numts were inserted in intronic/ intergenic regions, while in T. nigroviridis and F. rubripes most numts were integrated simultaneously in portions of exonic/ intronic/intergenic regions, a pattern that could have been biased by the higher rate of noncoding DNA loss in the compact genomes of these pufferfishes. Once mtDNA fragments become incorporated into the nuclear genome, they immediately are exposed to different modes of evolution (lower mutation rates due to nuclear DNA repair, a distinct genetic code, and the possibility of recombination), which will influence the divergence patterns between the two sequences [7,32,33]. Because the identification of old numts is limited by our ability to detect decayed divergent sequences homologous to mtDNA, we would expect that the slow-evolving mtDNA genes would retrieve the best hits of the genomic ‘‘fossils’’—numts (as mtDNA evolves generally at much higher mutation rates than nuclear DNA). A comparison of the amino acid variability of mtDNA protein-coding genes in 36 vertebrate taxa showed that for the CO1 and CytB genes the amino acid changes are the lowest [34]. Thus, identification of higher numbers of COI and CytB numts should be expected under neutrality. However, CytB numts significantly exceed COI numts by four- to fivefold (t test, p = 0.003, df = 4) (Fig. 4; Supplementary Appendix 2). Similarly, ND5 numts significantly exceed COI numts by three- to sixfold (t test, p = 0.001, df = 4) and are integrated largely in exons, even in the genome of D. rerio (Fig. 5).
This provides a much higher insertion rate than expected, suggesting that positive selection may favor the insertion of CytB and/or ND5 numts (since CytB and ND5 genes are almost contiguous in the mtDNA genome, a selective integration or selective maintenance of one of these gene regions could have favored the integration or maintenance of the other). Preferential integration of fish numts in genes The systematic and precise cataloging of pseudogenes can improve gene prediction and genomic annotation efforts, as the existence of pseudogenes often introduces errors in the predicted gene sets (it was initially suggested that up to 22% of the once predicted human genes could be pseudogenes [35,36]). Thus, numts should be an element of concern for the prediction and annotation of genes. In fact, we detected several newly predicted genes in the three fish genomes overlapping BLASTN numts that doubtably represent real functional genes. Although such segments may harbor evidence that characterizes functional gene regions, namely sufficient open reading frame (ORF) length and homology with species-specific expressed sequence tags (ESTs), the possibility that the ESTs derive from mitochondrial pseudogenes located in the nuclear genome is very low. Numts were identified in a wide range of species, but up to now there is no evidence for their expression in animals. If some numts could be expressed, we have to hypothesize their expression at a high level, as ESTs are identified by superficial screening of cDNA libraries. The small percentage of ESTs of mitochondrial origin (2%) observed in the F. rubripes EST libraries [37] and the identification of large ORFs using the vertebrate mitochondrial genetic code support the mitochondrial origin of such ESTs. However, if uncertainty exists concerning the integration of large BLASTN numts in gene regions, the D. rerio small numt_Da3 located in one intron of a predicted gene (Fig. 2) clearly challenges that point of view. Additionally, evidence for the preferential insertions of TBLASTN numts in genes rather than intergenic or repetitive regions, in all three fish genomes studied (Fig. 5), suggests that numts in fishes integrate mostly in genes. While previous findings indicated that regions with low gene content were more prone to numt insertions [10,38], a recent study dramatically changed that perspective, by discovering that human species-specific numts preferentially integrate (¨80%) into known or predicted genes [27]. Our findings are remarkably congruent with those obtained in humans [27]. However, fish genomes showed several differences regarding the pattern of numt integration in genes. First, in fish genomes, numts integrated into genes were mostly of old ancestry (>12.5 to 21 Myr). By contrast, human numts inserted into genes were of very recent origin [27]. Second, in fish genomes, numts integrated into genes were mostly disrupted numts, which may increase the success of insertion or maintenance. By contrast, human numts integrated into genes were of short (intact) sequences [27]. Considering that in humans, disrupted pseudogenes represent only a small percentage of the overall
A. Antunes, M.J. Ramos / Genomics 86 (2005) 708 – 717
pseudogenes (<12% [39]), the higher frequency of disrupted numts observed in fish genomes suggests that further studies would be important to understanding the role of this mechanism in the evolution of fish numts. Third, in fish genomes, old inserted numts that may have modified the exon/ intron pattern of a gene are likely to represent evolutionary successes. Otherwise they would have been subjected to negative selection and removed from the population without leaving descendents [40]. By contrast, the recent numt insertions into human genes are new and evolutionarily unproven. We thus expect that a fraction of such insertions would be harmful (mutagenic) in humans. Indeed, recent findings suggest that two numts, occasionally found as new insertions in the human genome, are associated with diseases in humans [41,42]. Although the insertion of numts was previously considered functionless [13,43], recent human numts were shown to be a potentially mutagenic process that could damage the functional integrity of the human genome [27]. From our study, old numts in fish genomes seem to be preferentially integrated into genes, as inferred for recent numts in human. However, because in fish genomes such integrations happened more than 12.5 Myr ago, they are likely to represent evolutionary successes and they should be considered as a potential evolutionary mechanism for the enhancement of genomic coding regions. The fact that fish genomes seem to be ‘‘plastic’’ in comparison with other vertebrate genomes, as genetic changes such as polyploidization, gene duplication, gain of spliceosomal introns, and speciation are more frequent in fishes [44], may explain in part the different dynamics of numt insertion in fish genomes. Material and methods Search for homologs of mtDNA in the fish genomes BLASTN searches [45] were performed with the whole mtDNA genome used as a query sequence against the corresponding organismal nuclear genome and setting the maximum expectation value to e = 10 4 and no filters to recover hits that are biologically significant [15,16]. Complete mtDNA genomes were retrieved from The National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/) for F. rubripes (NC_004299) and D. rerio (NC_002333). The complete mtDNA genome for the T. nigroviridis is not available and similarity searches against the draft nuclear genome were performed using the F. rubripes mtDNA genome, the phylogenetically closest one available (most recent common ancestor between 18 and 30 Myr ago [19]; Fig. 6). Draft nuclear genomes were publicly available at the Ensembl Genome Browser (http://www.ensembl.org/) (Fugu v2.0, May 2004; TETRAODON7 and Zebrafish-WTSI Zv4, September 2004). As protein sequences provided more robust similarity signals than noncoding sequences, we additionally used the six-frame TBLASTN [46] to unveil further stretches of homology compared to BLASTN results [10]. TBLASTN searches were performed individually for each of the 13 mtDNA protein-coding genes. An expected score threshold of 10 2 to 10 3 rejects most alignments of <25 to 30% distant protein alignments [25], while above this ‘‘twilight zone’’ of similarity, 90% of the alignments are likely to represent true biological homologies [47]. We have thus used a conservative expectation cutoff score of e = 10 4. The SEG filtering option was selected to reduce the likelihood of obtaining matches with nuclear sequences of low complexity that are unrelated biologically with the query sequence. Because numts that have been inserted at different times in the nuclear genome have different
715
evolutionary distance divergence (higher divergence of old numts compared to recent ones), we have applied a distinct range of block substitution matrices (BLOSUM) [48] appropriate for sequences that are more evolutionarily divergent to those far less divergent: BLOSUM62, BLOSUM80, and BLOSUM100.
Mapping of the mitochondrial pseudogenes in the fish genomes The BLASTView in the Ensembl Web site was used to analyze results from BLAST searches. The contigs in which these BLAST hits were observed have been investigated in more detail to infer the chromosomal position and characteristics of the numts and the region of integration, information that could be gathered by examining the maps and annotation provided in ContigView and ExportView links, respectively. Numts that have been recently inserted into the nuclear genome have great similarity in size to the mtDNA homologous segments (intact numts). Thus, BLASTN alignments returning from the same contig were considered to be the same integration event if they had identical orientation and overlapping bases or, when containing gaps, they were considered to be the same if the gap was mostly similar in size to the expected mitochondrial fragment missing. By contrast, old numts may have undergone considerable rearrangements, namely secondary large insertion events [9], which may extend considerably the initial integration length of the numt (disrupted numts). Since the TBLASTN allows the recovery of distant homologous alignments (suitable for the detection of older numts), alignments from the same contig could be considered to be the same integration event even if they were disrupted by large insertions. Typically disrupted numts consisted of several fragments intercalated with large insertions, but such fragments followed the initial ordered structure of the homologous mtDNA gene. A more complex numt organization may include segment duplication and segment inversions, but with the increased sequence decay of older numts, structure may became so disrupted that only smaller numt portions could be identified (numt fragments).
Molecular dating of the mitochondrial pseudogenes in the fish genomes We have investigated the timing of divergence of all numts detected by BLASTN. Nucleotide numt sequences were aligned with the corresponding mtDNA sequences using ClustalX [49] and the overall genetic divergence was inferred using the Kimura two-parameter model, with complete deletion of gaps, in MEGA2 [50]. The molecular dating of the numt origin was estimated applying the equation of Li et al. [51] whereby the fraction of sequence divergence is y = (A1 + A2) t, where A1 = 2.5 10 8 substitutions/site/year for mtDNA [52] and A2 = 4.0 10 9 substitutions/site/year for nuclear pseudogene distance [53] and t is the time elapsed.
Molecular phylogenetics of the mitochondrial pseudogenes To unravel the phylogenetic relationships of numts within each other and with the homologous mtDNA sequences, we performed multiple alignments using ClustalX [49]. The selection of accurate alignment datasets was complicated given that most old numts are found disrupted or fragmented, which makes the sequences retrieved heterogeneous in length (sometimes covering only the first 20% of the complete gene and sometimes only the last 20%). Phylogenetic analyses were performed in MEGA2 [50] and PAUP* 4.0b2a [54] using three approaches: (i) neighbor-joining tree-building algorithm and a Kimura two-parameter model; (ii) maximum parsimony, with a heuristic search of starting trees obtained by step-wise addition; and (iii) maximum likelihood, incorporating a gamma-corrected HKY85 model with parameters estimated from the dataset. Reliability of nodes defined by the phylogenetic trees was assessed using 100 bootstrap replications.
Acknowledgments Comments made by Victor David and two anonymous referees improved a previous version of this article.
716
A. Antunes, M.J. Ramos / Genomics 86 (2005) 708 – 717
Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at doi:10.1016/j.ygeno.2005.08.002. References [1] S. Desagher, J.C. Martinou, Mitochondria as the central control point of apoptosis, Trends Cell Biol. 10 (2000) 369 – 377. [2] J.L. Blanchard, M. Lynch, Organellar genes: why do they end up in the nucleus? Trends Genet. 16 (2000) 315 – 320. [3] S.D. Dyall, P.J. Johnson, Origins of hydrogenosomes and mitochondria: evolution and organelle biogenesis, Curr. Opin. Microbiol. 3 (2000) 404 – 411. [4] K.L. Adams, Y.L. Qiu, M. Stoutemyer, J.D. Palmer, Punctuated evolution of mitochondrial gene content: high and variable rates of mitochondrial gene loss and transfer to the nucleus during angiosperm evolution, Proc. Natl. Acad. Sci. USA 99 (2002) 9905 – 9912. [5] S. Funes, et al., The typically mitochondrial DNA-encoded ATP6 subunit of the F1F0-ATPase is encoded by a nuclear gene in Chlamydomonas reinhardtii, J. Biol, Chem. 277 (2002) 6051 – 6058. [6] D.R. Wolstenholme, Genetic novelties in mitochondrial genomes of multicellular animals, Curr. Opin. Genet. Dev. 2 (1992) 918 – 925. [7] J.V. Lopez, N. Yuhki, R. Masuda, W. Modi, S.J. O’Brien, Numt, a recent transfer and tandem amplification of mitochondrial DNA to the nuclear genome of the domestic cat, J. Mol. Evol. 39 (1994) 174 – 190. [8] T. Mourier, A.J. Hansen, E. Willerslev, P. Arctander, The human genome project reveals a continuous transfer of large mitochondrial fragments to the nucleus, Mol. Biol. Evol. 18 (2001) 1833 – 1837. [9] Y. Tourmen, et al., Structure and chromosomal distribution of human mitochondrial pseudogenes, Genomics 80 (2002) 71 – 77. [10] M. Woischnik, C.T. Moraes, Pattern of organization of human mitochondrial pseudogenes in the nuclear genome, Genome Res. 12 (2002) 885 – 893. [11] D. Mishmar, E. Ruiz-Pesini, M. Brandon, D.C. Wallace, Mitochondrial DNA-like sequences in the nucleus (NUMTs): insights into our African origins and the mechanism of foreign DNA integration, Hum. Mutat. 23 (2004) 125 – 133. [12] D. Bensasson, M.W. Feldman, D.A. Petrov, Rates of DNA duplication and mitochondrial DNA insertion in the human genome, J. Mol. Evol. 57 (2003) 343 – 354. [13] E. Hazkani-Covo, R. Sorek, D. Graur, Evolutionary dynamics of large numts in the human genome: rarity of independent insertions and abundance of post-insertion duplications, J. Mol. Evol. 56 (2003) 169 – 174. [14] D. Bensasson, D.X. Zhang, D.L. Hartl, G.M. Hewitt, Mitochondrial pseudogenes: evolution’s misplaced witnesses, Trends Ecol. Evol. 16 (2001) 314 – 321. [15] S.L. Pereira, A.J. Baker, Low number of mitochondrial pseudogenes in the chicken (Gallus gallus) nuclear genome: implications for molecular inference of population history and phylogenetics, BMC Evol. Biol. 4 (2004) 17. [16] E. Richly, D. Leister, NUMTs in sequenced eukaryotic genomes, Mol. Biol. Evol. 21 (2004) 1081 – 1084. [17] J.S. Nelson, Fishes of the World, Wiley, New York, 1994. [18] J.M. Waters, G.P. Wallis, Cladogenesis and loss of the marine life-history phase in freshwater galaxiid fishes (Osmeriformes: Galaxiidae), Evolution 55 (2001) 587 – 597. [19] O. Jaillon, et al., Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype, Nature 431 (2004) 946 – 957. [20] O. Thalmann, J. Hebler, H.N. Poinar, S. Paabo, L. Vigilant, Unreliable mtDNA data due to nuclear insertions: a cautionary tale from analysis of humans and other great apes, Mol. Ecol. 13 (2004) 321 – 335. [21] O. Thalmann, D. Serre, M. Hofreiter, D. Lukas, J. Eriksson, L. Vigilant, Nuclear insertions help and hinder inference of the evolutionary history of gorilla mtDNA, Mol. Ecol. 14 (2005) 179 – 188.
[22] E.L. Brainerd, S.S. Slutz, E.K. Hall, R.W. Phillis, Patterns of genome size evolution in tetraodontiform fishes, Evolution 55 (2001) 2363 – 2368. [23] D.E. Neafsey, S.R. Palumbi, Genome size evolution in pufferfish: a comparative analysis of diodontid and tetraodontid pufferfish genomes, Genome Res. 13 (2003) 821 – 830. [24] M.S. Clark, P. Pontarotti, A. Gilles, A. Kelly, G. Elgar, Identification and characterization of a beta proteasome subunit cluster in the Japanese pufferfish (Fugu rubripes), J. Immunol. 165 (2000) 4446 – 4452. [25] S. Aparicio, et al., Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes, Science 297 (2002) 1301 – 1310. [26] C. Dasilva, et al., Remarkable compartmentalization of transposable elements and pseudogenes in the heterochromatin of the Tetraodon nigroviridis genome, Proc. Natl. Acad. Sci. USA 99 (2002) 13636 – 13641. [27] M. Ricchetti, F. Tekaia, B. Dujon, Continued colonization of the human genome by mitochondrial DNA, PLoS Biol. 2 (2004) 1313 – 1324. [28] R.D. Ward, M. Woodward, D.O.F. Skibinski, A comparison of genetic diversity levels in marine, freshwater and anadromous fishes, J. Fish Biol. 44 (1994) 213 – 232. [29] J.A. DeWoody, J.C. Avise, Microsatellite variation in marine, freshwater, and anadromous fishes compared with other animals, J. Fish Biol. 56 (2000) 461 – 473. [30] G. Deceliere, S. Charles, C. Bie´mont, The dynamics of transposable elements in structured populations, Genetics 169 (2005) 467 – 474. [31] J. Pons, A.P. Vogler, Complex pattern of coalescence and fast evolution of a mitochondrial rRNA pseudogene in a recent radiation of tiger beetles, Mol. Biol. Evol. (2005) 991 – 1000. [32] J.V. Lopez, S. Cevario, S.J. O’Brien, Complete nucleotide sequences of the domestic cat (Felis catus) mitochondrial genome and a transposed mtDNA tandem repeat (Numt) in the nuclear genome, Genomics 33 (1996) 229 – 246. [33] J.V. Lopez, M. Culver, J.C. Stephens, W.E. Johnson, S.J. O’Brien, Rates of nuclear and cytoplasmic mitochondrial DNA sequence divergence in mammals, Mol. Biol. Evol. 14 (1997) 277 – 286. [34] R.E. Broughton, J.E. Milam, B.A. Roe, The complete sequence of the zebrafish (Danio rerio) mitochondrial genome and evolutionary patterns in vertebrate mitochondrial DNA, Genome Res. 11 (2001) 1958 – 1967. [35] R.F. Yeh, L.P. Lim, C. Burge, Computational inference of homologous gene structures in the human genome, Genome Res. 11 (2001) 803 – 816. [36] P.M. Harrison, et al., Molecular fossils in the human genome: identification and analysis of processed and non-processed pseudogenes in chromosomes 21 and 22, Genome Res. 12 (2002) 272 – 280. [37] M.S. Clark, et al., Fugu ESTs: new resources for transcription analysis and genome annotation, Genome Res. 13 (2003) 2747 – 2753. [38] M. Ricchetti, C. Fairhead, B. Dujon, Mitochondrial DNA repairs doublestrand breaks in yeast chromosomes, Nature 402 (1999) 96 – 100. [39] Z. Zhang, P.M. Harrison, Y. Liu, M. Gerstein, Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome, Genome Res. 13 (2003) 2541 – 2558. [40] C. Bie´mont, A. Tsitrone, C. Vieira, C. Hoogland, Transposable element distribution in Drosophila, Genetics 147 (1997) 1997 – 1999. [41] K. Borensztajn, et al., Characterization of two novel splice site mutations in human factor VII gene causing severe plasma factor VII deficiency and bleeding diathesis, J. Haematol 117 (2002) 168 – 171. [42] C. Turner, et al., Human genetic disease caused by de novo mitochondrial – nuclear DNA transfer, Hum. Genet. 112 (2003) 303 – 309. [43] N.T. Perna, T.D. Kocher, Mitochondrial DNA—Molecular fossils in the nucleus, Curr. Biol. 6 (1996) 128 – 129. [44] B. Venkatesh, Evolution and diversity of fish genomes, Curr. Opin. Genet. Dev. 13 (2003) 588 – 592. [45] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman, Basic local alignment search tool, J. Mol. Biol. 215 (1990) 403 – 410. [46] S.F. Altschul, et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs, Nucleic Acids Res. 25 (1997) 3389 – 3402.
A. Antunes, M.J. Ramos / Genomics 86 (2005) 708 – 717 [47] B. Rost, Twilight zone of protein sequence alignments, Protein Eng. 12 (1999) 85 – 94. [48] S. Henikoff, J.G. Henikoff, Amino acid substitution matrices from protein blocks, Proc. Natl. Acad. Sci. USA 89 (1992) 10915 – 10919. [49] J.D. Thompson, T.J. Gibson, F. Plewniak, F. Jeanmougin, D.G. Higgins, The CLUSTAL-X Windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools, Nucleic Acids Res. 25 (1997) 4876 – 4882. [50] S. Kumar, K. Tamura, I.B. Jakobsen, M. Nei, MEGA2: molecular evolutionary genetics analysis software, Bioinformatics 17 (2001) 1244 – 1245.
717
[51] W.H. Li, T. Gojobori, M. Nei, Pseudogenes as a paradigm of neutral evolution, Nature 292 (1981) 237 – 239. [52] M. Hasegawa, H. Kishino, T. Yano, Dating of the human – ape splitting by a molecular clock of mitochondrial DNA, J. Mol, Evol. 22 (1985) 160 – 174. [53] W.H. Li, Molecular Evolution, Sinauer, Sunderland, MA, 1997. [54] D.L. Swofford, ‘PAUP* Phylogenetic Analysis Using Parsimony and Other Methods’ computer program, Sinauer, Sunderland, MA, 2001. [55] C. Elmerot, U. Arnason, T. Gojobori, A. Janke, The mitochondrial genome of the pufferfish, Fugu rubripes, and ordinal teleostean relationships, Gene 295 (2002) 163 – 172.