The genomic context of retrocopies increases their chance of functional relevancy in mammals

The genomic context of retrocopies increases their chance of functional relevancy in mammals

Journal Pre-proof The genomic context of retrocopies increases their chance of functional relevancy in mammals JoãoPaulo Machado, Agostinho Antunes P...

3MB Sizes 0 Downloads 9 Views

Journal Pre-proof The genomic context of retrocopies increases their chance of functional relevancy in mammals

JoãoPaulo Machado, Agostinho Antunes PII:

S0888-7543(18)30577-9

DOI:

https://doi.org/10.1016/j.ygeno.2020.01.013

Reference:

YGENO 9451

To appear in:

Genomics

Received date:

12 October 2018

Revised date:

3 January 2020

Accepted date:

21 January 2020

Please cite this article as: J. Machado and A. Antunes, The genomic context of retrocopies increases their chance of functional relevancy in mammals, Genomics (2019), https://doi.org/10.1016/j.ygeno.2020.01.013

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier.

Journal Pre-proof

The genomic context of retrocopies increases their chance of functional relevancy in mammals

João Paulo Machado 1,2 and Agostinho Antunes2,3,# 1

CIMAR/CIIMAR, Interdisciplinary Centre of Marine and Environmental Research,

University of Porto, Terminal de Cruzeiros do Porto de Leixões, Av. General Norton de

oo f

Matos, s/n, 4450–208 Porto, Portugal.

Abel Salazar Biomedical Sciences Institute (ICBAS), University of Porto, Porto, Portugal.

3

Department of Biology, Faculty of Sciences, University of Porto, 4169 007 Porto, Portugal.

#

Corresponding author: Agostinho Antunes,

Pr

e-

pr

2

Keywords:

Jo u

Full postal address

rn

Phone: (+351) – 220402742

al

Email: [email protected]

Adaptive, Functional relevancy, Genomics, Mammals, Retrocopies

Journal Pre-proof

Abstract Described as “junk” DNA, pseudogenes are dead structures of previously active genes present in genomes. Pseudogenes are categorized into two main classes: processed pseudogenes, formed through retrotransposition, and non-processed pseudogenes, typically originated from gene decay following duplication events. The term “processed pseudogene” has changed to “retrocopy” since they are likely to evolve new functional roles and became a retrogene. Here, we surveyed 38,080 retrocopies from chimpanzee, dog, human, mouse, and rat

oo f

genomes to assess their potential adaptive value. The retrocopies inserted in the same chromosome of the parental gene have higher chances of remain potentially “active” (absence

pr

of premature stop codons and frameshifts) (~26.1%), while those placed into a different

e-

chromosome have a twofold decrease chance of continuing potentially “active” (~7.52%). The genomic context of their placement seems associated with their expression. Retrocopies

Pr

placed in intragenic regions and the same sense of the “host” gene have higher chances of being expressed relative to other genomic contexts. The proximity of retrocopies to their

al

parental gene is associated with a lower decay rate, and their location likely influence their

rn

expression. Thus, despite their unclear role, retrocopies are probably involved in adaptive

Jo u

processes. Our results evidence natural selection acting in retrocopies.

Journal Pre-proof

1. Introduction In 1977 a structure termed “pseudogene” was reported for the first time in Xenopus laevis oocyte-type 5S RNA [1]. Pseudogenes are present from bacteria to vertebrates [2] and considered as non-functional sequences of genomic DNA that originated from functional genes [3]. Two main processes may be in the origin of pseudogenes from functional copies: (1) retro-transposition leading to the lack of a promoter ("dead on arrival"), and (2) decay of functional genes (frameshifts and/or premature insertion of stop codons), mostly from

oo f

duplicated copies but also occurring in non-duplicated copies [4]. Therefore, pseudogenes are often distinguished into two different broad classes: processed pseudogenes, corresponding to

pr

those retro-transposed back into a genome via an RNA intermediate; and non-processed

e-

pseudogenes, which are genomic remains of duplicated genes or residues of dead genes, also known as unitary pseudogenes [5, 6]. Described as non-functional genomic fragments, and

Pr

commonly referred to as “defunct” genes, it is assumed that they evolve under neutrality [7]. However, their presence might be functionally relevant, supported either by transcription data

al

or a high degree of conservation [8, 9].

rn

Retrotransposed genes were routinely labelled as “processed pseudogenes”, but recently the

Jo u

term “retrocopies” has been instead adopted to express their functional roles [10]. Therefore, those reverse-transcribed through an RNAs intermediate and inserted in the genome as “new” DNA are now assigned as retrocopies (originated from retroduplication) [11]. The gene duplication is a pivotal mechanism to generate evolutionary novelty [11, 12], along with the retrocopy duplication trough segmental duplication [11]. Although these two mechanisms differ in the type of copies generated, since the tandem duplications usually produce copies that inherit the genetic features (e.g. introns and promoter), while the retrocopies lack the introns and promoters [11]. The retrocopies lack a promoter and consequently their expression is dependent on the recruitment of regulatory elements such as transcriptional mechanisms of host genes that surround their insertion site [11, 13]. Contrary to the segmental duplication, transcribed retrocopies are more likely to evolve “new expression

Journal Pre-proof patterns” and therefore evolve “new functions” [11]. The high transcriptional activity allows a substantial number of functional retrocopies to emerge as retrogenes highlight the role of natural selection in the retrocopy “repository” [12]. The genomic abundance of retrocopies is associated and dependent on the rate of gene duplication/loss [14]. Mammals have a high number of retrocopies, depending on the species and the annotation pipeline [15, 16]. Two recent works suggested the expression in the human genome of 1,286 out of 7,849 retrocopies (~12.5%) [10], and 615 out of 4,927 retrocopies

oo f

(~16.4%) [17]. Contrary to the theoretical expectation that any gene can form a pseudogene, differences have been reported in the relative proportion, from gene to gene [18].

pr

Housekeeping genes, highly expressed genes in germ-line cells and those participating in

e-

basic metabolic regulations show multiple corresponding pseudogenes [16, 19-21]. This may be due to high expression levels of these genes, thus increasing their likelihood to accumulate

Pr

mutations [22] or to be retro-transposed into the genome [23]. Moreover, the GC content of

al

the genomic regions where the pseudogenes were placed also affect their mutation rate [24]. Early typified as non-functional units, here we provide support that non-random processes

rn

shape the retrocopies “repository”. Their destination and closer position to the parental gene

Jo u

influence frameshifts insertions and stop codons. Furthermore, retrocopies genomic context act as a major determinant for their expression. Here we present evidence supporting the adaptive role of retrocopies in mammalian genomes.

Journal Pre-proof

2. Methods 2.1. Retrocopies retrieval and identification The retrocopies retrieved from pseudogene.org database include: Cannis familaris (dog, build 50), Homo sapiens (human, build 83), Mus musculus (mouse, build 84), Pan troglodytes (chimpanzee, build 50), and Rattus norgevicus (rat, build 74). From this database were retrieved pseudogenes annotated as processed pseudogenes (retrocopies) after excluding those categorized as “ambiguous” and “duplicated”. The data retrieved from pseudogenes.org

oo f

encompass: C. familaris (6,001) (Table S1), H. sapiens (8,074) (Table S2), M. musculus (9,809) (Table S3), P. troglodytes (7,097) (Table S4) and R. norgevicus (7,099) (Table S5).

pr

These 38,080 pseudogenes, were reduced to 35,277 after discard those associated with

e-

mitochondrial DNA or inconclusive identity of the parental gene coordinates (chromosome).

Pr

The parental gene coordinates ID were obtained through its corresponding ENSEMBL protein id using the ENSEMBL Biomart. The parental genes were retrieved for the Ensembl releases: C. familiaris (60), H. sapiens (94), M. musculus (95), P. troglodytes (54), and R. norgevicus

al

(67). The retrocopies considered “inactive” are those showing signs of loss of protein coding

rn

ability (premature stop codons and frameshifts), while the other copies still showing protein

Jo u

coding ability were considered as “active” [25]. 2.2 Expression of retrocopies

The expression data from human retrocopies were retrieved from RetrogeneDB [17] and RCPedia [10]. In RetrogeneDB, based on ENSEMBL annotation, was retrieved information about expression and the ORF (Open Reading Frame) (188 retrocopies). While from RCPedia, were retrieved all the 7,849 retrocopies, their expression relative to their genomic context (intragenic or intergenic). Either RetrogeneDB or RCPedia use multi-step pipelines [10, 17], while to evaluate retrocopies expression the first use short read libraries in various tissues from selected records of NCBI SRA database [17] the other uses RNA-seq data from six tissues (brain, cerebellum, heart, liver, kidney and testis) [10, 26].

Journal Pre-proof 2.3. Insertion simulation For the retrocopies inserted in the same chromosome of the parental gene, we conducted a simulation study to test if the distance “retrocopy-parental gene” was random or biased. To accomplish this test, we followed the steps 1) collect the position of the parental genes and retrocopies; 2) for each chromosome was shuffled the position of the retrocopies within the list of retrocopies positions. Then measured the distance between the parental gene and the retrocopy. Both analyses were repeated three times to inspect the consistency of the obtained

oo f

data. While the majority of the retrocopies arose through retro-transposition events, some arose from duplication of previously retro-transposed genes, and these events lead to

pr

duplicated-retrocopies. To avoid possible bias introduced by segmental duplication regions, highly rich in retrocopies [27], the same distance analysis were performed discarding the

e-

retrocopies inserted in these regions. Data from the retrocopies and segmental duplication

Pr

were retrieved at Segmental Duplication Database [28]. Additionaly to avoid potential bias due to duplications of the pairs retrocopy parental gene, we filtered for retrocopies inserted in

al

the same chromosome. The filter was created by BLASTting 5 kbp flanking the retrocopy

rn

location and minimum alignment of 2.5 kbp and within 10 kbp.

Jo u

To test the possibility of a biased difference in the proportions of retrocopies with stop codons or alteration of the ORF relatively to their insertion chromosome, we performed a simulation randomly inserting these retrocopies in the five studied genomes using an “in house” script. To accomplish this the position of the parental genes and retrocopy was read, after this step the retrocopy was randomly sort as “active” or “inactive” up to the observed number of “active” retrocopies in each species. The proportions were then compared using Fisher’s exact test. 2.4. Gene evolution under neutrality Pseudogenes are assumed to evolve under neutrality. If this assumption is correct then the reminiscent sequences of previous functional copies will “accept” all the mutations. The

Journal Pre-proof solely difference rely in the ratio of transitions and transversions. It is expected that the transitions double the number of transversions, and this is an explanatory model for sequences evolving under neutrality [29]. Here, we tested the time and the mutation insertion effect of pseudogenes evolving under the emulated Kimura model [30], with the parameters of the model settled to a tS/tV (kappa value) = 2, which is a value frequently observed in mammalian genomes [29]. In each step, two random mutations were inserted, with doubled chances to insert a transition rather than a transversions to a maximum of 400 steps (800 mutations). For

oo f

each step the sequences were submitted to blast searches (either BLASTn or BLASTx) and the genes were considered lost when the first of the following criteria were achieved: 1) the

pr

expected value for the gene of interest is below 1e -10 ; or 2) presence of an unrelated gene in the top-hit results. The mutation rate and the predictive time for gene loss were calculated

e-

under the assumption of 2.41 mutation/site/year (averaged value observed for fossil record

Jo u

rn

al

Pr

and sequences) [31].

Journal Pre-proof

3. Results 3.1. Distribution in the Chromosomes The retrocopies follow a random distribution in the chromosomes, given the relation between chromosome length and retrocopies number (Table S6). This trend appears to be generalized in mammals since similar correlations were observed in the five mammals studied (chimpanzee, P. troglodytes; dog, C. familiaris; human, H. sapiens; mouse, M. musculus; and rat, R. norvegicus) (Table S6). Yet the relation is nearly absent when considering the

oo f

chromosome length and the number of retrocopies formed from each chromosome (Table S6). The mouse chromosome Y has been excluded from these analyses given the high number

pr

of retrocopies allocated (~7 times more than the average of the others chromosomes).

e-

We detected a non-random position of the parental gene and retrocopies in the human genome

Pr

without signs of disablement (i.e. without stop codons and maintaining the ORF), similar to the other four species described above . We observed evidences of disablement in 92.4% of

al

the 29,832 retrocopies inserted in a different chromosome relatively to the parental gene.

rn

When it is considered the retrocopies inserted in the same chromosome (5,445), the dismantlement decreased to about 73.9%. The disablement ratio of retrocopies placed in the

Jo u

same chromosome is significantly lower when compared to those placed in a different chromosome (Fisher’s exact test, p-value <2.2e-16) (Table 1 and Table S7). The proportions in the five species were significantly different (Table 1). In the human genome we have been able to use 7,446 retrocopies and from these 6,564 are inserted in a different chromosome while 882 are inserted in the same chromosome relatively to the parental gene. The insertion relatively to the parental gene seems associated with disablement, since 559 (~8.51%) are a different chromosome and 104 (~11.79%) in the same chromosome. While less evident, this disproportion between retrocopies inserted in the same or different chromosome (disabled) is statistically significant (Fisher exact test, p-value ~ 0.02). For the others four species the same trend was observed with higher proportion of retrocopies inserted in the same chromosome of the parental gene considered as “active” (Table 1 and Fig. S1). Therefore the proportion of

Journal Pre-proof retrocopies with stop codons and/or frameshift re-allocated in a different chromosome is significantly lesser than those inserted in the same chromosome relatively to the parental gene. From the 7,446 retrocopies of the human genome used here, 2,251 are in segmental duplicated regions. After removing those regions, the same trend is observed, since ~8.69% “active” where in a different chromosome relatively to the parental gene and ~10.6% “active” retrocopies are placed in the same chromosome of the parental gene. The difference in the

oo f

proportion (Fisher exact test, p-value ~ 0.018) is observed even when the segmental duplicated regions were discarded.

pr

Even when excluding based on distance 1Mbp, 0.1Mbp, 0.01Mbp, 0.001Mbp, 0.0001Mbp, to

e-

take into consideration possible gene conversion effects, the fisher’s exact test show the same

Pr

trend (Table S8). The proportion of retrocopies “active” copies in the same chromosome are significantly higher than those placed in a different chromosome relatively to the parental

al

gene.

rn

For the retrocopies present in the five genomes, we tested a random distribution of those without stop codons and maintaining the ORF (here considered as “active”). Assuming their

Jo u

original positions (parent gene and retrocopy) but allowing randomly retrocopies to be considered as “inactive” or “active” up to the limit of those observed as “active”, the results from 100 replicates [32]. In this scenario and allowing 663 retrocopies (for the human replicate) to be without stop codons, only six replicates had significantly higher proportions of retrocopies without stop codons and maintaining the ORF inserted in the same chromosome compared to those inserted in a different chromosome. Simulations were repeated for the other four species also revealed cases where the proportions of retrocopies inserted in the same chromosome relatively to the parental gene were significantly higher: 2 in chimpanzee, 6 in dog, 1 in mouse and 5 in rat. Overall only 4% of the simulations revealed

Journal Pre-proof a significantly higher proportion of retrocopies “active” in the same chromosome, therefore unlike to be a random observation. 3.2. Insertion in the same Chromosome The Mann-Whitney test shows positions significantly closer to the parental gene than a random insertion (p-value < 0.01). This is also observed in simulations using the original positions to constrain the initial placement (p-value < 0.01) (Fig. 1A and Fig. 1B). This trend was observed for each species suggesting the placement of the retrocopies closer to the

oo f

parental gene. Together, these two different simulation scenarios (original positions and completely random [33]) revealed that retrocopies are closer to the parental gene than

pr

predicted by a random distribution, either including or excluding segmental duplicated

e-

regions (Fig. 1A and Fig. 1B).

Pr

The presence of segmental duplications (SDs) may alter the results from the measured distance between the parent “gene- retrocopy”, particularly on SD pairs containing both

al

retrocopy and parental gene. In the human genome from 882 retrocopies, 277 are within

rn

segmental duplications [27]. After removing the SDs regions we also detected a significant result for Mann-Whitney-Wilcoxon test, p < 0.01, consequently the retrocopies are closer to

Jo u

the parental gene than a completely random distribution (Fig. 1A and Fig. 1B). The disablement of retrocopies and the distance to the parent gene follows an observable distance-based trend. The distance of the pair “retrocopy-parent gene” is significantly lower for the retrocopies considered as “active” (absence of stop codons and maintain of the ORF) when compared with those considered as “inactive”, Mann-Whitney U Test p-value = 0. 18. Indeed, most retrocopies without premature stop codons are closer to the parental gene (Fig. 2). Around 50% of the retrocopies without stop codons and maintaining the ORF are placed within 3Mbp (Million base pairs) relatively to the parental gene position. Considering those with stop codons or missense mutations, ~50% are placed within 3.6 Mbp distance from parent gene to the retrocopy (Fig. 2). Furthermore within the 25 Mbp was observed ~71.0%

Journal Pre-proof of the total number of retrocopies with in-frame stop codons or missense mutations and ~74.1% of all the retrocopies without stop codons and maintaining the ORF. 3.3. Processed Pseudogenes Expression Along with the absence of stop codons, the expressed retrocopies are considered to act as functional elements. On the other hand, symptoms of non-functionality include frame disablement (premature stop codons) and pseudogenes nucleotide sequence decay or incompleteness. Several transcribed pseudogenes are disabled [8], but some disabled

oo f

pseudogenes revealed evidence of being functional relevant [34]. The data from RetrogeneDB shows that from 615 expressed retrocopies in the human genome (Fig. 3A) only 188

pr

retrocopies maintain the ORF, i.e. nearly 30.5% (Fig. 3B). Data collected from RCPedia

e-

showed an association between genomic context and the expression, since 26% expressed retrocopies are located in intragenic regions and in the same sense of the “receiver” gene (Fig.

Pr

3C). By contrast, only 14% and 16% of those are expressed when placed in intergenic or intragenic positions on the opposite sense, respectively (Fig. 3C). In these two cases, the

al

expression is significantly lower when compared to those inserted in intragenic positions in

rn

the same sense of the gene “host” (Z-Score, p-values < 0.0001).

Jo u

3.4. Genes evolving under neutrality Given the detection of retrocopies under adaptation [35] it is relevant to access the time course for a gene to become undetectable through blast searchers (BLASTn or BLASTx) while evolving under complete neutrality. Using the cut-off value 1x10-10 and assuming an evolutionary rate of 2.415x10-9 mutations/site/year [31] we estimated for eight genes (BTF3, CYC1, GADPH, H3F3B, PPIA, RPL7A, SCOP and RPL39) the time needed for the blast searches to be below the empirical cut-off (Fig. S3). Our data suggested nearly 229.46 ± 100.82 million years (Myr) (Table S9) for a gene to be undetectable trough blast searches, but the insertion of the first stop codon would occur earlier, around 26.23 ± 22.95 Myr.

Journal Pre-proof

4. Discussion 4.1. Retrocopies origin and their location The random placement of human retrocopies has been previously described as a “bombardment” along evolution [15], thus hypothesizing retro-transposition as an efficient process to introduce regulatory elements into the genome “in search” of new target genes [36]. Consistently, this “stated war” was detected in the five analyzed mammalian species

oo f

(chimpanzee, dog, human, mouse and rat). This seems a trend for mammals, since larger chromosomes have a higher number of retrocopies. As previously observed in the human

pr

genome [15], retrocopies allocation is mainly affected by chromosome length [15]. While the results showed a random process on the retrocopies origin given the absence of relation

e-

between chromosome length and number of “parental” genes retro-copied. Several factors

Pr

were advanced to influence the likelihood of a gene to form a retrocopy, such as expression level and gene length [18], or mRNA stability [37]. Additionally, gene length has been

al

mentioned as a major determinant influencing pseudogenes formation, since longer genes

rn

tend to “produce” more non-processed pseudogenes, whereas retro-pseudogenes are typically formed by short protein-coding genes [38-40]. Together this suggests that the raw material to

Jo u

form “new” retrocopies appears to be highly associated with the nature of genes (parental genes) or the mRNA molecules. These properties have been shown to have an adaptive value, such as mRNA stability [41, 42], gene expression [43, 44] and gene length [45]. 4.2. Location as factor for Retrocopies “survival” Despite the retrocopies random insertion, here we reported an association between their placement location and frequency of premature stop codons or frameshifts (i.e. indicative of function loss) [8]. Those placed in the same chromosome showed less evidences of disablement. Indeed, simulated scenarios showed that the association between the retrocopies distribution and the disablement is highly unlikely to be a random event. And even excluding

Journal Pre-proof those within range that is compatible with gene conversion the same trend is detected and excluding possible genes in duplicated regions. Additionally, those retrocopies inserted in the same chromosome of the parental gene tend to be placed closer to the parental gene than a completely random distribution. The survival of the retrocopies requires a de novo promoter or its recruitment from the genomic environment around the placement site [35, 46]. A closer location seem advantageous since some retrocopies adjacent to the parental gene tend to be co-regulated [47]. Later it has been shown

oo f

an absence of co-regulation and, instead, an correlation between expression of retrocopies and parental gene, irrespective to their distance [48]. Thus, after random allocation, we

pr

hypothesize that their closer location confers advantages and increase their chances to become

e-

functionally relevant. Another possibility is that a closer location, as observed in the retrocopies allocated in the same chromosome, implies that they are young retrocopies. The

Pr

detection of age-dependent pseudogenes (retrocopies) was previously described as skewed [49], as newly formed retrocopies are identified easier than older retrocopies [50]. Despite the

al

different age of the retrocopies as estimated in previous studies [51], the random insertion

rn

decline the possibility of an age association with absence of signs of disablement. Their

Jo u

placement seems random, but selectively disabled depending on the insertion location. This is also supported by the observation of an average lower similarity between the parental gene and the retrocopy inserted in a different chromosome. Alternatively, a closer location increases their chances to became functional relevant by a more favorable expression context. Therefore, being closer to the parental gene, would lead to a decreased number of substitutions, which would increase the retrocopies detection, and consequently enhance the frequency of pseudogenes closer to the parental gene. 4.3. Retrocopies acting as “repositories” of information Based on the analysis of retrocopies in human and mouse genomes, it has been estimated that the formation of “new” retrocopies occurs in a rate of about 1-2% per gene per million years [52]. In the human genome, gene duplications happen at a predicted rate of 0.9% per gene per

Journal Pre-proof million years [52]. Thus, the arousal of “new genes” through retrocopies may be even more relevant than gene duplications. Their maintenance also appears to be of prime relevance since ~40% of the retrocopies are shared between human and mouse [53]. Under a neutral evolutionary pace, it will take ~244Myr for a gene to become “undetectable”, while a hypothetical frameshift might occur much earlier (~26Myr). This implies that pseudogenes, either processed or non-processed, and detected through blast searches older than ~244 Myr were under a period of non-neutral evolution. The expression of retrocopies,

oo f

and their predicted or proofed functionality, is also observed in some non-processed pseudogenes placing them as interesting and important functional units [54].

pr

The retrocopies persistence lead to an increased probability of pseudogene resurrection [52].

e-

Therefore, this shows the great relevance of pseudogenes in the generation of evolutionary

Pr

novelties, likely acting as “repositories” of innovation, even in scenarios evolving under neutral evolutionary pace.

al

4.4. Genomic landscape as a contributor to retrocopies expression

rn

The maintenance of the expression patterns might be important for regulatory processes [23,

Jo u

55] and probably enables the functional “resurrection” of pseudogenes, frequently described in human and mouse pseudogenes [52]. The lack of an appropriate regulatory environment often lead to the degeneration of the retrocopies [56], and we found that processed pseudogene near the parental gene lead to an increased probability of retention. Concordantly, a parental gene and its pseudogene interdependence were reported for the gene ABCC6 and the ABCC6P1, since their co-expression resulted in ABCC6 decreased expression [57]. Moreover, gene order in eukaryotes follow a non-random location, as those closely placed tend to be co-expressed and co-functional [58]. The allocation site seems closely associated with the chances for the retrocopies, as those placed in intragenic regions in the same sense of the gene “receiver” provide an adequate “expression context”. By contrast, when they are placed in intergenic regions or even in intragenic regions, their

Journal Pre-proof chances to be expressed decrease. After the insertion of a retrocopy, the first step leading it to a retrogene (functional counterpart of the retrocopies) is their expression [47]. Therefore the genomic context of the retrocopy allocation influence their probability of becoming a retrogene. Our results reveal three relevant aspects of retrocopies. 1) Retrocopies inserted in the same chromosome of the parental gene have higher chances to “survive”. 2) Retrocopies inserted in the same chromosome of the parental genes are closer than a completely random process. 3) Placement of retrocopies influences their chances of expression.

oo f

4.5. Retrocopies as functional units

Searching functional elements in genomes is of prime relevance for evolutionary biology. In

pr

recent years were raised several arguments regarding the functionality of retrocopies. The

e-

processes that lead to their arousal and ubiquitous presence in the mammalian genomes is therefore of major relevance to understand their function. While retrocopies seem randomly

Pr

placed and dispersed in the genomes, their location apparently determines their chances to acquire a functional relevancy. Those retro-transposed pseudogenes closer to the parental

al

gene (same chromosome) increase twice their chances of survival (i.e. lack of perceived stop

rn

codons and maintain the ORF). In addition, those placed in intragenic regions, in the same

Jo u

strand of the ‘receiver’ gene, have higher chances to be expressed. Thus, our results highlight a mechanism of natural selection acting on those retrocopies, revealing a functional relevancy in their maintenance and placing them potentially as functional units. Here, we described several non-random processes occurring in retrocopies, which suggest that retrocopies are subjected to natural selection and their perseverance in genomes is a nonrandom process.

Journal Pre-proof

5. Acknowledgements We are thankful for the comments provided by the Associate Editor and two anonymous reviewers, which helped improving a previous version of this manuscript. The authors acknowledge the Portuguese Fundação para a Ciência e a Tecnologia (FCT) for financial support to JPM (SFRH/BD/65245/2009). This work was further supported by a grant from Iceland, Liechtenstein and Norway through the EEA Financial Mechanism and the Norwegian Financial Mechanism. AA was partially supported by the Strategic Funding

oo f

UID/Multi/04423/2019 through national funds provided by FCT and European Regional Development Fund (ERDF) in the framework of the programme PT2020, and the FCT project

pr

PTDC/AAG-GLO/6887/2014 (POCI-01-0124-FEDER-016845) and PTDC/CTA-

Jo u

rn

al

Pr

e-

AMB/31774/2017 (POCI-01-0145-FEDER/031774/2017).

Journal Pre-proof

Tables Table 1 – Comparison between retrocopies inserted in the same and different chromosome, the proportion of “active” and identity compared with parental gene . The term active is associated with those maintain the ORF and without stop codons, (%) proportion of “active”. SD – Segmental Duplications, w/o – without, ǂ - Total without considering Human only in non-segmental duplicated regions. Different Chromosome

Fische r Te st Exact

“Inactive”

Active (%)

Identity (%)

Fraction (%)

“Active”

“Inactive”

Active (%)

Identity (%)

Fraction (%)

(p-value )

Chimpanzee

150

600

20.00

72.73

92.15

527

5,344

8.98

76.56

94.80

<2.20E-16

Dog

168

487

25.65

73.17

91.02

415

4,774

8.00

75.77

94.88

<2.20E-16

Human

104

778

11.79

72.92

92.97

559

6,005

8.51

77.21

94.64

0.001

M ouse

389

1,591

19.65

80.02

95.50

521

6,716

7.20

75.27

95.50

<2.20E-16

Rat

316

862

26.82

76.76

91.79

259

4,712

5.21

72.25

94.46

<2.20E-16

Human (w/o SD) Total (ǂ)

58

547

10.60

70.63

92.30

390

4,489

8.69

75.73

94.64

0.104

1,127

4,318

26.10

76.58

93.47

2,281

27,551

7.52

75.42

94.89

<2.20E-16

Pr

al

rn Jo u

Spe cie s

pr

“Active”

e-

oo f

Same Chromosome

Journal Pre-proof

Figures

Fig. 1 - Retrocopies inserted in the same chromosome: distance to parental gene and influence of segmental duplication in human genome. Distance of retrocopies retro-transposed to the same chromosome and the parental gene. The distance observed, a random scenario, with the retrocopies were randomly distributed in each

oo f

chromosome. The observed distance from retrocopies and parental gene after the removal of segmental duplication regions. A) Retrocopies inserted in the same chromosome in the human

e-

pr

genome. B) After excluding segmental duplicated regions.

Pr

Fig. 2 - Retrocopies inserted in the same chromosome: distance to parental gene and presence of stop codons.

al

Comparative empirical cumulative distribution of retrocopies inserted in the same

rn

chromosome. The orange line represents retrocopies containing stop codons and the blue line

Jo u

represents the retrocopies free of stop codons.

Fig. 3 - Retrocopies expression and genomic context. Data retrieved from RetrogeneDB: A) Expressed retrocopies; B) Expressed retrocopies that maintain the ORF, text labels refers to parent gene name’s, the linkers are colored accordingly to the chromosome where the pseudogene is placed. Data retrieved from RCPedia C) Schematic representation of the expressed retrocopies and the genomic context: 1) hypothetical “host” gene with two exons, 2) additional hypothetical gene. The ribbons indicate orientation of the reading frame.

Journal Pre-proof

References

Jo u

rn

al

Pr

e-

pr

oo f

[1] C. Jacq, J.R. Miller, G.G. Brownlee, A pseudogene structure in 5S DNA of Xenopus laevis, Cell, 12 (1977) 109-120. [2] A.J. Mighell, N.R. Smith, P.A. Robinson, A.F. Markham, Vertebrate pseudogenes, FEBS letters, 468 (2000) 109-114. [3] E.S. Balakirev, F.J. Ayala, Pseudogenes: are they "junk" or functional DNA?, Annual review of genetics, 37 (2003) 123-151. [4] A.N. Khachane, P.M. Harrison, Assessing the genomic evidence for conserved transcribed pseudogenes under selection, BMC genomics, 10 (2009) 435. [5] D. Zheng, A. Frankish, R. Baertsch, P. Kapranov, A. Reymond, S.W. Choo, Y. Lu, F. Denoeud, S.E. Antonarakis, M. Snyder, Y. Ruan, C.L. Wei, T.R. Gingeras, R. Guigo, J. Harrow, M.B. Gerstein, Pseudogenes in the ENCODE regions: consensus annotation, analysis of transcription, and evolution, Genome research, 17 (2007) 839851. [6] Z.D. Zhang, A. Frankish, T. Hunt, J. Harrow, M. Gerstein, Identification and analysis of unitary pseudogenes: historic and contemporary gene losses in humans and other primates, Genome Biol, 11 (2010) R26. [7] R. Martinez-Arias, E. Mateu, J. Bertranpetit, F. Calafell, Profiles of accepted mutation: from neutrality in a pseudogene to disease-causing mutation on its homologous gene, Human genetics, 109 (2001) 7-10. [8] P.M. Harrison, D. Zheng, Z. Zhang, N. Carriero, M. Gerstein, Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability, Nucleic acids research, 33 (2005) 23742383. [9] O. Svensson, L. Arvestad, J. Lagergren, Genome-wide survey for biologically functional pseudogenes, PLoS computational biology, 2 (2006) e46. [10] F.C. Navarro, P.A. Galante, RCPedia: a database of retrocopied genes, Bioinformatics, 29 (2013) 1235-1237. [11] H. Kaessmann, N. Vinckenbosch, M. Long, RNA-based gene duplication: mechanistic and evolutionary insights, Nature Reviews Genetics, 10 (2009) 19-31. [12] J. Zhang, Evolution by gene duplication: an update, Trends in ecology & evolution, 18 (2003) 292-298. [13] N. Vinckenbosch, I. Dupanloup, H. Kaessmann, Evolutionary fate of retroposed gene copies in the human genome, Proceedings of the National Academy of Sciences of the United States of America, 103 (2006) 3220-3225. [14] O. Podlaha, J. Zhang, Pseudogenes and their evolution, eLS, (2010). [15] Z. Zhang, P.M. Harrison, Y. Liu, M. Gerstein, Millions of years of evolutio n preserved: a comprehensive catalog of the processed pseudogenes in the human genome, Genome research, 13 (2003) 2541-2558. [16] Z. Zhang, M. Gerstein, Large-scale analysis of pseudogenes in the human genome, Current opinion in genetics & development, 14 (2004) 328-335. [17] M. Kabza, J. Ciomborowska, I. Makalowska, RetrogeneDB - a database of animal retrogenes, Molecular biology and evolution, (2014). [18] W. Li, W. Yang, X.J. Wang, Pseudogenes: pseudo or real functional elements?, Journal of genetics and genomics = Yi chuan xue bao, 40 (2013) 171-177. [19] S. Frederiksen, H. Cao, B. Lomholt, G. Levan, C. Hallenberg, The rat 5S rRNA bona fide gene repeat maps to chromosome 19q12-->qter and the pseudogene repeat maps to 12q12, Cytogenetics and cell genetics, 76 (1997) 101-106.

Journal Pre-proof

Jo u

rn

al

Pr

e-

pr

oo f

[20] B. Pei, C. Sisu, A. Frankish, C. Howald, L. Habegger, X.J. Mu, R. Harte, S. Balasubramanian, A. Tanzer, M. Diekhans, A. Reymond, T.J. Hubbard, J. Harrow, M.B. Gerstein, The GENCODE pseudogene resource, Genome Biol, 13 (2012) R51. [21] J. Zhang, Y.P. Zhang, Pseudogenization of the tumor-growth promoter angiogenin in a leaf-eating monkey, Gene, 308 (2003) 95-101. [22] C. Park, W. Qian, J. Zhang, Genomic evidence for elevated mutation rates in highly expressed genes, EMBO reports, 13 (2012) 1123-1129. [23] L. Poliseno, L. Salmena, J. Zhang, B. Carver, W.J. Haveman, P.P. Pandolfi, A coding-independent function of gene and pseudogene mRNAs regulates tumour biology, Nature, 465 (2010) 1033-1038. [24] C.D. Bustamante, R. Nielsen, D.L. Hartl, A maximum likelihood method for analyzing pseudogene evolution: implications for silent site evolution in humans and rodents, Molecular biology and evolution, 19 (2002) 110-117. [25] P.M. Harrison, D. Zheng, Z. Zhang, N. Carriero, M. Gerstein, Transcribed processed pseudogenes in the human genome: an intermediate form of expressed retrosequence lacking protein-coding ability, Nucleic acids research, 33 (2005) 23742383. [26] D. Brawand, M. Soumillon, A. Necsulea, P. Julien, G. Csárdi, P. Harrigan, M. Weier, A. Liechti, A. Aximu-Petri, M. Kircher, The evolution of gene expression levels in mammalian organs, Nature, 478 (2011) 343-348. [27] E. Khurana, H.Y. Lam, C. Cheng, N. Carriero, P. Cayting, M.B. Gerstein, Segmental duplications in the human genome reveal details of pseudogene formation, Nucleic acids research, 38 (2010) 6997-7007. [28] J.A. Bailey, Z. Gu, R.A. Clark, K. Reinert, R.V. Samonte, S. Schwartz, M.D. Adams, E.W. Myers, P.W. Li, E.E. Eichler, Recent segmental duplications in the human genome, Science, 297 (2002) 1003-1007. [29] J. Wakeley, The excess of transitions among nucleotide substitutions: new methods of estimating transition bias underscore its significance, Trends Ecol Evol, 11 (1996) 158-162. [30] M. Kimura, A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences, Journal of molecular evolution, 16 (1980) 111-120. [31] S. Kumar, S. Subramanian, Mutation rates in mammalian genomes, Proc Natl Acad Sci U S A, 99 (2002) 803-808. [32] https://github.com/lege-hub/genome_simulation, in, 2019. [33] https://github.com/lege-hub/random_retrocopies, in, 2019. [34] J. Xu, J. Zhang, Are Human Translated Pseudogenes Functional?, Molecular biology and evolution, (2015). [35] C. Casola, E. Betrán, The Genomic Impact of Gene Retrocopies: What Have We Learned from Comparative Genomics, Population Genomics, and Transcriptomic Analyses?, Genome Biology and Evolution, 9 (2017) 1351-1373. [36] J. Brosius, Genomes were forged by massive bombardments with retroelements and retrosequences, Genetica, 107 (1999) 209-238. [37] A. Pavlicek, A.J. Gentles, J. Paces, V. Paces, J. Jurka, Retroposition of processed pseudogenes: the impact of RNA stability and translational control, Trends Genet, 22 (2006) 69-73. [38] I. Goncalves, L. Duret, D. Mouchiroud, Nature and structure of human genes that generate retropseudogenes, Genome research, 10 (2000) 672-678.

Journal Pre-proof

Jo u

rn

al

Pr

e-

pr

oo f

[39] Z. Zhang, P. Harrison, M. Gerstein, Identification and analysis of over 2000 ribosomal protein pseudogenes in the human genome, Genome research, 12 (2002) 1466-1482. [40] A.N. Khachane, P.M. Harrison, Strong association between pseudogenization mechanisms and gene sequence length, Biology direct, 4 (2009) 38. [41] C. Dressaire, F. Picard, E. Redon, P. Loubiere, I. Queinnec, L. Girbal, M. Cocaign-Bousquet, Role of mRNA stability during bacterial adaptation, PloS one, 8 (2013) e59059. [42] K. Yamanaka, M. Inouye, Selective mRNA degradation by polynucleotide phosphorylase in cold shock adaptation in Escherichia coli, Journal of bacteriology, 183 (2001) 2808-2816. [43] H.B. Fraser, Gene expression drives local adaptation in humans, Genome research, 23 (2013) 1089-1096. [44] L. Lopez-Maury, S. Marguerat, J. Bahler, Tuning gene expression to changing environments: from rapid responses to evolutionary adaptation, Nature reviews. Genetics, 9 (2008) 583-593. [45] D.J. Lipman, A. Souvorov, E.V. Koonin, A.R. Panchenko, T.A. Tatusova, The relationship of protein conservation and sequence length, BMC evolutionary biology, 2 (2002) 20. [46] F.N. Carelli, T. Hayakawa, Y. Go, H. Imai, M. Warnefors, H. Kaessmann, The life history of retrocopies illuminates the evolution of new mammalian genes, Genome research, 26 (2016) 301-314. [47] M.N. Cabili, C. Trapnell, L. Goff, M. Koziol, B. Tazon-Vega, A. Regev, J.L. Rinn, Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses, Genes & development, 25 (2011) 19151927. [48] X. Guo, M. Lin, S. Rockowitz, H.M. Lachman, D. Zheng, Characterization of human pseudogene-derived non-coding RNAs for functional potential, PloS one, 9 (2014) e93972. [49] A.C. Marques, I. Dupanloup, N. Vinckenbosch, A. Reymond, H. Kaessmann, Emergence of young human genes after a burst of retroposition in primates, PLoS Biol, 3 (2005) e357. [50] C.-H. Kuo, H. Ochman, The extinction dynamics of bacterial pseudogenes, PLoS Genet, 6 (2010) e1001050. [51] D. Pan, L. Zhang, Burst of young retrogenes and independent retrogene formation in mammals, PloS one, 4 (2009) e5040. [52] H. Sakai, K.O. Koyanagi, T. Imanishi, T. Itoh, T. Gojobori, Frequent emergence and functional resurrection of processed pseudogenes in the human and mouse genomes, Gene, 389 (2007) 196-203. [53] Z. Zhang, N. Carriero, M. Gerstein, Comparative analysis of processed pseudogenes in the mouse and human genomes, Trends Genet, 20 (2004) 62-67. [54] R.M. Branca, L.M. Orre, H.J. Johansson, V. Granholm, M. Huss, A. PerezBercoff, J. Forshed, L. Kall, J. Lehtio, HiRIEF LC-MS enables deep proteome coverage and unbiased proteogenomics, Nature methods, 11 (2014) 59-62. [55] O.H. Tam, A.A. Aravin, P. Stein, A. Girard, E.P. Murchison, S. Cheloufi, E. Hodges, M. Anger, R. Sachidanandam, R.M. Schultz, G.J. Hannon, Pseudogenederived small interfering RNAs regulate gene expression in mouse oocytes, Nature, 453 (2008) 534-538. [56] I. D'Errico, G. Gadaleta, C. Saccone, Pseudogenes in metazoa: origin and features, Briefings in functional genomics & proteomics, 3 (2004) 157-167.

Journal Pre-proof

Jo u

rn

al

Pr

e-

pr

oo f

[57] A.P. Piehler, M. Hellum, J.J. Wenzel, E. Kaminski, K.B. Haug, P. Kierulf, W.E. Kaminski, The human ABC transporter pseudogene family: Evidence for transcription and gene-pseudogene interference, BMC genomics, 9 (2008) 165. [58] P. Michalak, Coexpression, coregulation, and cofunctionality of neighboring genes in eukaryotic genomes, Genomics, 91 (2008) 243-248.

Journal Pre-proof

Supplemental Material Fig. S1. Processed pseudogenes inserted in the same chromosome: distance to parental gene. Distance of retrocopies retro-transposed to the same chromosome and the parental gene. a) Chimpanzee, b) Dog, c) Mouse and d) Rat. Fig. S2. Blast results from the simulation of 8 genes evolving under neutrality. On each left subfigure are the results from BLASTn and on right for BLASTx for the following genes:

oo f

a) BTF3, b) CYC1, c) H3F3B, d) GADPH, e) PPIA, f) RPL7A, g) RPL39 and h) SCOP. The y-axis is log-scaled. The x-axis (n) represent the natural number of the blast order, and for

pr

each step (n) corresponds to two mutations. The reference line corresponds to the empirical 1E-10 , value above this value were discarded or partially discarded and not represented in the

e-

plots.

Pr

Table S1. Retrocopies in Dog. Data retrieved from pseudogenes.org and the data for the

al

parental gene location. #N/D - No data (excluded from the analysis).

rn

Table S2. Retrocopies in human. Data retrieved from pseudogenes.org, and the data for the parental gene location. #N/D - No data (excluded from the analysis). SDD - Segmental

Jo u

duplicated region.

Table S3. Retrocopies in mouse. Data retrieved from pseudogenes.org, and the data for the parental gene location. #N/D - No data (excluded from the analysis).

Table S4. Retrocopies in chimpanzee. Data retrieved from pseudogenes.org, and the data for the parental gene location. #N/D - No data (excluded from the analysis). Table S5. Retrocopies in rat. Data retrieved from pseudogenes.org, and the data for the parental gene location. #N/D - No data (excluded from the analysis).

Journal Pre-proof Table S6. Individual correlation between the retrocopies location and parental gene origin. (*) significant correlation.

Table S8. Retrocopies “active” and “inactive” excluded based on gene conversion range. Table S9. Estimated age of processed pseudogenes misidentification after evolving under

Jo u

rn

al

Pr

e-

pr

oo f

neutrality.

Journal Pre-proof Authors’ contributions JPM performed the phylogenetic, evolutionary and bioinformatic analyses and drafted the manuscript. AA participated in the design, genetic analyses, drafting and coordination of the

Jo u

rn

al

Pr

e-

pr

oo f

study. All authors read and approved the final manuscript.

Journal Pre-proof Highlights:

The genomic context of retrocopies increases their chance of functional relevancy in mammals

The retrocopies inserted in the same chromosome of the parental gene have higher chances of

e-

pr

decrease chance of continuing potentially “active”.

oo f

remain potentially “active”, while those placed into a different chromosome have a twofold

Retrocopies placed in intragenic regions and in the same sense of the “host” gene have higher

al

Pr

chances of being expressed relative to other genomic contexts.

rn

The proximity of retrocopies to their parental gene is possibly associated with a lower decay

Jo u

rate, and their location likely influence their expression.

Retrocopies are probably involved in adaptive processes.

Figure 1

Figure 2

Figure 3