Influence of the transposable element neighborhood on human gene expression in normal and tumor tissues

Influence of the transposable element neighborhood on human gene expression in normal and tumor tissues

Gene 396 (2007) 303 – 311 www.elsevier.com/locate/gene Influence of the transposable element neighborhood on human gene expression in normal and tumo...

216KB Sizes 3 Downloads 63 Views

Gene 396 (2007) 303 – 311 www.elsevier.com/locate/gene

Influence of the transposable element neighborhood on human gene expression in normal and tumor tissues Emmanuelle Lerat ⁎, Marie Sémon Université de Lyon, Université Lyon 1, CNRS, UMR 5558, Laboratoire de Biométrie et Biologie Evolutive, 43 boulevard du 11 novembre 1918, Villeurbanne F-69622, France Received 1 December 2006; received in revised form 16 March 2007; accepted 2 April 2007 Available online 6 April 2007 Received by I. King Jordan

Abstract Transposable elements (TEs) are genomic sequences able to replicate themselves, and to move from one chromosomal position to another within the genome. Many TEs contain their own regulatory regions, which means that they may influence the expression of neighboring genes. TEs may also be activated and transcribed in various cancers. We therefore tested whether gene expression in normal and tumor tissues is influenced by the neighboring TEs. To do this, we associated all human genes to the nearest TEs. We analyzed the expression of these genes in normal and tumor tissues using SAGE and EST data, and related this to the presence and type of TEs in their vicinity. We confirmed that TEs tend to be located in antisense orientation relative to their hosting genes. We found that the average number of tissues where a gene is expressed varies depending on the type of TEs located near the gene, and that the difference in expression level between normal and tumor tissues is greatest for genes that host SINE elements. This deregulation increases with the number of SINE copies in the gene vicinity. This suggests that SINE elements might contribute to the cascade of gene deregulation in cancer cells. © 2007 Elsevier B.V. All rights reserved. Keywords: Transposable elements; Cancer; Gene expression

1. Introduction Transposable elements (TEs) have been identified in the genomes of almost all living organisms. They are distinguished by their ability to move from one chromosomal position to another within the genome, and to replicate themselves. Mammalian genomes are particularly TE-rich, and a high proportion of their sequence is derived from TEs (45% of the human genome (International Human Genome Sequencing Consortium, 2001) and 38.5% of the mouse genome (Mouse Genome Sequencing Consortium, 2002)). Another large Abbreviations: TE, Transposable Element; ORF, Open Reading Frame; LINE, Long Interspersed Nuclear Element; SINE, Short Interspersed Nuclear Element; LTR, Long Terminal Repeat; EST, Expressed Sequence Tag; SAGE, Serial Analysis of Gene Expression; endogenous retrovirus, ERV. ⁎ Corresponding author. Laboratoire Biométrie et Biologie Evolutive, Université Claude Bernard, Lyon 1, UMR–CNRS 5558, Bat. Mendel, 69622 Villeurbanne cedex, France. Tel.: +33 4 72 43 29 18; fax: +33 4 72 43 13 88. E-mail address: [email protected] (E. Lerat). 0378-1119/$ - see front matter © 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.gene.2007.04.002

fraction of mammalian genomes is probably also TE-derived, but has diverged beyond recognition (Medstrand et al., 2005). TEs are divided into two main classes on the basis of their transposition intermediate (Finnegan, 1992). The first class includes the retrotransposons that use an RNA intermediate, and move by a “copy and paste” mechanism. Within this class, two subclasses have been identified, depending on the presence or absence of a Long Terminal Repeat (LTR) at the extremities of the elements. Non-LTR retrotransposons include two categories of elements: the Long Interspersed Nuclear Elements (LINEs) and the Short Interspersed Nuclear Elements (SINEs), the latter being dependent upon the former to transpose. The second class of TEs consists of the transposons that use a DNA intermediate and move via a “cut and paste” mechanism. With the exception of SINEs, active TEs all contain Open Reading Frames (ORFs) that encode all the proteins required for their transposition. All TEs are dependent on internal regulatory regions. In LTR retrotransposons, the promoter is located within the LTR, whereas in LINEs, it is located in the 5′ UnTranslated Region

304

E. Lerat, M. Sémon / Gene 396 (2007) 303–311

(UTR). Interestingly, one particular human LINE (L1) has a second promoter in the antisense orientation in the + 400 to + 600 bp region of the 5′ UTR (Nigumann et al., 2002). LTR and LINE elements both use an RNA polymerase II, whereas SINEs have an internal type-III polymerase promoter. TEs contain internal regulatory regions, and this allows them to influence the transcription of the genes near where they are inserted (Britten and Davidson, 1969). In the mouse genome, some LTR retrotransposons have been reported to act as alternative promoters. They provide exonic regions for a subset of genes expressed in oocytes and preimplantation embryos, and permit synchronous developmental expression of these genes (Peaston et al., 2004). The non-LTR retrotransposon L1 can undergo somatic transposition in mouse brain, which alters the levels of gene expression in vitro, potentially influencing neural cell fate and generating neuronal diversity (Muotri et al., 2005). Other examples have shown that ancient insertions have contributed to gene regulation (reviewed in Britten, 1996). For instance, the androgen dependence of a mouse sex-linked gene results from the insertion of a retrotransposon in the 5′ region (Stavenhagen and Robins, 1988). On the genomic scale, various analyses have demonstrated that a high proportion of current promoters is derived from TE sequences, suggesting that TEs may commonly affect the evolution of human gene regulation. For example, one quarter of the experimentally-identified promoters in mammals contain TE-derived sequences (Jordan et al., 2003; Van de Lagemaat et al., 2003). Moreover, some TE insertions have been shown to modify the tissue-specific expression of genes. This is the case for an estrogen biosynthesis gene, CYP19, the placenta-specific transcription of which is driven by an alternative promoter derived from an LTR (Van de Lagemaat et al., 2003). TEs are powerful cause of point mutations and can induce diseases. For example, more than 30 genetic diseases and 16 cancers have been attributed to homologous recombination between SINEs (Deininger and Batzer, 1999). Recombination involving LINE elements has also been observed to lead to the formation of tumors in the esophagus and in the female reproductive tract (Segal et al., 1999). The insertion of a retrovirus at particular locations in the genome can also activate oncogenes (Barklis et al., 1986). For instance, the avian leukosis virus, which induces tumors in chicken, is inserted near the cmyc oncogene (Hayward et al., 1981). This oncogene has also been altered by an intronic insertion of a LINE element in a breast cancer (Morse et al., 1988), by upstream insertions of a LINE in tumor development in dogs (Katzir et al., 1987), and by recombination with a LINE in rat immunocytomas (Pear et al., 1988). Several carcinogenic factors, such as benzo(a)pyrene (Stribinskis and Ramos, 2006) or gamma radiation (Farkash et al., 2006), have also been shown to activate TEs. All these findings imply that the epigenetic activation of TEs is able to cause point mutations and genomic instability, thus leading to cancer. Some studies have shown that TEs can be activated under tumor conditions, although no direct relationship has been demonstrated between the disease and the element. For example, some specific endogenous retroviruses (ERVs)

produce viral particles and exhibit reverse transcriptase activity in human melanoma cells (Muster et al., 2003); TE expression is enhanced in urothelial and renal carcinoma cells (Florl et al., 1999), in human leukemia (Patzke et al., 2002; Depil et al., 2002), and in colorectal (Debniak et al., 2001) and human breast cancers (Wang-Johanning et al., 2003). Hypomethylation of LINEs and HERV-W retrotransposons, which is associated with high level of gene expression (Dante et al., 1992), has been observed in various cancers (Florl et al., 1999; Menendez et al., 2004; Ehrlich, 2003). It has been suggested that hypomethylation of TEs may promote genomic instability, and thus facilitate tumor progression (Xu and Deng, 2002). Because TEs may be activated under certain conditions and may influence gene expression, they could be responsible for particular gene expression patterns in tumor tissues. Analyses of gene expression profiles in normal and tumor tissues have indeed revealed differential expression of particular genes (Schaner et al., 2003; Bucca et al., 2004; Taniwaki et al., 2006). For example, some genes are highly expressed in epithelial ovarian cancer, whereas under normal conditions they are generally almost silent (Welsh et al., 2001). These findings indicate that changes in gene expression occur during tumor development, but the direct cause of this deregulation has typically not been identified. In this paper, we use the complete sequence of the human genome to detect TEs in the neighborhood of genes. We analyze the expression of human genes in 19 pairs of normal and tumor tissues. We check whether the presence of a TE is associated with a change in gene expression between normal and tumor tissues, which could indicate that TEs play a role in gene deregulation in cancer. We show that the presence of TEs is correlated to the average number of normal tissues in which a gene is expressed. We find that genes associated with SINE elements are particularly deregulated under tumor conditions, and that this deregulation increases with the number of SINE copies. 2. Materials and methods 2.1. Expression and genome data We obtained six million ESTs from human tissues from GenBank (release October 2004 (Benson et al., 2004)). To allow us to assess the tissue origin of each EST with sufficient accuracy, we excluded cDNA libraries based on cell cultures, pooled organs, or unidentified tissues. After pooling all ESTs from libraries corresponding to the same tissue type, we retained only tissues that had been sampled with at least 10,000 ESTs to limit stochastic variations in expression measures. A total of 19 of these tissues were represented in both normal and tumor states and were thus included in our analysis. We downloaded transcribed sequences from Ensembl and retained one randomly chosen transcript per gene (version 24, October 2004, Hubbard et al., 2005). The association between genes and ESTs was determined by comparing CDS to the EST dataset using MEGABLAST (Zhang et al., 2000). We retained alignments showing at least 95% identity over 100 nucleotides

E. Lerat, M. Sémon / Gene 396 (2007) 303–311

or more. The choice of this criterion is a compromise between high sensitivity (the ability to associate each EST to its parental gene) and high specificity (the ability to distinguish between different members of highly conserved gene families). Finally, for each gene we converted absolute EST counts into relative EST counts (cpm, count per million). The final data set contains 33,592 human genes. SAGE experiment results were retrieved from the SAGE Genie website (ftp://cgap.ncbi.nih.gov/SAGE/Download; (Liang, 2002)) for human data. These libraries were grouped into the same 19 tissue types, each of them containing at least 20,000 tags in both the tumor and normal stages. To be able to determine the expression pattern of a given gene using SAGE, it is necessary to know the sequence of its 3′ end. Because gene prediction methods may fail to detect these 3′ non-translated regions, we restricted our analyses to genes for which an mRNA sequence had been described, and manually curated in the RefSeq database (Pruitt et al., 2003). The tag (a sequence of 10bp upstream of the last NlaIII restriction site in 3′) was extracted from these mRNAs. In some cases, one tag matched more than one Refseq mRNA. In these cases, we checked the genomic location of these mRNAs to determine whether they corresponded to alternative transcripts of the same gene or to different genes. Tags mapping to different locations were removed from the data set. We randomly selected one Refseq mRNA per gene per set of alternative-splicing variants. The absolute tag count was normalized as described for EST data. We mapped Refseq mRNAs to Ensembl genes (Hubbard et al., 2005) using Ensembl cross-references. We retained 34,408 human RefSeq mRNAs that were non-redundant and unambiguously located on the human genome. All the analyses were done using either the level of expression (cpm measured for a given gene in a given tissue) or the expression breadth (the number of tissues in which a gene is expressed with at least one cpm). 2.2. Identifying neighboring transposable elements We defined a “gene neighborhood” as the region 10 kb upstream and downstream of a gene. (We repeated all analyses using a smaller neighborhood of 1 kb; these results are reported in Supplementary information). We detected TEs integrated into gene neighborhoods using RepeatMasker (http://www.repeatmasker. org) on the nucleic sequence of the human genome (ENSEMBL, version 24, October 2004). TEs were classified as SINEs, LINEs, DNA transposons or LTR retroelements (including endogenous retroviruses). We only considered complete TEs, i.e. TEs that could contain active promoter sequences. Solo-LTRs were therefore considered as complete elements, while we did not include truncated elements in our analyses. We defined the position of an inserted TE relative to a gene: “upstream” (in the 5′ neighborhood), “downstream” (in the 3′ neighborhood) or “within introns” (inside the sequence defined as a gene), and whether it was oriented in the sense of transcription of the gene (“sense”) or in the opposite orientation (“antisense”). We thus have distinguished six ways for a TE to be inserted: upstream and in sense direction, upstream and in antisense direction, downstream and in sense

305

direction, downstream and in antisense direction, within introns and in sense direction, and within introns and in antisense direction. 2.3. Statistical analyses Statistical tests were performed using R (R development Team 2006, http://www.R-project.org). We performed statistical comparisons using a nonparametric test, the Wilcoxon rank sum test, that does not require the two sampled populations to follow normal distributions nor to have equal variances. The difference between the expression in normal and tumor tissues was estimated using a Euclidean distance (d). This is the sum for all the tissues of the squared differences between the expression level in normal and tumor tissues. For each gene, this distance measures the deregulation in tumors by quantifying the difference in expression between normal and tumor tissues. We identified four classes of deregulation in tumors, by dividing the dataset into four subsets, each containing the same number of genes (7601 to 7603 genes). The levels of deregulation in tumor tissue in these subsets were small (d b 2.66), medium (2.66 b d b 12.38), high (12.38 b d b 29.32) or very high (d N 29.32), respectively. We also computed a modified version of the Euclidean distance, after removing the effect of expression breadth by a partial correlation. To do this, we fitted a linear model between the Euclidean distance and the expression breadth (average number of tissues where the gene is expressed). The residues of this model were no longer correlated with the expression breadth, and we used them as the “residual Euclidean distance”. 3. Results We linked all the human genes annotated in the complete genome sequence to TEs mapped within a 10 kb distance upstream and downstream, and inside the gene (results obtained with 1 kb distance are shown in Supplementary Tables 2 and 3). Genes without any TE in their vicinity (i.e. within 10 kb or inside the gene) were taken to be non-TE-hosting genes. For each TE-hosting gene, we noted the position (within introns, downstream or upstream) and the orientation (sense of transcription or opposite sense) of every TE located in its neighborhood. The expression data were collected for each gene using EST and SAGE libraries from 19 tissues under normal and tumor conditions. 3.1. Number and position of complete TEs inserted near genes Table 1 shows the number and position of TEs inserted within 10 kb of genes based on EST data. LTR retroelements, whatever their position relative to the gene, display a significant tendency to be inserted in an antisense orientation versus sense orientation: the P-values of the Chi-square tests comparing the numbers of LTR in sense and antisense orientation are for each position (upstream, introns, downstream) lower than 10e−10. LINEs and DNA transposons also show this tendency, but only when they are inserted within introns (P-value = 0.001 and P-value = 2.09e−7

306

E. Lerat, M. Sémon / Gene 396 (2007) 303–311

Table 1 Number and positions of complete TEs near genes according to EST data Families

Position

Sense

Antisense P-value (Chi-square tests)

LTR retrotransposons Upstream 6119 7025 and ERVs Within introns 2109 6727 Downstream 5363 6079 LINEs Upstream 222 203 Within introns 54 93 Downstream 226 255 SINEs Upstream 61,790 62,739 Within introns 64,743 71,322 Downstream 53,060 51,980 DNA transposons Upstream 3860 3920 Within introns 5701 6269 Downstream 3817 3754

2.73e− 15 b10e− 16 2.18e− 11 NS 0.001 NS 0.007 b10e− 16 0.00086 NS 2.09e− 7 NS

NS: not significant.

respectively). There is a significant paucity of LINEs and LTR retroelements in introns, especially in the sense orientation in the case of the LTR retroelements (P-value b 10e−16). When they are located within introns, SINEs are also more frequent in antisense (P-value b 10e−16). 3.2. Gene expression patterns in the vicinity of LTR retroelements, DNA transposons, and LINEs We selected genes associated with only one type of TE (either LTR retroelements, LINEs or DNA transposons) in order to eliminate possible conflicting influences of several different TEs on gene expression. By comparison to the rest of the genome, SINEs are more abundant in regions of high gene density, within the introns and in the neighborhood of genes (International Human Genome Sequencing Consortium, 2001). In our dataset, only 1292 genes had no SINE in their vicinity (4.2% of all genes). To begin with, we therefore only considered genes associated with a single type of TE (DNA transposons, LTR retroelements, LINEs), and ignored the presence or absence of SINEs, to ensure that our dataset was not too greatly reduced. This gave us 15,865 genes (SAGE data) or 17,786 genes (EST data) expressed in at least one tissue. We then divided these genes into 18 classes, where each class consisted of genes containing elements of a single type, inserted in a particular orientation, and in a particular location relative to the gene (for example, genes containing DNA transposons inserted in the sense orientation upstream of the gene). We also established a “reference set” consisting of 7837 genes (SAGE data) or 8602 genes (EST data) without any LINEs, LTR retroelements or DNA transposons in their vicinity. For each gene, we then calculated the expression breadth (average number of tissues where they were expressed), under normal and tumor conditions, using SAGE and EST data (Table 2). Under normal conditions, the expression breadth varied significantly between the 19 classes of TE-hosting and TEnon-hosting genes depending on the position and type of the TEs (Table 2; Kruskal–Wallis test, P-value b 10e−16). For instance, gene's neighboring LINEs are generally expressed in fewer tissues than those in the reference set.

To test whether the presence of TEs influences the normal expression of genes in particular tissues, we compared the proportions of TE-hosting and non-TE-hosting genes expressed in each of the 19 normal tissues. These values were not directly comparable, because expression breadth is not similar in these two sets of genes. For example, in each tissue, the number of genes expressed was systematically lower among the genes with LINEs in their vicinity than among the genes in the reference set. To take this into account, we randomly removed genes from the reference set until the distribution of the average number of tissues for the remaining genes was the same as in the TEhosting gene set. We found that the proportion of genes expressed in each of the various tissues was not significantly different between each set of TE-hosting genes and the non-TEhosting set (data not shown). This does not mean, however, that the presence of TEs had no effect on gene expression in normal tissues in particular cases, but that overall, TEs did not seem to drive any tissue-specific pattern of expression in the genes that are located in their neighborhood. We then compared the average expression breadth between normal and tumor conditions for the 19 gene sets. We observed that the genes in every set (with or without TE in their vicinity, and whatever the type of TE present), tend to be expressed in more tissues in tumors than in normal tissues (Table 2). This is significant in the reference set (Wilcoxon test, P-value b 10e−16), in the three sets of genes hosting LTR retroelements inserted in antisense orientation, and in the three sets of genes hosting DNA transposons in the sense orientation. Even for those sets for which the difference is not significant, the same trend is typically observed. For instance, the tests are not significant for LINEs, despite some strong differences between normal and tumoral tissues, most likely because of the small number of genes in these datasets. We think it is likely, therefore that all genes, regardless of TE proximity, are expressed in more tissues in cancers. 3.3. TEs and gene deregulation in tumor tissues We measured the difference of the levels of expression between normal and tumor conditions using a Euclidean distance. For each gene, the distance is the sum for all tissues of the differences in expression levels between normal and tumor states. Therefore, the bigger the Euclidean distance computed, the more important the deregulation in tumor tissues. For this analysis we used SAGE data, which are more reliable than ESTs for measuring the levels of gene expression (Sun et al., 2004). We wanted to investigate the influence of LINEs, LTR retroelements, DNA transposons and SINEs on tumoral expression. We therefore performed four independent tests, one for each of these types of TEs. In the case of SINEs for instance, we compared the deregulation in tumors in the set of genes that possessed at least one SINE in their neighborhood to the deregulation in the set of genes that contained no SINE in their vicinity. The average Euclidean distance was computed for each of these two categories, and the significance of the difference in deregulation between the two sets was determined using a Wilcoxon test. The results (in Table 3) show that the Euclidean distance was significantly higher for genes hosting SINE elements

E. Lerat, M. Sémon / Gene 396 (2007) 303–311

307

Table 2 Average numbers of tissue in which genes are expressed using EST and SAGE data, depending on the type and position of TE(s) within 10 kb of these genes ESTs

SAGE

Families

Position

Orientation

Number of genes

Normal tissues

Tumor tissues

P-value (Wilcoxon test)

Number of genes

Normal tissues

Tumor tissues

P-value (Wilcoxon test)

LTR retrotransposons and ERVs

Upstream

+ − + − + − + − + − + − + − + − + −

943 1122 184 782 809 912 34 40 6 5 42 64 686 669 721 840 674 651 8602

4.36 4.02 3.91 4.99 4.00 3.93 3.00 1.22 1.67 6.40 2.95 3.22 4.62 4.60 5.85 5.65 4.74 4.63 4.82

4.57 4.24 4.18 5.44 4.11 4.12 2.94 1.20 3.50 6.00 2.57 3.47 4.85 4.58 6.14 5.98 5.16 4.75 5.17

0.01 0.0006 NS 5.8e−7 NS 0.01 NS NS NS NS NS NS 0.008 NS 0.01 0.0002 4.3e−6 NS b 2.2e−16

855 1048 156 406 737 837 29 36 4 4 35 55 624 609 646 737 616 594 7837

4.34 4.07 3.82 4.98 4.03 3.94 3.03 1.28 2.25 6.25 3.03 3.16 4.58 4.56 5.81 5.65 4.69 4.57 4.81

4.49 4.31 3.97 5.48 4.16 4.14 3.10 1.33 4.75 6.75 2.51 3.38 4.79 4.54 6.07 5.91 5.11 4.70 5.17

NS 0.00029 NS 1.37e−7 0.03 0.014 NS NS NS NS NS NS 0.02 NS 0.04 0.0093 1.12e−5 NS b2.2e−16

Within introns Downstream LINEs

Upstream Within introns Downstream

DNA transposons

Upstream Within introns Downstream

Non-TE-hosting genes

+: TE inserted in the same direction as the transcript. −: TE inserted in the opposite direction to the transcript. NS: not significant. Statistically significant differences are highlighted in bold.

than for genes with no SINE in their vicinity (P-value b 10e−16; P-value = 0.05 for the windows of 1 kb, results shown in Supplementary Table 3); similar results were obtained for DNA transposons. Genes having SINEs or DNA transposons in their neighborhood are therefore more deregulated in tumors than genes having no SINE or DNA transposon in their vicinity. In contrast, we found that the Euclidean distance is lower for genes hosting LINEs or LTR retroelements than for genes without these elements (P-value b 10e−16 in both cases). We investigated the hypothesis that the average number of tissues (expression breadth) could be a confounding factor causing the correlation between the presence of TEs and gene deregulation in tumors. As shown in Table 2, the type of TEs present in the neighborhood of a gene is associated with the expression breadth of this gene. For instance, genes with LINEs in their vicinity display greater-than-average tissue-specificity. The Euclidean distance used to measure the gene deregulation

in tumors also increased with the expression breadth (Spearman rho = 0.88, P-value b 10e− 16). This is probably at least partly a statistical artifact, because it is obviously impossible to detect gene deregulation in tumors if the gene is so weakly expressed that transcripts are not detectable in any tissue. To correct for this artifact, we eliminated the effect of expression breadth on the deregulation value by calculating a partial correlation. We computed the residuals of the correlation between the Euclidean distance and the expression breadth, and compared this “residual Euclidean distance” for genes with and without particular TEs. If the residual distance were independent of the presence of TEs, then the effect we observed previously was simply a by-product of the pattern of expression. As seen in Table 3, the differences in expression remained significant for the SINEs, the DNA transposons and the LINEs (Wilcoxon tests, P-value b 0.01 after Bonferroni correction), but not for the LTR retroelements.

Table 3 The average Euclidean distance between genes versus the presence of TEs (SAGE data) Type of TEs SINEs LINEs LTR retrotransposons and ERVs DNA transposons

Present Absent Present Absent Present Absent Present Absent

Number of genes

Average Euclidean distance

Standard deviation Euclidean distance

P-value

P-value for the residual Euclidean distance

29,116 1292 925 29,483 16,620 13,788 15,249 15,159

28.52 24.88 20.39 28.62 26.34 30.81 27.48 29.26

84.94 105.77 59.64 86.62 81.08 91.40 81.25 91.40

b10− 16

b10− 16

b10− 16

10− 9

b10− 16

0.03

10− 7

b10− 16

308

E. Lerat, M. Sémon / Gene 396 (2007) 303–311

Table 4 Frequency of genes with at least one TE in the various classes of gene expression divergence in normal and tumor tissues Expression divergence class

Type of TEs

a

SINEs LINEs LTR retrotransposons and ERVs DNA transposons a

Low (N = 7602)

Medium (N = 7603)

High (N = 7601)

Very high (N = 7602)

0.920 0.047 0.577 0.451

0.956 0.032 0.560 0.520

0.977 0.022 0.549 0.544

0.978 0.021 0.500 0.490

P-value (Chi-square tests) b10− 16 b10− 16 b10− 16 b10− 16

Non-linear relationship between expression divergence and the presence of TE.

We tried to further investigate these observations (made on the total deregulation measured over all the tissues), by detailing tissue by tissue. For each gene and for each tissue, we measured (with SAGE data) the normalized difference of the levels of expression between normal and tumor state. Four independent tests were performed, one for each class of TE. In each of the tests, the genes were grouped into two classes, with or without TE in proximity, and we compared the deregulation between the two classes in each of the 19 tissues using the Wilcoxon test. The results are presented in the Supplementary Table 3. Genes with LINE in their vicinity are less deregulated in cancers than genes without LINE elements in their neighborhood: this test is significant for 17 tissues out of 19. We observed the same tendency for LTR retroelements (15 tests are significant). Deregulation is higher for genes with SINEs in the vicinity than for other genes, this is true for every tissue in the dataset. DNA transposons show the same tendency, with significant tests for 14 tissues. In summary, these detailed observations made on the

analysis of each tissue separately show the trends that were observed for total deregulation across tissues. To further investigate the impact of the different classes of TEs on gene deregulation in tumors, we determined the proportions of genes hosting particular TEs for various degrees of gene deregulation. We divided the set of 30,408 genes into four equal-sized classes with gene deregulation in tumor (measured by the Euclidean distance) ranging from low to very high. We observed that the proportion of genes hosting SINEs and DNA transposons increased with the intensity of gene deregulation, and found the opposite tendency for genes hosting the other LTR retroelements and LINEs (Table 4). For example, 50% of highly deregulated genes had neighboring LTR retroelements, whereas 57.7% of genes displaying little divergence had LTR retroelements in their vicinity. The proportion of genes with LTR retroelements or LINEs in their vicinity decreased linearly with the deregulation of the genes in tumors (Table 4). We observed a linear increase in genes hosting

Fig. 1. Frequency of genes according to the number of SINEs in their vicinity, and to the divergence of their expression in normal and tumor tissues. The dataset is divided into four classes of equal size on the basis of the deregulation (each contains 7602 genes). The scale of gray intensities corresponds to four classes with increasing levels of deregulation (ranging from light-gray: low deregulation, to black: high deregulation). The genes identified by their level of deregulation were also divided into five classes of variable size on the basis of the number of SINE elements in their vicinity. The number of genes (N) belonging to the different classes is shown on the x-axis. Within each of these classes of SINE number, the height of the bars represents the frequency of genes belonging to each deregulation class (so that the sum of the height of the bars is equal to one within each SINE number class).

E. Lerat, M. Sémon / Gene 396 (2007) 303–311

SINEs, whereas for genes hosting DNA transposons, the relationship was not linear, and therefore more difficult to interpret. We decided to focus on SINEs whose presence is linked to an increase of deregulation, because our main goal was to determine whether TE insertions are able to contribute to the deregulation of gene expression in cancer. We checked whether the number of SINE copies in the gene neighborhood was related to the level of gene deregulation. The dataset of 30,408 genes was divided into five classes of genes of on the basis of the number of SINEs in their vicinity (0, 1 to 3, 4 to 6, 7 to 11, over 11). For each of these classes, we computed the relative proportions of the four classes of deregulation (low, medium, high and very high) described above (Fig. 1: the sum of the frequencies within each class of the number of SINEs is 1). If the number of SINEs in the vicinity of genes had no effect on the degree of gene deregulation, we would have expected the genes to be equally distributed in the different deregulation classes. We tested the deviation from the expected frequency of 0.25 by Chisquare tests applied to the five classes of numbers of SINEs. The P-values of these tests were all significant (P-value b 10e− 16 for every classes, except P-value = 0.006 for the 4–6 class). These findings indicate that the presence of a larger number of SINE elements in the vicinity of a gene was associated with a higher deregulation of gene expression in tumor tissues. 4. Discussion This analysis of the position of TE insertions relative to genes has confirmed the previously reported bias towards antisense insertions, particularly when LINEs and LTR retroelements were inserted within introns (Medstrand et al., 2002; Smit, 1999; Sémon and Duret, 2004). This bias is generally explained by the presence of regulatory motifs, such as polyadenylation signals, inside the elements, which could have detrimental effects on the organisms, as a result of the production of truncated proteins for example, if these motifs have the same orientation as the transcript. The biased orientation and under-representation of TEs in introns is therefore probably attributable to selective disadvantages. Unlike Van de Lagemaat et al. (2003), we observed the same orientation bias for LTR retroelements inserted in 5′ and 3′ regions, with a significant preference for the antisense orientation. However, in their article, Van de Lagemaat et al. (2003) only considered LTR retroelements that overlapped 5′ and 3′ termini of gene untranslated regions, in other words, those elements very close to the genes. We therefore checked our data for TEs located within 5 kb, 2 kb and 1 kb of the genes. However, our data still showed the same excess of LTR retroelements inserted in antisense orientation whatever their position (data not shown). Van de Lagemaat et al. (2003) thought that their results confirmed that LTR retroelements are involved in transcript formation by providing genes with promoters and polyadenylation signals. However, the elements that can provide such sequences are mainly degraded sequences that have been co-opted by the genome, and of which only the parts providing a selective advantage have been conserved (Britten, 1996). This could account for the difference we observed, because we took only complete TEs into consideration.

309

Our findings also show that the number of normal tissues in which the genes are expressed varies with the type of TEs they host. For example, genes hosting LINEs tend to be expressed in fewer tissues than genes hosting other types of TEs. It is known that TE frequency is not random in the genome: the different classes of TE are distributed differently with respect to surrounding GC content (Medstrand et al., 2002). For instance, SINEs tend to be inserted in GC-rich regions, while LINEs and LTR retroelements tend to be located in GC-poor regions. Because there is a weak correlation between the GC-content in a genomic region and the expression breadth of the genes that belong to that region (Sémon et al., 2005), it is not unexpected that, for instance, genes with LINEs in their vicinity are expressed in fewer tissues than genes neighboring SINEs. We confirm previous observations that particular genes hosting LINEs (e.g. murine and human genes involved in stress and in the defense response) are expressed in fewer tissues than average (Van de Lagemaat et al., 2003). A functional explanation for the fact that genes hosting LINEs tend to be weakly expressed has been proposed: when inserted into an intron, L1 elements can attenuate the expression of target genes by premature truncation of RNAs (for antisense insertions) or transcriptional elongation defects (for sense insertions) (Han et al., 2004). This has been demonstrated experimentally, and also by showing that in the human genome, the quantity of LINE L1 sequence in the introns of poorly expressed genes is higher than in the introns of highly expressed genes (Han et al., 2004). Our results, by showing that genes with LINEs in their vicinity tend to be tissue-specific, confirm this observation, because the measure of expression breadth we used is highly correlated to the level of expression (see for instance Lercher et al., 2002). The transcriptome obtained for the same tissue in different individuals is not identical, because of differences in genetic, environmental and physiological conditions. We studied the impact of one of these factors (cancer) on the transcriptome. The publicly available data do not permit us to pair, for each tissue, data obtained from tumor and normal cells extracted from the same individual. We therefore pooled libraries from different individuals to obtain a measure of the level of expression in normal and tumor conditions in each tissue. Pooling data from different individuals for each tissue is a way of limiting the effect of genetic and physiological factors. Genes hosting LTR retroelements and LINEs appear to be less deregulated than others in tumors. One possible interpretation of this finding could be that LINEs actively prevent gene deregulation in tumors. However, this explanation contradicts observations that the hypomethylation of LINEs leads to an increase in their transcription level in tumor tissue (Ehrlich, 2003; Roman-Gomez et al., 2005). LINEs and LTR retroelements are under-represented both within genes and in their vicinity, suggesting that these TEs may have been removed by negative selection (Medstrand et al., 2002). It is therefore possible that most of the LINE insertions tolerated near the genes do not impact on gene expression, even under tumor conditions. On average, genes with SINEs in their vicinity showed more differences in expression between tumor and normal conditions

310

E. Lerat, M. Sémon / Gene 396 (2007) 303–311

than genes without SINEs in their vicinity. This effect increases with the number of SINEs. In normal cells, SINEs are methylated, and so no expression of these elements is detected (Kochanek et al., 1993). However, it has been shown that in tumor cells, SINEs are under-methylated, and their level of expression via polymerase III increases (Li et al., 2000). SINEs can also influence gene expression. For example, the down regulation of human epsilonglobin is prevented by the presence of an Alu inserted in the −2.2 kb region of the gene (Wu et al., 1990), and an Alu sequence confers estrogen responsiveness on the BRCA1 gene (Norris et al., 1995). The human glycoprotein hormone alpha-subunit gene (Scofield et al., 2000) and the nicotinic acetylcholine receptor alpha6 (CHNRA6) (Ebihara et al., 2002) are both down regulated by SINE elements. Moreover, RNA polymerase III promoters have been shown to enhance the transcription of RNA polymerase II genes, indicating that SINEs could indeed interfere with gene expression (Oliviero and Monaci, 1988). Several mechanisms have been suggested to explain this ability of SINEs to affect gene expression: SINEs may contain sequences able to bind nuclear protein or, once they have been inserted, they may create new binding sites that act as transcription regulators. Hence, the presence of SINEs near genes would have no detrimental effect in normal cells, since these TEs are silenced by methylation. In tumor cells, however, where this inhibition is abolished, SINEs could influence the expression of neighboring genes, even if there is no direct link between tumor development and this activity. We cannot rule out the possibility, however, that in particular cases an individual SINE becomes active in a normal cell and promotes a cascade of events leading to tumorization. As genomic data are not currently available for tumor cells it is not possible to take into account the possible effects induced by potential new insertions that can occur in the genomes of those cells. However, the present analysis gives some insight on the influence of fixed insertions. 5. Conclusion We compared the patterns of expression of genes as a function of the type of TEs they host. We show that genes containing different types of TEs in their vicinity are expressed in different numbers of tissues, and show different levels of deregulation between normal and tumor conditions. Genes containing LINEs and LTR retroelements in their vicinity are less deregulated in cancer than the other genes. This may indicate that most of these TEs are tolerated insertions that have little effect on gene expression under the tumor conditions tested. In contrast, genes containing SINEs in their vicinity show more deregulation in tumors than the rest of the genes. This might indicate that, even if they are not the cause of tumors, SINEs contribute to the deregulation cascade leading to the formation of cancer cells. Acknowledgements EL has been funded by the “Fondation pour la Recherche Médicale”. We would like to thank Christian Biémont for his critical reading of the manuscript and Monika Ghosh for the English correction.

Appendix A. Supplementary data Supplementary data associated with this article can be found in the online version, at doi:10.1016/j.gene.2007.04.002. References Barklis, E., Mulligan, R.C., Jaenisch, R., 1986. Chromosomal position or virus mutation permits retrovirus expression in embryonal carcinoma cells. Cell 47, 391–399. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L., 2004. GenBank: update. Nucleic Acids Res. 32, D23–D26. Britten, R.J., 1996. DNA sequence insertion and evolutionary variation in gene regulation. Proc. Natl. Acad. Sci. U. S. A. 93, 9374–9377. Britten, R.J., Davidson, E.H., 1969. Gene regulation for higher cells: a theory. Science 165, 349–358. Bucca, G., Carruba, G., Saetta, A., Muti, P., Castagnetta, L., Smith, C.P., 2004. Gene expression profiling of human cancers. Ann. N.Y. Acad. Sci. 1028, 28–37. Dante, R., Dante-Paire, J., Dominique, R., Roizes, G., 1992. Methylation patterns of long interspersed repeated DNA and alphoid repetitive DNA from human cell lines and tumors. Anticancer Res. 12, 559–564. Debniak, T., et al., 2001. Comparison of Alu-PCR, microsatelite instability, and immunohistochemical analyses in finding features characteristic for hereditary nonpolyposis colorectal cancer. J. Cancer Res. Clin. Oncol. 127, 565–569. Deininger, P.L., Batzer, M.A., 1999. Alu repeats and human disease. Mol. Genet. Metab. 67, 83–193. Depil, S., Roche, C., Dussart, P., Prin, L., 2002. Expression of a human endogenous retrovirus, HERV-K, in the blood cells of leukemia patients. Leukemia 16, 254–259. Ebihara, M., Ohba, H., Ohno, S.I., Yoshikawa, T., 2002. Genomic organization and promoter analysis of the human nicotinic acetylcholine receptor alpha6 subunit (CHNRA6) gene: Alu and other elements direct transcriptional repression. Gene 298, 101–108. Ehrlich, M., 2003. Expression of various genes is controlled by DNA methylation during mammalian development. J. Cell. Biochem. 88, 899–910. Farkash, E.A., Kao, G.D., Horman, S.R., Prak, E.T., 2006. Gamma radiation increases endonuclease-dependent L1 retrotransposition in a cultured cell assay. Nucleic Acids Res. 34, 1196–1204. Finnegan, D.J., 1992. Transposable elements. Curr. Opin. Genet. Dev. 2, 861–867. Florl, A.R., Löwer, R., Schmitz-Dräger, B.J., Schulz, W.A., 1999. DNA methylation and expression of LINE-1 and HERV-K provirus sequences in urothelial and renal cell carcinomas. Br. J. Cancer 80, 1312–1321. Han, J.S., Szak, S.T., Boeke, J.D., 2004. Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes. Nature 429, 268–274. Hayward, W.S., Neel, B.G., Astrin, S.M., 1981. Activation of a cellular onc gene by promoter insertion in ALV-induced lymphoid leukosis. Nature 290, 475–480. Hubbard, T., et al., 2005. Ensembl 2005. Nucleic Acids Res. 33 (Database issue), D447–D453. International Human Genome Sequencing Consortium, 2001. Initial sequencing and analysis of the human genome. Nature 409, 860–921. Jordan, I.K., Rogozin, I.B., Glazko, G.V., Koonin, E.V., 2003. Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends Genet. 19, 68–72. Katzir, N., Arman, E., Cohen, D., Givol, D., Rechavi, G., 1987. Common origin of transmissible venereal tumors (TVT) in dogs. Oncogene 1, 445–448. Kochanek, S., Renz, D., Doerfler, W., 1993. DNA methylation in the Alu sequences of diploid and haploid primary human cells. EMBO J. 12, 1141–1151. Lercher, M.J., Urrutia, A.O., Hurst, L.D., 2002. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 31, 180–183. Li, T.H., Kim, C., Rubin, C.M., Schmid, C.W., 2000. K562 cells implicate increased chromatin accessibility in Alu transcriptional activation. Nucleic Acids Res. 28, 3031–3039. Liang, P., 2002. SAGE Genie: a suite with panoramic view of gene expression. Proc. Natl. Acad. Sci. U. S. A. 99, 11547–11548.

E. Lerat, M. Sémon / Gene 396 (2007) 303–311 Medstrand, P., van de Lagemaat, L., Mager, D.L., 2002. Retroelement distributions in the human genome: variation associated with age and proximity to genes. Genome Res. 12, 1483–1495. Medstrand, P., van de Lagemaat, L.N., Dunn, C.A., Landry, J.R., Svenback, D., Mager, D.L., 2005. Impact of transposable elements on the evolution of mammalian gene regulation. Cytogenet. Genome Res. 110, 342–352. Menendez, L., Benigno, B.B., McDonald, J.F., 2004. L1 and HERV-W retrotransposons are hypomethylated in human ovarian carcinomas. Mol. Cancer 3, e12. Morse, B., Rotherg, P.G., South, V.J., Spandorfer, J.M., Astrin, S.M., 1988. Insertional mutagenesis of the myc locus by a LINE-1 sequence in a human breast carcinoma. Nature 333, 87–90. Mouse Genome Sequencing Consortium, 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520–562. Muotri, A.R., Chu, V.T., Marchetto, M.C., Deng, W., Moran, J.V., Gage, F.H., 2005. Somatic mosaicism in neuronal precursor cells mediated by L1 retrotransposition. Nature 435, 903–910. Muster, T., et al., 2003. An endogenous retrovirus derived from human melanoma cells. Cancer Res. 63, 8735–8741. Nigumann, P., Redik, K., Matlik, K., Speek, M., 2002. Many human genes are transcribed from the antisensee promoter of L1 retrotransposon. Genomics 79, 628–634. Norris, J., et al., 1995. Identification of a new subclass of Alu DNA repeats which can function as estrogen receptor-dependent transcriptional enhancers. J. Biol. Chem. 270, 22777–22782. Oliviero, S., Monaci, P., 1988. RNA polymerase III promoter elements enhance transcription of RNA polymerase II genes. Nucleic Acids Res 16, 1285–1293. Patzke, S., Lindeskog, M., Munthe, E., Aasheim, H.C., 2002. Characterization of a novel human endogenous retrovirus, HERV-H/F, expressed in human leukemia cell lines. Virology 303, 164–173. Pear, W.S., et al., 1988. Aberrant class switching juxtaposes c-myc with a middle repetitive element (LINE) and an IgH intron in two spontaneously arising rat immunocytomas. Oncogene 2, 499–507. Peaston, A.E., et al., 2004. Retrotransposons regulate host genes in mouse oocytes and preimplantation embryos. Dev. Cell 7, 597–606. Pruitt, K.D., Tatusova, T., Maglott, D.R., 2003. NCBI reference sequence project: update and current status. Nucleic Acids Res. 31, 34–37. Roman-Gomez, J., et al., 2005. Promoter hypomethylation of the LINE-1 retrotransposable elements activates sense/antisense transcription and marks the progression of chronic myeloid leukemia. Oncogene 24, 7213–7223. Schaner, M.E., et al., 2003. Gene expression patterns in ovarian carcinomas. Mol. Biol. Cell. 14, 4376–4386.

311

Scofield, M.A., Xiong, W., Haas, M.J., Zeng, Y., Cox, G.S., 2000. Sequence analysis of the human glycoprotein hormone alpha-subunit gene 5′-flanking DNA and identification of a potential regulatory element as an Alu repetitive sequence. Biochim. Biophys. Acta 1493, 302–318. Segal, Y., et al., 1999. LINE-1 elements at the sites of molecular rearrangements in Alport syndrome-diffuse leiomyomatosis. Am. J. Hum. Genet. 64, 62–69. Sémon, M., Duret, L., 2004. Evidence that functional transcription units cover at least half of the human genome. Trends Genet. 20, 229–232. Sémon, M., Mouchiroud, D., Duret, L., 2005. Relationship between gene expression and GC-content in mammals: statistical significance and biological relevance. Hum. Mol. Genet. 14, 421–427. Smit, A.F.A., 1999. Interspersed repeats and other mementos of transposable elements in mammalian genomes. Curr. Op. Genet. Dev. 9, 657–663. Stavenhagen, J.B., Robins, D.M., 1988. An ancient provirus has imposed androgen regulation on the adjacent mouse sex-limited protein gene. Cell 55, 247–254. Stribinskis, V., Ramos, K.S., 2006. Activation of human long interspersed nuclear element 1 retrotransposition by benzo(a)pyrene, an ubiquitous environmental carcinogen. Cancer Res. 66, 2616–2620. Sun, M., Zhou, G., Lee, S., Chen, J., Shi, R.Z., Wang, S.M., 2004. SAGE is far more sensitive than EST for detecting low-abundance transcripts. BMC Genomics 5, e1. Taniwaki, M., et al., 2006. Gene expression profiles of small-cell lung cancers: molecular signatures of lung cancer. Int. J. Oncol. 29, 567–575. Van de Lagemaat, L.N., Landry, J.-R., Mager, D.L., Medstrand, P., 2003. Transposable elements in mammals promote regulatory variation and diversification of genes with specialized functions. Trends Genet. 19, 530–536. Wang-Johanning, F., Frost, A.R., Jian, B., Epp, L., Lu, D.W., Johanning, G.L., 2003. Quantitation of HERV-K env gene expression and splicing in human breast cancer. Oncogene 22, 1528–1535. Welsh, J.B., et al., 2001. Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. Proc. Natl. Acad. Sci. U. S. A. 98, 1176–1181. Wu, J., Grindlay, G.J., Bushel, P., Mendelsohn, L., Allan, M., 1990. Negative regulation of the human epsilon-globin gene by transcriptional interference: role of an Alu repetitive element. Mol. Cell. Biol. 10, 1209–1216. Xu, T.H., Deng, K.J., 2002. Transposable elements and tumor progression. Med. Hypnoanal. 58, 293–296. Zhang, Z., Schwartz, S., Wagner, L., Miller, W., 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7, 203–214.