Comprehensive analysis of epigenetic pattern of long noncoding RNA loci in colorectal cancer Qi Liao, Linbo Chen, Jianfa Liu, Tao Yang, Jincheng Li, Xiaohong Zhang, Jinshun Zhao PII: DOI: Reference:
S0378-1119(16)30747-8 doi: 10.1016/j.gene.2016.09.020 GENE 41582
To appear in:
Gene
Received date: Revised date: Accepted date:
21 May 2016 24 August 2016 14 September 2016
Please cite this article as: Liao, Qi, Chen, Linbo, Liu, Jianfa, Yang, Tao, Li, Jincheng, Zhang, Xiaohong, Zhao, Jinshun, Comprehensive analysis of epigenetic pattern of long noncoding RNA loci in colorectal cancer, Gene (2016), doi: 10.1016/j.gene.2016.09.020
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT Comprehensive analysis of epigenetic pattern of long noncoding RNA loci in colorectal cancer
Qi Liao 1 , Linbo Chen 2, Jianfa Liu 1, Tao Yang 1, Jincheng Li 1, Xiaohong Zhang 1,
IP
T
Jinshun Zhao 1,*
1. Department of Prevention Medicine, School of Medicine, Ningbo University,
SC R
Ningbo, Zhejiang, 315211, China;
2. Department of Gastroenterology, The Affiliated Hospital of Ningbo University
NU
School of Medicine, Ningbo, 315020, China;
*. Correspondence should be addressed to Drs. Jinshun Zhao
AC
CE P
TE
D
MA
(
[email protected])
1
ACCEPTED MANUSCRIPT Abstract Colorectal cancer (CRC) is one of the most common and severe cancers worldwide. The occurrence of CRC is developed by accumulation of genetic and epigenetic alteration in colon
IP
T
cells. Work over the last decade has proposed that epigenetic changes such as DNA methylation, histone modification of protein coding genes play an important role in CRC development.
SC R
However, the epigenetic pattern and features of lncRNAs in CRC were unclear. Here, we comprehensively analyze the patterns of DNA methylation, H3K4me3, H3K27me3 on both protein coding genes and lncRNAs. We found several interesting results which may help to
NU
discriminate the lncRNAs from protein coding genes. For example, the signals of DNA methylation and H3K4me3 are higher on protein coding genes than lncRNAs, but not for
MA
H3K27me3; the three epigenetic marks show different distribution on promoters, termination and across the whole gene between protein coding genes and lncRNAs, especially DNA mathylation,
D
which show regular signal tendency according to the principle of gene transcription. In addition,
TE
we further analyzed the affections of epigenetic marks on protein coding gene and lncRNA expression in HCT116 colon cell. Most of the results were consistent with the previous reports
CE P
such as H3K27me3 is an repressive mark. Furthermore, we analyzed the relationships among the three epigenetic marks and found that DNA methylation and H3K4me3 were positively correlated in promoter and termination region for both protein coding genes and lncRNAs. In a word, our
AC
results will give a clue to further study the pathologies of CRC.
Keywords: Colorectal cancer (CRC), epigenetic, long noncoding RNAs (lncRNAs), DNA methylation, H3K4me3, H3K27me3
2
ACCEPTED MANUSCRIPT Introduction Colorectal cancer (CRC) is the third most common cancer in world wide that causes more than 1.2 million incidences and estimated 608,000 deaths globally each year, which account for 8% of all [1]
T
. Many factors have been shown to be involved in CRC including genetic and
IP
cancer deaths
environmental factors. For decades of years, scientists mainly have focused on the genetic factors
SC R
and a quite number of associated SNP variants or Copy number variations have been found
[2-5]
.
However, the etiology of CRC remains to be poorly understood. Nowadays, increasing evidences shown the disruption of microenvironment-mediated epigenetic marks would make widespread
carcinogenesis of CRC
NU
influence on gene expression without changes in DNA sequences and play an important role in [6, 7]
. It was reported that accumulation of both genetic and epigenetic
MA
alternations are the main causes of CRC production and development [8]. Epigenetic mark, mainly includes DNA methylation and histone modification, plays a critical [9]
. DNA hypermethylation occurred in promoter of genes usually
D
role in transcription regulation
TE
results in low expression or even silencing [10], while DNA hypomethylation leads to up-regulation of genes on contract. Besides, there are various types of modifications on histone in mammal such
CE P
as acetylation, phosphorylation, methylation, ubiquitylation and SUMOylation. And different histones (H2A, H2B, H3 and H4) are involved, resulting in dozens of different histone modifications. Each histone modification has its own meaning. For example, H3K4me3 is
[13-15]
AC
associated with transcription initiation
[11, 12]
, H3K36me3 is related with transcription elongation
while H3K27me3 is correlated with RNA polymerase pausing and elongation repression.
The combination of different histone modifications occur together may also have special function. For example, H3K27me3 occurs together with the activating mark H3K4me3 in regions referred to as bivalent domains
[16, 17]
. There are also strong relationships between expression and histone
modification. The power of histone marks predicting gene expression by linear regression model has been successfully verified by Karlic et al. [18]. Besides, Cheng et al. constructed support vector regression models to integrate histone modifications and reported that histone modifications and transcription factors are statistically redundant for predicting gene expression levels
[19]
. All the
above facts suggested the tightly ties between the expression and epigenetic marks. In CRC, there have been lots of reports on epigenetic disruption. For example, USP44, a tumor suppressor, was found to be DNA hypermethylatied in CRC cell lines and most colorectal 3
ACCEPTED MANUSCRIPT adenomas
[20]
. Loss of Polycomb mark H3K27me3 from bivalent H3K4me3- and
H3K27me3-associated promoters was accompanied by activation of genes related with cancer progression in CRC
[21]
. However, the epigenetic pattern of CRC for both long noncoding RNAs
carry out a comprehensive analysis of epigenetic marks in CRC.
IP
T
and protein coding genes in a genome-wide is still a mystery. It is emergency and necessary to
SC R
Long noncoding RNA (lncRNA), which is a kind of non coding RNAs with length longer than 200nt and shares the similar sequence characters with mRNA. Accumulated evidences suggest lncRNAs play an important role in the regulation of various types of molecular from
NU
protein to noncoding RNAs such as microRNA. LncRNAs can act as regulators to affect the expression, location or stability of protein coding genes to participate in a number of biological
MA
processes including immune response, development, gene imprinting and so on. The epigenetic alternations of several lncRNAs in CRC were also observed. For example, HOTAIR can regulate
D
polycomb-dependent chromatin modification and was associated with poor prognosis in CRC [22].
TE
However, no reports about the comprehensive comparison between epigenetic pattern of lncRNAs and protein coding genes in CRC have been carried out. Understanding the role and influence of
CE P
epigenetic marks is at the heart of understanding transcriptional regulation in CRC. Here, we show the patterns and the main features of three major epigenetic marks (DNA methylation, H3K4me3, H3k27me3) of lncRNAs and protein coding gene in CRC in a genome wide. Then we analyzed the
AC
relationships between epigenetic change and expression of lncRNAs and protein coding genes. Our results give a review of epigenetic patterns in CRC and will provide the research clues for further studying the epigenetic mechanism of CRC pathogenesis.
Material and Method Data source and data processing A set of public dataset including MethylCap-seq, Chip-seq for H3K4me3 and H3K27me3 in WT HCT116 cell line (GSE39068) [23] was downloaded from GEO database. MACS software was used to call the peaks with default parameters under the hg19 genome [24]. Then we assigned each peak a location including promoter ([-2000bp, 100bp] around transcript start site), exon, intron and termination ([2000bp, -100bp] around transcript terminate site) relative with protein coding genes and lncRNAs based on the site of peak summit. Protein coding genes were downloaded 4
ACCEPTED MANUSCRIPT from Refseq database database
[26-28]
[25]
while lncRNAs were collected from Gencode, Refseq and UCSC
. In order to get more accuracy results, we only selected the dataset of protein
coding genes and lncRNAs with overlap between every two transcripts no more than 30% of the
T
max length they cover. For RNA-seq dataset, we first mapped the reads to the genome regions
IP
with Tophat software with default parameters [29], the expression of both protein coding genes and
SC R
lncRNAs were estimated through Cufflinks tool with default parameters [30].
Calculation of epigenetic modification signals
NU
For each protein coding gene and lncRNAs, we divided the structure of gene into promoter, each exon, each intron and termination regions. Then RPKM values were calculated for each region to
MA
represent the signal of epigenetic modification on the region with perl script. For each gene, we divided the region of [-5000bp, 5000bp] around transcript start site (TSS) and the region of
D
[-5000bp, 5000bp] around transcript termination site (TTS) into 100 average sub-regions with
TE
100bp for each region. Then we also calculated the Reads per Kilobases per Million reads (RPKM) of each bin to find the signal tendency around TSS and TTS. Besides, we divided each exon and
CE P
intron into 10 average regions and also calculated their RPKM to find the signal tendency of lncRNAs and protein coding genes across the whole gene.
AC
Random dataset construction
For each kind of epigenetic marks, we randomly selected the same number of non-overlapped regions with the same length distribution as the control. The step was repeated for 100 times and we got 100 random datasets of peaks. Then the random peaks were also mapped to the genomic regions of lncRNAs and protein coding genes, and the random distributions of peak located in different genome region for each mark were also calculated. Let the number of random datasets with its percentage (for example, the percentage of peaks located in protein coding genes) larger than the true percentage denoted as k, then a P-value was defined as k divided by 100. For the epigenetic signal of each gene, we randomly select two lncRNAs or two protein coding genes, if they satisfy the following conditions: i) the two lncRNAs/protein coding genes are not the same, ii) the two lncRNAs/protein coding genes have not been replaced for each other before, then we replaced the signal of the two lncRNAs/protein coding genes. The replacement was repeated for 5
ACCEPTED MANUSCRIPT 10000 times. The same method was used to get the random expression profiles.
Result
IP
T
Basic statistical information of epigenetic marks on protein coding genes and lncRNAs in CRC
SC R
In total 51,240, 15,152 and 2,407 peaks of DNA hypermethylation, H3K4me3 and H3K27me3 were identified. The number of DNA methylation marks is the highest while that of H3K27me3 mark is the lowest, suggesting DNA methylation modification is a globe epigenetic change in
NU
CRC. Among the three kinds of epigenetic marks, 58.6% of H3K4me3 and 54.4% of DNA methylation marks locate in the regions of protein coding genes, which are much higher than that
MA
of H3K27me3 mark with only accounting for 26.1% (Figure 1). In order to test whether the percentage of each mark is significant, we constructed a random dataset (see method). Through
D
comparing with random, we found both epigenetic marks of DNA methylation and H3K4me3 are
TE
more likely associated with genes that are transcribed. Especially for H3K4me3, the percentage of H3K4me3 mark on the regions that rich in both lncRNAs and protein coding genes is much higher
CE P
than random (16.0% vs. 5.0%). H3K4me3 usually locates in the regions of promoter and is well known as an active mark for transcription. The relative higher percent of H3K4me3 mark located in the regions of both lncRNAs and protein coding genes suggests an active transcription action in
AC
the regions with both protein coding genes and lncRNAs. However, we found H3K27me3 mark is not enriched in the regions that gene located comparison with random. As expected, the percentages of protein coding genes and lncRNAs with DNA hypermethylation mark are also the highest (63.0% and 27.6% respectively) among the three epigenetic marks, follow by the H3K4me3 (54.0% and 16.2% respectively) and H3K27me3 mark (3.5% and 2.0% respectively). Although few percentage of H3K27me3 was found on both protein coding genes and lncRNAs, H3K27me3 was important to regulate the expression of gene in multiple biological processes. It has been reported that lncRNAs expressed at lower levels may be associated with higher levels of H3K27me3 at their promoters [31].
Co-regulation of DNA methylation and H3K4me3 are more common in protein coding genes The expression of both protein coding genes and lncRNAs are usually regulated by multiple 6
ACCEPTED MANUSCRIPT epigenetic marks. However, it is still unclear whether co-regulation of multiple epigenetic marks is different between protein coding genes and lncRNAs in CRC. In this study, we found 64.5% of epigenetic-regulated lncRNAs (the lncRNAs with at least one kind of epigenetic marks) are
IP
T
controlled by DNA methylation only (p<0.01, 53.1% in random), while only 36.6% for protein coding genes which is less than random (p<0.01, 61.5% in random). However, co-regulation of
SC R
DNA methylation and H3K4me3 (DNA methylation – H3K4me3) is more common for protein coding genes than lncRNAs (36.6% vs. 9.3% respectively, comparison with 0.4% vs. 0.4% in random respectively), suggesting DNA methylation is positively co-occurred with H3K4me3 for
NU
protein coding genes but not for lncRNAs (Figure 2). Although previous finding show that H3K4me3 inversely correlates with DNA methylation at promoters to silence genes in colon
correlated with gene expression
[33]
MA
cancer[32], it has early been reported that DNA methylation in the gene body is positively , besides, the functional consequences of DNA methylation in [34, 35]
, therefore, more
D
the gene body may be related with the cross-talk with histone methylation
TE
common for DNA methylation and H3K4me3 co-occurrence on protein coding genes suggests that DNA methylation may be useful for the H3K4me3 maintenance. Our results also show that
CE P
lncRNAs have more percentage of unique H3K27me3 mark than protein coding genes (3.2% vs. 0.5% respectively, comparison with 2.1% vs. 1.5% in random respectively, Figure 2), indicating an important role of H3K27me3 on lncRNAs in CRC, which may cause large number of
AC
low-expressed lncRNAs.
Genome location distributions of epigenetic marks are different between protein coding gene and lncRNAs The genome regions were divided into four types: promoter which was defined as 2kb upstream of transcription start site (TSS) to the 100bp downstream of TSS, intron, exon and termination which was defined as 2kb downstream of transcription termination site (TTS) to the TTS. In order to get more correct and accurate results, only the protein coding genes and lncRNAs with the genome region overlapping no more than 30% between each other were selected to further analyze. In total 12,021 protein coding genes and 16,097 lncRNAs were obtained. In general, most peaks are located in introns of both lncRNAs and protein coding genes for all three kinds of epigenetic marks. However, much more DNA methylation peaks were found in exons of protein coding genes 7
ACCEPTED MANUSCRIPT than lncRNAs. While more H3K27me3 or H3K4me3 peaks were found in introns of protein coding genes than lncRNAs (Figure 3). As to the promoter, mores H3K4me3 or DNA methylation peaks were found in lncRNAs than protein coding genes. Then as to the termination region, more
IP
T
peaks were found in lncRNAs for all three kinds of epigenetic marks than protein coding genes (Figure 3). Comparing with random, H3K4me3 mark is significantly enriched at the promoter and
SC R
exon of both protein coding genes and lncRNAs (p<0.01). Besides, DNA methylation at exon of protein coding genes is also much more than random (p<0.01), suggesting the important function of DNA methylation in gene body. However, no significant differences were found for H3K27me3
NU
comparison with random (Figure 3). It suggests that H3K27me3 may appear on the intergenic regions and function in long distance, which is consistent with previous finding that H3K27me3
MA
peaks do not specifically mark promoters, but are more equally distributed over genes and
D
intergenic regions[36].
TE
Signals of DNA methylation and H3K4me3 show regular tendency across the whole genes In order to compare the signals of epigenetic marks between lncRNAs and protein coding genes,
CE P
RPKM of each four types of genome regions was calculated to evaluate the signals of epigenetic marks. For each gene with multiple exons or introns, the max signal was used to represent the signal of each exon and intron of the gene. We found the signal of DNA methylation was the
AC
highest while that of H3K27me3 was the lowest (Figure 4A-C). For DNA methylation and H3K4me3 modification, the signal of protein coding genes was higher than that of lncRNAs whatever any types of genome regions (Figure 4B and 4C). However, the signal of lncRNAs show no difference with that of protein coding genes as to the H3K27me3 mark, further indicating the important role of H3K27me3 mark on lncRNAs in CRC (Figure 4A). Then the distribution of signals for DNA methylation, H3K4me3 and H3K27me3 across the whole gene were analyzed. The protein coding genes and lncRNAs with 4 exons were selected. The regions of each exon and intron were divided into 10 average windows, and the RPKM of each window was calculated. Through drawing the line of mean signal of each window across the whole gene, we found that the signals of DNA methylation show a regular line across the whole genes in which higher signal in exon while lower in intron for protein coding genes. It suggests that DNA methylation may be associated with gene splicing for protein coding genes. However, 8
ACCEPTED MANUSCRIPT no obvious tendency was observed for lncRNAs (Figure 4F). The same results were found for the protein coding genes and lncRNAs with 5 exons. The signal for H3K4me3 also follows a regular tendency, in which higher in promoter and the first exon while lower in the end of genes for both
T
protein coding genes and lncRNAs (Figure 4E). It suggests that H3K4me3 is an active and
IP
necessary mark for gene transcription. Unfortunately, we didn’t find the signal of H3K27me3 for
SC R
both protein coding genes and lncRNAs has regular tendency across the whole genes (Figure 4D). The different shape of DNA methylation and H3K4me3 signal distribution may be applied to
NU
discriminate the lncRNAs from protein coding genes.
Signals of DNA methylation and H3K4me3 show regular tendency on the promoter and
MA
termination region
As the signal at promoter region was thought to be very important for the regulation of genes, we
D
further calculated the RPKM for each window of 100bp from -5kb to 5kb of the TSS, and from -5
TE
kb to 5 kb of the TTS. We found that the signal of DNA methylation at promoter show an invert shape of character “V” with the max signal around in the TSS (Figure 5C) and lower at the two
CE P
sides. But the signal at the termination region did not show the same shape as that at promoter. At the terminal region, the signal of DNA methylation for protein coding genes is increasing from -5kb to -1kb of the TTS but decrease to the lowest position sharply at the TTS (Figure 5F).
AC
However, the signal of DNA methylation on lncRNAs is nearly the same across the whole gene (Figure 5C), indicating that DNA methylation in gene body of protein coding genes may be associated with gene transcription. For H3K4me3, the distribution shows that two continue summits are linked tightly with a separation at the TSS. The first summit is lower than the second for protein coding genes while they are nearly equal for lncRNAs (Figure 5B). However, although the signals of terminal region on both protein coding genes and lncRNAs are much lower than that at the promoter, they show different shapes of signal distribution at the terminal region. The shape of protein coding genes is similar with the shape of character “V” with the lowest site at the TTS, while that of lncRNAs is nearly invert, showing an invert “V” shape with the highest site around the -4kb from the TTS (Figure 5E). Then for H3K27me3, the signal distribution of protein coding genes at promoter is equally invert to that of lncRNAs. The signal of lncRNAs is lower than protein coding genes in 9
ACCEPTED MANUSCRIPT the upstream region of TSS while higher in the downstream region of TSS. Therefore, there is a sharp decrease around TSS for protein coding gene and a sharp increase around TTS for lncRNAs (Figure 5A). However, the signal of lncRNAs at the terminal region is always higher than that of
IP
T
protein coding genes (Figure 5D), further indicating the important role of H3K27me3 on lncRNAs in CRC. The different features of signal distributions between protein coding genes and lncRNAs
SC R
in promoter and terminal region may be useful for distinguishing lncRNAs from protein coding genes.
NU
Relationship between gene expression and the signal of epigenetic marks Then we further analyze the relationship between the signal of each epigenetic mark and gene
MA
expression for protein coding genes and lncRNAs. According to the results of previous researches, we have known that higher DNA methylation level at promoter will repress or even silence gene
D
expression; higher H3K27me3 signal will also down-regulate the corresponding genes while
TE
higher H3K4me3 level will activate the transcription. In this section, we divided the genes into eleven groups according to their expression level. The first group containing genes with no
CE P
expression (the value of expression is 0). Other genes were divided into 10 equal groups with label 1-10 representing expression level. Label 1 had lowest expression while label 10 had highest expression. Then the signal distributions of epigenetic mark on the genes at each level were
AC
analyzed.
We found H3K4me3 signal is positively correlated with expression as expected for both protein coding genes and lncRNAs (Figure 6B). Except the genes with no expression, DNA methylation and H3K27me3 signals at promoter are negatively correlated with expression for both protein coding genes and lncRNAs as expected (Figure 6A). However, as to the genes with no expression, the signal of DNA methylation and H3K27me3 is not the highest (Figure 6A and 6D). It suggests that no expression may be caused by the low signal of H3K4me3 in promoter, further suggesting that H3K4me3 is an important and necessary active mark for gene expression. That is, although DNA methylation and H3K27me3 are responsibility for low expression of gene, but low gene expression is not necessary caused by low signal of DNA methylation and H3K27me3. Then we observed whether the signal of epigenetic marks at termination region had the same tendency as those at promoter. We found the signal of H3K27me3 and H3K4me3 at termination 10
ACCEPTED MANUSCRIPT region for both protein coding genes and lncRNAs have similar tendency with those at promoter (Figure 6H-I and 6K-L). However, the tendencies of DNA methylation at termination region for both protein coding genes and lncRNAs are not obvious (Figure 6J and 6M).
IP
T
The correlation between gene expression and the signal of different epigenetic modification at different kinds of genomic region was observed. We found for H3K4me3 and H3K27me3, the
SC R
tendency between gene expression and modification signal was similar although the strength of correlation between gene expression and H3K4me3 at intron and termination was not as strong as those at promoter and exon (Figure 7A and 7B). The negative correlations between the signal of
NU
DNA methylation at intron and exon for protein coding genes are weaker than those at promoter and termination region, which is consistent with the fact that DNA methylation in gene body is [33]
. Although the correlation between signals of DNA
MA
correlated with positive gene expression
methylation of lncRNAs is still negative, however, the strength is weak especially in termination,
TE
affected by multiple mechanisms.
D
intron and exon (Figure 7A and 7B). It suggests that expression of lncRNAs may be regulated and
CE P
Relationship among the signal of DNA methylation, H3K4me3 and H3K27me3 We then analyzed the relationship among the signal of DNA methylation, H3K4me3 and H3K27me3 at different genomic regions. In order to remove the affection of extreme values and
AC
get more robust results, we first filtered the top and bottom 20% of signals for each epigenetic modification in each genomic region and then combine the left genes. Then Pearson correlation coefficients were calculated for each epigenetic modification at each genomic region. We found the signal between DNA methylation and H3K4me3 was positively correlated at promoter and termination region for both protein coding genes and lncRNAs, while no correlation was found in exon and intron comparison with random (Figure 8A and 8B). Besides, the signal of DNA methylation and H3K27me3 at promoter of lncRNAs was also positively correlated but not for protein coding genes (Figure 8A and 8B). Because of the opposite functions of H3K4me3 and DNA methylation modification, the result of positive correlation between H3K4me3 and DNA methylation at promoter and termination region further suggests that expression is affected by antagonistic regulation in biology to obtain a relatively stable state.
11
ACCEPTED MANUSCRIPT Discussion Epigenetic is one of the most rapidly expanding and prevalent fields in biology. Epigenetic marks such as DNA methylation and histone modification are known to regulate the expression of both
[37]
IP
processes such as human disease
T
protein coding genes and lncRNAs, and have been verified to play a critical role in most biological . Long noncoding RNAs involve in epigenetic regulation
SC R
through multiple levels. On one hand, lncRNAs can regulate the epigenetic pattern of protein coding genes through interacting with polycomb group complex
[38]
. For example, lncRNA
HOTAIR represses the expression of HOXD through recruiting PRC2 to modify the H3K27me3
NU
modification of HOXD [39]. On the other hand, lncRNAs are also regulated by multiple epigenetic marks. Genome wide analysis of epigenetic features in lncRNAs loci found diverse regulation by
MA
multiple epigenetic marks [40], suggesting the important role of epigenetic regulation on lncRNAs. Epigenetic alteration has been already reported to be an important factor for development and [41, 42]
. Some epigenetic marks were even validated as biomarkers for early
D
progression of CRC
TE
diagnosis, prognosis or therapy choice of CRC. For example, DNA hypermethylation and downregulated of TFP12 in CRC was considered as an excellent candidate biomarkers [43]. Here,
CE P
we give a comprehensive analysis and comparison of epigenetic patterns between protein coding genes and lncRNAs in colorectal cancer, and further analyze the relationship between expression and epigenetic signal, and the relationships among different epigenetic marks.
AC
A number of researches have been carried out to identify several protein coding genes that can be used as epigenetic biomarkers of CRC [44-49]. In this study, we found that DNA methylation, H3K4me3 and H3K27me3 modifications occur more in protein coding genes than lncRNA loci, suggesting protein coding genes may regulated by epigenetic marks more widely. H3K27me3 occupation on genes is regulated by Polycomb Group Complex PRC2 and is validated to be a repressive epigenetic mark that can silence gene expression. Over the past several years, scientists mainly focus on the the role of lncRNAs regulating H3K27me3 modification of target protein coding genes. However, the studies about H3K27me3 aberration of lncRNAs are limited. In this study, we found over half of H3K27me3 marks locate in intergenic regions and the distribution of genomic loci is not different from that of random, indicating that H3K27me3 regulation may modify the chromatin structure to regulate gene expression in tran. Different kinds of epigenetic mark may co-occur to regulate the same gene. It has been 12
ACCEPTED MANUSCRIPT reported that the specific combinations of histone marks at promoters and enhancers are correlated with certain biological process such as fate of cell or cancer
[50]
. In this study, we found much
more lncRNAs than protein coding genes were regulated by DNA methylation or H3K27me3 only,
IP
T
while DNA methylation and H3K4me3 were usually co-occurred in protein coding genes. The combination of repressive mark H3K27me3 and activating mark H3K4me3 in the same area,
SC R
which is called “bivalent mark”, also play a critical role in colorectal cancer
[21]
. We found the
percent of “H3K27me3-H3K4me3” in lncRNA loci is relatively higher than protein coding genes (0.46% vs. 0.16% respectively) although both percentages are low. The genes with bivalent mark
NU
may play an important role in CRC development. The different combination patterns of epigenetic marks on protein coding genes and lncRNAs suggest the different regulation by epigenetic.
MA
Epigenetic mark is a key regulator of gene expression. In general, DNA methylation and H3K4me3 are active marks while H3K27me3 is repressive mark. We also found the same
D
relationships between epigenetic mark and expression in CRC. However, there are still several
TE
differences between protein coding genes and lncRNAs on epigenetic regulating expression. We found the relationship between H3K4me3 in exon and expression of lncRNAs was not as strong as
CE P
that of protein coding genes. Besides, the negative relationships between DNA methylation and expression were also much weaker on lncRNAs than protein coding genes at whatever genomic regions. It may suggest different mechanism of regulation by DNA methylation on lncRNAs.
AC
As a conclusion, we provide a comprehensive analysis of the patterns of three important epigenetic marks on protein coding genes and lncRNAs and their regulation to expression in CRC. The different features of epigenetic mark between protein coding genes and lncRNAs may be useful to distinguish lncRNAs from protein coding genes and annotate the functions of lncRNAs. Besides, our results may provide a research clue to further understand the mechanism of CRC development and progression.
13
ACCEPTED MANUSCRIPT Acknowledge Supported by Zhejiang Provincial Natural Science Foundation of China (No. LQ13C060002), National Natural Science Foundation of China (No.31301084), School Research Foundation of
T
Ningbo University (XKL14D2097, XYL14023) and Wang Kuangcheng Education Foundation of
SC R
IP
Ningbo University.
MA
NU
Figure legends Figure 1 The percentage of genomic regions for three types of epigenetic modifications. The label ‘lncRNAs’ in x axis means the genome regions of lncRNAs, ‘coding’ means the genome regions of protein coding genes, ‘lncRNA-coding’ means the genome regions that transcribe both lncRNAs and protein coding genes, while ‘other’ means the genome regions that not include lncRNAs or protein coding genes.
TE
D
Figure 2 The combination of epigenetic marks between protein coding genes and lncRNAs. Green bar represents the combination of epigenetic pattern for protein coding genes by random peaks while yellow bar represents that for lncRNAs by random peaks.
AC
CE P
Figure 3 The distribution of genomic regions for three types of epigenetic marks between protein coding genes and lncRNAs. Green bar represents the distribution of genomic region of epigenetic marks for protein coding genes by random peaks while yellow bar represents that for lncRNAs by random peaks. A. The distribution of DNA methylation location in the genomic region. B. The distribution of H3K4me3 location in the genomic region. C. The distribution of H3K27me3 location in the genome region.
Figure 4 The signal (RPKM value) of epigenetic marks between protein coding genes and lncRNAs. A. The signal of H3K27me3. B. The signal of H3K4me3. C. The signal of DNA methylation. D. The signal tendency of H3K27me3 across genes with 4 exons. E. The signal tendency of H3K4me3 across genes with 4 exons. F. The signal tendency of DNA methylation across genes with 4 exons.
Figure 5 The signal (RPKM value) tendency of epigenetic marks around TSS and TTS. A. The signal tendency of H3K27me3 around TSS. B. The signal tendency of H3K4me3 around TSS. C. The signal tendency of DNA methylation around TSS. D. The signal tendency of H3K27me3 around TTS. E. The signal tendency of H3K4me3 around TTS. F. The signal tendency of DNA 14
ACCEPTED MANUSCRIPT methylation around TTS.
SC R
IP
T
Figure 6 The signal tendency of epigenetic marks on protein coding genes and lncRNAs with different expression level around TSS and TTS . The protein coding genes and lncRNAs were divided into 11 levels according to their expression level. The level 0 includes the genes with no expression. Then the level 1 to level 10 include 10% of the left genes respectivly with the level 10 containing the genes with max expression. A-F are the signal tendency around the TSS while H-M are the signal tendency around the TTS.
CE P
TE
D
MA
NU
Figure 7 The relationship between epigenetic mark and expression. H3K27me3-coding means the correlation between the signal of H3K27me3 and expression for protein coding genes. H3K27me3-lncRNA means the correlation between the signal of H3K27me3 and expression for lncRNAs. H3K4me3-coding means the correlation between the signal of H3K4me3 and expression for protein coding genes. H3K4me3-lncRNA means the correlation between the signal of H3K4me3 and expression for lncRNAs. DNA methylation-coding means the correlation between the signal of DNA methylation and expression for protein coding genes. DNA methylation-lncRNA means the correlation between the signal of DNA methylation and expression for lncRNAs. The value in the table means the Pearson correlation coefficient between the signal of epigenetic marks and expression. A. the correlation between the signal of epigenetic marks in different genomic location and the expression of protein coding gene or lncRNA. B. the correlation between the signal of random epigenetic marks in different genomic location and the expression of protein coding gene or lncRNA..
AC
Figure 8 The relationship among the signal of epigenetic mark in different genomic location. ‘DNA-H3K4 coding’ means the relationship between DNA methylation and H3K4me3 of protein coding genes. ‘DNA-H3K4 lncRNAs’ means the relationship between DNA methylation and H3K4me3 of lncRNAs. ‘DNA-H3K27 coding’ means the relationship between DNA methylation and H3K27me3 of protein coding genes. ‘DNA-H3K27 lncRNAs’ means the relationship between DNA methylation and H3K27me3 of lncRNAs. ‘H3K27-H3K4 coding’ means the relationship between H3K27me3 and H3K4me3 of protein coding genes. ‘H3K27-H3K4 lncRNAs’ means the relationship between H3K27me3 and H3K4me3 of lncRNAs. The value in the table means the Pearson correlation coefficient between the signal of two kinds of epigenetic marks. A. the correlation among the signal of epigenetic mark in different genomic location. B. the correlation among the signal of random epigenetic mark in different genomic location
References [1]
Ferlay, J., H.R. Shin, F. Bray, et al. Estimates of worldwide burden of cancer in 2008: GLOBOCAN 2008 [J]. Int J Cancer: 2010, 127(12): 2893-2917
[2]
Ashktorab, H., A.A. Schaffer, M. Daremipouran, et al. Distinct genetic alterations in colorectal 15
ACCEPTED MANUSCRIPT cancer [J]. PLoS One: 2010, 5(1): e8879 [3]
Bruin, S.C., C. Klijn, G.J. Liefers, et al. Specific genomic aberrations in primary colorectal cancer are associated with liver metastases [J]. BMC Cancer: 2010, 10: 662
[4]
Nakao, M., S. Kawauchi, T. Uchiyama, et al. DNA copy number aberrations associated with
T
the clinicopathological features of colorectal cancers: Identification of genomic biomarkers by array-based comparative genomic hybridization [J]. Oncol Rep: 2011, 25(6): 1603-1611 Munoz-Bellvis, L., C. Fontanillo, M. Gonzalez-Gonzalez, et al. Unique genetic profile of
IP
[5]
sporadic colorectal cancer liver metastasis versus primary tumors as defined by high-density
SC R
single-nucleotide polymorphism arrays [J]. Mod Pathol: 2012, 25(4): 590-601 [6]
Gronbaek, K., C. Hother, and P.A. Jones. Epigenetic changes in cancer [J]. APMIS: 2007, 115(10): 1039-1059
[7]
Esteller, M. Cancer epigenomics: DNA methylomes and histone-modification maps [J]. Nat
[8]
NU
Rev Genet: 2007, 8(4): 286-298
Mikkelsen, T.S., M. Ku, D.B. Jaffe, et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells [J]. Nature: 2007, 448(7153): 553-560 Berger, S.L. The complex language of chromatin regulation during transcription [J]. Nature:
MA
[9]
2007, 447(7143): 407-412 [10]
Boyes, J. and A. Bird. DNA methylation inhibits transcription indirectly via a methyl-CpG binding protein [J]. Cell: 1991, 64(6): 1123-1134 Shukla, A., P. Chaurasia, and S.R. Bhaumik. Histone methylation and ubiquitination with their
D
[11]
1419-1433 [12]
TE
cross-talk and roles in gene expression and stability [J]. Cell Mol Life Sci: 2009, 66(8): Ruthenburg, A.J., H. Li, D.J. Patel, et al. Multivalent engagement of chromatin modifications
[13]
CE P
by linked binding modules [J]. Nat Rev Mol Cell Biol: 2007, 8(12): 983-994 Bannister, A.J., R. Schneider, F.A. Myers, et al. Spatial distribution of di- and tri-methyl lysine 36 of histone H3 at active genes [J]. J Biol Chem: 2005, 280(18): 17732-17736 [14]
Barski, A., S. Cuddapah, K. Cui, et al. High-resolution profiling of histone methylations in the
[15]
AC
human genome [J]. Cell: 2007, 129(4): 823-837 Edmunds, J.W., L.C. Mahadevan, and A.L. Clayton. Dynamic histone H3 methylation during gene induction: HYPB/Setd2 mediates all H3K36 trimethylation [J]. EMBO J: 2008, 27(2): 406-420
[16]
Vastenhouw, N.L. and A.F. Schier. Bivalent histone modifications in early embryogenesis [J]. Curr Opin Cell Biol: 2012, 24(3): 374-386
[17]
Bernstein, B.E., T.S. Mikkelsen, X. Xie, et al. A bivalent chromatin structure marks key developmental genes in embryonic stem cells [J]. Cell: 2006, 125(2): 315-326
[18]
Karlic, R., H.R. Chung, J. Lasserre, et al. Histone modification levels are predictive for gene expression [J]. Proc Natl Acad Sci U S A: 2010, 107(7): 2926-2931
[19]
Cheng, C. and M. Gerstein. Modeling the relative relationship of transcription factor binding and histone modifications to gene expression levels in mouse embryonic stem cells [J]. Nucleic Acids Res: 2012, 40(2): 553-568
[20]
Sloane, M.A., J.W. Wong, D. Perera, et al. Epigenetic inactivation of the candidate tumor suppressor USP44 is a frequent and early event in colorectal neoplasia [J]. Epigenetics: 2014, 9(8): 1092-1100
[21]
Hahn, M.A., A.X. Li, X. Wu, et al. Loss of the polycomb mark from bivalent promoters leads 16
ACCEPTED MANUSCRIPT to activation of cancer-promoting genes in colorectal tumors [J]. Cancer Res: 2014, 74(13): 3617-3629 [22]
Kogo, R., T. Shimamura, K. Mimori, et al. Long noncoding RNA HOTAIR regulates polycomb-dependent chromatin modification and is associated with poor prognosis in
[23]
T
colorectal cancers [J]. Cancer Res: 2011, 71(20): 6320-6326 Barrett, T., D.B. Troup, S.E. Wilhite, et al. NCBI GEO: mining tens of millions of expression
IP
profiles--database and tools update [J]. Nucleic Acids Res: 2007, 35(Database issue): D760-765
Zhang, Y., T. Liu, C.A. Meyer, et al. Model-based analysis of ChIP-Seq (MACS) [J]. Genome
SC R
[24]
Biol: 2008, 9(9): R137 [25]
Pruitt, K.D., T. Tatusova, and D.R. Maglott. NCBI reference sequences (RefSeq): a curated 2007, 35(Database issue): D61-65
[26]
Pruitt, K.D. and D.R. Maglott. RefSeq and LocusLink: NCBI gene-centered resources [J]. Nucleic Acids Res: 2001, 29(1): 137-140
Derrien, T., R. Johnson, G. Bussotti, et al. The GENCODE v7 catalog of human long
MA
[27]
NU
non-redundant sequence database of genomes, transcripts and proteins [J]. Nucleic Acids Res:
noncoding RNAs: analysis of their gene structure, evolution, and expression [J]. Genome Res: 2012, 22(9): 1775-1789 [28]
Karolchik, D., A.S. Hinrichs, T.S. Furey, et al. The UCSC Table Browser data retrieval tool [J]. Trapnell, C., L. Pachter, and S.L. Salzberg. TopHat: discovering splice junctions with
TE
[29]
D
Nucleic Acids Res: 2004, 32(Database issue): D493-496 RNA-Seq [J]. Bioinformatics: 2009, 25(9): 1105-1111 [30]
Trapnell, C., B.A. Williams, G. Pertea, et al. Transcript assembly and quantification by
CE P
RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation [J]. Nat Biotechnol: 2010, 28(5): 511-515 [31]
Wu, S.C., E.M. Kallin, and Y. Zhang. Role of H3K27 methylation in the regulation of lncRNA
[32]
Balasubramanian, D., B. Akhtar-Zaidi, L. Song, et al. H3K4me3 inversely correlates with
[33]
AC
expression [J]. Cell Res: 2010, 20(10): 1109-1116 DNA methylation at a large class of non-CpG-island-containing start sites [J]. Genome Med: 2012, 4(5): 47 Ball, M.P., J.B. Li, Y. Gao, et al. Targeted and genome-scale strategies reveal gene-body methylation signatures in human cells [J]. Nat Biotechnol: 2009, 27(4): 361-368 [34]
Cedar, H. and Y. Bergman. Linking DNA methylation and histone modification: patterns and paradigms [J]. Nat Rev Genet: 2009, 10(5): 295-304
[35]
Jones, P.A. Functions of DNA methylation: islands, start sites, gene bodies and beyond [J]. Nat Rev Genet: 2012, 13(7): 484-492
[36]
Pauler, F.M., M.A. Sloane, R. Huang, et al. H3K27me3 forms BLOCs over silent genes and intergenic regions and specifies a histone banding pattern on a mouse autosomal chromosome [J]. Genome Res: 2009, 19(2): 221-233
[37]
Portela, A. and M. Esteller. Epigenetic modifications and human disease [J]. Nat Biotechnol: 2010, 28(10): 1057-1068
[38]
Beisel, C. and R. Paro. Silencing chromatin: comparing modes and mechanisms [J]. Nat Rev Genet: 2011, 12(2): 123-135
[39]
Gupta, R.A., N. Shah, K.C. Wang, et al. Long non-coding RNA HOTAIR reprograms 17
ACCEPTED MANUSCRIPT chromatin state to promote cancer metastasis [J]. Nature: 2010, 464(7291): 1071-1076 [40]
Sati, S., S. Ghosh, V. Jain, et al. Genome-wide analysis reveals distinct patterns of epigenetic features in long non-coding RNA loci [J]. Nucleic Acids Res: 2012, 40(20): 10018-10031
[41]
Wang, X., Y.Y. Kuang, and X.T. Hu. Advances in epigenetic biomarker research in colorectal
[42]
T
cancer [J]. World J Gastroenterol: 2014, 20(15): 4276-4287 Zoratto, F., L. Rossi, M. Verrico, et al. Focus on genetic and epigenetic events of colorectal
IP
cancer pathogenesis: implications for molecular diagnosis [J]. Tumour Biol: 2014, 35(7): 6195-6206
Glockner, S.C., M. Dhir, J.M. Yi, et al. Methylation of TFPI2 in stool DNA: a potential novel
SC R
[43]
biomarker for the detection of colorectal cancer [J]. Cancer Res: 2009, 69(11): 4691-4699 [44]
Herman, J.G. and S.B. Baylin. Gene silencing in cancer in association with promoter hypermethylation [J]. N Engl J Med: 2003, 349(21): 2042-2054
Suzuki, H., D.N. Watkins, K.W. Jair, et al. Epigenetic inactivation of SFRP genes allows
NU
[45]
constitutive WNT signaling in colorectal cancer [J]. Nat Genet: 2004, 36(4): 417-422 [46]
Akiyama, Y., N. Watkins, H. Suzuki, et al. GATA-4 and GATA-5 transcription factor genes
MA
and potential downstream antitumor target genes are epigenetically silenced in colorectal and gastric cancer [J]. Mol Cell Biol: 2003, 23(23): 8429-8439 [47]
Easwaran, H.P., L. Van Neste, L. Cope, et al. Aberrant silencing of cancer-related genes by CpG hypermethylation occurs independently of their spatial organization in the nucleus [J]. Poeta, M.L., E. Massi, P. Parrella, et al. Aberrant promoter methylation of beta-1,4
TE
[48]
D
Cancer Res: 2010, 70(20): 8015-8024 galactosyltransferase 1 as potential cancer-specific biomarker of colorectal tumors [J]. Genes Chromosomes Cancer: 2012, 51(12): 1133-1143 Hrasovec, S., N. Hauptman, D. Glavac, et al. TMEM25 is a candidate biomarker methylated
CE P
[49]
and down-regulated in colorectal cancer [J]. Dis Markers: 2013, 34(2): 93-104 Rada-Iglesias, A., R. Bajpai, T. Swigut, et al. A unique chromatin signature uncovers early developmental enhancers in humans [J]. Nature: 2011, 470(7333): 279-283
AC
[50]
18
AC
CE P
TE
D
MA
NU
SC R
IP
T
ACCEPTED MANUSCRIPT
19
AC
CE P
TE
D
MA
NU
SC R
IP
T
ACCEPTED MANUSCRIPT
20
AC
CE P
TE
D
MA
NU
SC R
IP
T
ACCEPTED MANUSCRIPT
21
AC
CE P
TE
D
MA
NU
SC R
IP
T
ACCEPTED MANUSCRIPT
22
AC
CE P
TE
D
MA
NU
SC R
IP
T
ACCEPTED MANUSCRIPT
23
AC
CE P
TE
D
MA
NU
SC R
IP
T
ACCEPTED MANUSCRIPT
24
AC
CE P
TE
D
MA
NU
SC R
IP
T
ACCEPTED MANUSCRIPT
25
AC
CE P
TE
D
MA
NU
SC R
IP
T
ACCEPTED MANUSCRIPT
26
ACCEPTED MANUSCRIPT Abbreviations Colorectal cancer: CRC long noncoding RNAs: lncRNAs
IP
T
transcript start site: TSS transcript termination site: TTS
AC
CE P
TE
D
MA
NU
SC R
Reads Per Kilobases per Million reads: RPKM
27
ACCEPTED MANUSCRIPT Highlights 1. We compared the genomic location distributions of three kinds of epigenetic mark between protein coding genes and lncRNAs, analyzed the co-occurrence of epigenetic marks on the same
IP
T
gene loci, the signal value of epigenetic marks, and the signal tendency of epigenetic mark in promoter and termination regions of both protein coding genes and lncRNAs.
SC R
2. We observed the relationship between expression and signal of epigenetic marks and found that DNA methylation and H3K27me3 are repressive marks and H3K4me3 is active mark as expected. 3. We analyzed the relationship among the signal of DNA methylation, H3K4Me3 and
NU
H3K27Me3 were also analyzed and found that the signal between DNA methylation and H3K4me3 was positively correlated in the promoter and the termination region for both protein
MA
coding genes and lncRNAs, suggesting expression is affected by antagonistic regulation to obtain
AC
CE P
TE
D
a relatively stable state.
28