Characterization and machine learning prediction of allele-specific DNA methylation Jianlin He, Ming-an Sun, Zhong Wang, Qianfei Wang, Qing Li, Hehuang Xie PII: DOI: Reference:
S0888-7543(15)30033-1 doi: 10.1016/j.ygeno.2015.09.007 YGENO 8771
To appear in:
Genomics
Received date: Revised date: Accepted date:
22 August 2015 15 September 2015 20 September 2015
Please cite this article as: Jianlin He, Ming-an Sun, Zhong Wang, Qianfei Wang, Qing Li, Hehuang Xie, Characterization and machine learning prediction of allele-specific DNA methylation, Genomics (2015), doi: 10.1016/j.ygeno.2015.09.007
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT Characterization and Machine Learning Prediction of Allele-Specific DNA
PT
Methylation Jianlin He1$, Ming-an Sun2$, Zhong Wang3,4, Qianfei Wang1, Qing Li3,4*, Hehuang Xie1,2,5* Laboratory of Genome Variation and Precision Biomedicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, 100101, China 2 Epigenomics and Computational Biology Lab, Virginia Bioinformatics Institute; 5 Department of Biological Sciences, Virginia Tech, VA 24060, USA 3 School of Pharmaceutical Sciences; 4Center for Cellular & Structural biology, Sun Yat-Sen University, Guangzhou, 510080, China
NU
SC
RI
1
MA
$
These authors have contributed equally to this work. To whom correspondence should be addressed.
*
TE
AC CE P
Mingan Sun, Ph.D. Email:
[email protected]
D
Jianlin He Email:
[email protected]
Zhong Wang, Ph.D. Email:
[email protected]
Qiangfei Wang, Ph.D. Email:
[email protected]
Qing Li, Ph.D. Email:
[email protected] Hehuang Xie, Ph.D. Email:
[email protected]
1 / 26
ACCEPTED MANUSCRIPT ABSTRACT A large collection of Single Nucleotide Polymorphisms (SNPs) has been identified in the human genome. Currently, the epigenetic influences of SNPs on their neighboring CpG sites
PT
remain elusive. A growing body of evidence suggests that locus-specific information, including
RI
genomic features and local epigenetic state, may play important roles in the epigenetic readout of SNPs. In this study, we made use of mouse methylomes with known SNPs to develop statistical
SC
models for the prediction of SNP associated allele-specific DNA methylation (ASM). ASM has been classified into parent-of-origin dependent ASM (P-ASM) and sequence-dependent allele-
NU
specific DNA methylation (S-ASM), which comprises scattered-S-ASM (sS-ASM) and clustered-S-ASM (cS-ASM). We found that P-ASM and cS-ASM CpG sites are both enriched in
MA
CpG rich regions, promoters and exons, while sS-ASM CpG sites are enriched in simple repeat and regions with high frequent SNP occurrence. Using Lasso-grouped Logistic Regression (LGLR), we selected 21 out of 282 genomic and methylation related features that are powerful in
D
distinguishing cS-ASM CpG sites and trained the classifiers with machine learning techniques.
TE
Based on 5-fold cross-validation, the logistic regression classifier was found to be the best for cS-ASM prediction with an ACC of 0.77, an AUC of 0.84 and a MCC of 0.54. Lastly, we applied
AC CE P
the logistic regression classifier on human brain methylome and predicted 608 genes associated with cS-ASM. Gene ontology term enrichment analysis indicated that these cS-ASM associated genes are significantly enriched in the category coding for transcripts with alternative splicing forms. In summary, this study provided an analytical procedure for cS-ASM prediction and shed new light on the understanding of different types of ASM events. Keywords Allele-specific DNA methylation, SNP, epigenetic variation, logistic regression classifier
2 / 26
ACCEPTED MANUSCRIPT 1. Introduction According to dbSNP (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi), over 62 million SNPs have been identified in the human genome with 43 million validated. Apparently,
PT
some SNPs may affect the methylation levels of neighboring CpG sites and lead to complex diseases/traits in the human population. There is a growing interest in the determination of SNP
RI
associated allele-specific methylation events for better understanding of the relationship between
SC
genetic and epigenetic variations. Mechanistically, ASM events can be either parental-of-origin dependent (imprinted) or sequence-dependent [1]. The parental-of-origin dependent ASM (P-
NU
ASM) may arise during gametogenesis for gametic imprints or in post-implantation embryos for somatic imprints [2, 3]. The number of P-ASM regions are very limited but each P-ASM region
MA
may span over mega-bases in the genome [2]. In contrast, sequence-dependent ASM (S-ASM) at non-imprinted loci has been found to be prevalent but highly localized in genomes and frequently restricted to a very small number of CpG sites [4, 5]. Some of these S-ASM CpG sites
D
may co-localize in a narrow genomic region and result in clustered-S-ASM (cS-ASM). Since cS-
TE
ASM may affect a significant number of CpG sites within functional elements, there is a growing interest to determine cS-ASM regions in genome.
AC CE P
ASM loci may be identified experimentally with the comparisons of methylation levels between two SNP alleles. To quantitatively assess ASM on a genome-wide scale, the highdensity SNP microarrays have been exploited intensively [1, 6, 7]. In such a procedure, genomic DNA was first digested with methylation sensitive enzymes. Differently methylated alleles showed distinct enzymatic digestion patterns, which result in allele-skewed hybridization signals on SNP arrays. Apparently, such an approach heavily depends on the availability of methylation sensitive enzymes and their recognition sites in the genome. In combination with bisulfite treatment, high-throughput sequencing technologies offer alternative ways to identify ASM loci. Bisulfite sequencing data provides the methylation information for CpG sites on the two alleles in addition to sequence variations. The CpG sites adjacent to SNPs may be grouped by alleles and their methylation statuses may be subjected to standard statistical tests such as the Fisher’s Exact Test [4, 5]. However, identifying SNPs from bisulfite converted sequence reads is a challenging task. It was estimated that an average 30Xs sequence read depth would be required to call 96% SNP accurately from bisulfite sequencing data [8]. Unfortunately, most of the current high-throughput methylation datasets are with the average sequence coverage around 10Xs.
3 / 26
ACCEPTED MANUSCRIPT To overcome experimental limitations, a few statistical models have been proposed to predict ASM merely based on the distribution of methylation patterns observed in bisulfite sequencing data [9, 10]. Fang et al. proposed a model selection method to identify ASM for a fixed genomic
PT
interval. Using a fixed-width sliding window, regions of ASM were determined by comparing Bayesian information criterion (BIC) under non-allele-specific and allele-specific models. Peng
RI
and Ecker first identified partially methylated regions (PMRs) according to mapped reads
SC
originating from heterogeneous epigenomes. They adopted a mixed model to extract features of methylation patterns and classify PMRs into ASM or non-ASM. While these computational
NU
analyses largely extended the reservoir of known ASM regions, the predictive accuracy is highly limited to the hypothesis that all genomic regions with equal frequency of hyper and hypo
MA
methylation patterns are associated with ASM. For instance, equal frequency of hyper and hypo methylation patterns may be observed in a cell population with two distinct cell subsets mixed at equal ratio. Therefore, it is possible that the predicted ASM regions are actually the cell-subset-
D
specific methylated regions, which also show heterogeneous methylation patterns.
TE
A recent study performed reciprocal crosses between two distantly related inbred mouse strains (129x1/SvJ and Cast/EiJ) and determined the brain methylomes for F1 offsprings at
AC CE P
single base resolution [5]. Over 20 million SNPs present in the genomes of the two strains provide a high density SNP map (one SNP in every 133 bp). A total of 1,952 CpG dinucleotides and 131,765 CpG dinucleotides were identified as P-ASM and S-ASM, respectively. In the previous study, cS-ASM regions were defined as genomic loci with at least five S-ASM CpGs within a 2 kb window and at least 75% of these S-ASM CpGs show methylation preference toward the same allele [5]. Accordingly, a total of 9,030 S-ASM CpG sites were clustered into 1,051 cS-ASM regions. In this study, we made use of such experimentally validated ASM loci to explore the characteristics of three types of ASM events (P-ASM, cS-ASM, and sS-ASM) and designed a computational approach to infer cS-ASM. A workflow was presented in Figure 1. 2. Results and discussion 2.1 Distinct characteristics of P-ASM, cS-ASM and sS-ASM CpGs To gain a better understanding of the characteristics of distinct ASM events, we compiled four sets of CpG sites with at least 10Xs read coverage generated previously [5]: 1,952 P-ASM, 122,735 sS-ASM, 9,030 cS-ASM and 5,070,572 control CpG sites. We first examined the
4 / 26
ACCEPTED MANUSCRIPT distribution of ASM sites at functional genomic regions. As reported previously, P-ASM events preferentially occur at promoters and within gene regions [5]. A near six-fold enrichment in promoter regions was observed for P-ASM CpG sites (Figure 2A). cS-ASM tends to occur at
PT
promoters and exons but less frequently in introns and UTRs. To examine the association between the histone modifications and ASM, we made use of the histone modification
RI
information of H3K27ac, H3K4me1 and H3K27ac in mouse frontal cortex [11]. Consistent with
SC
the findings that P-ASM and cS-ASM CpG sites were enriched in promoter regions, these sites tend to reside in regions with H3K4me3, the active promoter mark (Figure 2B). In addition, P-
NU
ASM and cS-ASM sites are significantly enriched within genomic regions with H3K27ac marks. Since H3K27ac marks distinguish active enhancers from poised ones with H3K4me1 alone [12,
MA
13], this result suggests that P-ASM and cS-ASM may have an influence on the enhancer activities.
Repetitive elements, such as direct repeats, may participate in the establishment and
D
maintenance of allele-specific methylation patterns [14]. We examined the distribution of ASM
TE
sites within various repetitive elements (Figure 2C). All ASM sites tend to occur frequently within simple repeats. P-ASM sites are depleted from repeats including SINEs, LTR, LINE and
AC CE P
segmental duplication. In contrast to P-ASM CpG sites, cS-ASM and sS-ASM sites are prone to occur within segmental duplication. It has been reported that CpG islands (CGIs) and their flanking regions (CGI shores) often show tissue-specific methylation patterns [15, 16]. We examined the association between ASM CpG sites and CGIs/CGI shores. P-ASM CpG sites are highly enriched in CGIs and CGI shores (Figure 2D). cS-ASM CpG sites also show moderate enrichment in CGIs and CGI shores, but sS-ASM CpG sites are depleted from CGIs. We next extracted the genomic sequences flanking the ASM CpG sites and obtained sequence features including GC content, CpG density, CpG obs/exp and the frequency of all 4-mer motifs. P-ASM CpG sites were found to be enriched in regions with high GC content, high CpG density and high CpG obs/exp ratio (Figure S1A-C). As expected, sS-ASM CpG sites and control sites are with a similar level of GC content. Further examination of evolutionary conservation indicated that PASM and cS-ASM are more conserved than sS-ASM and control sites (Figure S2A). The average conversation scores of P-ASM and cS-ASM regions were found to be 0.15 and 0.14, respectively. The average conversation scores of sS-ASM and control regions are 0.09 and 0.11, respectively. To inspect whether the genomic regions hosting three types of ASM are associated
5 / 26
ACCEPTED MANUSCRIPT with different levels of genetic variation, we determined the SNP frequencies in 100-bp sequences flanking target CpG sites, and found that SNPs occurs at higher frequency surrounding sS-ASM and cS-ASM sites but deplete from P-ASM regions (Figure S2B).
PT
We further assessed the degree of methylation variations by weighted methylation entropy, which is modified from the original formula of methylation entropy [17]. For each sequence read,
RI
the number of transitions between the methylation states of neighboring CpG sites was
SC
considered in this study (Figure S3A). Genomic segments with only completely methylated and/or unmethylated reads will have weight methylation entropy of zero (Figure S3B-E). Thus,
NU
the weighted methylation entropy reflects the co-methylation effect between neighboring CpG sites. With such a modification, weight methylation entropy may distinguish the deterministic
MA
methylation events from stochastic ones. The average weighted methylation entropy for the PASM regions was found to be much lower than those of the cS-ASM and the sS-ASM, in particular for segments with intermediate methylation levels (Figure S3F).
D
In mammalian genomes, the methylation levels of CpG sites follow a bimodal distribution
TE
with a small proportion of CpG sites demonstrating intermediate (40-60%) methylation levels [18]. However, for CpG sites associated with ASM, the methylated and unmethylated patterns
AC CE P
are expected to be roughly equally distributed. Not surprisingly, we found that P-ASM and cSASM CpG were half-methylated with average values of 0.48 and 0.50, respectively. The methylation average of the sS-ASM CpG sites is 0.60 and the control CpG sites are hypermethylated with an average of 0.72. Since many genomic features are known to be correlated with the level of DNA methylation, we further divided CpG sites into five intervals according to their methylation levels. For all four sets of CpG sites, low-methylated CpGs are significantly enriched in high CpG density regions such as CGIs and regions with active histone modifications, such as H3K4me3, a mark for active transcription (Figure 2E). 2.2 Machine learning model in mouse dataset The numbers of P-ASM loci were determined to be very limited in both mouse and human genomes [5, 19]. However, a large number of S-ASM sites may be present in the human genome. Since S-ASM CpG sites are with unique characteristics, we next explored the possibility to predict cS-ASM and sS-ASM events. For each CpG, we compiled a total of 282 features (Table S1): 1) methylation related features, including methylation levels of targeted and neighboring CpG sites; 2) genomic features, including gene structure; 3) DNA-related attributes, including 6 / 26
ACCEPTED MANUSCRIPT DNA sequence properties [20], DNA motif [21] and SNP frequency. For all 282 features, we performed pair-wised comparisons between ASM and control CpG sites. Compared with control CpG sites, cS-ASM sites show significant differences (P<0.01, Wilcox Rank Sum test or Fisher’s
PT
exact test followed by Bonferroni correction) in over 220 features. In contrast, for sS-ASM, only fourteen features were found to have significant differences. For both types of S-ASM events,
RI
the top ten significant characteristics were summarized in Table 1.
SC
In order to select the most important predictors, we compared Lasso-grouped Logistic Regression (LGLR) with SVM-based Recursive Feature Elimination (SVM-RFE) (Figure S4).
NU
The predictors selected by LGLR show better classification performance than those selected by SVM-RFE (Wilcoxon rank sum test, p-value=4.36E-3, Figure S4D). Based on the predictors
MA
selected with LGLR, we employed logistic regression, LGLR, Bayesian network, Naïve Bayes, SVM and J48 tree to distinguish cS-ASM and sS-ASM from control sites (Table 2). Among all the methods tested, logistic regression had the best performance for cS-ASM prediction. Using
D
5-fold cross-validation, the values of AUC, ACC and MCC for the logistic regression approach
TE
were 0.85, 0.78 and 0.55, respectively (Figure 3A). On the other hand, the best prediction method for the sS-ASM was found to be the Bayesian network with the values of AUC, ACC and
AC CE P
MCC as 0.75, 0.69 and 0.39, respectively (Figure 3B). We further conducted the random sampling for 10 times to determine the AUC values and confirmed the consistency in performance for all methods tested (Figure 3C&D). Since cS-ASM sites are with distinct characteristics and the logistic regression approach showed a satisfactory performance in cS-ASM prediction, we next questioned whether cS-ASM sites can be distinguished from the P-ASM and sS-ASM CpG sites as well. We pooled P-ASM, sS-ASM and control CpG sites together as a control set for cS-ASM CpG sites. The performance of logistic regression with AUC of 0.84, ACC of 0.77 and MCC of 0.55 was achieved. The sensitivity and specificity of the logistic regression classifier were both at 0.77. To ensure such performance is not simply resulted from the enrichment of CpG sites in cS-ASM, we assessed the classifier using control sets at different levels of CpG density (Figure S5A, B). We found that the AUC of 0.8 was achieved with logistic regression for control sets with a wide range of CpG density. In order to improve positive predictive value and reduce false positive results, we adjusted the threshold of logistic regression classifier [22], and determined that the maximal positive predictive value to be 0.85 with the false positive rate as 5.6% when the threshold for
7 / 26
ACCEPTED MANUSCRIPT logistic regression classifier was set as 0.81 (Figure S6). Based on LGLR, we selected a total of 21 out of 282 as features contributed significantly for cS-ASM prediction, including top 10 features such as methylation signal and variation, CpG Obs/Exp, CGI and CGI shore, SNP
PT
frequency, as well as repeat element such as simple repeat, segmental duplication and SINE (Figure 4A). The performance of logistic regression classifier was determined as: AUC of 0.84,
RI
ACC of 0.77 and MCC of 0.54 by 5-fold cross-validation (Figure 4B).
SC
2.3 Application to human dataset
To test whether the machine learning approach trained with the mouse dataset could be
NU
applied on the human data, we re-analyzed a human brain methylome [23] and obtained the methylation profiles for 14,959,255 CpG sites with 10Xs read coverage. Based on UCSC
MA
genome annotation, we extracted features including gene structure, DNA-related attributes, SNP (snp135), and phastCons scores. We also explored The NIH Roadmap Epigenetics Project to obtain the information of histone modifications in the human adult brain, including H3K27ac,
D
H3k4me1 and H3k4me3 [24]. We next exploited the Bis-SNP [8], a bisulfite SNP calling
TE
program, and identified a total of 2.51 million putative genetic variations between the two alleles in the human brain methylome. Approximately 60% of these putative genetic variations (1.5
AC CE P
million) have been documented in dbSNP database. According to the presence of genetic variations, we assigned the sequence reads to two alleles. Within the assigned reads, we identified 355,941 CpG sites with at least 10Xs read coverage. With the logistic regression classifier trained previously, a total of 12,264 CpG sites were predicted to be associated with cSASM events. Notably, 6,273 (51.5% out of 12,264) predicted cS-ASM sites are within 2 kb from each other and can be merged into 522 regions. To examine whether the cS-ASM prediction is reliable, we asked how many predicted cSASM CpG sites showed significant methylation bias between the two alleles. Following the procedure described in the previous study [5], we performed Fisher’s exact test for the 355,941 CpG sites with at least 10Xs read coverage and identified 55,556 CpG sites to be allelic specific methylated. The FDR was estimated to be 1.4% with permutation test. After filtering the P-ASM sites in 67 known imprinting regions [19], a total of 55,531 CpG sites were identified as S-ASM CpG sites. Due to the lack of SNP information for parental genome, these S-ASM CpG sites cannot be further assigned to two alleles. We next asked whether these S-ASM CpG sites overlap cS-ASM clusters predicted by our machine learning approach. In the mouse genome, 9,030 S8 / 26
ACCEPTED MANUSCRIPT ASM sites (6.9% of 131,765) were determined as cS-ASM. In the human brain methylome, we found that 2,267 (4.1% of 55,531) of these S-ASM CpG sites were included in the list of 12,264 cS-ASM sites we predicted. In addition, the enrichment of S-ASM sites in the predicted cS-ASM
PT
regions were significant (Fisher’s exact test, p-value=8.57E-13). We further applied such an analytical procedure on two additional human brain methylomes (GSM913596 and GSM913597).
RI
It has been shown that the read coverage has a significant impact on SNP identification using
SC
Bis-SNP software [8]. As expected, we observed a significant decrease in the number of cS-ASM predicted with the methylomes of low read coverage (Table S2).
NU
To explore the functional relevance of cS-ASM events, we identified cS-ASM associated genes in human and mouse brains, which were defined as genes hosting cS-ASM regions within
MA
the gene body or within 10 kb from TSSs. From the human brain methylome with better reads coverage, we identified 1,950 genes associated with cS-ASM predicted (Table S3A). From mouse brain methylomes, a total of 845 genes associated cS-ASM CpG sites [5] were identified
D
(Table S3B). Using NCBI DAVID software [25], we found that the cS-ASM associated genes
TE
are significantly enriched in biological process involved regulation of small GTPase mediated signal transduction (p-value=5.05E-8, Bonferroni adjusted p-value=1.24E-04) and in biological
AC CE P
function associated with GTPase regulator activity (p-value=4.78E-7, Bonferroni adjusted pvalue=3.65E-04). In addition, over 47% of human genes and 35% of mouse genes associated with cS-ASM events, code for transcripts with alternative splicing forms (Table S3A, B). This suggests genetic variations may result in epigenetic differences between alleles within brain tissues and eventually lead to heterogeneity in brain transcriptome. 2.4 Discussion
In this study, we analyzed ASM associated CpG sites identified in mouse methylomes to determine the molecular characteristics of P-ASM, cS-ASM and sS-ASM CpG sites. The methylation levels of P-ASM CpG sites are in a narrow range centered on 0.5 and 4-CpG segments containing P-ASM CpG sites have low weighted methylation pattern entropy. Both PASM and cS-ASM CpG sites tend to be enriched in promoter regions, exons, CGI and CGI shores, while sS-ASM CpG sites are enriched in simple repeats and segmental duplication regions. Out of 282 features examined, over 220 features show significant differences between cS-ASM CpG sites and controls, but only a handful of features show significantly differences between sS-ASM and control CpG sites. Although these features could be mammalian “brain9 / 26
ACCEPTED MANUSCRIPT tissue-specific”, the procedure developed in this study may be applied on other tissues and species. It has long been an interest to identify ASM events in mammalian genomes. Particularly,
PT
aberrant imprinting has been associated with many human diseases [26-28], and recently, cSASM has been shown to correlate tightly with allele-specific gene expression [5]. Experimental
RI
identification of ASM events requires the generation of ultra-deep sequencing data from the
SC
genomes of many individuals (human) or hybrids of distantly related strains (animal). Two ASM prediction methods [29, 30] have been developed and used independent Bernoulli distribution to
NU
model the methylation status of CpG sites within a genomic segment and an expectationmaximization (EM) algorithm to estimate the allele of origin. In this study, we systematically
MA
analyzed the methylation profile, trans-factor binding pattern, genomic structure and DNArelated attributes. We relaxed the equal allele frequency assumption made in the aforementioned prediction methods [29, 30] but exploited a set of features determined with the LGLR procedure
D
for machine learning. In addition, we incorporated recent advances on P-ASM identification [19]
TE
into the filtering procedure and focused on the modeling of cS-ASM. The machine learning based methods involve careful training data selection and intensive
AC CE P
classifier training. A trained classifier effectively produces a 'black-box' prediction and is vulnerable to the biases and errors that the training data might have. In this study, we took advantage of the high-throughput bisulfite sequencing data at base-resolution that was generated with reciprocal crosses between two distantly related inbred mouse strains [5]. With 131,765 CpG dinucleotides already identified as sequence-dependent ASM, we were able to obtain detailed methylation pattern distribution and genomic features for ASM. Logistic regression classifier was found to have the best performance for cS-ASM prediction. Such a classifier depends on the features collected. In some scenarios, collecting a full set of features may not be feasible, in particular for trans-factor binding information including histone modifications. In this study, we used mouse data to train classifiers. When applied to other species, the prediction accuracy may be limited if species-specific epigenetic regulatory mechanisms exist. Despite these limitations, we successfully applied the classifier on the human adult brain and predicted 12,264 cS-ASM sites, which covered 2,267 S-ASM CpG identified with Fisher’s exact test. Worthy of noting, Bis-SNP requires 30Xs coverage data to obtain 96% efficiency for SNP identification. The number of S-ASM sites in human brain may be significantly under-estimated.
10 / 26
ACCEPTED MANUSCRIPT 3. Conclusions In summary, our analyses provide new knowledge to ASM classification and a powerful classifier for cS-ASM prediction. The procedure for DNA methylation pattern analysis outlined
PT
in this study may uncover novel information embedded in methylation data and also provides a
RI
new dimension to understand the interactions between genetic and epigenetic mutations. Although we cannot completely rule out the cell-subset specific methylation and strand-specific
SC
methylation events, the combination of the computational approach developed in this study and the genome-wide hairpin bisulfite sequencing technique [31] will resolve such limitation in
NU
future. 4. Materials and methods
MA
4.1 Datasets
Genome-wide base-resolution MethylC-Seq datasets (including GSM830248, GSM830249,
D
GSM753569 and GSM753570) of mouse frontal cortex from reciprocal crosses between two
TE
distantly related inbred strains (Cast and 129) were downloaded from NCBI Gene Expression Ominibus [5]. In total, 1,952 CpG sites from P-ASM and 131,765 CpG sites from S-ASM
AC CE P
identified in a previous study were extracted. Control datasets were randomly selected from the pool of 5,925,555 CpG sites with at least 10Xs reads coverage, in which completely methylated or unmethylated (721,356), P-ASM (1,952) and S-ASM (131,765) CpG sites were excluded. 4.2 Feature extraction and evaluation Genomic features were obtained from the UCSC Genome database [32], including the annotations for gene structure, CGI, repetitive elements (RepeatMasker), segmental duplication, SNP (snp128), and phastCons scores for placental mammal subset. Promoters were defined as 2kb regions in the upstream of transcription start sites (TSSs). As a feature representing genomic variation, SNP frequencies within 100bp flanking the ASM sites were determined. DNA-related attributes including GC content, CpG density, CpG obs/exp and frequency of a total of 256 sequence motifs (4-mer) were extracted from the 1kb region flanking ASM site. The binding regions of histone modification H3K27ac, H3K4me1 and H3K4me3 for mouse frontal cortex were obtained from ENCODE Project [33]. CpG methylation levels were calculated as the fraction of methylcytosine calls from all reads covering the CpG sites. As a feature representing
11 / 26
ACCEPTED MANUSCRIPT epigenetic variation, the average weighted methylation entropy of the 2kb flanking ASM site was determined. Weighted methylation entropy was calculated according to the formula from a previous study [5] with the adjustment to accommodate the methylation transitions for four
PT
neighboring CpG sites (Figure S3). All features are encoded as: 1) continuous numbers, including methylation level, weighted methylation entropy and phastCons score and so forth; or
RI
2) binary numbers, i.e. the presence/absence in features such as gene structure, CGI, repetitive
SC
elements, and etc.
Lasso-grouped logistic regression (LGLR) and SVM-based RFE were used to rank all
NU
available features and to select the significant ones for classification. For LGLR, the tuning parameter was scaled by grid and determined based on the classification performance, i.e. AUC
MA
value. To rank predictors, logistic model was re-trained 10 times to sort the average absolute value of coefficient and to exclude the predictors with zero value, which are less valuable in ASM prediction. In a sequentially backward elimination manner, SVM-based RFE ranks the
D
features by the change in objective function when removing one feature. The ranking iteration
TE
will be terminated when all features are ranked [34]. Notably, with eliminating a feature in each step of the SVM-RFE procedure, the error rate caused by the eliminating feature was determined
AC CE P
by an independent testing dataset in contrast to the training dataset. The SVM-RFE procedure was summarized as follows:
1. Start with an empty ranked features list R=[] and the feature list F=[1,2,…,N]; 2. Sequential backward elimination: a) Use all features in F to train a SVM classifier; b) Compute the ranking criteria using the following formulation for the features in F; C w , w a y x , for all i as support vector. c) Select the feature with the smallest ranking criterion: S f argmin f C f . d) Add the selected feature sf into the ranked features list R set: R=[ S f , R]; e) Remove the feature sf from the feature list F: F=F-[ S f ]; 3. Repeat step (2) until all features are ranked. 4. Output list R with ranked features. i
2 i
i
k
k
k
k
The Wilcox Rank Sum test and the Fisher’s exact test were used to evaluate the significance of features, continuous numbers and binary numbers, respectively. Bonferroni correction was performed to control FWER, and Benjamini and Hochberg correction (BH correction) was used to control FDR [35, 36]. 4.3 Weighted methylation entropy 12 / 26
ACCEPTED MANUSCRIPT The formula for weight methylation entropy was a modified version of methylation entropy [17]. “wi” was used to assess the methylation transitions in between neighboring CpG sites within a sequence read. More specifically, for a 4-CpG segment, “wi” was defined as the number
PT
of methylation transitions between the four neighboring CpG sites within a read divided by 3. For examples, for a 4-CpG segment with a methylation pattern of ‘1010’ or ‘0101’, all four CpG
RI
sites are with a methylation state different from its neighbors and the weighted methylation
SC
entropy is 1. In such a case, the methylation transitions along the four neighboring CpG sites occur three times, which is the maximal methylation transitions may occur in 4-CpG segments.
along the four CpG sites and thus wi is 2/3.
MA
4.4 Machine learning techniques
NU
For a 4-CpG segment with a methylation pattern as ‘1011’, there are two methylation transitions
Machine learning techniques explored in this study include Bayesian network, Naïve Bayes,
D
SVM, Decision tree, Lasso-Grouped Logistic Regression (LGLR) and logistic regression. A library for Support Vector Machines (LIBSVM) was used to implement support vector machines
TE
(SVM) [37]. Based on SVM with RBF kernel, LIBSVM was implemented to perform the
AC CE P
classification of ASM. Java package Weka [38] was used to implement four classifiers, based on the Bayesian network, J48, naïve Bayes and logistic regression, respectively. To perform LGLR, group lasso logistic regression (R package, grplasso) was used and the parameters were determined based on a grid algorithm [39]. 4.5 Evaluation of classifiers
The performance of each classifier was evaluated using a standard 5-fold cross-validation. In order to cope with unbalanced positive and negative training datasets, we randomly selected an equal number of positive and negative training sets in each class. In the 5-fold cross-validation, the datasets consisted of positive and negative sets that were uniformly partitioned into 5 subsets. Each of the subsets had an equal ratio of positive and negative sets. Each classifier was trained 5 times, each time using 4 subsets for training and the remaining 5th subset for testing. The predictive performance was assessed with the evaluation indicators including Receiver Operating Characteristic curve (ROC curve), accuracy (ACC), Matthew’s Correlation coefficient (MCC), the Area Under the ROC Curve (AUC) [40] and positive predictive value (PPV) [41]. The AUC value indicates prediction accuracy (Bhasin, et al., 2005), e.g. from 0.9 to 1 (excellent), from 0.7
13 / 26
ACCEPTED MANUSCRIPT to 0.9 (good) and from 0.5 to 0.7 (poor). The threshold-dependent parameters sensitivity (SE), specificity (SP), ACC, MCC and PPV were calculated by the following formulations, respectively [20].
PT
SE TP / (TP FN ) , SP TN / (TN FP) , ACC (TP TN ) / (TP TN FN FP) , PPV TP / (TP FP) ,
.
RI
MCC ((TP * TN ) (FN * FP)) / (TP * FN ) *(TN * FP) *(TP * FP) *(TN * FN )
SC
Where TP is true positive (ASM sites predicted accurately); FN is false negative (non-ASM sites predicted inaccurately); TN is true negative (non-ASM sites predicted accurately); and FP is
NU
false positive (ASM sites predicted inaccurately). In addition to AUC, we calculated the area under the ROC curve using discrete numerical integration.
MA
4.6 Independent validation with human fetal brain methylome The human brain methylomes (GSM913595, GSM913596 and GSM913597) were
D
downloaded from the GEO database [23]. Sequencing reads were mapped to the hg19 reference genome using the Bismark program [42]. The Bis-SNP program was used to call the SNP
TE
embedded Bisulfite dataset [8] using the reads that were mapped with high quality (Phred
AC CE P
score >= 20). SNPs with a minimum score of 20 were used to assign SNP-associated sequence reads to two alleles. CpG sites within each assigned read were then identified. Allele-specific methylated CpG sites were determined based on the Fisher’s exact test in a 2x2 contingency table for the numbers of methylated and unmethylated cytosines identified for two alleles to test whether a specific methylation event occurrences in two alleles. False discovery rate (FDR) is estimated via permuting the allelic origins of all reads used to identify ASM CpG sites, i.e. randomly assigning reads to two arbitrary alleles [5]. S-ASM sites with FDR less than 0.05 were identified and further filtered with known imprinted sites. The logistic regression classifier trained with the mouse dataset was applied on human brain methylomes to predict cS-ASM. The threshold of logistic regression classifier was optimized to achieve the maximal PPV and smaller FPR. The predicted cS-ASM sites within a 2kb range were clustered into cS-ASM regions with the minimum cluster size as 5 CpG sites. 4.7 Gene functional analyses According to the UCSC gene annotation, gene region was defined as genomic region from transcription start site (TSS) to transcription termination site and including the proximal 14 / 26
ACCEPTED MANUSCRIPT promoter region (up to 10kb upstream of TSS). Genes overlapped with the predicted cS-ASM CpG sites were subjected to GO term enrichment analysis with functional annotation tools in the DAVID Bioinformatics Resources [25]. After Bonferroni adjustment, the significance p-value
PT
threshold was set as 0.05.
RI
Competing interests
SC
The authors declare no competing interests. Acknowledgements
NU
This work was supported by the VBI new faculty startup fund for H.X. and the National Science Foundation of China [Grant number 81270633]. We thank Dr. Wei Xie (School of Life Sciences,
MA
Tsinghua University) for providing mouse ASM dataset. Contributions
D
H.X. designed and supervised the research with assistance from Q.L.; J.H. and M.S. analyzed the
TE
data and developed analytical procedure with help from Z.W., Q.W., and Q.L.; J.H. and M.S. contributed equally to this work; J.H., M.S. and H.X. drafted the manuscript and all authors
AC CE P
discussed the results, read and approve the manuscript.
15 / 26
ACCEPTED MANUSCRIPT REFERENCE
6.
7.
8. 9.
10. 11.
12. 13.
14. 15.
16.
PT
RI
SC
NU
5.
MA
4.
D
3.
TE
2.
Kerkel K, Spadola A, Yuan E, Kosek J, Jiang L, Hod E, Li K, Murty VV, Schupf N, Vilain E et al: Genomic surveys by methylation-sensitive SNP analysis identify sequence-dependent allele-specific DNA methylation. Nat Genet 2008, 40(7):904-908. John RM, Lefebvre L: Developmental regulation of somatic imprints. Differentiation; research in biological diversity 2011, 81(5):270-280. Reik W, Walter J: Genomic imprinting: Parental influence on the genome. Nat Rev Genet 2001, 2(1):21-32. Shoemaker R, Deng J, Wang W, Zhang K: Allele-specific methylation is prevalent and is contributed by CpG-SNPs in the human genome. Genome Res 2010, 20(7):883-889. Xie W, Barr CL, Kim A, Yue F, Lee AY, Eubanks J, Dempster EL, Ren B: Baseresolution analyses of sequence and parent-of-origin dependent DNA methylation in the mouse genome. Cell 2012, 148(4):816-831. Schalkwyk LC, Meaburn EL, Smith R, Dempster EL, Jeffries AR, Davies MN, Plomin R, Mill J: Allelic Skewing of DNA Methylation Is Widespread across the Genome. Am J Hum Genet 2010, 86(2):196-212. Paliwal A, Temkin AM, Kerkel K, Yale A, Yotova I, Drost N, Lax S, Nhan-Chang CL, Powell C, Borczuk A et al: Comparative Anatomy of Chromosomal Domains with Imprinted and Non-Imprinted Allele-Specific DNA Methylation. Plos Genet 2013, 9(8). Liu YP, Siegmund KD, Laird PW, Berman BP: Bis-SNP: Combined DNA methylation and SNP calling for Bisulfite-seq data. Genome Biol 2012, 13(7). Fang F, Hodges E, Molaro A, Dean M, Hannon GJ, Smith AD: Genomic landscape of human allele-specific DNA methylation. P Natl Acad Sci USA 2012, 109(19):73327337. Peng Q, Ecker JR: Detection of allele-specific methylation through a generalized heterogeneous epigenome model. Bioinformatics 2012, 28(12):I163-I171. Consortium EP, Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146):799-816. Zentner GE, Tesar PJ, Scacheri PC: Epigenetic signatures distinguish multiple classes of enhancers with distinct cellular functions. Genome research 2011, 21(8):1273-1283. Creyghton MP, Cheng AW, Welstead GG, Kooistra T, Carey BW, Steine EJ, Hanna J, Lodato MA, Frampton GM, Sharp PA et al: Histone H3K27ac separates active from poised enhancers and predicts developmental state. Proceedings of the National Academy of Sciences of the United States of America 2010, 107(50):21931-21936. Constancia M, Pickard B, Kelsey G, Reik W: Imprinting mechanisms. Genome research 1998, 8(9):881-900. Irizarry RA, Ladd-Acosta C, Wen B, Wu Z, Montano C, Onyango P, Cui H, Gabo K, Rongione M, Webster M et al: The human colon cancer methylome shows similar hypo- and hypermethylation at conserved tissue-specific CpG island shores. Nature genetics 2009, 41(2):178-186. Doi A, Park IH, Wen B, Murakami P, Aryee MJ, Irizarry R, Herb B, Ladd-Acosta C, Rho J, Loewer S et al: Differential methylation of tissue- and cancer-specific CpG island
AC CE P
1.
16 / 26
ACCEPTED MANUSCRIPT
22.
23.
24.
25. 26.
27.
28. 29.
30. 31.
PT
RI
SC
NU
21.
MA
20.
D
19.
TE
18.
AC CE P
17.
shores distinguishes human induced pluripotent stem cells, embryonic stem cells and fibroblasts. Nature genetics 2009, 41(12):1350-1353. Xie H, Wang M, de Andrade A, Bonaldo Mde F, Galat V, Arndt K, Rajaram V, Goldman S, Tomita T, Soares MB: Genome-wide quantitative assessment of variation in DNA methylation patterns. Nucleic acids research 2011, 39(10):4099-4108. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, Nery JR, Lee L, Ye Z, Ngo QM et al: Human DNA methylomes at base resolution show widespread epigenomic differences. Nature 2009, 462(7271):315-322. Court F, Tayama C, Romanelli V, Martin-Trujillo A, Iglesias-Platas I, Okamura K, Sugahara N, Simon C, Moore H, Harness JV et al: Genome-wide parent-of-origin DNA methylation analysis reveals the intricacies of human imprinting and suggests a germline methylation-independent mechanism of establishment. Genome research 2014, 24(4):554-569. Bhasin M, Zhang H, Reinherz EL, Reche PA: Prediction of methylated CpGs in DNA sequences using a support vector machine. Febs Lett 2005, 579(20):4302-4308. Feltus FA, Lee EK, Costello JF, Plass C, Vertino PM: Predicting aberrant CpG island methylation. Proceedings of the National Academy of Sciences of the United States of America 2003, 100(21):12253-12258. Penny KI, Chesney T: Imputation methods to deal with missing values when data mining trauma injury data. ITI 2006: Proceedings of the 28th International Conference on Information Technology Interfaces 2006:213-218. Zeng J, Konopka G, Hunt BG, Preuss TM, Geschwind D, Yi SV: Divergent WholeGenome Methylation Maps of Human and Chimpanzee Brains Reveal Epigenetic Basis of Human Regulatory Evolution. Am J Hum Genet 2012, 91(3):455-465. Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, Kellis M, Marra MA, Beaudet AL, Ecker JR et al: The NIH Roadmap Epigenomics Mapping Consortium. Nature biotechnology 2010, 28(10):1045-1048. Huang DW, Sherman BT, Lempicki RA: Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc 2009, 4(1):44-57. Cheng YW, Idrees K, Shattock R, Khan SA, Zeng Z, Brennan CW, Paty P, Barany F: Loss of imprinting and marked gene elevation are 2 forms of aberrant IGF2 expression in colorectal cancer. International journal of cancer Journal international du cancer 2010, 127(3):568-577. Deng T, Kuang Y, Zhang D, Wang L, Sun R, Xu G, Wang Z, Fei J: Disruption of imprinting and aberrant embryo development in completely inbred embryonic stem cell-derived mice. Development, growth & differentiation 2007, 49(7):603-610. Xu YQ, Grundy P, Polychronakos C: Aberrant imprinting of the insulin-like growth factor II receptor gene in Wilms' tumor. Oncogene 1997, 14(9):1041-1046. Fang F, Hodges E, Molaro A, Dean M, Hannon GJ, Smith AD: Genomic landscape of human allele-specific DNA methylation. Proceedings of the National Academy of Sciences of the United States of America 2012, 109(19):7332-7337. Peng Q, Ecker JR: Detection of allele-specific methylation through a generalized heterogeneous epigenome model. Bioinformatics 2012, 28(12):i163-171. Zhao L, Sun MA, Li Z, Bai X, Yu M, Wang M, Liang L, Shao X, Arnovitz S, Wang Q et al: The dynamics of DNA methylation fidelity during mouse embryonic stem cell self-renewal and differentiation. Genome research 2014, 10.1101/gr.163147.113.
17 / 26
ACCEPTED MANUSCRIPT
37. 38. 39. 40. 41. 42.
PT
RI
SC
NU
36.
MA
35.
D
34.
TE
33.
Kuhn RM, Haussler D, Kent WJ: The UCSC genome browser and associated tools. Brief Bioinform 2013, 14(2):144-161. Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng ZP, Snyder M, Dermitzakis ET, Stamatoyannopoulos JA et al: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146):799-816. Das R, Dimitrova N, Xuan ZY, Rollins RA, Haghighi F, Edwards JR, Ju JY, Bestor TH, Zhang MQ: Computational prediction of methylation status in human genomic sequences. P Natl Acad Sci USA 2006, 103(28):10713-10716. Benjamini Y, Hochberg Y: Controlling the False Discovery Rate - a Practical and Powerful Approach to Multiple Testing. J Roy Stat Soc B Met 1995, 57(1):289-300. Bock C, Paulsen M, Tierling S, Mikeska T, Lengauer T, Walter J: CpG island methylation in human lymphocytes is highly correlated with DNA sequence, repeats, and predicted DNA structure. PLoS genetics 2006, 2(3):243-252. Chang CC, Lin CJ: LIBSVM: A Library for Support Vector Machines. Acm T Intel Syst Tec 2011, 2(3). Frank E, Hall M, Trigg L, Holmes G, Witten IH: Data mining in bioinformatics using Weka. Bioinformatics 2004, 20(15):2479-2481. Meier L, van de Geer SA, Buhlmann P: The group lasso for logistic regression. J R Stat Soc B 2008, 70:53-71. Zhou X, Li ZC, Dai Z, Zou XY: Prediction of methylation CpGs and their methylation degrees in human DNA sequences. Comput Biol Med 2012, 42(4):408-413. Altman DG, Bland JM: Diagnostic-Tests-2 - Predictive Values .4. Brit Med J 1994, 309(6947):102-102. Krueger F, Andrews SR: Bismark: a flexible aligner and methylation caller for Bisulfite-Seq applications. Bioinformatics 2011, 27(11):1571-1572.
AC CE P
32.
18 / 26
ACCEPTED MANUSCRIPT FIGURE LEGENDS: Figure 1 A workflow for ASM classification and prediction. The workflow includes four
PT
sections: data processing, features extraction, model evaluation and application. Figure 2 Characterization of genomic and methylation related features for P-ASM, cS-ASM and
RI
sS-ASM CpG sites. Barplots show the distribution of ASM CpG sites in (A) different genomic regions; (B) genomic loci with distinct histone modifications; (C) repeat elements; (D) CGIs and
SC
CGI shores. The co-enrichments at genomic loci were demonstrated as fold change compared with random sets plotted as dashed lines. The asterisk represents that the enrichment is
NU
significant (FWER-adjusted, p < 0.01). (E) Heatmap shows the enrichment of ASM CpG sites in
sites were grouped into five intervals.
MA
genomic loci with distinct features. According to DNA methylation level, the four sets of CpG
Figure 3 Machine learning prediction of cS-ASM and sS-ASM. (A&B) ROC curves illustrate
D
the performance of classifiers for the prediction cS-ASM and sS-ASM, respectively. (C&D) Box
AC CE P
respectively.
TE
plots show the distribution of AUC values via random sampling for cS-ASM and sS-ASM,
Figure 4 Feature selection based on LGLR and ROC curve of logistic regression classifier for cS-
ASM prediction. (A) Top 10 features selected by LGLR. The standard errors were obtained with ten independent trainings. (B) ROC curve illustrate the performance of logistic regression classifier based on 5-fold cross-validation. In (A) and (B), the positive dataset includes cS-ASM sites and the negative dataset includes P-ASM, sS-ASM and control CpG sites.
19 / 26
ACCEPTED MANUSCRIPT TABLE: Table 1 Top ten characteristic features of S-ASM CpG sites
PT
Methylation_level Repeat_Simplerepeat SNP_snp Methylation_weighted_entropy_4CGsegment Methylation_level_4CGsegment Repeat_Segmentduplication CGIandCGIshore_CGI Motif_TGCT TransActing_H3K4me1 Motif_GGTC
Methylation level of target CpG site control if or not in Simple Repeat sS-ASM Occurrence of SNP in 100 bp flanking region sS-ASM Weighted methylation entropy of the continuous 4 CG segment sS-ASM Methylation level of the continuous 4 CG segment control if or not in Segmental Duplication sS-ASM if or not in CpG island control Occurrence of TGCT in 1 kb flanking region sS-ASM if or not in H3K4me1 region sS-ASM Occurrence of GGTC in 1 kb flanking region control
Significance 2.23E-308 2.23E-308 2.23E-308 2.23E-308 2.23E-308 2.23E-308 2.23E-308 2.23E-308 2.23E-308 4.83E-262 2.23E-308 2.23E-308 1.08E-89 7.53E-27 7.89E-17 1.16E-16 1.16E-16 7.36E-10 1.59E-09 1.08E-08
MA
NU
SC
RI
Feature Description Methylation level of target CpG site CpG density calculated from 1 kb flanking region if or not in CpG island CpG Obs/Exp in 1 kb flanking region if or not in exon Occurrence of GCGC in 1 kb flanking region if or not in H3K27ac region if or not in H3K4me3 region Occurrence of CGCG in 1 kb flanking region Occurrence of CCGC in 1 kb flanking region
D
1 2 3 4 5 6 7 8 9 10
Higher Value for control cS-ASM cS-ASM cS-ASM cS-ASM cS-ASM cS-ASM cS-ASM cS-ASM cS-ASM
Feature Name Methylation_level CpGDensity_CpG CGIandCGIshore_CGI CpGDensity_CpGObs/Exp Gene_exon Motif_GCGC TransActing_H3K27ac TransActing_H3K4me3 Motif_CGCG Motif_CCGC
TE
Rank 1 2 3 4 5 6 7 8 9 10
Scattering S-ASM
Clustering S-ASM
Type of ASM
AC CE P
Note: For continuous features, Wilcoxon rank sum test with Bonferroni correction was used for multiple testing. For discrete features, Fisher’s test with Bonferroni correction was used. The rightmost column displays single-test p-values, which significance threshold was set as 1%. The significance threshold after multiple testing correction is 0.01/282 = 3.55E-05. Detailed information is provided in Table S1.
Table 2 Performance of classifiers for cS-ASM and sS-ASM prediction based on 5-fold cross-validation Classifier Logistic Regress LGLR SVM Bayesian Net J48 Tree Naïve Bayes
Threshold
Specificity
Sensitivity
AUC
ACC
MCC
cS-ASM
sS-ASM
cS-ASM
sS-ASM
cS-ASM
sS-ASM
cS-ASM
sS-ASM
cS-ASM
sS-ASM
cS-ASM
sS-ASM
0.50 0.42 -0.40 0.50 0.67 0.50
0.50 0.47 -0.10 0.50 0.67 0.50
0.78 0.71 0.74 0.71 0.79 0.68
0.67 0.68 0.70 0.69 0.61 0.59
0.78 0.84 0.77 0.71 0.79 0.68
0.67 0.67 0.65 0.69 0.61 0.59
0.85 0.84 0.82 0.78 0.79 0.75
0.73 0.73 0.73 0.75 0.61 0.61
0.78 0.77 0.76 0.71 0.79 0.68
0.67 0.67 0.67 0.69 0.61 0.60
0.55 0.55 0.52 0.42 0.58 0.37
0.34 0.35 0.35 0.39 0.22 0.18
20 / 26
ACCEPTED MANUSCRIPT ADDITIONAL SUPPLEMENTARY FILES Figure S1 Distributions of DNA sequence related attributes for cS-ASM, sS-ASM, P-ASM and
PT
control CpG sites. Density plots estimated by Gauss kernel were shown for (A) CpG density (B) CpG obs/exp and (C) GC content, respectively
RI
Figure S2 Distributions of (A) PhastCons scores; (B) SNP frequencies in 100bp flanking
SC
sequences for P-ASM, cS-ASM, sS-ASM and control CpG sites
Figure S3 The formula of weighted methylation entropy, the examples for genomic loci with
NU
various methylation entropies in a cell population and the distribution of weighted methylated entropy for ASM loci. (A) The formula for weighted methylation entropy. wi represents the
MA
weight of methylation pattern i, which is used to assess the methylation transitions in between neighboring CpG sites within a sequence read. (B–E) Genomic loci with various weighted methylation entropies. (F) The box plots show the distributions of weighted methylation entropy.
D
The X-axis represents the average methylation levels of 4-CpG segments associated with distinct
TE
ASM events, which are divided into ten intervals.
AC CE P
Figure S4 Predictor selection by LGLR and SVM-RFE. (A) The curve of error rates show the rank of features selected with SVM-based recursion feature elimination (SVM-RFE). (B) The curve of AUC values shows the classification performance over tuning parameters in lassogrouped logistic regression (LGLR). (C) Comparison of cS-ASM prediction using different number of predictors selected by SVM-RFE and LGLR. (D) Comparison of cS-ASM prediction with logistic regression classifier using the top 20 features selected by LGLR or SVM-RFE (Wilcoxon rank sum test, p-value=4.36E-3). Figure S5 The impact of CpG density on AUC. (A) The AUC values using control sets with different levels of CpG density. (B) The ROC curves using control sets with different levels of CpG density. Figure S6 Selection of the optimal threshold for logistic regression classifier. (A) Positive predictive values and (B) FPR (1-Specificity) were determined with different thresholds in logistic regression classifier. Table S1 A full list of features collected for machine learning modeling
21 / 26
ACCEPTED MANUSCRIPT Table S2 Summary of cS-ASM loci predicted from three human adult brain methylomes Table S3 GO functional analysis of the genes associated with cS-ASM events in the human brain
AC CE P
TE
D
MA
NU
SC
RI
PT
methylomes
22 / 26
AC CE P
TE
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
Fig. 1
23 / 26
AC CE P
TE
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
Fig. 2
24 / 26
Fig. 3
AC CE P
TE
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
25 / 26
TE
D
MA
NU
SC
RI
PT
ACCEPTED MANUSCRIPT
AC CE P
Fig. 4
Highlights •
Three kinds of allele-specific DNA methylation events, P-ASM, sS-ASM and cS-ASM, have distinct characteristics
•
The combinations of a small set of DNA sequence, genomic and methylation related features are powerful in distinguishing different types of ASM
•
cS-ASM can be successfully predicted with logistic regression classifier
•
cS-ASM associated genes in brain are frequently the transcripts with alternative splicing forms
26 / 26