Genomics 87 (2006) 572 – 579 www.elsevier.com/locate/ygeno
DNA motifs associated with aberrant CpG island methylation F. Alex Feltus a,1 , Eva K. Lee a,b , Joseph F. Costello c , Christoph Plass d , Paula M. Vertino a,⁎ a
Department of Radiation Oncology and Winship Cancer Institute, Emory University School of Medicine, 1365-C Clifton Road. NE, Atlanta, GA 30322, USA b School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332, USA c The Brain Tumor Research Center and Department of Neurological Surgery, UCSF, San Francisco, CA 94143, USA d Department of Medical Microbiology and Immunology, Division of Human Cancer Genetics, The Ohio State University, Columbus, OH 43210, USA Received 29 August 2005; accepted 21 December 2005 Available online 17 February 2006
Abstract Epigenetic silencing involving the aberrant methylation of promoter region CpG islands is widely recognized as a tumor suppressor silencing mechanism in cancer. However, the molecular pathways underlying aberrant DNA methylation remain elusive. Recently we showed that, on a genome-wide level, CpG island loci differ in their intrinsic susceptibility to aberrant methylation and that this susceptibility can be predicted based on underlying sequence context. These data suggest that there are sequence/structural features that contribute to the protection from or susceptibility to aberrant methylation. Here we use motif elicitation coupled with classification techniques to identify DNA sequence motifs that selectively define methylation-prone or methylation-resistant CpG islands. Motifs common to 28 methylation-prone or 47 methylation-resistant CpG island-containing genomic fragments were determined using the MEME and MAST algorithms (http://meme.sdsc.edu). The five most discriminatory motifs derived from methylation-prone sequences were found to be associated with CpG islands in general and were nonrandomly distributed throughout the genome. In contrast, the eight most discriminatory motifs derived from the methylation-resistant CpG islands were randomly distributed throughout the genome. Interestingly, this latter group tended to associate with Alu and other repetitive sequences. Used together, the frequency of occurrence of these motifs successfully discriminated methylation-prone and methylation-resistant CpG island groups with an accuracy of 87% after 10-fold crossvalidation. The motifs identified here are candidate methylation-targeting or methylation-protection DNA sequences. © 2005 Elsevier Inc. All rights reserved. Keywords: Epigenetics; DNA methylation; Discriminant analysis; Classification techniques; Repetitive DNA; Alu
Introduction CpG islands are short stretches (500–2000 bp) of genomic DNA enriched for the dinucleotide, 5′-CpG-3′, which is the substrate for methylation by DNA methyltransferases. While most CpG sites in the human genome are methylated, those in CpG islands are typically unmethylated in normal tissue. In human cancers, de novo methylation of CpG island sequences is accompanied by gene silencing and can serve as an alternative to mutation or deletion in the inactivation of tumor suppressor and other genes. Examples include the VHL gene in renal cell carcinomas [1], the RB gene in retinoblastomas [2], the cell ⁎ Corresponding author. E-mail addresses:
[email protected] (F.A. Feltus),
[email protected] (P.M. Vertino). 1 Current address: University of Georgia, AGTEC Building, Room 225, 111 Riverbend Road, Athens, GA 30605, USA. 0888-7543/$ - see front matter © 2005 Elsevier Inc. All rights reserved. doi:10.1016/j.ygeno.2005.12.016
cycle inhibitor CDKN2A in many epithelial cancers [3], the mismatch repair gene MLH1 in sporadic colon cancer [4], and CDH1 in breast, bladder, and prostate cancer [5]. In the case of CDH1, it has been demonstrated that monoallelic loss due to methylation-induced silencing provides the second genetic “hit” in hereditary gastric cancer [6]. Thus, aberrant methylation of CpG islands plays a critical role in the initiation and progression of cancer. At present, it is not known why some CpG islands succumb to aberrant methylation during tumor progression while others are protected from it. One possibility is that local sequence features contribute to the susceptibility to, or protection from, de novo methylation. For example, there is evidence to suggest that methylation can spread from repetitive DNA [7,8,9]. Unusual DNA structures (e.g., non-B-DNA) are also suggested to serve as targets of de novo methylation [10,11]. Alternatively, methylation may be mistargeted in cancer cells through physical interaction of DNA methyltransferases with sequence-specific
F.A. Feltus et al. / Genomics 87 (2006) 572–579
transcription factors such as RB [12] and PML-RAR [13]. Sp1-binding sites have been demonstrated to protect the murine APRT promoter from methylation [14,15]. Finally, there is mounting evidence that DNA-binding factors which function in the organization of chromatin domains, such as CTCF, also play a role in determining methylation status at imprinted genes and other loci [16–20]. Collectively, these studies suggest that there are local features which act in cis to influence the establishment and/or stability of DNA methylation patterns. The identification of such factors and their function will lead to a better understanding the molecular mechanisms underlying altered DNA methylation patterning in aging and carcinogenesis. Using a human cell culture model in which de novo methylation of CpG islands is driven by ectopic expression of DNMT1 [21], we recently showed that CpG island loci differ in their innate susceptibility to de novo methylation [22]. By applying DNA pattern recognition and supervised learning techniques, we found that methylation-prone and methylation-resistant CpG islands could be distinguished based on their underlying sequence context [22]. These data demonstrated that there is a sequence signature associated with susceptibility to, or protection from, aberrant methylation, at least as it occurs in the DNMT1 overexpression model. We now extend these studies in an initial attempt toward identifying cis-acting elements that influence the propensity for aberrant DNA methylation. Using a combined approach coupling motif elicitation to define DNA patterns and classification techniques, we identified a sequence signature based on the frequency of 13 DNA motifs that can discriminate methylation-prone or methylation-resistant CpG islands. It is hypothesized that these motifs may play a role in the local susceptibility of CpG islands to aberrant DNA methylation. Results Methylation-prone and methylation-resistant CpG island data sets In a previous study, we determined the methylation state of 1749 CpG islands in human fibroblasts clones overexpressing the DNA methyltransferase, DNMT1, using restriction landmark genomic scanning (RLGS) [22]. Methylation-prone (MP) loci were defined as those sequences that were consistently hypermethylated in 3 of 3 independent DNMT1 overexpressing clones but were never methylated in 3 vector-only control clones. Methylation-resistant (MR) CpG islands were defined as those that were never methylated in all 6 cell clones. The 4000bp genomic sequence circumscribing the CpG island center (2000 bp upstream and 2000 bp downstream) was determined for 28 MP and 47 MR CpG islands for which sequence information was available. These functionally defined CpG island fragments were used in motif analysis.
573
methylation-resistant CpG islands using the program, MEME (http://meme.sdsc.edu) [23]. MEME derives sequence motifs from a training set of sequences and builds a position-specific scoring matrix wherein there is a probability associated with the occurrence of each base at each position. Twenty-eight methylation-prone or 47 methylation-resistant CpG island sequences (4 kb centered on the CpG island; see Ref. [22]) were used as input into the MEME algorithm, and the first 20 “bestfit” motifs were obtained for each data set. These 40 motifs were then individually aligned to the entire data set of 75 sequences with the motif alignment program, MAST (http://meme.sdsc.edu) [24] to determine the number of occurrences of each motif in the methylation-prone and methylation-resistant data sets. Only those motif hits with a position-specific goodness-of-fit P value of less than 10e-6 were considered. A t test was then used to compare the frequency of each motif between methylation-prone and methylation-resistant CpG island data sets. Motifs with frequencies that were not significantly different between the methylation-prone and methylation-resistant groups (P N 0.01) were excluded from further analyses. This primary filter yielded 8 motifs derived from the methylation-resistant sequences (MR) and five motifs derived from the methylation-prone sequences (MP) (Table 1). The spatial positions of these 13 motifs within the CpG islands in both data sets are shown in Figs. 1 and 2, and the number of occurrences is listed in Table 1. The position-specific scoring matrices of these motifs are available as Supplemental Data and can be used to probe any DNA sequence set using MAST. Motif frequencies in the human genome We next examined the more global distribution of the motifs in CpG island and non-CpG island DNA. To accomplish this, we extracted all 2348 annotated CpG islands from human chromosome 1 (http://genome.ucsc.edu). The 4-kb CpG island genomic fragments were constructed by locating the center position of each CpG island and extracting 2000 bp of sequence flanking both sides of this position. In a similar manner, a control data set of 2348 random 4000-bp fragments was also generated. The occurrences of the Mr and Mp motifs was then determined in these data sets using MAST. The results in Table 2 show a strong association of methylation-prone motifs with CpG islands in general relative to random DNA (t test; P = 0.035). In contrast, the methylation-resistant motifs were similarly distributed in both CpG island and non-CpG island DNA (t test; P = 0.243). These data indicate that the MP motifs are nonrandomly distributed in the genome and are selectively associated with CpG island DNA. Interestingly, although randomly distributed throughout the genome, there appeared to be a strong association between the methylation-resistant motifs and Alu sequences (Fig. 1B). Methylation-prediction potential
Motif discovery To identify motifs associated with methylation susceptibility, we derived sequences common to the methylation-prone or the
Next we tested the potential of the 13 motifs to classify methylation-prone and methylation-resistant CpG islands using a linear-optimization-based discriminant analysis-supervised
574
F.A. Feltus et al. / Genomics 87 (2006) 572–579
Table 1 Summary of motif occurrences in 75 CpG island fragments Motif
MP-α MP-β MP-γ MP-δ MP-ε MR-α MR-β MR-γ MR-δ MR-ε MR-ζ MR-η MR-τ a b c
Length (bp)
20 21 29 21 11 29 15 21 37 28 29 21 21
Consensus sequence
Methylation prone (n = 28) a
GGCTGCGGGGGCAGCAGCTG AAGAAGGGAGAGAAGGAGGAA TCCTCTCCCTTGTCTTCCTCCTCCTCCTC GGGGTGGGGGAGGGGGAGGAG CTCTCCCAAGC TTTTTTTTTTTTTTTTTGAGACAGAGTCT GTCAGGAGTTTGAGA GCCCAGGCTGGAGTGCAGTGG GCCTGTAATCCCAGCTACTCAGGAGGCTGAGGCAGGA AAAGTGCTGGGATTACAGGCGTGAGCCA CTCACTGCAACCTCCGCCTCCCGGGTTCA TGATCCGCCCGCCTCGGCCTC CCAGCCTGGCCAACATGGTGA
Methylation resistant (n = 47) b
t test
Sum
μ
σ
Sum
μ
σ
pc
32 35 28 35 16 7 2 12 12 6 21 12 9
1.1 1.3 1.0 1.3 0.6 0.3 0.1 0.4 0.4 0.2 0.8 0.4 0.3
0.8 0.9 0.9 0.9 0.7 0.4 0.3 0.7 0.6 0.5 0.8 0.6 0.7
13 14 9 26 5 58 35 57 54 33 66 40 38
0.3 0.3 0.2 0.6 0.1 1.2 0.7 1.2 1.1 0.7 1.4 0.9 0.8
0.6 0.5 0.4 0.9 0.4 1.2 0.9 0.8 1.3 0.8 0.9 0.7 0.9
5.06E-06 7.34E-06 1.38E-04 2.07E-03 2.23E-03 2.87E-06 1.63E-05 2.67E-05 1.66E-03 2.19E-03 2.67E-03 5.73E-03 8.52E-03
Total number (Sum), mean (μ), and standard deviation (σ) of MAST hits (P b 10E-6) in 28 methylation-prone CpG island fragments (4000 bp). Total number (Sum), mean (μ), and standard deviation (σ) of MAST hits (P b 10E-6) in 47 methylation-prone CpG island fragments (4000 bp). Student's t test comparing the average number of MAST hits between the two CpG island classes.
learning method [22,25–27]. In this method, a numeric vector representing the number of occurrences of the 13 motifs is used as the input variable to develop a predictive rule. Two strategies were used to validate the method. In the first strategy, all 75 CpG island sequences served as the data set and the performance was estimated by 10-fold cross-validation. This analysis resulted in the correct classification of 22 of 28 methylation-prone (79%) and 41 of 47 (87%) methylation-resistant sequences, for an overall rate of correct classification of 84%. The second strategy was designed to gauge whether comparable rates of classification are obtained when a smaller training set is used to generate the rule. Fifteen methylation-prone and 15 methylation-resistant sequences were randomly selected as the data set and the accuracy was estimated by 10-fold cross-validation. The rate of correct classification was 79% for the methylation-prone and 94% for the methylation-resistant CpG islands. These same 30 sequences were then used as a training set to generate a classifier which was then applied to the remaining 45 CpG islands that had not been used in the training module. In this case, the rate of correct classification was 79% for the methylation-prone sequences and 81% for the methylation-resistant CpG islands. The supervised learning approach is therefore quite reliable in that by using less than half of the total sequences as the training set, the predictive power is very close to the unbiased estimate. The results of these validation tests suggest that a discriminant function derived from this linear-optimization-based discriminant analysis method, and exploiting these 13 motifs, can be expected to correctly predict sequences with an accuracy of ∼80%. In a final set of experiments, we used an unsupervised approach (PAM) in which the feature vector representing the number of occurrences of the 13 motifs was used as input to directly classify the 75 CpG island sequences. In this case, the sequences were classified with an accuracy of 93% (26 of 28 correct) for the methylation-prone and 79% (37 of 47 correct) for the methylation-resistant sequences, for an overall rate of correct classification of 84%. This correct classification rate is very
similar to that obtained from the supervised learning method above, reinforcing the discriminatory power of the 13 motifs. Discussion Gene silencing associated with aberrant methylation of promoter region CpG islands is one mechanism leading to tumor suppressor gene inactivation in human cancers. However, whereas this is a common mechanism of inactivation of some tumor suppressor genes (CDH1, MLH1, CDKN2A), there are others (e.g., VHL, TP53, MSH2) that that although subject to frequent genetic inactivation in cancer cells, are rarely affected by such epigenetic silencing. In this study, we describe the identification of sequence motifs that can reliably discriminate CpG island loci with different propensities for de novo methylation. Methylation-prone and methylation-resistant sequences could be discriminated based on the frequency of 13 novel sequence motifs with an overall accuracy of 84–87% in both supervised and unsupervised learning approaches. These data add further support to the notion that there is a sequence signature associated with the propensity toward aberrant methylation. The identification of such factors may contain clues to the mechanism of aberrant CpG island methylation that accompanies aging and carcinogenesis. In previous work, we showed that genomic loci differ in their innate susceptibility to aberrant methylation and that the propensity toward this aberrant methylation could be predicted based on the underlying sequence context. In that study, we used pattern recognition and supervised learning to identify short DNA sequence patterns that were capable of discriminating methylation-prone and methylation-resistant sequences with an accuracy of ∼82% in blind predictions [22]. Seven DNA patterns (TCCCCCNC; TTTCCTNC; TCCNCCNCCC; GGAGNAAG; GAGANAAG; GCCACCCC; GAGGAGGNNG) were selected as having the highest discrimination potential between the two CpG island classes. Here we used a motif
F.A. Feltus et al. / Genomics 87 (2006) 572–579
575
Fig. 1. Spatial distribution of the 5 methylation-prone motifs in all 75 CpG island loci. MEME-derived position-specific probability matrices were used to identify motifs occurrences in the methylation-prone (top) or methylation-resistant (bottom) CpG island fragments (4000 bp) using the MAST algorithm. In each case the region encompassing the CpG island is indicated by a gray box. Methylation-prone (MP) motifs and Alu elements (Alu) are shown according to the color legend.
elicitation protocol, MEME, to identify motifs associated with a propensity toward aberrant methylation. The MEME algorithm does not identify simple consensus sequences but rather derives probability matrices wherein there is a probability associated
with each base occurring at each position [23]. The sequence motifs described here thus differ from the short sequence patterns derived in our previous studies in that they are longer (11–37 bp) and represent weighted probability matrices instead
576
F.A. Feltus et al. / Genomics 87 (2006) 572–579
Fig. 2. Spatial distribution of the 7 methylation-resistant motifs in all 75 CpG island loci. MEME-derived position-specific probability matrices were used to identify motifs occurrences in the methylation-prone (top) or methylation-resistant (bottom) CpG island fragments (4000 bp) using the MAST algorithm. In each case the region encompassing the CpG island is indicated by a gray box. Methylation-resistant (MR) motifs and Alu elements (Alu) are shown according to the color legend.
of exact match sequence expressions, allowing for a higher degree of sequence divergence. In addition, the motifs were derived based on their selective association with either methylation-prone or methylation-resistant sequences. This
becomes an important consideration as the factors that confer susceptibility to, or protection from, aberrant methylation are likely to be differentially represented in the two data sets. The inclusion of sequence divergence and increased length of
F.A. Feltus et al. / Genomics 87 (2006) 572–579
577
Table 2 Motif frequencies on chromosome 1 Motif
Length (bp)
CpG Islands a
Random DNA b
Consensus sequence
MP-α MP-β MP-γ MP-δ MP-ε MR-α MR-β MR-γ MR-δ MR-ε MR-ζ MR-η MR–τ
20 21 29 21 11 29 15 21 37 28 29 21 21
1234 1499 1565 1907 164 4630 1450 2664 5144 3062 3709 4566 2685
100 1221 630 468 60 6017 1678 2733 6330 3954 3363 4626 3316
GGCTGCGGGGGCAGCAGCTG AAGAAGGGAGAGAAGGAGGAA TCCTCTCCCTTGTCTTCCTCCTCCTCCTC GGGGTGGGGGAGGGGGAGGAG CTCTCCCAAGC TTTTTTTTTTTTTTTTTGAGACAGAGTCT GTCAGGAGTTTGAGA GCCCAGGCTGGAGTGCAGTGG GCCTGTAATCCCAGCTACTCAGGAGGCTGAGGCAGGA AAAGTGCTGGGATTACAGGCGTGAGCCA CTCACTGCAACCTCCGCCTCCCGGGTTCA TGATCCGCCCGCCTCGGCCTC CCAGCCTGGCCAACATGGTGA
a b
MAST hits (P b 10E-6) to 2348 CpG island-centered 4000-bp fragments from chromosome 1. MAST hits (P b 10E-6) to 2348 randomly generated 4000-bp fragments from chromosome 1.
the motifs described here should increase the probability of detecting biologically relevant DNA patterns. A similar approach was recently used to discover a motif signature predictive of imprinted genes [28]. Several genetic elements have been proposed to contribute to methylation propensity [7–20,29,30]. For example, Hasse and Schultz have shown that multiple fragments of the rat αfetoprotein gene conferred de novo methylation on a linked reporter gene when transiently transfected into F9 mouse embryonal carcinoma cells [31]. These regions all contained B1 and B2 SINEs, the murine equivalent of primate Alu elements, suggesting that repetitive DNA elements may serve as a nidus from which methylation can spread [8]. Similarly, methylation has been suggested to spread from human Alu repeats in the CDH1 [7] and TP53 genes [9]. Interestingly, 5/8 of the methylation-resistant motifs (MR-β, MR-τ, MR-δ, MR-ε, MR-ζ) show homology to Alu sequences and clustered within bona fide Alu elements surrounding the CpG islands (see Fig. 1B). These data also are consistent with the our previous observation that Alu elements are more frequently associated with the methylation-resistant than methylation-prone CpG islands [22]. The functional significance of this association is not entirely clear, but may reflect a selection against retrotransposition into methylation-prone regions during the Alu expansion or differential rates of Alu loss through inter-Alu recombination events [32]. Motifs associated with methylation propensity could also represent binding sites for proteins involved in the promotion or prevention of aberrant DNA methylation. It has been proposed that mistargeting of DNA methyltransferases by association with sequence specific transcription factors, such as RB [12] or the PML-RAR [13] fusion protein created chromosomal translocation in promyelocytic leukemias, could promote aberrant methylation in cancer cells. Although the methylation-prone motifs derived here do not obviously resemble E2F or RAR-binding sites, it remains formally possible that other DNA-binding factors may be involved. Likewise, Sp1 sites have been suggested to protect the APRT promoter from methylation [14,15]. More recently, binding of the insulator
protein CTCF has been shown to protect a linked transgene from heterochromatin-mediated extinction and subsequent de novo DNA methylation [33]. Indeed, the normal function of CTCF appears to be the establishment of boundaries between adjacent chromatin domains and the maintenance proper imprinting at the H19-Igf2 locus [34]. CTCF binding to a differentially methylated domain (DMD) upstream of the H19 gene is required to maintain the unmethylated state and proper expression of the maternal H19 allele [16,17]. CTCF-binding elements do not conform to a simple consensus owing to differential usage of its 12 zinc fingers for DNA binding and such sites must be determined empirically [35], so it is difficult to predict whether any of the current methylation-resistant motifs represent CTCF sites, but this is an area for further investigation. The idea that sequence context can influence aberrant methylation does not necessarily indicate the involvement of a sequence-specific binding factor. Alternatively, such motifs may reflect a common DNA secondary structure that is important methylation susceptibility, as suggested for the Fragile-X repeat [10], and it may be that similar secondary structures can be formed by dissimilar sequences. In this regard, the incorporation of a method that allows for sequence divergence via the use of motif profiles may be important for identifying structurally conserved, but sequence-divergent elements. It should be noted that the factors that influence susceptibility to aberrant methylation in the DNMT1 overexpression model are not necessarily the same as those that dictate de novo methylation occurring in human cancers. Whether the motif signature derived here will be more broadly applicable to sequences that are commonly methylated in cancer remains to be determined. One complication with such studies is how to define methylation propensity across cell types and systems. There are some genes like GSTP1 and VHL which are frequently methylated, but only in a single tumor type, and thus could be considered methylation prone or resistant depending on the situation. In this regard, it is noteworthy that the CpG islands that are methylation prone in the DNMT1 overexpression model tend to correspond to those that are methylated in multiple tumor
578
F.A. Feltus et al. / Genomics 87 (2006) 572–579
types rather than a tumor-type-specific manner in genome-wide studies in human cancers [42]. Nevertheless, the methods employed should be directly translatable to genome-wide CpG island methylation data sets from the analysis of human tumors (e.g., [42,43]). Our current efforts are focused on the identification of sequence features that reliably discriminate methylation-prone and -resistant CpG islands in tumor-type-specific classifiers, as well as development of sequence-feature-based algorithms for tumor classification. In summary, a set of 13 novel sequence motifs derived from methylation-prone or methylation-resistant CpG island sequences were discovered. Only the methylation-prone motifs were selectively associated with CpG islands, while the methylation-resistant motifs appeared to be randomly distributed along chromosome 1, presumably due to their strong association with Alu elements. Together, these 13 motifs were able to predict the methylation state of a CpG island ∼80% of the time. This work represents a first-generation strategy to discover candidate cis-acting methylation-susceptibility motifs using a genome-scanning approach and their effect upon CpG island methylation can be evaluated through transfection and transgenic approaches. A better understanding of the molecular mechanisms of CpG methylation susceptibility will enhance our ability to predict and possibly block epigenetic gene silencing in cancer.
best-fit motifs were obtained for each sequence set using the ZOOPs model with default parameters. The position-specific scoring matrix for each of the 40 motifs from both CpG island classes was then used to probe a database of the 75 CpG island sequences (28 methylation prone/47 methylation resistant) using the motif alignment program, MAST (http://meme.sdsc.edu) [24]. The frequency and position of all motif hits with a goodness-of-fit (P b 0.000001) were extracted from the MAST output file using custom Perl scripts. A Student’s t test was used to compare the frequency of occurrence of each motif between the methylation-prone and methylation-resistant data sets. Those motifs whose mean frequencies were significantly different at a threshold of P b 0.01 were selected for further analysis. Classification based upon supervised or unsupervised learning approaches was performed as previously described [22,25–27] and the accuracy of prediction was estimated by 10-fold cross-validation.
Materials and methods
References
Sequence data sets Methylation-prone and methylation-resistant CpG islands were identified as previously described [22]. Briefly, genome-wide CpG island methylation profiles were generated by Restriction Landmark Genomic Scanning. This technique is based on the two-dimensional separation of genomic NotI restriction fragments. The methylation state for each CpG island was detected as the presence or absence of DNA cleavage by the methylation-sensitive enzyme, NotI, which is almost exclusively found in CpG islands [36]. Methylation-prone CpG islands (66 total) were defined as those loci that were consistently methylated in multiple DNMT1 overexpressing clones but not in vector-only control clones. Methylation-resistant CpG islands were defined as those loci that were never methylated in any clone. Using a NotI/EcoRV boundary library [37] or an in silico restriction digest of chromosomal assemblies [38], sequence information was obtained for 28 methylation-prone and 47 methylation-resistant CpG islands. The extent of the CpG island was determined using the CpG report algorithm (EMBOSS [39]) which defines a CpG island as a region N200 bp, with a C + G content N60% and an observed over expected CpG frequency N0.6. For each of the 75 CpG islands in the study, the center of the CpG island was determined, and 2 kb of sequence information in either direction was extracted from the Celera [40] and the Human Genome Project [41] databases to give a uniform set of 4-kb sequence fragments centered on the CpG island. Human chromosome 1 CpG islands were extracted from the July 2003 freeze of the human genome (http://genome.ucsc.edu). CpG island start–stop positions (2348 total) were extracted from the UCSC annotations, and the center of the annotated CpG island was determined. Two kilobases flanking each side of the CpG island center was extracted using custom Perl scripts. A control data set of 2348 randomly generated 4000-bp fragments was also extracted from chromosome 1.
Motif discovery and classification Twenty-eight methylation-prone or 47 methylation-resistant CpG island fragments were used as input for MEME ([23]; http://meme.sdsc.edu). Twenty
Acknowledgments This work was supported by grants from the National Cancer Institute (2R01-CA077337, to PMV) and the National Science Foundation (DMI-0300435, to E.K.L.) and a postdoctoral fellowship from the American Cancer Society (PF-GMC-102929, to F.A.F.). Appendix A. Supplementary data Supplementary data associated with this article can be found in the online version at doi:10.1016/j.ygeno.2005.12.016.
[1] J.G. Herman, et al., Silencing of the VHL tumor-suppressor gene by DNA methylation in renal carcinoma, Proc. Natl. Acad. Sci. USA 91 (1994) 9700–9714. [2] T. Sakai, et al., Allele-specific hypermethylation of the retinoblastoma tumor-suppressor gene, Am. J. Hum. Genet. 48 (1991) 880–888. [3] A. Merlo, et al., 5′ CpG island methylation is associated with transcriptional silencing of the tumour suppressor p16/CDKN2/MTS1 in human cancers, Nat. Med. 1 (1995) 686–692. [4] M.F. Kane, et al., Methylation of the hMLH1 promoter correlates with lack of expression of hMLH1 in sporadic colon tumors and mismatch repairdefective human tumor cells, Cancer Res. 57 (1997) 808–811. [5] J.R. Graff, et al., E-cadherin expression is silenced by DNA hypermethylation in human breast and prostate carcinomas, Cancer Res. 55 (1995) 5195–5199. [6] W.M. Grady, et al., Methylation of the CDH1 promoter as the second genetic hit in hereditary diffuse gastric cancer, Nat. Genet. 26 (2000) 16–17. [7] J.R. Graff, J.G. Herman, S. Myohanen, S.B. Baylin, P.M. Vertino, Mapping patterns of CpG island methylation in normal and neoplastic cells implicates both upstream and downstream regions in the de novo methylation, J. Biol. Chem. 272 (1997) 22322–22329. [8] P.A. Yates, R.W. Burman, P. Mummaneni, S. Krussel, M.S. Turker, Tandem B1 elements located in a mouse methylation center provide a target for de novo DNA methylation, J. Biol. Chem. 274 (1999) 36357–36361. [9] A.N. Magewu, P.A. Jones, Ubiquitous and tenacious methylation of the CpG site in codon 248 of the p53 gene may explain its frequent appearance as a mutational hot spot in human cancer, Mol. Cell. Biol. 14 (1994) 4225–4232. [10] X. Chen, et al., Hairpins are formed by the single DNA strands of the fragile X triplet repeats: structure and biological implications, Proc. Natl. Acad. Sci. USA 92 (1995) 5199–5203. [11] A. Laayoun, S.S. Smith, Methylation of slipped duplexes, snapbacks and cruciforms by human DNA(cytosine-5)methyltransferase, Nucleic Acids Res. 23 (1995) 1584–1589.
F.A. Feltus et al. / Genomics 87 (2006) 572–579 [12] K.D. Robertson, et al., DNMT1 forms a complex with Rb, E2F1 and HDAC1 and represses transcription from E2F-responsive promoters, Nat. Genet. 25 (2000) 338–342. [13] L. Di Croce, et al., Methyltransferase recruitment and DNA hypermethylation of target promoters by an oncogenic transcription factor, Science 295 (2002) 1079–1082. [14] M. Brandeis, et al., Sp1 elements protect a CpG island from de novo methylation, Nature 371 (1994) 435–438. [15] D. Macleod, J. Charlton, J. Mullins, A.P. Bird, Sp1 sites in the mouse aprt gene promoter are required to prevent methylation of the CpG island, Genes Dev. 8 (1994) 2282–2292. [16] C.J. Schoenherr, J.M. Levorse, S.M. Tilghman, CTCF maintains differential methylation at the Igf2/H19 locus, Nat. Genet. 33 (2003) 66–69. [17] V. Pant, et al., The nucleotides responsible for the direct physical contact between the chromatin insulator protein CTCF and the H19 imprinting control region manifest parent of origin-specific long-distance insulation and methylation-free domains, Genes Dev. 17 (2003) 586–590. [18] A.C. Bell, G. Felsenfeld, Methylation of a CTCF-dependent boundary controls imprinted expression of the Igf2 gene, Nature 405 (2000) 482–485. [19] A.T. Hark, et al., CTCF mediates methylation-sensitive enhancer-blocking activity at the H19/Igf2 locus, Nature 405 (2000) 486–489. [20] C. Kanduri, et al., Functional association of CTCF with the insulator upstream of the H19 gene is parent of origin-specific and methylationsensitive, Curr. Biol. 10 (2000) 853–856. [21] P.M. Vertino, R.W. Yen, J. Gao, S.B. Baylin, De novo methylation of CpG island sequences in human fibroblasts overexpressing DNA (cytosine-5-)methyltransferase, Mol. Cell. Biol. 16 (1996) 4555–4565. [22] F.A. Feltus, E.K. Lee, J.F. Costello, C. Plass, P.M. Vertino, Predicting aberrant CpG island methylation, Proc. Natl. Acad. Sci. USA 100 (2003) 12253–12258. [23] T.L. Bailey, C.P. Elkan, Unsupervised learning of multiple motifs in biopolymers using expectation maximization, Machine Learn. J. 21 (1995) 51–83. [24] T.L. Bailey, M. Gribskov, Combining evidence using p-values: application to sequence homology searches, Bioinformatics 14 (1998) 48–54. [25] R.J. Gallagher, E.K. Lee, D. Patterson, Constrained discriminant analysis via 0/1 mixed integer programming, Ann. Operations Res. 74 (1997) 65–88. [26] R.J. Gallagher, E.K. Lee, D. Patterson, An optimization model for constrained discriminant analysis and numerical experiments with iris, thyroid, and heart disease datasets, Proc. AMIA Annu. Fall Symp. (1996) 209–213. [27] E.K. Lee, R. Gallagher, D. Patterson, A linear programming approach to discriminant analysis with a reserved judgement region, INFORMS J. Comput. 15 (2003) 23–41.
579
[28] Z. Wang, et al., Comparative sequence analysis of imprinted genes between human and mouse to reveal imprinting signatures, Genomics 83 (2004) 395–401. [29] P. Mummaneni, P.L. Bishop, M.S. Turker, A cis-acting element accounts for a conserved methylation pattern upstream of the mouse adenine phosphoribosyltransferase gene, J. Biol. Chem. 268 (1993) 552–558. [30] P. Mummaneni, K.A. Walker, P.L. Bishop, M.S. Turker, Epigenetic gene inactivation induced by a cis-acting methylation center, J. Biol. Chem. 270 (1995) 788–792. [31] A. Hasse, W.A. Schultz, Enhancement of reporter gene de novo methylation by DNA fragments from the alpha-fetoprotein control region, J. Biol. Chem. 269 (1994) 1821–1826. [32] M.A. Batzer, P.L. Deininger, Alu repeats and human genomic diversity, Nat. Rev. Genet. 3 (2002) 370–379. [33] V.J. Mutskov, C.M. Farrell, P.A. Wade, A.P. Wolffe, G. Felsenfeld, The barrier function of an insulator couples high histone acetylation levels with specific protection of promoter DNA from methylation, Genes Dev. 16 (2002) 1540–1554. [34] A. Lewis, A. Murrell, Genomic imprinting: CTCF protects the boundaries, Curr. Biol. 14 (2004) R284–R286. [35] R. Ohlsson, R. Renkawitz, V. Lobanenkov, CTCF is a uniquely versatile transcription regulator linked to epigenetics and disease, Trends Genet. 17 (2001) 520–527. [36] S. Lindsay, A.P. Bird, Use of restriction enzymes to detect potential gene sequences in mammalian DNA, Nature 327 (1987) 336–338. [37] C. Plass, et al., An arrayed human not I-EcoRV boundary library as a tool for RLGS spot analysis, DNA Res. 4 (1997) 253–255. [38] G. Zardo, et al., Integrated genomic and epigenomic analyses pinpoint biallelic gene inactivation in tumors, Nat. Genet. 32 (2002) 453–458. [39] S.A. Olson, EMBOSS opens up sequence analysis. European Molecular Biology Open Software Suite, Brief Bioinform. 3 (2002) 87–91. [40] J.C. Venter, et al., The sequence of the human genome, Science 291 (2001) 1304–1351. [41] E.S. Lander, et al., Initial sequencing and analysis of the human genome, Nature 409 (2001) 860–921. [42] T.T. Ching, A.K. Maunakea, P. Jun, C. Hong, G. Zardo, D. Pinkel, D.G. Albertson, J. Fridlyand, J.H. Mao, K. Shchors, W.A. Weiss, J.F. Costello, Epigenome analyses using BAC microarrays identify evolutionary conservation of tissue-specific methylation of SHANK3, Nat. Genet. 37 (2005) 645–651. [43] J.F. Costello, M.C. Fruhwald, D.J. Smiraglia, L.J. Rush, G.P. Robertson, X. Gao, F.A. Wright, J.D. Feramisco, P. Peltomaki, J.C. Lang, D.E. Schuller, L. Yu, C.D. Bloomfield, M.A. Caligiuri, A. Yates, R. Nishikawa, H. Su Huang, N.J. Petrelli, X. Zhang, M.S. O'Dorisio, W.A. Held, W.K. Cavenee, C. Plass, Aberrant CpG-island methylation has non-random and tumour-type-specific patterns, Nat. Genet. 24 (2000) 132–138.