Resource
Bivariate Genomic Footprinting Detects Changes in Transcription Factor Activity Graphical Abstract
Authors Songjoon Baek, Ido Goldstein, Gordon L. Hager
Correspondence
[email protected] (I.G.),
[email protected] (G.L.H.)
In Brief Baek et al. describe a method to identify changes in transcription factor activity based on telltale signs they leave on chromatin. This approach reliably predicts the activity of various transcription factors from chromatin accessibility information. The method provides an unbiased way to isolate relevant factors without assays targeting specific factors.
Highlights d
Only one-fifth of TF motifs show protection from DNA cleavage (‘‘footprint’’)
d
Enzyme-specific DNA cut bias profoundly and unavoidably affects footprint estimation
d
BaGFoot unbiasedly detects TF activity via TF-associated changes in DNA accessibility
d
BaGFoot overcomes DNA cut bias and is amenable to both DNase-seq and ATAC-seq
Baek et al., 2017, Cell Reports 19, 1710–1722 May 23, 2017 http://dx.doi.org/10.1016/j.celrep.2017.05.003
Cell Reports
Resource Bivariate Genomic Footprinting Detects Changes in Transcription Factor Activity Songjoon Baek,1,2 Ido Goldstein,1,2,* and Gordon L. Hager1,3,* 1Lab
of Receptor Biology and Gene Expression, The National Cancer Institute, NIH, Bethesda, MD 20892, USA authors contributed equally 3Lead Contact *Correspondence:
[email protected] (I.G.),
[email protected] (G.L.H.) http://dx.doi.org/10.1016/j.celrep.2017.05.003 2These
SUMMARY
In response to activating signals, transcription factors (TFs) bind DNA and regulate gene expression. TF binding can be measured by protection of the bound sequence from DNase digestion (i.e., footprint). Here, we report that 80% of TF binding motifs do not show a measurable footprint, partly because of a variable cleavage pattern within the motif sequence. To more faithfully portray the effect of TFs on chromatin, we developed an algorithm that captures two TF-dependent effects on chromatin accessibility: footprinting and motif-flanking accessibility. The algorithm, termed bivariate genomic footprinting (BaGFoot), efficiently detects TF activity. BaGFoot is robust to different accessibility assays (DNase-seq, ATAC-seq), all examined peak-calling programs, and a variety of cut bias correction approaches. BaGFoot reliably predicts TF binding and provides valuable information regarding the TFs affecting chromatin accessibility in various biological systems and following various biological events, including in cases where an absolute footprint cannot be determined. INTRODUCTION Regulation of gene expression by transcription factors (TFs) is fundamental to every aspect of cell biology. Because of their intimate involvement in virtually all biological processes and their well-documented participation in pathological conditions, TFs and their binding to DNA regulatory elements (i.e., enhancers) have been extensively studied (Voss and Hager, 2014). Enhancer regions span a few hundred bases and serve as a binding hub for various proteins—TFs, chromatin remodelers, histone- and DNA-modifying enzymes, etc. Enhancers are also accessible to nucleases and are hypersensitive to DNase digestion (i.e., DNase hypersensitive sites [DHSs], interchangeably termed peaks or hotspots). This property has been used to map the complete enhancer repertoire of cells in a genome-wide manner via partial digestion of DNA with DNase coupled with massive parallel sequencing (DNase-seq) (Boyle et al., 2008). Many tech-
niques that map the accessible regions of the genome by nuclease cleavage have been developed, including the widely used assay for transposable-accessible chromatin followed by sequencing (ATAC-seq) (Buenrostro et al., 2013). Changes in enhancer accessibility either between cell types or following a signal within the same cell population are indicative of changing enhancer activity. Quantifying a significant change in accessibility between two biological conditions is commonly used to isolate a subset of enhancers. The subset of differentially accessible enhancers is used to derive TF binding motifs that are statistically enriched in the subpopulation compared with total enhancers. The enriched motif might, in some cases, be indicative of increased TF binding. Although very useful and commonly used, this approach has critical pitfalls. First, an enrichment of a motif sequence is not direct evidence for TF binding. Second, such enrichment analyses often yield many motif sequences with very low p values. A p value cutoff is done randomly and cannot be determined empirically. Therefore, choosing TFs to follow with downstream experiments is challenging. Third, motif enrichment is strictly dependent on isolating a subset of enhancers based on bulk prominent changes in accessibility. This dramatically reduces the number of enhancers used for analyses (in most cases, 95%–99% of enhancers are excluded from analysis). In some cases, very few enhancers show significant changes for several reasons (e.g., poor sequencing quality, few replicates, or the nature of the biological signal that leads to minor, hard-to-detect changes), casting further doubts regarding the validity of findings. As detailed above, the entire sequence spanning an enhancer region is more DNase-accessible compared with non-regulatory regions of the genome. However, the few nucleotides that constitute the TF motif within the enhancer are sometimes locally protected from DNase when bound by the TF. This results in a relative lower DNase cleavage rate at the motif compared with other regions within the DHS. This TF-dependent local protection is termed a TF ‘‘footprint’’ (Galas and Schmitz, 1978) and has long served as a marker for TF binding and activity. In genome-wide chromatin accessibility assays, it became possible to find TF footprints by determining the relative decrease of DNase cuts at motif regions (Sung et al., 2016; Vierstra and Stamatoyannopoulos, 2016). This computational procedure, termed genomic footprinting, is increasingly applied, and several computational pipelines are available (Gusmao et al., 2016).
1710 Cell Reports 19, 1710–1722, May 23, 2017 This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
Genomic footprinting was initially regarded as an alternative way to measure TF binding that could replace more direct methods, such as chromatin immunoprecipitation followed by sequencing (ChIP-seq) (Neph et al., 2012). However, with the increasing application of genomic footprinting, several aspects of TF and DNase function were found to affect footprint features. First, the identification of individual footprint events (as opposed to motif-centered, genome-wide footprint patterns) has proven to be very challenging (Baek and Sung, 2016). The field has mostly shifted to determining aggregate footprints at the genome-wide level, represented as decreasing DNase cleavage events at the motif site compared with its flanking region. Second, inherent DNase cut bias was shown to fundamentally affect the assessment of footprinting capacity of some TFs. Notably, several TF footprints were found to be completely unrelated to TF binding but, rather, represent DNase cut bias for DNA and were thus termed ‘‘signatures’’ (He et al., 2014; Sung et al., 2014). Third, several TFs were shown to lack a footprint altogether despite prominent binding to DNA (Grøntved et al., 2015; He et al., 2014; Sung et al., 2014; Swinstead et al., 2016). The pitfalls in conventional motif enrichment analyses and the newly discovered layers of complexity in genomic footprinting warrant further advances in computational methods to determine TF activity from genomic accessibility data. Here we describe a computational approach that utilizes genome-wide accessibility data to detect changes in TF activity between two biological conditions. This approach overcomes many of the limiting factors hindering current methods, such as the need for a subset of enhancers with a significant bulk change in accessibility in motif enrichment analysis. Also, in contrast to existing footprinting methods, the method can detect changes in TF activity even in TFs where a genomic footprint pattern cannot be measured. This becomes a critical issue because we show that lack of a measurable footprint is the rule rather than the exception for most TF motifs. RESULTS A Motif-wide Assessment of Footprinting Patterns Reveals that Most TF Motifs Do Not Measurably Protect from DNase Cutting The utility of genomic footprinting relies on a local decrease in nuclease/transposase accessibility because of TF binding. Computationally, this decrease is evaluated genome-wide by aggregating enzymatic cleavage events (i.e., ‘‘cut counts’’) at and around a motif. Such a motif-centered aggregate plot has two distinct features: a decreased cut count at the motif compared with the flanking regions (footprint) and a zigzag pattern of variable cut frequency between nucleotides within the motif sequence (signature) (Figures 1A and 1B, bottom). The signature pattern derives primarily from sequence-dependent enzymatic cut bias and is also observed when cutting naked DNA (Figure 1B, bottom; He et al., 2014; Sung et al., 2014). Genomic footprinting patterns of many TFs in various cell types have been reported in recent years. In some reports, a TF was found to leave no footprint in motif sequences bound by the factor, although a signature was evident (Grøntved et al., 2015; He et al., 2014; Sung et al., 2014; Swinstead et al.,
2016). We wished to explore whether the lack of a footprint is a sporadic phenomenon occurring in a few cases or a more general feature of TFs. To achieve this genomic perspective, we generated a computational workflow for quantifying the genome-wide footprint depths of all known TF motifs. First, TF motifs were collected from three databases (JASPAR, TRANSFAC, and UniPROBE), and motif occurrences in the mouse genome were scanned using Find Individual Motif Occurrences (FIMO) (Grant et al., 2011). DHS sites were determined in a high-quality DNase-seq experiment performed on mouse liver (Goldstein et al., 2017). Motifs outside DHSs were excluded, and the DNase cut count in each motif occurrence was calculated and normalized to total reads in the sample. To determine the baseline, cut counts were also calculated and normalized in the motif-flanking region (± 200 bp from the motif center). Then, for each motif, cut counts of all genomic motif occurrences were aggregated and plotted (Figure 1C). Inherent DNase cut bias can profoundly affect a footprint’s signature. This bias was previously corrected with data from DNase-digested naked DNA (He et al., 2014; Lazarovici et al., 2013; Stergachis et al., 2014; Sung et al., 2014). However, this method of bias correction is specific for DNase-seq and is not applicable to other chromatin accessibility assays that will present their own bias (e.g., ATAC-seq). Also, technical variability between experiments and alternate commercial nuclease sources might present a different degree of bias. To overcome these issues, we employed a dataset-intrinsic bias correction methodology. We counted DNase cuts genome-wide (including cuts outside of DHSs) and calculated the cut frequency of all possible nucleotide hexamer combinations normalized to their frequency of occurrence in the genome as described previously (Lazarovici et al., 2013; Table S1). Thus, we compiled a dataset-intrinsic estimation of the cut bias of each nucleotide in our dataset. We then bias-corrected the cuts within DHSs by calculating the log ratio of observed cut counts (aggregated from all motif occurrences within DHSs) divided by the expected cut counts (deduced from hexamer cut bias). In many cases, the cut bias correction smoothed the zigzag patterns of motif signatures, allowing for an easier assessment of footprinting depth (Figure 1B; Figure S1). Finally, we determined the genome-wide footprint depth (FPD) of each motif as the depression of the average log ratio value within the motif compared with the flanking accessibility baseline (Figure 1C). We determined FPDs in mouse liver for all known motifs following digestion with three different nucleases and sorted them by FPD value (Figure 2A; Figure S2A; Table S2). Surprisingly, we found a continuum of FPD values ranging from a negative value (i.e., local protection from cutting at motifs) to a positive value. To validate that the positive FPD values did not stem from our dataset-intrinsic bias correction, we also determined FPD values following bias correction based on three naked DNA controls (Lazarovici et al., 2013; Yardımcı et al., 2014). In addition, we performed the dataset-intrinsic cut bias only on cuts within DHSs to dismiss the possibility that cuts outside of DHSs were skewing our results. The pattern of continuous FPD values ranging from a negative to a positive value was observed regardless of the correction approach (Figure S2B). Moreover, the identity of motifs with a positive or a negative
Cell Reports 19, 1710–1722, May 23, 2017 1711
C
A
T CAGG
C
A
A G T
A
C
T CAGG
C
CTTC G
T C
A G T
bound by CEBPB unbound
bias correction
norm. cut count
log ratio obs / exp 9.1 0.0
2.99 1.13 0.44 0.16 0.00
0
-9.1
30
-30
s cu erv cu ct ts ed e ts d
-30
TGCAAT
G
G
G A
ob
footprint
ex
pe
signature motif
0
30
baseline
0.0
motif Genome-wide cut count
A
CTTC G
T C
−0.5
motif
TGCAAT
G
G
FPD
−1.0
Number of cleavage events per bp
Cut DNA fragments
A
A
G A
Footprint depth
B
TF
all CEBP motifs within peaks sorted by cut count (high to low)
TF
−1.5
DNase hypersensitive site
A
−40
−20
0
20
log ratio (obs/exp)
40
distance to motif center (bp)
C
Accessible chromatin assay (e.g. DNase-seq)
2kb
Peak calling
1
calculate cut bias on all cut counts in dataset
aggregate all motifs within peaks
4
Exclude motifs outside peaks collect all known motifs
G
A T
C
filter motifs 2 -4 (FIMO p < 10 )
3 calculate normalized DNase cut counts
5 correct for cut bias and calculate footprint depth for all motif known motifs
footprint depth (FPD) 6 motif
Figure 1. Motif-wide Genomic Footprinting Workflow (A) TFs bind DNA regulatory elements that are hypersensitive to DNase digestion. Upon binding, TFs relatively protect their binding sites from DNase digestion, leading to a footprint detected by reduced cleavage events at the motif. A genome-wide footprint pattern is visible in an aggregation plot that includes all motif occurrences. DNase cutting is strictly affected by DNA sequence, leading to a signature at the motif sequence with a typical zigzag pattern. (B) A heatmap and aggregate plot of cut counts in mouse liver at all CEBP motif occurrences (n = 12,318). Uncorrected data (observed cuts) are shown on the left. Values corrected for cut bias (expected cuts) are shown on the right. The correlation between decreased cleavage events at the motif and TF binding is shown by ChIP-seq data of CEBPB on the far right obtained from Goldstein et al. (2017). (C) Genomic footprinting workflow. (1) DHS sites are determined. (2) Motif occurrences of all known motifs within the genome assembly are scanned. (3) For each known motif, all motif occurrences within DHSs are aggregated, and a cut count aggregate plot is generated by counting cleavage events (normalized to total reads in the sample). (4) Cut bias is calculated for every hexamer in the genome by measuring cleavage events throughout the dataset. (5) Cut bias is corrected for every motif by a log ratio of observed cuts divided by expected cuts. (6) FPD is determined as the distance between the motif-flanking baseline value and the average value within the motif. See also Figure S1.
FPD value was preserved across correction approaches, suggesting that motifs do not randomly obtain FPD values because of different bias estimates. Rather, specific motifs are prone to present a positive FPD value regardless of bias correction approach (Table S2). A positive FPD value could be interpreted as increased DNase digestion at the motif compared with its flanking region. How-
1712 Cell Reports 19, 1710–1722, May 23, 2017
ever, inspection of uncorrected footprints presenting a positive FPD value revealed a signature pattern with considerable cut count variability between nucleotides comprising the motifs. In most cases, some nucleotides’ cut count values were below the flanking region baseline, whereas others were above it, a phenomenon we termed an ‘‘incoherent signature’’ (Figure 2B; Figure S2C). This suggests that there is not a uniform increase
A
0.6
footprint depth (FPD)
0.4
positive FPD
0.2 0.0 -0.2 -0.4 -0.6
dist. of total motifs (n = 634)
-0.8 -1.0 negative FPD
-1.2 -1.4
B
motifs signature below baseline (n = 61)
signature crosses baseline = incoherent signature
signature above baseline
baseline
(n = 565)
(n = 8)
bias correction
C
0.6
footprint depth (FPD)
0.4
positive FPD
n = 61
0.2 0.0 -0.2 -0.4 -0.6 -0.8 -1.0
negative FPD
-1.2
n=8
-1.4
1.0
1.5
μ1=−0.42, p=0.2 μ2=0.016, p=0.8
0.0
0.5
density
2.0
D
2.5
motifs
in cut count across the motif as the corrected plot and the FPD value might indicate. Rather, the positive FPD value is a result of averaging intra-signature variability. If this is true, then positive FPD values may not represent a true biological event but stem from incoherent signature patterns. To explore this option, we filtered FPDs according to a stringent criterion. Motifs with a mixed pattern (whereby, prior to correction, some nucleotides have a positive value and others have a negative value; i.e., incoherent signatures) were excluded from further analysis. Motifs where all nucleotides showed a consistent value (i.e., all showed either a positive or a negative value) were included and plotted (Figure 2B). As expected with such a stringent criterion, 90% of motifs were discounted. Remarkably, the vast majority of motifs that met this cutoff showed a negative FPD (n = 61, 9% of total motifs) with only eight showing a positive FPD value (1% of total motifs). Importantly, although all positive FPDs were very low (<0.1), negative FPD values had a wide spectrum of values, with the lowest value reaching 1.4 (Figure 2C). This is in stark contrast to unfiltered FPDs, where negative and positive values were found at a similar rate (56% negative and 44% positive). This observation suggests that a positive FPD value is an artifact stemming from the variability in DNase cutting between individual nucleotides comprising the motif (i.e., variability within motif signature leading to an incoherent signature). This leads to FPD values slightly below or above the baseline and reflects a feature of the DNA sequence rather than a feature of TF activity. Thus, when estimating FPD, one in fact conflates two factors determining the final value: the actual protection from DNase cutting together with intra-signature cut variability. Indeed, when examining the total group of FPD values, we found that they fit a mix of two populations, each normally distributed. One population includes 80% of motifs and averages around zero FPD (average FPD = 0.016) with low variance, whereas the second population consists of 20% of motifs, is much more variable, and averages at an FPD of 0.42 (Figure 2D). Of note, this is not a liver-specific phenomenon and was observed in every one of the eight datasets we examined (Table S3). The parameters of these two populations are consistent with a model wherein most motifs average around an FPD of zero because of signature incoherence, whereas a fifth of motifs lead to a measurable footprint, averaging a negative FPD, indicating protection from cleavage. Thus, the FPD values of most motifs in all examined datasets are so profoundly affected by intra-signature variability that we cannot estimate whether they leave a footprint or not. The lack of a footprint in many motifs could be partially explained by the lack of expression of the TFs binding these motifs in the examined tissue (Gusmao et al., 2016; Stergachis et al., 2014). However, this cannot account for many footprint-lacking
−1.5
−1.0
−0.5
0.0
0.5
1.0
footprint depth
Figure 2. Most TF Motifs Do Not Show a Measurable Footprint Because of Incoherent Signatures (A) Log ratio-corrected FPD values in mouse liver for all motifs. FPDs span a wide spectrum of values, with a prominent fraction of motifs showing a positive value.
(B) To filter out incoherent signatures, a cutoff was determined whereby any signature crossing the flanking baseline is excluded from further analysis. (C) Log ratio-corrected FPD values in mouse liver for motifs meeting the criteria described in (B). After filtering, very few positive values remain, and the population consists mainly of motifs with a negative FPD value. (D) A probability density plot reveals a mixture of two normal distributions within FPD values (m, mean; p, mixture weight). See also Figure S2.
Cell Reports 19, 1710–1722, May 23, 2017 1713
motifs in the datasets we examined (Table S3) because many footprint-lacking motifs are bound by TFs that are expressed in the given cell type. For example, some motifs lacking a footprint in the liver dataset bind TFs that are expressed in the liver and play pivotal roles in liver biology by binding chromatin. Examples include the glucocorticoid receptor (GR), FoxA1/2, FoxO1, PPARA, STAT3, STAT5, and RORA (Figure S2C). Taken together, these findings reveal that measuring absolute FPD is dramatically affected by intra-signature variability and that 80% of motifs do not show a measurable footprint. Within this group of motifs, one cannot determine whether this is due to lack of TF expression or lack of protection from DNase cutting or whether such protection exists but is masked by intra-signature cut variability. Bivariate Genomic Footprinting The observation that most motifs do not present an easily measured footprint pattern and that absolute FPD calculation is confounded by intrinsic intra-signature cut variability seriously impedes conclusions drawn by genomic footprinting. This led us to design a computational method that would more faithfully capture a TF’s effect on chromatin accessibility in a manner resistant to these drawbacks. In some cases, alterations in TF expression (Gusmao et al., 2016) or activity (Goldstein et al., 2017) can lead to changes in FPD. In addition to this direct effect, TF binding events also lead to major changes in the accessibility of the chromatin environment around their binding sites by recruitment of chromatin-remodeling complexes, histonemodifying enzymes, hetero-dimerization, and assisted loading of other TFs (Voss and Hager, 2014). Indeed, several studies have shown that TF activation leads to increases in the accessibility of the motif-flanking regions (He et al., 2012; Swinstead et al., 2016), and this property has been used in predicting TF activity (Sherwood et al., 2014). We set out to generate a framework that will utilize changes in FPD and flanking accessibility (FA, set as ± 200 bp from the motif center) to predict changes in TF activity following a biological signal (e.g., untreated versus treated cells, tumor versus normal, etc.). We designed an algorithm termed bivariate genomic footprinting (BaGFoot), which enables a genomic, motif-wide comparison of FPD and FA. For every experimental condition, each motif was assigned an FPD value and a FA value. Then, to compare the difference between two conditions, we subtracted the FPD and FA of condition A from condition B, yielding DFPD and DFA values (Figures 3A and 3B). The D values for all motifs are presented in a scatterplot, DFA in the x axis and DFPD in the y axis. Because most TFs are not expected to change in activity and affect accessibility following a specific signal, most D values would center around the origin. Conversely, condition-dependent changes in TF activity may be reflected in deviation from the origin. For example, if the activity of a TF is increased in condition B, leading to an increase in FPD and FA (Figure 3A), then both D values would increase, and the motif bound by the TF would be found in the first quadrant (Figure 3C). To statistically evaluate the extent of the change in FA and FPD, we present the data in a ‘‘bag plot,’’ a visual way to present statistics of bivariate data (Rousseeuw et al., 1999). This presentation is analogous to the well-known ‘‘box and whiskers plot,’’
1714 Cell Reports 19, 1710–1722, May 23, 2017
which shows the main statistical characteristics of univariate data. In a bag plot, the bag area encompasses 50% or less of data points (analogous to the box in a boxplot). The ‘‘fence’’ area usually encompasses 97%–100% of data points (analogous to the whiskers in a boxplot). Any data point outside of the fence is defined as an outlier; similarly to a boxplot, this analysis does not dictate a minimal percentage of outliers (Figure 3C). Because most TF motifs are expected to localize around the origin, the population median would be proximal to the origin, and motifs with a significant deviation from the median would be characterized as outliers (Figure 3C). Statistically significant change in FA/FPD was evaluated by chi-square distribution. To examine the efficiency of BaGFoot in predicting TF activity, we first analyzed DNase-seq data from mouse liver following fasting, a metabolic stress known to affect many TFs (Goldstein et al., 2017; Goldstein and Hager, 2015). Comparing the fed to the fasted states identified four outlier groups of motifs deviating from the fence area. A group consisting of CREB-related motifs increased in both FPD and FA. A group of CEBP-related motifs increased mainly in FPD. The GR motif increased only in FA, and a group of STAT motifs decreased in both FPD and FA. All outlier motifs represent statistically significant changes following fasting, whereas the motifs within the fence are not significant (Figure 3D). To further assess outlier significance, we applied a two-sample t test comparing the outlier motifs in subsamples of data (90% of reads compared with the original read count, ten subsamples in each condition). All outlier motifs were statistically significantly called in both the FA and FPD parameters (p < 0.01). Remarkably, the four outlier groups of motifs bind TFs that are well known mediators of the hepatic response to fasting (Goldstein and Hager, 2015). In line with the predictions made by BaGFoot, the binding of CREB, GR, and CEBPB were all significantly increased following fasting, as measured by ChIP-seq under the same experimental conditions (Goldstein et al., 2017). BaGFoot Is Robust to Different Peak Calling Methods, Bias Correction Approaches, and Low-Read Datasets DHS sites in this study were identified by the DNase2Hotspot program. To examine whether BaGFoot can provide a reproducible output regardless of the peak calling algorithm, we also used MACS2 or F-seq to determine DHSs and executed BaGFoot on the two peak lists. The number of called peaks varied considerably between the three algorithms (DNase2Hotspots, 78,821; model-based analysis for ChIP-seq [MACS], 115,760; F-seq, 324,606). Nonetheless, the BaGFoot output from MACS-called or F-seq-called peaks (Figures S3A and S3B) was extremely similar to the output from DNase2Hotspots-called peaks (Figure 3D). Even when BaGFoot was applied to the entire genome with no peak calling step, the output closely resembled that for the DHS-exclusive BaGFoot analysis (Figure S3C). Of note, BaGFoot analysis on the entire genome with no focus on called peaks is computationally intense and is not recommended for most datasets in which peak calling is possible. Because bias correction profoundly affects absolute FPD values (Figure 2), we examined whether it also affects BaGFoot output. We performed BaGFoot on naked DNA-corrected data where the cut bias pattern of each DNA hexamer is determined
condition A
Number of DNase cuts per bp
ΔFA condition A ΔFPD
condition B
increased TF activity
Y axis value: difference in footprint depth (ΔFPD)
X axis value: difference in flanking accessibility (ΔFA)
motif
D 0.3
[condition B] - [condition A]
z
z
z
z z z z z z
only FPD decrease
only FA increase ΔFA outlier motif
Cebpa
Cebp Aare
n.s. P < 0.05 P < 0.01 red text = q < 0.05
Jundm2 Atf1_2
Hlf Cebp:Ap1
JunD
Atf1_1
c−Jun:Cre Creb1 HNF HNF6 F6z
0.1
z
Liver [fasted] - [fed]
Pax7_3 P ax7_3 ax7_ 7_ _3z Pax ax7 z ax7_1
Nfil3z
Cre
Dbx1 Db bx1 x1 1z z Po ou3 ou3f1 ou z P1 ou2f1_2 z zx1 Lm L m mx1b 1b Pou P ou3 o u3 Ho H xa4 xaax6_2 Pa4z z Sou ru yz3f4 Hlx1 Hlx x1 1z xa zb Ar iid id5a d5a Hoxxxb Ho b7 7z Sox21 1z Cphxz zrz zhx3 Lh Tlx22 Pdx1_2z Lh h1_1 zz6z z H Ho o8zxxc6 xc c53 c6 6ou4 Hn n nf1 f1 fLh 1z HH oL xd xd8 D Db bxxx2 b bx2 2x6 1_2 z Nkx6 1zzzx17 Pou1N 1N z1_ Dlx1 D Dl Dlx lx1 1−1 zz fp2Barx 187 P f2 arx x2 Hdxz zz B H Ho oou3f Hnf1a H nff azH nf1a z3 H xa1 Hoxa1 Ho xa10 zz z 9 ozxAlx Ho x1xc4 b4 44H 2 z Al7 xx3 z hx9 x2a z Sox14z Lmx1a So Hoxa Ho Hoxa xa7 a7 a7_ 7 xb4 z zz zPho z Ho H oxa7 2B Ba arx a rx xx1 12 1 En1_1 En Dl x2 2z z6−3 Esx E xx1 1zMsx zsx Ho oNkx ox xxb8 b8xx6−3 b8 M x1 N Lhx3_1z z1z zVsx1 Prop1 x1 1 Pax2z So arFt1 1_ _d 1z3 z z Hoxa6 Ho ox1 xaz6zCar Dd Ddit3:Cebpa d3 Gmeb1 b1 1z Otp O pozxd Pax7_2z Ho zz HOXA HO A2z 3z Ho oxd13 xc8 c8z zz P ou3f3 Uncx4.1 1 Jun:Ap1 z TBP_3 3 Ap1_1 Ap1 1zz Hnf1b Hnf1b f1b bz Hoxd10z P300 H xb13 Klf4_1z zz z Zfp128 28 Cdx2_2 x2 2z 3z Hif1b 1b GRE_1 1z Brc z Br Brca1 z Obo Ob O x5_2z Klf4_2 4Max_2 _2z z z ox5_1z Obo Arid3a_2 aB O TLX1::NF NFI FIC Cz A 2z Eklfz z Usf2zz z Nff−E rf2 rfEAp1 Arz PH ou2f1_1 Ho xxa9 xb9 ba9 b9 99zz 1z8z zy z Myc:Max Myc:M M Nf−E N Nf− −E Klf7P PR H Ho o x z Cdx1 Cdx Cd d x x1 1 z Isgf3g Mef2 Mef2a Me Mef ef2a 2a a_ a _2z _2 _1z b Nrf1_2 Gata3−pr G yzz Nfya Nfy azMax_1 OCT4−S SOXNk 2kx T3 NA G bHLHE Hz 40z Nkx3− N kx3−1 x3− x3−1 2zANOG A z Atf3 N z Nfy FOXL FO OXL1zIrf5Ir z z ARE Irf1 rIrf2 rf12zzMef z z Mef2c Myc y2zUs 1zf1 Usf ssf1 f1_2z1 Egr2 Eg Irf8z Myc_2 yc c 2zn−Myc_1 z2 zz 61 z Nrf1_ nM rf1 f1_1z1b2 Myyyc_3 Myc 3zE zz zT NFkB−p53−p50 p Ho oT xa13 Egr1_1 1 z1z Sp4z G GA A3 Ac3 3_2 O zz2z Gata6 TG T1 1isre 1 isre Eg gr1 z 1z NFkB−p65−p50 z −My Myc_2 M My yc 22 Irf4_1 Irf z H n−M Hi NFkB−p65 kB−p6 6−p65_1 65−Re 5−Rel R zl z z Oct4 ct4 Hmbo x1z Irf3 z Egr1 Pax5 ax5 5_ 5 _3gz 4z _ Tcf1 32zz T zz SGa Spd Mv1 izzf z1tfz_2 Bapx px1 1 r1 3z z GEt 3zZfEgr1 ztb33 EMiz Etv vE z z a4 4 Zbt Zb zz 2−5 5_2 5 _2 E2 E2f RelA R Re llA Az z tta z Ga z GTtta4 Ga ata1_ a at tta ta1 a1 1_Gat 2−5 p1 p 1_2 1 _Z 2zbzztb PRel 2 ttv vS4 Glk1 6z P oou2f2 z zb al1::Gata1 a Ga G ata1 a ata at ta1 a1 a 1z2zz z zz E3 tv Et 1Elk Ets1 5kk1 E Elk1 lk 1 Et E2f2 Nfatc2 Nfatc Nf fatc cz2zGata5 z Gz z kz k3 3zz3 z IsreC Sp100 zfp281 z NFkB− NF kB B0z−p65_2 −p6 p CA Runx R unx 4 E2 E2f4 E 2f4 2 4z_E2 E2f7 f z zzz f7 z_ NFk FkB F kkB RNF Re e1_3 Nkx3−1 3zz Mef2 Mef2a_1 2a 2a_1 13zRel z Ppa Pp P pRX par are reR Elk4 El E llk4 kk4 4z 2z 2zzrgz PE zA Nkx2−3 PPAR PP PAR RG RG :::R :: :pa RXR X XR RA PU 1:Irf 1: 1:Ir :Irf Irrfzfzz Runx_ Runx nP nx xz_3z.1:Irf z NFkB−cRel−p50 z Elk1 E Elk lk1 kz2z_ _2 2 Rx R xxra xr ra rR a z 3Eomes_ Zfx_ Zf Z fx xxT _T Spib Spi Sp pib pi ib b1 Tzcf1_1 z Tf fzap2 aFpli2 2a_ 2a a_ a a_2 Nr2 2e3 z2 s_S 1zp Insm IIn nzsm mv1 m1 14z p5 Nk 2 9z Nkx2−9 z2 E2f E 2fP z v1z1Glis2 _1 zz z 3f5 zz z M My Myf5 z Sfpi1 1zZ za_2 Gabp Ga ab abp bpa bpa a 2zlk1_1 z Zn Znf 1z4 43 3z zG NeuroD N Dxz px5 53_1 53 3nf _ff1 AR R −1z h53 ha al alfs alf llfsst lfs ffs sit ssi iit5 tz zz Ets E Et tts s1 1 R Run Ru np53_1 z Elf3_2 ze 1: ENF E−A E− −TA b o ZN Nb Fcfcp2 143 |a Re R e est:N es est st:N tt:N N rsfuznx 6z zz z E2f1_2 E2f Elf5_2 Elf5 5Ets1 2z1:E− z G z2f1_2z Mzf1 Mzf1_ ff1 f1_ 1_ _2 Ascl2 z23z z2 G ER Rfp G Zf Zfp Z fp p4 42 423 z4 MyE My zpz Ap2a A Ap2 2alph 2a 2 aGab alph lph lpha lp p zb aM z z zEbf S Elf1 zE z zz a z z BO B ORIS OR O RIS_1 R T2b ap2e z z2p2 pbf z2 Tccf T czffc cfc fcp2l1_ cp2l1 cp ccp2 pE p2 2lB 2 ll1 1?_zA Ets1_2 1_2 z1z z z cf f ap2b p Zic2 Z Ewsr1:Fli1 wsr1 w ws sr sr1 r1 r 1:F 1 :F : F Fli l i 1 HE Zic c1 PU.1_1z z Tea T ead ad4 ad ad4 d4zRelz Zic3 c3 NF kB Tfa Tf fap2c Ebf1_3 E bf1_3z Tcf Tf ap2a_1 p2a z z EWS E EW WS:ERG WS W :E −fusionz 1zCTC CTCF_2 C 2z MyffzEb SPDEFz M z Ebf1_2
GRE_2
−ΔFPD 0.0
zz z z z z z zz z
FPD+FA increase
Stat6_1 Stat6_2
Tead
Stat3_3 Bcl6_2
−0.1
only FPD increase
0.2
−ΔFPD
FPD+FA decrease
200 bp
flanking accessibility increases
C
only FA decrease
B
footprint deepens
motif
population median
condition condition A B
condition B
A
Stat1_2 Stat5 Stat1_1
−0.5
PU.1_2 2z
Stat3+Il23
Stat3_2 Bcl6_1 Stat3_1
Stat4
−0.4
−0.3
−0.2
−0.1
0.0 ΔFA
0.1
0.2
Figure 3. BaGFoot: Quantifying the Changes in FPD and FA between Two Experimental Conditions (A) The effects on chromatin accessibility mediated by TF activation can be measured via changes in accessibility within the motif and in the sequence flanking it. (B) Changes in FPD and FA following a biological signal are measured for every motif genome-wide, and D values are given. (C) D Values for all motifs are plotted in a bag plot. The population median is marked in light orange. The bag area (dark blue) is the region where 50% of the population is located. The fence area (light blue) is the region where most (typically 97%–99%) motifs are located. Motifs outside of the fence and with p < 0.05 are determined as statistically significant outliers; namely, motifs with a significant condition-dependent change in FA and/or FPD. (D) Bag plot depicting DFA and DFPD in mouse liver following fasting. Statistically significant outlier motifs are marked in colored squares. See also Figures S3 and S4.
by DNase-digested naked DNA (Table S1; Lazarovici et al., 2013; Yardımcı et al., 2014). The plot following the various naked DNA corrections (Figures S4A–S4C) was essentially identical to our dataset-intrinsic correction approach described above (Figure 3D). Additionally, the BaGFoot output was unaffected when we corrected for bias based only on cuts within DHSs (Figure S4D) rather than throughout the genome (Figure 3D), suggesting that the dataset-intrinsic correction can be done on either population. Furthermore, the output was not affected even when the cleavage rates were randomly shuffled (Figure S4E). The resilience of BaGFoot to the correction approach prompted us to examine the output following two mock corrections—one where the DNase cleavage rate was assumed to be uniform throughout hexamers and a second where the hexamer frequency was assumed to be uniformly distributed in the genome (Table S1). Even with these simulated corrections, BaGFoot performed similarly (Figures S4F and S4G). Thus, we conclude that BaGFoot is robust to different corrections and even performs similarly when no correction is done (equal cleav-
age rate, Figure S4G). The resistance of BaGFoot to assay bias stems from the fact that BaGFoot estimates the difference between two conditions. By definition, bias stemming from enzymatic activity should not change between similarly processed samples. Because we subtract one value from another, the cut bias that is identical between conditions is negated, and the D value represents only relevant changes. A major drawback in extrapolating TF activity from genomewide accessibility data is the need to determine a subset of enhancers that show a bulk change in accessibility between conditions in a statistically significant manner and then analyze that sub-population for motif enrichment. Because most DHSs do not change dramatically in their accessibility, they are excluded from analysis, and the subset of included DHSs represents a small fraction of total enhancers. Furthermore, this subset is sometimes challenging to determine in a statistically significant manner because of few replicates or poor quality of data. We postulated that BaGFoot would be able to find relevant TFs even in DHSs that do not show a statistically significant
Cell Reports 19, 1710–1722, May 23, 2017 1715
difference in accessibility between conditions. We employed the fed/fasted dataset and discounted all DHSs with a bulk change in accessibility following fasting (R2-fold, adjusted p % 0.05). Performing BaGFoot only on unchanged DHSs (Figure 4A) provided an almost identical output with the same outlier motifs as those found on total DHSs (Figure 3D). Many footprinting algorithms require deeply sequenced DNase-seq data to detect and measure absolute footprints (Baek and Sung, 2016). Because BaGFoot relies on relative rather than absolute measurements, we surmised that it might be robust even in low-read datasets. To test this hypothesis, we randomly subsampled reads from the fed/fasted dataset and artificially generated datasets with a lower number of reads (90%, 80% . 10% of reads compared with the original dataset). Each subsampling percentage tier was randomly selected ten times (90%, n = 10, etc.) to a total of 90 different datasets. Then, we performed BaGFoot on these datasets. DFPD and DFA values were reliably determined with very low variance in as low as 20% of reads for six examined motifs (Figure 4B; Figure S5). Remarkably, the vast majority of outliers identified in the original dataset were also identified as outliers in all subsampled datasets, suggesting that BaGFoot is able to detect changes in TF activity even in low-read datasets (Figure 4C; Table S4). Taken together, BaGFoot efficiently detects outlier motifs regardless of peak calling algorithm or bias correction method. BaGFoot also reliably isolates outlier motifs in datasets with varying read depths. BaGFoot Detects Relevant Motifs following a Wide Range of Biological Signals and in Various Cell Types The response to fasting is a systemic one, mediated by various hormonal and metabolic stimuli. We wished to determine whether BaGFoot can retrieve TFs relevant to several biological events. First, we examined BaGFoot’s performance following a TF-specific activating signal. To that end, we analyzed DNaseseq data obtained following short-term treatment of cells with a GR agonist—dexamethasone (John et al., 2011). Comparing dexamethasone-treated cells with their untreated counterparts revealed increases in the accessibility around Fox and Sox motifs, suggesting that GR facilitates chromatin opening around these motifs, thereby, promoting both Fox and Sox TF activity (Figure 5A). Both Sox and Fox motifs do not leave a footprint (Figure S6A) and, therefore, show only an increase in FA along the x axis. This observation highlights the advantage of BaGFoot compared with a univariate examination of footprinting capacity that would be uninformative in this case because of lack of footprints. Interestingly, a role for GR in enhancing FoxA1 binding was recently described (Swinstead et al., 2016) and is in line with our finding that activating GR leads to increasing accessibility around Fox motifs. A similar relationship between Sox and GR is undescribed and warrants further investigation. The two other motifs detected, Pax4 and Zfp105, are very prevalent in the DHSs, with over 70,000 occurrences, suggesting that this motif cannot provide reliable information regarding TF binding because most of these motifs are unbound. Next we examined the sensitivity of our algorithm in detecting lineage-determining factors in early developmental stages. He-
1716 Cell Reports 19, 1710–1722, May 23, 2017
mangioblasts are the early common precursors for hematopoietic and endothelial cells; thus, the differentiation of mesodermal cells to hemangioblasts is thought to constitute the first step toward hematopoiesis (Cao and Yao, 2011). We compared DNase-seq of hemangioblasts with their parental mesodermal cells using BaGFoot (Goode et al., 2016). Increases in accessibility around the GATA motif were found in hemangioblasts compared with their parental mesodermal cells (Figure 5B). Supporting the validity of our findings are observations indicating that GATA1 is a pivotal erythroid-determining TF during hematopoiesis with documented roles in hemangioblasts (Cao and Yao, 2011; Yokomizo et al., 2007). The accessibility data described above were based on DNase hypersensitivity. The ability to recover footprints from other chromatin accessibility assays has been questioned (Vierstra and Stamatoyannopoulos, 2016). We performed BaGFoot analysis on a recent dataset utilizing ATAC-seq to compare accessibility between a group of primary small-cell lung cancer (SCLC) tissues and a group consisting of SCLC liver metastases (Denny et al., 2016). The major motif showing increased accessibility and a deeper footprint was NF1 (Figure 5C). This is in agreement with the discovered role of NFIB in regulating metastatic development (Denny et al., 2016). Intriguingly, the motif for the hepatic TF HNF6 was also called as an outlier, raising the possibility that the liver microenvironment affected metastatic cells to express liver TFs. However, although the NF1 motif was still called an outlier when correcting the data based on naked DNA-digested ATAC-seq, the HNF6 motif was not (Figure S6B). Taken together, the data suggest that BaGFoot is efficient in predicting relevant TFs in various settings, including systemic metabolic stress, early developmental stages, specific activation of a TF, and in tumor-to-metastases transitions. Candidate Motifs Discovered by BaGFoot Are Bound by the Predicted TFs The outlier motifs detected in the liver following fasting (Figure 3D) directly correspond to increased binding of three TFs following fasting (Goldstein et al., 2017). To further validate changes in TF binding behaviors corresponding to outlier motifs, we processed multiple datasets with the BaGFoot algorithm. First we analyzed data in which mesodermal cells were in vitro-differentiated into macrophages (Goode et al., 2016). All the outlier motifs in the BaGFoot bag plot comparing macrophages with their mesodermal origin cells (Figure 6A) have been reported to bind TFs with critical roles in macrophage activity and lineage determination (Tussiwand and Gautier, 2015). Attesting to the validity of BaGFoot’s observations, ChIP-seq data revealed increased binding of CEBPB (bound by an identified outlier motif) in macrophages compared with mesodermal cells (Figure 6B). The purpose of BaGFoot is to suggest putative motifs for further exploration using only accessibility data. Therefore, in the lack of ChIP-seq data, BaGFoot considers both bound and unbound motifs. With the availability of CEBPB ChIP-seq in this dataset, we divided CEBP motifs into two groups: bound motifs and unbound motifs. Although unbound motifs were close to the origin, bound motifs were extreme outliers (extending further from the value for total motifs; Figure 6C).
0.3
Liver [fasted] - [fed] only unchanged DHS
0.2
A
Cebp Aare
0.1
Atf1_2
Cebp:Ap1
Atf1_1
JunD
c−Jun:Cre Creb1 Pax7_3 ax7 3z Sryz Sox14z
HNF6 HNF 6z
Pax7_1 7 1z
Cre
Nfil3z
Sox21 1z
Dbx1z
Cphxz Pou2f1_2 Ho ozx xa4z z Hoxxzcc6 Ho 6h Pdx1_2z Lmx1b L Lmx1 1 1b Lhx3_ L Lhx 3zx1 PzbozLhx3 z _ z Lhx L hx 1z Hoxb7 PL ou3f4 xb b7 P ax6_2 z Lhx5 z x5 zou3f2 Nkx6−1_1 6 1_1 1P Tlx22L z 2z M Barx2 Bar Db Dbx2 D x2zf2 4f3 3zbx2 zBarx Zfp187 z 1z4 Msx1 Hnf1 H nf1 nf f1 1znf1 zalx1 zx1 z d8 _1z Nkx6 Nkx6−1_2 5a Hnf1 Hn H nf n f1a ff1 1H z Hoxa10 Hoxa Hox Ho xxb xa Alx3 Alx x3zz x z Dlx z10 zHoxc5 xc4 c4 Pou1f lxx x1 z z HoHo Hoxb8 H Ho z1H Hoxb4 H Hox Hoxb4_2 x4 D xb4_2 2 Ddit3:Cebpaz z Pho x2a z Ho z1 Nkx6−3 Nkx6 Hoxa7_2 H xa7_ 7LEn 7En En1_ n1 1_ _1 zH Lmx1az Hdxz D1zlx lx2 x2 hx9 9z_ z Prop1 Vsx1 Lhx3_1 h 3 1z Hmx3 Hmx mx mx3 x3 3zx1z Hoxa7 Hox Ho xz 1z5 z x2 HoH Ho xzxo cN c8 zx2b Nkx2−5_1 Nkx2− 2 5_1 1 z o P 2b b z Otp p Hoxd1 13 Jun:Ap1z Hoxa6 Hoxa Ho xa a6 6z HO OXA XA32zz Pbx1_3 z Uncx4.1 Ap1 Ap1_1 P300zz zz z Hoxb13 H oPxb13 xou3f3 b13f3 3z Hoxd10z Hif1bz TLX1:: ::NFIC NFI z NF z Bbx Arid3a_2 z Obozx x5_2 2 z z Ap1_2 A p 2Zfp128
0.0
Pax2z
Gmeb1 eb1 1z GRE_2 2z
Klf4_1z
Klf4_2 4_2 _ zx_2z Max Eklfz Obox5_1 o1zrf4 x5_1 Pax7 x7 7_2 2zObo z Usf2 sf2 zz Irf4_2 4 2z 41 Isgf3gz Mef2a_2 Klf7z Mef2a 2a 2 a_22zz2f1_1 Myc:M M yc:Ma c:Maxz Nkx3−1_2 N kx3 kx 3f2 1_ Sozx18 8z z x_1 z NfyazMax_1 Nfy Nrf1_2 O 4−SOX OCT O 2−TC TCF C −NANOG Arz 1z OG Nfyz Nfy zz zyz Irx6z Irf5z Irf1 bHLHE b HPR LHz 40 I f2zz Irf2 Atf3 Atf Gata3−primar Gat Mef2c f2cz FOXL O 1z Irf8zM Usf1 sf1_ s f1_ _2 2 z _1 z sf1 Myc Myc_1 M 1zU _z1 g_1 p161 z z f AR _Rhlhb2 E n−My −Myc Myc_1z Nrf1 Sp4 4z z z lh hb2 NFkB−p53−p50 3 p50zT z T1isre 1i 1is re Egr1_1 E Egr 1 1z NFkB−p FkB B−p6 p p6 p65 65−Rel 65−Rel 65 5 Rel Rez NFkB−p65−p50 zz z HEgr1 Hif− Oct2 O t2z Egr1 r1z_2 2z HoOct4 14G 3zzata6 z O 1O NFkB−p −p6 p65_1z z n−My My M yyc 2zz I f6 Brf x1 Irf4_ rf4 f4 _zIrf3 1z1zz HmbIrf P ax x5_ x5 x 5 _3 3 Sp10 S 100 1 00z 00 Egr1 gr1 1 4zr1 Gata4 Ga G Gata 4z z zz z1_2z3z Spdef pd d GEt 1 Nfatc2 f Eztv Etv v1 v1_2 v z7403 E 2p 2f3 z 5 2z 2−5 TcfNkx2 T z t EzEt z btb33 btb Sp S p1 1_z2 H a:Ar Hif1a:Ar As1 rnt rn nElk t5zEtv6 z 2 z tv3 ttv v3 vz 33zzzz RelA Re ePou2f2 lA lA zz Tal1 T al1 al1:: l1Nkx2 ::G ::Ga ::: :G :Ga :Isre Gat a2ztta1 ta a1 1 Ets1 1nt zz z P o NFkB−p65_2 p _2z Elk1 E El k1 k 1Elk3 _3 N 6zz ou2f3 4 z z k3 zz zz z Nr2e3 Nr2 e3 3z E2f7 E2f E 2 7Zfp281 Rel Stat6_1 1z NFk NF N Fk kB Bel E2f4 E2f4 E2f 4z k3 CArG C CA rGzN Elk4_2 Elk El E Elk4_ k4 k 4 _1 2zz z zunx_3 Nkx2− Elk4 zz Fli1 F Runx_3 R unx 33 z3z Nkf2a ata5 1z1_3 Six6_3 Mef2a_1 M Mef2a Mef2 Mef2a_ YY YR 1zA 1z z2zE NF NFkB−cRel−p50 Elf Elf2 Ergz Pl P Pp p z Elk1 lk1z z2z PPARG PP GR ::R :: R RXR XR X Zfx Zf f 2zTf lilp2 z E2 E2f 2RA z zZ Rxr xRX rE zElf3 f 2z z Tfap2 fFli ap ap2 p_2 p2a 2Elk1_1 2a a_2 2z z Elf5_2 Elf5 _2 2zs1:E z Eomes_1 Nkx2 T Tcf1_1 TNkx2−9 ead ea adz z Gabpa Ga Gabpa_ bpa_ a_ a 2za z bpa Pl Pla gG zzlag Etv1_ E Et tv1 tv v1_1 v 1 1P Zn Zn nf143 nf1 f143 43 3zEtv zf1 Ets1:E−bo 1 Eib − bo ox Sfpi1 pi1Zn pi1_1 pi1 i1_1 Spib S ib 2zf6z Ettts ts1:R ts1: 1zyo :oR Run Ru u nx zxzfpi1 z un TzA cfcp2l1_ zGcm1z M zn y:R AR R−half −hal alfs fsi fs isz te1:R te Mzf1 Mz Mzf1_ ff1_ _2 2z2f1_2 z z Ascl2 z Res es est E Gy E 2A A z z Ap2alpha Ap p2a p2a alp pha p ha aGa Gabpa_ z z z zA Zfp4 Z Zf ffp p42 p4 p 4RG 42 2 23 EzIS s1_1 Tead4z T z3z SfZf yoD z E 2a Elf1 BOR BO B O ORIS OR RIS R CT CTC Ap A pw zp2l1 Ets1_ Zic2 zfC Tcfc f2 fc cp2l1_ cp c p2l1 2l1_ 1_2 2bf1_1 Ew E wsr1:Fli wsr1:Fli1 ws wsr 1 zFli1 lT 1azz z r1: Zic1 Zic czz1z z cffap2b pT p2 2bfzap2ez 2 E PU.1HE 1zB?bf EWS E :ERG−fusionz Zic3 c3 z NFkB−cRe cRel Rel T f 2 z1z Zfx 1z CTCF_2 Tf ap2a_1 ap2a Ebf1_ 1_3 1 3zyfz zfa SPDEFzEbf 2 Myf M My Ebf1_2z
GRE_1 1z
Stat6_2 Stat3_3 Stat3+Il23
−0.1
Bcl6_2 Stat5
−0.3
−0.2
B
Stat3_2
PU.1_2 2z
Bcl6_1 Stat3_1
Stat1_2
Stat4
Stat1_1
−0.1
0.0 ΔFA
0.1
Creb1 ΔFPD
0.3 0.2 0.1 0.0 -0.1
l l l
0.1 0.0 -0.1 -0.2
l
l
10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100
Rora
Rora ΔFA
ΔFPD
0.3 0.2 0.1 0.0 -0.1
l
l
l
0.1 0.0 -0.1 -0.2
l
l
ΔFA
ΔFPD
l
l
l
l
l l
10 20 30 40 50 60 70 80 90 100
subsampling percentage
0.3
l
Stat3 0.1 0.0 -0.1 -0.2
10 20 30 40 50 60 70 80 90 100
subsampling percentage
outlier called in X% of datasets:
Cebpa Cebp
1%~20% 20%~79% 80%~99% 100%
Jundm2
Aare Hlf
Atf1_2 Atf1_1
JunD Cebp:Ap1
c−Jun:Cre Creb1
0.1
0.2
ll
10 20 30 40 50 60 70 80 90 100
10 20 30 40 50 60 70 80 90 100
Stat3 0.3 0.2 0.1 0.0 -0.1
0.2
(A) Bag plot depicting DFA and DFPD in mouse liver following fasting. All DHSs with a significant bulk change between conditions (fold R 2, adjusted p % 0.05) were excluded, leading to only very subtle changes in outlier motifs detected. (B) Boxplots depicting DFA and DFPD values of selected motifs in randomly generated datasets. The number of BaGFoot input reads was reduced to the indicated percentage (percent reads from the original mouse liver dataset; each percentage tier was randomly generated ten times). (C) Bag plot depicting all outliers called in the entirety of subsampled datasets (n = 90). Each motif was determined an outlier in X% of datasets. Circle color and size represent binned ranks. For example, an outlier determined as such in 100% of datasets is marked in a large blue circle. See also Figure S5.
Creb1 ΔFA
−ΔFPD
Jundm2
Hlf
n.s. P < 0.05 P < 0.01 red text = q < 0.05
TBP_3 3z
C
Figure 4. BaGFoot Efficiently Detects TF Activity in Datasets with Low Reads and with Weak Changes in Enhancer Hypersensitivity
Cebpa
HNF6
Cre
Sry Sox21
GRE_2
Gmeb1 TBP_3
Ar
−ΔFPD 0.0
Mef2a_2 Nkx3−1_2 FOXL1
GRE_1
Six6_2 Tcf1_2
Six6_1
Sp100 NFkB−p65_2
Stat6_1 Stat3_3 Bcl6_2
−0.1
bHLHE40
Stat4
Stat5
Stat3+Il23
Stat1_2
Tead
Stat6_2
NFkB−cRel
Stat3_2
PU.1_2
Bcl6_1 Stat3_1
Stat1_1
−0.5
−0.4
−0.3
−0.2
−0.1
0.0 ΔFA
0.1
0.2
Cell Reports 19, 1710–1722, May 23, 2017 1717
A
B
C
Figure 5. BaGFoot Detects TF Activity in a Variety of Biological Systems and following Various Biological Transitions (A–C) Bag plots depicting DFA and DFPD in (A) dexamethasone-treated mammary adenocarcinoma cells compared with untreated cells, (B) hemangioblasts compared with their cells of origin (mesodermal cells), and (C) a group of tumors consisting of metastatic tumors (hyper) compared with a primary tumor group (hypo). See also Figure S6.
1718 Cell Reports 19, 1710–1722, May 23, 2017
This shows that the changes in FPD/FA are due to TF binding and not motif sequence. To further explore the efficiency of BaGFoot in detecting biologically relevant TFs, we examined a DNase-seq dataset from T lymphoblast cells (Bevington et al., 2016). Upon encountering an antigen and activation of T cell receptor (TCR) signaling, naive T cells become lymphoblasts and begin to rapidly proliferate to eventually become effector cells. Lymphoblasts respond more efficiently to a second TCR activation by phorbolmyristate-acetate/ionomycin (PMA/I) (Bevington et al., 2016). Comparing activated versus resting lymphoblast cells using BaGFoot, we found known TCR-activated TFs such as AP-1 and NFAT (Figure 6D). In line with BaGFoot’s prediction, the AP-1 TF was activated in this setting, as measured by increased binding (Figure 6E). The motifs bound by AP-1 were responsible for the increases in FA and FPD, whereas unbound motifs had no effect (Figure 6F). The TFs described above were also detected in the original motif analysis performed on a subset of DHSs that increased in accessibility upon PMA/I treatment (Bevington et al., 2016). However, BaGFoot also detected the motifs bound by BATF and IRF4, two TFs that play important roles in T cell biology (Man and Kallies, 2015; Figure 6D). These were not detected in conventional motif analysis, demonstrating that, in some cases, BaGFoot can identify critical TFs that were missed with existing methods. Interestingly, some motifs appeared in the second quadrant, showing an increase in FPD but a decrease in FA. This accessibility pattern would be consistent with a repressor that decreases its surrounding accessibility as it binds to its motif. Indeed, two motifs found in the second quadrant were reported to bind repressors: EGR (Feng et al., 2015) and Sp4 (Kwon et al., 1999). As for the motifs shown in Figure 5A, three frequently occurring motifs were detected as outliers in this dataset (Pax4, Zfp281, and Zfp740), but we cannot deduce TF activity from this observation because of the abnormally high number of motif occurrence in DHSs (>50,000). We also examined a dataset where changes in chromatin accessibility were measured during early adipogenesis (Siersbæk et al., 2011). The CEBP motif was called as an outlier (Figure S6C), in line with the role of CEBPB in the early establishment of a chromatin landscape facilitating adipogenesis (Siersbæk et al., 2011). Indeed, CEBPB binding increased following treatment, and the bound motifs were responsible for the shift in the FA/FPD parameters (Figures S6D and S6E). Finally, we processed ATAC-seq data from mouse embryonic stem cells (mESCs) where the level of Oct4, a critical stemness TF, was knocked down (King and Klose, 2017). The most distant outlier motif showing a decrease in FPD and FA was the Oct4 motif (Pou5f1; Figure 6G). Interestingly, the motifs for Sox2 and NANOG also showed the same trend, in line with findings showing Oct4-dependent binding of these TFs (King and Klose, 2017). Examining ChIP-seq data for the three TFs revealed that, as predicted by BaGFoot, binding of all three was reduced following Oct4 downregulation (Figure 6H). The binding of Oct4, Sox2, and NANOG was shown to be regulated by the chromatin remodeler BRG1 (King and Klose, 2017). Accordingly, BaGFoot analysis showed a reduction in both FPD and FA for
A Cebpa
Irf4_3 Ap1:Isre_1 Ap1−Irf4
Mafa fa Isgf3g g
Mef2a_2 2 Nfil3 N Nfi 3 Cphxx Maff ff Irf1 1
FOXC O 1 MaffIIrf4 r 4 Pdx1_2 Pdx Pd xx1 1 2 JunD D
Mafb_1 1
Irf2 2 Hif1a:Ar Hif1a:A A Mafb_2 2 Egr1_3 3 Hif 1a ar ntt Bhlhb2 2 Irf8 8 RXRA::VDR A:: ::VD VDR Tr4 T 4 c−Jun:Cre e Elk4_2 2 T1isre e 1 Pax2 ax c Mafk_2 2 Mef2c PF AX 3 FKHR FKHR−fusion Kk 6−fu fu u n Hlx1 Nkx6−3 Czf Hoxb4_1 oxb4_1 xb4 1Brca1 TLX X1:: 1::NFIC NFIC NF FIC C3: Mzf1_2 M zP f1 fa 1 22 1Mzf1 Sp1_2 1_2 24Zfp281 BFrcca 1o1 S ix1_1 xx1_1 1__1 Z p2 oa1 xxo1 1 Elk4_1 1 1 Elk1_1 Irf4_1 Irf Irf4 f4 1 Tbx5 5 Hoxa6 Ho xa6 6f4 T fe2a Tcf a E Et S1t:s1 ERG G n RunxRunx_1 4uEWS Zfp740 0 Ets1_2 sE 2 −fusion se TBP PU.1_1 1 Rora_2 o r _2 Sox17_2 So oBx1 xx17_2 7 3Isre 2sr Fli1_1 1 H Ro Hnf Zbtb7b izf zfF f kB−p50,p52 NF 52 2bt b Fli1 Nur77 r77Dlx4 Elk1_2 1Mizf 2 Ets1:Irf4 Mef2a_1 M Me ef Elf5 e 15 2 D x4 Irx3_2 Irx3 24N Sp1_1 1 RUN NX Rfx1_1 R Rfx Elf5_2 NFK N FK KB1 K KB B1x31 1 Spib S b Etv3 E2f7 E2f 2f7 7 Pax7_3 7 33P YY 1_1 1 _1t 4 _1 Insm1 Ins 1YY PU U.1 ..1:Irf 1: f E 1 Gcm1 1 Nrf1_2 2f6 Rora Ro oPra ora or ra a_1 _1 _ 1 Sfpi1_2 2 Elf3_2 lf 2 Elf5_1 Elk4_3 Elk4 k 3 k4 E2f6 E2f 2f 6 1 ax7_2 ax 7 b 25 Run Spdef_1 1 E E2f1_2 2 Sonx x2 Runx_2 x_ 2 Rfx4 4 5 Irx4 x4 4xx5 Runx_3 Run un un nxxx_3 u3f F R FO __3xa Nkx1−1 Nkx1 143 1 1 BORIS S xu3 xa5 a3f3 a5 53 Sox Rfx 1 Zn Znf1 Znf Zn nf f143 14 143 1 43 4 Ho oxa1 xS xa 1 H xc8 Ho 8 Sfpi1_1 Tccfc cf ccp pRe 1_2 22A R Res es s35 st t_2 CTCF_1 1 ZNF So oHo 1xxa2 13 3a2 11Pff1 Zbtb33 Z Zbtb b33 b3 b 3 H a 2 CTCF_2 CTCF 2 ax5 5_2 3 ZNF1 43| AF F Klf4_1 K 1 Nrf1_1 1 Rfx1_2 R Rf f fx x1_2 2 Hoxb8 H Ho xb8 b b8 8 Irx3_1 x3_1 EE2f1_ E2f2 Nkx6−1_2 N Nk kx6 x6 1_2 _2 2 Zfp128 8 E2f1 E2f E2f1_1 22 2f1 f1 1_1 1 E f Eklf Glis2 2 E2f1_3 E2f1 f1 3 Irx2 rx2 2 So S o0 x1 12 2IIrx6 Dlx2 2 Pax5_3 3 Zfp161 61 1Zic Zarx1 Zfp 05 51 Zic1 1 r 6 Ba Barx1 Bar Barx B rx1 1 Myb yb_2 b_2 _Bmy _2 2bm Hoxd3 H 3 NF1: N FEn1 1:FO 1: :FO FOX XA A1 F Zic3 c3 3c myb y1Zbtb3 yb bbtb3 En1 En1_ En 1O _X 1C Plagl1 1 Cutl1_2 2 HLH HM 1B Zic2 Zic ic c2 2 Evx1 Ev E v 1C FEV V Ho Zfp423 Zf ffp423 p p423 3 NHLH1 Ap2gamma a Ho oP xa HFEV Hlxb9 l EV lx lxb9 b b3 3 CHR CH H HR R H xb3 x 3 P o u p53_1 1 ou ou4 o u4 f Pcc6 u 2f 2f1 2h Gat Gat a3tta a3 a prima Zfx_2 2 Ap2alpha Ho H oxxc Ascl2 2 Gsh2_3990.2 Gsh Gs h2 2_39 3990 3 9 990.2 1M 2 l ha Myb_3 yGata Ipf1 p2f1 1 M1G Txx1 e _ d2 Dl 51 Sta aNa t1 Hn Hnf1a H nf n ff1 f1a 16u2 a2 Nano N a ano nog no og o g Foxxl1 Pbx1_ Pbx1 Pb P bead bx1 bx1_ b bx _2 Foxq1 oxq1 o xxq q1 1 M SPta tG aSta T fap2eT Tcf Raxx EBNA1 R oxd3 o d3F d3 SCL St Tcffap2b p b Gata4 G BNA1 BNA Stat5 tat5 5 PG E St TBP1_1 1 Vsx1 SCL Meox1 1 Dlx1 lx1 1 ax6_1 a ax6 6 6_1 1x Evx2 S 0 ou3f1 14 FHo Tfap2a_2 f a 2 Pitx3 3 Sp100 Arx M 3xx2 Msx Ms 2IIs SAr 18a xa3_1 1So Tcffap2cc Arid5a r d5a Olig2 g2 2 E 2A ObMsx2 E2A V9ax2 aBa 2arhl2 Tcf12 T ssx x 4 Obo x5_1 5_1 B Barhl2 ar a hl2 2 Ebf1_1 1 Smad4 Bsx B x Ebf1_2 Ebf1 2 Nkx2−9 Hoxb7 b7 7 S Stat TAL T A2L91 AL 1::T 1:: TCF F3 Gata6 6 HO O6_2 X XA Atoh1 A h1 1 1 Stat6_1 Lhx6_1 1b Tfap Tf fap2a_1 ap2a_ Stat6 Sta NeuroD1 N eAtoh D1 Hoxb4_ Hoxb Ho xb b4 Ebf1_3 E 3 6 2 Lhx3_2 L 2 Lhx6_2 TBP BP P_2 2 5 Hmx1 Hmx 1 Gata5 HNF6 6 Og2xx H 2 Hmx2 En2 2 Ho d8 Barhl1 B ho 1x Hmx3 mxxd8 mx x3 3F Nobo No obo FOXL FO OXL X 1 Gfi fi Phox2a ao 4 Nkx1−2 2 Lhx4 Smad3_1 1 HEB? ?
PU.1_2
n.s. P < 0.05 P < 0.01 red text = q < 0.05
E2A:PU.1 1
Spic picc
Homez Homez
Lhx8 8
Nfya
−1.0
−0.5
0.0 ΔFA
0.5
1.0
8
0
l 0.4 2
Nfya Nfy
−1.0
Ap1−Irf4
NFKB1z
Glis2 2zSp100z Myc_2z
NFkB−p50,p52z
Creb1 Eklf Klf4_1
Gfi
Lhx8
Lhx6_1
−0.5
0.0 ΔFA
0.5
n.s. P < 0.05 P < 0.01 red text = q < 0.05
1.0
0
0.5 1.0 1.5
1.0
distance from peak center (bp)
l
l called peaks l total peaks l no called peak
l l 1.0
Ap2alpha Ap2gamma
Cebp
Oct4
Sox2_2
Pou2f3
Hnf1 Alx3 OCT4−SOX2−TCF−NANOG
Hoxa3_1
Pou5f1
−0.8
−0.6
n.s. P < 0.05 P < 0.01 red text = q < 0.05
E 2 Emx2 Hoxd13
−0.4
−0.2
0.0 ΔFA
0.2
WT Oct4 KD
0 10 20 30 40 50
Tfap2a_1
IsrTc Isre cffap2e e Nobox x Ascl2 2 Tcf2 T 2 Runx_3 Ru R 3 Myff TBP_2 2 NFkB B Nkx2−5_1 1 CRX X Irf4_1 _1 NHLH1 1 Nkx3 kx3− x3−1_3 3 Pax4 Pax4_2 a _2 RUNX2 2 Foxh1 1 Mybl1 1 Ap1:Isre_2 Hmbox1 Hmbo x1 1re 2 NFkB− B−p −p50,p52 ,p 2 f _2 Spib b RXRA::VDR RX XR R XRA X R2 Tfap2a_2 kx2 x2 2−9 2− 9 NFkB−p65− 5Cphx Obo O ox6 CNkx2 x2 Rora Ro orra_2 Hoxc12 2aMyo Myb_3 3 yo o D Obo box1 1 NFkB− B−c −cRel−p50 el p5O el−p5 0b MyoG M y Na g TkunxR Rreb1 Rre Rr re eb1 eb b1p4 Ror R Rorc orc rc rc Ho NFkB− Runx 4e p4rrd 41 2 Nkx3−1 Nkx3 1 Hmx2 Hif Hi H iifff1 1b 1 bd 2 NFKB1 NFKB N NFK FK2B1 Zsca a an Prdm Prd Pr P dm d m 1xd12 NFkB−p65_2 B−p65 B−p6 5FK Y1Y Y1 Y1_ 8H Arid5a a5 Mzf1_2 Mzf1 Mzf M 1_2 Zfp128 Ebf1_ Ebf1 _YY Hlff I1Mz Gata5 Nkx3−1_1 Ap A p p1 _zzf 1ff1 Bsx sxx OtxT_e Six1 Hoxc9 Ho xc9_1 _1 ea ad4 Aa e Prdm14 4 Run Runx_ un _Os un 1 sMtf Osr1 Gata4 G 4 tf1 1 Nkx3−1_2 kare 3 3− 2 CHR R Poax7_2 21 Nkx3 S 3x12 xA 2r0 R Eb Ebf Ebf1 b bf bf1 f 1_3 1 Ebf1 Ebf Eb E bf1 bf b f f1 1 2 1 40 Ar p63 31 Smad3_1 S 3_1 1 Re elA elA A ad3 Oli Olig O Ol i y 2b_ M yb_ yb b b_2 Gfi G fi EMy fi 3 NFkB−p65_1 S C CL2 Gata6 6 bo x2 2T2c P 3 cfPitx3 cf1_1 Zbt Zb Zbtb Zfx ffx xM 1fb Tocf1_2 Pitx1 Pit 1 Obox2 Gata1_2 GatT Gata a1_2 a 1_2 22 N::N ff1 1a Maf Ma af f1b1_ _2 2l1 TLX1: T TLX1 LX1 L LX X1 X1::N X 1Zf 1: ::: ::N N 1 GA Ata T A3_2 m1st Pitx2 23 Rest:Nrsf R esst: t Nrs Nrsf sfRes s fm1 R Re sG sHif2a t if2a H Hif2a Hif2 f2 a3 Nkx6−3 kx6−3 ta3 a3−primar a3 a 3− y Ettf1_2 Egr1_2 Eg E gZ AEts At 1 2ta Gata2_1 1 Zffx Zfx_2 Zfx fx_22Atf FO XD1 D1 FOXF2 FO OO XF2 X 21 H xa11F x PPARG PARG RG Gxa1 D 5 Dlx5 Gata1_1 Gata1 1 d_ 2C c− c −xx5 Jun:C J Cre e Sox8 8 HE Ho PJu ax5 a ax 5_ 5 _3 3:C Elk1_ Elk1 EP Msx3 3 Etkttv v4 v 4_JFxr Stat4 tat4 4R F nD rD VDR C Creb Cr reb b1 b 1 Sp100 0 e x1 1 Atf1 tf1_1eo Elk4 E k4 4A Zfp187 p1 7 Hoxc6 6Pax7 4 THcf3 c 3xc1 _1 _ 11 Nrf1 Nr f11 C1re re Nkx6−1 Elkk4 kN 4_f1 S 23 Sp Spz1 1 27 3 Gsc_2327.3 G 2327 Nrf1 N r 2 rf StElk4 Sp1 1 ESp1_ Et Ta al 1::Gata1 1::Gata 1 Dlx1 Fl 1 Fl Fli Fli1 1_1 _ 13l1 Stat1_1 1 v3 Elk1 Elk1_ E El G kP1a Gabp 44_2 ax ax2 x32lk x2 El Elk lkk4 E2 E2f3 3ff2 2 2N Mizf M Mi izf zf Gabpa Gab abpE ab Nkx2−2 2 Hox Ho H xa7_1 xa 1 E2f 2 2f 2f4 ff4 47Dlx2 Dlx D x2 x 2E2 VH V ax2x E2 E 2 2f7 2f f7 NfE2 Nf att: a2 :Ap Ap Ap1 p1 p 16−1 6 161 233Irf8 r8 Z13b Zbtb Zb btb3 bt tb33 ttb3 b3 b Sox1 1 Hoxd3 Ho xd d Sox7 7 Hltf Hoxc8 8 NR2F1 N R R2F R2x R2 2F F1 F 1 1 So 0x fat Irf1 rf1 1 So x2_1 Lhx3_2 2 Hdx dxx SOX9 SO O 9 Etv5 5 Hif1a:Arntt Ho H xd8 8 Usf1_1 1 E 1 bx Bbx B x Uncx4.1 1 Pou6f1_1 1 Six1_1 Usf1 sf1_2 sf1 SUs Home H om me eEr ez rrf f3 3D Nkx2−6 6 Ezrrr rra ra aV Shox2 2 Lhx9 hx9 9 GAT G AE A DR4 ax14 Po ou uGA 3 3f4 4TA−D Ho Nkx2−5_2 2 Ho H oxc5 5 Lhx1 Ph axa3_2 ax4 4_12 4_1 T1isre T1is T T1 1 sre 1i e Lb Lhx2 Lh hx2 24 Cdx dx2 x2 Elk E lkk12 1_2 1 2 Prrx Phox2b o x2b b1 Nkx3 N Nkax7 Nk kx k x3−2 x3 37−2 −_ 2 21 H 6 Pkx3 ax _1 1 Titf1 1 T 2 TBP_ 1 xTlx22 Phox2a a Rax Nfatc2 f 2 Hoxc4 4 En En2 E na2 Srf_1 Esrra E Hoxa9 Hoxa H Ho xa9 xa x PH P 4f 4 ff3 39 Nkx2−4 4 Otp p Sry y Hoxa5 5oxa1 ga Ho H 1 Ho oIsH xb8 b8 83g Alx4 4 M 2 Msx2 F9oxl1 1 Hlxb9 b9 Hoxb6 6 Ms Msx1 Ho Hoxb4_2 2Og2x Car 1 1 O x Mef2a_2 2 Lhx5 5r tt1_1 Hoxd10 0 Arx x Isx xf1 2 Lhx4 x4 Pou2f1_2 Gbx2 2 Prop1 1 Sox14 4 Rho h x6 6 Barhl1 B Ba arhll1 Vsx1 1 Lmx1a a 1 H xa4 Ho x 4p1 Evx1 Hnf1a Hnf1 H f1 f1 1a a Lhx6_2 6 2p1 PouHnf Cdx2_2 Cdx x2 2 x Dbx2 Db D 2 Evx2 2 Arid3a_2 3a 2 Esx1 1 Prrx2_2 2 P ax6_2 2 Hoxd xd1 d1 Cart1_2 2a Zfp105 p 5 Dmbx1 bx1 1 Pou3f1 1 Hlx1 Lhx3_3 Lhx Dbx1 x1 1
-1.5 -1.0 -0.5
0
0.5 1.0 1.5
distance from peak center (kp)
2.0
ΔFA
2.5
3.0
Sox2 ChIP sites WT Oct4 KD
-1.5 -1.0 -0.5
0
0.5 1.0 1.5
distance from Oct4 peak center (kp)
NANOG ChIP sites 0 10 20 30 40 50
Tcfap2b Tcfap2c
Pou2f2
1.5
Oct4 ChIP sites
average tag density
Mef2a_1
average tag density
−0.15 −0.10 −0.05 0.00 0.05 0.10
1.2 1.1
-1.5 -1.0 -0.5
H Runx R Ru Runx_ unx_2 Zbtb btb7b b Obox5_1 5_1 1 Z
Jun:Ap1
l ll
mESCs [OCT4 KD] - [WT]
Oct2 Lmx1b Pou2f1_1 Pou3f3
5
Ap1_2 l ll Ap1_1
1.5
G NFkB−p53−p50
4
0
Plag1z
Hif1a:Arntz Myc_3z
n−Myc_2z Rreb1z
ΔFA
1.3
−ΔFPD
E2f1_3z
Zic1 ic1 c1z Zic3z z Zic2 Zic
3
1.4
NT PMA/I
0 1 2 3 4 5 6
E2f1_2z
0.0
Irf4_3
NFkB NF kB p65 Relz kB−p65−Rel NFkB−p65−p50z NFkB−cRel−p50 el p50 p5 z z NFR −p53−p50 z RelA llA NFE2L2z S x14 So 4z Ap1:Isre_1 p1:Isre_1 e 1z NFkBz Oct4z Mafa fz Pdx1_2z Oct2z Lhx3_1z Sox21z NFkB−cRelz Teadz T Pou4f3z z Aare Fo F oFO 3z 1z FOXC F FOX OX z z Rel el Nr2e3 Sryz n−Myc n n− −Myc_1 −M Myc_ M 1zz.1_2 Dlx4z PU U .1 2z z−fusionz z Pou2f2 Po Stat3_1 St tat3_1 Gc2 z Dbx1 z PAX P AX2z3 3: FKHR R Elk1_2 E Stat6 Stat6_2 Hif−1a Hif 1az Hif2a 2a az zz zz z Rora Ro R o ora rra a_ a_2 a _z 2z FOX F FO O X _1 z P:oFK u2f FOXL FO F Oz2XL X XL1 z Elk4_2 Elk4_ FO F O XA A 1_1 1z SPDEFz z 1 Mef2a_2 ef2a_ ef2a_2 ef2a ef2a_ f2a 2a 2a_ a_ _2 Elf1z Elk4_1z z zz z 1_1 _2 _ 2z E2f4z z ANO OCT CT4− 4 SO OX2− −TCF T Hn TC ANOG A NOG NO N OG O Hn nf1 nf ff1 1NA Elk1_1z E2f7E El Elk z6Gz z Pax2z Ho xd3 zEts1_1z z z Mizf Zfp161z Zfp Nkx6−1_2 Nk _z E2f6 Fli1 1z E2f2 z _2z Dlx D Dlx2 lx2 x2 2zz Arid5a Pax7_2 ax Nrf1_1 f1 1zz zP Ho S z zTfap2a_2 Dlx1 D llx1 x1 1zxc8 Ehf_2 Ehf_ z Elk4 3 fap2a_ _2 _ 2 zx6 I1zrrx rx6 x6zDl Gabpa_2 abpa bpa pa_2 zz Ho H oP xa3_ xxa3 xa a31f _5 1Ir Sp1_1z z Usf1 Usf1_ Usf1_1 U z l1 z2z P ou2 ou o u2f1_2 u u2 2H 2fH 2f 2f1 1zo zz 3d zz Gabp Gab G ab a zpa pa _ pa_ z ox2 x2fz Tcf fap2e1 z z Ez1lfz f4 4zz z T u u3 Hoxxb H b6b b6 xd8 xxd zz Ehf_1 Eh E Db D bx x3o E_2 Elf Elf2 lff2 lf2 fE 2zz1_2 z3 z_Ho zz Et 4zEtv5 Etv4 E Etv6 Et Etv v6 x xb7 xb b b7 7 E2f3z Irx I rx3 x3 3_ 1 z z Spde z Nu N u ur r r7 7 77 7 z Erg rg P zz z z P a ax4 xN Irx I rx2 P a ax x x7 7 _ 1 zz Etv5 v3 5z zx S SEt ff_1 1z Irx rx5 x5z13z G bpa xb bxx1 b5 51z2 Gabpa_3 Gabpa Gabp zz zz Ho H o1xar z z Nrf1_2z 84z Esx E Es ssx Lmx Lm mx x4 z z Ets1_3 3z Etv1_1 Ba Bar B arx a arx2 rrx2 xz 2H z Ets1Et 5z z z Elk1_3 Elk1 lk1 k1_3 k1 k1_ z3z z Ho 6z11 17_1 x1 aH a9 9zzoMxa6 Sox3 So xH 0zoSxa Etv1_2 1z z xa5 z xa1 H Ho o3 xa7 a90 a7 70.2 5Ho Atf3z Ap2a Ap2alpha p2alpha z o 2So z b Gsh2_3 G sh2_39 sh 2_ _3990 399 3990 39 9 990.2 0H2mx1b 2o Zbtb33z 8z z h z Rfx4 zz P 7_2 3zox8 zz Nkx3 N Nk Nkx kkx3 kx x3H x3 3za −1 −1 1_ Lhx5zA z xc6 zox Nkx3− N k 3− kx 3− −1_ −1 _H _3 3b Hoxb4_ Ho xb b b4 b4_2 Alx xH YY1 YY Y1 1z bHLHE40z bx1_ _ z z4z 2zLhx3 Lhx3_3 Ho 6z N x2 Nkx Nkx2− xx2−9 2−9 2 9zvvi1 3−9 z 5 z z zBarx1z Isl2 Mtf1JJund z z Rfx1_1 R Rf fffy l2z z2 P En n2 Ev E vixd1 iE iEn iE 1zn zz AE toh1 oh1z 2z ARE o Gfy− GfP Gf y1 −Staf −S S5_2 tafz Ho Tcf cf1 fzNk 1 kH 1zO En1_2 En1 Arx A z 9z H z z Rfx1_2z Nkx N 2ff3X 2A Pax5 P ax5_3 3Zfx_2 RENkx2−5 RE Lze ef XA zz Hoxa4 H Hoxa Ho xa kx 2−5 2− 2 5_33Le z cf3_2 z PU.1:Irf PU . z z Lhx1 1z4z zcf Zfp42 Zfp p423 p42 p zz bb1xx6 z G Gmeb1 sx2 2T z Nanog anog an a ano no n nog om gxd1 Hoxb13 Hox Ho H oO xB 3z BORISzZ Atf1_2zC A NTkkx Nk −1M −1 _ _1 1z z Bsx zzz zzz z zz P PZfp CTCF F −SatelliteEle −Sa le leax5_1 eme ement Usf2 cf3 3_ 3 _1 _ X−boxz Neur z Ho cf4 Vsx1 zz Klf7z S roD euroD eur oD D1 D Bhlhb2z Hmx1 x zPrrx2_ x1 z 2zz z z Hmx m mx mx2 Evx2z z Hmx mxx3 TBP_1z B 22rrh Ba Bar rhl2 hl2 h Lhx2 hxHm Nf1 1z Nf1_1 Gbx2 x2zz Lhx3_2 Lh 3 2z Smad3_1 cc−Jun:Crez Hlf lffz CTCF_2z Nkx1−2 Lhx4 L 4z JJunDz z Lhx6_2 Atf1_1z TLX1::NFICz Pbx3z Raxz z Klf4_2 4 2z Og2xz Izsx s Noboxz Pbx1_2 1 2 CTCF_1z B hl1z Barhl1 Nkx1 Nk Nkx1−1 1 1z Crez Myc_1z
E2f1_1z
−0.5
NFkB−p65_2
Egr1 r1 4z
Sp1_2 _2z
junB ChIP sites
Nfatc2 Batf_2 Nf−E2 Ap1_3 Irf4_2 Nfat Mafk_1 Nrf2 Nfat:Ap1 Batf_1 NFkB−p65_1 Bach1
Pax4_2
Zfp281
l l
F
Fosl2
average tag density
1.5 1.0 0.5
Sp4
Ap1:Isre_2
Egr2
l Cebp:Ap1
l
Ap1_2Jun:Ap1 P300 Ap1_1 Hif1b
Egr1_2 Zfp740
0.8
0.5 1.0 1.5
distance from peak center (kp)
E
Egr1_3
l
1.5
Lymphoblasts [PMA/I] - [NT]
Egr1_1
−ΔFPD
6 -1.5 -1.0 -0.5
D
−ΔFPD
1.2
average tag density
−0.5
0.0
E2f4 4 1 Klf7 Klf4_2 K Klf lfff4 2 Gmeb1 Egr1 E g 4 Sp4 4
Nfy
Batf_1
Usf2 2
Max_3 3
Egr2 2 Eg E gr1_1 1 Rreb1 Rre 1
mes. mac.
10 20 30 40 50
Atf1 2 Atf1_2 Usf1_2 sf1_2 _2 2 Myc_2 Myc 2 Usf 1M Usf1_1 Max_1 1 Max_2 Max _2 Jundm2 2 n−Myc_1 1 Atf1_1 1 Myc_3 3 Myc:Maxx
Irf4_2
−ΔFPD
Cebp:Ap1 Bach1 NFE2L2
4
Ap1_3
2
0.5
n−Myc_2 2
Egr1_2 _2 2
Fosl2 Batf_2
Nrf2
Cebpa l l Cebp
l
0
1.0
Nf−E2 Aare
Myc_1 Atf3
−ΔFPD
Ap1:Isre_2
Mafk_1 Hlf
l called peaks l total peaks l no called peak
1.6
CEBPB ChIP sites
Cebp
bHLHE40
C
P300 Hif1b Ap1_1 Jun:Ap1 Ap1_2
average tag density
1.5
B Hematopoiesis [macrophages] - [mesodermal]
WT Oct4 KD
-1.5 -1.0 -0.5
0
0.5 1.0 1.5
distance from Oct4 peak center (kp)
Figure 6. BaGFoot Reliably Predicts Changes in TF Behavior between Conditions (A–C) Bag plot of macrophages compared with their cells of origin revealed CEBP as an outlier motif (A). That motif was increasingly bound by CEBPB in macrophages, as measured by ChIP-seq obtained from Goode et al. (2016) (B). Bound CEBP motifs were responsible for the increase in FA and FPD, whereas unbound motifs had no effect (C). (D–F) Bag plot of PMA/I-treated lymphoblasts compared with untreated lymphoblasts revealed AP-1 as an outlier motif (D). The AP-1 motif was increasingly bound by junB in treated cells, as measured by ChIP-seq obtained from Bevington et al. (2016) (E). Bound AP-1 motifs were responsible for the increase in FA and FPD, whereas unbound motifs had no effect (F). (G and H) Bag plot of mESCs following Oct4 downregulation finds the Oct4, Sox2, and NANOG motifs as outliers (G). The proteins binding these motifs are more weakly bound at Oct4 enhancers, as measured by ChIP-seq obtained from King and Klose (2017) (H). See also Figure S6.
Cell Reports 19, 1710–1722, May 23, 2017 1719
all three motifs in cells where the BRG1 level was knocked down (Figure S6F). Taken together, the results from eight different datasets collectively show that the BaGFoot algorithm efficiently predicts TF activity following short- and long-term activating signals in various settings (differentiation, metabolic stress, immunogenic response, cancer-related chromatin alterations, etc.). BaGFoot can detect motifs that elude conventional motif enrichment analysis as well as motifs that show no measurable footprint. Importantly, the BaGFoot algorithm operates well on ATAC-seq data, indicating its utility for this increasingly utilized methodology. Finally, BaGFoot’s predictions of TF activity were confirmed by ChIP-seq in five different datasets. DISCUSSION TF footprinting approaches have evolved over the last few years, leading to advances in our understanding of TF biology. By the same token, inference of TF activity from footprinting was realized to be more complex than initially imagined because footprint patterns were discovered to be affected by several determinants. Here we show that most TFs do not leave a measurable footprint on DNA, either because of biological properties of the TF or intra-signature cut variability. To negate signature variability, identify footprint-lacking TFs, and to maximize information obtained from accessibility assays, we devised a computational workflow incorporating two properties of TFs that are indirectly measured by accessibility information. Upon binding, TFs increase their FA by recruiting other factors to the enhancer region and, in some cases, also leave a footprint. BaGFoot quantifies the difference in these two parameters between two experimental conditions. A signal that would activate a certain TF, leading to increased binding, has the potential to lead to increased FA and FPD. A critical feature of BaGFoot is an unbiased output that does not require prior knowledge on TFs activated in the examined pathway. This is especially advantageous under circumstances where the TFs relevant to the examined biological condition are either unknown or are too numerous to asses individually. This stands in stark contrast to ChIP-seq experiments that necessitate both knowledge of the target TF as well as TF-specific antibodies that are often difficult to obtain. Thus, when using BaGFoot, one accessibility experiment can direct our attention to relevant TFs in an unbiased manner. Indeed, a nascent, proof-of-principle version of the algorithm was included in a recently published paper; there, the algorithm was able to narrow down a list of 20 potential TFs involved in the regulation of fasting to just three TFs (Goldstein et al., 2017). Additionally, because BaGFoot relies on FA as well as on FPD, it is able to detect TF activity of footprint-lacking TFs. This represents a major advance compared with methods relying strictly on footprinting capacity. Commonly, detection of bulk changes in enhancer accessibility followed by motif enrichment is used to predict TF binding. We found that BaGFoot is superior to conventional motif enrichment analysis for several reasons: in BaGFoot there is no need to analyze a subpopulation of differentially accessible enhancers; BaGFoot can detect relevant TFs omitted by motif analysis;
1720 Cell Reports 19, 1710–1722, May 23, 2017
and BaGFoot measures properties directly related to TF activity and is therefore more indicative of actual TF binding. In addition to motif enrichment, other approaches were recently employed to discern TF activity from accessibility data. As found with BaGFoot, increases in motif-flanking accessibility were correlated with increases in TF activity during differentiation (Sherwood et al., 2014). Additionally, increases in TF gene expression (Gusmao et al., 2016) and binding (Pique-Regi et al., 2011) were correlated with footprinting capacity. The strength of BaGFoot is in combining these two TF properties in a systematic and robust way to a single, statistically evaluated output portraying the overall effect of TFs on chromatin accessibility. Although presenting clear advantages, several aspects of BaGFoot may, in some cases, limit conclusions drawn by it. First, in contrast to ChIP, which directly measures TF binding, BaGFoot only suggests TF binding by proxy. Second, because motifs are sometimes bound by a family of TFs, isolating the involved TF from the family might be challenging. In some cases, this is easily resolved when gene expression data is available for the relevant cell type and unexpressed TFs could be excluded. Third, a minority of TFs bind within inaccessible regions (Thurman et al., 2012). Because BaGFoot is reliant on quantifying enough cleavage events, it is restricted to accessible regions and would therefore not be able to detect these TFs. Last, BaGFoot considers all motifs within DHS sites, some of which are unbound by TFs, potentially leading to under-representation of a TF’s effect on chromatin (i.e., some actual outliers would not be called because most motifs are unbound and ‘‘dilute’’ the signal). Thus, TF-specific downstream experiments may be needed to validate BaGFoot’s output. Nonetheless, in all five datasets where this issue was tested, outliers detected by BaGFoot were associated with differential binding of the relevant TF. In the other three datasets, the outlier motifs were previously shown to bind TFs under the same conditions. In summary, BaGFoot efficiently detects TF activity following various biological events. BaGFoot is robust to different peakcalling algorithms, low-read datasets, and different genomic accessibility methods. BaGFoot can isolate footprint-lacking TFs. Last, the algorithm functions seamlessly even in datasets where differentially accessible enhancers are not observed. Therefore, BaGFoot presents a major improvement in the retrieval of differential TF activity from genomic accessibility experiments and provides valuable information regarding the TFs affecting chromatin accessibility in biological pathways.
EXPERIMENTAL PROCEDURES BaGFoot In the BaGFoot scatterplot, each point represents a motif, and the x axis values represent the change in the average normalized tag counts between two experimental conditions of DNase data of the ± 200-bp region at each motif center (FA). The y axis values represent the change in FPD. To visualize the location, spread, and skewness and to pinpoint outliers among motifs, we employed the bag plot approach (Rousseeuw et al., 1999). We observed that the population median (the point with the highest possible Tukey depth) locates close to the origin. This implies that our normalization works effectively in measuring both FA and FPD. The inner polygon (bag) contains at most 50% of the TFs. The outer polygon fence is formed by inflating the bag geometrically by a default factor of 3.
Publicly Available Data Used in the Study All data used in this study were obtained from published datasets cited in the main text and deposited in the GEO. The accession numbers for naked DNA data are GEO: GSE18927, GSE32970, and GSE92674. The accession numbers for chromatin digestions/ChIP-seq are GEO: GSE72087, GSE39982, GSE67465, GSE81258, GSE69101, GSE26189, GSE87822, and GSE27826 (see also Table S3). Code Availability Under GNU General Public License, version 3.0, all software implementations used in this paper are available at https://sourceforge.net/projects/bagfootr/. SUPPLEMENTAL INFORMATION Supplemental Information includes Supplemental Experimental Procedures, six figures, and four tables and can be found with this article online at http:// dx.doi.org/10.1016/j.celrep.2017.05.003. AUTHOR CONTRIBUTIONS S.B. conceived and wrote the algorithm, designed and performed analyses, and wrote the Experimental Procedures. I.G. designed analyses, provided conceptualization and interpretation, coordinated the project, and wrote the manuscript. G.L.H. supervised the project and helped with conceptualization and interpretation as well as with manuscript writing. ACKNOWLEDGMENTS We thank Prof. Hongshik Ahn (Stony Brook University) for comments that greatly helped to improve the algorithm and Erin E. Swinstead for help with manuscript editing. This work was supported by the Intramural Research Program of the NIH, the National Cancer Institute, and the Center for Cancer Research. Received: January 23, 2017 Revised: April 4, 2017 Accepted: April 26, 2017 Published: May 23, 2017 REFERENCES Baek, S., and Sung, M.H. (2016). Genome-Scale Analysis of Cell-Specific Regulatory Codes Using Nuclear Enzymes. Methods Mol. Biol. 1418, 225–240. Bevington, S.L., Cauchy, P., Piper, J., Bertrand, E., Lalli, N., Jarvis, R.C., Gilding, L.N., Ott, S., Bonifer, C., and Cockerill, P.N. (2016). Inducible chromatin priming is associated with the establishment of immunological memory in T cells. EMBO J. 35, 515–535. Boyle, A.P., Davis, S., Shulha, H.P., Meltzer, P., Margulies, E.H., Weng, Z., Furey, T.S., and Crawford, G.E. (2008). High-resolution mapping and characterization of open chromatin across the genome. Cell 132, 311–322. Buenrostro, J.D., Giresi, P.G., Zaba, L.C., Chang, H.Y., and Greenleaf, W.J. (2013). Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218. Cao, N., and Yao, Z.X. (2011). The hemangioblast: from concept to authentication. Anat. Rec. (Hoboken) 294, 580–588. €ner, B.M., Denny, S.K., Yang, D., Chuang, C.H., Brady, J.J., Lim, J.S., Gru Chiou, S.H., Schep, A.N., Baral, J., Hamard, C., et al. (2016). Nfib Promotes Metastasis through a Widespread Increase in Chromatin Accessibility. Cell 166, 328–342. Feng, Y., Desjardins, C.A., Cooper, O., Kontor, A., Nocco, S.E., and Naya, F.J. (2015). EGR1 Functions as a Potent Repressor of MEF2 Transcriptional Activity. PLoS ONE 10, e0127641. Galas, D.J., and Schmitz, A. (1978). DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. Nucleic Acids Res. 5, 3157–3170.
Goldstein, I., and Hager, G.L. (2015). Transcriptional and chromatin regulation during fasting - The genomic era. Trends Endocrinol. Metab. 26, 699–710. Goldstein, I., Baek, S., Presman, D.M., Paakinaho, V., Swinstead, E.E., and Hager, G.L. (2017). Transcription factor assisted loading and enhancer dynamics dictate the hepatic fasting response. Genome Res. 27, 427–439. Goode, D.K., Obier, N., Vijayabaskar, M.S., Lie-A-Ling, M., Lilly, A.J., Hannah, R., Lichtinger, M., Batta, K., Florkowska, M., Patel, R., et al. (2016). Dynamic Gene Regulatory Networks Drive Hematopoietic Specification and Differentiation. Dev. Cell 36, 572–587. Grant, C.E., Bailey, T.L., and Noble, W.S. (2011). FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018. Grøntved, L., Waterfall, J.J., Kim, D.W., Baek, S., Sung, M.H., Zhao, L., Park, J.W., Nielsen, R., Walker, R.L., Zhu, Y.J., et al. (2015). Transcriptional activation by the thyroid hormone receptor through ligand-dependent receptor recruitment and chromatin remodelling. Nat. Commun. 6, 7048. Gusmao, E.G., Allhoff, M., Zenke, M., and Costa, I.G. (2016). Analysis of computational footprinting methods for DNase sequencing experiments. Nat. Methods 13, 303–309. He, H.H., Meyer, C.A., Chen, M.W., Jordan, V.C., Brown, M., and Liu, X.S. (2012). Differential DNase I hypersensitivity reveals factor-dependent chromatin dynamics. Genome Res. 22, 1015–1025. He, H.H., Meyer, C.A., Hu, S.S., Chen, M.W., Zang, C., Liu, Y., Rao, P.K., Fei, T., Xu, H., Long, H., et al. (2014). Refined DNase-seq protocol and data analysis reveals intrinsic bias in transcription factor footprint identification. Nat. Methods 11, 73–78. John, S., Sabo, P.J., Thurman, R.E., Sung, M.H., Biddie, S.C., Johnson, T.A., Hager, G.L., and Stamatoyannopoulos, J.A. (2011). Chromatin accessibility pre-determines glucocorticoid receptor binding patterns. Nat. Genet. 43, 264–268. King, H.W., and Klose, R.J. (2017). The pioneer factor OCT4 requires the chromatin remodeller BRG1 to support gene regulatory element function in mouse embryonic stem cells. eLife 6, e22631. Kwon, H.S., Kim, M.S., Edenberg, H.J., and Hur, M.W. (1999). Sp3 and Sp4 can repress transcription by competing with Sp1 for the core cis-elements on the human ADH5/FDH minimal promoter. J. Biol. Chem. 274, 20–28. Lazarovici, A., Zhou, T., Shafer, A., Dantas Machado, A.C., Riley, T.R., Sandstrom, R., Sabo, P.J., Lu, Y., Rohs, R., Stamatoyannopoulos, J.A., and Bussemaker, H.J. (2013). Probing DNA shape and methylation state on a genomic scale with DNase I. Proc. Natl. Acad. Sci. USA 110, 6376–6381. Man, K., and Kallies, A. (2015). Synchronizing transcriptional control of T cell metabolism and function. Nat. Rev. Immunol. 15, 574–584. Neph, S., Vierstra, J., Stergachis, A.B., Reynolds, A.P., Haugen, E., Vernot, B., Thurman, R.E., John, S., Sandstrom, R., Johnson, A.K., et al. (2012). An expansive human regulatory lexicon encoded in transcription factor footprints. Nature 489, 83–90. Pique-Regi, R., Degner, J.F., Pai, A.A., Gaffney, D.J., Gilad, Y., and Pritchard, J.K. (2011). Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21, 447–455. Rousseeuw, P.J., Ruts, I., and Tukey, J.W. (1999). The Bagplot: A Bivariate Boxplot. Am. Stat. 53, 382–387. Sherwood, R.I., Hashimoto, T., O’Donnell, C.W., Lewis, S., Barkal, A.A., van Hoff, J.P., Karun, V., Jaakkola, T., and Gifford, D.K. (2014). Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nat. Biotechnol. 32, 171–178. Siersbæk, R., Nielsen, R., John, S., Sung, M.H., Baek, S., Loft, A., Hager, G.L., and Mandrup, S. (2011). Extensive chromatin remodelling and establishment of transcription factor ‘hotspots’ during early adipogenesis. EMBO J. 30, 1459–1472. Stergachis, A.B., Neph, S., Sandstrom, R., Haugen, E., Reynolds, A.P., Zhang, M., Byron, R., Canfield, T., Stelhing-Sun, S., Lee, K., et al. (2014). Conservation of trans-acting circuitry during mammalian regulatory evolution. Nature 515, 365–370.
Cell Reports 19, 1710–1722, May 23, 2017 1721
Sung, M.H., Guertin, M.J., Baek, S., and Hager, G.L. (2014). DNase footprint signatures are dictated by factor dynamics and DNA sequence. Mol. Cell 56, 275–285. Sung, M.-H., Baek, S., and Hager, G.L. (2016). Genome-wide footprinting: ready for prime time? Nat. Methods 13, 222–228. Swinstead, E.E., Miranda, T.B., Paakinaho, V., Baek, S., Goldstein, I., Hawkins, M., Karpova, T.S., Ball, D., Mazza, D., Lavis, L.D., et al. (2016). Steroid Receptors Reprogram FoxA1 Occupancy through Dynamic Chromatin Transitions. Cell 165, 593–605. Thurman, R.E., Rynes, E., Humbert, R., Vierstra, J., Maurano, M.T., Haugen, E., Sheffield, N.C., Stergachis, A.B., Wang, H., Vernot, B., et al. (2012). The accessible chromatin landscape of the human genome. Nature 489, 75–82.
1722 Cell Reports 19, 1710–1722, May 23, 2017
Tussiwand, R., and Gautier, E.L. (2015). Transcriptional Regulation of Mononuclear Phagocyte Development. Front. Immunol. 6, 533. Vierstra, J., and Stamatoyannopoulos, J.A. (2016). Genomic footprinting. Nat. Methods 13, 213–221. Voss, T.C., and Hager, G.L. (2014). Dynamic regulation of transcriptional states by chromatin and transcription factors. Nat. Rev. Genet. 15, 69–81. Yardımcı, G.G., Frank, C.L., Crawford, G.E., and Ohler, U. (2014). Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection. Nucleic Acids Res. 42, 11865–11878. Yokomizo, T., Takahashi, S., Mochizuki, N., Kuroha, T., Ema, M., Wakamatsu, A., Shimizu, R., Ohneda, O., Osato, M., Okada, H., et al. (2007). Characterization of GATA-1(+) hemangioblastic cells in the mouse embryo. EMBO J. 26, 184–196.