Identification of altered cis-regulatory elements in human disease

Identification of altered cis-regulatory elements in human disease

Review Identification of altered cis-regulatory elements in human disease Anthony Mathelier, Wenqiang Shi, and Wyeth W. Wasserman Centre for Molecula...

9MB Sizes 0 Downloads 68 Views

Review

Identification of altered cis-regulatory elements in human disease Anthony Mathelier, Wenqiang Shi, and Wyeth W. Wasserman Centre for Molecular Medicine and Therapeutics, 950 West 28th Avenue, Vancouver, BC V5Z 4H4, Canada

It has long been appreciated that variations in regulatory regions of genes can impact gene expression. With the advent of whole-genome sequencing (WGS), it has become possible to begin cataloging these noncoding variants. Evidence continues to accumulate linking clinical cases with cis-regulatory element disruption in a wide range of diseases. Identifying variants is becoming routine, but assessing their impact on regulation remains challenging. Bioinformatics approaches that identify variations functionally altering transcription factor (TF) binding are increasingly important for meeting this challenge. We present the current state of computational tools and resources for identifying the genomic regulatory components (cis-regulatory regions and TF binding sites, TFBSs) controlling gene transcriptional regulation. We review how such approaches can be used to interpret the potential disease causality of point mutations and small insertions or deletions. We hope this will motivate further the development of methods enabling the identification of etiological cis-regulatory variations. Introduction Recent advances in high-throughput sequencing technologies have revolutionized our capacities to decipher the human genome, providing new opportunities to better understand common and rare genetic diseases. Over the past 5 years it has become increasingly common to use whole-exome sequencing (WES, see Glossary) to identify genetic mutations with a phenotypic effect causal for a disease. This approach, which sequences the 2% of the human genome that codes for proteins, reveals mutations impacting on the functionality of proteins. These regions are well characterized, and diverse tools such as SIFT [1] and Polyphen2 [2] are available to assess the deleterious impact of mutations on protein functionality. However, an increasing number of cases are accumulating for which no mutation in protein-coding genes has been found to confidently explain the observed phenotype. Although causal mutations might be missed by the computational pipelines, or arise due to an underlying lack of understanding of gene function, a meaningful portion of these cases are likely caused by mutations lying in the remaining 98% of the human genome [3]. Corresponding author: Wasserman, W.W. ([email protected]). Keywords: cis-regulatory elements; transcription factor binding sites (TFBSs); wholegenome sequencing (WGS); ChIP-seq; noncoding variants; personalized medicine. 0168-9525/ ß 2015 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.tig.2014.12.003

Thanks to the decreasing cost of the sequencing technology, one can now perform WGS for an individual patient at a cost approaching USD 1000. We expect WGS to become a common medical test in the coming years, eventually replacing WES which is now entering into clinical use. With WGS data, one has access to the regions in the genome which determine where and when genes are transcribed. These regions are termed cis-regulatory regions and are composed of cis-regulatory elements (Figure 1). TFBSs are the core elements of such cis-regulatory regions. Accumulating evidence indicates the importance of cis-regulatory element alteration associated to multiple diseases. This is illustrated by the human gene mutation

Glossary Allele-specific binding (ASB): an ASB location corresponds to a genomic region where a TF predominantly occupies one allele (Figure 2). Cis-regulatory element: a TFBS located within a cis-regulatory region (Figure 1). Cis-regulatory region: a noncoding regulatory DNA region that controls the expression of genes on the same chromosome (promoters, enhancers, insulators, etc.) and is composed of cis-regulatory elements (Figure 1). Enhancer: a genomic region controlling the rate of expression of one or multiple distal genes (Figure 1). Enhancer RNA (eRNA): an RNA molecule obtained from the transcription of a DNA sequence at enhancers ([88] for review). Expression quantitative trait locus (eQTL): a genomic region observed to be associated with gene expression modulation in a population. Expression quantitative trait nucleotide (eQTN): the specific causal SNP located in an eQTL. Genome-wide association study (GWAS): analyzes common genetic variants in a population to highlight variants associated with a specific trait. Indel: a small insertion or deletion of nucleotides in a genome. Lineage-specific selection: sequences constrained by lineage-specific selection show selection within specific species or branches of a phylogenetic tree. Position weight matrix (PWM): in the context of this review, a PWM (or position-specific scoring matrix) is a quantitative model of TF binding that produces scores correlated with TF–DNA binding energies ([30] for review). Promoter: a functional regulatory region of a gene at which the transcription machinery binds to initialize transcription (Figure 1). Single-nucleotide polymorphism (SNP): a variant at a specific DNA locus that is observed with a specified frequency across a population. Single-nucleotide variant (SNV): a variant observed at a specific DNA locus. The frequency in a population is not considered (in contrast to a SNP). A SNV may be a SNP, but a SNP is always a SNV. Transcription factors (TFs): proteins that control the rate of transcription of genes. Sequence-specific DNA-binding TFs are a subset of these proteins which bind to the DNA in a sequence-specific manner (Figure 1). Transcription factor binding site (TFBS): a region in the DNA which is specifically bound by a TF (Figure 1). Transcription start-site (TSS): the position where the RNA polymerase initiates the production of RNA from the DNA (Figure 1). Whole-exome sequencing (WES): characterization of the complete DNA sequences corresponding to exons (usually focusing on protein-coding gene exons). Whole-genome sequencing (WGS): characterization of the complete DNA sequence of the genome of an individual.

Trends in Genetics, February 2015, Vol. 31, No. 2

67

Review

Trends in Genetics February 2015, Vol. 31, No. 2

(A)

(B)

Histone octamer

TFs

Enhancer Nucleosome Cohesin RNA transcripts

TSS

DNA Regulatory proteins

Promoters

RNA Pol II

(C) (D)

CAGE DNase I H3K4me3 H3K27ac TFs ChIP-seq ChIA-PET

TRENDS in Genetics

Figure 1. Overview of cis-regulation. (A) A human chromosome is organized in the cell through compaction of nucleosomes, in other words DNA strands rolled around a histone octamer. (B) Zoomed view at a transcriptional regulatory event where an enhancer is brought close to a promoter. Transcription factors (TFs) and other regulatory proteins provide a favorable environment for the RNA polymerase II (RNA Pol II) to produce RNA transcripts. Active cis-regulatory regions are represented by plain red boxes, whereas a dashed box represents an inactive one. Note that transcripts are produced at transcription start-sites (TSSs) both from the regulated gene and from the enhancer region. (C) The 2D conformation presented in (B) is given as a linear representation with the localization of the promoters, the enhancer, the TFs, the RNA Pol II, the TSSs, and the RNA transcripts on the DNA. (D) Examples of experimentally derived data providing information on transcriptional regulation (H3K4me3, H3K27ac, TF ChIP-seq), gene expression and enhancer activity (CAGE), chromatin accessibility (DNase I hypersensitivity), and chromatin conformation (RNA Pol II ChIA-PET) (see Table 1 for details about these features). Note that green bars represent reads mapped to the positive strand, whereas purple bars represent reads mapped to the negative strand.

database (HGMD [4]), which includes more than 3000 disease-implicated mutations categorized as ‘regulatory’ in its 2014.2 professional version. A meta-analysis of 1200 GWAS SNPs showed that more than one third of noncoding variants are likely to be causal for the phenotypic or disease traits observed [5]. There a growing number of specific cases as well. For example, a recent study highlighted a rare variant disrupting the binding of YY1 to the promoter of GDF5, resulting in reduced expression associated to osteoarthritis [6]. Mutations in an enhancer regulating the expression of the SOX10 TF have been identified to contribute to Hirschsprung disease [7]. An expression defect of the GATA2 gene has been associated to mutations in a cis-regulatory enhancer, leading to the development of the MonoMAC syndrome [8]. TFBS motifdisrupting alleles have been shown to lead to congenital heart defects in patients with Holt–Oram syndrome [9]. More disease cases such as cohesinopathies, diabetes, and cancers linked to variants lying in cis-regulatory regions are reviewed in [10–13]. Of particular relevance is cancer, which has been described as a disease of disrupted gene regulation. It is therefore unsurprising that non-coding variations are linked to tumorigenesis. For example, a recent study [14] of 863 human tumors highlighted recurrent mutations in promoter regions of the PLEKHS1, WDR74, and SHDH along with those already known to be associated to the TERT gene. The recurrent mutations found in the SHDH promoter have been shown to be associated with reduced gene expression. Recent advances in WGS are beginning to 68

reveal the presence of regulatory sequence alterations in proximity to many of the genes prone to protein-coding alterations in the same cancer classes [15,16]. Although WGS has the power to identify variants in noncoding regions, these variants are much more difficult to assess compared to variants in coding regions. A key challenge in accelerating the discovery of cis-regulatory variations in human diseases is developing a robust bioinformatics approach to highlight likely functional changes. By analogy to the successful approaches to protein-coding analysis, in which protein domains and evolutionary patterns are considered, prioritizing functional cis-regulatory alterations requires the identification of the functional cisregulatory regions and the cis-regulatory elements (TFBSs) within them. One then has to assess the impact on gene expression (in an appropriate context) of variations overlapping the cis-regulatory elements and relate this to the observed phenotype through the dysregulation of controlled genes. In this review we aim to inform readers of current computational approaches and resources for the identification of cis-regulatory regions and TFBSs in the human genome, with a specific focus on how such tools can empower the analysis and interpretation of single-nucleotide variations (SNVs) and small indels detected in WGS. The impact of structural and copy-number variations on gene expression fall out of the scope of this review ([17] for review). A general perspective of transcriptional regulation is presented to assist the reader in conceptualizing the design of the bioinformatics approaches (Box 1).

Review

Trends in Genetics February 2015, Vol. 31, No. 2

Box 1. A perspective on transcriptional cis-regulation In framing the computational analysis of cis-regulatory regions, it is helpful to approach transcriptional regulation from a simplified perspective. Some considerations that inform our interpretation of methods, data, and the literature are given below. Within the nucleus there is a mixture of densely packed and more-accessible DNA, with some regional locations where elevated concentrations of specific proteins are found. Segments of DNA can be proximal, allowing interactions (both inter- and intra- chromosomal). Within this setting, DNA-binding TFs load onto DNA (possibly with elevated rates at specific positions) and slide along the backbone, intermittently encountering a suitable position at which to engage the internal bases, allowing a more stable interaction. This is an important point because we anticipate the TFs will convert from a non-specific backbone interaction to a sequence-specific interaction, which may alter the shape of the protein and/or the DNA. The presence of a DNAbinding TF may either bring or catalyze the recruitment of additional proteins (either DNA-binding TFs or other types of TFs). These additional proteins may stabilize binding or create epigenetic changes to increase or decrease access of transcriptional machinery to the chromatin. There is extensive interaction between TFs, with clear evidence in the literature of interactivity between multiple TFs, as well as between TFBSs. It is unclear whether or what portions of these interactions are direct (physical contact) or indirect (e.g., multiple TFs acting through the same segment at different times). The TF–DNA interactions are not longlasting, but may recur frequently if the context is appropriate. We further perceive that the favorable TF-binding positions fall within a continuum of binding strengths, and that the binding strength may be a target of selective pressure that could influence the placement along a continuum (with the clear caveat that stronger binding does not necessarily mean that the site is more likely to be functional). A great many details remain to be resolved and there can be substantive debate on the model above. We use it as an intellectual framework for interpreting results.

Identifying cis-regulatory regions in the human genome Although a comprehensive inventory of all functional sequences and their activities across the diversity of cells and conditions is prohibitive, researchers have accumulated data that allow the identification of a vast number of regulatory regions. We succinctly introduce the reader to classes and collections of data that we perceive to be of particular value for the detection of regulatory regions (Table 1).

Properties and experimental data of cis-regulatory regions The activity of regulatory regions is controlled through the interplay between epigenomic modifications, conformation of the chromatin, and binding of TFs. While highly valuable, the range of histone marks and TF binding events that can be experimentally profiled remains constrained by a limited number of high-quality and specific antibodies. Active cis-regulatory regions are associated with open chromatin which can be identified through DNase I hypersensitivity, FAIRE-seq [18], or ATAC-seq [19] experiments. In addition to experimentally derived data, one can use genomic sequence characteristics to define cisregulatory regions such as conservation ([20] for review), or dinucleotide composition for enhancer prediction [21]. Laboratory methods can now identify active cis-regulatory regions in bulk. For instance, the self-transcribing active regulatory region sequencing technology allows the genome-wide identification of active enhancers by assaying millions of candidate DNA sequences from the genomes of model organisms [22]. Recently, the FANTOM5 consortium [23,24] screened hundreds of mammalian samples for active promoters and enhancers derived from cap analysis gene expression (CAGE) data (Box 2), providing atlases of active cis-regulatory regions. These compilations provide a rich source of training data for new informatics methods. Identifying regulatory regions through machine learning Drawing upon the growing body of genome-scale data, new machine learning-based methods have emerged for the annotation of cis-regulatory regions. Unsupervised models (in which observed data properties inform the classification of genome segments into groups) [25] segment the genome into classes, of which some may be highly enriched for annotated regions such as promoters or enhancers. Recent assessment indicates that 26% of the tested enhancers (from [26]) show experimental validation of activity [27]. Recently, supervised methods (in which extensive training data are used to focus predictions on specific types of regions) for enhancer prediction have emerged (reviewed in [28]). As more experimentally derived datasets become available for training, the supervised approaches will become increasingly powerful for delineating the locations of

Table 1. Examples of features used for the identification of cis-regulatory regions Features Transcription factor binding sites (TFBSs) Histone modifications Nucleosome Open chromatin DNA methylation Chromatin conformation Conservation Nucleotide sequence properties

Description TFBSs are core elements of cis-regulatory regions and can be thought as on/off switches controlling transcription Post-translational modifications of the N-terminal tail of histone proteins. Histone modifications affect the overall chromatin structure and are associated with cis-regulatory regions Basic organizational unit of eukaryotic chromatin composed of an octamer of histone protein cores and a segment of DNA. Nucleosomes are usually depleted at promoters and enhancers Regions highly accessible for TFs and other proteins, usually associated with active genome activity Role in gene regulation varies with cell context. Methylation at promoter regions is usually associated with gene silencing Chromatin conformation analyses identify genomic regions that may be linearly very distant but interact closely within the 3D nuclear organization Assessing the conservation of nucleotides between species through genome alignments can highlight regions likely to be functional Analysis of the nucleotide composition of the genome. For instance, the G+C content pattern helps to identify nucleosome positioning, and CpG islands are over-represented in promoters 69

Review Box 2. Active promoters and enhancers derived from CAGE data The CAGE technology extracts and sequences the 50 ends of transcribed RNAs in a population of cells [87]. By mapping the reads to the genome one can determine the precise localization of the active transcription start sites (TSSs). In contrast to traditional RNA-seq, CAGE allows the detection of multiple TSSs associated to a single gene with very high precision. Previous studies showed that enhancers can be transcribed to give rise to enhancer RNAs (eRNAs, reviewed in [88]) which are short, exosome-sensitive, unspliced RNAs. Interestingly, the transcription observed at enhancers is captured by CAGE deep-sequencing experiments, and displays a characteristic bidirectionality at enhancer edges (see CAGE track in Figure 1 in main text). This bidirectional signature revealed thousands of active enhancers in human and mouse CAGE samples in the FANTOM5 project [23,24]. By correlating gene expression from promoter TSSs with enhancer activity (both derived from CAGE), the studies provide some insights into the relationships between enhancers and genes. The CAGE-seq technology provides a unique opportunity to delineate active regulatory regions on a genome scale in a sample-specific manner.

cis-regulatory regions. At present most predicted regions are determined by unsupervised approaches, or directly by laboratory methods (e.g., STARR-Seq [22]). Identifying cis-regulatory elements – the TFBSs TFBSs are core elements of cis-regulatory regions Delineating the functional cis-regulatory regions in the human genome provides a first step towards the identification of disease-causing mutations. Located in cis-regulatory regions, functional TFBSs can be conceptually considered as the on/off switches for gene transcription. Several examples show that their disruption can lead to altered gene expression causal for human diseases. The binding of TFs to DNA is a complex interplay between amino acid–nucleotide interactions and topological properties [29]. The positions at which sequence-specific DNAbinding TFs transition from non-specific backbone interactions to specific base-interactions can be predicted. TFBSs are classically modeled by position weight matrices (PWMs) which summarize the binding preferences of a TF, usually assuming independence between each position within the TFBSs. Detailed descriptions have been presented (e.g., [30]). Multiple studies over the past decade have highlighted the need to go beyond classical PWMs to model features captured by high-throughput and lowthroughput TF–DNA binding experiments ([31] for review) even though the classical PWMs perform well for most TFs [32]. Diverse public databases collect, generate, and organize TF binding profiles (JASPAR [33], SwissRegulon [34], HOCOMOCO [35], UniProbe [36], FactorBook [37], HOMER [38], and CIS-BP [39]). At present the databases are focused on classic matrix-based models. Predicting TFBSs Amongst the many TF binding assays, the ChIP-seq technique [40] is the most widely used method for genome-scale identification of in vivo TF–DNA interactions [41]. Although the data can be key for modeling TF binding, one should be cautious when interpreting individual ChIP-seq observed regions (Box 3). Predicting TFBSs within ChIP-seq peaks using binding models is a widely 70

Trends in Genetics February 2015, Vol. 31, No. 2

used practice for prioritizing predictions of functional cis-regulatory variations with potential impact on transcriptional dysregulation (e.g., [15,42–44]). However, evidence of direct binding of a TF to the DNA does not imply functionality [29] although it does focus analysis on a subset of the genome more likely to contain active TFBSs. In the absence of ChIP-seq data for a particular TF in a specific condition/cell/tissue, the search for variations disrupting potential binding sites can be constrained to active regulatory regions derived from histone modification, open chromatin, or CAGE data for instance (Table 1). The identification of TFBSs within these regions can be performed by dedicated tools such as CENTIPEDE [45], MILLIPEDE [46], Wellington [47], and PIQ [48], assuming a binding model has been generated from experimental data. Assessing the impact of variations on TF–DNA interaction Given a variation overlapping a potential TFBS, the next challenge is to predict whether the alteration is likely to have a functional consequence on gene regulation. Existing predictive approaches can involve multiple lines of evidence that a position has a sequence-specific function. TF– DNA interactions arise from the interplay between DNA sequence motifs, chromatin accessibility, epigenetic marks, and interactions with cofactors (Figure 1). To prioritize the variants most likely to disrupt functional regulatory elements, one has to consider multiple aspects of TF–DNA interactions. In the following section we provide the reader with the main features to be considered when prioritizing variants with a regulatory impact; the specific computational tools using these features will be introduced in the next section. Collecting reliable reference datasets To evaluate the impact of variants it can be pivotal to compile a set of differentially bound allele pairs where a sequence variation is causal for the TF binding bias. A subset of cell lines have been studied by both the ENCODE and the 1000 Genomes projects, providing the community with TF binding, epigenetic, and genotyping data from the same cellular context allowing in-depth analyses of the impact of variations within TFBSs [43,49,50]. Allele-specific binding (ASB) events provide the advantage of focusing on variants within the same cellular environment [50,51] (Figure 2). ASB events can be determined using a binomial test applied to ChIP-seq data, where the underlying heterozygous positions are known [50,51]. Multiple pipelines are now available to detect ASB events from TF ChIP-seq experiments [51– 53]. Detectable ASB events represent a small portion of ChIP-seq datasets – usually less than 1% of all peaks [51]. ASB events are enriched for disease-associated SNPs and, when situated within 100 bp of promoter transcription start sites (TSSs), are strongly associated with gene expression alteration [50]. Interpreting TF binding alteration Differential TF binding has been analyzed across individuals (either tissue samples or cell strains) or at heterozygous sites in a cell. Motif-altering SNVs can account for a subset

Review

Trends in Genetics February 2015, Vol. 31, No. 2

Box 3. What to expect when you are ChIP’ing? Detailed recommendations for comprehensive analysis of ChIP-seq data are provided in [89]. While TFBSs are usually 8–15 nt in length, ChIP-seq peaks (regions with an enrichment of mapped reads) are commonly 300 nt in length, with an sub-area covered by the maximum number of reads (peak maximum, Figure IA). One anticipates the peaks to be enriched for the DNA motif recognized by the ChIP’ed TF in the vicinity of the peak maximum [90], with the motif being detectable using de novo motif discovery and known motif-enrichment detection tools (e.g., the MEME suite [91], RSAT [92], ChIPMunk [93], HOMER [38], and oPOSSUM-3 [94]). Nevertheless, ChIP-seq experiments do not always exhibit enrichment of the expected motif, nor do all peak regions contain an occurrence of the motif [95–98]. Indeed, a set of peaks can be conceptually decomposed into three subsets which correspond to: (i) a set of peaks with evidence of direct binding, (ii) a set of peaks arising from indirect binding, and (iii) the set of the remaining peaks (Figure I). Direct evidence of binding for peaks in (i) can be obtained by searching for the canonical motif. This provides a set of highly confident TFBSs with both experimental evidence of binding through

ChIP-seq and computational evidence of direct binding through the motif (Figure IB). The second set (ii) of peaks is difficult to distinguish from the third without knowing the protein partner interacting with the DNA. Peaks from set (iii) might derive from structural organization of the genome, unspecific binding, or experimental artefacts [99] (Figure IB,C). Some regions corresponding to ChIP-artefacts have been reported to be associated with high levels of transcription [100– 103]. Multiple studies have highlighted ‘HOT’ (high occupancy of transcription-related proteins) regions [95] or TF clusters [96]) that recurrently appear in ChIP-seq data of multiple TFs without enrichment of motifs for the ChIP’ed TFs. Such regions might be related to the chromatin conformation within the nucleus because they correlate with binding of proteins such as CTCF and cohesin which, together with ZNF143, are enriched at interacting loci identified through ChIA-PET experiments [104]. Recurrently observed enrichment of TFBS motifs across ChIP-seq datasets for diverse TFs has identified several motifs associated with frequently recovered regions; the motifs include CTCF and ZNF143 motifs, as well as ETS-like and JUN-like motifs [97].

(A) Shearing

Recovering Peak maxima

Sequencing and mapping

(B)

(C)

100

Direct binding with canonical mof

Mof score

95 90 85 80

Indirect binding

75 70 −400

−200

0

200

400

Mof distance to peak maxima

Unknown (CTCF, ZNF143, unspecific binding, etc.) TRENDS in Genetics

Figure I. What to expect from ChIP-seq data. (A) Schematic overview of the ChIP-sequencing procedure. Transcription factors (TFs), red octagons, are chemically crosslinked to the DNA, either directly or indirectly, which is then sheared. DNA segments bound by the TF under study are recovered through an antibody specific to the TF. Deep sequencing is performed, and reads are mapped onto a reference genome to predict bound regions (ChIP-seq peaks) where the peak maximum corresponds to the area with the highest amount of overlapping reads. (B) ChIP-seq peaks can be analyzed by scanning the regions with the canonical TF binding profile (motif) associated with the ChIP’ed TF. For each peak, we can record the position of the most similar sequence to the TF binding profile. The x axis provides the distance of the most similar motifs to the peak maximum. The y axis provides the corresponding scores of the motifs (computed from the TF binding profile). The plot is generated from the ENCODE SRF ChIP-seq experiment performed in embryonic stem cells (H1-hESC). (C) Regions recovered by ChIP-seq can be divided into three categories. The first contains the peaks showing evidence of a direct binding of the TF with a canonical motif predicted close to the peak maximum. The second is composed of peaks derived from indirect binding of the TF to the DNA through another protein. Third, a category of peaks captured for unknown reasons, and that might reflect the structural organization of the genome, unspecific binding, or experimental artifacts.

of observed binding differences [49,54] (Figure 3A,B). The alleles closest to the consensus motif preferentially show elevated binding [43,50,55]. In addition, altered motifs proximal to the experimental ChIP-seq peak maximum are more associated with differential binding [54]. Most ASB do not overlap the ChIP’ed TF motif, indicating that other mechanisms account for a portion of the events [43,50]. The presence (or absence) of cofactors (i.e., TFs acting cooperatively) binding nearby could account for a subset of these events (Figure 2A,C) [54,56]. For instance, NF-kB binding differences between individuals are

correlated to SNVs altering TFBSs of its cofactors [56]. Overall, the variation within the ChIP’ed TF motif or a cofactor motif can lead to altered TF–DNA interaction and explain a significant proportion of differential events. The relative impact of other potential contributory mechanisms remains to be evaluated. Chromatin marks and TF binding TFs can show binding preferences in the context of specific open chromatin and histone modifications, and this preference can be used to inform TFBS prediction 71

Review

Trends in Genetics February 2015, Vol. 31, No. 2

[55]. Taken together, it is possible that some TF binding events depend on specific chromatin marks, whereas other TFs do not (such as so-called pioneer factors, see [29]). Attributing differential TF binding to epigenetic alterations, cofactor binding, or canonical TFBS disruption remains a challenge.

(B)

(A)

ChIP-seq read

Allele A1

Allele B1

Allele A2

Allele B2 TRENDS in Genetics

Figure 2. Allele-specific and heterozygous binding. (A) Allele-specific binding events can be derived from heterozygous loci where there is a strong preference for ChIP-seq reads to map to only one allele. (B) Non allele-specific binding regions harbor reads mapping to both alleles similarly. It could be explained by the transcription factor (TF) equally recognizing the two alleles or if the variation does not disrupt the active binding site.

[45]. However, the dependence between chromatin marks and TF binding is not always clear [57]. There are subsets of epigenetic marks linked to better TF binding (active marks), and subsets linked to reduced binding (repressive marks) [58] (Figure 3D). From the opposite perspective, studies show the contribution of TF binding to the specification of histone modifications [38,59]. Moreover, SNPs are enriched at regions showing variable epigenetic marks between individuals when compared to invariant regions

(A)

Key: TFs TFBS Variaon

(B)

Repressive mark

(C)

(D)

TRENDS in Genetics

Figure 3. Schematic view of transcription factor (TF) binding alteration. (A) TF binding events in a ‘normal’ environment. Two TFs bind to their respective TF binding sites (TFBSs) and stabilize each other’s binding. (B) A variant lying within one of the TFBSs disrupts the binding of the TF to the DNA. This event is represented with lighter shading for the TF. The disruption of the binding can be due to a change in (i) a nucleotide recognized by the TF, or (ii) the DNA shape conformation altering the binding of the TF to the DNA. (C) Another scenario is that a variant lying within one of the TFBSs can disrupt the binding of both TFs to the DNA. In such a scenario, the TFs are considered as cofactors, where the binding of one TF is necessary for the binding of the other TF. (D) A variant can also be associated with a modification of the epigenetic environment, potentially repressing the binding of a TF to the DNA.

72

TFBS conservation Sequence conservation can inform the interpretation of variations within cis-regulatory elements (as it does for protein-coding sequence). Linking TFBS sequence conservation and functionality is not trivial. For instance, early experimental study of regulatory elements in human and mouse estimates that about 32–40% of human functional regions are not functional in mouse [60]. Moreover, TFbound regions identified in large-scale ChIP-seq experiments show limited conservation across species, with only 10–22% of peaks being conserved [61]. Although many ChIP-identified regions are not conserved, those overlapping evolutionarily conserved sequences are associated with higher rates of functional roles [62,63]. Moreover, genomic sequences under lineage-specific selection have been used to filter cis-regulatory variants in cancer WGS analyses [15]. TFBS redundancy Studies have shown that the presence of redundant binding sites in cis-regulatory regions can maintain the pattern and level of expression, even in the event of sequence alterations. As reviewed in [64], TFBSs can be considered as cumulative or incremental inputs of targeted gene regulation, and TF binding events vary the degree of potency of the regulatory regions. The redundancy can be considered as a buffering mechanism, through which a disrupted TFBS can be compensated by another nearby binding site in the same cis-regulatory region [65,66]. It may therefore be important to consider redundant TFBSs when interpreting the functional impact of variation on gene regulation. Experimental assessment of the impact of variants on gene expression New experimental techniques allow the assessment of the impact of variants on gene expression. Massively parallel reporter assays allow dissection in vivo of cis-regulatory region impacts on gene transcription. In [67], the authors analyzed three mouse liver enhancers by synthesizing >100 000 mutant haplotypes diverging by 2–3% from wild type. The study highlights that most of the variants have a modest effect on the enhancer function, and only 22% significantly affected expression. The impactful variants were more conserved and overlapped predicted liver-specific TFBSs. Looking at SNVs from a human whole-genome analysis, a similar portion of exonic SNVs (16%) were annotated as loss-of-function variants [68] by either VEP [69] or ANNOVAR [70]. Finally, massively parallel assays have been used to analyze the functional impact of variants in TFBSs of five activators and repressors in 2000 predicted human enhancers [71].

Review Examples of computational tools for prioritizing cisregulatory variants Several tools have been developed to predict the impact of variants within cis-regulatory elements by integrating both experimentally derived and sequence-based features. Early methods focused mainly on the use of TF binding profiles to evaluate the impact of variants on TF–DNA binding strength. RAVEN [72] uses phylogenetic footprinting information along with PWM score differences between reference and alternative alleles. The is-rSNP [73] and regSNP [74] tools compare the distribution of PWM scores at experimentally validated and background variants to assign P-values to TFBS-altering events. The sTRAP program [75] assesses the binding affinity differences (P-value differences between the reference and alternative alleles) between wild type and mutated sequences for TFBSs by using a biophysical model with available TF binding profiles. More recent softwares incorporate epigenetic data along with TF binding profiles to assess the cis-regulatory impact of germline variants. RegulomeDB [76] computes a heuristic score from the number of regulatory features overlapping the variants, but it does not assess whether variants are disrupting TFBSs. GWAS3D [77] integrates chromosome-capture information together with epigenetic marks, impact of binding affinity based on scores from PWMs, and conservation to prioritize regulatory variants. The impact of the variants on TF–DNA affinity is assessed by comparing the log-odds probabilities of binding between the reference and alternative alleles to a null empirical distribution. Most recently, machine-learning approaches have been used to predict variants with pathogenic effects. The GWAVA tool combines genomic information with epigenetic datasets to prioritize the variants, and is trained on disease variants annotated in the HGMD [4] as regulatory mutations versus control variants from the 1000 Genomes project [78]. The recent tools CADD [79] and DANN [80] predict pathogenic variants in both coding and noncoding regions using support vector machine and deep neural network approaches, respectively. Although the noncoding variants are prioritized to highlight those most likely to disrupt transcriptional cis-regulation, the tools do not assess the impact of variants on TF–DNA binding but instead use TFBSs predicted within ChIP-seq datasets using PWMs. Integrating genomic and transcriptomic information To bring the pieces together, we consider the challenge of interpreting the combination of genomic and transcriptomic data to predict variations causal for altered expression. Much work has focused on the detection of relationships between genotype and expression. Linkage-based approaches can be applied to determine expression quantitative trait loci (eQTLs), and computational tools have been developed to predict the causal variants ([81] for review). For instance, in [82] three tissues from healthy twin women were analyzed using gene expression arrays and SNP genotyping, revealing tissue-dependent regulatory correlations. The cis-eQTLs clustered preferentially

Trends in Genetics February 2015, Vol. 31, No. 2

and symmetrically around TSSs. The regulatory architecture of gene eQTLs has been similarly explored using HapMap-profiled lymphoblastoid cell lines [83]. The rapid compilation of cis-eQTLs has raised hopes for the identification of causal regulatory variants (as opposed to correlated markers), but progress has been limited. Two groups observed that cis-eQTNs (expression quantitative trait nucleotides) frequently disrupt a downstream promoter element [83,84]. Another study [85] reported that 40% of eQTNs are situated within DNase I hypersensitive or modified histone regions (which cover only 4.5% of the genome). Overlapping with TF ChIP-seq data, eQTNs were found to alter TFBS for specific profiled TFs [83]. Analysis of lymphoblastoid RNA-seq and genotype data for 462 individuals from the 1000 Genomes Project revealed that cis-eQTLs most likely to be causal for diseases are highly enriched in specific noncoding elements (e.g., TF peaks, DNase I hypersensitive sites, active promoters, and strong enhancers) [44]. Although these correlative analyses are starting to incorporate analysis of TFBS alteration to prioritize candidate causal alterations, the limitations of the predictive methods remain an impediment. Computational methods have recently started to emerge that unite subsets of the TFBS analysis tools to prioritize candidate regulatory variants. The FunSeq and FunSeq2 tools allow the identification of regulatory variants arising in cancer by integrating epigenetic data, TF ChIP-seq data, sequence Box 4. Outstanding questions Elucidating enhancer–gene association We have reviewed current methods allowing the identification of the cis-regulatory regions and elements controlling gene transcription. Although variants disrupting TF–DNA interactions lying within promoters are likely to alter the expression of the corresponding gene, elucidating which gene is altered by variants found at distal regulatory regions remains a hurdle. The challenge is to reveal the 3D geography of the nucleus, showing which enhancers and promoters are spatially proximal and are likely to functionally interact. Classically, variants are often associated to the most linearly proximal promoter. Such an approach does not account for cases in which regulatory regions act on more distant targets, nor on regulatory regions that act upon multiple promoters. Experimentally, promoter–enhancer interactions can be derived from chromatin conformation capture or derived experiments such as ChIA-PET (mediated by RNA Pol II) ([105] for review). A recent study in mouse samples revealed that such interactions are dynamic, with a high cell type-specificity [106], whereas another analysis found that enhancer–promoter relationships in one tissue can be derived from genomic data available in other tissues, suggesting more stable interactions [107]. These disparate results highlight the complexity of the problem, and represent preliminary steps towards the identification of enhancer–promoter interactions necessary for predicting the specific gene expression impact of variants lying within distal regulatory sequences. Creating cis-regulatory elements While the presented literature moves forward the identification of variants altering gene expression through the disruption of cisregulatory elements, the recognition of variants creating active transcriptional elements has not been fully addressed. There is inadequate evidence of cases in which TFBS are created. Preliminary steps have been introduced by the FunSeq2 software which considers gained TFBSs by scanning sequences around variants at promoters and enhancers with PWMs [86], but the demonstration of informatics approaches that are capable of predicting the creation of bona fide cis-regulatory elements remains to be achieved. 73

Review conservation, and TFBS motif alteration, as well as cancerfocused gene network analysis [15,86]. Such integrative approaches are likely to be a key focus of the field in the coming years. Concluding remarks Regulatory sequence alterations are expected to be significant contributors to human phenotypes. As described, researchers have an increasing capacity to identify sequence variations situated within cis-regulatory regions, and to predict which of the variations are likely to alter TF– DNA interactions. The key step of demonstrating causality for expression and/or disease-related phenotypes remains limited to detailed mechanistic studies (Box 4). Acknowledgments A.M., W.S., and W.W.W. are supported by the Genome Canada LargeScale Applied Research Grant 174CDE. Funding has been provided by the BC Children’s Hospital Foundation and the Child and Family Research Institute to A.M. The China Scholarship Council provides funding to W.S. We thank the four reviewers and the editor for helpful comments and suggestions. We thank members of the laboratory of W.W.W. for discussions, especially Chih-yu Chen, Yifeng Li, Casper Shyr, and Maja Tarailo-Graovac for helpful comments on the manuscript. We thank Dora Pak for management support. We apologize to the authors of papers that we were not able to cite and discuss owing to space limitations.

References 1 Ng, P.C. and Henikoff, S. (2001) Predicting deleterious amino acid substitutions. Genome Res. 11, 863–874 2 Adzhubei, I.A. et al. (2010) A method and server for predicting damaging missense mutations. Nat. Methods 7, 248–249 3 Manolio, T.A. et al. (2008) A HapMap harvest of insights into the genetics of common disease. J. Clin. Invest. 118, 1590–1605 4 Stenson, P.D. et al. (2009) The Human Gene Mutation Database: 2008 update. Genome Med. 1, 13 5 Visel, A. et al. (2009) Genomic views of distant-acting enhancers. Nature 461, 199–205 6 Dodd, A.W. et al. (2013) A rare variant in the osteoarthritis-associated locus GDF5 is functional and reveals a site that can be manipulated to modulate GDF5 expression. Eur. J. Hum. Genet. 21, 517–521 7 Lecerf, L. et al. (2014) An impairment of long distance SOX10 regulatory elements underlies isolated Hirschsprung disease. Hum. Mutat. 35, 303–307 8 Hsu, A.P. et al. (2013) GATA2 haploinsufficiency caused by mutations in a conserved intronic element leads to MonoMAC syndrome. Blood 121, 3830–3837 9 Smemo, S. et al. (2012) Regulatory variation in a TBX5 enhancer leads to isolated congenital heart disease. Hum. Mol. Genet. 21, 3255– 3263 10 Ward, L.D. and Kellis, M. (2012) Interpreting noncoding genetic variation in complex traits and human disease. Nat. Biotechnol. 30, 1095–1106 11 Lee, T.I. and Young, R.A. (2013) Transcriptional regulation and its misregulation in disease. Cell 152, 1237–1251 12 Sur, I. et al. (2013) Lessons from functional analysis of genome-wide association studies. Cancer Res. 73, 4180–4184 13 Friedensohn, S. and Sawarkar, R. (2014) Cis-regulatory variation: significance in biomedicine and evolution. Cell Tissue Res. 356, 495– 505 14 Weinhold, N. et al. (2014) Genome-wide analysis of noncoding regulatory mutations in cancer. Nat. Genet. 46, 1160–1165 15 Khurana, E. et al. (2013) Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587 16 Ongen, H. et al. (2014) Putative cis-regulatory drivers in colorectal cancer. Nature 512, 87–90 17 Mikhail, F.M. (2014) Copy number variations and human genetic disease. Curr. Opin. Pediatr. 26, 646–652

74

Trends in Genetics February 2015, Vol. 31, No. 2

18 Song, L. et al. (2011) Open chromatin defined by DNase I and FAIRE identifies regulatory elements that shape cell-type identity. Genome Res. 21, 1757–1767 19 Buenrostro, J.D. et al. (2013) Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nat. Methods 10, 1213–1218 20 Dowell, R.D. (2010) Transcription factor binding variation in the evolution of gene regulation. Trends Genet. 26, 468–475 21 Yanez-Cuna, J.O. et al. (2014) Dissection of thousands of cell typespecific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome Res. 24, 1147–1156 22 Arnold, C.D. et al. (2013) Genome-wide quantitative enhancer activity maps identified by STARR-seq. Science 339, 1074–1077 23 The FANTOM Consortium and the RIKEN PMI and CLST (DGT) (2014) A promoter-level mammalian expression atlas. Nature 507, 462–470 24 Andersson, R. et al. (2014) An atlas of active enhancers across human cell types and tissues. Nature 507, 455–461 25 Hoffman, M.M. et al. (2013) Integrative annotation of chromatin elements from ENCODE data. Nucleic Acids Res. 41, 827–841 26 Ernst, J. et al. (2011) Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43–49 27 Kwasnieski, J.C. et al. (2014) High-throughput functional testing of ENCODE segmentation predictions. Genome Res. 24, 1595–1602 28 Shlyueva, D. et al. (2014) Transcriptional enhancers: from properties to genome-wide predictions. Nat. Rev. Genet. 15, 272–286 29 Slattery, M. et al. (2014) Absence of a simple code: how transcription factors read the genome. Trends Biochem. Sci. 39, 381–399 30 Stormo, G.D. (2013) Modeling the specificity of protein–DNA interactions. Quant. Biol. 1, 115–130 31 Siggers, T. and Gordan, R. (2014) Protein–DNA binding: complexities and multi-protein codes. Nucleic acids Res. 42, 2099–2111 32 Weirauch, M.T. et al. (2013) Evaluation of methods for modeling transcription factor sequence specificity. Nat. Biotechnol. 31, 126– 134 33 Mathelier, A. et al. (2014) JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles. Nucleic Acids Res. 42, D142–D147 34 Pachkov, M. et al. (2013) SwissRegulon, a database of genome-wide annotations of regulatory sites: recent updates. Nucleic Acids Res. 41, D214–D220 35 Kulakovskiy, I.V. et al. (2013) HOCOMOCO: a comprehensive collection of human transcription factor binding sites models. Nucleic Acids Res. 41, D195–D202 36 Hume, M.A. et al. (2014) UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein–DNA interactions. Nucleic Acids Res. Published online November 5, 2014. (http://dx.doi.org/10.1093/nar/gku1045) 37 Wang, J. et al. (2013) Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium. Nucleic Acids Res. 41, D171–D176 38 Heinz, S. et al. (2010) Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 39 Weirauch, M.T. et al. (2014) Determination and inference of eukaryotic transcription factor sequence specificity. Cell 158, 1431– 1443 40 Johnson, D.S. et al. (2007) Genome-wide mapping of in vivo protein– DNA interactions. Science 316, 1497–1502 41 Furey, T.S. (2012) ChIP-seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nat. Rev. Genet. 13, 840–852 42 Ritchie, G.R. et al. (2014) Functional annotation of noncoding sequence variants. Nat. Methods 11, 294–296 43 Kilpinen, H. et al. (2013) Coordinated effects of sequence variation on DNA binding, chromatin structure, and transcription. Science 342, 744–747 44 Lappalainen, T. et al. (2013) Transcriptome and genome sequencing uncovers functional variation in humans. Nature 501, 506–511 45 Pique-Regi, R. et al. (2011) Accurate inference of transcription factor binding from DNA sequence and chromatin accessibility data. Genome Res. 21, 447–455

Review 46 Luo, K. and Hartemink, A.J. (2013) Using DNase digestion data to accurately identify transcription factor binding sites. Pac. Symp. Biocomput. 2013, 80–91 47 Piper, J. et al. (2013) Wellington: a novel method for the accurate identification of digital genomic footprints from DNase-seq data. Nucleic Acids Res. 41, e201 48 Sherwood, R.I. et al. (2014) Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nat. Biotechnol. 32, 171–178 49 Kasowski, M. et al. (2010) Variation in transcription factor binding among humans. Science 328, 232–235 50 Reddy, T.E. et al. (2012) Effects of sequence variation on differential allelic transcription factor occupancy and gene expression. Genome Res. 22, 860–869 51 Rozowsky, J. et al. (2011) AlleleSeq: analysis of allele-specific expression and binding in a network framework. Mol. Syst. Biol. 7, 522 52 Waszak, S.M. et al. (2014) Identification and removal of lowcomplexity sites in allele-specific analysis of ChIP-seq data. Bioinformatics 30, 165–171 53 Younesy, H. et al. (2014) ALEA: a toolbox for allele-specific epigenomics analysis. Bioinformatics 30, 1172–1174 54 Heinz, S. et al. (2013) Effect of natural genetic variation on enhancer selection and function. Nature 503, 487–492 55 Kasowski, M. et al. (2013) Extensive variation in chromatin states across humans. Science 342, 750–752 56 Karczewski, K.J. et al. (2011) Cooperative transcription factor associations discovered using regulatory variation. Proc. Natl. Acad. Sci. U.S.A. 108, 13353–13358 57 Li, M.J. et al. (2014) Exploring the function of genetic variants in the non-coding genomic regions: approaches for identifying human regulatory variants affecting gene expression. Brief. Bioinform. Published online June 10, 2014. http://dx.doi.org/10.1093/bib/ bbu018 58 Chen, C.C. et al. (2013) Understanding variation in transcription factor binding by modeling transcription factor genome–epigenome interactions. PLoS Comput. Biol. 9, e1003367 59 Serandour, A.A. et al. (2011) Epigenetic switch involved in activation of pioneer factor FOXA1-dependent enhancers. Genome Res. 21, 555– 565 60 Dermitzakis, E.T. and Clark, A.G. (2002) Evolution of transcription factor binding sites in Mammalian gene regulatory regions: conservation and turnover. Mol. Biol. Evol. 19, 1114–1121 61 Schmidt, D. et al. (2010) Five-vertebrate ChIP-seq reveals the evolutionary dynamics of transcription factor binding. Science 328, 1036–1040 62 Whitfield, T.W. et al. (2012) Functional analysis of transcription factor binding sites in human promoters. Genome Biol. 13, R50 63 Handstad, T. et al. (2011) A ChIP-Seq benchmark shows that sequence conservation mainly improves detection of strong transcription factor binding sites. PLoS ONE 6, e18430 64 Spivakov, M. (2014) Spurious transcription factor binding: nonfunctional or genetically redundant? Bioessays 36, 798–806 65 Maurano, M.T. et al. (2012) Widespread site-dependent buffering of human regulatory polymorphism. PLoS Genet. 8, e1002599 66 Spivakov, M. et al. (2012) Analysis of variation at transcription factor binding sites in Drosophila and humans. Genome Biol. 13, R49 67 Patwardhan, R.P. et al. (2012) Massively parallel functional dissection of mammalian enhancers in vivo. Nat. Biotechnol. 30, 265–270 68 McCarthy, D.J. et al. (2014) Choice of transcripts and software has a large effect on variant annotation. Genome Med. 6, 26 69 McLaren, W. et al. (2010) Deriving the consequences of genomic variants with the Ensembl API and SNP Effect Predictor. Bioinformatics 26, 2069–2070 70 Wang, K. et al. (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 38, e164 71 Kheradpour, P. et al. (2013) Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. Genome Res. 23, 800–811 72 Andersen, M.C. et al. (2008) In silico detection of sequence variations modifying transcriptional regulation. PLoS Comput. Biol. 4, e5

Trends in Genetics February 2015, Vol. 31, No. 2

73 Macintyre, G. et al. (2010) is-rSNP: a novel technique for in silico regulatory SNP detection. Bioinformatics 26, i524–i530 74 Teng, M. et al. (2012) regSNPs: a strategy for prioritizing regulatory single nucleotide substitutions. Bioinformatics 28, 1879–1886 75 Manke, T. et al. (2010) Quantifying the effect of sequence variation on regulatory interactions. Hum. Mutat. 31, 477–483 76 Boyle, A.P. et al. (2012) Annotation of functional variation in personal genomes using RegulomeDB. Genome Res. 22, 1790–1797 77 Li, M.J. et al. (2013) GWAS3D: Detecting human regulatory variants by integrative analysis of genome-wide associations, chromosome interactions and histone modifications. Nucleic Acids Res. 41, W150–W158 78 The 1000 Genomes Project Consortium (2010) A map of human genome variation from population-scale sequencing. Nature 467, 1061–1073 79 Kircher, M. et al. (2014) A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310–315 80 Quang, D. et al. (2014) DANN: a deep learning approach for annotating the pathogenicity of genetic variants. Bioinformatics Published online October 22, 2014. (http://dx.doi.org/10.1093/ bioinformatics/btu703) 81 Battle, A. and Montgomery, S.B. (2014) Determining causality and consequence of expression quantitative trait loci. Hum. Genet. 133, 727–735 82 Nica, A.C. et al. (2011) The architecture of gene regulatory variation across multiple human tissues: the MuTHER study. PLoS Genet. 7, e1002003 83 Gaffney, D.J. et al. (2012) Dissecting the regulatory architecture of gene expression QTLs. Genome Biol. 13, R7 84 Veyrieras, J.B. et al. (2008) High-resolution mapping of expressionQTLs yields insight into human gene regulation. PLoS Genet. 4, e1000214 85 Burke, T.W. and Kadonaga, J.T. (1996) Drosophila TFIID binds to a conserved downstream basal promoter element that is present in many TATA-box-deficient promoters. Genes Dev. 10, 711–724 86 Fu, Y. et al. (2014) FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480 87 Shiraki, T. et al. (2003) Cap analysis gene expression for highthroughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl. Acad. Sci. U.S.A. 100, 15776–15781 88 Natoli, G. and Andrau, J.C. (2012) Noncoding transcription at enhancers: general principles and functional models. Annu. Rev. Genet. 46, 1–19 89 Bailey, T. et al. (2013) Practical guidelines for the comprehensive analysis of ChIP-seq data. PLoS Comput. Biol. 9, e1003326 90 Bailey, T.L. and Machanick, P. (2012) Inferring direct DNA binding from ChIP-seq. Nucleic Acids Res. 40, e128 91 Ma, W. et al. (2014) Motif-based analysis of large nucleotide data sets using MEME-ChIP. Nat. Protoc. 9, 1428–1450 92 Thomas-Chollier, M. et al. (2012) A complete workflow for the analysis of full-size ChIP-seq (and similar) data sets using peak-motifs. Nat. Protoc. 7, 1551–1568 93 Kulakovskiy, I.V. et al. (2010) Deep and wide digging for binding motifs in ChIP-Seq data. Bioinformatics 26, 2622–2623 94 Kwon, A.T. et al. (2012) oPOSSUM-3: advanced analysis of regulatory motif over-representation across genes or ChIP-Seq datasets. G3 2, 987–1002 95 Yip, K.Y. et al. (2012) Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors. Genome Biol. 13, R48 96 Yan, J. et al. (2013) Transcription factor binding in human cells occurs in dense clusters formed around cohesin anchor sites. Cell 154, 801–813 97 Worsley Hunt, R. and Wasserman, W.W. (2014) Non-targeted transcription factors motifs are a systemic component of ChIP-seq datasets. Genome Biol. 15, 412 98 Kheradpour, P. and Kellis, M. (2014) Systematic discovery and characterization of regulatory motifs in ENCODE TF binding experiments. Nucleic Acids Res. 42, 2976–2987 99 Worsley Hunt, R. et al. (2014) Improving analysis of transcription factor binding sites within ChIP-Seq data based on topological motif enrichment. BMC Genomics 15, 472 75

Review 100 Fan, X. and Struhl, K. (2009) Where does mediator bind in vivo? PLoS ONE 4, e5029 101 Teytelman, L. et al. (2013) Highly expressed loci are vulnerable to misleading ChIP localization of multiple unrelated proteins. Proc. Natl. Acad. Sci. U.S.A. 110, 18602–18607 102 Park, D. et al. (2013) Widespread misinterpretable ChIP-seq bias in yeast. PLoS ONE 8, e83506 103 Kasinathan, S. et al. (2014) High-resolution mapping of transcription factor binding sites on native chromatin. Nat. Methods 11, 203–209

76

Trends in Genetics February 2015, Vol. 31, No. 2

104 Heidari, N. et al. (2014) Genome-wide map of regulatory interactions in the human genome. Genome Res. 24, 1905–1917 105 van Steensel, B. and Dekker, J. (2010) Genomics tools for unraveling chromosome architecture. Nat. Biotechnol. 28, 1089–1095 106 Zhang, Y. et al. (2013) Chromatin connectivity maps reveal dynamic promoter-enhancer long-range associations. Nature 504, 306–310 107 O’Connor, T.R. and Bailey, T.L. (2014) Creating and validating cisregulatory maps of tissue-specific gene expression regulation. Nucleic Acids Res. 42, 11000–11010