Decoding the non-coding genome: Opportunities and challenges of genomic and epigenomic consortium data

Decoding the non-coding genome: Opportunities and challenges of genomic and epigenomic consortium data

Accepted Manuscript Decoding the Non-Coding Genome: Opportunities and Challenges of Genomic and Epigenomic Consortium Data Henry Pratt, Zhiping Weng P...

753KB Sizes 0 Downloads 32 Views

Accepted Manuscript Decoding the Non-Coding Genome: Opportunities and Challenges of Genomic and Epigenomic Consortium Data Henry Pratt, Zhiping Weng PII:

S2452-3100(18)30075-1

DOI:

10.1016/j.coisb.2018.09.002

Reference:

COISB 191

To appear in:

Current Opinion in Systems Biology

Received Date: 5 July 2018 Accepted Date: 1 September 2018

Please cite this article as: Pratt H, Weng Z, Decoding the Non-Coding Genome: Opportunities and Challenges of Genomic and Epigenomic Consortium Data, Current Opinion in Systems Biology (2018), doi: 10.1016/j.coisb.2018.09.002. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Decoding the Non-Coding Genome: Opportunities and Challenges of Genomic and Epigenomic Consortium Data Henry Pratt1, Zhiping Weng1,* 1

AC C

EP

TE D

M AN U

SC

RI PT

Program in Bioinformatics and Integrative Biology, University of Massachusetts Medical School, Worcester, MA 01605, USA * Corresponding author: [email protected], 1-(508) 856-8866, 368 Plantation St., Albert Sherman Center, 5th floor, Room 1067, Worcester MA 01605

1

ACCEPTED MANUSCRIPT Abstract

SC

RI PT

Publicly-available next-generation sequencing data has greatly expanded in the past decade, particularly through the work of several international consortia. These collaborative efforts have applied numerous assays profiling diverse features of gene regulation, such as genome-wide chromatin structure, transcriptional activity, and transcription factor binding, to thousands of biosamples from several organisms. Newly-developed computational analyses and statistical methods link findings to gene expression changes and phenotypic changes. Integrative analysis of these datasets holds the potential to revolutionize our understanding of organismal development, cell type differentiation, cellular response to stimuli, and disease mechanisms. However, standardized methods for data access, uniform data processing, and integrative analysis largely do not exist, hindering the impacts of these efforts. Here we review advancements made by consortia and directions of ongoing efforts, as well as challenges in accessing and analyzing publicly-available consortium data and emerging tools for addressing these challenges.

Introduction

M AN U

The human genome project revealed that coding exons comprise only 2% of the genome’s base pairs [1]. The vast remainder includes introns, pseudogenes, non-coding RNAs, transposons, and various regulatory elements such as enhancers and promoters. A complete understanding of these non-coding regions is crucial in understanding development, cellular differentiation, and disease: more than 90% of known disease variants are non-coding, and non-coding regions harbor binding sites for hundreds of transcription factors [2–4].

AC C

EP

TE D

Decoding the regulatory function of the non-coding genome is a multifaceted effort with contributions from many fields, biological and computational alike. Thus, investigative efforts have been in large part the efforts of various consortia. Their findings have substantially advanced our understanding of gene regulation and its role in physiology and pathophysiology; new assays have been developed to target every phase of gene regulation, and new computational methods have been developed to interpret the results. The scale of data produced by consortia is immense, however, and techniques to effectively harness the wealth of knowledge they offer continue to evolve. Here we review some of the most prominent epigenomic consortium efforts, their contributions, and challenges in integrating and applying their data. A timeline of these consortia is presented in Figure 1.

Opportunities and Advancements: Regulatory Elements Regulatory elements, such as promoters, enhancers, and insulators, are a primary focus of many epigenetic consortia. Estimates suggest these elements represent up to 40% of the genome, and they play key roles in organismal development and diseases [2,5]. Several consortia aim to characterize all regulatory elements in the human genome; many offer catalogs of candidate elements for investigation by the community. Catalogs of regulatory elements offered by consortia are summarized in Table 1. The Roadmap Epigenomics and ENCODE consortia identify regulatory elements by generating reference epigenomes, or profiles of histone mark and chromatin accessibility data genome-wide. Roadmap produced chromatin state models in 111 human cell types from 28 tissues using machine learning techniques [6,7]. ENCODE has combined DNase-

2

ACCEPTED MANUSCRIPT

RI PT

seq and ChIP-seq data from 662 cell types to produce an Registry of 1.31 million human candidate cis-regulatory elements (ccREs) and 431 k candidate mouse elements, the largest collection available (Moore et. al., unpublished). The ENCODE Portal [8] also offers data from numerous other epigenetic assays, totaling ~11,000 human and mouse datasets (9,000 by ENCODE and 2,000 by Roadmap). Together, these projects have mapped transcription factor (TF) regulatory and co-occupancy networks in development and disease [9], annotated novel lncRNAs and TSSs [10,11], characterized RNA regulatory elements [12,13], and furthered understanding of the role of DNA replication and epigenetic modification in cellular differentiation and somatic mutagenesis [14,15].

M AN U

SC

The FANTOM Consortium uses transcriptional data derived from cap analysis of gene expression (CAGE) [16] to characterize regulatory elements in hundreds of human primary cells and tissues, more than 200 cancer cell lines, and several mouse developmental timepoints [17,18]. The Consortium has annotated nearly 200 k human promoters [17] and 65 k human enhancers [18,19], as well as more than 10 k novel lncRNAs [20]. Enrichment for GWAS SNPs near lncRNA promoters and enhancers suggest novel disease-specific roles for these elements [19,20]. FANTOM also offers detailed profiles of cellular responses to stimuli [18]. FANTOM's data are available for download or visualization using its ZENBU browser [21].

TE D

Other consortia directly link variants to gene expression independently of regulatory element activity. The Genotype-Tissue Expression (GTEx) project has collected nearly 12,000 samples from 53 tissues in 714 donors, and performed quantitative trait locus (QTL) analysis statistically links genomic variants to gene expression (eQTL), histone modifications (hQTL), and splicing events (splice QTL). GTEx data are available for download and visualization [22]. Method

Cell Types

Types

ENCODE Phase III (Moore et. al., unpublished)

DNase, H3K4me3, H3K27ac, and CTCF

662

3 (promoter-like, 1.3 million (20.8%) enhancer-like, and CTCF-only)

FANTOM [17– 19]

CAGE-based 573 annotation

2 (promoter, enhancer)

184,827 promoters (0.1%) 65,423 enhancers (0.6%)

Ensembl regulatory build [23]

ChromHMM, Segway [7,24]

88

9

446,602 (12.9%) (2015)

ENCODE Phase II, Roadmap [2,6]

ChromHMM [7]

127

15, 18

553,594 Enh state (39.4%) 146,860 TssA state (4.9%)

AC C

EP

Project

#active elements (genome coverage)

Table 1: Catalogs of human regulatory elements currently made available by consortia, with cell type and genome coverage.

3

ACCEPTED MANUSCRIPT Opportunities and Advancements: Regulatory Elements in Diseased Samples Roadmap, ENCODE, FANTOM, and GTEx have primarily focused on samples from asymptomatic donors. Similar efforts to identify disease-specific epigenetic variations in diseased samples have been led by other consortia.

SC

RI PT

The BLUEPRINT Project provides nearly 600 reference epigenomes for more than 100 adaptive and innate immune lineages, from both asymptomatic and diseased donors. Disease phenotypes include ALL, AML, CLL, APL, and T1DM [25]. Similar efforts, the DEEP Project and Canadian Epigenetics Environment and Health Research Consortium (CEEHRC), seek to profile epigenomes of samples from patients with inflammatory diseases and cancers. Data from all these consortia will lend insight into the epigenetic changes underlying autoimmune processes, autoinflammatory processes, and metabolic disturbances.

M AN U

The PsychENCODE Project focuses on human brain development and psychopathology, performing various epigenetic assays in FACS-sorted neurons from healthy and diseased brain samples. Disorders studied include autism spectrum disorder (ASD) and schizophrenia, in both human samples and animal models. The project will expand into single cell transcriptomics to map ASD gene regulatory networks in the dorsolateral prefrontal cortex, and also produces transcriptomic and ChIP-seq data on iPSC-derived neurons to study brain development [26].

TE D

The Cancer Genome Atlas (TCGA) characterizes the genomic and epigenomic landscapes of tumors and cancers. The project produced gene expression and genotyping data in 33 tumor types, including 10 rare cancers, from 11,000 cancer patients. Analysis has uncovered surprising similarities between cancers in seemingly distant tissues, identified novel cancer subtypes, and informed clinical trials for novel therapeutic targets, and will continue to bridge understanding between the molecular mechanisms underlying cancer and their clinical implications [27].

EP

Challenges: Validating Predicted Regulatory Regions

AC C

The aforementioned efforts have provided millions of regulatory feature annotations in the non-coding genome. Clearly, proper prioritization is critical for informed and efficient investigation of results on this scale. To this end, experimental validation of computational results, particularly predicted regulatory elements, is a critical objective of most consortia. Transgenic assays, in which the element of interest is cloned into an embryo coupled to a reporter, have been commonly used [28,29]. The VISTA Enhancer Database contains 1,574 enhancer elements with validated activity in up to eight different tissues using transgenic mouse assays [30]. ENCODE and FANTOM have also validated subsets of their regulatory element predictions with transgenic approaches [19]. The 1,574 enhancers validated by VISTA represent less than 1% of the total regulatory elements predicted across all consortium efforts, however, so future validation will likely apply massively parallel reporter assays (MPRA). Classically, MPRA places tens of thousands of candidate elements upstream of promoters coupled to barcoded reporter genes [31]. Recent variants using AAV-mediated delivery to assess enhancer/reporter

4

ACCEPTED MANUSCRIPT activity in vivo [32]. These assays do not assess elements’ activity in native chromosomal context, however; to address this, lentivirus may be used to integrate candidate enhancers into target cell chromatin, producing more reproducible and predictable results compared to episomal-based assays [33].

SC

RI PT

STARR-seq is an even higher-throughput reporter assay where millions of DNA fragments are cloned into plasmids such that active enhancers activate their own transcription [34]. When input is sheared genomic DNA, output provides a genome-wide readout of enhancer activity [35]. This approach may be biased toward open chromatin, but STARR-seq is capable of identifying enhancers from heterochromatic regions as well, which may represent poised enhancers [36]. Alternatively, STARR-seq may be applied to immunoprecipitated fragments. Application to glucocorticoid receptor ChIP-seq fragments, for example, revealed that the glucocorticoid receptor confers enhancer activity by tethering distant DNA sequences to enhancer regions through bridging interactions with other transcription factors [37].

M AN U

More recently, CRISPR-based assays have been developed to edit candidate regulatory elements in their native context. These techniques can approach saturation mutagenesis; applications have characterized important functional sequences in a previously-identified enhancer of the gene BCL11A [38] and identified a novel class of enhancers whose deletion only temporarily impacts expression of their target gene, POU5F1 [39]. CRISPR assays may also be used to probe enhancer-promoter links, as discussed in the next section.

AC C

EP

TE D

Given the link between regulatory element transcription and activity [40], nascent transcription profiling is an alternative approach for regulatory element validation. PRO-seq is a recently developed nuclear run-on assay based on the similar GRO-seq[41] which maps RNA polymerase engagement genome-wide at single base-pair resolution[42]. Relative to RNA-seq, polymerase ChIP-seq, and CAGE, PRO-seq offers significantly improved sensitivity for detection of unstable transcripts such as eRNAs across many orders of magnitude of abundance[43]. Subtle detected differences in transcription initiation and elongation can be used by machine learning approaches to identify active promoters and enhancers from GRO-seq and PRO-seq data, which are more likely to overlap traitassociated variants and tissue-specific eQTLs than regulatory elements predicted without GRO-seq or PRO-seq signal [44]. Combined with experimental approaches such as CRISPR-based assays and 3C (described in the next section), such approaches have already characterized disease-specific enhancers, including a novel enhancer 3´ of the oncogene KIT in acute myeloid leukemia cells whose activity is repressed by anti-cancer BET Inhibitor drugs [45]. PRO-seq may prove an increasingly valuable assay for identifying and characterizing cell type specific regulatory elements in the future. MPRA have not yet been broadly applied by consortia, although the fourth phase of ENCODE will greatly expand experimental validation of predicted regulatory elements. Regulatory element validation and prediction are in a unique position in that neither truly represents a gold standard: it is impossible to perfectly replicate in vivo conditions with existing techniques, and assays for regulatory element identification, such as ChIP-seq, suffer from noisy signal and inevitable false positives. Validation and prediction will likely continue to evolve and improve one another as ongoing consortium efforts like ENCODE Phase IV further understanding of regulatory element characteristics and functions. 5

ACCEPTED MANUSCRIPT Challenges: Predicting the Target Genes of Regulatory Elements Once active regulatory elements are identified, they must be placed in their physiologic and pathophysiologic context to inform hypotheses regarding functional roles in cellular processes and disease. A common first step is to link element activity to gene expression. Consortia approach this problem with experimental, computational, and statistical methods.

SC

RI PT

Assays such as Hi-C [46], 5C [47], and ChIA-PET [48] seek 3-dimensional interactions between regulatory elements and target genes. Hi-C was applied in the Roadmap Project, identifying patterns of nuclear reorganization during cell lineage commitment [49], and ENCODE has performed 31 Hi-C experiments and 70 additional ChIA-PET and 5C experiments, including some primary cell and tissue coverage [2]. The BLUEPRINT Consortium has also performed Hi-C on 17 primary hematopoietic cell types, profiling regulatory element contacts for more than 31,000 promoters [50]. Hi-C production is also underway for PsychENCODE schizophrenia samples.

M AN U

Alternatively, regulatory element to promoter links may be inferred by QTL analysis. GTEx [22] and BLUEPRINT [25,51] both offer QTL results, and although TCGA itself does not provide an eQTL catalog, a recent effort by Gong et. al. produced a list of nearly 6 million eQTLs associated with the 33 tumor types profiled by TCGA, including approximately 22,000 eQTLs associated with patient survival and approximately 330,000 intersecting existing GWAS loci [52]. Importantly, QTLs themselves only provide evidence of correlation and not causation; genes and QTLs may be indirectly linked, or the arrow of causation may run from gene expression to histone modification in the case of hQTLs.

AC C

EP

TE D

Recent advances in CRISPR/Cas9 technology probe promoter-enhancer links by editing putative regulatory elements. Screening may be accomplished by systematically deleting kilobase-sized regions within a few megabases of a target gene (CREST-seq) or by inducing heterochromatin using CRISPR interference (CRISPRi) [53,54]. These techniques have identified 45 enhancers regulating POU5F1, 7 elements enhancing MYC, and 2 elements repressing MYC [53,54]. They have also revealed novel biology, including the complex role of CTCF binding site orientation in determining enhancer-promoter contacts and insulator function [55] and that up to 3% of promoters may act as enhancers for distal genes, perhaps coordinating rapid gene expression response to stimulus [56]. These dual promoterenhancers are highly specific in their distal target genes, their enhancer function is not necessarily correlated with their promoter activity, and they are themselves poorly responsive to other enhancers; investigation of the mechanisms underpinning these peculiarities offers fruitful avenues for further research [57]. When experimental data are not available, a common goal is to computationally predict regulatory element to promoter links. This is made especially important, and challenging, by findings from the above work that regulatory element to promoter links are highly cell typespecific [49,50]. Machine learning approaches have been developed, incorporating sequence-based and ChIP-seq-based features [58,59]; however, such models generalize poorly to different cell types. As of this writing, there is no standardized predictive approach in use by consortia, and an optimal method of predicting regulatory element to promoter links from epigenetic data across cell types remains an open problem in bioinformatics.

6

ACCEPTED MANUSCRIPT Challenges: Data and Metadata Access

RI PT

Validated regulatory elements and their gene expression impacts, along with continued data production in new cell types and disease contexts, offer an invaluable, ever-expanding resource for community investigation. From a community user’s perspective, however, reproducible data access and consumption can be a daunting challenge. Consortium data is immense: a single consortium’s raw and processed data frequently exceeds a petabyte, or a million gigabytes. Reproducible consumption requires detailed metadata annotations, a Portal interface for browsing data, and an application programming interface (API) so users’ computer programs may query data in bulk.

M AN U

SC

Various Consortia, including PsychENCODE, use the Synapse platform, which provides a Portal and an API with clients so users may query data using R, Python, and Java; Synapse does not, however, define explicit metadata requirements nor require unique IDs for annotated objects, hindering reproducibility. ENCODE, IHEC, and TCGA have developed comprehensive custom systems. IHEC’s DeepBlue offers a Portal, a metadata database, and an XML-RPC API supported by numerous programming languages. Extensive metadata requirements are not defined, but DeepBlue’s database accommodates extended data for individual experiments [60]. ENCODE and TCGA both use JSON, a lightweight, humanreadable data format, to annotate experiments, files, biosamples, and software with detailed metadata. Both provide a representational state transfer (REST) API, standardizing queries for data and metadata. Objects have permanent unique IDs for use in publications and experimental protocols and bioinformatic pipelines are annotated, facilitating reproducibility [8,61]. ENCODE and TCGA users must, however, write their own clients to interact with the APIs.

Portal

yes, JSON format, ENCODED [62]

yes, yes, REST-based https://encodeproject and GraphQL.org/ based; no clients

AC C

ENCODE

Metadata Database

EP

Consortium

TE D

Reproducible use of data and metadata from different consortia currently requires significant effort from the user. Progress is being made toward standardization, which would greatly benefit the community. A summary of data and metadata systems used by consortia is presented in Table 2. API

FANTOM

no database; static README files

Roadmap

no database; Google yes, Sheet http://www.roadmap epigenomics.org/

no

BLUEPRINT

yes, visualizable through portal, using DeepBlue [60]

yes, http://dcc.blueprintepigenome.eu/

yes, DeepBlue, XML-RPC based, clients available

PsychENCODE

yes, via Synapse; querying requires

yes, via Synapse https://www.synapse

yes, via Synapse; clients available

yes, no http://fantom.gsc.rike n.jp/

7

ACCEPTED MANUSCRIPT linking database tables manually

.org/#!Synapse:syn4 921369

TCGA

yes, based on GDC data model

yes, yes, REST-based; https://gdc.cancer.go no client v/

GTEx

no database; static file download

yes, https://www.gtexport al.org/

RI PT

no

Table 2: A selection of consortia with a summary of the data access systems they provide, including metadata databases, portals, and APIs and associated clients.

Challenges: Reproducible Ground-Level Data Processing

M AN U

SC

For consistent integration with consortium results, community users may aim to process their own data with consortium software. Raw next-generation sequencing data is analyzed using a combination of software referred to as a pipeline, which typically aligns reads to the reference genome, generates signal, and calls peaks. There are no widely-used standards for pipelines as of this writing. Numerous algorithms exist for each step, and identifying an algorithm robust to differences in data quality and optimal for intended downstream analysis is challenging, as genome-wide datasets frequently lack gold standards or ground truth datasets for validation. Consortia use a variety of algorithms in their pipelines, and, reproducibly running consortium pipelines in a user’s own compute environment or in the Cloud may require significant effort.

AC C

EP

TE D

Two emerging technologies, workflow languages and containerization techniques, are easing the reproducibility challenge. Workflow languages specify pipeline steps and associated inputs, outputs, and parameters, in a platform-independent manner. Examples include Common Workflow Language (CWL) [63], Workflow Description Language (WDL), Big Data Script (BDS) [64], and SnakeMake [65]. The strengths and weaknesses of individual workflow languages is reviewed elsewhere [66]. The most popular containerization technique is currently Docker, which allows the precise versions of required software to be specified in a manifest called a Dockerfile and precompiled into a downloadable environment called a docker image. This ensures that the same versions of all software will be used by all users, and that the pipeline can run on any platform and produce identical output from identical input, assuming deterministic software. Docker images are well supported on various cloud computing platforms. The use of workflow languages and Docker images varies by consortium, as summarized in Table 3. Currently, few consortia are taking full advantage of these technologies, but their use is expanding. The future will likely see full containerization of pipeline software and associated workflows, which will greatly facilitate optimal, reproducible analysis of nextgeneration sequencing data by the community. Consortium

Pipeline code available?

Pipeline workflow language / platform

Pipeline docker images available?

8

ACCEPTED MANUSCRIPT ENCODE

yes, https://github.com/E NCODE-DCC

FANTOM

variable; some N/A source packages at http://fantom.gsc.rike n.jp/software/

no

Roadmap

no; documentation at N/A https://egg2.wustl.ed u/roadmap/web_port al/processed_data.ht ml

no

BLUEPRINT

no; pipeline documentation at http://dcc.blueprintepigenome.eu/#/md/ methods

PsychENCODE

variable; ChIP-seq source at https://github.com/w englab/psychip_snakem ake

SnakeMake

no

TCGA

variable; pipeline documentation at https://docs.gdc.can cer.gov/Data/Introdu ction/

N/A

no

GTEx

yes, WDL https://www.github.c om/broadinstitute/gte x-pipeline

yes, https://quay.io/encod e-dcc

RI PT

WDL, DNANexus

no

EP

TE D

M AN U

SC

N/A

yes, https://hub.docker.co m/r/broadinstitute/

AC C

Table 3: A selection of consortia and the reproducibility of their pipelines, including utilization of a workflow language and packaging of code and executables into Docker images.

Challenges: Integrating Data across Consortia Standardization of pipelines and data access will greatly facilitate integrative analyses across consortium data. Although the aforementioned challenges are daunting, several efforts are already underway to unite results from different consortia into a common body of scientific knowledge. An example of successful integration is ENCODE’s recent expansion to include all available Roadmap data in the ENCODE Portal. This effort involved reannotating hundreds of Roadmap experiments with standardized metadata and reprocessing all raw Roadmap data with ENCODE uniform pipelines [61]. This represents a complete post-hoc integration of these two projects: data from both may be seamlessly searched together by the same

9

ACCEPTED MANUSCRIPT criteria, and users may be confident in consistent ground level processing. This consistency has allowed for Roadmap data to be integrated into ENCODE’s Encyclopedia of candidate regulatory elements.

M AN U

SC

RI PT

The International Human Epigenetics Consortium (IHEC) aims to make 1,000 reference epigenomes available by 2020 [67]. Its standards for data production guide eight member consortia including Roadmap, ENCODE, BLUEPRINT, and DEEP, and its DeepBlue web server provides a unified system for access to more than 33,000 consortium datasets, although the system does not provide standardized metadata as ENCODE has with Roadmap [60]. Similarly, the Cistrome Project offers a database of 25,000 datasets extracted out of the Gene expression Omnibus (GEO) [68,69]. Cistrome makes an effort to extract metadata and reprocess raw data using a standardized pipeline, which includes comprehensive quality control metrics to account for varying data quality between sources [69,70]. The pipeline, along with many other software tools, are publicly available to help standardize user-driven analysis [70]. A sister project, Cistrome Cancer, also incorporates data from TCGA to profile pathologic enhancer activity and identify target genes for transcription factors [71].

TE D

Care must be taken in such integrative efforts. Numerous variables, many non-biological, influence genome-wide experiments. This is well known to cause batch effect, where technical properties such as collection and processing date account for a large portion of sample variability, sometimes more than true biological properties [72]. Standardized data production and biological replicates alleviate this, but with hundreds of labs contributing to consortia, some batch effects are inevitable. Correction algorithms may be applied to correct batch effects statistically, such as by incorporating them into linear models along with biological features or by Bayesian inference [73], as reviewed in detail elsewhere [74]. Although uniform processing and QC have been addressed by integrative efforts, batch effect has yet to be explored in depth, and remains a concern even when analyzing data from different labs within the same consortium.

AC C

EP

These integrative efforts have made progress in uniting the vast knowledge produced by multiple consortia. Considerable work remains to be done, however, to take full advantage of available data. The future likely holds a resource offering all the advantages of IHEC, Cistrome, and ENCODE: unified ground-level and integrative-level processing of consortium and community data, with a standardized system for metadata and data representation and access. A common resource will also require a more comprehensive exploration of batch effect, and an algorithm to quantify and correct it to the greatest extent possible. Such a unified system would revolutionize the interpretation of epigenetic data, and streamline the integration of new work into the greater context of knowledge in the field.

Conclusion Since the conclusion of the Human Genome Project, consortia have performed tens of thousands of experiments to characterize the non-coding genome. The resulting data offer invaluable resources for understanding the regulatory landscapes of human cells, the impact of regulatory element activity on gene expression, and the role of non-coding sequences in normal physiology and disease. Reproducible integrative analysis of this vast store of knowledge remains a significant challenge, but ongoing efforts and emerging technologies promise to bring reproducible, standardized analysis techniques and unified resources in 10

ACCEPTED MANUSCRIPT coming years, offering continually-improving understanding of regulatory elements and their functional roles in cellular processes and disease. Unification of consortium data has the potential to revolutionize current knowledge, as well as the incorporation of future knowledge, in the epigenetic field.

Acknowledgements This work was funded by the NIH grant HG009446 awarded to Zhiping Weng.

RI PT

References

Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al.: Initial sequencing and analysis of the human genome. Nature 2001, 409:860–921.

2.

•• ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 2012, 489:57–74.

SC

1.

M AN U

The flagship manuscript of Phase II of the ENCODE Project, describing major findings from its reference epigenomes and associated epigenetic experiments. Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y, et al.: Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res 2012, 22:1798–1812.

4.

Thurman RE, Rynes E, Humbert R, Vierstra J, Maurano MT, Haugen E, Sheffield NC, Stergachis AB, Wang H, Vernot B, et al.: The accessible chromatin landscape of the human genome. Nature 2012, 489:75.

5.

Stamatoyannopoulos JA: What does our genome encode? Genome Res 2012, 22:1602–1611.

6.

•• Roadmap Epigenomics Consortium, Kundaje A, Meuleman W, Ernst J, Bilenky M, Yen A, Heravi-Moussavi A, Kheradpour P, Zhang Z, Wang J, et al.: Integrative analysis of 111 reference human epigenomes. Nature 2015, 518:317–330.

EP

TE D

3.

AC C

The flagship manuscript of the Roadmap Epigenomics Project, describing major findings from their 111 reference epigenomes. 7.

Ernst J, Kellis M: ChromHMM: automating chromatin-state discovery and characterization. Nat Methods 2012, 9:215.

8.

Sloan CA, Chan ET, Davidson JM, Malladi VS, Strattan JS, Hitz BC, Gabdank I, Narayanan AK, Ho M, Lee BT, et al.: ENCODE data at the ENCODE portal. Nucleic Acids Res 2016, 44:D726–32.

9.

Neph S, Stergachis AB, Reynolds A, Sandstrom R, Borenstein E, Stamatoyannopoulos JA: Circuitry and dynamics of human transcription factor regulatory networks. Cell 2012, 150:1274–1286.

10. Lagarde J, Uszczynska-Ratajczak B, Carbonell S, Pérez-Lluch S, Abad A, Davis C, Gingeras TR, Frankish A, Harrow J, Guigo R, et al.: High-throughput annotation of full-length long noncoding RNAs with capture long-read sequencing. Nat Genet 2017, 49:1731–1740.

11

ACCEPTED MANUSCRIPT 11. Batut P, Gingeras TR: RAMPAGE: promoter activity profiling by paired-end sequencing of 5’-complete cDNAs. Curr Protoc Mol Biol 2013, 104:Unit 25B.11. 12. Lambert N, Robertson A, Jangi M, McGeary S, Sharp PA, Burge CB: RNA Bind-n-Seq: quantitative assessment of the sequence and structural binding specificity of RNA binding proteins. Mol Cell 2014, 54:887–900.

RI PT

13. Van Nostrand EL, Pratt GA, Shishkin AA, Gelboin-Burkhart C, Fang MY, Sundararaman B, Blue SM, Nguyen TB, Surka C, Elkins K, et al.: Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP). Nat Methods 2016, 13:508–514. 14. Lawrence MS, Stojanov P, Polak P, Kryukov GV, Cibulskis K, Sivachenko A, Carter SL, Stewart C, Mermel CH, Roberts SA, et al.: Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 2013, 499:214–218.

M AN U

SC

15. Polak P, Karlić R, Koren A, Thurman R, Sandstrom R, Lawrence M, Reynolds A, Rynes E, Vlahoviček K, Stamatoyannopoulos JA, et al.: Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature 2015, 518:360– 364. 16. Kodzius R, Kojima M, Nishiyori H, Nakamura M, Fukuda S, Tagami M, Sasaki D, Imamura K, Kai C, Harbers M, et al.: CAGE: cap analysis of gene expression. Nat Methods 2006, 3:211–222. 17. FANTOM Consortium and the RIKEN PMI and CLST (DGT), Forrest ARR, Kawaji H, Rehli M, Baillie JK, de Hoon MJL, Haberle V, Lassmann T, Kulakovskiy IV, Lizio M, et al.: A promoter-level mammalian expression atlas. Nature 2014, 507:462–470.

TE D

18. Arner E, Daub CO, Vitting-Seerup K, Andersson R, Lilje B, Drabløs F, Lennartsson A, Rönnerblad M, Hrydziuszko O, Vitezic M, et al.: Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science 2015, 347:1010–1014.

EP

19. •• Andersson R, Gebhard C, Miguel-Escalada I, Hoof I, Bornholdt J, Boyd M, Chen Y, Zhao X, Schmidl C, Suzuki T, et al.: An atlas of active enhancers across human cell types and tissues. Nature 2014, 507:455–461.

AC C

Describes a collection of candidate enhancers identified by the FANTOM Project using cap analysis of gene expression (CAGE) as regions with distinct bidirectional transcription patterns. 20. Hon C-C, Ramilowski JA, Harshbarger J, Bertin N, Rackham OJL, Gough J, Denisenko E, Schmeier S, Poulsen TM, Severin J, et al.: An atlas of human long non-coding RNAs with accurate 5’ ends. Nature 2017, 543:199–204. 21. Severin J, Lizio M, Harshbarger J, Kawaji H, Daub CO, Hayashizaki Y, FANTOM Consortium, Bertin N, Forrest ARR: Interactive visualization and analysis of largescale sequencing datasets using ZENBU. Nat Biotechnol 2014, 32:217–219. 22. •• GTEx Consortium: Genetic effects on gene expression across human tissues. Nature 2017, 550:204. Describes major findings of the GTEx Project, including cis-eQTLs in 44 human tissues and associated analysis.

12

ACCEPTED MANUSCRIPT 23. Zerbino DR, Wilder SP, Johnson N, Juettemann T, Flicek PR: The ensembl regulatory build. Genome Biol 2015, 16:56. 24. Hoffman MM, Buske OJ, Wang J, Weng Z, Bilmes JA, Noble WS: Unsupervised pattern discovery in human chromatin structure through genomic segmentation. Nat Methods 2012, 9:473.

RI PT

25. • Martens JHA, Stunnenberg HG: BLUEPRINT: mapping human blood cell epigenomes. Haematologica 2013, 98:1487–1489. Describes the design and goals of the BLUEPRINT Project, which aims to map epigenomes in leukemia and autoimmune diseases, as well as the first data release.

SC

26. • PsychENCODE Consortium, Akbarian S, Liu C, Knowles JA, Vaccarino FM, Farnham PJ, Crawford GE, Jaffe AE, Pinto D, Dracheva S, et al.: The PsychENCODE project. Nat Neurosci 2015, 18:1707–1712. Describes the design and goals of the PsychENCODE Project, which aims to map the epigenetics of brain development and psychiatric disorders.

M AN U

27. • Hutter C, Zenklusen JC: The Cancer Genome Atlas: Creating Lasting Value beyond Its Data. Cell 2018, 173:283–285. Summarizes the impact of TCGA, which produced genetic and epigenetic data from hundreds of samples and 33 tumor types. 28. Yauk CL, Gingerich JD, Soper L, MacMahon A, Foster WG, Douglas GR: A lacZ transgenic mouse assay for the detection of mutations in follicular granulosa cells. Mutat Res 2005, 578:117–123.

TE D

29. Pennacchio LA, Ahituv N, Moses AM, Prabhakar S, Nobrega MA, Shoukry M, Minovitsky S, Dubchak I, Holt A, Lewis KD, et al.: In vivo enhancer analysis of human conserved non-coding sequences. Nature 2006, 444:499–502.

EP

30. Visel A, Minovitsky S, Dubchak I, Pennacchio LA: VISTA Enhancer Browser--a database of tissue-specific human enhancers. Nucleic Acids Res 2007, 35:D88–92.

AC C

31. Patwardhan RP, Lee C, Litvin O, Young DL, Pe’er D, Shendure J: High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis. Nat Biotechnol 2009, 27:1173–1175. 32. Shen SQ, Myers CA, Hughes AEO, Byrne LC, Flannery JG, Corbo JC: Massively parallel cis-regulatory analysis in the mammalian central nervous system. Genome Res 2016, 26:238–255. 33. • Inoue F, Kircher M, Martin B, Cooper GM, Witten DM, McManus MT, Ahituv N, Shendure J: A systematic comparison reveals substantial differences in chromosomal versus episomal encoding of enhancer activity. Genome Res 2017, 27:38–52. Describes a lentivirus-based MPRA for functionally validating thousands of candidate regulatory elements integrated into chromatin. This produces more reproducible results than episomal-based assays and promises to further understanding of the influence of the chromatin landscape on regulatory element function. 34. Arnold CD, Gerlach D, Stelzer C, Boryń ŁM, Rath M, Stark A: Genome-wide

13

ACCEPTED MANUSCRIPT quantitative enhancer activity maps identified by STARR-seq. Science 2013, 339:1074–1077. 35. Muerdter F, Boryń ŁM, Arnold CD: STARR-seq - principles and applications. Genomics 2015, 106:145–150.

RI PT

36. Liu Y, Yu S, Dhiman VK, Brunetti T, Eckart H, White KP: Functional assessment of human enhancer activities using whole-genome STARR-sequencing. Genome Biol 2017, 18:219. 37. Vockley CM, D’Ippolito AM, McDowell IC, Majoros WH, Safi A, Song L, Crawford GE, Reddy TE: Direct GR Binding Sites Potentiate Clusters of TF Binding across the Human Genome. Cell 2016, 166:1269–1281.e19.

SC

38. Canver MC, Smith EC, Sher F, Pinello L, Sanjana NE, Shalem O, Chen DD, Schupp PG, Vinjamur DS, Garcia SP, et al.: BCL11A enhancer dissection by Cas9-mediated in situ saturating mutagenesis. Nature 2015, 527:192–197.

M AN U

39. Diao Y, Li B, Meng Z, Jung I, Lee AY, Dixon J, Maliskova L, Guan K-L, Shen Y, Ren B: A new class of temporarily phenotypic enhancers identified by CRISPR/Cas9mediated genetic screening. Genome Res 2016, 26:397–405. 40. Kim T-K, Hemberg M, Gray JM, Costa AM, Bear DM, Wu J, Harmin DA, Laptewicz M, Barbara-Haley K, Kuersten S, et al.: Widespread transcription at neuronal activityregulated enhancers. Nature 2010, 465:182–187. 41. Core LJ, Waterfall JJ, Lis JT: Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 2008, 322:1845–1848.

TE D

42. Kwak H, Fuda NJ, Core LJ, Lis JT: Precise maps of RNA polymerase reveal how promoters direct initiation and pausing. Science 2013, 339:950–953.

EP

43. Mahat DB, Kwak H, Booth GT, Jonkers IH, Danko CG, Patel RK, Waters CT, Munson K, Core LJ, Lis JT: Base-pair-resolution genome-wide mapping of active RNA polymerases using precision nuclear run-on (PRO-seq). Nat Protoc 2016, 11:1455– 1476.

AC C

44. Danko CG, Hyland SL, Core LJ, Martins AL, Waters CT, Lee HW, Cheung VG, Kraus WL, Lis JT, Siepel A: Identification of active transcriptional regulatory elements from GRO-seq data. Nat Methods 2015, 12:433–438. 45. • Zhao Y, Liu Q, Acharya P, Stengel KR, Sheng Q, Zhou X, Kwak H, Fischer MA, Bradner JE, Strickland SA, et al.: High-Resolution Mapping of RNA Polymerases Identifies Mechanisms of Sensitivity and Resistance to BET Inhibitors in t(8;21) AML. Cell Rep 2016, 16:2003–2016. This BLUEPRINT-associated project profiled histone modifications, gene expression, and genotypes for more than 100 monocyte, T-cell, and neutrophil samples, and mapped lineage-specific quantitative trait loci. 46. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, Amit I, Lajoie BR, Sabo PJ, Dorschner MO, et al.: Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 2009, 326:289–293. 47. Dostie J, Richmond TA, Arnaout RA, Selzer RR, Lee WL, Honan TA, Rubio ED, Krumm

14

ACCEPTED MANUSCRIPT A, Lamb J, Nusbaum C, et al.: Chromosome Conformation Capture Carbon Copy (5C): a massively parallel solution for mapping interactions between genomic elements. Genome Res 2006, 16:1299–1309. 48. Li G, Cai L, Chang H, Hong P, Zhou Q, Kulakova EV, Kolchanov NA, Ruan Y: Chromatin Interaction Analysis with Paired-End Tag (ChIA-PET) sequencing technology and application. BMC Genomics 2014, 15 Suppl 12:S11.

RI PT

49. Dixon JR, Jung I, Selvaraj S, Shen Y, Antosiewicz-Bourget JE, Lee AY, Ye Z, Kim A, Rajagopal N, Xie W, et al.: Chromatin architecture reorganization during stem cell differentiation. Nature 2015, 518:331–336.

SC

50. Javierre BM, Burren OS, Wilder SP, Kreuzhuber R, Hill SM, Sewitz S, Cairns J, Wingett SW, Várnai C, Thiecke MJ, et al.: Lineage-Specific Genome Architecture Links Enhancers and Non-coding Disease Variants to Target Gene Promoters. Cell 2016, 167:1369–1384.e19.

M AN U

51. Chen L, Ge B, Casale FP, Vasquez L, Kwan T, Garrido-Martín D, Watt S, Yan Y, Kundu K, Ecker S, et al.: Genetic Drivers of Epigenetic and Transcriptional Variation in Human Immune Cells. Cell 2016, 167:1398–1414.e24. 52. Gong J, Mei S, Liu C, Xiang Y, Ye Y, Zhang Z, Feng J, Liu R, Diao L, Guo A-Y, et al.: PancanQTL: systematic identification of cis-eQTLs and trans-eQTLs in 33 cancer types. Nucleic Acids Res 2018, 46:D971–D976. 53. Fulco CP, Munschauer M, Anyoha R, Munson G, Grossman SR, Perez EM, Kane M, Cleary B, Lander ES, Engreitz JM: Systematic mapping of functional enhancerpromoter connections with CRISPR interference. Science 2016, 354:769–773.

TE D

54. • Diao Y, Fang R, Li B, Meng Z, Yu J, Qiu Y, Lin KC, Huang H, Liu T, Marina RJ, et al.: A tiling-deletion-based genetic screen for cis-regulatory element identification in mammalian cells. Nat Methods 2017, 14:629–635.

EP

Describes CREST-seq, a CRISPR-based assay for assessing regulatory element activity in its native context by systematically deleting kilobase-sized candidate regulatory elements from the genome.

AC C

55. Guo Y, Xu Q, Canzio D, Shou J, Li J, Gorkin DU, Jung I, Wu H, Zhai Y, Tang Y, et al.: CRISPR Inversion of CTCF Sites Alters Genome Topology and Enhancer/Promoter Function. Cell 2015, 162:900–910. 56. Dao LTM, Galindo-Albarrán AO, Castro-Mondragon JA, Andrieu-Soler C, Medina-Rivera A, Souaid C, Charbonnier G, Griffon A, Vanhille L, Stephen T, et al.: Genome-wide characterization of mammalian promoters with distal enhancer functions. Nat Genet 2017, 49:1073–1081. 57. Catarino RR, Neumayr C, Stark A: Promoting transcription over long distances. Nat Genet 2017, 49:972–973. 58. Yang Y, Zhang R, Singh S, Ma J: Exploiting sequence-based features for predicting enhancer–promoter interactions. Bioinformatics 2017, 33:i252–i260. 59. Whalen S, Truty RM, Pollard KS: Enhancer-promoter interactions are encoded by complex genomic signatures on looping chromatin. Nat Genet 2016, 48:488–496. 60. Albrecht F, List M, Bock C, Lengauer T: DeepBlue epigenomic data server:

15

ACCEPTED MANUSCRIPT programmatic data retrieval and analysis of epigenome region sets. Nucleic Acids Res 2016, 44:W581–6. 61. Davis CA, Hitz BC, Sloan CA, Chan ET, Davidson JM, Gabdank I, Hilton JA, Jain K, Baymuradov UK, Narayanan AK, et al.: The Encyclopedia of DNA elements (ENCODE): data portal update. Nucleic Acids Res 2018, 46:D794–D801.

RI PT

62. Hong EL, Sloan CA, Chan ET, Davidson JM, Malladi VS, Strattan JS, Hitz BC, Gabdank I, Narayanan AK, Ho M, et al.: Principles of metadata organization at the ENCODE data coordination center. Database 2016, 2016. 63. Amstutz P, Andeer R, Chapman B, Chilton J, Crusoe MR, Valls Guimera R, Carrasco Hernandez G, Ivkovic S, Kartashov A, Kern J, et al.: Common Workflow Language, draft 3. 2016, doi:10.6084/m9.figshare.3115156.v1.

SC

64. Cingolani P, Sladek R, Blanchette M: BigDataScript: a scripting language for data pipelines. Bioinformatics 2015, 31:10–16.

M AN U

65. Köster J, Rahmann S: Snakemake—a scalable bioinformatics workflow engine. Bioinformatics 2012, 28:2520–2522. 66. • Leipzig J: A review of bioinformatic pipeline frameworks. Brief Bioinform 2017, 18:530–536. Reviews the strengths and weaknesses of different workflow languages for writing reproducible bioinformatic pipelines 67. Stunnenberg HG, Hirst M: The International Human Epigenome Consortium: A Blueprint for Scientific Collaboration and Discovery. Cell 2016, 167:1145–1149.

TE D

68. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, et al.: NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res 2013, 41:D991–5.

EP

69. Liu T, Ortiz JA, Taing L, Meyer CA, Lee B, Zhang Y, Shin H, Wong SS, Ma J, Lei Y, et al.: Cistrome: an integrative platform for transcriptional regulation studies. Genome Biol 2011, 12:R83.

AC C

70. • Mei S, Qin Q, Wu Q, Sun H, Zheng R, Zang C, Zhu M, Wu J, Shi X, Taing L, et al.: Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse. Nucleic Acids Res 2017, 45:D658–D662. Describes the integration of nearly 25,000 human and mouse ChIP-seq datasets, curated from various community and consortium sources and processed uniformly with the Cistrome pipeline to form a central database. 71. Mei S, Meyer CA, Zheng R, Qin Q, Wu Q, Jiang P, Li B, Shi X, Wang B, Fan J, et al.: Cistrome Cancer: A Web Resource for Integrative Gene Regulation Modeling in Cancer. Cancer Res 2017, 77:e19–e22. 72. Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 2010, 11:733–739. 73. Johnson WE, Li C, Rabinovic A: Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 2007, 8:118–127.

16

ACCEPTED MANUSCRIPT 74. • Goh WWB, Wang W, Wong L: Why Batch Effects Matter in Omics Data, and How to Avoid Them. Trends Biotechnol 2017, 35:498–507.

AC C

EP

TE D

M AN U

SC

RI PT

Reviews the impacts of batch effect in analyzing genomic and epigenomic data from multiple sources, and methods for avoiding or correcting it.

17

2000

2002

2004

2006

2008

2010

FANTOM4

FANTOM3

ENCODE I

2014

2016

2018

FANTOM5

ENCODE II

ENCODE III

6

IV

TE D

TCGA

EP

Roadmap Epigenome Project

AC C

FANTOM2

M AN U

Human Genome Project FANTOM1

2012

SC

1990

RI PT

ACCEPTED MANUSCRIPT

GTEx Project BLUEPRINT DEEP PsychENCODE

ACCEPTED MANUSCRIPT

Decoding the Non-Coding Genome: Opportunities and Challenges of Genomic and Epigenomic Consortium Data

AC C

EP

TE D

M AN U

SC

RI PT

Highlights ● Consortia have produced vast quantities of epigenetic data in order to characterize regulatory elements. ● New techniques are improving prediction and efficient experimental validation of regulatory elements. ● Standardized, reproducible data and metadata access systems and bioinformatic pipelines are crucial to take full advantage of consortium data. ● Integrative efforts are making progress at unifying consortium data.