Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks

Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks

Article Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks Graphical Abstract Authors € l...

4MB Sizes 0 Downloads 12 Views

Article

Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks Graphical Abstract

Authors € la-Reimand, Helen Zhu, Liis Uusku Keren Isaev, ..., Paul C. Boutros, € ri Reimand Michael D. Wilson, Ju

Correspondence [email protected]

In Brief Cancer is driven by somatic mutations in critical genes, but few non-coding drivers are known. In a pan-cancer analysis, Zhu et al. identified frequently mutated, multitissue regulatory elements with chromatin loops to distal genes. Genomic deletion of one region caused deregulation of cancer genes, pathways, and proliferation in human cells.

Highlights d

Pan-cancer driver analysis highlights frequently mutated regulatory elements (FMREs)

d

FMREs are active in many tissues and interact with genes via chromatin loops

d

FMRE deletion in human cells caused alterations in pathway activity and proliferation

d

Additional less-frequent regulatory mutations are enriched at cancer genes and pathways

Zhu et al., 2020, Molecular Cell 77, 1–15 March 19, 2020 ª 2020 Elsevier Inc. https://doi.org/10.1016/j.molcel.2019.12.027

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

Molecular Cell

Article Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks € la-Reimand,3,4,12 Keren Isaev,1,2,12 Lina Wadi,1 Azad Alizada,3 Shimin Shuai,1,5 Vincent Huang,1 Helen Zhu,1,2,12 Liis Uusku Dike Aduluso-Nwaobasi,1 Marta Paczkowska,1 Diala Abd-Rabbo,1 Oliver Ocsenas,1,2 Minggao Liang,3,5 J. Drew Thompson,1 Yao Li,1 Luyao Ruan,1 Michal Krassowski,1 Irakli Dzneladze,1 Jared T. Simpson,1,7 Mathieu Lupien,2,6 € ri Reimand1,2,13,* Lincoln D. Stein,1,5 Paul C. Boutros,2,8,9,10,11 Michael D. Wilson,3,5 and Ju 1Computational

Biology Program, Ontario Institute for Cancer Research, 661 University Avenue Suite 510, Toronto, ON M5G 0A3, Canada of Medical Biophysics, University of Toronto, 101 College Street Suite 15-701, Toronto, ON M5G 1L7, Canada 3Program in Genetics and Genome Biology, SickKids Research Institute, Peter Gilgan Centre for Research and Learning (PGCRL), 686 Bay Street, Toronto, ON M5G 0A4, Canada 4Division of Gene Technology, Department of Chemistry and Biotechnology, Tallinn University of Technology, Akadeemia tee 15, Tallinn 12618, Estonia 5Department of Molecular Genetics, University of Toronto, 1 King’s College Circle, Toronto, ON M5S 1A8, Canada 6Princess Margaret Cancer Centre, 101 College Street, Toronto, ON M5G 0A3, Canada 7Department of Computer Science, University of Toronto, 214 College Street, Toronto, ON M5T 3A1, Canada 8Department of Human Genetics, University of California Los Angeles, 10833 Le Conte Avenue, Los Angeles, CA 90095, USA 9Department of Urology, University of California Los Angeles, 200 Medical Plaza Driveway #140, Los Angeles, CA 90024, USA 10Institute of Precision Health, University of California Los Angeles, 10833 Le Conte Avenue, Los Angeles, CA 90024, USA 11Jonsson Comprehensive Cancer Centre, University of California Los Angeles, 10833 Le Conte Avenue, Los Angeles, CA 90024, USA 12These authors contributed equally 13Lead Contact *Correspondence: [email protected] https://doi.org/10.1016/j.molcel.2019.12.027 2Department

SUMMARY

INTRODUCTION

A comprehensive catalog of cancer driver mutations is essential for understanding tumorigenesis and developing therapies. Exome-sequencing studies have mapped many protein-coding drivers, yet few non-coding drivers are known because genomewide discovery is challenging. We developed a driver discovery method, ActiveDriverWGS, and analyzed 120,788 cis-regulatory modules (CRMs) across 1,844 whole tumor genomes from the ICGC-TCGA PCAWG project. We found 30 CRMs with enriched SNVs and indels (FDR < 0.05). These frequently mutated regulatory elements (FMREs) were ubiquitously active in human tissues, showed long-range chromatin interactions and mRNA abundance associations with target genes, and were enriched in motif-rewiring mutations and structural variants. Genomic deletion of one FMRE in human cells caused proliferative deficiencies and transcriptional deregulation of cancer genes CCNB1IP1, CDH1, and CDKN2B, validating observations in FMRE-mutated tumors. Pathway analysis revealed further sub-significant FMREs at cancer genes and processes, indicating an unexplored landscape of infrequent driver mutations in the non-coding genome.

Cancer is driven by somatic mutations that affect critical genes and hallmark pathways (Hanahan and Weinberg, 2011). Completing the catalog of cancer drivers is key to understanding biological mechanisms and developing precision therapies and biomarkers. Driver mutations such as single-nucleotide variants (SNVs) are positively selected and occur more frequently than expected from genome-wide mutation rates, show elevated functional impact, and are enriched in cancer pathways (Creixell et al., 2015; Gonzalez-Perez et al., 2013; Khurana et al., 2016). However, their discovery is challenged by numerous passenger mutations and sequencing artifacts apparent in cancer genomes. The Cancer Genome Atlas (TCGA) (Cancer Genome Atlas Research Network et al., 2013), the International Cancer Genome Consortium (ICGC) (Hudson et al., 2010), and others have assembled datasets of cancer genomes and identified hundreds of driver mutations in protein-coding genes (Bailey et al., 2018). In contrast, few non-coding drivers are known. Confirmed driver mutations in the TERT promoter create new transcription factor (TF) binding sites, increase TERT transcription, and confer replicative immortality (Bell et al., 2015; Fredriksson et al., 2014; Horn et al., 2013; Huang et al., 2013; Weinhold et al., 2014). Driver discovery in whole-genome sequencing (WGS) datasets has been limited for several reasons. First, most efforts have focused on genic non-coding elements such as promoters and non-coding RNAs (Fredriksson et al., 2014; Fujimoto et al., 2016; Khurana et al., 2013; Lanzo´s et al., 2017; Melton et al., 2015; Weinhold

Molecular Cell 77, 1–15, March 19, 2020 ª 2020 Elsevier Inc. 1

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

Genome-wide cancer driver discovery with ActiveDriverWGS

chr chr1 chr8 chr7

Merge & filter Candidate region (with active sites) Background sequence (±50 kbps) Poisson regression + tricnucleotide covariates

Associate FMREs with with putative target genes

Frequently mutated regulatory elements (FMREs)

75

50

50

25

25

0

0

4

VHL

5:1295292

CDKN2A

2

0

Predicted proportion of driver mutations per region (%)

Mutations / 100 bp (log 2)

Functional validation of FMRE (CRISPR, RNA-seq, proliferation) TERT WDR74 ICK

5:1295292 11:62608244 6:52860289 9:35658253 15:45493088 1:16940092 2:243153934 16:22308900 6:27870674 1:204475807 14:21081834 19:893248 1:28834956 22:29065573 2:10237806 1:161510036 1:165562391 X:117388771 11:65270169 9:44435985 1:161500602 8:133204434 4:173751654 4:190660002 10:96162552 4:775934 2:92261102 14:106938468 11:110072233 7:137139144

620 252 150

75

KRAS TP53 PIK3CA VHL CDKN2A SMAD4 PTEN PBRM1 ARID1A RB1 SETD2 IDH1 FBXW7 BRAF MEN1 APC MAP2K4 PIK3R1 SPOP OR2G6 KDM6A NFE2L2 CTNNB1 CBFB B2M CDH1 GATA3 CDKN1B NRAS MAP3K1 KMT2C ACVR1B ZFP36L2 ATM FOXA1 BAP1 NOTCH1 DDX3X AKT1 HRAS DDR2 TGFBR2 RBM10 H3F3A RNF43 KMT2D KDM5C EPHA2

Patients with mutations

100

TP53 KRAS

F

E 100

80

60

11:65270169

40

P=0.032

KMT2D

CDS FMRE

P=0.047

mut-

mRNA changes in FMRE enrichment in pathways FMRE-mutated tumors

Long-range chromatin interactions

C

D

alt trinuc patient T ACA ID001 G TAG ID002 ID005

Downstream analysis mut+

B 100

ref C A AA

MALAT1

120k cisregulatory modules (CRMs)

4.2M TF binding sites (TFBS) from ENCODE

pos 123 567 456

CHEK2

cell lines

Transcription factors (TFs)

b) SNVs, indels in 1,844 tumor genomes + trinucleotide mutation context

ZKSCAN3 MDM4 CCNB1IP1

a) Pre-defined genomic regions

A

5:1295292 11:62608244 6:52860289 9:35658253 15:45493088 1:16940092 2:243153934 16:22308900 6:27870674 1:204475807 14:21081834 19:893248 1:28834956 22:29065573 2:10237806 1:161510036 1:165562391 X:117388771 11:65270169 9:44435985 1:161500602 8:133204434 4:173751654 4:190660002 10:96162552 4:775934 2:92261102 14:106938468 11:110072233 7:137139144

OncodriveFML DriverPower (14) (24)

*

ActiveDriverWGS (30)

11 1

0

** * *

10

NBR (22)

6 0

1

0

6

1 3

12

0

0

3

5:1295292 (TERT)

1:204475807 (MDM4) 6:27870674 (ZKSCAN3)

11:65270169 (MALAT1)

WGS dataset

6:52860289 (ICK)

PCAWG (1,844) ICGC (3,181)

0%

2%

14:21081834 (CCNB1IP1)

4%

22:29065573 (CHEK2)

Patients mutated

Cancer type Panc−AdenoCA (232)

CNS−PiloAstro (89)

Kidney−ChRCC (43)

Lung−AdenoCA (33)

Myeloid−AML (13)

Prost−AdenoCA (199)

Panc−Endocrine (81)

ColoRect−AdenoCA (43)

Biliary−AdenoCA (33)

Breast−LobularCa (13)

Breast−AdenoCa (195)

Stomach−AdenoCA (65)

Uterus−AdenoCA (41)

Myeloid−MPN (23)

Bone−Epith (11)

Kidney−RCC (143)

Head−SCC (55)

Bone−Osteosarc (41)

Bladder−TCC (23)

Bone−Cart (9)

CNS−Medullo (141)

Thy−AdenoCA (48)

CNS−GBM (38)

CNS−Oligo (18)

Breast−DCIS (3)

Ovary−AdenoCA (110)

Lung−SCC (45)

Bone−Leiomyo (34)

Cervix−SCC (18)

Myeloid−MDS (2) Cervix−AdenoCA (2)

Figure 1. Discovery of Frequently Mutated Regulatory Elements (FMREs) as Candidate Non-coding Cancer Drivers (A) Outline of study. We designed the driver discovery method ActiveDriverWGS that identifies frequently mutated regions given a pre-defined list of genomic regions and somatic mutations from WGS data. The trinucleotide context of regions and mutations is considered for mutation rate analysis. We performed a driver (legend continued on next page)

2 Molecular Cell 77, 1–15, March 19, 2020

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

et al., 2014), while intergenic regulatory elements remain largely unexplored. However, distal regulatory elements interact with target genes via chromatin long-range interactions (Canela € la-Reimand et al., 2016) and et al., 2017; Rao et al., 2014; Uusku thus provide a potential mechanism for non-coding drivers and a framework for their annotation. For example, enhancer mutations of PAX5 and TAL1 occur in leukemia (Mansour et al., 2014; Puente et al., 2015). Second, driver discovery is complicated by variable somatic mutation rates, such as megabasescale associations with transcription, replication, and chromatin state (Lawrence et al., 2013; Polak et al., 2015; Reijns et al., 2015; Schuster-Bo¨ckler and Lehner, 2012), variation specific to classes of elements such as exons and TF binding sites (Bailey et al., 2015; Frigola et al., 2017; Hnisz et al., 2016; Kaiser et al., 2016; Katainen et al., 2015; Sabarinathan et al., 2016; Visser et al., 2012), and nucleotide-level mutational signatures (Alexandrov et al., 2013, 2019). Third, non-coding mutations can be spread across regions with similar function, whereas proteincoding mutations often affect only a few non-silent bases, particularly for activating driver mutations (Vogelstein et al., 2013). Lastly, genome-wide driver analysis is statistically underpowered as far fewer WGS datasets than exome-sequencing datasets are available currently, while the vast non-coding genome leads to increased multiple-testing penalties. We explored the gene-regulatory effects of non-coding somatic mutations using the large and uniformly processed WGS dataset of 1,844 tumor-normal pairs from the Pan-cancer Analysis of Whole Genomes (PCAWG) project (Campbell et al., 2017). Using driver discovery analysis, we identified frequently mutated regulatory elements (FMREs) across a genome-wide set of TF binding sites, then predicted their regulatory target genes using long-range chromatin interactions and found FMRE mutations that associated with TF motif rewiring and mRNA abundance changes in tumor transcriptomes. We also provide functional validation data for one distal FMRE in the regulation of cancer genes and proliferation pathways. Thus, a subset of non-coding mutations in ubiquitous regulatory elements may be involved in oncogenesis via long-range chromatin interaction networks. RESULTS Genome-wide Discovery of Cancer Driver Mutations with ActiveDriverWGS We analyzed a cohort of 1,844 whole cancer genomes of 31 cancer types and 14 million SNVs and indels. The ICGC/TCGA

PCAWG Consortium aggregated WGS data from 2,658 tumors across 38 tumor types generated by the ICGC and TCGA projects, reanalyzed the data with standardized pipelines to align to the human genome (hs37d5), and identified germline variants and somatically acquired mutations (Campbell et al., 2017). To provide a conservative analysis of non-coding driver mutations, we focused on the high-confidence set of 2,583 tumors and also excluded genomes of tumors that could individually influence the results. These were specifically 69 hypermutated genomes and all genomes of four cancer types with previous evidence of elevated mutation rates in TF binding sites (i.e., melanomas [Sabarinathan et al., 2016], lymphomas [Lohr et al., 2012], liver and esophageal cancers [Hnisz et al., 2016]). We confirmed an inflation of findings solely explained by these cancer types in initial analyses (STAR Methods). We created ActiveDriverWGS, a genome-wide driver discovery method that quantifies the enrichment of mutations in predefined genomic regions relative to an expected background model (Figures 1A and S1; STAR Methods). In this study, we used the genomic regions of 120,788 cis-regulatory modules (CRMs) covering 73.3 Mbps of the genome with 1.1 million binding sites of 99 human TFs profiled in ENCODE (ENCODE Project Consortium, 2012). ActiveDriverWGS identifies candidate noncoding drivers among such regions by evaluating SNVs and indels in each region relative to a region-specific expected mutation model trained on the adjacent genomic windows (±50 kbps) using Poisson regression. We selected this narrow sequence window for inferring expected mutation burden to adjust for larger, megabase-scale fluctuations of mutation rates observed in cancer genomes (Lawrence et al., 2013; Polak et al., 2015; Reijns et al., 2015; Schuster-Bo¨ckler and Lehner, 2012). ActiveDriverWGS also models mutations in the reference trinucleotide and alternative allele context to adjust for mutational signatures (Alexandrov et al., 2013). Together, these regional estimates provide models of expected mutation rates for detecting candidate drivers with significant excess of mutations. To validate ActiveDriverWGS, we first quantified its ability to recover cancer drivers among protein-coding genes (CDS). We found 48 CDS drivers in the pan-cancer cohort (FDR < 0.05), including a 22-fold enrichment of known cancer genes annotated in the COSMIC Cancer Gene Census database (44 found versus 2 expected, Fisher’s exact p = 4.7 3 1060, Figure 1B) (Futreal et al., 2004). Cohorts of specific cancer types revealed 71 genes including 60 known drivers (Figure S2). For mechanistic predictions, ActiveDriverWGS optionally finds drivers with known functional sites that are specifically enriched

discovery of cis-regulatory modules (CRMs) and found frequently mutated regulatory elements (FMREs). FMREs were associated with target genes using longrange chromatin interactions. Tumor RNA-seq data were analyzed to identify genes potentially regulated by the FMRE mutations. One FMRE was validated functionally. (B) Protein-coding driver genes detected with ActiveDriverWGS in the pan-cancer dataset (FDR < 0.05). Known cancer genes are shown in bold and are significantly overrepresented, validating our method. Colors indicate cancer types. (C) FMREs detected with ActiveDriverWGS in the pan-cancer dataset as candidate non-coding drivers (FDR < 0.05). Predicted target genes of selected FMREs are indicated. (D) Comparison of mutation rates (left) and a predicted proportion of driver mutations (right) in FMREs and protein-coding drivers from the pan-cancer analyses. Genes with outlier values are named. (E) Relative mutation frequencies of FMREs in the discovery cohort from PCAWG and an additional validation cohort from the ICGC data portal. (F) Venn diagram of FMREs detected using alternative driver discovery methods. Prioritized FMREs are indicated with arrows and were found by multiple complementary strategies, lending confidence to these findings as novel non-coding drivers.

Molecular Cell 77, 1–15, March 19, 2020 3

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

in mutations. Using a collection of protein post-translational modification (PTM) sites (Krassowski et al., 2018), we found several drivers with mutations enriched at PTM sites (TP53, IDH1, BRAF, PIK3R1, CTNNB1, H3F3A). Another driver analysis focusing on promoters, UTRs, and lncRNAs found the lncRNA NEAT1, 50 UTR of MTG2, PLEKHS1 promoter, and a few others, confirming our ability to recover non-coding regions with frequent mutations (Rheinbay et al., 2017). Lastly, we benchmarked ActiveDriverWGS using multiple approaches (Figure S1, STAR Methods). We found that modeling of trinucleotide mutations improved model specificity while exclusion of hypermutated genomes improved sensitivity. ActiveDriverWGS showed robustness to parameter choices and also outperformed most similar methods in terms of protein-coding driver discovery in PCAWG. Thus, ActiveDriverWGS is an accurate model for genome-wide cancer driver discovery. Driver Analysis Reveals Frequently Mutated Regulatory Elements (FMREs) Driver discovery across 120,788 CRMs revealed 30 frequently mutated regulatory elements (FMREs) with 740 SNVs and 71 indels that affected 532/1,844 tumors (29%) (ActiveDriverWGS FDR < 0.05; Figure 1C, Table S1). The majority of FMREs (17/ 30) and CRMs (75%) did not overlap any UTRs, promoters, or lncRNA genes studied in the PCAWG project (Rheinbay et al., 2017), indicating that our findings are complementary to earlier studies (Figure S2). We predicted the target genes of FMREs based on their location in promoters or any long-range chromatin interactions connecting FMREs and promoters. We used a published HiC map of long-range chromatin interactions (Rao et al., 2014) and selected 11,282/43,219 conserved chromatin interactions detected in at least two human cell lines to obtain a highconfidence set for our pan-cancer analysis. As a result, 21 genes were assigned to FMREs based on long-range chromatin interactions and 15 genes were assigned to FMREs located in their promoters. We prioritized five FMREs based on their cancer-associated target genes. CCNB1IP1 encodes an inhibitor of cell cycle and a regulator of meiotic recombination (Qiao et al., 2014; Singh et al., 2007; Toby et al., 2003) and is annotated as a tumor suppressor gene in the Cancer Gene Census database. CCNB1IP1 showed chromatin interactions with the FMRE 14:21081834 located 280 kbps from the gene (24 mutated tumors versus 6 expected, ActiveDriverWGS FDR = 4.5 3 104) (Figures 3B and 3C). The kinase ICK was associated with the FMRE 6:52860289 via chromatin interactions of 60 kbps (33 tumors versus 6 expected, FDR = 4.6 3 109). ICK has been described in the context of proliferation in glioblastoma (Yang et al., 2013). The transcription factor ZKSCAN3 was associated with the FMRE 6:27870674 via chromatin interactions of 446 kbps (27 tumors versus 7 expected, FDR = 3.6 3 104). ZKSCAN3 has been described in the context of autophagy and cyclin D2 expression (Chauhan et al., 2013; Yang et al., 2011). To extend target gene annotations, we queried the genome within ± 25 kbps of FMREs and found 38 additional genes including two known cancer genes. The FMRE 22:29065573 was located 18 kbps downstream of the tumor suppressor CHEK2 in the intron of TTC28 (10 tumors versus 1 expected, FDR = 5.5 3 104). The FMRE 4 Molecular Cell 77, 1–15, March 19, 2020

1:204475807 was located 9 kbps upstream of the oncogene MDM4 and 11 kbps upstream of the signaling gene PIK3C2B (24 tumors versus 5 expected, FDR = 4.4 3 104). These adjacent FMREs may reflect extended gene promoters or enhancers with undocumented short-range chromatin interactions. Three FMREs were found as positive controls or recurrently mutated regions: the promoters of TERT and WDR74 and the lncRNA MALAT1. In sum, FMREs reside further outside commonly analyzed promoters and potentially associate with known and candidate cancer genes. The relatively low number of FMREs suggested an overall lower mutation frequency compared to CDS drivers. However, as FMREs were shorter in sequence than CDS drivers (median 821 versus 2,086 bps), their relative mutation rate was higher (2.0 versus 1.5 mutations per 100 bps; Wilcoxon p = 0.032) (Figure 1D). FMREs carried higher proportions of putative driver mutations relative to passenger mutations, compared to CDS drivers (median 81% versus 73% mutations, Wilcoxon p = 0.047) (Figure 1D). To derive the proportion of putative driver mutations in each region, we counted the mutations observed in the pan-cancer cohort and subtracted the passengers estimated from the background model of ActiveDriverWGS. These data suggest that FMREs also undergo positive selection in cancer genomes. We next validated the FMREs computationally using additional WGS data and alternative driver discovery tools. First, we evaluated FMREs in an independent set of variant calls of 3,181 whole cancer genomes downloaded from the ICGC data portal. FMREs were mutated in a similar proportion of tumors in this validation cohort (695/3,181 of 22%; Figures 1E and S3). Mutation frequencies of FMREs were strongly correlated in the two cohorts when adjusting for sequence length (R2 = 0.93, p = 2.7 3 1010). Second, we used three alternative methods for driver discovery among CRMs (NBR [Nik-Zainal et al., 2016], OncodriveFML [Mularoni et al., 2016], DriverPower [Shuai et al., 2017]). The majority of FMREs (20/30) were also detected by at least one method, more than expected from chance alone (0 expected, Fisher’s exact p = 2.9 3 1064; Figure 1F). Complementary signals of positive selection, mutation impact, and functional clustering analyzed in these methods lend confidence to FMREs. Lastly, the majority of mutations in the five selected FMREs and the TERT promoter were considered true positives (201/207 or 97%) based on a manual review of raw sequencing data for sequence context, read coverage, and strand bias. These data support our hypothesis of FMREs as cancer drivers. We performed driver discovery in PCAWG cohorts of individual cancer types and found few FMREs aside the TERT promoter (Figure S4). The lack of detectable tissue-specific FMREs was explained by a power analysis. The pan-cancer cohort of 1,844 samples was powered to detect considerably less-frequent mutations (effect size f = 0.03) compared to the largest tissue-specific cohort (f = 0.18; 232 tumors) and the median cohort (f = 0.44; 41 tumors). To tackle this issue of limited statistical power, we restricted our driver analysis to the 30 pan-cancer FMREs and used cohorts of specific cancer types to re-discover these regions as tissue-specific FMREs. The majority of FMREs (27/30) were found in at least one cancer type (restricted FDR < 0.05), due to the reduced effect of multiple testing correction. Thus,

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

analysis of FMREs in tissue-specific cohorts is likely limited by the size of current WGS datasets. FMREs Are Enriched in Multiple Classes of Somatic Alterations Somatic driver mutations in different cancers are dominated by either SNVs and indels or copy-number and structural variants (Ciriello et al., 2013). We asked whether multiple mutation classes were enriched in the 27 novel FMREs (excluding TERT, WDR74, MALAT1) compared to control regions of randomly sampled CRMs with similar sequence length and transcription status (Figure 2A). As expected, FMREs were enriched in 494 SNVs (130 expected; pCRM < 106) and 47 indels (11 expected; pCRM = 1.3 3 105). This comparison of all FMREs to a matched set of control CRMs verified the ActiveDriverWGS analysis that compared each FMRE to its adjacent background sequence. We then analyzed 210,718 structural variants (SVs) comprising duplications, deletions, inversions, and translocations profiled in the PCAWG cohort (Li et al., 2017). All SV breakpoints combined were enriched in FMREs compared to control CRMs (17 observed versus 7 expected, pCRM = 0.0075) as were translocation breakpoints (6 observed versus 1 one expected, pCRM = 0.026) (Figure 2A). Analysis of focal SV segments (<100 kbps) showed an enrichment of deletions at FMREs (13 observed versus 5 expected, pCRM = 0.035) while other classes of SVs were not enriched. We reviewed an extended set of 48 inversions and translocations and annotated 16/30 FMREs with 91 potential target genes within ±10 kbps of breakpoints (Figure 2B). Five translocations were associated with mRNA abundance changes of ALK, TERT, LINC00221, and PIK3C2B in single tumors (z-score > 4). A translocation of the FMRE 11:65270169 (MALAT1) and the intron of the oncogenic kinase ALK marked a 1,500-fold increase in ALK mRNA abundance in a thyroid adenocarcinoma (405.54 versus median 0.28 FPKM-UQ, z = 6.71). An SV breakpoint between the FMRE 1:204475807 and PIK3C2B promoter marked a 4.5-fold mRNA increase of PIK3C2B in a breast tumor (21.71 versus 4.80 FPKM-UQ, z = 5.27). These examples suggest that SVs also rewire FMREmediated gene regulation, but the analysis is underpowered compared to our driver discovery due to many fewer documented SVs. In summary, diverse genetic alterations observed at FMREs support their role in cancer biology. FMREs Are Ubiquitous Regulatory Elements with Frequent Chromatin Interactions and Mutated Sequence Motifs We studied the epigenetic annotations of FMREs. In normal tissues, most of FMREs (16/30) were annotated as promoters, enhancers, or bivalent regions in >90% of the compendium of 127 human tissue types and cell lines recorded in the Roadmap Epigenomics project (Roadmap Epigenomics Consortium et al., 2015) (Figure 2C). The epigenetic annotations were over-represented in the 27 novel FMREs compared to control CRMs: active transcription start sites in 110 tissues, heterochromatin in 47 tissues, and enhancers in 17 tissues were enriched (pCRM < 0.05). Besides normal tissues, most FMREs (23/30) occurred at pancancer regulatory elements derived from a collection of 411 pri-

mary tumors in the TCGA project (Corces et al., 2018). FMREs were also bound by a larger number of TFs in ENCODE compared to all CRMs (median 57 versus 10, Wilcoxon p = 0.0026) and had increased sequence length (median 647 versus 465 bps, p = 0.0034), similarly to high-occupancy developmental enhancer regions described earlier (Gerstein et al., 2010; Kvon et al., 2012). FMREs co-occurred with anchors of 21 long-range chromatin interactions, more than expected from matched CRMs (11 expected, pCRM = 0.019) (Figure 2D). To study the impact of mutations in the 27 novel FMREs, we analyzed their sequence motifs bound by TFs using the FunSeq2 method (Fu et al., 2014). A third of SNVs in FMREs significantly affected motifs, with 149 SNVs causing motif losses and 22 SNVs causing motif gains (39 motif-rewiring SNVs expected, pCRM < 106) (Figure 2E). By analyzing TFs separately, we found 19 TFs with unexpectedly frequent motif-rewiring mutations compared to matching CRMs (pCRM < 0.05; R5 mutations) (Figure 2F). These included the chromatin architectural proteins CTCF, RAD21, and YY1 involved in chromatin interactions and promoter-enhancer contacts (Canela et al., 2017; Rao et al., €la-Reimand et al., 2016; Weintraub et al., 2017). 2014; Uusku FMRE mutations also occurred closer to midpoints of ChIPseq peaks in ENCODE where in vivo binding sites are known to occur, suggesting their functional impact (23.3% versus 8.4% SNVs, Kruskal-Wallis p = 2.7 3 1059; Figure 2G). These data show that FMREs are ubiquitous regulatory elements in diverse normal tissues and tumors that interact with target genes distally through long-range chromatin interactions, indicating potential functions in gene regulation and chromatin architecture. Distal FMRE Mutations Associate with Reduced Transcription of Tumor Suppressor CCNB1IP1 We asked whether the mutations in FMREs were associated with mRNA abundance changes in the matched transcriptomes of FMRE-mutated tumors. We had sufficient data to analyze 23 target genes of 12 FMREs using matched PCAWG transcriptomes available for a subset of tumors (831/1,844 tumors or 45%) (Calabrese et al., 2018). To model mRNA abundance changes, we used matched gene copy number calls from PCAWG as these may also explain mRNA abundance changes (Gerstung et al., 2018). As a result, we found 9/23 genes that showed significant associations of mRNA abundance and FMRE mutations (TERT, RCC1, ZKSCAN3, NBPF1, CCNB1IP1, ICK, EPB41, SQRDL, WDR74; F-test, FDR < 0.2) (Figure 3A). As a positive control, TERT promoter mutations associated with increased mRNA in 11 thyroid tumors (p = 6.8 3 106, median 0.077 versus 0.00 FPKM-UQ). No significant mRNA differences were apparent for MDM4 and CHEK2. This analysis is restricted to only a subset of FMRE-mutated tumors due to incompleteness of transcriptomic data, so additional target genes likely remain hidden. The tumor suppressor CCNB1IP1 (HEI10) showed significantly lower mRNA abundance in six tumors with non-coding mutations in the FMRE 14:21081834, including three kidney tumors and three breast tumors with mRNA data (p = 0.031, log fold-change log2FC = 0.65). This distal FMRE shows a longrange chromatin interaction with CCNB1IP1 through 280 kbps of genome that potentially mediates promoter-enhancer

Molecular Cell 77, 1–15, March 19, 2020 5

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

P<10-6

600

Focal deletions

P<0.0075 P<8.9x10-4

60

P<10-6

40 40

50 20 25

0

0 observed in FMREs

20

0

expected in CRMs

NBPF1 1:16940092 CROCCP2 RCC1 SNHG3 *1:2883 HIAT1 4956 FCGR HSPA 2A 1:16 6 PIK 150060 *1:2 3C2B 2 MD 0447 M4 5807

rf61 CXo 14 6A S LC

1

Y X

34

39 15

22

43

2 2:

2

21

20

HEXCA10 IM NAG 2 S CW PYY PIP4KC25 2B

Q1 SH

18

3

17

16

3 PCGF 4 4:77593 CPLX1 BEND4 SEPT11 PPA2

4

15

15:45493088 SHF

8468 0693 14:1 C00221 LIN 81834 10 11 *14:2 NASE R

5

14

13

2 NL 5 MB EK N

TERT *5:129 DIAP 5292 H1

6 P M PP1 KI RPS R10 F2 1 5 8B

12

10

2W UBE

SIT1 9:35658253 RMRP CCDC107

ARHGEF39

2

FO XJ

9

11 :11 00 72 23 11 :65 SC RD 3 2 Y X MA701 L1 6 L S A 9 11:6 SL NHGT1 260 C3A 1 WD 82442 R STX74 5

11

7

1

8

R 3A C

40

20

200

100

0

C

19

P<10-6

P<1.1x10-5

0

expected genome-wide

B CRLF1 C19orf60 UBA52 SF3A2 PLEKHJ1 DOT1L ABHD17A KLF16 APC2 RPS15 DAZAP1 SBNO2 R3HDM4 19:893248 MED16

0

CY 2:1 S1 AL 023 78 C K 06 N OM C2 CKA MD N o P 1 LR GE rf82 5 R F FI P1

200

Motif rewiring

E

P<0.019

P<0.013

75

400

Chromatin interactions

D

P<0.035

Number of mutations

P<10-6

Number of mutations

SV breakpoints

P<1.3x10-5

5:1295292 11:62608244 6:52860289 9:35658253 15:45493088 1:16940092 2:243153934 16:22308900 6:27870674 1:204475807 14:21081834 19:893248 1:28834956 22:29065573 2:10237806 1:161510036 1:165562391 X:117388771 11:65270169 9:44435985 1:161500602 8:133204434 4:173751654 4:190660002 10:96162552 4:775934 2:92261102 14:106938468 11:110072233 7:137139144

FMRE

Translocation with mRNA association

F

*

* * * *

*** ** *** *** *** ** ** ** ** ** ** *** ** *** *** * *** ** **

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

0

10 20 30 Number of mutations

Motif break

G

Motif gain

P=2.7x10-59 40

20

0 0

50

100

Chromatin states of normal tissues promoter

CTCF YY1 CDX2 ZNF35 MEF2 HOXA9 TATA RAD21 NFKB CACD BHLHE40 HDAC2 RXRA CBX5 CACBP BCL ESRRA AP1 SPI1

Mutations (%)

100

TF motif

Indels

P<10-6

Number of interactions

SNVs

A

enhancer

bivalent

0

1-10 11-50 51+ Distance (bp) FMRE

CRM

Figure 2. Genetic and Epigenetic Annotations of FMREs Indicate Their Pan-tissue Oncogenic Roles (A) Enrichment of multiple classes of mutations in FMREs. Boxplots show mutations observed in FMREs (dark red) compared to control CRMs matched by sequence length and transcription status (pink; pCRM) and control regions sampled genome-wide (gray; pGW). Distribution of observed values in FMREs was derived using resampling. (B) Circos plot of translocations adjacent to FMREs (±10 kbps) and putative their target genes. Orange lines connect translocations with mRNA changes in corresponding tumors. (C) Gene regulatory annotations of FMREs. Barplot shows the numbers of normal human tissues with regulatory annotations at FMREs in the Roadmap Epigenomics dataset. Purple circles indicate FMREs that are also regulatory elements in primary tumors in TCGA. Asterisks indicate prioritized FMREs. (D) Enrichment of long-range chromatin interactions in FMREs. Boxplots show number of distinct interactions in FMREs compared to control CRMs and random genomic regions. (E) Enrichment of motif-rewiring mutations in FMREs compared to control CRMs and random genomic regions. (F) TFs and chromatin architectural proteins with frequently mutated DNA motifs in FMREs (pCRM < 0.05). Points and error bars show the expected numbers of motif-altering mutations sampled from control CRMs. (G) Sequence distances between mutations and midpoints of TF binding sites (ChIP-seq peaks) in FMREs and all CRMs. Mutations in FMREs appeared closer to midpoints, indicating potential functions. p value from a Kruskal-Wallis test is shown.

contacts (Figure 3B). The target gene CCNB1IP1 encodes a ubiquitin E3 ligase that negatively regulates cell cycle progression at the G2/M transition by inhibiting cyclin B1 (CCNB1)

6 Molecular Cell 77, 1–15, March 19, 2020

(Toby et al., 2003). Knockdown of CCNB1IP1 in cancer cell lines increased cell motility and invasion, implicating its role in tumor progression and metastasis (Singh et al., 2007).

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

A

B

C

D

Figure 3. Integrative Analysis of Mutations, mRNA Abundance, and Chromatin Interactions Reveals a Distal Non-coding Candidate Driver of CCNB1IP1 (A) FMRE target genes with significant associations of FMRE mutations and mRNA abundance (FDR < 0.2, F-test). Boxplots show mRNA abundance in tumors with and without non-coding mutations in FMREs (Mut versus Ref). Each dot indicates a tumor sample colored by copy number status of the gene. Numbers of mutated samples are indicated below the boxplots. (B) Genomic location of the FMRE 14:21081834 (dark red) and its putative target gene CCNB1IP1. Long-range chromatin interactions spanning 280 kbps are shown below. CRMs (pink) and gene promoters (teal) are also shown. (C) Left: lollipop plot shows mutation frequency of FMRE 14:21081834 (dark red) and adjacent background sequence (gray) (left). Middle: lollipop plot shows 24 somatic mutations in the FMRE, including the six SNVs indicated by triangles that affected TF binding motifs. Right: Histogram shows the number of TF binding sites per nucleotide in the FMRE. (D) Sequence logos of TF binding motifs affected by six mutations in the FMRE that caused motif gains and motif losses. Representative TF motifs for each mutation are shown. Asterisk indicates the ETS TF motif found in the tumor with matching mRNA data.

The FMRE of CCNB1IP1 is likely active in multiple types of cancer as it is annotated as a regulatory element in the majority of normal human tissues as well as a pan-cancer element in primary tumors in TCGA. The FMRE overlaps with 85 types of TF binding sites measured in 79 cell lines in ENCODE (Figure 3C). Several

mutations in the FMRE (6/24) caused significant alterations of TF binding motifs in the DNA sequence: TF motifs of ETS, SPI1, and SRY were induced and TF motifs of RREB1, EBF1, and IRF were removed in individual tumor genomes (Figure 3D). SPI1 and other ETS family TFs are known as oncogenes often

Molecular Cell 77, 1–15, March 19, 2020 7

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

C

A

D

B

E

H

F

G

I

Figure 4. Functional Validation of FMRE 14:21081834 as a Distal Regulatory Region of CCNB1IP1 (A) Homozygous deletion of the FMRE region in three clonal cell lines (top) and validation with Sanger sequencing (bottom). HEK293 cells were used. FMRE-KO denotes cell lines with FMRE 14:21081834 deletion. (B) qPCR validation of FMRE-KO cell lines. See Figure S5 for details. (C) mRNA abundance of FMRE target CCNB1IP1 in FMRE-KO cells (dark red) compared to wild type (WT; gray) and CRISPR-treated controls (CTL, dark gray) measured using RT-qPCR. t test p values are shown. (legend continued on next page)

8 Molecular Cell 77, 1–15, March 19, 2020

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

activated in cancer (Delattre et al., 1992; Moreau-Gachelin et al., 1988; Tomlins et al., 2005). Interestingly, one mutation induced a de novo motif of the ETS family TF also seen in the TERT promoter (Bell et al., 2015; Huang et al., 2013). Motif losses induced by mutations were also associated with TFs involved in cancer pathways such as Ras signal transduction (RREB1), differentiation and B cell development (EBF1), and Jak-Stat signaling (IRF) (Liao, 2009; Tamura et al., 2008; Thiagalingam et al., 1996). Such gene-regulatory associations suggest new mechanistic hypotheses for FMRE mutations. Together, transcriptional inhibition of CCNB1IP1 in FMRE-mutated tumors is consistent with the known tumor suppressor role of this gene and suggests functionality of these distal non-coding mutations in cancer biology. Genomic Deletion of FMRE Causes Altered Proliferation of HEK293 Cells and Pathway Deregulation Reflected in Patient Tumors The association of distal non-coding mutations with tumor suppressor inhibition prompted us to functionally validate the FMRE of CCNB1IP1. We deleted the FMRE 14:21081834 (1.3 kbps) using CRISPR/Cas9 genome editing in HEK293 cells, developed single cells into three independent cultures with homozygous FMRE deletions (knockouts; FMRE-KO), and validated these using PCR genotyping and Sanger sequencing (Figures 4A, 4B, and S5). HEK293 embryonic kidney cells were used as a preliminary model to test kidney cancer mutations identified in the driver analysis. Using RT-qPCR experiments, we found that FMRE-KO cells showed reduced mRNA abundance of CCNB1IP1 compared to wild-type cells of different passages (log2FC = 0.67, Welch t test p = 0.0072) (Figure 4C). As expected, CCNB1IP1 mRNA abundance in FMRE-KO cells was also lower compared to CRISPR-transfected control cells (log2FC = 0.65, p = 0.0097). In contrast, no changes in CCNB1IP1 mRNA between wild-type and control cells were apparent (p = 0.86; log2FC = 0.014), indicating that mRNA changes were not caused by genome editing procedures. Inhibition of CCNB1IP1 mRNA in FMRE-KO cells implicates the distal FMRE in the regulation of this putative tumor suppressor and supports the association of SNVs and mRNA inhibition we observed in patient tumors. Global transcriptome profiling (RNA-seq) of FMRE-KO cells confirmed the reduced mRNA abundance of CCNB1IP1 (logFC = 0.72, limma/voom p = 0.0031) (Figure 4D). In total, 56 genes showed genome-wide differences in mRNA abundance (FDR < 0.1) (Figure 4E), whereas known cancer genes

were surprisingly frequent in this set (CDKN2A, HEY1, PSIP1, CDH1, MYD88, MLLT3; Fisher’s exact p = 0.012). We confirmed the significant mRNA changes of four cancer genes (CDKN2A, CDH1, HEY1, PSIP1) using RT-qPCR in FMRE-KO and wildtype cells (Figure 4F). Pathway enrichment analysis of the differentially expressed genes highlighted 21 Gene Ontology processes and Reactome pathways (FDR < 0.05) that were consistent with the role of CCNB1IP1 in the G2/M transition of the cell cycle, such as regulation of mitotic spindle assembly (FDR = 9.9 3 104; CHMP5, HSPA1B, HSPA1A) and negative regulation of protein ubiquitination (FDR = 0.048; CDKN2A, HSPA1B, HSPA1A) (Figure 4G). Developmental, stress response, and cellular senescence processes were also observed, such as oxidative stress induced senescence (FDR = 0.0079; CDKN2B, CDKN2A, HIST1H4H, HIST1H4C, HIST1H3F) and immune system development (FDR = 0.016; CA2, CR2, FAM213A, CDKN2B, CDKN2A, HIST1H4H, HIST1H4C, HSPA1B, HSPA1A, HIST1H3F). The genes with differential mRNA abundance may represent direct and indirect regulatory targets of the FMRE. We then asked whether the 56 genes detected in FMRE-KO cell lines were also differentially expressed in the patient tumors with FMRE mutations in PCAWG. We focused on the three FMRE-mutated kidney cancers with mRNA data and confirmed that 10/56 genes showed significant mRNA abundance changes and a matching direction of change (F-test, p < 0.1) (Figure 4E), using the stringent copy-number adjusted mRNA analysis described above. Strikingly, these include three cancer genes involved in cell cycle, transformation, and metastasis (Figure 4H). CDH1 (cadherin-1) encodes a major suppressor of tumor invasion and metastasis (Vleminckx et al., 1991). CDH1 mRNA abundance showed a 2.1-fold increase in the FMRE-mutated kidney tumors in PCAWG (p = 0.023) and a 2.5-fold increase in FMREKO cells (FDR = 0.087, p = 2.5 3 103). CDH1 upregulation in FMRE-KO cells was also validated using RT-qPCR (2.7-fold mRNA increase, p = 0.044) (Figure 4F). CDKN2B (cyclin-dependent kinase inhibitor 2B) encodes the tumor suppressor P15 that controls cell cycle progression and arrest in response to the TGF-b pathway (Hannon and Beach, 1994). CDKN2B mRNA abundance showed a 1.9-fold decrease in the FMRE-mutated kidney tumors in PCAWG (p = 0.079) and a 2.2-fold decrease in FMRE-KO cell lines (FDR = 0.047, p = 8.1 3 105) (Figure 4H). CDKN2B is frequently co-deleted with the tumor suppressor CDKN2A that was also differentially expressed in FMRE-KO cells. PSIP1 (LEDGF) encodes a regulator of differentiation and neurogenesis that drives transformation in leukemia (Ahuja

(D) mRNA abundance of CCNB1IP1 in FMRE-KO cells (dark red) compared to WT cells (gray) measured using RNA-seq. p value from a Limma/voom analysis is shown. (E) Volcano plot shows genes with differential mRNA abundance in FMRE-KO cells relative to WT cells measured using RNA-seq. Teal boxes show genes with matching mRNA changes in FMRE-mutated kidney tumors in PCAWG. (F) Validation of mRNA abundance changes of known cancer genes in FMRE-KO cells compared to WT controls using RT-qPCR. p values from t tests are shown. (G) Enrichment map of significantly deregulated pathways in FMRE-KO cells. Pathway-associated genes from the volcano plot are listed. Known cancer genes are in boldface letters. Circles represent GO terms and Reactome pathways, lines connect those with many shared genes, node size shows number of genes in the pathway, and colors correspond to strength of enrichment. (H) Comparison of mRNA abundance levels of CDH1, CDKN2B, and PSIP1 in FMRE-mutated kidney tumors (Mut) and non-mutated tumors (Ref). These changes match our observations in FMRE-KO cell lines (E and F) and suggest associations with metastasis, senescence, and cell cycle control. (I) Proliferation curve of FMRE-KO cells (dark red) and WT cells (gray). Three biological replicates for FMRE-KO and WT cells were measured with 60 technical replicates each. Error bars show ±1 SD and Wilcoxon p values are shown.

Molecular Cell 77, 1–15, March 19, 2020 9

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

et al., 2000; Yokoyama and Cleary, 2008). PSIP1 was downregulated in FMRE-mutated patient tumors and FMRE-KO cells (1.5fold, p = 0.048; and 1.7-fold, p = 1.1 3 103, FDR = 0.053; respectively) (Figure 4H) and was also validated using RTqPCR (Figure 4F). These results are limited by the low number of FMRE-mutated tumor transcriptomes available for analysis, and the relatively small statistical effects we measured. The matched mRNA changes suggest that our cellular model of FMRE deletion partially captures the gene-regulatory associations of the putative non-coding driver mutations in observed in patient tumors. We conducted a proliferation assay to measure the impact of the FMRE deletion on a hallmark cancer phenotype. Wild-type and FMRE-KO cells were compared at five time points over 9 days using three biological and 60 technical replicates for each group. FMRE-KO cells proliferated significantly slower compared to wild types at all time points (Wilcoxon p % 108, fold-change [FC] 0.73–0.92) except for the first time point at 24 h (p = 0.57, FC = 0.998) (Figure 4I). This assay provides functional evidence of the FMRE in regulating cell proliferation and supports our findings of cancer genes and pathways found in RNA-seq data. In particular, the predicted direct FMRE target CCNB1IP1 restricts cell cycle progression and causes reduced cancer cell proliferation upon gene knockdown (Singh et al., 2007; Toby et al., 2003). Alternatively, this reduced proliferation may result from an interplay of CDH1 or CDKN2B that are also deregulated in FMRE-mutated cells and patient tumors. It is tempting to speculate on the role of this FMRE in later tumor development, senescence, and metastasis, although more functional and mechanistic evidence is required. Taken together, these data provide preliminary functional evidence of one FMRE as a bona fide non-coding cancer driver. Integrative Pathway Enrichment Analysis Highlights Further FMREs at Known Cancer Genes Pathway and network analysis improves the sensitivity of cancer driver discovery by identifying less-frequently mutated genes through their interactions with key driver genes and pathways (Creixell et al., 2015; Reimand et al., 2019). We asked whether the entire list of CRMs was enriched in known cancer genes, biological processes of Gene Ontology, and molecular pathways of Reactome. Assuming that each gene would be regulated by multiple CRMs, we prioritized the genes by the integrated mutational significance of CRMs in their promoters, loop-associated enhancers, intronic regions, and adjacent genomic sequence, followed by analysis of enriched biological processes and pathways using the ActivePathways method (Paczkowska et al., 2018). Pathway analysis that integrated multiple CRMs per gene detected 29 significantly enriched GO terms with 452 genes (ActivePathways Q < 0.05) (Figure 5A). Cancer-related processes were enriched, such as cell division (Q = 0.015, 36 genes including CHEK2, MYC, SEPT9, NUMA1, ZBTB16), regulation of Ras protein signal transduction (Q = 0.021, 31 genes including ECT2L, RAF1, P2RY8, BCR), and embryo development (Q = 6.1 3 104, 96 genes including 18 COSMIC genes). Genes involved in actin cytoskeleton organization were also enriched (Q = 2.0 3 107, 73 genes including ARHGAP26, TPM4, 10 Molecular Cell 77, 1–15, March 19, 2020

LCP1, FOXP1, FLNA, SRC, TRIM27, BCR, JAK2, MYH9). The analysis selected 55 known cancer genes of the Cancer Gene Census database that were only apparent by integrative analysis of multiple CRMs and remained sub-significant in the driver analysis of CRMs (16 genes expected, p = 8.45 3 1016) (Figure 5B). Further, some genes were prioritized due to single CRMs (e.g., PTEN, PER1, NCOA2) and others due to a combination of proximal and distal CRMs (e.g., ARHGAP26, MYC, CDKN1B). Most identified pathways (15/29) were also only found through integration of proximal and distal CRMs and not by analyzing any single CRM category alone. These findings are in agreement with the PCAWG consensus analysis that identified additional pathway-implicated driver genes with non-coding mutations (Reyna et al., 2018). Thus, cancer genes and pathways may be modulated by infrequent non-coding mutations that affect groups of related CRMs and are challenging to detect individually. We also evaluated the co-occurrence of mutations in FMREs and CDS drivers to discover their genetic interactions. We tested pairs of 30 FMREs and 48 CDS driver genes and found 9/1,440 pairs where the co-occurrence of mutations in both regions was significantly higher than expected (Fisher’s exact test, FDR < 0.1) (Figure 5C). This affected eight known cancer genes (IDH1, NFE2L2, TP53, B2M, KDM5C, VHL, PIK3CA, ATM) and nine FMREs. Among known drivers, TERT promoter mutations significantly co-occurred with IDH1 CDS mutations in ten tumors (FDR = 0.0010, two tumors expected). Mutations in TP53 cooccurred with mutations in two FMREs, including the FMRE 6:27870674 that had distal chromatin interactions and mRNA abundance associations with the TF ZKSCAN3. Mutation cooccurrence suggests the existence of cooperative interactions of non-coding mutations and established cancer genes. In summary, these analyses outline a wider landscape of infrequent non-coding mutations in regulatory elements that are involved in cancer genes and pathways, yet remain undetected in current WGS datasets due to limited power. DISCUSSION We report frequently mutated regulatory elements as putative non-coding drivers apparent in a heterogeneous pan-cancer dataset. We hypothesize that mutations in FMREs cause deregulation of oncogenes and tumor suppressors by altering cis-regulatory and chromatin interactions. We present an integrative analysis of genomic, epigenomic, and transcriptomic datasets to find such regions and functional validation for one distal FMRE. Pathway analysis indicates numerous sub-significant FMREs at known cancer genes and pathways, underlining the potential for further discoveries in large WGS datasets. FMREs were found as pan-cancer drivers and these indeed are active regulatory elements across healthy and tumor tissues. However, fewer FMREs were found compared to protein-coding driver genes. There are several potential explanations. First, genes are regulated through multiple CRMs whose individual mutation frequency is therefore likely lower. Second, non-coding mutations may be functional only in the presence of other driver mutations. Third, the pan-cancer cohort we analyzed is heterogeneous and biased toward certain cancer types. Fourth, most

Actin filament organization Hormone-mediated signalling pathway

Response to stress & oxygen

Cell morphogenesis & differentiation

Inhibition of transcription

Hippo signalling

Protein membrane localization

FMRE evidence of pathways Intronic

Chromatin interaction

Combined

** .

* * * * * * * * * * * * * * * ** ** ** ** ** ** ** ** ** ***************

** ** * *

* ****** *

** . ** *

. ** . * * .

*

**

.

** *

* *** * ****** * *** .

*

** * ** * . ** *** * *

*

** **

* .

*

*

** * ** .* *

Embryo & CNS development

* *

Transmembrane tyrosine kinase signalling

RAS & GTPase signalling

TERT CCNB1IP1 ARHGAP26 MDM4 CHEK2 FCGR2B RUNX1 SS18L1 LCK NIN ECT2L FAT4 NCOA2 CDKN1B PER1 CREB3L1 MYC TCF12 PTEN DDR2 TPM4 ITK SEPT9 LCP1 NUP214 CTNND1 FOXP1 SUFU NUMA1 FLNA ELL RHOH ZBTB16 SRC DNMT3A ANK1 RAF1 P2RY8 TRIM27 ACVR2A HMGA1 EPAS1 ACSL3 BCR RPL10 JAK2 MYH9 TFEB RARA NCOA1 HOXA11 NFATC2 AFF3 CHST11 ERBB2

C

5:1295292 & IDH1 FDR=0.001 81

. P<0.1 * P<0.05 ** P<0.01 *** P<0.001

4

6

20

25

9 4

8:133204434 & B2M FDR=0.043

TP53 & 6:27870674 FDR=0.043 8

601

12

10

19

3

1:16940092 & KDM5C FDR=0.053

29

TP53 & 4:173751654 FDR=0.053

26

609

VHL & 2:10237806 FDR=0.071 78

2 11

5

PIK3CA & 1:161500602 FDR=0.081

10

143

5

12 7

ATM & 15:45493088 FDR=0.099 47

−log10 P, ActiveDriverWGS 2

NFE2L2 & 4:190660002 FDR=0.023

10

di re . c * * . ** * . ** in t * tro . . ** . . * * * * * * * * * * n * * lo * * o ad pe * * d * ja ce . ** . * ** * nt . ** ** in * * * te gr at ed . . . . . . . . . . . . . . . . . . * * * * * * * *

B

Known cancer genes in FMRE-enriched pathways

A

* ***

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

>8

21 5

FMRE CDS driver

Figure 5. Integrative Pathway Analysis Highlights Further Non-coding Mutations at Known Cancer Genes (A) Enrichment map shows pathways and processes detected through an integrative analysis of multiple CRMs per gene (ActivePathways, FDR < 0.05). Nodes corresponding to pathways are colored by the location of CRMs. Orange nodes indicate pathways that were detected by combining multiple types of CRMs. (B) Known cancer genes of the Cancer Gene Census identified in the pathway analysis of CRMs. Up to four CRMs are highlighted for every gene (direct, intronic, looped, and adjacent), and colored based on the unadjusted p value from the driver analysis. Genes are ranked by integrated significance of the four classes of CRMs. Cancer genes shown in boldface letters were also identified in the ActiveDriverWGS analysis of individual CRMs. (C) Significant co-occurrence of mutations in FMREs and protein-coding driver genes. Venn diagrams show pairs of FMREs and CDS drivers with unexpectedly high number of tumors with both elements mutated (Fishers exact test, FDR < 0.1), highlighting potentially synergistic genetic interactions. Counts of affected tumors are indicated.

regulatory elements are thought to be tissue specific, but only the pan-cancer WGS dataset was powered to detect FMREs and tissue-specific CRMs likely remain underrepresented. Associating FMRE mutations and mRNA expression was even less powered due to limited transcriptomics data. Fifth, FMRE discovery was subject to stringent multiple-testing correction as CRM regions were 6-fold more abundant than protein-coding genes. Lastly, non-coding driver mutations may be less frequent overall as these regulate processes later in tumor development where the tumors have higher levels of genetic heterogeneity. Our findings should be interpreted with caution as the somatic mutation landscape has complex associations with transcription, chromatin state, and gene regulation that may also explain the apparent positive selection at FMREs. For example, chromatin state explains variation in megabase-scale mutation rates, while active regulatory elements have been associated with both increased and reduced mutation rates (Frigola et al., 2017; Gonzalez-Perez et al., 2019; Hnisz et al., 2016; Katainen et al., 2015; €bler et al., 2019; Lawrence et al., 2013; Polak et al., 2014, Ku 2015; Reijns et al., 2015; Sabarinathan et al., 2016; Schuster-

Bo¨ckler and Lehner, 2012). To account for major genome-wide covariates of mutation rates, ActiveDriverWGS compares each regulatory element with its adjacent flanking sequence of at least 100 kbps and thus provides increased resolution compared to megabase-scale estimates. We also confirmed the increased mutation rate of FMREs relative to CRMs with similar length and transcription status. These analyses suggest that the mutation enrichment in FMREs exceeds a baseline level expected from regional and genome-wide covariates. Further computational and experimental work is required to establish FMREs as bona fide cancer drivers. As proof of concept, we functionally validated one FMRE in human cells. Genomic deletion of this distal CRM caused transcriptional repression of the predicted target CCNB1IP1, deregulation of additional cancer pathways and genes (e.g., CDH1, CDKN2B), and a cell proliferation defect. Interestingly, important genes involved in cell cycle control, senescence, and metastasis were transcriptionally altered both in FMRE-deleted cell lines and FMRE-mutated patient tumors. We speculate that the FMRE is involved in the transcriptional regulation of late tumor

Molecular Cell 77, 1–15, March 19, 2020 11

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

development in processes such as invasion and metastasis. However, our findings should be interpreted with caution as these come with some important caveats. First, deletion of the entire FMRE in cell lines is only a limited model of the SNVs and indels observed in tumor genomes. Second, the HEK293 cells we use were originally derived from an embryonic kidney and therefore do not represent typical cells found in renal cell carcinoma of the kidney where the putative driver mutations and mRNA associations were identified. Further work is needed to decipher the impact of these non-coding mutations in diverse cellular and animal models of cancer. Integration of cancer genome variation with cis-regulatory networks, long-range chromatin interactions, and transcriptomic data is a powerful approach for highlighting potentially functional non-coding mutations. Larger WGS datasets of specific cancer types coupled with transcriptomes are required to find infrequent mutations in the vast non-coding genome and to outline their gene-regulatory functions. As cis-regulatory elements are largely specific to tissues and tumor types, detailed epigenomic profiles of tumors are also needed. High-resolution datasets of chromatin architecture are required to identify target genes of distal regulatory elements that are challenging to map currently. WGS datasets with comprehensive clinical information are required to find therapy-specific drivers and biomarkers in the non-coding genome. As gene regulatory networks are complex and redundant, pathway and network analysis of CRM combinations is likely to increase sensitivity. In conclusion, our study suggests that rare non-coding mutations in cancer genomes provide an underappreciated contribution to the hallmarks of cancer through cis-regulatory and chromatin architectural mechanisms. STAR+METHODS Detailed methods are provided in the online version of this paper and include the following: d d d

d

d

KEY RESOURCES TABLE LEAD CONTACT AND MATERIALS AVAILABILITY EXPERIMENTAL MODEL AND SUBJECT DETAILS B Human subjects B Cell lines METHOD DETAILS B The ActiveDriverWGS method B Benchmarking of ActiveDriverWGS QUANTIFICATION AND STATISTICAL ANALYSIS B Genomic regions B Driver analysis of coding and non-coding regions using ActiveDriverWGS B Additional driver discovery methods for CRM analysis B Validating FMREs in additional cancer whole genome sequences B Chromatin states and long-range chromatin interactions of FMREs B Somatic mutations and structural variants in FMREs B Evaluating the enrichment of annotations and mutations in FMREs B Predicting target genes of FMREs and associating with mRNA abundance

12 Molecular Cell 77, 1–15, March 19, 2020

B

d

Pathway enrichment analysis of CRMs using ActivePathways B Analyzing mutation co-occurrence in FMREs and CDS drivers B Individual vetting of mutations in highlighted FMREs B RNA extraction and reverse transcription B Quantifying transcript abundance of FMRE target genes using RT-qPCR B RNA-seq and data analysis of FMRE-KO cells B Cellular proliferation assay DATA AND CODE AVAILABILITY

SUPPLEMENTAL INFORMATION Supplemental Information can be found online at https://doi.org/10.1016/j. molcel.2019.12.027. ACKNOWLEDGMENTS We are grateful to Drs. Federico Abascal, Gary Bader, Robert Bristow, Peter Campbell, Sean Egan, Gad Getz, Nuria Lopez-Bigas, Inigo Martincorena, Jakob Pedersen, Esther Rheinbay, and Josh Stuart for comments on the manuscript, Inigo Martincorena, Federico Abascal, and Loris Mularoni for contributing data, Carolyn Ptak and the OICR Genomics facility team for RNA-sequencing experiments, and members of the PCAWG Drivers and Functional Interpretation Working Group for discussions. We acknowledge Life Science Editors for editorial assistance. We acknowledge the contributions of the many clinical networks across ICGC and TCGA who provided samples and data to the PCAWG Consortium and the contributions of the PCAWG Technical Working Group and the PCAWG Germline Working Group for collation, realignment, and harmonized variant calling of the cancer genomes used in this study. We thank the patients and their families for their participation in the ICGC and TCGA projects. This research was partially funded by a Canadian Institutes of Health Research (CIHR) Project Grant to J.R., Ontario Institute for Cancer Research (OICR) Investigator Awards to J.R. and P.C.B., Operating Grants from Cancer Research Society (CRS) and Isaiah 40:31 Memorial Fund to J.R. and M.D.W. (#21089, #21428, #21311), Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant to J.R. (#RGPIN-2016-06485), the OICR Brain Tumor Translational Research Initiative funding to J.R., the OICR Biostatistics Training Initiative student fellowships to L.R. and Y.L., CIHR Canadian Graduate Scholarship to H.Z., MBP Excellence Award to K.I., and The Estonian Research Council (PUTJD145) fellowship to L.U.-R. P.C.B. was supported by a CIHR New Investigator Award. M.D.W. was partially supported by an Early Researcher Award from the Ontario Ministry of Research and Innovation and a Tier 2 Canada Research Chair from CIHR. A.A. and M. Liang were partially supported by a NSERC grant (RGPIN-2019-07041) to M.D.W. Funding from the OICR is provided by the Government of Ontario. AUTHOR CONTRIBUTIONS H.Z., L.U.-R., K.I., L.W., S.S., V.H., D.A.-N., M.P., D.A.-R., O.O., M. Liang, Y.L., L.R., M.K., I.D., and J.R. analyzed and visualized the data. L.U.-R. and A.A. performed the validation experiments. L.U.-R. designed and led the validation experiments. H.Z. and J.D.T. developed the software package for driver analysis. J.T.S., M. Lupien, L.D.S., P.C.B., and M.D.W. contributed discussions, methods, reagents and expertise. J.R. supervised the study, led the computational analysis and design of the computational methods. J.R. led writing of the manuscript with significant contributions from L.U.-R., M.D.W., P.C.B., K.I., H.Z., M. Lupien, and L.D.S. All authors reviewed and provided input to the manuscript. DECLARATION OF INTEREST The authors declare no competing interests.

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

Received: January 2, 2019 Revised: June 4, 2019 Accepted: December 24, 2019 Published: January 17, 2020 REFERENCES Ahuja, H.G., Hong, J., Aplan, P.D., Tcheurekdjian, L., Forman, S.J., and Slovak, M.L. (2000). t(9;11)(p22;p15) in acute myeloid leukemia results in a fusion between NUP98 and the gene encoding transcriptional coactivators p52 and p75-lens epithelium-derived growth factor (LEDGF). Cancer Res. 60, 6227–6229. Alexandrov, L.B., Kim, J., Haradhvala, N.J., Huang, M.N., Ng, A.W.T., Wu, Y., Boot, A., Covington, K.R., Gordenin, D.A., Bergstrom, E.N., et al. (2019). The Repertoire of Mutational Signatures in Human Cancer. bioRxiv. https://doi. org/10.1101/322859. Alexandrov, L.B., Nik-Zainal, S., Wedge, D.C., Aparicio, S.A., Behjati, S., Biankin, A.V., Bignell, G.R., Bolli, N., Borg, A., Børresen-Dale, A.L., et al.; Australian Pancreatic Cancer Genome Initiative; ICGC Breast Cancer Consortium; ICGC MMML-Seq Consortium; ICGC PedBrain (2013). Signatures of mutational processes in human cancer. Nature 500, 415–421. Bailey, S.D., Zhang, X., Desai, K., Aid, M., Corradin, O., Cowper-Sal Lari, R., Akhtar-Zaidi, B., Scacheri, P.C., Haibe-Kains, B., and Lupien, M. (2015). ZNF143 provides sequence specificity to secure chromatin interactions at gene promoters. Nat. Commun. 2, 6186. Bailey, M.H., Tokheim, C., Porta-Pardo, E., Sengupta, S., Bertrand, D., Weerasinghe, A., Colaprico, A., Wendl, M.C., Kim, J., Reardon, B., et al.; MC3 Working Group; Cancer Genome Atlas Research Network (2018). Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 174, 1034–1035. Bell, R.J., Rube, H.T., Kreig, A., Mancini, A., Fouse, S.D., Nagarajan, R.P., Choi, S., Hong, C., He, D., Pekmezci, M., et al. (2015). Cancer. The transcription factor GABP selectively binds and activates the mutant TERT promoter in cancer. Science 348, 1036–1039. Calabrese, C., Davidson, N.R., Fonseca, N.A., He, Y., Kahles, A., Lehmann, K.-V., Liu, F., Shiraishi, Y., Soulette, C.M., et al.; PCAWG Transcriptome Core Group (2018). Genomic basis of RNA alterations revealed by wholegenome analyses of 27 cancer types. bioRxiv. https://doi.org/10.1101/ 183889. Cancer Genome Atlas Research Network, Weinstein, J.N., Collisson, E.A., Mills, G.B., Shaw, K.R., Ozenberger, B.A., Ellrott, K., Shmulevich, I., Sander, C., and Stuart, J.M. (2013). The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120. Campbell, P.J., Getz, G., Stuart, J.M., Korbel, J.O., and Stein, L.D. (2017). Pancancer analysis of whole genomes. biorXiv. https://doi.org/10.1101/162784. Canela, A., Maman, Y., Jung, S., Wong, N., Callen, E., Day, A., Kieffer-Kwon, K.R., Pekowska, A., Zhang, H., Rao, S.S.P., et al. (2017). Genome Organization Drives Chromosome Fragility. Cell 170, 507–521.e18. Chauhan, S., Goodwin, J.G., Chauhan, S., Manyam, G., Wang, J., Kamat, A.M., and Boyd, D.D. (2013). ZKSCAN3 is a master transcriptional repressor of autophagy. Mol. Cell 50, 16–28. Chen, H., and Boutros, P.C. (2011). VennDiagram: a package for the generation of highly-customizable Venn and Euler diagrams in R. BMC Bioinformatics 12, 35. Ciriello, G., Miller, M.L., Aksoy, B.A., Senbabaoglu, Y., Schultz, N., and Sander, C. (2013). Emerging landscape of oncogenic signatures across human cancers. Nat. Genet. 45, 1127–1133. Corces, M.R., Granja, J.M., Shams, S., Louie, B.H., Seoane, J.A., Zhou, W., Silva, T.C., Groeneveld, C., Wong, C.K., Cho, S.W., et al.; Cancer Genome Atlas Analysis Network (2018). The chromatin accessibility landscape of primary human cancers. Science 362, 362. Creixell, P., Reimand, J., Haider, S., Wu, G., Shibata, T., Vazquez, M., Mustonen, V., Gonzalez-Perez, A., Pearson, J., Sander, C., et al.; Mutation Consequences and Pathway Analysis Working Group of the International

Cancer Genome Consortium (2015). Pathway and network analysis of cancer genomes. Nat. Methods 12, 615–621. Delattre, O., Zucman, J., Plougastel, B., Desmaze, C., Melot, T., Peter, M., Kovar, H., Joubert, I., de Jong, P., Rouleau, G., et al. (1992). Gene fusion with an ETS DNA-binding domain caused by chromosome translocation in human tumours. Nature 359, 162–165. Eisenhardt, A.E., Sprenger, A., Ro¨ring, M., Herr, R., Weinberg, F., Ko¨hler, M., Braun, S., Orth, J., Diedrich, B., Lanner, U., et al. (2016). Phospho-proteomic analyses of B-Raf protein complexes reveal new regulatory principles. Oncotarget 7, 26628–26652. ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74. Fredriksson, N.J., Ny, L., Nilsson, J.A., and Larsson, E. (2014). Systematic analysis of noncoding somatic mutations and gene expression alterations across 14 tumor types. Nat. Genet. 46, 1258–1263. Frigola, J., Sabarinathan, R., Mularoni, L., Muin˜os, F., Gonzalez-Perez, A., and Lo´pez-Bigas, N. (2017). Reduced mutation rate in exons due to differential mismatch repair. Nat. Genet. 49, 1684–1692. Fu, Y., Liu, Z., Lou, S., Bedford, J., Mu, X.J., Yip, K.Y., Khurana, E., and Gerstein, M. (2014). FunSeq2: a framework for prioritizing noncoding regulatory variants in cancer. Genome Biol. 15, 480. Fujimoto, A., Furuta, M., Totoki, Y., Tsunoda, T., Kato, M., Shiraishi, Y., Tanaka, H., Taniguchi, H., Kawakami, Y., Ueno, M., et al. (2016). Wholegenome mutational landscape and characterization of noncoding and structural mutations in liver cancer. Nat. Genet. 48, 500–509. Fungtammasan, A., Walsh, E., Chiaromonte, F., Eckert, K.A., and Makova, K.D. (2012). A genome-wide analysis of common fragile sites: what features determine chromosomal instability in the human genome? Genome Res. 22, 993–1005. Futreal, P.A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R., Rahman, N., and Stratton, M.R. (2004). A census of human cancer genes. Nat. Rev. Cancer 4, 177–183. Gerstein, M.B., Lu, Z.J., Van Nostrand, E.L., Cheng, C., Arshinoff, B.I., Liu, T., Yip, K.Y., Robilotto, R., Rechtsteiner, A., Ikegami, K., et al.; modENCODE Consortium (2010). Integrative analysis of the Caenorhabditis elegans genome by the modENCODE project. Science 330, 1775–1787. Gerstung, M., Jolly, C., Leshchiner, I., Dentro, S.C., Gonzalez, S., Rosebrock, D., Mitchell, T.J., Rubanova, Y., Anur, P., Yu, K., et al. (2018). The evolutionary history of 2,658 cancers. bioRxiv. https://doi.org/10.1101/161562. Gonzalez-Perez, A., Mustonen, V., Reva, B., Ritchie, G.R., Creixell, P., Karchin, R., Vazquez, M., Fink, J.L., Kassahn, K.S., Pearson, J.V., et al.; International Cancer Genome Consortium Mutation Pathways and Consequences Subgroup of the Bioinformatics Analyses Working Group (2013). Computational approaches to identify functional genetic variants in cancer genomes. Nat. Methods 10, 723–729. Gonzalez-Perez, A., Sabarinathan, R., and Lopez-Bigas, N. (2019). Local Determinants of the Mutational Landscape of the Human Genome. Cell 177, 101–114. Hanahan, D., and Weinberg, R.A. (2011). Hallmarks of cancer: the next generation. Cell 144, 646–674. Hannon, G.J., and Beach, D. (1994). p15INK4B is a potential effector of TGFbeta-induced cell cycle arrest. Nature 371, 257–261. Hnisz, D., Weintraub, A.S., Day, D.S., Valton, A.L., Bak, R.O., Li, C.H., Goldmann, J., Lajoie, B.R., Fan, Z.P., Sigova, A.A., et al. (2016). Activation of proto-oncogenes by disruption of chromosome neighborhoods. Science 351, 1454–1458. Horn, S., Figl, A., Rachakonda, P.S., Fischer, C., Sucker, A., Gast, A., Kadel, S., Moll, I., Nagore, E., Hemminki, K., et al. (2013). TERT promoter mutations in familial and sporadic melanoma. Science 339, 959–961. Hu, Y., Yan, C., Hsu, C.H., Chen, Q.R., Niu, K., Komatsoulis, G.A., and Meerzaman, D. (2014). OmicCircos: A Simple-to-Use R Package for the Circular Visualization of Multidimensional Omics Data. Cancer Inform. 13, 13–20.

Molecular Cell 77, 1–15, March 19, 2020 13

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

Huang, F.W., Hodis, E., Xu, M.J., Kryukov, G.V., Chin, L., and Garraway, L.A. (2013). Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959.

Discovery and prioritization of somatic mutations in diffuse large B-cell lymphoma (DLBCL) by whole-exome sequencing. Proc. Natl. Acad. Sci. USA 109, 3879–3884.

Huang, Y.F., Gulko, B., and Siepel, A. (2017). Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nat. Genet. 49, 618–624.

Mansour, M.R., Abraham, B.J., Anders, L., Berezovskaya, A., Gutierrez, A., Durbin, A.D., Etchin, J., Lawton, L., Sallan, S.E., Silverman, L.B., et al. (2014). Oncogene regulation. An oncogenic super-enhancer formed through somatic mutation of a noncoding intergenic element. Science 346, 1373–1377.

Hudson, T.J., Anderson, W., Artez, A., Barker, A.D., Bell, C., Bernabe´, R.R., Bhan, M.K., Calvo, F., Eerola, I., Gerhard, D.S., et al.; International Cancer Genome Consortium (2010). International network of cancer genome projects. Nature 464, 993–998.  Juul, M., Bertl, J., Guo, Q., Nielsen, M.M., Switnicki, M., Hornshøj, H., Madsen, T., Hobolth, A., and Pedersen, J.S. (2017). Non-coding cancer driver candidates identified with a sample- and position-specific model of the somatic mutation rate. eLife 6, 6. Kaiser, V.B., Taylor, M.S., and Semple, C.A. (2016). Mutational Biases Drive Elevated Rates of Substitution at Regulatory Sites across Cancer Types. PLoS Genet. 12, e1006207. €nen, E., Palin, K., Kivioja, T., Va €lima €ki, N., Gylfe, Katainen, R., Dave, K., Pitka €nninen, U.A., Cajuso, T., et al. (2015). CTCF/coheA.E., Ristolainen, H., Ha sin-binding sites are frequently mutated in cancer. Nat. Genet. 47, 818–821. Khurana, E., Fu, Y., Colonna, V., Mu, X.J., Kang, H.M., Lappalainen, T., Sboner, A., Lochovsky, L., Chen, J., Harmanci, A., et al.; 1000 Genomes Project Consortium (2013). Integrative annotation of variants from 1092 humans: application to cancer genomics. Science 342, 1235587. Khurana, E., Fu, Y., Chakravarty, D., Demichelis, F., Rubin, M.A., and Gerstein, M. (2016). Role of non-coding sequence variants in cancer. Nat. Rev. Genet. 17, 93–108. Ko¨hler, M., Ro¨ring, M., Schorch, B., Heilmann, K., Stickel, N., Fiala, G.J., Schmitt, L.C., Braun, S., Ehrenfeld, S., Uhl, F.M., et al. (2016). Activation loop phosphorylation regulates B-Raf in vivo and transformation by B-Raf mutants. EMBO J. 35, 143–161. Krassowski, M., Paczkowska, M., Cullion, K., Huang, T., Dzneladze, I., Ouellette, B.F.F., Yamada, J.T., Fradet-Turcotte, A., and Reimand, J. (2018). ActiveDriverDB: human disease mutations and genome variation in posttranslational modification sites of proteins. Nucleic Acids Res. 46, D901–D910. , R., Haradhvala, N.J., Ha, K., Kim, J., Kuzman, M., Jiao, W., €bler, K., Karlic Ku Gakkhar, S., Mouw, K.W., Braunstein, L.Z., et al. (2019). Tumor mutational landscape is a record of the pre-malignant state. bioRxiv. https://doi.org/10. 1101/517565.

Martincorena, I., Raine, K.M., Gerstung, M., Dawson, K.J., Haase, K., Van Loo, P., Davies, H., Stratton, M.R., and Campbell, P.J. (2017). Universal Patterns of Selection in Cancer and Somatic Tissues. Cell 171, 1029–1041.e21. Melton, C., Reuter, J.A., Spacek, D.V., and Snyder, M. (2015). Recurrent somatic mutations in regulatory regions of human cancer genomes. Nat. Genet. 47, 710–716. Moreau-Gachelin, F., Tavitian, A., and Tambourin, P. (1988). Spi-1 is a putative oncogene in virally induced murine erythroleukaemias. Nature 331, 277–280. Mularoni, L., Sabarinathan, R., Deu-Pons, J., Gonzalez-Perez, A., and Lo´pezBigas, N. (2016). OncodriveFML: a general framework to identify coding and non-coding regions with cancer driver mutations. Genome Biol. 17, 128. Nik-Zainal, S., Davies, H., Staaf, J., Ramakrishna, M., Glodzik, D., Zou, X., Martincorena, I., Alexandrov, L.B., Martin, S., Wedge, D.C., et al. (2016). Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54. Polak, P., Lawrence, M.S., Haugen, E., Stoletzki, N., Stojanov, P., Thurman, R.E., Garraway, L.A., Mirkin, S., Getz, G., Stamatoyannopoulos, J.A., and Sunyaev, S.R. (2014). Reduced local mutation density in regulatory DNA of cancer genomes is linked to DNA repair. Nat. Biotechnol. 32, 71–75. Paczkowska, M., Barenboim, J., Sintupisut, N., Fox, N.C., Zhu, H., AbdRabbo, D., Boutros, P.C., and Reimand, J.; PCAWG Network and Pathway Analysis Group (2018). Integrative pathway enrichment analysis of multivariate omics data. bioRxiv. https://doi.org/10.1101/399113. , R., Koren, A., Thurman, R., Sandstrom, R., Lawrence, M., Polak, P., Karlic ek, K., Stamatoyannopoulos, J.A., and Reynolds, A., Rynes, E., Vlahovic Sunyaev, S.R. (2015). Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature 518, 360–364. Puente, X.S., Bea`, S., Valde´s-Mas, R., Villamor, N., Gutie´rrez-Abril, J., Martı´nSubero, J.I., Munar, M., Rubio-Pe´rez, C., Jares, P., Aymerich, M., et al. (2015). Non-coding recurrent mutations in chronic lymphocytic leukaemia. Nature 526, 519–524.

Kvon, E.Z., Stampfel, G., Ya´n˜ez-Cuna, J.O., Dickson, B.J., and Stark, A. (2012). HOT regions function as patterned developmental enhancers and have a distinct cis-regulatory signature. Genes Dev. 26, 908–913.

Qiao, H., Prasada Rao, H.B., Yang, Y., Fong, J.H., Cloutier, J.M., Deacon, D.C., Nagel, K.E., Swartz, R.K., Strong, E., Holloway, J.K., et al. (2014). Antagonistic roles of ubiquitin ligase HEI10 and SUMO ligase RNF212 regulate meiotic recombination. Nat. Genet. 46, 194–199.

Lanzo´s, A., Carlevaro-Fita, J., Mularoni, L., Reverter, F., Palumbo, E., Guigo´, R., and Johnson, R. (2017). Discovery of Cancer Driver Long Noncoding RNAs across 1112 Tumour Genomes: New Candidates and Distinguishing Features. Sci. Rep. 7, 41544.

Rao, S.S., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D., Lander, E.S., and Aiden, E.L. (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680.

Law, C.W., Chen, Y., Shi, W., and Smyth, G.K. (2014). voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29.

Reijns, M.A.M., Kemp, H., Ding, J., de Proce´, S.M., Jackson, A.P., and Taylor, M.S. (2015). Lagging-strand replication shapes the mutational landscape of the genome. Nature 518, 502–506.

Lawrence, M.S., Stojanov, P., Polak, P., Kryukov, G.V., Cibulskis, K., Sivachenko, A., Carter, S.L., Stewart, C., Mermel, C.H., Roberts, S.A., et al. (2013). Mutational heterogeneity in cancer and the search for new cancerassociated genes. Nature 499, 214–218.

Reimand, J., and Bader, G.D. (2013). Systematic analysis of somatic mutations in phosphorylation signaling predicts novel cancer drivers. Mol. Syst. Biol. 9, 637.

Li, Y., Roberts, N.D., Weischenfeldt, J., Wala, J.A., Shapira, O., Schumacher, S.E., Khurana, E., Korbel, J., Imielinski, M., Beroukhim, R., and Campbell, P.J.; PCAWG Network (2017). Patterns of structural variation in human cancer. bioRxiv. https://doi.org/10.1101/181339. Liao, D. (2009). Emerging roles of the EBF family of transcription factors in tumor suppression. Mol. Cancer Res. 7, 1893–1901. Liao, Y., Smyth, G.K., and Shi, W. (2013). The Subread aligner: fast, accurate and scalable read mapping by seed-and-vote. Nucleic Acids Res. 41, e108. Lohr, J.G., Stojanov, P., Lawrence, M.S., Auclair, D., Chapuy, B., Sougnez, C., Cruz-Gordillo, P., Knoechel, B., Asmann, Y.W., Slager, S.L., et al. (2012).

14 Molecular Cell 77, 1–15, March 19, 2020

Reimand, J., Kull, M., Peterson, H., Hansen, J., and Vilo, J. (2007). g:Profiler–a web-based toolset for functional profiling of gene lists from large-scale experiments. Nucleic Acids Res. 35, W193–200. Reimand, J., Isserlin, R., Voisin, V., Kucera, M., Tannus-Lopes, C., Rostamianfar, A., Wadi, L., Meyer, M., Wong, J., Xu, C., et al. (2019). Pathway enrichment analysis and visualization of omics data using g:Profiler, GSEA, Cytoscape and EnrichmentMap. Nat. Protoc. 14, 482–517. Reyna, M.A., Haan, D., Paczkowska, M., Verbeke, L.P.C., Vazquez, M., Kahraman, A., Pulido-Tamayo, S., Barenboim, J., Wadi, L., Dhingra, P., et al. (2018). Pathway and network analysis of more than 2,500 whole cancer genomes. biorXiv. https://doi.org/10.1101/385294.

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

Roadmap Epigenomics Consortium,, Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Heravi-Moussavi, A., Kheradpour, P., Zhang, Z., Wang, J., Ziller, M.J., et al. (2015). Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330.

€la-Reimand, L., Hou, H., Samavarchi-Tehrani, P., Rudan, M.V., Liang, Uusku M., Medina-Rivera, A., Mohammed, H., Schmidt, D., Schwalie, P., Young, E.J., et al. (2016). Topoisomerase II beta interacts with cohesin and CTCF at topological domain borders. Genome Biol. 17, 182.

Rheinbay, E., Nielsen, M.M., Abascal, F., Tiao, G., Hornshøj, H., Hess, J.M., Pedersen, R.I., Feuerbach, L., Sabarinathan, R., Madsen, T., et al. (2017). Discovery and characterization of coding and non-coding driver mutations in more than 2,500 whole cancer genomes. biorXiv. https://doi.org/10.1101/ 237313.

Visser, M., Kayser, M., and Palstra, R.J. (2012). HERC2 rs12913832 modulates human pigmentation by attenuating chromatin-loop formation between a longrange enhancer and the OCA2 promoter. Genome Res. 22, 446–455.

Sabarinathan, R., Mularoni, L., Deu-Pons, J., Gonzalez-Perez, A., and Lo´pezBigas, N. (2016). Nucleotide excision repair is impaired by binding of transcription factors to DNA. Nature 532, 264–267. Schuster-Bo¨ckler, B., and Lehner, B. (2012). Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature 488, 504–507. Shuai, S., Gallinger, S., and Stein, L.; PCAWG Drivers and Functional Interpretation Group and the ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Network (2017). DriverPower: Combined burden and functional impact tests for cancer driver discovery. biorXiv. https://doi.org/10.1101/ 215244. Singh, M.K., Nicolas, E., Gherraby, W., Dadke, D., Lessin, S., and Golemis, E.A. (2007). HEI10 negatively regulates cell invasion by inhibiting cyclin B/Cdk1 and other promotility proteins. Oncogene 26, 4825–4832. Tamura, T., Yanai, H., Savitsky, D., and Taniguchi, T. (2008). The IRF family transcription factors in immunity and oncogenesis. Annu. Rev. Immunol. 26, 535–584. Thiagalingam, A., De Bustros, A., Borges, M., Jasti, R., Compton, D., Diamond, L., Mabry, M., Ball, D.W., Baylin, S.B., and Nelkin, B.D. (1996). RREB-1, a novel zinc finger protein, is involved in the differentiation response to Ras in human medullary thyroid carcinomas. Mol. Cell. Biol. 16, 5335–5345. Toby, G.G., Gherraby, W., Coleman, T.R., and Golemis, E.A. (2003). A novel RING finger protein, human enhancer of invasion 10, alters mitotic progression through regulation of cyclin B levels. Mol. Cell. Biol. 23, 2109–2122. Tomlins, S.A., Rhodes, D.R., Perner, S., Dhanasekaran, S.M., Mehra, R., Sun, X.W., Varambally, S., Cao, X., Tchinda, J., Kuefer, R., et al. (2005). Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 310, 644–648.

Vleminckx, K., Vakaet, L., Jr., Mareel, M., Fiers, W., and van Roy, F. (1991). Genetic manipulation of E-cadherin expression by epithelial tumor cells reveals an invasion suppressor role. Cell 66, 107–119. Vogelstein, B., Papadopoulos, N., Velculescu, V.E., Zhou, S., Diaz, L.A., Jr., and Kinzler, K.W. (2013). Cancer genome landscapes. Science 339, 1546–1558. Wagih, O. (2017). ggseqlogo: a versatile R package for drawing sequence logos. Bioinformatics 33, 3645–3647. Wagih, O., Reimand, J., and Bader, G.D. (2015). MIMP: predicting the impact of mutations on kinase-substrate phosphorylation. Nat. Methods 12, 531–533. Weinhold, N., Jacobsen, A., Schultz, N., Sander, C., and Lee, W. (2014). Genome-wide analysis of noncoding regulatory mutations in cancer. Nat. Genet. 46, 1160–1165. Weintraub, A.S., Li, C.H., Zamudio, A.V., Sigova, A.A., Hannett, N.M., Day, D.S., Abraham, B.J., Cohen, M.A., Nabet, B., Buckley, D.L., et al. (2017). YY1 Is a Structural Regulator of Enhancer-Promoter Loops. Cell 171, 1573– 1588.e28. Yang, L., Wang, H., Kornblau, S.M., Graber, D.A., Zhang, N., Matthews, J.A., Wang, M., Weber, D.M., Thomas, S.K., Shah, J.J., et al. (2011). Evidence of a role for the novel zinc-finger transcription factor ZKSCAN3 in modulating Cyclin D2 expression in multiple myeloma. Oncogene 30, 1329–1340. €kela €, T.P. (2013). CCRK depletion inhibits glioblasYang, Y., Roine, N., and Ma toma cell proliferation in a cilium-dependent manner. EMBO Rep. 14, 741–747. Ye, J., Coulouris, G., Zaretskaya, I., Cutcutache, I., Rozen, S., and Madden, T.L. (2012). Primer-BLAST: a tool to design target-specific primers for polymerase chain reaction. BMC Bioinformatics 13, 134. Yokoyama, A., and Cleary, M.L. (2008). Menin critically links MLL proteins with LEDGF on cancer-associated target genes. Cancer Cell 14, 36–46.

Molecular Cell 77, 1–15, March 19, 2020 15

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

STAR+METHODS KEY RESOURCES TABLE

REAGENT or RESOURCE

SOURCE

IDENTIFIER

Bacterial and Virus Strains pSpCas9(BB)-2A-GFP plasmid

Addgene

Cat#PX458

One Shot Stbl3 Chemically Competent E. coli

Thermo Fisher

Cat#C737303

Critical Commercial Assays Lipofectamine 3000

Thermo Fisher

Cat#L3000015

DNeasy Blood & Tissue Kit

QIAGEN

Cat#69506

RNeasy Plus Mini kit

QIAGEN

Cat#74136

QIAprep Spin Miniprep Kit

QIAGEN

Cat#27104

SuperScript VILO cDNA Synthesis Kit

Thermo Fisher

Cat#11754-050

Advanced qPCR mastermix with Supergreen LO-ROX

Wisent

Cat#800-435-UL

KAPA Hyper RNA kit

Roche

Cat# 08098107702; KK8541

AlamarBlue Reagent

Thermo Fisher

Cat#DAL1025

Deposited Data RNA-seq data of FMRE-KO and WT HEK293 cells Somatic variant calls, transcript abundance, and other core data generated by the ICGC/TCGA Pan-cancer Analysis of Whole Genomes Consortium

ArrayExpress accession number E-MTAB-8631 Campbell et al., 2017

https://dcc.icgc.org/releases/ PCAWG

ATCC

Cat#CRL-1573

sgRNAs for deletion of FMRE 14:21081834: 50 -CACCGTCATCACTC TATTGCTCCAC-30 (50 sgRNA) and 50 -CACCGTCGCCTTCCGCTTAA TATGG-30 (30 sgRNA).

Integrated DNA Technologies (IDT)

N/A

qPCR primers for validating FMRE 14:21081834 deletion: PCR-1: F1 50 -TCTGTTGCCAGGAAGGGATA-30 , and R1 50 -AGGAAAAAG CCGTGGAAAAT-30 ; PCR-2: F1, and R2 50 -GGGATGTGAAGAAG CAAAGC-30

Integrated DNA Technologies (IDT)

N/A

Gene specific RT-qPCR primers: CCNB1IP1 (forward 50 GACTG CGACCAGAGATCGTG 30 ; reverse 50 CCCTCAGCCTTGCTGAA ATTG 30 ), the reference gene GAPDH (forward 50 ACATCGCTC AGACACCATG 30 ; reverse 50 TGTAGTTGAGGTCAATGAAGGG 30 ), CDKN2A (forward 50 GACCTGGCTGAGGAGCTG 30 ; reverse 50 GGGCATGGTTACTGCCTCTG 30 ), HEY1 (forward 50 CGGCTCTA GGTTCCATGTCC 30 ; reverse 50 GCTTAGCAGATCCCTGCTTCT 30 ), CDH1 (forward 50 TTACTGCCCCCAGAGGATGA 30 ; reverse 50 TGCAACGTCGTTACGAGTCA 30 ), PSIP1 (forward 50 CTGGAGAC CTCATCTTCGCC 30 ; reverse 50 GGGTGGCTTTACAGCTCCAT 30 )

Integrated DNA Technologies (IDT)

N/A

ActiveDriverWGS

This paper

https://github.com/reimandlab/ ActiveDriverWGSR

ActivePathways

Paczkowska et al., 2018

https://cran.r-project.org/web/ packages/ActivePathways/

g:Profiler, web server used on July 13st 2018 (version 18-06-21)

Reimand et al., 2007

https://biit.cs.ut.ee/gprofiler

MIT CRISPR Designer

N/A

http://crispr.mit.edu

edgeR v3.10.5 and limma v3.24.15

Law et al., 2014

https://bioconductor.org

Experimental Models: Cell Lines HEK293 cells Oligonucleotides

Software and Algorithms

e1 Molecular Cell 77, 1–15.e1–e10, March 19, 2020

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

LEAD CONTACT AND MATERIALS AVAILABILITY €ri Further information and requests for resources and reagents should be directed to and will be fulfilled by the Lead Contact, Ju Reimand ([email protected]). This study did not generate new unique reagents. Cell lines generated in this study may be made available on request but we may require a payment and/or a completed Materials Transfer Agreement if there is potential for commercial application. EXPERIMENTAL MODEL AND SUBJECT DETAILS Human subjects Sequencing of human subjects’ tissue was performed by ICGC and TCGA consortium members under a series of locally approved Institutional Review Board (IRB) protocols as described earlier (Hudson et al., 2010). Informed consent was obtained from all human participants. Ethical review of the current data analysis project was granted by the University of Toronto Research Ethics Board (REB) under protocol #30278, ‘‘Pan-cancer Analysis of Whole Genomes: PCAWG.’’ We analyzed the Pan-cancer Analysis of Whole Genomes (PCAWG) project of the International Cancer Genome Consortium (ICGC) and the Cancer Genome Atlas (TCGA) dataset of 1,844 whole cancer genomes of 31 cancer types with 14.2 million somatic single nucleotide variants (SNVs) and indels (Campbell et al., 2017). This represented a subset of the consensus variant calls in 2,693 tumors sequenced in the ICGC-TCGA PCAWG Project. The subset of tumors for this study was derived using the following procedure. First, we focused on the whitelisted subset of 2,583 tumors. We filtered 69 hyper-mutated samples with more than 90,000 mutations (30 mutations/Mb) that contributed 47% of all mutations. To avoid confounding factors due to previously-observed stronger mutation rates in transcription factor binding sites in samples of specific cancer types with atypical mutational processes, we further excluded all 670 samples of four cancer types: melanoma (65), lymph-related cancers (BNHL (104), CLL (90), NOS (2)), esophageal adenocarcinoma (95), and liver hepatocellular carcinoma (314). This step avoided a leakage of strong confounding signals from specific cancer types into our pan-cancer driver analysis, as suggested by initial analyses of specific cancer cohorts. Nearly all cohorts of individual cancer types revealed no or only a few frequently mutated regulatory elements (FMREs) as expected from our power analysis, whereas the four cancer types exceptionally revealed dozens of FMREs. Mutation enrichment in these four cancer types likely occurs due to specific mutational processes in TF binding sites. While such FMREs may include driver mutations, these mutational processes confound pan-cancer driver analyses and should be analyzed separately. We chose to focus on a high-confidence set of cancer types with no putative confounding factors and excluded these cancer types from this study. Cell lines We created a homozygous deletion of a non-coding element in HEK293 cell lines (Figure 4A,B, Figure S5). To functionally validate the FMRE 14:21081834, we induced a deletion of the entire region (1.3 kbps) in HEK293 cells using CRISPR/Cas9 genome editing and derived three independent clonal cell lines with homozygous deletions of the FMRE by culturing single genome-edited cells and screening the resulting colonies for FMRE deletions. For genome editing, single guide RNAs (sgRNAs) targeting the 50 and 30 boundaries of the FMRE (chr14:21081182-21082486) were designed using the MIT CRISPR Designer software (http://crispr.mit.edu). The following sgRNAs were designed to induce 1,342 bp FMRE deletions (knockouts; KO) in HEK293 cell lines, resulting in FMRE-KO cell lines: 50 -CACCGTCATCACTCTATTGCTCCAC-30 (50 sgRNA) and 50 -CACCGTCGCCTTCCGCTTAATATGG-30 (30 sgRNA). The designed sgRNAs were ordered from Integrated DNA Technologies (IDT) as custom DNA oligonucleotides and cloned into the pSpCas9(BB)-2A-GFP plasmid (PX458, Addgene). Transfections were performed with Lipofectamine 3000 Reagent (Thermo Fisher Scientific) on HEK293 cells cultured in 6-well plates in DMEM medium supplemented with 10% (vol/vol) FBS at 37 C and 5% CO2. After 24 h of incubation, the GFP+ fluorescent cells were isolated by FACS sorting and returned to incubator for further growth. Next, single cells were isolated by serial dilutions and cultured in 96-well plates at concentration of 0.5 cells per well. The colonies were inspected for clonal appearance after 1 week and expanded for additional 2-3 weeks. The clonal populations that reached 90% confluency were subsequently split for DNA and RNA extractions. FMRE-KO clonal cell lines were identified by targeting the deleted genomic region and its boundaries with PCR experiments. FMRE-KO cell lines were grown from single cells, established and selected over at least five passages to enable functional experiments. Therefore, no acute genotoxic stress would be expected as cells recover from genome editing experiments over passaging. Genomic DNA (gDNA) from wildtype and CRISPR/Cas9-edited clonal cell populations was extracted using DNeasy Blood & Tissue Kit (QIAGEN) according to manufacturer’s instructions. The quantity and quality of gDNA was assessed with NanoDrop ND-2000 (Thermo Scientific). FMRE-KO cell lines were identified by two subsequent PCR experiments. We amplified 100-200 ng of gDNA using recombinant Taq DNA polymerase with 1X PCR buffer and 1.5mM MgCl (Invitrogen), and 500 nM of each primer (PCR-1: F1 50 -TCTGTTGCCAGGAAGGGATA-30 , and R1 50 -AGG AAAAAGCCGTGGAAAAT-30 ; wildtype band 1,564 bps, knockout band 222 bps; PCR-2: F1, and R2 50 -GGGATGTGAAGAAGCA AAGC-30 , wildtype or band 555 bps, no detectable PCR product in knockout). The homozygous genotype of these cell lines was confirmed using Sanger sequencing (ACGT Corp) on PCR products (222 bps) amplified from the gDNA spanning the target sites of sgRNAs (Figure S5).

Molecular Cell 77, 1–15.e1–e10, March 19, 2020 e2

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

METHOD DETAILS The ActiveDriverWGS method Candidate drivers among protein-coding genes and non-coding regions (CRMs) were identified with ActiveDriverWGS, our novel mutation enrichment method that tests whether a pre-determined genomic element of interest is significantly more mutated than expected from a background mutation model constructed from a relevant background sequence. ActiveDriverWGS extends our earlier work on the discovery of protein-coding driver mutations (Reimand and Bader, 2013). ActiveDriverWGS is a local mutation enrichment model based on Poisson generalized linear regression that determines the expected number of mutations in a genomic element by observing mutations in a background window of at least 100 kbps around the element, including ± 50 kbps upstream and downstream of the region. If an element comprises multiple disjointed segments (e.g., exons), every such segment contributes up to ± 50 kbps of additional background sequence from flanking regions (e.g., introns), resulting in more background sequence provided to regions spanning multiple segments. This additional background sequence space adjusts for regional variation in mutation rates. ActiveDriverWGS also considers the trinucleotide composition of the element and the background sequence, and the trinucleotide context of mutations (i.e., corresponding to mutation signatures) with reference and alternative allele as cofactors in the regression model, expanded to 96 cofactors for SNVs that account for nucleotide complementarity. An additional 97th cofactor is used for indel mutations for which trinucleotide content is not considered. To simplify the model, we discard mutation signatures that have zero mutations in the tested element and the background sequence and thus affect no coefficients. To be conservative, each tested element is only considered to be mutated once per every patient and additional mutations are discarded. This reduces potential inflation of significance stemming from locally hypermutated regions. ActiveDriverWGS conducts analysis of deviance using chi-square tests to evaluate the different distribution of mutations in a genomic element of interest compared to a background sequence, using pairs of nested regression models as the null and alternative hypotheses (H0 versus H1). Each nucleotide is represented four times in the regression: once for each three alternative alleles and once for indel mutations. Each such nucleotide combination may have zero or more mutations. The chi-square test evaluates whether accounting for the element of interest as an additional coefficient (i.e., the binary variable isElement) provides a significantly better model fit compared to the null model that considers only the overall distribution of mutations in trinucleotide contexts (i.e., the multinomial variable trinucleotideContext): H0 : PoisðnMutationsÞ  trinucleotideContext

H1 : PoisðnMutationsÞ  trinucleotideContext + isElement A significant p value in this combined test indicates that the tested element has a significantly different mutation rate compared to the background sequence. To distinguish regions with excess mutations from regions with fewer than expected mutations, ActiveDriverWGS additionally computes confidence intervals to the expected number of mutations by sampling expected mutation counts from the null model H0. The alternative hypothesis H1 is accepted only if the number of observed mutations significantly exceeds expected background mutations, i.e., the 95% quantile of mutations samples from the null model. In contrast, if the confidence intervals indicate significant excess of mutations in the background and depletion of mutations in the element (i.e., negative selection), we invert corresponding small p values from the Chi-square test (p = 1-P) to conservatively discard regions that display reduced variation due to negative selection or technical factors. The tested genomic elements with no mutations are conservatively assigned p = 1. The p values resulting from the first test are corrected for multiple testing across all tested regions using the BenjaminiHochberg False Discovery Rate (FDR) procedure and genes with FDR<0.05 are considered significant. To estimate the expected passenger and driver mutations per element, we compute the number of passenger mutations as the median number of expected mutations from the null model H0. The corresponding number of driver mutations is the number of observed mutations in the element minus the number of median expected mutations in that element predicted from the null model H0. The user may optionally determine enrichment of driver mutations in a pre-defined database of active sites such as post-translational modification sites in proteins or TF binding sites in DNA. If such data are present, a second analysis of deviance and chi-square tests compares the model H1 with an additional model H2: H1 : PoisðnMutationsÞ  trinucleotideContext + isElement + isSite The second comparison quantifies the enrichment of mutations in active sites (i.e., binary variable isSite) relative to mutations in the candidate driver element and its expected background sequence. The P values from the second test are also computed with a chi-square test (H2 versus. H1) and corrected with the FDR procedure through restricted hypothesis testing that only considers candidate driver elements that passed the first test at FDR<0.05. Benchmarking of ActiveDriverWGS We benchmarked ActiveDriverWGS using simulated mutations, various parameter settings, and comparison to alternative driver discovery methods (Figure S1). First, ActiveDriverWGS was tested with simulated mutations to evaluate potential false discoveries. To generate simulated mutation data, we split the genome into non-overlapping windows of 50 kbps and randomly re-assigned PCAWG

e3 Molecular Cell 77, 1–15.e1–e10, March 19, 2020

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

pan-cancer single nucleotide variants in each window to alternative positions of the same trinucleotide context using sampling with replacement. Indels were also randomly shuffled with no accounting for trinucleotide context. Besides in-house simulated data, we also tested ActiveDriverWGS on two additional sets of simulations from the PCAWG drivers working group (i.e., Sanger and Broad simulations) (Rheinbay et al., 2017). Results from 672 ActiveDriverWGS runs with three sets of simulated mutations, 32 cohorts (cohorts of specific cancer types and the pan-cancer cohort) and seven types of genomic elements were collected. In sum, these revealed only eleven significant findings at FDR<0.05, suggesting low false discovery rates of the method. We tested ActiveDriverWGS with different parametrizations. First, we varied the size of background windows for establishing expected mutation rates: ± 10, ± 25, ± 50, ± 75, and ± 100 kbps. The method was found to be robust to variations in background window, however the ± 50 kbp window provided the strongest enrichment of known cancer genes in the pan-cancer CDS analysis. Second, we tested the importance of model cofactors corresponding to mutation trinucleotide contexts in the regression model of ActiveDriverWGS. This confirmed the importance of trinucleotides for driver discovery, as the exclusion of these cofactors increased false positives among pan-cancer CDS drivers (47/96 versus 4/48 non-cancer genes found using naive versus trinucleotide-adjusted models). Third, we tested the alternative option of including hyper-mutated tumor genomes in the pan-cancer cohort. Inclusion of these 69 genomes led to a recovery of fewer CDS drivers in the pan-cancer cohort (26/27 versus 44/48 known cancer genes found in the PCAWG cohorts with versus without hypermutated genomes). This increase false negative rate was likely observed due to increased statistical noise of frequent passenger mutations. Thus, the final analyses excluded hyper-mutated samples and cofactors for mutation signatures were included. Fourth, we tested ActiveDriverWGS with several sets of variant calls from the PCAWG project. The PCAWG project used four variant calling pipelines, while the final PCAWG consensus variant callset included variants detected by at least two of four pipelines (2+/4). The consensus variant calling strategy of the PCAWG project was determined systematically by the consortium by experimentally validating a subset of detected variants using qPCR and evaluating the performance of various consensus strategies in recovering the experimentally-validated variant set (Campbell et al., 2017). The 2+/4 pipeline was the most optimal as it recovered experimentally-validated variants with the greatest balance of sensitivity and precision, with a particular advantage toward variants with low allele frequency. We investigated this further and found that false-positive driver genes were called in the stringent variant callset. In contrast, driver discovery in non-coding regions led to fewer results in more stringently filtered variant callsets, including the loss of TERT promoter mutations in the most stringent variant callset. These differences are likely apparent due to an overly stringent filtering of lower-frequency mutations in the non-coding space. In protein-coding sequences, such falsely-filtered variant calls led to an attenuated rate of expected mutations and inflated the significance of mutations within these candidate driver genes. In non-coding regions, false-negative variant calls in candidate driver regions reduced their significance. These results suggest that appropriate variant calling is important for driver discovery, while overly stringent filtering of variants may lead to false positives and false negatives in driver discovery. The PCAWG consensus variant calls (2+/4) were used in our study. We compared the performance of ActiveDriverWGS with seven alternative methods for discovering drivers in protein-coding sequences. Alternative driver calls were derived from the PCAWG driver discovery project and comprised 44 PCAWG cohorts of specific cancer types, meta-cohorts of similar cancer types, and pan-cancer cohorts (Rheinbay et al., 2017). The methods included dNdScv (Martincorena et al., 2017), DriverPower (Shuai et al., 2017), ExInAtor (Lanzo´s et al., 2017), MutSigCV (Lawrence et al., 2013), ncdDetect (Juul et al., 2017), NBR (Nik-Zainal et al., 2016), and OncodriveFML (Mularoni et al., 2016). We evaluated two metrics for every method and sample cohort: enrichment of known cancer genes (i.e., COSMIC Cancer Gene Census genes) among detected genes as a proxy of sensitivity (evaluated with Fisher’s exact tests), and the F1-score as a balanced metric of sensitivity and specificity. In this comparison, ActiveDriverWGS was consistently the tool with the best performance that was also available for non-coding driver discovery. The other best-performing tools MutSigCV and dNdScv are limited to CDS genes and not applicable to non-coding regions. QUANTIFICATION AND STATISTICAL ANALYSIS Genomic regions Our driver discovery pipeline was run separately for multiple classes of genomic regions of the human genome hg19. Cis-regulatory modules (CRMs) derived from the ENCODE project comprised clusters of TF binding sites (TFBS) measured in chromatin immunoprecipitation (ChIP-seq) experiments retrieved from the UCSC Genome Browser. We used the dataset of 4.4 million binding sites in 90 cell lines and excluded sites that were only observed in one cell line. TFBSs of RNA polymerases POLR2A and POLR3G were excluded as these were much longer than typical TFBSs. The remaining 1.06 million binding sites of 99 TFs measured in 90 cell lines were merged into consecutive regions based on R 1bp of common sequence. We discarded regions bound by single TFs and reduced the set to 147,648 cis-regulatory modules (CRMs; i.e., clusters of TFBSs). Sequence regions corresponding to CDS sequence and splice sites were subtracted from CRMs, resulting in CRMs comprising multiple disjoint segments. Segments with length less than 50 bps were discarded. CRMs were further filtered to exclude those located in common fragile sites (Fungtammasan et al., 2012). We also excluded CRMs in regions with excessive germline variation in the PCAWG cohort. Specifically, the upper bound of permitted germline variation was determined through a positive control gene FOXA1, i.e., the gene predicted as a driver in our pan-cancer analysis that showed the highest germline per-nucleotide mutation frequency in PCAWG. As a result, the merging

Molecular Cell 77, 1–15.e1–e10, March 19, 2020 e4

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

and filtering procedure led to a final dataset of 120,788 CRMs. TFBS clustering led to a non-redundant dataset of CRMs representative of multiple cell lines that we used for driver discovery in the heterogeneous PCAWG pan-cancer cohort. Driver analysis of coding and non-coding regions using ActiveDriverWGS The 120,788 cis-regulatory modules (CRMs) for non-coding driver analysis were derived from ENCODE as described above. In addition to CRMs, we analyzed several types of genomic elements: protein-coding sequences (CDS), untranslated regions of CDS genes (50 UTR, 30 UTR), promoters of coding genes (promDomain), and gene bodies of long non-coding RNAs (lncRNA), derived from the PCAWG consensus dataset (Rheinbay et al., 2017). Driver analyses were performed on the pan-cancer cohort as well as 31 individual cohorts of specific cancer types. Multiple testing correction and filtering (FDR<0.05) was performed separately for each combination of cancer type and element type. Analysis of cohorts of specific cancer types detected only few FMREs: the TERT promoter was found in six cancer types and two additional FMREs were found in specific cancer types, likely owing to limited power of these cohorts (Figure S4). We conducted a power analysis to evaluate the detectable effect size of the PCAWG pan-cancer cohort and cohorts of specific cancer types (Figure S4). Power analysis shows that cohorts of specific cancer types are less-powered to detect infrequent mutations in CRMs. Power analysis of chi-square tests was performed using the pwr R package and the pwr.chisq.test function. Effect size for a range of numbers corresponding to sizes of cohorts was computed with relative degrees of freedom in comparing H0 and H1 in ActiveDriverWGS (df = 1) and significance (p = 0.05). This process was repeated for several values of power (0.6-0.9) and data were plotted as line plots. Sample sizes corresponding to the size of the pan-cancer cohort, largest cohorts of specific cancer types and the median cohort were printed over the power curve. We performed driver discovery with restricted hypothesis testing to find candidate FMREs in specific cancer types and increase sensitivity in these smaller cohorts. We selected the limited set of 30 FMREs detected in the pan-cancer analysis, conducted driver analyses on these in each of 31 cancer types, and adjusted the results for multiple testing across all cancer types (FDR<0.05). This approach resulted in fewer testable hypotheses (930) compared to a driver analysis of all CRMs (120,788) and thus increased sensitivity. However, the results were limited to regions with pan-cancer enrichment of somatic mutations while additional regions with mutations in a certain cohort likely remained undetected (Figure S4). We compared the relative mutation frequencies and predicted proportions of driver mutations of protein-coding driver genes (CDS) and FMREs from the ActiveDriverWGS analysis of the pan-cancer cohort. In the first analysis, we compared FMREs and CDS drivers in terms of mutations per 100 bps of sequence using a non-parametric Wilcoxon test. In the second analysis, we first predicted the expected number of mutations for each element (CDS driver and FMRE) using the ActiveDriverWGS model (FDR<0.05). The expected number of mutations per element was derived from the null background model of ActiveDriverWGS that was trained on each element and its adjacent flanking sequence as described above. The expected number of mutations corresponded to the median value of 1000 predictions from the null model sampled using the Poisson distribution. Using the expected number of mutations as a proxy of passenger mutation count in each element, we derived the number of predicted driver mutations per element as the number of observed mutations minus the number of expected mutations. The ratio of predicted driver mutation count in a region over the count of all observed mutations in that region was computed for all FMREs and CDS drivers. A non-parametric Wilcoxon text was applied to compare FMREs and CDS drivers in terms of these predicted proportions of driver mutations (Figure 1D). ActiveDriverWGS also highlights driver genes where somatic mutations are specifically enriched in a database of active sites such as protein phosphorylation sites. To demonstrate this feature, we analyzed CDS drivers using a set of 385,185 post-translational modification (PTM) sites from the ActiveDriverDB database (Krassowski et al., 2018). We identified PTM-associated mutations in five pan-cancer drivers (IDH1, BRAF, PIK3R1, CTNNB1, H3F3A). For example, 25/35 SNVs in BRAF affect phosphorylation sites (12 expected, FDRsite = 4.3x104 from ActiveDriverWGS). The V600E substitution in BRAF is predicted to affect the adjacent phosphorylation site S602 (Eisenhardt et al., 2016; Ko¨hler et al., 2016) and create a sequence motif bound by Polo-like kinase 1 by the MIMP algorithm (Wagih et al., 2015). Analysis of active sites can lead to mechanistic hypotheses for the impact of specific mutations in driver genes and non-coding regions. Additional driver discovery methods for CRM analysis Three independent driver discovery methods NBR (Nik-Zainal et al., 2016), DriverPower (Shuai et al., 2017), and OncodriveFML (Mularoni et al., 2016) were used to analyze CRMs besides ActiveDriverWGS. The tools were used as described in the PCAWG driver study (Rheinbay et al., 2017) on the set of CRMs and statistically significant results were reported separately for each method (FDR<0.05) (Figure 1F). Each method uses different statistical models, cofactors, mutation impact scores and/or clustering metrics and thus provides complementary evidence to our findings. NBR uses a negative binomial regression to estimate the background mutation rate of each element as described earlier (Nik-Zainal et al., 2016). DriverPower comprises a combined burden and functional impact test for coding and non-coding cancer driver elements (Shuai et al., 2017). In DriverPower, the predicted mutation rate for CRMs was further calibrated with functional impact scores measured by LINSIGHT scores (Huang et al., 2017). OncodriveFML infers positive selection of genomic regions and functional impact of mutations to find cancer driver genes and non-coding regions (Mularoni et al., 2016).

e5 Molecular Cell 77, 1–15.e1–e10, March 19, 2020

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

Validating FMREs in additional cancer whole genome sequences To confirm the mutation frequency of FMREs in further tumor samples, we collected an additional dataset of somatic variant calls from the International Cancer Genome Consortium (ICGC) Data Portal (Data Release 27) (Figure 1E, Figure S3). The dataset included 3,181 whole genome patient samples of 34 cohorts and/or cancer types after excluding donors represented in the PCAWG cohort. We computed genomic overlaps between FMREs and somatic mutations in the dataset to quantify the number of mutations and affected patients for each FMRE for all samples. FMRE mutation rates in the PCAWG discovery dataset and the ICGC validation dataset was compared using ANOVA and F-test on the number of donors with FMRE mutations by controlling for region length as a covariate. The validation dataset should be interpreted with caution due to widely non-uniform data processing and heterogeneity of tumors in the cohort. Chromatin states and long-range chromatin interactions of FMREs CRMs and FMREs were annotated with chromatin states and long-range chromatin interactions to interpret their gene regulatory and epigenetic context (Figure 2c,d). Chromatin states of 127 healthy human tissues and cell lines were derived from the Roadmap Epigenomics project (Roadmap Epigenomics Consortium et al., 2015). Data from the core model of 15 chromatin states were used. Annotations for each chromatin state and tissue or cell line were considered separately in permutation tests described below. To summarize the data, we merged chromatin states corresponding to promoters (TssA, TssAFlnk), enhancers (Enh, EnhG), and bivalent regions (TssBiv, BivFlnk, EnhBiv). We also reviewed chromatin accessibility data of primary tumors to annotate FMREs. We analyzed cis-regulatory elements of primary tumors derived using ATAC-seq assays in a panel of 410 tumor samples of 23 cancer types in the TCGA project (Corces et al., 2018), using the hg19 chromosomal coordinates of these regions provided in the dataset. Regulatory elements were considered to co-occur with FMREs given a minimum of 1 bps sequence overlap. Chromatin long-range interactions (i.e., chromatin loops) from eight human cell lines were derived from a HiC dataset (Rao et al., 2014). To obtain a highconfidence subset of chromatin interactions, we merged interactions whose anchors shared overlapping sequence at both 50 and 30 ends. The resulting anchors spanned the unions of initial anchors considered in merging. We filtered interactions that were apparent only in one cell line. This resulted in 11,282/43,219 high-confidence interactions using the anchor definitions of the original dataset. Long-range chromatin interactions were considered to interact with a putative target gene if one anchor of a given interaction overlapped the 50 UTR or promoter of the target gene while the other anchor overlapped with the FMRE and had no overlap with the gene 50 UTR or promoter. Somatic mutations and structural variants in FMREs We analyzed different types of somatic mutations in the PCAWG project to evaluate FMREs as cancer drivers (Figure 2A,B). Structural variant (SV) calls and copy number alteration (CNA) calls of the tumor samples originate from the PCAWG project (Gerstung et al., 2018; Li et al., 2017). For CNAs, we used total copy number calls per segments and excluded segments with missing values. For SVs, we studied coordinates of breakpoints as well as short sequence segments (%100 kbps) defined by breakpoints. All SVs were examined together and also separately by alteration types (deletions, inversions, duplications, translocations). Segment analysis only considered deletions, inversions, and duplications. SVs in FMREs did not co-occur with SNVs and indels, as only 6/811 SNVs and indels in FMREs were closer than 5 kbps to SV breakpoints in matching patients and only 23/811 appeared in high copy-number regions (> 5 copies), alleviating concerns of technical biases in variant detection in structurally variable regions. For mRNA abundance analysis of FMRE mutations, genomic segments with copy number alterations were further processed to obtain gene-level copy number estimates, defined as median values of CNA segments overlapping with gene exons. Gene-level CNA calls were used as covariates in the mRNA abundance analysis. Structural variants in FMREs were associated with differential transcript abundance. We used a ± 10 kbps window around balanced SV breakpoints (inversions and translocations) to identify genes and FMREs where SVs potentially altered transcript abundance through transcriptional regulation mediated by FMREs. Two possibilities were considered: a) an FMRE and a target gene co-located at the same SV breakpoint that would potentially disrupt an active regulatory interaction of the FMRE and the gene, and b) an FMRE and a gene located at opposite SV breakpoints that would potentially initiate a de novo regulatory interaction of the FMRE and the target gene through enhancer hijacking and similar mechanisms. SVs were visualized using the omicCircos R package v1.14.0 (Hu et al., 2014). Z-scores were used to compare expression differences between the single tumor samples affected by the SV and other non-mutated tumors of the same cancer type. A z-value threshold was used for selecting differentially expressed genes (absolute z-score > 4). FunSeq2 software (v2.10) (Fu et al., 2014) was used to predict gains and losses of TF binding motifs induced by single nucleotide variants, using default parameters (Figure 2E,F, 3D). As many TFs had more than one predicted DNA-binding motif in the FunSeq2 database, predictions from FunSeq2 were filtered such that only one motif of a given TF was considered altered by any mutation. Motifs were visualized using the ggseqlogo R package (Wagih, 2017) (v0.1). Evaluating the enrichment of annotations and mutations in FMREs We studied genomic and epigenomic annotations and cancer mutations of FMREs relative to all CRMs (Figure 2A,D,E,F). We used custom permutation tests where subsets of CRMs were sampled as controls to estimate the expected number of each class of annotation or mutation. We distributed all CRMs into 100 distinct bins of similar sequence length and transcription status. Transcription status was determined as a binary variable based on CRM overlap with bodies of CDS and/or lncRNA genes. To estimate the

Molecular Cell 77, 1–15.e1–e10, March 19, 2020 e6

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

expected numbers of annotations (mutations) in the set of FMREs, we sampled 10,000,000 random sets of CRMs from the bins such that each bin contributed a set of random CRMs corresponding to the number of FMREs in that bin. Statistical significance of annotation enrichment was estimated as an empirical p value, i.e., the fraction of permutations that showed equivalent or higher number of annotations (mutations) compared to the true set of FMREs. To avoid biasing the permutation tests by known non-coding regions with frequent somatic mutations, we only focused on novel FMREs and excluded the three known FMREs (TERT promoter, WDR74 promoter, MALAT1 lncRNA). Besides length- and transcription-adjusted sampling of CRMs, we also sampled random genome regions of equivalent genomic lengths as controls in a separate group of independent tests. Confidence intervals for observed numbers of FMRE annotations were derived with resampling. We first compared FMREs and CRMs regarding enrichment of SNVs and indels. This approach validated the earlier analysis using ActiveDriverWGS: the permutation test considered the set of all FRMEs relative to similar CRMs sampled genome-wide as background sequence, while ActiveDriverWGS considered one CRM at a time relative to an adjacent background sequence. In additional analyses, we confirmed enrichment of different classes of mutations including copy-number alterations and structural variants that were not used initially for FMRE discovery however provided additional support to FMREs as potentially driven by multiple classes of mutations. Enrichment of motif-rewiring mutations in FMREs was computed relative to expected values from matching CRMs in three configurations: all motif-rewiring mutations, mutations associated with gains and losses separately, and mutations associated with motifs of all specific TFs separately. In the latter analysis, TF with at least five motif-rewiring mutations in FMREs and significant enrichment were considered as hits (PCRM<0.05). To further study the functional impact of FMRE mutations, we evaluated their distances to the nearest peak summits apparent in ENCODE ChIP-seq datasets, defined as the median coordinates of individual TF binding sites. Mutations in all CRMs were evaluated as controls (Figure 2G). Mutations in CRMs and FMREs were binned according to the shortest distances to peak summits. Non-parametric Kruskal-Wallis tests were used to compare bin distributions of FMREs and CRMs. Predicting target genes of FMREs and associating with mRNA abundance To evaluate the functional impact of FMRE mutations, we first mapped potential target genes of FMREs using two criteria: direct FMREs occurring in gene promoters and 50 UTRs, and distal FMREs interacting with gene promoters and 50 UTRs via long-range chromatin interactions (Rao et al., 2014) (Table S1). Consensus chromatin interactions conserved in at least two cell lines were used for target gene prediction. To associate specific FMREs and mRNA abundance of their target genes, we used matching RNA-seq data from PCAWG (Calabrese et al., 2018), measured as log1p-transformed upper-quartile normalized values of fragments per kilobase of transcript per million (log1p FPKM-UQ). Matched transcriptomic data was available for 45% of PCAWG tumors (831/1,844) and tumors with no data were excluded from the mRNA analysis. To estimate impact of mutations in mRNA abundance, we compared tumors with mutations in a specific FMRE to tumors with no mutations in the FMRE, limiting to cancer types where at least three mutated samples with mRNA data were available (Figure 3A). FMREs and/or cancer types with fewer mutated samples were excluded to increase confidence. Linear regression models were used to determine whether tumors with mutations in a given FMRE showed significantly different mRNA abundance compared to tumors with no mutations in the FMRE. Gene copy numbers were used as covariates to control for differential mRNA abundance caused by genomic amplifications and deletions. Samples of each cancer type were tested separately to capture tissue-specific effects. However, additional pan-cancer analyses were conducted if multiple cancer types had a sufficient number of mutated samples available for a particular FMRE. We then combined samples of multiple cancer types and modeled cancer type-specific effects using regression covariates. ANOVA F-tests comparing null and alternative models were used for evaluating associations of mRNA abundance and mutations in FMREs. In the null model log1ptransformed mRNA abundance of a gene was determined by the gene copy number, while in the alternative model the mRNA abundance was determined by the linear combination of the gene copy number and the FMRE mutation status. An additional covariate was added to both null and alternative models if samples of multiple cancer types were pooled in cases where a sufficient number of mutated samples was available from multiple cancer types. For each FMRE-gene pair, individual cancer types and pooled samples were tested separately and the most significant result for each FMRE-gene pair was selected, followed by adjustment for multiple testing correction. Significant results were selected based on a lenient threshold (FDR<0.2) (Figure 3A), to increase sensitivity of this underpowered analysis that was limited by partial availability of matched transcriptomic data and heterogeneity of the pan-cancer cohort. Numbers of tumor samples with FMRE mutations are shown in the figure. Pathway enrichment analysis of CRMs using ActivePathways To analyze additional cancer genes, molecular pathways and processes enriched in CRMs and somatic mutations, we first associated every protein-coding gene with all CRMs potentially controlling its mRNA abundance. Four categories of CRMs in an order of priority were mapped: CRMs directly located in gene promoters (direct), distal CRMs with chromatin interactions to gene promoters (looped), CRMs located in gene introns (intronic), and intergenic CRMs within ± 25 kbps of a target gene (adjacent), as described above. For each protein-coding gene, we then selected a single CRM with the strongest enrichment of somatic mutations (i.e., minimum unadjusted p value from ActiveDriverWGS) for each category. Duplicated counting of the same CRM to multiple categories for the same gene was not permitted to avoid inflation of statistical significance. Duplicated CRMs were removed based on priority of CRM categories (direct, looped, intronic, adjacent) and replaced by the next CRM in the significance ranking. For example, the CRM of the TERT promoter mutation hotspot was assigned to the top-priority direct category, removed from the lower-priority adjacent category to avoid double counting, and replaced by the next most significant CRM in the adjacent category. Categories where no

e7 Molecular Cell 77, 1–15.e1–e10, March 19, 2020

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

CRMs were available were finally assigned conservative p values of 1. As a result, each gene was associated with up to four nonredundant CRMs or FMREs with the most significant enrichment of cancer mutations. This prioritization strategy allowed us to integrate mutation enrichment signals across multiple disjoint CRMs potentially regulating the mRNA abundance of a gene proximally or distally. As the next step, we conducted pathway enrichment analysis of these data using ActivePathways (Paczkowska et al., 2018), an integrative pathway analysis method that uses data fusion techniques to determine significantly enriched pathways and processes in multi-omics datasets by weighing evidence from each dataset (Figure 5A). The matrix of all CDS genes with the p values of related CRMs was provided as input to ActivePathways, with the four categories of CRMs with corresponding p values from ActiveDriverWGS. ActivePathways was executed using default parameters. The GMT file with functional gene sets included Gene Ontology biological processes and Reactome pathways and was retrieved from the g:Profiler web server on July 13st 2018 (g:Profiler version 18-06-21) (Reimand et al., 2007). Enriched pathways were corrected for multiple testing and filtered using default settings of ActivePathways (Holm Q<0.05). Pathway-associated genes were listed from the overlap column from ActivePathways output. Genes from in the pathway analysis that were annotated in COSMIC Cancer Gene Census were selected for further analysis (Figure 5B). Enriched pathways were visualized using the enrichment map approach using standard procedures and manually curated to highlight the most relevant unique biological themes (Reimand et al., 2019). We excluded histone genes from the final pathway analysis by assigning p = 1 to these genes, as initial analyses showed many results driven primarily by histones, likely due to their clustering on the genome that biased single FMREs and CRMs to be assigned to multiple histone genes. Analyzing mutation co-occurrence in FMREs and CDS drivers To determine co-occurring mutations and potential synergistic genetic interactions, we tested 1,440 pairs of 30 FMREs and 48 protein-coding driver genes that were detected in the pan-cancer cohort (ActiveDriverWGS FDR<0.05) (Figure 5C). Significance of co-occurring mutations was estimated with Fisher’s exact tests using all 1,844 tumors as the background set, followed by multiple testing correction with the FDR procedure. Gene-FMRE pairs were considered significant using a lenient threshold (FDR<0.1) to increase sensitivity, as this analysis tested a large number of gene-FMRE pairs while most of the identified cancer driver genes and FMREs were relatively sparsely mutated, reducing statistical power. Venn diagrams were created with the VennDiagram R package (Chen and Boutros, 2011). Individual vetting of mutations in highlighted FMREs Mini-BAM files for samples with a variant in the following FMREs were downloaded from PCAWG GNOS servers: chr5: 12948591295726 (TERT promoter), chr14: 21081182-21082486 (chromatin interactions with CCNB1IP1 promoter), chr1: 204475015204476599 (upstream of MDM4), chr6: 52859342-52861236 (chromatin interactions with ICK promoter), chr6: 2787002827871319 (chromatin interactions with ZKSCAN3 promoter), and chr22: 29065413-29065733 (downstream of CHEK2). FMRE variants were manually examined in the IGV software v2.3.97. Variants that were missing or were called within palindromic regions were marked as false positives. Variants were flagged as low confidence if they occurred on one strand (forward or reverse), had fewer than four reads, or were found within a homopolymer run. Variants were highlighted, but not flagged, if they had four supporting reads, only one supporting read on one of the strands, in a low coverage region, or in a region with strand bias. Variants that were found in regions with strand bias and showed strand bias in their supporting reads were highlighted but not flagged. A subset of all mutations (40/197 or 20%) and expression-correlated mutations (3/31 or 10%) were highlighted due to strand bias or low variant allele frequency genome. The fraction of highlighted mutations was similar in the positive control FMRE of the TERT promoter (14/83 or 17%), underlining challenges with lower sequencing coverage in the non-coding genome. The false positive rate is similar to the overall variant calling error rate of 5% of the PCAWG project and lends confidence to the mutations underlying our driver predictions. RNA extraction and reverse transcription Total RNA extraction of HEK293 cells was carried out using RNeasy Plus Mini kit (QIAGEN) according to manufacturer’s instructions. Extracted RNA was quantified using the Qubit fluorometer (Life Technologies) and quality was assessed using the Tapestation 2200 device (Agilent). RNA integrity number (RIN) was obtained for each RNA sample and indicated high quality (RIN range 9.4 to 9.8). Total RNA (2.5 ug) was reverse-transcribed using SuperScriptTM VILOTM cDNA Synthesis Kit (Thermo Fisher) following manufacturer’s protocol. Quantifying transcript abundance of FMRE target genes using RT-qPCR Quantitative reverse transcription PCR (RT-qPCR) was used to compare transcript abundance of CCNB1IP1 in FMRE-KO cells, wildtype HEK293 cells of different passages, and CRISPR-treated HEK293 cells to validate our observations in genomic analysis of patient tumors (Figure 4C). Wildtype and CRISPR-treated HEK293 cells were used as controls. Several passages of wildtype cells were used as controls to account for variation related to passaging. CRISPR-treated control cells were treated with CRISPR and guideRNAs similarly to FMRE-KO cells, grown from single-cell clones, and were subsequently confirmed to have no deletion at the FMRE locus. Gene specific RT-qPCR primers amplifying CCNB1IP1 (forward 50 GACTGCGACCAGAGATCGTG 30 ; reverse 50 CCCTCAGCCTTGCTGAAATTG 30 ) and the reference gene GAPDH (forward 50 ACATCGCTCAGACACCATG 30 ; reverse 50 TGTAGTTGAGGTCAATGAAGGG 30 ) were designed using the NCBI Primer BLAST software(Ye et al., 2012). Additional

Molecular Cell 77, 1–15.e1–e10, March 19, 2020 e8

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

RT-qPCR primers amplifying the genes CDKN2A (forward 50 GACCTGGCTGAGGAGCTG 30 ; reverse 50 GGGCATGGTTACTGCCT CTG 30 ), HEY1 (forward 50 CGGCTCTAGGTTCCATGTCC 30 ; reverse 50 GCTTAGCAGATCCCTGCTTCT 30 ), CDH1 (forward 50 TTACTGCCCCCAGAGGATGA 30 ; reverse 50 TGCAACGTCGTTACGAGTCA 30 ), PSIP1 (forward 50 CTGGAGACCTCATCTTC GCC 30 ; reverse 50 GGGTGGCTTTACAGCTCCAT 30 ) were designed to validate their differential transcript levels in wildtype HEK293 and FMRE-KO groups as detected by our RNA-seq experiment (see below) (Figure 4F). RT-qPCR reactions were conducted in a 384-well plate using ABI PRISM 7000 Real-Time system (Applied Biosystems). Each reaction was performed in triplicate and in a 20 mL volume containing 1X advanced low-ROX qPCR master-mix with Supergreen (Wisent), 400 nM of each primer and 5 ng of template cDNA. The following cycling conditions were used: 95 C for 2 min, followed by 40 cycles of 95 C for 5 s and 60 C for 30 s. Each experiment included a no-template control. Ct values for amplification of the target genes and the control gene GAPDH in wildtype (n = 3) and FMRE-KO (n = 3) samples were obtained and relative quantities and fold change were calculated using the delta-delta Ct method. The Welch two-sample t test was used to estimate the difference and statistical significance of fold changes between wildtype and FMRE-KO groups. RNA-seq and data analysis of FMRE-KO cells High-throughput RNA sequencing (RNA-seq) was performed on three HEK293 FMRE-KO clonal cell lines and three wildtype HEK293 cell lines of different passages (Figure 4D,E). Sequencing was completed at the core genomics facility of Ontario Institute for Cancer Research (OICR) using standard protocols. RNA-seq libraries for wildtype HEK293 cell lines and FMRE-KO cell lines were constructed using the KAPA Hyper RNA kit (Roche), quantified using the Fragment Analyzer device (Advanced Analytical), and pooled according to concentrations determined by qPCR. Sequencing was performed on the HiSeq 2500 sequencer (Illumina), with all samples pooled on one 2x100 bp high output lane. The raw sequencing dataset was quality-controlled and aligned and FASTQ files were generated using standard data processing pipelines of the OICR genomics core facility. Reads were aligned to the human reference genome GRCh38.p12 downloaded from the Ensembl database (v93). Paired-end alignment was performed using the subread method via the Rsubread package v1.18.0 using default parameters(Liao et al., 2013), resulting in 86%–88% mapped reads per sample. Transcripts were quantified using the function featureCounts using external gene annotations retrieved from Ensembl (Homo_sapiens.GRCh38.78.gtf). Differential mRNA abundance analysis was performed using the limma/voom method (Law et al., 2014) with R packages edgeR v3.10.5 and limma v3.24.15. Genes with low mRNA abundance were filtered such that only genes with above-threshold transcript abundance in at least two samples were maintained in the analysis (counts per million (CPM)>0.5 per sample). 13,994 CDS genes were maintained in the analysis and non-coding genes were filtered. Read counts were normalized and transformed with voom. Differential mRNA abundance analysis of HEK293 FMRE-KO cells relative to wildtype HEK293 cells used empirical Bayes estimation of statistical significance and multiple testing correction. Differentially expressed genes were selected using FDR<0.1 (Figure 4d,e). Pathway enrichment analysis of differentially expressed genes was conducted with the g:Profiler software (Reimand et al., 2007) (version 2018-10-02) (Figure 4G). We used the ordered gene list analysis option in the R package with input genes ranked by statistical significance of differential mRNA abundance (default multiple testing correction method in g:Profiler, Q<0.05). For a conservative analysis, the background set of genes was limited to those detected at above-threshold levels in the RNA-seq experiment (CPM > 0.5 in R 2 samples). We only considered gene sets of Gene Ontology biological processes and Reactome pathways and restricted gene set size (5 % x % 1000) and gene set overlap with differentially expressed genes (x R 3) for conservative interpretation. Results of the analysis were visualized using the enrichment map approach using standard procedures and manually curated to highlight the most relevant unique biological themes and the associated genes from mRNA analysis (Reimand et al., 2019). To validate our findings of mRNA abundance changes in FMRE-KO HEK293 cell lines, we analyzed the reported genes (56 genes with FDR < 0.1) in patient tumor transcriptomics data in PCAWG (Figure 4H). We focused on the kidney cancer cohort (kidney-RCC) and the subset of 117 tumors with matching transcriptomics and WGS data. For each gene, we compared its transcript abundance levels of three FMRE-mutated samples of FMRE 14:21081834 with the transcript abundance levels of the rest of the cohort, using the conservative approach of copy-number adjusted log-linear regression models and ANOVA tests similarly to the discovery of mRNAassociated FMREs described above. We selected the 12 genes with significant differences in transcript abundance in transcriptomic data of patient tumors (unadjusted p < 0.1). Two additional genes were filtered as these showed inconsistent directions of change in mRNA abundance between cell lines and patient tumors (GRHPR, HSPA1L). In total, ten genes with significant and matching mRNA abundance changes between cell lines (FMRE-KO versus WT HEK293 cells) and patient tumors (FMRE-mutated and FMRE-reference tumors) were reported. The lenient p value cutoff was applied in this validation analysis in order to account for the small number of patient tumors with FMRE-mutations and the stringent copy-number adjusted tests used. Cellular proliferation assay Proliferation of wildtype HEK293 and FMRE-KO cell lines over time was measured using alamarBlue reagent (Thermo Scientific) and fluorescent readout on a SpectraMax Gemini EM Microplate Spectrofluorometer microplate reader (Molecular Devices) and SoftMax Pro software (Figure 4I). 500 cells per well were seeded in 96-well plates and grown in DMSO (10% FBS) at 37 C and 5% CO2 incubator. Three biological replicates of FMRE-KO cell lines as well as three wildtype HEK293 control cell lines were used with six 96-well plates per time point in total. Several passages of wildtype cells were used as controls to account for variation related to passaging. Cell numbers were quantified on days 1, 4, 6, 7 and 9 after the start of the experiment. Preliminary experiments

e9 Molecular Cell 77, 1–15.e1–e10, March 19, 2020

Please cite this article in press as: Zhu et al., Candidate Cancer Driver Mutations in Distal Regulatory Elements and Long-Range Chromatin Interaction Networks, Molecular Cell (2020), https://doi.org/10.1016/j.molcel.2019.12.027

showed that the cells reached full confluency by day 10. The alamarBlue reagent was added to the cells in the growth media at each well (10% (v/v)), incubated for 2 h, and fluorescence intensity signal was measured at 570 nm of excitation and 585 nm of emission wavelengths. Results were analyzed by measuring and plotting fluorescence intensity values over time as proxy of cell count. To avoid confounding spatial effects, 36 measurements per plate corresponding to edges and corners of plates were excluded from the analysis, resulting in 60 fluorescence intensity values per cell line and time point. At each time point, we compared fluorescence intensity values between pooled wildtype cells and pooled FMRE-KO cells and estimated statistical significance of such differences using the non-parametric Wilcoxon test. Growth curves were plotted with error bars indicating standard deviation. DATA AND CODE AVAILABILITY The driver discovery method ActiveDriverWGS is freely available as R source code and an R package in the Comprehensive R Archive Network (CRAN) repository. Custom R scripts used for other analyses in the study are available upon request. The source code of ActiveDriverWGS is also available at the GitHub web site https://github.com/reimandlab/ActiveDriverWGSR. Raw RNA-seq data of HEK293 FMRE-KO cell lines are available in the ArrayExpress database (accession number E-MTAB-8631). Somatic variant calls, transcript abundance, and other core data generated by the ICGC/TCGA Pan-cancer Analysis of Whole Genomes Consortium is described in the marker paper (Campbell et al., 2017) and available for download at https://dcc.icgc.org/ releases/PCAWG. Additional information on accessing the data, including raw read files, can be found at https://docs.icgc.org/ pcawg/data/. In accordance with the data access policies of the ICGC and TCGA projects, most molecular, clinical and specimen data are in an open tier which does not require access approval. To access potentially identifiable information, such as germline alleles and underlying sequencing data, researchers will need to apply to the TCGA Data Access Committee (DAC) via dbGaP (https://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?page=login) for access to the TCGA portion of the dataset, and to the ICGC Data Access Compliance Office (DACO; https://icgc.org/daco) for the ICGC portion. In addition, to access somatic single nucleotide variants derived from TCGA donors, researchers will also need to obtain dbGaP authorization.

Molecular Cell 77, 1–15.e1–e10, March 19, 2020 e10