Insertions and Deletions Target Lineage-Defining Genes in Human Cancers

Insertions and Deletions Target Lineage-Defining Genes in Human Cancers

Article Insertions and Deletions Target Lineage-Defining Genes in Human Cancers Graphical Abstract Authors Marcin Imielinski, Guangwu Guo, Matthew M...

37MB Sizes 0 Downloads 55 Views

Article

Insertions and Deletions Target Lineage-Defining Genes in Human Cancers Graphical Abstract

Authors Marcin Imielinski, Guangwu Guo, Matthew Meyerson

Correspondence [email protected] (M.I.), [email protected]. edu (M.M.)

In Brief Certain highly expressed, lineagedefining genes are targeted by a novel somatic mutational process.

Highlights d

d

Noncoding indels link cellular lineage and cancer Surfactant gene 30 UTR are most significant indel targets in lung adenocarcinoma

d

Lineage-defining genes harbor noncoding indel hotspots in multiple cancer types

d

Hotspot indels associate with AATAATD motif and specific chromatin contexts

Imielinski et al., 2017, Cell 168, 1–13 January 26, 2017 ª 2017 Elsevier Inc. http://dx.doi.org/10.1016/j.cell.2016.12.025

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

Article Insertions and Deletions Target Lineage-Defining Genes in Human Cancers Marcin Imielinski,1,2,* Guangwu Guo,3,4 and Matthew Meyerson3,4,5,6,* 1Department of Pathology and Laboratory Medicine, Englander Institute for Precision Medicine, Institute for Computational Biomedicine, and Meyer Cancer Center, Weill Cornell Medicine, New York, NY 10065, USA 2New York Genome Center, New York, NY 10013, USA 3Cancer Program, Broad Institute of Harvard and MIT, Cambridge, MA 02142, USA 4Department of Medical Oncology, Dana Farber Cancer Institute, Harvard Medical School, Boston, MA 02215, USA 5Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02215, USA 6Lead Contact *Correspondence: [email protected] (M.I.), [email protected] (M.M.) http://dx.doi.org/10.1016/j.cell.2016.12.025

SUMMARY

Certain cell types function as factories, secreting large quantities of one or more proteins that are central to the physiology of the respective organ. Examples include surfactant proteins in lung alveoli, albumin in liver parenchyma, and lipase in the stomach lining. Whole-genome sequencing analysis of lung adenocarcinomas revealed noncoding somatic mutational hotspots near VMP1/MIR21 and indel hotspots in surfactant protein genes (SFTPA1, SFTPB, and SFTPC). Extrapolation to other solid cancers demonstrated highly recurrent and tumor-type-specific indel hotspots targeting the noncoding regions of highly expressed genes defining certain secretory cellular lineages: albumin (ALB) in liver carcinoma, gastric lipase (LIPF) in stomach carcinoma, and thyroglobulin (TG) in thyroid carcinoma. The sequence contexts of indels targeting lineage-defining genes were significantly enriched in the AATAATD DNA motif and specific chromatin contexts, including H3K27ac and H3K36me3. Our findings illuminate a prevalent and hitherto unrecognized mutational process linking cellular lineage and cancer. INTRODUCTION Large-scale sequencing of human tumor samples has implicated unexpected pathways and mutational processes in carcinogenesis (Garraway and Lander, 2013; Vogelstein et al., 2013). The growing power of whole-genome sequencing now enables the discovery of significantly altered loci in noncoding sequences. Examples include the common mutation of the TERT promoter in many cancer lineages (Horn et al., 2013; Huang et al., 2013), alterations of the TAL1 super-enhancer in acute T-lymphoblastic leukemia (Mansour et al., 2014), and the identification of MYC and KLF5 enhancer duplication in multiple epithelial cancer lineages (Zhang et al., 2016). The vast majority of analytic efforts to nominate mutation hotspots in cancer genomes using statistical approaches have

focused on protein-coding regions in whole-exome capture data. The 98% of the genome that does not code for proteins includes transcribed but untranslated exons of genes, introns, and noncoding regulatory genetic elements, some of which may harbor clinically important and targetable DNA alterations (Khurana et al., 2016; Weinhold et al., 2014). For whole-exome analysis, statistically calibrated approaches such as InVEx (Hodis et al., 2012) and MutSigCV (Lawrence et al., 2014) have been developed to correct for the mutational heterogeneity that otherwise results in the nomination of spurious hotspots in late-replicating and poorly expressed genes. Such calibration is important in the analysis of tumor types harboring high burdens of neutral mutations, such as lung cancer and melanoma (Hodis et al., 2012; Imielinski et al., 2012). This challenge becomes particularly daunting with whole-genome sequencing analysis, where the number of hypotheses (i.e., candidate regions) is large while the number of samples is small, relative to whole-exome capture data. Lung adenocarcinoma is the most common type of lung cancer and a prototype for precision oncology (Pao and Hutchinson, 2012). Though several large-scale sequencing studies have resolved the landscape of recurrent coding alterations in lung adenocarcinoma in significant detail (Ding et al., 2008; Govindan et al., 2012; Imielinski et al., 2012; Kan et al., 2010; Cancer Genome Atlas Research Network, 2014a), the analysis of whole genome sequences has thus far revealed only rare noncoding mutations (Weinhold et al., 2014). In the present study, we analyzed whole genome sequences of lung adenocarcinoma using a somatic burden test based on Gamma-Poisson regression (Hilbe, 2014) for analysis of both insertion/deletion (indel) and single nucleotide variant (SNV) somatic mutations. To our surprise, in addition to alterations of known cancer genes and a noncoding mutation hotspot near MIR21 and VMP1, we found recurrent somatic indel mutations in noncoding regions of surfactant protein genes, the major transcriptional product of type II pneumocytes in the lung. Through statistical analysis of whole genome sequences across a diverse collection of cancers, we found that other tumor types harbor similarly prevalent hotspots of noncoding somatic indel mutations, targeting a class of lineage-defining genes. These highly expressed genes define cell types that play essential biosynthetic roles in the physiology of their respective organs and (in the majority of Cell 168, 1–13, January 26, 2017 ª 2017 Elsevier Inc. 1

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

n tiled genome-wide hypotheses

A

(10-kb intervals, 500-bp spacing)

Hypothesis i of 1, ..., n

Significance model Significance of i:

Somatic WGS variants from 79 lung adenocarcinomas

Genomic covariates, ci j

Observed count:

Eligible territory width, wi

Expected count:

Genome-wide peak definition

B 12

STK11

KRAS

10 8

VMP1, mir21

6

Indel 14

TP53

Observed –log10(P)

Observed –log10(P)

C

SNV 14

4 2 0

SFTPB

SFTPC, BMP1

12 10

SFTPA1

8 6 4

VMP1, mir21

TP53 EGFR

2 0

0

2

4

6

8

10

12

14

Expected –log10(P)

0

2

4

6

8

10

12

14

Expected –log10(P)

Figure 1. Genome-wide Scan for Noncoding Somatic Mutations in Lung Adenocarcinoma (A) Schematic of the Gamma-Poisson regression model used to model genomic variation in mutation density and identify noncoding somatic mutation hotspots in lung adenocarcinoma. The model is applied to n overlapping intervals representing a genome-wide hypothesis set by identifying a combination of k genomic covariate weights aj ; j ˛f1; .; kg and shape parameter q that is most likely to fit the observed mutation count data yi ; i ˛f1; .; ng: and the k covariate values cji ; j ˛f1; .; kg. p values are computed by identifying the probability of observing a mutation count greater than or equal to the observed mutation count yi under the Gamma-Poisson distribution given the expected value mi and shape parameter q. The model was fit using k = 8 covariates (STAR Methods, see Figure S1 for more detailed schematic). (B and C) (B) Quantile-quantile plots showing p values for SNV and (C) indel densities across 79 lung adenocarcinoma WGS cases. In these plots, l refers to the slope of the y = lx line fitting --log10 quantiles of observed p values and the uniform distribution. See also Figure S1 and Table S1.

cases) represent the precise cell of origin for the respective cancers. The frequent indel mutation of this gene class is a previously undescribed feature of cancer genomes and is likely to inform our understanding of the mutational processes and molecular pathogenesis of human cancers. RESULTS Recurrent Mutation Hotspots in Lung Cancer Genomes To identify regions of the genome under positive somatic mutational selection in lung cancer, we analyzed whole-genome sequencing reads from 79 lung adenocarcinoma tumor-normal

2 Cell 168, 1–13, January 26, 2017

pairs (Imielinski et al., 2012; Cancer Genome Atlas Research Network, 2014a). These cases comprise predominantly early stage and treatment-naive surgical resection specimens. We tallied mutation calls across a genome-wide hypothesis set of 2.823 million overlapping 10 Kbp intervals across 2.429 Gbp of eligible genomic territory to identify candidate noncoding hotspots in the genome (Figure 1A). We applied Gamma-Poisson regression (Hilbe, 2014) to statistically account for the regional genomic heterogeneity in neutral somatic variant densities (Lawrence et al., 2013; Polak et al., 2015; Schuster-Bo¨ckler and Lehner, 2012) by modeling counts of patients harboring indel or SNV in each interval as a function of eligible territory width and eight genomic covariates (Figure 1A and STAR Methods). Given this model, we computed p values for mutation enrichment in each interval i with mutation count yi as the probability PðYRyi j mi ; qÞ, using the expected mutation count mi and shape parameter q from the Gamma-Poisson regression fit (Figures 1A and S1A). Quantile-quantile (Q-Q) plots can be used to demonstrate how well a set of genome-wide P-values approximate a uniform distribution through the calculation of a genomic inflation factor, which is the slope l of a line y = lx that fits a set of log10 transformed observed and expected quantiles (Pearson and Manolio, 2008). Algorithms like InVEx (Hodis et al., 2012) and MutSigCV (Lawrence et al., 2014) demonstrate statistically calibrated Q-Q plots (l near 1) when applied to whole-exome sequencing data but are not readily adaptable to whole-genome analysis. Application of our Gamma-Poisson regression model yielded p values closely aligned to the uniform distribution, as demonstrated by Q-Q plots with l values at 1.01 for SNV and 1.00 for indel analyses (Figures 1B and 1C). The resulting most significant loci are shown in Table S1A (SNV) and Table S1B (indel). Loci significant for SNVs (FDR < 0.1, Figure 1B) corresponded to known coding mutation driver alterations in TP53 (p = 3.6 3 1014), STK11 (p = 3.6 3 108), and KRAS (p = 3.6 3 107), supporting the relevance of this analysis (Table S1A and Figure 1B). Among the significant SNV loci was a previously uncharacterized noncoding mutation hotspot overlapping VMP1 and MIR21 (Figures 1B, 2A, and Tables S1A and S1B). MIR21 encodes a microRNA whose overexpression has been linked to tumorigenesis in lung cancer and other tumor types (Seike et al., 2009). This locus was recently shown to be recurrently amplified in lung adenocarcinoma (Campbell et al., 2016). VMP1 encodes a vacuolar membrane protein that is functionally linked to autophagy and has been found to be recurrently rearranged in several cancer types, including breast and esophageal cancer (Blum et al., 2016; Inaki et al., 2011). Interestingly, the VMP1/MIR21 locus was also the ninth most highly ranked hotspot in the indel analysis (Figure 1C), though it did not pass the FDR threshold of 0.1 (p = 6.4 3 106). The SNV and indels contributing to these hotspots did not affect any coding positions of VMP1 or MIR21, but clustered in a nearly 40 Kbp region of open chromatin and H3K27ac (Figure 2A), as profiled in a lung adenocarcinoma cell line A549. Samples harboring indels or SNVs in this locus demonstrated significantly higher levels of MIR21 expression, while there was no significant association of mutation status with expression of nearby protein-coding genes (TUBD1, VMP1) (Figure 2B).

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

A

Figure 2. Lung Adenocarcinoma Harbor Noncoding Mutational Hotspot at MIR21 (A) Genomic track plot of SNV and indel hotspots identified in a whole-genome scan of lung adenocarcinoma in the vicinity of MIR21. Tracks represent RNA-seq and chromatin data from ENCODE for lung adenocarcinoma cell line A549. (B) Violin plots show mutant versus WT expression data for three genes in the vicinity of the nominated mutation hotspot (MIR21, TUBD1, VMP1). See also Figure S2.

B

Lung Cancer Indels Target Surfactant Protein Genes The three most significantly altered loci in the indel analysis of lung adenocarcinomas (Figure 1C and Table S1B) overlapped the genes SFTPB (p = 1.8 3 1014), SFTPA1 (p = 4.8 3 1010), and SFTPC/BMP1 (p = 1.3 3 107). SFTPB, SFTPA1, and SFTPC encode surfactant proteins that are specific markers of type II pneumocytes in the normal lung, where they help generate the surface tension required to maintain open air spaces (Haagsman and Diemel, 2001). Though germline mutations in SFTPA1 have been recently linked to adult lung cancer (Nathan et al., 2016), none of these genes have previously been implicated as somatic mutational targets in lung cancer or other tumor types. Surfactant proteins harbor divergent domain structures (SFTPB – 4 Saposin domains, SFTPC – BRICHOS domain, SFTPA1 – Collagen-like

and C-type lectin domains) and are not paralogs, though they contribute to the same pathway (Haagsman and Diemel, 2001). The corresponding genomic loci (which we will refer to as SFTP loci) lie on different chromosomes (2, 10, and 8, respectively) and consist of early replicating regions with average GC content (Figure S1B). Upon visual inspection, the somatic mutations in these loci were supported by multiple tumor-DNA-specific alignments of high mapping quality, without strand bias or an excess of mismatches. Furthermore, loci harboring these mutations were not enriched in alignments with low allele frequency mismatches or low mapping quality reads (Table S1B). Though SFTP loci were highly enriched in the indel analysis, they showed no significant deviation from background with respect to SNV density (Figure S1B). Other loci in the indel analysis passing a false discovery threshold of 0.1 were TP53 (p = 8.8 3 107) and MYO5C (p = 1.8 3 106) (Table S1B). Indel density in EGFR was also nominally enriched (p = 2.7 3 106), though it did not pass the genome-wide false discovery threshold (FDR = 0.12). Unlike with SFTP loci, indels within TP53 and EGFR hotspots comprised exclusively protein-coding variants. Overall, somatic indel mutations in SFTP loci were found within or near the 30 UTRs of SFTPB, SFTPA1, and SFTPC (Figure 3A–3C). In total, 18/79 (23%) of lung adenocarcinomas harbored one of 21 indels occurring at SFTPB (10 cases), SFTPA1 (5 cases), or SFTPC/BMP1 (4 cases). Among SFTPBassociated indels, only one mutation was predicted to alter transcript structure through the perturbation of a splice site, while the majority were located within (8/11 events) or downstream (2/11 events) of the annotated SFTPB 30 UTR. Events associated with SFTPC (5 events) and SFTPA1 (5 events) loci occurred exclusively at or downstream of the annotated 30 UTR of the respective gene. In summary, the vast majority of SFTP indel mutations (19/21) occurring at bases at or downstream of the 30 UTR were not covered by standard exome capture, as demonstrated

Cell 168, 1–13, January 26, 2017 3

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

A

B

D

C

E

Figure 3. Surfactant Protein Loci 30 UTR Are the Most Significant Somatic Indel Hotspots in Lung Adenocarcinoma Genomes (A–C) Lollipop plots demonstrating the genomic distribution of mutations in (A) SFTPB, (B) SFTPA1, and (C) SFTPC at two scales of resolution. Gray rectangles shown above the lollipops indicate top peak regions in the significance analysis. Model fits and significance plots are shown in Figure S1B and Table S1C. (D) Expression values from 2,917 normal human tissue samples from the GTEx database for SFTPA1, SFTPB, and SFTPC. Tissue types on x-axis are abbreviated as follows: Adipose Tissue (AD), Adrenal Gland (AG), Bladder (BD), Blood (BL), Brain (BR), Breast (BS), Blood Vessel (BV), Cervix Uteri (CE), Colon (CO), Esophagus (ES), Fallopian Tube (FT), Heart (HA), Kidney (KI), Liver (LI), Lung (LU), Muscle (MU), Nerve (NE), Ovary (OV), Pancreas (PA), Pituitary (PI), Prostate (PR), Salivary Gland (SA), Skin (SK), Small Intestine (SM), Spleen (SP), Stomach (ST), Testis (TE), Thyroid (TH), Uterus (UT), and Vagina (VA). (E) Histograms of mutation frequencies of SFTPA1, SFTPB, and SFTPC somatic indels in lung adenocarcinoma and 12 other tumor types comprising 487 WGS sequenced tumor-normal pairs. See also Figure S3 and Table S2.

by the analysis of aggregate base-resolution coverage profiles across more than 1,000 lung cancer exome sequences (Campbell et al., 2016). SFTP indels were not significantly associated with known DNA alterations in lung adenocarcinoma or smoking status (Figures S2A–S2D), nor with cis gene expression, splicing, or methylation at the respective SFTP gene or other neighboring genes (Figures S2E and S2F). However, we observed significant gene expression differences at the pathway level when comparing RNA-seq profiles of SFTP mutant and wild-type (WT) lung adenocarcinomas (Table S2 and Figure S3A), including significant upregulation of peptide chain elongation (p = 9.8 3 107), 30 UTR-mediated translational regulation (p = 9.8 3 105), mitochondrial fatty acid beta oxidation (p = 1.6 3 104), and respiratory electron transport (p = 1.8 3 104) pathways. These pathway changes were comparable in magnitude to those observed for an analysis comparing TP53 mutant versus WT transcriptomes in the same dataset, which yielded pathways with established roles in TP53 biology, such as DNA replication, DNA recombination, and cell cycle control (Figure S3B). SFTPB, SFTPA1, and SFTPC demonstrate striking lung-specific expression in Genotype-Tissue Expression (GTEx) data

4 Cell 168, 1–13, January 26, 2017

(Figure 3D) (Mele´ et al., 2015), obtained from healthy human tissues. To examine whether the corresponding somatic indel patterns were tissue specific, we analyzed whole genome sequences from 487 tumors representing 12 additional tumor types. Scanning these genomes for somatic variants in the three SFTP loci, we only found two additional samples harboring SFTP indels across these 487 tumors: one in a poorly differentiated, late-stage lung squamous cell cancer and one in a gastric adenocarcinoma. This represents a 25-fold enrichment (95% CI: [13.2, 47.4]) in lung adenocarcinoma versus other tumor types, even after correcting for sample-specific variations in indel density (p = 5.6 3 1023, Wald test, logistic regression) (Figure 3E). Multi-cancer Analysis of Mutation and Lineage Given the specificity of SFTP gene expression to lung tissue and SFTP somatic indels to lung adenocarcinoma, we hypothesized that other tumor types might harbor similar noncoding indel enrichment at highly expressed and lineage-specific genes. Through semi-supervised analysis (see STAR Methods and Figures S4A and S4B) of 2,917 GTEx samples spanning 30 normal tissues (Mele´ et al., 2015), we identified three clusters of genes

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

A

Tumor 1 Tissue A

HK

60

ML

47 126

RPKM

RPKM ≥ 1000 in ≥ 1 tissue

LS 20,345 genes

233 genes

Tumor 2 Tissue B

Tumor 3 Tissue C

Tumor 4 Tissue B Mutation density comparisons between housekeeping (HK), multi-lineage (ML), and lineagespecific (LS) gene territories within a given (EN or EF) tumor context

EN EF

EN EF

EN EF

EN EF

Mutation density comparisons between expression-native (EN) vs. expression-foreign (EF) tumor contexts within a given (HK, ML or LS) gene class

B

P = 0.29 P = 4.8 × 10–24

C

P = 0.26 P = 1.5 × 10–4

Relative variant density

P = 3.4 × 10–35

P = 1.7 × 10–7

100

Expressionnative 100

1

1

P = 0.19

P = 0.044

P = 9.5 × 10–70

Expressionforeign 100

1

1

ML

LS

Expressionnative

P = 0.83

100

HK

HK (housekeeping) ML (multi-lineage) LS (lineage-specific)

P = 0.15

HK (housekeeping) ML (multi-lineage) LS (lineage-specific)

P = 0.15

Expressionforeign

HK

Indel

ML

LS

SNV

Figure 4. Somatic Indels Target Lineage-Specific Genes in an Expression-Native Context (A) Schematic of analysis of mutation density and lineage context across highly expressed genes and tumors from multiple tissue types. 233 highly expressed genes were identified from analysis of GTEx (Mele´ et al., 2015) and clustered into housekeeping (HK), multi-lineage (ML), and lineage-specific (LS) categories (see STAR Methods and Figures S4A and S4B and Table S3A for details). Mutation densities were then compared in tumors where a given gene territory was expression-native (EN) or -foreign (EF) on the basis of that tumor’s tissue of origin (Table S3A). (B and C) (B) Violin plots comparing indel and (C) SNV densities in LS/ML/HK gene territories in tumor types in EN and EF tumor contexts across 487 WGSsequenced cases across 12 (non-lung adenocarcinoma) cancer types. p values (Wald test, see STAR Methods) above the plot represent pairwise EN comparisons, while p values in the middle represent EN versus EF comparisons within a single LS, ML, or HK territory class. See also Figure S4 and Table S3.

with > 1,000 RPKM median expression in at least one tissue, comprising 60 housekeeping genes (e.g., RPL13, HLA-B), 47 multi-lineage genes (e.g., APOD, IGFBP7), and 126 lineage-specific genes (INS, CYP2E1, PGA5). We examined somatic mutation densities (normalized to average per-sample mutation density) in these genomic territories (gene ± 10 Kbp flanking sequence) across 487 whole-genome sequenced samples spanning 12 cancer types other than lung adenocarcinoma. For each cancer type, we identified lineage-specific, multi-lineage, and housekeeping gene territories that were either native or foreign to that cancer’s lineage context based on their expression in healthy GTEx tissues (Table S3A). We compared variant densities in each tumor context (expression-native versus -foreign) and territory class (lineage-specific versus multi-lineage versus housekeeping) and evaluated group differences using Gamma-Poisson regression (Figure 4A and Table S3B). We found that lineage-specific gene territories were 14.3-fold (95% CI: [10.7, 19.2], p = 9.5 3 1070, Wald

test, Figure 4B) enriched in indels in the expression-native versus -foreign tumor context (Figure 4B). Lineage-specific territories were also significantly enriched in indels relative to both multi-lineage (p = 4.7 3 1024) and housekeeping (p = 1.8 3 1035) territories when examining only expression-native tumor contexts (Figure 4B). In contrast, following Bonferroni correction for six comparisons (p < 0.0083, adjusted p < 0.05), there was no significant expression-native versus -foreign enrichment for indel densities at multi-lineage (p = 0.044) or housekeeping territories (p = 0.188) or between multi-lineage and housekeeping territories in expression-native tumors (p = 0.29) (Figure 4B). There was no significant difference in the density of SNVs between expression-native and -foreign tumors at lineage-specific genes (p = 0.146, Figure 4C). There was a significant, albeit modest (< 2-fold), SNV enrichment in lineage-specific versus multi-lineage genes (p = 1.7 3 107, RR: 1.47, 95% CI: [1.27, 1.70]) and lineage-specific versus housekeeping genes (p = 1.5 3 104, RR: 1.30, 95% CI: [1.13, 1.48]) (Figure 4C and Cell 168, 1–13, January 26, 2017 5

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

A

B

C

D

Figure 5. Noncoding Lineage-Specific Indels Target Albumin, Thyroglobulin, and Gastric Lipase in Gastric, Thyroid, and Liver Cancer (A–C) Lollipop plots demonstrating the genomic distribution of (A) ALB, (B) TG, and (C) LIPF hotspot somatic indels within and outside of their expression-native context (hepatocellular carcinoma, thyroid cancer, and gastric cancer, respectively). (D) Expression values from 2,917 normal human tissue samples and 30 tissues for ALB (albumin), TG (thyroglobulin), and LIPF (gastric lipase) obtained from GTEx. See Figure 3 legend for tissue type abbreviations. See also Figures S5 and S6 and Table S4.

Table S3C) in native tumors. These results suggest a mutational process selectively generating indels at lineage-specific genes within an expression-native tumor context. Four of 11 non-lung adenocarcinoma tumor types demonstrated significant expression-native versus -foreign enrichment (adjusted p < 0.05) of indel density in lineage-specific gene territories: hepatocellular carcinoma (LIHC, p = 2.8 3 1015), gastric adenocarcinoma (STAD, p = 6.8 3 1010), papillary thyroid carcinoma (THCA, p = 5.8 3 108), and cutaneous melanoma (SKCM, p = 0.0043), following Bonferroni correction (adjusted p < 0.05) for 11 hypotheses (Figure S4C and Table S3B). (Bladder tissue was not assigned any lineage-specific genes; hence, bladder cancer could not be evaluated in these analyses.) We probed the 233 highly expressed genes to identify those preferentially mutated between native and foreign tumor contexts. We identified five genes with significant indel enrichment (Bonferroni adjusted p < 0.05, 52 hypotheses, Wald test, logistic regression, Figure S4D and Table S3D): ALB in hepatocellular carcinoma (p = 2.1 3 1024), ALDOB in hepatocellular and kidney clear cell carcinoma (p = 3.6 3 107), and FGG in hepatocellular carcinoma (p = 4.0 3 106), ‘TG in thyroid carcinoma (p = 1.3 3 1013), and LIPF in gastric adenocarcinoma (p = 6.7 3 1013) (Figures 5A–5C). As with SFTP genes, all of these genes were strongly and preferentially expressed in the presumed tissue of origin of the cancer type in which they were found to be mutated (Figure 5D). ALB encodes albumin, which is the most abundant protein in human plasma and is synthesized primarily by hepatocytes

6 Cell 168, 1–13, January 26, 2017

(Farrugia, 2010). ALB was targeted by indels in 41% (22/54) of hepatocellular carcinoma cases and showed 17.6-fold enrichment of indels (95% CI: [10.2, 30.6]) in hepatocellular carcinoma versus other tumor types, mirroring its tissue-specific expression pattern (Figure 5A and Figure S4D). A recent whole-genome sequencing study in an independent Japanese hepatocellular carcinoma cohort found ALB mutations in 14% of 268 patients (Fujimoto et al., 2016). TG encodes thyroglobulin, a protein produced by follicular cells of the thyroid that alone comprises more than half of that organ’s mass (Boron and Boulpaep, 2008). TG is specifically expressed in thyroid tissue and was targeted by indels in 43% (20/47) of thyroid carcinoma samples, representing a 9.03-fold enrichment in thyroid carcinomas relative to other tumor types (95% CI: [5.04, 16.2]) (Figures 5B and S4D). LIPF, encoding gastric lipase, is a secretory product of chief cells that reside in the antral and fundic gastric mucosa (Roussel et al., 1999). This gene was targeted by indels in 23% (9/39) of gastric adenocarcinoma samples and showed 15.5-fold enrichment of indels (95% CI: [7.34, 32.7]) in gastric cancer versus other tumor types (Figures 5C and S4D and Table S3D). ALB and TG showed significant, but more modest (3- to 5-fold) enrichment in SNV density between native and foreign tumors (Figure S4D and Table S3D). As with SFTP locus mutations, ALB, TG, and LIPF indels predominantly target noncoding sequences (Figure 5A–5C and Table S4A). Tables of indel mutation calls at LIPF and TG loci with image links to genome viewer variant snapshots are provided as Table S4A.

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

Noncoding mutations occurring in introns and untranslated regions of genes may exert biological effects by perturbing splicing, gene expression, and methylation, among other mechanisms. We tested cis associations between tumor gene expression or splicing and somatic indel status in loci nominated in these analyses (ALB, FGG, ALDOB, LIPF, TG) and did not find any significant associations (Figures S5A–S5J). Testing methylation differences between mutant and WT samples, we observed a significant reduction of ALB methylation in liver cancers harboring noncoding ALB indels (p = 9.0 3 105), but not at other genes in the genomic vicinity of the ALB locus (Figure S5K). There were no associations between cis methylation and indel status across the other indel hotspots. As with SFTP mutations in lung adenocarcinomas, we observed significant pathway differences when comparing transcriptomes of samples that were mutant versus WT for lineage indel hotspot mutations (Table S2). In liver cancer (LIHC), we found significant upregulation of fatty acid metabolism (p = 3.1 3 105), glycine, serine and threonine metabolism (p = 8.5 3 105), and numerous amino acid and redox metabolism pathway differences between tumors that harbored noncoding indels in ALB, FGG, and ALDOB versus tumors that were WT for all of these loci (Figure S6A). In thyroid cancer, we found significant downregulation of immune pathways in tumors harboring TG noncoding indels versus WT tumors, marked by downregulation of HLA expression (allograft rejection, graft versus host disease pathways), CTLA4 pathway, and PD1 signaling pathway (Figure S6B). The common theme among these observed gene set changes is not immediately clear. Lineage-Specific Indels Are Enriched in Specific Chromatin Contexts To examine the topographic context of lineage indel hotspots, we analyzed somatic variant data with respect to tumor-tissue matched (Tables S5A and S5B) chromatin features identified by ENCODE (ENCODE Project Consortium, 2012), the Epigenomics Roadmap (Roadmap Epigenomics Consortium et al., 2015), and other annotations (see STAR Methods for details). For a given feature, such as H3K4me3, we used Gamma-Poisson regression to examine the enrichment of indels in peak versus non-peak associated subsets of the hotspot-associated territory, comparing this enrichment to that associated with other highly expressed genes and background (1,000 randomly chosen genes) in 100 bp, 1 Kbp, and 10 Kbp windows defined around the peak region. Applying 440 analyses across 110 chromatin marks yielded 64 significant associations of enrichment or depletion across 25 epigenetic features following Bonferroni correction (Figure 6A and Table S5C). We detected several chromatin marks with indel density enrichment in hotspot regions relative to highly expressed genes or background, most significantly H3K79me2, H420me1, H3K36me3, H3K4me3, and H3K27Ac. H3K27ac and H3K36me3 were associated with indel mutations across all four tumor types at a window width of 1 Kbp or greater (Figure 6B) and were enriched in their indel densities relative to both highly expressed genes and background (Figure 6C). The strongest ChIP-seq signal for these chromatin marks was found in the lung (A549 cell) and liver cancer (HepG2 cell) models,

though thyroid (‘‘Fetal adrenal’’) and gastric (‘‘Gastric digestive’’) cancer models also contributed to this association (Figure 6D). Other enrichment signals were driven either by a single tumor type—in particular, several transcription factor binding site associations (NRSF, TCF4) and Pol2 that were driven by liver cancer and HepG2 cells. Conversely, several chromatin marks were associated with significant indel density depletion in lineage-indel hotspot regions, including H2BK5ac and H2BK15ac, compared to both other highly expressed genes and background (Figure S7). In addition to ENCODE features, we examined hotspot-specific indel depletion or enrichment in the vicinity of nuclear compartment transitions, including TAD boundaries (Jin et al., 2013), loop domains (Rao et al., 2014), and alternate polyadenylation sites (APA) (You et al., 2015). We found significant enrichment of indels within 10 Kbp of loop domains, previously defined (Rao et al., 2014) through high-resolution Hi-C analyses as sites demarcating nuclear compartments. This association was driven by a highly significant depletion of somatic indels in loop domain neighborhoods in background genes (p = 4.2 3 1019), which was absent in both highly expressed genes (p = 0.30) and hotspot genes (p = 0.22) (Figure S7C). We also found significant enrichment of indels in hotspot regions within 1 Kbp of APA sites, but this was not significant relative to other highly expressed genes. Instead, this pattern was driven by both groups having an elevated APA-associated indel density relative to background (Figure S7D). Though these associations may not be relevant to the biology of hotspot loci, the relationship of indel densities with expression, loop domains, and APA regions appears significant and may warrant further investigation. The enrichment of recurrent somatic indels in loci encoding highly transcribed and lineage-specific genes in lung, gastric, and thyroid cancer suggests the signature of a somatic mutational process shaping the cancer genome. One candidate for such a process is transcription-associated mutagenesis (TAM), which creates indels that expand or contract polynucleotide repeats following the collision of replication and transcription machinery at highly transcribed genes (Jinks-Robertson and Bhagwat, 2014). TAM has been previously invoked as a hypothetical mechanism for somatic mutation enrichment in COL2A1, a highly expressed and cartilage-specific gene, in chondrosarcoma (Tarpey et al., 2013). To investigate this hypothesis, we probed the immediate sequence context of somatic indels targeting highly expressed genes across the 13 tumor datasets (including lung adenocarcinoma). We found that the inserted or deleted sequences associated with lineage-specific hotspot indels were not significantly more likely to expand or contract a repeat than non-hotspot events (p = 0.12, Fisher’s exact test, Tables S4B and S4C). Though lineage-associated indels were not associated with an enrichment of this particular TAM signature, a large fraction (R 90%) of noncoding somatic indels involved a repeat contraction or expansion whether or not they were contained inside a lineage-specific hotspot. We did not identify any significant G versus C or A versus T transcriptional strand bias either within or around inserted or deleted sequences. We did, however, find significant enrichment of AT (versus GC) bases associated with hotspot indels, both within deleted sequences (p = 7.7 3 105, Cell 168, 1–13, January 26, 2017 7

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

10–19

10–12

Depleted feature

0 100 1000 10000

NRSF H3K79me2 H4K20me1 H3K36me3 Pol2 H3K4me3 H3K27ac TCF4 Other

H2BK5ac H2BK15ac H2AK5ac H2BK120ac H3K4ac HNF4G (SC−6558) H4K8ac H4K91ac Other

B

Window

Enriched feature

Tumor type

DNase

LUAD LIHC THCA STAD

H3K27ac H3K36me3

Direction

Features

Significance

10

–26

Depleted Enriched Not significant

H3K4me1 H3K4me2 H3K4me3

C Relative indel density

A

H3K27ac (10 kb; P = 1.1 × 10–17)

P = 0.37

1000

10

Non−peak Peak

Background

H3K56ac

H3K9ac H4K20me1 0.1

1.0

0

10.0

100

Effect size

D H3K27Ac H3K36me3

1000 10,000

Window

LUAD

Non−peak Peak

Non−peak Peak

Highly expressed

Hotspot

H3K36me3 (10 kb; P = 5.5 × 10–18)

H3K79me2

–5

Relative indel density

10

P = 1.2 × 10–17 P = 0.54 P = 7.9 × 10–8 P = 0.12 P = 1.8 × 10–16

LIHC

P = 4.2 × 10–24 P = 0.00068 P = 2.9 × 10–6 –19 P = 0.89 P = 0.00054 P = 1.2 × 10

1000

10

Non−peak Peak

Background

Non−peak Peak

Non−peak Peak

Highly expressed

THCA

Hotspot

STAD

A549 (E114)

HepG2 (E118)

Fetal etal adrenal adre adren adrena r gland gland nd (E080) (E080 (E08 (E E08 0

Gastric digestive (E094)

A549 (E114)

HepG2 (E118)

Fetal Feta etal a adrenal adrena adren adre r gland gla d (E080) glan (E080 (E08 E08 E 08

Gastric digestive (E094)

10 kb

10 kb

10 kb

10 kb

10 kb

10 kb

10 kb

Indels Genes 85.87 85.88 85.89 85.90 85.91

Chr 2

22.00 22.01 22.02 22.03 22.04

Chr 8

74.25

74.27

74.29

Chr 4

155.51

155.53

155.55 104.17

Chr 4

104.19

Chr 9

104.21

133.9

134.0

Chr 8

134.1

90.41

90.42

90.43

90.44

90.45

Chr 10

Figure 6. Hotspot Indels Associate with Specific Epigenomic Contexts Topographic feature enrichment analysis. (A) Volcano plot demonstrating top enriched and depleted topographic features, including chromatin marks, transcription factor binding sites, and loop domain annotations. Indel densities in peak-associated regions (within 0, 100 bp, 1 Kbp, 10 Kbp of a peak) were evaluated using Gamma-Poisson regression at hotspot genes versus other highly expressed genes and a panel of 1,000 randomly chosen genes. (B) Matrix demonstrating mutational support for significantly enriched features stratified by tumor type and window distance. (C) Violin plots demonstrating supporting data for two of the most significantly enriched features in the Roadmap Epigenomics dataset. The interaction p value (Wald test) is shown at the top, and additional p values associated with individual pairwise comparisons are indicated using rectangular connectors. (D) Genomic track plots demonstrating peaks (red ranges) and high-resolution (-log10 P value of ChIP-seq enrichment) signals for H3K27Ac and H3K36me3 near select mutational hotspots in lung, liver, thyroid, and gastric cancer. See STAR Methods for analytic details. See also Table S5.

Wilcoxon rank-sum test, Table S4C) and their 5-base genomic neighborhoods (p = 2.55 3 104), though the biological significance of this enrichment is unclear. A recent bacterial study has proposed that replication-transcription collision is a mechanism of indel mutagenesis (Sankar et al., 2016). This mechanism is thought to be distinct from TAM, though it may be used to explain transcription-associated indel events. To probe signatures of replication-transcription collisions, we examined indel patterns as a function of replication and transcription strands. A recently published study by Haradhvala et al. (2016) examined strand asymmetry of SNVs across many cancer types and ascertained replication direction through the analysis of replication-timing datasets. We applied Haradhvala et al. (2016) annotation to examine indel patterns in highly expressed genes. Specifically, we classified genic regions as being either ‘‘co-directional’’ (if the associated strand of transcription matched the strand of replication) or ‘‘head-on’’ (if the transcription strand was opposite to the replication strand), where ‘‘+’’ (vs. ‘‘’’) transcripts were considered co-directional (vs. head-on) to right- (vs. left-) replicating genomic regions. Examining mutation patterns across highly expressed genes in 13 can-

8 Cell 168, 1–13, January 26, 2017

cers, we found significant indel enrichment in ‘‘head-on’’ gene territories in the native expression context (p = 0.0035) (Figure S7E). This asymmetry suggests that expression-associated indels may be the result of head-on replication-transcription collisions; however, the degree of this enrichment is mild and likely does not account fully for the lineage-specific indel phenomenon driving observed indel hotspots. Lineage-Specific Indels Arise near a Motif, AATAATD To examine previously undescribed sequence contexts that might be associated with lineage-specific indel hotspots, we examined sequence features in the 50-base neighborhood of highly expressed indels. Applying the DREME algorithm (http:// meme-suite.org/), we discovered significant enrichment of an AATAATD/HATTATT (E = 5.9 3 1010) motif in the vicinity of hotspot- versus non-hotspot-associated indels (Figure 7A). (In this motif notation, D and H are IUPAC ambiguity codes describing and ‘‘A, G, T, not C’’ and ‘‘A, C, T, not G,’’ respectively.) The motif was present in 38/107 (35%) of hotspot-associated indels but only 19/355 (5.3%) of non-hotspot-associated indels (OR = 10.3, 95% CI: [4.89, 22.1], p = 5.7 3 1011), with particularly

Bits

A

B

2

+

1 0

Bits

2



1

Fraction of mutations with motif

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

0

P = 4.2 × 10–14

Mutation contexts

C

1.00

****

**

**

*** ****

0.75

** p < 10–2 *** p < 10–4 **** p < 10–6

0.50 0.25 0.00 Non-hotspot All

E = 5.9 × 10–10

LUAD

LIHC THCA STAD

Lineage-specific hotspot

Site of lesion GTCACGGGACCAGTATTGTTTCCTCATTTTTTATAAGGACACTGAAGCTTG|GAGGAGTTAAATGTTTTGAGTATTATTCCAGAGAGCAAGTGGCAGAGGCT GAACTTCTTCATCTCGAGTTCTAATTTGATTGCATTATTGTCTGAGAGACTGTTTGTTATTATGTCAGTTCTTTTTCATTTGTTGAGTGTTTTACTTCCAATT TCCACCCTGGGTGACACAGAGAGATCCTATTTCAAAAACAAATAAATAAATAATTTTAAAAAAGAGTTGACTAGCTGGTAGGAGGACAGAATTGGCAGGAAGT GAAAGCACCACAAAAATTTCAACATTTACATAAGGTGTCATTATCACGACTATTATTTGCTTAATGGGTCCAGAGCTTGCAGCTGGTTTGTGTATGTTGGAG ACTGATGATTCCAATAATGAGAAAGAAAAATAATGCAAGAATGTAAAATGATATACAGTGCAATTTAGATCTTTTCTTGAGATGGTTTCAATTCTGGAATC TTGAATAATTTAGAGGACGCTGTCCTTTTTGTCCTAAAAAAAGGGACAGATATTTAAGTTCTATTTATTTATAAAATCTTGGACTCTTATTCTAATGGTTCATTATT CATAATATGGCACTTCCAAAATCTGAATAATATATAATTGCAATGACATACTTCTTTTCAGAGATTTACTGAAAAGAAATTTGTTGACACTACATAACGTGA CTATGCCAAAGTGGTAGGTTTATTGTTGGAAAAAAATGTAGTTCTTTGACTGATGATTCCAATAATGAGAAAGAAAAATAATGCAAGAATGTAAAATGATAT TATTATGCTGATAAGAGTACCCAGAATAAAATGAATAACTTTTTAAAGACAAAATCCTCTGTTATAATATTGCTAAAATTATTCAGAGTAATATTGTGGAT CAAATTCCAGAATGCGTAAGTAATTTTTATTGACTGATTTTTTTTATCAATTTGTAATTATTTAAGACTTAATATATGAGCCACCTAGCATAGAACTTTTAAGAATGAAA TAAAGTGCTGGGATTACAGGCATGAGCCACTGTGCCCAGCCGACAGATACTATTATTATTTCCATTCTACCGAGAAGGAGACTAAGGCTCTGATCATTTAAA AAGATATTTTGAAGTTTATTGAAACAGGATACAATCTTTCTGAAAAATTTAAGATAGACAAATTATTTAATGTATTACGAAGATATGTATATATGGTTGTTATAATTG ATGTAATTGTGGGGAGGTGACGGGCTTTTGAGAAGGAGGAGAACTATTATTAAGCAATCTAAAGATACTGTCTTCTTGTGAGGAAGAACTGAGTTTCTCATT CCCTTATTCACAATTCTGAAATCTAAAAACTCTGAAGACCAAAAGTTTTCTCATAAGTTTTAGTAAATTATTTGGTGGCAAAATCTGACCTGAACCTATCTGTCATCTTTA TAACATATTATTTAGAGGAATAAGACAGAATTAGCCAAAGGGGCTTACTCT|AAGACCTGGGAGAAAGACCTGGGAGTCAGGGCACAGGAAAGATGACATTT CTCAAGAATGGAATAATTTATCAGAAAGCACTTCTTAAGAAAATACTTAGCAGTTTCCAAAGAAAATATAAAATTACTCTTCTGAAAGGAATACTTATTTTTGTCTTCTTATTTTTGTTATCT ATGGATAGGATCCTCTCCTTCACCTCTCTGTCAGCACTGGCACATGAGACA|AACATTTTTACCAAGGATTTGATGAATATTTTCAATAATAGATTTTGGCC ACATATAATAATCAATAGACTTCTTGCCAATCGTTTACTAGAGTTAAATAATGTCAACAAAGAGCCTGGTTCTGTGTCGTGAACATCTGCTAATACTTCTCT CTGCTATTTTGTTCTCAGCCTTTGAAAACACCCCATTTTAACATCCTGTTATAATAATAACGCCTACTTTGCAGATTCACCCTGGTGACTAAGGGTGTGAAT TTTATTATTTAGACCTGTTTTAAGTTTATAGAAAAGTGAGCAGAAAGGACAGAGAGTTCCCATACACTCCCTCTCCCCCTCCTTCTTGCCCTGCTATTTAT TATGTATTTAGCTAGTGCCTCACACATAACATATGAGTATAGAGCCTACCTATTATTACTATTATCTCATGTGCCTGAAGACACTTCACTGACCTACTGATT ATAACATTTTATTTATTGCATTATTTTCTTCTTAACCTAAGGATTTGTTTTTTAGACTTTAGATCCCTAGAATTTTTAAATTTTTATTCATCATTTTGGAATTTGA CTCTACCTCCAGAGTCTGTGATTATTATTACTGAGGTGGAGCTGAAACTCAGAGAGAGAGGGAGATCAGCTGCTCAATATGGGCTCAGCAAGGACCAGGAG TCTGGATGTCAGTGTCTCAATCTGTGAAATGAGAAAGGAGAATAGTAGCTA|TTTGAGTCATTATTAGCACTAATTAAGACAATCTCTGTAAACTGTTGAGC AAGGAATCTAATTAGATTTCAATTTGTCTATTCCATAGGTTTTTAATATTATTATACTTTCAGAGCATATTAAGCACTCTTAAATAGAATTCTTCTTCTTATTTTTTTTAGA CTCTACCTCCAGAGTCTGTGATTATTATTACTGAGGTGGAGCTGAAACTCAGAGAGAGAGGGAGATCAGCTGCTCAATATGGGCTCAGCAAGGACCAGGAGAC CACCATCCTTAGTAAGTCCAAAATAATGTCACACTTATAAAAACAAACAAAACAACAACAACAACACACACACACACCTTTTTCACCAGTTCCTTGTTCTTT TGTCCTCTCCTTTCTGAAGTACTAATACCTGATATAATAAGCTAATAATTCTAACTGTTCATGGAATTTCTGGTCTTCTTAAACAATACCCTTTCCTTTGTTG TGCTCAGATTTCCAATTATTTCATTAATAATGTTTATTTATGAAATGTTTTATACTTTGTTTAGATTGAAATTCAAATAAGGCCTTTATCTTTTATTACTTGA AGAGATTTCATCTTCTGCGTAATAATTTGCAGAAGACCATCTTCAGTCTCC|AATGAAGATGGCCAATTTGTTTCACTATGGCCACAGTCCGCTATATTAAT TCTCGAAGTATTGCTCCTAACATACAATTTCTCAATAAATATTTATTGAATAATTTATTTCTGACACACTAAAAATGAAGTGATTTAAAACTGTCCTCTACAT AATAGAGAAAGACAGGATTTATATTAAATGACATGGAACACAGCATACACATTATTATTAAATCTGACATTTCATTTCAGGGATAGAAATAATTCCCACCTA CTAAAATTTTAAAAATATCTAGCCAGTGCTCAGATTTCCAATTATTTCATTAATAATGTTTATTTATGAAATGTTTTATACTTTGTTTAGATTGAAATTCAA TGCACCTCATGATATGTCCTCTACTGCATGGCAACCATCCACATGGGAGCTCTTCTTTAATTATTGGAGCTTATTATTATGTGCCTATATAATAAATGCCTT AGGTATTACTTATTAGGATTATTATTCCTTTGCTCTTACATATTAGACAATCTTCTTATAATAACTTATAGGCTATTATAAGTTATTATAACTTATAGTAGG TTGATATTTTTGGTAAGGTCATTTTTTCCCTAATTGAATATTCTTTTCTAATTATTTTTATAGATTCAGGGAGTACAAGTGCAGTTCTGTTACGTGGATATA

+ Motif: AATAATD

– Motif: HATTATT

Deletion: AGAG

LUAD

LIHC

THCA

STAD

Insertion: A|T

Figure 7. Hotspot Indels Arise in the Vicinity of an AATAATD Sequence Motif (A) Logo of AATAATD motif (and its reverse complement, HATTATT) significantly enriched in the sequence neighborhood of 107 lineage-specific hotspot indels (associated with LIPF, TG, or SFTP loci) relative to 355 somatic indels associated with other highly expressed genes. E-value calculated using DREME (http:// meme-suite.org/tools/dreme). ‘‘D’’ and ‘‘H’’ are IUPAC ambiguity codes describing ‘‘A, G, T, not C’’ and ‘‘A, C, T, not G,’’ respectively. (B) Fraction of indel events harboring the AATAATD motif in their sequence neighborhood stratified by tumor type. Bars and whiskers respectively denote the means and standard deviations of beta posterior densities estimating the fraction of events harboring the motif in the given category. p values were obtained using Fisher’s exact test. (C) 50-base sequence context around 36 of 107 lineage-specific hotspot indels that harbor the AATAATD motif in their sequence neighborhood, with sequences oriented to the (+) transcript strand of the associated gene. The motifs are color coded with respect to strand. The site of the indel lesion is located in the middle of each sequence. See also Figure S7.

strong enrichment for LIPF/STAD events (10/11 events, OR = 170, 95% CI: [22, 7300], p = 2 3 1011). The correlation was not driven purely by indels associated with the LIPF hotspot; there was also a significant enrichment of this motif among non-LIPF events (OR = 7.2, 95% CI: [3.7, 15], p = 1.2 3 109) (Figure 7B). The AATAATD motif occurred on either side, or occasionally both sides, of the indel lesion (oriented with respect to the transcribed strand of the respective genes), did not exhibit transcriptional strand bias, and occasionally was disrupted by a deletion event (Figure 7C). A subset of this motif (AATAAT or AAUAAU) comprises the sequence target of a microRNA MIR126, which has been previously linked to cancer and metastasis (Png et al., 2011). AATAATD does not match any known eukaryotic motif in TOMTOM (http://meme-suite.org/tools/ tomtom); however, it resembles the AATAAA polyadenylation site conserved across eukaryotes (Proudfoot, 2011).

DISCUSSION We have applied a whole-genome sequencing analysis approach to nominate recurrently somatically mutated regions across 79 lung adenocarcinoma cases, among these a hotspot of noncoding SNVs and indels in the vicinity of MIR21, a microRNA previously shown to be amplified and overexpressed in lung adenocarcinoma. Through this analysis, we discovered a phenomenon of recurrent somatic indel mutations in multiple cancer types located in lineage-specific genes that exhibit high expression in the cognate tissues of the respective cancers. Larger-scale whole genome and targeted sequencing analyses will further delineate the distribution and prevalence of these lineage-specific indel mutations across a broader range of cancer types. One possible interpretation of the high prevalence (20%–40%) and statistically significant enrichment of these lineage-specific

Cell 168, 1–13, January 26, 2017 9

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

mutations is that these genes (i.e., SFTPB, ALB, TG, LIPF) may be targets of positive selection in their respective cancer types. Mutations in SFTP loci were the most highly significant somatic indel alterations discovered in lung adenocarcinoma, using a statistically calibrated genome-wide test that corrected for known covariates of neutral mutation density. The remaining loci emerged through a focused analysis between normal tissue gene expression and somatic mutation patterns in tumors, yielding a similarly strong statistical signal of expression-native indel enrichment at several lineage-specific genes. Alternatively, these lineage-specific indel mutation patterns may be the result of a focal and previously undescribed mutational process. The correlation of these mutations to several genomic features (e.g., chromatin marks and AATAATD sequence motif) may reflect features of the mutational process that generated them rather than indicate a particular direction of selection pressure during tumor evolution. It is thus appropriate to entertain both hypotheses (driver hotspot versus mutational process) as possible explanations for the described lineage-specific mutational phenomenon. If these loci are indeed cancer-relevant genes, they may represent dominant oncogenes or recessive tumor suppressors. Since malignant transformation is frequently associated with dedifferentiation, a trivial explanation for these observations may be that these mutations halt the production of a metabolically expensive protein product as a tumor cell sheds the specialized characteristics of its cell of origin. According to this hypothesis, we might expect to see these genes also frequently targeted by deletions or truncating mutations. This, however, has not been observed in large-scale exomic mutation studies of these tumor types (Imielinski et al., 2012; Cancer Genome Atlas Research Network, 2014a, 2014b, 2014c) or GISTIC analysis of focal copy number alterations (https://www. broadinstitute.org/tcga/gistic/) (Zack et al., 2013), with the exception of ALB mutations in hepatocellular carcinoma (Fujimoto et al., 2016; Schulze et al., 2015). Furthermore, our analysis does not demonstrate significant reduction of cis gene expression of the corresponding transcript in mutant tumors compared to WT. Though alterations in cancer cell gene expression could be masked in these samples through sample contamination with non-neoplastic epithelium expressing high transcript quantities, the above observations argue against a simple loss-offunction role for these indel mutations. If these mutations drive cancer through a dominant oncogenic effect, they would likely affect the epigenetic or transcriptional state of cancer cells, including splicing, methylation, noncoding RNA expression, or distribution of chromatin marks. Though our analyses do not demonstrate any mutation-associated cis changes in transcription, exon splicing, or methylation across the various tumor types that harbor such hotspots (LUAD, LIHC, THCA, STAD) (with the exception of a cis reduction of ALB methylation levels in the setting of ALB indels) (Figures S2 and S5), we observe statistically significant changes of gene expression in trans at the pathway level. In lung adenocarcinoma, the signal strength of pathway level gene expression changes with SFTP gene mutations is comparable to those obtained from an identical gene set analysis comparing TP53 mutant and WT samples. While the TP53 gene set analysis

10 Cell 168, 1–13, January 26, 2017

reveals pathways involved in TP53 biology (cell-cycle progression, DNA replication), trans expression changes associated with SFTP noncoding mutants point to pathways involved in 30 UTR RNA processing and protein synthesis. It is unclear how 30 UTR mutations that appear to be transcriptionally inert in cis might globally upregulate RNA processing and protein synthesis in trans, though detailed transcriptomic and proteomic comparisons of mutant and WT isogenic cell lines may shed light into possible mechanisms. One possibility is that these pathway changes are not causally tied to the associated genetic lesions, but rather, they tag a particular evolutionary trajectory in tumorigenesis, which is reflected in the transcriptional signature; however, we do not observe a significant association of SFTP mutations with any known lung adenocarcinoma driver, which one would expect to be linked to a reproducible evolutionary path. Functional characterization of the phenotypic impact of these noncoding mutations through transgene overexpression or genome editing in controlled cellular or animal models of cancer will be necessary to evaluate their potential as bona fide cancer drivers. Additional proteomic profiling of genotyped tumor samples and cell lines will determine whether these mutations exert their effects by altering translation in cis (e.g., whether SFTPB 30 UTR mutations alter SFTPB translation without perturbing gene expression or splicing). If the observed indel patterns are not the result of selection, they may constitute a previously uncharacterized transcriptionassociated somatic mutation phenomenon. The relationship between transcription and mutation has been studied extensively in yeast (Jinks-Robertson and Bhagwat, 2014) and recently explored in human cancer genome studies (Chapman et al., 2011; Haradhvala et al., 2016; Lawrence et al., 2013; Pleasance et al., 2010a, 2010b). The indel patterns reported in this study do not readily fall under previously described phenomena such as transcription coupled repair (which results in reduced mutation density in highly expressed genes) or transcriptionally coupled damage (which is associated with A/G SNV in hepatocellular cancer) (Haradhvala et al., 2016). They also cannot be readily explained by mismatch repair deficiency or microsatellite instability (Kim et al., 2013; Supek and Lehner, 2015; Zhao et al., 2014), as they do not cluster in hyper-mutator patients, are not enriched in microsatellite contexts, and arise in focal genomic regions rather than affecting genome-wide mutation distributions. Furthermore, they do not exhibit signatures of TAM or transcriptionassociated recombination (TAR), which have been characterized in yeast and E. coli and proposed as a source of genome instability in human cancer (Kim and Jinks-Robertson, 2012). While somatic indels in lineage genes are enriched in genes that are transcribed in a head-on orientation relative to the direction of replication (Figure S7E), this correlation appears to account for a minority of the expression-native enrichment that is associated with the key indel hotspots that we have nominated in liver, lung, gastric, and thyroid cancer. The indel hotspots described in this study target a special class of loci, encoding protein products that are manufactured in large quantities by a single cell type within a specific organ. Moreover, the secretion of these lineage-defining proteins is a primary function of that cell type and is vital to healthy organ and systems physiology. For example, SFTPB, SFTPA1, and SFTPC proteins

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

are secreted from type II alveolar cells to coat 300 million microscopic air-spaces in the human lung, allowing alveoli to efficiently inflate and preventing pulmonary collapse (Andreeva et al., 2007). Albumin, the product of the ALB gene, comprises more than half of the blood plasma protein mass and is responsible for the oncotic pressure that maintains intravascular volume (Farrugia, 2010). Thyroglobulin, the product of the TG gene, comprises over half of the mass of the thyroid gland, where it is used to synthesize thyroid hormone (Boron and Boulpaep, 2008). Gastric lipase, the product of the LIPF gene, catalyzes the majority of lipid hydrolysis in the stomach (Roussel et al., 1999). Further sequencing and analysis will reveal which additional tumor types are affected by the mutational phenomenon described in this study and whether it is a hallmark of carcinomas arising from secretory epithelial cell types or a more general phenomenon. Our results build upon recently discovered links between epigenetic features of normal tissues and tumor-specific mutation patterns (Polak et al., 2015). However, in contrast to the broad megabase-level correlations that Polak and colleagues observed between tissue-specific epigenetic patterns and local somatic SNV densities in cancer, our data demonstrate a much more focal pattern of tissue specificity for indel mutations. Though the mutation patterns observed in our whole-genome sequencing analyses (results not shown) do not suggest that these lesions are subclonal or present in normaladjacent tissues, deep sequencing of normal, pre-malignant, and malignant epithelium will be needed to examine the precise timing of these mutations in tumor evolution. The clinical implications of recurrent indel mutations in lineage-specific genes across multiple cancer types remain to be fully elucidated. Reproducible links between the transcriptional or epigenetic state of a healthy cell and the mutational state of a tumor may be diagnostically useful in cases where a poorly differentiated cancer has drifted phenotypically from its cell or tissue of origin, including but not limited to the diagnostic dilemma of carcinoma of unknown primary origin. Analysis of lineage-specific indel patterns may also illuminate the study of field cancerization through deep sequencing of tumor-adjacent tissues. Finally, the tissue and cancer selectivity of lineage-specific indel mutations could be exploited in the future for early cancer detection or for circulating tumor DNA monitoring. STAR+METHODS

d

d

QUANTIFICATION AND STATISTICAL ANALYSIS B Genome-wide modeling of neutral mutation density B Lineage-context analyses of indel and SNV density B Epigenomic feature enrichment analysis B Replication direction analysis DATA AND SOFTWARE AVAILABILITY B Additional Resources

SUPPLEMENTAL INFORMATION Supplemental Information includes seven figures and five tables and can be found with this article online at http://dx.doi.org/10.1016/j.cell.2016.12.025. AUTHOR CONTRIBUTIONS M.I. developed custom software and performed all data analyses. G.G. performed independent data analyses supporting Figure 1. M.I. and G.G. performed computational genomic data processing. M.I. and M.M. designed data analyses and wrote the manuscript. ACKNOWLEDGMENTS The authors would like to thank Aruna Ramachandran for helpful comments and careful editing during manuscript preparation, Leslie Gaffney for help with graphical design of figures, and Jeanie Joe for meticulous proofreading. The authors would also like to thank Korbinian Strimmer for correspondence regarding p value randomization and Heng Li for providing details of his universal mask. The authors thank Yoichiro Mitsuishi and Joshua Francis for helpful discussions. M.I. is supported by a Burroughs-Wellcome Career Award for Medical Scientists. This work was supported by grants from the National Cancer Institute as part of The Cancer Genome Atlas project: U24CA126546, U24CA143867, U24CA143845, U24CA126544, and U24CA143883. This work was also funded by the Department of Defense W81XWH-12-1-0269 (M.M.), the American Cancer Society Research Professor Award (M.M.), and the National Cancer Institute R35CA197568 (M.M.). Received: May 13, 2016 Revised: October 25, 2016 Accepted: December 16, 2016 Published: January 12, 2017 REFERENCES Alexandrov, L.B., Nik-Zainal, S., Wedge, D.C., Aparicio, S.A., Behjati, S., Biankin, A.V., Bignell, G.R., Bolli, N., Borg, A., Børresen-Dale, A.-L., et al.; Australian Pancreatic Cancer Genome Initiative; ICGC Breast Cancer Consortium; ICGC MMML-Seq Consortium; ICGC PedBrain (2013). Signatures of mutational processes in human cancer. Nature 500, 415–421.

Detailed methods are provided in the online version of this paper and include the following:

Andreeva, A.V., Kutuzov, M.A., and Voyno-Yasenetskaya, T.A. (2007). Regulation of surfactant secretion in alveolar type II cells. Am. J. Physiol. Lung Cell. Mol. Physiol. 293, L259–L271.

KEY RESOURCES TABLE CONTACT FOR REAGENT AND RESOURCE SHARING EXPERIMENTAL MODEL AND SUBJECT DETAILS B Human Subjects METHOD DETAILS B Sequence data and processing B Genome-wide noncoding mutational scan B GTEx expression analysis B Epigenomic data provenance B Cis and trans analyses of expression and methylation B Sequence motif analysis

Baca, S.C., Prandi, D., Lawrence, M.S., Mosquera, J.M., Romanel, A., Drier, Y., Park, K., Kitabayashi, N., MacDonald, T.Y., Ghandi, M., et al. (2013). Punctuated evolution of prostate cancer genomes. Cell 153, 666–677.

d d d d

Bailey, T.L., Boden, M., Buske, F.A., Frith, M., Grant, C.E., Clementi, L., Ren, J., Li, W.W., and Noble, W.S. (2009). MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res. 37, W202–W208. Berger, M.F., Hodis, E., Heffernan, T.P., Deribe, Y.L., Lawrence, M.S., Protopopov, A., Ivanova, E., Watson, I.R., Nickerson, E., Ghosh, P., et al. (2012). Melanoma genome sequencing reveals frequent PREX2 mutations. Nature 485, 502–506. Blum, A.E., Venkitachalam, S., Guo, Y., Kieber-Emmons, A.M., Ravi, L., Chandar, A.K., Iyer, P.G., Canto, M.I., Wang, J.S., Shaheen, N.J., et al. (2016). RNA

Cell 168, 1–13, January 26, 2017 11

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

Sequencing Identifies Transcriptionally Viable Gene Fusions in Esophageal Adenocarcinomas. Cancer Res. 76, 5628–5633. Boron, W.F., and Boulpaep, E.L. (2008). Medical Physiology: A Cellular and Molecular Approach (Elsevier Health Sciences). Brennan, C.W., Verhaak, R.G., McKenna, A., Campos, B., Noushmehr, H., Salama, S.R., Zheng, S., Chakravarty, D., Sanborn, J.Z., Berman, S.H., et al.; TCGA Research Network (2013). The somatic genomic landscape of glioblastoma. Cell 155, 462–477. Campbell, J.D., Alexandrov, A., Kim, J., Wala, J., Berger, A.H., Pedamallu, C.S., Shukla, S.A., Guo, G., Brooks, A.N., Murray, B.A., et al. (2016). Distinct patterns of somatic genome alterations in lung adenocarcinomas and squamous cell carcinomas. Nat. Genet. 48, 607–616. Cancer Genome Atlas Network (2012). Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70. Cancer Genome Atlas Network (2015a). Comprehensive genomic characterization of head and neck squamous cell carcinomas. Nature 517, 576–582. Cancer Genome Atlas Network (2015b). Genomic Classification of Cutaneous Melanoma. Cell 161, 1681–1696. Cancer Genome Atlas Research Network (2012). Comprehensive genomic characterization of squamous cell lung cancers. Nature 489, 519–525. Cancer Genome Atlas Research Network (2013). Comprehensive molecular characterization of clear cell renal cell carcinoma. Nature 499, 43–49. Cancer Genome Atlas Research Network (2014a). Comprehensive molecular profiling of lung adenocarcinoma. Nature 511, 543–550. Cancer Genome Atlas Research Network (2014b). Comprehensive molecular characterization of gastric adenocarcinoma. Nature 513, 202–209. Cancer Genome Atlas Research Network (2014c). Integrated genomic characterization of papillary thyroid carcinoma. Cell 159, 676–690. Cancer Genome Atlas Research Network (2014d). Comprehensive molecular characterization of urothelial bladder carcinoma. Nature 507, 315–322. Cancer Genome Atlas Research Network (2015). The Molecular Taxonomy of Primary Prostate Cancer. Cell 163, 1011–1025. Ceccarelli, M., Barthel, F.P., Malta, T.M., Sabedot, T.S., Salama, S.R., Murray, B.A., Morozova, O., Newton, Y., Radenbaugh, A., Pagnotta, S.M., et al.; TCGA Research Network (2016). Molecular Profiling Reveals Biologically Discrete Subsets and Pathways of Progression in Diffuse Glioma. Cell 164, 550–563. Chapman, M.A., Lawrence, M.S., Keats, J.J., Cibulskis, K., Sougnez, C., Schinzel, A.C., Harview, C.L., Brunet, J.-P., Ahmann, G.J., Adli, M., et al. (2011). Initial genome sequencing and analysis of multiple myeloma. Nature 471, 467–472. Dickhaus, T. (2014). Simultaneous Statistical Inference (Berlin, Heidelberg: Springer Science & Business Media). Ding, L., Getz, G., Wheeler, D.A., Mardis, E.R., McLellan, M.D., Cibulskis, K., Sougnez, C., Greulich, H., Muzny, D.M., Morgan, M.B., et al. (2008). Somatic mutations affect key pathways in lung adenocarcinoma. Nature 455, 1069–1075. ENCODE Project Consortium (2012). An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74. Ernst, J., and Kellis, M. (2012). ChromHMM: automating chromatin-state discovery and characterization. Nat. Methods 9, 215–216. Farrugia, A. (2010). Albumin usage in clinical medicine: tradition or therapeutic? Transfus. Med. Rev. 24, 53–63. Fujimoto, A., Furuta, M., Totoki, Y., Tsunoda, T., Kato, M., Shiraishi, Y., Tanaka, H., Taniguchi, H., Kawakami, Y., Ueno, M., et al. (2016). Whole-genome mutational landscape and characterization of noncoding and structural mutations in liver cancer. Nat. Genet. 48, 500–509.

Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L.M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., et al. (2014). TITAN: inference of copy number architectures in clonal cell populations from tumor wholegenome sequence data. Genome Res. 24, 1881–1893. Haagsman, H.P., and Diemel, R.V. (2001). Surfactant-associated proteins: functions and structural variation. Comp. Biochem. Physiol. A. Mol. Integr. Physiol. 129, 91–108. Haradhvala, N.J., Polak, P., Stojanov, P., Covington, K.R., Shinbrot, E., Hess, J.M., Rheinbay, E., Kim, J., Maruvka, Y.E., Braunstein, L.Z., et al. (2016). Mutational Strand Asymmetries in Cancer Genomes Reveal Mechanisms of DNA Damage and Repair. Cell 164, 538–549. Harrow, J., Frankish, A., Gonzalez, J.M., Tapanari, E., Diekhans, M., Kokocinski, F., Aken, B.L., Barrell, D., Zadissa, A., Searle, S., et al. (2012). GENCODE: the reference human genome annotation for The ENCODE Project. Genome Res. 22, 1760–1774. Hilbe, J.M. (2014). Modeling Count Data (Cambridge University Press). Hodis, E., Watson, I.R., Kryukov, G.V., Arold, S.T., Imielinski, M., Theurillat, J.-P., Nickerson, E., Auclair, D., Li, L., Place, C., et al. (2012). A landscape of driver mutations in melanoma. Cell 150, 251–263. Horn, S., Figl, A., Rachakonda, P.S., Fischer, C., Sucker, A., Gast, A., Kadel, S., Moll, I., Nagore, E., Hemminki, K., et al. (2013). TERT promoter mutations in familial and sporadic melanoma. Science 339, 959–961. Huang, F.W., Hodis, E., Xu, M.J., Kryukov, G.V., Chin, L., and Garraway, L.A. (2013). Highly recurrent TERT promoter mutations in human melanoma. Science 339, 957–959. Imielinski, M., Berger, A.H., Hammerman, P.S., Hernandez, B., Pugh, T.J., Hodis, E., Cho, J., Suh, J., Capelletti, M., Sivachenko, A., et al. (2012). Mapping the hallmarks of lung adenocarcinoma with massively parallel sequencing. Cell 150, 1107–1120. Inaki, K., Hillmer, A.M., Ukil, L., Yao, F., Woo, X.Y., Vardy, L.A., Zawack, K.F.B., Lee, C.W.H., Ariyaratne, P.N., Chan, Y.S., et al. (2011). Transcriptional consequences of genomic structural aberrations in breast cancer. Genome Res. 21, 676–687. Jin, F., Li, Y., Dixon, J.R., Selvaraj, S., Ye, Z., Lee, A.Y., Yen, C.-A., Schmitt, A.D., Espinoza, C.A., and Ren, B. (2013). A high-resolution map of the threedimensional chromatin interactome in human cells. Nature 503, 290–294. Jinks-Robertson, S., and Bhagwat, A.S. (2014). Transcription-associated mutagenesis. Annu. Rev. Genet. 48, 341–359. Kan, Z., Jaiswal, B.S., Stinson, J., Janakiraman, V., Bhatt, D., Stern, H.M., Yue, P., Haverty, P.M., Bourgon, R., Zheng, J., et al. (2010). Diverse somatic mutation patterns and pathway alterations in human cancers. Nature 466, 869–873. Khurana, E., Fu, Y., Chakravarty, D., Demichelis, F., Rubin, M.A., and Gerstein, M. (2016). Role of non-coding sequence variants in cancer. Nat. Rev. Genet. 17, 93–108. Kim, N., and Jinks-Robertson, S. (2012). Transcription as a source of genome instability. Nat. Rev. Genet. 13, 204–214. Kim, T.-M., Laird, P.W., and Park, P.J. (2013). The landscape of microsatellite instability in colorectal and endometrial cancer genomes. Cell 155, 858–868. Koren, A., Polak, P., Nemesh, J., Michaelson, J.J., Sebat, J., Sunyaev, S.R., and McCarroll, S.A. (2012). Differential relationship of DNA replication timing to different forms of human mutation and variation. Am. J. Hum. Genet. 91, 1033–1040. , R., Ghosh, S., Polak, P., EgKoren, A., Handsaker, R.E., Kamitaki, N., Karlic gan, K., and McCarroll, S.A. (2014). Genetic variation in human DNA replication timing. Cell 159, 1015–1026.

Garraway, L.A., and Lander, E.S. (2013). Lessons from the cancer genome. Cell 153, 17–37.

Lawrence, M.S., Stojanov, P., Polak, P., Kryukov, G.V., Cibulskis, K., Sivachenko, A., Carter, S.L., Stewart, C., Mermel, C.H., Roberts, S.A., et al. (2013). Mutational heterogeneity in cancer and the search for new cancerassociated genes. Nature 499, 214–218.

Govindan, R., Ding, L., Griffith, M., Subramanian, J., Dees, N.D., Kanchi, K.L., Maher, C.A., Fulton, R., Fulton, L., Wallis, J., et al. (2012). Genomic landscape of non-small cell lung cancer in smokers and never-smokers. Cell 150, 1121–1134.

Lawrence, M.S., Stojanov, P., Mermel, C.H., Robinson, J.T., Garraway, L.A., Golub, T.R., Meyerson, M., Gabriel, S.B., Lander, E.S., and Getz, G. (2014). Discovery and saturation analysis of cancer genes across 21 tumour types. Nature 505, 495–501.

12 Cell 168, 1–13, January 26, 2017

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

Li, H. (2014). Toward better understanding of artifacts in variant calling from high-coverage samples. Bioinformatics 30, 2843–2851.

lysosomal acid lipase, two lipolytic enzymes of medical interest. J. Biol. Chem. 274, 16995–17002.

Li, H., and Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 1754–1760.

Sankar, T.S., Wastuwidyaningtyas, B.D., Dong, Y., Lewis, S.A., and Wang, J.D. (2016). The nature of mutations induced by replication–transcription collisions. Nature 535, 178–181.

Mallick, S., Li, H., Lipson, M., Mathieson, I., Gymrek, M., Racimo, F., Zhao, M., Chennagiri, N., Nordenfelt, S., Tandon, A., et al. (2016). The Simons Genome Diversity Project: 300 genomes from 142 diverse populations. Nature 538, 201–206. Mansour, M.R., Abraham, B.J., Anders, L., Berezovskaya, A., Gutierrez, A., Durbin, A.D., Etchin, J., Lawton, L., Sallan, S.E., Silverman, L.B., et al. (2014). Oncogene regulation. An oncogenic super-enhancer formed through somatic mutation of a noncoding intergenic element. Science 346, 1373–1377.

Saunders, C.T., Wong, W.S.W., Swamy, S., Becq, J., Murray, L.J., and Cheetham, R.K. (2012). Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28, 1811–1817. Schulze, K., Imbeaud, S., Letouze´, E., Alexandrov, L.B., Calderaro, J., Rebouissou, S., Couchy, G., Meiller, C., Shinde, J., Soysouvanh, F., et al. (2015). Exome sequencing of hepatocellular carcinomas identifies new mutational signatures and potential therapeutic targets. Nat. Genet. 47, 505–511.

Mele´, M., Ferreira, P.G., Reverter, F., DeLuca, D.S., Monlong, J., Sammeth, M., Young, T.R., Goldmann, J.M., Pervouchine, D.D., Sullivan, T.J., et al.; GTEx Consortium (2015). Human genomics. The human transcriptome across tissues and individuals. Science 348, 660–665.

Schuster-Bo¨ckler, B., and Lehner, B. (2012). Chromatin organization is a major influence on regional mutation rates in human cancer cells. Nature 488, 504–507.

Nathan, N., Giraud, V., Picard, C., Nunes, H., Dastot-Le Moal, F., Copin, B., Galeron, L., De Ligniville, A., Kuziner, N., Reynaud-Gaubert, M., et al. (2016). Germline SFTPA1 mutation in familial idiopathic interstitial pneumonia and lung cancer. Hum. Mol. Genet. 25, 1457–1467.

Seike, M., Goto, A., Okano, T., Bowman, E.D., Schetter, A.J., Horikawa, I., Mathe, E.A., Jen, J., Yang, P., Sugimura, H., et al. (2009). MiR-21 is an EGFR-regulated anti-apoptotic factor in lung cancer in never-smokers. Proc. Natl. Acad. Sci. U.S.A. 106, 12085–12090.

Pao, W., and Hutchinson, K.E. (2012). Chipping away at the lung cancer genome. Nat. Med. 18, 349–351.

Supek, F., and Lehner, B. (2015). Differential DNA mismatch repair underlies mutation rate variation across the human genome. Nature 521, 81–84.

Pearson, T.A., and Manolio, T.A. (2008). How to interpret a genome-wide association study. JAMA 299, 1335–1344.

Tarpey, P.S., Behjati, S., Cooke, S.L., Van Loo, P., Wedge, D.C., Pillay, N., Marshall, J., O’Meara, S., Davies, H., Nik-Zainal, S., et al. (2013). Frequent mutation of the major cartilage collagen gene COL2A1 in chondrosarcoma. Nat. Genet. 45, 923–926.

Pleasance, E.D., Cheetham, R.K., Stephens, P.J., McBride, D.J., Humphray, S.J., Greenman, C.D., Varela, I., Lin, M.-L., Ordo´n˜ez, G.R., Bignell, G.R., et al. (2010a). A comprehensive catalogue of somatic mutations from a human cancer genome. Nature 463, 191–196. Pleasance, E.D., Stephens, P.J., O’Meara, S., McBride, D.J., Meynert, A., Jones, D., Lin, M.-L., Beare, D., Lau, K.W., Greenman, C., et al. (2010b). A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature 463, 184–190.

Vogelstein, B., Papadopoulos, N., Velculescu, V.E., Zhou, S., Diaz, L.A., Jr., and Kinzler, K.W. (2013). Cancer genome landscapes. Science 339, 1546–1558. Weinhold, N., Jacobsen, A., Schultz, N., Sander, C., and Lee, W. (2014). Genome-wide analysis of noncoding regulatory mutations in cancer. Nat. Genet. 46, 1160–1165.

Png, K.J., Halberg, N., Yoshida, M., and Tavazoie, S.F. (2011). A microRNA regulon that mediates endothelial recruitment and metastasis by cancer cells. Nature 481, 190–194.

Wu, D., and Smyth, G.K. (2012). Camera: a competitive gene set test accounting for inter-gene correlation. Nucleic Acids Res. 40, e133.

, R., Koren, A., Thurman, R., Sandstrom, R., Lawrence, M.S., Polak, P., Karlic ek, K., Stamatoyannopoulos, J.A., and SuReynolds, A., Rynes, E., Vlahovic nyaev, S.R. (2015). Cell-of-origin chromatin organization shapes the mutational landscape of cancer. Nature 518, 360–364.

You, L., Wu, J., Feng, Y., Fu, Y., Guo, Y., Long, L., Zhang, H., Luan, Y., Tian, P., Chen, L., et al. (2015). APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals. Nucleic Acids Res. 43, D59–D67.

Proudfoot, N.J. (2011). Ending the message: poly(A) signals then and now. Genes Dev. 25, 1770–1782.

Zack, T.I., Schumacher, S.E., Carter, S.L., Cherniack, A.D., Saksena, G., Tabak, B., Lawrence, M.S., Zhang, C.-Z., Wala, J., Mermel, C.H., et al. (2013). Pan-cancer patterns of somatic copy number alteration. Nat. Genet. 45, 1134–1140.

Rao, S.S.P., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D., Lander, E.S., and Aiden, E.L. (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159, 1665–1680. Roadmap Epigenomics Consortium, Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., Heravi-Moussavi, A., Kheradpour, P., Zhang, Z., Wang, J., Ziller, M.J., et al. (2015). Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330. Roussel, A., Canaan, S., Egloff, M.P., Rivie`re, M., Dupuis, L., Verger, R., and Cambillau, C. (1999). Crystal structure of human gastric lipase and model of

Zhang, X., Choi, P.S., Francis, J.M., Imielinski, M., Watanabe, H., Cherniack, A.D., and Meyerson, M. (2016). Identification of focally amplified lineage-specific super-enhancers in human epithelial cancers. Nat. Genet. 48, 176–182. Zhao, H., Thienpont, B., Yesilyurt, B.T., Moisse, M., Reumers, J., Coenegrachts, L., Sagaert, X., Schrauwen, S., Smeets, D., Matthijs, G., et al.; ANECS (2014). Mismatch repair deficiency endows tumors with a unique mutation signature and sensitivity to DNA double-strand breaks. eLife 3, e02725– e02726.

Cell 168, 1–13, January 26, 2017 13

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

STAR+METHODS KEY RESOURCES TABLE

REAGENT or RESOURCE

SOURCE

IDENTIFIER

Human reference genome hg19

Genome Reference Consortium

http://hgdownload.cse.ucsc.edu/goldenpath/ hg19/chromosomes/

GENCODE v19

(Harrow et al., 2012)

https://www.gencodegenes.org/releases/ 19.html

Whole genome and RNA-sequencing .bam or .fastq files for STAD, LUAD, LUSC, SKCM, THCA, BLCA, BRCA, GBM, HNSC, LIHC, KIRC, and LGG projects, whole-exome sequencing .bam files for LUAD project

The Cancer Genome Atlas (TCGA)

dbGaP: phs000178.v1.p1

TCGA Level 3 gene expression, splicing, methylation, clinical, and miRNA profiles for STAD, LUSC, SKCM, THCA, BLCA, BRCA, GBM, HNSC, LIHC, KIRC, and LGG samples

Firehose Broad Institute TCGA GDAC

http://gdac.broadinstitute.org

Lung adenocarcinoma whole genome sequencing .bam files

(Imielinski et al., 2012)

dbGaP: phs000488.v1.p1

Prostate adenocarcinoma whole genome sequencing .bam files

(Baca et al., 2013)

dbGaP: phs000447.v1.p1

Metastatic melanoma whole genome sequencing .bam files

(Berger et al., 2012)

dbGaP: phs000452.v1.p1

ENCODE ChIP-seq, DNaseI seq, and TFBS profiles

(ENCODE Project Consortium, 2012)

http://www.genome.gov/ENCODE/

Roadmap Epigenomics ChIP-seq, ChromHMM, DNaseI hypersensitivity, and TFBS profiles

(Roadmap Epigenomics Consortium et al., 2015)

http://egg2.wustl.edu/roadmap/web_portal/

Deposited Data

Genome-Tissue Expression project

(Mele´ et al., 2015)

http://www.gtexportal.org

Molecular signatures database: Canonical Pathways gene sets

Broad Institute

http://software.broadinstitute.org/gsea/ msigdb GEO: GSE43070

TAD boundary annotations

(Jin et al., 2013)

Loop domain annotations

(Rao et al., 2014)

GEO: GSE63525

Alternate polyadenylation sites

(You et al., 2015)

http://genome.bucm.edu.cn/utr/

Replication direction annotation

(Haradhvala et al., 2016)

http://www.broadinstitute.org/cancer/cga/ AsymTools

Replication timing annotation

(Koren et al., 2012)

http://mccarrolllab.com/wp-content/uploads/ 2015/03/Koren-et-al-Table-S2.zip

Software and Algorithms BWA v0.5.9

(Li and Durbin, 2009)

http://bio-bwa.sourceforge.net

Picard v1.8

Broad Institute

https://broadinstitute.github.io/picard/

GATK v3.1

Broad Institute

https://software.broadinstitute.org/gatk/

Strelka v2.0

(Saunders et al., 2012)

https://sites.google.com/site/ strelkasomaticvariantcaller/home

R 3.3 (data.table, stats, MASS)

The R Project

https://www.r-project.org/

Bioconductor 3.4 (GenomicRanges, Rsamtools, Biostrings, limma)

Bioconductor

https://www.bioconductor.org/

TITAN

(Ha et al., 2014)

http://compbio.bccrc.ca/software/titan/

fish.hook

Imielinski lab

https://github.com/mskilab/fish.hook

skitools

Imielinski lab

https://github.com/mskilab/skitools

gUtils

Imielinski lab

https://github.com/mskilab/gUtils

gTrack

Imielinski lab

https://github.com/mskilab/gTrack (Continued on next page)

e1 Cell 168, 1–13.e1–e6, January 26, 2017

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

Continued REAGENT or RESOURCE

SOURCE

IDENTIFIER

MuTrix

Imielinski lab

https://github.com/mskilab/MuTrix

MEME suite v4.7

(Bailey et al., 2009)

http://meme-suite.org/

Heng Li, personal communication

https://github.com/lh3/sgdp-fermi/releases/ download/v1/sgdp-263-hs37d5.tgz

Other Universal mask for whole genome sequencing variant calling

CONTACT FOR REAGENT AND RESOURCE SHARING Further information and requests for resources and reagents should be directed to and will be fulfilled by Lead Contact Matthew Meyerson ([email protected]). EXPERIMENTAL MODEL AND SUBJECT DETAILS Human Subjects Cancer genome sequence data were generated through informed consent as part of previously published sequencing studies and analyzed in accordance with each original studies’ data use guidelines and restrictions. Data are available via dbGAP controlled access (http://www.ncbi.nlm.nih.gov/dbgap-controlled, accession numbers: dbGaP: phs000488.v1.p1, dbGaP: phs000447.v1.p1, dbGaP: phs000452.v1.p1, dbGaP: phs000178.v1.p1). Gender and age characteristics of study participants were as follows: Prostate adenocarcinoma from (Baca et al., 2013) (31 cases, 100% male, median age 62 years [range 46-73]); Cutaneous melanomas from (Berger et al., 2012) (25 cases, 68% male, median age 49.5 years [range 25-77]); Lung adenocarcinoma from (Imielinski et al., 2012) (29 cases, 72.4% male, median age 64 years [range 43-83]); Bladder cancers from (Cancer Genome Atlas Research Network, 2014d) (23 cases, 65.2% male, median age 65 years [range 34-84]); Breast carcinoma from (Cancer Genome Atlas Network, 2012) (95 cases, 1.1% male, median age 58 years [range 30-89]); Glioblastoma from (Brennan et al., 2013) (27 cases, 63% male, median age 60 years [range 21-76]); Head and neck squamous from (Cancer Genome Atlas Network, 2015a) (16 cases, 81.2% male, median age 53 years [range 29-75]); Kidney chromophobe from (Cancer Genome Atlas Research Network, 2013) (5 cases, 60% male, median age 62 years [range 51-77]); Low grade glioma from (Ceccarelli et al., 2016) (19 cases, 42.1% male, median age 39 years [range 17-62]); Hepatocellular carcinoma from TCGA (54 cases, 55.6% male, median age 65.5 years [range 23-84]); Lung adenocarcinoma from (Cancer Genome Atlas Research Network, 2014a) (50 cases, 48% male, median age 66 years [range 41-82]); Lung squamous cell from (Cancer Genome Atlas Research Network, 2012) (50 cases, 76% male, median age 68 years [range 47-83]); Prostate adenocarcinoma from (Cancer Genome Atlas Research Network, 2015) (20 cases, 100% male, median age 60 years [range 46-73]); Cutaneous melanoma from (Cancer Genome Atlas Network, 2015b) (33 cases, 78.8% male, median age 52 years [range 25-81]); Gastric adenocarcinoma from (Cancer Genome Atlas Research Network, 2014b) (40 cases, 65% male, median age 70 years [range 39-90]); Thyroid carcinoma from (Cancer Genome Atlas Research Network, 2014c) (49 cases, 22.4% male, median age 49 years [range 17-85]). METHOD DETAILS Sequence data and processing We obtained WGS reads for 50 TCGA lung adenocarcinoma (LUAD), 50 lung squamous cancer (LUSC), 33 cutaneous melanoma (SKCM), 49 papillary thyroid carcinoma (THCA), 23 bladder cancers (BLCA), 95 breast cancer (BRCA), 27 glioblastoma (GBM), 16 head and neck squamous (HNSC), 54 hepatocellular carcinomas (LIHC), 5 kidney cancer chromophobe (KIRC), 19 low-grade glioma (LGG), 40 gastric cancers (STAD), and 20 prostate cancer (PRAD) tumor-normal pairs via dbGAP access phs000178.v1.p1 (Cancer Genome Atlas Network, 2012, 2015b; Cancer Genome Atlas Research Network, 2012, 2013, 2014a, 2014b, 2014c, 2014d, 2015). We also obtained WGS reads for 29 lung adenocarcinoma cases from Imielinski et al. (dbGAP phs000488.v1.p1) (Imielinski et al., 2012), 31 prostatic adenocarcinoma cases from Baca et al. (dbGAP phs000447.v1.p1) (Baca et al., 2013), and 25 metastatic melanoma cases from Berger et al. (dbGAP phs000452.v1.p1) (Berger et al., 2012). Paired-end read data were aligned to hg19 using BWA aln and sampe v0.5.9 (http://bio-bwa.sourceforge.net/) (Li and Durbin, 2009) We used Picard v1.8 (https://broadinstitute.github.io/ picard/) and Genome Analysis Toolkit v3.1 (https://software.broadinstitute.org/gatk/) for downstream duplicate marking, base quality score recalibration, and local realignment around indels in tumor and normal. We called somatic SNVs and indels using Strelka v2.0.13 (Saunders et al., 2012). Whole genome somatic variants were filtered against a ‘‘universal mask’’ (Mallick et al., 2016) implementing the principles outlined in (Li, 2014). This mask specifies genomic regions that are likely to harbor recurrent artifactual variant calls. The mask is available as a gzipped bed file at https://github.com/lh3/sgdp-fermi/releases/download/v1/sgdp-263-hs37d5.tgz, was created using a script detailed in https://gist.github.com/lh3/9d6dcfc3436a735ef197. Briefly, this script filters regions that have either (1) low mappability,

Cell 168, 1–13.e1–e6, January 26, 2017 e2

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

(2) low complexity, or (3) are enriched in aberrant SNP calls in the 1000 genomes project (http://www.internationalgenome.org/). Low mappability hg19 regions were defined from genomic positions for which 37 or fewer of all possible 75-mers intersecting that position cannot be mapped elsewhere in the genome with at most one mismatch or gap. Low complexity hg19 regions were defined by running the mdust program (https://github.com/lh3/mdust), UCSC RepeatMasker (http://www.genome.ucsc.edu), and homopolymers that span 7 or more bases. SNP-enriched regions corresponding to regions of likely hg19 misassembly were chosen as clusters of pre-filtered 1000 Genomes Project SNP calls that harbored excess heterozygosity. All instances of ‘‘eligible territory width’’ in the manuscript are made with reference to the intersection of this interval mask with a particular genomic region (e.g., gene, 10 Kbp tile) Genome-wide noncoding mutational scan To identify regions of the genome under positive somatic mutational selection in lung cancer, we analyzed whole genome sequencing reads from 79 lung adenocarcinoma tumor-normal pairs (Imielinski et al., 2012; Cancer Genome Atlas Research Network, 2014a) aligned to hg19 by BWA (Li and Durbin, 2009), with mean tumor coverage of 54.9-fold (range 41-86) and mean normal coverage of 39.8-fold (range 23-49). These samples were drawn from predominantly early stage and treatment naive surgical resection cases. We generated variant calls using the Strelka (Saunders et al., 2012) algorithm filtered by a genomic interval mask outled in Mallick et al. (2016), yielding 4.65 million somatic single nucleotide variants (SNV) and 36,333 somatic insertion/deletion (indel) variants, corresponding to a mean SNV density of 14.3 per Mbp (range 0.13-99) and mean indel density of 0.09 per Mbp (range 0.0008-0.41) across 2.429 Gbp of eligible territory. Masked regions (666 Mbp) corresponded to regions of low-complexity, low-mappability, and misassembled sequence on hg19, implementing the principles described in (Li, 2014). We tallied mutation calls across the intersection of 6.191 million regularly spaced 10 Kbp intervals (each overlapping by 9.5 Kbp) and the 2.429 Gbp of eligible territory. We used this spacing and interval size as a genome-wide hypothesis set for identifying candidate noncoding hotspots in the genome (Figure 1A), with a focus on regulatory elements (e.g., enhancers or promoters) which are usually less than 10 Kbp in size and would presumably appear as local peaks of mutational density in the analysis. Among the 6.191 million tiles, we excluded regions with fewer than 75% eligible bases or greater than 95% quiescent chromatin as assessed by ChromHMM (Ernst and Kellis, 2012) analysis of the A549 lung adenocarcinoma cell line ENCODE profiles (https://www.genome.gov/ENCODE/), yielding a final set of 2.823 million intervals for hypothesis testing (see ‘‘Genome-wide modeling of neutral mutation density’’ section below). Among eligible subset of these intervals, we computed the values of 8 genomic covariates: GC content, CpG and TpC percentage, replication timing, DNaseI hypersensitivity, quiescent and active chromatin in the A549 cell line, and the 1 Mbp regional mutational density. The values of covariates were aggregated across the covered subsets of the 2.823 million candidate intervals, and fit to observed counts by maximum likelihood. Parameter fits for SNV and indel models are shown in Table S1C. GTEx expression analysis RPKM values were downloaded from the GTEx Portal (www.gtexportal.org/home) for 2917 samples and 30 tissue types (Mele´ et al., 2015). To identify highly expressed genes, we examined the histogram of tissue medians of expression across 20,345 genes and 2917 GTEx samples spanning 30 normal tissues (Mele´ et al., 2015) and identified an upper mode containing 233 genes with expression above 1000 RPKM (Figure S4A). We used complete-linkage clustering with a Euclidean distance metric to cluster median tissue gene expression across highly expressed genes across 30 tissue types and labeled the three top-most clusters (lineage-specific, multi-lineage, housekeeping) based on visual inspection of the dendrogram and heatmap results (Figure S4B). We mapped tumor types in our WGS tumor analysis to tissues-of-origin (Table S3A). We mapped gene and tumor pairs into expression-native and foreign categories if a gene was found to have a 100 or greater RPKM in the tissue from which that tumor is presumed to arise (e.g., gastric cancer and stomach). Examples of highly expressed genes classified as housekeeping included those coding for ribosomal proteins (e.g., RPL13), human leukocyte antigen genes (e.g., HLA-B), and metabolic enzymes (e.g., GADPH). Examples of highly expressed multi-lineage genes include those encoding growth factors (e.g., IGFBP7), apolipoproteins (e.g., APOD), and proteoglycans (e.g., BGN). Highly expressed, lineage-specific genes included those encoding insulin in the pancreatic islets (e.g., INS), cytochrome P450 in the liver (e.g., CYP2E1), and pepsinogen in the stomach (e.g., PGA5). Epigenomic data provenance We compiled transcription factor binding sites and chromatin marks from reference epigenomes profiled in ENCODE and Roadmap Epigenomics data portals (ENCODE Project Consortium, 2012; Roadmap Epigenomics Consortium et al., 2015). We also obtained genomic annotations of topologically associated domain (TAD) boundaries (Jin et al., 2013), 3-D loop domains (Rao et al., 2014), and alternate polyadenylation sites (You et al., 2015) via web links obtained from the respective publications. For ENCODE and Roadmap data, there were multiple profiles per assay (e.g., chromatin or transcription factor ChIP-seq) requiring choice of an optimal tissue matched reference epigenome. To achieve this, we manually annotated ENCODE and Roadmap reference epigenomes and TCGA tumor samples analyzed in our study with respect to a matrix of binary histopathological and anatomic features (Tables S5A and S5B). Using this annotation, we mapped TCGA tumors to their closest reference epigenomes in ENCODE (http://www. mskilab.com/publications/cell2017/ENCODEvsTCGA/index.html) and ROADMAP (http://www.mskilab.com/publications/cell2017/ ROADMAPvsTCGA/index.html) on the basis of their Euclidean distance in feature space (Tables S5A and S5B), allowing the choice of the closest reference epigenome to be made for each sample in a systematic fashion. For other feature types (TAD boundaries, Loop domains, APA), we used a single annotation for all analyses. In the case of TAD boundaries, we used IMR90 profiles as the

e3 Cell 168, 1–13.e1–e6, January 26, 2017

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

reference cell type. For loop domain analyses, we chose features that were found in the deeply profiled GM12878 cell type and at least one other lower depth in situ Hi-C profile generated by (Rao et al., 2014). For APA data, there was a single track that represented the union of all discovered alternate polyadenylation sites. Cis and trans analyses of expression and methylation We compared mutant versus WT gene, exon, microRNA expression using the functions voom, eBayes, lmFit within the limma (v3.29) R package in R Bioconductor (https://bioconductor.org/). We compared mutant versus WT methylation using Wilcoxon test on methylation (Beta) values obtained from the Broad GDAC TCGA portal (https://gdac.broadinstitute.org/). We assessed gene set enrichments using the CAMERA (Wu and Smyth, 2012) algorithm via ‘‘camera’’ function in limma. For gene sets we used the Canonical Pathways components of the MsigDB v5.1 (http://software.broadinstitute.org/gsea/msigdb) comprising 1330 gene sets. We applied ‘‘camera’’ with inter.gene.cor parameter set to 0.08. Though 0.05 is the default setting, we found that 0.05 resulted in inflated Q-Q plots when permuting sample labels. We chose 0.08 after sweeping parameter values from 0.01 to 0.1 and finding 0.08 as the setting that, on average, yielded uniform P-value distributions with sample label permutations. Sequence motif analysis We queried hg19 for sequence context in the vicinity of hotspot- and non-hotspot-associated indels. We used the rtracklayer and Biostrings packages in R BioConductor (https://bioconductor.org/) to determine strand-bias, AT-bias, and other simple sequence context characteristics. We used the MEME suite (http://meme-suite.org/) to search for de novo motifs that were enriched in the neighborhood of hotspot-associated indels. Specifically, we used the DREME algorithm (http://meme-suite.org/tools/dreme) with an E value threshold of 0.05 to search for motifs between 3 and 8 sequences in length. We then used Fisher’s Exact Test to determine enrichments of the AATAATD motif in subsets of indels (i.e., associated with specific tumor type-specific hotspots). We also use beta densities to visualize the posterior mean and standard deviation of the fraction of events harboring the AATAATD motif. QUANTIFICATION AND STATISTICAL ANALYSIS Genome-wide modeling of neutral mutation density We applied Gamma-Poisson regression to predict local heterogeneity in the neutral somatic mutation density using sequence, interval, and numeric track covariates in a generalized linear model framework (Figure 1A, Figure S1A, Table S1). We chose the GammaPoisson distribution as a model of over-dispersed count data (Hilbe, 2014). The model was applied to overlapping 10 Kbp tiles staggered at 500 bp intervals along the genome, each corresponding to a hypothesis about a particular region being a target of somatic selection (or non-background mutation processes). We used 8 genomic covariates to predict the local genomic mutation density. For sequence covariates, we used GC, TpC, and CpG fraction as sequence contexts previously associated with mutation signatures in lung cancer (Alexandrov et al., 2013; Imielinski et al., 2012; Lawrence et al., 2013; Pleasance et al., 2010b). For interval track covariates, we used DNaseI hypersensitive sites, quiescent chromatin, and active chromatin in A549 lung cancer cell lines. To obtain these tracks, we downloaded A549 DNaseI hypersensitivity data and other A549 chromatin marks (H3K27me3, H3K4me1, H3K27ac, H3K4me3, H3K36me3) from ENCODE (https://www. genome.gov/ENCODE/). Active and quiescent chromatin regions were defined after applying a 15-state ChromHMM model to input chromatin data (Ernst and Kellis, 2012). We delineated states with H3K27ac marks as ‘‘active’’ and states showing no marks or only H3K27me3 marks as ‘‘quiescent.’’ Numeric covariates comprising replication timing data were obtained from Koren et al. (Koren et al., 2014) and 1 Mbp local somatic variant density was computed from the data. The values of all covariates and mutation counts were computed in the ‘‘eligible territory’’ of each interval i ˛f1; .; ng where eligibility was defined via intersection with a publicly available whole genome interval mask (see ‘‘Sequence Data and Processing’’ section above). Sequence and interval covariate values were computed for each hypothesis i as the fraction of the eligible positions in interval i that matched the given sequence feature or intersected the interval track. Numeric covariates for each hypothesis i were computed as the mean value of the numeric track within the eligible subset of interval i. The mutation count yi was computed as the number of samples in the dataset harboring a mutation in the eligible subset of interval i. We removed intervals with fewer than 75% eligible bases or greater than 95% quiescent chromatin to yield a total of 2.823 million genome-wide hypotheses. We modeled mutation counts yi in each interval i, given eligible territory width wi and k covariate values cji ; j ˛f1; ::; kg, as 0 1 k P aj cji B C yi  GammaPoisson@wi e j = 1 ; qA k P

aj cji

where wi e j = 1 is the mean parameter and q is the shape parameter. The k + 1 parameters of the model (q, aj ; j ˛f1; .; kg) were estimated by maximum likelihood using the MASS R package (http://cran.r-project.org/web/packages/MASS/index.html) on a random

Cell 168, 1–13.e1–e6, January 26, 2017 e4

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

b j ˛f1; .; kg we then computed randomized P-values 100,000 subsample of the 2.823 million data matrix rows. Given the ML fit, b q; a pi  Uniformðai ; bi Þ for each hypothesis i using the right tail masses 0 1 k P ba j cji B C qA ai = PGammaPoisson @yRyi + 1; wi e j = 1 ; b

and

0

k P

B bi = PGammaPoisson @yRyi ; wi e j = 1

ba j cji

1 C ;b q A:

P-value randomization is a standard approach used to generate uniformly distributed p values from discrete null distributions (Dickhaus, 2014). Analyses were performed separately for SNVs. and indels. p values were mapped to FDR values using the Benjamini-Hochberg procedure in the R stats package. Since significant (FDR<0.1) candidate intervals overlapped each other at peak regions, we chose a subset of non-overlapping and maximally significant intervals that covered each peak region to comprise the final set of significant loci. Q-Q plots and associated l ‘‘genomic inflation’’ values were computed by --log10 transforming observed quantiles of P values and expected quantiles of the uniform distribution, then fitting a line y = lx through the transformed points. The modeling framework (including covariate calculations, model fitting, and Q-Q plot computation) is provided as an R package (fish.hook, https://github.com/mskilab/fish.hook). Lineage-context analyses of indel and SNV density We compared indel and SNV densities across variants in classes (lineage-specific, multi-lineage, and housekeeping) of 233 highly expressed genes in expression-native and -foreign tumor sample contexts using Gamma-Poisson regression to model mutation counts. We used regression to correct for sample-specific differences in mutation density, which may occur when comparing across tumor types. For expression-native vs. expression-foreign comparisons (Figure 4), we chose a gene class (e.g. multi-lineage genes) to analyze and modeled mutation counts in expression-native ðyi;1 Þ and expression-foreign ðyi;0 Þ genomic territories associated with that gene class across all tumor samples i ˛f1; .ng   yi;c  GammaPoisson wi;c ri ea + bc ; q where c ˛f0; 1g is a binary indicator variable denoting expression-native (1) or expression-foreign (0) status, wi;c is the eligible width for the class c territory in sample i, and ri is the mean (per Mbp) variant density in sample i. We inferred model parameters ða; b; qÞ and 95% confidence intervals for expression-native vs. -foreign enrichment ðln bÞ through maximum likelihood estimation and obtained two-tailed P-values using the Wald test. We applied this test for all three territory classes (lineage-specific, multi-lineage, and housekeeping). We also applied this analysis to individual tumor types to compare indel densities between expression-native and -foreign lineage-specific territories and obtain tumor type-specific P-values and effect sizes. The latter analyses could only be performed in tumor types that were assigned at least one expression-native, lineage-specific territory in the upstream GTEx analysis (Table S3A). This comprised 11 of the 12 non-LUAD tumor types analyzed in this study (all with the exception of bladder cancer). We employed a related analysis (Figure 4) to compare mutation densities in pairs of gene classes within the expression-native tumor context. Namely, for a given pair of gene classes (e.g., housekeeping versus multi-lineage), we modeled mutation counts in the first class ðyi;0 Þ and second class ðyi;1 Þ across all tumor samples i ˛f1; .; ng as above except c ˛f0; 1g indicates gene class status in the given pair, wi;c indicates the eligible width for the expression-native subset of gene class c in sample i, and eb indicates class 1 versus class 0 enrichment as a relative risk. To identify genes that were differentially mutated in expression-native vs. -foreign context (Figure 5) we applied logistic regression on (dichotomous) mutation status in each sample i as a function of average per-sample mutation density ri . Specifically, we modeled the probability of mutation status yi;g = 1 in sample i and gene g with expression-native / -foreign status ci;g as fðlnðri Þ + a + bci;g Þ; where fðxÞ = 1 +1ex . We computed adjusted odds-ratios eb and associated 95% confidence intervals by maximum likelihood and P-values using the Wald test via the ‘‘stats’’ package in R, for each gene g separately. Only genes that were mutated in three or more cases were included in this analysis. Epigenomic feature enrichment analysis We used Gamma-Poisson regression to assess significant enrichment or depletion of indel density in epigenomic peak regions associated with three groups (hotspot genes, other highly expressed genes, 1000 randomly selected genes) in liver, lung, thyroid, and gastric cancer (Figure 6). For a given assay with peak regions (e.g., ChIP-seq peaks, APA sites) and a given window (0, 100, 1000, 10,000 bp) around them we computed the intersection of genes in a given class with (padded) peak regions. We modeled indel counts in groups of peak versus non-peak regions across samples as a function of eligible territory, using peak x group interaction

e5 Cell 168, 1–13.e1–e6, January 26, 2017

Please cite this article in press as: Imielinski et al., Insertions and Deletions Target Lineage-Defining Genes in Human Cancers, Cell (2017), http://dx.doi.org/10.1016/j.cell.2016.12.025

terms ðg1 ; g2 Þ to capture significant enrichment or depletion of peak-associated indels in hotspot regions versus other territory classes. The model is represented by the following equation:   yi;p;g  GammaPoisson wi;p;g ri ea + b0 p + b1 I1 ðgÞ + b2 I2 ðgÞ + g1 pI1 ðgÞ + g2 pI2 ðgÞ ; q for all i ˛f1; .; ng samples, p ˛f0; 1g peak vs. non-peak regions, and g ˛f0; 1; 2g groups (where the values of g indicate hotspot ð2Þ vs. other highly expressed ð1Þ vs. background genes ð0Þ). wi;p;g is eligible territory width for sample i in territory group g with peak status p, ri is the mean (per Mbp) variant density in sample i. The term Ix ðgÞ is an indicator function which is 1 when g = x and 0 otherwise. We inferred model parameters ða; b0 ; b1; b2 ; g1 ; g2 ; qÞ through maximum likelihood estimation and obtained two-tailed P-values using the Wald test. We applied the above model to 440 assay 3 window combinations, and compiled effect sizes and P-values for the g2 term representing the interaction between group 2 and peak status. We applied Bonferroni multiple hypothesis correction for 440 hypotheses and applied a corrected P-value threshold of Pcorrected <0.05 to assess significance. Replication direction analysis We obtained annotations of replication direction from (Haradhvala et al., 2016) and crossed with the GENCODE v19 gene annotation of highly expressed genes to label genes as either ‘‘head-on’’ or ‘‘co-directional’’ with respect to their transcriptional strand and replication orientation. We excluded 2 Gbp of genomic territory that were given an indeterminate replication direction. In each sample, we then subdivided expression-native and -foreign territories with respect to their head-on versus co-directional status. We then applied Gamma-Poisson regression to model expression-native mutation counts yi;c;d as a function of expression-native status c ˛f0; 1g and head-on replication-transcription orientation d ˛f0; 1g across samples i ˛f1; .; ng   yi;c;d  GammaPoisson wi;c;d ri ea + bc + gd + dcd ; q where wi;c;d is the eligible territory width in sample i for territory with expression-native status c with and replication-transcription orientation d, ri is the mean (per Mbp) variant density in sample i. We inferred model parameters ða; b; g; d; qÞ and 95% confidence intervals for the relative risk (ed) corresponding to the interaction term between expression-native status and replication-transcription orientation using maximum likelihood estimation and a two-tailed P-value via the Wald test. DATA AND SOFTWARE AVAILABILITY Custom software tools developed for this study are available as open source R packages. These include fish.hook (https://github. com/mskilab/fish.hook), an R package for applying Gamma-Poisson regression to nominate genome-wide hotspots of mutation density and examine differential enrichment in mutation density across classes of genomic intervals, while taking into account covariates and eligibility. Additional tools comprise custom packages for manipulating genomic intervals (gUtils, https://github.com/ mskilab/gUtils) and visualizing complex genomic tracks (gTrack, https://github.com/mskilab/gTrack). Additional violin plots, gene set visualizations of CAMERA results, and mutation matrices were generated using custom functions in skitools (https://github. com/mskilab/skitools) and muTrix (https://github.com/mskilab/muTrix). Additional Resources Links to multi-track visualizations of statistical signals near top mutation hotspots and alignment data supporting key mutational events are provided at www.mskilab.com/publications/cell2017. Links to these images can also be found among the corresponding supplementary tables in the paper.

Cell 168, 1–13.e1–e6, January 26, 2017 e6

Supplemental Figures A

B

Figure S1. Schematic and Top Hits of a Genome-wide Scan for Recurrent Noncoding Mutations in Lung Adenocarcinoma, Related to Figure 1 (A) Detailed schematic of fish.hook Gamma-Poisson regression model applied to identify genome-wide significant hotspots of indel and SNV mutations in lung adenocarcinoma whole genome sequences. (B) Genomic snapshot of the SFTP loci and their associated covariates, showing both indel and SNV counts in the region of the nominated hotspot.

% prevalence 0

20 40 60 80 100

Cis Gene Loci

E

SFTPB SFTPC SFTPA1

B

VAMP8 VAMP5 RNF181 USP39 GNLY chr2:85.8 MB chr2:85.8 MB chr2:85.8 MB chr2:85.8 MB chr2:85.9 MB P= 0.12 P= 0.94 P= 0.041 P= 0.56 P= 0.38

SFTPB Mut SFTPB WT

MIR21/VMP1

ATOH8 chr2:86 MB P= 0.98

GGCX TMEM150A C2orf68 SFTPB chr2:85.8 MB chr2:85.8 MB chr2:85.8 MB chr2:85.9 MB P= 0.44 P= 0.3 P= 0.79 P= 0.22

15

Expression

A

10 5 0 Mut

WT

Mut

FGF17 chr8:21.9 MB P= 0.32

U2AF1

WT

Mut

WT

Mut

FAM160B2 SFTPC chr8:21.9 MB chr8:22 MB P= 0.98 P= 0.084

WT

Mut

BMP1 chr8:22 MB P= 0.12

WT

Mut

POLR3D chr8:22.1 MB P= 0.36

WT

NUDT18 chr8:22 MB P= 0.68

Mut

WT

Mut

HR chr8:22 MB P= 0.65

WT

Mut

REEP4 chr8:22 MB P= 0.76

WT

LGI3 chr8:22 MB P= 0.046

Mut

WT

PHYHIP chr8:22.1 MB P= 0.43

SFTPC Mut

CTNNB1

SFTPC WT

Mutation Types Silent Missense Splice Site Frame Shift In Frame Indel Other non syn. Nonsense Noncoding hotspot indel Noncoding non-hotspot indel

RBM10 EGFR ARID1A STK11 SMARCA4 KRAS

10

5

0

−5 Mut

WT

Mut

EIF5AL1 chr10:81.3 MB P= 0.05

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

SFTPA1 SFTPA2 chr10:81.4 MB chr10:81.3 MB P= 0.43 P= 0.63

15

SFTPA1 Mut SFTPA1 WT

Expression

ERBB2

Expression

15

KEAP1

10 5 0 −5 Mut

WT

Mut

WT

Mut

WT

BRAF NF1 TP53

F

ATM APC

SFTP Exons

SFTPB Mut

CDKN2A

SFTPB WT

TERT Del Amp

SFTPC Mut

NKX2−1

SFTPC WT

MIR21

D

LUAD−TCGA−05−4396−Tumor−SM−23JGI

LUAD−TCGA−44−2665−Tumor−SM−1A2MT

LUAD−TCGA−44−6148−Tumor−SM−1ZKHZ

LUAD−TCGA−50−6597−Tumor−SM−23JHT

LUAD−TCGA−05−5429−Tumor−SM−1WKRP

LUAD−TCGA−44−2659−Tumor−SM−26X6X

LUAD−TCGA−05−4398−Tumor−SM−26X77

LUAD−TCGA−38−4630−Tumor−SM−3L4DD

LUAD−TCGA−55−6982−Tumor−SM−26X96

LUAD−TCGA−55−6986−Tumor−SM−26X8E

LUAD−TCGA−75−6203−Tumor−SM−1ZKIE

LUAD−S01356−Tumor

LUAD−S01407−Tumor

LUAD−S01346−Tumor

LUAD−TCGA−78−7143−Tumor−SM−2FWRO

LUAD−TCGA−50−5930−Tumor−SM−1ZKIM

LUAD−TCGA−55−1596−Tumor−SM−1A2MJ

LUAD−TCGA−44−2666−Tumor−SM−1A2MY

LUAD−E01278−Tumor

LUAD−TCGA−75−7030−Tumor−SM−26X9E

LUAD−TCGA−55−6984−Tumor−SM−26X9K

LUAD−S01473−Tumor

LUAD−TCGA−55−8299−Tumor−SM−342D9

LUAD−TCGA−91−6840−Tumor−SM−26X91

LUAD−2GUGK−Tumor

LUAD−TCGA−73−4659−Tumor−SM−26X74

LUAD−TCGA−05−4420−Tumor−SM−26X7C

LUAD−S01467−Tumor

LUAD−TCGA−05−4397−Tumor−SM−26X72

LUAD−FH5PJ−Tumor

LUAD−E01217−Tumor

LUAD−TCGA−49−4510−Tumor−SM−26X7N

LUAD−S01331−Tumor

LUAD−TCGA−05−4432−Tumor−SM−26X7M

LUAD−S01478−Tumor

LUAD−S01468−Tumor

LUAD−E00934−Tumor

LUAD−S00488−Tumor

LUAD−TCGA−50−6591−Tumor−SM−1ZKIY

LUAD−TCGA−78−7156−Tumor−SM−2FWSH

LUAD−TCGA−38−4628−Tumor−SM−26X7R

LUAD−TCGA−67−3771−Tumor−SM−1A2M8

LUAD−TCGA−05−4389−Tumor−SM−26X6W

LUAD−S01345−Tumor

LUAD−D02326−Tumor

LUAD−TCGA−55−1594−Tumor−SM−1A2M7

LUAD−TCGA−73−4666−Tumor−SM−26X79

LUAD−TCGA−91−6847−Tumor−SM−26X98

LUAD−TCGA−75−5147−Tumor−SM−1WKQZ

LUAD−TCGA−67−3772−Tumor−SM−1A2ME

LUAD−5V8LT−Tumor

LUAD−TCGA−64−1680−Tumor−SM−1DUJ2

LUAD−TCGA−55−6972−Tumor−SM−26X9J

LUAD−S01405−Tumor

LUAD−TCGA−78−7146−Tumor−SM−2FWS7

LUAD−TCGA−78−7535−Tumor−SM−2HMPN

LU−A08−43−Tumor

LUAD−TCGA−55−7281−Tumor−SM−2FWSF

LUAD−TJW61−Tumor

LUAD−TCGA−78−7158−Tumor−SM−2FWSQ

LUAD−TCGA−49−4512−Tumor−SM−23JGK

LUAD−TCGA−67−6215−Tumor−SM−1ZKID

LUAD−TCGA−49−6742−Tumor−SM−23JH7

LUAD−TCGA−49−4486−Tumor−SM−26X7D

LUAD−S01341−Tumor

LUAD−TCGA−05−4395−Tumor−SM−3L4CV

LUAD−TCGA−50−5932−Tumor−SM−1ZKJ9

LUAD−AEIUF−Tumor

LUAD−S01381−Tumor

LUAD−E01014−Tumor

LUAD−TCGA−97−8171−Tumor−SM−342DO

LUAD−U6SJ7−Tumor

LUAD−S01302−Tumor

LUAD−E01317−Tumor

LUAD−TCGA−50−5066−Tumor−SM−1WKRQ

LUAD−TCGA−05−4422−Tumor−SM−26X7H

LUAD−S01404−Tumor

LUAD−TCGA−64−1678−Tumor−SM−1DUIY

LUAD−QY22Z−Tumor

0

20 40 60 80 100

% prevalence

Never Smoker Ex Smoker Current Smoker N/A

exon 6 P= 0.23

Mut

WT

Mut

WT

Mut

exon 2 P= 0.067

WT

Mut

exon 3 P= 0.07

WT

Mut

WT

exon 4 P= 0.071

Mut

WT

exon 5 P= 0.22

exon 4 P= 0.23

exon 3 P= 0.23

exon 2 P= 0.23

exon 1 P= 0.24

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

exon 5 P= 0.052

SFTPC

5 0 WT

exon 1 P= 0.4

SFTPA1 WT

exon 7 P= 0.22

SFTPB WT

Mut

SFTPA1 Mut

exon 8 P= 0.22

10

Smoking Status

Smoking status

exon 9 P= 0.21

0

exon 1 P= 0.049

WT NA

CCND1

exon 10 P= 0.23

5

Mut

Expression

MYC CDKN2A

exon 11 P= 0.19

10

CN Types

Expression

C

Expression

exon 12 P= 0.32

PIK3CA

Mut

exon 2 P= 0.53

WT

Mut

exon 3 P= 0.52

WT

exon 4 P= 0.5

Mut

WT

exon 5 P= 0.5

Mut

exon 6 P= 0.48

WT

exon 7 P= 0.46

8

SFTPA1

4 0 Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Lung adenocarcinoma WGS cases

Figure S2. Mutational Correlates of Top Loci in Lung Adenocarcinoma, Related to Figure 2 (A–D) (A) Matrix of alteration status at SFTP loci and (B) other recurrently mutated or (C) recurrently copy-number-altered loci in lung adenocarcinoma or (D) smoking status across 79 lung adenocarcinoma whole genome sequences. There was no significant correlation of SFTP or MIR21 locus mutation status at known loci or smoking status in lung adenocarcinoma, following multiple hypothesis correction. (E and F) (E) Correlation of cis regional gene expression and (F) exon expression with indel status at SFTPB, SFTPA1, and SFTPC loci in lung adenocarcinoma tumor samples. p values for differential expression were obtained using the limma and voom R packages.

A

B

Expression vs. SFTP mutation status correlations

Expression vs. TP53 mutation status correlations

0.5 Pearson r

0.5 Pearson r

0.0 −0.5

0.0 −0.5

0

5000

10000

15000

0

17683

Gene ranks

5000

10000

15000

17683

Gene ranks DNA STRAND ELONGATION (Dir = Up, P = 1.1e−05, FDR = 0.0087) C MCM7

GINS1

RFC4

C MCM6

GINS4

C CDC45

DNA2

C MCM8

RFC5

PRIM1

KEGG RIBOSOME (Dir = Up, P = 5.3e−07, FDR = 0.00065) RPL30

RPS23

RPS4Y1

RPL23A

RPL6

RPL8

RPLP2

RPL37A

RPL37

RPS24

G2 M CHECKPOINTS (Dir = Up, P = 1.9e−05, FDR = 0.0087) CHEK1

MCM7

CHEK2

RFC4

ORC1

MCM6

CDK1

DBF4

MCM10

CDC6

PEPTIDE CHAIN ELONGATION (Dir = Up, P = 9.8e−07, FDR = 0.00065) RPL30

RPS23

RPS4Y1

RPL23A

RPL6

RPL8

RPLP2

RPL37A

RPL37

RPS24

ACTIVATION OF THE PRE REPLICATIVE COMPLEX (Dir = Up, P = 2e−05, FDR = 0.0087) MCM7

INFLUENZA VIRAL RNA TRANSCRIPTION AND REPLICATION (Dir = Up, P = 2.3e−05, FDR = 0.0088) RPL30

RPS23

RPS4Y1

RPL23A

RPL6

RPL8

POLR2L

RPLP2

RPL37A

ORC1

MCM6

DBF4

MCM10

RPL37

CHEK1

C MCM7

RFC4

C MCM6

ORC1

RPS23

RPS4Y1

RPL23A

RPL6

RPL8

RPLP2

RPL37A

RPL37

PSME2

RPS24

RPS23

RPS4Y1

RPL23A

RPL6

RPL8

RPLP2

RPL37A

RPL37

POLE2

CDC45

MCM8

DBF4

MCM10

CDC6

C CDC45

CDC7

MCM7

ORC1

MCM6

DBF4

MCM10

PSMA6

PSME1

CDC6

CDC7

SYNTHESIS OF DNA (Dir = Up, P = 4.4e−05, FDR = 0.0098)

SRP DEPENDENT COTRANSLATIONAL PROTEIN TARGETING TO MEMBRANE (Dir = Up, P = 7.9e−05, FDR = 0.021) RPL30

CDC7

M G1 TRANSITION (Dir = Up, P = 3.7e−05, FDR = 0.0097)

3 UTR MEDIATED TRANSLATIONAL REGULATION (Dir = Up, P = 2.6e−05, FDR = 0.0088) RPL30

CDC6

ACTIVATION OF ATR IN RESPONSE TO REPLICATION STRESS (Dir = Up, P = 3.5e−05, FDR = 0.0097)

PSME2

MCM7

GINS1

RFC4

ORC1

MCM6

GINS4

PSMA6

DNA2

CCNA2

RPS24

ASSEMBLY OF THE PRE REPLICATIVE COMPLEX (Dir = Up, P = 6.7e−05, FDR = 0.013) NONSENSE MEDIATED DECAY ENHANCED BY THE EXON JUNCTION COMPLEX (Dir = Up, P = 0.00011, FDR = 0.025) RPL30

RPS23

RPS4Y1

RPL23A

RPL6

RPL8

RPLP2

RPL37A

RPL37

PSME2

MCM7

MCM7

GINS1

ORC1

MCM6

PSMA6

PSME1

CDC6

PSMA5

MCM8

CDT1

RPS24

UNWINDING OF DNA (Dir = Up, P = 7.9e−05, FDR = 0.013) MCM6

GINS4

CDC45

MCM8

MCM4

MCM2

GINS2

MCM5

MITOCHONDRIAL FATTY ACID BETA OXIDATION (Dir = Up, P = 0.00016, FDR = 0.026) DECR1

ACADS

ACADVL

ECHS1

MCEE

HADHB

ACADM

ECI1

HADHA

ACADL

EXTENSION OF TELOMERES (Dir = Up, P = 1e−04, FDR = 0.013) RFC4

DNA2

POLE2

RFC5

PRIM1

RFC3

TERT

POLD1

PCNA

FEN1

FORMATION OF THE TERNARY COMPLEX AND SUBSEQUENTLY THE 43S COMPLEX (Dir = Up, P = 0.00017, FDR = 0.026) RPS23

RPS4Y1

RPS24

RPS17

RPS18

RPS12

EIF3F

RPS25

RPS27

RPS19

KEGG DNA REPLICATION (Dir = Up, P = 0.00011, FDR = 0.013) MCM7

RFC4

MCM6

DNA2

POLE2

RFC5

PRIM1

RNASEH2A

MCM4

MCM2

RESPIRATORY ELECTRON TRANSPORT (Dir = Up, P = 0.00018, FDR = 0.026) UQCRBP1

UQCRB

NDUFV2

COX7C

NDUFA10

UQCRQ

NDUFB8

UQCRC1

CYC1

DNA REPLICATION (Dir = Up, P = 0.00011, FDR = 0.013)

NDUFB1 PSME2

MCM7

AURKB

RESPIRATORY ELECTRON TRANSPORT ATP SYNTHESIS BY CHEMIOSMOTIC COUPLING (Dir = Up, P = 2e−04, FDR = 0.026) UQCRBP1

UQCRB

NDUFV2

COX7C

NDUFA10

UQCRQ

NDUFB8

ATP5G1

UQCRC1

CYC1

EME1

RPS4Y1

RPS24

RPS17

RPS18

RPS12

EIF3F

RPS25

RPS27

CENPA

GINS1

RFC4

DSN1

ORC1

CDC20

XRCC2

BLM

BRCA2

RAD54L

RAD54B

MUS81

XRCC3

RAD51

POLD1

LAGGING STRAND SYNTHESIS (Dir = Up, P = 0.00013, FDR = 0.013)

ACTIVATION OF THE MRNA UPON BINDING OF THE CAP BINDING COMPLEX AND EIFS (Dir = Up, P = 0.00078, FDR = 0.094) RPS23

SGO2

KEGG HOMOLOGOUS RECOMBINATION (Dir = Up, P = 0.00012, FDR = 0.013)

RFC4

DNA2

RFC5

PRIM1

RFC3

POLD1

PCNA

FEN1

RFC2

POLD3

RPS19

MITOTIC M-M/G1 PHASES (Dir = Up, P = 0.00014, FDR = 0.013) TRANSLATION (Dir = Up, P = 0.00086, FDR = 0.095) RPL30

RPS23

RPS4Y1

RPL23A

RPL6

RPL8

RPLP2

RPL37A

RPL37

PSME2

MCM7

PSME2

ORC1

AURKB

SGO2

CENPA

DSN1

ORC1

CDC20

MCM6

KIF2C

RPS24

CDT1 ASSOCIATION WITH THE CDC6 ORC ORIGIN COMPLEX (Dir = Up, P = 0.00016, FDR = 0.014)

Gene ranks

PSMA6

PSME1

CDC6

PSMA5

MCM8

CDT1

ORC6

PSMC2

Gene ranks 0

5000

10000

15000

17683

0

5000

10000

15000

17683

Figure S3. Gene Set Enrichment Analysis of Mutant and WT Samples in Lung Adenocarcinoma, Related to Figure 3 (A and B) (A) Notch plot of gene sets arising from CAMERA analysis (Wu and Smyth, 2012) using canonical pathways in the molecular signatures database (MSigDB, http://software.broadinstitute.org/gsea/msigdb). Notch plots visualize gene set expression correlations with SFTP and (B) TP53 hotspot mutation status. Pearson correlations are shown in the top waterfall panel. Gene sets are arranged in the bottom panel, labeled with P and FDR values from the CAMERA analysis (Wu and Smyth, 2012). The x position of each notch represents gene rank and its color the Pearson r of the expression to phenotype correlation. In (A), all gene sets with FDR<0.1 are shown, while in (B) only the top 15 most significantly differentially expressed are shown.

C

40

60

80

Relative indel density

Skin

102

103

104

Stomach

Liver Tumor Sample Context Expression-native (EN) Expression-foreign (EF)

1

EN

101

1

Thyroid

100

0

20

Frequency

100

A

EF

EN

EF

EN

EF

EN

EF

105 P = 4.3 x 10-3

P = 5.8 x 10-8

P = 6.8 x 10-10

P = 2.8 x 10-15

Maximum median tissue RPKM+1

RPKM 102

103

104

D

105

House-keeping Multi-lineage

Uterus

ALB

20

LIPF TG

15 10

ALDOB

5

FGG

0 0.1

1.0

10.0

Enrichment (Odds Ratio)

Gene class Housekeeping Multi−lineage Lineage−specific

SNV 25 20 15 10

ALB

5

TG

0 0.1

1.0

10.0

Enrichment (Odds Ratio)

Esophagus

Blood Vessel

Ovary

Bladder

Prostate

Cervix Uteri

Fallopian Tube

Adipose Tissue

Skin

Nerve

Breast

Vagina

Thyroid

Lung

Colon

Stomach

Small Intestine

Brain

Spleen

Salivary Gland

Testis

Kidney

Blood

Pituitary

Adrenal Gland

Liver

Heart

Muscle

Pancreas

Lineage-specific

Indel 25

Genes

RPL19 RPL8 RPL18AP3 RPL13A RPL13AP5 RP11−543P15.1 RPS16 RPL39P3 RPLP0 RPL10 EEF2 RP11−466H18.1 RPL3P4 EEF1A1P5 RPL3 ACTG1 RP11−138I1.4 OAZ1 RPS13 RP11−3P17.3 RPS25 RPL5 RPS6 RPL13 RPS2 RPL10A RPS8 TXNIP TMSB4X TMSB10 IFITM3 MTND2P28 hsa−mir−6723 PEBP1 HSPB1 ALDOA RP5−977B1.11 MYL9 GPX3 HLA−E HLA−C CD74 B2M HLA−B RP11−124N14.3 VIM IFITM2 S100A11 RPS12 RPS11 RPLP2 RPS27A RPS18 TPT1 RPLP1 FTH1 ACTB RP5−940J5.9 GAPDH FTL SLC25A4 AKR1B1 YBX3 PTGDS APOE FABP5P7 KRT10 SLPI C3 SERPINA3 FABP4 FABP3 HBB HBA2 SRGN HBA1 S100A9 S100A8 IGLC1 IGKJ4 IGLC3 IGLC2 IGHA2 IGHA1 IGHG1 IGHG2 IGHM RP11−1143G9.4 LYZ IGFBP7 DSTN A2M MT2A BGN FHL1 PMP22 CRYAB APOD ACTA2 TAGLN FLNA TPM2 MGP IGFBP5 MYH11 DES AC053503.6 APOA2 APCS HPX ORM2 CRP APOH FGB AGXT FGA ORM1 FGG CYP2E1 APOC3 TTR FGL1 AMBP ALB HP APOA1 SERPINA1 ALDOB SAA1 SAA2−SAA4 SAA2 RBP4 APOC1 SPP1 MPZ CTRC PNLIPRP1 RP11−331F4.4 CELA2A AMY2A INS CELA2B CUZD1 PRSS3P2 PRSS3P1 PNLIPRP2 GP2 CTRB2 CTRB1 CELA3B PLA2G1B SYCN CPA2 CPB1 CPA1 CLPS CELA3A PNLIP CEL PRSS1 REG1B REG3A PRSS3 REG1A SPINK1 LGALS4 PHGR1 PIGR CTD−2331C18.5 PGA3 GKN1 PGA5 LIPF PGC CYP17A1 CYP11B1 CYP21A2 HSD3B2 STAR PHF7 PRM2 PRM1 TNP1 SFTPA2 SFTPA1 SFTPC SFTPB TG KLK2 KLK3 DEFA5 DEFA6 FDCSP MUC7 ZG16B LHB MIR7−3HG POMC PRL GH1 CGA NNAT AC019349.5 KRT13 KRT6A SPRR1A SPRR3 KRT5 KRT14 KRT1 MYL2 MB CTD−2201G16.1 MYH7 ACTA1 CKM TNNC1 EEF1A2 TNNC2 MYL1 MYLPF MYBPC1 MYH1 TNNT1 KLHL41 TNNI2 ENO3 PYGM MYL3 ANKRD1 TNNT2 TNNI3 ACTC1

−log10(p)

101

0

−log10(p)

B

GTEx Tissues Figure S4. Analysis of Healthy Tissue Gene Expression and Somatic Mutational Density across Cancer, Related to Figure 4 (A) Histogram of median gene expression values in GTEx (http://www.gtexportal.org/home) showing the maximum of median per-tissue gene expression values for each gene in GTEx. The red line represents the threshold used to define ‘‘highly expressed genes.’’ (B) Heatmap of log median RPKM values across 30 tissue types and 233 genes demonstrating 1000 or greater median RPKM value in at least one tissue type. The dendrogram on the left was obtained through complete-linkage hierarchical clustering and a Euclidean distance metric. Gene clusters were defined by choosing a tree cutoff to yield 3 clusters. Housekeeping, Multi-lineage, and Lineage-specific labels were assigned to gene clusters following visual inspection of the heatmap and dendrogram. (C) Boxplots demonstrating statistically significant (Wald test, Gamma-Poisson regression) tumor-type specific differences in indel density for expression-native (EN) versus expression-foreign (EF) tumor contexts. Each notch and box represents maximum likelihood estimate and 95% CI for the rate parameter for the given subgroup. Each data point represents a tumor sample and EN versus EF gene context combination. (D) Volcano plots of P-values (Wald test, logistic regression) and odds ratios for EN versus EF enrichment at specific house-keeping, multi-lineage, and lineagespecific genes across 487 cases.

ALB chr4:74.3 MB P= 0.47

10

ALDOB Mut

5

ALDOB WT

0 −5

5

WT

Mut

WT

Mut

0 −5

PLRG1 chr4:155.5 MB P= 0.032

LRAT chr4:155.5 MB P= 0.3

FGB chr4:155.5 MB P= 0.31

D

FGG chr4:155.5 MB P= 0.28

FGA chr4:155.5 MB P= 0.36

Expression

10

TG Mut

5

TG WT

0 Mut

WT

Mut

Mut

WT

Mut

WT

Mut

WT

exon 1 exon 2 exon 3 exon 4 exon 5 exon 6 exon 7 P= 0.66 P= 0.63 P= 0.58 P= 0.61 P= 0.59 P= 0.55 P= 0.51

Mut

WT

Mut

WT

Mut

exon 8 P= 0.5

E 12.5

WT

LIPF chr10:90.4 MB P= 0.92

10.0 7.5

LIPF Mut

5.0

LIPF WT

0

2.5

−5

0.0 Mut WT

Mut

WT

Mut

WT

Mut

WT

WT

H

FGG

exon 9 exon 10 exon 11 exon 12 exon 13 exon 14 exon 15 P= 0.49 P= 0.48 P= 0.51 P= 0.48 P= 0.49 P= 0.45 P= 0.43

exon 9 P= 0.35

exon 8 P= 0.32

exon 7 P= 0.29

exon 6 P= 0.25

exon 5 P= 0.2

exon 4 P= 0.17

10.0

12

7.5 8

5.0 2.5

4 WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

G

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

ALDOB Expression

exon 9 P= 0.35

ALDOB WT

WT

SLA chr8:134 MB P= 0.47

12.5

Mut

ALDOB Mut

Mut

5

Expression

ALB WT

Expression

ALB Mut

WT

10

ALB 16

Mut

15

Mut

F

WT

WISP1 TG PHF20L1 chr8:133.8 MB chr8:133.9 MB chr8:134.2 MB P= 0.61 P= 0.19 P= 0.9

15

−5

WT

WT

Expression

C

FGG WT

10

Mut Mut

FGG Mut

LPPR1 ZNF189 RNF20 BAAT MRPL50 ALDOB chr9:103.8 MB chr9:104.2 MB chr9:104.3 MB chr9:104.1 MB chr9:104.1 MB chr9:104.2 MB P= 0.9 P= 0.72 P= 0.4 P= 0.79 P= 0.77 P= 0.96 15

Expression

ALB WT

B

AFM chr4:74.3 MB P= 0.051

15

Expression

ALB Mut

AFP chr4:74.3 MB P= 0.82

Expression

A

exon 8 P= 0.32

exon 7 P= 0.29

exon 6 P= 0.25

exon 5 P= 0.2

exon 4 P= 0.17

exon 3 P= 0.16

exon 2 P= 0.17

0.0 Mut

WT

Mut

exon 3 P= 0.16

WT

exon 2 P= 0.17

Mut

WT

Mut

WT

WT

Mut

WT

12.5 10.0

exon 1 P= 0.16

Mut

exon 1 P= 0.16

12.5

7.5

10.0

5.0

7.5

2.5

5.0

0.0

FGG Mut FGG WT

Mut

2.5

WT

Mut

WT

Mut

WT

0.0 Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

I

WT

Mut

WT

TG exon 1 P= 0.73

exon 2 P= 0.73

exon 3 P= 0.68

exon 4 P= 0.68

exon 5 P= 0.66

exon 6 P= 0.61

exon 7 P= 0.61

exon 8 P= 0.62

exon 9 P= 0.7

exon 10 P= 0.63

exon 11 P= 0.57

exon 12 P= 0.5

exon 13 P= 0.49

exon 14 P= 0.51

exon 15 P= 0.5

exon 16 P= 0.46

exon 17 P= 0.45

15

10

5

0

TG WT

Mut

Expression

TG Mut

WT

Mut

exon 18 P= 0.43

WT

Mut

exon 19 P= 0.46

WT

exon 20 P= 0.48

Mut

WT

Mut

exon 21 P= 0.47

WT

exon 22 P= 0.46

Mut

WT

exon 23 P= 0.44

Mut

WT

Mut

exon 24 P= 0.43

WT

exon 25 P= 0.43

Mut

WT

exon 26 P= 0.46

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

exon 27 P= 0.37

exon 28 P= 0.36

exon 29 P= 0.36

exon 30 P= 0.34

exon 31 P= 0.31

exon 32 P= 0.32

exon 33 P= 0.3

Mut

Mut

Mut

Mut

Mut

Mut

Mut

Mut

WT

15

10

5

0

Mut

WT

Mut

exon 34 P= 0.29

WT

Mut

exon 35 P= 0.29

WT

exon 36 P= 0.29

Mut

WT

Mut

exon 37 P= 0.3

WT

exon 38 P= 0.26

Mut

WT

Mut

exon 39 P= 0.24

WT

Mut

exon 40 P= 0.2

WT

exon 41 P= 0.17

Mut

WT

exon 42 P= 0.16

WT

exon 43 P= 0.12

WT

exon 44 P= 0.089

WT

exon 45 P= 0.06

WT

exon 46 P= 0.04

WT

exon 47 P= 0.025

WT

WT

exon 48 P= 0.017

15

10

5

0

Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

J

WT

Mut

WT

Mut

WT

Mut

WT

WT

Mut

WT

Mut

WT

exon 3 P= 0.87

exon 4 P= 0.97

Mut

WT

0

exon 6 P= 1

10

Mut

WT

exon 7 P= 0.97

Mut

WT

exon 8 P= 0.96

WT

Mut

WT

ALB

1.00

WT

Mut

exon 5 P= 0.94

5

Mut

LIPF Mut

Mut

Methylation

Expression

10

exon 2 P= 0.94

WT

K

LIPF exon 1 P= 0.99

Mut

Mut

WT

exon 9 P= 0.99

Mut

WT

exon 10 P= 0.98

ANKRD17 P=0.11

ALB P=9 x 10-5

AFP P=0.73

AFM P=0.59

RASSF6 P=0.1

0.75 0.50 0.25 0.00 Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

LIPF WT 5

ALB Mut ALB WT

0 Mut

WT

Mut

WT

Mut

WT

Mut

WT

Mut

WT

(legend on next page)

Figure S5. Genomic Correlates of Liver, Thyroid, and Gastric Cancer Indel Hotspot Mutations, Related to Figure 5 (A–J) Correlation of cis (A–E) regional gene expression and (F–J) exon expression associated with ALB, ALDOB, FGG, TG, and LIPF loci in LIHC, LIHC, LIHC, THCA, and STAD samples, respectively that are mutant versus wild type for somatic indel mutations in the respective locus. P-values for expression were obtained using the limma and voom R packages. (K) Differential methylation in the vicinity of the ALB gene in LIHC samples that are mutant or wild type for somatic ALB indels. P-values for methylation were obtained by applying a Wilcoxon test to the log transformed methylation values in mutant and wild type samples. Associations with methylation status were not observed for other indel hotspot loci (data not shown).

A Pearson r

B

Expression vs. ALB / FGG / ALDOB mutation status correlations in LIHC

0.5

Expression vs. TG mutation status correlations in THCA 0.5

Pearson r

0.0 −0.5

0.0 −0.5

0

5000

10000

15000

17250

Gene ranks

0

5000

10000

15000

16880

Gene ranks KEGG FATTY ACID METABOLISM (Dir = Up, P = 3.1e−05, FDR = 0.017) ECI2

ALDH2

ALDH7A1

ACAT2

ADH4

ACAA1

GCDH

ADH1A

EHHADH

KEGG ALLOGRAFT REJECTION (Dir = Down, P = 3.6e−05, FDR = 0.026)

ECHS1

HLA−B

XENOBIOTICS (Dir = Up, P = 4.7e−05, FDR = 0.017) CYP2B6

CYP2D6

CYP3A4

CYP2A6

CYP1A1

CYP2A13

CYP2C9

CYP2C8

CYP2C19

CYP1A2

HLA−B

SYNTHESIS OF BILE ACIDS AND BILE SALTS VIA 7ALPHA HYDROXYCHOLESTEROL (Dir = Up, P = 5.2e−05, FDR = 0.017) ACOX2

AKR1D1

SLC27A5

CYP8B1

AKR1C4

SLC27A2

HSD17B4

CYP7A1

ACOT8

GGCX

PROC

F7

PROZ

PROS1

VKORC1

F9

FURIN

ADH4

ADH1A

GSTA2

MGST2

CYP2B6

ZAP70

MGST1

ADH6

GCAT

AGXT

GLYCTK

GNMT

SHMT1

SARDH

PIPOX

GLDC

GSTA4

ALDH7A1

PCCB

ACAT2

ALDH6A1

EHHADH

MLYCD

ECHS1

CD3E

ADH4

MASP2

C8A

ADH1A

CYP2B6

DGAT2

ADH6

CYP3A4

CYP4A22

UGT1A4

CD3E

MASP1

C3

C4A

MBL2

C1QB

CD247

ABAT

SERPINC1

PROC

F7

PROS1

FGA

FGG

FGB

HLA−DPA1

HLA−DPA1

HLA−DOA

HLA−DOA

HLA−DQA1

HLA−DOB

HLA−DQA1

CD247

HLA−C

HLA−C

HLA−DRB1

HLA−DPA1

HLA−DRB1

PIK3CA

UGT2B4

ITK

CD3D

HLA−DRB5

HLA−G

KLRC1

HLA−E

HLA−DRB5

HLA−G

HLA−E

HLA−DRB5

HLA−G

ICA1

HSPD1

HLA−DPB1

CD4

CD3G

HLA−DPA1

HLA−DRB5

IL10

HLA−DOA

HLA−C

HLA−E

HLA−DRB5

HLA−G

HLA−DPB1

PTPRC

CD4

CD3G

HLA−DPA1

HLA−DRB5

CD80

HLA−DRB1

CD86

HLA−DRA

CD3G

CD28

ICOSLG

CD3E

CD247

FASLG

CCR5

FAS

CD4

CD3G

CD28

B2M

CD3G

BIOCARTA CTL PATHWAY (Dir = Down, P = 0.00089, FDR = 0.12)

C7 C6

F5

HLA−E

BIOCARTA TCAPOPTOSIS PATHWAY (Dir = Down, P = 0.00057, FDR = 0.094)

GZMB

BIOCARTA EXTRINSIC PATHWAY (Dir = Up, P = 0.00016, FDR = 0.021) F2

CD247

CD28

BIOCARTA COMP PATHWAY (Dir = Up, P = 0.00016, FDR = 0.021) C2

HLA−C

BIOCARTA CTLA4 PATHWAY (Dir = Down, P = 0.00038, FDR = 0.073)

KEGG RETINOL METABOLISM (Dir = Up, P = 0.00014, FDR = 0.021) DHRS4

HLA−DOA

PHOSPHORYLATION OF CD3 AND TCR ZETA CHAINS (Dir = Down, P = 0.00034, FDR = 0.073)

ALAS1

LDHC

HLA−DOB

HLA−DOB

HLA−B

KEGG PROPANOATE METABOLISM (Dir = Up, P = 1e−04, FDR = 0.02) ALDH2

IL10

KEGG AUTOIMMUNE THYROID DISEASE (Dir = Down, P = 0.00028, FDR = 0.073)

KEGG GLYCINE SERINE AND THREONINE METABOLISM (Dir = Up, P = 8.5e−05, FDR = 0.019) CBS

HLA−DPA1

TRANSLOCATION OF ZAP 70 TO IMMUNOLOGICAL SYNAPSE (Dir = Down, P = 0.00024, FDR = 0.073)

F10

CYP2D6

CD28

CD28

KEGG DRUG METABOLISM CYTOCHROME P450 (Dir = Up, P = 6.4e−05, FDR = 0.017) GSTA1

HLA−DOB

KEGG TYPE I DIABETES MELLITUS (Dir = Down, P = 0.00019, FDR = 0.073)

SCP2

GAMMA CARBOXYLATION TRANSPORT AND AMINO TERMINAL CLEAVAGE OF PROTEINS (Dir = Up, P = 6e−05, FDR = 0.017) F2

CD28

KEGG GRAFT VERSUS HOST DISEASE (Dir = Down, P = 3.9e−05, FDR = 0.026)

PRF1

CD3D

ITGB2

CD3E

CD247

FASLG

FAS

PD1 SIGNALING (Dir = Down, P = 0.00097, FDR = 0.12)

F10

CD3E

Gene ranks

CD247

HLA−DQA1

HLA−DRB1

HLA−DPB1

CD4

CD3G

HLA−DPA1

HLA−DRB5

CD274

Gene ranks 0

5000

10000

15000

17250

0

5000

10000

15000

16880

Figure S6. Gene Set Enrichment Analysis of Samples Harboring Mutations at Liver and Thyroid Indel Hotspots, Related to Figure 5 (A and B) Notch plot of the top 10 most significant gene sets arising from CAMERA analysis (Wu and Smyth, 2012) of (A) ALB / FGG / ALDOB mutant versus wild type liver cancers and (B) TG wild type and mutant thyroid cancers using canonical pathways in the molecular signatures database (MSigDB, http://software. broadinstitute.org/gsea/msigdb). Pearson correlations are shown in the top waterfall panel. Gene sets are arranged in the bottom portion of the panel, labeled with P and FDR values from the CAMERA analysis. The x position of each notch represents gene rank and its color the Pearson r of the expression to phenotype correlation.

A

B

H2BK5ac, 10kbp (P = 4.2 × 10–10)

H2BK15ac, 10kbp (P = 1.7 × 10–8)

P = 0.38

1000

10

non−peak

peak

Background

non−peak

peak

Highly expressed

non−peak

peak

–8

Relative indel density

Relative indel density

–10

P = 3.8 × 10 –6 P = 0.17 P = 6.2 × 10 P = 5.6 × 10–6 P = 0.25

P = 0.24

P = 1.8 × 10 P = 0.44 P = 0.00028 P = 6.2 × 10–5 P = 0.69

1000

10

non−peak

Hotspot

peak

Background

non−peak

peak

Highly expressed

non−peak

peak

Hotspot

E

P = 0.0035

Orientation

D

Loop domains, 10kbp (P = 8.9 × 10–9) P = 9.4 × 10–9 P = 0.064 P = 0.18 P = 4.2 × 10 P = 0.30 –19

P = 0.22

1000

10

Out

In

Background

Out

In

Highly expressed

Out

Hotspot

In

Relative indel density

Relative indel density

C

Alternate polyadenylation sites, 10Kbp (P = 7.8 × 10–6) P = 8.0 × 10–6 P = 0.0077 P = 0.67 –5 P = 0.093 P = 8.8 × 10

P = 0.00027

Relative indel density

Head-on 100

1

Co-directional 100

1

Expression- Expressionforeign native

1000

Context 10

Out

In

Background

Out

In

Highly expressed

Out

In

Hotspot

Figure S7. Chromatin Features Associated with Significant Enrichment or Deletion of Indels in Hotspot Loci, Related to Figure 7 (A and B) Violin plots demonstrating supporting data for the most significantly depleted topographic feature-window combinations in the Roadmap Epigenomics (A) 10kb windows around H2BK5 acetylation and (B) H2BK15 acetylation. The interaction P-value is shown at the top, and additional P-values associated with individual pairwise comparisons are indicated using rectangular connectors. These P-values were computed from post-hoc pairwise analyses of gene classes (e.g., hotspot versus highly expressed) or pairs of feature labels within a gene class (i.e., peak versus non-peak), obtained using Wald test on Gamma-Poisson regression terms (see STAR Methods for details). (C) Violin plots demonstrating indel density enrichment in 10 kb neighborhoods of loop domains (Rao et al. 2014) for both highly expressed and hotspot genes relative to background. (D) Violin plots demonstrating indel density enrichment in 1 kb neighborhoods of alternate polyadenylation (APA) sites. (E) Violin plots demonstrating elevated indel density in ‘‘head-on’’ transcription / replication units in the expression-native context. P-value obtained from Wald test on the interaction term for colliding and expression-native labels following Gamma-Poisson regression.