Maximizing the Utility of Cancer Transcriptomic Data

Maximizing the Utility of Cancer Transcriptomic Data

Review Maximizing the Utility of Cancer Transcriptomic Data Yu Xiang,1,3 Youqiong Ye,1,3 Zhao Zhang,1 and Leng Han 1,2, * Transcriptomic profiling ...

3MB Sizes 0 Downloads 20 Views

Review

Maximizing the Utility of Cancer Transcriptomic Data Yu Xiang,1,3 Youqiong Ye,1,3 Zhao Zhang,1 and Leng Han

1,2,

*

Transcriptomic profiling has been applied to large numbers of cancer samples, by large-scale consortia, including The Cancer Genome Atlas, International Cancer Genome Consortium, and Cancer Cell Line Encyclopedia. Advances in mining cancer transcriptomic data enable us to understand the endless complexity of the cancer transcriptome and thereby to discover new biomarkers and therapeutic targets. In this paper, we review computational resources for deep mining of transcriptomic data to identify, quantify, and determine the functional effects and clinical utility of transcriptomic events, including noncoding RNAs, post-transcriptional regulation, exogenous RNAs, and transcribed genetic variants. These approaches can be applied to other complex diseases, thereby greatly leveraging the impact of this work.

Highlights The explosive volume of cancer transcriptomic data provides opportunities to understand transcriptomic events in the genome. Advances in mining cancer transcriptomic data enable us to understand the complexity of the transcriptome. Transcriptomic events, including noncoding RNA, post-transcriptional regulation, exogenous RNA, and transcribed genetic variants, inform the discovery of novel biomarkers and therapeutic targets.

The Explosive Volume of Cancer Transcriptomic Data RNA sequencing (RNA-seq), one of the most popular applications of high-throughput sequencing technology, is used to measure gene expression levels across the transcriptome. More than 200 000 human RNA-seq samples have been deposited in a public data repository, the Sequence Read Archivei, and the number is increasing continuously (Figure 1). Among these, more than 60 000 are cancer-related samples. RNA-seq is a series of related protocols, including poly(A)-selected RNA-seq, total RNA-seq, small RNA-seq, and 50 or 30 ends enriched RNA-seq [1]. Poly(A)-selected RNA-seq is one of the most popular and cost effective protocols. Several cancer genomic projects have released large-scale high-quality poly(A)-selected RNAseq data, including The Cancer Genome Atlas (TCGA)ii (10 000 samples) [2], International Cancer Genome Consortium (ICGC)iii (12 000 samples) [3], Cancer Cell Line Encyclopedia (CCLE)iv (1000 samples) [4], St. Jude Pediatric Cancerv (4000 samples) [5], and MiTranscriptome (6500 samples) [6]. Comprehensive analyses of these transcriptomes provide unique opportunities to understand oncogenic processes [7] and signaling pathways [8]. Deep mining of cancer transcriptomic data provide the opportunity to develop new strategies for cancer treatment [1]. These new discoveries are being translated into the clinic for diagnostic and prognostic purposes [9].

Noncoding RNA The human genome encodes approximately 20 000 protein-coding genes and a large number of noncoding RNAs [10]. These noncoding RNAs have emerged as biomarkers and therapeutic targets [11]. Most types of noncoding RNAs can be captured by RNA-seq, including long noncoding RNA (lncRNA) (see Glossary), pseudogenes, enhancer RNA (eRNA), small nucleolar RNA (snoRNA), and circular RNA (circRNA) (Figure 2A).

Trends in Cancer, December 2018, Vol. 4, No. 12

1

Department of Biochemistry and Molecular Biology, McGovern Medical School at The University of Texas Health Science Center at Houston, Houston, TX 77030, USA 2 Center for Precision Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA 3 These authors contributed equally

*Correspondence: [email protected] (L. Han).

https://doi.org/10.1016/j.trecan.2018.09.009 © 2018 Elsevier Inc. All rights reserved.

823

Glossary

# of RNA-seq samples in SRA

300 000

100 000

30 000

10 000

3000

1000 2015

St. Jude PeCan

2014

2016

2017

2018

ConsorƟum

2013

MiTranscriptome

2012

TARGET

2011

CCLE

ICGC

TCGA

2010

Key: All samples

Cancer-related samples

Figure 1. The Explosive Volume of Cancer Transcriptomic Data. The number of RNA-seq samples deposited in the public data repository (upper panel), and large-scale cancer genomic projects (bottom panel). CCLE, Cancer Cell Line Encyclopedia; ICGC, International Cancer Genome Consortium; SRA, Sequence Read Archive; TARGET, Therapeutically Applicable Research To Generate Effective Treatments; TCGA, The Cancer Genome Atlas.

lncRNA lncRNAs are transcripts, >200 nucleotides in length and without protein coding potential [12]. Emerging evidence demonstrated significant roles of lncRNAs in cancer through interaction with proteins, other RNA, and lipids [13]. They are also promising clinical biomarkers and therapeutic targets [14]. Tremendous efforts have been made to annotate lncRNAs in the human genome, such as ENCODEvi [15]. In particular, LNCat (lncRNA atlas) has collected 24 lncRNA annotation resources that refer to >200 000 lncRNAs in over 50 tissues and cell lines [16]. Quantification of lncRNA expression based on existing annotations is straightforward: map RNA-seq reads to annotated regions of lncRNA and then calculate and normalize the expression level as for protein-coding genes. De novo identification of lncRNAs has required a rigorous computational pipeline for ab initio transcriptome assemblies, followed by assessment of the coding potential [6]. Through this effort, MiTranscriptome curated a large amount of RNA-seq data from tumors, normal tissues, and cell lines from 25 independent studies, and provides the landscape of lncRNAs in the cancer transcriptome [6]. TANRIC provides an interactive, open platform for exploring the expression landscape of lncRNAs across multiple cancer types [17]

824

Trends in Cancer, December 2018, Vol. 4, No. 12

Allele-specific expression (ASE): differential abundance of multiple alleles of a transcript in a diploid organism. Alternative polyadenylation (APA): RNA 30 end processing that generates distinct 30 termini. Alternative splicing (AS): a process by which identical pre-mRNA molecules are spliced in different ways. Circular RNA (circRNA): a novel RNA molecule that forms a covalently closed circular structure. Enhancer RNA (eRNA): noncoding RNAs transcribed from enhancer regions. Exogenous RNA: RNA originating from exogenous species. Expression quantitative trait locus (eQTL): genomic locus that accounts for all or a fraction of the changes in mRNA expression levels. Gene fusion: a process by which hybrid genes form from two previously independent genes. Long noncoding RNAs (lncRNAs): noncoding RNAs longer than 200 nucleotides that lack protein-coding potential. Pseudogene: dysfunctional copies of protein-coding genes that have lost protein-coding ability. RNA editing: a post-transcriptional mechanism that confers nucleotide changes in RNA transcripts. Single-nucleotide variant (SNV): an alteration of a single nucleotide in a DNA sequence. Small nucleolar RNAs (snoRNAs): noncoding RNAs predominantly found in the nucleolus and which primarily guide chemical modifications of other RNAs.

(A) Noncoding RNA

IncRNA

Pseudogene

eRNA

snoRNA

(B) Post-transcripƟonal regulaƟon

circRNA

(C) Exogenous

PAS PAS PAS

Viral intergraƟon Viral RNA

AlternaƟve polyadenylaƟon

AlternaƟve splicing

A I RNA ediƟng

Bacterial RNA

(D) Transcribed geneƟc variant

Gene fusion

A

C SNV

T

eQTL

ASE

C

A

Figure 2. Overview of Transcriptomic Events Captured by RNA-seq. (A) Noncoding RNAs, including long noncoding RNA (lncRNA), pseudogene, enhancer RNA (eRNA), small nucleolar RNA (snoRNA), and circular RNA (circRNA). (B) Post-transcriptional regulation, including alternative splicing, alternative polyadenylation, and RNA editing. (C) Exogenous RNAs, including viral RNA and bacterial RNA. (D) Transcribed genetic variants, including gene fusion, single-nucleotide variation (SNV), allele-specific expression (ASE), and expression quantitative trait loci (eQTL). PAS, Polyadenylation site.

and theoretically allows users to query the expression level of any novel transcript. Analysis based on expression profiles from TANRIC showed that lncRNA expression could define a novel subtype, thus demonstrating potential roles of lncRNAs in cancer pathogenesis and biomarker development. Several databases have been developed to study lncRNAs in cancer: Lnc2Cancer and Lnc2Catlas provide interactive interfaces to examine associations between lncRNAs and human cancers [18,19], while LnCaNet provides a pan-cancer coexpression network for lncRNAs and cancer genes in TCGA samples [20] (Table 1). Using these existing resources, further studies have systematically characterized the functional alterations of lncRNAs across multiple cancer types. For example, lncRNAs are altered in cancers at the

Trends in Cancer, December 2018, Vol. 4, No. 12

825

Table 1. Summary of Databases for Cancer-Related Transcriptomic Events Database

Description

URL

Refs

MiTranscriptome

A database of human lncRNA from over 6000 samples spanning diverse cancer and tissue types

http://www.mitranscriptome.org

[6]

TANRIC

An interactive open platform to explore the function of lncRNAs in cancer

https://bioinformatics.mdanderson.org/main/ TANRIC:Overview

[17]

lnc2Cancer

A manually curated database that provides experimentally supported associations between lncRNA and cancer

http://www.bio-bigdata.com/lnc2cancer

[19]

Lnc2Catlas

An atlas of long noncoding RNAs associated with risk of cancers

https://lnc2catlas.bioinfotech.org

[18]

lnCaNet

A comprehensive regulatory network resource for lncRNA and cancer genes

http://lncanet.bioinfo-minzhao.org

[20]

Pseudogene expression profile across multiple cancer types

https://www.synapse.org/#!Synapse: syn1732077/files

[26]

A data portal to explore snoRNAs in cancer

http://bioinfo.life.hust.edu.cn/SNORic

[44]

A cancer-specific circRNA database

http://gb.whu.edu.cn/CSCD

[57]

A data portal to explore the alternative splicing of TCGA tumors

http://projects.insilico.us.com/TCGASpliceSeq

[66]

PolyASite

A repository for 30 end sequencing data

http://www.polyasite.unibas.ch

[81]

PolyA_DB

A database cataloging cleavage and polyadenylation sites (PASs) in several genomes

http://www.polya-db.org/v3

[82]

APADB

A database of vertebrate polyadenylation sites determined by 30 end sequencing

http://tools.genxpro.net/apadb

[83]

APASdb

A web-accessible database of APA sites to visualize the precise map and usage quantification of different APA isoform

http://genome.bucm.edu.cn/utr

[84]

TC3A

Comprehensive resource of APA usage in TCGA samples

http://tc3a.org

[85]

CCLE APA

TCGA and CCLE APA data on Synapse: syn7888354

https://www.synapse.org/#!Synapse: syn7888354

[80]

TCGA RNA editing data on Synapse: syn2374375

https://www.synapse.org/#!Synapse: syn2374375

[102]

Dr.VIS

A database of human disease-related viral integration sites

http://www.bioinfo.org/drvis

[111]

MicroView

A database for analysis and visualization presence of bacterial in TCGA data

http://microview.igs.umaryland.edu/tcga_v1

[113]

ChimerDB 3.0

An enhanced database for gene fusions from cancer transcriptome and literature data mining

http://ercsb.ewha.ac.kr/fusiongene

[122]

COSMIC fusion

Manually curated gene fusions from peer reviewed publications

https://cancer.sanger.ac.uk/cosmic/fusion

[123]

Long noncoding RNA

Pseudogene TCGA_PseudoGene Small nucleolar RNA SNORic Circular RNA CSCD Alternative splicing TCGASpliceSeq Alternative polyadenylation

RNA editing TCGA_RNAediting Exogenous RNA

Gene fusion

826

Trends in Cancer, December 2018, Vol. 4, No. 12

Table 1. (continued) Database

Description

URL

Refs

FusionCancer

Annotated information of gene fusions from 591 recently published RNA-seq datasets across 15 cancer types

http://donglab.ecnu.edu.cn/databases/ FusionCancer

[124]

Tumorfusions

A resource for cancer-associated transcript fusions from TCGA

http://www.tumorfusions.org

[125]

Systematic identification of cis-eQTLs and trans-eQTLs in TCGA

http://bioinfo.life.hust.edu.cn/PancanQTL

[149]

eQTLs PancanQTL

transcriptional, genomic, and epigenetic levels [21] and can synergistically dysregulate cancer pathways in both a tumor-specific manner and in multiple tumor contexts [22].

Pseudogene Pseudogenes are dysfunctional copies of protein-coding genes that have lost the ability to produce proteins through the accumulation of deleterious mutations [23]. Pseudogenes play significant roles in cancer, likely through the regulation of their wild-type cognate genes by sequestering microRNAs (miRNAs) such as PTENP1 and KRASP1 [24]. Tremendous efforts have been undertaken to annotate pseudogenes in the human genome through the ENCODE project [15] and the Yale pseudogene databasevii [25]. To accurately quantify pseudogene expression from RNA-seq, a remaining challenge is to distinguish between a pseudogene and its wild-type gene, due to their high level of sequence similarity. One way to address the potential issue of cross-mapping is to retain reads with high alignability [26]. Another way is to filter out RNA-seq reads that are likely to have originated from coding genes [27]. The first systematic analysis of pseudogenes across multiple cancer types demonstrated their functional effects [28]. A further study in a TCGA dataset showed that tumor subtypes defined by pseudogene expression are highly concordant with subtypes defined by other molecular data, and that the pseudogene expression subtypes may provide additional prognostic power [26]. The expression profile of pseudogenes across multiple cancer types was released through Synapse (syn1732077) for further analysis (Table 1).

eRNA Enhancers, which are noncoding genomic regions that act spatially with targeted promoters to regulate downstream genes [29], show tight association with disease, including cancer [30]. The identification and prediction of enhancers have required comprehensive analysis of genomic features, transcription factor binding features, and epigenetic features [31]. Current annotations of enhancers include ENCODE [15], FANTOMviii [32], and EnhancerAtlasix [33]. Enhancers have been found to be broadly transcribed, leading to the production of enhancerderived RNAs, also known as eRNAs [34]. Expression of eRNAs is considered an essential signature of enhancer activation, thus making it significant to study enhancer activities by RNAseq [34]. For example, androgen receptor (AR)-induced eRNA may regulate the expression of AR-dependent oncogenes in prostate cancer, which influences androgen deprivation therapy [35]. P53-induced eRNAs are involved in the P53-dependent cell cycle arrest process across multiple cancer cell lines [36]. A recent study utilized high-quality expressed enhancer annotations from FANTOM and characterized the expressed enhancers across 9000 samples from

Trends in Cancer, December 2018, Vol. 4, No. 12

827

TCGA. This study revealed a positive correlation between global enhancer activation and aneuploidy. The study further identified enhancers as key regulators for therapeutic targets, including PD-L1, which highlights the clinical implications of enhancers [37]. These findings opened a new door to study expressed enhancers in large-scale cancer transcriptomic data.

snoRNA snoRNAs are noncoding RNAs, 60–300 nucleotides in length, and predominantly found in the nucleolus [38]. They are associated with ribonucleoproteins (RNPs) to form snoRNP particles and guide the modification of ribosomal RNAs and spliceosomal RNAs [38]. Annotations of snoRNAs are collected in UCSC Genome Browserx [39] and GENCODEx [40]. Emerging evidence has revealed functional roles of snoRNAs in oncogenesis. For example, SNORD50A/B is deleted across multiple cancer types and the deletion is associated with poorer survival [41]. Unfortunately, only a few snoRNAs can be captured by RNA-seq, likely due to the lack of poly-A tails among snoRNAs [42]. SnoRNA-seq has been developed by restricting the insert size, ranging from 40 to 200 bp, for sequencing libraries [43]; however, this method has not been applied to a large number of cancer samples. Alternatively, miRNA-seq provides the opportunity to study the expression landscape of snoRNAs. miRNA-seq is not designed for a full snoRNA repertoire, but is probably the most appropriate way to quantify the snoRNA expression landscape from TCGA omic data [44]. One study revealed the regulation pattern of snoRNAs and identified 46 clinically relevant snoRNAs [44]. More importantly, the study released a data portal, snoRNA in cancer (SNORic), which allow users to query snoRNAs for further study [44] (Table 1).

circRNA circRNA is a novel noncoding RNA characterized by a covalently closed circular structure [45], which makes it unique compared with linear RNAs. circRNAs are generated in a ‘back-splicing’ process, in which a downstream 50 splicing site back-splices to an upstream 30 splice site. circRNAs play important roles in cancer by functioning as miRNA sponges [46]. Several computational tools have been developed to detect circRNAs. These tools either align reads directly against a reference genome, such as find_circxii [47], CIRCexplorerxiii [48], CIRIxiv [49], CIRI-ASxv [50], and circRNA_finderxvi [51], or align reads against a pseudo-reference built from genomic annotation to detect back-spliced junction reads, such as NCLscanxvii [52], KNIFExviii [53], and PTESFinderxix [54]. These tools adopt different strategies [55] with different resulting levels of performance [56]. Therefore, it might be more appropriate to combine multiple tools to reduce false positives [56]. To explore the landscape of circRNAs in cancer, a cancer-specific circRNA database (CSCD) was developed and collected >270 000 cancer-specific circRNAs, as well as their RNA regulatory binding sites and miRNA binding sites [57]. This valuable resource is the first database for cancer-specific circRNAs across a large number of samples (Table 1).

Post-transcriptional Regulation Human transcripts are under extensive post-transcriptional regulation [58], including alternative splicing (AS), alternative polyadenylation (APA), and RNA editing (Figure 2B). AS AS is the process by which identical pre-mRNA molecules are spliced in different ways [59]. Up to 94% of human multi-exon genes are alternatively spliced, contributing to complexity in the transcriptome and leading to protein diversity [60,61]. Aberrant splicing events or mutations in splice-site sequences frequently occur in cancer [62]. Specific isoforms of genes are 828

Trends in Cancer, December 2018, Vol. 4, No. 12

associated with the hallmarks of cancer [59], rendering them as promising therapeutic targets in cancer treatments [62]. Dozens of AS detection tools have been developed [63], facilitating the exploration of AS patterns across large-scale cancer transcriptomic datasets. Mutation-mediated AS events have been identified to classify cancer patients according to tumor subtypes with distinct clinical features across 33 cancer types [64]. Analysis across 16 cancer types showed frequently retained introns in transcripts from tumor samples compared with matched normal samples, suggesting significant roles of novel peptides in cancer [65]. TCGASpliceSeq provides a comprehensive map for AS events across 33 cancer types from TCGA [66] (Table 1).

APA APA is widespread mRNA 30 end processing, which provides a means to regulate mRNA metabolism, including mRNA stability, translation efficiency, and intracellular localization [67,68]. Studies have highlighted the important roles of APA in human cancer. For example, proliferating cells tend to produce mRNAs with shortened 30 untranslated regions (UTRs) [69], and transcripts of several proto-oncogenes in tumor cells tend to have 30 UTR shortening [70]. In addition to the development of APA protocols to capture APA sites [71,72], several bioinformatic tools have been developed to identify APA events from standard RNA-seq data. These tools can be classified into two categories: one is based on prior annotation of the polyA site, and includes MISOxx [73], 3USSxxi [74], and QAPAxxii [75]; and the other uses de novo identification and quantification of dynamic APA events and thus is more powerful in cancer research. Tools in the second category include Daparsxxiii [76], APAtrapxxiv [77], and TAPASxxv [78]. TAPAS, in particular, can detect more than two APA sites, as well as the APA sites before the last exon [78]. With these powerful tools, genome-wide APA analysis using large-scale cancer transcriptomic datasets has revolutionized our understanding of important roles of APA in cancer. Recent studies have revealed global shortening of 30 UTRs in tumor samples compared with normal samples to avoid miRNA-mediated repression [76]. Further deep mining has shown surprising enrichment of 30 UTR shortening that is likely to function as competing-endogenous RNAs (ceRNAs) for tumor-suppressor genes [79]. Another study showed striking global 30 UTR shortening in cancer cell lines and identified an appreciable number of clinically relevant APA events [80]. More importantly, this study provided evidence that APA events could affect drug sensitivity, especially for drugs that target chromatin modifiers, suggesting clinical utility in cancer. Several databases provide resources for further analysis of APA events in cancer, including PolyASite [81], PolyA_DB [82], APADB [83], and APASdb [84], and have collected APA events detected from PolyA-seq in a limited number of cancer samples. APA events across 800 cancer cell lines can be accessed through Synapse [80]. In particular, TC3A provides the most comprehensive resource of APA usage for >10 000 cancer samples across 32 cancer types [85], which is a valuable resource for further APA analysis (Table 1).

RNA Editing RNA editing is a post-transcriptional event that modifies a single base change on RNA nucleotides without altering their genomic DNA [86]. RNA editing events can lead to modulation

Trends in Cancer, December 2018, Vol. 4, No. 12

829

of AS in mRNA, missense codon changes, and modifications of noncoding RNAs [86], including both miRNAs [87] and lncRNAs [88]. RNA editing may also contribute to proteomic diversity in cancer [89]. Among the tools developed to identify RNA editing sites by comparing RNA-seq with DNA-seq are REDItoolsxxvi [90], JACUSAxxvii [91], and RES-Scannerxxviii [92]. A further study identified RNA editing sites from RNA-seq alone by taking advantage of the fact that the same editing sites are more likely to be present in different individuals than in rare singlenucleotide polymorphisms (SNPs) [93]. Similar approaches have been utilized in other tools, including GIREMIxxix [94], RDDpredxxx [95], RASERxxxi [96], RED-MLxxxii [97], and RNAEditorxxxiii [98], maximizing the power to identify RNA editing sites from enormous RNA-seq datasets. Advanced by these computational tools, databases with millions of high-confidence RNA editing sites have been established, including RADARxxxiv [99] and REDIportalxxxv [100]. These data resources provide the research community with good opportunities to study the global pattern of A-to-I editing in healthy individuals and/or patients. Using the cancer transcriptome to understand the role of RNA editing in cancer, a pilot study identified a novel recoding RNA editing of AZIN1 as a potential driver in human hepatocellular carcinoma [101]. Further study characterized the genomic landscape and clinical relevance of RNA editing using large-scale RNA-seq data from TCGA and suggested that globally altered RNA editing patterns are likely to be affected by ADAR1 [102]. That study identified an appreciable number of clinically relevant events and experimentally demonstrated the effects of RNA editing on drug sensitivity. More importantly, the editing level for each detectable RNA editing site across TCGA samples was deposited to Synapse (syn2374375) [102] (Table 1). Other than the functional effects of individual RNA editing sites, further study showed that A-to-I editing is the major source of mRNA variability in cancer, and that the global Alu editing index could predict patient survival [103].

Exogenous RNA Around 15% of human cancers are estimated to have links to microbial infections (e.g., viral and/or bacterial infection) [104] (Figure 2C). Microbes can contribute to carcinogenesis through multiple mechanisms, such as altering proliferation of the host cell and influencing the host’s metabolism and immune system [105]. Microbes may even affect cancer immunotherapy [106,107]. Therefore, the detection of viral and bacterial genetic material within a tumor is critical for cancer diagnosis and prognosis. PathSeqxxxvi proposed the first concept to identify non-human nucleic acids that may indicate candidate microbes by subtracting the alignment to human reference sequences and aligning the remaining reads to microbial reference sequences [108]. This idea has been applied in other computational tools, such as virusSeqxxxvii [109] and VirusFinderxxxviii [110]. Dr.VIS provides a database of oncogenic viral integration sites in various human diseases, including cancer [111]. Several studies have further explored the interplay between microbes and cancer by using TCGA RNA-seq data. A systematic screening of viral expression in 4433 tumor samples provided a comprehensive map of tumor-associated viruses and identified recurrent fusions between viral and host genes [112]. Another study suggested the possibility of identifying bacteria associated with particular tumor types by carefully considering all bacterial taxa present in cancer, their relative abundance, and batch effects. This study also provided an interactive website, MicroView, to examine the presence of bacteria in TCGA samples [113]. The microbiota is also associated with tumor expression profiles in patients with breast cancer, suggesting complicated interplay between the microbiota and human cancer [114].

830

Trends in Cancer, December 2018, Vol. 4, No. 12

Transcribed Genetic Variant Genetic variants, including structural variants and single-nucleotide variants (SNVs), are generally detected by DNA-seq [115]. RNA-seq has been utilized to capture genetic variants that impact transcription, such as gene fusion, transcribed SNVs, allele-specific expression (ASE), and expression quantitative trait loci (eQTL) (Figure 2D). RNA-seq in particular provides additional insights into these genetic variants, especially for samples for which we lack DNA-seq data.

Gene Fusion Gene fusions are an important class of cancer drivers that are generated from genomic structural rearrangements, transcription read-throughs of neighboring genes, and/or splicing of mRNA molecules [116]. These events have the potential to generate chimeric proteins with altered functions, which can serve as biomarkers for the diagnosis of specific cancer types and are thus promising therapeutic targets [116]. For example, chronic myelogenous leukemia (CML) is characterized by BCR–ABL1 fusion, and the drug imatinib, an inhibitor of the tyrosinekinase activity of BCR–ABL, has been successfully used to treat patients with chronic-phase CML [117]. More than 30 bioinformatic algorithms have been developed to detect gene fusions [118]. These algorithms utilize split-reads (one read that aligns to different genes) or spanning reads (each read of paired reads aligns to different genes) to identify the fusion events [119]. The determination of gene fusions for the full cancer genome has further revealed that gene fusions play critical roles in oncogenesis. Novel and recurrent fusions have been shown to drive tumorigenesis through constitutive activation of kinases [120]. A systematic study using TCGA RNA-seq data identified a total of 25 664 gene fusions and revealed that gene fusions involving oncogenes tend to exhibit increased gene expression, whereas fusions involving tumor suppressors have the opposite effect on gene expression [121]. Several comprehensive databases for cancer-related gene fusions have been constructed (Table 1). For example, ChimerDB [122] and COSMIC Gene Fusion [123] provide manually curated gene fusions from previous publications. FusionCancer compiled fusion genes from 591 published human cancer RNA-seq datasets [124]. TumorFusions catalogued 20 731 gene fusions detected across 33 cancer types, using TCGA RNA-seq data [125]. These resources facilitate further studies of the functional effects of gene fusions in cancer.

SNV SNVs are alterations of a single nucleotide in a DNA sequence. Nonsynonymous SNVs (nsSNVs) may affect protein function, including expression, folding, binding affinity, and post-translational modification [126], and thus drive disease progression or contribute to phenotypic traits. TCGA consortium established the mutational landscape across 33 cancer types through the analysis of DNA-seq [115]. It is also possible to call SNVs for genes with relatively high expression levels from RNA-seq [127]. Dozens of tools have been developed to call variants through whole-genome sequencing or whole-exome sequencing, such as MuSExxxix [128]. Tools developed for DNA-seq can also accept RNA-seq data to call variants; these tools include VarScan2xl [129], RADIAxli [130], Seuratxlii [131], and VarDictxliii [132]. Several tools have been developed to specifically identify SNVs from RNA-seq, including eSNV-detectxliv [133], SNPiRxlv [134], and VaDiRxlvi [135]. However, due to the interference of RNA editing sites, high error rates for alignment near splicing junctions, read-depth differences between lowly and highly expressed genes, and error introduced through reverse transcription, it is still more accurate to call SNVs from DNA-seq [127]. Nonetheless, the

Trends in Cancer, December 2018, Vol. 4, No. 12

831

enormous volume of RNA-seq data provides an alternative approach to identify transcribed SNVs, especially for samples for which we have no DNA-seq data.

ASE ASE refers to the unequal expression of multiple alleles of a transcript in a diploid organism. ASE may arise through X-chromosome inactivation, genetic imprinting, and uniparental disomy [136]. AS, protein-truncating mutations, and other post-transcriptional mechanisms may also cause ASE [137,138]. ASE plays important roles in human cancer; for example, ASE of the transforming growth factor beta receptor 1 (TGFBR1) is associated with the risk of colon cancer [139]. Accurate ASE quantification requires the use of personalized haplotype genome alignment, strict alignment quality control, and intragenic SNP aggregation [140]. Many tools have been developed to quantify ASE from RNA-seq and genotyping data [141], such as MMSEQxlvii [142]. It is also possible to quantify ASE from RNA-seq data alone. For example, MBASEDxlviii [143] aggregates information across multiple SNV loci to obtain a measurement of ASE at the gene level, and has been used to identify extensive ASE in cancer. Furthermore, SCALExlix [144] detects ASE from single-cell RNA-seq data with appropriate adjustment for technical variability.

eQTL The eQTL are genomic loci that account for all or a fraction of the changes in mRNA expression levels [145]. eQTL analysis is a powerful approach to identify candidate susceptibility genes for cancer risk [146–148] by linking genotype and expression data. PancanQTL is a comprehensive resource for cis- and trans-eQTL across 33 cancer types from TCGA, and has enabled researchers to further investigate the role of genetic variation in tumorigenesis [149] (Table 1).

Concluding Remarks It is still a technical challenge (e.g., batch effect, inconsistency of data format and preprocessing) to integrate various levels of transcriptomic data. Therefore, transcriptomic profiling has been applied to large-scale datasets, including TCGA, ICGC, and CCLE, which provide uniquely comprehensive opportunities for integrative analysis. Alterations at the gene, gene set, pathway, and network levels have been characterized through transcriptomic profiling [1]. For example, TCGA pan-cancer project characterized alterations of gene expression in multiple cancer signaling pathways [8], their effects on the immune landscape [150], and pharmacogenomic interactions with the circadian clock in cancer chronotherapy [151]. However, these analyses were underpowered to determine the full utility of cancer transcriptomic data. In the past decade, the research community has demonstrated the significance of deep mining of transcriptomic data and revealed the endless complexity of the transcriptome. For example, recent studies uncovered the diversity of noncoding RNAs [152], such as lncRNAs, pseudogenes, and circRNAs. Meanwhile, post-transcriptional regulation, including AS, APA, and RNA editing, have been shown to add another layer to this complexity. It will be powerful to further integrate transcriptomic data with other types of omics data to identify transcribed genetic variant, including ASE and eQTLs. To capture these transcriptomic events, different methods have been developed to fully consider the features of RNA molecules of interest [153]. Standard RNA-seq has limitations for the study of these transcriptomic events, but the enormous sample sizes and rigorous computational tools already available maximize the discovery power. Nevertheless, several challenges remain (see Outstanding Questions). (i) How do we efficiently store and process sample sizes of huge magnitude? (ii) How do we develop accurate algorithms to identify and/or quantify transcriptomic events by increasing both sensitivity and specificity? (iii) How do we 832

Trends in Cancer, December 2018, Vol. 4, No. 12

Outstanding Questions How do we efficiently store and process sample sizes of huge magnitude? How do we develop accurate algorithms to identify and/or quantify transcriptomic events by increasing both sensitivity and specificity? How do we design appropriate experimental approaches to capture these transcriptomic events? Will breakthroughs in sequencing technologies (e.g., long-read sequencing) overcome the current limitations? Can these methods be applied to the single-cell level to further reveal heterogeneity?

design appropriate experimental approaches (e.g., GRO-seq) to capture these transcriptomic events? (iv) Will breakthroughs in sequencing technologies (e.g., long-read sequencing) overcome the current limitations? (v) Can these methods be applied to the single-cell level to further reveal heterogeneity? Cloud computing, shared pools of configurable computer systems resources (e.g., The Seven Bridges Cancer Genomics Cloud [154]), will be a solution for processing massive transcriptomic data. Advanced algorithms, including deep learning [155] (e.g., multilayer perceptron, convolutional neural net, and long short-term memory) and Bayesian approach [156], may increase the sensitivity and specificity to identify transcriptomic events, even to the single-cell level. Appropriate design of RNA-seq protocols (e.g., 50 -end RNA-sequencing) [157] and long-read sequencing (e.g., nanopore sequencing) [157] will enable novel discovery of transcriptomic events, Using the approaches, we have reviewed and more, the research community has identified novel biomarkers and therapeutic targets at a pivotal juncture in the big data era. The field of genomics has been revolutionized to extend far beyond quantifying the expression of protein-coding genes. More importantly, the approaches herein reviewed, which were primarily developed for cancer research, can be applied to other complex diseases, thereby greatly leveraging their impact. Acknowledgments We regret that page limitations have prevented us from including all the relevant methods and related work in this review. This work was supported by the Cancer Prevention & Research Institute of Texas (RR150085 to L.H.). We gratefully acknowledge the consortium, including TCGA, and research groups that have provided accessible data. We thank LeeAnn Chastain for editorial assistance.

Resources i

https://www.ncbi.nlm.nih.gov/sra/

ii

https://cancergenome.nih.gov/

iii

https://dcc.icgc.org/

iv v

https://portals.broadinstitute.org/ccle

https://pecan.stjude.cloud/home

vi

https://www.encodeproject.org/

vii

https://pseudogene.org/

viii ix x

https://fantom.gsc.riken.jp/

www.enhanceratlas.org/

https://genome.ucsc.edu/

xi

https://www.gencodegenes.org

xii

https://github.com/marvin-jens/find_circ

xiii

https://github.com/YangLab/CIRCexplorer

xiv

https://sourceforge.net/projects/ciri/

xv

https://sourceforge.net/projects/ciri/files/CIRI-AS/

xvi

https://github.com/orzechoj/circRNA_finder

xvii

https://github.com/TreesLab/NCLscan

xviii xix xx

https://github.com/lindaszabo/KNIFE

https://sourceforge.net/projects/ptesfinder-v1/

http://genes.mit.edu/burgelab/miso/

xxi

http://www.biocomputing.it/3uss_server

xxii

https://github.com/morrislab/qapa

xxiii

https://github.com/ZhengXia/dapars

xxiv xxv

https://sourceforge.net/projects/apatrap/

https://github.com/arefeen/TAPAS

xxvi

https://sourceforge.net/projects/reditools/

xxvii

https://github.com/dieterich-lab/JACUSA

xxviii

https://github.com/ZhangLabSZ/RES-Scanner

Trends in Cancer, December 2018, Vol. 4, No. 12

833

xxix

https://github.com/zhqingit/giremi

xxx

http://epigenomics.snu.ac.kr/RDDpred/

xxxi

https://www.ibp.ucla.edu/research/xiao/RASER.html

xxxii

https://github.com/BGIRED/RED-ML

xxxiii

http://rnaeditor.uni-frankfurt.de/

xxxiv xxxv

http://rnaedit.com/

http://srv00.recas.ba.infn.it/atlas/

xxxvi

http://software.broadinstitute.org/pathseq/index.html

xxxvii

http://odin.mdacc.tmc.edu/xsu1/VirusSeq.html

xxxviii

https://bioinfo.uth.edu/VirusFinder/

xxxix xl

https://github.com/danielfan/MuSE

http://varscan.sourceforge.net

xli

https://github.com/aradenbaugh/radia

xlii

https://github.com/alexischr/seurat

xliii

https://github.com/AstraZeneca-NGS/VarDict

xliv xlv

http://bioinformaticstools.mayo.edu/research/esnv-detect/

http://lilab.stanford.edu/SNPiR/readme

xlvi

http://gigadb.org/dataset/100360

xlvii

https://github.com/eturro/mmseq

xlviii xlix

https://bioconductor.org/packages/release/bioc/html/MBASED.html

https://github.com/yuchaojiang/SCALE

References 1.

Cieslik, M. and Chinnaiyan, A.M. (2018) Cancer transcriptome profiling at the juncture of clinical translation. Nat. Rev. Genet. 19, 93–109

17. Li, J. et al. (2015) TANRIC: an interactive open platform to explore the function of lncRNAs in cancer. Cancer Res. 75, 3728–3737

2.

Cancer Genome Atlas Research Network et al. (2013) The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45, 1113–1120

18. Ren, C. et al. (2018) Lnc2Catlas: an atlas of long noncoding RNAs associated with risk of cancers. Sci. Rep. 8, 1909

3.

International Cancer Genome Consortium (2010) International network of cancer genome projects. Nature 464, 993–998

19. Ning, S. et al. (2016) Lnc2Cancer: a manually curated database of experimentally supported lncRNAs associated with various human cancers. Nucleic Acids Res. 44, D980–D985

4.

Barretina, J. et al. (2012) The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature 483, 603–607

20. Liu, Y. and Zhao, M. (2016) LnCaNet: pan-cancer co-expression network for human lncRNA and cancer genes. Bioinformatics 32, 1595–1597

5.

Ma, X. et al. (2018) Pan-cancer genome and transcriptome analyses of 1,699 paediatric leukaemias and solid tumours. Nature 555, 371–376

21. Holliday, D.L. et al. (2016) Comprehensive genomic characterization of long non-coding RNAs across human cancers. Cancer Cell 28, 529–540

6.

Iyer, M. et al. (2015) The landscape of long noncoding RNAs in the human transcriptome. Nat. Genet. 47, 199–208

7.

Ding, L. et al. (2018) Perspective on oncogenic processes at the end of the beginning of cancer genomics. Cell 173, 305– 320

22. Chiu, H.S. et al. (2018) Pan-cancer analysis of lncRNA regulation supports their targeting of cancer genes in each tumor context. Cell Rep. 23, 297–312

8.

Sanchez-Vega, F. et al. (2018) Oncogenic signaling pathways in the cancer genome atlas. Cell 173, 321–337

9.

Byron, S.A. et al. (2016) Translating RNA sequencing into clinical diagnostics: opportunities and challenges. Nat. Rev. Genet. 17, 257–271

10. Djebali, S. et al. (2012) Landscape of transcription in human cells. Nature 489, 101–108 11. St. Laurent, G. et al. (2015) The landscape of long noncoding RNA classification. Trends Genet. 31, 249–251 12. Mattick, J.S. and Rinn, J.L. (2015) Discovery and annotation of long noncoding RNAs. Nat. Struct. Mol. Biol. 22, 5 13. Lin, C. and Yang, L. (2018) Long noncoding RNA in cancer: wiring signaling circuitry. Trends Cell Biol. 28, 287–301 14. Sahu, A. et al. (2015) Long noncoding RNAs in cancer: from function to translation. Trends Cancer 1, 93–109 15. ENCODE Project Consortium et al. (2012) An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 16. Xu, J. et al. (2017) A comprehensive overview of lncRNA annotation resources. Brief. Bioinform. 18, 236–249

834

Trends in Cancer, December 2018, Vol. 4, No. 12

23. Pink, R.C. et al. (2011) Pseudogenes: pseudo-functional or key regulators in health and disease? RNA 17, 792–798 24. Poliseno, L. et al. (2010) A coding-independent function of gene and pseudogene mRNAs regulates tumour biology. Nature 465, 1033–1038 25. Karro, J.E. et al. (2007) Pseudogene.org: a comprehensive database and comparison platform for pseudogene annotation. Nucleic Acids Res. 35, D55–D60 26. Han, L. et al. (2014) The pan-cancer analysis of pseudogene expression reveals biologically and clinically relevant tumour subtypes. Nat. Commun. 5, 3963 27. Guo, X. et al. (2014) Characterization of human pseudogenederived non-coding RNAs for functional potential. PLoS One 9, e93972 28. Kalyana-Sundaram, S. et al. (2012) Expressed pseudogenes in the transcriptional landscape of human cancers. Cell 149, 1622–1634 29. Schmitt, A.D. et al. (2016) Genome-wide mapping and analysis of chromosome architecture. Nat. Rev. Mol. Cell Biol. 17, 743– 755 30. Sur, I. and Taipale, J. (2016) The role of enhancers in cancer. Nat. Rev. Cancer 16, 483–493

31. Wang, C. et al. (2013) Computational identification of active enhancers in model organisms. Genomics Proteomics Bioinformatics 11, 142–150 32. FANTOM Consortium et al. (2014) A promoter-level mammalian expression atlas. Nature 507, 462–470 33. Gao, T. et al. (2016) EnhancerAtlas: a resource for enhancer annotation and analysis in 105 human cell/tissue types. Bioinformatics 32, 3543–3551 34. Li, W. et al. (2016) Enhancers as non-coding RNA transcription units: recent insights and future perspectives. Nat. Rev. Genet. 17, 207

56. Hansen, T.B. (2018) Improved circRNA identification by combining prediction algorithms. Front. Cell Dev. Biol. 6, 1–9 57. Xia, S. et al. (2018) CSCD: a database for cancer-specific circular RNAs. Nucleic Acids Res. 46, D925–D929 58. de Klerk, E. and ’t Hoen, P.A.C. (2015) Alternative mRNA transcription, processing, and translation: insights from RNA sequencing. Trends Genet. 31, 128–139 59. Oltean, S. and Bates, D.O. (2014) Hallmarks of alternative splicing in cancer. Oncogene 33, 5311–5318 60. Wang, E.T. et al. (2008) Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476

35. Hsieh, C.-L. et al. (2014) Enhancer RNAs participate in androgen receptor-driven looping that selectively enhances gene activation. Proc. Natl. Acad. Sci. U. S. A. 111, 7319–7324

61. Pan, Q. et al. (2008) Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nature 40, 1413–1415

36. Melo, C.A. et al. (2013) ERNAs are required for p53-dependent enhancer activity and gene transcription. Mol. Cell 49, 524–535

62. Lee, S.C.-W. and Abdel-Wahab, O. (2016) Therapeutic targeting of splicing in cancer. Nat. Med. 22, 976–986

37. Chen, H. et al. (2018) A pan-cancer analysis of enhancer expression in nearly 9000 patient samples. Cell 173, 386–399 38. Jorjani, H. et al. (2016) An updated human snoRNAome. Nucleic Acids Res. 44, 5068–5082 39. Casper, J. et al. (2018) The UCSC Genome Browser database: 2018 update. Nucleic Acids Res. 46, D762–D769 40. Harrow, J. et al. (2012) GENCODE: the reference human genome annotation for the ENCODE project. Genome Res. 22, 1760–1774 41. Siprashvili, Z. et al. (2016) The noncoding RNAs SNORD50A and SNORD50B bind K-Ras and are recurrently deleted in human cancer. Nat. Genet. 48, 53–58 42. Kim, M. et al. (2006) Distinct pathways for snoRNA and mRNA termination. Mol. Cell 24, 723–734 43. Zhou, F. et al. (2017) AML1-ETO requires enhanced C/D box snoRNA/RNP formation to induce self-renewal and leukaemia. Nat. Cell Biol. 19, 844–855 44. Gong, J. et al. (2017) A pan-cancer analysis of the expression and clinical relevance of small nucleolar RNAs in human cancer. Cell Rep. 21, 1968–1981 45. Salzman, J. (2016) Circular RNA expression: its potential regulation and function. Trends Genet. 32, 309–316 46. Kristensen, L.S. et al. (2018) Circular RNAs in cancer: opportunities and challenges in the field. Oncogene 37, 555–565 47. Memczak, S. et al. (2013) Circular RNAs are a large class of animal RNAs with regulatory potency. Nature 495, 333–338 48. Zhang, X.O. et al. (2014) Complementary sequence-mediated exon circularization. Cell 159, 134–147

63. Carazo, F. et al. (2018) Upstream analysis of alternative splicing: a review of computational approaches to predict contextdependent splicing factors. Brief. Bioinform. Published online January 29, 2018. http://dx.doi.org/10.1093/bib/bby005 64. Li, Y. et al. (2017) Revealing the determinants of widespread alternative splicing perturbation in cancer. Cell Rep. 21, 798– 812 65. Dvinge, H. and Bradley, R.K. (2015) Widespread intron retention diversifies most cancer transcriptomes. Genome Med. 7, 1–13 66. Ryan, M. et al. (2016) TCGASpliceSeq a compendium of alternative mRNA splicing in cancer. Nucleic Acids Res. 44, D1018– D1022 67. Di Giammartino, D.C. et al. (2011) Mechanisms and consequences of alternative polyadenylation. Mol. Cell 43, 853–866 68. Tian, B. and Manley, J.L. (2017) Alternative polyadenylation of mRNA precursors. Nat. Rev. Mol. Cell Biol. 18, 18–30 69. Sandberg, R. et al. (2008) Proliferating cells express mRNAs with shortened 30 untranslated regions and fewer microRNA target sites. Science 320, 1643–1647 70. Mayr, C. and Bartel, D.P. (2009) Widespread shortening of 30 UTRs by alternative cleavage and polyadenylation activates oncogenes in cancer cells. Cell 138, 673–684 71. Hoque, M. et al. (2013) Analysis of alternative cleavage and polyadenylation by 30 region extraction and deep sequencing. Nat. Methods 10, 133–139 72. Sun, Y. et al. (2012) Genome-wide alternative polyadenylation in animals: insights from high-throughput technologies. J. Mol. Cell Biol. 4, 352–361

49. Gao, Y. et al. (2015) CIRI: an efficient and unbiased algorithm for de novo circular RNA identification. Genome Biol. 16, 4

73. Katz, Y. et al. (2010) Analysis and design of RNA sequencing experiments for identifying isoform regulation. Nat. Methods 7, 1009–1015

50. Gao, Y. et al. (2016) Comprehensive identification of internal structure and alternative splicing events in circular RNAs. Nat. Commun. 7, 12060

74. Le Pera, L. et al. (2015) 3USS: a web server for detecting alternative 30 UTRs from RNA-seq experiments. Bioinformatics 31, 1845–1847

51. Westholm, J.O. et al. (2014) Genome-wide analysis of Drosophila circular RNAs reveals their structural and sequence properties and age-dependent neural accumulation. Cell Rep. 9, 1966–1980

75. Ha, K.C.H. et al. (2018) QAPA: a new method for the systematic analysis of alternative polyadenylation from RNA-seq data. Genome Biol. 19, 45

52. Chuang, T.J. et al. (2016) NCLscan: accurate identification of non-co-linear transcripts (fusion, trans-splicing and circular RNA) with a good balance between sensitivity and precision. Nucleic Acids Res. 44, e29 53. Szabo, L. et al. (2015) Statistically based splicing detection reveals neural enrichment and tissue-specific induction of circular RNA during human fetal development. Genome Biol. 16, 126 54. Izuogu, O.G. et al. (2016) PTESFinder: a computational method to identify post-transcriptional exon shuffling (PTES) events. BMC Bioinformatics 17, 31 55. Gao, Y. and Zhao, F. (2018) Computational strategies for exploring circular RNAs. Trends Genet. 34, 389–400

76. Xia, Z. et al. (2014) Dynamic analyses of alternative polyadenylation from RNA-seq reveal a 30 -UTR landscape across seven tumour types. Nat. Commun. 5, 5274 77. Ye, C. et al. (2018) APAtrap: identification and quantification of alternative polyadenylation sites from RNA-seq data. Bioinformatics 34, 1841–1849 78. Yang, J.S. et al. (2014) TAPAS: tool for alternative polyadenylation site analysis. Bioinformatics 30, 2989–2990 79. Park, H.J. et al. (2018) 30 UTR shortening represses tumorsuppressor genes in trans by disrupting ceRNA crosstalk. Nat. Genet. 50, 783–789 80. Xiang, Y. et al. (2018) Comprehensive characterization of alternative polyadenylation in human cancer. J. Natl. Cancer Inst. 110, 379–389

Trends in Cancer, December 2018, Vol. 4, No. 12

835

81. Gruber, A.J. et al. (2016) A comprehensive analysis of 30 end sequencing data sets reveals novel polyadenylation signals and the repressive role of heterogeneous ribonucleoprotein C on cleavage and polyadenylation. Genome Res. 26, 1145– 1159 82. Wang, R. et al. (2018) PolyA-DB 3 catalogs cleavage and polyadenylation sites identified by deep sequencing in multiple genomes. Nucleic Acids Res. 46, D315–D319 83. Müller, S. et al. (2014) APADB: a database for alternative polyadenylation and microRNA regulation events. Database (Oxford) 2014, bau076 84. You, L. et al. (2015) APASdb: a database describing alternative poly(A) sites and selection of heterogeneous cleavage sites downstream of poly(A) signals. Nucleic Acids Res. 43, D59–D67 85. Feng, X. et al. (2018) TC3A: the cancer 30 UTR atlas. Nucleic Acids Res. 46, D1027–D1030

106. Gopalakrishnan, V. et al. (2018) Gut microbiome modulates response to anti-PD-1 immunotherapy in melanoma patients. Science 359, 97–103 107. Routy, B. et al. (2018) Gut microbiome influences efficacy of PD1-based immunotherapy against epithelial tumors. Science 359, 91–97 108. Kostic, A.D. et al. (2011) PathSeq: software to identify or discover microbes by deep sequencing of human tissue. Nat. Biotechnol. 29, 393–396 109. Chen, Y. et al. (2013) VirusSeq: software to identify viruses and their integration sites using next-generation sequencing of human cancer tissue. Bioinformatics 29, 266–267 110. Wang, Q. et al. (2013) VirusFinder: software for efficient and accurate detection of viruses and their integration sites in host genomes through next generation sequencing data. PLoS One 8, e64465

86. Xu, X. et al. (2018) The role of A-to-I RNA editing in cancer development. Curr. Opin. Genet. Dev. 48, 51–56

111. Bai, B. et al. (2015) Dr.VIS v2.0: an updated database of human disease-related viral integration sites in the era of high-throughput deep sequencing. Nucleic Acids Res. 43, D887–D892

87. Wang, Y. et al. (2017) Systematic characterization of A-to-I RNA editing hotspots in microRNAs across human cancers. Genome Res. 27, 1112–1125

112. Tang, K.W. et al. (2013) The landscape of viral expression and host gene fusion and adaptation in human cancer. Nat. Commun. 4, 2513

88. Gong, J. et al. (2017) LNCediting: a database for functional effects of RNA editing in lncRNAs. Nucleic Acids Res. 45, D79–D84

113. Robinson, K.M. et al. (2017) Distinguishing potential bacteriatumor associations from contamination in a secondary data analysis of public cancer genome sequence data. Microbiome 5, 9

89. Peng, X. et al. (2018) A-to-I RNA editing contributes to proteomic diversity in cancer. Cancer Cell 33, 817–828 90. Picardi, E. and Pesole, G. (2013) REDItools: high-throughput RNA editing detection made easy. Bioinformatics 29, 1813– 1814 91. Piechotta, M. et al. (2017) JACUSA: site-specific identification of RNA editing events from replicate sequencing data. BMC Bioinformatics 18, 7 92. Wang, Z. et al. (2016) RES-Scanner: a software package for genome-wide identification of RNA-editing sites. Gigascience 5, 37 93. Ramaswami, G. et al. (2013) Identifying RNA editing sites using RNA sequencing data alone. Nat. Methods 10, 128–132 94. Zhang, Q. and Xiao, X. (2015) Genome sequence–independent identification of RNA editing sites. Nat. Methods 12, 347 95. Kim, M. et al. (2016) RDDpred: a condition-specific RNA-editing prediction model from RNA-seq data. BMC Genomics 17, 5 96. Ahn, J. and Xiao, X. (2015) RASER: reads aligner for SNPs and editing sites of RNA. Bioinformatics 31, 3906–3913 97. Xiong, H. et al. (2017) RED-ML: a novel, effective RNA editing detection method based on machine learning. GigaScience 6, 1–8 98. John, D. et al. (2017) RNAEditor: easy detection of RNA editing events and the introduction of editing islands. Brief. Bioinform. 18, 993–1001

114. Thompson, K.J. et al. (2017) A comprehensive analysis of breast cancer microbiota and host gene expression. PLoS One 12, e0188873 115. Bailey, M.H. et al. (2018) Comprehensive characterization of cancer driver genes and mutations. Cell 173, 371–385 116. Mertens, F. et al. (2015) The emerging complexity of gene fusions in cancer. Nat. Rev. Cancer 15, 371–381 117. Ren, R. (2005) Mechanisms of BCR-ABL in the pathogenesis of chronic myelogenous leukaemia. Nat. Rev. Cancer 5, 172–183 118. Latysheva, N.S. and Babu, M.M. (2016) Discovering and understanding oncogenic gene fusions through data intensive computational approaches. Nucleic Acids Res. 44, 4487–4503 119. Wang, Q. et al. (2013) Application of next generation sequencing to human gene fusion detection: computational tools, features and perspectives. Brief. Bioinform. 14, 506–519 120. Stransky, N. et al. (2014) The landscape of kinase fusions in cancer. Nat. Commun. 5, 4846 121. Gao, Q. et al. (2018) Driver fusions and their implications in the development and treatment of human cancers. Cell Rep. 23, 227–238 122. Lee, M. et al. (2017) ChimerDB 3.0: an enhanced database for fusion genes from cancer transcriptome and literature data mining. Nucleic Acids Res. 45, D784–D789 123. Forbes, S.A. et al. (2017) COSMIC: somatic cancer genetics at high-resolution. Nucleic Acids Res. 45, D777–D783

99. Ramaswami, G. and Li, J.B. (2014) RADAR: a rigorously annotated database of A-to-I RNA editing. Nucleic Acids Res. 42, D109–D113

124. Wang, Y. et al. (2015) FusionCancer: a database of cancer fusion genes derived from RNA-seq data. Diagn. Pathol. 10, 131

100. Picardi, E. et al. (2017) REDIportal: a comprehensive database of A-to-I RNA editing events in humans. Nucleic Acids Res. 45, D750–D757

125. Hu, X. et al. (2018) TumorFusions: an integrative resource for cancer-associated transcript fusions. Nucleic Acids Res. 46, D1144–D1149

101. Chen, L. et al. (2013) Recoding RNA editing of AZIN1 predisposes to hepatocellular carcinoma. Nat. Med. 19, 209–216

126. Reimand, J. et al. (2015) Evolutionary constraint and disease associations of post-translational modification sites in human genomes. PLoS Genet. 11, 1–24

102. Han, L. et al. (2015) The genomic landscape and clinical relevance of A-to-I RNA editing in human cancers. Cancer Cell 28, 515–528 103. Paz-Yaacov, N. et al. (2015) Elevated RNA editing activity is a major contributor to transcriptomic diversity in tumors. Cell Rep. 13, 267–276 104. Plummer, M. et al. (2016) Global burden of cancers attributable to infections in 2012: a synthetic analysis. Lancet Glob. Health 4, e609-16 105. Garrett, W.S. (2015) Cancer and the microbiota. Science 348, 80–86

836

Trends in Cancer, December 2018, Vol. 4, No. 12

127. Xu, C. (2018) A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Comput. Struct. Biotechnol. J. 16, 15–24 128. Fan, Y. et al. (2016) MuSE: accounting for tumor heterogeneity using a sample-specific error model improves sensitivity and specificity in mutation calling from sequencing data. Genome Biol. 17, 178 129. Koboldt, D.C. et al. (2012) VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22, 568–576

130. Radenbaugh, A.J. et al. (2014) RADIA: RNA and DNA integrated analysis for somatic mutation detection. PLoS One 9, e111516 131. Christoforides, A. et al. (2013) Identification of somatic mutations in cancer through Bayesian-based analysis of sequenced genome pairs. BMC Genomics 14, 302 132. Lai, Z. et al. (2016) VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 44, 1–11 133. Tang, X. et al. (2014) The eSNV-detect: a computational system to identify expressed single nucleotide variants from transcriptome sequencing data. Nucleic Acids Res. 42, 1–11 134. Piskol, R. et al. (2013) Reliable identification of genomic variants from RNA-seq data. Am. J. Hum. Genet. 93, 641–651 135. Neums, L. et al. (2018) VaDiR: an integrated approach to Variant Detection in RNA. Gigascience 7, 1–13 136. Santoni, F.A. et al. (2017) Detection of imprinted genes by single-cell allele-specific gene expression. Am. J. Hum. Genet. 100, 444–453

144. Jiang, Y. et al. (2017) SCALE: modeling allele-specific gene expression by single-cell RNA sequencing. Genome Biol. 18, 1–15 145. Rockman, M.V. and Kruglyak, L. (2006) Genetics of global gene expression. Nat. Rev. Genet. 7, 862–872 146. Li, Q. et al. (2013) Integrative eQTL-based analyses reveal the biology of breast cancer risk loci. Cell 152, 633–641 147. Albert, F.W. and Kruglyak, L. (2015) The role of regulatory variation in complex traits and disease. Nat. Rev. Genet. 16, 197–212 148. Lawrenson, K. et al. (2015) Cis-eQTL analysis and functional validation of candidate susceptibility genes for high-grade serous ovarian cancer. Nat. Commun. 6, 8234 149. Gong, J. et al. (2018) PancanQTL: systematic identification of cis-eQTLs and trans-eQTLs in 33 cancer types. Nucleic Acids Res. 46, D971–D976 150. Thorsson, V. et al. (2018) The immune landscape of cancer. Immunity 48, 812–830

137. Klimpe, S. et al. (2011) Evaluating the effect of spastin splice mutations by quantitative allele-specific expression assay. Eur. J. Neurol. 18, 99–105

151. Ye, Y. et al. (2018) The genomic landscape and pharmacogenomic interactions of clock genes in cancer chronotherapy. Cell Syst. 6, 314–328

138. Li, G. et al. (2012) Identification of allele-specific alternative mRNA processing via transcriptome sequencing. Nucleic Acids Res. 40, 1–13

152. Wu, H. et al. (2017) The diversity of long noncoding RNAs and their generation. Trends Genet. 33, 540–552

139. Tomsic, J. et al. (2010) Allele-specific expression of TGFBR1 in colon cancer patients. Carcinogenesis 31, 1800–1804

153. Hrdlickova, R. et al. (2017) RNA-seq methods for transcriptome analysis. Wiley Interdiscip. Rev. RNA Published online January 2017. http://dx.doi.org/10.1002/wrna.1364

140. Wood, D.L.A. et al. (2015) Recommendations for accurate resolution of gene and isoform allele-specific expression in RNA-seq data. PLoS One 10, 1–27

154. Lau, J.W. et al. (2017) The cancer genomics cloud: collaborative, reproducible, and democratized - a new paradigm in largescale computational research. Cancer Res. 77, e3–e6

141. Gu, F. and Wang, X. (2015) Analysis of allele specific expression - a survey. Tsinghua Sci. Technol. 20, 513–529

155. Lecun, Y. et al. (2015) Deep learning. Nature 521, 436–444

142. Turro, E. et al. (2011) Haplotype and isoform specific expression estimation using multi-mapping RNA-seq reads. Genome Biol. 12, 1–15 143. Mayba, O. et al. (2014) MBASED: allele-specific expression detection in cancer tissues and cell lines. Genome Biol. 15, 1–21

156. Song, Y. et al. (2017) Single-cell alternative splicing analysis with expedition reveals splicing dynamics during neuron differentiation. Mol. Cell 67, 148–161 157. Adiconis, X. et al. (2018) Comprehensive comparative analysis of 50 -end RNA-sequencing methods. Nat. Methods 15, 505–511

Trends in Cancer, December 2018, Vol. 4, No. 12

837