A pan-cancer study of copy number gain and up-regulation in human oncogenes

A pan-cancer study of copy number gain and up-regulation in human oncogenes

Life Sciences 211 (2018) 206–214 Contents lists available at ScienceDirect Life Sciences journal homepage: www.elsevier.com/locate/lifescie A pan-c...

2MB Sizes 1 Downloads 16 Views

Life Sciences 211 (2018) 206–214

Contents lists available at ScienceDirect

Life Sciences journal homepage: www.elsevier.com/locate/lifescie

A pan-cancer study of copy number gain and up-regulation in human oncogenes YongKiat Weea, TianFang Wanga, Yining Liub, Xiaoyan Lic, Min Zhaoa,

T



a

School of Engineering, Faculty of Science, Health, Education and Engineering, University of the Sunshine Coast, Queensland 4558, Australia The School of Public Health, Institute for Chemical Carcinogenesis, Guangzhou Medical University, 195 Dongfengxi Road, Guangzhou 510182, China c Beijing Anzhen Hospital, Capital Medical University, Beijing Institute of Heart, Lung & Blood Vessel Disease, Beijing, China b

A R T I C LE I N FO

A B S T R A C T

Keywords: Oncogene Pan-cancer Copy number variation Copy number gain Gene expression

Aim: There has been limited research on CNVs in oncogenes and we conducted a systematic pan-cancer analysis of CNVs and their gene expression changes. The aim of the present study was to provide an insight into the relationships between gene expression and oncogenesis. Main methods: We collected all the oncogenes from ONGene database and overlapped with CNVs TCGA tumour samples from Catalogue of Somatic Mutations in Cancer database. We further conducted an integrative analysis of CNV with gene expression using the data from the matched TCGA tumour samples. Key findings: From our analysis, we found 637 oncogenes associated with CNVs in 5900 tumour samples. There were 204 oncogenes with frequent copy number of gain (CNG). These 204 oncogenes were enriched in cancerrelated pathways including the MAPK cascade and Ras GTPases signalling pathways. By using corresponding tumour samples data to perform integrative analyses of CNVs and gene expression changes, we identified 95 oncogenes with consistent CNG occurrence and up-regulation in the tumour samples, which may represent the recurrent driving force for oncogenesis. Surprisingly, eight oncogenes shown concordant CNG and gene upregulation in at least 250 tumour samples: INTS8 (355), ECT2 (326), LSM1 (310), DDHD2 (298), COPS5 (286), EIF3E (281), TPD52 (258) and ERBB2 (254). Significance: As the first report about abundant CNGs on oncogene and concordant change of gene expression, our results may be valuable for the design of CNV-based cancer diagnostic strategy.

1. Introduction Cancer is a major cause of death worldwide [1] and is a consequence of unlimited cell growth and proliferation. The uncontrolled growth will give rise to a tumour and the cancer cells may spread into healthy tissue (metastasis), and even affect blood and circulatory systems [2]. Cancer is a consequence of gene mutation and arising from the accumulation of somatic genetic alterations [2]. The mutated ‘cancer genes’ are divided into two groups: ‘driver’ and ‘passenger’ [3]. ‘Passenger’ mutations have neutral effects on the clonal expansion of the cancer cells and do not stimulate growth [3]. Conversely, ‘driver’ mutations are usually involved in cancer progression and a mutation within those genes such as oncogenes confer a selective growth advantage [4]. These oncogenes often encode proteins for the signalling pathways that maintain normal cell growth. In general, oncogenes arising from the mutations in proto-oncogenes are found in all normal cells and play a role in stimulating excessive cell division. The mutated



forms of proto-oncogenes are found in the tumour cells [5]. Gene mutations and translocations can occur during cancer initiation event, whereas amplification normally occurs during progression. Research into oncogenes has been critically important in the treatment of cancer because the outcomes can be applied diagnostically in determining the seriousness and the stages of the disease, and to help in the discovery of potential markers as a guide to future gene therapy [6]. Cancer development involves a sequence of genetic abnormalities including single nucleotide mutations and copy number of variants (CNVs). CNVs are the copies of DNA segments in the human genome, size varies from thousands to millions of DNA bases and can vary in copy-number. Such copy number variations (or CNVs) can result in gene dosage imbalances [7]. There are two categories of CNVs: copy number loss (CNL) denotes the deletion of the gene copies while copy number gain (CNG) denotes the addition of gene copies. It is important to understand CNVs when examining the disease-associated changes and a baseline of human genomic variation needs to be created through

Corresponding author. E-mail address: [email protected] (M. Zhao).

https://doi.org/10.1016/j.lfs.2018.09.032 Received 24 July 2018; Received in revised form 14 September 2018; Accepted 18 September 2018 Available online 19 September 2018 0024-3205/ © 2018 Elsevier Inc. All rights reserved.

Life Sciences 211 (2018) 206–214

Y. Wee et al.

of gain (CNGs) and loss (CNLs) samples across multiple TCGA cancers. The CNG frequency for each gene was also calculated based on the number of gain divided by the total number of gains and losses across multiple cancers (total number of gain samples/ total number of gain and loss samples). This information was used as to provide cut-off values for filtering purposes: the number samples of CNG ≥20; the gainloss ratio > 2, and; the CNG frequency for CNGs > 0.1. We applied the cut-off value of number samples CNGs > 20 and the ratio of gain/ loss > 2 to identify the oncogenes with constant CNGs. To filter out the CNVs in the human population, we defined a threshold-value with duplication frequency > 0.1. Finally, to determine the same oncogenes with concordant CNGs and up-regulation, the threshold value (ratio of Gain_Over/Loss_Under) was set to > 30 samples and 95 genes were generated with consistent CNGs and over-expression. The main reason for this was to identify a reliable gene list with constant CNG and overexpression. We set different cut-off values and we managed to narrow down the gene list to < 100 genes. Therefore, this level of gene list would be performed better for functional analysis.

the analysis of the whole-genome CNVs [7,8]. Traditional methods, such as light microscopy for cytogenetic analyses have been used to detect the presence of large fragment deletions and duplications [9]. A large group of copy-number gains or losses has been associated with the development of disease [7,10]. Furthermore, some CNVs were found among the individuals with susceptibility to disease, such as oncogenes and tumour-suppressor genes in cancer [7,10]. The elevated gene expression of oncogene via gene amplification is a common event in human cancer. These amplified genes are required to be overexpressed in order to function as drive alterations. For example, Yamaguchi et al. had conducted an integrative analysis of copy number and gene expression profiling to discover the potential driver genes in 1454 solid tumors. There were 64 known driver oncogenes found in 587 tumors based on their gene expression profiling and CNVs. The authors compared the mRNA expression levels of these 64-known oncogene driver by performing the microarray analysis to assess the fold change between tumors and matched normal tissues in expression levels. The genes with elevated gene expression ≥ 5 fold in tumour tissues were known as overexpressed. The gene expression results of the 12 genes from 64 known oncogenes were then integrated with matched genomic copy number results to explore their relationships between CNVs and gene amplifications. The authors defined the genes with copy number ≥ 6 as overexpressed genes [11]. Ding et al. developed their own hierarchical Bayes statistical model, xseq, to systematically quantify and study the effect of somatic mutations on expression profiles. The authors only focused on the cis-effect impacts of tumour suppressor genes with loss-of-function mutations. The statistical model, xseq is predicted based on the measurable signals from the mutations with functional effects on the transcription in mRNA transcripts. The xseq model applies a precomputed ‘influence graph’ to integrate initial genegene relationship knowledge into its modelling framework [12]. Previous studies have investigated the correlation between CNVs and gene expression across different types of human cancer [13,14]. However, there is limited systematic study of this relationship found in oncogenes. To overcome these constraints, we conducted a pan-cancer CNV analysis on all the human oncogenes in order to explore the overall prospective of the CNV features. From this study, we may also be able to cross-validate some observations from the studies such as the concordance of copy number gain and up-regulation of oncogenes. The results may provide a better understanding of the relationship between CNV and gene expression changes in the progression of cancer.

2.1. CNVs in human population control data In order to find out the high frequent CNVs in cancer sample, we downloaded the CNVs data in the human population from the DECIPHER v9.11 database [19] at https://decipher.sanger.ac.uk/files/ downloads/population_cnv.txt.gz. Using the population CNV file, we aimed to extract the common CNVs and their duplication and deletion frequency; this served as a control data in the analysis. Since the CNV frequencies data in health population were provided in chromosome region, it requires the gene symbols for mapping. Hence, we downloaded the genomic location information for all the Human RefSeq genes from UCSC genome browser database on 20 Oct 2016 at http:// hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz. Using Bedtools (V2.26.0), we performed mapping between the health CNV data and the RefSeq data based on the overlapping of chromosomal locations. The corresponding genomic locations in GRCH 38 were annotated with the control data. As oncogenes are normally known in gain-of-function in cancer development, we only compare the CNG frequency from COSMIC with the duplication frequency in the population control data. The genes with frequent pan-cancer CNGs were defined by filtering common CNVs using the cut-off value of CNG frequency in population control data of > 0.1. As a result, 204 oncogenes were identified from a list of genes with frequent CNGs in comparison with the CNVs in population data. The 204 genes were used for further analysis including functional enrichment and mapping to gene expression data.

2. Materials and methods The 803 curated oncogene data were downloaded from the ONGene (http://ongene.bioinfo-minzhao.org/) [15] database. The format of the data was in plain text and included all the basic information including oncogene IDs and gene symbols. OnGene database is developed based on the curated literature genetic resource of the oncogene-related research. OnGene database contains all the curated genes, literature and functional annotations. This database can be used as a guide to perform a large-scale of genetic screening which related to oncogenes. In addition, it can be also used as a categorised ONG catalogue for experimental validation and integrative analysis of cancer genomics [15]. To perform a series of systematic analyses between CNV and oncogenes, The Cancer Genome Atlas (TCGA) CNVs data [16] were downloaded from the Catalogue of Somatic Mutations in Cancer (COSMIC) database (V78, GRCH 38) [17] and was used to investigate the CNVs in pancancer level. TCGA has identified and profiled the molecular alterations of a large number of tumour samples across their DNA, RNA, protein and epigenetic levels [18]. Fig. 1 demonstrates the pipeline for identification of concordant copy number gain and over-expression of oncogenes in human cancer. The CNVs for the oncogenes were extracted based on the official gene symbol. A total of 637 oncogenes overlapped with both the gain and loss gene copies in TCGA cancer samples. The gain-loss ratios for each oncogene were calculated based of the number

2.2. Gene expression analysis in frequent oncogenes with CNGs All TCGA expression data were downloaded from the COSMIC database (V78) in order to explore the concordant changes of gene expression in oncogenes with the CNVs. The analysis targeted only the gene expression changes in the same TCGA samples with CNGs oncogenes. COSMIC data consists of FPKM, Z-score and RSEM values. Fragments Per Kilobase of transcript per Million mapped reads (FPKM) represents the relative expression of a fragment of transcript. RNA-Seq platform generates the trimmed short reads and the FPKM calculated is based on the reads in the process of gene expression quantification. In addition, RSEM is another well-known measurement value for quantifying the transcripts in RNA-Seq data [20]. The Z-score of the expression data was applied to identify whether an oncogene is over-expression or under-expression. A Z-score for a sample refers to the number of standard deviations away from the mean of expression in the reference and, in the formula below, x represents the expression in tumour sample; μ represents the mean expression in reference samples and δ represents the standard deviation of expression in reference samples: 207

Life Sciences 211 (2018) 206–214

Y. Wee et al.

Fig. 1. Pipeline for the discovery of consistency in copy number of gain and up-regulation of oncogenes in pan-cancer. This flowchart shows the pipeline for finding the oncogenes which consistent with the copy number of gain (CNG) and their corresponding gene expression.

Z=

x−μ δ

data in REVIGO and then produced a semantic similarity-based scatterplot of GO terms from Toppfun.

We utilized the threshold point of Z-score with > 2 to determine over-expressed oncogenes in each TCGA sample and focused on the concordant over-expression of the oncogenes with the CNGs in the same cancer samples. The number of samples with consistent over-expression and CNGs were calculated for each gene. After the investigation of matched TCGA samples for both CNVs gain and gene over-expression, we defined a threshold value with number of TCGA samples > 30 and resulted 95 of oncogenes with consistent CNGs and up-regulated gene expression. To examine the CNVs patterns in TCGA samples, the integrative analysis was performed using a free web database known as cBioPortal (http://cbioportal.org). The cBioPortal for Cancer Genomics allow users to explore, analyse and visualize the multidimensional cancer genomics data. The list of 95 genes was then used to explore their correlation plots between CNVs and expression in the TCGA samples using cBioPortal.

3. Results and discussion 3.1. Pan-cancer analysis to identify oncogenes with frequent copy number gain To explore the relationship between CNV and expression on all the human oncogenes, we developed a bioinformatics framework to identify the concordant copy number gain (CNG) and over-expression (Fig. 1). The process firstly involved downloading a list of 803 oncogenes from the ONGene database. In order to present a global analysis of the CNVs in human cancers, 637 oncogenes (see Additional file 1) were mapped to the CNVs information with TCGA samples. To measure the statistical significance of our gene selection, we performed a hypergeometric test which is desirable for a test procedure to appropriately distribute evidence from our sample selection. There is a total of 61,088 genes in NCBI gene database (ftp://ftp.ncbi.nih.gov/gene/ DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz). Among these genes, we could find 10,495 genes with ratio CNG/CNL > 2 and number of CNG samples ≥ 20, which is the same criteria for us to select 637 genes from 803 oncogenes. By applying the hypergeometric test to the four numbers (61,088, 10,495, 803, 637), we obtained a significant result with P-value = 3.30E−364. This test also shows us if following the same distribution of abundant CNG genes in COSMIC, there will only be about 280 genes selected from 803 genes. This over enriched 5.20-fold compared to expectations with a result with a P-value lower than 0.05 means our selection of 637 from 803 genes is random. Starting from 637 genes, we further filtered out those common CNVs in the health human populations. In detail, the CNVs data from the DECIPHER v9.11 database served as a control data to filter out the common CNVs which were not be related to cancer development. The mapped file of oncogenes and CNVs was then overlapped with the control data to generate a list of annotated 357 oncogenes without redundancy (see Methods). For each oncogene, we calculated the ratio of gain and loss copies, which was used for identification of oncogenes

2.3. Functional enrichment analysis The final stage of genetic analyses usually involves the production of a list of genes and a description of their molecular functions and mechanisms. Pathway analysis generates the biological information underlying the list of genes of interest. The information can be identified through the mapping process to their gene ontology (GO) terms. Subsequently, the enrichment analysis within the set of genes can be represented statistically by over or under expressed or ranked according by P-values [21]. The molecular functions of 204 oncogenes were analysed using Toppfun (http://toppgene.cchmc.org) [22]. Toppfun is a web database which allows users to explore the molecular functions of gene ontology (GO), cellular components, biological processes and pathways. The database is updated regularly and provides an extensive collection of genome annotations. From the Toppfun results, a total of 50 enriched GO terms was generated and we extracted the GO IDs and their corresponding P-values for the visualization process using REVIGO (http://revigo.irb.hr/) [23]. REVIGO summarized and removed the redundant GO terms from a long list. The GO results served as input 208

Life Sciences 211 (2018) 206–214

Y. Wee et al.

Fig. 2. (A) Gene enrichment analysis of 204 human oncogenes with concordant CNGs. The scatterplot presents the summarized gene ontology (GO) terms of all 204 oncogenes with CNGs. Circles indicate the GO clusters and are plotted in two-dimensional space according to other GO terms' sematic similarities. Circle size shows the log10 P value (the larger corrected P-value, the smaller circle) and circle colour represents directly proportional to the frequency of the GO term in the GOA database. (B) A general pan-cancer landscape between the correlations of CNV aspects based on 95 oncogenes with up-regulated gene expression conceivably caused by CNGs. Blue represents deletions; Red indicates CNGs. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

checkpoints, and this leads to uncontrolled cell division. An example of this are well-studied fusion oncogenes between BCR and ABL that are found in many chronic myelogenous leukaemia patients [25] and study also demonstrated that a high copy number gains of BCR/ABL1 fusion was found in patient's samples with chronic myeloid leukaemia [26]. The mutated ABL gene is translocated within the BCR gene on chromosome 22 and the formation of this fusion protein causes unregulated cell cycling and leads to cancer development [25]. In addition, STAT3 regulates the activity of survival proteins and tumour suppressors, halts apoptosis and the cell cycle, inactivates the p53 pathway, and causes inhibition of senescence [27]. STAT3 is identified in different types of human cancers, and it has been demonstrated to be an important part of the downstream pathways in numerous other oncogenes [27]. From our analysis, we found 637 oncogenes associated with CNVs in 5900 tumour samples. Although the gene list of 204 oncogenes was a subset of 637 oncogenes and they were enriched on some similar pathways in cancer. However, for those pathways enriched in the final 204 oncogenes with consistent gene expression changes and CNG, and they were not randomly selected. For example, the MAPK signalling pathway was not revealed in the functional enrichment analysis for 637 oncogenes.

with frequent CNGs and the cut-off values have been defined to create a list of genes with consistent CNGs. As oncogenes are well-known as ‘gain of function’ genes, only oncogenes with CNGs are used. Therefore, the condition for the threshold was defined by the CNVs that were detected in the number of samples with CNG at least double to the number of samples with copy number loss (see the Materials and methods section). As a result, the outputted file generates a total of 204 oncogenes with frequent CNGs. To explore the functions for the 204 genes, the GO enrichment analysis was performed. The results (Fig. 2A) show that the genes are involved in most of the vital process in cancer development including cell apoptosis (adjusted P-value = 7.93E−32), cell cycle (adjusted Pvalue = 7.73E−22) and cell division (adjusted P-value = 8.55E−15). Oncogenes play a major role in cell cycle regulation processes that trigger cellular division and growth [24]. The proteins that are coded by the oncogenes propel the cell cycle forward, normally provoking the cells to enter from one of the G (gap) phases to other cell cycle checkpoints such as chromosome segregation (mitosis) or chromosome replication (S phase). The abnormal protein products stimulate excessive tumour growth, without arresting the cell at certain 209

Life Sciences 211 (2018) 206–214

Y. Wee et al.

Fig. 3. Sample-based mutational patterns for top 10 genes with the highest amplification rate from TCGA ovarian serous cystadenocarcinoma (A) TCGA oesophageal carcinoma. (B) TCGA lung squamous cell carcinoma. (C) Columns indicate samples and rows indicate genes. The colour bars are used to represent the genomic alterations such as CNAs and somatic mutations.

summary, our enrichment analysis revealed that these oncogenes with frequent CNGs involved in many basic processes for cancer development.

Obviously, these resulted biological process pathways were only specific to the 204 oncogenes. In addition, we further narrowed down the gene list from 637 to 204 with a requirement of consistent expression change as some copy number changes may possess neutral effects. The final discovered 204 oncogenes have a unique functional distribution like influencing cell adhesion, cell mortality and cell division. In addition, we found 95 oncogenes play an important role in the regulation of cell differentiation (adjusted P-value = 2.73E−34) and protein phosphorylation (adjusted P-value = 3.55E−23). Moreover, oncogenes with CNGs may have a basic function in cell development processes such as growth (adjusted P-value = 2.61E−17) [28], and also biological adhesion process (adjusted P-value = 1.16E−16) and the MAPK cascade (adjusted P-value = 2.22E−18). MAPK pathways are stimulated by different intracellular and extracellular signals including hormones, growth factors and cytokines [29]. These signalling pathways connect to the machinery which control basic biological activities such as apoptosis, growth, survival and differentiation [30]. During tumorigenesis, the gene alterations can trigger the MAPK pathway to the hyperactive state and also dysregulate the kinase process [29]. Numerous oncogenic driver mutations have been determined in MAPK/ERK pathway across variety of human cancer types including V600EBRAF mutation and mutations in KRAS [31] and these mutated genes will eventually cause over activation of the MAPK/ERK pathway in cancer progression [31]. Most of the signalling pathways are modulated by the Ras GTPases which act as molecular controllers in the cellular process. For example, Ras oncogenes have a significant role in downstream signalling pathways involved in cancer formation [29]. In

3.2. Gene expression increased with copy number gain on oncogenes at matched tumour samples To further validate the functions of those oncogenes with frequent CNGs, we investigated the correlation between CNGs and the upregulated expression using the matched tumour samples. In order to find genes with concordance CNGs and up-regulation, we counted the number of matched samples in both CNVs and expression data for each oncogene. After examination of the matched TCGA samples for both CNVs gain and gene over-expression, we identified 95 of oncogenes with consistent CNGs and up-regulated gene expression in > 30 samples (see Additional file 2). From the results of functional enrichment analysis, these genes have been confirmed that they were related to the cancer progression in the cell cycle (adjusted P-value = 1.670E−15) and the biological pathway in cancer (adjusted P-value = 1.137E−9). Our further pan-cancer mutational analysis (Fig. 2B) showed that these 95 oncogenes have a high mutation rate in the tumour samples particularly as revealed by gene amplifications. For example, the frequency of genetic alterations in TCGA ovarian serous cystadenocarcinoma that exhibited at least one copy number change for each gene was the highest with 511 cases (88.0%). The frequency of alteration in cBio Cancer Genomics Portal is defined by mutation, copy number amplification, or homozygous deletion in tumour samples. The frequency of 210

Life Sciences 211 (2018) 206–214

Y. Wee et al.

Fig. 4. The expression analysis of up-regulated expression of three oncogenes with CNGs and their survival curves: ECT2, LSM1 and INTS8. (A) The expression level of ECT2 in TCGA ovarian serous cystadenocarcinoma. (B) The expression level of INTS8 in TCGA breast invasive carcinoma. (C) The expression level of LSM1 in TCGA breast invasive carcinoma. (D) Overall survival analysis of ECT2 in TCGA ovarian serous cystadenocarcinoma. (E) Overall survival analysis of INTS8 in TCGA breast invasive carcinoma. (F) Overall survival analysis of LSM1 in TCGA breast invasive carcinoma. Plots were derived from cBioPortal based on the Kaplan-Meier analysis. Blue line indicates lower expression and red line indicate higher expression. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

is a common activated oncogene which is implicated in the pathogenesis of human B cell lymphomas [32]. BCL6 is a transcriptional repressor in germinal centres where it is an important site for maturation of the antibodies affinity in lymphoma [32]. However, its role in ovarian, lung and oesophageal were overlooked to date. FNDC3B is known as fibronectin type III domain containing 3B and the amplification process of FNDC3B has been observed in oesophageal, lung, ovarian and breast cancers [33]. The over-expression of FNDC3B triggers different types of cancer pathways such as Rb1 and TGFβ signalling [33]. MECOM is a well-known oncogene in variety of cancer types and its expression commonly associated with poor patient survival [34]. SKIL is involved in the SMAD pathway, which modulates cell growth and cell differentiation through transforming growth factor-beta (TGFB) [35]. A few studies have discovered SKIL as a driver gene at 3q26 and the co-expression of TLOC1 and SKIL activated the cooperative cell transformation [36]. The over-expression of SKIL is correlated with the copy number gain of 3q26 and the amplifications has been identified in human breast, lung and ovarian cancer [36]. SOX2 is part of the SOX gene family which encodes for a transcription factor and contains a high specificity DNA binding domain [37]. SOX2 is overexpressed in several types of cancer and influences the functional state of cancer cell through signalling process [37]. Most of our results showed the importance of these oncogenes in cancer progression through a high copy number gains in multiple cancers, which was only validated in some specific cancer types.

the amplification event in these 95 oncogenes was > 80.0% in the ovarian serous cystadenocarcinoma cohort. In addition, in TCGA lung squamous cell carcinoma, there were 424 cases (85.0%) with at least one copy number change. > 80% of the lung squamous cell carcinomas involved gene amplifications. The same proportion of copy number changes in both CNGs and CNLs with > 60% cases were identified across 16 cancer datasets from 8 different human cancer types, including breast cancer, head and neck squamous cell carcinoma, lung cancer, bladder urothelial carcinoma, sarcoma, glioblastoma multiforme, stomach adenocarcinoma, cervical squamous cell carcinoma and endocervical adenocarcinoma. In addition, there was only one TCGA breast cancer patient with over 70% of CNGs. Most of the cancer patients had CNGs, and the deletion events only affected a small patient group. Therefore, the result shows the importance of these 95 oncogenes in cancer development which function by promoting a large number of copy number gains. In addition to the sample analysis, we explored the genomic alterations in multiple genes across several tumour samples (Fig. 3). We used the OncoPrint in cBioPortal from a query search for alterations in these 95 oncogenes in TCGA ovarian serous cystadenocarcinoma, TCGA oesophageal carcinoma and TCGA lung squamous cell carcinoma tumour samples. An OncoPrint is a graphical display of gene mutations in human cancer tumour samples. We further identified 10 genes with the highest amplification rate across three tumour samples (Fig. 3). For ovarian serous cystadenocarcinoma, there were total of nine genes with > 20.0% alteration frequency. ECT2, FNDC3B, MECOM, PTP4A3 and SKIL showed the highest alteration frequency with > 32.0% amplification. In the TCGA oesophageal carcinoma tumour sample, seven genes were showed > 20.0% of genetic alterations frequency and most of these alterations were related to amplification. FNDC3B, MECOM, SKIL and TP63 had the highest amplifications frequency ranging from 26% to 100%. We observed the similar strong features in the lung squamous cell carcinoma dataset with a total of seven genes exceeded 40.0% alterations frequency. The genes with highest frequency included SOX2 (48.0%), FNDC3B (45.0%), ECT2 (44.0%), MECOM (44.0%), SKILL (44.0%), TP63 (41.0%) and BCL6 (40.0%). Therefore, all these 95 genes have been associated with gene amplification events across the three TCGA samples. We further discovered seven genes (BCL6, ECT2, FNDC3B, MECOM, SKIL, SOX2 and TP63) with high amplification frequency compared to other genes across the entire cancer types. In fact, these genes are all involved in the basic cancer process. However, some of the studies as follow were just conducted in specific cancer types. Their common roles across cancer types deserve more investigation, which may be important for clinical application in multiple cancers. For example, BCL6

3.3. Three oncogenes with recurrent CNGs to induce the up-regulation in over 300 tumour samples from multiple cancers By selecting those genes with high numbers of tumour samples (> 300 samples) with over-expression of oncogenes, we further identified three oncogenic drivers in pan-cancer level, including ECT2, INST8 and LSM1 (Fig. 4). These genes are only explored in limited cancer types, their functions in multiple cancers may provide common mechanisms for the oncogenesis. For instance, the ECT2 gene is known as the epithelial cell transforming sequence 2 gene and was identified as a proto-oncogene capable of transforming NIH/3T3 fibroblasts [38]. ECT2 is overexpressed in various types of human tumours such as lung, oesophageal and brain [39–41]. ECT2 is usually amplified at 3q26 amplicon in human tumours [42,43]. However, it was not explored in other cancers. More surprisingly, INTS8 refers to integrator complex subunit 8 which plays a role in the cleavage process of small nuclear RNAs by encoding a subunit of the integrator complex. However, it is not known if there is any correlation between INTS8 and cancer development. There are few reports that INTS8 undergoes mutations in 211

Life Sciences 211 (2018) 206–214

Y. Wee et al.

Fig. 5. The network of the 7 oncogenes with highest amplification rates. The network represents the molecular function-based relationship between these 7 oncogenes and the linker genes in cancer development. The purple box indicates the oncogenes, while the blue box indicates the linker genes. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

oncogenes in different types of cancer. Both LSM1 and INTS8 genes gave a significant result in TCGA breast invasive carcinoma with Pvalue < 2.2E−16. The ECT2 gene also generated a significance Pvalue with 1.19E−07 in TCGA ovarian serous cystadenocarcinoma. The difference was statistically significant (P < 0.05) in all the results which suggests the CNG triggers gene expression and may serve as one of the important factors in cancer progression. To explore the expression of the three-potential cross-cancer biomarkers in the prognosis of four different cancer types which available in the Kaplan-Meier Plotter online platform (www.kmplot.com) — breast, ovarian, lung and gastric cancer. We evaluated all these three prognostic-related genes to examine their impact in the recurrence-free survival (RFS) of these four different cancer types patients. The desired Affymetrix was valid: 21890_at (INTS8), 203534_at (LSM1) and 219787_s_at (ECT2). Survival curves were plotted for all patients in breast (n = 1015; Fig. S2), ovarian (n = 1816; Fig. S3), lung (n = 2457; Fig. S4) and gastric (n = 1815; Fig. S5) cancer. When group of patients was split into four groups according to different cancer types, these three genes were associated with RFS. Interestingly, the high expression of these three genes showed statistically significant with P < 0.05 in the four cancer types from the KM plotter analysis. Overall, these results may help to further validate the reliability and reproducibility of these three crosscancer oncogenes with constant CNGs and over-expression in > 300 TCGA samples and ultimately it may aid in assessing the patients' risk profile.

peripheral T cell lymphoma when compared with non-cancerous samples from different patients [44]. Moreover, the LSM is a member of the Sm/Lsm family of proteins and it also identified as a cancer-associated Sm-like (CaSm) oncogene, which are involved in mRNA degradation. The over-expression of LSM1 has been determined in human pancreatic and breast cancer and the amplification process occurred at the 8p11–12 region in breast cancer patients [45]. To identify the potential overlooked cancer types for these three genes, the tumour sample with the highest amplification frequency and the most significant overall survival value was selected for each type of cancer. The survival analysis in cBio portal assess survival differences between altered and non-altered patient sets by calling the patients with either up-regulation or down-regulation as “altered”, while those with a z-score between −2 and 2 as “non-altered”. CNGs in oncogene ECT2 accounted for > 20.0% cases in ovarian serous cystadenoma. Another oncogene INTS8 was also amplified in 130 cases (16.2%) patients in a breast invasive carcinoma dataset. The frequency of gain-offunction in LSM1 demonstrated a higher percentage (12.7%, 104 cases) in breast invasive carcinoma patients than in prostate and head and neck cancer (6.2%, 32 cases; 8.2%, 23 cases). The frequency of the oncogenes with CNGs and over-expression across the tumour samples suggested that this could be a common mechanism in the cancer development. Furthermore, we showed that all these genes were constantly overexpressed in the tumour sample with CNGs (see Additional file 3). The overall survival was compared between tumour samples with or without alterations in each of the oncogenes which contained the highest number of tumour samples. For example, breast invasive carcinoma had significant overall survival rates with adjusted P-value 1.65E−07 for patients with genetic alteration in LSM1 (Fig. 4F). The median month survival for breast cancer patients with genetic alterations in INTS8 and LSM1 was 113.70 and 120.53 respectively, while that of patients without genetic mutation in both genes was 146.39. However, the median month survival for ovarian cancer patients with genetic alterations for ECT2 was 48.29, while that of patients without genetic mutation was 42.02. We observed that TCGA breast cancer patients with genetic alterations in INTS8 and LSM1 had significantly better overall survival compared to TCGA ovarian cancer patients with genetic alterations in ECT2. The expressions of LSM1, INTS8 and ECT2 and their mRNA were compared between the cell subsets in the two groups. The expressions data were downloaded from cBioPortal (Fig. 4), statistically analysed using t-test and compared using the merged data (amplification and gain) and diploid. A P-value of < 0.05 indicated that the difference was statistically significant. We conducted a t-test analysis for the three

3.4. Interconnectivity of oncogenes with high frequency of gene amplification and up-regulation To further explore the functions of those oncogenes with frequent CNGs, we applied the network analysis and constructed a gene network to identify the global connection of the gene of interest. Specifically, we focused on seven genes (BCL6, ECT2, FNDC3B, MECOM, SKIL, SOX2 and TP63) identified by our expression analyses of oncogenes with high frequency CNGs and consistent gene up-regulation in multiple cancers. The derived network (Fig. 5) comprised of six of the seven core genes (the sole core gene not connected with other genes in the network is ECT2) and another 20 linker genes that were shown to connect these genes. By focusing on genes with the highest number of interactions, we found seven genes with 5 or more connections, including MECOM (11 connections), C4BPB (9), SKIL (8), GPR20, TLN2 (each with 6 connections) and RERGL and FNDC3B (each with 5 connections). Interestingly, four of them are linker genes (C4BPB, GPR20, TLN2 and RERGL), which may be significant for cancer progression due to their high connectivity with the recurrent oncogenes. 212

Life Sciences 211 (2018) 206–214

Y. Wee et al.

Fig. 6. A pan-cancer view of copy number variation (CNV) features based on 7 linker genes with potentially induced by up-regulated gene expression and copy number gains (CNGs).

counterparts, lncRNA genes are made of a small number of exons, lower selectivity in evolution process and lower abundance. On top of that, the expression of lncRNAs is highly specific for tumour or tissue type [47]. The roles of the non-coding oncogenes in cancer progression could be determined through additional analysis of large-scale oncogenes data and their respective gene expression changes. Limited sample size in some of the cancer types may discard many CNVs with lower frequencies. In addition, the signals outside the predesigned probes may be lost as TCGA largely depends on the CGH array between normal and cancer samples for distinguishing different types of CNVs. This causes in limited sample sizes and indicates the presence of many undiscovered structural variants in cancer development. From the OncopPrint analysis of 95 oncogenes, we observed that there are seven oncogenes with the highest amplification in TCGA ovarian serous cystadenocarcinoma, TCGA oesophageal carcinoma and TCGA lung squamous cell carcinoma. The results indicate that these 7 oncogenes — BCL6, ECT2, FNDCB3, MECOM, SKIL, SOX2 and TP63 are likely to be important target genes for cancer therapies and may also be associated with the patient's survival rate. Further experimental analysis and validation may give insight into the potential molecular mechanisms underlying copy number gain and recurrent over-expression.

To further explore the potential function of the 20 linker genes, we conducted enrichment analysis of the 20 genes using Toppfun. The linker genes were identified using GeneMANIA. The results show that these linker genes are enriched with cell-cell junction (adjusted Pvalue = 5.821E−5), bicellular tight junction (adjusted Pvalue = 2.225E−4) and occluding junction (adjusted Pvalue = 2.400E−4). The genes involved in cell-cell junction are F11R, TLN2, CLDN1, LIN7A and ABCB1. A few studies have been reported that the deregulation of the cell junction plays an important role in prostate, lung and ovarian cancer [46]. For example, the regulation of signalling process between cells and the integrity maintenance of epithelial tissue are operated at the cell junction sites of intercellular adhesion. The expression of the cell junction is associated to the epithelial-mesenchymal transition (EMT) which is known as one of the biological processes involved in embryogenesis. The dysregulation of cell junction adhesion has an impact on the EMT and subsequently it will trigger oncogenesis in cancer development. The additional mutational analysis on these 20 linker genes (Fig. 6) shows a significant amplification frequency across the tumour samples. The cases with > 50% genetic alterations including both CNLs and CNGs were identified in eight cancer datasets from six cancer types: ovarian serous cystadenocarcinoma, pancreatic cancer, lung squamous cell carcinoma, oesophageal carcinoma, neuroendocrine prostate carcinoma and breast cancer. For example, the TCGA ovarian serous cystadenocarcinoma patients had > 70% genetic alteration in CNGs. Overall, most of cancer cohort patients (> 50%) had CNGs compared to patients affected with CNLs. This implied that these linker genes also play a significance role in the cancer progression through genetic alterations in high frequency of copy number gains. Oncogenes are mutated genes that have the potential to induce cancer because they are often expressed at high levels and this results in uncontrolled cell growth. As oncogenes are known as “gain of function” genes in cancer progression, we concentrated on the consistent pattern of CNG and gene up-regulation for this study. Intriguingly, a large amount of CNGs were correlated with an up-regulation of cellular levels of oncogenes. The cancer genomic datasets were primarily based on the public TCGA data. One of the limitations of this study is we only integrated the protein-coding gene but not non-coding genes such as long non-coding RNA (lncRNA). Compared to their protein-coding

4. Conclusions The huge amount of cancer genomics data can now be applied as a basic molecular genetic model to assist in cancer prognosis and to discover the underlying molecular mechanisms including initiation and progression in cancer development. The results of the pan-cancer analysis of copy number variations on human oncogenes presented here will allow us to focus on identifying the significance of oncogenes in cancer progression. The gain of copy number in oncogenes may promote the gene expression change involving oncogenesis. There is a strong evidence that those oncogenes with recurrent copy number gain play a major role in cancer-related pathways Moreover, the copy number gain in several oncogenes may be associated with patient survival. Supplementary data to this article can be found online at https:// doi.org/10.1016/j.lfs.2018.09.032. 213

Life Sciences 211 (2018) 206–214

Y. Wee et al.

Acknowledgments

W305–W311. [23] F. Supek, M. Bosnjak, N. Skunca, T. Smuc, REVIGO summarizes and visualizes long lists of gene ontology terms, PLoS One 6 (2011) e21800. [24] M. Hajjari, A. Salavaty, HOTAIR: an oncogenic long non-coding RNA in different cancers, Cancer Biol. Med. 12 (2015) 1–9. [25] F. Pane, M. Intrieri, C. Quintarelli, B. Izzo, G.C. Muccioli, F. Salvatore, BCR/ABL genes and leukemic phenotype: from molecular mechanisms to clinical correlations, Oncogene 21 (2002) 8652–8667. [26] A. Virgili, E.P. Nacheva, Genomic amplification of BCR/ABL1 and a region downstream of ABL1 in chronic myeloid leukaemia: a FISH mapping study of CML patients and cell lines, Mol. Cytogenet. 3 (2010) 15. [27] B. Barre, A. Vigneron, N. Perkins, I.B. Roninson, E. Gamelin, O. Coqueret, The STAT3 oncogene as a predictive marker of drug resistance, Trends Mol. Med. 13 (2007) 4–11. [28] R.E. Willis, Targeted cancer therapy: vital oncogenes and a new molecular genetic paradigm for cancer initiation progression and treatment, Int. J. Mol. Sci. 17 (2016). [29] A.S. Dhillon, S. Hagan, O. Rath, W. Kolch, MAP kinase signalling pathways in cancer, Oncogene 26 (2007) 3279–3290. [30] E.K. Kim, E.J. Choi, Pathological roles of MAPK signaling pathways in human diseases, Biochim. Biophys. Acta 1802 (2010) 396–405. [31] M. Burotto, V.L. Chiou, J.M. Lee, E.C. Kohn, The MAPK pathway across different malignancies: a new perspective, Cancer 120 (2014) 3446–3456. [32] A. Gallavotti, Q. Zhao, J. Kyozuka, R.B. Meeley, M.K. Ritter, J.F. Doebley, et al., The role of barren stalk1 in the architecture of maize, Nature 432 (2004) 630–635. [33] C. Cai, M. Rajaram, X. Zhou, Q. Liu, J. Marchica, J. Li, et al., Activation of multiple cancer pathways and tumor maintenance function of the 3q amplified oncogene FNDC3B, Cell Cycle 11 (2012) 1773–1781. [34] A. Sayadi, J. Jeyakani, S.H. Seet, C.L. Wei, G. Bourque, F.A. Bard, et al., Functional features of EVI1 and EVI1Delta324 isoforms of MECOM gene in genome-wide transcription regulation and oncogenicity, Oncogene 35 (2016) 2311–2321. [35] A.C. Tecalco-Cruz, M. Sosa-Garrocho, G. Vazquez-Victorio, L. Ortiz-Garcia, E. Dominguez-Huttinger, M. Macias-Silva, Transforming growth factor-beta/SMAD target gene SKIL is negatively regulated by the transcriptional cofactor complex SNON-SMAD4, J. Biol. Chem. 287 (2012) 26764–26776. [36] D. Hagerstrand, A. Tong, S.E. Schumacher, N. Ilic, R.R. Shen, H.W. Cheung, et al., Systematic interrogation of 3q26 identifies TLOC1 and SKIL as cancer drivers, Cancer Discov. 3 (2013) 1044–1057. [37] K. Weina, J. Utikal, SOX2 and cancer: current research and its implications in the clinic, Clin. Transl. Med. 3 (2014) 19. [38] T. Miki, C. Smith, J. Long, A. Eva, T. Fleming, Oncogene ect2 is related to regulators of small GTP-binding proteins, Nature (1993). [39] D. Hirata, T. Yamabuki, D. Miki, T. Ito, E. Tsuchiya, M. Fujita, et al., Involvement of epithelial cell transforming sequence-2 oncoantigen in lung and esophageal cancer progression, Clin. Cancer Res. 15 (2009) 256–266. [40] B. Salhia, N.L. Tran, A. Chan, A. Wolf, M. Nakada, F. Rutka, et al., The guanine nucleotide exchange factors trio, Ect2, and Vav3 mediate the invasive behavior of glioblastoma, Am. J. Pathol. 173 (2008) 1828–1838. [41] M. Sano, N. Genkai, N. Yajima, N. Tsuchiya, J. Homma, R. Tanaka, et al., Expression level of ECT2 proto-oncogene correlates with prognosis in glioma patients, Oncol. Rep. 16 (2006) 1093–1098. [42] A.M. Eder, X. Sui, D.G. Rosen, L.K. Nolden, K.W. Cheng, J.P. Lahad, et al., Atypical PKCiota contributes to poor prognosis through loss of apical-basal polarity and cyclin E overexpression in ovarian cancer, Proc. Natl. Acad. Sci. U. S. A. 102 (2005) 12519–12524. [43] Y. Han, F. Wei, X. Xu, Y. Cai, B. Chen, J. Wang, et al., Establishment and comparative genomic hybridization analysis of human esophageal carcinomas cell line EC9706, Zhonghua Yi Xue Yi Chuan Xue Za Zhi 19 (2002) 455–457. [44] F. Yin, L. Shu, X. Liu, T. Li, T. Peng, Y. Nan, et al., Microarray-based identification of genes associated with cancer progression and prognosis in hepatocellular carcinoma, J. Exp. Clin. Cancer Res. 35 (2016) 127. [45] K.L. Streicher, Z.Q. Yang, S. Draghici, S.P. Ethier, Transforming function of the LSM1 oncogene in human breast cancers with the 8p11–12 amplicon, Oncogene 26 (2007) 2104–2114. [46] A.J. Knights, A.P. Funnell, M. Crossley, R.C. Pearson, Holding tight: cell junctions and cancer spread, Trends Cancer Res. 8 (2012) 61–69. [47] Y. Liu, M. Zhao, lnCaNet: pan-cancer co-expression network for human lncRNA and cancer genes, Bioinformatics 32 (2016) 1595–1597.

This work was supported by the research start-up fellowship of University of Sunshine Coast to MZ. We would like to express our gratitude to Prof. Richard Burns for review and comments on this manuscript. References [1] L. Torre, R. Siegel, D.A. Jermal, Global Cancer Facts & Figures, 3rd edition, 3 American Cancer Society, 2015, p. 1. [2] G. Cooper, The Cell: A Molecular Approach, 2nd ed., Sinauer Associates, 2000. [3] C. Greenman, P. Stephens, R. Smith, G.L. Dalgliesh, C. Hunter, G. Bignell, et al., Patterns of somatic mutation in human cancer genomes, Nature 446 (2007) 153–158. [4] M. Zhao, P. Kim, R. Mitra, J. Zhao, Z. Zhao, TSGene 2.0: an updated literature-based knowledgebase for tumor suppressor genes, Nucleic Acids Res. 44 (2016) D1023–D1031. [5] S.Y. Luo, D.C. Lam, Oncogenic driver mutations in lung cancer, Transl. Respir. Med. 1 (2013) 6. [6] M. Zhao, J. Sun, Z. Zhao, Distinct and competitive regulatory patterns of tumor suppressor genes and oncogenes in ovarian cancer, PLoS One 7 (2012) e44175. [7] K.K. Wong, R.J. Deleeuw, N.S. Dosanjh, L.R. Kimm, Z. Cheng, D.E. Horsman, et al., A comprehensive analysis of common copy-number variations in the human genome, Am. J. Hum. Genet. 80 (2007) 91–104. [8] X. Zheng-Bradley, P. Flicek, Applications of the 1000 genomes project resources, Brief Funct. Genomics 12 (2016). [9] N. Mahdieh, B. Rabbani, An overview of mutation detection methods in genetic disorders, Iran. J. Pediatr. 23 (2013) 375–388. [10] I. Ionita-Laza, A.J. Rogers, C. Lange, B.A. Raby, C. Lee, Genetic association analysis of copy-number variation (CNV) in human disease pathogenesis, Genomics 93 (2009) 22–26. [11] K. Ohshima, K. Hatakeyama, T. Nagashima, Y. Watanabe, K. Kanto, Y. Doi, et al., Integrated analysis of gene expression and copy number identified potential cancer driver genes with amplification-dependent overexpression in 1,454 solid tumors, Sci. Rep. 7 (2017) 641. [12] J. Ding, M.K. McConechy, H.M. Horlings, G. Ha, F. Chun Chan, T. Funnell, et al., Systematic analysis of somatic mutations impacting gene expression in 12 tumour types, Nat. Commun. 6 (2015) 8554. [13] T.P. Lu, L.C. Lai, M.H. Tsai, P.C. Chen, C.P. Hsu, J.M. Lee, et al., Integrated analyses of copy number variations and gene expression in lung adenocarcinoma, PLoS One 6 (2011) e24829. [14] R. Wei, M. Zhao, C.H. Zheng, M. Zhao, J. Xia, Concordance between somatic copy number loss and down-regulated expression: a pan-cancer study of cancer predisposition genes, Sci. Rep. 6 (2016) 37358. [15] Y. Liu, J. Sun, M. Zhao, ONGene: a literature-based database for human oncogenes, J. Genet. Genomics 44 (2017) 119–121. [16] T.I. Zack, S.E. Schumacher, S.L. Carter, A.D. Cherniack, G. Saksena, B. Tabak, et al., Pan-cancer patterns of somatic copy number alteration, Nat. Genet. 45 (2013) 1134–1140. [17] S.A. Forbes, D. Beare, P. Gunasekaran, K. Leung, N. Bindal, H. Boutselakis, et al., COSMIC: exploring the world's knowledge of somatic mutations in human cancer, Nucleic Acids Res. 43 (2015) D805–D811. [18] Cancer Genome Atlas Research, J.N. Weinstein, E.A. Collisson, G.B. Mills, K.R. Shaw, B.A. Ozenberger, et al., The cancer genome atlas pan-cancer analysis project, Nat. Genet. 45 (2013) 1113–1120. [19] E. Bragin, E.A. Chatzimichali, C.F. Wright, M.E. Hurles, H.V. Firth, A.P. Bevan, et al., DECIPHER: database for the interpretation of phenotype-linked plausibly pathogenic sequence and copy-number variation, Nucleic Acids Res. 42 (2014) D993–D1000. [20] B. Li, C.N. Dewey, RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome, BMC Bioinf. 12 (2011) 323. [21] H. Tipney, L. Hunter, An introduction to effective use of enrichment analysis software, Hum. Genomics 4 (2010) 202–206. [22] J. Chen, E.E. Bardes, B.J. Aronow, A.G. Jegga, ToppGene suite for gene list enrichment analysis and candidate gene prioritization, Nucleic Acids Res. 37 (2009)

214