Protein coding gene CRNKL1 as a potential prognostic biomarker in esophageal adenocarcinoma

Protein coding gene CRNKL1 as a potential prognostic biomarker in esophageal adenocarcinoma

Artificial Intelligence in Medicine 76 (2017) 1–6 Contents lists available at ScienceDirect Artificial Intelligence in Medicine journal homepage: www...

2MB Sizes 1 Downloads 84 Views

Artificial Intelligence in Medicine 76 (2017) 1–6

Contents lists available at ScienceDirect

Artificial Intelligence in Medicine journal homepage: www.elsevier.com/locate/aiim

Protein coding gene CRNKL1 as a potential prognostic biomarker in esophageal adenocarcinoma Zhen Li a , Qianlan Yao a , Songjian Zhao a , Zhen Wang b,∗ , Yixue Li a,b,∗ a

School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dong Chuan Road, Shanghai, China Key Lab of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China b

a r t i c l e

i n f o

Article history: Received 12 December 2016 Received in revised form 12 January 2017 Accepted 19 January 2017 Keywords: Esophageal adenocarcinoma (EAC) Prognosis Biomarker CRNKL1

a b s t r a c t Background: Esophageal adenocarcinoma (EAC) is one of the most aggressive gastroesophageal cancers. PTGS2, EGFR, ERBB2 and TP53 are the traditional EAC prognostic biomarkers, but they are still limited in their ability to effectively predict the overall survival. Objectives: To identify an improved biomarker for predicting the prognosis of EAC by using the expression profile. Materials and methods: Differential co-expression analysis and differential expression analysis were performed to identify the related genes of EAC. The 5-fold cross-validation was used to select a prognostic biomarker from the 532 EAC related genes. Results: CRNKL1 was identified as a prognostic biomarker to predict the survival of EAC patients. It could significantly stratify EAC patients into high-risk and low-risk groups and was much better than the traditional biomarkers. Furthermore, ROC curve also verified that CRNKL1 with the highest area under the curve (AUC), reaching a sensitivity of 83.33% and a specificity of 78.57%. Conclusions: Our research proposed that CRNKL1 might be a novel prognostic biomarker with better predictive ability by comparing with the traditional biomarkers, which provided a preferable opportunity in the clinical applications of EAC. © 2017 Elsevier B.V. All rights reserved.

1. Introduction Esophageal adenocarcinoma (EAC) is a subtype of esophageal cancer that is associated with overweight, obesity, and chronic gastroesophageal reflux disease [1–3]. In recent decades, EAC increases faster than other malignancies in western countries [4], and only 16% of patients could survive more than 5 years after diagnosis and the median survival time is less than 1 year [5]. Despite of recent advances in prognosis and therapies, the survival rates and time were not largely improved. EAC is usually detected in the middlelate stage, leading to poor survival [6,7]. Because the pathogenesis of EAC is concealed, it generally miss the best treatment opportunity. If it can be caught in time, it will be apparently treated with relative ease. Therefore, it is necessary to develop effective

∗ Corresponding authors at: Key Lab of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China E-mail addresses: [email protected] (Z. Wang), [email protected] (Y. Li). http://dx.doi.org/10.1016/j.artmed.2017.01.002 0933-3657/© 2017 Elsevier B.V. All rights reserved.

biomarkers for predicting the prognosis of EAC patients, which is able to give a survival benefit. A biological marker can afford an indication of the disease condition, whether normal or abnormal, which should be sensitive, specific and cost-effective. In current clinical practice, PTGS2, EGFR, ERBB2 and TP53 were considered as the optimal prognostic EAC biomarkers [7–11]. COX-2, an induced enzyme, is encoded by the PTGS2 gene, and its expression affects survival outcomes of patients in EAC [12,13]. EGFR and ERBB2 are members of the EGFR family, and involved in at least 3 different oncogenic pathways. Increased EGFR expression is correlated with higher tumor stages and worse overall survival [10,14]. HER-2 is encoded by the ERBB2 gene, its overexpression and amplification were all associated with poor cancer-specific survival [15,16]. TP53 is a typical tumor suppressor gene and its overexpression is commonly observed in adenocarcinoma of the esophagus, but its prognostic value appears limited [17,18]. However, these biomarkers can only improve the current predictive efficiency of EAC by a low specificity and a restricted sensitivity. The aim of this study was to investigate the expression pattern of EAC genes and find an improved biomarker for prognosis bet-

2

Z. Li et al. / Artificial Intelligence in Medicine 76 (2017) 1–6

patients into two groups with markedly different outcomes. Compared with traditional EAC biomarkers, the new one achieved higher sensitivity and specificity. All findings suggested that our study efficiently identified a prognostic marker with a better prognostic power in EAC. 2. Materials and methods 2.1. Dataset

Fig. 1. The biological functions of DCGs. The cutoff of functional enrichment analysis for DCGs was Benjamin adjust p-value < 0.01.

ter than the well-studied biomarkers. Differential co-expression analysis and differential expression analysis were used to identify the EAC related genes. The 5-fold cross-validation was performed to select a potential prognostic biomarker. Both in the training and testing datasets, the prognostic biomarker could divide EAC

The normalized gene expression dataset of EAC was downloaded from UCSC Cancer Genomics Browser (https://genome-cancer.ucsc. edu/proj/site/hgHeatmap/) [19]. The cancer datasets of UCSC Cancer Genomics Browse were derived from The Cancer Genome Atlas (TCGA) database. The EAC dataset totally included 87 EAC and 10 normal tissue samples, which were involved in 19063 genes (mRNAs). After removing the genes with more than 80% missing values in the 97 samples, 16458 genes were remained for analysis. We randomly selected 55 EAC samples as the training dataset, and the remaining 32 EAC samples as testing dataset. 2.2. EAC related genes Differential co-expression analysis for the expression profile of the training dataset was conducted in R environment (V3.2.3) using

Fig. 2. An integrative pipeline for identification of prognostic biomarker. (A) Part of differential co-expression network. The red ellipses were DCGs, the blue ellipses were the important DCGs that regarded as the candidate EAC related genes, and the yellow ellipses were the rest genes. (B) Differential expression analysis was performed to identify differential expressed genes from the important DCGs (Benjamin adjust p-value < 0.01) by using the expression profile. (D) The 5-fold cross-validation was used to identify a biomarker for predicting the survival of EAC patients. HR analysis was used to calculate the prognostic accuracies of candidates. The gene with the highest prognostic accuracy was shown in the plot. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Z. Li et al. / Artificial Intelligence in Medicine 76 (2017) 1–6

3

Fig. 3. CRNKL1 predicts the clinical outcomes of EAC patients. Kaplan-Meier survival curves showed CRNKL1 was able to distinguish EAC patients with different clinical outcomes in (A) the training dataset, and (B) the testing dataset. The survival days are shown along the x-axis and overall survival rates are shown along the y-axis.

DCGL package (V2.0) [20–22]. In DCGL analysis, differential coexpression profile (DCp) was designed for identifying differentially co-expressed genes (DCGs), which proved to be superior to currently popular methods in simulation studies attributed to their uniqueness of exploiting the quantitative co-expression change of each gene pair in the co-expression networks. Differential coexpression enrichment (DCe) was designed for identifying DCGs and differentially co-expressed links (DCLs). First, DCLs was identified by estimating the degree of change of correlations between two conditions for each gene pairs. Then, DCGs were figured out if one gene was surrounded by more DCLs than other genes significantly. Finally, DCe sorted out DCLs and DCGs via estimating the degree of enrichment of DCLs. Throughout our study, the genes were filtered by the method of expression variance with default options, which resulted in a total of 9482 genes preserved. And 3790 DCGs were identified and involved in 1478300 of DCLs by DCp and DCe analysis. GO term enrichment analysis was performed by the database of annotation, visualization and integrated discovery (DAVID) software[23,24] to investigate the functional roles of DCGs in the development of EAC (Benjamin adjust p-value < 0.01). Gene enrichment analysis was performed to verify the association of DCGs with the disease, which based on 1601 human drug targets from DrugBank [25] and 572 cancer genes from Cancer Gene Census [26]. The 3790 DCGs were taken as the initial EAC-related gene set. Because the number of DCLs and p-value are two critical screening indexes who can indicate the importance of DCGs, we utilized them to pick out the important DCGs. 1. The DCGs involved in the number of DCLs must be more than the average number of DCLs of all DCGs; 2. The Benjamin adjusted p-value of DCGs must be less than 0.01. According to these two criteria, 1595 important DCGs were obtained. To further select EAC-related genes, we considered the differential expression between the normal and tumor tissues (ttest, Benjamin adjust p-value < 0.01), and 532 genes were identified that closely related to the progression of EAC. 2.3. Prognostic biomarker selection The 5-fold cross-validation was performed to identify a valuable biomarker from the 532 EAC related genes. Hazard ratio (HR) analysis was performed to calculate the prognostic accuracy of genes.

The HR of genes were evaluated by using the survcomp package in R, and an univariate Cox regression model was implemented to analyze the relationship between the gene expression and survival time. The genes would be ranked by their significance of p-value, and the top 5 genes with most significant p-value were selected as the prognostic candidates. The optimum one was determined by the most significant average p-value in the 5-fold cross-validation. 2.4. Statistical analysis High-risk and low-risk groups of EAC patients were formed by the K-means algorithm (K = 2) [27–29] based on the expression profile of the identified prognostic biomarker. Kaplan-Meier survival analysis was performed for the two groups, and statistical significance was assessed using the log-rank test by the R survival package [30]. Receiver operating characteristic (ROC) curve analysis was used for validating in the testing and the whole dataset, and the differences in the area under the curve (AUC) were detected in R environment (V3.2.3) using ROCR package. AUC of 0.5 was a threshold to indicate predictive power. Larger AUC indicates better predictability, whereas equal and smaller than 0.5 indicates no predictive ability. 3. Results 3.1. EAC related genes The training dataset was used for differential co-expression analysis, and included a total of 16458 genes across 55 EAC and 10 normal tissue samples. 9482 genes were preserved after filtering by expression variance. As a result, 3790 DCGs involved in 1478300 DCLs were yielded for further analysis. Functional enrichment analysis showed that the DCGs mainly concerned with the immune cell activation and regulation of cell death (Fig. 1). In addition, the DCGs significantly enriched in drug targets (p-value = 9.85E-6), but not in cancer genes (p-value = 0.34) (Fig. S1). This indicated that in some sense these DCGs might play a role in the EAC tumorigenesis. Functional enrichment analysis and gene enrichment analysis proved that 3790 DCGs might be functional genes in the development of EAC. Based on the number of DCLs and the significance of p-value of the DCGs, 1595 important DCGs were selected from

4

Z. Li et al. / Artificial Intelligence in Medicine 76 (2017) 1–6

the all DCGs (Fig. 2A). Next, differential expression analysis identified 532 differential expression genes from 1595 DCGs (Fig. 2B). 532 genes were final EAC related genes that could be used as prognostic candidates. 3.2. Identification of a potential prognostic biomarker Protein coding genes are well known cancer-associated genes, and widely used as prognostic biomarkers. The 532 EAC related genes were not only differentially co-expressed genes, but also differentially expressed genes, and served as candidates for biomarker screening. The 5-fold cross validation showed that the gene with the most significant p-value would serve as our desired target. CRNKL1 was identified as a potential prognostic biomarker of EAC (Fig. 2C). Regression analysis showed that CRNKL1 could be an independent predictor for EAC patients (HR = 4.027, 95% CI 1.6187–10.0181, p = 2.53E-3 in the training dataset; HR = 4.4738, 95% CI 1.6452–12.1651, p = 2.92E-3 in the testing dataset). 3.3. Predictive ability of CRNKL1 With the CRNKL1 biomarker, 55 EAC patients of the training dataset were significantly classified into two groups by K-means algorithm and Kaplan-Meier survival analysis (log rank test, pvalue = 1.37E-2) (Fig. 3A). Among them, 24 patients were high-risk group (the median of overall survival time was 221.5 days, <42% living people), and 31 patients were low-risk group (the median of overall survival time was 408 days, >77% living people). To validate the expandability of CRNKL1, we tested its predictive ability in the testing dataset of 32 EAC patients. The result showed that it could also significantly stratify the EAC patients into different groups in the testing dataset (log rank test, p-value = 2.61E-2) (Fig. 3B). 16 patients (the median of overall survival time was 406 days, <25% living people) and 16 patients (the median of overall survival time was 485.5 days, >62% living people) were classified into the high-risk and low-risk groups, respectively. Next, we compared the predictive ability of CRNKL1 with traditional EAC prognostic biomarkers (PTGS2, EGFR, ERBB2 and TP53). However, the conventional biomarkers could not remarkably stratify the EAC patient into high-risk and low-risk groups either in the training dataset or the testing dataset (Fig. 4). Furthermore, we calculated the distances of every sample to the center of low-risk patients that has been identified in the training dataset. And then ROC curve was plotted by using the distances to find a cut-off that could distinguish EAC patients from low-risk patients. In the testing dataset, ROC curve showed that the predictive ability of CRNKL1 was higher than that of PTGS2, EGFR, ERBB2 and TP53. At the optimal cut-off, CRNKL1 had 83.33% sensitivity and 78.57% specificity with an AUC of 0.8058 (Fig. 5). Furthermore, we used the total dataset to validate the predictability of CRNKL1 was much better than the conventional biomarkers. The result was consistent with those preformed in the testing dataset (Fig. S2). 4. Discussion

Fig. 4. The traditional prognostic biomarkers (PTGS2, EGFR, ERBB2 and TP53) predict the clinical outcomes of EAC patients. Kaplan-Meier survival curves showed the conventional biomarkers could not significantly classify the EAC patients into different clinical outcomes in (A–D) the training dataset, and (E–H) the testing dataset. The survival days are shown along the x-axis and overall survival rates are shown along the y-axis.

The high mortality of esophageal adenocarcinoma is due to lack of an effective biomarker. PTGS2, EGFR, ERBB2 and TP53 have been extensively reported, they are closely associated with the survival of EAC and can be served as traditional EAC biomarkers. However, all of them are limited in effectively predicting overall survival in EAC patients. In this study, CRNKL1 was identified as a potential prognostic indicator from ten thousands of genes by using the expression profile of EAC. Differential co-expression analysis and differential expression analysis were integrated to identify the EAC related genes. Differential co-expression analysis showed CRNKL1 was differentially co-expressed with a large amount of genes, which

suggested its expression might directly or indirectly connect with many genes and was a very important gene in EAC. It well complemented the shortage of differential expression analysis [31,32]. And then 5-fold cross-validation was used to select a potential prognostic biomarker from the EAC related genes. Compared with PTGS2, EGFR, ERBB2 and TP53, CRNKL1 was an effective biomarker that could significantly classify EAC patients into different groups in training dataset and testing dataset, respectively. Furthermore, the AUC of CRNKL1 was much higher than the traditional EAC biomarkers, and provided a sensitivity of 83.33% and a specificity of 78.57%.

Z. Li et al. / Artificial Intelligence in Medicine 76 (2017) 1–6

5

Appendix A. Supplementary data Supplementary data associated with this article can be found, in the online version, at http://dx.doi.org/10.1016/j.artmed.2017. 01.002.

References

Fig. 5. The ROC curve for predicting EAC patient survival using biomarkers in the testing dataset. The AUC is shown in the plots and CRNKL1 with the highest AUC.

CRNKL1 (Crooked Neck Pre-MRNA Splicing Factor 1) was located on chromosome 20 with at least 15 exons and expressed in both fetal and adult stages and were expressed abundantly in many tissues [33]. CRNKL1 was thought to be involved in the pre-mRNA splicing process and loss of function would presumably lead to splicing defects [34,35]. Functional enrichment analysis showed that the genes differentially co-expressed with CRNKL1 tended to take part in signal pathway, apoptotic process and regulation of transcription from RNA polymerase II promoter, which were all highly related with the fate of cells (Fig. S3). CRNKL1 has been proved that mutated in Basal Cell Carcinomas31 . To our knowledge, CRNKL1 has not been reported to serve as prognostic biomarker in any cancer. 5. Conclusion Our study has shown CRNKL1 could predict the survival of patients with EAC, and its predictive power is superior to the current traditional prognostic biomarkers. Whether CRNKL1 is beneficial to promote the efficiency of EAC clinical diagnosis still requires a sufficient number of clinical trials, but our results provide a novel opportunity of prognosis in EAC. Competing interests The author s declare no competing financial interests. Author contributions ZL, ZW and YXL conceived the concept for this work; ZL performed the analyses and wrote the manuscript; QLY, SJZ and ZW improved the article. Acknowledgements This work was supported by the National High Technology Research and Development Program of China (2015AA020104), the National Key Research and Development Program on Precision Medicine (2016YFC0901700).

[1] Enzinger PC, Mayer RJ. Esophageal cancer. New Engl J Med 2003;349(23):2241–52. [2] Li B. Global cancer statistics, 2012. CA Cancer J Clin 2012;61(1):344–64, 1. [3] Procter DS. Oesophageal carcinoma. South African Med J = Suid-Afrikaanse tydskrif vir geneeskunde 1981;47(8):348–51. [4] Thrift AP, Whiteman DC. The incidence of esophageal adenocarcinoma continues to rise: analysis of period and birth cohort effects on recent trends. Ann Oncol 2012;23(12):3155–62. [5] Rubenstein JH, Shaheen NJ. Epidemiology, diagnosis, and management of esophageal adenocarcinoma. Gastroenterology 2015;149(2):302–17. [6] Lin EW, Karakasheva TA, Hicks PD, Bass AJ, Rustgi AK. The tumor microenvironment in esophageal cancer. Oncogene 2016;35(41):5337–49. [7] Hong L, Han Y, Zhang H, Fan D. Prognostic markers in esophageal cancer: from basic research to clinical use. Expert Review of Gastroenterology. Hepatology 2015;9(7):1–3. [8] Fouad YM, Mostafa I, Yehia R, El-Khayat H. Biomarkers of barrett’s esophagus. World J Gastrointest Pathophysiol 2014;5(4):450–6. [9] Lei Z, Ma J, Yu H, Liu J, Wei Z, Liu H, et al. Targeted therapy in esophageal cancer. Expert Review of Gastroenterology. Hepatology 2016;10(8):1–10. [10] Denlinger CE, Thompson RK. Molecular basis of esophageal cancer development and progression. Surg Clin North Am 2012;92(5):1089–103. [11] Tänzer M, Liebl M, Quante M. Molecular biomarkers in esophageal, gastric, and colorectal adenocarcinoma. Pharmacology. Therapeutics 2013;140(2):133–47. [12] Prins MJ, Verhage RJ, ten Kate FJ, Van HR. Cyclooxygenase isoenzyme-2 and vascular endothelial growth factor are associated with poor prognosis in esophageal adenocarcinoma. J Gastrointest Surg 2012;16(5):956–66. [13] France M, Drew PA, Dodd T, Watson DI. Cyclo-oxygenase-2 expression in esophageal adenocarcinoma as a determinant of clinical outcome following esophagectomy. Dis Esophagus 2004;17(2):136–40. [14] Navarini D, Gurski RR, Madalosso CA, Aita L, Meurer L, Fornari F. Epidermal growth factor receptor expression in esophageal adenocarcinoma: relationship with tumor stage and survival after esophagectomy. Cambridge: University Press; 2007. p. 1548–55. [15] Prins MJ, Ruurda JP, van Diest PJ, Van HR, Ten Kate FJ. The significance of the HER-2 status in esophageal adenocarcinoma for survival: an immunohistochemical and an in situ hybridization study. Ann Oncol Off J Eur Soc Med Oncol 2013;24(5):1290–7. [16] Yoon HH, Shi Q, Sukov WR, Wiktor AE, Khan M, Sattler CA, et al. Association of HER2/ErbB2 expression and gene amplification with pathologic features and prognosis in esophageal adenocarcinomas. Clin Cancer Res Off J Am Assoc Cancer Res 2012;18(2):546–54. [17] Duhaylongsod FG, Gottfried MR, Iglehart JD, Vaughn AL, Wolfe WG. The significance of c-erb B-2 and p53 immunoreactivity in patients with adenocarcinoma of the esophagus. Ann Surg 1995;221(6):683–4, discussion 683–684. [18] Nason KS. Predicting response to neoadjuvant therapy in esophageal cancer with p53 genotyping: a fortune-teller’s crystal ball or a viable prognostic tool? J Thoracic Cardiovasc Surg 2014;148(5):2286–7. [19] Cline MS, Craft B, Swatloski T, Goldman M, Ma S, Haussler D, et al. Exploring TCGA pan-Cancer data at the UCSC cancer genomics browser. Sci Rep 2013;3(10):2652. [20] Yang J, Yu H, Liu BH, Zhao Z, Liu L, Ma LX, et al. DCGL v2.0: an r package for unveiling differential regulation from differential Co-expression. PLoS One 2013;8(11):e79729. [21] Yu H, Liu BH, Ye ZQ, Li C, Li YX, Li YY. Link-based quantitative methods to identify differentially coexpressed genes and gene pairs. BMC Bioinf 2011;12(1):315. [22] Liu BH, Yu H, Tu K, Li C, Li YX, Li YY. DCGL: an R package for identifying differentially coexpressed genes and links from gene expression microarray data. Bioinformatics 2010;26(20):2637–8. [23] Huang DW, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protocol 2009;4(1):44–57. [24] Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res 2009;37(1):1–13. [25] Wishart DS, Knox C, Guo AC, Cheng D, Shrivastava S, Dan T, et al. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 2008;36(Suppl. 1):901–6. [26] Futreal PA, Coin L, Marshall M, Down T, Hubbard T, Wooster R, et al. A census of human cancer genes. Nat Rev Cancer 2004;4(3):177–83. [27] Seiler M, et al. ConsensusCluster: a software tool for unsupervised cluster discovery in numerical data. Omics J Integr Biol 2010;14(1):109–13.

6

Z. Li et al. / Artificial Intelligence in Medicine 76 (2017) 1–6

[28] Miles GD, Seiler M, Rodriguez L. Identifying microRNA/mRNA dysregulations in ovarian cancer. BMC Res Notes 2012;5(1):164. [29] Haller F, et al. Prognostic role of E2F1 and members of the CDKN2A network in gastrointestinal stromal tumors. Clin Cancer Res Off J Am Assoc Cancer Res 2005;11(18):6589–97. [30] Kramar A, Com-Nougué C. Estimate of adjusted survival curves. revue d Épidémiologie et de sant. Publique 1990;38(38):149–52. [31] De l FA. From ‘differential expression’ to ‘differential networking’ – identification of dysfunctional regulatory networks in diseases. Trends Genetics Tig 2010;26(7):326–33. [32] Cui X, Churchill GA. Statistical tests for differential expression in cDNA microarray experiments. Genome Biol 2003;4(4):210.

[33] Lai CH, Chiu JY, Lin W. Identification of the human crooked neck gene by comparative gene identification 1. Biochimica et Biophysica Acta (BBA). Gene Struct Expr 2001;1517(3):449–54. [34] Chung S, Zhou Z, Huddleston KA, Harrison DA, Reed R, Coleman TA, et al. Crooked neck is a component of the human spliceosome and implicated in the splicing process. Biochim Biophys Acta 2002;1576(3):287–97. [35] Jayaraman SS, Rayhan DJ, Hazany S, Kolodney MS. Mutational landscape of basal cell carcinomas by whole-Exome sequencing. J Invest Dermatol 2014;134(1):213–20.