Mining Public Databases for Precision Oncology

Mining Public Databases for Precision Oncology

TRECAN 269 No. of Pages 3 Forum Mining Public Databases for Precision Oncology Jason Roszik1,2,* and Vivek Subbiah3,* Millions of dollars have been ...

659KB Sizes 0 Downloads 42 Views

TRECAN 269 No. of Pages 3

Forum

Mining Public Databases for Precision Oncology Jason Roszik1,2,* and Vivek Subbiah3,* Millions of dollars have been spent on creating public databases. To date these data reside in isolated ‘silos[93_TD$IF][1’. Real-world realization of precision oncology, the right drug for the right patient at the right time, may be possible only if the right data come to the right clinic at the right time. Mining Public Databases: Unearthing the Gold Mines Precision oncology implies that patients receive personalized therapy based on their molecular alterations. The clinical availability of next-generation sequencing technologies has opened new avenues for therapy. However, with the rapid rise in technologies we have a huge challenge to surmount to be able to use these large datasets for clinical translation. Oncogenes that are overexpressed, mutated, or altered in some other way are being targeted by pharmaceuticals developed specifically for a limited number of tumor types. However, matching patients to novel therapies is often a challenge, even when mutation and gene or protein expression data are available for the patient. The main reason for this is that it is often difficult to estimate the relevance of genomic data for specific genomic alterations that have only a limited literature. One solution is to potentially integrate and leverage large public cancer databases that may be used to identify whether the tumor of the patient has an alteration that might be exploited by certain

therapies. Using an integrated system of several cancers with data from a large number of patients, we might be able to identify oncogenic drivers and drug indications for certain alterations that are very rare in general. Although multiple nextgeneration sequencing databases have been created in the past years, their use in precision oncology remains limited. The Cancer Genome Atlas (TCGA)i[92_TD$IF] [1] is currently the largest collection of processed genomic data, including mutations, copy number alterations, and gene and protein expression, for more than 30 cancer types. Similarly, the Therapeutically Applicable Research to Generate Effective Treatments (TARGET)ii project made genomic data available for childhood cancers, including acute lymphoblastic leukemia, acute myeloid leukemia, kidney tumors, neuroblastoma, and osteosarcoma. These data are accessible from the Genomic Data Commons Data Portal of the National Cancer Instituteiii and raw sequencing data for these and many other sequencing projects are available through the Genotypes and Phenotypes (dbGaP) database of the National Center for Biotechnology Information (NCBI)iv.

number alterations of a limited number of genes, and the availability of clinical data is also limited. Another useful database is the Human Protein Atlasvi, which is a comprehensive resource for protein expression and localization in 44 normal human tissue types and in 17 cancer types from about 8000 patients. In addition, cancer cell line exome and RNA sequencing as well as drug screening data are available from the Cancer Cell Line Encyclopedia (CCLE)vii and from the Genomics of Drug Sensitivity in Cancer (GDSC)viii projects, which can be used to identify overexpressed drug targets and drug sensitivity based on gene expression [3]. Most of the data portals mentioned herein provide data analytic capabilities and other web-based tools, like the cBio Portalix, can also be used to analyze sequencing data. However, these analytics portals remain a work in progress, and we need tools that provide more sophisticated analytics but do not require bioinformatics or programming expertise. Another potential application area for sequencing databases is in predicting unwanted toxicities. Healthy tissue toxicity is often an important concern in Phase I trials. To predict such issues, we can analyze RNA and protein expression data from normal tissues. The Genotype-Tissue Expression (GTEx)x project [4] contains RNA sequencing data for more than 11 000 samples from 53 human tissues. GTEx has a web-based interface that can be used to quickly estimate the expression level of a gene in normal tissues. A limitation here is that many tissue types are not included in GTEx and a drug target expressed in them may cause significant toxicity.

Sequencing panels are frequently used, and in addition to identifying targetable mutations these data can be applied to estimate total mutation load and immunotherapy response [2]. The American Association for Cancer Research (AACR) project Genomics Evidence Neoplasia Information Exchange (GENIE)v is a collection of clinical-grade cancer genomic information that, in addition to mutations and copy number alterations, contains gene fusion and clinical data. Although panel sequencing is performed commercially by several sequencing companies and major cancer centers, there is no public access to most of these databases. Furthermore, a limitation of panel The abovementioned databases are great sequencing datasets is that no gene or resources and are publicly available. protein expression is available, only copy However, several limitations exist: first,

Trends in Cancer, Month Year, Vol. xx, No. yy

1

TRECAN 269 No. of Pages 3

x

Figure 1. Current Public Databases Reside in Isolated Silos. Although multiple large databases are publicly available for further analysis, connecting de-identified data stored in these repositories is difficult or not possible. To make precision medicine possible, we will need to connect all of these databases so that they ‘talk’ to each other. Examples of such large databases included in the figure are The Cancer Genome Atlas (TCGA), the Cancer Cell Line Encyclopedia (CCLE), the Genomics Evidence Neoplasia Information Exchange (GENIE), Genomics of Drug Sensitivity in Cancer (GDSC), Genotype-Tissue Expression (GTEx), and the Human Protein Atlas (HPA).

many data require complex bioinformatics algorithms to decipher; second, they reside in separate silos (Figure 1); and finally and most importantly, the clinical annotations may not be perfect. Data privacy and security issues make access and integration of data more complicated. Anonymization of data is important to protect patient privacy; however, this requires a lot of data preprocessing and makes the connection of de-identified data in multiple repositories difficult or impossible. For example, an institution that provided samples to the TCGA may not be able to integrate publicly available TCGA sequencing-based results with data from a clinical trial in which the same patients participated. Different data formats also contribute to silo creation, including data generated by analysts in various department using different versions of the same software in the same organization. There can also be multiple copies, creating redundancy in an 2

Trends in Cancer, Month Year, Vol. xx, No. yy

integrated system. These problems present within institutions are even more of an issue when trying to bring data together from multiple organizations, especially as cancer data silos were built over several years, and some of them decades, by competing organizations and people.

Garbage In–Garbage Out Big data come with big challenges and we need to come up with bigger solutions to addresses these problems. This all starts with what fields and what type of data go into these databases, as in the world of big data there is a saying, ‘garbage in– garbage out’. Therefore, we need to ensure that all of these data have the necessary elements so there is common language and they do not reside in silos but are interconnected and ‘talk’ to each other. The good news is that a multistakeholder consensus recently clearly outlined the core clinical data elements that are needed for cancer genomic

repositories. The Center for Medical Technology Policyxi and the Molecular Evidence Development Consortium/ Cure-Onexii assembled a group of more than 50 public and private stakeholders to form a multi-institutional working group to address this moving forward [5]. They found that each of the major organizations [e.g., National Cancer Institute (NCI), American Association for Cancer Research (AACR), American Society of Clinical Oncology (ASCO)] collected data elements that varied from 40 to 540 elements but fewer than ten were included in all databases. They have proposed a common minimum and manageable ‘required’ dataset for all genomic-scale projects. Providing the database organizations adhere to this required data, prospective data collection programs will enable faster translation of precision medicine to the clinic. But what do we do with all of the public databases that are already there, for which millions of dollars have been spent on large-scale analyses? These data residing in isolated silos need to be interconnected and in many cases intraconnected. There are consensus groups working on algorithms to evaluate whether these datasets can communicate with each other and web-based tools are being developed to provide analytics. However, these websites usually provide relatively basic analytics tools. One suggestion is to attempt to integrate the common denominator data elements that are available in all datasets. Although retrospective, it may facilitate optimization of the prospective data collection as technologies mature.

Concluding Remarks Therefore, to optimize the use of sequencing data in precision medicine, we will need tools to efficiently analyze large amounts of data, predict response, and prioritize available treatments and clinical trials. Real-world realization of precision oncology, the right drug for the right patient at the right time, may

TRECAN 269 No. of Pages 3

be possible in the near future only if the right data come to the right clinician at the right time.

x

www.gtexportal.org/

xi

www.cmtpnet.org/

xii

https://cure-one.org/

1

Resources i

https://cancergenome.nih.gov/

ii

https://ocg.cancer.gov/programs/target

iii

https://gdc.cancer.gov/

iv v

www.ncbi.nlm.nih.gov/gap

www.aacr.org/Research/Research/Pages/

aacr-project-genie.aspx vi

www.proteinatlas.org/

vii

https://portals.broadinstitute.org/ccle/

viii ix

www.cancerrxgene.org/

www.cbioportal.org/

Department of Melanoma Medical Oncology, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd, Houston, TX 77030, USA 2 Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd, Houston, TX 77030, USA 3

Department of Investigational Cancer Therapeutics, The University of Texas MD Anderson Cancer Center, 1515 Holcombe Blvd, Houston, TX 77030, USA *Correspondence: [email protected] (J. Roszik) and

References 1. Tomczak, K. et al. (2015) The Cancer Genome Atlas (TCGA): an immeasurable source of knowledge. Contemp. Oncol. (Pozn.) 19, A68–A77 2. Roszik, J. et al. (2016) Novel algorithmic approach predicts tumor mutation load and correlates with immunotherapy clinical outcomes using a defined gene mutation set. BMC Med. 14, 168 3. Qin, Y. et al. (2017) A tool for discovering drug sensitivity and gene expression associations in cancer cells. PLoS One 12, e0176763 4. GTEx Consortium (2013) The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585 5. Conley, R.B. et al. (2017) Core clinical data elements for cancer genomic repositories: a multi-stakeholder consensus. Cell 171, 982–986

[email protected] (V. Subbiah). https://doi.org/10.1016/j.trecan.2018.04.008

Trends in Cancer, Month Year, Vol. xx, No. yy

3