Analytic methods for systems medicine

Analytic methods for systems medicine

10 Analytic methods for systems medicine Panagiotis V.S. Vasileiou 1, Gkikas Magiorkinis 2, Pagona Lagiou 2, 3, Vassilis Gorgoulis 1, 4, 5, 6 1 Molec...

750KB Sizes 0 Downloads 95 Views

10 Analytic methods for systems medicine Panagiotis V.S. Vasileiou 1, Gkikas Magiorkinis 2, Pagona Lagiou 2, 3, Vassilis Gorgoulis 1, 4, 5, 6 1

Molecular Carcinogenesis Group, Department of Histology and Embryology, Medical School, National and Kapodistrian University of Athens, Athens, Greece 2 Department of Hygiene, Epidemiology and Medical Statistics, Medical School, National and Kapodistrian University of Athens, Athens, Greece 3 Department of Epidemiology, Harvard T. H. Chan School of Public Health, Boston, MA, United States 4 Faculty Institute for Cancer Sciences, Manchester Academic Health Sciences Centre, University of Manchester, Manchester, United Kingdom 5 Biomedical Research Foundation, Academy of Athens, Athens, Greece 6 Center for New Biotechnologies and Precision Medicine, Medical School, National and Kapodistrian University of Athens, Athens, Greece

Abstract The outburst of computing power and the advent of high-throughput technologies facilitated medicine to enter the era of systems approach. The basic principle of systems medicine is that the human organism is a mosaic of interacting subnetworks. In this regard, analytic methods should account for this multidimensional and dynamic nature of interdependence. Based on various computational models and novel mathematical methodologies, analytic methods in systems medicine are designed to handle input from all relevant levels within the patient’s system, from the microscopic to the macroscopic. This chapter introduces data sources used in systems medicine and challenges that are raised on all aspects of workflow due to inherent quality issues, high dimensionality, and heterogeneity of biomedical data. It then explains the sequential steps followed during analysis, namely data preprocessing and data mining.

Introduction Back to the beginnings of systems theory in 1950, Bertalanffy, an Austrian biologist from a village near Vienna, known as one of the founders of general systems theory, claimed in his paper “An outline of general system theory” that

Translational Systems Medicine and Oral Disease. https://doi.org/10.1016/B978-0-12-813762-8.00010-4 Copyright © 2020 Elsevier Inc. All rights reserved.

229

Translational Systems Medicine and Oral Disease

organismic conceptions in biology “assert the necessity of investigating not only parts but also the relations of organization resulting from a dynamic interaction and manifesting themselves by the difference in behavior of parts in isolation and in the whole organism.”1 Based on this notion, the general consensus that outlines the fundamental concept of systems medicine is that human organism is a “network of networks,” a mosaic of microsystems.2 Between all these distinct elements that comprise the human organism as a whole, various levels of interaction exist, both macroscopically and microscopically. In essence, this group of subnetworks is a means by which molecular information flows through the system and, ultimately, drives the phenotype, via a multiscale, nonlinear, high-dimensional, and dynamic manner. From a thermodynamic point of view, the interacting unitsdgenome, proteins, molecules, cells, organs, and finally individualsdconstitute an open system, meaning that they are constantly and mercilessly amenable/subjected to entropy, “the general trend of the universe toward death and disorder,” as stated by James R. Newman.3 Concurrently, living systems exert also some other characteristic behaviors and responses (namely homeostasis, resilience, hormesis) (Table 10.1), to cope with harsh environmental conditions, overcome stressful conditions, and survive, an ability that involves adaptive changes in cellular, subcellular, and organismal level.4,5

Table 10.1 Definitions of homeostasis, resilience, and hormesis. Terminology

Definition

Relevant terms

Paradigms

Homeostasis

The trend of living organisms to actively maintain fairly stable conditions necessary for survival via the involvement of negative (and/or positive) feedback loops

Steady-state, dynamic equilibrium

body temperature regulation DNA damage response and repair circuits

Resilience

The cell’s capacity to withhold, not directly collapsing after the insult or not entering cell death program but, instead, recovering from toxicant exposure with the possibility of developing tolerance to the next hit

Tolerance, adaptation

chemotherapy drug resistance of tumor cell subpopulations epigenetic changes that create a “scar,” as a consequence of the stress the cell has experienced

Hormesis

A biphasic dose response through which low levels of a stressful stimulus exert beneficial effects and activate an adaptive response that augments the resistance of the cell or organism to a next, higher dose hit

Preconditioning, adaptive stress response “Whatever does not kill you, makes you stronger”

ischemic preconditioning

230

Translational Systems Medicine and Oral Disease

Importantly, under the prisma of systems medicine, this web of interplaying, interacting, and interdepending biomedical elements should be approached not only topologically but, most importantly, temporally, namely in respect of the fourth dimensiondtime, because time-symmetric and asymmetric variations exist between them.6 Taking all the above into consideration, it easily becomes clear how monumental, challenging, and complex project is systems medicine analytics, given that each step of the analysis should be in respect of the level of parts, the level of the whole, and the spatiotemporal dynamic relations (Fig. 10.1). Analytics in systems medicine obviously exceed traditional methods of info analysis. The development and implementation of analytic methods in systems medicine involves hallmarks, such as the explosion of computing power, the use of various computational models, and novel mathematical/statistical methodologies. The advent of these tools introduced high-throughput technologies, based on state-of-the-art insights into human physiology and pathology and made feasible to capture and record vast amounts of information about each individual patient over a large timescale. Unfortunately, data accumulated are both voluminous and complex, thus further making analytics in systems medicine an even more demanding task. Of great importance, the main objective of analytics in the case of systems medicine is not merely the collection/gathering of these data from an extrabig pool of knowledge generated, but instead how these databases can be put into meaningful formats to be most useful to healthcare providers. In other words, analytics in systems medicine aim to fill in the methodological gap between data capturing and extraction of biomedical knowledge to

• Cells • Tissues • Organs

• Biomolecules (i.e.

nucleic acids, proteins, electrolytes etc.)

Cellular

Subcellular

Organismal

Temporal

Systems Medicine Analytics

• Social networks • Human ecology (i.e. diet, physical activity, population density, microbial exposure)

• Environmental exposure

• Circadian rhythms • Age-related shifts • Acute vs chronic perturbations

Figure 10.1 The different levels of observation and monitoring in the context of systems medicine highlight the need for a multilayer approach of systems medicine analytics.

231

Translational Systems Medicine and Oral Disease

improve the quality of healthcare. In this regard, multidisciplinary alliances (including physicians, biologists, epidemiologists, statisticians, bioinformaticians, information technology specialists, geneticists, etc.) as well as collaborations in a worldwide basis are the sine qua non. The importance of cooperation in order not to lose valuable information had been already mentioned by Boulding back to 1956, who claimed that “the more science breaks into subgroups, and the less communication is possible among the disciplines, . the greater chance there is that the total growth of knowledge is being slowed down by the loss of relevant communications.”7

Data sources used in systems medicine analytics and the need for integration Today’s availability of computing power and high-throughput technologies increased world’s technological capacity to handle information. The implementation of these achievements in the biomedical field, in an almost routine basis, introduced medicine to the era of big data.8 The “holy grail” in systems medicine analytics is predicting cellular and organismal traits at the highest level of confidence, detail, and reliability via the exploitation of biomedical big data, mostly referring to multiomics and electronic medical record (EMR) data. A nonexhaustive list of information sources includes genome, transcriptome, proteome, metabolome, microbiome, epigenome, interactome, diseaseome, data derived from EMR (including demographic, clinical, experimental, machine, and sensor data), and a legion of environmental factors (exposome) or data derived from social media and web services, smartphones, handheld devices, and implantable electronic devices capable of self-interrogation and monitoring (such as implantable cardiac defibrillators, cardiac pacemakers, implantable loop recorders, etc.). Even histopathological and radiological images are now subjected to computational quantifications based on imaging informatics algorithms.9,10 The management of all this supporting information intertwines computer model systems and novel methodologies designed to handle input from all relevant levels within the patient’s system, from the smallest scale to the largest (molecules, cells, tissues, organs, environment, epidemiology, sociology). Each platform individually can decipher comprehensive portraits of distinct biological states; however, the regulatory events that explain how these changes occurred can only be inferred by the combination or the full integration of the multiple levels of information.11 Regarding -omics, the extent to which one level of information explains the next one (as defined by the central dogma of molecular biology) is the first question that is raised when such datasets are obtained collectively, but, surprisingly, this is not always the case. For example, transcriptomes and their encoded products

232

Translational Systems Medicine and Oral Disease

show a moderate correlation of 40%e60% depending on the model used and the technology applied.12,13 To give an example of how huge challenge is the effective and efficient data capture in the clinical setting, globally, patients report their symptoms in over 200 spoken languages, in a total number of 19,217 hospitals and healthcare facilities which are available in a worldwide basis. Doctors record symptoms by using 1 of the over 1000 EMR systems and use ICD-10 (International Classification of Disease, Tenth Edition) coding system which lists over 13,000 diagnoses for each of whom a suitable documentation approach with an appropriate data model should be implemented. In principle, each data item on a case report form (CRF) could be derived from one medical concept (e.g., patient age). Given that a typical CRF consists of approximately 40 data items and that there are over 300,000 nonsynonymous concepts available (based on SNOMED, a clinical terminology platform created to support clinical decision-making and analytics in software programs), it comes to conclusion that “there are much more CRFs than atoms in the universe” (1.5  10171 vs. 1  1080)14! On the other hand, human genome contains 30,000 to 35,000 genes and encodes for nearly 100 trillion cells in the human body.15,16 Between each hierarchical leveldfrom genes to messenger RNAs and afterward to proteinsdmodifications are made (e.g., alternative splicing) as well as numerous interactions among thousands of molecules, thus establishing a hypercomplex regulatory system. Analytics in systems medicine cover the integration of all this extraordinary amount of complex heterogeneous information, in terms of data analysis, modeling, interpretation, validation, and quality control.17 The term integration refers to the accessibility of different data types of interest via a single platform, the feasibility of systematic querying through specific data formats and platform infrastructures, and the link between different types of patient-specific data, such as molecular and clinical.18 Integration can be fulfilled via four categories of approaches, correlation-based, concatenationbased, multivariate-based, and pathway-based integration, into which all existing statistical methods fit.19

Challenges regarding big data The origin of the term big data dates back to 1997 when National Aeronautics and Space Administration (NASA) scientists tried to describe the incapability of displaying and analyzing datasets too large to be stored in a computer’s hard disc.20 Big data raises challenges on all aspects of workflow, from data capture and storage to cleaning, analysis, visualization, and sharing.8 The hallmark of big data is its massive size and complexity.21 Widely accepted notions to describe the complexity of big data are the three “Vs”:

233

Translational Systems Medicine and Oral Disease

Volume Value

Variability

Variety

7 V’s of Big Data Volatility

Velocity

Veracity

Figure 10.2 Conceptual classification of the challenges related to the characteristics of big data. See text for more details.

volume (size), variety (diversity), and velocity (frequency of update). Some authors add veracity, volatility, variability, and value. Altogether, the seven Vs are considered to be the key data management challenges associated with big data, thus representing obstacles in systems medicine analytics (Fig. 10.2).22 Volume is the main characteristic of big data. Owing to the massive expansion in data storage capabilities, the amount of available data globally is expected to raise intensely in the years ahead; it has been estimated to reach the extraordinary size of 44 zettabytes or 44 trillion gigabytes in 2020, corresponded to 50-fold increase in comparison to 2011 (Table 10.2).8 But “How big is the big”? According to Baro et al., a dataset could be qualified as “big dataset” only if Log(np) is superior or equal to 7 (where n is the number of statistical individuals and p is the number of variables).23 In simple terms, we consider data to be big, when, for example, the product of the number of individuals with their recorded associated variables is larger than 10 million. Heterogeneity and complexity of multiple datasets refer to the variety. This term translates into the “aggregation of widely disparate sources of data or mash-ups of data derived from independent sources.”24 The flood of data can be of structured format (organized in rows and columns, compliant with traditional database methods) and semistructured and/or unstructured (meaning neither machine-readable nor computable) format. Most biomedical datasets are created for operational purposes, thus they are largely unstructured (80%e90% of generated data), i.e., streaming videos, imaging studies, audio files, physician notes in the form of a text, anatomic hand

234

Translational Systems Medicine and Oral Disease

Table 10.2 Data sizes with paradigms. 1 BYTE 1 byte (a single character) 10 bytes (a single word) 100 bytes (a telegram) 1 KILOBYTE (1000 bytes) 1 kilobyte (a very short story) 10 kilobytes (an encyclopedic page) 50 kilobytes (a compressed document image page) 1 MEGABYTE (1,000,000 bytes) 1 megabyte (a small novel) 10 megabytes (a minute of high fidelity sound) 100 megabytes (1 meter of shelved books) 500 megabytes (a CD-ROM) 1 GIGABYTE (1,000,000,000 bytes) 1 gigabyte (a movie at TV quality) 100 gigabytes (a floor of academic journals) 1 TERABYTE (1,000,000,000,000 bytes) 1 terabyte (all the X-ray films in a large hospital) 10 terabytes (the printed collection of US Library of congress) 50 terabytes (the contents of a large mass storage system) 1 PETABYTE (1,000,000,000,000,000 bytes) 2 petabytes (all US academic research libraries) 20 petabytes (all production of hard-disc drivers in year 1995) 200 petabytes (all printed material) (Continued)

235

Translational Systems Medicine and Oral Disease

Table 10.2 Data sizes with paradigms.dcont'd 1 EXABYTE 5 exabytes (all words ever spoken by human beings) 1 ZETTABYTE 1 YOTTABYTE 1 XENOTTABYTE 1 SHILENTNOBYTE 1 DOMEGEMEGROTTEBYTE

drawings with informal annotations, social media updates, as well as log files, click data, machine and sensor physiological measures (signals), etc. Unfortunately, data of this format, no matter how voluminous, are unavailable to fulfill the promises of big data. Advanced analytics, including machine learning algorithms and artificial intelligence, based on recently developed natural language processing and text predictics are increasingly used to address this issue, but they are still in their infancy.25 On the contrary, structured data (as those included in a typical EMR), such as patient demographics, diagnoses based on the ICD-10 coding system, laboratory tests, vital signs etc., are readily analyzed in spreadsheet or database formats. Unfortunately, structured data account for only one-fifth of available healthcare information.8 Another key management challenge associated with big data is velocity, namely the speed and frequency of data creation, generation, delivering, processing, and analysis during a 24-hour/7 days’ basis. Nowadays, it is data at or near real time.26 Notably, the speed of data production is much faster than ever before in the history of humankind with 90% of all existing knowledge having been created in the past 2 years.8 In addition, big data are characterized by volatility, a trait that describes the rate of change and the consistency of the data over time. Volatility is relevant to an abrupt, sudden, and unintentional modification, shifting, or instability. Relevant questions are “How old does your data need to be before it is considered irrelevant, historic, or not useful any longer?” Or “How long does data need to be kept for?” Of note, data currency and availability not only for a limited amount of time as well as rapid retrieval of information when required are of paramount importance. Veracity refers to the accuracy and truthfulness of the dataset, as well as the provenance or reliability of the data source. To make it clear, one might ask:

236

Translational Systems Medicine and Oral Disease

Who created this set of data? What methodology did they follow in collecting the data? Has the information been edited or modified by anyone else? This is one of the unfortunate characteristics of big data. It’s not simply the quality of the data itself but also how trustworthy the data source is and how meaningful it is to the analysis based on it.27 It entails removing bias, inconsistencies, or duplication, identifying methodologically the relevant data points, and, finally, interpreting results in a proper, relevant, and actionable way. Obviously, this is especially important when incorporating basic research with clinical big data. Knowledge of the data’s veracity provides us with better understanding of the risks associated with analysis and decisions based on this particular dataset. Variability in big data’s context regards inconsistencies in the data resulting from multiple disparate data types and sources.28 These must be found by anomaly and outlier detection methods for any meaningful analytics to occur. Variability may also refer to the inconsistent speed at which big data is loaded into your database. Frankly, data in themselves have minimal value, no matter how big the data library is. The challenge on extracting meaningful information is of utmost importance for big data analytics.29 Substantial values should be extracted from the data and useful messages should be uncovered. In this regard, visualization methods that allow making sense of the data need to be created.30

Data preprocessing Data preprocessing of biomedical and/or clinical data is considered to be a crucial and meaningful step in systems medicine analytics, within the frame of the famous knowledge discovery from data process (Fig. 10.3).31,32 Preprocessing refers to a framework of systematic and detailed methodologies for converting raw data into a format that is acceptable to model learning algorithms, therefore applicable for starting a data mining process. It mainly addresses the issues of noise, redundancies, inconsistencies, incompleteness, and the existence of multiple and possibly irrelevant features, that unfortunately characterize real-world medical datasets. Furthermore, data preprocessing sophisticated methods deal with the challenge of volume and velocity of big data.33 Selecting the right combination of preprocessing methods has a considerable impact on the classification potential of a dataset and determines mostly the reliability, suitability, and quality of hidden knowledge that is going to be extracted afterward, during the data mining project. Of note, data preparation phase is time consuming, using up to 50% or sometimes up to 80% of the total project time.34e36 Data preprocessing tasks include both data preparation (transformation, integration, cleaning, normalization) and data reduction tasks (feature selection, instance selection, discretization).33 Different types of data require

237

Translational Systems Medicine and Oral Disease

Mining Pre-processing

Interpretation/ Visualization

Knowledge

Selection Data

Figure 10.3 The sequential steps of data knowledge discovery process.

different processing technologies; most structured data commonly need classic preprocessing technologies, whereas the processing of semistructured or unstructured data requires more complex and challenging processing methods to be implemented. One of the most common issues encountered in real-world data is missing values. They may be attributed to inadequate sampling methods, cost concerns, or limitations in the acquisition process (i.e., data entry mistakes, misinterpretation of original documents when entering values). Of note, information embedded in EMR is inherently disorganized. Missing values poses great challenge in the analytic procedure; they may result in poor knowledge extraction or wrong conclusions, complicate data manipulation, and produce strong biases if inappropriately handled. The option of discarding missing values (by implementing methods such as the listwise or pairwise deletion) is simple and effortless but, unfortunately, not troublefree because it is associated with loss of statistical power, bias introduction, and underestimation of variances. Other approaches also include data imputation methods (i.e., mean filling, nearest neighbor); coming from statistics, these methods use maximum likelihood procedures and model the probability functions of the data. These techniques, even though highly impactful, are considered to be outdated. Instead, modern, more robust imputation methods, such as interpolation, expectation maximization, maximum likelihood, and multiple imputations, as well as the usage of machine learning (ML) techniques, are becoming popular nowadays. Unfortunately, these methods have their own limitations; for example, interpolation ignores associations between different features, whereas model-based

238

Translational Systems Medicine and Oral Disease

filling does not take into consideration missingness mechanisms. In any case, missing values imputation is a hard problem in which many relationships among data have to be taken into consideration to estimate the best possible value to replace the missing one. A second inherent quality issue that is relevant with the preprocessing of high-resolution waveform data, such as data derived from blood pressure sensors, electrocardiography, photoplethysmography, etc., is noise processing. Noise refers to an abnormal attribute value in a data source (also known as an illegal value), including corruptions in the waveform and artifacts. It can affect the input features, the output values, or both; in the second case, the introduced bias is greater. Importantly, even partial noise correction is claimed to be beneficial.37 The processing of noise data includes binning, regression, outlier analysis, and retrieval of other data sources. Three main approaches are commonly used in the noise processing. The first one is to use noise filtering which identify and remove the noisy instances in the training data (median filtering, Kalman filter, model-based filtering). Tackling the problem of noise further employs sensor fusion methods and signal quality assessment. Because of inherent data quality issues, -omic data preprocessing is based on different tools. Both massive high-throughput sequencing (HTS) and mass spectrometry (MS)ebased protein identification “omics” technologies rely on the identification of short sequence reads either by direct polynucleotide sequencing or by mass spectrumebased peptide matching that are finally aligned to a reference genome or proteome, respectively. In HTS experiments, millions of short sequence reads are aligned to a reference genome and the number of reads that fall into a particular genomic region is recorded, as read count data.38 An example of a major challenge with HTS data is the differential expression analysis in RNA-seq data with an unexpectedly large variability of sequence count data among transcripts. More specifically, read counts observed at a particular transcript location are limited by the depth of sequencing coverage (i.e., how many times a part of the genome has been “read”) and are dependent on the relative abundance of other transcripts. A main difference in the principles of the two technologies is that HTS data are obtained in a parallel spatially resolved manner for all reads, whereas proteomics data are collected sequentially with priority to the peptides that are retained less by the chromatography column which is used for sample complexity reduction before MS. The latter can introduce some degree of stochasticity in the identification and quantification of lowly expressed proteins which can be almost eliminated by metabolic or chemical isotopic amino acid labeling or by data-independent acquisition methods. These unique features contained in HTS and MS data have been the driving force for

239

Translational Systems Medicine and Oral Disease

Table 10.3 Statistical pipelines/tools implemented in -omic data preprocessing. Genomic • • • • •

GMAP BWA STAR GATK SAM Tools

Transcriptomic • • • • • • • • •

GMAP BWA STAR HTSeq BED Tools RSEM Cufflinks Defuse TopHatFusion • TransABySS • Trinity • Scripture

Epigenomics • MACS • SISSRs

Proteomic • OpenMS • Mzmine 2

Metabolomic • OpenMS • Mzmine 2

the development of a number of computational-intensive statistical tools for analyzing sequence data with uncertainty, data normalization, and differential expression detection. Table 10.3 summarizes selected tools for -omic data preprocessing.17 The common preprocessing step for genomic, transcriptomic, and epigenomic data is sequence mapping.39 Genomic preprocessing uses per base differences for small-scale variant detection and read-pair-based, read-depth-based, split-read-based, and assembly based methods for large-scale variant detection. Epigenomic data preprocessing is based on genome-wide measurements of proteineDNA interaction by chromatin immunoprecipitation and quantitative measurements of transcriptomes. Here, the aim is to identify the density of reads along the genome and significant peaks, and model background noise. Preprocessing for MS used in proteomic and metabolomic studies includes alignment, baseline correction, and peak detection and quantification.17,39 Pathway analysis is another widely used example of preprocessing in RNA-seq data. There are a variety of tools, but no “gold standard” method for functional pathway analysis of high-throughput scale data40: overrepresentation analysis approaches (i.e., Onto-Express, GoMiner, ClueGo), functional class scoring approaches (i.e., GSEA), or pathway topologyebased tools (i.e., Pathway-Express).40e42 In addition, reconstruction of regulatory networks and associated toolkits, such as metabolic networks (i.e., Recon 1, Recon 2, SEED, IOMA, MADE) and gene regulatory networks (i.e., Boolean methods, ordinary differential equation/ODE models), has been developed to address the inherent data quality issues of HTS data via a dynamic approach.40,43,44

240

Translational Systems Medicine and Oral Disease

Data mining To discover hidden patterns and trends within healthcare datasets, and derive needful information, the application of data mining techniques has become a requisite.45e47 The term data mining refers to the application of user-oriented algorithmic approaches to extract instructive patterns from raw data.48 Data mining is a multidisciplinary task incorporating novel approaches of information science (Fig. 10.4). Generally, data mining algorithms are classified into two categories: predictive model and descriptive. The purpose of the former is mostly to predict the future outcome rather than an existing behavior, whereas descriptive algorithms aim to discover patterns and identify associations between attributes represented by the data (Fig. 10.5).17 Of great importance, medical databases contain both static information, such as patient gender, demographic information (i.e., age, race/ethnicity), blood type, comorbidities, text-based physician notes etc., and temporal data, meaning sequences of data referring to changeable and potentially modifiable parameters through time that are recorded during multiple visits, such as symptoms, physical findings, abnormal or normal laboratory blood tests, imaging data, disease diagnoses, medications, therapeutic interventions, or even environmental factors. In this regard, two main strategies exist for data mining: static endpoint prediction and temporal data mining.17

Statistics

Artificial Intelligence

Data Mining

Database technology

Machine Learning

Figure 10.4 Data mining is a multidisciplinary task that uses machine learning, statistics, artificial intelligence, and database technology.

241

Translational Systems Medicine and Oral Disease

Descriptive analytics

Predictive analytics

Scrutinizes data to identify patterns/ associations/ dependencies/ relationships, among multimodal data

Determine future outcome based on data captured at different time points. Forecasting/ expecting

Static end-point prediction

Temporal data mining

Feature ↔ Endpoint

Temporal patterns, regular or irregular sampling frequencies

Figure 10.5 Overview of data mining main strategies.

Static endpoint prediction strategy models the relationship between selected clinical features (i.e., dual antiplatelet therapy) and targeted clinical endpoints (i.e., major bleeding) by the implementation of three groups of techniques: classification, regression analysis, and associate rule learning (Fig. 10.6).17,49 Classification is one of the most commonly applied methods of data mining in healthcare sector. It divides data samples into target classes. The data classification process involves learning (or training) and prediction dataset. Training set is the algorithm that consists of a set of attributes to predict the outcome. Prediction set consists of same set of attributes as that of training

Classification/ clustering

Regression analysis

Associate rule learning

Temporal prediction and trend analysis

Figure 10.6 Overview of data mining statistical tools.

242

Translational Systems Medicine and Oral Disease

set but, in that case, prediction attribute is yet to be known. Some wellknown classification algorithms used in healthcare are decision tree, Knearest neighbor, ZeroR, support vector machine, neural network, and Bayesian methods. Clustering is a method of unsupervised learning in which a large database is divided into numbers of small subgroups called clusters based on similarities between data points. In other words, data points within the same cluster are more identical as compared with the data points of other clusters. Clustering is different from classification: (a) it has no predefined classes (classification has), (b) its goal is descriptive instead of predictive, and (c) it is a method of unsupervised learning, meaning that it occurs by observing only independent variables (classification analyzes both independent and dependent variables).49 Regression analysis is also a useful technique of data mining. It is widely used in medical field for predicting the diseases or survivability of a patient. Regression is mainly a statistical tool which is used to inspect the certain relationships between independent variables (i.e., features) and dependent (i.e., endpoints). Regression can be classified into linear and nonlinear on the basis of certain count of independent variables. Of note, linear regression is restricted only to numerical data, whereas in case of categorical data (such as logistic regression) a nonlinear regression model can be used. In addition, association rule learning is a rule-based ML method that can be applied to databases for association mining. The key objective of association mining algorithms is efficiency and not accuracy. It discovers frequent patterns and relationships among a set of data items. For example, it is used by researchers to detect associations among diseases and their prescribed drugs.50 From a different perspective, data may contain attributes generated and recorded at different time points (i.e., diagnosis, a specific treatment, drug response) may depict a snapshot. Therefore, finding meaningful relationships in the data may require considering the temporal order of the attributes. In this case, temporal association rule mining is proposed to elucidate causality between the event and outcome.17 Selected methods to do so are Hidden Markov models and conditional random fields.

Initiatives and perspectives In the era of systems medicine, data across the biological, clinical, behavioral, and environmental spectrum are aggregated, integrated, and fused. Importantly, what is really going to prove transformative for medicine and release the full potential of the systematic approach is not the amount of data collected but, mostly, algorithms applied. In other words, no matter how big, big data are useless unless analyzed, interpreted, and acted on.51

243

Translational Systems Medicine and Oral Disease

Shifting toward electronic, digitalized data platforms for the management of healthcare information creates a pioneer and unexploited resource for creating novel knowledge in medicine. This is the first big step regarding systems medicine analytics. The driving force has been the development of new data storage and data retrieval architectures, coupled with on-demand, scalable computational resources, driven by cloud-based technologies (semantic web technologies/semantics).52 Semantics exploit data storage capacity on petabyte level and even beyond, thus facilitating limitless documentation, integration, access, analysis, and interoperability of data, previously considered to be impossible. At the moment, semantically annotated data items and forms mostly refer to omics, rather than clinical, data, due to ethical concerns about the possibility of reidentifying individuals.53,54 In this regard, efforts have been made to deal with the issue of anonymization and sharing. One example is iDASH (integrating Data for Analysis, anonymization, and SHaring) which focuses on algorithms and tools for sharing data in a privacy-preserving manner.55 Unfortunately, missing semantic annotation in databases is the main reason for data integration and migration problems. Various successful projects addressing the issue of defining a semantic mapping from a database schema to an ontology, thus bridging the gap between clinical care and research, have been developed, such as eMERGE, SHRINE, and SHARPn.14,56e59 Despite its potential, semantic web technologies continue to evolve slower than anticipated, because current methods remain difficult to use, mostly relying on human annotation.52 Promising software platforms for the development of applications that can handle big data in the field of systems medicine have entered the scenery. One such open-source distributed data processing framework developed is Hadoop MapReduce. Furthermore, big data silos have been developed in the context of multiple ongoing consortium initiatives. Among their basic role of data storage, these projects can also serve as platforms for large-scale test cases of suggested data integration methods and data reuse/metadata. Trans-Omics for Precision Medicine (TOPMed) program from the National Heart, Lung, and Blood Institute of the US National Institutes of Health (NIH) generates vast amounts of RNA-seq data, proteomics data, and metabolomics data. At the moment, whole genome sequences have been generated in more than 100,000 people by TOPMed. Moreover, a growing volume of multilevel profiles available through the The Cancer Genome Atlas (TCGA) consortium, namely the Pan-Cancer dataset currently comprising more than 30 solid tumors types, is being available. TCGA and the International Cancer genome Consortium and the Therapeutically Applicable Research to Generate Effective Treatments consortium represent an -omic data integration approach, whereas other initiatives, such as the US 1000 Genomes Project and the UK-based 100,000 Genomes Project, use single -omic data.

244

Translational Systems Medicine and Oral Disease

New statistical tools from the field of ML are also mandatory and critical for transforming data contained in data warehouses into medical knowledge. ML has already entered the scenery and has become an omnipresent and indispensable tool for dealing with multiplex problems in sciences. ML has the potential to handle a huge number of predictors and combine them in nonlinear and interactive ways, thus allowing massive dataset utilization, transformation, and analysis. Especially in systems medicine, ML is anticipated to improve the diagnostic accuracy and introduce more accurate prognostic models through the application of “hybrid” algorithms. Using such algorithms is going to be a routine in everyday clinical practice.51 Additionally, a key role for optimal utility of big data in systems medicine belongs to visual analytics approaches.30 Visual analytics foster the synthesized information that big data provide and facilitate decision-making processes in real-time basis. Interactive visualization is considered to be the best way to understand large, multisource, variable-type, and time-varying data. In this respect, visual analytics has been defined as “the science of analytical reasoning facilitated by interactive visual interfaces.”60 In the forthcoming years, future reserves a protagonist role for systems medicine in both the clinical and research setting. The analytic methods used will be of paramount importance becausedas mentioned abovedit’s not the amount of available data but, instead, the ability to access it efficiently that matters. From this perspective, new, more effective ways to analyze the data is becoming more and more imperative. Unfortunately, barriers do exist and need to be overcome. For example, the ability to incorporate environmental factors is at the moment limited, in terms of how to assess them and how to integrate them into the analyses.61 According to Silverman, another caveat is that most of the omics measurements are not longitudinal. In addition, timerelated restrictions still pose challenges. For example, many of the currently available biospecimens gathered/stored in big data silos have been collected postdisease presentation, thus their value for creating biomarkers for incident disease is questionable.61 New, rigorous methodological approaches and analytic strategies are definitely required to create new evidence. In this respect, there is a strong need for global and inderdisciplinary collaboration to address all the challenges of efficient and effective analytics in systems medicine, including relevant ethical issues.

References 1. Bertalanffy L. General System Theory: Foundations, Development, Applications. New York: George Braziller; 1968. 2. Trachana K, Bargaje R, Glusman G, Price ND, Huang S, Hood LE. Taking systems medicine to heart. Circ Res. 2018;122(9):1276e1289. https://doi.org/10.1161/ CIRCRESAHA.117.310999.

245

Translational Systems Medicine and Oral Disease

3. Gorgoulis VG, Pefani DE, Pateras IS, Trougakos IP. Integrating the DNA damage and protein stress responses during cancer development and treatment. J Pathol. 2018. https://doi.org/10.1002/path.5097 [Epub ahead of print]. 4. Smirnova L, Harris G, Leist M, Hartung T. Cellular resilience. ALTEX. 2015;32(4):247e260. https://doi.org/10.14573/altex.1509271. 5. Mattson MP. Hormesis defined. Ageing Res Rev. 2008;7(1):1e7. 6. Tibbitt MW, Anseth KS. Dynamic microenvironments: the fourth dimension. Sci Transl Med. 2012;4(160):160ps24. 7. Boulding K. General system theory: the skeleton of a science. General Systems. 1956;1:11e17. 8. Austin C, Kusumoto F. The application of Big Data in medicine: current implications and future directions. J Interv Card Electrophysiol. 2016;47:51e59. 9. Kumar V, Gu Y, Basu S, et al. Radiomics: the process and the challenges. Magn Reson Imaging. 2012;30(9):1234e1248. 10. Kristensen VN, Lingaerde OC, Russnes HG, Vollan HK, Frigessi A, Børresen-Dale AL. Principles and methods of integrative genomic analyses in cancer. Nat Rev Canc. 2014;14(5):299e313. 11. Green S, Wolkenhauer O. Integration in action. EMBO Rep. 2012;13:769e771. 12. de Sousa Abreu R, Penalva LO, Marcotte EM, Vogel C. Global signatures of protein and mRNA expression levels. Mol Biosyst. 2009;5:1512e1526. 13. Ghaemmaghami S, Huh WK, Bower K, et al. Global analysis of protein expression in yeast. Nature. 2003;425:737e741. 14. Dugas M. Clinical research informatics: recent advances and future directions. Yearb Med Inform. 2015;10:174e177. 15. Chanda SK, Caldwell JS. Fulfilling the promise: drug discovery in the post-genomic era. Drug Discov Today. 2003;8:168e174. 16. Guttmacher AE, Collins FS. Genomic medicine-A primer. N Engl J Med. 2002;347:1512e1520. 17. Wu P-Y, Cheng C-W, Kaddi CD, Venugopalan J, Hoffman R, Wang MD. Omic and electronic health records big data analytics for precision medicine. IEEE Trans Biomed Eng. 2017;64:263e273. 18. Bauer CR, Knecht C, Fretter C, et al. Interdisciplinary approach towards a systems medicine toolbox using the example of inflammatory diseases. Briefings Bioinf. 2017;18:479e487. 19. Cavill R, Jennen D, Kleinjans J, Briedé JJ. Transcriptomic and metabolomic data integration. Briefings Bioinf. 2016;17(5):891e901. 20. Cox M, Ellsworth D. Managing Big Data for Scientific Visualization. Vol. 97. New York, NY, USA: ACM Siggraph; 1997:21e38. 21. Marx V. The big challenges of big data. Nature. 2013;498:255e260. 22. Sivarajah U, Kamal MM, Irani Z, Weerakkody V. Critical analysis of Big Data challenges and analytical methods. J Bus Res. 2017;70:263e286. 23. Baro E, Degoul S, Beuscart R, Chazard E. Toward a literature-driven definition of big data in healthcare. BioMed Res Int. 2015;2015:639021. 24. Berger ML, Doban V. Big data, advanced analytics and the future of comparative effectiveness research. J Comp Eff Res. 2014;3(2):167e176. https://doi.org/10.2217/ cer.14.2. 25. Luo L, Li L, Hu J, et al. A hybrid solution for extracting structured medical information from unstructured data in medical records via a double-reading/entry system. BMC Med Inf Decis Mak. 2016;16(1):114. 26. Moore KD, Eye stone K, Coddington DC. The big deal about big data. Healthc Financ Manag. 2013;67:60e68.

246

Translational Systems Medicine and Oral Disease

27. Dereli T, Cos¸kun Y, Kolker E, Güner Ö, Aǧirbas¸li M, Özdemir V. Big data and ethics review for health systems research in LMICs: understanding risk, uncertainty and ignorance-and catching the black swans? Am J Bioeth. 2014;14(2):48e50. 28. Carusi A. Validation and variability: dual challenges on the path from systems biology to systems medicine. Studies History Philos Biol Biomed Sci. 2014;48(Part A):28e37. 29. Huberman BA. Sociology of science: big data deserve a bigger audience. Nature. 2012;482(7385):308. 30. Kamal N, Wiebe S, Engbers JDT, Hill MD. Big data and visual analytics in health and medicine: from pipe dream to reality. J Health Med Inform. 2014;5:e125. 31. Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques. 3rd ed. Burlington: Morgan Kaufmann Publishers Inc; 2011. 32. Zaki MJ, Meira W. Data Mining and Analysis: Fundamental Concepts and Algorithms. New York: Cambridge University Press; 2014. 33. García S, Ramírez-Gallego S, Luengo J, Benítez JM, Herrera F. Big data preprocessing: methods and prospects. Big Data Anal. 2016;1:9. 34. Pyle D. Data Preparation for Data Mining. San Francisco: Morgan Kaufmann Publishers Inc.; 1999. 35. Duhamel A, Nuttens MC, Devos P, et al. A preprocessing method for improving data mining techniques. Application to a large medical diabetes database. Stud Health Technol Inf. 2003;95:269e274. 36. Zhang S, Zhang C, Yang Q. Data preparation for data mining. Appl Artif Intell. 2003;17:375e381. 37. Zhu X, Wu X. Class noise vs. Attribute noise: a quantitative study. Artif Intell Rev. 2004;22:177e210. 38. Chu C, Fang Z, Hua X, et al. deGPS is a powerful tool for detecting differential expression in RNA-sequencing studies. BMC Genomics. 2015;16:455. 39. Pepke S, Wold B, Mortazavi A. Computation for ChIP-seq and RNA-seq studies. Nat Methods. 2009;6(11 Suppl):S22eS32. 40. Belle A, Thiagarajan R, Soroushmehr SM, Navidi F, Beard DA, Najarian K. Big data analytics in healthcare. BioMed Res Int. 2015;2015:370194. 41. Khatri P, Sirota M, Butte AJ. Ten years of pathway analysis: current approaches and outstanding challenges. PLoS Comput Biol. 2012;8(2):e1002375. 42. Huang DW, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009;37(1):1e13. 43. Thiele I, Swainston N, Fleming RMT, et al. A community-driven global reconstruction of human metabolism. Nat Biotechnol. 2013;31(5):419e425. 44. Marbach D, Costello JC, Küffner R, et al. Wisdom of crowds for robust gene network inference. Nat Methods. 2012;9(8):796e804. 45. Larose DT. Discovering Knowledge in Data: An Introduction to Data Mining. New York: John Wiley; 2005. 46. Koh HC, Tan G. Data mining applications in healthcare. J Healthc Inf Manag. 2005;19(2):64e72. 47. Kantardzic M. Data Mining: Concepts, Models, Methods, and Algorithms. New Jersey: John Wiley; 2003. 48. Grupe FH, Owrang MM. Data-base mining -discovering new knowledge and competitive advantage. Inf Syst Manag. 1995;12:26e31. 49. Ahmad P, Qamar S, Rizvi SQA. Techniques of data mining in healthcare: a review. Int J Comput Appl. 2015;120:38e50. 50. Altaf W, Shahbaz M, Guergachi A. Applications of association rule mining in health informatics: a survey. Artif Intell Rev. 2017;47:313.

247

Translational Systems Medicine and Oral Disease

51. Obermeyer Z, Emanuel EJ. Predicting the future d big data, machine learning, and clinical medicine. NEJM. 2016;375:1216e1219. 52. Weng C, Kahn MG. Clinical research informatics for Big Data and precision medicine. IMIA Yearb Med Inf. 2016;1:211e218. 53. Ohm P. Broken promises of privacy: responding to the surprising failure of anonymization. UCLA Law Rev. 2010;57:1701e1777. 54. Naveed M, Aydayn E, Clayton EW, et al. Privacy in the genomic era. ACM Comput Surv. 2015;48(1). 55. Ohno-Machado L, Bafna V, Boxwala AA, et al. iDASH: integrating data for analysis, anonymization, and sharing. J Am Med Inform Assoc. 2012;19(2):196e201. 56. Dugas M. Missing semantic annotation in databases - the root cause for data integration and migration problems in information systems. Methods Inf Med. 2014;53(6):516e517. 57. Newton KM, Peissig PL, Kho AN, et al. Validation of electronic medical record-based phenotyping algorithms: results and lessons learned from the eMERGE network. J Am Med Inform Assoc. 2013;20(e1):e147ee154. 58. McMurry AJ, Murphy SN, MacFadden D, et al. SHRINE: enabling nationally scalable multisite disease studies. PLoS One. 2013;8(3):e55811. 59. Pathak J, Bailey KR, Beebe CE, et al. Normalization and standardization of electronic health records for high-throughput phenotyping: the SHARPn consortium. J Am Med Inform Assoc. 2013;20(e2):e341ee348. 60. Thomas JJ, Cook KA. A visual analytics agenda. IEEE Comp Graphic App. 2006:10e13. 61. Schmidt HHHW, Baumbach J, Loscalzo J, Agusti A, Silverman EK, Azevedo V. Systems Medicine; 2018. ahead of print http://doi.org/10.1089/sysm.2017.29000.rtd.

Further reading Alyass A, Turcotte M, Meyre D. From big data analysis to personalized medicine for all: challenges and opportunities. BMC Med Genomics. 2015;8:33.

248