Mining Big Data in biomedicine and health care

Mining Big Data in biomedicine and health care

Accepted Manuscript Guest Editorial Mining Big Data in Biomedicine and Health Care Samah Fodeh, Qing Zeng PII: DOI: Reference: S1532-0464(16)30127-7 ...

745KB Sizes 3 Downloads 104 Views

Accepted Manuscript Guest Editorial Mining Big Data in Biomedicine and Health Care Samah Fodeh, Qing Zeng PII: DOI: Reference:

S1532-0464(16)30127-7 http://dx.doi.org/10.1016/j.jbi.2016.09.014 YJBIN 2648

To appear in:

Journal of Biomedical Informatics

Received Date: Accepted Date:

14 September 2016 20 September 2016

Please cite this article as: Fodeh, S., Zeng, Q., Mining Big Data in Biomedicine and Health Care, Journal of Biomedical Informatics (2016), doi: http://dx.doi.org/10.1016/j.jbi.2016.09.014

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Fodeh and Zeng (editorial)

Guest Editorial: Mining Big Data in Biomedicine and Health Care Introduction The notion of “Big data” is often characterized by the Three Vs: Volume, Variety, and Velocity. [1] While the concept is always associated with the volume dimension, variety is very relevant to biomedical data given the different types of data characterizing patients, including structured and tabulated data (e.g., labs, medications), unstructured data represented by narratives written by clinicians, and diagnostic imaging data (e.g., X-rays and CT-scans). Velocity of big biomedical data refers to the rate of growth and change in such datasets and has gained special prominence in biomedicine as a result of high throughput technologies.

Despite the rapid increase in the collection and analysis of big data, the biomedical and healthcare research communities are only beginning to capitalize on the transformative opportunities that these data provide. For example, healthcare data were recently predicted to increase from approximately 500 petabytes in 2012 to 25,000 petabytes by 2020. [2] As large and complex data sets are becoming increasingly available to the research community, more advanced and sophisticated big data analytical techniques are needed to exploit and manage these data.

Applying analytical techniques to big data has great benefit for biomedical research, allowing identification and extraction of relevant information and facilitating more rapid discovery of meaningful patterns and new threads of knowledge. To accelerate the use of big data, we have seen the recent initiation of many sponsored research programs, including the Obama Administration’s $200 million Big Data Research and Development Initiative, which was launched “to greatly improve the tools and techniques needed to access, organize, and glean discoveries from huge volumes of digital data.” [3]

-1-

Fodeh and Zeng (editorial) Machine learning and data mining methods can be used to mine significant knowledge from a variety of large and heterogeneous textual and tabulated data sources, supporting biomedical research and healthcare delivery. One example of Big Data projects is Oncospace, [4] a project initiated at Johns Hopkins University, through which the association of patients with their related treatment and outcome data is studied. Oncospace hosts a data lake that readily permits the biomedical data-science activities of patient and outcome selection along with feature extraction. In another project, Mount Sinai Medical Center in the U.S. utilizes technologies from Ayasdi, a big data company, to analyze all genetic sequences of Escherichia Coli, including over one million DNA variants, to investigate why bacterial strains resist antibiotics. [5] Microsoft’s HealthVault [6], launched in 2007, is an excellent application of medical big data. Its goal is to manage individual health information plus individual and family medical devices. On the other hand, the majority of existing models that utilize computational techniques for disease prediction and longitudinal analysis of clinical data [7-10] have limitations hindering their broad-scale adoption in real-world settings. These models were derived from clinical trial databases that represent a population of patients with limited generalizability (e.g., care provided in closely monitored settings in larger academic medical centers, smaller patient sample size, lack of heterogeneity in the patient population). [11]

In light of the current interest and activity in the informatics research community, plus the challenges associated with analytical and predictive efforts to date, we disseminated a call for papers for a special issue of the Journal of Biomedical Informatics (JBI) to be focused on technologies that advance the community’s effort to mine Big Biomedical Data [12]. Our goal has been to produce a special issue that will establish the state-of-the-art regarding computational systems and resources related to Big Data. We received a total of 35 submissions from which 16 manuscripts were ultimately accepted for publication after rigorous peer-review. The articles were selected based on their scientific contribution to the field. One important observation we made is the misconception of the “Big Data” notion by some in the research community. Many of the papers that we rejected at the initial evaluation proposed to test and apply their methods to datasets that did not meet any of the dimensions of big data: volume, variety and velocity. While acknowledging that the creation of annotated data (reference standard) for evaluation of a newly

-2-

Fodeh and Zeng (editorial) proposed system can be labor intensive, expensive and not always feasible, the test data should demonstrate heterogeneity and their volume should at least exceed the range of hundreds.

In this special issue we offer articles that are organized in three categories: (i) contributions related to predictive models, (ii) research concerning the use of big longitudinal data, and (iii) research devoted to unsupervised and descriptive data mining. In this overview editorial we provide a snapshot of the research described in the special issue, all of which seeks to advance the development of computational techniques for big data mining.

Research on Predictive Models

Developing computational EHR-based predictive models to forecast future instances or severity of the disease helps to improve the quality of healthcare. Projected benefits include: (1) helping healthcare providers to manage their patients effectively by providing appropriate treatment options, (2) reducing the incidence of hospitalization, a major driver of costs, and (3) helping clinical researchers to design clinical trials by targeting high risk patients with heterogeneous characteristics for disease-modifying therapeutic interventions. [13]

For heart failure survival prediction, Taslimitehrani et al. [14] modified their previous classification algorithm, Contrast Pattern Aided Logistic Regression (CPXR(Log)) to develop and validate prognostic risk models. The approach was used to predict 1-, 2-, and 5-year survival in heart failure based on data from (EHRs) at Mayo Clinic. Comorbidities of patients were included in the model to obtain more accurate predictions. The CPXR(Log) method constructs a pattern-aided logistic regression model defined by several patterns and corresponding local logistic regression models. CPXR(Log) outperformed standard logistic regression, Random Forest, support vector machine (SVM), decision tree, and Ada-Boost methods by capturing the fairly complicated interactions between predictor and response variables for heart failure and implying the heterogeneity of the heart failure disease.

-3-

Fodeh and Zeng (editorial) For type-two diabetes mellitus (DM2) detection, EHR-based predictive models have been shown to be successful based on a study by Anderson et al [15]. Their results are especially important in that approximately 25% of DM2 patients in the United States are undiagnosed due to inadequate screening. The authors developed a pre-screening tool to predict current DM2, using multivariate logistic regression and a random-forest probabilistic model for out-of-sample validation. The tool was applied to the current number of undiagnosed individuals in the United States and predicted that incorporating EHR phenotype screening would identify an additional 400,000 patients with active, untreated diabetes compared to conventional pre-screening models. While this study showed the importance of incorporating EHR data to improve prediction, the study by Pineda et al [16] concluded that predictions of computational methods surpass experts’ predictions. In previous research, Pineda et al showed that influenza can be detected using electronic Emergency Department free-text reports. In the manuscript included in our special issue, they studied seven machine-learning classifiers for influenza detection, comparing their diagnostic capabilities against an expert-built influenza Bayesian classifier. Results showed that all classifiers had receiver operating characteristic (ROC) areas under the curve (AUC) ranging from 0.88 to 0.93, and performed significantly better than the expert-built Bayesian model.

In the same line of research but with a different focus, Feldman et al [17] discussed how Big Data relates to effective utilization of data-centric tools within clinical practice. This utilization may encounter two significant challenges: (1) the scalability issue for computational tools, requiring parallel or distributed implementations, and (2) the interpretability of the output of the tools. In an effort to address these challenges, they present a case study investigating the issues associated with scaling CARE (a disease prediction algorithm to accommodate large medical datasets) and analyzing diagnosis predictions by providing contextual information to convey the certainty of the results to a physician.

In another study by Jung et al [18], the authors explore the impact of non-stationarity, which is broadly defined as occurring when the data-generating process being modeled changes over time, in clinical data on the power of predictive models. The data-generating process in the study was the routine care of wounds in the EHR where observational data would be used to predict delayed wound healing. The authors proposed that changes in the wound care process over time

-4-

Fodeh and Zeng (editorial) may have a substantial impact on the development of predictive models. Model development was carried out under different degrees of non-stationarity simulated by using different train-test split procedures. They concluded that non-stationarity in the data-generating process requires the development of more complex predictive models.

While many studies focused on prediction of diseases, the work by Yao et al [19] proposed a framework to predict treatment methods within a prescription. They proposed to mine treatment patterns in Traditional Chinese Medicine (TCM) clinical cases using supervised topic model and TCM domain knowledge. The framework first identifies symptoms and medicines, labels syndromes on each case, and decides treatment methods with TCM domain ontology, then mines treatment patterns, especially medicine usage patterns in prescriptions, via a supervised topic model. It finally uses medicines and prescription topics as features to predict treatment methods within the prescription. The framework might be useful to illuminate some further clinical research.

Unsupervised and Descriptive Mining

Querying very big biomedical datasets is a challenge due to the complexity of biomedical data. Fong et al [20] evaluated the results of three different free-text search methods, including a unique topic-modeling adaptation, n-gram models, and structured-element weights, with a use case involving patient falls. The Latent Dirichlet Allocation topic modeling that was used is an unsupervised machine-learning method. The search methods were applied to a total of 49,859 patient-safety reports. Overall, the topic-model search method found the most relevant patient safety event reports, though it missed relevant reports identified by the other approaches. The study suggests that multiple search strategies may be combined for best results.

Istephan et al [21] implemented and examined the feasibility of developing a framework to provide efficient querying of large volumes of medical imaging data. The feasibility study was conducted using epilepsy as the domain focus. For a query, the proposed framework uses structured data as filter and then performs feature

-5-

extraction in a distributed manner via

Fodeh and Zeng (editorial) Hadoop[22]. The study used two types of criteria to validate the feasibility of the proposed framework – the ability/accuracy of fulfilling an advanced medical query and the efficiency that Hadoop provides. Many prior studies in the field have mined EHR data to predict clinical outcomes for clinical research. The study by Zhang et al [23] developed a data-driven methodology for extracting common clinical pathway from EHR. The study modeled each patient’s multidimensional clinical records into one-dimensional sequences using novel constructs designed to capture information on each visit’s purpose, procedures, medications and diagnoses. Characterizing visit sequences as Markov chains, that authors extract significant transitions and present them as clinical pathways. The approach was tested on 1,576 chronic kidney disease patient cases. The study by Pettey et al [24] applied network projection methods for exploring the co-occurrence of diagnosis codes in homelessness patients, in large clinical datasets consisting of administrative ICD9 codes. A key contribution of the method is the ability to visualize events with temporality and proximity to homelessness. By adapting the network projection methods from social network analysis and applying them to a use case of homelessness among military veterans, the study confirmed existing knowledge regarding risk factors and uncovered possible pathways to homelessness among such individuals.

Mining Longitudinal Big Data

Large amounts of pooled longitudinal observational medical data potentially hold a wealth of information and have been recognized as potential sources for gaining new knowledge. However, the application of data mining and machine learning methods to temporal data is still a developing research area and extracting useful information from such data series is a challenging task.

Using the Health Improvement Network (THIN) database, an electronic healthcare database containing UK primary-care records with time stamped medical events (e.g., myocardial infarction or vomiting) and drug prescriptions for over 3.7 million active patients, Reps et al [25] sought to detect adverse drug events using a set of causality considerations developed by the epidemiologist Bradford Hill. The Bradford Hill considerations look at various perspectives of a

-6-

Fodeh and Zeng (editorial) drug and outcome relationship to determine whether it shows causal traits. While the problem of adverse drug effect (ADE) detection using machine learning has been studied before, the novelty of this work is its effort to enrich the model with causality considerations to distinguish more effectively between ADE and non-ADE. Xie et al [26] put longitudinal big data to a very interesting use of predicting the number of hospital days in the coming year for individuals from a general insured population based on their insurance claim data. They generated four baggedregression-tree prediction models by windowing the data into four different timescales (bimonthly, quarterly, half-yearly and yearly) to explore the impact of varying the temporal resolution on the predictive power of such models. They evaluated their models on three types of populations in the claims data: the whole population, a senior sub-population and a non-senior sub-population. The half-yearly windowed model delivered the best performance on all three populations by predicting ‘no hospitalization’ versus ‘at least one day in hospital’ with a Matthews correlation coefficient (MCC) of 0.426. This study is very useful for insurance companies to plan for future policies and cost of care. In a different yet relevant study related to hospitalization cost, [27] Warner et al, proposed to detect hospital-acquired complications (HACs) using temporal clinical data. This work could be particularly useful given that the early detection and prevention of HACs could greatly reduce strains on the US healthcare system and improve patient morbidity & mortality rates. They developed a machine-learning model with two classifiers for predicting the occurrence of HACs within five distinct categories including: medical, surgical, bleeding/clotting, infectious and other complications. As reported, at least $10 billion of excessive hospital costs could be saved in the US alone using their model.

With the wealth of knowledge Big Data has to offer, it suffers from the problems of censoring and missing data. In their work, Vock et al [28] presented a general-purpose approach to account for right-censored (which occurs when the outcome is not observed during the study as a result of dropping from the study or not experiencing the outcome) outcomes in temporal data using the inverse probability of censoring weighting (IPCW). They incorporated IPCW into a number of machine-learning algorithms used to mine big healthcare data including Bayesian networks, knearest neighbors, decision trees, and generalized additive models. Their approach led to better predictions than conventional ad-hoc approaches when applied to predicting the 5-year risk of experiencing a cardiovascular adverse event using clinical data from a large U.S. Midwestern

-7-

Fodeh and Zeng (editorial) healthcare system. To address the problem of missing data, and to avoid adopting traditional methods that often eliminate data with missing values, Lin et al [29] proposed twelve different rough-set-based measuring methods to assess the strength of adverse drug effect signals, based on the concept of utilizing characteristic–set based approximation. Six out of the twelve measures proposed in their rough-set based approach were feasible for signal detection. Experimenting with these measures showed that their timeline warning of suspicious ADE signals was similar to the results with traditional methods that deleted missing elements, and sometimes earlier detection was observed compared to traditional methods. In another effort to address the missing values problem in temporal data, Rahman et al [30] proposed an imputation method called (FLk-NN) that enables imputation of missing values even when all data at a time point are missing and when there are different types of missing data both within and across variables. FLk-NN incorporates time-lagged correlations both within and across variables by combining two imputation methods, based on an extension to k-NN and the Fourier transform. When applied to three datasets including: simulated and actual Type 1 diabetes datasets, and multi-modality neurological ICU monitoring, FLk-NN achieved the highest imputation accuracy compared to several methods representing different types of imputation such as MCI [31] and EM [32]. The authors report that their method can impute data even if 50% of a dataset is missing, with many consecutively missing values and in the presence of fully empty instances in the data.

In sum, the published manuscripts of this issue provide new methods to extract knowledge from different types of clinical Big Data. With the rapid increase in the three V’s of Big Data: volume, variety, and velocity, the development of new innovative and scalable computational methods is becoming an urgent need and pressing necessity.

Samah Fodeh, PhD Assistant Professor Yale University New Haven, CT, United States E-mail address: [email protected]

-8-

Fodeh and Zeng (editorial) Qing Zeng, PhD Professor of Biomedical Informatics George Washington University Washington, DC, United States E-mail address: [email protected]

-9-

Fodeh and Zeng (editorial)

References 1. 2. 3. 4.

5. 6. 7. 8.

9. 10. 11. 12. 13. 14. 15.

16. 17.

18. 19.

Doug, L., 3D data management: Controlling data volume, velocity and variety. Meta Group Research Note, 2001. 6: p. 70. Roski, J., G.W. Bo-Linn, and T.A. Andrews, Creating value in health care through big data: opportunities and policy implications. Health Affairs, 2014. 33(7): p. 1115-1122. Kalil, T. Big Data is a Big Deal. 2012 [cited 2016 6/7/2016]. McNutt, T., et al. OncoSpace: A new paradigm for clinical research and decision support in radiation oncology. in International Conference on the Use of Computers in Radiation Therapy. 2010. Chen, M., S. Mao, and Y. Liu, Big data: a survey. Mobile Networks and Applications, 2014. 19(2): p. 171-209. Hachman, M., Microsoft launches ‘HealthVault’records-storage site. Retrieved October, 2007. 4: p. 2007. Levy, W.C., et al., The Seattle Heart Failure Model prediction of survival in heart failure. Circulation, 2006. 113(11): p. 1424-1433. Koelling, T.M., S. Joseph, and K.D. Aaronson, Heart failure survival score continues to predict clinical outcomes in patients with heart failure receiving β-blockers. The Journal of heart and lung transplantation, 2004. 23(12): p. 1414-1422. Brophy, J.M., et al., A multivariate model for predicting mortality in patients with heart failure and systolic dysfunction. The American journal of medicine, 2004. 116(5): p. 300-304. Hassanien, A.E., et al., Rough sets in medical informatics applications, in Applications of Soft Computing. 2009, Springer. p. 23-30. Califf, R.M. and M.J. Pencina, Predictive Models in Heart Failure Who Cares? Circulation: Heart Failure, 2013. 6(5): p. 877-878. Fodeh S., Z.Q., Call for Papers: Special issue on mining big data in biomedicine and health care. J Biomed Info, 2014. 51: p. 1-2. Dunlay, S.M., N.L. Pereira, and S.S. Kushwaha. Contemporary strategies in the diagnosis and management of heart failure. in Mayo Clinic Proceedings. 2014. Elsevier. Taslimitehrani, V., et al., Developing EHR-driven heart failure risk prediction models using CPXR (Log) with the probabilistic loss function. Journal of biomedical informatics, 2016. 60: p. 260-269. Anderson, A.E., et al., Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: A cross-sectional, unselected, retrospective study. Journal of biomedical informatics, 2015. Pineda, A.L., et al., Comparison of machine learning classifiers for influenza detection from emergency department free-text reports. Journal of biomedical informatics, 2015. 58: p. 60-69. Feldman, K., D. Davis, and N.V. Chawla, Scaling and contextualizing personalized healthcare: A case study of disease prediction algorithm integration. Journal of biomedical informatics, 2015. 57: p. 377-385. Jung, K. and N.H. Shah, Implications of non-stationarity on predictive modeling using EHRs. Journal of biomedical informatics, 2015. 58: p. 168-174. Yao, L., et al., Discovering treatment pattern in Traditional Chinese Medicine clinical cases by exploiting supervised topic model and domain knowledge. Journal of biomedical informatics, 2015. 58: p. 260-267.

-10-

Fodeh and Zeng (editorial) 20.

21. 22. 23.

24.

25. 26. 27.

28.

29. 30. 31. 32.

Fong, A., A.Z. Hettinger, and R.M. Ratwani, Exploring methods for identifying related patient safety events using structured and unstructured data. Journal of biomedical informatics, 2015. 58: p. 89-95. Istephan, S. and M.-R. Siadat, Unstructured medical image query using big data–An epilepsy case study. Journal of biomedical informatics, 2016. 59: p. 218-226. White, T., Hadoop: The Definitive Guide. 2009: O'Reilly Media, Inc. Zhang, Y., R. Padman, and N. Patel, Paving the COWpath: Learning and visualizing clinical pathways from electronic health record data. Journal of biomedical informatics, 2015. 58: p. 186-197. Pettey, W.B., et al., Using network projections to explore co-incidence and context in large clinical datasets: Application to homelessness among US Veterans. Journal of biomedical informatics, 2016. 61: p. 203-213. Reps, J.M., et al., A supervised adverse drug reaction signalling framework imitating Bradford Hill’s causality considerations. Journal of biomedical informatics, 2015. 56: p. 356-368. Xie, Y., et al., Analyzing health insurance claims on different timescales to predict days in hospital. Journal of biomedical informatics, 2016. 60: p. 187-196. Warner, J.L., et al., Classification of hospital acquired complications using temporal clinical information from a large electronic health record. Journal of biomedical informatics, 2016. 59: p. 209-217. Vock, D.M., et al., Adapting machine learning techniques to censored time-to-event health record data: A general-purpose approach using inverse probability of censoring weighting. Journal of biomedical informatics, 2016. 61: p. 119-131. Lin, W.-Y., et al., Rough-set-based ADR signaling from spontaneous reporting data with missing values. Journal of biomedical informatics, 2015. 58: p. 235-246. Rahman, S.A., et al., Combining Fourier and lagged k-nearest neighbor imputation for biomedical time series data. Journal of biomedical informatics, 2015. 58: p. 198-207. Allison, P.D., Missing data. Vol. 136. 2001: Sage publications. Schneider, T., Analysis of incomplete climate data: Estimation of mean values and covariance matrices and imputation of missing values. Journal of Climate, 2001. 14(5): p. 853-871.

-11-