Can J Diabetes 39 (2015) 235e238
Contents lists available at ScienceDirect
Canadian Journal of Diabetes journal homepage: www.canadianjournalofdiabetes.com
Perspectives in Practice
Clinical Diabetes Research Using Data Mining: A Canadian Perspective Baiju R. Shah MD, PhD a, b, c, *, Lorraine L. Lipscombe MD, MSc a, b, d a
Department of Medicine, University of Toronto, Toronto, Canada Institute for Clinical Evaluative Sciences, Toronto, Toronto, Canada Department of Medicine, Sunnybrook Health Sciences Centre, Toronto, Canada d Department of Medicine, Women’s College Hospital, Toronto, Canada b c
a r t i c l e i n f o
a b s t r a c t
Article history: Received 1 December 2014 Received in revised form 11 February 2015 Accepted 23 February 2015
With the advent of the digitization of large amounts of information and the computer power capable of analyzing this volume of information, data mining is increasingly being applied to medical research. Datasets created for administration of the healthcare system provide a wealth of information from different healthcare sectors, and Canadian provinces’ single-payer universal healthcare systems mean that data are more comprehensive and complete in this country than in many other jurisdictions. The increasing ability to also link clinical information, such as electronic medical records, laboratory test results and disease registries, has broadened the types of data available for analysis. Data-mining methods have been used in many different areas of diabetes clinical research, including classic epidemiology, effectiveness research, population health and health services research. Although methodologic challenges and privacy concerns remain important barriers to using these techniques, data mining remains a powerful tool for clinical research. Ó 2015 Canadian Diabetes Association
Keywords: administrative data data mining research methods
r é s u m é Mots clés : données administratives exploration de données méthodes de recherche
Avec l’avènement de la numérisation de grandes quantités d’informations et la puissance informatique capable d’analyser ce volume d’informations, l’exploration de données est de plus en plus appliquée à la recherche médicale. Les jeux de données créés pour l’administration du système de santé fournissent une mine d’informations provenant de différents secteurs de la santé, et les systèmes de santé universels à payeur unique des provinces canadiennes signifie que les données sont plus générales et complètes dans ce pays que dans beaucoup d’autres juridictions. L’augmentation de la capacité de lier l’information d’origine clinique, tels les dossiers médicaux électroniques, les résultats des tests de laboratoire et les registres des maladies, a élargi les types de données disponibles pour analyse. Les méthodes d’extraction de données ont été utilisées dans de nombreux domaines de la recherche clinique du diabète, y compris l’épidémiologie classique, la recherche d’efficacité, la recherche sur la santé des populations et des services de santé. Bien que les défis méthodologiques et de confidentialité restent d’importants obstacles à l’utilisation de ces techniques, l’exploration de données reste un outil puissant pour la recherche clinique. Ó 2015 Canadian Diabetes Association
Introduction Data mining is the process of exploring and analyzing large datasets to find previously unknown relationships or correlations and to summarize the data in novel ways that are both understandable and useful (1). Applications of data mining are common in many industries, such as credit card fraud detection and * Address for correspondence: Baiju R. Shah, MD, PhD, G106, 2075 Bayview Avenue, Toronto, Ontario M4N 3M5, Canada. E-mail address:
[email protected] 1499-2671/$ e see front matter Ó 2015 Canadian Diabetes Association http://dx.doi.org/10.1016/j.jcjd.2015.02.005
targeting marketing campaigns. The mining of health data is being applied increasingly in medical research to track healthcare patterns and explore disease hypotheses. Here, we review the unique Canadian role in clinical research using data mining, some applications of data mining in clinical diabetes research and some ongoing challenges facing the application of data mining in health research. Data mining has become possible because of the advent of big datadthe creation and collection of vast amounts of information in a digital format. Digitalized data have dramatically increased in
236
B.R. Shah, L.L. Lipscombe / Can J Diabetes 39 (2015) 235e238
volume, velocity and variety. According to IBM, 2.5 exabytes (that’s 2,500,000,000,000 megabytes) of data were generated each day in 2012. The types of information being collected have also expanded to include both structured and unstructured data, such as images, videos, sensor outputs and gene sequencings. Major advances in computing power have, therefore, been necessary to allow meaningful and efficient access to and analysis of such large volumes of data. In computer sciences, Moore’s law observes that microprocessor complexity doubles approximately every 2 years (2); hence, the average smartphone today has computing power that is several orders of magnitude greater than that of the original IBM personal computer. These 2 parallel technologic advances of the past few decades have been catalysts that have permitted the development of modern data mining. As in other industries, the digitization of data has been adopted increasingly in the healthcare setting. Administrative and financial information in large institutions such as hospitals has long been managed electronically; but now, clinical information such as radiology images and primary care physicians’ records are also becoming digitalized. In all cases, these large datasets are collected for reasons other than analysis, such as for clinical care or administration of the healthcare system. However, they can be secondarily exploited for clinical research purposes. (Data mining techniques are also used to exploit large datasets in other areas of health research, such as genome-wide association studies or functional neuroimaging, but these are beyond the scope of this review.) In the past, observational health research relied on small registries, health surveys or selected cohorts. The use of large secondary databases increases the potential sample size and power for such studies and provides unique opportunities to explore novel questions in unselected population-based cohorts. Furthermore, because these databases are being created anyway, for other purposes, both the costs of doing research and the time taken to obtain results can be substantially lower when using these data than by setting out to collect data prospectively. Canada’s Place in Healthcare Data Mining Canada’s healthcare system is uniquely positioned to develop large databases suitable for population-based research (3). Within each province and territory, virtually every permanent resident is eligible for healthcare services provided by a single payer; therefore, the data collected to administer the system include virtually the entire population. There are very few inpatient and physician services that are outside of the public system, so virtually all healthcare interactions in these sectors can be captured. In addition, each province and territory assigns a unique personal identifier (the health card number) to all residents, which is captured as part of clinical care; therefore, individuals’ records from various databases and various healthcare sectors can be linked. Administrative data sources detailing many different healthcare sectors are available (though availability varies among provinces), including health insurance registration data, inpatient and outpatient hospital care, day surgery, emergency department care, physician services, homecare, long-term care, vital statistics, prescriptions and healthcare provider data. These are increasingly being joined by large databases of clinical information, including laboratory tests, healthcare providers’ electronic medical records, and cancer- and other disease-based registries. Several Canadian provinces have developed data warehouses, such as Population Data BC, the Manitoba Centre for Health Policy, Ontario’s Institute for Clinical Evaluative Sciences and Health Data Nova Scotia. The purpose of these warehouses is to provide a more convenient repository for these databases and advanced expertise in the linkage and analyses of information for research and healthcare evaluation.
Uses of Data Mining for Clinical Diabetes Research Modern data mining can be used in the same way that population-level data have always been used: for classic epidemiologic studies searching for patterns of and causes of disease in populations. When John Snow became the father of modern epidemiology while trying to understand the 1854 cholera outbreak in London, he was using a 19th century version of big data: a map of Soho plotting the homes of cholera victims. With 21st century digitized healthcare administrative data, we can continue to perform such classic epidemiologic studies, such as comparing diabetes prevalence across provinces (4) or among neighbourhoods within a city (5). These epidemiologic techniques can also be used to detect associations between diabetes and other medical conditions that are not traditional complications of the disease, such as hip fracture, breast cancer or sepsis (6e8). By virtue of being “big,” the data available from healthcare administrative sources may be useful for finding associations that would be too difficult to find in smaller datasets. For example, the relationships between gestational diabetes and cardiovascular risk factors and between gestational diabetes and proxies for cardiovascular disease were established with prospective cohort studies and crosssectional analyses of clinical data. However, only mining of large databases could find an association between gestational diabetes and cardiovascular events because they are so rare in reproductiveage women (9). Phase IV postmarketing surveillance for adverse events caused by new drugs requires large numbers of patients with a wide variety of medical conditions to be exposed to drugs over the long term. The adverse events are often very rare and not necessarily known or suspected a priori. Hence, data mining is a particularly valuable tool for drug safety and pharmacoepidemiologic research. Data mining was used in many of the studies that were evaluating whether there is a cardiovascular risk associated with thiazolidinediones (10e12) and in studies linking statins and fluoroquinolones with glucose abnormalities (13e15). Data mining also contributes to understanding the safety of older pharmaceutical agents, like sulfonylureas and metformin (16e19). Distributed data networks are being created to enable evaluation of rarer adverse drug effects across very large populations. For example, the Canadian Network for Observational Drug Effects Studies (CNODES) is a pan-Canadian collaboration of researchers that provides information regarding drug safety by combining data from 9 populationbased databases using standardized methods (13,20,21). Another avenue of clinical research in which analysis of secondary data sources is invaluable is effectiveness researchd studying whether an intervention works in routine clinical care (which contrasts with efficacy research, which studies whether it works on carefully selected patients in idealized settings such as a randomized trial). Because effectiveness, by definition, requires examination of the real-world care of “regular” patients, routinely collected administrative data are an ideal method for conducting this research. For example, although the use of spironolactone in patients with heart failure reduced mortality in a randomized trial, its use in routine clinical care was associated with an abrupt increase in hyperkalemia-associated mortality (22). Similarly, these data are ideally suited for population health questions because they include information on the entire unselected population. Thus, Canadian researchers have used data mining to explore the relationships among neighbourhood walkability, poverty and diabetes risk (23) and the influence of sex and ethnicity on macrovascular complications of diabetes (24,25). Data mining can also be used for health services research and evaluation of healthcare system performance. Because these data are often collected for healthcare administrative purposes, they are particularly suited for this type of research. For example, the impact
B.R. Shah, L.L. Lipscombe / Can J Diabetes 39 (2015) 235e238
of changes to healthcare programs or policies can be ascertained in “natural experiments” (26,27), and the quality of care delivered to patients by the healthcare system can be evaluated (28e30). Methodologic Challenges Thus, there are many lines of clinical research for which the mining of large secondary data sources is a useful and efficient methodology. However, there remain several limitations to these methods that will continue to challenge researchers, decision makers, clinicians and others using the outputs of these analyses. First, because the data are often collected for other purposes (such as healthcare administration or clinical care), they may not be collected in a standardized or consistent format, and important information may be missing or unreliable. For example, physician service claims include 1 or more diagnostic codes that are meant to describe the conditions being dealt with during a particular encounter with a patient. However, because the diagnostic coding usually has little bearing on the amount paid to the physician, neither the physician submitting the claim nor the health ministry receiving it has much at stake in ensuring its accuracy. Thus, this essential piece of diagnostic information that is relied upon in analyses to identify patients with particular diagnoses may, in fact, be invalid. Users of these databases have to be knowledgeable and transparent about how the data were collected, their accuracy and their gaps. This limitation highlights the necessity of good datamining methods; to produce high-quality impactful research, the algorithms for identifying diseases and capturing outcomes must be validated. For example, algorithms that identify diabetes using administrative databases have been validated in several different Canadian populations (31e36). Second, the protection of privacy is paramount when using these data, particularly when the data were not originally collected for research purposes. Individuals included in the data have not given express consent for inclusion in research, so data must be anonymized before it can be analyzed. This process is often fairly straightforward for administrative data, where a health card number can be encrypted, or a date of birth can be converted to a year of birth. However, there are significant challenges in anonymizing text data such as electronic health records because identifying information such as names and addresses can be included in text fields (37). Furthermore, identification of an individual may become possible, even with anonymized data, when linking different data sets from various sources; procedures to guard against this are essential. Third, there is heterogeneity among provinces in the scope of available data. For example, some provinces collect data concerning prescriptions filled by all residents, whereas others record only prescriptions filled by residents older than 65 years. As a result, research questions in which this additional data breadth is important can be answered only in some provinces (20). Even when common data are available, provincial privacy legislation limits sharing of raw data, so analyses have to be conducted within each jurisdiction and pooled to create national estimates (20). Across all provinces, there are very limited data concerning utilization of nonphysician care providers, such as diabetes educators, who nonetheless are essential parts of clinical care. Furthermore, several provinces continue to have very limited data availability, so gaining access to data for research purposes is time consuming and expensive. These barriers will have to be overcome to maximize the comprehensiveness and rigour of national data mining research. Fourth, as with all observational research, residual confounding for unmeasured or unknown factors remains a significant concern. This is particularly relevant with secondary databases, which have substantial breadth but which tend to lack depth in terms of potential confounding variables, such as body mass index, blood
237
pressure levels and smoking status. Methodologic advances such as high-dimensional propensity scores seek to address these limitations (38), but ongoing development of new analytic techniques and data linkages is needed. Last, to be most powerful, large databases have to be linked together so that a broader view of patients’ health statuses and utilizations can be measured. However, the databases are often siloed, and making these linkages can be difficult. Indeed, connecting databases from different sectors of the healthcare system can be challenging enough; linking data concerning important determinants of health, such as income, education, employment and immigration, is even more challenging. The development of data warehouses in various provinces has facilitated this process, but truly national studies remain challenging. Conclusions In conclusion, the mining of large databases can yield important research findings that influence clinical care and healthcare delivery. The healthcare systems of Canadian provinces have created large repositories of administrative data covering entire populations, which are increasingly being joined by other large clinical databases and registries. There are important barriers to using these data, including both methodologic challenges and privacy concerns. However, the mining of these growing data sources provides a unique and powerful tool to explore disease hypotheses, investigate population health and identify healthcare patterns so as to understand and improve the health of Canadians. Author Contributions Both authors contributed substantially to the conception of the article, drafted or revised it for important intellectual content, and gave final approval of the version to be published. References 1. Hand D, Mannila H, Smyth P. Principles of data mining. Cambridge: The MIT Press, 2001. 2. Moore GE. Cramming more components onto integrated circuits. Electronics 1965;38:114e7. 3. Brown C. Barriers to accessing data are bad medicine. CMAJ 2014;186:1203. 4. Public Health Agency of Canada. Diabetes in Canada: Factors and figures from a public health perspective. Ottawa: Public Health Agency of Canada, 2011. 5. Glazier RH, Booth GL, Gozdyra P, et al. Neighbourhood environments and resources for healthy living: A focus on diabetes in Toronto. Toronto Institute for Clinical Evaluative Sciences; 2007. 6. Lipscombe LL, Jamal SA, Booth GL, Hawker GA. The risk of hip fractures in older individuals with diabetes: A population-based study. Diabetes Care 2007;30: 835e41. 7. Lipscombe LL, Goodwin PJ, Zinman B, et al. Diabetes mellitus and breast cancer: A retrospective population-based cohort study. Breast Cancer Res Tr 2006;98: 349e56. 8. Shah BR, Hux JE. Quantifying the risk of infectious diseases for people with diabetes. Diabetes Care 2003;26:510e3. 9. Shah BR, Retnakaran R, Booth GL. Increased risk of cardiovascular disease in young women following gestational diabetes mellitus. Diabetes Care 2008;31: 1668e9. 10. Lipscombe LL, Gomes T, Levesque LE, et al. Thiazolidinediones and cardiovascular outcomes in older patients with diabetes. JAMA 2007;298:2634e43. 11. Dormuth CR, Maclure M, Carney G, et al. Rosiglitazone and myocardial infarction in patients previously prescribed metformin. PLOS ONE 2009;4: e6080. 12. Juurlink DN, Gomes T, Lipscombe LL, et al. Adverse cardiovascular events during treatment with pioglitazone and rosiglitazone: Population-based cohort study. BMJ 2009;339:b2942. 13. Dormuth CR, Filion KB, Paterson JM, et al. Higher potency statins and the risk of new diabetes: Multicentre, observational study of administrative databases. BMJ 2014;348:g3244. 14. Carter AA, Gomes T, Camacho X, et al. Risk of incident diabetes among patients treated with statins: Population-based study. BMJ 2013;346:f2610. 15. Park-Wyllie LY, Juurlink DN, Kopp A, et al. Outpatient gatifloxacin therapy and dysglycemia in older adults. New Engl J Med 2006;354:1352e61.
238
B.R. Shah, L.L. Lipscombe / Can J Diabetes 39 (2015) 235e238
16. Abdelmoneim AS, Eurich DT, Gamble JM, et al. Risk of acute coronary events associated with glyburide compared with gliclazide use in patients with type 2 diabetes: A nested case-control study. Diabetes Obes Metab 2014;16:22e9. 17. McAlister FA, Eurich DT, Majumdar SR, Johnson JA. The risk of heart failure in patients with type 2 diabetes treated with oral agent monotherapy. Eur J Heart Fail 2008;10:703e8. 18. Eurich DT, Majumdar SR, McAlister FA, et al. Improved clinical outcomes associated with metformin in patients with diabetes and heart failure. Diabetes Care 2005;28:2345e51. 19. Stang M, Wysowski DK, Butler-Jones D. Incidence of lactic acidosis in metformin users. Diabetes Care 1999;22:925e7. 20. Suissa S, Henry D, Caetano P, et al. CNODES: the Canadian Network for Observational Drug Effect Studies. Open Med 2012;6:e134e40. 21. Lipscombe LL, Austin PC, Alessi-Severini, et al. Atypical antipsychotics and hyperglycemic emergencies: Multicentre, retrospective cohort study of administrative data. Schizophr Res 2014;154:54e60. 22. Juurlink DN, Mamdani MM, Lee DS, et al. Rates of hyperkalemia after publication of the Randomized Aldactone Evaluation Study. New Engl J Med 2004;351:543e51. 23. Booth GL, Creatore MI, Moineddin R, et al. Unwalkable neighborhoods, poverty, and the risk of diabetes among recent immigrants to Canada compared with long-term residents. Diabetes Care 2013;36:302e8. 24. Khan NA, Wang H, Anand S, et al. Ethnicity and sex affect diabetes incidence and outcomes. Diabetes Care 2011;34:96e101. 25. Shah BR, Victor JC, Chiu M, et al. Cardiovascular complications and mortality after diabetes diagnosis for South Asian and Chinese patients: A populationbased cohort study. Diabetes Care 2013;36:2670e6. 26. Kiran T, Victor JC, Kopp A, et al. The relationship between financial incentives and quality of diabetes care in Ontario, Canada. Diabetes Care 2012;35:1038e46. 27. Kiran T, Kopp A, Moineddin R, et al. Unintended consequences of delisting routine eye exams on retinopathy screening for people with diabetes in Ontario, Canada. CMAJ 2013;185:E167e73.
28. Brown LC, Johnson JA, Majumdar SR, et al. Evidence of suboptimal management of cardiovascular risk in patients with type 2 diabetes mellitus and symptomatic atherosclerosis. CMAJ 2004;171:1189e92. 29. Manns BJ, Tonelli M, Zhang J, et al. Enrolment in primary care networks: Impact on outcomes and processes of care for patients with diabetes. CMAJ 2012;184: e144e52. 30. Shah BR, Cauch-Dudek K, Pigeau L. Diabetes prevalence and care in the Métis population of Ontario, Canada. Diabetes Care 2011;34:2555e6. 31. Hux JE, Ivis F, Flintoft V, Bica A. Diabetes in Ontario: Determination of prevalence and incidence using a validated administrative data algorithm. Diabetes Care 2002;25:512e6. 32. Chen G, Khan N, Walker R, Quan H. Validating ICD coding algorithms for diabetes mellitus from administrative data. Diabetes Res Clin Pract 2010;89: 189e95. 33. Southern DA, Roberts B, Edwards A, et al. Validity of administrative data claimbased methods for identifying individuals with diabetes at a population level. Can J Public Health 2010;101:61e4. 34. Guttmann A, Nakhla M, Henderson M, et al. Validation of a health administrative data algorithm for assessing the epidemiology of diabetes in Canadian children. Pediatr Diabetes 2010;11:122e8. 35. Dart AB, Martens PJ, Sellers EA, et al. Validation of a pediatric diabetes case definition using administrative health data in Manitoba, Canada. Diabetes Care 2011;34:898e903. 36. Amed S, Vanderloo SE, Metzger D, et al. Validation of diabetes case definitions using administrative claims data. Diabet Med 2011;28:424e7. 37. Tu K, Klein-Geltink J, Mitiku TF, et al. De-identification of primary care electronic medical records free-text data in Ontario, Canada. BMC Med Inform Decis 2010;10:35. 38. Schneeweiss S, Rassen JA, Glynn RJ, et al. High-dimensional propenstiy score adjustment in studies of treatment effects using healthcare claims data. Epidemiology 2009;20:512e22.