Using the Electronic Medical Record to Identify Patients at High Risk for Frequent Emergency Department Visits and High System Costs

Using the Electronic Medical Record to Identify Patients at High Risk for Frequent Emergency Department Visits and High System Costs

Accepted Manuscript Using the Electronic Medical Record to Identify Patients at High Risk for Frequent ED Visits and High System Costs David W. Frost,...

378KB Sizes 0 Downloads 20 Views

Accepted Manuscript Using the Electronic Medical Record to Identify Patients at High Risk for Frequent ED Visits and High System Costs David W. Frost, MD, Shankar Vembu, PhD, Jiayi Wang, B.Sc, Karen Tu, MD, Quaid Morris, PhD, Howard B. Abrams, MD PII:

S0002-9343(16)31308-0

DOI:

10.1016/j.amjmed.2016.12.008

Reference:

AJM 13846

To appear in:

The American Journal of Medicine

Received Date: 29 July 2016 Revised Date:

2 December 2016

Accepted Date: 2 December 2016

Please cite this article as: Frost DW, Vembu S, Wang J, Tu K, Morris Q, Abrams HB, Using the Electronic Medical Record to Identify Patients at High Risk for Frequent ED Visits and High System Costs, The American Journal of Medicine (2017), doi: 10.1016/j.amjmed.2016.12.008. This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

ACCEPTED MANUSCRIPT

Using the Electronic Medical Record to Identify Patients at High Risk for Frequent ED Visits and High System Costs

Disclosures: None.

TE D

Corresponding author: Dr. David W Frost Toronto Western Hospital New East Wing 8-424 399 Bathurst St. Toronto, Ontario, Canada M5T2S8 Tel 416-603-5800 x 2936 Fax 416-603-6495 Email [email protected]

M AN U

SC

RI PT

David W Frost, MDa,c,d,j, Shankar Vembu, PhDe,j, Jiayi Wang, B.Sce,j, Karen Tu, MDb,c,j,k, Quaid Morris, PhDe,f,g,h,I,j, Howard B Abrams, MDa,c,d,j a Division of General Internal Medicine, b Department of Family and Community Medicine and Institute of Health Policy, Management and Evaluation, University of Toronto c University Health Network, d OpenLab at University Health Network e Donnelly Center for Cellular and Biomolecular Research, f Banting and Best Department of Medical Research Departments of gMedical Genetics, hElectrical and Computer Engineering, iComputer Science j University of Toronto k Institute for Clinical Evaluative Sciences

EP

Funding: This study was funded by an operating grant from “Building Bridges to Integrate Care” (BRIDGES) and supported by the Institute for Clinical Evaluative Sciences (ICES), which is funded by an annual grant from the Ontario Ministry of Health and Long-Term Care (MOHLTC). The opinions, results and conclusions reported in this paper are those of the authors and are independent from the funding sources. No endorsement by ICES or the Ontario MOHLTC is intended or should be inferred. The funder played no role in the design or execution of the study.

AC C

Conflict of Interest: None of the authors have any actual or potential conflict of interest Contributorship statement: All authors had access to the data. Data analysis was performed by SV, JW, and QM. DF drafted the manuscript and all authors reviewed and revised it. All authors approved the final submission Running head: EMR-based prediction of future high users Keywords: Electronic medical records, machine learning, high users, frequent ED visits, predictive modelling

ACCEPTED MANUSCRIPT

ABSTRACT

RI PT

Background: A small proportion of patients accounts for a very high proportion of healthcare utilization. Accurate pre-emptive identification may facilitate tailored intervention. We sought to determine whether Machine Learning techniques using text from a family practice Electronic Medical Record (EMR) can be used to predict future high Emergency Department (ED) use and total costs by patients who are not yet high ED users or high cost to the healthcare system.

SC

Methods: Text from fields of the Cumulative Patient Profile within an EMR (PS Suite) of 43,111 patients was indexed. Separate training and validation cohorts were created. After processing, 11,905 words were used to fit a logistic regression model. The primary outcomes of interest in the 12 months following prediction were 1) 3 or more ED visits and, 2) being in the top 5% in healthcare expenditures. Outcomes were assessed through linkage to administrative databases housed at the Institute for Clinical Evaluative Sciences (ICES).

M AN U

Results: In the model to predict frequent ED visits, after excluding patients who were high ED users in the previous year, the area under the receiver operating characteristic (AUROC) curve was 0.71. Using the same methodology, the model to predict top 5% in total system costs had an AUROC curve of 0.76

AC C

EP

TE D

Conclusion: Machine learning techniques can be applied to analyze free text contained in EMRs. This dataset is more predictive of patients who will generate future high costs than future ED visits. It remains to be seen if these predictions can be used to reduce costs by early interventions in this cohort of patients.

ACCEPTED MANUSCRIPT

INTRODUCTION

RI PT

An aging population living longer with a high burden of chronic disease poses a challenge to healthcare systems, including a substantial increase in total healthcare costs, as well as Emergency Department (ED) visits and hospital admission [1, 2]. In the province of Ontario, 1% of the population accounts for 49% of combined hospital and homecare costs, and 5% of the population accounts for 84% of these costs [3]. Interventions are needed to reduce preventable ED visits and limit total costs if public healthcare systems are to remain sustainable in the face of these demographic pressures.

M AN U

SC

There is increasing effort to identify and target interventions towards the ‘high user’ patient population in an effort to provide better patient care at a lower cost [4, 5]. Although identifying and targeting current “high users” of the healthcare system is logical, feasible, and often practiced, this may not be the most optimal patient population on whom to focus efforts. The well-described phenomenon of “regression to the mean”, whereby patients who frequently access care become less frequent users over time in the absence of intervention [6, 7], suggests that there may be significant benefits to preemptively identifying patients who are at risk of becoming “high users” through predictive modeling. Lewis and others [8, 9] have described the concept of “impactability models”, which refers to optimizing the likelihood that an intervention will have an impact on a given population. This concept is rooted in the fact that the interventions in question to reduce ED visits, admissions, and total costs, are often themselves resource intensive and have had variable success [10-12].

TE D

There are several predictive tools in the published literature. These use various methods and data sources including patient surveys [13-15], laboratory tests [15], administrative data [16, 17], electronic medical records (EMRs) [18-20] or combinations. There are also several existing tools that predict, after a sentinel hospital admission, the risk of readmission.[12, 16, 21]. Many are dependent on administrative data. Few include data on social determinants of health, which are powerful predictors of future healthcare system use [22, 23]. Fewer still utilize data that can be easily accessed by primary care providers in order to identify patients within their own roster that may generate future high costs or ED visits. Our goal was to derive and validate a novel tool using data available in the Cumulative Patient Profile (CPP) of family physicians’ EMRs, to predict future ED visits and high overall system costs. Importantly, this included data on social determinants of health in addition to the past medical history, comorbidities, and medications.

AC C

EP

The approach of using EMRs to make clinically meaningful predictions has been successfully employed in prediction of suicidality [24], readmissions [18-20], in-hospital cardiac arrest [25], and sepsis [26]. We hypothesized that the rich data in this text would have predictive value that could augment, or even obviate, the need for administrative data. To our knowledge, the use of machine learning with a free text dataset to make clinically meaningful predictions has not been reported previously. After a pilot study at a single site (Toronto Westerm Hospital Family Health Team), we set out to derive and validate an EMR text-based prediction algorithm. We examined 2 primary outcomes and derived different predictive algorithms for each: 1) Frequent ED visits, which, in addition to being costly [27], are associated with poor patient satisfaction [28], and 2) Individual-level total healthcare system utilization cost. The study was approved by the Research Ethics Board at University Health Network, as well as the Sunnybrook Health Sciences Centre Research Ethics Board.

ACCEPTED MANUSCRIPT

METHODS

SC

RI PT

Our model was derived and validated using the Electronic Medical Record Administrative data Linked Database (EMRALD) housed at the Institute for Clinical and Evaluative Sciences (ICES) in Toronto, Ontario. The methodology behind this database is described in detail elsewhere [29, 30]. At the time of the study it contained approximately 300,000 patient records from across Ontario, using the market leading PS Suite® EMR. The EMR and administrative (outcome) data are date-stamped, allowing predictions to be made based on EMR data as of a given extraction date. This allows for observations of outcomes of interest in subsequent timeframes. To allow sufficient time to observe outcomes of interest, we confined our analysis to EMR data that was extracted in the time-span of September, 2011 to March, 2012. The predictive model was developed based on this extraction date, then outcome data were obtained for the one year pre- and post-extraction date. To create the model, we used the following Cumulative Patient Profile (CPP) fields: active problem list, past medical history, family history, medications, allergies, social history, risk factors.

TE D

M AN U

Using data from the fall of 2011 to the spring of 2012 resulted in 116 family physicians from 28 clinics and over 150,000 patients eligible for this study. Figure 1 illustrates the distribution of the cohort. We required the family physician to have had the EMR for at least 12 months and the start of the patient record in the EMR to be at least 12 months prior to the date the data was extracted. The patient cohort was further restricted to patients rostered to one of the participating physicians (125,315 patients) and those with a populated CPP (101,174 patients). Through random selection, half the cohort was designated the “training cohort” and half the “validation cohort” (50,587 each). We then limited the cohorts to patients 50 years of age or older, yielding 21,680 patients in the training cohort and 21,431 in the validation cohort. The rationale for the limitation to older adults is that the characteristics of high system use differ between younger and older patients e.g. complex pediatric patients, trauma [5, 31] and that potential solutions to these issues might be significantly different from those of an aging cohort with co-morbid chronic disease. System-wide outcome data for ED visits was obtained through NACRS (National Ambulatory Care Reporting System), which comprehensively captures all ED visits across the province [32]. Person-level

EP

data linkage was performed across primary care EMR data and administrative data including Canadian Institute for Health Information (CIHI), NACRS and the Ontario Health Insurance Plan (OHIP). The training cohort consisted of 21,680 patient records of which 1,237 (5.7%) had three or more ED visits in the one year post the 2012 extraction date, and were labeled as high ED users.

AC C

We then indexed all free text from the CPP. This was followed by basic text pre-processing, including removal of frequent and infrequent terms, resulting in a vocabulary of 11,905 terms that were used as features to fit a logistic regression model.[33]. In the same cohorts, we also examined total individual-level healthcare system utilization costs using methodology described elsewhere [34]. This dataset aggregates individual person-level costs associated with several healthcare service categories: acute inpatient hospitalization, ED visits, surgery, rehabilitation, complex care, long term care, physician services, home care, prescription drugs, and equipment. We included total cost of the patient to the health care system in the year preceding the data extraction date, using this amount as both an outcome and as a covariate in the model in separate exercises. In keeping with the published literature, we were most interested in those patients in the top 1% and 5% of costs, as they account for a very large proportion of health care expenditures [31].

ACCEPTED MANUSCRIPT

RESULTS Model to predict frequent ED visits:

RI PT

We assessed the performance of the model on only those patients in the validation set who were not high ED users in the year preceding the extraction date (i.e. they had <3 ED visits in 2012) but who became high users following the extraction date. There were 895 such patients in the validation cohort and the classification accuracy of the model on these patients was 74.8%. The AU-ROC for this group was 0.71, shown in Figure 2.

SC

Choosing the point where both the sensitivity and specificity are optimized on this ROC curve leads to sensitivity and specificity of around 71%. Given the high cost of interventions, a higher specificity (lower false positive rate) may be desirable at the expense of sensitivity. If such a characteristic were desired, 47% sensitivity and 80% specificity can be achieved with the current model.

M AN U

Model to predict total costs:

We developed a different model for costs. We followed similar methodology to the model for ED visits, and used the same training and validation cohorts. Table 1 outlines the performance of the model in the validation cohort in predicting the top 1% and 5% of users in terms of total costs without including prior costs in the model. The ROC curve for the top 5% total cost dataset is shown in Figure 3; in predicting this group in the subsequent year, the model achieved 71% sensitivity and 73% specificity.

EP

TE D

Using the same model, we then turned our attention to the group of patients who were not in the top percentiles of total costs pre-extraction date, and subsequently entered this group after the data extraction date. For example, during the pre-extraction date, 20,359 patients were not in the top 5%. Of these patients, 651 patients entered the top 5% within 12 months after the data extraction date. Table 2 shows the performance of the model in predicting the top 1% and 5% of total system costs. The ROC curve is shown in Figure 4.

INTERPRETATION

AC C

We created and prospectively validated two novel prediction tools using free text mined from the cumulative patient profile of primary care EMRs: a model to predict frequent ED visits, and a different model to predict high total system costs. The ED use model performed reasonably well, with AUROC curve of 0.69, notably without inclusion of any prior system use in the model. With the EMRALD cohort we were able to accurately replicate a true predictive task by attempting to predict high users based on the CPP in the years preceding the high use. The predictive model for total costs performed even better, with AUROC of 0.79 and 0.78 for prediction of the top 1% and 5% respectively. When patients who were in the top 5% highest cost of users were excluded, 65% sensitivity and 75% specificity can be achieved in predicting those patients who will enter this group in the subsequent 12 months. Given the very high proportion of total healthcare costs incurred by such a small group, accurate prediction of which patients will enter this group can be an extremely valuable resource in guiding remedial efforts towards where the yield will be highest.

ACCEPTED MANUSCRIPT

RI PT

The strengths of this study include a large patient cohort of over 43,000 records, with 100% success in linking the cohort to administrative data for outcome assessment. The lack of requirement for administrative data increases the generalizability of this model and the usability at the primary care practice level. The study also has several weaknesses. The EMRALD database is currently limited to those practices using the PS Suite® EMR. The use of free text introduces complexities such as synonyms, alternate or incorrect spelling, and context is not captured. There may also be local idiosyncrasies of documentation that limit generalizability. However, our very large sample sizes, both in terms of records, and the size of the vocabulary, likely mitigates against many of these problems. Further, because of database limitations, we only had 12 months of outcome data. It is likely that the predictive power would increase if the goal was to predict outcomes over a longer time period.

M AN U

SC

This study was undertaken in the province of Ontario, Canada. There are well documented differences between Canada and other countries, most notably the United States, in the organization and delivery of healthcare [35]. However, the demographics of the 2 countries are remarkably similar in many respects [36], and the striking difference in the rate of uninsured people is diminishing, with over 90% of Americans now having health insurance [37]. Even in the event this particular model cannot be applied directly in the United States or elsewhere, the unique methodology and source of data we have employed lends itself to the creation of a similar index for a given jurisdiction.

TE D

This study focused primarily on the cohort of patients who were not high ED users or had high system costs pre-extraction date and subsequently entered these groups, as we see those as patients that may most benefit from intense intervention. In predicting patients in the top 5% of total system costs in the subsequent 1 year among those who were not previously in this group, the model performed well, with 65% sensitivity and 75% specificity. These performance characteristics compare favorably with other models which notably often require the patient to have been a high user previously. Reuben et al [15] reported 45.5% sensitivity and 81.5% specificity to a multi-source tool to predict ED use in the next 3 years. Lyon et al [14] reported both sensitivity and specificity in the 65% range for a self-reported questionnaire to predict ED visits. The locally developed HARP model, which requires data from hospital admission, has 49% sensitivity and 74% specificity for predicting hospital readmission within 30 days [38].

AC C

EP

The implications of such prediction tools to the practicing clinician are of particular interest. It is conceivable that a given ambulatory practice, or a broader system, may employ the methodology described here to leverage the information contained in the text of their EMR to create a stratified list of those patients most likely to become high users or incur high costs. Importantly, our study demonstrates that substantial predictive power is possible even among those patients who are not yet high users or high cost. Once a clinician or, more optimally, a multidisciplinary team, is alerted to these patients and analyzes what factors are likely to contribute to this high risk, remedial action may be undertaken, to the benefit of the patient and the system. In conclusion, we have shown that machine learning techniques can be applied to analyze the free text contained in primary care practices’ EMRs. A resulting index can predict the outcomes of frequent ED visits and high total system costs but the clinical utility of this remains to be tested. We are not aware of any other published indices developed to predict high system costs or frequent ED visits based on primary care EMR data. The ability to accurately target interventions at the primary care level, to a high risk group not experiencing the outcome of interest at the time of prediction, has the potential to improve patient care and reduce costs substantially. Areas for future inquiry include the creation of a simplified version of the index, which may allow application at the individual practice level, and comparison of the predictive performance of EMR text-based tools to more readily available administrative data. Whether this type of prospective, “preventative” identification of patients at high risk can accurately enough predict the highest-yield populations at whom to target intervention, is also an area for future research.

ACCEPTED MANUSCRIPT

ACKNOWLEDGEMENTS

RI PT

We thank Jacqueline Young for her assistance in the preparation of the EMR data for analysis, as well as Diana Toubassi and Lesley Adcock for assistance with the pilot phase of the project.

REFERENCES

5. 6. 7. 8. 9.

10. 11. 12. 13. 14.

15. 16.

SC

M AN U

4.

TE D

3.

EP

2.

Portrait of the Canadian Population in by age and sex. January 5, 2016]; Available from: http://www12.statcan.ca/english/census06/analysis/agesex/NatlPortrait2.cfm. Why health care renewal matters: Learning from Canadians with chronic health conditions January 5, 2016]; Available from: http://www.healthcouncilcanada.ca/tree/2.20Outcomes2FINAL.pdf. Ideas and Opportunities for Bending the Health Care Cost Curve. Available from: https://www.oha.com/KnowledgeCentre/Library/Documents/Bending%20the%20Health%20Car e%20Cost%20Curve%20(Final%20Report%20-%20April%2013%202010).pdf. Powers, B.W. and S.K. Chaguturu, ACOs and High-Cost Patients. N Engl J Med, 2016. 374(3): p. 203-5. Wodchis, W.P., P.C. Austin, and D.A. Henry, A 3-year study of high-cost users of health care. CMAJ, 2016. Roland, M., et al., Follow up of people aged 65 and over with a history of emergency admissions: analysis of routine admission data. BMJ, 2005. 330(7486): p. 289-92. Georghiu, T., Steventon, A., Billings, J., Blunt, I., Lewis, G., Bardsley, M., Predictive risk and healthcare: an overview, 2011, Nuffield Trust, United Kingdom. Lewis, G.H., "Impactibility models": identifying the subgroup of high-risk patients most amenable to hospital-avoidance programs. Milbank Q, 2010. 88(2): p. 240-55. Weber, C. and K. Neeser, Using individualized predictive disease modeling to identify patients with the potential to benefit from a disease management program for diabetes mellitus. Dis Manag, 2006. 9(4): p. 242-56. Leppin, A.L., et al., Preventing 30-day hospital readmissions: a systematic review and metaanalysis of randomized trials. JAMA Intern Med, 2014. 174(7): p. 1095-107. Purdey, S. and A. Huntley, Predicting and preventing avoidable hospital admissions: a review. J R Coll Physicians Edinb, 2013. 43(4): p. 340-4. Dhalla, I.A., et al., Effect of a postdischarge virtual ward on readmission or death for high-risk patients: a randomized clinical trial. JAMA, 2014. 312(13): p. 1305-12. Freedman, J.D., et al., Using a mailed survey to predict hospital admission among patients older than 80. J Am Geriatr Soc, 1996. 44(6): p. 689-92. Lyon, D., et al., Predicting the likelihood of emergency admission to hospital of older people: development and validation of the Emergency Admission Risk Likelihood Index (EARLI). Fam Pract, 2007. 24(2): p. 158-67. Reuben, D.B., et al., Development of a method to identify seniors at high risk for high hospital utilization. Med Care, 2002. 40(9): p. 782-93. Bottle, A., P. Aylin, and A. Majeed, Identifying patients at high risk of emergency hospital admissions: a logistic regression analysis. J R Soc Med, 2006. 99(8): p. 406-14.

AC C

1.

ACCEPTED MANUSCRIPT

22. 23. 24. 25. 26. 27. 28. 29. 30.

31. 32.

33. 34.

35. 36.

RI PT

SC

21.

M AN U

20.

TE D

19.

EP

18.

Donnan, P.T., et al., Development and validation of a model for predicting emergency admissions over the next year (PEONY): a UK historical cohort study. Arch Intern Med, 2008. 168(13): p. 1416-22. Amarasingham, R., et al., Electronic medical record-based multicondition models to predict the risk of 30 day readmission or death among adult medicine patients: validation and comparison to existing models. BMC Med Inform Decis Mak, 2015. 15: p. 39. Nijhawan, A.E., et al., An electronic medical record-based model to predict 30-day risk of readmission and death among HIV-infected inpatients. J Acquir Immune Defic Syndr, 2012. 61(3): p. 349-58. Rana, S., et al., Predicting unplanned readmission after myocardial infarction from routinely collected administrative hospital data. Aust Health Rev, 2014. 38(4): p. 377-82. Billings, J., et al., Case finding for patients at risk of readmission to hospital: development of algorithm to identify high risk patients. BMJ, 2006. 333(7563): p. 327. Adler, N.E. and W.W. Stead, Patients in context--EHR capture of social and behavioral determinants of health. N Engl J Med, 2015. 372(8): p. 698-701. Fitzpatrick, T., et al., Looking Beyond Income and Education: Socioeconomic Status Gradients Among Future High-Cost Users of Health Care. Am J Prev Med, 2015. 49(2): p. 161-71. Tran, T., et al., Risk stratification using data from electronic medical records better predicts suicide risks than clinician assessments. BMC Psychiatry, 2014. 14: p. 76. Churpek, M.M., et al., Using electronic health record data to develop and validate a prediction model for adverse outcomes in the wards*. Crit Care Med, 2014. 42(4): p. 841-8. Thiel, S.W., et al., Early prediction of septic shock in hospitalized patients. J Hosp Med, 2010. 5(1): p. 19-25. Shumway, M., et al., Cost-effectiveness of clinical case management for ED frequent users: results of a randomized trial. Am J Emerg Med, 2008. 26(2): p. 155-64. Kamali, M.F., et al., Emergency department waiting room: many requests, many insured and many primary care physician referrals. Int J Emerg Med, 2013. 6(1): p. 35. Tu, K., et al., Evaluation of Electronic Medical Record Administrative data Linked Database (EMRALD). Am J Manag Care, 2014. 20(1): p. e15-21. Tu, K., et al., Are family physicians comprehensively using electronic medical records such that the data can be used for secondary purposes? A Canadian perspective. BMC Med Inform Decis Mak, 2015. 15: p. 67. Rosella, L.C., et al., High-cost health care users in Ontario, Canada: demographic, socioeconomic, and health status characteristics. BMC Health Serv Res, 2014. 14: p. 532. National Ambulatory Care Reporting System (NACRS). January 5, 2016]; Available from: http://www.cihi.ca/CIHI-extportal/internet/en/document/types+of+care/hospital+care/emergency+care/NACRS_METADAT A. Tibshirani, R., Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (methodological), 1996. 1: p. 267-288. Wodchis, W., Bushmeneva K, Nikitovic M, McKillop I. , Guidelines on Person-Level Costing Using Administrative Databases in Ontario, in Working Paper Series2013: Health System Performance Research Network. Lewis, S., A system in name only--access, variation, and reform in Canada's provinces. N Engl J Med, 2015. 372(6): p. 497-500. The World Factbook. November 25, 2016]; Available from: https://www.cia.gov/library/publications/the-world-factbook/.

AC C

17.

ACCEPTED MANUSCRIPT

RI PT SC M AN U TE D EP

38.

Health Insurance Coverage in the United States: 2015. November 25, 2016]; Available from: http://census.gov/library/publications/2016/demo/p60-257.html. Hospital Admission Risk Prediction (HARP) – a new tool for supporting providers and patients, 2013, Health Quality Ontario.

AC C

37.

ACCEPTED MANUSCRIPT

Table 1 AUROC 0.787 0.784

Sensitivity 0.660 0.713

Specificity 0.765 0.730

AUROC 0.759 0.758

Sensitivity 0.600 0.647

Specificity 0.768 0.745

Table 2

AC C

EP

TE D

M AN U

SC

Percentile of cost 1% 5%

RI PT

Percentile of cost 1% 5%

ACCEPTED MANUSCRIPT

Figure 1: Distribution of the patient cohort Figure 2: Performance of model (in the validation cohort) in predicting high ED use among those who were not high ED users in the prior year.

RI PT

Figure 3: Performance of the model (in validation cohort) in predicting patients being in the top 5% of total costs. This model did not include prior costs. Figure 4: Performance of the model (in validation cohort) in predicting patients being in the top 5% of total costs among those patients who were not in the top 5% of total costs in the prior year.

AUROC 0.787 0.784

Sensitivity 0.660 0.713

Specificity 0.765 0.730

M AN U

Percentile of cost 1% 5%

SC

Table 1: Performance of the model (in the validation cohort) in predicting the top 1% and 5%of users in terms of total costs without including prior costs in the model.

Table 2: Performance of the model in predicting the top 1% and 5%of total system costs among those patients who were not in the top 5% of total costs in the prior year. AUROC 0.759 0.758

Sensitivity 0.600 0.647

AC C

EP

TE D

Percentile of cost 1% 5%

Specificity 0.768 0.745

50,587 included in training cohort

TE D

50,587 included in validation cohort

AC C

EP

21,680 patients 50 years old or older included in training cohort

M AN U

101,174 with populated Cumulative Patient Profile

24,141 with unpopulated Cumulative Patient Profile

SC

125,315 rostered patients where EMR used for at least 12 months

RI PT

ACCEPTED MANUSCRIPT

21,431 patients 50 years old or older included in validation cohort

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

AC C

EP

TE D

M AN U

SC

RI PT

ACCEPTED MANUSCRIPT

ACCEPTED MANUSCRIPT

Clinical Significance

RI PT

-A small number of patients accounts for a disproportionately large amount of ED visits and total system costs -Machine learning techniques can be applied to analyze free text contained in primary care EMRs

SC

-This dataset is more predictive of patients who will generate future high costs than future ED visits.

AC C

EP

TE D

M AN U

-This is a promising method by which to focus resource-intensive interventions on those patients most likely to benefit