British Journal of Anaesthesia, 118 (3): 391–9 (2017) doi: 10.1093/bja/aew476 Critical Care
CRITICAL CARE
Risk prediction models for delirium in the intensive care unit after cardiac surgery: a systematic review and independent external validation A. Lee1,*, J. L. Mu1, G. M. Joynt1, C. H. Chiu1, V. K. W. Lai1, T. Gin1 and M. J. Underwood2 1 2
Department of Anaesthesia and Intensive Care, The Chinese University of Hong Kong, Hong Kong, China and Division of Cardiothoracic Surgery, Department of Surgery, The Chinese University of Hong Kong, Hong Kong, China
*Corresponding author: E-mail:
[email protected]
Abstract Numerous risk prediction models are available for predicting delirium after cardiac surgery, but few have been directly compared with one another or been validated in an independent data set. We conducted a systematic review to identify validated risk prediction models of delirium (using the Confusion Assessment Method-Intensive Care Unit tool) after cardiac surgery and assessed the transportability of the risk prediction models on a prospective cohort of 600 consecutive patients undergoing cardiac surgery at a university hospital in Hong Kong from July 2013 to July 2015. The discrimination (c-statistic), calibration (GiViTI calibration belt), and clinical usefulness (decision curve analysis) of the risk prediction models were examined in a stepwise manner. Three published high-quality intensive care unit delirium risk prediction models (n¼5939) were identified: Katznelson, the original PRE-DELIRIC, and the international recalibrated PRE-DELIRIC model. Delirium occurred in 83 patients (13.8%, 95% CI: 11.2–16.9%). After updating the intercept and regression coefficients in the Katznelson model, there was fair discrimination (0.62, 95% CI: 0.58–0.66) and good calibration. As the original PRE-DELIRIC model was already validated externally and recalibrated in six countries, we performed a logistic calibration on the recalibrated model and found acceptable discrimination (0.75, 95% CI: 0.72–0.79) and good calibration. Decision curve analysis demonstrated that the recalibrated PREDELIRIC risk model was marginally more clinically useful than the Katznelson model. Current models predict delirium risk in the intensive care unit after cardiac surgery with only fair to moderate accuracy and are insufficient for routine clinical use. Key words: cardiac surgical procedures; decision support techniques; delirium; postoperative complications; review, systematic; validation studies
Delirium is a serious clinical condition that is often underdetected and not treated promptly by health-care professionals.1 The incidence of delirium after cardiac surgery ranges from 6 to 52%.2 Although delirium is reversible, it is a precursor to poor patient outcomes. These include longer duration of mechanical ventilation [mean difference (MD) 1.79 days, 95% CI: 0.31–3.27],3 a prolonged stay in the intensive care unit (ICU; MD 1.38 days, 95% CI: 0.99–1.77)
and in hospital (MD 0.97 days, 95% CI: 0.61–1.33),3 and persistent cognitive impairment after surgery.4 Many studies have also supported an association between delirium and an increased risk of short-term mortality [relative risk (RR) 2.19, 95% CI: 1.78–2.70]3 and mortality up to 10 yr [hazard ratio (HR) 1.65, 95% CI: 1.38–1.97].5 However, another study suggested that there was no association between delirium and mortality during ICU stay after adjusting for
Editorial decision: November 14, 2016; Accepted: December 30, 2016 C The Author 2017. Published by Oxford University Press on behalf of the British Journal of Anaesthesia. All rights reserved. V
For Permissions, please email:
[email protected]
391
392
|
Lee et al.
Editor’s key points • Delirium remains an important issue following cardiac
surgery. Prediction of patients who will suffer delirium aids in providing appropriate post-surgical care and facilities, and has the potential to improve outcome. • The authors examined the effectiveness of established risk-prediction models in a large multi-centre cohort of patients in Hong Kong. • Moderate predictive accuracy was found, indicating potential future usefulness, but there was inadequate accuracy to allow current, routine, clinical use to be recommended.
a change in severity of disease before the onset of delirium (HR 1.19, 95% CI: 0.75–1.89).6 Nevertheless, the health-care costs associated with ICU delirium are substantial, with both higher ICU (39%, 95% CI: 12–72%) and hospital costs (31%, 95% CI: 1–70%).7 Although the exact aetiology of delirium is unknown, its pathophysiology appears to be multifactorial.8 Predisposing factors (e.g. elderly patients, medical co-morbidities, and cognitive, functional, visual, and hearing impairments) and precipitating factors (e.g. severity of illness, continuous infusion of benzodiazepine, blood product transfusion, and prolonged duration of mechanical ventilation) combine to trigger delirium.2 8–10 Thus, risk prediction models incorporating strong risk factors are likely to be useful to support clinical decision-making. By targeting drug prophylaxis and other non-pharmacological interventions for patients at high risk of developing delirium, the incidence, severity, and duration of delirium may be reduced. Numerous risk prediction models are available for predicting delirium after cardiac surgery,11 but few have been directly compared with one another or been validated in an independent data set. Ideally, risk prediction models should undergo internal validation to ensure reproducibility and external validation to support generalizability before implementation into clinical practice.12 After training our bedside ICU nurses on using the Confusion Assessment Method (CAM)-ICU screening tool for detecting delirium,13 we searched for the most appropriate validated risk prediction model for early identification of patients at risk of developing delirium in the ICU after cardiac surgery. The objective of this study was therefore to externally validate and assess the performance of all published validated risk prediction models of delirium, using the CAM-ICU assessment tool for detection of delirium. First, we performed a systematic review to identify all potential prediction models, and then critically appraised all validated risk prediction models using the CHARMS reporting guideline.14 Then we updated the validated risk prediction models in a step-wise manner on a cohort of 600 consecutive patients undergoing cardiac surgery at a university hospital in Hong Kong from July 2013 to July 2015.
Methods Systematic review We searched electronic databases (Ovid MEDLINE and EMBASE) for validated clinical risk prediction models for ICU delirium, published from January 1990 onwards, which would be applicable on the first day after postoperative cardiac surgery. A systematic search was performed in October 2012 and repeated on April 15, 2016. We adopted the search filter for prognostic prediction studies described
by Geersing and colleagues.15 Further studies were identified from reviewing the reference lists of retrieved studies and review articles. We restricted the language of publication to English and Chinese. Criteria for considering studies The intended scope of the review was to identify published prognostic scores to help identify adults who will or will not develop ICU delirium after undergoing cardiac surgery. Studies were included if they met the following criteria: (i) prospective or retrospective cohort of adult patients admitted to an ICU after cardiac surgery; (ii) patients were assessed for delirium using CAM-ICU assessment tool (CAM-ICU has a higher discrimination performance than the Intensive Care Delirium Screening Checklist);16 and (iii) prognostic models that reported internal validation of the development data set (random split of data or resampling methods, such as bootstrapping) or reported external validation (temporal, geographical, different setting, different investigators)14 to predict the future occurrence of ICU delirium. We excluded risk prediction models developed in children and risk prediction models where delirium was measured after the first day after cardiac surgery. Disagreements were resolved by discussion between the authors. Data extraction and critical appraisal After screening the titles and abstracts and selecting the potential eligible articles for full-text review (A.L.), a second author (V.K.W.L.) checked the selection. Two authors (A.L. and V.K.W.L.) independently extracted the following data: authors, year of publication, country where the study was conducted, study design, sample size, study population characteristics, risk predictors, model performance (calibration and discrimination measures), type of validation (internal or external), and incidence of delirium. Using the CHARMS checklist,14 we rated the five risk domains (participant selection, predictor assessment, outcome assessment, attrition, and analysis for the development of the prediction model) as low, moderate, or high using criteria previously described.17 Discrepancies in data extraction were resolved by discussion among the authors.
Validation cohort Setting and participant characteristics The reporting of this study was according to the TRIPOD checklist for prediction model development and validation.18 The Joint Chinese University of Hong Kong-New Territories East Cluster Clinical Research Ethics Committee approved the protocol for this prospective cohort study (CRE-2012.564). After written informed consent was given, we recruited 600 consecutive adult patients admitted to the ICU after undergoing emergency and elective cardiac surgery at The Prince of Wales Hospital in Hong Kong, a university teaching hospital. Patients were excluded if they had a sustained Richmond Agitation and Sedation Scale score19 of 4 or 5 throughout the ICU admission, major auditory or visual disorders, were mentally incompetent, had no CAM-ICU assessment recorded, or were unable to understand Chinese or English. All patients received standardized surgical processes and perioperative care under existing protocols for postoperative ICU sedation, analgesia, and weaning from mechanical ventilation. Two authors (J.L.M. and C.H.C.) collected patient characteristic data that included age, sex, ASA physical status, logistic EuroSCORE, urgency of ICU admission, details of surgical procedures, duration of anaesthesia, duration of mechanical ventilation, ICU length of stay, re-admission to the ICU, duration of the hospital stay, and the 30 day mortality status from the
Validating ICU delirium risk prediction models
patient’s medical record and from the Hospital Authority Clinical Management System electronic database. Predictors We used the same definitions of the predictors as described in the original risk prediction models. The risk factors included in the risk prediction models were collected by one of the investigators (J.L.M.) during the first day after cardiac surgery. These risk factors included age, preoperative depression, preoperative creatinine >150 mmol litre1, preoperative use of statins, combined coronary artery bypass graft and valvular surgery, red blood cell transfusion of >5 units and perioperative intra-aortic balloon pump support, severity of illness score (APACHE II score), coma, infection, metabolic acidosis, use of sedatives and morphine, urea concentration, and urgent ICU admission.20–22 For non-delirious patients discharged from the ICU within 24 h after cardiac surgery, the measurements for the PRE-DELIRIC models21 22 were based on the last set before ICU discharge. The attending ICU physicians were blinded to the results of the risk prediction models to minimize performance bias. Outcome The primary outcome was CAM-ICU assessment for ICU delirium13 or the use of haloperidol for the treatment of ICU delirium.21 The bedside nurses performed CAM-ICU assessments three times a day (once per 8 h shift). The duration of follow-up was until the patient’s discharge from the ICU or the time to diagnosis of ICU delirium, whichever was shorter for risk prediction purposes. Pharmacological treatment of delirium (haloperidol, dexmedetomidine, quetiapine) was prescribed at the discretion of attending specialist intensive care physician. The nurses performing CAM-ICU assessment were unaware of the study objectives and were blinded to the risk factors being collected by the authors. Before the cohort study began, the inter-rater reliabilities between ICU bedside nurses and three persons (two ICU physicians and a research nurse) trained by a psychiatrist were measured. There was substantial agreement in a sample of 34 ICU patients (j 0.77, 95% CI: 0.35–0.93). Of the 34 ICU patients, five were delirious (14.7%, 95% CI: 5.6–29.6%).
Data analysis There were no missing data for the predictors collected. The distribution of variables from each model development data set and our cohort were tabulated. Continuous variables are reported as mean and SD or median and interquartile range (IQR) as appropriate. The 95% confidence interval (95% CI) is reported for the incidence of delirium. The level of significance was set at P<0.05. First, we used the original formulas of the included risk prediction models and applied them to our patients without adjustments to the slope and intercept (model 0), taking into account the type of coding (effect vs reference) used for categorical predictors, to calculate the predicted probabilities. As the intercept was not published in one paper,20 we wrote to the authors to provide this information. Then the ‘calibration-in-the-large’ method (model 1)23 was used to update the model by recalibrating the intercept to adjust for the difference in the incidence of delirium between the development data set and our patient population. Next, we applied the ‘logistic calibration’ method (model 2)23 by adjusting the intercept and the slope in our cohort data set. The calibration slope in our cohort study reflects the combined effect of overfitting on the development data set and the true differences in the effects of the predictors.24 Patients were divided into the following risk groups: <10% (very
|
393
low), 10–20% (low), 20–40% (intermediate), 40–60% (high), and >60% (very high), using a similar classification to the one previously described.21 Discrimination (probability of correct classification for a pair of patients with and without delirium) was assessed by constructing an area under the receiver operating characteristic (AUROC) curve and estimating a c-statistic.24 We compared the discrimination performance of different risk prediction models using the method of DeLong and colleagues.25 Calibration (agreement between probability of delirium and observed delirium frequency) was assessed using a Hosmer–Lemeshow (HL) test with nine degrees of freedom and calibration plot with a GiViTI calibration belt. The calibration belt is a fitted polynomial logistic function curve between the logit transformation of the predicted probability and outcome with surrounding 80% CI (light grey area) and 95% CI (dark grey area).26 The calibration belt is more useful than the HL test as it highlights ranges of significant miscalibration.26 This type of calibration approach is rated as moderate in the calibration hierarchy recently proposed.27 We used Nagelkerke’s R2 and the Brier score to estimate the overall performance of the updated models. A decision curve analysis28 was used to compare the clinical usefulness of the risk prediction models. This involves choosing an appropriate threshold probability, defined in this setting as the level above which a patient (or physician) would choose prophylactic treatment for delirium.28 The net benefit is the difference between the expected benefit (number of patients with delirium and who will receive prophylaxis, i.e. true positives) and expected harm (number of patients without delirium who would be treated in error multiplied by a weighting factor based on a threshold probability, i.e. false-positive rate multiplied by ratio of threshold probability divided by one minus the threshold probability) associated with each considered strategy.28 Analyses were performed using STATA software (StataCorp, College Station, TX, USA). Calibration belts were plotted using R version 3.2.5 (R Foundation for Statistical Computing, Vienna, Austria).
Sample size A sample of 71 from the delirium group (positive group) and 474 from the non-delirium group (negative group) was expected to achieve 80% power to detect a difference of 0.10 between the AUROC curve under the null hypothesis of 0.8521 and an AUROC under the alternative hypothesis of 0.7520 using a two-sided ztest at a significance level of 0.05. The data were a continuous risk probability response with an incidence of delirium of one in every eight patients that was based on our general observation. Thus, the total sample size was 545. Given that those lost to follow-up was 10% patients with no CAM-ICU assessments, we recruited 600 patients for the study. The sample size calculation was performed using PASS 11 software (NCSS, Kaysville, UT, USA).
Results Systematic review The flow chart for identifying internally and externally validated risk prediction models for ICU delirium after cardiac surgery is shown in Fig. 1. During the last ‘check’ search completed in 2016, we identified an internally validated early prediction model for ICU delirium (E-PRE-DELIRIC) consisting of nine predictors available at the time of ICU admission29 but were unable to externally validate it using our data as our prospective data
394
|
Lee et al.
TOTAL RECORDS (n=139) MEDLINE (n=44) EMBASE (n=90) Other sources (n=5) Duplicates removed (n=40)
Records for title and abstract screening (n=99) Records excluded (n=74)
Records for full text screening (n=25) RECORDS EXCLUDED (n=22) No internal validation (n=17) Other delirium screening instrument (n=3) No reference (non-delirious) group (n=1) No adequate information available in data set (n=1) Risk prediction models included in analysis (n=3)
Fig 1 Systematic review flowchart for identifying validated risk prediction models for delirium in the intensive care unit after cardiac surgery.
collection, begun in 2012, did not include all variables required and was nearing completion. The missing variables included history of cognitive impairment, history of alcohol abuse, mean arterial blood pressure at the time of ICU admission, use of corticosteroids, respiratory failure, and blood urea nitrogen at the time of ICU admission.29 Characteristics of the three studies20–22 meeting selection criteria are shown in Table 1. One internally validated risk prediction model was developed specifically in cardiac surgical patients but had not undergone external validation.20 The other two risk prediction models were from the same model of 10 predictors (PRE-DELIRIC) and were internally validated21 and extensively validated temporally, geographically, and at different ICU centres.21 22 Overall, the quality of reporting for the three included studies was high. We graded the risk of outcome assessment bias in one study21 as moderate because it was unclear whether CAM-ICU assessments were assessed independently from the assessment of predictors. The discrimination was acceptable20–22 to excellent.21 There was good calibration in all risk prediction models using calibration plots21 22 and the HL test.20
Validation cohort Of the 684 consecutive patients undergoing cardiac surgery, 600 met the inclusion criteria (Fig. 2). For assessing the PRE-DELIRIC model, we excluded a further three patients who stayed in the ICU for <12 h, in keeping with the original inclusion criterion.21 Over half the patients (58%) had coronary artery bypass graft with or without valvular operations. The median (IQR) logistic EuroSCORE
was 3.63 (2.08–8.46) in 579 (97%) patients. The mean (SD) APACHE II score was 14.0 (5.0). The median (IQR) duration of cardiopulmonary bypass time and mechanical ventilation were 105 min (86–147) and 9.1 h (5.7–16.8), respectively. The median (IQR) length of stay in ICU and hospital were 22.0 h (19.6–24.5) and 12 days (10–19), respectively. No patient was lost to follow-up (Fig. 2). All delirium episodes were detected by a positive CAM-ICU tool result. In our study, no patient was classified as having delirium solely because the patient had received a dose or doses of haloperidol (i.e. this condition was not encountered). Delirium occurred in 83 of 600 patients (13.8%, 95% CI: 11.2–16.9%), similar to the incidence reported in the study by Katznelson and colleagues20 (11.5%, P¼0.17) but lower than the original and recalibrated PRE-DELIRIC models21 22 (29.8%, P<0.001 and 19.9%, P¼0.001, respectively). Mixed delirium (n¼33, 39.8%) was more frequent than hypoactive (n¼26, 31.3%) and hyperactive delirium (n¼24, 28.9%). The first episode of delirium usually occurred on the first day (n¼49, 59.0%) and second day (n¼18, 21.7%) after cardiac surgery. Of the 83 patients with delirium, pharmacological treatment was given to 40 (48.2%); haloperidol and dexmedetomidine (n¼5), quetiapine and dexmedetomidine (n¼1), and dexmedetomidine alone (n¼34). Patients undergoing emergency cardiac surgery had a higher risk of delirium than those undergoing elective cardiac surgery (37.2 vs 9.9%; RR 3.75, 95% CI: 2.57–5.48). Re-admission to the ICU occurred in 17 (2.8%) patients. Seven patients (1.2%) died within 30 days after cardiac surgery. The distribution of risk factors for delirium in our cohort and in the development data sets20 22 are shown in Table 2. As a
Validating ICU delirium risk prediction models
|
395
Table 1 Description and quality assessment of validated prediction models for delirium in the intensive care unit after cardiac surgery. APACHE, Acute Physiology and Chronic Health Evaluation; AUROC, area under receiver operative characteristic curve; CABG, coronary artery bypass graft; CPB, cardiopulmonary bypass; HL, Hosmer–Lemeshow goodness-of-fit test; IABP, intra-aortic balloon pump; ICU, intensive care unit; RBC, red blood cell Authors (yr)
Population, incidence of delirium
Predictors in final model
Katznelson and • Prospective cohort of RBC transfusion >5 units, perioperative IABP sup1059 patients colleagues port, preoperative deundergoing cardiac (2009)20 surgery with CPB from pression, preoperative creatinine >150 mmol April 2005 to June litre1, age 60 yr, com2006 at one centre in Canada bined CABG/valvular • Delirium occurred in surgery, preoperative statin use 122 patients (11.5%) • Prospective cohort of Age, APACHE II score, van den Boogaard and coma, admission cat3056 medical and surcolleagues egory, infection, metagical (includes cardiac (2012)21 bolic acidosis, surgery) patients admitted to five Dutch morphine use, sedation, urea, urgent ICUs from February admission 2008 to September 2009 • Delirium occurred in 911 patients (29.8%) • Prospective cohort of Age, APACHE II score, van den Boogaard and coma, admission cat1824 medical and surcolleagues egory, infection, metagical (includes cardiac (2014)22 bolic acidosis, surgery) patients admitted to eight ICUs morphine use, sedation, urea, urgent in six countries from admission October 2011 to June 2012 • Delirium occurred in 363 patients (19.9%)
further external validation update was applied to the development data set for PRE-DELIRIC21 in eight ICUs in six countries,22 we decided to apply this latest PRE-DELIRIC model to our cohort data set only for independent external validation. Therefore, this left two risk prediction models20 22 for external validation for this study. In our setting, the discrimination of the recalibrated PRE-DELIRIC model22 (0.75, 95% CI: 0.72–0.79) was higher than that of the Katznelson model20 (AUROC 0.62, 95% CI: 0.58– 0.66) as shown in Fig. 3 (P<0.001). Logistic calibration of the Katznelson model20 to our cohort resulted in an acceptable calibration (HL P¼0.04) with no significant over- or underprediction intervals (Supplementary Fig. 1A) over a range of predicted risk of delirium between 7.9 and 40.0%. Likewise, logistic calibration of the recalibrated PRE-DELIRIC risk model24 resulted in an acceptable calibration (HL P¼0.99) with no significant over- or underprediction intervals in the calibration belt (Supplementary Fig. 1B) over a range of predicted risk of delirium between 0.5 and 88.5%. The overall performance of the logistic calibration of the PRE-DELIRIC risk model22 (R2¼0.190, Brier score¼0.104) was higher than that of the Katznelson model20 (R2¼0.037, Brier score¼0.117). A comparison of the regression coefficients for logistic calibrations of Katznelson and the PRE-DELIRIC models with previous published data sets20–22 is shown in
Discrimination (c-statistic)
Calibration
• Development Development HL AUROC (0.77) P¼0.34 • Validation AUROC: internal (0.75)
• • Development AUROC (0.87, 95% CI: 0.85–0.89) • Validation AUROC: • internal (0.86), temporal (0.89, 95% CI: 0.86–0.92), • external (0.85, 95% CI: 0.84–0.87)
Risk of biases
Selection (low), predictor assessment (low), outcome assessment (low), attrition (low), analysis (low)
Development Selection (low), preintercept (0.06), dictor assessment slope (1.08) (low), outcome assessment (moderValidation: temate), attrition (low), poral intercept analysis (low) (0.22), slope (1.2) External intercept (0.29), slope (0.93)
Validation AUROC: • Validation: exter- Selection (low), predictor assessment external (0.76, 95% nal intercept (0.08), slope (1.09) (low), outcome asCI: 0.74–0.79) • HL P¼0.045 sessment (low), attrition (low), analysis (low)
Supplementary Table 1. The recalibrated PRE-DELIRIC risk22 model was more clinically useful than the Katznelson model20 in the decision curve analysis (Fig. 4).
Discussion This systematic review identified three internally and externally validated risk prediction models20–22 for ICU delirium after cardiac surgery that were suitable for independent external validation. Overall, the methodological quality of these risk prediction models20–22 was high. Two models with a different set of predictors were directly compared with one another, using an independent prospective cohort data set of 600 adults undergoing cardiac surgery, to test the transportability of the models. After logistic calibration of the Katznelson20 and recalibrated PRE-DELIRIC model,22 both models showed good calibration. The discrimination was acceptable in the recalibrated PREDELIRIC model22 and was higher than the fair discrimination performance found in the Katznelson model.20 The recalibrated PRE-DELIRIC model22 was more clinically useful than the Katznelson model20 because of the higher net benefits found over a wider range of threshold probabilities when applied to an independent data set.
396
|
Lee et al.
Assessed for eligibility (n=684)
Total recruited (n=600)
EXCLUDED (Total=84) INELIGIBLE (n=60) Redo or congenital cardiac surgery (n=28) No RASS or CAM-ICU during ICU stay (n=13) RASS –4 or –5 during ICU admission (n=5) Non-cardiac surgery (n=5) Auditory disorders (n=1) History of psychosis (n=1) Mental incompetence (n=7) ELIGIBLE BUT NOT RECRUITED (n=24) Cannot obtain consent (n=19) Refused (n=5)
ICU stay<12 hours (n=3)
Data available for analysis Katznelson prediction model (n=600)
Data available for analysis PRE-DELIRIC prediction model (n=597)
Fig 2 Independent external validation study flow chart.
This systematic review revealed 25 delirium risk prediction models in the cardiac surgical population, but that only a few have undergone formal internal and external validation checks. These findings are consistent with the results from systematic reviews of new risk prediction studies where internal and external validations were performed in a third (36%)30 and a quarter (25–29%)30 31 of the time, respectively. There is a smaller probability (16%) of a new risk prediction model being externally validated by different authors within 5 yr after publication.31 Even within the validated models included in the review, we noted inadequate reporting of blinding delirium outcome assessment in one.21 This finding is consistent with a low prevalence (11%) of blinded outcome evaluation in risk prediction models in the anaesthesia literature.32 Although the Katznelson model20 was designed for postcardiac surgical patients, there was only fair discrimination. This might be because the original purpose of the study was to estimate the effect of statin administration on the risk of ICU delirium while adjusting for potential confounders. In contrast, the PRE-DELIRIC models21 22 (and the E-PRE-DELIRIC model)29 better captured the known predisposing and precipitating risk factors associated with delirium in general ICU patients, including those undergoing cardiac surgery. A previous study suggests that external validation of new risk prediction models in populations different from model development data sets have lower discrimination performance.31 The median AUROC change in the discrimination performance in subsequent validations by different authors was significantly lower (0.05, P<0.001) in 14 of 17 instances examined.31 Our results are mixed compared with these findings. The lower AUROC found in the Katznelson model20 might be related to our
homogeneous and less severe case-mix compared with the development data set, resulting in smaller net benefits in the decision curve analysis. Another possible reason for the fair discrimination performance was the smaller than expected regression coefficients in our cohort that might reflect overfitting in the development data set, because a large number of candidate predictors (n¼17) were considered for modelling and no shrinkage of coefficients was performed. The validation case-mix appeared to be more homogeneous and less severe than the PRE-DELIRIC data sets.21 22 This might be because the validation cohort was restricted to cardiac surgical patients, in contrast to the development cohort that consisted of a mixed ICU cohort. The incidence of delirium in PREDELIRIC data sets21 22 was greater than that found in our cohort, a phenomenon that might partly be explained by haloperidol use as part of the definition of delirium. However, the number of patients included solely on this basis was not reported. Our mean APACHE II score was also significantly lower (P<0.001). Without the reported mean and SD of the linear predictors from the development data sets, formal statistical analysis of the heterogeneity of and severity of case-mix differences between studies is not possible.33 The magnitude of the regression coefficients in our logistic calibration model appeared more comparable to the development data set21 than to the recalibrated PREDELIRIC data set.22 Of note, our calibration belt (Supplementary Fig. 1B) performed better throughout the whole range of predicted probabilities compared with the recalibrated PRE-DELIRIC model,22 which overestimated the risk of delirium in moderateto high-risk patients. To our knowledge, this is one of few decision curve analyses with a direct comparison of two externally validated risk
Validating ICU delirium risk prediction models
|
397
Table 2 Characteristics of patients in external validation cohort (Prince of Wales, Hong Kong) and published validated risk prediction models for delirium in the intensive care unit. *Five hundred and sixty-nine patients underwent cardiopulmonary bypass. APACHE, Acute Physiology and Chronic Health Evaluation; CABG, coronary artery bypass graft; CPB, cardiopulmonary bypass; IABP, intra-aortic balloon pump; IQR, interquartile range; NR, not reported; RBC, red blood cell Characteristics
Prince of Wales cohort (n¼600)
Katznelson20 (n¼1059)
Original PRE-DELIRIC21 (n¼3056)
Recalibrated PRE-DELIRIC22 (n¼1824)
Male sex (%) APACHE II Median (IQR) Mean (SD) CPB >90 min Delirium events (%) Predictors in Katznelson model20 Age 60 yr (%) Intraoperative RBC >5 units (%) Perioperative IABP support (%) Preoperative depression (%) Preoperative creatinine >150 mmol litre1 (%) Combined CABG and valvular surgery (%) Preoperative statin use (%) Predictors in PRE-DELIRIC model21 22 Surgery admission (%) Mean (SD) age (yr) APACHE II score (SD) Coma None Drug induced (%) Miscellaneous (%) Combination (%) Infection (%) Metabolic acidosis (%) Morphine use (%) None 0.01–7.1 mg day1 7.2–18.6 mg day1 >18.6 mg day1 Sedation (%) Highest urea (mmol litre1) Urgent admission (%)
413 (68.8)
752 (71.0)
1937 (63.4)
1040 (57.0)
13 (10–17) 13.8 (5.1) 397 (69.8)* 83 (13.8)
NR NR 594 (56.1) 122 (11.5)
15 (12–19) NR NR 911 (29.8)
NR 19 (9) NR 363 (19.9)
343 (57.2) 18 (3.0) 32 (5.3) 14 (2.3) 50 (8.3) 43 (7.2) 279 (46.5)
674 (63.6) 163 (15.4) 37 (3.5) 40 (3.8) 196 (18.5) 176 (16.6) 676 (63.8)
NR NR NR NR NR NR NR
NR NR NR NR NR NR NR
597 (100) 60 (11) 14 (5)
1059 (100) NR NR
1857 (60.8) 63 (15) 16 (6)
869 (47.6) 60 (17) 19 (9)
553 (92.6) 37 (6.2) 0 7 (1.2) 84 (14.1) 319 (53.4)
NR NR NR NR NR NR
2265 (74.1) 608 (19.9) 36 (1.2) 147 (4.8) 556 (18.2) 598 (19.6)
1405 (77.0) 295 (16.2) 21 (1.2) 103 (5.6) 516 (28.3) 525 (28.8)
18 (3.0) 282 (47.2) 254 (42.5) 43 (7.2) 559 (93.6) 7.8 (3.9) 86 (14.4)
NR NR NR NR NR NR NR
1512 (49.5) 262 (8.6) 729 (23.9) 553 (18.1) 1113 (36.4) 7.3 (3.4) 1579 (51.7)
1333 (73.1) 77 (4.2) 115 (6.3) 135 (7.4) 774 (42.4) 9.5 (8.5) 1147 (62.9)
prediction models. At all threshold probabilities, the recalibrated PRE-DELIRC model22 gave higher net benefits than the Katznelson model,20 suggesting that it is the preferable model to use. However, the overall performance and net benefits of the recalibrated PRE-DELIRIC model22 were modest. Even at 40% cut-off for the recalibrated PRE-DELIRIC model (Fig. 3), the positive likelihood (6.67) and negative likelihood (0.85) ratios were fair. In practical terms, the net consequence of using the Katznelson model20 is the equivalent of a strategy that found 4–47 per 1000 patients for prophylactic pharmacological intervention without treating any unaffected patients (no overtreatment) at threshold probabilities between 10 and 20% compared with doing nothing (Fig. 4). Applying the recalibrated PRE-DELIRIC model22 at the same threshold probabilities of 10–20%, the corresponding net benefits would be to identify 30–67 per 1000 patients for prophylactic pharmacological intervention without overtreatment compared with doing nothing. Consequently, the clinical usefulness of both models appears limited in a clinical setting. Although prediction models, including the ones we have assessed in this paper, remain an invaluable research tool for risk
stratification or risk adjustment, the implication for future delirium research in cardiac surgical patients is that there is a need to consider model extensions or aggregation of multiple risk prediction models34 to provide a more accurate prediction. For model extensions, it may require the inclusion of measures of frailty35 and specific illness severity scores, such as logistic EuroSCORE, as predictors for delirium to improve the model performance and clinical usefulness. Aggregation of multiple published risk prediction models is a promising new technique that can simultaneously update, identify, and estimate the best combination of published risk prediction models in a small validation cohort (<30 events) to yield a superior prediction model.34 Given the substantial number of events in our study, the aggregation of Katznelson20 and PRE-DELIRIC models21 22 29 would be unlikely to confer substantial benefit for improving the model performance over the present approach taken by us. A limitation of the systematic review was the possibility of publication bias arising from restricting the language of publication to English and Chinese. Another limitation was the use of the CAM-ICU assessment tool for identifying delirium as it has
398
|
Lee et al.
Conclusions 1.00 10% 10%
Sensitivity
0.75 0.50
20%
0.25
40% 20% 60% 40%
0 0.00
0.25
0.50 1-Specificity
0.75
1.00
Katznelson AUROC (0.62) Recalibrated PRE-DELIRIC AUROC (0.75)
Fig 3 Comparison of Katznelson20 and recalibrated PRE-DELIRIC22 risk prediction models for area under the receiver operating characteristic (AUROC) curve with cut-offs at 10% (very low), 20% (low), 40% (moderate), and 60% (high).
1.5
Net benefit
0.1
Despite a large number of published predictive models, there was a paucity of high-quality internally or externally validated risk prediction models for ICU delirium after cardiac surgery. The external validation approach used in this study was comprehensive and distinct from estimating the performance of risk prediction models from authors who subsequently carried out external validation of their own developed models. Although the PREDELIRIC model21 was originally developed in a cohort of mixed ICU patients, the overall performance of the recalibrated PREDELIRIC22 was better than the Katznelson model20 in a defined cardiac surgical population. Although the assessed models remain sufficiently valid for research-based risk stratification, the discrimination and calibration are at best moderate and could be improved. Based on the decision curve analysis, the use of these models for clinical practice decisions appears premature. Improvements in model performance through model extension or aggregation of these prediction models in cardiac surgical patients would be necessary to justify their clinical use.
Authors’ contributions Study design: A.L. (systematic review); A.L., G.M.J., T.G., M.J.U. (validation cohort study) Data collection: A.L., V.K.W.L. (systematic review); J.L.M., C.H.C. (validation cohort study) Data analysis: A.L., G.M.J., T.G., M.J.U. (systematic review); A.L., J.L.M., G.M.J., T.G., M.J.U. (validation cohort study) Data interpretation: A.L., G.M.J., T.G., M.J.U. Manuscript preparation: A.L., J.L.M., G.M.J., C.H.C., V.K.W.L., T.G., M.J.U.
Supplementary material
0.05
Supplementary material is available at British Journal of Anaesthesia online.
0
Acknowledgements
–0.05 0
0.2 Treat all Treat none
0.4 0.6 0.8 Threshold probability
1
Katznelson Recalibrated PRE-DELIRIC
Fig 4 Comparison of decision curves for prediction of intensive care unit delirium after cardiac surgery. The threshold probability is the level above which a patient (or physician) would choose prophylactic treatment for delirium. Green dotted line assumes no patients have delirium (treat none). Black short-dashed line assumes all patients have delirium (treated
We thank the staff involved in CAM-ICU assessment training before study commencement.
Declaration of interest A.L. and T.G. are members of the editorial board for Perioperative Medicine journal. A.L. is an editor for the Cochrane Anaesthesia, Critical and Emergency Care Review Group. G.M.J. is a member of the editorial board for Critical Care and Shock journals. J.L.M., C.H.C., V.K.W.L., and M.J.U. declare that they have no conflict of interest.
all). Orange long-dashed line represents the expected benefit associated with the Katznelson model.20 Blue continuous line represents the expected benefit associated with the recalibrated PRE-DELIRIC model.22
known limitations with detecting hypoactive or mixed delirium. Although the specificities for detecting hypoactive and mixed delirium are high (>92%) using CAM-ICU, the sensitivities are 31% (95% CI: 17–48%) and 53% (95% CI: 35–74%), respectively, in mixed ICU patients.36 Thus, our overall incidence of delirium may be an underestimate, but this is unlikely to affect the discrimination performance of the models.24
Funding Research Grants Council of the Hong Kong Special Administrative Region, China (project reference: CUHK469113); Chinese University of Hong Kong (Direct Grant 2041761).
References 1.
Patel RP, Gambrell M, Speroff T, et al. Delirium and sedation in the intensive care unit: survey of behaviors and attitudes of 1384 healthcare professionals. Crit Care Med 2009; 37: 825–32
Validating ICU delirium risk prediction models
2.
3.
4.
5.
6.
7.
8. 9.
10. 11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Lin Y, Chen J, Wang Z. Meta-analysis of factors which influence delirium following cardiac surgery. J Card Surg 2012; 27: 481–92 Salluh JI, Wang H, Schneider EB, et al. Outcome of delirium in critically ill patients: systematic review and meta-analysis. Br Med J 2015; 350: h2538 Saczynski JS, Marcantonio ER, Quach L, et al. Cognitive trajectories after postoperative delirium. N Engl J Med 2012; 367: 30–9 Gottesman RF, Grega MA, Bailey MM, et al. Delirium after coronary artery bypass graft surgery and late mortality. Ann Neurol 2010; 67: 338–44 Klein Klouwenberg PM, Zaal IJ, Spitoni C, et al. The attributable mortality of delirium in critically ill patients: prospective cohort study. Br Med J 2014; 349: g6652 Milbrandt EB, Deppen S, Harrison PL, et al. Costs associated with delirium in mechanically ventilated patients. Crit Care Med 2004; 32: 955–62 Steiner LA. Postoperative delirium. Part 1: pathophysiology and risk factors. Eur J Anaesthesiol 2011; 28: 628–36 Zaal IJ, Devlin JW, Hazelbag M, et al. Benzodiazepine-associated delirium in critically ill adults. Intensive Care Med 2015; 41: 2130–7 Zaal IJ, Devlin JW, Peelen LM, Slooter AJ. A systematic review of risk factors for delirium in the ICU. Crit Care Med 2015; 43: 40–7 Gosselt AN, Slooter AJ, Boere PR, Zaal IJ. Risk factors for delirium after on-pump cardiac surgery: a systematic review. Crit Care 2015; 19: 346 Labare`re J, Renaud B, Fine MJ. How to derive and validate clinical prediction models for use in intensive care medicine. Intensive Care Med 2014; 40: 513–27 Ely EW, Margolin R, Francis J, et al. Evaluation of delirium in critically ill patients: validation of the Confusion Assessment Method for the Intensive Care Unit (CAM-ICU). Crit Care Med 2001; 29: 1370–9 Moons KG, de Groot JA, Bouwmeester W, et al. Critical appraisal and data extraction for systematic reviews of prediction modelling studies: the CHARMS checklist. PLoS Med 2014; 11: e1001744 Geersing GJ, Bouwmeester W, Zuithoff P, Spijker R, Leeflang M, Moons KG. Search filters for finding prognostic and diagnostic prediction studies in Medline to enhance systematic reviews. PLoS One 2012; 7: e32844 Neto AS, Nassar AP Jr, Cardoso SO, et al. Delirium screening in critically ill patients: a systematic review and meta-analysis. Crit Care Med 2012; 40: 1946–51 JM, et al. Childhood asthma predicSmit HA, Pinart M, Anto tion models: a systematic review. Lancet Respir Med 2015; 3: 973–84 Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): the TRIPOD Statement. Br J Surg 2015; 102: 148–58 Sessler CN, Gosnell MS, Grap MJ, et al. The Richmond Agitation–Sedation Scale: validity and reliability in adult intensive care unit patients. Am J Respir Crit Care Med 2002; 166: 1338–44 Katznelson R, Djaiani GN, Borger MA, et al. Preoperative use of statins is associated with reduced early delirium rates after cardiac surgery. Anesthesiology 2009; 110: 67–73
|
399
21. van den Boogaard M, Pickkers P, Slooter AJ, et al. Development and validation of PRE-DELIRIC (PREdiction of DELIRium in ICu patients) delirium prediction model for intensive care patients: observational multicentre study. Br Med J 2012; 344: e420 22. van den Boogaard M, Schoonhoven L, Maseda E, et al. Recalibration of the delirium prediction model for ICU patients (PRE-DELIRIC): a multinational observational study. Intensive Care Med 2014; 40: 361–9 23. Janssen KJ, Moons KG, Kalkman CJ, Grobbee DE, Vergouwe Y. Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol 2008; 61: 76–86 24. Steyerberg EW. Clinical Prediction Models: a Practical Approach to Development, Validation, and Updating. New York: Springer, 2009 25. DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988; 44: 837–45 26. Nattino G, Finazzi S, Bertolini G. A new test and graphical tool to assess the goodness of fit of logistic regression models. Stat Med 2016; 35: 709–20 27. Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. A calibration hierarchy for risk models was defined: from utopia to empirical data. J Clin Epidemiol 2016; 74: 167–76 28. Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making 2006; 26: 565–74 29. Wassenaar A, van den Boogaard M, van Achterberg T, et al. Multinational development and validation of an early prediction model for delirium in ICU patients. Intensive Care Med 2015; 41: 1048–56 30. Bouwmeester W, Zuithoff NP, Mallett S, et al. Reporting and methods in clinical prediction research: a systematic review. PLoS Med 2012; 9: 1–12 31. Siontis GC, Tzoulaki I, Castaldi PJ, Ioannidis JP. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol 2015; 68: 25–34 32. Guglielminotti J, Dechartres A, Mentre´ F, Montravers P, Longrois D, Laoue´nan C. Reporting and methodology of multivariable analyses in prognostic observational studies published in 4 anesthesiology journals. A methodological descriptive review. Anesth Analg 2015; 121: 1011–29 33. Debray TP, Vergouwe Y, Koffijberg H, Nieboer D, Steyerberg EW, Moons KG. A new framework to enhance the interpretation of external validation studies of clinical prediction models. J Clin Epidemiol 2015; 68: 279–89 34. Debray TP, Koffijberg H, Nieboer D, Vergouwe Y, Steyerberg EW, Moons KG. Meta-analysis and aggregation of multiple published prediction models. Stat Med 2014; 33: 2341–62 35. Jung P, Pereira MA, Hiebert B, et al. The impact of frailty on postoperative delirium in cardiac surgery patients. J Thorac Cardiovasc Surg 2015; 149: 869–75 36. van Eijk MM, van den Boogaard M, van Marum RJ, et al. Routine use of the confusion assessment method for the intensive care unit: a multicenter study. Am J Respir Crit Care Med 2011; 184: 340–4 Handling editor: Jonathan Hardman