Statistically Speaking
Explanatory Versus Predictive Modeling Kristin L. Sainani, PhD
INTRODUCTION When building multivariate statistical models, researchers need to be clear as to whether their goals are explanatory or predictive. Explanatory research aims to identify risk (or protective) factors that are causally related to an outcome. Predictive research aims to find the combination of factors that best predicts a current diagnosis or future event. This distinction affects every aspect of model building and evaluation. Unfortunately, researchers often conflate the two, which leads to errors [1]. This article reviews the differences between explanatory and predictive modeling.
EXPLANATORY MODELING The aim of explanatory research is to establish causal relationships. For example, in the article by Yu et al [2], the researchers attempted to identify risk factors that would predispose amputees to falling during the postoperative period. A better understanding of these risk factors could help researchers design interventions to prevent falls. Explanatory model building is primarily concerned with identifying individual risk factors that are associated with the outcome as well as ruling out confounding (extraneous variables that are related to both the risk factor and outcome may create spurious associations between them). Explanatory modelers need to worry about chance findings, unmeasured confounding, and residual confounding.
CANDIDATE VARIABLES Testing too many candidate variables may lead to type I errors (a statistically significant finding that is due to chance). Researchers can limit the number of candidate variables by focusing on a few key hypotheses (eg, Do opioids increase the risk of falls? Do benzodiazepines increase the risk of falls?). Often, however, the goal is broader: to identify multiple risk factors for an outcome. In this case, researchers should select candidate variables a priori based on biologic plausibility, previous research, or clinical experience. For example, Yu et al [2] chose potential factors “on the basis of clinical relevance and experience,” including comorbidities, cognitive deficits, and medications that might increase the risk of falls.
VARIABLE SELECTION The process of selecting variables for the final multivariate model should be driven by the specific hypotheses being tested. Risk factors are included if they have a significant or nearsignificant relationship with the outcome; confounder variables are included if they change the relationship between a risk factor of interest and the outcome, regardless of statistical significance. Some variables, such as age and gender, may be included for face validity even if they are not significantly related to the outcome. Explanatory modelers should avoid automated variable selection procedures, for example, stepwise regression, because these optimize overall model fit with no regard to the roles of individual variables. Yu et al [2] included 7 clinically relevant variables in their multivariate logistic regression (a multivariate regression technique used when the outcome is binary [eg, fall/no fall]): etiology of amputation, level of amputation, side of amputation, presence of cognitive PM&R 1934-1482/14/$36.00 Printed in U.S.A.
K.L.S. Division of Epidemiology, Department of Health Research and Policy, Stanford University, Stanford, CA. Address correspondence to: K.L.S.; e-mail:
[email protected] Disclosure: nothing to disclose
ª 2014 by the American Academy of Physical Medicine and Rehabilitation Vol. 6, 841-844, September 2014 http://dx.doi.org/10.1016/j.pmrj.2014.08.941
841
842
Sainani
impairment, presence of chronic renal failure, use of opioid analgesics, and use of benzodiazepines. They manually removed nonsignificant variables to arrive at a final model that included the type of etiology (dysvascular versus nondysvascular), level of amputation (transtibial versus nontranstibial), and side of amputation (right versus left). The remaining 4 risk factors were not independently associated with falling once the 3 amputation characteristics were taken into account.
MODEL ASSESSMENT AND VALIDATION For explanatory models, researchers should focus on the individual b coefficients and P values for the risk factors of interest. For example, in the study by Yu et al [2], a transtibial level of amputation had an odds ratio of 2.127 (P < .05) for falls compared with a nontranstibial amputation. Measures of overall model performance, such as R2 values, are less important. Similarly, researchers who attempt to validate the findings need to confirm individual causal relationships rather than overall model performance. Authors and readers should consider the potential role of chance in the findings, particularly if a large number of risk factors were tested and if the resulting P values achieve only a moderate level of significance (.01 < P < .05). They should also consider whether the apparent relationships could be explained by unmeasured or residual confounding.
PREDICTIVE MODELING Prediction models aim to accurately estimate the probability that a disease is present (diagnosis) or that a future event will occur (prognosis). For example, Bates et al [3] built a model to predict 1-year mortality of veterans with stroke. Knowing a patient’s mortality risk can help patients, physicians, and caregivers to better plan postdischarge care and priorities. In predictive modeling, overall predictive accuracy is paramount and the role of individual variables is less critical. Variables may be included in the final model even if they are not causally related to the outcome. For example, stroke patients discharged to an acute care facility are more likely to die, but this variable is a marker of poor health rather than a cause of death. Predictive modelers need to consider several aspects of model performance. They also need to worry about overfitting and generalizability. Overfit models are tuned to the idiosyncrasies of a particular sample and thus have high predictive accuracy for the sample but not for new observations. Because of the problem of overfitting, prediction models should always undergo validation.
CANDIDATE VARIABLES Predictive modelers typically start with a larger pool of candidate variables than explanatory modelers. Bates et al [3]
EXPLANATORY VERSUS PREDICTIVE MODELING
considered 6 types of candidate variables: demographics, type of stroke, comorbidities, procedures received during hospitalization or intensive care unit stays, length of stay in the hospital, and discharge location. When the pool of candidate predictors is large relative to the sample size, overfitting is likely. Thus, predictive modelers may screen out candidate variables or apply data reduction techniques, such as principal components analysis, before final model building. For example, Bates et al [3] first screened out candidate variables that appeared unrelated to death in bivariate analyses (P > .20), which left 38 variables.
VARIABLE SELECTION Predictive modelers often use automated selection procedures. For example, Bates et al [3] fit a logistic regression model by using automated backward selection (retaining variables with P < .05) to arrive at a final model with 17 variables. Automated selection procedures help researchers to find the combination of predictors that optimizes overall model fit. However, these methods cause considerable overfitting and thus should be used cautiously and only in conjunction with validation [4]. Newer automated selection procedures that incorporate shrinkage, for example, LASSO (“least absolute shrinkage and selection operator”), have considerable advantages over traditional methods [4,5]. The final prediction model may be translated into a clinical scoring rule. Bates et al [3] assigned scores for different risk factors based on the size of their b coefficients in the final logistic regression model. For example, patients younger than 60 years old receive 0 points, patients 60-69 years old receive 2 points, patients 70-79 years old receive 5 points, and patients 80 years old and older receive 8 points. Higher risk scores correspond to a higher probability of death.
MODEL PERFORMANCE Predictive models should be assessed for discrimination, calibration, and goodness of fit. Unfortunately, many predictive modelers fail to report any metrics beyond discrimination [6]. Additional statistics (called reclassification statistics) are needed if the goal of an analysis is to compare 2 prediction models. Prediction models also should be evaluated for their clinical utility, although this may require further studies. A model is said to discriminate well if it systematically assigns higher predicted probabilities to those who have the outcome compared with those who do not. Discrimination is typically measured by using receiver operating characteristic (ROC) curves and the related C-statistic (equal to the area under the ROC curve). Bates et al [3] reported an area under the ROC curve of 0.785, which represents moderate discrimination. Calibration addresses the accuracy of the estimated probabilities. A model may discriminate well but still be poorly calibrated. For example, a model that assigns predicted
PM&R
Vol. 6, Iss. 9, 2014
probabilities of death of 15% to all patients who lived and 16% to all patients who died would discriminate perfectly; however, these predicted risks are way off (because the observed risks for these groups are 0% and 100%). Calibration is assessed by comparing the observed and predicted event rates in subgroups of patients, typically with a HosmerLemeshow test or calibration plot. In the study by Bates et al [3], the observed and predicted event rates were not significantly different according to a Hosmer-Lemeshow test. Goodness of fit assesses how well the model fits the data and captures aspects of both calibration and discrimination. These measures are automatically generated when performing multivariate regression in most statistical analysis packages and include root mean-squared error, R2, Akaike information criterion, and Bayesian information criterion. When comparing the performance of 2 prediction models, researchers should not rely on ROC analysis, which is insensitive to differences between the models [7]. Instead, they should report the net reclassification index or the integrated discrimination improvement, which measure the improvement in risk prediction in 1 model compared with another.
VALIDATION Model performance may be overestimated when models are fit and tested on the same data (due to overfitting). Thus, validation is essential. With external validation, researchers test the model on an entirely new data set. With internal validation, researchers partition the available data into “training” and “test” sets, such that models are fit and tested on different data. External validation is optimal, because it reveals both overfitting and a lack of generalizability to novel populations. However, external validation requires the researchers to perform a second study, which may be logistically infeasible. Internal validation uses the data on hand. In
843
a split-sample approach, researchers fit the model on one portion of their data and test it on the remainder. For example, Bates et al [3] fit the model on 60% of their data (n ¼ 7539), holding out the remaining 40% (n ¼ 5026) for validation. Discrimination, calibration, and goodness-of-fit metrics were similar for the training and validation sets, which means that their model did not suffer from overfitting. Split-sample validation requires a large sample. For smaller samples, researchers can use cross validation or bootstrap validation. These methods cleverly “recycle” observations such that each observation is used alternately in training and testing. For example, in 10-fold cross validation, the data are randomly divided into deciles. The model is fit by using nine-tenths of the data and is tested on the remaining tenth. This process is repeated 10 times, such that each observation serves as a test data point exactly once. Measures of model performance are averaged over these 10 iterations by using only the values calculated on the test data. Authors and readers should focus on the validation-corrected metrics, not the metrics calculated on the original data. They also should keep in mind that internally validated models will likely perform worse when applied to novel populations.
CONCLUSIONS The differences between explanatory modeling and predictive modeling are summarized in Table 1. Unfortunately, the distinction between these two often becomes blurred, which leads to confusion and errors. Researchers with explanatory goals get sidetracked when trying to optimize overall model metrics (such as R2 values or areas under the ROC curve) and neglect issues such as confounding. Researchers with predictive goals get sidetracked worrying about individual b coefficients and P values, and omit critical steps such as
Table 1. Summary of explanatory versus predictive models Explanatory Models Goal Threats to validity
Establish causal relationships Chance findings (type I errors); confounding
Candidate variables
A limited set of prespecified risk factors and confounders
Variable selection
Hypothesis driven; should not use automated selection procedures
Measures of model performance
Size of b coefficients for individual risk factors; level of significance for individual risk factors
Validation
New studies are needed to confirm individual causal relationships
ROC ¼ receiver operating characteristic; AIC ¼ Akaike information criterion.
Predictive Models Predict current diagnoses or future outcomes Overfitting; lack of generalizability to new populations A larger set of potential predictors; some predictors may not be causally related to the outcome Exploratory; may use automated selection procedures, but validation is essential and newer automated procedures that incorporate shrinkage are preferred Discrimination (eg, ROC analysis); calibration (eg, Hosmer-Lemeshow test); goodness of fit (eg, R2, AIC); reclassification (eg, net reclassification index); clinical utility Internal validation: split-sample validation; cross validation; bootstrap validation; external validation
844
Sainani
calibration or validation. Thus, researchers need to be clear as to the goals of their analysis. Readers also need to understand these differences because the “checklists” for evaluating explanatory and predictive models are distinct.
REFERENCES 1. Shmueli G. To explain or to predict? Stat Sci 2010;25:289-310. 2. Yu JC, Lam K, Nettel-Aguirre A, Donald M, Dukelow S. Incidence and risk factors of falling in the postoperative lower limb amputee while on the surgical ward. PM R 2010;2:926-934.
EXPLANATORY VERSUS PREDICTIVE MODELING
3. Bates BE, Xie D, Kwong PL, Kurichi JE, Ripley DC, Stineman MG. Oneyear all-cause mortality after stroke: a prediction model. PM R 2014;6: 473-483. 4. Sainani KL. Multivariate regression: the pitfalls of automated variable selection. PM R 2013;5:791-794. 5. Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc B 2003;58:267-288. 6. Collins GS, de Groot JA, Dutton S, et al. External validation of multivariable prediction models: a systematic review of methodological conduct and reporting. BMC Med Res Methodol 2014;14:40. 7. Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation 2007;115:928-935.