Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality

Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality

Journal of Clinical Epidemiology 57 (2004) 1138–1146 Automated variable selection methods for logistic regression produced unstable models for predic...

200KB Sizes 0 Downloads 62 Views

Journal of Clinical Epidemiology 57 (2004) 1138–1146

Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality Peter C. Austina,b,c,*, Jack V. Tua,b,c,d,e a

Institute for Clinical Evaluative Sciences, G1 06, 2075 Bayview Avenue, Toronto, Ontario M4N 3M5, Canada b Department of Public Health Sciences, University of Toronto, McMurrich Bldg, 4th Floor, 12 Queen’s Park Crescent West, Toronto, Ontario, M5S 1A8 Canada c Department of Health Policy, Management and Evaluation, University of Toronto, Toronto, McMurrich Bldg, 2nd Floor, 12 Queen’s Park Crescent West, Ontario, M5S 1A8 Canada d Clinical Epidemiology and Health Care Research Program, Sunnybrook & Women’s College Health Science Centre, 2075 Bayview Ave, Toronto, Ontario, M4N 3M5 Canada e Division of General Internal Medicine, Sunnybrook & Women’s College Health Sciences Centre, 2075 Bayview Ave, Toronto, Ontario, M4N 3M5 Canada Accepted 14 April 2004

Abstract Objectives: Automated variable selection methods are frequently used to determine the independent predictors of an outcome. The objective of this study was to determine the reproducibility of logistic regression models developed using automated variable selection methods. Study Design and Setting: An initial set of 29 candidate variables were considered for predicting mortality after acute myocardial infarction (AMI). We drew 1,000 bootstrap samples from a dataset consisting of 4,911 patients admitted to hospital with an AMI. Using each bootstrap sample, logistic regression models predicting 30-day mortality were obtained using backward elimination, forward selection, and stepwise selection. The agreement between the different model selection methods and the agreement across the 1,000 bootstrap samples were compared. Results: Using 1,000 bootstrap samples, backward elimination identified 940 unique models for predicting mortality. Similar results were obtained for forward and stepwise selection. Three variables were identified as independent predictors of mortality among all bootstrap samples. Over half the candidate prognostic variables were identified as independent predictors in less than half of the bootstrap samples. Conclusion: Automated variable selection methods result in models that are unstable and not reproducible. The variables selected as independent predictors are sensitive to random fluctuations in the data. 쑖 2004 Elsevier Inc. All rights reserved. Keywords: Regression models; Multivariate analysis; Variable selection; Logistic regression; Acute myocardial infarction; Epidemiology

1. Introduction Investigators are frequently interested in determining the independent predictors of mortality after acute myocardial infarction (AMI). This may be done for several reasons. First, identifying the independent risk factors for mortality facilitates effective risk stratification, allowing physicians to optimize patient care. Second, such models allow for accurate risk adjustment, thus permitting valid comparisons of mortality across different hospitals, physicians, or regions. Third, such models can be used in the evaluation of new therapies and interventions in observational studies. However, there is no consensus in the literature as to which

* Corresponding author. Tel.: 416-480-6131; fax: 416-480-6048. E-mail address: [email protected] (P.C. Austin). 0895-4356/04/$ – see front matter 쑖 2004 Elsevier Inc. All rights reserved. doi: 10.1016/j.jclinepi.2004.04.003

variables are the independent predictors of mortality after AMI. Several studies have developed statistical models to predict mortality after AMI. For example, Krumholz [1] compared the performance of and the variables contained in the Cardiovascular Cooperative Project Pilot model, the GUSTO-I model, the Medicare Mortality Predictor System model, an ICD-9 code model, and two models from the California Hospital Outcomes Project. The models differed in terms of performance and in the variables contained. No variables were common to all models. Several variables (e.g., age and location of infarct) appeared in the majority of models, whereas some predictors appeared in only two of the six models. There are several reasons for the differences in the variables identified as independent predictors of mortality. First, the patient populations differed between studies. Some studies were based upon patients enrolled in clinical

P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146

trials, whereas others included patients from the general AMI population. Second, the studies differed in terms of the variables collected a priori as potential predictors of mortality. Third, the studies differed in terms of the statistical methods used to model mortality. However, despite these differences between studies, different studies identified different variables as independent predictors of mortality after AMI. Investigators developing models to predict mortality need to maintain a balance between including too many variables and model parsimony [2,3]. Omitting important prognostic factors results in a systematic mis-estimation of the regression coefficients and biased prediction, and including too many predictors results in the loss of precision in the estimation of the regression coefficients and the predictions of new responses [2]. Researchers frequently use automated variable selection methods, such as backward elimination or forward variable selection techniques, to identify independent predictors of mortality or for developing parsimonious regression models. Automated variable selection methods have been used in several studies examining the independent predictors of mortality after AMI [4–7]. The purposes of the current study were (1) to determine the degree to which random variability in a dataset can result in different variables being identified as independent predictors of mortality after an AMI (this allows one to assess the reproducibility or stability of models obtained using automated model selection methods) and (2) to compare the agreement between different automated model selection methods. 1.1. Model selection methods Multiple automated variable selection methods have been developed. The three most commonly used methods are backward elimination, forward selection, and stepwise selection. Miller [8,9] and Hocking [10] provide comprehensive overviews of model selection methods. We briefly summarize these methods. Backward elimination begins with a full model consisting of all candidate predictor variables. Variables are sequentially eliminated from the model until a prespecified stopping rule is satisfied. At a given step of the elimination process, the variable whose elimination would result in the smallest decrease in a summary measure is eliminated. Possible summary measures are deviance or R2. The most common stopping rule is that all variables that remain in the model are significant at a pre-specified significance level. Forward selection begins with the empty model. Variables are added sequentially to a model until a predefined stopping rule is satisfied. At a given step of the selection process, the variable whose addition would result in the greatest increase in the summary measure is added to the model. A typical stopping rule is that if any added variable would not be significant at a predefined significance level, then no further variables are added to the model.

1139

Stepwise selection is a variation of forward selection. At each step of the variable selection process, after a variable has been added to the model, variables are allowed to be eliminated from the model. For instance, if the significance of a given predictor is above a specified threshold, it is eliminated from the model. The iterative process is ended when a pre-specified stopping rule is satisfied. Statisticians have several concerns about the use of automated variable selection methods: (1) It results in values of R2 that are biased high [11,12], (2) it results in estimated standard errors that are biased low [13], (3) the results are dependent upon the correlation between the predictor variables [14], and (4) the ordinary test statistics upon which these methods are based were intended for testing pre-specified hypotheses [13]. These results have been demonstrated in the context of linear regression estimated using ordinary least squares. The impact of automated model selection methods on logistic regression models needs to be more fully examined.

2. Methods 2.1. Data sources The Ontario Myocardial Infarction Database (OMID) is a population-based database of patients admitted with a most responsible diagnosis of AMI in the province of Ontario. It consists of patients discharged between 1 April 1992 and 31 March 2002. The OMID database was constructed by linking together Ontario’s major health care administrative databases. Details on the construction of the OMID database are provided elsewhere [15,16]. One of the limitations of the OMID database is the paucity of detailed clinical data. To address this limitation, detailed clinical data were collected on a random sample of 6,015 patients discharged from 57 Ontario hospitals between 1 April 1999 and 31 March 2001 by retrospective chart review to supplement the OMID administrative data. Data on patient history, cardiac risk factors, comorbid conditions and vascular history, vital signs, and laboratory tests were collected for this sample of patients. 2.2. Statistical methods We chose a list of candidate variables that would be examined for their association with mortality within 30 days of AMI admission. These variables included demographic characteristics (age and gender), presenting signs and symptoms (cardiogenic shock and acute pulmonary edema), classical cardiac risk factors (diabetes, history of cerebrovascular accident or transient ischemic attack [CVA/TIA], history of hyperlipidemia, hypertension, family history of heart disease, and smoking history), comorbid condition and vascular history (angina, asthma, coagulopathy, cancer, chronic liver disease, chronic congestive heart failure, dementia/ Alzheimer’s, depression, hyperthyroid, peptic ulcer disease,

1140

P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146

peripheral arterial disease, renal disease, previous AMI, previous percutaneous transluminal coronary angioplasty [PTCA], previous coronary artery bypass graft [CABG] surgery, and aortic stenosis), vital signs on admission (systolic and diastolic blood pressure, heart rate, and respiratory rate), laboratory test results—hematology (hemoglobin, white blood count, and international normalized ratio), and laboratory test results—chemistry (sodium, potassium, glucose, urea, creatinine, total cholesterol, HDL and LDL cholesterol, and triglycerides). Each of the above variables was examined for its univariate association with 30-day mortality. For categorical variables, a Chi-squared test was used to determine the statistical significance of the association between the variable and mortality within 30 days of admission. Dichotomous risk factors were assumed to be absent unless their presence was explicitly documented in the patient’s medical record. Risk factors or conditions whose prevalence was ⬍1% in the population of patients were excluded from further analyses. For continuous variables, a univariate logistic regression predicting 30-day mortality was fit to determine the statistical significance of the association between the variable and mortality. Continuous variables that were missing for more than 10% of the population were excluded from further analyses. Exclusion of patients with missing data on continuous data resulted in a final sample size of 4,911 for the multivariate analyses. Variables that were significantly associated with 30-day mortality with a significance level of P ⬍ .25 were selected for possible inclusion in multivariate logistic regression models to predict 30-day mortality. We chose P ⫽ .25 as the threshold for including variables in the multivariate model because this has been suggested elsewhere as an appropriate threshold [17]. In sensitivity analyses, we also explored the use of P ⫽ .10 and P ⫽ .05 as alternate thresholds. Table 1 contains each variable identified above, its prevalence (for categorical variables), the proportion of patients with missing data (continuous variables), and the statistical significance of its univariate association with 30-day mortality. We used Bootstrap methods to examine the stability of models predicting 30-day mortality after AMI [18]. First, all observations with any missing data on the continuous predictors were eliminated. From this sample, we chose 1,000 bootstrap samples. A bootstrap sample is a sample of the same size as the original dataset chosen with replacement. Thus, a given subject in the original cohort may occur multiple times, only once, or not at all in a specific bootstrap sample. Once we chose a bootstrap sample, we used three different variable reduction methods to arrive at a final regression model for determining the variables that are significant independent predictors of 30-day mortality. First, we used backward elimination with a threshold of P ⫽ .05 for eliminating a variable in the model. Second, we used forward model selection with a threshold of P ⫽ .05 for selection a variable for inclusion in the model. Third, we used stepwise model selection with thresholds of P ⫽ .05

Table 1 Univariate association between mortality and clinical characteristics A. Categorical variables Variable Female sex Patient history — admission Acute pulmonary edema Cardiogenic shock Cardiac risk factors Diabetes Hypertension Smoking history CVA/TIA Hyperlipidemia Family history of heart disease Comorbid conditions and vascular Angina Cancer Dementia/Alzheimer’s Peptic ulcer disease Previous MI Asthma Liver disease Depression Peripheral arterial disease Previous PTCA Coagulopathy Congestive heart failure (chronic) Hyperthyroid Renal disease (dialysis dependent) Previous CABG Stenosis (aortic)

30-day Prevalence mortality ratea P value 35.3%

14.7%

⬍.0001

5.7% 2.2%

25.9% 67.4%

⬍.0001 ⬍.0001

24.9% 44.3% 34.2% 9.6% 31.0% 30.3% history 31.3% 3.7% 3.8% 4.9% 21.9% 4.8% 0.6% 6.7% 6.8% 3.6% 0.3% 5.6%

13.8% 11.1% 7.1% 18.5% 7.0% 4.2%

⬍.0001 .6962 ⬍.0001 ⬍.0001 ⬍.0001 ⬍.0001

12.1% 17.3% 35.7% 11.8% 14.7% 10.7% 16.2% 16.8% 16.9% 5.5% 5.0% 30.7%

.0466 .0017 ⬍.0001 .6100 ⬍.0001 .8962 .2877 ⬍.0001 ⬍.0001 .0095 .7169 ⬍.0001

1.7% 0.6%

12.6% 18.2%

.5772 .1662

6.9% 1.5%

12.4% 24.4%

.3263 ⬍.0001

Percent missing

Odds ratiob

B. Continuous variables Variable Age Vital signs on admission Systolic blood pressure Diastolic blood pressure Heart rate Respiratory rate Laboratory tests (hematology) Hemoglobin White blood count International normalized ratio Laboratory tests (chemistry) Sodium Potassium Glucose Urea Creatinine Total cholesterol HDL cholesterol LDL cholesterol Triglycerides a

P value

0.0%

1.080

⬍.0001

0.6% 0.8% 0.9% 9.6%

0.977 0.967 1.013 1.079

⬍.0001 ⬍.0001 ⬍.0001 ⬍.0001

1.9% 1.9% 17.6%

0.971 1.071 1.464

⬍.0001 ⬍.0001 .0006

2.1% 2.3% 4.2% 7.1% 2.9% 43.0% 52.8% 55.1% 43.9%

0.951 1.861 1.064 1.125 1.007 0.627 1.000 0.585 0.834

⬍.0001 ⬍.0001 ⬍.0001 ⬍.0001 ⬍.0001 .1108 1.0000 ⬍.0001 .0236

Overall 30-day mortality in the cohort was 10.9%. Odds ratio is relative change in the odds of 30-day mortality with a one-unit increase in the predictor variable. b

P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146

for variable selection and for variable elimination. Thus, for a given bootstrap sample, we obtained three final models: one obtained using backward elimination, one obtained using forward variable selection, and one obtained using stepwise variable selection. For each model, we noted which variables had been selected and compared the results across the three variable selection methods. We repeated this process using the 1,000 bootstrap samples. We determined the proportion of regression models in which each of the candidate variables was retained, and we determined the proportion of bootstrap samples in which there was agreement between the different model selection methods. Finally, we determined the distribution of the number of variables selected for the final model using each of the three model selection methods. We repeated the above analysis using only variables that were significantly associated with 30-day mortality in a univariate analysis at the P ⬍ .10 level and again at the P ⬍ .05 level. Finally, we repeated the primary analysis to examine the impact of using clinical judgment in association with automated variable selection methods. To do so, we made several clinical judgments. First, we identified three variables a priori that we felt to be the strongest predictors of AMI mortality. These variables were age, the presence of cardiogenic shock at presentation, and systolic blood pressure on admission. These three variables were forced into each regression model using backward, forward, and stepwise model selection. Second, two of the variables, creatinine and urea levels, are, to a certain degree, surrogates for one another and are usually correlated with each other. Thus, the decision was made to consider only creatinine levels as a candidate and to exclude urea from the regression models. Similarly, the decision was made to exclude diastolic blood pressure due to the inclusion of systolic blood pressure. Finally, three of the variables (history of previous myocardial infarction [MI], previous PTCA, and previous CABG) are markers for the presence or severity of previous coronary artery disease. The decision was made to retain only one of these variables (history of previous MI) and to exclude the other two as potential independent predictors of mortality.

3. Results Backward model selection resulted in 940 unique regression models in the 1,000 bootstrap samples. Eight hundred eighty-nine models were chosen only once, 45 models were chosen twice, 3 models were chosen three times, and 3 models were chosen four times. No model was chosen more than four times in the 1,000 bootstrap samples using backward selection. Forward model selection resulted in 932 unique regression models in the 1,000 bootstrap samples. Eight hundred seventy-nine models were chosen only once, 41 models were chosen twice, 9 models were chosen three times, and 3 models were chosen four times. No model was chosen more than four times in the 1,000 bootstrap samples using forward selection. Stepwise model selection

1141

resulted in 936 unique regression models in the 1,000 bootstrap samples. Eight hundred eighty-six models were chosen only once, 39 models were chosen twice, 8 models were chosen three times, and 3 models were chosen four times. No model was chosen more than four times in the 1,000 bootstrap samples using stepwise selection. Over the 1,000 bootstrap samples, the variables selected by backward selection agreed with those selected by forward selection in 83.4% of the bootstrap samples. The model determined by backward selection agreed with that determined by stepwise selection in 90.1% of the bootstrap samples. Finally, the model selected by forward selection agreed with that determined by stepwise selection in 91.5% of the bootstrap samples. The number of times that each variable was selected using each of the three variable selection techniques is depicted in Fig. 1. Age, systolic blood pressure, and shock at presentation were identified as independent predictors of mortality in all 1,000 models using each of the three variable selection models. Glucose level was identified as an independent predictor of mortality in at least 99.5% of the bootstrap samples using each of the three methods. White blood count was similarly identified in at least 98.7% of the bootstrap samples using backward selection, forward selection, and stepwise selection. Urea was identified as independent predictor of mortality in 91.0%, 94.8%, and 91.8% of the bootstrap samples using backward selection, forward selection, and stepwise selection, respectively. The remaining 23 variables were selected in ⬍90% of the bootstrap samples using each model selection method. Eighteen of the 29 variables were identified as independent predictors of mortality in less than half of the bootstrap samples using backward selection. Six variables (cancer, sodium, diastolic blood pressure, diabetes, smoking status, and history of previous MI) were selected in less than 10% of the bootstrap models using each variable selection method. Six variables were identified as independent predictors of mortality in fewer than 10% of the bootstrap samples using backward elimination (cancer, sodium, diastolic blood pressure, diabetes, smoking status, and history of previous MI). At least one of these six variables was identified as an independent predictor in 37.3% of the bootstrap models using backward elimination. Twelve variables were identified as independent predictors of mortality in ⬍20% of the bootstrap samples using backward elimination (peripheral arterial disease, stenosis, angina, CVA/TIA, hyperlipidemia, and hemoglobin, in addition to the above six variables). However, in 77.8% of the bootstrap samples, at least one of these 12 variables was identified as an independent predictor of mortality using backward selection. Comparable results were obtained for forward and stepwise selection. The number of variables selected in the 1,000 bootstrap replications using each variable selection methods is depicted in Fig. 2. Each model selection method resulted in a final model with between 8 and 19 variables in the 1,000 bootstrap samples. Furthermore, the distribution of the

1142

P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146 1000

Number of times variable chosen

900 800 700 600 Backward selection Forward selection Stepwise selection

500 400 300 200 100

Ag Sy Sh e st oc ol k i G cB lu P co s W e BC H Ur ea e a D rt ra em te C en re t D ati ia ep ni re ne ss Pr io ev R io C n u e Pu sp s P HF lm ira TC on tor A a y Fa ry ra m ed te ily em Pe H a rip is he to F ra r e l a Po m y rte ta ale s ria s l d ium is St ea en se o An sis H C gin yp V a er A/ T l H ipid IA em e m og ia lo C bin an c D So er ia di st um ol D ic B ia P be Pr Sm tes ev o io ke us r AM I

0

Variable

Fig. 1. Number of times each variable was selected by automated variable selection methods.

number of variables in the resultant models is approximately normally distributed. Using backward selection, the majority of the resultant models (77.1%) had between 11 and 14 variables that were identified as independent predictors of mortality. For forward and stepwise selection, 78.3% and

78.2% of the resultant models had between 11 and 14 variables identified as independent predictors of mortality, respectively. Within each bootstrap sample, we determined the number of variables identified as independent predictors of mortality

Percent of bootstrap samples (%)

25

20

15 Backward selection Forward selection Stepwise selection 10

5

0 8

9

10

11

12

13

14

15

16

17

Number of variables selected

Fig. 2. Number of variables in final model.

18

19

P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146

1143

100 90

Percent of bootstrap samples (%)

80 70 60 Backwards vs forwards Backwards vs stepwise Forward vs stepwise

50 40 30 20 10 0 -2

-1

0 1 2 Difference in number of variables

3

Fig. 3. Differences in number of variables identified as independent predictors.

using each of the three variable selection methods. We then determined the difference in the number of variables identified using each of the three selection methods. Results are reported in Fig. 3. In 85.6% of the bootstrap samples, backward selection and forward selection identified the same number of variables as being independent predictors of mortality. However, in 6.3% of the bootstrap samples, backward selection identified more independent predictors of mortality than did forward selection. In six bootstrap samples, backward selection identified three additional independent predictors of mortality than did forward selection. In 91.9% of the bootstrap samples, backward selection and stepwise selection identified the same number of variables as being independent predictors of mortality. However, in 7.0% of the bootstrap samples, backward selection identified more independent predictors of mortality than did forward selection. In seven bootstrap samples, backward selection identified three more independent predictors of mortality than did forward selection. In 91.8% of the bootstrap samples, forward selection and stepwise selection identified the same number of variables as being independent predictors of mortality. However, in 8.2% of the bootstrap samples, forward selection identified more independent predictors of mortality than did stepwise selection. In seven bootstrap samples, forward selection identified two more independent predictors of mortality than did stepwise selection. As a sensitivity analysis, we included as candidate variables for the variable selection methods only variables whose association with mortality in a univariate analysis was significant at the P ⫽ .10 or P ⫽ .05. This resulted in the same variables being considered as candidate variables. This was due to the fact that the risk factors whose prevalence was at

least 1% and that were significant at P ⫽ .25 were also significant at the .10 and .05 levels (Table 1). When clinical judgment was used in association with automated model selection methods, results similar to those described above were obtained. Creatinine, glucose, and white blood count were identified as independent predictors of mortality in 99.9%, 99.9%, and 98.5% of bootstrap samples, respectively, using backward elimination. Sixteen of 26 variables were identified as significant in fewer than half of the bootstrap models using backward elimination. Despite using clinical judgment in variable selection, backward elimination identified 845 unique subsets of variables. No model was identified more than seven times in the 1,000 bootstrap samples. As a final sensitivity analysis, we created a random variable for inclusion in the automated variable selection process. This is an adaptation of an approach suggested by Miller [9]. For each subject, we generated a random variable from a standard normal distribution. This value was generated to be independent of the outcome and all other variables. We then included this randomly generated noise variable in the backward, forward, and stepwise model selection processes. This variable was included in 16.3% of the models selected using backward elimination. It was selected in 15.9% and 15.8% of the models selected using forward and stepwise selection, respectively. Nineteen variables were selected more frequently than this randomly generated noise variable; however, 10 variables were selected less frequently than this noise variable (angina, hyperlipidemia, CVA/TIA, hemoglobin, cancer, diastolic blood pressure, diabetes, sodium, smoker, and previous AMI). Thus, a randomly generated variable was identified as an independent predictor

1144

P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146

of mortality more often than 34.5% of clinical characteristics considered in the current study.

4. Discussion Using detailed clinical data on 4,911 AMI patients, we have highlighted several issues concerning the use of automated model selection methods. First, the variables identified as independent predictors of AMI mortality using automated variable selection methods are highly dependent on random fluctuations in the subjects included in the dataset. Using 1,000 bootstrap samples, backward selection identified 940 distinct subsets of variables as independent predictors of mortality. Furthermore, no regression model was chosen more than four times. Similar results were obtained for forward and stepwise model selection. Second, the majority of candidate variables were identified as independent predictors of mortality in only a minority of models. This is despite the fact that the candidate variables were all plausible predictors of AMI mortality. Three variables were identified as independent predictors of mortality in all bootstrap models; an additional three variables were identified as independent predictors of morality in at least 90% of the models chosen using backward selection, and a further one variable was identified as an independent predictor of mortality in at least 80% of the models. Third, for a given model selection method, there was substantial variability in the number of variables that were identified as independent predictors of mortality across the 1,000 bootstrap replications. Fourth, backward selection and forward selection methods agreed with one another on the independent predictors of mortality in 83.4% of the bootstrap samples. Thus, our findings explain, in part, why different studies of AMI mortality consistently identify different predictors of mortality. The use of automated variable selection methods is controversial, with many statisticians having reservations about their use. There are multiple reasons for this. Flack and Chang [11], using simulations, demonstrated that a large proportion of selected variables are truly independent of the outcome and that the resultant model’s R2 is upwardly biased. A similar study, also using simulation methods, determined that between 20% and 74% of variables selected are noise variables that are unrelated to the outcome [14]. Furthermore, the number of noise variables included increased as the number of candidate variables increased. Similarly, using simulations, Murtaugh [2] reported that the probability of correctly identifying variables was inversely proportional to the number of variables under consideration. Furthermore, coefficient estimates obtained using automated variable selection methods are biased away from zero, and P values associated with testing the statistical significance of each independent variable are biased low. Thus, the magnitude of the association between each selected variable and the outcome is larger than are warranted by the data, and the

statistical significance of the association is overstated. These results were demonstrated in the context of linear regression using ordinary least squares. We have demonstrated that similar concerns are valid for logistic regression models. Henderson and Velleman [19] caution against the blind use of automated variable selection methods, arguing for the use of an interactive model-building approach. They argue that the data analyst must bring subject-specific knowledge to the model-building process. Furthermore, automated model-building methods can mask problems with multicollinearity, nonlinearity, and observations with high leverage. There are at least three reasons for constructing regression models. The first is primarily epidemiologic in focus. In this setting, one is interested in assessing the association between an exposure variable and an outcome of interest. Such an assessment must take into account possible confounders and effect modifiers [20]. In such a setting, a structured approach to modeling can be used because the researcher has a primary hypothesis guiding the modelbuilding process. Furthermore, the variables identified a priori as possible confounders or effect modifiers can be informed by the data analyst’s previous experience and knowledge of the research field. The second reason for constructing regression models is for predictive purposes. In cardiovascular research, one frequently wants to predict the likelihood of an adverse outcome such as mortality. In such instances, one is more concerned with predictive accuracy than with the variables that are entered in the model. Model development should incorporate methods that involve data splitting [21] or bootstrap assessments of predictive accuracy [18]. The third reason for constructing regression models is hypothesis generation or exploratory analysis. In this setting, one is interested in determining the independent predictors of an event. Automated model selection methods are frequently used because the model-building process is not guided by a clear hypothesis that one is interested in testing. In clinical research, it is important to determine which variables are associated with a worse prognosis. This allows appropriate risk stratification and can allow clinicians to provide more appropriate medical care. In the first setting described above, the use of automated variable selection methods is inappropriate and unnecessary because the researcher is guided by a clear hypothesis and should use a structured approach to assessing the effects of confounders and effect modifiers. In the second setting described above, automated variable selection methods are not inappropriate if used in conjunction with cross-validation or data-splitting methods to determine predictive accuracy in an independent validation dataset. The resultant model is likely to contain important independent predictors of mortality and noise variables mistakenly identified as predictors of mortality. However, because the objective of the model is prediction and not hypothesis testing, this is not a major limitation. The use of automated variable selection methods in this context would be problematic if variables that were

P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146

mistakenly identified as independent predictors of the outcome were expensive or difficult to obtain. In a related article, the authors [22] examine the use of backward variable elimination in conjunction with bootstrap methods to develop predictive models. In the third setting described above, the use of automated variable selection methods is likely to result in the greatest problems. The resultant model will likely contain variables that are true predictors of the outcome and variables that have mistakenly been identified as predictors of the outcome. By interpreting the model in isolation, one cannot assess which variables fall into each of the two categories. To draw more robust conclusions, we suggest several possible strategies: (1) using the bootstrap methods described in the current study allows one to assess the strength of the evidence that a given predictor is an independent predictor of mortality, (2) comparing the final model with other models reported in the literature to assess the consistency of the findings, and (3) independently validate the model in an independent dataset. Altman and Andersen [23] used bootstrap sampling to assess the stability of a Cox regression model fit to 216 patients enrolled in a clinical trial. Using 17 candidate variables, they demonstrated that stepwise selection identified between 4 and 10 variables as independent predictors of mortality. Furthermore, the frequency with which variables were included in the models ranges from a high of 100% for one variable to a low of 6% for another variable. Eleven variables were identified as significant predictors in less than half of the bootstrap models. Despite theoretical concerns about the validity of automated model selection methods, such methods are frequently used in the clinical literature. Studies in the literature identify different variables as independent predictors of mortality after AMI. Part of this is due to differences in the inclusion/ exclusion criteria upon which the different cohorts were based. We have identified that an additional reason for this is that small degrees of random variation in one dataset can have a substantial influence on the variables that are identified as independent predictors of mortality and on the number of variables that are identified as independent predictors of mortality. Thus, it is likely that no one regression model estimated on one dataset can conclusively identify the independent predictors of mortality. Such a model will likely include variables that truly are independently associated with mortality. However, such a model will also most likely include spurious variables that are not true independent predictors of mortality. Furthermore, our study demonstrated that by changing the method by which variables are selected, two investigators, working with the same data, may identify different regression models. In conclusion, we make several recommendations. First, investigators need to be aware of the limitations of using automated variable selection methods. When using these methods, there is a strong likelihood that variables that are truly independent of the outcome will be identified as independent predictors of the outcome. Second, statistical

1145

modeling should not be separated from subject-matter expertise. Regression modeling should be informed by clinical knowledge and not be treated as a “black-box.” Third, when automated variable selection methods are used in an exploratory analysis to determine which variables are associated with the outcome of interest, we recommend that bootstrap methods similar to those outlined in the current manuscript be used to determine the strength of the evidence that a given variable truly is an independent predictor of the outcome.

Acknowledgments The Institute for Clinical Evaluative Sciences is supported in part by a grant from the Ontario Ministry of Health and Long Term Care. The opinions, results and conclusions are those of the authors and no endorsement by the Ministry of Health and Long-Term Care or by the Institute for Clinical Evaluative Sciences is intended or should be inferred. Financial support for this study was provided in part by a grant from the Canadian Institutes of Health Research (CIHR) to the Canadian Cardiovascular Outcomes Research Team (CCORT). Dr. Austin is supported in part by a New Investigator Award from the CIHR. Dr. Tu is supported by a Canada Research Chair in Health Services Research.

References [1] Krumholz HM, Chen J, Wang Y, Radford MJ, Chen YT, Marciniak TA. Comparing AMI mortality among hospitals in patients 65 years and older: evaluating methods of risk adjustment. Circulation 1999;99: 2986–92. [2] Murtaugh PA. Methods of variable selection in regression modeling. Commun Stat Simulation Computation 1998;27:711–34. [3] Wears RL, Lewis RJ. Statistical models and Occam’s razor. Acad Emerg Med 1999;6:93–4. [4] The Multicenter Postinfarction Research Group. Risk stratification and survival after myocardial infarction. N Engl J Med 1983;309: 331–6. [5] Suarez C, Herrera M, Vera A, Torrado E, Ferriz J, Arboleda JA. Prediction on admission of in-hospital mortality in patients older than 70 years with acute myocardial infarction. Chest 1995;108:83–8. [6] Henning R, Wedel H, Wilhelmsen L. Mortality risk estimation by multivariate statistical analyses. Acta Med Scand 1975;586(Suppl): 14–31. [7] Dubois C, Pierard LA, Albert A, Smeets J, Demoulin J, Boland J, Kulbertus HE. Short-term risk stratification at admission based on simple clinical data in acute myocardial infarction. Am J Cardiol 1988;61:216–9. [8] Miller AJ. Selection of subsets of regression variables. J R Stat Soc [Ser A] 1984;147:389–425. [9] Miller A. Subset selection in regression. 2nd edition. Boca Raton (FL): Chapman & Hall/CRC; 2002. [10] Hocking RR. The analysis and selection of variables in linear regression. Biometrics 1976;32:1–49. [11] Flack VF, Chang PC. Frequency of selecting noise variables in subset regression analysis: a simulation study. Am Stat 1987;14:84–6. [12] Copas JB, Long T. Estimating the residual variance in orthogonal regression with variable selection. Statistician 1991;40:51–9.

1146

P.C. Austin, J.V. Tu / Journal of Clinical Epidemiology 57 (2004) 1138–1146

[13] Harrell FE Jr. Regression modelling strategies: with applications to linear models, logistic regression, and survival analyses. New York: SpringerVerlag; 2001. [14] Derkson S, Keselman HJ. Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables. Br J Math Stat Psychol 1992;45:265–82. [15] Tu JV, Naylor CD, Austin P. Temporal changes in the outcomes of acute myocardial infarction in Ontario, 1992–1996. Can Med Assoc J 1999;161:1257–61. [16] Tu JV, Austin P, Naylor CD, Iron K, Zhang H. Acute myocardial infarction outcomes in Ontario. In: Naylor CD, Slaughter PM, editors. Cardiovascular health services in Ontario: an ICES atlas. Toronto, Canada: Institute for Clinical Evaluative Sciences; 1999. p. 83– 110.

[17] Hosmer DW, Lemeshow S. Applied logistic regression. New York: John Wiley & Sons; 1989. [18] Efron B, Tibshirani R. An introduction to the bootstrap. London: Chapman & Hall; 1993. [19] Henderson HV, Velleman RF. Building multiple regression models interactively. Biometrics 1981;37:391–411. [20] Rothman KJ, Greenland S. Modern epidemiology. 2nd edition. Philadelphia: Lippincott Williams & Wilkins; 1998. [21] Picard RR, Berk KN. Data splitting. Am Stat 1990;4:140–7. [22] Austin PC, Tu JV. Bootstrap methods for developing predictive models. Am Stat 2004;58:131–137. [23] Altman DG, Andersen PK. Bootstrap investigation of the stability of a Cox regression model. Stat Med 1989;8:771–83.