J CltaEpidemiol Vol. 43,
0895-4356/90 $3.00 + 0.00 Copyright 0 1990 Pergamon Press plc
No. 4, pp. 349-359, 1990 Printed in Great Britain. All rights resewed
A COMPARISON OF MULTIVARIABLE MATHEMATICAL METHODS FOR PREDICTING SURVIVAL-II. STATISTICAL SELECTION OF PROGNOSTIC VARIABLES* STEPHEN D. WALTBR,‘~
ALVANR. FEINSTEIN~$ and CAROLYNIL WELLS*
‘Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Canada and *Yale University School of Medicine, New Haven, CT 06510, U.S.A.
Ontario,
(Received30 August1989)
Abstract-The introductory abstract for Parts I-III in this series appears at the beginning of Part I. The concluding summary appears at the end of Part III.
INTRODUCTION Part I [ 1J in this series described the underlying rationale for comparing the statistical performance of alternative multivariable analytic procedures in identifying important prognostic variables for survival. A cohort of 1266 lung cancer patients was used for the analysis. In a prognostic stratification developed before the current research, the patients had been divided-according to various combinations of symptoms, functional severity, and TNM stage-into five biologic groups having a strong prognostic gradient in 6-month survival rates, ranging from 85% in Group A to 14% in Group E. Part I also described how a challenge set of 200 patients was removed from the 1266 patients, by stratified randomization, to have the same proportions in Groups A through E. From the remaining 1066 patients, seven “generating” samples, each containing 400
*Supported in part by Grant Number HS 04101 from the National Center for Health Services Research, OASH; from The Andrew W. Mellon Foundation; and The Council for Tobacco Research-U.S.A., Inc. as a Special Project No. 135. tProfessor, Department of Clinical Epidemiology and Biostatistics, McMaster University. IReprint requests should be addressed to: Alvan R. Feinstein. M.D.. Yale University School of Medicine, 333 Cedar Street, New Haven, CT 06510, U.S.A.
patients, were chosen to have different patterns of relative frequencies in the five prognostic strata. In the Seven generating samples, the frequencies were arranged, with stratified random selection, to distribute the five strata into shapes that were uniform, unimodal symmetric, increasing exponential, decreasing exponential, U-shaped, bimodal, or proportional to the original distribution of the 1266 patients. The seven different generating samples were intended to test the ability of several competing statistical methods to select consistent sets of prognostic variables in a wide range of populations, for which the same biologic attributes had been arranged in different patterns of distribution. A statistical method that can achieve such consistency would offer predictions that are relatively robust, i.e. unaffected by the composition of the population in a single study, thus allowing the results of one study to be generalized more easily to other situations. The objectives were to examine whether different statistical methods, applied to different distributions of the same basic population, would identify the same set of significant prognostic variables and give them similar ranks of importance. The purpose of this report is to outline the essential features of each of the competing statistical methods, and to describe the main 349
350
STEPHEND. WALTERet al.
results obtained when the different methods were used to identify prognostic variables in the various generating samples. DESCRIPTION OF BASIC STATISTICAL METHODS
Multiple linear regression
Regression methods characterize a relationship between a dependent variable (sometimes referred to as a response variable or the outcome) and one or more independent variables. If k independent variables are to be included in the model, the multiple linear regression relationship may be denoted as follows: Y = b, + b,X, + b2XZ+ *. . bkXk + e.
(1)
Here Y represents the dependent variable, X,, x 2, *..1 X, are the independent (or predictor) variables, and e is a random error term representing the difference between the observed value of Y and its expected value under the model. The coefficients b,, b,, . . ., b,, represent the predicted change in Y per unit change in each independent variable, holding fixed the values of the other independent variables. Thus, each term in the model represents a partial explanation of the variation in Y, adjusted for the effects of other variables in the model. The term b,, is the expected value of Y when all the independent variables have the value zero. Although linear regression is usually applied to continuous dependent variables, the independent variables may be continuous and/or discrete. In the data used here, the dependent variable was the dichotomous outcome of alive or dead at 6 months. We recognize that this discrete dependent variable is not wholly appropriate for linear regression, mainly because of its violation of the usual normality assumption for e. It was used so that its results could be directly compared with those of the other methods. Multiple logistic regression
This method uses one dependent variable, which is a logistic function of the probability of response, denoted by p. This form of regression is pertinent for a dichotomous dependent variable, such as alive/dead at 6 months. As before, the independent variables may be continuous and/or discrete. The general form of the regression is:
log[p/(l -
P)I = bo + 44 + b,X, + * ** + bkXk + e.
Log [p/(1 -p)] is known as the logit of the response probability, p. The form of this doseresponse relationship is more easily visualized when there is only one independent variable. The equation then reduces to log[p/(l -p)] = 6, + b,X + e, which has a familiar sigmoid curve, with low values of P (near 0) as X is increasingly negative and high values of P (near 1) as X is increasingly positive. Cox proportional hazards method This method, which is another regression technique, was developed specifically for the analysis of survival data. It allows for one or more predictor variables to be related to the risk of death, and uses the follow-up information from all patients, including those who are not followed until death. Individuals with only partial follow-up provide information on the risk of death in some time periods, even though their own death has not been observed. Data for such patients are called “censored”; censoring may occur because the data are analyzed before the death of all individuals in the study, or because some individuals are lost to follow-up or withdrawn from the study before death [2]. The Cox regression method involves a hazard function, which describes the instantaneous probability of death in a short interval of time, as a function of the independent variables. The hazard function may be expressed as follows: h(t, X) = h,,(t)exp(b,X,
+ b2X2 + ... + bkXk).
The hazard is proportional to the probability that an individual who is alive at time t (and hence in the set of people “at risk” at time t) will die in the following short interval of time. This probability is potentially influenced by variables X,, X,, . . ., X, included in the hazard function; and the contribution of each variable is adjusted for the effects of the others included in the model. Variables with positive coefficients correspond to an increasing risk of death as X increases. Two different analyses were carried out with the Cox procedure, using different parts of the data. In the first analysis, only the follow-up data from the first year were used for each patient; patients still alive after 1 year were regarded as censored at that time. In the second analysis, the first 5 years of follow-up were used, with analogous censoring of the few patients still alive at 5 years.
Comparison of Multivariable Methods for Prediction-11
Discriminant analysis
STEPWISE
In this method, one derives a linear combination of the independent variables X,, X,, . . ., X,, which has large variation between individuals in different population subgroups, compared with the small variation between individuals in the same subgroups. In this study, two subgroups were defined by death or survival at 6 months. Therefore, the discriminant function here identifies the combination of independent variables which differs the most between the patients who survived or died before 6 months, relative to the variation within the two survival groups. The discriminant function may be expressed as: j- = b, + b,X, + b,X* + . * * + bkXk
and can be computed for any individual whose set of X values is known. Subsequently, each individual can be assigned into one or other of the subgroups according to whether his/her discriminant function value f exceeds a certain threshold value. If the independent variables are strongly related to survival, this assignment will constitute a relatively accurate classification; if the independent variables are only weakly related to survival, the discriminant function will make relatively frequent classification errors. Discriminant analysis has some algebraic similarities to multiple linear regression. In fact if only two groups are to be distinguished, and if the same set of k independent variables is used with both methods, there is a simple conversion between the discriminant and regression coefficients [3], so that the results of the two methods are identical.
COMPUTER
PROCEDURES
SAS programs [4,5] were used to calculate the statistical models for linear regression (PROC STEPWISE), logistic regression (PROC LOGIST), and Cox regression (PROC PHGLM). The discriminant function analyses were calculated using the BMDP7M program [6], since stepwise discriminant function analysis was unavailable in version 82.3 or SAS. All the independent variables were standardized to permit comparison of their coefficients. The stepwise selection process is described in the next section, and the algorithm for entering or deleting variables is described in the Appendix.
351
!3ELRCTzON OF INDEPENDENT VARUBLES
Because many potential independent variables might be included in the models, some form of systematic selection was required. A good selection method will choose a relatively small subset of meaningful independent variables while excluding variables with limited predictive value. The most commonly used selection algorithms are the “stepwise” procedures. In the forward selection stepwise procedure, one begins with no independent variables in the model, and successively adds variables to build up a progressively more complex model. Addition of new variables ceases when there is no significant improvement in the predictions of the whole model. In contrast, the backward elimination stepwise procedure begins with all candidate variables included in the model: one then eliminates variables in turn, continuing until further deletion would significantly reduce the predictive value of the model as a whole. Although the backward elimination method has the advantage of including every variable in at least one model (at the initial step), its main disadvantage is a relatively high computational expense, with the cost rising rapidly with the number of independent variables included in the model. A further difficulty is that the backward elimination method may not be able to include all candidate independent variables in the model, because of data insufficiency. As the number of candidate variables increases, the likelihood goes up that two or more of their partial effects will be confounded, so that it will be impossible to distinguish which of two variables (if either) is more meaningfully related to the outcome. Faced with confounded data, some computer programs may fail, or at best will make arbitrary decisions about the deletion of independent variables until a sufficiently small set of non-confounded variables remains. Because of the large number of candidate predictor variables in our data set, and bearing in mind the difficulties associated with the backward elimination method, we elected to use forward stepping. We also adopted, however, a frequently used modification which tests each stage of model development for the continued significance of all variables currently selected into the model. More details of our selection algorithm are given in the Appendix.
352
STEPHEND. WALTERef al. DESCRIPTION
OF VARIABLES
TNM stage) was regarded as “selected” if any of its component variables entered the mathematical model. In Tables 3-5, however, the citations refer to the 27 independent variables that were available after the transformations.
The data for each sample of 400 patients were expressed with survival as the dependent variable, and with 27 independent variables of which 12 were “original” and 15 were transformed variables. Of the 12 “original” variables, 10 had been binary coded as yes or no (for male, anemia, chronic cough, unresolved pneumonia, chronic dyspnea, chronic pulmonary disease, active tuberculosis, extraneous iatrotropy, pipe smoking, and cigar smoking). Age was coded in its dimensional values for years, and cigarette smoking in an ordinal rating scale from O,l)...) 4. As described in Part I [l], the remaining 15 independent variables were created as transformations of the original coding expressions. Three ordinal variables-TNM stage, functional severity, and symptom pattern-were converted, respectively, into 2, 2, and 3 “dummy” variables that preserved the contrasts between adjacent categories in the ordinal scale. Three more variables, respectively called Symptom duration 1, 2, and 3, were expressed in duration of months, and were formed as interaction terms between duration of symptoms and the three symptom pattern variables. Finally, five binary-coded dummy variables were used to express different histologic types of cancer. In the presentation of results for Tables 1 and 2, a variable that had been transformed (such as
RESULTS
In the tabulated results, linear discriminant analysis and multiple linear regression have been regarded as only one method, because of their algebraic equivalence, noted earlier, which leads to identical selections of predictor variables. The results therefore compare four analyses: multiple linear regression, multiple logistic regression, Cox regression for survival censored at 1 year, and Cox regression for survival censored at 5 years. Variation of populations
methods
in dIrerent
sampling
Table 1, which is arranged according to the seven generating samples, shows the number of times each of the four methods selected the main independent variables in each generating population. Perfect agreement between methods within a generating sample is seen in cells with entry “0” or “4”, showing that none or all four of the methods selected a given variable. As noted previously, the transformed variables with more than two levels (TNM stage,
Table 1. Number of methods (out of 4) selecting cited variable in seven generating populations Population Variables
Proportional
TNM sta.ge Functiorlal severity Symptom stage Symptom duration Age (yr) .%X Unresolved pneumonia Chronic cough Chronic dyspnea Chronic pulm. disease Active tuberculosis Anaemia Cigarette smoking Pipe smoking Cigar smoking Extraneous iatrotropy Microscopic status
4 4 4 0 2 2 0 0 0 0 0 0 0 0 0 0 4
Decreasing exponential
Bimodal
Increasing exponential
4 4 4 0 4 0 0 0 4 0 0 0 1 0 0 0 4
Unimodal symmetric
Uniform
U-Shape
4 4 4 2 0 0 0 0 0 0 0 2 0 0 2 0 2
4 4 3 0 0 2 0 0 1 0 0 1 0 0 0 1 3
4 4 4 0
Overall distribution of number of methods selecting a variable, across all variables and populations Number of selections
Frequency
0
1
2
3
4
65
11
10
5
28
1 4 0 0 1 1 0 0 0 0 0 0 4
Comparison of Multivariable Methods for Prediction-II
353
to the type of population being sampled. The disagreements, which occurred in 22% of instances, are distributed rather randomly across the seven populations. It is also apparent from Table 1 that the same predictor variables tend to be selected, regardless of which population is being sampled. For instance, TNM stage and functional severity are clearly both very important variables, as are the variables for symptom stage and microscopic status. On the other hand, several other manifestations did not appear to be prognostically important, for example, unresolved pneumonia, chronic cough, active tuberculosis, cigarette smoking, pipe smoking, and duration of symptoms. The greatest levels of disagreement occurred for the variables sex, chronic dyspnea, anaemia, cigar smoking, and extraneous iatrotropy.
functional severity, symptom stage, symptom duration, and microscopic status) have been counted in Table 1 as “selected” if the dummy indicator variable associated with any of the levels was selected as significant. These variables were regarded as “not selected” if none of their component indicator variables were selected. This convenient way of summarizing the overall predictive value of categorical variables with several levels does not account, however, for differences that can (and do) exist with respect to the particular levels of indicator variable(s) selected as significant by the various methods. These differences will be discussed later. The results of Table 1 show remarkably good overall concordance between the methods in their selection of variables. For the cells formed by 17 variables in seven populations, there were 119 “challenges” for each of the four methods. Taken over all variables and across all population distributions, all four methods agreed in 65 of the 119 cells (i.e. 54%) that a given variable should not be selected. Conversely, in 28 (24%) of the 119 cells, all four methods agreed on selecting the cited variable. Thus, in 78% of the challenges, there was complete agreement among the four selection methods. In another 16 (13%) of the challenges, only one of the methods disagreed with the other three. Table 1 does not reveal any particular tendency for methods to agree or disagree according
Variation of sampling populations in diferent methodr
In Table 2, which is a counterpart of Table 1, the results are shown for the consistency of a given multivariable method in selecting the same variables in different populations. This table shows the number of populations (out of the seven used) in which each method selected a given variable. In this instance, there were 17 variables for 4 methods (or 68 challenges) in each of the seven populations. Complete agreement occurred if the same variable was selected
Table 2. Number of generating populations (out of 7) in which variable was selected by each of four methods Method Variables
Linear
Logistic
Cox l-year
Cox S-years
TNM stage Functional severity
7 7
7 7
7 7
7 7
Symptom duration stage Age (yr) !kX
31 1 4
51 1 :
1 4 :
Unresolved pneumonia Chronic cough Chronic dyspnee Chronic pulm. disease Active tuberculosis Anaemia Cigarette smoking Pipe smoking Cigar smoking Extraneous iatrotropy Microscopic status
0” 2 0 0 1 0 0 2 1 3
0 1
0 3
; 3 3 0 0 2
0”
0” 1 0 0 0 1 1
: 3 1 1 2 2 6
: 0 2 0 1
Overall distribution of number of populations in which a variable was selected: across all variables and methods Number of selections 0
Frequency
24
123456 13
8
I 7
2
3
1
10
354
STEPHEND. WALTER et al.
by each method in all or in none of the seven populations. The pattern of agreement in Table 2 is not quite as clear cut as in Table 1, but 71% of the cells of this table showed at most one disagreement (i.e. a score of 0, 1, 6, or 7) among the seven populations. In 34 cases out of 68 (50%), there was complete agreement that a variable should or should not be selected. In general, as noted in Table 1, the same variables tend to be selected by the various methods, notably TNM status, symptom severity, symptom stage, and microscopic status. Variation in sequence of selection for variables
Table 3 shows the arrangement of the variables selected by each method in each population, according to the order in which each variable entered the analytic model in the stepwise process. To present this information, the variables TNM, symptom stage, functional severity, symptom duration, and microscopic status are now shown according to the dummy indicator variables that represent the contrasts between the various levels of these factors. Examination of Table 3 also shows reasonably good concordance of the first two choices between methods within the same population, and within a method across populations. For almost all combinations of methods and populations, one of the TNM status variables and one of the functional severity variables were the two variables first chosen in the sequence of selection. In one exception, which occurred in the Cox l-year analysis of the unimodal symmetric population, symptom pattern was selected as the second variable, and the TNM variable was entered at the next (third) stage of model development. Differences in distribution of the population samples had an effect that is evident in the various assessments of the variable TNM stage 2. This variable was selected first for the linear and logistic analyses in all except the decreasing exponential population. Because the decreasing exponential generating sample contains only a small proportion of patients with severe disease, the analysis has limited power to show a contrast between the most advanced TNM stage patients and the rest, as represented by the variable TNM stage 2. On the other hand, TNM stage 1 was selected as the first variable in the decreasing exponential sample, thus suggesting that the general effect of TNM status is most clearly evident in that sample as a contrast between TNM stage 1 and the combined stages
2 and 3. In several of the Cox analyses, functional severity was chosen before TNM status, but both TNM stage 1 and TNM stage 2 were selected. After selection of the first two or three variables (referring to TNM status or functional severity), the sequence of chosen variables is much less regular. Table 3 shows a tendency for components of symptom stage, age, sex, or microscopic status to enter within the next few steps in some but not all instances. The pattern of entry of the other variables is highly irregular. These inconsistencies are not surprising, because a complex set of partial significance tests must be examined for each of the remaining variables after two or three variables have been selected into a model. Inconsistencies can be expected when the relative ranking of significance levels between a given set of variables depends on two or three other previously selected variables. For this reason, the inconsistencies in sequence of selection of these later, relatively minor, variables are not surprising. Relative contribution of variables
The next set of analyses was done to compare the way in which the different models and samples led to different coefficients for the relative contribution or importance of each variable. Since the values of these coefficients are often used as “screening” procedures to denote the relative biologic importance of the individual variables, we wanted to determine whether the coefficients would be altered when different methods were applied to different patterns of distribution for the same set of biologic data. For this purpose, we selected a set of 13 independent variables and specified that all 13 variables should be included (i.e. “forced”) in each method for each population sample. The 13 variables, which were chosen to contain those that had frequently been significant in the previous analyses, were the following: TNM status (2 levels), functional severity (2 levels), symptom stage (2 levels), microscopic status (1 level), sex, age, symptom duration (2 levels), anaemia, and extraneous iatrotropy. The original three levels of TNM status and functional severity were each represented by their original two dummy indicator variables. Symptom duration was represented by two indicator variables, symptom duration 2 and symptom duration 3. Microscopic status was represented by anaplastic histology. To arrive at standardized coefficients that would be unaffected by the units
6, 6 6:. -,4, 4, 4 -, -, 7, -
-
-
-, -9 -9 -, -, -, -, -1
-, -9 -, -, -, -, -, -, -
-, -, -9 -9 -9 -1 -, -,
3, -, -, -, -, - .-, --I --- , 9
3, -9 -> -, -9
-, -, -7 -, -, 5, 3
4, 6, -, -* 1; 1, 1, 1 -, -, -9 2, 2, 2, 2 -, -1 7, 5, 5, 3, 5 -, -, -9 -, -, -, -, -9 -- 9
Proportional
6, -. -7 -, -, 3, -, -, -,
-, -, -, -, -, 3, -, -, -1
8: ->-, -. -9 -, 7, 4 -, - ,-, 3, 5
--
-,
5, -
-7 --I -7 I -, L, -, 6, -
-, --4, 4: -1 6 -9 -, -, -
-, -1 -, -, -1-v -, -9 -9 -, -, -, -, -1
-1
5, 5, -, _--, .,
I, -7
-7 -9
1, 1, 4, 2 6, -9 1, 1 2, 2, 2, 3
Decreasing exmential -,
-,
-
-,
-1
-
-
9; -, -, 8, 6, 5, -9
-1 6, -, -, -
5, -, -, -9 -, -, 4, -,
-, --I -, -, -, -, 4, -, 3
z: ; -, .-, 8 -> 7 3, 4 7, 6 -1-1 -, -9 -. -, -, -7 -1 -, -
5: -, -, 7, 6, 5, -,
4, -, -9 -3 -, -, 3, -9 -7 -9 -3
1, 1, 2, 2 -3 2, -12, 3, 1, 1 -, -, -9 -
-3
Bimodal
-9
-,
-,
-,
-.
-,
’F,
-,
-
-
-
-
-
-
Ft
4, 4, 5, 5, -, -, -
-1-, -, -9 -, -1-
6, 3 -, -, -, -, -1-t 6, -3 -, -, -, 4, 5 -, -7 -, -9 -> -, -, -, -9 -, -, -9 -1 -, -, 7 -, -, -, _- , 9--,--.
-9
-, -,
-,
-9
-9
-, -,
-,
-,
-7
->
-9
6 -, 1, 1, 2, 2 2, 2, 3, 4 3, 3, 1, 1
-1-v
Increasing exponential -,
--I
-
1, 1, 3, 2 -, -, -12, 2, 1, 1 -, -9 -, 3, 3, -> -, -, 2, 3 -, -, -. 4, 4, -, -, -> -, -, -, - ,-, -, -, -, -1 -, -, -, -7 -, -7 -7 -, -1 -, -, -, -1-, --I 6, 5 -, -. -, -, -, - 5, 5, -: -, -9 -, -, -, -, 6 -, -, -, 4 -, -1 5, -3 -, 4, -, -9 7, -
-1
Unimodal symmetric 3, 1, -> 2,
-,
-,
-, -,
-6
I
-
-, -, -,
-, -1 5 -, -, -1 6, -
-, -, -, -
-,
_, -, _, -, _, _, -, - 9--I 4
-, -,
-, -,
-, -, -, -3 -, -, -> -9 5, -, -, -, -,-,-,-
-
-
-
7, 2, 2 3, 3 1, 1
-, 5, -, -9 -1 4, -, -, -, -,-,-,-, - -t ;I 7,7,
-,-,-,-
3, 1, -, 2,
Uniform
-
-
-
46
9
-
5 3
??
il 2. $ -
-, -, -, _, _, -, -, --- 9 -1--, 9: -, 5, 3, -, -, -,
z
f
F
!I+. g D s z % 5 z. g
_, _, -, -, _, _, --
-9 -, -, -, - -3 -, -I-, 8 -, -9 5, -,-,-,-, -9 -, -, - 9-,
5, F, 8, -, -, 6, -, _, -, -,_,-,-, -1-g -,4. - 3: -,7,
2, 2, 4 1 -,_,-,-
3, 4, 4, 7 1, 1, 2, 2 -, -, -, -
U-Shape
“-” indicates that a variable was not selected into the model. Variables which entered the model at the final step but were then immediately removed are also shown as “-“. Variables which entered the model but which were removed at a later step are shown by the step (rank) at which they entered. TFindicates symptom stage variables that did not enter the model on their own. They were “forced” into the model, because the associated interaction variable, (symptom duration) had individually entered the model at a previous step. In many instances, however, the symptom duration variable was not retained after the symptom stage variable was included.
Unresolved pneumonia Chronic cough Chronic dyspnea Chronic pulm. disease Active tuberculosis Anaemia Cigarette smoking pipe smoking .Cigar smoking Extraneous iatrotropy Well diff. epidermoid Well diff. adenocarcinoma Small cell Anaplastic Metastatic, unspecified
Sex
TNM stage 1 TNM stage 2 Severity 1 Severity 2 Symptom stage I Symptom stage 2 Symptom stage 3 Symptom duration 1 Symptom duration 2 Symptom duration 3 Age (Yr)
Variables
Table 3. Order of entry of significant variables in forward stepwise selection algorith;n for various generating samples, in each of four methods (The sequence of four entries in the cells of each column of each row shows the rank order in which each variable entered the model, respectively, for linear regression, logistic regression, Cox l-year model, and Cox 5-year model)
356
STEPHEND. WALTER et
of measurement for each variable, the data for all of the variables were entered into the models in a standardized format, i.e. in the form (X - X)/s, where R and s represent the mean and standard deviation for each variable. Standardization was carried out separately for each sample of patients. Some members of the 13 predictor variables might have been omitted without serious loss of significance or precision, but we wanted the final comparison of analytic methods to use the same set of independent variables, even though some of them may not have been significant within any particular method. The main advantage of using a common set of variables is that it facilitates the direct comparison of coefficients and significance levels between different methods and populations. A disadvantage is that the forced inclusion of some non-significant variables may reduce the apparent significance of other varibles that are collinear. Furthermore, models that are over-fitted with unnecessary variables may also have reduced predictive value [7]. Another caveat in this process is that the use of more than one indicator variable for different levels of the same effect (such as TNM stage, functional severity, symptom stage and symptom duration) will tend to reduce the significance of the coefficients of the individual indicator variables, because of possible collinearity in the contrasts between various pairs of levels of these variables. For example, the coefficient of TNM stage 1 may be less significant when conditioned on the presence of TNM stage 2 in the model than if TNM stage 1 alone were included. This distinction may explain some discrepancies between the results of the “forced” analyses and those found earlier in the stepwise selections, where possibly only one of the indicator variables would have been included. Table 4 shows the 28 sets of results obtained when the four methods were applied to the seven population samples, with the 13 standardized variables “forced” into the models. Although all coefficients were expressed in a standardized form, the relative magnitude (and thus, the relative “importance”) of the coefficients would be difficult to compare across the 28 sets of results. In particular, the results of different statistical methods are difficult to appraise because of fundamental differences in the mathematical expressions that are assumed to represent the relationship between dependent
al.
and independent variables. We therefore designated the standardized coefficient for TNM stage 2 as a “baseline” value in each analysis, and expressed all other standardized coefficients as a ratio in relation to the standardized coefficient for TNM stage 2. The TNM stage 2 coefficient was selected for this additional standardizing role because the coefficient itself was often statistically significant and because the TNM stages are more familiar and conventionally recognized than the alternative candidates (particularly functional severity 2). With this approach, the top row in Table 4 shows the actual values, in each situation, of the standardized coefficient for TNM stage 2. All the other rows in Table 4 show the ratio obtained when each standardized coefficient is divided by the standardized coefficient for TNM stage 2. For example, in Row 2 of the first column of results, the relative values of TNM stage 1 and severity 1 are 1.14 and 0.13. Their actual values, when multiplied by the 0.071 value for TNM stage 2, would be 0.081 and 0.009 respectively. Two other points should be noted about Table 4. First, each actual or relative coefficient is marked with an appropriate symbol, when pertinent, to denote the “statistical significance” of each coefficient’s p levels as being below 0.05, 0.01, or 0.001. Coefficients that lack such symbols were not statistically significant, i.e. p > 0.05. A separate column in Table 4 is marked “prognostic impact”. This column was marked + if the variable was expected, according to customary clinical experience and previous analyses [8], to have a favorable prognostic effect, and - if otherwise. The expected effect was favorable for 2 and unfavorable for 11 of the 13 variables under study. We used this tactic to indicate the expected effect because the signs of the coefficients found in the individual analyses would be positive or negative according to the type of mathematical model. For example, a favorable prognostic effect would be shown by a positive sign for the coefficient in the linear and logistic models, but by a negative sign in the Cox models. [If desired, the independent or dependent variables could be recoded to yield consistent signs for the coefficients in all models.] As shown in the individual cells of Table 4, all but 27 of the 364 coefficients had signs that agreed with the expected direction. The locations of the disagreements have been marked with a dot symbol. In all but two of these
+
+
Prognostic impact
+
0.190 0.730* 0.700* 0.420 O.llO# 0.540* 0.390 0.060 0.560 0.620 0.09% 0.240
0.093t
Lin
0.110 0.370 0.130
0.660 0.240 0.5%
0.270 0.910+ 1.5607 0.470 0.11q 0.760’ 0.530 0.110 0.750 1.030 0.1305 0.310
0.385*
Log
0.580 1.13Ot 1.310# 0.008 0.200 0.670* 0.8307 0.280 0.380 0.780 0.100 0.220
0.206*
C-l
c-s
1.420* 0.530 1.630t 0.560
0.052
Lin
0.510 0.91ot 1.24Ot 0.270# 0.200 0.490* 0.75Ot 0.250 0.160 0.550 0.280 0.260
0.2327
c-5
0.430 0.400 0.439
0.680 0.790 0.69Ot 0.720 0.580 0.308 0.560
0.620 0.980 0.590 0.550
0.040
1.650* 0.570 2.750t 0.598
0.213
Log
0.060§ 0.170 0.490
0.530 0.490 0.330 0.050 0.130
0.680 0.700* 1.0503 0.120
0.239f
C-l
1.78Ot 0.920 0.550 t.244F 0.250 0.880. 0.390 1.140* 0.630 0.020 0.120 O.OlO$
0.051
Lin
2.120t 0.990 0.670 1.340* 0.340 1.050. 0.520 1.330* 0.660 0.010 0.120 0.040
0.223
Log
1.28Ot 0.670* 0.390 0.880* 0.250 0.630* 0.400 0.680* 0.250 0.150 0.070 0.368
0.228t
C-l
@Sign of coefficient differed from sign of main prognostic impact of this variable.
‘p < 0.05;tp < 0.01; & < 0.001.
c-5
0:350 0.080 0.430* 0.320 0.6lOt 0.230 0.080 0.370 0.180
Et
0.99ot
0.275$
c-5
0.000 0.340 0.360
0.440 0.180 0.310 0.208 0.050
0.330 0.430 0.820$ 0.0305
0.308$
Decreasing exponential
0.480 0.460$ 0.460
0.508 0.810 0.460 0.508
0.02Otg 0.020
0.750* 0.080*5 1.270$ 0.810*
0.214t
Increasing exponential
0.130 0.320 0.130
0.710 0.5CKt 0.710* 1.040+
0.010
0.650 0.230 1.12Ot 0.980’
0.209’
C-l
Uniform
1.04Ot 0.170 1.040~ 0.410 0.430 0.680t 0.74Ot 0.11q 0.57q 0.270# 0.280 0.060
0.09Ot
Lin
0.130 0.700 0.130
0.430 0.680 0.270 0.280$ 1.100
0.380 0.100 1.280t 1.4307
0.060
Lin
0.410 0.410 0.010
0.400 0.590* 0.200 0.140$ 0.548
0.272t
C-l
1.400t 1.OoOt 0.280 0.150 2.500$ 1.180$ 0.450 0.530 0.710 0.390 1.06Ot 0.7703 1.170t 0.630* 0.170# 0.320 1.t-Jot-N0.07@$ 0.040 0.44@ 0.140 0.470 0.240 0.060
0.363*
Log
0.219f
c-5
0.7107 0.220 1.140$ 0.050 0.470 0.66Oj 0.5707 0.51ot 0.12@$ 0.010 0.220 0.080
0.307$
c-5
0.240 0.560* 0.170
0.630 0.620* 0.360 0.03@ 0.390
0.040 0.390 0.260 0.190 1.0701: 1.2801 0.700* 0.480
0.251;
C-l
U-Shaped
0.290 0.790 0.160
0.480 0.760 0.250 0.340$ 1.380
0.410 0.070 2.OOOt 1.5107
0.256
Log
Unimodal symmetric
0.590 0.420 0.860* 0.450 0.080 0.470 0.7707 0.170 0.0805 0.190 0.510 0.410
0.083.
Lin
0.690 0.430 1.3207 0.460 0.190 0.540 0.970 0.210 0.1205 0.280 0.580 0.510
0.372*
Log
0.400 0.390 1.090~ 0.520 0.220 0.460* 0.780$ 0.180 0.220 0.270 0.190 0.68Ot
0.302$
C-l
Bimodal
coefficients relative to TNM2 when four methods were forced to handle the same 13 variables
Lin = linear regression; Log = logistic regression; C-l = Cox l-year model; C-5 = Cox 5-year model.
Age Symptom duration 2 Symptom duration 3 Anaemia Extran. iatrotropy
TNMl Severity 1 Severity 2 symptoms 2 Symptoms 3 Anaplastic Male
Relative value
TNM2
Actuai value
Symptom duration 3 Anaemia Extran. iatrotropy
+ -
0.100 0.780* 1.050* 0.460 0.508
0.060
0.660’ 0.890* 0.410 0.480
-
Symptoms 3
Anaplastic Male Age Symptom duration 2
0.314
Log
1.210* 0.140 1.6lOt 1.020
0.071*
Lin
Proportionate
1.140* 0.130 1.03Ot l.ooO*
-
TNMl Severity 1 Severity 2 Symptoms 2
Relative value
Actual value TNM2
Prognostic impact
Table 4. Statistical significance and magnitude of standardii
0.330 0.140 1.120$ 0.350 0.210 0.310 0.710$ 0.160 0.180 0.200 0.340 0.400+
0.323$
C-5
F x?
g
2
z 5 3.
0,
s 3 a g. I¶
358
ST!dPHEND. WALTER ef al.
Table 5. Range of variation across 28 models and population samples, for coefficients cited in Table 4 Range of variation in Variable TNM stage 1 Severity 1 Severity 2 Symptom stage 2 Symptom stage 3 Anaplastic Male Female Symptom duration 2 Symptom duration 3 Anaemia Extraneous iatrotropy
Significant coefficients
All coefficients
0.71-2.12 0.67-1.13 0.70-2.75 0.7fH.51 0.02’ 0.461.06 0.57-1.17 0.51-1.33 1.04* 0.56*
0.04-2.12 0.07-1.13 0.31-2.75 0.00-1.51 0.01-0.71 0.18-1.06 0.20-1.17 0.03-1.33 0.05-l .38 0.00-l .03 0.074.79
0.40-0.68
0.01-0.68
*Only one of the 28 coefficients was significant.
disagreements, however, the quantitative value of the coefficient was not statistically significant. The results of Table 4 show some striking variations in coefficients for different “generating” samples within the same mathematical method. For example, the standardized coefficient for TNM stage 2 ranges from 0.206 to 0.302-a difference of about 50%-in the Cox l-year regression results for the seven populational samples. The range of variation in the standardized coefficients for the other 12 variables, expressed in relation to TNM stage 2, can be considered in two ways: for only those coefficients that were statistically significant, and for all coefficients, regardless of significance and regardless of sign. The results, which are shown in Table 5, demonstrate that the significant coefficients for these variables can have quantitative ranges in which the upper numbers are two to four times the size of the lower numbers. The ranges become substantially greater when examined for all values of the coefficients, regardless of statistical significance. The large ranges of variation demonstrate that the inherent attributes of a particular biologic system-which was identical for all seven of the generating samples under study-can receive substantially different quantitative assessments according to the selected mathematical method and the distributional pattern of the data. Some of the variability in the coefficients probably arises, as discussed earlier, from imprecisions that occur when models are fitted with a surfeit of variables,‘and from collinearities that may occur when a single variable is expanded into several dummy variables that represent multi-level coding. Additional variation in the coefficients will have arisen from the
separate standardization of the data within each sample. Because a “separate” standardization would be the usual statistical practice for each data set, the concomitant variation in coefficients must be accepted as an inherent part of this method of analysis. Despite the variation in the numerical values of the coefficients, inspection of Table 4 reveals that TNM status and functional severity have the largest coefficients in most of the analyses. Hence these variables would be identified as “most important” using a criterion based on coefficient magnitudes, as well as according to the p-values and order of stepwise selection, as discussed earlier. The main point to be noted is that the different mathematical methods, despite some of the remarkably consistent performances in some of the situations noted earlier, can also produce inconsistent performances in the estimated coefficients. The inconsistencies seem particularly likely to occur when the models are used in a “forced” array of multiple variables and when the same biologic events are assembled in samples of data that have major variations in basic distribution of the cogent phenomena.
DISCUSSION
The results do not indicate a striking superiority for any one of the four mathematical methods under examination, at least with respect to their selection of prognostically important variables. Similar sets of predictors were selected by the various methods within one population, and the agreement was quite good between the selections of a given method in different populations. Greater discrepancies were noted for the order of selection of variables, and for their estimated coefficients. Conclusions concerning all but the first one or two significant predictors must therefore be viewed cautiously. In particular, the magnitude of effects for variables included in “forced” sets for multivariable models may not be well estimated. If possible, these values should be checked using stepwise modifications of the fixed model, and/or by comparing the results of alternative methods of analysis. Concordance in such comparisons would provide some reassurance about the validity of the conclusions. Discordant results should prompt further consideration of the appropriateness of each method (including its
Comparison of Multivariable Methods for Prediction-II
assumptions), and of the underlying biologic connotation of the component predictor variables. REFERENCES 1. Feinstein AR, Wells CK, Walter SD. A comparison of multivariable mathematical methods for predicting survival-I. Introduction, rationale, and general strategy. J Clin Epidemiol 1990; 43: 339-347. 2. Lawless JF. Statistical Models and hlethods for Lifetime Data. New York: Wiley; 1982. 3. Kleinbaum DG, Kupper LL. Applied Regression Analysis sod Other Multivariable Methods. North Scituate, Mass.: Duxbury Press; 1978. 4. SAS User’s Guide: Statistics, 1982 edn. Cary, N.C.: SAS Institute; 1982. 5. SUGI Supplemental Library User’s Guide, 1983 edn. Gary, N.C.: SAS Institute; 1983. 6. Dixon WJ, Ed. BDMP Statistical Software. 1983. Printing and Additions. Berkely: University of California Press; 1983. 7. Weisberg S. Applied Linear Regression. New York: Wiley; 1980. 8. Feinstein AR, Wells CK. Lung cancer staging-a critical evaluation. In: Matthay R, Pd. Recent Advances in Lung Cancer. CIin chest Med 1982; 3: 291-305.
APPENDIX Derails of Stepwise Algorithm In view of the potential computational difficulties with the backward elimination method (especially because of the large number of candidate independent variables in our data set), we decided to use the forward selection stepwise procedure, with a modification that is customarily available in standard computer programs. The unmodified forward selection procedure progresses as follows. At the initial step, the full set of candidate independent variables is reviewed, and the variable that has the most significant relationship to the dependent outcome variable is selected for inclusion in the model, as long as it meets a certain minimum criterion level of significance. In all of our work, this criterion was that the p-value for the variable to be included should be no greater than 0.05. The most significant variable (if any) that satisfies this criterion is then put into the model, and a solution to the functional form of the equation is computed.
359
At the next stage, the remaining candidate variables are considered in turn, to see if they significantly improve the prediction of the dependent variable, conditional on the presence of the first variable already selected. Again, any variable to be included at this point must satisfy the significance probability criterion, but in this case such probability is conditional on the presence of the first variable in the model. The most significant of these second step variables is then included into the model, and a new functional equation is computed. At the third and subsequent steps, the set of variables remaining at each point is evaluated, and the most significant is included as long as it meets the criterion of statistical significance. The algorithm ceases to select further variables when no new significant variables can be identified. A modification to the forward selection algorithm was as follows. At each point in the selection procedure, the partial significance of each term included in the model (at any step) was reviewed. For continued inclusion in the model a variable had to not only meet the initial test of significance at the point of its inclusion, but had to continue to meet a test of significance in the presence of other variables which have been selected at earlier and later steps. In our analysis, the criterion for a variable to enter and to remain in the model thereafter was that its partial probability value should be no greater than 0.05. If, at a given step, a variable failed to meet this test, it was eliminated from the model. Thus the modification consists effectively of backward elimination applied to the variables identified at successive steps in the forward selection procedure. The modification was used to allow for the possibility that certain variables, although significant at their initial selection step, may become insignificant when further variables are included. The simplest case where this might happen is as follows: suppose. that at step 1 variable A is included, at step 2 variable B is included, and at step 3 variable C is included. It is possible that by the joint inclusion of variables Band C in the model, the further predictive value of variable A is eliminated; in other words, variables B and C together provide an adequate prediction of the outcome, without the requirement that A also be included. However, if a single variable only were to be included, then it would be variable A as the single most significant term. It is also possible (but rather unlikely in practice) that a variable may be included at an early step in the algorithm, excluded at a later step, but then re-center the model at a subsequent point. In practice, patterns of inclusion followed by subsequent deletion were unusual in our analyses. The most usual pattern was that a sequence of variables would be identified for inclusion in the model, and that the algorithm would then stop when all significant variables had been identified.