0895-4356/X1 $3.00+ 0.00 Copyright0 1990PergamonPressplc
J Ciia Epidemtol Vol.43, No. 4, pp. 339-347, 1990 Printed in Great Britain.All rights reserved
A COMPARISON OF MULTIVARIABLE MATHEMATICAL METHODS FOR PREDICTING SURVIVAL-I. INTRODUCTION, RATIONALE, AND GENERAL STRATEGY* ALVAN R. FEINSTEM,‘*?$ CAROLYN K. WELLS’ and STEPHEN D. WALTER* ‘Yale University School of Medicine, New Haven, CT 06510, U.S.A. and 2Department of Clinical
Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada (Received 30 August 1989) Abstract-This paper and the two following papers (Parts I-III) report an investigation of performance variability for four multivariable methods: discriminant function analysis, and linear, logistic, and Cox regression. Each method was examined for its performance in using the same independent variables to develop predictive models for survival of a large cohort of patients with lung cancer. The cogent biologic attributes of the patients had previously been divided into five ordinal stages having a strong prognostic gradient. With stratified random sampling, we prepared seven “generating” sets of data in which the five biologic stages were arranged in proportional, uniform, symmetrical unimodal, decreasing exponential, increasing exponential, U-shaped, or bi-modal distributions. Each of the multivariable methods was applied to each of the seven generating distributions, and the results were tested in a separate “challenge” set, which had not been included in any of the generating sets. The research was intended not merely to compare the performance of the multivariable methods, but also to see how their performance would be affected by different statistical distributions of the same cogent biologic attributes. The results, which are presented in the second and third papers, were compared for selection of independent variables and coefficients, and for accuracy in fitting the generating sets and the challenge set. Multivariable methods Regression methods Discriminant function analysis Cox regression
INTRODUCTION
When multiple variables are evaluated for their effects on a selected dependent variable, at least *Supported in part by Grant Number HS 04101 from the National Center for Health Services Research, OASH: from The Andrew W. Mellon Foundation: and The Council for Tobacco Research-U.S.A., Inc. as a Special Project No. 135. tProfessor of Medicine and Epidemiology; Director, Clinical Epidemiology Unit and The Robert Wood Johnson Clinical Scholars Program, Yale University School of Medicine, New Haven, Connecticut. Senior Biostatistician, Cooperative Studies Program Coordinating Center, Veterans Administration Medical Center, West Haven, Connecticut. IReprint requests should be addressed to: Alvan R. Feinstein, M.D., Yale University School of Medicine, 333 Cedar Street, New Haven, CT 06510, U.S.A.
Linear regression Logistic regression Predictive models
four procedures can be used for the statistical analysis: multiple linear (or curvilinear) regression, multiple logistic regression, discriminant function analysis, and the proportional hazards method that is often called Cox regression. Each of these procedures has its own theoretical distinctions in advantages and disadvantages for different phenomena that can be chosen as the “dependent” target variable. Linear regression is theoretically best when the target variable is expressed in dimensional (continuous) data, such as blood pressure; logistic regression is most appropriate with a binary variable, cited for an event such as alive or dead at a specified time interval; discriminant function analysis has frequently been used for binary 339
ALVAN R. FEINSTEINer al.
340
targets but was originally planned to distinguish polytomous unranked nominal categories, such as the histologic type (adeno, epidermoid, small cell, etc.) of a particular cancer; and Cox regression is intended to appraise the entire “dynamic” pattern of a survival curve, rather than a “static” outcome, such as the duration or binary existence of survival. Nevertheless, each of the four procedures has often been regarded as “robust” enough to be used in ways that differ from its underlying theoretical principles and justifications. Although not wholly proper according to mathematical principles, binary events have been the dependent variable in linear regression and Cox regression; and ordinal variables have been the target of logistic regression. In this research, we wanted to compare the performance of the four multivariable analytic procedures in a manner analogous to studying “observer variability”. When challenged by data for the same biologic phenomena, would the four procedures produce the same results? Would the same independent variables be chosen as significant predicators? Would the selected predictors be given similar weights or ranks of “importance”? Would the results have similar accuracy when patients are classified in the “generating” groups used for developing the models, or in “challenge” groups that were not previously examined?
BACKGROUND
AND GOALS
The questions just cited have seldom been explored with empirical data, and the existing explorations have pursued only a limited spectrum of challenges. In a non-exhaustive search of the literature, we found 13 published reports in which specific real data, rather than simulated data sets, were used to compare the performance of these multivariable procedures [l-13]. One report [9] examined three procedures-the discriminant function, logistic, and Cox methods-but in most instances, the tested procedures were contrasted in pairs. In several studies, multiple logistic regression and discriminant function analysis were compared either with each other [ 1, 3-51, or with alternative non-parametric analytic procedures such as recursive partitioning [3,4,6, 10, 131. In other single studies, the compared multivariable procedures were methods of linear vs logistic [2], discriminant function vs
Cox [8], and logistic vs Cox regression [ 121. In one report, Cox’s method was compared when used alone and when amplified with alternative tactics for specifying predictor variables [7]. In another study [ll], Cox regression was compared with recursive partitioning and with correspondence analysis. No previous study contained a simultaneous comparison of all four multivariable procedures. Another limitation in previous studies has been the composition of the groups under analysis. A biologically important predictive factor should presumably have the same effect regardless of its prevalence in the group under analysis, but the effect might be altered or unrecognized if different magnitudes of prevalence lead to different statistical decisions during the data analysis. For this reason, the composition of the groups under study is an important consideration when analytic methods are compared. To determine whether the analytic interpretations are affected by statistical rather than biologic distinctions, the compared analytic methods should be tested not with a single group of data, but with several groups having different prevalences of the main biologic factors. The effect of the prevalence of various predictors was acknowledged in several previous reports [3,4,8], but prevalence has not hitherto been deliberately altered to allow a systematic study of its effects. A third problem in previous research was the frequent absence of “confirmation” studies. The particular algebraic models fitted to a set of data called the generating, training, or development set have usually been checked by their performance when “re-substituted” in the same set of data [l-5, 7,8, lo]. The models have seldom been “confirmed”, however, with tests in new or challenge sets of data for different groups of people who bad not been included in the development set, but who were reasonable candidates for application of the developmental results [2-5, 7,8, lo]. We therefore had three main goals in the current research: (1) To compare the results obtained when four multivariable methods were applied to the same sets of data. (2) To vary the composition of the data sets, so that the performance of the methods could be tested for biologic factors having different degrees of prevalence.
Comparison of Multivariable MethodS for Prediction-I
(3) To evaluate the performance of the multivariable methods when re-substituted in their development sets and when tested in a different challenge set. The results of the research are presented in a trio of papers. This paper (Part I) describes the composition of variables and patients in the basic set of data used for the analyses. The paper also delineates the process by which the basic data set was converted to a “challenge set” and to seven “generating sets”. The generating sets contained different prevalences in the distributions of the main biologic factors. The second paper (Part II) presents a comparison of the results obtained when four multivariable methods were used to analyze data in the seven generating sets. The results are examined for the concordance with which the four methods, applied to the seven sets, selected “important” predictor variables and ranked their relative importance. The third paper (Part III) reports the predictive accuracy of the four multivariable methods when “resubstituted” in the seven generating sets, and when tested with the challenge set. BASIC STRATEGY
The basic data consisted of information that had previously been analyzed and reported for a large cohort of patients with primary cancer of the lung [14]. The earlier work was aimed at demonstrating the important predictive impact of certain types of clinical data that are usually omitted from the exclusively morphologic information used in conventional medical appraisals of prognosis for patients with cancer
1151. During the earlier analysis, the clinical, demographic, morphologic and other predictor variables had been examined and combined with clustering methods that formed stratified categories (or “stages”), rather than the “additive scores” produced by standard mathematical models. The clustering strategies involved a mixture of both clinical judgment to form biologically coherent groups, and statistical isometry to join categories that had similar rates of survival [16-181. The five composite strata that emerged from the analysis had been identified with the letters A-E. The main biologic attributes used in forming these strata were the categories of three ordinal
341
variables: TNM stage, symptom pattern, and fwtctional severity. These categories had been arranged as five composite stages, labelled A, B, C, D, E, that were associated with a large prognostic gradient throughout the survival period [14]. (The distribution of the cohort and the 6-month survival results for these stages are shown later in Table 2.) In the current research, we prepared a series of seven “generating sets” in which the biologic characteristics of the A-E strata would have different statistical distributions. These seven sets would be examined by each of the mathematical methods to form a model with the appropriate array of predictive variables and coefficients. We also put aside a separate “challenge set” containing a group of patients, not included in any of the generating sets, who could be used to test the predictive accuracy of each of the derived models. We could thus examine the performance of different mathematical models for the same biologic phenomena in groups having different distributions, and in a “new” challenge group.
METHODS
Basic population The basic population under analysis was an “inception” cohort of 1266 patients with primary cancer of the lung. They represented a consecutive series of all patients with microscopic evidence of this disease, whose zero time occurred at the YaleNew Haven Hospital or West Haven VA Hospital during the interval 1 January 1953-31 December 1964. Zero time was demarcated as the date of the first antineoplastic therapy (thoracotomy, radiotherapy, chemotherapy, or metastatic surgery) directed at the lung cancer or its effects. In patients who received no antineoplastic therapy, zero time was the date of the decision not to treat, or the date of the latest appropriate diagnostic evidence that might have been followed by a therapeutic decision. Because of unusually complete methods for getting baseline state and follow-up data [18], information was available for zero-time status and for the complete 5-year post-zero interval on all but one of the 1266 patients. (The missing patient, who was last seen in moribund condition 6 weeks after onset for widespread of “palliative” radiotherapy metastases, was counted as though death had occurred at 1.6 months after zero time.)
342
ALVANR. FEINSTEIN et al.
Selection of outcome event
In previous analyses [14], the entire population was combined, regardless of therapy, which was not used as a predictor variable. The 5-year survival rate for the 1266 patients was 7%. Consequently, if survival at 5 years were the event to be predicted for individual patients, a 93% rate of accuracy could be achieved a priori simply by predicting death for everyone. To establish a more suitable predictive challenge, we chose survival at 6 months after zero time. This time interval is clinically pertinent for both prognostic estimations and therapeutic evaluations; and it offers a cogent opportunity to demonstrate the significance of prognostic variables, because the 6-month survival rate for the entire population was close to 50%. (The actual 6-month survival rate was 44%, with a median survival time of 4.6 months.) Arrangement of mathematical models
For testing the discriminant function and logistic precedures, 6-month survival was chosen as the binary dependent variable, coded as present or absent. For multiple regression, survival duration would have been appropriate as a dependent variable, but could not be used because it was unknown for the cohort members who were still alive when last observed. Accordingly, the binary event of survival at 6-months was also used as the dependent variable in multiple linear regression. Since the proportional hazards model uses a group’s entire survival experience over an interval of time, rather than simply survival at a particular time point, we established two arrangements for the Cox regressions. In both arrangements, the model was given the available data for the entire population, but in the first arrangement, which we called Cox-I, all patients who were still alive at the end of 1 year were regarded as censored by being withdrawn alive. In the second arrangement, called Cox-5, all patients who were still alive at the end of 5 years were similarly censored. The Cox- 1 model thus derived its coefficients and predictive variables from data for a l-year survival curve, and would tend to identify predictors that were most influential in short-term survival. In contrast, the Cox-5 model, working from a 5-year survival curve, would put relatively more emphasis on factors that influence longer-term survival. Some (or all) of these factors, of course, might be the same in both analyses.
With these arrangements, the five mathematical models to be tested contained the three conventional procedures of multiple linear regression, discriminant function analysis, multiple logistic regression, and the two approaches described as Cox- 1 and Cox-5. SpecQication of baseline predictor variables
In most analyses of survival for patients with cancer, the baseline predictor variables are selected exclusively for such morphologic attributes as the anatomic location, size, and spread of the cancer, histologic or cytologic types, and microscopic evidence of vascular or lymphatic invasion. In previous research for the population under study [14, 19,201, special attention had been given to classifying and analyzing the prognostic effects of certain clinical information which, although constantly applied in the decisions of “clinical judgment”, is seldom formally identified, specified, and quantified. The data from patients’ medical records and other appropriate sources had been excerpted and categorized in a taxonomy [18] that included the variables used in conventional morphologic staging systems as well as the additional “new” clinical variables. In the list that follows, the first 10 variables represent the special clinical phenomena that are usually omitted in conventional analyses of prognosis for cancer. The last 7 variables represent the customary morphologic and demographic attributes. The variables are first discussed according to the format used for our previous analyses of the lung cancer data. A subsequent section, titled Coding of variables, indicates the way in which the original 17 variables were either maintained intact or converted into additional “dummy variables” that formed the 27 independent variables used in the current analyses. 1. Symptom pattern. The patterns of subjective symptoms in the clinical spectrum of patients with lung cancer could be classified in the following ordinal array, originally coded as 1-4, which reflected an increasing prognostic severity of clinical manifestations [14, 19,201: asymptomatic, primary symptoms only, systemic symptoms, metastatic symptoms. 2. Functional severity. The variable was
formed as a composite of two clinical variables that had previously been identified [14] as having prognostic importance. One of the variables, symptom severity, contained such attributes as severe dyspnea, substantial amounts
Comparison of Multivariable Methods for Prediction-I
of weight loss, and severe tumor effects. To be classified as having severe tumor effects, a patient had to have extensive or incapacitating clinical manifestations, such as coma, jaundice; or major functional impairment due to metastatic lesions, rather than morphologic evidence alone or relatively minor clinical manifestations such as mild hemiparesis or laboratory abnormalities in liver function. The second variable, co-morbid severity, referred to the presence of severe additional diseases, beyond the lung cancer itself, that would themselves be expected to have poor prognosis. Examples of severe co-morbid ailments are uncontrolled decompensation of heart or liver, repeated previous episodes of acute myocardial infarction, or recurrent episodes of overt stroke. The two types of severity had previously been formed into a composite ordinal variable, called functional severity, which had been coded as 0, 1, 2 to reflect its absence or two ascending levels of prognostic severity. 3. Chronometry. This variable indicated the pre-zero duration of symptoms attributable or possibily attributable to the lung cancer. For patients who had had appropriate symptoms, the pre-zero duration of the symptoms was coded directly in months and tenths of months. In earlier analyses [19,20], we had found that patients with only primary symptoms had better survival rates if the symptoms had a long rather than short pre-zero duration. The results were consistent with the concept that slow growing cancers often produce symptoms “slowly”, i.e. over an extended period of time. 4. Unresolved pneumonia. Although not a formal component of the A-E stages, this attribute was included here because we had previously noted (in unpublished bivariate analyses) that it was often associated with high survival. 5. Chronic cough. 6. Chronic dyspnea. 7. Chronic pulmonary disease. r 8. Active tuberculosis.
These four variables, each cited as present or absent, were included in the analysis because they represent diagnostic co-morbidity. Since their clinical manifestations, particularly in patients with primary symptoms, cannot readily be distinguished from those caused by the lung cancer, the diagnostic attribution of symptoms is difficult in patients who have both lung cancer and one or more of these four manifestations. Although symptoms in such patients were listed
343
in the appropriate category of the symptom pattern variable, we recognized that the symptoms might sometimes be due to the co-morbid condition, rather than the cancer. For example, hemoptysis might be caused by active tuberculosis or chronic pulmonary disease in a patient with a slow-growing asymptomatic lung cancer. The diagnostically co-morbid conditions were included so that their .possible effects on both classification of symptoms and prognosis could be suitably accounted for in the analyses. 9. Extraneous iatrotropy. This variable referred to “lanthanic” patients [21] whose iatrotropic stimulus, i.e. the reason they had sought medical attention, was unrelated to a symptomatic manifestation of lung cancer. In such patients, the lung cancer was detected with a screening or case-finding procedure conducted as part of a “routine” diagnostic process. The iatrotropic stimulus was either a complaint unequivocally attributable to a co-morbid disease, rather than lung cancer, or a routine examination, performed for reasons such as new employment or insurance. Patients who had deliberately sought a “check-up” while ostensibly asymptomatic were not included in this category unless the check-up was done as part of a regulatory scheduled program of periodic examinations. 10. Anemia. This variable, classified as present or absent according to appropriate criteria, was included because we had previously noted (in unpublished analyses) that it seemed to affect survival. If. TNM stage. This is the classical morphologic variable used in conventional prognostic analyses. It was cited, according to standard criteria [22], in the three ordinal stages of I, II, and III. 12. Histologic type. This is another classical morphologic variable in prognosis. It was cited, in a nominal rating scale, according to the histologic type of the cancer. Patients whose only microscopic evidence of cancer was in a Pap smear of sputum or other sites, rather than in histologic tissue, were listed as cytology only. 13. Age. The patient’s age was cited directly
in years. 14. Sex. Sex was classified as male or female. 15. Cigarette smoking. The amount of customary cigarette smoking, before onset of symptoms referable to lung cancer, was classified in five ordinal categories as none, shght ( l/2 but < 1 pack/day),
ALVAN R.
344
FEINSTEIN et al.
heavy ( > 1 but < 2 packs/day), and extreme ( > 2 packs/day), 16. Pipe smoking. 17. Cigar smoking. -C These two variables were used to indicate the customary use of pipes or cigars as present or absent. Because patients in the era under study seldom used smokeless tobacco, it was not separately identified. Coding of variables
For the impending analyses to be done by the mathematical methods, some of the variables were already cited in dimensional numbers (such as age) or in binary categories that could be coded as 0, 1 and easily managed in the mathematical process. This type of binary coding was used for unresolved pneumonia, chronic cough, chronic dyspnea, chronic pulmonary disease, active tuberculosis, extraneous iatrotropy, anemia, sex, pipe smoking and cigar smoking.
For other variables, however, certain adjustments were needed to allow proper management of unknown data, or to avoid arbitrary ordinal ratings that might be analyzed as though they were dimensional number. These adjustments were managed as follows: Symptom pattern variables. The four ordinal categories of symptom pattern were converted into three “dummy” binary variables, according to a coding scheme [23] that preserves the contrasts between adjacent categories of the ordinal scale. In the coding scheme, which is shown in Table 1, if the variable “symptom stage 1” is selected to enter a model, the main contrast is between any symptoms and asymptomatic; if “symptom stage 2” enters, the main contrast is for “systemic symptoms or worse” vs primary symptoms or none. Functional severity variables. The three ordinal categories of functional severity were transformed into two dummy binary variables
Table 1. Coding scheme for 3 dummy variables describing 4 categories of the ordinal variable, “symptom pattern” Dummy variable codes Classification of symptom pattern Asymptomatic Primary symptoms only Systemic symptoms Metastatic symptoms
Symptom Symptom Symptom stage 1 stage 2 stage 3 0 1 1
I
0 0 1 1
0 0 0 1
using a strategy of conversion similar to that of symptom pattern. TNM stage variables. The three ordinal categories of TNM stage were also converted into two dummy binary variables, using the same conversion strategy. Chronometry variables. Because patients who were asymptomatic would be listed as unknown for chronometric duration of symptoms, and because unknown data create problems in a multivariable analysis, chronometry was coded in three variables that were formed as the interaction of duration of symptoms and symptom stage. The three variables depended on the presence of any symptoms (symptom stage l), any non-primary symptoms (symptom stage 2), or any metastatic symptoms (symptom stage 3). Each of the interaction variables would have the value of symptom duration for patients who were in the corresponding symptom stage, and a value of 0 for patients who were not in that stage. These variables were respectively labelled Symptom duration I, Symptom duration 2, and Symptom duration 3. With this coding convention the asymptomatic patients would be denoted by a 0 in all three variables. Histologic-type variables. The diverse nominal categories of histologic types were consolidated into five main groups, each of which then became a dummy variable binary-coded as present or absent. These five histologic groups were well differentiated epidermoid carcinoma, well diferentiated adenocarcinoma, small cell carcinoma (which included “oat cell” cancer), anaplastic carcinoma (which included “poorly
differentiated” epidermoid or adenocarcinoma, all “large cell carcinomas”, and patients whose histology was called “anaplastic” without further specification), and metastatic carcinoma. The metastatic carcinoma designation was used because it had occasionally been applied by pathologists, without further specification, to tissues removed from certain metastatic sites. Patient whose microscopic evidence consisted of cytology only would be coded as 0 in all five of these dummy variables. Cigarette smoking variable. Because the ordinal ratings of cigarette smoking were derived from a dimensional magnitude of packs per day, a special conversion to dummy variables did not seem necessary, and the original Scategory ordinal rating scale was allowed to remain intact, with the coded values of O-4. These activities in coding thus preserved 12 of the original variables, and added 15 dummy
Comparison of Multivariable Methods for Prediction-I
345
distributed in proportions of 1812% in Stages A-C. The overall survival rate, 44% at 6 months, is partitioned into a large prognostic gradient, ranging from 85% survival into Stage A to 14% in Stage E.
variables to replace 5 of the original variables. The additional dummy variables contained 3 for chronometry (“symptom duration”), 3 for symptom pattern, 2 for functional severity, 2 for TN44 stage, 5 for histologic type, and 2 for other smoking.
The 27 coded variables were entered into the analyses as the independent predictor variables for each patient. For the linear, logistic, and discriminant function procedures, the outcome variable for each patient was coded as alive or dead at 6 months after zero time. For the Cox-1 and Cox-5 procedures, the outcome variables were coded as the actual survival times, with living patients being censored at either 1 or 5 years, as described earlier. ORIGINAL RESULTS
In previous analyses [ 141 of data for the 1266 patients, a complex set of clinical and statistical judgments had been used to partition the original 17 variables into categories, and to combine those categories into the five composite stages or strata, designated as A-E. The biologic components of those strata, as noted earlier, came from categories of the symptom pattern, functional severity, and TNM stage variables. The remaining 14 basic variables were omitted from the A-E composite stages mainly for the sake of simplicity and also because they did not seem to add a major prognostic impact beyond the three selected variables. Nevertheless, to allow the mathematical models a chance to work with all the appropriate data, the 14 additional variables were included in the current analyses. Table 2 shows the distribution of the entire cohort in the composite A-E stages, and the 6-month survival rates for each stage. The composition of the cohort showed that almost three-fifths of its members were in Stages D and E (31 and 26% respectively), with the remainder
SURSEQURNT ACTIVITLRS
Creation of challenge set
Before any analytic work was done with the data, 200 patients were randomly selected, by stratified sampling within the A-E stages, to be a “challenge set”, having the same proportionate distribution as the parent population of 1266 cases. The challenge set, which was put aside for testing subsequent predictions, had the following numbers and proportions of patients in each stratum: A, 35 (18%); B, 26 (13%); C, 24 (12%); D, 63 (31%); and E, 52 (26%). Preparation of generating samples
After selection and sequestration of the challenge set, the remaining 1066 patients in the parent population were used to prepare 7 sets of generating samples, each containing 400 patients. These patients were chosen, by stratified random sampling, to produce groups having the following defined “shapes” in their distributions of the A-E strata: proportional, uniform, symmetric unimodal, bimodal, decreasing exponential, increasing exponential, and U-shaped. Each sample was “replaced” in the 1066 patients before selection of the next sample; thus, a particular patient might be a member of more than one generating sample. The different shapes of the generating samples were chosen to offer a wide and challenging spectrum of statistical distributions for the same biologic phenomena. Because the phenomena would retain the same biologic attributes, regardless of their statistical distributions, the different arrangement would serve as a test for the “biologic
Table 2. Six-month survival rates in composite strata of original cohort of 1266 patients
CE 43/4-c
No. of dmonth survivors in stratum
6-Month survival rate in stratum
Composite stratum (or “stage”)
No. of patients in stratum
Proportions of patients in stratum
A :
222 152 167
18% 12% 13%
188 102 75
85% 49% 61%
D E
396 329
31% 26%
141 45
36% 14%
Total
1266
100%
551
44%
ALVANR. FEINSTEIN et al.
346
Table 3. Distribution of composite strata in seven generating samples Composite stratum
Proportional sample
Uniform
Symmetric unimodal
Decreasing exponential
Increasing exponential
U-shape
Bimodal
A B C D E
IO 52 48 126 104
80 80 80 80 80
55 83 124 83 55
166 104 65 40 25
25 40 65 104 166
106 71 46 71 106
67 100 67 100 66
Total
400
400
400
400
400
400
400
Table 4. Six month survival rates for composite strata in seven generating sets Composite stratum A B C D E Total
Proportional
Uniform
Symmetric unimodal
58170 (83%) 30/52 (58%) 21148 (44%) 381126 (30%) 13/104 (13%)
62180 (78%) 46180 (58%) 39180 (49%) 35180 (44%) 10/80 (13%)
40155 (73%) 51/83 (61%) 60/124 (48%) 32183 (39%) S/55 (15%)
140/166 (84%) 58/104 (56%) 35165
160/400 (40%)
192/400 (48%)
191/400 (48%)
robustness” of the mathematical models. A model that was too “data dependent” might not be able to recognize and distinguish the important biologic attributes when they appeared in different statistical distributions. For the arrangements, the proportional sample of 400 patients was distributed across the A-E categories in the same pattern as the original population, and the uniform sample contained 80 members in each group. For the symmetric unimodal, exponential, U-shape, and bimodal distributions, suitable patterns were obtained by taking sample sizes in adjacent A-E stages to have the approximate ratios of 2: 3 or 3:2. The numbers of patients in the strata of these seven generating sets are shown in Table 3. Although the basic biologic attributes of the predictors remained unchanged, the overall 6month survival rates in the seven generating sets varied with their distributional arrangements and randomly chosen constituents. Table 4 shows the 6-month survival rates for each stratum in each sample. As expected, the survival rates remained similar (within limits of random variation) for each of the five strata in each sample. The overall survival rates among the seven groups, however, ranged from a high of 64% to a low of 33% in the two exponentially distributed samples, which had the highest proportions of patients in the best or worst stages. In the four other samples, which all
Decreasing exponential
Increasing exponential
U-Shape
Bimodal
(ZZ (45%) 4125 (16%)
22125 (88%) 27140 (68%) 30165 (46%) 35/104 (34%) 19166 (11%)
88/106 (83%) 45171 (63%) 18/46 (39%) 22171 (31%) lo/106 (9%)
56161 (84%) 60/100 (60%) 35167 (52%) 38/100 (38%) 10/66 (15%)
255/400 (64%)
1331400 (33%)
183/400 (46%)
199/400 (50%)
had essentially symmetrical distributions (in uniform, unimodal, bimodal, or U shapes), the total 6-month survival rates ranged from 46 to 50%. The corresponding value in the proportionate sample was 40%. Subsequent activities We now had a sequestered challenge set of 200 patients, seven generating samples of 400 patients each, and five mathematical methods to be tested. In the operational procedures, all of the independent variables were standardized within each generating sample to permit a direct comparison of the partial regression coefficients. Parts II and III of this series describe the operations and results obtained when the multivariable analytic procedures were applied to the seven generating sets of data, and later, to the challenge set. REFERENCES Halperin M, Blackwelder WC, Verter JL. Estimation of the multivariate logistic risk function: A comparison of the discriminant function and maximum likelihood approaches. J Chron Dis 1971; 24: 125-158. Coronary Drug Project Research Group. Factors influencing long-term prognosis after recovery from myocardial infarction. Three-year findings of the Coronary Drug Project. J Chron Dis 1974; 27: 267-285. Dillman RO, Koziol JA. Statistical approach to immunosuppression classification using lymphocyte surface markers and functional assays. Cancer Res 1983; 43: 417-421.
Comparison of Multivariable Methods for Prediction-I 4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Gilpin E, Olshen R, Henning H, Ross J. Risk prediction after myocardial infarction. Cardiology 1983; 70: 73-84. Schimitz PIM, Habbema JDF, Hermans J. The performance of logistic discrimination of myocardial infarction data in comparison with some other discriminant analysis methods. Stat Med 1983; 2: 199-20s. Cook EF, Goldman L. Empiric comparisons of multivariate analytic techniques: advantage and disadvantage of recursive partitioning analysis. J Cluon Dis 1984; 37: 721-731. Harm.11FE, Lee KL, Califf RM, Pryor DB, Rosait RA. Regression modelling strategies for improved prognostic predictions. StatMed 1984; 3: 143-152. _ Madsen EB. Gilnin E. Hennina H. Short term nroanosis in acute myocardial infarcti&: Evaluation of drfferent predictor methods. Am Heart J 1984; 107: 1241-1251. Brenn T, Amesen E. Selecting risk factors: A comparison of discriminant analysis, logistic regression and Cox’s regression model using data from-the Tromso heart studv. Stat Med 1985: 4: 413423. Harrell FE, Lee KL, Matchor DB, Reichert TA. Regression models for prognostic prediction: advantages, problems, and suggested solutions. Cancer Treat Rep 1985; 69: 1071-1077. Ciampi A, Thiffault J. Nakache J-P, Asselain B. Stratification by stepwise regression, correspondence analysis and recursive partition: A comparison of three methods of analysis for survival data with covariates. Comput Stat Data Anal 1986; 4: 185-204. Peduxxi P, Holford T, Detre K, Chan Y-K. Comparison of the logistic and Cox regression models when outcome is determined in all patients after a fixed neriod of time. J Chron I% 1987: 40: 761-767. Cook EF, Goldman L. Asymmetric stratification. An outline for an efficient method for controlling confounding in cohort studies. Am J Epidemiol 1988; 127: 626639.
347
14. Feint&in AR, Wells CK. Lung cancer staginga critical evaluation. In: Matthav R. Ed. Recent advances in lung cancer. CBn C&at Med 1982; 3: 291-305. 15. Feinstein AR. Gn classifying cancers while treating oatients. Arch Intern Med 1985: 145: 1789-1791. 16. ‘WeinsteinAR. Clinical biostatistics: XIV. The purposes of prognostic stratification; XV. The process of prognostic stratification (Part 1); XVI. The process of prognostic stratification (Part 2); XVII. Synchronous partition and bivariate evaluation in predictive stratification. CIin Pharmacol Ther 1972; 13: 285-297, 442-457, 609624, 755-768. 17. Feinstein AR, Schimpff CR, Hull EW (with technical assistance of HL Bidwell). A reappraisal of staging and therapy for patients with cancer of the rectum. I. Development of two systems of staging; II. Patterns of presentation and outcome of treatment. Arch Intern Med 1975; 135: 1441-1462. 18. Feinstein AR, Pritchett JA, Schimpff CR. The epidemiology of cancer therapy: II. The clinical course: data, decisions, and temporal demarcations; III. The management of imperfect data; IV. The extraction of data from medical records. Arch Intern Med 1969; 123: 323-344, 448-461, 571-590. 19. Feinstein AR. Symptoms as an index of biologic behaviour and prognosis in human cancer. Nature 1966; 209: 241-245. 20. Feinstein AR. A new staging system for cancer and a reappraisal of “early” treatment and “cure” by radical surgery. N Engl J Med 1968; 279: 747-753. 21. Feinstein AR. Cl&al Judgment. Baltimore: Williams & Wilkins Co.; 1967. 22. American Joint Committee for Cancer Staging and End Results Reporting. CIinkal Staghtg System for Carclooma of the Lung. Chicago: American Joint Committee on Cancer; 1973. 23. Walter SD, Feinstein AR, Wells CK. Coding ordinal independent variables in multiple regression analysis. Am J Epidemiot 1987; 125: 319-323.