Paul L. Canner
14
Further Aspects of Data Analysis
INTRODUCTION In Chapter 13 the process of data and safety monitoring was discussed and statistical methods for monitoring the accruing data over the course of the trial for evidence of beneficial or adverse effects were described. Other statistical methods used in evaluating the results in the Data and Safety Monitoring Committee (DSMC) reports and in the published papers relating to both treatment effects and the natural history of coronary heart disease (CHD) are described and discussed in this chapter. For methods that are weU-known or well-documented elsewhere, appropriate references are provided and the details are not repeated here. When this is not the case, more detailed descriptions of the methods are provided. In addition to the statistical methods, a most important aspect of data analysis relates to making certain that the data tabulated in the DSMC report or published in a scientific paper are indeed correct. A section in this chapter is devoted to a discussion of quality assurance procedures for data analyses. STATISTICAL METHODS USED IN THE CDP
Comparison of Two Proportions The usual test for the difference of two proportions was frequently used in DSMC reports and CDP publications. The well-known formula for this test is T =
pl - p2 [~(1 - p-)(1/n, + 1/n2] ''2'
(1)
where Pl and p2 are the observed proportions of an event for the two treatment groups, nl and n2 are the numbers of patients in the two treatment groups, and ~ = (nlp~ + n2p2)/(n~ + n2). For the most part the CDP did not utilize either the Yates continuity corrected test for two proportions [1] or the exact Controlled Clinical Trials 4:485-503 (1983) © Elsevier Science Publishing Co., Inc. 1983 52 Vanderbilt Ave., New York, New York 10017
485 0197-2456/83/$3.00
486
P.L. Canner test for comparing two proportions of Fisher [2] and Irwin [3]. Due to the large numbers of patients and events in the CDP, the P values derived from' the latter two tests generally differed little from those determined via the test of eq. (1). For the situation of small sample sizes or small numbers of events, the question of whether or not to use the continuity correction has been the subject of controversy among statisticians [4-6]. From a consideration of the various arguments presented, it would seem that the continuity corrected test of proportions should be used in order to obtain the best estimate of the P value, i.e., the probability of a result at least as extreme as the one observed. However, the uncorrected test as given in eq. (1) should be used if one must decide for or against a particular hypothesis at a preset Type I error rate.
Methods of Survival Analysis A method adapted from a paper by Littell [7] was used to compute life table rates for various endpoints in the CDP. This is a parametric method that assumes that the survival curve within each time interval (usually 4 months in the CDP) is of an exponential form with a constant hazard rate (or force of mortality) within the interval (but not constant across intervals). Before displaying the formulas associated with this method, some notation and definitions are necessary. Suppose the probability of surviving the interval (t,t + 1) given survival to time t is Pt. Then the cumulative probability of surviving the interval (0,t + 1) is P~÷I = p o p l " ' ' p t . In order to estimate the ps and Ps, several quantities must be defined. Let Nt denote the number of patients w h o have survived to time t; let D, denote the number of deaths (or number of patients having the event of interest) occurring in the interval (t,t + 1); and let Wt denote the number of patients due for withdrawal in (t,t + 1). A patient is said to be "due for withdrawal" in the interval (t,t + 1) if the number of time intervals between the patient's date of enrollment into the study and the present date of analysis is greater than t but no more than t + 1. For the Dt patients who died in (t,t + 1), uj is defined as the time to death as a fraction of the length of the interval (t,t + 1) for the jth deceased patient in that interval, 0 < uj ~ 1. For the Nt - Dt patients not dying in (t,t + 1), vi is defined as the time to withdrawal as a fraction of the length of the interval (t,t + 1) for the ith patient surviving that interval, 0 < vi ~ 1. For those surviving patients not due for withdrawal in (t,t + 1), v~ = 1. Defining Nt -
Dt
i=~
o~
j=l
Pt is estimated by/gt = exp( - D/Q~). T h e asymptotic variance of/~t is estimated by V/lr p, = p,D/Q~,.
487
Further Aspects of Data Analysis
Finally, the cumulative probability of surviving (0,t + 1) is estimated by Pt+l - /~0~1 • • • ~t, with variance estimated by t
V~ir P,+~ =
P~+I ~ i~
(D,/Q~). 0
With the formulas just given, it is possible to compare two survival curves statistically (via confidence intervals or hypothesis testing) only on a pointby-point basis. These formulas do not allow the comparison of two or more entire survival curves. The Cox life table regression method [8] has been used in the CDP for the purpose of testing the difference between the entire survival curves for a drug group and the placebo group [9]. In order to compute the test statistic for the Cox procedure, it is necessary to know the "survival" times for all D1 patients in treatment group 1 and all Dz patients in treatment group 2 who had a specified event during the trial; D1 + D2 = D. Just prior to the occurrence of the ith event (both treatments combined), i = 1. . . . . D, there were ra patients in treatment 1 and r,~ patients in treatment 2 still being followed and at risk of having the event; ril + r,2 = r~. The test statistic, then, is T =
D1 --
ru/ri i =
1
rilra/~ i =
1
In the case of ties, i.e., more than one patient having an event during the same reporting interval (usually days), the test statistic takes the following form: K
D1T=
~
miril
i=l
ri
[ ~ mi(ri- mi) ( r , l ] ( ~ ] ] "~' ~= r~ 1 \ ri l t r # 3 1
where K denotes the number of distinct event times and m~ denotes the number of events occurring during the ith time interval. The two formulas just given are for the special case of simply comparing two survival curves without respect to any other variables. The Cox method, in its general form, can also be used to compare two survival curves while adjusting for a number of concomitant variables. In the instances of its use in the CDP, the Z values for drug-placebo difference in survival curves computed using the Cox procedure were always within 0.25 units of the corresponding Z values computed using the ordinary test for two proportions [eq. (1)] based on all events through the end of the trial [9]. This d o s e agreement was due to the fact that the survival curves all tended to be exponential, with very little crossing of one another except when the two curves were nearly identical. Since the survival curves for the major endpoints in the CDP generally tended to be approximately exponential with constant hazard over the entire follow-up period, the following test for comparing two exponential survival curves was sometimes employed as a quick manual method of obtaining Z
488
P.L. Canner
values that were v e - y close to those given by the Cox method: D~ and D~ d e n o t e the total n u m b e r s of deaths over the entire follow-up period for treatm e n t g r o u p s 1 a n d 2, respectively; Q~ and Q~ d e n o t e the total n u m b e r s of patient-intervals of e x p o s u r e to risk of an e v e n t for the two treatment groups; a n d N~ a n d N~ d e n o t e the n u m b e r s of patients enrolled in treatments 1 and 2. T h e n the test statistic is [10] T =
[D'JQ'~ -
(2)
D'JQ'2]
[(N~ + N'2)/(Q~ + Q2)] [(D~ + D2)/N'~N'a] 1~2' w h e r e T has approximately a standard normal distribution. Note that Q~ a n d Q~ w e r e obtained by s u m m i n g the Qs given for the individual intervals from the Littell life table output. Table 1, for p u r p o s e s of comparison, gives the Cox Z value, the exponential Z value from eq. (2), and the binomial Z value from eq. (1) for the clofibrate-placebo and nicotinic acid-placebo comparisons for four e n d p o i n t s . The Cox a n d binomial Z values have b e e n published previously [9]. For the examples given in Table 1, the Cox and exponential Z values differ b y a m a x i m u m of 0.04 units. Such a g r e e m e n t will not necessarily be f o u n d if the survival curves are highly nonexponential in shape. Estimated survival curves can be obtained using the Cox procedure, as can an assessment of the overall difference of two curves. H o w e v e r , the Cox p r o c e d u r e imposes a sort of parallelism u p o n the survival curves; for example, in a situation w h e r e the curves are actually diverging initially, a n d t h e n later converging a n d e v e n crossing, the Cox-estimated curves m a y turn out to be essentially parallel or e v e n nearly coincident. For this reason, it is r e c o m m e n d e d that the survival curves be estimated using a m e t h o d such as Littell's [7] or that of Kaplan a n d Meier [11]. Comment.
Table I
Z Values C o m p u t e d T h r e e Different Ways for Four Endpoints a n d T w o T r e a t m e n t C o m p a r i s o n s in the CDP Z values
Endpoint Total mortality Clofibrate-placebo Nicotinic acid-placebo Coronary death Clofibrate-placebo Nicotinic acid-placebo Definite nonfatal MI Clofibrate-placebo Nicotinic add-placebo Coronary death or definite nonfatal MI Clofibrate-placebo Nicotinic acid-placebo
Cox
Exponential
Binomial
- 0.15 - 0.56
- 0.14 - 0.55
0.04 - 0.67
- 1.13 - 0.69
- 1.12 - 0.66
- 1.08 - 0.75
- 0.78 - 2.86
- 0.78 - 2.85
- 0.64 - 3.09
- 1.24 - 2.56
- 1.27 - 2.60
- 1.27 - 2.77
Further Aspects of Data Analysis
489
Subgroup Analysis In the monitoring of treatment effects in the CDP there was frequent interest in evaluating treatment differences within specific subgroups of study participants, e.g., subgroups of the dextrothyroxine and placebo groups [12]. One statistical test employed in this regard was the ordinary test of two proportions described previously [eq. (1)] for the findings in a particular subgroup, without regard to the findings in the complementary subgroup. However, it was usually more appropriate to do a test for heterogeneity of treatment effects among subgroups. A number of tests are available for this purpose. Five are presented here for application in two treatments and two subgroups. The first three involve a comparison between two subgroups of the absolute differences in proportions between the two treatment groups. For the first test, T~ = (pll - P~2) - (p2~ - p22) i
=
lj
=
Hi/
1
where n~j denotes the number of patients and p~j the observed proportion of events for the ith subgroup and the jth treatment, i,j = 1,2. The second test is similar except that a single pooled estimate of the probability of an event is used in the denominator of the test statistic: T~ =
(p~, - p~2) - (p2~ - P=)
=
2
],,=,
(l-T,) X £ n,;' i
=
lj
=
I
where ~ = Xi~,jnijpij/xd, iF_,jnij. The theoretical justification for this second test is somewhat tenuous since the null hypothesis does not stipulate that the four proportions are equal (which would justify a pooled estimate of the common proportion). However, this test has been given because, in contrast to the first test, it gives results very close to those of a regression method, now to be described. Consider the following multiple linear regression model:
y = bo + b~xl + b2x2 + baxlx2, where xl is a 0,1 variable for treatment and x2 is a 0,1 variable for some baseline subgrouping characteristic. The regression coefficient, b3, for "interaction" provides an estimate of the difference in treatment effect between the two subgroups. The estimated b~divided by its standard error yields a test statistic, approximately normally distributed, for evaluating the significance of this interaction between treatment and subgrouping characteristic. With this regression approach it is easy to evaluate the interaction effect in the presence of concomitant variables. Also, the subgrouping variable need not be dichotomous but may be continuous; in this case b3 represents the difference in slopes between the best-fitting lines relating the endpoint to the baseline characteristic for the treatment group and for the placebo group.
490
P.L. Canner
The first three tests consider a comparison of absolute differences between treatment groups in two subgroups. The next two tests, by contrast, are designed to compare relative differences. Suppose pn = 0.20, p~2 = 0.40, P2~ --- 0.02, and p22 = 0.04. The first three tests would be likely to conclude a significant interaction or heterogeneity since pn - p12 --- -0.20 is much different from p2~ - p22 = -0.02. However, the relative differences, pn/p~2 and p2~/pz~, are both 0.50, and the two tests to be described will not conclude heterogeneity in this instance. The fourth test is known as Cochran's Method [13,14], and deals with standardized differences, i.e., d~ = (Pll - p~)/pi(1 - p~), where ~ = (n~pil + ni2p,~)/(nil + n,~). Thus,
:
+_111
Again, this test statistic has approximately a standard normal distribution. For more than two subgroups, the test statistic is given by Fleiss [14] as a chisquare statistic with degrees of freedom equal to the number of subgroups minus one. The preceding test has been criticized by Mantel et al. [15] as giving anomalous results in certain instances with very unequal sample size ratios among the two or more subgroups. Two of the examples provided in this critique had sample sizes of n l l = 210, n12 = 70, n2~ = 20, nz~ = 830 (giving sample size ratios of nn/n~2 = 3.00 and n2~/n22 = 0.02) and nn = 200, n~2 = 1800, n2~ = 1000, nz~ --- 1000 (i.e., sample size ratios of 0.11 and 0.50). The criticism does not apply in clinical trial settings where the sample size ratios among the subgroups will always be nearly equivalent, within random variation, as long as the subgroups are defined by baseline rather than postrandomization characteristics. An alternative test that avoids the criticism raised by Mantel et al. and that gives results similar to the Cochran test in the clinical trial setting is based on the logarithm of the odds ratio [14,16]. The odds ratio is defined as oi = pi~(1 pi2)/p,~(1 - p~x). Hence the fifth test statistic is -
ln(ol) - ln(o2)
Ts= i
, j = l m j P d I - pij)J
This test is essentially the same as a test for an interaction term in a multiple logistic regression model [17]. In making statistical evaluations of heterogeneity of treatment differences among subgroups with any of the above tests, the problem of multiple inference had to be considered in the CDP. No a priori hypotheses concerning such heterogeneity were established in the CDP. Thus any "interesting" findings emerging from such analyses for heterogeneity or interaction had to be considered in the light of the total number of such analyses carried out. This was a crucial matter in the deliberation in the CDP of possible early termination of dextrothyroxine in a subgroup of patients. There were several variables for which the dextrothyroxine effect on mortality was different among different
FurtherAspects of Data Analysis
491
levels of the variable. In that particular instance, multiple regression--the third test described abov~ w a s used to evaluate the degree of heterogeneity for some 48 baseline variables. A computer simulation procedure was used to determine the appropriate critical value, taking into account the multiplicity of tests. Details of this procedure have been published elsewhere [12,18]. Based on analyses for heterogeneity carried out for the dextrothyroxine and placebo groups, it was concluded that dextrothyroxine did indeed have differential effects on mortality in different subgroups of the patients. However, there remained the problem of determining for exactly which subgroup(s) dextrothyroxine should be discontinued, and for which subgroup(s) the drug should be continued. Data received as of October 1, 1970 indicated two rather complicated subgroups defined by baseline risk group, history of angina pectoris, and ECG heart rate. In one subgroup dextrothyroxine had a substantially higher mortality than placebo (Z = 3.29) and in the complementary subgroup it had a substantially lower mortality than placebo (Z = -2.60). Since these subgroups were defined as a result of a trial and error process designed to maximize the degree of heterogeneity, it was decided to follow the patients for another 10 months to determine whether the heterogeneity would hold up. The outcome as of August 1, 1971 was that dextrothyroxine continued to do poorly in the first subgroup, but did not continue to do well in the second subgroup. Thus, since no subgroup could be identified for which dextrothyroxine gave consistent evidence of a beneficial effect, the drug was discontinued in all patients in late 1971. In this discussion of subgroup analyses attention has been focused on subgroups defined by characteristics measured and observed at baseline. There was often an interest among some CDP investigators in carrying out mortality and morbidity analyses in subgroups d6fined by characteristics measured during the follow-up period, such as adherence to the study medication regimen or cholesterol/triglyceride response to the treatment. It was felt that dilution of treatment effect resulted from the inclusion of poor adherers or poor cholesterol/triglyceride responders in the mortality and morbidity analyses and that the fairest test of a drug's efficacy would be made by excluding poor adherers and poor responders. However, after carrying out many analyses of these data it was concluded that such analyses were invalid and misleading because it was not possible to define an appropriate subgroup of the placebo group against which the subgroup of good adherers/responders in a particular drug group could be compared. The detailed arguments and analyses have been published elsewhere [19].
Multiple Regression Analysis Introduction. Many books and papers have been written on the topic of mul-
tiple regression analysis and the reader is referred to these if necessary as background to this section [17,20-22]. In the CDP, this statistical technique was used principally for three purposes: 1. To adjust treatment effects with respect to mortality and morbidity for baseline covariates.
492
P.L. Canner
2. To identify in a stepwise fashion for natural history analyses those baseline variables most highly associated with subsequent mortality and morbidity. 3. To test for interaction between treatment and baseline variables with respect to mortality or morbidity endpoints. The subsections below describe some features of the multiple regression analyses carried out in the CDP.
Types of variables used in regression equation. Three types of variables--quantitative, quantal, and qualitative---were used in regression equations. For quantitative variables such as body weight, serum cholesterol, and blood pressure, the measured value for each patient was ordinarily used. For quantal or binary variables codes 0 and 1 were ordinarily used to denote the two possible values or states; e.g., 0 = no history of cancer, 1 = history of cancer. Qualitative variables having several classes, such as occupational category, marital status, or race, were incorporated into the regression equation in one of two ways: In some cases a binary variable was constructed by combining various classes, such as married versus not married, or professional/technical versus other occupations. Alternatively, for a variable with k classes, k-1 " d u m m y " or indicator variables were defined, each taking the values 0 or 1. For current marital status, for instance, the following four variables were defined: xl x2 x3 x4
= = = =
1 if 1 if 1 if 1 if
married, 0 if not; divorced, 0 if not; widowed, 0 if not; separated, 0 if not.
For a given patient, only one of these variables will have the value 1. All zeroes indicate that the patient is in the fifth class, namely, never married.
Linear versus logistic regression. When the dependent variable of a regression model is binary---as was most frequently the situation in the CDP--the logistic model [17] tends to provide a better fit to the data than the linear model. The linear regression model also has the disadvantage that it sometimes yields estimates of risk that are either negative or greater than 1 (or greater than 100% if risk is expressed in percent); such estimates of risk always fall between 0 and 1 with a logistic model. On the other hand, it is much more expensive to run a logistic regression than a linear regression since the former requires an iterative computing procedure. Stepwise variable selection techniques are also more difficult to carry out with logistic than with linear regression. The latter are two of the principal reasons multiple linearwrather than logistic-regression was ordinarily used in the CDP. Furthermore, it has been shown empirically that for analyses of the relationship of mortality to baseline characteristics in the CDP placebo group, the multiple linear regression gives results similar to those obtained with multiple logistic regression analysis [22]. This is likely due to the fact that for the range or mortality risk within which most CDP patients fall the logistic curve is not much different from a straight line.
493
Further Aspects of Data Analysis
Variable selection. The forward selection (or "step-up') procedure was used to identify out of large sets of baseline variables those smaller sets of 5, 10, and 20 variables most highly associated with a particular endpoint. The reason for using this as opposed to backward elimination ("step-down'), stepwise regression (a combination of forward and backward selection), all possible regressions, or other variable selection procedures was based on the fact that software for a forward selection procedure was more readily available at the time. Comparisons of some of these selection procedures from a statistical standpoint have been carried out and reported by others [19,20]. Relationship between multiple correlation coefficient and partial t values, One purpose of stepwise regression analysis is to identify a subset of regressor variables that produces a maximal reduction in the variance of the dependent variable, s2. The proportional reduction in s~ due to the regression of y on various independent variables, x, is given by =
-
,where six is the residual variance of y after taking the regressor variables into account [23]. The quantity Rc is known as the multiple correlation coefficient corrected for degrees of freedom, and is related to the uncorrected multiple correlation coefficient, R, by the formula 1 - R~ = (1 - R2)(n - 1)/(n - p -
1),
where p denotes the number of regressor variables in the equation. When doing a step-up regression analysis, R is found always to increase with each new variable added to the equation, but R~ increases for a while and then may level off and begin decreasing as the number of regressor variables becomes large. In other words, the point may be reached where adding new variables having very little independent correlation with the dependent variable will only add noise and thus increase rather than decrease the residual variance, six. It can be shown analytically that at the point where R~ begins to decrease, the partial F value (i.e., the square of the t value) for the variable just entered is ~ 1. Thus it might often be reasonable to stop entering variables once their t values become less than 1 in absolute value.
Selection of concomitant variables for adjustment of treatment effects. A suggestion one hears occasionally concerning selection of adjusting variables is that only those variables distributed differentially (say, at the 5% level of significance) between the two treatment groups should be included in the adjustment. But it can be easily shown that variables that are only moderately maldistributed between the treatment groups (say, with a P value of 0.20 or 0.30) but that are strongly correlated with the endpoint can also have a substantial influence on the observed treatment effect. To make certain that all of the important baseline characteristics were included in the adjustment of treatment effects with respect to mortality and morbidity in the CDP, this adjustment was usually carried out using the complete set of 54 variables considered in the analysis of baseline comparability of the treatment groups [9,24]. This large number of variables might be regarded as overkill, which in a sense it is.
494
P.L. Canner However, it appears that no harm is done by using the complete set. This has been demonstrated in two ways. First, by means of step-up regression, with the treatment variable forced into the equation, the baseline variables most highly associated with the endpoint were stepped into the equation; after about 20 steps the standard error and t value for the regression coefficient for treatment had converged to virtually the same values found with all 54 variables in the equation. Second, a special step-up regression routine was developed in which the treatment variable was first forced into the equation and the baseline variables were then stepped in according to the amount of change (in either direction) in the t value for treatment from the preceding step. In this way too after about 20 steps there was convergence to nearly the same values found with 54 variables. Instead of going through the stepwise routines for many different endpoints, for ease of presentation the treatment t values adjusted for all 54 variables were reported by the CDP [9,24].
Special stepwise procedures for adjusting variables. A special option was built into the CDP regression program for selecting variables in a stepwise fashion according to the amount of change produced in the t value of the main regressor variable forced into the equation. With this option one can select those baseline adjusting variables that either (1) maximize the adjusted t value of the main variable of interest, (2) minimize this adjusted t value, or (3) produce the maximum change in the t value in either direction (as described above). In this way some feel can be obtained for the degree of dependency of the adjusted t value upon the particular set of adjusting variables utilized. Thus, for example, in an analysis of the nicotinic acid-placebo difference with respect to total mortality carried out in 1973 (prior to the end of the study), the unadjusted t value was 0.33, the maximum adjusted t value (12 adjusting variables) was + 0.63, and the minimum adjusted t value (16 adjusting variables) was 1.00. For a different endpoint, coronary death or definite nonfatal MI, the minimum and maximum adjusted t values were -3.22 and -2.06. Thus, it is possible to influence greatly the regression adjustment of the effect of a treatment or other variable simply by a careful choice of the adjusting variables. For this reason, it is strongly recommended that the set of adjusting variables for treatment effects be clearly established early in the study, before it is possible for their choice to be influenced by the data. -
-
Computation of the adjusted intercept. When using multiple regression for the purpose of obtaining the relationship of a given variable (such as treatment) with an endpoint after adjusting for other variables, it was often desirable to obtain the adjusted intercept as well as the adjusted regression coefficient. For example, w h e n the regressor variable of interest is treatment, the adjusted intercept is equivalent to the adjusted incidence of the event for the placebo group; adding the adjusted regression coeffident for treatment to this gives the adjusted incidence for the drug group. In several previous CDP publications such adjusted incidence rates have been published [9,24,25]. Computation of the adjusted intercept is now described. Suppose we have a regression equation,
y = bo + blxl +
b2x2 + " ' "
+ bex m
Further Aspects of Data Analysis
495
where xl is the variable of interest and x2. . . . . x~ are adjusting variables. The adjusted intercept is defined as b0 = b0 + Q, where Q = bzx2 + • ' ' + bpxp, that is, the p-1 adjusting variables are all evaluated at their means. The CDP regression programs included an option of specifying which variables were to be used for adjusting purposes, and of printing out the adjusted intercept. For other regression programs in which this option is not available, the adjusted intercept can be easily computed in the following way without having to compute Q directly: First, note that ~, the observed proportion of the endpoint, is equivalent to b0 + bxx~ + Q. (This equality applies only in the case of multiple linear regression, not multiple logistic regression.) Thus the adjusted intercept, b0 = b0 + Q, is seen to be equal to ~ - blxl.
Missing data. Missing data in multiple regression analyses were usually handled by substituting the mean values computed for all patients in the analysis. The number of missing values generally was very small, representing about 0.3% of the total data. Occasionally, patients having one or more data items missing were completely excluded from the regression analyses. The results generally did not differ much from those obtained by the first approach. A third approach, of limited use in the CDP since it came to our attention late in the study, is the following: For each regressor variable having missing data, define a d u m m y variable, x = 0 if data present and = 1 if data missing, and add these variables to the regression equation. The justification for this approach is that for the case of a single regressor variable with missing data, the estimated regression coefficient for the variable and the estimated intercept are identical to those obtained by excluding the missing data from the analysis. For the case of other regressor variables present in the equation, the properties of this procedure have been investigated only for a few specific examples and have not been worked out in general. The standard set of regressor variables in the CDP included several baseline ECG variables, and if any one of these was missing, all were missing for a given patient. Thus only one d u m m y variable had to be defined for the set of ECG variables. With this approach it makes little difference what values are supplied for the missing data in the regression analysis so long as the same value is used for all missing data for a given variable and so long as the values are not so extreme (such as a missing code of 999 for a 0,1 variable) as to create problems in computational accuracy. The inclusion of all such d u m m y variables for missing data in the regression equation may cause the corrected multiple correlation coefficient, R,, to decrease, resulting in a poorer fit of the regression model than in the absence of any d u m m y variables for missing data. In this case it would be well to use a stepwise approach, beginning with the substitution of mean values for all missing observations, and stepping the d u m m y variables into the regression equation via a forward selection procedure until the R~ ceases to increase. A comparison of the various procedures for handling missing data is given in Table 2, using data from the CDP placebo group for 40 baseline variables as regressor variables and total mortality as the dependent variable. For 10 of the 40 variables there were missing data for at least one of the 2789 patients. The t values for regression coefficients vary only slightly among the mean value and d u m m y variable approaches. However, they may become substantially different if the patients with any missing data are excluded from
5.48 2.06 3.18 1.93 0.77 0.33896
5.19 1.27 3.23 2.13 1.18 0.33873
Imputing mean value
0.33969
5.51 2.05 3.18 1.87 0.82
Dummy variables for missing data
0.34181
5.52 2.04 3.18 1.88 0.79
Optimum subset of dummy variablesa
'Selected from the total set of ten dummy variables via step-up regression so as to maximize the corrected multiple correlation coefficient.
ST depression Ventricular premature beats Intermittent claudication Physical activity Cigarette smoking Corrected multiple correlation coefficient
Excluding patients with missing data
t Values of Selected Regression Coefficients for Various M e t h o d s of H a n d l i n g Missing Data, C D P Placebo G r o u p Methods of handling missing data
Baseline variable
Table 2
FurtherAspects of Data Analysis
497
the analysis. For the example in Table 2, the Rc is smallest when patients are excluded and largest when the optimum subset of d u m m y variables is selected. Conditional variables. A special variety of missing data results from the use of conditional variables. An excellent example of a conditional variable--although not from the CDP is number of pregnancies. In studies involving both men and w o m e n the variable "number of pregnancies" has substantial missing data in the form of "not applicable" responses. It would not do to impute a mean number of pregnancies to the men, nor would it always be acceptable to exclude all men from the regression analysis. So by providing a d u m m y variable, x -- 0 if data present and = 1 if data missing (or equivalently, x = 0 if female and = 1 if male), the association of number of pregnancies with the endpoint, in the presence of many other baseline variables, can be appropriately evaluated.
Analyses by Clinic Although most of the analyses of the CDP data were carried out on the totality of the data, some analyses of the endpoint data took into consideration the data from each of the 53 clinics. This was done to make sure that atypical outcomes in a small number of clinics were not swamping the findings in all the rest. The simplest of these analyses involved tabulating the proportion of events for each treatment group within each clinic, and then counting the number of clinics in which drug did better than placebo, the number in which drug was worse than placebo, and the number in which the two treatments had identical proportions of events. The classical sign test [26] provided a simple, although not particularly powerful, assessment of statistical significance of the treatment group difference with respect to the endpoint. The test statistic for sign test is T = (p - 0.5)/(0.25/N) 1~2,where N is the number of clinics and p is the proportion of clinics for which the drug group event rate was higher than the placebo group event rate plus one half of the proportion of clinics for which the two event rates were identical. For the clofibrate-placebo comparison with respect to 5-year total mortality, the test for two proportions gave a Z value of -0.60, i.e., the mortality was slightly lower in the clofibrate group than in the placebo group. However, in only 24 of the 53 clinics was the clofibrate mortality lower than that in the placebo group, whereas in 29 clinics the clofibrate group had a higher mortality than the placebo group. The sign test yielded a Z value of + 0.69. The apparent discrepancy between these two Z values can be traced to the fact that of the eleven smallest clofibrate-placebo differences, nine disfavored clofibrate. In each instance one more death in the placebo group or one less death in the clofibrate group would have been enough to change the sign of the difference in favor of the clofibrate treatment. A statistically more powerful test than the sign test, viz., the Wilcoxon signed rank test [26], was also used to evaluate the drug-placebo differences within the 53 clinics. For the example given in the preceding paragraph, this test gave a Z value of -0.38, quite similar to the result based on the test for two proportions.
498
P.L. Canner
QUALITY ASSURANCE PROCEDURES FOR DATA ANALYSIS Several procedures were followed in the CDP to assure a high level of quality of the data. Many of these are discussed in Chapters 12 (external quality control programs for the Central Laboratory, ECG Reading Center, etc.) and 6 (editing of data received at the Coordinating Center from the clinics and other centers). Quality assurance procedures do not end once the data are on the computerized data base. Many additional checking procedures must be followed to make certain that the data analyses have been carried out correctly and that the reports to the DSMC and the published papers present results accurately. Some of the procedures used in the CDP and other more recent studies to assure quality of data analyses and reports are described below.
Auditing of Analysis File Against Study Forms In the CDP, as in many other clinical trials, a special data file for analysis was generated periodically from the master file. A procedure not used in the CDP, but followed to good advantage in other trials such as the Diabetic Retinopathy Study and the National Cooperative Gallstone Study, has been to select a sample of patient visits and compare, item by item, the responses given on the original study form with the corresponding data on the analysis file. This is helpful in identifying errors in the program that extracts the analysis file from the master file, as well as any systematic errors in updating the master file that may have previously escaped detection.
Generation of Patient Data Listings One of the more useful programs written in the CDP generated for each patient a condensed, formatted printout of all baseline and follow-up values for most of the CDP variables. Included on this two-page printout were all cardiovascular events, biochemical, hematologic, and clinical measurements, summary of body systems, ECG findings, usage of nonstudy medications, prescription and adherence summaries for the study medication, reasons for reduced prescription, side effects, cause of death, survival or follow-up time, and other variables (Fig. 1). These printouts could be generated for all patients, patients in a particular treatment group, selected patients specified by ID number, or special subgroups of patients. These listings were used to help investigators ferret out confounding variables in cases of suspected treatment effects. Thus, for example, printouts were generated for all patients in the clofibrate and placebo groups ever having hematocrit below 36%. This was done for the purpose of investigating whether the apparent excess incidence of this event in the clofibrate group was correlated---temporally or otherwise--with other findings such as congestive heart failure, use of diuretics, depressed white blood count, or death, whether it involved a persistent versus an isolated occurrence, and whether it was reversed if the study drug was discontinued. In the process of using these printouts for such purposes, certain anomalies in the data that had slipped past the original computer edit were detected.
ID 49-048
D A T A SUMMARY
2-1-70
DAYS LAST MI
513
CAUSE D T H - M.I.
547 O C C - U N K M
SUDD - U N K N
DAYS IV5 C O M P L A I N T S
IV4 C O M P L A I N T S
178 161 185 172 167
144 144 152 149 144
ii0 120 120 130 115
70 70 74 70 68
23 29 23 22 23
0.40 0.29 0.31 0.55 0.33
0.22 0.14 0.15 0.25 0.17
-
-
***** ***** ***** ***** ***** 90.0 90.0 90.0 90.0
6.4 L 9.0 9.0 9.0
QX23
7.3 2.7
5.2 5.9
0 0
0- 0-16 0- 0-16
Q/QS
89 114 87 159
29 40
13 19
43 47
2 0.5 3 1.0 *
=
3 0 3 0
0 0 000 002 0 0 000 002
53 51
93 93
. . . .
.
. . . .
,
. . . .
. . . .
.
. . . .
.
9 15
0 0
4 5
-i -3
0 0
RII RV5 TII TV5 TC
DEF DEF
ECG F I N D I N G S (MINNESOTA CODE) T STE A X R A V V A R R M I S H - R A X l S
. . . .
680-0 76 1-15 640-0 600-0
000 002 020 000 302 020
STD
.
6 0
DEF DEF
Figure I Sample paHent data ~sting used in the CDP.
0 3
FV NCF
ii 7
.
C A R D I O V A S C U L A R EVENTS . . . . . . SYST]~S MI ICA STR ICL PAO P-E T - P A - F A R R H - R H O S P GI G U N S MS D M BP
5.2 4.3
-I . DEF SUS DEF 0 . DEF I . DEF . DEF 2 DEF DEF 3 . DEF 4 . DEF
FV CHF ANG ACI
0 3
..... ANNUAL LAB M E A S U R E M E N T S M I S C E L L A N E O U S ........ FV TRIG URIC A L K P BUN FGL IGL W B C ANC H E M NY CIG P A E A N M }{YP C - M W R K ACT
0 1 2 3 4 . YES
YES YES YES YES YES
. YES
YES
35
15
NO
4
NEV
FOR
SMOKING HISTORY .......... E V E R YRS N U M PRES YRS C I G A R PIPE S M O K SMOK SMOK STOP
. YES
. . . . .
..... LAB/CLIN M E A S U R E M E N T S A D H E R E N C E ..... MEDICATIONS............... FV CHO WT SBP DBP GOT T B I L DBIL UG UP L - A D H CUM% CAP REAS. C O M P L A I N T S INS O-H DIG A - A DIU A - H N I T GOU A-C CHO OTH
QUAL ECG F NO. MIS 1
STAT-2 AGE-50 RISK-I RACE-2 } ~ - 7 0 R B W 0.91 ENTRY 7-18-67
PLBO
~D ~D
500
P.L. Canner In some studies, sucb printouts have been sent back to the clinics, with omission of any information that might unblind the treatment group. For at least some of the patients the data on these printouts were compared to the ~ data on the original records at the clinics and any discrepancies were brought to the attention of the Coordinating Center.
Generation of Point Frequency Distributions As another preliminary step to data analysis, it is most helpful to obtain a point frequency distribution i.e., a tabulation of the frequency of occurrence of every distinct value--for every variable on the analysis file. This will identify many types of anomalies in the data such as: 1. Illegal codes, such as a code 4 when only codes 1, 2, 3, and 9 have been defined. 2. Measurements given to more decimal places than provided by the measuring instrument. 3. Digit preferences, such as the tendency to read and record blood pressures to the nearest five or ten units. 4. Bimodality or other bizarre form of the distribution. 5. Outliers, i.e., extreme values distinctly separate from the rest of the distribution.
Detection and Handling of Outliers A useful method of detecting potential outlier observations has just been suggested in connection with point frequency distributions. Various statistical methods are available [27] for evaluating whether a given extreme value is indeed an outlier, i.e., not consonant with the rest of the data. Once an observation has been identified as an outlier, the first step is to go back to the original records and determine whether a recording or keying error was made. If such a value is verified as correct, then the question of whether or not to include the value in the data analysis depends upon the nature of each particular analysis. There is no reason to exclude the value if the analysis is a count of the number of patients having a value exceeding a given cut point. However, if means and standard deviations are being computed, and the outlier value is such that it could have an undue impact on the mean and standard deviation, then it should either be excluded or given a less extreme value (a procedure known by statisticians as Winsorization [27]) for purposes of the analysis.
Testing Analysis Programs To make sure that a particular analysis program was picking up the correct variables from the analysis file or was correctly computing standard errors, test statistics, etc., such programs were often tested in the CDP by running them against a small subtile of 10 or 20 patients and independently producing the tabulations and statistical calculations manually from the original data.
Further Aspects of Data Analysis
501
Checking for Internal and External Consistency The means, standard deviations, frequency distributions, and total numbers of observations included in a given report for the CDP DSMC were routinely checked for consistency with corresponding values given in previous reports. For the initial tabulations of distributions of baseline variables, the CDP findings were checked against those of comparable patient groups published for other studies. Within a report, the different tables, which may have resulted from a variety of analysis programs, were checked for consistency of denominators. A discrepancy of as little as one patient among the denominators in different tables triggered an extensive investigation into the cause, and sometimes proved to be the tip of the iceberg of a much larger problem.
Being Always Suspicious The last point made above reflects the attitude that was instilled into the CDP statisticians and analysis programmers, namely, to be continually suspicious of the data. Any inconsistencies or deviations from expectation, no matter how slight, were thoroughly checked out. Considering the large number of different analysis programs---many with multiple versions---used in the CDP, the number of statisticians and programmers carrying out data analyses with these programs, and the fact that occasionally there was turnover of such personnel, there was ample opportunity for errors to be introduced through using the wrong version of a program, keying in wrong parameters for cut points, options, or variable codes, changing a definition in one program but not in others, and the like. An attitude of continual suspicion helped to counteract such problems.
Preparing Computer-Generated Reports One potential source of error--i.e., transcription of data from computer output to handwritten table to typewritten report--was avoided to a large extent in the CDP by formatting the computer output in report-ready tables. For most pages in a DSMC report, only table and page numbers needed to be typed. The one-time extra programming effort in the formatting of tables that would appear routinely in DSMC reports was amply repaid by elimination of transcription errors and need for extensive typing and proofreading. However, it was generally not possible to generate tables by computer for manuscripts being prepared for publication. Careful proofreading was required in such instances.
SUMMARY OF METHODOLOGICAL PRINCIPLES AND LESSONS 1. The Cox life table regression method [8] is useful for testing the difference between two entire survival curves; however, other methods, such as the ones of Littell [7] or of Kaplan and Meier [11], are preferable for estimating the survival curves. 2. Among the several tests available for interaction between treatment group and baseline characteristic, the test based on the difference of logarithms
502
P.L. Canner of odds ratios was perhaps the most useful in the CDP. This test can be performed as part of a multiple logistic regression analysis with interaction terms. The corresponding test based on multiple linear regression analysis was also used in the CDP; however, in this author's judgment, the comparison of relative differences between treatment groups among different subgroups provided by the log odds ratio or logistic regression approach is generally more meaningful than the comparison of absolute differences provided by the linear regression approach. . Analyses of treatment differences in subgroups defined by characteristics measured during the follow-up period--such as adherence to the study medication regimen--should be interpreted with extreme caution or else avoided entirely . Selection of baseline covariates for adjustment of treatment effects should be based on both the degree of maldistribution of a variable between the two treatment groups and the strength of association with the endpoint of interest. A useful algorithm employed by the CDP was to take a basic set of 54 variables and select in a stepwise fashion those variables producing the greatest amount of change (in either direction) in the treatment effect. The use of d u m m y or indicator variables in designating missing data in multiple regression analyses is recommended. 6. It is most important to carry out a variety of quality assurance procedures in conjunction with data analyses to make certain that the proper data items have been extracted into the analysis file, the statistical analyses have been computed correctly, outlying observations have been detected and properly handled in the analyses, and results have been transcribed correctly from computer output to tables in reports and manuscripts. .
REFERENCES
1. Yates F: Contingency tables involving small numbers and the X2 test. J R Statist Soc (Suppl) 1:217-235, 1934 2. Fisher RA: Statistical Methods for Research Workers, 5th ed. Edinburgh, Oliver and Boyd, 1934 3. Irwin JO: Tests of significance for differences between percentages based on small numbers. Metron 12:83-94, 1935 4. Grizzle JE: Continuity correction in the ×2-test for 2 x 2 tables. Am Statist 21:28--32, 1967 5. Mantel N, Greenhouse SW: What is the continuity correction? Am Statist 22:27-30, 1968 6. Conover WJ: Uses and abuses of the continuity correction. Biometrics 24:1028, 1968 7. LitteU AS: Estimation of the T-year survival rate from follow-up studies over a limited period of time. Hum Biol 24:87-116, 1952 8. Cox DR: Regression models and life-tables. J R Statist Soc B 34:187-202, 1972 9. Coronary Drug Project Research Group. Clofibrate and niacin in coronary heart disease. JAMA 231:360-381, 1975 10. Canner PL: Monitoring treatment differences in long-term clinical trials. Biometrics 33:603--615, 1977
Further Aspects of Data Analysis
503
11. Kaplan E, Meier P: Nonparametric estimation from incomplete observations. Am J Statist Assoc 53:457--481, 1958 12. Coronary Drug Project Research Group. The Coronary Drug Project. Findings leading to further modifications of its protocol with respect to dextrothyroxine. JAMA 220:996-1008, 1972 13. Cochran WG: Some methods of strengthening the common X2 tests. Biometrics 10:417-451, 1954 14. Fleiss JL: Statistical Methods for Rates and Proportions. New York, Wiley, 1973 15. Mantel N, Brown C, Byar DP: Tests for homogeneity of effect in an epidemiologic investigation. Am J Epidem 106:125-129, 1977 16. Woolf B: On estimating the relation between blood group and disease. Ann Hum Gen 19:251-253, 1955 17. Walker SH, Duncan DB: Estimation of the probability of an event as a function of several independent variables. Biometrika 54:167-179, 1979 18. Coronary Drug Project Research Group. Practical aspects of decision making in clinical trials: The Coronary Drug Project as a case study. Controlled Clin Trials 1:363-376, 1981 19. Coronary Drug Project Research Group. Influence of adherence to treatment and response of cholesterol on mortality in the Coronary Drug Project. N Engl J Med 303:1038-1041, 1980 20. Draper NR, Smith H: Applied Regression Analysis. New York, Wiley, 1966 21. Freund RJ, Minton PD: Regression Methods: A Tool for Data Analysis. New York, Marcel Dekker, 1979 22. Coronary Drug Project Research Group. Factors influencing long-term prognosis after recovery from myocardial infarction--three-year findings of the Coronary Drug Project. J Chron Dis 27:267-285, 1974 23. Crocker DC: Some interpretations of the multiple correlation coefficient. Am Statist 26:31-33, 1972 24. Coronary Drug Project Research Group. Aspirin in coronary heart disease. J Chron Dis 29:625-642, 1976 25. Coronary Drug Project Research Group. The prognostic importance of plasma glucose levels and of the use of oral hypoglycemic drugs after myocardial infarction in men. Diabetes 26:453-465, 1977 26. SiegelS: Nonparametric Statistics for the Behavioral Sciences. New York, McGrawHill, 1956 27. Barnett V, Lewis T: Outliers in Statistical Data. New York, Wiley, 1978