C. R. Rao and R. Chakraborty, eds., Handbook of Statistics, Vol. 8 © Elsevier Science Publishers B.V. (1991) 125-144
J
Statistical Design and Analysis of Epidemiologic Studies: Some Directions of Current Research
Norman
Breslow
1. Introduction Epidemiology is the branch of public health science concerned with the occurrence of disease in populations and how that occurrence is related to genetic and environmental risk factors. Epidemiologic research influences many health promotion and disease prevention practices, such as the personal decision to quit smoking or the societal decision to regulate emissions of ionizing radiation. Statisticians have played a major role in the development of epidemiology as a science, and many of its fundamental concepts and methods rest heavily on those of statistics. This paper reviews several recent developments in statistical methodology that were either motivated by, or have major applications to, problems in epidemiology. It is divided into two sections, Design and Analysis.
2. Design Classical epidemiologic designs for studying disease/risk factor associations at the level of the individual are the cohort design and the case-control design. With the cohort design, groups of subjects defined often on the basis of a particular exposure are followed forward in time to ascertain disease specific incidence or mortality rates. A famous example is the cohort of survivors of the atom bomb blasts in Hiroshima and Nagasaki (Beebe, Ishida and Jablon, 1962). With the case-control design, cases of a particular disease are ascertained as they arise in a defined population or seek hospital treatment and their exposure histories, obtained retrospectively by interview or other means, are compared to those of control subjects who are sampled from the population 'at risk'. Statistical methods for the design and analysis of cohort and case-control studies are discussed at length by Breslow and Day (1980, 1987). Supported in part by USPHS Research Grant No. CA40644. This work is based on an invited address at the Symposium on the Future of Theory and Methods in Epidemiology, held during the 1987 meeting of the International EpidemiologicalAssociation in Helsinki. 125
126
N. Breslow
At the 1981 Symposium on Epidemiology in Occupational Health, Miettinen (1982, 1985) put forward two new design concepts for epidemiologic research that tended to blur the distinction between cohort and case-control studies and that were remarkably prophetic of future developments. First, as an alternative to the usual sample of disease-free controls, he proposed a sample drawn drom the study 'base' without regard to the disease outcome. Second, for situations where expensive data acquisition remains to be done after ascertainment of disease and exposure status, he suggested that subsamples be selected with regard to both characteristics. In the intervening years, considerable work has been done on these two design concepts and associated methods of data analysis, so that today they are ready for full exploitation. Both should see substantial use in future years. 2.1. A numerical example
Table 1 presents a ficticious numerical example that will help to fix ideas. Part A shows the exposure distribution of a cohort of 10200 subjects identified as to whether or not they develop the disease of interest during the study period. For concreteness, and also to emphasize that we have in mind a follow-up study of persons who are initially disease-free, we may think of asbestos insulation workers kept under surveillance for lung cancer for 10 years starting from the 20-th anniversary of their employment. The odds ratios indicate that moderately exposed workers have lung cancer rates about 3 times those of the lightly exposed, whereas for the heavily exposed this ratio is closer to 9. These ratios could be interpreted as incidence rate ratios if the entries in the row labeled 'non-cases' were exactly proportional to the 'person-years' of observation contributed by members of the cohort and likewise if the control sample in the case-control study was drawn using 'incidence density' sampling (Greenland and Thomas, 1982; Miettinen, 1976). However, this refinement is unnecessarily complicated for present purposes. Table I(B) shows the expected exposure distribution of controls, selected as a 10~o random sample of the non-cases, that are to be used with all available cases in a classical case-control study. Such designs are well known to produce unbiased estimates of the odds ratios that would be observed if the entire cohort were followed up (Cornfield, 1951). Multiple logistic regression analysis is used to estimate exposure odds ratios that are adjusted for covariables (Breslow and Day, 1980). Case-control sampling from cohort data files has been suggested as a means of reducing the computational burden of the analysis, or reducing the number of subjects for whom covariable information, for example on smoking, needs to be collected (Mantel, 1983). For a time-dependent analysis, a few controls are selected randomly from among those who are disease-free at the time each case is diagnosed. 'Risk sets' consisting of a case and its time matched controls are analyzed by conditional logistic regression so as to approximate the results that would be obtained from a proportional hazards regression analysis of data from the entire cohort (Breslow and Day, 1980; Liddel et al., 1977; Prentice and Pyke, 1979).
Statistical design and analysis of epidemiologic studies
127
Table 1 Design options in epidemiology a Exposure
LO
MED
HI
TOT
50 150 200 8.6
550 9650 10200
50 50 8.6
550 965
40 + 10 300 40 8.6
440 + 110 1930 2040
(A) Full cohort (base) Cases Non-cases Total Odds ratio
300 7700 8000 1.0
200 1800 2000 2.9 (B) Case-control sample
Cases Controls b Odds ratio b
300 770 1.0
200 180 2.9 (C) Case-cohort (case-base) sample
Cases b Controls b Sub-cohort b Odds ratio b
240 + 60 1540 1600 1.0
160 + 40 360 400 2.9
(D) 'Balanced' two-stage sample (selectivity in both series) Cases b Controls b Odds ratio b
50 150 1.0
50 150 1.0
50 150 1.0
150 450
a After Miettinen (1982). b Expected values.
2.2. The case-cohort design Table I(C) shows the expected exposure distributions for an alternative type of study design termed by Miettinen (1982) a 'case-base' design and by Prentice (1986) a 'case-cohort' design. Here, a subcohort selected without regard to the disease outcome is used to estimate the exposure distribution in the cohort and thus the denominators of the disease rates at each level of exposure. Returning to the example, among those in the 20~o subcohort sampled at random from the 10200 insulation workers we expect 40 to be heavily exposed to asbestos. Ten will develop lung cancer. Forty additional lung cancer cases will be ascertained from among the 160 heavily exposed workers who are not in the subcohort. Unbiased estimates of the exposure odds ratios result from a comparison of exposure levels for the non-cases in the subcohort with those of the cases in the entire cohort. Table I(D) illustrates a 'balanced two-stage' design where both disease and exposure data are used to select the subsamples. All the cases and all of the heavily exposed are selected in order to increase the information available about
128
N. Breslow
the risk of high exposure. Statistical analyses of data from this design clearly need to be adjusted for the biased sampling, however, in order to estimate the odds ratios of interest. We return to this design later.
2.3. Advantages of the case-cohort design The case-cohort design offers certain advantages over the case-control design as a technique for reduction of the number of subjects for whom covariable data are collected or processed. For time-matched analyses, a subcohort member is included in the risk set for all cases who are diagnosed while he is under observation. Thus, he may serve as a control for several cases. Consequently, the sizes of the risk sets are generally larger than for a time-matched case-control study (Breslow and Day, 1980), even though the total number of subjects are the same. The regression coefficients of interest are estimated with greater precision. However, the statistical analysis is more complicated due to the need to account for correlations between the individual risk set contributions to the estimating equations (Prentice, 1986). Furthermore, if it turns out that only a few subcohort members are still under observation at the later times when most cases of disease occur, the risk sets for such cases may be reduced in size with a consequent loss of efficiency. (See note added in proof.) Since no knowledge of their disease outcome is required, subcohort members may be selected at the outset of the study. When implemented in the context of a dietary intervention trial, for example, this means that biochemical assays of serum samples may be carried out at once for the subcohort in order to monitor the effectiveness of and compliance with the intervention. Similar assays are made of the stored serum for all cases as they arise and the resulting data are combined with those from the subcohort for statistical analysis. Clearly, one needs to be very sure that assay results for samples stored several years are equivalent to those conducted with fresh material; otherwise, the results may be biased. Table 2 illustrates a case-cohort design used to study a possible association between external radiation exposure and lung cancer in nuclear energy workers (Peterson et al., 1987). The investigators identified more than 5000 white male operations employees who had been monitored for external radiation for at least three years and who had terminated employment in 1965 or later, when questions about tobacco use became a routine part of the periodic medical exam. Eighty-six lung cancer deaths were ascertained through records of the Social Security Administration and state and national death registers. The subcohort consisted of 455 men, stratified by year of birth in rough proportion to the numbers of lung cancer deaths. Detailed time-dependent smoking histories were abstracted from the medical records. In spite of this effort, it turned out that adjustment for tobacco use had little effect on the patterns of lung cancer risk across radiation exposure groups. There was a slightly elevated lung cancer risk among the moderately exposed, but a non-significant negative trend overall.
129
Statistical design and analysis of epidemiologic studies
Table 2 Case-cohort study of lung cancer and radiation exposure in nuclear workersa Period of birth
Lung cancer deaths
Subcohort members Selected Eligible
All
Lung cancer
1900-1904 1905-1909 1910-1914 1915-1919 1920-1924 1925-1929 1930-1934
19 18 27 13 6 2 1
413 688 949 938 1012 922 523
95 95 135 70 30 10 10
6 3 4 0 0 0 0
1900-1934
86
5445
455
13
a Source: Peterson, Gilbert, Stevens and Buchanan (1987).
2.4. The two-stage design
One serious drawback to both case-cohort and case-control designs is that they ignore data on exposure and-disease for subjects not included in the subsamples selected for covariable ascertainment. In the preceding example, radiation dosimetries were available but not used for 4917 nuclear energy workers who did not die from lung cancer and who were not selected for the subcohort. Such designs also may be inefficient because they fail to use all the subjects who have experienced a rare exposure (see Table 1). One needs 'selectivity in both series' so that all 'rare' subjects, whether cases or exposed, are used in the analysis (Miettinen, 1982). White (1982) and Walker (1982) published separate accounts of how to combine the 'first stage' data on exposure and disease, i.e. data available from the entire cohort (Table I(A)), with the covariable data collected from a balanced second stage subsample (Table I(D)) so as to estimate covariable adjusted odds ratios relating disease and exposure. White also derived a variance for the adjusted odds ratio that correctly accounts for sampling errors in the first stage data. However, both papers were limited to the simple situation of sampling from a fourfold table (case-control vs. exposed/nonexposed) and to covariables having a small number of discrete values. Breslow and Cain (1988) recently developed the modifications of logistic regression needed to analyze data from samples that are stratified according to disease and any number of exposure levels, using model equations containing discrete and continuous exposure variables, covariables and interaction terms. Theirs is an extension of the work of econometricians Manski and McFadden (1981) on 'choice based', i.e. outcome selective, sampling for binary response analysis. Breslow and Cain show how to adjust output from standard logistic regression
130
N. Breslow
analysis of the second stage data so as to account both for the bi~ed sampling and for the extra data available from the first stage sample. For the situations considered by White and Walker, in which exposure is represented as a single binary variable, one simply multiplies the adjusted odds ratio from the logistic analysis by the reciprocal of the cross-product ratio of the sampling fractions to adjust for the bias. More precisely, let us denote by N~ and N~o the numbers of exposed and non-exposed cases in the cohort as a whole and by No~ and Noo the corresponding numbers of non cases. Suppose that the sizes of the subsamples drawn from each of these four categories are n~l , n~o, nol, and noo, and further that the regression coefficient for the binary exposure variable in the logistic regression model fitted to the subsampled data is ~. Then the adjusted estimate of the log odds ratio is ~aoj = ~ + log (N~lN°°nl°n°~ l .
\Nlo Nol nl I noo/ Similarly, if ~.2 is the standard variance estimate for /~ obtained from the program, then aaaj = &z ~_
E( (5 (1 lo)( o 2)] _
n
+
1
o
+
+
1
1
0
is the variance estimate for /~aa~ that accounts for the first stage data. When the exposure effects are modelled continuously, the adjustment involves more complicated matrix manipulations (Breslow and Cain, 1988). 2.5. Analyses of the Framingham data
Tables 3-6 illustrate the application of two stage logistic regression methodology to 18 years follow-up data from Framingham (Kannel and Gorton, 1974), the cohort study of a small Massachusetts town that first identified several important risk factors of coronary heart disease (CHD). A public use'tape kindly supplied by the National Center for Health Statistics contained data for 5052 individuals with no prior CHD diagnosis who had known values recorded at the initial examination for height, weight, systolic blood pressure, age and cigarette smoking. Their distribution by age and sex is shown in Table 3(A). For purposes of this illustrative exercise, age and sex were regarded as the 'exposure' whereas relative weight, blood pressure, etc. were regarded as the 'covariables'. Several different sampling schemes were used" to reduce the size of the dataset for efficient computer analysis. First, two samples of 673 controls were selected for comparison with the 673 cases of CHD, one a simple random sample and the
Statistical design and analysis of epidemiologic studies
131
Table 3 Sampling from the Framingham cohort Males Age:
30 -
40 -
Females 50 -
30 -
Total
40 -
50 -
75 870 945
154 687 841
673 4379 5052
136 75
11 154
673 673
19 22
37 25
149 149
25 25
25 25
149 149
(A) Entire cohort CHD No CHD Total
73 742 815
158 603 761
189 481 670
24 996 1020
(B) Control samples: all available cases Random Stratified
113 73
80 158
76 189
151 24
(C) Standard case-control sample Cases Controls
18 27
25 25
45 16
5 34
(D) Balanced two stage sample Cases Controls
25 25
25 25
25 25
24 24
other stratified by age and sex (Table 3(B)). A d d i t i o n a l simple r a n d o m samples o f 149 cases and 149 controls were d r a w n from the cases and non cases (Table 3(C)) to illustrate the p r o b l e m s with a 'small' case-control study. Finally, in hopes o f d e m o n s t r a t i n g the increased efficiency o f the b a l a n c e d design, two m o r e samples o f 149 cases and 149 controls were selected using stratification by age and sex (Table 3(D)). Logistic regression m o d e l s were fit to the entire cohort of 5052 men and w o m e n in o r d e r to establish a reference point for c o m p a r i s o n with results o b t a i n e d from the various case-control samples (Table 4). Except for the c o n s t a n t term and a binary indicator o f female sex, all variables were c o n t i n u o u s . M o s t were centered about m o d a l values and then scaled so that the regression coefficients would be o f c o m p a r a b l e magnitude. N o t e that the regression coefficient for age is r e d u c e d by inclusion o f the covariables, indicating that some o f the age effect is m e d i a t e d by the association o f age with high b l o o d pressure a n d other risk factors. M e a s u r e m e n t s o f serum cholesterol were missing for over one-third o f the initial cohort, and a binary indicator o f serum availability was a d d e d to the m o d e l shown in Table 4(C) to account for this. T h o s e for w h o m b l o o d s were taken h a d a lower C H D risk. T h e examples below use only the five variable m o d e l (Table 4(B)) so as to avoid the complication o f missing values. Results for the seven variable m o d e l were similar.
132
N. Breslow
Table 4 Logistic regression fits to the entire cohort a Variable
Coefficient
Standard error
(A) Two-variable model CONSTANT SEX ( F ) (AGE-50)/10
- 1.129 - 0.894 0.784
0.058 0.087 0.052
(B) Five-variable model CONSTANT SEX ( F ) (AGE-50)/10 (FRW- 100)/100 (SBP-142.6)/100 CIGS (PACK)
- 1.420 - 0.848 0.677 0.950 1.558 0.381
0.095 0.097 0.057 0.266 0.181 0.073
(C) Seven-variable model CONSTANT SEX ( F ) (AGE-50)/10 (FRW-100)/100 (SBP-142.6)/100 CIGS (PACK) SC AVAILABLE (SC-227.3)/100
- 1.217 0.906 0.651 0.930 1.543 0.381 - 0.330 0.634
-
0.095 0.098 0.057 0.270 0.185 0.073 0.090 0.122
a Key: FRW = Framingham relative weight, SBP = systolic blood pressure (mmHg), and SC = Serum cholesterol (mg/100 cc).
Table 5 presents logistic regression fits to d a t a for all 673 C H D cases and an equal n u m b e r o f controls. The ' u n a d j u s t e d ' entries are s t a n d a r d p r o g r a m output for the 1346 o b s e r v a t i o n s ; the ' a d j u s t e d ' coefficients a n d s t a n d a r d errors account for the extra d a t a shown in Table 3(A). N o a d j u s t m e n t is needed for the r a n d o m case-control sample so far as validity o f the relative risk estimates is concerned (Breslow and Day, 1980). By taking account o f the Table 3(A) data, however, we are able to correct the c o n s t a n t term and to m a k e the estimates o f age and sex effects more precise. N o t e that the adjusted s t a n d a r d errors for the 'exposure' variables are quite close to those for the entire cohort (Table 4(B)) and that the coefficient for sex, at least, is m o v e d substantially closer to its cohort value o f - 0 . 8 5 . The adjustment has negligible effect on the coefficients and s t a n d a r d errors for the covariables, information about which is exclusively contained in the case-control sample. ( H a d age and sex been analyzed as discrete variables with interactions, the a d j u s t m e n t would have no effect at all on the covariables (Breslow and Cain, 1988).) O n the other hand, a d j u s t m e n t is clearly needed for the stratified case-control sample (Table 5(B)) in o r d e r to account for the fact that
Statistical design and analysis of epidemiologic studies
133
Table 5 Logistic regression fits to the second stage data: all available cases and an equal number of controls (N = 1346)a Variable
Unadjusted Coefficient
Standard error
Adjusted Coefficient
Standard error
(A) Random control sample CONSTANT SEX AGE FRW SBP CIGS
-
0.51 1.00 0.69 1.25 1.49 0.36
0.12 0.14 0.08 0.38 0.27 0.11
- 1.39 - 0.91 0.73 1.26 1.50 0.36
0.09 0.11 0.07 0.38 0.27 0.11
- 1.46 - 0.79 0.65 0.81 1.57 0.42
0.08 0.11 0.07 0.36 0.25 0.10
(B) Stratified control sample CONSTANT SEX AGE FRW SBP CIGS
- 0.27 0.05 0.00 0.68 1.50 0.38
0.10 0.13 0.08 0.36 0.25 0.10
" See Table 4 for key. Variables are centered and scaled as shown there.
cases and controls were b a l a n c e d on age and sex. The change in the covariable coefficients is m o r e p r o n o u n c e d here also. Table 6 presents analogous results for the smaller case-control samples (Tables 3(C) a n d 3(D)). N o n e o f the 'covariable' effects are well estimated here and those for relative weight and smoking even fail to d e m o n s t r a t e statistical significance. The s t a n d a r d errors for age a n d sex are r e d u c e d substantially by the adjustment, but d o not n o w a p p r o a c h those o b t a i n e d from the full cohort analysis (Table 4(B)). Nevertheless, the age and sex effects are r e a s o n a b l y well estimated by the adjusted coefficients.
2.6. Relative efficiency A s o m e w h a t surprising feature of Table 6 is that the adjusted s t a n d a r d errors from analysis o f the b a l a n c e d sample are little if any smaller than those for the r a n d o m sample. The anticipated gains from stratification are not realized in this example because there are no 'rare' categories o f 'exposure'. The two sexes are equally represented a n d there is sufficient variation in age so that the linear age term is well determined even from the r a n d o m subsamples. Table 7 presents relative efficiencies, i.e. ratios o f asymptotic variances, for regression coefficients estimated from b a l a n c e d vs. u n b a l a n c e d samples for a simple m o d e l with one binary exposure, one binary covariable that m a y c o n f o u n d
134
N. Breslow
Table 6 Logistic regression fits to the second stage data: case-control samples (N = 298)a Variable
Unadjusted Coefficient
Standard error
Adjusted Coefficient
Standard error
(A) Random samples of cases and controls CONSTANT SEX AGE FRW SBP CIGS
-
0.58 0.69 0.70 0.36 2.11 0.14
0.25 0.28 0.16 0.80 0.72 0.23
- 1.18 - 0.95 0.70 0.33 2.15 0.19
0.18 0.16 0.09 0.81 0.71 0.23
(B) Balanced samples of cases and controls CONSTANT SEX AGE F R W
SBP CIGS
- 0.18 0.14 - 0.14 0.12 2.58 0.26 -
0.23 0.26 0.16 0.78 0.61 0.23
- 1.28 - 0.92 0.68 - 0.07 2.55 0.27
0.16 0.16 0.11 0.78 0.62 0.23
a See Table 4 for key. Variables are centered and scaled as shown there.
the disease/exposure association and their interaction (Breslow and Cain, 1988). It is a s s u m e d that only 5 ~o o f the population is exposed, that exposure carries a relative risk of 2, and that the first stage sample is so much larger than the second that only sampling errors from the latter need be accounted for. The table entries show that, when the exposure is rare, the b a l a n c e d design offers substantial advantages for estimation o f exposure effects in the presence o f a strong confounder, at the cost o f a slight reduction in efficiency for estimation o f the confounder effect. The b a l a n c e d design is much m o r e efficient for estimation of interaction effects. 2.7. Summary In summary, two new strategies for subsampling and analysis o f d a t a from epidemiologic studies have been developed that should see increasing use in future years. The c a s e - c o h o r t design is particularly well suited to prospective intervention trials where it is desired to m o n i t o r covariables on a sample while storing biological specimens or other materials from the r e m a i n d e r o f the cohort for later processing. The two-stage design is especially a p p r o p r i a t e for retrospective cohort studies where d a t a on primary exposures and disease o u t c o m e are already available and it remains to collect the covariable information.
Statistical design and analysis of epidemiologic studies
135
Table 7 Relative efficiencies of balanced vs. 'case-control' samplinga RR of confounderb
Odds ratio linking exposure & confounder
Relative efficiencyfor estimating Exposure e~ct
Con~under e~ct
Interaction
(A)'Common'disease:casesandcontrolsbalancedonexposure
0.2
0.2 1.0 5.0
1.4 4.4 3.6
0.7 1.0 1.5
4.4 3.5 3.3
1.0
0.2 1.0 5.0
0.7 1.0 1.1
0.7 1.0 1.1
5.8 4.1 3.9
5.0
0.2 1.0 5.0
1.8 3.7 1.4
0.8 1.0 0.8
6.5 4.1 3.4
(B)'Rare'disease:allavmlablecasesused
0.2
0.2 1.0 5.0
1.0 1.2 1.3
0.8 0.9 0.9
1.4 1.5 1.7
1.0
0.2 1.0 5.0
1.0 1.0 1.0
0.7 0.8 0.8
2.3 2.1 2.1
5.0
0.2 1.0 5.0
1.1 1.2 1.0
0.8 0.8 0.8
2.9 1.9 1.7
a Source: Breslow and Cain (1988).
b Relative risk (RR) for a confounder that affects 30% of the population.
3. Analysis The computer developments of the past decade have revolutionized the methods of data analysis used by professional statisticians. Powerful desktop workstations are n o w available at relatively low cost that facilitate real time graphical exploration of moderately large databases. F o r m a l inference protocols, where test statistics derived from regression coefficients are referred to tables of the chi-square distribution, are giving way to n o n p a r a m e t r i c estimation procedures in which statistical significance is evaluated by repeated computer sampling from the database itself. These developments are starting to impact data analysis in epidemiology a n d in other fields where statistical methods play an important role. Two examples will serve to illustrate current trends.
136
N. Breslow
3.1. The generalized additive model Hastle and Tibshirani (1986, 1987) developed a generalization of linear regression models of the form y=
O~-[- f l l X 1 q- fl2X2 + "'" ~- [~pXp "t-E,
in which the regression variables x with unknown coefficients fl are replaced by smooth functions s = s(x). Thus, the model equation is given as y = ~ + s,(xl)
+ s2(x2) + " ' " + s p ( x p ) + E.
This flees the statistician from the constraints of a linear dependence of y on each x. However, the contributions to the regression equation from different x's are still additive. Formerly, it was possible to relax the linearity assumption by replacing each linear term with a low degree polynomial in x or a parametric non-linear function. The advantage of the new methods is that no such specification of functional form is needed. The goal is to let the data 'speak for themselves'. Various 'smoothers' are available to estimate the s functions. Hastie and Tibshirani recommend a local linear smoother. If only a single x variable is involved, this means estimating s at x by the linear regression of y on x using only observations (Yi, xi) for which xi is close to x. Multiple regression variables are accounted for in the usual fashion by regressing the residuals involving those terms that are already in the model on each added x variable. The width of the neighborhoods are determined by a parameter known as the span. For a span of 0.1, the linear regression at x uses only 10~o of the data points, the 5~o with nearest xi on either side of x. A span of 1.0 corresponds to simple linear regression. By varying the span in the range from 0 to 1, one can control the degree of smoothness. Objective selection of the 'optimal' span for each x is possible using the criterion of 'cross-validation'. One minimizes the prediction error when each y observation is estimated from its associated x's, using a regression equation determined from the rest of the data (Stone, 1974). In practice, however, it is sometimes desirable to consider a variety of different spans and to pick one that appears to give a 'reasonable' picture of the relationship. There is the usual tradeoff between excess variance if the span is too small so that the estimated regression function is too wiggly, and excess bias if the span is too large so that the true curvature in the relationship is lost. Hastie and Tibshirani have extended this approach for use with generalized linear models (McCullagh and Nelder, 1983) fitted by maximum likelihood, of which the paradigm is logistic regression. Using their program, we fit models with varying spans to the 673 cases and 673 randomly sampled controls from Franlingham (Table 3(B)). Separate fits were made for males and females that included age, relative weight, blood pressure and cigarettes as regression variables (see Table 4(B)). Figures 1 and 2 contrast the 'smooths' estimated using spans of 0.1, 0.3 and 0.5 with the lines whose slopes were determined by the usual logistic
Statistical design and analysis of epidemiologic studies
137
SYSTOLIC BLOOD PRESSURE
AGE
J J
i
30 35 40 45 50 55 60 65
FRAMINGHAM RELATIVE WEIGHT
60
80
I
I
100
120
50
100
150
200
250
300
CIGARETTES PER DAY
I 140
160
0
10
20
30
40
50
60
Fig. 1. Smooth regression functions estimated by the generalized additive model for Framingham males with spans of 0.1 (lowest curve in each panel), 0.3 (middle curve) and 0.5 (upper curve).
regression approach. Both are normalized so that the estimated values, s(x) or fix, sum to zero over all subjects. For most variables, there is remarkably little evidence for a departure from linearity. A possible exception is the relation of C H D risk to cigarette consumption. There is a slight decline in risk from nonsmokers to very light smokers, especially for females, with a gradual rise in risk thereafter. For men, there is a suggestion that the C H D risk may stop rising after 1-2 packs/day. Substantial work remains to be done on statistical inference procedures in connection with these new methods. H o w are we to decide, for example, whether the simple linear model for cigarettes provides an adequate description of the data in comparison with the generalized additive fit with span = 0.17 Hastie and Tibshirani provide only a partial answer. Using the 'deviance' (twice the log likelihood ratio) as a measure of the discrepancy between the fitted model and the observed data, they note that this usually has a chi-square distribution with degrees of freedom equal to n - p, the number of subjects minus the number of
138
N. Breslow AGE
SYSTOLIC BLOOD PRESSURE
J °~"
r
25
35
45
55
65
FRAMINGHAM RELATIVE WEIGHT
50
I
i
100 150 200 250 300
C I G A R E T T E S PER D A Y
f
.....~
50
100
150
200
250
0
10
-
~
20
30
40
Fig. 2. Smooth regression functions estimated by the generalized additive model for Framingham females with spans of 0.1 (lowest curve in each panel), 0.3 (middle curve) and 0.5 (upper curve).
fitted parameters. For the generalized additive model, they estimate an 'effective number' of parameters from the trace of the 'projection' matrix at the final stage of iteration. Table 8 summarizes the results of an 'analysis of deviance' based on this concept. Both theoretical and empirical calculations suggest that the expected difference in deviances for two nested models is approximately equal to the difference in the corresponding values of p. However, the variability of the deviance difference is often greater than twice the p difference. One therefore should resist the temptation to refer such deviance differences to tables of chi-square. It usually will be necessary to resort to resampling methods (Efron and Tibshirani, 1986) in order to accurately assess the significance of the observed results. In fact, the greatest danger in these new methods lies in the temptation to overinterpret insignificant features of the fitted curves. While they offer great power and flexibility, they also pose a trap for the unwary.
Statistical design and analysis of epidemiologic studies
139
Table 8 Analysis of deviance: goodness-of-fitstatistics for the models shown in Figures 1 and 2 Span
0.1 0.3 0.5 1.0 (linear)
Males
Females
Deviance
pa
Deviance
p
770.5 812.5 824.0 833.1
47.1 17.3 11.8 5
702.2 732.8 744.7 755.2
47.1 17.0 11.5 5
p = effectivenumber of parameters used in fitting model.
3.2. Nonparametric estimation of the SMR Our second example of graphical analysis of epidemiologic data concerns a procedure for nonparametric estimation of the standardized mortality ratio (SMR) as a continuous function of age or time (Andersen et al., 1985; Breslow and Day, 1987; Breslow and Langholz, 1987). In occupational mortality studies, it is useful to calculate the SMR (defined as the ratio of the observed number of cases to the number expected from standard rates) according to time since initial employment, time since termination of employment, duration of employment or other time-varying factors. One hopes thereby to answer questions about the latent interval between exposure and disease, about the attenuation of risk following cessation of exposure, or even about the biological mechanisms involved in the disease process. Similarly, by examining changes in the SMR according to age, calendar time or other variables involved in its definition, one can examine whether the assumption of a constant risk ratio that implicitly underlies its use as a summary measure is well founded (Breslow and Day, 1985). As an example, Table 9 presents observed and expected numbers of respiratory cancer deaths, broken down according to several fixed and time-varying factors for a cohort of 8014 Montana smelter workers exposed to airborne arsenic (Breslow, 1985; Lee-Feldstein, 1983). Workers were first employed between 1885 and 1956. The follow-up period was 1938-1977 and a worker was entered on study if still employed in 1938 or upon the completion of one year of employment if hired thereafter. Expected rates were based on U.S. national mortality rates for white males, specific for age and calendar year of follow-up. Workers employed prior to 1925, when changes in the smelting process substantially reduced the concentrations of airborne arsenic, had respiratory mortality rates approximately 3.6 times expected. For those employed later the ratio was only 1.6. There were substantial variations in the SMR according to birthplace, duration of time since first employment, level of arsenic exposure and calendar year of follow-up (Breslow, 1985). Interpretation of these results, however, should take account of the strong correlations between the risk factors. Since follow-up did not commence until
140
N. Breslow
Table 9 Variations in respiratory cancer SMR's for Montana smelter workers a Factor
Level
Number of deaths Observed
Expected b
SMR
Year of employment
1885-1924 1925-1955
115 161
31.8 98.2
3.62 1.64
Birthplace
U.S. Foreign
198 80
110.0 21.0
1.80 3.81
Years since first employed
1-14 15-29 30 +
101 59 116
61.2 31.9 36.8
1.65 1.85 3.15
Arsenic exposure
Light Moderate Heavy
153 91 32
95.7 26.9 7.4
1.60 3.39 4.34
Year of follow-up
1938-1949 1950-1959 1960-1969 1970-1977
34 65 94 83
8.4 22.1 44.6 54.9
4.03 2.94 2.11 1.51
a Source: Breslow (1985). b From U.S. national rates for white males, accounting for age and calendar year of follow-up.
1938, for example, nearly everyone hired prior to 1925 contributed person-years o f observation only to the later categories o f years since initial employment. W h e n the change in the S M R according to time since hire is adjusted for the effect of period o f hire in a P o i s s o n regression analysis, it is no longer statistically significant (Breslow, 1985). Recent developments in techniques o f survival analysis facilitate the estimation of the S M R as a continuous function o f a continuous time variable, without the need for arbitrary grouping o f each factor into discrete intervals. Briefly, s u p p o s e that 2 ( 0 denotes the death rate for a r a n d o m l y selected c o h o r t m e m b e r at time t, and that 2*(0 denotes the expected rate b a s e d on his age and the c a l e n d a r year at t. The function we w a n t to estimate is the ratio O(t) = 2 ( 0 / 2 * ( 0 . F o r technical reasons, it is easiest first to estimate the cumulative S M R O ( t ) = ~ o O ( S ) d s using the formula
&t) = Z
4
',<' Ej where d; denotes the n u m b e r o f deaths that occur at tt and Yj(t;) denotes whether (Y = 1 ) o r not (Y = 0) the j - t h cohort m e m b e r is under observation at that time. Thus, O is analogous to the s t a n d a r d n o n p a r a m e t r i c estimator o f cumulative
Statistical design and analysis of epidemiologic studies
141
mortality, except that the denominator summands, usually the total numbers of deaths at ti, are replaced by the expected numbers based on standard rates. The graph of O(t) for the Montana cohort, with t equals time since initial employment, is shown in Figure 3. Since most interest is in the slope of the curve, which appears to increase markedly some 30-40 years since initial employment, this graph is difficult to interpret in quantitative terms. Using 'kernel' methods (Ramlau-Hansen, 1983; Yandell, 1983), it is possible to derive a smooth estimator O(t) from the cumulative. The degree of smoothness depends on a bandwidth b such that jumps in O at observed times of death ti in the interval (t - b, t + b) contribute to the estimation of 0 at t. Figure 4 shows two smoothed estimates
160 ~120 o rr
~
80
~ 4o ° 0
0
/ /
I
10
r
20
t
/
t
30 40 Years employed
I
50
60
I
70
Fig. 3. Cumulative respiratory cancer SMR by years since initial employment for Montana smelter workers (reproduced from Breslow and Langholz (1987)). 3.5i
.......... IoY~SI rb~S~di~dilhth
///~- ....
3.01
2.5 rr / //i
2.0
1.5
1.0
1~0
2JO
3'0 4JO Years employed
5'0
6~0
Fig. 4. Smooth estimates of the respiratory cancer SMR by years since initial employment for Montana smelter workers: comparison of five- and ten-year bandwidths (reproduced from Breslow and Langholz (1987)).
N. Breslow
142
derived from Figure 3, with bandwidths of 5 and 10 years. These confirm that the SMR rises to a peak of nearly 3.5 some 40 years or so from the date of first employment. Grouped data analysis demonstrated that the effect on the SMR of time since initial employment is largely due to confounding by period of employment (Breslow, 1985). A similar covariable adjustment of the nonparametric estimate may be developed from an extension of the basic model equation, viz.
2(tlz(t)) = O(t) ).*(t) exp {/~z(t)} , where z is a vector of (possibly time-dependent) covariables and/~ is a vector of regression coefficients. This has the same structure as Cox's (Cox, 1972) famous proportional hazards model, except that the unknown 'background' rates are expressed as the product 0(02*(0 of known standard rates times a nonparametric relative risk function (Andersen et al., 1985). By incorporating log2*(t) = Zo(t) into the regression equation as a covariable with known coefficient /~o = 1, standard programs may be used to obtain estimates of/~ and 0(t). Figure 5 presents a comparison of the unadjusted estimate of/~(t), 1 and an estimate adjusted via the proportional hazards model for period of employment, birthplace and number of years worked in areas of the smelter with potential for heavy or moderate arsenic exposure. The difference between the two is striking. Instead of providing evidence for a possible latent interval between onset of exposure and respiratory cancer effect, as might be suggested by Figure 4, the covariable adjusted analysis shows a constant or even slightly declining SMR. For cohort members born in the U.S., hired in 1925 or later and who were accumulat3.5
3.0
. . . . . . . . . Adjusted
/
/
G9
~
\
2.0 1.5
'\
_
,
/'-.
1.0 0.5
0
k. .-
"'--J
"". . . . - --j
"""-,
I
L
i
i
I
10
20
30 Years employed
40
50
"", "'-
I
60
Fig. 5. Respiratory cancer S M R by year since initial employment for M o n t a n a smelter workers with and without covariable adjustment (reproduced from Breslow and Langholz (1987)).
t Obtained using a slightly different smoothing technique than that described earlier.
Statistical design and analysis of epidemiologic studies
143
ing no heavy or m o d e r a t e arsenic exposure, there is little evidence that respiratory cancer rates were elevated over b a c k g r o u n d .
3.3. S u m m a r y
In summary, c o m p u t a t i o n a l l y intensive, graphical m e t h o d s o f d a t a analysis are now available that free the epidemiologist from the constraints i m p o s e d by linear models and by the grouping o f d a t a into arbitrary categories. M u c h w o r k remains to be done on a s s o c i a t e d statistical inference procedures, however, so that a p p a r ently interesting features o f the graphs that are due partly or wholly to sampling errors are not overinterpreted.
Note added in proof Recent w o r k by Langholz, B. and T h o m a s , D. C. (1990, N e s t e d case-control and c a s e - c o h o r t m e t h o d s of sampling from a cohort: a critical c o m p a r i s o n . A m . J. Epidemiol. 131, 169-176) suggests that the efficiency advantages of the casecohort design mentioned in Section 2.3 have been overstated. F o r studies with r a n d o m censoring and staggered entry, the c a s e - c o h o r t design m a y be substantially less efficient due to high correlations between the individual risk set contributions to the pseudo-likelihood analysis.
References Andersen, P. K., Borch-Johnsen, K., Deckert, T. et al. (1985). A Cox regression model for the relative mortality and its application to diabetes mellitus survival data. Biometrics 41, 921-932. Beebe, G. W., Ishida, M. and Jablon, S. (1962). Studies of the mortality of A-bomb survivors. I. Plan of study and mortality in the medical subsample (selection 1), 1950-1958. Radiation Research 16, 253-280. Breslow, N. E. (1985). Multivariate cohort analysis. Natl. Cancer Inst. Monograph 67, 149-56. Breslow, N. E. and Cain, K. C. (1988). Logistic regression for two stage case-control studies. Biometrics 44, 891-899. Breslow, N. E. and Day, N. E. (1980). Statistical Methods in Cancer Research I: The Analysis of Case-Control Studies. IARC Scientific Publication No. 32, International Agency for Research on Cancer, Lyon. Breslow, N. E. and Day, N. E. (1985). The standardized mortality ratio. In: P. K. Sen, ed., Biostatistics : Statistics in Biomedical, Public Health and Environmental Scienc., Springer-Verlag, New York, 55-74. Breslow, N. E. and Day, N. E. (1987). Statistical Methods in Cancer Research H: Design and Analysis of Cohort Studies. IARC Scientific Publication No. 82, International Agency for Research on Cancer, Lyon. Breslow, N. E. and Langholz, B. (1987). Nonparametric estimation of relative mortality functions. J. Chronic Dis. 40, 895-995. Cornfield, J. A. (1951). A method of estimating comparative rates from clinical data. J. Natl. Cancer Inst. 11, 1269-1275. Cox, D. R. (1972). Regression models and life tables (with discussion). J. R. Stat. Soc. B 34, 187-220.
144
N. Breslow
Efron, B. and Tibshirani, R. (1986). Bootstrap methods for standard errors, confidence intervals, and other measures of statistical accuracy. Stat. Sci. 1, 54-77. Greenland, S. and Thomas, D. C. (1982). On the need for the rare disease assumption in case-control studies. Am. J. Epidemiol. 116, 547-553. Hastie, T. and Tibshirani, R. (1986). Generalized additive models (with discussion). Stat. Sci. 1, 297-318. Hastie, T. and Tibshirani, R. (1987). Generalized additive models: some applications. Y. Am. Star. Assoc. 82, 371-386. Hsieh, D. A., Manski, C. F. and McFadden, D. (1985). Estimation of response probability from augmented retrospective observations. J. Am. Stat. Assoc. 80, 651-662. Kannel, W. B. and Gorton, T. (ed.) (1974). The Framingham Study. DHEW Publication No. (NIH) 74-559. US GPO, Washington, DC. Lee-Feldstein, A. (1983). Arsenic and respiratory cancer in humans: Follow-up of copper smelter employees in Montana. J. Natl. Cancer Inst. 70, 601-609. Liddel, F. D. K., McDonald, J. D. and Thomas, D. C. (1977). Methods for cohort analysis: Appraisal by application to asbestos mining (with discussion). J. R. Stat. Soc. A 140, 469-490. Manski, C. F. and McFadden, D. (1981). Alternative estimators and sample designs for discrete choice analysis. In: C. F. Manski and D. McFadden, eds., Structural Analysis of Discrete Data. The MIT Press, Cambridge, MA, 2-50. Mantel, N. (1983). Synthetic retrospective studies and related topics. Biometrics 29, 479-486. McCullagh, P. and Nelder, J. A. (1983). Generalized Linear Models. Chapman and Hall, London. Miettinen, O. (1976). Estimability and estimation in case-referent studies. Am. J. Epidemiol. 103, 226-235. Miettinen, O. (1982). Design options in epidemiologic research: an update. Scand. J. Work Environ. Health 8 (Suppl. 1), 7-14. Miettinen, O. (1985). Theoretical Epidemiology. Wiley, New York. Peterson, G. R., Gilbert, E. S., Stevens, R. G. and Buchanan, J. A. (1987). A case-cohort study of lung cancer among males based on smoking data obtained from industrial medical records. (Submitted for publication.) Prentice, R. L. (1986). A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 73, 1-11. Prentice, R. L. and Pyke, R. L. (1979). Logistic disease incidence models and case-control studies. Biometrika 66, 403-411. Ramlau-Hansen, H. (1983). Smoothing counting process intensities by means of kernel functions. Ann. Star. 11, 453-466. Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with discussion). J. R. Stat. Soc. B 36, 111-147. Walker, A. M. (1982). Anamorphic analysis: sampling and estimation for covariable effects when both exposure and disease are known. Biometrics 38, 1025-1032. White, J. E. (1982). A two stage design for the study of the relationship between a rare exposure and a rare disease. Am. J. Epidemiol. 115, 119-128. Yandell, B. S. (1983). Nonparametric inference for rates with censored survival data. Ann. Stat. 11, 1119-1135.