0021-968 I/87 53.00 + 0.00 Copyright © 1987 Pergamon Journals Ltd
J Chron Dis Vol. 40, Suppl. 2, pp. 89S-998, 1987
Printed in Great Britain. All rights reserved
NONPARAMETRIC ESTIMATION OF RELATIVE MORTALITY FUNCTIONS NORMAN BRESLOW!· and BRYAN LANGHOLZ 1,2 'Department of Biostatistics SC-32, University of Washington, Seattle, WA 98195, U.S.A. and 2Department of Biostatistics, German Cancer Research Center, Heidelberg, F.R.G.
Abstract-The standardized mortality ratio (SMR), which is the ratio of the number of deaths from a particular disease observed in a study cohort to the number expected from national vital statistics, has a long history of use in epidemiology. Separate SMRs computed according to discrete intervals of time since initial exposure, time on study, age or calendar year are used to examine effects of these variables on relative mortality. Applying some recent ideas of Andersen et at. (Biometrics 1985; 41: 921-932), we discuss refinements of this procedure that yield a nonparametric estimate of the SMR as a function of one time-varying factor while at the same time adjusting for the effects of other factors in a regression equation. The method is utilized to study relative lung cancer mortality as a function of years since initial employment in cohorts of 8014 Montana smelter workers and 679 Welsh nickel refinery workers. Suggestions are made for approximating the relative mortality function using only data for cases and matched controls sampled from the full cohort. Biometry
Cohort studies
Epidemiologic methods
INTRODUCTION A traditional approach to the epidemiologic analysis of cohort data is to compare the observed cause-specific mortality or morbidity rates in the study cohort to national or regional rates. The standardized mortality ratio (SMR), defined as the ratio of the observed number of cases or deaths to the number expected from standard rates corrected for age, sex, and other demographic variables, has for many years been used to measure occupational health risks [2]. The SMR is often most informative when calculated as a function of years since initial employment, time since termination of employment, duration of employment in high exposure areas
This work was presented at the April 1985 Symposium on Time-Related Factors in Cancer Epidemiology, National Institutes of Health, Bethesda, Maryland, U.S.A. This research was supported in part by Grants l-K07CA00723 and l-ROI-CA40644 from the United States Public Health Service and an award from the Alexander von Humboldt Foundation. ·All correspondence should be addressed to Dr Norman Breslow.
Standardized mortality ratio
Statistics
or other time-varying factors. By studying the evolution of relative risk in time, one hopes to answer questions about the latent interval between exposure and disease, about the attenuation of risk following cessation of exposure or perhaps even about the biological mechanisms involved in the disease process. This paper starts by reviewing some examples where temporal variations in the SMR have been used to make epidemiologic inferences. Statistical modeling of the SMR as a function of multiple time related factors allows for correction of confounding effects that are responsible for most of the SMR's well publicized drawbacks. Methods are described for the nonparametric estimation of the SMR as a continuous function of a continuous time variable, with or without adjustment for confounding by other variables. Results presented in the Appendix show that the approach may be implemented even in the context of the "casecontrol within a cohort" method of analysis [3]. It is intended as a descriptive tool for creating a graphic impression of temporal changes in relative risk.
89S
90S
NORMAN BRESLOW
and
SOME EXAMPLES
A few examples drawn from the recent literature will point up the ubiquity of SMR analyses and some of the major concerns that arise in their application. Some of these studies are used later to illustrate the statistical methods. Lung cancer in asbestos workers
Selikoff et al. [4] examined ratios of observed to expected lung cancer cases among asbestos and insulation workers according to time since initial exposure. The SMRs increased to a maximum of 6.08 at 30-34 years since initial exposure and then declined. However, the background rates for which the SMRs serve as multipliers continued to rise steeply, due to the concomitant aging of the cohort and the rapid increase in lung cancer rates with age. These data have been interpreted to suggest that a decline in the relative risk occurred about the time of retirement [5], possibly because of the gradual evacuation of asbestos fibers from the lungs [6]. Alternative explanations could include a differential elimination of heavy cigarette smokers from the cohort, the confounding of time since initial employment with date of employment (and hence with intensity of exposure -see below), or simply that the effects of accumulated exposure were not strictly multiplicative with background.
BRYAN LANGHOLZ
Cancer mortality among British doctors
Doll and Peto [8] studied the time-trends in cancer mortality among British male doctors in comparison to mortality rates for the general population of England and Wales. While the SMRs for lung cancer declined from about 75% to well under 50% during the 20 year study, those for all other cancer remained constant at about 75%. This finding most likely results from the fact that doctors reduced their smoking drastically in comparison to men of similar ages in the general population. For example, the average number of cigarettes consumed by 5086 doctors aged 35-39 in 1951 was 84% of the average in the general population; by 1971 the cigarette consumption for these same doctors was only 41% of the general population smoking rates. Lung cancer in Montana smelter workers
Brown and Chu [9] studied lung cancer mortality rates in a cohort of Montana smelter workers [10] in an attempt to determine whether early or late stages in a putative multistage carcinogenic process were affected by arsenic exposure. They compared the observed and expected numbers of lung cancer deaths according to age at hire, time since hire, time since termination and other variables. Roughly speaking, with all other factors held constant, one would expect the SMRs to increase with time since cessation of exposure and decline with age at exposure if only the first stage of the The healthy-worker effect in vinyl chloride disease process were affected by arsenic. They workers would decrease with time since exposure and Fox and Collier [7] illustrate nicely the selec- increase with age at start of exposure if only the tion bias known as the "healthy worker" effect. penultimate stage were affected [11, 12]. Neither For the first few years following employment in condition is fully satisfied for the Montana . the vinyl chloride industry, mortality rates data. However, in a multivariate matched caseamong workers who were presumably healthy control analysis limited to 100 lung cancer cases at the time of hire were only 37% of the rates among those who left employment at or later in the general population. There was a gradual than 55 years of age, Brown and Chu [12] found attenuation of this effect with increasing patterns in the adjusted risk ratios that were follow-up, such that after 15 years 230 deaths consistent with arsenic acting at a late (penultare observed vs 244.30 expected. However, imate) stage in the carcinogenic process. They there was also a continuing selection for health also note that study of the time relationships for among those who remain employed. The SMR excess as opposed to relative risk may yield of 230/244.30 = 94.2% among those employed more information for purposes of discriminat15 or more years earlier is decomposed into one ing early vs late stage activity. In the sequel we of 75/101.36 = 74.0% for those still employed use much the same data to illustrate how SMRs at that time vs 155/142.94 = 108.4% for those may be adjusted for confounding effects using a who are no longer employed. Several papers at multiplicative model. this conference are concerned with the problems that such a selection for health poses for dose- Lung cancer in Welsh nickel workers Follow-up in the study of Welsh nickel response analyses.
Estimation of Relative Mortality
918
Table I. Summary of person-years and observed and expected lung cancer deaths among Welsh nickel refinery workers" Lung cancer deaths Level
Person-years at risk
<20yr 20.0-27.4 27.5-34.9 35+
3089.0 7178.2 3594.6 1364.7
Year of first employment
<1910 1910-14 1915-19 1920-24
1950.1 2903.7 2293.8 8078.9
41 11 23 39 13 62
Exposure indext
0 0.5-4.0 4.5-8.0 8.5-12.0 12.5+
7736.4 4903.4 1716.8 60\.0 268.9
Time since first employment
<20 yr 20-29 30-39 40-49 50+
Variable Age at first employment
Total
SMR Obs.
Exp.
(O/E)
13
5.58 13.54 6.00 \.75 2.84 4.52 4.12 15.38
2.3 5.3 6.8 6.3 8.1 8.6 3.2 4.0
42 50 27 12 6
14.90 8.31 2.47 0.85 0.34
2.8 6.0 10.9 14.1 17.6
2577.7 4764.5 4332.2 2465.1 1086.9
6 35 55 31 10
10.9 11.1 7.3 3.4 1.6
15226.4
137
0.55 3.14 7.57 9.18 6.42 26.86
72
5.1
"Source: Kaldor et al. [16]. Slight numerical differences exist between the data analyzed here and those reported by Kaldor and colleagues, due to a different method of calculation of the person-years denominators. fNumber of years worked in selected areas of the refinery prior to start of follow-up.
refinery workers originally reported by Morgan [13] and Doll et al. [14] has now been extended from the mid 1930's through 1981 [15]. Table 1 presents observed numbers of lung cancer deaths and numbers expected based on age and calendar year specific rates for men in England and Wales. These data are essentially identical to those reported by Kaldor et al. [16], who examined them for consistency with multistage theory but were unable to draw any firm conclusions in this regard. The sharp decline in relative risk with time since initial employment may have resulted from one or more ofthe following. (i) exposure effectively ceased after 1925, so that time from initial employment was strongly related to time since cessation of exposure; (ii) heavy smokers may have been differentially eliminated from the population at risk as time wore on; (iii) the effect of accumulated exposure may not have been multiplicative on the background rates, but rather additive or intermediate between additive and multiplicative. ADJUSTMENTS FOR CONFOUNDING EFFECTS
As the last two examples make clear, the SMR may be affected simultaneously by several c.o. oIO/SUPP.
l-G
highly correlated age or time factors. Some sort of adjustment procedure is then essential in order to produce unconfounded estimates of the effect of each factor on risk. Poisson regression models fitted to grouped data consisting of observed deaths and person-years denominators that are cross-classified by all relevant factors offer a convenient methodologic approach [17,18]. Breslow et al. [19] have emphasized the multiplicative model under which the observed number d, of deaths in the ith cell of the crossclassification is assumed to be Poisson distributed with mean E (dJ = E j exp {xiiI}, where E, is the expected number of deaths as determined from general population rates, Xi is a vector of binary covariables representing the effects of confounding factors, and fJ is a vector of regression coefficients. Pierce and Preston [20] and Kaldor et 01. [16] consider models that incorporate both additive and multiplicative components. Using the multiplicative model, we [21] conducted an analysis of deviance of SMRs from the Montana smelter workers study using observed and expected numbers of deaths that were cross-classified according to the following factors: date of hire (1885-l924vs 1925-1955); age at hire « 24, 25-34, 35 + years); arsenic
92S
NORMAN
BRESLOW and
exposure (light vs moderate vs heavy); time since initial employment (1-14, 15-29, 30+ years); and time since last employment (still employed, 0-9, 10+ years). An apparent increase in the SMR with time since employment and a decline with age at employment were entirely explained by the confounding with date of employment. The risk of respiratory cancer evidently decreased substantially for persons hired after 1925, when modifications in the smelting process reduced airborne arsenic exposures [22]. Thus, as is true of many industrial cohort studies of this type, date of exposure was acting as a surrogate for exposure intensity. According to a model with main effects for period of employment and exposure category, the SMR for men hired after 1925 who did not work in moderate or heavy arsenic exposure areas was estimated to be 1.27. This SMR was approximately doubled for those hired earlier and doubled again for those who worked at least one year in moderate or heavy exposure areas. However, there was a suggestion (X~ = 4.8, p = 0.09) that the effects of date of hire and workplace did not combine multiplicatively: the relative risk associated with work in heavy arsenic exposure areas was greater than that for moderate exposure areas for persons employed before 1925, but the opposite may have been true for persons employed in 1925 or later. Variations in the SMR with exposure intensity or duration may be due to confounding with stratification factors used in calculation of the SMR, namely age and calendar year. Such confounding lies at the heart of the problem of non-comparability of indirectly standardized
BRYAN LANGHOLZ
mortality indices [23]. A possible remedy is to incorporate the stratum factors as covariables in the regression analysis. Table 2 contrasts two regression analyses of the SMR on date of employment and duration of moderate/heavy arsenic exposure, one with and one without calendar year as a covariable. When accounted for in the analysis, the sharp decline in the SMR with calendar year of follow-up markedly reduced the change in the SMR with date of employment. The coefficients of the exposure variables in the adjusted analysis were close to those obtained from a multiplicative model in which background rates are estimated internally by age and calendar year, without recourse to an external standard population. These coefficients are identical when age and calendar year effects are modeled as flexibly in the SMR analysis as they are in the internally controlled one. When the adjusted and unadjusted SMR analyses differ appreciably, as in Table 2, it means that the cohort to standard rate ratios for each exposure level are not constant, as assumed implicitly by the SMR comparison, but instead vary with age and/or calendar year. See Breslow and Day [24] for a full discussion of this point. ESTIMATING THE SMR AS A CONTINUOUS FUNCTION OF TIME
The dependence of the SMR on timedependent exposure factors was investigated in the preceding examples by calculating O/E ratios separately for a few discrete intervals of age or time. Arbitrary choices as to the number of such intervals and their endpoints may
Table 2. Estimated regression coefficients (log SMR ratios) and standard errors for respiratory cancer SMRs from the Montana cohort" Externally controlled analysis
Variable
Internally controlled analysis
Without calendar year effects
With calendar year effects
Estimation of age/year specific rates in 16 strata
0.708 ±0.127
0.526 ± 0.140
0.485 ± 0.150
0.772 ±0.158 0.564 ± 0.206 0.948 ±0.180
0.796 ± 0.159 0.581 ± 0.206 0.951 ±0.181
0.797 i 0.158 0.571 i 0.206 0.951 iO.181
Hired before 1925 Duration of heavy/ moderate arsenic exposure: 0-1 yr 1-4yr 5-14 yr 15+ yr Calendar year: 1938-49 1950-59 1960-69 1970-77 ·Source: Breslow and Day [24].
-0.117iO.215 -0.331 i 0.211 -0.612 i 0.222
Estimation of Relative Mortality
938
markedly affect the results. If the SMR is changing rapidly with time, important information about the rate of change may be lost by grouping. Hence there is a need for nonparametric estimation of the SMR as a continuous function of time.
This may be recognized as a variant of Cox's [27] proportional hazard model suggested by Andersen et at. [1] wherein 0 (I) plays the role of the unknown rate function (now a rate ratio) and log At (I) enters the model as a timedependent covariate with known regression coefficient. An estimate [28, 29] of the cumuThe proportional hazards model with standard lative rate ratio defined by the integral rates C8 (t) = (s) ds Let us denote by t a time variable such as years since initial employment or age that changes linearly with time-on-study and is con- is available from the formula sidered a basic time scale for analysis. Denote C(J (t) = L fJ; (4) by il;(t) the death rate for subject i at time t and 1,<;' I Yj(t;»).j(t/) suppose that
to
J
Ait) = 8).1(t).
(1)
Here A} (r) is the standard rate for the ith subject based on his age and the calendar year at time t, and 8 is an unknown constant that represents the multiplicative effect of cohort membership. Let YJt) equal 1 or 0 according as the ith subject is or is not on-study and hence "at risk" of death at time t, let OJ equal 1 or 0 according to whether or not the observation period is terminated by death from the cause of interest and let t; = sup {t: Y;(t) = I} be the time of termination. It follows that the maximum likelihood estimate of 8 is given by = OjE where 0 = L;b; is the observed number of deaths and
e
E
=~
f'O Y;(t)il7(t) dt
is the expected number based on external standard rates [25, 26]. If the model is relaxed so that the cohort to standard rate ratios are only assumed constant within discrete time intervals, il;(t) = Ojil1(t)
for
"j_l::;;;t::;;;"j, I ::;;;j
t
J _ /
E'i~1 0;= ~E J: Y[(s) 8 (s)A7(s) ds. Covariance adjustment of the SMR
Covariance adjustment for fixed and time dependent covariables Xj(t) is easily accomplished using the multiplicative model AICI) = e(t) exp{log A1 (t) + Xj(t) II} where now the unknown regression coefficients II are estimated by partial likelihood [31]. Extensions to more general relative risk models of the form A/Cl) = e(t) r {At (t); x/(t); II} are possible [32] although we have not implemented them here. This would include, for example, an additive structure such as A;(t)=O(t){).1(t)+x/(t)lI} for the effects of the covariates on the background rates, or an additive relative risk structure such as 0 (t)At(t){1 + xl(t)/l}. The estimate of the cumulative SMR adjusted for covariable effects becomes C(J (t) =
~
J,
(2)
L:
0;
',"1 L Yj(t;) Aj(t;) exp{xj(t;)P}
,
(5)
j
the same arguments show that the MLE of OJ is given by the ratio O)Ej , where OJ is the number of deaths observed in the jth time interval and Ej is the expected number
Ej=~
Some motivation for this formula is provided by the fact [30] that
1(l ) Y;( t ) d t .
where pis determined by partial likelihood. For the multiplicative model, these calculations are readily implemented using standard programs for survival analysis with time-dependent covariables. Smoothed estimates of the SMR
Most of the preceding examples illustrate this approach. A further relaxation of model (1) allows the multiplicative factor 0 to change continuously with time: (3)
The proposed estimates (4) and (5) of the cumulative SMR are step-functions with jumps at the observed times of death. The problem of estimating the SMR itself is thus entirely analogous to that of estimating a probability density or hazard function from an empirical
948
NORMAN BRESLOW
and
cumulative distribution or hazard function [33,34]. Recent work by Ramlau-Hansen [35] and Yandell [36] has validated kernel estimates of the form fl(t)
=~ DO KC ~s) dC8(s)
=!b I( K(t -b ti) dce (tJ,
(6)
where K (x) is a smooth positive kernel function integrating to one and b is a bandwidth that determines the degree of smoothness. In effect, the estimate e(t) of the SMR at t is a weighted average of nearby differentials dCe (t/). This approach is implemented in the sequel using the kernel function K(x)=O.75(I-x 2 ) for -I ~ x ~ 1, and K(x) = 0 elsewhere. Tanner and Wong [37] propose a refinement whereby the bandwidth is selected to depend on the spacing between observations, becoming narrower where deaths are more frequent. Confidence bands
Confidence boundaries for the smoothed estimate of the unadjusted SMR, i.e, that which arises by substituting equation (4) into (6), may be obtained by making minor changes in the bounds derived for the analogous estimate of the hazard function [35, 36]. One must assume that the number of subjects n increases and that the bandwidth b decreases in such a way that the product nb becomes large and that ot~er regularity assumptions are met. Then the .po~nt estimate fl (t) is asymptotically normally distributed about the true SMR (t) with standard error
e
SE{fl(t)}
I n
= [ b 2 i~1 K
2
s,
/2
(t - ti) J1 -b- [Y*(tiW '
(7)
where Y*(tJ =
BRYAN LANGHOLZ
A limitation of formula (8) is that it applies only to the estimate of the SMR at a single time point t. Yandell [36] has derived asymptotically valid simultaneous confidence bounds that apply to the entire interval tmin + b ~ t ~ t m• x - b, where t min and t m• x denote the minimum and maximum times ti at which terminations occur. The percentile zal2 that appears in equation (8) is replaced by a larger quantity that depends on both a and the bandwidth b, or more specifically, on the time interval over which simultaneous bounds are desired. Using the kernel defined above, for example, one would replace ZO.05 = 1.645 by a quantity at least as large as 2.721, depending upon b, in order to obtain 90% simultaneous intervals. Unfortunately, simulation studies indicate that this method is unnecessarily conservative when applied with bandwidths that are sufficiently broad so as to give reasonably smooth and unbiased estimates of the SMR. Similar bounds have been derived also for the smoothed version of the adjusted SMR that is given by substituting equation (5) into (6). In place of equation (7) one makes recourse to a more complicated variance estimate that takes account of the error in estimation of the regression coefficients fJ [38]. Two examples
Figure I shows the cumulative SMR for respiratory cancer among the Montana smelter workers by number of years since initial employment. Two smoothed estimates of the SMR, one with bandwidth b = 5 years and the other (smoother) one with b = 10 years, are shown in Fig. 2. They confirm the Brown and Chu [9] result that the crude SMR stays in the range 1.5-2.0 until 30-35 years from date of hire, and then increases rapidly to 3.0-3.5. However, covariance adjustment is needed
L. ~(ti)At(ti)
160
j
is the total standard risk at ti' Note that Y*(t i) also appears in the denominator of equation (4). Thus a confidence band for the SMR at time t is given by tl(t) ± Z al2 SE{tl(t)}, where za /2 is the appropriate percentile of the unit normal distribution (e.g, ZO.05 = 1.645 for 90% confidence). In practice, due to skewness in the distribution of fl(t), it is preferable to determine confidence bounds on the log scale via log tr(t) ± Z a/2 SE {fl (m/fl(t).
(8)
~
120
!a: ::;
, '" ;;;
~
80
40
o o
I
10
.
20
30
40 -50
Years employed
Fig. 1. Cumulative respiratory cancer SMR by years since initial employment for Montana smelter workers.
Estimation of Relative Mortality 3.5
3.5
- - 5 Yoar bandwidth
3 .0
95S
-------- .·10 Year band widt h
3.0
. - - . - - - · ·Adjusted
2.5
- - Unadjusted
2.5
cr
~ 20
2.0 1.5 1.5
1.0
10
20
30 Years emplo yed
40
50
0.5
60
Fig. 2. Smoothed estimates of the respiratory cancer SMR by years since initial employment for Montana smelter workers: comparison of 5 and 10 year bandwidths.
a
10
20
30 Years employed
40
50
60
Fig. 3. Respiratory cancer SMR by years since initial employment for Montana smelter workers with and without covariable adjustment.
in view of the strong correlation between A smoothed graph of the lung cancer SMR number of years employed and date of first for Welsh nickel refinery workers by years since initial employment is shown in Fig. 4. It employment. The covariables used to adjust the SMR confirms the marked decrease noted in Table 1. analysis are date of employment (pre or post This temporal pattern remains even after co1925), birthplace (U.S. vs foreign) and number variance adjustment for age at first employof years worked in areas of the smelter with ment, year of first employment and exposure potential for moderate or heavy arsenic index (Table 4). The adjusted SMR shown in exposure. Regression coefficients estimated by Fig. 4 corresponds to the baseline category used partial likelihood are shown in Table 3. These in the regression analysis; age under 20 when were inserted in formula (5) to yield the adjusted first employed, hired before 1910, and exposure SMR estimates shown in Fig . 3. The unadjusted index of zero. The fact that it is higher than estimate differs slightly from that shown in Fig. the unadjusted SMR is evidently due to a lower 2 since the time variable "years employed" was risk for men hired in 1910 or after, among rounded to integral years in order to reduce the whom most of the deaths and person-years computational burden of the partial likelihood observation occurred. analysis, and also since 12 lung cancer deaths that occurred at 80 or more years of age were DISCUSSION excluded. The adjusted SMR does not show an The graphical statistical techniques described increase at 30-35 years, but in fact declines slightly. Of course, there may well be residual in this paper are a useful tool for the descriptive, confounding by other time-dependent co- exploratory analysis of time relationships in variables. For example, since "moderate" or chronic disease epidemiology. As it turned "heavy" arsenic exposure may mean different out, both our illustrative examples involved things in different time periods, inadequate respiratory cancer mortality, utilized national account may have been taken of arsenic effects 12 by the two covariables used to represent them (Table 3). 10
------ •• Ad justed
8
- - Unad justed
Table 3. Regression coefficients in the multiplicative model used to adjust the SMR analysis for the Montana smelter workers Variable Hired before 1925 Foreign born Heavy arsenic" Moderate arsenic"
Regression coefficient ± SE 0.70±0.18 0.47 ±O.I4 0.041 ± 0.010 O.OI7±0.007
"Time dependent exposure variables lagged two years to account for "healthy worker" selection bias.
2
o'----_ _-'20
30
--'-
L...-. _ _- - '
50
80
Fig. 4. Smoothed estimate of the lung cancer SMR by years since initial employment for Welsh nickel refinery workers, with and without covariable adjustment.
96S
NORMAN BRESLOW
and
Table 4. Regression coefficients in the multiplicative model used to adjust the SMR analysis for the Welsh nickel refinery workers Variable Age at employment (yr) <20 20.0-27.4 27.5-34.9 35+ Year of employment 1902-1909 1910-1914 1915-1919 1920-1924
Regression coefficient ± SE
0.28 ±0.31 0.06 ± 0.34 -0.28 ± 0.44
0.04 ± 0.27 -0.51 ± 0.38 -0.33 ± 0.30
Exposure index'
o
0.5-8.0 4.5-8.0 8.5-12.0 12.5+
0.59 ± 0.21 1.05 ± 0.28 1.24 ± 0.37 1.34 ± 0.48
"Number of years worked in selected areas of the refinery prior to start of follow-up
vital statistics for the standard population, and considered years since initial employment as the time scale of interest. In other applications, one may wish to substitute incidence rates based on a parametric model of the disease process for the vital statistics and select as a time variable age, calendar year, or years since termination of employment. The graphical output provides a flexible, non parametric method of evaluating the goodness-of-fit of the assumed parametric model and may suggest appropriate modifications of that model when the rate ratios vary systematically with time. For example, the marked decrease in the SMR for nickel workers may be due to heterogeneity of the population such that those at highest risk, e.g. heavy smokers, are eliminated early on. Substantial improvement in fit results if each worker's mortality function is multiplied by a term representing his "frailty" and if it is assumed that such frailties have a random distribution in the population [39]. Proper application of the powerful graphical techniques illustrated here requires a full understanding of the potential hazards that arise in the use of the SMR as a comparative measure [23,40]. If the basic multiplicative model (3) under which the graphical estimate was derived fails to hold, and if there are strong correlations between age or calendar year and the time variable t according to which patterns in the SMR are being examined, misleading inferences may result. One solution to this problem is to create time-dependent covariables x(t) repre-
BRYAN LANOHOLZ
senting age and/or calendar year and to include them in the regression equation, thus adjusting for their effects on the SMR. This is tantamount to modelling the effects of age and year on the baseline rates in a parametric fashion, using the known standard rates in such a way as to reduce the number of parameters required for a good fit [24]. In the Montana smelter workers study, we found that the apparent changes in the SMR with time since employment were due almost entirely to confounding by date of employment. There may have been some slight residual confounding caused by a decline in the SMR with calendar year. One can also use the techniques of this paper as a descriptive tool to discover whether the SMR in fact does vary with age or calendar year. The fact that such graphical analysis may be implemented even with synthetic case-eontrol data (see Appendix) broadens the appeal of the "case-control within a cohort" methodology that is now frequently used to analyze cohort data. The approach is especially useful when additional covariate information needs to be collected from existing records or next-of-kin, since the additional data collection may then be limited to relatively small numbers of subjects. However, our results both here and elsewhere [19] demonstrate the value of increasing the control sample so that more than the usual 1-3 controls are obtained for each case. An alternative approach that may be particularly useful when planning the data collection for a prospective cohort study is Prentice's [41] case-cohort design. Here covariable information is collected and processed for all disease cases as they arise and for an unmatched control sample from the cohort as a whole. Since each control may contribute to several risk sets, the risk set sizes are larger and one need not resort to the bias reduction techniques we used in order to estimate cumulative (relative) mortality functions. On the other hand, the increased magnitude of the computational effort may rule out the sort of exploratory interactive analyses that we have in mind. REFERENCES 1. Andersen PK, Borch-Johnsen K, Deckert T, Green A, Hougaard P, Keiding N, Kreiner S. A Cox regression model for the relative mortality and its application to diabetes mellitus survival data. Biometrics 1985; 41: 921-932. 2. Logan WPD. Cancer mortality by occupation and social class 1851-1971. IARC Scientific Publications No. 36. Lyon: YARe; 1982.
Estimation of Relative Mortality 3. 4.
5. 6.
7. 8. 9. 10. II. 12.
13. 14. 15.
16. 17. 18. 19. 20.
21. 22. 23. 24.
Thomas DC. Addendum to a paper by Liddel FDK, McDonald JC and Thomas DC. J R Stat Soc A 1977; 483-485. Selikoff II, Hammond EC, Seidman H. Latency of asbestos disease among insulation workers in the United States and Canada. Cancer 1980; 46: 27362740. Walker AM, Loughlin JE, Friedlander ER, Rothman KJ, Dreyer NA. Projections of asbestos-related disease 1980---2009. J Occup Med 1983; 25: 409-423. Berry G, Gilson JC, Holmes S, Lowinsohn HC, Roach SA: Asbestosis: A study of dose-response relationship in an asbestos textile factory. Br J Ind Med 1979; 36: 98-112. Fox AJ, Collier PF. Low mortality rates in industrial cohort studies due to selection for work and survival in the industry. Br J Prev Soc Med 1976; 30: 225-230. Doll R, Peto R. Mortality in relation to smoking: 20 years' observations on male British doctors. Br Med J 1976; 2: 1525-1536. Brown CC, Chu KC. Implications of the multistage theory of carcinogenesis applied to occupational arsenic exposure. J Natl Cancer Inst 1983; 70: 455-463. Lee AM, Fraumeni JF. Arsenic and respiratory cancer in man: an occupational study. J Natl Cancer lost 1969; 42: 1045-1052. Day NE, Brown CC. Multistage models and primary prevention of cancer. J Natl Cancer Inst 1980; 64: 977-989. Brown CC, Chu KC: Approaches to epidemiologic analysis of prospective and retrospective studies: Example of lung cancer and exposure to arsenic. In: Prentice RL, Whittemore AS, eds. Environmental Epidemiology: Risk Assessment. Philadelphia: Society for Industrial and Applied Mathematics; 1982:94-106. Morgan LG. Some observations on the incidence of respiratory cancer in nickel workers. Br J Ind Med 1958; 15: 224-234. Doll R, Morgan LG, Speizer FG, Cancers of the lung and nasal sinuses in nickel workers. Br J Cancer 1970; 24: 623-632. Peto J, Cuckle H, Hennon C, 0011 R, Morgan LO: A further update of respiratory cancer mortality of nickel refinery workers in Clydach (Wales). In: Sunderman PFW Jr, ed, Nickel in the Human Environment. (IARC Scientific Publication No. 53). Lyon: IARC; 1983: 37-46. Kaldor J, Peto J, Doll R, Hennon C, Morgan L. Models for respiratory cancer in nickel workers. J Natl Cancer Inst 1986; 77: 841-848. Berry G. The analysis of mortality by the subject-years method. Biometrics 1983; 39: 173-184. Frome EL. The analysis of rates using Poisson regression models. BIometrics 1983; 39: 665-674. Breslow NE, Lubin JH, Marek P, Langholz B. Multiplicative models and cohort analysis. J Am Stat Assoe 1983; 78: 1-12. Pierce DA, Preston DL. Hazard function modelling for dose-response analysis of cancer incidence in the A-bomb survivor cohort. In: Prentice RL, Thompson OJ, Eds. Atomic Bomb Survivor Data: Utilization and Analysis. Philadelphia: Society for Industrial and Applied Mathematics; 1984: 51-66. Breslow N: Multivariate cohort analysis, Natl Cancer Inst Monogr 1985; 67: 149-56. Lee-Feldstein A. Arsenic and respiratory cancer in humans: follow-up of copper smelter employees in Montana. J Natl Cancer Inst 1983; 70: 601-610. Yule GU. On some points relating to vital statistics, more especially statistics of occupational mortality. J R Stat Soc 1934; 97: 1-84. Breslow NE, Day NE. The standardized mortality ratio. In: Sen PR, Ed. BIostatistics: Statistics In Blo-
25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37.
38.
39.
40. 41. 42. 43.
44. 45.
97S
medical, Public Health and Environmental Sciences. Amsterdam: Elsevier; 1985: 55-74. Kilpatrick SJ. Mortality comparisons in occupational groups. Appl Stat 1962; 17: 65-86. Breslow NE. Analysis of survival data under the proportional hazards model. Int Stat Rev 1975; 43: 55-68. Cox DR. Regression models and life tables (with discussion). J R Stat Soc B 1972; 34: 187-220. Breslow NE. Contribution to the discussion of paper by DR Cox. J R Stat Soc B 1972; 34: 216-217. Oakes D. Contribution to the discussion of paper by DR Cox. J R Stat Soc B 1972; 34: 208. Cox DR, Oakes D. Analysisof Survival Data. London: Chapman and Hall; 1984. Cox DR. Partial likelihood. Biometrika 1975; 62: 269-276. Prentice RL, SelfS. Asymptotic distribution theory for Cox-type regression models with general relative risk form. Ann Stat 1983; 11: 804-813. Watson GS, Leadbetter MR. Hazard analysis 1. Biometrika 1964; 51: 175-184. Watson OS, Leadbetter MR. Hazard analysis II. Sankhya 1964; 26A: 110-116. Ramlau-Hansen H. Smoothing counting process intensities by means of kernel functions. Ann Stat 1983; II: 453-466. Yandell BS. Nonparametric inference for rates with censored survival data. Ann Stat 1983; II: 1119-1135. Tanner MA, Wong WHo Data based nonparametric estimation of the hazard function with applications to model diagnostics and exploratory analysis. J Am Stat Assoc 1984; 79: 174-182. Andersen PK, Rasmussen NK.. Admissions to psychiatric hospitals among women giving birth and women having induced abortion. A statistical analysis of a counting process model. Research Report 82/6. Copenhagen: Statistical Research Unit; 1982. Clayton DG, Kaldor JM, Day NE. Heterogeneity models as an alternative to proportional hazards in cohort studies. Proe 45th Session of the International Statistical Institute 1985; 3.2: 1-16. Gilbert ES, Buchanan JA. An alternative approach to analyzing occupational mortality data. J Occup Med 1984; 26: 822-882. Prentice RL. A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika 1986; 73: 1-12. Mantel N. Synthetic retrospective studies and related topics. Biometrics 1973; 29: 479-486. Breslow NE, Day NE. Statistical Methods in Cancer Research I: The Analysis of Case-Control Studies (IARC Scientific Publication No. 32). Lyon: IARC; 1980. Quenouille M. Approximate tests of correlation in time series. J R Stat Soc B 1949; 11: 18-84. Efron B. The Jackknife, the Bootstrap and Other ResampUng Plans. Philadelphia: Society for Industrial and Applied Mathematics; 1982.
APPENDIX
Approximate Methods to Reduce the Computations The analyses presented in the paper involved the of proportional hazards regression models with dependent covariables to moderate or large samples. were 137 lung cancer deaths among the 679 Welsh
fitting timeThere nickel
988
NORMAN BRESLOW and BRYAN LANGHOLZ
workers and 276 such deaths among the 8014 Montana smelter workers. Since a separate risk set is formed for each distinct time at which a death occurs, and since the number of subjects in each risk set may total several hundred or more, the iterative computations are generally only feasible in batch mode. This tends to discourage the sort of interactive, exploratory data analysis preferred by many investigators. In this Appendix we discuss a method for approximating the partial likelihood analysis and the estimate of relative mortality so as to be able to analyze the data interactively. Sampling from the risk sets Mantel [42J proposed the "synthetic retrospective study" as a means of reducing the computations when fitting logistic regression models to large cohorts containing a relatively small number of disease cases. He suggested that one conduct the analysis on a reduced dataset consisting of all the cases and a sample of the non-cases. Thomas [3] extended this approach for use with the partial likelihood analysis of the proportional hazards model by sampling a small number of controls at random from the non-cases in each risk set. The matched sets of cases and controls so formed are then analyzed using conditional logistic regression [43]. The number of controls that one needs to draw for adequate efficiency relative to the full partial likelihood analysis depends on the magnitude of the relative risks being estimated and the percentage of the population at risk who are exposed. It is hardest to accurately estimate large relative risks associated with rare exposures [19]. So far as we know, no attempt has been made heretofore to estimate the baseline or relative mortality functions from such "synthetic" case-control data. Estimation of the relative mortality function from casecontrol data An examination of the formula used to estimate cumulative relative mortality as a continuous function of time suggests how it may be adapted to serve also with casecontrol samples. The key requirement is that one know the sampling fractions used to select controls within each risk set, and this is met for synthetic case-control studies. Let us denote by Jlj = jl;({J) the average of the estimated relative risk factors associated with the N, subjects in the ith risk set. Thus I Ili = - I rij' (AI)
small samples that we have in mind with n, in the range from 5 to 20. However, various steps may be taken to correct it. Taylor-series expansion Expanding the reciprocal of the sample mean f in a Taylor series about jl yields 1
I
1
-::=---(f-Il)+ r jl f.l2
2(f -
J.lf + ...
2fl3
and thus, to an approximation of order n -31 2,
E{~}=~ + Var(f). r
fl3
jl
This suggests replacing f; in equation (A3) with the bias corrected estimator
I
67
(A4)
il; =~- nj,f
where uf=(n;-I)-lLkij-f;)2 is the within risk set variance. The jackknife Another means of reducing the bias in small sample estimators is to use the jackknife technique of Quenouille [44]. Repeated estimates of the parameter of interest are made from subsamples where each of the observations is deleted in turn, and these are used to approximate the small sample bias [45]. For our problem the bias corrected estimate is I n, (n;- 1)2 ~ I (A5) - = - - - - - L. - - - .
n,
P;
fL;
j= I
n/; - rij
A disadvantage of this formula as opposed to equation (A4) is that it requires two passes through the data sampled from each risk set. Pooling information over risk sets For nearby values of Ii there should be little difference between the risk sets R I since there is limited opportunity for individuals to be entered onto or withdrawn from study. This suggests that one may reduce the sampling bias at the risk of a slight increase in systematic bias by pooling of controls from "nearby" risk sets. With b denoting a bandwidth used for pooling we define
(A6)
Nij,RI
where rg = ).7(t/)r {x/t l), P} = exp{log).J (t,) + x/tl)/I} denotes the adjusted baseline mortality rate for the jth member of risk set R;. The estimator (5) takes the form d, CU(t) = I - ', (A2)
Either the Taylor series or the jackknife approach to bias reduction may be used in conjunction with PI,b'
""",N;fl;
where d, denotes the number of cases in R I • Suppose now that we only have available a sample of nl controls drawn from the Ni-di non-cases in R,. We estimate the means It; from the sample by I f,=-
I
---------Jackknife
\.
Hi
nll_l
\
r y,
I ~
,,<,Nlrl
(A3)
as our initial approximation to equation (5). The main drawback of this approach is the fact that the reciprocal of a sample mean is a biased estimator of the reciprocal of the mean, and the bias can be severe for the
,...
..·"···.···,··,·,,Taylor
....
a:
A refinement would be to use NI-I {diS; + (N; - dl)fd in place of f l , where SI denotes the average mortality rates estimated for the d, cases. In practice this should make little difference unless the cases are a large fraction of those "at risk." Substituting f; for JlI in equation (A2) yields cdo(t) =
- - - A l l controls \
~ 2
.
\ "
o 10
20
30
40
50
60
Years employed
Fig. 5. Smoothed respiratory cancer SMR by years since initial employment for Montana smelter workers: jackknife (dashed) and Taylor series (dotted) estimates based on five controls per case.
Estimation of Relat ive Mortali ty 35 - - - All
3.0
controls
. . - -- - - - --Jackknile
···················Taylor 2.5
::.
'" 2.0 1.5
1.0
0
10
20
30
40
50
60
Yea rs employed
Fig. 6. Smoothed respiratory cancer SMR by years since initial employment for Montana smelter workers: jackknife (dashed) and Taylor series (dotted) based on 20 controls per case.
Preliminary results for the Montana cohort Figures 5 and 6 present estimates of the SMR for Montana smelter workers . without adjustment for covariab les, using the va rious techniques suggested in this section. The rather similar results obtained with the covariance
99S
adjusted analysis are not shown. All calculations were based on risk sets defined for each of the 57 integral years since data of hire at which one or more of the 276 lung cancer deaths occurred. The solid line in each figure represents the unadjus ted, smoothed SMR estimated from the entire set of cohort data. It corresponds to the unadjusted SMR shown in Fig. 3. Figure 5 contrasts this with the jackknife (lower curve) and Taylor series (upper curve) estimates based on five controls per case without any pooling of risk sets . Both estimates are clearly unaccept able. Increasing the number of cont rols to 20 per case (Fig. 6) subst ant ially improves the situation . The Tay lor series estimate is now virtually identical to that based on the full set of cohort data, and the jackknife version is only slightly lower. Even with 20 controls per case, however, the bias of the crude estimae (A3) based on the uncorrected reciprocals of the sample means is apparent (not shown) . For the analysis based on pooling of controls for risk sets that have "times" within 2! years of th e target, we used five controls per case so that the total number of controls entering into the averages equ ation (A6) is approximately 25. The results (not shown) are comparable to those obtained with 20 contro ls per case and no pooling, except that the Taylor series estimate underestimates the cumulative SMR in the upper range and the negat ive bias in the jackknife estimat e is a bit more pronounced. The best results are obta ined with pooling of nearby matched sets that contain 20 controls each (not shown).