Fitting epidemiologic follow-up studies with cumulative damage models

Fitting epidemiologic follow-up studies with cumulative damage models

Fitting Epidemiologic Damage Models NIKOLAUS Follow-up Studies with Cumulative BECKER Department of Epidemiology AND WERNER RITTGEN Department ...

960KB Sizes 0 Downloads 16 Views

Fitting Epidemiologic Damage Models NIKOLAUS

Follow-up

Studies with Cumulative

BECKER

Department of Epidemiology AND

WERNER

RITTGEN

Department of Mathematical Models German Cancer Research Center, Heidelberg

Germany

Received 18 July 1991 and 18 May 1993

ABSTRACT The use of an extended version of the cumulative damage model to identify and quantify cancer risk arising from a specified exposure is outlined. For this, a strategy to fit the model to individual data coming from epidemiologic follow-up studies is described. Two statistical problems are addressed: first, the regularity of the model has to be ascertained to allow the application of maximum-likelihood and likelihood-ratio methods for parameter estimation and testing. Second, a statistical test has to be found that permits testing goodness of fit in the setting of parameter estimation with individual data. As an example, these methods are applied to the data of a cohort study on mortality among stainless steel welders in the Federal Republic of Germany. The results show that the model fits the data well and confirms a carcinogenic effect of stainless steel welding among welders. Some distinguishing characteristics of the model, especially its prediction of a potentially decreasing relative risk despite ongoing carcinogenic exposure, are discussed.

INTRODUCTION The mathematical design of the extension of cumulative damage models (CD models) with regard to the consideration of cancer risk from specified exposures has been outlined qualitatively in previous papers [3, 251. In short, the idea is to model the always existing environmental background exposure by a first Poisson process and to overlay it by a second Poisson process that represents the additional environmental exposure under particular consideration. The phenomenon of latency period is taken into account by the assumption that the carcinogenic damages of this second process become effective only MATHEMATICAL

BIOSCIENCES

120:147-163

(1994)

OElsevier Science Inc., 1994 655 Avenue of the Americas, New York, NY 10010

147 0025-5564/94/$7.00

NIKOLAUS

138

BECKER

AND WERNER

RITl-GEN

after a certain time lag. For details, the reader is referred to the cited papers. The present investigation is focused on the practical application of this model. The methods to be used and the statistical problems to be solved are exemplified by fitting the model to the data of a study of German stainless steel welders [8, 91. This is a historical follow-up study with an internal control group that has shown a slightly increased risk for total cancer among welders in comparison to the general population as well as to the internal reference group. The practical question of interest is whether it is possible to reproduce this finding by fitting the CD model to these data. The statistical problems mentioned concern essentially the regularity of the model and testing goodness of fit. Regularity is a prerequisite when applying maximum-likelihood and likelihood-ratio methods for parameter estimation and testing [27, pp. 144, I.511 and must therefore be examined first. Testing goodness of fit arises as a problem in the given setting because the data are not grouped when the model is fitted. so classical x’ methods on grouped data cannot be applied for testing goodness of fit [ 11, p. 1901.

THE

DATA

The data to be fitted with the CD model stem from an occupational cancer study on stainless steel welding with chromium-nickel cxposurc set up in the years 1980- 1982 and pursued with a second follow-up to the year 1988. The window for inclusion into the study was the years 1950-1970, that is, welders who entered into exposure during these years were included in the study. Thus, 1213 welders with individualrelated records on entry into exposure, age at entry, exit from exposure, date of last information, and, for deceased persons, date and cause of death, were included. Further information concerning average daily amount of exposure is not taken into account in the present examination because no considerable differences were found between the exposure groups for this variable. For the purposes of this analysis, the previously listed items have been rounded to full years. This means that in the presentation of data according to certain time intervals (see below), small shifts of cancer cases between time groups compared to the already existing papers on this study may occur. A comparison group with the analogous individual-related records consists of 1688 turners who were not exposed to chromium-nickel fumes at their work place. The disease endpoint to be considered in the present invcstigation is death from cancer, which occurred in 48 welders and 62 turners. Further details of this study can be found in [8] and [9].

14’)

FITTING FOLLOW-UP STUDIES

THE

MODEL

The model to be fitted is the CD model with two exposure processes as introduced in Becker [3, 41, which has been extended for the consideration of latency periods by Rittgen and Becker [25]. Mathematically, the model is specified by the following assumptions: (1) (N,; t > 0) is a homogeneous Poisson process (“background exposure”) with intensity pI > O;(M,; t,, G t G t, > is a homogeneous Poisson process (“additional exposure”) with intensity llz 2 0; (N,; t > 0) and (A’,; t,, arising from the process A4, by positive i.i.d. random point displacements D;, i E N, with common distribution Wt). (3) (X,; i E IV) and ( x,; j E N) are families of nonnegative i.i.d. random variables (“damages”) with distributions F,(x) and F2( x) with finite expectations m, = E(X) and m2 = E( x) associated with the points of the Poisson processes N, and M,, respectively, generating the compound Poisson process

z, =x, + ... + XN, + /y, + *** +

,YG,

(“cumulative

damage”).

(4) C is a biological system (“tissue”) characterized by a random threshold Y (“host resistance”) with distribution G(x), and defines the indicator variable I( t >,

I(t)=

1,

z, < y,

0

z, >Y.

L

That is, Z(t) is 1 as long as the cumulative damage Z, is less than Y and is zero if Z, exceeds the threshold. (5) The failure time T of the biological system C (“clinical appearance of a malignant tumor” or “death from cancer”) is defined by T=inf[t;I(t)=O]. The process N, represents the background exposure to which each individual, either a member of the “unexposed” control group or one of the “exposed” cohort, is assumed to be exposed. Obviously, the parameters of this background exposure will be assessed from general population data or the control group as outlined below. The process M, represents the additional exposure actually under consideration, for example, exposure to chromium-nickel fumes among welders. The

NIKOLAUS

150

BECKER AND WERNER

RITTGEN

quantities t,, and t, denote in epidemiological terms “age at entry into exposure” and “age at exit from exposure,” respectively, for each individual. The distribution *IF(t) describes the delay between the exposure process and the effective biological activation of the damages being induced by this exposure (i.e., it models some kind of “latency”) and will be assumed to be lognormal with parameters u and s. The parameters of the second exposure process M,, F~,and m,, as well as those of the lognormal distribution q(t), u and s, will be assessed from the exposed cohort (see below). Assumptions l-5, together with the assumptions that the damages be deterministic and equal, m, > 0 and m2 > 0, and that the threshold distribution be double exponential with localization parameter c > 0 [2], result in the following survival function s(t):

For the case of continuous exposure, displaced Poisson process can be obtained

the intensity from pL? by t,,
&(t)=/+X*(f--,,), If the additional exposure ceases at time displaced process ,&(t) is given by

rate function

t,
h(t) can be obtained

h(l)

PARAMETER

= -

of the

(2)

t,, the intensity

rCL2(f)=~CLZX[~(t-t,,)-~(l-t,)], The hazard way as

j&(f)

from s(t)

&(t)

of the

(3) in the usual

dS( t)/dt

S(t)



ESTIMATION

The parameters to be assessed are p,,m,,c,p2,mz,u,s. To use all available information in the data set, the parameters will be estimated by maximum-likelihood methods applied to the individual records. This means that for each individual record, its contribution to the log-likelihood

FITTING FOLLOW-UP STUDIES

151

has to be calculated. E( is 1 if failure occurs at 1 = T and is 0 otherwise. A(T,t,,t,,O) is the hazard at failure time T depending upon entry into the exposure t,), exit from exposure t,, and the parameter vector 0 = ( p,, m,, c, p2,m2, u, s). The quantity w,, denotes “age at entry into observation” for this individual, which, in the case of the welders, is identical to the age of entry into exposure. The quantity w1 denotes “age at exit from observation,” which is usually different from age at exit from exposure but in the case of deceased persons is identical to the age at death. The hazard rate function h(t,t,,,t,,O) specifies the particular contribution of that individual to the global likelihood, that is, the specific exposure experienced by this person, expressed by age f. at onset and age t, at cessation of exposure. These individual contributions sum up to the global log-likelihood function I(O) for all n members of the cohort (see [28, p. 77; 11, p. 1851):

The maximization of f(O) was computed using optimization routines of the IMSL MATH/LIBRARY [ 181, which are based on quasi-Newton methods and a finite-difference gradient or a user-supplied gradient. The steps, in detail, were the following. (1) Fitting of the cancer mortality data of the internal control group of turners representing the “background exposure” to which welders and turners are assumed to be commonly exposed. For this step, the methods outlined by Becker and Rittgen 171 were applied to the data presented in Table 1. Only the age range 45-84 years was considered. The results are the parameter estimates fi,, &,, c^. (2) Keeping the parameters b,, Gz,, c^ (obtained in step 1) fixed, the model is fitted to the data of the exposed cohort to obtain the estimates fi2, &, t, and s^. This procedure was subdivided into three successive substeps to enable examination of the respective benefit of the incorporation of parameters. TEST

STATISTICS

Two test statistics goodness of fit.

are applied

to test hypotheses

and

to examine

The likelihood ratio test. In Appendix 1, it is shown that the CD model satisfies certain regularity conditions. Thus, consistency of ML estimators and asymptotic x2 distribution of the logarithm of the

NIKOLAUS

152

TABLE Cancer

Cases

and Person

Age Group

Years

BECKER

AND

WERNER

RITTGEN

1

ot’ Observation

among

Turners

Casts

by Age Groups

Person

Ycvs‘ 229 3.277 455 I 5.946 6,635 6,732 6,348 5,422 4,216 ?I,100 2,056

1.70I 750 353 I23 27 51.157

likelihood ratio is warranted. In particular, improvement of mode1 fit by incorporation of additional parameters can be tested by comparing the differences of the log likelihoods -21(O) with x,f, whereby v denotes the number of newly introduced parameters [15, p. 5201. Goodness of fit. As already mentioned, goodness-of-fit tests based on grouped data are not directly applicable. Corrections proposed by Schoenfeld [26] and Tsiatis [29] do not apply, because the CD mode1 does not belong to those special mode1 classes. However, a general method suggested by Heckman [17] may be used. The test is based on grouped data and an appropriate correction for parameter estimation with individual data and leads to a kind of (observed-expected) statistic of the form

G=C(O-E)C

‘(O--q7

with a variance-covariance matrix C being adjusted for parameter estimation. G is asymptotically $-distributed, with J being the number of cells. The procedure, which has apparently not been noticed in cancer epidemiology thus far, is described in Appendix 2. This method is not yet fully satisfying because it does not use the complete individual data for the test, but it appears to be without

FITTING FOLLOW-UP

STUDIES

153

alternative at the moment. Tests based on individual data (see, e.g., Chapter 6 of [13]) do not apply, because the present data are censored. The method suggested by Michael and Schucany 1211 cannot be used because the censoring mechanism considered is different from the present one. RESULTS The results of the fits following the consecutive steps of evaluation described above are presented in Table 2. The first step provided the parameters E_L,,~, of the background exposure process and c of the threshold distribution by fitting the model to the data of the internal control group of turners as presented in Table 1. The logic of a correctly specified control group is that it may be assumed to have exactly the same background exposure as the “exposed” group, so that the only difference between the two cohorts is just the additional exposure under consideration. Thus, the fitted parameters can be assumed to specify accurately the background exposure of the welders group as well as the turners group. Assuming that the welders have no increased risk beyond this background from their occupational exposure, the model should accurately fit the data without the additional exposure process. For testing this “null hypothesis” the data have been grouped according to duration of employment and observation time since onset of exposure (Table 3). Table 3 indicates that the model does not fit the data accurately. Nevertheless, the G statistic according to Appendix 2 yields G,, = 13.8 with 14 degrees of freedom, that is, P( x;~ > 13.8) = 0.5. The -21(O) value of this initial model fit was 628.5 for the welders (line 2 in Table 2).

TABLE 2 Log-Likelihood and Estimated Parameters for the Welders’ Cohort and the Control Group of Turners for the Different Steps of Fitting the Model Parameters

Line

Data

-2X1

1

Turners Welders Welders Welders Welders

628.5 621.0 614.6 614.6

2 3 4 5

Threshold c 13.52 13.52 13.52 13.52 13.52

Background Exposure

Additional Exposure

/Jl

MI

CL2

0.485 0.485 0.485 0.485 0.485

0.298 0.298 0.298 0.298 0.298

0.016 0.008 0.008

m2

1.0 2.26 2.26

Displacement Ll

s

2.53 2.53

0.01

NIKOLAUS

154

BECKER AND WERNER

RIT-TGEN

TABLE 3 Number of Cancer Cases Observed and Expected from the Model (a) Assuming No Carcinogenic Effect of the Additional Exposure f pz = nr, = 0) and tb) Assuming a Carcinogenic Effect of the Additional Exposure ( & = 0.008,$1: = 2.26, Li= 2.53) Duration of Exposure (yrl o-9

11lk19

20-29

0 E E: 0 E,, E, 0

(G,, = 13.8; G’, =X.4) Time Since First Exposure (yr) O-9

IO-19

20~29

30-39

2 2.2 2.2 ~

6 2.6 3.2 6 s.4 7.6

5 3.1 3.5 X 5.0 1.2 10

I

0

1.4 I.5 2 2.4 3.1) 5 4.0 6.1

0 0 0 0.05 0.06 0 0.2 0.2

~ ~

F20 El

5.4

9.7

40-49

-

3

I

E E:)

-

I.6 3.3

40-49

0

-

Total

4, E c;

0.6 0.9 0 0.5 O.‘j I I.2 2. I

30-39

0

6, E,

2 2.2 2.2

12 8.0 10.x

23 13.4 20.4

11 0.4 13.0

Total 14 9.2 10.4 16 12.X 17.x 1s 9.5 16. I 4 2. I 4.2 0 0.5 0.9 40 34.2 49.5

0 = observed cases; E,, = expected cases from model (a): E, = expcctcd cases from model (bl.

In the second step of model fit, the three parameters ~L?L,, & ,, and c^ were kept fixed, and the parameters for the additional exposure process and the displacement process pu,, m_,,u,s were fitted to the welders’ data. This was carried out successively to examine the benefit of the incorporation of further parameters into the model. At first the two parameters of the additional exposure process p1 and m, were incorporated. Table 2, line 3, shows that - 2 x log likelihood = 621 .O has improved significantly with P( ~22 > 7.5) = 0.02. As the next parameter, the displacement parameter u was incorporated, allowing a deterministic “latency period.” This also led to a significant improvement of fit (Table 2, line 4), demonstrating the relevance of consideration of latency in the model. The fitted value u = 2.53 indicates the mean displacement E(D) = e” = 12.6 years. In contrast, incorporation of the variance of latency in the model by parameter s does not provide any reduction of the log likelihood (Table 2, line 5). Thus, distributional

FITTING FOLLOW-UP STUDIES

155

shape of the displacement distribution cannot be assessed from such data. The grouped presentation of the observed cases and the expected numbers on the basis of the new fit in Table 4 shows a closer resemblance of expected to observed values. The G-statistic results are G, = 8.4, P( xf4 > 8.4) = 0.9. In detail, Table 3 shows that a total of 49 cancer cases were observed while the fit predicts 49.5. The diagonal provides the fit for continuous exposure, providing a rather close prediction for all cells, while the cells concerned with risk experience after cessation of exposure show model predictions that are too low. This is true mainly for the first decades after cessation of exposure (e.g., 3.2 expected cases vs. 6 observed 10 to < 20 years after onset of exposure, that is, 0 to < 10 years after cessation of exposure in the first row). A possible explanation of this effect is that persons who had left the company from which they were recruited for this study had in fact obtained other, similar employment and thus continued their exposure above the magnitude known in this study and used for the model fit. On the basis of this fit, Figure 1 illustrates the relative risk function in dependence on age for several ages at onset of exposure and common cessation of exposure at the age of 65 years. The figure shows that relative risk increases after a relatively short time of latency but decreases again despite ongoing exposure after some decades. Nevertheless, the underlying hazard rate functions strongly increase (Figure 2). Some remarks to this effect are given below. The fact that the distributional shape of the displacement could not be estimated results among the exposed in nonsmoothed breakpoints of the hazard and the relative risk after onset and cessation of exposure.

DISCUSSION

AND

CONCLUSIONS

The practical scope of cancer epidemiologic research is to identify risk factors for cancer and to quantify the relative risk of exposed people in order to prevent disease by eliminating such factors, if possible. Besides that, a theoretical question of cancer epidemiology as well as cancer research in general is whether and, if yes, to what extent epidemiologic observations in human populations give insight into details of the biological process of carcinogenesis. It is a widespread belief that it is possible and even recommended to understand epidemiologic observations in biologic terms, that is, as a direct result of a cellular staging mechanism 1241. Accordingly, continuous efforts are undertaken to imbed also the findings of epidemiologic follow-up studies such as occupational cancer studies (see, e.g., Brown

1%

NIKOLAUS

BECKER

AND

WERNER

RITTGEN

3.0

2.5

ONSET OF EXPOSUREAT THE AGE OF: 2.0 I- ,20YEARS is - -.-..30 YEARS LIZ

:I-____ * 0

20

40 60 AGE (YEARS)

FIG. I. Model-based relative risk functions onset of exposure with (a) continuous exposure

for two different and (b) cessation

80

100

time points l,, 01 of exposure at the

age of 65 years.

and Chu [12] and Dong et al. [14]) in the theoretical framework of the usual biologically based cancer models, that is, the multistage model (see, e.g., Whittemore and Keller [30]) or the Moolgavkar-Knudson model [231. However, various problems make this approach questionable. The pure staging mechanism of the multistage model is known to be an “oversimplification” [I& p. 1281. Moolgavkar [22] points out that it may result in a wrong understanding of dose-effect relationships. Kopp and Portier [19] show that the lack of any growth mechanism in the multistage model may lead to incorrect interpretations of the mechanisms of carcinogenesis. Thus, a refinement of the biologic model appeared to be required. However, another paper by the same authors [20] based on a refined model demonstrates that it is hard to distinguish between different models by fitting empirical data even from animal experiments, which are frequently more precise than epidemiologic data. An older objection raises the principal question of whether such special

FITTING FOLLOW-UP

0.10

157

STUDIES

r CONTINUOUSEXPOSURE AFTER CESSATIONOF EXPOSURE BACKGROUNDEXPOSURE

CESSATIONOF EXP

-ONSET OF EXPOSURE

.' I 80

I 100

AGE (YEARS) FIG. 2. Hazard rate functions for background exposure and background exposure plus additional exposure with onset at t,, = 20 years of age and (a) continuous exposure and (b) cessation of exposure at 65 years of age.

biologic models are useful at all, because they explain the curvature of age-incidence curves for cancer by some particular biologic mechanism, whereas various other diseases show the same curvature without having the same biologic mechanism of disease development in common with cancer 11, p. 91. This criticism may be accentuated by the fact that even physical or technical processes show this curvature (i.e., increasing hazard rate functions) so that the above question is strengthened: To what extent do epidemiologic data unequivocally show specific, identifiable biologic mechanisms? In view of such methodological problems, the present CD approach has started from the reverse question: If many diseases and even technical processes show similar curvatures of hazard rate functions, how many detailed biologic assumptions are actually required to understand the epidemiologic obsemations of cancer occurrence? The CD model interprets the macroscopic phenomena that are seen only by observation of human populations to be a kind of wearing out process:

158

NIKOLAUS

BECKER AND WERNER

RITTGEN

A series of carcinogenic events from a potentially carcinogenic environment damages a human host, or an organ from it for site-specific considerations, which can withstand these insults to a certain extent but fails (gets cancer) if a threshold of individual resistance is exceeded. Neither the number nor the nature of these damages, nor the biologic level (cell, tissue, immune system, etc.), are strictly determined. Some previous papers [2-41 have outlined this approach qualitatively and pointed out why it might be appropriate for theoretical epidemiologic purposes. However, for a profound examination of this first impression, a series of applications to epidemiologic data were required. Thus Becker and Rittgen [7] have shown that this approach suffices to fit well the cancer mortality data of descriptive epidemiology for some cancer sites. For other sites, suggestions for model refinements that remain within the framework of this approach are given to help provide acceptable fits. Here we have focused on the consideration of follow-up studies of analytical epidemiology with CD models. The first aim was to ascertain that the usual methods of ML estimation and LR testing can also be applied to the present nonlinear CD model. A further methodological problem was how to test goodness of model fit if individual data are used for model fitting instead of grouped data. The second aim was to give a practical example of an application of the model, in this case to an occupational cancer study among stainless steel welders. The results of this paper are that the usual methods of model fitting just presented are also applicable to CD models, that the G statistic is a suitable method to test goodness of fit if individual data are the basis of the fitting procedure, and that the model fits the data of the example reasonably well. With respect to the theoretical question raised above, it should be mentioned that in a subsequent paper [5] the model is applied to an epidemiologic study that has frequently been used to test the closeness of fit for various biologically based cancer models, and its results are compared to those of some other convenient models. This paper demonstrates that the CD approach is also able to reproduce these data and leads to a fit of data very close to that achieved by the other models. Thus, these and the present findings indicate that assumptions of the present type with rather parsimonious biologic interpretation are sufficient to give a limited explanation to epidemiologic observations in terms of age, duration and intensity of exposure, and latency time. This implies, conversely, that models set up to establish biologic interpretations beyond this level might principally be unable to be proved empirically. Thus, the main result and value of the present and analogous investigations are theoretical; they provide evidence that it is an overin-

FITTING FOLLOW-UP STUDIES

159

terpretation to see in epidemiologic data any support for a biologic cancer model specified, for example, on a cellular level. Conversely, for practical epidemiologic purposes it turns out that a model of the present type that is specified in epidemiology-related terms (age, background exposure, additional exposure, latency, cancer hazard) might be sufficient or even superior. This underlines the practical reason for getting acquainted with the empirical properties of this type of model. Finally, one epidemiologically interesting point will shortly be sketched, namely the decreasing relative risks despite ongoing exposure, visible in Figure 1. The decrease occurs despite the strongly increasing hazard rate functions for both the background exposure and the background exposure plus the continuous additional exposure (Figure 2). However, Figure 2 demonstrates that the additional exposure of the welders leads only in the first part of the curves to a faster increase than among the nonexposed, while in the later part the increase is not as fast as that of the nonexposed. This leads at first to a strongly increasing ratio, that is, relative risk, and then, however, to a decrease (Figure 1). Mathematically, this is a consequence of the general property of the hazard rate functions being limited from above by the intensities of the Poisson processes (see [4]). Epidemiologic literature shows various examples (see [5]) of decreasing relative risk in spite of ongoing exposure, and it appears worthwhile to examine whether it is a consequence of poor quality of data or an effect related to the mentioned property of relative risk.

APPENDIX

1

According to Borgan [lo], the following conditions are sufficient ensure regularity of a counting process model with censoring:

to

(1) The proportion of processes still “at risk” at time t are asymptotically (i.e., numbers of observations +m) deterministic uniformly in finite time intervals. (2) The individual hazard rates and their derivatives up to the third order with respect to the parameters exist and are continuous functions, and they are bounded in the parameter range for all finite time intervals. (3) The individual hazard rates are strictly positive in the parameter range for all finite time intervals. (4) The covariance matrix is positive definite. Conditions

(1) and (3) follow directly

from the properties

of the CD

NIKOLAUS

160

BECKER

AND WERNER

RITTGEN

model. Condition 4 can only be demonstrated numerically within the given parameter region and is satisfied as well. The hazard rates depend on seven parameters, giving seven derivatives of the first order, 28 of the second order, and 84 of the third order. However, they can be written in a form in which the different terms can be analyzed with the aid of recursion arguments. Using this technique, it can be shown that condition (2) holds if the parameter range is chosen in such a way that ~~ and m ? are nonnegative and all other parameters strictly positive. A detailed presentation can be obtained from the authors on request. APPENDIX

2

The G statistic of Hcckman [17] resembles the usual x2 statistic on grouped data with the only difference that (0 - E)“/E is replaced by a quadratic form in (0 - E) in which the coefficient matrix is the inverse of the covariance matrix adjusted for parameter estimation. In the following, the notation of the G statistic is introduced and modified with respect to the present epidemiologic application. The density f(y]x,O) = - dS(y]x,O)/cly [minus the derivative of the survival function of formula (1) with respect to y] relates outcome J which is defined here as “age at failure (death with cancer),” to the covariate vector x = (t,,, t ,), consisting of the variates t,, (“age at onset of the additional exposure”) and t, (“age at cessation of the additional exposure”), and the parameter vector 0 =( ~,,rn,,~,~.,,tn,,~r,s). 0 is estimated by maximum likelihood based on the n individual observa. tions of the given cohort providing the estimator 0. The covariancc matrix Cc., can be approximated by minus the inverse of the information matrix I,:,, which is approximately [27. p. 1541

;li,,Z= I

d logf(ylx,@)

a@,

d

i

logJ‘(yIx,@)

i,j= 1,....p.

JO,

I

For the test of goodness of fit in discrete cells, select J disjunct intervals C, = (c,, c, + ,>, j = 0,. . . , J - 1, in advance (in the present example, according to the matrix used in Table 3). The method proposed by Heckman is not restricted to the special subdivision into those intervals outlined in his paper but works for any selection as long as it is disjunctive and maintains positive probabilities for each cell. The indicator function

d,, =

1

i 0

if y, t C,, otherwise

i=

l,...,n,

FITTING FOLLOW-UP

STUDIES

161

rererring to the observed cancer cases provides, in contrast to Heckman, only C:=, dij G 1 for all 12 cohort members instead of strict equality as assumed by Heckman. The reason is that not all observed persons die with cancer during the observation time. The only consequence is that the G statistic has J degrees of freedom instead of the J - 1 resulting with Heckman. Because the distribution F(ylx, 0) = 1 - S(ylx, 0) can be approximated, when A is small, by

F(yIx,@)=l-exp

the expected number the second expression cell, that is, by

[

-lh(s,x,@)ds = 1

1

A(s,x,@)~s,

of observations in each cell can be calculated by in formula (5) and is restricted to the respective

(wb and w; are onset and end of observation period). The individual variance-covariance matrix CI’ = ( w,‘~>, i = 1,. . . , n, for the cells arises from the binomial distribution by

w~.,=E(d,,,lx,,O),[l-E(di,,Ix,,O)], Of.k

=

-E(di,,IXI,O)‘(dl,kIX,,0),

Finally,letA’=(a!,!,),i=l,..., the expectations with

a,!,

IZ, denote

dE(d,,jlx, 0) k

=

a@,



>...7 J 7

j=l j,k=l,...,J;

j#k.

the matrix of derivations

j=l >..., J;

of

k = l,...,p.

Thus fi’ is a J X J matrix, A’ is a J X p matrix, and matrix. With that, the corrected variance-covariance obtained by

C,

is a p x p

matrix

C is

162

NIKOLAUS

The G statistic G=

&

[d, - E(d;lx;,O)]C

I

G is asymptotically

RIT-TGEN

follows by

,c I

BECKER AND WERNER

‘+ g [d; - E(d,Ixi,O)]? t

x2 distributed

I

with J degrees

of freedom.

We greatly appreciate the helpful comments of the referees. REFERENCES 1

2 3 4 5 6 7 8

9

I0

11 12

13 14

15 16

P. Armitage and R. Doll, The age distribution of cancer and a multi-stage theory of carcinogenesis, Br. J. Cunccr 8: I- 12 (1954). N. Becker, Cumulative damage models in cancer epidemiology: Application to human incidence and mortality data, Arch. EmSiron. Heulth 44:260-266 (1989). N. Becker, cumulative damage models of exposures with limited duration and host factors, Arch. Enr~iron. Health. 44:331-336 (1989). N. Becker, reliability models in cancer epidemiology, Biomet. .I. 31:727-74X ( 1989). N. Becker, Cigarette smoking and lung cancer: A reconsideration of the British doctors’ data with cumulative damage models, Epidemiology (1993), in press. N. Becker and W. Rittgcn, Some mathematical properties of CD models regarding their application in cancer epidemiology, Biomet. J. 32:3~ 15 (1990). N. Becker and W. Rittgen, Fitting cancer mortality data with cumulative damage models, M&r. Biosci. 10X:57-73 (1992). N. Becker, J. Claude. and R. Frentzcl-Beymc, Cancer risk of arc welders exposed to fumes containing chromium and nickel, Stand. J. Work Enr’iron. Health 11:75-X2 (1985). N. Becker, J. Chang-Claude, and R. Frentzel-Beymc, Risk of cancer for arc welders in the Federal Republic of Germany: Results of a second follow-up (19X3-81, Br. J. Ind. Med. 4X:675-683 (1991). A. Borgan, Maximum likelihood estimation in parametric counting process models, with applications to censored failure time data, Stand. J. Stut. 11:I-16 (1984). N. E. Breslow and N. E. Day, Statistical Methods in Cancer Reseurch, Vol. 2, The Design and Atlalysis of Cohort Studies, IARC, Sci. Publ. No. X2, 1987. C. C. Brown and K. C. Chu. A new method for the analysis of cohort studies: Implications of the multistage theory of carcinogenesis applied to occupational arsenic exposure, Emiron. Health Perspect. 50:293-30X (1983). R. B. D’Agostino and M. A. Stephens, Goodness-of-Fit Techniques. Marcel Dckker, New York, 1988. M. H. Dong, C. K. Redmond, S. Mazumdar, and J. P. Costantino, A multistage approach to the cohort analysis of lifetime lung cancer risk among steelworkers exposed to coke oven emissions, Am. J. Epidemiol. 328:X6&873 (1988). E. J. Dudewicz and S. N. Mishra. Modern Mathematical Statistics, Wiley, New York, 198X. J. J. Cart, D. Krewski, P. N. Lee, R. E. Tarone, and J. Wahrendorf, Statisticcrl Methods in Cancer Research, Vol. 3, The Design and Analysis of Long-Term Animal Experiments, IARC. Lyon, 19X6.

FITTING FOLLOW-UP 17

18 19

20

21 22 23 24

25 26 21 28 29 30

STUDIES

163

J. Heckman, lhe x2 goodness of tit statistic for models with parameters estimated from microdata, Econometrica 52:1543-1547 (1984). IMSL, User’s Manual, IMSL MATH/LIBRARY, FORTRAN Subroutines for Mathematical Applications, Version 1.l, IMSL, Houston, 1989. A. Kopp and C. J. Portier, A note on approximating the cumulative distribution function of the time to tumor onset in multistage models, Biometrics 45:12591264 (1989). A. Kopp-Schneider and C. J. Portier, Distinguishing between models of carcinogenesis: The role of clonal expansion, Fundam. Appl. Toxicol. 17:601-613 (1991). J. R. Michael and W. R. Schucany, A new approach to testing goodness of fit for censored samples, Technometrics 21:435-441 (1979). S. H. Moolgavkar, Carcinogenesis modeling: From molecular biology to epidemiology, Ann. Ret,. Public Health 7:151-169 (1986). S. H. Moolgavkar and A. G. Knudson, Mutation and cancer: A model for human carcinogenesis, J Nutl. Cancer Inst. 66:103771052 (1981). S. H. Moolgavkar, N. E. Day, and R. G. Stevens, Two-stage model for carcinogenesis: Epidemiology of breast cancer in females, J. N&l. Cancer Inst. 65:559569 (1980). W. Rittgen and N. Becker, The consideration of latency periods in cumulative damage models, Biomet. J. 33:505-515 (1991). D. Schoenfeld, Chi-square goodness-of-fit tests for the proportional hazards regression model, Biomettika 67:145-153 (1980). R. J. Serfling, Approximation Theorems of Mathematical Statistics, Wiley, New York, 1980. D. L. Snyder, Random Point Processes, Wiley, New York, 1975. A. A. Tsiatis, A note on a goodness-of-fit test for the logistic regression model, Biometrika 67:250-251 (1980). A. S. Whittemore and J. B. Keller, Quantitative theories of carcinogenesis, SIAM Rel,. 20:1-30 (1978). J.