Residual analysis for censored duration data

Residual analysis for censored duration data

Economics Letters North-Holland RESIDUAL 18 (1985) 35-38 ANALYSIS FOR CENSORED 35 DURATION DATA Tony LANCASTER Uniuersity of Hull, Hull. HU6 7...

244KB Sizes 0 Downloads 78 Views

Economics Letters North-Holland

RESIDUAL

18 (1985) 35-38

ANALYSIS

FOR CENSORED

35

DURATION

DATA

Tony LANCASTER Uniuersity of Hull, Hull. HU6 7RX, UK

Andrew

CHESHER

Universrfy of Bristol, Bristol, BS8 IH Y, UK Received

31 October

1984

In this note we show how residuals defined for right censored duration data, such as arise in. for example, studies. feature in diagnostic statistics to detect omitted covariates and neglected heterogeneity.

labour

market

1. Introduction In the analysis of complete duration data residuals play an important role, as described by Lancaster (1983a,b) but in practice duration data are often subject to right censoring, in which case the formulae for residuals used by Lancaster are inapplicable. In this paper we describe the construction of residuals for right censored duration data and give examples of their application. We consider models for the time, T, spent in some state, for example the time spent in unemployment by a member of a random sample from the flow into unemployment. We write the hazard function governing exit as 19(u, x, p) > 0, where u is time since entry and x and /3 are vectors of respectively covariates and parameters. The survivor function is P(T>

t) = exp( -e(t)),

(I)

where c(t) is the integrated hazard function, /de< u, x, P)du, whose arguments x and /3 we suppress. Because E(T) is an increasing function of T, P(T > t) is equal to P(c(T) > c(t)), and since this probability is, by (l), just exp( -c(t)), it is evident that the c(T) are iid unit mean exponential variates. Replacing unknown parameters in c(t) by their ML estimators we obtain i which is a generalised residual in the sense of Cox and Snell (1968). The use of generalised residuals in the analysis of uncensored duration data is described by Lancaster (1983a,b) who terms the r(T), ‘generalised errors’.

2. Residuals for censored duration data When data are subject to right censoring, realisations of T are not observed when T exceeds a censoring time, s, and the generalised residuals < cannot be calculated because t is not always known. To proceed, let y be a realisation of Y = min(T, s) and, following the procedure used by Chesher 0165-1765/85/$3.30

0 1985, Elsevier Science Publishers

B.V. (North-Holland)

T. Luncas~er, A. Chesher / Residual analysis for censored duration data

36

and Irish (1984) in the context of the normal linear model, and mentioned in the biometric duration literature [Lagakos (1979)], consider a residual C equal to i for uncensored observations and, for censored observations, equal to the ML estimate of the conditional mean of the generalised error, e(T), given T > s. Since the generalised errors are unit mean exponential variates we have E(r( T) 1 T > s) = E(S) + 1 and, letting C indicate censoring with C = 1 if T > s and zero otherwise, we can express the residual corresponding to the observation y as e=e(y)=(1-c)El(y)+c(2(y)+l)=f(y)+c, where c is a realisation of C. Straightforward calculations E(e(Y)(x,

give the moments

of e(Y)

conditional

on s and x,

s) = 1

which, we note, depends var(e(Y))x,

neither

on s nor an x, and

s)=l-exp(-e(s))=l-7r,

(2)

where r = P(C = 11x, s) is the conditional E(e( Y) (x, s) is constant, the unconditional var(e(Y))

probability of censoring. variance of e(Y) is

We note

now

that

because

= 1 - E(a),

(3)

where expectation is with respect to x and s. Though the e(Y) have unit mean and they are uncorrelated with x and s.

are not exponential

variates,

they do

3. Applications When the censoring process is uninformative, ’ which means roughly that the fact that an observation is censored tells us nothing about T except that it exceeds s, the log-likelihood function for N independent realisations is

L(P) =

5 ((1 - c,>1% et_Y,,

x,3

r=l

Generally exp(&)B,,(u,

P>-&))

B will contain a multiplicative scale x,, /?), and the mean score for &, is

N-lw=N-‘f

{(l-c,)-c,(y,)}=N~‘~

case

B may

be written

as

{l-e,(y,)}.

Since this mean score is zero at the ML estimates equal to the expectation of the e, ( yI)‘s.

and Prentice

in which

1=1

i=l

’ See Kalbfleisch

factor

(1980, pp. 119-122)

the sample

for a precise definition.

mean

of the residuals

2, is one, and

T. Lancaster, A. Chesher / Residual analysis

for censored

duration data

31

When the hazard function contains a factor exp(x;&) the mean score for /?, is N-‘C,N_, (1 e,(y,)}x,, from which it follows that the sample covariances of the residuals and covariates are zero. Evidently score tests of hypotheses specifying elements of & as zero examine sample covariances of residuals and covariates, which suggests graphical procedures to detect omitted covariates involving plots of the P, against candidate covariates. ’ In practice a common and serious specification error when constructing models for duration data is the failure to allow for across individual variation in the hazard function which may depend upon unobservable as well as observable covariates [see Chesher and Lancaster (1983)]. The Information Matrix (IM) test [White (1982)] provides a useful diagnostic to detect such misspecifications [Chesher (1984)] and, as we now show, this statistic is, for right censored data, a simple function of the residuals 6,. Suppose that it is the scale parameter &, that is possibly subject to across individual variation. Then the IM procedure compares to zero the sum of NP’i!12L(p)/a# and the mean squared contributions to the score for &, both evaluated at the ML estimates. For the right censored duration model,

which, at the ML estimates is - (1 - N/N) where NC is the total number of censored observations, while the mean of the squared score contributions is N-‘CL,(P, - l)2. So the IM test procedure examines whether the sample variance of the residuals is approximately equal to one minus the sample proportion of censored observations. Recall that eq. (3) tells us that the variance of the e,(x) (estimated by the variance of the residuals) is equal to one minus the marginal probability of censoring (estimated by one minus the proportion of observations censored). To detect variations in PO we examine whether (3) holds approximately in the sample. With iid covariates the asymptotically x:,, statistic can be computed as N times the uncentered R2 from a least squares estimation in which the left-hand side variable is constant and the right-hand side variables are (6, - 1)2 - (C, - c,) and the individual contributions to the scores for all the parameters of the model [see Chesher (1983) Lancaster (1984)]. For duration data subject to general schemes of censoring and grouping we can define residuals as above, using for censored observations the ML estimate of the expectation of E,(7;) conditional on available information concerning the location of q. To detect omitted covariates we then proceed just as before, examining the sample covariance of the residuals and the candidate covariates. The IM statistic to detect variation in the scale parameter is a function of these residuals and of ‘second moment’ residuals defined as i,’ for uncensored observations and, for censored observations, as the ML estimate of the expectation of T,2 given available information concerning the location of T. For a similar result in the context of the normal linear model see Chesher and Irish (1984).

References Chesher, A.D., 1983, The information matrix test, simplified calculation via a score test interpretation, Economics Letters 13, no. 1, 45-48. Chesher, A.D., 1984, Testing for neglected heterogeneity, Econometrica 52, no. 4, 865-872. Chesher, A.D. and M. Irish, 1984, Residuals and diagnostics for probit, Tobit and related models, Discussion paper no. 84/152

’ Chesher

and Irish (1984) discuss

the interpretation

of plots of residuals

obtained

from censored

data.

(University of Bristol, Bristol). Chesher, A.D. and T. Lancaster, 1983, The estimation of models of labour market behaviour, Review of Economic Studies 50. 609-624. Cox, D.R. and E.J. Snell, 1968, A general definition of residuals. Journal of the Royal Statistical Society B30. 2488275. Kalbfleisch, J.D. and R.L. Prentice, 1980, The statistical analysis of failure time data (Wiley, New York). Lagakos, SW., 1979, General right censoring and its impact on the analysis of survival data, Biometrics 35, 1066137. Lancaster, T., 1983a, Generalised residuals and heterogeneous duration models: The exponential case, Bulletin of Economic Research 35, 71-85. Lancaster, T., 1983b, Generalised residuals and heterogeneous duration models with applications to the Weibull model, Hull Economic Research Paper 114, Dec. (University of Hull, Hull). Lancaster, T., 1984, The covariance matrix of the information matrix test, Econometrica 52, no. 4, 1051-1054. White. H., 1982, Maximum likelihood estimation of misspecified models, Econometrica 50. no. 1, l-25.