Economics Letters North-Holland
RESIDUAL
18 (1985) 35-38
ANALYSIS
FOR CENSORED
35
DURATION
DATA
Tony LANCASTER Uniuersity of Hull, Hull. HU6 7RX, UK
Andrew
CHESHER
Universrfy of Bristol, Bristol, BS8 IH Y, UK Received
31 October
1984
In this note we show how residuals defined for right censored duration data, such as arise in. for example, studies. feature in diagnostic statistics to detect omitted covariates and neglected heterogeneity.
labour
market
1. Introduction In the analysis of complete duration data residuals play an important role, as described by Lancaster (1983a,b) but in practice duration data are often subject to right censoring, in which case the formulae for residuals used by Lancaster are inapplicable. In this paper we describe the construction of residuals for right censored duration data and give examples of their application. We consider models for the time, T, spent in some state, for example the time spent in unemployment by a member of a random sample from the flow into unemployment. We write the hazard function governing exit as 19(u, x, p) > 0, where u is time since entry and x and /3 are vectors of respectively covariates and parameters. The survivor function is P(T>
t) = exp( -e(t)),
(I)
where c(t) is the integrated hazard function, /de< u, x, P)du, whose arguments x and /3 we suppress. Because E(T) is an increasing function of T, P(T > t) is equal to P(c(T) > c(t)), and since this probability is, by (l), just exp( -c(t)), it is evident that the c(T) are iid unit mean exponential variates. Replacing unknown parameters in c(t) by their ML estimators we obtain i which is a generalised residual in the sense of Cox and Snell (1968). The use of generalised residuals in the analysis of uncensored duration data is described by Lancaster (1983a,b) who terms the r(T), ‘generalised errors’.
2. Residuals for censored duration data When data are subject to right censoring, realisations of T are not observed when T exceeds a censoring time, s, and the generalised residuals < cannot be calculated because t is not always known. To proceed, let y be a realisation of Y = min(T, s) and, following the procedure used by Chesher 0165-1765/85/$3.30
0 1985, Elsevier Science Publishers
B.V. (North-Holland)
T. Luncas~er, A. Chesher / Residual analysis for censored duration data
36
and Irish (1984) in the context of the normal linear model, and mentioned in the biometric duration literature [Lagakos (1979)], consider a residual C equal to i for uncensored observations and, for censored observations, equal to the ML estimate of the conditional mean of the generalised error, e(T), given T > s. Since the generalised errors are unit mean exponential variates we have E(r( T) 1 T > s) = E(S) + 1 and, letting C indicate censoring with C = 1 if T > s and zero otherwise, we can express the residual corresponding to the observation y as e=e(y)=(1-c)El(y)+c(2(y)+l)=f(y)+c, where c is a realisation of C. Straightforward calculations E(e(Y)(x,
give the moments
of e(Y)
conditional
on s and x,
s) = 1
which, we note, depends var(e(Y))x,
neither
on s nor an x, and
s)=l-exp(-e(s))=l-7r,
(2)
where r = P(C = 11x, s) is the conditional E(e( Y) (x, s) is constant, the unconditional var(e(Y))
probability of censoring. variance of e(Y) is
We note
now
that
because
= 1 - E(a),
(3)
where expectation is with respect to x and s. Though the e(Y) have unit mean and they are uncorrelated with x and s.
are not exponential
variates,
they do
3. Applications When the censoring process is uninformative, ’ which means roughly that the fact that an observation is censored tells us nothing about T except that it exceeds s, the log-likelihood function for N independent realisations is
L(P) =
5 ((1 - c,>1% et_Y,,
x,3
r=l
Generally exp(&)B,,(u,
P>-&))
B will contain a multiplicative scale x,, /?), and the mean score for &, is
N-lw=N-‘f
{(l-c,)-c,(y,)}=N~‘~
case
B may
be written
as
{l-e,(y,)}.
Since this mean score is zero at the ML estimates equal to the expectation of the e, ( yI)‘s.
and Prentice
in which
1=1
i=l
’ See Kalbfleisch
factor
(1980, pp. 119-122)
the sample
for a precise definition.
mean
of the residuals
2, is one, and
T. Lancaster, A. Chesher / Residual analysis
for censored
duration data
31
When the hazard function contains a factor exp(x;&) the mean score for /?, is N-‘C,N_, (1 e,(y,)}x,, from which it follows that the sample covariances of the residuals and covariates are zero. Evidently score tests of hypotheses specifying elements of & as zero examine sample covariances of residuals and covariates, which suggests graphical procedures to detect omitted covariates involving plots of the P, against candidate covariates. ’ In practice a common and serious specification error when constructing models for duration data is the failure to allow for across individual variation in the hazard function which may depend upon unobservable as well as observable covariates [see Chesher and Lancaster (1983)]. The Information Matrix (IM) test [White (1982)] provides a useful diagnostic to detect such misspecifications [Chesher (1984)] and, as we now show, this statistic is, for right censored data, a simple function of the residuals 6,. Suppose that it is the scale parameter &, that is possibly subject to across individual variation. Then the IM procedure compares to zero the sum of NP’i!12L(p)/a# and the mean squared contributions to the score for &, both evaluated at the ML estimates. For the right censored duration model,
which, at the ML estimates is - (1 - N/N) where NC is the total number of censored observations, while the mean of the squared score contributions is N-‘CL,(P, - l)2. So the IM test procedure examines whether the sample variance of the residuals is approximately equal to one minus the sample proportion of censored observations. Recall that eq. (3) tells us that the variance of the e,(x) (estimated by the variance of the residuals) is equal to one minus the marginal probability of censoring (estimated by one minus the proportion of observations censored). To detect variations in PO we examine whether (3) holds approximately in the sample. With iid covariates the asymptotically x:,, statistic can be computed as N times the uncentered R2 from a least squares estimation in which the left-hand side variable is constant and the right-hand side variables are (6, - 1)2 - (C, - c,) and the individual contributions to the scores for all the parameters of the model [see Chesher (1983) Lancaster (1984)]. For duration data subject to general schemes of censoring and grouping we can define residuals as above, using for censored observations the ML estimate of the expectation of E,(7;) conditional on available information concerning the location of q. To detect omitted covariates we then proceed just as before, examining the sample covariance of the residuals and the candidate covariates. The IM statistic to detect variation in the scale parameter is a function of these residuals and of ‘second moment’ residuals defined as i,’ for uncensored observations and, for censored observations, as the ML estimate of the expectation of T,2 given available information concerning the location of T. For a similar result in the context of the normal linear model see Chesher and Irish (1984).
References Chesher, A.D., 1983, The information matrix test, simplified calculation via a score test interpretation, Economics Letters 13, no. 1, 45-48. Chesher, A.D., 1984, Testing for neglected heterogeneity, Econometrica 52, no. 4, 865-872. Chesher, A.D. and M. Irish, 1984, Residuals and diagnostics for probit, Tobit and related models, Discussion paper no. 84/152
’ Chesher
and Irish (1984) discuss
the interpretation
of plots of residuals
obtained
from censored
data.
(University of Bristol, Bristol). Chesher, A.D. and T. Lancaster, 1983, The estimation of models of labour market behaviour, Review of Economic Studies 50. 609-624. Cox, D.R. and E.J. Snell, 1968, A general definition of residuals. Journal of the Royal Statistical Society B30. 2488275. Kalbfleisch, J.D. and R.L. Prentice, 1980, The statistical analysis of failure time data (Wiley, New York). Lagakos, SW., 1979, General right censoring and its impact on the analysis of survival data, Biometrics 35, 1066137. Lancaster, T., 1983a, Generalised residuals and heterogeneous duration models: The exponential case, Bulletin of Economic Research 35, 71-85. Lancaster, T., 1983b, Generalised residuals and heterogeneous duration models with applications to the Weibull model, Hull Economic Research Paper 114, Dec. (University of Hull, Hull). Lancaster, T., 1984, The covariance matrix of the information matrix test, Econometrica 52, no. 4, 1051-1054. White. H., 1982, Maximum likelihood estimation of misspecified models, Econometrica 50. no. 1, l-25.