Statistics & Probability Letters 63 (2003) 287 – 293
Large sample results under biased sampling when covariables are present ∗ ' Jacobo de U˜na-Alvarez Departamento de Estad stica e Investigaci on Operativa, Facultad de Ciencias Econ omicas y Empresariales, Universidad de Vigo, Campus Universitario Lagoas-Marcosende, 36310 Vigo (Pontevedra), Spain Received December 2000; received in revised form January 2003
Abstract In this work, we introduce and analyze new estimation procedures when a (p + 1)-dimensional variable (U; T ) is sampled under selection bias models. Here, T denotes a time of interest and U = (u1 ; : : : ; up ) is a vector of covariables. Applications include the estimation of (conditional) cumulative hazard and mean residual time functions, and inference about the regression curve m(u) = E( (T ) | U = u). c 2003 Elsevier Science B.V. All rights reserved. Keywords: Covariables; Least squares; Regression; Selection bias; Weighted distributions
1. Introduction Let (U; T ) be a random vector, where T is a positive random variable (e.g. a lifetime, unemployment time, or failure time) and U = (u1 ; ::; up )t is a random vector of p covariates. The basic problem considered in this work is that of the estimation of the joint distribution function of (U; T ), FU; T (x; y) = P(U 6 x; T 6 y), given a random sample of observations (Xi ; Yi ), 1 6 i 6 n, not from FU; T but from its biased distribution x y −1 FX; Y (x; y) = P(X 6 x; Y 6 y) = W w(u; t)FU; T (du; dt) (1.1)
−∞
−∞
with W = w(u; t)FU; T (du; dt) (assumed to exist). Here, w denotes a known positive function. Implicitly, (1.1) speci?es that the (relative) probability of sampling a pair (u; t) is proportional to the size of w(u; t) The case w(u; t) = t de?nes the so-called length-biased scenario, in which the
Work supported by the Grants PGIDIT02PXIA30003PR and BFM2002-03213. Tel.: +34-986812492; fax: +34-986812401. ' E-mail address:
[email protected] (J. de U˜na-Alvarez).
∗
c 2003 Elsevier Science B.V. All rights reserved. 0167-7152/03/$ - see front matter doi:10.1016/S0167-7152(03)00093-2
288
J. de U˜na-Alvarez / Statistics & Probability Letters 63 (2003) 287 – 293
larger the length of the time, the greater the probability of being observed. Note that the randomly sampled case is recovered by taking w ≡ 1 in (1.1). This and related models were investigated (among others) by Vardi (1982, 1985), Horv'ath (1985) and Jones (1991), who did not consider covariates or focused mainly on inference on the marginal distribution of T . More general theory and models with a partially unknown w were presented in Gill et al. (1988), Gilbert et al. (1999), and Gilbert (2000). Biased sampling problems (or weighted distributions) appear quite commonly when investigating duration times. This phenomenon is closely related to that of left-truncation. In a left-truncated situation, a pair (U; T ) is sampled if and only if T0 6 T , where T0 is a random variable called truncation time. In this case, the function w in (1.1) equals w(x; y) = P(T0 6 y | U = x; T = y): Each set of model assumptions on the vector (U; T; T0 ) de?nes a special selection bias situation. For example, when T0 is assumed to be independent of (U; T ), the biasing function is just the truncation distribution. Other situations are possible. In Section 2 we consider a Vardi-type estimate which includes the covariate vector U . More gen’ erally, an empirical version of the functional S =E(’(U; T ))= ’(u; t)FU; T (du; dt) is investigated. As applications, estimators for covariances, correlation coeLcients, and cumulative hazard and mean residual time functions are de?ned. Basic large sample results about consistency and asymptotic distributional laws are established. Furthermore, in Section 3 we introduce and analyze in detail a modi?ed least-squares criterion for estimating the regression function m(x) = E[ (T ) | U = x] under a parametric assumption. In applications, denotes a suitable transformation of the time axis such as the logarithm. As particular models, our analysis includes the so-called accelerated failure time models, widely used in practice. As an alternative to linear and nonlinear (parametric) ?ts, estimation of a regression function in a nonparametric way is possible too. Kernel smoothing for m(x) under biased sampling goes back to Ahmad (1995). This kind of estimates may be introduced as special convolutions of a kernel function and the empirical F˜ U; T in Section 2. Recent contributions in this ?eld are Alcal'a et al. (2000), Crist'obal and Alcal'a (2000), and Wu (2000).
2. Estimation of functionals Let ’P be an arbitrary function de?ned on the (p + 1)-dimensional euclidean space. Under (1.1), we clearly have E(’(X; P Y )) = W −1 E(’(U; P T )w(U; T )):
(2.1)
In particular, for ’(u; P t) = w(u; t)−1 we obtain W = {E(w(X; Y )−1 )}−1 : So let us de?ne the natural 1 estimate for W : W˜ ={ n ni=1 w(Xi ; Yi )−1 }−1 : This is actually the harmonic average of the w(Xi ; Yi )’s. Of course, the SLLN gives W˜ → W almost surely under E(w(X; Y )−1 ) ¡ ∞. Now, for a given function ’, take ’(u; P t)=’(u; t)w(u; t)−1 in (2.1). We have S ’ =E(’(U; T ))=WE(’(X; Y )w(X; Y )−1 ): So a natural estimate for S ’ is de?ned by the empirical functional n 1 ’(Xi ; Yi )w(Xi ; Yi )−1 : S˜ ’ = W˜ n i=1
(2.2)
J. de U˜na-Alvarez / Statistics & Probability Letters 63 (2003) 287 – 293
289
In particular, estimation of FU; T (u; t) is obtained when considering the indicator ’(x; y) = I (x 6 u; y 6 t): n 1 ˜ ˜ F U; T (u; t) = W I (Xi 6 u; Yi 6 t)w(Xi ; Yi )−1 : (2.3) n i=1 This is a Vardi-type estimate with covariates. Note that (2.2) also follows from integration with respect to the empirical measure associated to F˜ U; T , S˜ ’ = ’(u; t)F˜ U; T (du; dt): ’ There are a number of examples for S that arise in applications. For a ?xed j ∈ {1; : : : ; p} de?ne ’1 (u; t) = u j t, ’2 (u; t) = t, ’3 (u; t) = u j , ’4 (u; t) = t 2 , ’5 (u; t) = (u j )2 : Combination of S˜ ’i , 1 6 i 6 5, gives estimates for the covariance and correlation between the jth covariate and T . Conditional cumulative hazard and mean residual lifetime functions emerge as functions of special S ’ ’s too. Hence, we establish our main (large sample) result for the empirical version of a general parameter = g( ’1 dFU; T ; : : : ; ’r dFU; T ); namely ˜ = g( ’1 d F˜ U; T ; : : : ; ’r d F˜ U; T ); where the functions g; ’1 ; : : : ; ’r are chosen by the researcher. Recall that, under (1.1), we have ’i (u; t)w(u; t)−1 FX; Y (du; dt) ; 16i6r ’i (u; t)FU; T (du; dt) = w(u; t)−1 FX; Y (du; dt) and then = (g ◦ q)(v) ≡ h(v) where q(x1 ; : : : ; xr ; xr+1 ) = (x1 =xr+1 ; : : : ; xr =xr+1 ) and v= ’1 w−1 dFX; Y ; : : : ; ’r w−1 dFX; Y ; w−1 dFX; Y : Also, ˜ = h(v) ˆ where −1 ˆ −1 ˆ −1 ˆ vˆ = ’1 w d F X; Y ; : : : ; ’r w d F X; Y ; w d F X; Y ; Fˆ X; Y being the empirical distribution function of the (Xi ; Yi )’s. The following result is then a consequence of the SLLN and the multivariate CLT (along with the delta method, see SerSing (1980, p. 122)). Theorem 2.1. Assume that the functions ’˜ i (u; t) = ’i (u; t)w(u; t)−1 , 1 6 i 6 r, and ’˜ r+1 (u; t) = − 1 w(u; t) satisfy |’˜ i | dFX; Y ¡ ∞. Then, if h is continuous at v, we have ˜ → with probability one. Assume, in addition, that condition ’˜ 2i dFX; Y ¡ ∞ holds for 1 6 i 6 r + 1. Then, if ∇h (the gradient of h) exists and ∇h(v) = 0, we have n1=2 (˜ − ) → N (0; "2 ) in law, where r+1 r+1 @h @h 2 t " = ∇h(v) ("ij )∇h(v) = "ij ; @xi @xj x=v i=1 j=1 the "ij ’s being de
2
Conclude that "ˆ =
k=1
∇h(v) ˆ t ("ˆ
ˆ ij )∇h(v)
2
k=1
→ " holds with probability one.
J. de U˜na-Alvarez / Statistics & Probability Letters 63 (2003) 287 – 293
290
As important applications of Theorem 2.1, one can obtain the large sample results corresponding to F˜ U; T . In this case, the limit variance turns out to be 2 "2 = "2 (x; y) = W {E(I (U 6 x; T 6 y)w(U; T )−1 ) + E(w(U; T )−1 )FU; T (x; y)
− 2E(I (U 6 x; T 6 y)w(U; T )−1 )FU; T (x; y)}: Now set ˜ X) ≡ −ln %(t;
I (x ∈ X; y ¿ t)F˜ U; T (d x; dy) ; I (x ∈ X)F˜ U; T (d x; dy)
˜ X) can be regarded where X denotes a set of restrictions on the covariates. Under continuity, this %(t; as an estimator for the conditional cumulative hazard
I (x ∈ X; y ¿ t)FU; T (d x; dy) : %(t; X) = −ln I (x ∈ X)FU; T (d x; dy) ˜ X) is a consequence of Theorem 2.1, Again, strong consistency and asymptotic normality for %(t; now with limit variance E(I (U ∈ X; T ¿ t)w(U; T )−1 ) E(I (U ∈ X)w(U; T )−1 ) 2 2 + " = " (X; t) = W E 2 (I (U ∈ X; T ¿ t)) E 2 (I (U ∈ X))
2E(I (U ∈ X; T ¿ t)w(U; T )−1 ) : − E(I (U ∈ X; T ¿ t))E(I (U ∈ X)) Similar results could be obtained for an estimator of the conditional mean residual lifetime function r(t; X) = E[T − t | T ¿ t; U ∈ X]; namely I (x ∈ X; y ¿ t)(y − t)F˜ U; T (d x; dy) r(t; ˜ X) = : I (x ∈ X; y ¿ t)F˜ U; T (d x; dy) 3. Estimation of the regression curve 3.1. Linear regression under biased sampling Suppose now that, for a given transformation , one is interested in estimating m(x)=E[ (T ) | U = x] under a linear speci?cation H0 : m(U ) = U t '0 = '01 u1 + · · · + '0p up where '0 = ('01 ; : : : ; '0p )t is the (unknown) p-dimensional “true parameter”. This is equivalent to assuming the relation (T ) = U t '0 + (, where the error term satis?es E[( | U ] = 0. The so-called accelerated failure time model can be regarded as a particular case of this situation, for the special time transformation (t) = ln t. In this framework, the interest is focused on the regression parameter vector '0 , since one will look for important “risk factors” among the u j ’s. Given the biased data (X1 ; Y1 ); : : : ; (Xn ; Yn ), we propose to estimate '0 by the minimizer 'n of n ( (Yi ) − Xit ')2 w(Xi ; Yi )−1 = ( (Y ) − X ')t )( (Y ) − X '); ' → Sn (') = i=1
J. de U˜na-Alvarez / Statistics & Probability Letters 63 (2003) 287 – 293
291
(Y ) = ( (Y1 ); : : : ; (Yn ))t is the vector of transformed responses,
where
) = diag(w(X1 ; Y1 )−1 ; : : : ; w(Xn ; Yn )−1 ) is a (random) weighting matrix, t 1 x1 : : : X1 X = ::: = ::: ::: Xnt x1n : : :
and x1p
::: xpn
is the (random) design matrix. In this section we derive large sample properties of 'n . −1 Explicitly, 'n can be de?ned as 'n = (X t )X )−1 X t ) (Y ) = M1n M2n (Y ) where M2n = n−1 X t ) and M1n = M2n X . Note that, since ) is random, ordinary theory for weighted least-squares estimation does not apply here. Now, by the SLLN, n 1 i M2n (Y ) = xk w(Xk ; Yk )−1 (Yk ) → E(Xw(X; Y )−1 (Y )) n k=1
and
M1n =
16i6p
n
1 i j xk xk w(Xk ; Yk )−1 n k=1
→ E(XX t w(X; Y )−1 ) ≡ +0 16i; j 6p
with probability one. On the other hand, (2.1) implies +0 = W −1 E(UU t ) and E(Xw(X; Y )−1 (Y )) = W −1 E(U (T )) (=W −1 E(UU t )'0 under H0 ), and the following result clearly holds. Theorem 3.1. Assume that the matrices E(Xw(X; Y )−1 (Y )) and +0 = E(XX t w(X; Y )−1 ) exist, and that +0 is nonsingular. Then, 'n → '0 under H0 with probability one. Write −1 −1 M2n (Y ) − '0 = M1n (M2n (Y ) − M1n '0 ): 'n − '0 = M1n −1 → +0−1 almost surely. On the other hand, we have Note that M1n p n 1 xki (Yk ) − xkj '0j w(Xk ; Yk )−1 M2n (Y ) − M1n '0 = n j=1 k=1
=
n
1 ’i (Xk ; Yk )w(Xk ; Yk )−1 n k=1
1
p
i
where ’i (x; y)=’i (x ; : : : ; x ; y)=x (
(y)−xt '
0 ),
16i6p
; 1 6 i 6p
i=1; : : : ; p. Now, by (2.1), we obtain for 1 6 i 6 p
E(’i (X; Y )w(X; Y )−1 ) = W −1 E(’i (U; T )) = W −1 E(ui ( (T ) − U t '0 )) = 0 Apply the multivariate CLT to get the following Theorem.
under H0 :
J. de U˜na-Alvarez / Statistics & Probability Letters 63 (2003) 287 – 293
292
Theorem 3.2. Assume conditions in Theorem 3.1. Assume also that E(’2i (X; Y )w(X; Y )−2 ) ¡ ∞ holds for 1 6 i 6 p. Then, n1=2 ('n − '0 ) → Np (0; +0−1 ++0−1 ) in law under H0 , where + = (Cov(’i (X; Y )w(X; Y )−1 ; ’j (X; Y )w(X; Y )−1 ))16i; j6p . In practice, we need an empirical version of the limit variance–covariance matrix +0−1 ++0−1 . Substitute M1n for +0 and +n = ("ijn )16i; j6p for +, where n 1 n "ijn = ’i (Xk ; Yk )’nj (Xk ; Yk )w(Xk ; Yk )−2 ; n k=1
−1 −1 1 6 i; j 6 p; and ’ni (x; y) = xi ( (y) − xt 'n ); 1 6 i 6 p: It is easily seen that M1n +n M1n consistently −1 −1 estimates +0 ++0 .
3.2. Nonlinear regression The linearity assumption m(U ) = U t '0 may be very restrictive for modelling the relation between the lifetime and the covariate vector. In this section we discuss a generalization of results above to a nonlinear speci?cation H0 : m(U ) = f(U ; '0 ); where '0 is a d-dimensional parameter and now p and d may be diVerent. Here, suitable least-squares estimation is introduced by minimizing n ' → Sn (') = ( (Yi ) − f(Xi ; '))2 w(Xi ; Yi )−1 : i=1
As in the linear case, large sample results for the general empiricals introduced in Section 2 are the core of the asymptotic behavior of the minimizer 'n of Sn ('). Details of the proofs are omitted. The arguments involved are similar to those in Stute (1999), who considered nonlinear regression methods with censored (although unbiased) responses. We will refer to the following regularity conditions: (i) (ii) (iii) (iv) (v) (vi) (vii)
E[ (T )2 ] ¡ ∞; the '-parameter space is compact, f(x; :) is continuous for each x, f2 (x; ') 6 M (x) for some integrable function M , L('; '0 ) = E[(f(U ; ') − f(U ; '0 ))2 ] ¿ 0 for ' = '0 , f(x; :) is twice continuously diVerentiable for each x, E(’2j (X; Y )w(X; Y )−2 ) ¡ ∞ holds for 1 6 j 6 d, where @f(x; ') : ’j (x; y) = ( (y) − f(x; '0 )) @'j '='0
For a linear f(:; '0 ), the ’j ’s become the functions in Theorem 3.2. Also, in the linear case some of the regularity conditions above (as (ii) and (iv)) are not needed. Theorem 3.3. Assume (i) – (v). Then, 'n → '0 and n−1 Sn ('n ) → W −1 Var(() under H0 with probability one.
J. de U˜na-Alvarez / Statistics & Probability Letters 63 (2003) 287 – 293
293
Theorem 3.4. Assume (i) – (vii) and that 'n is an interior point of the '-parameter space. Then, n1=2 ('n − '0 ) → Nd (0; )0−1 ))0−1 ) in law under H0 , where ) = (Cov(’i (X; Y )w(X; Y )−1 ; ’j (X; Y )w(X; Y )−1 ))16i; j6d and
)0 =
E
@f(X ; ') @f(X ; ') w(X; Y )−1 @'i @'j '='0
:
16i; j 6d
Acknowledgements Thanks to a referee for suggestions which have improved the presentation of the paper. References Ahmad, I.A., 1995. On multivariate kernel estimation for samples from weighted distributions. Statist. Probab. Lett. 22, 121–129. Alcal'a, J.T., Crist'obal, J.A., Ojeda, J., 2000. Nonparametric regression estimators in biased sampling models. In: N'un˜ ez-Ant'on, V., Ferreira, E. (Eds.), Statistical Modelling. Universidad del Pa'Ws Vasco, Bilbao, pp. 131–136. Crist'obal, J.A., Alcal'a, J.T., 2000. Nonparametric regression estimators for length biased data. J. Statist. Plann. Inference 89, 145–168. Gilbert, P.B., 2000. Large sample theory of maximum likelihood estimates in semiparametric biased sampling models. Ann. Statist. 28, 151–194. Gilbert, P.B., Lele, S.R., Vardi, Y., 1999. Maximum likelihood estimation in semiparametric selection bias models with applications to AIDS vaccine trials. Biometrika 86, 27–43. Gill, R.D., Vardi, Y., Wellner, J.A., 1988. Large sample theory of empirical distributions in biased sampling models. Ann. Statist. 16, 1069–1112. Horv'ath, L., 1985. Estimation from a length-biased distribution. Statist. Decisions 3, 91–113. Jones, M.C., 1991. Kernel density estimation for length-biased data. Biometrika 78, 511–519. SerSing, R.J., 1980. Approximation Theorems of Mathematical Statistics. Wiley, New York. Stute, W., 1999. Nonlinear censored regression. Statist. Sinica 9, 1089–1102. Vardi, Y., 1982. Nonparametric estimation in the presence of length bias. Ann. Statist. 10, 616–620. Vardi, Y., 1985. Empirical distributions in selection bias models (with discussion). Ann. Statist. 13, 178–205. Wu, C.O., 2000. Local polynomial regression with selection biased data. Statist. Sinica 10, 789–817.