ARTICLE IN PRESS Journal of Statistical Planning and Inference 140 (2010) 3468–3475
Contents lists available at ScienceDirect
Journal of Statistical Planning and Inference journal homepage: www.elsevier.com/locate/jspi
Jackknife methods for left-truncated data Pao-sheng Shen Department of Statistics, Tunghai University, Taichung 40704, Taiwan
a r t i c l e i n f o
abstract
Article history: Received 1 June 2008 Accepted 13 May 2010 Available online 21 May 2010
Let U and V be two independent positive random variables with continuous distribution functions F and G, respectively. Under random truncation, both U and V are observable only when U Z V . Let F n be the nonparametric maximum likelihood estimate (Lynden-Bell, 1971) of F ¼ 1F. In this paper, it is shown that the jackknife variance estimate of F n consistently estimates the limit variance. In a simulation study, we show that the jackknife works satisfactorily for moderate sample size. As an illustration we also apply our method to a real set of AIDS data. & 2010 Elsevier B.V. All rights reserved.
Keywords: Truncated data Jackknife
1. Introduction Let U and V be two independent positive random variables with continuous distribution functions F and G, respectively. Under random truncation, both U and V are observable only when U Z V . Let aF and bF denote the left and right endpoints of F. Define aG and bG similarly. For identifiabilities of F and G, we assume that aG raF and bG rbF . Truncated data occur in astronomy (e.g., Lynden-Bell, 1971; Woodroofe, 1985), epidemiology, biometry (e.g., Wang et al., 1986; Tsai et al., 1987) or in other fields such as economics. The following example describes a situation where truncation occurs. Example (AIDS blood-transfusion data): Lagakos et al. (1988) report data on the infection (denoted by V ) and induction times (denoted by T) for 258 adults and 37 children who were infected by contaminated blood transfusions and developed AIDS by June, 30, 1986. The data consists of the time in years, measured from April 1, 1978, when adults were infected by the virus from a contaminated blood transfusion, and the waiting time to development of AIDS, measured from the date of infection. For the pediatric population, children were infected in utero or at birth, and the infection time is the number of years from April, 1, 1978 to birth. The data was based on an eight year observational window. Let U = 8 T. The truncation effect comes from the fact that we only observed over the period (0, 8]. An individual is observed if and only if T þ V r8 or V rU . 0 Let (U1,V1),y,(Un,Vn) denote the truncated sample. Let Uð1Þ oUð2Þ o oUðnÞ be the ordered values of Uis and V(i), the concomitant of U(i) for i= 1,y,n. The nonparametric maximum likelihood estimator (NPMLE) of F ðxÞ ¼ 1FðxÞ, Q P P F n ðxÞ ¼ z r x ð1dLn ðzÞÞ, was derived by Lynden-Bell (1971), where Ln ðzÞ ¼ UðiÞ r z 1=ni , ni ¼ nj¼ 1 I½VðjÞ r UðiÞ r UðjÞ , and IA is the indicator function of the event A. When the lower boundary of U (denoted by aF) is larger than that of V (denoted by aG), the consistency of F n was proved by Woodroofe (1985). Before we give details, we briefly discuss several aspects of the jackknife in the ‘‘complete data’’ situation. For more details, see Efron and Tibshirani (1993). The delete-1 jackknife has been proposed to serve two purposes (cf. Quenouille, 1956; Tukey, 1958), namely, to provide a methodology to reduce a possible bias of an estimator and, second to yield an approximation for its variance. The jackknife incorporates the so-called pseudovalues which result from applying the
E-mail address:
[email protected] 0378-3758/$ - see front matter & 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2010.05.013
ARTICLE IN PRESS P.-s. Shen / Journal of Statistical Planning and Inference 140 (2010) 3468–3475
3469
statistic of interest to the subsamples of the observations with one observation deleted one after another. For randomly censored data, Gaver and Miller (1983) demonstrated that the Kaplan–Meier survival estimator can be jackknifed to give conservative confidence limits for survival probabilities. Stute (1996) showed that the jackknife variance estimate of a Kaplan–Meier integral consistently estimates the limit variance. However, for left-truncated data, there is no evidence in the literature that the jackknife variance estimate of F n works. In this paper the effect of jackknifing the estimator F n will be examined both by asymptotic analysis and by Monte Carlo simulation. In Section 2, it is shown that the jackknife variance estimate of F n consistently estimates the limit variance. In Section 3, we report on the results of some Monte Carlo investigations, comparing confidence limits for the survival probability obtained via the jackknife with those from other techniques. In Section 4, we apply the jackknife to a transfusion-related acquired immune deficiency syndrome (AIDS) data. Section 5 gives some concluding remarks. 2. Consistency of the jackknife variance estimate Let LðxÞ denote the cumulative hazard function of F. The estimator F n ðxÞ is closely related to the sample cumulative P hazard function Ln ðxÞ ¼ UðiÞ r x 1=ni : Compared to jackknife method based on censored data (see Stute and Wang, 1994; Stute, 1996), it is more complicated to apply jackknifing directly to F n , as will be seen in Lemma 2.6. Hence, we show the consistency of the jackknife variance estimate of F n by starting with the jackknife estimator of Ln ðxÞ. For kr n1, let P Ln,m ðxÞ denote the value of Ln ðxÞ when UðmÞ ðm ¼ 1, . . . ,nÞ is deleted from the sample. Let LJð1Þ ðxÞ ¼ n1 nm ¼ 1 Ln,m ðxÞ. Quenouille’s estimate of bias is biasðLn ðxÞÞ ¼ ðn1Þ½LJð1Þ ðxÞLn ðxÞ and the delete-1 jackknife estimate of LðxÞ is
Ln ðxÞbiasðLn ðxÞÞ ¼ LJ ðxÞ ¼ nLn ðxÞðn1ÞLJð1Þ ðxÞ: The following theorem derives the delete-1 jackknife estimator of Ln ðxÞ. Proposition 2.1. Given nk 4 1 for k rn1, the delete-1 jackknife estimator, LJ ðxÞ (k= 1,y,n) is given by 8 for x o UðnÞ ; < Ln ðxÞ LJ ðxÞ ¼ ðn1Þ : Ln ðxÞ þ , for x Z UðnÞ : n Proof. The proof is straightforward by induction on k, and hence omitted.
&
Hence, for x ZUðnÞ , the delete-1 jackknife estimator differs from the original estimator by a term (n 1)/n. pffiffiffi For kr n1, the jackknife variance estimate of nLn ðUðkÞ Þ (see Tukey, 1958) is given by n X
nV J ðLn ðUðkÞ ÞÞ ¼ ðn1Þ
½Ln,m ðUðkÞ ÞLJð1Þ ðUðkÞ Þ2 :
m¼1
Next, we will show the consistency of nV J ðLn ðUðkÞ ÞÞ in Theorem 2.2. To prove Theorem 2.1, we need Lemmas 2.1–2.3. Lemma 2.1. For k rn1, nV J ðLn ðUðkÞ ÞÞ ¼ An þBn , where An ¼ ðn1Þ
k X
1 , ðn n m m 1Þ m¼1
Bn ¼ 2ðn1Þ
k X 1 Y , n ðn 1Þ m ¼ i þ 1 im i¼1 i i k1 X
where Pn Yim ¼
j ¼ mþ1
dmj ðdij dim Þ
nm ðnm 1Þ
,
dij ¼ I½VðjÞ r UðiÞ r UðjÞ . Proof. The proof is tedious but straightforward, and hence omitted.
&
Before we formulate Lemmas 2.2 and 2.3, we define the following notations: Notations: Let O(zn) and o(zn) denote the limiting behavior of sequences of real numbers. The notation O(zn) represents terms that have the same order of magnitude as zn as n-1 and o(zn) represents a term that has smaller order than zn. The notations Op and op are similarly defined for the limiting behavior of sequences of random variables. P 0 2 Lemma 2.2. As n-1, k=n-p, and i=n-p , E½ km ¼ i þ 1 Yim ¼ oð1Þ. Proof. Given m define the subset Rm ¼ fðUðjÞ ,VðjÞ ÞjVðjÞ rUðmÞ ; j ¼ m, . . . ,ng. Hence, the number of pairs in Rm is equal to nm. Let ðU~ j , V~ j Þ ðj ¼ 1, . . . ,nm Þ denote the pairs of (U(j), V(j)) in Rm . Note that the pair (U(m),V(m)) is in Rm (i.e., nm Z 1). Without loss of generality, let ðUðmÞ ,VðmÞ Þ ¼ ðU~ nm , V~ nm Þ. Hence, UðmÞ o U~ j for j= 1,y,nm 1. Let V~ ð1Þ o V~ ð2Þ o o V~ ðnm Þ be the ordered
ARTICLE IN PRESS 3470
P.-s. Shen / Journal of Statistical Planning and Inference 140 (2010) 3468–3475
values of V~ 1 , . . . , V~ nm . Note that Pnm 1 ~ j ¼ 1 ðd ij dim Þ Yim ¼ , nm ðnm 1Þ where d~ ij ¼ I½UðiÞ 4 V~ j for j =1,y,nm 1. It follows that for i= 1,y,k 1; m= i+1,y,k, conditioning on nm, the conditional distribution of Yim is PðYim ¼ 0jnm Þ ¼ PðUðiÞ o V~ ð1Þ or UðiÞ 4 V~ ðnm Þ jnm Þ; for q= 1,y,nm 1, q nm ¼ PðV~ ðqÞ o UðiÞ o V~ ðq þ 1Þ r VðmÞ jnm Þ, P Yim ¼ nm ðnm 1Þ P Yim ¼
nm q nm ¼ PðVðmÞ r V~ ðqÞ o UðiÞ o V~ ðq þ 1Þ jnm Þ: nm ðnm 1Þ
Hence, nX m 1
"
q2
# PðV~ ðqÞ o UðiÞ o V~ ðq þ 1Þ r VðmÞ jnm Þ þ
nX m 1
"
ðnm qÞ2
#
PðVðmÞ r V~ ðqÞ o UðiÞ o V~ ðq þ 1Þ jnm Þ 2 2 n2m ðnm 1Þ2 q ¼ 1 nm ðnm 1Þ ( ) ( ) nX nX m 1 m 1 q2 ðnm qÞ2 ~ ~ r max PðV ðqÞ o UðiÞ o V ðq þ 1Þ jnm Þ þ max PðV~ ðqÞ o UðiÞ o V~ ðq þ 1Þ jnm Þ 2 2 q ¼ 1,...,nm 1 n2 ðnm 1Þ q ¼ 1,...,nm 1 n2 ðnm 1Þ m m q¼1 q¼1 ( ) ( ) q2 ðnm qÞ2 ~ ð1Þ o UðiÞ o V~ ðn Þ jnm Þ þ max Pð V PðV~ ð1Þ o UðiÞ o V~ ðnm Þ jnm Þ ¼ Oðn2 ¼ max m m Þ: q ¼ 1,...,nm 1 n2 ðnm 1Þ2 q ¼ 1,...,nm 1 n2 ðnm 1Þ2 m m
2 E½Yim jnm ¼
q¼1
Rx P Note that as n-1, k=n-p and i=n-p0 , ðn1Þ km ¼ i þ 1 1=nm ðnm 1Þ converges in probability to x p0 dF ðzÞ=R2 ðzÞ (see Wang p P k 2 et al., 1986), where RðzÞ ¼ PðV r z rU jU ZV Þ. Hence, we have m ¼ i þ 1 E½Yim jnm ¼ oð1Þ. This completes the proof of Lemma 2.2. & Lemma 2.3. As n-1, k=n-p, and i=n-p0 , k X
k X
E½Yim Yim0 jnm ,nm0 ¼ oð1Þ:
m ¼ i þ 1 m0 ¼ m þ 1
Proof. Note that conditionally on U(m) and U~ j (j =1,y,nm 1), we have PðV~ 1 r V~ 2 r r V~ nm jUðmÞ , U~ 1 , . . . , U~ nm 1 Þ ¼ PðV1 r V2 r r Vnm jVj r U~ j ,Vj r UðmÞ , j ¼ 1, . . . ,nm Þ ¼ PðV1 r V2 r r Vnm jVj rUðmÞ , j ¼ 1, . . . ,nm Þ, where Vj ’s are i.i.d. with distribution function PðVj r xÞ ¼ GðxÞ. Since this is true for any U(m) and U~ j (j= 1,y,nm 1), it follows that conditionally on nm PðV~ 1 ¼ V~ ði1 Þ , . . . , V~ nm 1 ¼ V~ ðinm 1 Þ ,VðmÞ ¼ V~ ðinm Þ jnm Þ ¼
1 nm !
ð2:1Þ
for all nm! permutations ði1 , . . . ,inm 1 ,inm Þ of (1,2,y,nm). Using Eq. (2.1), Tsai (1990) proposed a statistic for testing the independence of U and V . 0 Hence, for q= 1,y,nm 1 and m = m+ 1, given dmm0 ¼ 1 (i.e., Vðm0 Þ rUðmÞ , nm0 ¼ nm 1), say, Vðm0 Þ ¼ V~ 1 , the joint distribution of Yim and Yim0 is q ðnm0 qÞ nm , dmm0 ¼ 1 ¼ PðV~ 1 r V~ ðqÞ oUðiÞ o V~ ðq þ 1Þ rVðmÞ jnm Þ ,Yim0 ¼ P Yim ¼ nm ðnm 1Þ nm0 ðnm0 1Þ ðnm qÞq ¼ PðV~ ðqÞ o UðiÞ o V~ ðq þ 1Þ jVðmÞ Z V~ ðq þ 1Þ , V~ 1 r V~ ðqÞ ,nm Þ: nm ðnm 1Þ Similarly, P Yim ¼
q q nm , dmm0 ¼ 1 ¼ ðnm qÞðnm 1qÞ PðV~ ðqÞ o UðiÞ o V~ ðq þ 1Þ jVðmÞ Z V~ ðq þ 1Þ , V~ 1 Z V~ ðq þ 1Þ ,nm Þ, ,Yim0 ¼ nm ðnm 1Þ nm0 ðnm0 1Þ nm ðnm 1Þ
ARTICLE IN PRESS P.-s. Shen / Journal of Statistical Planning and Inference 140 (2010) 3468–3475
3471
ðnm qÞ ðnm0 qþ 1Þ qðq1Þ nm , dmm0 ¼ 1 ¼ ,Yim0 ¼ PðV~ ðqÞ o UðiÞ o V~ ðq þ 1Þ jVðmÞ r V~ ðqÞ , V~ 1 r V~ ðqÞ ,nm Þ, P Yim ¼ nm ðnm 1Þ nm0 ðnm0 1Þ nm ðnm 1Þ ðnm qÞ q1 nm , dmm0 ¼ 1 ¼ qðnm qÞ PðV~ ðqÞ oUðiÞ o V~ ðq þ 1Þ jVðmÞ r V~ ðqÞ , V~ 1 Z V~ ðq þ 1Þ ,nm Þ: P Yim ¼ ,Yim0 ¼ nm ðnm 1Þ nm0 ðnm0 1Þ nm ðnm 1Þ From (2.1), we have PðV~ ðqÞ oUðiÞ o V~ ðq þ 1Þ jVðmÞ Z V~ ðq þ 1Þ , V~ 1 r V~ ðqÞ ,nm Þ ¼ PðV~ ðqÞ o UðiÞ o V~ ðq þ 1Þ jVðmÞ Z V~ ðq þ 1Þ , V~ 1 Z V~ ðq þ 1Þ ,nm Þ ¼ PðV~ ðqÞ oUðiÞ o V~ ðq þ 1Þ jVðmÞ r V~ ðqÞ , V~ 1 r V~ ðqÞ ,nm Þ ¼ PðV~ ðqÞ o UðiÞ o V~ ðq þ 1Þ jVðmÞ r V~ ðqÞ , V~ 1 Z V~ ðq þ 1Þ ,nm Þ: 2 0 It follows that EðYim Yim0 jdmm0 ¼ 1,nm Þ ¼ oðn2 m Þ. Similarly, we can show that EðYim Yim jdmm0 ¼ 0,nm Þ ¼ oðnm Þ. 0 2 0 Hence, for m = m+ 1, EðYim Yim0 jnm Þ ¼ oðnm Þ. Similarly, for m 4 m þ1, it can be shown that EðYim Yim0 jnm ,nm0 Þ ¼ oðn2 m Þ. Since P ðn1Þ km ¼ i þ 1 1=nm ðnm 1Þ converges in probability, we have as n-1, k=n-p and i=n-p0 , Pk Pk 0 m ¼ iþ1 m0 ¼ m þ 1 EðYim Yim jnm ,nm0 Þ ¼ oð1Þ. This completes the proof of Lemma 2.3. &
We are now in the position to formulate Theorem 2.1. Theorem 2.1. For k rn1, as n-1 and k=n-p, nV J ðLn ðUðkÞ ÞÞ converges in probability to the limit variance of Rx pffiffiffi n½Ln ðUðkÞ ÞLðxp Þ, i.e., 0 p dF ðzÞ=R2 ðzÞ, where F ðzÞ ¼ PðU r zjU Z V Þ and xp ¼ inf fx : F ðxÞ Zpg denotes the population p percentile. Proof. By Lemma 2.1, we have nV J ðLn ðUðkÞ ÞÞ ¼ An þBn , where An is the analogue of Greenwood’s formula. As n-1 and Rx k=n-p, An converges in probability to 0 p dF ðzÞ=R2 ðzÞ (Tsai et al., 1987). From Lemmas 2.2 and 2.3 we have Pk1 Pk Y ¼ o ð1Þ. Since ðn1Þ 1=½n ðn 1Þ converges in probability, we have Bn = op(1). This completes the proof p i i m ¼ i þ 1 im i¼1 of Theorem 2.1. & Next, we will prove the consistency of the jackknife variance estimator of F n ðUðkÞ Þ (denoted by VJ ðF n ðUðkÞ ÞÞ). Since aG raF and bG rbF , by the first part of Corollary 5 of Woodroofe (1985), we have for k= 1,y,n 1, limn-1 Pðnk ¼ 1, for some k rn1Þ ¼ 0, which implies that limn-1 PðF n ðUðkÞ Þ 40 forall k r n1Þ ¼ 1. Hence, w.l.o.g., we consider the case F n ðUðkÞ Þ 4 0 for k= 1,y,n 1. Define H n ðUðkÞ Þ ¼ ln F n ðUðkÞ Þ. In order to study the estimate F n ðUðkÞ Þ, expand the logarithm H n ðUðkÞ Þ ¼ lnF n ðUðkÞ Þ ¼ Ln ðUðkÞ Þ þ
1 X 1 r¼2
r
Qnr ðUðkÞ Þ,
ð2:2Þ
P where Qnr ðUðkÞ Þ ¼ ki ¼ 1 1=nri for r = 2,3,y . To show the consistency of VJ ðF n ðUðkÞ ÞÞ), we need Lemmas 2.4–2.6 through 2.6. P r r ðUðkÞ Þ ¼ ð1=nÞ nm ¼ 1 Qn,m ðUðkÞ Þ, where Qrn,m(U(k)) denote the delete-1 estimator of Qrn(U(k)) when U(m) Lemma 2.4. Let QJð1Þ P r (m= 1,y,n) is deleted from the sample. Then for k= 1,y,n 1, QJð1Þ ðUðkÞ Þ ¼ Qnr ðUðkÞ Þ þ ki ¼ 1 ð1=nnr1 Þ½ðni =ni 1Þr1 1. i Proof. The proof is straightforward by induction on k, and hence omitted.
&
pffiffiffi nH n ðUðkÞ Þ. Then Lemma 2.5. Let nV J ðH n ðUðkÞ ÞÞ denote the delete-1 jackknife variance estimator of 1 nV J ðH n ðUðkÞ ÞÞ ¼ nV J ðLn ðUðkÞ ÞÞ þ Op ðn Þ. P P Þ½ðni =ni 1Þr1 1 and H Jð1Þ ðUðkÞ Þ ¼ ð1=nÞ nm ¼ 1 H n,m ðUðkÞ Þ, where H n,m ðUðkÞ Þ denote the Proof. Let Rðk,rÞ ¼ ki ¼ 1 ð1=nnr1 i delete-1 estimator of H n ðUðkÞ Þ when U(m) (m= 1,y,n) is deleted from the sample. From Lemma 2.4, for kr n1, the jackknife pffiffiffi variance estimate of nH n ðUðkÞ Þ is given by
nV J ðH n ðUðkÞ ÞÞ ¼ ðn1Þ
n X
½H n,m ðUðkÞ ÞH Jð1Þ ðUðkÞ Þ2
m¼1
¼ ðn1Þ
n X m¼1
¼ ðn1Þ
k X m¼1
( Ln,m ðUðkÞ Þ þ LJð1Þ ðUðkÞ Þ þ (
1 X 1 r¼2
r
k X 1 1 1 þ d n n n im i¼1 i i ¼ mþ1 i i¼1 i
1 X 1 r¼2
r
)2 r ðUðkÞ Þ QJð1Þ
k X
#)2 k k X 1 1 X 1 þ Rðk,rÞ þ ðn1Þ r i ¼ 1 ðni dim Þr i ¼ m þ 1 nri i ¼ 1 nri r¼2 ( " #)2 n k k 1 k k X X X X X 1 1 1 X 1 1 þ þ Rðk,rÞ : n dim i ¼ 1 nri r i ¼ 1 ðni dim Þr i ¼ 1 nri r¼2 i¼1 i m ¼ kþ1 þ
" 1 1 X X 1 m
m 1 X
r ðUðkÞ Þ Qn,m
ARTICLE IN PRESS 3472
P.-s. Shen / Journal of Statistical Planning and Inference 140 (2010) 3468–3475
Since 1 X
Rðk,rÞ ¼ ð1=nÞ
r¼2
k X
1 ¼ Op ðn2 Þ, 1Þðn ðn i i 2Þ i¼1
1 X 1 1 ¼ Op ðn2 Þ, ¼ r ðn n n m m 1Þ r¼2 m
and m 1 X ½nri ðni dim Þr 1 1 ¼ Op ðn2 Þ, ¼ ðni dim 1Þðni dim Þ ni ðni 1Þ nri ðni dim Þr i¼1r ¼2 i¼1
m 1 X 1 X
it follows that nV J ðH n ðUðkÞ ÞÞ ¼ nV J ðLn ðUðkÞ ÞÞ þ Op ðn1 Þ. This completes the proof of Lemma 2.5.
&
Lemma 2.6. Let F n,m ðxÞ denote the value of F n ðxÞ when (U(m),V(m)) is deleted from the sample. Let F Jð1Þ ðxÞ ¼ k r n1, 2 3 k k Y X nj 15 4 F Jð1Þ ðUðkÞ Þ ¼ F n ðUðkÞ Þ þ BðsÞ, nj s ¼ 2 j ¼ sþ1
Pn
m¼1
F n,m ðxÞ=n. For
where BðsÞ ¼
" # s1 n s1 Y Y 1 1 1 X 1 : 1 dsm 1 nðns 1Þ i ¼ 1 ni dis ns m ¼ s ni dim i¼1
Proof. The proof is straightforward by induction on k, and hence omitted. The jackknife variance estimate of F n ðxÞ is given by VJ ðF n ðxÞÞ ¼ ððn1Þ=nÞ shows the consistency of nV J ðF n ðUðkÞ ÞÞ.
&
Pn
m ¼ 1 ½F n,m ðxÞF Jð1Þ ðxÞ
2
, The following theorem
Theorem 2.2. For k rn1, as n-1 and k=n-p, nV J ðF n ðUðkÞ ÞÞ converges in probability to the limit variance of Rx pffiffiffi n½F n ðUðkÞ ÞF ðxp Þ, namely, ½F ðxp Þ2 0 p dF ðzÞ=R2 ðzÞ. P Proof. By Lemma 2.6, we have F Jð1Þ ðUðkÞ Þ ¼ F n ðUðkÞ Þ þ op ðn1 Þ. Thus, nV J ðF n ðUðkÞ ÞÞ ¼ ðn1Þ nm ¼ 1 ½F n,m ðUðkÞ ÞF Jð1Þ ðUðkÞ Þ2 Pn ¼ ðn1Þ m ¼ 1 ½F n,m ðUðkÞ ÞF n ðUðkÞ Þ2 þop ð1Þ. Now, note that exp½H n ðUðkÞ Þ ¼ F n ðUðkÞ Þ. Hence, from the mean-valued theorem, we have k
F n,m ðUðkÞ ÞF n ðUðkÞ Þ ¼ ezn,m ½H n,m ðUðkÞ ÞH n ðUðkÞ Þ ¼ F n ðUðkÞ Þ½H n,m ðUðkÞ ÞH n ðUðkÞ Þ þ Rn,m,k , k
k
where Rn,m,k ¼ ½ezn,m F n ðUðkÞ Þ½H n,m ðUðkÞ ÞH n ðUðkÞ Þ and zn,m is point on the line segment between H n,m ðUðkÞ Þ and H n ðUðkÞ Þ. P where C1n ¼ ðn1Þ½F n ðUðkÞ Þ2 nm ¼ 1 ½H n,m ðUðkÞ ÞH n ðUðkÞ Þ2 , Thus, we have nV J ðF n ðUðkÞ ÞÞ ¼ C1n þ C2n þ 2C3n þ op ð1Þ, Pn Pn Pn Pn 2 C2n ¼ ðn1Þ m ¼ 1 ðRn,m,k 1=n m ¼ 1 Rn,m,k Þ , and C3n ¼ ðn1Þ m ¼ 1 ðRn,m,k ð1=nÞ m ¼ 1 Rn,m,k Þ½H n,m ðUðkÞ ÞH n ðUðkÞ Þ.By the 2 Cauchy–Schwarz inequality, C3n r C1n C2n . Hence, the consistency of nV J ðF n ðUðkÞ ÞÞ will follow if we can show that
(a) C1n converges in probability to ½F ðxp Þ2 (b) C2n = op(1).
R xp 0
dF ðzÞ=R2 ðzÞ as k=n-p and,
Since F n ðUðkÞ Þ converges almost surely to F ðxp Þ (see Wang and Jewell, 1985), (a) follows by Lemma 2.5. Next, P Pk Ln,m ðUðkÞ ÞLn ðUðkÞ Þ is equal to m1 i ¼ 1 dim =ni ðni dim Þ if m rk and equal to i ¼ 1 dim =ni ðni dim Þ if m 4 k. Hence, 1 1 H n,m ðUðkÞ ÞH n ðUðkÞ Þ ¼ Ln,m ðUðkÞ ÞLn ðUðkÞ Þ þ Op ðn Þ ¼ Op ðn Þ. Thus, for 1 rm r n1, max
1 r m r n1
k
jzn,m H n ðUðkÞ j r
max
1 r m r n1
jH n,m ðUðkÞ ÞH n ðUðkÞ j ¼ Op ðn1 Þ:
ð2:3Þ k
By the continuity of exponential function and (2.3), it follows that max1 r m r n1 jezn,m F n ðUðkÞ Þj converges in probability to zero. Since C2n r ðn1Þ
Xn m¼1
R2n,m,k r
max
1 r m r n1
k
½ezn,m F n ðUðkÞ Þ2 ðn1Þ
n X
½H n,m ðUðkÞ ÞH n ðUðkÞ Þ2 ,
m¼1
(b) follows from Lemma 2.5. This completes the proof of Theorem 2.2.
&
In the following section, the effect of jackknifing the product-limit estimate F n ðxÞ is examined in a Monte Carlo simulation.
ARTICLE IN PRESS P.-s. Shen / Journal of Statistical Planning and Inference 140 (2010) 3468–3475
3473
3. Simulation results In this section, simulations are conducted to investigate the performance of the jackknife procedure. It has been shown Rx pffiffiffi nðF n ðxÞF ðxÞÞ is asymptotically normal with mean 0 and variance F ðxÞ2 0 dF ðzÞ=R2 ðzÞ (Wang et al., 1986). Hence, an qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi approximate 12a confidence interval for F ðxÞ can be constructed using F J ðxÞ 7 ta,n1 VJ ðF n ðxÞÞ, where that
F J ðxÞ ¼ nF n ðxÞðn1ÞF Jð1Þ ðxÞ and ta,n1 is the a upper percentile point of a t distribution with n 1 degrees of freedom. An alternative to the jackknife method is an analogue of Greenwood’s formula (Tsai et al., 1987), i.e., P VG ðF n ðxÞÞ ¼ ½F n ðxÞ2 UðiÞ r x 1=ni ðni 1Þ.An approximate 12a confidence interval for F ðxÞ can be constructed using qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi F n ðxÞ 7 za VG ðF n ðxÞÞ, where za is the a upper percentile point of the standard normal distribution. It should be noted that the bootstrap procedure, a resampling approach investigated by Wang (1991) and Gross and Lai (1996), is also applicable to left-truncated data. Gross and Lai (1996) give an asymptotic justification of the simple bootstrap method for left-truncated and right-censored data. Let VB ðF n ðxÞÞ denote the simple bootstrap variance estimate of F n ðxÞ. An qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi approximate 12a confidence interval for F ðxÞ can be constructed using F n ðxÞ 7 ta,n1 VB ðF n ðxÞÞ. We study the performances of the three methods on simulated left-truncated data. The U ’s are left-truncated exponential distributed: U LEð0:5Þ, i.e., FðxÞ ¼ 1eðx0:5Þ for x Z0:5. The V ’s are left-truncated Weibulldistributed: V LWðag , dg Þ, i.e., dg
GðxÞ ¼ 1eðxag Þ for x Z ag , with varying parameters ag = 0.0,0.5, and dg ¼ 0:25,1:0 and 4.0. The significance level, a, is set at 0.025 (95% confidence interval). The sample sizes are chosen as 25, 50, 100 and 200. Five thousand Monte Carlo trials were run with 500 bootstrap replications per trial. Table 1 shows the results of the empirical coverages (E.C.) of confidence intervals at x= 1.5. Table 1 also shows the proportion of truncation (pt), empirical bias of F n ð1:5Þ (denoted by Bias), and ^ denotes the observed empirical variance of F n ð1:5Þ, and B^ is the difference between the average of ^ S ^ , where S statistic of B= ^. VJ ðF n ð1:5ÞÞ (or VG ðF n ð1:5ÞÞ, VB ðF n ð1:5ÞÞÞ and S ^ S ^ of Table 1, we conclude that: Based on the empirical coverages and the statistic B=
Table 1 ^ S ^ at x =1.5. Empirical coverages and B= n
ag
dg
Bias
pta
E.C.
^ ^ S B= GE
JK
BO
GE
JK
BO
25 25 25 25 25 25
0.0 0.0 0.0 0.5 0.5 0.5
0.25 1.00 4.00 0.25 1.00 4.00
0.003 0.004 0.004 0.000 0.008 0.140
0.35 0.31 0.32 0.42 0.50 0.58
0.094 0.014 0.230 0.102 0.261 0.612
0.051 0.044 0.214 0.052 0.215 0.300
0.053 0.065 0.162 0.057 0.122 0.280
0.932 0.935 0.882 0.923 0.880 0.570
0.936 0.949 0.937 0.927 0.921 0.694
0.935 0.933 0.908 0.956 0.912 0.643
50 50 50 50 50 50
0.0 0.0 0.0 0.5 0.5 0.5
0.25 1.00 4.00 0.25 1.00 4.00
0.000 0.001 0.001 0.001 0.004 0.116
0.35 0.31 0.32 0.42 0.50 0.58
0.008 0.066 0.063 0.050 0.214 0.614
0.014 0.046 0.174 0.017 0.185 0.390
0.024 0.036 0.124 0.051 0.096 0.150
0.942 0.934 0.923 0.930 0.915 0.577
0.947 0.942 0.938 0.937 0.933 0.698
0.947 0.943 0.912 0.935 0.921 0.685
100 100 100 100 100 100
0.0 0.0 0.0 0.5 0.5 0.5
0.25 1.00 4.00 0.25 1.00 4.00
0.000 0.000 0.006 0.001 0.001 0.105
0.35 0.31 0.32 0.42 0.50 0.58
0.010 0.032 0.015 0.054 0.186 0.531
0.016 0.057 0.068 0.024 0.118 0.144
0.001 0.041 0.053 0.027 0.089 0.157
0.945 0.944 0.928 0.932 0.930 0.638
0.948 0.947 0.946 0.950 0.942 0.717
0.947 0.946 0.938 0.954 0.936 0.702
200 200 200 200 200 200
0.0 0.0 0.0 0.5 0.5 0.5
0.25 1.00 4.00 0.25 1.00 4.00
0.001 0.000 0.001 0.000 0.000 0.001
0.35 0.31 0.32 0.42 0.50 0.58
0.016 0.013 0.019 0.001 0.137 0.201
0.025 0.015 0.034 0.156 0.132 0.149
0.013 0.027 0.027 0.149 0.119 0.176
0.947 0.951 0.947 0.947 0.940 0.930
0.951 0.951 0.950 0.952 0.946 0.942
0.951 0.952 0.948 0.952 0.943 0.931
a
The proportion of truncation.
ARTICLE IN PRESS 3474
P.-s. Shen / Journal of Statistical Planning and Inference 140 (2010) 3468–3475
(i) For small samples (n=25,50), the jackknife (JK) generally performs in a conservative manner as compared to Greenwood (GE) and bootstrap (BO). That is, JK tends to overestimate the variance of F n ðxÞ, while GE and BO tend to underestimate. Hence, the coverage of 95% confidence intervals based on normal theory is closer to nominal with JK than with GE and BO. (ii) For sample size n= 100, both JK and BO slightly overestimate the variance of F n ðxÞ, while GE underestimate. The coverages of JK and BO are close to the nominal levels except for the case with severe truncation (i.e., ag =0.5 and, dg ¼ 4:00). In this case, the biases of F n ð1:5Þ are large and low coverages occur. (iii) For large samples (n= 200), all three methods perform fairly well, except that a larger sample size is required for the the case of severe truncation to be close to the nominal level. In general, the results obtained indicate that the jackknife procedure is a worthy competitor of bootstrap and Greenwood’s formula under truncation model.
4. Application to a real data set To illustrate the jackknife method for left-truncated data, we consider the data on the infection time for 37 children and 258 adults with transfusion-related AIDS, described in Section 1. The data have been condensed by grouping dates of Table 2 Estimates of conditional distribution of child induction time PðT r xjT o 4:25Þ ¼ PðU Z 8xjU 4 3:75Þ. x
0.50 0.75 1.00 1.25 1.50 1.75 2.25 2.50 2.75 3.00 3.25 3.50
8x
7.50 7.25 7.00 6.75 6.50 6.25 5.75 5.50 5.25 5.00 4.75 4.50
Fn
0.040 0.140 0.260 0.368 0.414 0.491 0.573 0.640 0.712 0.756 0.857 0.923
GE
JK
BO
pL
pU
pL
pU
pL
pU
0.000 0.034 0.112 0.191 0.227 0.291 0.363 0.427 0.498 0.548 0.673 0.778
0.096 0.246 0.407 0.544 0.600 0.691 0.783 0.854 0.925 0.965 1.000 1.000
0.000 0.037 0.117 0.179 0.221 0.286 0.351 0.409 0.489 0.535 0.603 0.741
0.098 0.246 0.406 0.557 0.608 0.699 0.796 0.873 0.936 0.978 1.000 1.000
0.000 0.000 0.105 0.191 0.225 0.298 0.380 0.424 0.498 0.578 0.678 0.785
0.135 0.278 0.414 0.545 0.602 0.685 0.767 0.857 0.926 0.935 1.000 1.000
Table 3 Estimates of conditional distribution of adult induction time PðT r xjT o 6:75Þ ¼ PðU Z 8xjU 41:25Þ. x
0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.25 5.50 5.75 6.00 6.25 6.50
8x
7.75 7.50 7.25 7.00 6.75 6.50 6.25 5.75 5.50 5.25 5.00 4.75 4.50 4.00 3.75 3.50 3.25 3.00 2.75 2.50 2.25 2.00 1.75 1.50
Fn
0.005 0.005 0.006 0.016 0.029 0.042 0.064 0.094 0.123 0.146 0.171 0.216 0.248 0.291 0.342 0.390 0.428 0.469 0.548 0.666 0.742 0.797 0.826 0.875
GE
JK
BO
pL
pU
pL
pU
pL
pU
0.001 0.001 0.001 0.007 0.014 0.022 0.035 0.054 0.072 0.088 0.104 0.134 0.156 0.187 0.224 0.259 0.288 0.320 0.385 0.486 0.558 0.615 0.648 0.713
0.009 0.009 0.011 0.026 0.043 0.062 0.093 0.135 0.173 0.204 0.237 0.298 0.340 0.396 0.461 0.522 0.568 0.618 0.712 0.846 0.926 0.979 1.000 1.000
0.001 0.000 0.002 0.010 0.015 0.024 0.038 0.057 0.075 0.090 0.106 0.135 0.158 0.188 0.223 0.258 0.288 0.319 0.377 0.482 0.553 0.603 0.639 0.703
0.009 0.009 0.011 0.025 0.042 0.061 0.092 0.132 0.171 0.203 0.236 0.297 0.339 0.396 0.462 0.523 0.569 0.619 0.720 0.851 0.932 0.990 1.000 1.000
0.000 0.000 0.000 0.002 0.009 0.015 0.032 0.046 0.068 0.083 0.096 0.127 0.154 0.189 0.227 0.261 0.300 0.324 0.379 0.498 0.573 0.635 0.672 0.741
0.010 0.016 0.017 0.031 0.048 0.067 0.097 0.143 0.177 0.209 0.245 0.305 0.342 0.394 0.458 0.520 0.556 0.613 0.718 0.834 0.911 0.959 0.981 1.000
ARTICLE IN PRESS P.-s. Shen / Journal of Statistical Planning and Inference 140 (2010) 3468–3475
3475
infection and AIDS into three-month intervals beginning April 1, 1978. For 37 children, the risk set of the observation U(1) = 2.5, and U(2) = 3.75 are very small (n1 =3 and n2 =5, respectively). Thus, estimation was performed conditionally given that T o 4:25, i.e., U 43:75. This conditioning was also used by Tsai et al. (1987) and Lai and Ying (1991). The choice of 3.75 yields risk set sizes n0i s that are not too small. Given the choice of 3.75 and the condensed data, an obvious Q P modification of F n is F n ðxÞ ¼ 3:75 o UðiÞ r x ð1di =ni Þ, and that of Greenwood’s formula is VG ðF n ðxÞÞ ¼ ½F n ðxÞ2 3:75 o UðiÞ r x di =ni ðni di Þ, where di is the multiplicity of U(i). Similarly, the jackknife and bootstrap estimate are based on the modified F n . In Table 2, we display three types of 95% confidence intervals [pL pU] for the conditional distribution of the child induction times PðT r xjT o4:25Þ ¼ PðU Z8xjU 4 3:75Þ.For 258 adults, the risk set of the observation U(1) = 0.75, and U(2) =1.25 are very small (n1 = 2 and n2 = 8, respectively). Thus, estimation was performed conditionally given that T o 6:75, i.e., U 41:25. In Table 3, we display three types of 95% confidence intervals for conditional distribution of the adult induction time PðT rxjT o 6:75Þ ¼ PðU Z 8xjU 4 1:25Þ. Table 2 indicates that when x Z1:25, the interval widths based on JK are larger than those based on GE and BO. However, when x o1:25, the interval widths based on BO are largest among the three methods. Table 3 indicates that the confidence intervals for GE and JK are almost equal. When x Z4:0, the interval widths based on BO are slightly smaller than those based on GE and JK. When x o 4:0, the situation is reverse. 5. Concluding remarks In this paper, we have derived the consistency of the jackknife variance estimator of the NPMLE of F when data are left-truncated. In survival studies the unknown parameter of interest can frequently be defined as a functional TðF Þ. Statistical inferences about TðF Þ are usually made based on the statistic TðF n Þ. Assessing TðF n Þ requires a consistent estimator of the asymptotic variance of TðF Þ (denoted by VJ ðTðF n ÞÞ). The consistency of the jackknife variance is closely related to the smoothness of the functional T. In the ‘‘complete data’’ situation (i.e., F n is the empirical survival function), many results for the consistency of VJ ðTðF n ÞÞ are established through the differentiability of T (e.g., Parr, 1985; Sen, 1988; Shao, 1993). For left-truncated data, it requires further investigation that ‘‘how smooth’’ the T needs to be for the jackknife to work. Moreover, it is likely that the jackknife procedure can be extended to the NPMLE of F (see Gross and Lai, 1996) for left-truncated and right-censored data. This extension would of course further complicate the proof of the consistency. References Efron, B., Tibshirani, R.J., 1993. An Introduction to the Bootstrap. Chapman and Hall, New York. Gaver, P., Miller, G., 1983. Jackknifing the Kaplan–Meier survival estimator for censored data: simulation results and asymptotic analysis. Commu. Statist. Theor. Meth. 12, 1701–1718. Gross, S.T., Lai, T.L., 1996. Bootstrap methods for truncated data and censored data. Statist. Sinica 6, 509–530. Lagakos, S.W., Barraj, L.M., Gruttola, V.D.E., 1988. Nonparametric analysis of truncated survival data, with application to AIDS. Biometrika 75, 515–523. Lai, T.L., Ying, Z., 1991. Estimating a distribution function with truncated and censored data. Ann. Statist. 19 (1), 417–442. Lynden-Bell, D., 1971. A method of allowing for known observational selection in small samples applied to 3CR quasars. Monthly Notes Roy. Astronom. Soc. 155, 95–118. Parr, W.C., 1985. Jackknifing differentiable statistical functionals. J. Roy. Statist. Soc. B 47, 56–66. Quenouille, M.H., 1956. Notes on bias in estimation. Biometrika 43, 353–360. Sen, P.K., 1988. Functional jackknifing: rationality and general asymptotics. Ann. Statist. 16, 450–469. Shao, J., 1993. Differentiability of statistical functionals and consistency of the jackknife. Ann. Statist. 21, 61–75. Stute, W., Wang, J.-L., 1994. The jackknife estimate of a Kaplan–Meier integral. Biometrika 81, 602–606. Stute, W., 1996. The jackknife estimate of variance of a Kaplan–Meier integral. Ann. Statist. 24, 2679–2704. Tsai, W.-Y., Jewell, N.P., Wang, M.-C., 1987. A note on the product-limit estimator under right censoring and left truncation. Biometrika 74, 883–886. Tsai, W.Y., 1990. Testing the assumption of independence of truncation time and failure time. Biometrika 77, 169–177. Tukey, J., 1958. Bias and confidence in not quite large samples. Ann. Math. Statist. 29, 614. Wang, M.-C., 1991. Nonparametric estimation from cross-sectional survival data. J. Amer. Statist. Assoc. 86, 130–143. Wang, M.-C., Jewell, N.P., 1985. The product limit estimate of a distribution function under random truncation. Technical Report, Program in Biostatistics, University of California, Berkeley. Wang, M.-C., Jewell, N.P., Tsai, W.-Y., 1986. Asymptotic properties of the product-limit estimate under random truncation. Ann. Statist. 14, 1597–1605. Woodroofe, M., 1985. Estimating a distribution function with truncated data. Ann. Statist. 13, 163–177.