Computational Statistics and Data Analysis (
)
–
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
Nonparametric regression with doubly truncated data C. Moreira a,b,∗ , J. de Uña-Álvarez a , L. Meira-Machado b a
Statistical Inference, Decision, and OR group & Centro de Investigaciones Biomédicas (CINBIO), University of Vigo, Lagoas Marcosende, 36310 Vigo, Spain b
Centre of Mathematics and Department of Mathematics and Applications, University of Minho - Campus de Azurém, 4800-058 Guimarães, Portugal
article
info
Article history: Received 12 December 2013 Received in revised form 20 March 2014 Accepted 27 March 2014 Available online xxxx Keywords: Local polynomial regression Kernel smoothing Bandwidth selection Random truncation Biased data Mean squared error
abstract Nonparametric regression with a doubly truncated response is introduced. Local constant and local linear kernel-type estimators are proposed. Asymptotic expressions for the bias and the variance of the estimators are obtained, showing the deterioration provoked by the random truncation. To solve the crucial problem of bandwidth choice, two different bandwidth selectors based on plug-in and cross-validation ideas are introduced. The performance of both the estimators and the bandwidth selectors is investigated through simulations. A real data illustration is included. The main conclusion is that the introduced regression methods perform satisfactorily in the complicated scenario of random double truncation. © 2014 Elsevier B.V. All rights reserved.
1. Introduction Random truncation is a well-known phenomenon which may be present when observing time-to-event data. For example, recruitment of lifetimes through a cross-section induces left-truncation, so larger event times are observed with higher probability. Another example is found when analyzing data which correspond to events taking place before some specific date; in this case, the time-to-event is right-truncated and, therefore, small lifetimes are over-sampled. These two forms of truncation are one-sided, and relatively simple estimators exist. See e.g. Klein and Moeshberger (2003). Nonparametric estimation methods suitable for one-sided random truncation were developed in the last three decades, see for example Woodroofe (1985), Tsai et al. (1987) or Stute (1993) for the estimation of a cumulative distribution function, and for nonparametric regression, Gross and Lai (1996); Iglesias-Pérez and González-Manteiga (1999), Akritas and LaValley (2005) or Ould-Saïd and Lemdani (2006). In some applications, two-sided (rather than one-sided) random truncation appears. This occurs, for example, when the sample restricts to those individuals with event falling between two particular dates. This is the case of the sample provided by Moreira and de Uña-Álvarez (2010a), who reported data corresponding to children diagnosed of cancer between 1999 and 2003; in this case, the age at cancer diagnosis is doubly truncated, the truncation times have been determined by the two specific limiting dates of observation. The AIDS Blood Transfusion data in Kalbfleisch and Lawless (1989) is another example of such a situation. These data are restricted to those cases diagnosed of AIDS prior to January 1987. For this data
∗ Corresponding author at: Statistical Inference, Decision, and OR group & Centro de Investigaciones Biomédicas (CINBIO), University of Vigo, Lagoas Marcosende, 36310 Vigo, Spain. E-mail addresses:
[email protected],
[email protected] (C. Moreira),
[email protected] (J. de Uña-Álvarez),
[email protected] (L. Meira-Machado). http://dx.doi.org/10.1016/j.csda.2014.03.017 0167-9473/© 2014 Elsevier B.V. All rights reserved.
2
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
set, the induction times are doubly truncated because HIV was unknown before 1982, so any case of transfusion-related AIDS before this time would not have been properly classified. See Section 4 for more details. In these two examples, double truncation affects a retrospective time (time from onset to diagnosis). Therefore, right-censoring issues are not present. Under double truncation, the observational bias is not so evident as in the one-sided truncated setup. Generally speaking, one may say that, under double truncation, large and small inter-event times will be less probably observed. Unlike for onesided truncation, the nonparametric maximum-likelihood estimator (NPMLE) of the lifetime distribution has no explicit form under double truncation; this complicates the practice and the theoretical developments. We mention that censoring is a problem different from random truncation, because with censored data the researcher has at least some partial information on the censored lifetimes. Compared to the huge literature devoted to one-sided truncation, there are only few papers devoted to the random double truncation model. Efron and Petrosian (1999) introduced the NPMLE of a cumulative distribution function (df) under double truncation. The asymptotic properties of this NPMLE were further investigated by Shen (2010). Moreira and de UñaÁlvarez (2010b) introduced a semiparametric estimator of a doubly truncated df, while Moreira et al. (2010) presented an R package to compute the NPMLE and confidence bands. Methods for testing a quasi-independence assumption between the lifetime of interest and the truncation times were investigated by Martin and Betensky (2005). Despite the existence of these papers, random double truncation is a phenomenon which is still quite unknown nowadays. In some applications, the goal is the estimation of a smooth curve such as the density function, the hazard rate function, or the regression function. The estimation of these curves crucially depends on the selected bandwidth or smoothing parameter (Wand and Jones, 1995). To the best of our knowledge, the only paper dealing with smoothing methods under double truncation is Moreira and de Uña-Álvarez (2012), who considered kernel density estimation. In this paper we rather focus on nonparametric kernel regression. Let (X ∗ , Y ∗ ) be the two-dimensional variable of interest, where Y ∗ is the lifetime or the inter-event time of main interest, and X ∗ is a one-dimensional continuous covariate. Since Y ∗ may represent a transformation of the lifetime (such as the logarithm) in applications, or just a different type of response, we just assume that the support of Y ∗ is contained in the reals. The goal is the estimation of the regression function m(x) = E [Y ∗ |X ∗ = x]. Due to the presence of random double truncation, we are only able to observe (X ∗ , Y ∗ ) when U ∗ ≤ Y ∗ ≤ V ∗ , where (U ∗ , V ∗ ) are the truncation times; in that case, (U ∗ , V ∗ ) are also observed. On the contrary, when U ∗ ≤ Y ∗ ≤ V ∗ is violated, nothing is observed. As usual with random truncation, we assume that the truncation times are independent of (X ∗ , Y ∗ ). Let (U1 , V1 , X1 , Y1 ), . . . , (Un , Vn , Xn , Yn ) be the observed sample, these are iid data with the same distribution as (U ∗ , V ∗ , X ∗ , Y ∗ ) given U ∗ ≤ Y ∗ ≤ V ∗ , and let mT (x) = E [Y1 |X1 = x] be the observed regression function. In general, mT (x) and the target m(x) will differ; see e.g. Fig. 4, in which these two curves are estimated for the AIDS Blood Transfusion data. This is because of the truncating condition which introduces an observational bias. Similar features were reported in the context of length-biasing, in which the relative probability of sampling a given value of (X ∗ , Y ∗ ) is proportional to the length of Y ∗ , see e.g. Cristóbal and Alcalá (2000). In the doubly truncated setup, this relative probability of observing (X ∗ , Y ∗ ) = (x, y) is given by G(y) = P (U ∗ ≤ y ≤ V ∗ ), since (X ∗ , Y ∗ ) and (U ∗ , V ∗ ) are independent. This function G can be estimated from the data by maximum likelihood principles, see the iterative algorithm in Section 2. The rest of the paper is organized as follows. In Section 2 we introduce the relationship between the observed conditional distribution and that of interest. As it will be seen, by downweighting the (Xi , Yi )s with the largest values of Gn (Yi ) (where Gn is an estimator for G), we are able to obtain a consistent estimator of m(x). Weighted local polynomial type estimators are considered to this end. We give the asymptotic bias and variance of the weighted Nadaraya–Watson (i.e. local constant) estimator and the weighted local linear kernel estimator, and a confidence interval is introduced. We also propose two different methods to choose the bandwidth for these estimators in practice. In Section 3 we investigate the finite-sample performance of the estimators, and the bandwidth selectors through simulations. Section 4 illustrates all the proposed methods by considering AIDS Blood Transfusion data of Kalbfleisch and Lawless (1989). Finally, in Section 5 we report the main conclusions of our investigation. The technical proofs and details are deferred to the Appendix. 2. The estimators In this Section we introduce the proposed estimators. We also include the asymptotic results (Section 2.1) and the bandwidth selection algorithms (Section 2.2). Firstly we introduce the needed notations. Let F (.|x) be the conditional df of Y ∗ ∞ ∞ ∗ ∗ ∗ ∗ ∗ given X = x, so m(x) = −∞ tF (dt |x), and let α(x) = P (U ≤ Y ≤ V |X = x) = −∞ G(t )F (dt |x) be the conditional probability of no truncation. It is assumed that α(x) > 0. Let F ∗ (.|x) be the observable conditional df, that is F ∗ (y|x) = P (Y1 ≤ y|X1 = x). We have F (y|x) = α(x) ∗
−1
y
G(t )F (dt |x) −∞
for every y. This means that, for a fixed value of the covariate, the response Y ∗ is observed with a relative probability y proportional to G(Y ∗ ). Conversely, provided that G(t ) > 0 for all t, one may write F (y|x) = α(x) −∞ G(t )−1 F ∗ (dt |x),
∞
where α(x) = 1/α ∗ (x) with α ∗ (x) = −∞ G(t )−1 F ∗ (dt |x) = E G(Y1 )−1 |X1 = x . Therefore, the target m(x) is written as m(x) = m∗ (x)/α ∗ (x) where m∗ (x) = E Y1 G(Y1 )−1 |X1 = x .
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
3
Note that the functions m∗ (x) and α ∗ (x) are conditional means of observable variables (once G is replaced by a proper estimator) and, consequently, they can be estimated by standard methods as, e.g., Nadaraya–Watson. Hence, a consistent ∗ ˆ NW (x) = m ˆ ∗NW (x)/αˆ NW estimator of m(x) may be introduced as m (x), where n
Kh (x − Xi )Yi Gn (Yi )−1
i =1
∗NW (x) = m
n
Kh ( x − X i )
i =1
and n
αNW (x) = ∗
Kh (x − Xi )Gn (Yi )−1
i=1 n
Kh (x − Xi )
i=1
are the Nadaraya–Watson estimators of m∗ (x) and α∗ (x) respectively. In these expressions, Gn stands for a nonparametric estimator of the biasing function G, namely Gn (y) = u≤y≤v Tn (du, dv) where Tn is the NPMLE of the joint df of the truncation times U ∗ and V ∗ (Shen, 2010). Also, K and h are the kernel function and the bandwidth respectively, while Kh (.) = K (./h)/h is the re-scaled kernel; see e.g. Wand and Jones (1995) for more on kernel regression. As mentioned, for the computation of Tn (and hence of Gn ) an iterative algorithm is needed. To this end, we have followed the algorithm proposed by Shen (2010), see also Moreira et al. (2010). Explicitly, the steps are as follows:
ϕm(0) I (Ui ≤ Ym ≤ Vi ) where ϕm(0) = 1n , 1 ≤ m ≤ n, is the initial solution for F . −1 n (1) (1) 1 1 Step S1 Compute the associated solution for T : ψj = = nm=1 ψm(1) I (Um ≤ Yi ≤ (0) , 1 ≤ j ≤ n, and Ψi i=1 (0) (0)
Step S0 Compute Φi
=
n
m=1
Φi
Vm ). (1)
Step S2 Compute the improved solution for F : ϕj
=
n
i=1
1 (1)
Ψi
Φj
−1
1 (1) ,
Ψj
(1)
1 ≤ j ≤ n, and Φi
=
n
m=1
ϕm(1) I (Ui ≤ Ym ≤
Vi ) . Step S3 Repeat Steps S1 and S2 until a convergence criterion is reached. (k−1)
− ϕj(k) | ≤ 1e − 06. Then, the estimated Tn is constructed (k) (k) n from the k-th solution ψj , 1 ≤ j ≤ n, as Tn (u, v) = j=1 ψj I (Uj ≤ u, Vj ≤ v). Accordingly, Gn is computed as n (k) Gn (y) = u≤y≤v Tn (du, dv) = j=1 ψj I (Uj ≤ y ≤ Vj ). An alternative way of introducing a consistent estimator for m(x) is through weighted local least squares. Introduce the As convergence criterion, we have used max1≤j≤n |ϕj
criterion function n {Yi − β0 − · · · − βp (Xi − x)p }2 Kh (x − Xi )Gn (Yi )−1 . i =1
ˆ (p) (x) = Let ( β0 , . . . , βp ) be the minimizer of this criterion. Then, m β0 is an estimator for m(x). Under length-bias (G(y) ∝ y), this is the (weighted) local polynomial estimator introduced by Cristóbal and Alcalá (2000). For p = 0, we obtain NW (x) introduced above. For p = 1, we obtain the weighted local the Nadaraya–Watson (i.e. local constant) type estimator m LLK (x), which (in the ordinary setting with no truncation) has been recommended linear kernel regression estimator, say m in applications due to its smaller bias and boundary adaptability when compared to the local constant estimator. 2.1. Asymptotic performance
ˆ NW (x) and m ˆ LLK (x). Some further Theorem below gives an asymptotic expression for the bias and the variance of m notation is needed. We put f (x) for the density of the covariate X ∗ , and α = P (U ∗ ≤ Y ∗ ≤ V ∗ )(>0) for the (unconditional) probability of no truncation, and we introduce σ 2 (x) = E [(Y ∗ − m(X ∗ ))2 G(Y ∗ )−1 |X ∗ = x]. We also put µ2 (K ) = t 2 K (t )dt and R(K ) = K (t )2 dt. The following conditions are assumed: (C1) (C2) (C3) (C4)
The function m′′ is continuous and bounded in a neighborhood of x. The kernel function K is a density function symmetrical about zero, and with compact support. As n → ∞, h → 0 and nh → ∞. The conditional expectations E [(Y ∗ − m(X ∗ ))d G(Y ∗ )−1 |X ∗ = x], d = 0, 1, 2, are finite.
4
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
In the following, we put an ∼ bn whenever two sequences of reals satisfy an /bn → 1 as n → ∞.
NW (x) and m LLK (x) are given by Theorem 1. Assume (C1)–(C4). The asymptotic bias and variance of m Bias( mNW (x)) ∼ Bias( mLLK (x)) ∼
1 2 1 2
µ2 (K )h2 m′′ (x)f (x) + 2m′ (x)f ′ (x) /f (x) ≡ h2 BNW (x),
(1)
µ2 (K )h2 m′′ (x) ≡ h2 BLLK (x),
(2)
and Var ( mNW (x)) ∼ Var ( mLLK )(x) ∼ (nh)−1 R(K )ασ 2 (x)/f (x) ≡ (nh)−1 V (x).
(3)
Proof. See the Appendix. Theorem 1 shows that the asymptotic bias of the proposed estimators is the same as that corresponding to the iid case and, therefore, it is unaffected by the double truncation issue. However, truncation may influence the variance; the same happens under one-sided truncation (see e.g. Ould-Saïd and Lemdani (2006), or Liang et al. (2011)) and for length-biased data Cristóbal and Alcalá (2000). In the untruncated situation, the asymptotic variance of both the NW and LLK estimators is (nh)−1 R(K )τ 2 (x)/f (x) where τ 2 (x) = E [(Y ∗ − m(X ∗ ))2 |X ∗ = x]. This quantity τ 2 (x) may be greater or smaller than ασ 2 (x), so the estimation of m(x) could be performed with less variance under double truncation than under iid sampling at particular points x. This can be explained by the fact that, because of truncation, specific parts of the support of the variable of interest are over-sampled (whileothers not), thus introducing extra information in some areas. However, by Hölder’s inequality we have ασ 2 (x)dx ≥ τ 2 (x)dx, and hence the doubly truncated scenario is ‘more difficult’ (in the sense of having more variance in estimation) in average. A similar feature was found when estimating a density function under double truncation, see Moreira and de Uña-Álvarez (2012). We also point out that the expressions for the asymptotic bias and variance given in Theorem 1 are unaffected by estimation of G. As indicated in the Appendix, this is due to the rate of convergence of Gn , which is faster than (nh)1/2 . Theorem 1 also gives the information about the relative asymptotic performance of Nadaraya–Watson and local linear kernel smoothers. As usual when comparing both methods, it is seen that the NW involves an extra bias term (2m′ (x)f ′ (x)) which is missing in the LLK. In practice, this will result in a poorer relative performance. 2.2. Bandwidth choice As always with kernel methods, the choice of the bandwidth sequence h is very important for the accuracy of the proposed estimators. One possible criterion for choosing the bandwidth is to minimize the mean integrated squared error (MISE), ˆ (x) is defined as which for any estimator m MISE( m) = E
( m(x) − m(x))2 dx =
Bias( m(x))2 dx +
Var ( m(x))dx.
Theorem 1 suggests the following asymptotic approximation to the MISE of the NW and LLK estimators: MISE( mNW ) ∼ h4
BNW (x)2 dx + (nh)−1
V (x)dx
and MISE( mLLK ) ∼ h
4
BLLK (x) dx + (nh) 2
−1
V (x)dx
respectively. The bandwidth minimizing these two functions is given by hopt = n
−1/5
1/5 V (x)dx 4 B2 (x)dx
(4)
where V (x) is defined in (3) and B(x)is BNW (x) in (1) for the NW estimator and BLLK (x) in (2) for the LLK estimator. Practical usage of (4) requires estimation of V (x)dx and B2 (x)dx, which can be performed on the basis of formulas (1)–(3). To this end, we follow the direct plug-in (DPI) method in Härdle and Marron (1995), which makes use of polynomials and histograms for the preliminary estimation of the density and regression functions involved in B(x) and V (x). However, these estimators must be adapted to the doubly truncated scenario. In the Appendix we give the most relevant details behind this modified DPI bandwidth hDPI . Alternatively, one may use a cross-validation (CV) criterion to compute the bandwidth in a data-driven way. For any
2
ˆ (x) of m(x) depending on a bandwidth h, a cross-validation function is given by CV (h) = i=1 Yi − m ˆ −i (Xi ) , estimator m ˆ −i (x) is the leave-one-out version of m ˆ (x) computed by removing the ith datum from the initial sample (cfr. Härdle where m et al. (2004)). Both the DPI and the CV bandwidths are investigated in the simulations of the next section. n
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
5
Fig. 1. Top: biasing function G for Models 1–3 (from left to right). Bottom: observational bias behind Models 1–3 (from left to right): target (dashed line) and observed regression function (solid line). Case τ = 0.1.
3. Simulation study In this section, we illustrate the finite sample behavior of the proposed estimators, through a simulation study. We compare the global mean squared errors of the NW and LLK estimators, and we investigate the performance of the two bandwidth selectors introduced in Section 2. We also explore the performance of the proposed confidence interval. We generate the data according to three different patterns of truncation:
• Model 1: 1. Draw X ∗ from Unif (0, 1). 2. Given X ∗ = x, set Y ∗ = m(x) + ϵ with ϵ ∼ N (0, τ ) independent of X ∗ . 3. Draw independently U ∗ ∼ Unif (−0.125, 0.5) and V ∗ ∼ Unif (0.5, 1.5). 4. We accept the (U ∗ , V ∗ , X ∗ , Y ∗ ) satisfying the condition U ∗ ≤ Y ∗ ≤ V ∗ . Models 2 and 3 are similar to Model 1 but with a different Step 3 for the simulation of the truncating variables. Specifically, for Model 2 we take U ∗ ∼ 1.204Beta(3, 1)− 0.704 and V ∗ ∼ Unif (0.5, 3); while for Model 3 we take U ∗ ∼ Unif (−0.25, 0.5) and V ∗ = U ∗ + 1.25. Note that Model 3 reflects a truncation pattern similar to that of the AIDS data in Section 4, where the difference between the right and left truncation times is constant. Given the covariate Xi (i = 1, . . . , n), the response Yi is 2+sin(2π x) generated in Step 2 by considering as true regression function m(x) = E (Y |X = x) = . The whole procedure Step 3 1–Step 4 is repeated until a final sample with size n is obtained, with n = 50, 100, 250 and 500. We consider two different values for the standard deviation τ of ϵ in each model, τ = 0.01 and τ = 0.1. In Fig. 1 the observational bias behind the simulated models is depicted. For Model 1, the biasing function G(t ) is roughly symmetrical, taking smaller values as t goes to the tails. As a consequence, the observable regression function separates from the target for X around 1/4 and 3/4, corresponding to the maximum and minimum values of the sinus function. This departure is hardly noticed for τ = 0.01 (not shown), since in this case the response is roughly constant and therefore the double truncation does not induce any bias; however, the separation becomes very clear for τ = 0.1 (Fig. 1, bottom-left). On the other hand, for Models 2 and 3 the biasing function is asymmetrical (Fig. 1, top-center and top-right, respectively). In both cases the main problems are in the observation of small responses; this provokes the departure between the target and the observed regression function around 3/4 (Fig. 1, bottom-center and bottom-right). As for Model 1, the observational bias is almost negligible for Models 2 and 3 when τ = 0.01 (not shown). ˆ NW (x) and m ˆ LLK (x), both based on a Gaussian kernel, was assessed along M = 1000 Monte The global performance of m ˆh Carlo trials. To this end we used the global mean squared error (GMSE) as a measure of fit, which for any given estimator m
6
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
Table 1 Minimum GMSE’s (×103 ) of the NW and LLK estimators, and corresponding optimal bandwidths for Model 1. The median and the interquartile range (×103 ) of CV and DPI bandwidth selectors are also reported.
τ
n
hGMSE
GMSE
hCV
IQR(hCV )
IQR(hDPI )
hDPI
NW m 50 100 250 500
0.008 0.008 0.008 0.007 LLK m
0.063 0.044 0.023 0.013
0.017 0.012 0.009 0.007
9.000 4.000 2.000 1.000
0.022 0.020 0.018 0.016
4.394 3.123 1.845 1.132
50 100 250 500
0.026 0.022 0.017 0.015
0.031 0.018 0.009 0.005
0.027 0.022 0.018 0.016
6.250 4.000 2.000 1.000
0.024 0.022 0.018 0.016
1.959 1.118 0.514 0.320
50 100 250 500
0.047 0.040 0.032 0.027 LLK m
1.944 1.179 0.577 0.329
0.051 0.045 0.037 0.032
22.250 14.000 9.000 7.000
0.049 0.048 0.042 0.038
11.447 9.530 5.211 3.648
50 100 250 500
0.063 0.054 0.045 0.038
1.603 0.926 0.437 0.245
0.074 0.066 0.061 0.059
22.000 13.250 8.000 5.000
0.053 0.050 0.043 0.038
11.643 9.131 4.194 2.724
0.01
NW m
0.1
Table 2 Minimum GMSE’s (×103 ) of the NW and LLK estimators, and corresponding optimal bandwidths for Model 2. The median and the interquartile range (×103 ) of CV and DPI bandwidth selectors are also reported.
τ
n
hGMSE
GMSE
hCV
IQR(hCV )
hDPI
IQR(hDPI )
NW m 50 100 250 500
0.008 0.008 0.008 0.007 LLK m
0.064 0.045 0.023 0.013
0.018 0.012 0.008 0.007
8.000 4.000 3.000 1.000
0.022 0.020 0.018 0.016
4.155 2.787 1.709 1.132
50 100 250 500
0.026 0.022 0.018 0.015
0.032 0.019 0.009 0.005
0.026 0.022 0.018 0.016
6.000 4.000 2.000 1.000
0.024 0.022 0.018 0.016
1.874 0.965 0.533 0.320
50 100 250 500
0.046 0.040 0.032 0.027 LLK m
1.862 1.097 0.540 0.329
0.049 0.043 0.035 0.032
22.000 14.000 9.250 7.000
0.049 0.048 0.042 0.038
11.096 9.261 5.725 3.648
50 100 250 500
0.062 0.054 0.044 0.038
1.529 0.935 0.437 0.245
0.071 0.065 0.057 0.059
21.000 13.000 9.000 5.000
0.053 0.049 0.043 0.038
11.127 8.469 4.177 2.724
0.01
NW m
0.1
is defined as GMSE (h) =
M n 1
Mn l=1 k=1
ˆ h,l (Xk,l ) − m(Xk,l )]2 [m
ˆ h,l (x) is the estimator m ˆ h (x) based on the lth trial, and Xk,l is the kth covariate value in the lth trial. In Tables 1–4 we where m report the optimal bandwidth hGMSE (defined as the bandwidth leading to the smallest GMSE) and the minimum GMSE for the NW and LLK estimators and Models 1 to 3. As expected, the optimal bandwidths and the GMSE decrease as the sample size increases, and they increase with the noise (τ ). On the other hand, the error of the LLK estimator is always smaller than that of NW, so the local linear smoother will be preferred in practice. The median and the interquartile range (IQR) of the CV and DPI bandwidths hCV and hDPI along the Monte Carlo simulations are also provided in Tables 1–4. It is seen from the Tables that CV is more variable than DPI; this agrees with previous comparative studies on both bandwidth selectors. Regarding the bias, generally speaking, the CV bandwidth tends to oversmooth. On the other hand, the DPI method tends to choose a bandwidth smaller than optimal for LLK (some exceptions are
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
7
Table 3 Minimum GMSE’s (×103 ) of the NW and LLK estimators, and corresponding optimal bandwidths for Model 3. The median and the interquartile range (×103 ) of CV and DPI bandwidth selectors are also reported.
τ
n
hGMSE
GMSE
hCV
IQR(hCV )
hDPI
IQR(hDPI )
NW m 50 100 250 500
0.008 0.008 0.008 0.007 LLK m
0.064 0.034 0.023 0.013
0.018 0.010 0.009 0.007
8.000 3.000 2.250 1.000
0.022 0.019 0.018 0.016
4.054 2.408 1.764 1.151
50 100 250 500
0.025 0.019 0.017 0.015
0.032 0.013 0.009 0.005
0.026 0.020 0.018 0.015
6.250 3.000 3.000 2.000
0.024 0.020 0.018 0.016
1.736 0.783 0.569 0.310
50 100 250 500
0.048 0.037 0.033 0.027 LLK m
1.798 0.760 0.539 0.314
0.050 0.039 0.035 0.031
20.250 10.000 8.000 6.000
0.049 0.045 0.042 0.037
10.863 6.585 5.907 3.599
50 100 250 500
0.063 0.048 0.044 0.038
1.501 0.617 0.417 0.229
0.069 0.057 0.052 0.048
22.000 9.000 7.000 5.000
0.053 0.047 0.043 0.038
10.532 5.515 4.908 3.019
0.01
NW m
0.1
Table 4 Minimum GMSE’s (×103 ) of the NW and LLK estimators, and corresponding optimal bandwidths for Model 3, using the true G(.). The median and the interquartile range (×103 ) of CV and DPI bandwidth selectors are also reported.
τ
n
hGMSE
GMSE
hCV
IQR(hCV )
hDPI
IQR(hDPI )
GMSE-ratio
0.021 0.012
0.008 0.007
2.120 0.986
0.017 0.015
1.432 1.021
0.913 0.925
0.008 0.005
0.017 0.011
2.893 1.970
0.017 0.015
0.504 0.303
0.889 1.000
0.502 0.298
0.031 0.031
7.700 5.591
0.041 0.036
5.310 3.564
0.931 0.950
0.401 0.222
0.051 0.047
6.400 4.598
0.042 0.037
4.590 2.827
0.962 0.970
NW m 250 500 0.01
0.007 0.007
LLK m 250 500
0.018 0.011
NW m 250 500 0.1
0.031 0.026
LLK m 250 500
0.041 0.037
found for τ = 0.01), while it oversmooths for NW. Overall, CV is closer than DPI to hGMSE for NW but the opposite is true for LLK. In Figs. 2 and 3 the attained MSE’s (for each simulated sample) of the NW and LLK estimators when based on the CV and DPI bandwidths are described by using box-plots. These figures correspond to n = 250, other values of n reported similar results (not shown). The box-plots reveal the smaller error of the LLK estimator when compared to NW. It is also clear that the error of the NW estimator is smaller when based on CV bandwidth, but the situation is just the opposite for the LLK estimator, which performs better with DPI. This is in agreement with the bias of each bandwidth selector (discussed above). Therefore, one practical conclusion from our simulations is that one should use the LLK estimator rather than NW, and that one should take the DPI bandwidth for the computation of the LLK. In Figs. 2 and 3 reference lines for the minimum GMSE attained by each method is included (dashed lines), suggesting that the LLK estimator with DPI bandwidth performs well. An interesting issue to investigate is the relative importance of the information on the biasing function G in the construction of the regression estimator. To this end, in Table 4 we report for Model 3 and sample sizes n = 250, 500 the simulation results for the estimators which make use of the true function G. Results are directly comparable to those in Table 3. The last column in Table 4 reports the ratio between the GMSE of the estimator based on G (Table 4) and that based on Gn (Table 3), the last one being always greater for the considered sample sizes. Roughly, it is seen that information on G is relevant, since
8
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
Fig. 2. Box-plot of the M = 1000 mean squared errors obtained using the NW (left) and the LLK (right) estimators when using the CV or DPI bandwidths, for Models 1 to 3 (from top to bottom), with sample size n = 250 and τ = 0.01. Horizontal dashed line corresponds to the minimum GMSE.
there is a variance increase coming from the estimation of this function; but this has a minor impact in the performance of the estimator (in agreement with its smaller order of convergence) as the sample size increases. Similar findings in kernel density estimation were reported by Moreira and de Uña-Álvarez (2012). Regarding the bandwidth selectors, Table 4 suggests that the estimator based on the true G demands less smoothing than that using Gn , according to its smaller variance; the automatic bandwidth selectors present a smaller dispersion too.
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
9
Fig. 3. Box-plot of the M = 1000 mean squared errors obtained using the NW (left) and the LLK (right) estimators when using the CV or DPI bandwidths, for Models 1 to 3 (from top to bottom), with sample size n = 250 and τ = 0.1. Horizontal dashed line corresponds to the minimum GMSE.
4. AIDS incubation time analysis For illustration purposes, in this section we consider epidemiological data on transfusion-related Acquired Immune Deficiency Syndrome (AIDS). The AIDS Blood Transfusion Data are collected by the Centers for Disease Control (CDC),
10
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
Fig. 4. Left panel: NW estimator based on CV bandwidth (h = 14.4, dashed–dotted line) and DPI bandwidth (h = 6.12, dashed line). Right panel: LLK estimator based on CV bandwidth (h = 20, dashed–dotted line) and DPI bandwidth (h = 7.59, dashed line). Solid lines correspond to the standard NW (left) and LLK (right) estimators based on DPI bandwidth. AIDS Blood Transfusion data.
which is from a registry data base, a common source of medical data (see Bilker and Wang (1996), Kalbfleisch and Lawless (1989)). The variable of interest (X ∗ ) is the induction or incubation time, which is defined as the time elapsed from Human Immunodeficiency virus (HIV) infection to the clinical manifestation of full-blown AIDS. The CDC AIDS Blood Transfusion Data can be viewed as being doubly truncated. The data were retrospectively ascertained for all transfusion-associated AIDS cases in which the diagnosis of AIDS occurred prior to the end of the study, thus leading to right-truncation. Besides, because HIV was unknown prior to 1982, any cases of transfusion-related AIDS before this time would not have been properly classified and thus would have been missed. Thus, in addition to right-truncation, the observed data are also truncated from the left. See Bilker and Wang (1996, Section 5.2), for further discussions. The data include 494 cases reported to the CDC prior to January 1, 1987, and diagnosed prior to July 1, 1986. Of the 494 cases, 295 had consistent data, and the infection could be attributed to a single transfusion or short series of transfusions. Our analyses are restricted to this subset, which is entirely reported in Kalbfleisch and Lawless (1989), Table 1. Values of U ∗ were obtained by measuring the time from HIV infection to January 1, 1982; while V ∗ was defined as time from HIV infection to the end of study (July 1, 1986). Note that the difference between V ∗ and its respective U ∗ is always July 1, 1986–January 1, 1982 = 4.5 years. More specifically, our goal is to study the relationship between AIDS incubation time and age at infection. In Fig. 4 we depict the scatterplot of the CDC AIDS blood transfusion data (age at the infection vs. time of incubation) together with the regression function estimators. Fig. 4, left, gives the NW estimator computed from the CV and DPI automatic bandwidth selectors. For the CDC AIDS blood transfusion data these bandwidths gave the values 14.4 and 6.12 respectively. Fig. 4, right, gives the LLK estimator, for which the CV and DPI algorithms gave bandwidths 20 and 7.59 respectively. Overall, the four estimators coincide in that the mean incubation (or induction) time is an increasing–decreasing function of age at infection; this agrees with previous analysis reported for this data set, see e.g. Table 4 in Kalbfleisch and Lawless (1989). For comparison purposes, we include in Fig. 4 the ordinary Nadaraya–Watson and LLK estimators, which ignore the truncation problem. These estimators were computed by using the respective DPI bandwidth values indicated before. The first thing one sees is that these naive curves underestimate the regression function all along its domain. This is explained by the fact that large incubation times are observed with smaller probability; see Fig. 5, in which the function Gn (together with a 95% pointwise confidence band based on the simple bootstrap) for the AIDS Blood Transfusion data is depicted. Secondly, the ordinary estimators of m(x) suggest a flat (or even a slightly decreasing) shape of the regression function for 20 < x < 80, where the true m(x) is concave (according to the corrected estimators). An explanation for this is that intermediate ages are associated to the largest incubation times, which are under-sampled, and therefore the scatterplot gets empty at its topcentral part, where the pick of m(x) is located. Summarizing, it is very important to perform a correction for the double truncation issue in regression analysis. We only mention that the nonparametric estimator for G depicted in Fig. 5 could be replaced by a suitable parametric estimator, leading to an alternative estimator for m(x). Indeed, Fig. 5 suggests a cubic polynomial as a sensible candidate. The parametric model could be fitted by maximizing the conditional likelihood of the truncation times given the lifetimes, as investigated by Moreira and de Uña-Álvarez (2010b). Similarly as for kernel density estimation (Moreira and de UñaÁlvarez, 2012), this semiparametric approach would be asymptotically equivalent to the purely nonparametric approach considered in this paper; however, for finite sample sizes, the inclusion of parametric information on G could lead to some improvements in estimation (same references). Simulations reported in Table 4 support this in the particular setting of nonparametric regression.
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
11
Fig. 5. Biasing function for the CDC AIDS blood transfusion data.
5. Conclusions In this paper we have proposed two different nonparametric estimators for a regression function when the response is subject to random double truncation. The proposed estimators are proper adaptations of the Nadaraya–Watson and local linear kernel smoothers to the truncated setup. Without such an adaptation, the ordinary kernel smoothers would be systematically biased. Asymptotic expressions for the bias and the variance of the estimators have been obtained, and two different bandwidth selectors based on cross-validation and plug-in ideas have been introduced. The practical performance of the estimators and the bandwidth selectors has been investigated through simulations, and a real data analysis has been provided for further illustration. It has been demonstrated that both estimators and both bandwidths selectors perform well, approaching their targets as the sample size increases. Besides, when comparing the several estimators, the local linear kernel estimator based on the direct plug-in bandwidth seems to be the best one. Although the results have been established for a single covariate, similar results can be provided for the multivariate setting. However, as always with nonparametric regression, the practical performance of the estimators will become poorer as the dimension grows. To this regard, the application of semiparametric regression techniques (such as additive regression) to the randomly truncated scenario would be interesting. Another issue which is left for future research is the construction of confidence bands for the regression function. Acknowledgments This work was supported by research Grant MTM2011-23204 (FEDER support included) of the Spanish Ministerio de Ciencia e Innovación, by SFRH/BPD/68328/2010 Grant of Portuguese Fundação Ciência e Tecnologia and by FEDER Funds through ‘‘Programa Operacional Factores de Competitividade - COMPETE’’ and by Portuguese Funds through ‘‘FCT - Fundação para a Ciência e a Tecnologia’’, in the form of grant PEst-OE/MAT/UI0013/2014. Appendix. Technical proofs and details Proof to Theorem 1 In this section we prove the asymptotic expressions given in Theorem 2.1. Note that these expressions refer to the ˆ (p) (x), where the case p = 0 corresponds to the Nadaraya–Watson estimator (NW), local polynomial kernel estimator m and p = 1 corresponds to the local linear kernel estimator (LLK). Note that Theorem 1 is similar to Theorem 2.1 in Sköld (1999), where the biasing function here (Gn ) is random but independent of the covariate. Note also that conditions (C1)–(C4) imply those in Sköld (1999). As mentioned in that paper, the result for NW follows by a linearization and standard Taylorarguments (once Gn is replaced by the true biasing function G), while for LLK one may apply techniques as for no truncated data (e.g. Wand and Jones, 1995). For illustrative purposes, we give here a sketch of the proof for the NW estimator. Consider the Nadaraya–Watson estimator:
NW (x) = m
∗ (x) m α ∗ (x)
12
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
∗ ∗ (x) = m ∗NW (x) and (x) = m NW (x). We have: where m α ∗ (x) = αNW (x). Put for simplicity m
∗ (x) m
(x) − m(x) = m =
α ∗ (x) 1
α ∗ (x)
m∗ (x)
−
α ∗ (x)
∗ (x) − m∗ (x) + m∗ (x) m
1
α ∗ (x)
1
−
α ∗ (x) ∗ 1 ∗ 1 1 (x) − m∗ (x) + (x) − m∗ (x) = ∗ m − ∗ m ∗ α (x) α (x) α (x) ∗ m∗ (x) α (x) − α ∗ (x) − ∗ α (x)α ∗ (x) m∗ (x) ∗ 1 ∗ (x) − m∗ (x) − ∗ 2 = ∗ m α (x) − α ∗ (x) + R(x) α (x) α (x)
where R(x) =
1
α ∗ (x)
=
1
α ∗ (x)
− −
1
α ∗ (x)
∗ (x) − m∗ (x) + m
1
∗ (x) − m∗ (x) + m
α ∗ (x)
m∗ (x) ∗ α (x) − α ∗ (x) ∗ α (x) m∗ (x) ∗ α (x) − α ∗ (x) α ∗ (x)
1
α ∗ (x)
−
1
α ∗ (x)
.
(x) − m(x) is asymptotically equivalent to Using standard arguments one gets R(x) = oP ((nh)−1/2 ) and, therefore, m φ(x) =
1
α ∗ (x)
m∗ (x) ∗ α (x) − α ∗ (x) . ∗ 2 α (x)
∗ (x) − m∗ (x) − m
√
∗ (x) and Since Gn is a√ n−consistent estimator of G (Shen, 2010; Moreira and de Uña-Álvarez, 2012), and since both m ∗ α (x) are only nh−consistent estimators, one may replace Gn by G in the formulae to compute the asymptotic bias and ∗ (x) and variance in a simple way. This leads to the well-known asymptotics for the NW-type estimators m α ∗ (x), namely (cfr. Härdle et al. (2004)) ∗ (x) − m∗ (x) ∼ Em E α ∗ (x) − α ∗ (x) ∼
1 2 1 2
µ2 (K )h2
µ2 (K )h2
1 f ∗ (x) 1
f ∗ (x)
m∗′′ (x)f ∗ (x) + 2m∗′ (x)f ∗′ (x) ,
∗′′ α (x)f ∗ (x) + 2α ∗′ (x)f ∗′ (x)
(for the biases), and Var ( m∗ (x)) ∼ (nh)−1 R(K ) Var ( α ∗ (x)) ∼ (nh)−1 R(K )
Var [Y12 G(Y1 )−1 |X1 = x] f ∗ (x)
Var [G(Y1 )−1 |X1 = x] f ∗ (x)
∗ (x), Cov m α ∗ (x) ∼ (nh)−1 R(K )
,
,
E [Y1 G(Y1 )−2 |X1 = x] − m∗ (x)α ∗ (x) f ∗ (x)
(for the variances and covariance), where f (x) stands for the density of the observed covariate (X1 ). Therefore, we obtain as n → ∞ ∗
E [ φ(x)] ∼
1 2
µ2 (K )h2 B0 (x)
where B0 (x) =
1
1
α ∗ (x) f ∗ (x)
m∗′′ (x)f ∗ (x) + 2m∗′ (x)f ∗′ (x) −
m∗ (x)
1
α ∗ (x)2
f ∗ (x)
∗′′ α (x)f ∗ (x) + 2α ∗′ (x)f ∗′ (x)
and Var φ(x) =
Var ( m∗ (x))
α ∗ (x)2
∼ (nh) R(K ) −1
+
m∗ (x)2
∗ m∗ (x) ∗ (x), Var ( α ( x )) − 2 Cov m α ∗ (x) α ∗ (x)4 α ∗ (x)3
1 f ∗ (x)
Var (Y1 G(Y1 )−1 |X1 = x)
α ∗ (x)2
+
Var (G(Y1 )−1 |X1 = x)
α ∗ (x)4
m∗ (x) E (Y1 G(Y1 )−2 |X1 = x) − m∗ (x)α ∗ (x) ∗ 3 α (x)
− 2
m∗ (x)2
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
13
Y12 G(Y1 )−2 G(Y1 )−2 ∗ 2 1 E |X1 = x + E m (x) |X1 = x = (nh)−1 R(K ) ∗ ∗ 2 f (x) α ( x) α ∗ (x)4 Y1 G(Y1 )−2 ∗ − 2E m ( x )| X = x 1 α ∗ (x)3 1 E (Y1 − m(X1 ))2 G(Y1 )−2 |X1 = x . = (nh)−1 R(K ) ∗ ∗ 2 f (x)α (x)
Now, it is easily seen that the densities of X ∗ (f (x)) and X1 (f ∗ (x)) are linked through α −1 f (x) = α ∗ (x)f ∗ (x). Compute the second derivative of both sides of the equation to get
α ∗′′ (x)f ∗ (x) + 2α ∗′ (x)f ∗′ (x) = α −1 f ′′ (x) − α ∗ (x)f ∗′′ (x). Similarly, compute the second derivative of equation m∗ (x)f ∗ (x) = α −1 m(x)f (x) to get m∗′′ (x)f ∗ (x) + 2m∗′ (x)f ∗′ (x) = α −1 (m′′ (x)f (x) + 2m′ (x)f ′ (x)) + α −1 m(x)f ′′ (x) − m∗ (x)f ∗′′ (x). bias of the NW From these relationships we get B0 (x) = [m′′ (x)f (x) + 2m′ (x)f ′ (x)]/f (x) and the result on the asymptotic estimator is obtained. For the variance, just note ασ 2 (x)/f (x) = E (Y1 − m(X1 ))2 G(Y1 )−2 |X1 = x /f ∗ (x)α ∗ (x)2 to conclude. DPI bandwidth Histograms are constructed by first partitioning the design interval [a, b] into interval blocks Bj , j = 1, . . . , N. In this paper we have always used N = 3, as suggested by Härdle and Marron (1995). Let B denote a generic block Bj , and let r and l denote the right and left boundaries of this block. The proportion of Xi ’s falling in each interval reflects the height l l of the density near the center of the block. Let c = r + denote the block center and rb = r − denote the block radius. The 2 2 histogram density estimate (adapted to double truncation) is given by 1
f (c ) =
n
2
n
I (|c − Xi | ≤ rb )Gn (Yi )−1 .
Gn (Yi )−1 rb i=1
i=1
To estimate the derivative of f on B we use a simple differencing method. This requires two estimates of f so we split the block into left and right halves. Estimate the frequencies on each half of the block by n
nl = n
Gn (Yi )−1
I (l ≤ Xi ≤ c )
n
i=1
Gn (Yj )−1
j =1
and nr = n
n
I (c ≤ Xi ≤ r )
i =1
Gn (Yi )−1 n
.
Gn (Yj )−1
j =1
Forming a difference quotient based on histograms at the center of the two sub-blocks gives the derivative estimate
(nr − nl )/(nrb ) f ′ (c ) = . rb
These two estimates are combined into the score function
f′ f
(c ) and, together with estimates of m′ and m′′ , they are
used to construct an estimate of B (x)dx. A natural and straight forward method for estimating m, m′ and m′′ , is leastsquares polynomial regression. The simplest version of this is a parabola, and to avoid loss of flexibility (see Härdle and Marron (1995) for more details), we fit blockwise parabolas. To deal with the double truncation issue, the squares in the least-squares criterion are weighted by the 1/Gn (Yi )’s. Explicitly, by assuming m(x) = β1 + β2 x + β3 x2 locally, one sets ′ B(cj ) = 2 β3j + 2 2 β3j (cj ) + βˆ 2j ff (cj ), where cj is the center of the block Bj . The final estimate of B2 = B2 (x)dx is given
by B2 =
n
j =1
2
2rb B2 (cj ).
The estimation of V (x) in (4) requires an estimator for
B is given by
σ2 f
(c ) =
σ 2 (c ) , f (c )
σ 2 (x) f ( x)
. An estimate of this value at the center of a generic block
14
C. Moreira et al. / Computational Statistics and Data Analysis (
)
–
(Xi ))2 Gn (Yi )−2 , which leads to V (c ) = α σf2 (c )(2π 1/2 )−1 when K is the (Yi − m −1 Gaussian kernel and α = ni=1 Gn (Yi )−1 . The blockwise approach (based on local parabolic pilot fits) leads to the following estimate of V = V (x)dx: where σ 2 (c ) =
V =
n
1 −1 Xi ∈B Gn (Yi )
Xi ∈B
V (cj ). 2rb
j =1
Finally, the DPI bandwidth is computed as hDPI = ( V /(4 B2 n))(1/5) . References Akritas, M.G., LaValley, M.P., 2005. A generalized product-limit estimator for truncated data. J. Nonparametr. Stat. 17, 643–663. Bilker, W.B., Wang, M.-C., 1996. A semiparametric extension of the Mann–Whitney test for randomly truncated data. Biometrics 52, 10–20. Cristóbal, J., Alcalá, T., 2000. Nonparametric regression estimators for length biased data. J. Statist. Plann. Inference 89, 145–168. Efron, B., Petrosian, V., 1999. Nonparametric methods for doubly truncated data. J. Amer. Statist. Assoc. 94, 824–834. Gross, S.T., Lai, T.L., 1996. Nonparametric estimation and regression analysis with left-truncated and right-censored data. J. Amer. Statist. Assoc. 91, 1166–1180. Härdle, W., Marron, J., 1995. Fast and simple scatterplot smoothing. Comput. Statist. Data Anal. 20, 1–17. Härdle, W., Müller, M., Sperlich, S., Werwatz, A., 2004. Nonparametric and Semiparametric Models. Springer, Berlin. Iglesias-Pérez, M., González-Manteiga, W., 1999. Strong representation of a generalized product-limit estimator for truncated and censored data with some applications. J. Nonparametr. Stat. 10, 213–244. Kalbfleisch, J.D., Lawless, J.F., 1989. Inference based on retrospective ascertainment: an analysis of the data on transfusion-related aids. Amer. Stat. Assoc. 84, 360–372. Klein, J.P., Moeshberger, M.L., 2003. Survival Analysis. Techniques for Censored and Truncated Data. Springer, New York. Liang, H., de Uña-Álvarez, J., Iglesias-Pérez, M., 2011. Local polynomial estimation of a conditional mean function with dependent truncated data. Test 20, 653–677. Martin, E., Betensky, R.A., 2005. Testing quasi-independence of failure and truncation times via conditional Kendall’s tau. J. Amer. Statist. Assoc. 100, 484–492. Moreira, C., de Uña-Álvarez, J., 2010a. Bootstrappping the NPMLE for doubly truncated data. J. Nonparametr. Stat. 22, 567–583. Moreira, C., de Uña-Álvarez, J., 2010b. A semiparametric estimator of survival for doubly truncated data. Stat. Med. 29, 3147–3159. Moreira, C., de Uña-Álvarez, J., 2012. Kernel density estimation with doubly truncated data. Electron. J. Stat. 6, 501–521. Moreira, C., de Uña-Álvarez, J., Crujeiras, R., 2010. Dtda: an R package to analyze randomly truncated data. J. Stat. Softw. 37, 1–20. Ould-Saïd, E., Lemdani, M., 2006. Asymptotic properties of a nonparametric regression function estimator with randomly truncated data. Ann. Inst. Statist. Math. 58, 357–378. Shen, P., 2010. Nonparametric analysis of doubly truncated data. Ann. Inst. Statist. Math. 62, 835–853. Sköld, M., 1999. Kernel regression in the presence of size-bias. J. Nonparametr. Stat. 12, 41–51. Stute, W., 1993. Almost sure representations of the product-limit estimator for truncated data. Ann. Statist. 21, 146–156. Tsai, W., Jewell, N., Wang, M., 1987. A note on the product-limit estimator under right censoring and left truncation. Biometrika 74, 883–886. Wand, M.P., Jones, M.C., 1995. Kernel Smoothing. In: Monographs on Statistics and Applied Probability, vol. 60. Chapman and Hall Ltd, London. Woodroofe, M., 1985. Estimating a distribution function with truncated data. Ann. Statist. 13, 163–177.