Data-driven local bandwidth selection for additive models with missing data

Data-driven local bandwidth selection for additive models with missing data

Applied Mathematics and Computation 217 (2011) 10328–10342 Contents lists available at ScienceDirect Applied Mathematics and Computation journal hom...

1MB Sizes 0 Downloads 34 Views

Applied Mathematics and Computation 217 (2011) 10328–10342

Contents lists available at ScienceDirect

Applied Mathematics and Computation journal homepage: www.elsevier.com/locate/amc

Data-driven local bandwidth selection for additive models with missing data q R. Raya-Miranda ⇑, M.D. Martínez-Miranda Department of Statistics and O.R., Faculty of Sciences, University of Granada, Campus Fuentenueva, 18071 Granada, Spain

a r t i c l e

i n f o

Keywords: Missing data Imputation Wild Bootstrap Smoothing parameter Backfitting

a b s t r a c t This paper deals in the nonparametric estimation of additive models in the presence of missing data in the response variable. Specifically in the case of additive models estimated by the Backfitting algorithm with local polynomial smoothers [1]. Three estimators are presented, one based on the available data and two based on a complete sample from imputation techniques. We also develop a data-driven local bandwidth selector based on a Wild Bootstrap approximation of the mean squared error of the estimators. The performance of the estimators and the local bootstrap bandwidth selection method are explored through simulation experiments. Ó 2011 Elsevier Inc. All rights reserved.

1. Introduction Some observations in samples are often incomplete in practice. For example, let us think about the appealing nonresponse in sample surveys, in clinical essays, in studies with longitudinal and ecological data, etc. During the last three decades, a vast amount of work has been done in the area. Ibrahim and Molenberghs [2] present a completed review about the missing data methods in longitudinal studies. Rubin [3] distinguished among three different missing data: missing completely at random (MCAR) whether the missingness does not depend on the variable with missing data nor the observed data; missing at random (MAR) when lost data can depend on the observed data but not on the missing data, and finally, nonignorably missing (NINR) if the dependence is on both missing and observed data. For independent observations, the standard nonparametric regression estimation methods usually consider complete samples. In fact the simplest way to deal with missing data is to drop the incomplete observations. Other more complex method consists of substituting (impute) the missing values with an estimate. Such estimated values used to arise as the mean value or from simple parametric regression estimation (over the available variables). The problem with these parametric methods is that they are only lightly theoretical supported and suffer from bias problems. Other more refined imputation methods are those of [4], which develop the EM algorithm, or those based on multiple imputation [5]. Recently, multiple imputation methods have been proposed based on nonparametric and semiparametric estimation, such as those put forward by Aerts et al. [6]. The paper [7] provides a simulation study evaluating the effect of some imputation methods with different causes of missingness. Several inferential problems have been considered in the presence of missing data: linear regression [8]; logistic regression [9]; log-linear models [10]; generalized linear models [11–13], etc. However, the nonparametric additive models with missing data have not been paid special attention yet. In multivariate regression, the missing data can arise on the response and/or the covariates in the model. In case that the loss of data is in the covariates [14,15] considered the problem q

This document is a collaborative effort.

⇑ Corresponding author.

E-mail addresses: [email protected] (R. Raya-Miranda), [email protected] (M.D. Martínez-Miranda). 0096-3003/$ - see front matter Ó 2011 Elsevier Inc. All rights reserved. doi:10.1016/j.amc.2011.05.040

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

10329

under a semiparametric or generalized linear model. However Nittner [16] estimated an additive model with missing values at the covariates. And in case of missing data in the response, it stands out the paper [13], for semiparametric models; the papers [17,18], for general multivariate nonparametric regression models and [19], for robust nonparametric estimation. Chen and Ibrahim [20] estimated semiparametric models considering missing data for the covariates and the response variable. For estimating the unconditional mean of the response, E½Y, under a nonparametric treatment, it stands out the following papers: [21–24], which considered simple imputation techniques. Also [6,18], solved the problem using multiple imputation. It has been considered other problems such as the density estimation with missing data by Titterington and Mill [25] and Titterington and Sedransk [26]; the estimation of other functions such as the unconditional variance, VarðYÞ or the distribution function of Y by Cheng [23]; and the problem of choosing the best model for incomplete samples by Hens et al. [27] and Ibrahim et al. [28]. In this paper we consider the nonparametric additive estimation of the regression function, in the presence of missing data in the response variable. We propose three nonparametric estimators based on the Backfitting algorithm with local polynomial smoothers [29]. The first estimator consists of using only the complete observations, the second one uses a simple imputation method to complete the sample and the third one uses a multiple imputation method to complete the sample. Others alternative methods have been proposed to estimate the additive models, for example: the marginal integration method by Linton and Nielsen [30] and Tjøstheim and Auestad [31]; the smooth Backfitting by Mammen et al. [32] and a two step procedure by Horowitz et al. [33]. The estimators presented in the paper can be easily extended for these additive estimation methods. We have chosen the Backfitting estimator because is less affected by boundary effects or correlation of the explanatory variables. For the first and the second proposed estimators we deal with the problem of choosing the smoothing or bandwidth parameter from the available data (data-driven bandwidth selectors). It is a technical problem in nonparametric estimation and becomes crucial because its direct repercussion in estimators theoretical/practical performance. The first step to propose a proper data-driven bandwidth selector is to define the optimality criterion. The easiest and most common criterion in nonparametric regression is the conditional Mean Integrated Squared Error (MISE). Bandwidth parameters chosen from such a global criterion are constant along the estimation interval. This feature can be inefficient when the target regression surface exhibits irregularities and also when the sample sizes are small. On the other hand local bandwidth selectors based on local error criterion such as the mean squared error (MSE) allow a nice adaptation to the available data and to capture the real features in the underling surface. Martínez-Miranda et al. [34] propose a local bandwidth selector for Backfitting additive estimates based on the bootstrap method. In this paper the estimators are based in the ideal situation of whole datasets. Here we propose an extension of the method involving missing-data adjustments. The selector also extends the previous work in multivariate local linear estimation by González-Manteiga et al. [35]. In this paper the authors formulate the problem in the presence of missing data in the response variable. The literature in this topic is not very extensive moreover when nonparametric additive models are considering, indeed there are not many works apart from those mentioned above. Also it is worthwhile to cite the work of Aerts et al. [6] propose global bandwidths based on a cross-validation strategy for estimators based on multiple imputation. And recently Boente et al. [19] define a cross-validation selector but for robust estimation with missing data. The remainder of this paper is structured as follows: in Section 2, we introduce the additive model estimation problem in presence of missing data in the response variable and formulate three estimators: the Simplified Backfitting (SB), the Imputed Backfitting (IB) and the Multiple Imputed Backfitting (MIB) estimators. In Section 3, we develop a data-driven local bandwidth selector based on a Wild Bootstrap approximation of the mean squared error of the estimators. In Section 4, we evaluate the practical performance of the estimators and the proposed bandwidth selector through a simulation study. It is shown that the proposed estimators and the local bootstrap bandwidth selection strategy work very well indeed. Some conclusions are written in Section 5. 2. Simple and multiple imputation in additive models For i ¼ 1; . . . ; n, it is assumed for one-dimensional response variables Y 1 ; . . . ; Y n that

Y i ¼ m0 þ m1 ðX 1i Þ þ    þ md ðX di Þ þ ei ;

ð1Þ

where m1 ; . . . ; md are unknown functions satisfying E½mj ðX ji Þ ¼ 0; m0 is an unknown constant, ei are error variables and Xi ¼ ðX 1i ; . . . ; X di Þ are random variables Rd ði ¼ 1; . . . ; n; j ¼ 1; . . . ; dÞ. Throughout the paper we make the assumption that X1 ; . . . ; Xn are independent identically distributed (i.i.d.) and that X ji takes its values in a bounded interval Ij . Furthermore, the error variables, e1 ; . . . ; en , are assumed to be i.i.d. with mean zero and being independent of X1 ; . . . ; Xn .  n In the case where no observations are missing, a sample of i.i.d. vectors with respect to the random vector Xti ; Y i i¼1 is available. In our case it may be possible that Y i is not observed for any index i or it is not feasible to obtain enough Y observations due to demographic o economical constraints. This pattern is common, for example, in the scheme of double sampling proposed by Neyman [36], where first a complete sample is obtained and then some additional covariate values are computed since perhaps this is less expensive than to obtain more response values. In order to check whether an observation is

10330

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

complete or not, a new variable, d is introduced into the model as an indicator of the missing observations. Thus di ¼ 1 if Y i is observed and di ¼ 0 if Y i is missing, for i ¼ 1; . . . ; n. Following the patterns in literature (see [37]), we need to establish whether the loss of an item of data is independent or not of the value of the observed data and/or the missing data. In this paper we suppose that the data are missing at random (MAR), i.e.

P½d ¼ 1jY; X ¼ P½d ¼ 1jX ¼ pðXÞ:

ð2Þ

In practice, (2) might be justifiable by the nature of the experiment when it is legitimate to assume that the missing of Y mainly depends on X. This kind of missing data has been considered in the context of nonparametric regression by González-Manteiga et al. [35]. If there are not missing data, the nonparametric estimation of the component additive functions, mj can be given by solving the system of normal equations:

2

32

m1

3

2

S1

3

I

S1

   S1

6S 6 2 6 . 6 . 4 . Sd

I .. . Sd

7 6 7 6    S2 7 76 m2 7 6 S2 7 7 6 7 6 .. .. 7 76 .. 7 ¼ 6 .. 7Y; . . 54 . 5 4 . 5  I md Sd

ð3Þ

where Sj represents the n  n smoother matrix with respect to de jth covariate vector. The smoother matrices for local poly T nomial regression are Sj ¼ sj;X d1 ; . . . ; sj;X dn , where sj;xj represents the equivalent kernel for the jth covariate at the point xj :

 1 sj;xj ¼ eT1 XTj;xj Kxj Xj;xj Xj;xj Kxj ; with ei a vector with a one in the ith position and zeros elsewhere, the matrix Kxj ¼ diagfK hj ðX j1  xj Þ; . . . ; K hj ðX jn  xj Þg for some kernel function K and bandwidth hj ,

3 1 ðX j1  xj Þ    ðX j1  xj Þpj 7 6. .. .. .. 7 ¼6 5 4 .. . . . 2

Xj;xj

1 ðX jn  xj Þ    ðX jn  xj Þpj

and pj the degree of the local polynomial for fitting mj . The Backfitting algorithm [38] provides an iterative solution of (3). Opsomer [1] wrote the estimators directly as

3 2 ^1 I m 7 6S 6m ^ 6 27 6 2 6 . 7¼6 . 6 . 7 6 . 4 . 5 4 . ^d Sd m 2

S1

   S1

3 S1 6S 7 6 27 6 . 7Y  M1 CY; 6 . 7 4 . 5

31 2

I .. .

   S2 7 7 .. .. 7 7 . . 5

Sd



I

Sd

provided the inverse of M exists. This expression shows that the additive model estimators are also linear smoothers, and it is possible to define the additive smoother matrix Wj as

Wj ¼ Ej M1 C; where Ej is a partitioned matrix of dimension n  nd with an n  n identity matrix as the jth block and zeros elsewhere, so ^ j ¼ Wj Y, for j ¼ 1; . . . ; d. that m If there are missing observations in the response variable, a very simple way to estimate the regression function is the Simplified Backfitting (SB), which consists of using only complete observations, in other words, those where di ¼ 1. Thus the SB can be obtained as

^ SB;j ¼ Wdj Y; m

j ¼ 1; . . . ; d;

where

Wdj ¼ Ej ðMd Þ1 Cd :

 T And the smoother matrices Sdj ¼ sdj;X d1 ; . . . ; sdj;X dn , where sdj;xj :

 1 sdj;xj ¼ eT1 XTj;xj Kdxj Xj;xj Xj;xj Kdxj ;

the matrix Kdxj ¼ diagfK hj ðX j1  xj Þd1 ; . . . ; K hj ðX jn  xj Þdn g. Another option is the Imputed Backfitting (IB), which is constructed in two stages. In the first stage, the SB is used to estib i Þ; i ¼ 1; . . . ; ng, is mate the missing observations so as to complete the sample. In this way a completed sample, fðXi ; Y

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

10331

P b i ¼ di Y i þ ð1  di Þm ^ SB;j ðX ji Þ being the SB estimation of the additive ^ SB ðXi Þ, with m ^ SB ðXi Þ ¼ m ^ SB;0 þ dj¼1 m obtained where Y function m, evaluated on Xi . b i Þ; i ¼ 1; . . . ; ng, where ð Y b 1; . . . ; Y b n Þt is the imputed Once the sample is completed, Backfitting is applied to the data fðXi ; Y response vector. The procedure multiple imputation consisted of replacing each missing observation by various observations from a probability distribution, giving rise to various complete data set which are analyzed through standard statistical procedures, those results being finally combined for the final result. Using the ideas of Aerts et al. [6] as a basis, we shall consider the following regression function estimation method. At a first stage, each missing observation is imputed s times, through conditional distribution function estimation; that is, for each Xi we obtain s observations Y þ i;t ðt ¼ 1; . . . ; sÞ of the distribution:

b F ðyjXi Þ ¼

n X

Ld ðXj ; Xi Þ1fY j 6 yg;

j¼1

where the weights are considered to be similar to those of Nadaraya [39]:

dj LG ðXj  Xi Þ : Ld ðXj ; Xi Þ ¼ Pn k¼1 dk LG ðXk  Xi Þ

n  o  t þ þ After the imputations, for t ¼ 1; . . . ; s, Backfitting is applied to the data Xi ; Y þ is the i;t ; i ¼ 1; . . . ; n , where Y 1;t . . . ; Y n;t multiple imputed response vector. Then the Multiple Imputed Backfitting (MIB) is

Fig. 1. Estimated surfaces of Model 1 and q ¼ 0 with local bootstrap bandwidth. The top left panel shows the SB estimation, the top right the IB estimation and the true simulated surface is plotted on the medium panel. The bottom left panel shows the difference between the true simulated surface and the SB estimation, and the bottom right the difference between the true simulated surface and the IB estimation.

10332

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

^ IMB ðXi Þ ¼ m

t 1X ^ IMB;t ðXi Þ: m s s¼1

3. Data-driven local bandwidth selection As we discuss before an important issue in any nonparametric estimation is the choice of an appropriate smoothing parameter. In this situation, two commonly used approaches are the cross-validation and plug-in methods. The automatic selectors like cross-validation are attractive due to their simplicity and intuitive definition, but they suffer from a limited convergence rate for the estimators and high computational time. By contrast, plug-in methodologies require us to obtain theoretical expressions of the bias and the variance of regression estimators, which are not always available in practice. The estimators from the Backfitting algorithm are very sensitive to the choice of bandwidth. The behavior of the estimates for the highly correlated design is slightly strange and hard to interpret. Sperlich et al. [40] showed that minimizing the

Fig. 2. Estimated surfaces of Model 2 and q ¼ 0 with local bootstrap bandwidth. The top left panel shows the SB estimation, the top right the IB estimation and the true simulated surface is plotted on the medium panel. The bottom left panel shows the difference between the true simulated surface and the SB estimation, and the bottom right the difference between the true simulated surface and the IB estimation.

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

10333

^ that are not cross-validation criterion with respect to the bandwidth vector ðh1 ; h2 ; . . . ; hd Þ yielded optimal bandwidths for m ^ j ; j ¼ 1; . . . ; d. Opsomer [1] developed a plug-in bandwidth estimators as the minimizer of the necessarily optimal for the m Mean Average Square Error (MASE) for additive models fitted by local polynomial regression, in the case of independence between the covariates; Mammen and Park [41] introduced bandwidth selectors for smooth Backfitting based on penalized sums of squared residuals. Finally, Nielsen and Sperlich [42] developed a cross-validation method for the smooth Backfitting estimator, with a feasible implementation of it; they showed that this implementation indeed yields bandwidths that were as efficient as its counterpart in one-dimensional nonparametric regression. Another approach to the problem of bandwidth selection is based on the bootstrap method [34,35]. Martínez-Miranda et al. [34] proved the consistency of the bootstrap approximation for the bivariate case (when local linear smoothers are involved) and also the optimality of the bootstrap bandwidth selector. They also obtained an expression that can be evaluated in practice by computing exactly the expectation according to the resampling distribution. Thus it is possible to achieve the practical implementation of the proposed local bootstrap bandwidth selector without getting involved in Monte Carlo approximations. Therefore the use of a local bandwidth selector provides notable improvements in the estimation of surfaces (by achieving a major adaptation to the features presents in the data) without being necessary to use the expensive timeconsuming Monte Carlo approximations. The resampling mechanism that we use here follows the same guidelines as those that inspired the method proposed by these authors but it has been adapted to the case of additive models with missing data. The method consists of obtaining the local optimal bandwidth parameter by minimizing the bootstrap estimator of the Mean Squared Error. For the SB and IB estimators we have, respectively:

Fig. 3. Estimated surfaces of Model 3 and q ¼ 0 with local bootstrap bandwidth. The top left panel shows the SB estimation, the top right the IB estimation and the true simulated surface is plotted on the medium panel. The bottom left panel shows the difference between the true simulated surface and the SB estimation, and the bottom right the difference between the true simulated surface and the IB estimation.

10334

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

h i2 ^ SB;h0 ðxÞ ; ^ SB;h ðxÞ  m min MSESB ðx; hÞ ¼ min E m h

ð4Þ

h

min MSEIB ðx; h; gÞ h;g

¼ min E



h;g

^ IB;h;g ðxÞ ½m

2

^ IB;h0 ;g0 ðxÞ ; m

ð5Þ

^  is the regreswhere h0 ; g0 are pilot bandwidth vectors, E is the expectation operator for the bootstrap resampling, and m sion function estimation with the bootstrap sample. The minimization of the above expressions (4) and (5) has to be carefully made. Indeed the results can be dramatically different depending on how you minimize. Here we follow the discussion of Nielsen and Sperlich [42] who consider an algorithm which looks for the minimum value of each component by separately. The problem is that we can not decompose the MSE in each component but we can focus first in the minimization in a particular component and fix the other ones. Therefore we avoid the problem of minimizing in a multidimensional grid with high size. More specifically we have considered the following strategy to derive the bootstrap smoothing parameter: ^ g0 . Then, direction for direction look for the hj minimizing the First, for some initial bandwidth g0 2 Rd , calculate the m MSE value. Having arrived at j ¼ d and found the appropriate hj we must update all estimates, with the new bandwidths. The same procedure has been carried out for the SB and IB estimators. The algorithm scheme, in general, is as follows. ð0Þ

ð0Þ

Step 1: take starting bandwidths g 1 ; . . . ; g d , set r ¼ 0. ^ ðrÞ Step 2: calculate m j ðxj Þ; j ¼ 1; . . . ; d. Step 3: for j ¼ 1; . . . ; d, ðrÞ (a) find the hj that minimizes the MSE ðx; hÞ, by a simple one-dimensional grid search; ðrÞ ^ ðrÞ (b) when the optimal hj has been found, update the components m j ðxj Þ; then set j to j þ 1.  Step 4: if the MSE ðx; hÞ has improved, set r to r þ 1 and go to step 3; otherwise stop.

2

2

1

1

m2

m1

rho = 0

0

0

-1

-1

-2

-2 -0.7

-0.2

x1

0.3

0.8

-0.7

-0.2

-0.7

-0.2

x2

0.3

0.8

0.3

0.8

2

2

1

1

m2

m1

rho = 0.3

0

0

-1

-1

-2

-2 -0.6

0.0

x1

0.6

x2

Fig. 4. Estimated components in the estimated additive Model 1 with local bootstrap bandwidth. The dashed curves (- -) correspond to the SB estimation, the solid curves (–) are the IB estimation and the dotted curves () are the real simulated functions.

10335

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

rho = 0 1.0 2

0.8

1

m2

m1

0.6 0.4 0.2

0

-1

0.0

-2 -0.7

-0.2

x1

0.3

0.8

-0.7

-0.2

-0.7

-0.2

x2

0.3

0.8

0.3

0.8

rho = 0.4 1.0 2

0.8

1

m2

m1

0.6 0.4 0.2

0

-1

0.0

-2 -0.6

0.0

x1

0.6

x2

Fig. 5. Estimated components in the estimated additive Model 2 with local bootstrap bandwidth. The dashed curves (- -) correspond to the SB estimation, the solid curves (–) are the IB estimation and the dotted curves () are the real simulated functions.

^ opt ¼ h with k the index of the last iteration. Then, h ^ opt is the bandwidth that minimizes the bootstrap Define the vector h estimator of Mean Squared Error. ð0Þ In step 1, a natural choice of starting values g j ; j ¼ 1; . . . ; d, would be minimize a cross-validation criterion for an additive model. ðkÞ

4. Empirical study This section contains the results of a simulation study in which the estimators SB, IB and MIB proposed in Section 2 are compared. We also aim to investigate the performance of the proposed bootstrap approximation of the conditional mean squared error and the derived local bootstrap bandwidth selector introduced in Section 3. In this aim the estimators have been calculated using the proposed local bootstrap selector and later compared with those using a global selector based on the cross-validation method. 4.1. Simulated models and settings We have considered 100 samples ðXi ; Y i Þ; i ¼ 1; . . . ; n, with n ¼ 200 from three additive models,

Y i ¼ m1 ðX i1 Þ þ m2 ðX i2 Þ þ ei ; with around 40% of missing data. The component functions, design distributions and missing data mechanism specified as follows: The model 1 with

mj ðxj Þ ¼ 2 sinðpxj Þ; the model 2 with

j ¼ 1; 2;

   pðxÞ ¼ 1= 2:2 þ exp 0:6x21 ;

10336

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

1.5

1.0

1.0

0.5

0.5

m2

m1

rho = 0 1.5

0.0

0.0

-0.5

-0.5

-1.0

-1.0

0.2

0.5

0.8

x1

0.2

0.5

0.8

0.2

0.5

0.8

x2

1.5

1.5

1.0

1.0

0.5

0.5

m2

m1

rho = 0.5

0.0

0.0

-0.5

-0.5

-1.0

-1.0

0.2

0.5

0.8

x1

x2

Fig. 6. Estimated components in the estimated additive Model 3 with local bootstrap bandwidth. The dashed curves (- -) correspond to the SB estimation, the solid curves (–) are the IB estimation and the dotted curves () are the real simulated functions.

m1 ðx1 Þ ¼ x21 ;

   pðxÞ ¼ 1= 2 þ 0:5 exp x21

m2 ¼ 2 sinð0:5px2 Þ;

and finally, model 3 by considering

m1 ðx1 Þ ¼ 1  6x1 þ 36x21  53x31 þ 22x51 ;    pðxÞ ¼ 1= 1 þ 0:65 exp x21 :

m2 ðx2 Þ ¼ sinð5px2 Þ;

The explicative variables were generated as follows: In model 1, for the design we first draw variables zj ; j ¼ 1; 2 from Nð0; 1Þ with correlation q12 ¼ q. To facilitate the grid search, we projected the variables into the interval (1.25, 1.25) by the transformation xj :¼ 2:5 tan1 ðzj Þ=p. Therefore q is not the correlation between the xj but is related to it. We concentrate on the estimation inside ½1; 12 . In model 2, the covariates were generated from truncated bi-dimensional normal distributions ðX 1 ; X 2 ÞT  Nð0; Rc Þ; c ¼ 1; 2, where the covariance matrices are given by



R1 ¼

1 0 0 1



 ;

R2 ¼

1

0:4

0:4

1

:

The truncation was done for the covariates to have the compact support ½1; 12 . To be specific, a random variable generated from one of the bi-dimensional normal distributions was discarded if one of the covariates fell outside the interval [1, 1]. Finally, in model 3, the covariates are generated from a joint distribution with marginal Nð1=2; 1=9Þ and the desired correlation. Because the domain of the covariates is assumed to be bounded we rejected all observations for which one of the covariates fell outside the interval [0, 1]. The residuals were generated from a normal distribution with mean zero and constant variance of 0, for models 1 and 2, and 0.25, for model 3. The local linear smoother involved in the Backfitting algorithm was calculated with gaussian kernel, KðxÞ ¼ ð2pÞð1=2Þ expðx2 =2Þ.

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

10337

rho = 0

ISE

0.6

0.3

IB.Boot

IB.Opt

SB.Boot

SB.Opt

SB.Boot

SB.Opt

rho = 0.3

ISE

0.8

0.4

IB.Boot

IB.Opt

Fig. 7. The ISE Boxplots for Model 1 (on the top the case q ¼ 0 and the case q ¼ 0:3 on the bottom). The IB and SB estimations with both local optimal bandwidth and the proposed local bootstrap estimation are compared by using the ISE measurement.

rho = 0

ISE

0.4

0.2

0.0 IB.Boot

IB.Opt

SB.Boot

SB.Opt

SB.Boot

SB.Opt

rho = 0.4

ISE

0.7

0.3

IB.Boot

IB.Opt

Fig. 8. ISE Boxplots for Model 2 (on the top the case q ¼ 0 and the case q ¼ 0:4 on the bottom). The IB and SB estimations with both local optimal bandwidth and the proposed local bootstrap estimation are compared by using the ISE measure.

^ are evaluated and therefore compared trough the global error measureEach of the considered estimators (lets write m) ment of the performance given by the approximated Integrated Square Error (ISE). This measure is computed over the performed replications of the model and it is given by

10338

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

rho = 0

ISE

0.4

0.2

0.0 IB.Boot

IB.Opt

SB.Boot

SB.Opt

rho = 0.5

ISE

0.30

0.15

IB.Boot

IB.Opt

SB.Boot

SB.Opt

Fig. 9. The ISE Boxplots for Model 3 (on the top the case q ¼ 0 and the case q ¼ 0:5 on the bottom). The IB and SB estimations with both local optimal bandwidth and the proposed local bootstrap estimation are compared by using the ISE measure.

Table 1 Approximated mean integrated squared error (MISE) for the local bootstrap bandwidth for the SB and IB estimators (the standard deviation in brackets).

SB IB EF SB;IB;Boot %ISEIB < ISESB

SB IB EF SB;IB;Boot %ISEIB < ISESB

SB IB EF SB;IB;Boot %ISEIB < ISESB

^ ¼ ISEðmÞ

Model 1 q¼0

Model 1 q ¼ 0:3

0.353895 (0.092856) 0.278069 (0.067206) 21.426 87

0.360574 (0.115311) 0.300156 (0.075724) 16.794 72

Model 2 q¼0

Model 2 q ¼ 0:4

0.174554 (0.070258) 0.126046 (0.035916) 27.789 87

0.168112 (0.077478) 0.105898 (0.038871) 37.007 95

Model 3 q¼0

Model 3 q ¼ 0:5

0.159493 (0.064598) 0.118656 (0.048404) 25.604 88

0.163621 (0.052223) 0.129634 (0.041625) 20.771 80

r X ^ v j ÞÞ2 =r; ðmðv j Þ  mð j¼1

where v j is a grid of 21  21 bidimensional estimation points. An approximation to the MISE was obtained as the mean over the 100 replications of ISE.

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

10339

Table 2 Approximated MISE comparison between the local bootstrap bandwidth and the standard (global) CV bandwidth for Model 1 (cases q ¼ 0 and q ¼ 0:3). The first panel corresponds to the SB estimations. The results for the IB estimations are showed in the second panel of the table. The standard deviations are showed in brackets.

SBCV SBBoot EF CV ;Boot;SB %ISEBoot < ISECV

IBCV IBBoot EF CV ;Boot;IB %ISEBoot < ISECV

Model 1 q¼0

q ¼ 0:3

Model 1

0.535402 (0.334541) 0.353895 (0.092856) 33.901 62

0.492153 (0.298474) 0.360574 (0.115311) 26.7353 56

Model 1 q¼0

q ¼ 0:3

0.487348 (0.337810) 0.278069 (0.067206) 42.942 61

0.443691 (0.297271) 0.300156 (0.075724) 32.350 54

Model 1

rho = 0

1.5

1.5

1.0

1.0

0.5

0.5

0.0

0.0 Simplified Imputed Multiple imputation True function

-0.5 -1.0 0.2

0.5

-0.5 -1.0 0.8

0.2

x1 1.5

1.0

1.0

0.5

0.5

0.0

0.0

-0.5

-0.5

-1.0

-1.0 0.5

x1

0.8

x2

1.5

0.2

0.5

0.8

0.2

0.5

0.8

x2

Fig. 10. Estimated components of Model 3 (on the top the case q ¼ 0 and the case q ¼ 0:5 on the bottom) from IB, MIB and SB estimation methods and using with global cross validation bandwidth.

Finally the local bootstrap bandwidth selector requires the choice of a pilot bandwidth parameter, g; we have relied on the pilot bandwidths obtained by a global cross-validation selector. 4.2. Analysis of the results First consider the SB and IB estimators using the proposed local bootstrap bandwidth. The estimators have been calculated both using local bootstrap bandwidths (obtained by minimizing the bootstrap estimator of the mean squared error) and the theoretical (unfeasible) local optimal ones (calculated by minimizing the true mean squared error at each estimation point). Figs. 1–3 show for each scenario: the estimations obtained (averages on all replicates) by SB and IB estimators, the

10340

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

rho = 0

ISE

0.7

0.3

IB

MIB

SB

rho = 0.5

ISE

0.5

0.2

IB

MIB

SB

Fig. 11. The ISE Boxplots for Model 3 (on the top the case q ¼ 0 and the case q ¼ 0:5 on the bottom). The IB, MIB and SB estimations with global cross validation bandwidth are compared by using the ISE measurement.

true surface (m) and the differences between the true surface and the estimations. Furthermore Figs. 4–6 show the real curve and the estimations obtained by SB and IB for each component separately. From these plots we can appreciate that IB outperforms over SB estimations for overall models, being notable at the zones of greatest curvature. It holds for both component estimations and the bidimensional resulting estimated surfaces. The benefits of using IB are more patent by a numerical evaluation of the estimation error ISE defined above. Figs. 7–9 show the box-plots associated to the ISE values for each compared estimator at each of the three simulated models. The IB estimator provides estimations which perform considerably better than those associated to SB estimator, in all cases considered, even when the variables are correlated. Note that the results by involving the proposed bootstrap selector and the theoretical values are very close. Tables 1 and 2 summarize the results of the simulations. Table 1 gives the values of the MISE, the efficiency for the SB and IB estimators and also the percentage of times that the ISE for the IB is smaller than the ISE for the SB when using the local bootstrap bandwidth. The results reported show that the IB estimator provide smaller MISE values than those associated to the SB ones for all cases considered and also very high percentages of times that the MISE for the IB is smaller than the MISE for the SB. The reported efficiency of the IB estimator with respect to the SB estimator was computed as

EF SB;IB ¼

MISESB  MISEIB  100; MISESB

where MISESB denoted the MISE of the SB estimator and MISEIB that of the IB estimator. In order to compare the results obtained between the bootstrap local bandwidth and a global one the Table 2 gives the values of the MISE for the SB and IB estimators when considering the local bandwidth obtained applying the bootstrap method and the global one obtained from a cross validation method, for the model 1 (the results for the other models have been omitted because they provide similar information). It also included the efficiency and the percentage of times that the ISE for the bootstrap is smaller than the ISE for the cross-validation method. The reported efficiency of the local bandwidth with respect to the global one was computed as

EF CV;Boot ¼

MISECV  MISEBoot  100; MISECV

where MISECV denotes the MISE of the estimator using a global bandwidth calculated by cross validation and MISEBoot that of the estimator using a local bootstrap bandwidth. From these results we see that the local bandwidths outperform over the global ones, achieving smaller ISE values. We have also compared the SB and IB estimators with the multiple imputed estimator (MIB). Little and Rubin pointed out that simple imputation tends to underestimate the standard error. In this sense the authors recommended using multiple imputation. The resulting estimates using both simple and multiple imputation as were described in Section 2 are showed

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

10341

in Figs. 10 and 11. To calculate the MIB estimator we took the number of multiple imputations, s, equal to 10 (other choices provides quite similar results). Note that we can appreciate some improvement using multiple imputation but is not remarkable for the here considered Backfitting estimation. The results for the other considered models are similar and we have not included for shortness reasons. Also the proposed bootstrap bandwidth selector can be calculated for the MIB estimator in a similar way to simple IB but of course with a major computational effort. Again the corresponding analysis of such bootstrap bandwidth using MIB does not provide any new insight and then we do not include it in the paper. 5. Concluding remarks We have introduced three nonparametric procedures to estimate an additive model when there are missing observations in the response variable. For two of the estimators we have proposed a local bootstrap method for the automatic selection of the smoothing parameters. Some empirical experiments reveal that the IB estimator performs better than the SB ones, this is also reflected in very high percentages of times that the ISE for the IB is smaller than the ISE for the SB. On the other hand, the performance of the IB and SB estimators is sensitive to the presence of correlation among the variables. Effectively, the MISE of the estimators increase in cases of correlated variables. It is also remarkable that the proposed local bootstrap bandwidth is simple and easy to calculate in practice. Also they do not require too expensive computational times compared with other local bandwidth like any local cross-validation versions. The simple explicit expression of the estimators allows to compute exactly the bootstrap approximations and to avoid expensive Monte Carlo simulations. The methods were implemented with MatLab 2007b, using built-in MatLab subroutines, on an AMD 1.84 GHz. We have generated our code using the on-line available MatLab subroutines from (www.stat.colostate.edu/jopsomer/) to compute the Backfitting estimates. Acknowledgements The authors thank Wenceslao González-Manteiga and Ana Pérez-González for helpful discussion. We also acknowledge very helpful comments from an anonymous reviewer and the editor. The second author gratefully acknowledge the financial support of the Spanish ‘‘Ministerio de Ciencia e Innovación’’ MTM2008-03010. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31]

J.D. Opsomer, Asymptotic properties of backfitting estimators, J. Multivariate Anal. 73 (2000) 166–179. J.G. Ibrahim, G. Molenberghs, Missing data methods in longitudinal studies: a review, Test 18 (2009) 1–43. D.B. Rubin, Inference and missing data, Biometrika 63 (1976) 581–592. A.P. Dempster, N.M. Laird, D.B. Rubin, Maximum likelihood estimation from incomplete data via the EM algorithm (with discussion), J. Roy. Statist. Soc. 39 (1977) 1–38. D.B. Rubin, Multiple Imputation for Nonresponse in Surveys, Willey & Sons, 1987. M. Aerts, G. Claeskens, N. Hens, G. Molenberghs, Local multiple imputation, Biometrika 89 (2002) 375–388. J. Gámez-García, J. Palarea-Albaladejo, J.A. Martín-Fernández, Métodos de inferencia estadística con datos faltantes. Estudio de simulación sobre los efectos en las estimaciones, Estadística Española 48 (2006) 241–270. R.J.A. Little, Regression with missing X’s: a review, J. Amer. Statist. Assoc. 87 (1992) 1227–1237. W. Vach, Logistic regression with missing values and covariates, Lecture Notes in Statistics, vol. 86, Springer-Verlag, Berlin, 1994. C. Fuchs, Maximum likelihood estimation and model selection in contingency tables with missing data, J. Amer. Statist. Assoc. 77 (1982) 270–278. J.G. Ibrahim, Incomplete data in generalized linear models, J. Amer. Statist. Assoc. Theor. Methods 85 (1990) 765–770. J.G. Ibrahim, S.R. Lipsitz, M.-H. Chen, Missing covariates in generalised linear models when the missing data mechanism is nonignorable, J. Roy. Statist. Soc. B 61 (1999) 173–190. J.G. Ibrahim, M.-H. Chen, S.R. Lipsitz, Missing responses in generalised linear mixed models when the missing data mechanism is noningnorable, Biometrika 88 (2001) 551–564. C.Y. Wang, S. Wang, L.-P. Zhao, S. -T Ou, Weighted semiparametric estimation in regression analysis with missing covariate data, J. Amer. Statist. Assoc. 92 (1997) 512–525. C.Y. Wang, S. Wang, R.J. Carrol, R.G. Gutiérrez, Local linear regression for generalized linear models with missing data, Ann. Statist. 26 (1998) 1028– 1050. T. Nittner, The additive model affected by missing completely at random in the covariate, Comput. Statist. 19 (2004) 261–282. C.K. Chu, P.E. Cheng, Nonparametric regression estimation with missing data, J. Statist. Plann. Infer. 48 (1995) 85–99. W. González-Manteiga, A. Pérez-González, Nonparametric mean estimation with missing data, Commun. Statist. Theor. Methods 33 (2004) 277–303. G. Boente, W. González-Manteiga, A. Pérez-González, Robust nonparametric estimation with missing data, J. Statist. Plann. Infer. 139 (2008) 571–592. Q. Chen, J.G. Ibrahim, Semiparametric models for missing covariate and response data in regression models, Biometrics 62 (2006) 177–184. P.E. Cheng, L.J. Wei, Nonparametric inference under ignorable missing data process and treatment assignment, Int. Statist. Symposium, Taipei, ROC, vol. 1, 1986, pp. 97–112. P.E. Cheng, Applications of kernel regression estimation: a survey, Commun. Statist. Theor. Methods 19 (1990) 4103–4134. P.E. Cheng, Nonparametric estimation of mean functionals with data missing at random, J. Amer. Statist. Assoc. 89 (1994) 81–87. S.F. Nielsen, Nonparametric conditional mean imputation, J. Statist. Plann. Infer. 99 (2001) 129–150. D.M. Titterington, G.M. Mill, Kernel-based density estimates from incomplete data, J. Roy. Statist. Soc. B 45 (1983) 258–266. D.M. Titterington, J. Sedransk, Imputation of missing values using density estimation, Statist. Prob. Lett. 8 (1989) 411–418. H. Hens, M. Aerts, G. Molenberghs, Model selection form incomplete and design-based samples, Statist. Medic. 25 (2006) 2502–2520. J.G. Ibrahim, H.T. Zhu, N.S. Tang, Model selection criteria for missing-data problems using the EM algorithms, J. Amer. Statist. Assoc. 103 (2008) 1648– 1658. J.D. Opsomer, D. Ruppert, Fitting a bivariate additive model by local polynomial regression, Ann. Statist. 25 (1997) 186–211. O. Linton, J.P. Nielsen, A kernel method of estimating structured nonparametric regression based on marginal integration, Biometrika 82 (1995) 93– 100. D. Tjøstheim, B.H. Auestad, Nonparametric identification of nonlinear time series: Projections, J. Amer. Statist. Assoc. 89 (1994) 1398–1409.

10342

R. Raya-Miranda, M.D. Martínez-Miranda / Applied Mathematics and Computation 217 (2011) 10328–10342

[32] E. Mammen, O. Linton, J.P. Nielsen, The existence and asymptotic properties of a Backfitting projection algorithm under weak conditions, Ann. Statist. 27 (1999) 1443–1490. [33] J. Horowitz, J. Klemelä, E. Mammen, Optimal estimation in additive regression model, Bernoulli 12 (2006) 271–298. [34] M.D. Martínez-Miranda, R. Raya-Miranda, W. González-Manteiga, A. González-Carmona, A bootstrap local bandwidth selector for additive models, J. Comput. Graph. Statist. 17 (2008) 38–55. [35] W. González-Manteiga, M.D. Martínez-Miranda, A. Pérez-González, The choice of smoothing parameter in nonparametric regression through wild bootstrap, Comput. Statist. Data Anal. 47 (2004) 487–515. [36] J. Neyman, Contribution to the theory of sampling human populations, J. Amer. Statist. Assoc. 33 (1938) 101–116. [37] R.J.A. Little, D.B. Rubin, Statistical analysis with missing data, J. Willey & Sons, New York, 2002. [38] A. Buja, T. Hastie, R. Tibshirani, Linear smoothers and additive models (with discussion), Ann. Statist. 17 (1989) 453–555. [39] E.A. Nadaraya, On estimating regression, Theor. Probabil. Appl. 9 (1964) 141–142. [40] S. Sperlich, O.B. Linton, W. Härdle, Integration and backfitting methods in additive models: finite sample properties and comparison, Test 8 (1999) 419–458. [41] E. Mammen, B.U. Park, Bandwidth selection for smooth Backfitting in additive models, Ann. Statist. 33 (2005) 1260–1294. [42] J.P. Nielsen, S. Sperlich, Smooth backfitting in practice, J. Roy. Statist. Soc. B 67 (2005) 43–61.