Pseudo-empirical likelihood estimation using nonparametric regression

Pseudo-empirical likelihood estimation using nonparametric regression

Applied Mathematics Letters 22 (2009) 1021–1024 Contents lists available at ScienceDirect Applied Mathematics Letters journal homepage: www.elsevier...

458KB Sizes 9 Downloads 136 Views

Applied Mathematics Letters 22 (2009) 1021–1024

Contents lists available at ScienceDirect

Applied Mathematics Letters journal homepage: www.elsevier.com/locate/aml

Pseudo-empirical likelihood estimation using nonparametric regression M. Rueda ∗ , J.F. Muñoz University of Granada, Spain

article

info

Article history: Received 28 November 2007 Received in revised form 14 October 2008 Accepted 6 January 2009 Keywords: Sample surveys Inclusion probabilities Lagrange multiplier method Regression estimator Local polynomial regression

a b s t r a c t Pseudo-empirical likelihood estimation of the population mean is considered. A nonparametric regression theory is proposed, to provide the fitted values on which to calibrate, and the common model misspecification problem is therefore addressed. Results derived from empirical studies show that the proposed estimator for the population mean can perform better than alternative estimators. © 2009 Elsevier Ltd. All rights reserved.

1. Introduction The pseudo-empirical likelihood method is a recently developed technique that can be used in the problem of estimating the population mean or total (see [1–3]) and the distribution function (see [4]). This model-assisted technique relies on a specific model relating the auxiliary variables and the variable of interest, a linear relationship being the most commonly used. However, the assignment of such a relationship can be inappropriate or unverifiable. Undesirable results under model misspecification can also be achieved. An alternative to the model-assisted methods is the nonparametric approach, which does not place any restrictions on the relationship between the auxiliary and study variables. Nonparametric methods have gained acceptance in most areas of statistics. However, these methods are less extended in the survey context. Some references on this topic are [5–7]. Nonparametric estimators can be derived by using kernel regression or local polynomial regression. Local polynomial regression possesses many desirable theoretical properties, including design adaptation, consistency, asymptotic unbiasedness and consistency (see [8–11]). Moreover, [12] explores a wide range of application areas of local polynomial regression techniques. In this work, we propose an estimator for the population mean of a variable of interest that combines the pseudoempirical likelihood method with nonparametric regression techniques based on local polynomial regression. Empirical studies are also carried out to assess the performance of the proposed estimator in comparison to alternative estimators. 2. Proposed estimator Assume that a sample s of size n is drawn from a finite population U of size N using a specific sampling design with firstorder inclusion probabilities πi . Let di = 1/πi denote the design weight of the unit i. Let y be the variable of interest. The PN aim is to estimate the population mean of y, i.e. Y = N −1 i=1 yi . Without loss of generality, we assume a single auxiliary variable denoted as x.

∗ Corresponding address: Departamento de Estadística e I.O., Facultad de Ciencias, Avda Fuentenueva, Universidad de Granada, 18071, Granada, Spain. Tel.: +34 958240494; fax: +34 958243267. E-mail address: [email protected] (M. Rueda). 0893-9659/$ – see front matter © 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.aml.2009.01.024

1022

M. Rueda, J.F. Muñoz / Applied Mathematics Letters 22 (2009) 1021–1024

Fig. 1. Scatter plots for simulated populations with size N = 1000 and based on the model yi = m(xi ) + i . i are taken with standard deviation σ = 0.1. x is uniformly distributed over [0, 1]. Functions mi = mi (x) are given in (4).

The auxiliary variable x is incorporated by considering the superpopulation model yi = m(xi ) + i ,

i = 1, . . . , N ,

(1)

where i are independent and identically distributed with mean zero and variance σ . The unknown regression function m(·) is defined without loss of generality on the interval [0, 1] (see [8]). Since many desirable theoretical properties have been derived, we propose to use the local polynomial regression to estimate m(·) in expression (1). However, other kernel methods can be also used to estimate m(·). [8] demonstrates that the estimation of m(·) in the survey context is given by bi = m b(xi ) = et1 (Xxti Wxi Xxi )−1 Xxti Wxi ys , where et1 = (1, 0), ys = [yj ]j∈s , Wxi = diag{dj Kh (xj − xi )}j∈s , Xxi = [1, xj − xi ]j∈s and m 2

Kh (u) = h−1 K (u/h), K (·) being a continuous kernel function and h the bandwidth parameter. The proposed pseudo-empirical likelihood estimator for the population mean Y is given by yPROP =

X

b pi yi ,

(2)

i∈s

where the weights b pi maximize the pseudo-empirical likelihood functionb l(p) =

X

pi = 1 (0 ≤ pi ≤ 1)

P

i∈s

di log pi subject to the conditions (3)

i∈s

and

X

bi = pi m

i∈s

N 1 X

N i =1

bi = M . m

[1,4] use the Lagrange multiplier method to find the solution to the maximization problem in the pseudo-empirical likelihood method. Following [1,4], the solution to our maximization problem is given by b pi = wi /[1 + λ(b mi − M )], for P P i ∈ s, where wi = di / j∈s dj . The Lagrange multiplier λ is the solution to i∈s wi (b mi − M )/[1 + λ(b mi − M )] = 0. Eq. (3) is a natural condition that provides weights b pi with attractive properties, such as the fact of giving genuine distribution functions. Note this property is not enjoyed by other known techniques, such as the calibration method proposed by Deville and Särndal [13]. 3. Numerical comparisons In this section, simulation studies are carried out to study the performance of the proposed estimator yPROP given by (2). In terms of the Relative Bias (RB, as a percentage) and Relative Efficiency (RE, as a percentage), yPROP P is compared numerically with the following estimators: (i) the standard Horvitz–Thompson estimator yHT = N −1 i∈s di yi , (ii) the local polynomial regression estimator yLPR (see [8]), (iii) the pseudo-empirical likelihood estimator yPE (see [1]) and (iv) the regression estimator yREG (see [14]). Simulation studies are based on several populations which are briefly described as follows. First, we study the effect of several working models on the various estimators by assuming simulated populations with size N = 1000 and based on the model (1). Values i are taken with standard deviation σ = 0.1. The auxiliary variable x is uniformly distributed over [0, 1] and the functions mi = mi (x) are given by m1 (x) m2 (x) m3 (x) m4 (x)

= 2 + 2(x − 0.5), = 2 + 2(x − 0.5)2 , = 2 + 2(x − 0.5) + exp(−200(x − 0.5)2 ), = 2 + 2(x − 0.5)∆(x ≤ 0.6) + 0.6∆(x ≥ 0.6).

(4)

Scatter plots of simulated populations can be seen in Fig. 1. Simulated populations allow us to compare the performances of the various estimators under an assortment of correct and incorrect specifications of the mean function. Nonparametric

M. Rueda, J.F. Muñoz / Applied Mathematics Letters 22 (2009) 1021–1024

1023

Table 1 RE of estimators under the sampling designs SRSWOR and STWOR for simulated populations. Model

m1

n

50 100 200 50 100 200 50 100 200 50 100 200

m2

m3

m4

SRSWOR

STWOR

yHT

yPROP

yLPR

yPE

yREG

yHT

yPROP

yLPR

yPE

yREG

100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0

3.1 2.9 2.6 32.3 31.9 28.9 12.0 11.6 12.1 4.7 5.1 4.3

3.2 2.9 2.6 31.8 31.8 29.6 12.1 11.6 12.1 4.7 5.1 4.4

3.0 2.9 2.6 101.6 100.6 100.1 20.4 20.0 21.9 8.6 10.6 8.8

3.3 2.9 2.6 105.8 102.2 101.1 21.3 20.4 22.1 8.8 10.7 8.8

100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0 100.0

15.2 11.3 20.3 45.4 40.7 41.8 27.8 29.6 21.9 22.6 20.1 19.2

16.4 15.3 21.8 46.7 41.9 42.5 28.7 32.1 22.3 24.1 23.6 19.5

14.8 11.0 19.8 102.1 102.1 100.8 43.9 48.9 36.2 27.8 21.1 21.2

15.6 11.9 20.6 103.7 103.0 102.5 44.5 49.3 36.7 28.7 21.9 21.4

Table 2 RE for estimators under the sampling designs SRSWOR and STWOR for the factories population. n

SRSWOR

50 100 200

STWOR

yHT

yPROP

yLPR

yPE

yREG

yHT

yPROP

yLPR

yPE

yREG

100.0 100.0 100.0

4.9 4.4 4.5

4.9 4.5 4.6

17.8 17.1 16.4

17.6 17.3 16.5

100.0 100.0 100.0

20.0 22.0 16.2

21.2 24.8 18.9

57.9 53.3 51.7

58.4 55.4 52.4

estimators yPROP and yLPR assume a smooth mean function and they only misspecify the mean function for the jump response (m4 ), which has a point of discontinuity at x = 0.6. Parametric estimators yREG and yPE are expected to perform well under the linear response (m1 ), since the assumed model is correctly specified. Moreover, a real application is derived by using the natural factories population (see [15–17]). For this population, the auxiliary variable is the number of workers and the study variable is the output for factories. This population was replicated five times and a population size of N = 400 individuals was achieved. For each simulation, B = 1000 samples were selected to compute the Relative Bias RB = 100 × (E [b y] − Y )/Y and PB the Relative Efficiency RE = 100 × MSE [b y]/MSE [yHT ], where E [b y] = B−1 b=1 b yb is the empirical mean and MSE [b y] =

B−1 b=1 (b yb − Y )2 is the empirical mean square error. b yb denotes the value of a given estimator b y for the bth simulation run. Samples were selected by simple random sampling without replacement (SRSWOR) and stratified random sampling (STWOR). Samples with sizes n = 50, n = 100 and n = 200 were considered for each population. Simulation studies were programmed in R/Splus. The R/Splus codes are available from the authors. Following [8], nonparametric estimators (yPROP and yLPR ) were computed with bandwidth h = 0.25 and an Epanechnikov kernel. The choice of the bandwidth parameter h and the kernel function K (·) plays a key role in nonparametric regression estimation. However, this is a problem which is beyond the scope of this work. Results derived from empirical studies showed values of RB within a reasonable range, i.e. the absolutes values of RB are all less than 1% and are thus not reported. Table 1 reports the RE for estimators under SRSWOR and STWOR for simulated populations. With the design SRSWOR and the model m1 , we observe that the proposed estimators perform similarly to estimators that take the auxiliary information into account. The nonparametric estimators yPROP and yLPR are however more efficient under other underlying functions. With the design STWOR, we see that the proposed estimator yPROP , which is based on the pseudo-empirical likelihood method, is more efficient than the local polynomial regression estimator yLPR . We similarly observe that yPE is more efficient than yREG . This gain in efficiency of the pseudo-empirical likelihood estimators (yPROP and yPE ) over the regression estimators (yLPR and yREG ) is also discussed in Chen and Sitter [1], where it is demonstrated that the pseudo-empirical likelihood estimation is better than the regression estimation under STWOR. We also observe that the parametric estimators yREG and yPE are less efficient than the customary estimator yHT under the model m2 . We now study, in Table 2, the performance of the proposed estimator for the factories population. We observe that yPROP and yLPR are equivalent under SRSWOR and both are better than estimators yPE and yREG . We also see that yPROP is the most efficient estimator under STWOR. This can be for two reasons. First, because the sampling design has an impact on the estimators that provides an improvement of the pseudo-empirical likelihood approach over the regression approach. Second, nonparametric estimators are clearly more accurate than parametric estimators because the linear relationship is not stated for the factories population. In brief, we conclude that the proposed estimator yPROP possesses a good performance in terms of relative bias and mean square error. yPROP and yLPR are equivalent under SRSWOR. However, yPROP can have a gain in efficiency over yLPR under STWOR. The proposed estimator is also better than parametric estimators when the relationship between the auxiliary and study variables is not linear.

PB

1024

M. Rueda, J.F. Muñoz / Applied Mathematics Letters 22 (2009) 1021–1024

Acknowledgments This research was partially supported by Consejería de Innovación, Ciencia y Tecnología (grant no. SEJ565) and Ministerio de Educación y Ciencia (grant no. MTM2006-04809) References [1] J. Chen, R.R. Sitter, A pseudo empirical likelihood approach to the effective use of auxiliary information in complex surveys, Statist. Sinica 9 (1999) 385–406. [2] M. Rueda, J.F. Muñoz, Y.G. Berger, A. Arcos, S. Martínez, Pseudo empirical likelihood method in the presence of missing data, Metrika 65 (2007) 349–367. [3] C. Wu, R.R. Sitter, A model-calibration approach to using complete auxiliary information from survey data, J. Amer. Statist. Assoc. 96 (2001) 185–193. [4] J. Chen, C. Wu, Estimation of distribution function and quantiles using the model-calibrated pseudo empirical likelihood method, Statist. Sinica 12 (2002) 1223–1239. [5] A.H. Dorfman, P. Hall, Estimators of the finite population distribution function using nonparametric regression, Ann. Statist. 21 (1993) 1452–1475. [6] J.D. Opsomer, C. Miller, Selecting the amount of smoothing in nonparametric regression estimation for complex surveys, Nonparametric Statist. 17 (2005) 593–611. [7] M. Rueda, I. Sánchez-Borrego, A predictive estimator of the finite population mean using nonparametric regression, Comput. Statist. 24 (1) (2009) 1–14. [8] F.J. Breidt, J.D. Opsomer, Local polynomial regression estimators in survey sampling, Ann. Statist. 28 (2000) 1026–1053. [9] J. Fan, Design-adaptive nonparametric regression, J. Amer. Statist. Assoc. 87 (1992) 998–1004. [10] J. Fan, Local linear regression smoothers and their minimax efficiencies, Ann. Statist. 21 (1993) 196–216. [11] D. Ruppert, M.P. Wand, Multivariate locally weighted least squares regression, Ann. Statist. 22 (1994) 1346–1370. [12] J. Fan, I. Gijbels, Local Polynomial Modeling and its Applications, Chapman and Hall, London, 1996. [13] J.C. Deville, C.E. Särndal, Calibration estimators in survey sampling, J. Amer. Statist. Assoc. 87 (1992) 376–382. [14] S. Singh, Advanced Sampling Theory with Applications: How Michael ‘‘Selected’’ Amy, Kluwer Academic Publisher, The Netherlands, 2003, pp. 1–1247. [15] A.Y.C. Kuk, T.K. Mak, Median estimation in the presence of auxiliary information, J. R. Stat. Soc. B 51 (1989) 261–269. [16] A.Y.C. Kuk, T.K. Mak, A functional approach to estimating finite population distribution functions, Theory Meth. 23 (1994) 883–896. [17] M.N. Murthy, Sampling Theory and Method, Statistical Publishing Society, Calcutta, 1967.