Computational Statistics and Data Analysis 55 (2011) 838–851
Contents lists available at ScienceDirect
Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda
Post-stratified calibration method for estimating quantiles S. Martínez, M. Rueda ∗ , A. Arcos, H. Martínez, I. Sánchez-Borrego University of Almería, Spain University of Granada, Spain
article
info
Article history: Received 4 March 2010 Received in revised form 6 July 2010 Accepted 6 July 2010 Available online 27 July 2010 Keywords: Post-stratified estimator Calibration Finite distribution function Auxiliary information Bahadur representation
abstract The estimation of quantiles in the presence of auxiliary information is discussed. Calibration and poststratification techniques provide simple and practical procedures for incorporating auxiliary information into the estimation of distribution functions, which can offer some useful gains in efficiency. The estimator proposed combines these techniques and possesses a number of desirable properties, including yielding a genuine distribution function, providing simplicity of computation and generalizing Silva and Skinner’s estimator. This proposed procedure is compared to alternative methods. On the basis of simulation studies, the proposed post-stratified calibration estimator presents a good level of performance and comprises a valid alternative to other estimators of the distribution function. © 2010 Elsevier B.V. All rights reserved.
1. Introduction The estimation of quantiles is a topic that has received considerable attention, for various reasons. Statisticians often find that they are dealing with variables, such as income, that have highly skewed distributions. In such situations, the median is regarded as a more appropriate measure of location than the mean, and thus governments in different countries obtain poverty measures based on quantiles. Assuming the presence of auxiliary information, various estimation procedures have been described (e.g., calibration Särndal, 2007; Rueda et al., 2006; Stearn and Singh, 2008, pseudo empirical likelihood methods, Chen and Wu, 2002; Rueda and Muñoz, 2009, nonparametric regression, Dorfman and Hall, 1993; Kuo, 1988) to obtain more efficient estimators for population means and totals, and efforts have been made to apply these general procedures directly to the quantile estimation. However, existing estimators have undesirable properties due to the specific nature of quantiles. There are two important ways in which auxiliary information can be used to estimate quantiles: firstly, we can develop indirect estimators of quantiles, in which it is assumed that the quantile of the auxiliary variable of the same order as that to be estimated is known, as is the case with position and stratification estimators (Kuk and Mak, 1989), and difference estimators (Rueda et al., 2003). Secondly, we can obtain monotonic estimators of the finite population distribution function and then take the inverse at the appropriate point of the required quantile. Most papers in which this approach is adopted (e.g. Chambers and Dunstan, 1986; Dorfman and Hall, 1993) assume a superpopulation model and suggest model-based estimators. Such methods encounter difficulties in the computation and the possible bias caused by the inadequacy of the superpopulation model assumed. Moreover, the above authors do not examine in great depth the properties of the quantile estimators derived. For example, quantile estimators are not generally studied via simulation studies. Rao et al. (1990) proposed a designbased difference estimator that may take values beyond the range [0, 1] and which is not always a monotone function
∗ Corresponding address: Departamento de Estadística e I.O., Facultad de Ciencias, Avda. Fuentenueva, Universidad de Granada, 18071 Granada, Spain. Tel.: +34 958240494; fax: +34 958243267. E-mail address:
[email protected] (M. Rueda). 0167-9473/$ – see front matter © 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.csda.2010.07.006
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
839
of the variable of interest. Silva and Skinner (1995) carried out an exhaustive study of the properties of several estimators of the distribution function and pointed out important problems. Other important references in this topic are Chen and Wu (2002), Kuk and Mak (1989, 1994), Martínez et al. (2010), Rueda et al. (2003, 2007a), Rueda and Muñoz (2009) and Wang and Dorfman (1996). Option two plays a central role in this paper. The first step involved is to obtain new estimators of the distribution function that incorporate the auxiliary information. This information is used to define poststrata and also to construct the calibration equations in each poststratum. The proposed estimators possess a number of attractive features: they are asymptotic designunbiased, genuine distribution functions and calibration estimators. Moreover, Silva and Skinner’s estimator (Silva and Skinner, 1995) is a particular case of this class of estimators. Quantile estimation through direct inversion of the proposed estimated distribution function is discussed and its asymptotic properties are studied through Bahadur representations. An empirical study is then carried out to compare the precision of the proposed estimators with other well-known quantile estimators. 2. An estimation procedure 2.1. Motivation Consider a finite population U = {1, . . . , k, . . . , N }, which consists of N different elements. Let s = {1, . . . , n} be the set of n units included in a sample, selected according to a specified sampling design d, in which the inclusion probabilities πk and πkl are assumed to be strictly positive. For simplicity, we now assume that the sampling design is simple random sampling without replacement (SRSWOR). The extension to a general sampling scheme is presented in Section 6. Let yk be the value of the variable of interest y. The finite population distribution function of the study variable y is defined by 1 X ∆(t − yk ); Fy (t ) = N k∈U with
∆(t − yk ) =
0 1
if t < yk if t ≥ yk .
The corresponding finite population α -quantile of y is: Qy (α) = inf {t : Fy (t ) ≥ α}. A general procedure to estimate Qy (α) is formulated as follows: first obtain a monotonically nondecreasing estimator of the distribution function b Fy (t ), and then estimate the quantile by taking the inverse as
b Qy (α) = b Fy−1 (t ) = inf {t : b Fy (t ) ≥ α}. The usual Horvitz–Thompson estimator of the distribution function Fy (t ) is given by 1 X ∆(t − yk ) b ; FyH (t ) = N k∈s πk
b which takes the form b FyH (t ) = 1n k∈s ∆(t − yk ); under simple random sampling. Also, FyH (t ) is unbiased and a genuine distribution function under SRSWOR. The distribution function Fy (t ) can be estimated by the poststratified estimator b Fps (t ) defined by Silva and Skinner (1995). Let z be a known auxiliary variable, which is used as a stratifying variable. To define b Fps (t ) let the G poststratum partitioning U be denoted by U1 , . . . , UG where k ∈ Ug if z(g −1) < zk ≤ z(g ) , P
z(1) < z(2) < · · · < z(G−1) ,
z(0) = −∞,
z(G) = +∞;
and z(1) , z(2) , . . . , z(G−1) are specified values. Let s1 , s2 , . . . , sG be the corresponding partition of s so that sg = s ∩ Ug . Let Ng be the size of Ug and ng the size of sg , where n1 + n2 + · · · nG = n; then the poststratified estimator is
b Fps (t ) =
G X Ng X ∆(t − yk ) · . g =1
N
k∈sg
ng
(1)
Silva and Skinner showed that poststratification is a simple and practical procedure offering some useful gains in efficiency. Calibration was introduced by Deville and Särndal (1992) to estimate the population total, but this approach is applicable to the estimation of more complex parameters than a population total, such as population variances (Singh, 2001; Singh et al., 1999). However, the use of a calibration technique for estimation of the distribution function is not simple. Harms and Duchesne (2006) and Rueda et al. (2007b) use different ways to implement the calibration approach in the estimation of the distribution function. Both methods give nearly design unbiased estimation and compare favourably with earlier estimation methods for the distribution function, which are not based on calibration thinking but on the same auxiliary information (see Särndal, 2007).
840
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
We now propose to modify the Silva and Skinner estimator using a linear combination of the auxiliary variables as the stratification variable, and then applying the calibration technique in each poststratum. 2.2. Definition of an estimator for the distribution function We assume that with each unit k ∈ U there is a known vector xk = (xk1 , xk2 , . . . , xkJ ) of values of auxiliary variables, x1 , . . . , xJ . This information can be used to formulate an efficient estimator of the distribution function using calibration and poststratification techniques. First, we consider the pseudo-variable vk = b β 0 xk where
! −1 X
b β=
0
xk xk
·
X
k∈s
xk yk .
k∈s
This choice is based upon the linear superpopulation model: yk =
J X
βj xkj + ek ;
(2)
j =1
k = 1, . . . , N, j = 1, . . . , J and the random vector e = (e1 , . . . , eN ) are assumed to have zero mean and a positive definite covariance matrix which is diagonal. The computed value vk is the predicted value for yk under model (2). We then use this pseudo-variable, which combines the information of all auxiliary variables, as the stratification variable to construct the G poststratum denoted by Ug , g = 1, . . . , G, and sg = s ∩ Ug . The corresponding poststratified estimator is:
b Fps2 (t ) =
G X Ng X ∆(t − yk ) g =1
N
k∈sg
ng
;
(3)
(3) is analogous to (1) but the construction of poststrata is different. The poststratified estimator b Fps2 (t ) can be expressed as a linear estimator: G 1 XX b akg ∆(t − yk ); Fps2 (t ) =
N g =1 k∈s g
with akg = Ng /ng . In this form, we can obtain b Fps2 (t ) as a calibration estimator. For this purpose, the Horvitz–Thompson estimator 1 X ∆(t − yk ) b FyH (t ) = ; N k∈s πk can be expressed by: G 1 XX b FyH (t ) = dkg ∆(t − yk );
N g =1 k∈s g
(4)
with dkg = π1 = Nn . k The basic sampling design weights dkg are modified by new ones wkg that are as close as possible to dkg for the metric:
Φsg =
1 X (wkg − dkg )2 2 k∈s g
dkg
g = 1, 2, . . . , G;
(5)
subject to the following calibration equation 1 X Ng k∈s g
wkg = 1 g = 1, 2, . . . , G;
(6)
in each stratum. The solution of this problem gives the weights akg = Ng /ng . By replacing dkg with these new weights akg = Ng /ng in (4) we obtain the poststratified estimator b Fps2 (t ). In this form, we can see b Fps2 (t ) as a calibration estimator indicating poststrata membership. We now modify b FyH (t ) by including new benchmark constraints that include the auxiliary information given by x1 , . . . , xN . The proposed poststratification calibration estimator is thus obtained.
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
841
The idea is to derive new weights that modify as little as possible the original sampling weights (dkg ) that appear in the Horvitz–Thompson estimator (4) while respecting certain calibration equations in each poststratum. Hence, we choose Pg ordered values of the pseudo-variable v in each stratum Ug : g
g
g
t1 < t2 < · · · < tPg
g = 1, 2, . . . , G.
With these points, we consider the estimator G 1 XX b ωkg ∆(t − yk ); Fωps (t ) =
N g =1 k∈s g
where the weights ωkg are obtained with the minimization, in each stratum Ug , of the chi square function
Φsg =
1 X (ωkg − dkg )2 2 k∈s g
dkg
g = 1, 2, . . . , G;
(7)
subject to the following conditions 1 X Ng k∈s g
ωkg ∆(tjg − vk ) = Fvg (tjg )
g
where Fvg (tj ) =
1 Ng
P
j = 1, . . . , Pg g = 1, . . . , G;
(8)
g
k∈Ug
g
∆(tj − vk ) is the distribution function of the pseudo-variable v in the stratum Ug evaluated at tj .
Calibration equation (8) is imposed because it seems reasonable to expect that an estimator of the distribution function of y in the poststratum Ug should approach the distribution function closely as the pseudo variable v approaches y. The calibration conditions (8) can be expressed as 1 X Ng k∈s g
ωkg ∆(tg − vk ) = Fvg (tg ) g = 1, . . . , G;
(9)
with
0 g g ∆(tg − vk ) = ∆(t1 − vk ), . . . , ∆(tPg − vk ) g = 1, . . . , G; 0 g g Fvg (tg ) = Fvg (t1 ), . . . , Fvg (tPg ) g = 1, . . . , G. The resulting calibration weights follow by minimizing (7) subject to (9), and they are given by
ωkg = dkg +
dkg λg ∆(tg − vk ) Ng
g = 1, . . . , G;
(10)
g
where λg = Ng2 (Fvg (tg ) − b FVH (tg ))0 · (Tg )−1 g = 1, . . . , G; is the vector of Lagrange multipliers of dimension Pg and
0 g g g g g b FVH (t1 ), . . . , b g = 1, . . . , G; FVH (tg ) = b FVH (tPg ) P g g g where b FVH (tj ) = N1 k∈sg dkg ∆(tj − vk ) is the Horvitz–Thompson estimator in the stratum Ug of the pseudo-variable v . g Assuming that the inverse of Tg =
XN k∈sg
n
∆(tg − vk )∆(tg − vk )0 g = 1, . . . , G;
exists (the existence of Tg−1 is studied in Appendix A), the calibration weights, for all units k ∈ sg , are given by
0 −1 g ωkg = dkg + dkg Ng Fvg (tg ) − b FVH (tg ) · Tg · ∆(tg − vk ).
(11)
If Tg is non singular for all g, then the poststratified calibration estimator is G 1 XX b Fωps (t ) = ωkg ∆(t − yk )
N g =1 k∈s g
=
G X
h
0
g g Wg · b FYH (t ) + Fvg (tg ) − b FVH (tg )
i · Dg ;
(12)
g =1
P N g (tg − vk )∆(t − yk ), Wg = Ng and b FYH (t ) = N1 k∈sg dkg ∆(t − yk ). g Clearly, the estimator b Fωps (t ) coincides with b Fps2 (t ) if, for all stratum Ug , we only choose the point tg = maxk∈Ug vk , that is, the value Pg = 1 for all g ∈ 1, 2, . . . , G. Thus, the estimator b Fps2 (t ) is a particular case of b Fωps (t ). with Dg = (Tg )−1 ·
P
N k∈sg n ∆
842
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
3. Properties of the proposed estimator 3.1. b Fωps (t ) is a genuine distribution function Next, we study the condition that under the proposed estimator b Fωps (t ) is a genuine distribution function. The following conditions must then be verified: (i) (ii) (iii) (iv)
b Fωps (t ) is continuous on the right. b Fωps (t ) is monotone nondecreasing. limt →−∞ b Fωps (t ) = 0. limt →+∞ b Fωps (t ) = 1.
Following Rueda et al. (2007b), conditions (i)–(iii) are satisfied, but condition (iv), in general, is not. Condition (iv) is equivalent to adding, in every stratum Ug , the following condition 1 X Ng k∈s g
ωkg = 1 g = 1, 2, . . . , G.
(13)
To add condition (13) into the calibration conditions (8) we need, for all g = 1, . . . , G, to take a value tPg that is sufficiently g g large, that is, a value tPg that guarantees Fvg (tPg ) = 1. For example, we can take tPg = maxk∈Ug vk . 3.2. b Fωps (t ) is a calibration estimator For the construction, the weights are exactly calibrated to the distribution function of the v -variable in each stratum Ug . Calibration is a highly desirable property for survey weights (see Särndal, 2007) 3.3. Asymptotic behaviour of b Fωps (t ) We define the following vectors for each g = 1, . . . , G g zk
=
∆(tg − vk )
if k ∈ Ug if k 6∈ Ug ;
0Pg
g
where 0Pg denotes the zero vector of dimension Pg. Next, with the vectors zk , g = 1, 2, . . . , G; we define
Z0k = (z1k )0 , (z2k )0 , . . . , (zGk )0 . It is clear that the dimension of Zk is R =
PG
g =1
Pg.
Since the value of Zk is available for all k ∈ U, we consider the following calibration estimator 1 X b Fρ (t ) = ρk ∆(t − yk ); N
(14)
k∈s
where the calibration weights {ρk : k ∈ s} are obtained with the minimization of 1 X ρk −
Φρ =
N 2 n
N n
2 k∈s
;
(15)
subject to the condition 1 X N
ρk Zk = Z¯ ;
(16)
k∈s
where Z¯ is the population mean of variable Z. These calibration weights are
ρk =
N n
+
N n
λR Zk N
;
(17)
where λR = N 2 (Z¯ − Z¯ HT )0 ( k∈s Nn Zk Z0k )−1 has dimension R (Z¯ HT denotes the Horvitz–Thompson estimator of Z). Consequently, the calibration weights are
P
ρk =
N n
! −1 + N (Z¯ − Z¯ HT )
0
X k∈s
0
Zk Zk
· Zk ;
(18)
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
843
and
0 1 X b ρk ∆(t − yk ) = b FYH (t ) + Z¯ − Z¯ HT b Bs ; Fρ (t ) = N
(19)
k∈s
where
b Bs =
hX
Zk Z0k
i−1 X · Zk ∆(t − yk ).
k∈s
k∈s
Once the estimator b Fρ (t ) has been defined, our aim is to relate the estimator b Fρ (t ) to b Fωps (t ). Thus, we can study the asymptotic variance of b Fωps (t ) with that of b Fρ (t ). To do so, the estimator b Fρ (t ) is obtained with the minimization of (15), subject to (16). To achieve this, we have to minimize N 2 n
1 X ρk −
1 X 1 X 1 X + λR Z¯ − ωk Zk = Φρ + λ1R · z¯ 1 − ρk z1k + · · · + λGR · z¯ G − ρk zGk ;
N n
2 k∈s
N
N
k∈s
k∈s
N
k∈s
where λR = (λ , λ , . . . , λ ) and z¯ denotes the sample mean of z for each g = 1, . . . , G. If the unit k ∈ sg then 1 R
G R
2 R
g
g
(z1k )0 , . . . , (zGk )0 = (0P1 )0 , . . . , (0P (g −1) )0 , ∆(tg − vk )0 , (0P (g +1) )0 , . . . , (0PG )0 ;
Z0k =
therefore
λR · Zk = λgR · ∆(tk − vk ); thus, the calibrated weights for k ∈ sg are
ρk =
N
+
n
N n
λ R Zk N
=
N
+
n
N n
λgR ∆(tg − vk ) N
;
since the calibrated weights ρk satisfy condition 1 X N
ρk zgk = z¯ g ;
k∈s
that is 1 X N k∈s g
ρk ∆(tg − vk ) =
1 X N k∈U g
∆(tg − vk ).
This condition is equivalent to 1 X Ng k∈s g
ρk ∆(tg − vk ) =
1 X Ng k∈U g
∆(tg − vk );
and then 1 X
N
Ng k∈s g
n
+
N n
λgR ∆(tg − vk )
! ∆(tg − vk )0 =
N
1 X Ng k∈U g
∆(tg − vk )0 ;
and we obtain
λgR =
N Ng
· λg .
Consequently, λR = ( NN · λ1 , . . . , 1
ρk =
N n
+
N n
λgR ∆(tg − vk ) N
N NG
=
· λG ) and the calibrated weights, for k ∈ sg , are N n
+
N n
λg ∆(tg − vk ) Ng
g = 1, 2, . . . , G;
The calibration weights {ρk : k ∈ s} coincide with those given by (10) and therefore b Fρ (t ) = b Fωps (t ). b Finally, to study the asymptotic behaviour of Fρ (t ), we consider expression (19) and following Rueda et al. (2007b), we can establish the following theorem.
844
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
Theorem 1. The asymptotic behaviour of b Fρ (t ) is the same as that of the estimator
0 b ¯ − A¯ HT · B; Fa (t ) = b FYH (t ) + A
(20)
with
#−1
" X
B=
0
Ak Ak
·
X
k∈U
Ak ∆(t − yk );
k∈U
A0k = ((a1k )0 , . . . , (aGk )0 ) where g
ak =
∆(tg − µk ) 0Pg
if k ∈ Ug if k ∈ 6 Ug ;
µk = β · xk , β denotes the multiple regression coefficient between y and x. Consequently b Fρ (t ) is asymptotically normal and asymptotically design unbiased and its asymptotic variance is given by AV (b Fρ (t )) =
1−
n N
SE2 ;
n
(21)
where Ek = ∆(t − yk ) − A0k · B and SE2 =
1 N −1
P
U
(Ek − E )2 .
Because b Fωps (t ) = b Fρ (t ), the asymptotic variance of b Fωps (t ) is given by (21). Once the asymptotic behaviour of b Fωps (t ) has been determined, the next step is to obtain a new expression of Ek . Thus, it is easy to see that D1 D 2
#−1
" B=
X
Ak A0k
·
X
k∈U
k∈U
Ak ∆(t − yk ) = .. ; . DG
where
Dg =
X
g
g
ak (ak )0 ·
k∈Ug
X
g
ak ∆(t − yk ) g = 1, 2, . . . , G;
k∈Ug
If k ∈ Ug then
g
A0k = (0P1 )0 , . . . , (0P (g −1) )0 , ak , (0P (g +1) )0 , . . . , (0PG )0 . Finally, for k ∈ Ug g
Ek = ∆(t − yk ) − A0k · B = ∆(t − yk ) − (ak )0 · Dg
= ∆(t − yk ) − (∆(tg − µk )) · Dg . The asymptotic variance of b Fωps (t ) can be estimated by 0
b V (b Fωps (t )) =
1− n
n N
s2e ;
(22)
1 where ωkg is given by (11), s2e = n− 1
P
s
(ek − e)2 and for k ∈ Ug ,
ek = ∆(t − yk ) − (∆(tg − µk ))0 · b Dωg ; with
b Dωg =
X k∈sg
g kg ak
ω
g ak 0
( )·
X
ωkg agk ∆(t − yk ).
k∈sg
4. Definition of the quantile estimator If the proposed estimator of the distribution function, b Fωps (t ), has the properties of a distribution function, its inverse is well defined, and we can define a new calibrated estimator of Qy (α) as
b Qωps (α) = inf {t : b Fωps (t ) ≥ α}.
(23)
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
845
Fig. 1. Scatterplots for CANCER, SUGAR and MURTHY populations.
We now study the properties of the b Qωps (α) estimator. To do so, a linear approximation is needed, because b Qωps (α) is not a continuous function. The estimator b Qωps (α) can be expressed asymptotically as a linear function of the estimated distribution function evaluated at the quantile Qy (β) by the Bahadur representation (see Chambers and Dunstan, 1986; Rueda et al., 2007b):
b Qωps (α) − Qy (α) =
1 fy (Qy (α))
(α − b Fωps (Qy (α))) + O(n−1/2 );
(24)
where fy (·) denotes the derivative of the limiting value of Fy (·) as N −→ ∞. This implies that the sampling distribution of b Qωps (α) is closely related to that of b Fωps (Qy (α)). From (24) we obtain the asymptotic variance of b Qωps (α), to the first degree of approximation, as n 1− N 2 1 V (b Qωps (α)) = 2 SE ; fy (Qy (α)) n and its estimator is:
b V (b Qωps (α)) =
1 fy2 (Qy (α))
1− n
n N
s2e .
An approximate value of fy2 (Qy (α)) can be obtained by applying standard methods such as the kernel or the kth nearest neighbour methods, Silverman (1986). The variance estimator is stated in an explicit form, making direct calculation possible. 5. Simulation study The main properties of the estimator b Fωps (t ) are established in Section 3. The next step is to analyze the accuracy of this estimator via empirical studies. For this purpose, we consider three populations. The first population (called cancer) consists of N = 301 counties in North Carolina, South Carolina and Georgia with respect to the white female population in 1960; this population was studied by Royal and Cumberland (1981). The second population (called sugar) consists of sugar cane farms surveyed in Queensland, Australia in 1982; this was originally used by Chambers and Dunstan (1986), and later by Rao et al. (1990). In the cancer and sugar populations, a strong linear relationship exists between the study variable and the auxiliary variable (see Fig. 1). A third population was included to consider the effect of the departure from the assumption of a linear relation between the study variable and the auxiliary vector on these estimators. This population (called murthy) is a natural small population of 80 factories (Murthy, 1967) which has also been studied by Kuk (1993) and by Kuk and Mak (1989). This population should provide some indication of the robustness of the alternative estimators.
846
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
Table 1 Relative Bias (RE) in percentage, relative mean square error (RX ) and relative efficiency (RE) of the compared estimators for the cancer population. Q (0, 25)
Q (0, 50)
Q (0, 75)
RB
RX
RE
RB
RX
RE
RB
RX
RE
b F c (t ) b Fωps (t ) b FCD (t ) b Fps (t )
n = 25 0.159 0.150 0.615 0.144
0.198 0.192 0.634 0.203
2.831 2.919 0.884 2.761
0.145 0.140 0.824 0.162
0.180 0.173 0.827 0.212
2.325 2.419 0.506 1.974
0.210 0.160 0.918 0.166
0.276 0.206 0.918 0.266
1.921 2.574 0.578 1.993
b F c (t ) b Fωps (t ) b FCD (t ) b Fps (t )
n = 50 0.113 0.101 0.689 0.104
0.139 0.125 0.705 0.130
2.508 2.789 0.495 2.682
0.095 0.088 0.862 0.109
0.115 0.105 0.863 0.132
2.681 2.937 0.358 2.336
0.146 0.093 0.936 0.128
0.187 0.122 0.937 0.133
1.912 2.930 0.382 2.688
b F c (t ) b Fωps (t ) b FCD (t ) b Fps (t )
n = 125 0.067 0.057 0.768 0.075
0.087 0.074 0.769 0.096
2.069 2.432 0.234 1.875
0.059 0.056 0.894 0.079
0.067 0.062 0.894 0.093
2.556 2.762 0.192 1.841
0.069 0.042 0.951 0.093
0.086 0.053 0.951 0.084
2.479 4.023 0.224 2.538
In the first study, the precision obtained in estimating a finite population α -quantile is compared by inverting five finite population distribution function estimators: (i) the Horvitz–Thompson estimator b FyH (t ) (ii) the proposed b Fωps (t ) estimator (post-calibrated estimator), (iii) the calibration estimator b Fc (t ) proposed by Rueda et al. (2007b), (iv) the Chambers–Dunstan estimator b FCD (t ) (Chambers and Dunstan, 1986), and (v) the Silva and Skinner estimator b Fps (t ). We selected B = 1000 samples with different sample sizes, n, under SRSWOR. For each estimator, we calculated estimates of the finite population α -quantile for α = 0.25, 0.5 and 0.75. The performance of each estimator is measured by the relative bias (RB), the relative Mean Square Error (RX ) and the relative efficiency (RE), with RB =
B Qb (α) − Qy (α) 1Xb
B b=1
Qy (α)
,
RX =
MSE [b Q (α)] Qy (α)2
and RE =
MSE [b QyH (α)] MSE [b Q (α)]
;
where b indexes the bth simulation run, b Q (α) is an estimator for the α -quantile and MSE [b Q (α)] = B−1 b=1 [b Qb (α) − 2 b b Qy (α)] is the empirical Mean Square Error for Q (α); QyH (α) is the baseline estimator which is obtained by inverting b FyH (t ). The random generations, calculations and all the estimators were obtained using the R programme. The R codes are available from the authors. In the construction of the estimator b Fc (t ), we chose two points in calibration condition (8): the population median, t1 = Q (0.5), and t2 = maxk∈U vk . In a similar way, in the construction of the post-calibrated estimator b Fωps (t ), we chose the same points in the post-stratum: the median of the corresponding post-stratum, Uh , and maxk∈Uh vk . These choices were made in order to guarantee that we can consider the estimation of quantiles b Q (α) by inverting the distribution function
PB
1 estimators to give b Fw−ps (α), and for this, b Fwps (t ) must be a genuine distribution function. The strata were obtained by applying the Dalenius–Hodges method, using the auxiliary variable and in all populations, four poststrata were obtained. Following Silva and Skinner (1995), any poststrata with sg empty are pooled with adjacent poststrata until all sg are non-empty. Tables 1–3 show the simulated RE, RX and RB for each of the estimators at three quartiles. The results for the Cancer population show that the Chambers–Dunstan estimator presents the worst behaviour, while the other estimators perform well. The post-calibrated estimator is the most efficient one in all cases. In the Sugar population the behaviour of the Chambers–Dunstan estimator is different: this estimator is the most efficient when α = 0.25. b Fωps (t ), b Fps (t ) and b Fc (t ) are better than b FCD (t ) for α = 0.5 and α = 0.75.
In the third population, b Fωps (t ) is the most efficient estimator for α = 0.25, and b Fc (t ) is the best estimator for α = 0.5. b FCD (t ) has an irregular performance, being very efficient for α = 0.75 but presenting a small value of RE when α = 0.25. b Fωps (t ) is better than b Fps2 (t ) and b Fc (t ) in all cases. We see that there is considerable variation in the performance of the estimators, depending on the value of α . Accordingly, we performed a second simulation to show the average performance of the estimators. For each one, we calculated estimates of the finite population α -quantile for α = 0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.75, 0.8 and 0.9. Then, the performance of all the estimators was compared in terms of Average Relative Bias (avrb) and Average Relative Efficiency (avrre) (average
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
847
Table 2 Relative Bias (RE) in percentage, relative mean square error (RX ) and relative efficiency (RE) of the compared estimators for the sugar population. Q (0, 25)
Q (0 , 5 )
Q (0, 75)
RB
RX
RE
RB
RX
RE
RB
RX
RE
b F c (t ) b Fωps (t ) b FCD (t ) b Fps (t )
n = 50 0.061 0.051 0.046 0.064
0.078 0.064 0.057 0.068
1.114 1.357 1.524 1.278
0.064 0.060 0.079 0.067
0.078 0.073 0.091 0.076
1.242 1.327 1.065 1.274
0.065 0.058 0.079 0.067
0.087 0.076 0.103 0.089
1.183 1.355 1.000 1.157
b F c (t ) b Fωps (t ) b FCD (t ) b Fps (t )
n = 75 0.047 0.038 0.036 0.049
0.059 0.047 0.045 0.053
1.156 1.451 1.515 1.286
0.049 0.044 0.066 0.059
0.060 0.053 0.074 0.056
1.266 1.433 1.026 1.356
0.051 0.043 0.059 0.054
0.066 0.055 0.071 0.059
1.225 1.470 1.139 1.371
b F c (t ) b Fωps (t ) b FCD (t ) b Fps (t )
n = 100 0.038 0.031 0.031 0.041
0.048 0.039 0.039 0.042
1.098 1.351 1.351 1.255
0.042 0.037 0.057 0.044
0.052 0.045 0.065 0.054
1.189 1.373 0.951 1.145
0.041 0.033 0.050 0.047
0.052 0.042 0.059 0.050
1.231 1.524 1.085 1.280
Table 3 Relative bias (RE) in percentage, relative mean square error (RX ) and relative efficiency (RE) of the compared estimators for the murthy population. Q (0, 25)
Q (0, 50)
Q (0, 75)
RB
RX
RE
RB
RX
RE
RB
RX
RE
b Fc (t ) b Fωps (t ) b FCD (t ) b Fps (t )
n = 25 0.036 0.032 0.071 0.032
0.048 0.047 0.096 0.052
1.650 1.686 0.825 1.523
0.032 0.036 0.040 0.041
0.048 0.053 0.050 0.059
2.023 1.832 1.942 1.645
0.039 0.023 0.022 0.028
0.055 0.043 0.028 0.048
1.493 1.910 2.933 1.711
b F c (t ) b Fωps (t ) b FCD (t ) b Fps (t )
n = 30 0.031 0.025 0.055 0.031
0.044 0.036 0.076 0.052
1.490 1.821 0.862 1.260
0.029 0.031 0.039 0.033
0.042 0.047 0.048 0.056
1.890 1.653 1.618 1.387
0.032 0.016 0.022 0.016
0.050 0.032 0.028 0.037
1.307 2.043 2.335 1.767
b F c (t ) b Fωps (t ) b FCD (t ) b Fps (t )
n = 35 0.028 0.017 0.041 0.023
0.040 0.027 0.061 0.044
1.300 1.927 0.853 1.182
0.022 0.023 0.041 0.025
0.033 0.034 0.048 0.052
2.064 2.003 1.419 1.310
0.025 0.013 0.019 0.012
0.040 0.027 0.025 0.034
1.376 2.039 2.202 1.619
overall α -values) with AVRB =
1 X
"
11 α
B 1Xb Qb (α) − Qy (α)
Qy (α)
B b=1
# =
1 X 11 α
RB(b Q (α))
(25)
and AVRRE =
1 X 11 α
"
MSE (b Q )(α) Qy2
(α)
# =
1 X 11 α
RX (b Q (α)).
(26)
Table 4 presents the avrb (in percentage) and avrre statistics for the various estimators. We see that there is no great variation in the performance of the estimators in terms of AVRB. The best overall performance is achieved by the postcalibrated estimator. The avrb of the Chambers Dunstan estimator is usually greater than that of the other estimators. In terms of AVRRE, a similar conclusion can be drawn, i.e. the proposed estimator is better than the calibrated, the Chambers–Dunstan and the Silva–Skinner estimators. We next compare the conditional biases. Figs. 2–4 show the estimated conditional biases for several estimators of Qy (α) with α = 0.1, 0.5 and 0.9. The conditional biases were estimated from 20 groups of 50 samples each, formed using the ordered values of means of population x ranks of the sample x values. From Figs. 2–4 we see that the post-calibrated estimator displays good conditional behaviour. The proposed quantile estimator presents a small conditional bias with little variation over the whole range of samples, whereas the Chambers–Dunstan estimator displays a poor conditional performance, especially for α = 0.1.
848
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
Table 4 Average relative bias, avrb (in %) and average relative mean squared error, avrre of the compared estimators. cancer
murthy
avrb
avrre
n = 25
b F c (t ) b Fωps (t ) b FCD (t ) b Fps (t )
0.504 0.469 0.586 0.472
avrb
0.268 0.238 0.265 0.271
0.106 0.097 0.119 0.094
n = 30
0.120 0.095 0.213 0.099
0.400 0.359 0.405 0.399
0.247 0.219 0.249 0.246
0.070 0.061 0.078 0.063
n = 35
0.064 0.047 0.078 0.062
0.288 0.250 0.299 0.320
0.370 0.362 0.406 0.371
n = 75
0.045 0.030 0.049 0.044
n = 125
avrre
n = 50
0.052 0.034 0.056 0.046
n = 50
b F c (t ) b Fωps (t ) b FCD (t ) b Fps (t )
avrre
n = 25
0.189 0.161 0.242 0.150
b F c (t ) b Fωps (t ) b FCD (t ) b Fps (t )
sugar
avrb
0.299 0.279 0.314 0.307
n = 100
0.039 0.026 0.044 0.039
0.232 0.200 0.234 0.238
0.055 0.045 0.058 0.048
0.263 0.241 0.266 0.301
0.15
Estimated bias
Estimated bias
0.08 0.13 0.11 0.09 0.07
0.07 0.06 0.05
0.05 0.04 30.5
33.0
35.5
38.0
40.5
43.0
45.5
48.0
30.5
33.0
35.5
0.15 Estimated bias
38.0
40.5
43.0
45.5
48.0
Group mean of x rank means
Group mean of x rank means
Silva-Skinner
Post-calibrated
0.10
Calibrated 0.05 Chambers-Dunstan 0.00 30.5
33.0
35.5
38.0
40.5
43.0
45.5
48.0
Group mean of x rank means
Fig. 2. Estimated bias for group mean of x rank means. Sample size n = 125. α = 0.1, 0.5 and 0.9. cancer population.
Table 5 Average relative bias, avrb (in percentage) and average relative mean squared error, avrre for some estimators. sugar avrb
murthy avrre
n = 75
b F c (t ) b Fωps (t ) b Fps (t )
0.045 0.041 0.058
avrb
cancer avrre
n = 30 0.240 0.232 0.304
0.037 0.032 0.044
avrb
avrre
n = 125 0.223 0.196 0.245
0.053 0.054 0.068
0.264 0.242 0.315
To investigate the effect of an increasing number of points in the calibration equation (8), we conducted another study. Now b Fc (t ) and b Fωps (t ) estimators were calculated with four points: the three quartiles and the maximum.
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
0.06
849
0.04
Estimated bias
0.05 0.03 0.04 0.03
0.02
0.02 0.01 0.01 0.00
0.00 220
240
260
280
300
320
220
340
240
260
280
300
320
340
Group mean of x rank means
Group mean of x rank means 0.08
Estimated bias
Silva-Skinner 0.06 Post-calibrated 0.04 Calibrated 0.02 Chambers-Dunstan 0.00 220
240
260
280
300
320
340
Group mean of x rank means
Fig. 3. Estimated bias for group mean of x rank means. Sample size n = 35. α = 0.1, 0.5 and 0.9. murthy population.
0.06
0.07
Estimated bias
Estimated bias
0.08
0.06 0.05 0.04
0.05
0.04
0.03
0.03 0.02 54
56
58
60
62
64
66
Group mean of x rank means
54
56
58
60
0.06
Estimated bias
62
64
66
Group mean of x rank means
Silva-Skinner
0.05 Post-calibrated 0.04 Calibrated 0.03 Chambers-Dunstan 0.02 54
56
58 60 62 64 Group mean of x rank means
66
Fig. 4. Estimated bias for group mean of x rank means. Sample size n = 100. α = 0.1, 0.5 and 0.9. sugar population.
Table 5 presents the avrb (in percentage) and avrre statistics for some sample sizes. The main conclusions derived in this new study are: - Estimators b Fc (t ) and b Fωps (t ) obtained with four points provide more precise estimates that the respective estimators obtained with 2 points. - The increase in accuracy of these calibrated estimators against the Silva and Skinner estimator is noticeable.
850
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
6. Generalization to a general sampling design If the samples are obtained from population U according to a measurable sampling design d, with design matrix
Π = (πkm ), it would appear logical to construct the distribution function estimator from its Horvitz–Thompson estimate, 1 X b dk ∆(t − yk ); FYH (t ) = N
k∈s
with dk = 1/πk being the basic design weights. b The Pestimator FYH (t ) is unbiased but is not a genuine distribution function (see Särndal, 2007). If we replace N with b N = d , we obtain the Hájek estimator of the distribution function: k k∈s
P b F0 (t ) =
k∈s
∆(t − yk )dk P ; dk
k∈s
which is a genuine distribution function. However this estimator can have a substantial bias. For t equal to the median and the third quartile, Chambers and Dunstan (1986) and various other authors have empirically demonstrated large biases. The proposed calibration estimator is obtained, in this case, by minimizing the chi square function:
Φsg =
1 X (ωkg − dkg )2 2 k∈s g
dkg
g = 1, 2, . . . , G;
(27)
subject to the same calibration equations and dkg = 1/πk for k ∈ sg ; g = 1, 2, . . . , G. The resulting calibration weights, for all units k ∈ sg , are given by:
0 g ωkg = dkg + dkg Ng Fvg (tg ) − b FVH (tg ) · (Tg )−1 · ∆(tg − vk );
(28)
g
where b FVH (t ) is the Horvitz–Thompson estimators with the sampling weights dkg for g = 1, . . . , G. 7. Conclusions Estimation of the finite population distribution function and quantiles in the presence of auxiliary information requires special treatment in terms of construction. The Chambers and Dunstan estimator can be very efficient when the model upon which it is based is appropriate. However, this estimator can perform poorly under model misspecification. Calibration (see Särndal, 2007) provides a simple and practical approach to incorporating auxiliary information into the estimation of distribution functions, which can offer some useful gains in efficiency. We propose an alternative estimator which combines calibration and poststratification techniques, and possesses a number of desirable properties: it yields a genuine distribution function, is computationally simple, and generalizes the poststratified estimator. In addition, the estimator is obtained by calibration, with all the advantages provided by this technique (see Särndal, 2007). Based on various simulation studies, the proposed post-calibration estimator presents a good performance and is a valid alternative to other quantile estimators. Acknowledgements The authors would like to thank the referees for their many helpful comments and suggestions. This work is partially supported by Ministerio de Educación y Ciencia (contract No. MTM2009-10055). Appendix. Existence of Tg−1 To study the existence of Tg−1 , we consider the v -values of sample units sg in the stratum Ug in increasing order
v(g1) ≤ v(g2) · · · ≤ v(gng −1) ≤ v(gng ) g = 1, 2, . . . , G. g
g
Following Rueda et al. (2007b), we assume that the value ti chosen in stratum Ug , is greater than the first ki sample values of g g g g variable v in sample sg , with ki > ki−1 for i = 2, . . . , Pg; k1 > 0 and kPg ≤ ng , then Tg−1 is not singular and is the following Pg × Pg symmetric matrix
Tg−1
(c11 )g (c21 )g 0 = 0 . .. 0
(c12 )g (c22 )g (c32 )g
0
0 0
.. .
(c23 ) (c33 )g ··· .. .
··· ··· .. .
0
0
0
0
g
··· ··· ··· ··· (cPg −1Pg −1 )g (cPgPg −1 )g
0 0 0 0
; g )
(cPg −1Pg (cPgPg )g
(29)
S. Martínez et al. / Computational Statistics and Data Analysis 55 (2011) 838–851
851
where the (cij )g are given by
(c11 )g =
1
1
+
g k1
N n
P k=1
g
−1
g
P g
g ki
N n
P g
P g
k=k1 +1
1
g
;
g
ki+1
1
(ai,i )g =
;
g k2
N n
P k=k1 +1
ci,i+1 =
−1
(c12 )g =
;
g k2
cPP =
N n
N n
1 g ki+1
P g
; N n
k=ki +1
.
g
kPg
P g k=kPg −1 +1
k=ki +1
k=ki−1 +1
+
N n
With the new expression of Tg−1 ; g = 1, 2, . . . , G, we can obtain an alternative expression of the calibration weights ωk . g
Thus, the calibration weights for the units j = 1, 2, . . . , k1 in the sample sg with g = 1, 2, . . . , G are
ω(gj) =
N n
+
g g g N FVH (t1 ) N Fvg (t1 ) − b n g g
k1
P k=1
g
j = 1, 2, . . . , k1 .
N n g
g
The weights for the sample units j = ki−1 + 1, . . . , ki with i = 2, . . . , Pg are
ω(gj) =
N n
−
g g N g g b N F ( t ) − F ( t ) g v i−1 VH i−1 n g
ki P g
+
N N n g
N n
g
g
g
g
.
N n
k=ki−1 +1
Finally for all j = kPg + 1, . . . , n the weights are ω(j) = Therefore, if the condition g
g
g
ki P
k=ki−1 +1
g
g
Fv (ti ) − b FVH (ti )
g
0 < k1 < k2 < · · · < kPg −1 < kPg ≤ n;
N . n
(30)
is satisfied for all g = 1, 2, . . . , G, the matrix Tg is not singular for all g = 1, 2, . . . , G and consequently the estimator b Fωps (t ) is well defined. If condition (30) is not satisfied for any g in {1, 2, . . . , G}, the corresponding matrix Tg is singular and the estimator b Fωps cannot be obtained. To solve this problem, for each stratum Ug where condition (30) is not satisfied, −1
we only consider condition (6) in the calibration process instead of conditions (8). With this modification, the estimator
b Fωps (t ) can be obtained in all cases. References
Chambers, R.L., Dunstan, R., 1986. Estimating distribution functions from survey data. Biometrika 73, 597–604. Chen, J., Wu, C., 2002. Estimation of distribution function and quantiles using the model-calibrated pseudo empirical likelihood method. Statist. Sinica 12, 1223–1129. Deville, J.C., Särndal, C.E., 1992. Calibration estimators in survey sampling. J. Amer. Statist. Assoc. 87, 376–382. Dorfman, A.H., Hall, P., 1993. Estimators of the finite population distribution function using nonparametric regression. Ann. Statist. 16 (3), 1452–1475. Harms, T., Duchesne, P., 2006. On calibration estimation for quantiles. Surv. Methodol. 32, 37–52. Kuk, A., 1993. A kernel method for estimating finite population distribution function using auxiliary information. Biometrika 80, 385–392. Kuk, A., Mak, T.K., 1989. Median estimation in the presence on auxiliary information. J. R. Stat. Soc. Ser. B 1 (2), 261–269. Kuk, A., Mak, T.K., 1994. A functional approach to estimating finite population distribution functions. Comm. Statist. Theory Methods 23 (3), 883–896. Kuo, L., 1988. Classical and prediction approaches to estimating distribution functions from survey data. In: Proc. Survey Research Methods Section. Amer. Statist. Assoc., Alexandria, VA, pp. 280–285. Martínez, S., Rueda, M., Arcos, A., Martínez, H., 2010. Optimum calibration points estimating distribution functions. J. Comput. Appl. Math. 233, 2265–2277. Murthy, M.N., 1967. Sampling Theory and Method. Statistical Publishing Society, Calcutta. Rao, J.N.K., Kovar, J.G., Mantel, H.J., 1990. On estimating distribution functions and quantiles from survey data using auxiliary information. Biometrika 77, 365–375. Royal, R.M., Cumberland, W.G., 1981. An empirical study of the ratio estimator and estimators of its variance. J. Amer. Statist. Assoc. 76, 66–77. Rueda, M., Arcos, A., Martínez-Miranda, M.D., 2003. Difference estimators of quantile in finite populations. Test 12, 101–116. Rueda, M., Arcos, A., Muñoz, J.F., Singh, S., 2007a. Quantile estimation in two-phase sampling. Computat. Statist. Data Anal. 51, 2559–2572. Rueda, M., Martínez, S., Martínez, H., Arcos, A., 2006. Mean estimation with calibration techniques in presence of missing data. Comput. Statist. Data Anal. 50, 3263–3277. Rueda, M., Martínez, S., Martínez, H., Arcos, A., 2007b. Estimation of the distribution function with calibration methods. J. Statist. Plann. Inference 137, 435–448. Rueda, M., Muñoz, J.F., 2009. New model-assisted estimators for the distribution function using the pseudo empirical likelihood method. Statist. Neerlandica 63 (2), 227–244. Särndal, C.E., 2007. The calibration approach in survey theory and practice. Surv. Methodol. 33 (2), 99–119. Silva, P.L.D., Skinner, C.J., 1995. Estimating distribution functions with auxiliary information using poststratification. J. Official Statist. 11, 277–294. Silverman, B.W., 1986. Density Estimation for Statistics and Data Analysis. Chapman & Hall, London. Singh, S., 2001. Generalized calibration approach for estimating variance in survey sampling. Ann. Inst. Statist. Math. 53, 404–417. Singh, S., Horn, S., Chowdhury, S., Yu, F., 1999. Calibration of the estimator of variance. N. Z. J. Stat. 40 (2), 199–212. Stearn, M., Singh, S., 2008. On the estimation of the general parameter. Comput. Statist. Data Anal. 52, 4253–4271. Wang, S., Dorfman, A.H., 1996. A new estimator for the finite population distribution function. Biometrika 83, 639–652.