Statistics and Probability Letters 83 (2013) 46–51
Contents lists available at SciVerse ScienceDirect
Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro
Prior value incorporated calibration estimator in stratified random sampling Tetsuji Ohyama ∗ Department of Statistics, Faculty of Medicine, Oita University, Oita 879-5593, Japan
article
abstract
info
Article history: Received 21 January 2012 Received in revised form 6 August 2012 Accepted 27 August 2012 Available online 5 September 2012
In this paper, we discuss the calibration estimator for estimating population characteristics that uses prior values and an auxiliary variable. We assume an infinite population and unknown underlying distribution for the objective variable. Simulation studies are conducted to investigate the efficiency of the proposed estimator. Finally a numerical example is presented. © 2012 Elsevier B.V. All rights reserved.
Keywords: Population characteristics Infinite population U-statistics Calibration Auxiliary variable
1. Introduction Stratification, in the framework of an infinite population, is represented by a decomposition of the population L distribution, i.e., F (y) = i=1 wi Fi (y), where F is the unknown population cumulative distribution function (CDF) of the objective variable y, Fi is the unknown CDF of the ith stratum, L is the number of the stratum, and wi ≥ 0(i = 1, . . . , L) L are known constants such that i=1 wi = 1 (Isii and Taga, 1969; Yanagawa, 1975). Under stratified random sampling, population characteristics may be represented as follows:
θ (F ) =
···
i1
wi1 · · · wim
φ(y1 , . . . , ym )dFi1 (y1 ) · · · dFim (ym ),
(1.1)
im
where φ is a symmetric function of y1 , y2 , . . . , ym . In this paper, we concentrate on m = 1 for simplicity. Then (1.1) reduces to
θ=
L
wi θi ,
i=1
where θi = φ(y)dFi (y) is the population characteristics of the ith stratum. Suppose that Yi1 , . . . , Yini is a simple random sample of the objective variable from the ith stratum. Then we may construct estimators of θ (F ) by combining Hoeffding’s U-statistics as follows:
θˆ =
L
wi θˆi ,
i=1
∗
Tel.: +81 97 586 5602; fax: +81 97 586 5619. E-mail address:
[email protected].
0167-7152/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.spl.2012.08.023
(1.2)
T. Ohyama / Statistics and Probability Letters 83 (2013) 46–51
47
where
θˆi =
ni 1
ni α=1
φ(Yiα ).
(1.3)
It is known that U-statistics (1.3) is the minimum variance unbiased estimator of θi in a family of unknown continuous, or discrete CDFs (Fraser, 1957), and (1.2) gives an unbiased estimator of θ (F ). In the context of sample survey, it is often the case that we have some auxiliary variables related to the objective variable and prior knowledge about population characteristics of the objective variable. Calibration approach is commonly used to include auxiliary variables to increase the precision of the estimators of population characteristics. Deville and Särndal (1992) first presented calibration estimators in survey sampling. Singh (2003) considered the two constraints when minimizing chi-square distance function to construct a traditional linear regression estimator. Under the same constraints, Stearns and Singh (2008) considered the minimization of a model assisted chi-square distance function. Singh (2012) proposed the calibration based on a displacement function that has the flexibility of taking positive, negative, or zero value. In the case of stratified random sampling Singh et al. (1998) introduced the calibration estimation, and Kim et al. (2007) discussed various calibration approach ratio estimators. On the other hand, Bayesian inference based on stratified random sampling has been developed to incorporate prior estimates; for example, see Gelman et al. (2003). Differing from the Bayesian approach, Ohyama et al. (2008) proposed prior value incorporated estimators (PVIEs) in an infinite population framework without any assumptions for prior values. The purpose of this paper is to improve the precision of the PVIE by incorporating an auxiliary variable. Section 2 reviews the PVIE proposed by Ohyama et al. (2008). In Section 3, we propose a new estimator that includes prior values and an auxiliary variable for stratified random sampling. Simulation is conducted in Section 4 to evaluate the efficiency of the proposed estimator. Finally we present a numerical example in Section 5. 2. Review of the prior value incorporated estimator In this section, we will briefly review the results of Ohyama et al. (2008). Let θ0i be a prior value of θi . Ohyama et al. (2008) considered the following estimator for θ :
θˆB =
L
L
wi θˆBi =
i =1
wi (1 − Bi )θˆi + Bi θ0i ,
i=1
where Bi is some constant. Putting εi = θ0i − θi , the mean and mean squared error (MSE) of θˆB are given as follows: E (θˆB ) = θ −
L
wi Bi εi
(2.1)
i=1
MSE(θˆB ) =
L
w (1 − Bi ) V (θˆi ) + 2 i
2
i=1
L
2 wi Bi εi
.
(2.2)
i=1
Minimizing (2.2), the optimum Bi is obtained as follows: Boi
=1−
c εi
wi V (θˆi )
,
c=
L
w i εi
i =1
L εi2 1+ ˆ i=1 V (θi )
−1 .
(2.3)
The minimum value of MSE(θˆB ) is MSE(θˆBo ) = c
L
w i εi =
i=1
L
2 w i εi
i=1
L εi2 1+ ˆ i=1 V (θi )
−1 .
Defining the relative efficiency of θˆBo relative to θˆ by REB = V (θˆ )/MSE(θˆBo ), it follows that L wi2 V (θˆi ) L 2 ˆ V (θ ) εi i=1 REB = 1+ ≥1+ 2 , ˆ L ε2 i=1 V (θi ) w i εi i=1
where V (θˆ ) =
L
i=1
w
2 i V
(θˆi ) and ε =
L
i=1
wi εi . This inequality means that θˆBo has smaller MSE than that of the unbiased
estimator θˆ . However θˆBo contains unknown parameters. Thus θˆB∗ , in which unknown parameters were replaced by corresponding unbiased estimators, was proposed. Simulation study showed that the REFB tends to be large as CVi /|εi | increases and ni decreases, where CVi is the coefficient of variation of the objective variable in ith stratum.
48
T. Ohyama / Statistics and Probability Letters 83 (2013) 46–51
3. Proposed estimator 3.1. Prior value incorporated calibration estimator Suppose that (Yi1 , Xi1 ), . . . , (Yini , Xini ) is a simple random sample from the ith stratum, where Yiα is an objective variable and Xiα is an auxiliary variable (α = 1, . . . , ni ). We define the following new estimator:
θˆC =
L
w ˜ i θ˜Bi
(3.1)
i =1
with calibration weights w ˜ i , where
θ˜Bi = (1 − B˜ i )θˆi + B˜ i θ0i ,
(3.2)
and B˜ i is the optimum Bi that uses w ˜ i . We employ the method proposed by Singh (2003), that is, the minimization of the chi-square-type distance function
Φ=
L 1 (w ˜ i − wi )2
wi q i
2 i=1
,
(3.3)
subject to the following two constraints: L
w ˜ iµ ˆ i = µ,
(3.4)
w ˜ i = 1,
(3.5)
i=1
and L i=1
where qi ’s are known positive weights unrelated to w ˜ i , µ is the known constant, µ ˆ i is the unbiased estimator of µi given by
µ ˆi =
ni 1
ni α=1
ϕ(Xiα ),
and ϕ is a symmetric function of Xi1 , Xi2 , . . . , Xini . Minimization of (3.3) subject to (3.4) and (3.5) leads to following calibrated weights:
wi qi µ ˆi
wi qi − wi qi
i
w ˜ i = wi +
wi qi µ ˆ
2 i
i
wi qi µ ˆi
i
wi qi −
i
wi qi µ ˆi
ˆ 2 (µ − µ).
(3.6)
i
Substituting (3.6) into (3.1), we obtain the prior value incorporated estimator that includes an auxiliary variable, and it may also be represented as follows:
˜ − µ), θˆC = θ˜B + β(µ ˆ ˜ ˜ where θB = i wi θBi and wi qi µ ˆ i θ˜Bi wi qi − wi qi θ˜Bi wi qi µ ˆi i i i i . β˜ = 2 2 wi qi µ ˆi wi qi − wi qi µ ˆi i
i
(3.7)
(3.8)
i
Now we evaluate the mean squared error of θˆC when qi = 1. Since the value of w ˜ i should be close to that of wi , we approximate θ˜Bi by θˆBi . Then we have
(θˆC − θ )2 ≈ (θˆB − θ )2 + β 2 (µ − µ) ˆ 2 − 2β(θˆB − θ )(µ ˆ − µ). Taking expectation and using β = Cov(θˆB , µ)/ ˆ V (µ) ˆ , we obtain MSE(θˆC ) ≈ MSE(θˆB ) − β 2 V (µ). ˆ Thus we may expect that θˆC has smaller MSE than that of θˆB .
T. Ohyama / Statistics and Probability Letters 83 (2013) 46–51
49
Next, β˜ is decomposed as follows:
β˜ = βˆ R + β˜ Bε where
wi µ ˆ i θˆi −
i
βˆ R =
wi θˆi
i
wi µ ˆ 2i −
wi µ ˆi
i
i
wi µ ˆi
2
i
and
β˜ Bε =
wi B˜ i εˆ i µ ˆi −
i
wi µ ˆ − 2 i
wi B˜ i εˆ i
wi µ ˆi
i
i
2
i
wi µ ˆi
.
i
Thus θˆC is represented as follows: L
θˆC = θˆ + βˆ R (µ − µ) ˆ +
wi B˜ i εˆ i + β˜ Bε (µ − µ). ˆ
i =1
Assuming the bias of the PVIE,
i
wi Bi εi , is enough close to zero, we obtain the following equation:
(θˆC − θ ) ≈ (θˆR − θ ) + β (µ − µ) ˆ 2 − 2βBε (θˆR − θ )(µ ˆ − µ), 2
2
2 Bε
where θˆR = θˆ + βˆ R (µ− µ) ˆ is the usual combined regression estimator. Taking expectation and using βR = Cov(θˆR , µ)/ ˆ V (µ) ˆ , we obtain MSE(θˆC ) ≈ MSE(θˆR ) + (βB2ε − 2βBε βR )V (µ). ˆ Thus we may expect that θˆC has smaller MSE than that of θˆR if βB2ε − 2βBε βR < 0. 3.2. Proposed estimator Recall that unknown parameters are involved in the optimum Bi given by (2.3). These parameters are replaced by corresponding unbiased estimators from the data. Thus we propose the following estimator θˆC∗ for θ :
θˆC∗ =
L
˜ − µ), {(1 − B˜ ∗i )θˆi + B˜ ∗i θ0i } + β(µ ˆ
i=1
where ni c˜ εˆ i
˜∗
Bi = 1 −
θˆi =
w ˜ i Vˆ [φ(Yi1 )]
L 1
ni α=1
φ(Yiα ),
,
c˜ =
L
w ˜ i εˆ i 1 +
i=1
Si2 =
1
L
ni εˆ i2
i=1
Vˆ [φ(Yi1 )]
ni {φ(Yiα ) − θˆi }2 ,
ni − 1 α=1
−1 , εˆ i = θ0i − θˆi
and w ˜ i and β˜ are given by (3.6) and (3.8) respectively. 4. Simulation We conducted simulations to study the efficiency of the proposed estimator θˆC∗ . We considered φ(y) = y, ϕ(x) = x, wi = 1/L, and L = 3. The sample size of each stratum was ni = 10, 30, 50. Random observations of the objective variable yiα (α = 1, . . . , ni ) were generated by the following equation: yiα = 5 + β xiα + eiα , where xiα is the auxiliary variable that follows normal distribution N (µi , τi2 ), eiα ∼ N (0, σi2 ), and β = 0.5, 1, 2. (µ1 , µ2 , µ3 ) = (20, 30, 40), CVa = τi /µi = 0.2, and CVo = σi /(5 + βµi ) = 0.1, 0.2, 0.3. Prior value was given by θ0i = 5 + βµi + ε and ε = 0.3, 1.3. We computed 10,000 values of the combined regression estimator θˆR , prior value incorporated estimator θˆB∗ , and proposed estimator θˆC∗ . We then obtained the values of relative efficiencies RER = V (θˆ )/MSE(θˆR ), REB = V (θˆ )/MSE(θˆB∗ ) and REC = V (θˆ )/MSE(θˆC∗ ). Table 1 summarizes the results. This table shows that REC value tends to be large as β increases, ε decreases, and coefficient of variation of the objective variable CVo decreases.
50
T. Ohyama / Statistics and Probability Letters 83 (2013) 46–51 Table 1 Simulation results.
ε
CVo
0.10
0.20
0.30
β
n = 10
n = 30
n = 50
RER
REB
REC
RER
REB
REC
RER
REB
REC
0.3
0.5 1.0 2.0
3.14 3.84 4.26
1.46 1.51 1.54
3.83 4.56 4.96
3.31 3.97 4.38
1.35 1.50 1.55
3.67 4.69 5.10
3.32 4.01 4.42
1.25 1.46 1.55
3.47 4.57 5.08
1.3
0.5 1.0 2.0
3.19 3.82 4.19
0.97 1.23 1.43
2.69 3.79 4.61
3.24 4.01 4.27
0.89 1.02 1.31
2.60 3.31 4.23
3.28 4.05 4.43
0.91 0.93 1.20
2.72 3.05 4.08
0.3
0.5 1.0 2.0
1.55 1.73 1.84
1.49 1.52 1.53
2.17 2.46 2.65
1.58 1.72 1.82
1.45 1.54 1.57
2.17 2.46 2.65
1.57 1.73 1.83
1.38 1.51 1.57
2.07 2.44 2.64
1.3
0.5 1.0 2.0
1.56 1.72 1.79
1.10 1.33 1.46
1.65 2.15 2.42
1.55 1.71 1.82
0.91 1.12 1.41
1.36 1.79 2.37
1.59 1.67 1.84
0.88 1.02 1.31
1.36 1.62 2.25
0.3
0.5 1.0 2.0
1.24 1.31 1.36
1.51 1.52 1.52
1.82 1.93 2.01
1.25 1.31 1.38
1.48 1.55 1.57
1.81 1.96 2.09
1.23 1.31 1.39
1.46 1.55 1.57
1.76 1.96 2.12
1.3
0.5 1.0 2.0
1.23 1.30 1.35
1.21 1.41 1.49
1.47 1.78 1.96
1.23 1.33 1.37
1.00 1.26 1.47
1.21 1.64 1.93
1.25 1.31 1.38
0.91 1.14 1.40
1.13 1.46 1.86
Table 2 Population characteristics. Stratum
Ni
wi
θi
µi
θ0i
1 2 3
15 22 10
0.319 0.468 0.213
8.899 9.640 10.958
9.270 9.849 10.968
8.903 9.640 10.956
47
1.000
θ =9.684
µ =9.902
Table 3 Estimated values of characteristics. Stratum ni
1 2 3
3 6 3 12
Births
Deaths
θˆi
Si2
µ ˆi
Ui2
8.774 9.501 10.654
0.028 0.114 0.099
9.241 9.847 10.783
0.019 0.056 0.059
θˆ = 9.505
ˆ = 0.0077 Vˆ (θ)
µ ˆ = 9.853
Vˆ (µ) ˆ = 0.0036
B∗i
w ˜i
B˜ ∗i
−0.314
0.287 0.468 0.245
−0.496
0.691 −0.067
0.683 0.049
5. Numerical example In this section, we apply the proposed estimator to the data concerning births and deaths of Japanese in 2010. The number of births was used as the objected variable and the number of deaths was used as the auxiliary variable. Both were logtransformed since their distributions were highly skewed. Japan has 47 prefectures, and we divided them into three strata based on the population of each prefecture using the method proposed by Dalenius and Hodges (1959). Prior values of the objective variable were given by the data in 2009. Population characteristics are summarized in Table 2. We randomly selected prefectures from each strata, and estimated the population mean of the number of births using the proposed estimator. Sample size of each stratum were computed by Neyman allocation. Here we take sample size as n = 12. Sample means, θˆi and µ ˆ i , sample variances Si2 and Ui2 , of each variable are summarized in Table 3. This table also lists the value of the estimated optimum Bi for PVIE, B∗i , calibrated weight w ˜ i , and the estimated optimum Bi that uses w ˜ i, B˜ ∗i . Other parameters were calculated as follows: cˆ = 0.0249,
c˜ = 0.0255,
β˜ = 1.298,
βˆ R = 1.051,
β˜ Bε = 0.247.
Mean of the log-transformed number of births was estimated to be 9.656 by the proposed estimator. Estimates by the combined regression estimator and PVIE were 9.557 and 9.530 respectively. The approximate MSE value of the proposed estimator was calculated to be 5.22 × 10−4 . The MSE values of the usual combined regression estimator and the PVIE were 3.75 × 10−3 and 4.47 × 10−3 respectively. Thus in this example, we may expect that the proposed estimator is the most efficient estimator among the three estimators.
T. Ohyama / Statistics and Probability Letters 83 (2013) 46–51
51
Acknowledgments This work was supported by Grant-in-Aid for Young Scientists (B) from the Ministry of Education, Culture, Sports, Science and Technology of Japan (No. 23700339). We would like to thank the Editor and referees for helpful comments and suggestions. References Dalenius, T., Hodges, J.L., 1959. Minimum variance stratification. Journal of the American Statistical Association 54, 88–101. Deville, J., Särndal,, 1992. Calibration estimators in survey sampling. Journal of the American Statistical Association 87, 376–382. Fraser, D., 1957. Nonparametric Methods in Statistics. Wiley, New York. Gelman, A., Carlin, J., Stern, H., Rubin, D., 2003. Bayesian Data Analysis, second ed. Chapman & Hall, CRC, London, Boca Raton, FL. Isii, K., Taga, Y., 1969. On optimal stratifications for multivariate distributions. Scandinavian Actuarial Journal. Kim, J., Sungur, E.A., Heo, T.Y., 2007. Calibration approach estimators in stratified sampling. Statistics & Probability Letters. Ohyama, T., Doi, J.A., Yanagawa, T., 2008. Estimating population characteristics by incorporating prior values in stratified random sampling/ranked set sampling. Journal of Statistical Planning and Inference 138, 4021–4032. Singh, S., 2003. Golden Jubilee Year-2003 of the linear regression estimator. Working Paper at St. Cloud State University. Singh, S., 2012. On the calibration of design weights using a displacement function. Metrika 75, 85–107. Singh, S., Horn, S., Yu, F., 1998. Estimation of variance of the general regression estimator: higher level calibration approach. Survey Methodology 24, 41–50. Stearns, M., Singh, S., 2008. On the estimation of the general parameter. Computational Statistics and Data Analysis 52, 4253–4271. Yanagawa, T., 1975. Stratified random sampling; gain in precision due to stratification in the case of proportion allocation. Annals of the Institute of Statistical Mathematics 27, 33–44.