A general class of linearly extrapolated variance estimators

A general class of linearly extrapolated variance estimators

Statistics and Probability Letters 98 (2015) 29–38 Contents lists available at ScienceDirect Statistics and Probability Letters journal homepage: ww...

565KB Sizes 4 Downloads 95 Views

Statistics and Probability Letters 98 (2015) 29–38

Contents lists available at ScienceDirect

Statistics and Probability Letters journal homepage: www.elsevier.com/locate/stapro

A general class of linearly extrapolated variance estimators Qing Wang ∗ , Shiwen Chen Department of Mathematics and Statistics, Williams College, Williamstown, MA, USA

article

info

Article history: Received 10 June 2014 Received in revised form 8 December 2014 Accepted 11 December 2014 Available online 18 December 2014 Keywords: ANOVA decomposition Jackknife Hoeffding decomposition Linear extrapolation Variance estimation U-statistic

abstract A general class of linearly extrapolated variance estimators was developed as an extension of the conventional leave-one-out jackknife variance estimator. In the context of U-statistic variance estimation, the proposed variance estimator is first-order unbiased. After showing the equivalence between the Hoeffding decomposition (Hoeffding, 1948) and the ANOVA decomposition (Efron and Stein, 1981), we study the bias property of the proposed variance estimator in comparison to the conventional jackknife method. Simulation studies indicate that the proposal has comparable performance to the jackknife method when assessing the variance of the sample variance in various distributions. An application to half-sampling cross-validation indicates that the proposal is more computationally efficient and shows better performance than its jackknife counterpart in the context of regression analysis. © 2014 Elsevier B.V. All rights reserved.

1. Introduction Variance measures the uncertainty of a random variable. Therefore, variance estimation is crucial in evaluating the performance of a point estimator or a statistical methodology. Nowadays, one of the commonly used variance estimation techniques is the leave-one-out jackknife variance estimator (Quenouille, 1949; Tukey, 1958). Denote the parameter of interest as θ . Given an i.i.d. sample of size n, X1 , . . . , Xn , the jackknife variance estimator for statistic θˆ = T (X1 , . . . , Xn ) is defined as Vˆ J =

n n−1 

n

(Tn−−i 1 − T¯n−1 )2 ,

(1.1)

i=1

where Tn−−i 1 = T (X1 , . . . , Xi−1 , Xi+1 , . . . , Xn ) and T¯n−1 = n−1 i=1 Tn−−i 1 . Efron and Stein (1981) consider the jackknife variance n estimator as a linearly extrapolated estimator: one first constructs a variance estimator at subsample size n − 1 using i=1 (Tn−−i 1 − T¯n−1 )2 , and then extrapolates it from n − 1 to the original sample size n by multiplying (n − 1)/n. Following the footsteps of Efron and Stein (1981), we consider an extension of the conventional jackknife methodology. The main contribution of this paper is the proposal of a general class of linearly extrapolated variance estimators and the investigation of its bias property. We also demonstrate an application of the proposed variance estimator in half-sampling cross-validation. The new methodology can be summarized as follows: we first devise a variance estimator at subsample size m (m < n) and then extrapolate it from m to n using linear approximation. In the context of U-statistic variance estimation, an unbiased variance estimator at size m can be obtained as long as the kernel size k ≤ m ≤ n/2 (see Section 2). Then, the bias in the linearly extrapolated variance estimator can be formally evaluated. We prove in Section 3 the equivalence between the Hoeffding decomposition (Hoeffding, 1948) and the ANOVA decomposition (Efron and Stein, 1981), which facilitates the

n



Correspondence to: 18 Hoxsey Street, Williamstown, MA 01267, USA. Tel.: +1 413 597 4960. E-mail address: [email protected] (Q. Wang).

http://dx.doi.org/10.1016/j.spl.2014.12.011 0167-7152/© 2014 Elsevier B.V. All rights reserved.

30

Q. Wang, S. Chen / Statistics and Probability Letters 98 (2015) 29–38

comparison between the proposed variance estimator and the leave-one-out jackknife variance estimator. We show in Section 3 that both estimators are first-order unbiased, and their biases can be expressed explicitly. We demonstrate the performance of the proposal in comparison to the jackknife method in two simulation studies in Section 4. The proposed variance estimator shows comparable performance to jackknife method in a simulation study that assesses the variance of the unbiased sample variance. The proposal seems to outperform the jackknife estimator with high computational efficiency in the context of half-sampling cross-validation. It can be seen in Section 4.2 that the flexibility of choosing subsample size m (m ≤ n/2) in the proposed variance estimator leads to efficient realization of the cross-validation algorithm. In the end, we conclude our paper with some final remarks and discussions. 2. Linearly extrapolated variance estimator Given an i.i.d. sample of size n, a U-statistic (Hoeffding, 1948) is defined as Un =

 n −1 k



φ(Xi1 , . . . , Xik ),

(2.1)

1≤i1 <···
where φ(x1 , . . . , xk ) is a symmetric kernel function with k components. It is an unbiased estimator for the parameter θ = E {φ(X1 , . . . , Xk )}. As most unbiased point estimators can be written as a U-statistic, throughout this paper we will focus on the problem of U-statistic variance estimation. However, the proposed variance estimator can be easily generalized to other statistics that do not have a U-statistic representation. Hoeffding (1948) derives the closed-form expression of the variance of a U-statistic. However, calculating the exact variance is computational expensive, especially when both n and k are large. Moreover, the asymptotic variance of a general U-statistic (see Theorem 7.1 in Hoeffding, 1948) is not necessarily reliable when the kernel size k is not negligible compared to the sample size n. In this paper, we propose a linearly extrapolated variance estimator that is easy to construct and is applicable as long as k ≤ n/2. In addition, the proposed variance estimator is first-order unbiased in the context of U-statistic variance estimation, which makes it a valuable competitor of the jackknife variance estimator. Consider a U-statistic defined in (2.1). For any k ≤ m, let Um be the U-statistic computed based on a subsample of size m, say Xm = (X1 , . . . , Xm ). Denote  m −1  Um = Um (X1 , . . . , Xm ) = φ(Xi1 , . . . , Xik ). k 1≤i <···
1

Theorem 1. Let Un be a U-statistic based on a symmetric kernel φ of size k, where k ≤ m ≤ n/2. Given an i.i.d. sample of size n, ∗ let Sm and Sm be mutually exclusive subsamples of size m. Define Vˆ m =

−1    n n−m m

m



{Um (Sm ) − Um (Sm∗ )}2 2

∗ )⊆X (Sm ,Sm n

.

Then, Vˆ m is an unbiased estimator of Var (Um ). The linearly extrapolated variance estimator of Un can be expressed as Vˆ ex =

m n

Vˆ m

(2.2)

which is a first-order unbiased estimator for Var (Un ). ∗ Proof. Because Sm and Sm are nonoverlapping data subsets of size m, ∗ 2 E (Vˆ m ) = (1/2)E [{Um (Sm ) − Um (Sm )} ] = E (Um2 ) − {E (Um )}2 = Var(Um ).

The unbiasedness of Vˆ m follows. To show the first-order unbiasedness of Vˆ ex , notice that m m E (Vˆ ex ) = E (Vˆ m ) = Var(Um ). n n Based on the exact formula of a U-statistic variance (Hoeffding, 1948), we have Var(Um ) =

 −1 k  2   k m j =1

j

j

δj2 ,

where δj2 is the variance of the jth orthogonal term in Hoeffding decomposition (Hoeffding, 1948; Lee, 1990). E (Vˆ ex ) =

k m

n j =1

 2  −1 k

m

j

j

δj2 =

k2 n

δ12 +

k m

 2  −1 k

m

j

j

n j =2

Therefore, Vˆ ex is a first-order unbiased estimator for Var(Un ).



δj2 .

Q. Wang, S. Chen / Statistics and Probability Letters 98 (2015) 29–38

31

When the sample size n and subsample size m are both large, the exhaustive number of pairs of nonoverlapping subsets

(Sm , Sm∗ ) may be enormous. In this case, one can randomly draw B nonoverlapping pairs of subsamples, (Sm,b , Sm∗ ,b ) for 1 ≤ b ≤ B, and approximate Vˆ ex (2.2) as follows Vˆ ex ≈

B ∗ 2 m −1  {Um (Sm,b ) − Um (Sm,b )} B . n 2 b=1

(2.3)

Remark 1. From the proof of Theorem 1, the linearly extrapolated variance estimator Vˆ ex yields the smallest bias in the second-order term when m = n/2. In the case that m = n/2 Vˆ ex (2.3) can be easily realized by randomly splitting the sample of size n into disjoint half samples. 3. Comparison to Jackknife variance estimator The conventional jackknife variance estimator (1.1) can be viewed as a linearly extrapolated variance estimator with subsample size m = n − 1 (Efron and Stein, 1981). In this section we want to compare the performance of our proposed variance estimator with that of the jackknife methodology in terms of their bias properties. As shown in the proof of Theorem 1, the first-order unbiasedness of Vˆ ex is easily obtained based on the Hoeffding decomposition. In addition, Efron and Stein (1981) give the explicit bias expression of the jackknife variance estimator based on the ANOVA decomposition. In order to compare the bias property of the proposed variance estimator with that of the jackknife method, we need to establish a connection between the Hoeffding decomposition (Hoeffding, 1948) and the ANOVA decomposition (Efron and Stein, 1981). Let hj (1 ≤ j ≤ k) be the jth orthogonal term in Hoeffding decomposition. If we denote φj (x1 , . . . , xj ) = E {φ(X1 , . . . , Xk ) | X1 = x1 , . . . , Xj = xj }, then h1 (x1 ) = φ1 (x1 ) − θ , and hj (x1 , . . . , xc ) = φj (x1 , . . . , xj ) −

j−1  

hl (xi1 , . . . , xil ) − θ .

l=1 (j,l)

Let Ac (1 ≤ c ≤ n) be the cth orthogonal term in ANOVA decomposition, where A1 (x1 ) = E {Un (X1 , . . . , Xn ) | X1 = x1 } − θ , and Ac (x1 , . . . , xc ) = E {Un (X1 , . . . , Xn ) | X1 = x1 , . . . , Xc = xc } −

c −1  

Aj (xi1 , . . . , xij ) − θ .

j=1 (c ,j)

Lemma 1. The relationship between hj and Ac (1 ≤ j ≤ k; 1 ≤ c ≤ n) can be expressed as Ac =

 n  −1  n − c  k−c

k

hc

for 1 ≤ c ≤ k.

Moreover, for c > k we have Var (Ac ) = 0. Proof. Proof by induction. When c = 1, we have A1 (x1 ) =

 n  −1 k



E φ(Xi1 , . . . , Xik )|X1 = x1



1≤i1 <···ik ≤n

 n −1  n − 1 

φ1 + k−1    n  −1 n − 1 = h1 (x1 ). k k−1

=



k

n−1 k

  θ −θ

Suppose Aj =

 n −1  n − j  k−j

k

hj

for 1 ≤ j ≤ c .

We want to show that Ac +1 =

 n −1  n − (c + 1)  k

k − (c + 1)

hc +1

for c < k.



32

Q. Wang, S. Chen / Statistics and Probability Letters 98 (2015) 29–38

Since

        n −1  n − (c + 1)   n − (c + 1) n − (c + 1) E (Un |x1 , . . . , xc +1 ) = φ1 + · · · + φ c +1 + θ k−1 k − (c + 1) (c +1,c +1) k k (c +1,1)       c +1   n −1  n − (c + 1) n − (c + 1) = φj + θ , k k−j k j =1 (c +1,j) we have c  

Ac +1 = E (Un |x1 , . . . , xc +1 ) −

Aj (xi1 , . . . , xij ) − θ

j=1 (c +1,j)

       c +1  c  n −1   n − (c + 1) n − (c + 1) = φj + θ − Aj (xi1 , . . . , xij ) − θ k−j k k j =1 j=1 (c +1,j) (c +1,j)        −1  c c +1   n −1   n − j n − (c + 1) n − (c + 1) n φj + θ − hj − θ = k−j k k c−j k j=1 (c +1,j) j =1 (c +1,j)   j  c  n −1  n − (c + 1)   n −1    n − (c + 1)   = φc +1 + hl + θ k k k − (c + 1) k−j j=0 (c +1,j) l=1 (j,l) c  n  −1   n − j − hj − θ . k k−j j=1 (c +1,j) The coefficient of θ is

  c  n −1   n −1  n − (c + 1)    n − (c + 1)   n  − =− . k k−j k k k − (c + 1) j=1 (c +1,j) The coefficient of hc +1−j for 1 ≤ j ≤ c is

 n −1   n − (c + 1)  − . k k − (c + 1) (c +1,j) Therefore, A c +1 =

=

 n −1  n − (c + 1)  k

φc +1 −

k − (c + 1)

c  

 hj − θ

j=1 (c +1,j)

 n −1  n − (c + 1)  k



k − (c + 1)

hc +1

for 1 ≤ c + 1 ≤ k.

Moreover, based on the explicit expression of U-statistic variance, it is easy to see that Var(Ac ) = 0 for k < c ≤ n.



According to Lemma 1,

σ

2 Ac

=

 n −2  n − c 2 k

k−c

δ = 2 c

 n −2  k 2 c

c

δc2 for 1 ≤ c ≤ k,

where σA2c = Var(Ac ) and δc2 = Var(hc ). Theorem 1 in Efron and Stein (1981) expresses the expectation of the jackknife variance estimator in terms of the orthogonal terms in ANOVA decomposition as follows n3

  n−2 n5 2 σ + σ2 + ···. (n − 1)2 A2 2 (n − 1)4 A3 1 With the connection between hj and Aj (1 ≤ j ≤ k), one can compare the exact bias of the proposed linearly extrapolated E (Vˆ J ) = nσA21 +



n−2



variance estimator with that of the jackknife variance estimator explicitly. Theorem 2. The expectation of the conventional jackknife variance estimator (1.1) can be expressed as k k2 2  δ1 + E (Vˆ J ) =

n

c =2

n2c −1

(n − 1)2c −2



n−2 c−1

  −2  2 n k c

c

δc2 .

Thus, it is first-order unbiased with positive bias in the second-order term.

Q. Wang, S. Chen / Statistics and Probability Letters 98 (2015) 29–38

33

Table 1 Comparison between Vˆ ex and Vˆ J . Standard Normal N (0, 1) Sample size n

10 (Truth: 0.2222) 30 (Truth: 0.0690) 50 (Truth: 0.0408) 100 (Truth: 0.0202)

Vˆ ex

Jackknife

(m = n/2)

(m = n/4)

Mean SD MSE

0.2349 0.2612 0.0684

0.2346 0.2619 0.0687

0.4797 0.4642 0.2818

Mean SD MSE

0.0692 0.0452 0.0020

0.0694 0.0459 0.0021

0.0811 0.0515 0.0028

Mean SD MSE

0.0407 0.0209 0.0004

0.0405 0.0207 0.0004

0.0445 0.0224 0.0005

Mean SD MSE

0.0203 0.0075 0.0001

0.0203 0.0076 0.0001

0.0208 0.0078 0.0001

Remark 2. From Theorems 1 and 2, both the proposed variance estimator and the jackknife variance estimator are first-order unbiased. The jackknife estimator has slightly smaller bias in the δ22 term. Yet because their second-order biases are both proportional to 1/n2 , the difference between these two variance estimators is negligible for relatively large sample size n. Remark 3. It is possible to reduce the bias of the variance estimators by extrapolating from two subsample sizes, say n/2 and n/4. This will yield a second-order extrapolated variance estimator. As the second-order extrapolated variance estimator does not have a direct connection to the conventional jackknife variance estimator, we will investigate the second-order extrapolation approach in another paper. 4. Simulation study To illustrate the practical performance of the proposed linearly extrapolated variance estimator in comparison to the jackknife variance estimator, we conduct the following two simulation studies. In the first example, we consider a simple but practical scenario where the goal is to evaluate the precision of the unbiased sample variance. In the second example, we consider a problem of estimating the variance of Kullback–Leibler risk of a fitted regression model under Least Square criterion. 4.1. Assessing the precision of unbiased sample variance The simulation setting for the first study is as follows: R = 1000 independent samples of size n (n = 10, 30, 50, and 100) are randomly drawn from some distribution, among which there are bell-shaped standard normal, skewed Gamma(5, 2) (a Gamma distribution with shape parameter 5 and scale parameter 2), and heavy-tailed t (10) (a t distribution with 10 degrees of freedom). We consider the parameter of interest θ as the true variance of the distribution. We let θˆ = S 2 be the unbiased sample variance that is of a U-statistic representation. Our goal is to estimate the variance of the sample variance S 2 . For each given sample of size n, we realize the leave-one-out jackknife variance estimator and the proposed linearly extrapolated variance estimator with m = n/2 or m = n/4. The subsample size m = n/4 is rounded down to the nearest integer whenever n/4 is not a whole number. For the construction of the proposed variance estimator, we set B = 500 in Eq. (2.3). When the underlying population distribution is normal, the theoretical true variance of S 2 can be obtained based on a chi-square distribution. For other distributions, we approximate the truth based on 1,000,000 random samples drawn from the underlying distribution. We report in Tables 1 and 2 the average of the variance estimates for S 2 (Mean), the standard deviation of the variance estimates (SD), and the mean squared error (MSE). In addition, we include three figures, Figs. 1–3, where each figure shows four sets of side-by-side boxplots displaying the sampling distributions of different variance estimators at various sample sizes (n = 10, 30, 50, 100). Due to large variation of the variance estimators at sample size n = 10, the first panel in each figure has different vertical scale compared to the other panels. The horizontal dashed line in each panel marks the true value of the variance for a given sample size and distribution. According to Tables 1 and 2, the proposed variance estimator with m = n/2 shows similar performance compared to the jackknife variance estimator across the three distributions under consideration. The bias of the linearly extrapolated variance estimator decreases as the sample size increases. Tables 1 and 2 also confirm that m = n/2 leads to the smallest possible bias in Vˆ ex , as seen in Theorem 1. The simulation comparison indicates that in this example the difference in terms

34

Q. Wang, S. Chen / Statistics and Probability Letters 98 (2015) 29–38

Table 2 Comparison between Vˆ ex and Vˆ J . Skewed Gamma(5, 2) Sample size n

Vˆ ex

Jackknife

(m = n/2) 10 (Truth: 0.5345) 30 (Truth: 0.1704) 50 (Truth: 0.1014) 100 (Truth: 0.0503)

(m = n/4)

Mean SD MSE

0.5290 1.0983 1.2062

0.5288 1.0933 1.2084

0.9429 1.6048 2.7423

Mean SD MSE

0.1612 0.2193 0.0482

0.1607 0.2176 0.0474

0.1831 0.2415 0.0585

Mean SD MSE

0.0935 0.0965 0.0094

0.0933 0.0974 0.0096

0.1007 0.1034 0.0107

Mean SD MSE

0.0490 0.0382 0.0015

0.0491 0.0385 0.0015

0.0497 0.0388 0.0015

Heavy-tailed t (10) Sample size n

10 (Truth: 0.5045) 30 (Truth: 0.1596) 50 (Truth: 0.0948) 100 (Truth: 0.0473)

Vˆ ex

Jackknife

(m = n/2)

(m = n/4)

Mean SD MSE

0.7030 4.7118 22.2402

0.7036 4.7740 22.8307

1.1874 6.4482 42.0459

Mean SD MSE

0.1779 0.5346 0.2861

0.1775 0.5337 0.2852

0.2007 0.5997 0.3614

Mean SD MSE

0.1049 0.1992 0.0398

0.1048 0.1952 0.0382

0.1122 0.2038 0.0418

Mean SD MSE

0.0503 0.0601 0.0036

0.0502 0.0604 0.0037

0.0511 0.0603 0.0037

Fig. 1. Side-by-side boxplots of the sampling distributions of different variance estimators when the samples are generated from the Standard Normal. The horizontal dashed line marks the true value of the variance for the given sample size.

Q. Wang, S. Chen / Statistics and Probability Letters 98 (2015) 29–38

35

Fig. 2. Side-by-side boxplots of the sampling distributions of different variance estimators when the samples are generated from Gamma(5,2). The horizontal dashed line marks the true value of the variance for the given sample size.

Fig. 3. Side-by-side boxplots of the sampling distributions of different variance estimators when the samples are generated from t (10). The horizontal dashed line marks the true value of the variance for the given sample size.

of bias between Vˆ ex and Vˆ J is negligible as sample size n > 10. In addition, we have observed very similar or even smaller mean squared error value for Vˆ ex with m = n/2. Figs. 1–3 indicate that each variance estimator is quite variable with many outliers at sample size n = 10. However, as the sample size increases to n = 30 the variability of the three variance estimators seems quite similar. The extrapolated variance estimator with m = n/4 performs generally worse than the other two methods, especially for small sample size n. When n gets as large as 100, Vˆ ex with m = n/4 shows similar performance as in the case when m = n/2. In practice, when the sample size n is large enough, one could consider to use m = n/4 (or m < n/2). Using a smaller value of subsample size m helps reduce the computational cost for Um in (2.3).

36

Q. Wang, S. Chen / Statistics and Probability Letters 98 (2015) 29–38

4.2. Assessing the variance of a cross-validation score In this section we discuss a potential application of our proposed variance estimator in the context of cross-validation. ˆ be a loss function that Consider a parametric family of models Mβ = {mβ } with parameter denoted as β . Let L(y | β) measures the closeness between the observed value y and its predicted value based on fitted model mβˆ . The risk of using model mβˆ can be written as

ˆ Sn˜ ))}, θ = E {L(Y | β( ˆ Sn˜ ) represents the parameter estimated based on a training sample Sn˜ of size n˜ . where β( Let Sn˜ +1 be a subsample of size n˜ + 1. Consider a symmetric kernel function 1

φ(Sn˜ +1 ) =

n˜ +1 

n˜ + 1 i=1

ˆ S −i )), L(Yi | β( n˜

(4.1)

−i i ˜ excluding the ith observation in Sn˜ +1 . The U-statistic form estimate for where Sn− ˜ ⊂ Sn˜ +1 and Sn˜ is a data subset of size n the true risk θ is

 Un =

 −1 

n n˜ + 1

φ(Sn˜ +1 ).

(n,˜n+1)

When the loss function L is the Kullback–Leibler distance, the defined U-statistic is akin to the generalized AIC model selection tool (Lindsay and Liu, 2007; Wang and Lindsay, 2014a). Notice that Un˜ +1 = φ(Sn˜ +1 ). Therefore, the linearly extrapolated variance estimator (2.3) with m = n˜ + 1 ≤ n/2 can be realized based on B randomly drawn disjoint subsamples of size n˜ + 1. In contrast, the jackknife variance estimator requires n repeated calculations of Un−1 , where the realization of each Un−1 involves computing a large number of kernel function φ ’s. In the following we will demonstrate the computational advantage of the proposed variance estimator in comparison to the jackknife method in the context of regression analysis. Consider a linear regression problem, where we regress the response variable Y on five predictor variables Xj (1 ≤ j ≤ 5). Assume the true relationship is Yi = 1 + 8X1 + 5X2 + 3X3 + 1X4 + 0.1X5 + ϵi

(1 ≤ i ≤ n),

where ϵi ’s are independent and normally distributed with mean 0 and standard deviation 0.1. Our goal here is to evaluate the cross-validated risk estimate for the Least Squares model fit based on Kullback–Leibler distance. We consider the case when the subsample size in Vˆ ex is m = n/2. Let Sn/2 be a data subset of size n/2. Denote the symmetric kernel function of size n/2 based on Kullback–Leibler distance as

φKL (Sn/2 ) =

1



n/2 (x ,y )∈S t t n/2

log fβ( ˆ S −t

n/2−1

) (yt |xt ).

(4.2)

And, let

 UKL,n =

−1 

n n/2

φKL (Sn/2 ),

(4.3)

(n,n/2)

where xt = (1, xt ,1 , . . . , xt ,5 )T is the vector of x-variables for the tth observation, and β = (β0 , . . . , β5 )T is the vector of regression coefficients. Here, fβ( ˆ S −t ) is the estimated density function for the response Yt , where the model is fitted based n/2−1

on a subsample of size n/2 − 1 without the tth observation in Sn/2 . With normal random errors the U risk estimate based on Kullback–Leibler distance is proportional to the L2 risk estimate. This is because the natural logarithm of the normal probability density function is proportional to a squared distance, i.e. log fβ( ˆ S −t

n/2−1

) (yt |xt )

ˆ Sn−/t2−1 ))2 . ∝ −(yt − xt β(

The complete U-statistic defined in (4.3) is expensive to realize. Instead, one can approximate UKL,n (4.3) using an incomplete U-statistic as follows: Consider B/2 randomly split half samples, denoted as (Sn/2,b , S¯n/2,b ) (1 ≤ b ≤ B/2), where S¯n/2,b is the complementary set of Sn/2,b . Consider B UKL ,n =

B/2 1

B b =1

{φKL (Sn/2,b ) + φKL (S¯n/2,b )}.

(4.4)

B ˆ In this case, UKL ,n/2 (Sn/2 ) = φKL (Sn/2 ). Hence, the linearly extrapolated variance estimator Vex (2.3) is easy to realize by random half sampling.

Q. Wang, S. Chen / Statistics and Probability Letters 98 (2015) 29–38

37

Table 3 Comparing performance between Vˆ J and Vˆ ex in cross-validation. Estimator

Mean

SD

MSE

Time

Truth Vˆ ex Vˆ J

0.00552 0.00579 0.00891

0.00228 0.00476

7.78 × 10−6 3.41 × 10−5

1.28 min 82.40 min

In the following we demonstrate the computational efficiency of our proposed linearly extrapolated variance estimator with m = n/2 in comparison to the jackknife method. In this context, the leave-one-out jackknife method is realized based on delete-one samples of size n − 1. If we denote the delete-one U-statistic as UKL,n−1 , then

 UKL,n−1 =

n−1 n/2

 

φKL (Sn/2 ).

(n−1,n/2)

One can randomly draw B half samples out of the size-(n − 1) data subset and use an incomplete U-statistic to approximate B UKL,n−1 . We denote it as UKL ,n−1 . To obtain the artificial data sets, we first generate predictors Xi ’s, each of size n, independently from Uniform(0, 1). Then, we standardize √ each predictor to have zero mean and unit standard deviation. Next, we simulate random errors of size n from Normal(0, 0.1) for R = 200 times, and obtain the values for the response variable Y based on the assumed true relationship. For each sample of size n, we perform cross-validation with training sample size n˜ = n/2 − 1 based on Kullback–Leibler distance to evaluate the ordinary least-squares model fit. In this case, the kernel size of the U-statistic risk estimate is n/2 B and split half sampling is applied to compute Vˆ ex and UKL ,n−1 . Although there might be more efficient ways to compute the B incomplete U-statistic UKL ,n−1 in this context, we stick to the construction through the symmetric kernel function φKL (4.2)

B that is used to compute Vˆ ex with m = n/2. To make a fair comparison, we set B = 200 in both Vˆ ex (2.3) and UKL ,n − 1 . In the following table we summarize the simulated mean, standard deviation, and mean squared error for each variance estimator. The computational time shows the average time used in minutes to compute each variance estimate. The true B variance Var(UKL ,n ) is approximated based on 10,000 simulated data sets. From Table 3 it is clearly seen that the proposed linearly extrapolated variance estimator is much cheaper to compute than the leave-one-out jackknife variance estimator when split half sampling is implemented. In this example both variance estimators show some positive bias, but the linearly extrapolated variance estimator yields smaller bias, standard deviation, and mean squared error. Previous research, such as Hinkley (1977) and Wu (1986), points out that the conventional jackknife variance estimator may perform poorly with large positive bias in regression analysis due to the unbalanced structure of the delete-one estimates for regression coefficients. Our result in Table 3 confirms this weakness of the conventional jackknife method. In comparison, the proposed linearly extrapolated variance estimator gives accurate estimation in the context of regression analysis.

5. Discussions Previous research (Marron, 1987; Shao, 1993; Hall and Robinson, 2009; Wang and Lindsay, submitted for publication) shows that subsampling helps reduce the variability of a statistical estimator, such as in the context of kernel density estimation and regression analysis. The simulation studies in Section 4 indicate that this may be true in variance estimation as well. It would be interesting to show rigorously whether the proposed variance estimator is less variable compared to the jackknife variance estimator. This work would lead to the discussion of the trade-off between bias and variance of the linearly extrapolated variance estimator. The investigation of the variance property requires further theoretical work, which may be an interesting future project. Acknowledgments We would like to thank the anonymous referee and the associate editor for providing valuable comments and suggestions that lead to an improved version of the manuscript. References Efron, B., Stein, C., 1981. The jackknife estimation of variance. Ann. Statist. 9 (3), 586–596. Hall, P., Robinson, A.P., 2009. Reducing variability of crossvalidation for smoothing-parameter choice. Biometrika 96 (1), 175–186. Hinkley, D.V., 1977. Jackknifing in unbalanced situations. Technometrics 19, 285–292. Hoeffding, W., 1948. A class of statistics with asymptotically normal distribution. Ann. Math. Stat. 19 (3), 293–325. Lee, A.J., 1990. U-statistics: Theory and Practice. Marcel Dekker. Lindsay, B.G., Liu, J., 2007. Model assessment tools for a model false world. unpublished manuscript. Marron, J.S., 1987. Partitioned cross-validation. Econometric Rev. 6, 271–283. Quenouille, M.H., 1949. Notes on bias in estimation. Biometrika 43, 353–360.

38

Q. Wang, S. Chen / Statistics and Probability Letters 98 (2015) 29–38

Shao, J., 1993. Linear model selection by cross-validation. J. Amer. Statist. Assoc. 88 (422), 486–494. Tukey, J.W., 1958. Bias and confidence in not-quite large samples (abstract). The Annals of Mathematical Statistics 29, 614. Wang, Q., Lindsay, B.G., 2014a. Variance estimation of a general U-statistic with application to cross-validation. Statist. Sinica 24 (3), 1117–1141. Wang, Q., Lindsay, B.G., 2014b. Improving cross-validated bandwidth selection using subsampling-extrapolation techniques. unpublished manuscript (submitted for publication). Wu, C.F.J, 1986. Jackknife, bootstrap and other resampling methods in regression analysis. Ann. Statist. 14, 1261–1295.