A class of model averaging estimators

A class of model averaging estimators

Accepted Manuscript A class of model averaging estimators Shangwei Zhao, Aman Ullah, Xinyu Zhang PII: DOI: Reference: S0165-1765(17)30437-8 https://...

647KB Sizes 1 Downloads 71 Views

Accepted Manuscript A class of model averaging estimators Shangwei Zhao, Aman Ullah, Xinyu Zhang

PII: DOI: Reference:

S0165-1765(17)30437-8 https://doi.org/10.1016/j.econlet.2017.10.023 ECOLET 7819

To appear in:

Economics Letters

Received date : 14 July 2017 Revised date : 4 September 2017 Accepted date : 23 October 2017 Please cite this article as: Zhao S., Ullah A., Zhang X., A class of model averaging estimators. Economics Letters (2017), https://doi.org/10.1016/j.econlet.2017.10.023 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A Class of Model Averaging Estimators Shangwei Zhao1 , Aman Ullah2 and Xinyu Zhang3,4 1 College

of Science, Minzu University of China, Beijing, China of Economics, University of California, Riverside, CA, USA 3 Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China 2 Department

4 ISEM,

Capital University of Economics and Business, Beijing, China

SUMMARY: Model averaging aims to a trade-off between efficiency and biases. In this paper, a class of model averaging estimators, g-class, is introduced, and its dominance condition over the ordinary least squares estimator is established. All theoretical findings are verified by simulations. KEY WORDS: Finite Sample Size, Mean Squared Error, Model Averaging, Sufficient Condition. JEL Classification codes: C13, C2. 1

I NTRODUCTION

It is well known that the estimation based on a “small” model can be more efficient than that based on a “large” model, but the former can lead to substantial biases. Model averaging aims to a tradeoff between efficiency and biases. However, in most of the existing literature, model averaging methods are mainly focused on large sample properties. For example, the asymptotic optimality is studied in Hansen (2007), Liang et al. (2011), Liu & Okui (2013) and Zhang et al. (2016b); the asymptotic distributions are developed in Hjort & Claeskens (2003), Zhang & Liang (2011) and Liu (2015); and asymptotic risk is compared in Hansen (2014). Few contribution was devoted to the finite sample properties of model averaging. Exceptions are Magnus et al. (2010) and Magnus et al. (2011), where the prior information and Bayesian tools are utilized. In the current paper, we study model averaging from a frequentist perspective and a review of Bayesian model averaging can be found in Hoeting et al. (1999). Recently, Zhang et al. (2016a) studied the dominance of the Mallows model averaging (MMA) estimator (Hansen, 2007) over the ordinary least squares (OLS) estimator under the finite sample situation and found that when the sample size and the number of regressors satisfy a condition, the dominance holds. The current paper follows that work. Specifically, we provides a class of model

averaging estimators, g-class, by changing the fixed tuning parameter in the Mallows criterion (Hansen, 2007) to a flexible parameter g. We find that the dominance is related to the flexible parameter g. We further propose a modified MMA estimator, which dominates the OLS estimator without the condition needed by the MMA. Furthermore, all theoretical findings are verified by simulations. The remainder of this paper is organized as follows. Section 2 introduces the model and preliminary results. Section 3 proposes the class of model averaging estimator and provides its finite sample property. Section 4 verifies the finite sample property by simulations. Section 5 concludes the paper. Technical proofs are contained in an Appendix. 2

M ODEL

AND PRELIMINARY RESULTS

We follow Zhang et al. (2016a) notations as much as possible for readers’ convenience. We consider a linear regression model: y = Xβ + e,

(2.1)

where y = (y1 , . . . , yn )T is a response vector, X = (x1 , . . . , xn )T is a covariate matrix, e = (e1 , . . . , en )T ∼ Normal(0, σ 2 In ) is an error term vector, and In is an n × n identity matrix. The matrix X is assumed to have full column rank p. Here, we use the normality assumption which is very commonly used for studying the finite sample property such as in James & Stein (1961) and Stein (1981). From the normality assumption, the OLS estimator of β is    OLS = (XT X)−1 XT y ∼ Normal β, σ 2 (XT X)−1 . β

(2.2)

 OLS − y2 , where ·2 is the Euclidean norm. The estimator of σ 2 is σ 2 = (n − p)−1 Xβ

We first partition the regressors in X into M groups and then combine M nested sub-models of

(2.1) candidate models, where the mth candidate model includes the first m groups of X, denoted by Xm . Here, M is predetermined by a grouping method, which is further discussed in Section 5.  Let km be the group size of the mth group, vm = m j=1 kj , and thus vm is the number of columns T −1 T th of Xm . Let Πm = (Ivm , 0vm ×(p−vm ) ), Am = ΠT m (Xm Xm ) Πm (X X), and then Under the m

 = Am β  . The model candidate model, the restricted OLS estimator of β can be written as β m OLS 2

averaging estimator of β is  β(w) =

M 

m=1

 , wm β m

where wm is the weight corresponding to the mth candidate model and w = (w1 , . . . , wM )T , be   w = 1 . Let v = (v1 , . . . , vM )T . Hansen longing to weight set H = w ∈ [0, 1]M : M m=1 m (2007) proposed choosing weights by minimizing the Mallows’ criterion 2  σ 2 wT v, C(w) = Xβ(w) − y + 2

(2.3)

2   − Xβ . Let w) which is motivated by unbiasedly estimating the squared estimation risk E Xβ(

 w)  is the Mallows model  = (w M )T = arg minw∈H C(w), so the combined estimator β( w 1 , . . . , w

averaging (MMA) estimator of β. Zhang et al. (2016a) presents the following result. Preliminary result: If (n − p − 2)(n − p)−1 km ≥ 4 for all m ≥ 2, then 2 2    − Xβ < E XβOLS − Xβ . E Xβ(w)

This result provides an exact dominance condition for the MMA estimator over the OLS estimator in the MSE sense, (n − p − 2)(n − p)−1 km ≥ 4 for all m ≥ 2, but this condition is related to n and p. Next, we extend the Mallows’ criterion to a general version, introduce a class model averaging estimator, and further find out a model averaging estimator whose dominance over the OLS estimator is not impacted by n and p. 3

A g-C LASS M ODEL AVERAGING E STIMATORS

In the Mallows’ criterion (2.3), the first term measures the model fit, while the second term measures the model complexity (because there exists the vector v which is composed by the number of regressors in the candidate model) and serves as a penalty, where the constant 2 can be viewed as a tuning parameter. To be more general, we propose a weight choice criterion as follows: 2 

σ 2 wT v, C(w, g) = Xβ(w) − y + 2g

where the tuning parameter 2 is multiplied by a positive constant g. Obviously,

C(w, 1) = C(w). 3

 w

g ) be the g-class model averaging estimator which is equal

g = arg minw∈H C(w, g) and β( Let w  w)  Define  (the MMA estimator) for g = 1, i.e., w

1 = w. to β(

2  2 2    ) = I(mJ < M) 2σ (vM − vmJ ) − Xβ M − Xβ mJ q (β OLS , σ

 J−1 −1  g {2 − g(n − p + 2)(n − p) } (v − v ) − 4 (vmj+1 − vmj ) m m j+1 j , +σ 4 2   j=1 Xβ mj+1 − Xβmj

where {m1 , . . . , mJ } denotes the indexes set of candidate models with positive weights. 2    2 2 

g ) − Xβ = σ p − E q (β OLS , σ  ) . Theorem 1 E Xβ(w

Theorem 1 provides a formula for the MSE of the g-class model averaging estimator. See Ap-

pendix A.1 for the proof of Theorem 1. From Theorem 1, we can obtain the following dominance result. −1  w

g) for all m ≥ 2, then β( Corollary 1 If km > 2 and 0 < g ≤ 2(n − p)(n − p + 2)−1 (km − 2)km

 OLS . dominates β

Corollary 1 presents the sufficient condition on g for the g-class model averaging estimator outperforming the OLS estimator. See Appendix A.2 for the proof of Corollary 1. −1 = 2(n − When km ≥ 4 for all m ≥ 2, it is seen that 2(n − p)(n − p + 2)−1 (km − 2)km

−1 ) ≥ (n − p)(n − p + 2)−1 , so from Corollary 1, we immediately get the p)(n − p + 2)−1 (1 − 2km

following special case for g.  w

g ) dominates Corollary 2 When g = (n − p + 2)−1 (n − p), if km ≥ 4 for all m ≥ 2, then β(

 . β OLS

Motivated by Corollary 2, we propose a new model averaging method with weight vector

g=(n−p+2)−1 (n−p) . w

This method dominates the OLS under the condition that km ≥ 4 for all m ≥ 2, which is free from the sample size and the number of regressors. Since it is a modified version of MMA, we term it mMMA. 4

Recently, Zhang et al. (2015) proposed choosing weights by minimizing the following KullbackLeibler criterion 2  − y + 2(n − p)(n − p − 2)−1 σ KL(w) = Xβ(w) 2 wT v,

so

 

w, (n − p)(n − p − 2)−1 = KL(w). C

(3.4)

 KL = arg minw∈H KL(w). This method is called Kullback-Leibler model averaging Define w  KL ) is a member (KLMA). We assume n − p − 2 > 0. From (3.4), the KLMA estimator β(w of g-class model averaging estimators with g = (n − p)(n − p − 2)−1 . From Theorem 1, we can

obtain the following result.  OLS when (n − p − 6)(n − p − 2)−1 km ≥ 4 for all m ≥ 2.  w  KL ) dominates β Corollary 3 β(

See Appendix A.3 for the proof of Corollary 3. Since n − p − 2 > n − p − 6, from the

condition (n − p − 6)(n − p − 2)−1 km ≥ 4 in Corollary 3, we know that the lower bound of km is needed to be greater than 4, but not equal to 4 as it was in the case for the mMMA estimator in Corollary 2. We have observed above that for special cases of g = 1, g = (n − p)(n − p + 2)−1 , and g = (n − p)(n − p − 2)−1 , we get MMA, mMMA, and KLMA estimators, respectively. It will be an interesting topic to find the optimum value of g, in the g-class model averaging estimator, for which the MSE is minimum. However, this is extremely challenging and out of scope of this paper. 4

S IMULATIONS

In this section, we use some simulations to verify the theoretical results of the previous section. Specifically, we expect to have the following findings: Finding I. When (n − p − 2)(n − p)−1 km ≥ 4 and (n − p − 6)(n − p − 2)−1 km ≥ 4 for all m ≥ 2, the MMA and KLMA will always yield smaller MSEs than the OLS. Finding II. When (n − p − 2)(n − p)−1 km < 4 for any m ≥ 2, the MMA can perform worse than the OLS; when (n − p − 6)(n − p − 2)−1 km < 4 for any m ≥ 2, the KLMA can perform worse 5

than the OLS. Finding III. When km ≥ 4 for m ≥ 2, the mMMA will always yield smaller MSEs than the OLS. Findings I-II will verify the preliminary result and Corollary 3. Finding III will verify Corollary 2. The simulation setting is from Hansen (2014); that is yi = β0 +

p−1 

βj xji + ei ,

i = 1, . . . , n

j=1

with ei ∼ Normal(0, 1) (i = 1, . . . , n), xji ∼ Normal(0, 1), β0 = 0, βj = cj −α (j = 1, . . . , p − 1), and α ∈ {0, 1, 2, 3}. The coefficient c is selected to vary the population R2 ≡ var(β0 + p−1 j=1 βj xji )/var(yi ) in {0.1, 0.2, . . . , 0.9, 0.98}. We use the following configurations of n, p and

km :

I. II. III. IV.

n = 12, n = 16, n = 30, n = 35,

p = 5, p = 9, p = 6, p = 11,

k1 k1 k1 k1

= 1, = 1, = 1, = 1,

k2 k2 k2 k2

= 4; = 4, k3 = 4; = 5; = 5, k3 = 5.

All MSEs in estimating β = (β0 , . . . , βp−1 )T are calculated by using 10,000 replications. The MSEs of model averaging methods are normalized by that of the OLS estimator, so a MSE below one indicates that the estimator has smaller MSE than the OLS. Figures 1-4 show the MSEs for α = 0, 1, 2, 3, respectively. In Configurations III-IV, (n−p−2)(n−p)−1 k2 = 4.583 ≥ 4 and (n−p−6)(n−p−2)−1 k2 = 4.091 ≥ 4. It is seen from bottom two panels of Figures 1-4 that the MMA and KLMA always lead to smaller MSE than the OLS. This is Finding I. In Configurations I-II, (n − p − 2)(n − p)−1 k2 = 3.667 < 4 and (n − p − 6)(n − p − 2)−1 k2 = 3.273 < 4. It is seen from top two panels of Figures 1-4 that the MMA and KLMA sometimes lead to larger MSE than the OLS. This is Finding II. Figures 1-4 show that the mMMA always lead to smaller MSE than the OLS. This is Finding III. In addition, we find that for all model averaging methods, their MSEs can be much lower than that of the OLS, especially when R2 is small, i.e., residual variance is high. This finding is encouraging in view of the fact that R2 is often small in many cross sectional models. 6

5

C ONCLUDING R EMARKS

In this paper, we introduce a g-class model averaging estimator and its exact dominance condition is obtained, which depends on the sample size and number of regressors. It is shown that a member of this class has an exact dominance condition free from the sample size and number of regressors. Our theoretical findings are all verified by simulation results. Secondly, we remark that the model averaging is based on a pre-supposed grouping structure. The existing grouping procedures such as octagonal shrinkage and clustering algorithm for regression (Bondell & Reich, 2008) may be utilized in applications. Lastly, our theory is confined to the context of nested models and built under the normally distributed error. Extensions to non-nested models and more general distributed error warrant future study. ACKNOWLEDGMENT Zhao’s research was supported by a grant from the Ministry of Education of China (Grant no. 17YJC910011) and Foundation of Minzu University of China (Grant no. 2017MDYL21). Ullah’s research was supported by the Academic Senate Grant at UCR. Zhang’s research was supported by National Natural Science Foundation of China (Grant nos. 71522004 and 11471324). A PPENDICES A.1 Proof of Theorem 1 By using the same steps of (A.12) and (A.13) in Zhang et al. (2016a), we have that for any constant  ), . . . , fJ−1 (β  ),  , f1 (β c and any functions of β OLS OLS OLS    J−1  J−1    OLS ) = gσ 2E  OLS ) E g σ2 (vmj+1 − vmj )c fj (β (vmj+1 − vmj )c fj (β j=1

j=1

and

E



4 g2σ

J−1  j=1

 ) (vmj+1 − vmj )c fj (β OLS

  = g 2 σ 4 1 + 2(n − p)−1 E



 J−1    ) . (vmj+1 − vmj )c fj (β OLS

7

j=1

In addition,   2g(vmj+1 − vmj )2 − 4g(vmj+1 − vmj ) − g 2 1 + 2(n − p)−1 (vmj+1 − vmj )2 

  = g(vmj+1 − vmj ) 2 − g(n − p + 2)(n − p)−1 (vmj+1 − vmj ) − 4 .

2  ,σ From the above formulas, the definition of q (β OLS  ), and the proof of Theorem 1 in Zhang et al.

(2016a), we can obtain the result of Theorem 1.

A.2 Proof of Corollary 1 −1 for all m ≥ 2, we have When 0 < g ≤ 2(n − p)(n − p + 2)−1 (km − 2)km

 2 − g(n − p + 2)(n − p)−1 (vmj+1 − vmj ) − 4   −1 (vmj+1 − vmj ) − 4 ≥ 2 − 2(km − 2)km 

−1 = 4km (vmj+1 − vmj ) − 4

≥ 0, which, together with (A.15)-(A.17) in Zhang et al. (2016a), imply the result of Corollary 1. A.3 Proof of Corollary 3 −1 . When Recall that the condition of g is Corollary 1 is 0 < g ≤ 2(n − p)(n − p + 2)−1 (km − 2)km −1 we have g = (n − p)(n − p − 2)−1 , by g ≤ 2(n − p)(n − p + 2)−1 (km − 2)km −1 (n − p)(n − p − 2)−1 ≤ 2(n − p)(n − p + 2)−1 (km − 2)km ,

which is equivalent to (n − p − 6)(n − p − 2)−1 km ≥ 4 given n − p − 2 > 0. Hence, we can obtain Corollary 3 by using Corollary 1.

R EFERENCES B ONDELL , H. D. & R EICH , B. J. (2008). Simultaneous regression shrinkage, variable selection, and supervised clustering of predictors with oscar. Biometrics 64, 115–123. H ANSEN , B. E. (2007). Least squares model averaging. Econometrica 75, 1175–1189. 8

H ANSEN , B. E. (2014). Model averaging, asymptotic risk, and regressor groups. Quantitative Economics 5, 495–530. H JORT, N. L. & C LAESKENS , G. (2003). Frequentist model average estimators. Journal of the American Statistical Association 98, 879–899. H OETING , J. A., M ADIGAN , D., R AFTERY, A. E. & VOLINSKY, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science 14, 382–417. JAMES , W. & S TEIN , C. (1961). Estimation with quadratic loss. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability 1, 361–380. L IANG , H., Z OU , G., WAN , A. T. K. & Z HANG , X. (2011). Optimal weight choice for frequentist model average estimators. Journal of the American Statistical Association 106, 1053–1066. L IU , C.-A. (2015). Distribution theory of the least squares averaging estimator. Journal of Econometrics 186, 142–159. L IU , Q. & O KUI , R. (2013). Heteroskedasticity-robust Cp model averaging. Econometrics Journal 16, 462–473. ¨ , P. (2010). A comparison of two model averaging techM AGNUS , J., P OWELL , O. & P R UFER niques with an application to growth empirics. Journal of Econometrics 154, 139–153. M AGNUS , J., WAN , A. & Z HANG , X. (2011). Weighted average least squares estimation with nonspherical disturbances and an application to the hong kong housing market. Computational Statistics & Data Analysis 55, 1331–1341. S TEIN , C. (1981). Estimation of the mean of a multivariate normal distribution. Annals of Statistics 153, 1135–1151. Z HANG , X. & L IANG , H. (2011). Focused information criterion and model averaging for generalized additive partial linear models. Annals of Statistics 39, 174–200. Z HANG , X., U LLAH , A. & Z HAO , S. (2016a). On the dominance of Mallows model averaging estimator over ordinary least squares estimator. Economics Letters 142, 69–73. Z HANG , X., Y U , D., Z OU , G. & L IANG , H. (2016b). Optimal model averaging estimation for generalized linear models and generalized linear mixed-effects models. Journal of American Statistical Association 111, 1775–1790. Z HANG , X., Z OU , G. & C ARROLL , R. (2015). Model averaging based on kullback-leibler distance. Statistica Sinica 25, 1583–1598.

9

Case I: n=12,p=5,k =4

Case II: n=16,p=9,k =4

m

m

1.1

1

1

0.9

0.9

0.8

0.8

MSE

MSE

1.1

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.3 0.1

0.90.98

MMA KLMA mMMA

0.2

0.3

0.4

2

0.7

0.8

0.90.98

Case IV: n=35,p=11,k =5

m

m

1.1

1.1

1

1

0.9

0.9

0.8

0.8

MSE

MSE

0.6

R

Case III: n=30,p=6,k =5

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3 0.1

0.5

2

R

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.3 0.1

0.90.98

0.2

0.3

0.4

2

0.5

0.6

0.7

0.8

0.90.98

R2

R

Figure 1: Simulation result: α = 0

Case I: n=12,p=5,k =4

Case II: n=16,p=9,k =4

m

m

1.1

1

1

0.9

0.9

0.8

0.8

MSE

MSE

1.1

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.3 0.1

0.90.98

MMA KLMA mMMA

0.2

0.3

0.4

2

Case III: n=30,p=6,k =5

0.7

0.8

0.90.98

Case IV: n=35,p=11,k =5

m

m

1.1

1

1

0.9

0.9

0.8

0.8

MSE

MSE

0.6

R

1.1

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3 0.1

0.5

2

R

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.3 0.1

0.90.98

0.2

0.3

2

0.4

0.5

0.6

R2

R

Figure 2: Simulation result: α = 1

10

0.7

0.8

0.90.98

Case I: n=12,p=5,k =4

Case II: n=16,p=9,k =4

m

m

1.1

1

1

0.9

0.9

0.8

0.8

MSE

MSE

1.1

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.3 0.1

0.90.98

MMA KLMA mMMA

0.2

0.3

0.4

2

0.7

0.8

0.90.98

Case IV: n=35,p=11,k =5

m

m

1.1

1.1

1

1

0.9

0.9

0.8

0.8

MSE

MSE

0.6

R

Case III: n=30,p=6,k =5

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3 0.1

0.5

2

R

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.3 0.1

0.90.98

0.2

0.3

0.4

2

0.5

0.6

0.7

0.8

0.90.98

R2

R

Figure 3: Simulation result: α = 2

Case I: n=12,p=5,k =4

Case II: n=16,p=9,k =4

m

m

1.1

1

1

0.9

0.9

0.8

0.8

MSE

MSE

1.1

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.3 0.1

0.90.98

MMA KLMA mMMA

0.2

0.3

0.4

2

Case III: n=30,p=6,k =5

0.7

0.8

0.90.98

Case IV: n=35,p=11,k =5

m

m

1.1

1

1

0.9

0.9

0.8

0.8

MSE

MSE

0.6

R

1.1

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3 0.1

0.5

2

R

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.3 0.1

0.90.98

0.2

0.3

2

0.4

0.5

0.6

R2

R

Figure 4: Simulation result: α = 3

11

0.7

0.8

0.90.98

Highlights • 1. We introduce a class of model averaging estimators, g-class, which contains some existing model averaging estimators. • 2. We derive the dominance condition of the g-class model averaging estimator over the ordinary least squares estimator. • 3. We verify all theoretical findings by simulations.

1