Corrected Mallows criterion for model averaging

Corrected Mallows criterion for model averaging

Journal Pre-proof Corrected Mallows criterion for model averaging Jun Liao, Guohua Zou PII: DOI: Reference: S0167-9473(19)30257-9 https://doi.org/10...

714KB Sizes 0 Downloads 37 Views

Journal Pre-proof Corrected Mallows criterion for model averaging Jun Liao, Guohua Zou

PII: DOI: Reference:

S0167-9473(19)30257-9 https://doi.org/10.1016/j.csda.2019.106902 COMSTA 106902

To appear in:

Computational Statistics and Data Analysis

Received date : 13 May 2019 Revised date : 8 September 2019 Accepted date : 10 December 2019 Please cite this article as: J. Liao and G. Zou, Corrected Mallows criterion for model averaging. Computational Statistics and Data Analysis (2019), doi: https://doi.org/10.1016/j.csda.2019.106902. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

© 2019 Published by Elsevier B.V.

Journal Pre-proof *Manuscript Click here to view linked References

Jun Liao

of

Corrected Mallows criterion for model averaging School of Statistics, Renmin University of China, Beijing 100872, China

p ro

School of Mathematical Sciences, Capital Normal University, Beijing 100048, China

Guohua Zou∗

School of Mathematical Sciences, Capital Normal University, Beijing 100048, China

Pr e-

Abstract

An important problem with model averaging approach is the choice of weights. The Mallows criterion for choosing weights suggested by Hansen (Econometrica, 1175-1189, 75, 2007) is the first asymptotically optimal criterion, which has been used widely. In the current paper, the authors propose a corrected Mallows model averaging (MMAc) method based on F distribution in small sample sizes. MMAc exhibits the same asymptotic optimality as Mallows model averaging

al

(MMA) in the sense of minimizing the squared errors. The consistency of the MMAc based weights tending to the optimal weights minimizing MSE is also studied. The authors derive the convergence rate of the new empirical weights.

urn

Similar property for MMA and Jackknife model averaging (JMA) by Hansen and Racine (Journal of Econometrics, 38-46, 167, 2012) is established as well. An extensive simulation study shows that MMAc often performs better than MMA and other commonly used model averaging methods, especially for small and moderate sample size cases. The results from the real data analysis also

Jo

support the proposed method. Keywords: Asymptotic optimality, Consistency, Mallows criterion, Model averaging, Weight choice 2010 MSC: 62F12, 62H12 ∗ Corresponding

author Email address: [email protected] (Guohua Zou)

Preprint submitted to Computational Statistics and Data Analysis

September 8, 2019

Journal Pre-proof

of

1. Introduction

Model averaging, as an alternative to model selection, has received substan-

p ro

tial attention in recent years. As we know, model selection aims to choose the best candidate model as the true data generating process, while model av5

eraging combines all candidate models by certain weights. Therefore, model averaging avoids the risk due to “putting all eggs in one basket”. Since the method utilizes the useful information from different models, it generally yields

Pr e-

the more accurate estimates and predictions than model selection. There are two types of model averaging methods: Bayesian model averaging (BMA) and 10

Frequentist model averaging (FMA). For BMA, it is needed to assume the prior probabilities for candidate models and prior distributions for parameters contained in the models first, and then the model weights are determined based on the posterior probabilities, which can be employed to combine the posterior statistics. Some examples on BMA include Madigan & Raftery (1994), Raftery

15

et al. (1997), Fern´andez et al. (2001), Wright (2008), Ley & Steel (2009), a-

al

mong others. See Hoeting et al. (1999) and Fragoso et al. (2018) for reviews on the BMA methods. For FMA, the model weights are often selected directly by data-driven ways; see, for instance, Buckland et al. (1997), Hjort & Claesken-

20

urn

s (2003), Claeskens & Hjort (2008), and Claeskens (2016). In this paper, we focus on the FMA approaches with linear regression models. The most important problem with model averaging is the choice of weights. Since the optimal weights are usually not available in practice, seeking asymptotically optimal weights becomes a main target for statisticians. The Mallows model averaging

Jo

(MMA) criterion proposed by Hansen (2007) is the unbiased estimator of the

25

mean squared error (MSE) when the variance of the disturbance is known and asymptotically optimal in the sense of minimizing the possible squared errors under the assumptions of discrete weight set and nested candidate models (i.e., the mth model, 1 ≤ m ≤ M , uses the first km regressors in the ordered set

with 0 < k1 < k2 < · · · < kM ). In the more general setting allowing weights 2

Journal Pre-proof

30

to be continuous and models non-nested (i.e., not require the regressors to be

of

ordered), Wan et al. (2010) further demonstrated such an asymptotic optimality for MMA. Hansen (2008) applied the MMA method to forecast combination issues and illustrated that the MMA criterion is an approximately unbiased esti-

35

p ro

mator of the out-of-sample one-step-ahead mean squared forecast error. Hansen (2014) showed that, under some conditions, the asymptotic risk of MMA estimators is globally smaller than the unrestricted least-squares estimator in a nested model framework. In another direction, Hansen & Racine (2012) proposed the Jackknife model averaging (JMA) approach allowing for heteroscedasticity and

40

Pr e-

established its asymptotic optimality. Zhang et al. (2013) extended such an asymptotic optimality of JMA to the dependent data cases. Ando & Li (2014), based on the JMA method, suggested a two-step model averaging procedure for the high-dimensional linear regression. In addition, Zhang et al. (2015) proposed a Kullback-Leibler distance based model averaging (KLMA) method and also showed its asymptotic optimality. Other examples on asymptotically optimal 45

model averaging methods include Liu & Okui (2013), Cheng & Hansen (2015),

et al. (2019b).

al

Liu (2015), Gao et al. (2016b), Ando & Li (2017), Liao et al. (2019a), and Liao

The most well-known asymptotically optimal model averaging criterion is MMA which consists of two terms. The first term, representing the goodness of fit, is the in-sample squared residuals, and the second is the penalty for the

urn

50

averaging number of variables in candidate models. Although MMA is asymptotically optimal for the large sample case, its small sample performance should be improved in view of the simulation results of Hansen & Racine (2012) and Zhang et al. (2015). We expect that such an improvement for MMA can be attained if we can obtain a more appropriate penalty term for the small sample

Jo

55

size case. In this paper, we achieve this goal by constructing an F distribution to approximate the distribution of the random variable related to the penalty term in MMA on the belief that F distribution can approximate the true small sample distribution well (Hamilton, 1994). Accordingly, a corrected MMA (M-

60

MAc) criterion is proposed. The new criterion remains asymptotic optimality. 3

Journal Pre-proof

Further, for the MMAc based weight, we derive its rate of convergence to the

of

optimal weight minimizing MSE. The similar conclusions are also provided for the MMA and JMA based weights, which have not been explored before, although such a large sample property does not give us any insight for selection among the three model averaging methods. Simulation studies and empirical

p ro

65

applications show that for the small and moderate sample size cases, the improvement of performance of our MMAc over MMA and other commonly used model averaging methods is quite remarkable.

The remainder of this paper is organized as follows. Section 2 proposes MMAc and establishes its asymptotic theory. Section 3 conducts three simulation

Pr e-

70

experiments. Section 4 applies the proposed method to a physical growth data set and a monthly container port throughput data set. Section 5 concludes. Proofs are contained in the Appendix.

2. The corrected Mallows model averaging criterion 75

Consider the following data generating process (DGP):

where µi =

P∞

i = 1, ..., n

al

yi = µi + ei ,

j=1

(1)

xij βj with xij being the jth regressor and βj the associated

urn

coefficient, ei is the noise term with mean 0 and variance σ 2 , and {ei } are mutually independent. Here, the DGP (1) is the same as that of Hansen (2007) and includes infinite regressors so that all of the candidate models are misspecified, 80

which is realistic. Let Y = (y1 , ..., yn )0 , µ = (µ1 , ..., µn )0 , and e = (e1 , ..., en )0 . To approximate the true process (1), we consider M candidate linear re-

Jo

gression models. Under the mth (1 ≤ m ≤ M ) candidate model including

0 0 km regressors, µ is estimated by µ ˆm = Pm Y , where Pm = Xm (Xm Xm )−1 Xm

85

with Xm being an n × km regressor matrix. Let the weight vector w ∈ W = PM {w ∈ [0, 1]M : m=1 wm = 1}, then the model averaging estimator of µ can be PM PM bm = P (w)Y , where P (w) = m=1 wm Pm . written as µ ˆ (w) = m=1 wm µ 4

Journal Pre-proof

Define the loss function of µ ˆ (w) as L(w) = kµ − µ ˆ (w)k2 , then we have = kY − µ ˆ (w)k2 − 2e0 (Y − µ ˆ (w)) + e0 e

of

L(w)

= kY − µ ˆ (w)k2 − 2e0 (I − P (w))(µ + e) + e0 e

p ro

= kY − µ ˆ (w)k2 + 2e0 P (w)e − 2e0 (I − P (w))µ − e0 e.

(2)

Assuming that the M th candidate model is the largest one, the MMA takes the form C MMA (w)

=

90

M X

wm km

m=1 0 MMA

2 w0 eb0 ebw + 2b σM wπ

Pr e-

=

2 kY − µ ˆ (w)k2 + 2b σM

,

(3)

2 = eb0M ebM /(n − kM ), and π MMA = where eb = (b e1 , ..., ebM ) with ebm = Y − µ ˆm , σ bM

(k1 , ..., kM )0 .

(4)

al

Note that from (2), we have   0 2 e P (w)e EL(w) = E kY − µ ˆ (w)k2 + 2b σM − nσ 2 2 σ bM " # M X e0 P m e 2 2 = E kY − µ ˆ (w)k + 2b σM wm 2 − nσ 2 . σ b M m=1

Comparing (3) and (4), we find that ignoring a constant, MMA is obtained 2 by replacing e0 Pm e/b σM only by km , which is just an approximate mean of 2 e0 Pm e/b σM :

urn

95

E



e 0 Pm e 2 σ bM



≈E



e0 Pm e σ2



= trace (Pm ) = km .

(5)

2 It can be seen that essentially, this is to approximate the distribution of e0 Pm e/b σM

by a χ2 distribution when the error term e follows a normal distribution. In

Jo

other words, when the penalty km is used for model selection and averaging, we 2 are actually approximating the true distribution of e0 Pm e/b σM by χ2 (km ). How-

100

ever, in comparison to χ2 distribution, F distribution may represent a better

approximation to the true small sample distribution of a random variable whose limit distribution is a χ2 distribution (Hamilton, 1994). This is similar to the case where the normal distribution is often replaced by the t distribution under 5

Journal Pre-proof

the small samples. For this reason, we attempt to construct an F variable to 2 2 approximate e0 Pm e/b σM . In this regard, a natural choice is e0 Pm e/e σm , where

of

105

2 σ em = kY − µ ˆm k2 /n = Y 0 (In − Pm )Y /n. The denominator is set as “n” so that

2 σ em can be viewed as the maximum likelihood estimator of σ 2 and such a setting

110

p ro

has been successfully applied to derive the corrected AIC (AICc) criterion for model selection in Hurvich & Tsai (1989). Thus  0   0  e Pm e e Pm e E ≈E . 2 2 σ bM σ em

(6)

We further illustrate our motivation by a simulation study. Using the sim-

Pr e-

ulation setting of Hansen (2007) with the sample size n = 20, the estimation results of error variance under different candidate models are shown in Figure 1. 2 is obvious, and such a variation It is seen from Figure 1 that the variation of σ bM 115

2 is replaced by σ 2 directly like (5). On the other hand, will be neglected if σ bM 2 2 2 can be partially taken like (6), the variation of σ bM is replaced by σ em when σ bM

into consideration. Importantly, such a substitution combines the estimation of error variance under each candidate model, and so different penalty terms are expected to be obtained for different candidate models apart from the simple

120

al

penalty km , the number of regressors. 2 To derive the distribution of e0 Pm e/e σm , we consider a special case of µ = Xβ

urn

with X being an n × k0 regressor matrix and β is a k0 × 1 true parameter vector, and assume that the true model is included in the family of candidate models and e is Gaussian. It is worth pointing out that these assumptions are only for convenience for the derivation of our criterion but not necessary for 125

both asymptotic theories and applications of our method. In fact, the similar assumptions are made in many literatures on the derivation of model selection

Jo

criteria. See, for example, Hurvich & Tsai (1989), Hurvich et al. (1998), and Naik & Tsai (2001). 2 Now, it is readily seen that ne σm ∼ σ 2 χ2 (n − km ). Also, Pm e is independent

130

2 of Y − µ ˆm , which means that e0 Pm e and ne σm are mutually independent. Thus,

6

2.5 2.0 1.5

M2

M3

M4

M5

urn

M1

al

0.0

0.5

1.0

Estimation of variance

3.0

Pr e-

3.5

p ro

of

Journal Pre-proof

M6

M7

M8

M0

Figure 1: Estimation results of error variance (σ 2 = 1). M1-M8 and M0 represent σ e12 –e σ82 and

2 , respectively. The simulation setting is the same as that of Subsection 3.2 with n = 20, σ bM

Jo

M = 8, R2 = 0.6 and α = 1.0.

7

Journal Pre-proof

n − km e0 P m e ∼ F (km , n − km ), 2 km ne σm and so, 

e 0 Pm e 2 σ em



=

nkm n − km nkm = . n − km n − km − 2 n − km − 2

p ro

E

of

together with e0 Pm e ∼ σ 2 χ2 (km ), we immediately have (7)

(8)

MMAc Combining (4), (6) and (8), and letting πm = nkm /(n − km − 2), we propose

the following corrected MMA (labeled MMAc) criterion

Pr e-

2 C MMAc (w) = w0 eb0 ebw + 2b σM w0 π MMAc ,

MMAc 0 where π MMAc = (π1MMAc , · · · , πM ) . Then the resultant weights are given

by

w ˆ MMAc = argminw∈W C MMAc (w). 135

(9)

It can be seen that comparing to MMA, MMAc imposes a heavier penalty on the number of variables in the candidate models. A criterion with the

al

similar feature is the Kullback-Leibler distance based model averaging (KLMA) criterion proposed by Zhang et al. (2015). KLMA has the form

urn

C KLMA (w) = w0 eb0 ebw + 2

n − kM σ b2 w0 π MMA , n − kM − 2 M

(10)

and is numerically shown to perform better than MMA for small sample sizes. 140

Note that the difference between KLMA and MMA is the quantity while the difference between MMAc and MMA is the penalty

(n−kM ) (n−kM −2) ,

n (n−km −2)

for

the mth model. Although the modified penalties bear some similarity, their

Jo

distinguish is marked because but

145

n (n−km −2)

(n−kM ) (n−kM −2)

is the same for each candidate model,

has different effects on different candidate models, and when the

dimension of the candidate model increases, this quantity becomes large. In other words, the proposed MMAc method has the feature that the larger the candidate model, the heavier the extra penalty. Therefore, we expect that MMAc would bring more improvement over MMA than KLMA.

8

Journal Pre-proof

In the following, we consider the large sample properties of our proposed method. All limiting processes discussed here are with respect to n → ∞. It is

of

150

well-known that MMA is asymptotically optimal in the sense of achieving the lowest possible squared errors (Hansen, 2007; Wan et al., 2010). Similarly, we

p ro

can establish the same asymptotic optimality for MMAc.

Define the risk function R(w) = E{L(w)} = Ekµ − µ ˆ (w)k2 , and let ξn = 155

inf R(w). In the following, C denotes a generic constant. We need the follow-

w∈W

ing regularity conditions for Theorem 1.

−2G Condition (C.1). E(e4G i ) ≤ C < ∞, and M ξn

PM

m=1 (R(wm0 ))

G

→ 0 for

Pr e-

some fixed integer 1 ≤ G < ∞, where wm0 is an M × 1 vector with the mth element being one and the others zeros. 160

Condition (C.2). µ0 µ/n = O(1).

2 Condition (C.3). kM /n ≤ C < ∞.

Conditions (C.1) - (C.3) are the conditions for Theorem 2 in Wan et al. (2010) and frequently used in literature (Liu & Okui, 2013;Zhang et al., 2015).

al

Theorem 1. Under Conditions (C.1) - (C.3), L(w ˆ MMAc )/ inf L(w) = 1 + op (1).

165

w∈W

urn

Proof. See the Appendix.

Theorem 1 shows that the squared loss of the model averaging estimator µ b(w ˆ MMAc ) with weights determined by MMAc is asymptotically equivalent to that of the infeasible best possible model averaging estimator. 170

Now, we investigate a more intuitive property of the MMAc based weights—

Jo

consistency. Similar theoretical results will be established as well for MMA and ii JMA. Recall that ebm = Y − µ ˆm , and let Pm be the ith diagonal element of Pm ,

Dm be the n × n diagonal matrix with the ith diagonal element equalling to

175

ii −1 (1 − Pm ) , eem = Dm ebm , and ee = (e e1 , · · · , eeM ), then the JMA criterion has the

form

C JMA (w) = w0 ee0 eew. 9

(11)

Journal Pre-proof

Denote the optimal weight w0 = argminw∈W R(w). Let λmin (B) and λmax (B) =

of

kBk be the minimum and maximum singular values of a general real matrix B, respectively. Denote Ω1 = (µ − µ ˆ1 , ..., µ − µ ˆM ), Λ1 = (ˆ µ1 , ..., µ ˆM ), Ω = Ω01 Ω1 , and Λ = Λ01 Λ1 . Theorem 2 below gives the rates of the MMA, MMAc and JMA

based weights tending to the infeasible optimal weight vector w0 . We need the

p ro

180

following regularity conditions for this theorem.

Condition (C.4). There are two positive constants κ1 and κ2 , such that 0 < κ1 < λmin (Λ/n) ≤ λmax (Λ/n) < κ2 < ∞, in probability tending to 1.

uniformly in m.

Pr e-

185

 0 √ 1/2 0 e/ nk = Op (kM ) Xm /n)−1 = O(1), and kXm Condition (C.5). λmax (Xm Condition (C.6). kM /n1−2δ = O(1) and M kM /(n2δ ξn ) = o(1), where δ is a positive constant.

ii , Condition (C.7). p∗ = O(kM /n), where p∗ = max{1≤m≤M } max{1≤i≤n} Pm 0 0 ii . Xm )−1 Xm is the ith diagonal element of Pm = Xm (Xm and Pm

Condition (C.8). λmax (Ω/n) = Op (1).

al

190

Condition (C.9). ξn /n ≤ C < ∞.

urn

Condition (C.4) is common and it requires that both the minimum and maximum singular values of Λ are bounded away from zero and infinity. Condition (C.5) is similar to Condition (C.1) of Zhang (2015), and is a high level condition 195

that can be proved by original conditions. The first part of Condition (C.6) is implied by Condition (C.3) for sufficient small δ. The second part of Condition

Jo

(C.6) implies that the number of candidate models M can increase with n but at a rate with restriction. For instance, when the candidate models are nested, if ξn is of order O(n1−α ) for some α ≥ 0, then M can be o(n

200

1−α 2 +δ

). Condition

(C.7) is common in the literature on model selection and averaging (Li, 1987; Zhang et al., 2013). Under Condition (C.8), the maximum singular value of Ω/n is assumed to be bounded in probability. Condition (C.9), which puts a 10

Journal Pre-proof

constraint on the increasing rate of ξn , is rather mild and commonly used in

205

of

literature (Hansen & Racine, 2012).

Theorem 2. If w0 is an interior point of W and Conditions (C.2), (C.4),

p ro

(C.5), and (C.6) are satisfied, then there exist local minimizers w ˆ MMA and w ˆ MMAc of C MMA (w) and C MMAc (w), respectively, such that   kw ˆ MMA − w0 k = Op ξn1/2 n−1/2+δ , and

Pr e-

  kw ˆ MMAc − w0 k = Op ξn1/2 n−1/2+δ ,

(12)

(13)

where δ is a positive constant given in Condition (C.6). 210

Further, if Conditions (C.7), (C.8), and (C.9) are satisfied, then there exists a local minimizer w ˆ JMA of C JMA (w) such that

  kw ˆ JMA − w0 k = Op ξn1/2 n−1/2+δ .

al

Proof. See the Appendix.

(14)

Remark 1. It is seen from Theorem 2 that the convergence rate is determined by the sample size n and the minimized squared risk ξn . Given the speed of n → ∞, the lower the speed of ξn → ∞, the faster the speed of w ˆ → w 0 (w ˆ can be

urn

215

w ˆ MMA , w ˆ MMAc , or w ˆ JMA ). For instance, when ξn ≤ λ < ∞ with λ indicating a  0 constant, kw−w ˆ k = Op n−1/2+δ . In addition, as long as ξn = o(n1−2δ ) which is similar to a case considered in Hansen & Racine (2012), the convergence of w ˆ is valid, i.e., kw ˆ − w0 k = op (1). Remark 2. To establish the asymptotic optimality like Theorem 1, we need to

Jo

220

assume that ξn → ∞, which is also a necessary condition used in Hansen (2007), Wan et al. (2010), and Hansen & Racine (2012). This condition is reasonable in the situation that all candidate models are misspecified. In Theorem 2, we do not require this condition, which strengthens theoretical basis for applications of

225

the optimal model averaging methods. 11

Journal Pre-proof

Theorems 1 and 2 indicate that the proposed MMAc method has the same

of

large sample properties as MMA and JMA. In the next two sections, we will study the finite sample behavior of MMAc by simulation trials and real data

230

p ro

analyses, respectively.

3. Simulations

In this section, we compare the finite sample performance of PMA, MMA, JMA, KLMA, and our proposed MMAc for different cases. Here PMA is the

235

Pr e-

predictive model averaging criterion proposed by Xie (2015), which takes the   PM n+k(w) form kY − µ ˆ (w)k2 n−k(w) with k(w) = m=1 wm km . 3.1. Simulation 1 for the case with the fixed number of covariates This simulation setting is used in Hurvich & Tsai (1989) and Zhang et al. (2015). The details can be given below. Let β = (1, 2, 3, 0, 0, 0, 0)0 , X = (X(1) , . . . , X(7) ) with X(j) ∼ Normal(0, In ) for j = 1, . . . , 7, µ = Xβ, and e ∼ Normal(0, σ 2 In ) and is independent of X. Consider seven nested candi240

date models with Xm = (X(1) , . . . , X(m) ) for m = 1, . . . , 7.

al

Define R2 = var(µi )/var(yi ) = 14/(14 + σ 2 ) and then vary σ 2 so that R2 varies on a grid between 0.1 and 0.9. To evaluate the model averaging estimators

urn

based on MMA, JMA, KLMA and MMAc, we compute the risk Ekµ − µ ˆ (w)k2 approximated by the average across 1000 simulation replications. The sample 245

size n is set to 20, 50, 75, and 100. We normalize the risk by dividing the risk of the infeasible optimal least squares estimator. The simulation results are summarized in Figure 2. It is seen that, for various R2 and n, the proposed method MMAc performs dramatically better than

Jo

the existing four model averaging methods: PMA, MMA, JMA and KLMA,

250

especially for small sample sizes. It is also clear that, as n increases, the differences among the five methods become small, and especially among MMA, JMA and KLMA.

12

Journal Pre-proof

0.2

0.3

0.4

0.5 R2

0.6

0.7

0.8

0.9

0.3

0.4

0.5 R2

0.3

0.4

0.5 R2

0.6

0.7

0.8

0.9

0.7

0.8

0.9

PMA JMA KLMA MMA MMAc

Pr e-

0.2

0.2

n=100

PMA JMA KLMA MMA MMAc

0.1

0.1

Risk 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Risk 1.1 1.2 1.3 1.4 1.5 1.6 1.7

n=75

p ro

0.1

PMA JMA KLMA MMA MMAc

of

n=50

Risk 1.1 1.2 1.3 1.4 1.5 1.6 1.7

Risk 1.1 1.2 1.3 1.4 1.5 1.6 1.7

n=20 PMA JMA KLMA MMA MMAc

0.6

0.7

0.8

0.9

0.1

0.2

0.3

0.4

0.5 R2

0.6

Figure 2: Results for Simulation 1.

1.3

n=50

1.3

n=20

0.1

0.2

0.3

0.4

0.5 R2

0.6

1.0

Risk 1.1

1.2

PMA JMA KLMA MMA MMAc

0.9

0.9

urn

1.0

Risk 1.1

1.2

al

PMA JMA KLMA MMA MMAc

0.7

0.8

0.9

0.1

0.2

0.3

n=75

0.7

0.8

0.9

Risk 1.1

1.2

PMA JMA KLMA MMA MMAc

0.9

1.0

Jo

0.6

1.3

1.3 1.2 Risk 1.1 1.0

0.2

0.5 R2

n=100 PMA JMA KLMA MMA MMAc

0.9

0.1

0.4

0.3

0.4

0.5 R2

0.6

0.7

0.8

0.9

0.1

0.2

0.3

0.4

Figure 3: Results for Simulation 2 with α = 0.5.

13

0.5 R2

0.6

0.7

0.8

0.9

Journal Pre-proof

1.2 Risk 1.1

p ro

1.0

PMA JMA KLMA MMA MMAc

0.9

0.9

1.0

Risk 1.1

1.2

PMA JMA KLMA MMA MMAc

of

1.3

n=50

1.3

n=20

0.1

0.2

0.3

0.4

0.5 R2

0.6

0.7

0.8

0.9

n=75

0.2

0.3

0.4

0.5 R2

0.6

0.7

0.8

0.9

1.3

1.3

n=100

1.2

PMA JMA KLMA MMA MMAc

0.9

1.0

Pr e-

0.9

1.0

Risk 1.1

1.2

PMA JMA KLMA MMA MMAc

Risk 1.1

0.1

0.1

0.2

0.3

0.4

0.5 R2

0.6

0.7

0.8

0.9

0.1

0.2

0.3

0.4

0.5 R2

0.6

0.7

0.8

0.9

Figure 4: Results for Simulation 2 with α = 1.0.

1.4

n=50

1.4

n=20

0.1

0.2

0.3

0.4

0.5 R2

0.6

1.1

Risk 1.2

1.3

PMA JMA KLMA MMA MMAc

1.0

1.0

urn

1.1

Risk 1.2

1.3

al

PMA JMA KLMA MMA MMAc

0.7

0.8

0.9

0.1

0.2

0.3

n=75

0.7

0.8

0.9

Risk 1.2

1.3

PMA JMA KLMA MMA MMAc

1.0

1.1

Jo

0.6

1.4

1.4 1.3 Risk 1.2 1.1

0.2

0.5 R2

n=100 PMA JMA KLMA MMA MMAc

1.0

0.1

0.4

0.3

0.4

0.5 R2

0.6

0.7

0.8

0.9

0.1

0.2

0.3

0.4

Figure 5: Results for Simulation 2 with α = 1.5.

14

0.5 R2

0.6

0.7

0.8

0.9

Journal Pre-proof

3.2. Simulation 2 for the case with an increasing number of covariates

yi = µi + ei =

500 X

xij βj + ei ,

of

This simulation setting is the same as that in Hansen (2007). Let i = 1, ..., n,

p ro

j=1

where xij = 1 for j = 1, xij ∼ Normal(0, 1) for j > 1, and {xij } are mutually 255

independent; ei ∼ Normal(0, 1), and {ei } are mutually independent and also in√ dependent of {xij }; and βj = c 2αj −α−0.5 with α varying in {0.5, 1.0, 1.5} and

c being determined by R2 = c2 /(1 + c2 ) ∈ [0.1, 0.9]. Following Hansen (2007),

Pr e-

we assume that there are M , the largest integer no greater than 3n1/3 , nested candidate models with the mth model including the first m regressors. The sam260

ple size is set to be the same as that in Simulation 1, i.e., n ∈ {20, 50, 75, 100}. Also, the performance of the five model averaging methods is measured by the risk used in Simulation 1 and 1000 simulation replications are run. The results are displayed in Figures 3 - 5. It can be seen that, for small α (α = 0.5), MMAc is superior to PMA, MMA, JMA and KLMA when R2 is

265

small, but inferior to them when R2 is moderate and large (Figure 3). With the

al

increase of α, MMAc becomes better. In fact, when α is moderate (α = 1.0), MMAc is superior to the other four methods for large ranges of R2 (Figure 4); and when α is large (α = 1.5), MMAc almost dominates PMA, MMA, JMA

270

urn

and KLMA (Figure 5).

Recall that relative to MMA, MMAc imposes a heavier penalty on the number of variables in the candidate models and when the dimension of the candidate model increases, this penalty becomes large. Returning to this simulation, we observe that, when α = 0.5, the coefficients decline so slowly that they are not

Jo

absolutely summable. In this case, the true model would contain many impor275

tant variables and so it may not be appropriate to put a heavy penalty on the big candidate model; in other words, although a big model leads to the high variance, it is more possible for a small model to cause the larger bias. Hence it is not strange that MMAc can be outperformed by the other three methods for quite small α. However, it is not common for the coefficients not to be ab15

Journal Pre-proof

280

solutely summable in literature. For instance, for the autoregressive or moving

of

average processes, the absolute summability of the coefficients is a commonly required condition, and many important theories on the time series are established under such a condition. Therefore, MMAc deserves to be recommended

285

p ro

for applications.

3.3. Simulation 3 for the dependent data case

This simulation setting is used in Zhang et al. (2013) for studying the performance of JMA in dependent data cases. Assume the ARMA(1,1) process:

Pr e-

yt = φyt−1 + et + 0.5et−1 , where et ∼ Normal(0, σ 2 ), and φ takes the values such that R2 = (var(yt ) − var(et ))/var(yt ) varies between 0.1 and 0.9. Since

290

this ARMA(1,1) process has an AR(∞) representation, to approximate the ARMA(1,1) process, like Zhang et al. (2013), the following M + 1 (M is the largest integer no greater than 3n1/3 ) candidate models are used in this simulation: Pm yt = b + et and yt = b + j=1 yt−j θj for 1 ≤ m ≤ M , where b is the intercept,

and θj is the coefficient of yt−j . Let n ∈ {15, 25, 50, 75} and σ 2 ∈ {1, 4}. To ap295

praise the prediction abilities of the five model averaging methods, we compute

al

the out-of-sample one-step-ahead mean squared prediction error of the model averaging forecast yˆn+1 (w): E(yn+1 − yˆn+1 (w))2 , which is approximated by the average based on 5000 simulation trials. For convenient comparison, the risks

300

urn

reported in Figures 6-7 are divided by σ 2 . It is seen from Figures 6-7 that, regardless of σ 2 , for small sample size cases with n = 15 and 25, MMAc dominates its competitors, i.e., PMA, MMA, JMA and KLMA, for all levels of R2 . In these two cases, KLMA is superior to MMA although only a small improvement can be made, and is comparable with

Jo

JMA. In addition, JMA performs a little better than MMA when n = 15 and 305

similarly to MMA when n = 25, which is consistent with the findings in Zhang et al. (2013).

When the sample size increases to n = 50 or 75, the other four methods are

still dominated by MMAc, but the difference among the five model averaging methods becomes very minor. 16

Journal Pre-proof

of

1.3 Risk 1.2 0.1

0.2

0.3

0.4

0.5 R2

0.6

0.7

0.8

1.0

1.2

PMA JMA KLMA MMA MMAc

p ro

1.1

1.3

Risk 1.4

1.5

1.4

n=25

1.6

n=15

0.9

0.1

0.2

0.3

0.4

0.5 R2

0.6

0.7

PMA JMA KLMA MMA MMAc

0.8

0.9

n=75

1.3 1.2

Risk 1.1 1.0

Pr e-

1.0

Risk 1.1

1.2

1.3

n=50

0.1

0.2

0.3

0.4

0.5 R2

0.6

0.7

0.8

PMA JMA KLMA MMA MMAc

0.9

0.9

PMA JMA KLMA MMA MMAc

0.9

0.1

0.2

0.3

0.4

0.5 R2

0.6

0.7

0.8

0.9

Figure 6: Results for Simulation 3 with σ 2 = 1.

310

In addition, when the moving average coefficient in the ARMA(1,1) process takes other values such as 0.2 and 0.7, the simulation results (not reported here)

al

still support the proposed method MMAc.

urn

4. Empirical application

4.1. Empirical example 1: Physical growth study 315

The data set considered here is from Berkeley Guidance Study (BGS) (Tuddenham & Snyder, 1954) and has been analyzed by Ye et al. (2018). The observations are the measures of physical growth indexes for 66 registered newborn

Jo

boys. It is of interest to study the relationship between age 18 height (HT18) and other physical growth indexes. Hence the response variable is set as HT18,

320

and, following Ye et al. (2018), the possible covariates include weights at ages two (WT2) and nine (WT9), heights at ages two (HT2) and nine (HT9), age nine leg circumference (LG9), and age 18 strength (ST18) (Ye et al., 2018). We present the scatterplot matrix of the variables in Figure 8. From Figure 8, it

17

Journal Pre-proof

of

1.3 Risk 1.2 1.1

1.3

Risk 1.4

1.5

1.4

n=25

1.6

n=15

0.2

0.3

0.4

0.5 R2

0.6

0.7

0.8

p ro

0.1

1.0

1.2

PMA JMA KLMA MMA MMAc

0.9

0.1

0.2

0.3

0.4

0.5 R2

0.6

0.7

PMA JMA KLMA MMA MMAc

0.8

0.9

n=75

1.3 1.2

Risk 1.1 1.0

Pr e-

0.1

0.2

0.3

0.4

0.5 R2

0.6

0.7

0.8

PMA JMA KLMA MMA MMAc

0.9

0.9

PMA JMA KLMA MMA MMAc

0.9

0.1

0.2

0.3

0.4

0.5 R2

0.7

0.8

ST18

160 240

1.0

Risk 1.1

1.2

1.3

n=50

0.6

Figure 7: Results for Simulation 3 with σ 2 = 4.

10 14 18

18

20

50

25

35

160 180

HT18

14

al

10

95 85

HT2

50

urn

WT2

HT9

25 35

Jo 160

180

125 140

20

WT9

LG9

85

95

125

140

160

240

Figure 8: Scatterplot matrix of the variables for the empirical example 1.

18

0.9

Journal Pre-proof

is seen that the correlation between HT18 and HT9 is very strong, which is accordant with the conclusions in the last paragraph of this subsection. Define

of

325

HT 18i as the ith observation of HT18 for i = 1, ..., n (the training sample size), and the similar definitions are used for the other variables. The largest model

p ro

is given by

HT 18i = β1 + β2 W T 2i + β3 W T 9i + β4 HT 2i + β5 HT 9i + β6 LG9i + β7 ST 18i + ei , where βj , j = 1, ..., 7, are the associated coefficients, and ei is the disturbance. Totally 26 (= 64) candidate models are considered. The scaled mean squared

Pr e-

prediction error (MSPE), based on the predictions for the test sample with size 66 − n (n ∈ {15, 25, 50, 60}), is given by

66−n 2 X  1 HT 18i − HTˆ 18i , 2 σ ˆM (66 − n) i=1 2 , given in (3), is the estimator of var(ei ) based on the full sample, where σ ˆM 330

and HTˆ 18i is the model averaging forecast of HT 18i . We replicate 1000 times to obtain the mean and variance of MSPEs, which are displayed in Table 1. It is seen from Table 1 that, among the four model averaging methods, M-

al

MAc performs the best for all of the cases, especially for the small and moderate sample sizes, in terms of mean and variance. We also observe that MMA and 335

KLMA have a very close performance, and both are often better than PMA and

urn

JMA.

Now define Models 1-5 as the candidate models with covariates {HT9}, {WT9, HT9}, {HT9, LG9}, {WT2, HT2, HT9, LG9}, and {WT2, HT2, HT9, LG9, ST18}, respectively. We record the weights obtained by the five model av340

eraging methods in each replication, and then calculate the mean of the weights

Jo

based on 1000 replications for each method. We list only the models to which the weights no small than 0.1 are assigned. The results are displayed in Table 2. We find from the table that all the listed models contain HT9, which implies that HT9 has the most important effect on HT18. This conclusion is coinci-

345

dent with that drawn by Ye et al. (2018). In addition, the simulation results in Ye et al. (2018) suggest that the top two important covariates for HT18 are 19

Journal Pre-proof

Table 1: Mean and variance of MSPEs for Example 1

MMA

MMAc

n = 15

1.797

1.741

1.641

1.700

1.571

n = 25

1.346

1.350

1.323

1.330

1.309

n = 50

1.179

1.178

1.175

1.176

1.171

n = 60

1.184

1.179

1.182

1.183

1.180

n = 15

0.867

0.622

0.510

0.636

0.362

n = 25

0.104

0.104

0.090

0.094

0.080

n = 50

0.114

0.115

0.113

0.113

0.113

n = 60

0.342

0.342

0.338

0.339

0.334

of

KLMA

Pr e-

Variance

JMA

p ro

Mean

PMA

HT9 and LG9. From our data analysis, we see that, in comparison with the other three model averaging methods, MMAc exactly puts more weights on the smallest model including HT9 and LG9 (i.e., Model 3) for most cases. This is 350

desirable since it is more possible that a larger model leads to a larger variance

al

for small sample scenarios.

4.2. Empirical example 2: Container port throughput data of Hong Kong The data are the container throughputs (unit: one hundred thousand twen-

355

urn

tyfoot equivalent unit (TEU)) of Hong Kong (HK) from January 2001 to December 2015 and have been studied by Gao et al. (2016a), which are presented in Figure 9. It is seen from Figure 9 that the data have the seasonality and trend. To this end, the data are preprocessed by twelfth log difference to remove the seasonality and then first difference to remove the trend, which implies that the

Jo

available sample size is n0 = 180 − 12 − 1 = 167. The candidate models are the

360

same as those used in Simulation 3, except that n is now the training sample size. Let n1 = n0 − (n + M + 1) be the size of the test sample, where M has been defined in Simulation 3. We do one-step ahead prediction for n1 times with window moving ahead by one step for each time. Let n ∈ {15, 25, 50, 75}. The mean and variance of the mean squared prediction errors (MSPEs) are reported

20

p ro

of

Journal Pre-proof

Table 2: Weighs estimated by model averaging methods

PMA

JMA

1

0.24

0.25

2

0.10

3 n = 25

1 2 3

n = 50

1 2

MMA

MMAc

0.44

0.35

0.52

0.10

0.10

0.12

0.12

0.12

0.20

0.25

0.31

0.27

0.37

0.15

0.10

0.15

0.15

0.15

0.16

0.11

0.17

0.17

0.19

0.10

0.20

0.13

0.12

0.14

0.16

0.15

0.17 0.33

0.14 0.25

0.17

0.29

0.28

4

0.12

0.13

0.10

0.11

5

0.11

0.10

0.19

0.18

0.19

1

urn

n = 60

0.12

al

3

KLMA

Pr e-

n = 15

Model

0.19

0.18

3

0.25

0.16

0.30

0.29

0.37

4

0.20

0.27

0.17

0.18

0.15

5

0.14

0.11

0.12

Jo

2

21

p ro

180 160 01

03

Pr e-

120

140

Throughputs

200

220

of

Journal Pre-proof

05

07

09

11

13

15

Year

Figure 9: Container port throughput data of Hong Kong.

365

in Table 3.

From Table 3, it is seen that, in terms of mean and variance, MMAc performs

al

the best among various model averaging methods for most cases, especially for small sample sizes. These observations are similar to those in simulations.

370

urn

5. Conclusions

In this article, we proposed a corrected Mallows model averaging method— MMAc. Similar to MMA, our MMAc is asymptotically optimal in the sense of minimizing squared errors in large sample sizes. Further, the rate of the MMAc based empirical weight converging to the optimal weight is derived. Similar

Jo

conclusions for MMA and JMA are also obtained. These properties provide

375

a strong support for the three methods under large samples, although could not be used to select them. Both simulation studies and empirical applications indicate that MMAc often performs better than PMA, MMA, JMA and KLMA for small and even moderate sample sizes.

22

Journal Pre-proof

Table 3: Mean and variance of MSPEs for Example 2 (×10−3 )

MMA

MMAc

n = 15

7.505

6.671

6.762

7.042

6.360

n = 25

6.501

6.240

6.307

6.353

6.153

n = 50

5.802

5.759

5.794

5.792

5.748

n = 75

4.317

4.271

4.348

4.339

4.469

n = 15

0.345

0.144

0.151

0.227

0.106

n = 25

0.101

0.093

0.091

0.093

0.085

n = 50

0.089

0.089

0.089

0.090

0.088

n = 75

0.041

0.040

0.042

0.041

0.045

of

KLMA

Pr e-

Variance

JMA

p ro

Mean

PMA

In our current paper, we estimate the variance σ 2 under each candidate mod380

el. In view of Zhao et al. (2016) who considered model averaging estimation on covariance matrix, it also makes sense to estimate the variance σ 2 by averaging its estimators under different models here. Zhang et al. (2016) derived the condition under which the MMA estimator dominates the ordinary least squares

385

al

estimator in the sense of MSE. Whether MMAc has the similar property is an interesting issue. On the other hand, the current paper focuses on the constant error variance case. It is of interest to extend the proposed method to the het-

urn

eroscedastic error case. To this end, an extension of Equation (7) to such a case is needed. In addition, although the dependent data case has been considered in our simulation study and real data analysis, the theoretical properties are 390

only for the i.i.d. case. Extending the theoretical results here to the dependent data case is also meaningful. These remain for our further researches.

Jo

Our idea of deriving weight choice criterion based on the F distribution could be generalized to some other cases. For example, Liao & Tsay (2016) extended MMA to multivariate time series models. It is interesting to design a

395

new criterion in the spirit of MMAc for this situation. Further, the derivation of the convergence rates for weights based on other criteria like the two-step model averaging for the high-dimensional linear models proposed by Ando & Li 23

Journal Pre-proof

(2014) is also possible, along the line of proving Theorem 2, and this warrants

400

of

our future research.

Inference after model averaging is also an important topic. Hjort & Claeskens (2003) provided a local misspecification framework to investigate the asymptotic

p ro

distributions of model averaging estimators; see also Claeskens & Hjort (2008). Under such a framework, Hansen (2014) and Liu (2015) concluded that the asymptotic distribution of MMA is a nonlinear function of normal distribution 405

plus a local parameter for the linear regression models. Recently, without the local misspecification assumption, Zhang & Liu (2019) showed that the asymp-

Pr e-

totic distribution of the MMA estimator is a nonlinear function of the normal distribution with mean zero for the linear regression models with finite parameters, and proposed a simulation-based confidence interval. Noting that the 410

difference of MMAc and MMA vanishes as n → ∞, it is easy to see that the MMAc estimator has the same limiting distribution as the MMA estimator. It is of interest to further consider the MMAc based interval estimation. This is left for future research.

415

al

Acknowledgments

The authors thank the editor, the associate editor and the two referees for

urn

their constructive comments and suggestions that have substantially improved the original manuscript. Zou’s work was partially supported by the National Natural Science Foundation of China (Grant nos. 11971323 and 11529101) and

Jo

the Ministry of Science and Technology of China (Grant no. 2016YFB0502301).

24

Journal Pre-proof

420

Appendix

of

A.1. Proof of Theorem 1 Observe that, under Condition (C.3),

w∈W

p ro

  sup R−1 (w) w0 π MMAc − w0 π MMA

# nk m ≤ sup R (w) − km wm n − km − 2 w∈W m=1  2  (kM + 2)kM −1 kM ≤ ξ =O = O(ξn−1 ) → 0, n − kM − 2 n nξn "

−1

M X

(A.1)

Pr e-

where the last line is ensured by the conditions of Theorem 1. From this and following the proof of Theorem 2 in Wan et al. (2010), we see that Theorem 1 425

holds. The details are omitted for saving space and available from the authors upon request. A.2. Proof of Theorem 2 1/2

ˆ = Denote n = ξn n−1/2+δ . To verify (12) of Theorem 2 for the case w w ˆ MMA , following Fan & Peng (2004) and Chen et al. (2018), it suffices to show that, there exists a constant C0 such that, for the M ×1 vector u = (u1 , ..., uM )0 ,   MMA 0 MMA 0 lim P inf C (w +  u) > C (w ) = 1, (A.2) n 0 kuk=C0 ,(w +n u)∈W

urn

n→∞

al

430

which means that there exists a minimizer w ˆ MMA in the bounded closed domain

Jo

{w0 + n u : kuk ≤ C0 , (w0 + n u) ∈ W} such that kw ˆ MMA − w0 k = Op (n ). We

25

Journal Pre-proof

first notice that

of

C MMA (w0 + n u) − C MMA (w0 )

2 = kY − µ ˆ (w0 + n u)k2 + 2b σM (w0 + n u)0 π MMA − kY − µ ˆ (w0 )k2 0

p ro

2 −2b σM w0 π MMA

 = kµ − µ ˆ (w0 + n u)k2 + 2e0 µ − P (w0 + n u)µ − P (w0 + n u)e  2 +2b σM (w0 + n u)0 π MMA − kµ − µ ˆ (w0 )k2 − 2e0 µ − P (w0 )µ − P (w0 )e 0

2 −2b σM w0 π MMA

Pr e-

= kµ − µ ˆ (w0 + n u)k2 − kµ − µ ˆ (w0 )k2 − 2e0 P (n u)µ − 2e0 P (n u)e 2 0 MMA +2n σ bM uπ 0

= 2n u0 Λu − 2n w0 Ω01 Λ1 u − 2e0 P (n u)µ − 2e0 P (n u)e 2 0 MMA +2n σ bM uπ .

(A.3)

In the following, we show that 2n u0 Λu is the leading term of (A.3). 435

Since λmin (Λ/n) > κ1 in probability tending to 1 under Condition (C.4), we have

al

2n u0 Λu > κ1 n2n kuk2 > 0,

(A.4)

in probability tending to 1.

urn

Noting that EkΩ1 w0 k2 = Ekµ − µ ˆ (w0 )k2 = ξn , we obtain kΩ1 w0 k = Op (ξn1/2 ).

(A.5) 1/2

From Condition (C.4), it is clear that kΛ1 k = λmax (Λ) = Op (n1/2 ). Thus, 0

Jo

|n w0 Ω01 Λ1 u|

440

≤ n kΛ1 kkΩ1 w0 kkuk = Op (n1/2 ξn1/2 n )kuk.

(A.6)

0

1/2

Recalling n = ξn n−1/2+δ , we see that n w0 Ω01 Λ1 u is asymptotically dominated by 2n u0 Λu. Now we are in a position to derive the stochastic orders of the remaining

26

Journal Pre-proof

terms of (A.3). From Condition (C.5), it is seen that



0 0 e0 Xm (Xm Xm )−1 Xm e  0 √ 0 λmax (Xm Xm /n)−1 kXm e/ nk2 = Op (kM )

of

e 0 Pm e =

max

m∈{1,...,M } 445

|e0 Pm µ|

≤ kµk

p ro

uniformly in m. Therefore, by Condition (C.2), max

1/2

m∈{1,...,M }

and then

(e0 Pm e)

Pr em=1

≤ n kuk M ≤ Op (n

Similarly, we have

1/2

max

m∈{1,...,M }

|e0 Pm µ|2

1/2 M 1/2 kM n ) kuk .

al

|e0 P (n u)e| = |n (e0 P1 e, · · · , e0 PM e) u| !1/2 M X 0 2 ≤ n kuk (e Pm e) m=1

urn

 ≤ n kuk M = Op (M

1/2

max

m∈{1,...,M }

kM n ) kuk .

1/2

= Op (n1/2 kM ),

|e0 P (n u)µ| = |n (e0 P1 µ, · · · , e0 PM µ) u| !1/2 M X 0 2 ≤ n kuk |e Pm µ| 

(A.7)

(e0 Pm e)2

1/2 (A.8)

1/2 (A.9)

In addition, Condition (C.2) and Ekek2 = nσ 2 = O(n)

(A.10)

kY k ≤ kµk + kek = Op (n1/2 ),

(A.11)

Jo

imply that

which, together with Condition (C.6), yields that 2 σ bM

= Y 0 (In − PM )Y /(n − kM ) ≤ λmax (In − PM )kY k2 /(n − kM ) = Op (1). 27

(A.12)

Journal Pre-proof

450

Thus, we have

of

2 0 MMA 2 |n σ bM uπ | = n σ bM |(k1 , · · · , kM )u|

p ro

2 ≤ n σ bM k(k1 , · · · , kM )0 kkuk !1/2 M X 2 2 = n σ bM km kuk m=1

= Op (M 1/2 kM n ) kuk .

(A.13)

So from (A.8), (A.9), (A.13), and Condition (C.6), it is seen that e0 P (n u)µ, 2 0 MMA e0 P (n u)e, and n σ bM uπ are all asymptotically dominated by 2n u0 Λu. Hence

Pr e-

(A.2) is true, and thus (12) of Theorem 2 with w ˆ=w ˆ MMA is proved.

For the case w ˆ = w ˆ MMAc , we note from Condition (C.6) that nkM /(n −

455

kM − 2) = O(kM ). This, together with (A.12), implies that   nkM M 1/2 2 0 MMAc n kuk = Op (M 1/2 kM n ) kuk . (A.14) |n σ bM uπ | = Op n − kM − 2

In view of the above process of showing (A.2), it is apparent that (13) of Theorem 2 is true for w ˆ=w ˆ MMAc .

It remains to verify (14) of Theorem 2. Let Γ = (b µ1 − µ e1 , · · · , µ bM − µ eM )

al

460

and recall Ω1 = (µ − µ b1 , · · · , µ − µ bM ). Then we can write kb µ(w) − µ e(w)k2 = 0

w0 Γ0 Γw and (b µ(w) − µ e(w)) (Y − µ b(w)) = w0 Γ0 e + w0 Γ0 Ω1 w. Thus, we decom-

pose C JMA (w) as

urn

C JMA (w)

=

=

=

kY − µ e(w)k2

0

kY − µ b(w)k2 + kb µ(w) − µ e(w)k2 + 2 (b µ(w) − µ e(w)) (Y − µ b(w)) kY − µ b(w)k2 + w0 Γ0 Γw + 2w0 Γ0 e + 2w0 Γ0 Ω1 w.

(A.15)

Jo

As a result,

=

C JMA (w0 + n u) − C JMA (w0 ) kY − µ b(w0 + n u)k2 − kY − µ b(w0 )k2 + 2n u0 Γ0 Γu 0

+2n w0 Γ0 Γu + 2n u0 Γ0 e + 22n u0 Γ0 Ω1 u + 2n u0 Γ0 Ω1 w0 0

+2n w0 Γ0 Ω1 u,

(A.16) 28

Journal Pre-proof

where kY −b µ(w0 +n u)k2 −kY −b µ(w0 )k2 is asymptotically dominated by 2n u0 Λu 465

of

in view of (A.3) and the process of proving (12). It is clear that 2n u0 Γ0 Γu ≥ 0

for any u. Thus, similar to the proof of (12), to verify (14), it is enough to show that the last five terms of (A.16) are asymptotically dominated by 2n u0 Λu(>

p ro

κ1 n2n kuk2 > 0).

Let Dm − I be the n × n diagonal matrix whose ith diagonal element is

ii ii Pm /(1 − Pm ). 470

Then from Li (1987), we obtain the expression µ em = [Dm (Pm −

I) + I]Y . So we have, uniformly in m ∈ {1, · · · , M }, =

k(Dm − I) (I − Pm ) Y k

Pr e-

kb µm − µ em k



kDm − IkkY kλmax (I − Pm )

=

Op (kM n−1/2 ),

(A.17)

where the last equality is because of (A.11), maxm∈{1,...,M } {λmax (Pm )} ≤ 1,

and maxm∈{1,...,M } kDm − Ik = O(kM n−1 ) (by Condition (C.7)). Further, it is seen from (A.17) that

1/2

kΓk ≤ (trace(Γ0 Γ))

al

M X

=

2

m=1

kb µm − µ em k

!1/2

urn

= Op (M 1/2 kM n−1/2 ).

(A.18)

Combining (A.10) and (A.18) leads to |n u0 Γ0 e| ≤ n kekkΓkkuk = Op (M 1/2 kM n )kuk.

Observe from Condition (C.8) that λ2max (Ω1 ) = λmax (Ω01 Ω1 ) = λmax (Ω) = Op (n),

(A.20)

|2n u0 Γ0 Ω1 u| ≤ 2n kΓkλmax (Ω1 ) kuk2 = Op (M 1/2 kM 2n )kuk2 .

(A.21)

Jo

475

(A.19)

and hence,

29

Journal Pre-proof

|n u0 Γ0 Ω1 w0 | ≤

n kΓkkΩ1 w0 kkuk

of

Using (A.5), we have

= Op (M 1/2 kM n−1/2 ξn1/2 n )kuk

p ro

= Op (M 1/2 kM 2n n−δ )kuk. In addition, utilizing (A.17) again, we obtain

2 M

X

0 2 0 kΓw k = wm (b µm − µ em )

m=1

which implies that

m=1

2

0 2 wm kb µm − µ em k = Op (kM n−1 ),

0

Pr e-



M X

|n w0 Γ0 Γu|

(A.22)

(A.23)

≤ n kΓw0 kkΓkkuk

2 = Op (M 1/2 kM n−1 n )kuk

= Op (M 1/2 kM n )kuk

480

by (A.18), and 0

by (A.20).

(A.25)

al

|n w0 Γ0 Ω1 u| ≤ n kΓw0 kλmax (Ω1 ) kuk = Op (kM n )kuk

(A.24)

urn

Finally, Conditions (C.6) and (C.9) ensure that M 1/2 kM n /(n2n )

and

=

−1 M 1/2 kM −1 = (M kM ξn−1 n−2δ )1/2 (kM n−1 )1/2 n n

=

o[(M kM ξn−1 n−2δ )1/2 ],

Jo

M 1/2 kM 2n /(n2n )

(A.26)

1/2

= M 1/2 kM ξn−1/2 n−δ (ξn n−1 )1/2 (kM n−1 n2δ )1/2

= O[(M kM ξn−1 n−2δ )1/2 ].

(A.27)

Therefore, from (A.19), (A.21), (A.22), (A.24), (A.25), and using (A.4), (A.26),

485

(A.27), and Condition (C.6), we see that the last five terms of (A.16) are all asymptotically dominated by 2n u0 Λu. This completes the proof of (14) of

Theorem 2. 30

Journal Pre-proof

References

490

of

Ando, T., & Li, K. C. (2014). A model-averaging approach for high-dimensional regression. Journal of the American Statistical Association, 109 , 254–265.

p ro

Ando, T., & Li, K. C. (2017). A weight-relaxed model averaging approach for high dimensional generalized linear models. Annals of Statistics, 45 , 2654– 2679.

Buckland, S. T., Burnham, K. P., & Augustin, N. H. (1997). Model selection: an integral part of inference. Biometrics, 53 , 603–618.

Pr e-

495

Chen, J., Li, D., Linton, O., & Lu, Z. (2018). Semiparametric ultra-high dimensional model averaging of nonlinear dynamic time series. Journal of the American Statistical Association, 113 , 919–932.

Cheng, X., & Hansen, B. E. (2015). Forecasting with factor-augmented regres500

sion: A frequentist model averaging approach. Journal of Econometrics, 186 , 280–293.

al

Claeskens, G. (2016). Statistical model choice. Annual Review of Statistics and Its Application, 3 , 233–256.

505

urn

Claeskens, G., & Hjort, N. L. (2008). Model Selection and Model Averaging. Cambridge: Cambridge University Press. Fan, J., & Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32 , 928–961. Fern´andez, C., Ley, E., & Steel, M. F. J. (2001). Benchmark priors for Bayesian

Jo

model averaging. Journal of Econometrics, 100 , 381–427.

510

Fragoso, T. M., Bertoli, W., & Louzada, F. (2018). Bayesian model averaging: a systematic review and conceptual classification. International Statistical Review , 86 , 1–28.

31

Journal Pre-proof

Gao, Y., Luo, M., & Zou, G. (2016a). Forecasting with model selection or model

515

metrica A: Transport Science, 12 , 366–384.

of

averaging: a case study for monthly container port throughput. Transport-

Gao, Y., Zhang, X., Wang, S., & Zou, G. (2016b). Model averaging based on

p ro

leave-subject-out cross-validation. Journal of Econometrics, 192 , 139–151. Hamilton, J. (1994). Time Series Analysis. Princeton: Princeton University Press.

Hansen, B. E. (2007). Least squares model averaging. Econometrica, 75 , 1175– 1189.

Pr e-

520

Hansen, B. E. (2008). Least squares forecast averaging. Journal of Econometrics, 146 , 342–350.

Hansen, B. E. (2014). Model averaging, asymptotic risk, and regressor groups. 525

Quantitative Economics, 5 , 495–530.

Hansen, B. E., & Racine, J. (2012). Jackknife model averaging. Journal of

al

Econometrics, 167 , 38–46. Hjort, N. L., & Claeskens, G. (2003). Frequentist model average estimators. Journal of the American Statistical Association, 98 , 879–899. Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian

urn

530

model averaging: a tutorial. Statistical Science, 14 , 382–417. Hurvich, C. M., Simonoff, J. S., & Tsai, C. L. (1998). Smoothing parameter selection in nonparametric regression using an improved Akaike information

Jo

criterion. Journal of the Royal Statistical Society, Series B , 60 , 271–293. 535

Hurvich, C. M., & Tsai, C. L. (1989). Regression and time series model selection in small samples. Biometrika, 76 , 297–307.

Ley, E., & Steel, M. F. J. (2009). On the effect of prior assumptions in Bayesian model averaging with applications to growth regression. Journal of Applied Econometrics, 24 , 651–674. 32

Journal Pre-proof

540

Li, K. C. (1987). Asymptotic optimality for cp , cl , cross-validation and general-

of

ized cross-validation: Discrete index set. Annals of Statistics, 15 , 958–975. Liao, J., Zong, X., Zhang, X., & Zou, G. (2019a). Model averaging based on leave-subject-out cross-validation for vector autoregressions. Journal of

545

p ro

Econometrics, 209 , 35–60.

Liao, J., Zou, G., & Gao, Y. (2019b). Spatial Mallows model averaging for geostatistical data. Canadian Journal of Statistics, 47 , 336–351. Liao, J.-C., & Tsay, W.-J. (2016). Multivariate least squares forecasting aver-

Pr e-

aging by vector autoregressive models. Working paper.

Liu, C.-A. (2015). Distribution theory of the least squares averaging estimator. 550

Journal of Econometrics, 186 , 142–159.

Liu, Q., & Okui, R. (2013). Heteroskedasticity-robust cp model averaging. The Econometrics Journal , 16 , 463–472.

Madigan, D., & Raftery, A. E. (1994). Model selection and accounting for

555

al

model uncertainty in graphical models using Occam’s window. Journal of the American Statistical Association, 89 , 1535–1546.

urn

Naik, P. A., & Tsai, C. L. (2001). Single-index model selections. Biometrika, 88 , 821–832.

Raftery, A. E., Madigan, D., & Hoeting, J. A. (1997). Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 560

92 , 179–191.

Jo

Tuddenham, R. D., & Snyder, M. M. (1954). Physical growth of california boys and girls from birth to eighteen years. Univ. of Calif. Publications in Child Development, 1 , 183–364.

Wan, A. T. K., Zhang, X., & Zou, G. (2010). Least squares model averaging by

565

Mallows criterion. Journal of Econometrics, 156 , 277–283.

33

Journal Pre-proof

Journal of Econometrics, 146 , 329–341.

of

Wright, J. H. (2008). Bayesian model averaging and exchange rate forecasts.

Xie, T. (2015). Prediction model averaging estimator. Economics Letters, 131 ,

570

p ro

5–8.

Ye, C., Yang, Y., & Yang, Y. (2018). Sparsity oriented importance learning for high-dimensional linear regression. Journal of the American Statistical Association, 113 , 1797–1812.

ters, 130 , 120–123. 575

Pr e-

Zhang, X. (2015). Consistency of model averaging estimators. Economics Let-

Zhang, X., & Liu, C. (2019). Inference after model averaging in linear regression models. Econometric Theory, 35 , 816–841.

Zhang, X., Ullah, A., & Zhao, S. (2016). On the dominance of mallows model averaging estimator over ordinary least squares estimator. Economics Letters, 142 , 69–73.

Zhang, X., Wan, A. T. K., & Zou, G. (2013). Model averaging by jackknife

al

580

criterion in models with dependent data. Journal of Econometrics, 174 , 82–

urn

94.

Zhang, X., Zou, G., & Carroll, R. (2015). Model averaging based on KullbackLeibler distance. Statistica Sinica, 25 , 1583–1598. 585

Zhao, S., Zhang, X., & Gao, Y. (2016). Model averaging with averaging covari-

Jo

ance matrix. Economics Letters, 145 , 214–217.

34