Journal Pre-proof Corrected Mallows criterion for model averaging Jun Liao, Guohua Zou
PII: DOI: Reference:
S0167-9473(19)30257-9 https://doi.org/10.1016/j.csda.2019.106902 COMSTA 106902
To appear in:
Computational Statistics and Data Analysis
Received date : 13 May 2019 Revised date : 8 September 2019 Accepted date : 10 December 2019 Please cite this article as: J. Liao and G. Zou, Corrected Mallows criterion for model averaging. Computational Statistics and Data Analysis (2019), doi: https://doi.org/10.1016/j.csda.2019.106902. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.
Journal Pre-proof *Manuscript Click here to view linked References
Jun Liao
of
Corrected Mallows criterion for model averaging School of Statistics, Renmin University of China, Beijing 100872, China
p ro
School of Mathematical Sciences, Capital Normal University, Beijing 100048, China
Guohua Zou∗
School of Mathematical Sciences, Capital Normal University, Beijing 100048, China
Pr e-
Abstract
An important problem with model averaging approach is the choice of weights. The Mallows criterion for choosing weights suggested by Hansen (Econometrica, 1175-1189, 75, 2007) is the first asymptotically optimal criterion, which has been used widely. In the current paper, the authors propose a corrected Mallows model averaging (MMAc) method based on F distribution in small sample sizes. MMAc exhibits the same asymptotic optimality as Mallows model averaging
al
(MMA) in the sense of minimizing the squared errors. The consistency of the MMAc based weights tending to the optimal weights minimizing MSE is also studied. The authors derive the convergence rate of the new empirical weights.
urn
Similar property for MMA and Jackknife model averaging (JMA) by Hansen and Racine (Journal of Econometrics, 38-46, 167, 2012) is established as well. An extensive simulation study shows that MMAc often performs better than MMA and other commonly used model averaging methods, especially for small and moderate sample size cases. The results from the real data analysis also
Jo
support the proposed method. Keywords: Asymptotic optimality, Consistency, Mallows criterion, Model averaging, Weight choice 2010 MSC: 62F12, 62H12 ∗ Corresponding
author Email address:
[email protected] (Guohua Zou)
Preprint submitted to Computational Statistics and Data Analysis
September 8, 2019
Journal Pre-proof
of
1. Introduction
Model averaging, as an alternative to model selection, has received substan-
p ro
tial attention in recent years. As we know, model selection aims to choose the best candidate model as the true data generating process, while model av5
eraging combines all candidate models by certain weights. Therefore, model averaging avoids the risk due to “putting all eggs in one basket”. Since the method utilizes the useful information from different models, it generally yields
Pr e-
the more accurate estimates and predictions than model selection. There are two types of model averaging methods: Bayesian model averaging (BMA) and 10
Frequentist model averaging (FMA). For BMA, it is needed to assume the prior probabilities for candidate models and prior distributions for parameters contained in the models first, and then the model weights are determined based on the posterior probabilities, which can be employed to combine the posterior statistics. Some examples on BMA include Madigan & Raftery (1994), Raftery
15
et al. (1997), Fern´andez et al. (2001), Wright (2008), Ley & Steel (2009), a-
al
mong others. See Hoeting et al. (1999) and Fragoso et al. (2018) for reviews on the BMA methods. For FMA, the model weights are often selected directly by data-driven ways; see, for instance, Buckland et al. (1997), Hjort & Claesken-
20
urn
s (2003), Claeskens & Hjort (2008), and Claeskens (2016). In this paper, we focus on the FMA approaches with linear regression models. The most important problem with model averaging is the choice of weights. Since the optimal weights are usually not available in practice, seeking asymptotically optimal weights becomes a main target for statisticians. The Mallows model averaging
Jo
(MMA) criterion proposed by Hansen (2007) is the unbiased estimator of the
25
mean squared error (MSE) when the variance of the disturbance is known and asymptotically optimal in the sense of minimizing the possible squared errors under the assumptions of discrete weight set and nested candidate models (i.e., the mth model, 1 ≤ m ≤ M , uses the first km regressors in the ordered set
with 0 < k1 < k2 < · · · < kM ). In the more general setting allowing weights 2
Journal Pre-proof
30
to be continuous and models non-nested (i.e., not require the regressors to be
of
ordered), Wan et al. (2010) further demonstrated such an asymptotic optimality for MMA. Hansen (2008) applied the MMA method to forecast combination issues and illustrated that the MMA criterion is an approximately unbiased esti-
35
p ro
mator of the out-of-sample one-step-ahead mean squared forecast error. Hansen (2014) showed that, under some conditions, the asymptotic risk of MMA estimators is globally smaller than the unrestricted least-squares estimator in a nested model framework. In another direction, Hansen & Racine (2012) proposed the Jackknife model averaging (JMA) approach allowing for heteroscedasticity and
40
Pr e-
established its asymptotic optimality. Zhang et al. (2013) extended such an asymptotic optimality of JMA to the dependent data cases. Ando & Li (2014), based on the JMA method, suggested a two-step model averaging procedure for the high-dimensional linear regression. In addition, Zhang et al. (2015) proposed a Kullback-Leibler distance based model averaging (KLMA) method and also showed its asymptotic optimality. Other examples on asymptotically optimal 45
model averaging methods include Liu & Okui (2013), Cheng & Hansen (2015),
et al. (2019b).
al
Liu (2015), Gao et al. (2016b), Ando & Li (2017), Liao et al. (2019a), and Liao
The most well-known asymptotically optimal model averaging criterion is MMA which consists of two terms. The first term, representing the goodness of fit, is the in-sample squared residuals, and the second is the penalty for the
urn
50
averaging number of variables in candidate models. Although MMA is asymptotically optimal for the large sample case, its small sample performance should be improved in view of the simulation results of Hansen & Racine (2012) and Zhang et al. (2015). We expect that such an improvement for MMA can be attained if we can obtain a more appropriate penalty term for the small sample
Jo
55
size case. In this paper, we achieve this goal by constructing an F distribution to approximate the distribution of the random variable related to the penalty term in MMA on the belief that F distribution can approximate the true small sample distribution well (Hamilton, 1994). Accordingly, a corrected MMA (M-
60
MAc) criterion is proposed. The new criterion remains asymptotic optimality. 3
Journal Pre-proof
Further, for the MMAc based weight, we derive its rate of convergence to the
of
optimal weight minimizing MSE. The similar conclusions are also provided for the MMA and JMA based weights, which have not been explored before, although such a large sample property does not give us any insight for selection among the three model averaging methods. Simulation studies and empirical
p ro
65
applications show that for the small and moderate sample size cases, the improvement of performance of our MMAc over MMA and other commonly used model averaging methods is quite remarkable.
The remainder of this paper is organized as follows. Section 2 proposes MMAc and establishes its asymptotic theory. Section 3 conducts three simulation
Pr e-
70
experiments. Section 4 applies the proposed method to a physical growth data set and a monthly container port throughput data set. Section 5 concludes. Proofs are contained in the Appendix.
2. The corrected Mallows model averaging criterion 75
Consider the following data generating process (DGP):
where µi =
P∞
i = 1, ..., n
al
yi = µi + ei ,
j=1
(1)
xij βj with xij being the jth regressor and βj the associated
urn
coefficient, ei is the noise term with mean 0 and variance σ 2 , and {ei } are mutually independent. Here, the DGP (1) is the same as that of Hansen (2007) and includes infinite regressors so that all of the candidate models are misspecified, 80
which is realistic. Let Y = (y1 , ..., yn )0 , µ = (µ1 , ..., µn )0 , and e = (e1 , ..., en )0 . To approximate the true process (1), we consider M candidate linear re-
Jo
gression models. Under the mth (1 ≤ m ≤ M ) candidate model including
0 0 km regressors, µ is estimated by µ ˆm = Pm Y , where Pm = Xm (Xm Xm )−1 Xm
85
with Xm being an n × km regressor matrix. Let the weight vector w ∈ W = PM {w ∈ [0, 1]M : m=1 wm = 1}, then the model averaging estimator of µ can be PM PM bm = P (w)Y , where P (w) = m=1 wm Pm . written as µ ˆ (w) = m=1 wm µ 4
Journal Pre-proof
Define the loss function of µ ˆ (w) as L(w) = kµ − µ ˆ (w)k2 , then we have = kY − µ ˆ (w)k2 − 2e0 (Y − µ ˆ (w)) + e0 e
of
L(w)
= kY − µ ˆ (w)k2 − 2e0 (I − P (w))(µ + e) + e0 e
p ro
= kY − µ ˆ (w)k2 + 2e0 P (w)e − 2e0 (I − P (w))µ − e0 e.
(2)
Assuming that the M th candidate model is the largest one, the MMA takes the form C MMA (w)
=
90
M X
wm km
m=1 0 MMA
2 w0 eb0 ebw + 2b σM wπ
Pr e-
=
2 kY − µ ˆ (w)k2 + 2b σM
,
(3)
2 = eb0M ebM /(n − kM ), and π MMA = where eb = (b e1 , ..., ebM ) with ebm = Y − µ ˆm , σ bM
(k1 , ..., kM )0 .
(4)
al
Note that from (2), we have 0 2 e P (w)e EL(w) = E kY − µ ˆ (w)k2 + 2b σM − nσ 2 2 σ bM " # M X e0 P m e 2 2 = E kY − µ ˆ (w)k + 2b σM wm 2 − nσ 2 . σ b M m=1
Comparing (3) and (4), we find that ignoring a constant, MMA is obtained 2 by replacing e0 Pm e/b σM only by km , which is just an approximate mean of 2 e0 Pm e/b σM :
urn
95
E
e 0 Pm e 2 σ bM
≈E
e0 Pm e σ2
= trace (Pm ) = km .
(5)
2 It can be seen that essentially, this is to approximate the distribution of e0 Pm e/b σM
by a χ2 distribution when the error term e follows a normal distribution. In
Jo
other words, when the penalty km is used for model selection and averaging, we 2 are actually approximating the true distribution of e0 Pm e/b σM by χ2 (km ). How-
100
ever, in comparison to χ2 distribution, F distribution may represent a better
approximation to the true small sample distribution of a random variable whose limit distribution is a χ2 distribution (Hamilton, 1994). This is similar to the case where the normal distribution is often replaced by the t distribution under 5
Journal Pre-proof
the small samples. For this reason, we attempt to construct an F variable to 2 2 approximate e0 Pm e/b σM . In this regard, a natural choice is e0 Pm e/e σm , where
of
105
2 σ em = kY − µ ˆm k2 /n = Y 0 (In − Pm )Y /n. The denominator is set as “n” so that
2 σ em can be viewed as the maximum likelihood estimator of σ 2 and such a setting
110
p ro
has been successfully applied to derive the corrected AIC (AICc) criterion for model selection in Hurvich & Tsai (1989). Thus 0 0 e Pm e e Pm e E ≈E . 2 2 σ bM σ em
(6)
We further illustrate our motivation by a simulation study. Using the sim-
Pr e-
ulation setting of Hansen (2007) with the sample size n = 20, the estimation results of error variance under different candidate models are shown in Figure 1. 2 is obvious, and such a variation It is seen from Figure 1 that the variation of σ bM 115
2 is replaced by σ 2 directly like (5). On the other hand, will be neglected if σ bM 2 2 2 can be partially taken like (6), the variation of σ bM is replaced by σ em when σ bM
into consideration. Importantly, such a substitution combines the estimation of error variance under each candidate model, and so different penalty terms are expected to be obtained for different candidate models apart from the simple
120
al
penalty km , the number of regressors. 2 To derive the distribution of e0 Pm e/e σm , we consider a special case of µ = Xβ
urn
with X being an n × k0 regressor matrix and β is a k0 × 1 true parameter vector, and assume that the true model is included in the family of candidate models and e is Gaussian. It is worth pointing out that these assumptions are only for convenience for the derivation of our criterion but not necessary for 125
both asymptotic theories and applications of our method. In fact, the similar assumptions are made in many literatures on the derivation of model selection
Jo
criteria. See, for example, Hurvich & Tsai (1989), Hurvich et al. (1998), and Naik & Tsai (2001). 2 Now, it is readily seen that ne σm ∼ σ 2 χ2 (n − km ). Also, Pm e is independent
130
2 of Y − µ ˆm , which means that e0 Pm e and ne σm are mutually independent. Thus,
6
2.5 2.0 1.5
M2
M3
M4
M5
urn
M1
al
0.0
0.5
1.0
Estimation of variance
3.0
Pr e-
3.5
p ro
of
Journal Pre-proof
M6
M7
M8
M0
Figure 1: Estimation results of error variance (σ 2 = 1). M1-M8 and M0 represent σ e12 –e σ82 and
2 , respectively. The simulation setting is the same as that of Subsection 3.2 with n = 20, σ bM
Jo
M = 8, R2 = 0.6 and α = 1.0.
7
Journal Pre-proof
n − km e0 P m e ∼ F (km , n − km ), 2 km ne σm and so,
e 0 Pm e 2 σ em
=
nkm n − km nkm = . n − km n − km − 2 n − km − 2
p ro
E
of
together with e0 Pm e ∼ σ 2 χ2 (km ), we immediately have (7)
(8)
MMAc Combining (4), (6) and (8), and letting πm = nkm /(n − km − 2), we propose
the following corrected MMA (labeled MMAc) criterion
Pr e-
2 C MMAc (w) = w0 eb0 ebw + 2b σM w0 π MMAc ,
MMAc 0 where π MMAc = (π1MMAc , · · · , πM ) . Then the resultant weights are given
by
w ˆ MMAc = argminw∈W C MMAc (w). 135
(9)
It can be seen that comparing to MMA, MMAc imposes a heavier penalty on the number of variables in the candidate models. A criterion with the
al
similar feature is the Kullback-Leibler distance based model averaging (KLMA) criterion proposed by Zhang et al. (2015). KLMA has the form
urn
C KLMA (w) = w0 eb0 ebw + 2
n − kM σ b2 w0 π MMA , n − kM − 2 M
(10)
and is numerically shown to perform better than MMA for small sample sizes. 140
Note that the difference between KLMA and MMA is the quantity while the difference between MMAc and MMA is the penalty
(n−kM ) (n−kM −2) ,
n (n−km −2)
for
the mth model. Although the modified penalties bear some similarity, their
Jo
distinguish is marked because but
145
n (n−km −2)
(n−kM ) (n−kM −2)
is the same for each candidate model,
has different effects on different candidate models, and when the
dimension of the candidate model increases, this quantity becomes large. In other words, the proposed MMAc method has the feature that the larger the candidate model, the heavier the extra penalty. Therefore, we expect that MMAc would bring more improvement over MMA than KLMA.
8
Journal Pre-proof
In the following, we consider the large sample properties of our proposed method. All limiting processes discussed here are with respect to n → ∞. It is
of
150
well-known that MMA is asymptotically optimal in the sense of achieving the lowest possible squared errors (Hansen, 2007; Wan et al., 2010). Similarly, we
p ro
can establish the same asymptotic optimality for MMAc.
Define the risk function R(w) = E{L(w)} = Ekµ − µ ˆ (w)k2 , and let ξn = 155
inf R(w). In the following, C denotes a generic constant. We need the follow-
w∈W
ing regularity conditions for Theorem 1.
−2G Condition (C.1). E(e4G i ) ≤ C < ∞, and M ξn
PM
m=1 (R(wm0 ))
G
→ 0 for
Pr e-
some fixed integer 1 ≤ G < ∞, where wm0 is an M × 1 vector with the mth element being one and the others zeros. 160
Condition (C.2). µ0 µ/n = O(1).
2 Condition (C.3). kM /n ≤ C < ∞.
Conditions (C.1) - (C.3) are the conditions for Theorem 2 in Wan et al. (2010) and frequently used in literature (Liu & Okui, 2013;Zhang et al., 2015).
al
Theorem 1. Under Conditions (C.1) - (C.3), L(w ˆ MMAc )/ inf L(w) = 1 + op (1).
165
w∈W
urn
Proof. See the Appendix.
Theorem 1 shows that the squared loss of the model averaging estimator µ b(w ˆ MMAc ) with weights determined by MMAc is asymptotically equivalent to that of the infeasible best possible model averaging estimator. 170
Now, we investigate a more intuitive property of the MMAc based weights—
Jo
consistency. Similar theoretical results will be established as well for MMA and ii JMA. Recall that ebm = Y − µ ˆm , and let Pm be the ith diagonal element of Pm ,
Dm be the n × n diagonal matrix with the ith diagonal element equalling to
175
ii −1 (1 − Pm ) , eem = Dm ebm , and ee = (e e1 , · · · , eeM ), then the JMA criterion has the
form
C JMA (w) = w0 ee0 eew. 9
(11)
Journal Pre-proof
Denote the optimal weight w0 = argminw∈W R(w). Let λmin (B) and λmax (B) =
of
kBk be the minimum and maximum singular values of a general real matrix B, respectively. Denote Ω1 = (µ − µ ˆ1 , ..., µ − µ ˆM ), Λ1 = (ˆ µ1 , ..., µ ˆM ), Ω = Ω01 Ω1 , and Λ = Λ01 Λ1 . Theorem 2 below gives the rates of the MMA, MMAc and JMA
based weights tending to the infeasible optimal weight vector w0 . We need the
p ro
180
following regularity conditions for this theorem.
Condition (C.4). There are two positive constants κ1 and κ2 , such that 0 < κ1 < λmin (Λ/n) ≤ λmax (Λ/n) < κ2 < ∞, in probability tending to 1.
uniformly in m.
Pr e-
185
0 √ 1/2 0 e/ nk = Op (kM ) Xm /n)−1 = O(1), and kXm Condition (C.5). λmax (Xm Condition (C.6). kM /n1−2δ = O(1) and M kM /(n2δ ξn ) = o(1), where δ is a positive constant.
ii , Condition (C.7). p∗ = O(kM /n), where p∗ = max{1≤m≤M } max{1≤i≤n} Pm 0 0 ii . Xm )−1 Xm is the ith diagonal element of Pm = Xm (Xm and Pm
Condition (C.8). λmax (Ω/n) = Op (1).
al
190
Condition (C.9). ξn /n ≤ C < ∞.
urn
Condition (C.4) is common and it requires that both the minimum and maximum singular values of Λ are bounded away from zero and infinity. Condition (C.5) is similar to Condition (C.1) of Zhang (2015), and is a high level condition 195
that can be proved by original conditions. The first part of Condition (C.6) is implied by Condition (C.3) for sufficient small δ. The second part of Condition
Jo
(C.6) implies that the number of candidate models M can increase with n but at a rate with restriction. For instance, when the candidate models are nested, if ξn is of order O(n1−α ) for some α ≥ 0, then M can be o(n
200
1−α 2 +δ
). Condition
(C.7) is common in the literature on model selection and averaging (Li, 1987; Zhang et al., 2013). Under Condition (C.8), the maximum singular value of Ω/n is assumed to be bounded in probability. Condition (C.9), which puts a 10
Journal Pre-proof
constraint on the increasing rate of ξn , is rather mild and commonly used in
205
of
literature (Hansen & Racine, 2012).
Theorem 2. If w0 is an interior point of W and Conditions (C.2), (C.4),
p ro
(C.5), and (C.6) are satisfied, then there exist local minimizers w ˆ MMA and w ˆ MMAc of C MMA (w) and C MMAc (w), respectively, such that kw ˆ MMA − w0 k = Op ξn1/2 n−1/2+δ , and
Pr e-
kw ˆ MMAc − w0 k = Op ξn1/2 n−1/2+δ ,
(12)
(13)
where δ is a positive constant given in Condition (C.6). 210
Further, if Conditions (C.7), (C.8), and (C.9) are satisfied, then there exists a local minimizer w ˆ JMA of C JMA (w) such that
kw ˆ JMA − w0 k = Op ξn1/2 n−1/2+δ .
al
Proof. See the Appendix.
(14)
Remark 1. It is seen from Theorem 2 that the convergence rate is determined by the sample size n and the minimized squared risk ξn . Given the speed of n → ∞, the lower the speed of ξn → ∞, the faster the speed of w ˆ → w 0 (w ˆ can be
urn
215
w ˆ MMA , w ˆ MMAc , or w ˆ JMA ). For instance, when ξn ≤ λ < ∞ with λ indicating a 0 constant, kw−w ˆ k = Op n−1/2+δ . In addition, as long as ξn = o(n1−2δ ) which is similar to a case considered in Hansen & Racine (2012), the convergence of w ˆ is valid, i.e., kw ˆ − w0 k = op (1). Remark 2. To establish the asymptotic optimality like Theorem 1, we need to
Jo
220
assume that ξn → ∞, which is also a necessary condition used in Hansen (2007), Wan et al. (2010), and Hansen & Racine (2012). This condition is reasonable in the situation that all candidate models are misspecified. In Theorem 2, we do not require this condition, which strengthens theoretical basis for applications of
225
the optimal model averaging methods. 11
Journal Pre-proof
Theorems 1 and 2 indicate that the proposed MMAc method has the same
of
large sample properties as MMA and JMA. In the next two sections, we will study the finite sample behavior of MMAc by simulation trials and real data
230
p ro
analyses, respectively.
3. Simulations
In this section, we compare the finite sample performance of PMA, MMA, JMA, KLMA, and our proposed MMAc for different cases. Here PMA is the
235
Pr e-
predictive model averaging criterion proposed by Xie (2015), which takes the PM n+k(w) form kY − µ ˆ (w)k2 n−k(w) with k(w) = m=1 wm km . 3.1. Simulation 1 for the case with the fixed number of covariates This simulation setting is used in Hurvich & Tsai (1989) and Zhang et al. (2015). The details can be given below. Let β = (1, 2, 3, 0, 0, 0, 0)0 , X = (X(1) , . . . , X(7) ) with X(j) ∼ Normal(0, In ) for j = 1, . . . , 7, µ = Xβ, and e ∼ Normal(0, σ 2 In ) and is independent of X. Consider seven nested candi240
date models with Xm = (X(1) , . . . , X(m) ) for m = 1, . . . , 7.
al
Define R2 = var(µi )/var(yi ) = 14/(14 + σ 2 ) and then vary σ 2 so that R2 varies on a grid between 0.1 and 0.9. To evaluate the model averaging estimators
urn
based on MMA, JMA, KLMA and MMAc, we compute the risk Ekµ − µ ˆ (w)k2 approximated by the average across 1000 simulation replications. The sample 245
size n is set to 20, 50, 75, and 100. We normalize the risk by dividing the risk of the infeasible optimal least squares estimator. The simulation results are summarized in Figure 2. It is seen that, for various R2 and n, the proposed method MMAc performs dramatically better than
Jo
the existing four model averaging methods: PMA, MMA, JMA and KLMA,
250
especially for small sample sizes. It is also clear that, as n increases, the differences among the five methods become small, and especially among MMA, JMA and KLMA.
12
Journal Pre-proof
0.2
0.3
0.4
0.5 R2
0.6
0.7
0.8
0.9
0.3
0.4
0.5 R2
0.3
0.4
0.5 R2
0.6
0.7
0.8
0.9
0.7
0.8
0.9
PMA JMA KLMA MMA MMAc
Pr e-
0.2
0.2
n=100
PMA JMA KLMA MMA MMAc
0.1
0.1
Risk 1.1 1.2 1.3 1.4 1.5 1.6 1.7
Risk 1.1 1.2 1.3 1.4 1.5 1.6 1.7
n=75
p ro
0.1
PMA JMA KLMA MMA MMAc
of
n=50
Risk 1.1 1.2 1.3 1.4 1.5 1.6 1.7
Risk 1.1 1.2 1.3 1.4 1.5 1.6 1.7
n=20 PMA JMA KLMA MMA MMAc
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5 R2
0.6
Figure 2: Results for Simulation 1.
1.3
n=50
1.3
n=20
0.1
0.2
0.3
0.4
0.5 R2
0.6
1.0
Risk 1.1
1.2
PMA JMA KLMA MMA MMAc
0.9
0.9
urn
1.0
Risk 1.1
1.2
al
PMA JMA KLMA MMA MMAc
0.7
0.8
0.9
0.1
0.2
0.3
n=75
0.7
0.8
0.9
Risk 1.1
1.2
PMA JMA KLMA MMA MMAc
0.9
1.0
Jo
0.6
1.3
1.3 1.2 Risk 1.1 1.0
0.2
0.5 R2
n=100 PMA JMA KLMA MMA MMAc
0.9
0.1
0.4
0.3
0.4
0.5 R2
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
Figure 3: Results for Simulation 2 with α = 0.5.
13
0.5 R2
0.6
0.7
0.8
0.9
Journal Pre-proof
1.2 Risk 1.1
p ro
1.0
PMA JMA KLMA MMA MMAc
0.9
0.9
1.0
Risk 1.1
1.2
PMA JMA KLMA MMA MMAc
of
1.3
n=50
1.3
n=20
0.1
0.2
0.3
0.4
0.5 R2
0.6
0.7
0.8
0.9
n=75
0.2
0.3
0.4
0.5 R2
0.6
0.7
0.8
0.9
1.3
1.3
n=100
1.2
PMA JMA KLMA MMA MMAc
0.9
1.0
Pr e-
0.9
1.0
Risk 1.1
1.2
PMA JMA KLMA MMA MMAc
Risk 1.1
0.1
0.1
0.2
0.3
0.4
0.5 R2
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5 R2
0.6
0.7
0.8
0.9
Figure 4: Results for Simulation 2 with α = 1.0.
1.4
n=50
1.4
n=20
0.1
0.2
0.3
0.4
0.5 R2
0.6
1.1
Risk 1.2
1.3
PMA JMA KLMA MMA MMAc
1.0
1.0
urn
1.1
Risk 1.2
1.3
al
PMA JMA KLMA MMA MMAc
0.7
0.8
0.9
0.1
0.2
0.3
n=75
0.7
0.8
0.9
Risk 1.2
1.3
PMA JMA KLMA MMA MMAc
1.0
1.1
Jo
0.6
1.4
1.4 1.3 Risk 1.2 1.1
0.2
0.5 R2
n=100 PMA JMA KLMA MMA MMAc
1.0
0.1
0.4
0.3
0.4
0.5 R2
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
Figure 5: Results for Simulation 2 with α = 1.5.
14
0.5 R2
0.6
0.7
0.8
0.9
Journal Pre-proof
3.2. Simulation 2 for the case with an increasing number of covariates
yi = µi + ei =
500 X
xij βj + ei ,
of
This simulation setting is the same as that in Hansen (2007). Let i = 1, ..., n,
p ro
j=1
where xij = 1 for j = 1, xij ∼ Normal(0, 1) for j > 1, and {xij } are mutually 255
independent; ei ∼ Normal(0, 1), and {ei } are mutually independent and also in√ dependent of {xij }; and βj = c 2αj −α−0.5 with α varying in {0.5, 1.0, 1.5} and
c being determined by R2 = c2 /(1 + c2 ) ∈ [0.1, 0.9]. Following Hansen (2007),
Pr e-
we assume that there are M , the largest integer no greater than 3n1/3 , nested candidate models with the mth model including the first m regressors. The sam260
ple size is set to be the same as that in Simulation 1, i.e., n ∈ {20, 50, 75, 100}. Also, the performance of the five model averaging methods is measured by the risk used in Simulation 1 and 1000 simulation replications are run. The results are displayed in Figures 3 - 5. It can be seen that, for small α (α = 0.5), MMAc is superior to PMA, MMA, JMA and KLMA when R2 is
265
small, but inferior to them when R2 is moderate and large (Figure 3). With the
al
increase of α, MMAc becomes better. In fact, when α is moderate (α = 1.0), MMAc is superior to the other four methods for large ranges of R2 (Figure 4); and when α is large (α = 1.5), MMAc almost dominates PMA, MMA, JMA
270
urn
and KLMA (Figure 5).
Recall that relative to MMA, MMAc imposes a heavier penalty on the number of variables in the candidate models and when the dimension of the candidate model increases, this penalty becomes large. Returning to this simulation, we observe that, when α = 0.5, the coefficients decline so slowly that they are not
Jo
absolutely summable. In this case, the true model would contain many impor275
tant variables and so it may not be appropriate to put a heavy penalty on the big candidate model; in other words, although a big model leads to the high variance, it is more possible for a small model to cause the larger bias. Hence it is not strange that MMAc can be outperformed by the other three methods for quite small α. However, it is not common for the coefficients not to be ab15
Journal Pre-proof
280
solutely summable in literature. For instance, for the autoregressive or moving
of
average processes, the absolute summability of the coefficients is a commonly required condition, and many important theories on the time series are established under such a condition. Therefore, MMAc deserves to be recommended
285
p ro
for applications.
3.3. Simulation 3 for the dependent data case
This simulation setting is used in Zhang et al. (2013) for studying the performance of JMA in dependent data cases. Assume the ARMA(1,1) process:
Pr e-
yt = φyt−1 + et + 0.5et−1 , where et ∼ Normal(0, σ 2 ), and φ takes the values such that R2 = (var(yt ) − var(et ))/var(yt ) varies between 0.1 and 0.9. Since
290
this ARMA(1,1) process has an AR(∞) representation, to approximate the ARMA(1,1) process, like Zhang et al. (2013), the following M + 1 (M is the largest integer no greater than 3n1/3 ) candidate models are used in this simulation: Pm yt = b + et and yt = b + j=1 yt−j θj for 1 ≤ m ≤ M , where b is the intercept,
and θj is the coefficient of yt−j . Let n ∈ {15, 25, 50, 75} and σ 2 ∈ {1, 4}. To ap295
praise the prediction abilities of the five model averaging methods, we compute
al
the out-of-sample one-step-ahead mean squared prediction error of the model averaging forecast yˆn+1 (w): E(yn+1 − yˆn+1 (w))2 , which is approximated by the average based on 5000 simulation trials. For convenient comparison, the risks
300
urn
reported in Figures 6-7 are divided by σ 2 . It is seen from Figures 6-7 that, regardless of σ 2 , for small sample size cases with n = 15 and 25, MMAc dominates its competitors, i.e., PMA, MMA, JMA and KLMA, for all levels of R2 . In these two cases, KLMA is superior to MMA although only a small improvement can be made, and is comparable with
Jo
JMA. In addition, JMA performs a little better than MMA when n = 15 and 305
similarly to MMA when n = 25, which is consistent with the findings in Zhang et al. (2013).
When the sample size increases to n = 50 or 75, the other four methods are
still dominated by MMAc, but the difference among the five model averaging methods becomes very minor. 16
Journal Pre-proof
of
1.3 Risk 1.2 0.1
0.2
0.3
0.4
0.5 R2
0.6
0.7
0.8
1.0
1.2
PMA JMA KLMA MMA MMAc
p ro
1.1
1.3
Risk 1.4
1.5
1.4
n=25
1.6
n=15
0.9
0.1
0.2
0.3
0.4
0.5 R2
0.6
0.7
PMA JMA KLMA MMA MMAc
0.8
0.9
n=75
1.3 1.2
Risk 1.1 1.0
Pr e-
1.0
Risk 1.1
1.2
1.3
n=50
0.1
0.2
0.3
0.4
0.5 R2
0.6
0.7
0.8
PMA JMA KLMA MMA MMAc
0.9
0.9
PMA JMA KLMA MMA MMAc
0.9
0.1
0.2
0.3
0.4
0.5 R2
0.6
0.7
0.8
0.9
Figure 6: Results for Simulation 3 with σ 2 = 1.
310
In addition, when the moving average coefficient in the ARMA(1,1) process takes other values such as 0.2 and 0.7, the simulation results (not reported here)
al
still support the proposed method MMAc.
urn
4. Empirical application
4.1. Empirical example 1: Physical growth study 315
The data set considered here is from Berkeley Guidance Study (BGS) (Tuddenham & Snyder, 1954) and has been analyzed by Ye et al. (2018). The observations are the measures of physical growth indexes for 66 registered newborn
Jo
boys. It is of interest to study the relationship between age 18 height (HT18) and other physical growth indexes. Hence the response variable is set as HT18,
320
and, following Ye et al. (2018), the possible covariates include weights at ages two (WT2) and nine (WT9), heights at ages two (HT2) and nine (HT9), age nine leg circumference (LG9), and age 18 strength (ST18) (Ye et al., 2018). We present the scatterplot matrix of the variables in Figure 8. From Figure 8, it
17
Journal Pre-proof
of
1.3 Risk 1.2 1.1
1.3
Risk 1.4
1.5
1.4
n=25
1.6
n=15
0.2
0.3
0.4
0.5 R2
0.6
0.7
0.8
p ro
0.1
1.0
1.2
PMA JMA KLMA MMA MMAc
0.9
0.1
0.2
0.3
0.4
0.5 R2
0.6
0.7
PMA JMA KLMA MMA MMAc
0.8
0.9
n=75
1.3 1.2
Risk 1.1 1.0
Pr e-
0.1
0.2
0.3
0.4
0.5 R2
0.6
0.7
0.8
PMA JMA KLMA MMA MMAc
0.9
0.9
PMA JMA KLMA MMA MMAc
0.9
0.1
0.2
0.3
0.4
0.5 R2
0.7
0.8
ST18
160 240
1.0
Risk 1.1
1.2
1.3
n=50
0.6
Figure 7: Results for Simulation 3 with σ 2 = 4.
10 14 18
18
20
50
25
35
160 180
HT18
14
al
10
95 85
HT2
50
urn
WT2
HT9
25 35
Jo 160
180
125 140
20
WT9
LG9
85
95
125
140
160
240
Figure 8: Scatterplot matrix of the variables for the empirical example 1.
18
0.9
Journal Pre-proof
is seen that the correlation between HT18 and HT9 is very strong, which is accordant with the conclusions in the last paragraph of this subsection. Define
of
325
HT 18i as the ith observation of HT18 for i = 1, ..., n (the training sample size), and the similar definitions are used for the other variables. The largest model
p ro
is given by
HT 18i = β1 + β2 W T 2i + β3 W T 9i + β4 HT 2i + β5 HT 9i + β6 LG9i + β7 ST 18i + ei , where βj , j = 1, ..., 7, are the associated coefficients, and ei is the disturbance. Totally 26 (= 64) candidate models are considered. The scaled mean squared
Pr e-
prediction error (MSPE), based on the predictions for the test sample with size 66 − n (n ∈ {15, 25, 50, 60}), is given by
66−n 2 X 1 HT 18i − HTˆ 18i , 2 σ ˆM (66 − n) i=1 2 , given in (3), is the estimator of var(ei ) based on the full sample, where σ ˆM 330
and HTˆ 18i is the model averaging forecast of HT 18i . We replicate 1000 times to obtain the mean and variance of MSPEs, which are displayed in Table 1. It is seen from Table 1 that, among the four model averaging methods, M-
al
MAc performs the best for all of the cases, especially for the small and moderate sample sizes, in terms of mean and variance. We also observe that MMA and 335
KLMA have a very close performance, and both are often better than PMA and
urn
JMA.
Now define Models 1-5 as the candidate models with covariates {HT9}, {WT9, HT9}, {HT9, LG9}, {WT2, HT2, HT9, LG9}, and {WT2, HT2, HT9, LG9, ST18}, respectively. We record the weights obtained by the five model av340
eraging methods in each replication, and then calculate the mean of the weights
Jo
based on 1000 replications for each method. We list only the models to which the weights no small than 0.1 are assigned. The results are displayed in Table 2. We find from the table that all the listed models contain HT9, which implies that HT9 has the most important effect on HT18. This conclusion is coinci-
345
dent with that drawn by Ye et al. (2018). In addition, the simulation results in Ye et al. (2018) suggest that the top two important covariates for HT18 are 19
Journal Pre-proof
Table 1: Mean and variance of MSPEs for Example 1
MMA
MMAc
n = 15
1.797
1.741
1.641
1.700
1.571
n = 25
1.346
1.350
1.323
1.330
1.309
n = 50
1.179
1.178
1.175
1.176
1.171
n = 60
1.184
1.179
1.182
1.183
1.180
n = 15
0.867
0.622
0.510
0.636
0.362
n = 25
0.104
0.104
0.090
0.094
0.080
n = 50
0.114
0.115
0.113
0.113
0.113
n = 60
0.342
0.342
0.338
0.339
0.334
of
KLMA
Pr e-
Variance
JMA
p ro
Mean
PMA
HT9 and LG9. From our data analysis, we see that, in comparison with the other three model averaging methods, MMAc exactly puts more weights on the smallest model including HT9 and LG9 (i.e., Model 3) for most cases. This is 350
desirable since it is more possible that a larger model leads to a larger variance
al
for small sample scenarios.
4.2. Empirical example 2: Container port throughput data of Hong Kong The data are the container throughputs (unit: one hundred thousand twen-
355
urn
tyfoot equivalent unit (TEU)) of Hong Kong (HK) from January 2001 to December 2015 and have been studied by Gao et al. (2016a), which are presented in Figure 9. It is seen from Figure 9 that the data have the seasonality and trend. To this end, the data are preprocessed by twelfth log difference to remove the seasonality and then first difference to remove the trend, which implies that the
Jo
available sample size is n0 = 180 − 12 − 1 = 167. The candidate models are the
360
same as those used in Simulation 3, except that n is now the training sample size. Let n1 = n0 − (n + M + 1) be the size of the test sample, where M has been defined in Simulation 3. We do one-step ahead prediction for n1 times with window moving ahead by one step for each time. Let n ∈ {15, 25, 50, 75}. The mean and variance of the mean squared prediction errors (MSPEs) are reported
20
p ro
of
Journal Pre-proof
Table 2: Weighs estimated by model averaging methods
PMA
JMA
1
0.24
0.25
2
0.10
3 n = 25
1 2 3
n = 50
1 2
MMA
MMAc
0.44
0.35
0.52
0.10
0.10
0.12
0.12
0.12
0.20
0.25
0.31
0.27
0.37
0.15
0.10
0.15
0.15
0.15
0.16
0.11
0.17
0.17
0.19
0.10
0.20
0.13
0.12
0.14
0.16
0.15
0.17 0.33
0.14 0.25
0.17
0.29
0.28
4
0.12
0.13
0.10
0.11
5
0.11
0.10
0.19
0.18
0.19
1
urn
n = 60
0.12
al
3
KLMA
Pr e-
n = 15
Model
0.19
0.18
3
0.25
0.16
0.30
0.29
0.37
4
0.20
0.27
0.17
0.18
0.15
5
0.14
0.11
0.12
Jo
2
21
p ro
180 160 01
03
Pr e-
120
140
Throughputs
200
220
of
Journal Pre-proof
05
07
09
11
13
15
Year
Figure 9: Container port throughput data of Hong Kong.
365
in Table 3.
From Table 3, it is seen that, in terms of mean and variance, MMAc performs
al
the best among various model averaging methods for most cases, especially for small sample sizes. These observations are similar to those in simulations.
370
urn
5. Conclusions
In this article, we proposed a corrected Mallows model averaging method— MMAc. Similar to MMA, our MMAc is asymptotically optimal in the sense of minimizing squared errors in large sample sizes. Further, the rate of the MMAc based empirical weight converging to the optimal weight is derived. Similar
Jo
conclusions for MMA and JMA are also obtained. These properties provide
375
a strong support for the three methods under large samples, although could not be used to select them. Both simulation studies and empirical applications indicate that MMAc often performs better than PMA, MMA, JMA and KLMA for small and even moderate sample sizes.
22
Journal Pre-proof
Table 3: Mean and variance of MSPEs for Example 2 (×10−3 )
MMA
MMAc
n = 15
7.505
6.671
6.762
7.042
6.360
n = 25
6.501
6.240
6.307
6.353
6.153
n = 50
5.802
5.759
5.794
5.792
5.748
n = 75
4.317
4.271
4.348
4.339
4.469
n = 15
0.345
0.144
0.151
0.227
0.106
n = 25
0.101
0.093
0.091
0.093
0.085
n = 50
0.089
0.089
0.089
0.090
0.088
n = 75
0.041
0.040
0.042
0.041
0.045
of
KLMA
Pr e-
Variance
JMA
p ro
Mean
PMA
In our current paper, we estimate the variance σ 2 under each candidate mod380
el. In view of Zhao et al. (2016) who considered model averaging estimation on covariance matrix, it also makes sense to estimate the variance σ 2 by averaging its estimators under different models here. Zhang et al. (2016) derived the condition under which the MMA estimator dominates the ordinary least squares
385
al
estimator in the sense of MSE. Whether MMAc has the similar property is an interesting issue. On the other hand, the current paper focuses on the constant error variance case. It is of interest to extend the proposed method to the het-
urn
eroscedastic error case. To this end, an extension of Equation (7) to such a case is needed. In addition, although the dependent data case has been considered in our simulation study and real data analysis, the theoretical properties are 390
only for the i.i.d. case. Extending the theoretical results here to the dependent data case is also meaningful. These remain for our further researches.
Jo
Our idea of deriving weight choice criterion based on the F distribution could be generalized to some other cases. For example, Liao & Tsay (2016) extended MMA to multivariate time series models. It is interesting to design a
395
new criterion in the spirit of MMAc for this situation. Further, the derivation of the convergence rates for weights based on other criteria like the two-step model averaging for the high-dimensional linear models proposed by Ando & Li 23
Journal Pre-proof
(2014) is also possible, along the line of proving Theorem 2, and this warrants
400
of
our future research.
Inference after model averaging is also an important topic. Hjort & Claeskens (2003) provided a local misspecification framework to investigate the asymptotic
p ro
distributions of model averaging estimators; see also Claeskens & Hjort (2008). Under such a framework, Hansen (2014) and Liu (2015) concluded that the asymptotic distribution of MMA is a nonlinear function of normal distribution 405
plus a local parameter for the linear regression models. Recently, without the local misspecification assumption, Zhang & Liu (2019) showed that the asymp-
Pr e-
totic distribution of the MMA estimator is a nonlinear function of the normal distribution with mean zero for the linear regression models with finite parameters, and proposed a simulation-based confidence interval. Noting that the 410
difference of MMAc and MMA vanishes as n → ∞, it is easy to see that the MMAc estimator has the same limiting distribution as the MMA estimator. It is of interest to further consider the MMAc based interval estimation. This is left for future research.
415
al
Acknowledgments
The authors thank the editor, the associate editor and the two referees for
urn
their constructive comments and suggestions that have substantially improved the original manuscript. Zou’s work was partially supported by the National Natural Science Foundation of China (Grant nos. 11971323 and 11529101) and
Jo
the Ministry of Science and Technology of China (Grant no. 2016YFB0502301).
24
Journal Pre-proof
420
Appendix
of
A.1. Proof of Theorem 1 Observe that, under Condition (C.3),
w∈W
p ro
sup R−1 (w) w0 π MMAc − w0 π MMA
# nk m ≤ sup R (w) − km wm n − km − 2 w∈W m=1 2 (kM + 2)kM −1 kM ≤ ξ =O = O(ξn−1 ) → 0, n − kM − 2 n nξn "
−1
M X
(A.1)
Pr e-
where the last line is ensured by the conditions of Theorem 1. From this and following the proof of Theorem 2 in Wan et al. (2010), we see that Theorem 1 425
holds. The details are omitted for saving space and available from the authors upon request. A.2. Proof of Theorem 2 1/2
ˆ = Denote n = ξn n−1/2+δ . To verify (12) of Theorem 2 for the case w w ˆ MMA , following Fan & Peng (2004) and Chen et al. (2018), it suffices to show that, there exists a constant C0 such that, for the M ×1 vector u = (u1 , ..., uM )0 , MMA 0 MMA 0 lim P inf C (w + u) > C (w ) = 1, (A.2) n 0 kuk=C0 ,(w +n u)∈W
urn
n→∞
al
430
which means that there exists a minimizer w ˆ MMA in the bounded closed domain
Jo
{w0 + n u : kuk ≤ C0 , (w0 + n u) ∈ W} such that kw ˆ MMA − w0 k = Op (n ). We
25
Journal Pre-proof
first notice that
of
C MMA (w0 + n u) − C MMA (w0 )
2 = kY − µ ˆ (w0 + n u)k2 + 2b σM (w0 + n u)0 π MMA − kY − µ ˆ (w0 )k2 0
p ro
2 −2b σM w0 π MMA
= kµ − µ ˆ (w0 + n u)k2 + 2e0 µ − P (w0 + n u)µ − P (w0 + n u)e 2 +2b σM (w0 + n u)0 π MMA − kµ − µ ˆ (w0 )k2 − 2e0 µ − P (w0 )µ − P (w0 )e 0
2 −2b σM w0 π MMA
Pr e-
= kµ − µ ˆ (w0 + n u)k2 − kµ − µ ˆ (w0 )k2 − 2e0 P (n u)µ − 2e0 P (n u)e 2 0 MMA +2n σ bM uπ 0
= 2n u0 Λu − 2n w0 Ω01 Λ1 u − 2e0 P (n u)µ − 2e0 P (n u)e 2 0 MMA +2n σ bM uπ .
(A.3)
In the following, we show that 2n u0 Λu is the leading term of (A.3). 435
Since λmin (Λ/n) > κ1 in probability tending to 1 under Condition (C.4), we have
al
2n u0 Λu > κ1 n2n kuk2 > 0,
(A.4)
in probability tending to 1.
urn
Noting that EkΩ1 w0 k2 = Ekµ − µ ˆ (w0 )k2 = ξn , we obtain kΩ1 w0 k = Op (ξn1/2 ).
(A.5) 1/2
From Condition (C.4), it is clear that kΛ1 k = λmax (Λ) = Op (n1/2 ). Thus, 0
Jo
|n w0 Ω01 Λ1 u|
440
≤ n kΛ1 kkΩ1 w0 kkuk = Op (n1/2 ξn1/2 n )kuk.
(A.6)
0
1/2
Recalling n = ξn n−1/2+δ , we see that n w0 Ω01 Λ1 u is asymptotically dominated by 2n u0 Λu. Now we are in a position to derive the stochastic orders of the remaining
26
Journal Pre-proof
terms of (A.3). From Condition (C.5), it is seen that
≤
0 0 e0 Xm (Xm Xm )−1 Xm e 0 √ 0 λmax (Xm Xm /n)−1 kXm e/ nk2 = Op (kM )
of
e 0 Pm e =
max
m∈{1,...,M } 445
|e0 Pm µ|
≤ kµk
p ro
uniformly in m. Therefore, by Condition (C.2), max
1/2
m∈{1,...,M }
and then
(e0 Pm e)
Pr em=1
≤ n kuk M ≤ Op (n
Similarly, we have
1/2
max
m∈{1,...,M }
|e0 Pm µ|2
1/2 M 1/2 kM n ) kuk .
al
|e0 P (n u)e| = |n (e0 P1 e, · · · , e0 PM e) u| !1/2 M X 0 2 ≤ n kuk (e Pm e) m=1
urn
≤ n kuk M = Op (M
1/2
max
m∈{1,...,M }
kM n ) kuk .
1/2
= Op (n1/2 kM ),
|e0 P (n u)µ| = |n (e0 P1 µ, · · · , e0 PM µ) u| !1/2 M X 0 2 ≤ n kuk |e Pm µ|
(A.7)
(e0 Pm e)2
1/2 (A.8)
1/2 (A.9)
In addition, Condition (C.2) and Ekek2 = nσ 2 = O(n)
(A.10)
kY k ≤ kµk + kek = Op (n1/2 ),
(A.11)
Jo
imply that
which, together with Condition (C.6), yields that 2 σ bM
= Y 0 (In − PM )Y /(n − kM ) ≤ λmax (In − PM )kY k2 /(n − kM ) = Op (1). 27
(A.12)
Journal Pre-proof
450
Thus, we have
of
2 0 MMA 2 |n σ bM uπ | = n σ bM |(k1 , · · · , kM )u|
p ro
2 ≤ n σ bM k(k1 , · · · , kM )0 kkuk !1/2 M X 2 2 = n σ bM km kuk m=1
= Op (M 1/2 kM n ) kuk .
(A.13)
So from (A.8), (A.9), (A.13), and Condition (C.6), it is seen that e0 P (n u)µ, 2 0 MMA e0 P (n u)e, and n σ bM uπ are all asymptotically dominated by 2n u0 Λu. Hence
Pr e-
(A.2) is true, and thus (12) of Theorem 2 with w ˆ=w ˆ MMA is proved.
For the case w ˆ = w ˆ MMAc , we note from Condition (C.6) that nkM /(n −
455
kM − 2) = O(kM ). This, together with (A.12), implies that nkM M 1/2 2 0 MMAc n kuk = Op (M 1/2 kM n ) kuk . (A.14) |n σ bM uπ | = Op n − kM − 2
In view of the above process of showing (A.2), it is apparent that (13) of Theorem 2 is true for w ˆ=w ˆ MMAc .
It remains to verify (14) of Theorem 2. Let Γ = (b µ1 − µ e1 , · · · , µ bM − µ eM )
al
460
and recall Ω1 = (µ − µ b1 , · · · , µ − µ bM ). Then we can write kb µ(w) − µ e(w)k2 = 0
w0 Γ0 Γw and (b µ(w) − µ e(w)) (Y − µ b(w)) = w0 Γ0 e + w0 Γ0 Ω1 w. Thus, we decom-
pose C JMA (w) as
urn
C JMA (w)
=
=
=
kY − µ e(w)k2
0
kY − µ b(w)k2 + kb µ(w) − µ e(w)k2 + 2 (b µ(w) − µ e(w)) (Y − µ b(w)) kY − µ b(w)k2 + w0 Γ0 Γw + 2w0 Γ0 e + 2w0 Γ0 Ω1 w.
(A.15)
Jo
As a result,
=
C JMA (w0 + n u) − C JMA (w0 ) kY − µ b(w0 + n u)k2 − kY − µ b(w0 )k2 + 2n u0 Γ0 Γu 0
+2n w0 Γ0 Γu + 2n u0 Γ0 e + 22n u0 Γ0 Ω1 u + 2n u0 Γ0 Ω1 w0 0
+2n w0 Γ0 Ω1 u,
(A.16) 28
Journal Pre-proof
where kY −b µ(w0 +n u)k2 −kY −b µ(w0 )k2 is asymptotically dominated by 2n u0 Λu 465
of
in view of (A.3) and the process of proving (12). It is clear that 2n u0 Γ0 Γu ≥ 0
for any u. Thus, similar to the proof of (12), to verify (14), it is enough to show that the last five terms of (A.16) are asymptotically dominated by 2n u0 Λu(>
p ro
κ1 n2n kuk2 > 0).
Let Dm − I be the n × n diagonal matrix whose ith diagonal element is
ii ii Pm /(1 − Pm ). 470
Then from Li (1987), we obtain the expression µ em = [Dm (Pm −
I) + I]Y . So we have, uniformly in m ∈ {1, · · · , M }, =
k(Dm − I) (I − Pm ) Y k
Pr e-
kb µm − µ em k
≤
kDm − IkkY kλmax (I − Pm )
=
Op (kM n−1/2 ),
(A.17)
where the last equality is because of (A.11), maxm∈{1,...,M } {λmax (Pm )} ≤ 1,
and maxm∈{1,...,M } kDm − Ik = O(kM n−1 ) (by Condition (C.7)). Further, it is seen from (A.17) that
1/2
kΓk ≤ (trace(Γ0 Γ))
al
M X
=
2
m=1
kb µm − µ em k
!1/2
urn
= Op (M 1/2 kM n−1/2 ).
(A.18)
Combining (A.10) and (A.18) leads to |n u0 Γ0 e| ≤ n kekkΓkkuk = Op (M 1/2 kM n )kuk.
Observe from Condition (C.8) that λ2max (Ω1 ) = λmax (Ω01 Ω1 ) = λmax (Ω) = Op (n),
(A.20)
|2n u0 Γ0 Ω1 u| ≤ 2n kΓkλmax (Ω1 ) kuk2 = Op (M 1/2 kM 2n )kuk2 .
(A.21)
Jo
475
(A.19)
and hence,
29
Journal Pre-proof
|n u0 Γ0 Ω1 w0 | ≤
n kΓkkΩ1 w0 kkuk
of
Using (A.5), we have
= Op (M 1/2 kM n−1/2 ξn1/2 n )kuk
p ro
= Op (M 1/2 kM 2n n−δ )kuk. In addition, utilizing (A.17) again, we obtain
2 M
X
0 2 0 kΓw k = wm (b µm − µ em )
m=1
which implies that
m=1
2
0 2 wm kb µm − µ em k = Op (kM n−1 ),
0
Pr e-
≤
M X
|n w0 Γ0 Γu|
(A.22)
(A.23)
≤ n kΓw0 kkΓkkuk
2 = Op (M 1/2 kM n−1 n )kuk
= Op (M 1/2 kM n )kuk
480
by (A.18), and 0
by (A.20).
(A.25)
al
|n w0 Γ0 Ω1 u| ≤ n kΓw0 kλmax (Ω1 ) kuk = Op (kM n )kuk
(A.24)
urn
Finally, Conditions (C.6) and (C.9) ensure that M 1/2 kM n /(n2n )
and
=
−1 M 1/2 kM −1 = (M kM ξn−1 n−2δ )1/2 (kM n−1 )1/2 n n
=
o[(M kM ξn−1 n−2δ )1/2 ],
Jo
M 1/2 kM 2n /(n2n )
(A.26)
1/2
= M 1/2 kM ξn−1/2 n−δ (ξn n−1 )1/2 (kM n−1 n2δ )1/2
= O[(M kM ξn−1 n−2δ )1/2 ].
(A.27)
Therefore, from (A.19), (A.21), (A.22), (A.24), (A.25), and using (A.4), (A.26),
485
(A.27), and Condition (C.6), we see that the last five terms of (A.16) are all asymptotically dominated by 2n u0 Λu. This completes the proof of (14) of
Theorem 2. 30
Journal Pre-proof
References
490
of
Ando, T., & Li, K. C. (2014). A model-averaging approach for high-dimensional regression. Journal of the American Statistical Association, 109 , 254–265.
p ro
Ando, T., & Li, K. C. (2017). A weight-relaxed model averaging approach for high dimensional generalized linear models. Annals of Statistics, 45 , 2654– 2679.
Buckland, S. T., Burnham, K. P., & Augustin, N. H. (1997). Model selection: an integral part of inference. Biometrics, 53 , 603–618.
Pr e-
495
Chen, J., Li, D., Linton, O., & Lu, Z. (2018). Semiparametric ultra-high dimensional model averaging of nonlinear dynamic time series. Journal of the American Statistical Association, 113 , 919–932.
Cheng, X., & Hansen, B. E. (2015). Forecasting with factor-augmented regres500
sion: A frequentist model averaging approach. Journal of Econometrics, 186 , 280–293.
al
Claeskens, G. (2016). Statistical model choice. Annual Review of Statistics and Its Application, 3 , 233–256.
505
urn
Claeskens, G., & Hjort, N. L. (2008). Model Selection and Model Averaging. Cambridge: Cambridge University Press. Fan, J., & Peng, H. (2004). Nonconcave penalized likelihood with a diverging number of parameters. Annals of Statistics, 32 , 928–961. Fern´andez, C., Ley, E., & Steel, M. F. J. (2001). Benchmark priors for Bayesian
Jo
model averaging. Journal of Econometrics, 100 , 381–427.
510
Fragoso, T. M., Bertoli, W., & Louzada, F. (2018). Bayesian model averaging: a systematic review and conceptual classification. International Statistical Review , 86 , 1–28.
31
Journal Pre-proof
Gao, Y., Luo, M., & Zou, G. (2016a). Forecasting with model selection or model
515
metrica A: Transport Science, 12 , 366–384.
of
averaging: a case study for monthly container port throughput. Transport-
Gao, Y., Zhang, X., Wang, S., & Zou, G. (2016b). Model averaging based on
p ro
leave-subject-out cross-validation. Journal of Econometrics, 192 , 139–151. Hamilton, J. (1994). Time Series Analysis. Princeton: Princeton University Press.
Hansen, B. E. (2007). Least squares model averaging. Econometrica, 75 , 1175– 1189.
Pr e-
520
Hansen, B. E. (2008). Least squares forecast averaging. Journal of Econometrics, 146 , 342–350.
Hansen, B. E. (2014). Model averaging, asymptotic risk, and regressor groups. 525
Quantitative Economics, 5 , 495–530.
Hansen, B. E., & Racine, J. (2012). Jackknife model averaging. Journal of
al
Econometrics, 167 , 38–46. Hjort, N. L., & Claeskens, G. (2003). Frequentist model average estimators. Journal of the American Statistical Association, 98 , 879–899. Hoeting, J. A., Madigan, D., Raftery, A. E., & Volinsky, C. T. (1999). Bayesian
urn
530
model averaging: a tutorial. Statistical Science, 14 , 382–417. Hurvich, C. M., Simonoff, J. S., & Tsai, C. L. (1998). Smoothing parameter selection in nonparametric regression using an improved Akaike information
Jo
criterion. Journal of the Royal Statistical Society, Series B , 60 , 271–293. 535
Hurvich, C. M., & Tsai, C. L. (1989). Regression and time series model selection in small samples. Biometrika, 76 , 297–307.
Ley, E., & Steel, M. F. J. (2009). On the effect of prior assumptions in Bayesian model averaging with applications to growth regression. Journal of Applied Econometrics, 24 , 651–674. 32
Journal Pre-proof
540
Li, K. C. (1987). Asymptotic optimality for cp , cl , cross-validation and general-
of
ized cross-validation: Discrete index set. Annals of Statistics, 15 , 958–975. Liao, J., Zong, X., Zhang, X., & Zou, G. (2019a). Model averaging based on leave-subject-out cross-validation for vector autoregressions. Journal of
545
p ro
Econometrics, 209 , 35–60.
Liao, J., Zou, G., & Gao, Y. (2019b). Spatial Mallows model averaging for geostatistical data. Canadian Journal of Statistics, 47 , 336–351. Liao, J.-C., & Tsay, W.-J. (2016). Multivariate least squares forecasting aver-
Pr e-
aging by vector autoregressive models. Working paper.
Liu, C.-A. (2015). Distribution theory of the least squares averaging estimator. 550
Journal of Econometrics, 186 , 142–159.
Liu, Q., & Okui, R. (2013). Heteroskedasticity-robust cp model averaging. The Econometrics Journal , 16 , 463–472.
Madigan, D., & Raftery, A. E. (1994). Model selection and accounting for
555
al
model uncertainty in graphical models using Occam’s window. Journal of the American Statistical Association, 89 , 1535–1546.
urn
Naik, P. A., & Tsai, C. L. (2001). Single-index model selections. Biometrika, 88 , 821–832.
Raftery, A. E., Madigan, D., & Hoeting, J. A. (1997). Bayesian model averaging for linear regression models. Journal of the American Statistical Association, 560
92 , 179–191.
Jo
Tuddenham, R. D., & Snyder, M. M. (1954). Physical growth of california boys and girls from birth to eighteen years. Univ. of Calif. Publications in Child Development, 1 , 183–364.
Wan, A. T. K., Zhang, X., & Zou, G. (2010). Least squares model averaging by
565
Mallows criterion. Journal of Econometrics, 156 , 277–283.
33
Journal Pre-proof
Journal of Econometrics, 146 , 329–341.
of
Wright, J. H. (2008). Bayesian model averaging and exchange rate forecasts.
Xie, T. (2015). Prediction model averaging estimator. Economics Letters, 131 ,
570
p ro
5–8.
Ye, C., Yang, Y., & Yang, Y. (2018). Sparsity oriented importance learning for high-dimensional linear regression. Journal of the American Statistical Association, 113 , 1797–1812.
ters, 130 , 120–123. 575
Pr e-
Zhang, X. (2015). Consistency of model averaging estimators. Economics Let-
Zhang, X., & Liu, C. (2019). Inference after model averaging in linear regression models. Econometric Theory, 35 , 816–841.
Zhang, X., Ullah, A., & Zhao, S. (2016). On the dominance of mallows model averaging estimator over ordinary least squares estimator. Economics Letters, 142 , 69–73.
Zhang, X., Wan, A. T. K., & Zou, G. (2013). Model averaging by jackknife
al
580
criterion in models with dependent data. Journal of Econometrics, 174 , 82–
urn
94.
Zhang, X., Zou, G., & Carroll, R. (2015). Model averaging based on KullbackLeibler distance. Statistica Sinica, 25 , 1583–1598. 585
Zhao, S., Zhang, X., & Gao, Y. (2016). Model averaging with averaging covari-
Jo
ance matrix. Economics Letters, 145 , 214–217.
34