ARTICLE IN PRESS
Journal of Econometrics 128 (2005) 99–136 www.elsevier.com/locate/econbase
VAR forecasting under misspecification Frank Schorfheide Department of Economics, University of Pennsylvania, McNeil Building, 3718 Locust Walk, Philadelphia, PA 19104-6297, USA Received 1 March 2004 Available online 8 October 2004
Abstract The paper considers multi-step forecasting of a stationary vector process under a quadratic loss function with a collection of finite-order vector autoregressions (VAR). Under severe misspecification it is preferable to use the multi-step loss function also for parameter estimation. We propose a modification to Shibata’s (Ann. Statist. 8 (1980) 147) final prediction error criterion to jointly choose the VAR lag order and one of two predictors: the maximum likelihood estimator plug-in predictor or the loss function estimator plug-in predictor. A Monte Carlo experiment illustrates the theoretical results and documents the empirical performance of the selection criterion. r 2004 Elsevier B.V. All rights reserved. JEL classification: C11; C32; C53 Keywords: Forecasting; Loss function estimation; Model selection
1. Introduction This paper considers the problem of forecasting a stationary vector process with a collection of finite-order vector autoregressions (VAR) several periods into the future. The forecasts are evaluated under a quadratic prediction error loss function. Tel.: +215.898.8486; fax: +215.573.2057.
E-mail address:
[email protected] (F. Schorfheide). 0304-4076/$ - see front matter r 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.jeconom.2004.08.009
ARTICLE IN PRESS 100
F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
Two important practical questions are whether the parameter estimates should depend on the forecast horizon and how to choose the lag order of the VAR. The existing literature provides a variety of answers to these questions. For instance, with regard to parameter estimation Granger (1993) proposed that if one believes that a particular criterion should be used to evaluate forecasts—multi-step forecast errors in the context of this paper—then it should also be used at the estimation stage of the modelling process. This statement reflects serious concern about model misspecification. In the absence of misspecification the parameter estimates should be based on the likelihood function of the forecast model instead of the loss function of the forecaster. Maximum likelihood or quasi maximum likelihood (MLE) estimators tend to have less sampling variability than loss function based estimators (LFE). In the presence of misspecification, however, LFE’s are potentially preferable because the use of the relevant loss function can reduce the expected forecast error by favorably tilting the misspecification bias of the parameter estimates. As the sample size increases the effect of parameter uncertainty on the prediction risk vanishes while the effect of the estimation bias prevails. For this reason, LFEs are frequently used in practice and have received considerable attention in the econometrics literature. In the context of multi-step ahead forecasting LFEs are also called multi-step or direct estimators and have been studied by, among others, Bhansali (1997), Clements and Hendry (1998), Findley (1983), Ing (2003), and Weiss (1991). Estimators based on alternative loss functions have been analyzed by, for instance, Christoffersen and Diebold (1996, 1997), Skoros (2002), Tiao and Tsay (1994), Tsay (1993), and Weiss (1996). As far as macroeconomic forecasting with autoregressive models is concerned it is by no means clear that a LFE plug-in predictor is always the best choice. For instance, Marcellino et al. (2004) undertake a large-scale empirical comparison of MLE versus LFE plug-in predictors using data on more than 150 monthly macroeconomic time series. They find that MLE plug-in predictions tend to yield smaller forecast errors, in particular in high-order autoregressions and for long forecast horizons. For series measuring wages, prices, and money, on the other hand, LFE plug-in predictors improve upon MLE plug-in predictors in low-order autoregressions. The degree of model misspecification is closely tied to the choice of lag order. As the dimensionality of the forecasting model increases, the misspecification and hence the benefit from loss function estimation potentially decreases. On the issue of order selection Diebold (1998, pp. 90–91) notes: ‘‘In practical forecasting we usually report and examine both Akaike information criterion (AIC) and Schwarz information criterion (SIC). Most often they select the same model. When they do not, and in spite of the theoretical asymptotic efficiency property of AIC, many authors recommend the use of the more parsimonious model selected by SIC, other things equal. This accords with the keep it sophisticatedly simple (KISS) principle and with results of studies comparing out-of-sample forecasting performance of models selected by various criteria.’’ Schwarz’s (1978) SIC has the property that it chooses the ‘‘true’’ number of lags with probability tending to one in large samples, if the data have been generated
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
101
from a finite order VAR. Selection criteria with this property are called consistent. If, however, the data generating process is not contained in the set of candidate forecast models, the use of a consistent model selection rule such as SIC is potentially suboptimal. Selecting a model that is of higher dimension may lead to a reduction of the misspecification bias that outweighs the larger parameter uncertainty. Shibata (1980) proved a result of this nature in the context of forecasting an infinitedimensional linear process with a collection of finite-order autoregressions. However, as pointed out for instance by Speed and Yu (1993), the battle between SIC and AIC depends crucially on the rate at which the misspecification bias decays as the model complexity increases. Hence, Diebold’s statement reflects the view that model misspecification in practice is often not sufficiently large to justify a departure from consistent model selection procedures. The main contribution of this paper is to propose a modification of Shibata’s (1980) final prediction error criterion, which we will refer to as prediction criterion (PC). The PC can be used to jointly choose between maximum likelihood and loss function based estimation and to select the lag length of the forecasting model. In practice, PC could also be used to choose the estimator conditional on a lag length that has been determined with one of the well-known order selection criteria such as AIC or SIC. Under the assumption that the data are generated from a covariance stationary infinite-order vector moving average (VMA) process we show that the proposed selection criterion provides an asymptotically unbiased estimate of the prediction risk. It is typically assumed in the literature on prediction with autoregressive models that the data generating process (DGP) is fixed and the class of candidate forecasting models is increasing with sample size, e.g., Bhansali (1996), Ing and Wei (2003), Shibata (1980), and Speed and Yu (1993). Thus, the discrepancy between the best estimated forecasting model and the DGP vanishes asymptotically. We follow the opposite approach. We keep the class of forecasting models fixed and let the degree of misspecification asymptotically vanish by assuming that our sample is generated from a drifting infinite-dimensional process that approaches a pth order VAR at rate T 1=2 : We refer to this setup as local misspecification. The drift rate T 1=2 balances the trade-off between estimation bias and variance. Along the asymptote the ranking of the forecast models and estimation procedures remains the same. In this setup the degree of misspecification is ‘‘too small’’ to be consistently estimable. Hence, PC provides only an asymptotically unbiased estimate of the final prediction risk but not a consistent estimate as in Shibata’s (1980) framework. We can show that in our framework consistent model selection criteria and MLE based predictors are preferable if the degree of (local) misspecification is small, whereas loss function based predictors and order selection criteria such as the proposed PC work better if the misspecification is large. The paper is organized as follows. Section 2 introduces the notation, the data generating process, the forecasting models, and defines the plug-in predictors. In Section 3 the frequentist prediction risk of the two predictors is calculated conditional on the VAR lag length. Our modification of the final prediction error criterion is presented and analyzed in Section 4. Section 5 presents a small Monte
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
102
Carlo (MC) experiment to illustrate the theoretical results and the last section concludes. Mathematical proofs and derivations are collected in Appendix A. Appendix B describes the design of the MC experiment.
2. Notation and setup This paper considers the problem of forecasting a stationary vector time series h periods into the future with a VAR of order ppq: The forecaster observes an realization of the n-dimensional process fyt gTt¼1 that can be used to estimate the parameters and the lag length of the VAR. The h-step ahead predictor is denoted by y^ Tþh : Assumption 1 (Forecasting models). The forecasts y^ Tþh are based on pth order VARs of the form yt ¼ f1 yt1 þ þ fp ytp þ et
(1)
where f1 ; . . . ; fp are n n coefficient matrices and 0pppq: The presence of correctly modelled deterministic components does not affect our conclusions in a substantive manner. Hence, we will proceed as if they are absent to keep derivations as simple as possible. A quadratic loss function is used to assess the VAR forecasts. Let tr½ denote the trace operator. Assumption 2 (Loss function). The forecasts are evaluated under the quadratic prediction error loss function LðyTþh ; y^ Tþh Þ ¼ tr½W ðyTþh y^ Tþh ÞðyTþh y^ Tþh Þ0 : W is a symmetric and positive-definite weight matrix. To obtain a concise notation that is consistent with VARs of different lag lengths we write the forecasting model (1) in q-dimensional companion form. Let 0n denote an n n matrix of zeros, and I n the n n identity matrix. Define 2
yt
6 y 6 t1 Yt ¼ 6 6 .. 4 . ytqþ1
3 7 7 7; 7 5
2
f1
6I 6 n F¼6 6 .. 4 .
0n
3
fp
0n
0n
0n
0n
0n
0n
0n 7 7 .. 7 7; . 5
0n
0n
In
0n
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
103
where F is nq nq: Moreover, let 0n 6 . 6 .. 6 6 6 0n Rp ¼ 6 6I 6 n 6 . 6 . 4 .
.. . .. .
3 0n .. 7 . 7 7 7 0n 7 7: 0n 7 7 .. 7 7 . 5
0n
In
2
2
In
3
60 7 6 n7 7 M¼6 6 .. 7; 4 . 5
2
t
3
60 7 6 n1 7 7 Et ¼ 6 6 .. 7; 4 . 5
0n
0n1
M is an nq n matrix, and E t is nq 1: Thus, Y t ¼ FY t1 þ E t
(2)
0
and yt ¼ M Y t : The VAR(p) model in q-companion form satisfies the restriction M 0 FRp ¼ 0; where Rp is an nq nðq pÞ matrix. The extended companion form notation and the restriction function will be useful subsequently to compare forecasting models of different orders. will now derive the MLE and LFE plug-in predictors of yTþh : Define S T;kl ¼ PWe T 0 Y t¼1 tk Y tl : Maximization of the Gaussian quasi likelihood function associated with forecasting model (1) subject to the VAR(p) lag length restriction M 0 FRp ¼ 0 leads to the following estimator of the VAR coefficients: ^ T ðmle; pÞ ¼ S T;01 S 1 ½I nq Rp ðR0 S 1 Rp Þ1 R0 S 1 : F T;11 p T;11 p T;11
(3)
The h-step ahead predictor is constructed from an estimate of C ¼ Fh : ^ to denote point estimates of Fh : F ^ h signifies the h-power of the We will use C ^ To obtain the maximum-likelihood plug-in predictor for yTþh let estimator F: ^ ^ h ðmle; pÞ: CT ðmle; pÞ ¼ F T Predictor 1. The MLE plug-in predictor1 is defined as ^ T ðmle; pÞY T : y^ Tþh ðmle; pÞ ¼ M 0 C
Loss function estimation procedures are based on the idea that in a large sample the observed frequencies of hypothetical prediction losses at time tpT are a reliable indicator for the frequentist risk associated with different parameter choices for the prediction function M 0 CY T :2 The loss function estimator of C subject to zero 1
y^ Tþh ðmle; pÞ is sometimes called ‘‘iterated’’ forecast as it entails first estimating a VAR and then iterating upon the VAR to obtain multi-step forecasts. 2 If the average prediction loss does not converge to the frequentist risk, then loss function estimation is difficult to interpret. This paper focuses on models for which the convergence occurs.
ARTICLE IN PRESS 104
F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
restrictions on higher order lags is ^ T ðlfe; pÞ ¼ S T;0h S 1 ½I nq Rp ðR0 S 1 Rp Þ1 R0 S1 : C T;hh p T;hh p T;hh
(4)
Since the loss function is quadratic the estimator does not depend on the weight matrix W. It can be used to construct the following predictor of yTþh : Predictor 2. The LFE plug-in predictor is defined as ^ T ðlfe; pÞY T : y^ Tþh ðlfe; pÞ ¼ M 0 C
While Predictors 1 and 2 are widely used in practice they are not the only forecasting procedures available. For instance, VAR forecasts are often generated with Bayesian techniques such as in Doan et al. (1984), Litterman (1986), Ingram and Whiteman (1994), Kadiyala and Karlson (1997), and Del Negro and Schorfheide (2004). Although the posterior mean forecasts from a Bayesian VAR are asymptotically equivalent to the MLE plug-in predictors considered in this paper, their small sample properties tend to be better than that of MLE plug-in predictors. One insight of the literature on Bayesian VARs is that carefully chosen informative prior distributions can reduce the sampling variability of the predictor substantially without introducing large and unfavorable biases. To capture these finite sample effects in the theoretical analysis is beyond the scope of this paper. In practice a forecaster faces a sample of fixed size T. We assume that it has been generated from a covariance stationary DGP with a possibly infinite-dimensional VMA representation. The goal is to study the expected loss (risk) of the Predictors 1 and 2 by considering an asymptotic approximation that is valid as T!1: If the DGP and the lag length p of the forecasting model are fixed then the variance of the parameter estimators decreases at rate OðT 1 Þ whereas the misspecification bias is of order Oð1Þ: Hence, if the conditional mean of the VAR(p) is misspecified the LFE predictor will eventually dominate the MLE predictor along this asymptote. If the VAR lag length is estimated subject to the restriction ppq for some fixed q, then the VAR(q) will eventually outperform its competitors if the DGP is not nested in the VAR specifications under consideration. To obtain more insightful asymptotic results that capture the trade-offs in finitesample prediction one has to let the degree of model misspecification shrink as T!1: This can be achieved in two ways. On the one hand, one can let the dimensionality of the forecast models increase with the sample size. This is the approach taken by, for instance, Bhansali (1996), Ing and Wei (2003), and Shibata (1980). On the other hand, one can keep the set of forecasting models fixed and let the DGP drift toward the class of forecasting models as T tends to infinity. We follow the second approach. Data are generated from the drifting vector autoregressive moving average process 1 a X yt ¼ M 0 FY t1 þ t þ pffiffiffiffi M 0 Aj Mtj : T j¼1
(5)
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
105
F is an nq nq companion form matrix for a VAR(q) process. For any fixed sample size the DGP is a potentially infinite-dimensional linear process. The drift rate T 1=2 balances the trade-off between bias and efficiency of different estimators. If the misspecification vanishes at a rate slower than T 1=2 (severe misspecification) then the asymptotic prediction error loss is solely determined by the estimation bias. If it decays at a rate faster than T 1=2 (negligible misspecification) then only the variance of the estimators matters asymptotically. Along the T!1 asymptote the ranking of predictors stays the same. The setup captures the notion that the VARs provide a fairly good, albeit not perfect, approximation of reality. We refer to the smallest p such that M 0 FRp ¼ 0 as the asymptotic lag order p of the DGP. The coefficients of the drifting disturbance process are normalized as follows E½0t t ¼
1 X
E½0t M 0 A0j MM 0 Aj Mt :
(6)
j¼1
Thus, conditional on the covariance matrix E½t 0t the parameter a controls the size of the misspecification. Let kAk ¼ ðtr½AA0 Þ1=2 and kAkO ¼ ðtr½OðAA0 Þ Þ1=2 for symmetric and positive-definite weight matrices O: We make the following assumptions about the DGP. Assumption 3. (i) The largest eigenvalue of F is less than one in absolute value. 1 (ii) The sequence P1 2 of nq nq matrices fAj gj¼0 satisfies the following summability condition: j¼0 j kAj ko1: (iii) ft g is a sequence of independent, n-dimensional, mean zero random variates with E½t 0t ¼ S : (iv) The t ’s are uniformly Lipschitz over all directions, that is, there exist K40; d40; and n40 such that for all 0pw upd; sup Pfuon0 t owgpKðw uÞn :
n0 n¼1
(v) There exists an Z40 such that E½k0t t k3hþZ o1: Assumptions 3(i) and (ii) guarantee that for any fixed T the DGP is stationary. Assumptions (iii)–(v) ensure that the finite sample moments of Predictors 1 and 2 eventually exist (see Appendix A.2 for details). The Lipschitz condition (iv) is discussed in Findley and Wei (2002) and is needed to bound the moments of S 1 T;11 and S1 T;hh in (3) and (4). All elliptically distributed t ’s with bounded density have the uniform Lipschitz property, including Gaussian and t-distributed random variables. In order to be able to carry out our theoretical analysis, we are essentially focussing on the class of stationary DGP’s that have a Wold representation. In practice a forecaster might suspect that the DGP has a particular nonlinear structure, e.g., a regime switching or threshold autoregressive. If pre-tests, e.g., Bierens (1990) or Corradi and Swanson (2002), strongly confirm the suspected
ARTICLE IN PRESS 106
F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
nonlinearities then it will be best to incorporate them directly into the forecasting model. Our setup restricts the misspecification of the forecasting models to omitted autoregressive and moving average terms. Practitioners, of course, face additional challenges that are not addressed in this paper, such as misspecified trend and seasonal components, parameter changes in form of drifting coefficients or structural breaks, and poorly measured forecast origins. While in applications the estimation sample is typically the same as the forecasting sample, the theoretical literature makes a distinction between independent and same sample prediction. ~ Tþh Assumption 4. The process to be predicted fyg t¼1 is independent of the process T fyt gt¼1 that is used for parameter and lag length estimation but otherwise has the same probabilistic structure. The assumption of independent sample prediction is made for mathematical convenience, e.g., Baillie (1979), Lewis and Reinsel (1985, 1988), Reinsel (1980), and Shibata (1980). Results for same sample prediction risks of forecasts from autoregressive models can be found, for instance, in Fuller and Hasza (1981), Ing and Wei (2003), and Ing (2003). These authors find that under stationary DGP’s same and independent sample prediction risks are identical up to order OðT 1 Þ: Since we do not attempt to derive terms of order OðT 3=2 Þ or higher in this paper we proceed under the assumption of independent sample prediction. 3. Multi-step forecasting with a VAR(p) The derivation of the risk associated with Predictors 1 and 2 proceeds in three steps. First, we calculate pseudo-optimal parameter values for the misspecified VAR’s that minimize the risk of the prediction function M 0 CY~ T with respect to C: ^ T ðmle; pÞ and C ^ T ðlfe; pÞ is derived. Third, we Second, the limit distribution of C obtain the asymptotic prediction risk from the limit distribution by showing that the pffiffiffiffi ^ T ði; pÞ F h ÞgTXT ; i ¼ mle; lfe; is uniformly integrable. In the sequence f T ðC context of forecasting with correctly specified univariate autoregressive models uniform integrability was proved by Fuller and Hasza (1981). We verify the uniform integrability using moment bound theorems of Findley and Wei (1993, 2002). The following additional notation is used subsequently to characterize the limit behavior of the estimators: vecð:Þ stacks the columns and vecrð:Þ the rows of a matrix. P j Let A0 ¼ 0nqnq and AðLÞ ¼ 1 j¼0 Aj L ; where L denotes the lag operator. Let Z t ¼ AðLÞMt and SEE ¼ MS M 0 : Moreover, we introduce asymptotic covariance matrices for the companion form processes Y t and Z t : GYY ;h ¼ lim E½Y Tþh Y 0T ¼ T!1
GZY ;h ¼ lim E½Z Tþh Y 0T ¼ T!1
1 X
0
F jþh SEE F j ;
j¼0 1 X j¼0
0
Ajþh SEE F j :
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
107
As the sample size tends to infinity, the influence of the distortion aT 1=2 M 0 AðLÞMt on the covariance of Y T vanishes. 3.1. Pseudo-optimal values for a VAR(p) The optimal predictor of a future observation y~ Tþh generated from the DGP is the conditional mean 0 ~ y^ opt Tþh ¼ ET ½M Y Tþh ;
(7)
where the expectation is taken conditional on the (infinite) history of the process up to time T and the parameters a; F and AðLÞ: The expected loss of y^ opt Tþh provides a lower bound for the frequentist risk of any estimator. We normalize the prediction risk Rðy^ Tþh Þ of a predictor y^ Tþh as follows: 2 Rðy^ Tþh Þ ¼ E½ky~ Tþh y^ Tþh k2W E½ky~ Tþh y^ opt Tþh kW 2 ¼ E½ky^ Tþh y^ opt Tþh kW X0:
ð8Þ
The relative ranking of predictors is not affected by this normalization. Due to the moving-average component in the DGP the conditional mean ET ½M 0 Y~ Tþh is a function of past t ’s in addition to Y~ T : Hence, the predictor constructed from the forecasting model will in general be suboptimal. We will refer to the value of C that minimizes the prediction risk of M 0 CY~ T given a particular parameterization of the DGP as the pseudo-optimal value (POV).3 Define 1 0 1 1 0 Qp ¼ G1 YY ;0 GYY ;0 Rp ðRp GYY ;0 Rp Þ Rp GYY ;0 :
(9)
Lemma 1. Suppose M 0 FR ¼ 0: The pseudo-optimal value that minimizes the risk RðM 0 CY~ T Þ with respect to C subject to the restriction M 0 CR ¼ 0 (only the first p lags enter the prediction function) satisfies ~ T ðpov; pÞ ¼ M 0 F h þ aT 1=2 mðpov; pÞ þ aOðT 1 Þ; M0C where mðpov; pÞ ¼ M 0
h1 X
F j GZY ;hj Qp :
j¼0
We will show in the next section that the limit distribution of the loss function ~ T ðpov; pÞ; whereas the limit estimator is centered at the pseudo-optimal parameter C distribution of the maximum likelihood estimator is not. 3 The concept of pseudo-optimal values has been widely used in econometrics. Our definition is most closely related to the one used in the indirect inference literature, e.g., Gourieroux et al. (1993), and by Skoros (2002).
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
108
3.2. Limit distribution As an intermediate step in the calculation of the prediction risk the limit distribution for the two estimators is derived. According to the DGP Y~ Tþh is given by Y Tþh ¼ F h Y T þ
h1 X
F j ðaT 1=2 ZTþhj þ E tþhj Þ:
(10)
j¼0
The loss function estimator can be expressed as ^ T ðlfe; pÞ ¼ F h F h Rp ðR0 S 1 Rp Þ1 R0 S 1 C p T;hh p T;hh ! h 1 T XX 1=2 j 0 þ aT F Ztj Y th QðhÞ T;p j¼0 t¼1
þ
h1 X T X
!
F j E tj Y 0th QðhÞ T;p ;
ð11Þ
j¼0 t¼1
where 1 0 1 1 1 0 1 QðhÞ T;p ¼ S T;hh S T;hh Rp ðRp S T;hh Rp Þ Rp S T;hh :
If the number of lags p of the estimated VAR is greater or equal to the lag order p of the DGP then M 0 F h Rp ¼ 0 and the second term drops out. The third and fourth terms determine the bias and variance of the limit distribution, respectively. For h ¼ 1 one obtains the Gaussian maximum likelihood estimator. The following theorem characterizes the joint limit distribution of the MLE and LFE for lag lengths p and q. Notice that Qq ¼ G1 YY ;0 : Theorem 1. Suppose the DGP satisfies Assumption 3 and the lag orders of the estimated VARs are greater than p ; that is, p0 ; qXp : Then ^ T ði; pÞ ¼ M 0 F h þ T 1=2 ½amði; pÞ þ zT ði; pÞ þ op ðT 1=2 Þ M0C for p ¼ p0 ; q and i ¼ mle; lfe: Moreover, fzT ði; pÞg p¼p0 ;q ¼)fzði; pÞg p¼p0 ;q ; i¼mle;lfe
i¼mle;lfe
4
where 2
02 3 2 V ðmle; p0 Þ 0 6 zðlfe; p0 Þ 7 B6 0 7 6 V ðmle; p0 Þ 6 7 B6 7 6 6 7 NB6 7; 6 4 zðmle; qÞ 5 @4 0 5 4 V ðmle; p0 Þ zðmle; p0 Þ
zðlfe; qÞ 4
3
0
V ðmle; p0 Þ
The notation is shorthand for vecrðzÞ Nð0; V Þ:
31 7C 7C 7C: 5A
V ðlfe; p0 Þ V ðmle; lfeÞ V ðmle; qÞ V ðlfe; p0 Þ
V ðmle; qÞ
V ðlfe; qÞ
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
109
The matrices mði; pÞ; V ði; pÞ; and V ðmle; lfeÞ are given by mðmle; pÞ ¼ M 0
h1 X
F j GZY ;1 Qp F h1j ;
mðlfe; pÞ ¼ mðpov; pÞ;
j¼0
and V ðmle; pÞ ¼
h1 X h1 X
0
0
ðM 0 F i SEE F j MÞ ðF h1i Qp F h1j Þ;
i¼0 j¼0
V ðlfe; pÞ ¼
h1 X h1 X
0
ðM 0 F i SEE F j MÞ ðQ0p Gji Qp Þ;
i¼0 j¼0
V ðmle; lfeÞ ¼
h1 X h1 X
0
0
ðM 0 F i SEE F j MÞ ðF h1i G1 YY ;0 GYY ;h1j Qp Þ:
i¼0 j¼0
The covariance matrix V ðmle; pÞ that appears in Theorem 1 is identical to the one derived by Baillie (1979) and Reinsel (1980). Theorem 1 implies that under the ^ T ðlfe; pÞ can be considered misspecification the asymptotic distribution of the LFE C expressed as the sum of the limit distribution of the MLE, amðmle; pÞ þ zðmle; pÞ; and an uncorrelated random variable, DC ðlfe; p; mle; pÞ ¼ a½mðlfe; pÞ mðmle; pÞ þ zðlfe; pÞ zðmle; pÞ; with covariance matrix V ðlfe; pÞ V ðmle; pÞ: Hence, the LFE has a larger covariance than the MLE: V ðlfe; pÞXV ðmle; pÞ: Its limit distribution is centered at the pseudo~ T ðpov; pÞ: Unlike the variance differential V ðlfe; pÞ V ðmle; pÞ; optimal value M 0 C the discrepancy between the means of the two limit distributions is a function of a: The larger the local misspecification, the bigger is the bias of the MLE relative to the ~ T ðpov; pÞ: If the lag order of the estimated VAR is smaller pseudo-true value M 0 C than the asymptotic lag order p of the DGP then the bias of MLE and LFE is of order Oð1Þ and will asymptotically dominate. 3.3. Prediction risk ^ T ði; pÞY~ T for i ¼ mle; lfe: The next theorem characterizes the Let y^ Tþh ði; pÞ ¼ M 0 C asymptotic prediction risk Rðy^ Tþh ði; pÞÞ ¼ lim TRðy^ Tþh ði; pÞÞ T!1
of the MLE and LFE plug-in predictors based on their limit distribution.
(12)
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
110
Theorem 2. Suppose that Assumptions 3 and 4 are satisfied and pXp : As the sample size T!1; the prediction risk TRðy^ Tþh ði; pÞÞ converges to Rðy^ Tþh ði; pÞÞ ¼ a2 kmði; pÞ mðpov; qÞk2W GYY ;0 þ tr½ðW GYY ;0 ÞV ði; pÞ þC; |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} RV ði;pÞ
RB ði;pÞ
(13) where
8 2 9 = <X h1 C ¼ a2 E : M 0 F j ET ½ZTþhj mðpov; qÞY~ T ; : j¼0 W
If the lag length of the VAR is pXp then the asymptotic risk is (up to a constant) determined by the bias term a2 RB ði; pÞ and the variance term RV ði; pÞ that appear in Theorem 2. If the degree of misspecification a of the forecasting model is small then it is preferable to use a predictor with little sampling variation. On the other hand, for large values of a a predictor with a small bias terms will yield more desirable forecasts. Corollary 1. Suppose that Assumptions 3 and 4 are satisfied and pXp : The bias component RB ði; pÞ and the variance component RV ði; pÞ of the asymptotic risk satisfy the following relationships: (i) (ii) (iii) (iv)
RB ðlfe; pÞpRB ðmle; pÞ; RB ðlfe; p0 ÞpRB ðlfe; pÞ for p0 Xp; RV ðmle; pÞpRV ðlfe; pÞ; RV ði; pÞpRV ði; p0 Þ for p0 Xp and i 2 fmle; lfeg:
Corollary 1 states that (i) for a given lag-length the bias component of the LFE predictor is smaller than the bias component of the MLE predictor. On the other hand the ranking of the variance components is reverse (iii). Increasing the lag-length of the VAR lowers the bias of y^ Tþh ðlfe; pÞ (ii) and raises the variance of both MLE and LFE plug-in predictors (iv). Suppose a forecaster considers a VAR(p) as baseline model and is concerned about misspecification. There are two margins along which she could improve upon the MLE plug-in forecast. On the one hand, the forecaster could increase the laglength, on the other hand she could switch from a MLE to a LFE predictor. For both y^ Tþh ðmle; pÞ and y^ Tþh ðlfe; pÞ there exists an optimal lag length that balances the opt trade-off between bias and variance, which we will denote by popt mle and plfe : However, this optimal lag length need not be the same for the two predictors. If the misspecification a of the forecasting model is large, then the optimal lag length is greater than p : For a finite collection of forecasting models 1pppq; one can make the misspecification a large enough such that it is optimal to use the LFE plug-in predictor from the VAR(q). The next section will address the problem of a datadriven choice of predictor.
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
111
4. Selecting a predictor We will now derive an asymptotically unbiased estimator for the prediction risk of y^ Tþh ði; pÞ: The in-sample mean squared h-step ahead forecast error matrix is given by " # T 1 X 0 0 ^ T ði; pÞY th ÞðY t C ^ T ði; pÞY th Þ M : MSEði; pÞ ¼ M ðY t C (14) T t¼1 We normalize the forecast error by the MSE of the loss function predictor associated with the VAR(q) and define the risk differential DR;T ði; pÞ ¼ Tðtr½W MSEði; pÞ tr½W MSEðlfe; qÞ Þ:
(15)
It is well known that the in-sample risk differential is not a useful criterion for the lag-order and the type of predictor. For instance, by construction of the LFE predictor it is always the case that DR;T ðlfe; pÞpDR;T ðmle; pÞ even though the ranking of the actual risk could be reverse. In order to obtain a reasonable selection criterion, one has to adjust DR;T ði; pÞ by a penalty for the use of additional lags and loss function estimation. ^ pÞ given in Theorem 1, the inUsing the asymptotic representation of M 0 Cði; sample loss can be decomposed as follows T MSEði; pÞ ¼
T X
M 0 ðY t F h Y th ÞðY t F h Y th Þ0 M
t¼1
T 1=2
T X
M 0 ðY t Y 0th F h Y th Y 0th Þðamði; pÞ þ zT ði; pÞ þ op ð1ÞÞ0
t¼1 T X T 1=2 ðamði; pÞ þ zT ði; pÞ þ op ð1ÞÞðY t Y 0th F h Y th Y 0th Þ0 M t¼1
ðamði; pÞ þ zT ði; pÞ þ op ð1ÞÞ T
1
T X
! Y th Y 0th
ðamði; pÞ
t¼1
þ zT ði; pÞ þ op ð1ÞÞ0 :
ð16Þ
The first term in (16) does not depend on the choice of predictor and lag length and cancels in the calculation of the risk differential DR;T ði; pÞ: The remaining terms are of order Op ðT 1 Þ if pXp : The asymptotic behavior of the risk differential is summarized in the following theorem. Theorem 3. Suppose that Assumptions 3 and 4 are satisfied and pXp :
ARTICLE IN PRESS 112
F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
(i) Then the in-sample forecast error loss differential has the following limit distribution DR;T ði; pÞ¼)a2 trfW ðmði; pÞ mðlfe; qÞÞGYY ;0 ðmði; pÞ mðlfe; qÞÞ0 g þ trfW ðzði; pÞ zðlfe; qÞÞGYY ;0 ðzði; pÞ zðlfe; qÞÞ0 g þ 2a trfW ðmði; pÞ mðlfe; qÞÞGYY ;0 ðzði; pÞ zðlfe; qÞÞ0 g: (ii) The expected in-sample forecast error differential converges to E½DR;T ði; pÞ !a2 ðRB ði; pÞ RB ðlfe; qÞÞ ðRV ði; pÞ RV ðlfe; qÞÞ: The limit random variables zði; pÞ are defined in Theorem 1 and the asymptotic risk components RB ði; pÞ and RV ði; pÞ are given in Theorem 2. Theorem 3 suggests to correct the MSE by twice the variance component of the asymptotic risk to obtain an asymptotically unbiased estimate of Rðy^ Tþh ði; pÞÞ that can be used as a selection criterion. Definition 1. Define the PC T ði; pÞ criterion for the joint selection of VAR lag length and type of estimator as ^ V ði; pÞ; PC T ði; pÞ ¼ T tr½W MSEði; pÞ þ 2R ^ V ði; pÞ is a consistent estimate of RV ði; pÞ: where R RV ði; pÞ can be consistently estimated, for instance, by replacing the matrices F and SEE that appear in the covariance expressions of Theorem 1 with quasi maximum likelihood estimates from a VAR(q). Moreover, population autocovariance matrices GYY ;h can be replaced by their sample analogues. The proposed PC criterion is closely related to Amemiya’s (1980) prediction criterion, Mallows’ (1973) C p ; and Shibata’s (1980) final prediction error criterion.5 The misspecification of the VAR forecasting models is too small for the prediction risk to be consistently estimable. Hence, there is always a non-zero probability of selecting a sub-optimal lag length and estimator asymptotically. According to Theorem 3 the differential PC T ði; pÞ PC T ðlfe; qÞ ¼ Op ð1Þ if pXp : However, if the VAR lag length p is less than the asymptotic lag length p of the DGP then PC T ði; pÞ PC T ðlfe; qÞ ¼ Op ðTÞ so that the probability of selecting a model with fewer than p lags vanishes asymptotically. As pointed out above, the LFE predictor always attains a smaller in-sample loss than the MLE predictor, because it is explicitly obtained by minimizing the in^ V ði; pÞ in the definition of the PC criterion sample forecast error loss. The term 2R
5
For LFE plug-in predictors the risk differential simplifies to h1 X 0 RV ðlfe; pÞ RV ðlfe; qÞ ¼ n2 ðq pÞ tr½WM 0 F i SEE F i M : j¼0
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
113
can be viewed as a penalty for estimation inefficiency due to direct use of the prediction loss function and high dimensionality of the forecasting model. Model selection criteria that are asymptotically of the form MSCði; pÞ ¼ T tr½W MSEði; pÞ þ ðn2 pÞcT ðiÞ; where cT ðiÞ!1 and T 1 cT ðiÞ!0 as T!1 will choose the asymptotic lag order p with probability tending to one as the sample size increases, regardless of the degree of local misspecification a: We will refer to these model selection criteria as consistent as they correctly choose the asymptotic lag length of the VAR in large samples. Bayesian model selection criteria and their asymptotic approximations such as SIC fall into this category as the implied penalty for dimensionality diverges at rate cT ðiÞ ¼ Oðln TÞ: Other approaches that lead to consistent model selection criteria include the prequential approach (Dawid, 1992), predictive least squares (Wei, 1992), and stochastic complexity (Rissanen, 1986). For simplicity suppose that a forecaster considers only two predictors, namely, y^ Tþh ði; pÞ and y^ Tþh ðlfe; qÞ: Let Ifx4ag denote the indicator function that is one if x4a and zero otherwise. Consider the PC selection-based predictor that is obtained by using y^ Tþh ðlfe; qÞ instead of y^ Tþh ði; pÞ whenever PC T ði; pÞ4PC T ðlfe; qÞ: 0 ^ 1=2 PC y^ PC DC;T ÞY~ T ; Tþh ¼ ðM CT ði; pÞ þ T
(17)
where 1=2 ^ T ðlfe; pÞ C ^ T ði; pÞÞIfPC ði;pÞ4PC ðlfe;pÞg DPC M 0 ðC T T C;T ¼ T
¼ M 0 ½aðmðlfe; pÞ mði; pÞÞ þ zT ðlfe; pÞ zT ði; pÞ þ op ð1Þ IfPC T ði;pÞ4PC T ðlfe;pÞg : PC PC Let DPC C signify the limit distribution of DC;T : According to Theorem 3 DC is a function of zði; pÞ zðlfe; qÞ which is uncorrelated with zT ði; pÞ: Thus it can be shown that the asymptotic risk of the PC selection-based predictor relative to Rðy^ Tþh ði; pÞÞ is given by
^ Tþh ði; pÞÞ Rðy^ PC Tþh Þ Rðy 2 2 ¼ kE½DPC C aðmðlfe; qÞ mði; pÞÞkW GYY ;0 a RB ði; pÞ PC 2 þ E½kDPC C E½DC kW GYY ;0 :
ð18Þ
The terms in the first line of (18) capture the bias differential between the selectionbased predictor and y^ Tþh ði; pÞ: Its sign depends on the magnitude of the misspecification. The term in the second line reflects the variance differential of the two predictors and is always positive. Suppose that the criterion is used to choose between the MLE and the LFE predictor conditional on a pre-selected lag-length p ¼ qXp : In the absence of misspecification, a ¼ 0; the risk of the PC-selection predictor is greater than Rðy^ Tþh ði; pÞÞ according to (18). However, as the degree of misspecification a gets larger, the LFE predictor becomes more desirable and the probability that PC selects the LFE predictor increases. The bias differential in (18) becomes eventually negative
ARTICLE IN PRESS 114
F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
and will dominate the variance term. If a is sufficiently large Rðy^ PC Tþh Þ Rðy^ Tþh ði; pÞÞo0: Overall, the risk of the selection-based predictor typically lies between the risk of MLE and LFE predictor, which we illustrate in the Monte Carlo study. The PC criterion can also be used to select type of predictor and lag length jointly. Suppose that p ¼ p oq: If a ¼ 0 then the risk of the PC selection-based predictor is greater than Rðy^ Tþh ði; p ÞÞ: Since the risk function is continuous in a the same ranking is obtained for small, non-zero values of a: Thus, if the misspecification of the forecasting models is (or ex ante is believed to be) small, then the best lag length selection strategy in our framework is to use a consistent model selection criterion which will select the lag order p asymptotically. This result provides a formal justification for Diebold’s (1998) statement cited in the Introduction.6 As the ^ Tþh ði; p Þ reverses and the misspecification increases the ranking of y^ PC Tþh and y consistent selection criterion will choose a lag length that is too small. Eventually, it will become preferable to use PC, or a criterion with similar asymptotic properties such as AIC, for lag length selection and to choose a p4p : The use of the proposed PC selection criterion will guard against poor prediction if the misspecification is believed to be potentially severe. However, this protection comes at a price if the misspecification is small.
5. A small Monte Carlo study A small Monte Carlo study is conducted to illustrate the large sample results obtained in the previous two sections and to examine the small sample properties of the PC-selection-based predictor. We consider a bivariate VAR with asymptotic lag order p ¼ 1 as DGP. The eigenvalues of the coefficient matrix F are 0.9 and 0.5. Thus, the limit VAR is stationary, yet shocks will be fairly persistent. The contemporaneous correlation between the two innovations is 0.4. The distortion process is a vector moving average (VMA) of order 10. The precise specification of the DGP can be found in Appendix Appendix B. In the MC experiment we maintain the assumption that the process to be predicted is independent of the process that is used to estimate the VAR parameters. We are calculating the standardized risk T Rðy^ Tþh Þ for each predictor based on 100,000 MC repetitions. The standardized risk can be compared to the asymptotic risk Rðy^ Tþh Þ: Tables 1–4 present risk differentials T½Rðy^ Tþh ði; pÞÞ Rðy^ Tþh ðlfe; pÞÞ for LFE and LFE plug-in predictors. We also report the expected value and standard deviation for the corresponding PC criterion. The risk differentials and PC statistics are divided by TRðy^ Tþh ðlfe; pÞÞ and multiplied by 100 to reflect percentage changes relative to the LFE plug-in predictor based on a VAR(q). Negative entries reflect improvements, i.e., risk reductions. Tables 1–4 have three objectives: compare the performance of the predictors y^ Tþh ði; pÞ as a function of misspecification, and 6
Diebold (1998) refers to the AIC criterion, which has similar asymptotic properties with respect to order selection as the PC criterion.
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
115
Table 1 Relative risk, horizon h ¼ 2 T
p¼1 MLE
100
500
5000
1
100
500
5000
1
p¼2
p¼4
LFE
MLE
LFE
MLE
LFE
Misspecification a ¼ 0 Risk 67.8 E½PC 48.6 s½PC 38.2 Risk 69.1 E½PC 64.9 s½PC 39.2 Risk 69.5 E½PC 69.3 s½PC 39.7 Risk 69.8
61.8 44.1 36.8 63.5 59.5 37.8 63.9 63.8 38.1 64.3
52.9 37.2 32.6 54.8 51.1 34.1 54.9 55.0 34.7 55.4
41.8 27.7 28.5 42.7 39.2 29.6 42.2 42.5 30.1 42.9
11.9 8.4 9.0 12.5 11.6 10.4 12.1 12.5 10.8 12.5
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Misspecification a ¼ 2 Risk 51.3 E½PC 32.5 s½PC 38.5 Risk 48.2 E½PC 44.6 s½PC 40.5 Risk 48.3 E½PC 48.3 s½PC 41.5 Risk 48.8
56.6 35.8 36.5 54.1 49.1 38.5 52.9 52.4 39.3 52.6
41.5 23.0 33.3 36.5 31.2 36.8 35.2 34.5 37.9 35.8
36.4 23.0 28.3 34.6 31.2 30.5 33.4 33.4 31.6 33.8
9.9 6.0 8.7 10.6 9.5 10.5 10.3 10.5 11.0 10.7
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Note: The table reports relative prediction risks T½Rðy^ Tþh ði; pÞÞ Rðy^ Tþh ðlfe; qÞÞ E½PC and s½PC refer to expected value and standard deviation of the PC criterion. Risk differentials and PC statistics are divided by TRðy^ Tþh ðlfe; qÞÞ and multiplied by 100 to reflect percentage changes relative to the VAR(q) LFE plug-in predictor. Negative entries correspond to improvements (risk reductions).
forecast horizon, document the accuracy of the asymptotic approximation in Theorems 2 and 3, and illustrate the extent to which the PC criterion measures the risk of the various predictors. According to our theoretical analysis for a ¼ 0 the MLE plug-in predictor derived from a VAR(1) yields the lowest prediction risk. This result is confirmed in our Monte Carlo experiment. For h ¼ 2 the use of y^ Tþh ðmle; 1Þ reduces the prediction risk by about 70 percent. The risk is increasing in the lag length p and LFE plug-in predictors perform worse than MLE plug-in predictors. If a ¼ 2 the ranking of the predictors changes slightly. Now y^ Tþh ðlfe; 1Þ dominates y^ Tþh ðmle; 1Þ: As in the a ¼ 0 case performance deteriorates when more lags are included in the VAR. For a ¼ 5 it becomes evident that the optimal lag length for MLE and LFE plug-in predictors cann differ. A comparison among MLE predictors suggests that the optimal choice is opt popt mle ¼ 4; whereas plfe ¼ 1: Hence, switching from quasi maximum likelihood estimation to loss function estimation and increasing the dimensionality of the forecasting model can be viewed as substitutes. If a ¼ 10 the misspecification is so
ARTICLE IN PRESS 116
F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
Table 2 Relative risk, Horizon h ¼ 2 T
p¼1 MLE
100
500
5000
1
100
500
5000
1
Misspecification a ¼5 Risk 8.0 E½PC 6.4 s½PC 36.0 Risk 25.7 E½PC 27.7 s½PC 40.8 Risk 32.3 E½PC 32.2 s½PC 43.9 Risk 30.6 Misspecification a ¼ 10 Risk 21.0 E½PC 30.5 s½PC 30.9 Risk 91.1 E½PC 91.9 s½PC 31.9 Risk 147.3 E½PC 147.1 s½PC 37.6 Risk 147.2
p¼2
p¼4
LFE
MLE
LFE
MLE
LFE
34.0 16.2 31.1 20.0 14.6 36.7 11.7 10.8 40.3 8.7
19.2 5.2 29.2 19.1 24.2 39.5 41.1 42.6 44.6 37.8
20.9 12.9 23.9 9.0 7.6 29.9 1.0 1.0 33.7 0.4
0.3 2.8 9.7 2.2 1.1 10.8 3.8 3.9 11.2 3.7
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
9.3 1.7 23.4 15.1 18.7 25.6 43.1 44.1 33.4 55.9
7.9 0.1 20.0 42.4 45.1 27.8 139.7 141.1 40.5 146.1
6.6 1.9 17.0 11.4 11.2 21.0 40.6 40.4 28.4 50.7
14.7 13.1 11.0 15.1 15.5 10.1 6.7 6.7 9.8 6.6
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Note: The table reports relative prediction risks T½Rðy^ Tþh ði; pÞÞRðy^ Tþh ðlfe; qÞÞ E½PC and s½PC refer to expected value and standard deviation of the PC criterion. Risk differentials and PC statistics are divided by TRðy^ Tþh ðlfe; qÞÞ and multiplied by 100 to reflect percentage changes relative to the VAR(q) LFE plug-in predictor. Negative entries correspond to improvements (risk reductions).
large that y^ Tþh ðlfe; 4Þ is the best predictor. The results for four-step ahead forecasting, reported in Tables 3 and 4, are qualitatively similar to h ¼ 2: By and large the asymptotic risk calculation are accurate for T ¼ 5000: If the misspecification is small then the asymptotic risk also provides a fairly precise approximation for T ¼ 100 and T ¼ 500: As a increases the quality of the approximations deteriorates.7 Tables 1–4 also report the expected value and the standard deviation of the (standardized as described above) PC differential PC T ði; pÞ PC T ðlfe; qÞ: The PC criterion was constructed to provide an asymptotically unbiased estimate of the 7 Under misspecification there are many different ways to construct asymptotic approximations. We simply use the design matrices of the DGP. More accurate results can be obtained as follows. Calculate the MA representation for the DGP, choose the F~ that minimizes the variance of the distortion process, derive the corresponding MA representation for the minimum-variance distortion process, and use F~ and A~ h to compute the asymptotic approximation.
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
117
Table 3 Relative risk, horizon h ¼ 4 T
p¼1 MLE
100
500
5000
1
100
500
5000
1
p¼2
p¼4
LFE
MLE
LFE
MLE
LFE
Misspecification a ¼ 0 Risk 68.3 E½PC 49.0 s½PC 43.1 Risk 67.7 E½PC 63.7 s½PC 45.1 Risk 67.8 E½PC 67.3 s½PC 45.7 Risk 68.1
52.8 37.1 36.3 52.6 49.2 37.9 52.8 52.2 38.3 52.9
58.0 41.2 37.7 57.8 54.3 40.2 57.8 57.5 40.9 58.3
36.0 23.4 26.7 35.3 32.5 27.9 35.1 34.8 28.2 35.3
30.6 21.7 19.3 31.6 29.8 22.2 32.0 31.7 23.1 32.2
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Misspecification a ¼ 2 Risk 65.7 E½PC 44.8 s½PC 42.8 Risk 60.0 E½PC 56.3 s½PC 44.7 Risk 58.6 E½PC 58.2 s½PC 45.8 Risk 58.3
51.8 36.8 35.3 48.3 46.2 36.8 47.5 47.4 37.8 47.2
59.1 36.5 35.7 54.3 47.9 39.6 51.9 50.6 41.3 51.3
33.7 24.4 25.0 32.0 30.7 26.5 31.7 31.6 27.6 31.8
27.1 18.2 18.0 27.6 26.2 21.5 28.4 28.4 22.9 29.0
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Note: The table reports relative prediction risks T½Rðy^ Tþh ði; pÞÞ Rðy^ Tþh ðlfe; qÞÞ E½PC and s½PC refer to expected value and standard deviation of the PC criterion. Risk differentials and PC statistics are divided by TRðy^ Tþh ðlfe; qÞÞ and multiplied by 100 to reflect percentage changes relative to the VAR(q) LFE plug-in predictor. Negative entries correspond to improvements (risk reductions).
prediction risk. Indeed, for TX500 the expected value of the PC differential matches the actual risk differential accurately, while we find some small sample bias for T ¼ 100: According to Theorem 3, the PC differential converges in distribution to a non-degenerate random variable. Indeed, the MC simulations indicate that the sampling variation of the selection criterion is substantial and that small risk differentials are difficult to detect. For instance, consider h ¼ 2; a ¼ 10; and T ¼ 1000: The risk differential between y^ Tþh ðmle; 1Þ and y^ Tþh ðlfe; 4Þ is 148.5 percent and the standard deviation of the PC differential is 37.7%. A differential of this magnitude in favor of the loss function estimator is detected by PC with high probability. However, the differential between y^ Tþh ðmle; 4Þ and y^ Tþh ðlfe; 4Þ is only about 7% compared to a standard deviation of approximately 10%. Thus, with fairly large probability the PC comparison will indicate that the MLE plug-in predictor is preferable even though the LFE predictor has the lower risk. We now turn to the performance of selection-based predictors. In addition to PC lag-length selection we also consider SIC and AIC to choose the number of lags of
ARTICLE IN PRESS 118
F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
Table 4 Relative risk, horizon h ¼ 4 T
p¼1 MLE
100
500
5000
1
100
500
5000
1
p¼2
p¼4
LFE
MLE
LFE
MLE
LFE
Misspecification a ¼ 5 Risk 49.7 E½PC 30.7 s½PC 40.1 Risk 29.5 E½PC 26.7 s½PC 42.8 Risk 19.8 E½PC 19.4 s½PC 45.3 Risk 16.6
40.8 30.5 32.0 29.8 30.1 34.2 25.2 25.6 35.7 23.1
43.5 24.9 33.5 30.0 21.9 39.6 22.5 20.0 43.1 22.0
24.8 21.5 23.1 20.2 22.7 24.7 18.0 18.7 25.6 17.0
16.2 9.6 18.2 15.9 14.5 20.5 15.4 15.6 22.3 15.5
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Misspecification a ¼ 10 Risk 23.7 E½PC 11.3 s½PC 31.4 Risk 15.8 E½PC 17.9 s½PC 35.4 Risk 52.1 E½PC 52.7 s½PC 40.0 Risk 62.7
24.5 19.0 23.8 0.4 2.1 29.1 16.9 16.1 30.4 22.8
20.7 10.7 27.1 10.7 15.4 37.3 37.6 40.5 41.7 33.8
14.0 12.5 17.4 0.1 4.7 22.6 6.8 5.3 21.5 11.0
7.5 7.1 20.4 10.8 11.2 21.0 7.4 7.5 19.7 10.1
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
Note: The table reports relative prediction risks T½Rðy^ Tþh ði; pÞÞRðy^ Tþh ðlfe; qÞÞ E½PC and s½PC refer to expected value and standard deviation of the PC criterion. Risk differentials and PC statistics are divided by TRðy^ Tþh ðlfe; qÞÞ and multiplied by 100 to reflect percentage changes relative to the VAR(q) LFE plug-in predictor. Negative entries correspond to improvements (risk reductions).
the forecasting model. As in Tables 1–4 the entries in Tables 5 and 6 correspond to percentage risk differentials relative to TRðy^ Tþh ðlfe; 4ÞÞ: The row labels denote the criterion that has been used to determine the lag length. The column labels signify the predictor: MLE plug-in, LFE plug-in, or the predictor selected by PC. For instance, the (SIC,PC) entry refers to the predictor that is obtained by selecting the lag length with SIC and choosing between y^ Tþh ðmle; p^ SIC Þ and y^ Tþh ðlfe; p^ SIC Þ based on PC. The (PC,PC) entry corresponds to the joint selection of lag length and predictor with PC. While AIC and SIC are widely used in practice they are by no means the only available lag-length selection criteria. For instance, SIC was derived as a largesample approximation of the marginal data density associated with a Bayesian VAR. Improved finite-sample performance could possibly be obtained by direct calculation of the marginal data density using results in Zellner (1971) for multivariate Gaussian regression models with natural conjugate priors or numerical approximations based on the output of a Gibbs sampler, as proposed by Chib (1995). Moreover, the
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
119
Table 5 Selection-based risk, forecast horizon h ¼ 2 a
Lag Sel
T ¼ 100
T ¼ 500
MLE
LFE
PC
MLE
LFE
PC
0
AIC SIC PC
60.4 67.8 49.6
55.0 61.8 43.1
59.0 65.3 47.6
62.6 69.1 53.4
57.5 63.5 47.3
61.1 66.7 51.3
2
AIC SIC PC
17.3 37.7 31.0
12.1 38.4 35.6
15.3 36.4 33.1
17.1 36.5 31.1
11.0 39.4 36.6
15.0 36.0 34.2
5
AIC SIC PC
0.9 9.0 3.6
2.7 13.7 14.6
0.3 10.8 13.6
2.0 3.0 10.5
0.0 0.4 4.3
0.6 2.7 4.1
10
AIC SIC PC
8.5 4.9 1.1
0.4 5.8 0.3
2.1 5.0 0.2
15.1 16.7 19.0
0.0 1.5 6.9
0.3 1.9 7.0
Note: The table reports relative prediction risks T½Rðy^ Tþh ði; pÞÞRðy^ Tþh ðlfe; qÞÞ for selection-based predictors. Risk differentials are divided by TRðy^ Tþh ðlfe; qÞÞ and multiplied by 100 to reflect percentage changes relative to the VAR(q) LFE plug-in predictor. Negative entries correspond to improvements (risk reductions). Table 6 Selection-based risk, forecast horizon h ¼ 4 a
Lag Sel
T ¼ 100
T ¼ 500
MLE
LFE
PC
MLE
LFE
PC
0
AIC SIC PC
63.3 68.3 53.2
48.4 52.8 35.0
58.1 61.4 48.2
63.4 67.7 54.2
49.1 52.6 36.3
58.8 61.6 49.4
2
AIC SIC PC
34.9 55.0 49.4
13.0 36.2 33.7
28.9 47.6 44.2
34.8 53.5 47.1
11.7 36.2 32.2
29.1 46.9 42.2
5
AIC SIC PC
17.9 29.9 33.7
4.6 17.7 24.0
12.1 23.4 27.7
15.9 16.1 19.4
0.1 4.8 14.9
8.2 9.4 14.6
10
AIC SIC PC
1.7 18.5 13.0
5.6 13.5 12.5
2.6 13.7 10.8
10.8 11.9 15.6
0.0 0.3 5.9
4.7 5.5 8.0
Note: The table reports relative prediction risks T½Rðy^ Tþh ði; pÞÞRðy^ Tþh ðlfe; qÞÞ for selection-based predictors. Risk differentials are divided by TRðy^ Tþh ðlfe; qÞÞ and multiplied by 100 to reflect percentage changes relative to the VAR(q) LFE plug-in predictor. Negative entries correspond to improvements (risk reductions).
ARTICLE IN PRESS 120
F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
extensive literature on forecast combination suggests that it might be feasible to improve predictions by either classical, e.g., Bates and Granger (1969), or Bayesian forecast combination, e.g., Min and Zellner (1993). However, the exploration of these techniques is beyond the scope of this Monte Carlo study. In the absence of misspecification SIC lag length selection clearly dominates AIC or PC lag length selection. For T ¼ 100 and h ¼ 2 the improvement of y^ Tþh ðmle; p^ SIC Þ over y^ Tþh ðlfe; 4Þ is 67.8%, while y^ Tþh ðlfe; p^ SIC Þ leads to an improvement of 61.8%. PC can be used to choose among MLE and LFE predictor conditional on the selected lag length. The loss reduction of the resulting fully datadriven procedure is 65.3%. As argued in Section 4, the protection against poor performance under misspecification comes at a cost. The risk associated with the PC selection-based predictor lies typically between the risk of MLE and LFE predictor. However, recall that ex ante the forecaster does not know which of the two predictors is better. Selection frequencies are reported in Tables 7 and 8. For instance, the rows labelled MLE(AIC) and LFE(AIC) for h ¼ 2; a ¼ 0; and T ¼ 100 indicate that AIC selected p^ AIC ¼ 1 in 87% of the MC repetitions. In 79% of the cases PC selected y^ Tþh ðmle; 1Þ and the frequency of choosing y^ Tþh ðlfe; 1Þ was 8%. Table 7 reflects the well-known result that AIC tends to select higher dimensional models than SIC. For h ¼ 2; a ¼ 0; and T ¼ 100 SIC always selects a VAR(1). In 14% of the repetitions PC chooses the LFE predictor, which explains why the performance of the fully data-driven predictor is worse than y^ Tþh ðmle; p^ SIC Þ but better than y^ Tþh ðlfe; p^ SIC Þ: For a ¼ 2 it remains true that SIC provides the best lag length selection. However, PC lag length selection now dominates AIC selection. While AIC chooses p42 in more than 80% of the repetitions, PC keeps the dimension of the forecasting model small. The joint selection of lag length and predictor leads to the frequent choice of y^ Tþh ðlfe; 1Þ; which according to Table 1 is the best among the competing predictors. As the degree of misspecification is raised to a ¼ 5 PC order selection by and large dominates both AIC and SIC order selection. Table 7 indicates that due to the increased misspecification AIC and SIC tend to select larger VARs more frequently. Conditional on having selected pX3 the use of the LFE predictor is fairly rare. The good performance of the joint PC selection procedure is due to the fact that it corrects for the misspecification not by increasing the lag length but rather by estimating a low dimensional VAR with the loss function based estimator. The results for a ¼ 10 are mixed. If T ¼ 100 the SIC order selection works best, whereas for T ¼ 500 the AIC criterion, which selects a VAR(4) most of the times, leads to the smallest prediction risk. To summarize, the results on lag length selection are fairly sensitive to the degree of model misspecification, which is consistent with previous findings in the econometrics literature alluded to by Diebold (1998, pp. 90–91) and the theory provided in Sections 3 and 4 of this paper. If the misspecification is small then SIC works best. As the misspecification increases there are gains from jointly choosing lag length and predictor with the proposed PC criterion. The PC criterion can also be used to choose between MLE and LFE predictor conditional on a pre-selected lag length. The MC experiment documents that the risk of the PC-based predictions lies
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
121
Table 7 Selection frequencies, forecast horizon h ¼ 2 a
0
2
5
10
Pred
MLE(AIC) LFE(AIC) MLE(SIC) LFE(SIC) MLE(PC) LFE(PC) MLE(AIC) LFE(AIC) MLE(SIC) LFE(SIC) MLE(PC) LFE(PC) MLE(AIC) LFE(AIC) MLE(SIC) LFE(SIC) MLE(PC) LFE(PC) MLE(AIC) LFE(AIC) MLE(SIC) LFE(SIC) MLE(PC) LFE(PC)
T ¼ 100
T ¼ 500
p¼1
p¼2
p¼3
p¼4
p¼1
p¼2
p¼3
p¼4
79 8 86 14 67 5
8 1
3
1
8 1
2
1
13 1
7
5 2
81 8 87 13 72 5
12 1
5 1
3 1
46 3 12 2 9 1
28 7 1
13
48 2 5 1 8 1
31 5
23 10 30 38
16 1 37 15 9 3 3
29 31 3 3 4 11
2
24 35 7 14
34 2 24 11 9 3
30 25 6 4
63 35 29 16 10 12
33 3 47 44 20 13
22 1 3 2 6 8
5 36 1 11
2 98 2 86
8
58
1 53
2 28
6 3
1 1 25
22 14 27 47
57
14
37 20 7 3
11
20
5 2
Note: The table reports selection frequencies for estimators and lag lengths. In case of AIC and SIC the estimation method is chosen with PC.
between the risk of the simple MLE and LFE predictions. The use of the PC criterion can hence guard against choosing the worse of the two simple predictors, which is of course unknown ex ante, and in many cases leads to a performance that is almost as good as the better of the two predictors.
6. Conclusion This paper considers the problem of multi-step ahead forecasting with potentially misspecified VAR(p) models. A forecaster who considers a VAR(p) yet is concerned about model misspecification has in principle two choices. First, increase the complexity of the forecast model, e.g., the lag length of a VAR. Second, estimate the parameters of the forecast model by minimizing the in-sample forecast error loss to
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
122
Table 8 Selection frequencies, forecast horizon h ¼ 4 a
0
2
5
10
Pred
T ¼ 100
T ¼ 500
p¼1
p¼2
p¼3
p¼4
p¼1
p¼2
p¼3
p¼4
MLE(AIC) LFE(AIC) MLE(SIC) LFE(SIC) MLE(PC) LFE(PC)
77 10 86 14 63 2
8 1
3
1
8 1
2
1
14
8 1
9 3
79 9 87 13 67 1
14
8
7 2
MLE(AIC) LFE(AIC) MLE(SIC) LFE(SIC) MLE(PC) LFE(PC)
1
15 1 43 8 17 1
44 5 12 2 6 1
30 5 1
13 1 49 8 18
46 4 5 1 7 1
32 4
3
50 13
47 14 4 1 5 5
2
45 14 18 4
31 5 28 7 4 1
45 10 8 2
79 18 37 8 6 5
17 6 4 2 5 2
19 22
2 1 29 33
25 11 52 39 14 6
6 6 3 5
35 65 31 57 2 23
MLE(AIC) LFE(AIC) MLE(SIC) LFE(SIC) MLE(PC) LFE(PC) MLE(AIC) LFE(AIC) MLE(SIC) LFE(SIC) MLE(PC) LFE(PC)
29 5 61 3
8 3
2 8
31 5 62 2
42 15
6 34
18 4
5 22
7 3
Note: The table reports selection frequencies for estimators and lag lengths. In case of AIC and SIC the estimation method is chosen with PC.
induce a misspecification bias of the parameter estimates that lowers the prediction risk. Of course, these two choices are not mutually exclusive. Authors who are generally skeptical about the correct specification of econometric models tend to recommend loss function estimates. However, this recommendation is only reasonable if the specification of the forecast model is strongly rejected by the data or prior beliefs. In the case of macroeconomic forecasting with VARs it is not at all clear whether loss function estimation is preferable to likelihood based estimation. Hence, we propose a criterion that can be used to jointly select the lag length of the VAR and the parameter estimation procedure. The criterion can be viewed as a modification of Shibata’s (1980) final prediction error criterion. The criterion can also be used to choose between MLE and LFE predictor conditional on a lag length selected with a conventional criterion such as AIC or SIC.
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
123
Asymptotic results can be conveniently obtained by assuming that the infinitedimensional data generating model approaches the candidate VARs at rate T 1=2 : Under local misspecification of the forecast model the proposed PC criterion provides an asymptotically unbiased estimate of the prediction risk differentials. While the PC selection-based predictor in most cases does not attain the risk of the best among the simple predictors (which is of course unknown ex ante), it clearly tends to outperform the worst predictors. The paper focuses on multi-step VAR forecasting under quadratic loss functions when the data are generated from a covariance-stationary infinite-dimensional VMA process. Extensions to time-varying and non-stationary data generating processes, other forms of misspecification, asymmetric loss functions, and volatility forecasting are left for future research.
Acknowledgements I wish to thank Jean Boivin, Frank Diebold, Yuichi Kitamura, Peter Phillips, Chris Sims, Arnold Zellner, three anonymous referees, the associate editor, and seminar participants at the Federal Reserve Bank of Atlanta, the University of Virginia and the 2003 North American Winter Meeting of the Econometric Society for valuable comments and discussions. Financial support from the University of Pennsylvania Research Foundation is gratefully acknowledged. GAUSS programs that implement the Monte Carlo experiment are available at http://www.econ. upenn.edu/schorf.
Appendix A. Derivation and proofs A.1. Main results Throughout the appendix, we use the operator ET ½: to denote expectations conditional on T and its infinite past. We omit the p subscript of the matrices R and Q. Recall that 1 0 1 1 0 Q ¼ G1 YY ;0 GYY ;0 RðR GYY ;0 RÞ R GYY ;0 :
(A.1)
Proof of Lemma 1. The time T conditional expectation of yTþh is ET ½M 0 Y Tþh ¼ M 0 F h Y T þ aT 1=2
h1 X
M 0 F i Et ½ZTþhi :
(A.2)
i¼0
The minimization of RðM 0 CY T ja; F ; AðLÞÞ subject to M 0 CR ¼ 0 is equivalent to the constrained minimization of E½kM 0 CY T M 0 ET ½Y Tþh k2W :
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
124
Let L be the matrix of Lagrange-multipliers for the constraint M 0 CR ¼ 0: The first~ T ðpov; pÞ is of the form order condition for C 0 ¼ M 0 F h E½Y T Y 0T þ aT 1=2 M 0
h1 X
E½F i ET ½Z Tþhi Y 0T
i¼0
~ T ðpov; pÞE½Y T Y 0 LR: MC T 0
ðA:3Þ
Since, E½Y T Y 0T ¼ GYY ;0 þ aOðT 1=2 Þ
(A.4)
and " E½F
i
ET ½Z Tþhi Y 0T
¼E F
!
1 X
i
Aj MTþhij
1 X
!0 # j
F MTj
j¼0
j¼hi 1=2 ¼ F i GZY ;hi G1 Þ; YY ;0 þ aOðT
the statement of the lemma follows.
ðA:5Þ
&
Lemma 2. (i) If M 0 FRp ¼ 0 then the first p2 rows of F h R are equal to zero. (ii) Q0 GYY ;0 Q ¼ Q: (iii) Partition ! G11 G12 GYY ;0 ¼ ; G21 G22 such that G11 corresponds to the autocovariance matrix of the first p-lags of yt : Then " # G1 0 11 : Q¼ 0 0 (iv) QF h R ¼ 0: Proof of Lemma 2. (i) By recursion. If the first p2 rows of a pðq pÞ qp matrix A are zero, then the first p2 rows of the matrix A0 ¼ FA are also zero. (ii) By direct matrix multiplication. (iii) By formula for inverse of partitioned matrix and direct matrix multiplication. (iv) Follows from (i) and (iii). & Proof of Theorem 1. The LFE is given by T
1=2
^ T ðlfe; pÞ F Þ ¼ a M ðC 0
h
h1 X
0
MF
j
T
1
j¼0
þ
h1 X j¼0
T X
! Z tj Y 0th
ðTQT;p Þ
t¼1 0
MF
j
T
1=2
T X t¼1
! E tj Y 0th
ðTQT;p Þ:
ðA:6Þ
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
125
To analyze the MLE predictor we use a first-order Taylor expansion of Fh around F ¼ F: M 0 ðFh F h Þ ¼ M 0
h1 X
F j ðF F ÞF h1j þ RðM 0 ðF F ÞÞ;
(A.7)
j¼0
where RðÞ is the remainder term. Hence, ^ T ðmle; pÞ F h Þ ¼ a T 1=2 M 0 ðC
h1 X
M 0 F j T 1
j¼0
þ
Z t Y 0t1 TQT;p F h1j
t¼1
h1 X
0
MF
j
T
1=2
j¼0
þT
!
T X
T X
! E t Y 0t1
TQT;p F h1j
t¼1
1=2
^ T ðmle; pÞ F ÞÞ: RðM ðF 0
ðA:8Þ
Now define the following random matrices: zT ðlfe; pÞ ¼
h1 X j¼0
j
0
ðM F Þ T
1=2
T X
þa M
0
h1 X
F
j
T
1
T X
j¼0
zT ðmle; pÞ ¼
j¼0
ðTQT;p Þ
t¼1
"
h1 X
! E tj Y 0th
t¼1
ðM 0 F j Þ T 1=2
T X
þa M
0
h1 X
F
j
j¼0
T
1
ðTQT;p Þ mðlfe; pÞ
ðA:9Þ
!
E t Y 0t1 ðTQT;p F h1j Þ
t¼1
"
#
! Z tj Y 0th
T X
#
! Zt Y 0t1
TQT;p F
h1j
mðmle; pÞ
t¼1
^ T ðmle; pÞ F ÞÞ: þ T 1=2 RðM 0 ðF
ðA:10Þ
Such that the prediction functions can be expressed as ^ T ði; pÞ F h Þ ¼ amði; pÞ þ zT ði; pÞ: T 1=2 M 0 ðC
(A.11)
Using a weak law of large numbers for linear vector processes, e.g., Phillips and Solo (1992) and the definitions of mðlfe; pÞ ¼ mðpov; pÞ and mðmle; pÞ we deduce that ! h1 T X X p 0 j 1 0 M F T Ztj Y th ðTQT;p Þ mðlfe; pÞ ! 0; (A.12) j¼0
M
0
h1 X j¼0
t¼1
F
j
T
1
T X t¼1
! Zt Y 0t1
p
TQT;p F h1j mðmle; pÞ ! 0:
(A.13)
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
126
^ T ðmle; pÞ F ÞÞ ¼ Since vecrðABCÞ ¼ ðA C 0 ÞvecðB0 Þ; Qp is symmetric, RðM 0 ðF op ð1Þ; and (A.12), (A.13), we can write vecrðzT ðlfe; pÞÞ ¼
h1 X
" 0
j
ðM F Qp Þvec T
1=2
j¼0
T X
# Y thþj E 0t
t¼1
þ op ð1Þ; vecrðzT ðmle; pÞÞ ¼
h1 X
ðA:14Þ " 0
ðM 0 F j F h1j Qp Þvec T 1=2
j¼0
T X
# Y t1 E 0t
t¼1
þ op ð1Þ:
ðA:15Þ
Based on Assumptions 3 the terms " vec T
1=2
T X
# Y thþj E 0t
;
j ¼ 0; . . . ; h 1
t¼1
jointly satisfy a central limit theorem for vector martingale difference sequences. The covariance matrix for the limit random variables zði; pÞ is obtained by straightforward yet tedious calculations using the results from Lemma 2. Proof of Theorem 2. According to Eqs. (A.2) and (A.11) the difference between the conditional expectation of yTþh (omitting the tilde) and the predictor y^ Tþh ði; pÞ is given by ET ½M 0 Y Tþh y^ Tþh ði; pÞ ¼ aT 1=2
h1 X
! M 0 F j ET ½Z Tþhj mðlfe; qÞY T
j¼0
þ aT 1=2 ½mðlfe; qÞ mði; pÞ Y T T 1=2 zT ði; pÞY T : The normalized prediction risk can be expressed as follows: TE½trfW ðET ½M 0 Y Tþh y^ Tþh ði; pÞÞðET ½M 0 Y Tþh y^ Tþh ði; pÞÞ0 g ¼ a2 trfW ðmðlfe; qÞ mði; pÞÞGYY ;0 ðmðlfe; qÞ mði; pÞÞ0 g þ trfW E½zT ði; pÞGyy;0 zT ði; pÞ0 g ( " ! h1 X 2 0 j þ a tr W E M F ET ½ZTþhj mðlfe; qÞY T j¼0
h1 X j¼0
0
j
M F ET ½Z Tþhj mðlfe; qÞY T
!0 #)
ðA:16Þ
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
(
"
2a tr W E½zT ði; pÞ E Y T ( 2
þ 2a tr W E
" h1 X
h1 X
! 0 #) M 0 F j ET ½Z Tþhj mði; pÞY T
j¼0 0
MF
j
127
#
ET ½Z Tþhj Y 0T
mðlfe; qÞGYY ;0
j¼0
) ðmðlfe; qÞta; pÞÞ
0
.
ðA:17Þ
Since tr½WABA0 ¼ vecrðAÞ0 ðW BÞvecrðAÞ and tr½AB ¼ tr½BA we can rewrite the third term of (A.17) as trfW E½zT ði; pÞGyy;0 zT ði; pÞ0 g ¼ trfðW Gyy;0 ÞzT ði; pÞzT ði; pÞ0 g: According to Theorem 6 the sequence kzT ði; pÞk2 is uniformly integrable. Hence, we can deduce that (see Billingsley, 1968) that trfðW Gyy;0 ÞE½zT ði; pÞzT ði; pÞ0 g!trfðW Gyy;0 ÞV ði; pÞg:
ðA:18Þ
The fourth term of (A.17) converges to zero because uniform integrability of kzT ði; pÞk2 implies that E½zT ði; pÞ !0:
(A.19)
Since " E
h1 X
# 0
MF
j
ET ½ZTþhj Y 0T
¼ M0
j¼0
h1 X
F j GZY ;hj ¼ mðlfe; qÞGYY ;0
(A.20)
j¼0
the fifth term in (A.17) is equal to zero. Hence, the statement of the theorem follows. & Proof of Corollary 1. (i) and (ii) follow directly from Theorem 2. To show (iii), write zðlfe; pÞ ¼ zðmle; pÞ þ ½zðlfe; pÞ zðmle; pÞ : Based on the covariance matrix given in Theorem 1 it can be easily verified that zðmle; pÞ and ½zðlfe; pÞ zðmle; pÞ are uncorrelated. Hence, V ðlfe; pÞXV ðmle; pÞ in the matrix sense. (iv) Same argument as (iii). & Proof of Theorem 3. (i) Recall that QT;q ¼
$X
yth yth
%1
:
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
128
The in-sample mean squared h-step ahead forecast error can be rewritten as follows: T MSEði; pÞ " # T X 0 0 ^ T ði; pÞY th ÞðY t C ^ T ði; pÞY th Þ M ¼ M ðY t C t¼1
( ¼
)
T X
M 0 ðY t F h Y th ÞðY t F h Y th Þ0 M
t¼1
(
T
1=2
M
0
ðY t Y 0th
F
h
Y th Y 0th Þðamði; pÞ
0
þ zT ði; pÞÞ M
t¼1
(
)
T X
T 1=2
T X
) ðamði; pÞ þ zT ði; pÞÞðY t Y 0th F h Y th Y 0th Þ0 M
t¼1
n o ðamði; pÞ þ zT ði; pÞÞðT 1 QT;q Þðamði; pÞ þ zT ði; pÞÞ :
ðA:21Þ
The first term is common across i; p and cancels by taking differences. From the definition of zT ðlfe; pÞ in (A.9) follows that T 1=2
T X
M 0 ðY t Y 0th F h Y th Y 0th Þ
t¼1
¼ aM 0
h1 X
T 1
j¼0
T X
! Ztj Y 0th
þ
t¼1
h1 X
M 0 F j T 1=2
j¼0
T X
! E tj Y 0th
t¼1
¼ ½zT ðlfe; qÞ þ amðlfe; qÞ ðTQT;q Þ1 :
ðA:22Þ
Therefore, T tr½W MSEði; pÞ ( ) T X 0 0 h h M ðY t F Y th ÞðY t F Y th Þ M ¼ tr W *
t¼1
( X ) + 1 0 0 2 tr W ðamði; pÞ þ zT ði; pÞÞ yth yth ðamðlfe; qÞ þ zT ðlfe; qÞÞ T * ( X ) + 1 0 0 yth yth ðamði; pÞ þ zT ði; pÞÞ : þ tr W ðamði; pÞ þ zT ði; pÞÞ ðA:23Þ T Since p 1 X yth y0th ! Gyy;0 T
the in-sample loss differential is Tðtr½W MSEði; pÞ tr½W MSEðlfe; qÞ Þ ¼ a2 trfW ðmði; pÞ mðlfe; qÞÞGYY ;0 ðmði; pÞ mðlfe; qÞÞ0 g
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
129
þ trfW ðzT ði; pÞ zT ðlfe; qÞÞGYY ;0 ðzT ði; pÞ zT ðlfe; qÞÞ0 g þ 2a trfW ðmði; pÞ mðlfe; qÞÞGYY ;0 ðzT ði; pÞ zT ðlfe; qÞÞ0 g þ op ð1Þ:
ðA:24Þ
Statement (i) now follows from Theorem 1 and the Continuous Mapping Theorem. (ii) The first term of the limit distribution (A.24) is constant. The second term has a gamma-distribution and we can deduce from Theorem 1 that its mean is given by E½trfW ðzði; pÞ zðlfe; qÞÞGYY ;0 ðzði; pÞ zðlfe; qÞÞ0 g ¼ trfðW GYY ;0 ÞðV ðlfe; qÞ V ði; pÞÞg:
ðA:25Þ
The third term of (A.24) is a Gaussian random variable with mean zero. Thus the right-hand-side of (A.25) is the expected value of the limit random variable in (A.24). Uniform integrability of in-sample loss differential follows from Theorem 7. A.2. Moment bounds and uniform integrability Throughout this section we will use hzil to denote the norm ðEkxkl Þ1=l : We begin with two moment bound theorems. Theorem 4. Suppose yt satisfies Assumption 3. Then for each real moment order tX1 there exists a positive integer T ðq; tÞ such that sup
E
TXT ðq;tÞ
lt min
(
)! 1 X 0 Y th Y th o1: T
Proof of Theorem 4. Obtained by a slight modification of the proof of Theorem 4.1 in Findley and Wei (2002) to account for the drift in the moving average representation of yt : Define the scalar processes 1 X ut ¼ cj tj ; ðA:26Þ j¼0
wt ¼
1 X
d j tj ;
ðA:27Þ
j¼0
where cj and d j are n 1: Moreover, let guu;j and gww;j denote the jth order autocovariances of u and w, respectively. Theorem 5.PSuppose that (i) t P in Eqs. (A.26) and (A.27) satisfies Assumption 3. (ii) 1 2 2 2 2 Moreover, 1 j jc j o1 and j j¼0 j¼0 j jd j j o1: Then there exists a constant K p such that !1=4 !1=4 , 1 1 X X 1 X 2 2 pffiffiffiffi ut wtj guw;j pK p guu;j gww;j o1 T 2p j¼1 j¼1
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
130
Proof of Theorem 5. See proof of ‘‘First Moment Bound Theorem’’ in Findley and Wei (1993) and Eqs. (2.9) and in Findley P and Wei (1993). The summability P1 (2.10) 1 2 2 conditions (ii) imply that g o1 and j¼1 uu;j j¼1 gww;j o1 (see Phillips and Solo, 1992, Lemma 3.5). & Theorem 6. Suppose that Assumption 3 is satisfied. Then there exists a d40 and T ðh; pÞ such that sup T4T ðh;pÞ
EkzT ði; pÞk3þd o1;
i ¼ mle; lfe
Thus the sequence kzt k3 is uniformly integrable. Proof of Theorem 6. The theorem is proved for a VAR(1) forecasting model. The extension to higher dimensional VARs is straightforward. The proof consists of two parts. In Part (a) we bound the norms hzT ðlfe; 1Þi3þd and hzT ðmle; 1Þi3þd with moments of ( X )1 1 X 1 1 X pffiffiffiffi E tj Y 0th ; Y th Y 0th Z tj Y 0th GZY ;hj : ; T T T (A.28) Part (b) applies the moment bound Theorems 4 and 5 to (A.28). Part (a): We begin with the loss-function estimator. Recall that
zT ðlfe; 1Þ ¼
( )( X )1 1 X 1 E tj Y 0th Y th Y 0th F j pffiffiffiffi T T j¼1 " ( )( X )1 h1 X X 1 j 1 0 0 Z tj Y th Y th Y th þa F T T j¼0 #
h1 X
mðlfe; 1Þ :
ðA:29Þ
According to Minkowski’s inequality hzT ðlfe; 1Þi3þd p
h1 X j¼0
þa
)( X )1 + 1 X 1 0 0 pffiffiffiffi E tj Y th Y th Y th kF k T T 3þd |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl ffl} j
h1 X j¼0
*(
ðiÞ
*(
)( X )1 1 X 1 j 0 0 Z tj Y th Y th Y th kF k T T |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl} ðiiÞ
+ GZY ;hj G1 YY
: 3þd
|fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}
ðA:30Þ
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
The first term in (A.30) can be bounded with Ho¨lder’s inequality: ", 1 X E tj Y 0th hðiÞi3þd p pffiffiffiffi T q1 ð3þdÞ 3ðq1 1Þ=q1 *( + ) 1 1 X 5 Y th Y 0th ; T
131
ðA:31Þ
q1 ð3þdÞ=ðq1 1Þ
where q1 41: The second term can be bounded as follows: *( )( X )1 + 1 X 1 0 0 Z tj Y th GZY ;hj Y th Y th hðiiÞi3þd p T T 3þd , ( X )1 Y th Y 0th G1 GZY ;hj YY T 3þd ", 1 X Z tj Y 0th GZY ;hj p pffiffiffiffi T q2 ð3þdÞ 3ðq2 1Þ=q2 *( )1 + 1 X 5 Y th Y 0th T ðq2 ð3þdÞÞ=ðq2 1Þ (, X ) 1 Y th Y 0th kGZY ;hj k þ kG1 k : YY T 3þd Thus, a sufficient condition for the uniform integrability of zT ðlfe; 1Þ is , 1 X 0 p ffiffiffiffi E tj Y th o1; sup T tXT ðh;pÞ ð3þdÞq1 , sup tXT ðh;pÞ
1 X Z tj Y 0th GZY ;hj o1; T ð3þdÞq2
*( sup tXT ðh;pÞ
)1 1 X 0 Y th Y th T
ðA:32Þ
(A.33)
(A.34)
+ o1 ð3þdÞ minfq
(A.35)
q1 q ; 2 g 1 1 q2 1
for some q1 ; q2 41 and T ðh; pÞ: The mle-plug-in predictor is a function of ( )( X )1 h1 X 1 X 1 j 0 0 E t Y t1 Y t1 Y t1 zT ðmle; 1Þ ¼ F pffiffiffiffi F h1j T T j¼1 "
þa
h1 X j¼0
F
j
(
)( X )1 1 X 1 F h1j mðmle; 1Þ Z t Y 0t1 Y t1 Y 0t1 T T
pffiffiffiffi ^ T ðmle; 1Þ F Þ: T RðF
#
ðA:36Þ
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
132
According to Minkowski’s inequality )( X )1 + 1 X 1 0 0 pffiffiffiffi E t Y t1 Y t1 Y t1 hzT ðlfe; 1Þi3þd pðh 1ÞkF k T T 3þd *( ) ( ) 1 1 X 1 X Zt Y 0t1 Y t1 Y 0t1 þ aðh 1ÞkF h1 k T T + pffiffiffiffi 1 ^ T ðmle; 1Þ F Þi3þd : GZY ;hj G þ T hRðF ðA:37Þ h1
*(
YY
3þd
Since the first two terms of (A.37) are equivalent to the terms arise that arise in a h ¼ 1-step ahead LFE predictor, we now focus on the remainder term. From an horder Taylor series expansion of Fh around F, deduce
kRðF F ÞkpC h ðF Þ
h X
kF F kj
(A.38)
j¼2
for some constant C h ðF Þ that depends on the forecast horizon h and the matrix F. Thus, pffiffiffiffi ^ T ðmle; 1Þ F Þi3þd h T RðF pjC h ðF Þj
h X
pffiffiffiffi ^ T ðmle; 1Þ F Þi3þd : T ðjþ1Þ=2 h T ðF
ðA:39Þ
j¼2
Thus, a sufficient condition for the uniform integrability of zT ðmle; 1Þ is , sup tXT ðh;pÞ
, sup tXT ðh;pÞ
1 X pffiffiffiffi E t Y 0t1 o1; T ð3þdÞq3 h
(A.40)
1 X Z t Y 0t1 GZY ;1 o1; T ð3þdÞq4 h
(A.41)
*( sup tXT ðh;pÞ
)1 1 X 0 Y t1 Y t1 T
for some q3 ; q4 41 and T :
+ o1 q3 q ; 4 g 3 1 q4 1
ð3þdÞh minfq
(A.42)
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
133
Part (b): Notice that Y t can be decomposed as follows: a Y t ¼ ðI FLÞ1 E t þ pffiffiffiffi ðI FLÞ1 AðLÞE t T a e e ¼ Y t þ pffiffiffiffi Zt : T
ðA:43Þ
Thus, for instance, , , 0 1 X 1 X pffiffiffiffi E t Y 0t1 E t Ye t1 p pffiffiffiffi T T ð3þdÞq3 h ð3þdÞq3 h , X a 1 e0 EtZ þ pffiffiffiffi pffiffiffiffi : t1 T T ð3þdÞq3 h We can now apply Theorem 5 to bound (A.33), (A.34), (A.40), and (A.41). Theorem 4 is used to establish bounds for (A.35) and (A.42). This amounts to choosing appropriate constants d40; q1 ; q2 ; q3 ; q4 41; and T based on h, p, and Z in Assumption 3(v). Theorem 7. Suppose that Assumption 3 is satisfied. Then there exists a d40 and T ðh; pÞ such that sup T4T ðh;pÞ
EkTðtr½W MSEði; pÞ tr½W MSEðlfe; qÞ Þk1þd o1;
i ¼ mle; lfe:
Thus the sequence kTðtr½W MSEði; pÞ tr½W MSEðlfe; qÞ Þk is uniformly integrable. Proof of Theorem 7. To prove the Theorem it is sufficient to consider terms of the form , * +( X ) 1 0 0 tr zT ði; pÞ Y th Y th zT ðlfe; qÞ T 1þd ,( X )-r1 1 3 2 Y th Y 0th ¼ hzT ði; pÞirð1þdÞ=r hzT ðlfe; qÞirð1þdÞ=r ðA:44Þ 2 3 T ð1þdÞ=r1 for some ri 40 and r1 þ r2 þ r3 ¼ 1: Choose ri ¼ 1=3 and apply Theorems 5 and 6 to bound the right-hand side of (A.44).
Appendix B. Design of Monte Carlo experiment The data generating model was parameterized as follows. Asymptotically the DGP approaches a VAR(1) with coefficients ! ! ! cosðp=4Þ cosð5p=6Þ 0:9 0 cosðp=4Þ cosð5p=6Þ 1 F¼ sinðp=4Þ cosð5p=6Þ 0 0:5 sinðp=4Þ cosð5p=6Þ
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
134
and S ¼
1
2 0:4
2 0:4
4
! :
The distortion process follows an VMAð10Þ process with coefficient matrices A1 ¼
! 0:87 0:69 ; 1:37 0:03
! 0:11 0:20 A4 ¼ ; 0:10 0:12
A7 ¼
A10 ¼
0:08 0:15 0:17 0:13
! ;
0:15 0:03 0:24
0:01
0:05 0:85
! 0:81 ; 0:14
A3 ¼
0:24 A5 ¼ 0:19
! 0:17 ; 0:33
0:24 A6 ¼ 0:15
A2 ¼
A8 ¼
0:01
0:05
0:14
0:06
! ;
A9 ¼
! 0:27 ; 0:10
0:30 0:30
! 0:18 ; 0:29
0:50
0:12
0:21
0:03
! ;
! :
The weight matrix is of the form W¼
cosðp=4Þ sinðp=4Þ sinðp=4Þ
cosðp=4Þ
!
0:6
0
0
0:5
!
cosðp=4Þ sinðp=4Þ sinðp=4Þ
cosðp=4Þ
!0 :
References Amemiya, T., 1980. Selection of regressors. International Economic Review 21, 331–354. Baillie, R.T., 1979. Asymptotic prediction mean squared error for vector autoregressive models. Biometrika 66, 675–678. Bates, J.M., Granger, C.W.J., 1969. The combination of forecasts. Operations Research Quarterly 20, 451–468. Bhansali, R.J., 1996. Asymptotically efficient autoregressive model selection for multistep prediction. Annals of the Institute of Statistical Mathematics 48, 577–602. Bhansali, R.J., 1997. Direct autoregressive predictors for multistep prediction: order selection and performance relative to plug-in predictors. Statistica Sinica 7, 425–449. Bierens, H.B., 1990. A conditional moment test of functional form. Econometrica 58, 1443–1458. Billingsley, P., 1968. Convergence of Probability Measures. Wiley, New York. Chib, S., 1995. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association 90 (432), 1313–1321. Christoffersen, P.F., 1997. Optimal prediction under asymmetric loss. Econometric Theory 13, 808–817. Christoffersen, P.F., Diebold, F.X., 1996. Further results on forecasting and model selection under asymmetric loss. Journal of Applied Econometrics 11, 561–571. Clements, M.P., Hendry, D.F., 1998. Forecasting Economic Time Series. Cambridge University Press, Cambridge. Corradi, V., Swanson, N.R., 2002. A consistent test for out-of-sample, nonlinear predictive ability. Journal of Econometrics, 353–381.
ARTICLE IN PRESS F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
135
Dawid, A.P., 1992. Prequential analysis, stochastic complexity and Bayesian inference. In: Bernado, J.M., Berger, J.O., Dawid, A.P., Smith, A.F.M. (Eds.), Bayesian Statistics, vol. 4, pp. 109–125. Del Negro, M., Schorfheide, F., 2004. Priors from general equilibrium models for VARs. International Economic Review, 45 (2), 643–673. Diebold, F.X., 1998. Elements of Forecasting. South Western Publishing. Doan, T., Litterman, R., Sims, C., 1984. Forecasting and conditional projections using realistic prior distributions. Econometric Reviews 3, 1–100. Findley, D.F., 1983. On the use of multiple models for multi-period forecasting. American Statistical Association: Proceedings of Business and Economic Statistics, pp. 528–531. Findley, D.F., Wei, C.Z., 1993. Moment bounds for deriving time series CLT’s and model selection procedures. Statistica Sinica 3, 453–480. Findley, D.F., Wei, C.Z., 2002. AIC, overfitting principles, and the boundedness of moments of inverse matrices for vector autoregressions and related processes. Journal of Multivariate Analysis 83, 415–450. Fuller, W.A., Hasza, D.P., 1981. Properties of predictors for autoregressive time series. Journal of the American Statistical Association 76, 155–161. Gourieroux, C., Monfort, A., Renault, E., 1993. Indirect inference. Journal of Applied Econometrics 8, S85–118. Granger, C.W.J., 1993. On the limitations of comparing mean squared forecast errors: comment. Journal of Forecasting 12, 651–652. Ing, C.K., 2003. Multistep prediction in autoregressive processes. Econometric Theory 19, 254–279. Ing, C.K., Wei, C.Z., 2003. On same-realization prediction in an infinite-order autoregressive process. Journal of Multivariate Analysis 85, 130–155. Ingram, B.F., Whiteman, C.H., 1994. Supplanting the Minnesota prior—Forecasting macroeconomic time series using real business cycle model priors. Journal of Monetary Economics 34, 497–510. Kadiyala, K.R., Karlson, S., 1997. Numerical methods for estimation and inference in Bayesian VAR models. Journal of Applied Econometrics 12, 99–132. Lewis, R.A., Reinsel, G.C., 1985. Prediction of multivariate time series by autoregressive model fitting. Journal of Multivariate Analysis 16, 393–411. Lewis, R.A., Reinsel, G.C., 1988. Prediction error of multivariate time series with mis-specified models. Journal of Time Series Analysis 9, 43–57. Litterman, R.B., 1986. Forecasting with Bayesian vector autoregressions—five years of experience. Journal of Business and Economic Statistics 4, 25–37. Mallows, C.L., 1973. Some comments on C p : Technometrics 12, 661–676. Marcellino, M., Stock, J.H., Watson, M.W., 2004. A comparison of direct and iterated multistep AR methods for forecasting macroeconomic time series. Manuscript, Department of Economics, Princeton University. Min, C.K., Zellner, A., 1993. Bayesian and non-Bayesian methods for combining models and forecasts with applications to forecasting international growth rates. Journal of Econometrics 56, 89–118. Phillips, P.C.B., Solo, V., 1992. Asymptotic theory for linear processes. Annals of Statistics 20, 971–1001. Reinsel, G.C., 1980. Asymptotic properties of prediction errors for the multivariate autoregressive model using estimated parameters. Journal of the Royal Statistical Society 42, 328–333. Rissanen, J., 1986. Stochastic complexity and modelling. Annals of Statistics 14, 1080–1100. Schwarz, G., 1978. Estimating the dimension of a model. Annals of Statistics 6, 461–464. Shibata, R., 1980. Asymptotically efficient selection of the order of the model for estimating parameters of a linear process. Annals of Statistics 8, 147–164. Skoros, S., 2002. A decision-based approach to econometric modelling. Manuscript, Santa Fe Institute. Speed, T.P., Yu, B., 1993. Model selection and prediction: normal regression. Annals of the Institute of Statistical Mathematics 45, 35–54. Tiao, G.C., Tsay, R.S., 1994. Some advances in non-linear and adaptive modelling in time-series. Journal of Forecasting 13, 109–131. Tsay, R.S., 1993. Comment: adaptive forecasting. Journal of Business and Economics Statistics 11, 140–142.
ARTICLE IN PRESS 136
F. Schorfheide / Journal of Econometrics 128 (2005) 99–136
Wei, C.Z., 1992. On predictive least squares principles. Annals of Statistics 20, 1–42. Weiss, A., 1991. Multi-step estimation and forecasting in dynamic models. Journal of Econometrics 48, 135–149. Weiss, A., 1996. Estimating time series models using the relevant cost function. Journal of Applied Econometrics 11, 539–560. Zellner, A., 1971. Introduction to Bayesian Inference in Econometrics. Wiley, New York.