E. J. Hannan, P. R. Krishnaiah, M. M. Rao, eds., Handbook of Statistics, Vol. 5 © Elsevier Science Publishers B.V. (1985) 179-187
6
Various Model Selection Techniques in Time Series Analysis Ritei Shibata
1. Introduction This chapter aims to give a short review on various model selection techniques which have been developed in the context of time series analysis. Our main concern is on moving-average (MA), autoregressive (AR) or autoregressivemoving average ( A R M A ) models. A similar problem is called model checking, which aims at checking adequacy of a model. Since the test statistics employed in model checking can be also used for constructing a model selection pro u cedure, we first look up such statistics. We should, however, note that the model selection is not a simple combination of model checkings. The aim of the model selection is not only in checking adequacy of a model but also in (a) obtaining a good predictor, or (b) describing a system, or identifying a system. W e consider a univariate A R M A ( p , q) model, rb(B)z,
=
0(B)e,,
0.~)
where ~b(B) = 1 + 4~1B + ~baB2+ "'" + ~bpB p and O(B) = 1 + 01B + 02 B 2 + ' ' " + OqB q
are transfer functions with backward operator B. For simplicity, we assume that {et} is a sequence of independent identically normally distributed random variables with mean 0 and variance o-2. A sequence of autocovariances is denoted by {%}, and that of the sample autocovariance, based on n obse~ vations Zl . . . . . z,, is denoted by {C~}. The estimated noise sequence is then
~t-- O(B)-I~(B)z,, where 0(B) and 4~(B) are the m a x i m u m likelihood or quasimaximum likelihood estimates of O(B) and &(B), respectively. In this chapter, we do not go 179
R. Shibata
180
further into mathematical details. We assume any commonly used regularity conditions. In the next section, we will see some test statistics which are specific for type of models, MA, A R , or A R M A . In Section 3, some other statistics will be seen, which are not specific for type of models. Section 4 is for discussions on how to construct a selection procedure, based on such test statistics.
2. Statistics specific for each mode|
2.1. Moving-average models C o m m o n l y used test statistic for M A ( q ) is that based on sample autocovariances [7]. The quadratic form of the sample autocovariances, h
T= n
~
l'lL'mO'~f~^Ira
1, m = q + l
is asymptotically distributed as X~h_q(O) with the noncentrality n
qJ=n
~
3,t~/mo "j" ,
l,m=q+l
where o"t" or o7t~ are the l, m elements of the inverse of the autocovariance matrix or of the sample autocovariance matrix, respectively. Therefore, by T we can check q dependence, a specific property of MA(q), but we may fail in checking a linearity of the process. A r e m a r k a b l e fact is that this statistic is not equivalent to the m a x i m u m log likelihood in any sense.
2.2. Autoregressive models For testing AR(p), a sequence of partial autocorrelations {4'm}, which are zeros for m > p, plays an important role. The sum of sample partial autocorrelations {0m}, h
~Jp+l l=l
is asymptotically distributed as X~(t)) with the noncentrality h
Z 4.2.+, l=l
Therefore, by T we can test the null hypothesis of {zt} being an AR(p). As the sample partial autocorrelation ~bm, commonly used definition is the last coordinate of the solution ~ ( m ) ' = ( ~ l ( m ) , . . . , C~m(m)) of the m t h - o r d e r Y u l e - W a l k e r equation.
Model selection techniques in time series analysis
181
Historically the following statistic is proposed before by Quenouille [18], i
T
-
h
d.4(p) n ~
h^2 p+ l
l=l
where =]~l
Y,
,(p)z,_,
n t=p+l
j(p)z,_,,j
,
"'j=O
-=
and 6 " 2 ( P ) - n - 1p , = ~ ~ =
dpl(P)Zt_l
with z, = 0 for t ~< 0. The above/~l can be thought of as an approximation to the covariance n
hl = _
~, ete,-i
n t=p+l
between the noise e, and its backward representation e, = El,_0 ~b,,z,+~. Hence, @dr2(p) might be more natural than the q~z as an estimate of the partial autocorrelation. It is well known that the above two statistics are asymptotically equivalent to each other [2]. These statistics are also asymptotically equivalent to the maximum log likelihood.
2.3. Autoregressive moving-average models A specific property of autoregressive moving-average model is that it is non-identifiable when overfitted. Since an A R M A model rI ( B ) 6 (B)z, = ~7(B)O(B)e, has the same covariance structure as that of (1.1), the transfer functions O(B) and ~b(B) are not uniquely determined by autocovariances of {z,}. Generalized partial autocorrelation 0k(J) is defined as the last coordinate oi the solution ~b(/') of the equation
AQ, k )~P(l) = - ~ ( / ' ) , where 7]
A¢i,k)= and ~,(j)'- (~,j+~.....
j
k-1
~j+~)°
" • 7j-k+1
"'°
J
~,, J
R. Shibata
182
Because of the non-identifiability when overfitted, the equations which characterize ARMA(p, q), p
3~t= - ~
~j'/t-j
for 1 > q,
j=l
imply that the matrix A ( j , k) is nonsingular if and only if j ~
ifj=q,k>p, ifj>q,k>p, otherwise.
If j = 0, 0k(J) reduces to an ordinary partial autocorrelation ~k. Making use of such property, we can find the orders p and q as the following. In the estimated 4' array, =
find the coordinates (q + 1, p + 1) which specify the North-West corner of tile largest South-East sub-array, whose all elements are unstable but there are zeros on the North edge. By similar idea, the use of zl array, A = [IA(j, k)[]
is proposed by Beguin, Gourieroux and Monfort [4]. Equivalent procedures are proposed by Chow [8], Graupe, Krause and Moore [10], or Woodside [29]. Since 0 ]A(/, k)l= # 0
if j > q and k > p , otherwise,
it is enough to find, in an estimated A array, the coordinates (q + 1, p + 1), which specify the North-West corner of the largest South-East sub-array whose all elements are zeros. We can also construct a test statistic by considering the determinant of the above South-East sub-array [4]. More complicated statistics are proposed by Gray, Kelley and McIntire [11], called S array or R array. For example, the (j, k) element of S array is defined as
s,(j)
= {(-1)~s~+u}
-' ,
M o d e l selection techniques in time series a n a l y s i s
183
where S ~'m is the l, m element of the inverse of
S =
i 7j ')lj+k
" ""
"Y!+k- 1]
"' •
"Yj+2k-lj
The S array has the following properties: (-1)P (1 + f~ th,)
for all j > q - p,
l=1
Sp(j) =
(-1)P+'(1+ ~ 4h)~bp
for all j < - q - p ,
1=1 S k ( - - q - - p --
1)= +_ca for any k > p ,
Sk(j)= undefined for k > p, if j < - q -
p or j > q - p.
Then, similarly as in A array, we can find the coordinates (q + 1, p + 1) in S array. It is known that some elements of S array coincide with the partial a u t o c o r relations (see Woodward and Gray [30]). An advantage of a selection procedure based on such generalized partial autocorrelations is that we can avoid unstable estimation of A R M A parameters when overfitted. It is, of course, at the risk of underfitting. In other words, such procedure might be good for the aim (b) in Section 1.
3. Other statistics not specific for type of models 3.1. Based on likelihood The dominant term of the maximum log likelihood is the one-step-ahead prediction error, ~2
1 ~-, ~2
O" e = - - ~
Et .
(3.1)
/~ t=l
However, non-identifiability of A R M A model when overfitted causes a problem. If we want to estimate both transfer functions O(B) and $(B), these estimates are not only inconsistent but also unstable. For the aim (a) in Section l, such inconsistency does however not cause much problems since the maximum likelihood estimate of the transfer function k (B) = O(B)-%k(B )
184
R. Shibata
is not far from the true one, even when overfitted [14]. Another way might be to use the Lagrangian multiplier test statistic as is demonstrated in Poskitt and Tremayne [18]. By modifying the Fisher information matrix, we can avoid the problem of the singularity, but for doing this we have to fix an alternative a priori. Therefore, such a statistic is not suitable for model selection. 3.2. Portmanteau test statistic
Portmanteau test statistic or Q statistic is the sum of squares of serial correlations rt(e,) of residual sequence {g,}, h
r =. E /=1
It is shown by Box and Pierce [6] that the above T is asymptotically distributed as X]-p-q under the null hypothesis. To accelerate the speed of convergence to the asymptotic distribution, Ljung and Box [16] proposed a correction such as h
T = n(n + 2) Z (n - l)-lr~(g,).
(3.2)
/=1
Detailed analysis of the distribution under null hypothesis or alternatives can be found in [16]. The above statistic is the most natural for checking uncorrelatedness of residuals, but, if our main concern is in only obtaining a good predictor in the sense of mean squared error, it might be checking too many things. In spite of the correction in (3.2), convergence is not so fast since it consists of fourth moments of original process. A comparison with the Lagrangian multiplier test can be found in Godfrey [9]. 3.3. Cross-validation
This kind of statistic is proposed by Stone [27], in the context of multiple regression. A formal extension yields a statistic n
r = Z {z,- e,(t-t} 2 , t=l
where g , ( t - ) is an interpolation of z~, which is estimated from the observations except Zr It generally requires a laborious calculation. There is not so much known about the behavior of this statistic, but it has a tendency of overfitting, particularly when outliers exist.
Model selection techniques in time series analysis
185
4. Model selection We can construct a selection procedure by using one of test statistics introduced in the previous sections. However, it is not so good an idea to repeat such testing for various p and q. If we do so, we first have to choose many significance levels required, and the resulting power is a complicated function of the levels, as well as of the order of testings. It is hard to get a good control even for overall type I error. As an alternative we can consider the use of a h o m o g e n e o u s testing, like as in Krishnaiah [15]. By such testing, we can well control the type I error, but still it requires a lot of computation. A better principle might be to find a model which balances overfitting risks and underfitting risks. A typical way of realizing such balancing behavior is to select p and q which minimizes
C(p, q)= T+ a(p + q) .
(4.1)
Here, the T is one of test statistics which are introduced in previous sections. The second term in (4.1) can be considered as a penalty term for the complexity of the model, and a term for compensating the r a n d o m fluctuation of T. Since the expectation of T is p + q when T is distributed as X 2 with degrees of freedom p and q, it is better to choose c~ greater than 1, so as to ensure the positive penalty for an increase of the degrees of freedom. In A I C [1], B I C [20], or 4' [12], which are called criterion procedures, all criteria are of the form of (4.1) with T = - 2 log(maximum likelihood). For such criteria, a lot of discussions have been done. Most controversial point is that how to choose a, which is 2 in AIC, log n in BIC, and c loglog n for some c > 2 in 4,. The choice of a depends on the aim of the selection. If our main concern is in prediction, c~ should be chosen so as to yield less prediction error. If it is to identify a system stable, the consistency is more important than the amount of the prediction error. In Shibata [21], such two aspects of the model selection are d e m o n s t r a t e d for the case of AIC, in the context of the nested A R model fitting. It is shown that the selection by the minimum A I C procedure has a tendency of the underfitting and is not consistent, but the increase of the prediction error is not so much, only of the order O(1/n) uniformly in 4,1, - . . , 4,p. Similar discussions are done for general a by Bhansali and D o w n h a m [5], or Atkinson [3]. Their conclusion is consistent on the point that a should be greater than 2 even if the prediction error is our main concern. An answer to the optimality is given by Shibata [22] from the viewpoint of prediction error. H e showed that the choice a = 2 is asymptotically optimal, under the assumption that the underlying process does not degenerate finite order A R process. This result, namely "asymptotic efficiency of the selection with a = 2" is also applied to an autoregressive spectral
186
R. Shibata
estimate [24]. Taniguchi [29] showed that Shibata's result holds true also for A R M A models. However, for the case of small samples, the above asymptotic theory does not work so well [23]. Recently Shibata [25] showed that the a p p r o x i m a t e minimax regret choice of ce is 2.8. The regret means how much the prediction error increases when a section procedure is applied, compared with the error when the true model is known. Further generalization of the A I C can be found in [26]. If we want to avoid overfitting in any case, a should be chosen greater than 2 loglog n but slower than n. This is the result of H a n n a n and Quinn [12]. The term 2 loglog n follows from the fact that the range of the random fluctuation of T is at most 2 loglog n from the law of iterated logarithm. It is interesting to note that the choice a = log n in BIC, which is derived from the viewpoint of Bayesian, satisfies the above condition. H a n n a n and Rissanen [13] proposed a practical way of selecting the orders p and q of A R M A by using one of the above consistent criteria. Assuming p = q, find m which minimizes C(m, m) in (4.1), then the m is asymptotically equal to max(p0, q0) of the true orders P0 and q0. Next assuming p = m or q = m, find p and q which minimize C(p, q), then we can find P0 and q0 consistently. A remaining problem in practice is how to choose P and O which specify the largest orders p and q. This is equivalent to the problem how to choose ' h ' of statistics in Section 2. This problem has not been analyzed well, but an analysis by Shibata [26] gives a rough guideline that we can choose any large P and Q, as long as the tail probability P(F,,+2,,_p_o> am/(m + 2)) is 2 close enough to P(x,,.z>am) for m = 1 , 2 , 3 . . . . . n - P O. As a final remark, we note that if a is chosen bounded, then actual penalty is seriously affected by small changes of T as well as changes of initial conditions. W e should choose a so as to compensate well any such changes. References [1] Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In: B. N. Petrov and F. Csfiki, eds., Second International Symposium on Information Theory, 267-281. Akadrmia Kiado, Budapest. [2] Anderson, T. W. (1971). The Statistical Analysis of Time Series. Wiley, New York. [3] Atkinson, A. C. (1980). A note on the generalized information criterion for choice of a model. Biometrika 67, 413--418. [4] Beguin, J.-M., Gorieroux, C. and Monfort, A. (1980). Identification of a mixed autoregressive-moving average process: the corner method. In: O. D. Anderson, ed., Time Series, 423-435. North-Holland, Amsterdam. [5] Bhansali, R. J. and Downham, D. Y. (1977). Some properties of the order of an autoregressive model selected by a generalization of Akaike's E P F criterion. Biometrika 64, 547-551. [6] Box, G. E. P. and Pierce, D. A. (1970). Distribution of residual autocorrelations in autoregressive-integrated moving average time series models. J. Amer. Statist. Assoc. 65, 1509--1526. [7] Box, G. E. P. and Jenkins, G. M. (1976). Time Series Analysis: Forecasting and Control. Holden-Day, New York. [8] Chow, J. C. (1972). On estimating the orders of an autoregressive moving-average process with uncertain observations. IEEE ?¥ans. Automat. Control AC-17, 707-709.
Model selection techniques in time series analysis
187
[9] Godfrey, L. G. (1979). Testing the adequacy of a time series model. Biometrika 66, 67-72. [10] Graupe, D., Krause, D. J. and Moore, J. B. (1975). Identification of autoregressive-moving average parameters of time series. I E E E Trans. Automat. Control AC-20, 104--107. [ll] Gray, H. L. Kelley, G. D. and McIntire, D. D. (1978). A new approach to ARMA modeling. Comm. Statist. B7, 1-115. [12] Hannah, E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression. J. Roy. Statist. Soc. Ser. B 41, 190-195. [13] Hannan, E. J. and Rissanen, J. (1982). Recursive estimation of mixed autoregressive-moving average order. Biometrika 69, 81-94. [14] Hannan, E. J. (1982). Fitting multivariate A R M A models. In: G. Kallianpur, P. R. Krishnaiah, J. K. Ghosh, eds., Statistics and Probability: Essays in Honor ofC. R. Rao, 307-316. North-Holland, Amsterdam. [15] Krishnaiah, P. R. (1982). Selection of variables under univariate regression models. In: P. R. Krishnaiah, ed., Handbook of Statistics--II. North-Holland, Amsterdam. [16] Ljung, C. M. and Box, G. E. P. (1978). On a measure of lack of fit in time series models. Biometrika 65, 297-303. [17] Milh0j, A. (1981). A test of fit in time series models. Biometrika 68, 177-18% [18] Poskitt, D. S. and Tremayne, A. R. (1981). An approach to testing linear time series models. Ann. Statist. 9, 974--986. [19] Quenouille, M. H. (1947). A large-sample test for the goodness of fit of autoregressive schemes. J. Roy. Statist. Soc. Ser. B 11, 123-129. [20] Schwarz, C. (1978). Estimating the dimension of a model. Ann. Statist. 6, 461-464. [21] Shibata, R. (1976). Selection of the order of an autorcgressive model by Akaike's information criterion. Biometrika 63, 117-126. [22] Shibata, R. (1980). Asymptotically efficient selection of the order of the model for estimating parameters of a linear process. Ann. Statist. 8, 147-164. [23] Shibata, R. (1980). Selection of the number of regression parameters in small sample cases. In: Statistical Climatology, 137-148. Elsevier, Amsterdam. [24] Shibata, R. (1981). An optimal autoregressive spectral estimate. Ann. Statist. 9, 300-306. [25] Shibata, R. (1983). A theoretical view of the use of AIC. In: O. D. Anderson, ed., Time Series Analysis: Theory and Practice, Vol. 4, 237-244. Elsevier, Amsterdam [26] Shibata, R. (1984). Approximate efficiency of a selection procedure for the number of regression variables. Biometrika 71, 43-49. [27] Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions. Z Roy~ Statist. Soc. 2, 111-133. [28] Taniguchi, M. (1980). On selection of the order of the spectral density model for a stationary process. Ann. Inst. Statist. Math. 32A, 401--419. [29] Woodside, C. M. (1971)o Estimation of the order of linear systems. Automatica 7, '727-733. [30] Woodward, W. A. and Gray, H. L. (1981). On the relationship between the S array and the Box-Jenkins method of A R M A model identification. J. Amer. Statist. Assoc. 76, 579-587.