Statistics & Probability Letters 64 (2003) 293 – 303
Maximized log-likelihood updating and model selection Beong Soo So Department of Statistics, Ewha University, Seoul 120-750, South Korea Received February 2003
Abstract On the basis of simple algebraic identities, we derive a new updating formula for the maximized loglikelihoods. This formula enables us to decompose the predictive minimum description length (PMDL) criterion of Rissanen J. Roy. Statist. Soc. Ser. B 49 (1987) 1 into maximized log-likelihood and a penalty term representing model complexity. Statistical implications for model selection problems are discussed. c 2003 Elsevier B.V. All rights reserved. Keywords: Information criteria; Maximized log-likelihood; Updating formula
1. Introduction Since the pioneering work of Akaike (1973), various model selection criteria have been proposed in the statistical literature. They are essentially equivalent to the following. If k indexes the model, choose k to maximize the information criterion Ln (ˆk ; k) − cn (k); where Ln (ˆk ; k) is the maximized log-likelihood function, ˆk is the maximum likelihood estimator (MLE) of the parameter ∈ Rpk in the model, cn (k) is a nonnegative random variable that measures the complexity of the chosen model and n is the sample size. The complexity cn (k) is usually proportional to the number of the parameters pk of the model and is a speci
294
B.S. So / Statistics & Probability Letters 64 (2003) 293 – 303
Recently, Rissanen (1987) proposed an information criterion that selects the model which maximizes PMDL =
n
lt (ˆt −1 );
t=1
where lt () is the conditional log-likelihood function of the tth data yt given the previous observations (y1 ; : : : ; yt −1 ), and ˆt is the MLE of based on the data (y1 ; : : : ; yt ). As a special case of the PMDL criterion, Hannan et al. (1989) and Wei (1992) investigated asymptotic properties of the predictive least-squares (PLS) criterion PLS =
n
(yt − xt ˆt −1 )2
t=1
for the linear regression model yt = xt + ut ; t = 1; : : : ; n, where xt ; ∈ Rk are column vectors of regressor variables and parameters, x is the transpose of x, ˆt is the least squares estimator of based on the data {(y1 ; x1 ); : : : ; (yt ; xt )} and ut is a sequence of independent and identically distributed (i.i.d.) errors with zero mean and
B.S. So / Statistics & Probability Letters 64 (2003) 293 – 303
295
Lemma 1. Let be a parameter in the parameter space . Let {lt ()}nt=0 be a sequence of functions de1ned on . If we let t ls () St () = s=0
and let ˆt =argmax St (), or St (ˆt ) ¿ St () for any ∈ , t =0; : : : ; n, then we have, for any m 6 n, Sn (ˆn ) − Sm (ˆm ) =
n t=m+1
Sn (ˆn ) − Sm (ˆm ) =
n
lt (ˆt −1 ) =
t=1 n
n
[St (ˆt ) − St (ˆt −1 )];
(1)
[St −1 (ˆt ) − St −1 (ˆt −1 )];
(2)
t=m+1
lt (ˆt ) +
t=m+1 n
n
lt (ˆt −1 ) +
n
t=m+1
lt (ˆn ) − + n;
(3)
t=1
lt (ˆt ) =
t=1
n
lt (ˆn ) + − n ;
(4)
t=1
where + n =
n
[St (ˆt ) − St (ˆt −1 )] + [S0 (ˆ0 ) − S0 (ˆn )]
t=1
and − n =
n
[St −1 (ˆt −1 ) − St −1 (ˆt )] + [S0 (ˆn ) − S0 (ˆ0 )]:
t=1
One immediate consequence of the lemma is the following set of inequalities. Corollary 1. Under the same conditions as in Lemma 1, we have n n ˆ lt (t −1 ) 6 lt (ˆn ): t=1
Furthermore, if S0 (ˆn ) − S0 (ˆ0 ) = 0, then n n ˆ lt (t ) ¿ lt (ˆn ): t=1
(5)
t=1
t=1
Remark 1. In statistical applications, we usually have lt () = log f(yt |xt ; );
t = 1; : : : ; n;
(6)
296
B.S. So / Statistics & Probability Letters 64 (2003) 293 – 303
which is the conditional log-likelihood of yt given xt . If S0 () = 0, then ˆt is the MLE of based on {(y1 ; x1 ); : : : ; (yt ; xt )}. If S0 () = log () for a prior probability density function (p.d.f.) () of , then ˆt is the posterior mode of based on {(y1 ; x1 ); : : : ; (yt ; xt )}. Remark 2. For a standard linear model, yt = xt + ut ;
t = 1; : : : ; n
with ut ∼ N(0; 2 ), we get lt () = −(yt − xt )2 =22 where xt ; ∈ Rk . Updating formulae (1) and (2) reduce to two well-known formulas for updating the residual sum of squares of errors (SSE), SSEn (ˆn ) − SSEm (ˆm ) =
n
(yt − xt ˆt −1 )2 (1 − ht )
t=m+1
=
n
(yt − xt ˆt )2 (1 + h∗t );
t=m+1
t −1
respectively, where ht = xt ( s=1 xs xs )−1 xt , and h∗t = xt ( ts=1 xs xs )−1 xt , t = 1; : : : ; n. Lemma 1 not only justi
lt (ˆt −1 ) =
t=1
n
lt (ˆn ) − + n;
t=1
where + n =
n
[St (ˆt ) − St (ˆt −1 )]
t=0
=
n
(ˆt −1 − ˆt ) Jt∗∗ (ˆt −1 − ˆt )=2
(7)
t=0
=
n t=0
∇lt (ˆt −1 )(Jt∗−1 Jt∗∗ Jt∗−1 )∇lt (ˆt −1 )=2;
(8)
B.S. So / Statistics & Probability Letters 64 (2003) 293 – 303
297
ˆ−1 = ˆn , ∇lt () = (@lt ()=@i )ki=1 , t = 0; : : : ; n, are column vectors in Rk , 1 ∗ Jt (uˆt −1 + (1 − u)ˆt ) du; Jt = 0
Jt∗∗ =
0
1
Jt (uˆt −1 + (1 − u)ˆt )2(1 − u) du;
and Jt () is the Hessian matrix −∇2 St () = [ − @2 St =@i @j ] of −St (). Theorem 1, which suggests a simple approximation to the penalty term,gives a motivation for introducing a new information criterion as a good approximation of the PMDL. Denition. We de
n t=1
1 ˆ ˆ ∇lt (ˆt −1 )Jt− −1 (t −1 )∇lt (t −1 )=2:
Both PMDL and SIC are not permutation invariant and their values depend strongly on the order of the data which is natural for time-series data. But for cross-sectional data, we need a criterion that is permutation invariant. The following theorem provides some asymptotic results which will be useful not only in constructing permutation invariant criteria but also in identifying the close relation of PMDL and SIC with existing information criteria. In the following, the notation a.s. stands for almost sure convergence as n → ∞. Theorem 2. Let {(yt ; xt )}nt=0 be a sequence of i.i.d. random vectors and let Vt () = ∇lt ∇lt (), 1 ˆ ˆ ˆ V () = E[Vt ()] be continuous and Mn = nt=1 Jt− −1 (t −1 )[Vt (t −1 ) − V (t −1 )] for n ¿ 1. If we ∗ assume that ˆn = + o(1) a:s: as n → ∞ and sup E|Mn | ¡ ∞;
(9)
n
Jn (ˆn ) = J + o(1) n
a:s:;
(10)
for some positive de1nite matrix J , then n t=1
1 ˆ −1 ˆ tr[Jt− V ) + o(1)]log n −1 (t −1 )Vt (t −1 )] = [tr(J
a:s:;
(11)
where tr(A) is the trace of a square matrix A = [aij ] and V = V ( ∗ ). One immediate consequence of the above result is the exact condition for the asymptotic equivalence of PMDL (or SIC) and BIC. For example, when the model is correctly speci
(12)
298
B.S. So / Statistics & Probability Letters 64 (2003) 293 – 303
which in turn implies the asymptotic equivalence of PMDL,SIC and BIC. We also note that, by the results of Nishi (1988), consistency of the PMDL criterion and SIC follows immediately from this asymptotic equivalence result. Remark 3. (SIC for cross-sectional data). A further implication of Theorem 2 is that it suggests a natural modi
(13)
where In () = nt=1 ∇lt ∇lt () and In (ˆn )=n is a consistent estimator of V ( ∗ ).Note that SIC∗ is not only permutation invariant but also invariant under reparametrization of the model parameter . The data-dependent penalty term in (13) also appears as a natural correction of the bias of the maximized log-likelihood in the estimation of the expected log-likelihood of a
a:s:
a:s:
Furthermore, if the model is correctly speci
a:s:
by (12) which implies the asymptotic equivalence of SIC, BIC and FIC. On the other hand, when the model is not correctly speci
B.S. So / Statistics & Probability Letters 64 (2003) 293 – 303
299
f(·); i.e. ˆn = arg max
n
log f(yt − xt ):
t=1
As is noted by Konishi and Kitagawa (1996), we have tr[Jn−1 (ˆn )In (ˆn )] = (E(
2
)=E( )) · k + o(1)
a:s:;
where (u) = f (u)=f(u). Thus, SIC is not equivalent to BIC or FIC even asymptotically unless we 2 −1=2 happen to have E( 2 )=E( )=1 with E( (u))=0. For the normal exp(−u2 =22 ), p.d.f. f(u)=(2 ) 2 2 it can be expressed as u g(u) du= u f(u) du = 1 with ug(u) du = 0. While, for the Laplacian −1 −1 − 1 p.d.f. f(u) = (2) exp(−|u|=), it reduces to [2g(0)] =[2f(0)] = 1 with sign(u)g(u) du = 0. Thus the problem of the asymptotic equivalence of BIC and SIC is reduced to that of the identity of the relevant location and scale parameters of g(u) and f(u). In order to investigate the
1 6 i; j 6 2; 1 6 k 6 n
for the array of i.i.d. errors uijk with zero mean and
BIC
TIC
SIC
k
(k=2)log N
(k + (*ˆN − 3)=2)
(1=2)(k + (*ˆN − 3)=2)log N
300
B.S. So / Statistics & Probability Letters 64 (2003) 293 – 303
We illustrate the statistical performances of the four model selection criteria, the AIC, the BIC, the TIC, and the SIC by simulated ANOVA models. Speci
Table 2 Empirical distributions (%) of the order estimates based on AIC, BIC, TIC, and SIC for model yijk =&+'i +(j +('()ij +uijk with (&; '1 ; (1 ; ('()11 ) = (0; :3; 0; 0) n1
n2
n3
k
AC
BC
TC
SC
AC
BC
TC
SC
AC
BC
TC
SC
D1 1 2 3 4 5
9 1 68 13 9
44 1 52 2 1
10 1 67 14 8
46 1 51 2 0
1 0 77 14 8
14 0 84 2 0
1 0 78 13 8
14 0 84 2 0
0 0 78 14 8
0 0 98 2 0
0 0 78 14 8
0 0 98 2 0
D2 1 2 3 4 5
10 1 68 13 8
46 1 51 2 0
10 1 70 12 7
43 0 55 2 0
1 0 78 13 8
13 0 85 2 0
1 0 77 14 8
12 0 86 2 0
0 0 78 13 9
0 0 98 2 0
0 0 78 13 9
0 0 98 1 0
D3 1 2 3 4 5
9 1 69 13 8
37 0 60 3 0
11 3 61 14 11
43 2 50 4 1
2 0 76 14 8
13 0 85 2 0
3 1 71 14 11
24 1 72 3 0
0 0 78 13 9
2 0 97 1 0
1 0 75 14 9
10 1 87 2 0
Note: D1 , D2 , D3 are N(0; 1), U (−a; a) with a2 = 3, t3 , the distributions, respectively; (n1 ; n2 ; n3 ) = (25; 50; 100); k the number of parameters; AC means AIC, etc.; number of replications = 10 000.
B.S. So / Statistics & Probability Letters 64 (2003) 293 – 303
301
Table 3 Empirical distributions (%) of the order estimates based on AIC, BIC, TIC, and SIC for model yijk =&+'i +(j +('()ij +uijk with (&; '1 ; (1 ; ('()11 ) = (0:3; 0:3; 0:3; 0) n1
n2
n3
k
AC
BC
TC
SC
AC
BC
TC
SC
AC
BC
TC
SC
D1 1 2 3 4 5
0 0 5 79 16
3 6 15 73 3
0 1 5 78 16
3 5 15 74 3
0 0 0 83 17
0 3 3 95 2
0 0 0 84 16
0 3 3 95 2
0 0 0 84 16
0 0 0 99 1
0 0 0 84 16
0 0 0 99 1
D2 1 2 3 4 5
0 1 4 79 16
3 6 15 73 3
0 0 4 79 16
2 4 13 78 3
0 0 0 84 16
0 3 3 95 2
0 0 0 84 16
0 2 2 96 2
0 0 0 84 16
0 0 0 98 2
0 0 0 84 16
0 0 0 99 1
D3 1 2 3 4 5
0 1 4 78 17
5 5 10 77 3
2 3 7 70 18
12 9 14 61 4
0 0 1 83 16
0 1 3 94 2
1 1 2 78 18
6 4 8 79 3
0 0 0 84 16
0 0 0 98 2
0 0 1 80 19
4 2 3 89 2
Note: D1 , D2 , D3 are N(0; 1), U (−a; a) with a2 = 3, t3 the distributions, respectively; (n1 ; n2 ; n3 ) = (25; 50; 100); k the number of parameters; AC means AIC, the etc.; number of replications = 10 000.
Tables 2 and 3 show that, although SIC and BIC show similar performance for the normal distribution D1 , SIC performs better than BIC for the short-tailed distribution D2 especially in small samples. However, for heavy-tailed distribution D3 , it is outperformed by BIC. In summary SIC, as an automatic order estimator, shows the behavior very similar to that of BIC except for the models with the heavy-tailed error distributions and has the desirable consistency property for large samples but it together with BIC tends to su?er the downward bias in small samples as a trade-o?. Acknowledgements This work was supported by Korea Research Foundation Grant (KRF-2000-015-DP0058). Appendix A. Proof of Lemma 1. First we note that, for any t ¿ 1, [St (ˆt ) − St −1 (ˆt −1 )] = [St (ˆt ) − St (ˆt −1 )] + lt (ˆt −1 ) = lt (ˆt ) + [St −1 (ˆt ) − St −1 (ˆt −1 )];
302
B.S. So / Statistics & Probability Letters 64 (2003) 293 – 303
by the de
n
[St (ˆt ) − St −1 (ˆt −1 )] =
t=m+1
[St (ˆt ) − St (ˆt −1 )] +
t=m+1 n
=
n
lt (ˆt −1 )
t=m+1
lt (ˆt ) +
t=m+1
n
n
[St −1 (ˆt ) − St −1 (ˆt −1 )]:
t=m+1
This together with the identity t=m+1 [St (ˆt ) − St −1 (ˆt −1 )] = Sn (ˆn ) − Sm (ˆm ) completes the proofs of (1) and (2). Now (3) and (4) follow immediately from simple rearrangements of (1) and (2) for m = 0. Proof of Corollary 1. These follow immediately from identities (3) and (4) of Lemma 1 since + n, − ˆ n ¿ 0 by the de
1 ˆ ˆ tr[Jt− −1 (t −1 )Vt (t −1 )] = An + Mn ;
1 ˆ ˆ where An = nt=1 tr[Jt− −1 (t −1 )V (t −1 )]. Condition (9) implies Mn = M + o(1) a:s: for some random variable M with E|M | ¡ ∞ by the martingale convergence theorem (Hall and Heyde, 1980, p. 17). −1 −1 ˆ ˆ Condition (10) and the continuity of V () imply that n n an =tr[(Jn (−n1)=n) V (n )]=tr(J V )+o(1) a:s: n which in turn implies An = t=1 wt = t=1 wt at = t=1 wt = tr(J V ) + o(1) a:s: for wt = 1=t. Since n t=1 wt = log n + c + o(1) for the Euler constant c = 0:5772 : : :, the proof is complete. References Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle. In: Petrov, B.N., Csaki, F. (Eds.), International Symposium on Information Theory, Budapest, pp. 267–281. Hall, P., Heyde, C., 1980. Martingale Limit Theory and Its Applications. Academic Press, New York. Hannan, E.J., Mcdougall, A.J., Poskit, D.S., 1989. Recursive estimation of autoregressions. J. Roy. Statist. Soc. Ser. B. 51, 217–233.
B.S. So / Statistics & Probability Letters 64 (2003) 293 – 303
303
Kavalieris, L., 1989. The estimation of order of an autoregression using recursive residuals and cross validation. J. Time Series Anal. 10, 271–281. Konishi, S., Kitagawa, G., 1996. Generalized information criteria in model selection. Biometrika 83 (4), 875–890. Nishi, R., 1988. Maximum likelihood principle and model selection when the true model is unspeci