Model order estimation via penalizing adaptively the likelihood (PAL)

Model order estimation via penalizing adaptively the likelihood (PAL)

Signal Processing 93 (2013) 2865–2871 Contents lists available at SciVerse ScienceDirect Signal Processing journal homepage: www.elsevier.com/locate...

577KB Sizes 20 Downloads 45 Views

Signal Processing 93 (2013) 2865–2871

Contents lists available at SciVerse ScienceDirect

Signal Processing journal homepage: www.elsevier.com/locate/sigpro

Fast communication

Model order estimation via penalizing adaptively the likelihood (PAL)$ Petre Stoica a, Prabhu Babu b,n a b

Department of Information Technology, Uppsala University, Uppsala, SE 75105, Sweden Department of Electronic and Computer Engineering, HKUST, Hong Kong

a r t i c l e i n f o

abstract

Article history: Received 30 October 2012 Received in revised form 8 February 2013 Accepted 11 March 2013 Available online 1 April 2013

This paper introduces a novel rule for model order estimation based on penalizing adatively the likelihood (PAL). The penalty term of PAL, which is data adaptive (as the name suggests), has several unique features: it is “small” (e.g. comparable to AIC penalty) for model orders, let us say n, less than or equal to the true order, denoted by n0, and it is “large” (e.g. of the same order as BIC penalty) for n 4 n0 ; furthermore this is true not only as the data sample length increases (which is the case most often considered in the literature) but also as the signal-to-noise ratio (SNR) increases (the harder case for AIC, BIC and the like); and this “oracle-like” behavior of PAL's penalty is achieved without any knowledge about n0. The paper presents a number of simulation examples to show that PAL has an excellent performance also in non-asymptotic regimes and compare this performance with that of AIC and BIC. & 2013 Elsevier B.V. All rights reserved.

Keywords: Model order selection Penalized likelihood criteria Bayesian information criterion (BIC) Akaike information criterion (AIC)

1. Introduction and problem formulation Let y∈RN denote the data vector, and let pn ðy; θn Þ be its probability density function (pdf) under the hypothesis that y was generated by a model with parameter vector equal to θn ∈Rn (when θn is viewed as the vector of unknown parameters, pn ðy; θn Þ is called the likelihood function); hereafter n denotes the dimension of θn , which is also called the order of the model corresponding to θn . ~ Denote the set of models under consideration by fM n gnn ¼ 1 and assume that these models are nested, i.e. M n1 ⊂Mn2 for n1 on2 , which is the case in many order selection problems. In theory we can assume that y was indeed generated by a model in this class, let us say M n0 with order n ¼ n0 o n~ and parameter vector θn0 . In practice,

however, this is rarely true (as the data generating mechanism is usually too complicated to be comprised by the considered model set), a situation which we will symbolically designate in the present framework by the ~ The problem that we will address in this notation n0 4 n. paper is the estimation of n0 from y. The literature on estimating n0 is comparable in breadth and depth with that on estimating θn0 , see e.g. [1–7] for some relatively recent reviews and contributions to the former literature. Of the many rules that have been proposed for order selection along the years two have stood the test of time: the Akaike information criterion (AIC) [8] and the Bayesian information criterion (BIC) [9,10]. Both these rules have a penalized-likelihood form, namely min½−2 ln pn ðy; θ^ n Þ þ γn n

☆ This work was supported in part by the Swedish Research Council (VR) and the European Research Council (ERC). n Corresponding author. Tel.: +46 852 69250196; fax: +46 852 23581485. E-mail address: [email protected] (P. Babu).

0165-1684/$ - see front matter & 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.sigpro.2013.03.014

ð1Þ

but with different penalty coefficients: γ ¼ 2 for AIC

ð2Þ

γ ¼ ln N for BIC

ð3Þ

2866

P. Stoica, P. Babu / Signal Processing 93 (2013) 2865–2871

In (1) θ^ n is the maximum likelihood estimate of θn . The negative log-likelihood term in (1) decreases with increasing n whereas the penalty term increases. Ideally the latter term should have the following behavior: (i) for n ≤n0 the penalty should be small enough so that the function in (1) decreases with increasing n; and (ii) for n 4 n0 the penalty term should be sufficiently large such that the function in (1) increases with increasing n. In the large-sample case the penalties of AIC and BIC, despite the fact that they are data-independent, are reasonably good at achieving the above desired behavior. More concretely:

(i) if n ≤n0 then penalty of PAL is O(n) like AIC's; and (ii) if n 4 n0 then PAL's penalty is Oðn log NÞ like BIC's.

(by a slight abuse of notation we use O(n) and Oðn log NÞ in lieu of Oð1Þ and Oðlog NÞ to indicate the fact that the penalty is proportional to n; note that the symbol OðÞ shows either the order in probability or the deterministic order, depending on the context). To achieve the above “oracle-like” properties PAL uses a data-adaptive penalty. These properties entrust PAL with some unique features. Concretely, in the case of n0 o n~ (see Section 3 for numerical illustrations of these facts):

 The rule performs well in either the large-N or the

 If n0 o n~ then the BIC's probability of correct order



selection approaches one as N-∞. For AIC, however, this is not true because property (ii) is not satisfied; consequently AIC has a non-zero probability of overestimating n0 even for N-∞. If n0 4 n~ then AIC possesses a min–max optimality property (briefly stated, this means that the average likelihood corresponding to the model selected by AIC increases, in the worst-data case, at the largest possible rate, as N increases; the “average likelihood” is calculated over samples different from the one used for estimation, and the “worst case” is defined with respect to the parameters of the model that generated the data, see e.g. [11,3] for details on this aspect), whereas BIC does not have such a property.

Next, assume that the data vector y consists of a signal part and a noise term. In the high-SNR case (i.e. when the ratio of powers of the signal and noise terms is much larger than one), on the other hand, AIC and BIC appear to have less satisfactory properties. In particular, BIC does not have any longer the consistency property described above (see e.g. [4]) and it is also questionable if AIC's min–max optimality (also called minimax-rate optimality) still holds. There have been several attempts in the literature to combine AIC and BIC into a single rule that is both consistent and min–max optimal in large samples (see e. g. [3,12] and references therein). However, it was shown in [3] that, under certain conditions, this is not possible: a rule that is consistent cannot be also min–max optimal, and vice versa (though see below for a remark on this aspect). Another line of research has tried to address the unsatisfactory behavior of the criteria in the small-N and high-SNR case. In particular, it has been shown in [5] that the performance of BIC in the said case can be enhanced by appropriately modifying the criterion; however, doing so essentially requires information about the SNR in order to know if the large-N or the high-SNR version of BIC should be used. The main goal of this paper is to introduce the PAL rule (see Section 2) which possesses properties (i) and (ii) above in both the large-N case and the high-SNR one. More specifically, in the large-sample case (for details on the high-SNR case see the next section):



high-SNR case, without the need for any intervention by the user (such as switching between different penalties as essentially required for other rules, see e. g. [5]). PAL's probability of under-fitting is similar to AIC's and that of over-fitting is comparable to BIC's . As a result PAL's probability of correct order selection can be expected to be larger than BIC's and definitely than AIC's.

In the other case of n0 4 n~ (such as when the data generating mechanism cannot be described by a finiteorder model belonging to the considered set), PAL's penalty is O(n) like AIC's and yet PAL's behavior is typically closer to BIC's. The reason for this appears to be a factor of ~ present in PAL's penalty (see the next section) and lnðnÞ the need for choosing a relatively large n~ in such undermodeling cases, which may make PAL's penalty sensibly larger than AIC's (in spite of both having the same magnitude order). Interestingly, in numerical work not reported here (for lack of space) we have observed that without the said factor the PAL rule behaved similarly to (or better than) BIC whenever n0 o n~ and like AIC when ~ in other words such a slightly modified version of n0 4 n; PAL appears to achieve both consistency and min–max optimality, which means that the result in [3] about the impossibility of achieving both these properties simultaneously might not apply to it. Intuitively the higher-order models selected by AIC can ~ for tasks such as be a better choice, in the case of n0 4 n, prediction, control, spectral analysis and so forth than the smaller-order models selected by BIC or PAL. However, the min–max optimality of AIC that holds whenever n0 4 n~ is a rather conservative property: it guarantees that in the worst-data case the average likelihood associated with the model selected by AIC increases at the largest possible rate, as N increases; but in many other cases the more parsimonious models selected by PAL or BIC can have larger such likelihood values and hence can be more desirable (we will illustrate these facts numerically in Section 3). Furthermore quite often in applications the quality of an estimated model is not measured directly via its associated average likelihood value but via other metrics (such as the model's multi-step prediction performance) and in such situations AIC is not guaranteed to outperform BIC or PAL even in the hardest data cases

P. Stoica, P. Babu / Signal Processing 93 (2013) 2865–2871

(a fact that we have observed in numerical simulations that, once again, are not included here for brevity). 2. The PAL rule Consider the following two generalized likelihood ratios: " # pn−1 ðy; θ^ n−1 Þ r n ¼ 2 ln ð4Þ p1 ðy; θ^ 1 Þ "

pn~ ðy; θ^ n~ Þ ρn ¼ 2 ln pn−1 ðy; θ^ n−1 Þ

2

2

ð6Þ

N

where hn ðÞ: R -R is a given function (while (6) is not a necessary condition for the analysis in this section we impose it for the sake of conciseness). Note that for (6): θn ¼ ½αTn s2n T

ð7Þ

and M1 is the “model” with hn ðαn Þ ¼ 0 (for n¼1). We assume that y is not pure white noise, that is M n0 ≠M1

ð8Þ

and therefore will use M1 only as a “reference model” (i.e. ~ we will pick the model from the set fMn gnn ¼ 2 ). First consider the large-N case. Under (8) and some additional regularity conditions we have (for N⪢1Þ: ( 0 n¼2 rn ¼ ð9Þ OðNÞ n ¼ 3; …; n~ and ρn ¼

(

lnðr n þ 1Þ lnðρn þ 1Þ

OðNÞ

n ¼ 2; …; n0

Oð1Þ

n ¼ n0 þ 1; …; n~

ðwith high probabilityÞ ð10Þ

~ if that is In the second row in (10) we assumed that n0 o n; not the case this row should of course be omitted. Note that this row of (10) follows from the fact that, asymptotically in N and for n 4 n0 , ~ þ 1Þ ρn ∼χ 2 ðn−n

ð11Þ

where χ ðkÞ denotes the chi-square distribution with k degrees of freedom (see e.g. [13,14]). To motivate the second row of (9) and the first one of (10) consider the pdf in (6) for which 2

ð17Þ

has the following properties:

 It is an increasing function of n (for nested models)

ð5Þ

pn ðy; θn Þ ¼ e−ð1=2sn Þ∥y−hn ðαn Þ∥ =ð2πs2n ÞN=2 n−1

Under (8) the difference ðs^ 21 −s^ 2n−1 Þ is statistically significant for any n 4 2 and hence the second row of (9) holds; similarly, so does the first row of (10) for n ≤n0 . It follows from the above discussion that the ratio

#

~ Assume that the data vector is Gaussian for n ¼ 2; …; n. distributed with a pdf that has the following quite general form (for a generic model Mn):

2867



which equals zero at n ¼2 (indeed lnðr 2 þ 1Þ ¼ 0) and is strictly positive for n 4 2 (in view of (11) we can have ρn o1 with a non-zero probability; however, lnðρn þ 1Þ 40 due to the added one). It is Oð1Þ for n ¼ 2; …; n0 and Oðln NÞ for n ¼ n0 þ 1; …; n~ (see (9) and (10)).

We can make use of these properties of (17) in the following way to build a penalty term and the associated PAL order selection rule that has the “oracle-like” features (i) and (ii) (see the previous section): PALðnÞ ¼ −2 ln pn ðy; θ^ n Þ lnðr n þ 1Þ ~ þnlnðnÞ lnðρn þ 1Þ

~ ðn ¼ 2; …; nÞ

ð18Þ

Note that the penalty in (18) is invariant to data scaling, as it should, because rn and ρn are so. As N increases this ~ for n ¼ 3; …; n0 . For the values penalty approaches n lnðnÞ of n~ encountered in most applications this is comparable ~ with AIC's penalty: indeed, as an example, lnðnÞ∈½1:6; 3 ~ ~ on the other hand, the for n∈½5; 20. For n ¼ n0 þ 1; …; n, penalty term in (18) is approximately equal to ~ ½lnðnÞ=lnðρ n þ 1Þn ln N

ð19Þ

~ in particular, the In (19), ρn increases with increasing n; ~ mean of the (asymptotic) distribution of ρn is ðn−n þ 1Þ ~ in (19) is to compensate for this (see (11)). The role of lnðnÞ increase of lnðρn þ 1Þ and bring (19) reasonably close to n ln N, which is the penalty of BIC. This observation concludes the analysis showing that, in large samples, the penalty of PAL has the desired features (i) and (ii). Next consider the high-SNR case. In such a case we have (see (15) and (16); and make use of the fact that s^ 2n is on the order of signal power for n o n0 and of the noise power for n≥n0 Þ: ( Oð1Þ n ¼ 2; …; n0 ð20Þ rn ¼ OðlnðSNRÞÞ n ¼ n0 þ 1; …; n~ and

(

OðlnðSNRÞÞ

n ¼ 2; …; n0

Oð1Þ

n ¼ n0 þ 1; …; n~

α^ n ¼ argmin∥y−hn ðαn Þ∥2

ð12Þ

ρn ¼

s^ 2n ¼ ∥y−hn ðα^ n Þ∥2 =N

ð13Þ

(compare these equations with (9) and (10)). It follows from (20) and (21) that the penalty term of PAL has the “oracle-like” features (i) and (ii), described in Section 1, also in the present case. More concretely, as SNR increases, the penalty of PAL approaches zero for n ≤n0 and is proportional to n ln lnðSNRÞ for n 4 n0 . Finally note that, in the large-sample case, if the data generating mechanism does not belong to the considered

αn

and −2 ln pn ðy; θ^ n Þ ¼ N ln s^ 2n þ const:

ð14Þ

r n ¼ Nlnðs^ 21 =s^ 2n−1 Þ

ð15Þ

ρn ¼ Nlnðs^ 2n−1 =s^ 2n~ Þ

ð16Þ

ð21Þ

2868

P. Stoica, P. Babu / Signal Processing 93 (2013) 2865–2871

~ then ρn is likely to be model set (i.e. symbolically, n0 4 n) ~ Consequently PAL might be large (see (10)) for any n ≤ n. expected to behave like AIC in such a situation. However, if ~ there is a model in the set fM n gnn ¼ 2 that fits the data reasonably well PAL will rather behave more like BIC (see the next section for a numerical illustration of this fact; ~ and also the discussion in Section 1 about the case n0 4 n).

The probabilities of correct order selection, corresponding to AIC, BIC and PAL, are shown in Fig. 1 as functions of 1=s2 and for N ¼15 as well as N ¼50. As can be seen from this figure PAL outperforms BIC and AIC significantly, particularly so for the smaller number of data samples.

3. Numerical examples

We use the following autoregression to generate the elements of the data vector y:

The performance metric in each of the following examples is determined from 5000 Monte-Carlo simulations with different noise realizations.

a1 ¼ −2:760;

We build a matrix A∈RNN whose elements are independent Gaussian random variables with zero mean and unit variance. Then we compute nine candidate regressors as k ¼ 1; …; 9

ð22Þ

where fek g are independent Gaussian random vectors with zero mean and covariance matrix equal to I (the covariance matrix of xk is thus equal to AAT ). Finally we generate the data vector as y ¼ 3:35x1 −1:23x2 þ 0:9x3 þ 2:12x4 þe

yk þ a1 yk−1 þ a2 yk−2 þ a3 yk−3 þ a4 yk−4 ¼ ϵk ; k ¼ 1; 2; …; N ðhence n0 ¼ 5Þ

ð24Þ

where

~ 3.1. Linear regression (n0 o n)

xk ¼ Aek ;

~ 3.2. Time series (n0 o n)

ðhence n0 ¼ 5Þ

ð23Þ

where e is a Gaussian noise vector (independent of fxk g) with zero mean and covariance matrix equal to s2 I (the coefficients of fxk g in (23) were drawn randomly from a Gaussian distribution with zero mean and variance equal to one). We will vary s2 to change the SNR (which is proportional to 1=s2 ); for simplicity use 1=s2 as an SNR measure. The set of models to be tested is defined as follows: M2 corresponds to x1 , M3 to fx1 ; x2 g; …, and M10 to fx1 ; …; x9 g (hence n~ ¼ 10).

a2 ¼ 3:809;

a3 ¼ −2:654;

a4 ¼ 0:924 ð25Þ

and fϵk g is a Gaussian white noise with zero mean and unit variance. The set of models that will be fitted to the data consists of autoregressions of orders equal to 1; 2; …; 9 (hence n~ ¼ 10). The probabilities of correct order selection, for the same three rules as in the previous example, are shown in Fig. 2 as functions of N. Once again, PAL visibly outperforms the other two rules. ~ 3.3. Time series, cont'd (n0 4 n) The model set of this example is still composed of autoregressions but this time of orders from 1 to 29 (hence n~ ¼ 30). The data is generated by the following moving average (hence n0 is infinite here): yk ¼ ϵk þ c1 ϵk−1 þ c2 ϵk−2 ;

k ¼ 1; 2; …; N

ð26Þ

where fϵk g is a white Gaussian noise, as in (24), and 1 þ c1 z þ c2 z2 ¼ ð1−αzÞð1−αn zÞ

1

ð27Þ

1

Probability of correct order selection

Probability of correct order selection

PAL BIC

0.8

AIC

0.6

0.4

0.2

0

PAL

0.8

BIC AIC

0.6

0.4

0.2

0

50

100

150 2

1/σ

200

250

300

0 0

0.5

1

1.5

2

2.5

3

1/σ2

Fig. 1. Linear regression: probability of correct order selection vs 1=s2 for (a) N ¼15 and (b) N ¼50.

3.5

4

4.5

5

P. Stoica, P. Babu / Signal Processing 93 (2013) 2865–2871

We will consider two values for α, namely α ¼ 0:98ei

0:8π

ð28Þ

and α ¼ 0:5ei

0:8π

ð29Þ

Note that the smaller jαj the closer is the data generating mechanism to the considered model set. For a generic model Mn in the set under test, with n−1 coefficients fa^ k gk ¼ 1 , the negative average log-likelihood is given (to within an additive constant) by ðN=2Þ ln ðLLÞ where nþ1

2

LL ¼ E½ϵk þ b1 ϵk−1 þ ⋯ þ bnþ1 ϵk−n−1 2 ¼ 1 þ ∑ bk

ð30Þ

2869

Note that LL is lower bounded by one (a value which LL cannot exactly attain for n o∞) and also that LL is likely to decrease towards 1 as N increases; hence the inverse of (30), viz 1= LL, can be expected to have a behavior somewhat similar to that of the probability of correct order selection considered in the previous examples. Therefore here we will use 1= LL as the performance metric. In Fig. 3 we show 1/LL, as a function of N, for the models selected by AIC, BIC and PAL. For the value of α in (28) (which corresponds to a data generating mechanism that cannot be well approximated by the models in the considered set), AIC slightly outperforms the other two rules as N increases; however, for the α in (29), which has a smaller magnitude, AIC is outperformed by BIC and PAL.

k¼1

In the above equation E denotes the expectation operation (over samples independent of that used for parameter estimation), and 1 þ b1 z þ ⋯ þ bnþ1 znþ1 ¼ ð1 þ a^ 1 z þ ⋯

þa^ n−1 zn−1 Þð1 þ c1 z þ c2 z2 Þ

ð31Þ

Probability of correct order selection

1 PAL BIC AIC

0.8

3.4. Time series, cont'd (n0 o n~ but Mn0 ≈M n0 −1 ) The fourth-order autoregression in (24) and (25) clearly cannot be well approximated by a lower-order autoregression. The question is what happens if this is possible, i.e. the true model M n0 lies close to a lower-order model set such as Mn0 −1 —a situation which we designate by the notation M n0 ≈M n0 −1 . In this final example we will consider data generated by the same fourth-order autoregression as in (24) but with the following two different sets of coefficients: a1 ¼ −2:83;

a2 ¼ 2:723;

a3 ¼ −0:9328;

a4 ¼ 0:0402 ð32Þ

0.6

a1 ¼ −2:98;

a2 ¼ 3:14;

a3 ¼ −1:3204;

a4 ¼ 0:1607 ð33Þ

0.4

0.2

0

0

50

100

150

200

250

300

350

N

~ probability of correct order selection vs N. Fig. 2. Time series (n0 o n):

Like in the previous example based on (24), the set of models that will be fitted to the data generated by using either (32) or (33) is composed of autoregressions of orders 1; 2; …; 9 (hence n~ ¼ 10). While n0 ¼ 5 for both (32) and (33), these two autoregressions have a rather small coefficient a4 and therefore are close to M4 (with (32) being apparently closer) which

1

1

0.95

0.95 PAL

0.9

BIC

0.9

PAL

AIC

1/LL

1/LL

AIC

BIC

0.85 0.8

0.85

0.8 0.75 0.75

0.7

0.7

0.65

100

200

300

400

500

600 N

700

800

900

1000

0.65 100

200

300

400

500

600 N

~ 1/LL vs N for the α in (a) Eq. (28) and (b) Eq. (29). Fig. 3. Time series (n0 4 n):

700

800

900

1000

2870

P. Stoica, P. Babu / Signal Processing 93 (2013) 2865–2871

means that in their case n0 ≈4. Consequently, a sound order-selection rule is expected to estimate n0 as 4 more often than not (particularly so for (32)). Fig. 4 shows the average orders selected by AIC, BIC and PAL, for different values of N, along with bars of length equal to two standard deviations (centered on the average orders). As can be seen from this figure, the PAL-selected models are consistently the most parsimonious ones (with an average order close to n¼ 4) and those selected by AIC are the least parsimonious (with an average order larger than n ¼6). The behavior of PAL in Fig. 4 (i.e. frequently picking n ¼4) suggests a potential advantage over the other two rules (which often choose a “too large” order). To confirm that this behavior is indeed advantageous we will compare the models selected by the three rules with one another using an LL metric similar to the one in (30) (31), namely: 100

2

LL ¼ 1 þ ∑ bk

ð34Þ

k¼1

where (for a model Mn with autoregressive coefficients n−1 fa^ k gk ¼ 1 ) 1 þ b1 z þ ⋯ þ b100 z100 ≈

1 þ a^ 1 þ ⋯ þ a^ n−1 zn−1 1 þ a1 z þ ⋯ þ a4 z4

ð35Þ

(the truncation of the infinite series on the left-hand side of (35) to the first 100 terms was found to be satisfactory in all cases considered). Fig. 5 shows 1/LL, as a function of N, for the two data instances in (32) and (33). As can be seen from the figure the more parsimonious PAL models are not worse (actually they are slightly better) than the more complex models selected by AIC and BIC. 4. Concluding remarks Consider first the case of n0 ≤ n~ (i.e. the data generating mechanism belongs to or, more practically, is close to the model set). In such a case the probability of correct order selection is usually a natural performance metric (though

see the last example in the previous section for some comments on this aspect). With respect to this metric PAL was shown to perform better than BIC and much better than AIC as N increases. Furthermore if N takes on a relatively small value then PAL was shown to significantly outperform both BIC and AIC as the SNR increases. In the other case symbolically designated as n0 4 n~ (i.e. the mechanism generating the data either has the same form as the models in the set under consideration but with a larger (possibly infinite) order or it has a completely different form), the AIC-selected model can have a larger average likelihood value than that of the models selected by PAL or BIC. However, the difference in performance between these selection rules, as measured by the said likelihood values, is noticeable only in the worst (or close to worst) data cases. In many other cases, in which the model set under test can provide a meaningful description of the data, this difference is small or even changes sign (i.e. AIC is outperformed by BIC or PAL). Furthermore other metrics rather than the abovementioned likelihood values are often of interest in applications (such as the multi-step prediction performance of the estimated model) and with respect to them AIC is not guaranteed to outperform BIC or PAL even in the worstdata case. All in all it appears that the new PAL rule introduced in this paper can give AIC and BIC a “run for their money”. In effect PAL was clearly preferable to the latter rules in all the simulation tests we have performed to date. Of course in order to claim the superiority of PAL over the other rules with a high level of confidence more testing and comparisons are needed. In addition, a better theoretical understanding of the properties of PAL as well as a deeper (first principles-based) motivation of the rule would be desirable. Such a theoretical analysis of PAL could lead to an even better performance for instance by finely tuning the penalty term of the rule, which is something that we have not attempted to do in this short paper.

10

10 PAL(mean +− std) BIC (mean +− std) AIC(mean +− std)

9

9

8

Average orders

Average orders

8

7

6

7

6

5

5

4

4

3

PAL(mean +− std) BIC (mean +− std) AIC(mean +− std)

0

50

100

150

200

N

250

300

350

400

3

0

50

100

150

200

250

300

350

400

N

Fig. 4. Time series (n0 o n~ but M n0 ≈M n0 −1 ): average orders ( 7 one std.) vs N for the autoregressive coefficients in (a) Eq. (32) and (b) Eq. (33).

P. Stoica, P. Babu / Signal Processing 93 (2013) 2865–2871

2871

1

1

0.9

PAL

0.9 PAL

BIC AIC

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0

BIC

0.8

1/LL

1/LL

0.8

AIC

0.1 50

100

150

200

250

300

350

400

N

0

50

100

150

200

250

300

350

400

N

Fig. 5. Time series (n0 o n~ but M n0 ≈M n0 −1 ): 1/LL vs N for the autoregressive coefficients in (a) Eq. (32) and (b) Eq. (33).

References [1] P. Stoica, Y. Selen, Model-order selection: a review of information criterion rules, IEEE Signal Processing Magazine 21 (4) (2004) 36–47. [2] C. Rao, Y. Wu, On model selection, in: S. Konishi, R. Mukerjee (Eds.), Model Selection, IMS, Beachwood, OH, 2001. [3] Y. Yang, Can the strengths of AIC and BIC be shared? A conflict between model identification and regression estimation, Biometrika 92 (4) (2005) 937–950. [4] Q. Ding, S. Kay, Inconsistency of the MDL: on the performance of model order selection criteria with increasing signal-to-noise ratio, IEEE Transactions on Signal Processing 59 (5) (2011) 1959–1969. [5] P. Stoica, P. Babu, On the proper forms of BIC for model order selection, IEEE Transactions on Signal Processing 60 (9) (2012) 4956–4961. [6] M. Karimi, Order selection criteria for vector autoregressive models, Signal Processing 91 (4) (2011) 955–969.

[7] T. Wu, P. Chen, Y. Yan, The weighted average information criterion for multivariate regression model selection, Signal Processing 92 (1) (2012) 49–55. [8] H. Akaike, A new look at the statistical model identification, IEEE Transactions on Automatic Control 19 (6) (1974) 716–723. [9] G. Schwarz, Estimating the dimension of a model, The Annals of Statistics 6 (2) (1978) 461–464. [10] J. Rissanen, Modeling by shortest data description, Automatica 14 (5) (1978) 465–471. [11] R. Shibata, Asymptotically efficient selection of the order of the model for estimating parameters of a linear process, The Annals of Statistics 555 (1980) 147–164. [12] X. Shen, J. Ye, Adaptive model selection, Journal of the American Statistical Association 97 (457) (2002) 210–221. [13] S. Wilks, Sample criteria for testing equality of means, equality of variances, and equality of covariances in a normal multivariate distribution, The Annals of Mathematical Statistics (1946) 257–281. [14] E. Lehmann, J. Romano, Testing Statistical Hypotheses, Springer Verlag, NY, 2005.