Pitfalls of the parametric approaches exploiting cross-validation for model order selection*

Pitfalls of the parametric approaches exploiting cross-validation for model order selection*

16th IFAC Symposium on System Identification The International Federation of Automatic Control Brussels, Belgium. July 11-13, 2012 Pitfalls of the pa...

346KB Sizes 0 Downloads 43 Views

16th IFAC Symposium on System Identification The International Federation of Automatic Control Brussels, Belgium. July 11-13, 2012

Pitfalls of the parametric approaches exploiting cross-validation for model order selection ? Gianluigi Pillonetto ∗ Giuseppe De Nicolao ∗∗ ∗

Department of Information Engineering, University of Padova, Padova, Italy (e-mail: [email protected]) ∗∗ Department of Computer and Systems Science, Pavia, Italy (e-mail: [email protected]) Abstract: Prediction error methods (PEM) are often used to identify a dynamic system starting from input-output samples. In particular, in the classical parametric scenario models of different order are identified from data and compared using the cross validation (CV) paradigm where measurements are split into a training and a validation data set. However, some inefficiencies related to this popular approach to system identification have been recently pointed out. This paper provides some insights on the reasons of such pitfalls, clarifying why PEM equipped with CV may lead to estimators with large variance and a poor predictive capability on new data. Keywords: prediction error methods; output error models; model complexity; ill-conditioning 1. INTRODUCTION The most used approach to identification of linear discretetime models is given by Prediction Error Method (PEM), see [Ljung, 1999, Soderstrom and Stoica, 1989]. This technique is typically used in a parametric scenario where finite-dimensional hypothesis spaces of different order, such as ARX, ARMAX or Laguerre models, are postulated and model-order selection becomes a key ingredient of the identification process. Classically, this step is performed by cross validation (CV) where data are split into a training and validation set [Hastie et al., 2008]. Recent research has however pointed out some inefficiencies related to this classical approach to system identification, see [Pillonetto and De Nicolao, 2011, Chen et al., 2011b]. In particular, PEM+CV may overestimate model order so that its performance may be unsatisfactory, far from that predicted by standard (i.e. without model selection) statistical theory. Incidentally, the pitfalls of parametric approaches have also motivated the recent contributions on nonparametric regularized methods for system identification initiated in [Pillonetto and De Nicolao, 2010] and continued in [Pillonetto et al., 2011, Chen et al., 2011b, Pillonetto and De Nicolao, 2011, Chen et al., 2011a]. In particular, in the approach described in [Pillonetto and De Nicolao, 2010] the impulse response is modeled as the realization of a zeromean Gaussian process [Rasmussen and Williams, 2006]. Prior information is thus introduced in the identication process just assigning a particular covariance, the so called ? This research has been partially supported by the PRIN Project “Sviluppo di nuovi metodi e algoritmi per l’identificazione, la stima Bayesiana e il controllo adattativo e distribuito”, by the Progetto di Ateneo CPDA090135/09 funded by the University of Padova, by the European Community’s Seventh Framework Programme [FP7-ICT223866-FeedNetBack] and [FP7/2007-2013] under grant agreement n257462 HYCON2 Network of excellence

978-3-902823-06-9/12/$20.00 © 2012 IFAC

stable spline kernel in [Pillonetto and De Nicolao, 2010], that includes information on system stability. This kernel depends on very few hyperparameters determined from data using marginal likelihood maximization. Such a procedure is interpretable as the counterpart of model order selection in the classical parametric PEM paradigm but it has been proved to be more robust than CV and AIC-type criteria [Akaike, 1974]. It introduces the right amount of regularization in the estimation process, effectively guarding against the possible ill-conditioning affecting the identification problem. Differently from this approach, in this paper we show that PEM+CV may not have a satisfactory control on model complexity. In particular, we provide some new mathematical insights that clarify why CV may lead to estimators having a large variance, as shown experimentally in [Chen et al., 2011b, Pillonetto and De Nicolao, 2011, Chen et al., 2011a]. The crux of our analysis is the use of a nonparametric embedding, i.e. the introduction of a large hypothesis space including all the usual models adopted for system identification. Under the hypothesis of null impulse response, it is shown that the estimate selected by CV within this wide set of impulse responses can be often ill-conditioned, also representing a dangerous “domain of attraction” for the estimates obtainable in the smaller hypothesis spaces. Thus, PEM+CV may be induced to spend too many degrees of freedom to fit the training and the validation data. This leads to poor predictive capability on new data, also when generated by an input just slightly different from that associated with the identification data. This phenomenon is also described deriving closed form results for the case of FIR models and system input given by a Dirac Delta or a step. The paper is organized as follows. Section 2 reports the problem statement. In Section 3, we introduce the large hypothesis space M defining the nonparametric embedding as well as other model classes commonly used for

215

10.3182/20120711-3-BE-2027.00147

16th IFAC Symposium on System Identification Brussels, Belgium. July 11-13, 2012

• α = α corresponds to the set of all the impulse responses having the same ability of 0` in fitting the training set since 0` ∈ Rα Notice that Rα contains the simplest impulse response with all null components, but may well contain also much complicated and oscillating impulse responses.

system identification. In Section 4, the performance of CV is discussed using M as hypothesis space. In Section 5, the consequence of these findings for the use of classical models for system identification are elucidated, also treating the case of FIR models with input given by a Dirac Delta or a step. Conclusions end the paper while the Appendix gathers the proof of the main result. 2. STATEMENT OF THE PROBLEM For t ∈ Z, we use {yt } to indicate noisy output data coming from a discrete-time linear dynamic system fed with a known input {ut }. The measurements model is yt =

` X

fk ut−k + ωt

Exploiting the definitions above, we define the hypothesis space M as follows M = {fα s.t. fα ∈ Rα , α ∈ [α, α]} (6) ˘ of M used for system identification 3.2 Classical subsets M

(1)

k=1

{ft }`t=1

where f = is the unknown impulse response while {ωt } is white Gaussian noise. We assume that the output measurements are split into a training and a validation data set contained in the m-dimensional column vectors T and V, respectively. To maintain the exposition as simple as possible, it is also assumed that the same deterministic input generates T and V so that we can write T = U f + e, V = Uf + v (2) where the white Gaussian noises e and v are independent with variance σ 2 , the column vector f ∈ R` contains the coefficients of the impulse response while U ∈ Rm×` is a known matrix whose entries are defined by the system input. For the sake of simplicity we assume that m ≥ ` and that rank(U ) = `. We also assume that the coefficients of the true impulse response are equal to zero so that T and V are equal to the noises e and v, respectively. Then, our aim is to investigate how robust CV is in controlling model complexity under this Null Impulse Response Scenario (NIRS). 3. COMPETITIVE MODELS FOR SYSTEM IDENTIFICATION

In system identification one usually selects a small subset ˘ ⊂ M of candidate impulse responses, whose elements M are not parametrized by α but rather by the model order, hereafter denoted by the integer p, with possibly p = 0. Once the data T are given, each model order is associated with a value of α and a subset of Rα . In addition, the larger ˘ p, the smaller α. Three well known examples of M-type sets are reviewed below. Example 1. (LAGUERRE AND FIR MODELS). Let the `-dimensional column vectors {φβi }pi=1 be defined by the Laguerre basis functions. More precisely, each φβi contains the first ` components of the inverses of the rational transfer functions given, in the zeta-domain, by z (1 − βz)i−1 , β ∈ (−1, 1) z − β (z − β)i−1 Let p be the model order belonging to {0, 1, . . . , `}. For p > 0 define Φβp ∈ R`×p and Qβp ∈ Rm×p as follows Φβp := [φβ1 φβ2 . . . φβp ], Qβp := U Φβp Then, once T is known, the competitive models are in the set MLAG = {Φβp ((Qβp )> Qβp )−1 (Qβp )> T ,

3.1 A large set of competitive models determined by the training data T Hereafter, all the vectors are column vectors and 0` denotes the `-dimensional vector whose components are all null. Let α denote a positive scalar belonging to the closed interval [α, α] determined by T as follows: α = kT − U (U > U )−1 U > T k2 >

α=T T

p ∈ {0, 1, . . . , `},

β ∈ (−1, 1)}

(7)

Φβ0 ((Qβ0 )> Qβ0 )−1 (Qβ0 )> T

where := 0` . Notice that, once T is given, each couple of values (p, β) is associated with one impulse response in Rα where α(p, β) = kT − Qβp ((Qβp )> Qβp )−1 (Qβp )> T k2 In addition, one has

(3) (4)

with k · k to indicate the classical Euclidean norm. Then, we use Rα to denote the set of all the impulse responses able to describe the training set T with the same residue α according to a quadratic fit criterion. In mathematical terms, we have  Rα = f ∈ R` s.t. kT − U f k2 = α (5) In particular • α = α is associated with a set containing only one element given by the least squares estimate, i.e. Rα = {(U > U )−1 U > T }

α(0, 0) = α,

α(`, 0) = α

Finally, notice that, for β = 0, MLAG reduces to the class of FIR models, denoted by MF IR in the rest of the paper. Example 2. (RATIONAL TRANSFER FUNCTIONS). p now belongs to the discrete set {0, 1, . . . , p}. It establishes the order of the rational transfer function that describes the impulse response in the zeta domain and is defined by  0 if p = 0    a0 if p = 1 Fp (z) = a + Pp−1 a z i (8) 0  i=1 i  if p = 2, . . . , p Pp−1  1 + i=1 bi z i

216

16th IFAC Symposium on System Identification Brussels, Belgium. July 11-13, 2012

where {ai }pi=0 and {bi }pi=1 are real numbers. Let Sp denote the space of all the vectors f ∈ R` containing the first ` components of the inverses of Fp (z). Then, once T is given, each p is associated with only one value of α and with possibly more than one impulse response contained in Rα . The competitive models are in the set   MRT F = arg min kT − U f k2 , p ∈ {0, 1, . . . , p} f ∈Sp

Given p, the corresponding α is α(p) = min kT − U f k2 f ∈Sp

As in the previous cases, MRT F contains the null impulse response. Notice that α(p) may be equal to α provided that p is sufficiently large.

Proposition 3. Let M be the hypothesis space under the NIRS. Then, the most predictive vectors of V, as a function of α, belong to MO ⊂ M where o n MO = fˆ(α), α ∈ [α, α] (12) and fˆ(α) := arg min kV − U f k2 f ∈Rα  Ke if α = α  v + λ(α)e (13) = if α ∈ (α, α] K 1 + λ(α) In addition, according to the CV paradigm, the optimal impulse response is fˆ(ˆ α) where α ˆ = arg min CV O(α) (14) α∈[α,α]

3.3 Concluding remarks of the section We have presented different model classes defined by the training set T . The first type has been denoted with M and contains subsets of impulse responses parametrized by α, a scalar providing a quadratic fitting measure. Notice that the training set T defines M establishing the range of ˘ and consists α. The second type has been denoted by M of a small subset of M containing impulse responses depending on the model order p. This type of model class includes the three key hypothesis spaces MF IR , MLAG and MRT F . 4. MODEL SELECTION VIA CV ADOPTING M AS HYPOTHESIS SPACE 4.1 Model selection via CV using M as hypothesis space For our future developments, it is useful to introduce some new definitions. First, let Kp := (Up> Up )−1 Up> ,

K := (U > U )−1 U >

and let also P (x) denote the second-order polynomial defined as follows P (x) := x2 (e> Ke + α − e> e) >

>

(9) >

>

+ 2x(e Ke + α − e e) + (2e − v) Kv + α − e e It is easy to assess that, for α in (α, α], both of the two roots of P (x) are always real. Then, we use λ : (α, α] 7→ R to indicate the function that maps α into the maximum root of P (x). Simple computations show that such function is monotonically decreasing as α varies in (α, α] and one also has lim λ(α) = +∞

(10)

α→α+

λ(α) = −1 +

(e − v)> K(e − v) e> Ke

(11)

We are now in a position to characterize the most predictive impulse responses (with respect to the validation data set) as a function of α, as well as the optimal α returned by CV when M is adopted as hypothesis space. This is described in the next proposition.

with CV O(α) defined by  >  v v + e> Ke − 2e> Kv if α = α   2 > 2 > λ (α)e Ke − 2λ (α)e Kv − (2λ(α) + 1)v > Kv v> v +  (1 + λ(α))2   if α ∈ (α, α] (15)  Hence, when V becomes available, the most predictive vectors, as a function of α, belong to a particular subset MO of M whose elements all belong to the range of K. In particular, they admit the structure of least squares solutions with data given by a combination of T and V. Hence, remarkably, if U is an ill-conditioned matrix, e.g. induced by a low-pass input or also by the realization of white noise, MO is completely composed by ill-conditioned solutions. Within MO , the optimal solution is then characterized by the α ˆ given by (14). In fact, as shown in Appendix, for each value of α the function (15) provides the prediction quadratic error on V obtained by the most predictive vector in Rα . Now, recalling that E[e> Ke] = E[v > Kv], E[e> Kv] = 0 one has λ2 − 2λ − 1 (16) λ2 + 2λ + 1 where the dependence of λ on α is omitted to simplify the notation. From the above expression, it is easy to obtain that √ E[CV O] ≤ E[v > v], 0≤λ≤1+ 2 Since E[v > v] is the expected prediction error on V obtained by 0` , this suggests that there exists a non negligible number of vectors, possibly similar to the least squares solution KT , able to predict V better than the true impulse response under the NIRS. In fact, when λ = 1 + √ 2, fˆ depends on a convex combination of e and v where e is likely to be dominant on v. This phenomenon will be formally described in Proposition 5 for the case of FIR models and inputs given by a Dirac delta or a step. Notice also that the vectors able to predict the validation set better than 0` , as a function of α (with α ≤ e> e) are exactly those satisfying the following two conditions  > > f U U f − 2f > U > e = α − e> e (17) f > U > U f − 2f > U > v ≤ 0

217

E[CV O] = E[v > v] + E[e> Ke]

16th IFAC Symposium on System Identification Brussels, Belgium. July 11-13, 2012

4.2 Characterization of the least predictive vectors in M Now, let

250

η : (α, α] 7→ R indicate the function mapping α into the smallest root of P (x). One can easily assess that η(·) is monotonically increasing and that

Prediction error on yv

200

(e − v)> K(e − v) lim+ η(α) = −∞, η(α) = −1 − (18) e> Ke α→α The following proposition is the “dual” of Proposition 3: it provides the vector with the worst predictive capability on V. It turns out that this characterization is obtained just replacing λ(·) with the function η(·) defined above.

(19)

In addition, the impulse response with the worst prediction capability on V is f˜(˜ α) where α ˜ = arg max CV O(α)

(20)

α∈[α,α]

with CV O(α) defined by   v > v + e> Ke − 2e> Kv if α = α   2 > 2 > η (α)e Ke − 2η (α)e Kv − (2η(α) + 1)v > Kv v> v +  (1 + η(α))2   if α ∈ (α, α] (21)  In analogy with the previous case, in Appendix it is shown that the function (21) returns the prediction quadratic error on V obtained by the least predictive vector in Rα . Hence, CV O and CV O provide lower and upper bounds on the prediction error on V obtainable by the vectors in M, as a function of α.

150

100

50

CVO

Proposition 4. Under the NIRS, let M be the hypothesis space. Then, we have f˜(α) := arg max kV − U f k2 f ∈Rα  Ke if α = α  v + η(α)e = if α ∈ (α, α] K 1 + η(α)

CVO

5

10

15

20

25

30

35

40

!

Fig. 1. Realization of CV O (solid line), the curve associated with the vectors in the set MO defined in (12), and CV O (dotted line), obtained with m = 50 and using a step as input. The two curves denote the best and the worst achievable prediction error on the validation data set as a function of the fitting level α. The circles indicate the prediction errors obtained using MF IR as hypothesis space. The circle inside the square visible on the right is the value of the prediction error obtained setting p = 0, i.e. using the true impulse response given by 0` under the NIRS. As p increases, α decreases and the points move to the left. The optimal order selected by CV is p = 9. ˘ is a small subset of belonging to CV O. In fact, since M M, they depend on few values of α defined by the values that p may assume and on the particular realizations of T and V. Hence, the prediction errors consist of a finite number of points contained in the region bounded by CV O and CV O. An example is given by the circles in Fig. 1. In particular, since p = 0 is associated with the null impulse response 0` and the fitting level α = α, the coordinates of the first point are (α, v > v). Then, as p increases, each new point stays on the left with respect of the previous one. The coordinate of the last point is (α, v > v + e> Ke − 2e> Kv) when e.g. p = ` and FIR models are used. 5.1 Pitfalls of CV using MF IR as hypothesis space

5. MODEL SELECTION VIA CV ADOPTING ˘ M-TYPE HYPOTHESIS SPACES Fig. 1 plots a particular realization of CV O (solid line) and CV O (dotted line), see the figure caption under the figure for details. Recall that the values assumed by the lower function CV O are the minimum prediction errors on V achievable for different values of α. Each point on this curve is in one to one correspondence with one vector fˆ(α) defined by (13). The upper curve CV O instead provides the worst obtainable error and is made of points associated with the vectors f˜(α) defined by (19). All the other impulse responses in M lead to a prediction error falling in the region lower and upper bounded by these two curves. ˘ Let us consider M-type hypothesis spaces, whose elements are parametrized by the model order p. Adopting these spaces, the prediction errors on V in general are not points

When the hypothesis space is MF IR , the prediction errors are the values assumed by the following objective  v> v if p = 0 CV OF IR (p) = kv − Up (Up> Up )−1 Up> ek2 if p = 1, . . . , ` (22) that consists of a stochastic process on {0, 1, 2, . . . , `}. The most suitable value for p selected by CV is then given by pˆ = arg min CV OF IR (p) (23) p∈{0,1,...,`}

To obtain insights about the performance of CV, and possibly closed-form expressions characterizing it, we need to fix a form for the system input. For this purpose, we will analyze two significant cases, with the input being either a Dirac delta or a step. When m = `, this corresponds to set U to U1 or U2 , respectively, where:

218

16th IFAC Symposium on System Identification Brussels, Belgium. July 11-13, 2012

150

Empirical CDF of model order estimates using FIR models 1 0.8

100

0.6 0.4 0.2

0

2

4

6

8

10

12

14

16

18

20

50

Empirical CDF of model order estimates using rational transfer functions 1 0 −1

0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

!

0.6

Fig. 3. Histogram of the 1000 estimates of β returned by cross validation when the size of the training and validation data set is m = 50, the input is a step and MLAG is adopted as hypothesis space.

0.4 0.2

−0.8

0

2

4

6

8

10

12

14

16

18

20

Fig. 2. Empirical cumulative distribution of model order estimates adopting MF IR (top) or MRT F (bottom).     1 0 0 ... 0 1 0 0 ... 0 0 1 0 . . . 0   1 1 0 ... 0 . . .  1 1 1 ... 0 .  . .  U1 =  . 0 . . 0  , U2 =  . . . . . . . . . .  .. .. .. . . ..   . . .. .. .  . . . 1 1 1 ... 1 0 0 0 ... 1 Notice that while the Dirac delta defines a problem where the impulse response is directly measured, the step case makes the identification ill-conditioned. The statistics of CV OF IR coming from these two choices are derived in the following proposition, whose proof is omitted due to space constraints. Proposition 5. Consider the NIRS and let {ip }p∈Z+ denote the increments of CV OF IR , i.e. ip = CV OF IR (p + 1) − CV OF IR (p), p = 0, 1, 2, . . . (24) If the input is a Dirac delta (i.e. U = U1 ), we have ip = ep (ep − 2vp ), p = 0, 1, 2, . . . so that P[ip < 0] ≈ 0.35, p = 0, 1, 2, . . .

(25) (26)

If the system input is instead a step (i.e. U = U2 ), define Pm Pm ei vi e¯ = i=1 , v¯ = i=1 (27) m m Then, as m gets large, each ip tends to assume the same distribution. In particular, for p = 0 one has i0 = m¯ e(¯ e − 2¯ v) (28) while, if m >> p, one has ip ≈ ep (ep − 2vp ), p = 1, 2, . . . (29) so that it still holds that P[ip < 0] ≈ 0.35, p = 0, 1, 2, . . . (30)  A realization of the process characterized by the above proposition with input given by a step is displayed as circles in Fig. 1. Recall that, under the NIRS, one would like that the minima of the realizations of CV OF IR be frequently located at p = 0. But the above proposition shows that when the input is a Dirac delta the objective

becomes a particular random walk with independent and identically distributed increments. When the input is a step, the same result is obtained for m >> p, with all the increments still independent just excluding the first one. This, combined with (30), indicates that the probability for the objective to improve as p increases is not negligible. This result also formalizes the fact, suggested by the particular realization in Fig. 1, that the number of impulse responses satisfying the system (17) is considerable when α is close to α (i.e. when m >> p adopting FIR models), decreasing not so quickly as α decreases (i.e. as the model order p increases). Notice also that, if one adopted MLAG in place of MF IR , CV would select the right model complexity even less frequently since MF IR ⊂ MLAG . To further corroborate these theoretical findings, we have performed a Monte Carlo study of 10000 runs under the NIRS, setting m = 100 and storing at any run the value of pˆ defined in (23) when the input is a step and the maximum allowed value ` for p is 20. Fig. 2 (top panel) plots the empirical cumulative distribution of the 10000 values of pˆ. Notice that the probability of the event pˆ = 0 is less than 0.5 while that of the event pˆ ≥ 7 is around 0.1. Thus, in a significant number of cases CV leads to an ill-conditioned estimate of f whose spectrum is located at high frequencies where the input power is low. The peculiarity of this estimate is that it is able to well describe both T and V but is likely to have a very poor predictive capability on data generated by an input different from the step. ˘ 5.2 Pitfalls of CV using M-type hypothesis spaces The results reported in Proposition 5 can be extended to more general inputs, at least by resorting to numerical studies, and represents just a particular consequence of the more general findings described in the previous section. The solutions returned by CV are “attracted” by the vectors associated with the curve CV O which all belong to the range of K. Thus, if U is ill-conditioned, the risk for this model order selection procedure to return an inaccurate impulse response estimate, with many oscillations, is not negligible. This also explains why, when e.g. the input is low pass and the hypothesis space is MRT F or MLAG , PEM+CV tends to introduce many zeros and poles at high frequencies. In fact, this permits to mimic the behavior of the operator K, leading to unreliable model orders.

219

16th IFAC Symposium on System Identification Brussels, Belgium. July 11-13, 2012

As an example, Fig. 2 (bottom panel) plots the same kind of distribution of pˆ when MF IR is replaced by MRT F , i.e. FIR models are replaced by rational transfer functions estimated by the oe.m Matlab function. In this case, p represents the order of the two polynomials entering the transfer function description whose “optimal” value is still determined by CV. It is apparent that wrong order models are chosen even more frequently. For instance, the probability of the event pˆ ≥ 6 is around 0.5. To further illustrate this phenomenon, we have also repeated the same kind of simulation adopting MLAG as hypothesis space and storing the value of β chosen by CV after each of the 1000 runs. The histogram of the 1000 estimates of β displayed in Fig. 3 confirms that this procedure has not an acceptable control on the model complexity: the true impulse response is 0` but transfer functions with a highfrequency dominant pole are frequently selected. 6. CONCLUSIONS The performance of parametric approaches equipped with CV for model order selection has been investigated under the assumption of null impulse response with training and validation data generated using the same input. While the latter assumption will be relaxed in future work, we believe that the analysis here reported already provides some useful insights about the pitfalls of PEM+CV. In particular, it points out the importance of introducing regularization in the estimation process to circumvent the illconditioning the classical approach to system identification can be subject to. This also motivates further research on the regularized kernel-based methods along the line initiated in [Pillonetto and De Nicolao, 2010].

requires ζ ≤ −1. Now, using the constraint ke − U f k2 = α with f replaced by (35), we obtain that the value of ζ that fully defines the solution of Problem (13) must satisfy ke(1 + ζ) − U (U > U )−1 U > (v + ζe)k2 = α(1 + ζ)2 (36) Then, simple computations reveal that the solutions of the above equation are the two roots of the polynomial (9), with the special case α = α that is associated with the least squares solution (U > U )−1 U > e. Furthermore, the condition ζ ≤ −1 implies that the solution of Problem (19) is obtained choosing the minimum between these two roots, i.e. setting ζ(α) to λ(α), whereas Problem (19) is solved choosing the largest root, i.e. setting ζ(α) to η(α). Now, considering the objective kv − U f k2 with f replaced by (35), we obtain that the prediction error on the validation data set, as a function of ζ(α), is

2 > −1 >

v − U (U U ) U (v + ζ(α)e) (37)

1 + ζ(α) that, after simple computations, is equal to   v > v + e> Ke if α = α   ζ 2 (α)e> Ke − 2ζ 2 (α)e> Kv − (2ζ(α) + 1)v > Kv > v v+  (1 + ζ(α))2   if α ∈ (α, α] (38) The structure of the above objective shows that (15) and (21) provide the prediction error on the validation data set, as a function of α, when the most and the least predictive vector in Rα is selected, respectively. This concludes the proof. REFERENCES

APPENDIX Proofs of Propositions 3 and 4: In view of the definition of Rα , under the NIRS, Problem (13) is equivalent to finding arg min kv − U f k2

s.t.

f

ke − U f k2 = α

(31)

The associated Lagrangian is L(ζ, f ) = f > U > U f − 2f > U > v + v > v + ζ(f > U > U f − 2f > U > e + e> e − α) and one also has

(32)

∇f L(ζ, f ) = 2U > U f − 2U > v + 2ζU > U f − 2ζU > e (33) ∇2f L(ζ, f ) = 2(1

>

+ ζ)U U

(34)

For the gradient (33) to be zero, f must be equal to   (U > U )−1 U > (v + ζe) if α < α ≤ α (35) 1+ζ  (U > U )−1 U > e if α = α Hence, for suitable values of ζ the above expression provides the solution of Problems (13) and (19). In particular, (34) shows that a necessary condition for f to be a minimum is ζ ≥ −1 whereas the solution of Problem (19)

H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19: 716–723, 1974. T. Chen, H. Ohlsson, G.C. Goodwin, and L. Ljung. Kernel selection in linear system identification – part II: A classical perspective. In Proceedings of CDC-ECC, 2011a. T. Chen, H. Ohlsson, and L. Ljung. On the estimation of transfer functions, regularization and gaussian processes - revisited. In Proceedings of the IFAC World Congress 2011, Milano, 2011b. T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning. Springer, 2008. L. Ljung. System Identification - Theory For the User. Prentice Hall, 1999. G. Pillonetto and G. De Nicolao. A new kernel-based approach for linear system identification. Automatica, 46:81–93, 2010. G. Pillonetto and G. De Nicolao. Kernel selection in linear system identification – part I: A gaussian process perspective. In Proceedings of CDC-ECC, 2011. G. Pillonetto, A. Chiuso, and G. De Nicolao. Prediction error identification of linear systems: A nonparametric Gaussian regression approach. Automatica, 47:291–305, 2011. C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. The MIT Press, 2006. T. Soderstrom and P. Stoica. System Identification. Prentice Hall, 1989.

220