Copyright il:l IFAC 12th Triennial World Congress. Sydney. Australia. 1993
ON THE USE OF REGULARIZATION IN SYSTEM IDENTIFICATION J. Sjoberg, T. McKelvey and L. LJung DeparlfMfIl ofEleclrical Engineering. LinlcOping U"iversily. S-581 83 LinlcOping. Sweden
Abstract. Regularization is a standard statistical technique to deal with ill-conditioned parameter estimation problems. We discuss in this contribution what possibilities and advantages regularization offers in system identification . In the first place regularization reduces the variance error of a model, but at the same time it introduces a bias. The familiar trade-off between bias and variance error for the choice of model order/structure can therefore be discussed in terms of the regularization parameter. We also show how the well-known problem of parametrizing multivariable system can be dealt with using overparametrization plus regularization . A characteristic feature of this way of letting the parametrization/model structure/model order be solved by regularization is that it is an easy and "automatic" way of finding the important parameters and good parametrization. No statistical penalty is paid for the overparametrization, but there is a penalty of higher computational burden. Keywords.
regularization, identification, parametrization, overparametrization, model selection
1 INTRODUCTION
where we will assume most of the time that the extra term is of the form
A typical parameter estimation problem is the following one. Let two observed sequences of vectors, y(t) and
y(t)
= g(O,
1(0)
=1 0 -
0#
12
(6)
(1)
The regularization term 1(0) will have the effect that the resulting estimate, which we will denote where we seek to estimate the value of the param- by eter vector O. This is typically done by minimizing a criterion of fit (7) (2) with respect to O. Here €(t, 0) is the misfit
will be a compromise between 0 being close to a value that gives a small error €(t, 0) and being close to the value 0# .
A distinct feature of regularization should be mentioned right away: The regularization term influences different element of the parameter vector 0 in There is a vast literature on how to minimize (2) quite different ways. Parameters that are very imand about the properties of the estimate portant for obtaining a good fit are not affected by (4) 1(0) very much, while parameters that have minor influence on €(t, 0) will be much more likely to be under varying assumptions about {y(t),
€(t,O)
= y(t) -
g(O,
(3)
We shall be particularly interested in regularization. By this term is meant that the criterion VN(O) is modified to
wt(O) = VN(O)
+ f> ·/(0)
(5)
If the minimization is performed using an second order algorithm of Gauss-Newton type, the regularization of the form (6) will also make the numerics well conditioned independently of the quality of the observations. 75
2 A PIECE OF ANALYSIS
Here we see the important benefit of regularization: "Unimportant" parameters 1 no longer have a negative influence on the model performance, and if more parameters are included there is no statistical penalty as long as the corresponding eigenvalues are smaller than 8.
We shall in this section bring out the benefits of regularization. We shall perform the analysis under the assumption that the model structure is a linear regression, g(O, )O(t)) = OT )O(t) so that the criterion (2) becomes quadratic in O. This is not much of a restriction , because conventional asymp- We must however also address the problem that totic analysis of the properties of ON is anyway lo- in practice we do not know 00 but must use cal, and based on local quadratic expansion of the criterion. In (Sjoberg and Ljung, 1992) the results for some nominal guess 0# . Then the estimate 01v below are derived more explicitly. will not converge to the true value 00 but to some Assume that the observed data actually can be other value 0' as N - 00 . It can be shown that described by the following relation holds:
y(t)
= 0T; )O(t) + e(t)
(8)
for a realization of whi te noise sequence {e( t)}, and let Aa = Ee2 (t). Let 0# = 00 . (In practice we do not know 00 and have to use a nominal guess 0#. We will return to this issue later.) If we let the quality of the model be evaluated by:
V(O)
= E(y(t) -
OT )O(t))2
(9)
and introduce :
(10)
(17) What all this means is the following : Suppose that we do have some knowledge about the order of magnitude of 00 (so that 1 00 - 0# 1 is not exceedingly large). Suppose also that there are d#(8) parameters that contribute with eigenvalues of Q that are less than 8. Then introducing regularization - with parameter 8 - decreases the model error (13) by Aod# IN, at the same time as the increase of V(O') due to bias is less than 814 . 100 - 0# 12. Regularization thus leads to a better mean square error of the model if
Using traditional calculations , (Ljung , 1987) , it can be shown that:
where (12) Since all the matrices in (11) can be diagonalized simultaneously we find that (13) 2 O'i
d
J _ "'"' -
~
(O'i
+ 8)2
O'i :
eigenvalues of Q (14)
where d = dimO. First, we notice that without regularization, i. e. for 8 = 0, we obtain the wellknown , general result that
VN
d
= Ao(1 + N)
(15)
(18) This expression gives a precise indication of when regularization is good and what levels of 8 are reasonable. Note that Q = E)O(t))OT(t) is a matrix of which we have a good estimate from data. This means that d#(8) is known with good approximation. With some measure of the size of 1 00 - 0# 1 the best choice of regularization parameter 8' is then 8' = arg min{( 18)} (19) 6
This means that we can, with help of 8, decide how important a parameter should be in order to be included in the model, and this makes the choice of 8 a way to choose the model order. In practice, 8 has to be decided upon by cross-validation. See (Gu, 1990) or (Golub et al., 1979) . Let us conclude this section by mentioning how much the estimate o1v differs from ON. After some derivations it can be shown that
See, e.g., (Ljung , 1987), p 418 . Second, if the spread of the eigenvalues is rather substantial, so that O'i is often either significantly larger or significantly smaller than 8, then we can interpret J as follows: number of eigenvalues of Q that are larger than 8
(16)
o1v = (I - M6)ON + M60# M6 = 8(81 + Q)-l
(20)
(21)
so the regularized estimate is a weighted mean between the unregularized one and the nominal value
0# . 1 In this context "parameters" should be interpreted as directions in the parameter space O.
76
3
IMPLICIT REGULARIZATION DUE TO UNFINISHED SEARCH
The unregularized estimate ON, defined by (4) is typically computed by iterative search and if this search is stopped before the true minimum is reached, then this is the same as regularization towards the initial value of the parameters. In this section we will just give the basic results, the derivation can be found in (Sjoberg and Ljung, 1992), or in (Wahba, 1987). The minimization algorithm is typically of the following kind:
(22) With an initial value O~) = ()# . The step size J.l is determined by some search along the indicated line. H is a matrix that may modify the search direction from the negative gradient one, - V!v, to some other one. It is natural to view the initial value ()# as a prior guess of ()a, as in the case of explicit regularization . After a finite number of steps the parameter estimate can be written
4 SOME USES OF REGULARIZATION IN SYSTEM IDENTIFICATION 4.1
The basic use of regularization
It is apparent from the previous analysis that a
prime use of regularization is to allow more parameters in the model structure than actually necessary, and to let the identification procedure "itself" select the important ones. Those "not selected" will then not have an adverse influence on the variance error, due to the regularization . This simple procedure has to be qualified, though. It is important to distinguish between two kinds of "unnecessary parameters" . Consider first the two extreme ones (24) and
y(t) = b1 u(t)
+ b2 u(t -
1),
(25)
where {u(t)} white noise and the true value of b2 = O. In both cases the parameter b2 is "unnecessary". In (24) regularization will automatically account for b1 + b2 as the "interesting" parameter combination and select individual values of b1 and b2 so that Ilbll is minimized (obtained for b1 b2 ). There will then be no statistical penalty for using two parameters rather than one (but there is a computational burden).
=
where Mk = (I - J.lHQ)k and Q was defined by (12). Compare with (20), (21)! The point now is that - except in the true Newton case H = Q-1 - the matrices Mk and M6 behave similarly. The amount of regularization that is inflicted depends as the size of 6 (large 6 corresponds to small k) .
In the second case, (25), b2 is still not "necessary" but it does have a substantial influence on the fit; it just so happens that the true parameter has the numerical value O. For this case, regularization does not offer any help . On the other hand, b2 is well identifiable, and its "un necessity" will be Hence, "unfinished" search for the minimum of recognized as the fact that the estimate is small a criterion function has the same effect as reg- compared to the estimated standard deviation. ularization towards the initial value ()# . Only schemes that allow exact convergence in a finite Now, real estimation problems show these two feanumber of steps do not show this feature . This tures in rather concealed forms. For example, if a gives us a powerful way to perform regularization. dynamical system is given by If explicit regularization is performed, one has to y(t) u(t) (26) search over 6 for the one which gives best performance on the validation data, according to the and we use a model structure of the form last section. This means that many minimizations 1 + bq-1 (27) y(t) = 1 + fq-1 u(t) have to be done. With regularization by terminated search, the minimization only has to be done once. The criterion of fit is computed on the vali- (q-1 is the delay operator), then the estimate of b dation data after each iteration, and the search is and f show both of the features expressed in (24) stopped when the criterion has reached its mini- and (25). mum. In this framework the identification data is used to calculate the steps in the parameter space, 4.2 Parametrization of State-Space Modand the validation data is used to stop the search. els This technique has been used for some years for image restoring, see e.g. (Fleming, 1990), and also Parametrization of multivariable state-space modin connection with neural networks, see (Sjoberg els is a well known and notoriously difficult proband Ljung, 1992). lem in system identification, see e.g. (Kailath,
=
77
1980), (Luenberger, 1967) or (van Overbeek and Ljung, 1982). Table 1: V evaluated for the overparametrized The root of the problem is that there is not one sin- model and the 5 canonical models . gle, smooth canonical parametrization of a multiModel OM CM {1,5} CM {2,4} output system . Instead one has to work with a large number of different possible parametrizaV 0.0056 0.0070 0.0205 tions, corresponding to different values of the obModel CM {3,3} CM {4,2} CM {5, I} servability indices (which of course are unknown) . V 0.0195 0.0195 0.0064 A radical approach to this problem is simply to fill all the state-space matrices with parameters. This will mean that there are too many paramto 10- 5 which makes the bias negligible but still eters involved. The resulting model structure is makes the numerics well conditioned. To assess then clearly not identifiable since the uniqueness the quality of the estimated model we evaluate of the parameters is lost . A true system will thus be a surface in the parameter space instead of a ...::... 1 N V =N Iy(t) - y(t)12 single point. But this is a situation of the kind t=l (24) , which is readily handled by regularization. Apart from the increased numerical burden, this using the independent validation set. approach eliminates all structural issues for the identification of multi variable system. This type To compare the proposed parametrization of of overparametrization does not offer any extra dy- the model with the conventional canonical parametrizations, we identified the five different namics compared to the canonical forms since the possible models corresponding to the five sets two model structures still have the same number of of observability indices. The results are given states . Thus we do not suffer from the usual probin Table 1 which clearly shows that the overlem of overfit, e.g. overparametrized ARX models . parametrized model (OM) is even slightly better This type of overparametrization has a connec- than the best canonical model (CM). 0 tion with the recent sub-space identification methThe proposed overparametrized state space model ods, e.g (Overschee and Moor, 1992), since the structure also performs well compared to canonisub-space methods also obtain models in a noncal forms when identification is performed on data canonical form. which has an ill-conditioned canonical representaTo further present these ideas a small example will tion, see (McKelvey, 1993). The overparametrizabe presented in the sequel. tion thus not only allows us to model the unknown system but also gives us the possibility to obtain a Example In (Maciejowski, 1989) Appendix A.2 a well conditioned representation. Using the canonturbo generator model with two inputs, two outical forms we do not have this extra degree of freeputs and six states is presented . We used the dom to also find a well conditioned model. continuous time model to generate an estimation data set and a validation set using random binary In (Overschee and Moor , 1992) an identification of (-1, +1) signals as inputs. The sample time was a glass oven using real data is given as an example. set to 0.05 . The estimation data set and the val- Using the same data and our overparametrized idation set was 500 respectively 300 samples long model structure, the performance of the identified with the output sequences corrupted with zero model on independent data is quite comparable mean gaussian noise with variance 0.0025 to make with the results of (Overschee and Moor, 1992) . the system of output-error type.
L
The estimation was then performed on the estimation data set by minimizing: N
W,tUi)
=~L
Iy(t) - y(t)12
+ ol(W
t=l
with the predictor y(t):
x(t
+ 1) y(t)
A(B)x(t) + B(O)u(t) ; x(O) C(O)x(t) + D(O)u(t)
=0
This gives us a total of 64 parameters to estimate which can be compared with the 40 parameters for any canonical form. The parameter 0 was set
4-3
Model Structure Selection
The model selection problem in system identification is characterized by testing very many different structures, until a suitable one is found. One reason for the necessity to try many structures is the mentioned intricate identifiability question , as illustrated by (26)-(27). An interesting alternative is to use just one high order model together with regularization. The result will then not be confused by non-identifiable parameter combinations, and the necessity of identifiable ones can be examined directly by comparison with estimated standard deviation. 78
4-4
Non-lin ear Black-box Models Need Many Paramet ers
In general , fitting functions to data is a n ill-posed problem , and to any finite data set infinitely many functions can be found whi ch fit the data. Most of these fun ctions would, however , perform rather poorly when applied to validation data. This motivates some kind of restri ct ions on the estimated function , e.g., that it should belong to a certain set of functions . That is what we are doing when we try to fit a linear model to the data. Putting rest ri ct ions on the estimated fun ction is a way to incorporate prior-knowledge in the estimation procedure (e.g., th e system is known to be linear) . If, however, no prior-knowledge is available, (15) still gives us some guidance; whatever model we try to fit to the data, we have to limit the number of parameters . Still, we want the model to be as fl exibl e as possible. This can be achieved with help of regularization in the following way. A flexible model may be proposed with many paramet ers, and then, with the regularization, the "useful " parameters are selected automatically. In this fram ework , a good model is characterized by an ability to approximate many different fun ctions with only a subset of its parameters . The regularization takes care of the parameters not belonging to this subset, i. e., the "unnecessary" parameters . Artificial neural networks apparently have this ability to approximate a wide variety of function , and in each case only use some of its parameters. See (Sjoberg and Ljung , 1992). The following example will show how it works.
Example
With data generated by
u(t) + yet - 2) yet) = 1 + y2(t _ 1) + u2(t _ 1)
(28)
with {u(t)} chosen as white Gaussian noise with (Tu = 1. The output data , {yet)} data were corrupted by white Gaussian noise, (T = 0.5 .
::r o
S;;:'2
4
6
8
ID
m] 12
14
16
18
20
Iterations
Sim,!:uiar values of the Hessian
o
0.
.2
0 ·2
Sin gular values
Fig. 1: a) The cri terion of fit ; solid lin e: measured on the identifi cation data; das hed line: measured on the validation dat a. b) The singular valu es of the Hessian According to the theory in the last section the regularization decrease in each iteration , and from the measure of fit on the validation data we see that the optimal parameter values are reached after 7 iterations. This means that the amount of regularization is optimal after 7 iterations , and, in case explicit regularization had been used, these 7 iterat ions corresponds to an optimal choice of D. The singular values of the Hessian for the optimal parameter val ues are shown in Figure 1 b). They are spread between 10 3 and 10- 3 , and the smallest ones correspond to "unnecessary" parameters. Because of the wide spread of the singular values , there is a quite sharp border between the parameters contributing to the model and the ones "switched off" by the reg ularization. See (14) and (16) . 0 This clearly shows that flexible models led to illcondi tioned Hessians which , however, are easily handled by the use of regularization. Further, by using implicit regularization we only have to perform one minimization.
We now try to fit a neural network with one hidden layer to the data. The network can be expressed in the following way
5
CONCLUSIONS
We have in this contribution examined the consequences of regul ariz ation . We have pointed to sevg(r,o(t» = Cjdr,o(t)T Aj) + Co (29) eral problems in system identification where reguj=l larization should be important and useful. We feel where we have 10 hidden units in the hidd en layer, that regularization is underutilized in the applicaand (T(x) is the nonlinear activation function of the tions. neurons in the hidd en layer. Cj and Aj are the parameters of the model. The regression vector is 6 REFEREN CES r,o(t) = [yet - 1) yet - 2) u(t) u(t - 1) 1) . This model has 61 parameters, which must be consid- Fleming, H. (1990). Equivalence of regularization and truncated iteration in the solution of illered as fairly high when we try to identify them posed image reconstruction problems. Linear with 200 data. In Figure 1 a) we see how the Algebra and its Applications, 130 :133-150 . measures of fit develops during the minimization. 10
L
79
Golub, G ., Heath, M., and Wahba, G . (1979) . Generalized cross-validation as a method for choosing a good ridge parameter . Technometrics , 21(2) :215-223 . Gu, C. (1990) . Adaptive spline smothing in nongaussian regression models. Journal of the A merican Statistical Association, 85(411) :80 1807 . Jennrich, R . (1969) . Asymptotic properties of nonlinear least squares estimators . Ann. Math . Statis., 40(2):633-643 . Kailath , T . (1980) . Linear Systems. Prentice-Hall , Englewood Cliffs, New Jersey. Ljung, L. (1987) . System Identification : Theory for the User. Prentice-Hall, Englewood Cliffs, New Jersey. Luenberger, D. (1967). Canonical forms for linear multivariable systems. IEEE Trans. A utomaiic Control, AC-12:290 . Maciejowski, J . (1989) . Multivariable Feedback Design . Addison-Wesley, Wokingham , England . McKelvey, T . (1993) . System identification using over-parametrized state-space models. Technical report, Report LiTH-ISY-1454 , Department of Electrical Engineering, Linkoping University, Linkoping, Sweden . Overschee, P. V. and Moor, B. D. (1992) . Two subspace algorithms for the identification of combined deterministic-stochastic systems. In Proc. 31 'st CDC, Vol. 1, pages 511- 516. Sjoberg, J . and Ljung, L. (1992) . Overtraining, regularization, and searching for minimum in neural networks. In Proc. IFAC Symposium on Adaptive Systems in Control and Signal Processing, Grenoble, France . van Overbeek , A. and Ljung, L. (1982) . On-line structure selection for multi variable state space models. Automatica, 18(5) :529- 543. Wahba, G. (1987) . Three topics in ill-posed problems. In Engl , H. and Groetsch , C ., editors, Inverse and Ill-posed Problems. Academic Press .
80