( :()Jl\ l ig l ll E~ l llll;ll i () 1 1
©
I F.-\( l(klltili(.lIi~11I .tll(l Y()rk, l K, ]q,"': ,-)
S\",lt'l ll Pdr.tllH'11'1
I ~IX,~),
RESTRICTED EXPONENTIAL FORGETTING IN REAL-TIME IDENTIFICATION R. Kulhavy f mlillll/' "/ / "/(!lll/flli(!I' [/il'l",' 1I1It! Alllllllllllillll. C:l'rillJs/IIl 'lIk ;\tllI/l'lII\' f 'mgll/,. C:n/i()l/IJ1'akill
'1' SriI'IICfS.
Abstract. Identification of time - varying stochastic systems is examined from t~ Bayesian viewpoint . The problem how to update the posterior par ameter density without an explicit model of parameter variations. solely on the basis of a set of alternative parameter densities is formulated as a statistical decision problem. Its solution considerably generalizes the well - known technique of exponential forgetting . An effective way is suggested how to respect the natural requirement of forgetting only that piece of information which is being updated by the currently observed data . The restricted exponential forgetting elaborated for the special case of linear normal regression model formally results in the adaptive version of the recently proposed directional forgetting. Properties of the estimation algorithm are illustrated by an example . Keywords. Adaptive control; Bayes methods; data handling; decision theory; identification; parameter estimation; time - varying systems. INTRODUCTION
ta forms the (m - dimensional) output y(t). y
Standard procedures for parameter estimation are. as arule. based on the assumption that the estimated parameters are constant. A num ber of heuristic techniques have been devel oped to handle the practical cases of time - varying parameters . Most of them achieve the adaptivity by. loosely speaking. forget ting obsolete information. The directional forgetting (see Kulhavy and Karny (1984)) which starts from the Bayesian interpretation of the well - known exponential forgetting belongs to this class too . The parameter tracking has been considerably improved by this technique. However. it turns out that the original concept of the forget ting requires some revisions concerning the questions: (i) what piece of information should be forgotten and (ii) how to forget it. In the present paper the former question is analysed by considering the relation be tween prior and posterior (before and after incorporating information contained in the most recent data) probability densities by which the uncertainty of parameters is characterized. The latter question is formulated and solved as a statistical decision problem. For the practically important case of linear normal regression model the presented solution leads to a simple modification of recursive least squares with the adaptive directional forgetting. The resulting algorithm is very robust with respect to poor excitation of system and partly adaptive with respect to different rates of parameter variations.
p(y(t)lt - 1;u(t).k(t)). t =1.2 . .. .
(1)
Depending on the context. (1) will mean the density of the random variab l e y(t) or its va l ue at the point y(t). cond i tioned on u(t). k(t) and data measured up to and including the time t - 1. The unknown parameters k(t)EK are taken as a multivariate random variable. Their prior uncertainty is described by the conditional (on all re l evant information) probabi l ity density p(k(1)10) defined with respect to a suitable measure K (technical assumptions needed for a rigorous treatment will be omitted below because of simplicity - see e . g. Loeve (1960) for details). Provided the input generator employs no other information about k(t) than measured data (see the natural condition of control in Peterka (1981)). the measurement updating of the parameter density p(k(t) It - 1)~ ~p(k(t)lt) is described by the Bayes the orem p(k(t) l t)" p(y(t) l t-1;u(t).k(t))p(k(t)!t - l) (2 )
where" stands for proportionality . In order to complete the recursion. the time updating p(k(t)lt)~p(k(t+1)it) has to be defined. The standard approach using the model of parameter variations is rarely ap plicable because of computational comp l exity. or lack of a proper model. For this reason. the possibility to specify the time updating of the parameter density in a direct " operational " sense is of interest.
BAYESIAN IDENTIFICATION OF TIME-VARYING SYSTEMS A stochastic system is considered on which a finite data set is measured at discrete ti me instants t=1.2 •... The directly manipulated input is cenoted u ( t). the rest of da IS VOL 2-8*
The dependence of the system output on pre vious data is specified by the fam i ly of suitab l y parametrized conditiona l probabi lity densities
114:)
11 44 DECISION-THEORETIC DERIVATION OF FORGETTING
ary to specify on the basis of o ther re quirements.
Missing information about the model of parameter variations can be compensated by another piece of knowledge. We shall assume that at each time instant t the set of alternative parameter densities . (Pi(k(t+1)lt), i=1,2, ... ,N}
(3)
Foll owi n g the usual terminology,the mapping p(k(t) I t) *p (k(t+l ) I t) defined by ( 6 ) is called the (generalized) exponential forget ting. A connection with the standard exponential forgetting in the Bayesian inter pretati on (cf. Peterka (1981» " ( t+1 i t )
corresponding to a priori chosen models of the ti me updating Pi(k(t)lt)*Pi(k(t+1)lt) is
p(k(t+1)lt)«[p o (k(t+1) l t) .:
available. The problem how to update the posterior parameter density p(k(t) It) on the basis of the set of alternative parameter densities(3) so that the parameter density p(k(t+l) It) may well reflect the t i me changes of parameters can be and will be treated as a statistical decision problem (H,D, l ,a) (see Savage (1954».
is apparent - take N=l and P1 ( k ( t+1) I t ) «1.
( 7)
0
It should be emphasized that in compar ison with (7) the solution(6) is invariant with respect to any transformation of parameters . Moreover, the generalized forgetting admits to consider other than noninformative alternative densities and in this way en ables to respect both prior information and information con ta ined in data.
Statistical Decision Problem The parameter space H includes those parameter densities which come into ~ccount as a quantitative description of parameter un certainty. Although the problem is nonparametric in principle, only the alternative densities from (3) together with the pos terior density po(k(t+l) It):p(k(t) It) are assumed to form H because of feasibility . The decision space D contains all prob ability densities with respect to K. The loss functional defi n ed on HxD , measuring the "unlikeness " or "distance " of parameter densities from Hand D, is introduced by means of the variable known as I-d i vergence, Kullback's information or generalized en tropy (see e.g . Csiszar (1967» 1(Pi(k(t +1)lt) , P(k(t +1)) = E(k(t+1» = .rp(k(t+ 1 »)ln p.\k(t+lllt) ddk(t+l)
K
(4)
~
The def i n i tion(4) is motivated by t h e attempt to find a s i mp l y realizable solution for a rather rich family of probability densities . The probability distribution on t h e space H is supposed to be given, described by the T
vector a (t+1It)=[ao(t+1It), ... , aN(t+1It)] of probabilities of all elements of H. We take the Bayesian solution minimizing the expected value of the loss functional (4) as the optimum solution of the statistical decision problem. Thus the updated parameter density is defined by N
p(k(t+1 I t) = arg
min E a.(t+1It). p ( k(t+1» e: D i=O 1
1 (p i (k ( t+ 1 It) ,p (k ( t+ 1) ) )
(5 )
Ge n eralized Exponential Forgetting Solving (5) by the Lagrange multipliers meth od we obtain N
p(k(t+1)lt)~ n [ p.(k(t+l)lt) ] i=O 1
a.
(t + 1 It)
1
(6)
if the densities Pj(k(t+1) It) e: H for all j such that a .(t+1It»O are not orthogonal J
(singular), i.e. their product is not zero almost everywhere (a.e . in the sequel) . In the opposite case the problem has not a single solution and the decision is necess -
Construction of alternative parameter den sities. As the density(l) (identical for all alternatives) a nd the models of the time updating Pi (k(t) It)*Pi (k(t+l) It) ,i=1,2, ... ,N are assumed to be a priori chosen , the alternative parameter densities can be updated recursively in the above sketched way. The selection o f the models o f tim e updating is limited by two rather contradictory requirements. Firstly, the demands o n the computation of the alternative densities and forgetting should be adequate to ou r possibilities . Hence , a fixed density P1(k (t +1)lt) =P1(k (t)lt -1 ) corresponding t o t he worst case of parameter changes must be often sufficient . Secondly, the set (3) should cover in the sense of operation (6) the expected parameter changes as accurately as possible. Therefore the best known models of the assumed parameter evolutio n, giving data- dependent densities Pi(k(t+1) It), are generally needed. Determination tives. So far termine a(t+1 probabilities
of probabilities of alternatwo feasible ways how to de it) are known. First, the ai(t+l I t ) may be stated
a priori expressing the subjective belief in particular alternatives . Second, the optimal probability vector a(t+1 It) can be defined as least favourable between all possible probability vectors a(t+1) in the sense N
p(k (t+l)lt) =arg
max min L a . (t+l) . a(t+1) p(k(t+1»€D i=O 1
. [ 1 (p i (k ( t+ 1 ) I t) ,p (k ( t+ 1 » ) - , i (t+ 1 I t ) 1
(8 )
where 'i( t+1 It) denote the values of losses which are felt for particular elements of H as equally important. By this modification of the loss functional we can partly respect available information about the " success fulness" of particular al ternatives in describing the real parameter evolution. As the density p(k(t) I t ) represents a g o od approximation of the best posterior parameter density, the equally i mportant losses can be estimated by its discrepan c y (I - di vergence) with respect t o the prior para~ eter densities A (t+1 I t)=(l+o)l(p(k(t ) I t-1 ) ,P lk ( t 1 , t )) o 01 \ Ai(t+1It ) =1 (Pi ( k ( t ) I t-1 ) , Plk ( t l ' t )) ,
Res! rined EX)Jollelllial Forgettillg ill Real-Time Idell!ifiGllioll
1145
for i=l,2, ... ,N . The asymmetry for i =O follows from the fact that only the detected ( not expected) discrepancy is measured by l(p(k(t) It - l), p(k(t) Itf). For this reason we increase the loss by the heuristic factor 0>0. It can be shown that this factor represents the maxi mal expected ratic of the amount of "forgotten" information l(po(k(t+l)lt),p(k ( t+l) l t)) to the amount of information contained in the latest data l(p ( k(t)lt-l),p(k(t)lt)) in the steady state. RESTRICTED FORGETTING The characteristic feature of the exponen tial forgetting is that the whole information described by the posterior parameter density is modified for ao(t+1It)<1. However, are we really justified to suspect all this information? If the model of parameter vari ations is not available, only information contained in data, modifying through the model · (l) our prior information, is relevant for the answer. Ive suggest to apply the forgetting solely to that piece of accumulated information which has been modified by the latest data. This heuristic idea can be formalized in the following way. Let us consider a measurable mapping T:K·K~ Then the densi ty p (k' ) wi th re spec t to a suitable measure K' can be determined for an arbitrary density p(k) from the equality fp(k ' )dK' (k ' ) = fp(k)dK(k)
B
T- 1 (B)
which holds for each measurable set B. We say that the mapping T is sufficient with respect to the pair of densities (p(k),p(k)} if nonnegative measurable functions g,g,h exist such that p(k)
g(T(k) )h(k)
p(k)
g(T(k) )h(k)
p(k(t+1) l t)a
p(T(k(t+1)) l t)p (k(t+l) l t) Po(T(k(t+lll l tl 0 (11)
for po(T(k(t+1) It»o i f pork' (t+1) It)=O implies p(k' (t+l) It)=O a.e . In the opposite case the extension of p(k' ( t+1) It) on peke t+1) I t) is not single and must be speci fied by additional requirements. In order to remove all the common piece of i nformation, we should select the minimal sufficient mapping of parameters, i . e. such a sufficient mapping which comprises each other sufficient mapping. It can be proved by using the results of Csiszar (1967) that it is enough to take the following function of the unknown parameters P(y(t) It -1;u (t) k(t)) T(k( t)) =
for p(k(t) {
It- 1)~ . ~
1=1
Pi(k(t)lt- 1»0 (12)
-1 else
However, computational results can make us choose some more "detailed" mapping . APPLICATION TO LINEAR NORMAL REGRESSION MODEL It is well -k nown that parameter estimation in the case of lin ear normal regression model results in the recursive least squares method. An effective modification of this method due to the application of the restricted exponential forgetting is descri bed in this section. Model of System For the linear normal multivariate regression model the density (1) takes the Gaus sian form my 1 p(y(t) It-1;u(t) , n(t),p(t)) = (2n) -2j n(t)I"2
hold a .e .
(13) exp ( _~ (y(t) _pT (t)z(t))T net) (y(t) _ pT(t)z(t))}
Let us select a mapping sufficient with re spect to (p(k(t)lt - 1),p(k(t)lt)}, (PiCket) It - l),Pi(k(t) It)} , i=1,2 , . .. ,N.
where the mz-dimensiona l vector z(t) is a
Loosely speaking, such a mapping condenses the effect of the latest data in the sense that the parameters k ' =T(k) make possible to distinguish prior and posterior densities as accurately as the original parameters k do. If we accept the cautious policy "if data say nothing, keep initial information ", then that piece of information which is re moved using the sufficient mapping must be again restored after the forgetting. More formally, the mapping must be sufficient
known function of u(t) and previous data up to the time t-1, the precision matrix net) and the matrix of regression coefficients pet) are the unknown parameters k(t). The prior density of the parameters is supposed to be in the conjugate Gauss -Wishart form v(1 IO)+m z - m -l y p(n(1),p(l) 10)~ l n(1) I exp
2
(14)
{ -~tr ( n( 1 ) A( 1 10) ) } exp {- ~tr r Q ( 1 ) .
-~ (1 10) ) TC (1 10) -1 (p ( t) - ~ (1 10) ) 1}
~~~(t:i~7t)~~pect to (Po(k(t+l) It),
. (p ( t)
The restricted exponential forgetting can be developed in this way . The application of (6) to the densi ties of a sui tably defined sufficient mapping gives
A scalar v( 1 10) >m - 1 and matrices p( 1 10),
N a'. (t+llt) p(k' (t+1) I t)~ IT r p. (k ' (t+1) It) ] 1 (10) i=O 1
Specialization of Restricted Forgetting
where the probability vector
a~(t+l l t)
can
be determined in the same way as above. The requirement of sufficiency results (after the substitution k' (t+l)=T(k(t+1)) ) in
y
COIO) >0, A(110) >0 form the sufficient stat istic fully specifying the density.
An application of the restricted exponential forgetting to the above model is suggested under these special assumptions. a) One fixed alternative density of parameters is considered as the limit case of the parameter density in the form (14)
I I-Hi
R. Kulh a n -
-1 for C i . e.
(t+1It)~0 , A(t+1It)~0,v(t+1It)~0 ,
mz - my - 1 p(Q (t+1) , P( t +1 ) I t)a IQ (t+ 1 )1
2
t= 1,2, . ..
(15)
b) The sufficient (probab ly the minimal suf ficient )mapping of parameters saving the Gauss - Wishart form of the parameter den sity through estimation (16 ) is selected. c) The probability vectorCl ' (t+11t) is de termined as least favourable in the sense of Eq. (8),(9) (with the densities p(k) replaced by p(k ' ». Results Let us introduce this notation: ~(t+1It) stands forClo(t+1It), A
AT
( 17)
e(tlt-1)=y(t) - p (tlt-1)z(t) denotes the prediction error , r;(tlt-1)=zT(t)C(tlt-1)z(t)
(18)
n(tlt - 1)=~T(tlt-1)A-1(tlt-1)~(tlt - 1) (19) are the auxiliary scalars and
1
dt)=$(t+1It)-(1 - ~(t+1It»r;-
(tlt - l) (20)
is the weighting factor . The following results can be proved after a cumbersome but straightforward compu tation . 1 . The Gauss - Wishart form of th e prior den sity of parameters reprodu ces through estimation . 2 . The s u f ficient s t atistic (P , C , A, v) evolves recursive l y as fol l ows:
or even, for slowly varying parameters, as
1+(1+P ) [~y 1~r; i:::~ n l ( tlt -l )
where the time notation (t l t - 1) relates to all statistics in the square brackets. Discussion a) The major difference of the new algorithm in comparison with the exp onentially forgotten least squares consists in Eq. (22a) (see the geometrical interpretation in Kulhavy and Karny ( 1984 ), cf . practically the same relat ion in Hagglund (1983». High numerical reliability of estimation under insufficiently informative data caused by linear feedback , unsuitable parametriza tion, i nput saturation, rare changes of variables etc. i s achieved in this way. b) Eq. (24) for the scalar v represents a s li ght mod i fication wi th respect to the directional forgetting following from the more precise "definition " o f the forgetting. c) The substantial innovation lies in the recursive computation of the forgetting factor ~(t+1It) according to Eq. ( 25 ) or Eq . (26). To illustrate the contribution of the suggested procedure , the compa r is on with the results of Fortescue a n d oth e rs ( 1 981) may be usefu l. The i r idea o f k ee ping a constant de s ired amo unt o f i n f ormation resu l ts (trans l ated to our scheme) in
~ - 1(t + 1It)
for r;(tlt - 1»0
( 26 )
=
1+~ [ ~ N o
P(t+1It )
m
y
(v+l)n ] 1+r;+n ( tlt-1 ) (27 )
P(tlt - 1) + C(t/t-1)Z(t) ~T(tlt _ 1) 1+r; tlt-l}
(21a)
length " ) is to be a priori chosen . No tice that Eq. (26) is formally similar to Eq . (27) with No replaced by
C(t+1It) C(tlt - 1)
C(tlt-1)Z(t)ZT(t)C(tlt - 1) <-
1 (t I t-1) + r;( t I t - 1)
(22a)
for r;( t I t - 1) = 0 : p(t+1It)
P(tlt-I)
(21b)
C(t+1It)
C(tlt - 1 )
(22b)
independently of r;(tlt-1 ): A( t+ 1 It ) = = ~(t+1It)[A(tlt-1) + v(t+1It) =
where the factor No ( " asymptotic memory
~(t
I t - 1 ) ~T (t I t-l), 1+r;{tlt - l} J (23)
~(t+1It)[v(tlt - l)
+ 1]
1+1;-1 (t I t - 1) 1+ P Thus in our case the "memory length" is depending on the uncertainty of the regression coeffiCients P (the uncer tainty of Q is neglected in the presented approximations of ~(t+1It». This implies a higher degree o f adaptivity of the algorithm (e .g. a r ough ly exponential increasing of d>(t+llt) in the initial phase of estimation as well as in re - tuning ) . ILLUSTRATIVE EXAMPLE - TRACKING OF VARIABLE OUTPUT LEVEL
(24)
3. The exact form of equations the solution of which gives ~(t+1It) is too complicated to be presented here . The value of ~(t+1It) can be computed approx i mately as
Single input single output system was simulated by the regression model ( 13) wi th z T (t)
=
= (u(t),y(t-l),u(t-1)"
.. . y(t - 3 ) ,u ( t - 3 ' ,v\t )'
Restricted Expollelltial Forg-ettillg- ill Real-Time Id elltificatioll
P )J
T
= [ b o ,al,bl, ... ,a 3 ,b 3 ,1] =
1
where the parameters ai' b
corresponded to i t he discretized transfer function
--~--~3'
( l+TS)
T=0 .8T d
Td being the sampling period. The disturb an c e v(t) giving a variable output level was modelled as the random walk with N(0 , 0 .00 2 5 )-distributed increments . The system was controlled to the zero setpoint by the self-tuning controller with the simple predictor P- 1A2BK (see Bohm and others (198 4» . Thus, the regression model with the reduced structure ZT ( t)= [ U(t ) ,y(t-1),u(t-1),1]
pT= [ b~,a1,b1'c ] was supposed . The penalty w=O . Ol on the input increments was chosen. It should be emphasized that the disturbance vet) was not measured. Simulation runs with two versions of the suggested algorithm using the approximations (25) and (26) as well as with the standard exponentially forgotten least squares using the time variable factor (27) were realized. Estimation started from P(110)=O, C(110)=I, A(110)=1, v(110)=1. In comparison with the standard exponential forgetting the following most important ob servations were done using the new algorithm (some of them are demonstrated in Fig. 1): - the average value of the forgetting factor and, consequently, the quality of control were in a wide range(p
5) much less sensitive to the value of the heuristic factor; - the quality of control achievable by a suitable choice of the heuristic factor was slightly better for the approximation (25) and comparable for the approximation (26) ; - the small values of ~(t+1It) at the first steps of estimation accelerated the initial convergence of parameter estimates; - the variances of all the estimated regression coefficients with the exception of the absolute term c were significantly smaller . CONCLUDING REMARKS Several interesting results have been obtained by the careful formulation and sol ution of the problem of forgetting obsolete information . The generalized exponential forgetting admits to respect prior information about parameter variations and to apply other ways of suppressing informat i on than a simple flattening of the posterior parameter density . Thus e.g. the tracking of faster varying parameters is possible using suitable alternative models of the time updating. The suggested procedure of determining the probabilities of particular alternatives (parameter densities) enables to cover a wide range of possible rates of parameter variations by means of a single new parameter. The intuitive requirement to apply the forgetting only to information
IH i
modified by the latest data is formulated more generally in c o mparis o n with Kulhavy and Karny (1984 ) . Notice that the restric tion of forgetting may have a sense even for a single parameter . Using the minimal sufficient mapping (12 ) the optimal restriction of forgetting can be easily speci fied for an arbitrary model of syste m which defines the density (1). Applying the restricted exponential forgetting to linear normal regression model we have derived the simple but effective modification of the recursive least squares, nearly the same as for the directional for getting . The main innovation of the algorithm consists in the adaptive adjustment of the forgetting factor. The computation of ~(t+1 It) is designed so that the expec ted amount of forgotten information in the steady state is proportional by the a pri ori chosen factor p to the expected amount of new information contained in data. A rather rough choice of ~ seems to be suf ficient because of a little sensitivity of estimation to it, in other words, a rather broad range of different rates of parameter variations is covered by a single value of P . It should be emphasized that the noninformative alternative density limits the app li cability of the a l gorithm to esti mation of relatively slowly varying parameters. The tracking of faster varying parameters is poss ib le but only at the cost of a considerable increasing of the uncertainty of estimated parameters. REFERENCES Bohm, J., A. Halouskova, M. Karny, and V. Peterka (1984). Simple LQ self - tuning controllers. Proc. 9th IFAC Congress, Budapest . Csiszar, I . (1967). Information-t ype measures of d ifference of probability distributions and indirect observations. Studia Scientiarum Mathemati carum Hungarica 2, 299 - 318 . Fortescue, T.R., L.S. Kershenbaum, and B .E . Ydstie (1981). Implementation of self-tuning regulators with variable forgetting factors. Automatica, 17, 831-835. -Hagglund, T. (1983) . The problem of forget ting old data in recursive estimation . Proc. IFAC Workshop on Adaptive System in Control and Signal Processing, San Francisco. Kulhavy, R., and M. Karny (1984). Tracking of slowly varying parameters by di rectional forgetting. Proc. 9th IFAC . Congress, Budapest. Loeve, M. (1960) . Probability Theory. Van Nostrand, Princeton . Peterka, V. (1981). Bayesian approach to system identification. In P. Eykhoff (Ed.), Trends and Progress in System Identification. Pergamon Press, Oxford . Chap . 8, pp. 239-30 4. Savage , L . J. (1954). The Foundations of Statistics. Wiley, New York.
114H
1 500
R. Kulhan'
500
Ly~t)
500
1
t=1
500 4 .3 0
L
t=l
1 00
EF (27)
REF(26)
0.95
~-;;~~-.-.-.\,
4.1 0
_.-.-.-..
0.90
-.
3. 90 0.85
".
~\
3.70
c; 0.....
0.8 0 -+--I--+---+--<~~-+--+---+-_ _ 1 <2 N-
1
ID
0
~
'"
0
0
0
0
0
0
0
0 0
0
({, W0
ID
0
N
U">
c::i c::i 0
ci
...:
0 0
N
'"
0 0
0 ci
~(t+llt),V(t)
ID 0 0
0
N
0 0
0 0
'"
0
0
I
~
N
0
0
0
q, (t+ll t )
1.0
1.0
0 .5
o.
0.5
t
2 (\+llt)
- 0.5
REF(25)
REF(25)
er
~
= 0 .2
= 0 .2
o. 100
-1 . 0
Fig. 1 .
200
300
400
Restricted exponential forgetting (REF ) with variable for p,etting fa c t o r in adaptive c o ntrol. Sensitivity of b o th appr o xi mati on s ( 2 5) and ( 26 ) as well as of standard exponential f o r g atting ( EF ) using Eq. ( 27 ) ( the technique of Fortescue and others ( 1981 )) with respe c t to different values of the heuristic factor p o r l i No is compared .
500