Martingale Theory in Bayesian Least Squares Estimation

Martingale Theory in Bayesian Least Squares Estimation

Copyright © I FAC Identification and Syste m Parameter Estimation 1982, Washington D.C .. USA 1982 MARTINGALE THEORY IN BAYESIAN LEAST SQUARES ESTIMA...

1MB Sizes 0 Downloads 58 Views

Copyright © I FAC Identification and Syste m Parameter Estimation 1982, Washington D.C .. USA 1982

MARTINGALE THEORY IN BAYESIAN LEAST SQUARES ESTIMATION

J.

Sternby* and H. Rootzen**

*K ockumation A B, B ox 1044, 5-21210 M aimo, Sweden **Instit ut for Mat ematisk Statistik , Kobenhavns Universit et, Denmark

Abstract. In a previous paper, Sternby (1977), the convenience of using martlngale theory in the analysis of Bayesian Least Squares estimation was demonstrated. However, certain restrictions had to be imposed on either the feedback structure or on the initial values for the estimation. In the present paper these restrictions are removed, and necessary and sufficient conditions for strong consistency are given for the Gaussian white noise case with no assumptions on closed loop stability or on the feedback structure. In the open loop case the poles are shown to be consistently estimated a.e., and in the closed loop case certain choices of control law are shown to assure consistency. Finally adaptive control laws are treated, and implicit self-tuning regulators are shown to converge to the desired control 1aws . Keywords. Identification; mation; adaptive control.

least squares;

I NTRODUCTI ON The simplicity of the Least Squares (LS) method has made it widely used for the purpose of parameter identification in dynamical systems. As compared to ordinary regression analysi s, some additional problems then arise concerning the asymptotical properties of the estimates. Natural questions to ask are: What happens if there is feedback in the system? What happens if the closed loop system is unstable (which could happen temporarily with adaptive control laws)? What happeris if the noise is not white? Some of the questions have been answered in recent years, but the complete solution is not yet found . One difficult problem is e g the case with (possibly adaptive) feedback and no stabil ity assumptions for the closed loop system. In a previous paper, Sternby (1977), the convenience of using Martingale theory for consistency proofs was demonstrated. Using a Bayesian approach, consistency was shown under a certain condi ti on wi thout any explicit assumptions on stability. However, for technical reasons, restrictions had to be imposed on ei ther the feedback structure or the initial values for the estimator. In this paper these restrictions will be removed, and the necessary and sufficient condition for consistency w.p.l is given for linear discrete time systems with white, Gaussian equation noise. 921 1 SPE - 2- 0 '*

Bayes methods;

parameter esti-

Furthermore, the consistency condition is interpreted in terms of conditions on the feedback, and some di fferent ways are i ndicated to guarantee consistency. Adapti ve control 1aws are di scussed in the final section, and self-tuning regulators are taken as examples of feedback that allow the true control 1aw to be obta i ned a symptotically. THE LEAST SQUARES IDENTIFICATION METHOD Let the true system be described by yet)

= ~ (t)Tx

+ eft)

(1)

with ~ (t) a column vector of variables known at time t, i e $(t) is a function of previous values of inputs and outputs. The sequence { e(t )} is some unknown disturbance, and x i s a random n-di mens i ona 1 vector of parameters to~timated. Normally, x is considered a constant, but unknown vector to be estimated. By taking x as a random vector we thus consider a whole class of systems with the structure (1), and picking a certain member of that class is the first part of the identification experiment.

J. Sternby and H. Rootzen

922

The underlyi ng probabil i ty space then supports both the parameter x and the noise sequence ( e(t) } , and will be thought of as a product of a space supporting x and a space supporting (e(t) } . Primarily we will be interested in a.e. convergence of estimates,

xt -+ Xoo say, and "a.e." then means almost everywhere with respect to the joint distribution of x and (e(t) } . By standard arguments, using the Fubini theorem, this convergence is equivalent to the existence of a set AcR n with P(xEA) = 1, and such that for each x EA, Xt -+x a.e. with respect to the marginal distribution of ( e(t) } given the va 1ue of x. Thus, in thi s context the concept of convergence a. e. means "for almost every realization for almost every system". In particular, nothing can be said about a certain system with given true parameters. For a further discussion on this point, see Sternby (1977). From a pedagogical point of view, this approach has its advantages. To readers famil i ar with Martingale theory, the results follow from known theorems in a straightforward manner. The ord i nary LS estimate ~ of xis obtained by minimizing with respect to ~

(2 )

In this paper, however, the parameter vector xis consi dered a random vector and the LS ~t

estimate

of x is defined to minimize

The same notations will be used here as in Sternby (1977): the smallest a-algebra containing

Foo

Ft

for every t;

lB: r: -+R the indicator function for the set B (1 B(w) =1 if wEB else 1B(w) =0) N(S)

null space of S GENERAL RESULTS

The following theorem was proven in Sternby (1977) . Theorem 1: Suppose that the distribution of the true parameters x has finite second moments. Then Xt and Pt converge a.e.

0

Resul ts concerni ng convergence of Bayes estimators have also been considered in the statistical literature, see e g Schwartz (1965) and the references therein. The limits are denoted by Xoo and Poo ' It is thus clear from the beginning that the LS-estimate mlnlmlzlng (2a) always converges. In the next theorem, which is a slight generalization from Sternby (1977) the limit is examined. Theorem 2: Let a E F be a random vector wi th 00

lal=l.Then with the assumptions of Theorem 1 a. e.

(2a) If a is a constant with where Ft is the a-algebra generated by the a priori information and all measurements up to time t. It is well known (see e g Kalman or Jazwinski) that under general circum,

't:,

stances ~ t= x = E[xIFt1. This t fact wi 11 be used in the sequel, and in 1ater secti ons conditi ons wi 11 be gi ven for the two equal.

LS-estimates

~t

and

~t

to

be

The single-input, single-output (SISO) and the multivariable cases can be treated simultaneously as was done in Sternby (1977), but in order to emphasize the basic ideas thi s paper is wri tten out only for the SISO case.

P(P a=O) 00

o

Proof: The proof follows the same route as for Theorem 2 of Sternby (1977). Let M[a1= (w IP ooa=O }. Since M[a1 E Foo El 1M[a 11 Ft 1 -+ 1M[a 1

a. e.

according to Levy's zero-or-one law (Chung, 1968, p 313). Also aEFoo ,*E[aIFt1-+ a

a.e.

Then a.e. 2

t:,

T

Vt=E[1M[a1IFt1 .E[aIFt1 PtE[aIFt1-+ 0 a.e. with Vt

>

0 and

2 T Vt2 1ElalFt 11 .trace(pt).::.const·E[x xlFt1

a.e.

But the right member is a conditional mean, and thus uniformly integrable by Martingale

923

Ma rtin ga l e Theory in Bayesian Least Squares Estimation

theory, (Chung 1968, theorem 9.4.3). Then Vt is uniformly integrable and converges a.e. and must al so converge in Ll to the same zero 1 i mi t, (Chung 1968, theorem 4.5.4). t·loving the other conditional expectations inside Pt we get

=E {E[ (E( ' M[a]IF t ) ' E(a IF t )

T

'

(X - X

with some suitable initial values. Here it is obvious that St will converge, being decreasing and bounded from below. The recursion (3)-(5) can be started as soon as (2) has a unique minimum, i e l: (k) ' (k)T has become invertible. The initial values coul d be taken as e g the result of mi nimizi ng (2). On the other hand it was shown in J!.strom, Wittenmark (1971) that for the Gaussian

2 t » 1F t ]}

case,

T

' 2 =E[ E(1 [a ]IF ) · E(aI F ) (x - x ) ] ~ 0 M t t t

the condi ti ona 1 mean

xt =

conditional variance Pt will equations like (3)-(5). Then

so that

with

and the 1 ast part of the theorem is proven. But all three factors converge a.e. (the 1ast one by theorem 1), and the 1 imit must be as in L2. Then also

a .e.

Cl

For systems picked randomly as our model prescribes, this theorem answers the consistency question as long as the conditional mean is used as our estimate. But also in a practical situation, when we are faced with a particular system, this theorem should be encouragi ng. I f there is no reason to suspect that the true system belongs to a null set (e g through overparameterization), then the estimate can be assumed to converge to the true values if the vari ance tends to zero. It is thus des i rable to use the conditional

mean as an estimate. But unfortunately this is in general impossible to calculate.

the

noise

variance,

so

~

and

the

also satisfy Pt is scaled that

xt =

1;;

t

and Pt R'S t , where R is the noise variance. This result requires the initial values to be chosen to the a priori mean and variance of the parameters. Thus the conditional mean and the estimate minimizing (2) will obey the same recursive equations, but with possibly different initial values. The following theorem shows that these initial conditions do not affect ~onsistency, and conditions are given for ~t

to be strongly consistent.

Theorem 3: Let the output of a system be generated by (1), with {et t) } a sequence of i.i.d. zero mean Gaussian variables. Let the true parameter vector x be independent of { e (t)} and have a probabi 1 i ty di stribution that is absolutely continuous with respect to the Gaussian distribution. Let aE F"" with I a I = 1. Then for some ~"" and a

T'

~""

T

= a x

~t ~ ~""

a .e.

a.e. on {wI S""a= A}.

Cl

Proof: Denote the initial values used for

~t

and St by ~o and So' First suppose that the true parameter vector x is Gaussian with

MAIN RESULTS - THE GAUSSIAN CASE

,

In the Gaussian case , theorems 1 and 2 for the

conditional

mean

to the estimate ~

xt = 1;; t carry over minimizing (2).This es-

timate can be calculated recursively according to ( 3)

where

mean ~o and covariance R'S o ' where R is the noi se vari ance. Then accordi ng to Astrom, Wittenmark (1971) the conditional distributi,on of x given Ft is ~aussian with mean x t and covari ance Pt' where xt and Pt/R satisfy the e,qua!ions (3)-(5) with initial

conditions

Thus xt Theorem 1

= ~ t and

and Then

Po=R'S o from

( 4)

and

But {w IP""a =O } = {w IS ""a =O }

T

St=St_1- \ _1 (t)[1 +( t ) \_l ( t ) ]

-1

(t)\_l (5 )

and from Theorem 2 a

T'~",,=a Tx a . e. on {w IS ""a=O }

Cl

J. Sternby and H. Root z en

924

This proves the theorem for Gaussian parameters. However, the convergence or nonconvergence of these sequences is not affected by our assumptions on the distribution of x. Only the actual val ue of the random vector x will matter. We have thu s proven the resul t fo r a co 11 ection of x-values (and noise sequences) havi ng probabil i ty one in the Gauss i an measure. Now remains to prove that this same collection has measure one also in the correct measure. Note that the set {w IS ""a = O} remains the same (defined through St' not Pt). This last step then follows directly from the assumed absolute continuity of the true measure w.r.t. the Gaussian measure. To see thi s consi der the two null sets {w l ~ t

does not converge }

and {w IS""a = 0,

r T a ~"" '*' a x }

o

mical system. On the other hand, for this case stronger resul ts can be shown by other methods, see e g Solo (1981). The most straightforward appl ication of Theorem 3 is to cases where the whole St-matrix tends to zero. Then all components of the estimate will be consistent. The next theorem gi ves a condi ti on for thi s to happen. Theorem 4: {w lS t

S""

=

0, wi th S""

=

0 a. e.

~

~""

= x a. e.

Theorem 3 thus repl aces and supersedes Theorem 3 and corollaries 1 and 2 of Sternby (1977). The-Gondition S"" = 0 will be fUrther examined below. Remark 2: Using the same technique with absol ute continui ty of measures, the counterpart to Theorem 5 of Sternby (1977) can also be shown so that the sets where S"" 0 and ~"" = x can di ffer only by a null set.

O} = {w l

-+

""

[a Tq, (s)]2

L

divergent

for every constant col umn vector a '*' O}. Proof:

The

corresponding

proof

for

0

Pt in

Sternby (1977) goes through wi thout changes, since only the structure of equation (5) is used. 0 Using Theorems 3 and 4 we see that there are only two possibilities: I) St

Remark 1: Note that no assumptions are made 1 n rheorem 3 on stabi 1 i ty or feedback. Consi stency is only coupl ed to the condition

With notations as in Theorem 3 and So >0

11) St

~

-+

0

f

0 ~

~t

-+

x

aTq, (t)

or -+

0 for some a

'*' O.

The second case imposes certain restrictions on the control law. For the analysis the following lemma is needed. Lemma 1: Let {e(t) } be a sequence of independent, zero mean normal random vari ab 1es with a variance R(t) bounded by 0< E1 2 R(.t) ~ E2 for all t. Let { v(t) } be another sequence and e(t) independent of v(k) for every k~t. Then

A

Remark 3: It would be desirable to be able to drop the Gaussian assumption on the noise sequence as well. But this is more involved, since it requires the absolute continuity of a product measure of infinitely many components of absolutely continuous measures. Remark 4: In Sternby (1977) also L2 convergence was shown. This part cannot be extended in the same way, since the d is tribution of the estimates is involved, and not only null sets. In fact, the second moments need not even exist. Remark 5: In general it is not possible to say anything about any specific system, since the notion a.e. always excludes parameter sets wih zero probability. However, if q, (t) does not depend on the parameters, then consistency a.e. in our product measure impl ies con si stency a.e. for every parameter vector x. We thus have anon-Bayesian consistency result, e g for ordinary regression analysis, where q, (t) does not contain old inputs or outputs of a dyna-

diverges a.e.

o

This result is intuitively clear, and the proof (using e g the extended Borel-Cantelli lemma, see Breiman (1968) ) is omitted being a standard exercise in probability theory. Using the lemma, the a in case Il above can be further specified. If u(k) is the latest input present in aT q, then all components of a, corresponding to y(t) for t>k, must be zero. This is because y(t) always contains e(t), and L [aT q, (t)] 2 would otherwise diverge according to the lemma. It is thus clear that either the parameters converge to their true values or the control law will converge to a linear one satisfying aT· q,( t) = O.

As a special case, if q, (t) does not contain any inputs, (pure AR model or all b-parameters known) then Il) cannot happen, and consistency is shown.

Ma rtinga l e Theor y in Baye sian Leas t Squa r es Estimat i on

For closed loop sys tems there are severa 1 ways to guarantee consistency by avoiding aT .p ->- 0: use a nonlinear control law use a time-varying control 1aw, e g an everlasting shift between two constant and linear control laws use a more complex linear control law than can be described by aT .p =0. These results agree with those of Gustavsson and coworkers (1977) for more general identification schemes. For of en loop systems there can be no bounds in a .p = 0 i nvo 1vi ng outputs and the correspondi ng components of a will be zero. In the next section this is shown to imply that all estimates corresponding to outputs will be consi stent. As di scussed in Sternby (1977), a condition similar to, but weaker than persi stently exciting input is required to assure consistency also for the input pa rameters.

Proof:

925

It follows from theorem 3 that if a E N( Soo)a.e. then

aT( [,00 - x)

= 0 a.e.

A

Thus £;00 - xl.N(SJ

a.e.

The theorem then follows from lemma 2.

o

This result was also shown by Ljung (1974) in a non-Bayesian version, but under the assumption of closed loop stability. Coroll ary: With the assumpti ons of Theorem 3, any adaptive control law based on Least Squares estimation, where the estimates enter the control law only as £; t
ADAPTIVE CONTROL LAWS In order to handle general control laws, where possibly aT .p (t) ->- 0, we need more detailed information on which parameters will converge. The following lemma concerni ng the null space of S00 ' is then useful. Lemma 2: For any real ization

~ [aT.p (k)l 2 =aTs~1a <00

a l.tJ(S J

~

f or every t.

0

Several self-tuning regulators fall within the category defined in the corollary. The common mi nimum vari ance type impl i ci t sel ftuners (regulator parameters directly estimated) are of this type, either the leading b parameter is estimated or not. Also the implicit pole placement self tuner fits into this framework, and does thus e g always give a stable closed loop system. For a more general adaptive control law, Theorem 3 shows that the parameters will almost always converge. Theorem 5 and Lemma 1 show that, unless £; = x, the final control law can be written oo( [,00 _x)T .p (t) = O. This A

Proof: The same techn i que is used as in the appendix of Sternby (1977) proving a similar lenma. Denote the eigenvalues of Soo by " j ' Then (S -1) . . < 1/ " . f or every " . t

JJ -

J

J

*

0

This is shown as in Sternby (1977) for one element at a time by setting all eigenvalues in P to zero except " j ' But a has no components in N( SJ so a TSt-1 a::

l:

A .*0

J

( 1lA .) . aTa

< 00

f or eve ry t

o

J

The ne xt theorem shows what will happen with the LS method for genera 1 control 1aws. Notice that no stability condition is required, and that no control law is specified. Theorem 5: Under the assumptions of Theorem 3 convergent a. e.

o

express i on for the control 1aw compared to its original definition, might then be possible in some cases convergence of the control 1aw desired one. It is however al so easy examples when this is not possible.

can be and it to show to the to give

CONCLUSION It has been shown that in the Gaussian white noi se case, consi stency for the LS estimate follows if the variance tends to zero. This requires no assumptions on stability or on the control law. Furthermore, a condi ti on is gi ven for the variance to tend to zero, which imposes certain restrictions on the control law. Finally, a general condition is given for the possible convergence points of the estimates. This last condition can in some cases be used to show convergence of adaptive control laws to the desired ones.

J. Sternby and H. Rootzen

926

The assumpti on of Gauss i an noi se is necessary with our method of analysis. However, the results are 1 ikely to hold even without this assumption. It is possible that some kind of centra 1 1i mit theorem coul d be employed to remove it. Using Theorem 5, and maybe extensions ther~­ of, it is possible to examine the asymptotlc properties of adaptive control 1aws. It mi ght be difficult to get gene:ral results, but specific algorithms are ea~lly analysed. Impl icit versions of self-tunl ng regul ators have been discussed in this paper. In the corresponding expl icit sel f-tuners the parameters of the control 1aware cal cul at:d from a linear transformation of the estlmates and should be possible to handle w~th Theorem 5. The examination of other adaptlve control laws with this tool is left as a topic for future work.

REFERENCES Astrom, K J and B Wittenmark (1971). Problems of identification and control. J. Math. Anal. Appl. 34, 90. Breiman, L (1968). Probability. Add; son-Wes 1ey. Chung, K L (1968). A course in probability theory. Harcourt, Brace & world Inc, New York. Gustavsson, I and coworkers (1977). Identification of processes in closed loop -identifiability and accuracy aspects. Automatica ~, 59. Ljung, Land B Wittenmark (1974). Asymptotic Properties of Self-Tuning Regulators. Report 7404, Dept of Automatic Control, Lund Institute of Technology, Sweden. Schwartz, L (1965). On Bayes Procedures. Z Wahrscheinlichkeitstheorie 4, 10.

ACKNOWLEDGEMENT This research was done with partial support for the second author by the Air Force Office of Sci entific Research under grant AFOSR-75-2796.

Solo, V (1981): Strong consistency of Least Squares Estimators in regression with correlated disturbances. Ann. Statist. ~, 689. Sternby, J (1977). On Consistency for the Method of Least Squares Using Martingale Theory. IEEE Trans. Aut. Control ~, 346.