INVERSE STATE AND DECORRELATED STATE STOCHASTIC APPROXIMATION R. Kumar and J. B. Moore Department of Electrical Engineering, The University of Newcastle, New South Wales 2308, Australia
Abstract. Stochastic approximation algorithms for parameter identification are derived by a sequential optimization and weighted averaging procedure with an instructive geometric interpretation. Known algorithms including standard least squares and suboptimal versions requiring less computational effort are thereby derived. More significantly, novel schemes emerge from the theory which, in the cases studied to data and reported here, converge much more rapidly than their nearest rivals amongst the class of known simple schemes. The novel algorithms are distinguished from the known ones by either a different step size selection, and/or by working with a transformed state variable with components relatively less correlated, and/or by replacing" the state vector in a crucial part of the calculations by its componentwise pseudo-inverse. The convergence rate of the novel schemes, termed inverse state algorithms, is close to that of the more sophisticated optimal least squares recursions. For the case of extended least squares and recursive maximum likelihood schemes, the inverse state stochastic recursion performs, in loose terms, as well as the more sophisticated schemes in the literature. An asymptotic convergence analysis for the algorithms is a minor extension of known theory. Keywords. Stochastic systems, identification, parameter estimation, fast stochastic approximation algorithms, decorrelating state transformations. INTRODUCTION Stochastic approximation (SA) algorithms [1-6] are used for parameter estimation in communication and control applications. A restricted class of these recursions are very simple to implement and will be termed here simple stochastic approximation (SSA) recursions. It is usually assumed that since the convergence rates are in general poor for the SSA schemes in the literature when compared to the asymptotically efficient, but more sophisticated, optimal recursive least squares (RLS) schemes [5, 6], the penalty of simplicity is poor performance. The ideal is of course a simple recursion with good performance. The only justification for introducing novel classes of SSA recursions, as in this paper, is that in a wide range of environments they perform significantly nearer the ideal than the SSA schemes so far studied. The stochastic approximation recursions in the literature [4] haveothe following general structure. Denoting e as the unknown parameter vector and §k as its estimate at time instance k, consider the recursions
(1)
259
where ~k is the one step ahead prediction error for the scalar measurement sequence zk' and x is a measurable sequence (a k state n-vector) usually consisting of past outputs of the signal generating systems (and inputs if these ~re measubable). To assure convergence of e k to e, the step size Yk ~s set as some decreasing scalar sequence wrth the property k!lYk = oo'k!l Y~ < 00, as achieved for example by !k ~ng
Y = l/k. In practice k need not go to zero, which permits trackof slowly varying parameters.
The optimal RLS scheme for a signal model zk = x~eo + v with x measurable and v k k k a zero mean noise :equtnce ha~l YkQ(xk'~k) = Pkxk~k where Pk - (i~lxixi) is assumed to ex~st [5]. For the case when x is not k a signal model state but is an estimate of the s~ate i ~f a sig~al model such as k zk = x~e + v k ' x +l = FX + GWk driven by some s~ochastic k~nput wk ' then extended k least squares (ELS) ideas [7, see RMLl] still take YkQ(x k , ~k) = Pkxk~k even if this is no longer opt~mal. Recursive maximum likelihood (RML) schemes [7, RML2] also take ykQ(xk , ~k) = Pkxk~k where now x is not vIewea as a sta~e estimate but is k rather the parameter sensitivity of a one step ahead prediction error.
260
R. Kumar and J. B. Moore
For the class of SSA recursions with wide application in the literature [1-6). the expensive calculation of Pix is replaced i in the RLS. ELS and RML schemes by a simple calculation for Ykxk_ where Yn is a scalar with the properties aoOVe. In this paper we propose a new class of SSA algorithms in which YkQ(x~'~k) of (1) is Ykxt~k (effort ~ n) wnere # denotes a componentwise pseudo inverse*. The algorithm will be referred to as the inverse state stochastic approximation (IS) recursion. Notice that since = Qkxk where Qk is a diago~at matriX with nonnegative eIements [{~ i }2) • the IS recursion can be viewed as replacing Pkx in the k RLS. ELS and RML schemes oy YkQkxk for the specific Q ~ 0 above. k
xt
Also proposed in this paper is a selection of the step size Yk which in effect exponentially forgets the data for an initial transient period and a rationale for this. The idea of working with an exponential or other weighting factor in a least squares index. (8). or selecting Yk > l/k for an initial transient period [7] is certainly not new . However for SSA schemes. all of which to our knowledge can be improved significantly by exploiting the above ideas. there does not appear to be a good reference on the subject. For example. to our knowledge it has not been stressed that giving equal weighting to all data as in optimal least squares should not be attempted in a suboptimal scheme such as an SSA scheme since the processing of the earlier measurements is inherently less effective than the processing of the most recent measurements . Thus since an SSA scheme is of necessity suboptimal. it is better to l ·e t it approximate an exponential weighted least squares scheme for an initial transient period than an equal weighted asymptotically efficient least squares scheme. For SSA schemes in which the components of the state vector are correlated. a modification to SSA schemes is proposed which is of more general interest than for IS scheme; but for IS schemes there is substantial improvement. The modification works with a transformed state vector Lxk for a selection of L such that the components of Lxk are less correlated than with L = I. and preferably uncorrelated. We term the resulting algorithms Decorrelated State (DS) algorithms. Such a scheme is distinct from. but has similarities to that of [9.10] where the noisy state observations ~ are also preprocessed before use in the standard RLS (or SSA) schemes. In contrast to the
*
That is. denoting ~(i) as the i-th element of(~. the vector xt has its element [~i/r-l for ~i) ~ 0 and zero otherwise. In practice. eUfure that Q is bounded above. [~i)] is set to k zero for xk (i):!> E: for some small E: > O.
to
preprocessing here which is memoryless. in [8.9) the states are prefiltered in a dynamic filter to achieve what are termed instrumental variables. These variabl es are designed to be highly correlated to the signal model states ~ but uncorrelated to the noise corrupting the measurements of these states. There could be advantages in applying the DS algorithm in conj.unction with instrumental variables. The algorithms we propose . along with other schemes in the literature. can be viewed as achieving a parameter estimate Sk as a w~ighied average of preliminary estimates al. a 2 • • . a~ where a: is closest in a (weighted) Aleast squares sense to the previous est~t* a _ • yet forces the residual k l (zk-~ak) to zero. An alternative interpretation is that a suitably defined "centre of mass" of hyperplanes (z -xi~a) = 0 for i = 1.2 •...• k is calcula!ed recursively on the a-space for each k. A study of the desired properties of the weighted averaging filter which determines the "centre of mass" of the hyperplanes leads to a very natural selection of Y as k -1 -1 -1 Y = 1 + KY _ initialized by some Yl > 0 k l k where K is the feedback parameter of the weighted averaging filter. To ensure stability of the filter. and considering the case when the feedback parameter is permitted to be time varying then its value. denoted ~. is restricted as ~:!> 1. To achieve exponential data weighting in an initial transient period N. ~ is set at some value less than unity for k:!> N. To ensure convergence of th~ parameter estimated via the constraints LY = approach is to set ~
00
Ly2 < 00. one to u!ity for k ~ N.
The hyperplane interpretation of the algorithm suggested to us specific selections of the weighting matrix of the sequential optimization procedure. and also of the transformation T. such that the hyperplanes at each successive time instant are more likely to be orthogonal. thereby increasing the speed of convergence. However in hindsight a theory based on an estimate of minimizing the norm of the parameter errors gives the most direct explanation as to why the algorithms of this paper work much better than the SSA schemes in the literature. being near optimal within this class of schemes. "CENTRE OF MASS" OF HYPERPLANES IN THE PARAMETER SPACE Consider the standard problem of estimating the parameter vector aD E: Rn given the scalar measurement sequence {Z l .Z2 •• • • } generated by zk
=
~ao
~ l!.
~ao
+ v k • Yk - ~ (1)
where x., = [x k • ble n-vector which sequence {v } is k noise process with
(n) ~
(2)
• ••• x., ) is a measurais time varying. The noise a sam~le sequence from a E[vkIFk_l) - O. and
Inverse state and decorrelated state stochastic approximation 2
E[v~IFk_l]
S 0 for all k, where Fk~l is the a-algebra generated by vl'~2, ..• ,vk_I' Let us ~enote an estimate of e at xime k as a and for convenience take a o = O. k For the case ~ ~ 0, let us also denote the hyperplane (zk - x~a) = 0 in the a space by ~. ~ecursive schemes for achieving estimates a from measurements ZO,Zl, k .•. zk are here given a geometric interpretation in the a parameter space in terms of calculating a "centre of mass" of the hyperplanes Ho, Hl""'~'
The following lemma summarizes straightforward results concerning projections from a Roint outside a hyperplane H = {e:z=x'e, x ~ O} in a space to the hyperplane so as to minimizes a distance measure 1 1~-aHI I~ over all eH E H, for some B = ~ B > O. The minimizing a is H ~enoted a and is termed the projection of H a onto H. Notice that without loss of generality! ~fnce x ~ 0, B may be chosen so that x B x = 1. For th~ case when B is singular, the measure Ila-e I I~ is no longer a distance measure but tMe projection has meaning for us if we require that B is restrict~d so that BB x=x (i.e. x c R[Bl) and x'B#x ~ 0 (say x'B#x = 1) where BH denotes the Moore-Penrose psuedo inverse. Examples of such Bare (i) B = xx' (H) B is a diagonal matrix with ith element [x(i)]2, where xCi) is the ith element of x which may be zero.
e
11 Lemma 2.1. With B such that BB x=x, x B#x = I, theQ in the notation above, the projection of a to H is A
A
aH = a +
11 B x(z
(3)
with the "distance" measure now A A A Ila - aHII~ = (z _x'a)2
o
(5)
holds.
Proof.
One method of proof is to view the projection exercise as a constrained minimization problem with a Hamiltonian
H
= 11 eH
A
-
a 11 ~ +
A(z -
xaH)
'dH as
= 2B(a - a) - Ax to zero, H 11 premul t5_plY~I,lg by x'B , noting that z = x 'a* H' BB 11 x = x and re-organizing leads to Setting
Orthogonal projection' A Fo~ the case B = r, (5) tells us that (a - aH) is orthogonal to H. For B ~ I, and B > 0 then a co-ordinate basis change will achieve the same orthogonality property. Centre of mass of hyperplanes. For the signal model (2) and some weighting matrices Bk restricted so that
~
Bk=B ' k
m
H
A
L II a
2
A
II•• (7) B k=l Hk k ACM and is denoted a • The weighting matrices Bk = ~ 0 aremnot necessarily the same as B , and their selection is discussed k later. We can conceive gf the hyperplanes ~_ being replaced ~n R by a unit mass at x~e projection of a onto H, denoted a~, this projection being o~tained.using ~K ACM the B '. Then am is the centre of mass of k these unit masses weighted according to the • Since the poi~ts {SH l ' 6Hz , •• BHm} k are a function of a we see that in a trial and error search for the "centre of mass", the configuration of unit mass~s changes with each trial selection of a. As a consequence the "centre of mass" of a set of hyperplanes can not in general be determined in a simple way from the set of hyperplanes. Moreover, i~~s not possible in general to calculate aM recursively in terms of ACM and Hm' _ m l
- a
Bk
B
a
Least squares estimate. From (4) it is immediate that when Bk = B , the "moment k of inertia" index, above,mis equi':9:lint to the least squares index k~l(zk-xk a) and ~hus that the hyperplane ncentre of mass" aCM is nothing other than the least squares eWtimate de~oted here B~S' A~h~t~for the case when Bk = Bk , then am ;am~l Now recall [11] ttiat Athe least squares parameter estimat a LS i.e. the hyperplane
eCM ,
centre of mass mconverges almost surely to the par~eter a O as m+oo under the following conditions. (i)
Noise conditions.
E[v~IFk_l]
is singular)
B/I (z - a'x) x B x
- a = ~
from which all the results of the lemma follow.
S 0
2
E[vkIFk_l]
0,
(ii) persistently exciting conditions.
(ii~
a
xk'Bk#xk=l •. (6)
then the weighted "centre of mass" of {Hl, Hz, .• H} is defined as the a which minimizes th~ weighted "moment of intertia"
m
B
and
hyper~lanes
P
and a solution (nonunique if
BkB~xk=Xk
0,
(4)
the orthogonality
and for all condition
261
=
m -1 (L~x') + 0
0
K
k
as
m +
00
Boundedness =ynditions. The condition number on P is bounded above. (This condit~on may not be necessary but numerical problems arise otherwise.
One disadvantage of calculating the least squares estimate is that in the simple
262
R. Kumar and J. B. Moore
algorithms, at least, the computational effort is of the order of n 2 • Fast algorithms which reduce the effort to the order of n do exist but these require longer programs involving more memory than is acceptable for many applications [lll. We now explore a sequential strategy which approximates the hyperplane "centre of mass" at any time instant m, converging to the true "centre of mass" as m->oo. In this strategy, we no longer have Bk = B • Of particular interest k are the schemes of order n in computational effort. Seqential centre of mass. As we have seen, it is not in general possible to give simple recursive algorithms to update the "centre of mass" of hyperplanes {HI, H2 •• H }. To formulate a simple recursive algorithID for achieving an estimate of this "centre o~ mass", denoted the sequential centre of mass 8 , m it makes sense to replace, for all k ~ m, the hyperplanes A ~ by a unit mass at the ~rojection of 8 _ 1 onto ~ and denoted k 8~. Weighting matrices Bk are used for th~s projection. Thus, kn minimizing the ind~ (7) we replace A8H which depends on 8 by a~ estimate 8g k which d~es not depend on 8. That is, w~ select 8 to minimize the sequential moment of inertia m A 118 k=l
since the least squares index is not dependent on any sequencing in {zk'x }. k Some of the results above can be conveniently summarized in a lemma. Lemma 2.2. Suppose there is given a sequence {zk' x } arising from the signal model (2) then t~e least squares parameter estimate 8~S can be viewed as the "centre of mass" of hyperplanes {HI H2 •. H } , denoted a~M and defined above with wei£h\ing matrices Bk satisfying (6) and B = Bk . ~lternat:vely, with Bk satisfying (6f and Bk xkx ' k then a LS ,= a , the sequential centre of mass defined ~ove. Computational effort constraints. To ensure that the computational effort of the above algorithm (8-l~ is of order n (the state vector dimension), rather than of order n 2 as in least squares estimation, certain restrictions on B, Bk must be imposed. Here we consider ctasses of selections of B , B achieving the simplest possible convk ergen~ algorithm. More specific selections are given in later sections. In particular, Bk = bkI for some nonnegative scalar b k_ !ndependent of x ' k and thus BkP k = ykI for some nonnegative scalar ~k' independent of x . This is cercainly simpler k than setting Bk = x~xkI, for example, which requires n multplications.
(i)
L
to achieve 8 , termed the sequential centre m of mass of hyperplanes. {HI, H2, .• H } m which of course is sequence dependent. From (3) and the above definitions we have the following recursive algorithm for calculating the sequential centre of mass of {HI, H2, .•. H }. m
Algorithm With Bk
(ii)
With the above simplification to the recursive schemes ( 8-10)we now have the class of SSA schemes 8k
(9) .. (10)
Interpretation. At time k we have an estimate 8 _ l and a new measurement zk' k from which we calculate a preliminary new estimate as which is nearest in some Hk sense to 8 - l but which matches the new k data. We now back off from matching the new data exactly so as to reduce the effect of noise and form a weigh~ed average between the preliminary estimate 8~ which m~tches the now data and the former ~stimate 8 _ . k l A
Specialization to least squares algorithms. Observe that with ~k - xkx: and no additional restriction on B, ~hen since a~ ~xk = zk the sequential moment of inertia mk A A 2 J"1118 - 8g k l1 (xkxk~) about 8 is the least squares index. With this Bk' the sequencing in tHI, H2, . • . H } is irrelevant m
Bk is taken as a diagonal matrix satisfying (6).
=
8 k _ l + YkBk#xk(Zk -
xk~8k_l)
•. (11)
where Bk satisfies (ii) above and u is k some fixed scalar sequence independent of x ' k In some applications, by working with less drastic simplifications on Bk' ~k there could well be a better trade off between algorithms complexity and performance. For exam£le, the following class of selections of B ,B have proved useful in our simulations of th~ next section (i)
(H)
Bk is a fixed diagonal matrix sequence of nonnegative ele~e~ts independeijt of x · Thus BkP~ = diag {ytl)y~2) ..k y~n)} = r ~ 0, k as in [l3f. Bk# is taken as a matrix L~D kL satisfying (6) where Dk is a diagonal matrix and L is a lower triangular matrix with elements in the set {O, -1, +l}.
Thus, with the restrictions (8-10) is now
(i)
and (ii)
imation Inverse state and decorr elated state stocha stic approx
Expone ntial data weight ing. Since the algorithm (8-10) is, in genera l, sequen ce depend ent with earlie r estima tes 8. (and eH 1. ) not as good as later estimate~, it makes sense to give a greate r weight ing to the most recent of such estima tes. To ac~ieve this, the Bk can be selecte d so that Bk increas es as k increa ses. If we c~nsi~er the four2pos~~b ilities for Bk , V1Z Bk = 1, k, k , K 10 ) gi~rs for some consta nt ! 2 l , then the steE sizes as_ PkB k = Yk = k ,2(~tD , k l 6k(k+l )-1 (2k+l)- 1, (1+K+K2+ ... K - )
-f
I- K
l-KK denoted equal, linear , quadra tic and expone ntial weight ing respec tively. Simula tion studies reporte d in a later section indica te that expone ntial weight ing for an initia l transie nt period is perfera ble. Howeve r, to tent estima tion, we require that achiev e consis 00 00 2 LY = 00, Ly k < 00 which means that when k expone ntial weight ing is employ ed for the initia l transie nt with K < 1, then K 1 after such a period . should be set as K So far, a novel interStrong converg ence. pretati on of what is happen ing for the class of SSA algorit hms ( :.1), (12 .), has been develo ped, but there is nothing startli ng about this class of (SSA) scheme s. In fact, almost sure converg ence condit ions of such scheme s are given from known techniq ues as in [11, 12, 13] to be that with 0 < 001 for some 00, 01 and all k, $ L'DkL $ 011 II r l1 2 < 00. .•• } = 00, k k=l \t=l The more import ant issue is the selecti on of D , Land r k so as to achiev e a near k optima l converg ence rate and the develop ment of some ration ale for the selecti on of these parame ters. Let us recall that to date no straigh tforwa rd optima l method exists for the selecti on of such parame ters.
I
min{y~l) ,y~2),
I
Conver gence rate and orthog onal hyperp lanes. In the remain der of this section , we seek insigh ts which may be of use in the selecti on r k , L, and Dk . In the followi ng secof tion, a more direct "optim ization " proced ure is employ ed. First observe that with Bk = b I for scalar 1S su~ that the bk , if the n-vecto r x hyperp lanes ~ is ort~ogonal to the (n-l) previou s hyperp lanes ~'-l Hk _ 2 .•• ~_ +1' then the converg ence rate will be, in tRe absenc e of any knowle dge of eO, superio r to random non orthogo nal selecti ons. This observ ation is most readily grasped in the two dimens ional noise free case with weight ing only given to the most recent measur ement. Figure 2.1 illustr ates this where ~. = H -2 for all k. With expone ntial weight tng o~ equal or quadra tic weight ing, the superi ority of the orthogo nal hyperp lane is also appare nt, as is the fact that expone ntial weight ing gives more rapid converg ence than quadra tic or equal weight ing. In the case when noise
263
is presen t the picture is more clouded but as is well known, the converg ence rate of SSA scheme s during an initia l period (the crucia l period for us) is relativ ely unaffected by the presenc e of noise. The above simple picture leads us to conjec ture that in finding the sequen tial centre of mass of hyperp lanes in the parame ter space, the more orthog onal are the hyperp lanes (after taking into accoun t the weight ing), the more rapid the converg ence rate to the true "centre of mass". Such a conjec ture has its parall el in state optimi zation method s where conjuga ge gradie nt techniq ues and variab le matrix techniq ues work with orthog onal directi ons in a transfo rmed space [17] •
With r k = ykI for Selecti on of L. scalar Yk ' and L a nonsin gular matrix with unity diagon al elemen ts (withou t loss of genera lity), and for compu tationa l simpli city its other elemen ts in the set {-I,D , +l}, tven (12 ) can be rewrit ten with L'DkL = Bk satisfy ing (2.5) as
(L,)-l~k
= (L,)-lSk _l+YkD k(LXk) tk
It is consis tent with our conjec ture in the preced ing paragra ph to sele:t_ lL so that the hyperp lanes Hk in (L) e space are orthog onal to the liyperp lanes Hk _ l As a step to achieve thIs, it might ~-~l. be Ehough t that k could be selecte d to minimi ze E[xk'L 'Lx k _ l ] = tr{L E[xkxk _i]L'} for i = 1, 2 •.• n-l subjec t to earlIe r constr aints on L. Howeve r, usually E[xkx~_ . ] is not known a priori , but rather 1 we liave qualit iative inform ation such as that ~' = [zk_l ••. zk_n] where zk is the samplin g of a continu ous time proces s zk hi~hly correla ted with zk_l. For wi~h th1s case Lx = [zk_l(z _2- z -1) ... (zk_ -zk_n+ l)] is such ~at ~e,constraints on f are satisfi ed and E[x k L Lx k ] « E[xk'x ]. Higher order differe ncing could k improv e the orthog onality . We explore such a selecti on of L in the simula tions of a later section . Section of D • For the case when L = I (for simpli city), then since Dk is constr aimed to be a diagon al matrix w1th positiv e elemen ts d£i) such that x~Dkx = 1 DkD~X = x, there are only a fimited number of rea~ily calcula ted candid ates. Here we. focus on t~£ such cIndid~te~ viz. and - [x(1)] 'm being d(1) = (x'x ) k m k k k tlie number of nonzero compon ents of x. of the fatter ority superi the To demon strate selecti on, let us work with a two dimens ional noise free exampl e in which again k as i~i}igure 2.1(b). ~ = H _ 2 for all k Let us also assume that IXk i=a i = time invaria nt. Then in the [alS() ~2e(2)] are orthogo nal space the H dd and H and we expect ~~, candiaX lat~er the with from our earlie r conjec ture improve d convergen ce. Of course in higher dimens ions and with a i not time invaria nt the
R. Kumar and J. B. Moore
264
t.
situation is not so clear. In order to get more precise results than the guidelines based on geometric considerations in this section, we obtain a more precise theory using a minimization technique in the next section. MINIMIZING THE ESTIMATION ERROR OF EACH STEP If .. In the following, the selection of Bk=L DkL to yield the inverse state (IS) and decorrelated state (DS) versions of a simple stochastic approximation (SSA) algorithms is made as a result of maximizing theA decreas in estimation error norm I lek-eOI I given ek_l,.-x k , ;k' Yk • Thus we attempt to selecE L DkL subject to the normalization and computational simplicity constraints so far imRosed so as to ~imize the reduction (1Ie _ - e011 -lle - eOII). k l not known, k the best Since eO is of course we can achieve is to maximize this expression with eO replaced by its best available estimate 8 k _ l , viz (I 18k_l - 8kl I)· Marginally more convenient we seek a solution to the following virtually equivalent optimization problem.
x
Nonconvex optimization problem. With the notation of earlier sections, and given x k ' Yk and ;k' select the time invariant lower triangular matrix L and the diagonal time varying matrix Dk to maximize IICL.-)-lCEl _ l - Elk) 112 = k
Y~11DkLxk;kI12
subject to the following further constraints on L and D • k (i) the constraints (6) ., via x LDkL'-x = 1, (LDkL'-) (LDkL.-)txk = x k
k
(H)
(Hi)
the computation simplicity constraint that D is diagonal and L is a lower tr!angular (nonsingular) matrix with unity diagonal elements and otherwise elements in the set {+l, -1, Ol.
o < 001
~ L'-D L ~ 011 for some 00, 01 and all k to ensure convergence of the SSA algorithm. To ensure 00 to be greater than zero, it is sufficient that I Ixl I is bounded from above since L is nonsingular. The upper bound is finite since the components,pf the matrix D are bounded by ""E2.
Optimal selection of Dk~ For a fixed L, forming the Hamiltonian, the stationary point of the.-o~ject!ye function is obtained as D = (x L Lx) I. However, this yields a min!mum of the ~jective function rather than a maximum. Representing Dk with diagonal elements d~i) for i = 1,2, •.• n the constrained maximum, not necessarily unique, lies on the boundary of the feasibility region as illustrated in the two
dimensional sketch Figure 3. Its valut'~2 is dei) = 0 for i ~ j, d i ) = [(LX ) J'J k k - O(of where 0(0) is a term of the order of 0 9~d j is sele~tfd so that [(Lx ) (J)] t ~ [(Lx ) (1)] for all i with k the pseudo inverse kdenoted t , defined as in section 1. The "optimal" SSA algorithm is now, to a close approximation for 0 small,
(L~)(l) (Lx ) (j -1) k
ek=ek_l+ou
k
O-l[(LX ) (j)] k (L~) (j+1)
;k; .• (13)
(Lx ) (n) k
[(L~) (j)] t ~
[(Lx ) k
(i)] ~'
for all
1.
The near optimal Inverse State (IS) Algorithm Work with a near "optimal" solution, namely the (IS) algorithm •• (14)
where a constant m is absorbed in Y or k just ignored, m being the number of nonzero components of x • k The advantages of the inverse state scheme, apart from a certain aesthetic appeal, is that it is less sensitive to small changes in the elements of ~ when these, or a subset of these, are approximately equal in magnitude, ~nd also the selection of E in defining x k is not critical as it may be in the "optimal" scheme (13 ). Selection of L to achieve Decorrelated State (DS) Algorithm. For a fixed Y k (perhaps chosen as in section 2) and Dk selected as for the IS scheme (to keep the theory Simple), our task is to select L to m~imize the index 11 DkLxk;k 112 = I I (Lx)k ;k l 12 or ~ore simply to maximize the inaex II(Lx ) 112 or e~uivalently to k minimize the index 11 L~ 11 . Since now x is stochastic it makes sense to select Lk to minimize E[xkL'-Lx ] or equivalently k to select L so as to reduce the correlation of the components of (Lx ) with each other, k subject of course to the simplicity constraints above and the knowledge available on ~. Put another way, we seek a lower triangular matrix L with elements in the set {a, -1, I} to minimize tr[LPL'-] where P = E[xkx '-]. Frequently, as in the case k of ARMA iaentification, P is not known, being a function of the unknown parameter e. However, since a fixed L minimizes the index tr[LPL'-] for a range of P, an approximation to P could serve just as well. It is preferable to avoid a tedious calculation of L or a periodic update of L, as illustrated in the following important class of problems.
Inverse state and decorrelated state stochastic approximation Consider the case ~ = [z l' ••• , Z ] where zk is highly correfated with kzn k-l as wh en zk i s a sampling at a rate in excess of the Nyquist frequency of a continuous time process, with Ra
more sophisticated schemes where the computational effort grows as the square of the state dimension. In particular, when the signal model states are measurable, its convergence rate is close (a factor of one to five) to that for the optimum recursive least square algorithm, whiEh represents a factor of 10 to 100 or more imporvement on the best simple schemes in the literature. The simulations suggest that the convergence of the algorithm is relatively insensitive to the signal to noise ratio encountered. A typical application of the algorithm, for the measurable state case is to MA channel for this typical digital communication application are presented. As another example to demonstrate the strength of the algorithm compared to recursive least squares, the performance of the IS algorithm when applied to the identification of ARMA models has been evaluated.
R
n
Ra
P
then selecting ~ ..
k
= (Zk -n-zk-n+1)]
one obtains LPL~ = E[~X~]
(RI-Ra)
(R _ -R _ ) n l n 2
( 2R o-RI-R_J! •. (2Rn_2-Rn_l:Rn_3)
and thus we can compare the indices tr P
= nRo;
trLPL~
= tr{P} +
(n-l)[Ro-Rl-R
-1
]
When the components of x k are highly correlated as when RI = R 1 is close to R then trLPL~ is close to zero and there is clearly a large percentage reduction in the index. There is also a significant improvement in the convergence rate as illustrated by simulation results in the next sections. Higher order differencing could give further decorrelation of the elements of Lxk for some applications and thus increases in the index, but at some computational expense. Selection of Y • The theory of this section k tells us to select Y as large as possibly subject, of course, t~ the convergence constraints !Y = 00, !Y~ < 00. This confirms the approach of k using exponential data weighting for an initial transient period before switching to Y = Ilk as in equal weighting. k CONCLUSION The inverse state stochastic algorithm with exponential data weighting, and decorrelated s~ate.elements is a simple stochastic approX~at10n procedure in that its computations effort grows but linearly with the signal model state dimension. In contrast to other such simple schemes its performance does not appear to be significantly inferior to the
265
In situations where the state of the system is not measurable but an estimate for the same can be derived, the IS algorithm fits in with the extended least squares and recursive maximum likelihood approach achievin& much simpler versions of these algorithms. In the simulations reported here the IS approach performs, in loose terms, as well as these more sophisticated algorithms in the literature but requires much less computation effort. Certainly they give far superior performance to other simple stochastic approximation algorithms in the literature which are well known to give extremely poor performance for these applications. REFERENCES Saridis, G.N. "Stochastic approximation methods for idenfification and control: A survey", IEEE TranB. on Auto Control Vol. AC-19, No. 6, December 1974. Albert, A. and Gardener, L. Stochastic Approximation and Nonlinear Regressions Cambrdige, Mass., MIT Press, 1967. Isermann, R., Baur, U., Bamberger, W., Knepo, P., and Siebert, H., "Comparison of six on-line identification and parameter estimation methods", Automatiaa, Vol. 10, pp. 81-103. Ljung, L., "Convergence of recursive stochastic algorithms", Report 7403, Division of Automatic Control, Lund Institute of Technology, February 1974. Mendel, J., Discrete Techniques of Parameter Estimation, New York: Marcel Dekker, 1973. Wasan, M.T., Stochastic Approximation, Cambridge, England: Cambridge University Press, 1969. Soderstrom, T., Ljung, L. and Gustavsson, I., "A comparative study of recursive identification methods", Report 7427, Division of Automatic Control, Lund Institute of Technology, December 1974. Anderson, B.D.O. and Moore, J.B., Optimal Filtering Prentice-Hall, Inc., New Jersey, 1979. Young, P.C., "An instrumental variable method for real-time identification of
R. Kumar and J. B. Moore
266
a noisy process", Automatiaa, Vol. 6, pp. 271-287, 1970. Young, P.C. "Some observations on instrumental variable methods of time-series analysis", Int. J. Control, Vol. 23, No. 5, pp. 593-612, 1976 Moore, J.B., "On strong consistency of least squares identification algorithms", Automatiaa, to appear. Ljung, L., Falconer, D., "Fast Calculation of gain Matrices for Recursive Estimation Schemes" Report LiTH-ISY-I-014l, University of Linkoping, 1977. Polyak, B.T. and Tsyplin, Ya, A., "Pseudogradient adapt ion and training algorithm Automatiaa i Teleneklanika, No. 3, pp. 45-68, March 1973. Dvaretzby, A., "On stochastic approximation in Proa. 3rd Berkeley Symp. Math. Stat. and Probability, Vol. I, pp. 39-55 1956. Gladyshev, E. "On stochastic approximation". Theory of Prob. and Appl., Vol. 10, No. 2, 1965, pp. 275-278. Evans, R.J., Optimal Signal Processing with Constraints, Ph.D. Thesis, Department of Electrical Engineering, University of Newcastle, October 1975. Luenberger, D.G. Introduction to Linear and Nonlinear Programming Reading, Mass. Addison-Wesley, C-1973. Mark, J.W., "A note on modified Kalman filter for channel equalizations", Proa. letter, Proa. of the IEEE, pp. 481-483, April 1973.
d (2 k X~L'Dk LX
k
.. I
[( X~L~LX k)-l. (X;L~LXk)-lJ
feasibility reg i o n
(constrained minimum) o n st rained maximum
d (I) k
Figure 3.1: Feasibility Region and Constrained Maximum 100
90
Ensemb le SCat ist ics 25 Simula ci ons Run s ~r\' SilZnalli nv ~ .. Nots e Var ... 750 C .. 20 .~!
SA) (k- O.999)
SNR .. 4
80
In itial Estimatio n ..
70
so'" SA 3 (k- t. )
,60 o
40%
~50
· ·
0
~O
30%
~
o
""
~30
.
20%
0
~
10% 10
1%
APPENDIX Two sections of the paper containing simulation results have been omitted because of space limitations. However sample simulation results for the case of sixth order moving average model and second order autoregressive moving average model with colored noise can easily be discerned from figure Al and A2 respectively. It should be noted that the results shown in figure A2 for all the stochastic approximation methods i.e. SA3, SA4 include the L-transformation and the specific form of exponential weighting discussed in the paper so as to significantly improve their performance over the more conventional schemes.
2000
1500
1000
SOO
~o .
of I terat ions
Figure AI: Performance improvement for scheme (Hoving Average Model). 10
IS
Ensemble statisti c s 2S Simulat ions Ru ns
1
Noise Vac . • O.S C .. 10
S .. 200 SA~(k " O.9q q.l .
{ ~~~
)
k=l.)
100
0
"
10%
0 ~
"
IS (k-O . 999)
""
~
·· ·
.
~ ~
"
0
~
H evcn
e'
E
Hodd
Jv A'
~
10- 1
~
8'
'
,
1%
9,
0
e'
1
(b)
500
Figure 2.1:
Faster Convergence with Orthogonal Hyperplanes
1000
1500
2000
2500
3000
3500
No . of Iterations ...
Figure A2: Simultaneous State and Parameter Estimation results (Second Order ARMA Model)