Copyright e IFAC System Identification, Copenhagen, Denmarlc. 1994
SECOND LAYER LEAST SQUARES FOR RECURSIVE NON-LINEAR ESTIMATION M. KARNY, I. NAGY, A. HALOUSKOV A • ·UTIA AV CR, e-m4il: schooIOuti4.cas.cz, POBor 18, 18208 Prague 8, Czech Republic
Abstract, Bayesian interpretation is given to a recently designed recursive approximation of Bayesian estimation. The procedure is viewed as an ordinary Bayesian estimation of a linear Gaussian system not at original parameter level but at a "second", posterior, log-likelihood level. The interpretation allows to generalize the procedure and opens possibility to exploit the rich state of art (choice of priors, structure estimation, adaptivity, martingale analysis etc.). Key Words. Approximate Recursive Bayesian Non-Linear Estimation, Least Squares
1. INTRODUCTION
tivity, martingale analysis, ... ). Due to the space limitations no proofs are presented in this paper: the key ones can be found in (Karny and Nagy, 1993).
Signal processing, adaptive control, diagnostics etc. rely heavily upon recursive estimation. In spite of substantial progress in the field, efficient non-asymptotic design procedures are lacking with exception of a narrow model range admitting finite dimensional sufficient statistic. The general approaches as prediction error methods (Peterka, 1981) evaluate point estimates and guarantee just asymptotic properties.
2. PRELIMINARIES 2.1. System Model
. Data sequence d( t), t = 1, 2, ... is measured on the system to be modelled. The directly manipulated part of d(t) (possibly empty) is called the system input u(t), the rest is the system output y(t), i.e. d(t) = (u(t),y(t)). The data observed up to the time instant t are denoted by u(t) = (d(I), ... , d(t)).
According to our best knowledge, there exists a single approach (Kulhavy et al., 1992) which respects both finite-sample and completedescription properties of the Bayesian set-up (Peterka, 1981). It rests on an exact recursive evolution of equivalence classes of likelihood functions specified by sub-sufficient statistics. An al-
The relation of y(t) to u(t - 1), u(t) is modelled by a parametrized system model
ternative view point on a specific subclass within this framework was presented in (Karny and Nagy, 1993). Essentially, inherent recursivity of linear-in-free-parameters least squares (LS) estimator applied to a linear system has been exploited.
p(y(t)lu(t - 1), u(t), 8) where
p(alb) denotes a probability density function (pdf) of a conditioned on b; 8 is a real me-dimensional parameter of the system that is supposed to be unknown to the input generator, i.e. (Peterka, 1981)
Here, a Bayesian interpretation is given to this procedure. It is viewed as an ordinary Bayesian estimation of a linear-in-parameters Gaussian system not at the original parameter level but at a "second" (posterior likelihood) level. The interpretation
p(u(t)lu(t - 1),8)
= p(u(t)lu(t -
1)). (1)
For each 8, the parametrized model is assumed to be determined by the m~-dimensional regressor
enlightens the criterion used; allows to generalize the procedure; makes the approach coherent with the Bayesian set-up; opens the possibility to exploit the state of art (choice of priors, structure estimation, adap-
4>(-) p(y(t)lu(t - 1), u(t), 8)
= p(y(t)I4>(t), 8)
(2)
with the regressor specified recursively by a known 929
In other words, the uncertainty about the unknown array Y(t) = (yl(t), ... , ym'(t)) is supposed to be described by the pdf
p(Y(t)ll1(t), e(t))
m,
= p(Y(t)le(t)) =
sufficiently smooth me-dimensional vector function L'(O). Then continuous space limit of the posterior pdf on e(t) is Gaussian with the expected value
(8)
£[/(OII1(t))II1(t))] = L'(B)6(tll1(t))
IT Ny i(t)(.ci ei (t), Qi / hi) where
=
i=l
N a (a, q) is the normal pdf of a given by the exRemarks.
+
1. It should be stressed that just the lack of a sufficient statistic of a finite bounded dimension is addressed. Feasibility on the given computer required for the model and selected prior pdf have to be extended to the matrices .ci , too.
V (t 111(t))
=
+ 1
p- (11 11 (1))
+
(9)
+
It should be stressed that both prior moments may depend on t, 11(0) only.
V(III1(O))
J J
L(O)L'(O)p(O) dO
L(B)Ll(x(t), B)p(B) dB
+
p- 1(1111(0))6(1111(0)).
0
Quality of the results depends heavily on the quality of the model chosen. The following Proposition, which is a version of an observation made in (Kulhavy et al., 1992), motivates the choice of the expected value in the (hyper)parametrized model:
with
P(t II1(t)) V(t II1(t)) P- 1(tll1(0)) + .c'Q-1.ch .c'Q-1Y(t)h + V(tll1(O)) p- 1(t 111(0) )6( t 111(0))
4. [/(OII1(t)) in terms of p(xll1(t))]
Proposition
Let p(xll1(t)) denote formal sample pdf on xspace. Then, under the conditions of Proposition
1, I(OII1(t))
o
=t
A formal limiting process (h - 0) leads to the continuous O-space version of Proposition 2 which will be used hereafter.
=
[J JJ(x, O)p(xll1(t)) dx + ((t, 0)]
where ((t, B)
Proposition 2a. [Posterior pdf on e(t)] Let the
.c
P(t II1(t)) V(t II1(t)) p-1(t - 1111(t - 1)) + P- 1(tll1(0))p-1(t - 1111(0)) p- 1 (11 11(0)) +
4.2. 011 (Hyper)Parametrizatioll
. P(tll1(t)) = Ne(t)(e(tll1(t)), h )
(ml" me )-matrix
0
+ V(tll1(O)) - V(t - 1111(0))
Proposition 2. [Posterior pdf 011 e(t)] For the model (8) and the prior pdf (9), the posterior pdf
=
L(B)/(BII1(t))p(O) dB.
V(t - I1000t - 1)) +
V(tll1(t))
and specified by the prior expectation 6(tll1(0))h and covariance P(tll1(O))h > O.
6(tll1(t)) P- 1(tll1(t)) V(tll1(t)) V(tll1(O))
+
Proposition 3. [Recursion for the posterior pdJl The statistic determining the posterior pdf p(e(t)ll1(t)) can be evaluated recursively 6(tll1(t)) P- 1(tll1(t))
The self-reproducing prior p(e(t)) on e(t) is selected which is known to be Gaussian
p(e(t)II1(t))
L(O)L'(O)p(O) dO
Inherent recursivity is the most important property of this estimate:
Prior pdf on e(t). The parameters ei(t), j 1,2, ... ,ml/ will be assumed a priori independent. Under the key assumption, they remain independent a posteriori: the single parameter case is sufficient and the superscripts i are dropped.
«3( t 111( 0)), P(t 111(0))/ h)
J J
V(tll1(O))
+
2. Use of the inverted prior pdf as measure of the dispersion is intuitively plausible: the more is a parameter value a priori expected, the smaller uncertainty should be on the sensitivity function.
;\'e(1)
P(t 111(t)) V (t 111(t)) P- 1(tll1(0)) +
6(tll1(t)) P- 1(tll1(t))
pected value ii and covariance q.
= ((0) + :1/ In(p(O)]/t.
(10)
o
The formula (10) suggests sui table form of the expectation. The integration of JJ(x,O) weighted by
be a discretized version of a 930
function
4>(t)
~(.,.)
= ~(4)(t -
1), d(t)).
tatistic is inspected. It means that the dimension of the sufficient statistic is unbounded whenever the number of data is unbounded. Thus, the true Bayesian estimate p(9Iq(t)) is not numerically tractable.
(3)
Hereafter, the output (regressand) and the regressor are often joined into data vector x'(t) [y'(t),4>'(t)), (' is transposition).
4. SOLUTION
The evaluation of p(y(t)I4>(t), 9) is assumed to be feasible with the available computer (for all x(t) and 9).
In the addressed problem, the posterior pdf is incompletely known. Within the Bayesian paradigm, it should be described in terms of probability theory and the Bayes rule used as the basic learning principle.
For later use, the parametrized model is factorized into the product of the data-dependent, MO, and data independent, s(-), non-negative factors
p(YI4>, 9) = M(x, 9)s(9).
For applying this simple idea, a (hyper)parametrized model has to be specified relating the observable quantities to the incompletely known one. Moreover, an appropriate prior pdf has to be specified. This will be done in the text below.
(4)
2.2. Bayesian Estimation The Bayesian estimation uses the observed data for updating the prior pdf p(O) - assumed also tractable on the computer - to the posterior pdf p(9Iq(t)). For (I), (2), (3), the updating is described by the Bayes rule
4.1. Probabilistic Model of I(Olq(t))
The posterior pdf is infinite-dimensional object so that stochastic-process apparatus should be employed. The related technicalities are skipped here. The finite-dimensional (my) < 00) distributions of the sensitivity function values yii(t) li(9iilq(t)), i 1, ... , my} < 00, j 1, ... , me are inspected only (with an arbitrary selection of different grid points 9ii ).
p(9Iq(t)) ex: p(y(t)I4>(t), 9)p(9Iq(t - 1)) p(9Iq(0)) = p(O) (5)
=
where ex: is proportionality sign. The proposition below says that this rule can be transformed into a linear recursion for the sensi-
=
The notation yi (t) = [yi 1 (t), ... , yimyi] and hi upper bound on the lengths of (multivariate) intervals (Oi (i -1), Oi i] will be used hereafter.
tivity functions
a 1(9Iq(t)) = aoln[P(Olq(t))]. t = 0,1,...
=
=
(6) Even in the discretized O-space the array Y (t) is incompletely known whenever the total number of nodes of the grid, 2::7::1 mYi, is too large. Thus, the probabilistic description should be used within the (subjective) Bayesian set-up. The following assumption will be adopted:
:e(-)
(gradient is supposed to exist). The recursion is driven by
= 09a In(p(y(t)I4>(t), 9)) = a a 00 In(M(x(t), 9)) + 00 In(s(O)) = j1(x(t),O) +«9), t = 1,2,... (7)
Ll(x(t),9)
Key assumption. The uncertainty on yi(t) can be reduced to the uncertainty on unknown real low (mei) dimensional vectors ei (t) which determine the conditional expectation of Yi (t) as the linear combination of columns of known (my i, mei)matrices i
where x(t) is the measured data vector.
.c
Proposition 1. (Image of Bayes rule] Let sensitivity functions be well defined on a O-independent support. Then, the Bayes rule transforms into the linear recursion
£[yi (t)lq(t)] =
.ci e(t).
The array e(t) = (e 1 (t), ... ,em'(t)) may depend on q(t). The conditional diagonal covariance Qi / hi, with non-zero entries of Qi equal to inversions of the prior pdf on the grid, i.e. Qii = l/p«(}ii), i = 1, .. . ,mYi, is assumed.
1(9Iq(t)) = I(Olq(t - 1)) + Ll(x(t), 9) = = I(Olq(t - 1)) + j1(x(t), 0) + «0). 3. ADDRESSED PROBLEM
The remaining conditional moments are defined by assuming conditional normality of yi(t) and conditional independence of yi (t), yk(t), j:f k.
The recursive estimation of the parametrized models without a finite-dimensional sufficient s931
Proposition S. [SA fonn of p(0(t)lu(t))] Under (15), the statistics determining the posterior pdf p(0(t)lu(t» can be evaluated recursively as follows
p(xlu(t)) is the trouble-making part in the 1(01·) evaluation. For growing t, the pdf p(xlu(t» can be expected to be close to some quite complex pdf p(xlu(oo». The integrals of the discussed type are mostly evaluated as weighted sum of integrand values taken at a finite selection of arguments x;(u(t», i = 1,2, ... , me. Thus,
E>(tlu(t» = E>(t - llu(t - 1»+
+
(17)
[P(J L(O)~(x(t), O)p(O) dO + V)-
me
L
Jl( x;( u(t)), 0)0; (u(t))
-
(11 )
0(t - llu(t - 1»
;=1
P _
is a suitable candidate for the expected value of 1(·1·). If the nodes are a priori fixed at x;(u(t» x; then the ith entry of L(·) equals to Jl(x;, 0) and the weights 0; (. ) become the only free parameters.
=
J
(12)
p(yl4>,O)p(O)dO.
J
Jl(x, O)J.l'(x, O)p(O) dO
(13)
x in
the data s-
for nodes and measured data x, pace. The integration (7)
f(x) =
J
I-/(x, 0)(0) dO
~(x,O) =
(14)
L(O)L'(O)p(O) dO) -1
0
:Op(y1z,O) for x' = [y,z'].
~(.) (7) of the sen· sitivity function IU (6) into data dependent and independent parts (4)
2. Decompose the increment
~(x, 0) = I-/(x, 0)
P>O
+ (0).
3. Select the prior pdfp(O) (5) which reflects belief in various O-values and guarantees finiteness of C(x, x) (13) and f(x) for any x, x from x - space. 4. Select "representative" fixed points in x-space Xl, ... , X me with me ~ me. The selection should be based on the prior prediction (12) and defines L;(O) = Jl(x;, 0). 5. Initialize the statistics K(O) > 0, V, P > 0
Proposition 5 indicates that at least some (hyper)parameters should grow proportionally to time. The approximate nature of the estimation implies that prior uncertainty cannot fall below some level. This reasoning motivates the following options defining the (hyper)prior pdf (9)
(15)
characterizing prior pdf on (hyper)parametric space (15), (17), set t = 0). 6. Evaluate
K(tW, V = P- 8(0Iu(0» K(t - 1) + 1, K(O) > O. 1
By introducing normalized estimates
E>(tlu(t» = 8(tlu(t»/K(t)
J
1. Select a class of parametrized system models p(y(t)lu(t - 1), u(t), 0): - positive on a O-independent support; - with a finite-dimensional recursively feasible regressor (2), (3); - with the partial derivatives
4.3. On (Hyper)Prior Pdf
=
(p-l +
5. ALGORITHMIC SUMMARY
is usually less problematic.
P(tlu(O» V(tlu(O» K( t)
P(tlu(t» =
Adaptivity of the estimator can be reached by preventing K(t) to grow to infinity, e.g. by the classical forgetting K(t) = AK(t - 1) + 1, 0 < A < 1.
Thus, for a given input generator, it is possible to generate probable data vectors x, to select the nodes. With the recommended choice of L, the integrations needed when evaluating V(tlu(t» and p- l (1Iu( 1), cf. Proposition 3, reduce to evaluations of
C(x, x) =
1
K(t).
This form shows that the vast amount of theoretical results available for SA can be directly applied. A superior finite-time behaviour is expected as the prediction error reflects global model properties.
The prior pdf p( 0) and the parametrized model p(yl4>,O) define the prior predictor
p(yl4» =
]
P = (p-l + ( 16)
J
L'(O)L(O)p(O)dO)-l,
i.e. pre-compute P(tlu(t)), see (17). This step finishes the off-line phase, the rest concerns the on-line phase.
the recursion given in Proposition 3 can viewed as stochastic approximations (SA): 932
Yroposltlon tJ. l:;ensltzvlty JunctIOns oJ nonGaussian ARX model] Supposing (1) and aI, a2, the assumptions of Proposition 1 are met. With X == X(x(t), K, r), the increment
7. Increase formally t.ime t and measure new data x(t) and evaluate the vector c(x(t)) with ith entry equal to C(Xi, x(t)).
~(x,
8. Update (hyper)parameter estimate (17)
8(tI0"(t))
= 6(t -
1100(t - 1)) + ~
e(t) = P(c(x(t)) + if) - 8(t - 1100(t - 1)). 9. Exploit the results according to the intended ap-
o 2. Trivial decomposition (4) with s(8) = 1 is assumed here, i.e. jj(x,8) = ~(x,8). 3. Gauss-inverse-Wishart prior pdf p( 8) p(K, r) is selected because of its tight connection with RLS, its flexibility and well-known properties. 4. L(K, r) becomes functional array because the multi-parameter case is dealt with:
plication. For instance, determine approximate posterior pdf p(810"(t))
ex p(O)S' (0)
IT [M(Xi, 0)]"(1)6,(1 <7(1» 1
,
i=!
i.e. t.ransform the gained estimate 11:( t )L' (0)8( t 10"( t))
L{ (f{, r) = ~j(x1. K, r)
l( 1110"( t)) back to t.he corresponding pdf. 10. Return to the step 7.
xi
The example illustrates the algorithm on a pract.ically important problem. The explanations follow the above algorithm. 1. Single output. system is assumed to be de-
scribed by t.he ARX model
= K/c/J(t) + c:(t)
where I\ are unknown regression coefficients (real m,p-vect.or). The m,p-dimensional real regression vector c/J( t) is generated according to (3), the noise c:(t) is a white non-Gausslall noise with
= f ( c(t) .;r' r )
=
,mElJ ~
=
=
me·
5. Generally accessible prior information has been built in p(O) and the selection of the xgrid, thus it is reasonable to take a flat prior at the upper level, to select if = O. The value P should result from analysis of the achievable accuracy of the approximation. 6-8. The evaluation of the P needs to compute the integral C(x, x) (13) for x, x belonging to the selected grid. Updating of the (hyper)parameter estimate requires evaluation of c(x(t)), i.e. of C(x(t), xi): the integral (13) has to be evaluated on-line. For this reason, it is of extreme importance that the analytical evaluation of C(x, x) reduces, for the chosen prior p(8), the need for numerical integration to three-dimensional space only (irrespectively of m4» (Karny and Nagy, 1993). 9. The proposed estimate p(810"(t)) is weighted geometric mean of models at pre-selected data vectors. The upper level estimation provides respective weights.
6. EXAMPLE
p(c(t)IO"(t - 1), u(t),8)
x
where is an appropriate grid in the [y,c/J']'-space, j 1, ... ,me m,p + 1, i 1, ...
y(t)
=
K, r) -r-O.5G(X, r)c/J ] [ -;"G(X,r)+H(X,r) .
.
The function fC, r) is a zero mean pdf parametrized by an r > 0 characterizing the pdf width. The unknown vector 8 (K/, r)' parametrizes the system model
=
p(y(t)IO"(t - 1), u(t),8) = = f (X(I(, I', x(t)), r)
7. CONCLUSIONS The paper presents a Bayesian view on a formerly designed (Karny and Nagy, 1993) recursively feasible approximation of non-Gaussianjnonlinear estimation. This interpretation allows to generalize the procedure a bit and, first of all, to employ full apparatus of the Bayesian estimation (like rational choice of priors, estimation of structural parameters, unified view on adaptivity etc.).
(18)
where the normalized innovation X is X(x(t),8) X(x(t), I<, r) YCI)-;"4>CI).
=
=
The noise pdf is assumed to have a1: the support of f(', r) independent of r a2: the finite logarithmic derivatives
A lot have been done in the addressed area but the Kulhavy's equivalence approach (Kulhavy et al., 1992) seems to be the most advanced system-
/;/C"f) ) liIC"f) G((, r) f(',r ' H((, r I(',r . Then, it is straightforward to verify:
=
=
933
atic treatment to the recursive approximate estimation with guaranteed finite-time properties. This paper can be viewed as a special version of this theory with different interpretation of the derived results. In this way, universal approximation properties of Gaussian sums can be exploited in addition to the mentioned Bayesian calculus (they should help in describing attainable precision given by P). Also rich results related to stochastic approximations can be employed. Better finitetime properties are conjectured for the proposed procedure as the point estimation at the upper level becomes global approximation at the basic estimation level (this property makes it radically different from such approaches like extended Kalman filtering). The theoretical potential relaxed by this paper is rich but almost no steps are made here in this respect. The practical contribution is more immediate: by mixing analytical and numerical integration, Bayesian estimation of ARX model with a general noise can be made practically solvable. The same mixture of analytical and numerical integration can be applied to "neuron" -type models: it can be found that the result applies to any scalar product of data and unknown coefficients in a known non-linearity while corrupted by white Gaussian noise.
8. REFERENCES
Karny, M. and I. Nagy (1993). Recursive least squares approximation of bayesian nongaussian/non-linear estimation. In: Mutual Impact of Computing Power and Control Theory
(M. Karny and K. Warwick, Eds.). pp. 124-134. Plenum Press. New York, London. Kulhavy, R., I. Nagy and J. Spousta (1992). Towards real-time implementation of bayesian parameter estimation. In: Preprints of 4th IFAC Symposium Adaptive Control and Signal Processing ACASP'92 (.. , Ed.). Vol. 2. pp. 125247. Academic Press. Grenoble. Peterka, V. (1981). Bayesian system identification. In: Trends alld Progress in System Identification (P. Eykhoff, Ed.). pp. 239-304. Pergamon Press. Oxford.
Acknowledgment This work has been partially supported by GA (R, Grant No. 102/93/0228 and by GA AV (R, Grants No. 275109. 275110.
934