Auromatrca.
Pergamon
0005-1098(94)00163-4
Vol. 31, No. 5, pp. 755-758. 1995 Elsetier Science Ltd Printed in Great Britain ooO5-1098/95 $9.50 + 0.00
Brief Paper
Analysis
of Linear Methods for Robust Identification in tl*
J. R. PARTINGTON? Key Words-Identification; pseudorandomness.
least-squares;
and P. M. M;iKIL& error
Abstract-We consider worst-case analysis of system identification by means of the linear algorithms such as least-squares. We provide estimates for worst-case and average errors, showing that worst-case robust convergence cannot occur in the .5’,identification problem. The case of periodic inputs is also analysed. Finally a pseudorandomness assumption is introduced that allows more powerful convergence results in a deterministic framework.
-i) +
v/r7
(1)
j=O
where h E e, is an unknown impulse response, i.e.
2 IW)l<
03,
(2)
k=O
u is a bounded input that one can choose at will, and n is a noise sequence. This set-up has been considered for worst-case analysis by many authors (e.g. Makill, 1991; Tse et al., 1991; Jacobson et al., 1992; Partington and Makila, 1992, 1995; Akpy and Khargonekar, 1993; Akpy and Hjalmarsson, 1994; Partington, 1994a) in order to obtain ‘hard’ error bounds for the purposes of robust control. The basic identification framework that we shall use is that one is given a finite set of corrupted output measurements y(O), . . , y(m - 1) from which to construct a model h,: often this will be a finite impulse response (FIR) model of order n (depending on m). The model h, should be close to h in the e, norm, when m, the number of measurements, is large and llnllr, the noise level, is small, that is, lim sup Ilh - h,, 11,= 0, m-m lI?II,CC r-0 a property commonly described as robust convergence.
robustness;
linear
algorithms;
may also require uniform convergence over various compact subsets in which the unknown system is assumed to lie.) Algobth%rs considered in the literature have included those using complexity theory and interpolation, those using linear programming and-simplest of all-the least squares method. Although it is known that the least-squares method satisfies (3) when h is restricted to FIR models of a fixed maximum order (Akcay and Khargonekar, 1993), it is known to diverge as the order of the model increases (Akwy and Hjalmarsson, 1994), at least when the input is a pseudorandom binary sequence (that is, u(k) = *l for each k randomly and independently chosen). Other inputs have been considered by Milanese (1995). In Section 2 we analyse the worst-case error of an arbitrary linear identification algorithm and show that robust conver ence is not possible: informally, the errors can grow as c vgn, whatever the input. Therefore for this problem formulation nonlinear identification methods are required, even though they require large amounts of data (Kacewicz and Milanese, 1992, Poolla and Tikku, 1994; Dahleh et al., 1993). A similar result is known to hold for the H, identification problem with frequency response measurements (Partington, 1992). For comparison, we also analyse the average-case errors and show that they too will be large if the number of data points is small. In Section 3 we analyse the worst-case performance of the least-squares method for periodic inputs. A rate-of-divergence result is obtained, and it is shown that a standard statistical error estimate fails completely in a worst-case sense. In Section 4 we consider a notion of pseudorandomness that can be expressed in deterministic terms and show how such an assumption can give more powerful convergence results.
1. Introduction Consider the time-domain identification experiment
y(k) = IXWW
bounds;
(3)
2. Divergence results In this section we consider the behaviour of linear identification algorithms and formulate the identification set-up in terms of linear operators as follows. Let the operators
(One
*Received 6 September 1993; revised 5 April 1994; received in final form 27 Seotember 1994. The original version of this paper was piesented at the 10th IFAC Symposium on System Identification, which was held in Copenhagen, Denmark during 4-6 July 1994. The Published Proceedings of this IFAC Meeting may be ordered from: Elsevier Science Limited, The Boulevard, Langford Lane, Kidlington, Oxford OX5 lGB, U.K. This paper was recommended for publication in revised form by Associate Editor Roberto Tempo under the direction of Editor R. F. Curtain. Corresponding author Dr J. R. Partington. Tel. +44 113 242 9925; E-mail +44 113 233 5123; Fax
[email protected]. t School of Mathematics, University of Leeds, Leeds LS2 9JT, U.K. $ Division of Automatic Control, Lulel University of Technology, S-971 87 Lule%, Sweden.
e,-s,e&c
(4)
be defined by
W)(j) = i: h(kMj -k), k=O
(5)
where u is (as usual) a bounded input, say IIuII_~ 1, and h, = Ty is the approximate model obtained by a linear procedure (e.g. least-squares) from m data points. Let es, e,, be the usual basis in e,, with e*(i) =
1 ifj=k, 0 otherwise.
We can quantify the worst-case identification error in terms of the performance of T at the vectors Se,. Note that one 755
Brief Papers
756 requires I/TSe, - e,II to be small if identify finite-impulse-response systems absence of noise. The following result is the divergence result of Partington identification: however, here we are time-domain experiment rather than experiment, so the previous result does Theorem 2.1. Let S and T be linear suppose that IITS(e~)-e,/lI
the algorithm is satisfactorily in of a similar nature (1992) for concerned with a frequency-domain not apply.
operators
as above
to the to H, a
and
-hk)?
according to Lindenstrauss and Tzafriri (1979, p. 73). (That is, t?, is a space of cotype 2, which means that norms of Rademacher averages cannot be too small, and a vector-valued Khintchine inequality holds there.) But since IIx~br~(r)e~JI, = 1 for all t, this implies that the operator norm of T* and hence of T is big. For llT*el,lll MT*e,,Sek)I=l(ek,
TSek)I~ll
-A,/.
(10)
Hence /r-o
(I - A,)2.
(11)
the proof.
Thus if 11TSe, - ek /I5 4 for k = 0, , n - 1 (for example) then it follows that there is a noise sequence (~(k))r=;i such that IIT~II, pV’%
/lull->
(12)
for h E f, (13)
Note that ~6 is an achievable order of magnitude for such an algorithm, since the linear e, methods given by Partington, (1994b) do give 6, errors of order lfi with m=2n. It is also of interest to consider the average-case errors, since, although these are less useful for the purposes of robust control, they do put the worst-case errors into some perspective. In the following result the effect of integrating over r and using the Rademacher functions is that averages are taken over all noise sequences *I with equal weight; then the multiple integrals below give the effect of averaging over all noise sequences uniformly and independently distributed in the interval I-1,1]. We do not wish to impose a stochastic structure on our noise, but it is illuminating to compare the worst-case situation with what could be regarded as the ‘typical’ situation. Theorem 2.2. Under
the hypotheses
I?~ I
T jE_ q(rkj dr 2 z
I’I=0 /I
, ak,,,_,). We then see that (16) / dc
12)’ r,(t)+,
(17)
by Khintchine’s 1979), is at least
inequality
(Lindenstrauss
and
(8)
Proof: Consider the dual operator T*: &+fl, defined by (T*x, y) = (x, Ty) for x E & and y E e. Then, as is well known, IjT 11= 11T*l/ Let rk(f) = sign [sin (2k~)] denote the usual Rademacher functions. Then
,,T,,+’
T*ek = (u~,~,
(7) which, Tzafriri,
,,T,,2~;n$(l k-”
and hence
Let us write equals
side of (14) equals
$11,’ =&.
fork=O,....n-l.Then
This completes
Proof: The left-hand
But we know that (19) Hence the average norm of T over all +l sequences order n/G. To obtain (15), we perform the calculation to obtain a lower bound of
is of same
which is
again giving order n/G as the amount that errors due to ‘average’ noise are expanded into errors in the approximate model. This is again achieved with suitable inputs of length n by Partington (1994b). It follows by an elementary application of Chebyshev’s inequality that (under the assumption of uniform distribution of noise in an interval [-a, a]) the probability of modelling errors arising comparable to those of the worst case would be of order at most (n/V&)/G = a, but this is certainly significant for small values of tn. We shall see later some ways of improving on the worst-case error bounds by making non-stochastic assumptions on the noise. 3. Periodic inputs In this section we shall analyse the worst-case e, model error behaviour of the least-squares method for periodic inputs. Note that a widely used input test signal is the periodic PRBS signal (SaderstrBm and Stoica, 1989). Let the input u be a periodic signal with period n. Then the response h *U (where h E f, and * denotes output convolution) is also periodic with period n. Define the n X n matrix
40) A = [Qxn
=
u(n-1)
u(n-2)
u(1)
u(I)
u(0)
u(n-1)
u(2)
u(2)
u(I)
u(0)
u(3)
I! u(n-1)
u(n-2)i
“.”
u(l)
u(O) 1 (22)
A is a circulant matrix (Horn and Johnson, 1988). Thus the maximum column and row sum matrix norms of A are equal and are given by
above, 1
n I k:, 11- Akl
IIAII, = max i
ISjjs”i_,
(14)
JU;~~ =‘g
k=”
lu(k)l.
(23)
Let the input period be such that A-’ exists. This implies that Z;::u(k) #O, i.e. the mean value of u must be non-zero. Note that A--’ is a circulant matrix too. Lemmu 3.1 (24)
Brief Papers Proof
Since A is a circulant, we can write A = zz=h u(k)Dk,
where 0
0
0
...
100 D=
0 ..
0
... 1
. . .. I 00
0 ..
1 0
...
.
..
(25)
.
.
. ..lO
;1
and Do = D” = I (the identity matrix). Also, since A-’ is circulant, we can write A-’ = z;;AbJ)k, for some numbers bk, k =O, 1,. , n - 1. It follows that b,u(O) + x$&i b,u(n -k) = 1, so that, by the triangle inequality and since I@)1 5 1, IIA-‘lll -~Pz: lb(k)1 s 1. The lower bound 1 is a least lower bound, and in fact is attained by A = I. This completes the proof. Define a = {a(k)E;A E ei’ by u(k) = zjzoh(j)u(k -j). Furthermore, for a sequence x = {x(k) = xk} of real numbers, we use the notation xc> ={x((j - l)n + i)Ez,‘. With this notation, (1) can be written as y(i) = a + no),
j = 1,2, .
Let M output periods be estimate 6 of a is given by
available.
(26) The
Note that if the true system h E et is such that h(k) = 0 for k 2 n then periodic inputs with period n are not optimal for the least-squares estimation of h. In fact, there exist period inputs with period length being a small multiple of n (= model order) that give worst-case error CG (cf. Section 2). The point of the above examples is to show that there are statistically plausible designs that result in a much worse performance. Milanese (1995) has determined the worst-case performance of the least-squares method for certain other input designs. Consider again general model parametrizations. Let m = Mn. Denote the average of the absolute values of the noise bv (33) Theorem 3.1. Let u be any periodic input with period n satisfying max, lu(k)l s C for some C >O. Then for the least-squares estimate
Proof
From the definition follows that
of the II.11, matrix norm, it
least-squares (35) The lemma for circulant matrices then gives the result, since
Note that n = A x,“=, h(j). Let h E ll denote any least-squares model having a^ as its model output period, i.e. a^= A xi”=, ho? Now ,li-~ll,~~~,~~6~‘-h~)~~~,6~lld01.
(28)
Take for example u(O) = +l, u(k) = 0, k = 1,2,. . , n - 1 as a period of u., i.e. u is the periodic unit impulse input. Then A = I, so that, by (28), sup 111;-h,ll,Znr. II&-~
(29)
Consider for the moment FIR systems h satisfying h(j) = 0 for j zz n. Let the model have the same FIR structure. Then for the periodic impulse input sup I$ -/Ill, lI?IL~~
=ne.
(30)
We get also the following. Proposition
3.1. Let n > 1 be odd. Let [u(k)1 = 1 for all k, and let Iz;;A u(k)1 = 1. Let there be one level change/return to original level per input period. Let h(j) = h(j) = 0 for j’n. Then
sup 111;-hII, =nc. II&-
(31)
Proof
Denote n = 2q - 1, where 4 = 2,3,4,. . . . Without loss of generality, we can take the first row of A to be [l 1 . . . -1 -1 . ..], with 4 consecutive Is. After some straightforward algebraic manipulations, we find that the first column in A-‘, [b. b, . . . b,_,]‘, is given by b. = b,_, = 4, and .bi = 0 otherwise. Thus I(A-‘II = 1. Using the obtained exphcrt expression for A-’ (A-’ is a circulant matrix) in & - h = A-‘(6 - n) gives the desired result. Now take u(O) = u(2) = u(3) =. . = u(n - 1) = l/(n - 1) and u(l) = -(n - 2)/(n - 1) (note that 11~II_< 1). Then sup [Ifi-hll,=n(n-l)c, IlVlk~~ since the first row of A-’ is [l 0 1 1 . . . 11.
(32)
4. Non-stochastic assumptions on noise It is sometimes said that the worst-case experimental error/noise is something that by definition cannot occur if it allows dependence between the input and the error. This criticism misses, however, the following central point. We are always using a model/uncertainty model class of restricted explanatory power (for example, in the case of the analysis of this paper a certain class of stable linear time-invariant systems). Therefore part of the experimental error, as our mathematical model sees it, can very well be due to unmodelled dynamics (in the case of the present work due to, for example, a small time-varying and/or nonlinear perturbation). However, we do not make the mistake of claiming that randomness properties of the noise/experiment are not important: on the contrary, such properties are most important in applications, and have been utilized widely in probabilistic system identification theory (Ljung, 1987; Saderstrijm and Stoica, 1989). We should therefore like to derive identification error results for random noise in a worst-case and non-probabilistic setting. This can be accomplished by including ‘pseudorandomness’ conditions on the noise. Here is one possibility, which can also be interpreted in terms of the crosscorrelation between the noise and the input u. We begin by using a Euclidean (ez+ ez) setting, only later discussing its application to the (&+ &) setting. Let T be any linear operator from @ to f$!, where m ?n (and in practical situations we expect m to be much larger than n) and suppose that the operator norm (greatest singular value) of T is o. Then we can write e = X@Y, where Y is the kernel of T,X=Yl, dimXsn, and dimYzm-n. Let P be the orthogonal projection from @ onto X. A ‘typical’ vector x in @ will have IlPxll approximately equal to a llxll (at most), and hence IIw*~~~Il~ll2’
(36)
This is certainly the case on average when x is chosen from an isotropic probability distribution (that is, one such that the probability density function depends only on IlxII), but it is also a useful general non-stochastic assumption to make because it can be tested for the purposes of model (in)validation. We now assume that u(k) = 0 for k
Brief Papers -Msk 5 -1 then the contribution to y(k) from u(k -j). j > k, is bounded by I/u Ilr &,, Ih(j which for h E e, can be regarded as a small extra noise component. To see an application of the notion of pseudorandom noise to the identification of FIR models, let us consider the least-squares procedure, using an input u of length N = m + 1 - n such that I:[:
u(j)z’i
zafi
whenever
IzI = 1,
(37)
in order to identify an unknown impulse response h of length n from m measurements of y = h *u + t). It follows from Theorem 2.2 of Partington and Makila (1995) and Partington (1994b) that the C+ @ norm of the least-squares operator is at most 2/(a / N); that is, that the e, error in h is at most 2/(aV?$ times the e, norm of 7. Using this, we can now state the following result. Theorem 4.1. Let k
y(k)=zh(j)u(k-j)+r),
for k=O.....m-I,
, =o
(38)
where u = (u(O), , u(N - 1), 0, 0, .) satisfies (37) with N = m + 1 - n. Let Ty = g = (g(O), , g(n - 1)) minimize the least-squares error 11y - g *u llz. Then
5 Furthermore,
(2/(r)
J(m-n+p~~ nm
if the noise n has the property
(3% that
II~l)ll2~c~llr,ll2
(40)
where P is the orthogonal projection (Ker T)l (call this C-pseudorandomness), 2Cn
Ilrhll,~
II9112
onto the then 2Cn
ll~llx
a~m(m-n+l)SaVm-n+l’
subspace
(41)
Proof: This follows immediately from the preceding discussion, noting that for a vector x E Rk we always have II~III
5 fi
II~llZ>
ll~ll?
5 fi
l/4=.
(42)
together with the observation that T is a linear operator and TT/ = T(Pv). That is, if the noise satisfies a C-pseudorandomness condition then the least-squares procedure gives 8, errors of order n/G times the & norm of 7. Note that C-pseudorandomness changes dramatically the worst-case time complexity of identification. It is known that for the simple bounded error noise model the worst-case time complexity of identification of FIR systems with n parameters is exponential in n (Kacewicz and Milanese, 1992; Poolla and Tikku, 1994; Dahleh et al., 1993) for example, to reach a worst-case model error of 2r (Ilnllor~ e), it is required to make approximately 2” measurements. If the noise in addition satisfies (40) and the input is chosen as in Theorem 4.1, it is seen that to get Ilg -h/l, 5 KE just takes at most m = F(K)n2 measurements, where F(K) = [2C/(Ka)12, and K > 0. The worst-case time complexity is then not worse than quadratic in n.
5. Conclusions It has been shown that the use of the least-squares algorithm (and indeed any linear algorithm) for f, identification can lead to large modelling errors in the worst case, so that the robust convergence condition fails to hold. For periodic inputs the exact rate of divergence has been determined in some cases. In addition, a non-stochastic way of formulating a pseudorandomness assumption has been devised without relying on any notion of stationarity or quasi-stationarity. This gives an improvement on nonstochastic bounds for identification errors, and could be combined with a model (in)validation step. It would be of considerable interest to develop a non-stochastic theory of noise in which the noise models are ordered according to the resulting worst-case time complexity of identification.
Akcay, H. and H. Hjalmarsson (1994). The least-squares identification of FIR systems subject to worst-case noise. Syst.Control Lett., 23, 329-338. Akcay. H. and P. P. Khargonekar (1993). The least squares algorithm, parametric system identification, and bounded noise. Automaticu, 29, 1535-1540. Dahleh, M. A., T. Theodosopoulos and J. N. Tsitsiklis (1993). The sample complexity of worst-case identification of F.I.R. linear systems. Syst. Control. Lett., 20, 157-166. Horn, R. A. and C. R. Johnson (1988). Matrix Analysis. Cambridge University Press, Cambridge. Jacobson, C. A., C. N. Nett and J. R. Partington (1992). Worst case system identification in I’: optimal algorithms and error bounds. Syst. Control Len., 19,419-424. Kacewicz, B. and M. Milanese (1992). On the optimal experiment design in the worst-case t?, system identification. In Proc. 31st IEEE Con5 on Decision and Control. Tucson, AZ, pp. 296-300. Lindenstrauss, J. and L. Tzafriri (1979). Classical Banach Spaces II. Springer, Berlin. Ljung, L. (1987). System Identification. Prentice-Hall. Englewood Cliffs, NJ. Makill, P. M. (1991). Robust identification and Galois sequences. Int. J. Control, 54, 1189-1200. Milanese, M. (1995). Properties of least squares estimates in set membership identification. Automatica, 31,327-332. Partington. J. R. (1992). Robust identification in H,. J. Math. Ana? Applies, i66, 428-441. Partington, J. R. (1994a). Interpolation in normed spaces from the values of linear functionals. Bull. Lond. Math. Sot., 26, 165-170. Partington. J. R. (1994b). Worst case identification in tz: linear and nonlinear algorithms. Syst. Control Lett., 22, 93-98. Partington, J. R. and P. M. Mlkila (1992). Worst-case identification from closed-loop time series. In Proc. 1992 American Control Con&, Chicago, pp. 301-306. Partington, J. R. and P. M. Makila (1995). Worst case analysis of the least squares method and related identification methods. Svst. Control Lett.. 24. 193-200. Poolla, K. and A. Tikku (1994). On the time complexity of worst-case system identification. IEEE Trans. Autom. Control, AC-39,944950. Siiderstrom, T. and P. Stoica (1989). System Identification. Prentice-Hall, Englewood Cliffs, NJ. Tse, D. N. C., M. A. Dahleh and J. N. Tsitsiklis (1991). Optimal asymptotic identification under bounded disturbances. In Proc. 30th IEEE Conf on Decision and Control, Brighton, UK, pp. 623-628.