Robust maximum-likelihood estimation of multivariable dynamic systems

Robust maximum-likelihood estimation of multivariable dynamic systems

Automatica 41 (2005) 1667 – 1682 www.elsevier.com/locate/automatica Robust maximum-likelihood estimation of multivariable dynamic systems夡 Stuart Gib...

351KB Sizes 0 Downloads 58 Views

Automatica 41 (2005) 1667 – 1682 www.elsevier.com/locate/automatica

Robust maximum-likelihood estimation of multivariable dynamic systems夡 Stuart Gibsona , Brett Ninnessb,∗ a Lehman Brothers, London, UK b School of Electrical Engineering and Computer Science, The University of Newcastle, Australia

Received 22 March 2004; received in revised form 20 March 2005; accepted 16 May 2005

Abstract This paper examines the problem of estimating linear time-invariant state-space system models. In particular, it addresses the parametrization and numerical robustness concerns that arise in the multivariable case. These difficulties are well recognised in the literature, resulting (for example) in extensive study of subspace-based techniques, as well as recent interest in ‘data driven’ local co-ordinate approaches to gradient search solutions. The paper here proposes a different strategy that employs the expectation–maximisation (EM) technique. The consequence is an algorithm that is iterative, with associated likelihood values that are locally convergent to stationary points of the (Gaussian) likelihood function. Furthermore, theoretical and empirical evidence presented here establishes additional attractive properties such as numerical robustness, avoidance of difficult parametrization choices, the ability to naturally and easily estimate non-zero initial conditions, and moderate computational cost. Moreover, since the methods here are maximum-likelihood based, they have associated known and asymptotically optimal statistical properties. 䉷 2005 Elsevier Ltd. All rights reserved. Keywords: Parameter estimation; System identification

1. Introduction A fundamental and widely-applicable approach to the problem of obtaining parametric models from observed data involves adopting a statistical framework and then selecting as estimated model, that which maximises the likelihood of the observed data. Schemes guided by this principle are known as maximum likelihood (ML) methods and, due to the fact that they have been studied for almost a century, they benefit from a very large and sophisticated body of supporting theory (Ljung, 1999; Söderström & Stoica, 1989). This

夡 This work was supported by the Australian Research Council. This work was not presented at any IFAC meeting. This paper was recommended for publication in revised form by Associate Editor Michel Verhaegen under the direction of Editor Torsten Söderström. ∗ Corresponding author. Fax: +61 2 49 69 93. E-mail addresses: [email protected] (S. Gibson), [email protected] (B. Ninness).

0005-1098/$ - see front matter 䉷 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.automatica.2005.05.008

theoretical underpinning allows, for example, important practical issues such as error analysis and performance trade-offs to be addressed. Moreover, it provides a rationale for using such methods since it is generally (but not universally) true that ML estimators are asymptotically optimal, in that they asymptotically (in observed data length) achieve the Cramér–Rao lower bound (Ljung, 1999; Söderström & Stoica, 1989). However, despite their theoretical advantages, the practical deployment of ML methods is not always straightforward. This is largely due to the non-convex optimisation problems that are often implied. Since these cannot be solved in closed-form, they are typically attacked via a gradientbased-search strategy such as Newton’s method or one of its derivatives (Dennis & Schnabel, 1983). The ultimate success of such approaches depends on the associated cost function’s curvature with respect to the model parameters. This, in turn, is dependent on the chosen model parametrization, and the selection of this can be difficult, particularly

1668

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

in the multivariable case where the cost contours resulting from common canonical state-space parametrizations imply poor conditioning (McKelvey et al., 2004). Fully parametrized state-space models (i.e. those in which all elements of all matrices are unconstrained) provide an obvious alternative to canonical structures, since they provide a very general, compact and simple framework to represent finite-dimensional multivariable systems. Furthermore, it has recently been established that fully parametrized models can be coupled with ML and prediction error criterions via algorithms that identify, at each iteration of a gradientbased optimisation, a search subspace of minimal dimension (Bergboer et al., 2002; Ljung, 2004; McKelvey et al., 2004). Inspired by these issues, this paper explores a different approach to the problem of finding ML estimates of fully parametrized state-space models from multivariable observations. More specifically, the work here employs the expectation–maximisation (EM) algorithm as a means of computing ML estimates. The EM algorithm enjoys wide popularity in a broad variety of fields of applied statistics. For example, areas as disparate as signal processing and dairy science routinely use the method (Borran & Aazhang, 2002; Shumway & Stoffer, 2000). However, despite this acceptance and success in other fields, it could be argued that in systems and control settings, the EM algorithm is not as well understood and utilised as it may deserve to be. This paper seeks to make the contribution of establishing, via both theoretical and empirical evidence, that EM algorithm-based techniques are a highly competitive alternative for solving multivariable control-relevant ML estimation problems. It is important to note that there have been previous works using the EM algorithm in control-related problems. In (Isaksson, 1993) the problem of single-input, single-output (SISO) ARX model estimation on the basis of censored data sets was considered, and was further addressed in (Goodwin & Feuer, 1999). Furthermore, the work (Shumway, 1982) appearing in the statistics literature, considered the control relevant issue of time series estimation via an EM algorithm-based solution and with respect to a particular subclass of ARMA time series models. By way of contrast, this paper deals with a different set of estimation problems by including the possibility of exogenous inputs, by progressing beyond polynomial SISO ARX structures to MIMO state-space models and by allowing full ARMA noise model estimation. Furthermore, an essential development of this paper is the derivation of a numerically robust implementation. This is of key importance since the poor numerical properties of any naïve EM implementation limit the feasible state and input–output dimensions of any estimated system to such a degree as to severely curtail practical utility. On the other hand, with the robust EM implementation derived here, the method proves to be exceptionally dependable, and capable of handling quite high dimensions of state, input number and output number.

Additionally, this paper makes further contributions by providing a self-contained introduction to the EM approach and its underlying principles, as well as then profiling, both theoretically and empirically, the performance of the specific EM-based techniques developed here for linear time invariant (LTI) MIMO estimation. 2. The expectation–maximisation (EM) algorithm This introductory section is intended as a concise overview of the fundamentals of the EM algorithm. The first essential point is that the EM algorithm is designed to compute the ML estimate  ML of a parameter vector  on the basis of an observed data set Y, for which the likelihood of this data is written as p (Y ). That is, the EM method is an algorithm to find  ML ∈ { ∈  : p (Y )  p (Y ) ∀ ∈ }.

(1)

Here p (·) is a probability density function that is parametrized by a vector  ∈ Rd , while  ⊂ Rd is a comML pact subset of candidate parameter vectors from which  is to be chosen. As implied by Eq. (1), the ML estimate  ML need not be unique. This point will be addressed further in the sequel. This formalism is well known with regard to controlrelevant system identification problems. The traditional approach is to recognise (1) as a particular case of a general class of optimisation problems where the cost is smooth enough for a gradient-based-search algorithm to be used (Ljung, 1999; Söderström & Stoica, 1989). However, instead of exploiting any smoothness of p (Y ), the EM algorithm takes a different approach by utilising a more fundamental characteristic of the cost that arises by virtue of it being a probability density function; viz.  p (Y ) dY = 1 ∀. (2) RN

Engaging this mechanism involves the postulate of a socalled ‘complete data set’ that contains not only Y, which is what was actually observed, but also another data set X, which one might wish were available, but in fact is not. This component X is usually termed the ‘missing data’ and its composition is left completely up to the user. Its choice is usually the key design step in the use of the EM algorithm. The approach taken here is to select the missing data so that an associated likelihood maximisation problem can be simply and robustly solved in closed form. With this in mind, the principles underlying the EM Algorithm depend first on employing the elementary definition of conditional probability p (Y )p (X|Y ) = p (X, Y ) and hence, log p (Y ) = log p (X, Y ) − log p (X|Y ).

(3)

This provides an explicit link between a ‘wished for’ log-likelihood function log p (X, Y ) that depends on

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

unavailable observations X, and the log-likelihood log p (Y ) which is actually available, and for which a maximiser is sought. Using this association, the essential idea underlying the EM algorithm is to approximate log p (X, Y ) by a function Q(,  ), which is a projection of log p (X, Y ), defined by a current estimate  of the maximiser of p (Y ), and onto a space defined by available observations. That is, log p (X, Y ) ≈ Q(,  )E {log p (X, Y )|Y }.

(4)

Here, E {·} is the expectation operator with respect to an underlying probability density function defined by an assumption of the system parameters being  . This invites the obvious question as to how maximisation (or approximate maximisation) of p (X, Y ) depending on unavailable data X, relates to the prima facie more practical issue of maximising L()p (Y ). To study this, note that L() may be written in terms of Q(,  ) by employing (3) as follows: 

L() log p (Y ) = E {log p (Y )|Y } = E {log p (X, Y )|Y } −E {log p (X|Y )|Y } = Q(,  ) − V(,  ),

(5)

(6)



where Q(,  ) was defined above in (4) and

V(,  )E {log p (X|Y )|Y }.

(7)

1669

3. Application of the EM algorithm to LTI parameter estimation The estimation problem considered in this paper is that of obtaining a ML parameter estimate of an nth-order system described by the state-space equations        A B xt wt xt+1 = + . (11) C D yt ut vt This is subject to the distribution of the initial state x1 being Gaussian with unknown mean  and positive definite covariance matrix P1 . That is x1 ∼ N(, P1 ).

(12)

The estimation is performed on the basis of observing a sequence of input–output data samples (UN , YN ), where UN {u1 , . . . , uN },

YN {y1 , . . . , yN },

(13)

with yt ∈ Rp , ut ∈ Rm . In Eq. (11), the vector sequence {xt ∈ Rn } represents the evolution of the system’s state, while {wt } and {vt } model random disturbances. It is assumed that the latter two sequences can be described by temporally independent random variables with a positivedefinite joint covariance matrix and the following Gaussian distribution:       0 Q S wt ∼N , T . (14) 0 S vt R

The equality (5) follows since the conditioning on Y fixes all the random components that the ensemble average implied by the expectation operator is taken over. Furthermore, it is straightforward (see the following proof of Theorem 2) that V( ,  ) − V(,  ) is the Kullback–Leibler distance between the two density functions p (X|Y ) and p (X|Y ). Hence, via (2), V( ,  ) − V(,  )  0 for all ,  . Therefore, the decomposition (6) establishes that

Estimating the parameters of a system described by Eqs. (11), (12) and (14) amounts to estimating the elements of the constant matrices A, B, C, D, Q, R, S, P1 and the vector . For convenience, these quantities shall be collected into a vector  as follows:

Q(,  ) > Q( ,  ) ⇒ L() > L( ).

[vec{}T , T ]T ,

(8) 

That is, any new  which increases Q(,  ) above its previous value Q( ,  ) must also increase the likelihood function L(). This principle then leads to the following definition for one iteration of the EM algorithm starting from an estimate  k of  ML defined by (1) and updating to a (better) one  k+1 according to (1) E-Step: Calculate: Q(,  k ).

(9)

(2) M-Step:

T [T , T ]T ,

Here  A  C

B D

(15)

[vec{}T , vec{P1 }T ]T .



 and



Q ST

 S , R

(16)

(17)

and the vec{·} operator creates a vector from a matrix by stacking its columns on top of one another. With this definition of the parameter vector  in mind, the log-likelihood function L() for the system described above is (neglecting constant terms) (Ljung, 1999) N 1  log det(CP t|t−1 C T + R) L() = − 2 t=1

Compute:  k+1 = arg max Q(,  k ). ∈

(10)

One invocation of Eqs. (9) and (10) is rarely enough to obtain a satisfactory approximation to  ML . Therefore, an EM algorithm is usually composed of multiple iterations.

N 1  T − εt ()[CP t|t−1 C T + R]−1 εt (), 2

(18)

t=1

where (with Y0 0) yt|t−1 (), εt ()yt − 

 yt|t−1 ()E {yt |Yt−1 }

(19)

1670

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

with the latter being the mean square optimal one-step ahead prediction of the system output and

rule, and use of Markov properties implied by (11) yields

Pt|t−1 E {(xt −  xt|t−1 )(xt −  xt|t−1 )T }

p (YN , XN+1 ) = p (x1 )

(20)

is the covariance matrix associated with the state-estimate  xt|t−1 E {xt |Yt−1 }. Both of these quantities may be calculated by the well-known Kalman predictor, a version of which will be presented in Lemma 3.2. Now, an essential observation is that, if in addition to the measurements YN , the state sequence XN +1 {x1 , x2 , . . . , xN+1 }

(21)

were available, then it would be possible to extract an estimate of  from Eq. (11) using simple linear regression techniques. Since knowledge of XN+1 would so radically simplify the estimation problem it is taken here as the EM algorithm’s missing data, which in the previous Section 2 was simply labelled as X. According to the definition of the EM algorithm (see Eqs. (9) and (10)) the first step is to determine the function Q(,  ) defined by (4). This is achieved via the following result. Lemma 3.1. Consider the model structure (11), (12), (14). If the missing data is X XN+1 (defined by Eq. (21)) and the observed data is Y YN (see (13)), then for all  such that , P1 > 0 the function Q(,  ) defined in (4) is given by 

−2Q(,  ) = log det P1 + Tr{P1−1 E {(x1 − )(x1 − )T |YN }} + N log det  + N Tr{−1 [ − T −  T +  T ]},



T Tt [xt+1 , ytT ],

N 1  E { t Tt |YN }, N

(22)

(23)

(24)

N 1  E { t ztT |YN }, N

1

 N

t=1 N  t=1

E {zt ztT |YN }.

(26)

Furthermore, straightforwardly from Eqs. (11), (12), (14) and (17) p (x1 ) ∼ N(, P1 ),    xt+1 x ∼ N(zt , ). p yt t

(27)

Therefore, using the relationships (27) and excluding terms that are independent of the quantities to be estimated, Eq. (26) may be expressed as − 2 log p (YN , XN+1 |UN ) = log det P1 + (x1 − )T P1−1 (x1 − ) + N log det  N  + ( t − zt )T −1 ( t − zt ).

(28)

t=1

Applying the conditional expectation operator E {·|YN } to both sides of Eq. (28) yields (22). Note that, according to the definitions (23) and (25), all quantities making up Q(,  ) may be evaluated from the elements  xt|N E {xt |YN },

E {xt xtT |YN }

T E {xt xt−1 |YN },

t=1



p (xt+1 , yt |xt ).

t=1

(29)

and

where ztT [xtT , uTt ],

N 

(25)

Proof. The derivation here draws on that presented in (Shumway, 1982) wherein a simpler time series modelling situation was considered. Those ideas are extended here to allow for exogenous inputs together with full ARMA modelling (that is, S  = 0) of the measurement noise component while not requiring (as in (Shumway, 1982)) non-minimal state dimension. To begin, repeated application of Bayes’

(30)

which according to the model structure of interest (11) and noise assumptions (14) may be computed using a standard Kalman smoother, except for the component (30), which requires a non-standard augmentation. This is made explicit in the following lemma. Lemma 3.2. Let the parameter vector  be composed of the elements of A, B, C, D, Q, R, S, P1 and  (according to (15)), which define the system (11), (12), (14). Then T E {yt xtT |YN } = yt  , xt|N

(31)

T E {xt xtT |YN } =  xt|N  + Pt|N , xt|N

(32)

T T xt−1|N E {xt xt−1 |YN } =  xt|N  + Mt|N ,

(33)

where  xt|N , Pt|N , and Mt|N are calculated via the reversetime recursions −1 Jt Pt|t A Pt+1|t ,

 xt|N =  xt+1|N − A xt|t + Jt  xt|t − But − SR −1 yt ,

(35)



Pt|N = Pt|t + Jt Pt+1|N − Pt+1|t JtT ,

(36)

T

(34)

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

for t = N, . . . , 1, and T T + Jt (Mt+1|N − APt|t )Jt−1 Mt|N = Pt|t Jt−1

(37)

for t = N, . . . , 2, where the matrices A, B, Q are defined as AA − SR −1 C, B B − SR −1 D, QQ − SR −1 S T .

(38)

The quantities  xt|t , Pt|t , Pt|t−1 required by Eqs. (35)–(37) may themselves be pre-computed using the Kalman Filter equations T

1671

where 1 is a closed hypercube in R$ , $ = n2 + nm + np + p2 + n, and 2 is a closed and bounded subset of Rv , v = n2 + (n + p)2 for which all  ∈ 2 imply symmetric positive definite , P1 . Lemma 3.3. Let given by (25) satisfy > 0, and be used to define   according to ( is also defined in Eq. (25))   A B  = −1 , [vec{}T , T ]T ,  C D

 x1|N .

(46)

Pt|t−1 = APt−1|t−1 A + Q, −1  Kt = Pt|t−1 C T CP t|t−1 C T + R ,

(40)

Pt|t = Pt|t−1 − Kt CP t|t−1 ,

(41)

 xt−1|t−1 + But−1 + SR −1 yt−1 , xt|t−1 = A

(42)

If  defined by the parameter space assumptions 3.1 above  lies within 1 , then for any fixed  ∈ 2 , the is such that  point (46) is the unique maximiser       = arg max Q ,  . (47)  ∈1

 xt|t =  xt|t−1 + Kt (yt − C xt|t−1 − Dut )

(43)

Furthermore, define   according to ( is defined in (24))

(39)

with t = 1, . . . , N. The recursion (37) is initialised with MN |N = (I − KN C)APN−1|N−1 .

(44)

Proof. Eqs. (35) and (36) are the well known Rauch–Tung– Striebel recursions (Jazwinski, 1970) for fixed interval Kalman smoothing of the system (11), (12), (14) once transformed to have uncorrelated state and measurement noise as follows: xt+1 = Ax t + But + wt + SR

yt + w t ,

 .

These depend on the Kalman Filtered quantities  xt|t , Pt|t , Pt|t−1 which again are well known (Jazwinski, 1970) as being computable via Eqs. (39)–(43) where {yt } is now considered as an input as well as a measured output. The expressions (37) and (44) are established in Property P4.3 of (Shumway & Stoffer, 2000). With the computation of Q(,  ) established as being straightforward, attention now turns, according to Eq. (10), to its maximisation. This is also straightforward (by design) as established in the following Lemma 3.3, whose formulation depends on the following parameter space specifications. Parameter Space Assumptions 3.1. Recalling the decomposition  = [T , T ]T defined in (15), the set of candidate parameter vectors  is taken as

 = 1 × 2 ,

 ∈ 1 ,

 ∈ 2 ,

(49)

ut uTt > 0,

(50)

if  implies a controllable and observable system, and if   ∈ 2 , then      ,  . (51)  = arg max Q  ∈2

yt = Cx t + Dut + vt , 0 R

P1 P1|N . T

t=1

where now      0 Q − SR −1 S T wt ∼N , vt 0 0

(48)

Then   is a stationary point of Q([   , T ]T ,  ) with respect to , and  and P1 defined above are non-negative by construction. Finally, if  and P1 given by (49) are positive definite, if the input sequence, {ut } is such that N 

= (A − SR −1 C)xt + (B − SR −1 D)ut −1

 [vec{}T , vec{P1 }T ]T ,   Q S  T =  − −1 T , S R

(45)

In combination with (46), this provides a unique global maximiser of Q(,  ) over  ∈ . Proof. Consider first the maximisation with respect to , and note that the second and third terms of the formulation of Q(,  ) given in (22), which are the only terms to depend upon , , may be written as − Tr{P1−1 E {(x1 − )(x1 − )T |YN }}

= −Tr{P1−1 [( x1|N − )( x1|N − )T + P1|N ]}

and − Tr{−1 [ − T −  T +  T ]} = −Tr{−1 [( − −1 ) ( − −1 )T +  − −1 T ]}. These terms are clearly (globally) maximised with respect to the elements of  by the choices in Eq. (46).

1672

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

Turning now to the maximisation with respect to , application of Lemma C.1 and the chain rule indicates that at any point where  > 0 d d log det  + Tr{−1 ( − −1 T )} d d = −1 − −1 ( − −1 T )−1 .

4. Robust algorithm implementation When collected, the results in Lemmas 3.1–3.3 deliver the ‘naïve’ EM-based algorithm for estimating the parameters of the system description (11), (12), (14). EM Algorithm 4.1. (naïve implementation)

This is clearly zero for the choice of  in (49), which is therefore a stationary point of Q(,  ) with respect to a parametrization of symmetric positive definite . Via a similar argument, the expression (49) is a stationary point of Q(,  ) with respect to symmetric positive definite P1 . Furthermore, Lemma C.2 establishes that under the indicated extra assumptions, the Hessian of Q(,  ) is negative definite at this unique stationary point, and hence it is a global maximiser of Q(,  ). Finally, with the definition t [ Tt , ztT ]T and t , zt being given by (23) then    T = E { t Tt |YN }  0. (52)

0 = [vec{}T , T , (1) Let k = 0 and initialise estimates at  vec{}T , vec{P1 }T ]T ; (2) Using the system specification  k , perform the Kalman–Filter recursions (39)–(43) followed by the Kalman Smoother (type) recursions (34)–(37) in order to compute , and defined in Eqs. (23)–(25); k ) with respect to  by choosingA, B, C, (3) Maximise Q(,  D, Q, R, S, P1 and  according to Eqs. (46) and (49) k+1 ; to obtain a new estimate  k )} has con(4) If the associated likelihood sequence {L( verged then terminate, otherwise increment k and return to step 2.

Therefore, the stationary point  given by (49) is the Schur complement (Söderström & Stoica, 1989) of in E { t Tt |YN } and hence, by construction, is non-negative.  in (46) does not depend on Noting that the maximiser  , and hence that (under the assumptions) the expressions (46), (49) simultaneously maximize Q([T , T ]T ,  ) with respect to T = [T , T ]T then completes the proof. 

While the above provides a formal algorithm specification, the algorithm is termed ‘naïve’ since the question of robust and efficient implementation requires further consideration. In particular, the experience of the authors is that any approach involving the above formal algorithm specification leads to unsatisfactory results for all but rather trivial state and input/output dimension. In particular, it is essential that the estimated covariance matrix composed of Q, R and S (see Eq. (14)) is symmetric and positive semidefinite at all iterations, despite the limitation of finite precision arithmetic. On the other hand, if a considered approach is taken to dealing with the limitations of finite precision computation, then Algorithm (4.1) can be implemented in a very robust fashion, as will now be detailed.

Note that an essential feature of the maximisation steps (46) and (49) is that they do not impose or require any particular parametrization of the system matrices to be estimated. As in the case of subspace-based methods, and data-driven local co-ordinate gradient-search methods, this imbues the associated EM algorithm with an enhanced degree of robustness by avoiding well-known pitfalls associated with MIMO parametrization choices. Furthermore, it allows the EM-based approach proposed here to smoothly mesh with the aforementioned alternatives, particularly as a means for providing initialisation of the algorithm via a subspace-based approach. At the same time, this lack of imposed parametrization is achieved by a full (and hence over) parametrization, which raises the issue of whether there is a variance penalty for employing a model structure with a non-minimal parametrization. This issue has been addressed in (Pintelon et al., 1996) where it was established that, in fact, for quantities that are unaltered by parametrization choice (such as frequency response function) the variance of an estimate is the same for minimally and non-minimally parametrized model structures. Furthermore, note that according to Lemma 3.3, an estimate  =  x1|N of any non-zero initial state conditions arises very easily and naturally, implying that the algorithm can provide dynamic system estimates on the basis of non steady-state operating data.

4.1. The robust E-step A major part of the E-step of EM Algorithm 4.1 consists of computing the covariance matrices {Pt|N }, using a set of Kalman smoothing recursions. The latter are used to construct the matrices and  (see Eqs. (23) and (25)). It is possible to ensure that these matrices are positive semi-definite and symmetric, as required, by employing a ‘square-root’ filtering strategy. Such a scheme uses recursions that propagate the matrix square-roots of Pt|t−1 , Pt|t and Pt|N rather than the full matrices themselves. There are several approaches to square-root filtering, but the method chosen here is based on the techniques proposed in (Kailath et al., 2000) and employs the numerically robust and efficient QR factorisation (Golub & Loan, 1989) which, for an arbitrary matrix M, decomposes it as M = QR where Q is a unitary matrix and R an upper-triangular matrix. Consider first the recursion (36) which propagates Pt|N backward in time t. We seek a recursion which instead

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682 T /2

1/2

1/2

propagates a square root Pt|N for which Pt|N = Pt|N Pt|N satisfies (36). For this purpose, the following QR factorisation is useful:   T /2 T T /2  Pt|t A Pt|t R11 R12    T /2 R22 . (53)  = QR = Q 0  Q 0 T /2 0 0 T 0 Pt+1|N Jt Here, R is partitioned conformally to the left-hand side of (53). Exploiting the unitary nature of Q, and multiplying this equation on the left by its transpose provides   T R11 R11 RT11 R12

RT12 R11 RT12 R12 + RT22 R22   Pt+1|t APt|t . = T Pt|t A Pt|t + Jt Pt+1|N JtT . Equating the lower right submatrices then indicates that

RT22 R22 = Pt|t + Jt Pt+1|N JtT − RT12 R11 (RT11 R11 )−1 RT11 R12 = Pt|t + Jt Pt+1|N JtT − Pt|t A (Pt+1|t )−1 APt|t

= Pt|t + Jt Pt+1|N − Pt+1|t JtT . (54) T

Therefore, Eqs. (36) and (54) imply that

RT22 R22 = Pt|N

(55) 1/2

satisfies (36) so that Pt|N = R22 . Turning now to the forward time recursions (39)–(41) for the Kalman filtered state covariance Pt|t , a recursion is 1/2 sought which propagates a square root Pt|t . Again, a QR factorisation is useful, this time of the form  1/2   1/2  R CP t|t−1 T R11 R12 . (56) = QR = Q 1/2 0 R22 0 P t|t−1

A similar procedure to the one used above reveals that T T 1/2 R22 =Pt|t and therefore that Pt|t = R22 R22 satisfies (41). Finally, the factorisation (56) requires the matrix square root 1/2 Pt|t−1 , and for this purpose the factorisation  T /2   T 1 Pt−1|t−1 A R = Q (57) T /2 0 Q

1673

Since is square, then standard pivoting and Gaussian elimination (as, for example, implemented by the Matlab / operator) provides a numerically robust means for computing (re)estimates of A, B, C and D according to (46). The computation of the Q, R and S matrix updates according to Eq. (49) requires more careful thought. For example, due to the subtractions involved in (49), non-negativity could easily be lost via any naïve implementation. Furthermore, it is again important to guarantee symmetry. An ad hoc method for addressing this issue would involve implementing (49) directly, and then forcing symmetry of  by averaging it with its transpose. This, however, only ensures symmetry, and not non-negativity. In consideration of this, the paper here develops a numerically robust means for computing (49) by employing the following Cholesky factorisation    T  

T X11 X12 X11 X12 = (58)  0 X22 0 X22  T  X11 X11 XT11 X12 = , (59) T T X12 X11 X12 X12 + XT22 X22 where all matrices are partitioned conformally to the lefthand side. Equating the lower right sub-matrices yields (we assume > 0 and hence X11 > 0)

 = XT12 X12 + XT22 X22 = XT12 X11 (XT11 X11 )−1 XT11 X12 + XT22 X22 . This implies that  may be expressed in terms of Cholesky factors as

 =  − −1 T =  − XT12 X11 (XT11 X11 )−1 XT11 X12 = XT22 X22 . That is, Q, R and S may be calculated at each iteration using the formula  = XT22 X22 . This approach simultaneously guarantees both the symmetry and non-negative definiteness of the result. In relation to this, in the interests of maximum robustness, it is important to employ a Cholesky factorisation method that unlike, for example, that native to Matlab (Mathworks, 2004) can cope with rank deficient matrices. For this purpose, the method presented in (Golub & Loan, 1989) (as Algorithm 4.2.4) is appropriate.

can be performed so that by (39) T TR  R 1 1 = APt−1|t−1 A + Q = Pt|t−1 . 1/2 T. Thus Pt|t−1 = R 1

4.2. The robust M-step Note that a key aspect of the preceding robust implementation of the E-step is that the square-root terms can be used to form matrices and  (see Eqs. (23)–(25)) that are guaranteed to be non-negative and symmetric, even in the presence of finite precision computation.

5. Algorithm properties With this definition of a numerically robust version of the EM-based algorithm in hand, the paper now proceeds to investigate some of its convergence properties via a series of theoretical and empirical studies. 5.1. Uniqueness of iterations Minimally-parametrized model structures (those for which each input–output system behaviour corresponds to

1674

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

only one point in parameter space) have traditionally been favoured in system identification. By way of contrast, the model structure used by the EM algorithm is clearly not minimally-parametrized since any similarity transformation of the state-space matrices in Eq. (11) leads to another model with the same input–output behaviour and thus the same likelihood value. It is therefore natural to question whether { k }, the sequence of estimates generated by EM Algorithm 4.1, are well defined and if so what their convergence properties are. We begin with the issue of well-posedness. Theorem 1. Suppose that  k parametrizes a controllable and observable system with , P1 > 0, and that for the given data length N, the input sequence {ut } satisfies N  t=1

ut uTt > 0.

(60)

Then , defined by Eq. (25), is positive definite and  k+1 is uniquely defined. Proof. See Appendix A.



Clearly, condition (60) (which has been previously employed in (50)) can be considered a ‘persistence of excitation’ requirement which forces the input sequence to be informative. 5.2. Limit points of the algorithm With the well-posedness of the algorithm iterates { k } established, the paper now turns to the question of convergence. In relation to this, there exist some general results on the convergence of EM algorithm iterates, which are scattered among a number of papers in the statistical literature (see, for example, Boyles, 1983; Dempster et al., 1977; Meng and Rubin, 1994; Wu, 1983). In what follows, we draw on these results in a manner tailored to the particular case of Algorithm 4.1. To begin, we establish the important property that Algorithm 4.1 generates a sequence of estimate iterates { k } for which the associated sequence of likelihoods {L( k )} is non-decreasing. Theorem 2. Let  k+1 be generated from  k by an iteration of EM Algorithm 4.1. Then L( k+1 )  L( k )

∀k = 0, 1, 2, . . .

(61)

for almost all (with respect to Lebesgue measure) X.

k+1 ) − L( k ) = [Q( k+1 ,  k ) − Q( k ,  k )] L(    + [V(k , k ) − V(k+1 ,  k )] V( k ,  k ) − V( k+1 ,  k ),

(63) (64)

where (64) follows from (63) upon using the assumption that  k+1 is generated by an EM algorithm iteration. Via the definition of V(·,  k )

V( k ,  k ) − V( k+1 ,  k )    pk (X|XN , YN ) pk (X|YN ) dX. = log pk+1 (X|YN ) Application of Lemma C.3 then completes the proof.

(65)



Note that while this establishes that Algorithm 4.1 will never lead to a new estimate  k+1 with a likelihood lower than a preceding one  k , it also establishes that in the more usual case when the maximisation step yields Q( k+1 ,  k ) > Q( k ,  k ), then in fact the underlying likek+1 ) > L( lihood is strictly increased; i.e. L( k ). Furthermore, note that this fundamental property of the algorithm is a direct consequence of the essential likelihood function property (2). It may be significant that this, and hence the likelihood monotonicity of the EM algorithm, holds even when the likelihood function is discontinuous, and hence even in situations for which gradient-based-search methods are inapplicable. Furthermore, note that Theorem 2 also establishes that in terms of gauging convergence, a termination condition based on thresholding the relative decrease of Q( k+1 ,  k ) is equivalent to the same test involving the underlying likelihood L( k ). To continue the analysis, notice that the likelihood L() given by (18) is continuous on the compact parameter space  and hence bounded. Therefore, since by Theorem 2 the sequence {L( k )} is monotonic, it is clear that it must converge. The values to which is does converge are established by the following lemma. Theorem 3. Let { k } ⊂  be a sequence of estimates generated by EM Algorithm 4.1. Then a limit point ' of { k } k )} conis a stationary point of L() and the sequence {L( verges monotonically to L(' ). Proof. Denote, for arbitrary ∈  the sets:

Y {(, L()) :  ∈ , L()  L( )}, Z {(, L()) :  ∈ },

with equality if and only if both

Q( k+1 ,  k ) = Q( k ,  k ), pk+1 (X|Y ) = pk (X|Y )

Proof. Eq. (6) yields

X {(, L()) :  ∈ , L()  L( )}. (62)

As already noted, L() is continuous (and differentiable) on , and it is a fundamental property of continuous functions that the sets Y and Z are closed. It follows that any finite intersections and unions of them are also closed.

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

In particular, X = Y \(Y ∩ Z ) is a closed subset of Euclidean space, and hence compact. Furthermore, according to Lemmas 3.1 and 3.2, Q(,  ) is continuous over  in both arguments. This implies that all the conditions of Theorem 2 of (Wu, 1983) are satisfied, and application of that result completes the proof.  Note that although this result establishes that the sequence of likelihoods {L( k )} is convergent to a value L(' ) associated with a (local) stationary point ' of L(), this does not k } themimmediately imply that the parameter estimates { selves converge to that local stationary point ' . In particular, since a non-minimal parametrisation is involved, it is natural to be concerned that the algorithm could lead to a ‘wandering’ of parameter estimates that are parameter-space different, but system-wise equivalent, and hence all implying the same likelihood L(' ). In fact, this is k } arrives at a not the case, since as soon as the sequence { stationary point ' of L(), then all further iterations of the algorithm remain at this same parameter-space stationary point. Corollary 5.1. Let  k ∈  parametrise a controllable and observable system. Suppose that the input sequence UN satisfies condition (60). Then

k+1 )  L( k ) L(

(66)

k+1 =  k . with equality if and only if  Proof. According to Theorem 2, we need only show that  k+1 =  k if and only if both

Q( k+1 ,  k ) = Q( k ,  k ),

pk+1 (X|Y ) = pk (X|Y )

for almost all X. Clearly, the ‘only if’ part is trivial. To adk+1 ,  k )= Q( k ,  k ) dress the ‘if’ component, note that if Q( then according to the definition of the EM algorithm, both  k+1 and  k must be maximisers of the function Q(·,  k ). k ) has On the other hand, Theorem 1 demonstrates that Q(·,  only one maximiser. Thus  k+1 = k if L( k+1 )=L( k ).  5.3. Rate of convergence of parameter estimates The previous results establish the convergence properties of EM Algorithm 4.1 about a stationary point ' of the likelihood. This section now turns to understanding the convergence rate of Algorithm 4.1 and the factors affecting it. This is a more difficult question to address. To gain insight, an analysis local to a stationary point ' of L() provides the following approximate expression characterising the evolution of the ‘estimation error’  k − ' . Theorem 4. Suppose that  k parametrizes a controllable and observable system with , P1 > 0 and that '

1675

is a stationary point of the likelihood function L(·). Then '     k+1 − ' = [I − I−1 XY (k )IY (k )](k −  ) + o( k+1 −  k 2 ) + o( k − ' 2 )

(67)

where

IY ( k ) − log p (Y |U )  T  j j =k

j2

and



(68)

 log p (X, Y )| Y   j jT =k+1 2 j = − Q (,  k ) . T  j j

IXY ( k ) − Ek

j2

(69)

=k+1

The latter is guaranteed to be invertible provided input condition (60) is satisfied. Proof. See Appendix B.



This establishes that locally around a stationary point ' , the EM algorithm proposed in this paper exhibits an estimation error  k − ' that approximately evolves according to the autonomous system given by (67) with diminishing order terms o(·) neglected. In particular, the theorem suggests that convergence locally around ' will be rapid if the matrices IXY ( k ) and IY ( k ) are approximately equal. In relation to this, note that according to (6) and the fact that V( ,  ) − V(,  ) is a Kullback–Leibler distance, then L() = Q(, ) and hence Q(, ) mimics the value of the likelihood L(). Theorem 4 now further establishes that if Q(, ) is also a good approximation to the second order k ) ≈ properties of the likelihood function, that is if IXY ( IY ( k ), then local convergence will be rapid. Moreover, at least locally about ' , the rate of parameter convergence will ' ' be governed by the smallest eigenvalue of I−1 XY ( )IY ( ). Therefore, in the interests of studying this relationship between IXY and IY , note that as a consequence of (6),

IY (' ) = IXY (' ) − IX (' ), where



j2

'

IX ( ) − E'

j jT

 log p (X|Y )| Y

(70)

='

is the ‘missing data’ X information matrix. All three information matrices IX , IY and IXY are non-negative by construction and hence regardless of the nature of the missing data X

IY (' ) IXY (' ).

(71)

Intuitively then, the more information the (unknown) missing data implies, the slower the convergence.

1676

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

Furthermore, since IXY (' ) and IY (' ) are both nonnegative by construction, an immediate consequence of (71) ' ' is that the eigenvalues of I − I−1 XY ( )IY ( ) must be real and lie in the interval [0, 1] since (assuming IY (' ) > 0)

Element 1/2

{Pt|t−1 }

' ' i {I − I−1 XY ( )IY ( )} T /2

Table 1 FLOP count for each of the key steps of the robust implementation of EM Algorithm 4.1

' ' = 1 − i {IY (' )I−1 XY ( )IY ( )} .   1/2

(72)

∈[0,1]

Here i {M} denotes the ith eigenvalue of a matrix M, and 1/2 IY (' ) is a positive definite matrix such that IY (' ) = 1/2 T /2 IY (' )IY (' ). This implies that locally around ' the EM algorithm can be expected to display an exponential convergence rate.

6. Computational load The final performance issue to be addressed is that of the computational load of the proposed EM Algorithm 4.1. This will be addressed by comparing the load relative to the main alternative for ML estimation of MIMO systems, which is that provided by recent methods of gradient-search via data driven local co-ordinates (DDLC) (Bergboer et al., 2002; Ljung, 2004; McKelvey et al., 2004). In addressing computational overhead, there are two key issues. First, there is the requirement per iteration, which we will measure here in terms of necessary floating point operations (FLOPs). Second, it is important to consider the number of iterations. Unfortunately, it is impossible to be precise with regard to the latter. The experience of the authors is that (as will be demonstrated in the next section) the number of iterations required by the EM methods proposed here, are typically approximately equal to the number required by the alternative DDLC method. In consideration of this, the paper will profile computational requirements via FLOP load per iteration, and for this purpose a detailed audit of the FLOP count for the key stages of the robust implementation of EM Algorithm 4.1 is provided in Table 1 (recall m is the number of inputs, p is the number of outputs, n is the model state dimension, and N is the data length). The FLOP counts provided are 1/2 for the computation of all N quantities such as Pt|t where necessary. Therefore, assuming the typical case of state dimension n being equal to or larger than the input/output dimensions m, p then the total FLOP count per iteration required by a robust implementation of EM Algorithm 4.1 is O(n3 N ). Turning now to a DDLC approach, we refer the reader to (Bergboer et al., 2002; Ljung, 2004; McKelvey et al., 2004) for the details of the method. The essential point is that at each iteration, since the methods involved are gradient-search based, then the gradient itself needs to be computed for each of the components parametrizing the

1/2 {Pt|t }

{Kt } { xt|t } 1/2 {Pt|N } {Jt } { xt|N } {Mt|N } , ,

 

Eqs.

Number of FLOPs required

(57)

10 n3 N 3 32 n3 N 3

(56) (40) (42), (43) (53) (34) (35) (37) (23), (25) (46) (49)

(7pn + 13 p 2 + 25 p + 2n2 )pN (2n2 + O(n max(m, p)))N 56 n3 N 3 (4n + 2)n2 N (3 + 2m + 4n + 2p)nN (8n + 2)n2 N (9n2 + O(n max(m, p)))N (m + n)2 (2 + 73 n + 13 m + 2p) 1 (2n + m + p)3 + (n + p)3 3

n(m+2p)+mp dimensional set of minimal representations. This involves running an n state filter, with FLOP count O(n2 N ) (there is no sparsity in any of the matrices involved) for each of these n(m + 2p) + mp parameters, which leads to a FLOP count of O(max(p, m)n3 N ). For low numbers of inputs and outputs, these FLOP counts of O(n3 N ) and O(max(p, m)n3 N ) for (respectively) EM Algorithm 4.1 and a DDLC gradient-search approach are roughly equivalent, and establish Algorithm 4.1 as competitive from a computational load point of view. However, for larger numbers of inputs and outputs, there can be an appreciable difference in the FLOP count due to the DDLC count being that of Algorithm 4.1 (which is n3 N ), but then scaled by max(p, m). 7. Empirical study Having derived a robust implementation of EM Algorithm 4.1 and provided a theoretical analysis of its properties, this section provides a brief empirical study of its performance. To begin, we compare the results obtained by Algorithm 4.1 to those provided by the main alternatives, which are subspace-based techniques, and DDLC gradient-search approaches. The comparison method used here is that employed in (Ljung, 2003) for a similar profiling purpose, and wherein a Monte-Carlo approach is used. To be more specific, our first empirical study involves the simulation and estimation of 250 different, stable, two-input, two-output, 5th-order systems in innovations form that were generated randomly using the Matlab drss() command. For each of these 250 estimation experiments, a new length N = 500 i.i.d. input realisation ut ∼ N(0, I ), and a new i.i.d. innovations realisation et ∼ N(0, 0.04I ) was generated. Three estimates were formed for each of these 250 data sets. Firstly, a (CVA-weighted) subspace estimate (Ljung, 1999) was found using the Matlab command n4sid() with the options Focus and N4Weight set to ‘Prediction’

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

3

4.5

(a)

Prediction Error Cost (EM-based Method)

Prediction Error Cost (EM-based Method)

5

4 3.5 3 2.5 2 1.5 1 0.5 0

1677

2.5

2

1.5

1

0.5

0 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0

5

Prediction Error Cost (CVA Weighted Subspace Method)

(b)

0.5 1 1.5 2 2.5 Prediction Error Cost (Gradient-based Method)

3

Fig. 1. Simulation 1—Monte-Carlo study results. The left-hand figure shows a comparison of the CVA weighted subspace and EM algorithm ML estimates. The right-hand figure shows a comparison of DDLC parametrised gradient-search versus EM algorithm ML estimates.

and ‘CVA’, respectively. Secondly, a DDLC based gradient search, initialised at the subspace estimate, was used to find a ML estimate as proposed in McKelvey et al. (2004) and implemented via the Matlab pem() function (with the algorithm’s Tolerance set to 0, and Focus to Prediction). Finally, a further ML estimate was found via EM Algorithm 4.1. In the latter case, the algorithm was initialised with the subspace-based estimate and with the arbitrarily selected noise covariance estimates of Q = I and R = 0.2I . Both iterative algorithms were forced to run for 100 iterations. For each of these three estimation methods, and for each of 250 estimation experiments, the mean prediction error cost associated with the estimated model of N 1  EN ( (yt −  yt|t−1 ( yt|t−1 ( k ) k ))T (yt −  k )) (73) pN t=1

was computed, where the term  yt|t−1 (k ) above is the steadystate mean square optimal one step ahead predictor defined by k . The performance of Algorithm 4.1, as measured by EN ( k ), and relative to subspace and DDLC gradient search is then illustrated in Fig. 1. There, each star represents one data set, and two estimation experiments, with the co-ordinates of each star being determined by the value of EN ( k ) for each of the two estimation methods. A star on the indicated diagonal line represents equal cost EN ( k ) for both methods, while a star below the line represents lower cost (and hence superiority in a prediction error sense) for the method profiled via the vertical axis. With this in mind, Fig. 1(a) illustrates that following an initial subspace estimation step by a further EM algorithm

refinement will usually lead to a final estimate that is superior in a mean-square prediction error sense. Of course, under Gaussian conditions, it will also always lead to an estimate that is superior in a ML sense as the preceding theoretical analysis has established. Fig. 1(b) illustrates a further point. Namely, in terms of choosing between either a DDLC co-ordinate gradient search, or the EM-based Algorithm 4.1 as a means of refining an initial subspace-based estimate, then Algorithm 4.1 is quite competitive; there where many realisations where the EM method led to substantially lower cost than DDLC, and only a few where EM led to higher cost, in which cases the relative difference was smaller (than in the reverse case). Of course, no general conclusion can be drawn on the basis of specific, even Monte-Carlo based, simulation examples. However, the authors have noted that, over a very much wider range of problems than profiled here, the EM-based search appears to be particularly robust in avoiding capture in local minima, but at the expense of best-case convergence speed. To provide an illustration of this latter trade-off, consider the case of a particular 8th-order two input two output fixed multivariable system [A, B, C, D] which is specified as [A, B, C, D]  1    (s + 1)(s + 0.1) ZOH    1   (s + 0.7)(s + 0.3)

 3   (s + 2)(s + 0.5)    1   (s + 0.4)(s + 0.8)

and for which N = 1000 data points are simulated with i.i.d. input ut ∼ N(0, I ), and i.i.d. measurement noise et ∼ N(0, 0.125I ). This simulation with different input

1678

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

100

Gradient−based (average)

EM (average)

101

Gradient−based (best)

102 (a)

Mean Prediction Error Cost

Mean Prediction Error Cost

100

EM (average)

101 Gradient−based (best)

Gradient−based (average)

102 0

5

10

15

20 25 30 35 Iteration Number

40

45

0

50

5

10

(b)

15

20 25 30 35 Iteration Number

40

45

50

Fig. 2. Simulation 2—Sample average prediction-error cost EN ( k ) evolution for DDLC parametrized gradient search via Matlab 7.0’s pem() function (dashed line) versus EM-Algorithm 4.1 (solid line). Also shown is the best-case performance of the pem() function (dot-dashed lines) and the global minimum value of the expected likelihood E{L()} (dotted line). (a) shows the case of Gaussian measurement noise and (b) shows the case of uniformly distributed measurement noise.

and noise realisations was repeated until 100 ‘successful’ DDLC co-ordinate gradient-based-search estimates were found from an initialisation of   0.1 0.1  q 2 − q + 0.25 q 2 − 1.40q + 0.49  .   G(q, 0 )    0.1 0.1 q 2 − 1.20q + 0.36 q 2 − 0.80q + 0.16 Here, ‘successful’ was taken to mean that the final prediction error variance was within 30% of the measurement noise variance, and for each data set an ML estimate was also found by EM Algorithm 4.1. The sample-average trajectory of diminishing prediction error cost EN ( k ) over these 100 successful runs is shown in Fig. 2 as the solid line. The left-hand plot shows the case of Gaussian measurement noise, while the right-hand plot illustrates uniformly distributed measurement noise. Note that in both cases, the best case DDLC parametrised gradient-search performance over all the simulation runs was superior to the average behaviour of the EM algorithm. However, there is a factor that balances this best case performance. Due to high variability in the DDLC gradientsearch performance over the simulation runs, the average performance, after ‘non-successful’ runs were censored from the averaging, is somewhat inferior to that of the average EM algorithm convergence rate, for which there were no non-successful runs that required culling. Indeed, due to the observed low variability in EM algorithm behaviour, its average performance is representative of its worst case as well. While again this simulation provides only one specific robustness example, the experience of the authors is that it

appears to be a more general principle. Namely, in comparison to DDLC-based gradient search, the EM algorithm presented in this paper is particularly reliable in its ability to avoid becoming locked in local minima, but perhaps at the cost of slower convergence rate. A further robustness aspect is illustrated in the right-hand plot of Fig. 2, wherein the case of uniform distributed measurement noise is presented. This is a situation where the assumptions underlying the ML criterion employed in this paper are violated, but nevertheless the robust EM algorithm presented here still converges reliably to an estimate that is the global minimiser of prediction error. Note that in this case, no non-convergent trajectories were culled in computing the average EM behaviour but 9% of the DDLC gradientsearch runs which did not converge were censored before the average trajectory illustrated in Fig. 2(b) was computed. As before, the best-case performance of the gradient-based scheme was significantly faster than the average for the EM algorithm.

8. Conclusions The contribution of this paper is to derive a numerically robust and algorithmically reliable method for the estimation of possibly high dimension LTI MIMO systems. The key principle underlying the methods proposed here is the expectation–maximisation technique. Attractive features of the associated algorithm derived and studied here include numerical robustness, ability to deal with non-smooth likelihood surfaces, ease of initial condition estimation, moderate computational cost which scales well with the number

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

of model states and input–output dimension, and robustness to local-minima attraction. Balancing this, the convergence rates of the EM methods derived here have been observed as being generally slower than the best-case performance of the main alternative method of DDLC parametrized gradient search, although the average behaviour of the two methods has been observed by the authors as being equivalent. For readers wishing to evaluate the performance of the EM algorithm methods discussed in this paper for themselves, a Matlab-based software package that implements them is freely (for non-commercial use) available from the corresponding author on request.

Proof. The following definitions

A

1 N

t=1

T  , xt|N  xt|N

N 1  C ut uTt , N t=1

B

1 N

N  t=1

1 xt Ek ut N  t=1  A+D B = , BT C

=

=k

(B.2)

Therefore,

j2 Q (,  k ) j jT

 T xt Y ,U ut N N   A B 0 BT C



B C

( k −  k+1 )

= k+1 k+1 −  k 2 ). + o(

=k+1

t=1

 A − BC−1 BT 0 0 C   −1 I −BC A = BT 0 I

Also by Taylor expansion, jQ(,  k ) j2 Q(,  k ) =  j j jT

(B.1)

k+1 −  k 2 ) + o(' −  k 2 ). + o(

=



= k+1

( k+1 − ' )



j2 Q(,  k ) T j j

= k+1

j2 L( k )  ( k − ' ) − (k − ' ) j jT

k+1 −  k 2 ) + o(' −  k 2 ). + o(

I

−C−1 BT

(B.3)

k )/j jT |=k+1 > 0, Since Lemma C.2 proves that j2 Q(,  the result follows directly from Eq. (B.3). 

and therefore, by virtue of the fact that C > 0, 

j2 L( k )  jL( k ) ' (  −  ) = + o(' −  k 2 ). k T j j j

and combining Eqs. (B.1) and (B.2) then delivers j2 L( k )  j2 Q(,  k ) ' (k −  ) = ( k −  k+1 ) j jT j jT 

 xt|N uTt ,

N 1  D Pt|N , N



k Proof. A linear Taylor’s expansion of jL()/j about  and evaluated at  = ' (note jL(' )/j = 0) provides

=k

permit the expression N 

Appendix B. Proof of Theorem 4

Employing Fisher’s identity jL( k ) jQ(,  k ) =  j j

Appendix A. Proof of Theorem 1

N 

1679

 0  0. I

Appendix C. Technical lemmata

Since the model is controllable and observable and  is positive definite, Lemma C.4 proves that D > 0. Therefore,

Lemma C.1. Suppose M, N ∈ Rn×n and M is invertible. Then



j log det M = M −T , jM j Tr {M −1 N } = −M −T N T M −T , jM

A + D − BC−1 BT 0

 0 >0 C

and thus   A+D B BT C   I BC−1 A + D − BC−1 CT = 0 0 I   I 0 ×  > 0. C−1 BT I

0 C



j Tr{MN } = N T , jM

dvec{M −1 } = −M −T ⊗ M −1 . dvec{M}T

Proof. See Magnus and Neudecker (1988).



Lemma C.2. Let the parameter vector  k parametrize a controllable and observable system with , P1 > 0. Assume

1680

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

that the input condition (60) is satisfied. Then j2 − Q(,  k ) T  j j =k+1   −1 0 0 0 N ⊗  −1   0 P1|N 0 0  = N −1 −1   0 0 ⊗ 0 2 −1 1 −1 0 0 0 2 P1|N ⊗ P1|N > 0,

(C.1)

where  and Q(,  k ) are defined by Eqs. (15) and (22), the matrix  is defined by Eq. (46) and , and are defined in Eqs. (23) and (25) with   k . Proof. Using Lemma 3.1, Lemma C.1 and the product rule

jQ(, ' ) N j = − j 2 j × Tr{{}−1 [ − T −  T +  T ]} = N −1 ( −  ). (C.2) Therefore, using standard Kronecker product identities (Magnus and Neudecker, 1988)

jQ(,  k ) = N vec{−1 } − N ( ⊗ −1 ) vec{} jvec{} = N [( −  )T ⊗ I ] vec{−1 }

(C.3)

and hence, since the partial derivative of (C.3) with respect to the elements of  and P1 will be zero, j2 Q(,  k ) = −N ⊗ −1 , jvec{}jvec{}T  =k+1

while Eqs. (46) and (C.3) yield j2 Q(,  k ) = ( − −1 ) T jvec{}jvec{}  =k+1

jvec{−1 } × jvec{}T =

= 0.

k+1

We now turn to calculating the partial derivatives of Q(,  k ) with respect to the parameters of , via Lemma C.1 and the product rule

j2 Q(,  k ) jvec{}jvec{}T N = −1 ⊗ −1 2 N − (([ − T −  T +  T ]−1 )T ⊗ I ) 2 N × (−1 ⊗ −1 ) − (I ⊗ (−1 [ − T 2 −  T +  T ]))(−1 ⊗ −1 ).

However, according to (46)

 − T −  T +  T =  − −1 T =  so that

j2 Q(,  k ) jvec{}jvec{}T =

k+1

=−

N −1  ⊗ −1 . 2

The elements of Q(,  k )|=k+1 associated with the matrix elements of P1 and  are computed in a similar manner. This completes the proof on noting that under the assumptions of the lemma, the further Lemma C.4 establishes that P1|N is also positive definite.  Lemma C.3. Let f and g be non-negative and integrable functions with respect ( to a measure  and S be the region in which f > 0. If S (f − g) d 0, then    f f log (C.4) d 0, g S with equality if and only if f = g almost everywhere. Proof. The following is based on a partial argument presented in Rao (1965). First, using the well-known fact that log x  x − 1       f g d = − d f log f log g f S  S  g f − 1 d − f S  = (f − g) d 0 S

which establishes the first part of the lemma. To prove the second part, note that the sufficiency of f =g a.e. for equality is trivial. The necessity can be established by noting that log x = (x − 1) − (x − 1)2 /(22 ) for some  ∈ [1, x], and therefore     g f log (g − f ) d d = f S S  2  f g − − 1 d . (C.5) 2 f S 2 By assumption the first term on the right-hand side of (C.5) is less than or equal to zero. Similarly, the second term can be no greater than zero. Since the whole expression is equal to zero, it is necessary to choose f and g so that both terms are uniformly zero. Noting that f > 0 and that (g/f − 1)2  0, this implies that  2  f g − 1 d = 0 2 f S 2 only when f = g. Since this also makes the first term of (C.5) equal to zero the proof is complete. 

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

Lemma C.4. Consider the system (11), (14). Let the pair 1/2 (A, Q ) (defined by Eq. (38)) be controllable, the pair (C, A) be observable, and the matrices  and P1 positive definite. Then there exist bounded constants 1 , 2 > 0 such that

1 I  Pt|N 2 I

∀t  1.

(C.6)

Proof. According to Lemmas 7.2 and 7.3 of Jazwinski (1970), there exist positive bounded constants 1 and 2 such that

1 I  Pt|t 2 I

(C.7)

for all t  1. We shall now prove that Pt|N is bounded below. Via Eqs. (40), (41) and (C.7), x T Pt|t−1 x = x T Pt|t x + x T Pt|t−1 C T (CP t|t−1 C T + R)−1 × CP t|t−1 x  x T Pt|t x 1

(C.8)

for all t  1. Using the matrix inversion lemma and (34), (39), (36) yields Pt|N = Pt|t + Jt (Pt+1|N − Pt+1|t )JtT −1 = (Pt|t + AQ

−1

A )−1 T

−1 −1 + Pt|t A Pt+1|t Pt+1|N Pt+1|t APt|t . T

(C.9)

Since A is bounded and Q > 0 it follows from Eq. (C.7) that the first term on the right-hand side of Eq. (C.9) is positive definite. The second term is positive semi-definite by construction so that, for an arbitrary vector x satisfying x = 1, −1 x T Pt|N x  x T (Pt|t + AQ

−1

A )−1 x 3 > 0 T

∀t  1.

To establish that Pt|N is bounded above, notice that if Pt+1|N  Pt+1|t then via (36) and (C.8) Pt|N = Pt|t − Jt (Pt+1|t − Pt+1|N ) JtT  Pt|t  Pt|t−1 .   0

The proof is completed on noting that, via Eq. (C.8), PN |N  PN|N−1 .  References Bergboer, N. H., Verdult, V., & Verhaegen, M. H. G. (2002). An efficient implementation of maximum likelihood identification of LTI statespace models by local gradient search. In Proceedings of the 41st IEEE CDC, Las Vegas, 2002. Borran, M. J., & Aazhang, B. (2002). EM-based multiuser detection in fast fading multipath environments. In Proceedings of EURASIP (Vol. 8) (pp. 787–796). Boyles, R. A. (1983). On the convergence of the EM algorithm. Journal of the Royal Statistical Society, Series B, 45(1), 47–50. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39, 1–38.

1681

Dennis, J. E., & Schnabel, R. B. (1983). Numerical methods for unconstrained optimization and nonlinear equations. Englewood Cliffs, NJ: Prentice-Hall. Golub, G., & Loan, C. V. (1989). Matrix computations. Baltimore, MD: Johns Hopkins University Press. Goodwin, G. C., & Feuer, A. (1999). Estimation with missing data. Mathematical and Computer Modelling of Dynamical Systems, 5(3), 220–244. Isaksson, A. (1993). Identification of ARX models subject to missing data. IEEE Transactions on Automatic Control, 38(5), 813–819. Jazwinski, A. H. (1970). Stochastic processes and filtering theory. New York: Academic Press. Kailath, T., Sayed, A. H., & Hassabi, B. (2000). Linear estimation. Upper Saddle River, NJ: Prentice-Hall. Ljung, L. (1999). System identification: Theory for the user. (2nd ed.), New Jersey: Prentice-Hall, Inc. Ljung, L. (2003). Aspects and experiences of user choices in subspace identification methods. In Proceedings of the 13th IFAC symposium on system identification (pp. 1802–1807). Rotterdam, The Netherlands. Ljung, L. (2004). MATLAB system identification toolbox users guide, Version 6. The Mathworks. Magnus, J. R., & Neudecker, H. (1988). Matrix differential calculus with applications in statistics and econometrics. Chichester, UK: Wiley. Mathworks (2004). MATLAB users guide, Version 7. The Mathworks. McKelvey, T., Helmersson, A., & Ribarits, T. (2004). Data driven local co-ord’s for multivariable linear systems with application to system identification. Automatica, 40(9), 1629–1635. Meng, X.-L., & Rubin, D. B. (1994). On the global and componentwise rates of convergence of the EM algorithm. Linear Algebra and its Applications, 199, 413–425. Pintelon, R., Schoukens, J., McKelvey, T., & Rolain, Y. (1996). Minimum variance bounds for overparameterized models. IEEE Transactions on Automatic Control, 41(5), 719–720. Rao, C. R. (1965). Linear statistical inference and its application. New York: Wiley. Shumway, R. H. (1982). An approach to time series smoothing and forecasting using the EM algorithm. Journal of Time Series Analysis, 3(4), 253–264. Shumway, R. H., & Stoffer, D. S. (2000). Time series analysis and its applications. Berlin: Springer. Söderström, T., & Stoica, P. (1989). System identification. New York: Prentice-Hall. Wu, C. F. (1983). On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1), 95–103.

Stuart Gibson recieved his B.Sc. degree in 1993, his B.Eng. degree in Computer Engineering in 1996, and his Ph.D degree in Electrical Engineering in 2004. All awards were conferred by the University of Newcastle, Australia. Between 1995 and 1997 he worked as a physicist in the M.R.I. Unit of John Hunter Hospital, Newcastle before joining the Centre for Integrated Dynamics and Control (CIDAC) in 1997, where he stayed till 2000. He is currently working in the financial industry. Brett Ninness was born in 1963 in Singleton, Australia and received his BE, ME and Ph.D degrees in Electrical Engineering from the University of Newcastle, Australia in 1986, 1991 and 1994, respectively. He has been with the School of Electrical Engineering and Computer Science at the University of Newcastle since 1993, where he is currently an Associate Professor. Together with Håkan Hjalmarsson, he is jointly

1682

S. Gibson, B. Ninness / Automatica 41 (2005) 1667 – 1682

organising the 14th IFAC Symposium on System Identification to be held in Newcastle in 2006. He is a member of the editorial boards of Automatica, and IEE Control Theory and Applications, and has published more than 100 articles in the areas of system identification, stochastic signal processing and control. Details of his research and teaching activities are available at sysid.newcastle.edu.au/brett.