Automatica 41 (2005) 359 – 376 www.elsevier.com/locate/automatica
Asymptotic properties of subspace estimators夡 Dietmar Bauer∗ Institute for Mathematical Methods in Economics, TU Wien, Argentinierstr. 8, A-1040 Wien, Austria Received 27 January 2004; received in revised form 22 July 2004; accepted 12 November 2004 Available online 18 January 2005
Abstract Since the proposal of subspace methods in the 1980s and early 1990s substantial efforts have been made in the analysis of the statistical properties of the algorithms. This paper surveys the literature on the asymptotic properties of particular subspace methods used for linear, dynamic, time invariant, discrete time systems. The goals of this paper are threefold: First this survey tries to present the most relevant results on the asymptotic properties of estimators obtained using subspace methods. Secondly the main methods and tools that have been used in the derivation of these results are presented to make the literature more accessible. Thirdly main unsolved questions and rewarding research topics are identified some of which can be attacked using the toolbox discussed in the paper. 䉷 2004 Elsevier Ltd. All rights reserved. Keywords: Linear systems; Discrete time systems; Subspace methods; Asymptotic properties
1. Introduction Since the seminal survey of Viberg (1995) on subspace methods for the identification of linear dynamic models the literature on subspace methods has grown substantially. The development can be very coarsely divided into four directions of research (without any particular ordering) (1) Extensions of the basic underlying idea to different model classes. (2) Development of algorithms for online applicability and monitoring purposes. (3) Application of the methods. (4) Analysis of the asymptotic properties of the estimators. Each of these topics would warrant a separate survey paper. In this paper I will focus on the fourth topic. In particular the scope is constrained to the case of linear, dynamical, time invariant, discrete time systems since this to the best of 夡 This paper was not presented at any IFAC meeting. This paper was recommended for publication in revised form by Guest Editors Torsten Söderström, Paul van den Hof, Bo Wahlberg and Siep Weiland. ∗ Fax: +43 1 58801 11999. E-mail address:
[email protected] (D. Bauer).
0005-1098/$ - see front matter 䉷 2004 Elsevier Ltd. All rights reserved. doi:10.1016/j.automatica.2004.11.012
my knowledge is the only class of models where the asymptotic properties of the algorithms have been investigated to a certain extent. In order to survey asymptotic properties of subspace methods it is necessary to make the definition of the methods under investigation more concrete. The term subspace methods unfortunately is used for a wide variety of quite different algorithms. All subspace methods are formulated in the state space framework for modelling the dependence of some measured output process (yt )t∈Z where yt ∈ Rs , on an observed input process (ut )t∈Z where ut ∈ Rm , and lagged values of the inputs and outputs. A common feature of all model classes for which subspace methods have been developed is that the vector of stacked outputs yt+j , j = 0, 1, . . . , f − 1, can be additively decomposed into a linear function of the n-dimensional state xt , a possibly nonlinear function of ut+j , j = 0, . . . , f − 1 and a noise component which according to the models is uncorrelated with the remaining two terms. This last term equals the prediction errors of a prediction of yt+j , j 0 based on (ut )t∈Z and ys , s < t, and the sum of the first two the mean square predictions. The term ‘subspace’ then is motivated from the fact that for f s > n the matrix describing the linear mapping relating the n-dimensional state to the vector
360
D. Bauer / Automatica 41 (2005) 359 – 376
of predictions for the outputs (yt+j )j =0,...,f −1 (conditional on known inputs) is of dimension f s × n defining a subspace of Rf s via its column space. Also an alternative motivation for the term subspace methods exists: A particular class of algorithms tries to estimate the state in the first step. Interpreting least-squares predictions as projections of random variables in an appropriately defined Hilbert space the projections of (yt+j )j =0,...,f −1 given (ut+j )j =0,...,f −1 onto the whole past of inputs and outputs span an n-dimensional subspace providing another interpretation for the origin of the term (for details cf. Picci & Katayama, 1996, and the references therein). The ways in which this dimensionality reduction to the ndimensional subspace is exploited differ fundamentally for the proposed algorithms, however, suggesting the consideration of two classes of procedures, see below.1 This is true even for the class of linear dynamical models analyzed in this paper. In Bauer (2003) for this model class a detailed description of many of the most prominent algorithms proposed in the literature is given. The differences between the various proposed algorithms in many instances are minor details which do not require additional tools for the analysis of the asymptotic properties. In this survey hence it is not attempted to obtain results for all different variants but we will rather restrict attention to the main algorithms described in detail below. The plan of this paper is as follows: In the next section the stochastic properties of the considered model class are discussed. Section 3 discusses in detail the subspace algorithms that are dealt with in this paper. Subsequently consistency of the various estimators is investigated in Section 4 and questions of asymptotic bias are discussed. Section 5 deals with asymptotic normality and examines expressions for the asymptotic variance under the assumption of correctly specified order of the system. Order estimation techniques are described in Section 6. Topics related to closed-loop operation conditions are briefly discussed in Section 7. Finally Section 8 concludes the paper. The general style of presentation is intended to be accessible for postgraduate students with some background in statistics. This implies that especially the discussion of the model set is lengthier than otherwise required. Throughout the paper the following notation will be used: X denotes the two norm of the matrix X or the Euclidean norm of the vector X, respectively. By → we denote convergence in probability if not stated explicitely otherwise. A.s. will abbreviate almost sure. For a sequence of random matrices FT the notation FT = o(gT ) means that FT /gT → 0 a.s., FT = O(gT ) means that there exists a constant M < ∞ such that lim supT →∞ FT /gT M a.s. oP (gT ) and OP (gT ) denote the corresponding in probability versions. X will denote the transpose of a matrix or a vector and X > 0 means that the symmetric matrix X is 1 This point of view does not seem to be widely accepted in the community.
positive definite. An eigenvalue of maximum modulus of a matrix A will be denoted as max (A).
2. State space systems Subspace algorithms have been proposed for a number of different model classes. However, in this paper only the leading case of linear, time invariant, discrete time systems of the form (for t ∈ Z) xt+1 = Ax t + But + Kεt , yt = Cx t + Dut + εt
(1)
where the output data yt ∈ Rs and the input data ut ∈ Rm are observed for t =1, . . . , T will be dealt with. Here (xt )t∈Z denotes the state sequence and (εt )t∈Z the unobserved (weak) white noise. The noise εt and the input us are assumed to be independent for all t, s ∈ Z, i.e. (with the exception of Section 7) only open-loop operation conditions will be considered. A ∈ Rn×n , B ∈ Rn×m , C ∈ Rs×n , D ∈ Rs×m and K ∈ Rn×s are the matrices to be estimated. Additionally the innovation variance = Eεt εt needs to be estimated. Throughout the paper the system will be assumed to be stable, i.e. |max (A)| < 1 will be assumed. Furthermore, the strict minimum-phase condition |max (A−KC)| < 1 will be imposed. The following material provides a crash course to linear state space modelling. A more extensive account of the subject can be found in Chapters 1 and 2 of Hannan and Deistler (1988) or in Chapter 4 of Ljung (1999). Subspace methods are inherently using black box modeling, i.e. there is no specific knowledge about the system matrices (A, B, C, D, K) exploited. Let Sn denote the set of all quintuples of system matrices (A, B, C, D, K) where the state is n dimensional and where |max (A)| < 1 and |max (A − KC)| < 1 holds. The system equations (1) imply that yt = Cx t + Dut + εt = C(Ax t−1 + But−1 + Kεt−1 ) + Dut + εt = CAt−1 x1 + Dut + εt +
t−1
CAj −1 [But−j + Kεt−j ]
j =1
= CAt−1 x1 +
t−1 j =0
L(j )ut−j +
t−1
K(j )εt−j ,
(2)
j =0
where L(j ) ∈ Rs×m and K(j ) ∈ Rs×s are the so-called impulse response coefficients. This shows that apart from initial effects (i.e. CAt−1 x1 ) the state space system is only a convenient way to represent the impulse response sequence. Two state space systems (A, B, C, D, K) ∈ Sn and ˜ B, ˜ C, ˜ D, ˜ K) ˜ ∈ Sn˜ are called observationally equiva(A, ˜ K], ˜ j ∈ N. lent if D = D˜ and CAj −1 [B, K] = C˜ A˜ j −1 [B, An equivalent way to describe observational equivalence
D. Bauer / Automatica 41 (2005) 359 – 376
is via the pair of transfer functions (l(z), k(z)) where for z ∈ C, |z| 1 ∞ L(j )zj = D + zC(In − zA)−1 B, l(z) = k(z) =
j =0 ∞
K(j )zj = Is + zC(In − zA)−1 K.
j =0
These power series converge absolutely on the unit disc due to the stability assumption. l(z) and k(z) are rational functions in z ∈ C. Let U denote the set of all pairs of rational transfer functions (l(z), k(z)) such that k(0) = Is having no poles inside the closed unit disc and where k −1 (z) has no poles inside the closed unit disc. Define the mapping : n 0 Sn → U : (A, B, C, D, K) → (l(z), k(z)) attaching pairs of transfer functions to state space systems. Then two systems are observationally equivalent iff ˜ B, ˜ C, ˜ D, ˜ K). ˜ Note that observa(A, B, C, D, K) = (A, tional equivalence does not imply that two systems have the same state dimension. A system is called minimal if there exists no observationally equivalent system having lower state dimension. For minimal systems the dimension of the state is called the order of the system. Let the subset of U which corresponds to systems of order n be denoted as Mn . Then U = n 0 Mn . The geometrical structure of Mn is investigated in detail in Chapter 2 of Hannan and Deistler (1988). For (l(z), k(z)) ∈ Mn the set of minimal observationally equivalent systems −1 (l(z), k(z))∩Sn can be described using the set {T : T ∈ Rn×n , det T = 0} of nonsingular matrices corresponding to a change in the state space basis: Two ˜ B, ˜ C, ˜ D, ˜ K)are ˜ minimal systems (A, B, C, D, K) and (A, observationally equivalent iff there exists a nonsingular matrix T such that ˜ A = TA˜ T−1 , B = TB, ˜ ˜ K = TK. C = C˜ T−1 , D = D, The concept of observational equivalence leads to a partitioning of Sn into equivalence classes. A description of the set of all equivalence classes is provided by canonical forms which select for each pair (l(z), k(z)) one representative of the corresponding equivalence class. Hence canonical forms are defined as mappings : U → n 0 Sn . One example for a canonical form are echelon canonical forms (cf. e.g. Hannan & Deistler, 1988, Section 2.5.). Parameterizations of subsets M ⊂ U , i.e. bijective mappings : M → ⊂ Rp , are usually defined based on canonical forms. In order for a parameterization to be useful for the statistical analysis it has to possess certain properties (see Hannan & Deistler, 1988, Remark 4, p. 128, and Chapter 2, for a discussion). The two most relevant of these properties are continuity on M implying that convergence in the set of transfer functions in M implies convergence of parameter vectors2 2 The topology in the parameter space is Euclidean and in U the socalled pointwise topology is used. See Hannan and Deistler (1988, p. 54) for a definition.
361
and differentiability of the corresponding mapping → (−1 ()) = (A(), B(), C(), D(), K()) attaching system representations in the image of the canonical form to the parameter vectors in order to allow for gradient techniques to be used in the optimization of criterion functions (such as the likelihood or the one step ahead prediction error). It is known that for multivariate output (yt )t∈Z the set Mn cannot be parameterized continuously using only one mapping. However, for each pair (l(z), k(z)) ∈ Mn there exists a set Mn,i indexed using a multi-index i, containing an open neighborhood of this pair, which is open and dense in Mn and which can be parameterized continuously. One construction of such a parameterization is based on overlapping forms as discussed in great detail in Hannan and Deistler (1988, Section 2.6.). Henceforth i : Mn,i → n,i ⊂ R2ns+mn+ms will be used to denote the overlapping form corresponding to the multi-index i and n,i denotes the corresponding parameter set which is an open subset of R2sn+sm+nm . On Mn,i the mapping i is continuous. Summing up the facts stated so far we consider three different levels of system representations: Pairs of rational transfer functions (l(z), k(z)) ∈ Mn , quintuples of system matrices (A, B, C, D, K) ∈ Sn and parameter vectors ∈ n,i . Each of these representations has its merits and we will frequently change the considered representation. The links between the representations have been described above. Despite the fact that knowledge of properties of canonical and overlapping forms and parameterizations is not needed for the application of subspace algorithms it is nevertheless a necessary requisite for the understanding of the properties of the estimates provided by the algorithms. A note of caution is in order: The equivalence concept above does not include the effects of the initial state x1 . Typically subspace methods ignore the presence of initial effects which is justified by asymptotic arguments in many cases. It should be kept in mind that although the initial state does not influence any of the asymptotic results to be presented it has an effect on the finite sample properties of the algorithms and hence might be problematic in situations where a careful modeling of initial conditions is crucial.
3. Subspace algorithms The ultimate goal of the identification is to obtain ˆ ˆ estimates (l(z), k(z)) of a pair of transfer functions (l(z), k(z)) ∈ U and to quantify the confidence in the estimates. For the latter also estimates of the innovation variance are needed. In parametric identification methods this is done by specifying the order n of the system to be estimated (see Section 6 for a discussion) and parameterizing Mn e.g. by means of i . Given the order n and the multi-index i subspace methods can be seen as a mapping attaching a parameter vector in n,i to the input/output
362
D. Bauer / Automatica 41 (2005) 359 – 376
data.3 In the following a brief description of the various algorithms investigated in this paper will be given. For a detailed discussion of the algorithms see, e.g., Bauer (2003) and the references therein. In order to understand the motivation for the algorithms let the time t for the moment be called ‘the present’. From the system equations it follows easily from inserting εt = yt − Cx t − Dut that
The remaining two terms are Of (A − KC)p xt−p which lies in the range space of Of and will be small for p large due to the strict minimum-phase assumption and finally + Ef Et,f which is uncorrelated with all the remaining terms since (εt )t∈Z is a white noise process. Assume for the moment that xt , ut and yt are stationary random variables with finite second moments. Then (4) suggests the vector equation (t = p + 1, . . . , T − f + 1)
− xt = (A − KC)p xt−p + Kp Zt,p = Kt−1 Z − + A¯ t−1 x1 ,
+ + − Yt,f = z Zt,p + u Ut,f + Nt⊥ ,
t,t−1
(3)
where for A¯ = A − KC, B¯ = B − KD we have Kp = − ¯ A[K, ¯ ¯ A¯ 2 [K, B], ¯ . . . , A¯ p−1 [K, B]] ¯ and Zt,p [K, B, B], =
[zt−1 , . . . , zt−p ] for zt = [yt , ut ]. Therefore, the state is a linear function of past outputs and a prior value of the state. For p n minimality implies that Kp is of full row rank. Since (εt )t∈Z is assumed to be a white noise process the best linear mean square prediction of yt+j given xt , zt−1 , zt−2 , . . . and ut+j , ut+j −1 , . . . , ut according to j Eq. (2) is equal to CAj xt + i=0 L(i)ut+j −i and hence is a linear function of the state xt and the future of the input. This property of the state to summarize the information contained in the past that is relevant for the prediction of the future of the output lies at the core of the subspace methods. Due to this property the state space has been called ‘Markovian splitting subspace’ (cf. e.g., Picci & Katayama, 1996). It is necessary to introduce some notation: For f n + 1 + +
, . . . , y
let Yt,f = [yt , yt+1 t+f −1 ] and let Et,f be gen+ erated analogously to Yt,f using εt instead of yt . Let
f −1 Of = [C , A C , . . . , (A ) C ] denote the truncated observability matrix and let Ef denote the matrix whose first block row is equal to [I, 0, . . . , 0] and its ith row equal + to [CAi−2 K, . . . , CK, I, 0, . . . , 0] for i > 1. Finally Ut,f + denotes the vector of future inputs built analogously to Yt,f using ut in place of yt and Uf is the matrix whose first block row is equal to [D, 0, . . . , 0] and its ith block row equal to [CAi−2 B, . . . , CB, D, 0, . . . , 0] for i > 1. Then writing the prediction equations (2) jointly for yt , yt+1 , . . . , yt+f −1 and using the recursive representation of xt given in (3) results in the vector equation for t ∈ Z: + + + − Yt,f = Of Kp Zt,p + Of A¯ p xt−p + Uf Ut,f + Ef Et,f . (4)
(5)
where Nt⊥ denotes the residual vector. Here using Pt = −
+
[(Zt,p ) , (Ut,f ) ] we have + Pt (EPt Pt )−1 [z , u ] = EYt,f
= [Of Kp , Uf ] + Of A¯ p (Ext−p Pt )(EPt Pt )−1 . (6)
Therefore z = Of Kp in general. This introduces a bias which tends to zero for p → ∞ due to the strict minimumphase assumption. Also note that the bias term lies in the ˜ p space spanned by the columns of Of . Therefore z = Of N n×p(s+m) ˜ for some matrix Np ∈ R . Further u = Uf where again the bias is in the space spanned by the columns of Of and vanishes for p → ∞. Note that the presentation did not mention the term ‘data Hankel matrices’ which can be found often in the literature. These result if the vector equation (5) for t = p + 1, . . . , T − f + 1 is seen as the columns of a matrix equation. The somewhat complicated notation coming with the calculation of least-squares fits based on the matrix equation has led to much confusion in my point of view. In this paper, I will always adhere to the regression interpretation presented above. These observations build the basis for the subspace algorithms whose common outline can be decomposed into three main steps: (1) Given the integers f and p estimate z , u by regressing + − + Yt,f onto Zt,p and Ut,f resulting in estimates ˆ z and ˆ u . (2) The estimate ˆ z will typically be of full rank, whereas Of Kp is of rank n. Hence a rank n approximation ˆ f and ˆ p of ˆ z is obtained resulting in estimates O Oˆ f K ˆ Kp . ˆf,K ˆ p and ˆ u , estimates of (3) Based on the estimates O the system matrices are obtained.
This equation lies at the core of the subspace idea. It pro+ vides an additive decomposition of Yt,f into four terms: − + Of Kp Zt,p and Uf Ut,f which are products of system dependent matrices with observed variables. Note that the matrix Of Kp will be of low rank n for f, p sufficiently large.
The rank n approximation is obtained using the singular value decomposition (SVD, for short)
3 Strictly speaking this is not correct since M contains only stable n and strictly minimum-phase systems. Both assumptions are not necessarily satisfied for the subspace estimates, see Section 4 for a discussion, but trivially can be imposed by arbitrarily assigning an estimate in case these assumptions are not fulfilled by the original estimate.
where Uˆ n ∈ Rf s×n denotes the matrix whose columns are the left singular vectors of Wˆ f+ ˆ z Wˆ p− corresponding to ˆn = the n dominating singular values ˆ 1 ˆ 2 · · · ˆ n , diag( ˆ 1 , . . . , ˆ n ) and Vˆn ∈ Rp(s+m)×n denotes the matrix of
ˆ Vˆ = Uˆ n ˆ n Vˆ + Rˆ n , Wˆ f+ ˆ z Wˆ p− = Uˆ n
D. Bauer / Automatica 41 (2005) 359 – 376
the corresponding right singular vectors. Rˆ n denotes the approximation error arising from using a system order n. Here Wˆ f+ = (Wˆ f+ ) > 0 and Wˆ p− = (Wˆ p− ) > 0 are weighting matrices to be chosen by the user (more details will be discussed ˆ p = Vˆ (Wˆ − )−1 . ˆ f = (Wˆ + )−1 Uˆ n ˆ n, K below). Then O n p f MOESP and CCA fit into this framework, whereas N4SID uses a slightly different third step while using the same first two steps. A detailed discussion of various proposed variants of this general scheme can be found in Bauer (2003). A discussion of N4SID can be found in Van Overschee and DeMoor (1994), see also below. In the literature subspace methods have been classified into either belonging to the MOESP type of procedure or to use the state approach similar to the CCA method proposed by Larimore (1983). The way these two approaches exploit the estimates obtained in the second step of the above outline is fundamentally different and will be described next. Afterwards it will be discussed how N4SID fits into the framework. 3.1. The MOESP approach Consider the MOESP class of procedures first: This class of algorithms has been proposed in a series of papers (cf. Verhaegen & DeWilde, 1993; Verhaegen, 1994; and the references therein). The discussion centers on the algorithm PO-MOESP which will be referred to simply as MOESP in line with recent literature. This class of algorithms estimate the system matrices using the structure of the matrices Of and Uf and hence can be seen as finite sample versions of realization procedures: Define Of =[0, I(f −1)s ]Of . Then obviously Of = Of −1 A. Letˆ f = [I(f −1)s , 0]O ˆ f and defining O ˆf ˆ f = [0, I(f −1)s ]O ting O analogously to Of this equation can be used in order to obtain an estimate Aˆ of A as the least squares solution (i.e. minimal Frobenius norm) as ˆ f AFr . ˆf −O Aˆ = arg min O A∈Rn×n
ˆ f . This The estimate Cˆ is defined as the first block row of O procedure is usually termed ‘shift invariance approach’. Given these two estimates, the structure of u = Uf + Of (A−KC)p xu where xu is a bias term due to xt−p (see the discussion below (5)), can be used to construct estimates of B and D. Note that Uf is linear in B and D, being a matrix f s×(f s−n) whose entries are L(j ). Let O⊥ denote a full f ∈R
⊥ column rank matrix such that Of Of = 0. Further let Uf =
L(A, B, C, D) define the mapping L. Then (O⊥ f ) u = ⊥
⊥
(Of ) Uf = (Of ) L(A, B, C, D). Replacing true quantities with estimates, B and D can be estimated using (weighted) least-squares fitting on the vectorized equations:
ˆ ˆ D] ˆ = arg min vec(( ˆ ˆ [B, O⊥ f ) {u − L(A, B, C, D)})Wˆ ,
363
where Wˆ > 0 is a positive definite possibly data-dependent O⊥ ) is based on weighting matrix. Here the estimate ( f
ˆ f and an estimate of Of . Two possible choices are O
f −1
[Cˆ , Aˆ Cˆ , . . . , (Aˆ ) Cˆ ] . Typically the weighting Wˆ will be equal to the identity matrix and hence the approximation is in the Frobenius norm. This leads to the original version of MOESP. Alternative weights have been proposed in various papers without strong motivation. In Chiuso and Picci (2003) in some sense optimal weights are provided. The original MOESP algorithm focuses on the estimation of the transfer function l(z) and does not deliver estimates for k(z). Estimates for K can be obtained from the solution of Riccati equations (if these exist) using estimates of the residual variances in the system equations. This has been proposed in one variant of N4SID (see Van Overschee & DeMoor, 1994). Alternative methods for estimating K can be found in Knudsen (2001). We will not discuss these estimates. Another algorithm in the MOESP family, namely PI-MOESP,4 follows the same outline. The difference is − that instead of using Zt,p as a regressor to obtain the esti− ˆ = [u t−1 , u t−2 , . . . , u t−p ] , mate z PI-MOESP uses Ut,p i.e. only past inputs are included and past outputs are omitted. Consequently, this algorithm uses a state space model where K = 0 and (εt )t∈Z is not white noise but noise of unmodeled dynamics. Therefore, contrary to the state space model used in this paper PI-MOESP does not assume that the dynamics due to the noise and due to the observed input are identical. This might be advantageous in some situations as documented by Chiuso and Picci (2004d). The analysis of PI-MOESP follows grosso modo the same lines as the analysis of MOESP and hence I will not go into details in this respect. 3.2. The state approach The second class of procedures has been called the state approach or the Larimore approach since the main proponent in this class is the procedure introduced as CVA by Larimore (1983) and subsequently called CCA by Peternell, Scherrer, and Deistler (1996). In this approach an estimate of the state − ˆ p Zt,p is obtained as xˆt = K for t = p + 1, . . . , T + 1 and zero else. Additionally an estimate x˜t+1 of xt+1 is utilized. Larimore (1983) used the shifted state x˜t+1 = xˆt+1 . Other choices have been proposed, see below. For two vector sequences (at )t∈N and (bt )t∈N let at , bt = T −1 Tt=1 at bt using the same symbol for the vectors and the processes where it is understood that not available quantities are replaced by zeros. Then the sysˆ B, ˆ C, ˆ D, ˆ K) ˆ are obtained from tem matrix estimates (A,
4 Here PI stands for past inputs. The previously used PO is the acronym for past outputs.
364
D. Bauer / Automatica 41 (2005) 359 – 376
least-squares fitting as Cˆ Dˆ yt = , Aˆ Bˆ x˜t+1 xˆt , × ut
matrices rather than ut . Hence we obtain the equations ˆ yt xˆt Cˆ D , + ˆ = Aˆ B x˜t+1 Ut,f −1 xˆt xˆt , . × + + Ut,f Ut,f
xˆt ut −1 xˆt . ut
ˆ t for t = p + 1, . . . , T − f + 1 Using εˆ t = yt − Cˆ xˆt − Du ˆ and zero else define K =x˜t+1 , εˆ t ˆεt , εˆ t −1 . Note that using − ˆ p Zt,p for any a different estimate of the state xˆt = TK nonsingular matrix T does not change the estimated pair of ˆ ˆ ˆ B, ˆ C, ˆ D, ˆ K). ˆ transfer functions (l(z), k(z)) = (A, In the literature there seems to be a profound confusion about naming conventions: The labels MOESP, CCA and N4SID are sometimes used to denote particular choices for the weighting matrices Wˆ f+ and Wˆ p− and sometimes in order to denote the two different classes of methods. In this paper we will use the following convention: The two classes of methods will be called MOESP type of approach (or simply MOESP) irrespective of the choice of the weighting matrices, and state approach, respectively. The name CCA will be reserved for the particular state approach using the weighting matrices + + + + Wˆ f+ = (Yt,f , Yt,f − Yt,f , Ut,f
+ + −1 + + −1/2 × Ut,f , Ut,f Ut,f , Yt,f ) ,
4. Convergence of the parameter estimates
+ − − − Wˆ p− = (Zt,p , Zt,p − Zt,p , Ut,f
+ + −1 + − × Ut,f , Ut,f Ut,f , Zt,p )1/2 .
ˆ D) ˆ are obtained from noting that B and Then estimates (B, D are linear in (B, D). This is utilized in a least-squares fit analogous to the estimation of (B, D) in the MOESP class of algorithms. Therefore both versions of N4SID will be considered as variants of the state approach in the following. Chiuso and Picci (2003) propose an algorithm that is similar to the N4SID algorithms: In the first step the state apˆ C). ˆ Here the proach is used in order to obtain estimates (A, state is estimated similar to N4SID from two different regressions for xˆt and x˜t+1 . A slight difference to N4SID is + + that the regression for x˜t+1 uses Yt+1,f instead of Yt+1,f −1 . A second difference is the choice of the CCA weighting scheme (7) for the SVD. In Chiuso and Picci (2004c) it is ˆ C) ˆ for a specific shown that the so obtained estimates (A, ‡ ˆ in the estimation of xˆt choice of the generalized inverse O f and x˜t+1 are numerically identical to the estimates obtained in MOESP providing a link between the MOESP type of procedures and the state approach. The second step of their approach follows the MOESP type of procedures, however.
(7)
Here X1/2 denotes the uniquely defined symmetric square root of a positive definite matrix. 3.3. The N4SID algorithm In Van Overschee and DeMoor (1994) two versions of N4SID were given. Their Algorithm 2 is a state approach where instead of x˜t+1 = xˆt+1 the estimate ˆ ˆ f )‡ ˆ z,+ Z − x˜t+1 = ([I(f −1)s , 0]O t+1,p+1 is used where z,+ is estimated via the regression + − + ⊥ Yt+1,f −1 = z,+ Zt+1,p+1 + u,+ Ut+1,f −1 + Nt,+ .
ˆ f )‡ denotes a generalized inverse of Here ([I(f −1)s , 0]O ˆ f . Typically the Moore–Penrose pseudoinverse [I(f −1)s , 0]O is used. To the best of my knowledge no results on the asymptotic properties of this algorithm have been published in the literature. However, the proofs given for the state approach are straightforward to apply to this algorithm. We will not go into details. Algorithm 1 of Van Overschee and DeMoor (1994) also can be seen as a state approach with two differences: First the estimate x˜t+1 given above is used and second it includes the + term Ut,f in the regression for the estimation of the system
In the last section the algorithms have been described and motivated based on assumptions on the data generating process and the stochastic nature of the input and the noise process. Abstracting from the motivation we obtain a numerical procedure attaching to any given input/output data set yt , ut , t = 1, . . . , T for a set of design parameters f, p, Wˆ f+ ˆ ˆ and Wˆ p− for given system order n an estimate (l(z), k(z)) (under regularity conditions ensuring the uniqueness of the solutions to the least-squares problems). Given the description of the algorithm the question of the properties of the resulting estimators arises. Since the finite sample properties are hard if not impossible to derive one typically resorts to asymptotic arguments. In order for the asymptotics to work some assumptions on the stochastic structure of the involved processes need to be made. Throughout the paper we will use the following assumption: Assumption 1. The noise process (εt )t∈Z is a strictly stationary and ergodic martingale difference sequence with respect to the sigma algebra Ft = {εt , εt−1 , . . .} fulfilling the following properties:
E{εt | Ft−1 } = 0, 4 Eεt,a < ∞,
E{εt εt | Ft−1 } = E{εt εt },
E{εt,a εt,b εt,c | Ft−1 } = a,b,c ,
where Eεt εt = > 0. Here εt,a denotes the ath component of the vector εt . Further a,b,c is a constant not depending on t.
D. Bauer / Automatica 41 (2005) 359 – 376
The process (yt )t∈N is generated by a minimal state space system (A0 , B0 , C0 , D0 , K0 ) ∈ Sn using the input process (ut )t∈N which is assumed to be a stochastic process with finite second moments and being independent of the noise εr , r ∈ Z. Furthermore, x1 is assumed to be a random variable with finite fourth moment independent of εt , t 0. The assumptions on the noise are fulfilled, e.g. for independent identically distributed sequences with zero mean, nonsingular variance and finite fourth moment. Note that we did not impose any restriction on the input sequence except for finite second moments and independence of the noise implying open-loop operation conditions. In the following we will be interested in a number of properties of the estimators of the system matrices ˆ p = Vˆ (Wˆ − )−1 obˆ B, ˆ C, ˆ D, ˆ K) ˆ using the choice K (A, n p tained using subspace methods. In some cases it turns out to be more convenient to specify a multi-index i and to ˆ B, ˆ C, ˆ D, ˆ K)) ˆ ∈ n,i consider the estimators ˆ i = i ((A, if this is well defined (see footnote 2). Throughout the paper we will assume that i is chosen such that 0,i = i ((A0 , B0 , C0 , D0 , K0 )) is an interior point of n,i implying that ˆ i is well defined for T large enoughwhenever it is discussed in the paper (as follows from a.s. consistency ˆ ˆ of (l(z), k(z)), see below). The main properties we are interested in are the following: (1) Consistency: A parameter estimator is called weakly consistent if ˆ i → 0,i in probability. Analogously ˆ B, ˆ C, ˆ D, ˆ K) ˆ of the state space systhe estimator (A, tem is called consistent, if there exists a nonsingular deterministic matrix T not depending on T such that −1 ˆ TA0 T−1 , Bˆ − TB0 , C−C ˆ ˆ ˆ− A− 0 T , D−D 0, K TK0 all converge to zero in probability. (2) Misspecification analysis: If the order used in the estimation is smaller than the true order then the estimators cannot be consistent since the model used does not contain the true system. Nevertheless it is of interest to know whether the estimators converge to some limit or diverge for large sample sizes. (3) Fulfillment of assumptions: Assumption 1 includes the stability of the true system. The estimators should have the same property. A similar argument can be made for the minimum-phase condition. Additionally the system might be known to have no direct feedthrough term, i.e. D0 = 0. (4) Asymptotic normality: The asymptotic distribution of the parameter vector estimators in all cases discussed in this paper will be seen to be normal. Of main interest in this respect is the comparison of the asymptotic variance5 for the various algorithms.
5 In this paper the term asymptotic variance is used for the variance of the limiting Gaussian variable.
365
In this section the first three points are dealt with. First the general tools in order to derive the limiting results are presented. Then two subsections are devoted to results for the MOESP class of algorithms and the state approach, respectively. The outline of the algorithm given above included three steps: A least-squares fit, a singular value decomposition ˆf,K ˆ p and the estimation of the system matrices based on O ˆ and u . The properties of the least-squares fit are well understood. The third step is specific to the two classes of algorithms and hence will be dealt with in the respective subsections below. In any case the third step is a least-squares fit. The second step really is the core of the subspace idea. It involves a SVD of the matrix: ˆ Vˆ → W + z W − = Un n V . Wˆ f+ ˆ z Wˆ p− = Uˆ p n f The assumptions which imply the stated convergence will be discussed later on. SVDs are not unique even when one restricts the singular values to be ordered decreasing in size. For each singular vector ui also −ui is a singular vector corresponding to the same singular value. If there are repeated singular values the nonuniqueness is even worse being related to the choice of the basis in the space spanned by a number of singular vectors. In order to achieve uniqueness some further restrictions have to be introduced. The columns of Uˆ n (and Un , respectively) are restricted to be of unit length. Fixing one entry in each column of Uˆ n (and Un , respectively) to be positive leads to a unique SVD in the case where there are no repeated singular values in n (cf. the discussion in Bauer, Deistler, & Scherrer, 1999, p. 1245). In the case of repeated singular values in n also restrictions in the SVD can be found but are much harder to describe. For the analysis of the asymptotic properties of the various algorithms the SVD step is the main complication. Most results explicitely or implicitely use the following theorem in this respect, which is contained, e.g. in Chatelin (1983, see in particular Proposition 3.26). Note that there are close links between the SVD of a matrix X = U SV and the eigenvalue decomposition of X X = V S 2 V : Theorem 1. Let TT denote a sequence of compact selfadjoined operators converging in operator norm to the compact operator T0 . Let 0,j denote the eigenspaces corresponding to the eigenvalues 0,j of T0 where 0,1 > 0,2 > · · · are assumed to be ordered decreasing in size. (i) Then for each eigenvalue 0,j of multiplicity m0,j there exist m0,j eigenvalues T ,j,i , i = 1, . . . , m0,j of TT converging to 0,j . The span of the corresponding eigenspaces T ,j := span{ T ,j,i , i = 1, . . . , m0,j } converges in the gap metric (see, e.g. Chatelin, 1983, p. 54, for a definition) to 0,j . (ii) Furthermore, there exist matrices VT ,j whose columns span the spaces T ,j and V0,j whose columns span
366
D. Bauer / Automatica 41 (2005) 359 – 376
the spaces 0,j normalized according to VT ,j VT ,j =
V I, V0,j 0,j = I such that
T ,j − 0,j I
(T T = V0,j
− T0 )V0,j
+ o(TT − T0 ), VT ,j − V0,j = (0,j I − T0 )† (TT − T0 )V0,j + o(TT − T0 ), where the equation TT VT ,j = VT ,j T ,j defines T ,j and X † denotes the Moore–Penrose pseudoinverse of X. Note that (ii) only states the existence of a particular SVD having the differentiability properties. This does not necessarily correspond to the actual numerical implementation. In the case of m0,j = 1, j = 1, . . . , n, the unit norm vector spanning the eigenspaces are unique up to a sign. Taking the vectors VT ,j such that the sign of one particular entry in each VT ,j and V0,j is fixed to be positive where the corresponding entry in V0,j is nonzero leads to a realizable representation having the differentiability property. Note that we are interested in Of (and Kp ) rather than Un (or Vn ). Since Of (and Kp vice versa) for minimal systems are of full rank for f (and p) large enough it follows that there exists a set of rows (columns) which form a nonsingular matrix. Here we will always assume that this set of rows of Of (columns of Kp ) is formed by the first n rows (columns) which is true for a generic subset M˜ n,i ⊂ Mn , see Hannan and Deistler (1988, Section 2.6.) and analogous arguements for K, respectively. Let the corresponding matrix be denoted as [Of ]n ∈ Rn×n ([Kp ]n ∈ Rn×n resp.). On M˜ n,i the restriction [Of ]n = In ([Kp ]n = In resp.) in connection with minimality uniquely defines one state space system corresponding to a pair of transfer functions hence defining a canonical form on M˜ n,i . For the case [Of ]n = In this leads to echelon forms (see Hannan & Deistler, 1988, Section 2.5. for details). In the following we will use the symbol Of and Kp corresponding to the true system rep˜ f and K ˜ p will be resented in the echelon form. Similarly O ˜ p ] = In . If used for the canonical form corresponding to [K different system representations are used in the definition of Of and Kp this will be remarked in the text. Using this notation Bauer, Deistler, and Scherrer (2000) stated that6 (The ˜ p follows for p → ∞ at a suffiˆ p−K expression for K ciently fast rate, see Theorem 7 below, the expression for Oˆ f − Of for fixed f, p n) ˆ p −K ˜ p) = O ˜ ‡ (ˆ z − z )(I − S˜p K ˜ p) (K f −ε + o(ˆ − T ), z
z
6 To be precise only the linearization of K ˆ p−K ˜ p in the case m = 0 is contained in the reference. The general result and the result for Oˆ f − Of follow by analogously.
ˆ f − Of ) = N‡ (ˆ z − z ) (I − Sf O ) (O p f −ε ˆ + o( − T ), z
z
(8)
where Sf =[In , 0] ∈ Rf s×n , S˜p =[In , 0] ∈ Rp(s+m)×n and ‡
O˜ f = (O˜ f W+ O˜ f )−1 O˜ f W+ ,
N‡p = (Np W− N p )−1 Np W− ˜ f (N , respectively). denote generalized left inverses of O p + + Here z = Of Np , W+ = (Wf ) Wf and W− = (Wp− ) Wp− . This holds if weighting matrices such that Wˆ p− − Wp− = o(T −ε ) and Wˆ + − W + = o(T −ε ) for some ε > 0 and f
f
deterministic matrices Wp− and Wf+ are used. Recently, Jansson (2000) and Chiuso and Picci (2003) noticed that to some extent the linearization of the SVD is not necessary for the asymptotic analysis. This is achieved by representing the true system using a state basis depending on the estimates rather than using a fixed representaˆ f denote an estimate of Of such that tion: To this end let O − ˆ p be obtained ˆ f − Of = o(T ) for some > 0. Let K O ‡ ‡ ˆ denotes a generalized left inverse) ˆ ˆ z (as before O as O f
f
ˆ ‡ z =(O ˆ‡O ˜ ˆ‡ ˜ ˜ ˇ p =O and let K f f f )Np . Then the matrix (Of Of ) determines the state representation used for estimation and hence the coordinate system for the state is random rather than deterministic. Obviously one obtains
ˆ p −K ˇ p =O ˆ ‡ (ˆ z − z ) K f = O‡f (ˆ z − z ) + o(ˆ z − T − )
(9)
‡
ˆ − O‡ = o(T − ) which will hold in many cases. for O f f Note the similarities between (8) and (9): The linearization ˜ p ) in (8) accounting for only differs in the term (I − S˜p K the fixed basis of the state space. Theorem 1, (8) and (9) all can be and have been used to derive the asymptotic properties of subspace estimators. The expressions based on Theorem 1 are the least transparent ones. Eq. (8) seems to provide the most valuable information both in terms of general applicability to all (l(z), k(z)) ∈ Mn and in terms of complexity of the expressions. Hence the presentation in this paper centers on these expressions. The linearization given in (9) is slightly simpler. One has ˆ ‡ − O‡ = o(T − ) or to keep in mind, however, that O f f similar conditions have to be used which need to be justified. ˆ f is based on estimates obtained from the SVD step If O then essentially Theorem 1 has to be used to justify the ˆ‡. convergence of O f
ˆf From (8) it is readily seen that consistent estimation of O ˆ p , respectively, hence essentially hinges on consistent or K estimation of z in the first step. These facts are central in the development of asymptotic properties.
D. Bauer / Automatica 41 (2005) 359 – 376
4.1. MOESP type of algorithms In the MOESP class of methods usually the integers f and p are kept fixed and finite. This will be assumed throughout this section. Assumption 2. The deterministic input process (ut )t∈N is assumed to be pseudo stationary in the sense that T −1 Tt=1 ut → and there exists a pseudocovariance sequence u (j ) such that sup|j | f +p ut , ut−j − u (j ) → 0. Furthermore, (ut )t∈N is assumed to be persistently exciting of order f + p in the sense that there exists an integer T0 + + such that Ut,f +p , Ut,f +p > cI for some 0 < c and T T0 . In the following we will use expectation to denote the limiting pseudo covariance when applied to the input sequence and statistical expectation when applied to the noise process. 4.1.1. Consistency Under Assumption 1 it follows that [ˆ z , ˆ u ] → [z , u ] = [Of Kp , Uf ] + Of A¯ p Ext−p Pt (EPt Pt )−1 (see (6)) where the invertibility of EPt Pt is ensured by the persistence of excitation condition. Hence the estimates of Of Kp and Uf are not consistent in general due to the presence of the term involving xt−p . Nevertheless, the properties of the limits are as desired: The range space of z is contained in the range space of Of and u is identical to Uf plus a bias in the direction of the range space of Of . Based on these facts the following theorem summarizes the results related to consistency issues: Theorem 2. Let the output process (yt )t∈N be generated by a minimal state space system (A0 , B0 , C0 , D0 , K0 ) ∈ Sn0 such that Assumption 1 holds and let the input process (ut )t∈N fulfill Assumption 2. For a given multi-index i such that i,0 = i ((A0 , B0 , C0 , D0 , K0 )) is in the interior of n0 ,i let the parameter vector ˆ i using the true order of the system be estimated using a MOESP type of algorithm with f n0 + 1 and p n0 chosen independent of the sample size. Assume that the weightings in the SVD are chosen such that either Wˆ f+ = Wf+ > 0 or Wˆ f+ is chosen according to − − 1/2 the CCA weighting in (7). Wˆ p− = Zt,p , Zt,p or the CCA choice is used. Assume that the weighting Wˆ in the leastsquares fit to obtain the estimates Bˆ and Dˆ is chosen such that Wˆ − W → 0, W > 0. ˆ → l(z) is that (i) Then a sufficient condition for l(z) − Zt,p xt rank E = n0 + f m. (10) + + Ut,f Ut,f If the singular values of Wf+ z Wp− are distinct then ˆ B, ˆ C, ˆ D) ˆ are consistent for also the estimates (A,
367
appropriate choice of the orientation of the columns of Uˆ n . (ii) The subset Mn0 (ut , f, p, ) ⊂ Mn0 such that the consistency condition (10) holds, is open and dense in Mn0 for all fixed f n0 + 1, p n0 . (iii) Let r n0 denote the degree of the A-annihilator (i.e. the lowest degree of all polynomials q(z) such that q(A) = 0) and let o n denote the observability index (i.e. the smallest integer such that Oo has rank n0 ). Then for f o + 1, p 3r and (ut )t∈N being persistently exciting of order 4r + o the consistency condition (10) holds. The consistency result and the genericity of the set Mn0 (ut , f, p, ) can be found in Lemma 5 and Theorem 8 of Bauer and Jansson (2000), the sufficient conditions on the integers f, p have been derived in Chui (1997, Theorem 5.18). Jansson and Wahlberg (1998) provides an example where the consistency condition does not hold implying Mn0 (ut , f, p, ) = Mn0 in some situations even if f n0 + 1, p n0 . − − is replaced by Ut,p in For the PI-MOESP method Zt,p the consistency condition. Consequently the condition automatically holds in this case (cf. Jansson & Wahlberg, 1998). 4.1.2. Misspecification analysis To the best of my knowledge there are no results available in the literature for the case of misspecified order. However, a number of results are simple to obtain using the previously developed tools. For instance, it can be shown that under the
⊥ assumptions of Theorem 2 if the calculation of O f is based
ˆ
f −1 ˆ ˆ ˆ ˆ on [C , A C , . . . , (A ) C ] and the order n˜ > n0 is used ˆ → l(z) remains to hold. Since this for the estimation, l(z) result does not seem to be of much relevance the proof is omitted and can be obtained from the author on request. If
⊥ ˆ the calculation of O f is based on Of it seems likely that consistency also holds, however, to the best of my knowledge no proof of this statement exists. Also the more interesting case of specifying the order for estimation lower than the true order has not been investigated analytically in the literature to the best of my knowledge. Conditions for convergence of the estimated matrix Oˆ f in this case follow from using the tools discussed above and since the second stage of estimation consists of a leastsquares fit also this step is simple to analyze. However, more interesting than proving convergence of the estimators are the properties of the limiting low-order approximation. In this respect there seem to be no results available and many questions remain open. Simulations in Van Overschee (1995) indicate that some choices of Wˆ f+ lead to approximation properties which coincide with intuitions: There the weightings are linked to bandpass filters and the achieved biases in simulation studies show a decrease in the frequency range which is stressed by the filters. The theoretical justification of this intuition remains an open problem as of now.
368
D. Bauer / Automatica 41 (2005) 359 – 376
4.1.3. Imposing assumptions Beside consistency typically estimators are required to possess properties which are known to hold for the true process. For the MOESP type of methods a number of results in this respect have been achieved:
Assumption 3. The input process (ut )t∈Z is generated according to a state space system of form (1) driven by an unobserved white noise process (t )t∈Z fulfilling Assumption 1 where for the generation of ut no additional observed input is present and t is independent of εr for all t, r ∈ Z.
Theorem 3. Let the assumptions of Theorem 2 hold and let the estimation be performed using the true order n0 . If D0 = 0 and D = 0 is imposed in the least-squares fit used to obtain estimates of B and D the consistency results given in Theorem 2 hold unchanged. The estimator ‡ O ˆf ˆ ˜ A = Of , 0
Hence the input is assumed to be an ARMA process. For the analysis of the general state approach it has proven necessary to consider the integer p to tend to infinity as a function of the sample size. This necessitates the usage of a number of results with respect to the growing dimension of the regression vectors. The most important − − results in this respect are: The eigenvalues of Zt,p , Zt,p a are a.s. (uniformly in p = O((log T ) )) bounded away from zero and also bounded from above (cf. Hannan & Deistler, 1988, Theorem 6.6.10). Furthermore, Theorem 6.6.11. of Hannan and Deistler (1988) states that − − −1 − − −1 Zt,p , Zt,p ∞ < c, (EZt,p (Zt,p ) ) ∞ < c a.s. unia formly in p = O((log T ) ). Another useful fact in the present situation is (zt = [yt , u t ] )
ˆ ‡ = (O ˆ O ˆ −1 ˆ
˜ where O f f f ) Of fulfills |max (A)| 1. This estiˆf → mator introduces an asymptotic bias. In particular if O Of then A˜ converges to
f −2
(O f Of )−1 =
(Aj ) C CAj A
j =0 A − (O f Of )−1 (Af −1 ) C CAf .
This asymptotic bias decreases with increasing f. The first claim is obvious. The estimator A˜ has been proposed by Chui and Maciejowski (1996) who also prove that ˜ 1. The last fact is obvious from the expression |max (A)| given. In some situations one might be willing to accept the bias in return for the guaranteed stability: One issue in this respect is estimation of the stochastic subsystem (A, C, K) which is based on the solution to a Riccati equation (see, e.g. Van Overschee & DeMoor, 1994). The solution to this equation only makes sense if the estimates correspond to a positive real transfer function. Otherwise no estimate of the Kalman gain can be obtained. Peternell (1995) contains a large number of simulations showing that the choice of Wˆ f+ crucially influences the percentage of cases such that the Riccati equations cannot be solved. Further evidence and theoretical analysis is contained in Dahlen, Lindquist, and Mari (1998). The minimum-phase assumption is automatically satisfied for estimators obtained via the solution to the Riccati equation, if the equation has a solution. 4.2. The state approach In this section we turn to the state approach. Again the discussed topics include consistency in the case of correctly specified order, misspecification analysis and fulfillment of assumptions. For most of this section we will use the following assumption:
max εt , zt−j = O(QT ),
0
sup zt , zt−j − Ezt zt−j = o(1)
j ∈Z
according to Theorems 5.3.1. √ and 7.4.1. of Hannan and Deistler (1988) where QT = log log T /T , HT = (log T )a for some 0 < a < ∞. The set of assumptions on the input is overly strong and much weaker assumptions (including deterministic sequences) are sufficient for the results in this section. However, since the results documented in the literature use these set of assumptions we refrain from discussing weaker assumptions. In almost all results for the state approach the condition p → ∞ will be used. Hence a condition like − − −1 Zt,p , Zt,p 2 < c uniformly in the sample size (including divergence of p at some rate) will be needed. Consequently typical deterministic signals such as the constant or sums of sinusoids are not permitted being only of finite order of persistence. 4.2.1. Consistency Using these tools it is not difficult to establish consistency: Theorem 4. Let the output process (yt )t∈N be generated by a minimal state space system (A0 , B0 , C0 , D0 , K0 ) ∈ Sn0 such that Assumption 1 holds and let the input process (ut )t∈Z fulfill Assumption 3. For a given multi-index i such that i,0 = i ((A0 , B0 , C0 , D0 , K0 )) is in the interior of n0 ,i let the parameter vector ˆ i using the true order of the system be estimated using a state approach algorithm with x˜t+1 = xˆt+1 where f n0 fixed and p → ∞, p/(log T )a → 0 a.s. for some integer a < ∞. Assume that the weightings in the SVD are chosen such that either Wˆ f+ = Wf+ > 0 for a nonsingular matrix Wf+ or the CCA weighting according to
D. Bauer / Automatica 41 (2005) 359 – 376
(7) is chosen. Wˆ p− is assumed to be chosen according to (7). Then ˆ i is well defined a.s. for T large enough and moreover ˆ i → i,0 . On a generic subset Mn+0 (ut , f, ) ⊂ Mn0 characterized by the fact that for (l(z), k(z)) ∈ Mn+0 (ut , f, ) the − are distinct also the syssingular values of Wf+ Of K∞ W∞ tem matrix estimators are consistent (for appropriate choice of the orientation of the columns of Uˆ n ). The same result applies for f = p → ∞ a.s. subject to the upper bound p = O((log T )a ) using the weights according to (7). Under the conditions of Theorem 2, in particular under the assumption that p n0 is fixed independent of the sample ˆ obtained using Algorithm 1 of N4SID size, the estimate l(z) ˆ → l(z). is consistent, i.e. l(z) A proof of the consistency statement for the parameter vector for the case f = p → ∞ can be found in Peternell et al. (1996), the case of fixed f follows easily from the arguments given there. Consistency of the parameter matrices for (l(z), k(z)) ∈ Mn+0 (ut , f, ) is shown in Bauer (1998, Section 4.3). Consistency of N4SID under the conditions of consistency in the MOESP setting is implicit in the results of Section 2 of Chiuso and Picci (2004b). This distinction seems to be of interest since in particular in control applications the chosen inputs might not be persistent of infinite order. The ability of the N4SID algorithm to deliver consistent estimates for finite p is therefore seen to be important.
4.2.2. Misspecification analysis As for the MOESP type of algorithms there appears to be not much literature available on this subject. To the best of my knowledge the only exception is the discussion in Section 5.3.2. of Bauer (1998) which is also contained in Bauer, Deistler, and Scherrer (1998). In the following we will only deal with the case x˜t+1 =xˆt+1 . In that case Theorem ˆ p = Vˆ (Wˆ − )−1 . 1 clarifies the convergence properties of K n p Once xˆt is estimated the rest of the algorithm consists of least-squares fitting and therefore the results given in the next theorem are straightforward to derive: Theorem 5. Let the input process (ut )t∈Z and output process (yt )t∈N be as in Theorem 4. Let (A0 , B0 , C0 , D0 , K0 ) ∈ Sn0 denote the representation of the true system according to − =U V . Of = (Wf+ )−1 Un0 n0 where Wf+ Of K∞ W∞ n0 n0 n0 ˆ ˆ Then consider the subspace estimators (l(z), k(z)) based − on xˆt (n) ˜ = Vˆn˜ (Wˆ p− )−1 Zt,p for Wˆ p− chosen according to (7) and Wˆ f+ chosen according to the restrictions of Theorem 4, where f n0 fixed independent of the sample size and p → ∞, p (log T )a a.s. for some a < ∞ are used. ˆ ˆ (i) If n˜ n0 then (l(z), k(z)) → (l(z), k(z)) i.e. the pair of transfer functions is estimated consistently. ˆ ˆ (ii) If n˜ < n0 then for n˜ > n+1 it follows that (l(z), k(z)) → ˜ (ln˜ (z), kn˜ (z)) for some limiting pair (ln˜ (z), kn˜ (z))
369
which depends on f and Wf+ . In the case of no inputs kn˜ (z) = (An˜ , Cn˜ , Kn˜ ) where An˜ = A1,1 ,
Cn˜ = C1 ,
Kn˜ = (A1,2 C2
+ K1 )( + C2 C2 )−1 .
Here (A1,1 , C1 , K1 ) denotes the subsystem of (A0 , C0 , K0 ) obtained by simply truncating the matrices. A1,2 ∈ ˜ ˜ 0 −n) Rn×(n denotes the (1, 2) block of A0 and C2 ∈ s×(n0 −n) ˜ R the trailing submatrix of C0 such that C0 = [C1 , C2 ]. In the case f = p → ∞ the same conclusions hold for the CCA choice (7) of weighting matrices. The proof for (ii) can be found in Bauer (1998, Section 5.3.2). The proof of (i) is given in Bauer (1998, p. 105). For the case of observed inputs the limiting expressions could be given as well but are less transparent. Even though the expressions given in the theorem are very explicit up to now no satisfactory interpretation has been given. A discussion in the no input case can be found in Bauer (1998, p. 123): There the special choice of weighting Wf+ introduced by Van Overschee (1995) is investigated. Links to frequency weighted balancing are found. However, the analysis is not conclusive and the links are found to be weak. Unfortunately as in the MOESP case this topic is heavily neglected in the literature. 4.2.3. Imposing assumptions With respect to the fulfillment of assumptions it is obvious that the assumption of no direct feedthrough term can be imposed in the state approach without changing any of the results presented so far. This simply amounts to the omission of ut in the least-squares fit in the observation equation. With respect to the fulfillment of stability and minimumphase assumption, results have been obtained only in the case of no observed inputs: For the choice of x˜t+1 = xˆt+1 Chui and Maciejowski (1996) show that the initial conditions in the regression for the estimation of A can be chosen ˆ 1. Contrary to the MOESP case this such that |max (A)| does not introduce an asymptotic bias since p → ∞. In this case of no observed inputs a special choice of x˜t+1 can be ˆ 1 (cf. Bauer, 1998, given guaranteeing |max (Aˆ − Kˆ C)| Corollary 5.4.1). Another approach to guarantee the fulfillment of the minimum-phase assumption is the transformation to innovation form for the obtained transfer function under the assumption of a stable estimate Aˆ (cf. e.g. Hannan & Deistler, 1988, Theorem 1.3.3).
5. Asymptotic distribution In the last section the convergence properties of the estimators have been investigated. The bottom line was that for correctly specified order for the various methods assumptions could be found which lead to consistent estimation of
370
D. Bauer / Automatica 41 (2005) 359 – 376
l(z) or even the pair of transfer functions (l(z), k(z)) ∈ Mn0 . Having established consistency interest turns to the asymptotic distribution in order to quantify the confidence in the estimates. This question will be addressed again first for the MOESP type of methods and afterwards for the state approach. 5.1. MOESP type of methods The asymptotic distribution for the MOESPtype of methods has been established in Bauer and Jansson (2000) and extended to more general cases in Jansson (2000). The derivation of the asymptotic distribution uses standard techniques: The first step of the algorithms consists in a least-squares fit. The asymptotic properties of least-squares estimators are well known. All subsequent calculations use only these estimates obtained in the first step and hence can be viewed as a nonlinear mapping of the least-squares estimators of the first step. Then linearization of this mapping with respect to the entries of the least-squares estimators at the true quantities z and u using in particular the expressions given in (8) or (9) directly leads to the asymptotic distribution as given in the next theorem: Theorem 6. Let the assumptions of Theorem 2 hold and assume that for the input the Assumption 2 holds. Assume that the true system (l(z), k(z)) ∈ Mn0 (ut , f, p, ) such that the consistency condition (10) of Theorem 2 holds. √ (i) Then T (ˆ i − i,0 ) converges in distribution to a normal random variable with mean zero and some variance V, say, which in general depends on f, p, Wf+ and Wp− . (ii) The asymptotic variance of the estimated system poles does not depend on the choice of the weighting Wˆ f+ .
⊥ ˆ C) ˆ then V does not is based on (A, (iii) If the estimate O depend on Wˆ f+ .
f
(i) is proven in Jansson (2000). The fact that the distribution of the estimated poles does not depend on Wf+ has been stated in Jansson (1997). In Jansson (2000) explicit expressions for the asymptotic variance of the system matrix estimates are given which show that the asymptotic variance ˆ Cˆ corresponding to a random state basis of the estimates A, does not depend on Wˆ f+ . Chiuso and Picci (2003) provide relatively simple expressions for the asymptotic variance of the estimated system matrices using their robust N4SID algorithm based on (9). All these expressions unfortunately did not yet lead to a better understanding of the effects of the user choices f, p and Wp− on the asymptotic variance.
⊥ the asymptotic The fact that with the special choice of O f
distribution does not depend on Wf+ is to the best of my knowledge new although it is immediate from the fact that ˆ C) ˆ and in this case estimation of B and D is based on (A,
ˆ C). ˆ ˆ u where the influence of the weights is limited to (A,
+ ⊥ −1
If O = Uˆ (Wˆ ) where Uˆ = [Uˆ , Uˆ ] then the accuracy f
2
f
n
2
of the estimated parameter vector depends on Wˆ f+ as documented in the calculations contained in Bauer and Jansson (2000). Lots of research effort in the area has been devoted to gain knowledge on the influences of the integers f and p as well as the weighting matrix Wˆ p− on the asymptotic variance. Gustafsson (1999) contains a couple of facts in this respect, Jansson (1997) and Bauer and Jansson (2000) contain simulation results and some numerical calculations of the asymptotic variance. Chiuso and Picci (2003) contains an in a certain sense optimal choice of Wˆ for the estimation of B and D. The commonly used weighting matrices Wˆ f+ and Wˆ p− for fixed integers f and p do not seem to lead to estimators which achieve the Cramer–Rao lower bound and hence are suboptimal. No recommendations with respect to the choice of f and p have been derived from expressions of the asymptotic accuracy up to now. This is an open problem. 5.2. The state approach Most work on the asymptotic distribution of estimates obtained using the state approach refers to x˜t+1 = xˆt+1 although the asymptotics for different choices can be obtained using the same tools. The assumptions introduced for consistency in fact are strong enough in order to obtain the asymptotic distribution of the state approach estimates: Theorem 7. Let the assumptions of Theorem 4 hold and assume that the state approach using the true order n0 , f n0 finite not depending on the sample size, p − d log T /(2 log 0 ), d > 1, 0 < 0 = |max (A0 − K0 C0 )|, p (log T )a for some a < ∞ (p chosen deterministically) and Wˆ p− chosen according to (7) is used to obtain the es√ timator ˆ i . Then T (ˆ i − i,0 ) converges in distribution to a normal random variable with mean zero and variance V, say, depending on f and Wf+ . Using a different weighting matrix Wˆ p− such that there exists a deterministic matrix Wp− with Wˆ p− − Wp− = O(T −ε ) for some ε > 0, (Wp− )−1 2 < ∞, (Wˆ p− )−1 2 < ∞ a.s. uniformly in p = O((log T )a ) does not change the asymptotic variance V. A central limit theorem for the estimates of the system matrices using the state approach has been derived in Bauer et al. (1999, Theorem 1) for the case of no inputs present and in Bauer (1998, Theorem 4.3.1.) for the case with additional inputs where both results hold only on the generic subset Mn+0 (ut , f, ) defined in Theorem 4.4. The result stated above is easily obtained from the proofs contained therein with replacing the linearization of the SVD based on the expressions in Theorem 1 with Eq. (8) (in the case that Kp S˜p
D. Bauer / Automatica 41 (2005) 359 – 376
is nonsingular; otherwise similar expressions using a different selector matrix S˜p are straightforward to obtain). The independence of the asymptotic variance on the choice of Wˆ p− has been discussed in Bauer et al. (2000) for the case of no observed inputs and is a straightforward consequence of (8). In a slightly different scenario, Chiuso and Picci (2004b) show that under the assumption of consistency in Theorem 4.2. the estimators obtained using N4SID Algorithm 1 using finite f n0 + 1 and p n0 are asymptotically normal ˆ C) ˆ where also the asymptotic variance for the estimates (A, is given. The main difference to the results given above is the fact that the assumptions on the input in this case do not include infinite persistence. They also provide some evidence that the weighting used in robust N4SID results in superior estimation accuracy than the original N4SID choice. Also for the state approach the asymptotic analysis focused on gathering knowledge on the dependence of the asymptotic variance on the weighting matrices and the choice of the integer f and p. It has been noted in the theorem and is apparent from (8) that Wˆ p− (subject to regularity conditions such as nonsingularity) does not influence the asymptotic accuracy. Also p → ∞ at a certain rate is used in the theorem where the rate depends on the true system and p ensures that 0 T 1/2 → 0. Effectively this implies that the asymptotics do not rely on the fact that only finitely many lags are used to reconstruct the state rather than the whole past. It has been noted in Peternell et al. (1996); Deistler, Peternell, and Scherrer (1995) that if pˆ is chosen according to the order estimate in the autoregression equation yt = Dut +
p
j zt−j + t
j =1
obtained minimizing the BIC criterion over 0 p (log T )a then −2pˆ log 0 / log T → 1 a.s. holds.7 In the following 0 > 0 will be assumed throughout. Therefore the integer part of d p, ˆ d > 1 fulfills the conditions of Theorem 7 a.s. Hence this choice is recommended. It should be noted however, that asymptotic theory does not provide any advice as of the choice of the constant d > 1. With respect to the choice of f and Wˆ f+ in the general case the situation is similar to the MOESP type of procedures: The linearizations in (8) can be used in order to derive tractable expressions for the asymptotic variance which are simpler than the original expressions implicitly contained in Bauer (1998) but still complex enough to prohibit a comparison of the various proposed weighting schemes and different choices of f. The same is true for the expressions provided in Chiuso and Picci (2004b). However, for the state approach using p → ∞ in the case of no inputs the analysis of the effects of f and Wˆ f+ on the asymptotic variance of the parameter estimators for correctly specified order is nearly complete: 7 This holds except in the case where y follows an ARX(p, q) t process. In that case pˆ tends to max(p, q).
371
Theorem 8. Let the assumptions of Theorem 7 hold and assume that there are no inputs present. Let the true transfer function k0 (z) ∈ Mn0 be such that Kp [In0 , 0]
is nonsingular. √ (i) Then the asymptotic variance of T (ˆ i − i,0 ) under the assumption of correctly specified order using a fixed f n0 , p − d log T /(2 log 0 ), d > 1 (p − − 1/2 , Zt,p and an arbitrary deterministic), Wˆ p− = Zt,p + + ˆ ˆ weighting Wf such that Wf − Wf+ = o(T −ε ), Wf+ > 0 for some ε > 0, is of the form M1 M1 + M2 [z ⊗ {O‡f Ef (I ⊗ )E f (O‡f ) }]M2 . (11) − −
Here z =EZt,∞ (Zt,∞ ) and O‡f =(O f W+ Of )−1 Of W+ for W+ = (Wf+ ) Wf+ . Further M1 and M2 are deterministic matrices depending on the true system k0 (z) but not on f and Wf+ . (ii) Expression (11) is minimized for the CCA choice cor+ + −1/2 responding to the weighting Wf+ = [EYt,f (Yt,f )] for any fixed f. This minimal expression for fixed f decreases monotonically with f → ∞. (iii) For the above-mentioned assumptions let the integers f = p be chosen according to the restriction p − d log T /(2 log 0 ), p = O((log T )a ) for some d > 1, a < ∞, p chosen deterministically. Then the estimates of the system obtained using CCA under the assumption of correctly specified order are asymptotically equivalent to estimates ˜ i obtained from minimizing √ the one step ahead prediction error in the sense that T (ˆ i − ˜ i ) → 0 in probability. Therefore in the case of i.i.d. Gaussian innovations (εt )t∈Z the CCA estimates are efficient.
The expression for the asymptotic variance is given in Theorem 1 of Bauer and Ljung (2002), the optimization thereof is dealt with in Corollary 2 of that paper. The expression for the asymptotic variance is obviously based on (8) and continues to hold in the case of white noise input (ut )t∈Z . The asymptotic equivalence of CCA with estimators obtained by minimizing the one step ahead prediction error is proved in Bauer (2000). It is unfortunate that the proof of the asymptotic equivalence given in Bauer (2000) does not explain why CCA has this optimality property. Partial explanation is contained in two papers (Dahlen & Scherrer, 2004; Scherrer, 2002): The first paper shows that in the case of no observed input CCA with f =p → ∞ has the interpretation of performing a special model reduction step on a preliminary autoregressive model using lag length p. The model reduction is related to the expressions for the limiting lower order models given in Theorem 4.5. Scherrer (2002) derives optimality in terms of approximation of the asymptotic likelihood of this model reduction technique. Finally it is known that estimators of state space models based on preliminary AR models may
372
D. Bauer / Automatica 41 (2005) 359 – 376
values converge to their true values, i.e. ˆ i → i > 0, i = 1, . . . , n0 , ˆ i → 0, i > n0 . Therefore Rˆ n0 2 → 0 follows. This fact is used in the criterion
attain the Cramer Rao bound (see Wahlberg, 1989). These observations unfortunately do not fully explain the underlying reasons for optimality of CCA which might lead the way to similar results also in the case of colored noise. Therefore in this case of no inputs the knowledge on the asymptotic properties is quite complete: If there is a true system and the order is correctly specified then CCA with f and p tending to infinity at a rate which can be estimated from the data leads to optimal accuracy within the class of subspace methods. If the white noise (εt )t∈Z is i.i.d. Gaussian then even the Cramer Rao lower bound is attained and CCA achieves optimal accuracy within the class of all (asymptotically) unbiased estimates. For non-Gaussian innovations CCA is asymptotically equivalent to prediction error methods which are the standard choice in this framework. If the order is specified too high then the state approach still delivers consistent estimates. In the case of underestimation of the order a bias results, expressions for which can be found above.
• Rˆ n 2Fr = M ˆ 2j suggested in Peternell (1995).8 j =n+1
• Rˆ n 22 = ˆ 2n+1 suggested in Bauer (2001). • Rˆ n 2 = M ˆ 2j ) suggested by Cambaj =n+1 − log(1 −
Mendez and Kapetanios (2001) in the case of CCA weightings implying ˆ 1 < 1.
6. Estimating the order
Asymptotic results for the estimation of the order have only been proven under rather strong assumptions on the inputs:
All results cited above refer to the estimators obtained using some specified order. Here the term specified is used in order to indicate that the order is supplied externally, i.e. not data driven. Typically the order is not known in advance and the choice of the order is part of the identification process. For subspace methods three different approaches have been discussed in the literature: • Estimating the order based on the information contained in the estimated singular values. • Obtaining the order from sequential testing on the rank of the matrix ˆ z or closely related matrices. • Using estimates of the innovation variance in analogy to the information criteria. These approaches will be discussed in the following. All these procedures only use in-sample information. Alternatively of course cross-validation techniques can be used for model selection. In this paper somewhat arbitrarily the discussion is limited to the in-sample based procedures.
d(n)CT , I C(n) = Rˆ n 2 + T n = 0, . . . , min(f s, (s + m)p), where . : Rf s×(s+m)p → [0, ∞] denotes a nonlinear mapping, d(n)=2sn+sm+nm denotes the dimension of the parameter vector for a system of order n. CT > 0 penalizes big models. Suggestions for . found in the literature are (M = min(f s, (s + m)p))
Assumption 4. The input process (ut )t∈ Z is of the form ut = hj=1 cj eij t + ch+1 vt where vt = ∞ j =0 Kv (j )t−j , where (t )t∈Z is an ergodic, martingale difference sequence fulfilling the noise assumptions stated in Assumption 1 and j being independent of εr , r ∈ Z, and where Kv (j ) Kv v for fu () = some 0 < Kv < ∞, 0 < v < 1. Furthermore, ∗ ∞ ∞ i j i j is assumed to fulj =0 Kv (j )e j =0 Kv (j )e fill 0 < f I fu () f¯I < ∞ for − < . The random variables cj ∈ Rm have zero mean and finite variances and are such that the corresponding process (ut )t∈N is real valued. The variable ch+1 ∈ R is a constant. If ch+1 = 0 the process (ut )t∈Z is assumed to be persistently exciting of or+ + der f + p, i.e. Ut,f +p , Ut,f +p > cI is assumed to hold a.s. for T large enough. These assumptions are stronger than the Assumption 2. The following result clarifies the properties of the order estimation using I C(n) for any of the above-mentioned choices for .:
ˆ n Vˆ + Rˆ n . Wˆ f+ ˆ z Wˆ p− = Uˆ n n
Theorem 9. Let the process (yt )t∈N fulfill the assumptions of Theorem 2 where the input process (ut )t∈Z satisfies Assumption 4. The true order of the system is denoted by n0 and for fixed f n0 + 1 and p n0 independent of the sample size the true system (A0 , B0 , C0 , D0 , K0 ) ∈ Mn0 (ut , f, p, ). The weightings Wˆ f+ and Wˆ p− are assumed to be chosen such that Wˆ + − W + = O(QT ), Wˆ p− −
Let as before ˆ 1 ˆ 2 · · · denote the estimated singular values. Then for n0 denoting the true order it follows that under suitable assumptions on the data generating process and the weighting matrices to be discussed below the singular
8 In fact, Peternell (1995) uses a different expression for d(n) which, however, is asymptotically equivalent.
6.1. Estimating the order based on singular values In the second step of subspace methods the typically full ˆfK ˆ p rank matrix ˆ z is approximated by the rank n matrix O where the approximation is derived using the SVD
f
f
Wp− = O(QT ) for nonsingular matrices Wf+ and Wp− .
D. Bauer / Automatica 41 (2005) 359 – 376
Let nˆ denote the minimal integer minimizing I C(n) over 0 n min(f s, (s + m)p). In this case the conditions CT > 0, CT /T → 0, CT /(fp log log T ) → ∞ are sufficient for the consistency of nˆ for either the Frobenius norm or the two norm, i.e. nˆ → n0 a.s. holds. For the CCA choice of the weightings Wˆ f+ and Wˆ p− according to (7) the same holds for the function Rˆ n 2 = M ˆ 2j ). j =n+1 − log(1 −
If f n0 + 1 is fixed and p → ∞, p − d log T / (2 log 0 ), p (log T )a for some a < ∞ or if f =p, p − d log T /(2 log 0 ), p (log T )a and the CCA weights (7) are used then the same condition on the penalty leads to consistent estimates of the order for (l(z), k(z)) ∈ Mn0 and input (ut )t∈Z satisfying Assumption 3. The proof for the two norm and the Frobenius norm for the CCA weighting matrices can be found in Bauer (2001), the result for the choice of Camba-Mendez and Kapetanios (2001) follows in straightforward fashion using the arguments contained therein observing that log(1+x)=x+o(|x|) for small x. Also the generalization of the weighting matrices under the restrictions of the theorem is straightforward. If (l(z), k(z)) ∈ Mn0 − Mn0 (ut , f, p, ) it follows that
n0 = 0. In this situation I C(n) cannot provide consistent estimates of the order and nˆ < n0 a.s. asymptotically (cf. Bauer, 2001). Hence in the situation where the consistency condition is violated this cannot be detected on the basis of the estimated singular values. The theorem is not satisfactory for a number of reasons: First and most importantly only sufficient conditions are given in terms of a lower bound on the penalty factor. Simulations in Bauer (2001) showed that even the lower bound on the penalty might lead to a high risk of underestimating the order. Secondly the theorem states consistency for three different choices of .2 but does not allow for a comparison amongst them. Camba-Mendez and Kapetanios (2001) argue that their criterion amounts to the likelihood ratio of a reduced rank regression. However, the asymptotic distribution of the likelihood ratio might be nonstandard since the residuals in the corresponding regression equation are nonwhite. Hence typical reasoning on the basis of conventional 2 limiting distribution might not be valid. Bauer (2001) invokes simulation examples in order to demonstrate that the Frobenius norm-based criterion is more sensitive to the choice of f and p than the two norm-based criterion. This has not been supported by any analytical analysis.
6.2. Sequential testing approaches An alternative concept for specification of the order is to use a sequential testing approach. Again the linearization equations of Theorem 4.1 can be of value. Sorelius (1999) ˆ f,f = derived a sequential testing approach based on H + − Yt,f , Zt,f in the case of no inputs present which for s = 1
373
can be described as follows: ˆ f,f . (1) Choose f = 1 and test for nonsingularity of H (2) If the null of singularity is rejected, set f = f + 1 and go to step (3). Else go to step (4). ˆ f,f . Go to step (3) Test for nonsingularity of the matrix H (2). (4) The order is estimated as nˆ = f − 1. Stop. The asymptotic distribution of the tests can be derived using the expressions given in Theorem 4.1. In Sorelius (1999) also an extension to the multivariate case is presented. Also two competing tests for step (3) are discussed. Asymptotic distributions of all test statistics are derived. Camba-Mendez and Kapetanios (2001) contains a very similar sequential testing procedure. Peternell (1995) contains a discussion on the asymptotic distribution of similar expressions. Again the knowledge about the properties of the testing approaches is sparse. Note that both the estimation procedures discussed in the last subsection and the sequential testing approach of this section are based on the estimated singular values. While the estimation approach uses only a very crude argument based on the rate of convergence of the estimate ˆ z the tests are based on a detailed evaluation of the asymptotic distribution of the smallest singular values (or related statistics, respectively). On the other hand the tests necessarily depend on the underlying assumption of a true system whereas the estimation procedures might also provide a reasonable fit in situations where the true data generating process can only be approximated. Therefore a detailed comparison of these two approaches seems to be valuable. In the literature such a comparison is not even attempted. 6.3. Using the estimated innovations covariance The third approach to specifying the order of the system lies in using the estimated innovation variance. This approach has only been investigated in the state approach where p → ∞ is used. We will here confine our interest to the case of no inputs. Note that in the state approach an estimate of the innovations covariance matrix is immediately obtained as ˆ (n) ˆ n) ˆ n) ˜ = ˆεt (n), ˜ εˆ t (n) ˜ = yt − C( ˜ xˆt (n), ˜ yt − C( ˜ xˆt (n), ˜ − ˆ n) where xˆt (n) ˜ = Vˆn˜ (Wˆ p− )−1 Zt,p , εˆ t (n) ˜ = yt − C( ˜ xˆt (n) ˜ and −1 ˆ C(n)=y ˜ ˜ xˆt (n), ˜ xˆt (n) ˜ . Now it has already been t , xˆ t (n) used in Theorem 5 that for n˜ > n+1 ˜ , n˜ n0 , ˆ (n) ˜ → + C2 (n)C ˜ 2 (n) ˜ , n˜ < n0 ,
where n0 denotes the order of the true system. Here ˜ C = [C1 (n), ˜ C2 (n)], ˜ C1 (n) ˜ ∈ Rs×n˜ , C2 (n) ˜ ∈ Rs×(n0 −n) corresponds to the system representation according to − , i.e. according to the state the SVD of Wf+ Of K∞ W∞
374
D. Bauer / Automatica 41 (2005) 359 – 376
− )−1 Z − . Note that this estimate ˆ (n) xt = Vn 0 (W∞ ˜ is dift,∞ ˆ ˆ ˆ ˜ ˜ = εt (i,n˜ ), εt (i,n˜ )where εt (i,n˜ ) denotes ferent from (n) the innovation estimate obtained from the Kalman filter using the subspace parameter estimate ˆ i,n˜ obtained using the estimation order n. ˜ The difference between these two estiˆ n) mates is that in the estimation of εˆ t (n) ˜ = yt − C( ˜ xˆt (n) ˜ the ˆ state is estimated using the first n˜ rows of Kp ∈ Rn0 ×ps . − )−1 may not Note that the first n˜ rows of K∞ = Vn 0 (W∞ correspond to a nth ˜ order system. Therefore asymptotically ˆ (n) ˜ represents the residual error variance of explaining yt using only the first n˜ components of the state while ne˜ (n) ˜ on the other hand glecting the dynamics in the state. takes the dynamics of the state into account. The difference is most drastically seen in a state space system (A, C, K) where C = [Is , 0] and xt denotes the corresponding state. In this situation adding the last n0 − s state components to the equation yt = C(s)xt (s) + εt does not contribute to the explanation of the output yt . However, if the system is minimal then the last n0 − s coordinates of xt will contribute to the prediction of xt+1 and hence are essential ˜ (n) takes the state dynamics into for the state dynamics. ˆ (n) measures only the immediate impact of account while the components of xt on yt . Using these two estimates one can define two criterion functions in analogy to I C(n) for 0 n min(f s, ps):
ˆ (n) + d(n)CT , I V C(n) = log det T ˜ (n) + d(n)CT , I V C(n) = log det T which should not be confused with the information criteria of the AIC or BIC type (cf. e.g. Akaike, 1976) which uses ˇ (n) = εt (˜ i,n ), εt (˜ i,n ) ˜ (n) where ˜ i,n denotes the prediction error in place of estimate of the parameter vector and εt (˜ i,n ) denotes the corresponding estimated prediction errors. The asymptotic properties of these procedures are given in the next theorem: Theorem 10. Let the process (yt )t∈N fulfill the assumptions of Theorem 7 and assume that there is no input (ut )t∈Z present. Let the true order of the system be denoted as n0 . Let f n0 be fixed not depending on T and let p − d log T /(2 log 0 ), d > 1, p (log T )a for some a < ∞. Let Wˆ f+ be such that there exists a nonsingular matrix Wf+ where Wˆ f+ − Wf+ = O(QT ) and let Wˆ p− chosen according to (7). Assume that n0 −1 > n0 .Then for CT /(p log log T ) → ∞, CT /T → 0 the order estimate obtained by minimizing I V C(n) over 0 n min(f s, ps) is consistent if C2 (n0 − 1) > 0, else the order is underestimated asymptotically a.s. The proof can be found in Bauer (2001), Theorem 4. The main message from the theorem is that using I V C(n)
has some disadvantages: Even though the set on which C2 (n0 − 1) = 0 holds might be small (in some cases only the complement of a generic subset of Mn0 , for details see Bauer (1998, p. 103)) order estimation will be difficult also in situations where C2 (n0 − 1) is small as documented in the simulations contained in Bauer (2001). Therefore the usage of I V C(n) is not recommended. On the contrary the properties of I V C(n) seem to be preferable. However, this criterion has not been analyzed in the literature so far. 7. Closed-loop identification It is somewhat surprising that although subspace methods have been investigated primarily in the automatic control community the literature on their use in the closed loop situation is very sparse. A system is considered to be in closed loop, if the input ut is amongst other variables also influenced by past outputs yt−1 , yt−2 , . . .. Considering the first step of the algorithms, i.e. the least-squares fit in the equation + + + − Yt,f = z Zt,p + u Ut,f + (Of (A − KC)p xt−p + Ef Et,f )
one notices that apart from the problems due to the presence of xt−p in the closed-loop case the term due to the future of the inputs (ut )t∈Z is correlated with the future of the noise + contained in Et,f . Of course this problem does not occur for joint identification considering (zt )t∈Z as the output (cf. Verhaegen, 1993). In the literature there are a number of propositions to deal with this problem: Chou and Verhaegen (1997) present an algorithm which is similar to subspace methods as presented above using an instrumental variables approach in the regression in the first step. Chou and Verhaegen (1999) pro+ pose to move the term u Ut,f to the left-hand side using a preliminary estimate of Uf obtained, e.g. by the previously mentioned algorithm. Then z can be estimated consistently and the state approach can be used to obtain estimates of the system matrices. Ljung and McKelvey (1996) and Qin and Ljung (2003) suggest recursive methods in order to obtain consistent estimators of z and u . The analysis of these procedures is only emerging. The geometrical framework based on stochastic realization theory appears to be extremely helpful in this respect: Chiuso and Picci (2004a) provide some results on consistency, respectively, asymptotic bias of these estimators and Chiuso (2004) provides the first preliminary asymptotic normality results. The literature in this field is far from conclusive and there are many open questions left for research. 8. Conclusions The analysis of the asymptotic properties of subspace algorithms has been pursued for approximately a decade now.
D. Bauer / Automatica 41 (2005) 359 – 376
A number of questions have been clarified: Consistency conditions have been found and asymptotic normality has been established in virtually all variants of subspace methods for linear dynamic systems in open-loop operation conditions. Expressions for the asymptotic variances have been found and investigated. Order estimation techniques have been proposed. Limiting behaviour for some situations of misspecification have been derived. Nevertheless, in a number of areas the knowledge gained is still sparse. Traces of open questions can be found throughout the paper. The most pressing ones as far as asymptotic theory is concerned in my opinion are: • Comparison of the asymptotic accuracy of the various proposed methods in the case of colored inputs and correctly specified order. This includes the development of recommendations based on the asymptotic variance on how to choose f, p, Wˆ f+ and Wˆ p− . • Misspecification analysis. Again the effects of the user choices are the major puzzle to be solved. • Analysis of algorithms for closed-loop operation conditions. • Order estimation: The approaches developed so far are only seen as preliminary since they lack a profound motivation. Also a comparison between the various approaches is missing. Another area which is somewhat outside the scope of this survey lies in the analysis of algorithms for different model classes such as bilinear models. The main techniques discussed in this paper such as the linearization of the SVD also apply for other model classes. Hence it seems that the properties of many different algorithms can be analyzed using these techniques. In particular consistency and asymptotic normality could be proved along the lines outlined here. The last decade has seen a tremendous growth in terms of the number of proposed algorithms. The next decade faces the challenge to try to understand the merits of the various algorithms in order to transform the basic idea of subspace algorithms from an interesting descriptive tool into a sound, well understood statistical technique for model selection and estimation.
Acknowledgements The author wishes to thank Manfred Deistler and Wolfgang Scherrer for their continued support during the last decade. A big part of the results achieved in the area are due to their foresight and technical skills. Also I want to thank Thomas Ribarits, Brett Ninness, Lennart Ljung and Tomas McKelvey for many fruitful discussions on the subject of this paper. The comments of three reviewers led to a significant improvement of the paper for which I am grateful. Finally the author acknowledges financial support by the European commission under the project EU TMR ‘SI’, by the Austrian
375
FWF under project P14438-INF and the Max Kade foundation financing a postdoc position at the Cowles Foundation for Research in Economics, Yale University, New Haven, USA, whose hospitality is gratefully acknowledged. References Akaike, H. (1976). Canonical correlation analysis of time series and the use of an information criterion. In R. Mehra, D. Lainiotis, (Eds.), System identification: Advances and case studies (pp. 27–96). New York: Academic Press Inc. Bauer, D. (1998). Some asymptotic theory for the estimation of linear systems using maximum likelihood methods or subspace algorithms. Ph.D. thesis, TU Wien, Austria. Bauer, D. (2000). Comparing the CCA subspace method to pseudo maximum likelihood methods in the case of no exogenous inputs. Technical Report. TU Wien, Institute f. Econometrics, Operations Research and System Theory. Bauer, D. (2001). Order estimation for subspace methods. Automatica, 37, 1561–1573. Bauer, D. (2003). Subspace methods. Proceedings of the IFAC symposium on system identification, SYSID’03 Rotterdam: The Netherlands. Bauer, D., Deistler, M., & Scherrer, W. (1998). Asymptotic distributions of subspace estimates under misspecification of the order. Proceedings of the 1998 MTNS Padua: Italy. Bauer, D., Deistler, M., & Scherrer, W. (1999). Consistency and asymptotic normality of some subspace algorithms for systems without observed inputs. Automatica, 35, 1243–1254. Bauer, D., Deistler, M., & Scherrer, W. (2000). On the impact of weighting matrices in subspace algorithms. In Proceedings of the IFAC conference SYSID’00. Santa Barbara, California. Bauer, D., & Jansson, M. (2000). Analysis of the asymptotic properties of the MOESP type of subspace algorithms. Automatica, 36(4), 497–509. Bauer, D., & Ljung, L. (2002). Some facts about the choice of the weighting matrices in Larimore type of subspace algorithms. Automatica, 38, 763–773. Camba-Mendez, G., & Kapetanios, G. (2001). Testing the rank of the Hankel covariance matrix: A statistical approach. IEEE Transactions on Automatic Control, 46, 331–336. Chatelin, F. (1983). Spectral approximation of linear operators. New York: Academic Press. Chiuso, A. (2004). Asymptotic variance of closed-loop subspace identification methods. Proceedings of the 43rd CDC Bahamas. Chiuso, A., & Picci, G. (2003). The asymptotic variance of subspace estimates. Journal of Econometrics, 118, 257–291. Chiuso, A., & Picci, G. (2004a). Consistency analysis of some closedloop subspace identification methods. Automatica 41(3), this issue. Chiuso, A., & Picci, G. (2004b). Numerical conditioning and asymptotic variance of subspace estimates. Automatica, 40, 677–683. Chiuso, A., & Picci, G. (2004c). On the ill-conditioning of subspace identification with inputs. Automatica, 40, 575–589. Chiuso, A., & Picci, G. (2004d). Subspace identification by data orthogonalization and model decoupling. Automatica, 40(10), 1689–1703. Chou, T., & Verhaegen, M. (1997). Subspace algorithms for the identification of multivariable dynamic errors-in-variables models. Automatica, 33, 1857–1869. Chou, C. T., & Verhaegen, M. (1999). Closed-loop identification using canonical correlation analysis. In Proceedings of the ECC’99 conference, Karlsruhe, Germany. Chui, N. (1997). Subspace methods and informative experiments for system identification. Ph.D. thesis, Cambridge: Cambridge University. Chui, N. L. C., & Maciejowski, J. M. (1996). Realization of stable models with subspace methods. Automatica, 32(11), 1587–1595.
376
D. Bauer / Automatica 41 (2005) 359 – 376
Dahlen, A., Lindquist, A., & Mari, J. (1998). Experimental evidence showing that stochastic subspace identification methods may fail. System and Control Letters, 34, 303–312. Dahlen, A., & Scherrer, W. (2004). The relation of the CCA subspace method to a balanced reduction of an autoregressive model. Journal of Econometrics, 118, 293–312. Deistler, M., Peternell, K., & Scherrer, W. (1995). Consistency and relative efficiency of subspace methods. Automatica, 31, 1865–1875. Gustafsson, T. (1999). Subspace methods for system identification and signal processing. Ph.D. thesis, Chalmers University, Gothenburg, Sweden. Hannan, E. J., & Deistler, M. (1988). The statistical theory of linear systems. New York: Wiley. Jansson, M. (1997). On subspace methods in system identification and sensor array signal processing. Ph.D. thesis, KTH, Stockholm. Jansson, M. (2000). Asymptotic variance analysis of subspace identification methods. In Proceedings of the SYSID’2000 conference. Santa Barbara, California. Jansson, M., & Wahlberg, B. (1998). On consistency of subspace methods for system identification. Automatica, 34(12), 1507–1519. Knudsen, T. (2001). Consistency analysis of subspace identification methods based on a linear regression approach. Automatica, 37, 81–89. Larimore, W. E. (1983). System identification, reduced order filters and modeling via canonical variate analysis. In H. S. Rao, P. Dorato (Eds.), Proceedings of the 1983 America control conference, Vol. 2. Piscataway, NJ (pp. 445–451). Ljung, L. (1999). System identification: Theory for the user (2nd ed.). Englewood Cliffs, NJ: Prentice-Hall. Ljung, L., & McKelvey, T. (1996). A least squares interpretation of subspace methods for system identification. In Proceedings of the CDC96. Kobe, Japan. Peternell, K. (1995). Identification of linear dynamic systems by subspace and realization-based algorithms. Ph.D. thesis, TU Wien. Peternell, K., Scherrer, W., & Deistler, M. (1996). Statistical analysis of novel subspace identification methods. Signal Processing, 52, 161–177. Picci, G., & Katayama, T. (1996). Stochastic realization with exogenous inputs and ‘subspace methods’ identification. Signal Processing, 52, 145–160. Qin, S., & Ljung, L. (2003). Closed-loop subspace identification with innovation estimation. In Proceedings of the 13th IFAC symposion on system identification. Rotterdam, Netherlands.
Scherrer, W. (2002). Local optimality of minimum phase balanced truncation. In Proceedings of the 15th IFAC world congress. Barcelona, Spain. Sorelius, J. (1999). Subspace-based parameter estimation problems in signal processing. Ph.D. thesis, Uppsala University, Sweden. Van Overschee, P. (1995). Theory, implementation and practice of subspace identification algorithms. Ph.D. thesis, Katholieke Universiteit Leuven, Belgium. Van Overschee, P., & DeMoor, B. (1994). N4SID: Subspace algorithms for the identification of combined deterministic-stochastic systems. Automatica, 30(1), 75–93. Verhaegen, M. (1993). Application of a subspace model identification technique to identify lti systems operating in closed loop. Automatica, 29(4), 1027–1040. Verhaegen, M. (1994). Identification of the deterministic part of mimo state space models given in innovations form from input-output data. Automatica, 30(1), 61–74. Verhaegen, M., & DeWilde, P. (1993). Subspace model identification: Part 3. analysis of the ordinary output-error state-space model identification algorithm. International Journal of Control, 58(3), 555–586. Viberg, M. (1995). Subspace-based methods for the identification of linear time-invariant systems. Automatica, 31(12), 1835–1851. Wahlberg, B. (1989). Estimation of autoregressive moving- average models via high-order autoregressive approximations. Journal of Time Series Analysis, 10, 283–299. Dietmar Bauer was born in St. Pölten, Austria, in 1972. He received his master and Ph.D. degree in Applied Mathematics from the TU Wien in 1995 and 1998, respectively. Since 1995 he was with the Institute for Econometrics, Operations Research and System Theory (now the Department of Mathematical Methods in Economics) of the TU Wien except for three postdoc positions held at the University of Newcastle, Australia, Linköping University, Sweden, and the Cowles Foundation, Yale University, USA. His research interests include time series analysis in particular subspace algorithms for linear dynamical systems in the stationary case and for unit root processes.