Digital Signal Processing 11, 252–268 (2001) doi:10.1006/dspr.2001.0385, available online at http://www.idealibrary.com on
Estimating Optimal Weights for Instrumental Variable Methods 1 Petre Stoica∗ and Magnus Jansson† ∗ Department
of Systems and Control, Information Technology, Uppsala University, Uppsala, Sweden; and † Department of Signals, Sensors and Systems, Royal Institute of Technology (KTH), SE-100 44 Stockholm, Sweden E-mail:
[email protected] Stoica, P., and Jansson, M., Estimating Optimal Weights for Instrumental Variable Methods, Digital Signal Processing 11 (2001) 252–268. The accuracy of the instrumental variable approach to parameter estimation (also called correlation analysis) can be significantly improved by optimally weighting the estimating equations. However, the optimal weight in general depends on unknown quantities and hence must be itself estimated before its use becomes possible. The optimal weighting matrix encountered in a typical parameter estimation problem is equal to the DC power of a certain vector sequence and usually it has to be consistently estimated in a nonparametric manner. Consistent nonparametric estimation of power (at any frequency, zero or not) is not an easy task and typically it cannot be achieved by using “natural” estimators (such as the periodogram). Surprisingly, it turns out that in the case considered in this paper there is a natural consistent estimator of the DC power matrix that represents the desired optimal weight. This result will promote the use of optimally weighted parameter estimation methods of instrumental variable or related types; it may also have some interesting consequences for nonparametric spectral analysis which though are not explored herein. 2001 Academic Press Key Words: parameter estimation; weighted instrumental variable methods; optimal weighting; nonparametric estimation of the optimal weighting.
1. PROBLEM STATEMENT AND PRELIMINARIES Let x(t, θ ) ∈ Rm×1 be a stationary sequence (for t = 1, 2, . . .) that depends on the parameter vector θ (with dim(θ) ≤ m), which we want to estimate, and let θ 0 denote the true and unknown value of θ . Assume that the sequence x(t, θ 0 ) 1 This work was supported in part by the Swedish Foundation for Strategic Research (SSF), the Swedish Foundation for International Cooperation in Research and Higher Education (STINT), and the Foundation BLANCEFLOR Boncompagni-Ludovisi, née Bildt.
1051-2004/01 $35.00 Copyright 2001 by Academic Press All rights of reproduction in any form reserved.
Stoica and Jansson: Weighted Instrumental Variable Methods
253
is known to have mean equal to zero E[x(t, θ 0 )] = 0,
(1)
where E stands for the statistical expectation operator. This property can be used to estimate the unknown parameters by minimizing the following function with respect to θ µT (θ)W−1 µ(θ),
(2)
where W ∈ Rm×m is a positive definite weighting matrix, and N 1 µ(θ ) = √ x(t, θ). N t =1
(3)
Henceforth N denotes the number of data samples available. Most commonly, x(t, θ) has the form x(t, θ) = z(t) (t, θ ),
(4)
where z(t) ∈ Rm×1 is a so-called vector of instrumental variables (IV) and (t, θ) are the residuals of the data model corresponding to θ (see, e.g., [1–12]). As seen from (1) the main property of the IV vector z(t) is that it is uncorrelated with (t, θ 0 ), which explains the name of correlation analysis that also is sometimes used for this approach to parameter estimation [1]. For obvious reasons it is also called the generalized method of moments (see, e.g., [4–6]). To keep the paper concise we omit any detailed discussion on the instrumental variable approach to parameter estimation, nor do we discuss the potential advantages of this approach over least squares or maximum likelihood methods. The reader less familiar with the IV approach can learn its basics from [3] in the econometrics literature or [2, 12] in the engineering literature. Let θˆ denote the parameter estimate obtained by minimizing the criterion function in (2). The accuracy of θˆ depends heavily on the weighting matrix W used in (2). When (t, θ) is linear in θ (such as in the case of linear regression models), the optimal weight that leads to the θˆ with the smallest covariance matrix in the class of unbiased estimates is given by the Aitken–Markov theory of best linear unbiased estimation (BLUE) [1, 2]: W = E[µ(θ 0 )µT (θ 0 )].
(5)
In the more difficult cases in which (t, θ ) is a nonlinear function of θ , an extension of the Aitken–Markov theory presented in [2] states that the optimal weight which yields an asymptotically (in N ) best consistent (ABC) estimate of θˆ is still given by (5) or any (root-N ) consistent estimate thereof (see, also, [4, 5]). Furthermore, the covariance matrix of the asymptotic distribution of the parameter estimates obtained from (2) and (5) is well known to decrease monotonically with increasing m (see, e.g., [2, 5, 12]). From now on we will use the symbol W to denote the optimal weighting matrix in (5).
254
Digital Signal Processing Vol. 11, No. 3, July 2001
As usually W in (5) is unknown, the problem of estimating it is the main step of an ABC (or BLUE) parameter estimator (see, e.g., [2–6, 13, 14] where parameter estimation problems having exactly the form of (2), (5) have been considered). Presenting a method for accurately estimating W along with a proof of its consistency form the main goals of this paper. Before stating our goals in more exact terms, let us present a specific system identification problem that can be cast into the previous general framework. E XAMPLE 1. Consider a linear dynamic system modeled by the difference equation y(t) + a1 y(t − 1) + · · · + an y(t − n) = b1 u(t − 1) + · · · + bn u(t − n) + (t, θ ),
(6)
where y(t) ∈ R is the system’s output, u(t) ∈ R is the input signal, (t, θ) ∈ R is the residual term, and θ = [a1 , . . . , an , b1 , . . . , bn ]T is the unknown parameter vector. We assume that there exists a true data generating mechanism (the “system”) having the form of (6) and whose parameter vector θ 0 is to be estimated. We also assume that (t, θ 0 ) and the input u(s) are uncorrelated with one another for any values of s and t (which basically amounts to assuming that there is no feedback from y(t) to u(t)). Under these conditions we can build the IV vector z(t) from u(t − 1), . . . , u(t − m), with m ≥ 2n, T z(t) = u(t − 1) . . . u(t − m) ,
(7)
define x(t, θ) as in (4), and estimate the unknown parameters by minimizing (2). The resulting parameter estimator is the so-called weighted overdetermined IV method [2, 12]. ˆ As already stated our goal in this paper is to obtain a consistent estimate W ˆ of W. More exactly, we want a W which is such that the mean square error (MSE) of any of its elements is asymptotically on the order of N −1 . We will work under the following assumptions: Assumption A1. The vector x(t, θ 0 ) has the form (4) where {z(t)} and { (t, θ 0 )} are given (for t = 1, . . . , N ). Also, z(s) and (t, θ 0 ) are Gaussian distributed and independent of each other for any values of s and t. To simplify the notation we will omit the dependence of (t, θ 0 ) and x(t, θ 0 ) on θ 0 and will therefore simply write x(t) for x(t, θ 0 ), etc. Remark 1. The assumption made above that { (t)} is available bears com˜ of (t), where θ˜ is an ments. Typically we only dispose of an estimate (t, θ) initial consistent estimate of θ 0 (obtained, for example, by minimizing (2) with the weighting matrix equal to I). However, this fact does not pose any serious problems to our attempt to obtain a consistent estimate of W. Indeed, let us say that we found a desired root-N consistent estimate of W by assuming that (t) ˆ 0 ) to indicate the fact that it was available. Let us denote this estimate by W(θ ˜ depends on the knowledge of (t, θ 0 ). If we use this estimator of W with (t, θ)
Stoica and Jansson: Weighted Instrumental Variable Methods
255
ˆ θ˜ ). A simple Taylor series expansion shows that, in lieu of (t, θ 0 ) we obtain W( asymptotically in N (see, e.g., [2, 14]), ˆ θ˜ ) = W(θ ˆ 0 ) + O(θ˜ − θ 0 ) W(
(8)
ˆ θ˜ ) is still a root-N and hence that, under weak regularity conditions, W( ˆ ˜ consistent estimate of W, like W(θ 0 ), provided that θ is a root-N consistent estimate of θ 0 . Since the last condition can be satisfied in many cases, it follows that assuming that (t, θ 0 ) was available for the estimation of W it is not a restrictive requirement. Furthermore, the same would be evidently true if {z(t)} were a function of θ 0 and we assumed it was available. Assumption A2. Let ρ(k) = E[ (t) (t − k)].
(9)
Also, let z(t) and z˜ (t) denote two generic elements of z(t) (possibly z(t) = z˜ (t)), and let r(k) = E[z(t)˜z(t − k)].
(10)
The covariance sequences {ρ(k)} and {r(k)} decay exponentially to zero as k increases. In other words, there is a positive number λ that is strictly less than one, 0 < λ < 1, such that |ρ(k)| ≤ αλ|k|
(11)
|k|
(12)
|r(k)| ≤ αλ , where α is a positive finite-valued constant.
Remark 2. A2 is a weak assumption satisfied by many stationary sequences, such as strictly stable ARMA signals. To conclude this section it only remains to remark on the fact that the limitation of the discussion to a scalar sequence { (t)} is done only to keep the notation and explanation simple. However, the main results and conclusions presented in what follows apply directly to the parameter estimation problems in which { (t, θ)} is a vector sequence (see, e.g., [13] for such a problem). In that case x(t, θ) in (4) is redefined as x(t, θ) = z(t)ε T (t, θ)
(13)
N N 1 1 µ(θ ) = √ vec(x(t, θ)) = √ [ (t, θ) ⊗ z(t)], N t =1 N t =1
(14)
and µ(θ ) is given by
where ⊗ is the symbol of the Kronecker product.
Digital Signal Processing Vol. 11, No. 3, July 2001
256
2. ESTIMATION OF W—SOME DIFFICULTIES The following discussion is included mainly for the benefit of those readers less familiar with spectral analysis and related issues. A straightforward calculation yields W=
N N N N N 1 1 1 E[x(t)xT (s)] = R(t − s) = (N − |k|)R(k), N N N
(15)
R(k) = E[x(t)xT (t − k)].
(16)
t =1 s=1
t =1 s=1
k=−N
where
Under A2 it follows from a calculation similar to that in Eqs. (39)–(42) in the Appendix that the second term in W=
N
R(k) −
k=−N
N 1 |k|R(k) N
(17)
k=−N
tends to zero as N → ∞, and also that the limit of the first term is finite. Hence, lim W =
N→∞
∞
R(k),
(18)
k=−∞
which is recognized as the power spectral density matrix of {x(t)} at frequency ω = 0 (that is, the DC power). Consequently, our problem is basically that of consistently estimating the DC power in a nonparametric manner (no parametric model can usually be postulated for x(t)). This is known not to be an easy problem, a fact which will become apparent as we proceed in this section. Indeed, we will show shortly that two natural estimators of W yield very inaccurate (in particular, inconsistent) estimates of the optimal weight. An obvious way to estimate W is to replace R(k) in (15) by a consistent ˆ estimate R(k), N ˆ = 1 ˆ W (N − |k|)R(k). N
(19)
k=−N
ˆ The problem is how to obtain {R(k)} above. In the rest of this section we discuss two most natural estimates of {R(k)} that will be shown to fail to provide ˆ In the next section we present a third estimate of {R(k)} whose a consistent W. ˆ use will lead to a root-N consistent W. The so-called unbiased estimate of R(k) is given by
1 N T ˆ R(k) = N−k t =k+1 x(t)x (t − k) ˆ ˆ T (k) R(−k) =R
(20)
Stoica and Jansson: Weighted Instrumental Variable Methods
257
for k = 0, 1, . . . , N − 1. A simple calculation shows that we can rewrite (20) as ˆ R(k) =
1 x(p)xT (s)δk,p−s N − |k| N
N
for any |k| < N,
(21)
p=1 s=1
where δi,j is the Kronecker delta (δi,j = 1 for i = j , and δi,j = 0 for i = j ). Inserting (21) in (19) gives N N N ˆ = 1 W x(p)xT (s)δk,p−s N k=−N p=1 s=1 N N N 1 T = x(p)x (s) δk,p−s = µ(θ 0 )µT (θ 0 ), N k=−N p=1 s=1
(22)
=1
ˆ corresponding to (20) estimates where µ(·) was defined in (3). Hence, the W T T E[µ(θ 0 )µ (θ 0 )] by µ(θ 0 )µ (θ 0 ). This estimate is recognized as the multivariate extension of the periodogram (at frequency ω = 0). It is known to have very poor ˆ in (22) has rank equal to one and is an statistical properties. In particular, W inconsistent estimate of W [15, 16]. Next we consider the so-called biased estimate of R(k),
T ˆ R(k) = N1 N t =k+1 x(t)x (t − k) T ˆ ˆ (k) R(−k) =R
(23)
for k = 0, 1, . . . , N − 1. Rewriting (23) as in (21) and inserting the resulting ˆ expression for R(k) in (19) gives N N N ˆ = 1 W (N − |k|) x(p)xT (s)δk,p−s N2 k=−N
=
1 N2
p=1 s=1
N N
(N − |p − s|)x(p)xT (s)
p=1 s=1
N
N −1 1 = 2 x(1) . . . x(N) . .. N 1
N −1
...
N .. .
..
...
N −1
.
T x (1) .. .. . . xT (N) N 1
1 T X CX. (24) N2 The matrix C defined above is the covariance matrix of a moving average of order N with all coefficients equal to one and unit-power driving white noise. Hence, C is positive definite for any finite value of N . Furthermore, under weak conditions the matrix X in (24) has full column rank equal to m, with ˆ above is strictly positive definite probability one (wp 1), which means that W
Digital Signal Processing Vol. 11, No. 3, July 2001
258
ˆ in (20) and (23) has therefore led (wp 1). The small difference between the R’s ˆ ˆ to a significant difference between the corresponding W’s: from the rank-one W ˆ ˆ corresponding to R(k) in (20) to the strictly positive definite W associated with ˆ ˆ in (24) is yet another inconsistent estimate R(k) in (23). Unfortunately, the W of W (as follows, e.g., from [15]). ˆ to the estimate R(k) ˆ The significant sensitivity of W used in (19) is explained by the unboundedness of the summation operator in that equation, as N ˆ increases, which implies that small errors in {R(k)} may have a nonnegligible ˆ In the next section we present another natural estimate R(k) ˆ effect on W. for ˆ use in (19) and prove that the so-obtained W is consistent.
3. ESTIMATION OF W—THE PROPOSED SOLUTION Windowing the sample covariance sequence in (19) can be thought of as a possibility to obtain a consistent estimate of W [6, 17–21]. However, using a windowed estimate of W is hardly feasible without information about the ˜ Such correlation length of x(t) (i.e., the lag k˜ beyond which R(k) ≈ 0 (for k > k)). information is usually difficult to obtain, and even with it at hand a windowed estimate of W may fail to satisfy the positive semidefiniteness requirement or, otherwise, be highly biased. More exactly, our point is that the finite sample properties of a windowed estimate of W are usually poor, despite the fact that such an estimate can be made consistent by appropriately choosing the window and its span, and this fact leads to a poor accuracy of the corresponding parameter estimates. Even relatively sophisticated methods for choosing the window and its span (such as those in [6, 18, 20, 21]) may suffer from this problem. To partly support these claims we can cite the numerical evidence in [14] where the attempts to estimate an optimal weight of the form of W in (15) by using a rectangular window of length chosen by the user have had only a rather limited success (see also [3]). Hence the windowing techniques of classical spectral analysis, even when complemented with a span selector tailored to the problem at hand, may hardly provide the desired simple root-N consistent estimate of W, and we will not consider them. Instead, let us observe the structure of x(t) in (4), which (along with A1) implies that R(k) = E[z(t)zT (t − k)] E[ (t) (t − k)] = (k)ρ(k),
(25)
where ρ(k) is as defined in (9), and (k) = E[z(t)zT (t − k)].
(26)
Since the sequences {z(t)} and { (t)} are available (for t = 1, . . . , N ) we can estimate R(k) in (25) as ˆ ˆ ρ(k), R(k) = (k) ˆ
(27)
Stoica and Jansson: Weighted Instrumental Variable Methods
with
T ˆ (k) = N1 N t =k+1 z(t)z (t − k) T ˆ (−k) = ˆ (k)
259
(28)
for k = 0, 1, . . . , N − 1, and ρ(k) ˆ similarly defined. The estimate of W corresponding to (27) is given by N ˆ = 1 ˆ ρ(k). W (N − |k|)(k) ˆ N
(29)
k=−N
ˆ above is that it is guaranteed to be a positive The first important property of W semidefinite matrix. To see this, note that the sequence β(k) =
N − |k|
for k ∈ (−N, N)
0
else
(30)
is positive definite (as it can be viewed as the covariance sequence of a moving average whose covariance matrix is given by the positive definite C in (24)). The ˆ sequences {(k)} and {ρ(k)}, ˆ appended with zeros for |k| > N , are also positive ˆ semidefinite [15, 16]. Hence W in (29) is equal to the discrete Fourier transform (DFT) (at ω = 0) of the product of three positive semidefinite sequences ∞ ˆ = 1 ˆ W β(k)ρ(k) ˆ (k). N
(31)
k=−∞
Making use of the fact that the DFT of the product of two sequences is equal to the convolution of their DFT’s, along with the fact that the DFT’s of {β(k)} and ˆ {ρ(k)} ˆ are positive and the DFT of {(k)} is a positive semidefinite matrix, at any ˆ in (31) is a positive semidefinite matrix. frequency ω, it follows readily that W ˆ in (29) is the main result of this paper. The second important property of W ˆ in (29) Under the assumptions made (see A1 and A2 in Section 1), the matrix W is a root-N consistent estimate of W. In other words, the mean square error ˆ tends to zero as 1/N , as N increases. The proof of (MSE) of any element of W ˆ is presented in the Appendix. While this result this important property of W may seem surprising at first it has nevertheless an intuitive explanation: {ρ(k)} ˆ in (29), which in view of A2 is essentially a decaying sequence, acts as an implicit ˆ window on {(k)} and vice versa! More exactly, the comparatively large errors ˆ in (k) that occur for many large lags k are weighted out by the multiplication via ρ(k) ˆ (and vice versa). Despite the simplicity of the above observation, it was overlooked in the previous literature by both us [2, 14] and other researchers (see, e.g., [3, 17–21]), where more intricate and statistically less accurate methods than (29) were proposed for estimating W. The intricacy of the latter methods stems from the fact that their user faces a number of choices (such as selecting the window and its span) that typically are hard to make, whereas no such choice is needed in our method. Regarding the claimed statistical inefficiency of the existing methods,
Digital Signal Processing Vol. 11, No. 3, July 2001
260
note that their (asymptotic) convergence rates are slower than the N −1/2 rate of our method and, more importantly, that their finite sample performance may well be badly affected by a poor choice of the user’s parameters. A quantitative comparison of our method for estimating W with the previous ones would be definitely of interest, and we hope that there will be readers of this journal interested in performing it. In the present paper we focus on studying the ˆ estimate proposed herein. performance of only the W
4. NUMERICAL EXAMPLES To illustrate the theoretical results quantitatively we will consider a simple case of the general estimation problem outlined in the example in Section 1. We let y(t) + a1 y(t − 1) + a2 y(t − 2) = b1 u(t − 1) + (t)
(32)
be the data generating mechanism with a1 = −1.4, a2 = 0.8 and b1 = 1. Furthermore, we let (t) − c (t − 1) = e(t)
(33)
u(t) − du(t − 1) = v(t),
(34)
where e(t) and v(t) are independent zero-mean, white noise processes with unit variances. To estimate the parameter vector θ = [a1 , a2 , b1 ]T corresponding to (32) we use the weighted overdetermined IV method outlined in Example 1, with z(t) given in (7), where we take m = 10. Before proceeding with the estimation of θ we will numerically study the ˆ in (29). Here we assume that the accuracy of the weighting matrix estimate W residual sequence { (t)} is given along with {z(t)} for t = 1, . . . , N . (Later we will study the situation when we also need to estimate (t).) Two main cases are considered: (a) Both AR poles in (33) and (34) are well inside the unit circle, which leads to a λ in A2 close to zero. (b) At least one AR pole is close to the unit circle, which gives a λ close to 1. In the two plots that follow we show the following normalized root mean square error (RMSE) performance criterion m m i=1
ˆ
j =1 MSE(Wi,j )
WF
,
(35)
where · F denotes the Frobenius norm. The MSE’s above are estimated from 500 trials for each of the different sample sizes N considered. In Fig. 1 we let ˆ decays as c = 0.1 and d = 0.3 to study case (a) above. We see that the RMSE of W −1/2 when N increases which agrees well with what was shown theoretically N in the Appendix. Also included in Fig. 1 is the corresponding RMSE graph when ˆ is estimated under the assumption that we know (k) (and hence we only W
Stoica and Jansson: Weighted Instrumental Variable Methods
261
ˆ for different sample FIG. 1. The graphs show the normalized RMSE’s of the elements of W sizes N . Here, c = 0.1 and d = 0.3. The curve indicated by the symbol “o” corresponds to the ˆ is estimated as in (29). The curve marked with “x” is for the “normal” operating case where W probing signal case when {%(k)} is known.
estimate ρ(k) in (25)). This situation occurs when probing signals are used for system parameter estimation. As expected, this additional information improves the estimation accuracy of W. Next we consider case (b) above and let c = 0.99 and d = 0.9. The corresponding results are shown in Fig. 2. As expected we need a larger value of N to obtain an accurate estimate of W in case (b) than the value required in case (a). When considering the intuitive explanation as to why ˆ in (29) is a consistent estimate of W, given at the end of Section 3, it is clear W ˆ that in case (b) the implicit windowing by ρ(k) ˆ (or (k)) is not quite effective in ˆ ˆ reducing the MSE of W since {ρ(k)} ˆ (or {(k)}) tends rather slowly to zero when c (or d) is very close to 1. Finally, we study the accuracy of the optimally weighted IV estimate of θ and ˆ Here we use c = 0.8 and how this is affected by the estimation errors in W. d = 0.9 to obtain a fairly difficult scenario (cf. the above discussion). In Fig. 3, the RMSE (estimated from 500 trials) of the estimate of a1 is plotted versus ˆ given in (29) N for the following three weighting matrices: I, W in (5), and W (hence we also estimate (k) in this case; assuming that the input generating mechanism is unknown). The RMSE graphs for the other estimated parameters (a2 and b1 ) are similar. The plot in Fig. 3 illustrates the usefulness of using the (optimally) weighted version of the IV method. We also see that the difference in accuracy between the cases when the true weighting W is known and when ˆ in (29) is quite small and vanishes it is replaced by the consistent estimate W when N increases, as predicted by the theory. Note that here we first had to ˆ This was accomplished by estimate the residuals (t) before we could obtain W.
262
Digital Signal Processing Vol. 11, No. 3, July 2001
FIG. 2. Similar to Fig. 1 but for c = 0.99 and d = 0.9.
FIG. 3. The graphs show the RMSE values of the estimates of a1 obtained by using different ˆ weighting matrices in the criterion in (2): “x” – I, “+” – W, “o” – W.
Stoica and Jansson: Weighted Instrumental Variable Methods
263
using the (initial) estimates of a1 , a2 , and b1 obtained from (2) with W = I and then computing the corresponding residuals in (32).
5. CONCLUDING REMARKS
The instrumental variable or correlation analysis approach to parameter estimation, which can be seen as a special case of the method of moments, has been frequently used in applications such as system identification, time series analysis, and econometrics. The estimation accuracy of this approach may be improved significantly by suitably weighting the corresponding estimating equations. While this fact has been known for many years, a clear-cut method for estimating the optimal weight was apparently lacking. In the previous literature the optimal weighting matrix was estimated either (i) parametrically [2, 17, 22] or (ii) nonparametrically [3, 14, 18–21]. The approach (i) requires the postulation and estimation of a model at least for the noise (or residual) sequence (t). The need for estimating nuisance parameters may significantly increase the computational burden of this approach and may also make the overall procedure more prone to mismodeling. The approach (ii) is more robust than (i) (as, unlike (i), it does not make and hence it does not depend on any strong modeling assumptions), but it may suffer from poor accuracy. Indeed, without information about the correlation lengths of the sequences z(t) and (t) the approach (ii) may give either a rather inaccurate estimate of the optimal weight or a nonpositive definite one, or both. The approach proposed in this paper for estimating the optimal weight is nonparametric and yet it yields quite accurate estimates of W. In fact, as was shown in the previous section, the parameter estimates obtained by using the ˆ true and unknown optimal weight W and, respectively, the estimated weight W proposed herein have very similar accuracies for a wide range of practical values of the sample length. The availability of an accurate nonparametric estimate of the optimal weighting matrix is expected to promote the use of optimal instrumental variable or correlation analysis-based methods for parameter estimation in signals and systems. Our results may also have some relevance for nonparametric spectral analysis (as the optimal weight that we showed how to estimate consistently in a nonparametric manner is equal to the DC power matrix of a certain stationary vector sequence), but the significance of this observation, if any, is yet to be clarified. Finally, we note that the root-N consistency of the estimated weighting ˆ can be proved under more general assumptions than those we used matrix W in the previous sections, see the recent paper [23]. The cited work, of which we have become aware after submitting the present paper, contains methods and results very similar to ours but more general and naturally (much) more involved.
Digital Signal Processing Vol. 11, No. 3, July 2001
264
APPENDIX I. The Consistency Proof ˆ in (29) is given by The generic element of W wˆ =
N 1 (N − |k|)ˆr (k)ρ(k), ˆ N
(36)
k=−N
ˆ where rˆ (k) is a generic element of (k), rˆ (k) =
N 1 z(t)˜z(t − k) N
(37)
t =k+1
(hereafter z(t) and z˜ (t) are two elements of z(t), see Assumption A2). We want to show that MSE(w) ˆ E(wˆ − w)2 = O(N −1 ) for sufficiently large values of N . Here w is the element of W that corresponds to w. ˆ First we note that, in view of A1, w − E(w) ˆ = =
N 1 (N − |k|){r(k)ρ(k) − E[ˆr (k)] E[ρ(k)]} ˆ N k=−N N
1 N3
(N − |k|)(2N − |k|)|k|r(k)ρ(k).
(38)
k=−N
Hence, making use of A2, | E(w) ˆ − w| ≤ ≤
N α2 (N − |k|)(2N − |k|)|k|λ2|k| N3
2α 2 N
k=−N N
|k|λ2|k| ≤
k=−N
∞ 4α 2 k kγ , N
(39)
k=0
where γ = λ2 < 1. Now, f (γ )
∞
γk =
k=0
1 1−γ
(40)
and f (γ ) =
∞ ∞ 1 1 k k−1 = kγ = kγ , γ (1 − γ )2 k=0
(41)
k=0
which implies that ∞
kγ k =
k=0
γ < ∞. (1 − γ )2
(42)
It follows from (39) and (42) that bias(w) ˆ E(w) ˆ − w = O(N −1 ), which shows that wˆ is asymptotically unbiased.
(43)
Stoica and Jansson: Weighted Instrumental Variable Methods
265
A corollary of (43) is that ˆ = var(w) ˆ + O(N −2 ). MSE(w) ˆ var(w) ˆ + bias2 (w)
(44)
Hence to prove the stated result we must show that ˆ 2 = O(N −1 ). var(w) ˆ E(wˆ 2 ) − [E(w)]
(45)
To that end, note that E(wˆ 2 ) =
N N 1 (N − |k|)(N − |p|) E[ˆr (k)ˆr (p)] E[ρ(k) ˆ ρ(p)]. ˆ N2
(46)
k=−N p=−N
Using Assumption A1, once again, we obtain 2 E[ˆr (k)ˆr (p)] =
N 1 N2
N
N 1 N2
N
E[z(t)˜z(t − k)z(s)˜z(s − p)]
t =k+1 s=p+1
=
[r(k)r(p) + r¯ (t − s)˜r (t − s + p − k)
t =k+1 s=p+1
+ r(t − s + p)r(s − t + k)] 1 = 2 (N − |k|)(N − |p|)r(k)r(p) N N 1 (N − |j |)[¯r (j )˜r (j + p − k) + r(j + p)r(−j + k)] + 2 N j =−N T1 + ) 1 ,
(47)
where {¯r (k)} and {˜r (k)} denote the covariance sequences of z(t) and z˜ (t), respectively. Similarly, E[ρ(k) ˆ ρ(p)] ˆ =
1 (N − |k|)(N − |p|)ρ(k)ρ(p) N2 N 1 + 2 (N − |j |)[ρ(j )ρ(j + p − k) + ρ(j + p)ρ(−j + k)] N j =−N
T2 + ) 2 .
(48)
Next, we note that [E(w)] ˆ 2=
N N 1 (N − |k|)(N − |p|) E[ˆr (k)] E[ρ(k)] ˆ E[ˆr (p)] E[ρ(p)] ˆ N2 k=−N p=−N
=
N N 1 (N − |k|)(N − |p|)T1 T2 . N2
(49)
k=−N p=−N
The limits for j in (47) are not quite correct. However, we can extend them as in (47) when later we upperbound the sum (see, e.g., (51)). 2
Digital Signal Processing Vol. 11, No. 3, July 2001
266
Combining (45)–(49) leads to the following expression for var(w) ˆ var(w) ˆ =
N N 1 (N − |k|)(N − |p|)(T1 )2 + T2 )1 + )1 )2 ) N2 k=−N p=−N
L1 + L2 + L3 .
(50)
We analyze the terms {Lk }3k=1 above in a one-by-one manner. By making use of A2, we note that |L1 | ≤
N N 1 N6
N
(N − |k|)2 (N − |p|)2 |r(k)||r(p)|(N − |j |)
k=−N p=−N j =−N
× [|ρ(j )||ρ(j + p − k)| + |ρ(j + p)||ρ(−j + k)|] ≤
N N α4 N6
N
(N − |k|)2 (N − |p|)2 (N − |j |)λ|k| λ|p|
k=−N p=−N j =−N
× [λ|j | λ|j +p−k| + λ|j +p| λ|j −k| ] ≤
N N α4 N
N
N N α4 N
N
λ|k| λ|p| [λ|j | + λ(1/2)|j +p| λ(1/2)|j −k| ]
k=−N p=−N j =−N
≤
λ|k| λ|p| [λ|j | + λ|j |−(1/2)|k|−(1/2)|p| ]
k=−N p=−N j =−N
≤
N N 2α 4 N
N
λ|j | λ(1/2)|k| λ(1/2)|p|
k=−N p=−N j =−N
16α 4 ≤ N
∞
λj
j =0
∞
2 λ
1 2k
= O(N −1 ).
(51)
k=0
To obtain some of the upper bounds above we used the easily verified facts that |j + p| ≥ |j | − |p| and that x ≤ y ⇔ λx ≥ λy . By a calculation similar to (51) we can show that |L2 | ≤ O(N −1 ).
(52)
Finally, N N N N 1 |L3 | ≤ 6 (N − |k|)(N − |p|)(N − |i|)(N − |j |) N k=−N p=−N i=−N j =−N
× [|¯r (j )||˜r (j + p − k)| + |r(j + p)||r(−j + k)|] × [|ρ(i)||ρ(i + p − k)| + |ρ(i + p)||ρ(−i + k)|] ≤
α4 N2
N
N
N
N
[λ|j | λ|j +p−k| + λ|j +p| λ|−j +k| ]
k=−N p=−N i=−N j =−N
× [λ|i| λ|i+p−k| + λ|i+p| λ|i−k| ]
Stoica and Jansson: Weighted Instrumental Variable Methods
=
267
N N N N α 4 |j |+|j +p−k|+|i|+|i+p−k| [λ + λ|j |+|j +p−k|+|i+p|+|i−k| N2 k=−N p=−N i=−N j =−N
+ λ|j +p|+|j −k|+|i|+|i+p−k| + λ|j +p|+|j −k|+|i+p|+|i−k| ]. (53) The terms in the above equation can be shown to be O(N −1 ) as we prove below for the first of them (the proofs for the other terms are similar) N N N N α 4 |j | |i| |j +p−k| |i+p−k| λ λ λ λ N2 k=−N p=−N i=−N j =−N
p=−j +k+s
≤
N N N α4 N2
j −k+N
λ|i| λ|j | λ|s|
k=−N i=−N j =−N s=j −k−N
N 3N N α4 16α 4 ≤ 2 2N λ|i|+|j |+|s| ≤ N N i=−N j =−N s=−3N
∞
3 i
λ
= O(N −1 ).
(54)
i=0
Combining (50)–(54) yields the desired result (45), and hence the proof is concluded.
REFERENCES 1. Ljung, L., System Identification: Theory for the User. Prentice Hall International, Englewood Cliffs, NJ, 1987. [2nd ed., 1999] 2. Söderström, T. and Stoica, P., System Identification. Prentice Hall International, Hemel, Hempstead, UK, 1989. 3. White, H., Asymptotic Theory for Econometricians. Academic Press, Orlando, 1984. [especially Chap. VI]. 4. Hansen, L. P., Large sample properties of generalized method of moments estimators. Econometrica 50 (1982), 1029–1054. 5. Hansen, L. P., A method for calculating bounds on the asymptotic variance-covariance matrices of generalized method of moments estimators. J. Economet. 30 (1985), 203–228. 6. West, K. D. and Wilcox, D. W., A comparison of alternative instrumental variables estimators of a dynamic linear model. J. Business Econom. Statist. 14 (1996), 281–293. 7. Stoica, P. and Söderström, T., Optimal instrumental variable methods for identification of multivariable linear systems. Automatica 19 (1983), 425–429. 8. Stoica, P. and Söderström, T., Optimal instrumental variable estimation and approximate implementations. IEEE Trans. Automat. Control AC-28 (1983), 757–772. 9. Stoica, P., Söderström, T., and Friedlander, B., Optimal instrumental variable estimates of the AR parameters of an ARMA process. IEEE Trans. Automat. Control AC-30 (1985), 1066–1074. 10. Stoica, P., Friedlander, B., and Söderström, T., Optimal instrumental variable multistep algorithms for estimation of the AR parameters of an ARMA process. Int. J. Control 45 (1987), 2083–2107. 11. Stoica, P., Sorelius, J., Cedervall, M., and Söderström, T., Errors-in-variables modeling: An instrumental variable approach. In Recent Advances in Total Least Squares and Errors-inVariables Modeling (van Huffel, S., Ed.). SIAM, Philadelphia, 1997, pp. 329–340. 12. Söderström, T. and Stoica, P., Instrumental Variable Methods for System Identification. LNCIS Series. Springer-Verlag, Berlin/New York, 1983. 13. Stoica, P. and Jansson, M., A transfer function approach to MIMO system identification. In IEEE Conference on Decision and Control, Phoenix. December 1999, pp. 2400–2405. 14. Stoica, P., Cedervall, M., and Eriksson, A., Combined instrumental variable and subspace fitting approach to parameter estimation of noisy input–output systems. IEEE Trans. Signal Process. 43, No. 10 (1995), 2386–2397.
268
Digital Signal Processing Vol. 11, No. 3, July 2001
15. Priestley, M. B., Spectral Analysis and Time Series. Academic Press, New York, 1981. 16. Stoica, P. and Moses, R., Introduction to Spectral Analysis. Prentice Hall, Upper Saddle River, NJ, 1997. 17. West, K. D., Another heteroskedasticity and autocorrelation consistent covariance matrix estimator. J. Econom. 76 (1997), 171–191. 18. Newey, W. K. and West, K. D., Automatic lag selection in covariance matrix estimation. Rev. Econom. Stud. 61 (1994), 631–654. 19. Newey, W. K. and West, K. D., A simple, positive semi-definite, heteroskedasticity and autocorrelation consistent covariance matrix. Econometrica 55 (1987), 703–708. 20. Andrews, D. W. K., Heteroskedasticity and autocorrelation consistent covariance matrix estimation. Econometrica 59 (1991), 817–858. 21. Andrews, D. W. K. and Monahan, J. C., An improved heteroskedasticity and autocorrelation consistent covariance matrix estimator. Econometrica 60 (1992), 953–966. 22. den Haan, W. J. and Levin, A., Inferences from parametric and non-parametric covariance matrix estimation procedures. Technical Working Paper 195, National Bureau of Economic Research, Washington, DC, 1996. 23. Robinson, P. M., Inference-without-smoothing in the presence of nonparametric autocorrelation. Econometrica 66, No. 5 (1998), 1163–1182.
PETRE STOICA is a Professor of System Modeling in the Department of Systems and Control at Uppsala University (UU) in Sweden. Previously he was a Professor of Signal Processing at the Bucharest Polytechnic Institute (BPI). He received a D.Sc. degree in Automatic Control from the BPI in 1979 and a Honorary Doctorate in Science from UU in 1993. He held longer visiting positions with the Eindhoven University of Technology, Chalmers University of Technology (where he held a Jubilee Visiting Professorship), Uppsala University, University of Florida, and Stanford University. His main scientific interests are in the areas of system identification, time series analysis and prediction, ststistical signal and array processing, spectral analysis, wireless communications, and radar signal processing. He has published seven books, eight book chapters, and some 400 papers in archival journals and conference records on these topics. The most recent book that he coauthored is Introduction to Spectral Analysis (Prentice-Hall, 1997). Recently he has also coaedited two books on Signal Processing Advances in Wireless Communications (Prentice-Hall, in press). He received a number of awards including corecipient of the 1989 ASSP Society Senior Award for a paper on statistical aspects of array signal processing and recipient of the 1996 Technical Achievement Award of the IEEE Signal Processing Society for fundamental contributions to statistical signal processing with applications in time series analysis, system identification, and array processing. In 1998 he received a Senior Individual Garant Award from the Swedish Foundation for Strategic Research. He is also corecipient of a 1999 IEEE Signal Processing Society Best Paper Award for a paper on parameter and rank estimation of reduced-rank regressions, a 2000 IEEE Third Millennium Medal, as well as of the 2000 IEEE W.R.G. Baker Paper Prize Award for a work on maximum likelihood methods for radar. He is on the editorial boards of five journals in the field: Signal Processing; Journal of Forecasting; Circuits, Systems and Signal Processing; Multidimensional Systems and Signal Processing; and Digital Signal Processing: A Review Journal. He was a guest coeditor for several special issues on system identification, signal processing, spectral analysis, and radar for some of the above journals and for the IEE Proceedings, as well as a member of the international program committees of many topical conferences. From 1981 to 1986 he was a director of the International Time Series Analysis and Forecasting Society, and he has been a member of the IFAC Committee on Modeling, Identification, and Signal Processing since 1994. He also is an honorary member of the Romanian Academy, a fellow of the Royal Statistical Society, and a fellow of IEEE. MAGNUS JANSSON was born in Enköping, Sweden, in 1968. He received the Master of Science, Techinal Licentiate, and Ph.D. in electrical engineering from the Royal Institute of Technology (KTH), Stockholm, Sweden, in 1992, 1995, and 1997, respectively. From September 1998 he spent one year at the Department of Electrical and Computer Engineering, University of Minnesota, USA. Dr. Jansson is currently a Research Associate at the Department of Signals, Sensors, and Systems, Royal Institute of Technology. His research interests include statistical signal processing, time series analysis, and system identification.