CHAPTER
2 Linear Least Squares and Normal Theory
2.1. INTRODUCTION In this chapter we shall study the relatively simple linear least squares approach to identification. The basic notion of least squares is that parameter estimates are chosen so that the model output agrees with the data as closely as possible as measured by the sum of the squares of the differences between them. The least squares technique was originally developed independently by Gauss and Legendre in the early 19th century. The concept of least squares analysis is explained by Gauss in Theoria Motus Corporum Coelestium [67] (theory of the Motion of Heavenly Bodies): If the astronomical observations and other quantities on which the computation of orbits is based were absolutely correct, the elements also, whether deduced from three or four observations would be strictly accurate (so far indeed as the motion is supposed to take place exactly according to the laws of Kepler) and, therefore, if other observations were used, they might be confirmed but not corrected. But since all our measurements and observations are nothing more than approximations to the truth, the same must be true of all calculations 22
2.2.
23
THE LEAST SQUARES SOLUTION
resting upon them, and the highest aim of all computations concerning concrete phenomena must be to approximate, as nearly as practicable to the truth. But this can be accomplished in no other way than by a suitable combination of more observations than the number absolutely requisite for the determination of the unknown quantities. Gauss goes on to say that ... the most probable value of the unknown quantities will be that in which the sum of the squares of the differences between the actually observed and the computed values multiplied by numbers that measure the degree of precision is a minimum. The linear least squares problem may be motivated as follows: Suppose we have a stochastic process {Y" t E (1, ... , N)} for which the mean value is a linear function of a parameter vector (); i.e.,
E[Y,]
= x rT ()
(2.1.1)
where x, is a known vector for t = 1, ... , N. Our aim is to obtain a good estimate (good in the least squares sense) of () from a realization of the stochastic process {Yr' t E (1, ... , N)}. We shall investigate various properties of the least squares estimator. Also of interest are the distributions of the estimator and of the functions of residuals (differences between observed data and model output) when the process {Y" t E (1, ... , N)} is Gaussian. We shall illustrate by means of examples how these distributions may be used to test hypotheses regarding the model and to obtain confidence intervals for the parameters.
2.2. THE LEAST SQUARES SOLUTION
e
We begin by using an intuitive approach. We seek the value of which minimizes the sum of the squares of the deviations between v, and the mean predicted by the model, namely, x, T(), i.e., we attempt to minimize N
S=
L (Yr -
x, T())2
(2.2.1 )
r= 1
Equation (2.2.1) can be written in vector form S = (Y - X())T(y - X())
(2.2.2)
where
(2.2.3 )
24
2.
LINEAR LEAST SQUARES AND NORMAL THEORY
Differentiating Eq. (2.2.2) with respect to () shows that the value 0 which minimizes S satisfies the equation (often called the normal equation)
(XTX)O = XTy
(2.2.4)
If XTX is invertible, then there is a unique solution which can be expressed
as
0= [XTX]-tXTy
(2.2.5)
Equation (2.2.5) is often called the least squares estimator since it minimizes the sum of the squares given in (2.2.1). For the case where XTX is singular, the normal equation does not have a unique solution and there is a family of least squares estimates which may be determined in any particular case by the usual methods for solving linear equations [84, 86]. Example 2.2.1
Consider the following (moving average) model of a
stochastic process p
£[1';] =
I hk ur - k ' k=O
t= 1, ... , N
where the U r' t = 1 - p, ... , N are known and hk , k = 0, ... , p are unknown parameters. Thus, using the notation already introduced
x/
=
(ur , u.: l'
... , Ur-
p)
and
XTX=
f
u/
Ur~r-l
r= 1
. Ur-tUr
...
U~-I'"
:: [ UrU r- p
ur_Pur] . :
,
2 vt-,
The least squares estimate is obtained from Eq. (2.2.5).
'l
The least squares problem can be given a geometric interpretation by looking at Eq.' (2.2.2). We observe that Y = XO defines a subspace of R N spanned by the columns of X. The quantity S in Eq. (2.2.2) thus represents the square of the distance from some point Y in this subspace to the given vector Y. For any {j satisfying (2.2.4), the error vector Y - Yis perpendicular (or normal) to the subspace spanned by the columns of X. This follows from (2.2.4) since i.e.,
(2.2.6)
2.3.
25
BEST LINEAR UNBIASED ESTIMATORS
(In fact, from (2.2.6), Y - Y is in the nullspace of X T , i.e., Y - Y is in the orthogonal complement ofthe range space of X [39].) This property explains the use of the term normal equation for (2.2.4). Example 2.2.2 Consider
£[Y;] = (3 - t)O, Thus, X = (2
t = 1,2
1)T. Say we observe
Yl = 4,
Y2 = 1
For this example, the geometric interpretation of the least squares solution is depicted in Fig. 2.2.1.
3 range space of X
2
y-y Y of
observations
4 3 Figure 2.2.1. Geometric interpretation of least squares.
In subsequent sections we investigate further properties of the least squares and related estimators.
2.3. BEST LINEAR UNBIASED ESTIMATORS A best linear unbiased estimator (BLUE) is defined as the estimator that has minimum variance among the class of all linear unbiased estimators (see Definition 1.3.4). For our current problem, we thus require {j to be a
26
2.
LINEAR LEAST SQUARES AND NORMAL THEORY
linear function of Y, i.e.,
(2.3.1 ) We restrict attention to stochastic processes having mean given by (2.1.1) with () the true parameter value, and having covariance given by
E[(Yr - x,T())(y, -
=
0
(2.3.2) (2.3.3)
if t = s
X,T())] = (12
otherwise
where E denotes expectation over the distribution of y given (). Combining (2.3.2) and (2.3.3),
(2.3.4) Properties of the least squares estimator under the above assumptions are discussed in the following theorem:
Theorem 2.3.1 Consider a second-order stochastic process Y, having mean X() and covariance (12l. Then the least squares estimator of ()given by
8 = [XTX]-IXTy
(2.3.5)
has the following properties (i) (ii) (iii) (iv)
It is a linear function of the data E y [8] = () (i.e., 8 is unbiased-see Definition 1.3.1) E y[(8 - ())(O - ())T] = (XTXr 1(12 Ey[{O - ()){O - ()y] ~ Ey[(iJ - ()){iJ _ ())T]
(2.3.6) (2.3.7) (2.3.8)
where iJ is any other linear unbiased estimator (i.e., 0 is BLUE). Proof (i) From (2.3.5),
(2.3.9) where L = (XTXr 1 X T. (ii) Ey[O] ~ Ey[(X T 1 XTy]
Xr
Xr
= (X T
1X
TE; Y
Xr X TX() = ()
= (X T
1
(iii)
0- () =
(XTX)-IXTy - () = (XTXrIXT(y - X())
Ey[(O - ())(O - ())T] = (XTXrIXTEy{Y - X())(Y - X())TX(XTXr 1 = (XTXrIXTlX(XTXrl(12 = (X TXr l(12
(iv) Consider any linear estimator estimator, then
iJ where iJ = CY. HiJ is an unbiased
2.3.
BEST LINEAR UNBIASED ESTIMATORS
27
i.e., Ey{CY) = e, For (2.3.1O) to be true for any
(2.3.1O)
e, we require CX=I
(2.3.11 )
The covariance of tJ is given by
Ey[(tJ - e)(tJ - e)l) = E~.[(CY - e)(CY - e)T] = Ey[(CY - cxe + cxe - e)(CY - cxe + cxe _ e)T] = Ey[(CY - CXe)(CY - CXe)l) (using (2.3.11))
= CEy[(Y - Xe)(Y - Xe)l)C T = CCTa 2
(2.3.12)
We now define D in terms of Cas (2.3.13) Now we form DDT as
DDT = [C - (XTXr1XT][C - (XTXr1xTJT = CC T - (XTXr1XTC T - CX(XTXr 1 + (XTXr 1 = CC T - (XTXr 1 (using (2.3.11)) 20 (2.3.14) Thus
CCTa 2 2 (XTXr la 2 This establishes part (iv) of the theorem.
\l
Corollary 2.3.1 Consider the stochastic process Y having mean xe and covariance I: (where I: is known positive definite matrix). Then the BLUE for is
e
(2.3.15)
Proof We can write I: in the form ppT where P is nonsingular. Hence defining
Y = p-1y
(2.3.16)
Then
E[Y] = E[p-l Y] =
r: xe
(2.3.17)
28
2.
LINEAR LEAST SQUARES AND NORMAL THEORY
We can thus write E[Y] = XO,
(2.3.18)
Also E[(Y - X'O)(Y - X'O)T] = E[p-1(y - XO)(Y - xoyp-1 = p-l'1:,p-T = p-lppTp-T = I (2.3.19) Hence, from Theorem 2.3.1 the BLUE is IJ = (X TXt l XTY = (XTp-Tp-lXtlXTp-Tp-ly
=
(X T'1:,-l xt lX T'1:,-l y
This establishes the Corollary.
(2.3.20)
V
Note The BLUE estimator described in Corollary 2.3.1 is also the estimator which minimizes the following weighted sum of squares
s = (Y -
XOy'1:,-l(y - XO)
(2.3.21 )
We can thus give Gauss's interpretation to '1:, -1 as expressing the relative precision of our measurements (see Section 2.1). The covariance of 0 described in Eq. (2.3.20) is readily shown to be
E[(IJ - O)(IJ - oy] =
(XT'1:,-lXt1
(2.3.22)
A possible interpretation of Corollary 2.3.1 is that, if '1:, is known (at least up to a scalar multiplier), then by means of a suitable linear transformation of the data (i.e., Y = P" 1 Y), the problem may be reduced to the ordinary least squares situation of Theorem 2.3.1. The transformation P" 1 is sometimes called a prewhitening filter. The case where '1:, is unknown is also amenable to solution but we defer discussion on the most general problem until a later chapter (cf. Sections 3.3 and 5.3). Here we shall concentrate on the case where '1:, takes the special form (12D with (12 unknown but D known. The known matrix D can be eliminated by use of a prewhitening filter and thus we assume D = I without loss of generality.
2.4. UNBIASED ESTIMATION OF BLUE COVARIANCE It was shown in Theorem 2.3.1 that the covariance of IJ was given by
(2.4.1 ) Of course, (12 is a parameter of the distribution of Yand hence will also need to be estimated in general. A suitable unbiased estimator of (12 is described in the following theorem.
2.5.
29
NORMAL THEORY
Theorem 2.4.1 An unbiased estimator of a 2 is
v= S(£1)/(N -
(2.4.2)
p)
where N is the number of observations, p the number of parameters, and S the sum of the squares of the residuals (2.4.3) Proof Substituting (2.3.5) into (2.4.3) yields
S(£1) = (Y - X(XTXr 1 XTy)T(y _ X(XTXr 1 XTy) = yT(IN - X(XTXr 1 XT)(IN - X(XTXr 1 XT)y = yT(I N - X(X TX)-l XT)y = trace(I N - X(XTXr 1 XT)yyT Therefore, E[S] = trace(I N - X(XTXr 1 X T)[a21 N + XeeTX T] = (12 trace(I N - X(XTXr 1 X T) = (12 trace IN - (12 trace X (X TX) - 1X T = (12 trace IN - (12 trace(XTXr 1 XTX (since trace AB = trace BA) = (12 trace IN - a 2 trace I p = (12(N - p) (2.4.4) i.e., E[V] =
(12
as required.
V
2.5. NORMAL THEORY The theory developed above made no requirements on the form of distribution of Y save that the mean be xe and variance a 2 1. We now make the further assumption that Y is distributed normally, i.e., y - N(Xe, (121) (see Section A.2).We now have the following stronger result on the optimality of the least squares estimator. Theorem 2.5.1 Consider a normal stationary stochastic process having mean xe and covariance (121. The minimum variance unbiased estimator (MVUE) (see Definition 1.3.3) of e is the least squares estimator £1 = (XTXr 1 XTy
(2.5.1)
Proof From Eq. (1.3.5), Fisher's information matrix is
-
u, -
i ra log p(yiP) J ra log p(yiP) J}
E yIP\
ap
T
ap
(2.5.2)
30
2.
LINEAR LEAST SQUARES AND NORMAL THEORY
where
fJ = (OT,0-2V
(2.5.3)
Now using the assumed normal distribution for the data, we have p(YlfJ) = (21t0- 2 N / 2 exp{-(1/20- 2 )(y - XO)T(y - Xe)} iJ log p(Y IfJViJe = (1/0- 2)(y - xerX = VI T
r
iJ log p(Y IfJ)/iJ0- 2 = - (N/20- 2)
+ (1/20- 4)(y
- XO)T(y - Xe) = V2
(2.5.4) (2.5.5)
(2.5.6)
Hence (2.5.7) (2.5.8) Now (since first and third central moments of a normal distribution are zero) (2.5.9) EYIP(VIVI T) = EYlP(I/0-4)X T(y - XO)(Y - XfWX = (1/0- 4)XT[0-2/]X = (1/0- 2)XTX (2.5.10) Hence from the Cramer-Rao inequality (Theorem 1.3.1), the covariance of any unbiased estimator g(Y) of fJ satisfies the inequality
However, it was shown in Theorem 2.3.1 that cove = (XTXr 10'2
a
where is the least squares estimator of O. Thus achieves the Cramer-Rao lower bound. Since (Theorem 2.3.1), it follows that is a MVUE of e. \j
e
e
(2.5.12)
eis also unbiased
Theorem 2.5.2 The estimator Vgiven in Eq. (2.4.2) is a MVUE for 0- 2 • Proof See Zacks [27, Chapter 3].
\j
We now establish some properties of the estimator vector R = (Y - Xe):
eand the residual
2.5.
31
NORMAL THEORY
Theorem 2.5.3 Under the same conditions as Theorem 2.5.1, the estimator 0 = (X TXt 1XTy is normally distributed with mean e and covariance (XTxt l a2.
Proof 0 is a linear function of Y and thus has a normal distribution since Y has a normal distribution. We have previously established the mean and covariance and thus the theorem is proved. V
The distribution of the estimator 0i is depicted diagrammatically in Fig. 2.5.1. Note that the results are clustered around the true parameter value e;. The standard deviation for Oi is (a2C ii ) I / 2 where C i i is the ith diagonal element of [X T X ] - I.
Approx. 68% of results
>.
:.0 1\I ..c
..
o a.
zaJC(i Figure 2.5.1. Normal distribution for (Jj with mean OJ and variance a 2C ii .
Theorem 2.5.4
vector R = y of
e.
Under the same conditions as Theorem 2.5.1, the residual
xe is normally distributed and is statistically independent
Proof The normality of R follows from the normality of Y and
e.
We now consider the following covariance: Ey[(O - e)RT] = Ey[((XTxt 1 XTy - e)(Y - X(XTxt 1 XTY)T]
= Ey[(XTXt 1 XTy(y _ X(XTxt 1 XTY)T] = Ey[(XTXt 1 XTyyT(I _ X(XTxt tXT)]
= (XTXtlXT[a21 + xeeTXT][I
~ X(X TXt 1X T]
= a 2(XTXt 1 XT[I - X(XTxt tXT]
=0
(2.5.13)
Therefore the independence of 0and R follows from the normality of 0 and R.
32
2.
LINEAR LEAST SQUARES AND NORMAL THEORY
Theorem 2.5.5 The normalized sum of squares of residuals s(0)/u2 = TR/u 2 R has a X2(N- p) distribution. Proof RTR/u 2 = (l/u 2)[(y - XO)T(y - XO)] = (l/u 2)[(y - X(XTxt lXTY)T(y - X(XTxt IXTY)] = (l/u 2)yT(I N - X(XTxt I XT)y = (l/ir 2 )( y - X(W(IN - X(XTXt1XT)(y - XO) (2.5.14) Now we define
= (l/u)(Y - XO) A = IN - X(XTXtIX T
Z
(2.5.15) (2.5.16)
Note A is idempotent (A 2 = A) and Z is normally distributed", N(O, I). Hence from Corollary A.6.2, RTR/u 2 has a x2 (r) distribution where r = trace A = trace(I - X(XTXt1X T) = N - P (2.5.17) N
This completes the proof of the theorem.
V
The general shape of the distribution of the normalized sum of squares of residuals is shown in Fig. 2.5.2. (Note that for N - p large the X2(N - p) distribution can be approximated by a normal distribution with mean N - P and variance 2(N - p), see Sections 16.3-16.6 of reference [15]). Notice that the values of RTR/(N - p) cluster around u 2 for N - p large. In Theorem 2.5.3, we showed that 0", N(O, (XTxt I ( 2). However, in practice we shall only have an estimate of u 2 , i.e., Eq. (2.4.2), We therefore establish the following result.
»
'0 e
GI "Cl
N=N,
RTR (J2(N-P)
Figure 2.5.2. Approximate distribution for normalized sum of squares of residuals for N = Nt and N = N 2 with N 2 > Nt"
2.5.
33
NORMAL THEORY
Theorem 2.5.6 The distribution of (11; - O;)/{C ii S(I1)/(N - p)p/2 is the student t distribution on (N - p) degrees of freedom where l1i and 0i are the ith components of 11 and 0, respectively, C j j is the ith diagonal element of (XTXr l , and-
S(I1) = (Y - XI1)T(y - XI1) Proof From Theorem 2.5.3, (11 - 0) "" N(O, (XTXr l (12 ) . Hence the marginal distribution for the ith component is
l1i -
0i ""
N(O, Ci i (12 )
or Vj
Also
=
(l1i - O;)/(1.}C;; "" N(O, 1)
z = (1/(12)S(I1) "" X2(N - p)
from Theorem 2.5.5. Theorem 2.5.4 implies that Hence from Result A.5.1 it follows that
Vi
and z are independent.
v;/[z/(N - p)]1/2 "" t(N - p) i.e.,
(l1
j -
Oi)/[C iiS(I1)/(N - p)]1/2 "" t(N - p)
This completes the proof of the theorem.
\l
The above result can be used to construct a (1 - IX)% confidence interval for OJ since
Pr { -
ta
< (l1i
-
0;)/[ Cii S(I1)/(N - P)ji/2 <
ta}
=
1 - IX
where t, is the upper IX percentile of the t distribution on (N - p) degrees of freedom. The (1 - IX) % confidence interval for 0i is
I1j
-
t a [Cii S(I1)/(N - p)ji/2 < OJ <
l1i + t a [Cj i S(I1)/(N - p)]1 /2
\l
As might be expected from Figs. 2.5.1 and 2.5.2, for (N - p) large the t(N - p) distribution can be approximated by an N(O, 1) distribution. Thus Theorem 2.5.6 reduces to Theorem 2.5.3 with (12 replaced by S(I1)/(N - p) for large N. Example 2.5.1 A process {y,} is known to be either constant or to increase linearly with time. A suitable model for {y,} is y, = hI
+ h2 t + 1:"
t = 0, ... , N
where {1:,} is an i.i.d. Gaussian process with zero mean and variance
(12.
34
2.
LINEAR LEAST SQUARES AND NORMAL THEORY
With N = 11, the following results were obtained for the minimum variance unbiased estimates of (hi' h2 , (T2):
hi = 0.53,
h2 = 0.05,
ti2 = 0.132
We shall use the t distribution to obtain a 95 %confidence interval for h 2 • From Theorem 2.5.6 the distribution of (h2 - h2)/{C22S(0)/1OP /2 has a student t distribution on 10 degrees of freedom, where
0= (hi' h2)T S(O) = (N - p) 0- 2 = 1.32 C
22 =
(2, 2)th element of (XTxt 1
=(2,2)thelementofILf~o r~ tt2j}-1 = 1/110 From tables of the t distribution the upper 97.5 percentile is 2.228. Application of Theorem 2.5.6 leads to the following result for the 95 % confidence interval: 0.05 - 2.228
f 1 1.3211/2 · 10 J < ITiO
h 2 < 0.05 + 2.228
11 1.3 2 TiO' 10 J
1/ 2
1
i.e., -0.027 < h2 < 0.127. \1 We now turn to consider the situation where there may be linear relationships between the parameters. This may occur, for example, when too many parameters have been included in the parameter vector. We wish to establish a test for the presence of these constraints. The essential results are contained in the following theorem.
Theorem 2.5.7 Consider the normal linear model with the side constraint L8 = C where L is an s x p matrix and is assumed to have rank s. Then the minimum variance unbiased estimator fJ subject to the side constraint LfJ = C is given by fJ
= 0 - (XTXtIC(L(XTX)-ICtl(LO - C)
where 0 is the unconstrained least squares estimate of 8. We also define the following quantities:
S(O) = (Y - XO)T(y - XO)
(2.5.18)
S(fJ) = (Y - XfJ)T(y - xfJ)
(2.5.19)
SH = S(fJ) - S(O)
(2.5.20)
2.5.
35
NORMAL THEORY
Then under the null hypothesis that LfJ = C we have
(i) (l/a 2 )SH- X2(s) (ii) SH and S(O) are independent SH/S ( ) (iii) S(O)/(N _ p) - F s, N - p Proof Consider
J
where
0 = (XTxt 1 XTy'
= S(ii) - S(O)
(2.5.21)
Hence
J = (Y - Xii)T(y - Xii) - (Y - XOny - XO)
= (Y - Xii)T(y - Xii) - yTy + OTxTxo
= -2iJTXTy + iJTxTxii + OTxTxO =
(ii - O)TXTX(ii - 0) (2.5.22)
We now introduce the constraint via a Lagrangian multiplier to form J c = (ii - O)TXTX(ii - 0) + 2).T(Lii - C)
Differentiating w.r.t. ii leads to
LT). or defining A
=
=
XTX(O - ii)
X T X,
LT ) ,
=
A(O - ii)
(2.5.23)
Multiplying on the left by LA - I gives
LA-1LT). = L(O - ii) = LO - C since Lii = C Also, LA - 1 LT is positive definite since L has rank s. So
;, = (LA - 1 LTt 1 (LO - C)
(2.5.24)
From (2.5.23),
Using (2.5.24), (2.5.25) Thus establishing the first part of the theorem. Substituting into (2.5.22), gives J
= (Le - C)T(LA - I LTt 1 LA -ILT(LA - I LTr l(LO - C) = SH (2.5.26)
If the null hypothesis is true, then C = LfJ where fJ is the true parameters.
36
2.
LINEAR LEAST SQUARES AND NORMAL THEORY
Therefore
Sa
= =
(0 - O)TLT(LA -I LTt I LA -I LT(LA -I LTt I L(O - 0) (0 - O)TLT(LA - I LTt I L(O - 0)
But
0- 0 = (XTXtIXT(y - XO) = A-IXT(y - XO) Sa = (Y - XO)TXA-ILT(LA-Il!)-ILA-IXT(y - XO) = (Y - XOYAa(y - XO)
(2.5.27)
where A a = XA-ILT(LA-ILTtILA-IXT
Clearly A a is idempotent (i.e., A a 2 = A a). Now, from (2.5.14)
S(0)/(J2 = (l/(J)(y - XO)T{I N
-
X(XTxt IXT)(y - XO)(I/(J)
From (2.5.27),
Now we form
Clearly
Also, rank A a =
S
and rank Al = P - s
since
A I is also idempotent
Therefore
Furthermore,
(1/(J){y - XO) '" N(O, I) Hence, from the Fisher-Cochrane theorem (Theorem A.6.1), S(0)/(J2 and Sa/(J2 are independent X2 random variables with (N - p) and s degrees of freedom, respectively.
2.5.
NORMAL THEORY
37
This completes the proof of parts (i) and (ii) of the theorem. Part (iii) now immediately follows from Result AA.1. \l The general shape of the F distribution is shown in Fig. 2.5.3.
..
>'iii c CII
"C
Figure 2.5.3. General form of the F distribution. Note: C decreases with increasing N.
We illustrate the application of Theorem 2.5.6 in the following example: Example 2.5.2 With the same data as in Example 2.5.1, the following results were obtained for the MVUE of h 1, (J2 when h 2 was constrained to be zero
h1 = 0.54,
6 2 = 0.204
We shall use the F distribution to test the hypothesis that h2 = 0 at a 95 % risk level. From Theorem 2.5.6, the following variable has an F( 1, 10) distribution under the null hypothesis (h 2 = 0)
f
=
SH(S(8)/1O)
where
S(8) = 1.32,
SH = 2.04 - 1.32 = 0.72
therefore
f
= 0.72/0.132 = SA
Now, from tables, F o.9s(l, 10) = 4.96. Sincej"» 4.96, we conclude that there is good evidence to reject the null hypothesis at a 5 % risk level (cf. the confidence interval found for h2 in Example 2.5.1). \l Example 2.5.2 is a form of decision problem in which a choice is desired between two competing model structures. We have adopted a classical
38
2.
LINEAR LEAST SQUARES AND NORMAL THEORY
approach by fixing the probability that the null hypothesis is rejected when it is actually true. Classical methods for testing statistical hypotheses are treated in many texts, for example [69].
2.6. NUMERICAL ASPECTS The least squares estimates have been shown to be solutions of the normal equations (2.2.4), i.e., (X TX)8 = XTy (2.6.1) Equation (2.6.1) could be solved by any method for solving sets of linear equations, e.g., Guassian elimination [84]. Unfortunately, the normal equations are often ill-conditioned, especially when the number of parameters is large. In these circumstances special numerical procedures are required. We briefly describe one numerically robust algorithm which exploits the structure of the least squares problem [86]. Consider a Householder transformation matrix defined as follows:
Q = 1- 2uuT
(2.6.2)
where u is any unit length vector. Lemma 2.6.1 The transformation matrix Q defined in (2.6.2) has the following properties: (i) Q is symmetric, i.e., QT = Q
(2.6.3)
(ii) Q is is orthonormal (2.6.4) Proof Left as an exercise.
V
Lemma 2.6.2 The product of Householder transformations is orthonormal, i.e.• ifqJ = QpQp-t ... Qt where Qt, ... , Qpare Householder transformations, then qJTqJ = I Proof Left as an exercise.
V
(2.6.5)
2.6.
NUMERICAL ASPECTS
39
Lemma 2.6.3 Given any vector x, there exists a Householder transformation Q such that
(2.6.6)
where (2.6.7) Proof
Using (2.6.3) and (2.6.4) in (2.6.6) gives
(2.6.8)
Using (2.6.2), X.l
[
: ···
XN
1=
[(I -2U12~,1,1 -2U 1U 2 It
. ..
(2.6.9)
-2U 1U N A
Thus U1
=
[i(1 - xI!,1,)p/2
u, = -X;/(2U 1 A),
(2.6.10) i = 2, ... , N
(2.6.11)
Finally, from (2.6.6) (2.6.12) i.e., (2.6.13)
Lemma 2.6.4 Given any N x p matrix X (N > p), there exists an orthonormal matrix 'P such that (2.6.14) where R is a p x p upper triangular matrix and 'P is the product of p Householder transformations.
40
2.
LINEAR LEAST SQUARES AND NORMAL THEORY
Proof From Lemma 2.6.3 it is clear that a transformation QI can be found such that
QI X
[
~1 X~2
...
= :
(2.6.15)
:
o XN2 ... XNp
Similarly, if Q2 is defined by 1zero, then Q2 can be chosen so that
2U'U'T
11.1
X12
o
"
1
Q2QI X =: [
x~p]
.
o
where the first element of u' is
X 13
••.
X 1p
.1 2 0
X23
...
X 2P
0
X~3
"
:
. ...
]
(2.6.16)
x~p
Continuing this process yields Eq. (2.6.14) with
'P = QpQp_1 ... Ql
(2.6.17)
which is orthonormal from Lemma 2.6.2. \l
Theorem 2.6.1
Consider the least squares problem with sum of squares S(O) = (Y - XO)T(y - XO)
(2.6.18)
The least squares estimate satisfies R(j = '11
(2.6.19)
where R is upper triangular and is given by (2.6.20) where 'P is given by (2.6.17) and the p-vector 'II in Eq. (2.6.19) is given by (2.6.21 ) The sum of squares of residuals is given by S((j)
= '12T'12
(2.6.22)
Proof S(O) = (Y - XO)T(y - XO)
(2.6.23)
S(O) = (Y - XO)T'I'T'P(Y - XO)
(2.6.24)
Using Lemma 2.6.2,
41
PROBLEMS
Using Lemma 2.6.4, the matrix P can be found such that
'Px=
r~J
(2.6.25)
where R is upper triangular. Let (2.6.26) where '11 is a p-vector and '12 is an (N - p)-vector. Substituting (2.6.25) and (2.6.26) into (2.6.24) gives 5(0) = ('11 - RO)T('11 - RO)
+ '12T'12
(2.6.27)
Since R is square, 5(0) is minimized by 0 satisfying
RO = '11 and the sum of squares of residuals is 5(0) = '12T'12
V
Equation (2.6.19) can be readily solved by back-substitution since R is triangular. The algorithm outlined above for determining 0 has superior numerical properties compared with naive algorithms based on direct solution of the normal equations (2.6.1) [86].
2.7. CONCLUSIONS This chapter has considered the problem of estimation when the stochastic process has a mean value which is expressible as a known linear function of a parameter vector, i.e., E{Y} = XO where X is known. The results are fundamental to any treatment of identification and we shall often have reason to refer back to this chapter. In the next chapter we consider more general estimation problems via the maximum likelihood approach.
PROBLEMS 2.1. In an agricultural experiment, the crop yields (y) were obtained from a sequence of experiments with different amounts of fertilizer per acre (u). u:
I
2
3
4
5
6
7
y:
1.1
0.9
1.1 1.4 1.3 1.1 1.2
8
9
10
1I
12
1.6
2.1
2.0
1.6
1.7
42
2.
LINEAR LEAST SQUARES AND NORMAL THEORY
(i) Assuming that the 12 experiments are independent (i.e., the errors are uncorrelated) fit a model to the data of the following form Yk
= eo + elU k + Gk
where {Gk} is assumed to be an i.i.d. sequence of normal random variables each with zero mean and variance (f2. (ii) Using the results of part (i), construct 95 % confidence intervals for the parameters eo and el (use t distribution). (iii) It is hypothesized that the fertilizer has no effect on the yield per acre (i.e., it is hypothesized that I is zero). Test, at the 5 ~~ risk level, that the data give evidence against the hypothesis (use F distribution). 2.2. Use the least squares technique to fit a quadratic function of the form
e
E[Yr]
= at
2
+ bt + c
to the data given below. t: .1',:
1
2
9.6 4.1
3
4
5
6
7
8
1.3 0.4 0.05 0.1 0.7
9
1.8 3.8
10 9.0
2.3. Consider a linear continuous time dynamic system having transfer function H(s) = B(s)/A(s)
(s is the Laplace transform variable)
where
+ + bn_lsn- 1 1 + als + + ansn
B(s) = bo + b;s A(s) =
By applying sinewaves of frequency WI' W 2' ... , W N to the system, we are able to measure H(jw j ) , i = 1, ... , N. Of course, there will be unavoidable errors in these measurements, so we denote the results by H(jwJ, i = 1, ... , N. Since H(jw) = B(jw)/A(jw), we have
A(jw)H(jw) = B(jw) We therefore propose that the coefficients in A and B be estimated by minimizing the sum of squares of moduli of the differences: N
J =
L erej j=
1
where e, = (A(jw;)H(jwj) - B(jw j ) ) and transpose.
*
denotes complex conjugate
43
PROBLEMS
(a) Show that J can be expressed as J = (Y - XO)*(Y - XO)
where
x=
[-~W:~(~Wd' ... , (-~Wd:~(~Wd, 1'~Wl'
,
-jwNH(jwN),···, (-jW N) H(jw N), 1,jw N, Y=
(~Wt):~!!],
, (jWN)
[~(~Wd] H(jWN)
(b) Show that the value of 0 minimizing J is
e= [Re(X* X)] -
1
Re[x* Y]
where Re denotes real part. (The algorithm described in this question is often called Levy's method.) 2.4. Is the estimator found in Problem 2.3b a linear function of the data HUw!), ... , HUwN)? Comment on the applicability of the theory of Section 2.3 to this problem. 2.5. Establish Eq. (2.3.22). 2.6. Let X be a k-variate normal random variable with mean Ji and covariance I:. Show that the characteristic function for X is
E {X X ... X } _ 1 on
"I
'2
I_
'. t-
0
where it, i2 , ... , in are integers and
EX{XIXmXnXp}
= I:mnI:/ p + I:/mI: np + I:/nI:mp
where I: i j denotes the ijth element of E, Hint: Use the answer to Problem 2.7. 2.9. Let Y denote an N-variate normal random variable having mean XO and covariance matrix (721 where (72 is a scalar and I is an N x N identity matrix. The regression matrix X is known but 0 and (72 are unknown parameters (the dimension of 0 is p and hence X is N x p).
44
2.
LINEAR LEAST SQUARES AND NORMAL THEORY
Determine the information matrix M fJ for the parameter vector
(P is a (p + I)-vector) Hence show that the lower bounds on the covariance of any unbiased estimator of and a 2 are given, respectively, by
e
and Hint: See Theorem 2.5.1. 2.10. We know that the estimator V = S/(N - p) given in Eq. (2.4.2) is an unbiased estimator for a 2 • Show that the variance of V is given by var(V)
= 2a4/(N - p)
Comment: V clearly does not achieve the Cramer-Rao lower bound (cf. problem 2.9) but V is MVUE (see Theorem 2.5.2). 2.11. Show that S(iJ)/a 2 , where S(n) is defined in Eq. (2.5.19), has a X2(N - P + s) distribution. 2.12. Show that the estimators and
e
V = [l/(N - p)](Y - XIW(Y - XO)
are sufficient statistics for and a 2 if Y is assumed to have Gaussian distribution with mean xe and covariance a 2 I .