Chapter 3 Probabilities and Statistical Estimation This chapter summarizes mathematical fundamentals of probabilities and statistical estimation. Since the facts established here are directly connected with the analysis in the subsequent chapters, brief derivations are given to most propositions, but those which require lengthy and subtle mathematical arguments are stated without proofs.
3.1 3.1.1
Probability Distributions Mean, variance, and covariance
Let x be a scalar random variable, and p(x) its probability density defined for real x. The expectation E[x] and the variance V[x] of x are defined by
E[x] --
f
F
V[x] =
xp(x)dx,
( x - E[x])2p(x)dx.
(3.1)
(:x)
By definition, the variance V[x] is nonnegative. Two random variables x and y are independent of each other if their joint probability density p(x, y) has the form p(x, y) = px(x)py(y). The covariance (or correlation) of x and y is defined by
V[x, Yl = E [ ( x - E[xl)(y- E[y])].
(3.2)
R a n d o m variables x and g are said to be uncorrelated if V[x,y] = 0. Independent random variables are always uncorrelated, but the converse does not necessarily hold. Let x be an n-vector random variable, and p(x) its probability density defined in the entire n-dimensional space ~ n . The expectation E[x] and the variance-covariance matrix (or simply covariance matrix) V[x] of x m'e defined by s
E[x] - Jn[,~xp(x)dx, -
The covariance matrix
-
-
(3.3)
V[x] is always positive semi-definite, since
(a, V[x]a) = E[(x - E[x], a) 2] > 0 61
(3.4)
62
Chapter 3.
Probabilities and Statistical Estimation
for an arbitrary n-vector a. The covariance matrix V[x] is diagonal if and only if the components of x are uncorrelated to each other. The variancecovariance tensor (or simply covariance tensor) 12IX] of an ran-matrix random variable X is an mnmn-tensor defined by V[X l - E[(X - E [ X ] ) |
( X - E[X])].
(3.5)
(A, 1 2 [ X ] A ) - E [ ( X - E[X]; A) 2] > 0
(3.6)
This is also positive semi-definite, since
for an arbitrary mn-matrix A. Quantities E[x2], E[[[x[[2], and E[[IXII 2] are called the mean squares of x, x, and X , respectively. Their square roots are called the root mean squares of the respective random variables. Let X1, ..., XN be independent samples (they may be scalars, vectors, or matrices) from a distribution that has mean E[X] and variance or covariance matrix/tensor V[X]. Their average N -
x
1
-
Z
(3.7)
<:x-- 1
has the following expectation and variance or covariance matrix/tensor: E[X]-
E[X],
V[X]-
NV[X].
(3.8)
This implies that for a large number N 1
X ~ E[X] + O ( ~ ) ,
(3.9)
and hence X ,,., E[X] in the asymptotic limit N --+ oc, where the symbol "~" indicates that the statistical behavior is similar. This fact is known as the law
of large numbers 1. If x and y are random variables, their direct sum x | y is also a random variable. Its expectation is
E[x | y] - E[x] | E[y].
(3.10)
The covariance matrix of x | y has the form V[x + y ] -
(
V[x]
V[y, x]
V[x,y] ) V[y]
'
(3.11)
where y] -
-
-
1The law of large numbers can be stated in many different ways; the precise meaning of the symbol ",,~" differs in each case (we omit the details).
3.1.
Probability Distributions
63
v[y, . ] - E[(y - E [ y ] ) ( . - E [ . ] ) ~1 = Y [ . , y]~.
(3.12)
If x a n d y are i n d e p e n d e n t , t h e n V[x, y] -- 0 a n d V[y, x] - O, a n d h e n c e t h e covariance m a t r i x of x | y has t h e f o r m
v [ . 9 y] - v[~] 9 v[y].
(3.13)
If x is an n - v e c t o r r a n d o m variable a n d A is an m n - m a t r i x , t h e n y - A x is an m - v e c t o r r a n d o m variable; its e x p e c t a t i o n a n d c o v a r i a n c e m a t r i x are
E[y] - AE[x],
V[y] - A V [ x l A T .
(3.14)
Let y - y ( x ) be an m - v e c t o r f u n c t i o n of n - v e c t o r x. If x is a r a n d o m variable, t h e n y is also a r a n d o m variable. If we write ~ - E[x] and x = 5c + A x , t h e d e v i a t i o n A x is a r a n d o m variable of m e a n 0. If we write ~) - E[y] a n d y ~) + A y , we o b t a i n to a first a p p r o x i m a t i o n
OY I A x ,
(3.15)
w h e r e Oy/Oxlx is an r a n - m a t r i x w h o s e (ij) e l e m e n t is Oy~/axj e v a l u a t e d at 2. To a first a p p r o x i m a t i o n , t h e covariance m a t r i x of y can be w r i t t e n as
v[y]-
Oy
v[~] oy
(3.16)
Let x be an n - v e c t o r r a n d o m variable, a n d let n = N[x]. If we write 5; -- E[x], x - 5; + A x , ~t - E[n], a n d n - ft + A n , we o b t a i n to a first approximation 1
A n - ,~_,, P r t A x ,
-- X[5~],
Ilxll
w h e r e Pr~ is t h e p r o j e c t i o n m a t r i x along ft. covariance m a t r i x of n can be w r i t t e n as
(3.17)
To a first a p p r o x i m a t i o n , t h e
1
V[n] - ilxll 2 Prt V[x]Prt. 3.1.2
Geometry
(3.18)
of probability distributions
Let x be an n - v e c t o r r a n d o m variable, a n d let 5~ - E[x]. T h e d e v i a t i o n A x = x - 5;, often called "error", is an n - v e c t o r r a n d o m variable of m e a n 0. We can w r i t e
x -- ~, + Ax,
E[II~II~I of
The mean square covariance m a t r i x V[x]"
V[x]-
E[AxAxT].
(3.19)
th~ error A x is given by t h e t r a c e of t h e
E[IIAxl] 2] - trV[x].
(3.20)
64
Chapter 3.
Probabilities and Statistical Estimation
The spectral decomposition of the covariance matrix V[x] has the following form (see eq. (2.62))" n
V.x[1-
E
~
'2uiu-~ ,
o'1 > _ ""
> _ an > _ 09
(3.21)
i=1
The vector ul indicates the orientation in which the error in x is most likely to occur, and a 2 is the variance in that orientation. In fact, for an arbitrary unit vector u
v[(~. ~)l - E[(~. ~.)~] - (~. E [ A . A . ~ ] u ) - (~. V[.]~).
(3.22)
which is maximized by u l ; the m a x i m u m value is a 2 (see eqs. (2.86)). We can also see that for each i the eigenvalue a i2 indicates the variance of the error A x in orientation ui. Since {ui} is an orthonormal system, the error A x can be expressed in the form Tt
Ax - Z
Axiui,
A x i - (Ax, ui).
(3.23)
i=1
It follows from eq. (3.21) that the distribution in each orientation is uncorre-
fated" E [ A x i A x j ] - (ui, V[xluj) - a~Sij.
(3.24)
If the distribution of the error A x is isotropic, i.e., its occurrence is equally 2 Eq. (2.63) implies that the likely in every orientation, we have a 2 = ... = a n. covariance matrix V[x] has the form
v[~]
-
-~- z ,
~2 _
E[IIAxlI2].
(3.25)
n
Let {ul, ..., un} be an orthonormal basis of 7~n. If the distribution of A x is restricted to the r-dimensional subspace S - {ul, ..., U.}L C 7~n, the covariance matrix V[x] is singular:
V[x] - ~ cr2uiui r.
(3.26)
i=1
The null space of V[x] is the orthogonal complement Af - {U,+l, ..., Un}L C TC~ of S. If the distribution is isotropic in S, the covariance matrix V[x] has the following form (see eqs. ( 2 . 5 0 ) a n d (2.51))"
v[~]
-
-~- P x , r
~2 _
E[llAxl[2].
(3.27)
3.2.
Manifolds a n d Local D i s t r i b u t i o n s
65
%(s)
0 Fig. 3.1. The tangent space Tx(S) and the normal space Tx(,5') to S at x in three-dimensions.
3.2 3.2.1
M a n i f o l d s and Local D i s t r i b u t i o n s M a n i f o l d s and tangent spaces
A (differential) manifold S in an n - d i m e n s i o n a l space T~n is a s u b s e t of 7~ n such t h a t for each p o i n t x C S t h e r e exists a diffeomorphism (a s m o o t h differential m a p p i n g ) b e t w e e n a n e i g h b o r h o o d of x in S a n d an o p e n s u b s e t of 7~ m for some m, which is called t h e dimension of t h e m a n i f o l d S. A m a n i f o l d is a generalization of a s m o o t h c u r v e in two d i m e n s i o n s a n d a s m o o t h surface in t h r e e d i m e n s i o n s (the exact definition of a m a n i f o l d is o m i t t e d ) . A n n - d i m e n s i o n a l space 7~ ~ is itself an n - d i m e n s i o n a l manifold. If an m~-dimensional m a n i f o l d S ' is a s u b s e t of an m - d i m e n s i o n a l m a n i f o l d S, m a n i f o l d S ' is said to be a submanifold of S of codimension m - m'. In p a r t i c u l a r , an m - d i m e n s i o n a l m a n i f o l d in 7~m is a s u b m a n i f o l d of 7~ ~ of c o d i m e n s i o n n - m. Manifolds are often defined by e q u a t i o n s 2. For e x a m p l e , an ( n - 1)d i m e n s i o n a l u n i t sphere in n d i m e n s i o n s c e n t e r e d at t h e c o o r d i n a t e origin, d e n o t e d by S ~-1, is defined by Ilxl] 2 = 1. A n e q u a t i o n f ( x ) = 0 is nonsingular if it defines a m a n i f o l d of c o d i m e n s i o n 1, a n d is singular otherwise. For e x a m p l e , e q u a t i o n Ilxll 2 = 0 is singular a, b e c a u s e t h e m a n i f o l d it defines is t h e c o o r d i n a t e origin O, w h i c h is z e r o - d i m e n s i o n a l (i.e., of c o d i m e n s i o n n). Let x be a p o i n t in an m - d i m e n s i o n a l m a n i f o l d S C g ~ . T h e set of all infinitesimally small vectors Am E T~n such t h a t x + A x E S forms an md i m e n s i o n a l linear space to a first a p p r o x i m a t i o n . T h i s linear space is d e n o t e d by Tx(S) a n d called t h e tangent space 4 to S at x (Fig. 3.1). A t a n g e n t space is a g e n e r a l i z a t i o n of a t a n g e n t line to a c u r v e in two a n d t h r e e d i m e n s i o n s a n d a t a n g e n t p l a n e to a surface in t h r e e d i m e n s i o n s . T h e o r t h o g o n a l c o m p l e m e n t 2A manifold is also called an algebraic variety if it can be defined by polynomial equations F(k)(x) = 0, k = 1, ..., L. 3In this book, we always consider real spaces; in a complex space, equation I]xll2 = 0 defines an (n - 1)-dimensional imaginary surface. 4The collection of all Tx(S), x C S, is called the tangent bundle of S.
66
Chapter 3.
Probabilities and Statistical Estimation
,,,-" ./i
(~)
(b)
(c)
(d)
Fig. 3.2. (a) Two surfaces intersecting transversally. (b)-(d) Two surfaces meeting non-transversally. of T x ( S ) C ~ n , denoted by N x ( S ) , is an ( n - m)-dimensional linear space and is called the normal space to 8. If f ( x ) = 0 is a nonsingular equation that defines a manifold of codimension 1, it has an ( n - 1)-dimensional tangent space T x ( S ) - {Vf}Ll and a one-dimensional normal space N x ( S ) - { V f } L , where V f = (Of/Oxl,...,Of/Oxn) T. If an m-dimensional manifold S and an m~-dimensional manifold 8 ~ meet in 7~" and if m + m ' > n, their intersection 8 f 3 8 ~ is in general an ( m + m ' - n ) dimensional manifold. Equivalently put, if a manifold S of codimension I and a manifold $~ of codimension 1~ meet in T~n and if 1-4-1~ < n, their intersection S gl S~ is in general a manifold of codimension 1 + 1~. If the following condition is satisfied in addition, manifolds 8 and S~ are said to intersect transversally:
T~ (S n S') = T~ (S) n T~ (S').
(3.28)
For example, two surfaces in three dimensions intersect transversally if they cut each other along a curve (Fig. 3.2a), but they do not if they overlap (Fig. 3.2b) or touch each other along a curve (Fig. 3.2b) or at a single point (Fig. 3.2c). If two manifolds intersect transversally, their intersection is structurally stable in the sense that its dimension (or codimension) is preserved if the two manifolds are infinitesimally perturbed in an arbitrary manner.
3.2.
Manifolds a n d Local Distributions
T
67
(S)
x)
0 Fig. 3.3. A local distribution.
3.2.2
Local distributions
Let x be an n-vector r a n d o m variable whose distribution is r e s t r i c t e d to an m - d i m e n s i o n a l manifold $ C 7~n. In general, m a t h e m a t i c a l t r e a t m e n t is very difficult if a r a n d o m variable is constrained to be in a manifold. For example, if x is a unit 3-vector, its probability density p(x) is defined over a unit sphere S 2, but its e x p e c t a t i o n E[x] - fs2 xp(x)dx is generally inside the sphere. In this book, w h e n e v e r we consider a probability distribution of a r a n d o m variable x c o n s t r a i n e d to be in a manifold $, we a s s u m e t h a t x has a local distribution in the sense t h a t the distribution is sufficiently localized a r o u n d some point 5~ E $ and hence to a first a p p r o x i m a t i o n the d o m a i n of the distribution can be identified with the t a n g e n t space T x ( $ ) at 5~ (Fig. 3.3). We choose the point ~ in such a way t h a t
p S E [ x - 5~] - O,
(3.29)
and identify the covariance m a t r i x of x with v[.]-
-
-
(3.30)
where p S is the projection m a t r i x onto the t a n g e n t space T x ( $ ) at 5~. We often call 5~ simply "the t r u e value of x". Eq. (3.30) implies t h a t the range and the null space of V[x] coincide with the t a n g e n t space T x ( S ) and the n o r m a l space N x ( $ ) , respectively. For brevity, we call the null space of the covariance m a t r i x V[x] of x simply "the null space of x".
3.2.3
Covariance matrix of a 3-D rotation
Consider a local distribution of three-dimensional r o t a t i o n s 5 a r o u n d R. Namely, we r e g a r d a t h r e e - d i m e n s i o n a l r o t a t i o n m a t r i x R as a r a n d o m variable p e r t u r b e d from 1~ by a small a m o u n t A R in the form R - R + A R . Since R and R are b o t h rotations, the t r a n s f o r m a t i o n from 1~ to R is also 5A three-dimensional rotation can also be represented by a 4-vector q, called quaternion, constrained to be on a 3-dimensional unit sphere S 3 in four dimensions. Hence, the distribution of three-dimensional rotations can also be thought of as defined over S 3.
68
Chapter 3.
Probabilities and Statistical Estimation
a rotation around some axis by a small angle. Since a small rotation has the form given by eq. (2.57), we can write R - ( I + A ~ l x I + O ( h ~ 2 ) ) R , or
R - / ~ +/',fit x/~ + o(A~2).
(3.31)
Hence, to a first approximation AR-
AFtl •
(3.32)
We define the covariance matrix of rotation R by (3.33) The unit eigenvector of V[R] for the largest eigenvalue indicates the axis around which perturbation is most likely to occur. The corresponding eigenvalue indicates the mean square of the angle of rotation around that axis. The mean square of the angle of perturbed rotation in total is E [ A ~ 2] -- trV[R].
(3.34)
In particular, if the perturbation is equally likely to occur for every axis orientation, the covariance matrix V[R] has the form
v [ n ] - ~-~, 3.3 3.3.1
~ = E[Z~n~].
(3.35)
G a u s s i a n D i s t r i b u t i o n s and )~2 D i s t r i b u t i o n s Gaussian distributions
The most fundamental probability distribution of an n-vector random variable is the multidimensional Gaussian distribution (or normal distribution). We say that n-vector x is a Gaussian random variable if it has a multidimensional Gaussian distribution. If it has mean m and covariance matrix ~ of full rank, the probability density has the form
p(x) -
1
e-(m-m'E-~(m-m))/z
(3.36)
v/(2~.)n i~, I which defines a distribution over the entire n-dimensional space 7~n. It is easy to confirm that
E[,] - ]~o .p(.)d. - m, v[.] - / ~ ( .
- m)(. - m)~p(.)d.
- ~.
(3.37)
3.3.
69
Gaussian Distributions and ~2 Distributions
////2 ....
1
X
0
ml
Fig. 3.4. Contours along which probability density is constant. The probability density is constant on the surface in 7~~ defined by (x-
m , 27-1(x - m ) ) =
c
(3.38)
for a positive constant c (Fig. 3.4). Suppose x is decomposed into the form x = x l | x2, and let m = m l | m2 be the corresponding decomposition of the mean m . If X l and x2 are uncorrelated to each other, the covariance matrix E is decomposed into the form ,~ = •1 | E:2, where -~1 and E2 are the covariance matrices of xl and x2, respectively. Then, eq. (3.36) has the form
1 p(x)-
C
- ( x , - m ~ E~-~(x~-m~))/2
V/(27r)nlEil X
1
e
- ( x 2 - m 2 ' ~,2-~(x2-m2))/2
.
(3.39)
This means that uncorrelated Gaussian random variables are always indepen-
dent of each other. In one dimension, the probability density reduces to
p(x)-
1
e--(x--m)2/2cr2
where cr2 (> 0) is the variance of x (Fig. 3.5). standard deviation. Let us call the pair {m + cr, m - or}
(3.40) The value a is called the
(3.41)
the standard deviation pair, and the interval [m - ~r, m + cr]
(3.42)
the standard confidence interval. The probability that x falls into the standard confidence interval is about 68.27%. If m = 0 and a = 1, the distribution is called the standard Gaussian (or normal) distribution.
70
C h a p t e r 3.
Probabilities and Statistical Estimation
m--o
m
m+o
"------
X
Fig. 3.5. One-dimensional Gaussian distribution. Gaussian distribution of mean m and covariance matrix E of rank r (< n) is defined as follows. Consider the case m - 0 first. Let Tt
-- Z
6ri2Uilti'
0"1 ~ ' ' "
-~ ~7r > O'r-t-1 :
"''-"
f i n ( = 0),
(3.43)
i--1
be the spectral decomposition of 2~. From the discussion in Section 3.1.2, we see t h a t x cannot deviate in the orientations u , + l , ..., Un. Since {ui} is an orthonormal basis of T~n, vector x is expressed in the form
- ~ x,~,,
x,- (,, ~,).
(3.44)
i----1
It follows t h a t the distribution is limited to the range 7"4,E = {Ul, ..., U r } L of the covariance matrix ~ . The components Xl, ..., x . have an r-dimensional Gaussian distribution with density
p(Xl,...,Xr.) __
1 p
C--Z'i=I xi2/2~
(3.45)
v/(2~) ~ 1-i~=l ~, which can be rewritten in the form
p(x) -
1
e-(X'~-x)/2 ,
(3.46)
X/(2~)~l~l+ where I ~ l + - 1-ii=lr ~ is the product of all positive eigenvalues of 2~ Eq. (3.46) defines a probability distribution only in the r-dimensional subspace ~ E . Hence,
v[x]-/~
xxTp(x)dx- ~.
(3.47)
23
T h e Gaussian distribution for m :fi 0 is defined by replacing x by x - m .
3.3.
Gaussian Distributions and )~2 Distributions
71
If the covariance m a t r i x 27 has the spectral decomposition in the form of eq. (3.43), Ul indicates the orientation of the most likely deviation (see Fig. 3.4); or1 is the standard deviation in that orientation. Hence, the probability t h a t - a l < (Ul, x - m ) < al is about 68.27%. Let us call { m + o l U l , m -- crlUl }
(3.48)
the primary deviation pair, which indicates in which orientation the deviation is most likely to occur. The Gaussian distribution plays a special role in statistics for m a n y reasons, of which the most important is the fact t h a t if X1, ..., XN are independent samples from a distribution of mean zero and variance/covariance matrix V[X], the average .'~ - Ec~N__I X ~ / N is asymptotically a Gaussian r a n d o m variable of mean zero and variance/covariance m a t r i x V[X]/N under a mild regularity condition. This fact is known as the central limit theorem 6. Other important properties of the Gaussian distribution include the following: 9 If x is an n-vector Gaussian random variable of mean m and covariance matrix 27, m-vector y - A x for an arbitrary r a n - m a t r i x A is also a Gaussian r a n d o m variable of mean A m and covariance m a t r i x A ~ A T (see eqs. (3.14)). 9 Each component of a vector Gaussian r a n d o m variable x is independent and has the s t a n d a r d Gaussian distribution if and only if m - 0 and ,~ mY.
9 If each component of x is independent and has the s t a n d a r d Gaussian distribution, each component of vector y = A x is independent and has the s t a n d a r d Gaussian distribution if and only if AA-r = I. Since the Gaussian distribution is defined over the entire n-dimensional space 7~n or its linear subspace, the probability tails away infinitely. However, we can define a Gaussian distribution over an arbitrary manifold $ C 7~n if the distribution is sufficiently localized around one point 5~ E ,5 and hence the domain of the distribution can be identified with the tangent space Tx ($) (see Section 3.2.2). Namely, the distribution can be regarded as locally Gaussian if it has a probability density in the form
p(x) - Ce -(x-Jc'E-(x-'2"))/2,
(3.49)
where C is the normalization constant. The mean 5~ and the covariance matrix E are assumed to satisfy the following relations:
~ p(x)dx - l,
P~ fs(x-
5~)p(x)dx - O,
6We omit the exact statement of the theorem and the proof.
72
Chapter 3.
P~ L(.
Probabilities and Statistical Estimation
- ~)(~ - ~ ) T p ( ~ ) d ~ P ~ - E.
(3.50)
Here, P ~ is the projection matrix onto the tangent space T x ( $ ) at 5~.
3.3.2
Moment generating functions and moments
The moment generating function of a scalar random variable x is defined by 7
E[~~~ - Eoo E[xk k! ] Ok.
r
(3.51)
k=0 If x is a Gaussian random variable of mean 0 and variance a2, its moment generating function has the following form:
9 (0)- ~
20~/~= ~0r ~•2k 0 2k .
(3.52)
k=0 Comparing this with eq. (3.51) term by term, we obtain the kth moment E[x k] in the following form:
E[xkl -
k!ak
k - 0, 2, 4,6,..., k - 1,3,5,7, ....
2k/2(k/2)! , 0,
(3.53)
The moment generating function of a vector random variable x is defined by 8 O0
~ o ( 0 ) - E[e (x'O)] - ~
~ E [ ( x , 0)k],
(3.54)
k:0 where the argument 0 is also an n-vector. If x is a Gaussian random variable of mean 0 and covariance matrix E , its characteristic function has the following form: OO
9 (o)- ~(o.~o),= ~ 2_~(o, ~o)~
(3.55)
k=O Comparing eqs. (3.54) and (3.55), we obtain the expressions for the (multidimensional) moments E[xi~xi2"" xi.]. For example,
E[xi] - 0,
E[xixj] - ~ij,
E[xixjxkxl] -- ~ij~kl + ~ik~jl + ~il~jk,
E[xixjxk] -- O, E[xixjxkXlXm] = 0.
(3.56)
7The function r is called the characteristic function (i is the imaginary unit). It is simply the Fourier transform of the probability density p(x) of x. SFunction (~(iO) is also called the characteristic function. It is the multidimensional Fourier transform of the probability density p(x) of x.
3.3.
Gaussian Distributions and X2 Distributions
73
O Fig. 3.6. )~2 distribution with r degrees of freedom.
3.3.3
X 2 distributions
If xl, ..., x~ are i n d e p e n d e n t samples from the s t a n d a r d Gaussian distribution, the distribution of
R-xa
+...+x~
(3.57)
is called the X2 distribution with r degrees of freedom. We call a r a n d o m variable which has the X 2 distribution simply a X2 variable. Its probability density is defined over [0, oc) in the form
p~(R)-
1 Rr/2_le_R/2 2~/2r(r/2)
(3.58)
where F(n) - f o t ~ - l e - t d t is the Gamma function 9 (Fig. 3.6). T h e m e a n and the variance of this distribution are
E[R] = r,
V[R] = 2r.
(3.59)
T h e density p~(R) takes its m a x i m u m at R = r - 2. T h e i m p o r t a n t facts concerning the X~z distribution include the following: 9 If R1, ..., RN are i n d e p e n d e n t ~2 variables with rl, ..., rN degrees of freedom, respectively, the sum R = Ra + " " + RN
(3.60)
is a )i2 variable with rl + ' . . + rn degrees of freedom. 9 If x is a Gaussian r a n d o m variable of m e a n 0 and covariance m a t r i x E of rank r, the q u a d r a t i c form R = (x,1ff-x)
9F(n + 1) - n! and F(n + 1/2) = (2n)Iv/~/22nnI for nonnegative integers n.
(3.61)
74
C h a p t e r 3.
Probabilities and Statistical E s t i m a t i o n
is a X2 variable with r degrees of freedom. 9 T h e probability t h a t a Gaussian r a n d o m variable of m e a n 0 and covariance m a t r i x E of rank r satisfies
(x, ~ - x )
_< 1
(3.62)
to 0 f0 9 If x~, c~ = 1, ..., N , are i n d e p e n d e n t Gaussian r a n d o m variables, each having m e a n 0 and covariance m a t r i x E a of rank ra, the s u m N
R - E
(x~, E ~ x~)
(3.63)
or--1
is a X2 variable w i t h Ec~N__Ir~ degrees of freedom. 9 Let n - v e c t o r x and m - v e c t o r y be Gaussian r a n d o m variables of m e a n 0, and let ~ x and E y be their respective covariance matrices. Let n and r ( ~ n) be the ranks of ~ x and ,!Ty, respectively. If t h e r e exists an r a n - m a t r i x A such t h a t y = A x , t h e difference R - (x, ~7~1x) - (y, E y y ) is a X2 variable with n -
3.3.~
(3.64)
r degrees of freedom (Cochran's theorem 11).
Mahalanobis distance and X 2 test
Let n-vector x be a Gaussian r a n d o m variable of m e a n 0 and covariance m a t r i x E . If E is of full rank, we can define a n o r m 12 of x by
ll ll -
(3.65)
E q u i d i s t a n t points from the origin in this n o r m have equal probability densities, and the probability density at x becomes smaller as IlxllE becomes larger. T h e value IlxllE is called the Mahalanobis distance of x f r o m the origin. If x is r a n d o m l y chosen, Ilxll~ is a ~(2 variable with n degrees of freedom. If E has rank r ( < n), we can define a p s e u d o - n o r m 13
I1 11 -
(3.66)
1~ four decimal digits, this equals 0.6827, 0.3935, 0.1987, 0.0902, 0.0374 for r = 1, 2, 3, 4, 5, respectively. alTo be exact, this is a special case of Cochran's theorem. 12For any positive definite symmetric matrix ~, eq. (3.65) defines a norm in the strict mathematical sense described in Footnote 4 in Section 2.1.1. 13This is not a norm in the strict mathematical sense because the triangle inequality (2.9) does not hold; see eq. (3.67).
3.3.
G a u s s i a n Distributions a n d X2 Distributions
75
a
100
Fig. 3.7. ~:2 test with significance level a%. which is also called the M a h a l a n o b i s distance. Since I1 11 -- 0 for x e N'2~, eq. (3.66) defines a distance in the usual sense only in the range ~ of ~ ; for x l E 7~2~ a n d x2 E N'2~, we have
+ If x is r a n d o m l y chosen,
II ll
= ll lll .
is
(3.67)
variable with r degrees of freedom.
T h e X2 d i s t r i b u t i o n provides a simple m e a n s to test hypotheses. In m a n y problems, we can define a r a n d o m variable R in the form of eq. (3.57), where each x / m a y not have zero mean. T h e e x p e c t a t i o n of R b e c o m e the smallest w h e n all x i have zero means. Suppose all x i have zero m e a n s if and only if some condition is satisfied. This condition is r e g a r d e d as a hypothesis a n d can be t e s t e d by observing a s a m p l e d value of R: the hypothesis is rejected if it is very large to an inadmissible degree. An exact p r o c e d u r e is as follows. Let R be a sample from a X2 d i s t r i b u t i o n with r degrees of f r e e d o m u n d e r t h e hypothesis. T h e hypothesis is rejected with significance level a% (or with confidence level ( 1 0 0 - a ) % ) i f it falls into the rejection region (X2,a, e~) a n d is r e g a r d e d as acceptable 14 otherwise (Fig. 3.7). T h e t h r e s h o l d value Xr,a 2 is called t h e a ~o significance value of X2 with r degrees of f r e e d o m a n d defined in such a way t h a t 15
f
~ p~(R)dR2r,a
a 100
(3.68)
Thus, the hypothesis is rejected with significance level a% if R>
2 X~,a"
(3.69)
This p r o c e d u r e is called the X2 test a n d frequently a p p e a r s in m a n y practical p r o b l e m s - - i n p a r t i c u l a r w h e n least-squares o p t i m i z a t i o n based on the Mahalanobis distance is used, since the residual of o p t i m i z a t i o n is usually a X 2 variable if the noise is G a u s s i a n (see the next section). 14Note that we do not say that the hypothesis is accepted. Being acceptable means that there exists no evidence strong enough to reject it. 15If r is large, say r > 30, the approximation Xr, 2 a "~ (Na +v/2r - 1)2/2 holds, where the number Na is defined in such a way that a standard Gaussian random variable falls in the interval (Na, ~ ) with probability a/100.
76
Chapter 3.
Probabilities and Statistical Estimation
O F i g . 3.8. M o d i f i e d X 2 d i s t r i b u t i o n w i t h r degrees of f r e e d o m .
If R is a X2 variable with r degrees of freedom, the distribution of 8 _
R
/
J-I
p..rr~
\
_
\
----/
r
is sometimes called the modified ~12 distribution16 with r degrees of freedom. The probability density of s is given by rp~(rs), where p r ( R ) is the X2 probability density given by eq. (3.58) (Fig. 3.8). Eq. (3.59) implies that its expectation and variance are
E[s]- 1,
V[s] =
2
-.
(3.71)
r
In terms of the modified k,2 variable s, the X2 test given by eq. (3.69) can be rewritten as >
(3.72)
r
The X2 test usually takes this form when the magnitude of the noise is estimated mid compared with its presumed value, as will be shown in later chapters.
3.4
3.~.1
S t a t i s t i c a l E s t i m a t i o n for G a u s s i a n M o d e l s
Maximum likelihood estimation
Let x be an n-vector, and A an ran-matrix. Let ~ be an m-vector Gaussian r a n d o m variable of mean 0 and covariance matrix E of full rank. Then y -
Ax
+ e
(3.73)
16This terminology is not widely used because the only difference from the X2 distribution is scaling. However, this distribution plays an essential role in the problems we study in this book, as we will see later.
3.4.
Statistical Estimation for Gaussian Models
77
is an m-vector Gaussian random variable with mean A x and covariance matrix ~7. Hence, the probability density of y is given by p(v) =
1
_(y_Ax,,F,-l(y_Ax))/2
(3.74)
Consider the problem of estimating the parameter x from a sampled value y. Namely, we want to find a function ~ ( y ) that gives an estimate of x for a given y. Such a function is called an estimator. Evidently, any value x ~ such that A x ~ = 0 can be added to x. In order to remove this indeterminacy, we assume that x is constrained to be in A f t , the null space of matrix A. Maximum likelihood estimation seeks the value x that maximizes the probability density p(y), which is called the likelihood when viewed as a function of the observed value y. The problem reduces to minimizing the Mahalanobis distance I l Y - AxlI,F,, i.e.,
J[x] = (y - A x , E - l (y - A x ) ) --+ min
(3.75)
under the constraint x E A f t . The solution, which is called the maximum likelihood estimator, is obtained in the following form (see eqs. (2.136) and (2.137)): -(A-r~-IA)-Aq-,F,-ly. (3.76) Its expectation and covariance matrix are E[5~] = x,
V[~] = ( A q - ~ - I A ) -.
(3.77)
An estimator is unbiased if its expectation coincides with the true value. The first of eqs. (3.77) implies that the maximum likelihood estimator 5~ is unbiased. The residual J[~] of the function J[x] given by eq. (3.75) can be written as follows (see eq. 2.138)): J[~] = (y, ~ 7 - 1 y ) - ( ~ , a q - ~ - l A ~ ) .
(3.78)
This is a ~2 variable with n - m ~ degrees of freedom, where m ~ = r a n k A (see eq. (3.64)). If each component of e distributes independently and isotropically with the same root mean square e, the covariance matrix of E has the form V[e] = e2I (see eqs. (3.25)). Hence, eq. (3.75) reduces to the least-squares optimization I l Y - Axll 2 -4 min,
(3.79)
and the maximum likelihood estimator 5: is given as follows (see eq. (2.134)):
= A-y.
(3.S0)
78
Chapter 3.
Probabilities and Statistical Estimation
(.i:-x, 2:- ~(x-x)) = constant
Rn
L
Fig. 3.9. The point ~ E S that minimizes J[R] is the tangent point of the equilikelihood surface to S.
3.4.2
Optimization
with linear constraints
Let x be an n-vector Gaussian random variable with an unknown mean 5~ and a known covariance matrix 22 of full rank. Suppose the mean 2 satisfies a linear constraint A x - b. Consider the problem of estimating 9 from a sampled value x. Since the probability density of x has the form 1
e -(x-x,
~ -1
(x - x))/2
(3.81)
I the m a x i m u m likelihood estimator for 2 is obtained by the minimization J[5~] = (x - ~, 27 -1 (x - ~)) --+ min
(3.82)
under the constraint A ~ = b. In geometric terms, this problem can be interpreted as follows. The constraint A x = b defines an a ~ n e subspace 17 S in 7~~. The minimization (3.82) means that 5~ is a point in $ that has the shortest Mahalanobis distance 115;- 5c]122 from the sampled position x. Since the equilikelihood surface (the set of all 5~ for which J[5~] = constant) is an "ellipsoid" in 7~~, the point 5~ that minimizes J[2] in S is the "tangent point" of this ellipsoid to 8; all other points in S should be outside that ellipsoid (Fig. 3.9). If we let A x = x - ~, eq. (3.82) reduces to the minimization (Ax, 27 - l A x ) --+ min
(3.83)
under the constraint AAx
= A x - b.
(3.84)
If this constraint is satisfiable, the m a x i m u m likelihood estimator ~ (= x - A x ) is obtained as follows (see eq. (2.140)): 5; - x - • A 17An in ~r~.
affine subspace of
q- ( A E A
T-)- ( A x - b).
(3.85)
7~ n is a s u b s e t of 7~ n o b t a i n e d by t r a n s l a t i n g a linear s u b s p a c e
3.4.
Statistical Estimation for Gaussian Models
79
Its expectation and covariance matrix are E[5~]- 3,
V [ : b ] - 2 ~ - IF,A T ( A ~ A - C ) - A E .
(3.86)
Hence, the m a x i m u m likelihood estimator 5: is unbiased. The residual J[~] of the function J[x] given by eq. (3.82) can be written as J[~] - ( A x - b, ( A ~ , A T ) - ( A x
- b)).
(3.87)
This is a X2 variable with m' degrees of freedom, where m' - r a n k ( A E A ]-) (= rankA) (see eq. (3.61)).
3.4.3
Maximum
a posteriori probability e s t i m a t i o n
Let x be an n-vector, and A an ran-matrix. Let y be an m-vector Gaussian random variable of mean 9 and covariance matrix 2~y of full rank. Consider the following linear model: z = A x + y. (3.88) We want to estimate the parameter x from a sampled value of m-vector z. Suppose the parameter x has an a priori distribution with mean 5~ and covariance matrix 2~x of full rank. The a priori probability density (or prior) of x is p(x) 1 - ( x - x , 2 ~ ~(x-x))/2 -
v/(2~)~l~l
~
.
(3.89)
.
(3.90)
The probability density of y is p(u)
=
~
v/(2~)ml~ui
~-r
~r
For a particular value of x, the m-vector z defined by eq. (3.88) is a Gaussian random variable of mean A x + ~) and covariance matrix E y . Hence, the conditional probability density p(zlx ) of z conditioned on x is
p(z]x) -
1 ~/(2~)~l~yl
e - ( z - A x - ~ t ' E ~ t ~(z-Ax-~l))/2 .
(3.91)
The marginal probability density p(z) of z is defined by
p(z) - f p(z[x)p(x)dx,
(3.92)
which is computed indirectly as follows. From eq. (3.88), the expectation and the covariance matrix of z are given by
E[z] - ASc + 9,
V[z] - A ~ x A -r + E y .
(3.93)
80
Chapter 3.
Probabilities and Statistical Estimation
Hence, the marginal probability density p(z) should have the form
p(z)
1
_(z_A~c_ 9 (A,V, x A T+Ey)-I(z_A~c_y))/2
V/ ( 2rr) m lA ,U,x A r + .,Uy l (3.94) The a posteriori probability density (or posterior)p(xlz) of x conditioned on z is defined by
p(z)
'
(3.95)
which is known as the Bayes formula. Maximum a posteriori probability estimation (often called Bayes estimation) seeks the value of x that maximizes the a posteriori probability density p(xlz ). Maximum likelihood estimation is a special case obtained by setting p(x) - constant. If eqs. (3.89), (3.91), and (3.94) are substituted into eq. (3.95), the a posteriori probability density is obtained in the form
P(XlZ) where
~ l ~ x I +A-rE~IIAI -(x-x
(2~)-
e
(E[r~+ATE~/A)(x-x))/2 '
(3.96)
the vector ~ is defined as follows (see the matrix inversion formula
(2.22)): 5; -- 9 + (,E~, x + AT-,~yl A ) - I A T z~yl (Z -- (A~ + 9)) = ~ + r, x A r ( A r , x A T +
~ ) - 1 ( ~ _ (A~ + 9)).
(3.97)
Thus, the a posteriori probability density p(xlz ) defines a Gaussian distribution of x. If we write its mean and covariance matrix as E[x, lz ] and V[xlz ], respectively, we have W[xl2:] _ ( ~ 1
E[xlz]_ ~ +
.~_ A T . ~ I A ) - I ,
V[xlz]Ar.~yl ( Z -
(A~ -~- 9)).
(3.98) (3.99)
Evidently, p(xlz ) is maximized by x = ~ (= E[zlz]) given by eq. (3.97). We also see that E[~] - 2, i.e., the maximum a posteriori probability estimator 5~ is unbiased. The uncertainty of x that still remains after z is observed is described by the a posteriori covariance matrix V[xlz ] given by eq. (3.98). The marginal probability density p(z) in the denominator in the Bayes formula (3.95) does not involve x. Hence, maximizing p(xlz ) is equivalent to maximizing p(zlx)p(x), which in turn is equivalent to maximizing l o g p ( x ) + logp(zlx ). If eqs. (3.89) and (3.91) are substituted into this, the problem can be written in the following form:
J[x] - (x-hc, z ~ l ( x - 5 ; ) ) + ( z - A x - ~ l ,
Eyl(z-Ax-~l))
--+ min. (3.100)
3.4.
Statistical Estimation for Gaussian Models
81
Va_l
state transitiOnr~B ~ ....
Xct_l - - ~ z q k a _ l ~ - ~ Xct----~ ...
....... o bs;;ation
...................
......................
Y~ Fig. 3.10. State transition of a linear dynamical system. Hence, the problem is viewed as minimizing the square sum of the Mahalanobis distances:
d [ x ] - [ I x - 5:[]2E x + I1 -
~ min.
(3.101)
J[&] - (z - AS~ - ~), (A,V, x A T + , ~ y ) - l ( z - AS - ~)).
(3.102)
The residual J[&] can be written as
This is a X2 variable with m degrees of freedom. The marginal probability density p(z) given by eq. (3.94) has the form
p ( z ) - constant • e -J[2]/2. 3.4.4
(3.103)
K a l m a n filter
The Kalman filter is an iterative linear update procedure for maximum a posteriori probability estimation when the parameters to be estimated change as time progresses in the form of a linear dynamical system. Let n-vector x~ be the state vector at time c~, and L-vector y~ the observation vector at time c~. The process of state transition and observation is described by x~ - A ~ - I x ~ - i + B ~ - l V ~ - I ,
(3.104)
y,~ = C~x~ + w,~,
(3.105)
where A ~ - I , B ~ - I , and C a are constant matrices (Fig. 3.10). Vectors v~ and w~ are assumed to be independent Gaussian random variables, and their expectations E[vo~] and E[w(~] and covariance matrices V[vo~] and V[wo~] are assumed to be known. Furthermore, the covariance matrix V[w,~] is assumed to be of full rank. With this setting, the Kalman filter computes the estimator &~ of the state vector x~ at time c~ and its covariance matrix V[&~] by iterating maximum a posteriori probability estimation. The update rule is derived as follows.
Chapter 3.
82
Probabilities and Statistical Estimation
E[Va_l]
--"
~-,-'~-~,~~
i
"~~ ~
"'"
Fig. 3.11. Kalman filter. Assume that x,~-i is a Gaussian random variable of mean X~--I and covariance matrix V[5~_1]. Since eq. (3.104) is linear in x~-1 and V~-l, the state vector xa at time a is also a Gaussian random variable. Let ~ and V[xa] be, respectively, the mean and the covariance matrix of that distribution. They are computed from eqs. (3.88) and (3.93) as follows (read z, A, x, and y in eqs. (3.93) as x~, A ~ - I , x~-1, and B~-lV~-l, respectively):
Xo, -- Ac~-i ~,o~-1 Jr So~-i E[vo~-I ],
(3.106)
T~r[xal-Aa-lV[ ~ o~-1]Ac~_l 7- + B a - 1 V[vot-]1 B aT- l .
(3.107)
If the value y~ determined by eq. (3.105) is observed, the a posteriori probability distribution of x~ is also Gaussian. Let ~ and V [ ~ ] be, respectively, the mean and the covariance matrix of that distribution. They are computed as follows (read z, A, x, and y in eqs. (3.98), and (3.99) as y~, C a , x~, and w~, respectively):
5~ = 5c~ + V[Sc~lC~V[w~]-l(y~ - ( C ~ 5 ~ + E[w~])),
( -
-1 +
T V [ w a ] - l e a ) -1 .
Eqs. (3.106)-(3.109) define the Kalman filter for computing ~ XO~--I V[~c~-l]. Eq. (3.108) can also be written in the following form:
from
and
5~,~ = 5~ + K,~ (y,~ - :~,~),
K~ = V[~,o~lC:V[wcv] -1,
~lc~ --C~5~ + E[w~].
(3.108) (3.109) and V[&~]
(3.110) (3.111)
Since ~ and 9~ are maximum likelihood estimators of x~ and ya before the actual value y~ is observed, eq. (3.110) can be viewed as correcting the predicted value 5~ by feeding back the difference between the actual observation y~ and its estimator 9~ (Fig. 3.11). In this sense, the matrix K ~ is often referred to as the Kalman gain. The difference y~ - 9~ is independent of Y~-I, Y~-2, " " , and has mean 0; it is called the innovation of y~.
3.5.
General Statistical Estimation
83
In the above formulation, the Kalman filter is derived as m a x i m u m a posteriori probability estimation on the assumption that all the variables are Gaussian. However, the same Kalman filter can be obtained without assuming Gaussian distributions: if we adopt the criterion of minimum mean square estimation, we can obtain eqs. (3.106)-(3.109) by orthogonal projection is of the state vector onto the affine subspace defined by the observation vector (we omit the details).
3.5
3.5.1
General Statistical Estimation
Score and Fisher information matrix
We now study the problem of statistical estimation in the general case where the distribution of the data or noise is not necessarily Gaussian and the statistical model is not necessarily linear. In abstract terms, the problem is stated as follows. Let x be a random variable that has a probability density p(x; O) parameterized by 0. The problem is estimating the parameter 0 by observing random samples from that distribution. We assume that x is an n-vector constrained to be in an hi-dimensional manifold X C 7~~ and the parameter 0 is an m-vector constrained to be in an m'-dimensional manifold $ C 7~m. The probability density p(x; 8) is assumed to be continuous and continuously differentiable with respect to both x and 0 an arbitrary number of times. We also assume that p(x; 8) > 0 for all x e X. Furthermore, differentiation V0(. ) (= ( 0 ( . ) / 0 0 1 , . . . , 0 ( 9)lOOn)T) with respect to 0 and integration f dx with respect to x are assumed to be interchangeable for any expression of p(x; 8) (as long as the integration exists). The probability density p(x; 8) is not defined for 0 g S. For the convenience of analysis, however, we extend it to 0 g S in such a way that p(x; 0 + AO) = p(x; 8) + 0 ( 5 8 ) 2 for all 0 e S and A0 e TO(,S)• Here, TO(S ) is the tangent space to manifold $ at 0 (see Section 3.2.1); O(A0) 2 denotes terms of degree 2 or higher in the components of A0. In intuitive terms, the probability density is "constant in the normal direction" to S in ~r~. This assumption implies the identity
v0p(.; 0) e TO(S).
(3.112)
Define an m-vector random variable I by I = V 0 logp(x; 8).
(3.113)
This vector is called the score of x with respect to the parameter 8. Since lSOrthogonality is defined in the statistical sense as having no correlation. We omit the details.
Chapter 3.
84
Probabilities and Statistical Estimation
p(x; 0) is a probability density, the normalization condition
xp(x; O)dx
- 1
(3.114)
holds for any 8 E $. It follows that if 8 is perturbed into 8 + A0 in such a way that 0 + A0 E 8, the first variation of the left-hand side of eq. (3.114) must be 0. The constraint 0 + A0 E 8 requires A0 E TO(S ) to a first approximation. If we use the logarithmic differentiation formula V0P(X; 0) =
p(x;
0)V 0 logp(x; 0),
(3.115)
the first variation of the left-hand side of eq. (3.114) is
fxp(x;
0)(V0 logp(x; 0),
AO)dx -
(El/f, A0),
(3.116)
where we have invoked the assumption that differentiation and integration are interchangeable. Since eq. (3.116) must vanish for an arbitrary A0 E TO(S ), we conclude that E[l] E TO(S) • However, eq. (3.112)implies that E[l] E TO(S ). It follows that the score l is a random variable of mean 0:
E[l] : The
Fisher information matrix
0.
(3.117)
is defined by
j-
(3.11S)
Taking the expectation of the identity 02 log p
OOiOOj
0 log p 0 log p
OOi
OOj
1
02p
I P O0~OOj'
(3.119)
and noting that differentiation and integration are interchangeable, we obtain
/0,.o,. /,lo,.o.o, / o, pdx OOiooPpdx = Ppdx + OOi
OOj
OOiOOj
02 - - fx liljpdx -~ OOTOOj/xPdX - -E[lilj].
(3.120)
Hence, if an (mm)-matrix L is defined by L - - V ~ logp(x; 0),
(3.121)
where V~(. ) denotes a matrix whose (ij) element is 02( 9)/OOiOOj, the Fisher information matrix is expressed in the following form:
j = E[L].
(3.12z)
3.5.
General Statistical Estimation Since
85
I E TO(S ), we have P~J-
J P ~ - J,
(3.123)
where P 0s is the projection matrix onto TO(N ). Hence, the rank of J is at most rn'. We say that the distribution p(x; O) is regular if J has rank m, which means that I can take all orientations in TO(S ) if x is appropriately chosen. In this book, we consider only regular distributions. Since the range of J coincides with TO(S ) for a regular distribution, the following identity holds: J JJ - J - P~. (3.124) -
If the distribution is locally Gaussian and has the probability density given by eq. (3.49}, the score 1 and the Fisher information m a t r i x J have the following form: l = E-(x~), /., = J = 2Y-. (3.125)
3.5.2
Unbiased estimator and Cramer-Rao lower bound
Let x E A" be a sample from a distribution which has a probability density
p(x; O) parameterized by 0 E $. Let ~}(x) be an estimator of 0, i.e., a function of x that returns an estimate of 0. The estimator ~}(x) is assumed to satisfy the constraint on 0: 0(x) E S for any x E X. The estimator {}(x) is unbiased if 19
P~ / ( O ( x ) - O)p(x; O)dx - O, ax
(3.126)
where P0s is the projection matrix onto TO(S) (see eq. (3.29)). The covariance matrix of the estimator 0(x) is defined as follows (see eq. (3.30))"
ViOl = Po~/;(0(~)
- o)(o(x) - o)Tp(x;
o)~P~.
(3.127)
Since eq. (3.126) holds identically for any 0 E $, the first variation of the left-hand side of eq. (3.126) must be 0 if 0 is perturbed into 0 + A0 for AO e TO(S ). If we use the logarithmic differentiation formula (3.115) and interchange differentiation and integration, the first variation of the left-hand side of eq. (3.126) can be written as
= --AO + E[P~O(x)I-C]AO, 19It seems that unbiasedness can be defined by El0] = 0, but is "curved" (see Section 3.2.2).
(3.128)
E[O]may be outside S if S
86
Chapter 3.
Probabilities and Statistical Estimation
where we have employed the approximation P~+AoE[O
-
01 ~ 0. The exact
expression for PO+AoE[O s - 0] involves the second fundamental form of the manifold $ at 0. Roughly speaking, it has the order of a2AO/R 2, where a is the "standard deviation" of 0 in $ and R is the "radius of curvature" of $ (we omit the exact analysis). Here, we simply assume that the manifold $ is sufficiently "flat" in the domain within which the distribution of 0 is localized. Throughout this book, this is always assumed whenever we talk about local distributions. Eq. (3.128) must vanish for an arbitrary A0 e T0($ ). Since E[P~OI T] P~OE[I] T - O, this condition is written as -
A0 e $.
-
(3.129)
If we let A0 - P~Ae, then A0 E S for an arbitrary Ar E 7~m. Hence, the identity E[P~(O - O)(P~I)TIAE - P~Ae (3.130) must hold identically for e E ~m. eq. (3.130) implies
Since I E T0($ ) and hence P~l - l,
E[P~(O - 0)1 T] - P~.
(3.131)
Combining this with eqs. (3.127) and (3.118), we obtain the following relationship:
T ( ]_
V[0] P0s
p S J
)
9
(3 132) .
Since the left-hand side is positive semi-definite, the following matrix is also positive semi-definite (see Section 2.2.3)" (pS
,-
_j-)(
,-)-(
V[0]
V[O]-J-
,-). (3. 33)
Here, we have assumed that the distribution is regular and hence eqs. (3.123) and (3.124) hold. Since J - is positive semi-definite, the positive semidefiniteness of eq. (3.133) implies V[0]_ J - ,
(3.134)
where A __ B means A - B is positive semi-definite. Eq. (3.134) is called the Cramer-Rao inequality and gives a lower bound, called the Cramer-Rao lower bound, on the covariance matrix V[0] of an arbitrary unbiased estimator 0. If this bound is attained, the estimator 0 is said to be efficient.
3.6.
Maxinmm Likelihood Estimation
87
Suppose N independent samples Xl, -.-, X N are observed from a distribution whose probability density is p(x; 0). Let 0(ix, ...,iN) be an estimator of 0. If we consider the direct sum -- Xl @''"
the independence of x l,
..., X N
(3.135)
O iN,
implies that 5: has the probability density
/5(5:; 0) -- p(xl; 0 ) . .
"p(xN; 0).
(3.136)
Since the estimator ~}(Xl, . . . , i x ) can be viewed as an estimator 0(5:) of 5:, the argument described earlier can be applied. The score of 5: is N
l-
V 0 log/5(5:; 0 ) -
N
V0 E
logp(x~; 0 ) -
ct=l
E
l~,
(3.137)
a=l
where l~ is the score of x~. Since {x~} are independent, the Fisher information matrix of 5: is T
[ I - E[
l~
l#
a=l
]- E
/3=1
E[l~l~] - N J,
(3.138)
c~,f~=l
where J is the Fisher information matrix for p(x; 0). Cramer-Rao lower bound is given in the following form:
Consequently, the
1 v[0] ~_NJ-.
(3.139)
In particular, if 0 is a scalar, we have V[0] >
3.6
3.6.1
1
1
NE[(O logp/O0) 2] = -NE[O z logp/O02]"
(3.140)
Maximum Likelihood Estimation
Maximum likelihood estimator and the exponential family
Given a sample x E ,Y from a distribution which has a probability density
p(x; O) parameterized by 0 E 8, the maximum likelihood estimator 0 is the value of 0 that maximizes the likelihood p(x; 0), i.e., the probability density viewed as a function of 0 by substituting the sampled value. Hence, the maximum likelihood estimator 0 is the solution of the minimization JThe probability density in preceding section.
-21ogp(x; 0) --+ min.
(3.141)
p(x; O) is assumed to have the properties described
88
Chapter 3.
Probabilities and Statistical Estimation
In order to distinguish variables fi'om their true values, we regard 0 as a variable and write its true value as 0. With the expectation that the maximum likelihood estimator 0 is close to 0, we write 0-0+A0.
(3.142)
The constraint 0 E S requires A0 E To(S ) to a first approximation. Substituting eq. (3.142) into the fimction J in eq. (3.141) and expanding it in the neighborhood of 0, we obtain _
J = - 2 logp(x; 0) - 2(1, A0) + (A0, LA0) + O(A0) 3,
(3.143)
where I is the score defined by eq. (3.113) and L is the matrix defined by eq. (3.121). The bar indicates that the value is evaluated at 0. The assumption (3.112) implies that P ~ ] , = LPoS - L,
(3.144)
where P ~ is the projection matrix onto T0(S ). It follows that the rank of L is at most the dimension m ~ of the tangent space T0(S ). Here, we assume that the rank of L is exactly rn ~ so as to guarantee the unique existence of the value of 0 E $ that minimizes J. Then, the range of L coincides with T0(S ). If the term O(A0) 3 in eq. (3.143) is ignored, the function J is minimized under the constraint A0 E T0(S ) by the following value (see eq. (2.137)): A0 - - L - i .
(3.145)
It follows that the maximum likelihood estimator is given by 0 - 0 + A0. From eqs. (3.117) and (3.144), we see that P x o E [ O - 0 ] - 0, which means that the maximum likelihood estimator is unbiased in the first order 2~ A probability distribution which has a probability density p(x; O) parameterized by 0 is said to belong to the exponential family if p(x; O) can be expressed in terms of a vector function f ( x ) and scalar functions C(O) and g(x) in the form 21
p(x; O) -- C(O) exp[(f(x), O) + 9(x)].
(3.146)
Many probability distributions that appear in practical applications have probability densities in this form, and the Gaussian distribution is a typical example 22. For a distribution of the exponential family, the matrix L 2~ proviso "in the first order" means t h a t the result is obtained by ignoring high order terms in A0. 21If the exponent on the right-hand side of eq. (3.146) has the form (f(x), h(0)) + g(x) for some vector function h ( . ), the distribution is said to belong to the curved exponential family. If h(0) is taken as a new parameter, i.e., if rI = h(0) can be solved for 0 in the form 0 = t(rl) , the distribution belongs to the exponential family with p a r a m e t e r rI. 22Beside the Gaussian distribution, the exponential family includes such distributions as the Poisson distribution, the binomial distribution, the gamma distribution, the beta distribution, and the Poisson-gamma distribution.
3.6.
Maximum Likelihood Estimation
89
defined by eq. (3.121) does not depend on x and hence is equal to the Fisher information matrix J (see eq. (3.122)). From eq. (3.145), we see that the covariance matrix V[0] of the maximum likelihood estimator 0 is given as follows (see eqs. (3.127)and (3.144)): V [ 0 ] - L-E[~q-]L - - J - J J - - J - .
(3.147)
This means that the Cramer-Rao lower bound is attained (see eq. (3.134)). Thus, the maximum likelihood estimator is efficient in the first order if the distribution belongs to the exponential family.
3.6.2
Asymptotic behavior
If N independent samples X l , . . . , X N are observed from a distribution whose probability density is p(x; 0), the maximum likelihood estimator ~) is the value of 0 that maximizes the likelihood p(xl; O) .. . p ( x s ; O). Namely, 0 is the solution of N
J-
-2 E
logp(x~)--+ min.
(3.148)
c~'-i
In this case, eq. (3.145) is replaced by
(s ) ~=1
(~=1
where l~ and L~ are, respectively, the score l and the matrix L for x~ evaluated at 0. Hence, the covariance matrix V[0] of the maximum likelihood estimator 0 - 0 + A0 is written in the following form:
E
ViOl = E[
c~=l
~=1
3,=1
5=1
-l. (3. 50)
Matrices [,1, ..., I,N are random variables that belong to the same distribution, and their common expectation is the Fisher information matrix J . Hence, the following law of large numbers holds for a sufficiently large N (see eq. (3.9))" N
1 ~ E
t ~ ,,~ J .
(3.151)
c~=l
From this and the independence of each x~, we can write eq. (3.150) as follows (see eqs. (2.81 ) )" N
1
N
[lj~-lJc~,/~=l
N2Z c~--1
J - J J - - -N J - "
(3.152)
90
Chapter 3.
Probabilities and Statistical Estimation
This means that the Cramer-Rao lower bound is attained in the asymptotic limit (see eq. (3.139)). Thus, the maximum likelihood estimator is asymptotically efficient 23. We also observe the following (we omit the details)" 9 An estimator 0 of 8 is said to be consistent if 0 ~ 8 as N -+ co. Since eq. (3.152)implies that V[0] ,.~ O(1/N), the maximum likelihood
estimator consistent. 9 Since 11, ..., IN are independent random variables of mean 0 (see eq. (3.117)) and have the same distribution, the central limit theorem N (see Section 3.3.1) states that ~ = 1 l ~ / N is asymptotically Gaussian. It follows from eq. (3.149) that the maximum likelihood estimator is
asymptotically a Gaussian random variable of mean 0 and covariance matrix J - IN. 3.7
3. 7.1
Akaike
Information
Criterion
Model selection
Suppose N data x l, ..., x N are observed. A statistical test is a procedure for judging if they can be regarded as independent random samples from a particular distribution with probability density p(x); the X2 test is a typical example (see Section 3.3.4). If the data are known to be independent random samples from a distribution with probability density p(x; O) parameterized by O, the procedure for determining the parameter 8 that best explains the data is called statistical estimation; maximum likelihood estimation is a typical example (see Section 3.6.1). But how can we guess a parameterized probability density p(x; 8)? In other words, how can we judge if the data can be regarded as independent random samples from a distribution with probability density pl(x; 8), or with probability density p2(x; 8), or with other probability densities? A parameterized probability density p(x; 8) is called a (statistical) model of the distribution. In order to select a best model, we need a criterion that measures the "goodness" of a particular model. If we adopt a particular model p(x; 8) and apply maximum likelihood estimation, the parameter 8 is determined by maximizing l I N (~=1 p ( x ~ ; 8) o r N 1 log p(x a; 0). Let ~ be the resulting maxiequivalently minimizing - 2 ~ 4= mum likelihood estimator of 0. Let us call n
J[{x~ }, 0] - - 2 ~
logp(x~; ~)
(3.153)
o~--1
23The effect of the neglected high order terms in A0 converges to 0 as N --+ c~ under mild regularity conditions. Hence, the proviso "in the first order" can be dropped. We omit the details.
3.7.
Akaike Information Criterion
91
simply the residual. A good model is expected to have a large likelihood l-I~'~=l p(xa; 0), thereby a small residual J[{x~},0]. Hence, the residual appears to be a good criterion for model selection. However, since 0 is determined so as to minimize the residual for the current data {x a}, the residual can be made arbitrarily small, say, by assuming that x can take only N values x l , .... , x u . Such an artificial model may explain the current data very well but may be unable to predict the data to be observed in the future. This * ... , x N * be independent ranobservation leads to the following idea. Let Xl, dom samples to be observed in the future; they are assumed to have the same distribution as the current data x l, ..., X g. For a good model, the residual n
J[{x,~}, 0] - - 2
logp(x,~, 0)
(3.154)
o~1
for the future data { x ; } should be small. Since the future data { x ; } and the maximum likelihood estimator 0, which is a function of the current data {x~}, are both random variables, the above residual is also a random variable. In order to define a definitive value for the model, we take expectation and consider I - E*[E[J[{x~}, 0]]]
(3.155)
where E*[.] and E[.] denote expectation with respect to the future data {x~} and the current data {xa }, respectively. We call I simply the expected residual and regard a model as better if the expected residual I is smaller.
3. 7.2
Asymptotic expression for the expected residual
As in Sections 3.5 and 3.6, we assume that the data x l, ..., x N are n-vectors sampled from an hi-dimensional manifold X C ~ n . The model parameter 0 is assumed to be an m-vector constrained to be in an ml-dimensional manifold S C 7~m. Hence, the model p(x; O) has rn' degrees of freedom. We also assume that the model p(x; 0) contains the true probability density. Suppose the true model is p(x; 0), and let 0 be the maximum likelihood estimator of 0. Writing - 0 + A0 and expanding logp(x~; 0) in the neighborhood of 0, we obtain -, A 0 ) l o g p ( x ; ; 0) - logp(x~; 0- ) + (1,~,
~1 ( A 0 , L : A 0 )
+ O(A0) 3
(3.156)
where the score l~ and the m a t r i x / ~ are defined as follows (see eqs. (3.113) and (3.121)):
i: -
log p(. ;; 0),
/,~ = - V ~ logp(x*; 0).
(3.157)
92
Chapter 3.
Probabilities and Statistical Estimation
Ignoring higher order terms and noting that A0 does not depend on {x~}, we obtain from eqs. (3.155) and (3.156) N
I -- Z (-2E*tlogp(x*~; 0)1- 2(E*[l*], ElAn}I)-t- El(A0, E*[L*IA~})]). o~--1
(3.158) Since {x;} and {x~} have the same distribution, we have
E*[logp(x*~;0)1 - E[logp(x,~; 0)1.
(3.159)
Recall that E * [ l : ] - 0,
E * [ L : ] - J,
(3.160)
where J is the Fisher information matrix (see eqs. (3.117), (3.118), and (3.122)). It follows that eq. (3.158) can be rewritten as N
I - -2E[Z
logp(x~; 0)] +
E[N(AO, JA0)].
(3.161)
ct=l
Expanding log p(x~; 0) in the neighborhood of 0, we obtain -
1
logp(x~; 0) - logp(x~; 0) - (l~, A0) - ~(A0, s
+ O(A0) 3, (3.162)
where l~ and I,a are defined by 1,~ = V 0 logp(x,~; 0),
L,~ - - V ~ logp(x,:,; 0).
(3.163)
If we put .[,,~ -- - V ~ logp(xo~; 0),
(3.164)
I,~ - L~ + O(A0).
(3.165)
we have Substituting this into eq. (3.162) and summing it over c~ - 1, ..., N, we obtain N
Z
N
N
logp(xa; 0) - Z logp(x~; 0) - ( Z 1~' A0)
c~=l
~=I
2
4=1
~ ~ L~
AO) + O(AO)3.
(3.166)
c~=l
Since the maximum likelihood estimator {} maximizes ~ =N1 log p(x o~; 0 ) , w e see that ( Z lo,, A0) -- (V 0 ot=l
logp(xo,; 0) ce=l
, A0) -- 0
(3.167)
3.7.
Akaike Information Criterion
93
for any admissible variation A~} e T~)($). If we recall that eq. (3.151) holds for N ,,~ oo (the law of large numbers) and ignore higher order terms in A~} in eq. (3.166), we have N
E
N
logp(xa; 0) -~ E
c~=l
logp(x~;/)) - N N ( A / ) , JA/})
(3.168)
c~=l
for N ,,~ oe. Substituting this into eq. (3.161), we obtain N
I ,,~ - 2 E [ E
logp(x~;/})1 + 2NE[(A/},
JAb)l.
(3.169)
ce--1
As shown in Section 3.6.2, the deviation A0 of the maximum likelihood estimator is asymptotically a Gaussian random variable of mean 0 and covariance matrix J - / N (the central limit theorem). We assume that the distribution is regular, so the Fisher information matrix J has rank m ~ (see Section 3.5.1). It follows that (A~), ( J - / N ) - A O ) - N ( A 0 , J A 0 ) is asymptotically a X2 variable of m' degrees of freedom (see eq. (3.61)). Hence, we have E[N(A/}, J A 0 ) ] ,,~ m'
(3.170)
for N ,,~ c~. Thus, eq. (3.169) can be expressed in the form
I ,,, E[J[{x~}, t}]] + 2m'.
(3.171)
This means that if we define
A I C - J[ { x ~ }, 0] + 2m',
(3.172)
this is an unbiased estimator of the expected residual I for N ~- oe. This estimator is called the Akaike information criterion, or A I C for short, and can be used as a measure of the goodness of the model; the predictive capacity of the model is expected to be large if AIC is small. According to this criterion, a good model should not only have a small residual but at the same time have a small degree of freedom; otherwise, one can define a model that fits the current data arbitrarily well by increasing the number of the model parameters.