An identity for the Fisher information and Mahalanobis distance

An identity for the Fisher information and Mahalanobis distance

Journal of Statistical Planning and Inference 138 (2008) 3950 -- 3959 Contents lists available at ScienceDirect Journal of Statistical Planning and ...

182KB Sizes 3 Downloads 92 Views

Journal of Statistical Planning and Inference 138 (2008) 3950 -- 3959

Contents lists available at ScienceDirect

Journal of Statistical Planning and Inference journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / j s p i

An identity for the Fisher information and Mahalanobis distance夡 Abram Kagana,∗ , Bing Lib a Department b Department

of Mathematics, University of Maryland, College Park, MD 20742, USA of Statistics, Pennsylvania State University, University Park, PA 16802, USA

ARTICLE

INFO

ABSTRACT

Consider a mixture problem consisting of k classes. Suppose we observe an s-dimensional random vector X whose distribution is specified by the relations P(X ∈ A|Y = i) = Pi (A), where Y is an unobserved class identifier defined on {1, . . . , k}, having distribution P(Y = i) = pi . Assuming the distributions Pi having a common covariance matrix, elegant identities are presented that connect the matrix of Fisher information in Y on the parameters p1 , . . . , pk , the matrix of linear information in X, and the Mahalanobis distances between the pairs of P's. Since the parameters are not free, the information matrices are singular and the technique of generalized inverses is used. A matrix extension of the Mahalanobis distance and its invariant forms are introduced that are of interest in their own right. In terms of parameter estimation, the results provide an independent of the parameter upper bound for the loss of accuracy by esimating p1 , . . . , pk from a sample of X  s, as compared with the ideal estimator based on a random sample of Y  s. © 2008 Elsevier B.V. All rights reserved.

Article history: Received 16 January 2006 Received in revised form 29 November 2007 Accepted 19 February 2008 Available online 28 March 2008 Keywords: Categorical random variables Mixture models Moore--Penrose inverse

1. Introduction In this paper we establish a fundamental identity that connects the Fisher information and Mahalanobis distance in the context of mixture models. Mixture models are widely used in many contemporary statistical applications such as clustering, classification, and neural networks. The monographs by Lindsay (1995) and McLachlan and Basford (1988) explored both the theoretical and applied sides of this field. Suppose that a categorical random variable Y has a distribution P(Y = i) = pi ,

i = 1, . . . , k,

with  = (p1 , . . . , pk )T as its parameter. Here Y is not observed; what is observed is an s-dimensional random vector X related to

Y via (known) conditional distributions Pi (A) = P(X ∈ A|Y = i),

i = 1, . . . , k, A ∈ B,

(1)

where B is the Borel algebra in Rs . We call the random vector X a contamination of Y. This setup results in a mixture model for X whose distribution has the form P(A) = P(X ∈ A) = p1 P1 (A) + · · · + pk Pk (A) with the vector of mixing probabilities  as the model parameter. 夡 ∗

Research supported in part by National Science Foundation Grants DMS-0405681 and DMS-0704621. Corresponding author. E-mail address: [email protected] (A. Kagan).

0378-3758/$ - see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.jspi.2008.02.017

A. Kagan, B. Li / Journal of Statistical Planning and Inference 138 (2008) 3950 -- 3959

3951

In Section 2, a simple and elegant identity is obtained for binary Y, in which case the vector  reduces to a scalar, due to the relation p1 + p2 = 1. This identity leads to an upper bound for the loss of accuracy by estimating p using observations on X, as compared with the ``ideal'' estimator using hypothetical observations on Y. The upper bound is universal, in the sense that it is independent of the parameter . In Section 3, the identity is extended to general categorical Y. For this purpose we introduce the concept of the Mahalanobis distance matrix, which consists of the pairwise Mahalanobis distances between Pi and Pj , as defined in (1). In Section 4 we derive an invariant form of the identity, which is independent of the parameterization of the free components of . In Section 5 we obtain a canonical form of the invariant identity using Moore--Penrose inversions of singular matrices (see Rao and Mitra, 1971). In Section 6, we establish an inequality that provides the lower bound for the asymptotic variances of estimators of  derived from arbitrary linear estimating equations. 2. Binary Y The main idea is best illustrated in the simplest case of a binary Y, with P(Y = 1) = 1 − P(Y = 0) = p.

(2)

The Fisher information on p contained in Y is IY (p) = 1/[p(1 − p)]. Suppose, however, that Y is unobservable but an s-dimensional random vector X is observed that is related to Y by P{X ∈ A|Y = i} = Pi (A),

i = 0, 1,

where A ∈ B, the Borel algebra of sets in Rs . The distributions P0 and P1 may be interpreted as clusters associated with Y, and Y as the unobservable identifier of the clusters. The marginal distribution of X is A ∈ B.

P{X ∈ A; p} = pP 1 (A) + (1 − p)P0 (A),

The measure P(·; p) is absolutely continuous with respect to  = P0 + P1 . Denote its density with respect to  as f(x; p) = pf 1 (x) + (1 − p)f0 (x), where fi = dPi /d, i = 0, 1. The Fisher information on p contained in X is  IX (p) =

{f1 (x) − f0 (x)}2 d(x). pf 1 (x) + (1 − p)f0 (x)

(3)

It is easy to see that IX (p)  IY (p), with the equality holding if and only if (f0 f1 > 0) = 0, that is, P0 and P1 are mutually singular. To see this, write the integrand on the right-hand side of (3) as F(x), and partition the support of  into {x : f0 f1 > 0} ∪ {x : f0 > 0, f1 = 0 or f0 = 0, f1 > 0} ≡ A ∪ B, where the argument x in f0 (x) and f1 (x) is suppressed. Then, on B, F(x) = f1 /(1 − p) + f0 /p, and, on A F(x) < (f12 + f02 )/(pf 1 + (1 − p)f2 )  f1 /p + f0 /(1 − p). As a result,   IX (p) − IY (p) =

f f F− 1 − 0 p 1−p



d =

 

f f F− 1 − 0 p 1 −p A



d.

Because the integrand on the right-hand side is negative on A, the integral on the right is no greater than 0, and is equal to 0 if and only (A) = 0. Let (X1 , . . . , Xn ) be a sample from population f(x; p). Then the likelihood equation n 

f1 (Xj ) − f0 (Xj )

j=1

pf 1 (Xj ) + (1 − p)f0 (Xj )

=0

has, under standard regularity conditions, a solution p˜ n such that √

dist

n(p˜ n − p) −→ N{0, 1/IX (p)},

as n → ∞.

3952

A. Kagan, B. Li / Journal of Statistical Planning and Inference 138 (2008) 3950 -- 3959

On the other side, the maximum likelihood estimator based on an iid sample Y1 , . . . , Yn from distribution (2), which is pˆ n = Y¯ , satisfies √ dist n(pˆ n − p) −→ N{0, p(1 − p)}, as n → ∞. Thus the loss of accuracy of p˜ n , as compared with pˆ n , is 1 1 1 − p(1 − p), − = IX (p) IX (p) IY (p) which does depend on p. Frequently, however, statisticians prefer to base inference only on the first two moments of X rather than the whole likelihood, so that the inference drawn be robust against model misspecification. In this case, linear Fisher information becomes relevant. Suppose that X = (X (1) , . . . , X (s) )T is an s-dimensional random vector whose distribution depends on a scalar parameter  ∈ , where  is an open set. Recall that the linear Fisher information IˆX () is determined by the mean and covariance matrix of X: E (X) ≡ (),

var (X) = E {[X − ()][X − ()]T } ≡ V().

We will assume that V() is positive definite. As usual, the superscript T denotes the transpose. Suppose for a moment that X has a density f(x; ) with respect to a measure  such that the score J(x; ) = j log f(x; )/j exists and    j log f 2 f d < ∞. E (J 2 ) =

j

Let Jˆ = 0 () + 1 ()X (1) + · · · + s ()X (s) =

s 

r=1

cr ()(X (r) − r ()) = c T ()(X − ())

(4)

be the least squares approximation of J by linear combinations of 1, X (1) , . . . , X (s) (in (4) we used the fact that E (J) = 0 so that Jˆ ˆ = 0, E (X T J) ˆ = E (X T J) = (˙ ())T , the vector of the derivatives is a linear combination of the centered X (r) , r = 1, . . . , s). Since E (J)   T −1 T of (), one sees that c () = (˙ ()) V () and ˆ J(X; ) = (˙ ())T V −1 ()(X − ()). The last relation does not involve the density function f(x; ) and may be taken as a linear (in X) estimating function for . As with the classical Fisher information, the linear information on  in X is, by definition, ˆ IˆX () = E |J(X; )|2 = (˙ ())T (V())−1 ˙ ().

(5)

As seen from (5), IˆX () depends only on the mean vector and covariance matrix of X. Turn now to the Mahalanobis distance D(P0 , P1 ) between two probability distributions P0 , P1 on the Borel -algebra in Rs with mean vectors and covariance matrix   i = x dPi (x), Vi = (x − i )(x − i )T dPi (x), i = 0, 1. (6) We assume that V0 = V1 = V . The Mahalanobis distance is defined by D(P0 , P1 ) = (1 − 0 )T V −1 (1 − 0 ). See, for example, Rao (1973, p. 542). The following theorem connects three characteristics: the Fisher information on p in a binary response Y, IY (p) = 1/[p(1 − p)], the linear information on p in X, IˆX (p), and the Mahalanobis distance D(P0 , P1 ). Theorem 1. Suppose that the components of X have finite second moment under P0 and P1 with V0 = V1 . Then 1 1 1 + . = IY (p) D(P0 , P1 ) IˆX (p)

(7)

Proof. Using the notation in (6) and the assumption V0 = V1 = V , we compute the marginal mean and variance matrix of X to be E(X) = 0 + p(1 − 0 ) var(X) = V + p(1 − p)(1 − 0 )(1 − 0 )T .

(8)

A. Kagan, B. Li / Journal of Statistical Planning and Inference 138 (2008) 3950 -- 3959

3953

Write = 1 − 0 and  = p(1 − p). Then, from (5) and (8) one has IˆX (p) = T (V +  T )−1 = T V −1/2 (Is + V −1/2 T V −1/2 )−1 V −1/2 , where Is is the s × s identity matrix. To invert the matrix Is + V −1/2 T V −1/2 , set (Is + V −1/2 T V −1/2 )−1 = Is + V −1/2 T V −1/2 and find by multiplying both sides of the relation by Is + V −1/2 T V −1/2 . Simple calculations give

=−

 1 +  T V −1

=−

1 . IY (p) + D(P0 , P1 )

(9)

Thus one has IˆX (p) = T V −1/2 (Is + V −1/2 T V −1/2 )V −1/2

= T V −1 + ( T V −1 )2 = IY (p)D(P0 , P1 )/[IY (p) + D(P0 , P1 )].

Taking the reciprocals of both sides of (10) gives (7).

(10)



If (X1 , . . . , Xn ) is a sample from population f(x; p), the linear estimating equation n  i=1

ˆ ; p) = 0 J(X i

becomes (1 − 0 )T V −1

n  i=1

{(Xi − 0 − p(1 − 0 )} = 0.

(11)

¯ Eq. (11) has a unique solution p˜ that is linear in X, p˜ =

(1 − 0 )T V −1 (X¯ − 0 ) (1 − 0 )T V −1 (1 − 0 )

.

√ dist ˆ the It can be shown that n(p˜ − p) −→ N(0, 1/IˆX (p)) as n → ∞. Thus the asymptotic loss in accuracy of p˜ compared with p, maximum likelihood estimator of p based on (Y1 , . . . , Yn ), is 1 1 1 = . − IY (p) D(P0 , P1 ) IˆX (p) What is interesting is that the loss depends only on the Mahalanobis distance between P0 and P1 , but not on p. Since IX (p)  IˆX (p), identity (7) implies immediately the inequality 1 1 1 − .  IX (p) IY (p) D(P0 , P1 ) Similarly, if one uses more informative estimating equations for p than a linear estimating function in X (e.g., polynomial of degree > 1), then the corresponding information will exceed IˆX (p), and 1/D(P0 , P1 ) will be an upper bound for the asymptotic loss in accuracy of estimating p by the estimating equations versus the maximum likelihood estimator from observations of Y. Thus, Theorem 1 provides a universal upper bound (independent of p) for the asymptotic loss in accuracy for a large class of estimators generated by estimating functions in X. It is worth noticing that in many interesting setups, the distributions P0 and P1 differ by a shift in which case their covariance matrices are the same. 3. General categorical Y Now let Y take k different values, say {1, . . . , k}, with probabilities P(Y = i) = pi ,

i = 1, . . . , k.

3954

A. Kagan, B. Li / Journal of Statistical Planning and Inference 138 (2008) 3950 -- 3959

Due to the relation p1 + · · · + pk = 1, there are only k − 1 free parameters, say p1 , . . . , pk−1 , and we set p = (p1 , . . . , pk−1 )T . The (k − 1) × (k − 1) matrix of Fisher information on p in Y is IY (p) = {diag(p)}−1 + 1k−1 1Tk−1 /pk ,

(12)

where 1Tk−1 is the k − 1-dimensional vector with all its entries equal to 1, and diag(p) is the diagonal matrix with the diagonal entries p1 , . . . , pk−1 . Let now X be a random vector in Rs with s  k − 1 whose conditional distribution given Y = i is P(X ∈ A|Y = i) = Pi (A),

i = 1, . . . , k,

so that the marginal distribution of X is a mixture Pp = p1 P1 + · · · + pk−1 Pk−1 + (1 − p1 − · · · − pk−1 )Pk with p = (p1 , . . . , pk−1 ) as a parameter. Let P = {P1 , . . . , Pk } be the set of probability measures on the Borel algebra of Rs with mean vectors 1 , . . . , k and common covariance matrix V assumed nonsingular. Fix j ∈ {1, . . . , k} and set

l = l − j ,

j = ( 1 , . . . , j−1 , j+1 , . . . , k ).

Note that l is an s-dimensional vector and j an s × (k − 1) matrix. Definition 1. We call the (k − 1) × (k − 1) matrix Mj (P) = Tj V −1 j

(13)

a Mahalanobis distance matrix of P. The concept of a Mahalanobis distance matrix involves a fixed j. Direct calculations show that for any i, j there exists a nonsingular (k − 1) × (k − 1) matrix Cij such that Mi (P) = CijT Mj (P)Cij . Thus the Mahalanobis matrices constructed from the same P are congruent. The Mahalanobis distance matrix Mj (P) is a measure of scattering of the elements of P with respect to each other. Turn now to the matrix of linear Fisher information on p in X. Set

(p) = Ep (X),

V(p) = varp (X),

and denote by ˙ (p) the s × (k − 1) Jacobian matrix of (p), i.e., the jth column of ˙ (p) is j(p)/jpj , j = 1, . . . , k − 1. We assume ˙ (p) to have full column rank. Similar to (5), the linear Fisher information on p in X in this case is the (k − 1) × (k − 1) matrix IˆX (p) = ˙ (p)T V −1 (p)˙ (p). The next theorem states a relation between IY (p), IˆX (p), and the Mahalanobis distance matrix Mk (P) as defined in (13). Note that a special role of Mk (P) is due to the choice of p1 , . . . , pk−1 as free parameters. Theorem 2. The following relation holds: IˆX−1 (p) = IY−1 (p) + Mk−1 (P).

(14)

Proof. To simplify notation, set k = . By the computation, the marginal mean and covariance matrix of X are

(p) = k + p, V(p) = V + (diag(p) − ppT ) T ,

(15)

which mimics their special forms in (8). Using the relations diag(p)1k−1 = p and 1Tk−1 p = 1 − pk , it is easy to verify that the matrix diag(p) − ppT in (15) is the inverse of IY (p) defined in (12). Hence IˆX (p) = T (V + IY−1 (p) T )−1 

= T V −1/2 (Is + V −1/2 IY−1 (p) T V −1/2 )−1 V −1/2 .

(16)

A. Kagan, B. Li / Journal of Statistical Planning and Inference 138 (2008) 3950 -- 3959

3955

Following the argument that led to (9), we find (Is + V −1/2 IY−1 (p) T V −1/2 )−1 = Is − V −1/2 [IY (p) + T V −1 ]−1 T V −1/2 . Substitute this into (16) to obtain IˆX (p) = T V −1  − T V −1 (IY (p) + T V −1 )−1 T V −1  = T V −1  − T V −1 (IY (p) + T V −1 )−1 ( T V −1  + IY (p) − IY (p)) = T V −1 (IY (p) + T V −1 )−1 IY (p). Inverting the matrices on both sides of the equation gives (14).



4. Invariant form of the main identity The identity (13) depends on the chosen parameterization. For example, if, instead of (p1 , . . . , pk−1 ), one chooses (p2 , . . . , pk ) as free parameters, then the Mk (P) in (14) needs to be replaced by M1 (P). In this section identity (14) will be modified so that it be invariant with respect to parameterization. The idea is to work directly with the asymptotic variance of the maximum likelihood estimators rather than the information. Set  = (p1 , . . . , pk )T = (pT , pk )T . Let Y1 , . . . , Yn be n independent observations on Y. The maximum likelihood estimator of  is given by pˆ i = n−1 #{Yj : Yj = i},

i = 1, . . . , k.

By the central limit theorem, √ dist n(ˆ − ) −→ N(0, Y ()),

where Y () = diag() − T .

An advantage of Y () over IY−1 (p) is that the former is a function of  and does not depend on the choice of free parameters. )T be the To find an invariant version of Iˆ−1 (p), we again begin with the asymptotic variance matrix. Let p˜ = (p˜ , . . . , p˜ 1

X

solution of the following system of linear estimating equations n−1

n  i=1

k−1

(˙ (p))T V −1 (p)(Xi − (p)) = 0,

whose limiting distribution is 0. Then

√ dist n(p˜ − p) −→ N(0, IˆX−1 (p)). Let 0k−1 be the k − 1 dimensional vector whose entries are all equal to

T , 1)T + (I T  = (0k−1 k−1 , −1k−1 ) p.

On setting H = (1 , . . . , k ) and C0 = (Ik−1 , −1k−1 )T , we easily see that

(p) = H , ˙ (p) = HC 0 , V(p) = V + HC 0 (diag(p) − ppT )C0T H T = V + H(diag() − T )H T , and that, as n → ∞,

√ dist n(˜ − ) −→ N(0, X (, C0 )), where

X (, C0 ) = C0 {C0T H T [V + H(diag() − T )H T ]−1 HC 0 }−1 C0T .

(17)

The asymptotic covariance matrix in (17) depends only on  but not on the choice of free parameters. The above matrix C0 is a column contrast matrix, i.e., each of its columns adds up to zero. The next proposition shows that the covariance matrix in (17) remains the same if C0 is replaced with any k × (k − 1) column contrast matrix with full column rank. Proposition 1. Let C be any k × (k − 1) full rank column contrast matrix. Then

X (, C) = X (, C0 ).

(18)

Proof. Any full column rank, k × (k − 1) column contrast matrix C can be written as   A = C0 A C= −1Tk−1 A for some (k − 1) × (k − 1)-matrix A of full rank. Substituting CA−1 for C0 in (17) leads to (18).



3956

A. Kagan, B. Li / Journal of Statistical Planning and Inference 138 (2008) 3950 -- 3959

Let us now see what (18) means. Suppose that instead of (p1 , . . . , pk−1 ), pT = (p2 , . . . , pk ) are chosen as free parameters. Then √

˙ (p) = HC 1 with C1 = (−1k−1 , Ik−1 )T , a column contrast matrix of full column rank. The asymptotic variance of n(˜ − ) will take the form of X (, C1 ), which, by Proposition 1, is the same as X (, C0 ). Since X (, C) does not depend on the choice of free

parameters, we shall simply write

X (, C) ≡ X (). The same arguments work for IY (p). By computation it is easy to verify that IY (p) = [diag(p)]−1 + 1k−1 1Tk−1 /pk = C0T [diag()]−1 C0 . Repeating the arguments that led to X () results in

Y () = C0 [C0T (diag())−1 C0 ]−1 C0T . As with X (), this matrix also remains unchanged if one replaces C0 with any k × (k − 1) column contrast matrix C of full column rank. Now we are going to define an invariant version of the Mahalanobis matrices for the set of probability measures P ={P1 , . . . , Pk } with mean vectors 1 , . . . , k and a common, nonsingular, covariance matrix V. As in the invariant modification of IˆX and IY , it is easier to work with the inversion of a Mahalanobis distance matrix. Note that the Mk (P) can be re-expressed as Mk (P) = C0T H T V −1 HC 0 , which is not invariant. We now define an invariant form of its inversion, which we call the Mahalanobis concentration matrix. Definition 2. Let C be a k × (k − 1) column contrast matrix of full column rank. The Mahalanobis concentration matrix for P = {P1 , . . . , Pk } is a k × k matrix K(P) = K = C(C T H T V −1 HC)−1 C T . Similar to X () and Y (), one can see that K is constant on the set of all k × (k − 1) contrast matrices of full column rank. Since K is singular, we define the invariant Mahalanobis distance matrix as a generalized inverse of the Mahalanobis concentration matrix. The following definition adopts a convenient version of such an inverse. Definition 3. The invariant Mahalanobis distance matrix for P = {P1 , . . . , Pk } is M(P) = M = (H T V −1 H)C(C T H T V −1 HC)−1 C T (H T V −1 H). It is easy to check that this matrix, too, is the same for all k × (k − 1) contrast matrices C of full rank and the following identities hold: MKM = M,

KMK = K.

In other words, K = M − and M = K − where A− denotes a generalized inverse of A (see, e.g., Rao and Mitra, 1970). We are now ready to state an invariant version of Theorem 2. Theorem 3. The following relation holds:

X () = Y () + K.

(19)

Proof. On multiplying (14) by C0 from the left and by C0T from the right and taking into account C0 IˆX−1 (p)C0T = X (),

C0 IY−1 (p)C0T = Y (),

one gets

X () = Y () + C0 Mk−1 C0T . Furthermore, C0 Mk−1 C0T = C0 ( T V −1 )−1 C0T = C0 (C0T H T V −1 HC 0 )−1 C0T = K, which completes the proof.



A. Kagan, B. Li / Journal of Statistical Planning and Inference 138 (2008) 3950 -- 3959

3957

5. Canonical form of the main identity Although the matrices in Theorem 3 are invariant with respect to C, they still rely on C for their explicit expressions. In this section we derive a completely coordinate-free representation of identity (19). We call it the canonical form because it is the most natural among other representations. The canonical representation is based on a special form of generalized inverse called the Moore--Penrose generalized inverse (see Rao and Rao, 2000, Chapter 8). Let be a symmetric matrix with spectral decomposition T , where the columns of  are the eigenvectors of and  is a diagonal matrix with the diagonal elements being the eigenvalues of arranged in the same order as the columns of . Denote by † the matrix obtained from  by replacing all its nonzero elements with their reciprocals. The Moore--Penrose generalized inverse of is by definition

† = † T . The Moore--Penrose inverse is unique for any symmetric . Alternatively, the above † is the unique matrix that satisfies the following relations:

† = ,



† = † ,

† = (

† )T ,

† = ( † )T ,

(20)

which are sometimes taken as a definition of the Moore--Penrose generalized inverse. Note in passing that ( † )† = . All the three matrices in (19), X (), Y () and M − = K, can be written as C(C T AC)−1 C T , where C is a k × (k − 1) column contrast matrix of full column rank and A a symmetric matrix. The next lemma gives the general form of the Moore--Penrose inverse for such matrices. The subspace of all contrasts (i.e., vectors with the sum of the components zero) is the orthogonal complement to the (onedimensional) subspace generated by vector 1k . If P is the projection matrix onto the subspace of the contrasts, then P = Ik − Q where Q is the projection matrix onto the vector 1k , i.e., Q = 1k (1Tk 1k )−1 1Tk = 1k 1Tk /k. Lemma 1. If C is a k × (k − 1) column contrast matrix of full column rank and A a full-rank symmetric matrix, then [C(C T AC)−1 C T ]† = PAP. Proof. Because C is of full column rank, P is the projection onto the column space of C. Therefore P = C(C T C)−1 C T . It is then easy to verify that matrices C(C T AC)−1 C T and PAP satisfy the relations in (20).  Now set

¯ = n−1

k  i=1

i ,

H0 = (1 − ¯ , . . . , k − ¯ ,

 = (1/1 , . . . , 1/k )T .

The following corollary is an immediate consequence of Lemma 1. Corollary 1. The Moore--Penrose inverses of Y (), X () and K are

†Y () = diag() − 1Tk /k − 1k T /k + (1Tk /k2 )1k 1Tk , †X () = H0T (V + H(diag() − T )H)H0 , K † = H0T V −1 H0 .

We write these inverses as IY (), IˆX () and M, respectively. The matrix M in Corollary 1 is a very natural extension of the Mahalanobis distance. It summarizes all the distances (standardized by V) of the i 's from their center ¯ . Note that the matrix is invariant with respect to parameterization. Similarly, IY () and IˆX () are natural versions of the information that do not require any specification of the free parameters. Now we can rewrite the main identity explicitly in terms of the Moore--Penrose inverses. Theorem 4. The following identity holds: † † IˆX () = IY () + M † .

3958

A. Kagan, B. Li / Journal of Statistical Planning and Inference 138 (2008) 3950 -- 3959

6. Relation to estimating equations Let X1 , . . . , Xn be independent observations of X. To begin with, choose p = (p1 , . . . , pk−1 )T as free parameters, and estimate p by solving the linear estimating equation n  i=1

G(p)(Xi − (p)) = 0,

(21)

where G(p) is a (k − 1) × s matrix. It is a special case of estimating equations introduced in Godambe (1960). Under some regularity conditions (that include differentiability of G(p) and (p); that is why free parameters are needed) there is a solution p˜ G to (21) such that √ dist n(p˜ G − p) −→ N(0, X (p; G)) as n → ∞. Here

X (p; G) = [G(p)]−1 G(p)V(p)GT (p)[G(p)]−T , where A−T stands for (AT )−1 = (A−1 )T . The inverse of X (p; G) is the information matrix associated with the estimating function G(p)(X − (p)). We shall denote it IˆX (p; G). Using the multivariate Cauchy--Schwarz inequality, one can show that for any G IˆX (p; G)  IˆX (p; T V −1 (p)) = T V −1 (p) = IˆX (p). To see this, abbreviate G(p) by G, and note that var{[(G)−1 G − ( T V −1 )−1 T V −1 ]X} = (G)−1 GVGT (G)−T − (V −1 )−1 . Because the left-hand side is positive semi-definite, we have (G)−1 GVGT (G)−T  ( T V −1 )−1

or

IˆX−1 (p; G)  IˆX−1 (p),

(22)

where, as usual, A  B means that the matrix A − B is positive semi-definite. By (14), IˆX−1 (p; G)  IY−1 (p) + Mk−1 (P). The last inequality connects the Mahalanobis matrices with estimating equations. Turn now to the invariant form. Let T , 1)T + C p ˜ G = (0k−1 0 ˜G

be the estimator of  from the estimating equation (21). The arguments used in Section 4 show that ˜ G is the same for all choices √ of (k − 1) free parameters from p1 , . . . , pk . The asymptotic covariance matrix of n(˜ G − ) is

X (; G) = C0 (G(p)HC 0 )−1 (G(p)V(p)GT (p))(G(p)HC 0 )−T C0T = C(G(p)HC)−1 (G(p)V(p)GT (p))(G(p)HC)−T C T , where C is any k ×(k −1) contrast matrices of full column rank. This is a manifestation of the fact that any choice of free parameters will lead, via (20), to the same estimator of . Multiplying both sides of (22) by C0 from the left and by C0T from the right gives

X (; G) = C0 IˆX−1 (p; G)C0T  C0 IˆX−1 (p)C0T = X (). Taking the Moore--Penrose inverse twice gives 

C0 IˆX−1 (p; G)C0T

 † †





C0 IˆX−1 (p)C0T

† †

= IˆX (). †

Defining the (invariant) information IˆX (; G) on  contained in the estimating function G(p)(X − (p)) as the Moore--Penrose inverse of X (p; G), the arguments of this section prove the following theorem.

A. Kagan, B. Li / Journal of Statistical Planning and Inference 138 (2008) 3950 -- 3959

3959

Theorem 5. For any (k − 1) × s matrix G(p), † † IˆX (; G)  IY () + M † .

If one interprets the Mahalanobis distance matrix as an information measure, the inequality becomes similar in form to the Stam (1959) classical inequality for the Fisher information on a location parameter: for independent X, Y −1  IX−1 + IY−1 . IX+Y

Theorem 5 provides a universal (i.e., independent of the parameter) lower bound for the loss in accuracy in estimating the mixture parameters by estimators generated by linear estimating functions in X versus the maximum likelihood estimator from observations of Y. The bound is in terms of the Mahalanobis (concentration) matrix for the clusters distributions, in complete agreement with intuition: the better the clusters are separated, the more accurate we expect to estimate the proportion of each cluster, regardless of the proportions themselves. Acknowledgments The authors are thankful to two referees and an Associate Editor for many comments that all were used in the revision. References Godambe, V.P., 1960. An optimum property of regular maximum likelihood estimation. Ann. Math. Statist. 31, 1208--1211. Lindsay, B.G., 1995. Mixture Models: Theory Geometry and Applications. Institute of Mathematics and Statistics, Hayward. McLachlan, G.J., Basford, K.E., 1988. Mixture Models: Inference and Applications to Clustering. Marcel Dekker, New York. Rao, C.R., 1973. Linear Statistical Inference and its Applications. second ed.. Wiley, New York. Rao, C.R., Mitra, S.K., 1971. Generalized Inverse of Matrices and its Applications. Wiley, New York. Rao, C.R., Rao, M.B., 2000. Matrix Algebra and its Applications to Statistics and Econometrics. World Scientific Publishing, Singapore. Stam, A.J., 1959. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Inform. Control 2, 101--112.