Estimation, principal components and hamiltonian systems

Estimation, principal components and hamiltonian systems

Systems & Control North-Holland Letters 6 (1985) July 1985 103-108 Estimation, principal components and Hamiltonian systems Anthony BLOCH * Div...

476KB Sizes 4 Downloads 67 Views

Systems & Control North-Holland

Letters

6 (1985)

July 1985

103-108

Estimation, principal components and Hamiltonian systems Anthony

BLOCH *

Division of Applied MA 02138, USA

Received Revised

Sciences,

Harvard

University,

Cambridge,

POI-

8 January 1985 15 April 1985

In this paper we show how symplectic geometry can be used to analyse a fundamental problem in estimation theory - the fitting of lines and planes to a data set in n dimensions. Symplectic methods yield information not only on the critical point structure of the Maximum Likelihood Estimation (MLE) function (see Byrnes and Willems [7]), but also on the asymmetries in the distribution of the data points in n-space.

Abstracf:

Total least squares estimation, Linear system identification, Principal components, Hamiltonian systems, Symplectic geometry, Convexity, Complete integrability, Morse the-

Keywords:

ory.

1. Introduction The statistical problem to be considered here is that of finding a d-plane of best fit to a data set in n real or complex dimensions. Now, in contrast to the standard least squares problem where we assume we have a number of input variables which are known exactly and a number of output variables which we can measure with some error, we assume here that both inputs and outputs are measured with error. Indeed, in this case we need not distinguish between inputs and outputs and we may simply write our data points as Xi=2i+&i,

Zi = true value,

where we assume (for convenience) that the components of ei are identically distributed random variables. Then the least squares estimate (and, in the case of normal errors, the maximum likelihood estimate) of a d-plane fitted to the data is given by minimizing the total perpendicular distance of the points from the plane. * Research

supported

0167-6911/85/$3.30

in part by NSF Grant 0 1985, Elsevier

This problem, known as the Total Least Squares (TLS) problem, has received a certain amount of attention in the literature of late (see [12]) though the original concept goes back to Karl Pearson

No. ECS-81-21428.

Science

Publishers

We remark that the analysis here corresponds to the identification problem for a linear system without dynamics - for example, a network of resistors. Further, in the spirit of the framework for identification proposed by Willems (see [26]), it does not postulate an a priori input/output structure. As Willems points out, such a generalization of the standard input/output model is often desirable - in electrical network problems or econometric analysis for example, where causality is frequently unclear. It should be noted that in the case d = 1 the TLS line corresponds to the first principal component in a Principal Component Analysis (see [18]) of the data - with a slight shift in statistical perception: in this case the xi should be regarded as samples from a multivariate elliptical (usually normal) distribution. Now, given the above statistical problem, natural questions to ask are: (a) What is the structure of the minimum of the TLS estimation function? Is it unique? Is it connected? (b) What is the general critical point structure of the estimation function? (c) What is the principal component structure of the data? Remarkably, the methods of symplectic geometry yield information on all these (interrelated) questions. The application of geometrical methods to estimation theory should be regarded as having been pioneered by Fisher [ll], and was later developed by Hunurbar [14], Rao [22], and more recently Efron [lo] and Amari 121. The approach in these cases is to employ the methods of Riemannian geometry based on the Fisher information metric. The application of symplectic geometry here depends on two basic developments in the subject

B.V. (North-Holland)

103

Volume

6. Number

SYSTEMS

2

& CONTROL

that have occurred in the past ten years or so: the theory of convexity associated with the moment map (see Atiyah [4], Kostant [16], Duistermaat and Heckeman [8]), and the theory of completely integrable Hamiltonian systems (see Adler and Van Moerbeke [l], Moser [17] and Mumford [19]). Both these developments have roots which go back much earlier (see [24], [15]). The basic point is that symplectic geometry can be used to prove mathematical results outside the realm of classical (or quantum) mechanics. The applicability of symplectic methods to the problem at hand was first noticed by Byrnes and Willems (see [7], to which the reader should refer).

July 1985

LETTERS

Now let h(n) be the Hermitian matrices of dimension n, u(n) the Lie algebra of the unitary group U(n), and G&d, n) the complex Grassmannian of d-planes in n-space. Note that Q, C E h(n). Observe now that P may be regarded as representing a point in Gc (n - d, n), that is, an n - d plane in n-space. Furthermore, P E h(n) may be identified with an element of u(n) via the mapping P*iP. Such a P may then be regarded as an element of an adjoint orbit of U(n), of rank n - d matrices. (Recall that the adjoint action of U(n) on POE u(n) maps P,, to the set of matrices UP,U-‘, UE U(n).)

Thus we observe, rather remarkably, that H is the restriction of a linear functional to Gc( d, n) viewed as an adjoint orbit of U(n).

Section 2

We will now set up the estimation problem. Note. All our calculations will be over Q=” unless

otherwise specified. For reduction to the real case, see [7]. Let ej, j= l,..., for Q=” and let xi=CXijej,

i=l,...,

n, be an orthonormal

basis

p,

Section 3

Let us now consider the critical manifold structure of H. Diagonalizing C we find H(P)

= i

i

be p points. Each Aij is measured with observation error eij: hij=Xij+Eij where we take eij to be N(O,l), independent. Let Q denote the orthogonal projection of Q=” onto a d-plane. Let P = I - Q. Then the total perpendicular distance of the p points onto the d-plane is given by H(P)=TrCP=TrC(I-Q) (=TrC-H,(Q)whereH,(Q)=TrCQ) where C has entries ckj = CiXijXik (C = h*X where X is the data matrix, entries hi,). The TLS estimate of P is then given by minimizing H(P). By the Gauss-Markov theorem this is also the MLE estimate. Then Hmin

=

ClEij12

where eij = Aij - h, = Aij - ( Qxi) j. 104

JXi12pi

i-l

where ]hi] are the singular values of A, and pi are the diagonal entries of P. Moreover, the set of all pi is precisely equal to the convex set C(d,

n)=

(P,,

P~,-..,P,,)ER”:

i

O
ipi=n-d.

1

i-l

I

The proof of this assertion can be found in Byrnes and Willems [7] and follows from the classical Schur-Horn theorem [ 131. Thus, finding the critical manifolds of H corresponds to a linear programming problem. Take the case first of generic C. Then it is a purely combinatorial observation to see that there are (a) critical points of H. In the case d = 1 there are n critical points and one can see that these correspond to the components of a principal component analysis of the data - the critical values of Q pick out the various principal components.

Volume

6, Number

2

SYSTEMS

& CONTROL

In the case of non-generic C (the occurrence of multiple eigenvalues), the critical set contains critical manifolds which are, in general, cross products of complex Grassmannians and hence connected. For a precise formula for the manifold of local minima of H, see Byrnes and Willems [7]. The structure of the critical manifolds may be derived via a combinatorial analysis of the convex set discussed above, but in fact connectivity of the critical manifolds and indeed convexity have their origin in a much deeper theorem in symplectic geometry (see following section). Moreover it can be shown using symplectic methods that H is a perfect Morse function, that is, its Morse series equals its Poincark polynomial. We do not discuss this result here, but refer again to Bymes and Willems [7] where a proof may be found.

July 1985

LETTERS

thus yields both the convexity and connectivity results of the previous section. Moreover, H(P) = Tr CP is itself contained in a maximal toral subalgebra of u(n), which thus defines a torus action on M = G(d, n). Hence H is almost periodic and is one of n - 1 independent functions in involution on M (n - 1 rather than n because of the trace condition on P). We shall identify such a set of functions explicitly below. Now Liouville’s theorem [3] states that if there are m independent integrals in involution on a symplectic manifold of dimension 2m, the Hamiltonian flow arising from any integral is completely integrable (solvable by quadratures). Hence in the case d = 1 where M = G(l, n) = Q=P”-’ our Hamiltonian H is completely integrable.

Section 5 Section 4

Let us now see how the symplectic geometry arises in our problem and analyse the Hamiltonian system in the case d = 1 using the theory of completely integrable Hamiltonian systems. We recall that the co-adjoint orbits of a Lie group G have a well defined symplectic structure, the Kostant-Kirilov structure. In the case of a compact group, however, adjoint and co-adjoint orbits may be identified. Thus our TLS function H(P) = Tr CP may be regarded as a Hamiltonian on an adjoint orbit of u(n). Now we have the following theorem of Atiyah [41: Theorem. Let M be a compact connected symplectic manifold and suppose we have a symplectic action of the n-torus on M. Then if the first Betti number of M vanishes, this leads to n functions A,. . . , f, on M that are independent and Poisson commute.

The map f: M + IR” given by the fi (the ‘moment map’) satisfies (a) all non-empty fibres f i(c) are connected; (b) the image f(M) is convex and is the convex hull of the critical values of f. Taking M = G(d, n) to be the adjoint orbit of matrices P above and fi to be the diagonal elements pi of P which lie in a maximal toral subalgebra (a maximal commutative subalgebra) of u(n)

As mentioned in the introduction, over the past few years there have been some remarkable developments in the theory of completely integrable Hamiltonian systems. Given a system that is completely integrable, the problem is to find a technique for carrying out the explicit integration. The general procedure that has been developed involves describing the motion of the system via a Lax pair with parameter. The flow then linearizes on the real part of the Jacobi variety of an algebraic curve associated with the system. (See [l].) A variant of this procedure can be used to integrate our system in the case d = 1 and to provide information on the principal component structure of the data (question (c) of Section 1). Let us write H,,(Q) = Tr CQ (which, as noted earlier, differs by a constant from H(P) = Tr CP). Then we have: Proposition 1. For H,(Q) = Tr CQ the Hamiltonian equations on an adjoint orbit of U(n) are given by the Lax equation

b=[Q,

Cl

and by the Lux equation with parameter (see [1,17])

(Q+Kj

=[Q+E,

Cl.

Proof. The proof follows from the fact that H is a left invariant function on the tangent bundle 105

SYSTEMS & CONTROL

Volume 6, Number 2

TU(n) of U(n). For more details see [5,21]. The association of a Lax pair with parameter to the system is the key to finding a related curve on which the flow linearizes, thus enabling us to solve the problem explicitly, as in Proposition 3 below (see [5,61). Proposition 2. A set of n - 1 independent integrals in inuolution for HO(Q) = Tr CQ is (for generic C)

Tr CQ, Tr C2Q,. . . ,Tr C”-IQ. Proof. Consider the coefficients of [ in the functions Tr(Q+[C)“,

k=2 ,..., n.

These yield integrals for the flow Q = [Q, C] and include the n - 1 independent integrals listed. Also conserved by the flow are Tr(CQ)k. We examine the statistical meaning of these integrals below. Finally:

LETTERS

July 1985

We see that each matrix entry qij of Q oscillates with frequency equal to the modulus of the difference ]hjl2 - ]h,]’ between the squares of the singular values of the data matrix h. In terms of principal component analysis these frequencies are simply the differences between the variances of the principal components, Hence, for large differences between the variances we get high frequency oscillation and for small differences low frequency oscillation. In the limit ]A,.] = ]hj] Vi, j we get no flow at all. Hence the Hamiltonian flow is a measure of sphericity (see [18]), a measure of how close to equality are the eigenvalues of h*h = C. Regarding the data points xi as samples from a normal distribution, this indicates how close the contours of equal probability density are to being spheres. It is also instructive to consider the behaviour of Q as a projection matrix as time evolves. An intuitive picture may be formed by considering the two dimensional case, say

Proposition 3. For Q = ( qij) and C diagonalized,

C=diag(]h,]2,...,]A,]2), the flow on an adjoint orbit of U(n) given by the Hamiltonian H,,(Q) = Tr CQ is 4ijtt)=“ij

+

eXP((lh/12-I~i12)f

bij)

where. ~~~and biBiiare constants. Proof. The details of the proof may be found in [5] and [6]. The key is to write Q as Q

=zxz,

z- (z l,...Z,)Ec”,

zi=si+iti,

Clzi12 = ls

and to regard Q as a complex rank one perturbation of the matrix C. Then the results of Moser [16] on real rank two perturbations may be used, and the Hamiltonian flow may be shown to be equivalent to that of the Hamiltonian H=+~IXi12(s,?+

ti’)

with symplectic structure Ci ds, A dt,. Section 6

Let us now analyse the statistical information mirrored.by the Hamiltonian flow. 106

A little thought then shows that under the Hamiltonian flow the line onto which Q projects the data is rotated through a ‘level set’ of the function Ho(Q).

Let us examine now the statistical meaning of the integrals associated with the flow. First we consider the integrals Tr CkQ. We may view Q as a projection operator on Hilbert space. Then Tr CkQ is the expected value in the ‘quantum mechanical’ sense of the operator Ck. Thus Tr CkQ may be viewed as a k-th moment of the operator C. Also conserved under the flow are Tr( CQ)k. A statistical meaning may be associated with these integrals as follows: Suppose x is a multivariate normal vector with zero mean. Then we can show that, up to a constant, the integrals Tr(CQ)k are estimates for the cumulants of the distribution of the quadratic form Z’Qx. Finally, we remark that there is a natural correspondence between the level sets of the Hamiltonian and the contours of equal probability density in data space: Suppose, as before, that the points xi are samples from a normal distribution; that the projec-

Volume tion

6, Number matrix

SYSTEMS

2

& CONTROL

z=(z

,,...,

z,,)EC”,

z,=.si+iti;

and that C has been diagonalized: C=diag(lh,]‘,...,

A,,]‘).

Then a level set of the Hamiltonian (see Proposition 3) is given (in the ‘estimator’ space, with co-ordinates zi) by k IXi121Zi12= h,

some h ,

i-l

and the surfaces of constant probability data space are given by -=k,

July 1985

function may be regarded as a Hamiltonian for an almost periodic system (a system arising from a torus action) will lead to a linear programming problem in the ‘momentum’ variables. Further, standard statistical quantities such as statistical moments could provide candidates for such variables.

is

Q=zxT,

LETTERS

density in

somek

Acknowledgement I would like to thank Professor C.I. Bymes for all his advice and encouragement during the course of this work.

References [l]

(setting the exponent in the probability density function equal to a constant), where IAil is an estimate of the variance of the i-th principal component and wi is the (complex) coordinate along the i th principal axis. For h = k, these ellipsoids (2n - 1 dimensional real manifolds) are precisely dual (hyper) surfaces to each other. (The dual of an equation (in projective space) describing a surface is equal to the equation describing its tangent plane at every point on the surface.) This duality is a reflection of the fundamental duality induced between the estimator (P or Q) space and the data space (and indeed between the TLS problem and principal component problem) by the bilinear form Tr CP. For further information in this regard see [6], [9] and 1231. 7. Conclusion The analysis here, aside from its theoretical interest, has specific implications for the analysis of the type of statistical problem at hand. Firstly, in the solution of the MLE problem it is of vital importance to know that the set of minima is connected. Secondly, it is remarkable that this estimation problem in the L2 (rather than L’) norm can be reduced to a linear programming problem. Moreover, it is a consequence of the arguments here that any estimation problem where the MLE

M. Adler and P. Van Moerbeke, Completely integrable systems, Euclidean Lie algebras and curves, Adu. in Math. 38 (3) (1980) 267-317. [2] S. Amari, Differential geometry of statistics, Technical Report METR 84-1. Mathematical Engineering Section, University of Tokyo. Mathematical Methods of Classical Mechanics 131 V.I. Arnold, (Springer, Berlin-New York, 1978). Convexity and commuting Hamiltonians, [41 M.F. Atiyah, Bull. London Math. Sot. 14 (1982) 1-15. system 151A.M. Bloch, A completely integrable Hamiltonian associated with line fitting in complex vector spaces, Bull. Amer. Math. Sot., to appear. WI A.M. Bloch, Ph.D. Thesis, Harvard University (in preparation). Least squares estimation, 171 C.I. Bymcs and J.C. Willems, linear programming and momentum, Invent. Math., submitted. and G.J. Heckeman, On the variation in PI J.J. Duistermaat the cohomology in the symplectic form of the reduced phase space, Invent. Math. 69 (1982) 259-268. 191J. Durbin and M.G. Kendall, The geometry of estimation, Biometrika 38 (1951) 150-158. 1101B. Efron, Describing the curvature of a statistical problem, Ann. Statist. 3 (6) (1975) 1189-1242. Proc. Cam1111R.A. Fisher, Theory of statistical estimation, bridge Phil. Sot. 122 (1925) 700-725. WI G.H. Golub and C.F. Van Loan, An analysis of the total least squares problem, SIAM J. Numer. Anal. 17 (6) (1980) 883-893. stochastic matrices and the diagonal of a 1131 A. Horn, Doubly rotation matrix, Amer. J. Math. 76 (1954) 620-630. On a property of distributions admitting iI41 V.S. Hurxurbar, sufficient statistics, Biometrika 36 (1949) 71-74. Vorlesungen lfber Dynamik, in: Gessam1151 C.G.J. Jacobi, melte Werke, Supplement band, Berlin (1884). WI B. Kostant, On convexity, the Weyl group and the Iwasawa decomposition, Ann. Sci. &ole Norm. .Sup. 6 (1973) 413-455. 107

Volume [17]

[18] [19] [20] [21]

108

6, Number

2

SYSTEMS

& CONTROL

J. Moser, The geometry of quadrics and spectral theory, Proceedings of the Chern Symposium 1979 (Springer, Berlin-New York, 1980). R.S. Muirhead. Aspects of Multivariate Statistical Theory (Wiley, New York, 1982). D. Mumford, Tata L-ecfures on Thefa II (Birkhauser, Basel-Boston, 1984). K. Pearson, On lines and planes of closest fit to points in space, Phil. Mag. 2 (1901) 559-572. T. Raitu, The motion of the free n-dimensional rigid body, Indiana Univ. Math. J. 29 (1980) 609-629.

LETTERS [22]

[23] [24]

[25] [26)

July 1985

B.R. Rao, A formula for the curvature of the likelihood surface of a sample drawn from a distribution admitting sufficient statistics, Biometrika 47 (1960) 203-207. G.D. Salmon, Analytic Geometry of Three Dimensions (Chelsea. New York, 1927). I. Schur, ijber eine Klasse von Mittelbildungen mit Anwendungen auf der Determinantentheorie, Sifzungsberichte Berliner Math. Gesellschaft 22 (1923) 9-20. S.R. Searle, Linear Models (Wiley, New York, 1977). J.C. Willems, From time series to linear systems, Part I, Preprint Mathematisch Instituut, Groningen.