Neural Networks, Vol. 8, No. 1, pp. 11-23, 1995 Copyright © 1994 Elsevier Science Ltd Printed in the USA. All rights reserved 0893-6080/95 $9.50 + .00
Pergamon
CONTRIBUTED ARTICLE
Lyapunov Functions for Convergence of Principal Component Algorithms MARK D. PLUMBLEY King's College London
(Received 25 May 1993;accepted 16 November 1993) Abstract--Recent theoretical analyses of a class of unsupervized Hebbian principal component algorithms have identified its local stability conditions. The only locally stable solution for the subspace P extracted by the network is the principal component subspace P*. In this paper we use the Lyapunov function approach to discover the global stability characteristics of this class of algorithms. The subspace projection error, least mean squared projection error, and mutual information I are all Lyapunov functions for convergence to the principal subspace, although the various domains of convergence indicated by these Lyapunov functions leave some of P-space uncovered. A modification to I yields a principal subspace information Lyapunov function I' with a domain of convergence that covers almost all of P-space. This shows that this class of algorithms converges to the principal subspace from almost everywhere.
Keywords--Neural networks, Unsupervized learning, Principal component analysis, Information theory, Hebbian algorithms, Lyapunov functions, Oja rule. 1. I N T R O D U C T I O N
matrix ~x = E ( x x r). Thus, Yl represents the component of x in the direction of its largest eigenvector of ~ (the principal component), Y2 the component in the direction of the second largest, and so on. Principal component analysis is an optimal linear transform in the sense that it minimizes the least mean squared reconstruction error of the input x from the output y (e.g., Hornik & Kuan, 1992). It also maximizes the transmitted information from a Gaussian input x to the output y, given uncorrelated equal-variance Gaussian additive noise on the input (Linsker, 1988; Plumbley & Fallside, 1988). In fact, PCA is sufficient, but not necessary, to find this optimum. Provided the rows of W span the same subspace as that spanned by the largest M eigenvectors of ~ , the principal subspace for a given M, then the mean squared reconstruction error will be minimized and the transmitted information will be maximized (Plumbley & Fallside, 1988). We shall refer to this as principal subspace analysis. Over the last decade or so, a number of neural network algorithms have been suggested to enable a onelayer linear neural network (Figure 1 ) to perform principal subspace analysis. Most of these PCA algorithms are based on the Oja (1982) principal component finder neuron (e.g., Oja, 1989), and update their weights at the nth time step by a small change
Principal Component Analysis (PCA) is a popular statistical tool for linearly reducing the dimensionality of a set of measurements while retaining as much information as possible about the original measurements. It appears under various names, including K a r h u n e n Lo~ve transform in signal processing, the Hotelling transform in image processing, and Factor Analysis in the statistical literature (e.g., Watanabe, 1985). Suppose we have a linear transform from an N-dimensional zero-mean input vector x = [xl . . . . . XN] tO an M-dimensional output vector y = [y~ . . . . . YM], and y is related to x by the expression y = Wx
(1)
where W is an M × N matrix, with M < N. Principal component analysis sets the M successive rows of W to the M largest eigenvectors of the input covariance
Acknowledgements:The author would like to thank John Taylor and Guido Bugmann for many useful discussions on this and related work, and the comments of an anonymous referee that helped to improve this paper considerably.During part of this work the author was supported by a TemporaryLectureship from the AcademicInitiative of the Universityof London. Requests for reprints should be sent to the author at Department of Computer Science, King'sCollegeLondon,Strand, LondonWC2R 2LS, UK.
AW,, = rt,~(y,,x~r -- K,,W,,)
11
(2)
12
M. D. Plumbley
where the matrix K~ varies from one algorithm to another, and is normally a function of W~ and x~. Particular examples of these algorithms include the Williams ( 1985 ) Symmetric Error Correction algorithm, which uses K~ = y~y~r,
(3)
the Oja and Karhunen (1985) Symmetric Gradient Ascent algorithm, which uses K. = diag(y~y~) + 2LT÷(y.y~r),
(4)
and the Sanger (1989) Generalized Hebbian Algorithm, which uses K, = LT(yny~).
(5)
In the previous expressions, diag( • ) is the matrix function that sets entries to zero that are not on the diagonal, LT+( • ) sets entries to zero that are not strictly below the diagonal, and L T ( . ) sets entries to zero that are not on or below the diagonal. Several authors have analyzed the behaviour of these and related algorithms (Hornik & Kuan, 1992; Oja, 1992; Oja & Karhunen, 1985; Sanger, 1989), based on the stability or instability of various equilibrium points for the ordinary differential equation (o.d.e.) equivalent of the algorithms. Typically, depending on the particular algorithm chosen, these show that the equilibrium point corresponding to PCA is the only possible asymptotically stable solution for W. However, these analyses only give local stability information, and it has so far proved problematical to identify the domain of attraction for the identified stable solution. One popular method for identifying a domain of attraction for a system such as this is the Lyapunov function approach (Cook, 1986). If an energy-like function can be found that monotonically decreases (or monotonically increases) over time, towards the stable point from anywhere within a given region, that region is a domain of attraction for the stable point. In this article, we identify a number of Lyapunov functions for this class of PCA algorithms, with their associated domains of attraction. These include a) the subspace projection error (Section 4.1 ), b) the least mean square reconstruction error (Section 4.2), and
W
X
Y
FIGURE 1. Linear network with input x, weight matrix W, and output y = Wx.
c) the mutual information (Section 4.3). All of these leave a significant section of W-space outside of their respective domains of attraction. Finally, a modification to the latter mutual information function yields what we term d) the principal subspace information (Section 5), which has an associated domain of attraction that covers almost all values of W. This is the main result of this article. It allows us to show that the class of PCA algorithms considered here converges to the principal subspace from almost everywhere, and consequently that a random initial value of W will almost surely converge to the principal subspace. Most of the proofs will be relegated to the Appendix. 2. T H E ORDINARY D I F F E R E N T I A L EQUATION APPROACH Rather than attempting a direct analysis of the difference equation, eqn (2), following other authors (e.g., Hornik & Kuan, 1992) we consider instead the behaviour of the equivalent o.d.e. Informally, if the update factor ~n is small enough to make only small changes to Wn at each step, we can consider the mean weight update AW = ( A Wn), assuming that the weight matrix W , remains approximately constant ( ~ W ) over the period during which we are taking the mean. If ~ is also approximately constant we get aW = ( ~ , ) ( ( W , x , x ~ ) - (Knw,)) ~ ~(w(~ox~) - (K~)w) ~ ~(W~x - KW)
(6)
where ~x = E ( x x T) is the covariance matrix of x (making the assumption that x is zero mean), and K = E(Kn) is the expected value of Kn. In what follows, we assume that the covariance matrix :~x is nonsingular with distinct eigenvalues ~ > },~ > . . . > ~ . More formally, under certain conditions (e.g., Hornik & Kuan, 1992) it can be shown that the path of W , in algorithm (2) will approximately follow the path of W in the o.d.e. dW
dt
-
W~;~
-
KW
(7)
and in particular W~ will converge with probability 1 (i.e., almost surely) to the roots of the o.d.e. (7). We remarked in the Introduction that only principal subspace analysis, and not full-blown principal component analysis, is necessary to optimize either the least mean squared reconstruction error of the transmitted information. This leads us to consider the behaviour of the subspace spanned by the rows of W, rather than W itself. Oja (1992) showed that this subspace can be represented by the orthogonal projection matrix P that projects vectors into the given subspace by left (or right)
Principal Component Algorithms
13
multiplication. If W W r has full rank, then this orthogonal projection is defined by P = Wr(WWr)-~W.
(8)
It is immediately evident that P as defined satisfies the two conditions for an orthogonal projection, namely p2 = p
(9)
p r = p.
(10)
If only the first of these conditions were satisfied, P would be a projection, but not an orthogonal projection. Note that if P is an orthogonal projection, then ( I s P), where Is is the N × N identity matrix, is also an orthogonal projection. A way to visualize the situation is to imagine P as a operator that projects input vectors into the given subspace. Thus, the vector i = Px is the component of x that lies entirely within the subspace we are interested in. Note that after P has been applied once to yield a vector within our subspace, another application of P will have no further effect: thus, P~ = P ( P x ) = Px = ~ because p2 = p. Before we proceed, we show that we can choose K in eqn (7) so that W W r has full rank for all t, so that P can continue to be formed as in eqn (8).
( 11 )
and that W W r is finite and nonsingular at t = O. Then W W r remains finite and nonsingular for all t > O, and asymptotically converges to W W r = I~.
Proof. See Appendix. Note that eqn (11 ) is satisfied for the Williams ( 1985 ) SEC algorithm (3) and the Oja and Karhunen ( 1985 ) SGA algorithm (4), as well as the original singleoutput Oja ( 1982 ) algorithm, for which these two are slightly different generalizations. Therefore, W W r will remain finite and nonsingular for these algorithms. However, consider the simple Hebbian algorithm
dW/dt = W ~
(12)
with no weight decay term. This is of the form given in eqn (7) with K = 0. There is nothing to prevent the rows of W from becoming infinitely large in the direction of the single principal eigenvector of ~ as the algorithm progresses. As t --~ ~ , the weight matrix W becomes degenerate, and W W r becomes singular. Other algorithms that do not satisfy eqn ( 11 ) have to be approached individually. For example, using eqn (7) with K = W W r produces the novel algorithm
dW/dt = W ~ x - WWrW
dP ~ - = ( I N - P)~xP + PXx(IN- P),
(14)
which is dependent on P only, and not on the precise form of W. The conditions for stationarity and stability of P are given by the following two theorems (Williams, 1985; Oja, 1992). THEOREM 2.2. P in eqn (14) is stationary [is a fixed point of eqn (14) ] iff it is the subspace spanned by some M eigenvectors of~,x.
Proof. [See also Oja (1992) or Williams (1985).] If dP/dt in eqn (14) is zero, multiplying on the right by
THEOREM 2.1. Suppose that K in eqn (7) satisfies K + K r = 2W~xW r
is similar for the previous case, and shows that the selected output components have the square of the input variance components at convergence. This algorithm will be covered in more detail in a later paper. Another example is Sanger's (1989) Generalized Hebbian Algorithm (5). Although this again does not satisfy eqn ( 11 ), Sanger's direct proof of convergence does show that W W T does not become degenerate. For the remainder of this article, we shall assume that K in eqn (7) is such that W W r does retain full rank for all t, to avoid any difficulties with P. The o.d.e. (7) defines an o.d.e, for the behaviour of the subspace P itself(Oja, 1992 ), which can be written as
(13)
for which W ~ 1W r remains finite and nonsingular if so initially, implying the same about W W r. The proof
P we find that (IN -- P ) ~ x P = 0, so ~xP = P~xP = P2;~. The converse is clearly also true [i.e., that P is stationary if (Is - P ) ~ P = 0], so we have that d P / dt = 0 iff ~xP = P~x, that is, ~x and P commute. Now, ~ and P commute iff they share the same eigenvectors (Strang, 1976 ), that is, iff P is the subspace spanned by some M of the eigenvectors of ~x. (Note that P has M eigenvectors with eigenvalue 1, and N M with eigenvalue 0.) Therefore, P is stationary iff it is the subspace spanned by some M of the eigenvectors of ~ . • THEOREM 2.3. P in eqn (14) is stable only when it is the principal subspace P*, that is, the subspace spanned by the first M eigenvectors of~,x. Any other fixed point
is unstable. Proof. See Oja (1992) or Williams (1985). This tells us that the principal subspace P* is the unique stable point for P, but it only tells us about the local behaviour around each of the stationary points. We cannot tell the domain of attraction for the stable solution, or whether it is possible to get caught by a saddle point on the way to the stable point from any particular initial value of P. To get a more global view, we now proceed to identify a number of Lyapunov functions with associated domains of convergence. Any initial P within a domain of convergence so identified will asymptotically converge to the stable point.
14 3. E N E R G Y F U N C T I O N S A N D LYAPUNOV FUNCTIONS
3.1. Energy Functions Energy functions, functions that decrease over time, have already been used in conjunction with PCA algorithms to visualize the behaviour of such algorithms. Baldi and Hornik (1989), for example, showed that the mean squared error of a linear auto-association network (trained with BackProp) has only one local m i n i m u m , in addition to a n u m b e r of saddle points. The BackProp algorithm itself operates by performing a steepest-descent search in this energy function. For the o.d.e, (7) it is possible to identify a n u m b e r of different energy functions. For our purposes, an energy function is a function S ( W ) for which its time derivative is nonpositive, that is, dS/dt <_O, when d W / dt is given by the o.d.e. (7). These typically arise from error measures that decrease as an algorithm progresses. O f course, we could instead use an increasing function of t as an energy function, by simply taking the negative of the function. These functions sometimes arise as signal or output measures that increase as an algorithm progresses. We shall see that transmitted information is an increasing energy function for the o.d.e.s (7) and (14).
3.2. Lyapunov Functions A Lyapunov function is like an energy function that can be used to estimate a domain of attraction for a stable point. In m a n y cases, energy functions can be used as Lyapunov functions, although the estimated domain of attraction they define may only be subsets of the true domain of attraction. There is no general method available to construct a Lyapunov function that will identify the complete domain of convergence, so it is often a matter of trial-and-error to identify a good one. The most general definition o f a Lyapunov function is that it should be a function L that strictly decreases monotonically from any point a within a particular region D (the domain of attraction), except at a single point a* within D, where it is stationary. If such an L can be found, then a* is asymptotically stable (Cook, 1986). In what follows, it is convenient for us to use the following formulation. THEOREM 3.1. (Lyapunov) Suppose that we have a function L(a), defining a region D consisting of all points a such that L( a) < c for some constant c, with the boundary of D given by the all points L ( a ) = c, such that L I d L ( a ) / d t < Ofor all a ~ a* in D, and L2 dL( a* )/dt = O. Then the equilibrium point a = a* is asymptotically stable, with a domain of attraction D.
M. D. Plumbley So, if we have an energy function with several equilibrium points a*, a t21 , a [3] , . . . , we choose c to be the m i n i m u m value of L ( a ) over all points (apart from a* ) where d L ( a ) / d t = 0. In other words, our domain of attraction will be all the points with energy below that of the second lowest stationary point or points, with the lowest stationary point being the stable point. We can now consider a number of energy functions for our Hebbian PCA algorithms from this point of view, to see what domains of attractions are identified. These functions include the subspace projection error; least mean square reconstruction error; and mutual information. For the purposes of this article we shall call these nonmaximal Lyapunov functions, because their domains of attraction are not, in general, the largest that can be identified. The largest domain of attraction identified here is given by our final principal subspace information measure: its domain of attraction covers almost all of P-space. 4. N O N M A X I M A L
LYAPUNOV F U N C T I O N S
4.1. Subspace Projection Error The first function we shall consider is the mean squared error Se between the input vector x and its projection ~ = Px into the subspace defined by P. As we have already noted in the Introduction, this is the subspace spanned by the rows of the weight matrix W. LEMMA 4.1. Suppose that P varies according to the
o.d.e. (14). Then the subspace projection error Sp:E(lx-~12):E(Ix-exl
z)
(15)
is a nonincreasing function of t, and is' stationary iff P is the subspace spanned by some M eigenvectors of Zx. Proof See Appendix. The function S~, is thus an energy function with stationary points corresponding to the stationary points of P (see Theorem 2.2). THEOREM 4.2. Let S~,, be the value of S~, when P is the principal subspace P* of Zx, and let Xi be the ith
largest eigenvalue ofZ~. Then S~, is a Lyapunov function for convergence of P to P * , with domain of attraction
Dp= {PIS~
(16)
Consequently, the principal subspace P* is asymptotically stable with domain of convergence Dr. Proof See Appendix. For a network with two inputs ( N = 2) and a single output ( M = 1 ), this domain of attraction Dv covers the whole of P-space except the point ptz] where P is the subspace corresponding to the second eigenvector (i.e., the minor component) of the 2 × 2 input covariance matrix. This point p[2l is the only stationary point for P apart from the optimal solution P* itself.
Principal Component Algorithms
15
For the more general case with N > 2, a significant proportion of P-space is outside the domain of attraction De. For example, significant regions near to the third-lowest stationary point pt 3] (and higher stationary points) will be outside De, but simulations suggest that initial values of P within these regions will also converge to P*. This suggests that De is not the true, or maximal, domain of attraction for P*. Theorem 4.2 tells us that P converges to P* from any initial point within De, but not what happens to points outside of De.
The error measure Se used in Section 4.1 measures the mean squared error in the projection ~ = Px from x. Because Px = W r ( W W :r)-~Wx, we can construct the projected vector ~ as ~ = W r ( W W r ) - t y where y is the output from the network given in eqn ( 1 ). However, this is not the best reconstruction of x given y, in the m i n i m u m mean squared error sense. It is well known that the least mean squared reconstruction o f x given the network output y = W x is given by the vector (17)
where Q is given by the expression Q = z~wr(WZxW~)-tw.
(18)
Now, we see that Q~ = Q, so Q is a projection. However, Q ~ ¢ Q in general, so Q is not an orthogonal projection. The projection Q is related in a n u m b e r of ways to P. From their definitions, for example, we can immediately write down the equations: PQ = P
(19)
QP ~ Q
(20)
QXx = X~Q r
(21)
that suggest P and Q are closely related. In fact, for a given covariance matrix 2;x there is a one-to-one mapping from Q to P, as the following l e m m a demonstrates. In particular, it is not necessary to know W to determine Q from P and ~x. LEMMA 4.3. P is a function of Q only, and Q is a func-
tion of P and ~,x only. Furthermore, P and Q are equal iff they are the subspace spanned by some M eigenvectors ofZx. Proof See Appendix. Consider then, as our next tentative Lyapunov function, the least mean squared reconstruction error SQ = E(lx - ~12) = E([x - Qx] ~)
S o is a nonincreasingfunction oft, and is stationary iff Q is the subspace spanned by some M eigenvectors of~,x. Proof See Appendix. THEOREM 4.5. S o is a Lyapunov function for convergence of P to P * , with domain of attraction D o = ( P I S o < Sp. q- ~kM - •M+, } ~ Dp.
(23)
Consequently, the principal subspace P* is asymptotically stable with domain of convergence D o.
4.2. Least Mean Square Reconstruction Error
~ = x~W~(WZxW~)-~y = Qx
LEMMA 4.4. The least mean square reconstruction error
(22)
that, for any given 2~xis a function of P only. Therefore, we can view it as a function of P instead of Q if we wish.
Proof See Appendix. Note that the boundary for D o is expressed in terms of S~ for simpler comparison with D~,. We have therefore identified a Lyapunov function with a larger domain of attraction D o than De. However, as for D~,, D o still leaves a significant proportion of P-space uncovered in general. There is also a dual projection O of Q, given by O = w T ( w ~ ~WT)-~WX~I
(24)
that has similar properties to Q. However, because the energy function So = Tr[(IN -- O ) ~ x ] is the greatest mean squared reconstruction error (under certain constrai~ats), the domain of attraction Do identified by So is smaller than De, and hence smaller than DQ. 4.3. Mutual Information For a multivariate Gaussian input signal x corrupted by uncorrelated equal-variance Gaussian noise, PCA or principal subspace analysis maximizes mutual information ( M I ) between the input and the output of the network in Figure 1 (Linsker, 1988; Plumbley & Fallside, 1988). In a previous article, the author used MI as a Lyapunov function for a decorrelating network with noise on the output instead of the input (Plumbley, 1993). In this article, we take a similar approach for these PCA algorithms and attempt to use MI in a similar way. However, as we shall see, the situation in this case is not quite so straightforward, and the MI function has to be modified before yielding the final result here. Suppose that input signal, with covariance matrix ~x, contains noise with covariance matrix 2~, = a~ IN. Then the total output and output noise have respective covariance matrices W ~ x W r and a ~ W W r. The mutual information I is given by (Linsker, 1988; Plumbley & Fallside, 1988) I = ½[log det(W~xW r) - log det(tr~WWr)].
(25)
Although this has a very different form to the previous Lyapunov functions considered, it can be used as a Lyapunov function in a very similar way.
16
M. D. Plumbley
LEMMA 4.6. The mutual information I in eqn (25) is a nondecreasingfunction oft, and is stationary iffP is the subspace spanned by some M eigenvectors of~,x.
Proof. See Appendix. Thus, in a similar way that both Se and So decrease monotonically over time, the mutual information I increases monotonically over time, except when P is the subspace spanned by some M eigenvectors of ~x. Consequently, I is an energy function for W, albeit an increasing rather than a decreasing energy function. Furthermore, the following l e m m a shows that I can also be viewed as an energy function for P.
Let us tentatively suppose that this will generalize to P in the following way. If a nonzero eigenvector e of P can be found that is entirely perpendicular to the principal subspace P*, that is, e r P * e = 0, then it may be difficult or impossible for P to converge to P*. This suggests that W P * W r should be nonsingular for convergence because W can be decomposed into the product of an M × N matrix formed from the M nonzero eigenvectors of P on the right, and an M × M scaling and rotation matrix on the left. Let us therefore construct an artificial principal subspace information function I ' as follows I' = ½[log det(WP*W r) - log det(WWr)]
LEMMA 4.7. For a given covariance matrix ~,x, the mu-
tual information I in eqn (25) is a function of P only. Proof See Appendix. We mentioned above that I is maximized when at the principal subspace solution P = P*. This means that we can propose I (or, strictly speaking, - I ) as a Lyapunov function. THEOREM 4.8. Let Ie. be the value of I at the point P --- P * . Then I is a Lyapunov function for convergence of P to P * , with domain of attraction Dt = { P I I > Ip. - ~l log.~--~ A~t+~J1 "
(26)
Consequently, the principal subspace P* is asymptotically stable with domain of attraction D~. Proof See Appendix. For a single output, that is, M = 1, it is easy to show that this reduces to the same domain of attraction to that found for Se above, because both Sp and I are monotonic in the output variance. Consequently, for a single output M = 1, DI = De, and DI C DQ, SO Dr is no better than D O for a single-output network at least. This information function still does not identify a domain of attraction covering as much of P-space as we would like. However, the proof of L e m m a 4.6 uses the fact that W ~ x W ~ is nonsingular. We can see from eqn (25) that if WZ~W T is singular (but W W r is nonsingular), then d e t ( W Z x W ~') = 0 so I --~ - ~ . This suggests a modification that can be made to the information function I, which will finally yield a domain of attraction covering virtually all of P-space.
5. P R I N C I P A L SUBSPACE I N F O R M A T I O N
(27)
that is the mutual information function in eqn (25) with the covariance matrix ~x replaced by the principal subspace P*, and a~ set to 1. In a sense, this is a weighted information function, so that information about any input component in the principal subspace is equal, but information about any input component that is not in the principal subspace is ignored. As for I, this can be shown to be a function of P (and P * ) only. Let us now explore I ' as a possible Lyapunov function. LEMMA 5.1. Suppose that W P * W T is initially nonsingular. Then the principal subspace information I' is a nondecreasing function of t, and is stationary iff p=p*.
Proof See Appendix. We note immediately that L e m m a 5.1 differs in two important ways from the equivalent lemmas for the previous Lyapunov functions considered. Firstly, it is only valid when W P * W r is initially nonsingular. Secondly, with this condition, I ' is only stationary when P is the principal subspace P*. What has happened is that the condition for W P * W r to be nonsingular has excluded the other stationary conditions for P, but which were still present in the other energy functions Se, SQ, So, and I. This allows us to state the main theorem of this article. THEOREM 5.2. The principal subspace information I' is a Lyapunov function for convergence of P to P * ,
with domain of attraction Dr = { P IWP*W T nonsingular }.
(28)
Consequently, the principal subspace P* is asymptotically stable with domain of convergence Dr. Proof See Appendix.
It is well known that if the network ( 1 ) has precisely zero weight component in the direction of a particular eigenvector of ~x, then there will be no tendency for a Hebbian algorithm to increase that weight component away from zero. This is easy to see for a system y = wx with a single scalar weight w, because the weight update algorithm Aw = yx = wx 2 will be zero if the weight w is zero.
The region where P is outside of Dr, for which W P * W r is singular, has dimensionality I less than the whole of P-space. Consequently, D r covers almost everywhere in P-space. Thus, if an initial P is chosen at random, it will be in the domain of attraction D r with probability 1. Finally, we note that if P is initially within D r , it will only be stationary when P = P*, that is, when it
Principal Component Algorithms
17
has converged to the final solution. Because our previous Lyapunov functions Sp, SQ, So, and I are stationary iff P is stationary, these will also be stationary only at convergence. Thus, in particular the subspace projection error Sp and least mean squared reconstruction error S o strictly decrease monotonically over time until convergence to P = P*, and similarly the mutual information I strictly increases monotonically over time until convergence. Consequently, Sp, SQ, and I are also Lyapunov functions with domain of convergence D' (according to the general definition) because they vary monotonically from any point within D' and are stationary only at convergence) The increase in information I over time was noted by F61difik (1989) for an algorithm related to (although not one of) the class considered here.
with the first equality holding for any M. Similarly, we can verify that 0 = (Ix - Q ) = u ~ ( u ~T, ~ x-1 U ~ ) - - I u ~ : 1
and that SQ simplifies to So = Tr[(u~Z~u~') '] = T 1_ ~ "
Also, for this case, the boundaries of the d o m ~ n s of attraction have the following property, which is evident from the figures. THEOREM 6.1. Suppose that N = 3 and M = 2, with u~ the normal vector to P. Then in uh-space, the domains of attraction De and DQ are bounded by planar rings with normal vectors
1
T
T
0
~(~2 -- X3)1/2] T
1~ 1/2
1 ~'/2] r 0
As we have already mentioned, the 2-input case ( N = 2) leads to a relatively trivial situation where even the domain of attraction Dp is sufficient to cover virtually the whole of P-space. Also, with just a single output ( M = 1 ), the domains of attraction Dp and Dr are identical. We therefore choose to illustrate the domains of convergence identified by these Lyapunov functions for a 3-input 2-output network ( N -- 3, M = 2) with an ordered diagonal input covariance matrix :~x with eigenvalues 2~ = 3, X2 = 2, and X3 = 1. In this system, each orthogonal projection P is a two-dimensional subspace of 3-space, which we can represent by the unit vector normal to P (and the opposite unit vector). Thus, P-space is represented by the surface of a half-sphere. Figure 2a-d shows the domains of attraction De, DQ, Dz, and D r , respectively. Note that a view of only half of the sphere is sufficient because the reverse side of the sphere is identical but opposite to the front side. In this special case where the number of outputs M is one less than the number of inputs N, several simplifications can be made. For any orthogonal projection P of rank M can write P = UpU ~rwhere Up is an N × M matrix such that U r U e = IM. Similarly, for the complementary projection ~ = (IN -- P) we can write p T = upup where up is a unit-length vector such that UpU~ + uf,u~ -- IN. [In the general case, Up will be an N × ( N - M ) matrix such that U~,Up = IN-M.] This vector up is the normal vector used to plot the surfaces in Figure 2a-d. A number of the expressions for the energy functions become simplified, based on the vector up. For example, it is easy to show that Sp = Tr(u~,~xU~) = U~xU~,
(31)
Uf,~x Ill,
bp = [(~1 - X2) 1/2
6. AN I L L U S T R A T I V E E X A M P L E
(30)
Proof See Appendix. Comparing the an#es 0e and 0Q of be and bQ to the vertical (e3), we find that [X~X "~ COS 0Q = t ~ 1
COS 0p,
(32)
which confirms that bQ is closer to e3 than b~ is, so the boundaries of DQ are more horizontal than those of D~. It also appea~ from the d i a ~ a m s that D~ and D o are identical, even t h o u # they were plotted from difffrent functions. This is confirmed by the following sli#tly more general theorem. THEOREM 6.2. Suppose that M = N - 1. Then them is a one-to-one relationship between I and S o, and consequently Dt = D o. Proof See Appendix. We mentioned at the be#nning of this section that for a sin#e output M = 1, DI and D~ are identical, so
D~ = DI C Do and from Theorem 6.2, for M = N - 1 we now have De C D1 = DQ. ~ e l i m i n ~ y work su~ests that De C Dz may hold more generally for M > 1, which leads us to conjecture that for any 1 ~ M ~ N D~C DzC D o
(33)
holds, allowing for set equality in the relations. Strict subsets may be possible for M = 2, N = 4.
(29) 7. C O N C L U S I O N S
1 This is perhaps a little academic because it would not have been possible to state this without first using the Lyapunov function I'.
We have considered the global convergence properties of a class of Oja-like Hebbian algorithms that either
18
M. D. Plumbley 1
1
-
1
-
0
/
(a)
(b)
~
1
-
~
0
1
(c)
-
0
(d)
FIGURE 2. Domains of attraction (a) D~, (b) Do, (c) D,, and (d) D,. for a 3-input 2-output network with input component eigenvalues X,, X2, and X3 or value 3, 2, and 1, respectively. The eigenvectors of the input covariance matrix are indicated by el, e2, and ea. Note that Do is larger than D~, D, and Do are identical, and De covers the whole of P-space except the equator of the sphere, where det(WP*W r) = 0.
perform principal component analysis or find the principal subspace of an input covariance matrix. The algorithms, which are of the form given in o.d.e. (7), are known to be locally stable only at the principal subspace (or principal components) solution. Initially, we showed that many of this class of algorithms, such as the original Oja (1982) PCA neuron, the Williams (1985) SEC algorithm, and the Oja and Karhunen (1985) SGA algorithm, ensure that the weight matrix W remains finite and full rank if it is so initially, and converges to an orthonormal set such that W W r converges to IM. We considered the behaviour of the orthogonal projection P into the subspace spanned by the rows of the weight matrix W. We identified a number of Lyapunov functions for convergence of P to the principal subspace
P*, namely the subspace projection error Sp, the least mean square reconstruction error SQ, the greatest mean square reconstruction error So, and the mutual information I. However, in general, each of the domains of attraction D O D Dp D Do and D1 for. these Lyapunov functions exclude a significant proportion of P-space. By modifying the mutual information I to give what we term the principal subspace information I', we get a domain of attraction De, which covers all of P-space except a lower-dimensional subset of P-space for which W P * W r is singular. We conjecture that this is the maximal domain of attraction for P to P*. Therefore, if W is adapted according to the o.d.e. (7) such that W W r remains finite and nonsingular, then P will asymptotically converge to the principal subspace P = P* from almost everywhere.
Principal Component Algorithms
REFERENCES Baldi, R, & Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural Networks, 2, 53-58. Cook, P. A. (1986). Nonlinear dynamical systems. Englewood Cliffs, NJ: Prentice Hall. F/~ldifik, P, (1989). Adaptive network for optimal linear feature extraction. Proceedings of the International Joint Conference on Neural Networks, 1JCNN-89, (pp. 401-405 ). Washington, DC. Hornik, K., & Kuan, C.-M. (1992). Convergence analysis of local feature extraction algorithms. Neural Networks, 5, 229-240. Linsker, R. ( 1988 ). Self-organization in a perceptual network. IEEE Computer, 21(3), 105-117. Oja, E. (1982). A simplified neuron model as a principal component analyser. Journal of Mathematical Biology, 15, 267-273. Oja, E. (1989). Neural networks, principal components, and subspaces. International Journal of Neural Systems, 1 ( 1), 61-68. Oja, E. (1992). Principal components, minor components, and linear neural networks. Neural Networks, 5, 927-935. Oja, E., & Karhunen, J. ( 1985 ). On stochastic approximation of the eigenvectorsand eigenvaluesof the expectation of a random matrix. Journal of Mathematical Analysis and Applications, 106, 69-84. Plumbley, M. D. ( 1991 ). On information theory and unsupervised neural networks (Tech. Rep. CUED/F-INFENG/TR. 78). Cambridge, UK: Cambridge University Engineering Department. Plumbley, M. D. (1993). Efficient information transfer and antiHebbian neural networks. Neural Networks, 6, 823-833. Plumbley, M. D., & Fallside, E (1988). An information-theoretic approach to unsupervised connectionist models. In D. Touretzky, G. Hinton, & T. Sejnowski (Eds.), Proceedings of the 1988 Connectionist Models Summer School (pp. 239-245). San Mateo, CA: Morgan-Kaufmann. Sanger, T. D. (1989). Optimal unsupervised learning in a single-layer feedforward neural network. Neural Networks, 2, 459-473. Strang, G. (1976). Linear algebra and its applications. New York: Academic Press. Watanabe, S. (1985). Pattern recognition: Human and mechanical New York: John Wiley & Sons. Williams, R. J. (1985). Feature discovery through error-correction learning(ICS Rep. 8501 ). San Diego, CA: Universityof California.
19 IN, IM
S, S ( W ) L, L(a) D c a*
a[21, a [ 3 1 , . . .
sp (sp.) Ivl Tr(A) Xi p* Spi~
N-, M-dimensional identity matrix orthogonal projection o f x ( = P x ) energy function L y a p u n o v function o f a L y a p u n o v d o m a i n o f attraction u p p e r limit for L y a p u n o v function defining D attractor for a equilibrium points for a subspace projection error (at P = p*) length of vector v trace o f matrix A ( = Z i a ; i ) ith largest eigenvalue of ~x point of convergence for P (principal subspace of :~x) second-lowest equilibrium value
of Sp Dp
IIAIIE
Q 0
I
~,, ~ det(. )
~p.
d o m a i n of attraction defined by L y a p u n o v function Sp Frobenius n o r m of A [ = (~,7
a~) '/:]
least m e a n square projection o f x (=Qx) least m e a n square projection [ =ZxWr(W~xWr)-~W] least m e a n square projection error dual projection to Q (greatest MSE projection) projection error due to O mutual information noise covariance matrix, variance d e t e r m i n a n t of matrix value o f I at principal subspace P = p*
d o m a i n of attraction defined by L y a p u n o v function I x, y, w, AW scalar input, output, weight, weight update U* N × M matrix such that U*( U * ) r = P*, ( U * ) r U * = I ~ t~ 1/21rTT value decomposition W = Vw,-,w ~ w singular ( S V D ) of W i! principal subspace information d o m a i n of attraction defined by Dl' L y a p u n o v function I ' J~,J~ cost functions for convergence of WW r Up N × M matrix such that U p U ~r = P, UprUp = IM c o m p l e m e n t projections (IN -- P ) , (IN- Q) e (ei) eigenvector ( i t h eigenvector) up unit length vector if M = N - 1 such that u ~,u~, = [ ' bp, b~,, b~ b o u n d a r y rings for Sp when N Dr
NOMENCLATURE N,M
x,y x i , yi
W :~x E(.) Xn, Yn, W n . . . . x r, A r
K
diag( • ) L T ( . ) [ L T + ( . )]
(.) t P
n u m b e r o f inputs, outputs input, o u t p u t vector ith input, output c o m p o n e n t M × N weight matrix covariance matrix of x [ = E ( x x r ) ] expectation x, y, W , . . . at time step n transpose o f vector, matrix weight decay matrix in H e b b i a n algorithms update factor set off-diagonal matrix entries to zero m a k e matrix (strictly) lower triangular m e a n (in informal anal.) continuous t i m e orthogonal projection [ =W r × (WW r)-~W]
=3, M=2
2O
M. D. Plumbley
bQ, b~, b~
I-fox, [
boundary rings for SQ when N =3, M=2 angle of bp, bQ to the vertical total, lost information such that I
~ Rp
some covariance matrix Uw rotation matrix (Uw = UpR~,,
0p, 0Q
A.2. Proof of Theorem 2.1 Because W W r is initially finite and nonsingular, then both WW r and (WW r ) - ~ are initially bounded towards Ira. Consider the following squared Frobenius norm cost functions:
= ITO r -- [
R~,Re~ = I~t)
J~ = IIl~ - wwr[[~ = Tr[(I~a - W W r ) z]
(A5)
and J 2 = IIIM--(WW T) II[~ : Tr{ [ba - ( W W r ) -' 12 }.
APPENDIX: P R O O F S OF T H E O R E M S
(A6)
Taking the time derivative of J~ and substituting for eqns (7) and ( 11 ) when appropriate, we find
A.1. Preliminaries The proofs in this appendix use a number of standard results concerning matrix expressions: for details see any standard linear algebra text, such as Strang (1976). Before proceeding with the proofs, we remind the reader of some of the results for real matrices we use here. Suppose that A = [aij], an n × n square matrix. Then we say A is positive definite, written A > 0, if vAv r > 0 for any nonzero nvector v. This holds iff all its eigenvectors are positive. I r A is positive definite, then it is certainly nonsingular because all of its eigenvectors are nonzero. If BB ~ is nonsingular, then it is positive definite: it cannot have negative eigenvalues. If A is positive definite, and B is an m × n matrix, such that BB r is also positive definite, then BAB ~ is positive definite. Two matrices A and B commute, that is, AB = BA, iffthey have the same eigenvectors. If one (or both) of A or B has repeated eigenvalues [ such as the projection P, which has M eigenvalues with value 1 and (N - M) with value 0], then it is sufficient that a set can be found that is identical to the set of eigenvectors of the other matrix. The trace of A, written Tr(A), is the sum of its diagonal entries, Z~ a,, and satisfies Tr(A) = Tr(AV). For any m × n matrix B and n × m matrix C, we have Tr(BC) = Tr(CB). The trace of a matrix is the sum of its eigenvalues, much as the determinant of a matrix det(A) is the product of its eigenvalues. Suppose now that A is positive definite. Then Tr(BAB r ) >_ 0 for any m × n matrix B, with equality iff B is the m × n zero matrix. In particular, Tr(A) > 0, and Tr( BB r ) >- 0 with the same equality condition for the latter inequality. Trace is a linear map, so d/dt[Tr(A)] = Tr[d/dt(A)].
dJl/dt = - 2 Tr{(I~t - w w r ) [ W ( d W r / d l ) + (dW/dt)Wrl } = - 2 Tr[(IM - w w r ) ( 2 W Z x W r ) -
Trld/dt(e~)] = Tr(e~'dA/dt)
(A2)
simply by differentiating the expansion (A 1 ) and rearranging terms within the trace to move the d A / d t terms to the end. In a similar way, we can show Tr d A / d t = Tr(B-~dB/dt)
(A3)
Similarly, the time derivative of J2 is given by
dJ2/dt = - 2 T r ( [ I ~ t - ( W W r ) - ~ ] { ( W W r ) ~ × [W(dWr/dt)
+ (dW/dt)Wrl(WWr)
1})
= 2 Tr{ [I~t - ( W W r ) - ~ I ( W W r ) -~ × [2WZxW r - ( K W W r + W W r K ) ] ( W W r ) -' } = 4 Tr{ [I~t - ( W W r ) ' I ( W W r ) - ' W I ; x W r
x [I~,- (ww ~) '1} - - 4 T r { W r ( W W r ) ~[La - ( W W r ) - ~ ] W 2 x W r
× [ ~ , - (ww ~) ,](wwT)-~w} -< 0.
(A8)
Thus, the norms of both (I~t - W W r ) and [ I~t - ( W W r ) - ~] are bounded for all t by their initial values, so both W W r and ( W W r )-~ remain finite, so W W r remains finite and nonsingular for all t > 0. Finally, because ~;x and WW T are nonsingular for all t, W ~ W r is also nonsingular for all t. Consequently, equality in eqn (A7) holds iffWW r = l~t. Therefore, J~ is a Lyapunov function for the convergence o f w w r to the point W W r = ha, with the domain of attraction consisting of all nonsingular WW r.
A.3. Proof of Lemma 4.1 We have P)x[ 2)
(A9)
: E { T r [ ( I N - P)XXT(IN -- P)]}
(A10)
- Tr[(Iu--
(All)
d/dt(log det B) = d/dt(Tr log B) = Tr(B-~dB/dt)
WWr)] (A7)
S~, = E ( l ( l u -
if B - e ~, so
WWrK)]
_< 0.
(A1)
with log(B) = A defined if B = e ~ is positive definite. We have e -~ = (e A) ~, and det e n = e TM,so log det B = Tr log B ifB is positive definite. Derivatives of functions of matrices are perhaps not so familiar. However, it is very easy to verify the identity
r +
: -4 Tr[(I~t- WWr)WZxWr(I~-
If A and B are two n × n square matrices, then det(AB) = det(A)det(B). The matrix exponentiation function is written A l A2 en=I+~.~ +-~(+...
(KWW
P)Zx]
(A4)
if B is positive definite. Equation (A4) is particularly important for our treatment of mutual information functions.
using the identities Tr(AB) = Tr(BA) and (IN -- p)2 = (i N _ p). Differentiating with respect to t and using the identity Tr(A) = Tr(A r) we get
21
Principal Component Algorithms dSv = -2 Tr[(I~ - P)~P~x] dt
P~
(AI2)
--
= - 2 Tr[(Lv - P)Z~PPZx(Lv - P)]
(AI3)
-<0
(A14)
so Sp is a nonincreasing 2 function of t. Now, equality in eqn (A14) holds iff(I~v - P)~;~P = 0, which is true iff P ~ = ~;~P. So Se is stationary iff P is the subspace spanned by some M eigenvectors of ~x. •
(A20)
= Q ~ x = Z~Q = ~,,P,
which proves the final part of the theorem. Now consider (b). Differentiating eqn (19) and rearranging we get P d Q = dP(I~v - Q ) ,
(A21)
which, postmultiplied by ( I s - P), gives us PdQ(Is-
P) = dP(I~v - Q)(I~v - P) = d P ( I ~ - P)
A.4.
Proof of Theorem
4.2
(A22)
by substituting in eqn (20). Also, differentiating eqn (9) we get
When P is the subspace spanned by some M of the eigenvectors of Zx, we have
(A23)
which, adding d P to both sides and substituting in eqn (A22), gives
M
S~ = T r C ~ ) -
0 = dP - PdP - dPP,
~ ),~
(AI5)
US
j=l
d P = d P ( I ~ v - P) + ( I s - P ) d P where )~, > . . . > ~ are the M eigenvalues of ~ selected by P. Thus, Sv is minimized when the Mprincipal eigenvalues are selected by P, giving a minimal value of N
Sp.=minS~=
~
P
)~
(A16)
i= M + 1
that is the s u m of the N - M smallest eigenvalues of Z~. At the next lowest energy point Svt~ = min~, { SpIP # P* }, the second best projection P~Z~selects the subspace corresponding to the largest M - 1 eigenvalues and the M + 1st eigenvalue, so
= P d Q ( l s - P) + ( I s -
~
(A24)
which defines any infinitesimal change d P in P in terms of P and d Q only. For the converse direction, differentiating eqn (19) and rearranging gives us PdQ = dP(Is-
Q),
(A25)
which, together with eqn (20), allows us to write Q d Q = Q P d Q = QdP(I~, - Q ) .
N
Sv~l = ~ +
P)dQrP,
~
Also, differentiating eqn (21) for fixed Zx gives us
i=M+2
= Sp* + ~ -
(A26)
~+~.
(AI7)
lfwe therefore use c = S~. + ),u - ),u+~ in Theorem 3.1, the conditions for SI, to be a Lyapunov function with domain of attraction Dr are satisfied. •
dQ~,~ = ~xdQ r
(A27)
SO
d Q Q = d Q Q Z x Z :~ = :~x(QdQ) r ~ : ~
A.5.
Proof of Lemma
= ~ ( I N -- Q r ) d p Q r Z : l .
4.3
We proceed by a continuous induction method. We demonstrate (a) that a one-to-one mapping exists for a particular value of P, and (b) that any infinitesimal change d P (resp. d Q ) in the variable P is a function of itself, Q (resp. P and Z~), and d Q (resp. d P ) only. Consider (a) first. Suppose that P is a subspace spanning some M eigenvectors of Z~, so that P~;~ = ~xP. Using this with eqns (19), (20), and (21) as appropriate, we get
=
=
2xp:~:
(AlS)
Conversely, suppose that Q = QT as would be the case in eqn ( A I 8 ) above (because P = pr). Then from eqns (19) and (20) we have =
dQ=
dQQ+QdQ
= Zx(Lv - Qr)dpQr~:~ + Q d P ( I s -
Q),
(A29)
1
e.
p = pr
Finally, differentiating the relation Q = Q2, and substituting in eqns (A26) and (A28) we can now write
which shows that any infinitesimal change d Q to Q is a function of Q, d P , and : ~ only. •
Q = Q P = Q p X x x : ~ = Q2~xp2~: ~
= 2~QrpX: l
(A28)
Qrp = Qp = Q
(AI9)
so P and Q have a one-to-one relation (equality) for a particular value of P and Q. Furthermore, i f P = Q ( = Q r ) , then
z With a little more manipulation it can be shown (Plumbley, 1991 ) that dS~,/dt = -Tr[(dP/dt)(dP/dt)r], so the o.d.e. (14) represents a steepest descent search in P-space for a m i n i m u m of S1,.
A.6.
Proof of Lemma
4.4
As for P, the o.d.e. (7) defines the behaviour of Q to be governed by the o.d.e.
dQ/dt = ( I s - Q ) ~ x Q + Q ~ x ( I u - Q)
(A30)
that is remarkable for being of identical form to the o.d.e. (14) for P. Using the identities ZxQ r = Q2~x and (I~v - Q)2 = ( I s - Q ) , we can write SQ = Tr[(I~v - Q)Zx(m - Q r ) ] = T r [ ( I ~ - Q)2~x]
(A31)
M . D. P l u m b l e y
22 that, taking the time derivative and substituting in eqn (A30), gives
A . 9 . P r o o f of L e m m a 4.7
US
Let us write a singular value decomposition (SVD) of W ,
dSo/ dt = - T r [ ( d Q / dt ) ~ ] = - 2 Tr[Z~/2 (lu - Q ) r Q ~ Q r ( I s
W = VwG~2Urw
- Q)Z~/2 ] (A32)
_<0
with equality iff(Iu - Q ) r Q = 0. But (Iu - Q ) r Q = 0 implies that Q = Q r Q = Q r , and also conversely because Q2 = Q. Therefore, S o is stationary iff Q = Q r is an orthogonal projection. I f Q = Q r then Q = Q P = Q r p = p , and conversely i f Q = P then Q r = p = Q, so Q = Q r i f f P = Q. From L e m m a 4.3 this holds iff they are the subspace spanned by some M of the eigenvectors of ~ . •
A.7.
where Vw is an M × M matrix such that VwV(v = V~,Vw = I g , G ~2 is an M × M diagonal matrix, and Uw is an N × M matrix such that Uv~Uw = Ig. Also, because P has rank M we can also decompose P into the product p = U p U ~r
(A40)
where Up is an N × M matrix such that U ~ U p = I u (because p2 =P). However, from the definition of P, we can also write p = U w G ~t2 V w T T)-I ( V w G ~i 2 U wT U w G { ~12 Vw
Proof of Theorem 4.5
From L e m m a 4.3, for any given ~ there is a one-to-one relationship between P and Q, so St/can be considered to be a function of P only for any given ~x. Also, from L e m m a 4.4, whenever SQ is stationary, we have P = Q, so SQ = Sp. In particular, we have rain S o = m i n Sp = Sp. P
(A39)
× VwG{~U(~ = UwU(v
(A41)
so for any Up chosen in eqn (A40) we have Uw = UpRp
(A33)
P
(A42)
for some M × M matrix Rp such that RpR~ = Iu. (Note that we cannot simply state that Uv and Uw are equal.) Now expanding I in terms of our SVD of W , we get
and m i n S o = Sv* + ~ g - XM+,.
(A34)
p÷p*
I = ~ [log d e t ( V w G ~ Z U ~ , U w G ~ 2 V ~ ) If we therefore use c = Sp. + XM-- ~M+~in Theorem 3.1, the conditions for S o to be a Lyapunov function (for P) with domain of attraction D o are satisfied. Finally, note that because S o is the least mean squared reconstruction error, S o <_ Sp, so
Sp < Sp. + k~t - XM+~
- log d e t ( ~ V w G ~ 2 U ~ U w G ~ 2 V ~ ) ] = ~ [log d e t ( G ~ 2 U ~ U w G ~
= ~ [log det(Gw) + log d e t ( U ~ Z ~ U w ) - log ~gn _ log det(Gw)]
(A35)
= ~ [log d e t ( R ~ U ~ U v R v )
implies S o < Sp. + XM - XM+, SO D Q D D p .
2) - log det(a~Gw)]
= ~ [log d e t ( U ~ U p )
(A36)
- log ~ u ]
- log ~ ]
that is a function of P only, given ~ , and a~.
•
A.10.
A.8. P r o o f o f L e m m a 4.6 Because :~x and W W r are nonsingular for all t, W:$~W r is also nonsingular for all t. Differentiating eqn (25) with respect to t and using the identity eqn (A4) gives us
•
Proof of Theorem 4.8
The mutual information at the stationary points is given by
1 M
X~
I = ~ J=~ log --a~
(A44)
dl/dt = Tr[(W~,~Wr)-~WX~(dWr/dt) where X~. > . . . > Xi~,are the eigenvalues of~x selected by P as before, and will have m a x i m u m value
- ( a~WWr)-'a~W( dWr/dt)]. = Tr[(WZ~Wr)-~WZ~(Iu- p)(dWr/dt)]
= T r [ ( W ~ , ~ W r ) - ~ W Z ~ ( d P / d t ) W r]
I p . = m a xP l =
using the identity ( d P / dt ) w r = ( Is - P) ( d W r~ dt ). Substituting in eqn (14) we get
~1 ~ log ~ iffil
(A37)
( A45 )
O'~
when P* is the subspace spanned by the M principal eigenvectors of ~x. Thus, if I is to be used as a Lyapunov function, the identified domain of attraction will be the set Dt of points P such that
d l / d t = T r [ ( W ~ W r) t W ~ ; ~ ( I u - P ) ~ x W r] = Tr[(Iu - P)X~Wr(W~Wr)-tWX~(Iu
I > Ip~l
- P)]
1
>_ 0
(A38)
with equality iff (I~v - P)~;~W r = 0, that is, iff (I~ - P ) ~ P = 0, which holds iff P is the subspace spanned by some M eigenvectors of ~ . •
~M+~
1 M-~
~
~ ~ log ,,---~-,+ - ~ l o g 2 i=~ a~ = I p . - ~ 1 1og XM . ~M+I
•
(A46)
Principal C o m p o n e n t A l g o r i t h m s
23 b~ = [(),~ - X2) ~/2 0
A.I 1. Proof of Lemma 5.1 Let us tentatively suppose that W P * W r is nonsingular for all t. The N × N orthogonal projection P* is of rank M, so it can be expressed as P* = U*( U * ) r where U * is an M X N matrix. Therefore, W P *W r = ( W U * ) ( W U * ) r where ( W U * ) is an M × M nonsingular matrix. Differentiating eqn (27) with respect to t as in the proof of Theorem 4.8, we get
dl'/dt = Tr[(WP*Wr)-~WP*(dP/dt)Wr].
I /~/2 b ~ = [(~22-~1 '
(A47)
= T r [ ( W P * W r ) - m W p * Z ~ p * w r]
(A53)
T
'I"'
0
(~-~2]
o
:'
1 /~/2]T J
(A54)
li"7i
as the two vectors normal to the boundary ring planes.
P)Z~W r]
- Tr[(WP*Wr)-~WP*Wr(WWr)-~WZ~W
-- ),3)1/2]
by superposition of the antisymmetric forms of the expression. Therefore, Sv = ),2 iffut,, b~, = 0 or ut," b~ = 0, giving the two planes we are looking for. In terms of the projection P itself, on the boundary of Dr one of either b~, or bp must be in the plane of P. A similar argument gives a related result for DQ, with
Substituting eqn (14) into eqn (A47) and noting ~hat P*Z~ = (P*)2Z~ = P*Z~P* because P* is a projection that shares its eigenvectors with Z~, we get
dl'/dt = Tr[(WP*Wr)~WP*(IN-
-(),2
A.14. Proof of Theorem 6.2 ~]
= Tr { [ ( U * ) r W r ] - ~ ( W U * ) - ~ ( W U * ) ( U * ) r
First, let us hypothesize that, for any finite invertible input covariance matrix Z, we have l~o~(Z) = I ( ~ ) + [(Z)
X X,,U* [(U* ) r w r ] ) - Tr(P2x)
(A56)
where
= Tr(P*2~) - Tr(P2~) = Tr[(IN -- P)2~] - Tr[(IN - P * ) 2 . ]
21TOT(~) = log det(Z)
= Sp -- Sp.
21(2) = log det(U~ZUp)
> 0
(A48)
with equality iffP = P*, because Sp. is the unique minimum of Sp. It remains to prove that W P * W r is nonsingular for all t. From the proof of Theorem 2.1, we know that the norm of [ I~t - ( W W r )- ~] for all t is bounded above by its initial value. Therefore, the smallest eigenvector of W W r is bounded below, so d e t ( W W r ) is bounded below for all t. Now, W P * W r is initially nonsingular, so log det ( W P * W r ) is initially finite, giving a finite initial value for I'. The term W P * W r cannot become singular without log d e t ( W P * W ~) first going below the lower bound fixed by d e t ( W W r ) and I' (which is nondecreasing as long as W P * W r is nonsingular). Therefore, W P * W r is forced to remain nonsingular for all t. •
A.12. Proof of Theorem 5.2 I' has no secondary stationary points that concern us, so we only need ensure that W P * W r is initially nonsingular to find our domain of attraction. This condition is equivalent to the condition for I' to be finite, that is, I' > - ~ , so I' the domain of attraction is D r = { PI I' > - o o }. However, it is probably clearer to write this more simply in terms of the nonsingularity of W P * W r, as D r = { P I W P * W r is nonsingular}. •
2[(~) = log S o = - l o g ( u ~ - ~ u ~ , ) . The proof proceeds by continuous induction: that is, we proceed to show that eqn (A56) is satisfied for a particular value o f ~ , and that the derivatives of both sides with respect to an infinitesimal offset d Z of Z are equal. Consider first ~ = Iu. Clearly Iror(IN) = log det IN = 0 and l(I~v) + / ( I N ) = log det(Up~U1,) - log(u~u~) r = 0 so eqn (A56) is satisfied at 2~ = I~v. Now diflbrentiating with respect to ~, and using the identity (A4), for the derivative of I ~ r we get d[lror(~)] = Tr(~-~d~)
(A57)
and for I we get d [ l ( ~ ) ] = Tr[(U~r2;Up)-~U~rd~Up] = Tr(~-~Qd~)
(A58)
A.13. Proof of Theorem 6.1 Considering first the boundaries of Dp, suppose without loss of generality that the input covariance is diagonal, equal to ~,, 2~,=[
0 ~
0
and f o r / w e get
d[[(X)) = -Tr[(u~-Jut,)-Ju~(-2~-)d~2;-l)u~,] = Tr (~-~) u~,(u~2~-~u ~,)-~u'~(-2~-t d2;)
].
(A49)
= Tr(Z-](~d~).
),3
Combining eqns (A58) and (A59) we get
The boundary condition that Sv = ),2 is equivalent to "~1 -- ),2
0
0
-(~,2 - ),3)
(A59)
d[l(~)] + d[[(Z)] = Tr[~-~(Q + 0)d~] = Tr(X-~ d~)
= d[ITox(~)] = (ui,. b~,)(u~,, b~,) where
b~, =
[()`,
-
),2) I/2
0
(~'2 -- ) , 3 ) 1 / 2 ] T
(A60)
(ASI)
(A52)
so the hypothesis is proved. Because Itot is not a function of P, and we have I = Itot log SQ, then there is a one-to-one relation between I and SQ, so D , = Do . m