NEUNET 1157
Neural Networks PERGAMON
Neural Networks 11 (1998) 385–390
Neural networks letter
A unified algorithm for principal and minor components extraction Tianping Chen a,*, Shun Ichi Amari b, Qin Lin a a
Department of Mathematics, Fudan University, Shanghai, China b RIKEN Brain Science Institute, Wako-shi, Saitama, Japan Received 15 October 1997; accepted 10 January 1998
Abstract Principal component and minor component extractions provide powerful techniques in many information-processing fields. However, by conventional algorithms minor component extraction is much more difficult than principal component extraction. A unified algorithm which can be used to extract both principal and minor component eigenvectors is proposed. This ‘unified’ algorithm can extract true principle components (eigenvectors) and if altered simply by the sign, it can also serve as a true minor components extractor. This is of practical significance in neural network implementation. It is shown how the present algorithms are related to Oja’s principal subspace algorithm, Xu’s algorithm and the Brockett flow. It is also shown that the algorithms are based on the natural gradient ascend/descent methods (a potential flow in a Riemannian space). q 1998 Elsevier Science Ltd. All rights reserved. Keywords: Principal component extraction; Minor component extraction; Singular value decomposition; Dynamical system; Natural gradient
1. Introduction In many information-processing systems, it is necessary to extract the main features inherent in complex, highdimensional input data streams. Such ‘dimensionality reduction’ helps us eliminate information redundancy and allows for further information processing and transmission through limited channels. One of the most general-purpose feature extraction techniques is principal component analysis (PCA). It can be shown that the kth principal component (PC) direction is along the eigenvector which corresponds to the kth largest eigenvalue of the input covariance matrix. On the other hand, the subspace spanned by the eigenvectors corresponding to the smallest eigenvalues is the minor subspace, which, in signal processing, usually represents the noise subspace. Minor component analysis (MCA), the counterpart of PCA, is also an important topic to investigate. The single neuron case (p ¼ 1) was considered first by Amari (1977) where the gradient descent algorithm was used under the constraint lwl 2 ¼ 1, w being the wight vector of the neuron. Oja (1982) gave an extensive study on the * Requests for reprints should be sent to Professor Tian-Ping Chen (fax: +8621-65648084, email:
[email protected]). He is supported by the National Science Foundation of China. This work was completed when he visited RIKEN Frontier Research Program, Hirosawa 2-1, Wako-shi, Saitama 351-01, Japan.
0893–6080/98/$19.00 q 1998 Elsevier Science Ltd. All rights reserved. PII: S0 89 3 -6 0 80 ( 98 ) 00 0 04 - 5
PCA neuron. Several subspace extraction algorithms have so far been proposed (see Oja and Karhunen, 1985; Oja, 1992; Sanger, 1989; Xu, 1993; Chen, 1997). The differential equation corresponding to a typical principal subspace extraction algorithm (see Oja and Karhunen, 1985; Oja, 1992) can be formulated as ˙ ¼ CM(t) ¹ M(t)M T (t)CM(t) M(t)
(1)
where M [ R n 3p (p # n), C is the covariance matrix E[xxT ] ˙ ¼ (d/dt)M. This equation has been of signal x and denotes M extensively studied via the ways of Lyapunov functions and the Ricatti equation. In a very recent paper (Chen et al., 1997), it is proved that a necessary and sufficient condition for the system to extract the principal subspace is that the initial weight M(0) has full rank projection on to the principal subspace (also see Helmke and Moore, 1993). In this case, when t → `, M(t) converges to a matrix, of which the column vectors are in the subspace spanned by the first p eigenvectors. Moreover, the sharp convergence rate is also given in Chen et al., 1997. However, the algorithm extracts the principal subspace but information about individual principal eigenvalues and eigenvectors disappears. Xu (1993) extended the subspace method of Eq. (1). His algorithm can be written as ˙ ¼ CQD ¹ QDQT CQ Q(t)
(2)
386
T. Chen et al. / Neural Networks 11 (1998) 385–390
where D is a positive diagonal matrix, D ¼ diag(d 21, … , d 2p), d 21 . d 22 . … . d 2p . 0, and Q are elements in a Stiefel manifold, that is, n 3 p matrices whose columns are mutually orthogonal unit vectors. This algorithm can extract the true eigenvectors, and Q converges to the matrix whose columns are the p principal eigenvectors of C with unit norm. When we put M(t)D 1/2, Eq. (2) is rewritten as ˙ ¼ CMD ¹ MM T CM: M (29) It reduces to Oja’s subspace method when D is equal to the identity matrix. Xu’s algorithm is robust against outliers (Eguchi, personal communication). There are many research projects concerning PCA. Some are successful for extracting the true principal eigenvectors (Sanger, 1989; Oja, 1992; Xu, 1993). The minor component analyzer (MCA) was first proposed by Xu et al. (1992) in the single neuron case. It was then generalized in Oja (1992) and Wang and Karhunen (1996). From a completely different point of view, Brockett (1989) proposed an algorithm for diagonalizing a positivedefinite matrix, which is described by the dynamical equation of isopectral flows. All the eigenvalues and eigenvectors are extracted by this method. The Brockett flow is obtained from a potential function as the natural or Riemannian gradient flow in the space of all the orthogonal matrices. Xu (1993) elucidated the relation between his algorithm and the Brockett flow. The Brockett flow is related to the Lax form of completely integral systems. Its relation to information geometry (Amari, 1985, 1997a) is also suggested by Nakamura, 1993; Fujiwara and Amari, 1995. It is desired to design algorithms which have the following properties: 1. The algorithm is executed in parallel. 2. The limiting solution gives individual eigenvectors, not only the subspace. 3. The algorithm can extract principal components as well as minor components in a simple way. 4. The algorithm is robust. In this paper, we propose the following algorithms
side. This simplicity is what we desired. In the rest of the paper, we shall show that algorithm Eq. (3) can be used to extract principal eigenvectors and algorithm Eq. (4) can be used to extract minor eigenvectors. Our algorithms are closely related with Xu’s algorithm (Xu, 1993) and the Brockett flow (Brockett, 1989). We show that our algorithms are the natural or Riemannian gradient flow (Amari, 1997b) on a Stiefel manifold. We elucidate the relation between our algorithm and Xu’s. The Brockett flow can be obtained as its special case, and moreover the subspace algorithm by Oja and Karhunen (1985) is its degenerate special case. We finally explain the reason why our method can extract the minor components as well as the principal components by reversing the sign, whereas the conventional algorithms do not.
2. Integral invariants of algorithms We first show that P(t) ¼ M T (t)M(t)
(6)
remains invariant in the dynamical process. Let m a(t),a ¼ 1,…,p, be the column vectors of M(t). Then the components P ab ¼ m Tam b are the inner product of ms. Therefore, this shows that the magnitudes and mutual angles of ms are invariant. In other words, the p vectors m a rotate together without changing their configuration. Theorem 1. P(t) is invariant under the dynamical evolvements in Eq. (3) and Eq. (4). Proof. d T ˙ T (t)M(t) þ M T M(t): ˙ M (t)M(t) ¼ M dt
(7)
By substituting Eq. (3) or Eq. (4) in the above equation, we immediately have d fMT(t)M(t)g ¼ 0: dt
(8)
˙ ¼ CM(t)M T (t)M(t) ¹ M(t)M T (t)CM(t) M(t)
(3)
We next show other invariants. Let the singular value decomposition of M(t) be
˙ ¼ ¹ CM(t)M T (t)M(t) ¹ M(t)M T (t)CM(t): M(t)
(4)
M(t) ¼ Q(t)D1 =2(t)W(t)
The first one (Eq. (3)) can extract the principal components, while the second one (Eq. (4)) extracts the minor components. The related on-line learning algorithms are ( ) X X T T T T ˙ ¼6 (ma mb )(mb x)x ¹ (ma x)(mb x)mb (5) m(t) b
b
where m a(t) is the ath column vector of M(t), representing the connection weight of the ath unit in the corresponding neural network. The learning algorithm is non-local. It is interesting that the only difference between the two algorithms Eq. (3) and Eq. (4) is the sign on the right-hand
(9)
where Q(t) is an n 3 p matrix whose p columns are mutually orthogonal with unit norm, D is a positive diagonal matrix with diagonal entries d 21 $ d 22 $ … $ d 2p $ 0, and W is a p 3 p orthogonal matrix. Then, D 1/2(t) and W(t) are kept invariant in the dynamical processes Eq. (3) and Eq. (4). It is easy to show that QT (t)Q(t) ¼ Ip
(10)
WT (t)W(t) ¼ W(t)WT (t) ¼ Ip
(11)
where I p is the p 3 p unit matrix.
387
T. Chen et al. / Neural Networks 11 (1998) 385–390
Theorem 2. D(t) and W(t)are invariants of Eq. (3) and Eq. (4). Proof. The invariant M T(t)M(t) can be written as M T (t)M(t) ¼ W(t)T D(t)W(t):
(12)
Since this is a diagonalization of M TM, its eigenvalues and (generalized) eigevectors are uniquely determined and are invariants D and W. In particular, if the singular value decomposition of the initial M(0) has different singular values, d 21 . d 22 . …d 2p, D(t) does and W is unique. (It is possible to prove that they are invariants even when some d is are equal.) The above theorem shows that only Q(t) changes in the dynamical processes Eq. (3) and Eq. (4). We can rewrite Eq. (3) and Eq. (4) in terms of Q(t) by using M(t)M T(t) ¼ Q(t)DQ T(t) and (t) ¼ (t)D 1/2W. The result is ˙ ¼ 6 CQ(t)D ¹ Q(t)DQT (t)CQ(t) (13) Q(t) The positive one is the same as Xu’s algorithm (Xu, 1993) in terms of Q, and is the same as the Brockett flow when p ¼ n. We choose an initial M(0) such that W ¼ I p. A simple choice of M(0) is 3 2 d1 0 … 0 7 6 6 0 d2 :7 7 6 7 6 6: : 07 7 6 M(0) ¼ 6 7 7 60 0 d p7 6 7 6 6: : :7 5 4 0
0
0
M (t)M(t) ¼ D:
It is known that the Brockett flow is the gradient flow on the manifold SO(n) of all the special orthogonal matrices (Brockett, 1991). We show this in our case (see also Xu, 1993). Let us put H ¼ MM T ¼ QDQ T, and introduce f (M) ¼ tr {HC} as the cost function to be minimized or maximized. We enlarge n 3 p matrix Q to an n 3 n orthogonal matrix Q* by supplementing missing columns and also enlarge D to a n 3 n diagonal matrix D* by supplementing n–p zero entries in the diagonal. Then tr{HC} ¼ tr{QDQT C} ¼ tr{Qp Dp QpT C}:
(16)
Taking the differential of f corresponding to infinitesimal change of Q* to Q* þ dQ*, we have df ¼ tr{dQp QpT HC} þ tr{CHQp dQpT } ¼ 2tr{dX HC} ¼ 2tr{CH dX T }
ð17Þ T
where dX ¼ dQ*Q* . The differentials dX form a nonholonomic basis in the tangent space of the orthogonal matrices SO(n) at Q* (see Amari et al., 1997a, b). Taking Q*Q8 T ¼ I n into account, we have dX ¼ dX T and df ¼ tr{(HC ¹ CH)dX} ¼ tr{[H, C]dX}
(18)
where [A, B] ¼ AB ¹ BA is the Lie bracket. Therefore, the gradient dynamics in terms of the non-holonomic quantity DX is given by DX ¼ [C, H]:
(19)
This is rewritten as DQp ¼ [C, H]Qp :
Then, the invariant is T
3. Derivation of algorithms by natural gradient
(14)
It is interesting to note that Xu’s algorithm (Eq. (2)) is derived by substituting Eq. (14) in Eq. (3). When p ¼ n, this is equivalent to the Brockett algorithm, and when D ¼ I p the Oja subspace method (Eq. (1)) is derived. In the manifold R n 3p of all the n 3 p matrices, the dynamical Eq. (3) and Eq. (4) have invariant submanifolds (15) SD ¼ Ml M T M ¼ D such that their trajectories are included in S D when M(0) is S D. Xu’s algorithm (2) and our Eq. (3) are exactly the same on S D. However, the entire dynamical behaviors of Eq. (3) and Eq. (4) are different from Eq. (2) and its sign negated in the entire manifold R n 3p. What is different is the dynamical stability of submanifolds S D, as will be proved later. This is the reason why Eq. (4) extracts minor components whereas Eq. (2) does not when its sign is reversed.
(20)
In terms of the differential equation, we have the Brockett flow, ˙ p (t) ¼ [C, H(t)]Qp (t) Q
(21)
which is the gradient flow in the Riemannian space SO(n) with the Killing metric, that is, with the inner product , dX, dX $ tr(dX dX T ): Our problem is to extract p principal components (p , n). Because H ¼ Q*D*Q* T ¼ QDQ T, by selecting the first p columns of Q* in Eq. (21), we obtain the dynamical equation in the closed form of Q(t), ˙ ¼ [C, H(t)]Q(t): Q
(22)
This is equivalent to ˙ ¼ [C, H(t)]M M
(23)
and is the natural gradient ascent algorithm Eq. (3) to maximize the cost function f(W). If we reverse the sign of the algorithm, we get an algorithm Eq. (4) which minimizes the cost function f(W).
388
T. Chen et al. / Neural Networks 11 (1998) 385–390
4. Convergence analysis We show the equilibrium points and their stability of Eq. (3) and Eq. (4) in the manifold S D. This part is common to Xu’s algorithm. We define the Stiefel manifold.
We now study what happens in Xu’s algorithm (29) when numerical perturbations occur by rounding-off errors or others.
Proposition: The set
Theorem 5. When D $ 0.5I p, M ` ¼ Q `D 1/2 is stable under Xu’s algorithm (29) but is unstable when the sign of (29) is reversed.
St(p, n) W {Q [ Rn3p lQT Q ¼ Ip }
Proof. Let us put
is the compact Stiefel manifold of dimension pn ¹ 12(p þ 1). The tangent space T Q at Q [ St(p,n) consists of the following set of n 3 p matrices: TQ ¼ {y [ Rn3p lyT þ QT y ¼ 0}:
(26)
and consider a small perturbation in
1 D2
1
The dynamics Eqs. (3) and (4) are rewritten on S D as ˙ ¼ ¹ QDQT CQ þ CQD (24) Q and ˙ ¼ QDQT CQ ¹ CQD Q
1
M(t) ¼ Q(t)D 2
(25)
respectively. We assume here that C has distinct eigenvalues. The conclusions hold even for the case that C has repeated eigenvalues, except that we extract the eigensubspaces corresponding to the same multiple eigenvalues. However, the proof is a little more complicated and will not be given in this paper. Theorem 3. Let p be a permutation of the index set {1,…,n} to {p(1),…(n)}, and e i be the ith eigenvector of C. If Q(0) [ St(p,n) and both D and C have distinct eigenvalues, then: 1. Both Eq. (24) and Eq. (25) have nn!p! ¹ p2p equilibrium points Q ` ¼ ( 6 e p(1)… 6 e p(p)). 2. From any initial points, the solution of Eq. (24) or Eq. (25) converges to one of the equilibrium points. 3. System Eq. (24) has 2 p stable equilibrium points (6e 1,…, 6 p), and from almost all initial points, its solution converges to one of them. System Eq. (25) has 2 p stable equilibrium points, ( 6 e n, 6 e n¹1,…, 6 en ¹ p þ 1 ), and from almost all initial points, its solution converges to one of them. Proof is given in Appendix A.
5. Stability of invariant submanifolds S D We have shown that our algorithms Eqs. (3) and (4) are derived as the natural gradient flows of f(M). The minimal components are extracted in the same way as the principal components. We now show the reason why Xu’s algorithm fails to extract the minor components while ours can do. Theorem 4. The submanifold S D is a stable submanifold of Eq. (3) and Eq. (4) in the sense of Lyapunov.
M þ dM ¼ Q(t){D 2 þ «Z}
where Z is a diagonal matrix and « . 0 is small. We con1 sider the stability at M` ¼ Q` D 2 . The variational equation of dM is ˙ ¼ d(CMD ¹ MDM T CM) dM
(28)
where the variation d is given by Eq. (27). We then have ˙ ¼ «(CQZD ¹ QZDM T CM ¹ MDQT ZCM dM ¹ MDM T CQZ):
ð29Þ
Taking account of QT` CQ` ¼ L
(30)
M`T CM` ¼ LD
(31)
at
1 M` ¼ Q ` D 2 ,
the variational equation becomes 3
1
Z˙ ¼ LD(Ip ¹ 2D)Z ¹ D 2 QT` ZQ` LD 2 :
(32)
The first term LD(I p ¹ 2D) of the r.h.s. is negative when I p , 2D and the second term is also a negative on Z. Hence, M ` is stable under perturbations. However, when the sign of (29) is reversed, the variational equation is 3
1
Z˙ ¼ ¹ LD(Ip ¹ 2D)Z þ D 2 QT` ZQ` LD 2
(33)
which diverges to infinity. Therefore, the solution M ` corresponding to the minor components is unstable, although it is stable when the orbit is restricted in S D. Remark I. If we control M(t) ¼ Q(t)D 1/2 such that the Dpart is kept fixed or Q(t) remains in St(p,n) correctly, the reversed Xu algorithm can extract the minor components as well. Remark II. In our algorithm, perturbations in W-part remain neutrally stable. This implies that, even when W(0) ¼ I p, a small perturbation dW remains in the solution, 1
Proof. Since S D is an invarient submanifold for any D, when M(t) is perturbed to M(t) þ dM(t) which belongs to SD þ dD , it stays in it and never diverges.
(27)
M` ¼ Q` D 2 (Ip þ dW):
(34)
The principal or minor subspace is invariant under this perturbation but the eigenvectors suffer from small errors.
T. Chen et al. / Neural Networks 11 (1998) 385–390
389
This can be avoided by cancelling dW such that M T(t) is diagonalized in each step.
2. To prove the convergence of the solution, as in Brockett (1991), we introduce the cost function f(M). We have shown
6. Conclusions
d tr(HL) ¼ ¹ tr([H, [H, L]]L) dt
In this paper, we proposed a unified algorithm capable of both PCA and MCA by simply switching the sign of the same learning rule. The advantage of the unified algorithm is that the PCA version and MCA version are of identical structure. The algorithms are shown to be stable and converge to the desired state from almost all initial conditions. Moreover, the algorithm can extract the principal/minor eigenvectors in the sense that the left singular vectors of resultant weight matrix are exactly the principal/minor eigenvectors. We show the reason why the conventional algorithms fail to extract the minor components when the sign is reversed. Numerical experiments show that the algorithms are effective and the theoretical analysis is valid.
Appendix A Proof of Theorem 3 In order to prove the theorem, let us put U ¼ [e1 …en ]
(A1)
¼ tr((HL ¹ LH)(HL ¹ LH)T ) ¼ kHL ¹ LHkF $ 0:
ðA8Þ
So as Q evolves along Eq. (24), f (t) W tr(HL) increases monotone, but f(t) and f0(t) apparently are bounded from above (since they are all combinations of Q, L and D which are all bounded), so that f(t) has a limit and dtd f (t) goes to zero. Since L has distinct eigenvalues, T d dtf (t) ¼ kHL ¹ LHk vanishes if and only if H ¼ QDQ is diagonal, which occurs only when Q is one of the equilibrium points Q ` described in part 1. For different permutation p, Q ` is isolated, so Q approaches one of them, this proves the convergence of the solution. As for the convergence of Eq. (25), the proof is almost the same, except that f(t) is monotone decreasing and bounded from below in this case. 3. To prove this part, we linearize the system at each equilibrium point. The linearization of Eq. (24) around Q ` is given by
where U T CU ¼ L:
(A2)
We rewrite all the quantities in the basis of e is. Then, the quantitites are transformed as ˜ ¼ UT M M
(A3)
˜ ¼ UT Q Q
(A4)
C˜ ¼ L
(A5)
e˜ i ¼ U T ei :
(A6)
For the sake of brevity, we omit the tilde. Eq. (13) is now ˙ ¼ 6 {LQD þ QDQT LQ} Q
(A7)
and e i ¼ [0…1…] T where one 1 appears at the ith position. 1. It is easy to verify that all Q ` with the form given above are equilibrium points. On the other hand, at the equilibrium points, we have
y˙ ¼ LyD ¹ yDQT` LQ` ¹ Q` DQT` y ¹ Q` DQT` Ly
(A9)
where y [ TQ` , i.e. yT Q` þ QT` y ¼ 0. For Q ` ¼ ( 6 e p(1)… 6 e p(p)), this is equivalent to y Tp(1)…(p) ¼ 0. This means y p(1)…(p) is skew-symmetric, where y p(1)…p(p) is a p 3 p submatrix of y composed of the p(i)th rows of y, i ¼ 1,…,p. So the tangent space TQ` can be parameterized by n 3 p ¹ 1/ 2p(p þ 1) coordinates ( i Ó {p(1), …, p(p)}, j ¼ 1, …, p yij i [ {p(1), …, p(p)}, j ¼ p ¹ 1 (i), …, p With this set of coordinates, the linearized system Eq. (A9) can be rewritten as the decoupled set of differential equations
LQ` D ¼ Q` DQT` L`
8˙ yij ¼ li dj yij ¹ dj lp(j) yij ¼ dj (li ¹ lp(j) )yij , > > > > > > i Ó p(1), …, p(p), j ¼ 1, …, p > > > > > < y˙ ij ¼ li dj yij ¹ dj lp(j) yij ¹ dp ¹ 1 (i) lp(j) lp ¹ 1 (i) ¹ dp ¹ 1 (i) li yij ¼
. By multiplying Q T ` from the left, we get QT` LQ` D ¼ DQT` LQ` , which, under the assumption that D has distinct eigenvalues, indicates that Q `DQ ` is diagonal. This occurs only when Q ` takes the form ( 6 e p(1),…, 6 e p(p)) since L is assumed to have distinct eigenvalues. The signature ‘ 6 ’ in the expression is determined by the initial state, and is immaterial because e i and ¹ e i can be regarded as the same direction.
In the last equality we used the skew-symmetricity of yp(1)…p(p) , i:e: yp(j)p ¹ 1 (i) ¼ ¹ yij . So the eigenvalues of the linearized equation at Q` ¼ ( 6 ep(1) … 6 ep(p) ) takes
> (li dj ¹ dj lp(j) þ dp ¹ 1 (i) lp(j) ¹ dp ¹ 1 (i) li )yij ¼ > > > > > > (dj ¹ dp ¹ 1 (i) )(li ¹ lp(j) )yij , i [ p(1), …, p(p), > > > > : j ¼ p ¹ 1 (i), …, p
390
T. Chen et al. / Neural Networks 11 (1998) 385–390
the form 8 dj (li ¹ lp(j) ), i Ó p(1), …, p(p), j ¼ 1, …, p > > < (dj ¹ dp ¹ 1(i) )(li ¹ lp(j) ), i [ p(1), …, p(p), > > : j ¼ p ¹ 1 (i), …, p
(A10)
which are all non-zero, and they are all negative if and only if p ¼ I, i.e. p(i), i ¼ 1,…,p. Therefore, by the stable manifold theorem of dynamical system, we can conclude that the domain of attraction for 2 p asymptotically stable equilibrium point Q ` ¼ ( 6 e 1,…, 6e p) is an open and dense subset, which fills ‘almost everywhere’ of ST(p,n). The stable manifold theorem also shows that the convergence is exponential. For system Eq. (25), there are similar conclusions. The theorem is proved completely.
References Amari S. (1977). Neural theory of association and concept-formation. Biology Cybernetics, 26, 175–185. Amari, S. (1985). Differential-Geometrical Methods of Statistics. Springer Lecture Notes in Statistics, 28. Springer, New York. Amari, S. (1997a). Information geometry. In Nencka, H., Bourguignon, J.P. (Eds.), Geometry and Nature. Contemporary Mathematics, 203. American Mathematical Society, Rhode Island, pp. 81–95. Amari S. (1997b). Natural gradient works efficiently. Neural Computation, 10, 251–276. Amari S., Chen T.-P., & Cichocki A. (1997). Stability analysis of learning algorithms for blind source separation. Neural Networks, 10, 1345– 1351.
Amari S., Chen T.-P., & Cichocki A. (1998). Nonholonomic orthogonal constraints in adaptive blind source separation. Neural Computation, submitted. Brockett R.W. (1989). Least square matching problems. Linear Algebra and its Applications, 122–124, 761–777. Brockett R.W. (1991). Dynamical systems that sort lists, diagonalize matrices, and solve linear programming problems. Linear Algebra and its Applications, 146, 79–91. Chen T.-P., Hua Y., & Yan W.Y. (1998). Global convergence of Oja’s subspace algorithm for principal component extraction. IEEE Transactions on Neural Networks, 9, 58–67. Chen T.-P. (1997). Modified Oja’s algorithms for principal subspace and minor subspace extraction. Neural Processing Letters, 5, 105–110. Fujiwara A., & Amari S. (1995). Gradient systems in view of information geometry. Physica D, 80, 317–327. Helmke, U., Moore, J.B. (1993). Optimization and Dynamical Systems. Springer, New York. Nakamura Y. (1993). Completely integrable gradient systems on the manifold of Gaussian and multinomial distributions. Japan Journal of Applied and Industrial Mathematics, 10, 179–189. Oja E. (1982). A simplified neuron model as a principal component analyzer. J. Math. Biology, 15, 267–273. Oja E., & Karhunen J. (1985). On stochastic approximation of the eigenvectors and eigenvalues of the expectation of a random matrix. Journal of Mathematical Analysis and Applications, 106, 69–84. Oja E. (1992). Principal components, minor components, and linear neural networks. Neural Networks, 5, 927–935. Sanger T.D. (1989). Optimal unsupervised learning in a single-layer linear feedforward network. Neural Networks, 2, 459–573. Xu L., Oja E., & Suen C. (1992). Modified Hebbian learning for curve and surface fitting. Neural Networks, 5, 441–457. Xu L. (1993). Least mean square error recognition principle for self organizing neural nets. Neural Networks, 6, 627–648. Wang L., & Karhunen J. (1996). A unified neural bigradient algorithm for robust PCA and MCA. International Journal of Neural Systems, 7, 53– 67.