Neurocomputing 47 (2002) 35–57
www.elsevier.com/locate/neucom
-insensitive Hebbian learning Colin Fyfe ∗ , Donald MacDonald Applied Computational Intelligence Research Unit, The University of Paisley, Paisley, UK
Abstract One of the major paradigms for unsupervised learning in arti+cial neural networks is Hebbian learning. The standard implementations of Hebbian learning are optimal under the assumptions of i.i.d. Gaussian noise in a data set. We derive -insensitive Hebbian learning based on minimising the least absolute error in a compressed data set and show that the learning rule is equivalent to the principal component analysis (PCA) networks’ learning rules under a variety of conditions. We then show that the extension of the -insensitive rules to nonlinear PCA (NLPCA) gives learning rules which +nd the independent components of a data set. Finally, we shown that using -insensitive minor components analysis is c 2002 Elsevier Science B.V. All rights reserved. extremely e6ective at noise removal. Keywords: Hebbian learning; Noise; Independent components
1. Introduction The basis of many unsupervised learning rules is Hebbian learning which is so-called after Hebb [13] who conjectured “When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in #ring it, some growth process or metabolic change takes place in one or both cells such that A’s e%ciency, as one of the cells #ring B, is increased”. This paper investigates a novel implementation of Hebbian learning. Neural nets which use Hebbian learning are characterised by making the activation of a unit depend on the sum of the weighted activations which feed into the unit. They use a learning rule for these weights which depends on the strength of ∗ Corresponding author. E-mail addresses:
[email protected] (C. Fyfe),
[email protected] (D. MacDonald).
c 2002 Elsevier Science B.V. All rights reserved. 0925-2312/02/$ - see front matter PII: S 0 9 2 5 - 2 3 1 2 ( 0 1 ) 0 0 5 7 9 - 3
36
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
the simultaneous activation of the sending and receiving neuron. These conditions are usually written as wij xj (1) yi = j
and Hwij = xj yi ;
(2)
the latter being the Hebbian learning mechanism. Here yi is the output from neuron i, xj is the jth input, and wij is the weight from xj to yi . is known as the learning rate and is usually a small scalar which may change with time. We see that the learning mechanism says that if xj and yi +re simultaneously, then the weight of the connection between them will be strengthened in proportion to their strengths of +ring. Substituting Eq. (1) into Eq. (2), we can write the Hebb learning rule as wik xk = wik xk xj : (3) Hwij = xj k
k
It is this last equation which gives Hebbian learning its ability to identify the correlations in a data set. Now it is well known that the simple Hebbian rule above is unstable in that it causes the weights to increase without bounds. Thus, weight change rules which have an inbuilt decay term in them have been developed. The most signi+cant of these have been those developed by Oja and colleagues [19 –21] which have been shown to not only cause the weights to converge but to converge so that the network is performing a principal component analysis (PCA) of the data set. It is well known that PCA is the best linear compression of a data set in that it minimises the least mean square error between the compressed data and the original data set. An alternative de+nition is that PCA provides the linear basis of the data set that captures most variance. This basis is formed from the eigenvectors of the covariance matrix of the data set in order of largest eigenvalues. In Section 2, we describe a negative feedback implementation of Oja’s rules and derive a new -insensitive form of Hebbian learning. In Section 3, we experiment with the linear version of the rule and show that it performs an approximation to principal component analysis and that it is more robust in the presence of shot noise; we also illustrate the e6ectiveness of the method in an anti-Hebbian rule and in a topology preserving network. Section 4 extends the network to nonlinear activation functions and we show its e6ectiveness as a method of +nding the independent sources in data sets composed of mixtures of these sources. Finally, we show that using -insensitive minor components analysis is extremely e6ective at noise removal.
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
37
2. The negative feedback network and cost functions Fig. 1 shows the network which we have shown to perform a PCA [8]: the data is fed forward from the input neurons (the x-values) to the output neurons. Here the weighted summation of the activations is performed and this is fed back via the same weights and used in the simple Hebbian learning procedure. Consider a network with N -dimensional input data and having M output neurons. Then the ith output neuron’s activation is given by yi = act i =
N
wij xj
(4)
j=1
with the same notation as before. This +ring is fed back through the same weights as inhibition to give xj (t + 1) ← xj (t) −
M
wkj yk ;
(5)
k=1
where we have used (t) and (t + 1) to di6erentiate between activation at times t and t + 1. Now simple Hebbian learning between input and output neurons gives M wlj yl ; (6) Hwij = t yi xj (t + 1) = t yi xj (t) − l=1
where t is the learning rate at time t. This network has previously been shown in [24,25,9] to perform a principal component analysis (PCA). This network actually only +nds the subspace spanned by the principal components; we can +nd the actual principal components by introducing some asymmetry into the network [9].
Fig. 1. The negative feedback network. Activation transfer is fedforward and summed and returned as inhibition. Adaption is performed by simple Hebbian learning.
38
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
We have previously [11] introduced nonlinearity after the feedback operation using M wlj yl Hwij = t f(yi ) xj (t + 1) = t f(yi ) xj (t) − l=1
and related the resulting network to exploratory projection pursuit. We will in Section 4 be more interested in the addition of nonlinearity before the feedback, N (7) yi = f(act i ) = f wij xj j=1
followed by (5) and (6). Karhunen and Joutsensalo [15] have shown that learning rule (6) may be derived as an approximation to gradient descent on the mean squared magnitude of the residuals (5). We can use the residuals (5) after feedback to de+ne a general cost function associated with this network as J = f1 (e) = f1 (x − W y);
(8)
where in the above f1 = ||·||2 , the (squared) Euclidean norm. It is well known (e.g. [23,1]) that, with this choice of f1 (); the cost function is minimised with respect to any set of samples from the data set on the assumption of i.i.d. Gaussian noise on the samples. It can be shown that, in general (e.g. [23]), the minimisation of J is equivalent to minimising the negative log probability of the error or residual, e, which may be thought of as the noise in the data set. Thus, if we know the probability density function of the residuals, we may use this knowledge to determine the optimal cost function. We will in Section 4 discuss data sets whose inherent noise is rather more kurtotic than Gaussian (Table 6 gives an example). An approximation to these density functions is the (one-dimensional) function 1 exp(−|e| ); (9) p(e) = 2+ where 0 ∀|e| ¡ ; (10) |e| = |e − | otherwise with being a small scalar ¿ 0. Using this model of the noise, the optimal f1 () function is the -insensitive cost function f1 (e) = |e| :
(11)
Therefore, when we use this function in the (nonlinear) negative feedback network we get the learning rule @J @f1 (e) @e =− ; HW ˙ − @W @e @W
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
39
which gives us the learning rules 0 if |xj − Wkj yk | ¡ ; k (12) HWij = Wkj yk ) = y sign(e) otherwise: yi sign(xj − k
We see that this is a simpli+cation of the usual Hebb rule using only the sign of the residual rather than the residual itself in the learning rule. We will +nd that in the linear case (Section 3) allowing to be zero gives generally as accurate results as nonzero values but the special data sets in Section 4 require nonzero because of the nature of their innate noise. 2.1. Is this a Hebbian rule? The immediate question to be answered is “does this learning rule qualify as a Hebbian learning rule given that the term has a speci+c connotation in the arti+cial neural networks literature?”. We may consider e.g. covariance learning [17] to be a di6erent form of Hebbian learning but at least it still has the familiar product of inputs and outputs as a central learning term. The +rst answer to the question is to compare what Hebb wrote with the equations above. We see that Hebb is quite open about whether the pre-synaptic or the post-synaptic neuron would change (or both) and how the mechanism would work. Indeed it appears that Hebb considered it unlikely that his conjecture could ever be veri+ed (or falsi+ed) since it was so inde+nite [7]. Secondly, it is not intended that this method ousts traditional Hebbian learning but that is coexists as a second form of Hebbian learning. This is biologically plausible—“Just as there are many ways of implementing a Hebbian learning algorithm theoretically, nature may have more than one way of designing a Hebbian synapse” [2]. Indeed the suggestion that long term potentiation “seems to involve a heterogeneous family of synaptic changes with dissociable time courses” seems to favour the coexistence of multiple Hebbian learning mechanisms which learn at di6erent speeds. We now demonstrate that this new simpli+ed Hebbian rule performs an approximation to principal component analysis in the linear case and +nds independent components when we have nonlinear activation functions. 3. -insensitive Hebbian learning In this section we will use the linear neural network yi = act i =
N
wij xj ;
j=1
xj (t + 1) ← xj (t) −
M k=1
wkj yk ;
40
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
Table 1 The subspace spanned by +rst three principal components is captured after only 5000 iterations. = 0:1 2:0749643e-003
6:1149565e-002
−3:9888122e-001
3.6858096e-001
−8:3912233e-001
2:8296234e-002
7:7505112e-003
8.5505936e-001
4.7678573e-001
−1:9534208e-001
5:0981820e-003
−2:5554618e-002
3.2719504e-001
−7:9844131e-001
−5:0683308e-001
Hwij =
0 y sign(xj (t + 1)) = y sign(e)
if |xj (t + 1)| ¡ ; otherwise
and show that the network converges to the same values to which the more common PCA rules [20,22] converge. 3.1. Principal component analysis To demonstrate PCA, we use the -insensitive rules on arti+cial data. When we use the basic learning rules (6) on Gaussian data, we +nd an approximation to a PCA being performed. The weights shown in Table 1 are from an experiment in which the input data was chosen from zero mean Gaussians in which the +rst input has the smallest variance, the second the next smallest and so on. Therefore, the +rst principal component direction is a vector with zeros everywhere except in the last position which will be a 1 so identifying the +lter which minimises the mean square error. In our experiment, we have three outputs and +ve inputs; the weight vector has converged to an orthonormal basis of the principal subspace spanned by the +rst three principal components: all weights to the inputs with least variance are an order of magnitude smaller than those to the three inputs with most variance. This experiment used 5000 presentations of samples from the data set, = 0:1 and the learning rate was initially 0:01 and was annealed to 0 during these 5000 iterations. Oja’s subspace rule may be transformed into a PCA rule by using deQationary techniques [22] xj (t + 1) ← xj (t) − wkj yk Hwkj = t yk xj (t + 1)
∀j;
∀j
(13)
for k = 1; 2; : : : : Thus, the feedforward rule is as before but feedback and learning occur for each output neuron in turn. Similarly, we may +nd the actual principal components by using a deQationary rule with -insensitive Hebbian learning. xj (t + 1) ← xj (t) − wkj yk Hwkj = t yk sign(xj (t + 1)) for k = 1; 2; : : : :
∀j; ∀j
(14)
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
41
Table 2 The actual principal components are captured after only 5000 iterations. = 0:5 −1:7574163e-002
3:3526187e-002
2:6317708e-002
4:9533961e-002
1.0068893e + 000
−1:7960927e-002
−2:0583177e-002
2:1191622e-002
−1:0037621e + 000
6:8503537e-003
−4:1927978e-003
3:6645443e-002
−9:9189108e-001
−3:5904231e-002
6:6094374e-002
Table 3 The actual principal components are almost found after only 1000 iterations. = 0 −5:1776166e-003
4:5376865e-002
−2:3840385e-003
1:2658529e-002
1.0019231e + 000
−3:6847661e-002
1:0566274e-002
−1:3372074e-001
9.9144360e-001
1:6136625e-003
−1:1309247e-002
−8:1029262e-003
9.9036232e-001
1:4140185e-001
−1:4256455e-003
Table 2 shows the results when +ve dimensional data of the same type as before was used as input data, the learning rate was 0:1 decreasing to 0 and was 0:1. These results were taken after only 5000 iterations. The convergence is very fast: a typical set of results from the same data are shown in Table 3 where the simulation was run over only 1000 presentations of the data ( = 0 in this case). We may expect that since the learning rule is insensitive to the magnitude of the input vectors x, the rule is less sensitive to outliers than the usual rule based on mean square error. Fig. 2 shows that the PCA properties of the deQationary -insensitive network (Eq. (14)) are una6ected by the addition of noise from a uniform distribution in [ − 10; 10] to the last input in 30% of the presentations of the input data (Table 4). In comparison the bottom +gure shows that the standard Sanger network (Eq. (13) responds to this noise (as one would expect (Table 5)). We note that this need not be a good thing; however, in the context of real biological neurons we may wish each individual neuron to ignore high magnitude shot noise and so the -insensitive rule may be optimal. Finally, the insertion of a di6erentiated i term in the calculation of the residual (as in [21]) also causes convergence to the actual principal components but was found to be two orders of magnitude slower than the deQationary technique described above. 3.2. Astronomical data As an illustration of the -insensitive Hebbian learning on real data we use astronomical data sets. 3.2.1. The Cetin data set We +rst use a data set derived by Cetin [3] which consists of 16 sets of spectra in which each set contains 32 × 32 spectra. The projections of this data set on the +rst two principal components is shown in Fig. 3 and its projections onto the +rst
42
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
Fig. 2. The top +gure shows the convergence of an -insensitive network to the third PC. The bottom +gure shows the convergence of the standard Generalised Hebbian Algorithm from Sanger. Both convergences were troubled by high amplitude shot noise. Table 4 The 30% outliers are ignored by the -insensitive rule 2:5339642e-002
−3:9976042e-002
4:7171348e-002
3:3109733e-002
9.9560379e-001
1:5587732e-002
1:4625496e-002
9:7532991e-003
−1:0013772e + 000
3:2922597e-002
−4:8078854e-002
−6:4830431e-002
9.9738211e-001
2:1815735e-002
−4:5000940e-002
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
43
Table 5 The standard Sanger rule +nds the noise irresistable 6:5436571e-002
−4:2781770e-002
−3:5680891e-002
5:9717521e-002
−9:9664319e-001
6:9542470e-002
−1:4830770e-002
1:7929844e-002
−9:9627047e-001
−7:1398709e-002
−9:5947374e-001
−1:7583321e-003
−2:1653710e-001
−9:4726964e-002
−1:1765086e-001
Fig. 3. Projection of the Cetin data set on the +rst two principal components found by Sanger’s Generalised Hebbian Algorithm.
two +lters found by the -insensitive Hebbian rule is shown in Fig. 4. Interestingly we see the clusters are very much more compact and rather more isolated from one another in the latter case. 1 We conjecture that some clusters contain outliers which 1 This result has been repeated a number of times with di6erent initial conditions, learning rates etc. and is rather close to the projections obtained when using the technique of exploratory projection pursuit.
44
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
Fig. 4. Projection of the Cetin data set on the +rst two +lters found by the -insensitive PCA method.
pull the principal component projections away from the optimal (with respect to viewing the clusters) directions whereas the -insensitive rule does not give so much weight to these outliers. 3.2.2. The Lunar Crater volcanic #eld remote sensing data The data used in this section is from a visible-near infra-red spectrum (0:4– 2:5 m), 224 band, 30 m=band AVIRIS image of the Lunar Crater volcanic +eld (LCVF), Nevada. LCVF is one of NASA’s test site in which many comprehensive studies have been conducted [5]. Fig. 5 is a natural color composite of the LCVF used in [18], with 23 cover classes marked by their respective class labels. The AVIRIS (airborne visible infra-red imaging spectrometer) image is 614 × 512 pixels covering an area of 10 × 12 km2 . The material includes volcanic cinder cones (class A) and weather derivatives of them, including ferric oxide rich soils (L,M,W), basalt Qows of various ages (F,G,I), a dry lake divided into two halves consisting of sandy (D) and clayey (E) composition; a small rhyolitic outcrop (B); vegetation in the lower
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
45
Fig. 5. The actual true materials in the LCVF data set.
left corner (J), and along washes (C). Alluvial material (H), also dry (N,O,P,U) and wet (Q,R,S,T) play a outwash with sediments of various clay content and other sediments (V) in depressions in the mountain slopes. Basalt cobblestones are strewn around the playa (K). A long NW-SE trending scarp (G) in the lower left corner borders the vegetated area. Merenyi has classi+ed the data using a hybrid self-organising map and back propagation network. It is with the results from this network that we compare the results from PCA and -insensitive PCA. Fig. 6 shows the Merenyi results. We cannot expect PCA, a linear technique, to cluster the spectra as accurately as the SOM, which is a nonlinear technique, or the back propagation network which has the advantage of being supervised, but PCA and -insensitive PCA are actually able to identify features missed by Merenyi’s investigation [18]. For the purposes of classi+cation the outputs of the PCA networks were binned and then compared with the previous study’s results [18]. Each bin was assigned the class from the classi+ed image of the most common spectra found in that bin. This results in the outputs from the PCA networks having the same classes as the image in Fig. 7. New classes were created when the output of the network identi+ed two or more di6erent bins to represent the same class. This occurs when the Cinder cones (A) are identi+ed by the PCA network to be made of three di6erent, but similar material. PCA of the LCVF image shows quite di6erent results: the network is incapable of identifying the dry playa outwash (N) in the bottom right of the image and
46
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
Fig. 6. The result of the classi+cation produced by Merenyi [18] using hybrid Self Organising Map and backpropagation networks.
Fig. 7. PCA on LCVF image taken by AVIRIS sensor. The network was trained for 10 000 000 epochs and a learning rate of 0.001 was used.
actually identi+es it as basalt Qow (I). Similar mis-classi+cation occurs with the alluvial material (H) and weathered ferric oxide soils (L) in the top centre of the image, which is incorrectly identi+ed as basalt Qow (I). An interesting feature
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
47
Fig. 8. -insensitive PCA on LCVF image taken by AVIRIS sensor. The network was trained for 10 000 000 epochs with a learning rate of 0.001 and equal to 0.2.
highlighted by PCA is the three di6erent bins used by the PCA network to uniquely identify cinder cones (A). This could reveal an oversimpli+cation of the classes used in [18]. -insensitive PCA of the LCVF image shows a much better identi+cation of classes than PCA (Fig. 8). The -insensitive PCA network has been able to identify the dry playa outwash (N,O) in the bottom right of the image more clearly than PCA. The alluvial material (H) and ferric oxide rich soils (L) in the area at the top centre of the image are identi+ed much more clearly. Also the area of the ferric oxide rich soils (W, L) and basalt Qow (I) at the top left is much more clearly de+ned, although both networks miss the ferric oxide rich soils (M), PCA identifying most of the area as basalt Qow (I). Finally, -insensitive PCA still identi+es the three bins as Cinder cones (A), and shows the transition from on to the other and +nally moving to ferric oxide rich soil (L). Overall, -insensitive PCA gives a better representation of the data than PCA, which has large areas through out the image identi+ed as basalt Qow (I) instead of classes (C, H, M, N, O, V, W), and although -insensitive PCA incorrectly identi+es some classes, PCA makes more misidenti+cations across more classes and over a larger area. 3.3. Anti-Hebbian learning Now the -insensitive rule was derived in the context of the minimisation of a speci+c function of the residual. It is perhaps of interest to enquire whether similar
48
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
rules may be used in other forms of Hebbian learning. We investigate this using anti-Hebbian learning. FUoldiVak [6] has suggested a neural net model which has anti-Hebbian connections between the output neurons. The equations which de+ne its dynamical behaviour are N wij yj : yi = xi + j=1
In matrix terms, we have y = x + Wy and so, y = (I − W)−1 x: He shows that, with the familiar anti-Hebbian rule, Hwij = − yi yj
for i = j
the outputs, y are decorrelated. Now the matrix W must be symmetric and has only nonzero nondiagonal terms i.e. if we consider only a two input, two output net,
0 w : (15) W= w 0 However, the -insensitive anti-Hebbian rule is nonsymmetrical and so if wij is the weight from yi to yj , we have −yj sign(yi ) if |yi | ¿ ; Hwij = (16) 0 otherwise: To test the method, we generated two-dimensional input vectors, x, where each element is drawn independently from N(0,1) and then added another independently drawn sample from N(0,1) to both elements. This gives a data set with sample covariance matrix (10000 samples) of
1:9747 0:9948 : (17) 0:9948 1:9806 The covariance matrix of the outputs y (over the 10000 samples) from the network trained using the -insensitive learning rule is
1:8881 −0:0795 : (18) −0:0795 1:1798 We see that the outputs are almost decorrelated. It is interesting to note that • the asymmetrical learning rules have resulted in nonequal variance on the out-
puts.
• but the covariance (o6-diagonal) terms are equal, as one would expect.
It is our +nding that the outputs are always decorrelated but the +nal values on the diagonals (the variances) are impossible to predict and seem to depend on the actual values seen in training, the initial conditions, etc.
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
49
A feedforward decorrelating network, y = (I + W )x may also be created with the -insensitive anti-Hebbian rule with similar results. 3.4. Topology preserving maps We have also used the negative feeback network to create topology preserving maps [10] with somewhat di6erent properties from Kohonen’s SOM [16]. The feedforward stage is as before but now we invoke a competition between the output neurons and the neuron with greatest activation wins. The winning neuron, the pth, is deemed to be maximally +ring (= 1) and all other output neurons are suppressed (= 0). Its +ring is then fed back through the same weights to the input neurons as inhibition. xj (t + 1) ← xj (t) − wpj · 1
for all j;
(19)
where p is the winning neuron. Now the winning neuron excites those neurons close to it i.e. we have a neighbourhood function (p; j) which satis+es (p; j) 6 (p; k) for all j; k: ||p − j || ¿ ||p − k || where || · || is the Euclidean norm. Typically, we use a Gaussian whose radius is decreased during the course of the
Fig. 9. A comparison of the Kohonen network and the scale invariant network on arti+cial data drawn uniformly from the square {(x; y) : −1 6 x 6 1; −1 6 y 6 1}.
50
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
simulation. Then simple Hebbian learning gives Hwij = t (p; i)xj (t + 1)
(20)
= t (p; i)(xj (t) − wpj ):
(21)
The somewhat di6erent properties of this mapping from the Kohonen mapping may be seen in Fig. 9: the Kohonen SOM spreads out evenly across the data set while with the scale invariant mapping, each output neuron captures a slice of the inputs of approximately equal magnitude angle. We may now report that if we use the -insensitive learning rule 0 if |xj (t + 1)| ¡ ; Hwij = y(p; i) sign(xj (t + 1)) otherwise; convergence to the same type of mapping is also achieved. As before, the resultant mapping is found more quickly and is more robust against shot noise. 4. Finding independent components In this section we introduce a nonlinear activation function N yi = f(act i ) = f wij xj ; j=1 M
wkj yk ; xj (t + 1) ← xj (t) − k=1 0 Hwij = y sign(xj (t + 1)) = y sign(e)
if |xj (t + 1)| ¡ ; otherwise;
(22)
where f( ) is a nonlinear activation function. The -insensitive learning rule is e6ectively quantising the inputs—a nonlinear operation. However, it has been pointed out that the ordinary Hebb rule in a linear network is biologically implausible since the e6ect of two small inputs is the same as that of one large input. Now the -sensitive Hebb rule removes this from those inputs which give the smallest residual (where the residual is less than ) but nevertheless the e6ect is retained for moderate sized inputs. A nonlinearity in the learning rule removes this. In addition, we will use two data sets where the noise is inherently nonGaussian. We will demonstrate that the nonlinear -insensitive Hebbian rule +nds the independent components of data comprising mixtures of sources. 4.1. Bars data A major problem still with the analysis of visual scenes is the identi+cation of the independent components which go into making up the complete visual scene. The benchmark problem in this area, proposed by Foldiak [6], consists of identifying the individual bars from a set of input data which contains a mixture of bars
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
51
Fig. 10. Samples of input data presented to the network. Table 6 Approximate probabilities that the given number of pixels will be +ring in a given bar. Probabilities for 4 –7 pixels not shown are less than 0.05 Number of Pixels
0
1
2
3
8
Probability
0.3
0.34
0.17
0.05
0.125
(Fig. 10 shows examples). The data set consists of an 8 × 8 square grid containing a random mixture of horizontal and vertical bars. The input value xi = 1 if a square is black and is 0 otherwise and the black bars are chosen randomly such that each of the 16 possible bars are chosen with a +xed probability of 18 independent of each other. Note that there are (nearly) 216 possible patterns in the data set, compared with the 264 patterns which could be represented on the 8 × 8 grid. Now we wish to represent this data set, with no loss of information, but to do so in a way which reveals the underlying causes of the data set. At one extreme, we might have a totally distributed representation in which each of the 216 possible patterns is identi+ed when a particular output neuron +res i.e. we have almost 216 output neurons with only one +ring at any one time. The opposite extreme is the compact coder such as the PCA coder which gives us a global compression of the data. We aim for a middle way: the bars are the underlying causes of the input data and so we wish the output neurons to identify precisely one of the horizontal or vertical bars. Thus, the ensemble of output neurons will, acting together, be able to recreate the input pattern. Now noise is inherent in this data set, caused by the presence of other bars. Consider a single horizontal bar composed of eight pixels. If it is not present and a vertical bar is present, the pixel in the horizontal bar corresponding to the vertical bar will be blackened. Thus, there is a nonzero probability (Table 6) that some activation will be passed forward through the weights to the neuron responding to the horizontal bar even when the horizontal bar is not present. We see from Table 6 that there is a signi+cant probability that one or two pixels from a bar will be activated even when the bar is not present. This indeed was the problem which motivated -insensitive Hebbian learning: the probability density function 1 exp(−|e| ) (23) p(e) = 2+
52
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
Fig. 11. Each row represents the weights into one output neuron. The eight horizontal and eight vertical lines have been found.
is an approximation to the above densities. We have previously shown [4] that imposing a constraint of nonnegativity on the weights (or outputs) of the negative feedback network with standard Hebbian learning changes the PCA network to a principal factor analysis network capable of identifying the individual bars (something the PCA network cannot do). The recti+cation on the outputs is a special case of the nonlinear activation passing rule (7); we have similar results to those reported herein when we use continuous functions which approximate the recti+cation. 4.2. Extracting the independent causes Now we consider the -insensitive learning rule with the nonnegative weight constraint: whenever the rule has the e6ect of making a particular weight negative we set that weight to zero. This does not necessarily constrain its future growth. We have also used a network which recti+es the outputs i.e. in (22), f(t) = t if t ¿ 0; f(t) = 0; t 6 0 with similar results. The weights in a trained network on an 8 × 8 grid (i.e. 16 bars) are shown in Fig. 11: each row of the diagram shows the learned weight vector into one output neuron—horizontal lines are found by eight contiguous nonzero weights, vertical lines are by eight nonzero weights each separated from its neighbours by 7 zero-magnitude weights. Learning rate was 0.01 decreasing to 0 over the course of the simulation which took 50 000 iterations. must be greater than 0 and best results are obtained with between 0.1 and 0.5. 4.3. Extraction of sinusoids A slightly di6erent model of mixture data has been widely used in recent years and the technique for +nding the independent factors has been called independent component analysis (ICA) because it is seen as an extension of PCA. The problem may be set up as follows: let there be N independent nonGaussian signals (s1 ; s2 ; : : : ; sN ) which are mixed using a (square) mixing matrix A to get N vectors, xi ; each of which is an unknown mixture of the independent signals, x = As:
(24)
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
53
Then the aim is to retrieve the original input signals when the only information presented to the network is the unknown mixture of the signals. The weights in the network will be W such that y = Wx;
(25)
where the elements of y are the elements of the original signal up to a permutation and scaling. The nonlinear extension of Oja’s subspace algorithm [15,12] (26) Hwij = xj f(yi ) − f(yi ) wkj f(yk ) k
has been used to separate signals into their signi+cant subsignals. It is our +nding that any nonzero value of gives similar separation to the standard nonlinear PCA rule when separating one sinusoid from a mixture but that = 0 fails to completely remove one sinusoid from the other two. Perhaps the most frequent use of ICA is in the extraction of independent voices from a mixture of voices. Now it has often been noted that individual voices are kurtotic [14] and it has been shown that the nonlinear PCA network with nonlinearities matching the signal’s statistics can readily extract individual voices from a mixture. The -insensitive rule performs similarly; however, in the next section we report on a slightly di6erent method for extraction of a voice in speci+c circumstances. 5. Noise reduction One of the most interesting application of -insensitive PCA is in the +nding of the smaller magnitude components. Xu et al. [26] have shown that by changing the sign in the PCA learning rule we may +nd the minor components of a data set—those eigenvectors which have the smallest associated eigenvalues. We use this technique to extract a signal from a mixture of signals; we use three speech signals and mix them so that the power of the weakest signal constitutes a small fraction of the total power in the mixes. Table 7 gives an example of the relative power of each of the signals in each of the three mixtures. Table 7 The relative proportion of each signal in each of the three mixes. Signal 3 is very much the weakest yet is readily recovered
Mixture 1 Mixture 2 Mixture 3
Signal 1
Signal 2
Signal 3
86712 8468 66389
17539 11225 56827
1 1 1
54
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
We emphasise that we are using a linear network (and hence utilising the second order statistics of the data set) to extract the low power voice. The network’s learning rules are yi = act i =
N
wij xj ;
j=1
xj (t + 1) ← xj (t) −
Hwij =
M
wkj yk ;
k=1
0 −y sign(xj (t + 1))
if |xj (t + 1)| ¡ ; otherwise:
The mixture is such that humans cannot hear the third voice at all in any of the three mixtures and yet the MCA method reliably +nds that signal with least power. The original third signal and its recovered form are shown in Figs. 12 and 13 when we use = 0; learning rate = 0:1. We note that a limitation of this method is that the signal to be recovered must be swamped by the noise: if there is a component of the noise which is lower power than the signal and also more kurtotic than the mixture this will be
Fig. 12. The original signal.
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
55
Fig. 13. The recovered signal (bottom) using the Minor -insensitive Component Analysis.
recovered. However, there is one situation (and perhaps a frequently met one given the symmetry of our ears and surroundings) in which MCA may be useful: if the mixing matrix is ill-conditioned. Consider the mixing matrix
0:8 0:4 A= 0:9 0:4 whose determinant is −0:04. Its eigenvalues are 1.2325 and −0:0325. The major principal component will be the term which is constant across the two signals and so the smaller component will be the residual when this has been removed. Two voice signals were mixed using this matrix and one recovered by the MCA method is shown in Fig. 14 which should be compared with the original in Fig. 12. 6. Conclusion We have derived a slightly di6erent form of Hebbian learning which we have shown capable of performing PCA type learning in a linear network. However, more importantly in situations in which the noise is more kurtotic we have shown that the network readily identi+es the independent components of the signal. Using this rule, we have recovered the underlying causes of a visual scene when
56
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
Fig. 14. The recovered signal from -insensitive MCA when the mixing matrix is ill-conditioned.
the mixtures are ORed versions of the causes; we have recovered the individual components of audio signals when these are linearly mixed; we have extracted a single signal from an extremely noisy mixture using the method of minor components. Future work will concentrate on analysing the form of noise to be expected in real visual situations and creating cost functions which are optimal for these situations. References [1] C. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, Oxford, 1995. [2] T.H. Brown, S. Chatterji, in: Hebbian synaptic plasticity, The Handbook of Brain Theory and Neural Networks, MIT Press, Cambridge, MA, 1995, pp. 454–459. [3] H. Cetin, D.W. Levandowski, Interactive classi+cation and mapping of multi-dimensional remotely sensed data using n-dimensional probability density functions (npdf), Photogramm. Eng. Remote Sensing 57 (12) (1991) 1579–1587. [4] D. Charles, C. Fyfe, Modelling multiple cause structure using recti+cation constraints, Network: Comput. Neural Systems 9 (1998) 167–182. [5] M. Dale-Bannister, E.A. Guinness, S.H. Slavney, T.C. Stein, R.E. Arvidson, Archiving and distributions of geologic remote sensing +eld experiment data, Trans. Am. Geophysical Union 72 (17) (1991) 176–186.
C. Fyfe, D. MacDonald / Neurocomputing 47 (2002) 35–57
57
[6] P. FUoldiVak, Models of sensory coding, Ph.D. Thesis, University of Cambridge, 1992. [7] Y. Fregnac, in: Hebbian synaptic plasticity: comparative and developmental aspects, in: M.A. Arbib (Ed.), The Handbook of Brain Theory and Neural Networks, MIT Press, Cambridge, MA, 1995. [8] C. Fyfe, Pca properties of interneurons, From Neurobiology to Real World Computing, ICANN 93, 1993, pp. 183–188. [9] C. Fyfe, Introducing asymmetry into interneuron learning, Neural Computation 7 (6) (1995) 1167–1181. [10] C. Fyfe, A scale invariant feature map, Network: Comput. Neural Systems 7 (1996) 269–275. [11] C. Fyfe, R. Baddeley, Non-linear data structure extraction using simple hebbian networks, Biol. Cybernet. 72 (6) (1995) 533–541. [12] M. Girolami, C. Fyfe, Kurtosis extrema and identi+cation of independent components: a neural network approach, ICASSP-97, IEEE Conference on Audio, Speech and Signal Processing, 1997. [13] D.O. Hebb, The Organisation of Behaviour, Wiley, New York, 1949. [14] C. Jutten, J. Herault, Blind separation of sources, part 1: An adaptive algorithm based on neuromimetic architecture, Signal Processing 24 (1991) 1–10. [15] J. Karhunen, J. Joutsensalo, Representation and separation of signals using nonlinear pca type learning, Neural Networks 7 (1) (1994) 113–127. [16] T. Kohonen, Self-Organising Maps, Springer, Berlin, 1995. [17] R. Linsker, From basic network principles to neural architecture, Proceedings of National Academy of Sciences, 1986. [18] E. Merenyi, Quo Vadis computational intelligence: new trends and approaches in computational intelligence, Studies in Fuzziness and Soft Computing, Physica Verlag, Heidelberg (2000) Vol. 54. [19] E. Oja, A simpli+ed neuron model as a principal component analyser, J. Math. Biol. 16 (1982) 267–273. [20] E. Oja, Neural networks, principal components and subspaces, Int. J. Neural Systems 1 (1989) 61–68. [21] E. Oja, H. Ogawa, J. Wangviwattana, Pca in fully parallel neural networks, in: I. Aleksander, J. Taylor (Eds.), Arti+cial Neural Networks, Vol. 2, North-Holland, Amsterdam, 1992. [22] T.D. Sanger, Analysis of the two-dimensional receptive +elds learned by the generalized hebbian algorithm in response to random input, Biol. Cybernet. 63 (1990) 221–228. [23] A.J. Smola, B. Scholkopf, A tutorial on support vector regression. Technical Report NC2-TR-1998-030, NeuroCOLT2 Technical Report Series, October 1998. [24] R.J. Williams, Feature discovery through error-correcting learning. Technical Report 8501, Institute for Cognitive Science, University of California, San Diego, 1985. [25] L. Xu, Least mean square error reconstruction principle for self-organizing neural-nets, Neural Networks 6 (5) (1993) 627–648. [26] L. Xu, E. Oja, C.Y. Suen, Modi+ed hebbian learning for curve and surface +tting, Neural Networks 5 (1992) 441–457.