Self-organizing feature map with a momentum term

Self-organizing feature map with a momentum term

NEUROCOMPUTINC ELSEVIER Neurocomputing 10 (1996) 71-81 Self-organizing feature map with a momentum term Masafumi Hagiwara * Department of Electr...

521KB Sizes 2 Downloads 20 Views

NEUROCOMPUTINC

ELSEVIER

Neurocomputing 10 (1996) 71-81

Self-organizing feature map with a momentum term Masafumi

Hagiwara

*

Department of Electrical Engineering, Faculty of Science and Technology, fiio 3-14-1 Hiyoshi, Kohoku-ku, Yokohama, 223, Japan

University,

Received 21 February 1994; accepted 4 August 1994

Abstract The objectives of this paper are to derive a momentum term in the Kohonen’s selforganizing feature map algorithm theoretically and to show the effectiveness of the term by computer simulations. We will derive the self-organizing feature map algorithm having the momentum term through the following assumptions: (1) The cost function is E” = Cla n-pEp, where E, is the modified Lyapunov function originally proposed by Ritter and Schulten at the pth learning time and a is the momentum coefficient. (2) The latest weights are assumed in calculating the cost function E”. According to our simulations, it has shown that the momentum term in the self-organizing feature map can considerably contribute to the acceleration of the convergence. Keywords:

Self-organizing

feature map; Momentum

term

1. Introduction

Neural networks can be classified into two groups with respect to learning; one is supervised learning and the other is unsupervised learning. The backpropagation algorithm is one of the most popular learning paradigms for the supervised learning [l]. Many applications of neural networks use the backpropagation algorithm and various modifications of the algorithm have been reported [2-41. On the other hand, Kohonen’s self-organizing feature map [5] is one of the well-known unsupervised learning algorithms. It has also been studied extensively [5-101, and applied to many fields such as sensory mappings, combinatiorial optimization and robotics. Kohonen’s self-organizing feature map has the following features:

* Email: [email protected] 0925-2312/96/$15.00 0 1996Elsevier SSDI 0925-2312(94)00056-S

Science B.V. All rights reserved

72

M. Hagiwara / Neurocomputing IO (I 9%) 71-81

(1) It takes a computational shortcut to achieve the effect of lateral interactions. (2) It is useful for clustering or categorizing. (3) It can extract the hidden features of multi-dimensional input patterns and their probability. (4) It is topology-preserving for the network; similar inputs are mapped into neighboring neurons in the network. (5) It has a neurophysiological background [81. Generally, the learning of neural networks is slow. For example, Kohonen’s self-organizing feature map requires thousands of learning steps to converge. To increase the speed, Lo and Bavarian investigated the selection of neighborhood interaction function in the map [9]. This approach, however, requires a lot of additional computations for the interaction of neurons in the map; the merit (1) would be lost. In the case of backpropagation, the momentum term can contribute to acceleration of the learning [l], and recently Hagiwara has made clear the theoretical derivation of the term [ll]. Hagiwara has derived momentum terms also in Boltzmann machine learning and in mean field theory learning [12]. On the other hand, as for unsupervised competitive learning, the use of a momentum to improve the convergence performance is suggested ([13], p. 222). However, the theoretical derivation of the term has been unclear. In this paper, we theoretically derive the momentum term in the self-organizing feature map to accelerate its convergence [151. In Section 2, we briefly review Kohonen’s self-organizing feature map algorithm. In Section 3, we will derive the momentum term in the self-organizing feature map algorithm. Then in Section 4, computer simulation results are reported to show the effectiveness of the momentum term.

2. Review of Kohonen’s self-organizing feature map algorithm In this section, we briefly review the conventional Kohonen’s self-organizing feature map algorithm. The leaning rule can be written into a step by step procedure for computer simulation [9,14]; Step 1. Initialization: Select the size and structure of the network. Initialize W$ with small random values. Step 2. Present an input xcL. Step 3. Compute the distance between X~ and to all neurons, or the matching scores d,‘s.

M. Hagiwara / Neurocomputing 10 (1996) 71-81

where )I- I) is the Euclidean norm, xi” is the input input layer at the pth learning time, and W$ is neuron in the input layer to the ith neuron in the time. Step 4. Select the i*th neuron closest to xk, or minimum

73

to the jth neuron in the the weight from the jth map at the @h learning distance d,*;

di* = min di

(2)

i

Step 5. Update weights to the i* th neuron and its neighbors;

w$+l = WC +

qAcl(i,

i*)(xF

(3)

-w/j),

otherwise

(4)

Learning times

1

Weights

(a) With assumption 2). (The latest weights w iy at the nth learning time are assumed.) Leaming limes

1

2 . . .

/.l . . .

Weights

(b) Without assumption 2). Fig. 1. Illustration for assumption (2).

n

74

M. Hagiwara / Neurocomputing 10 (I 996) 71-81

Where n is the learning constant, i*) = exp

Ap(i,

Step 6.

[ -

h-;;llZ),

(5)

ri* and ri are the position vectors of the winning neuron and of the winning neighborhood neurons, respectively. a(p) is a width parameter that is gradually decreased ([13], p. 237). Repeat by going to Step 2.

3. Theoretical derivation of a momentum term in Kohonen’s self-organizing feature map algorithm In this section, we will derive the momentum term in Kohonen’s self-organizing feature map algorithm. At first, the cost function is defined: E” =&EM, CL

(6)

where E, is the modified Lyapunov function originally proposed Schulten 171at the @h learning time, Ep = ;~~A“(& i j

by Ritter and

‘*)(xi” - w;)~,

(7)

and (Yis the momentum coefficient (0 <(Y < 1). It should be noted that similar to the case of derivation of the momentum terms in backpropagation, Boltzmann machine learning algorithm and mean field theory learning algorithm [11,12], the

11 =0.0001

40

‘;

~~‘,“‘1”‘1’~‘1~~~~

Q =O (Conventional)

-

a =0.9

20 -

lil 10 -

0 0

ll”l’l”l”dll’l’l 2000 4000

6000

Data presentation

6000

10000

[times]

Fig. 2. Error in the map as the function of iteration (7 = 0.0001, pattern-by-pattern

learning).

M. Hagiwara / Neurocomputing

10 (1996) 71-81

15

latest weights w$ s at the nth learning time are assumed in Eq. (7). This assumption is illustrated in Fig. 1. In addition, we should mention that the learning time index it means the latest learning time, not the final learning time in the training. Then we calculate the learning rule based on the commonly used Gradient method ([l], p. 322-328); dW;a

-2.

(8) II

Using Eqs. (6)~(81, aE”

-

awir].

=

n -&tC-IL~~A~(i,

i*)(xj”

-

winj).

(9)

i j

P

Therefore,

=qCCP(i, i

i*)(xr-w;)

+q

j

~a”-p~~p(i,

i*)(xj”

-

w$)

CL n-1

=77CCA~(i,i*)(xy-w~)+t77aC~Y(“-1)-lrCCAY(i,i*)(xj”-w~)

i

j

= ~~~An( i j

i j

P i, i*)(

x7 -

wz)

+

aAw$-‘.

(10)

The last term in Eq. (10) is the momentum term in the self-organizing feature map.

2000

4000

6000

Data presentation

6000

10000

[times]

Fig. 3. Error in the map as the function of iteration (7 = 0.001, pattern-by-pattern

learning).

M. Hagiwara /Neurocomputing 10 (1996) 71-81

76

Although many readers may be aware of the importance of the assumption (2) (the latest weights w$ s are assumed in the cost function E”), we should emphasize that it is very hard to derive the momentum term without the assumption. Based on our approach, the assumption (2) is indispensable, which is shown in the Appendix.

4. Computer simulation We have carried out many computer simulations to examine the effect of momentum term. Some of them are shown in this section. 4.1. Simulation conditions In our simulations, the following conditions were used.

(1) Mapping was from two inputs to 10 X 10 neural sheet: 2-100 network was used.

(2) Initial value of each weight is between 0.45 and 0.55. (3) Range of each element of the input vector is between 0.0 and 1.0. (4) We used the following function in Eq. (5); a( /_L)= Ui(c+i)P’Pm=

(11)

where 0;: = 5.0, af = 0.5, and CL,,,== 10 000. (5) We defined the sum of the errors of each neuron weight vector compared to the ideal square grid position to be measure. (12)

0

2000

4000

6000

8000

10000

Data presentation [times] Fig. 4. Error in the map as the function of iteration (7 = 0.01, pattern-by-pattern

learning).

M. Hagiwara / Neurocomputing 10 (1996) 71-81

77

103th step

500th step

(a) with the momentum kxm (
Fig. 5. Formation of weight space as the function (a = 0.9); (b) without the momentum term.

(b) \vlthoul the momentum Lcrm.

of iteration

(7 = 0.001): (a) with the momentum

term

M. Hagiwara /Neurocomputing

18

10 (1996) 71-81

where 100 on the first summation is the number of neurons in the map layer, 2 on the second summation is that in the input layer, (Ail, Ai21is the ideal square grid coordinates to each input vector. (6) We used the following two kinds of weights learning methods: @Pattern-by-pattern learning: All weights are updated every time each pattern is presented (simulation results are shown in Section 4.2). (ii)Batch mode learning: Weight changes are accumulated every time each pattern is presented. Weights update is carried out every Nave times (simulation results are shown in Section 4.3). 4.2. Simulation results by pattern-by-pattern learning Figs. 2, 3 and 4 show the error which is plotted as the function of data presentation, where 77= 0.0001, 77= 0.001, and 77= 0.01, respectively. Each plot is the averaged value over 10 trials using different initial weights. In each figure, the plot indicated by momentum coefficient LY= 0 corresponds to the result by using the conventional algorithm. From these figures, we can see that the momentum term contributes to the acceleration of convergence. Furthermore, in case of the conventional map (a = O), the error is saturated at a large value when 17 is small, whereas the error in the map using the momentum term decreases to much lower value. Fig. 5 shows the plots of six states of the weight space. Since the input vectors are from a uniform distribution over a unit square, we expect the weights space of the network to self-organize into a perfect square grid. It is clear that the part (a), with the momentum term, forms the square grid faster than part (b). 4.3. Simulation results by batch mode learning Fins. 6 and 7 show the error which is plotted as the function of data presentation, where N,, is 100, and 77n(learning constant normalized by N,,) = 0.1 and

40

0

0

2000

4000

6000

8000

10000

Data presentation [times] Fig.6. Error in the map as the functionof iteration(7 = 0.1, batch mode learning).

79

M. Hagiwara / Neurocomputing 10 (1996) 71-81

I

0

0

2000

I

I

I

4000

(

I

I

I

6000

Data presentation

a

I

a

I

6000

I

10000

[times]

Fig. 7. Error in the map as the function of iteration (_rl= 0.5, batch

mode learning).

7, = 0.5, respectively. Each plot is the averaged value over 10 trials using different initial weights. From these figures, we can see that the momentum term contributes to the acceleration of convergence even if momentum coefficient is large by using batch mode learning.

5. Conclusions

We have theoretically derived a momentum term in Kohonen’s self-organizing feature map algorithm through the following assumptions: (1) The cost function is E” = C;cxn-ILECL,where E, is the modified Lyapunov function originally proposed by Ritter and Schulten at the j&h learning time and LY is the momentum coefficient. (2) The latest weights are assumed in the cost function E”. The latter assumption is especially important: We have shown that it is very hard to derive the momentum term. In addition, we have confirmed the effectiveness of the term by computer simulations. Since there are many features in Kohonen’s self-organizing feature map algorithm as mentioned in the Introduction, it has a large potential. As the use of momentum term is very easy, it would contribute the development of the algorithm both theoretically and practically.

Acknowledgment

The author would like to thank the reviewers and Professor T. Kohonen for their valuable suggestions which have helped to improve the quality of this paper. The author also would like to thank Professor M. Nakagawa and Professor D.E. Rumelhart.

80

M. Hagiwara / Neurocomputing

10 (19%) 71-81

Appendix We will show in this appendix that we can not derive the momentum term without the assumption (2) (the latest weights are assumed in calculating the cost function E”) based on our approach. In this case, Eq. (7) becomes E, = +~~Ap( i

i, i*)( xj” - IV;)’

(A-1)

j

Then Eq. (9) becomes aE" = -yyi, aW;

i*)(q-w;).

(A-2)

Therefore, Aw;

=

~~~/i”(i, i

i*)(xi”

-wi”i).

j

The momentum term disappeared: it has been shown that we can not derive the term without the assumption.

References [l] D.E. Rumelhart, J.L. McClelland and the PDP Research Group: Parallel Distributed Processing, Vol. 1 (MIT Press, 1986). [2] T.P. Vogl, J.K. Mangis, A.K. Rigler, W.T. Zink and D.L. Alkon: Accelerating the convergence of the Back-propagation method, Biol. Cybernet. 59 (1988) 257-263. [3] M. Hagiwara, Novel back propagation algorithm for reduction of hidden units and acceleration of convergence using artificial selection, ht. Joint Conf: on Neural Network Z (1990) 625-630. [4] M.K. Weir: A method for self-determination of adaptive learning rates in back propagation, Neural Networks 4 (1991) 371-379. [5] T. Kohonen: Self-organization and Associative

Memory (Springer, Berlin, 1989). [6] H. Ritter and K. Schulten: Convergence properties of Kohonen’s topology conserving maps: Fluctuations, stability, and dimension selection, Bill. Cybernet. 60 (1988) 59-71. [7] H. Ritter and K. Schulten: Kohonen’s self-organizing maps: Exploring their computational capabilities, Znt. Conf. on Neural Networks I (1988) 109-116. [8] H. Ritter and K. Schulten: Neural computation and Self-organizing Maps (Addison Wesley, 1992). [9] Z.-P. Lo and B. Bavarian: On the rate of convergence in topology preserving neural networks, Eiol. Cybernet. 65 (1991) 55-63. [lo] P. Demartines and F. Blayo: Kohonen self-organizing maps: Is the normalization necessary?, Complex Syst. 6 (1992) 105-123. [ll] M. Hagiwara: Theoretical derivation of momentum term in back-propagation, Znt. Joint Co@ on Neural Networks Z (1992) 682-686. 1121 M. Hagiwara: Acceleration for both Boltzmann machine learning and mean field theory learning, Znt. Joint Co@ on Neural Networks Z (1992) 687-692. 1131 J. Hertz, A. Krogh and R.G. Palmer: Introduction to the Theory of Neural Computation (Addison

Wesley, 1991).

M. Hagiwara / Neurocomputing 10 (1996) 71-81

81

[14] R.P. Lippmann: Introduction to computing with neural nets, IEEE ASSP Msg. (1987) 4-22. 1151M. Hagiwara: Self-organizing feature map with a momentum term, Int. Joint Conf: on Neural Neiworks I (1993) 467-470. Masafumi Hagiwara is an Assistant Professor of Keio University. He was born in Yokohama. Jaoan on October 29. 1959. He received the B.E.. M.E.. and Ph.D. degrees from Electrical Engineering in Keio University, ‘Yokohama, Japan, in 1982, 1984 and 198?, respectively. In 1987, he became a research associate of Keio University. Smce 1990 he has been an Assistant Professor. From 1991 to 1993, he was a visiting scholar at Stanford University. He received Niwa Memorial Award, Shinohara Memorial Young Engineer Award and IEEE Consumer Electronics Society Chester Sal1 Award, in 1986, 1987 and 1990, respectively. His research interests are neural networks, fuzzy systems, and Genetic algorithms. Dr. Hagiwara is a member of the IEICE, IEEE, JNNS and INNS.