Statistical analysis of self-organization

Statistical analysis of self-organization

Neural Networks, Vol. 8, No. 5, pp. 717-727, 1995 Copyright © 1995 Elsevier Science Lid Printed in USA. All rights reserved 089~6080/95 $9.50 + .00 P...

934KB Sizes 0 Downloads 65 Views

Neural Networks, Vol. 8, No. 5, pp. 717-727, 1995 Copyright © 1995 Elsevier Science Lid Printed in USA. All rights reserved 089~6080/95 $9.50 + .00

Pergamon

o89 ,os0(9s)o001s-6

CONTRIBUTED ARTICLE

Statistical Analysis of Serf-organization FILIP M . MULIER AND VLADIMIR S. CHERKASSKY University of Minnesota (Received 2 November 1993; revised and accepted 19 January 1995)

Abstract--The choice of the specific form of the neighborhoodfunction and the learning rate in the Kohonen model of the self-organizing map has been empirical, since the model is very difficult to analyze. We present a new statistically motivated approach to determine the contribution o f each data presentation during training on the final position of the units of the trained map. Experimental results show that applying the commonly used learning rates to the finite training set results in unit locations overly influenced by the later presentations (i.e., lust 20°~ of training samples). Better learning rate schedules and neighborhood functions are then determined which allow more uniform contributions of the training data on the unit locations. These improved rates are shown to be a suitable generalization of the standard rates given by stochastic approximation theory for a self-organizing map of units.

Keywords---Self-organization, Learning rate, Stochastic approximation. 1. INTRODUCTION

bounds. For example, feedforward network models can use standard learning rate schedules prescribed by stochastic approximation (Ljung, 1977) for parameter (weight) updating (White, 1989), i.e.:

1.1. Stochastic Approximation and Learning Iterative (flow-through) computation methods are commonly used in artificial neural networks, and have several advantages over batch methods. They tend to be very simple computationally, albeit at the expense of recycling (multiple presentations) the training data. Another advantage is that models can be generated in real time as the data become available. Most importantly, iterative methods allow iterative m o d e l refinement, which can potentially provide additional modeling flexibility. In particular, additional local flexibility can be achieved naturally in a flow-through algorithm by allowing "unequal" treatment of different data points, depending on the current estimate of the model. The major weakness of many flow-through methods is a heuristic choice of the learning rate schedules. Flow-through methods process training data one point at a time, and the training set is usually repeatedly presented to the network (or recycled) many times. Hence, the final model of the trained network may be sensitive to the order of presentation of the training data. In some cases, the framework of stochastic approximation can be used to develop good learning rate schedules and asymptotic performance

k---0

(1)

~'~' ~ 2(k) < ~ . k--~

Even though classical theory of stochastic approximation assumes an infinite sequence of training samples, the same conditions on learning rate generally hold true for finite samples with recycling, commonly used in network learning (White, 1991). The fundamental idea of self-organizing feature maps was originally introduced by Marlsburg (1973) and Grossberg (1976) to explain the formation of neural topological maps. Based on this work, Kohonen (1982) proposed the model known as SelfOrganizing Maps (SOM) which he has successfully used to solve a number of pattern recognition and engineering applications. In this paper we consider the choice of the learning rate schedule for Kohonen's Self-Organizing Maps (SOM). The original SOM model (Kohonen, 1982) does not provide specific form of the learning rate and the neighborhood function schedules, so many heuristic schedules have been used (Kohonen, 1990; Ritter, Martinetz, and Schulten 1992). Straightfor-

Requests for reprints should be sent to Dr. V. S. Cherkassky, Department of Electrical Engineering, University of Minnesota, Minneapolis, M N 55455, USA.

717

718

F. M. Mulierand V. S. Cherkassky

ward generalization over the classical rates given by stochastic approximation does not seem an easy task, because the concept of the shrinking neighborhood in the SOM model makes the analysis difficult. An approach adopted in this paper employs statistical kernel interpretation of self-organization. For a given empirical learning rate schedule, this interpretation makes it possible to analyze (computationally) the contribution of a given training sample on the final location of the trained map units. Such analysis indicates potential pitfalls of the commonly used exponentially decreasing learning rate (Kohonen, 1990; Ritter, Martinetz, and Schulten 1992; Cherkassky & Lari-Najafi, 1991). Furthermore, we provide an analytic expression for improved learning rates, which ensure that every training sample has "equal" contribution on the final location of the trained map, regardless of the order of presentations. These improved rates are shown to be a suitable generalization of the standard rates given by stochastic approximation theory for a self-organizing map of units.

2. ANALYSIS OF UNIT UPDATE EQUATIONS The self-organizing map (SOM) (Kohonen, 1990) is a neural network model which is capable of projecting high dimensional data on a low dimensional array. The projection is done adaptively and preserves characteristic features of the input data. The map usually takes the form of an array of units or computational elements. A competitive learning process is applied to the units, so that each unit becomes a selective decoder of certain input patterns. The process is somewhat similar to vector quantization, where a set of codebook vectors, one for each unit, approximate the distribution of the input signal. Because the units lie on an ordered map, these reference vectors have an ordering, and so distance relations are preserved in the self-organization process. The Kohonen self-organizing algorithm uses the repeated application of a two step learning rule to order the units in in the feature space. First, for each training sample presented, the "winning" or closest unit is found. Second, this unit along with neighboring units are updated. In the most common form of self-organization (applied to density estimation), the distance measure used in the first step is the Euclidean distance in the sample space between the data point and a unit.

X(k)

= [xl(k), x2(k), . . . , xjv(k)] presented at iteration k, Wn,j(k) is the nth coordinate of the location of unitj at time k. By modifying this distance measure, it is possible to use self-organization for nonparametric regression or function estimation (Cherkassky & Lari-Najafi, 1991). In this case, every element of the vector X(k) does not have the same meaning since one element is assumed to be a function of the other elements for regression. Assume that the following functional form holds: x~ (k) = f(xl (k), ..., xN-t (k)) + error.

(3)

In order to preserve the functionality of the mapping, the winning unit is found using the distance measured in the space of the independent (predictor) variables.

i(k)=argmin lL[~:~ (xn(k)-w~'j(k))2]

(4)

This algorithm, called constrained topological mapping (CTM) (Cherkassky & Lari-Najafi, 1991), can then be used for regression problems. After the winning unit is found, the units in the map are then updated. This is performed the same way for both algorithms (i.e., standard SOM for density estimation and CTM for regression) and can be written in the following form. Wj(k) = Wj(k - 1) + fi (k)C( i(k),j, k)[X(k) - Wj(k - l)],

(5) where ~ (k) is the vector (of weights) denoting the location of unit j at time k (k = 1, 2, K, kmax) and X(k) is the sample (vector) presented at time k. ~ is the learning rate and is a function which decreases with increasing iteration k. C is the neighborhood function and is dependent on unit indices j, k and i (k)-the winning unit index at time k. C is often taken to be a Gaussian function of the indexj with mean at i(k) and standard deviation decreasing with k. Since we are not concentrating on these functions, and to simplify notation, the two can be combined aj (i (k), k) = fi (k) C(i (k), j, k)

(6)

with 0 < a j < 1.

i(k)=argmin I ~ - ~ Jn=l

(xn(k)-w~'j(k))2] '

(2)

Then the unit update equations can be written as

where N is the dimensionality of the data, xn (k) is the nth coordinate of the data sample vector

Wj(k) = [1 - a j ( i ( k ) , k)] W j ( k - 1)+ aj(i(k), k ) X ( k ) (7)

Analysis of Self-organization

719

given IVy(0) as an initial condition. This is the parametric equation of a line segment with endpoints at IVy(k-1) and X(k) with parameter aj. This equation directly shows how the parameter aj affects the motion of the unit j. Depending on the value of aj, the unit is moved from the endpoint Wj ( k - 1) to a point on this line segment. This iterative update equation can be made noniterative. First, solving for ICy (1), the result is: Wj(1) = [1-aj(i(1), 1)]Wj(O)+aj(i(1), 1)X(1).

(8)

This result can be used to determine Wj (2) as well: Wj(2) = [1 -aj(i(2), 2)][1 -aj(i(1), 1)] Wj(O) + [1 - aj(i(2), 2)]aj(i(1), 1)X(1)

(9)

time k. Since these coefficients depend on knowing the winning unit at each presentation step, the equations are evaluated numerically as the selforganization proceeds. Each coefficient in eqn (10) is a product series. With each iteration of selforganization a new term enters the product for each coefficient. For computational convenience, the coefficients can be stored as a matrix with the row index corresponding to the iteration step n and the column index corresponding to the unit index j. If unit j required updating at time n, aj(i(n), n) is entered at the corresponding element in the table. All the other entries above this entry (row indexes i = 1 , . . . , n - 1 ) in the same column are then multiplied by 1 - a j ( i ( n ) , n). This step is repeated for each unit updated at iteration step n.

+ aj(i(2), 2)X(2). This can not be exactly solved without being given a particular data set since one has to know i(1) and i(2), the winning units at time k = 1 and k = 2. Assuming that the list of winning units is known at time r = 1,2, . . . , k. The unit positions at time k can be written as: k Wj(k) = H [1 - a j ( i ( r ) , r)]. Wj(O) r=l k

[1 --aj(i(r), r)]-aj(i(l), 1).X(I)

+H r=2

k

[1 -aj(i(r), r)] .ai(i(n), n).X(n)

+ H

(10)

EMPIRICAL ANALYSIS OF SELFORGANIZATION Applying eqn (11) to particular data sets gives some insight into the self-organization process and can be used to answer two questions. First, how does the value of a training sample affect its contribution to the final location of each unit? Second, how does the order of presentation affect a training sample's contribution to the final location of each unit? The results will depend on the specific learning rate and neighborhood function used during self-organization. Our analysis concentrated on the commonly used (Kohonen, 1990; Cherkassky & Lari-Najafi, 1991, Ritter, Martinetz & Schutten 1992) exponentially decreasing learning rate fi and the Gaussian neighborhood function C given by

r----n+l

(k) =/~im~ \/~im~]

(12)

+ aj(i(k), k). X(k). From this equation one can see that the locations of the units at time k depend on a weighted sum of the initial position of the units and every data point presented to the algorithm. Wj(k) = dj (k) Wj(0) k

+ ~ 4(k, ,,Ix(,,).

(11)

n=l

The weights of the weighted sum then quantify the contribution of the initial conditions and of each presentation of the data on the position of the map at a later stage of self-organization. The coefficient dj (k) describes the contribution of the initial position of unit j on the position of unit j at time k. The coefficients dj (k, n) describe the contribution of the presentation at time n on the position of unit j at the

(-Ili--Jll2% C( i, j, k) = exp ~(/~ (k)So)2} .

(13)

In the above equations kmax is the total number of iterations and So is the number of units per dimension in the map. The initial and final learning rates in all of the experiments will be /~i~itial= 1.0, fianal =0.05. Note that the annealing schedule (neighborhood decrease rate) is the same as the learning rate schedule for the experiments in this section. Also, the units of a linear map were initially ordered and placed at equally spaced distances on [0, 1] interval before training. First, an experiment was done to determine what these coefficients look like. A training set of 100 samples was generated by taking x from a uniform distribution on [0, 1]. For each of the 1000 iterations, a sample was randomly selected from the training set (with replacement) and

720

F. M. Muller and V. S. Cherkassky 0.16

~

o.r:,.

"~

0.1

,~

0.08

25

(a)

0.14 -

I '\

(a)

A

20

15

" ~ 0.06. 0

0.04.

10

0.02 O' 0

~ ~

02

~----¢~

~=

"b4

J*r-~ -J, 06

0.8

X 0.1

14

0.2

0.4

0.3

0.5

0.6

0.7

0.8

0.9

(b) X

12 0.16

10

(b) 0.14

8 0.12

0

6

O

0.1

4

d~

0.08

2

~ 0

0.06 0.04

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

X

FIGURE 1. Plots showing the contribution of the data on the final unit location and the histogram of the data. ( l a ) The 10 functions (one for each unit) show the contribution of the data samples on the final unit location. ( l b ) The histogram of the data.

presented to a linear map with 10 units. Figure la is a plot showing the contribution of each data point on the 10 units. Each curve is given by the coefficient value of a data point versus the x value of that data point. There is one curve for each of the 10 units. From the figure, each unit is affected only by data that is nearby. Also, there is very little overlap between the kernel shaped functions. This means that each data point only affects one unit point in this example. In experiments where there are more units than data points, the peaks do overlap indicating that overfitting does not occur. Looking at the height of the peaks in Figure la, one seems to stand out. The histogram plot of the data (Figure 1b) also shows that there are less data points in this area. To more clearly see what happens when the distribution of the data is not uniform, a similar experiment was done with a different data set. In this case, the data had a nonuniform distribution with the histogram given in Figure 2a. In this scenario, the width of the peak as well at the height increases in areas with less data (Figure 2b). It may be useful to interpret the shapes of these peaks in the context of standard statistical kernel and nearest neighbor smoothing. One must remember, however, that in the statistical smoothing case, the kernel shapes are exactly specified and their location is at the point of estimation. In our case the

0.02 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

FIGURE 2. Effect of varying data density on kernel width width and heighL (2a) The histogram of the data. (21=) The 10 functions (one for each unit) show the contribution of the data samples on the final unit location.

position and the shape of the peaks were determined by the self-organizing algorithm. For statistical kernel smoothing, the width of the kernel in the sample space (the bandwidth) is fixed. The effective height depends on the number of samples under its width. For nearest neighbor smoothing, the number of samples under the kernel (the span) is fixed. If the kernel of the nearest-neighbor smoother is plotted in the sample space, its width will vary with the density of points (Altman, 1992). The peaks generated by the self-organizing process seem to react to data density in both manners by changing in width and height as the density of the samples changes. From the plots of these peak functions it is hard to determine their exact shapes. There is one coefficient for each data sample, and in the uniform distribution case with 10 units, roughly 1/10 of the samples form the support of each peak. This corresponds to about 10 points per peak. In order to determine more accurately what these functions look like, a similar experiment was done but with a larger training set. This time there were 1000 samples and only 5 units. In this case a peak would roughly contain 200 points, so the plots should have higher resolution. Figure 3 shows the peak function under these conditions for a

Analysis of Self-organization 0.05

721 0.07

i

+

unit

+ +

unit 3

0.06

unit 5 ( a )

0.04 O

0~0.05

++

•,7, 0.03

%+

+

~0.O4

++

0.02

+ +

+

+ ++

0 0 0.01

0

0.2

0.4

"~__~0.03

,4

+

++ +--

0

+ +

~0.02

+

+ ++~'+++ ,,~ .$

.....

-0.6

0.01 0.2

0.8

014

0.6

X

x

FIGURE 3. Contribution of data s a m p l e s on the final location of the center unit in a five unit map.

0.07



i::I0.06 O

.~o.os unit located in the center of the sample range of this experiment. In spite o f the increased resolution, there is more variance in the shape. Data points which have nearly the same value for x, and therefore should be treated in the same way by the algorithm, have vastly different contributions on the unit location. There are also many data points which are within the width of the peak, which have no contribution on the location of the unit. These effects indicate that there is some other factor related to a particular data presentation, besides the sample value, affecting its contribution on the value of a unit. The only other factor is a sample's order in the training set. If the contribution is plotted versus this order (Figure 4) one sees immediately that the data ordering has a major influence on the contribution of a data point. In the previous experiments, the training set was presented cyclically to the algorithm without randomization. The contribution of a sample in the training set was seen to depend on the ordering of the training set. More commonly, the presentations are randomly selected from the training set pool with replacement, so the ordering of the training set has no effect on the order of presentation. The results of experiments with this type of random presentation will be shown, where similar conclusions can be drawn when looking

0.05 4 0.04

++ +-

=

++

O

=0.03 - ,,,.I

0.02 0.01 0

200

400

0:8

-~

600

800

1000

sample index FIGURE 4. Contribution of data samples versus sample Index for a centrally located welghL

+

+ 4-

4,+++

4V +4 ~4

,.e~O.o4 ++4-+

"~0.03

+

+

0.02 ~.:~

. P

O.01 $

(b)

+

+

+

+

+++ 44

+ + +4 ++ ++ + + + +

÷

++

~, ÷

~ ~

++

+ ,

~4%+

0

++

+

+ +

-4

4~+

400

+

++ +

+



+

÷

4

+ ++ +

+,+ + ~ ,

~,,4~*~+ ÷

200

4 +

.+

++ + ~" + + + ++ +

+ +~. ++ ,

+

600

+

+ ÷

+ +

a,4.+ +

+

444

+ + +

"~

+

4+

+

_ _

+~-~. ++4 ~

+ +

1000

800

sample index 800

(c)

700600500o

3002001000

0

0.01

0.02

0.03 0.o4 contribution

0.05 o.~s

0.07

FIGURE 5. The effect of randomizing the training data. (Sa) Contribution of data samples on the final location of units 1, 3, and 5. (Sb) Contrlbutiofl of data samples on the whole map. ( 5 0 Histogram of contribution values.

at the effect o f presentation order rather than the effect of training set ordering. Consider the following experiment: 1000 samples were chosen from the uniform distribution on the range [0, 1] to form the training set. These were presented to the selforganization network with 5 units. At each presentation, a sample was chosen at random from the training set. The algorithm was allowed to iterate 100,000 times. Figure 5a is a plot of the peak functions for units 1, 3 and 5 showing the contribution on unit position of each sample in the training set versus the value of the training sample. Notice that they are still very jagged, indicating that presentation order has some effect on the contribution. Figure 5b is a plot of the sum of the contributions over all 5 units versus the order of the training data. In other words, each point on this

722

F. M. Muller and V. S. Cherkassky

0.06

4. EFFECT OF P R E S E N T A T I O N ORDER IN THE SIMPLIFIED CASE OF ONE UNIT

~.05

If there is only one unit in the map, the selforganization reduces to a moving average process

~0.04 +

0

0.03 0 "~ 0.02 '~-~0.01 0 O 0(

2Y 200

400

I III 600

800

W(k) ----[1 -l~(k)]W(k- 1)+ft(k)X(k) and eqn (10) can be written in the form: 1000

presentation k

k

W(k) = H [1 -/3 (r)]. W(O) r---I

FIGURE 6. Contribution of a presentation on the final map position versus the iteration number.

graph indicates a training sample's contribution on the total map versus its order in the training set. As should be expected, no trend based on training set order can be seen since the data was presented randomly. One should notice, however, that a high percentage of the contributions are near zero. This is made more clear by looking at the histogram (Figure 5c) of these contribution values. In this case about 75% of the data has little or no contribution on the final unit positions. Another way of looking at it is about 25°/'0 of the data contributes more than it should. With 1000 samples in the training set and 5 units, each sample should have a contribution of 5/100 for equal treatment of each sample. Looking at Figure 5b and 5c one sees roughly 25% of the samples contributing "too much". These plots indicate that even with randomization, some training data may be underutilized. The effect of the presentation order can be seen directly in the following experiment of self-organization with a 5 unit linear map. The training set consisted of 100 samples with a uniform distribution on the range [0, 1]. The training set was randomly sampled 1000 times. Figure 6 shows the total contribution to all units versus the presentation number. The graph shows that most of the contribution of the data on final unit position occurs during the last two hundred presentations. 0.05

k

+ H [1 -- fl (r)].fl (I).X(1) r=2

k +

H [1 - - / ~ ( r ) ] . / ~ r=n+ l

(n).X(n)

(15)

+ ~8(k). X(k). Note that the coefficients in eqn (15) are independent of the particular training data used. If a constant learning rate is used, then this equation corresponds to the exponentially weighted moving average. If /3 (k) = 1/k is used, then each presentation is equally weighted and contributes equally to the final unit position. Figure 7 shows the contributions versus presentation order for these two learning rates as well as the exponential learning rate, 0.14 0.12

"

"

~

"

"

(a)

0 0.I

.~0.08 "~-~0.06 0 o0.04

,, -"

0.02 0

.....

k~=20 1 1 ,'• 0

0.1

012

013

"'"" k ~ = lO0 ..."°'"'k'" =1000~' 014

0.5

0.6

0.7

0.8

0.9

normalized presentation k/k max 0.1

0.04

,~/

=0.08 (b)

////

= - ~ 0.03

.~ 0.06

fl(k) - (.05)

,-< j.--:

"~ 0.04 ,V~(k

"~ 0.02

) --.05-

o

0.01 0

(14)

............

0

20

40

60

presentation k

80

O o 0.02 0

0

- ~-''. 10 20

........ 30 40

.... 50

7"~ 60

presentation

~"¢~'C ..... T..,., 70 80 90 100

100

FIGURE ?. Contribution on final unit location versus presentation order for some commonly used learning rates.

FIGURE 8. Effect of the exponential learning rate parameters on the contribution curve. (Sa) Various contribution curves for different choices km.x. (The horizontal axis is normalized.) (Sb) Contribution curves for several values of fltJn.I.

Analysis of Self-organization

723 ( ~ r,,~\ kl~. 3

/3~a = 1.0,/3n~ = .05

2

(16)

1

•..~'...'.~-..~:::.;-~.,~¢:?..=.~.; .'~.:;', ~:'='.: ...:." :' ~.. • .,.-

one that is commonly used for self-organization. The common learning rate eqn (16) shows some curious effects depending on the choice of km~, the maximum number of presentations (Figure 8a) or on the choice of/3n,~a (Figure 8b). The curves are such that the sum of the contributions for k = 1, . . . , km~ must equal 1. For each presentation to have equal contribution, they must each have contribution 1 / k ~ . Fixing /3n~ forces the final presentation to have a contribution value equal to /3n,,l. If this value is larger than 1]km~ then some presentations are necessarily under-weighted. If the value is smaller than liking, then some presentations will be overweighted. Notice that making/3n,al = 1 / k ~ = 0.01 does not guarantee equal contribution for the exponential learning rate (Figure 8b). By modifying the parameters of the learning rate, one can influence which presentations contribute most to final unit location. These results indicate that the choice of learning rate should be based on the desired contributions based on presentation order. With a finite training set usually equal contribution is desired, but for nonstationary problems, "forgetting" past presentations may be good. In this case it would be useful to have some control over the time window used. If the learning rate is not carefully chosen, the results of self-organization may not be optimal.

0 -1

~,.,.~¢,..~r~-,T:..:-

~.. . : . : . . ~ , . ~ .

.,~

~.~

"~.

.~...,

..:.,,

.~,,..;.,,.~..;,.~,.~.

::....'~.J'..':~,

";.,.

".-'."..:."

This phenomena of dependence of unit location on the order of presentation can cause problems in certain situations. The unequal treatment of data within a fixed size training set is a statistical flaw of the self-organization process caused by poor choice of learning rate schedule. Many of the presentations are wasted because they do not affect the final unit positions. Under-utilization of the data is a serious problem, especially in situations with a large number of samples with high noise. In these cases, the accuracy of the estimation method depends on its ability to smooth out the effects of the errors using all the samples. Under-utilization of the training data also increases the variance of the estimates. If selforganization using a poorly chosen learning rate is repeated a number of times for the same training data, the sequence of maps generated will have a high variability, since each is based on a possibly different small portion of the training data. An example of this is given in the following experiment with the CTM algorithm and 10 units. A

z"

. . .

..%

...,....

".

." .

"..."



. . - . . -. -%

....

.

:.



.

.

..

".:

,

;..'~.'.

~.... . . . .

-~.~

; ¢-.:,*~.~o.,..~...~.~'~..~...,~. ,......,,. ;, :~..,..... : .... .g.~....~.. :. •&,,Z.-.~:-:-"'.¢"? :" ::"...~...~..:~c:.~.~.~..t.~,...,.,.;: .;.~ : . , ~ . . . . . . . . : ~ - . . ~ . . . - ~ , . :,.:..~

-, ~. :-." ....~-:,-~. "... ;a: .-. •. . , " . : . : ..'. ". • ." . . . . .

,.' ...,.~v., .; . : ' ' "

~ ." :~

"."

". "

• ...:

" ~ , ' i : . ~ ~ ~7.'. - . ~ " ;:~.:;;.<...'.~... " . ' . . . : . "-.: ~.~'. ~ ' , . ~ . . ~ . . . ~ . . ; . . . ~

• ~ , ' , - . - ' ) ' - . : s : ¢ _ .. " ~- : " ' . .,:. - ~ , ~ N , ~ , . . . : ..~<:.....h

".," :,..'C".:'.~".',".'.,P~..':.'~'~:;~.>.A.~.,,.~.".., .."

." "'" " . . -

?:r,'~::;.'~,:

-:'Y-:."":'~:"I,.'~:~.:.,L""..::~7.-:~;..~...~,. • .:.'L',

-2 -3 -4

0.2

0.4

0.6

0.8

X

FIGURE 9. Training set of 10,000 samples.

training set was generated by drawing 10,000 samples from a uniform distribution for the independent variable. The dependent variable was the sine function with noise. The noise was Gaussian with a standard deviation of 1.0 (Figure 9). This data was used to train the map by presenting each sample once to the map and using the learning rate and neighborhood given in eqn (12, 13). Then the data was used to generate two more data sets, training set A and training set B. Set A was created by taking the last 200 samples of the original training set. Set B was created by taking the first 200 samples of the original training set. Each of these training sets were used to 1.5 *,

(a)

1

0.5

\

0

5. CONSEQUENCE OF AD HOC CHOICE OF THE LEARNING RATE

~

..,

..~"

\,¢,

+/ ,4

-0.5

j1,(

-1

-1.5

0

0.2

0.4

0.6

0.8

X 1.5

(b) g

1 0.5



÷



\

0

\',,,

,,/

-0.5 -1

g +

-1.5

0.2

0.4

0.6

0.8

X FIGURE 10. Comparison of CTM maps trained with 10,000 samples, and 200 samples. (10a) Plot of function, map trained with 10,000 samples (labeled*) and map trained with last 200 samples (labeled +). (10b) Plot of funcUon, map trained with 10,000 samples (labeled *) and map trained with first 200 samples (labeled +).

724

F. M. Mulier and V. S. Cherkassky

train C T M maps. In these two cases, the samples were recycled 50 times so that for all three maps, 10,000 presentations were used for training using the same learning rate and neighborhood function. N o w we wish to compare the three different maps that were generated (Figure 10). Notice how similar the maps generated using the original 10,000 sample data set and training set A (last 200 samples) are (Figure 10a). The locations of the units follow very closely. As an experimental control, the map trained on the original data can be compared to the map trained on set B (first 200 samples). Notice that in this case unit locations vary quite a bit more (Figure 10b). This shows that the map resulting from the last 200 samples of the training set is just as good as the map generated by 10,000 samples. In this case only 2% o f the data had a real effect on the model. Previous experiments Cherkassky & Lari-Najafi, 1991) and numerous references on SOM (Kohonen, 1990; Ritter, Martinetz & Schutten) using the exponential learning rate schedule have not shown evidence o f these effects because they only manifest themselves in certain situations. If there exists a statistical difference between, say, the first half o f the training data and the last half in terms of ordering, the final map unit locations will more closely model the last half of the data. In most past experiments there was no statistical difference between the first half o f the data and the last half, so a model based on only the last half would apply just as well to the rest of the data. In this case it would be hard to determine if the model was based on the whole data set or just the last half, looking just at the final unit locations. However, the problems o f underutilization of data, (i.e., high model variance), still occur undetected. Problems of underutilization of data can occur even if random sampling o f the training set is done (as in most applications using SOM). The solution to this problem is in the careful choice of learning rate schedule. When self-organization is applied to a random nonstationary process which generates an unlimited amount of training data, equal contribution o f each presentation may not be a desired goal. In this case, one would like to control the time window within samples have a significant effect. The choice o f learning rate is still very important because a bad choice will result in unit locations which depend too much on the most recent presentations. The algorithm "forgets" past presentations too quickly when new presentations occur. This will lead to density approximations dependent on a small number o f samples and as such, prone to high variance. 6. I M P R O V E D L E A R N I N G RATE AND NEIGHBORHOOD FUNCTION The optimal choice o f learning rate is critical for good

performance of self organization algorithms when training on a finite data set. A major difficulty in determining this learning rate is the interaction of the neighborhood function and the complexity of updating multiple units. Here we develop a learning rate schedule which is optimal in terms of allowing equal contributions of all presentations on final map position. This condition was devised so that the choice of neighborhood function and its rate of decrease (annealing rate) does not affect the analysis, although the results of self-organization will depend greatly on choice of neighborhood. This condition is also based on the assumption that the data being presented has been generated by a stationary process. A numerical search for a learning rate schedule which meets this criteria can then be done. Since the search is performed algorithmically, the results may depend on the particular conditions o f the run, such as the distribution of the training data and the number of units. A sufficient number o f experiments is required in order to determine the effects o f these parameters. We will define the optimal learning rate schedule for self-organization as the one for which each presentation contributes equally to the final position of the map. This condition can be written as J

~_~l dy(kmax, n) =

J

/~/max

forn = 1,

---7

kmax

(17)

where J is the total number of units. Notice that this enforces an equal contribution over all the units in summation, it does not enforce equal contribution for each individual unit. We would also like to have the initial condition (the initial unit locations) have no contribution on the final map, J

d/(km~) ~ 0.

(18)

j=l

In order to make the learning rate derivation independent of the choice of neighborhood function and annealing rate it is assumed to be normalized: J C( i, j, n) =

1 for n -- 1 .....

kra~x a n d i =

1 .....

J.

j=l

(19) The learning rate schedule is a function/~ (n) defined at the discrete points n = 1 , 2 , . . . , kmax and can be related to the contribution via eqns (6) and (10). We would like to solve f o r / 3 (n) at each discrete point subject to eqns (17), (18), and (19). There are two times, n = 1 and n =kmax, for which fl(1) and fl(kmax) can be determined analytically. When n = kmax, eqn (17) becomes

Analysis of Self-organization J E 4(kmax, kmax)= J /=l kmax'

725 (20)

initialized using a standard learning rate schedule, such as the exponential. k = kmax

substituting for d using eqn (10) and eqn (6) gives

J fl(kmax) ~ C(i(km~),j, kmax)-kmx'J

(21)

fl (kmax)= J/kma~ for k = kmax - 1, kmax - 2, . . . , 2 perform complete self-organization using the rates fi (k), (k = 1, . . . , kmax) found so far to determine

This can be simplified by applying eqn (19) to give fl(kmax)=J/kmax, fl(1) can be determined by looking at the unit update equation for k = 1 ~.(1) = [1 -aj(i(1), 1)]Wj(O) +aj(i(1), 1)X(1).

J

E

j=l

4 (km~, k).

(22) Calculate the new ¢~(k) for time k using

As stated earlier, in order to minimize the effect of the initial unit positions on the final map position, aj(i(1), 1) should be as close to one as possible, but never over one, for all j. This requires fl (1) = 1, for any arbitrary neighborhood function conforming to eqn (19). So far two points of the optimal learning rate schedule have been determined. These two points are independent of the shape of the neighborhood function (but subject to eqn (19)), the training set size, the number of units, or the topology of the map. Determining further values of the learning rate requires numerical computation became of the complicated form of the coefficients dj (kraal, n). The search for the remaining values of the learning rate fl (k) will be done in reverse, i.e., for, k = kmax - 1, kmax - 2, . . . , 3, 2. Applying eqn (10) to eqn (17) gives

j--~l ( ~,=*+l[1-aj(i(r),r)].aj(i(k),k))=J/kmax. (23) The learning rate fl (k) can be factored out of the left hand side via eqn (6),

fl(k)

k~ ~ j=l

(

J

~ [1 -13(r)C(i(r),j, r)] .C(i(k),j, k)

r=k+l

)

j

tin.(k)

(25)

end for k = 1 (1) = 1.

Notice that the learning rate update rule eqn (25) is just a modification of eqn (24). By complete self-organization we mean starting from the initial conditions and performing km~x unit update steps. The learning rate found via this algorithm may depend on the initial learning rate chosen for self-organization. It also depends on the particular training sample presented, and so is actually a series of random variables. This requires that the algorithm be repeated a number of times to refine the learning rate estimate. In this case, eqn (25) should be modified to average over a number of searches. Figure 11 shows a learning rate determined by repeating the algorithm 20 times. The algorithm can also be modified to search for other learning rate schedules which enforce any type of contribution profile desired. This may be useful for applying selforganization to nonstationary processes where the most recently presented data should have the most 1

(24)

uniform data . . . . normal data

0.8

This equation can be evaluated numerically for a given realization of the self-organizing process. Notice that the learning rate at time k depends on the learning rates found for times k + 1, k + 2, . . . , kmax so the search must be done in reverse order. Since it is done algorithmically as self-organization proceeds, a starting set of values for fl (k) is required for all k = 1,2, . . . , kmax. The following algorithm can be used to perform the search. Initialization: fl(k)'s for k = 1,2, . . . , km~x are

./

kmaxfl(k) E dj(kmax,k) j=l

0.6 0.2 °o

1~0

2~0 3~ presentation k

4~

soo

FIGURE 11. Learning rate schedules found for seU-organlzing map with 10 unils.

726

F. M . Mulier a n d V. S. Cherkassky

uniform

data

j

40 30

O-'(k) 2O

10

1(10

200

300

400

51)0

presentation k FIGURE 12. Inverse of the learning rate schedule.

influence. If one takes the inverse of this learning rate, the linear functional form is clearly shown (Figure 12) and this form does not depend on the number of units or the dimensionality of the map (Figure 13). Note that for the curves in Figures 12 and 13, the endpoints were determined analytically and the rest of the curve was experimentally determined. The curves can be described by a linear equation based on the locations of the endpoints: 1

fl (k) = m ( k - 1) + 1 J 1- kin= m-- j.

(26)

J - - -

km.x

where J is the total number of units and kmax is the total number of presentations. In the case of a single unit ( J = 1), the equation becomes/3 (k) = 1/k which is the running average schedule and conforms to the well-known conditions on learning rates used in stochastic approximation methods eqn (1). When kmax is large, the rate becomes: 1 /3 (k) -- J - l ( k -

1) + 1

(27)

which is similar to the schedule commonly used for kmeans clustering. Note that k-means can be seen as a

specific case of self-organization where the neighborhood consists of only one unit and each unit has its own independent learning rate which is decreased when that unit is updated. The self-organization algorithm has a global learning rate because several units are updated with each iteration. If one assumes that each unit is updated exactly equiprobably during self-organization, then the two learning rates are identical. The running average schedule for k-means clustering has been proven to converge to a local minimum (MacQueen, 1967). An improved learning rate for learning vector quantization (LVQ1) has been proposed (Kohonen et al. 1992) which is based on the same condition of optimality (equal contribution of presentations on final unit locations) as the rate for SOM proposed here. Since the units in the LVQ 1 algorithm do not have a topological neighborhood, the learning rate can be analytically derived in the form similar to the common running average schedule used for k-means clustering. As shown in this paper, a similar learning rate schedule can be obtained for SOM. Because of the similarities between the k-means, LVQ1, and SOM algorithms, the learning rates based on the equal contribution condition for each algorithm all have similar basic functional form. The following experiment shows the effectiveness of the new learning rate schedule. The purpose of this test is to determine whether the order of presentation of the training samples has an effect on the final unit positions when using both the common learning rate eqn (12) and the new learning rate eqn (26). A training set was created which contained 2 clusters of data (Figure 14). Cluster one on the lower left contains 4800 samples and cluster two on the upper right contains only 200 samples. These data were used to train two identical Kohonen (linear or 1dimensional) maps of 25 units (map A and map B) which each use the commonly used learning rate eqn (12). For map A, the samples were presented randomly. Map B was trained by first presenting the data failing in cluster one followed by the data falling in cluster two. Figure 14 shows that the results 8

800

2 'units, I-D 6

600

map A

-- -

-

" . .:~ ::il. ..... "

4 1.'u~ter t w o

500

B-l(k) 400

4 units, 2-D

300

6 units, I-D

200

i

100' 0~ ~

0

200

7 400

, • 600

~ 800

~ 1000

, "~ 1200

16 units, 2-D its, 1 D , 3~i units. : -D 1400 1600

presentation k FIGURE 13. Inverse of leamlng rate found for 1-D and 2-D maps with various number of units.

X2

2

clusterori¢:2-~.~,":-. /

0

"i .". •."

-2

...

".'~-"

..

.

.

-4 -4

-2

0

2

4

6

X1

FIGURE 14. The order of presentation has an effect on the final locations of the units with the commonly used learning rate (12).

Analysis of Self-organization

6

727

map A -- -- -map B

• •

:~:.-::.:

i

i: :

4 • , elusler two X2 2

cluster.one...,:."•: .:.:~'::.'; ~"': .."- , , . : . . . • :'....; .~.~,i~.~,,!,~h[ "~.?'~." • '..-!~. • .... "~..:. .. . , ~ , , ~ ,.~.:..'." .j-.. • .: .. ; ~ : . ' . :

-2

-.- j..--.

- "- ~:....-~).'..., p

x1 FIGURE 15. The order of presentation has lithe effecl on final unit positlona with the newly developed learning rate (26).

o f self-organization differ in b o t h o f these cases due to presentation order. M a p B provides a p o o r estimate for the density of the training data since 20% o f the units fall in cluster two, which contains only 4% o f the data. The experiment was repeated using the same parameters, except the new learning rate schedule eqn (26) was used (annealing rate used was again eqn (12)). Again, one m a p ( m a p C) was trained by selecting samples randomly and the other m a p (map D) was trained by selecting samples first f r o m cluster one and then from cluster two. Figure 15 shows that the results of self-organization using the new learning rate show almost no dependence on the order of presentation. REFERENCES

Altman, N. S. (1992). A n introduction to kernel a n d nearestneighbor nonparametric regression. American Statistician, 46, 175-185. Cherkassky, V., & Lari-Najafi, H. (1991). Constrained topological mapping for nonparametric regression analysis. Neural Networks, 4, 27-40. Grossberg, S. (1976). Adaptive pattern classification a n d universal recoding, I: Parallel development a n d coding o f neural feature detectors. Biological Cybernetics, 23, 121-134. K o h o n e n , T. (1982). Clustering, taxonomy, a n d topological m a p s o f patterns. In Proceedings 6th International Conference on Pattern Recognition, (pp. 114-128) Munich.

Kohonen, T. (1990). The self-organizing map. Proceedings of the IEEE, 78, 1464-1479. Kohonen, T, Kangaas, J., Laaksonen, J., & Torkkola, K. (1992). LVQ_ PAK: A program package for the correct application of Learning Vector Quantizafion algorithm~. Proceedings International Joint Conference on Neural Networks, (pp. 725-730) Baltimore: IEEE. Liung, L. (1977). Analysis of recursive stochastic algorithms, IEEE Transactions Automatic Control, 22, 551-575. MacQueen, J. (1967). Some methods for classification and analysis of multivariate observations. Proceedings 5th Berkeley Symposiam on Mathematics, Statistics and Probability, 3, 281. Malsburg, C. (1973). Self-organizationof orientation sensitivecells in the striate cortex. Kybernetic, 14, 85-100. Ritter, H., Martinetz, T., & Schulten, K. (1992). Neural computation and self-organizing maps: an introduction, Reading, MA: Addison-Wesley. White, H. (1989)• Learning in artificial neural networks: a statistical perspective. Neural Computation, 1, 425-464. White, H. (1991). Parametric statistical estimation with artificial neural networks. Technical Report, Dept. of Economics, UC San Diego.

NOMENCLATURE dimensionality o f the training data iteration (discrete time) step, k = 1,2, . . . , kma~ X(k) training data vector presented at iteration step k x,,(k) nth coordinate o f the data vector X(k) total n u m b e r of units J unit index j = 1,2, . . . , J J Wj(k) vector denoting the location of u n i t j at time k nth coordinate o f the location o f unit j wn,j(k) at time k winning unit at time k i(k) learning rate function fl (k) C ( i (k), j, k) neighborhood function dj(k) contribution o f the initial position of u n i t j on the position of u n i t j at time k dj(k, n) contribution of the presentation at time n on the position o f u n i t j at the time k N k