Neurocomputing 74 (2011) 1815–1822
Contents lists available at ScienceDirect
Neurocomputing journal homepage: www.elsevier.com/locate/neucom
A filter based neuron model for adaptive incremental learning of self-organizing maps Mauro Tucci n, Marco Raugi Department of Electric Systems and Automation, University of Pisa, Pisa, Italy
a r t i c l e i n f o
abstract
Available online 7 April 2011
In this work a learning algorithm is proposed for the formation of topology preserving maps. In the proposed algorithm the weights are updated incrementally using a higher-order difference equation, which implements a low-pass digital filter. It is shown that by suitably choosing the filter the learning process can adaptively follow a specific dynamic. Numerical results, for time-varying and static distributions, show the potential of the proposed method for unsupervised learning. & 2011 Elsevier B.V. All rights reserved.
Keywords: Self-organizing maps Incremental learning Digital filters Time-varying environment
1. Introduction The self-organizing map (SOM), with its variants, is a popular neural network algorithm in the unsupervised learning category [1]. The topology preserving mapping from a high dimensional input space to a low dimensional output grid formed by the SOM finds a very wide range of applications [2]. After the training, the distribution of the SOM prototypes (model vectors) resembles vector quantization of the input data with the important additional feature of topology preservation between the input space and the output space. The incremental learning algorithm of the basic SOM follows a competitive learning paradigm, where matching of the prototypes with the input is increased, at each step, by moving all the model vectors in the direction of a randomly selected input sample. The strength of the update step is controlled by a scalar kernel function that has a central role in the learning process, and a large part of the literature is focused on theoretical and practical aspects of different implementations of this kernel function [3–6]. The common feature of all these SOM variants is that the structure of the prototypes update rule is maintained almost unchanged. Then, the focus is on the kernel function, which controls only the strength of the update step, whereas the direction is always radial with respect to the input sample. This behavior has theoretical support only for static kernel width, a condition reached in the final convergence phase of the learning [1]. However during the first phase, when the width of the kernel is shrinking, and the global topology of the input data is learned, a smoother change in the update direction would be a desired feature. To obtain this feature, in this paper a novel SOM model is proposed where the learning is accomplished by means of a higherorder linear difference equation, which implements a predefined
tracking filter. In the proposed model each cell contains a set of memory vectors in order to take trace of the past positions of the model vectors and of the inputs. The update direction is defined by the dynamic of a properly defined filter, which guides the movement of the model vectors as a combination of the past vectors. In that way, the proposed model gives a general framework that is able to define different training strategies, and includes the basic SOM as a particular case. The proposed approach is effective for the analysis of both static and dynamic input distributions. In fact in the proposed method the filter that guides the prototypes learning can be adapted to a changing environment using a suitably designed tracking filter. Many SOM variants have been proposed to process time variant patterns, for instance the time adaptive SOM [9], the predictive SOM [10], and other works [6,4]. In the SOM variant proposed in [9] for tracking moving patterns, neurons that are not able to follow the migrating pattern are deleted, and new neurons are continuously created to replace them. The SOM proposed in [10] is particularly suited for tracking the cluster centers of a migrating digital signal constellation, defined by quadrature amplitude modulation, and it uses one node for each symbol cluster. The SOM variants [6,4] are focused on automatic parameter calculation; however they also reveal a certain degree of map elasticity and capacity to adapt to changes in the input distribution. Many numerical experiments show that the proposed method uses a filter dynamic to define the learning equation, revealing enhanced performances for the analysis of dynamic distributions with respect to the above mentioned methods.
2. Model description 2.1. Classic SOM algorithm and proposed model motivation
n
Corresponding author. Tel.: þ39 0502217348; fax: þ 39 0502217333. E-mail addresses:
[email protected] (M. Tucci),
[email protected] (M. Raugi). 0925-2312/$ - see front matter & 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2010.08.028
The basic SOM consists of an ordered grid of a finite number of cells i¼1yD located in a fixed output array, where a distance metric d(c,i) between two cells c and i is defined. Each unit is
1816
M. Tucci, M. Raugi / Neurocomputing 74 (2011) 1815–1822
associated to a model vector (or weight) qi A Rn , which lies in the same space of the input pattern rAD CRn, where D is the dataset to be analyzed: qi ðt þ1Þ ¼ qi ðtÞ þ hci ðtÞðrðtÞqi ðtÞÞ,
i ¼ 1. . .D
ð1Þ
The update step for the cell i is proportional to a time dependent smoothing kernel function hci(t), where c is the winner node defined by the winner takes all (WTA) function: c ¼ argminf:rðtÞqi ðtÞ:g
ð2Þ
i
A widely applied kernel is written in terms of the Gaussian function centered over the winner node: hci ðtÞ ¼ mðtÞexpðdðc,iÞ2 =2s2 ðtÞÞ
ð3Þ
where m(t) is the learning factor and s(t) is the width of the neighborhood. Both functions m(t) and s(t) are defined as decreasing functions of time [1]. Usually the training ends after a predefined number of presentations. After the training the model vectors form a vector quantization of the input distribution, and their spatial arrangement preserves the fixed order defined by the output grid. Hence, the topology preserving map between the input space and the output grid is defined by associating an input sample r to the cell c (corresponding to the nearest centroid qc), using the WTA function in Eq. (2). In general the SOM is a batch algorithm that iterates over data buffer several times and it cannot be applied for online processes, where incremental learning is needed. A basic principle of adaptive incremental learning in neural networks is to develop new neuron models, which are able to adapt to new information without forgetting or overwriting already learned relationships [11]. The main drawback of SOM update rule in Eq. (1) is that the new weight depends only on the present input sample and on the present weight position, and there is no direct dependence on past input and weight values. Then, it is not easy to control the trade-off between learning new data and maintaining previous knowledge. In order to solve this conflict and to introduce more degrees of freedom in the adaptive learning process we propose to modify the neuron model by adding a local ‘‘memory’’ of past events as additional weights. The main weight of the neuron, which defines the center of the Voronoi region of the cell, is then updated as a linear combination of the memory weights. This linear combination is defined using the digital filter theory. Additionally, the self-organization behavior is achieved by updating only the memory of the winner cell and its neighbors. In the following we will show that the proposed modification of the SOM allows the online application of the algorithm for changing environments, and it also brings some benefits for processing static distributions.
difference equation that implements the dynamic of a linear system. Using the Z-transform formalism, the filter implemented in Eq. (4) can be defined by means of its transfer function as GðzÞ ¼
b1 zN1 þ . . . þbN N z a1 zN1 a2 zN2 . . .aN
ð5Þ
In our model, the scalar and strictly proper process G(z) of order N is implemented element-wise in Eq. (4) by means of the linear difference equation of order N described by the filter coefficients [a1,y,aN]T [b1,b2,y,bN]T. The filter G(z) in Eq. (5) guides the training of the SOM centroids; hence it defines the dynamic of the SOM learning process. For this reason in the following we call this filter as ‘‘learning filter’’. Guidelines to define the filter parameters will be described in the following sections. At each time step t¼0,1,2,y, a new input sample r(t) is presented to the map and the winner cell is selected by the WTA function (2). This means to select the cell with the nearest centroid qc(t), which is defined by the actual state of the memory vectors of the cell c as in Eq. (4). Then, the memory of the cells needs to be updated. In order to perform a straightforward onestep time shift of the discrete-time filter variables, the memory vectors should be updated as r~ i1 ðt þ 1Þ ¼ rðtÞ r~ ik ðt þ 1Þ ¼ r ik1 ðtÞ,
k ¼ 2. . .N
q~ i1 ðt þ 1Þ ¼ qi ðtÞ q~ ik ðt þ1Þ ¼ qik1 ðtÞ,
k ¼ 2. . .N
ð6Þ
However, to include the neighborhood collaboration in our model, only the memory of the winner cell and its neighboring cells should be mainly updated, while the memory of the cells that are far from the winner should not be updated. This is accomplished by considering the following update expressions instead of Eq. (6): r ik ðt þ 1Þ ¼ r ik ðtÞ þ hci ðtÞðr~ ik ðt þ 1Þrik ðtÞÞ qik ðt þ1Þ ¼ qik ðtÞ þhci ðtÞðq~ ik ðt þ 1Þqik ðtÞÞ
ð7Þ
where k¼1yN, and the Gaussian neighborhood function hci(t) is calculated as in Eq. (3). The model of the processing unit is shown in Fig. 1. The neighbor collaboration is then accomplished by means of an adaptive update of the ‘‘memory’’ of the neurons. Following the update expressions (7), neurons that do not recognize the input pattern will not update their memory, while the memory of the neurons that recognize the pattern is updated together with the memory of its topological neighbors. As a
2.2. Proposed neuron model description In order to add a memory to the neuron model, we associate to each element i¼1yD of the map grid two sets of N vectors ½r i1 ðtÞ,. . .r iN ðtÞ and ½qi1 ðtÞ,. . .,qiN ðtÞ, with qik ðtÞ,r ik ðtÞ A Rn , which represent the ‘‘local’’ memory of each cell. Vectors qik ðtÞ, k¼1yN, represent the last N values of the centroid qi(t), while vectors r ik ðtÞ, k¼1yN, take trace of the last N values of the input r(t) to the cell i (i.e. the inputs that were able to update the cell state as shown in the following). The centroid position of each cell is defined as the following linear combination of the memory vectors: qi ðtÞ ¼
N X
bk r ik ðtÞ þ ak qik ðtÞ
ð4Þ
k¼1
To properly define the coefficients of the linear combination [a1, y,aN]T [b1,b2,y,bN]T, it is necessary to employ a stable, discrete-time digital filter. In fact Eq. (4) represents a linear
Fig. 1. Implementation scheme of the proposed processing unit. Dashed lines represent the memory update relations in Eq. (7), while solid lines represent the output relation in Eq. (4).
M. Tucci, M. Raugi / Neurocomputing 74 (2011) 1815–1822
consequence, during the training the local memory of each cell i, which is formed by r ik ðtÞ qik ðtÞ k ¼1yN, will remember the values of input and centroid position related to the events that directly involve the cell, i.e. the cell is a winner or it belongs to the active neighborhood of the winner unit. 2.3. Model features: update direction, ordering, convergence, and stability It is known that in order to have well behaved self-organization activities, the model vectors of the winner and its neighbors have to be updated in order to increase the matching with the input pattern. This is usually carried out by increasing the similarity measure used to define the winner unit [1]. In our scheme the model vectors qi(t) are a linear combination of the memory vectors as shown in Eq. (4), and the update of the memory vectors is determined by Eq. (7). Substituting Eq. (7) in Eq. (4) we have the following implicit update expression for the model vectors: qi ðt þ1Þ ¼ qi ðtÞ þ hci ðtÞðq~ i ðt þ 1Þqi ðtÞÞ
ð8Þ
where q~ i ðt þ 1Þ ¼
N X
bk r~ ik ðt þ 1Þ þ ak q~ ik ðt þ 1Þ
k¼1
¼ b1 rðtÞ þ a1 qi ðtÞ þ
N X
bk r ik1 ðtÞ þak qik1 ðtÞ
ð9Þ
k¼2
From Eq. (8) it is easy to note that the model vector qi(t) moves in the direction of the vector q~ i ðt þ1Þ, which represents the next value of the learning filter output in the case of a complete shift of the variables as shown in Eq. (9). The model vector update Eq. (8) of the proposed model is similar to the classic SOM update rule (1). Difference between the two updates is that in our case the target vector is the filter output of each cell, while in the SOM it is the random input sample r(t). The target vector q~ i ðt þ 1Þ is different for each cell, hence the update direction of the model vectors is not globally radial with respect to the input sample r(t). The self-organization of the model vectors occurs if the adaptation of the winner and its neighbors improves the matching with the input sample, and this is also the basic principle of Hebbian learning [1]. In the proposed model, the model vectors move towards the output of the learning filter q~ i ðt þ 1Þ. Then, if the filter behaves as a ‘‘follower’’ of the input (i.e., it tracks constant inputs with zero error), it automatically increases the similarity of the model vectors with the input. This requirement is satisfied if the learning filter G(z) is a stable low-pass filter with unitary static gain [8]; hence if the following condition is fulfilled N X k¼1
bk þ
N X
ak ¼ 1
ð10Þ
1817
Eq. (8) implies that the model vectors converge as the kernel function hci(t) tends to zero [1]. 2.4. Network training procedure The training procedure of the proposed model is based on the following steps: 1) Given the vectorial input data DARn, we first create the output grid array of the SOM composed by D cells, and choose the neighborhood arrangement, defining the distance function d(c,i) between two cells for c, i ¼1yD. 2) We design the linear difference equation of order N, which drives the learning process, and suitably define the coefficients [a1,y,aN]T, [b1,b2,y,bN]T of numerator and denominator of the learning filter G(z). Guidelines to determine the filter coefficients, and their influence on the performance of the model are described in following sections. 3) Initial values for the memory vectors can be selected using the SOM random initialization [1]. The so-called linear initialization, which uses the principal axes of the input distribution, can give some benefit as in classic SOM [1]. Both random and linear initialization methods are well known and widely used methods to initialize the SOM and to produce the initial distribution of the model vectors qiinit . In our method the memory vectors are initialized as r ik ð0Þ ¼ qiinit and qik ð0Þ ¼ qiinit for i¼1yD and k¼1yN. 4) At each time step t ¼0,1,2,y, each cell i¼1yD calculates the value of the model vector qi(t) from the actual memory vectors r ik ðtÞ,qi ðtÞ, k¼1yN as in Eq. (4). Note that from Eqs. (4) and (10) we have qi ð0Þ ¼ qiinit . 5) We select a random input sample r(t)AD from the input distribution and find the winner unit c as c ¼ argminf:rðtÞqi ðtÞ:g,
i ¼ 1. . .D
i
6) The memory of the cells is updated as r ik ðt þ 1Þ ¼ r ik ðtÞ þ hci ðtÞðr~ ik ðt þ 1Þrik ðtÞÞ qik ðt þ1Þ ¼ qik ðtÞ þhci ðtÞðq~ ik ðt þ 1Þqik ðtÞÞ where k¼1yN and hci ðtÞ ¼ mðtÞexpððdðc,iÞ2 =2s2 ðtÞÞÞ 7) Return to step 4 or stop the training if a predefined number of presentations has been reached. As a result of the proposed model a new self-organizing algorithm is defined, which shows self-organization activities of the model vectors that can be suitably designed when the leaning filter G(z) is carefully determined. Moreover, in the proposed model the function hci(t) can be similar to other self-organizing paradigms as vector quantization or neural gas [7]. Thus, the proposed procedure includes also these situations as particular cases.
k¼1
If a learning filter with these characteristics is used in the proposed algorithm, and if a continuously repeated input sample r(t)¼r is considered, then the error r qi(t), for i¼1yD, will reduce to zero for t-N. In this sense the similarity of the model vectors with the input sample is increased when the model vectors move as described in Eqs. (8) and (9) and the learning filter fulfill condition (10). As a result our method shows safe ordering dynamics of the model vectors during the self-organization process. In the following we will show that the classic SOM algorithm is obtained by our method considering a learning filter that satisfies condition (10). Regarding the stability of the algorithm, Eqs. (8) and (9) show that the stability of the learning filter assures the local stability of the proposed model. Concerning the convergence of the model,
2.5. Guidelines to design the learning filter In this section we describe general principles for a useful design of the learning filter. A necessary requirement for every kind of learning filter (5) of the proposed algorithm is that it must be a stable low-pass filter that meets condition (10), which also means that the filter must be of Type 1 [8]. The design of the learning filter can be carried out in the continuous time domain defining the filter dynamics as a transfer function G(s) (where s represents the generalized frequency of the Laplace transform), and performing a discretization to obtain the learning filter G(z) using a sampling time of Ts ¼1 seconds . It is useful to represent G(s) as a closed loop system to design the continuous time low-pass filter. In fact a low-pass filter of type
1818
M. Tucci, M. Raugi / Neurocomputing 74 (2011) 1815–1822
1 can always be represented as a closed loop system that has one integrator in the open loop. The simplest case is represented by the following closed loop transfer function: G1 ðsÞ ¼
k s þk
ð11Þ
where k is the open loop proportional gain. Different learning filters can be obtained using different methods to transform the transfer function (11) to discrete time, or by considering more complex closed loop schemes. For instance a ‘‘controller’’ K(s) can be used in the place of the proportional gain k. The proposed neuron model is intrinsically formed as a ‘‘follower’’ of the input pattern r(t). Thus the proposed technique is also effective for a real-time tracking of time-varying input patterns. The tracking can be suitably adapted to desired dynamic properties (velocity, position, etc), depending on the adopted G(z) design. 2.5.1. Static distributions: Butterworth learning filter design We first describe the learning filter that gives the classic SOM. In fact the SOM update equation is obtained from Eqs. (4) and (7) considering the following first order learning filter: G1 ðzÞ ¼
a
ð12Þ
z þ a1
where aA(0,1) is a global ‘‘learning constant’’, which determines the bandwidth of the learning filter, and acts as a global attenuation constant of the learning factor m(t). In fact, considering the learning filter (12), Eq. (8) becomes qi ðt þ1Þ ¼ qi ðtÞ þ ahci ðtÞðrðtÞqi ðtÞÞ which is exactly the SOM learning rule. Hence the basic SOM algorithm is included in our framework and it can be obtained using the first order learning filter G1(z), which is also a Butterworth filter of order 1. In fact the learning filter G1(z) of the classic SOM in Eq. (12) can be derived from the continuous time Butterworth filter G1(s) in Eq. (11) by applying the method of impulse invariance for the conversion to discrete-time. Note that the learning filter G1(z), which produces the SOM algorithm, meets condition (10). To analyze static distributions, we have selected higher-order discrete low-pass Butterworth filters [8] as learning filters, due to their well known ‘‘optimal’’ characteristics of gain and bandwidth reported in circuit analysis. Consequently, the only free parameters of the learning filter are the order N of the filter and the normalized cut-off frequency o0 ¼ oBW/p (where oBW is the 3 db band of the continuous time filter) [8]. Every couple of values of o0A(0,1) and N defines a continuous time Butterworth filter. The discrete time filter parameters [a1,y,aN]T and [b1,b2,y,bN]T of the learning filter can be obtained by converting to discrete time using the method of impulse invariance and a sampling interval Ts ¼1 [8]. The method of impulse invariance is preferred because it produces strictly proper discrete time filters, and this procedure is used to obtain the Butterworth learning filters used in the following sections. For the analysis of static distribution we have considered the time dependent kernel function hci(t) expressed in Eq. (3), as in the classic SOM. The following popular decreasing functions for the learning parameters s(t) and m(t) have been used: 8 > sðtÞ ¼ smin þ ðsmax smin Þð1ðtÞ=T1 Þ2 t o T1 > > > < sðtÞ ¼ s t Z T1 min ð13Þ > mðtÞ ¼ mmin þ ðmmax mmin Þð1ðtÞ=T2 Þ2 t o T2 > > > : mðtÞ ¼ m T=t tZT min
2
where T1 and T2 are, respectively, the decreasing times of the kernel width and of the learning rate.
2.5.2. Dynamic distributions: alpha–beta learning filter design The proposed method can also be used for time-varying distributions. In the case of a moving input pattern the dynamic of the learning filter can follow the dynamic of the input, and the map adapts to the changing environment. To avoid the effects caused by the decrease to zero of the learning rate (that prevents adjustments of the map to input variations), we have considered a time invariant kernel function hci(t) ¼hci that does not decrease to zero. Butterworth filters are not particularly suitable to follow moving targets. For the analysis of migrating patterns we have considered the alpha–beta filter [12], a predictive low-pass filter traditionally applied to track moving objects. The alpha–beta filter is a second order, steady state, discrete-time Kalman filter designed to track a moving object, with a constant process noise covariance s2w , and a measured noise covariance s2v [12]. The input of the filter is a noisy measure of the position of the moving object r(t), and the output of the filter is the predicted position of the moving object q(t). In its commonly used state-space form, the filter uses a ‘‘velocity variable’’ v(t) to simulate the object movement and to calculate the output position. The classic equations of the filter (that also share the structure of the Kalman observer), are the following: ( qðt þ 1Þ ¼ qðtÞ þ vðtÞTs þ a½rðtÞðqðtÞ þ vðtÞTs Þ ð14Þ vðt þ 1Þ ¼ vðtÞ þ Tbs ½rðtÞðqðtÞ þvðtÞTs Þ where [a,b] represents the gains of the Kalman observer. The discrete time transfer function Ga,b(z) between the measured input observation r(t) and the predicted output position q(t), for Ts ¼1s is directly calculated from Eq. (14) as Gab ðzÞ ¼
az þ ba z2 þðb þ a2Þz þ ð1aÞ
ð15Þ
Note that the filter in Eq. (15) is of Type 1, hence it fulfills the condition (10).p The ffiffiffiffiffiffi following relation always holds for a Kalman observer: a ¼ 2bðb=2Þ [12]. This condition, together with stability constrains, implies that aA(0,1) and bA(0,2). The stochastic Kalman design finds optimal [a, b] parameters for known noise covariances s2w and s2v . If the model of the noise is not known a deterministic approach can be used to derive [a, b] from the desired bandwidth o0 of the filter in Eq. (15). For a given value of o0 the values of the parameters [a, b] are uniquely defined, and the alpha–beta learning filter in Eq. (15) is completely defined [12]. In this case the alpha–beta filter will not be optimal in the Kalman sense, but it will retain the good tracking capabilities reported in aerospace and control systems.
3. Numerical results Firstly we consider various examples of static input distributions. In order to evaluate the performance of the proposed algorithm, we will analyze the results obtained by training different maps with some Butterworth learning filters. The performance of the proposed SOM for static patterns is compared with the basic SOM. Successively, we consider a time-varying input distribution. In this experiment the alpha–beta learning filter is used and the performance of the proposed method is compared with other SOM variants that deal with changing environments. Two indexes are used to assess the quality of the obtained maps: the quantization error and the empirical distortion measure. The quantization error is calculated as QEðDÞ ¼
1 X :rqc : :D: r A D
M. Tucci, M. Raugi / Neurocomputing 74 (2011) 1815–1822
1819
where :D: denote the cardinality of the input dataset. The quantization error indicates how close the model vectors are fitting the data. The following empirical distortion measure DM(D,s)is used: DMðD, sÞ ¼
D 1 XX wci ðsÞ:rqi : :D: r A D i ¼ 1
where wci ðsÞ ¼ expððdðc,iÞ2 =2s2 ÞÞ. Unlike the quantization error the distortion measure DM(D,s) considers the SOM topology and we have DE(D,s)-QE(D) whens-0. It is known that the classic SOM minimizes DM(D,s) approximately, whereas the effective minimization of DM(D,s) requires heavy calculations [1]. 3.1. Analysis of static distributions 3.1.1. Gaussian distribution In this experiment we consider that the input belongs to a Gaussian clusterDdg A Rd centered in the origin of the axes and with unitary variance, and the dimensionality of the Gaussian input is d ¼10. A 20 20 grid with hexagonal neighborhood is considered and model vectors have been initialized using random initialization. We trained a set of maps with our method by adopting different Butterworth learning filters (BWLF), with bandwidth o0A[0.1,0.7] and order N ¼1,2,3,4. The total number of training steps is T¼ 100,000. Each training procedure is repeated 30 times with different random initializations in order to avoid possible topological defects. Then, we calculate the averaged quantization error QEðD10 g Þ and distortion measure DMðD10 g ,1Þ of the trained maps. In general, the values of the performance indexes do not differ from their mean values more than 2%, and they approximate the results obtained with the SOM linear initialization. Hence the reported differences in performance can be considered statistically significant when both random and linear SOM initialization methods are used. The use of a BWLF with order N ¼1 gives the learning filter in (12), and in this case we have the original SOM algorithm, as previously shown. Using the BWLF of order N ¼1 we consider the classic SOM algorithm with kernel function ahci(t) and where the global attenuation a directly depends on the bandwidth o0A[0.1,0.7]. We are mainly interested in assessing if filter orders N 41 bring some benefit to the measured quality of the trained maps with respect to the classic SOM (corresponding to N ¼1). 10 Figs. 2 and 3 show the QEðD10 g Þ and DMðDg ,1Þ results obtained with the proposed method. It can be observed that QE improves as the filter bandwidth o0 increases, while it worsens as the filter order increases. Hence better QE is obtained for high filter bandwidth and low filter order. The opposite situation can be noticed observing the DM curves. The trade-off between QE and DM is a common problem that arises in the formation of SOMs.
Fig. 2. Quantization error as a function of the filter order and normalized bandwidth for a Gaussian distribution in R10 and Butterworth learning filters. Order 1 represents the classic SOM algorithm.
Fig. 3. Empirical distortion measure as a function of the filter order and normalized bandwidth for a Gaussian distribution in R10 and Butterworth learning filters. Order 1 represents the classic SOM algorithm.
Fig. 4. Pareto front analysis of the various designs in the QE–DM plane. Order 1 represents the classic SOM algorithm. To increase the DM index it is necessary to increase the filter order.
Increasing the filter order for fixed filter bandwidth produces maps with improved distortion measure but affects the quantization error, while increasing the filter bandwidth produces opposite trends. Furthermore, it is noticeable that every BW filter with N 41 has a range of bandwidth where both QE and DM perform better than the classic SOM (represented by N ¼1). This is better shown by calculating the Pareto front in the QE and DM planes, as shown in Fig. 4. The figure shows some curves for different filter orders, and the Pareto front is highlighted. The curve for N ¼1 represents the classic SOM performance. It can be observed that in order to minimize DM and to keep the best QE (i.e. to stay in the Pareto front) it is necessary to increase the BW filter order. In topology preserving SOMs the minimization of DM is commonly considered more important than the minimization of QE. In this sense we can state that the proposed model for N 41 gives better self-organization performances than the basic SOM, because all the points of the Pareto front with N 41 give a better overall performance than the basic SOM in terms of DM. It can be noticed that the presented results, related to a 20 20 lattice size, possess general validity independently from the grid dimension, and represent general trends observed in several numerical experiments carried out by the authors.
3.1.2. Clustered distribution In the second experiment we consider that the input distribution (depicted in Fig. 5) is formed by three clusters in R2, where the cluster 1 (at the left) has a uniform probability density function and the clusters 2 and 3 (at the top and at the right, respectively) have a Gaussian probability density function. A classic 20 20 SOM with hexagonal neighborhood is considered by taking a BWLF with N ¼1 and o0 ¼0.8. For different random
1820
M. Tucci, M. Raugi / Neurocomputing 74 (2011) 1815–1822
Fig. 5. Model vectors of a map with topological defect trained with classic SOM algorithm.
Fig. 7. Model vectors of a map trained with a fourth order Butterworth learning filter.
Table 1 Characteristics of real world datasets. Dataset
# Samples
# Features
Wisconsin breast cancer Abalone Adult Kdd Cup 1999
569 4177 32,561 494,020
30 8 14 41
the proposed method we can obtain the desired feature by suitably selecting the filter order and cut-off frequency (as shown in the previous example), with no need to operate on hci(t). Using a Butterworth filter of order N ¼4, and cut-off frequency o0 ¼0.5, without changing the annealing schemes for the kernel function, the map in Fig. 7 is obtained.
Fig. 6. Model vectors of a map trained with a second order Butterworth learning filter.
initializations some topological defects (as the one shown in Fig. 5) can appear in the final map. We have carried out a number of simulations in order to assess the statistical presence of this kind of defects in the trained maps for the given distribution, both for the classic SOM and for our method using a second order Butterworth learning filter with cut-off frequency o0 ¼0.8. Both algorithms use the same annealing schemes for the kernel function hci(t). It comes out that the presence of topological defects with the second order BWLF reduces of 45%. In the trained maps an undesired curvature is present in the grid distribution of the model vectors at cluster 1, as shown in Fig. 6. This is due to the fact that the two principal axes of the input distribution have different lengths, while the map has a 20 20 square dimension. In this case a rectangular map may behave better, but we keep the 20 20 square grid to evidence the below described characteristic. In the classic SOM the learning parameters that define hci(t) have to be tuned in order to improve the map quality, but this is not a trivial task. On the other hand, in
3.1.3. Real world problems In this section we illustrate the application of our model to a number of real world problems. All the analyzed datasets are from the online UCI depository [13]. Table 1 shows the features of the considered problems. Table 2 shows the quantization error and distortion measure obtained using the former datasets to train a standard SOM, a SOM with second order BW filter and a SOM with fourth order Butterworth filter. The map grid dimension is 16 16 in all cases, the training duration is 40,000 presentations for the first two problems, 100,000 presentations for the Adult dataset, and 1e6 samples for the Kdd Cup 1999 data. The training is repeated several times with different random initializations, and Table 2 shows the statistically significant averaged values. The general trends observed in previous experiments are here confirmed: as the filter order increases, quantization error worsen and distortion measure improves. If a low quantization error is the goal then the SOM with BW filter of order 1 reaches the best values, whereas if reducing the map distortion is the main issue a higher-order filter, with fixed bandwidth, gives better DM values for slightly increased QE values. The last row of Table 2 reports the average performance gains (over the four considered problems) of the two higher-order methods with respect to the first order case, which represents the standard SOM.
M. Tucci, M. Raugi / Neurocomputing 74 (2011) 1815–1822
1821
Table 2 Performance indexes of the trained maps for various real world datasets. Dataset
Wisconsin breast cancer Abalone Adult Kdd Cup 1999 Average gain (%)
SOM
BW-SOM
BW-SOM
N¼ 1, o0 ¼ 0.7
N ¼2, o0 ¼ 0.7
N ¼4, o0 ¼ 0.7
QE
DM
QE
DM
QE
DM
45.2 0.0721 219 10.9 –
860 1.11 1830 217 –
47.8 0.0780 235 11.2 6.0
801 0.97 1690 185 þ10.47
55.4 0.081 248 12.8 16.40
754 0.89 1560 142 þ 20.36
Fig. 8. Quantization error as a function of the standard deviation sRW of the random walk step vector.
3.2. Analysis of time-varying distributions 3.2.1. Moving pattern with fixed structure We consider a square pattern in R2 with unitary uniform distribution, which moves in a two-dimensional plane with a random velocity vector. We use the filter Gab(z) in Eq. (15) as learning filter of the proposed SOM to follow the moving square pattern. We use a 10 10 grid with 100 nodes. The desired behavior is that the grid of the model vectors should be able to follow the square pattern without losing the topologic ordering. A linear initialization is used so that the centroids start from an initial ordered state inside the square pattern. At each iteration (i.e. each new presentation) a random vector forming a Normal distribution in R2 with zero mean and variance s2RW , is summed to the square pattern. Then, the input pattern moves in the plane following a random walk [14] without changing its shape and orientation. A new point inside the shifted pattern is selected as input to the SOM. The performance obtained with the alpha–beta filter is compared with the classic SOM (learning filter of order 1) for filter bandwidth o0A[0.1,0.7]. Other methods used for comparison are the parameter-less SOM [4], TASOM [9], and the predictive SOM [10]. The duration of the learning is T¼20,000 presentations, and it is repeated for different values of sRW. The comparison measures QE and DM are calculated at each iteration step, using the actual pattern position. The averaged values of QE and DM during the whole training are shown in Figs. 8 and 9, respectively, with respect to sRW. The different lines of the classic SOM and the proposed alpha–beta filter are related to different values of the bandwidth o0A[0.1,0.7]. The best performances are obtained with the proposed alpha–beta learning filter, which also shows values of QE and DM almost independent of the bandwidth o0. The classic SOM shows higher values of QE and DM for the same sRW and it also exhibits a stronger dependence on
Fig. 9. Empirical distortion measure as a function of the standard deviation sRW of the random walk step vector.
the bandwidth, revealing similar trends as in the static experiments: for increasing bandwidth QE improves and DM worsens. From these results we can observe that the selection of the bandwidth of the learning filter Gab(z) is not a critical point in the design of the proposed alpha–beta filter to follow moving patterns, even in the presence of an uncertain time variation of the environment. On the other hand the classic SOM with fixed kernel requires the knowledge of the time variation of the input to choose a suitable bandwidth. Among the other compared methods the best performing is TASOM, while parameter-less SOM and predictive SOM show lower performances. From this experiment the proposed learning filter approach for SOM learning reveals good capabilities to adaptively learn in a changing environment.
3.2.2. Moving pattern with structure change A second time-varying distribution experiment where the structure of the data also changes has been analyzed. The changing environment consists of an initial spherical cluster generated by a Gaussian distribution in R3 with zero mean and unitary variance. The standard deviation of the Gaussian cluster grows smoothly and linearly in time up to 4 times its initial value, and this change takes 10,000 points presentations. Then, the cluster splits into two clusters that, maintaining the final variance value, move with constant velocity following a straight line path in opposite directions. The duration of this phase is additional 10,000 points presentations. The final positions of the cluster centers are (20,0,0) and ( 20,0,0). The proposed method that uses the alpha–beta filter and the other methods previously introduced are trained using this changing distribution. The map grid dimension is 20 20, and the online training stops after
1822
M. Tucci, M. Raugi / Neurocomputing 74 (2011) 1815–1822
Table 3 Performance indexes of the trained maps for moving pattern with structure change. Method
QE
DM
Static SOM at final state Alpha–beta filter, o0 ¼ 0.7 Alpha–beta filter, o0 ¼ 0.1 TASOM Fixed kernel SOM, N ¼ 1, o0 ¼0.7 Parameter-less SOM Predictive SOM
2.05 2.30 2.45 2.58 2.64 2.70 3.10
20.8 22.4 24.2 26.9 25.5 27.7 29.3
20,000 presentations, which represent one full presentation of the moving distribution. The maps are initialized using linear initialization, and we consider the QE and DM values of the maps at the final instant. Hence we have considered only the state of the maps at the end of the training. For comparison we also train a map using only the ‘‘static’’ distribution that represents the final state of the time-varying pattern, which consists of two well separated Gaussians. This map is trained with 40,000 presentations using a first order BW filter with bandwidth 0.7, and it considers shrinking values of the annealing schemes m(t) and s(t) as in the case of stationary distributions. Table 3 shows the values of the performance indicators of the different used methods. The alpha–beta filter method is the one that better approximate the performance of the static map trained directly with the final distribution. In this experiment, as in the previous one, the good tracking capabilities of the filter reveal to be mostly independent form the filter bandwidth. Hence a filter bandwidth of o0 ¼0.7 is in general a good choice for the alpha–beta learning filter in the presence of changing environments. The performances of the other methods confirm the results obtained in the previous experiment.
4. Computational aspects Considering the numerical complexity of the proposed model, it is important to note that the more demanding operations are the comparisons needed to find the winner unit. In the proposed model the search of the winner is accomplished as in classical SOM, so that, if no shortcut winner search is used, the numerical complexity is o(D2)where D is the number of map units. The proposed model increases the number of updating operations per unit: when the classic SOM needs 1 scalar-vector multiplications at each update step, the proposed method needs 4N þ1 scalarvector multiplications, so that the increase in the computational load is linear with the filter order N. With regard to the memory requirements the proposed model needs 2N times the memory needed by the classic SOM. The increased requirements of computational resources are acceptable when the order of the filter is kept below N ¼5. A filter order equal to 5 or higher leads to a memory requirement, which is more than 10 times higher of the memory used by standard SOM. If the number D of cells is high, i.e. in the order of thousands, the memory required using higher-order filters may become prohibitive. For this reason we focus the attention on low order filters, and we find that the benefits of the proposed model are noticeable for N o5. 5. Conclusions The presented work shows the feasibility of implementing the SOM learning with a higher-order difference equation, which takes into account the data of the previous steps of the update
process. The incremental update rule proposed is based on a discrete difference implementation of a suitably designed lowpass digital filter. The proposed model is a framework that allows the flexibility to design different learning strategies for the incremental training of various algorithms related to vector quantization, in particular the topology preserving methods such as SOMs. It is shown that this approach allows a simple and flexible control of the SOM formation. This gives better performances indexes with respect to a classical SOM at a cost of a slightly higher computational cost. Many experiments show that the proposed technique is also well suited for tracking of the time-varying input patterns. References [1] T. Kohonen, Self-Organizing Maps, third edition, Springer, Berlin, 2001. [2] T. Kohonen, E. Oja, O. Simula, A. Visa, J. Kangas, Engineering applications of the self-organizing map, Proc. IEEE 84 (1996) 1358–1384. [3] Y.M. Cheung, L.T. Law, Rival-model penalized self-organizing map, IEEE Trans. Neural Networks 18 (2007) 289–295. [4] E. Berglund, J. Sitte, The parameterless self-organizing map algorithm, IEEE Trans. Neural Networks 17 (2006) 305–316. [5] S. Cho, J. Seok, Self-organizing map with time-invariant learning rate and its exponential stability analysis, Neurocomputing 19 (1998) 1–11. [6] K. Haese, Self-organizing feature maps with self-adjusting learning parameters, IEEE Trans. Neural Networks 9 (1998) 1270–1278. [7] Berkovich Martinetz, K. Schulten, Neural-gas network for vector quantization and its application to time-series prediction, IEEE Trans. Neural Networks 44 (1993) 558–569. [8] A. Antoniou, Digital Filters: Analysis, Design, and Applications, McGraw-Hill, New York, 1993. [9] H. Shah-Hosseini, R. Safabakhsh, TASOM: a new time adaptive self-organizing map, IEEE Trans. Syst. Man Cybern. B, Cybern. 33 (2003) 271–282. [10] A. Hirose, T. Nagashima, Predictive self-organizing map for vector quantization of migratory signals and its application to mobile communications, IEEE Trans. Neural Networks 14 (2003) 1532–1540. [11] E. Lughofer, Extensions of vector quantization for incremental clustering, Pattern Recognition 41 (2008) 995–1011. [12] J.H. Painter, D. Kerstetter, S. Jowers, Reconciling steady-state Kalman and alpha–beta filter design, IEEE Trans. Aerosp. Electron. Syst. 26 (1990) 986–991. [13] P.M. Murphy, D.W. Aha, UCI Repository of Machine Learning Databases [Machine-Readable Data Repository], Department of Information and Computer Science, University of California, Irvine, 1992. [14] M. Kac, Random walk and the theory of Brownian motion, Am. Math. Monthly 54 (1947) 369–391.
Mauro Tucci received the Ph.D. degree in applied electromagnetism from the University of Pisa, Pisa, Italy, in 2008. Currently, he is a postdoctoral research fellow of electrical engineering at the Department of Electrical Systems and Automation, University of Pisa. His research activity is in machine learning, data analysis, and system identification, with applications in electromagnetism, non-destructive testing, and power-line communications.
Marco Raugi received the Ph.D. degree in electrical engineering from the University of Pisa, Pisa, Italy, in 1990. Currently, he is a fulltime professor of electrical engineering at the University of Pisa. He is an author of more than 100 papers in international journals and conferences. His research activity is in numerical electromagnetics with main applications in nondestructive testing, electromagnetic compatibility, communications, and computational intelligence. Prof. Raugi was the General Chairman of the international conferences Progress In Electromagnetic Research Symposium in 2004 and the IEEE International Symposium on Power Line Communications and its Applications in 2007. He is a recipient of the IEEE-Industry Application Society 2002 Melcher Prize Paper Award.