On-line learning in RBF neural networks: a stochastic approach

On-line learning in RBF neural networks: a stochastic approach

Neural Networks PERGAMON Neural Networks 13 (2000) 719±729 www.elsevier.com/locate/neunet Contributed article On-line learning in RBF neural netwo...

430KB Sizes 3 Downloads 84 Views

Neural Networks PERGAMON

Neural Networks 13 (2000) 719±729

www.elsevier.com/locate/neunet

Contributed article

On-line learning in RBF neural networks: a stochastic approach M. Marinaro a,b,c, S. Scarpetta b,c,* a

International Institute for Advanced Scienti®c Studies ªE. R. Caianielloº, Via Pellegrino 19, Vietri sul Mare (Sa), Italy Dipartimento di Scienze Fisiche ªE.R. Caianielloº, UniversitaÁ di Salerno, Via S. Allende, I-84081 Baronissi (Sa), Italy c INFM, UnitaÁ di Salerno, Salerno, Italy

b

Received 13 August 1999; accepted 22 May 2000

Abstract The on-line learning of Radial Basis Function neural networks (RBFNs) is analyzed. Our approach makes use of a master equation that describes the dynamics of the weight space probability density. An approximate solution of the master equation is obtained in the limit of a small learning rate. In this limit, the on line learning dynamics is analyzed and it is shown that, since ¯uctuations are small, dynamics can be well described in terms of evolution of the mean. This allows us to analyze the learning process of RBFNs in which the number of hidden nodes K is larger than the typically small number of input nodes N. The work represents a complementary analysis of on-line RBFNs, with respect to the previous works (Phys. Rev. E 56 (1997a) 907; Neur. Comput. 9 (1997) 1601), in which RBFNs with N q K have been analyzed. The generalization error equation and the equations of motion of the weights are derived for generic RBF architectures, and numerically integrated in speci®c cases. Analytical results are then con®rmed by numerical simulations. Unlike the case of large N . K we ®nd that the dynamics in the case N , K is not affected by the problems of symmetric phases and subsequent symmetry breaking. q 2000 Elsevier Science Ltd. All rights reserved. Keywords: On-line learning; Radial basis functions neural networks; Statistical mechanics; Generalization error

1. Introduction Learning from examples in a layered neural network is an optimization problem based on the minimization of a learning error. In batch learning the error is de®ned as an additive error over a ®nite set of examples (training set). In the online learning scenario the weights of the synapses are updated sequentially according to the error computed on the last selected new example. The analysis of Radial Basis Functions Networks (RBFNs) in batch scenarios, i.e. with repeated presentation of a ®nite training set, has been addressed by various authors who focused on the generalization error and tried to deal with the quenched averages by making different assumptions and approximations (Holden & Niranjan, 1997; Freeman & Saad, 1995). The on-line scenario can present some advantages with respect to batch learning when the number of available examples is large, both in terms of storage and computational time, and can also allow for temporal changes in the task being learned. Moreover, the averages that account for the disor* Corresponding author. Dipartimento di Scienze Fisiche ªE.R. Caianielloº, UniversitaÁ di Salerno, Via S. Allende, I-84081 Baronissi (Sa), Italy. Tel.: 139-081-575-5939; fax: 139-089-965237. E-mail address: [email protected] (S. Scarpetta).

der introduced by the random selection of an example at each time step can be calculated directly, without having to deal with quenched averages which can present signi®cant dif®culties in networks with hidden nodes. The analysis of on-line learning in neural networks has typically been approached by means of Statistical Mechanics techniques in one of two ways (see Saad (1998) for a recent review): the ®rst approach (Biehl & Caticha, 1999; Saad & Solla, 1995a,b) (usually called statistical physics approach or large network framework) makes use of the thermodynamic limit (that is obtained by assuming an in®nite number, N, of input nodes) and analyzes quantities that are self-averaging in this limit, the second approach (Heskes & Kappen, 1993; Heskes, 1994) (usually called stochastic, or small network approach) looks at ®nite systems and makes use of ensemble averages in order to obtain a description of the interesting quantities whose probability distribution is governed by a master equation (ME). The ®rst approach has revealed to be very powerful in analyzing large multilayer perceptron (MLP) networks (Biehl, Riegler, & Wohler et al., 1996; Riegler & Biehl, 1995; Saad & Solla, 1995a,b). When the thermodynamic limit K , N ! 1 is introduced the generalization error is expressed in terms of a ®nite set of mean®eld observables whose dynamics results governed by a closed set of deterministic differential equations. Gradient

0893-6080/00/$ - see front matter q 2000 Elsevier Science Ltd. All rights reserved. PII: S 0893-608 0(00)00052-6

720

M. Marinaro, S. Scarpetta / Neural Networks 13 (2000) 719±729

descent algorithms, natural gradient algorithms, optimal learning rates, etc. have been studied in this framework for both asymptotic and symmetric phases of MLP. The application of this approach to on-line RBFNs has been recently analyzed in Freeman & Saad (1997a,b)). This approach presents some dif®culties for RBFNs, in fact, unlike in MLPs, the thermodynamic limit is totally meaningless for RBFNs. Since the basis functions are localized, the N ! 1 limit implies that the basis functions respond only in a vanishingly small zone of the input space, and, as noted in Freeman & Saad (1997a,b)), there is no obvious reasonable rescaling of the basis functions which makes the thermodynamic limit sound. Freeman & Saad (1997a,b) try to overcome this problem, resulting in a sensible analysis of RBFNs, which is limited however to the case of large N . K: To analyze the most common situation for RBF architecture, K . N; different approaches have to be used. In this paper we adopt the ªsmall netsº or stochastic method for analyzing the learning dynamics of RBFNs in an on-line scenario. This approach allows the description of on-line learning for an arbitrary number of hidden and input nodes. In the stochastic approach (Heskes & Kappen, 1993; Heskes, 1994; Heskes & Kappen, 1991; Leen, Schottky & Saad, 1998), the learning process is viewed as a stochastic process governed by a continuous-time ME. Since the ME cannot be solved exactly, approximation methods are developed. Typically, as in this paper, a small ¯uctuation expansion provides a description of the dynamics in terms of suitably scaled ¯uctuations around a deterministic ¯ow. Using a small ¯uctuation Ansatz for a small learning rate, a set of ordinary differential equations for the expected value of the net parameters and their variances is derived. Numerical simulations of RBF networks con®rm that the ¯uctuations around the mean value are irrelevant at small learning rate (as we expect for regions of the energy landscape with positive curvature), and that a description in terms of mean values is possible for any value of N. The performance of the network during the learning process is then measured by estimating the generalization error. Studying generalization error curves shows that a peculiar characteristic of the on-line learning of typical sized RBFNs is the absence of the plateau corresponding to the symmetric phases. This is a consequence of the shape of the basis function that avoids the lack of differentiation among the hidden nodes, which, on the contrary, is always present in the on-line learning of MLP. 2. The framework The RBF architecture consists of a feed-forward twolayer network in which the transfer function of each hidden node is radially symmetric in the input space. We will focus our attention on Gaussian basis functions, which are the most commonly used functions and have many useful analytical properties.

The network implements a mapping f : j ! y from an Ndimensional input space, j [ S input ; to a one-dimensional output space, y [ Sout : The network has N input nodes, one linear output node and a hidden layer with an arbitrary number, K, of hidden nodes whose basis function we denote as C i …j†; i ˆ 1¼K: The output nodes compute a linear combination of the outputs of the hidden nodes, parameterized by the weights u between the hidden and the output layers. Let the basis function of each hidden node be

C i …j; x i ; gi † ˆ e2gi ij2x i i

2

…1†

where i´i is the Euclidean distance, and g i is the parameter that controls the spread of the function around the center x i [ Sinput : The function computed by an RBF network with K hidden nodes is: net…j† ˆ

K X iˆ1

u i e2gi ij2x i i

2

…2†

The values ui, x i and g i, …i ˆ 1¼K†; are the adaptable parameters of the network. Typically a two-stage training procedure is used in RBF nets (Bishop, 1995). In the ®rst stage, the parameters governing the basis functions (x i and gi ) are determined using unsupervised techniques, i.e. methods which use only input data and not the target data (like Gaussian Mixture Models or clustering algorithms). The second stage of training then involves the determination of the second-layer weights u by fast linear supervised methods. Although fast to train, this approach generally results in suboptimal networks indeed the setting of the basis function parameters x i and g i using density estimation methods takes no account of the target labels associated with the data (Bishop, 1995). To obtain optimal performance we should include the target data in the training procedure. The alternative approach is to adapt all the net parameters ui, xi and g i, by a supervised training procedure, such as gradient descent. This represents a non-linear optimization problem which will typically be computationally intensive. We will investigate this second approach. The net learns on line, from a sequence of training examples {…j m ; ftm †; m ˆ 1¼}: The components of the training input vectors j m are uncorrelated Gaussian random variables with zero mean and spread parameter s ; i.e. their probability distribution is 2

P s …j† ˆ

e 2sj …p=s† N=2

…3†

Without loss of generality we can suppose that the training data are produced by an unknown teacher network with an RBF architecture and an arbitrary number, M, of hidden nodes (the so-called student-teacher scenario). Although the framework enables us to consider a wide range of cases, we will consider here the case where the number, K, of student basis functions equals the number, M, of teacher basis functions. The function computed by the

M. Marinaro, S. Scarpetta / Neural Networks 13 (2000) 719±729

two steps:

teacher is given by: K X

f t … j† ˆ

jˆ1

v j C j …j; x j0 ; bj †

…4†

The learning dynamics is driven by the on-line gradient descent algorithm, and parameters are adjusted according to the gradient of the error computed on the last example. Therefore, at time m the net parameters are updated with the rule: x im11

ˆ

x mi

2em 2h 2x i

gmi 11 ˆ gmi 2 h

2e m 2gi

uim11 ˆ umi 2 h

2em 2ui

721

…5†

1. In the ®rst step we go from the discrete-time process to continuous-time description, for any value of N. 2. In the second step, by introducing a small learning rate limit, we write a closed set of differential equations for the mean values of net parameters and their ¯uctuations. 2.1. From discrete to continuous-time master equation We start from the discrete-time learning process de®ned in Eq. (6). Eq. (6) de®nes a stochastic process since at each step the vector j m is chosen randomly from the training set. Let us introduce the probability density pm …J† to ®nd the net parameters in the state J at the discrete iteration step m . In terms of this microscopic probability density the process (6) can be written as: Z …8† pm11 …J† ˆ dJ 0 W‰J; J 0 Špm …J 0 †

where h is the learning rate, and em is the error made by student with parameters {gi ; xi ; ui }; i ˆ 1¼K; on the input jm :

with the transition probability densities:

e…j m ; {gi ; x i ; ui }† ˆ ‰net…j m † 2 ft …j m †Š2 2 32 K K X X m 2 m 0 2 ˆ 4 ui e2gi …j 2xi † 2 vj e2bj …j 2xj † 5

Bedeaux, Lakatos-Lindberg, and Shuler (1971) showed that a continuous-time description can be obtained from a discrete master equation, like Eq. (8), through the assignment of a random value Dt (with a characteristic time t ) to the time interval between two successive iteration steps. Following this approach we obtain from Eq. (8) a ME for the probability P…J; t† to ®nd the net in the state J at time t: 2P…J; t† 1Z ˆ dJ 0 {W‰J; J 0 Š 2 d…J 2 J 0 ††}P…J 0 ; t† …10† 2t t

iˆ1

jˆ1

We write Eqs. (5) in a more compact form by introducing the vector of parameters J ˆ {ui ; gi ; xi }; i ˆ 1¼K: The update rule (5) becomes: Jm11 ˆ Jm 2 hFJ …J m ; j m †

…6†

where FJ ˆ {Fu ; Fg ; Fx } is the gradient of the error: F gi ˆ

2e 2e 2e ; Fui ˆ ; Fxi ˆ 2gi 2ui 2x i

…7†

The training-error is the average over all training examples of the quadratic deviation e…j; {g i ; xi ; ui }†: The generalization error is de®ned by

eg …J† ; eg …{gi ; x i ; ui }† ˆ ke…j; {g i ; x i ; ui }†lj where the average is over the distribution Ps …j† of input vectors. In the following paragraphs we will assume an arbitrary ®nite input dimension, N, and an arbitrary number of hidden nodes, K, focusing on the most interesting case K . N: The thermodynamic limit K p N ! 1; which is used to analyze MLPs, cannot be applied for RBFNs because K is generally greater than N and besides the N ! 1 limit would imply the unrealistic situation in which each basis function becomes more and more peaked, covering only a vanishingly small zone of the input space. Therefore, in order to obtain the differential equation of motion of the RBFN, we use the stochastic approach and go through the following

W‰J 0 ; JŠ ˆ kd…J 0 2 J 2 hF‰J; jŠ†l j

…9†

This result is exact and it is valid independently of t , h , and N. The parameter t ®xes the time scale; we will choose t ˆ 1; i.e. the average time between two learning steps is our unit of time. Note that all the procedures introduce an uncertainty in the time axis t, which is the cost to pay for not having taken the limit N ! 1: We will denote the distribution of the states J at time t by J…t†: The expected value for an arbitrary function V…J† at time t is given by the ensemble average: Z …11† kV…J†l J…t† ˆ dJP…J; t†V…J† The symbol k´l j indicates the average over the distribution of examples P…j†; while k´l J…t† indicates the average over the distribution P…J; t†: 2.2. Stochastic process for small learning rate Using master Eq. (10) and de®nition (11) one obtains exact evolution equations for the mean value of the net parameters or their products. Namely Z d 2P…J; t† kJi lJ…t† ˆ …dN J†Ji …12a† dt 2t

722

M. Marinaro, S. Scarpetta / Neural Networks 13 (2000) 719±729

Z d 2P…J; t† kJi Jj lJ…t† ˆ …dN J†…Ji Jj † dt 2t

…12b†

Keeping in mind Eqs. (9), (12a) and (12b) and denoting by F…J† the average of the learning rule over the distribution of examples, F…J† ˆ kF‰J; jŠl j ; one obtains: 1 dkJi lJ…t† ˆ kFi …J†lJ…t† h dt

…13a†

system size expansion: starting from the full Kramers± Moyal expansion for P…J; t†; we make a small-¯uctuation Ansatz p J ˆ f…t† 1 hz …16† p (¯uctuations of order h around the deterministic part f…t†). In the limit of small learning rate …h ! 0† we obtain a Fokker±Plank equation for the ¯uctuation distribution def p p…z; t† ˆ P…f…t† 1 hz; t† :

1 dkJi Jj lJ…t† ˆ kFi …J†Jj lJ…t† 1 kJi Fj …J†lJ…t† 1 hkDij …J†lJ…t† h dt …13b†

1 2p…z; t† 2 1 22 ˆ 2a 01 …f…t†† {zp…z; t†} 1 a 2 …f…t†† 2 p…z; t† h 2t 2z 2 2z …17†

where Dij …J† ˆ kFi ‰J; jŠF j ‰J; jŠl j : Here F…J† and D…J† are the drift vector and the diffusion matrix, respectively, of the stochastic process under consideration. In a similar way one can write the evolution equations for the average value of the ¯uctuations of J around the mean value kJlJ…t† ;

(where the prime denotes differentiation with respect to the argument) and a deterministic equation for f…t† :

1 dS ij ˆ kFi …J†…Jj 2 kJj lJ…t† †lJ…t† h dt 1k…Ji 2 kJi lJ…t† †Fj …J†lJ…t† 1 hkDij …J†lJ…t†

…13c†

where S is the covariance matrix S ij …t† ˆ k…Ji 2 kJi lJ…t† †  …Jj 2 kJj lJ…t† †lJ…t† : The exact evolution equations for higherorder cumulants can be derived in the same way. Eqs. (13a), (13b) and (13c) hold for every value of h and N, but unfortunately they are, in general, unsolvable. Therefore some approximations are introduced in order to obtain approximate solutions of Eqs. (13a), (13b) and (13c). It is well known that a formal solution of the ME can be written in terms of the Kramers±Moyal expansion (van Kampen, 1981) (for simplicity we consider the one-dimensional case here)   1 X 2P…J; t† …21†n 2 n ˆ {an …J†P…J; t†} …14† n! 2t 2J nˆ1 where an …J† are the so-called jump moments de®ned by: Z def def an …J† ˆ dJ 0 …J 2 J 0 †n W…JjJ 0 † ˆ hn kF n …J; j†l j ˆ hn a n …J† …15† where all a n …J† are of order 1 and are independent of the parameter h . When the scaling assumption is valid, i.e. the terms of Eq. (14) are proportional to the powers of some parameter which can be taken to the zero limit, it is possible to truncate the series (14), obtaining an approximate solution (usually the series is truncated at the second term, leading to a Fokker±Plank equation). Unfortunately in our case all terms of Eq. (14) are of the same order and it is unjusti®able to break off the Kramers±Moyal series after a ®nite number of terms. Thus to solve our problem, following Heskes and Kappen (1993) and Heskes (1994) we use a sort of Van Kampen's

df…t† ˆ ha1 …f…t†† dt

…18†

From the Fokker±Plank Eq. (17), we can calculate the dynamics of kzl and of the size of ¯uctuations kz2 lJ…t† ˆ …1=h†k…J 2 f…t††2 lJ…t† 1=h

2kzlJ…t† ˆ a 01 …f…t††kzlJ…t† 2t

…19a†

1=h

2kz2 lJ…t† ˆ 2a 01 …f…t††kz2 lJ…t† 1 a 2 …f…t†† 2t

…19b†

If we choose the initial conditions such that ¯uctuations are zero at t ˆ 0 then the deterministic trajectory f…t† will coincide with the mean value of kJlj…t† and kz2 l will be the ¯uctuations around the mean value of J; S ˆ k…J 2 kJl J…t† †2 lJ…t† ˆ hkz2 lJ…t† : In conclusion we have: 1 2kJlJ…t† ˆ F…kJlJ…t† † h 2t

…20†

1 2kSl J…t† ˆ 2F 0 …kJlJ…t† †kSl J…t† 1 hD…kJlJ…t† † h 2t

…21†

where D…J† ˆ kF…J; j†F…J; j†l j : The result is consistent with the Ansatz if and only if a 01 …f…t†† , 0

…22†

This condition assures that z remains small as t increases (see Eqs. (19a) and (19b)). Therefore the small-¯uctuation approximation is valid in the attraction regions …a 01 …f…t†† , 0† where the energy surface have positive curvature. Outside these attraction region the approximation is valid on relatively short time scales …O…1=h††: When the energy surface presents multiple minima the approximation used provides a description around each of them but not a global description. The approximate equations for small learning rate, Eqs (20) and (21), in the case that the net parameter is a vector with many components, and no longer a scalar, can be obtained straightforwardly. Applying this procedure to RBF networks, the dynamical equation for the mean value

M. Marinaro, S. Scarpetta / Neural Networks 13 (2000) 719±729 5

723

2

4.5 1.5 4 3.5

1

3 0.5 2.5 2 0

0.5

1

1.5

0 0

2

0.5

1

1.5

4

2 4

x 10

x 10

Fig. 1. The dotted and dot-dashed lines show results from two simulations results in which we train the g and u parameters with on-line gradient descent (h ˆ 0.1) in a K ˆ 9 RBF network learning a noiseless task, in a bidimensional input space N ˆ 2; with the same initial conditions. We compare the simulation results with the theoretical prediction of the expected value given by the solid line. (a) Shows evolution of the student spread parameters g; and (b) the student hidden-output weights u: Task parameters are s ˆ 2; b ˆ {2:1; 2:5; 2:7; 3:3; 3:6; 3:7; 4:0; 4:2; 4:3}; all v i ˆ 1; and the student initial conditions are gi ˆ 5; ui ˆ 0:0; for each i ˆ 1; ¼; K:

of J; the mean value of JT J; and the expectation value of ¯uctuations S ij are obtained: 1 dkJi lJ…t† ˆ Fi …kJlJ…t† † h dt

…23a†

X 1 dS ij ˆ Gik …kJlJ…t† †S kj 1 S ki Gjk …kJlJ…t† † 1 hDij …kJlJ…t† † h dt k

…23b† 1 2kJi Jj lJ…t† ˆ Fi …kJlJ…t† †kJj lJ…t† 1 kJi lJ…t† Fj …kJlJ…t† † h 2t 1 hDij …kJlJ…t† †

…23c†

where J is the set {ui ; gi ; xi } of student parameters, FJ …J; j† is de®ned in Eq. (7) and F…J† ˆ kF J ‰J; jŠl j Dij ˆ kFi ‰J; jŠF j ‰J; jŠl j Gik …J† ˆ

2Fi …J† 2Jk

These are the equations that we are looking for. The averages can be carried out analytically and the resulting differential equations can be solved numerically (see the next section). 3. Learning dynamics and comparison with simulations Eqs. (23a), (23b) and (23c) hold in the general case in which J is the set of student parameters of the RBF net

g; u; xi ; i ˆ 1¼K; and the teacher parameters are b; v; x0i ; i ˆ 1¼M: The averages require the evaluation of Gaussian integrals which can be carried out analytically. The resulting differential equations can then be numerically solved. Therefore, they provide a tool for analyzing the learning process for a general RBF network under the condition of small learning rate. Although the framework enables us to consider a wide range of cases, we will analyze here the following two cases: (A) The centers xi of the student functions are ®xed and are set equal to the values of the centers x 0j of the teacher basis functions while g and u are adaptive parameters. (B) The centers are adaptive parameters but the variances of the basis functions are ®xed and set equal to the teacher values.

3.1. Learning the g and u parameters In the ®rst case we investigate the dynamics of the student net parameters, J ˆ {g; u}; while x i ˆ x 0i ; b; v are assumed to be ®xed. From Eq. 23a one obtains (Scarpetta, 1998) the following equations that describe the dynamics of the parameters mean values: 2kui lj…t† 2t

"

ˆ 22h

X k

uk H…gk ; gi ; x k ; xi † 2

X k

# vk H…gi ; bk ; x i ; x0k † …24a†

724

M. Marinaro, S. Scarpetta / Neural Networks 13 (2000) 719±729 1

10

4 1

10

3.5 0

10

0

10

3 -1

10

-1

2.5

10

-2

10

2

-3

10

-2

10

1.5

-4

10

0

0.5

1

1.5

2 4

x 10

1

-3

10

0.5 -4

10

1

2

10

3

10

10

0 0

4

10

100

200

300

400

500

600

Fig. 2. The generalization error corresponding to the learning dynamics of Fig. 1 is shown in: (a) log±log scale; (b) linear scale, and in the inset in a log-lin scale. The dotted and dot-dashed lines show the generalization error evolution in four single run simulations. Empirical results are compared with the theoretical prediction of the expected generalization error (solid line).

…26†

ated from the teacher network. Results are compared with the average evolution of the system (solid lines) found by numerically solving the differential equations. All the single run simulations stay around the mean value prediction with only small ¯uctuations around it, according to assumption. In this example the input space dimension is N ˆ 2; less than the number of hidden nodes K ˆ 9: To measure the performance of the student RBF network on the given task, we compute the generalization error during evolution. The expected value (11) of the generalization error at time t is Z eg ; keg …J†lJ…t† ˆ dJP…J; t†eg …J† ˆ kke…j; J†l j lJ…t† …27†

are given in Eqs. (B4) and (B5) in Appendix B. To demonstrate the validity of the theoretical results, and to show that the ¯uctuations introduced by on-line learning are negligible when h is small, we compare the evolution of the system found by numerically solving the differential Eqs. (24a) and (24b) for the mean values to different empirical results found by training a RBF network via on-line gradient descent. The empirical values of {g; u} are calculated during training. The components of training input vectors j m are sequentially chosen from a Gaussian distribution with mean 0 and variance …1=2s† according to assumption used to derive the differential equations. The centers of the teacher basis functions x i0 are randomly distributed in the input space. We compare the theoretical prediction for the mean values of net parameters with the empirical result of singlerun simulations without averaging over many realizations of the simulated network. Fig. 1 shows evolution of the net weights g i and ui in two runs. In each run we train the RBF network starting with the same initial weights, via on-line gradient descent on different set of examples gener-

In the small ¯uctuations limit, i.e. for small learning rate and under conditions (22), we can imagine expanding keg …J†lJ…t† around the expected value of J and truncating the expansion to the ®rst order eg . eg …kJlJ…t† †: Therefore we only need to compute the average over the input distribution which is a multivariate Gaussian integral and is analytically tractable. We have obtained an expression for the generalization error (see Appendix A) in terms of the teacher parameters and mean value of the student parameters. Note that the teacher parameters bi ; vi ; x0i are characteristic of the task to be learned and remain ®xed during training while the weights ui ; gi ; xi denoted by J are adaptable parameters of the student network, whose mean values evolve during training according to Eq. (23a), that in the case under consideration becomes Eqs. (24a) and (24b). In the simulations, the generalization error is empirically estimated via an average of the error on an 800-point test set. Fig. 2 shows the theoretical prediction of the generalization error corresponding to the parameter evolution shown in Fig. 1 in comparison with four single-run simulations results. The correspondence between theory and simulations

2kgi lj…t† 2t ˆ 2hu i

"

X k

uk M…gk ; gi ; x k ; x i † 2

X k

# vk M…bk ; gi ;

x0k ; x i † …24b†

Explicit expression for the integrals 2

2

H…A; B; xi; xj† ˆ ke2A…j2xi† e2B…j2xj† lj

…25†

and M…A; B; xi; xj† ˆ

2H…A; B; xi; xj† 2B

M. Marinaro, S. Scarpetta / Neural Networks 13 (2000) 719±729 2.8

1

2.6

0.8

2.4

0.6

2.2

0.4

2

0.2

1.8 0

2

4

6

0 0

8

2

725

4

6

4

8 4

x 10

x 10

Fig. 3. Time evolution of the g i parameters (a) and of the weights ui (b) in a two nodes RBF network learning a realizable task in input space with N ˆ 5…h ˆ 0:1†: The dotted lines show a single run simulation result, the solid line shows the theoretical result. Task parameters are s ˆ 3:3; b ˆ {3:4; 2:7}; all v i ˆ 1; and student initial conditions gi ˆ 2; ui ˆ 0:0; for each i ˆ 1; 2:

seems good, not only in the asymptotic decay but also in the transient phase. We see two different relaxation times in the generalization error: a fast decay towards a non-optimal solution which is followed by another exponential decay (see inset Fig. 2), with a larger characteristic time, towards optimal solution. In the region between the two regimes, the simulations show slightly larger ¯uctuations around the theoretical prediction of the mean value. In these regions, to have a more complete description one has to also consider Eq. (23b). The presence of larger ¯uctuations in transition zones is easily understood if we remember that in the small learning-rate limit the small-¯uctuations Ansatz is valid only in the regions of positive curvature and on relatively short time scales …O…1=h†† outside of these basins of attrac-3

1.5

x 10

tion. However in all cases that we have analyzed, the ¯uctuations are never as such as to invalidate our assumptions, if the learning rate is small enough. For the sake of completeness we show in Fig. 3 an example of dynamical evolution in the different situation with N . K …N ˆ 5 and K ˆ 2†: Dynamical evolution of g and u resulting from a single simulation is shown together with theoretical predictions for kglJ…t† and kulJ…t† : In this case as well the single-run simulations make small ¯uctuations around the mean value. Fig. 4 shows the corresponding generalization error during the learning process. A behavior analogous to Fig. 2 is observed with a two time exponential decay in the generalization error and good agreement between the predictions and the empirical results.

-2

10

-3

1

10

0.5

10

0 0

-4

-5

2

4

6

8 4

10

1

10

2

10

3

10

4

10

5

10

x 10

Fig. 4. Linear plot (a) and log±log plot (b) of the generalization error in the same learning problem of Fig. 3. The dotted line shows a single run simulation result, solid line is the theoretical result.

726

M. Marinaro, S. Scarpetta / Neural Networks 13 (2000) 719±729

3 0

10

2.5 2

-2

10 1.5 1

-4

10

0.5 -

0 0

20

40

60

80

100

6

10

2000

4000

6000

8000

10000

Fig. 5. Centers of hidden basis functions and hidden-output parameters of a K ˆ 5 RBF network in a N ˆ 2 input space are trained with on-line gradient descent rule with h ˆ 0:1: The generalization error is shown on linear scale (a), and on a lin-log scale (b). Dotted lines show three single-run simulations results and the solid line is the theoretical prediction. Task parameters are s ˆ 3; b ˆ g ˆ 2; all v i ˆ 1; student initial conditions are ui ˆ 0:0; and both teacher x0i and initial student vectors xi are randomly chosen in the input space from a Gaussian distribution with u ˆ s:

3.2. Learning the u and xi parameters The case B has been analyzed previously in Freeman and Saad (1997a,b) using an extension of the approach introduced for studying MLP networks. In order to compare our results with the ones found in Freeman and Saad (1997a,b), we introduce the same set of adaptive parameters used in Freeman and Saad (1997a,b), i.e. the quantities Qij ˆ xTi x j ; Rij ˆ x Ti x 0j ; and u; assuming gi ˆ bi ˆ b ®xed. From Eqs. (23a), (23b) and (23c) we obtain the following set of equations which describe the dynamics of the system: " # X X 2kui lJ…t† 0 ˆ 22h uk H…b; b; x k ; x i † 2 vk H…b; b; x i ; xk † 2t k k

…28a† ! N=2

2kQij lJ…t† s ˆ 24hb 2t 2b 1 s " # X X 0 £ ui ul L…xi ; x l ; xj † 2 vk ui L…x i ; x k ; x j † l

24hb " 

X l

s 2b 1 s

ui ul L…x j ; x l ; xi † 2

116h2 b2 1

X kk 0

k

! N=2

s 4b 1 s

X k

! N=2 "

# vk uj L…x j ; x0k ; x i †

X ll 0

vk uj vk 0 uj S…xi ; x j ; x 0k ; x0k 0 †

1

ul ui ul 0 uj S…x i ; xj ; x l ; xl 0 † X lk

# ul ui vk uj S…xi ; x j ; x l ; x 0k †

…28b†

! N=2 2kRij lj…t† s ˆ 24hb 2t 2b 1 s " # X X 0 0 0  ui ul L…x i ; xl ; x j † 2 vk ui L…xi ; x k ; xj † l

k

…28c†  0j ; and Here we use the notations Tij ˆ x 0T i x 2

2

L…x 1 ; x2 ; x 3 † ˆ k…j 2 x 1 †T x 3 e2b…j2x 1 † e2b…j2x 2 † lj

…29†

S…x1 ; x2 ; x3 ; x4 † 2

2

2

2

ˆ k…j 2 x 1 †T …j 2 x 2 †e2b…j2x 1 † e2b…j2x 2 † e2b…j2x 3 † e2b…j2x 4 † lj

…30† The explicit expressions of functions S and L are given in Appendix B. Eqs. (28a), (28b) and (28c) are equivalent to the difference-equations derived in Freeman and Saad (1997a,b) with the statistical physics approach. It is worthy noting that the equations depend explicitly on the input dimension N, and, contrary to the equation of motion for MLPs, they lose any signi®cance in the limit N ! 1: The system evolutions described below are obtained by integrating the differential equations with h ˆ 0:1 from a random initialization of student network parameters: vectors x i are chosen randomly in the input space, according to a Gaussian zero-mean distribution, and ui ˆ 0; ;i: The centers of the K teacher basis functions x 0i are random vectors whose components are distributed in the input domain as Gaussian variables of zero mean and variance …1=2u† and all vi ˆ 1: Note that the difference equations of Freeman and Saad, (1997a,b) are studied by making assumption on the form of teacher T and initialization of

M. Marinaro, S. Scarpetta / Neural Networks 13 (2000) 719±729 0.07

727

1.4 -1

10

0.06

1.2 -2

10

0.05

1 -3

10

0.04

0.8 -4

10

0.03

0.6 -5

10

0.02

0

500

1000

1500

0.4

0.01

0.2

0 0

500

1000

0 0

1500

500

1000

1500

2000

2500

3000

Fig. 6. Centers of hidden basis functions and hidden-output parameters of a K ˆ 2 RBF network in a N ˆ 5 input space are trained with on-line gradient descent with h ˆ 0:1: The dotted line shows a single run simulation result, the solid line is the theoretical result. Task parameters are s ˆ 2; b ˆ g ˆ 3:3; all v i ˆ 1; student initial conditions are ui ˆ 0:0; and both teacher x0i and initial student vectors xi are randomly chosen in the input space from a Gaussian distribution with u ˆ s: Linear plot (a) and log±lin plot (inset) of the generalization error. (b) shows the student hidden-output weights u evolution.

weights which are sensible only in the case of large N with N . K (for example there are not any K vectors x 0i in a Ndimensional space such that Tij ˆ x0i x 0j ˆ dij if K . N), here we relax these assumptions since we are considering generic RBF architectures with primary interest in the N , K case. To analyze Eqs. (28a), (28b) and (28c) it is important to distinguish between two cases: 1. N very large, and K ! N; 2. K $ N and ®nite. which present different behaviors in the generalization error. 0.4

In case 1, as N becomes large, in order to cover the input space we have to use Gaussian functions with large variance with respect to the relative distance of the vectors x 0i ; and as a consequence a sort of delocalization is introduced and the behavior of RBFN becomes similar to the one typical of MLP (Biehl et al., 1996; Riegler & Biehl, 1995; Saad, 1998; Saad & Solla, 1995a,b). In case 2, on the contrary, the basis functions are quite well localized and the behavior of the RBFN is similar to that observed in the Section 3.1. Here the generalization error does not show any plateau region, and a description in terms of symmetric phases and symmetry breaking cascade is no longer appropriate. Indeed, here the localization prevents the lack of 0.3

0.3

0.2

0.2 0.1 0.1 0 0 - 0.1

-0.1

-0.2

0

500

1000

1500

2000

2500

3000

- 0.2

0

500

1000

1500

2000

2500

3000

Fig. 7. Evolution of overlap parameters Qij ˆ xTi xj and Rin ˆ xTi x0n are shown, respectively, in (a) and (b) in the same learning scenario of the previous ®gure. The dot-dashed line shows a single run simulation result, the solid line is the theoretical result. The dotted lines mark the target-values of the overlap parameters.

728

M. Marinaro, S. Scarpetta / Neural Networks 13 (2000) 719±729

differentiation among hidden nodes that is the origin of the plateau. Fig. 5 shows the generalization error for a RBF network with ®ve hidden nodes learning a realizable task in a N ˆ 2 dimensional input space. The symmetric phase is not present even given the high symmetry of the task (all vi ˆ 1) and of initial conditions (all ui ˆ 0). More single-simulation results, with the same initial condition, are compared with the predicted mean value behavior. Also, in this case we observe that simulation results stay around the predicted mean value behavior with small ¯uctuations that do not invalidate the theoretical construction. The case N . K is illustrated in Fig. 6, which shows the generalization error for a RBF network with K ˆ 2 hidden nodes, learning a realizable task in a N ˆ 5 dimensional input space. A symmetric phase appears which manifests itself through a small plateau in the generalization error and the initial tendency of the Q and R parameters to converge towards a unique value in Fig. 7, a ˆ 50±500: After a while, small ¯uctuations (asymmetries) introduced by random initialization of centers xi are enhanced, the symmetric phase is escaped, and the convergence phase begins.

4. Conclusions The main focus of this work is the analysis of on-line learning in RBFN in the realistic case N , K: We adopt the stochastic approach which, contrary to the statistical physics approach, allows us to describe, at least locally, the on-line learning of RBFN in terms of generalization error and average dynamics of the main moments without taking the large N limit. The generalization error behavior is analyzed by numerical integration of equations in both the N . K and N , K architectures, for speci®c cases and small learning rate. Simulations are then performed which con®rm theoretical predictions for all phases of learning. From our analysis comes the result that the symmetric phases in RBF networks are an artifact, being a consequence of the unnatural choice N . K: In practical cases, RBF network architectures are chosen such that K . N; with N small (curse of dimensionality), therefore the symmetric plateau will not be a problem for practitioners using RBFNs.

Appendix A. Generalization error The generalization error can be computed in the small ¯uctuations limit by expanding the expectation keg …J†lJ…t† around the expected value of J: Breaking off the expansion at the ®rst order eg ˆ eg …kJlJ…t† †; and computing the average analytically over the input space,

one obtains:

eg ˆ 22

1

X ij

X ij

1

X ij

kui lJ…t† vj H…kgi lJ…t† ; bj ; kxi lJ…t† ; x 0j †

kui lJ…t† kuj lJ…t† H…kgi lJ…t† ; kgj lJ…t† ; kx i lJ…t† ; kxj lJ…t† †

vi vj H…bi ; bj ; x 0i ; x0j †

(A1)

Where H…A; B; xi; xj† is the result of the Gaussian integral (25) whose expression is given in the Appendix B. Appendix B. Gaussian integrals in the dynamical equations Integrals encountered in deriving the dynamical equations for the learning process in RBF networks are all Gaussian integrals that turn out to be easy to compute. All integrals are averages over the N-dimensional Gaus2 sian distribution Ps …j† ˆ …e 2sj =…p=s† N=2 †; and they can be carried out starting from the following: I…{Ai ; xi ; i ˆ 1; ¼; n}† ˆ

Z

exp 2

n X i

" ˆ

n X i

" 1

!2 ! Ai j 2 x i # N=2

s !

exp 2 !#2 ,

Ai xi

i

n X i

Ai 1 s

n X

2

e2sj dN j …p=s† N=2

n X i

! Ai x2i

!

!

Ai 1 s

(B1)

and using the identities: n   e2sj 2 Z X 2 2 A i …j 2 x 1 † dN j …j 2 x 1 † exp 2 …p=s† N=2 i

ˆ

d I …{Ai ; xi i ˆ 1; ¼; n}† dA 1

…B2†

and Z 2 …j 2 x 1 †T …j 2 x 2 † e…2s…j2x 3 † † dN j

ˆ …p=s† N=2



N 1 …x 3 2 x1 †T …x3 2 x2 † 2s

 …B3†

M. Marinaro, S. Scarpetta / Neural Networks 13 (2000) 719±729

From Eq. (B1) with n ˆ 2 we obtain the integral, encountered in Eqs. (24a), (28a) and (A1), denoted by H 2

2

H…A; B; xi; xj† ; k e…2A…j2xi† † e…2B…j2xj† † lj

ˆ

s s1A1B

and S…x1 ; x2 ; x3 ; x4 † ;

Z

729

…j 2 x 1 †T 2

2

2

£ …j 2 x 2 † e2b…j2x 1 † e2b…j2x 2 † e2b…j2x 3 † e2b…j2x 4 †

2

2

! N=2

(

…A2 xi2 1 B2 xj2 1 2ABxiT xj† 1 …2Axi 2 2 Bxj2 † £ exp s1A1B

)

…B4†

e2sj  dN j ˆ I…b; b; b; b; x 1 ; x2 ; x3 ; x4 † …p=s† N=2 " N 1 1 …2…3b 1 s†x 1  2…4b 1 s† 4b 1 s # T

1b…x2 1 x3 1 x4 †† …2…3b 1 s†x 2 1 b…x1 1 x3 1 x4 †† …B7†

and from Eq. (B2) one can easily derive the integral denoted by M in Eq. (24b): M…A; B; xi; xj† ˆ

2H…A; B; xi; xj† 2B 

ˆ H…A; B; xi; xj†

 N s Axi 1 Bxj 1i 2 xji 2 2 A1B1s A1B1s …B5†

Expressions denoted by L and S, encountered in Eqs. (28b) and (28c) are straightforwardly obtained from Eqs (B2) and (B4) L…x1 ; x2 ; x3 † 2 Z 2 2 e2sj dN j ; ‰…j 2 x 1 †T x3 Š e2b…j2x 1 † e2b…j2x 2 † …p=s† N=2

b1s b ˆ H…b; b; x 1 ; x2 † 2 x1 1 x 2b 1 s 2b 1 s 2

!T x3 …B6†

References Bedeaux, D., Lakatos-Lindberg, K., & Shuler, K. (1971). Journal of Mathematical Physics, 12, 2116. Biehl, M., & Caticha, N. (1999). Statistical mechanics of on line learning and generalization. In M. A. Arbib, The handbook of brain theory and neural networks. Cambridge, MA: MIT Press (in preparation). Biehl, M., Riegler, P., & Wohler, C. (1996). Journal of Physics A, 29, 4767. Bishop, C. (1995). Neural networks for pattern recognition, Oxford: Oxford University Press. Freeman, J. A., & Saad, D. (1995). Neural Computation, 7, 1000. Freeman, J., & Saad, D. (1997a). Physical Review E, 56, 907. Freeman, J., & Saad, D. (1997b). Neural Computation, 9, 1601. Heskes, T. (1994). Journal of Physics A, 27, 5145. Heskes, T., Kappen B. (1993). In J. Taylor, Mathematical foundations of neural networks (199p). Amsterdam. Heskes, T., & Kappen, B. (1991). Physical Review A, 44, 2718. Holden, S. B., & Niranjan, M. (1997). Neural Computation, 9, 441. Leen, T. K., Schottky, B., & Saad, D. (1998). In Jordan, Kearns & Solla, Advanced in neural information systems (p. 301), vol. 10. Cambridge, MA: MIT Press. Riegler, P., & Biehl, M. (1995). Journal of Physics A, 28, L507. Saad, D. (1998). Online learning in neural networks, London: Cambridge University Press. Saad, D., & Solla, S. (1995a). Physical Review Letter, 74, 4337. Saad, D., & Solla, S. (1995b). Physical Review E, 52, 4225. Scarpetta, S. (1998). PhD thesis, University of Salerno, Italy. van Kampen, N. (1981). Stochastic processes in physics and chemistry, Amsterdam: North-Holland.