Online stability of backpropagation–decorrelation recurrent learning

Online stability of backpropagation–decorrelation recurrent learning

ARTICLE IN PRESS Neurocomputing 69 (2006) 642–650 www.elsevier.com/locate/neucom Online stability of backpropagation–decorrelation recurrent learnin...

235KB Sizes 0 Downloads 53 Views

ARTICLE IN PRESS

Neurocomputing 69 (2006) 642–650 www.elsevier.com/locate/neucom

Online stability of backpropagation–decorrelation recurrent learning Jochen J. Steil Neuroinformatics Group, Faculty of Technology, Bielefeld University, Germany Available online 25 January 2006

Abstract We provide a stability analysis based on nonlinear feedback theory for the recently introduced backpropagation–decorrelation (BPDC) recurrent learning algorithm which adapts only the output weights of a possibly large network and therefore can learn in OðNÞ. Using a small gain criterion, we derive a simple sufficient stability inequality. The condition can be monitored online to assure that the recurrent network is stable and can in principle be applied to any network adapting only the output weights. Based on these results the BPDC learning is further enhanced with an efficient online rescaling algorithm to stabilize the network while adapting. In simulations we find that this mechanism improves learning in the provably stable domain. As byproduct we show that BPDC is highly competitive on standard data sets including the recently introduced CATS benchmark data [CATS data. URL: http://www.cis.hut.fi/lendasse/ competition/competition.html]. r 2006 Elsevier B.V. All rights reserved. Keywords: Recurrent online learning; Online stability; Time series prediction; Nonlinear systems; Small gain theorem

1. Introduction While recurrent neural networks have matured into a fundamental tool for trajectory learning, time-series prediction, and other time-dependent tasks, major difficulties for their more widespread application remain. These are the known high numerical complexity of training algorithms and the difficulties in assuring stability, which often is crucial in particular for adaptive control applications (see also the review [5]). Most of the efficient existing algorithms rely on backpropagation through time to compute error gradients and additionally require proper adjustment of learning rates and time-constants. To advance in the direction of a simple online recurrent learning technique, which could attract an even wider audience to use recurrent networks, in [16] we have introduced the backpropagation–decorrelation rule (BPDC), which combines three principles: (i) one-step backpropagation of errors; (ii) the usage of the temporal memory in the network dynamics which is adapted based on decorrelation of the activations, and (iii) the employTel.: +49 521 106 6066.

E-mail address: [email protected]. URL: http://www.jsteil.de. 0925-2312/$ - see front matter r 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.neucom.2005.12.012

ment of a reservoir of inner neurons which are nonadaptive to reduce complexity. The output weights then implement a linear readout function while at the same time output neurons provide full feedback into the reservoir. Therefore, BPDC learning is of linear OðNÞ complexity and in [16] it has already been shown that BPDC performs well on a number of standard tasks. The BPDC rule roots in a combination of recent ideas to differentiate the error function with respect to the states in order to obtain a ‘‘virtual teacher’’ target, with respect to which the weight changes are computed [1,13]. This is the starting point from which we will derive and justify BPDC in Section 2. Further, computing with fixed recurrent reservoirs has been introduced under the notion ‘‘echo state network’’ [6] and ‘‘liquid state machine’’ [8,9]. The reservoir in these approaches stores information about the temporal behavior of the inputs which allows to effectively learn a linear readout function. Besides efficiency, stability is a major issue in recurrent learning. In particular real applications in control or robotics require provable stability as a necessary prerequisite to prevent fatal damage or failure of systems. However, stability properties may not be preserved by standard recurrent learning algorithms [10]. To solve this problem, a number of approaches try to incorporate results

ARTICLE IN PRESS J.J. Steil / Neurocomputing 69 (2006) 642–650

643

trainable weights

bias

inputs

output

fixed dynamic reservoir

u

feedback

GI

G y W

GI

f Φ

Fig. 1. Top: BPDC trains output weights between a dynamic reservoir and the output neuron, which gives feedback to the reservoir. Bottom: network composed from linear feedforward operator G and nonlinear feedback F (see text).

from stability theory directly to the learning process. For input–output mappings, [18] use results from robust stability theory for closed loop systems to derive constraints in form of linear matrix inequalities (LMI). These impose and maintain local stability at the origin (also [19] and the references therein). A less direct approach is taken in [15], where global and local LMI conditions are given to determine a stable region in which arbitrary weight changes can be tolerated. Learning can then be restricted to that region. However, both methods increase the complexity and the numerical burden of the weight optimization process and cannot be used online. Ref. [7] enforce stability of equilibria to store patterns by incorporation of additional terms in the error functional and the incremental weight update formulas, which are derived from matrix conditions for global stability. In this contribution, we solve the stability problem for BPDC learning of one output giving a formal technique to analyze the stability of the ‘‘fixed dynamic reservoir + output neuron +feedback’’ network configuration shown in Fig. 1. It is based on the small gain theorem from nonlinear feedback theory and leads to a simple stability inequality to be monitored online while learning. Based thereon a stabilizing online rescaling of the weight matrix is proposed which enforces stability and optimizes learning under stability constraints at the same time. Differing from the previous work, we also use a bias neuron in the

network, which decreases the variance arising from the random initialization of the reservoir. We demonstrate this technique and give simulations for the recently introduced CATS benchmark, the Santa Fee laser data, and a 10th order dynamical system to shed some light on the complex trade-offs between stability, learning, and network configuration. 2. The BPDC learning rule We consider fully connected recurrent networks xðk þ DtÞ ¼ ð1  DtÞxðkÞ þ DtWfðxðkÞÞ þ DtWu uðkÞ,

(1)

NN

is the where xi ; i ¼ 1; . . . ; N are the states, W 2 R weight matrix, Wu the input weight matrix and k ¼ ^ k^ 2 N þ is a discretized time variable.1 For small Dt kDt; we obtain an approximation of the continuous time dynamics dx=dt ¼ x þ WfðxÞ and for Dt ¼ 1 the standard discrete dynamics. We assume that f is a standard sigmoidal differentiable activation function with f 0 p1 and is applied component wise to the vector x. We further assume that W is initialized with small random values in a certain weight initialization interval ½a; a (which can be adaptively rescaled as shown in Section 3.1). Denote by O  f1; . . . ; Ng the set of indices s of N O output neurons 1 With a small abuse of notation we also interpret time arguments and indices ðk þ 1Þ as ðk þ 1ÞDt.

ARTICLE IN PRESS J.J. Steil / Neurocomputing 69 (2006) 642–650

644

(i.e. xs output ) s 2 O) and let for a single output neuron O ¼ f1g such that x1 is the respective output of the structured network shown in Fig. 1. Note that we also use a fixed bias neuron to provide a constant input to all neurons in the network. If all but the output weights are fixed, we can regard the inner neurons as dynamical reservoir triggered by the input signal and providing a dynamical memory. The output layer linearly combines these states to read out the desired output. In [16], the backpropagation–decorrelation rule DwBPDC ðk þ 1Þ ¼ ij where gi ðk þ 1Þ ¼

X

f ðxj ðkÞÞ Z g ðk þ 1Þ, P Dt s f ðxs ðkÞÞ2 þ e i

(2)

ðð1  DtÞdis þ Dtwis f 0 ðxs ðkÞÞÞ

s20

 es ðkÞ  ei ðk þ 1Þ,

ð3Þ

has been introduced, where Z is the learning rate, e a regularization constant (e ¼ 0:002 throughout), and es ðkÞ are the non-zero error components for s 2 O at time k : es ðkÞ ¼ xs ðkÞ  ys ðkÞ with respect to the teaching signal ys ðkÞ. 2.1. Constraint optimization and virtual teacher forcing To show the rationale behind the apparent simplicity of BPDC, we give a short derivation of the learning rule from regarding recurrent learning as constraint optimization problem2 minimize E with respect to g  0,

(4)

where the error function E is given by the standard quadratic error with respect to the target output y for K time-steps E¼

K X h i2 1X ^ ^ xs ðkDtÞ  ys ðkDtÞ 2 ^ s2O

(5)

k¼1

and the constraint equations are obtained from the original recurrent network dynamics (1) for k ¼ 0; . . . ; K  1 as gðk þ 1Þ  xðk þ 1Þ þ ð1  DtÞxðkÞ þ DtWfðxðkÞÞ ¼ 0. (6) To minimize (4), we follow a new approach introduced by Atiya and Parlos [1]. The idea is to use the constraint equation to compute weight changes to approach a virtual target state, which is obtained by differentiating the error E with respect to the state (instead of the weights as in the usual gradient methods like RTRL or BTTP). To get a compact notation, we collect the relevant quantities in vectors (wTi are the rows of W ) x  ðxT ð1Þ; . . . ; xT ðKÞÞT ;

g  ðgT ð1Þ; . . . ; gT ðKÞÞT ,

w  ðwT1 ; . . . ; wTN ÞT . 2 For ease of notation we treat inputs and bias as lumped to certain reservoir states such that we do not have to treat them separately.

We obtain a targets  T qE ¼ ðeT ð1Þ; . . . ; eT ðKÞÞT , Dx ¼  qx where ( es ðkÞ ¼

xs ðkÞ  ys ðkÞ;

s 2 O;

0;

seO

and compute weight updates Dw to drive the network towards x þ ZDx by using the constraint (6): qg qg Dw  Z Dx. qw qx

(7)

We call this technique virtual teacher forcing because the targeted teacher states x þ Dx enforce the weight changes though they are never actually fed into the network. We refer to the approach to solve Eq. (7) using a pseudoinverse of qg=qw as Atiya–Parlos recurrent learning (APRL). It yields the training rule "   #1   qg T qg qg T qg APbatch Dw Dx. (8) ¼ Z qw qw qw qx It is worth noting that this update direction Dw does not follow the conventional gradient direction as real time recurrent learning [1,13]. It is straightforward to derive the respective online algorithm [1,12,13]. Denote the vector of neural activities at time-step k by f k ¼ ðf ðx1 ðkÞÞ; . . . ; f ðxN ðkÞÞÞT , then 1 0 ½f k T  C B    ½f k T  C 2 qgðk þ 1Þ B C B ¼B C 2 RNN . . . C B qw . A @   ½f k T Further, 3 2 qgð1Þ 6 qw 7 7 qg 6 7 6 ¼ 6 ... 7; 7 qw 6 4 qgðKÞ 5 qw



  qg T qgð1ÞT ¼ qw qw



qgðKÞT qw



and it is easy to see that 8 "   #1 !1 9 1 < KX = qg T qg ¼ diag f k ðf k ÞT : k¼0 ; qw qw ¼ diagfC 1 K1 g where C K1 is the auto-correlation matrix of the network activities. On the other hand we have ðqg=qxÞ Dx ¼ c, where c ¼ ðgð1ÞT ; . . . ; gðKÞT ÞT ;

gðkÞ ¼ ðg1 ðkÞ; . . . ; gN ðkÞÞT

and gi ðkÞ is defined as in (3). Then by recursively computing the pseudo-inverse in (8) and applying only the increment

ARTICLE IN PRESS J.J. Steil / Neurocomputing 69 (2006) 642–650

interpretation as ððGI ; WÞ; FÞ for the input–output equation

at every time step we get APbatch batch DwAP ðk þ 1Þ  DwAP ðkÞ ij ij ðk þ 1Þ ¼ Dwij

¼

C 1 k

y ¼ GI ðWu u þ WFðyÞÞ ¼ GI Wu u þ GFðyÞ;

k Z X ½f r j gi ðr þ 1Þ Dt r¼0

 C 1 k1

k1 X

½f r j gi ðr þ 1Þ

Z 1 Z ½C f k j gi ðk þ 1Þ þ ðC 1  C 1 k1 Þ Dt k Dt k k1 X  ½f r j gi ðr þ 1Þ r¼0

Z 1 ½C f k j gi ðk þ 1Þ ¼ Dt k Z batch þ ðC 1 C k1  IÞDwAP ðkÞ, ij Dt k

ð9Þ

batch where DwAP is the accumulated Atiya–Parlos learning ij update step until time step k from (8). The BPDC rule in (10) is similar to (9) differing by some modifications: (i) by restriction of learning in BPDC to the output weights motivated by the observation that in AP recurrent learning the hidden neurons update slowly and in a highly coupled way [13]; (ii) by omitting the momentum term introducing batch the old update DwAP (second term in (9)); (iii) by replacing ij the full autocorrelation matrix C k by the instantaneous autocorrelation matrix CðkÞ ¼ eI þ f k f Tk in the first term and use the small rank adjustment matrix inversion lemma " # 1 ½ð1=ÞIf k ½ð1=ÞIf k T 1 I CðkÞ f k ¼ fk  1 þ f Tk ð1=ÞIf k   1 1 f Tk f k fk ¼ fk  2 ¼P 2   1 þ ð1=Þkf k k fðxs ðkÞÞ2 þ  Z ) DwBPDC ðk þ 1Þ ¼ ½CðkÞ1 f k j gi ðk þ 1Þ. ð10Þ ij Dt

From this derivation the BPDC rule is interpreted as an effective mixture of reservoir computing, error minimization, and decorrelation mechanisms.

Using the standard notation for nonlinear feedback systems [14,21]3 the network (1) is composed of a linear feedforward and a nonlinear feedback operator F: e ¼ Wu u þ WfðyÞ;

small gain condition :

gðGÞgðFÞo1

y ¼ x.

The Laplace transformation of the linear part yields the forward operator GI ðsÞ ¼ ðI þ sIÞ1 while the activation function f defines the feedback operator F, see Fig. 1. F does not explicitly have to be stated in the frequency domain because it will be approximated by its gain which is defined by the maximum slope of f. Denote this network

(11)

holds. The small gain condition yields the loop gain estimate for u0 ¼ Wu u gðGI Þ where 1  gðGÞgðFÞ kHðu0 Þk2 gðHÞ ¼ sup ¼ kHk2 ku0 k2 u0 gðHÞp

ð12Þ

is the gain induced by the Lp2 ; p ¼ n; 1 norms for the operator H : Ln2 ! L2 . Note that gðGI Þ ¼ gðFÞ ¼ 1 by definition. To derive the stability condition now decompose the network into reservoir and output neuron subsystems: let xr ¼ ðx2 ; . . . ; xn ÞT and W rr the sub matrix connecting only these inner neurons. Then ððGI ; Wrr Þ; FÞ denotes the reservoir subsystem while ððGI ; w11 Þ; FÞ yields the one-dimensional output subsystem. The dimensions of GI and F have to be adjusted, respectively, but their gains remain equal to one. In Fig. 2 the network is shown as composition of the subsystems connected by the original feedback weights. For the composite system to be stable, the subsystems must be stable. Because gðGI Þ ¼ gðFÞ ¼ 1 and thus gðGÞ ¼ gðGI WÞ ¼ kWk we obtain for the subsystems the inequalities kWrr ko1 and w11 o1 and, as proved in the Appendix, kwoxr koð1  kWrr kÞð1  jw11 jÞkwxo r k1

3. The operator framework

x_ ¼ x þ e;

ðG ¼ GI WÞ.

The network acts as nonlinear feedback system implement: ing a loop operator H¼ððGI ; WÞ; FÞ which transforms Ln2 signals4 Wu u into L2 output signals y. Using the small gain theorem, this system is called input–output stable if there exists a bound gðHÞ on the Lp norm of the output signal such that kyk2 pgðHÞkuk2 (and in this case the origin is globally exponentially stable for the respective unforced dynamics (1) with u  0). Application of the small gain theorem yields that H is input–output stable if the operator gains gðGI Þ; gðGÞ ¼ gðGI WÞ and gðFÞ are finite and the

r¼0

¼

645

(13)

for the overall network. Here woxr is the vector of trainable weights and wxo r the vector of feedback weights from the output to the reservoir. The condition (13) can be easily monitored online, because the matrix and vector norms on the right-hand side can be precomputed at initialization time and while learning only the norm of the output weight vector kwoxr k has to be updated. Fig. 2 right shows the averaged right-hand side (stability margin) of (13) for different network sizes and initialization intervals. To statically assure stability of a network by initialization the interval parameter a must be chosen to generate networks considerably above the zero baseline for the stability margin.

3

We give the framework only for continuous time, an analog derivation is possible for discrete time using the z-transform and sequence spaces, see ([21], Chapter 6).

4 Strictly speaking this is true only after assuring stability of the system, see [14,21].

ARTICLE IN PRESS J.J. Steil / Neurocomputing 69 (2006) 642–650

646

wu0 u

wur u

output

reservoir r (GI,Wr )

y

(GI, w11)

woxr Φ feedback

5 4

50

stabmargin

3 2

80

1 0

100 120

−1 −2

0.05

150 0.075

0.1

0.125 0.15

0.175

a 0.2

initinterval [−a,a] Fig. 2. Top: system composed of inner reservoir subsystem, output subsystem and feedback. Bottom: stability margin vs. weight initialization intervals for different network sizes (averaged over 50 network initializations).

3.1. Online stability and rescaling Condition (13) can also be used to enforce network stability under learning. The idea is to rescale the full weight matrix by a factor l such that the resulting Wþ :¼lW obeys the stability constraint. Given the current kwoxr k not fulfilling (13), solve lkwoxr k ¼

ð1  lkWrr kÞð1  ljw11 jÞ lkwxo r k

(14)

for l to exactly obey the stability condition with the scaled matrix lW. This is a quadratic form in l and allows for the closed form solution r 1 kWr k þ jw11 j  l¼ 2

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi kWrr k2  2 jw11 jkWrr k þ jw11 j2 þ 4kwoxr kkwxo r k kwoxr kkwxo r k  jw11 jkWrr k

,

(15) such that it also can be evaluated online at no significant extra computational burden. We finally get the stabilized online learning algorithm: (0) initialize network with weights in ½a; a, compute kWrr k, kwxo r k, jw11 j; (1) check stability condition (13) for kwoxr k; (2) if (13) does not hold compute scaling factor l according to (15), rescale weight matrix Wþ :¼lW, rescale kWrr kþ :¼lkWrr k, kwxo r kþ :¼lkwxo r k, kw11 kþ :¼lkw11 k, replace the norms by the updatedk  kþ values in (13);

(3) iterate network; (4) apply learn step DwBPDC ; (5) goto (1) (unless stopping criterion is met). Rescaling dispenses the idea of a fixed reservoir in the sense that now the reservoir is globally scaled. In can be interpreted as adjusting the time scale on which the recurrent network activations operate against the linear damping _ ¼ x=l þ Wf), or, alternafactor (_x ¼ x þ lWf3x=l tively, as rendering the system more linear because lf flattens the nonlinearity, reduces its maximal slope and extends the range for x in which the network behaves almost linearly. This does not change the flow of information as determined by the weights and therefore can be expected not to interfere heavily with the decorrelation learning mechanism. To minimize the frequency of rescaling and its disturbance of the learning, it is useful to use in practice a slightly smaller l, e.g. l  0:02 is used in the experiments below. In the experiments we observe only very small performance losses after rescaling justifying this argumentation. Note that scaling can make the weights smaller, but as well could be used to scale them upward if the stability margin indicates that the network drifts away from the stability border towards a too stable configuration. In practice, however, we have never observed the latter case because the output weights always tend to increase, at least on average. 4. Simulation results Below we give simulation results for three standard and hard time series prediction benchmarks, for which

ARTICLE IN PRESS J.J. Steil / Neurocomputing 69 (2006) 642–650

reference results are available in the literature. The results illustrate the impact of the stability constraints on performance and in particular the capability of the online rescaling algorithm to obtain optimized results in the provably stable region. Note that in all experiments a configuration of 50/20 (nodes/inputs) refers to a network of 71 neurons: 49 reservoir neurons with fixed fully connectivity, 20 input reservoir neurons and one bias neuron with fixed forward connections, and one output neuron. The output neuron has 70 trainable input weights as well as 70 fixed output connections to all other neurons including the input/bias neurons and one trainable self-connection. If not otherwise stated, the number of inputs s gives the length of a time window of previous data yðk  1Þ; . . . ; yðk  sÞ as network input to predict yðk). 4.1. CATS benchmark Table 1 shows results for the recently introduced CATS benchmark [4] provided for a time-series competition held at IJCNN 2004. The data consist of 5000 points, with 100 points missing at positions 98021000; 198022000; . . . : Error criteria are the MSQE for all 100 points where a main difficulty is the tailing last 20 data points 4981–5000 because only on-sided information is available there. As the BPDC algorithm is essentially an online method, it is useful to provide it with estimated targets also for the unknown data. The following strategy is adopted: first the last 4096 data points are fast Fourier transformed (fft), where the first four gaps are filled with suitable random noise around a linear interpolation. This gives a good estimation in the first four gaps (MSQE 390). For the last 20 points we use the prediction from the network states as input for the Fourier transform such that with increasing accuracy of the network, the Fourier transform in turn becomes more accurate as well (in all gaps). Before each epoch the fft is computed, the upper 75% of the frequencies are pruned and the result is back-transformed to serve as teaching signal for the network.

Table 1 Average errors over 100 networks for the CATS benchmark for different network size/inputs 50/25–150/50 and different initialization ranges Network

50/25 80/30 100/20 100/30 120/40 150/50

Avg/Stddev/Best network 0.02

0.1–0.055

0.2

Rescaling

495/9.93/478 499/7.50/481 522/2.03/517 480/7.78/464 480/7.17/467 488/33.27/447

485/15.61/454 499/17.14/468 513/12.41/488 479/13.99/451 498/25.78/448 485/30.05/439

485/25.67/441 502/35.41/443 497/33.90/432 488/31.62/426 489/37.79/412 480/76.69/367

429/23.06/386 465/22.02/429 460/20.04/417 423/13.00/386 460/9.90/442 421/11.82/402

The middle column uses initialization close to the border of the stability range 0:1; 0:08; 0:07; 0:07; 0:065; 0:055 (for increasing size of networks) and for dynamically stabilized networks.

647

Table 1 shows results for networks (Z ¼ 0:03; e ¼ 0:002) of different sizes and taking a different number of immediate past values as inputs in three initialization conditions: provable stable (a ¼ 0:02); at the edge of the stability obtained from Fig. 2, right; and not provably stable (a ¼ 0:2) with the presented method. Because small gain stability conditions are always only sufficient and known to be conservative in the last case the networks also may be stable and we did not encounter stability problems in practice. Most impressive are the results for the online rescaling algorithm, which dramatically improves the results for the smaller networks (while maintaining stability). On average for static initialization all but the 100/20 networks are ranked third with respect to the results of the time series competition at IJCANN behind ([11], 408) and ([3], 441), the rescaling algorithm ranks second or third as well. The results show the robustness of BPDC to the initialization which obviously is of crucial importance because the reservoir is not adapted. On the other hand, the stddev increases with the initialization range because the reservoir naturally provides more dynamics with larger weights. This variance is caused mainly by the recurrent estimation of the last 20 points and the results below show that for more standard tasks it is not problematic at all. 4.2. Laser data In Fig. 3, left-hand side, results are given for the laser dataset A and networks of size 50/15 and 100/15, where the length of the time-window is chosen based on [20] to assure that most relevant information is contained in this timehorizon. Data are scaled by 1/100 and shifted with 0:7 afterwards to move them into a suitable region for the network. Network parameters are learning rate 0:03, e ¼ 0:002, and bias 0.2. The networks are trained over 15 epochs on the first 1000 points while tested on the next 1000 of dataset A and results are averaged over 50 networks. The results reveal several important observations. First, all standard deviations are very small such that despite the random character of the reservoir initialization the variance in the results can be neglected. In control experiments we find that this important regularizing property is mainly introduced by the bias neuron, which also improves performance, and therefore should always be provided. Second, it can clearly be seen that the best results are obtained for weights initializations, which are outside the small gain stability region. The oscillating character of the laser data suggests that here the network can profit from larger amplitudes in the network dynamics. The results also show that there is an optimum for the initialization, which, however, can be found only empirically running tests for many networks. This optimum may still yield stable networks because the small gain criterion is sufficient but not necessary. Though there are sophisticated methods mostly relying on linear matrix inequalities (e.g. [2,15]) to investigate recurrent network stability, there are

ARTICLE IN PRESS J.J. Steil / Neurocomputing 69 (2006) 642–650

648

Laser Data 0.34 50 train

50 test 50 test

0.32

100 train 100 test

0.3 avg(NMSQE)

0.28

stable by rescaling

0.26

statically stable

0.24 0.22 0.20 0.18 0.16 0.05

0.1

0.2

0.3

initialization interval [−a,a]

Tenth Order Data 0.3 50 train

0.28

50 test 50 test 100 train

0.26

100 test 0.24 avg(NMSQE)

stable by rescaling 0.22

statically stable

0.2 0.18 0.16 0.14 0.12 0.05

0.1

0.2

0.3

initialization interval [−a,a] Fig. 3. Train and test errors as function of the initialization interval for the laser data and tenth order data averaged over 50 networks (with stddev o0:03 throughout). The corresponding numbers for stabilized networks are marked, see text for discussion.

no sufficiently efficient methods available to analyze stability online and while adapting except for the method presented here. Finally, if stability is required like often in control applications, the network may be initialized sufficiently far inside the small gain region according to the results from Fig. 2. In this case we get performance marked with squares in Fig. 3 hoping that the network does not leave the stable region or the rescaling algorithm may be used as marked with circles (almost coinciding with the squares). The latter optimizes performance in the provable stable

region at no significant extra cost and is thus preferable over smaller initializations. 4.3. Tenth order data The following problem in discrete time has also been considered in [1]: " # 9 X yðk þ 1Þ ¼ 0:3yðkÞ þ 0:05yðkÞ yðk  iÞ i¼0

þ 1:5uðk  9ÞuðkÞ þ 0:1.

ARTICLE IN PRESS J.J. Steil / Neurocomputing 69 (2006) 642–650

We supply random input uðkÞ; uðk  9Þ uniformly drawn from ½0; 1 as input to predict the next output yðk þ 1Þ. In Fig. 3 results are given for networks of size 50/2, 100/2, learning rate 0:03, e ¼ 0:002, and bias 1. Data are scaled by 4 and shifted by 2 to move them into a suitable region for the network. The networks are trained over 15 epochs on the first 1000 points while tested on the next 1000 and results are averaged over 50 networks. As for the laser data, the variances are very small such that a very stable performance is reached. The results for this quite differently structured data are very similar to those for the laser data (note the different scale on the y-axis). They reveal that if stability is required then the rescaling method achieves the maximum performance as well, which again coincides with the best statically stable initialized networks. On the other hand, both experiments indicate that rescaling does not disturb the performance and may safely be used to perform stability control online. If this is not required and computational resources allow a systematic examination then the best performance is achieved by a larger initialization and the comments about its stability made for the laser data apply equally here. The sharp error rise in the 100 unit nets as opposed to the 50 unit seems due to the fact that the larger network operates more distant from the small gain stability region for smaller initialization intervals, compare Fig. 2. Control experiments for larger intervals reveal a similar but slower error rise also for the smaller network. We suppose that the reason is the potentially richer dynamics in the larger network. If operating near or even beyond the stability border the probability of unwanted reinforcement of spurious activity increases with the size of the network, which then implements more feedback paths. This is harder to control by the learning dynamics and can lead quickly to larger errors. 5. Conclusion In this contribution we have presented a method to prove and monitor stability for large networks where only the output layer is adapted. Though in principle the stability method is independent of the learning method used, we use it to access stability for the BPDC algorithm, which is a new and highly efficient OðNÞ learning paradigm for such output weights. The stability condition is also used to enhance BPDC with an online rescaling mechanism, which optimizes results in the stable domain and solves the problem, how to initialize the fixed reservoir weights to assure stability. The addition of a bias neuron in comparison to the results in [16,17] also decreases the variance arising from the random initialization to an extend which makes it irrelevant for practical application. Our results show that if provable stability is required, like usually in control, then the online-rescaling provides optimized results. If stability

649

constraints can be relaxed then as well larger weight initialization intervals may be considered and can lead to even better performance. However, also in this case the rescaling mechanism can be used to keep the network to a certain stability margin and to prevent the weights from growing uncontrolled. The encouraging results on the benchmark data, which have proven to be hard tasks for time-series prediction, show that a reasonable compromise between stability, efficient online learning, and accuracy can simultaneously be achieved with the BPDC learning. Acknowledgements I would like to thank the reviewers for their high interest in the topic and valuable comments, which very much helped to improve the paper. Appendix Consider the composite loop in operator notation and take norms: kxo ðsÞk ¼ kGI o ðsÞeo ðsÞk ðeo ðsÞ ¼ wou uðsÞ þ woo jðxo ðsÞÞ þ wor Fr ðxr ðsÞÞÞ pkwou kkuðsÞk þ jwoo jkxo ðsÞk þ kwor kkxr ðsÞk 1 kur k pkwou kkuðsÞk þ jwoo jkxo ðsÞk þ kwor k 1  kWrr k ¼ kwou kkuðsÞk þ jwoo jkxo ðsÞk kwor k kWru uðsÞ þ wxo r xo ðsÞk þ 1  kWrr k pkwou kkuðsÞk þ jwoo jkxo ðsÞk  kwor k  r kWu kkuðsÞk þ kwxo r kkxo ðsÞk , þ r 1  kWr k where we used that gðGI Þ ¼ kGI k ¼ gðFÞ ¼ kFk ¼ 1 and the loop gain estimation from the small gain theorem for the reservoir subsystem to replace the output of the reservoir kxr ðsÞk by its the scaled input kur k=ð1  kWrr kÞ, ur ¼ Wru u þ wxo r xo . Solving for kxo k yields  1 kwor k xr kw k kxo ðsÞkp 1  jwoo j  1  kWrr k o   kwor k r o kWu k kuðsÞk.  kwu k þ 1  kWrr k Stability requires the denominator of the left-hand side be larger zero and we can solve for the vector of the adapted output weights wor as kwor k kwxr k40 1  kWrr k o ð1  jwoo jÞð1  kWrr kÞ . 3kwor kp kwxo r k

1  jwoo j 

ARTICLE IN PRESS 650

J.J. Steil / Neurocomputing 69 (2006) 642–650

References [1] A.B. Atiya, A.G. Parlos, New results on recurrent network training: unifying the algorithms and accelerating convergence, IEEE Trans. Neural Networks 11 (9) (2000) 697–709. [2] N. Barabanov, D.V. Prokhorov, Stability analysis of discrete-time recurrent neural networks, IEEE Trans. Neural Networks 13 (2) (2002) 292–303. [3] X. Cai, N. Zhang, G. Venayagamoorthy, D. Wunsch, Time series prediction with recurrent neural networks using a hybrid pso-ea algorithm, in: IJCNN, 2004, pp. 1647–1653. [4] CATS data, URL: hwww.cis.hut.fi/lendasse/competition/competition.htmli. [5] B. Hammer, J.J. Steil, Tutorial: perspectives on learning with recurrent neural networks, in: Proceedings of the ESANN, 2002, pp. 357–368. [6] H. Jaeger, Adaptive nonlinear system identification with echo state networks, in: NIPS, 2002, pp. 593–600. [7] L. Jin, M.M. Gupta, Stable dynamic backpropagation learning in recurrent neural networks, IEEE Trans. Neural Networks 10 (6) (1999) 1321–1333. [8] W. Maass, T. Natschla¨ger, H. Markram, Real-time computing without stable states: a new framework for neural computation based on perturbations, Neural Comput. 14 (11) (2002) 2531–2560. [9] T. Natschla¨ger, W. Maass, H. Markram, The ‘‘liquid computer’’: a novel strategy for real-time computing on time series, TELEMATIK 8 (1) (2002) 39–43. [10] B.A. Pearlmutter, Gradient calculations for dynamic recurrent neural networks: a survey, IEEE Trans. Neural Networks 6 (5) (1995) 1212–1228. [11] S. Sarkka, A. Vehtari, J. Lampinen, Time series prediction by Kalman smoother with cross validated noise density, in: Proceedings of the IJCNN, 2004, pp. 1653–1658. [12] U.D. Schiller, J.J. Steil, On the weight dynamics of recurrent learning, in: Proceedings of the ESANN, 2003, pp. 73–78. [13] U.D. Schiller, J.J. Steil, Analyzing the weight dynamics of recurrent learning algorithms, Neurocomputing 63C (2005) 5–23. [14] J.J. Steil, Input–Output Stability of Recurrent Neural Networks, Cuvillier Verlag, Go¨ttingen, 1999 (Also: Ph.D. Dissertation, Faculty of Technology, Bielefeld University).

[15] J.J. Steil, Local stability of recurrent networks with time-varying weights and inputs, Neurocomputing 48 (1–4) (2002) 39–51. [16] J.J. Steil, Backpropagation–decorrelation: recurrent learning with OðNÞ complexity, in: Proceedings of the IJCNN, vol. 1, 2004, pp. 843–848. [17] J.J. Steil, Stability of backpropagation–decorrelation efficient OðNÞ recurrent learning, in: Proceedings of the ESANN, d-facto publications, Bru¨gge, 2005, pp. 43–48. [18] J.A.K. Suykens, B.D. Moor, J. Vandewalle, Robust local stability of multilayer recurrent neural networks, IEEE Trans. Neural Networks (2000) 222–229. [19] J.A.K. Suykens, J. Vandewalle, B.D. Moor, NLq theory: checking and imposing stability of recurrent neural networks for nonlinear modeling, IEEE Trans. Signal Process. (1997) 2682–2691. [20] J. Tikka, J. Hollme´n, A. Lendasse, Input selection for long-term prediction of time series, in: Proceedings of the Eighth International Work-Conference on Artificial Neural Networks (IWANN), Lecture Notes in Computer Science, vol. 3512, Springer, Berlin, 2005, pp. 1002–1009. [21] M. Vidyasagar, Nonlinear Systems Analysis, second ed., PrenticeHall, Englewood Cliffs, NJ, 1993. Jochen J. Steil received the diploma in mathematics from the University of Bielefeld, Germany, in 1993. Since then he has been a member of the Neuroinformatics Group at the University of Bielefeld, interrupted by one year at the St. Petersburg Electrotechnical University, Russia under support of a German Academic Exchange Foudation (DAAD) grant. In 1999, he received the Ph.D. Degree with a Dissertation on ‘‘Input–Output Stability of Recurrent Neural Networks’’. Since 2002 he has been appointed tenured senior research and teaching staff (Akad. Oberrat). J.J. Steil is staff member of the special research unit 360 ‘‘Situated Artifical Communicators’’ and the Graduate Program ‘‘Task Oriented Communication’’ and heads projects on robot learning and intelligent systems. Main research interests of J.J. Steil are analysis, stability, and control of recurrent dynamics and learning as well as the development of learning architectures for complex cognitive robots suited for multimodal human–machine communication, interaction, and instruction of grasping.