Neural Networks, Vol. 4, pp. 615-618, 1991
0893-6080/91 $3.(10 + .00 Copyright ~, 1991 Pcrgamon Press p[c
Printed in the USA. All rights reserved.
ORIGINAL CONTRIBUTION
Back Propagation With Expected Source Values TARIQ SAMAD Honeywell SSDC, Minneapolis, MN
(Received 19 March 1990; revised and accepted 23 April 1991)
Absffaet--The back propagation learning rule converges significantly faster if expected values of source units are used for updating weights. The expected value of a unit can be approximated as the sum of the output of the unit and its error term. Results from numerous simulations demonstrate the comparative advantage of the new rule.
Keywords--Neural networks, Supervised learning, Back propagation. updated according to the following formula:
1. INTRODUCTION
Aw,i -- #oi6,
A major reason for the resurgence of interest in neural networks is the discovery of learning algorithms for multilayer networks. There are now a number of such algorithms in the literature. The most popular of these is the back propagation learning rule (Werbos, 1974; Rumelhart, Hinton, & Williams, 1985). There are, however, some problems with back propagation. In particular, it is slow, a limitation that bodes poorly for real-world applications. This paper describes an intuitive and straightforward modification of back propagation that results in significant performance improvement. Unlike some other modifications and more sophisticated numerical optimization techniques such as conjugate gradient algorithms, the extension proposed here is extremely simple, and the computational overhead incurred per iteration is minimal. Thus, we can compare convergence results with the new rule directly to the original. Section 2 motivates and describes the modification, Section 3 presents several simulation results, and Section 4 presents some concluding remarks.
(1)
where o~ is the output of the "source" unit i and 6 i is an associated error term for unit j, defined as follows: /o I (t, - O,) 6, = ] o/, ~ W,kC~k
for output units j for hidden units j.
See Rumelhart et al. (1985) for the derivation. 6/ can be considered the error in unit j scaled by the derivative of j evaluated at its current value. This is true by construction when j is an output unit, and by analogy for hidden j's. The variation of this rule described below hinges on this analogy, for which further justification can be provided. Since eqn (1) is a gradient descent rule, it results in the squared error for the output layer being reduced (provided that r/ is sufficiently small). If we ignore the effect of the derivatives in the error terms 6 i, then the error terms in the output layer will be reduced. Furthermore, c5i is then linearly related to the error terms in the output layer, and the magnitude of c~i will also be reduced. In other words, eqn (1) attempts to drive 6t towards zero. Factoring in the derivative terms does not qualitatively affect this conclusion. Indeed, since unit values start out near 0.5, when derivatives are at their maximum value, and evolve slowly (for small q) towards their final learned values, the re,lationship between 6j and the output error is maintained in general. The decrease in 0r applied only for the current training input, say x, and it is in effect only as long as other network parameters remain unchanged. In particular, the effectiveness of the weight update will
2. EXPECTED SOURCE VALUES Back propagation is a gradient-descent learning rule that minimizes the sum squared error over the output layer of a feed-forward neural network. Weights wij from a unit i to a unit j are updated according to the following formula: Acknowledgement: The author is grateful to the anonymous referees for their numerous helpful comments. Requests for reprints should be sent to Tariq Samad, Honeywell SSDC, 3660 Technology Drive, Minneapolis, MN 55418.
615
016
/. Samad
E
\ \
.,
j
\
t toll
tOil
FIGURE 1. E'- is the error surface aa a function of tha weight w~ prior to learning iteration t, in response to some training input (all weights except w~ are assumed at their time t_ values). ~j- Is the value of the weight w~ at time t_. The original back propagation learning rule performs gradient descent on the surface E t-, The dashed curve depicts the error surface with respect to we after training Iteration t~all other weights are at their updated values. The new rule uses local information at time t_ to approximate (crudely) the dashed curve.
be reduced if. as learning proceeds, the output of unit i to x deviates from the o~ value used for the weight update. If i is an input unit, x will always lead to this of value for unit i. If i is a hidden unit. however. there are weights wh~ that are also subject to modi-
fication. These modifications wdl attect the outpul of unit i to the input x. Thus weight changes dictated by eqn (1) could turn out subsequently to have been inappropriate. Figure 1 attempts a schematic depiction of this reasoning. The error .~urface for a particular weight wi, changes as other weights are modified over an iteration. We can exploit the analogy between O~ and the error attributed to unit i to compensate for future changes to weights in earlier layers. There ts no ae curate way of effecting this compensation, but there is a crude bul reasonable way: Instead of using the current value of the source uml i lor update, we can use the ' e x p e c t e d value" for unn e The expected value can be formulated simply as the sum of the current value and the error term This purposeful mismatch of activations and error terms leads to the following modified back propagation learning rule: 2IW
-
~HO
-
[ft) ~,)
where fl is a constant, eqn (2) reverts to the onginat rule when fl equals 0.0. /f can usually bc sel to i.f) and superior results over eqn {~, are consistently obtained. In many cases, however, higher values of fl can further accelerate learning As with Jl- further increases in [] beyond some problem-specific value result in oscillations and nonconvergence. Several relatively less crude expected value approximations
TABLE 1 Simulation Results for Various Problems. For Each Problem, a Number of Experiments Were Performed with Differe,1 ( C o a ~ ~ ) Values of n and ~1. Each Experiment Consisted of 10 Runs for all Problems F.xr.,ept the ~ ~/Decoder, for Which 3 Runs Were Used. The Best Results for/~1 = 0 (the Original Back P r o ~ n Rule) and ,6 ~ 0 are shown. Iteration Limits Were 5,000 for XOR, 10,000 for PerRy, The A ~ are Over Converged:Runs. No Momentum Term Was Used, Weighte Were Initialized Randomly In ( , 1.0, + 1.0), and an Output Unit Value Was Interpreted as a 1 (0) if it was above 0.75 (Below 0.25)
Net Structure and Problem
r2~
r/
fl
#Converged
Ave. Convergence
2-2-1, XOR
0.8 1.0 2.0
0.0 5.0 5.0
8/10 10/10 8/10
605 3!8 107
3-4-1, Parity
0.3 2.0 0.8 2.0
0.0 0.0 1.0 2.0
10/10 9/10 10/10 8/10
1476 522 650 178
4-2-4, Encoder/decoder
3.0 4.0
0.0 8.0
10/10 10/10
~44 50
8-3-8, Encoder/decoder
5.0 4.0
0,0 2.0
10/10 10/10
215 t65
16-4-16, Encoder/decoder
4.0 3.0
0.0 1.0
10/t0 10/10
2_42 206
64-6-64, Encoder/deeoder
2.0 2.0
0.0 2.0
3/3 3/3
351 274
10-10-10, 10 Random associations
1.0 1.0
0.0 4.0
10/10 10/10
176 91
5-5-5-5, 5 Random associations
1,0 1.0
0.0 3.0
10/10 10/10
526 t75
617
Expected Value Backprop
are possible--the simplest being bounding o, + /~6~ in (0.0, 1.0)--but are not pursued here in the interest of keeping the modification as simple as possible. Note that eqn (2) is identical to the original when i is an input unit: there is no error associated with input units in the usual application of back propagation. This is not true in some variations such as the one-layer auto-associative memory described in (Samad & Harper, 1987), the learning rule of which bears some similarities to eqn (2). 3. S I M U L A T I O N
RESULTS
The heuristic nature of the argument furnished above renders empirical validation particularly important. Extensive performance comparisons between the new and original rules have been conducted. Many different problems, network structures, initial weight settings, and termination criteria have been investigated. In almost every case, the new rule has converged significantly faster than the original. Table 1 lists some comparative results from a number of experiments with several common Boolean problems. Network structures are described in the common n, - nh - no notation, where n,, nh, and no represent the number of units in the input, hidden, and output layers, respectively. All connections are between adjacent layers; there is full connectivity within this constraint. An analogous notation is used for networks with more hidden layers. Each experiment consisted of 10 trials for each listed value of q and [] (for the 64-6-64 encoder/decoder problem, each experiment consisted of 3 trials). The same initial weight
Epochs t.o L e a r n 10000,
i 5000 i
0
o.oi
0.2
1:0
lO.O
lO0.O" "n
FIGURE 2. ~dependency curves comparing learning rates for the original backpropagation rule (dashed line) and the variation described in this paper (solid line, .8 fixed at 1.0). The curves are generated by training a network for a specific application over a range of 7/values. The curves indicate the range of T/values for which learning is relatively rapid, the best learning rate that can reliably be obtained, and the sensitivity to the randomly generated initial weights. No momentum term was used, weights were initialized randomly in ( - 0 . 1 , +0.1) and an output unit value was interpreted as 1 (0) if It was above 0.85 (below 0.15). A 2-2-1 network was used to solve the XOR problem.
RMS Error
0.~
"] -",
0.1
L Epochs I
2
3
4
(1000's)
FIGURE 3. Learning curves for a function approximation problem (see text). The new rule converged in 837 epochs, the original rule required 3,721.
sets were used for each of the experiments with a given problem/structure. A large number of experiments were performed for each learning rule/problem combination, with different (coarsely sampled) values of q (for the original rule) and r/and/~ (for the new rule). Only the best results for each case are shown. The results in Table 1 are based on relatively high initial weights, large q's, and a liberal termination criterion. These parameter settings result in faster convergence and thus allowed more experiments to be conducted. Numerous simulations have also been conducted with more common parameter settings: initial weights in ( - 0 . 1 , +0.1), q = 0.3 (in some cases a range of q values was considered), and thresholds of 0.85 and 0 . 1 5 . / / w a s set to 1.0 in these simulations. Figure 2a shows an ~/-dependency curve (Samad, 1987) for the XOR problem with a 2-2-1 network under these conditions. As can be seen, learning is faster with expected source value updates over the entire useful range of q. With a 2-5-5-1 network, the new rule reliably learned the XOR in two to three thousand epochs. The original rule required around fifty thousand epochs, and that for a relatively narrow range of r/values. In the one experiment performed with a 2-4-4-4-1 network, the original rule did not learn in 1,000,000 epochs, whereas the new rule learned in 4,000. Figure 3 shows learning curves for the original and new back propagation rules for a function approximation problem. One hundred random samples of the following (contrived, but somewhat complex) function were used as the training set. Y = f ( x ~ , z~, x~) = g(e~, g(z)
(7,2 -
0.5) 2 - 3 sin(~z,))
= 1 / ( ~ + e =)
Ranges for input variables were restricted to the (0,1) interval. The network structure was 3 - 1 0 - 1, initial weights were randomly generated in ( - 0 . 3 , +0.3), and the convergence criterion was 0.01 rms error. A value of 0.3 was used for q. Convergence was at-
618
~. Samad
tained in 837 epochs for the expected source value variation ([3 = 1.0) and in 3,721 epochs for the original rule. The fastest convergence achieved with the original rule for this problem was 1,371 epochs with an q of 2.0. No attempt was made to optimize the new rule: only the eta = 0.3. fl = 1.0 experiment was conducted. 4. SOME CONCLUDING REMARKS There are several other variations and extensions of back propagation that have been described in the last few years (e.g., Fahlman, 1988; Jacobs, 1988; Stornetta & Huberman, 1987), all of which result in some improvement in learning rate over the original rule. A major advantage of the rule described in this paper is its simplicity. Unlike several other extensions, the new rule does not require significantly more computation per iteration, significantly more memory, or dynamic changes to parameter values. With some modifications, particularly conjugate gradient or line search methods, each learning iteration is substantially more complex than for the original rule. Learning rate improvements expressed in terms of iterations or epochs are misleading in such cases. It should also be noted that conjugate gradient or line search techniques are restricted to "batch" updates. Another feature of the new rule is that its computational requirements are as localized as for the original. The only extra quantity that needs to be computed for a weight update is the error term for the source unit, which is required by the original rule as well, although only for updates to weights to the unit. Implementations of back propagation differ in when error terms for hidden units are computed. One option is to compute all hidden unit error terms immediately after the forward pass; a second is to compute each hidden unit error term on demand (e.g., in a one hidden layer network the error terms for the hidden layer are computed after the weights to the output layer are modified). The first option is closer to true gradient descent in the entire weight space, but in practice it is frequently slower to converge. In this case, the only computational cost associated with the new rule is the negligible one of an extra addition and multiplication for each weight from a hidden unit. In the second case, since the error term for a hidden unit i is now used for adjusting
weights both to and from the umt i. and since 6, ~s a function of weights from i, it has to be computed twice in every learning iteration: once to compute the weight adjustments for efferent weights, and again (since it has now changed) to compute the weight adjustments for afferent weights. In general, the error term for every hidden unit will need to be computed twice. An execution time penalty per epoch of 10 to 15c~ has been observed. Also. the recursive implementation now incurs a small memory penalty--error terms for two layers musl be stored simultaneousl 3 The simulations reported in the previous section employed the on-demand error computation strategy. Significant improvements in learning rate have been obtained with the first option, the up-front computation of hidden errors, as well. For example, for the 64-6-64 encoder/decoder problem, an average learning rate of 348 epochs was observed with the original rule and an average of 252 epochs with the new one. For the function approximation problem, the respective learning rates were 4.266 and 834. All conditions were identical to those used for the ondemand error computation runs. REFERENCES Fahlman. S. E. (1988). Faster-learning variations of backpropagation: An empirical study. In D Touretzky, G E Hinton, & T. J. Sejnowski (Eds.), Proceedings tZf the 1988 Connectionist Models Summer School. San Marco, CA: Morgan Kaufmann Publishers 38-51. Jacobs, R. A. (1988). Increased rates o~ convergence through learning rate adaptation. Neural Networks. 1(4), pp. 295-308 Rumelhart D. E.. Hinton, G. E . & W~tiams, R. J [1985). Learning Internal Representations by Error-Propagation (ICS Report 8506), Institute for Cognitive Science, UCSD. La Jolla. CA Samad. T. [19871. Refining and redefining the backpropagation learning rule for connectionist networks. Proceedings of the
t987 1EEE Systerm Man and Cybernetics' Annual Conference, Alexandria_ VA. New York: IEEE Press. 958-963. Samad T., & Harper, P. (1987). Associative memory storage using a variant of the generalized delta rule. Proceedings oJ
the First IEEE International Conference on Neural Networks~ 111 173-184 Stornetta. W. S., & Huberman. B. A ( 19ts7 ,. An improved three-. layer, backpropagation algorithm. In M. Caudill & C. Butler (Eds. L Proceedings of the First IEEE International Conference on Neural Networks, 11,637-644 Werbos P. (1974). Beyond Regression: New Tools tor Prediction and Analysis in the Behavioral Sciences. Ph.D. Thesis. Harvard University Committee on Applied Mathematics.