Neurocomputing 57 (2004) 477 – 484
www.elsevier.com/locate/neucom
Letters
A modi ed error function for the backpropagation algorithm X.G. Wang, Z. Tang∗ , H. Tamura, M. Ishii Faculty of Engineering, Toyama University, Toyama 930-8555, Japan Received 16 October 2003; accepted 15 December 2003
Abstract We have noted that the local minima problem in the backpropagation algorithm is usually caused by update disharmony between weights connected to the hidden layer and the output layer. To solve this problem, we propose a modi ed error function. It can harmonize the update of weights connected to the hidden layer and those connected to the output layer by adding one term to the conventional error function. It can thus avoid the local minima problem caused by such disharmony. Simulations on a benchmark problem and a real classi cation task have been performed to test the validity of the modi ed error function. c 2003 Elsevier B.V. All rights reserved. Keywords: Backpropagation; Learning; Local minima; Modi ed error function
1. Introduction The backpropagation algorithm [12] has been widely recognized as an e=ective method for training feed-forward neural networks. But this gradient descent-based method often su=ers from a sub-optimal solution or a local minima problem. Many approaches [1,11,14] have been proposed. These methods try to modify the learning model and overcome the tendency to sink into local minima by adding a random factor to the model. However, the random perturbations of the search direction and various kinds of stochastic adjustments to the current set of weights are not e=ective at enabling a network to escape from local minima within a reasonable number of iterations [15]. ∗
Corresponding author. E-mail address:
[email protected] (Z. Tang).
c 2003 Elsevier B.V. All rights reserved. 0925-2312/$ - see front matter doi:10.1016/j.neucom.2003.12.006
478
X.G. Wang et al. / Neurocomputing 57 (2004) 477 – 484
We have noted that many local minima diEculties are closely related to the neuron saturation in the hidden layer. Once such saturation occurs, neurons in the hidden layer will lose their sensitivity to input signals, and the propagation of information is blocked severely [7]. In some cases, the network can no longer learn [5]. The same phenomenon was also observed and discussed by Andreas Hadjiprocopis [7], Christian Goerick [5], Simon Haykin [8] and Wessels et al. [16]. In this paper, we explain the neuron saturation in the hidden layer as the update disharmony between weights connected to the hidden layer and the output layer. Then we propose a modi ed error function for the backpropagation algorithm in order to avoid the local minima problem that occurs due to this phenomenon. Finally, simulation results are presented to substantiate the validity of the modi ed error function. Since a neural network with single nonlinear hidden layer is capable of forming an arbitrarily close approximation of any continuous nonlinear mapping [3,4,9,10], our discussion will be limited to such networks. 2. The modied error function The backpropagation algorithm is a gradient descent procedure used to minimize an objective function (error function) E [17]. The most popularly used error function is the sum-of-squares that is given by P J 1 E= (tpj − opj )2 ; (1) 2 j p=1
where P is the number of training patterns, tpj is the target value (desired output) of the jth component of the outputs for the pattern p, opj is the output of the jth neuron of the actual output pattern produced by the presentation of input pattern p, and J is the number of neurons in the output layer. To minimize the error function E, the backpropagation algorithm uses the following delta rule: @E Iwji = − ; (2) @wji where wji is the weight connected between neurons i and j and is the learning rate. From the above equations, we can see that in the error function there is only one term related to the neuron outputs in the output layer. Weights are updated iteratively to make the neurons in the output layer approximate their desired values. However, the conventional error function does not consider the neuron behavior of the hidden layer and what value should be produced in the hidden layer. Usually, the activation function of a neuron is given by a sigmoid function 1 f(x) = : (3) 1 + e−x There are two extreme areas that are called the saturated portion of the sigmoidal curve. Once the activity level of a neuron approaches the two extreme areas (the outputs of neurons are near to the constant 1 or 0), f (x) will become very small. Since a weight change is determined by the error signal and the sigmoidal derivative,
X.G. Wang et al. / Neurocomputing 57 (2004) 477 – 484
479
the weight change will become extremely small. If weights connected to the hidden layer and the output layer are updated so inharmoniously that all the hidden neurons’ outputs are driven rapidly into the extreme areas before the output neurons start to approximate to the desired signals, no weights connected to the hidden layer will be modi ed even though the actual outputs in the output layer are far from the desired outputs. Therefore, each hidden neuron generates a constant to the output layer. The error signal cannot be propagated backwards, and the input signal cannot be passed forwards. The hidden layer becomes meaningless. Therefore, the local minima problem occurs. To overcome such a problem, the neuron outputs in the output layer and those in the hidden layer should be considered together during the iterative update procedure. Motivated by this, we add to the conventional error function one term embodying the neuron outputs in the hidden layer. In this way, weights connected to the hidden layer and the output layer could be modi ed harmoniously. The modi ed error function is given by J P P J 1 1 2 2 Enew = (tpj − opj ) + (tpj − opj ) 2 2 j j p=1
×
p=1
H
(ypj − 0:5)2
= E A + EB :
(4)
j
We can see that the new error function consists of two terms P
EA =
J
1 (tpj − opj )2 2 j p=1
is the conventional error function, and J H P 1 2 2 (tpj − opj ) (ypj − 0:5) EB = 2 j j p=1
is the added term; ypj is the output of the jth neuron in the hidden layer, 0.5 is average value of activation Hfunction of (3) and H is the number of neurons in the hidden layer. In this equation, j (ypj −0:5)2 can be de ned as the degree of saturation in the hidden layer for pattern p. This added term is used to keep the degree of saturation of the hidden layer small while EA is large (the output layer does not approximate to the desired signals). While the output layer approximates to the desired signals, the e=ect of term EB will be diminished and will eventually become zero. Using the above error function as the objective function, the update rule of weight wji can be rewritten as Iwji = −A
@EA @EB − B ; @wji @wji
where A and B are the learning rates for EA and EB , respectively.
(5)
480
X.G. Wang et al. / Neurocomputing 57 (2004) 477 – 484
For pattern p, the derivative @EAp =@wji can be computed in the same way as the conventional error function (Eq. (1)). @EBp =@wji is easily obtained as follows. For weights connected to the output layer, H @E p @EBp = A (ypj − 0:5)2 : @wji @wji j
(6)
For weights connected to the hidden layer, J H @ypj @E p @EBp = A (ypj − 0:5)2 + (tpj − opj )2 (ypj − 0:5) : @wji @wji j @wji j
(7)
Because the network we used here is the neural network with one hidden layer, the input variables connect directly to the hidden layer neurons through weights. @ypj =@wji is given by @ypj = f (·)xpi ; @wji
(8)
where xpi is the ith input for pattern p. 3. Simulations In order to verify the e=ectiveness of the modi ed error function, we applied it to the backpropagation algorithm (denoted by ‘BP+Two-terms’). Then, a modi ed XOR problem and a real classi cation task were used for simulations. For comparison, we also performed the backpropagation algorithm (denoted by ‘BP’) and a global search technology—the simulated annealing method [11] (denoted by ‘SA’) with the conventional error function. In our simulations, the update rule of the backpropagation was in its modi ed form—the backpropagation with momentum algorithm [12]. The momentum parameter was set to 0.9. The process of the simulated annealing method was based on the work by Owen et al. [11], and those same parameters were used in our simulations. For all methods, weights and thresholds were initialized randomly from (−1:0; 1:0). Two aspects of the training algorithm performance, ‘success rate’ and ‘training speed’, were assessed for each algorithm. A training run was deemed to have been successful if the network correctly classi ed all patterns in the training set within a tolerance of 0.1 for each target element. Then, the average CPU time of successful training runs was used to evaluate each algorithm. For all trials, the upper limit epochs were set to 50,000. 3.1. The modi1ed XOR problem The modi ed XOR problem is di=erent from the classical XOR problem because one more pattern is included (that is, inputs=(0.5,0.5), teacher signal=1.0) such that a unique global minimum exists. Furthermore, several local minima exist simultaneously
X.G. Wang et al. / Neurocomputing 57 (2004) 477 – 484
481
Fig. 1. Comparison of learning processes between BP+Two-terms and BP: (a) conventional error function (EA ) vs. epochs and (b) degree of saturation in the hidden layer of all patterns vs. epochs.
in this problem [6]. We used the same 2–2–1 neural network as that used in another work [6] (two inputs, two hidden neurons and one output neuron) to solve this problem. To show how the BP+Two-terms can avoid the local minima, we compared a typical learning process of BP+Two-terms with that of BP for 1,000 epochs. The run was performed under the condition of A = 1:0, B = 1:0 for the BP+Two-terms, = 1:0 for the BP, and was started from the same weights selected randomly from (−1:0; 1:0). Fig. 1(a) depicts the conventional error function (EA ) vs. epochs for both methods. We can see that the training process of the backpropagation algorithm stopped because of the local minimum problem. Meanwhile, the BP+Two-terms method avoided the local minimum and trained the network successfully within 125 epochs. In Fig. 1(b), the degree of saturation in the hidden layer of overall patterns is also plotted as a function of epochs for both methods. While the BP encountered the local minimum and EA
482
X.G. Wang et al. / Neurocomputing 57 (2004) 477 – 484
Table 1 Experimental results for modi ed XOR problem Methods
Success rate (%)
Average CPU time (ms)
BP ( = 0:1) BP ( = 1:0) SA BP+Two-terms (A = 1:0, = 1:0)
71 58 86 100
41.8 5.3 5717.1 19.8
remained unchanged, the degree of saturation in the hidden layer increased continually until it reached 2.0. For the BP+Two-terms, the degree of saturation in the hidden layer was neutralized e=ectively by the modi ed error function. Table 1 shows the experimental results of the three methods based on 100 runs of this problem. For the BP, di=erent learning rates, =1:0 and 0.1, were used. The table shows that the backpropagation algorithm could obtain successful solutions for almost every run using the modi ed error function as the objective function, while many failures in convergence to the global solution occurred both in the backpropagation algorithm and the simulated annealing method using the conventional error function. Although the average CPU time of BP+Two-terms was a bit more than that of BP( = 1:0), it was much less than those of the SA and BP ( = 0:1). These results indicate that the proposed method could eEciently avoid the local minima for this example. 3.2. Ionosphere data Furthermore, we applied our method to a realistic ‘real-world’ problem: classi cation of radar returns from the ionosphere. The data set was created by Johns Hopkins University and obtained from the databases [2]. We can refer the usage of this database from Sigillito et al. [13]. This radar data was collected by a system in Goose Bay, Labrador. There are 34 attributes used to represent the pattern. Each pattern belongs to two classes: ‘good’ or ‘bad’, which are free electrons in the ionosphere. Good radar returns that are represented by ‘0’ in the experiment are those showing evidence of some type of structure in the ionosphere. Bad returns that are represented by ‘1’ are those that do not. The training data we used were the rst 200 instances that were carefully split almost 50% good and 50% bad. We employed a 34 –3–1 neural network model to solve this problem. We also performed 100 trials for the backpropagation algorithm, simulated annealing method and the proposed method. The simulation results are shown in Table 2. The results demonstrate that through the modi ed error function, the backpropagation algorithm was greatly enhanced in network training. Additionally, it is also noticed that the average CPU time of BP+Two-terms was much less than those for the BP using both = 1:0 and 0.1, which implies that our method might also accelerate the learning by eliminating any standstill state during learning. The results of the simulated annealing method were obtained by performing a 24 h computer simulation.
X.G. Wang et al. / Neurocomputing 57 (2004) 477 – 484
483
Table 2 Experimental results for ionosphere data Methods
Success rate (%)
Average CPU time (ms)
BP ( = 0:1) BP ( = 1:0) SA BP+Two-terms (A = 1:0; = 1:0)
46 40 0 80
123860.2 10182.5 — 5113.6
4. Conclusions In this paper, we proposed a modi ed error function with two terms for the backpropagation algorithm. This modi ed error function was used to harmonize the update of weights connected to the hidden layer and those connected to the output layer. Therefore, local minima problems due to such disharmony could be avoided. Besides, the modi ed error function did not require much additional computation and did not change the network topology. Finally, simulations performed on a benchmark problem and a real classi cation task both demonstrated that the performance of the backpropagation algorithm was greatly improved by using the modi ed error function. In addition, the result may be not limited to the activation function of (3) but in place of the average value of f(x) used in EB . More analysis on large problems and some detailed discussion on parameter settings are still required. References [1] S. Bengio, Y. Bengio, J. Cloutier, Use of genetic programming for the search of a new learning rule for neural networks, Proceedings of the First IEEE World Congress on Computational Intelligence and Evolutionary Computation, Orlando, FL, 27–29 June, 1994, pp. 324 –327. [2] C.L. Blake, C.J Merz, UCI Repository of machine learning databases, http://www.ics.uci.edu/∼mlearn/ MLRepository.html, University of California, Department of Information and Computer Science, Irvine, CA, 1998. [3] G. Cybenko, Approximation by superposition of a sigmoid function, Math. Control Signals Syst. 2 (1989) 303–314. [4] K. Funahashi, On the approximate realization of continuous mapping by neural networks, Neural Networks 2 (3) (1989) 183–192. [5] C. Goerick, W.V. Seelen, On unlearnable problems or A model for premature saturation in backpropagation learning, in: Proceedings of the European Symposium On Arti cial Neural Networks ’96, Brugge, Belgium, 24 –26 April 1996, pp. 13–18. [6] M. Gori, A. Tesi, On the problem of local minima in backpropagation, IEEE Trans. Pattern Anal. Mach. Intell. 14 (1) (1992) 76–86. [7] A. Hadjiprocopis, Feed forward neural network entities, Ph.D. Thesis, Department of Computer Science, City University, London, UK, 2000. [8] S. Haykin, Neural Networks, a Comprehensive Foundation, Macmillan, New York, 1994. [9] K. Hornik, M. Stinchcombe, H. White, Multilayer feedforward networks are universal approximators, Neural Networks 2 (5) (1989) 359–366. [10] B. Irie, S. Miyake, Capabilities of three-layered perceptions, Proceedings of the IEEE Conference on Neural Networks, Vol. I, San Diego, CA, 24 –27 July 1988, pp. 641– 648.
484
X.G. Wang et al. / Neurocomputing 57 (2004) 477 – 484
[11] C.B. Owen, A.M. Abunawass, Application of simulated annealing to the backpropagation model improves convergence, Proceedings of the SPIE Conference on the Science of Arti cial Neural Networks, Vol. II, 1966, Orlando, FL, 13–16 April 1993, pp. 269 –276. [12] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning internal representations by back-propagating errors, in: D.E. Rumelhart, J.L. McClelland, the PDP Research Group (Eds.), Parallel Distributed Processing, Vol. 1, MIT Press, Cambridge, MA, 1986, pp. 318–362. [13] V.G. Sigillito, S.P. Wing, L.V. Hutton, K.B. Baker, Classi cation of radar returns from the ionosphere using neural networks, Johns Hopkins APL Technical Digest, Vol. 10, Johns Hopkins University, Baltimore, MD, 1989, pp. 262–266. [14] A. Von Lehmen, E.G. Paek, P.F. Liao, A. Marrakchi, J.S. Patel, Factors inUuencing learning by backpropagation, Proceedings of the IEEE International Conference on Neural Networks, Vol. I, San Diego, CA, 1988, pp. 335 –341. [15] C. Wang, J.C. Principe, Training neural networks with additive noise in the desired signal, IEEE Trans. Neural Networks 10 (6) (1999) 1511–1517. [16] L.F.A. Wessels, E. Barnard, E. van Rooyen, The physical correlates of local minima, in: Proceedings of the International Neural Network Conference, Paris, July 1990, p. 985. [17] Y.H. Zweiri, J.F. Whidborne, L.D. Seneviratne, A three-term backpropagation algorithm, Neurocomputing 50 (2003) 305–318.