Chaos, Solitons and Fractals 40 (2009) 69–77 www.elsevier.com/locate/chaos
The performance of the backpropagation algorithm with varying slope of the activation function Yanping Bai a
a,b,*
, Haixia Zhang a, Yilong Hao
a
National Key Laboratory of Micro/Nano Fabrication Technology, Institute of Microelectronics, Peking University, 100871, China b Department of Applied Mathematics, North University of China, No. 3 Xueyuan Road, TaiYuan, ShanXi 030051, China Accepted 10 July 2007
Abstract Some adaptations are proposed to the basic BP algorithm in order to provide an efficient method to non-linear data learning and prediction. In this paper, an adopted BP algorithm with varying slope of activation function and different learning rates is put forward. The results of experiment indicated that this algorithm can get very good performance of training. We also test the prediction performance of our adopted BP algorithm on 16 instances. We compared the test results to the ones of the BP algorithm with gradient descent momentum and an adaptive learning rate. The results indicate this adopted BP algorithm gives best performance (100%) for test example, which conclude this adopted BP algorithm produces a smoothed reconstruction that learns better to new prediction function values than the BP algorithm improved with momentum. Ó 2009 Published by Elsevier Ltd.
1. Introduction Feedforward neural networks (FNN) are widely used to solve complex problems in pattern classification, system modeling and identification, and non-linear signal processing, analyzing non-liner multivariate data. One of the characteristics of the FNN is its learning (or training) ability. By training, the neural network can give correct answers not only for learned examples, but also for the models similar to the learned examples, showing its strong associative ability and rational ability which are suitable for solving large, nonlinear, and complex classification and function approximation problems. The classical method for training FNN is the backpropagation (BP) algorithm [1] which is based on the gradient descent optimization technique. Despite the general success of this algorithm it may converge to a local minimum of the mean squared-error objective function and requires a large number of learning iterations to adjust the weights of the FNN. Many attempts have been made to speed up the error BP algorithm. The most well known algorithms of this type are the conjugate gradient training algorithm [2] and Levenberg–Marquardt (LM) training algorithm [3]. The computational complexity of the conjugate gradient algorithm is heavily dependent on the line search methods. The LM algorithm has a faster speed than gradient and hardly stuck in the local minimum. But it requires too much * Corresponding author. Address: National Key Laboratory of Micro/Nano Fabrication Technology, Institute of Microelectronics, Peking University, 100871, China. E-mail addresses:
[email protected] (Y. Bai),
[email protected] (Y. Hao).
0960-0779/$ - see front matter Ó 2009 Published by Elsevier Ltd. doi:10.1016/j.chaos.2007.07.033
70
Y. Bai et al. / Chaos, Solitons and Fractals 40 (2009) 69–77
memory and computational time. Another method has been presented. Commonly known such as momentum [4], variable learning rate [5,10] or stochastic algorithms [6,7] lead only to a slight improvement. In [8], the author present an efficient method for learning a FNN that combines unsupervised training for the hidden neurons and supervised training for the output neurons. More precisely, when an input is presented to the network, the updating rule drags the weight vectors of the winner and its topological neighbor according to unsupervised training. The weights leaving from the winning and its neighbor are adjusted by the gradient descent method. In this paper, we will discuss the performance of the BP algorithm from difference view. In common practice, the functions will be specified not by algorithms but by a table or training set T consisting of n argument-value pairs. We will be given a d-dimensional argument x and an associated target value t that is our goal, and t will be approximated by a network output y. The function to be constructed will be fitted to T ¼ fðxi ; ti Þ : i ¼ 1 : ng. In most applications the training set T is considered to be noisy and our goal is not to reproduce it exactly but rather to construct a network function gðx; wÞ that produces a smoothed reconstruction that generalizes (learns) well to new function values. Therefore we improve the basic BP algorithm by varying slope of activation function and different training rates. And we test the performance of the BP algorithm with varying slope of the activation function and different learning rates on sixteen training sets, which data come from real-life problems. The results of tests indicated our adopted BP algorithm produces a smoothed reconstruction that learns better to new prediction function values than the BP algorithm improved with momentum adapting and an adaptive learning rate.
2. Description of the BP algorithm 2.1. Basic BP algorithm A BP network with a hidden layer can approximate with arbitrary precision an arbitrary nonlinear function defined on a compact set of Rn [4]. BP algorithm is a training algorithm with teachers, whose training procedures are divided into two parts: a forward propagation of information and a backward propagation (BP) of error. The network’s training procedures is described below. Let the node numbers of input and hidden layer be N and M, respectively. In this paper, the node number of the output layer is ascertained as 1. Let the input example vectors be nl ¼ ðnl1 ; nl2 ; . . . ; nlN Þ ð1 6 l 6 P Þ. Denote by wij ð1 6 i 6 N; 1 6 j 6 MÞ the weight connecting the ith input node and the jth hidden node. Denote by W j ð1 6 j 6 MÞ the connection weight between the jth hidden node and the output node. g(x) and f(x) are the activation functions of the hidden layer and the output layer, respectively. When training example nl are input to the network, the input and output values of the jth hidden node are denoted as xlj and y lj ð1 6 j 6 M; 1 6 l 6 P Þ, respectively, while the input values and output values of the output unit are denoted by Hl and Ol ð1 6 l 6 P Þ, respectively. In symbol we have xlj ¼
N X
wij nli ;
ð2:1Þ
i¼1
y lj ¼ gðxlj Þ; M X Hl ¼ W j y lj ;
ð2:2Þ ð2:3Þ
j¼1
Ol ¼ f ðH l Þ:
ð2:4Þ
Let the desired output corresponding to the input example nl be fl . (According to the type of output layer’s activation functions, fl are chosen as (0, 1) in this paper). Then the square error function for this step of training is 1 El ¼ ðfl Ol Þ2 : 2
ð2:5Þ
The overall square error function after all example are used is E¼
P 1X ðfl Ol Þ2 : 2 l¼1
ð2:6Þ
Let W denote the vector containing all the weights. The purpose of BP algorithm is to choose W so as to minimize the error function by, say, the gradient descent method. So, the general expression of the iteration formula is
Y. Bai et al. / Chaos, Solitons and Fractals 40 (2009) 69–77
where
71
W ðt þ 1Þ ¼ W ðtÞ þ DW ðtÞ;
ð2:7aÞ
oE DW ¼ g oW W ¼W ðtÞ
ð2:7bÞ
is the weight increment at time t, and the positive constant g is the training rate, 0 < g < 1. A popular variation of the standard BP algorithm of gradient method (2.7) is the so-called online gradient method (OGM for short). This means that the weight values are modified as soon as a training example is input to the network. Now, we have oE ¼ g1 ðfl Ol Þf 0 ðH l Þy lj : ð2:8Þ DW j ¼ g1 oW j By the chain rule and (2.1)–(2.5), we have oE ¼ g2 ðfl Ol Þf 0 ðH l ÞW j g0 ðxlj Þnlj ; Dwij ¼ g2 owij
ð2:9Þ
where g1 and g2 are the training rates of synaptic weights from hidden layer to output layer and input layer to hidden layer, respectively. The training examples fnl g are usually supplied to the network in stochastic order. It is more likely for OGM to jump off from a local minimum of the error function, compared with the standard gradient method, and it requires less memory space. Therefore, OGM is widely used in neural network training [9]. 2.2. BP algorithm with gradient descent momentum and an adaptive learning rate BP algorithm with momentum updating is one of the most popular modifications to the standard BP algorithm presented in Section 2.1. The idea of the algorithm is to update the weights in the direction which is a linear combination of the current gradient of the instantaneous error surface and the one obtained in the previous step of the training. In practical application, a momentum term is added to Formula (2.7a) to accelerate the convergence speed, resulting in oE þ a½W ðtÞ W ðt 1Þ; ð2:10Þ DW ¼ g oW w¼wðtÞ where the positive constant a is a momentum factor, 0 < a < 1. In the BP algorithm with momentum updating, a batch-updating approach accumulates the weight corrections over one entire epoch before actually performing the update. Batch updating with a variable learning rate represents a simple heuristic strategy to increase the convergence speed of the BP algorithm with momentum updating. The idea behind the approach is to increase the magnitude of the learning rate if the error function has decreased toward the goal. Conversely, if the error function has increased, the learning rate needs to be decreased. If the error function is increased less than the percentage which is defined previously, the learning rate remains unchanged. Applying the variable learning rate to the BP algorithm with momentum updating can speed the convergence in the cases of smooth and slowly decreasing error function. However, the algorithm can easily be trapped in a local minimum of the error surface. To avoid this, the learning rate is not allowed to fall below a certain value [11].
3. The training performance of the BP algorithm with varying slope of the activation function and different learning rates The activation function can be a linear or non-linear function. There are many different types of activation functions. Here, we present three of the most common types of activation functions. The first type is the linear function, which is continuous valued. It is written as y ¼ f ðuÞ ¼ u:
ð3:1Þ
The second type of the activation function is a hard limiter. This is a binary (or bipolar) function that hard-limits the input to the function to either a 0 or a 1 for the binary, and a 1 or a 1 for the bipolar type. They are written as 0 if u < 0; y ¼ f ðuÞ ¼ : ð3:2Þ 1 if u P 0;
72
Y. Bai et al. / Chaos, Solitons and Fractals 40 (2009) 69–77
and 8 > < 1 if u < 0; y ¼ f ðuÞ ¼ 0 if u ¼ 0; : > : 1 if u > 0:
ð3:3Þ
The third type of the basic activation function is the logistic function (or the hyperbolic tangent function, which is a simple rescaling of the logistic). It is showed as formula (3.4) (or formula (3.5)). y ¼ f ðuÞ ¼
1 ; 1 þ eau
y ¼ f ðuÞ ¼ tanhðauÞ ¼
ð3:4Þ eau eau 1 e2au ¼ ; eau þ eau 1 þ e2au
ð3:5Þ
where a is the slope parameter of the binary sigmoid function. The logistic function is a bounded, monotonic increasing, analytic function. Elementary properties include lim f ðuÞ ¼ 0;
u!1 0
lim f ðuÞ ¼ 1;
ð3:6Þ
u!1
00
2
f ðuÞ ¼ af ðuÞð1 f ðuÞÞ; f ðuÞ ¼ a f ðuÞð1 f ðuÞÞð1 2f ðuÞÞ; a jf 0 ðuÞj 6 ; 4 1 a f ðuÞ þ u for small u: 2 4
ð3:7Þ ð3:8Þ ð3:9Þ
For small values of u, the logistic function is nearly linear. The logistic activation functions for three different values of the slope parameter are shown in Fig. 1. The plots of the percentage errors incurred in the linear approximation to the logistic function (three different a values) are shown in Fig. 2. We can see from Fig. 2 that the logistic function with small a value is more nearly linear than the one with large a value for moderate value of u. From the weight update (2.7–2.9), we have seen thus far, we observe that the rate of the update is proportional to the derivative of the nonlinear activation function. During training of the network, the output of the linear combiner may fall in the saturation region of the activation function. In that region the derivative of the activation function is very small, the rate of learning becomes extremely slow. It may spend many iterations before the output of the linear combiner moves out of the saturation region. A straightforward approach to prevent this saturation would increase the size of the non-saturated part of the activation function by decreasing its slope. However, decreased the slope makes the network behave more as a linear network, which in effect diminishes the advantages of having a multilayer network. Because any number of layers having linear activation functions can be replaced by a single layer. Hence, there is an optimum value for the activation function slope that balances the speed of the network training and its mapping capabilities. This is also the aim of our adopted BP algorithm. The BP algorithm with varying slope of the activation function and different learning rates can be summarized in the following.
Fig. 1. Logistic activation functions for three different values of the slope parameter.
Y. Bai et al. / Chaos, Solitons and Fractals 40 (2009) 69–77
73
Fig. 2. Percentage errors in approximating logistic function by linear function.
4. Our adopted BP algorithm
Step 1. Initialize the weights in the networks (having same initialize weights in our tests). Step 2. The activation functions of the hidden layer and the output layer are selected as the logistic function gðxÞ ¼ 1þe1 x and f ðxÞ ¼ 1þe1ax , respectively (we adapt two layers network in this paper), where a is the slope parameter of the activation function f ðxÞ. Step 3. Present an input pattern from the set of the training input/output pairs by random order, and calculate the networks response. Step 4. Compare the desired network response with the actual output of the network, and the squared error is computed according to Eq. (2.5). Step 5. The weights of the network are updated according to Formulas (2.8) and (2.9), where the training rates of synaptic weight are different for output layer and hidden layer, which are g1 and g2 , respectively. Step 6. Stop if the network has converged or the maximum number of iterations is reached; else go to step 2. We have three parameters, two training rates g1 and g2 , one slope parameter a of the output activation function. The three parameters are selected as follows, which perform 500 iterations. First, we fix g1 and g2 . The sum squared-error are recorded for different a. We find out the a value with the convergence speed is faster and the sum squared-error is small. Second, we set the selected a and fix g1 , and check for different g2 . The sum squared-error are recorded for different g2 . The training rate g2 with the minimum error is confirmed. Last, we set the selected a and g2 , and check for different g1 . We finally determined g1 with minimum sum squared-error. After selected the three parameters, we perform the algorithm with maximum 2000 iterations and a determinate target error. In this paper, the network consists of one input node, four hidden units and one output node. We test the performance of the BP algorithm with varying slope of the activation function and different learning rates on 16 instances of time series. The values of input node are the time of time series, time ¼ 0; 1; 2; . . .. The target values of output node are the data of time series. Our adopted BP algorithm was employed to train 16 instances. We gained the sum squarederrors of three different slope parameters for each instance, which performed with same training rates (the middle slope parameter is with minimum sum squared error). The experiment results were showed in Table 1. Partial simulation results of the instances were figured in Fig. 3. From Table 1 and Fig. 3, we have seen thus far, the performances of network training with same training rates and different slope are very different for a large of instances. However it is similar for a little instance. We conclude that the basic BP algorithm can get very good performance of training by adjust two different training rates and one slope parameter.
5. Prediction performance of our adopted BP algorithm and the BP algorithm with momentum In this section, we test the prediction performance of our adopted BP algorithm with varying slope of activation function and different training rate. We will compare the test results to the ones of the BP algorithm with gradient
74
Y. Bai et al. / Chaos, Solitons and Fractals 40 (2009) 69–77
Table 1 The experiment results of 16 sets for the BP algorithm with varying slope of activation function Instances
Number of data
Training rate
Slope parameter (a)
Sum squared-error
1
65
g1 ¼ 0:5, g2 ¼ 0:05
2
65
g1 ¼ 0:5, g2 ¼ 0:1
3
65
g1 ¼ 0:5, g2 ¼ 0:05
4
65
g1 ¼ 0:1, g2 ¼ 0:5
5
65
g1 ¼ 0:5, g2 ¼ 0:05
6
65
g1 ¼ 0:5, g2 ¼ 0:01
7
40
g1 ¼ 0:1, g2 ¼ 0:5
8
40
g1 ¼ 0:5, g2 ¼ 0:05
9
40
g1 ¼ 0:17, g2 ¼ 0:05
10
40
g1 ¼ 0:17, g2 ¼ 0:002
11
40
g1 ¼ 0:5, g2 ¼ 0:012
12
40
g1 ¼ 0:25, g2 ¼ 0:5
13
40
g1 ¼ 0:25, g2 ¼ 0:025
14
40
g1 ¼ 0:17, g2 ¼ 0:056
15
40
g1 ¼ 0:5, g2 ¼ 0:033
16
40
g1 ¼ 1:5, g2 ¼ 0:5
3 4 5 4 5 6 4 5 6 0.5 1 1.5 2 3 4 2 3 4 1 2 3 2 3 4 1 2 3 0.5 1 2 2 3 4 0.5 0.95 1.5 1 2 3 1 2 3 2 3 4 4 5 6
1.6006 0.0562 0.1470 0.1197 0.1084 0.1143 0.0030 0.0028 0.0051 0.0042 0.0041 0.0049 1.9002 0.5143 23.1957 1.7868 0.1431 19.9188 0.0148 0.0119 0.0125 0.4826 0.0474 0.0559 1.3462 0.7558 1.1129 0.0429 0.0050 0.6188 0.9160 0.0279 2.3028 5.6555e004 3.2507e004 7.6633e004 1.6791 1.3651 1.4455 3.3927 0.9299 2.0429 1.7911 0.0205 4.1184 0.0089 0.0086 0.0089
descent momentum and an adaptive learning rate. In our adopted BP algorithm two training rates g1 and g2 were adopted within Table 1, and the slope parameter a was adopted using the middle value with minimum sum squared-error in Table 1 for each instance. The maximum iteration and the target value of sum squared-error were used as following: net.trainParam.epochs = 2000; net.trainParam.goal = 0.005.
Y. Bai et al. / Chaos, Solitons and Fractals 40 (2009) 69–77
75
Fig. 3. Partial simulation results of the instances by the BP algorithm with varying slope of activation function.
TRAINGDX is a network training function that updates weight and bias values according to gradient descent momentum and an adaptive learning rate in MATLAB. The function TRAINGDX was employed to train sixteen instances. We select a network with five input nodes, four hidden units and one output unit in the BP algorithm with momentum, and current value is predicted by previous five values. Therefore five values of input nodes are gained from current value delaying 1–5 s. The target value of output unit is the current value. The activation functions of hidden layer and output layer are hyperbolic tangent sigmoid transfer function TANSIG and linear transfer function PURELIN. The parameters of function TRAINGDX are selected as following: net.performFcn = ‘sse’; net.trainParam.epochs = 2000; net.trainParam.lr = 0.01; net.trainParam.lr_inc = 1.05; net.trainParam.lr_dec = 0.7; net.trainParam.goal = 0.005;
76
Y. Bai et al. / Chaos, Solitons and Fractals 40 (2009) 69–77
Table 2 Simulation results of prediction for our adopted BP algorithm and the BP algorithm with momentum updating and an adaptive learning rate on 16 instances Instances
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Number of data
65 65 65 65 65 65 40 40 40 40 40 40 40 40 40 40
Our adopted BP algorithm
BP algorithm with momentum updating
Sum squared error of training examples
Sum squared error of test examples
Sum squared error of training examples
Sum squared error of test examples
0.0607 0.1061 0.0038 0.0043 0.3592 0.7756 0.0103 0.0541 0.8784 0.0054 0.0263 4.7998e004 1.4882 0.3011 0.0157 0.0107
1.0945e004 0.0023 5.9083e005 2.3597e004 0.3344 0.0437 0.0039 0.0015 0.0081 4.8038e004 0.0046 1.0114e004 0.0557 0.0959 0.0048 1.9815e004
0.0179 0.0026 0.0044 5.2604e004 0.0590 0.0169 0.0094 0.0264 0.3284 0.0074 0.0064 4.2961e004 0.7205 0.0983 0.0039 0.0090
0.0104 0.0450 3.7632e004 8.4070e004 2.5838 0.9504 0.0554 0.0097 2.2324 0.5368 0.2137 2.5851e004 0.5695 0.7239 0.5754 0.0019
net.trainParam.mc = 0.9; net.trainParam.mingrad = 1e10. We used the first 52 data to form training examples, and the last 13 data to form test examples for the instances from 1 to 6. And we use the first 30 data to form training examples, and the last 10 data to form test examples for the instances from 7 to 16. Simulation results are showed in Table 2. Further more, Table 2 indicates that the mean squared errors of test examples for our adopted BP algorithm and the BP algorithm with momentum adapting are 0.0348 and 0.5319, respectively. Our adopted BP algorithm gives best performance (100%) for test examples.
6. Conclusions Some adaptations were proposed to the basic BP algorithm in order to provide an efficient method to non-linear data learning and prediction. Our adopted BP algorithm with varying slope of activation function and different learning rates was employed to train 16 instances. The results of experiment indicated that the performances of network training with same training rates and different slope parameters are very different for a large of instances, and it is similar for a little instance. We conclude that the basic BP algorithm can get very good performance of training by adjust two different training rates and one slope parameter. We also test the prediction performance of our adopted BP algorithm on sixteen instances. We compared the test results to the ones of the BP algorithm with gradient descent momentum and an adaptive learning rate. The results indicated that the mean squared errors of test examples for our adopted BP algorithm and the BP algorithm with momentum adapting are 0.0348 and 0.5319, respectively. Our adopted BP algorithm gives best performance (100%) for test example. The contributions were that the basic BP algorithm can get very good performance of training and test by adjust two different learning rates and one slope parameter of output activation function. The results of tests on sixteen sets indicated our adopted BP algorithm produces a smoothed reconstruction that learns better to new prediction function values than the BP algorithm improved with momentum adapting and an adaptive learning rate.
Acknowledgement The authors are thankful that the research is supported by the National Postdoctoral Science Foundations of China (No. 20060400032).
Y. Bai et al. / Chaos, Solitons and Fractals 40 (2009) 69–77
77
References [1] Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. Parallel distributed processing: explorations in the microstructures of cognition, vol. 1. Cambridge (MA): MIT Press; 1986. pp. 318–62. [2] Martin F, Moller S. A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks 1993;6:525–633. [3] Hagan MT, Menhaj MB. Training feedforward networks with the Marquardt algorithm. IEEE Trans Neural Network 1994;5:989–93. [4] Rojas R. Neural networks: a systematic introduction. Berlin: Springer Verlag; 1996. [5] Jacobs RA. Increased rates of convergence through learning rate adaptation. Neural Networks 1988;1:295–307. [6] Najim K, Chtourou M, Thibault J. Neural network synthesis using learning automata. J Systems Eng 1992;2(4):192–7. [7] Poznyak AS, Najim K, Chtourou M. Use of recursive stochastic algorithm for neural network synthesis. Appl Math Modelling 1993;17:444–8. [8] Mounir Ben Nasr, Mohamed Chtourou. A hybrid training algorithm for feedforward neural networks. Neural Process Lett, Published online: 29 September 2006, doi:10.1007/s11063-006-9013-x . [9] Simon H. Neural networks: a comprehensive foundation [M]. Beijing: Tsinghua University Press and Prentice-Hall; 2001. [10] Bai Yanping, Jin Zhen. Prediction of SARS epidemic by BP neural networks with online prediction strategy. Chaos, Solitons & Fractals 2005;26/2:559–69. [11] Vogl TP, Mangis JK, Zigler AK, Zink WT, Alkon DL. Accelerating the convergence of the backpropagation method. Biol Cybernet 1988;59:257–63.