RESEARCH NOTES Chinese Journal of Chemical Engineering, 20(6) 1219—1224 (2012)
Fast Learning in Spiking Neural Networks by Learning Rate Adaptation* FANG Huijuan (方慧娟)**, LUO Jiliang (罗继亮) and WANG Fei (王飞) College of Information Science and Engineering, Huaqiao University, Xiamen 361021, China Abstract For accelerating the supervised learning by the SpikeProp algorithm with the temporal coding paradigm in spiking neural networks (SNNs), three learning rate adaptation methods (heuristic rule, delta-delta rule, and delta-bar-delta rule), which are used to speed up training in artificial neural networks, are used to develop the training algorithms for feedforward SNN. The performance of these algorithms is investigated by four experiments: classical XOR (exclusive or) problem, Iris dataset, fault diagnosis in the Tennessee Eastman process, and Poisson trains of discrete spikes. The results demonstrate that all the three learning rate adaptation methods are able to speed up convergence of SNN compared with the original SpikeProp algorithm. Furthermore, if the adaptive learning rate is used in combination with the momentum term, the two modifications will balance each other in a beneficial way to accomplish rapid and steady convergence. In the three learning rate adaptation methods, delta-bar-delta rule performs the best. The delta-bar-delta method with momentum has the fastest convergence rate, the greatest stability of training process, and the maximum accuracy of network learning. The proposed algorithms in this paper are simple and efficient, and consequently valuable for practical applications of SNN. Keywords spiking neural networks, learning algorithm, learning rate adaptation, Tennessee Eastman process
1
INTRODUCTION
In the past two decades, the artificial neural networks (ANNs) have been widely used to model or control many industrial processes [1, 2]. The third generation of neural networks, spiking neural networks (SNNs), has received increasing attention in recent years [3]. This kind of networks differs from traditional ANNs in that spiking neurons propagate information by the timing of individual spikes, rather than by the rate of spikes. Some studies have shown that the temporal coding is more biologically plausible than the rate coding [4], and SNNs can be used to analyze spike trains directly without losing temporal information [5]. Since the inherent property is closer to biological neurons than sigmoidal ones, SNNs with temporal coding have higher computational power than ANNs with sigmoidal activation functions in theory [6]. Furthermore, the networks communicated through discrete spikes instead of analog values are more suitable for hardware implementations [7]. The SNN model is a very promising alternative for the sigmoidal network model. The learning method is one of the most important problems for practical applications of temporally encoded spiking neural networks. A supervised learning rule, called SpikeProp [8], is presented based on error backpropagation (BP) by assuming that the internal state of the neuron increases linearly for a region small enough around the instant of neuronal firing. Furthermore, it is proven that SpikeProp is mathematically correct without the help of the linearity assumption [9]. Subsequently, some researchers tried to improve
the original SpikeProp such as providing additional learning rules [10], adding a momentum term [11], and developing QuickProp and resilient propagation (RProp) algorithms [12]. The performance of the gradient descent based learning algorithm is very sensitive to the proper setting of the learning rate [11-13]. And modifications of the BP algorithm by adjusting the learning rate during the training in ANN do have practical value. However, in the aforementioned extending efforts on the SpikeProp algorithm, the learning rate is either constant or simplistically adaptive. A dynamic self-adaptation (DS) method of learning rate is proposed [14], but the algorithm is computationally expensive and has a lower success rate of convergence by comparison with SpikeProp. In this paper, three methods (heuristic rule, delta-delta rule, and delta-bar-delta rule) are applied to update learning rate adaptively in the weight training of the original SpikeProp algorithm. A momentum term is also used to help move out of local minima on the error surface by taking into account previous movements on this surface. We perform four experiments on the classical exclusive OR (XOR) problem, the Iris dataset classification problem, fault diagnosis in the Tennessee Eastman (TE) process and decoding information from Poisson spike trains to demonstrate that the proposed algorithms are simple and efficient, easy to be used in combination with other learning algorithms of SNNs. 2
LEARNING RATE ADAPTATION METHODS The general feedforward SNN architecture with
Received 2012-06-10, accepted 2012-07-31. * Supported by the National Natural Science Foundation of China (60904018, 61203040), the Natural Science Foundation of Fujian Province of China (2009J05147, 2011J01352), the Foundation for Distinguished Young Scholars of Higher Education of Fujian Province of China (JA10004), and the Science Research Foundation of Huaqiao University (09BS617). ** To whom correspondence should be addressed. E-mail:
[email protected]
1220
Chin. J. Chem. Eng., Vol. 20, No. 6, December 2012
multiple delayed synaptic terminals is shown in Fig. 1. Input nodes are in a set called H, hidden neurons in set I, and output neurons in set J. The behavior of each spiking neuron is modeled according to a simple version of the spike response model (SRM) [15]. The state of the neuron in SRM is described by its membrane potential x j (t ) =
∑ ∑ wijk ε ( t − ti − d k )
(1)
i∈Γ j k
Here neuron j in layer J, having a set Γj of immediate predecessors (“pre-synaptic neurons”), receives a set of spikes with firing time ti, i ∈ Γ j . wijk is the weight of synapse k of the connection from neuron i to j, and d k is the delay of synapse k. ε is a standard post synaptic potential (PSP) that can be described by function ε (t ) = (t / τ )e1−t /τ for t > 0 (else 0), and τ is the decay time constant. When the state variable x j (t ) crosses the threshold θ, the neuron fires a pulse, so-called action potential, or spike that is described by its spike time t j . The neuron’s membrane potential will drop to its resting potential after it emits a spike and the refractory period will be produced. The original SpikeProp is developed for SNN that fires only once, so it does not need to consider the refractory period.
attempt to keep the learning step size as large as possible while keeping learning stable. In the following, three methods are used to adjust the learning rate adaptively. 2.1
Heuristic rule
The heuristic rule can be written as ⎧a ⋅η (n − 1) if E[ w(n)] < E[ w(n − 1)] ⎪ η (n) = ⎨b ⋅η (n − 1) if E[ w(n)] > k ⋅ E[ w(n − 1)] (4) ⎪η (n − 1) otherwise ⎩ where a, b and k are all parameters [16]. The heuristic rule can update the learning rate in every learning step, but the learning rate in one learning step is same to all weights in the network. The following two learning rate adaptation rules assign individual learning rate to each synaptic weight for dealing with the complex error surface. 2.2
Delta-delta rule
As the delta-delta rule in ANN [17], we take the partial derivative of the error function in the nth iteration to ηijk (n) : a a ∂E (n) ∂E (n) ∂t j (n) ∂x j ⎣⎡t j (n) ⎦⎤ = ∂ηijk (n) ∂t aj (n) ∂x j t aj (n) ∂ηijk (n)
(
)
and obtain the final update rule: ∂E (n) ∂E (n) ∂E (n − 1) Δηijk (n) = −γ =γ k ∂ηij (n) ∂wijk (n) ∂wijk (n − 1) Figure 1 Feedforward spiking neural network with multiple delayed synaptic terminals for each connection
SpikeProp, derived in [8], uses the same weight update method as error backpropagation ∂E Δwijk = −η k (2) ∂wij Here η is the learning rate, which holds constant throughout training [8]. The performance of the steepest descent algorithm can be improved if the learning rate is allowed to change during the training. In addition, to improve convergence and get around of local minima, the momentum is used in combination with adaptive learning rate ∂E (n) wijk (n + 1) = wijk (n) − ηijk (n + 1) k + αΔwijk (n − 1), ∂wij (n) (3) where α is the momentum coefficient. The adaptive learning rate ηijk is indexed by n + 1 rather than n to indicate simply that its update occurs before wijk update. The adaptive learning rate, which is made responsive to the complexity of the error surface, should
(5)
(6)
where γ is a positive adjustable step for learning rate. Delta-delta rule has some potential problems. For example, in two successive iterations, if the partial derivatives to some weight have the same signs but have small values, the corresponding learning rate gets little increment. On the other hand, if the two successive partial derivatives to some weight have opposite signs and larger values, the corresponding learning rate gets great decrement. It is difficult to choose proper adjustable step γ in these two cases. The following delta-bar-delta rule can overcome the difficulties. 2.3
Delta-bar-delta rule
The delta-bar-delta rule is defined by the following functions [17] ⎧a if Sijk (n − 1) Dijk (n) > 0 ⎪ ⎪ Δηijk (n) = ⎨−bηijk (n) if Sijk (n − 1) Dijk (n) < 0 ⎪ otherwise ⎪⎩0
Dijk (n) =
∂E (n) ∂wijk (n)
(7)
(8)
Chin. J. Chem. Eng., Vol. 20, No. 6, December 2012
Sijk (n) = (1 − ξ ) Dijk (n) + ξ Sijk (n − 1)
(9)
where a, b and ξ are all parameters. The typical values are: 10−4 ≤ a ≤ 0.1 , 0.1 ≤ b ≤ 0.5 , and 0.1 ≤ ξ ≤ 0.7 . In delta-bar-delta rule, learning rates increase linearly and decrease exponentially. Linear increment avoids excessive speed to increase. And the exponential decrement makes learning rates decrease very fast, but remain positive. 3
EXPERIMENTS AND RESULTS
The network architecture adopted in this paper is a fully connected feedforward network with multiple delays per connection, as described in Fig. 1. With the method used in [18], the weight in each connection is initialized to a random number between 1 and 10. During the training, only positive weights are allowed for distinguishing excitatory and inhibitory neurons. The parameters for the spiking neuron model are: synaptic time constant τ = 7 and membrane threshold θ = 50 . The simulation time range is from 0 to 50 ms with the time step 0.01 ms. The maximum number of training epochs is set to 500. The initial learning rates of the three learning rate adaptation methods introduced in Section 2 are all set to 1. The parameters of three learning rules are selected in the typical range used in ANN [16, 17] by trial and error. In heuristic rule, a = 1.05, b = 0.8, k = 1.04; in delta-delta rule, γ = 0.1; in delta-bar-delta rule, a = 0.1, b = 0.2, ξ = 0.3. 3.1 XOR problem
We attempt to replicate the XOR experiment with the similar encoding scheme and network architecture used in original SpikeProp algorithm [8]. The three learning rate adaptation methods (heuristic, delta-delta, and delta-bar-delta) with and without momentum are used to train the SNN to learn the XOR pattern separately. The results are presented in Table 1. Each entry represents the average number of epochs for 100 different simulations. Convergence is defined by a mean squared error (MSE) of 1.0 ms. The momentum coefficient α is set to 1.5 in delta-bar-delta method and 0.9 in other methods. Table 1
Number of average iterations (± SD) with XOR
Algorithm
Without momentum
With momentum
heuristic
75±44
46±16
delta-delta
116±37
38±16
delta-bar-delta
66±17
18±4
SpikeProp
128±51
43±40
As shown in Table 1, using the three learning rate adaptation methods with momentum, the speed of
1221
convergence is increased by 73% averagely than that of original SpikeProp. Simulations with momentum are averagely quickened by 60% than those without momentum. The momentum term increases the effective learning rate but may have a destabilizing effect. The standard deviation (SD) of the SpikeProp with momentum shows significant dispersion from the average compared with that of the SpikeProp without momentum, while the learning rate adaptation methods with momentum are more stable than those without momentum. The above analyses indicate that learning rate adaptation and momentum balance each other in a beneficial way to deal with the complex error surface. In all the modifications of SpikeProp, the delta-bar-delta method with momentum has the fastest and the most stable convergence rate. If the momentum coefficient α is set to 0.9 in delta-bar-delta method, the average of iterations is 33. Because of the best effect in learning rate adaptation, the momentum coefficient of delta-bar-delta method can be raised to 1.5 to convergence within 18 iterations. On the contrary, if the momentum coefficient of heuristic or delta-delta method is greater than 1, the training procedure will be more likely to last longer even not converge. 3.2
Iris dataset
The Iris dataset consists of three classes, two of which are not linearly separable, providing a good practical problem for our algorithm. The dataset contains 150 data samples with 4 continuous input variables. As in [8], we employ 12 neurons with Gaussian receptive fields to encode each of the 4 input variables. Ten hidden neurons (including one inhibitory neuron) and three output neurons are used. Output classification is encoded according to a winner-take-all paradigm where the neuron coding for the respective class is designated an early firing time (12 ms), and all other neurons a considerably later ones (16 ms). A classification is correct if the neuron that fires earliest corresponds to the neuron required to fire first. Ten different sets of initial weights are tested. And for each set of initial weights, the dataset is randomly partitioned into a training set (50%) and a test set (50%) for 10 times. The total number of simulations is 100. Other parameters are same with those of XOR except that the time step is 0.1 ms and the maximum number of training epochs is set to 200. Figure 2 shows the classification accuracy averaged over 100 groups as a function of the number of training epochs. The performance of algorithms with momentum is better than those without momentum. The delta-bar-delta rule with momentum has the most rapid speed of convergence, the highest stability of training process, and the maximum accuracy among all eight algorithms. The training accuracy of deltabar-delta method with momentum can reach 100% after 85 training epochs, whereas the SpikeProp algorithm can only reach 98.5% until 200 epochs. Table 2 shows the best accuracy in the entire 200 epochs and
1222
Chin. J. Chem. Eng., Vol. 20, No. 6, December 2012
(a) (b) Figure 2 Comparison of Iris dataset classification accuracy of the original SpikeProp algorithm and seven modifications versus the number of training epochs for training set (a) and test set (b) SpikeProp; heuristics; delta-delta; delta-bar-delta; SpikeProp with momentum; heuristics with momentum; delta-delta with momentum; delta-bar-delta with momentum Table 2
Iterations
Algorithm with momentum
without momentum
Results of Iris dataset
Training accuracy >95%
Test accuracy >95%
Training set/%
Test set/%
heuristic
41
48
99.97
95.95
delta-delta
66
75
99.89
95.87
delta-bar-delta
24
35
100
95.76
SpikeProp
51
59
99.89
95.84
heuristic
81
112
98.91
95.95
delta-delta
83
165
99.31
94.93
delta-bar-delta
41
59
99.95
95.63
SpikeProp
87
144
98.51
95.65
how many training iterations it takes to reach 95% accuracy. 3.3
Best accuracy
TE process
The TE process created by the Eastman Chemical Co. has been widely used as a benchmark chemical process for evaluating fault diagnosis methods [19]. The simulation data are generated by MATLAB [20]. Three faults 4, 9, 11 of the TE process are chosen for identification. For each fault, the simulation dataset contains 40 observations respectively. Each observation contains 41 measured variables. The 12 manipulated variables in the TE process are set to constant. Only measured variable 9, 18, and 20 are selected as features for recognizing the three faults [21]. The parameters of SNN in this experiment are same with those of Iris dataset except that the maximum number of training epochs is set to 100. Figure 3 shows the fault diagnosis accuracy averaged over 100 groups as a function of the number of
training epochs. After 21 training epochs, the training accuracy of the delta-bar-delta method with momentum can reach 90.23%, whereas that of the SpikeProp is 63.72%. The corresponding test accuracy of the two algorithms are 84.87% and 61.82% respectively. In 100 training epochs, the best training and test accuracy of the delta-bar-delta with momentum are 95.08% and 87.42%, and those of the SpikeProp are 90.95% and 85.95%. Therefore, the results illustrate that the modified algorithms of SpikeProp can speed up convergence of SNN and improve the accuracy of fault diagnosis. The delta-bar-delta rule with momentum furthermore has the best performance among all the eight algorithms. 3.4
Poisson spike trains
In this experiment, we investigate the capability of SNN to decode temporal information from spike trains. Two spike train templates are produced by Poisson processes [22] with a frequency of 100 Hz and
Chin. J. Chem. Eng., Vol. 20, No. 6, December 2012
1223
(a) (b) Figure 3 Comparison of fault identification accuracy of the original SpikeProp algorithm and seven modifications versus the number of training epochs for training set (a) and test set (b) SpikeProp; heuristics; delta-delta; delta-bar-delta; SpikeProp with momentum; heuristics with momentum; delta-delta with momentum; delta-bar-delta with momentum
Figure 4
(a) Class 1 (b) Class 2 Two noisy patterns of Poisson spike trains for both class
(a) (b) Figure 5 Comparison of Poisson spike trains classification accuracy of the original SpikeProp algorithm and seven modifications versus the number of training epochs for training set (a) and test set (b) SpikeProp; heuristics; delta-delta; delta-bar-delta; SpikeProp with momentum; heuristics with momentum; delta-delta with momentum; delta-bar-delta with momentum
duration of 30 ms. From these two templates, 50 noisy patterns are created by randomly shifting each spike by an amount drawn from a normal distribution with a SD of 2 ms, see Fig. 4. The resulting set of 100 patterns is then split in half to produce a training set and a test set. The output is coded in the time-to-first-spike: for one class the output neuron is designated to fire an early spike at 31 ms and for another a later spike at 36 ms. Every connection between spiking neurons consists of
36 synapses with delays from 1 to 36. The time step is 0.1 ms and the maximum number of training epochs is set to 100. Other parameters are same with those of XOR. The rates of spikes in two classes are both 100 Hz, so the traditional ANN cannot be used to this classification problem, whereas the SNN can recognize the difference in the timing of individual spikes. Fig. 5 shows the classification accuracy averaged over 100
1224
Chin. J. Chem. Eng., Vol. 20, No. 6, December 2012
groups as a function of the number of training epochs. The training accuracy of delta-bar-delta method with momentum can reach 98.46% after 20 training epochs, whereas that of the SpikeProp algorithm reaches 98.0% until 54 epochs. The test accuracy of delta-bar-delta method with momentum can reach 99.06% after 29 training epochs, whereas that of the SpikeProp algorithm reaches 98.32% until 59 epochs. The delta-bar-delta method with momentum again has the best performance among all the eight algorithms. However, the worst method is not the SpikeProp, but the delta-delta. The network in this experiment has so many synapses that the initial MSE is up to about 200 ms. Thus the delta-delta rule increases the learning rate rapidly. Once the error of the network drops steeply, the reduction of the learning rate is very small. Consequently the learning rate is so large that the performance of the network is rather bad. Fig. 6 shows the learning rate changing processes of three learning rate adaptation methods with and without momentum. The learning rates of delta-delta and delta-bar-delta methods in Fig. 6 are the average number of all the weights. This figure indicates that the learning rate of delta-bar-delta is appropriately adjusted around 1, whereas the learning rate of delta-delta is more than 5 from the 13th training epoch. In addition the momentum term can smooth the oscillations resulted from the large learning rate in delta-delta algorithm.
2
3 4 5
6
7 8
9
10
11
12
13
14
15 16
17 18
Figure 6 The adaptation processes of learning rates heuristics; delta-delta; delta-bar-delta; heuristics with momentum; delta-delta with momentum; delta-bar-delta with momentum
19 20
21
REFERENCES 22 1
Lazzús, J.A., “Prediction of flash point temperature of organic compounds using a hybrid method of group contribution + neural network + particle swarm optimization”, Chin. J. Chem. Eng., 18 (5),
817-823 (2010). Zou, Z., Yu, D., Feng, W., Yu, L., Guo, N., “An intelligent neural networks system for adaptive learning and prediction of a bioreactor benchmark process”, Chin. J. Chem. Eng., 16 (1), 62-66 (2008). Maass, W., “Networks of spiking neurons: The third generation of neural network models”, Neural Networks, 10 (9), 1659-1671 (1997). Thorpe, S., Fize D., Marlot, C., “Speed of processing in the human visual system”, Nature, 381 (6582), 520-522 (1996). Fang, H., Wang, Y., He, J., “Spiking neural networks for cortical neuronal spike train decoding”, Neural Computation, 22 (4), 1060-1085 (2010). Maass, W., “Noisy spiking neurons with temporal coding have more computational power than sigmoidal neurons”, In: Advances in Neural Information Processing Systems, MIT Press, Cambridge, USA, 9, 211-217 (1997). Maass, W., “Lower bounds for the computational power of networks of spiking neurons”, Neural Computation, 8 (1), 1-40 (1996). Bohte, S.M., Kok, J.N., La Poutré, H., “Error-backpropagation in temporally encoded networks of spiking neurons”, Neurocomputing, 48, 17-37 (2002). Yang, J., Yang, W., Wu, W., “A remark on the error-backpropagation learning algorithm for spiking neural networks”, Applied Mathematics Letters, 25 (8), 1118-1120 (2012). Schrauwen, B., van Campenhout, J., “Extending spikeprop”, In: Proceedings of the International Joint Conference on Neural Networks, IEEE, Piscataway, USA, 471-475 (2004). Xin, J., Embrechts, M., “Supervised learning with spiking neural networks”, In: Proceedings of International Joint Conference on Neural Networks, IEEE, Piscataway, USA, 1772-1777 (2001). McKennoch, S., Liu, D., Bushnell, L.G., “Fast modifications of the SpikeProp algorithm”, In: Proceedings of the International Joint Conference on Neural Networks, IEEE, Piscataway, USA, 3970-3977 (2006). Ghosh-Dastidar, S., Adeli, H., “Improved spiking neural networks for EEG classification and epilepsy and seizure detection”, Integr. Comput. -Aid E., 14 (3), 187-212 (2007). Delshad, E., Moallem, P., Monadjemi, S.A.H., “Spiking neural network learning algorithms: Using learning rates adaptation of gradient and momentum steps”, In: 5th International Symposium on Telecommunications (IST), IEEE, 944-949 (2010). Gerstner, W., Kistler, W., Spiking Neuron Models, Cambridge University Press, England (2002). Vogl, T.P., Mangis, J.K., Rigler, A.K., Zink, W.T., Alkon, D.L., “Accelerating the convergence of the back-propagation method”, Biol. Cybern., 59 (4), 257-263 (1988). Jacobs, R.A., “Increased rates of convergence through learning rate adaptation”, Neural Networks, 1, 295-307 (1988). Moore, S.C., “Back-propagation in spiking neural networks”, Master Thesis, University of Bath, UK (2002). Downs, J.J., Vogel, E.F., “A plant-wide industrial process control problem”, Comput. Chem. Eng., 17 (3), 245-255 (1993). “Tennessee Eastman Problem for MATLAB”, Control Systems Engineering Laboratory, Arizona State University, 1998 [2012-06-08], http://csel.asu.edu/downloads/Software/TEmatlab.zip. Lu, N., Yu, X., “Fault diagnosis in TE process based on feature selection via second order mutual information”, CIESC Journal, 60 (9), 2252-2258 (2009). Heeger, D., “Poisson model of spike generation”, New York University, 2000 [2012-06-08], http://www.cns.nyu.edu/~david/ftp/handouts/ poisson.pdf.