Neurocomputing 35 (2000) 113}122
Modi"ed constructive backpropagation for regression Mikko Lehtokangas* Digital and Computer Systems Laboratory, Tampere University of Technology, P.O.Box 553, FIN-33101 Tampere, Finland Received 13 May 1999; accepted 13 April 2000
Abstract Choosing a network size is a di$cult problem in neural network modelling. In many recent studies constructive or destructive methods that add or delete connections, neurons, layers have been studied for solving this problem. In this work, we consider the constructive approach, which is in many cases computationally a very e$cient approach. In particular, we address the construction of feedforward networks by the use of modi"ed constructive backpropagation. The proposed modi"cation is shown to be important especially in regression-type problems like time series modelling. Indeed, according to our time-series prediction experiments, the proposed modi"ed method is competitive in terms of modelling performance and training time compared to the well-known cascade-correlation method. 2000 Elsevier Science B.V. All rights reserved. Keywords: Constructive backpropagation; Structure learning; Feedforward neural networks
1. Introduction The size and topology of a neural network a!ect directly to the generalization and training time which are the two important factors in neural network modelling. Larger than necessary networks tend to over"t resulting in poor generalization, while too small a network will have di$culty in learning the training samples [2]. In general, a large network requires more computation than a smaller one. As a result, many recent studies have treated the topology of network as a trainable parameter and allow the network to adjust its structure according to the problem at hand (see reviews [3,9]). * Tel.: #358-3-365-3881; fax: #358-3-365-3095. E-mail address:
[email protected]." (M. Lehtokangas). 0925-2312/00/$ - see front matter 2000 Elsevier Science B.V. All rights reserved. PII: S 0 9 2 5 - 2 3 1 2 ( 0 0 ) 0 0 3 0 4 - 0
114
M. Lehtokangas / Neurocomputing 35 (2000) 113}122
One popular approach for structure learning is the constructive one [3]. There, the training begins with minimal structure (no hidden units), and then more connections, neurons and layers are added to the network according to some prede"ned rule. One of the most well-known constructive neural network technique is the cascade-correlation (CC) learning [1]. Inspired by CC we have recently proposed and studied a similar technique called constructive backpropagation (CBP) [4}6]. We have showed that CBP is computationally just as e$cient as CC algorithm. Further, the CBP has the same constructive bene"ts as CC, but in addition CBP bene"ts from simpler implementation and ability to utilize stochastic optimization routines. We have also demonstrated that CBP can be a better approach in reducing the output error compared with CC learning. In [5,6] we have also extended the CBP method with two new features. Firstly, we have extended the CBP to allow addition of multiple new units simultaneously to a network. Secondly, the CBP was extended for adaptive structure learning. This includes both addition and deletion of units. The CC method does not support these features. In our recent studies, we have noticed that in some regression problems, like time-series modelling, the constructive backpropagation has di$culties to provide competitive performance in terms of modelling error [6]. Here we propose a simple but e!ective modi"cation to the CBP method to solve the problem. The presented time-series modelling experiments demonstrate that the modi"ed constructive backpropagation can signi"cantly improve the modelling performance.
2. Modi5ed constructive backpropagation Constructive backpropagation [4}6] is based on similar incremental error reduction idea like the cascade-correlation [1]. That is, each new neuron is trained to learn the error left from the previously added neurons. Therefore in constructive backpropagation, for ith neuron the error function to be minimized can be written as
G\ SSE " d ! v h !v h " (eG !v h ), (1) G IJ HI HJ GI GJ IJ GI GJ JI H JI where d is the desired output in kth output unit for lth training pattern, v is the IJ HI connection from jth hidden neuron to kth output unit (v are the bias weights), h is I HJ the output of jth hidden neuron for lth training pattern (for the bias weights h "1 J always), and eG is the residual error in kth output unit for lth training pattern that is IJ left from the previously added neurons (i.e. it is the desired output for the new ith neuron). Note that in the new, ith, neuron perspective the previous neurons are "xed just like in the cascade-correlation scheme. In other words, we are only training the weights connected to the new unit (both input and output connections) and hence the error need to be backpropagated only through one hidden layer always. In [4,6] we have shown that this scheme is computationally just as e$cient as the CC learning. Note that the inputs to the new hidden unit can include both the original network inputs but also outputs from the previously trained hidden units. Therefore, with CBP
M. Lehtokangas / Neurocomputing 35 (2000) 113}122
115
learning one is able to construct exactly and e$ciently the same structures as with CC learning. Further, in the extended scheme in [5,6] it has been shown, that after constructive phase we can actually continue simultaneous adaptation of all the units in hierarchical manner using Eq. (1). This makes it possible to continue the structure learning not only by increasing the number of units but also by decreasing the number of units. In addition, we have shown in [5,6] how the multiple units can be simultaneously added to a network to increase the modelling performance. These features of the extended CBP are not supported by the CC scheme. Even though in [4] we have demonstrated that CBP can yield better performance compared to the CC learning, we have noticed that in some regression problems (like time-series modelling) the CBP has di$culties to give competitive performance [6]. We have found that the reason for this is that the residual error eG in Eq. (1) may IJ contain some bias. In such case it is di$cult for a new neuron to learn to reduce the residual error further. A simple solution for this is that we use separate bias terms in each neuron addition phase. In this case, the cost function to be minimized becomes
G\ SSE " d ! (v h #v )!(v h #v ) IJ HI HJ HI GI GJ GI G H JI " (eG !v h !v ), (2) IJ GI GJ GI JI where d , v , h , and eG have the same meaning as in the above. In addition, v is the IJ HI HJ IJ HI bias term of jth neuron for kth output unit. This simple modi"cation enables jth neuron to cope with the disturbing bias in the residual error. Of course, the usage of separate biases increases the computational load slightly, but as the experiments in the next section show this is not very signi"cant. Note that Eq. (2) is for network construction where only one hidden unit is added to a network at a time. However, following [5,6] the modi"ed CBP can also be extended to add multiple units simultaneously to a network as well as continue structure adaptation after network construction phase. Thus, the modi"ed CBP is a very #exible constructive approach.
3. Time-series modelling experiments In this section the performance of the above proposed modi"ed CBP method is empirically investigated. A comparison with the basic CBP and CC learning is also presented. Because in [8] in the context of CC learning it was reported that in many problems placing hidden units in one hidden layer yields equal or better results than cascading the hidden units, we also chose to construct neurons in a single layer with the tested methods (CBP, modi"ed CBP and CC). With this selection we also avoid the problems of large fan-in to hidden units and irregular network structure [7]. The usage of candidate units as proposed in [1] was also studied. This means that in the new unit addition phase we train several candidate units starting from di!erent initializations. Then the best-trained candidate is added to the network and the rest are deleted. With all the tested methods we either used no candidates or "ve
116
M. Lehtokangas / Neurocomputing 35 (2000) 113}122
candidates. In addition, with CBP and modi"ed CBP we added either one unit or two units simultaneously to a network. The addition of multiple units simultaneously to a network is described in [5,6]. One should note, that in case we add all the units simultaneously to the network, we are in fact performing the usual backpropagation type of training of a "xed size network. Hence, the usual backpropagation type of training can be regarded as a special case of constructive backpropagation. For the readers' convenience we performed also comparison with the usual backpropagation type of training. The "rst benchmark problem considered is the Henon map time-series problem [11]. The second one is the laser intensity #uctuation time-series problem [12]. Both time series were scaled to have zero mean and unit variance. In all the experiments the optimization procedure used for training was the RPROP algorithm [10]. Due to random initialization of weights, each simulation was repeated 10 times. The results of the repetitions are presented with plots where the horizontal line in the middle is located at the average of the repetitions. The whiskers represent the maximum and minimum of the repetitions, respectively. The performance measure used was the mean square error (MSE). 3.1. Henon map series The chaotic Henon map time series was generated from equation [11] x "1#ax#bx , (3) R> R R\ where a"!1.4, b"0.3 and the initial conditions were x "1.0 and x "0.4. The \ "rst 100 data points were used for training and the next 100 points acted as an independent test set. We used two previous data points to predict the next one in the networks. The number of hidden units varied from 1 to 10. The results for this problem are depicted in Figs. 1}3. The results of basic CBP with one unit addition are
Fig. 1. Modelling performance as a function of the number of hidden units for the independent test set of the Henon map time series: (a) CC with no candidate units, and (b) CC with "ve candidate units. The upper and lower horizontal lines represent the maximum and minimum errors of the 10 repetitions. The smaller horizontal line in the middle represents the average error.
M. Lehtokangas / Neurocomputing 35 (2000) 113}122
117
Fig. 2. Modelling performance as a function of the number of hidden units for the independent test set of the Henon map time series: (a) CBP with no candidate units, (b) CBP with "ve candidate units, (c) modi"ed CBP with no candidate units, and (d) modi"ed CBP with "ve candidate units. One unit was added at a time. The upper and lower horizontal lines represent the maximum and minimum errors of the 10 repetitions. The smaller horizontal line in the middle represents the average error.
slightly worse compared with the CC learning. However, with two unit addition even the basic CBP is much better than the CC learning. The modi"ed CBP gives the best results especially in the case of two unit addition. However, with two unit addition there seems to be some di$culties in learning since the deviations between the best and worst runs are considerably large. Good weight initialization procedure (other than just random) might be helpful here, but as can be seen the usage of candidate units is also useful despite the considerable increase in computational load (with parallel machine this should not be so big a problem [1]). Considering computational e!orts, Table 1 lists the average cpu times required for each method. In Table 1 CBP1 and CBP2 refer to the basic CBP with one and two units addition, respectively. In addition, mCBP1 and mCBP2 refer to the modi"ed CBP with one and two units addition, respectively. Finally, the term &cands' refers to the usage of candidate units. As can be seen, all the CBP versions are somewhat faster to train compared to the cascade-correlation. As a result, in this problem the CBP methods are both computationally and in terms of prediction error better than the CC learning. Even though, the modi"ed CBP required slightly more computation
118
M. Lehtokangas / Neurocomputing 35 (2000) 113}122
Fig. 3. Modelling performance as a function of the number of hidden units for the independent test set of the Henon map time series: (a) CBP with no candidate units, (b) CBP with "ve candidate units, (c) modi"ed CBP with no candidate units, and (d) modi"ed CBP with "ve candidate units. Two units were added simultaneously at a time. The upper and lower horizontal lines represent the maximum and minimum errors of the 10 repetitions. The smaller horizontal line in the middle represents the average error.
Table 1 Average cpu times required for training in Henon map problem Cpu time (s)
CC
CBP1
mCBP1
CBP2
mCBP2
No cands Five cands
5.4 25.6
2.6 13.2
3.2 15.8
4.1 20.5
3.7 18.4
compared with the basic CBP, it gave much better results in terms of prediction error. For further comparison we trained also a "xed size MLP having 10 hidden units by the use of the RPROP training (hence this corresponds to the usual backpropagation type of training) [10]. In this case the average MSE of 0.0119 was obtained for the test set. The respective training took 10.1 s of cpu time. In comparison, the best-modi"ed
M. Lehtokangas / Neurocomputing 35 (2000) 113}122
119
Fig. 4. Modelling performance as a function of the number of hidden units for the independent test set of the laser intensity time series: (a) CC with no candidate units, and (b) CC with "ve candidate units. The upper and lower horizontal lines represent the maximum and minimum errors of the 10 repetitions. The smaller horizontal line in the middle represents the average error.
CBP scheme gave an average MSE of 0.0026, and the respective training took 18.4 s of cpu time. Obviously, the CBP scheme required slightly more computation. However, one should keep in mind that the computation in the CBP scheme includes the structure selection while the training of a "xed size MLP does not. Hence, in practice, incorporating some sort of structure selection to the usual backpropagation type of learning scheme can increase the training time substantially. 3.2. Laser intensity series The second time series is measured from a physics laboratory experiment [12]. It represents the intensity #uctuations in a far-infrared laser. The "rst 1000 data points were used for training and the next 1000 for testing. Here we used six previous data points to predict the next one. The number of hidden units was varied from 1 to 10. The results for this problem are depicted in Figs. 4}6. Also here the results of the basic CBP with one unit addition are slightly worse compared to the CC learning. However, with two unit addition the basic CBP is much better than the CC learning. With one unit addition the modi"ed CBP gives again the best results. However, with two unit addition there is not much di!erence with the basic CBP and the modi"ed CBP. This demonstrates, that in some cases the proposed modi"cation may not be so important. Finally, with this problem the usage of candidates was also slightly bene"cial to improve the results. Considering computational e!orts, Table 2 lists the average cpu times required for each method. The notations are the same as in the previous example. Also in this case all the CBP versions are faster to train compared to the cascade-correlation. Therefore, also in this problem the CBP methods are both computationally and in terms of prediction error better than the CC learning. Surprisingly, in this case the modi"ed
120
M. Lehtokangas / Neurocomputing 35 (2000) 113}122
Fig. 5. Modelling performance as a function of the number of hidden units for the independent test set of the laser intensity time series: (a) CBP with no candidate units, (b) CBP with "ve candidate units, (c) modi"ed CBP with no candidate units, and (d) modi"ed CBP with "ve candidate units. One unit was added at a time. The upper and lower horizontal lines represent the maximum and minimum errors of the 10 repetitions. The smaller horizontal line in the middle represents the average error.
CBP required slightly less computation compared with the basic CBP. This is due to the smaller number of training epochs that were required for training. For further comparison, we trained again a "xed size MLP having 10 hidden units by the use of the RPROP training. In this case, an average MSE of 0.0240 was obtained for the test set. The respective training took 137.9 s of cpu time. In comparison, the best-modi"ed CBP scheme gave an average MSE of 0.0274, and the respective training took 291.5 s of cpu time. Hence, also in this case the modi"ed CBP scheme is competitive when we take account of the fact that it includes the structure selection feature.
4. Conclusions Constructive training of feedforward networks by the use of constructive backpropagation was considered. First, we brie#y described the concept of constructive
M. Lehtokangas / Neurocomputing 35 (2000) 113}122
121
Fig. 6. Modelling performance as a function of the number of hidden units for the independent test set of the laser intensity time series: (a) CBP with no candidate units, (b) CBP with "ve candidate units, (c) modi"ed CBP with no candidate units, and (d) modi"ed CBP with "ve candidate units. Two units were added simultaneously at a time. The upper and lower horizontal lines represent the maximum and minimum errors of the 10 repetitions. The smaller horizontal line in the middle represents the average error.
Table 2 Average cpu times required for training in laser intensity problem Cpu time (s)
CC
CBP1
mCBP1
CBP2
mCBP2
No cands Five cands
110.3 533.7
68.1 339.5
61.2 304.8
69.7 347.7
58.5 291.5
backpropagation, and the problem we have encountered with the basic version. Then, we proposed a simple solution to the problem and introduced the modi"ed constructive backpropagation method. The performance of the proposed modi"ed version was then investigated with time-series modelling experiments. In the experiments, the modi"ed constructive backpropagation was found to yield signi"cantly better modelling performance compared to the basic version. On the other hand, no
122
M. Lehtokangas / Neurocomputing 35 (2000) 113}122
signi"cant increase in computational load was observed. We also found that especially the modi"ed constructive backpropagation (also the basic version) can outperform the well known cascade-correlation both in terms of modelling performance and computational cost. We conclude that the proposed modi"ed constructive backpropagation can be a competitive constructive method for training feedforward neural networks. Acknowledgements The author wishes to thank the reviewers for their valuable comments. This work has been supported by the Academy of Finland. References [1] S. Fahlman, C. Lebiere, The cascade-correlation learning architecture, in: D. Touretzky (Ed.), Advances in Neural Information Processing Systems, Vol. 2, Morgan Kaufman, San Mateo, CA, 1990, pp. 524}532. [2] S. Geman, E. Bienenstock, R. Doursat, Neural networks and the bias/variance dilemma, Neural Comput. 4 (1992) 1}58. [3] T. Kwok, D. Yeung, Constructive algorithms for structure learning in feedforward neural networks for regression problems, IEEE Trans. Neural Networks 8 (3) (1997) 630}645. [4] M. Lehtokangas, Learning with constructive backpropagation, Proceedings of Fifth International Conference on Soft Computing and Information/Intelligent Systems, Vol. 1, 1998, pp. 183}186. [5] M. Lehtokangas, Extended constructive backpropagation for time series modelling, Proceedings of Fifth International Conference on Soft Computing and Information/Intelligent Systems, Vol. 2, 1998, pp. 793}796. [6] M. Lehtokangas, Modelling with constructive backpropagation, Neural Networks 12 (4}5) (1999) 707}716. [7] D. Phatak, I. Koren, Connectivity and performance tradeo!s in the cascade correlation learning architecture, IEEE Trans. Neural Networks 5 (6) (1994) 930}935. [8] L. Prechelt, Investigation of the CasCor family of learning algorithms, Neural Networks 10 (5) (1997) 885}896. [9] R. Reed, Pruning algorithms * a survey, IEEE Trans. Neural Networks 4 (5) (1993) 740}747. [10] M. Riedmiller, H. Braun, A direct adaptive method for faster backpropagation learning: the RPROP algorithm, Proceedings of IEEE International Conference on Neural Networks, 1993, pp. 586}591. [11] J. Thompson, H. Stewart, Nonlinear Dynamics and Chaos: Geometrical Methods for Engineers and Scientists, Wiley, New York, 1986. [12] A. Weigend, N. Gershenfeld (Eds.), Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley, Reading, MA, 1994. Mikko Lehtokangas studied analog and digital electronics, and applied mathematics in the Department of Electrical Engineering at Tampere University of Technology where he received the M.Sc., Lic. Tech. and Dr. Tech. degrees in 1993, 1994 and 1995, respectively. Currently, he is a Junior Fellow of Academy of Finland in Signal Processing Laboratory at Tampere University of Technology. His main research interests are nonlinear adaptive architectures and algorithms, and their application.