Artificial Intelligence in Engineering 13 (1999) 301–307 www.elsevier.com/locate/aieng
Fast training algorithm for feedforward neural networks: application to crowd estimation at underground stations T.W.S. Chow*, J.Y.-F. Yam, S.-Y Cho City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong Received 20 September 1998; received in revised form 12 February 1999; accepted 21 March 1999
Abstract A hybrid fast training algorithm for feedforward networks is proposed. In this algorithm, the weights connecting the last hidden and output layers are firstly evaluated by the least-squares algorithm, whereas the weights between input and hidden layers are evaluated using the modified gradient descent algorithms. The effectiveness of the proposed algorithm is demonstrated by applying it to the sunspot and Mackey–Glass time-series prediction. The results showed that the proposed algorithm can greatly reduce the number of flops required to train the networks. The proposed algorithm is also applied to crowd estimation at underground stations and very promising results are obtained. q 1999 Elsevier Science Ltd. All rights reserved. Keywords: Fast training algorithm; Feedforward neural networks; Time-series prediction; Crowd estimation
1. Introduction Multilayer feedforward neural networks (MFNNs) are successful to model nonlinear systems and pattern recognition [1,2]. To develop successful neural network applications, thorough investigations on different network topologies are required. Therefore, a lot of time is spent on training and testing neural networks with different topologies; especially when the conventional backpropagation training algorithm is used. Although a lot of fast training algorithms have been proposed to reduce the time taken to train neural networks, the convergence speed of these algorithms may not be fast enough when they are applied to the large-scale systems mentioned above. Moreover, memory requirement and computational complexity of some of fast training algorithms increase quadratically with the number of weights. Therefore most of large-scale systems can only be modelled by sophisticated computing systems. In this paper, we describe the development of a hybrid fast training algorithm and its application to crowd estimation at underground stations. The computational complexity per iteration and memory requirement of the proposed algorithm is smaller than those of the well-known algorithms such as the conjugate gradient and Levenberg– Marquardt (LM) training algorithms [3,4]. The convergence performance of the proposed algorithm is demonstrated by * Corresponding author. E-mail address:
[email protected] (T.W.S. Chow)
applying the algorithm to some benchmark problems. Results show that the convergence speed of the networks trained by the proposed algorithm is faster than that of the networks trained by other investigated fast algorithms. The proposed algorithm is also applied to crowd estimation at underground stations and the estimation is carried out by extracting a set of significant features from images, included image edges, crowd object density and background removal. Promising results are obtained in terms of estimation accuracy and real-time response capability to alert the operators automatically. In Section 2.1, a summary of other fast training algorithms for MFNNs is introduced. Section 2.2 presents the proposed training algorithm based on the pure linear leastsquares method and the modified gradient descent algorithm. The efficiency of the training algorithm is demonstrated by applying the algorithm to the sunspots and Mackey–Glass time-series prediction. In Section 3, description and results of the crowd estimation for underground station using the proposed algorithm are given.
2. Fast training algorithm for FNNS 2.1. Review of training algorithm A lot of methods have been proposed to increase the rate of convergence, including using different cost functions and applications of different optimization techniques. The most
0954-1810/99/$ - see front matter q 1999 Elsevier Science Ltd. All rights reserved. PII: S0954-181 0(99)00016-3
302
T.W.S. Chow et al. / Artificial Intelligence in Engineering 13 (1999) 301–307
Table 1 The performances of different algorithms on the Mackey–Glass problem Method
Mflops to achieve RMSE 0.001
Training error
Test error
Proposed LSB LM ABP BP
1.102 7.751 499.7 . 5000 . 5000
0.000534 0.000784 0.000980 Not available Not available
0.000601 0.000769 0.00102 Not available Not available
well-known algorithms of this type are the conjugate gradient training algorithm and the Levenberg–Marquardt (LM) training algorithm [3,4]. The computational complexity of the conjugate gradient algorithm is increased dramatically due to the exact line search used in the algorithm. The LM algorithm is an extension of the Gauss–Newton optimization technique. It is more powerful than the gradient descent, but the memory required and computational complexity increases quadratically with the number of weights. This makes this method impossible to be used in large scale problems. Recently, new training algorithms based on linear least squares have been proposed like least-squares based and optimization layer by layer (OLL) learning algorithms [5–7]. In the pure linear least-squares based training algorithm, each layer of a neural network is decomposed into a linear and a nonlinear part. The linear part is determined by least-squares method. The remaining error is propagated back into the preceding layer of the neural network through the inverse of the activation function and a transformation matrix. These methods have an advantage that the network error can be minimized to a very small value after few iterations [6,7]. Moreover, the computational complexity of this kind of algorithm is only 2–3 times that of the conventional training algorithm. However, the training process usually stalls after ten iterations; the network error cannot be further reduced with additional training. The OLL learning algorithm is also based on an optimization of the multilayer perceptron layer by layer using the least-squares method and constrained optimization [5]. However, the activation function is approximated by a first order Taylor series. The computational complexity of the OLL algorithm is about 8–9 times that of the backpropagation algorithm. 2.2. Hybrid training algorithm In this paper, a hybrid training algorithm is developed to solve the stalling problem experienced by the pure linear least-squares based training algorithm. In the proposed algorithm, the weights between the last hidden and output layers are evaluated by linear least-squares method. The weights between input and hidden layers are then evaluated by a modified gradient based training algorithm. This method eliminates the difficulty in determining the appropriate
transformation matrices in pure linear least-squares based algorithm and thus the stalling problem is eliminated. The storage requirement and computational complexity of the proposed algorithms is slightly smaller than that of the pure linear least-squares based algorithms and is much smaller than those of the OLL and LM algorithms. A multilayer neural network with L fully interconnected layers is considered. Layer l consists of nl 1 1 neurons (l 1; 2; …; L 2 1), the last neuron being a bias node with a constant output of 1.0. If there are P patterns for network training, all given inputs can be represented by a matrix A 1 with P rows and n1 1 1 columns. All elements of the last column of the matrix A 1 are constant 1.0. As the output layer does not have any bias node, the network output can be represented by a matrix A L with P rows and nL columns. Similarly, the target can be represented by a matrix T L with P rows and nL columns. The weights between neurons in the layers l and l 1 1 form a matrix W l with entries wli;j (i 1; …; nl 1 1, j 1; …; nl11 ). Entry wli;j connects neuron i of layer l with neuron j of layer l 1 1. The output of all hidden layers and the output layer are obtained by propagating the training patterns through the network. Let us define the matrix Ol Al Wl
1
The entries of Al11 for all layers (i.e. l 1; 2; …; L 2 1) are evaluated as follows: l al11 i;j f
oi;j
i 1; …; P and j 1; …; nl11
2
where f
x is the activation function. The activation function for hidden neurons is the conventional sigmoidal function with the range between 0 and 1. The linear function is used for the output neurons. Learning is achieved by adjusting the weights such that A L is as close as possible or equal to TL so that the mean squared error E is minimized, where E is defined as X L 1 X L 2 ap;j 2 tp;j
3 E 2P p1;…;P j1;…;n L
In this learning algorithm, the weights between the last hidden layer and the output layer are evaluated by a pure least-squares algorithm; the weights between the input and the first hidden layers, and the weights between hidden layers are evaluated by modified gradient descent algorithm. The problem of determining the W L21 optimally can be formulated as follows: min iAL21 WL21 2 TL i2 with respect to WL21
4
This linear least-squares problem is solved by using QR factorization together with Householder transforms [8]. After the optimal weights W L21 are found, the new network outputs A L are evaluated. To determine the appropriate weights change in the preceding layer, the remaining error is backpropagated to the preceding layer of the neural network. After the gradient information is obtained, the
T.W.S. Chow et al. / Artificial Intelligence in Engineering 13 (1999) 301–307
appropriate learning rate and momentum coefficient for each layer are determined in accordance with the correlation between the negative error gradient and the previous weight update of that layer [9]. The correlation coefficient between the negative gradient and last weight update for layer l is given by X X rl
t 2 4
i
X X i
303
follows:
al
t ll
thl
t
i 2 7El
ti2 iDwl
t 2 1i2
7
l l 27Ei;j
t 2 27Ei;j
t Dwli;j
t 2 1 2 Dwli;j
t 2 1
j
3 2 3 2 1=2 X X 2 1=2 l l 27Ei;j
t 2 27Ei;j
t 5 4 Dwli;j
t 2 1 2 Dwli;j
t 2 1 5
j
i
j
l
t is the where t indexes the presentation number, 27Ei;j l negative error gradient with respect to wi;j in the layer l and Dwli;j
t 2 1 is the previous weight change of weight wli;j . l
t and Dwl
t are the mean values of the negative 27Ei;j i;j error gradients and the weight changes in layer l, respectively. From this correlation coefficient, three different conditions can be identified.
1. When the correlation coefficient is close to one, there is almost no change in the direction of local error minimization and the change of weights is likely moving on the plateau. The learning rate can be increased to improve the convergence rate. 2. When the correlation coefficient is near to minus one, it implies an abrupt change in the direction of local error minimization and the change of weights is likely moving along the wall of ravine. The learning rate should then be reduced to prevent oscillation across both sides of ravine. 3. When there is no correlation between the negative gradient and previous weight change, the learning rate should be kept constant. According to these three conditions, the following heuristic algorithm is proposed to adjust the learning rate: 1 hl
t hl
t 2 1 1 1 rl
t 2
5
6
We notice that the learning rate can increase or decrease rapidly when the successive values of correlation coefficient remain the same sign. This feature enables the appropriate learning rate to be found in few iterations, and thus reduces the output error rapidly. Furthermore, the algorithm does not greatly increase the storage requirement as the second order method. It does not need to calculate the second order derivatives either. The convergence rate is not optimized with a fixed momentum coefficient. The momentum term has an accell and Dwli;j have the same erating effect only when the 27Ei;j direction. For the fixed momentum coefficient, the momentum term may override the negative gradient term when the l and Dwli;j are in the opposite direction. The momen27Ei;j tum coefficient al
t at the tth iteration is determined as
(
l
t l
hl
t
hl
t , 1
1
hl
t $ 1
8
As ll
t is always less than 1, the momentum term will not override the negative gradient term. After evaluating the hl
t and al
t for layers l L 2 2; …; 1, the new weights are determined. After that, the new network output and error are evaluated. One epoch is said to be completed. As the fast increase in learning rate may drive the neurons to their saturation region in some neural network problems, the following strategy is used to improve the stability of the algorithm. If the present root mean squares error (RMSE) is greater than the previous one by 0.1, the last weight change is canceled and hl
t is reduced by half. This strategy gives a small preference to learning rate reduction and enhances the robustness of the training process. After the first epoch is completed, the last layer of weights are kept unchanged for further N 2 1 epochs, where N is the number of weights between the input layer and the last hidden layer. In the following N 2 1 epochs, only the weights between the input and the first hidden layers, and the weights between hidden layers are updated using the modified gradient based algorithm. After N epochs, the process is repeated as that in the first N epochs until the network error reaches the specified error level. 2.2.1. Validation of the proposed algorithm In this section, we compare the convergence performance of the proposed training algorithm with those of the BP, LSB and Levenberg–Marquardt (LM) algorithms. In order to demonstrate the advantages of using the least-squares method to evaluate the output weights in the proposed algorithm, our results will also be compared to an algorithm that uses adaptive learning rate and momentum (Eqs. (5)–(9)) to evaluate the weights of all layers, which we call the adaptive backpropagation (ABP) algorithm. We applied these algorithms to the following benchmark problems: 1. learning the Mackey–Glass time series; 2. sunspot series prediction.
304
T.W.S. Chow et al. / Artificial Intelligence in Engineering 13 (1999) 301–307 2.5
RMS Error
2.0
1.5
Proposed LSB ABP BP
1.0
0.5
LM
0.0 0
20
40
60
80
100
No. of Mflops
Fig. 1. The learning curves of different algorithms on the Mackey–Glass problem.
All the networks are started with random weights between 2 1 and 1. In each problem, 180 simulations were performed using different initial weights, learning rate and momentum. All algorithms were written in Matlab scripts and function files. The BP and LM algorithms are provided in Matlab Neural Network Toolbox. The first problem is to learn the Mackey–Glass (MG) series [10]. A discrete-time representation of the time-series is shown as follows: x
k 1 1 2 x
k
0:2x
k 2 t 2 0:1x
k; 1 1 x10
k 2 t
t 17
9
Two networks were used to predict the value of the MG at a future time x
k 1 1 from the most recent four consecutive data points of MG series, i.e. used the sequence x
k 2 3 x
k 2 2 x
k 2 1 x
k to predict x
k 1 1. Three hundred patterns were generated for training and 200 patterns were generated for testing the performance of the trained network. The number of hidden neurons is chosen to be 30, so that the network architectures are 4–30–1. The termination criterion is chosen to be RMSE 0.001. The results for Mackey–Glass time-series are shown in Table 1 and Fig. 1. The proposed algorithm requires 1.102 Mflops to achieve RMSE 0.001, whereas 7.751
and 499.7 Mflops are required by the LSB and LM algorithms, respectively. The total flops used by the proposed algorithm to train the network is only 0.221% of that used by the LM algorithm; the improvement is spectacular. The computational complexity of the employed algorithms in this problem scales as Proposed:ABP:LSB:LM:BP 3.521:1.0047:5.454:71.89:1. The second problem is to predict the sunspots activities [11]. The network architecture is chosen to be 12–8–1. The data is divided into three groups. The first group contains the data from 1700 to 1919 and has 208 patterns. The second and third groups contain data from 1920 to 1955 and 1956 to 1979, respectively. The first group is used to train the neural networks. The other two groups are used as the test sets. The termination criterion is RMSE 0.06. The results for the sunspot series prediction are shown in Table 2 and Fig. 2. Only 5.987 Mflops are required for the proposed algorithm to train the network to the termination criterion. The flops used by the proposed algorithm is only 16.0% of that used by the LM algorithm. In the sunspot problem, the network trained by the LSB algorithm cannot achieve the termination criterion, and get stuck at the average RMSE 0.0842. The computational complexity of the algorithms in this problem scales as Proposed:ABP:LSB:LM:BP 1.517:1.0082:2.939:54.77:1.
Table 2 The performances of different algorithms on the prediction of sunspots Method
Mflops to achieve RMSE 0.06
Training error
Test 1 error
Test 2 error
Proposed LSB LM ABP BP
5.987 Fail 37.37 91.71 1680
0.0590 Not available 0.0574 0.0600 0.06
0.0680 Not available 0.0720 0.0660 0.0665
0.151 Not available 0.153 0.138 0.139
T.W.S. Chow et al. / Artificial Intelligence in Engineering 13 (1999) 301–307
305
1.0 0.9 0.8
RMS Error
0.7 0.6 0.5 0.4 ABP
Proposed 0.3
LSB
LM
BP
0.2 0.1 0.0 0
5
10
15
20
25
No. of Mflops
Fig. 2. The learning curves of different algorithms on the prediction of sunspots.
Fig. 3. An example of the original image from the camera at the underground station in Hong Kong.
3. Crowd estimation at underground stations
Fig. 4. The images of the passengers’ ‘edges’ have been extracted by the edge detector.
Recent efforts in crowd estimation at underground stations, airports or stadia are current addressed in the research field of automatic surveillance systems [12–14]. Most current visual surveillance systems use a set of cameras to provide human operators with visual information. Making decision is rather difficult to temporal selection of scenes from different cameras and different areas to be controlled that are sequentially presented to the operator on a set of monitors in the control room. In this section, the simulation results of the neural network based crowd estimation by the proposed algorithm are presented. The system validation was performed on estimating of sequences of images acquired at an underground station in Hong Kong. The estimation process is articulated in three main steps. Firstly, sequences of images are captured from a closed circuit television and sampled at a frequency of about 1 Hz. Secondly, the feature extraction is required to map
306
T.W.S. Chow et al. / Artificial Intelligence in Engineering 13 (1999) 301–307
above processes using a different threshold condition. Estimation performance obtained by the proposed algorithm is compared with the real crowd density on the particular sequence during rush hour. The result is manifested in Fig. 6. On average, over 90% accuracy of estimating crowd density is obtained and this promising result is able to meet the end-user’s accuracy and robustness requirements. 4. Conclusions
Fig. 5. An image of crowd objects density is processed by the image masking and filtering.
the visual information coming from the images into a twodimensional feature space. Three indices, included edges, crowd object and background densities, are modelled by the neural networks trained by the proposed algorithm whether the platform is overcrowded or not. The network topology is chosen as 3 input, 15 hidden and 1 output neurons and the training processes were terminated when the root mean squares (RMS) error reaches 0.1. Fig. 3 depicts the original images in the particular time frame at an underground station. The length of edges of crowd objects is evaluated by an edge detection to count the number of pixels of the passengers’ “edges” and the image is shown in Fig. 4. The crowd objects density is represented by the number of image pixels corresponding to the crowd which should be separated from those of the background objects. The separation is based on the processes of masking and filtering the background information and the resulting image is shown in Fig. 5. Similarly, the background density can be evaluated by the
A hybrid fast training algorithm for feedforward networks is proposed. In this algorithm, the weights connecting the last hidden and output layers are firstly evaluated by the least-squares algorithm, whereas the weights between input and hidden layers are evaluated using the modified gradient descent algorithms. The performance of the proposed algorithm is demonstrated by applying it to the prediction of Mackey–Glass and sunspots time-series. The results show that the number of flops required for convergence is much smaller than that required by the other wellknown algorithms. The proposed algorithm is also applied in the crowd estimation at underground stations and very promising results are obtained.
References [1] Chow WS, Leung CT. Nonlinear autoregressive integrated neural network model for short-term load forecasting. IEE Proc Gener Transm Distrib 1996;143(5):500–506. [2] Bishop CM. Neural networks for pattern recognition, New York: Oxford University Press, 1995. [3] Barnard E. Optimization for training neural nets. IEEE Trans Neural Networks 1992;3(2):232–240. [4] Hagan MT, Menhaj MB. Training feedforward networks with the
100 90
Crowd Density (%)
80 70 60 50 40 30 20 10 0 0
10
20
30
40
50
60
70
80
90
100
Sample
Fig. 6. Estimated results of the neural network trained by the proposed algorithm (solid line) versus real crowd densities (dashed line) over the sequence of samples.
T.W.S. Chow et al. / Artificial Intelligence in Engineering 13 (1999) 301–307
[5]
[6]
[7]
[8]
Marquardt algorithm. IEEE Trans Neural Networks 1994;5(6):989– 993. Ergezinger S, Thomsen E. An accelerated learning algorithm for multilayer perceptrons: optimization layer by layer. IEEE Trans Neural Networks 1995;6(1):31–42. Biegler-Ko¨nig F, Ba¨rmann F. A learning algorithm for multilayered neural networks based on linear least squares problems. Neural Networks 1993;6:127–131. Yam YF, Chow WS. Accelerated training algorithm for feedforward neural networks based on least-squares method. Neural Process Lett 1995;2(4):20–25. Golub GH, Loan FV. Matrix computations, 2. Baltimore, MD: The Johns Hopkins University Press, 1989.
307
[9] Yam YF, Chow WS. Extended backpropagation algorithm. Electron Lett 1993;29(19):1701–1702. [10] Mackey MC, Glass L. Oscillations and chaos in physiological control systems. Science 1977;197:287–289. [11] Fro¨hlinghaus T, Weichert A, Ruja´n P. Hierarchical neural networks for time-series analysis and control. Network 1994;5:101–116. [12] Davies AC, Yin JH, Velastin SA. Crowd monitoring using image processing. Electron Commun Engng J 1995;February:37–47. [13] Regazzoni CS, Tesei A, Murino V. A real-time vision system for crowding monitoring. In: Proceedings of the IECON’93, 1993, pp. 1860–64. [14] Regazzoni CS, Tesei A. Distributed data fusion for real-time crowding estimation. Signal Process 1996;53:47–63.