Pattern Recognition 33 (2000) 2075}2081
Investigating the prediction capabilities of the simple recurrent neural network on real temporal sequences Lalit Gupta *, Mark McAvoy Department of Electrical Engineering, College of Engineering, Southern Illinois University, Carbondale, IL 62901, USA Neuro-Imaging Laboratory, Washington University, St. Louis, MO 63110, USA Received 30 March 1999; accepted 25 August 1999
Abstract The goal of this paper is to evaluate the prediction capabilities of the simple recurrent neural network (SRNN). The main focus is on the prediction of non-orthogonal vector components of real temporal sequences. A prediction problem is formulated in which the input is a component of a real sequence and the output is a prediction of the next component of the sequence. A method is developed to train a single SRNN to predict the components of sequences belonging to multiple classes. The selection of a distinguishing initial context vector for each class is proposed to improve the prediction performance of the SRNN. A systematic method to re-train the SRNN with noisy exemplars is developed to improve the prediction generalization of the network. Through the methods developed in the paper, it is demonstrated that: (a) a single SRNN can be trained to predict, contextually, the components of real temporal sequences belonging to di!erent classes, (b) the prediction error of the SRNN can be decreased by using a distinguishing initial context vector for each class, and (c) the prediction generalization of the SRNN can be increased signi"cantly by re-training the network with noisy exemplars. 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. Keywords: Recurrent neural network; Prediction; Initial context vector; Network retraining
1. Introduction The focus of this paper is to evaluate the prediction capabilities of the simple recurrent neural network (SRNN) on real-valued temporal sequences. Neural networks with di!erent architectures have been proposed to process temporal sequences. Examples of such networks include the modular [1,2], fully recurrent [3}6], and partially recurrent [7}9] networks. The SRNN is an example of a partially recurrent neural network derived from the multilayer perceptron. In this network, hidden unit outputs delayed by one time unit are fedback as context units using "xed unity weights [8,9]. The network is trained using the backpropagation training rule. Numerous studies have focused on demonstrating the e!ectiveness of the SRNN in predicting the orthonormal vector components of noise-free binary temporal * Corresponding author. Tel.: #1-618-536-2364; fax:#1618-453-7972.
sequences [7}12]. However, systematic studies have not been conducted to investigate the prediction capabilities of the SRNN on long noise free and noisy real temporal sequences belonging to di!erent classes. Therefore, the speci"c goals of this paper are to investigate the following: (a) Can a single SRNN be trained to accurately predict the non-orthogonal vector components of long multi-class real sequences which may have an unequal number of components? (b) Can a SRNN accurately predict the components of such real sequences if the network is presented with noisy inputs? (c) How can the prediction generalization of the SRNN be improved?
2. Simple recurrent neural network architecture The speci"c architecture of the SRNN used in this study is shown in Fig. 1. The network consists of 2 layers
0031-3203/00/$20.00 2000 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved. PII: S 0 0 3 1 - 3 2 0 3 ( 9 9 ) 0 0 1 8 7 - 9
2076
L. Gupta, M. McAvoy / Pattern Recognition 33 (2000) 2075}2081
sequence and the output is a prediction of the next component of the sequence. This particular formulation is selected because it is similar to the orthonormal binary vector prediction problem most frequently used in SRNN studies related to grammatical inference and learning [8}12]. The multi-class problem involves predicting, in the order of occurrence, the components of sequences from di!erent classes. For a C-class problem, let the sequence from class u , i"1, 2,2, C, be repreG sented by G "+g ( j),, k"1, 2,2, K , j"1, 2,2, J, G GI GH where g is the kth component in G , K is the number of GI G G components in G , and J is the number of samples in each G component. Given these notations, the prediction task is to train a single SRNN to predict component g given GI the input component g for k"2, 3,2, K and GI\ G i"1, 2,2, C.
Fig. 1. The SRNN architecture.
of neurons: the output layer and the context layer. The input to the network consists of a concatenation of the externally applied input X(n) and the context input H(n!1) which is the prior output of the context layer. The network is fully inter-connected between the output and context layers and between the context layer and the concatenated inputs. If =&7, =6&, and =&& represent the inter-connection matrices between the units in the output and context layer, the context layer and the external input, and the context layer and the context input, respectively, then, the output of the ith unit in the output layer is given by 1 y (n)" G 1#exp[!=&7H(n)] G for i"1, 2,2, N and the output of the jth unit in the context layer is given by 1 h (n)" H 1#exp+![=&&H(n!1)#=6& (n)], H H for j"1, 2,2, H, where, =&7 is the ith row of interconnection matrix G =&7, =&& is the jth row of =&&, and =6& is the jth row ( H of =6&.
3. The multi-class prediction problem The prediction problem is formulated as a task in which the input is a component of a real temporal
4. Network training The exact method developed for training a single SRNN for the multi-class prediction problem is as follows: The network is trained component by component with concatenated sequences of the form: [', G ; ', G ;2; ', G ; ', G ;2; ', G ; ! ! ', G ;2; ', G ;2], ! where, ' is the vector of values initially stored in the context units. The vector ' is also used to separate the sequences. That is, the context vector is reset to ' when the next sequence is presented during training. Convergence in training is tested for each input}output pair of components by comparing the squared di!erence between the component g and its prediction g( with a GI GI convergence threshold q (hereafter, predicted compo nents will be represented using the symbol ). Therefore, training terminates only when [g !g( ])q for all k and i. GI GI The prediction error of a test sequence ¹ "+t ( j ), G GI from class u is given by G )G E "(I/K ) [g !tK ]. G G GI GI )
(1)
This training method can also be used to train the network given a set of training exemplars GI G "+g (j),, q "1, 2,2, Q , for each class i, where, GO GIOG G G Q is the number of exemplars in the ith class. For this G case, given an input g , the network is trained to GIOG output the noise-free component g (if it is known) or GI the centroid g computed from the exemplar set. The GI>
L. Gupta, M. McAvoy / Pattern Recognition 33 (2000) 2075}2081
network is trained with concatenated sequences of the form: [', GI ; ', GI ; ; ', GI ; ', GI ; 2 ! ', GI ; ; ', GI , ]. 2 ! 2 The prediction error of a test sequence is computed using Eq. (1) with g replaced by g if the network is trained GI GI to output the centroid.
5. Multi-class sequences used for prediction evaluations In order to evaluate the prediction capabilities of the SRNN, temporal sequences with non-orthogonal real components were derived from the four simulated targets shown in Fig. 2 using the localized contour sequence
Fig. 2. Simulated targets.
2077
representation (LCS) [13]. In the LCS, each pixel on the boundary is represented by the perpendicular Euclidean distance between the boundary pixel and the chord connecting the end-points of an odd-numbered window centered on the boundary pixel. The LCSs of the 4 targets shown in Fig. 2, normalized to take values between 0 and 1, are shown in Fig. 3. To aid visualization, the sequences are displayed as continuous signals. For the experiments conducted in this paper, the components were selected as the 20 point segments between the dashed vertical lines in the "gures. The sequence derived from target F is G denoted by G , i"1, 2, 3, 4. The temporal properties of G these four sequences are demonstrated by using a representation of the form G : +g , g , g , g , g , g , g , a, b,, G : +b, g , g , b, g , g , g , g , a, b,, G :+b, b, c, g , g , g , g , g , c, g , g ,g ,g , b,, G :+c, g , g , g , d, g , g , g , g , g , d, g ,g ,g , b, g ,. Dissimilar components are denoted by g and similar GI components are denoted by a, b,2 . Observe that the resulting sequences have an unequal number of components and also have similar intra- and inter-class components. For example, the component b occurs in all four sequences and also occurs thrice in sequences G and G . Additionally, the occurrences of
Fig. 3. Localized contour sequences of the targets shown in Fig. 2.
2078
L. Gupta, M. McAvoy / Pattern Recognition 33 (2000) 2075}2081
adjacent pairs of components of the form (a, b) and (a, j) show the need for the prediction of b and j to be contextual. That is, the SRNN must be capable of producing di!erent output components for a given input component.
A noisy sequence GI was generated from G according to G G GI "+g ( j ),"+g ( j )#n ( j ),, G GI GI HI where n ( j ) is a sample drawn from a Gaussian density GI function with zero-mean and variance p. The variance G
Fig. 4. Examples of noisy localized contour sequences: (a) SNR"4 dB; (b) SNR"0 dB.
L. Gupta, M. McAvoy / Pattern Recognition 33 (2000) 2075}2081
p was speci"ed to generate a noisy sequence from class G u with a speci"ed mean square signal-to-noise ratio G given by )G SNR (dB)"10 log ++[1/(JK )] G G I ( ; [g ( j )],/(p)],. GI G H Examples of noisy sequences GI are shown in Fig. 4. G 6. Evaluating the prediction capabilities of the SRNN Given that each component consists of 20 samples, the dimensions of the input vector and the number of output units in the SRNN were selected to be 20. The number of context units were varied systematically in a preliminary set of experiments to determine the minimum number of context units for the prediction experiments. Several networks were trained and tested using noise free and noisy sequences. For the range of SNRs considered, it was determined that a minimum dimension of 40 was required for convergence given a convergence threshold q"0.005. Therefore, the network dimension was 20 input units, 40 context units, 40 hidden neuron units, and 20 output neuron units. 6.1. Selection of ' The standard rule used is to assign an initial value equal to 0.5 to the elements of ' [8}12]. It must be emphasized that no justi"cation is provided for selecting this initial value. It is presumed that 0.5 is selected because the context units can only take values in the interval +0,1] and 0.5 is the mid-point of this interval. A careful analysis of the results using '"0.5 during testing showed that the network prediction of the initial ("rst few) components of each test sequence was poor. This result is not unexpected because the same context is used to predict the initial components of di!erent sequences. That is, not enough distinguishing context is built into the network for the accurate prediction of the initial components of di!erent sequences using '"0.5. This adverse e!ect increases when the dimension of the context vector, relative to the input vector, is increased because the context vector then has a greater in#uence on the prediction [14]. Instead of using '"0.5, a di!erent ' can be used for each class u to aid the network in G G building distinguishing context. If the dimension of the context vector is H, it is proposed that ' is selected to be G the "rst H elements of the sequence G . In this approach, G the SRNN is trained using the vector ' as the initial G context vector as well as the separator between sequences during training. Additionally, the vector ' is used as the G initial context vector for testing sequences belonging to class u . G
2079
6.2. Prediction evaluation In order to obtain accurate estimates of the prediction error, 10 independent networks with randomly initialized weights were trained. For each class, 100 noisy sequences were generated using a "xed SNR and tested on each network. For a given SNR, the prediction error was estimated by averaging the prediction errors of the 400 noisy sequences on each network and then averaging the results across the 10 networks. In the "rst set of experiments, the networks were fully trained with the components of the noise-free sequences using '"0.5 and '"' . Given that the SRNNs were G successfully trained to predict the components of the four noise-free sequences, experiments were designed to evaluate the prediction generalization when the networks were presented with noisy inputs. The ordered components of a noisy test sequence from a known class were presented to the network input. The ordered output of the network was compared with the ordered components of the noise-free sequence of the same class as the input. The prediction error for each sequence tested was computed using Eq. (1). The averaged prediction errors are shown in row 1 of Tables 1 and 2, where, the symbol `*a is used to denote the noise free cases. It must be noted that each result presented in the tables, other than the noise-free test cases, is an average computed across [(100 sequences/class);(4 classes);(10 networks)]"4000 tests. For the noise-free test cases, the results were averaged across [(1 sequence/class);(4 classes);(10 net-works)] "40 tests. The next set of experiments were aimed at improving the prediction generalization of the networks. The approach developed in this paper is similar to the retraining method used to improve the generalization capabilities of the multilayer perceptron [15]. The SRNN was re-trained systematically with gradually increasing levels of noise in the training set. The noise-free trained network was retrained with a low-noise (SNR"10 dB) exemplar training set and was retested using the original set of noisy test sequences. These results are shown in row 2 of Tables 1 and 2. This network was re-trained with a training exemplar set with a higher noise level (SNR"8 dB), retested on the same set of noisy test sequences, and the results are shown in row 3 of Tables 1 and 2. Results for additional re-training using SNR"6 dB and retesting are shown in row 4 of Tables 1 and 2.
7. Discussion of results Clearly, the prediction errors presented in Tables 1 and 2 are a function of the temporal sequences used in the experiments. That is, di!erent results will be obtained with di!erent sets of temporal sequences. However,
2080
L. Gupta, M. McAvoy / Pattern Recognition 33 (2000) 2075}2081
Table 1 Prediction error using "0.5 Training SNR in dB
* 10 8 6
Test SNR in dB *
10
8
6
4
2
0
0.042140 0.010102 0.010929 0.014766
0.053582 0.012832 0.013034 0.016141
0.063006 0.014909 0.014440 0.017384
0.075587 0.017422 0.015822 0.018292
0.091142 0.021491 0.018288 0.020121
0.107768 0.027527 0.022174 0.023130
0.123843 0.036212 0.027702 0.027056
*
10
8
6
4
2
0
0.018264 0.002260 0.000725 0.000396
0.037477 0.004369 0.001583 0.000721
0.049159 0.005951 0.002217 0.000997
0.064072 0.008077 0.003122 0.001412
0.080389 0.011595 0.004793 0.002152
0.098454 0.016941 0.007724 0.003692
0.115292 0.024173 0.012332 0.006502
Table 2 Prediction error using
G Training SNR in dB
* 10 8 6
Test SNR in dB
tion error of the network using the standard initial context vector '"0.5 and the standard method of training the network without additional retraining (i.e., results in row 1 of Table 1). The prediction improvement factors are plotted in Fig. 5. The dashed curves correspond to the results for the initial context vector '"0.5 and the solid curves correspond to the results for '"' . These results G show that the prediction of the network can be improved using the proposed initial context vector '"' and G prediction can be improved quite signi"cantly through the use of the proposed systematic re-training method. It should also be noted that unlike the prediction results using '"' , the results using '"0.5 are not consisG tent. That is, the prediction error does not always decrease when the network is re-trained with increasing noise levels. Fig. 5. Comparison of the results in Tables 1 and 2.
8. Conclusions because the same test data is used to derive each of the results in Tables 1 and 2, meaningful comparisons can be made to demonstrate the improvement in performance using the proposed methods. For a given test SNR, the prediction improvement factor, e (dB)"20 log [E /E] is de"ned to measure the improvement in prediction using network retraining and using ' . E is the predicG
The goal of this paper was to evaluate the prediction capabilities of the SRNN. Because SRNNs have been shown to be quite e!ective in predicting the orthonormal vector components of noise-free binary sequences, the main focus was on the prediction of non-orthogonal vector components of real sequences. A prediction problem, similar to the one used for the SRNN in grammatical inference studies, was formulated in which the input is a component of a real temporal sequence and the output is a prediction of the next component of the sequence.
L. Gupta, M. McAvoy / Pattern Recognition 33 (2000) 2075}2081
Through the methods developed in the paper, it was demonstrated that: (a) A single SRNN can be trained fully to predict the components of the sequences belonging to di!erent classes. The network was trained using pairs of components and was never presented with entire sequences during training. Even though the SRNN was trained on multi-class sequences with similar intraand inter-class components, it was capable of predicting the components of each sequence contextually. (b) The prediction error of the SRNN can be decreased by using a distinguishing initial context vector for each class. This result is especially signi"cant because the selection of the initial context vector for improving the prediction performance of the SRNN has not been investigated. (c) The prediction generalization of the SRNN can be increased by re-training the network with noisy exemplars. The improvement in prediction was quite signi"cant when the re-training method was combined with the initial context vector '"' . G It could, therefore, be concluded that the SRNN is not only e!ective in predicting the components of binary temporal sequences but is also capable of predicting robustly the components of real temporal sequences.
References [1] A. Khotanzad, R. Afkhami-Rohani, D. Maratukulam, ANNSTLF-Arti"cial neural network short-term load forecaster-Generation Three, IEEE Trans. Power Systems 13 (4) (1998). [2] A. Kehagias, V. Petridis, Predictive modular neural networks for time series classi"cation, Neural Networks 10 (1997) 31}49.
2081
[3] P.J. Werbos, Backpropagation through time: what it does and how it does it, Proc. IEEE 78 (1990) 1550}1560. [4] L.B. Almeida, Backpropagation in perceptrons with feedback, in: R. Eckmiller, C. von der Malsburg (Eds.), Neural Computers, Springer, Berlin, 1988, pp. 199}208. [5] B.A. Pearlmutter, Learning state space trajectories in recurrent neural networks, Neural Comput. 1 (2) (1989) 263}269. [6] R.J. Williams, D. Zipser, A learning algorithm for continually running fully recurrent neural networks, Neural Comput. 1 (2) (1989) 270}280. [7] M.I. Jordan, Supervised learning systems with excess degrees of freedom, Massachusetts Institute of Technology, COINS Technical Report 88}27, May 1988. [8] J.L. Elman, Finding structure in time, Cognitive Sci. 14 (1990) 179}211. [9] J.L. Elman, Distributed representations, simple recurrent networks, and grammatical inference, Machine Learning 7 (2/3) (1991) 19}121. [10] R.F. Port, Representation and recognition of temporal patterns, Connection Sci. 1}2 (1990) 15}76. [11] A. Cleeremans, D. Servan-Schreiber, J.D. McClelland, Finite state automata and simple recurrent networks, Neural Comput. 1 (3) (1989) 372}381. [12] J. Ghosh, V. Karamcheti, Sequence learning with recurrent networks: analysis of internal representations, SPIE Vol. 1710, Science of Arti"cial Neural Networks, 1992, pp. 449}460, [13] L. Gupta, T. Sortrakul, A. Charles, P. Kisatsky, Robust automatic target recognition using a localized boundary representation, Pattern Recognition 28}10 (1995) 1587}1598. [14] R. Blake, Analysis of sequence prediction in recurrent neural networks, M.S. Thesis, Department of Electrical Engineering, Southern Illinois University, Carbondale, 1996. [15] L. Gupta, M.R. Sayeh, R. Tammana, A neural network approach to robust shape classi"cation, Pattern Recognition 23 (6) (1990) 563}568.
About the Author*LALIT GUPTA received the B.E. (Hons.) degree in electrical engineering from the Birla Institute of Technology and Science, Pilani, India in 1976, the M.S. degree in digital systems from Brunel University, Middlesex, England in 1981, and the Ph.D degree in electrical engineering from Southern Methodist University, Dallas, Texas in 1986. Since 1986, he has been with the Department of Electrical Engineering, Southern Illinois University at Carbondale, where he is currently an Associate Professor. His research interests include neural networks, computer vision, pattern recognition, and digital signal processing. Dr. Gupta serves as an associate editor for Pattern Recognition and is a member of the Pattern Recognition Society, the Institute of Electrical and Electronics Engineers, and the International Neural Network Society. About the Author*MARK MCAVOY received the B.S. degree in electrical engineering from University of Illinois at UrbanaChampaign in 1991, the M.S. degree in electrical engineering from Southern Illinois University, Carbondale, in 1993, and the Ph.D degree in engineering science (electrical engineering) at Southern Illinois University, Carbondale in 1998. He is currently at the Neuro-Imaging Laboratory, Washington University, St. Louis. His research interests include neural networks, pattern recognition, digital signal processing, and electrophysiology.