A new boosting algorithm for improved time-series forecasting with recurrent neural networks

Available online at www.sciencedirect.com Information Fusion 9 (2008) 41–55 www.elsevier.com/locate/inﬀus A new boosting algorithm for improved time...

Download PDF

430KB Sizes 0 Downloads 143 Views

Report

PDF Reader
Full Text

Available online at www.sciencedirect.com

Information Fusion 9 (2008) 41–55 www.elsevier.com/locate/inﬀus

A new boosting algorithm for improved time-series forecasting with recurrent neural networks Mohammad Assaad, Romuald Bone´ *, Hubert Cardot Universite´ Franc¸ois Rabelais de Tours, Laboratoire d’Informatique, 64, Avenue Jean Portalis, 37200 Tours, France Received 2 December 2005; received in revised form 27 October 2006; accepted 30 October 2006 Available online 14 December 2006

Abstract Ensemble methods for classiﬁcation and regression have focused a great deal of attention in recent years. They have shown, both theoretically and empirically, that they are able to perform substantially better than single models in a wide range of tasks. We have adapted an ensemble method to the problem of predicting future values of time series using recurrent neural networks (RNNs) as base learners. The improvement is made by combining a large number of RNNs, each of which is generated by training on a diﬀerent set of examples. This algorithm is based on the boosting algorithm where diﬃcult points of the time series are concentrated on during the learning process however, unlike the original algorithm, we introduce a new parameter for tuning the boosting inﬂuence on available examples. We test our boosting algorithm for RNNs on single-step-ahead and multi-step-ahead prediction problems. The results are then compared to other regression methods, including those of diﬀerent local approaches. The overall results obtained through our ensemble method are more accurate than those obtained through the standard method, backpropagation through time, on these datasets and perform signiﬁcantly better even when long-range dependencies play an important role. 2006 Elsevier B.V. All rights reserved. Keywords: Learning algorithm; Boosting; Recurrent neural networks; Time series forecasting; Multi-step-ahead prediction

1. Introduction The reliable prediction of future values of real-valued time series has many important applications ranging from ecological modeling to dynamic systems control, ﬁnance and marketing. Modeling the system that generated the series is often the ﬁrst step in the setting up of a forecasting system. It can provide an estimation of future values based on past

*

Corresponding author. Tel.: +33 2 47 36 14 14/42; fax: +33 2 47 36 14

22. E-mail addresses: [email protected] (M. Assaad), [email protected] (R. Bone´), [email protected] (H. Cardot). 1566-2535/$ - see front matter 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.inﬀus.2006.10.009

values. It is possible, at least numerically, to solve the equations of a mathematical model compound of a deterministic equation set where the initial conditions are known so that the evolution of the system can be determined. Generally the characteristics of the phenomenon which generate the series are unknown. The information available for the prediction is limited to the past values of the series. The relations which describe the evolution should be deduced from these values, in the form of functional relation approximations between the past and the future values. The most frequently adopted approach to estimate single-step-ahead (SS) future values ^xðt þ 1Þ consists in using a function f which uses the recent history of the time series ^xðt þ 1Þ ¼ f ðxðtÞ; xðt 1Þ; xðt 2Þ; . . .Þ as input, where x(t), for 0 6 t 6 l, is the time series data that can be used for building a model. In multi-step-ahead (MS) prediction,

42

M. Assaad et al. / Information Fusion 9 (2008) 41–55

given {x(t), x(t 1), x(t 2), . . .}, a reasonable estimate ^xðt þ hÞ of x(t + h) has to be looked for, h being the number of steps ahead. To allow for the building of such models an appropriate function f is needed. Given their universal approximation properties, multilayer perceptrons (MLP [3]) are often successful in modeling nonlinear functions f. In this case, a ﬁxed number p of past values is fed into the input layer of the MLP and the output is required so that a future value of the time series can be predicted. Using a time window of a ﬁxed size has proven to be limiting in many applications: if the time window is too narrow, important information may be left out while, if the window is too wide, useless inputs may cause distracting noise. Ideally, for a given problem, the size of the time window should be adapted to the context. This can be done by using recurrent neural networks (RNNs) [3,4] which would be learned by an algorithm based on a gradient such as backpropagation through time (BPTT) algorithm. It is possible to improve the results obtained by several means. One way is to develop a more appropriate training algorithm based on a priori information obtained from application ﬁeld knowledge (see for example [5]). It is also possible to adopt general methods which would improve the results of other diﬀerent models. One of these is known as ‘boosting’ which was introduced in [6]. The possible small improvement that a ‘‘weak’’ model can obtain compared to random estimate is substantially increased in the boosting algorithm through the sequential construction of several such models that concentrate progressively on the diﬃcult examples of the original learning set. In this paper we will focus on the deﬁnition of a boosting algorithm used to improve the prediction performance of RNNs. A new parameter will be introduced, allowing regulation of the boosting eﬀect. A common problem with the time series forecasting model is the low accuracy of long term forecasts. The estimated value of a variable may be reasonably reliable in the short term, but for longer term forecasts, the estimate is likely to become less accurate. Yet while reliable MS time series prediction has many important applications and is often the intended outcome, published literature usually considers SS time series prediction. The main reason for this is the increased diﬃculty of the problems requiring MS prediction and the fact that the results obtained by simple extensions of techniques developed for SS prediction are often disappointing. Moreover, if many diﬀerent techniques perform rather similarly on SS prediction problems, signiﬁcant diﬀerences show up when extensions of these techniques are carried out on MS problems. In this paper we will quickly go over the modelling approaches of MS prediction then existing work concerning the use of neural networks for MS prediction will be

presented. In Section 3, the ensemble methods will be reviewed before presenting the generic boosting algorithm and in Section 4, the related work on regression will be presented. Next a deﬁnition of RNNs as well as the BPTT associated learning algorithm will be provided. The new boosting algorithm is described in Section 6. Finally, in Section 7, we will look at the results of experiments obtained on three diﬀerent benchmark tests for SS prediction problems as well as those obtained on two benchmark tests for MS prediction problems, all of which showed an overall improvement in performance.

2. Modelling approaches for times series The most common approach to dealing with a prediction problem can be traced back to [7] and consists in using a ﬁxed number M of past values (a ﬁxed-length time window sliding over the time series) when building the prediction: xðtÞ ¼ ½xðtÞ; xðt sÞ; . . . ; xðt ðM 1ÞsÞ; ^xðt þ sÞ ¼ f ðxðtÞÞ:

ð1Þ ð2Þ

Most of the current work on SS prediction relies on a result in [8] showing that under several assumptions (among them the absence of noise) it is possible to obtain a perfect estimate of x(t + s) according to Eqs. (1) and (2) if M P 2d + 1, where d is the dimension of the stationary attractor generating the time series. In this approach, all the memory of the past required for performing a prediction is preserved in the sliding time window. An alternative solution is to keep a small length (usually M = 1) time window and enable the model to develop a memory of the past by itself. This memory is expected to represent the past information that is actually needed for performing the task more accurately. Time series prediction with RNNs usually corresponds to such a solution. Memory of the past – of variable length, see, e.g., [9,10] – is maintained in the internal state of the model, s(t) of ﬁnite dimension d at time t, which evolves (for M = 1) according to: sðt þ sÞ ¼ gðsðtÞ; xðtÞÞ;

ð3Þ

where g is a mapping function assumed to be continuous and diﬀerentiable. The time variable t can be either continuous or discrete and h is the output function. Assuming that the system is noise free the observed output is related to the internal dynamics of the system by ^xðt þ sÞ ¼ hðsðtÞÞ

ð4Þ

where ^xðt þ sÞ is the estimate of x(t + s) and the function h : Rd ! R is called the measurement function [11] which is generally assumed to be diﬀerentiable. In many applications the system output x(t) will be an analog, continuous signal.

M. Assaad et al. / Information Fusion 9 (2008) 41–55

2.1. Local and global approaches Among the many methods developed for time series prediction a distinction can usually be made between local and global approaches. In a local approach, a piecewise approximation is built for the behavior of the dynamical system generating the target time series. First, a synthetic state space is constructed by using the time-window technique and the output sequence. Then, some vector quantization technique performs a segmentation of this state space. Eventually, a simple (usually linear) local model is developed for every such segment. Local approaches rely on the assumption that the relevant dimension of the synthetic state space is much lower than the dimension of the time-window and that there are very few diﬀerent behaviors of the system (a few segments are enough). If this is not the case, the number of parameters explodes and the number of samples corresponding to each segment becomes insuﬃcient for a reliable training of the corresponding local model. The local approach was successful in many MS prediction problems [12–15], probably as a consequence of the increased robustness provided by the quantization of the state space and the simplicity of the local models employed. The vector quantization is sometimes performed by neural networks having a constrained topology, such as the selforganizing maps [12,16], or a free topology, such as the neural-gas networks [17] or the dynamic cell structures [13]. Some authors also mention the diﬃculties encountered by the local approach on a few other problems [12,14]. A global approach attempts to build a single (complex) model for the entire range of behaviors of the underlying dynamical system. In order to be able to model an entire range of behaviors, powerful models have to be used. Such models prove rather diﬃcult to train and can easily diverge from the required behavior. Given their universal approximation properties, neural networks are good candidate models for the global approaches. Among the many neural network architectures employed for time series prediction we can mention MLPs with a time window in the input, MLPs with ﬁnite impulse response (FIR) connections both from the input to the hidden layer and from the hidden layer to the output [18], recurrent networks obtained by providing MLPs with inﬁnite impulse response (IIR) connections [19] or with feedback from the output [20], recurrent networks with FIR connections [5,21,22] and recurrent networks with both internal loops and feedback from the output [23]. Many authors included time-delayed connections that provide an explicit memory of the past at various places. These additions appear to be (implicitly and sometimes explicitly) justiﬁed by the fact that they promote learning of longer-range dependencies in the data, which is supposed to be generally helpful for performing predictions and in particular for MS prediction problems. The relationship between learning longer-range dependencies and

43

performance in MS prediction was experimentally conﬁrmed in [24]. 2.2. SS and MS prediction Irrespective of the nature of the models employed, there are several methods for dealing with MS prediction problems. The ﬁrst and most common one consists in building (training) a predictor for the SS problem and using it recursively for the associated MS problem. The estimates provided by the model for the next time step are fed back into the input of the model until the desired prediction horizon is reached. This simple method is usually penalized by the accumulation of errors during successive time steps; the model quickly diverges from the desired behavior. The second and better method consists in training the predictor on the SS problem and, at the same time, making use of the propagation of penalties across time steps in order to punish the predictor for accumulating errors. When the models are MLPs or RNNs, such a procedure is directly inspired from the BPTT algorithm, which is an eﬃcient method for performing gradient descent on the cumulated error. In the third method, called direct method, the predictor is no longer concerned with an SS problem and is directly trained on the MS problem. Some experimental evidence and supportive theoretical considerations [25] tend to show that the direct method always performs better than the ﬁrst method we mentioned and at least as well as the second. Improved results were obtained by using recurrent networks and either training them with progressively increasing prediction horizon [26] or including time-delayed connections from the output of the network to its input [23]. A somewhat related method was suggested in [27] and consists in chaining several networks. For a given time horizon, a ﬁrst network learns to predict at t + 1, then a second network is trained to predict at t + 2 by using the prediction provided by the ﬁrst network as a supplementary input, and so on, until the desired time horizon is reached. This method was also found to provide better predictions than the ﬁrst one. A similar approach is used in [28] where prediction horizons are powers of 2. Recently, two signiﬁcantly diﬀerent forecasting methods were put forward. In [29] a huge recurrent neural network (called the ‘‘reservoir’’) is randomly generated so that its units present a rich set of behaviors; then, by training the output weights, these behaviors are combined in order to obtain the desired prediction. In [30], evolution is used to discover good RNN hidden node weights, and methods such as linear regression or quadratic programming are applied to compute optimal linear mappings from hidden state to output. While the papers concerning MS prediction we cite here are only part of the recent literature on the subject, the methods we mention characterize the existing approaches quite well. Nevertheless, it is particularly diﬃcult to

44

M. Assaad et al. / Information Fusion 9 (2008) 41–55

identify a typology of the MS prediction problems and to evaluate what methods are most appropriate for a speciﬁc kind of problem and why. In our experimentations, we perform MS prediction following the direct method, with a boosted global model. 3. Combining multiple learning methods The combination of models (classiﬁers or regressors) is an eﬀective way to improve the performance of the models [31]. The goal of the combination of models is to obtain a more precise estimate than that obtained by a single model. Several eﬀective methods to improve the performance of a simple algorithm by combining the several models were put forward. The combined methods can be classiﬁed into three different groups. The voting methods which include bagging algorithm [32] and the boosting methods [6,33], the stacking method (or stacked generalization) [34,35] and the cascading method (or cascade generalization) [36] will be deﬁned brieﬂy. The boosting algorithm however will be discussed in detail as it represents the fundamental interest of this article. The two voting methods bagging and boosting operate by applying a base learning algorithm and invoking it many times with diﬀerent training sets. In bagging [32], each training set is constructed by forming a bootstrap replicate of the original training set. In other words, given a training set D of q examples, a new training set D 0 is constructed by drawing q examples uniformly (with replacement) from D. AdaBoost algorithm [33] maintains a set of weights over the original training set D and adjusts these weights after each classiﬁer has been learned by the base learning algorithm. These adjustments increase the weight of examples that are misclassiﬁed by the base learning algorithm and decrease the weight of examples that are correctly classiﬁed. AdaBoost uses these weights to construct a new training set D 0 to be given to the base learning algorithm. In boosting by sampling, examples are drawn with replacement from D with probability proportional to their weights. The second method, boosting by weighting, can be used with base learning algorithms that can accept a weighted training set directly. With such algorithms, the entire training set D (with associated weights) is given to the base learning algorithm. Both methods have been shown to be very eﬀective [37]. Stacking is a way of combining multiple models that have been learned for a classiﬁcation task [36]. The ﬁrst step is to collect the output of each model into a new set of data. For each example in the original training set, this data set represents every model prediction of that example class, along with true classiﬁcation. The new data is treated as the data for another learning problem, and in the second step a learning algorithm is employed to solve this problem. In Wolpert’s terminology, the original data and the models constructed for it in the ﬁrst step are referred to as level 0 dataand level 0 models, respectively, while the

set of cross-validation data and the second-stage learning algorithm are referred to as level 0 data and level 1 generalizer. Breiman [34] demonstrated the success of stacking in the setting of ordinary regression. Level 0 models are regression trees of diﬀerent sizes or linear regressions using diﬀerent number of variables. But instead of selecting the single model that works best as judged by (for example) cross-validation, Breiman used the diﬀerent level 0 regressors output values for each member of the training set to form level 1 data. Then he used least-squares linear regression, under the constraint that all regression coeﬃcients be non-negative, as level 1 generalizer. Non-negativity constraint turned out to be crucial to guarantee that the predictive accuracy would be better than that achieved by selecting the single best predictor. The basic idea of cascading algorithm [36], like boosting, is to sequentially run the set of classiﬁers, at each step performing an extension of the original data set by adding new attributes. The new attributes are derived from the probability class distribution given by a base classiﬁer. The constructive step extends the representational language for the high level classiﬁers. Cascading produces a single but structured model for the data that combines the model class representation of the base classiﬁers. 4. Boosting methods The boosting methods are a part of the methods of model aggregation (see for example [38,39]). These methods make it possible to supplement the initial constraints related to the selection of a model and apply it to base models where the results are at least slightly better than random guessing. The goal is to obtain an aggregate model in which the error is smaller than those obtained by individual models. The basic idea is to increase the diversity so as to maximize the cover of data space. Each model covers a diﬀerent area of space and the aggregate model ensures a more total cover. The boosting algorithms can be classiﬁed into two families. The ﬁrst family, introduced in [6] for classiﬁcation problems, exploits hierarchies of three classiﬁers, trained on progressively more diﬃcult parts of the available data; their decisions are combined by a majority vote. Subsequent representatives of this family concern either classiﬁcation or regression in [38,40]. A second family of boosting algorithms was introduced in [41] and bases itself on the iterative update of a distribution which corresponds to the probability, for each example from the initial training set, of being selected for further training. The generic algorithm can be summarized as in Table 1. Most of the recent work on boosting algorithms belongs to this second family and follows the introduction of the powerful AdaBoost algorithm in [1,33]. Many diﬀerent algorithms can be obtained from this generic algorithm by making speciﬁc choices at every stage.

M. Assaad et al. / Information Fusion 9 (2008) 41–55 Table 1 The generic boosting algorithm 1. Set the initial distribution on the training set 2. Iterate until the stopping criterion is reached (a) develop a model given from the data selected on the training set according to the current distribution (b) update the distribution on the training set (c) assess the stopping criterion 3. Combine the models

The initial distribution on the training set is usually uniform, but prior knowledge concerning the problem would provide diﬀerent choices. When updating the distribution at each iteration, training examples which were found diﬃcult by previous models are favored. Subsequent models can thus concentrate on the most diﬃcult parts of the training data (the data having the greatest number of errors). For several boosting algorithms, it was shown that the expected error of the combination of models converges to 0 as the number of models becomes inﬁnite. This result relies on a strong hypothesis: step 2(a) can always return to a model that performs (at least) slightly better than random guessing. In practice, since new learners must deal with progressively more diﬃcult training examples, they end up performing worse than random guessing after a limited number of steps. This is the usual stopping criterion used by the algorithm. The generic boosting algorithm does not specify how the current distribution on the training set should be taken into account. In existing algorithms, the model is trained on a subset of the training set obtained by sampling with replacement according to the current distribution. As we shall see later, this solution is not always appropriate. Various methods were used for step 2(a) of the algorithm: linear models, decision trees, splines, MLPs, etc. The choice of a cost function and of the method for combining the models mainly depends on the type of problem: classiﬁcation or regression. Since we are concerned here with time series prediction, it would now be interesting to use the boosting method for regression. The ﬁrst approach [2,42] is rather empirical and adapts the AdaBoost algorithm to three cost functions that are speciﬁc to regression problems. The algorithm described in [38] belongs to the ﬁrst family of boosting algorithms and uses a threshold to label an answer of model as right or wrong. These two methods make use of the median as a way to combine the outputs of the models. Recently, a new approach to regressor boosting as residual-ﬁtting was developed [43–48]. Instead of being trained on a diﬀerent sample of the same training set, as in previous boosting algorithms, a regressor is trained on a new training set having diﬀerent target values (e.g. the residual error). Another approach [49] uses the boosting algorithm to minimize a global exponential squared cost, which is the

45

product of the costs of the various models. The cost function is the only change with respect to algorithms like AdaBoost. Before presenting our algorithm, it is important to mention the few existing applications of boosting to time series modeling. In [40], the boosting method belonging to the ﬁrst family of boosting algorithms is successfully applied to the classiﬁcation of phonemes. The models used are RNNs, and the authors are fully aware of the implications of the internal memory of the RNNs on the boosting algorithm. A similar type of boosting algorithm is used in [38] for the prediction of a benchmark time series, but with MLPs as models. A residual-ﬁtting boosting approach using decision trees and projection pursuit as regressors is applied, in [50], to the estimation of volatility in ﬁnancial time series. In all these cases, a signiﬁcant amount of data was available for learning. In the previous applications of RNNs with boosting algorithm, no work using the RNNs with boosting algorithm for the regression, especially with the time series, has been encountered. In the following section, the RNN and the BPTT adapted algorithm will be presented while in Section 6, we will discuss the boosting algorithm adapted for the regression problem. 5. Recurrent neural networks RNNs are characterized by the presence of cycles in the graph of interconnections, and are able to model temporal dependencies of unspeciﬁed duration between the inputs and the associated desired outputs, by using internal memory. The passage of information from one neuron to another through a connection is not instantaneous (one time step), unlike MLP, and the presence of the loops thus makes it possible to keep the inﬂuence of the information for a variable time period, theoretically inﬁnite. This does not require keeping a time window; the memory is coded by the recurrent connections and the outputs of the neurons themselves. The network learns to complete three complementary tasks: the selection of useful inputs, their retention in coded form and their use in the calculation of its outputs. Throughout the training the network itself reaches a compromise between the resolution (the precision) and the depth of its memory in order to carry out the required task. RNNs showed a higher capacity of modeling than MLPs [51,52], obtaining more accurate results with, in many cases, fewer parameters. In the majority of the cases the training is achieved with the help of the BPTT oﬀ-line algorithm. The feedforward backpropagation algorithm cannot be directly transferred to RNNs, because the backpropagation pass presupposes that the connections between the neurons induce a cycle-free ordering. Considering a time series of length l, the central idea of BPTT algorithm is to unfold

46

M. Assaad et al. / Information Fusion 9 (2008) 41–55

the original recurrent networks (Fig. 1) in time so as to obtain a feedforward network with l layers (Fig. 2), which in turn makes it possible to apply the learning method by backpropagation of gradient of the error through time. BPTT unfold the network in time by stacking identical copies of the RNN, and duplicating connections within the network to obtain connections between subsequent copies. The weights between successive layers must remain identical in order to be able to show up in the original recurrent network. In practice, it amounts to cumulating the changes of the weights for all the copies of a particular connection and adding the sum of the changes to all these copies after each learning iteration. Let us consider the application of BPTT for the training of recurrent networks between time t1 and tl. fi is the transfer function of neuron i, si (t) its output at time t, and wij its connection being from neuron j. A value provided to the neuron at time t coming from outside is noted xi (t). The algorithm supposes an evolution of neurons of recurrent networks given by the following equations:

si ðtÞ ¼ f ðneti ðt 1ÞÞ þ xi ðtÞ; i ¼ 1; . . . ; N X neti ðt 1Þ ¼ wij ðt 1Þsj ðt 1Þ:

The set Prev(i) contains, for each neuron i, the index of the incoming neurons Prev(i) = {jj $ wij}. Likewise, the neurons following neuron i are deﬁned as: Foll(i) = {jj $ wji}. The variation of the weight for all the sequence is calculated as the sum of the variations of this weight on each element of the sequence. By noting T(s) the set of neurons which have a desired output dp(s) at the time s, the mean quadratic error E(t1, tl) of the recurrent neural networks is deﬁned between time t1 and tl as: Eðt1 ; tl Þ ¼

tl X 1X ðd p ðtÞ sp ðtÞÞ2 2 t¼t1 p2T ðtÞ

s1 (t )

s2 (t )

w11

ð7Þ

Let us calculate the variation of the weight minimizing the total error [3] by a gradient descent, where g is the learning step: tl 1 X oEðt1 ; tl Þ oEðt1 ; tl Þ ¼ g owij owij ðsÞ s¼t1

ð8Þ

with

y2 ( t )

w21

ð6Þ

j2PrevðiÞ

Dwij ðt1 ; tl 1Þ ¼ g x1 (t )

ð5Þ

oEðt1 ; tl Þ oEðt1 ; tl Þ oneti ðsÞ ¼ ; owij ðsÞ oneti ðsÞ owij ðsÞ

ð9Þ

where wij(s) is the duplication of the weight wij of the original recurrent networks, for the time t = s. The equations of BPTT algorithm are ﬁnally obtained:

w22

w12

Fig. 1. A recurrent neural network.

y2 ( 4 ) s1 ( 4 ) x1 ( 4)

s2 ( 4 )

w11 ( 3)

w22 ( 3) w21 ( 3)

w12 ( 3)

s1 ( 3) x1 ( 3)

s2 ( 3)

w11 ( 2 )

w22 ( 2 ) w21 ( 2 )

ð10Þ • for the hidden layer 8" > > > ½si ðs þ 1Þ d i ðs þ 1Þ > > > > > > # > > > P oEðt1 ; tl Þ oEðt1 ; tl Þ < þ wji ðs þ 1Þ fi0 ðneti ðsÞÞ ¼ onet ðs þ 1Þ j j2FollðiÞ oneti ðsÞ > > > > > if i 2 T ðs þ 1Þ > > > > P > oEðt1 ; tl Þ > > wji ðs þ 1Þfi0 ðneti ðsÞÞ else : j2FollðiÞ onetj ðs þ 1Þ

s2 ( 2 )

w11 (1)

w22 (1) w21 (1)

s1 (1)

y2 ( 2 )

w12 ( 2 )

s1 ( 2 ) x1 ( 2 )

y2 ( 3)

• for the output layer ½si ðtl Þ d i ðtl Þfi0 ðneti ðsÞÞ if i 2 T ðs þ 1Þ oEðt1 ; tl Þ ¼ oneti ðsÞ 0 else

w12 (1) s2 (1)

x1 (1)

Fig. 2. The RNN of Fig. 1 unfolded in time.

ð11Þ y2 (1)

6. Boosting recurrent neural networks Boosting is a general method for improving the accuracy of any given learning algorithm. It produces a ﬁnal solution by combining rough and moderately inaccurate decisions oﬀered by diﬀerent classiﬁers, which is at least slightly better than random guessing. In boosting, the training set used for each classiﬁer is produced (weighted) based on the

M. Assaad et al. / Information Fusion 9 (2008) 41–55

47

performance of the earlier classiﬁer(s) in the series. Therefore, samples incorrectly classiﬁed by previous classiﬁers in the series are more emphasized than correctly classiﬁed ones. AdaBoost solved many of the practical diﬃculties of the earlier boosting algorithms. The algorithm ﬁrst receives a training set, {(xq, yq); q = 1, . . ., Q} as inputs where xq is an example in space X, and yq is a class label in set Y associated with xq. AdaBoost repeatedly calls a weak (base) learning algorithm in a series of rounds n = 1, . . ., N. In each round n, the algorithm assigns a distribution or a set of weights over the training set. The weight of training example q during round n is denoted as Dn(q). The weak learner can use Dn(q) on the training examples. Alternatively, the training inputs can be sampled according to Dn(q), and these resampled inputs can be used to train the weak learner. Initially, all the weights are equally set, and the weights of incorrectly classiﬁed examples are increased on each repetition so that the learner is forced to focus on the harder examples in the training set. The weak learner has to compute a hypothesis, hn: X ! Y with respect to Dn(q). Once hypothesis hn is calculated, AdaBoost uses a parameter an = (1 en)/en which is a measure of conﬁdence in the predictor. en is the pseudo-loss calculated from a loss function Ln(q). Then, update Dnþ1 ðqÞ ¼ Dn ðqÞanð1Ln ðqÞÞ =Z n where Zn is a normalization constant chosen such that Dn+1 is a distribution. In response, the weights of examples misclassiﬁed by hn are increased, and the weights of correctly classiﬁed examples are decreased, hence the algorithm is forced to focus on ‘‘hard’’ examples. Finally, the weak hypotheses are combined into a single ﬁnal hypothesis, ( ) N X X hf ðxÞ ¼ inf y 2 Y : log an P ð1=2Þ log an ; n:hn ðxÞ6y

n¼1

ð12Þ which is a weighted median of n hypotheses. The modiﬁcations to the AdaBoost algorithm were proposed to suit time series prediction problems. Generally, the use of boosting in prediction is harder than classiﬁcation. Classiﬁers map the inputs with several classes, while predictors calculate the expected values that are associated with the given inputs; the outputs are neither correct nor incorrect (in contrast to classiﬁcation problems). Rather, performance of a predictor is measured by the prediction error. This leads to the proposed modiﬁcations of the AdaBoost algorithm so that it is applicable to real-valued prediction tasks. The boosting algorithm used should comply with the restrictions imposed by the general context of application. In our case, it must be able to work well when a limited amount of data is available and accepts RNNs as regressors. We followed the generic algorithm of [1,2] and had to decide which loss function to use for the regressors, how to update the distribution on the training set and

Fig. 3. The boosting algorithm proposed for regression with RNNs.

how to combine the resulting regressors. Our updates are based on the suggestion in [2,42], but we used a linear transformation to the weights before using them (see the deﬁnition of Dn+1(q) in Fig. 3) in order to prevent the RNNs from simply ignoring the easier examples. Then, instead of sampling with a replacement according to the updated distribution, we preferred to weight the computed error for each example (thus using all the data points) at the output of the RNN with the distribution value corresponding to the example. We have two stop criteria in our algorithm: the maximum number of steps N and en < 0.5 (when the algorithm performs worse than random guessing). The decision to use all the data of training set for the development of each regressor makes it possible to observe

M. Assaad et al. / Information Fusion 9 (2008) 41–55

the condition according to which the algorithm must be able to treat time series made up of a limited number of data. Our idea is that the modiﬁcations of the weights Dn(q) make it possible to give more importance to the difﬁcult examples for the training of the following regressor. Parameter k makes it possible to regulate this taking into account; when k = 0, the diﬃcult examples have the same weights as the others. To combine the regressors we are using the weighted median [2,42], which is less sensitive to outliers, rather than the weighted mean. The proposed boosting algorithm can then be described as in Fig. 3. For stage (2a), Eqs. (10) and (11) of BPTT algorithm become: • for the output layer ½si ðtl Þ d i ðtl ÞDn ðs þ 1Þfi0 ðneti ðsÞÞ if i 2 T ðs þ 1Þ oEðt1 ;tl Þ ¼ oneti ðsÞ 0 else ð13Þ • for the hidden layer 8" > > > ½si ðs þ 1Þ d i ðs þ 1ÞDn ðs þ 1Þ > > > > > > # > > > P oEðt1 ; tl Þ < oEðt1 ; tl Þ wji ðs þ 1Þ fi0 ðneti ðsÞÞ þ ¼ onet ðs þ 1Þ j j2FollðiÞ oneti ðsÞ > > > > > if i 2 T ðs þ 1Þ > > > > P > oEðt1 ; tl Þ > > wji ðs þ 1Þfi0 ðneti ðsÞÞ else : onet ðs þ 1Þ j j2FollðiÞ ð14Þ

7. Experimental results The boosting algorithm described was used with the learning algorithm BPTT for SS and MS time series forecasting problems. We added several new results to a previous study [53] on SS problem and also added a new study on MS problem. The ﬁrst set of experiments was carried out in order to explore the performance of the constructive algorithm and to study the inﬂuence of parameter k on its behaviour. The algorithm is used on the sunspots time series and two Mackey-Glass time series (MG17 and MG30). The average results and standard deviation are provided, all of which have been determined after ﬁve trial runs for each conﬁguration: (linear, squared or exponential loss functions; value of parameter k). The extension of the range of values of k, which were due to the noticeably diﬀerent results found on average, will be discussed. The error criterion used was the normalized mean squared error (NMSE) which is the ratio between the mean squared error and the variance of the time series. It is deﬁned, for a time series xðtÞt¼t1 ;...;tl , by

Ptl

t¼t ðxðtÞ Ptl 1 t¼t1 ðxðtÞ

^xðtÞÞ

2

xðtÞÞ

2

Pt l ¼

t¼t1 ðxðtÞ lr2

^xðtÞÞ2

ð15Þ

;

where ^xðtÞ is the prediction given by the RNN and xðtÞ, r2 are the mean value and variance estimated from the available data. A value of NMSE = 1 is achieved by predicting the unconditional mean of a time series. The normalized root mean squared error (NRMSE) used for some of the results in the literature is the square root of the NMSE. The architectures used had a single input neuron, a single linear output neuron, a bias unit and a fully recurrent hidden layer composed of neurons with symmetric sigmoid activation function. For the sunspots dataset, we tested RNNs having 12 neurons in the hidden layer, and for the Mackey-Glass dataset 7 neurons. The number of neurons corresponds to the best results obtained by BPTT for the SS forecasting (we tested architectures varying from 1 to 15 hidden neurons). We set the maximal number n of RNNs at 50 for each experiment and for each one the weights in [0.3, 0.3] were randomly initialized. We then compared the results given by our algorithm to other results in the literature. 7.1. Sunspots time series The sunspots dataset (Fig. 4) is a natural dataset that contains the yearly number of dark spots on the sun from 1700 to 1979. The time series has a pseudo-period of 10–11 years. The task for the various models developed is to perform single-step-ahead predictions. It is common practice [54] to use the data from 1700 to 1920 as a training set and to assess the performance of the model on two sets, 1921–1955 (test 1) and 1956–1979 (test 2). Test 2 is considered to be more diﬃcult because it has a larger variance. In Table 2, we can see that boosting with the weighted median provides better results than boosting with the weighted mean. We also notice that the values of the average error on the three loss functions decrease from k = 5 to k = 20 and increase until k = 150 (see Fig. 5). In Table 3, to obtain the mean number of weak learners, we take the number of weak learners in each and every of the ﬁve trials on each set of parameters and then we take 200

Test 1

Test 2

180

yearly sunspot averages

48

160 140 120 100 80 60 40 20 0 1700 1725 1750 1775 1800 1825 1850 1875 1900 1925 1950 1975

Fig. 4. The sunspots time series.

M. Assaad et al. / Information Fusion 9 (2008) 41–55 Table 2 (NMSE * 103) of the weighted mean and of the weighted median for several values of k on (test 1 + test 2) set for ﬁve trials k Lin.

Mean Med. Mean Med. Mean Med.

Squ. Exp.

0.95

10

20

50

100

150

190 183 193 181 189 191

190 177 191 180 187 186

194 189 182 174 179 178

182 181 259 183 315 195

186 182 555 307 587 705

258 185 497 365 713 1122

Linear Squared Exponential

0.75

NMSE

5

0.55

0.35

0.15 0

50

100

150

k

Fig. 5. NMSE of the weighted median for diﬀerent values of k on the (test 1 + test 2) set for sunspots.

Table 3 Number of weak learners (mean, minimum and maximum) for several values of k and for 5 trials on the (test 1 + test 2) set k Lin.

Squ.

Exp.

Mean Min. Max. Mean Min. Max. Mean Min. Max.

5

10

20

50

100

150

9.8 9 11 6.4 5 8 26.6 22 33

10 8 12 6.8 7 8 26 25 27

8.4 6 10 5.4 5 6 32 28 36

9.8 8 12 10.4 8 17 34.2 23 50

8.4 8 9 15 13 19 50 50 50

10.2 7 18 13.8 13 16 50 50 50

the average. In the same way, we obtain the minimum and the maximum number of weak learners. We notice that the number of weak learners notably increases with the exponential function and with the great values of k. The number of weak learners in our experiments depends on the loss function used. Our algorithm develops around 9 weak learners with the linear and squared loss functions, and 36 weak learners with the exponential function. Fifty is the maximal number of networks that we have allowed for. Table 4 shows the best NMSE obtained by various models on the two test sets of sunspots. To obtain the best results in our boosting algorithm, we chose the normalised best results from the ﬁve trials for each set of parameters

49

Table 4 Best results (NMSE * 103) Model

Test 1

Test 2

TAR MLP IIR MLP RNN/BPTT DRNN1 DRNN2 EBPTT100 CBPTT Boosting (squared, 5) Boosting (linear, 10)

97 86 97 84 91 93 78 92 78 80

280 350 436 300 273 246 227 251 250 270

(loss function, k). The best results reported in this table were obtained with (squared loss function, k = 5) and (linear loss function, k = 10). The threshold autoregressive (TAR) model in [55] employs a threshold to switch between two AR models. The MLP in [56] has a time window of size 12 in the input layer. The Dynamical RNNs (DRNNs) are RNNs having FIR connections [57]. We show here the best results obtained in [58] on each of the two test sets; mean values were not available. DRNN 1 has 2 hidden neurons, fully connected by FIR connections of order 5. DRNN 2 has 5 hidden neurons, fully connected by FIR connections of order 2. The author found the order of these connections after several trials. For both algorithms EBPTT and CBPTT [5], which work with time delay connections, the authors set to 20 the maximal value for the delays of the new connections and to four the maximal number of new connections. The number of BPTT iterations performed for each candidate connection during the exploratory stage of EBPTT was set to 20 in one experiment and to 100 in another. We can notice the improvement upon BPTT without the constructive stage. To obtain the best mean results in Table 5, we take the normalised mean results of the ﬁve trials of each set of parameters, and then we choose the best results. The best mean results reported in this table were obtained with (squared loss function, k = 20) and (linear loss function, k = 10). These tables also show that the best results obtained with our boosting algorithm are close to the best results reported in the literature. Table 6 gives the standard deviation of the NMSE obtained in our experiments; this information was not

Table 5 Best mean results (NMSE * 103) Model

Test 1

Test2

RNN/BPTT EBPTT20 EBPTT100 CBPTT Boosting (squared, 20) Boosting (linear, 10)

102 96 92 94 90 82

371 320 308 281 296 314

50

M. Assaad et al. / Information Fusion 9 (2008) 41–55

Table 6 Standard deviations * 103

0.50

Test 1

Test 2

RNN/BPTT EBPTT CBPTT Boosting (squared, 5) Boosting (linear, 10)

2.3 0.9 6.4 9.4 1.7

10 4.6 33.8 34.1 25.5

0.40

0.30

NMSE

Model

EBPTT Linear

0.20

Squared 20 Squared 5

0.10

Table 7 Sunspots time series: mean NMSE on the cumulated test 1 + test 2 set as a function of the prediction horizon Steps ahead h

1

2

3

4

5

6

BPTT CBPTT EBPTT Linear 10 Squared 5 Squared 20 Exponential 20

0.24 0.17 0.19 0.18 0.18 0.17 0.18

0.88 0.69 0.53 0.43 0.43 0.40 0.42

1.14 0.99 0.79 0.54 0.56 0.54 0.67

1.22 1.17 0.80 0.67 0.64 0.73 0.76

1.01 0.99 0.88 0.74 0.73 0.69 0.77

1.02 1.01 0.84 0.73 0.65 0.68 0.74

The last four lines correspond to parameters (linear, 10), (squared, 5), (squared, 20) and (exponential, 20), respectively.

0.90

0.00 1

2

3

4

5

6

Steps ahead 1.50

1.20

0.90

NMSE

available for all the models in the literature. We generally ﬁnd that the standard deviation is about 10% of the average. For the MS prediction task, we kept the RNNs that had 12 neurons in the hidden layer. CBPTT and EBPTT obtained best performances with 2 hidden neuron architectures in [24]. We tested four sets of parameters (loss function, k) corresponding to the best results obtained by our algorithm on SS prediction: (linear, 10), (squared, 5), (squared, 20) and (exponential, 20). Due to the lack of results performed on MS problem in the literature, few comparisons were performed for this dataset. Table 7 shows the mean NMSE for a prediction horizon up to 6. For our algorithm, the associated standard deviation is less than 5 times lower than the mean, whatever the loss function used. Fig. 6 displays the overall aspect of the results.

Exponential

EBPTT Linear

0.60

Squared 20 Squared 5 0.30

Exponential

0.00 1

2

3

4

5

6

Steps ahead

Fig. 7. Sunspots time series: mean NMSE on the test set ((a) for test 1, (b) for test 2) as a function of the prediction.

All the tested algorithms perform better than standard BPTT and exhibit a fast degradation while simultaneously increasing prediction horizon. Boosted architectures give the best results. If the results of test1 are compared to those of test 2 (Fig. 7) it can be observed that the deterioration is mainly due to test 2. It is commonly accepted that the behaviour in test 2 can not be explained (by some longerrange phenomenon) given the available history. Shortrange information available in SS prediction enables the network to assess the rate of change in the number of sunspots. Such information is missing in MS prediction. For the sunspots series our algorithm develops around 9 weak learners with the linear and quadratic loss functions, and 30 weak learners with the exponential function, as for the SS problem. The mean number of networks remains practically constant while the horizon increases.

0.75

NMSE

7.2. Mackey-Glass time series 0.60

The Mackey-Glass benchmarks [59] are well-known for the evaluation of SS and MS prediction methods. The time series are generated by the following nonlinear diﬀerential equation:

EBPTT

0.45

Linear Squared 20

0.30

Squared 5 Exponential

0.15

1

2

3

4

5

6

Steps ahead

Fig. 6. Sunspots time series: mean NMSE on the cumulated test set as a function of the prediction horizon. Due to their poor results, BPTT and CBPTT algorithms are not represented.

dx 0:2 xðt hÞ ¼ 0:1 xðtÞ þ : dt 1 þ x10 ðt hÞ

ð16Þ

The behaviour is chaotic for h > 16,8. The results in the literature usually concern h = 17 (known as MG17, Fig. 8) and h = 30 (MG30, Fig. 9). According to common practice

M. Assaad et al. / Information Fusion 9 (2008) 41–55

51

0. 35

1.4

Linear Squared

1.3 1.2

Exponential

0. 30

1.1

NMSE

1 0.9 0.8

0. 25

0.7 0.6

0. 20

0.5 0.4 0

100

200

300

400

500

600

0. 15 0

Fig. 8. The Mackey-Glass time series for h = 17.

50

100

200

150

k

Fig. 10. NMSE (*103) of the weighted median for diﬀerent values of k for MG17.

1.4 1.2 1

Table 9 (NMSE * 103) of the weighted mean and of the weighted median for several values of k on test set of MG30 for 5 trials

0.8 0.6

k 0.4

Lin.

Mean Med. Mean Med. Mean Med.

0.2 0

100

200

300

400

500

600

Squ.

Fig. 9. The Mackey-Glass time series for h = 30. Exp.

10

20

150

200

300

0.67 0.69 0.69 0.64 0.62 0.62

0.72 0.71 0.74 0.73 0.62 0.59

0.72 0.76 0.60 0.60 0.59 0.55

0.51 0.50 0.70 0.48 0.57 0.47

0.67 0.48 1.14 0.45 1.65 0.47

1.17 0.45 57.2 0.47 75.9 0.60

0. 8 Linear Squared

0. 7

NMSE

[60], the data generated with x(t) = 0.9 for 0 6 t 6 h is then sampled with a period of 6 and the ﬁrst 500 values of this series are used for the learning set and the next 100 values for the test set. In Table 8 and Fig. 10, we notice that the best average results on MG17 are always obtained with the high values of k on the three functions used. It can be seen that boosting with the weighted median provides even better results than boosting with the weighted mean, especially for the great values of k. It can be seen that results obtained on MG30 (Table 9 and Fig. 11) are similar to MG17 results. We notice that the average error decreases from k = 20 to k = 200 and increases until k = 300, except with the linear loss function which increases after k = 300 (not shown here). For the two MG series, we obtained (Tables 10 and 11) around 26 networks with the linear function, 37 with the squared function and between 46 and 50 for the exponential function. Fifty is the maximal number of networks that we have allowed for. Using the exponential loss tends to

5

Exponential

0. 6

0. 5

0. 4 0

50

100

150

200

250

300

k

Fig. 11. NMSE (*103) of the weighted median for diﬀerent values of k for MG30.

Table 10 Number of weak learners (mean, minimum and maximum) for several values of k and for 5 trials on the test set of MG17 Table 8 (NMSE * 103) of the weighted mean and of the weighted median for several values of k on test set of MG17 for 5 trials k Lin. Squ. Exp.

Mean Med. Mean Med. Mean Med.

5

10

20

100

150

200

0.31 0.27 0.38 0.30 0.29 0.21

0.25 0.18 0.29 0.25 0.25 0.20

0.33 0.25 0.22 0.19 0.25 0.20

0.33 0.19 1.70 0.16 1.44 0.17

4.08 0.17 5.96 0.17 16.1 0.19

8.04 0.17 12.3 0.18 22.8 0.19

k Lin.

Squ.

Exp.

Mean Min. Max. Mean Min. Max. Mean Min. Max.

5

10

20

100

150

200

16.6 12 25 15.8 10 23 48.2 41 50

19.8 13 26 16.2 10 24 44.8 38 50

19.4 14 25 31.4 20 50 50 50 50

31.2 16 44 50 50 50 50 50 50

35.2 13 50 50 50 50 50 50 50

44.8 35 50 50 50 50 50 50 50

52

M. Assaad et al. / Information Fusion 9 (2008) 41–55

Table 11 Number of weak learners (mean, minimum and maximum) for several values of k and for 5 trials on the test set of MG30 k Lin.

Squ.

Exp.

Mean Min. Max. Mean Min. Max. Mean Min. Max.

5

10

20

150

200

300

15.2 11 21 17.6 11 28 38 32 44

12.6 11 15 15.2 9 27 42.4 28 50

14.6 9 21 20.2 15 25 45 39 50

24 17 33 50 50 50 50 50 50

36.6 26 50 50 50 50 50 50 50

44.6 23 50 50 50 50 50 50 50

result in the inclusion of more weak learners in the boosted model, since the exponential loss emphasizes the harder cases more than the other loss functions. Tables 12 and 13 show the best results and, respectively, the best mean results obtained by various models applied to the two test sets. The linear, polynomial, RBF and MLP models are mentioned in [59]. The FIR MLP put forward in [19] has 15 neurons in the hidden layer. FIR connections of order 8 are used between the inputs and the hidden neurons, while the order of the connections between the hidden neurons and the output is 2. The feedforward network used in [61] consists in a single input neuron, 20 hidden neurons and one output neuron. A delay is associated with every connection in the network, and the value of the delay is modiﬁed by a learning algorithm inspired by backpropagation. In [54] an hybrid evo-

Table 12 Best results (NMSE * 103) Model

MG17

MG30

Linear Polynomial RBF MLP FIR MLP TDFFN DRNN LLWNN RNN/BPTT EBPTT20 CBPTT Boosting (linear, 150) Boosting (squared, 100)

269 11.2 10.7 10 4.9 0.8 4.7 0.25 0.23 0.13 0.14 0.13 0.15

324 39.8 25.1 31.6 16.2 _ 7.6 _ 0.89 0.05 0.73 0.45 0.41

Table 13 Best mean results (NMSE * 103)

lutionary algorithm controlled by 7 parameters produces a RNN having 2 hidden neurons with sinusoidal transfer functions and several time-delayed connections. The DRNNs [58] have FIR connections of order 4 between the input and the hidden layer, FIR connections of order 2 between hidden neurons, and simple connections to the output neuron. The LLWNN [62] is a local linear wavelet neural network where the connection weights between the hidden layer and output layer are replaced by a local linear model. The corresponding learning algorithm requires the user to tune 8 parameters. For both EBPTT and CBPTT, the authors set to 34 the maximal value for the delays of the new connections and to 10 the maximal number of new connections. The number of BPTT steps performed for each candidate connection during the exploratory stage of EBPTT was always set to 20. For the two algorithms EBPTT and CBPTT [5], the mean results reported in Tables 12 and 13 were obtained with RNNs with 10 time-delayed connections. Best results and best mean results were obtained with RNNs having 7 and 6 hidden neurons respectively. Table 12 shows the best NMSE obtained by various models on the test set of MG 17 and MG 30. The best results reported in this table were obtained with (k = 150, linear loss function) for MG17 and with (k = 100, squared loss function) for MG30. This table also shows that the best results obtained with our boosting algorithm are close to the best results reported in the literature. Only the EBPTT algorithm surpasses our results. On average (Tables 13 and 14), we obtain lower NMSE and standard deviations than BPTT and the two constructive algorithms. To better understand why the performance depends on k and why the behavior on the three datasets is diﬀerent, we must note that when k = 0, Dn+1(q) = 1/Q, and when k 0, Dn+1(q) pn+1(q). The relations used to calculate the values of pn+1(q) are those of [2,42] for updating of the distribution and can be very close to 0 for the easier examples. If k 0 the RNNs are in reality trained on the examples having big errors (this fact is reinforced when the square cost function is used). Very few such examples exists and, what is more, they are rather isolated for the sunspots dataset. As a result the RNNs learn poorly on this dataset. For low k the examples have almost equal weights and boosting brings little improvement. We are currently

Table 14 Standard deviations * 103

Model

MG17

MG30

Model

MG17

MG30

RNN/BPTT EBPTT CBPTT Boosting (squared, 100) Boosting (squared, 200)

4.4 0.62 1.6 0.16 0.18

13 1.84 2.5 0.45 0.45

RNN/BPTT EBPTT CBPTT Boosting (squared, 100) Boosting (squared, 200)

3.8 0.074 2.0 0.016 0.025

41 0.84 1.6 0.028 0.014

M. Assaad et al. / Information Fusion 9 (2008) 41–55

trying to identify a simple method for adjusting k according to the distribution of the errors of the ﬁrst learner. For MS prediction, we have only performed tests on MG 17 since there are no results in literature concerning MG 30 MS forecasting. Table 15 and Fig. 12 show the strong improvements obtained on the original BPTT and on the constructive algorithms. Comparisons with other published results concerning MS prediction can only be performed for a horizon of 14; the results presented here are inferior to those of the local methods put forward in [12,13] or of the method in [62,63], but for the RNNs trained by our algorithm, signiﬁcantly fewer data points were employed for training (500 compared to 3000 or 10 000), which is the usual benchmark [19,59]. However, the use of a huge number of points for learning the MG17 artiﬁcial time series, generated without noise, is doubtful because it can lead to models with poor generalization to noisy data. The number of weak learners in our MS experiments on MG17 depends heavily on the loss function used. For the linear one, our algorithm constructs between 30 and 40 networks on average, except for horizons h = 4, 5, 6. In such cases, we obtain 50 networks. For the squared loss function, the average number is slightly higher, with 50 networks for h = 4, 5, 7, 11, 12, 13. For the exponential one, the maximum number of networks (50) is generated for all the horizons. Table 15 MG17 time series: mean NMSE (*103) on the cumulated test set as a function of the prediction horizon Steps ahead h

BPTT

CBPTT

EBPTT

Lin.

Squ.

Exp.

1 2 3 4 5 6 12 14

22 179 145 8 266 321 167 337

13 124 124 7 253 321 156 194

1 101 16 4 181 232 158 196

0.17 0.24 0.57 0.57 0.98 2.11 6.72 15.21

0.16 0.28 0.57 0.54 1.26 15.25 8.66 21.32

0.17 0.25 0.52 0.52 1.27 4.66 7.57 15.60

The last three columns correspond, respectively to parameters (linear, 150), (squared, 100) and (exponential, 100).

0.03

Linear

0.02

NMSE

Squared Exponential

0.02

0.01

0.01

0.00

1

2

3

4

5

6

7

8

9

10

11

12

13

14

Steps ahead

Fig. 12. MG17 time series: mean NMSE on the cumulated test set as a function of the prediction horizon.

53

8. Conclusion and future work We adapted boosting to the problem of learning timedependencies in sequential data for predicting future values, adding a new parameter for tuning the boosting inﬂuence and using recurrent neural networks as ‘‘weak’’ regressors. The experimental results that we have obtained show that the boosting algorithm actually improves upon the performance comparatively with the use of only one RNN. We ﬁrst compared our results on SS with the results obtained from other combination methods. Results obtained with our boosting algorithm are close to the best results reported in the literature. Then we assessed this algorithm on multi-step-ahead prediction. The experimental results obtained on two benchmark problems show that boosting recurrent neural networks greatly improves MS forecasting. The boosting eﬀect proved to be less eﬀective for sunspots MS forecasts because some short-term dependencies are essential for the prediction of some parts of the data. The fact that for the Mackey-Glass dataset the results are better can be explained by noting that long-range dependencies play a more important role for MG17 than for sunspots. We ﬁnd that local approaches are better when compared to our method, if enough data is available. If not however, our algorithm increases prediction performance of global approach which relies on RNNs. Further experiments should determine whether local approaches are too powerful for noisy real datasets, and should allow us to make comparisons with our method. Future work may include the modiﬁcation of our boosting algorithm for dealing with RNNs with time delay connections. and the addition of our k parameter for tuning the boosting inﬂuence in AdaBoost itself. References [1] Y. Freund, R.E. Schapire, A decision-theoretic generalization of online learning and an application to boosting, Journal of Computer and System Sciences 55 (1) (1997) 119–139. [2] H. Drucker, Boosting Using Neural Nets, in: A. Sharkey (Ed.), Combining Artiﬁcial Neural Nets: Ensemble and Modular Learning, Springer, 1999. [3] D.E. Rumelhart, G.E. Hinton, R.J. Williams, Learning Internal Representations by Error Propagation, in: D.E. Rumelhart, J. McClelland (Eds.), Parallel Distributed Processing: Explorations in the Microstructure of Cognition, MIT Press, Cambridge, MA, 1986. [4] R.J. Williams, D. Zipser, A learning algorithm for continually running fully recurrent neural networks, Neural Computation 1 (1989) 270–280. [5] R. Bone´, M. Crucianu, J.-P. Asselin de Beauville, Learning long-term dependencies by the selective addition of time-delayed connections to recurrent neural networks, Neurocomputing 48 (1–4) (2002) 251–266. [6] R.E. Schapire, The strength of weak learnability, Machine Learning 5 (1990) 197–227. [7] G.U. Yule, On a method of investigating periodicity in disturbed series with special reference to Wolfer’s sunspot numbers, Philosophical

54

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

M. Assaad et al. / Information Fusion 9 (2008) 41–55 Transactions of the Royal Society of London Series A 226 (1927) 267– 298. F. Takens, Detecting Strange Attractors in Turbulence, in: D.A. Rand, L.S. Young (Eds.), Dynamical Systems and Turbulence, Lecture Notes in Mathematics, vol. 898, Springer-Verlag, Berlin, 1980. A. Aussem, Suﬃcient conditions for error backﬂow convergence in dynamical recurrent neural networks, Neural Computation 14 (8) (2002) 1907–1927. B. Hammer, P. Tino, Recurrent neural networks with small weights implement deﬁnite memory machines, Neural Computation 15 (8) (2003) 1897–1926. M. Casdagli, S. Eubank, J.D. Farmer, J. Gibson, State space reconstruction in the presence of noise, Physica D 51 (1991) 52–98. J. Vesanto, Using the SOM and Local Models in Time-Series Prediction, in: Proceedings of the Workshop on Self-Organizing Maps, Espoo, Finland, June 1997, pp. 209–214. L. Chudy, I. Farkas, Prediction of chaotic time-series using dynamic cell structures and local linear models, Neural Network World 8 (5) (1998) 481–489. F. Gers, D. Eck, J. Schmidhuber, Applying LSTM to Time Series Predictable Through Time-Window Approaches, in: Proceedings of the International Conference on Artiﬁcial Neural Networks, Vienna, Austria, 2001, pp. 669–675. N.G. Pavlidis, D.K. Tasoulis, M.N. Vrahatis, Time Series Forecasting Methodology for Multiple-Step-Ahead Prediction, in: Proceedings of the International Conference on Computational Intelligence, Calgary, Alberta, Canada, 2005, pp. 456–461. J. Walter, H. Ritter, K.J. Schulten, Non-linear Prediction with Selforganizing Feature Maps, in: Proceedings of the International Joint Conference on Neural Networks, San Diego, USA, 1990, pp. 589– 594. T. Martinez, S.G. Berkovich, K. Schulten, Neural-gas network for vector quantization and its application to time-series prediction, IEEE Transactions on Neural Networks 4 (4) (1993) 558–569. A.D. Back, A.C. Tsoi, Stabilization Properties of Multilayer Feedforward Networks with Time-Delays Synapses, in: I. Aleksander, J. Taylor (Eds.), Artiﬁcial Neural Networks, Elsevier Science Publishers, 1992. E.A. Wan, Time Series Prediction by Using a Connection Network with Internal Delay Lines, in: A.S. Weigend, N.A. Gershenfeld (Eds.), Time Series Prediction: Forecasting the Future and Understanding the Past, Addison-Wesley, 1994. T. Czernichow, A. Piras, K. Imhof, P. Caire, Y. Jaccard, B. Dorizzi, A. Germont, Short term electrical load forecasting with artiﬁcial neural networks, Engineering Intelligent Systems 4 (2) (1996) 85–99. T. Lin, B.G. Horne, P. Tino, C.L. Giles, Learning long-term dependencies in NARX recurrent neural networks, IEEE Transactions on Neural Networks 7 (6) (1996) 13–29. S. El Hihi, Y. Bengio, Hierarchical Recurrent Neural Networks for Long-Term Dependencies, in: M. Mozer, D.S. Touretzky, M. Perrone (Eds.), Advances in Neural Information Processing Systems, MIT Press, Cambridge, MA, 1996, pp. 493–499. A.G. Parlos, O.T. Rais, A.F. Atiya, Multi-step-ahead prediction using dynamic recurrent neural networks, Neural Networks 13 (7) (2000) 765–786. R. Bone´, M. Crucianu, An Evaluation of Constructive Algorithms for Recurrent Networks on Multi-Step-Ahead Prediction, in: Proceedings of the International Conference on Neural Information Processing, Singapore, 2002, pp. 547–551. A.F. Atiya, S.M. El-Shoura, S.I. Shaheen, M.S. El-Sherif, A comparison between neural network forecasting techniques – Case study: River ﬂow forecasting, IEEE Transactions on Neural Networks 10 (2) (1999) 402–409. J.A.K. Suykens, J. Vandewalle, Learning a simple recurrent neural state space model to behave like Chua’s double scroll, IEEE Transactions on Circuits and Systems-I 42 (1995) 499–502.

[27] M. Duhoux, J. Suykens, B. De Moor, J. Vandewalle, Improved longterm temperature prediction by chaining of neural networks, International Journal of Neural Systems 11 (1) (2001) 1–10. [28] H.N. Nguyen, C.W. Chan, Multiple neural networks for a long term time series forecast, Neural Computation and Applications 13 (2004) 90–98. [29] H. Jaeger, Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication, Science 304 (2004) 78–80. [30] J. Schmidhuber, D. Wierstra, F.J. Gomez, Evolino: Hybrid Neuroevolution/Optimal Linear Search for Sequence Learning, in: Proceedings of the 19th International Joint Conference on Artiﬁcial Intelligence, Edinburgh, Scotland, 2005, pp. 853–858. [31] B. Gavin, W. Jeremy, H. Rachel, Y. Xin, Diversity creation methods: A survey and categorisation, Information Fusion Journal, Special issue on Diversity in Multiple Classiﬁer Systems 6 (1) (2005) 5–20. [32] L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123– 140. [33] Y. Freund, R.E. Schapire, Experiments with a New Boosting Algorithm, in: Proceedings of the Thirteenth International Conference on Machine Learning, 1996, pp. 148–156. [34] L. Breiman, Stacked regression, Machine Learning 24 (1996) 49–64. [35] D.H. Wolpert, Stacked generalization, Neural Networks 5 (1992) 241–259. [36] J. Gama, Combining Classiﬁers by Constructive Induction, in: Proceedings of the 10th European Conference on Machine Learning, 1998, pp. 178–189. [37] J.R. Quinlan, Bagging, Boosting and C4.5, in: Proceedings of the Thirteenth National Conference on Artiﬁcial Intelligence, Cambridge, MA, 1996, pp. 725–730. [38] R. Avnimelech, N. Intrator, Boosting regression estimators, Neural Computation 11 (2) (1999) 491–513. [39] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning – Data Mining, Inference and Prediction, Springer Series in Statistics, 2001. [40] G.D. Cook, A.J. Robinson, Boosting the Performance of Connectionist Large Vocabulary Speech Recognition, in: Proceedings of the International Conference in Spoken Language Processing, Philadelphia, PA, 1996, pp. 1305–1308. [41] Y. Freund, Boosting a Weak Learning Algorithm by Majority, in: Proceedings of the Workshop on Computational Learning Theory, 1990, pp. 202–216. [42] H. Drucker, Improving Regressors using Boosting Techniques, in: Proceedings of the Fourteenth International Conference on Machine Learning, 1997, pp. 107–115. [43] L. Mason, J. Baxter, P.L. Bartlett, M. Frean, Functional Gradient Techniques for Combining Hypotheses, in: A.J. Smola, P.L. Bartlett, B. Scho¨lkopf, D. Schuurmans (Eds.), Advances in Large Margin Classiﬁers, MIT Press, Cambridge, MA, 1999. [44] G. Ridgeway, D. Madigan, T. Richardson, Boosting Methodology for Regression Problems, in: Artiﬁcial Intelligence and Statistics, 1999, pp. 152–161. [45] G. Ra¨tsch, M. Warmuth, S. Mika, T. Onoda, S. Lemm, K.R. Muller, Barrier Boosting, in: Proceedings COLT, San Francisco, June 2000, pp. 170–179. [46] G. Karakoulas, J. Shawe-Taylor, Towards a Strategy for Boosting Regressors, in: A.J. Smola, P.L. Bartlett, B. Scho¨lkopf, D. Schuurmans (Eds.), Advances in Large Margin Classiﬁers, MIT Press, Cambridge, MA, 1999. [47] N. Duﬀy, D. Helmbold, Boosting methods for regression, Machine Learning 47 (2002) 153–200. [48] P. Bu¨hlmann, B. Yu, Boosting with L2-loss: Regression and classiﬁcation, Journal of the American Statistical Association 98 (2003) 324–340. [49] R.S. Zemel, T. Pitassi, A Gradient-Based Boosting Algorithm for Regression Problems, in: Advances in Neural Information Processing Systems 13, Cambridge, MA, USA, 2001, pp. 696–702.

M. Assaad et al. / Information Fusion 9 (2008) 41–55 [50] F. Audrino, P. Bu¨hlmann, Volatility estimation with functional gradient descent for very high-dimensional ﬁnancial time series, Journal of Computational Finance 6 (3) (2003) 1–26. [51] S. Santini, A. Del Bimbo, Recurrent neural networks can be trained to be maximum a posteriori probability classiﬁers, Neural Networks 8 (1) (1995) 25–29. [52] D.R. Seidl, R.D. Lorenz, A Structure by which a Recurrent Neural Network Can Approximate a Nonlinear Dynamic System, in: Proceedings of the International Joint Conference on Neural Networks, Seattle, USA, 1991, pp. 709–714. [53] M. Assaad, R. Bone´, H. Cardot, Study of the Behavior of a New Boosting Algorithm for Recurrent Neural Network, in: Proceedings of the International Conference on Artiﬁcial Neural Networks, Warsaw, Poland, 2005, pp. 169–174. [54] J.R. McDonnell, D. Waagen, Evolving recurrent perceptrons for time series modeling, IEEE Transactions on Neural Networks 5 (1) (1994) 24–38. [55] H. Tong, K.S. Lim, Threshold autoregression, limit cycles and cyclical data, Journal of the Royal Statistical Society B 42 (1980) 245–292. [56] A.S. Weigend, B.A. Huberman, D.E. Rumelhart, Predicting the Future: A Connectionist Approach, in: Proceedings of the International Journal of Neural Systems 1(3) (1990) 193–209.

55

[57] A. Aussem, Dynamical recurrent neural networks: Towards prediction and modelling of dynamical systems, Neurocomputing 28 (1999) 207–232. [58] A. Aussem, Nonlinear Modeling of Chaotic Processes with Dynamical Recurrent Neural Networks, in: Neural Networks and Their Applications, Marseille, France, 1998, pp. 425–433. [59] M. Casdagli, Nonlinear prediction of chaotic time series, Physica 35D (1989) 335–356. [60] A. Back, E.A. Wan, S. Lawrence, A.C. Tsoi, A Unifying View of some Training Algorithms for Multilayer Perceptrons with FIR Filter Synapses, in: Neural Networks for Signal Processing IV, Ermioni, Greece, 1994, pp. 146–154. [61] R.J. Duro, J. Santos Reyes, Discrete-time backpropagation for training synaptic delay-based artiﬁcial neural networks, IEEE Transactions on Neural Networks 10 (4) (1999) 779–789. [62] Y. Chen, B. Yang, J. Dong, Time-series prediction using a local linear wavelet neural network, Neurocomputing 69 (2006) 449–465. [63] H. Jaeger, The ‘‘Echo State’’ Approach to Analyzing and Training Recurrent Neural Networks, Technical Report GMD Report 148, German National Research Center for Information Technology, Germany, 2001.

A new boosting algorithm for improved time-series forecasting with recurrent neural networks

A new boosting algorithm for improved time-series forecasting with recurrent neural networks

Recommend Documents