Available online at www.sciencedirect.com
Energy Conversion and Management 49 (2008) 1156–1166 www.elsevier.com/locate/enconman
Bayesian neural network approach to short time load forecasting Philippe Lauret a,*, Eric Fock a, Rija N. Randrianarivony b, Jean-Franc¸ois Manicom-Ramsamy a a
Laboratoire de Physique du Baˆtiment et des Syste`mes, Universite´ de La Re´union, BP 7151, 15 Avenue Rene´ Cassin, 97715 Saint-Denis, France b IME-Institut pour la maıˆtrise de l’e´nergie, University of Antananarivo – BP 566, Madagascar Received 23 January 2007; accepted 9 September 2007 Available online 23 October 2007
Abstract Short term load forecasting (STLF) is an essential tool for efficient power system planning and operation. We propose in this paper the use of Bayesian techniques in order to design an optimal neural network based model for electric load forecasting. The Bayesian approach to modelling offers significant advantages over classical neural network (NN) learning methods. Among others, one can cite the automatic tuning of regularization coefficients, the selection of the most important input variables, the derivation of an uncertainty interval on the model output and the possibility to perform a comparison of different models and, therefore, select the optimal model. The proposed approach is applied to real load data. Ó 2007 Elsevier Ltd. All rights reserved. Keywords: Load modelling; Short term load forecasting; Neural networks; Bayesian inference; Model selection
1. Introduction Short term load forecasting (STLF) is essential for planning the day to day operation of an electric power system [1]. Accurate forecasts of the system load on an hour by hour basis from one day to a week ahead help the system operator to accomplish a variety of tasks like economic scheduling of generating capacity, scheduling of fuel purchases, etc. In particular, forecasting the peak demand is important as the generation capacity of an electric utility must meet this requirement. As this forecasting leads to increased security operation conditions and economic cost savings, numerous techniques have been used to improve STLF [2]. Among these techniques, the use of Neural networks (NNs) is particularly predominant in the load forecasting field [2,3]. Indeed, the availability of historical load data on the utility databases and the fact that NNs are data driven approaches capable of performing a non-linear mapping between sets of input and output variables make
*
Corresponding author. Tel.: +262 93 81 27; fax: +262 93 86 65. E-mail address:
[email protected] (P. Lauret).
0196-8904/$ - see front matter Ó 2007 Elsevier Ltd. All rights reserved. doi:10.1016/j.enconman.2007.09.009
this modelling tool very attractive. However, as stated by Refs. [2,4], NNs are such flexible models that the task of designing a NN for a particular application is far from easy. This statement stems from the fact that NNs are able to approximate any continuous function at an arbitrary accuracy, provided the number of hidden neurons is sufficient [5]. However, this ability has a downside that such close approximation can become an approximation to the noise. As a consequence, the model yields solutions that generalize poorly when new data are presented. In the NN community, this problem is called overfitting and may come about because the NN model is too complex (i.e. it possesses too many parameters). In fact, it is necessary to match the complexity of the NN to the problem being solved. The complexity determines the generalization capability (measured by the generalization or test error) of the model since a NN that is either too simple or too complex will give poor predictions.1 There are mainly two 1 As an illustration, consider a polynomial model whose complexity is controlled by the number of coefficients. A too low-order polynomial will be unable to capture the underlying trends in the data whilst a too highorder polynomial will model the noise on the data.
P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166
approaches to controlling NN complexity, namely architecture selection and regularization techniques. Architecture selection controls the complexity by varying the number of NN parameters (called weights and biases). One of the simplest ways involves the use of networks with a single hidden layer, in which the number of free parameters is controlled by adjusting the number of hidden units. Other approaches consist in growing or pruning the network structure during the training process. The approach taken by the pruning methods is to start with a relatively large network and gradually remove either connections or complete hidden units [6–8]. The technique of regularization encourages smoother network mappings by favouring small values for the NN parameters. Indeed, it has been shown [6] that small values for the weights decrease the tendency of the model to overfit. One of the simplest forms of regularizer is called weight decay and a regularization coefficient (also called weight decay term) allows controlling the degree of regularization. However, each of these techniques requires tuning of a parameter (i.e. regularization coefficient, number of hidden units, pruning parameter) in order to maximize the generalization performance of the NN. Classically, the setting of this control parameter is done by using the so called cross-validation (CV) techniques. Indeed, CV provides an estimation of the generalization error and, therefore, offers a possibility to select the best architecture or the optimal regularization coefficient. Unfortunately, CV presents several disadvantages. First, the cross-validation (CV) technique needs a separate data set (so fewer data for the training set), named validation set, in order to evaluate the variation of the generalization error (called here validation error) as the number of hidden neurons or the value of the regularization coefficient is changed. The optimal number of hidden nodes or the optimal value of the regularization coefficient corresponds to the minimum validation error. Secondly, because intrinsic noise exists in real datasets and because of the limited amount of data, one has to repeat the CV experiment multiple times using different divisions of the data into training and validation sets. This leads to well known CV techniques like k-fold cross-validation or the leave-oneout method [6]. Consequently, CV may become computationally demanding and tedious, and regarding for instance the regularization technique, a small range of weight decay coefficients is usually tested. Another critical issue is determination of the relevant input variables. Indeed, too many input variables, of which some are irrelevant to estimation of the output, could hamper the model. This is particularly true in cases of limited data sets where random correlations between the output and the input variables can be modelled. Again, unfortunately, classical NN methods are unable to treat this non-trivial task in a satisfactory way. Most researchers in the realm of STLF have emphasized the need to design the NN model properly but deplored the lack of a consistent method that allows deriving the opti-
1157
mal NN model [2,4,9]. As a consequence, Hippert [2] stated that some researchers are sceptical and believe that no systematic evidence that NNs outperform standard forecasting methods (such as those based on time series . . .) exists. Furthermore, Alves da Silva [10] stated that one should not produce a forecast of any kind without an idea of its reliability, and point predictions are meaningless when the time series is noisy. In this paper, we argue that in order to obtain a good model for electric load forecasting, emphasis has to be put on the design of the NN. In other words, conventional NN learning methods must be improved. For this purpose, we propose a probabilistic interpretation of NN learning by using Bayesian techniques. MacKay [11] originally developed Bayesian methods for NNs. The Bayesian approach to modelling offers significant advantages over the classical NN learning process. Among others, one can cite automatic tuning of the regularization coefficient using all the available data and selection of the most important input variables through a specific technique called automatic relevance determination (ARD). In addition, reliabilities in the forecast are taken into account as the method computes an error bar on the model output. As we emphasize NN modelling, it is important (in our view) to search for the optimal NN structure (i.e. the optimal number of hidden nodes). Again, we will see that the Bayesian method offers a means to select the optimal NN model (by performing a model comparison). In this survey, the Bayesian approach to neural learning is applied to real load data. The data were provided by EDF, the French electricity utility. 2. Model description and context of study In order to assess the feasibility of the proposed approach, we designed a NN whose goal is to forecast the next day’s load at the same hour. Actually, this model constitutes an hourly module that consists in determining the non-linear relationship between each hour’s load profile with past load and weather readings for the same hour. This hourly module is part of a global forecaster (that yields the complete load profile for the next day) obtained by combining the 24 hourly modules. A bibliographic survey helps us to a priori retain the input variables given in Table 1. The set of inputs typically contains exogenous variables (weather related variables), indicators such as the week end or holiday flag and past values of the load. Obviously, there is a strong correlation between the load demand and the weather related variables. According to Ref. [2], load series are known to be non-linear functions of the exogenous variables. For instance, a U shaped scatter plot between load demand and temperature has been reported by Ref. [9]. The data were collected from a micro-region of the south of Reunion Island (21.06 S, 55.36 E) in 2001. The database contains up to 1074 hourly records.
1158
P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166
Table 1 Inputs of the NN model NN input variables
Description
x1 x2 x3 x4 x5 x6
wfd1 h Wdd1 Wsd1 Td1 Gd1
x7 x8 x9 x10 x11 x12 x13 x14
Ld1 wfd Wdd Wsd Td Gd Ld Td+1
Week end or holiday flag for day d 1 Hour Yesterday’s actual wind direction at this hour Yesterday’s actual wind speed at this hour Yesterday’s actual temperature at this hour Yesterday’s actual global solar irradiance at this hour Yesterday’s actual load at this hour Week end or holiday flag for day d Actual wind direction at this hour Actual wind speed at this hour Actual temperature at this hour Actual global solar irradiance at this hour Actual load at this hour Temperature forecast for the next day at this hour
3. Neural network approach to STLF We chose a NN to model the electric load. Indeed, in the case of NN based models and as opposed, for instance, to polynomial regression techniques, no explicit knowledge of the functional form between the inputs and the output is needed. The most popular form of NN is the so called multilayer perceptron (MLP) structure. The MLP structure consists of an input layer, one or several hidden layers and an output layer. The input layer gathers the model’s inputs vector x while the output layer yields the model’s output vector y. In our case, the input vector x is given by the hourly values of the variables x1 to x14 given in Table 1, and the output vector y consists of only one output y, which is the corresponding forecast of the next day’s load at the same hour. Fig. 1 represents a one hidden layer MLP. The hidden layer is characterized by several non-linear units (or neurons). The non-linear function (also called
Fig. 1. Sketch of a MLP with d inputs and h hidden units, in our case, d = 14 (see Table 1). The output y is the next day’s load at the same hour.
activation function) is usually the tangent hyperbolic function f(x) = tanh(x) = ex ex/ex + ex. Therefore, a NN with d inputs, h hidden neurons and a single linear output unit defines a non-linear parameterized mapping from an input x to an output y given by the following relationship: " !# h d X X y ¼ yðx; wÞ ¼ wj f wji xi ð1Þ j¼0
i¼0
The parameters of the NN model are given by the so called weights and biases that connect the layers between them (notice that in Eq. (1), the biases are denoted by the subscripts i = 0 and j = 0 and are not represented in Fig. 1). The NN parameters, denoted by the parameter vector w, govern the non-linear mapping. The NN parameters w are estimated during a phase called the training or learning phase. During this phase, the NN is trained using a dataset (called training set) of N input and output examples, pairs of the form N D ¼ fxi ; ti gi¼1 . The vector x contains samples of each of the 14 input variables described in Table 1. The variable t, also called the target variable, is the corresponding measurement of the electric load. The training phase consists in adjusting w so as to minimize an error function ED, which is usually the sum of squares error between the experimental or actual output ti and the network output, yi = y(xi; w): ED ðwÞ ¼
N N 1X 1X fy i ti g2 ¼ e2 2 i¼1 2 i¼1 i
ð2Þ
The second phase, called the generalization phase, consists of evaluating the ability of the NN to generalize, that is to say, to give correct outputs when it is confronted with examples that were not seen during the training phase. Notice that these examples are part of a data set called test set. In the NN community, the performance measure (also called the generalizationP error) is usually given by the mean 2 squared error ðMSEÞ ¼ pffiffiffiffiffiffiffiffiffiffiffi ðei =N Þ or the root mean squared error ðRMSEÞ ¼ MSE. In the electricity supply industry, the PN mean absolute percent errors ðMAPEÞ ¼ ð1=N Þ i¼1 jei j=t i 100 is usually reported. A good generalization (i.e. good predictions for new inputs) is obtained through control of the NN complexity by using techniques such as pruning or regularization [6,8,11]. Nonetheless, it must noted that, contrary to architecture selection methods like pruning, regularization does not explicitly delete the irrelevant weight (or connection, see Fig. 1) but drives to zero (or nearly zero) the parameters that do not participate in the relationship. In order to better introduce the Bayesian optimization of regularization coefficients, we describe below the principles of the regularization technique. As mentioned above, the technique of regularization encourages smoother network mappings by adding a penalty term (also known as a weight decay term) to the preceding objective function (see Eq. (2)), resulting in this new objective function:
P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166
SðwÞ ¼ ED ðwÞ þ lEW ðwÞ
ð3Þ Pm 2 The additional term EW ðwÞ ¼ ð1=2Þ i¼1 wi (where m is the total number of parameters) penalizes large values for the weights that are known to be responsible for excessive curvature in the model [6]. The parameter l, called regularization coefficient, controls the degree of regularization. If l is too large, then the model cannot fit the data well. Conversely, if l is too small, then overfitting occurs. Thus, there exists an optimal value of l that gives the best trade off between overfitting and underfitting. Rigorously, this trade off is called the bias variance trade off. The interested reader should refer to Refs. [6,11]. However, the optimal value of the weight decay constant l must be found through cross-validation techniques. However, since there are drawbacks of the cross-validation methods, a limited pre-specified range of regularization coefficients can be optimized in a discrete manner. In order to compensate for these limits, we propose a probabilistic interpretation of NN learning that allows automatically controlling the NN complexity. 4. Bayesian neural network approach to SLTF 4.1. Principle of Bayesian NN learning: a probabilistic approach to NN learning In this paper, the principles of Bayesian reasoning [12– 15] are outlined and applied to estimation of the NN parameters. The remainder of this section summarizes the Bayesian approach as stated by Ref. [11]. It will be shown that the overfitting problem can be solved by using a Bayesian approach to control the model complexity. The Bayesian approach considers a probability density function (pdf) over the weight space. This pdf represents the degrees of belief taken by the different values of the weight vector. This pdf is set initially to some prior distribution and converted into a posterior distribution once the data have been observed through the use of Bayes’ theorem [14]. So, instead of the single ‘best’ set of weights computed by the classical approach of maximum likelihood (through minimization of an error function), Bayesian methods yield a complete distribution for the NN parameters. This posterior distribution can then be used, for instance, to infer predictions of the network for new values of the input variables. 4.1.1. The prior As we have a priori little idea of what the weight values should be, the prior is, therefore, chosen as a rather broad distribution. This can be done by expressing the prior pdf as a Gaussian distribution with a large variance: 1 pðwjaÞ ¼ eðaEW Þ Z W ðaÞ
ð4Þ
where a represents the inverse of the variance on the set of weights and biases and ZW(a) represents the normalization
1159
constant of the pdf. In the Bayesian framework, a is called a hyper-parameter as it controls the distribution of other parameters. The choice of Gaussians distributions simplifies the analysis and allows proceeding analytically. Further, the choice of a Gaussian prior for the weights leads to a probabilistic interpretation of the preceding regularizer EW (see Eq. (3)). Indeed, the preceding regularizer can be interpreted as minus the logarithm of the prior probability distribution over the parameters. It is important to note that the previous distribution of weights is defined for a given value of a. Hence, for the moment, we shall assume that its value is known. In a next paragraph, we shall relax this assumption. As we chose a Gaussian prior, the normalization factor ZW(a) is given by m=2 Z 2p ðaEW Þ Z W ðaÞ ¼ e dw ¼ ð5Þ a We recall that m is the total number of NN parameters. 4.1.2. The likelihood function or the noise model The derivation of the likelihood function is linked to the definition of the noise model. Given a training dataset D of N N examples of the form D ¼ fxi ; ti gi¼1 , the goal of NN learning is to find a relationship R between xi and ti. Since there are uncertainties in this relation as well as noise or phenomena that are not taken into account, this relation becomes ti ¼ Rðxi Þ þ ei
ð6Þ
where the noise ei is an expression of the various uncertainties. In our case, we want to approximate R(x) by y(x; w), i.e. a non-linear regression model given by a MLP. Hence, in the following, we assume that the ith target variable ti (or measurement) is given by some deterministic function of input vector x with added independent Gaussian noise. Further, if we assume that the errors have a normal distribution with zero mean and variance r2 = 1/b, the distribution of the noise is given by rffiffiffiffiffiffi b b pðei jbÞ ¼ exp e2i ð7Þ 2p 2 Assuming independent noise, the joint probability of a set of N noise values can be written as pðejbÞ ¼ pðe1 ; . . . ; eN jbÞ ¼
N Y
pðei jbÞ
i¼1
¼
! N =2 N b bX 2 exp e 2p 2 i¼1 i
ð8Þ
One can then take the difference between the data and the model ei = ti yi and substitute this into Eq. (8) to obtain the likelihood function: pðDjw; bÞ ¼
1 expðbED Þ Z D ðbÞ
ð9Þ
1160
P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166
b is also called a hyper-parameter and, as stated above, represents the inverse of the variance on the noise of the target output. Again, we shall assume that (for the moment) its value is known. The Gaussian noise model allows computing analytically the normalization factor ZD(b) given by N =2 2p Z D ðbÞ ¼ ð10Þ b
provide new tools (and particularly Bayes’s theorem) capable of inferring from the data the optimum values of a and b, i.e. the ones that yield the best tuning of the model complexity. Furthermore, as probabilistic modelling handles uncertainty in a natural way, it will be possible to assign error bars to the model’s predictions.
Again, the choice of a Gaussian likelihood function leads to a probabilistic interpretation of the error function ED. Indeed, the standard error function can be interpreted as minus the log likelihood of the noise model given by Eq. (9).
Nonetheless, the above inferences (i.e. Bayesian optimization of control parameters and evaluation of error bars on the NN predictions) require integrations over weight space. As the normalization factor of Eq. (13) cannot be evaluated analytically, a Gaussian approximation of the posterior distribution is needed. This approximation is obtained through the second-order Taylor expansion of S(w) around its minimum value or most probable set of weights wMP (see Ref. [6] for details). Under the Gaussian approximation, the normalization factor ZS(a, b) can now be evaluated analytically, thereby giving
4.1.3. The posterior distribution Once the prior pdf. and the likelihood function have been defined, it is possible to infer the posterior pdf, through the use of Bayes’ theorem [14]: pðwja; b; DÞ ¼
pðDjw; bÞ pðwjaÞ pðDja; bÞ
ð11Þ
where we have made use of the fact that the likelihood and the prior do not depend, respectively, on a and b. Using Eqs. (4) and (9), we finally obtain the posterior distribution pðwja; b; DÞ ¼
1 exp½SðwÞ Z S ða; bÞ
with S(w) = bED + aEW. The normalization constant is given by Z Z S ða; bÞ ¼ eSðwÞ dw
ð12Þ
ð13Þ
It is important to note that in a multi-parameter context such as NN learning (m is usually large 10), the evaluation of ZS(a, b) cannot be analytically performed. So, in order to make this integral analytically tractable, MacKay [16] proposed a Gaussian approximation to the posterior distribution of the weights (see below). The optimal values of the NN parameters weights correspond to the maximum of the posterior pdf. Since the normalizing factor is independent of the parameters, this is equivalent to minimizing the quantity S(w), i.e. the negative logarithm of Eq. (12) SðwÞ ¼
N m bX aX 2 fyðxi ; wÞ ti g þ w2 2 i¼1 2 i¼1 i
ð14Þ
As noted by Bishop [6], apart from an overall multiplicative factor, this is the usual sum of squares error function with a weight decay regularization term that depends only on the ratio l = a/b. As one can see, the Bayesian approach automatically and naturally leads to a regularized function to minimize. So, it seems that interpretation of S(w) as a log probability adds little new. However, it will be shown below that a probabilistic interpretation will
4.2. A Gaussian approximation of the posterior distribution
Z S ða; bÞ ¼ ð2pÞm=2 jAj1=2 exp½SðwMP Þ
ð15Þ
where jAj is the determinant of the Hessian matrix of the total (regularized) error function S(w). 4.3. Bayesian optimization of the control parameters: the evidence framework As stated above, the hyper-parameters a and b control the complexity of the NN model. Thus far, we have supposed that their values are known, but, for mostapplications, one has little idea of suitable values for the hyper-parameters [6]. One way of dealing with this problem is to include the noise variance and the regularization coefficient in the list of parameters. The parameter vector h is now h = {w, a, b}. However, this still does not give the optimal values of the hyper-parameters. Hopefully, the rules of probability theory included in the Bayesian framework propose a response through the concept of marginalization. The latter implies that one has to integrate them out from any predictions, i.e. Z pðwjDÞ ¼ pðw; a; bjDÞda db Z ¼ pðwja; b; DÞpða; bjDÞda db ð16Þ Although this marginalization can be done analytically, MacKay [16] argued that, in practice, it is better to find the optimal values of the hyper-parameters denoted by aMP and bMP and to perform the remaining calculations with the hyper-parameters set to these optimal values. MacKay [11] called this approach the evidence framework. Again, the use of Bayes’ theorem allows us to infer the optimal values of the hyper-parameters from the data
P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166
pða; bjDÞ ¼
pðDja; bÞ pða; bÞ pðDÞ
ð17Þ
where p(a, b) is the prior over the hyper-parameters, called a hyper-prior and p(Dja, b) is the likelihood term called the evidence for a and b. As we have no prior idea about the values of the hyper-parameters, p(a, b) is taken as a uniform distribution. Further, since the normalization factor p(D) is independent of a and b, maximizing the posterior p(a, bjD) turns out to maximize the term p(Dja, b), called the evidence term, for a and b. By using the Gaussian approximation of the posterior of the weights, it can be shown [6] that the evidence term can be written under the following form: pðDja; bÞ ¼
Z S ða; bÞ Z D ðbÞZ W ðaÞ 1
m
ð2pÞ 2 jAj 2 exp½SðwMP Þ ¼ N2 m 2p 2 : 2p b a
ð18Þ
The optimal values aMP and bMP correspond to the maximum of the evidence for a and b. As we saw above, this pdf can also be interpreted as an error function to minimize by taking the natural logarithm (log) of Eq. (18), thereby giving
1161
Bayesian optimization methods use the evidence, which is not a noisy function of the parameters [11]. 4.4. Numerical implementation of the evidence framework As we need to find the optimal values of a and b as well as the optimum weight vector wMP, an iterative process is used to tune the network. This iterative procedure is described below: 1. Choose initial (small) values for a and b. Initialize the weights in the network using values drawn from the prior distribution. At iteration i, given the current estimates of the weights wi, the hyper-parameters ai and bi, we can compute a current estimate of Si(w) (see Eq. (14)). i 2. Find weights wiþ1 MP that minimizes S (w) by using a standard non-linear training algorithm like Gauss–Newton iþ1 or scale conjugate gradient. Given wiþ1 MP , compute EW iþ1 and ED . 3. Compute new values of ai+1 and bi+1 using the three successive sub-steps a, b and c: Pm k (a) ciþ1 ¼ k¼1 kkkþa where kk is the kth eigenvalue of i the Hessian matrix of the un-regularized error, i.e. bi rrEiþ1 D iþ1
log pðDja; bÞ ¼
aEMP W þ
bEMP D
c (b) aiþ1 ¼ 2E iþ1
1 m log jAj þ log a 2 2
N N log b logð2pÞ 2 2
W
iþ1
c (c) biþ1 ¼ N2E iþ1
ð19Þ
The optimal values of the hyper-parameters are then obtained by differentiating the log evidence (Eq. (19)) with respect to these hyper-parameters, hence giving (details of calculation can be found in Ref. [6]): aMP ¼
c ; 2 EMP W
bMP ¼
N c 2 EMP D
ð20Þ
MP with EMP W ¼ EW ðwMP Þ and E D ¼ ED ðwMP Þ. The quantity c is the number of well determined parameters given by m X kk c¼ ð21Þ kk þ a k¼1
where the kk are the eigenvalues of the Hessian matrix of the un-regularized error ED, i.e. H = b$$ED. As seen (and contrary to the CV methods), no validation set is involved in the computation of the optimal values. So, all available data can used both for model fitting and control parameters optimization. Furthermore, the method allows searching in high dimensional control parameter spaces, as it is possible to compute the gradient of the evidence with respect to the control parameters. In other words, contrary to the classical regularization technique, a large number of regularization coefficients can be evaluated. Further, unlike the CV techniques that make use of a noisy performance measure such as the validation error,
D
4. Iterate to step 2 and use parameters wi+1, ai+1 and bi+1 to compute new values for the parameters. Regarding the initial values of a, it is good practice to start with relatively small values so that the model has sufficient flexibility to fit the data. The convergence of the procedure is obtained when the regularized error (see Eq. (14)) is equal to half the number of data points. Indeed, the theory states that SðwÞ ¼ N2 when a = aMP and b = bMP. 4.5. Selection of the input variables: the Bayesian technique of automatic relevance determination (ARD) In this section, we take a step further. Thus far, it has been assumed that there is a single class of weights that is controlled by a single hyper-parameter a. However, it is often desirable to divide the weights into several groups g (for instance, one group for the input layer weights, one for second layer weights, one for input layer biases and one for second layer biases). Again, these groups are controlled through a Gaussian prior by a hyper-parameter ag. As an illustration of this possibility, the technique of ARD consists in assigning a separate regularization coefficient to each input. More precisely, all weights related to the same input are controlled by the same hyper-parameter ag. As seen above, this hyper-parameter is associated with a Gaussian prior with zero mean and variance 1/ag.
1162
P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166
At the end of the training session, the weights with a large ag (or, conversely, a small variance 1/ag) are close to zero. Consequently, one can state that the corresponding input is irrelevant and, therefore, can be eliminated.
NNs with various numbers of hidden nodes). From Bayes’ theorem, we can compute the posterior probabilities of the different models Mi, given the data D pðM i jDÞ ¼
4.6. Determination of error bars on the output The probabilistic interpretation of NN learning offers a second important advantage. Indeed, it is possible to incorporate uncertainty about the parameters into the predictions. Regarding the STLF problem, in such a context of decision making, it is obviously of primary importance that the tool produces a measure of the reliability of the forecast. Instead of the single ‘‘best’’ estimate predictions offered by the classical approach, the Bayesian technique allows the derivation of an error bar on the model output. Indeed, in the Bayesian formalism, the (posterior) distribution of weights will give rise to a distribution of network outputs. Under some approximations and using the Gaussian approximation of the posterior of weights, it can be shown [6] that the distribution of outputs is also Gaussian and has the following form: ! 2 1 ðt y MP Þ pðtjx; DÞ ¼ pffiffiffiffiffiffi exp ð22Þ 2r2t 2pr2t This output distribution has a mean given by yMP = y(x, wMP) (the model response for the optimal estimate wMP) and a variance given by r2t ¼ r2 þ gT A1 g
ð23Þ
where g ¼ rw yjwMP is the gradient of the model output evaluated at the best estimate. One can interpret the standard deviation of the output distribution for the output as an error bar on the mean value yMP. This error bar has two contributions, one arising from the intrinsic noise on the data and one arising from the width of the posterior distribution that is related to the uncertainties on the NN weights. Notice, however, that uncertainties on the hyper-parameters and on the inputs are not taken into account in Eq. (23).
pðDjM i Þ pðM i Þ pðDÞ
ð24Þ
where p(Mi) is the prior probability assigned to model Mi, p(D) is the normalization constant and the quantity p(DjMi) is called the evidence for Mi. As we have no reason to assign different priors to different models, the models are then compared and ranked according to their evidence. By using some assumptions (and particularly the Gaussian approximation of the posterior weight distribution), Bishop [6] gives an expression of the log of the evidence of a NN having h hidden units as follows: MP log pðDjM i Þ ¼ aMP EMP W bMP E D
1 log jAj 2
m N log aMP þ log bMP þ logðh!Þ 2 2 1 2 1 2 þ 2 log h þ log þ log ð25Þ 2 c 2 N c þ
In Section 5.3, this quantity will be used to rank the various NN models. Indeed, the log evidence of NNs with increasing numbers of hidden nodes will be estimated. The maximum of the log of the evidence will correspond to the most probable NN model (or optimal NN structure). 5. Results and discussion The STLF models related to the Bayesian methods (i.e. evidence framework as well as the ARD technique) were implemented by using a specific MATLAB toolbox called Netlab [18]. We chose contiguous ranges of data for training and testing, i.e. 693 hourly values were used for the NN training. The rest of the data were used for testing (381 samples). The performance of the models was assessed by reporting their MAPE and RMSE errors. 5.1. Illustration of overfitting with a classical NN modeling approach (Model 1)
4.7. Bayesian model selection As shown above, Bayesian methods are used for parameter estimation problems, but the Bayesian approach can also be used for model selection problems [15,17]. Indeed, the Bayesian formalism (by again using the Bayes’ theorem) can compute and attach a probability to a model. This probability (also called the evidence of the model) can be used as a measure to select the best model. In other words, the model that exhibits the greater evidence is the most probable, given the data. In the NN context, NNs with different numbers of hidden nodes can be compared and ranked according to their evidence. To illustrate these statements, let us consider a set of models Mi (which may correspond in our case to a set of
In order to illustrate the overfitting problem and to highlight the benefits of the Bayesian approach, we chose deliberately to model the electric load with a (relatively large) classical NN of 32 hidden units (named model 1). We recall that one has no prior idea about the correct number of hidden units unless one undertakes a cross-validation procedure. Table 2 lists the performance of the models obtained on the training and test sets, and Fig. 2 shows the forecasted hourly load vs. measured ones. As one can see, the fit is very good on the training set, but the performance degrades on the test set. This type of results is a sign of overfitting. Here, the model is too complex and one has to employ techniques that could decrease this complexity. In the next paragraph,
P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166
1163
Table 2 Performance of the different models No.
Model
1
Classical NN (14 inputs – 32 hidden units) Bayesian NN (14 inputs – 32 hidden units) Bayesian NN (10 inputs – 32 hidden units) Bayesian NN (10 inputs – 4 hidden units)
2 3 4
Training set
Test set
RMSE (MW)
MAPE (%)
RMSE (MW)
MAPE (%)
1.88
0.62
30.30
10.28
4.28
1.29
14.28
4.84
6.65
1.93
13.40
4.25
9.98
3.17
11.36
3.64
450
350
Forecasted load (MW)
Forecasted load (MW)
400 300
250
200
350 300 250 200 150
150 100 100 100
50 150
200
250
300
100
350
150
200
250
300
Actual load (MW)
Actual load (MW)
Fig. 2. Classical NN model 1: (a) training set and (b) test set.
we show that this overfitting can be eliminated by using a Bayesian approach for controlling the complexity of the NN. 5.2. Bayesian NN modelling approach (Model 2) In this experiment, we took the same NN structure as above, but we used the Bayesian approach to forecasting the electric load. We named this model model 2.
Fig. 3 and Table 2 (line 2) show clearly the improvement brought by the Bayesian approach. Indeed, the performance on the test set has been enhanced. The Bayesian NN modeling approach diminished the RMSE and MAPE errors by 52% on the test set. Fig. 4 illustrates the convergence of the evidence framework. The optimal regularization coefficient (lMP = aMP/bMP) has a value of 0.04. We recall that, unlike the classical NN techniques like cross300
Forecasted load (MW)
Forecasted load (MW)
350
300
250
200
150
100 100
300
250
200
150
150
200
250
Actual load (MW)
300
350
100 100
150
200
250
Actual load (MW)
Fig. 3. Bayesian NN model 2: (a) training set and (b) test set.
300
350
1164
P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166
Fig. 4. Convergence of hyper-parameters in evidence framework.
validation, this optimal value is optimized on line, i.e. simultaneously with the optimization of the NN weights (that yields the optimal weight vector wMP) using all the training data at hand.
As mentioned above, the Bayesian NN learning offers the possibility (through the ARD technique) to determine the relevance of the input units. Notice that the indicators such as the week end or holiday flags and the hour were not included in the study. Fig. 5 shows the inverse of the values of the hyperparameters, namely the variance of the Gaussian associated with each input variable. As one can see, a group of three variables (temperature, global solar radiation and electric load) is clearly exhibited at each day. One may notice also that the relevance of this group is more pronounced at d-day. Not surprisingly, the relevance of the electric load appears to be the most relevant input variable in each group. The temperature forecast for the next day seems also to be an important variable. As shown in Fig. 5, the computed relevance of global solar radiation is relatively high. A possible explanation is that solar radiation is, to some extent, a proxy for other important variables, such as, for instance, temperature.
Results of the ARD technique
0.2
Input Variance
5.3. Results of the ARD technique (Model 3)
0.25
0.15
0.1
0.05
0 Wdd-1 Wsd-1 Td-1 Gd-1
Ld-1 Wdd
Wsd
Td
Gd
Ld
Td+1
Input Variable Fig. 5. ARD technique: relevance of the input variables.
We assessed the performance of a NN having as inputs (in addition to the hour and week end flag indicators) the relevant variables (i.e. Td1, Gd1, Ld1, Td, Gd, Ld, Td+1) given by the ARD method. Table 2 (line 3) lists the results.
P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166
300
Test data NN output error bars
280
Electric load (MW)
We see that a reduction of the input space has (very) slightly enhanced the performance of model 2. As mentioned above, in a theoretical point of view, when using the Bayesian approach, one has not to search for the optimal number of hidden nodes. However, in practice, it would seem that an optimal NN structure could improve the performance of the model. Section 5.3 illustrates the ranking of different NNs structures through the evaluation of the evidence (see Section 4.7). Nonetheless, the most important characteristic of the Bayesian ARD technique is that it is possible to include a large number of input variables without fear of overfitting. If an input variable is found to be irrelevant, then all the fan-out connections related to this input are automatically set to zero (or near zero).
1165
260 240 220 200 180 160 140
50
100
150
200
250
Time (hour)
5.4. Bayesian NN model selection (Model 4) Fig. 6 displays the log of the evidence of NN models with numbers of hidden nodes ranging to 1–20. As is seen, the optimal NN structure (i.e. the most probable NN model) corresponds to a NN having four hidden nodes. Table 2 (line 4) gives the performance of this new model. In this case, the improvement brought by this pruned NN is more pronounced as the RMSE and MAPE errors are reduced by 15% on the test set. 5.5. Error bars on the output The next stage is to make predictions with error bars. Fig. 7 plots the mean prediction (i.e. the NN output yMP) as well as the ± one standard deviation (rt) of the error bars from this prediction. In such a context of decision making (management and planning of means of production) to which the operator is subjected, an indication of the reliability associated with the model’s prediction is obviously of a great interest.
350 300
Log evidence
250 200 150 100
Fig. 7. NN mean prediction (solid line) with error bars at 1 standard deviation (dotted square lines).
6. Conclusions In this work, we have proposed a new methodology based on a Bayesian approach to STLF. It has been shown that, unlike the traditional NN techniques such as crossvalidation, the method is able to deal quite efficiently with model complexity (and, therefore, with the problem of overfitting) through the use of the evidence framework and model selection. It also provides a means to select the most important input variables of the model. The Bayesian method allows the computation of error bars on the model output. The latter take into account in a natural way two contributions: one arising from the intrinsic noise in the data and one arising from uncertainties in the parameter values given by the width of the posterior pdf. Future work will be devoted to the design of robust models. In other words, the derivation of better noise models, those which, in particular, take into account uncertainties in the model’s inputs, could greatly improve the modelling. In our view, the STLF modelers community can greatly benefit from the Bayesian framework as Bayesian methods provide an explicit handling of uncertainty in the modelling. So, it is worth using rather theoretically complicated modelling tools (though, in practice, it is not the case) that could improve the quality of the forecasting models. Acknowledgement
50
The authors are grateful to EDF (Electricite´ de France) for providing the databases.
0 -50
0
2
4
6
8
10
12
14
16
Number of hidden units Fig. 6. Log evidence of different NN models.
18
20
References [1] Senjyu TH, Takara K, Funabashi T. One-hour-ahead load forecasting using neural network. IEEE Trans Power Syst 2002;17(1):113–8.
1166
P. Lauret et al. / Energy Conversion and Management 49 (2008) 1156–1166
[2] Hippert HS, Pedreira CE, Souza RC. Neural networks for short-term load forecasting: a review and evaluation. IEEE Trans Power Syst 2001;16(1):45–55. [3] Mandal P, Senjyu T, Funabashi T. Neural networks approach to forecast several hour ahead electricity prices and loads in deregulated market. Energy Convers Manage 2006;47:2128–42. [4] Zhang G, Patuwo BE, Hu MY. Forecasting with neural networks. Int J Forecast 1998;14:35–62. [5] Hornik K, Stinchcombe M, White H. Multilayer feedforward networks are universal approximators. Neural Networks 1989;2(5): 359–66. [6] Bishop CM. Neural networks for pattern recognition. Oxford: Oxford University Press; 1995. [7] Reed R. Pruning algorithms – a survey. IEEE Trans Neural Networks 1996;4(5):740–7. [8] Lauret P, Fock E, Mara TA. A node pruning algorithm based on the Fourier amplitude sensitivity test method. IEEE Trans Neural Networks 2006;17(2):273–93. [9] Yuan JL, Fine TL. Neural-network design for small training sets of high dimension. IEEE Trans Neural Networks 1998;9(2): 267–80.
[10] Alves da Silva AP, Moulin LS. Confidence intervals for neural network based short-term load forecasting. IEEE Trans Power Syst 2000;15(4):1191–6. [11] MacKay DJC. Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press; 2003. [12] Malakoff DM. Bayes offer ‘new’ way to make sense of numbers. Science 1999;286:1460–4. [13] Cox RT. Probability, frequency and reasonable expectation. Am J Phys 1946;14:1–13. [14] Sivia DS. Data analysis: a Bayesian tutorial. Oxford: Oxford University Press; 1996. [15] Jaynes ET. Probability theory – the logic of science. Cambridge: Cambridge University Press; 2003. [16] MacKay DJC. A practical Bayesian framework for back-propagation networks. Neural Comput 1992;4(3):448–72. [17] Bretthorst GL. Bayesian model selection: examples relevant to NMR. In: Skilling P, editor. Maximum entropy and Bayesian methods. Dordrecht, The Netherlands: Kluwer Academic Publishers; 1990. p. 377–88. [18] Nabney IT. NETLAB: Algorithms for pattern recognition. London: Springer; 2002.