Electrical Power and Energy Systems 55 (2014) 144–154
Contents lists available at ScienceDirect
Electrical Power and Energy Systems journal homepage: www.elsevier.com/locate/ijepes
A sparse heteroscedastic model for the probabilistic load forecasting in energy-intensive enterprises Peng Kou ⇑, Feng Gao Systems Engineering Institute, SKLMS, Xi’an Jiaotong University, Xi’an 710049, China
a r t i c l e
i n f o
Article history: Received 9 January 2013 Received in revised form 1 September 2013 Accepted 3 September 2013
Keywords: Probabilistic forecasting Smart gird Gaussian process Heteroscedasticity Sparsification
a b s t r a c t The energy-intensive enterprises (EIEs) account for a significant part of the total electricity consumption in most industrial countries. In the smart grid environment, electric load forecasting in EIEs plays a critical role in the security and economical operation of both the main grid and the EIEs’ micro-grid. However, the accuracy of such forecasting is highly variable due to the strong stochastic nature of the load in EIEs. In this circumstance, probabilistic forecasts are essential for quantifying the uncertainties associated with the load, thus is highly meaningful for assessing the risk of relying on the forecasts and optimizing the energy systems within EIEs. This paper focuses on the day-ahead probabilistic load forecasting in EIEs, a novel sparse heteroscedastic forecasting model based on Gaussian process is developed. With the proposed model, we can provide predictive distributions that capture the heteroscedasticity of the load in EIEs. Since the high computational complexity of Gaussian process hinder its practical application to large-scale problems such as load forecast, the proposed model employs the ‘1/2 regularizer to reduce its computational complexity, thereby enhancing its practical applicability. The simulation on real world data validates the effectiveness of the proposed model. The data used in the simulation are obtained in the real operation of an EIE in China. Ó 2013 Elsevier Ltd. All rights reserved.
1. Introduction In many industrial countries, the electrical energy consumption in energy-intensive enterprises (EIEs) constitutes for a significant part of the country’s total energy use. These EIEs includes steel plants, alumina plants, petrochemical plants, cement plant, etc. In many cases, EIE has its own self-generating power plant, thus forming a micro-grid. This micro-grid is connected to the main grid through the substations or feeders, as in Fig. 1. Since the electric load in EIEs is affected by the start-up and shut-down of some high power consuming production units, e.g., electric furnace in steel plant, the load in EIEs is highly volatile and sharply fluctuating. Because the micro-grid in EIEs are connected to the main grid, these uncertain burst loads pose several challenges to both utilities and EIEs, such as stability, power quality, and especially power dispatching [1–3]. Specifically, from the viewpoint of utilities, the uncertain burst loads from EIEs may have an adverse impact on the power quality, e.g., large fluctuations in voltage and frequency. From the viewpoint of the EIEs, the burst load reduces the operational efficiency of their own self-generating power plant, so EIEs have to purchase additional burst loads following capability, spinning reserves, and some other ancillary services from the main ⇑ Corresponding author. Tel.: +86 29 82667679; fax: +86 29 82667856. E-mail address:
[email protected] (P. Kou). 0142-0615/$ - see front matter Ó 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.ijepes.2013.09.002
grid, thus increasing its energy costs [2]. In the smart grid environment, optimized scheduling among self-generating power plant, shiftable loads, energy storages, and utility power supply has a great potential to address these challenges [3]. Because of the existence of high uncertainties in EIEs’ energy consumption, a stochastic scheduling model is more appropriate than a deterministic one for the EIEs’ energy system. This is due to the fact that the former could introduce caution, flexibility and robustness in the solution. Such a stochastic scheduling model leads to the requirement of probabilistic load forecasting in EIEs. In addition to the point prediction values, probabilistic load forecasting also provides the predictive distributions of the future load, thus quantifying the uncertainties associated with the EIEs’ load. This information of uncertainties is the stepping stone for a stochastic scheduling model. Furthermore, probabilistic forecasts can also be used to improve the dynamic demand response, investigate the power flow [4,5], and evaluate the system reliability [6]. Various techniques have been proposed for the electric load forecasting. However, almost all of them focus on the main grid or the metropolitan power grid [7–14]. As stated above, the load characteristic of EIEs is significantly different from that of the main grid, so the existing load forecasting methods in literatures for the main grids are not appropriate for the load forecasts in EIEs. To the best of our knowledge [15,16] are the only previous works that address the load forecasting in EIEs. In [15], a gradient-boosting
P. Kou, F. Gao / Electrical Power and Energy Systems 55 (2014) 144–154
Main grid Utility power supply Bus 1 Transformer 1 Micro-grid in EIE Bus 3 Transformer 2 Bus 2 M G
EAF
Motor Heating loads loads
Static loads
Self-generating power plant Fig. 1. Diagram illustrating a micro-gird in EIE.
ensemble learning algorithm for non-stationary time series is established, and applied to the load forecasts in a large scale steel plant. In [16], a template-based technique along with template scaling and equivalence algorithms is proposed to solve the EIEs’ load modeling problem, and is applied to an oil refinery facility. Although these two methods show some promising results, they only provide the point forecasts, but not the probabilistic forecasts. From a statistical point of view, probabilistic load forecasting for EIEs is a probabilistic regression problem. Gaussian process (GP) is a powerful tool for probabilistic regression [17]. Since GP is a fully probabilistic model, it can give predictive distributions, i.e., probabilistic forecasts, rather than merely point predictions. Standard GP simply assumes that the variance in data is Gaussian and its level is uniform throughout all data points. However, this assumption is unreasonable for some real applications such as load forecasting in EIEs. In following sections, we will show that although the uncertainty associated with the load in EIEs can be quantified by Gaussian distributions, the level of such uncertainty cannot be assumed to be uniform due to the start-up and shut-down of some high power consuming production units. Consequently, the variances of the load series in EIEs can be time varying, i.e., the load series in EIEs is a heteroscedastic time series. In such a circumstance, standard GP may misestimate the variances of this time series, thus giving poor predictive distributions. Furthermore, since the misestimation of the variances makes the GP oversmooth or undersmooth, the point prediction performance is also affected. Heteroscedastic Gaussian Process (HGP) is an extension of the standard GP, it models the uncertainty level using a second GP in addition to the GP governing the noise-free output [18]. This way, HGP handles the non-uniform and input-dependent uncertainty levels, thereby capturing the local volatility of the load in EIEs. Therefore, HGP is suitable for the probabilistic load forecasting in EIEs. A major limitation of HGP is its high computational complexity. One HGP consists of two standard GP base models, each of them costs O(N3) for training and O(N2) for predicting, where N is the number of training points. This high computational complexity severely limits its scalability to large problems. When training a load forecasting model for EIEs, in general, there is a large amount of data available from the supervisory control and data acquisition (SCADA) system. In this case, the high computational complexity of HGP makes its training and real-time forecasting intractable, thus hindering its practical application. Therefore, from a practical viewpoint, we have to reduce the computational complexity of HGP before applying it to the probabilistic load forecasting in EIEs. To address the above issues, this paper presents a new approach for the probabilistic load forecasting in EIEs. To effectively handle
145
the heteroscedasticity in EIEs’ load series and the large-scale forecasting problems, the proposed method is developed to be a sparse heteroscedastic model, referred to as SHGP. In SHGP, to deal with the heteroscedasticity, a heteroscedastic GP model is employed to model and predict the variances. To handle the large-scale problems, SHGP reduces the computational complexity by sparsifiying its base models. The newly introduced ‘1/2 regularizer [19] is employed for this sparsification. Compared to the popular regularizers such as ‘1 [20], ‘1/2 regularizer generally gives more sparse solutions. By benefiting form this sparse nature of ‘1/2 regularization, SHGP has a significantly lower computational complexity compared to HGP. Considering the schedule horizon of the self-generating power plant in EIEs, we focus on the day-ahead forecasting, i.e., 24 h forecasting horizon. The remainder of this paper is organized as follows. A detailed description of the proposed SHGP model is presented in Section 2. Section 3 analyzes the load characteristics of EIEs, and shows how to select the input features for model training. Section 4 contains the description of the numerical experiments and the discussion of the results. The data used in the experiments are obtained from the real operation of a steel plant in China. Section 5 concludes the paper. 2. Sparse HGP based on ‘1/2 regularization Because of its heteroscedastic property, HGP is suitable for the probabilistic load forecasting in EIEs. However, when applying HGP to a practical load forecasting system, an important issue must be addressed. That is, the high computational complexity of the standard HGP makes the training and real-time forecasting intractable for large problems. To address this issue, we present a sparse HGP model, which reduces the computational complexity. We refer to this sparse HGP model as SHGP. In this section, we show how to establish this SHGP model using ‘1/2 regularization. Before introducing the proposed model, we give a brief review of HGP in Section 2.1. In Section 2.2, we present the proposed SHGP model. 2.1. A brief review of HGP The probabilistic load forecasting in EIEs is a probability regression problem, which can be formulated as: at time stamp t, given forecasting horizon h and a set of input feature xt+h|t, forecast the future load distribution p(yt+h|xt+h|t) at time stamp t + h. As mentioned earlier, this paper focuses on the day-ahead forecast, i.e., we only perform the single-step ahead forecasting. Therefore, the forecasting model can be expressed as
pðytþh jXtþhjt Þ ¼ gðXtþhjt Þ; where g denotes the probabilistic forecasting model. The input features form the D dimension input vector x, while the actual electric load measurements form the corresponding real valued target y. HGP is a heteroscedastic regression model, which takes into account the input-dependent noise. Given a dataset D ¼ ðX; yÞ consisting of N input vectors X ¼ fXn gNn¼1 and corresponding targets y ¼ fyn gNn¼1 , in HGP we assume that the relationship between the input vector and the target is given by
yn ¼ f ðXn Þ þ n ;
ð1Þ
here n N ð0; r is the input-dependent noise, which models the time changing variance, thus hitting the heteroscedasticity. f is the latent function. By placing a Gaussian process prior on f and assuming a noise rate function r2n ¼ rðXn Þ, the predictive distribution p(y|x,X,y) at a testing point x is a Gaussian distribution, which is given by 2 nÞ
146
P. Kou, F. Gao / Electrical Power and Energy Systems 55 (2014) 144–154
pðy jx ; X; yÞ ¼ N ðl ; r2 Þ
ð2Þ
with
l ¼ k ðK þ RÞ1 y; r2 ¼ k þ rðx Þ kT ðK þ RÞ1 k : The probability density function (PDF) in (2) is the direct output of the HGP, where K is the kernel matrix with elements Knm = k(xn, xm), k being the kernel function that parameterized by the hyperparameters h. k = k(x, x), k = [k(x1, x), k(x2, x), . . . , k(x2, x)]T, and diagonal noise matrix R = diag([r(x1), r(x2), . . . , r(xn)]T). In many cases, it is very useful to give an indication of the shape and range of the predictive distribution by giving the predictive intervals, which is more convenient than the PDF for visually illustrating and performance evaluation. Since the predictive distribution in (2) is Gaussian, we can easily get the a-level prediction interval
½Lb1a ðy Þ; Ub1a ðy Þ ¼ ½u zð1aÞ=2 r ; u þ zð1aÞ=2 r :
ð3Þ
This prediction interval gives a range of values within which the actual target is expected to lie with a certain probability, that is, the nominal coverage rate (1 a). Here a 2 ½0; 1 is the confidence level and z(1a)/2 is the critical value of the standard normal distribution. One standard HGP model consists of two base models, each of them is a homoscedastic GP. The first base model G1 models the noise-free target, while the second one G2 models the logarithms of the noise rates, that is v(x) = log(r(x)). This v-process is governed by a different kernel function kv that parameterized by the hyperparameter hv. The input locations of the training data points v fv n gNn¼1 for the v-process coincide with the ones of the y-process. This way, HGP adapts to the local noise and captures the local volatility. Since the noise rates are now independent latent variables in the combined HGP model, the predictive distribution at a testing point x becomes
pðy jx ; X; yÞ ¼ pðy jx ; X; y; v ; v Þpðv ; v jx ; X; yÞdv dv :
ð4Þ
The second term on the right hand of (4) makes this integral difficult to handle, so [18] suggests approximating the predictive distribution by
^ ; r ^ 2 Þ; ^ ; v^ Þ ¼ N ðl pðy jx ; X; yÞ pðy jx ; X; y; v T
¼ N ðk ðK þ RÞ1 y; k þ rðx Þ k ðK þ RÞ1 k Þ;
ð5Þ
^ ¼ logðrðxÞÞ are estimated by G2. The model (5) is ^ and v where v the practical HGP model, we denote it as G. The principle of HGP is illustrated in Fig. 2. The training procedure of HGP can be summarized as: On original dataset D ¼ ðX; yÞ, estimate the first base model G1 for predicting y from x. Given G1, estimate the empirical noise levels for the training data, i.e., vn = log(var[yn, G1(xn, D)]), forming a new dataset D0 ¼ ðX; v Þ. On D0 , estimate the second base model G2.
Fig. 2. Principle of HGP.
Estimate the combined HGP G by (5). If not converged, set G1 = G, then go to the second step. Above training procedure is an EM-style iteration, it is deemed to have converged when the change in the likelihood falls below some threshold. For more details of HGP, we refer the reader to [18]. 2.2. SHGP: sparse HGP based on ‘1/2 regularization When training HGP with the above EM-style iteration, both G1 and G2 are standard GP in the first round of iteration. In the second and following round of iterations, G1 = G are HGP, G2 is still standard GP. G1 requires O(N3) for training and O(N2) for predicting, while G2 requires O(N3) for training and O(N) (for G2, we do not need to predict the variance) for predicting. These lead to the high computational complexity of HGP. This high computational complexity can be reduced by sparsifying both G1 and G2 via ‘1/2 regularization, thus leading to the proposed SHGP model. Due to the fact that G1 = G after the first round of iteration, and the method for sparsifying G can be easily generalized to G2 and the first G1, we mainly focus on the sparsification of G. The principle of SHGP is illustrated in Fig. 3. 2.2.1. ‘1/2 Regularization In this subsection, we give a brief review of the ‘1/2 regularization, for the more details, we refer the readers to [19]. Regularization is a popular method to control the model complexity. Most commonly, the regularization approach considers the linear regression model and the sum least square loss, that is
minfjjy Xwjj22 þ kjjwjjq g: w
ð6Þ
Here the first term jjy Xwjj22 is the loss of the regression model, w = [w1, w2, . . . , wD]T is the regression coefficients to be inferred, ‘ is the regularization parameter. The regularizer normally takes P 1=q the form of the q-norm of w, that is jjwjjq ¼ ð Dd¼1 jwd jq Þ . In (6), different values of q correspond to different regularizers. When q = 0, we get the ‘0 regularizer, which yields the most sparse solutions but also faces the problem of combinatory optimization. The case of q = 1, i.e., ‘1 regularizer, is also known as the lasso [20]. Lasso is less sparse than the ‘0 regularization, but it leads to a convex optimization problem that is easy to be solved. Because of this, lasso gets its popularity and has been accepted as a very useful tool for sparsification. Nevertheless, in many cases, lasso does not yield sufficiently sparse solution, and may yield inconsistent selections or extra bias [21,22]. Therefore, to find more sparse solutions than lasso, we adopt the newly introduced ‘1/2 regularizer in this work. Numerous previous works have revealed that the ‘1/2 regularizer can assuredly generate more sparse solutions than lasso [19,23,24]. Fig. 4 gives an illustration of the different regularizers. In this figure, the shaded region denote the constraint region defined by the regularizers, the contours denote the unregularized loss, i.e., the first term in (6). The intersection point between them
Fig. 3. Principle of SHGP.
P. Kou, F. Gao / Electrical Power and Energy Systems 55 (2014) 144–154
147
Furthermore, in each round of iteration, we can dynamically update the regularization parameter k according to
kðsþ1Þ ¼ min kðsÞ ;
! pffiffiffiffiffiffi 96 kXk22 j½Bj ðwðsÞ ÞMþ1 j3=2 ; 9
ð10Þ
where [Bj(w(s))]M denotes the Mth largest component of Bj(w(s)) in magnitude. This way, the half algorithm can yield the so called Msparsity solution, i.e., the final solution w satisfies ||w||0 = M. In this sense, the sparse level of the solution can be directly controlled by a parameter M. 2.2.2. Use ‘1/2 regularization to sparsify G We now derive the sparse HGP model G using ‘1/2 regularization. Since G is a heteroscedastic GP, the relationship between the input vector and the target is given by Fig. 4. Plot of the contours of the unregularized loss (blue) along with the constraint region (green shaded region) for the (a) ‘0, (b) ‘1/2, (c) ‘1, and (d) ‘2 regularizer. The optimum solution is denoted by w*. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
represents the solution of problem (6). When the intersection point lies on a coordinate axis, a sparse solution is obtained. From Fig. 4(a), it is seen that when q = 0, a sparse solution is certain to be obtained. When q = 1, the lasso solution is the first location at which the contours touch the shaded region, and this will occur at a corner of the shaded region. Since every corner lies on a coordinate axis, it corresponds to a zero coefficient. When q = 2, there is no corner for the contours to hit and hence zero solutions will rarely appear, thus does not hit sparsity, as in Fig. 4(d). Comparing Fig. 4(b) and (c), it is obvious that the solution of ‘1/2 regularization occurs at a corner with a higher possibility than that of ‘1. This indicates that ‘1/2 regularizer may give more sparse solutions than lasso. Although ‘1/2 regularizer generally provides a more sparse solution than ‘1, it leads to a non-convex and non-smooth optimization problem, which is hard to solve. To address this, Xu et al. proposed an effective iterative half thresholding algorithm for the ‘1/2 regularization problem in [19]. In this work, we refer to this algorithm as the half algorithm. This algorithm considers the following optimization problem 1=2 minfjjy Xwjj22 þ kjjwjj1=2 g:
ð7Þ
w
Based on a fixed point iteration scheme, the half algorithm converges to a sparse solution of (7) by the following iteration
wðsþ1Þ ¼ Hk;j ðBj ðwðsÞ ÞÞ;
ð8Þ
where s labels the iteration step. The definition of operator Hk;j is T
Hk;j ðbÞ ¼ ½hk;j ðb1 Þ; hk;j ðb2 Þ; . . . ; hk;j ðbD Þ ; 8b 2 RD ; with
hk;j ðbd Þ ¼
8 <2 :
p3ffiffiffiffi 2 3 b 1 þ cos 23p 2/k3ðbd Þ ; if jbd j > 544ðkjÞ 3 d 0;
;
otherwise 3 ! k jbd j 2 : 8 3
The operator Bj is defined as
Bj ðwÞ ¼ w þ jXT ðy XwÞ; where j = (||X||2)2.
where n N (0, exp(vn)) is the input-dependent noise. We assume that f = [f(x1), f(x2), . . . , f(xN)]T behaves according to a Gaussian process, that is p(f|x1, x2, . . . , xN) = N (0, K). According to [25], if we represent f in an equivalent parametric form f = Ku, with prior u N ð0; K1 Þ, then the relationship between the input vector and the target can be described by a generalized linear regression function
y ¼ Ku þ ; where u is a vector of the regression coefficients, N (0, R). In order to get the optimal value of u, we use the maximum a posterior (MAP) method. This way, the posterior of u is proportional to
pðujX; yÞ / pðyjX; uÞpðuÞ:
ð11Þ
Since y = Ku + e and N (0, R), the first term on the right hand side of (11) can be rewritten as p(y|X, u) = N (Ku, R). In addition, since f N (0, K), we have u N (0, K1). Therefore, (11) can be rewritten as
1 1 pðujX; yÞ / exp ðy KuÞT R1 ðy KuÞ exp uT Ku : 2 2 ð12Þ From (12) we can get the negative log-posterior distribution over u, that is
1 1 log pðujX; yÞ / exp ðy KuÞT R1 ðy KuÞ exp uT Ku 2 2 / yT R1 y 2yT R1 Ku þ uT KT R1 Ku þ uT Ku: ð13Þ This way, the MAP estimate of G is equivalent to minimizing the negative logarithm of the posterior of u, that is
1 u ¼ arg min yT R1 Ku þ uT ðK þ KT R1 KÞu : u 2
ð14Þ
In this sense, (14) can be used as a loss function for training G. It is easy to see that the optimal solution of (14) is u = (K + R)1y. If ~ whose posterior probability is close to that of there is a sparse u ~ will be a reasonable approximation for u . the optimal u , then u In other words, we can use a sparse solution of (14) to approximate ~ are zeros, the G based on it will be a u . Since most elements of u sparse model. ~ . Adding a Here we use the ‘1/2 regularization to get the sparse u ‘1/2 regularizer to (14), then the regularized loss function of G can be written as
where /‘ ðbd Þ is given by
/k ðbd Þ ¼ arccos
yn ¼ f ðXn Þ þ n ;
ð9Þ
1 ~ ¼ arg min yT R1 Ku þ uT ðK þ KT R1 KÞu þ kkuk1=2 u 1=2 : u 2
ð15Þ
148
P. Kou, F. Gao / Electrical Power and Energy Systems 55 (2014) 144–154
However, the original half algorithm in [19] only deals with the least squares linear regression loss, as in (7). Therefore, we have to extend this algorithm to the loss function (15). Analyzing the original half algorithm in detail, we see that besides the operator Bj, there is no other operation depending on the form of the loss. In the original half algorithm, the definition of Bj is derived from the first order optimality of the loss in (7), that is
Bj ðwÞ ¼ w
2 j @ jjy Xwjj2
2
@w
¼ w þ jXT ðy XwÞ:
ð16Þ
Similarly, just like (16) is derived from (7), we can define a new operator Bj for (15) as
Bj ðuÞ ¼ u ¼uþ
j @ðyT R1 Ku þ uT ðK þ KT R1 KÞu=2Þ 2
j 2
ln pðyjhÞ ¼
1 1 e þ r2 Ij þ 1 y K e þ r2 I ln j K y þ const; f 2 2
ð21Þ
where K denotes the kernel matrix evaluated on the active set. One of the problems in this approach is the dependence of active set on h, such dependence makes the criterion (21) and gradients fluctuate with changing active points. This problem can be handled by repeating the following alternating steps: (i) fix h and select active points by (18) and (ii) fix active points and infer h. The stopping cri~ or a maximum numterion can be defined as the convergence of u ber of loops.
@u T
ðK R y ðK þ KT R1 KÞuÞ: 1
ð17Þ
With this new operator Bj , the half algorithm can solve (15) directly. The convergence proof of original half algorithm does not depend on the form of loss, so this convergence still holds with the new operator Bj . Using Bj to solve (15), an iterative update of u is given by
j uðsþ1Þ ¼ Hk;j uðsÞ þ ðKT R1 y ðK þ KT R1 KÞuðsÞ Þ : 2
ð18Þ
~ . By Eventually, this iteration converges to a sparse solution u s updating ‘ according to (10), we can directly control the sparse le~ . That is, if we set the sparse level to M, we will get vel of u ~ jj0 ¼ M. This way, we directly tune the sparse level of G. We rejju fer to the corresponding training points to the nonzero elements of ~ as the active set. With this sparse G, the predictive distribution u at a testing point x is
eTu e T ðK e ; e þ RÞ e 1 k ~ ; x Þ ¼ N k ~ ; k þ rðx Þ k pðy jX; u
ð19Þ
e is a column vector of the kernel functions between M acwhere k e and R e are kernel matrix and tive points and the test point x, k diagonal noise matrix evaluated at the active set, respectively. r(x) is predicted by G2. In practice, we can use the Cholesky with Side Information (CSI) method to further reduce the computational complexity [26]. That is, we approximate the kernel matrix K through K QTQ, where Q is a g N matrix with approximation rank g N. By this means, the iteration (18) becomes
j uðsþ1Þ ¼ Hk;j uðsÞ þ ðQ T QR1 y ðQ T Q þ Q T QR1 Q T Q ÞuðsÞ Þ : 2 Similar
discuss this topic in detail. The sparse G1 is conditional on the hyperparameters h. The optimal value of h can be inferred by minimizing the negative log marginal likelihood (21) w.r.t. h using gradient based techniques, that is
to
(9),
the
parameter
j can be taken as
j = (||[KrQT||2])2. 2.2.3. Sparsification for G2 and the first G1 Both G2 and the first G1 are standard GP. The method for sparsifying them is similar to that for G. The only difference is to replace the matrix R with the scalar r2f . In this case, the iteration rule (18) becomes
j T uðsþ1Þ ¼ Hk;j uðsÞ þ K y r2f K þ KT K uðsÞ : 2
ð20Þ
Other operations in sparsification is the same as that in Section 2.2.2, so the detail derivation is omitted for space reason. 2.2.4. Hyperparameters optimization Seeger et al. [27] gives a detail analysis of various issues associated with the gradient based hyperparameters adjustment on active set. Since the same ideas hold for our model, we will not
2.2.5. Computational complexity of SHGP The above sparsification method leads to the proposed SHGP model. With SHGP, if we set the sparse level to M and the CSI rank to g, then for G1 (G), predicting at a test point costs O(M2), for G2 predicting at a test point costs O(M) since it only predicts the mean value. During training, the first iteration has complexity O(N2g) w.r.t. time and O(Ng) w.r.t. space, due to calculating QTQR1y, (QTQ + QTQR1QTQ), QTQy and (r2f QTQ + QTQQTQ). In following iterations, if these terms are stored, the time complexity reduces to O(M2), due to the matrix–vector multiplication (r2f QTQ + QTQQTQ)u(s) and (QTQ + QTQR1QTQ)u(s). Benefiting from the very sparse nature of ‘1/2 regularization, we can set the sparse level to M N, so the computational complexity of SHGP is much lower than that of HGP. In practice, this yields a fast algorithm as is supported by the results in Section 4.2. 3. Data analysis and feature selection In this section, we analyze some characteristics of the load data in EIE, and select the input features. Specifically, in order to investigate that whether a heteroscedastic Gaussian process model is suitable for modeling the load in EIE, we are particularly interested in checking whether the EIE load series is heteroscedastic or homoscedastic, and whether the uncertainties associated with this load can be quantified by Gaussian. 3.1. Data description and analysis The data used in this research were generated at a large steel plant in China. The total electric load in this plant is about 1000 MW, thus being a typical energy intensive enterprise. The load data were collected by a SCADA system installed in this plant. The raw data are sampled at a high frequency of 1 min. In this work, it is averaged and stored at 5 min intervals, i.e., the sampling interval is T = 5 min. We referred to it as the 5-min averaged data. The data used in this work were collected over a period of five months. First, we plot the autocorrelation function (ACF) and the partial autocorrelation function (PACF) of the sampled load series. The lags for both ACF and PACF are set to 2016, i.e., seven days of data. The results are shown in Fig. 5. From Fig. 5(a), we can see that the ACF does not dampen out, so the load series is likely not stationary, presumably this is caused by its heteroscedastic property. In addition, it is clear to see that there is a pseudo daily pattern in this EIE’s load. This lies in the fact that in a certain span of time, the production plan in a plant is similar for each day. The PACF graph shows a cutoff value after lag 18 or probably lag 20, as in Fig. 5(b). Second, we investigate the heteroscedasticity of the load in EIE. Before the mathematically examination, we first give a visual
149
1
(a)
0.8
ACF
0.6 0.4 0.2 0 -0.2
0
500
1000
1500
2000
Sample Partial Autocorrelations
P. Kou, F. Gao / Electrical Power and Energy Systems 55 (2014) 144–154
1 0.8
(b)
0.6 0.4 0.2 0 -0.2
0
500
1000
1500
2000
Lag
Lag
Fig. 5. (a) ACF and (b) PACF graph for the electric load series in a typical EIE. Solid line denotes the 95% confidence interval.
Density
0.08 0.06
(a)
0.04 0.02 0 600
800
1,000
1,200
1,400
1200
1400
1200
1400
Load (MW)
Density
0.08 0.06
(b)
0.04 0.02 0 600
800
1000
Load (MW)
Density
0.08 0.06
(c)
0.04 0.02 0 600
800
1000
Load (MW) Fig. 6. Illustration of the heteroscedastic property of the load in a typical EIE. Empirical distributions of the load under (a) operating condition I, (b) operating condition II, and (c) operating condition III.
illustration, we plot the empirical probabilistic density function of the load data under different operating conditions. These different operating conditions correspond to the different start-up and shutdown status of several high power consuming production units. Fig. 6 gives three examples to illustrate this, in this figure the load data is represented through empirical PDFs by using 2 MW bins. It is seen that the variance, i.e., the uncertainty of the load in EIE is clearly affected by the operating condition. Comparison of Fig. 6(a) with Fig. 6(c) shows that when some high power consuming production unit is start-up, the variance of the load increases clearly. This observation clearly indicates that the variance of the load series is time varying, thus leading to a heteroscedastic time series. To verify the speculation from the previous visual observation, we employ the Breusch–Pagan test [28] and the White test [29] to mathematically examine the heteroscedasticity of this load series. Specifically, we split the whole load series into ten identical subsets, and perform the above two tests on each subset individually.
Both Breusch–Pagan and White test examine the null hypothesis that the series is homoscedastic. In these two tests, the p-value is the probability of the test statistic being at least as extreme as the one observed given that the null hypothesis is true. A small p-value is an indication that the null hypothesis is false. Therefore, if the p-value is sufficiently small, i.e., below the chosen significance level, we would have to reject the null hypothesis of homoscedastic, and accept the alternative hypothesis that the series is heteroscedastic. The test result is shown in Table 1. In this table, the logical output H = 1 indicates that we reject the null hypothesis of homoskedasticity at the significance level of 5%, and vice versa. The results in this table confirm our previous assumption, that is, there is significant heteroscedasticity exists in the load series in this EIE. In this case, some homoscedastic probabilistic forecasting model may misestimate the uncertainties associated with the load series. Finally, we give a rough test of whether the uncertainties associated with the load in this EIE can be quantified by Gaussian. To do so, we split the whole dataset into ten identical subsets, and run a standard GP on each subset individually. Subsequently, on each subset, we find a rough estimate of the uncertainty level based ~n from on the obtained GP. Specifically, considering a sample y the predictive distribution induced by the obtained GP at xn. View~n and yn as two independent observations of the same noiseing y ~n Þ2 =2 is a coarse free, unknown target, their arithmetic mean ðyn y estimate for the uncertainty level at xn. Furthermore, this estimate can be improved by taking the expectation with respect to the prePN s 2 dictive distribution [18], i.e., en N 1 s j¼1 ðyn yn;j Þ =2, where Ns is n;j , j = 1, 2, . . . , Ns are samples from the predicthe sample size and y tive distribution induced by the obtained GP at xn. For a large enough number of samples, e.g., Ns = 100, this will be a rough estimate for the uncertainty levels. This way, for each data subset, we get a noise series e = [e1, e2, . . . , eN]. We perform the Lilliefors test [30] on each series. The test result is shown in Table 2. In this table,
Table 1 Results of heteroscedastic test at the significance level of 5%. Data subset
# # # # # # # # # #
1 2 3 4 5 6 7 8 9 10
Breusch–Pagan test
White test
H
p
H
p
0 0 0 0 0 0 0 1 0 0
0.0047 0.0057 0.0017 0.0022 0.0015 0.0127 0.0064 0.0520 0.0385 0.0150
0 0 0 0 0 0 0 0 0 0
0.0037 0.0321 3.27e5 9.97e6 0.0219 0.0090 0.0035 0.0101 0.0425 0.0438
P. Kou, F. Gao / Electrical Power and Energy Systems 55 (2014) 144–154
Select the combination with the lowest testing error. Eliminate the selected feature from the candidate set. Proceed to the next loop.
Table 2 Results of Lilliefors test at the significance level of 5%. Data subset
H
p
# # # # # # # # # #
0 0 1 0 0 0 0 0 0 0
0.119 0.168 0.013 0.221 0.401 0.419 0.500 0.500 0.077 0.076
1 2 3 4 5 6 7 8 9 10
the logical output H = 0 indicates that the variance in data can be quantified by Gaussian distribution at the default significance level of 5%, and vice versa. Output p indicates the confidence of this test, its range is [0, 0.5], small values of p cast doubt on the validity of Gaussian. The results in this table suggest that, in general, the uncertainties in this EIE’s load can be reflected by Gaussian. From the above analysis, we can say that the load series in this EIE is heteroscedastic. Furthermore, these heteroscedastic uncertainties can be quantified by Gaussian distributions. Therefore, it is suitable to model and predict this EIE load series with a heteroscedastic Gaussian process model. 3.2. Feature selection One of the most important tasks in building a forecasting model is the selection of the input features. The electric load in an EIE is mainly a function of its production condition, so a natural choice for the input features set are the detail production plans and production schedules of each production unit. However, in most enterprise, these data are either unavailable or has a low time resolution. In most cases, the time resolution of the production plan is monthly or even seasonally, which is too coarse to be used for the day-ahead forecasting. Under these circumstances, we propose to build the probabilistic load forecasting model for EIE from a time series point of view. Time series is a set of sequential observations that are usually spaced at equal time intervals. A time series-based forecasting model employs the past observations as the input features, that is
pðytþh jXtþhjt Þ ¼ gðXtþhjt Þ ¼ gð½yt ; ytT ; . . . ; ytLT T Þ;
ð22Þ
where T is the sampling interval, in this work, T = 5 min. L indicates the number of past observations being considered. In this case, the feature selection is to extract a most informative subset from the historical observations {yt, ytT, . . . , ytLT}, and use them to form the feature vector xt+T.The first step of the feature selection is to determine the number of the candidate historical observations, i.e., the value of L in (22). Referring back to the ACF graph in Fig. 5, we see that the previous five days of data are correlated with and significantly affect the next possible value of the load, and there is a pseudo daily pattern in the load series. This can be regarded as a prior knowledge for the feature selection. According to this prior knowledge, we set the value of L to 1440, i.e., five previous days of data. Once the value of L is determined, the second step of the feature selection is to select a most informative subset from {yt, ytT, . . . , ytLT}. To do this, we employ the sequential forward greedy search approach. In this approach, we begin with an empty feature subset, then iterate the following steps: For every candidate feature, incorporate it into the feature subset separately, that is, each candidate feature corresponds to a possible combination. For each of the possible resulting combinations, build a forecasting model based on the training set, and compute the corresponding predictive error on the testing set.
In detail, for all data points, we use 60% as the training set and remains as the testing set. The proposed SHGP is employed as the forecasting model. The mean absolute percentage error (MAPE) is adopted as the criterion for the testing error. The feature selection procedure is illustrated in Fig. 7. It is seen that the testing error decreases rapidly when the number of input features is smaller than 10 and then levels off. Conversely, when the number of features is bigger than 20, further increase in the number of selected features does not improve the predictive performance but suffers overfitting. Based on this result, and considering the pseudo daily pattern shown in Fig. 5(a), we select 12 historical observations as the input features. Nine of them are load measurements of the previous three days around the forecasting time. The other threes, i.e., yt297T, yt298T, and yt299T, correspond to the fact that the operating period of the electric arc furnace (EAF) is about 45–55 min, and the EAF is the most power consuming production unit in a steel plant. All selected features are summarized in Table 3. 4. Numerical experiments 4.1. Description of experiments In this section, the effectiveness of the proposed SHGP model is evaluated using real-world dataset. A large steel plant in China is chosen for the case study. The 5-min averaged electric load data in this plant, as introduced in Section 3.1, are used for the experiment. This dataset is collected from October 1st, 2008 to December 31st, 2008. In order to compare with some other probabilistic prediction models, we also run the standard GP and the splines quantile regression (SQR) [31,32] with the same experimental setup. In addition, since the proposed SHGP also provides point predictions, we apply some state-of-the-art point forecasting models, e.g., support vector regression (SVR) [33] and backpropagation neural networks (BPNN), on the same dataset and compare their performance. For SHGP and standard GP, we adopt the automatic relevance determination (ARD) function as their kernel functions [34], which is given by
(
) D 1X 0 2 kðX; X Þ ¼ h0 exp hd ðxd xd Þ ; 2 d¼1 0
ð23Þ
6.0
MAPE on testing set (%)
150
5.5
5.0
4.5 0
20
40
60
80
100
Number of selected features Fig. 7. Predictive performance as a function of the number of input features.
151
P. Kou, F. Gao / Electrical Power and Energy Systems 55 (2014) 144–154
4.2. Trade-off between sparseness and performance
Table 3 Selected input features. Feature
Description
yt-288T yt-289T yt-290T yt-297T yt-298T yt-299T yt-575T yt-576T yt-577T yt-863T yt-864T yt-865T
Observation Observation Observation Observation Observation Observation Observation Observation Observation Observation Observation Observation
at the forecasting time, one day ago before the forecasting time, one day ago at 10-min before the forecasting time, one day at 45-min before the forecasting time, one day at 50-min before the forecasting time, one day at 55-min before the forecasting time, one day after the forecasting time, two days ago at the forecasting time, two days ago before the forecasting time, two days ago after the forecasting time, three days ago at the forecasting time, three days ago before the forecasting time, three days ago
ago ago ago ago
where h0, h1, . . . , hD are the hyperparameters to be estimated. For SQR, we set the degrees of freedom to six. For SVR, we employ the radial basis function as the kernel. For BPNN, we adopt the two-layer feed-forward network structure, set the number of hidden nodes to 12, and use the sigmoid transfer function. The average negative log predictive density (NLPD) [35], reliability evaluation (RE) criterion, and sharpness evaluation (SE) criterion [36] are used to evaluate the quality of the probabilistic prediction results. The NLPD is a means of evaluating the amount of probability that the model assigns the target. It penalizes predictions that are either under or over confident, that is
NLPD ¼
N 1X logðpmodel ðy jx ÞÞ; N n¼1
ð24Þ
where N is the number of testing points. RE evaluates the probabilistic correctness of the predictive distributions, which is given by
a^ ð1aÞ ¼
! nð1aÞ ð1 aÞ 100%; N
ð25Þ
where n(1a) is the number of times that actual targets do indeed lie within the a-level prediction interval, which is given by (3). A requirement for probabilistic prediction is that the nominal proportions of quantile forecasts match the observed probabilities. In other words, in an infinite series of probabilistic forecasts, observed proportions nð1aÞ =N should equal the pre-assigned probability (1 a) exactly, the difference between them is the bias of the probabilistic forecasting method. RE criterion evaluates this bias, that is, ^ ð1aÞ j, the better. the smaller the ja SE criterion evaluates the sharpness of the predictive distribution, measured by the mean width of the prediction intervals. This criterion expresses the ability of the model to concentrate the uncertainty information of the wind power prediction. It has the following form
In this subsection, we evaluate the effectiveness of the sparsification strategy in SHGP. We split the whole dataset into five subsets, and investigate the relation among sparse level M, computational time, and the prediction performance on each subset. For each subset, we use 85% data points, i.e., about 2900 examples, as training set and remains as testing set. Since the capacity of training set in each subset is about 2900, we vary the sparse level M e {20, 40, 60, 80, 100, 120, 150, 200, 250, 300}. The CSI rank g is set to 300. As a benchmark, the standard HGP model is also performed on each data subset. Finally, we average the resulting point prediction accuracies and computational times over all data subsets, and plot the averaged prediction accuracies and the averaged computational time as a function of the sparse level. Here the computational time includes both the training time spend on train set and the prediction time spend on test set. The results of this experiment are shown in Figs. 8 and 9. In Fig. 8, we plot the averaged predictive accuracies as a function of M, together with the accuracy of a standard HGP. From this figure, it can be shown that the MAPE of SHGP decreases rapidly when M < 150 and then levels off. Compared to the standard HGP, SHGP achieves similar accuracy when M > 150. Fig. 9 shows the computational time of SHGP as a function of M, together with the computational time of the standard HGP. It is seen that the computational time of SHGP is much less than that of a standard HGP. These results indicate that, compared to the standard HGP, SHGP achieves a similar accuracy with much smaller computational time, thus demonstrating the effectiveness of our sparsification strategy. 4.3. Point prediction results The proposed model also provides the point predictions, so we assess its accuracy in this subsection. For comparison, we also test the standard GP, SQR, and some popular point forecasting models, e.g., SVR and BPNN. Similar as that in Section 4.2, we split all available data into five subsets, and evaluate the proposed model on each of them. Table 4 shows the accuracies of the point forecasting results. It is seen that the MAPE of SHGP is about 4%. The accuracy of the standard GP is slightly worse than that of SHGP. This confirms the fact that the misestimation of the noise level makes the GP oversmooth or undersmooth, thus its point prediction performance is affected. The accuracy of SQR is clearly lower. The point
9
MAPE of SHGP MAPE of standard HGP
8
^dð1aÞ
ð26Þ
where Ub1a and Lb1a are the upper and lower bound of the a-level prediction interval, respectively, as in (12). What we desired is the narrower prediction intervals, i.e., the smaller the SE, the better. The mean absolute percentage error (MAPE) is adopted to evaluate the point predictive accuracy, that is
MAPE ¼
N 1X yn yn 100%; N n¼1 yn
where yn denotes the point prediction results.
MAPE (%)
7 N 1X ¼ ½Ub1a ðyn Þ Lb1a ðyn Þ; N n¼1
6
5
4
3
ð26Þ
0
50
100
150
200
250
300
M Fig. 8. Predictive accuracy of SHGP, as a function of the sparse level M. Together with the results of a standard HGP.
152
P. Kou, F. Gao / Electrical Power and Energy Systems 55 (2014) 144–154 4
10
Load (MW)
Running time (sec.)
actual observation 90% predictive interval given by GP 90% predictive interval given by SHGP
1200
Computational time of SHGP Computational time of standard HGP
3
10
1100
1000
900
2
10
1
10
20
30
40
50
60
70
Time index (5 min) Fig. 11. Comparison of the probabilistic load forecasting results given by SHGP and standard GP.
1
10
0
50
100
150
200
250
300
M 10
Fig. 9. Computational time of SHGP, as a function of the sparse level M. Together with the results of a standard HGP.
Reliability evaluation (%)
Table 4 MAPE of the point forecasting results (%). Data subset Model
SHGP GP SQR SVR BPNN
#1
#2
#3
#4
#5
4.03 4.12 5.07 4.05 4.05
3.99 4.02 4.82 3.79 4.02
3.91 4.10 4.89 3.77 3.94
4.60 4.88 5.73 4.66 4.97
4.16 4.15 4.92 4.34 4.49
SHGP GP SQR Perfect value
8 6 4 2 0 -2 -4 -6 -8
forecasting models also provide satisfactory performance, both SVR and BPNN achieve the similar accuracy to SHGP.
-10
0
10
20
30
40
50
60
70
80
90
0
Nominal coverage rate (%) 4.4. Probabilistic prediction results
Fig. 12. RE diagram of SHGP, standard GP and SQR. For each model, the RE is averaged over five data subsets, and plotted as a function of the nominal coverage rate.
The primary objective of SHGP is to provide the probabilistic load forecasts. In this subset, we evaluate the performance of SHGP from this point of view. Considering the trade-off between predictive accuracy and computational time shown in Figs. 8 and 9, we set the sparse level M to 200. Standard GP and SQR are also tested for comparison. Similar as that in Section 4.2, we split all available data into five subsets, and evaluate the proposed model on each of them. The results of this experiment are shown in Figs. 10–12 and Tables 5–7.
Table 5 NLPD of the predictive distributions. Data subset Model
SHGP GP SQR
#1
#2
#3
#4
#5
0.49 0.75 0.54
0.82 1.12 0.89
0.74 1.02 0.72
0.89 1.09 0.93
0.48 0.73 0.45
90% 1200
80% 70%
Load (MW)
60% 1100
50% 40% 30%
1000
20% 10% mean
900
actual 1
10
20
30
40
50
60
70
Time index 5 min Fig. 10. Day-ahead prediction results of SHGP, together with the actual load measurements. The shaded regions denote the predictive intervals with different nominal coverage rates.
P. Kou, F. Gao / Electrical Power and Energy Systems 55 (2014) 144–154 Table 6 Reliability evaluation of the predictive intervals. Data subset
#1
#2
#3
#4
#5
10% 20% 30% 40% 50% 60% 70% 80% 90%
1.27 1.06 1.5 2.37 2.99 3.58 2.57 0.20 1.35
1.23 1.15 1.42 1.05 2.31 2.91 3.93 4.20 4.25
1.25 1.52 2.25 2.03 2.05 2.96 3.81 4.06 4.17
0.83 1.18 1.77 2.05 2.32 1.93 3.72 1.32 4.60
1.01 1.4 1.5 1.72 2.31 2.38 3.63 3.81 3.91
Results of standard GP (%) Nominal coverage rate 10% 20% 30% 40% 50% 60% 70% 80% 90%
1.61 2.16 2.87 3.02 3.66 4.78 5.05 5.89 5.61
1.82 1.74 2.75 3.18 3.93 4.54 5.29 5.52 5.05
1.12 2.36 2.94 3.24 4.71 5.86 6.23 5.13 5.63
2.50 2.28 2.88 3.05 4.88 5.21 5.31 5.78 5.35
2.41 0.09 0.81 2.62 3.83 5.01 5.68 5.27 5.24
2.58 3.82 3.88 4.24 5.01 5.03 4.35 4.17 3.64
2.70 3.25 5.20 5.58 5.95 7.20 6.32 5.87 4.20
2.50 3.51 4.90 4.96 4.99 5.69 5.07 5.09 5.34
2.33 2.67 5.62 5.83 5.40 7.97 6.58 5.88 5.01
2.12 2.23 5.36 5.44 6.23 7.52 6.39 5.49 5.08
Results of SHGP (%) Nominal coverage rate
Results of SQR (%) Nominal coverage rate
10% 20% 30% 40% 50% 60% 70% 80% 90%
Table 7 Sharpness evaluation of the prediction intervals (MW). Data subset Model
SHGP GP SQR
#1
#2
#3
#4
#5
97.32 103.60 85.29
89.30 99.94 74.79
111.72 120.04 101.23
88.36 102.12 69.45
95.49 109.59 84.45
As an example, Fig. 10 shows the prediction results of SHGP on one of the data subset. In this figure, the prediction intervals, point predictions, together with the actual load measurements are plotted. Due to the limited space, this figure does not show the results for the whole 24 h horizon, but only the results for the former 72 time stamps. To visually illustrate the probabilistic forecasting results, we plot the prediction intervals rather than the predictive distribution. The prediction intervals are calculated from the predictive distribution using (3). From this figure, we can see that the 90%-level prediction interval contains almost all actual load measurements. In addition, it can also be seen that the point predictions closely follow the actual measured values and capture the local volatility. Fig. 11 shows a comparison of SHGP and standard GP. In this figure, the 90%-level prediction intervals provided by SHGP and standard GP, together with the actual load measurements are plotted. One can see that, the prediction interval given by SHGP captures the heteroscedasticity of the EIE’s load series. In contrast, the standard GP generally overestimates the uncertainty level. This is caused by its uniform assumption of the noise level, with this homoscedastic assumption, the high fluctuations exist in some periods of load series leads to an overall overestimation of the noise level. Table 5 presents the NLPD of the probabilistic forecasting results. As one can see, in most cases, SHGP provides a more reliable estimation of the predictive distribution. The main reason for this lies in the fact that, SHGP captures the heteroscedasticity associated with the load series. Compared to SHGP, the performance of
153
standard GP is relatively poor. This indicates that the standard GP cannot effectively quantify the uncertainties in the EIE’s load series, mainly due to its homoscedastic characteristic. The performance of SQR is slightly worse than SHGP. Table 6 summarizes the reliability evaluation (RE) of the obtained prediction intervals. The perfect RE corresponds to the ideal case when the observed proportions of the quantile predictions inside the testing set are the same with the nominal probabilities. In this situation, the RE a^ ð1aÞ would be zero. In other words, the smaller the deviations from the expected nominal coverage rate, i.e., the smaller the ^ ð1aÞ j, the better. From this table, it can be shown that SHGP ja achieves the lowest deviations. For all nominal coverage rates, the deviations between the perfect values and SHGP results are in general less than 5%. Standard GP gives competitive results for the nominal coverage rates smaller than 30%. However, when the nominal coverage rate is greater than 30%, the performance of standard GP degrades clearly, the deviations from the perfect values are greater than 5%. This result indicates that, as the nominal coverage rate increases, GP generally overestimates the variance of the load series, as in Fig. 11. The RE results of SQR are all negative and their values are generally too low, especially for the 50–80% nominal coverage rates. The deviations between the perfect values and SQR results are greater than 5% for the 50–80% nominal coverage rates, and some times greater than 7% for 60% nominal coverage rates. This result suggests that the prediction intervals given by SQR are generally too narrow, and hence underestimate the uncertainties associated with the load series. The above results demonstrate that SHGP is superior in handling the heteroscedasticity of the load series in EIE. We further assess the RE results in Fig. 12. In this figure, we first average the RE results over the different data subsets, then express the averaged RE as a function of the nominal coverage rate. The curves depict the deviation from the perfect value, as in (25). It is seen that the deviations between the perfect values and the SHGP results are contained in a 5% envelope. When the nominal coverage rate is below 30%, the RE of SHGP decreases with the nominal coverage rate. Conversely, when the nominal coverage rate exceeds 60%, the RE of SHGP increases. Furthermore, the RE of SHGP is negative for 10–70% nominal coverage rates, and is positive for 80–90% nominal coverage rates. This indicates that SHGP slightly underestimates the uncertainties for smaller nominal coverage rates, and overestimates the uncertainties for larger nominal coverage rates. The RE results of the standard GP are all positive, and generally increases with the nominal coverage rate. This suggests that the overall uncertainties are overestimated. In contrast, the RE results of SQR are all negative, which reveals the fact that the uncertainties are underestimated. Table 7 presents the sharpness evaluation (SE) of the prediction intervals. According to this criterion, what is desired is to have prediction intervals with smaller width, i.e., the higher the SE (closer to zero), the better. From this table, one can see that the SQR model achieves the highest SE, thus yielding the narrowest prediction intervals. Standard GP yields the widest prediction intervals. The results of SHGP lie between SQR and standard GP. It should be noted that there is a trade-off between RE and SE, so improving the RE will generally degrade the SE [37,38]. Furthermore, since the correctness of the predictive distribution is more important than its concentration, focus should be given first to the RE.
5. Conclusion In this paper, a sparse heteroscedastic model is established to provide the probabilistic electric load forecasts in EIEs. With this model, we can give heteroscedastic Gaussian predictive distributions, together with the point predictions. ‘1/2 regularizer is
154
P. Kou, F. Gao / Electrical Power and Energy Systems 55 (2014) 144–154
employed to reduce the computational complexity, thus enhancing the practical applicability of the proposed model. The validation of the proposed model is done against some other existing models. The results on the real-world EIE load data demonstrate that the predictive distributions given by the proposed model captures the heteroscedasticity of the load series, thus leading to better reliability. Meanwhile, the point prediction accuracies are also satisfactory. Comparison with the standard HGP shows that the proposed model achieves the same performance with a significantly reduced computational complexity. In summary, the proposed model is an alternative with some competitive advantage regarding the existing models. Furthermore, it may serve as a stepping stone for the stochastic energy scheduling in EIEs’ micro-grid. This work also opens up several directions for future research. First, we may consider the online learning scheme, which deals with the stream data and adapts to the concept drift in the load series. Second, with the development of demand response, EIEs may become electricity market participants, so investigating the relation between probabilistic load forecasting results and electricity price is also an interesting direction for future work, as that in [39]. Another important future work is to incorporate the correlation between loads. In the present study, we only consider the total load of an individual EIE. However, actually this overall load can be decomposed into several components, such as heating load, motor load, and base load, etc. From the perspective of a scheduler, the individual forecasts for these loads could provide more comprehensive information about the operating condition of an EIE’s microgrid. In this case, the correlation between the aforementioned loads cannot be ignored [40,41]. We plan to address this issue in future work by extending the presented model to deal with multiple targets, i.e., to forecast these loads simultaneously and takes the correlation between them into account. The techniques introduced in [42] may open the possibility to achieve this. That is, modeling the correlation between loads using the latent function interpretation of the convolution transform, and then constructing a multiple output Gaussian process through the convolution formalism. On the other hand, such a multi-target prediction model may also be used to simultaneously forecast the loads of several EIEs. This is meaningful for a distribution system that has many industrial customers. Acknowledgements The authors would like to thank Dr. Kristian Kersting for providing the code for HGP. The anonymous reviewers are acknowledged for his comments and suggestions. The authors gratefully acknowledge the financial support received from the Key Program of National Natural Science Foundation of China (Grant No. 6073602), National Natural Science Foundation of China (Grant No. 60974101), and Foundation for Innovative Research Groups of the National Natural Science Foundation of China (Grant No. 61221063). References [1] Babu C, Ashok S. Peak load management in electrolytic process industries. IEEE Trans Power Syst 2008;23(2):399–405. [2] Gao Y, Gao F, Zhai Q, Guan X. Self-balancing dynamic scheduling of electrical energy for energy-intensive enterprises. Int J Syst Sci 2012;20(1). [3] Ashok S, Banerjee R. Load-management applications for the industrial sector. Appl Energy 2000;66(2):105–11. [4] Ruiz-Rodriguez J, Hernandez C, Jurado F. Probabilistic load flow for photovoltaic distributed generation using the Cornish–Fisher expansion. Electr Power Syst Res 2012;89:129–38. [5] Ruiz-Rodriguez J, Hernandez C, Jurado F. Probabilistic load flow for radial distribution networks with photovoltaic generators. IET Renew Power Gener 2012;6(2):110–21. [6] Billinton R, Huang D. Effects of load forecast uncertainty on bulk electric system reliability evaluation. IEEE Trans Power Syst 2008;23(2):418–25. [7] Zhang Y, Zhou Q, Sun C, Lei S, Liu Y, Song Y. RBF neural network and ANFISbased short-term load forecasting approach in real-time price environment. IEEE Trans Power Syst 2008;23(3):853–8.
[8] Vilar M, Cao R, Aneiros G. Forecasting next-day electricity demand and price using nonparametric functional methods. Int J Electr Power Energy Syst 2012;39(1):48–55. [9] Sousa C, Neves P, Jorge M. Assessing the relevance of load profiling information in electrical load forecasting based on neural network models. Int J Electr Power Energy Syst 2012;40(1):85–93. [10] Hooshmand A, Amooshahi H, Parastegari M. A hybrid intelligent algorithm based short-term load forecasting approach. Int J Electr Power Energy Syst 2013;45(1):313–24. [11] Fan S, Chen L. Short-term load forecasting based on an adaptive hybrid method. IEEE Trans Power Syst 2006;21(1):392–401. [12] Chen T, Wang Y. Long-term load forecasting by a collaborative fuzzy-neural approach. Int J Electr Power Energy Syst 2012;43(1):454–64. [13] Bunnoon P, Chalermyanont K, Limsakul C. Multi-substation control central load area forecasting by using HP-filter and double neural networks (HPDNNs). Int J Electr Power Energy Syst 2013;44(1):561–70. [14] Guan C, Luh P, Michel L, Wang Y, Friedland P. Very short-term load forecasting: wavelet neural networks with data pre-filtering. IEEE Trans Power Syst 2013;28(1):30–41. [15] Gao Y, Pan J, Ji G, Gao F. A time-series modeling method based on the boosting gradient-descent theory. Sci China Technol Sci 2011;54(5):1325–37. [16] Liang X, Xu W, Chung C, Freitas W, Kun X. Dynamic load models for industrial facilities. IEEE Trans Power Syst 2012;27(1):69–80. [17] Rasmussen C, Williams C. Gaussian processes for machine learning. Cambridge (MA): MIT Press; 2006. [18] Kersting K, Plagemann C, Pfaff P, Wolfram B. Most likely heteroscedastic Gaussian process regression. In: Proc of international conference on machine learning. Corvallis, OR, US; July 2007. [19] Xu Z, Chang X, Xu F, Zhang H. L1/2 regularizaion: a thresholding representation theory and a fast solver. IEEE Trans Neural Networks Learn Syst 2012;23(7):1013–27. [20] Tibshirani R. Regression shrinkage and selection via the Lasso. J Roy Stat Soc Ser B 1996;58:267–88. [21] Zou H. The adaptive lasso and its oracle properties. J Am Stat Assoc 2006;101(476):1418–29. [22] Meinshausen N, Yu B. Lasso-type recovery of sparse representations for highdimensional data. Ann Stat 2009;37(1):246–70. [23] Rakotomamonjy A, Flamary R, Gasso G, Canu S. Lp–Lq penalty for sparse linear and sparse multiple kernel multitask learning. IEEE Trans Neural Networks 2011;22(1):1307–20. [24] Krishnan D, Fergus R. Fast image deconvolution using hyper Laplacian priors. In: Advances in neural information processing systems. Cambridge (MA): MIT Press; 2009. [25] Smola AJ, Bartlett P. Sparse greedy Gaussian process regression. In: Advances in neural information processing systems. Cambridge (MA): MIT Press; 2000. [26] Bach F, Jordan M. Predictive low-rank decomposition for kernel methods. In: Proc. of international conference on machine learning. Bonn, German, July 2005. [27] Seeger M, Williams C, Lawrence N. Fast forward selection to speed up sparse Gaussian process regression. In: Proc. of workshop on AI and statistics. Key west, Florida, US, January 2003. [28] Breusch T, Pagan A. Simple test for heteroscedasticity and random coefficient variation. Econometrica 1979;47(5):1287–94. [29] White H. A heteroskedasticity-consistent covariance matrix estimator and a direct test for heteroskedasticity. Econometrica 1980;48(4):817–38. [30] Lilliefors H. On the Kolmogorov–Smirnov test for normality with mean and variance unknown. J Am Stat Assoc 1967;62(318):399–402. [31] Nielsen H, Madsen H, Nielsen T. Using quantile regression to extend an existing wind power forecasting system with probabilistic forecasts. Wind Energy 2006;9(1):95–108. [32] Koenker R, Hallock K. Quantile regression. J Econom Perspect 2001;15(4):143–56. [33] Vapnik V. Statistical learning theory. New York (NJ): Wiley Interscience ; 1998. [34] Neal R. Bayesian learning for neural networks. Lecture notes in statistics, vol. 118. New York (NJ, USA): Springer; 1996. [35] Good I. Rational decisions. J Roy Stat Soc Ser B 1952;14(1):107–14. [36] Pinson P, Nielsen H, Moller J, Madsen H, Kariniotakis G. Nonparametric probabilistic forecasts of wind power: required properties and evaluation. Wind Energy 2007;10(6):497–516. [37] Juban J, Fugon L, Kariniotakis G. Probabilistic short-term wind power forecasting based on kernel density estimators. In: Proc. of European wind energy conference. Milan Italy, May 2007. [38] Pinson P, Kariniotakis G. Conditional prediction intervals of wind power generation. IEEE Trans Power Syst 2010;25(4):1845–56. [39] Bo R, Li F. Probabilistic LMP forecasting considering load uncertainty. IEEE Trans Power Syst 2009;24(3):1279–89. [40] Billinton R, Huang D. Incorporating wind power in generating capacity reliability evaluation using different models. IEEE Trans Power Syst 2011;26(4):2509–17. [41] Papaefthymiou G, Schavemaker H, Sluis L, Kling L, Kurowicka D, Cooke M. Integration of stochastic generation in power systems. Int J Electr Power Energy Syst 2006;28(9):655–67. [42] Alvarez A, Lawrence D. Computationally efficient convolved multiple output Gaussian processes. J Mach Learn Res 2011;12(5):1459–500.