Short term power load prediction with knowledge transfer

Short term power load prediction with knowledge transfer

Information Systems ] (]]]]) ]]]–]]] Contents lists available at ScienceDirect Information Systems journal homepage: www.elsevier.com/locate/infosys...

591KB Sizes 1 Downloads 60 Views

Information Systems ] (]]]]) ]]]–]]]

Contents lists available at ScienceDirect

Information Systems journal homepage: www.elsevier.com/locate/infosys

Short term power load prediction with knowledge transfer Yulai Zhang n, Guiming Luo School of Software, Tsinghua University, Tsinghua National Laboratory for Information Science and Technology, Beijing 100084, China

a r t i c l e i n f o

Keywords: Transfer learning Gaussian process Power load prediction

abstract A novel transfer learning method is proposed in this paper to solve the power load forecast problems in the smart grid. Prediction errors of the target tasks can be greatly reduced by utilizing the knowledge transferred from the source tasks. In this work, a source task selection algorithm is developed and the transfer learning model based on Gaussian process is constructed. Negative knowledge transfers are avoided compared with the previous works, and therefore the prediction accuracies are greatly improved. In addition, a fast inference algorithm is developed to accelerate the prediction steps. The results of the experiments with real world data are illustrated. & 2015 Elsevier Ltd. All rights reserved.

1. Introduction Power load forecast is a crucial issue in the management of the smart grids. Accurate forecast can greatly cut down the operational cost of power systems [1]. The scheduling and rescheduling of the generation plan and maintenance plan are both relied heavily on the short term power load forecast. The results of the short term power load forecast are also important to the other basic functions such as interchange evaluation and security assessment in the smart grid [2]. In the early researches, classic statistical models such as the AR (auto-regressive) model [3,4], the ARMAX (autoregressive moving average with external input) model [5], the BJ (Box–Jenkins) model [6,7] and the SS (state-space) model with Kalman filter [8,9] are used in the short-term power load forecast. In the past decade, along with the fast development of artificial intelligence and machine learning technic, mainstream AI based methods and models, such as ANN (artificial neural networks) [10,11], SVM (support vector machine) [12,13], evolutionary algorithm [14] and GP (Gaussian Process) model [15] have been n

Corresponding author. E-mail addresses: [email protected] (Y. Zhang), [email protected] (G. Luo).

adopted to solve this problem. Many hybrid models and methods are also proposed in the last few years. The neural network model with harmony search is developed in [16]. The PSO (particle swarm optimization) is implemented with SVR (support vector regression) in [17] and ANN in [18]. From the latest development mentioned above, we can see that parameter optimization is a critical issue to the improvement of the prediction results. The parameter optimization is a time consuming task for many models and unlikely to be performed in every prediction steps particularly when the time interval for prediction is very short. Most of the above models are parametric, except Gaussian process. One advantage of the GP model is that, as a non-parametric model, the hyper-parameter optimization step in the Gaussian process model is equivalent to the model structure selection step of the parametric models such as ANN and SVM and the prediction step of GP models is equivalent to both the parameter optimization step and the prediction step of the parametric models [19]. So the latent parameters of the GP model will be adapted to the new data automatically. This is the initial reason for us to consider the GP model in our work. The power load values can be affected by many hidden variables of both natural environments (such as wind speed and sunlight) and human activities (such as holidays

http://dx.doi.org/10.1016/j.is.2015.01.005 0306-4379/& 2015 Elsevier Ltd. All rights reserved.

Please cite this article as: Y. Zhang, G. Luo, Short term power load prediction with knowledge transfer, Information Systems (2015), http://dx.doi.org/10.1016/j.is.2015.01.005i

2

Y. Zhang, G. Luo / Information Systems ] (]]]]) ]]]–]]]

and emergency events). Most of these variables are difficult to obtain and some of them are uneasy to quantify. As a result, most of the existing works use the history load values as the elements of the feature vectors to obtain the predictions. Yet there are still some works try to use the weather conditions to forecast the power load, such as [20,21]. However, for the specific problem of short term load forecast, when the weather information is concerned, the short term weather forecast itself is a nontrivial task with great uncertainty. On the other hand, the hidden variables always have similar values for the neighboring cities that are located in the same region. Therefore, in the task of power load forecast, the prediction accuracies can be increased significantly by utilizing the knowledge and information from the data of these nearby cities. Transfer learning [22] is a research hot spot in the recent years. In a transfer learning process, the performance of the target task will be improved by using the transferred knowledge from the selected source tasks as well as that of the target task itself. Transfer learning methods have been successfully used in the problems such as natural language processing [23] and wireless localization [24]. In these problems, common knowledge and information such as language grammars and wireless communication environments are shared among the tasks. So it is reasonable to believe that the transfer learning methods can also give promising results in the filed of power load prediction. The first problem in developing a transfer learning method is how to find the source tasks for a given target task. In the multi-task learning method proposed in [25], tasks are clustered manually. Therefore, negative knowledge transfers [26], which are destructive to the prediction accuracies, are observed due to incorrect human choices. Another important reason for the negative transfer is that the knowledge transfer between tasks may not be symmetric. Even if the knowledge of one task improves the performance of another task, the improvement on the opposite direction cannot be guaranteed. In order to solve these problems, a transfer learning method with automatic source task selection is developed for the power load prediction problems in this paper. Transfer learning has also been named as the asymmetric multi-task learning in some literatures. The proposed method chooses different source tasks for different target tasks to improve the predictions. The second problem in this research is the computational time. The concept of the covariance function in the Gaussian process (GP) model can describe the relationships of the data points from different tasks easily. This is the second and the most important reason for us to use the GP model in our work. But one of the deficiencies of the Gaussian process model is that the computational time of its inference algorithm is oðn3 Þ. Moreover, larger numbers of data points will be involved due to the knowledge transfer, so the computation amount will increase significantly. A fast inference algorithm based on the proposed transfer learning GP model will be developed in this paper to accelerate the inference calculation. The time complexity will be reduced to oðn2 Þ. In the following sections, the basic concepts of the Gaussian process model and the covariance function for

power load sequences will be introduced in Section 2. The transfer learning method for power load prediction will be presented in Section 3. And finally, the experiments will be given in Section 4. 2. Preliminaries 2.1. Gaussian process model for prediction For the prediction problem with data set fðxt ; yt Þj t ¼ 1; …; ng, a regression model can be presented as yt ¼ f ðxt Þ þ et

ð1Þ

In Eq. (1), yt is the scalar output of the process y at time t and xt is the corresponding feature vector with length d. et is the independent white Gaussian noise with mean 2 value μe ¼ 0 and variance σn. For time series prediction problems, the xt can be chosen as: xt ¼ ½yt  1 ; yt  2 ; …; yt  d T , and d is called the number of dimensions of the feature space. Let kð; Þ denote the covariance function of the Gaussian process model with kðxi ; xj Þ ¼ kðxj ; xi Þ. The explicit forms of the covariance function will be discussed in the next subsection. Let yt be zero mean for simplicity. The joint distribution of the series y ¼ ½y1 ; y2 ; …; yn T can be written as y  N ð0; Kðx; xÞ þ σ 2n IÞ

ð2Þ T

where x ¼ ½x1 ; x2 ; …; xn  , and the elements of the Gram matrix K are K ij ¼ kðxi ; xj Þ. The identity matrix I in (2) can be substituted by an auto-covariance matrix if non-white Gaussian noises are concerned. See [27] for more details. The predictor of the output sequence at time n þ1 is denoted by y^ n þ 1 , and the joint distribution of the predictor of yn þ 1 and the history data y can be written as "

y y^ n þ 1

#

"

 N 0;

Kðx; xÞ þ σ 2n I k

T

kðxn þ 1 ; xÞ

#!

ðxn þ 1 ; xÞkðxn þ 1 ; xn þ 1 Þ

ð3Þ where the covariance vector k is defined as kðxn þ 1 ; xÞ ¼ ½kðxn þ 1 ; x1 Þ; kðxn þ 1 ; x2 Þ; …; kðxn þ 1 ; xn ÞT The expectation and the variance of the prediction can be obtained by the properties of the joint distribution as T

y n þ 1 ¼ Eðy^ n þ 1 Þ ¼ k ðxn þ 1 ; xÞðKðx; xÞ þ σ 2n IÞ  1 y

ð4Þ

Varðy^ n þ 1 Þ ¼ kðxn þ 1 ; xn þ 1 Þ T

 k ðxn ; xÞðKðx; xÞ þ σ 2n IÞ  1 kðxn ; xÞ

ð5Þ

In the GP model, the power load prediction y^ n þ 1 is taken as a random variable which is correlated with the previous load values. And the magnitude of the correlations is decided by the covariance function kð; Þ. In this work, we will focus on the one-step-ahead prediction problems. For the k-steps-ahead problems with k 4 1, one solution is to write the feature vector as xt ¼ ½yt  k ; …; yt  d  k þ 1 : Another solution is to use the predicted values in the feature vector xt. The uncertainty propagation problem [28] should be discussed when the predicted values are used.

Please cite this article as: Y. Zhang, G. Luo, Short term power load prediction with knowledge transfer, Information Systems (2015), http://dx.doi.org/10.1016/j.is.2015.01.005i

Y. Zhang, G. Luo / Information Systems ] (]]]]) ]]]–]]]

2.2. Covariance functions and learning the hyperparameters

Input of Target task

Input of Source task 1

3

Input of Source task 2

The power load values go up and down with the changes of the seasons and years. The non-stationary properties of the power load sequences are significant [13]. The linear trend covariance functions are always used in the Gaussian process model for non-stationary time series [29]

...

Input of Negative task

Dashed Line: Negative transfer Hidden Variables

kl ðxp ; xq Þ ¼ xTp xq Note that xp and xq are column vectors. Sometimes a weighting matrix will be used to give different weights to different factors in the feature vectors kla ðxp ; xq Þ ¼ xTp Lxq

ð6Þ

In Eq. (6), L is a d  d diagonal matrix. Intuitively speaking, for normalized data, the value of (6) will achieve its maximum when xp and xq are identical, if the length of the feature vector is long enough. This indicates that same history values probably lead to same prediction values. The diagonal elements of L, denoted by l1 ; …; ld , are the hyperparameters of the Gaussian process model and can be learned by the automatic relevance determination (ARD) process from the given data set. Note that if the ith feature has a large value of li, it will have a stronger influence to the outputs of the Gaussian process model. In order to learn the optimal values of the hyperparameters in the covariance function, iterative methods, such as conjugate gradient, will be used to solve the optimization problems constructed by the maximum likelihood estimation (MLE). The likelihood function for GP model can be written as    1 log p yx; θ ¼  yT ðKðx; xÞ þ σ 2n IÞ  1 y 2  n 1   ð7Þ  logK ðx; xÞ þ σ 2n I   log 2π 2 2 where θ ¼ ½l1 ; …; ld T is the hyper-parameter vector and it can be obtained by maximize the log likelihood function

θ^ ¼ arg maxðlog pðyjx; θÞÞ

ð8Þ

Note that for the noisy models, the variance of the additive 2 noise σn can also be estimated by the MLE procedure. Feature selection is a critical step in almost all the prediction problems. The power load data from the same time of the previous days or the same time of the previous weeks are always used as the key features in many researches [30]. This work can be done manually or automatically. Since we use the history load value as the elements of x, the optimization procedure of Eqs. (7) and (8) is equivalent to the automatic feature selection step which decides the most related time points for the current load forecast from the training data x and y. 3. Transfer learning method based on GP model In transfer learning problems, the input and output data observations of the source tasks and the target task are connected by the latent variables as shown in Fig. 1. Therefore, extra knowledge and information can be transferred from the source tasks to the target task. From

Output of Target task Fig. 1. Target task, source task and negative transfer in transfer learning problems.

another point of view, the data from the target task and the source tasks may share the same underlying noises, so larger number of data observations from the related tasks can eliminate the effects of these noises and improve the modeling accuracies. But if the unrelated or less related tasks are selected as source tasks, the useless information will be introduced as noises. Just as depicted in Fig. 1, these negative knowledge transfers have negative effects to the modeling and prediction of the target task. 3.1. Source task selection In order to avoid the negative knowledge transfers, source task selection is an important issue in transfer learning. For a group of tasks T ¼ fT k jk ¼ 1; …; Mg. If the jth task Tj is the current target task, the aim of the source task selection step is to select the source tasks from the rest M  1 tasks. The number of possible choices is 2M  1 , including the choice of a null set. The searching space increases exponentially with the number of tasks M, and the amount of computation is extremely large. Fortunately, it is unnecessary to evaluate all the possible combinations in many cases. The values of the short term power loads are sampled at the same clocks every day in all the cities. This brings us convenience to evaluate the data correlation. In the problems of time series prediction, the relationships between two sequences can be easily quantified by the correlation coefficient. More information will be shared between tasks with larger correlation coefficient. Intuitively speaking, the curves of the data sequences from two different cities with higher correlation coefficient will be more similar with each other. Thus, the source tasks can be selected in order of the amount of the shared information. As a result, the searching space can be reduced from 2M  1 to M. Besides, the same data set can be used for the source task selection step and the hyper-parameter learning step, that is to say, no extra data observations are required. ðjÞ For the tasks group fT 1 ; T 2 ; …; T M g, let fðxðjÞ t ; yt Þ; t ¼ 1; …; Ng be the original training data set of task Tj, j ¼ 1; …; M. The elements of the covariance coefficient matrix can be

Please cite this article as: Y. Zhang, G. Luo, Short term power load prediction with knowledge transfer, Information Systems (2015), http://dx.doi.org/10.1016/j.is.2015.01.005i

Y. Zhang, G. Luo / Information Systems ] (]]]]) ]]]–]]]

4

calculated as M cov ði; jÞ ¼

〈yðiÞ ; yðjÞ 〉 J yðiÞ J 2 J yðjÞ J 2

ð9Þ

where yðjÞ is the learning data vector for the jth task ðjÞ ðjÞ T yðjÞ ¼ ½yðjÞ N ; yN  1 ; …; y1 

ð10Þ

And the entire data sets from all tasks can be written as y ¼ ½yð1Þ ; …; yðMÞ . Note that in our application, power load values are always positive, otherwise we should use jM cov ði; jÞj instead. For the sorted vector of the jth column of the Mcov in the descending order, let vj be its original index vector, which means M cov ðvj ðk1 Þ; jÞ 4 Mcov ðvj ðk2 Þ; jÞ if k1 o k2 . The elements of vector vj can be formally described as vj ðkÞ ¼ h for k ¼

M X

I ðM cov ðh; jÞ r Mcov ði; jÞÞ

ð11Þ

i¼1

where I ðÞ is the indicator function with I ðtrueÞ ¼ 1 and I ðfalseÞ ¼ 0. Note that Mcov ði; iÞ ¼ 1 is the maximum value in covariance matrices. The first elements of the index vectors are always vj ð1Þ ¼ j, since the most related task of Tj is always itself. And correspondingly, vj(M) is the index of the least related task for the target task Tj. In the practice, some of the source tasks have positive but very small contributions to the target task. The computational time of the prediction step increases with the number of source tasks. In order to solve this problem, the AIC (Akaike information criterion) [31] will be used in this work to balance the tradeoff between the accuracy and the efficiency. For the target task Tj, the negative log likelihood function with the task number penalty can be written as Lðx ðmÞ; y ; θ; mÞ ¼  log pðy jx ðmÞ; θÞ þ λm ðjÞ

ðjÞ

ðjÞ

ðjÞ

ð12Þ

λ is a positive number decided by the users. A larger choice of λ will lead to smaller number of source tasks and vice versa. m is the number of source tasks selected for knowledge transfer and it has different values for different target tasks. Note that nothing will be transferred if m¼ 1. The superscripts of m and θ are omitted for clarity since no confusions will be made in our context. The feature matrix xðjÞ ðmÞ in (12) can be written as 2 ðv ð1ÞÞ 3 ðv ð2ÞÞ ðv ðmÞÞ xN j xN j … xN j 6 7 ðv ð2ÞÞ ðv ðmÞÞ 7 6 ðvj ð1ÞÞ xN j 1 … xN j 1 7 6x ð13Þ xðjÞ ðmÞ ¼ 6 N  1 7 6 ⋮ ⋮ ⋱ ⋮ 7 4 5 ðv ð1ÞÞ ðv ð2ÞÞ ðv ðmÞÞ x1 j … x1 j x1 j

ðkÞ (13) is a N  md matrix where the vector xðkÞ i ¼ ½yi  1 ; ðkÞ ðkÞ yi  2 ; …; yi  d  in this problem. Note that the absolute values of the original power load data have a wide variation between cities. The power load values of the biggest city can be four or five times larger than the smallest city in our work. The elements in the feature vectors from different tasks should be normalized to make them comparable in the task selection step. The source tasks for the target task Tj will be chosen as T s ¼ fT vj ðkÞ jk ¼ 1; …; mg. The values of m are not the same for different target tasks. m for target task Tj can be obtained by

½m; θ ¼ arg minLðxðjÞ ðmÞ; yðjÞ ; θ; mÞ m;θ

s:t:

1 rm r M; m A integer

ð14Þ

Note that the value of θ can also be obtained simultaneously by solving (14). But in online prediction, the original data values without normalization are more commonly used, whereas the data values are normalized in the source task selection step. Furthermore, the penalty item may also bias the estimation. So the hyper-parameters will be re-estimated and discussed in Section 3.2. However, these effects on the final results are not significant in our experiments. The Source task selection algorithm for the transfer learning based on the Gaussian process model is summarized in Table 1. Note that, if all the tasks will be performed in the problem, the entire covariance matrix can be calculated to reduce the compute time by half since Mcov is symmetric and the rest steps should be done for every target tasks. 3.2. Gaussian process model with knowledge transfer For the target task Tj, m 1 source tasks T vj ð2Þ ; …; T vj ðmÞ are selected by the task selection step in the last subsection. Note that T vj ð1Þ is always the target task itself. The data matrix from all the selected tasks for prediction step at time t can be presented as 2 ðv ð1ÞÞ 3 ðv ð2ÞÞ ðv ðmÞÞ yt j 1 … yt j 1 y j 6 t 1 7 ðv ð2ÞÞ ðv ðmÞÞ 7 6 ðvj ð1ÞÞ yt j 2 … yt j 2 7 6 yt  2 Y ðjÞ ¼ 6 7 t 6 ⋮ ⋮ ⋱ ⋮ 7 4 5 ðvj ð1ÞÞ ðvj ð2ÞÞ ðvj ðmÞÞ yt  n … yt  n yt  n ðjÞ And let yðjÞ t be the first column of Y t , which is the output data of the target task Tj ðv ð1ÞÞ

ðv ð1ÞÞ

j j T yðjÞ t ¼ ½yt  1 ; …; yt  n 

ð15Þ

Table 1 Source task selection algorithm of transfer learning for Gaussian process. Algorithm 1: Source task selection algorithm. Input: Learning data matrix y, Target task Tj, λ Output: Source Task set T s 1. Calculate the jth column of the correlation matrix Mcov from the given data set y, by M cov ði; jÞ ¼ 〈yðiÞ ; yðjÞ 〉= J yðiÞ J 2 J yðjÞ J 2 , where i ¼ 1; …; M, yðiÞ is the ith column of y. 2. Sort the jth column of the correlation matrix and get the index vector vj for task Tj. 3. Calculate the optimization problem (14) to obtain m for task Tj. 4. Select source tasks for task Tj according to T s ðT j Þ ¼ fT vj ðkÞ jk ¼ 1; …; mg:

Please cite this article as: Y. Zhang, G. Luo, Short term power load prediction with knowledge transfer, Information Systems (2015), http://dx.doi.org/10.1016/j.is.2015.01.005i

Y. Zhang, G. Luo / Information Systems ] (]]]]) ]]]–]]]

The data records in the same row are the power load values at the same time and that of the same column are the power load values at the same place. The corresponding input vector can be written as 2 ðv ð1ÞÞ 3 ðv ð2ÞÞ ðv ðmÞÞ xt j 1 … xt j 1 xt j 1 6 7 ðv ð2ÞÞ ðv ðmÞÞ 7 6 ðvj ð1ÞÞ xt j 2 … xt j 2 7 6 xt  2 xðjÞ ¼ ð16Þ 6 7 t 6 ⋮ ⋮ ⋱ ⋮ 7 4 5 ðv ð1ÞÞ ðv ð2ÞÞ ðv ðmÞÞ xt j n … xt j n xt j n xðjÞ t is a n  dm matrix xtðvðkÞÞ i

ðvðkÞÞ ðvðkÞÞ T ¼ ½ytðvðkÞÞ  i  1 ; yt  i  2 ; …; yt  i  d  ;

i ¼ 1; 2; …; n;

k ¼ 1; …; m

Note that the prediction value at time t is calculated by the data points with length n and n 5 N. N is the length of the entire time series. Before the prediction step, N is the length of training data for task selection and hyperparameter learning. And in the prediction step, the value of N increases with the coming of the new observation points. n is the number of reference points used in the prediction step in the transfer learning GP model, and it can be taken as the length of window on the whole history data sequence. Gaussian Process is non-parametric models, the hyper-parameter estimation is equal to the model structure selection in the parametric models. So the optimizations for hyper-parameter estimation have to be performed only once before the prediction steps for each task. The covariance function with knowledge transfer can be written as the sum of the target task's covariance function and the selected source task's covariance functions ðjÞ

kðxp ; xq ; θ Þ ¼

vj ðmÞ

X

ðjÞ

T ðiÞ ðxðiÞ p Þ Li ðθ Þxq

ð17Þ

i ¼ vj ð1Þ

5

ðjÞ

ðjÞ Varðy^t Þ ¼ kðxðjÞ t ; xt Þ T

ðjÞ ðjÞ ðjÞ ðjÞ 2 1 kðxðjÞ  k ðxðjÞ t ; xt ÞðKðx t ; x t Þ þ σ n IÞ t ; xt Þ

ð20Þ

Eq. (18) is a natural extension of (8). It automatically assigns a weight to each of the feature elements in the matrix (16), which are selected from either the target task or the source tasks. 3.3. Fast inference algorithm with knowledge transfer In the short term power load forecast, the calculation time of the prediction is limited by the sample interval. Moreover, in the transfer learning methods, the number of data points involved in the calculation is multiplied. So the computational time complexity is an important issue in our work. The standard inference method for the Gaussian process model has time complexity of oðn3 Þ, which increases quickly with the number of data points. The purpose of this subsection is to reduce the time complexity of the inference step. In this section, the fast algorithm developed in our conference paper [25] will be rephrased on the transfer learning model in Section 3.2. Since we are focusing on the prediction step of a particular target task Tj at a particular time t, the superscripts which indicate the target task and the subscripts indicate the step time will be omitted for simplicity. The major computational cost in (19) and (20) comes from the matrix inversion operation. By using the matrix inversion lemma: ðAþ BCDT Þ  1 ¼ A  1  A  1 BðC  1 þ DT A  1 BÞ  1 DT A  1 the inverse operation on the n  n matrix in (19) and (20) can be reduced to an inverse operation on a smaller matrix with order of dm  dm, which can be written as

where xp and xq are rows of (16). Li , i ¼ vj ð1Þ; ‥; vj ðmÞ are diagonal matrices which describe the magnitude of the contribution from power load data of task vj(i). The hyperparameter vector for task Tj can be written as

ðKðx; xÞ þ σ 2n IÞ  1 ¼ ðxLxT þ σ 2n IÞ  1 1 1 ¼ 2 I  2 xðσ 2n L  1 þ xT xÞ  1 xT

θðjÞ ¼ ½θTj1 ; θTj2 ; …; θTjm T

L in (21) is a dm  dm matrix composed of the diagonal elements of the matrices Lvj ð1Þ ; Lvj ð2Þ ; …; Lvj ðmÞ in (17) as 2 3 Lvj ð1Þ 6 7 Lvj ð2Þ 6 7 7 L¼6 6 7 ⋱ 4 5 Lvj ðmÞ

where θji is a d  1 vector, thus Lk ðθj Þ is a d  d diagonal matrix for task j and θj is an md  1 vector. The values of the hyper-parameters can be obtained by the optimization in (7) which is rewritten in (18) as    θðjÞ ¼ arg maxlog p yðjÞ xðjÞ ; θðjÞ θðjÞ

1 ¼  ðyðjÞ ÞT ðKðxðjÞ ; xðjÞ Þ þ σ 2n IÞ  1 yðjÞ 2  n 1   ðjÞ ðjÞ    logK x ; x þ σ 2n I   log 2π 2 2

ð18Þ

Note that the data xðjÞ and yðjÞ used in (18) are defined in (13) and (10) correspondingly in Section 3.1. The predictor of the new step of the target task Tj and the corresponding variance can be obtained in the similar way as (4) and (5). They are rewritten in the following equations: T

ðjÞ ðjÞ ðjÞ ðjÞ 2  1 ðjÞ ^ ðjÞ y ðjÞ yt t ¼ Eðy t Þ ¼ k ðxt ; x t ÞðKðx t ; x t Þ þ σ n IÞ

ð19Þ

σn

σn

ð21Þ

Remember x and y are obtained by (16) and (15) correspondingly and the diagonal elements of Lvj ðkÞ are obtained by (18). Using Eq. (21), the computational time of the matrix inverse is reduced from oðn3 Þ to oðn2 þ ðdmÞ3 Þ. Note that d  m is the number of hyper-parameters in the GP model. For most statistical learning problems, the numbers of data points are much bigger than the numbers of parameters. So we have dm 5n, which means the computational time can be effectively reduced by this method. The second problem is that when the number of data points n is large, the elements of the matrix Kðx; xÞ þ σ 2n I will also become large at the same time and its inverse matrix will encounter serious round off problems. Intermediate

Please cite this article as: Y. Zhang, G. Luo, Short term power load prediction with knowledge transfer, Information Systems (2015), http://dx.doi.org/10.1016/j.is.2015.01.005i

Y. Zhang, G. Luo / Information Systems ] (]]]]) ]]]–]]]

6

variables will be calculate instead of the inverse matrices ay ¼

1

σ 2n

fy  xðσ 2n L  1 þ xT xÞ  1 xT yg

ð22Þ

is the intermediate variable for calculating the new output and av ¼

1

σ 2n

fk  xðσ 2n L  1 þ xT xÞ  1 xT kg

ð23Þ

is the intermediate variable for the corresponding variance. The matrix inversion operation in (22) and (23) can be obtained as the result of the linear matrix equations by the Cholesky decomposition. For a positive definite matrix A

Table 2 Fast inference algorithm of transfer learning for Gaussian process. Algorithm 2: Fast inference algorithm for TLGP model. 2 Input: y, x,L, xt þ 1 ,σn Output: y t þ 1 , varðy^ t þ 1 Þ 1: A ¼ σ 2n L  1 þ xT x 2: b ¼ xT y 3: V ¼ cholðAÞ 4: Ky ¼ V  T V  1b   5: ay ¼ σ12 y  xK y n

6:

k ¼ xLxTtþ 1 0

7:

b ¼ xT k

8: 9:

Kv ¼ V  T V  1b   av ¼ σ12 k  xK v

10:

y t þ 1 ¼ kay

11:

varðy^ t þ 1 Þ ¼ xt þ 1 LxTtþ 1  kav

0

n

V ¼ choleskyðAÞ

ð24Þ

where A ¼ VV T , the upper triangle matrix V is the result of the Cholesky decomposition. For a k  k matrix A, the 3 complexity of Cholesky decomposition in (24) is k =6. For a column vector b, A  1 b is equivalent to the solution of the linear matrix equation Ax ¼ b A  1b ¼ V  T V  1b

ð25Þ

b ¼ xT y;

ð26Þ

0

b ¼ xT k

ð27Þ

The matrix inverse problem in (22) and (23) can be solved with ðdmÞ3 =6 þ ðdmÞ2 =2. Note that the inverse operation for the diagonal matrix L has time complexity of o(n). We rewrite the ay and av as ay ¼

av ¼

1

σ 2n 1

σ 2n

ð30Þ

^ where yðtÞ is the predicted value of y(t).

2

And the time complexity of (25) is k =2. Let A ¼ σ 2n L  1 þ xT x

prediction in this paper  2   ^  yðtÞ yðtÞ  NMSE ¼  yðtÞ ^  meanðyðtÞÞ

fy  xK y g

ð28Þ

fk  xK v g

ð29Þ 0

where K y ¼ A  1 b and K v ¼ A  1 b are calculated by (24)–(27). The fast inference algorithm is summarized as Algorithm 2 in Table 2. Remember that all the superscripts ðjÞ on x, y, xt þ 1 , and y^ t þ 1 , which identify the index of the target task Tj, are omitted for clarity. 4. Experiments 4.1. Data illustration and error evaluation The power load data of 12 nearby cities located in the Jiangxi province of China are collected in this work. The sample time interval is 15 min. Our target is to predict the power load value for the next 15 min. The relatedness of the power load of the neighboring cities are illustrated in Figs. 2 and 3. In Fig. 2, we plot part of the original data from four cities. And in Fig. 3, the data sequences are resampled every 24 h to show the long-term trends. The similarities showed in these figures indicate that transfer learning can be a promising method on these data sets. Since the absolute values of the power load data are different from one city to another, we use the normalized mean square error (NMSE) to evaluate the result of the

4.2. Accuracy improvements due to knowledge transfer In the experiments of this subsection, accuracy improvements due to knowledge transfers are presented and the effects of the negative knowledge transfers are also demonstrated. There are 16,000 data points from each of the 12 power load data sequences of task T 1 ; T 2 ; …T 12 . The first 1000 points in each data sequence are used for the source task selection and hyper-parameters learning. The rest 15,000 data points are used to validate the prediction results. The other parameters in this set of experiments are d¼10, λ ¼ 1, n¼1000 and M¼12. City T3 is the target task, and the source tasks will be chosen among the rest 11 tasks. The prediction values and true values of the power load are depicted in the top picture of Fig. 4. And the variances of the corresponding predictors are depicted in the bottom picture. The variance of the predictor is calculated by the joint Gaussian distribution and it can be taken as the confidence of the prediction. The variance of the predictor is different from the prediction error, which is calculated by the difference of the true value and the predicted value using (30). However, the predictors with large variances always indicate large prediction errors. In both pictures of Fig. 4, the blue lines are the predictions and variances curves without knowledge transfer; the green lines use all the tasks as source tasks, thus negative transfers can be observed obviously from the green line in the bottom picture. The red lines are that of the proposed method of this research, which has the lowest variances, and avoid the negative knowledge transfers. The prediction errors with different numbers of source tasks are depicted in Fig. 5. They are measured by average NMSE. For the target task T3, the Algorithm 1 in Table 1 gives the source task group fT 1 ; T 5 ; T 8 g with m ¼4. It can be seen

Please cite this article as: Y. Zhang, G. Luo, Short term power load prediction with knowledge transfer, Information Systems (2015), http://dx.doi.org/10.1016/j.is.2015.01.005i

Y. Zhang, G. Luo / Information Systems ] (]]]]) ]]]–]]]

City 1

7

City 2 200

1500

150 1000 100 500

100

200

300

400

500

50

100

City 3

200

300

400

500

400

500

City 4 200

300

150 200 100 100

100

200

300

400

500

50

100

200

300

Fig. 2. Electric power load data of 4 nearby cities (sampled interval: 15 min).

City 1

City 2

1500

200 150

1000 100 500

50

100

150

50

50

City 3

100

150

City 4

300

300 200

200 100 100

50

100

150

0

50

100

150

Fig. 3. Electric power load data of 4 nearby cities (sample interval: 24 h).

from the picture that the choice made by the source selection procedure proposed in Section 3.1 gives a better result than the other choices. It improves the prediction errors by knowledge transfers and avoids the negative transfers. 4.3. Method comparison In this section, the prediction errors and time costs of different methods for power load prediction are compared. We use the same data and parameter settings with that of the experiments in the previous subsection. In Table 3, the method of this paper is named as TLGP. The MTGP (multi-task Gaussian process) method is in [32]. The standard GP method refers to that in [19] with linear trend covariance functions. The auto-regression (AR) model will be estimated by the recursive least square (RLS) algorithm in [33]. The orders of the AR models equal to the parameter n in the Gaussian process model. The results of the PSO-SVR method in [17] and that of the ANN based method in [18] are also listed.

The prediction accuracies are improved by the transferred knowledge and the average computational time of the inference steps is also reduced, which can be concluded from Table 3. In addition, compared with the multi-task model in [32], the number of hyper-parameters in the proposed model is reduced by the flattened model structure and the smaller number of source tasks. Therefore, the time costs on hyper-parameter estimations are also reduced. In the experiments in Table 3, the hyper-parameter learning time of this paper's TLGP model is 18 s, while the that of the MTGP model in [32] is 210 s. The accuracies of the proposed GP based method are also higher than both the PSO-SVR method in [17] and the ANN based method in [18] due to the fact that the latent parameters of the non-parametric GP model are optimized in every prediction step. That is also the reason for the high computational time of the standard GP method. However, by using the algorithm in Section 3.3, the running time of proposed method is successfully reduced to be comparable to the PSO-SVR method in [17] and better than the ANN based method in [18].

Please cite this article as: Y. Zhang, G. Luo, Short term power load prediction with knowledge transfer, Information Systems (2015), http://dx.doi.org/10.1016/j.is.2015.01.005i

Y. Zhang, G. Luo / Information Systems ] (]]]]) ]]]–]]]

8 180

Table 4 Comparison of the NMSE of prediction with and without transferred information from source tasks.

160 140 120

City name

NMSE (without transferred knowledge)

NMSE (with knowledge from selected tasks)

Improvement (%)

Nanchang Jiujiang Jingdezhen Yichun Shangrao Pingxiang Gandongbei Fuzhou Ganxi Ji'an Ganzhou Yingtan

0.0103 0.0154 0.0082 0.0154 0.0523 0.0930 0.0479 0.0086 0.1165 0.0277 0.0174 0.2040

0.0076 0.0113 0.0073 0.0113 0.0506 0.0844 0.0464 0.0063 0.0986 0.0221 0.0148 0.1819

26.2 26.6 11.0 26.6 3.3 9.2 3.1 26.7 15.4 20.2 14.9 10.8

100 80

50

100

150

200

250

300

350

400

50

100

150

200

250

300

350

400

7

6.5

6

5.5

Fig. 4. Top picture: prediction values and true values of the power load data (black line: true values of the power load; blue line: prediction values obtained by the data of the target task only; green line: prediction values obtained by the data of all tasks; red line: prediction values obtained by the data from auto selected tasks). Bottom picture: variances of the predictions (blue line: variances of the prediction by the data of the target task only; green line: variance of the prediction by the data of all tasks; red line: variance of prediction by the data from auto selected tasks). (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

8.2

x 10−3

8.1 8

Table 5 Number of source task for the given target task. City name

Num. of source tasks (m 1)

City name

Num. of source tasks (m  1)

Nanchang Jiujiang Jingdezhen Yichun Shangrao Pingxiang

7 8 3 8 2 3

Gandongbei 9 Fuzhou 3 Ganxi 3 Ji'an 7 Ganzhou 11 Yingtan 5

NMSE

7.9 7.8 7.7

14.6

7.6

14.4

7.5

Negative Transfer 14.2

7.4 7.3 7.2

14 1

2

3

4

5

6

7

8

9

10

11

12

13.8

m

Fig. 5. Prediction errors with different number of source tasks.

13.6 13.4

Table 3 Comparison of prediction errors and time costs of different methods. Method

Prediction error

Running time per step

TLGP MTGP method in [32] Standard GP Standard GP (with Alg. 2) AR (with selected tasks) AR (single task) PSO-SVR in [17] ANN based method in [18]

7.28  10  3 7.61  10  3 8.19  10  3 8.19  10  3 2.35  10  2 2.71  10  2 9.04  10  3 1.10  10  2

2.656  10  3 6.507  10  1 5.801  10  1 2.054  10  3 0.987  10  4 0.744  10  4 2.423  10  3 4.140  10  2

4.4. Experiments on larger number of data sets In this section, we run the proposed method M times to perform the power load prediction for every cities in the entire data set. The improvement percentages are listed in Table 4.

13.2 200

250

300

350

400

450

500

550

600

Fig. 6. An example of negative transfer. Blue Line: variance of the prediction by the data of the target task only; red line: variance of the prediction by the data of the target task and one negative source task. (For interpretation of the references to color in this figure caption, the reader is referred to the web version of this paper.)

The numbers of the source tasks for each of the target tasks are listed in Table 5. From Table 5 we can see that only one prediction task select all the other tasks as the source tasks. The average number of source tasks is 5.75. It can be seen that the negative transfer problems are widely exist in the practices. The source task selection algorithm in this paper works well for the time series prediction problem for power load forecast. The asymmetric property of the transfer learning can also be observed from this experiment. For example, the task T3 (Jingdezhen) is one of the source tasks of task T11 (Ganzhou), but T11 is the unselected negative task of T3.

Please cite this article as: Y. Zhang, G. Luo, Short term power load prediction with knowledge transfer, Information Systems (2015), http://dx.doi.org/10.1016/j.is.2015.01.005i

Y. Zhang, G. Luo / Information Systems ] (]]]]) ]]]–]]]

4.5. An example of negative transfer At the end of the experiment part, an example of negative transfer is given to show that what will happen if we select the most unrelated task as the source task. The simulation here is similar to that in Section 4.2. In the experiment of Fig. 6, task T4 is the source task, and according to the result of Algorithm 1, task T6 is the most unrelated task. The blue line is the variance of the prediction by the data of the target task T4 only and the red line is the variance of the prediction by the data of the target task T4 and one negative source task T6. We can clearly see that the contribution of the T6 to T4 is negative. 5. Conclusion In this work, a transfer learning model based on Gaussian process is proposed to solve the short term power load prediction problem. The source task selection algorithm and the fast inference algorithm for this model are developed. The contributions of this work are threefold. First, the prediction accuracies are improved by using the knowledge transferred from the nearby cities. Second, negative knowledge transfers are avoided by correct source task selection. Third, the time complexity of the prediction inferences is reduced by converting the matrix inverse operations to the matrices with smaller orders. The better accuracies and faster calculations of the power load forecast are both of great importance to the operations of the smart grids. Acknowledgement This work was supported by the Funds NSFC61171121. References [1] D.K. Ranaweera, G. Karady, R. Farmer, Economic impact analysis of load forecasting, IEEE Trans. Power Syst. 12 (3) (1997) 1388–1392. [2] N. Amjady, Short-term hourly load forecasting using time-series modeling with peak load estimation capability, IEEE Trans. Power Syst. 16 (4) (2001) 798–805. [3] G. Mbamalu, M. El-hawary, Load forecasting via suboptimal seasonal autoregressive models and iteratively reweighted least-squares estimation, IEEE Trans. Power Syst. 8 (1) (1993) 343–348. [4] S. Huang, Short-term load forecasting using threshold autoregressive models, IEE Proc.—Gener. Transm. Distrib. 144 (5) (1997) 477–481. [5] H. Yang, C. Huang, A new short-term load forecasting approach using self-organizing fuzzy ARMAX models, IEEE Trans. Power Syst. 13 (1998) 217–225. [6] F. Meslier, New advances in short-term load forecasting using Box and Jenkins approach, IEEE Trans. Power Appar. Syst. 97 (4) (1978) 1004. [7] M. Hagan, S. Behr, The time-series approach to short-term load forecasting, IEEE Trans. Power Syst. 2 (3) (1987) 785–791. [8] D. Infield, D. Hill, Optimal smoothing for trend removal in short term electricity demand forecasting, IEEE Trans. Power Syst. 13 (3) (1998) 1115–1120. [9] S. Sargunaraj, D. Gupta, S. Devi, Short-term load forecasting for demand side management, IEE Proc.—Gener. Transm. Distrib. 144 (1) (1997) 68–74.

9

[10] H. Malki, N. Karayiannis, M. Balasubramanian, Short-term electric power load forecasting using feedforward neural networks, Expert Syst. 21 (3) (2004) 157–167. [11] T. Yalcinoz, U. Eminoglu, Short term and medium term power distribution load forecasting by neural networks, Energy Convers. Manag. 46 (9–10) (2005) 1393–1405. [12] F. Pan, H. Cheng, J. Yang, C. Zhang, Z. Pan, Power system short-term load forecasting based on support vector machines, Power Syst. Technol. 28 (2004) 39–42. [13] Y. Zhang, Z. Yan, G. Luo, A new recursive kernel regression algorithm and its application in ultra-short time power load forecasting, in: Proceedings of the IFAC 18th World Congress, 2011, pp. 12177–12182. [14] S.S.S. Hosseini, A.H. Gandomi, Short-term load forecasting of power systems by gene expression programming, Neural Comput. Appl. 21 (2, SI) (2012) 377–389. [15] N. Chen, Z. Qian, X. Meng, I.T. Nabney, Short-term wind power forecasting using Gaussian processes, in: Proceedings of the TwentyThird International Joint Conference on Artificial Intelligence, 2013, pp. 2790–2796. [16] N. Amjady, F. Keynia, A new neural network approach to short term load forecasting of electrical power systems, Energies 4 (3) (2011) 488–503. [17] P. Duan, K. Xie, T. Guo, X. Huang, Short-term load forecasting for electric power systems using the PSO-SVR and FCM clustering techniques, Energies 4 (1) (2011) 173–184. [18] H. Quan, D. Srinivasan, A. Khosravi, Short-term load and wind power forecasting using neural network-based prediction intervals, IEEE Trans. Neural Netw. Learn. Syst. 25 (2) (2014) 303–315. [19] C.E. Rasmussen, Gaussian processes in machine learning, in: Advanced Lectures on Machine Learning, Lecture Notes in Computer Science, vol. 3176, Springer, Berlin Heidelberg, 2004, pp. 63–71. [20] J. Fan, J. Mcdonald, A real-time implementation of short-term load forecasting for distribution power-systems, IEEE Trans. Power Syst. 9 (2) (1994) 988–994. [21] S. Ruzic, A. Vukovic, N. Nikolic, Weather sensitive method for short term load forecasting in electric power utility of Serbia, IEEE Trans. Power Syst. 18 (4) (2003) 1581–1586. [22] S.J. Pan, Q. Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22 (10) (2010) 1345–1359. [23] R. Collobert, J. Weston, A unified architecture for natural language processing: deep neural networks with multitask learning, in: Proceedings of the 25th International Conference on Machine learning, ACM, Helsinki, Finland, 2008, pp. 160–167. [24] S.J. Pan, V.W. Zheng, Q. Yang, D.H. Hu, Transfer learning for wifibased indoor localization, in: AAAI 2008 Workshop on Transfer Learning for Complex Task, 2008. [25] Y. Zhang, G. Luo, F. Pu, Power load forecasting based on multi-task gaussian process, in: Proceedings of the IFAC 19th World Congress, 2014, pp. 3651–3656. [26] M.T. Rosenstein, Z. Marx, L. Kaelbling, T. Dietterich, To transfer or not to transfer, in: NIPS 2005 Workshop on Inductive Transfer: 10 Years Later, 2005. [27] R. Murray-Smith, A. Girard, Gaussian process priors with ARMA noise models, in: Irish Signals and Systems Conference, Maynooth, 2001, pp. 147–152. [28] A. Girard, J.Q. Candela, R. Murray-smith, C.E. Rasmussen, Gaussian process priors with uncertain inputs—application to multiple-step ahead time series forecasting, in: Advances in Neural Information Processing Systems, 2003, pp. 529–536. [29] S. Brahim-Belhouari, A. Bermak, Gaussian process for nonstationary time series prediction, Comput. Stat. Data Anal. 47 (4) (2004) 705–712. [30] C. Li, X. Li, R. Zhao, J. Li, X. Liu, A novel algorithm of selecting similar days for short-term power load forecasting, Autom. Electric Power Syst. 32 (2008) 69–73. [31] K.P. Burnham, D.R. Anderson, Multimodel inference understanding AIC and BIC in model selection, Sociol. Methods Res. 33 (2) (2004) 261–304. [32] E. Bonilla, K.M. Chai, C. Williams, Multi-task gaussian process prediction, in: Advances in Neural Information Processing Systems, 2008, pp. 153–160. [33] L. Ljung, System Identification Toolbox for Use with MATLAB, The MathWorks, Inc., Natick, MA, U.S., 2007.

Please cite this article as: Y. Zhang, G. Luo, Short term power load prediction with knowledge transfer, Information Systems (2015), http://dx.doi.org/10.1016/j.is.2015.01.005i