Wind speed forecasting using deep neural network with feature selection

Wind speed forecasting using deep neural network with feature selection

Neurocomputing 397 (2020) 393–403 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Wind sp...

2MB Sizes 0 Downloads 100 Views

Neurocomputing 397 (2020) 393–403

Contents lists available at ScienceDirect

Neurocomputing journal homepage: www.elsevier.com/locate/neucom

Wind speed forecasting using deep neural network with feature selection Xiangjie Liu a, Hao Zhang a, Xiaobing Kong a,∗, Kwang Y. Lee b a

The State Key Laboratory of Alternate Electrical Power System with Renewable Energy Sources, North China Electric Power University, Beijing 102206, PR China b The Department of Electrical and Computer Engineering, Baylor University, Waco, TX 76798-7356, USA

a r t i c l e

i n f o

Article history: Received 15 May 2019 Revised 12 August 2019 Accepted 21 August 2019 Available online 1 April 2020 Keywords: Wind speed forecasting Deep neural network Mutual information Stacked auto-encoder Denoising Long short-term memory network

a b s t r a c t With the rapid growth of wind power penetration into modern power grids, wind speed forecasting (WSF) becomes an increasing important task in the planning and operation of electric power and energy systems. However, WSF is quite challengeable due to its highly varying and complex features. In this paper, a novel hybrid deep neural network forecasting method is constituted. A feature selection method based on mutual information is developed in the WSF problem. With the real-time big data from the wind farm running log, the deep neural network model for WSF is established using a stacked denoising auto-encoder and long short-term memory network. The effectiveness of the deep neural network is evaluated by 10-minutes-ahead WSF. Comparing with the traditional multi-layer perceptron network, conventional long short-term memory network and stacked auto-encoder, the resulting deep neural network significantly improves the forecasting accuracy. © 2020 Elsevier B.V. All rights reserved.

1. Introduction The demand of electricity is growing rapidly as a result of social, economical and industrial development. Nowadays, due to the environmental concerns, wind energy has been one of the fastest developing energy resources worldwide. The overall capacity of wind turbines installed worldwide reached 600 GW by the end of 2018, which covers 6% of the global electricity demand [1]. The increasing wind power penetration into the power grids can, meanwhile, seriously affect the safe operation of power system and power quality, due to the intermittence and randomness of wind power. As the produced wind power depends on wind speed, the design and implementation of an efficient wind speed forecasting (WSF) method can therefore improve the stability and safety of power system with high penetration of wind generation. However, it is tough to forecast the wind speed precisely, due to its nonlinear and non-stationary characteristics, as well as the highly complex interactions with various meteorological factors. The WSF method can be roughly categorized into two groups: the physical methods and statistical methods. Numerical weather prediction is a typical physical method, which forecasts the future weather using a series of thermodynamics and fluid dynamics



Corresponding author. E-mail address: [email protected] (X. Kong).

https://doi.org/10.1016/j.neucom.2019.08.108 0925-2312/© 2020 Elsevier B.V. All rights reserved.

equations [2]. However, the physical models usually require abundant background knowledge, e.g., aerodynamics and fluid mechanics, so that they may not be suitable for real-time WSF in electric power systems [3]. In the statistical group, time series models are trained based on the historical data, e.g., persistence method [4], auto-regression model [5], autoregressive moving average (ARMA) model [6], markov chain model [7], kalman filter [8], intelligent methods [9–11], and particularly artificial neural network (ANN) [12–14]. ANN offers an effective framework for WSF based on the ability to learn complex nonlinear functional mappings. In [12], the radial basis function neural network is adopted, showing the high quality of prediction. Paper [13] designs a back propagation (BP) neural network based on particle swam optimization, which achieves much better forecasting performance than ARMA model. Paper [14] proposes a reduced support vector machine model incorporating feature selection technique. Power system management has undergone great changes during the past decade, with much more wind power participating in the automatic generation control of the power grid [15], leading to high demands on wind power control system design. Generally, the accurate WSF is the precondition for constituting the advanced control strategy in modern power system, e.g., model predictive control [16,17]. The big data generated in the wind farm is generally characterized by massiveness, multi-source, heterogeneity and high-dimension. Traditional ANNs with shallow architecture have

394

X. Liu, H. Zhang and X. Kong et al. / Neurocomputing 397 (2020) 393–403

low efficiency in digging and extracting effective information from big data, so that they often suffer from uncontrolled convergence speed and local optimal. Meanwhile, it is more difficult to optimize the parameters of ANN as the number of hidden layers and the training sample size increase. Deep neural network (DNN) is an effective method to deal with the big data modeling problem [18–20]. DNN is a generative model, in which the lower layers represent the low-level features from inputs while the upper layers extract the high-level features that explain the input samples. Compared with the shallow architectures, DNN is able to extract the compact, hierarchical and inherent abstract features in the original data through the deep architecture, and thus can be used for establishing the accurate mapping relationships among various patterns. In practice, DNN has been successfully applied for fault diagnosis [21], electricity load forecasting [22], wind power prediction [23], soft sensing modeling [24] and etc. As one of the commonly used DNN frameworks, stacked auto-encoder (SAE) is constructed by stacking several shallow auto-encoders (AE), which learns features by first encoding the input data and then reconstructing it. SAE has been successfully applied in WSF [25] by using the historical data of wind speed to train the network, where the distinguished feature representation ability has been verified. In WSF problem, apart from the historical wind speed, many other meteorological factors, e.g., the temperature, the air pressure and the wind direction, can also affect the future wind speed. On the one hand, it is necessary to take as many factors as possible into account for improving the accuracy of WSF. On the other hand, taking even one more factor into account will lead to the increasing of computational burden of the network. Furthermore, the irrelevant and redundant factors may obscure the role of important factors. Therefore, it is quite important to excavate and extract effective features from these massive factors, to obtain the tradeoff between the computational burden and the network accuracy. The mutual information (MI) [26] is an effective feature selection method for DNN, due to its ability to measure any type of statistical dependency between variables. By using MI for feature selection of WSF, the joint relevance and redundancy of other features can be taken into account. Notice that the wind speed is not only nonlinear, but also relevant in time. This motivates us to use the information contained in the historical data for the WSF. The long short-term memory (LSTM) network [27,28] is a typical DNN that is suitable to make prediction based on time series data. By maintaining a memory cell in the hidden layer, LSTM network can receive information from the previous state to the current state. This architecture can keep all the available input information up to the current time. With this property, LSTM network has been widely used in early fault detection [29], solar irradiance prediction [30], and traffic forecasting [31]. The main purpose of this paper is to present a novel DNN WSF model based on SAE and LSTM network that exploits the statistical relationship among the historical data. At first, the most appropriate input variables of this model are determined using feature selection method based on MI. Then, the SAE with an appropriate structure is used to capture the intrinsic features contained in the original data. In order to improve the robustness of feature learning, the denoising coding [32] is also embedded into the SAE by randomly masking noise on its input. Finally, the LSTM network is adopted to generate the final output of the model. The remainder of the paper is organized as follows. Section 2 describes the WSF problem. Section 3 presents the feature selection method based on MI. The SDAE-LSTM network for WSF is given in Sections 4 and 5. The WSF simulation results using the real-time big data are shown in Section 6. Finally, a conclusion is drawn in Section 7.

Fig. 1. The wind data pattern selected for training.

2. WSF problem description In this work, the wind speed time series under consideration can be denoted as V = {v(1 ), v(2 ), · · · , v(t − 1 ), v(t ), v(t + 1 ), · · · }, where v(t) is the wind speed at time t. To forecast the wind speed v(t + 1 ) by using the historical data of wind speed, an explicit or implicit function must be found as follows:

v(t + 1 ) = f (v(t ), v(t − 1 ), · · · , v(t − N + 1 ); θ )

(1)

where N is the time lag, θ is the parameter vector determining the WSF model. To estimate the error between the forecasting wind speed and real wind speed, the loss function is defined as:

min θ

n  i=1

(v(t + 1 ) − vˆ (t + 1 ))2

(2)

where vˆ (t + i ) is the forecasting wind speed at sampling instant t + i. To obtain the wind speed forecasting model (1), it is necessary to find the optimal parameters θ by minimize the loss function. In the modeling process, all the data used in WSF come from a wind farm located in Neimenggu, Northwest China. The wind power system collects data from hundreds of channels by data acquisition modules. However, only the historical data of wind speed and meteorological data are used for WSF. The dataset was carefully selected from the huge amount of data logged in the wind farm, during which the wind speed varies frequently. A set of 35,0 0 0 continuous patterns with 10 min sampling throughout January 1st to September 4th in 2015 are selected for training, as shown in Fig. 1. The maximum and minimum wind speeds are 24.23 m/s and 0 m/s, respectively, with the average wind speed to be 5.40 m/s. Clearly, this data pattern contains extremely large variation range in wind speed so that it can cover the whole working conditions of wind farm. Another 40 0 0 patterns from four different seasons (10 0 0 patterns for each season) in 2014 are used as the validating sets, as shown in Fig. 2. In Fig. 2, S1 , S2 , S3 , S4 denote the validating sets from April 12th-20th in spring, July 6th-14th in summer, October 15th-23th in autumn and January 15th-23th in winter, respectively. Due to the different climatic features in different seasons, these data sets have remarkably different fluctuation characteristics, and therefore can comprehensively and systematically evaluate the effectiveness and practicability of the proposed model. 3. Mutual information for feature selection Besides the historical wind speed, the topography and meteorology are the main factors affecting the wind speed, especially the temperature, the wind direction and the air pressure. Taking all the factors in account, i.e., the uncorrelated and redundant factors, may mask the role of important factors, and increase a huge amount of data processing. This leads to the complexity of the network structure. Therefore, it is necessary to excavate and extract

X. Liu, H. Zhang and X. Kong et al. / Neurocomputing 397 (2020) 393–403

395

Fig. 2. The wind data pattern selected for validating.

effective features from massive related factors before establishing the accurate and reliable DNN model for WSF. Due to the advantage of measuring any type of statistical dependency between variables, MI method has been widely used in meteorological time series modeling [33], speech recognition [34] and etc., which demonstrates the excellent ability in feature selection. As a typical multi-variable nonlinear time series, the distribution type of wind speed is obviously different from that of the temperature, the wind direction and the air pressure. Thus, it is proper to adopt MI method in WSF to select the effective features. To forecast the wind speed v(t + 1 ), the influencing factors, such as the temperature T(t), the air pressure P(t), and the wind direction D(t) must be taken into account. Due to the continuity of wind speed, the wind speed at the time instant t + 1 also depends on the previous wind speed valuev(t − i + 1 )(i = 1, · · · , N ), where N is set to be 30. Thus, the previous wind speed v(t − i + 1 ), the temperature T(t), the air pressure P(t), and the wind direction D(t) are taken into account in forecasting the wind speed v(t + 1 ). The samples for features selection are shown in Fig. 1 and Fig. 3, where the samples size n is set to be 35,0 0 0. The features matrix is defined as:

F=



F1



F2

T (t )1 = ⎣ ... T (t )n

F3

F4

P (t )1 .. . P (t )n

··· D(t )1 .. . D(t )n

F33



v(t )1 .. . v(t )n

··· .. . ···

v(t − 29 )1

Due to the complexity of definition in Eq. (4), the MI can be calculated using the following equations:



.. ⎦ (3) . v(t − 29 )n

where Fk (k = 1 · · · 33 ) corresponds to the k-th feature. The MI between the future wind speed V (t + 1 ) = [ v(t + 1 )1 · · · v(t + 1 )n ]T and each feature vector can be defined as: I (Fk ; V (t + 1 )) p joint (Fk (i ), v(t + 1 ) j ) = p joint (Fk (i ), v(t + 1 ) j ) log dF (i )dv(t + 1 ) j p(Fk (i )) p(v(t + 1 ) j ) k

Fig. 3. The chosen data for feature selection.

(4)

where p is the probability density of a single variable, and p joint is the joint probability density between the two variables. The value of MI is equal to zero if the two vectors are mutually independent.

I (Fk ; V (t + 1 )) = H (Fk ) + H (V (t + 1 )) − H (Fk , V (t + 1 ))

(5)

where H(Fk )and H (V (t + 1 ))are the entropy of vector Fk and v(t + 1 ), respectively. And H (Fk , V (t + 1 )) is the joint entropy between the two vectors which is defined as:

H (Fk ) = −



p(Fk (i )) log p(Fk (i ))dFk (i )

H (v(t + 1 )) = −



p(v(t + 1 ) j ) log p(v(t + 1 ) j )dv(t + 1 ) j

H (Fk , v(t + 1 )) =− p joint (Fk (i ), v(t + 1 ) j ) log p joint (Fk (i ), v(t + 1 ) j )dFk (i )dv(t + 1 ) j

(6) (7)

(8)

Since it is difficult to calculate the entropy according to the definition -(Eqn. (6)–(8), the MI estimation methods are widely used

396

X. Liu, H. Zhang and X. Kong et al. / Neurocomputing 397 (2020) 393–403

Fig. 4. Mutual information value of different variables.

Fig. 6. Manifold learning perspective.

4.1. Denoising auto-encoder Fig. 5. The architecture of SAE.

in practical applications. Comparing with kernel method [35], and k-nearest neighbor method [36], histogram method [34] is adopted to estimate MI, due to its simplicity and computational efficiency. From Fig. 4, it is obvious that MI value decreases with the increasing of time-lag in historical wind speed. The MI value between the forecasting wind speed V (t + 1 ) and historical wind speed V (t − i + 1 )is small enough to be ignored when i > 18. Similarly, the wind direction is also omitted. Thus, the features are selected as x = {v(t − 17 ), · · · , v(t − 1 ), v(t ), T (t ), P (t )} in WSF. Remark 1. By using the MI feature selection, the impact of different factors that contribute to the WSF model can be evaluated quantitatively, so as to optimally determine the input vector of the DNN. 4. Stacked denoising auto-encoder (SDAE) The SAE is a combination of several AEs and a regression layer, as shown in Fig. 5. The AE is a single hidden layer ANN which consists of two parts: an encoder and a decoder. The input layer and the hidden layer forms the encoder which compresses the high dimensional input data into a low dimensional representation. Meanwhile, the hidden layer and the output layer forms the decoder which reconstructs the input data from the corresponding hidden representation. In the SAE model, the kth hidden layer is denoted as hk , the AE associated withhk−1 and hk (k = 1, 2, · · · , l ) is denoted as AEk .

The noise widely exists in sampled data, because of the accuracy error of the measurement sensors and the external electromagnetic interference. The pseudo-information provided by the generated noises can affect the accuracy of WSF to a large extent. Furthermore, it is hard to deal with the noise manually because of the large number of samples to be trained in practice. Thus it is necessary to adopt a new mechanism to achieve feature representations with better robustness. In [32], Vincent et al. proposed a modified version of AE named denoising auto-encoder (DAE). By adding partial corruption to the input pattern and then reconstruct the original input from the corrupted version, the obtained feature representation can be more robust. The learning process of DAE can be explained under the manifold learning perspective, as shown in Fig. 6. The DAE can be seen as a way to define and learn a manifold. In the training process shown in Fig. 6, the original data x concentrates near a nonlinear manifold. At first, it is corrupted to get the noisy version x˜. This noisy version is more likely to appear outside and farther away from the manifold. In the denoising training, the model attempts to map the corrupted x˜ back to the uncorrupted x. In the wind farm, the data collected from the actual site contain a certain degree of noise. Successfully trained denoising means that the model can map the noisy data back to its uncorrupted clean version, i.e., within a small area near the manifold. The structure of DAEk is shown in Fig. 7. The partially dam˜ k−1 can be obtained through a stochastic mapping aged version h as shown in Eq. (9):



˜ k−1 ∼ qD (h ˜ k−1 hk−1 ) h

(9)

X. Liu, H. Zhang and X. Kong et al. / Neurocomputing 397 (2020) 393–403

397

Fig. 7. Denoising auto-encoder.

where qD ( · ) is a stochastic process which randomly set a fixed number of hk−1 to be zero, while the others remain unchanged. ˜ k−1 , it is performed just like the basic AE. The After getting h hidden representation hk can be obtained through the encoder based on (10), and then mapped backed to a reconstructed vector z by the decoder as formulated in (11):

˜ k−1 + bk ) hk = f (W1k h 1

(10)

zk = g(W2k hk + bk2 )

(11)

f (x ) = g(x ) = 1/(1 + e−x )

(12)

where W1k ,bk1 represent the weight matrix and bias term of the encoder, W2k ,bk2 are the weight matrix and bias term of the decoder, and θ k = {W1k , bk1 , W2k , bk2 }is the parameter set in DAEk . The parameter set θ k can be optimized by minimizing the reconstruction error:

JDAE (θ k ) =

m 1 L(hki −1 , zik ) m

(13)

i=1

where m is the sample size and L represents the loss function which is often chosen as the mean square error (MSE) shown in Eq. (14).

LDAE (hki −1 , zik ) =

1 hk−1 − zik 2 2 i

(14)

Remark 2. By corrupting the input pattern and reconstruct the original input from the corrupted version, the obtained SDAE can be less affected by the background noise existed in the original wind speed data, so that the accuracy of the WSF model can be improved. 4.2. Training of SDAE A SDAE can be stacked by several DAEs. The training of SDAE includes two steps: an unsupervised layer-wise pre-training step and a supervised fine-tuning step, as shown in Fig. 8. In Fig. 8(a), with the original training data, DAE at the bottom layer is first trained by minimizing the reconstruction error in (13) using gradient decent method. Then, the generated hidden representations can be used as the input for training the higher level DAE, and so on. After the layer-wise pre-training, all the obtained hidden layers are

Fig. 8. (a) Unsupervised layer-wise pre-training for SDAE. (b) Supervised finetuning for SDAE.

Fig. 9. Structure of LSTM network.

stacked, and the regression layer can be added on top of the SDAE to generate the final output, as shown in Fig. 8(b). The parameters of the whole SDAE network can be fine-tuned in a supervised way using gradient descent method. The pseudocode of training for the SDAE model is demonstrated in Algorithm 1.

5. SDAE-LSTM network for WSF As mentioned in Section 3, MI is used to select the major features be tox = {v(t − 17 ), · · · , v(t − 1 ), v(t ), T (t ), P (t )}, i.e., the previous wind speeds v(t − i + 1 ), i > 18 are not taken as inputs to the DNN as features due to the small MI value. The purpose of using MI selection is that it can performs a trade-off between the computational burden and the network accuracy. However, these previous wind speed data can still be taken into account, by utilizing recurrent neural network [37,38]. As a kind of deep recurrent neural network, the LSTM network [27] can receive information from the previous state to the current state through the connections between hidden neurons, as shown in Fig. 9. Thus, this architecture can well utilize all the available input information up to the current time. By incorporating LSTM network in the model, the previous wind speeds v(t − i + 1 ), i > 18 can be taken into account, while the computation burden of the network will not increase.

398

X. Liu, H. Zhang and X. Kong et al. / Neurocomputing 397 (2020) 393–403

Fig. 11. Structure of SDAE-LSTM model.

Fig. 10. Structure of LSTM cell.

Therefore, this paper proposes a WSF model based on SDAE and LSTM network. In this model, SDAE with an appropriate structure is adopted to capture the intrinsic features contained in the original wind speed data, and then the LSTM network is used to generate the final output along the time axis. 5.1. Long short-term memory network The architecture of LSTM network is shown in Fig. 9, which mainly includes three layers: one input layer hl , one hidden layer hl+1 and one output layer y. The LSTM cell is the core of the LSTM network. With the LSTM cell shown in Fig. 10, the information can be removed or added to the cell state by the forget gate, input gate and output gate. In the LSTM cell, the input, the hidden layer state and the cell state at time t are denoted as htl , htl+1 and ct , respectively. The updating rule of the LSTM cell can be summarized as, l+1 ft = σ (Wx f htl + Wh f ht−1 + bf )

(15)

l+1 it = σ (Wxi htl + Whi ht−1 + bi )

(16)

l+1 ct = ft × ct−1 + it × tanh(Wxc htl + Whc ht−1 + bc )

(17)

ot = σ (

Wxohtl

l+1 + Whoht−1

+ bo )

htl+1 = ot × tanh(ct )

(18)

(19)

where ft ,it ,ot are the forget gate, input gate and output gate respectively. Wxf , Whf , Wxi , Whi , Wxc , Whc , Wxo , Who are the weight matrices.bf , bi , bc , bo are the bias vectors. σ ( · ) represents the function as follows:

σ (x ) = 1/(1 + e−x )

(20)

After obtaining the hidden layer, a fully connected layer performs a linear transformation on the hidden state htl+1 to obtain the output of the LSTM network:

yt = Wyh htl+1 + by

(21)

where (Wyh , by ) are the weights and bias of the output layer. Remark 3. The LSTM network is conditioned on the SDAE-based features of the original wind speed data. Therefore, the hybrid SDAE-LSTM model could effectively capture the dynamic characteristics of wind speed sequential data with long-range patterns.

5.2. Model structure and its training algorithm The structure of the 20-input and 1-output SDAE–LSTM model is shown in Fig. 11. The inputs are X = {v(t − 17 ), · · · , v(t − 1 ), v(t ), T (t ), P (t )}. The output is the forecasted wind speed. This model consists of feature extraction part and forecasting part. Feature extraction part constructed with SDAE is used to obtain the intrinsic features from the input data via training. And forecasting part is built with the LSTM network, which is responsible for outputting the expected wind speed along the time axis. The training of SDAE-LSTM model mainly composes of two phases. In the first step, the DAE1 is trained with the input data to obtain its hidden representations, and then the obtained hidden representations are used as the input to train the next DAE. In this way, multiple DAEs can be stacked hierarchically. In the second step, initialize the parameters of the SDAE part with the pretrained DAEs, and the initial parameters of the LSTM part are set to be small random values. The whole SDAE-LSTM network can be fine-tuned using gradient decent algorithm. The pseudocode of training method for the SDAE-LSTM model is demonstrated in Algorithm 2. 6. Experimental results and comparative analysis 6.1. Experimental settings With the input of the model decided by MI method, the optimal structure of SAE, i.e., the number of AEs and hidden layer units in each AE, needs to be determined. Experiments are repeatedly done by choosing the number of AEs ranging from 1 to 10, while the number of units in hidden layers from ϕ = [50, 45, · · · , 25, 20]. The optimal model structure is found from different configurations considering the root mean squared errors (RMSE) expressed as:

RMSE =

K 1 (v(k + 1 ) − vˆ (k + 1 ))2 K

(22)

k=1

where K is the pattern size. The relationship between the number of AEs and the RMSE of the learning network is shown in Fig. 12. The network is unable to generalize well when the number of AEs is too small, because of the insufficient number of tunable parameters in the model. The performance of the network gradually improves as the number of AEs increases, especially when the number reaches to 7. However, when the number of AEs increase further, the performance begins to degrade, as using more AEs will lead to more complex structures that are prone to overfitting. Moreover, the vanishing gradient problem also imposes negative impacts on fine-tuning of the DNN when the number of AEs increases. Therefore, the number of AEs is set to be 7.

X. Liu, H. Zhang and X. Kong et al. / Neurocomputing 397 (2020) 393–403

399

Table 1 Result of the four models of WSF using four different validation samples. Data set

Index

Training sample

MAPE MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE MAE RMSE MAPE MAE RMSE

S1

S2

S3 Fig. 12. The relationship between number of AEs and RMSE value. S4

(%)

(%)

(%)

(%)

(%)

MLP

SAE

LSTM

SDAE-LSTM

14.8372 0.7183 0.9824 17.0891 0.7909 1.0540 18.1620 0.7758 1.1312 13.9420 0.7432 1.1113 20.4112 0.6868 0.9262

9.3846 0.4385 0.5864 12.8172 0.5854 0.7127 11.4243 0.4936 0.7090 10.0631 0.5372 0.6985 12.7063 0.4281 0.5982

8.9924 0.4525 0.5294 11.9045 0.5482 0.6849 11.0359 0.4859 0.7149 9.7301 0.5108 0.6576 12.2974 0.4205 0.5302

5.6111 0.2356 0.3387 7.1323 0.3066 0.3880 7.1796 0.2762 0.3914 7.8646 0.3911 0.5460 10.6581 0.2956 0.3863

6.2. Forecasting results In this section, different cases have been studied to validate the effectiveness of the proposed SDAE-LSTM model by comparing with other models. To evaluate the performance of the WSF model, define two more kinds of error: the mean absolute error (MAE), and the mean absolute percentage error (MAPE):

MAE =

K

1

v(k + 1 ) − vˆ (k + 1 )

K k=1

Fig. 13. RMSE of SDAE with different denoising level.



(23)



K 1

v(k + 1 ) − vˆ (k + 1 )

MAP E =

× 100% K v (k + 1 )

(24)

k=1

For the corruption noise in the SDAE, the level of masking noise is chosen from [0, 0.1, 0.2, 0.3, 0.4, 0.5]. The relationship between the levels of masking noise and the RMSE is shown in Fig. 13. When the corruption level increases gradually from zero, the performance of network is enhanced since the robustness of the network is improved. However, too heavy corruption will degrade the input data quality, leading to a decrease of the performance. From the test, the level of the corruption noise is chosen to be 0.3. Finally, the SDAE-LSTM model can be established using the optimal SDAE structure determined above, with the regression layer of SDAE model being replaced by the LSTM network.

After the SDAE-LSTM model was trained by the set of 35,0 0 0 patterns, datasets from four different seasons: S1 , S2 , S3 and S4 are used to verify the model. Fig. 14 shows the modeling results with the 35,0 0 0 training patterns. The validation results are shown in Fig. 15-18. The indices of RMSE, MAE, and MAPE achieved by the SDAE-LSTM model for all the four seasons are illustrated in Table 1. When adopting four total different set of operating data from four seasons, the SDAELSTM model for WSF is still able to achieve good performance. From both the training and validating, it is clearly seen that the SDAE-LSTM network can forecast the wind speed accurately over a wide range of changing condition.

Fig. 14. Real and forecasted wind speed series of training sample using SDAE-LSTM network.

400

X. Liu, H. Zhang and X. Kong et al. / Neurocomputing 397 (2020) 393–403

Fig. 15. Real and forecasted wind speed series of S1 using (a) SDAE-LSTM network; (b) SAE; (c) LSTM network; (d) MLP network.

Fig. 16. Real and forecasted wind speed series of S2 using (a) SDAE-LSTM network; (b) SAE; (c) LSTM network; (d) MLP network.

Algorithm 1 Main steps to Train SDAE. Input: X is the training data for the network. l is the number of DAEs. θ k = {W1k , bk1 , W2k , bk2 } is the parameter set in DAEk . Step 1:Unsupervised layer-wise pre-training. Sample h0 = X . For k = 1 to l do Initialize θ k = {W1k , bk1 , W2k , bk2 } with small random values. For each training iteration: ˜ k−1 using Eq. (9). (1) Get partially damaged version h (2) Compute hidden activation hk using Eq. (10). (3) Compute reconstructed output zk using Eq. (11). (4) Conduct gradient descent method to update θ k , so as to minimize Eq. (13). (5) Iteration ends when stopping criterion is satisfied. Step 2: Supervised fine-tuning. } with small random values. Initialize the regression layer parameters {W l+1 , bl+1 1 For each training iteration: (1) Compute the output of SDAE using Eq. (4). (2) Conduct gradient descent method to update θ k (k = 1, 2, · · · , l ) and {W l+1 , bl+1 }. 1 (3) Iteration ends when stopping criterion is satisfied. }. Output: The updated parameters: θ k (k = 1, 2, · · · , l ) and {W l+1 , bl+1 1

X. Liu, H. Zhang and X. Kong et al. / Neurocomputing 397 (2020) 393–403

401

Fig. 17. Real and forecasted wind speed series of S3 using (a) SDAE-LSTM network; (b) SAE; (c) LSTM network; (d) MLP network.

Fig. 18. Real and forecasted wind speed series of S4 using (a) SDAE-LSTM network; (b) SAE; (c) LSTM network; (d) MLP network.

The SDAE-LSTM model is then compared with a SAE, a LSTM network and also a general multi-layer perceptron (MLP) network [39], under the same input/output data. The SAE is built using the optimal structure determined in Section 6.1, and is trained using an unsupervised layer-wise pre-training and a supervised finetuning. The structure of LSTM network and MLP network used in this experiment are both chosen as 20–28–1, and the networks are trained using gradient decent algorithm. The indices of RMSE, MAE, and MAPE achieved by all the four models for all the four cases are illustrated in Table 1, and the modeling results are shown in Fig. 15-18. From Table 1, it can be seen that the MAPE, MAE and RMSE indices obtained from the proposed SDAE-LSTM network in all four cases are the lowest among the four networks. Fig. 15-18 present the forecasted wind speed and the actual wind speed of the four models in all four cases. From these four figures, it is clear that the results from the proposed SDAE-LSTM network almost overlap the real values, indicating the highest pre-

diction accuracy. Therefore, the proposed SDAE-LSTM network possesses better forecasting capability over the other methods. The reasons can be summarized from three aspects. First, by incorporating SAE in the proposed model, the higher level features can be abstracted from lower level features of the wind speed data, these automatically learnt patterns are much informative and more appropriate for the forecasting. Second, the DAE intentionally adds noises into the training data and trains the AE with these corrupted data. Through the denoising training process, the DAE can recover the noise-free version of the training data, which enhanced the robustness of the network. Third, the LSTM network is able to discover temporal dependencies in highdimensional sequential wind speed data, which allows the network making full use of historical wind speed data. The performance of the MLP network is the worst, as it is a shallow architecture which often suffers from uncontrolled convergence speed and local optimal especially when the training sample size grows too large.

402

X. Liu, H. Zhang and X. Kong et al. / Neurocomputing 397 (2020) 393–403

Algorithm 2 Main steps to Train SDAE-LSTM network. Input: X is the training data for the network. l is the number of DAEs. θ k = {W1k , bk1 , W2k , bk2 } is the parameter set in DAEk . α = {Wx f , Wh f , Wxi , Whi , Wxc , Whc , Wxo , Who , Wyh } is the weight set in LSTM network. β = {b f , bi , bc , bo , by } is the biases set in LSTM network. Step 1: Selecting the data X from the actual site of wind farm, and normalize it before network training. Step 2: Unsupervised layer-wise pre-training of SDAE. Sample h0 = X . For k = 1 to l do Initialize θ k = {W1k , bk1 , W2k , bk2 }with small random values. For each training iteration: ˜ k−1 using Eq. (9). (1) Get partially damaged version h (2) Compute hidden activation hk using Eq. (10). (3) Compute reconstructed output zk using Eq. (11). (4) Conduct gradient descent method to update θ k , so as to minimize Eq. (13). (5) Iteration ends when stopping criterion is satisfied. End Step 3: Supervised fine-tuning of SDAE-LSTM network. Initialize SDAE part with pre-trained θ k (k = 1, 2, · · · , l ). Initialize α = {Wx f , Wh f , Wxi , Whi , Wxc , Whc , Wxo , Who , Wyh } and β = {b f , bi , bc , bo , by } in the LSTM part with small random values. For each training iteration: (1) Compute output yt using Eq. (10) and -(Eq. (15)–(21). (2) Conduct gradient descent method to update θ k (k = 1, 2, · · · , l ), α and β . (3) Iteration ends when stopping criterion is satisfied. Output: The updated parameters: θ k (k = 1, 2, · · · , l ), α ,β .

7. Conclusion Accurate WSF plays a vital role in the stability and security of power system operation. The intermittent and unstable nature of wind speed makes the WSF to be a quiet difficult task. In this paper, a hybrid deep architecture with feature selection is proposed for WSF. In this model, a feature selection framework based on MI is first developed to determine the most suitable input for the forecasting model. Then, the SDAE with an appropriate structure is used to capture the intrinsic features contained in the original data, and the LSTM network is adopted to generate the final output. To evaluate the effectiveness of proposed model, four datasets from different seasons are used for model validation. The comparative analysis of the SDAE-LSTM network, MLP network and SAE models is presented based on the evaluation indexes RMSE, MAPE, and MAE. The experimental results indicate that the proposed model outperforms the other two forecasting models. The SDAE-LSTM network is therefore reliable and effective for WSF in modern power system management. Declaration of Competing Interest We have no conflicts of interest form to declare. Acknowledgements This work was supported by the National Natural Science Foundation of China (Grant Nos. 61833011, 61603134, 61673171), the Fundamental Research Funds for the Central Universities (Grant Nos. 2017MS033, 2017ZZD004, 2018QN049). References [1] World Wind Energy Association, Wind power capacity worldwide reaches 600 GW, 53,9 GW added in 2018. 2019 [accessed 12 May 2019]. [2] A.M. Foley, P.G. Leahy, A. Marvuglia, E.J. McKeogh, Current methods and advances in forecasting of wind power generation, Renew. Energy 37 (1) (2012) 1–8. [3] X.W. Mi, H. Liu, Y.F. Li, Wind speed prediction model using singular spectrum analysis, empirical mode decomposition and convolutional support vector machine, Energy Convers. Manage. 180 (2019) 196–205.

[4] H. Bludszuweit, J.A. Dominguez-Navarro, A. Llombart, Statistical analysis of wind power forecast error, IEEE Trans. Power Syst. 23 (3) (2008) 983–991. [5] M. Gan, H. Peng, X.Y. Peng, I. Garba, A locally linear RBF network-based state-dependent AR model for nonlinear time series modeling, Inform. Sci. 180 (22) (2010) 4370–4383. [6] E. Erdem, J. Shi, ARMA based approaches for forecasting the tuple of wind speed and direction, Appl. Energy 88 (4) (2011) 1405–1414. [7] A. Shamshad, M.A. Bawadi, W.M.A.W. Hussin, T.A. Majid, S.A.M. Sanusi, First and second order markov chain models for synthetic generation of wind speed time series, Energy 30 (5) (2005) 693–708. [8] M. Poncela, P. Poncela, J.R. Perán, Automatic tuning of kalman filters by maximum likelihood methods for wind energy forecasting, Appl. Energy 108 (2013) 349–362. [9] C. Peng, S.D. Ma, X.P. Xie, Observer-based non-PDC control for networked T-S fuzzy systems with an event-triggered communication, IEEE Trans. Cybern. 47 (8) (2017) 2279–2287. [10] Q.M. Cong, W. Yu, Integrated soft sensor with wavelet neural network and adaptive weighted fusion for water quality estimation in wastewater treatment process, Measurement 124 (2018) 436–446. [11] L.J. Zhao, D.H. Wang, T.Y. Chai, Estimation of effluent quality using PLS-based extreme learning machines, Neural Comput. Appl. 22 (2013) 509–519. [12] C. Zhang, H.K. Wei, L.P. Xie, Y. Shen, K.J. Zhang, Direct interval forecasting of wind speed using radial basis function neural networks in a multi-objective optimization framework, Neurocomputing 205 (2016) 53–63. [13] C. Ren, N. An, J.Z. Wang, L. Li, B. Hu, D. Shang, Optimal parameters selection for BP neural network based on particle swarm optimization: a case study of wind speed forecasting, Knowl. Based Syst. 56 (2014) 226–239. [14] X.B. Kong, X.J. Liu, R.F. Shi, K.Y. Lee, Wind speed prediction using reduced support vector machines with feature selection, Neurocomputing 169 (2015) 449–456. [15] C. Peng, H.T. Sun, M.J. Yang, Y.L. Wang, A survey on security communication and control for smart grids under malicious cyber attacks, IEEE Trans. Syst. Man Cybern. Syst. (2019), doi:10.1109/TSMC. 2018.2884952. [16] X.J. Liu, X.B. Kong, Nonlinear model predictive control for DFIG-based wind power generation, IEEE Trans. Autom. Sci. Eng. 11 (4) (2014) 1046–1055. [17] J.H. Cui, S. Liu, J.F. Liu, X.J. Liu, A comparative study of MPC and economic MPC of wind energy conversion systems, Energies 11 (11) (2018) 1–23. [18] W. Yu, M. Pacheco, Impact of random weights on nonlinear system identification using convolutional neural networks, Inform. Sci. 477 (2019) 1–14. [19] D.H. Wang, C.H. Cui, Stochastic configuration networks ensemble for large-scale data analytics, Inform. Sci. 417 (2017) 55–71. [20] W.B. Liu, Z.D. Wang, X.H. Liu, N.Y. Zeng, Y.R. Liu, F.E. Alsaadi, A survey of deep neural network architectures and their applications, Neurocomputing 234 (2017) 11–26. [21] D.F. Guo, M.Y. Zhong, H.Q. Ji, Y. Liu, R. Yang, A hybrid feature model and deep learning based fault diagnosis for unmanned aerial vehicel sensors, Neurocomputing 319 (2018) 155–163. [22] A. Dedinec, S. Filiposka, A. Dedinec, L. Kocarev, Deep belief network based electricity load forecasting: an analysis of macedonian case, Energy 115 (2016) 1688–1700.

X. Liu, H. Zhang and X. Kong et al. / Neurocomputing 397 (2020) 393–403 [23] R.G. Yu, Z.Q. Liu, X.W. Li, W.H. Lu, D.G. Ma, M. Yu, J.R. Wang, B. Li, Scene learning: deep convolutional networks for wind power prediction by embedding turbines into grid space, Appl. Energy 238 (2019) 249–257. [24] C. Shang, F. Yang, D.X. Huang, W.X. Lyu, Data-driven soft sensor development based on deep learning technique, J. Process Control 24 (3) (2014) 223–233. [25] M. Khodayar, O. Kaynak, M.E. Khodayar, Rough deep neural architecture for short-term wind speed forecasting, IEEE Trans. Ind. Inform. 13 (6) (2017) 2770–2779. [26] T. Trappenberg, J. Ouyang, A. Back, Input variable selection: mutual information and linear mixing measures, IEEE Trans. Knowl. Data Eng. 18 (1) (2006) 37–46. [27] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural Comput. 9 (8) (1997) 1735–1780. [28] J. Gonzalez, W. Yu, Non-linear system modeling using LSTM neural networks, Proc. Int. Feder. Autom. Control 51 (13) (2018) 485–489. [29] W.N. Lu, Y.P. Li, Y. Cheng, D.S. Meng, B. Liang, P. Zhou, Early fault detection approach with deep architectures, IEEE Trans. Instrum. Meas. 67 (7) (2018) 1679–1689. [30] X.Y. Qing, Y.G. Niu, Hourly day-ahead solar irradiance prediction using weather forecasts by LSTM, Appl. Energy 148 (2018) 461–468. [31] Y. Tian, K. Zhang, J.Y. Li, X.X. Lin, B.L. Yang, LSTM-based traffic flow prediction with missing data, Neurocomputing 318 (2018) 297–305. [32] P. Vincent, H. Larochelle, Y. Bengio, P.A. Manzagol, Extracting and composing robust features with denoising autoencoders, in: Proceedings of the 25th International Conference on Machine Learning, 2008. [33] M. Han, W.J. Ren, X.X. Liu, Joint mutual information-based input variable selection for multivariate time series modeling, Eng. Appl. Artif. Intel. 37 (2015) 250–257. [34] A. Hacine-Gharbi, P. Ravier, R. Harba, T. Mohamadi, Low bias histogram-based estimation of mutual information for feature selection, Pattern Recogn. Lett. 33 (10) (2012) 1302–1308. [35] N. Bi, J. Tan, J.H. Lai, C.Y. Suen, High-dimensional supervised feature selection via optimized kernel mutual information, Expert. Syst. Appl. 108 (2018) 81–95. [36] M. Han, W.J. Ren, Global mutual information-based feature selection approach using single-objective and multi-objective optimization, Neurocomputing 168 (2015) 47–54. [37] X.M. Zhang, Q.L. Han, X.H. Ge, D.R. Ding, An overview of recent developments in Lyapunov–Krasovskii functionals and stability criteria for recurrent neural networks with time-varying delays, Neurocomputing 313 (2018) 392–401. [38] X.M. Zhang, Q.L. Han, J. Wang, Admissible delay upper bounds for global asymptotic stability of neural networks with time-varying delays, IEEE Trans. Neural Netw. Learn. Syst. 29 (11) (2018) 5319–5329. [39] X.J. Liu, X.B. Kong, G.L. Hou, J.H. Wang, Modeling of a 10 0 0MW power plant ultra super-critical boiler system using fuzzy-neural network methods, Energy Convers. Manage. 65 (2013) 518–527. Xiangjie Liu received the Ph.D. degree in Automatic Control from the Research Center of Automation, Northeastern University, Shenyang, China, in 1997. He subsequently held a postdoctoral position with the China Electric Power Research Institute (CEPRI), Beijing, China, until 1999, and acted as an Associate Professor in CEPRI. He was a Research Associate with the University of Hong Kong, and a Professor with National University of Mexico. He is now a Professor in North China Electric Power University, Beijing, China. His current research areas include model predictive control with its application in industrial processes. He is a member of technical committee on both control theory and process control, Chinese Association of Automation. He is an editor of Acta Automatica Sinica, an editor of Control and Decision, and an editor of Chinese Journal of Electric Power Automation Equipment. He has been in Program for New Century Excellent Talents in University since 2006.

403

Hao Zhang received the B.S. degree in Automation in 2015 in Department of Automation, North China Electric Power University, Baoding, China. He is currently a Ph.D. candidate in School of Control and Computer Engineering, North China Electric Power University, Beijing, China. His research interests include deep learning, data-driven modeling and optimal control of power plants.

Xiaobing Kong received the B.S. degree in Measurement Technology and Instrument in 2008, the Master degree in Automatic Control in 2011, and the Ph.D. degree in Automatic Control in 2014, all in Department of Automation, North China Electric Power University, Beijing, China. She currently works in School of Control and Computer Engineering, North China Electric Power University. Her research interests include modeling, optimization and nonlinear model predictive control of power plants.

Kwang Y. Lee received the B.S. degree in Electrical Engineering in 1964 from Seoul National University, M.S. degree in Electrical Engineering in 1968 from North Dakota State University, and Ph.D. degree in Systems Science in 1971 from Michigan State University. He was elected as a Fellow of IEEE in January 2001 for his contributions to the development and implementation of intelligent system techniques for power plants and power systems control and as a Life Fellow of IEEE since January 2008. He has been working in the area of power plants and power systems control for over thirty years at Michigan State, Oregon State, University of Houston, the Pennsylvania State University, and the Baylor University, where he is Professor and Chairman of the Department of Electrical and Computer Engineering. His research interests include control, operation, and planning of energy systems; computational intelligence, intelligent control and their applications to energy and environmental systems, and modeling, simulation and control of renewable and distributed energy sources.