Short-term wind speed forecasting framework based on stacked denoising auto-encoders with rough ANN

Short-term wind speed forecasting framework based on stacked denoising auto-encoders with rough ANN

Sustainable Energy Technologies and Assessments 38 (2020) 100601 Contents lists available at ScienceDirect Sustainable Energy Technologies and Asses...

3MB Sizes 0 Downloads 12 Views

Sustainable Energy Technologies and Assessments 38 (2020) 100601

Contents lists available at ScienceDirect

Sustainable Energy Technologies and Assessments journal homepage: www.elsevier.com/locate/seta

Original article

Short-term wind speed forecasting framework based on stacked denoising auto-encoders with rough ANN

T

Hamidreza Jahangira, Masoud Aliakbar Golkara, Falah Alhamelib,c, Abdelkader Mazouzd, ⁎ Ali Ahmadiane, , Ali Elkamelf,b a

Faculty of Electrical Engineering, K. N. Toosi University of Technology, Tehran, Iran College of Engineering, Khalifa University of Science and Technology, The Petroleum Institute, Abu Dhabi, United Arab Emirates c Abu Dhabi National Oil Company, PO Box 898, Abu Dhabi, United Arab Emirates d Department of Management and Management Information Systems, Al Ain University of Science & Technology, Abu Dhabi, United Arab Emirates e Department of Electrical Engineering, University of Bonab, Bonab, Iran f College of Engineering, University of Waterloo, Waterloo, Canada b

ARTICLE INFO

ABSTRACT

Keywords: Deep learning Stacked denoising auto-encoders Rough structure Neural networks Wind speed forecast

In this paper, a multi-modal short-term wind speed prediction framework has been proposed based on Artificial Neural Networks (ANNs). Given the stochastic behavior and high uncertainty of wind speed, in modern power systems with high penetration of wind power, precise wind speed forecasting is necessary for the power utility managers. To ensure that the prediction results are sufficiently precise, in this study, a multi-modal method is designed based on denoising and prediction modules. In denoising module, at first, the optimal configuration of Stacked Denoising Auto-encoders (SDAEs) is determined based on Genetic Algorithm (GA), and, then, SDAEs are applied for pre-training the data to reduce the input data noise. Due to the fact that different data sources have different noise bases, the optimal structure of the denoising module is arranged completely independent of various input data. In the forecasting module, the Sinusoidal Rough-Neural Network (SR-NN) is utilized to predict wind speed. After that, denoising and forecasting modules are stacked together to make the holistic deep structure network. In this study, the Deep Learning (DL) approach has been utilized to extract more robust features from the input data in the denoising process. To handle the high intermittent behavior of wind speed, the rough neurons are considered. Rough neurons are made up of two pairs of conventional neurons that are known to be upper and lower bound neurons. Since the wind speed has a periodic and nonlinear behavior, an ANN, based on the sinusoidal activation function, is more accurate than the typical type of sigmoid activation function. In this paper, weather data from Ahar, Iran, which has a high potential wind power, is considered. In modern power systems, the extra short-term forecasting is also needed; therefore, in this study, wind speed forecasting in one hour and 10-minute intervals are employed. Input data are selected based on Grey Correlation Analysis (GCA), and according to the GCA result, in addition to wind speed data at various heights, environmental data such as humidity and temperature are also considered. To evaluate the efficiency of the proposed method, the simulation results are compared with other structures of ANNs and benchmarking machine learning methods such as Support Vector Machine (SVM) and Autoregressive Moving Average (ARMA) methods in different scenarios. In this comparison, the impact of DL has been thoroughly investigated. In this paper, the comparison of the performance of deep and shallow network structures has been studied.

Introduction Background and motivation Renewable Energy Sources (RESs) are among the main foundations of human energy supply, and they are increasingly gaining a share in



the supply of energy in various communities. Among all RESs, wind energy has grown dramatically to supply energy in many modern cities and societies. Wind energy can address the pollution concerns of modern societies adequately. It has high reliability and can be contemplated as a source of sustainable energy production. By increasing the penetration of wind energy sources, various challenges have been

Corresponding author. E-mail address: [email protected] (A. Ahmadian).

https://doi.org/10.1016/j.seta.2019.100601 Received 26 June 2019; Received in revised form 25 November 2019; Accepted 3 December 2019 2213-1388/ © 2019 Published by Elsevier Ltd.

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

Nomenclature

Wij

A. Indices

WLR X Xi X Yg Yg Ymean

i g j k S

Index Index Index Index Index

of of of of of

visible layer sample output layer sample hidden layer sample iteration number layer number

B. Parameters M n0 NL NNS p q

Penalty factor Total number of input data components Total number of layers Total number of neurons in layer S Total number of visible layer components Total number of hidden layer components Momentum coefficient Training coefficient

S S

S j S j

max

(k )

C. Variables

b b bi bj bUS bLS bLR E E (k ) EFT ERBM f F H Hj L N OS OjS OU OL OUS OLS V Vi WU WL W jS (k ) WUS W LS

Bias vector of coder layer Bias vector of decoder layer Bias for sample i in visible layer Bias for sample j in hidden layer Upper bound bias vector for layer S Lower bound bias vector for layer S Bias vector of Logistic Regression layer Total Sum square error vector Total Sum square error in iteration k Fine-tuning error Energy function of RBM Activation function as function for SDAE optimal construction Hidden layer vector Hidden layer data for sample j Loss function Noising data vector Output vector of layer S Output of neuron j in layer S Output vector of upper bound neuron Output vector of lower bound neuron Output vector of upper bound neuron for layer S Output vector of lower bound neuron for layer S Visible layer vector Visible layer data for sample i Weight vector of upper bound neurons Weight vector of lower bound neurons Weight of neuron j in layer S in iteration k Weight vector of upper bound neurons in layer S Weight vector of lower bound neurons in layer S

Weight between sample i of visible layer and sample j of hidden layer Weight vector of Logistic Regression layer Total input data vector Input data sample i Reconstructed data vector Output for sample g Desired output for sample g The mean value of desired output vector in calculation period Coefficient of upper bound neuron Coefficient of lower bound neuron Coefficient of upper bound neuron in layer S Coefficient of lower bound neuron in layer S Coefficient of upper bound neuron j in layer S Coefficient of lower bound neuron j in layer S Weight variations for fine-tuning Maximum acceptable error Training variable Training variable in iteration k Random noise operator Weight vector between input layer and hidden layer Weight vector between hidden layer and reform layer

D. Abbreviations ANN AE ARMA BP CRBM DL DNN DAE E-NN GA GCA MAE MAPE MLP MBGD RBF RES RBM RE-NN R-NN R-RBF RMSE SDAE SR-NN SVM

deliberated in power systems. Wind speed behavior should be forecasted precisely to reach out to an accurate solution for optimal operation and planning problems in different parts of the power system such as distribution and transmission sections [1]. Considering the fact mentioned above, the most crucial factor in the design of wind farms is the wind speed prediction accuracy that leads to the optimal design of the wind farms [2]. By increasing the accuracy of wind power prediction, the risk of wind energy utilization in power systems would be reduced affectedly, and this work contributes to increasing the penetration of RES. Nowadays, we are confronted with the phenomenon of

Artificial Neural Network Auto Encoder Autoregressive Moving Average Back Propagation Continues Restricted Boltzmann Machine Deep Learning Deep Neural Network Denoising Auto Encoder ELMAN Neural Network Genetic Algorithm Grey Correlation Analysis Mean Absolute Error Mean Absolute Percentage Error Multi-Layer Perceptron Mini-Batch Gradient Descent Radial Basis Function Renewable Energy Source Restricted Boltzmann Machine Rough ELMAN Neural Network Rough Neural Network Rough-Radial Basis Function Root Mean Square Error Stacked Denoising Auto-Encoder Sinusoidal Rough-Neural Network Support Vector Machine

massive data which is commonly known as “big data”. To be able to use high-dimension data, modern computing tools, in accordance with this type of data, are needed. In this regard, a DL method is required. In recent studies, the DL structure has been implemented by many researchers in different applications [3]. The deep structure has a high ability to extract the main features from input data, and it is recommended to apply this technique in the analysis of different phenomena.

2

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

Literature survey Considering recent studies in wind speed prediction, this field can be divided into two main classes: Statistical and machine learning methodologies. Statistical methodologies involve making a supposition before forming a model. In this regard, the initial supposition deficiently plays a vital role in the model of the statistical method. In comparison to the Statistical methodologies, in the machine learning methods, by applying the feature extraction processes to the existing data of the studied system's behavior, the data can speak out without initial supposition. Actually, Statistical methods are rule-based programming, but machine learning methods learn from data without explicitly programmed instructions. Several researchers have investigated Statistical methods to predict wind speed. The general form of the statistical method is defined as Auto-Regressive Integrated Moving Average (ARIMA). To overcome the deficiencies of the ARIMA model, hybrid approaches have been employed by the researchers. Reference [4] has proposed a Fractional-Auto Regressive Integrated Moving Average (FARIMA) model to forecast wind speeds on the day-ahead and two-dayahead time intervals. In [5], the authors have proposed an Empirical Mode Decomposition (EMD) and the improved Recursive Autoregressive Integrated Moving Average (RARIMA) model. According to their results, RARIMA is more accurate than the conventional ARIMA method. In [6], the combination of ANN and ARIMA has been applied and the precision of the predicted results was improved. Machine learning methods have recently attracted many researchers to the prediction of various phenomena and many articles have been presented in relation to these techniques [7]. In recent years, Support Vector Machines (SVM), a powerful machine learning method based on data-driven technique, has also been used in short-term forecasting with acceptable outcome [8,9]. SVM is a highly principled machine learning method that contains a single hidden layer. The training process of SVM is usually based on the “kernel trick” concept. SVM can be considered as a benchmarking machine learning method in short-term wind speed and wind power forecasting problems in different time intervals [10]. Another machine learning-based method is ANN which has a high ability in feature extraction of input data. To survey the articles in wind speed prediction by ANNs, we can scrutinize these articles in seven sections:

4)

5)

6)

7)

1) considering the uncertainty with ANNs: References [11,12] use conventional ANNs – shallow networks with low hidden layers – to analyze the uncertainty of the wind speed behavior. Also, this concept can be explored in a variety of ways such as Fuzzy [13] and Rough [14]. Considering mean values instead of exact behavior of wind speed with uncertainty leads to imprecise results. In applications such as planning power systems with high penetration of RESs, this subject is crucial, and it must be considered [15]. 2) Deep learning (DL) – increasing the number of hidden layers — : In most of the articles written on this topic, shallow ANNs are used [16-19], and they extract fewer features from the input data and decrease the precision of the results. In shallow ANNs, the training process is easy but in a phenomenon like wind speed that has complex nonlinear behavior, the correctness of the prediction results decreases. It can be improved by using Deep Neural Networks (DNNs) [20,21]. In [22], DL has been used to predict wind speed, but the issue of noise elimination of input data has not been considered. Furthermore, DNNs are implemented in the fault detection task and it which illustrates the robustness of these approaches [23]. 3) Input data noise: Given that the wind speed is predicted with different environmental data and the noise potential in these documents is not low, the use of noise reduction techniques such as SDAE is essential to decrease input data noise. In [24], SDAEs are applied for load forecasting; this method has recently been utilized in different fields such as speech recognition [25] and image processing

tasks [26]. Using SDAE significantly increases the accuracy of the results. Training DNNs: According to the reference [27], in this field, when the deep structure is used, it is difficult to determine the optimum weight of the nonlinear Auto-encoders (AE). For this purpose, a pretraining method has been advised to be implemented on input data that is performed by the Restricted Boltzmann Machine (RBM) method. In this way, the initial values of different layers’ weights are defined by RBMs. This assumption is not taken into account in some articles on this topic [28]. RBMs can increase the accuracy of wind speed prediction by improving the feature extraction quality of the input data [29]. Activation function: In many papers, the sigmoid activation function is considered but, given that the wind speed has a periodic behavior and sharp variations at peak points, using of the sinusoidal activation function will improve the accuracy of the predicted results [30]. Multi-modal framework: In [31], which is one of the newest articles in this field, wind speed is predicted with Rough SDAEs and the denoising and forecasting stages are simultaneously carried out. The local error derivative values become lower and lower in the elementary layers, and, under these circumstances, the variation of the initial layers' weights is tiny, and the training process is not well done. In the meantime, due to these conditions, it is not recommended to use the Rough structure in the SDAE section because the Rough coefficients also reduce the local error derivative. These issues are resolved by using a multi-modal framework which is applied in this paper. Combination of ANNs and other methods: The combination of ANNs and metaheuristic optimization algorithm is applied in [32], where multi-layered feed-forward ANN was optimized by the genetic algorithm (GA) and trained by the back-propagation (BP) learning process. Moreover, some research works have been done based on the combination of ANNs and other methods such as probabilistic, weather forecasting, and wavelet packet schemes. The authors in [33] have used ANNs with probabilistic functions to predict wind speed. In this mentioned research, the mean values and variance of probability distribution functions were predicted by the ANN. In [34], a hybrid system was proposed for wind speed prediction. In this method, different models of weather forecasting and ANNs were combined. References [29,35] have used the combination of ANNs and wavelet function to predict wind speed over a short time span, and consequently, by implementing the hybrid approaches, the accuracy of the proposed method has been improved. Considering the above works, in this study, the GA is applied to determine the optimal structure of SDAE which is the combination of ANN and metaheuristic optimization algorithms. According to the mentioned researches, the ANN is considered as a promising approach that can handle large volume data with high precision and provide an acceptable performance in time series forecasting tasks. In this regard, in this study, deep ANN-based method, which is a robust tool in data engineering tasks, is implemented to handle the input data noise and uncertainty in short-term wind speed forecasting task.

Our contribution In this paper, the main goal is to determine a new applicable scheme for forecasting wind speed. By reviewing the literature in this field, it becomes clear that the new orientation of wind speed forecasting studies is to manipulate the concepts of DL and eliminate the input data noise. There are several articles in this area, but, to the best of our knowledge, these articles have some deficiencies and the knowledge gap in this subject can be categorized as follows:

• In none of the papers that use the DL approach or conventional ANN

to eliminate the input data noise, the optimal structure of the denoising network is purposefully determined and the final structure is

3

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

chosen by examining probable configurations [31,36,37].

heuristic optimization algorithm. - The input data structure is selected using the GCA method. - The prediction module is considered based on Rough Artificial Neural Networks.

• In many articles in this area, the denoising and prediction sections



are integrated so that the network in elementary layers are not trained well [31]; that is due to the random structure of the denoising section. Moreover, some papers have not employed the denoising technique in forecasting methods based on conventional ANN [11,17,22,33] and DL approaches [20,28,38]. Due to the high ability of Rough Neural Networks to predict phenomena with high uncertainty [3,30,39], particularly in the prediction of wind speed with complex intermittent behavior, the Rough networks are rarely employed.

Paper structure The construction of the rest of the article is designed as follows: In section 2, an overview of the main concepts applied in this paper such as DL, SR-NNs, SDAE, and RBM is presented. Subsequently, in section 3, the proposed method is fully described in detail. After that, in section 4, the simulation results are presented in different scenarios and the verification of the offered scheme is examined. Finally, in section 5, the overall conclusion about the suggested method and simulation results are expressed.

Accordingly, in this paper, to find the optimal configuration of the denoising module independent from the prediction section, a multimodal scheme for predicting wind speed is presented. In this technique, to find the optimal structure of the deep denoising module, the genetic algorithm (GA) is employed. Based on our pilot experience and due to the sheer number of training parameters in a deep network, determining the optimal structure of the denoising module for different kinds of input data (wind speed at different altitudes and environment data) combined with the forecasting module is not recommended. Indeed, to find the optimal structure of denoising module, optimization method generates different configurations with different number of hidden layers and neurons in every hidden layer to find the optimal solution; however, deep networks training parameters such as learning rate of different layers, dropout probability, and L2 regulation coefficient are too sensitive to the network structures. In some configurations, if the optimization task is implemented in a holistic way, the deep network may lose its convergence, and optimization method cannot follow the variations carefully. The general structure of the proposed method is defined as follows. First, the input data are selected based on Grey Correlation Analysis (GCA) result and different input data are individually and independently pre-processed by SDAE to find the optimal construction of the denoising module. Then, the SDAE outputs are put together and applied as inputs to the SR-NNs to build the holistic deep structure network. Finally, the deep network is trained and is prepared for short-term wind speed forecasting. In this method, R-NNs are utilized with the sinusoidal activation function along with SDAE. It should be noted that in the proposed method, the initial SDAE weights of different hidden layers are determined by RBMs. In this study, based on the DL method, more high-level robust features are extracted from low-level input data. Furthermore, with SR-NNs, uncertainty is considered, and noise reduction is also applied by SDAEs. To verify the robustness of the proposed method, forecasting results are also compared with benchmarking methods in this field such as SVM and ARMA. The main contributions of this paper can be categorized as follows:

An overview of basics theories DL strategy DL is generally based on an ANN with multiple hidden layers. This section answers the main question that why DL is applied in our proposed scheme for wind speed prediction. In ANNs, based on deep structure, chief features of input data are inspected in different layers while in shallow ANNs, all the features are examined by just one layer. In order to clarify the function of the DL, an intelligible example is introduced. If we want to analyze the topographies of a picture from the DL perspective, each layer explores different features. For example, the elementary layers only look at the darkness and brightness of the image and, then, the subsequent layers recognize the lines and shapes and, finally, the image is completed by joining all the layers. In shallow ANNs all the image features are studied just by one layer. In our effort, given that the wind speed has a complex stochastic behavior and is subject to different factors, the large dimension of input data is considered and the ANN faces a “big data”; in this context, DL is very effective. Actually, in DL, with increasing number of hidden layers, the ANN moves from low-level to high-level feature extraction of inputdata. DL, in a variety of applications such as price forecasting [35], load forecasting [24,41], solar radiation modeling [42], and many other studies, has proven to be very successful. In this study, DL structure is applied to predict the wind speed in a short-time scale. Stacked denoising auto-encoders AEs are simple ANNs that convert the input data to the output data with the lowest possible difference. In fact, AEs rebuild inputs by encoding and decoding processes. The AEs were primarily introduced by Hinton et al [43] in the 80’s whose ambition was to solve the unsupervised back-propagation problem. Long afterward, AEs were reinspected by Hinton [27] in 2006 to present a deep structure in ANNs. In this construction, they were fixed together and formed Stacked Auto-

- A new DL-based framework based on denoising and forecasting topologies is applied to extract the main features of input data in the denoising module. - The optimal configuration of SDAE is determined using a meta-

Reform layer

X

Hidden layer

H X

Input layer Fig. 1. The overall structure of AE. 4

Loss function L( X , X )

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

encoder (SAE). The main assignment of SAE is to extract high-level features from the input data. In terms of Denoising Auto-encoder (DAEs), noise is added to the data during the coding and decoding process. In fact, in this situation, the AE must retrieve the input data from the reconstructed data. This method increases feature extraction ability in AEs and improves the robustness of this organization. The foundation of the pre-training technique in this study is SDAE. The AE receives the input X [0, 1] with p-length and makes the output H [0, 1] with the q-length. This process is done in the coding unit as follows:

(noising layer) is added and before transferring the data in the new space, a denoising layer is implemented, which made some changes in the input data; then, the input data with noise is transferred to the new space, and finally, in the output layer, the input data is reconstructed from the data with noise. In this regard, DAE can rebuild data from the input data with noise and present a robust performance in practical applications with high noise rate. Indeed, DAE obtains more robust features from the input data if DAEs sequentially are linked together and make SDAE; it is shown in Fig. 2(b). To prevent overfitting and improve SDAE performance, as Hinton stated in [27], Restricted Boltzmann Machines (RBMs) have been used to establish initial weights because RBMs have high capabilities in data mining procedure [45]. Indeed, deep AEs are too sensitive; to avoid local optimal results and improve the convergence of the training procedure in these networks, a pre-processing method should be employed to find the initial value of the weights and the training procedure should be done with low learning rate. When the initial values of the weights are determined randomly, we may lose the convergence of the network in the training procedure (by selecting the inappropriate or unfit random weights), and the training error will increase in every training epoch; in this situation, the convergence of the training procedure may be lost [27]. In this paper, the initial weight of every DAE is established independently by an RBMs. As shown in Fig. 3, RBM consists of visible and hidden layers. In RBM, unlike the BM, there is no connection between the neurons in the layer. RBM training process is founded on minimizing the energy function which is well-defined as follows:

(1)

H = f ( X + b)

In the encoding segment, the output of the hidden layer is con[0, 1] is sidered as input and the reconstructed version of input X created with a similar dimension as follows: (2)

X =f( H+ b)

In this progression, we are looking for minimizing the difference between the input and the reconstructed data. The overall structure of AE is shown in Fig. 1. This path is done in an unsupervised mode and, in fact, the target is the input data. Loss function formula is defined as follows:

L (X , X ) = X

X

(3)

2

As shown in Fig. 2(a), Denoising Auto-encoder (DAE) has one more layer than basic AE. This layer executes the noise to input data and, then, the encoding and decoding process remains like AE. Adding noise is such that some of the neurons are randomly selected and their output values are transformed to zero, though the others are left unaffected [31,44]. By applying these circumstances, DAE should continue its process by inputs with noise and turbulence which will make DAE more advantageous than AE. In fact, in Fig. 1, a typical AE, without denoising layer, is presented and its main duty is to rebuild the input data in the output layer. AE transfers the input data in the new space (hidden layer) and rebuilds. In Fig. 2, which presents the DAE, a new layer

p

ERBM (V , H ) =

q

p

bi Vi i=1

q

bj Hj j=1

Vi Hj Wi j

(4)

i=1 j=1

In a continuous restricted Boltzmann machine (CRBM), visible layer is contemplated based on Gaussian transformation, and for hidden layer, linear transformation is constructed. Finding the initial weight values is not essential in shallow AE, but in deep networks, it is usually recommended. Further details about RBMs are given in [46]. In RBM, unlike BM, there is no internal connection in the layers and there are

Reform layer

X

'

Hidden layer

H

Noising layer

N

Input layer

X

Loss function L( X , X )

(a) DAE 1

DAE 2

DAE N

'

'

'

X

H

N

X

X

H

N

X

X

H

N

X

Fig. 2. (a) The overall structure of DAE, (b): The overall structure of SDAE with N DAEs. 5

Reform layer

Hidden layer

Noising layer

Input layer

Reform layer

Hidden layer

Noising layer

Input layer

Reform layer

Hidden layer

Noising layer

Input layer

(b)

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

where

W ij

Vi

zi =

Hj

1 1 + exp( WLR H + bLR )

(6)

ANN-based on the rough structure Rough Neuron can be contemplated as a pair of neurons called upper bound and lower bound neurons. This section answers the main question that why Rough structure is employed in our proposed technique for wind speed prediction. In order to get better outcomes from the ANNs, the ANN configuration should be designated according to the performance of the understudied phenomenon. In the exploration of real systems, due to the limitation of measuring instrument error and environmental noise in input data, there is always uncertainty in the final results. Various methods have been introduced to investigate the uncertainty such as fuzzy and Rough approaches which were successfully applied to the field of wind speed forecasting [14]. In many studies, Rough NNs (R-NNs) were appropriated to eliminate the environmental noise. R-NNs are among the best practices in this topic; the RNNs have made impressive outcomes in various cases such as traffic volume prediction, reduction of image noise, and medical diagnostic support system [30]. Nowadays, R-NNs have fascinated many researchers in the study of the uncertainties. In this paper, an ANN with a Rough arrangement has been exploited.

Fig. 3. The overall structure of RBM.

only connections between different layers that present better performance than BM in the feed-forward network. In the construction shown in Fig. 2, the output of each DAE is the input for the next DAE. In the training process of SDAE, every DAE is trained independently. Indeed, before the fine-tuning procedure, every AE has its input and target data (as we know, in AEs, the input data and target data are considered the same), and the training procedure can be done in an independent manner; in this way, they are trained from first to last, respectively, because the input data of the last AE is not available at the first step. This procedure has a significant effect on the training procedure of the deep AEs [1]. After that, when all DAEs have been trained, the fine-tuning process will start. In this section, the decoding part of every DAE is eliminated and a logistic regression layer is added at top of the network for the finetuning procedure as shown in Fig. 4. Afterwards, all of the layers are trained as multi-layer perceptron with a mini-batch gradient descent algorithm [38], the loss function for fine-tuning is defined as the crossentropy function [24]. The learning rates for DAE and fine-tuning process are defined as 0.05 and 0.001, respectively. In fact, in the pretraining tasks (training the DAEs in an individual manner), large learning rate (0.05) is implemented to search the entire solution space, and in the final training stage (fine-tuning procedure), to tune or adjust the performance of the pre-trained network, the small learning rate (0.001) should be employed. It should be noticed that the training procedure of the denoising and forecasting modules (fine-tuning procedure) are done simultaneously. We only separate them to find the optimal configuration of the denoising module (number of hidden layers and neurons in each layer), and all the forecasting results are obtained from the holistic network. The separated training task is only the pre-training procedure in this study.

L (X , X ) =

n0

[Xi log (z i ) + (1

Xi ) log (1

z i )]

i=1

Logistic regression layer

Proposed methodology In this paper, the proposed method is constructed based on denoising and forecasting tasks. In denoising task, DAEs are implemented, and the forecasting task is designed based on ANN with rough neurons. To find the optimal structure of the deep denoising module, the genetic algorithm is implemented. In this section, different parts of the proposed methodology are explained in detail. Stacked denoising auto-encoders optimal construction The optimal number of AEs and their neurons is determined by GA. In this method, the cost function is the total error in the fine-tuning process. GA determines the SDAE structure based on minimizing this error as shown in Fig. 5. Meanwhile, in this method, weight variation values are also defined as a penalty factor in the objective function; this prevents overfitting and avoids excessive AEs numbers. With the excessive increase in the number of AEs, the weight variations are reduced and weight training process is not done adequately. It is manifest in the elementary layers and this penalty factor is applied according to our experience in ANN training. It should be mentioned that in this study, to find the optimal

(5)

Fine Tuning Layer

X

W LR

Hidden layer

H Loss function

Noising layer

N

Input layer

X Fig. 4. Fine-tuning process diagram. 6

L( X , X )

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

Start

Run RBM to Find Optimal initial weights of every DAE for different populations

Add noise to input data Input Raw Data

Determine initial population of GA by Different Configuration of SDAE

Train every DAE individually

Denoising Module

Make SDAE by combining the DAEs for different populations

Adding logistic regression layer to different populatons

Cross over And Mutataion

Determine New population of GA by Different Configuration of SDAE

No F

Start Fine-tunning process of SDAEs based on cross-entropy function

Calculate Fitness function of GA (F) for all of the population

max

GA iteration data Input data Output data

Find the best Configuration of SDAE

Test the result of selected SDAEs

Iter=iter+1

No

Yes

Yes Iter>max iteratioan

Save the structures which satisfy error criteria

Train Rough ANN

Enter the noise reduced data of the best SDAEs configuration to the prediction section

Fig. 5. GA for designing optimal SDAE construction.

structure of the denoising module, the individual manner is applied because the denoising module is designed based on denoising autoencoders, and as we know, in auto-encoders, the target data is the same as the input data, and it is available in every state without forecasting part. In this way, it is possible to find the optimal structure of the denoising module in individual manner without considering the forecasting part.

Min.F = EFT + M ×

1 PEN

100} set. The solution space, in this case, is bounded to the introduced set and is implemented using selection, crossover, and mutation operators. In this way, in this paper, the solution space is the discrete mode, and GA presents a robust performance in this field [48]. Furthermore, GA has a good performance (convergence speed and high accuracy) in finding the global optimal solution with robust training tools including crossover and mutation [49]. According to the mentioned features, in this study, GA has been employed in the optimization task. Given the proposed objective function and the GA, the optimal SDAE structure is obtained. As shown in Fig. 5, in this way, first, the SDAE configurations that satisfy the error rate are selected, then, the final configuration is determined in accordance with the lowest error value in the prediction section. In this method, the GA parameters are selected based on our pilot experience as well as trial and error. To search the solution space carefully, the number of initial populations is defined as 1000, the maximum iteration number is defined as 500, and the maximum acceptable error ( max ) is considered to be 0.05 according to the fine-tuning process. Furthermore, the crossover and validation possibilities are defined as 50%. The crossover and mutation tools help GA to prevent falling into local optimal points [50].

(7)

S

PEN =

NL NN s=1 j=1

[W jS (k )

W jS (k

1)]

(8)

In the GA process, if the number of neurons in each layer is chosen randomly, the search space of the optimization problem will extend. Given that the objective function is determined by calculating the ANN training error value, finding the optimal structure is time-consuming and defining the large-scale solution space is not recommended. In our method, a chromosome is presented by (NL ) components, where each chromosome shows a construction vector for the SDAE. In these circumstances, to limit the structure of the SDAE, all the initial values of the number of hidden layer neurons are selected from the {1, 2, 3 …

Fig. 6. Rough neuron structure. 7

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

Sinusoidal rough artificial neural networks

E 2

According to [30], the SR-NNs had satisfactory results in identifying a system with periodic behavior. In this paper, SR-NN has been applied consistently with the periodic behavior of wind speed. The sinusoidal function will not be saturated unlike the sigmoid function [51]. In the sinusoidal activation function used in this paper, the frequency and the initial phase are defined as training parameters. Rough neuron configuration is exposed in Fig. 6 and the related formulas are described below. Rough Neuron can be considered as a couple of neurons called the upper bound and the lower bound. In Rough neurons, when x is considered as a feature variable, x and x¯ symbols are defined as the lower bound and upper bound variable, respectively. R-NNs are influential tools in removing environmental noise and handling uncertainty of input data. In this regard, they are among the best practices in this subject. Feedforward equations of the proposed R-NN in Fig. 7 are defined as follows:

O1

= Min (f =

1O 1 U

(WU1 X

+

+

bU1 ),

f

(WL1 X

OU2 = Max (f (WU2 O1 + bU2 ), f (WL2 O1 + bL2 ))

(12)

OL2 = Min (f (WU2 O1 + bU2 ), f (WL2 O1 + bL2 ))

(13)

O2 =

2O 2 U

+

E 1

E 1

(k

1)

E (k (k

1) + 1)

(k

1)

The momentum factor is defined between 0 and 1 and represents every parameter that can be trained such as weights, biases and Rough neuron parameters. According to (14), if we consider f (WU2 O1 + bU2 ) > f (WL2 O1 + bL2) , then, training route of different parameters can be described as follows: (16)

E E O2 OL2 = 2 O 2 OL2 WL2 WL

(17)

O1 OU2

OU2 WU1

E O2 O2 OL2

+

OL2 O1

Fu l ly co n nect ed l ay er 1 1



1 1

(22)

=

E O2 OU2 O1 E O 2 OL2 O1 + 2 2 1 1 O OU O O 2 OL2 O1 1

(23)

1

x2



1 2



∫ ∫

Desired data

2 1

1 2



O12 ∑

2 1

1 j

E





1 j



1 n



1 n



Input layer

(21)

E O2 OU2 O1 E O2 OL2 O1 + 1 O2 OU2 O1 O2 OL2 O1 1



x1

(20)

=

Fu lly co n nected lay er



O1 OU2

OU2 WU1

Denoising module In this part, various input data are denoised completely and independently with SDAEs. For this purpose, this section has been made up of 5 different subnets. Under these circumstances, each subnet is defined for a type of variable. Three of them are considered for wind speed data at different heights and the others are expressed the environmental data such as humidity and temperature. In this paper, it is assumed that the last 12 or 24-hour data are considered for each parameter. Due to the fact that different data sources have different noise bases, in this study, the denoising process has been done independently for each input data. After that, a vector with 60 or 120 components, based on the recent 12 or 24-hour data, is obtained (5 × 12 or 5 × 24). This vector is measured as the input of the R-NN. One of the advantages of using denoising section independently is to determine the optimal SDAE arrangement. In this approach, the number of AEs or hidden layers (NL ) in SDAE and neurons in each layer (NNS ) is designated by the heuristic optimization algorithm. In [31], given the coincidence of the noise reduction and prediction process,

(15)

E E O 2 OU2 = 2 O 2 OU2 WU2 WU

(19)

OU2 O1

Fig. 8 illustrates our proposed forecasting module in this research. We introduce wind speed forecasting method based on multi-modal formation. This framework consists of two modules: denoising and forecasting. The flowchart of the proposed framework is shown in Fig. 8.

Back propagation equations are described as follows [3]:

(k ) =

E O2 O2 2

Multi-modal framework

(14)

2O 2 L

=

(18)

E E O 2 OU2 O1 OL2 E O2 OL2 O1 OU2 = + 1 2 2 1 2 1 O OU O OL WL O 2 OL2 O1 OL2 WL1 WL

(11)

1O 1 L

E O2 O2 2

E E O2 = O 2 OU2 WU1

(10)

bL1))

+

2

(9)

OU1 = Max (f (WU1 X + bu1), f (WL1 X + bL1)) OL1

E

=

Back Propagation

Fig. 7. Proposed Rough NN. 8

Output layer

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

Predicted wind speed

Holistic Deep Structure

Forecasting Module

SR-NN Denoised data vector Denoising Module

SDAE Wind speed and Max and Min values for last 12 or 24 hours at 40m

SDAE

Wind speed and Max and Min values for last 12 or 24 hours at 30 m

SDAE

SDAE

Temperature values Wind speed and Max and Min values For last 12 or 24 hours for last 12 or 24 hours at 20 m

SDAE

Humidity values For last 12 or 24 hours

Fig. 8. The overall structure of the multi-modal method.

determining the optimal value of the above parameters is time-consuming and impossible by optimization methods. In this method, due to the fact that the denoising section is independent, defining the optimum arrangement of the SDAE is done by the GA.

connected to each other for the final training. In final training stage, all of the training parameters in denoising module, which have been trained individually, are defined as initial values, and all the parameters (denoising and forecasting modules) are trained simultaneously in this part. Deep networks are impressive tools for highly nonlinear forecasting problems, but when the number of network parameters is large, they have some deficiencies such as instability and overfitting [52]. In these conditions, for training the holistic deep network, three commonly used tools in DL concept, i.e. Mini-Batch Gradient Descent (MBGD), dropout neuron and L2 regularization algorithms, are employed. A brief description of these algorithms is given in Appendix A.

Forecasting module In this portion, the forecasting task will start after the denoising part. The denoised data are considered as inputs of the SR-NN. As shown in Fig. 7, the network used in this section has two Rough layers. In this part, the input data is employed to train SR-NN and the upper and lower bound weights of both layers with Rough neuron coefficients (α and β) are trained. Bearing in mind that the main purpose of this paper is to employ DL method in the denoising section and Rough structure in the prediction section neurons, the prediction part is composed only with two Rough layers and deep structure is just utilized in denoising section.

Error calculation criteria In this paper, in order to compare the accuracy of the different investigated ANNs, different error criteria are adopted. The Mean Absolute Error (MAE), the Mean Absolute Percentage Error (MAPE) and the Root Mean Square Error (RMSE) are considered in this paper. Using these criteria is very common in ANNs literature [36]. Equations (24–26) show the MAE, MAPE and RMSE formulas, respectively.

Final training process After the optimization of the SDAEs structure, the denoising and forecasting modules are stacked together in order to implement the final forecasting network with a DL concept. In this way, the denoising module – consisting of five SDAEs with optimal number of AEs and neurons – and forecasting module – consisting of two-layer SR-NN – are trained simultaneously in the final training process. Indeed, as shown in Fig. 8, in order to figure out the optimal structure of SDAEs, first, denoising and forecasting modules are separated, and then, they are

MAE =

1 n0

n0

|Yg g=1

Fig. 9. Grey correlation grades of input data. 9

Yg|

(24)

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

MAPE =

RMSE =

1 n0

n0

Yg

Ymean

g=1

1 n0

n0

|Yg g=1

data. In fact, over the 3-year period, the input vector consisting 26,280 (3 365 24) data set with a dimension of 60 is considered. We have selected the previous 12 data based on trial and error, which is common approach in data engineering tasks, and, with the previous 12 data, we found the best results in comparison with other input data dimensions. It should be noticed that in different case studies, the input data sequence can be different from each other. In the second scenario, the input data set, with 120-dimensional (5×24) vector for different parameters in the last 24 h data, is considered. In this study, 10% of the data is used for network testing and 20% for validation. These values can be obtained by optimization algorithms, but in this study, in order to simplify the proposed method and address the main subject of the paper, these values are determined according to our assumption in numerous runs of ANN training program. Validation is used to prevent overfitting which is essential for ANN training based on dropout method [58].

Yg

Yg |2

(25)

(26)

In order to make the fair comparison as much as possible, in each method, enough training iteration is done (1 0 0) and the best result of each ANNs, which actually creates the lowest error, is chosen as the final answer. Numerical study Data description The proposed methodology has been applied to predict wind speed for Ahar city in Iran [53]. In the three-year study period, the wind speed is significant and has an average of 14.8 m/s. In this study, the input data are considered as the wind speed [m/s], its minimum and maximum speed values at 40, 30, 20-meter altitude, the temperature [Celsius] and humidity [percent]. The input data structure is selected according to the GCA which is a strong tool in “Big Data” problems. A brief description of the GCA method is given in Appendix B. The grey correlation grades of different parameters are shown in Fig. 9. As shown in Fig. 9, the selected input data have high correlation with target data (wind speed at 40 m altitude). The GCA result shows that the wind speed at 30-meter altitude has more correlation than that in 20-meter altitude with target data and environmental data such as humidity and temperature which are effective parameters in short-term wind speed forecasting. In another point, GCA illustrates that before wind speed undergoes variation at 40 m, wind speed profile starts variations at 30 m. This is related to the case study environmental conditions and may be changed in different case studies. In this regard, in data engineering task, it is suggested to employ the correlation techniques to find the most related parameters, because each case study has its own situation. Furthermore, based on the GCA, we only make sure that we employ the related parameters as input data (with correlation grade more than 0.6 [54]), and selected parameters are implemented in the same manner in the denoising and forecasting tasks, and we do not allocate a bigger weight or coefficient for the wind speed at 30 m in comparison with other selected parameters during the training procedure. Indeed, the GCA is our filter in this study and, based on the GCA result, we can guarantee the effectiveness of the selected parameters. This procedure has been employed in other data engineering tasks [55]. These values are utilized over a period of three years as inputs to the ANNs and the wind speed at 40-meter altitude for the fourth year is predicted and compared with the real values. The wind speed is predicted in during a period of a day and a week. With that in mind, wind speed can be predicted for every day of the fourth year using the historical data of the previous three years. The numerical study is done using MATLAB 7.5 software on a PC with an Intel Core i5, 2.6 GHz CPU and 8 GB of RAM. The training procedure of the proposed DL-based approach is done with the Adam optimizer, which is a robust and well-known method in training the deep networks [56]. The training task is defined based on the stochastic gradient descent method which can significantly reduce the variance of the updating parameters and improve the convergence of the training procedure [57]. To predict the wind speed per day/week, data of the three previous years has been used. For ANN training, data from January 2012 to January 2014 with 1-hour intervals are utilized. To predict wind speed in a one-hour period, two scenarios, which are based on different input data sets, are defined. In the first scenario, the input data set, with 60dimensional (5×12) vector for different parameters in the last 12 h data, is considered. For example, to predict wind speed at 5 pm, data set of different parameters at 5 pm for the last 12 h is defined as an input

Simulation results The main purpose of this paper is to investigate the performance of DL in the denoising section. To this end, two different scenarios are examined, as previously stated. The DL method extracts the main features of the input data; therefore, using this method in input data with a high dimension is suggested. For this purpose, two data sets with dissimilar dimensions are considered to evaluate the function of the proposed technique. In this paper, various ANNs methods such as Multilayer Perceptron (MLP), Elman Neural Network (ENN), Rough Elman Neural Network (R-ENN) Radial Basis Function Neural Network (RBFNN), R-NN and SR-NN have been used to predict short-term wind speed. In all networks other than SR-NNs, the sigmoid activation function has been used. As stated in various references [38,59,60] and given the fact that ANNs have a more advanced structure than statistical techniques such as ARMA, the ANN methods predict the uncertain phenomena more accurately. To validate the performance of the proposed deep method, the simulation results are compared with other ANNs, SVM and ARMA which are known as the benchmarking methods in machine learning and statistical techniques [10]. A brief description of ARMA and SVM methods is given in Appendix C. First scenario As stated in the data description section, in the first scenario, the input data set, with 60-dimensional (5×12) vector for different parameters in the last 12 h data, is applied. As shown in Fig. 10, the optimal configuration of SDAE is found out by GA which has 4 intermediate DAE layers which have 90, 85, 80 and 75 neurons, respectively. As the GA has discovered, the number of hidden DAE neurons first increases and then decreases. The mentioned error criteria for different methods for predicting wind speed on May 4, 2015 are shown in Table 1. Fig. 11 and Fig. 12 show the actual wind speed and the prediction results using various method on May 4, 2015. As shown in DAE 1

DAE 2

DAE 3

DAE 4

Output layer Layer

Input Layer

90

60

85

80

75

60

Fig. 10. Optimal SDAE configuration in Scenario 1. 10

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

Table 1 Error criteria for different methods for predicting wind speed on May 4, 2015, at 40 m altitudes. Denoising Module

Prediction Module

MAE (m/s)

MAPE (%)

RMSE (m/s)

Denoising Module

Prediction Module

MAE (m/s)

MAPE (%)

RMSE(m/s)

None DAE SDAE None DAE SDAE None DAE SDAE

MLP MLP MLP RBF-NN RBF-NN RBF-NN R-NN R-NN R-NN

0.8959 0.8267 0.8092 0.78993 0.75687 0.71154 0.68828 0.65755 0.6267

12.8356 11.3306 11.0781 10.5571 10.1265 9.4780 9.1928 9.0835 8.6735

1.0879 0.9468 0.9599 0.9614 0.9440 0.8863 0.8566 0.7993 0.7794

None DAE SDAE None DAE SDAE None DAE SDAE

ENN ENN ENN R-ENN R-ENN R-ENN SR-NN SR-NN SR-NN

0.8688 0.8113 0.7874 0.7429 0.7239 0.6945 0.70891 0.57069 0.50221

11.3581 10.6080 10.2555 10.3872 10.0982 9.1451 9.78493 7.55772 6.76337

0.9883 0.9743 0.9675 0.9163 0.9851 0.8437 0.79047 0.74592 0.63071

SR-NNs (in Fig. 12) outperforms the R-NNs (in Fig. 11). This clarifies the effectiveness of the proposed method. For deeper analysis, spike points during a week, which are shown in Fig. 13, can be considered. The proposed method, which is constructed based on rough neurons with sinusoidal activation function and SDAE, presents an acceptable performance in handling the sharp spike points (Fig. 13). Accordingly, in the R-NNs, the MAE, MAPE and RMSE rates have been decreased from 0.68 (m/s), 9.19 (%) and 0.85 (m/s) to 0.62 (m/s), 8.67 (%) and 0.77 (m/s), respectively, by applying rough neurons and rough connection weights. The main difference between R-NNs and conventional multi-layer perceptron networks is neurons and connection weights construction. In R-NNs, for consideration of the uncertainty, instead of definite weights and definite neurons, interval weights and rough neurons are applied; this makes the ANN resistant to the uncertainty. Additionally, given the fact that the sinusoidal function has faster changes than sigmoid function, it has better performance in predicting wind speed. Accordingly, SR-NN has good implementation at peak points with sudden variations. Correspondingly, in the R-NNs with DAE, the MAE, MAPE and RMSE rates have been reduced from 0.65 (m/ s), 9.08 (%) and 0.79 (m/s) to 0.57 (m/s), 7.55 (%) and 0.74 (m/s), respectively, by employing sinusoidal activation function. In the ENNs, the internal loops in ANN configuration are considered and this improves network learning and depth of network memory. Therefore, ENNs have better performance than MLP-NNs in all modes; in addition, the Rough ELMAN-NN (RE-NN) has better performance than ENN due to the use of the Rough structure in neurons. These expressions are shown in Table 1. Fig. 13 shows the actual wind speed, prediction results using various methods between December 1 and December 7, 2015. As shown in Fig. 13, the SDAE has better performance in peak points, which implies the importance of using the SDAE module.

Fig. 11. Wind speed prediction results with R-NNs on May 4.2015 in Scenario 1.

Second scenario In the second scenario, the input data set with 120-dimensional (5×24) vector for different parameters in the last 24 h data is applied. In this case, the input data has a greater dimension, and the analysis and comparison of the results of this section with the previous section, which has lower input data dimension, can express interesting and important results regarding the effects of DL strategy on the feature extraction of the input data. As shown in Fig. 14, the optimal configuration of SDAE in this scenario is found out by GA which has 6 hidden DAE layers which have 175, 160, 150, 165, 145 and 130 neurons, respectively. Table 2 shows the error values for different structures of the ANN. As shown in Fig. 14, with the increase in the dimension of the input data in this scenario, the number of DAE layers in the denoising section has increased, which is due to the extraction of more features from the input data. In this case, the optimal SDAE structure determined by the GA has 2 more hidden layers than scenario 1. As in the previous section, Fig. 15 and Fig. 16 show the actual wind speed, prediction results without denoising module, DAE, and SDAE with R-ANNs and SR-NNs prediction module, respectively on May 4, 2015, and Fig. 17 shows the actual wind speed and the prediction results between December 1 and

Fig. 12. Wind speed prediction results with SR-NNs on May 4.2015 in Scenario 1.

Table 1, use of the denoising module in all methods reduces the error and this is because of the noise of the environmental data. Given that the input data has different noise sources, for the wind speed prediction with high accuracy, it is necessary to consider the denoising module. This module has a significant effect on the prediction results obtained from various networks studied in this paper. So, in the SR-NNs, the MAE, MAPE and RMSE rates from 0.7 (m/s), 9.78 (%) and 0.79 (m/s) to 0.50 (m/s), 6.76 (%) and 0.63 (m/s), respectively, have been reduced using different denoising modules. As expected, SDAEs, with more hidden layers, can extract more features from the input data and should have more precise results than DAE. Use of rough structure in the prediction module increases the accuracy of the results. To illustrate the difference between Fig. 11, and Fig. 12, which present R-NNs and SR-NNs, respectively, peak points can be compared with each other. For instance, at 22O’clock, the spike point occurs, and 11

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

Fig.13. Wind speed prediction results with SR-NNs between December 1 and December 7, 2015, in Scenario 1. DAE 1

DAE 2

DAE 3

DAE 4 DAE 5 DAE 6 Output layer Layer

Input Layer

175 120

160

150

165

145

130

120

Fig. 14. Optimal SDAE configuration in Scenario 2.

December 7, 2015. To demonstrate the difference between Fig. 15 and Fig. 16, which present R-NNs and SR-NNs, respectively, spike points can be compared with each other. For instance, at 8 O’clock, the spike point occurred, and SR-NNs (in Fig. 16) outperformed the R-NNs (in Fig. 15). As shown in Fig. 17, the proposed method can handle the spike points more precisely in comparison with other approaches and it illustrates the robustness of the proposed method. In the previous scenario, the effect of denoising and different ANN structures were investigated in the prediction of wind speed. In this scenario, the goal is to study the effect of DL, which is very influential in the case of input data with a high dimension. In this section, in the noise elimination section, more hidden layers (DAE) were utilized due to the increasing dimension of the input data. The growth in the dimension of input data provides more features to the ANN to predict wind speed behavior. But, it should be noted that with the increase in the dimension of the input data, the network should have a deeper structure so that it can extract more features from the data. If the dimension of the input data increases and the depth of the network does not increase, the extraction of the features will be more difficult and the network cannot be trained well. In fact, with the increase in the dimension of the input data, the network structure should be updated to accommodate more layers in order to obtain better results. According to Table 2, the error values in a different structure of ANNS have significantly decreased in this scenario than before, and the

Fig. 15. Wind speed prediction results with R-NNs on May 4.2015 in Scenario 2.

accuracy of the prediction results was improved. Applying input data with more dimension and with a deeper structure in the denoising section in this scenario improves the accuracy of the results. In other words, in this section, the input data dimension has doubled and increased from 60 to 120, which has resulted in extra features being extracted. By comparing the prediction results of this scenario with the previous scenario, in R-NN, using the SDAE, the MAE, MAPE and RMSE rates have been reduced from 0.6267 (m/s), 8.6735 (%) and 0.7794 (m/s) in scenario 1 to 0.4194 (m/s), 6.0719 (%) and 0.4808 (m/s) in scenario 2, respectively, by employing DL method in denoising module. According to the simulation results in two scenarios, use of high dimension input data with a deep denoising network significantly increases the accuracy of the prediction results. In addition, in the prediction module, among different structures of the ANNs, the Rough structure is more accurate for predicting wind speed, and, therefore,

Table 2 Error criteria for different methods for predicting wind speed on May 4, 2015, at 40 m altitudes in Scenario 2. Denoising Module

Prediction Module

MAE (m/s)

MAPE (%)

RMSE (m/s)

Denoising Module

Prediction Module

MAE (m/s)

MAPE (%)

RMSE (m/s)

None DAE SDAE None DAE SDAE None DAE SDAE

MLP MLP MLP RBF-NN RBF-NN RBF-NN R-NN R-NN R-NN

0.8276 0.8025 0.7482 0.7684 0.7229 0.6541 0.7397 0.5352 0.4194

10.8554 10.4762 9.7829 10.0703 9.3695 8.2274 9.1383 7.2979 6.0719

1.0055 0.9303 0.9012 0.9178 0.8596 0.6990 0.9135 0.6828 0.4808

None DAE SDAE None DAE SDAE None DAE SDAE

ENN ENN ENN R-ENN R-ENN R-ENN SR-NN SR-NN SR-NN

0.7960 0.7751 0.6541 0.7582 0.6898 0.5743 0.6316 0.4936 0.3562

10.7456 10.2873 9.2310 9.9875 8.3218 7.6437 9.78493 6.9523 5.9369

0.9764 0.9352 0.7873 0.9099 0.7348 0.5827 0.7214 0.6319 0.4267

12

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

performance is the second-best. SVM method has good performance in low spike points (about 13 m/s); however, in high spike points (about 20 m/s), it does not present accurate results. ARMA method, in low and high spike points, has low accuracy and just forecasts the general trend of wind data. The proposed method forecasts the spike points accurately according to the deep denoising module with optimal structure and rough neurons. To validate the performance of the proposed method using SVM and ARMA methods, the regression profile and R-squared criteria for an hour and 10-minute intervals are presented in Fig. 20 and Fig. 21, respectively. R-squared is a well-known performance metric in forecasting problems [61,62]. The R-squared value shows the relationship between forecasted and actual wind speed data. The higher R-squared value presents more accurate forecasting results. The results show that by implementing the proposed method, the R-squared value of the forecasted wind speed in an hour time span has been enhanced about 0.0274 and 0.1872, in comparison with SVM and ARMA, respectively. However, in 10-minute time span, R-squared is increased about 0.3205 and 0.7504 by employing the proposed method. This suggests that the proposed method is more effective in smaller intervals. For a closer look at this subject, different error criteria and computational time for forecasting results with 10-minute and one-hour intervals in a week are presented in Table 3. The learning curves of the proposed method (SDAE with SR-NN) in different time intervals are presented in Appendix D. Comparison of the results of Tables 2 and 3 shows that in the lower prediction time interval, the accuracy of the forecasted results is better. This is due to the fact that in an hour, the wind speed variation is higher than 10-minute intervals. Indeed, by increasing the time interval from ultra-short-term to an hour, the forecasting error will increase. According to the forecasting results and computational time of various implemented approaches in this study, which are presented in Table 3, the proposed method (SDAE with SR-NN) outperforms another benchmarking approaches; however, this method needs a little more computational time in comparison with other methods (about 6 s). This time complexity is lower than one minute and is acceptable in shortterm wind speed forecasting task. The limitations and abilities of the proposed method are discussed as follows: The proposed method is a data engineering task which is structured based on different historical data of the case study. Indeed, in databased approaches, historical data of the under-study phenomena has a vital rule in the accuracy and precision of the forecasting results. The proposed method can handle the input data noise and uncertainty with SDAEs and rough neurons, respectively; however, same as other data engineering tasks, the proposed method has high dependency on the historical data, and if we do not have a rich data set, we will have

Fig. 16. Wind speed prediction results with SR-NNs on May 4.2015 in Scenario 2.

using Rough neurons is recommended. Furthermore, shallow-based noise reduction module has less effect on the results than deep modes. If noise cancellation modules are applied, it is better to use deep modules with more input data. In fact, today, due to the large amount of data associated with various phenomena, it is better to use a deep structure to extract high-quality property. Indeed, to predict those phenomena like wind speeds, it is essential to employ the high dimension input data; this technique is very effective in the accuracy of the results. In fact, in predicting a phenomenon such as wind speed that has high dimension input data with different noise sources, using deep noise reduction module based on deep structure is essential. Discussion Numerical results in scenarios 1 and 2 show the robustness of the proposed methodology. The proposed method has better performance than other ANN methods at different time intervals. According to the forecasting results of the studied scenarios, the effectiveness of the denoising task, rough neurons, and sinusoidal activation function in eliminating the input data noise, handling the uncertainty of the input data, and improving the performance of the ANN in high intermittent profiles, respectively, is illustrated. In addition, to validate the performance of the proposed method, the forecasting results with an hour and 10-minute intervals, as stated in scenario 2, are compared with SVM and ARMA methods. Simulation results for an hour and 10-minute intervals in a week are shown in Figs. 18 and 19, with 168 and 1008 steps, respectively. As shown in these figures, the proposed method (SDAE with SR-NN) has better performance in spike points in comparison with SVM and ARMA in both time intervals, while, SVM

Fig. 17. Wind speed prediction results with SR-NNs between December 1 and December 7, 2015, in Scenario 2. 13

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

Fig. 18. Wind speed prediction results with proposed method, SVM and ARMA with an hour time interval between December 1 and December 7, 2015.

results with low precision. The main ability of the proposed method is handling the input data noise and uncertainty by implementing the SDAE and rough neurons, respectively. In data-based tasks, the input data noise and uncertainty have significant effects on the final results, and they should be handled carefully. In this paper, to employ the robust denoising task, the optimal configuration of the SDAEs is determined with GA which is a robust optimizing tool. Furthermore, in this study, various input data sources are selected, and the most related parameters are chosen by GCA.

Given that the input data is taken from different sources and has different noise foundations, to improve the performance of the denoising module, the genetic algorithm has been used to determine the optimal SDAE structure, individually. In general, use of DL makes it possible to extract higher-level features of input data. Then, the data is entered into the prediction module and the final training procedure is done on holistic deep structure network which is made by stacking the denoising and forecasting modules together. In agreement with the simulation section, increasing the dimension of the input data and using the deep structure in the noise elimination module will increase the precision of the prediction outcomes. In the prediction module, different ANNs have been employed in various scenarios and their results have been fully compared. In keeping with the simulation results in different modes, use of SDAE in the denoising section and Sinusoidal Rough Neural Networks (SR-NNs) in the prediction section have improved the accuracy of forecasting results. Meanwhile, considering the stochastic behavior of the wind speed and sharp changes in peak values, use of the sinusoidal activation function with rough neurons increases the precision of the results as indicated in this study. As shown by numerical results, the proposed method has better

Conclusions In this paper, a multi-modal short-term wind speed forecasting framework based on Stacked Denoising Auto-encoders (SDAE) with Rough Neural Network (R-NN) is introduced. The input data structure is determined based on Grey Correlation Analysis (GCA) method. According to the obtained correlation results, wind speed at different altitudes and environmental data such as humidity and temperature have high influence on wind speed profile.

Fig. 19. Wind speed prediction results with the proposed method, SVM and ARMA with 10-minutes time interval between December 1 and December 7, 2015. 14

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

(a): ARMA with 0.78519 R-squared

(b): SVM with 0.94925 R-squared

(c): SDAE with SR-NN with 0.97236 R-squared

Fig. 20. Regression of wind speed prediction results with proposed method, SVM and ARMA with an hour time interval between December 1 and December 7, 2015.

(a): ARMA with 0.90709 R-squared

(b): SVM with 0.95008 R-squared

(c): SDAE with SR-NN with 0.98213 R-squared

Fig. 21. Regression of wind speed prediction results with the proposed method, SVM and ARMA with 10-minutes time interval between December 1 and December 7, 2015.

phenomena, this issue is of great importance. Using more features in network training makes it possible to predict the wind speed behavior more accurately and even improve the accuracy of prediction results at specific points, such as spike points. In addition, it should be noted that use of shallow-depth denoising networks for high-dimensional data is not applicable for extracting the main features, and due to the increase in input data, deep networks are highly suggested. They are advised to predict different phenomena that have high dimension input.

Table 3 Different Error criteria and computational time for wind speed prediction results between December 1 and December 7, 2015. Time interval

Prediction Module

MAE (m/s)

MAPE (%)

RMSE (m/s)

Computational Time (s)

10-minutes

SVM ARMA SDAE with SR-NN

0.5060 1.0057 0.2525

7.96636 17.8335 3.97624

0.7137 1.3080 0.3344

29.8 23.6 35.8

1h

SVM ARMA SDAE with SR-NN

1.6898 2.7727 0.3562

11.3992 27.6561 5.9369

1.8918 3.5133 0.4267

21.1 17.9 28.6

CRediT authorship contribution statement Hamidreza Jahangir: Conceptualization. Masoud Aliakbar Golkar: Conceptualization. Falah Alhameli: Data curation. Abdelkader Mazouz: Data curation. Ali Ahmadian: Conceptualization. Ali Elkamel: Conceptualization.

performance than the conventional ANN-based method, Support Vector Machine (SVM) and Autoregressive Moving Average (ARMA) in both an hour and 10-minute intervals. As reported by this paper, to predict different phenomena such as wind speed, use of high-dimension inputs with deep-structured neural networks makes it possible to extract more features from the input data for network training. Today, due to the increase in data and creation of massive databases in various

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

15

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

Appendix A

• Mini-Batch Gradient Descent (MBGD) In MBGD algorithm, instead of choosing all the batch data, only one part of the data (mini-batch) is employed for training in each iteration. The main aim of MBGD is to mitigate the network instability and reduce the variance of the updates of the training parameters and result in more stable convergence [63]. Appling the mini-batch technique makes the gradient descent algorithm faster and more impressive. It should be considered that in contrast with batch gradient descent, where cost function (forecasting error) smoothly decays, MBGD has a little bit noisy cost function but still trends downwards. This noise happens because there are some harder mini-batches that cause fluctuation in the forecasting function [52].

• Dropout Dropout is a procedure for refining neural network performance by avoiding overfitting problem. In the dropout algorithm, a number of neurons are stochastically dropping out during training the network in order to avoid the co-adaptation of feature detectors. The discarded neurons can be selected with various probabilistic methods. It this way, just remained neurons are updated during training in each iteration and this makes the remaining neurons become more stable in different situations. Actually, dropout procedure converts the fully connected hidden layer to a partially connected layer that leads to decrease network dependency on deterministic neurons in hidden layers. It must be considered that dropout technique is just applied during the training process and in forecasting process all of the neurons are employed [64–66].

• L2 regularization The main aim of the L2 regularization technique implementing in a deep network is to prevent overfitting problem [67]. L2 regularization is a specific way of regularizing a cost function (forecasting error) with considering the variation of training terms (weights). In this technique, a penalty factor based on the sum of the squared values of the weights is added to the total forecasting error function as follows: n0

L = L0 +

2n 0

W (g ) 2

(1)

g=1

where L is the total loss function, L0 is the forecasting error, is the coefficient of L2 regularization and n 0 is considered as the total time sample. As stated by Hinton, L2 regularization keeps the weights small unless they have big error derivatives and helps to stop the network from fitting the sampling error [58]. Appendix B

• Grey Correlation Analysis (GCA) method The GCA is a remarkable tool to find out the correlation grade between different input data parameters and target data. GCA discovers the sensitivities of the influencing factors. In GCA, a higher grey correlating grade has more influence on the target data. By this method, important parameters for forecasting target data are selected. Matrix D which includes all the input data is defined as follows [54]: 1 (1)

z (1)

n (1)

1 (m )

z (m )

n (m )

D= (2)

where n (m ) is n-th input data at time sample m. Indeed, the rows define the time sample and columns define the input data index. To make it clearer, the input data matrix can be assumed as: 1

D=

(3)

m

= [ 1 (t ),

t

2 (t ),

....

z (t ).

...,

n (t )]

t

(4)

[1, m]

where z and t are input data index and time index, respectively. To apply GCA, input data can be normalized as follows: z

(t ) =

z (t )

max

z (t )

min z (t ) min z (t )

(5)

After that, the grey coefficient is determined as:

(

O (t ),

oz (t ) max

=|

z

min

(t )) =

O (t )

= max |

O (t )

+ +

oz (t )

z

max

(0, 1)

(6)

max

(7)

(t )| z

(8)

(t )| z = 1, ....,n

16

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al. min

= min |

where

O (t )

z

(9)

(t )| z = 1, ....,n

is a distinguishing factor, and has a value of 0.5 [68,69]. Finally, the grey correlation grade for different input data is calculated by: m

Z ( O (t ),

(

t=1

(t )) =

z

O (t ),

z

(t )) (10)

m

Appendix C

• ARMA Auto-Regressive Moving Average (ARMA) model is a common approach for time-series modeling and forecasting. p and q imply the orders of the autoregressive and the moving average parts, respectively. It has a set of parameters including {a, b1, .....,bp, c1, ....,cq} . An ARMA (p, q) model is defined as follows [10]: p

Yi (t ) =

p

+

bl Y (t

1) +

l=1

c l (t

1) + (t )

(11)

l=1

where is white noise with mean value equal to zero. As stated by [10,70], in short-term wind speed forecasting, implementation of low-order ARMA models has better results than high-order models.

• SVM Support Vector Machines (SVMs) are based on statistical learning theory and have acceptable performance in time series. SVM is different from other more traditional time series forecasting methodologies in the sense that there is no predefined “model” and the data drives the prediction [71]. It has been considered as a benchmarking method in short-term wind forecasting approaches [9,10,72]. To perform the non-linear regression, SVM maps the input data space to a higher dimensional feature space with Kernel function. Prediction function for non-linear regression is considered as follows [54,71]:

f (x ) = ( .

(12)

(x )) + b

where x is input data vector, b is bias vector, is mapping function and is weight vector. The goal is to find the optimal set of weight and bias parameters by minimizing the total loss function. The final goal is to minimize the regularized risk which is considered as follows:

min:

2

2

N

1 N

+

L (x (i), y (i), f (x (i), ))

(13)

i=0

where is the scale factor, is Euclidean norm of weights, N is the total number of samples, L is loss function, y is the observed data and i is the index of a data sample. The loss function is considered as -intensive loss function:

L (x (i), y (i), f (x (i), )) =

| y (i ) 0

f (x (i ), )|

if |y (i )

f (x (i ), )| otherwise

(14)

where is called as “tube size” and refers to the precision of the considered approximation. To find the optimal value of the parameters, the Lagrange multipliers and dual optimization problem are solved:

max:

1 2

N

N

(

i

+

i )( j

j)

x (i ), x (j )

i, j = 1

y (i )(

i

i

)

(15)

i=1

N

subject to:

(

i

i

)=0

i,

i

[0, 1]

(16)

i=1

where i and i are positive variables. Instead of mapping functions, a kernel function is specified that defines the inner product in the feature space. This is known as the “kernel trick” in the machine learning literature [73]. After finding the optimal values of training parameters, the forecasting function can be defined as follows: N

f (x ) =

(

i

i

) k (x , x (i )) + b

(17)

i, j = 1

k (x , x (i)) =

(x ),

(18)

(x (i ))

17

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al.

Appendix D The learning curve of the fine-tuning procedure of the proposed method (SDAE with SR-NN) for one-hour and 10-minute intervals are presented in Fig. 1A, and Fig. 2A, respectively.

Fig. 1A. Learning curve of the proposed method (SDAE with SR-NN) for 1-hour interval.

Fig. 2A. Learning curve of the proposed method (SDAE with SR-NN) for 10-minute interval.

Appendix E. Supplementary data Supplementary data to this article can be found online at https://doi.org/10.1016/j.seta.2019.100601.

railway strong wind warning system. J Wind Eng Ind Aerodyn 2015;141:27–38. [6] Liu H, Tian H, Li Y. Comparison of two new ARIMA-ANN and ARIMA-Kalman hybrid methods for wind speed prediction. Appl Energy 2012;98:415–24. [7] Khodayar M, Wang J. Spatio-temporal graph deep neural network for short-term wind speed forecasting. IEEE Trans Sustain Energy 2018;10:670–81. [8] Dahhani O, El-Jouni A, Boumhidi I. Assessment and control of wind turbine by support vector machines. Sustain Energy Technol Assessments 2018. https://doi. org/10.1016/j.seta.2018.04.006. [9] Santamaría-Bonfil G, Reyes-Ballesteros A, Gershenson C. Wind speed forecasting for wind farms: a method based on support vector regression. Renew Energy 2016. https://doi.org/10.1016/j.renene.2015.07.004. [10] Ezzat AA, Jun M, Ding Y. Spatio-temporal asymmetry of local wind fields and its impact on short-term wind forecasting. IEEE Trans Sustain Energy 2018. [11] Ak R, Li YF, Vitelli V, Zio E. Adequacy assessment of a wind-integrated system using neural network-based interval predictions of wind power generation and load. Int J Electr Power Energy Syst 2018;95:213–26. https://doi.org/10.1016/j.ijepes.2017.

References [1] Pearre NS, Swan LG. Statistical approach for improved wind speed forecasting for wind power production. Sustain Energy Technol Assessments 2018. https://doi.org/ 10.1016/j.seta.2018.04.010. [2] Dong Q, Sun Y, Li P. A novel forecasting model based on a hybrid processing strategy and an optimized local linear fuzzy neural network to make wind power forecasting: a case study of wind farms in China. Renew Energy 2017. https://doi. org/10.1016/j.renene.2016.10.030. [3] Jahangir H, Tayarani H, Ahmadian A, Golkar MA, Miret J, Tayarani M, et al. Charging demand of plug-in electric vehicles: forecasting travel behavior based on a novel rough artificial neural network approach. J Clean Prod 2019;229:1029–44. [4] Kavasseri RG, Seetharaman K. Day-ahead wind speed forecasting using f-ARIMA models. Renew Energy 2009;34:1388–93. [5] Liu H, Tian H, Li Y. An EMD-recursive ARIMA method to predict wind speed for

18

Sustainable Energy Technologies and Assessments 38 (2020) 100601

H. Jahangir, et al. 08.012. [12] Olaofe ZO. A 5-day wind speed & power forecasts using a layer recurrent neural network (LRNN). Sustain Energy Technol Assessments 2014. https://doi.org/10. 1016/j.seta.2013.12.001. [13] Petković D, Pavlović NT, Ćojbašić Ž. Wind farm efficiency by adaptive neuro-fuzzy strategy. Int J Electr Power Energy Syst 2016;81:215–21. https://doi.org/10.1016/ j.ijepes.2016.02.020. [14] Lingras P. Rough neural networks. Proc. 6th Int. Conf. Inf. Process. Manag. Uncertain. Knowledgebased Syst., 1996, p. 1445–50. [15] Khosravi A, Koury RNN, Machado L, Pabon JJG. Prediction of wind speed and wind direction using artificial neural network, support vector regression and adaptive neuro-fuzzy inference system. Sustain Energy Technol Assessments 2018. https:// doi.org/10.1016/j.seta.2018.01.001. [16] Naik J, Dash S, Dash PK, Bisoi R. Short term wind power forecasting using hybrid variational mode decomposition and multi-kernel regularized pseudo inverse neural network. Renew Energy 2018. https://doi.org/10.1016/j.renene.2017.10. 111. [17] Lawan SM, Abidin WAWZ, Masri T, Chai WY, Baharun A. Wind power generation via ground wind station and topographical feedforward neural network (T-FFNN) model for small-scale applications. J Clean Prod 2017;143:1246–59. https://doi. org/10.1016/j.jclepro.2016.11.157. [18] Nguyen VN, Jenssen R, Roverso D. Automatic autonomous vision-based power line inspection: a review of current status and the potential role of deep learning. Int J Electr Power Energy Syst 2018;99:107–20. https://doi.org/10.1016/j.ijepes.2017. 12.016. [19] Omrani H, Alizadeh A, Emrouznejad A. Finding the optimal combination of power plants alternatives: a multi response Taguchi-neural network using TOPSIS and fuzzy best-worst method. J Clean Prod 2018. [20] Hu Q, Zhang R, Zhou Y. Transfer learning for short-term wind speed prediction with deep neural networks. Renew Energy 2016;85:83–95. https://doi.org/10.1016/j. renene.2015.06.034. [21] Luo X, Sun J, Wang L, Wang W, Zhao W, Wu J, et al. Short-term wind speed forecasting via stacked extreme learning machine with generalized correntropy. IEEE Trans Ind Informatics 2018;14:4963–71. [22] Zhi Wang H, Qiang Li G, Bing Wang G, Chun Peng J, Jiang H, Tao Liu Y. Deep learning based ensemble approach for probabilistic wind power forecasting. Appl Energy 2017. https://doi.org/10.1016/j.apenergy.2016.11.111. [23] Wang L, Zhang Z, Long H, Xu J, Liu R. Wind turbine gearbox failure identification with deep neural networks. IEEE Trans Ind Informatics 2016;13:1360–8. [24] Tong C, Li J, Lang C, Kong F, Niu J, Rodrigues JJPC. An efficient deep model for day-ahead electricity load forecasting with stacked denoising auto-encoders. J Parallel Distrib Comput 2017. https://doi.org/10.1016/j.jpdc.2017.06.007. [25] Grozdić ĐT, Jovičić ST, Subotić M. Whispered speech recognition using deep denoising autoencoder. Eng Appl Artif Intell 2017;59:15–22. https://doi.org/10. 1016/j.engappai.2016.12.012. [26] Lee D, Choi S, Kim HJ. Performance evaluation of image denoising developed using convolutional denoising autoencoders in chest radiography. Nucl Instruments Methods Phys Res Sect A Accel Spectrometers, Detect Assoc Equip 2018;884:97–104. https://doi.org/10.1016/j.nima.2017.12.050. [27] Hinton GE, Salakhutdinov RR. Reducing the dimensionality of data with neural networks. Science (80-) 2006;313:504–7. [28] Feng C, Cui M, Hodge BM, Zhang J. A data-driven multi-model methodology with deep feature selection for short-term wind forecasting. Appl Energy 2017;190:1245–57. https://doi.org/10.1016/j.apenergy.2017.01.043. [29] Wang HZ, Wang GB, Li GQ, Peng JC, Liu YT. Deep belief network based deterministic and probabilistic wind speed forecasting approach. Appl Energy 2016;182:80–93. https://doi.org/10.1016/j.apenergy.2016.08.108. [30] Ahmadi G, Teshnehlab M. Designing and implementation of stable sinusoidal rough-neural identifier. IEEE Trans Neural Networks Learn Syst 2017;28:1774–86. [31] Khodayar M, Kaynak O, Khodayar ME. Rough deep neural architecture for shortterm wind speed forecasting. IEEE Trans Ind Informatics 2017;13:2770–9. [32] Kassa Y, Zhang JH, Zheng DH, Wei D. A GA-BP hybrid algorithm based ANN model for wind power prediction. Smart Energy Grid Eng. (SEGE), 2016 IEEE, IEEE; 2016, p. 158–63. [33] Men Z, Yee E, Lien F-S, Wen D, Chen Y. Short-term wind speed and power forecasting using an ensemble of mixture density neural networks. Renew Energy 2016;87:203–11. https://doi.org/10.1016/j.renene.2015.10.014. [34] Salcedo-Sanz S, Ángel MPB, Ortiz-García EG, Portilla-Figueras A, Prieto L, Paredes D. Hybridizing the fifth generation mesoscale model with artificial neural networks for short-term wind speed prediction. Renew Energy 2009;34:1451–7. https://doi. org/10.1016/j.renene.2008.10.017. [35] Wang D, Luo H, Grunder O, Lin Y. Multi-step ahead wind speed forecasting using an improved wavelet neural network combining variational mode decomposition and phase space reconstruction. Renew Energy 2017;113:1345–58. https://doi.org/10. 1016/j.renene.2017.06.095. [36] Yu C, Li Y, Zhang M. Comparative study on three new hybrid models using Elman Neural Network and Empirical Mode Decomposition based technologies improved by Singular Spectrum Analysis for hour-ahead wind speed forecasting. Energy Convers Manage 2017;147:75–85. https://doi.org/10.1016/j.enconman.2017.05. 008. [37] Yan J, Zhang H, Liu Y, Han S, Li L, Lu Z. Forecasting the high penetration of wind power on multiple scales using multi-to-multi mapping. IEEE Trans Power Syst 2018;33:3276–84. [38] Qureshi AS, Khan A, Zameer A, Usman A. Wind power prediction using deep neural

[39]

[41] [42] [43] [44] [45] [46] [48] [49] [50] [51] [52] [53] [54] [55] [56] [57] [58] [59] [60] [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73]

19

network based meta regression and transfer learning. Appl Soft Comput J 2017;58:742–55. https://doi.org/10.1016/j.asoc.2017.05.031. Jahangir H, Tayarani H, Baghali S, Ahmadian A, Elkamel A, Aliakbar Golkar M, et al. A Novel Electricity Price Forecasting Approach Based on Dimension Reduction Strategy and Rough Artificial Neural Networks. IEEE Trans Ind Informatics 2019:1–1. doi: 10.1109/TII.2019.2933009. Rahman A, Srikumar V, Smith AD. Predicting electricity consumption for commercial and residential buildings using deep recurrent neural networks. Appl Energy 2018. https://doi.org/10.1016/j.apenergy.2017.12.051. Hussain S, AlAlili A. A hybrid solar radiation modeling approach using wavelet multiresolution analysis and artificial neural networks. Appl Energy 2017. https:// doi.org/10.1016/j.apenergy.2017.09.100. Rumelhart DE, Hinton GE, Williams RJ. Learning internal representations by error propagation. California Univ San Diego La Jolla Inst for Cognitive Science; 1985. Vincent P, Larochelle H, Bengio Y, Manzagol P-A. Extracting and composing robust features with denoising autoencoders. Proc. 25th Int. Conf. Mach. Learn., ACM; 2008, p. 1096–103. de Harrington P de B. Feature expansion by a continuous restricted Boltzmann machine for near-infrared spectrometric calibration. Anal Chim Acta 2018;1010:20–8. https://doi.org/10.1016/j.aca.2018.01.026. Salakhutdinov R, Hinton G. Deep Boltzmann Machines. AISTATS, vol. 1, 2009, p. 448–55. doi: 10.1109/CVPRW.2009.5206577. Al-Madi N, Faris H, Mirjalili S. Binary multi-verse optimization algorithm for global optimization and discrete problems. Int J Mach Learn Cybern 2019:1–21. Srinivas M, Patnaik LM. Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Trans Syst Man Cybern 1994;24:656–67. Pandey HM, Chaudhary A, Mehrotra D. A comparative review of approaches to prevent premature convergence in GA. Appl Soft Comput 2014;24:1047–77. Godfrey LB, Gashler MS. Neural decomposition of time-series data for effective generalization. IEEE Trans Neural Networks Learn Syst 2017;29:2973–85. Hinton G, Srivastava N, Swersky K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited On 2012:14. Renewable Energy and Energy Efficiency Organization 2018. http://www.satba. gov.ir/en/home (accessed February 21, 2018). Wang K, Xu C, Zhang Y, Guo S, Zomaya A. Robust Big Data Analytics for Electricity Price Forecasting in the Smart Grid. IEEE Trans Big Data 2017:1–1. doi: 10.1109/ TBDATA.2017.2723563. Tsai M-S, Hsu F-Y. Application of grey correlation analysis in evolutionary programming for distribution system feeder reconfiguration. IEEE Trans Power Syst 2009;25:1126–33. Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. 2015 ICLR. ArXiv Prepr ArXiv14126980 2015. Yu D, Ebadi AG, Jermsittiparsert K, Jabarullah NH, Vasiljeva MV, Nojavan S. Riskconstrained Stochastic Optimization of a Concentrating Solar Power Plant. IEEE Trans Sustain Energy 2019. Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 2014. https://doi.org/10.1214/12-AOS1000. Paliwal M, Kumar UA. Neural networks and statistical techniques: a review of applications. Expert Syst Appl 2009;36:2–17. Heinermann J, Kramer O. Machine learning ensembles for wind power prediction. Renew Energy 2016;89:671–9. https://doi.org/10.1016/j.renene.2015.11.073. Kuik O, Branger F, Quirion P. Competitive advantage in the renewable energy industry: evidence from a gravity model. Renew Energy 2019. https://doi.org/10. 1016/j.renene.2018.07.046. Yuan J, Farnham C, Azuma C, Emura K. Predictive artificial neural network models to forecast the seasonal hourly electricity consumption for a University Campus. Sustain Cities Soc 2018. https://doi.org/10.1016/j.scs.2018.06.019. Qian Q, Jin R, Yi J, Zhang L, Zhu S. Efficient distance metric learning by adaptive sampling and mini-batch stochastic gradient descent (SGD). Mach Learn 2015;99:353–72. Baldi P, Sadowski P. The dropout learning algorithm. Artif Intell 2014;210:78–122. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR. Improving neural networks by preventing co-adaptation of feature detectors. ArXiv Prepr ArXiv12070580 2012. Zhang J, Zhu Y, Zhang X, Ye M, Yang J. Developing a Long Short-Term Memory (LSTM) based model for predicting water table depth in agricultural areas. J Hydrol 2018. https://doi.org/10.1016/j.jhydrol.2018.04.065. Gupta R, Gupta S, Ojha M, Singh KP. Regularized Artificial Neural Network for Financial Data. Soft Comput. Probl. Solving, Springer; 2019, p. 745–55. Deng J. Introduction to Grey System. J Grey Syst 1989. https://doi.org/10.1016/j. ijthermalsci.2017.10.007. Liu S, Lin Y. Introduction to grey systems theory. Underst Complex Syst 2010. https://doi.org/10.1007/978-3-642-16158-2_1. Pourhabib A, Huang JZ, Ding Y. Short-term wind speed forecast using measurements from multiple turbines in a wind farm. Technometrics 2016;58:138–47. Sapankevych NI, Sankar R. Time series prediction using support vector machines: a survey. IEEE Comput Intell Mag 2009:4. Yang L, He M, Zhang J, Vittal V. Support-vector-machine-enhanced Markov model for short-term wind power forecast. IEEE Trans Sustain Energy 2015;6:791–9. https://doi.org/10.1109/TSTE.2015.2406814. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. vol. 1. Springer series in statistics New York, NY, USA; 2001.