Multi-task learning for the prediction of wind power ramp events with deep neural networks

Multi-task learning for the prediction of wind power ramp events with deep neural networks

Neural Networks 123 (2020) 401–411 Contents lists available at ScienceDirect Neural Networks journal homepage: www.elsevier.com/locate/neunet Multi...

485KB Sizes 0 Downloads 126 Views

Neural Networks 123 (2020) 401–411

Contents lists available at ScienceDirect

Neural Networks journal homepage: www.elsevier.com/locate/neunet

Multi-task learning for the prediction of wind power ramp events with deep neural networks ∗

M. Dorado-Moreno a , , N. Navarin b,c , P.A. Gutiérrez a , L. Prieto e , A. Sperduti c , S. Salcedo-Sanz d , C. Hervás-Martínez a a

Department Department c Department d Department e Department b

of of of of of

Computer Science and Numerical Analysis, University of Cordoba, Córdoba, Spain Computer Science, University of Nottingham, Nottingham, United Kingdom Mathematics, University of Padova, Padova, Italy Signal Processing and Communications, University of Alcalá, Alcalá de Henares, Spain Energy Resource, Iberdrola, Madrid, Spain

article

info

Article history: Received 23 November 2018 Received in revised form 27 October 2019 Accepted 20 December 2019 Available online 7 January 2020 Keywords: Wind power ramp events Multi-task learning Multi-output Deep neural networks Renewable energies

a b s t r a c t In Machine Learning, the most common way to address a given problem is to optimize an error measure by training a single model to solve the desired task. However, sometimes it is possible to exploit latent information from other related tasks to improve the performance of the main one, resulting in a learning paradigm known as Multi-Task Learning (MTL). In this context, the high computational capacity of deep neural networks (DNN) can be combined with the improved generalization performance of MTL, by designing independent output layers for every task and including a shared representation for them. In this paper we exploit this theoretical framework on a problem related to Wind Power Ramps Events (WPREs) prediction in wind farms. Wind energy is one of the fastest growing industries in the world, with potential global spreading and deep penetration in developed and developing countries. One of the main issues with the majority of renewable energy resources is their intrinsic intermittency, which makes it difficult to increase the penetration of these technologies into the energetic mix. In this case, we focus on the specific problem of WPREs prediction, which deeply affect the wind speed and power prediction, and they are also related to different turbines damages. Specifically, we exploit the fact that WPREs are spatially-related events, in such a way that predicting the occurrence of WPREs in different wind farms can be taken as related tasks, even when the wind farms are far away from each other. We propose a DNN-MTL architecture, receiving inputs from all the wind farms at the same time to predict WPREs simultaneously in each of the farms locations. The architecture includes some shared layers to learn a common representation for the information from all the wind farms, and it also includes some specification layers, which refine the representation to match the specific characteristics of each location. Finally we modified the Adam optimization algorithm for dealing with imbalanced data, adding costs which are updated dynamically depending on the worst classified class. We compare the proposal against a baseline approach based on building three different independent models (one for each wind farm considered), and against a state-of-the-art reservoir computing approach. The DNN-MTL proposal achieves very good performance in WPREs prediction, obtaining a good balance for all the classes included in the problem (negative ramp, no ramp and positive ramp). © 2020 Elsevier Ltd. All rights reserved.

1. Introduction

∗ Corresponding author. E-mail addresses: [email protected] (M. Dorado-Moreno), [email protected] (N. Navarin), [email protected] (P.A. Gutiérrez), [email protected] (A. Sperduti), [email protected] (S. Salcedo-Sanz), [email protected] (C. Hervás-Martínez). https://doi.org/10.1016/j.neunet.2019.12.017 0893-6080/© 2020 Elsevier Ltd. All rights reserved.

Currently, there are many real-life problems which can be approached using machine learning techniques. Maybe, the most common tasks are clustering, and supervised regression and classification. When tackling related problems of the same task, each problem is usually solved individually, selecting a specific model for the task, training it and evaluating the performance of the results. However, in the presence of multiple related problems,

402

M. Dorado-Moreno, N. Navarin, P.A. Gutiérrez et al. / Neural Networks 123 (2020) 401–411

multi-task learning (MTL) can be used to improve generalization performances, by exploiting the information found in the data collected for all the tasks. As soon as the model is learning more than one loss function, MTL is being considered. MTL has a great applicability in real world problems, but it needs the application of specifically adapted methods. In order to use MTL, auxiliary tasks related to the main problem to solve must be carefully selected, since considering unrelated tasks in an MTL model can result in ‘‘negative transfer’’, i.e. a decrease of the performance with respect to an independently trained model. Caruana (1998) defines two tasks as similar if they can use the same features to make a decision. In this sense, Baxter (2000) state that related tasks need to share a common optimal hypothesis class, i.e. they need to have the same inductive bias. Finally, Xue, Liao, Carin, and Krishnapuram (2007) show that two tasks are similar if their classification boundaries are close enough. However, a more deep theoretical analysis of MTL and its applicability is still in its newborn stage. In the context of artificial neural networks (ANNs), MTL is usually accomplished by designing independent output layers for every task and including a shared representation for all of them. Originally, MTL ANNs were not paid too much attention given that old computational resources limited its applicability. Training an ANN with several hidden layers and multiple outputs used to be extremely difficult and inefficient. Recent deep learning technologies have made MTL reappear, thanks to the different computational training tricks and the use of GPUs (graphics processing units). Examples of areas where it has been applied are computer vision (Zhang, Ghanem, Liu, & Ahuja, 2012; Zhang & Zhang, 2014) or natural language processing (NLP) (Collobert & Weston, 2008; Liu, Qiu, & Huang, 2016). These techniques are able to combine the computational capacity of deep neural networks (DNN) with the generalization performance achieved with MTL. In this work, we focus on a wind energy related task, which is one of the fastest growing industries in the world, with potential global spreading and deep penetration in developed and developing countries. As an example, note that 15.6 GWatt of additional wind power capacity were installed in the European Union in 2017. It is also the second largest form of power generation in Europe, approaching quite fast to gas-based power plants. In the current situation of environmental concern and climate change, due to the massive use of fossil fuels, renewable energy resources in general – and wind energy in particular – play a key role in the migration of fossil fuels energy sources, and towards the new green and more sustainable economy. One of the main issues with the majority of renewable energy resources is their intrinsic intermittency, which makes difficult to increase the penetration of these technologies into the energetic mix. Wind farms is where wind turbines transform the kinetic energy of the wind into mechanical power, and a generator converts it in electricity. However, there are problematic situations in which there are sudden maintained changes in wind speed due to meteorological events such as passing fronts. These situations, known as wind power ramp events (WPREs), may lead to important damages in wind turbines, and use to reduce the wind energy produced in the facility. Nowadays, the most successful way to deal with WRPE is predicting them in an accurately way, to decide whether the turbines should be turned off. The prediction problem is usually approached in its binary version (non-ramp versus ramp events). However, there are two possible cases of WPREs. The first case, denoted as ’negative ramp’ (NR), causes the source of energy to be too weak to produce a cost-effective renewable energy. The second case covers those situations where the source of energy is so powerful that it can damage the wind farm, and it is denoted as ’positive ramp’ (PR). ’Non-ramp’ (NoR) periods are those when none of these extreme events occur,

i.e., when the wind farm is able to work normally and produce profitable renewable energy. Many different techniques have been previously applied to binary WPRE prediction: supervised regression or classification techniques (Taylor, 2017; Wang et al., 2017), recurrent neural networks (Dorado-Moreno, Cornejo-Bueno, Gutiérrez, Prieto, Hervás-Martínez et al., 2017; Dorado-Moreno, Cornejo-Bueno, Gutiérrez, Prieto, Salcedo-Sanz et al., 2017; Dorado-Moreno et al., 2016), meta-heuristics (Salcedo-Sanz, Pastor-Sánchez, Del Ser, Prieto, & Geem, 2015) or methods based on analyzing the statistical distributions of wind speed and power (Bossavy, Girard, & Kariniotakis, 2010; Cui, Feng, Wang, & Zhang, 2017; Cui et al., 2015). Other models are focused on meteorological aspects of WPREs (Jahn, Takle, & Gallus, 2017; Xiong, Zha, Qin, Ouyang, & Xia, 2016). Finally Díaz-Vico, Torres-Barrán, Omari, and Dorronsoro (2017) propose the use of DNNs not for WPRE prediction, but for wind energy prediction, a closely related task. In general, most of these papers propose a single model to predict the WPREs in each wind farm. That means one different model design – hyper-parameter validation, input vectors, training phase, etc. – per wind farm. Only in Spain there are more than 25 provinces having wind farms, and most of them have more than 15 wind farms already in operation. Thus, trying to predict WPREs in all of them may result in too many models that will be hard to be handled at the same time. In this work, we propose a single general DNN model that receives inputs from all the wind farms at the same time and predicts WPREs for each of them. Our proposal is based on the fact that including information from other wind farms (in addition to the one where we want to predict WPREs) improves the prediction results, because these types of events are spatially related at synoptic scale. Put is simpler, a WPRE happening in a wind farm can be correlated with WPREs in other spatial locations (even at a large spatial scale), e.g. as a result of a weather front. Specifically, this paper considers data from three wind farms distributed in the Spanish geography, completing this information with measurements obtained from the ERAInterim reanalysis project (Dee, Uppala, Simmons, Berrisford, & Poli, 2011) as predictive variables. The rest of the paper is organized as follows: Section 2 reviews the state of the art in the main topics approached in this paper (MTL and WPRE prediction) and also describes the database used in this research. Section 3 introduces and motivates the use of a MTL DNN method instead of standard individual models and will formally define the proposed model. In Section 4, WPREs and their definition will be described in more detail. The experimental design will be introduced in Section 5, where the data management and the evaluation metrics are fully described, as well as the corresponding configuration. Sections 6 and 7 describe the main results obtained and conclusions of the paper, respectively. 2. Related works This section introduces the related works of the main areas under study in this paper: (1) Deep multi-task learning, (2) Wind power ramp events. 2.1. Deep multi-task learning The central idea of Multi-Task Learning (MTL) is to share what is learned for a task with other tasks, trained in parallel (Caruana, 1998). MTL is an inductive transfer method, i.e. it provides a stronger inductive bias compared to Single-Task Learning (STL), that comes from signals of the other tasks. MTL has been extensively studied in the literature. Besides having been shown to be

M. Dorado-Moreno, N. Navarin, P.A. Gutiérrez et al. / Neural Networks 123 (2020) 401–411

beneficial in many real-world problems, the reasons why it works have also been studied (Caruana, 1998). Recently Maurer, Pontil, and Romera-Paredes (2016), have shown that better generalization bounds can be provided for MTL compared to STL. Intuitively, the classic approach of having a separate model for each task is implicitly assuming independence among different tasks. In MTL instead, we are not starting from that assumption. If the tasks are independent, MTL will (in principle) find out that there is no relation among tasks, and the MTL network will not perform better compared to single networks. On the contrary, if the tasks are not independent, the MTL approach will exhibit improved performance compared to STL, because it relies on less incorrect assumptions. In our case, we have spatio-temporal correlation between different inputs. MTL has been shown to achieve better results than STL in specific problems due to the following intuitive reasons:

• Statistical data amplification: by having different tasks with independent noise that share an input feature in MTL, learning the influence of that feature would be more accurate, as the network will average that feature through the different noise processes. • Attribute selection: considering multiple tasks being learned by a single network can be beneficial in the process of selecting relevant features. In fact, only the features that are relevant for multiple tasks will be selected. • Eavesdropping: when a feature is useful for one task but not for the others, STL usually discards this feature. Although this can be acceptable during the training process, the generalization accuracy may be deteriorated. MTL is able to detect this situation and considers features that work generally well for all the considered tasks. • Representation bias: neural networks, in general, are not guaranteed to converge to a global minimum. In fact, the solution space contains many local minima, and usually DNNs do not converge to the best solution. In MTL, the local minimum will be selected only in the intersection of the solution space for the related tasks. This restricts the hypothesis space, inducing additional bias, which generally results in reducing over-fitting. These advantages have increased the number of studies in which MTL is applied. For example, Evgeniou and Pontil (2004) present an approach to MTL based on the minimization of regularization functionals similar to the ones that are successfully used in STL. Specifically, they compare a regularized multi-task Support Vector Machine (SVM) with a regularized single-task SVM for both benchmark and real data, showing that MTL outperforms STL. Some other works are focused on feature learning. For example, in Argyriou, Evgeniou, and Pontil (2008) it is shown that the number of features learned by a MTL methodology should be limited due to the difficulties found in selecting the best features for learning multiple tasks. The authors presented a method for learning sparse representations shared from multiple tasks based in a non-convex regularizer. This way, they proved that their proposal, which is a mixture of supervised and unsupervised techniques, is equivalent to solving a convex optimization problem, converging in an optimal solution. Liu, Ji, and Ye (2009) also propose an approach based on both MTL feature learning and regularization. This work proposes a framework that accelerates the computation by reformulating the ℓ2,1 -norm regularization as two equivalent optimization problems. After performing experiments with several datasets, the authors conclude that the proposed formulation allows both norms to be computed easily in MTL. There are other approaches following the same underlying idea as MTL, for example in Zhou, Cichocki, Zhang, and Mandic (2016) the authors use multiblock data analysis to extract the

403

common and individual features of the multiblock data and the objective is similar than in MTL, i.e. to exploit features shared from different tasks to improve the performance of a model. Recently, after the arrival of Deep Learning (DL) methodologies, MTL techniques have been recovered and applied to computer vision and NLP problems. In computer vision, Zhang et al. (2012) proposed a framework to track objects in a particle filter: the representation of each particle is learned, considering each representation as a single task but learning all of them together. In Zhang and Zhang (2014), a deep convolutional neural network was applied to a face detection problem, simultaneously learning face and the non-face decision functions. In the field of NLP, a single convolutional neural network (Collobert & Weston, 2008) was trained using a mixture of MTL and unsupervised learning to tackle different related tasks, such as jointly predicting the semantic roles or the likelihood of a sentence (where one of them is not supervised). A more recent work (Liu et al., 2016) proposes a recurrent neural network instead of the well-known convolutional network to jointly learn across multiple related tasks for NLP, using three different configurations of task-specific and shared layers. Up to our knowledge, MTL has not been applied to the prediction of meteorological events in different geographical locations. The closest work is Zhu, Chen, Zhu, Duan, and Liu (2018), where the authors designed a deep convolutional neural network to predict wind speed taking into account the spatio-temporal correlation of the data from different turbines in a single wind farm. They use a single output, so this cannot be considered a MTL approach, although their inputs are built using information from many different locations and time inside a wind farm. Wind speed is spatio-temporally correlated, but only in near distances (not in a synaptic scale), thus, our approach for WPREs prediction is not compatible with this approach, but the intuition of using correlated information of meteorological variables to improve the prediction is very related. 2.2. Wind power ramp events A WPRE, as mentioned before, is a strong increase (positive ramp) or decrease (negative ramp) of wind speed that prevents the wind farm from working properly or can even damage the turbines. Usually, a simplified binary version of the problem is considered in the literature (Gallego-Castillo, Cuerva-Tejero, & López-García, 2015): both positive and negative ramps are merged in a single ramp class, and the second class represents the normal state, resulting in a binary classification problem. The simplest way to define ramps is to consider two measurements of wind speed (wind power, in fact) at a predefined time interval, and depending on the magnitude of their difference, the ramp can be detected. In the literature, the most common way to decide if a ramp event has occurred or not is to define a ramp function (S(t)) for every instant t to measure the maximum difference in power generated during a fixed time interval ∆ before t as: S(t) = maxi∈[t −∆,t ] (Pi ) − mini∈[t −∆,t ] (Pi ),

(1)

where Pi is the power generated at time i. Pi is obtained from wind speed using a function known as power curve, which can be defined for each wind farm. If the value of S(t) is higher than a predefined threshold, a WPRE could have happened, and if that threshold is not surpassed, there has been a normal state and the wind farm can function properly. The problem of ramp prediction has been approached by the combination of meta-heuristics and machine learning techniques (Salcedo-Sanz et al., 2015), considering an extreme learning machine neural network whose feature selection is carried

404

M. Dorado-Moreno, N. Navarin, P.A. Gutiérrez et al. / Neural Networks 123 (2020) 401–411

Table 1 First source of information: variables recorded by wind farm sensors. Variable name

Description

Unit

ws wd

wind speed wind direction

m/s degrees

out by a coral reefs optimization algorithm. The results obtained for the binary version of the problem are promising. Statistical analysis of the distribution of different variables associated to WPREs has been applied in Taylor (2017), where the authors propose an autoregressive logit model to predict WPREs, jointly modeling the probability of ramp events at more than one wind farm by using a multinomial logit formulation. In the field of artificial neural networks, one of the approaches for WPRE detection is the use of recurrent neural networks, specifically echo-state networks (ESN). Dorado-Moreno, CornejoBueno, Gutiérrez, Prieto, Hervás-Martínez et al. (2017) propose different RC architectures for the input variables that are a mixture of real data obtained from the wind farm sensors and reanalysis data obtained from the ERA-Interim Reanalysis project. A binary problem instead of the multiclass one is approached using a logistic regression along with the RC architecture. In DoradoMoreno et al. (2016), the logistic regression is replaced by a weighted support vector machine (SVM), in order to tackle the multi-class problem. On the other hand, in Díaz-Vico et al. (2017), a DNN is applied to solve this problem as a regression problem. A DNN architecture with two convolutional layers followed by two fully connected layers and a linear readout is used in order to obtain the desired output. A cross-validation over the hyperparameters of the DNN is considered, and an ensemble of these models is finally trained. They compare their results with a Support Vector Regressor and a LeNet model, showing that the best results are obtained by their DNN ensemble proposal. 2.3. Database Here, we introduce the two sources of data considered in our study, presenting their relation and the matching process between them. The first information source corresponds to wind data obtained every hour from three wind farms distributed on the Spanish geography, as shown in Fig. 1. The second source of information to obtain the predictive variables is the ERAInterim reanalysis project, widely used in the literature (Cannon, Brayshaw, Methven, Coker, & Lenaghan, 2015; Dorado-Moreno, Cornejo-Bueno, Gutiérrez, Prieto, Hervás-Martínez et al., 2017; Gallego-Castillo, Garcíia-Bustamante, Cuerva-Tejero, & Navarro, 2015), which provides meteorological data every 6 h worldwide. The first source of information is obtained using sensors placed in the three wind farms. Recall that WPREs are mainly related to synoptic fronts, which means that their occurrence could be related in large spatial scales. In this way, wind speed input variables included in Eq. (5) are defined as: wsf (t) = (wfs (t − 1), wfs (t − 2), . . . , wfs (t − T )), wdf (t) = (wfd (t − 1), wfd (t − 2), . . . , wfd (t − T )), where T is the size of the input window used for prediction. In words, the input at time t will be composed of historical data coming from all the three wind farms over a time window of size T (Ouyang, Zha, Qin, & Kusiak, 2016). The units and description of these two variables are included in Table 1. These measurements are also used to obtain class labels, as explained in Section 2.2: S(t) is calculated transforming wind speed at time t (w s (t)) into power Pt using the corresponding

Table 2 Second source of information: meteorological variables selected from the ERA-Interim reanalysis project. Variable name

Description

Unit

z0 z1 z2 z3 z4 z5 z6 z7 z8 z9 z 10 z 11

surface temperature surface pressure zonal wind component (u) at 10 m meridional wind component (v) at 10 m temperature at 500hPa zonal wind component (u) at 500hPa meridional wind component (v) at 500hPa vertical wind component (w) at 500hPa temperature at 850hPa zonal wind component (u) at 850hPa meridional wind component (v) at 850hPa vertical wind component (w) at 850hPa

K Pa m/s m/s K m/s m/s m/s K m/s m/s m/s

curve. The main problem with this source of information is that sensors can stop working for different reasons producing missing data, which can break the temporal structure of the time series. The second source of information is a database that can be downloaded from the ERA-Interim reanalysis project (Dee et al., 2011). It is maintained by the European Centre for Medium-Range Weather Forecasts and includes a large set of meteorological variables, worldwide, every 0.125 degree (latitude and longitude) and every 6 h, from 1979. We specifically selected the variables listed in Table 2. These variables provide important additional information of pressure at surface level and information about the wind and temperature in different heights (pressures), which can help in the considered prediction problem. The vector including the variables of Table 2 is denoted as zf (t) (which was also used in Eq. (5)). zf (t) will include information from the instant in which we want to predict the WPRE, as the physical models used in reanalysis data are able to obtain very reliable estimations of the variables considered (Dee et al., 2011). In this way, zf (t) is defined as: zf (t) = (zf0 (t), zf1 (t), . . . , zf11 (t)). 3. MTL deep neural network proposal We introduce the proposed neural network model by making explicit what are the assumptions we make on the conditional probability of the output given the input. In fact, when suitable activation functions for the output layer of a NN are used, it is possible to interpret the output of a NN as a probability, or as a number proportional to a probability. The network architecture will correspond to a specific factorization of such probability, factorization that depends on the assumptions made on the dependencies among the observed variables. Let {(xi (t), yi (t)), i = 1, . . . , Nt } be the available data for Nt different tasks over time t > 0. If we are interested in predicting yi (t) from the past, we can define a general predictive model that computes P(y1 (t), . . . , yNt (t)|x1 (t − 1), . . . , x1 (0), . . . , xNt (t − 1), . . . , xNt (0)). (2) A common assumption that is made, that is especially true for the setting considered in this paper, is to assume that the phenomenon that is being modeled does not present long-time dependencies. We can thus limit the dependency over time to a fixed time frame ∆t (e.g. Markovian models fix ∆t = 1), i.e.: P(y1 (t), . . . , yNt (t)|x1 (t − 1), . . . , x1 (t − ∆t), . . . , xNt (t − 1), . . . , xNt (t − ∆t)).

(3)

M. Dorado-Moreno, N. Navarin, P.A. Gutiérrez et al. / Neural Networks 123 (2020) 401–411

405

Fig. 1. Location of the three wind farms and the reanalysis nodes considered in the study.

In the commonly adopted STL, an additional assumption is made: the independence among tasks, i.e. in STL Eq. (3) is assumed to be equivalent to: P(y1 (t)|x1 (t − 1), . . . , x1 (t − ∆t)) × · · ·

× P(yNf (t)|xNf (t − 1), . . . , xNf (t − ∆t)).

(4)

When using a single hidden-layer neural network to estimate the probabilities of each factor, in order to make explicit the latent variables that correspond to the hidden units, it is common to exploit the chain rule of probability, as follows. Let hi (t) be the hidden-state representation at time t of the network trained for the ith task. We can then define the output of the network as (for the sake of notation, in the following we do not show on the left side of the equations the conditioning of the output variables w.r.t. the input): P(y1 (t), . . . , yNf (t)) = P(y1 (t)|h1 (t)) P(h1 (t)|x1 (t − 1), . . . , x1 (t − ∆t))×

...× P(yNf (t)|hNt (t))P(hNt |xNt (t − 1), . . . , xNt (t − ∆t)). It is not difficult to extend this definition to the case of l hidden layers, although it is usually preferred to show explicitly only the last hidden layer, by considering the intermediate hidden layers as internal variables used to parameterize the corresponding probabilistic factor. In many cases this independence assumption among tasks does not hold; clearly, it does not hold for our case. In fact, for the task of WPRE prediction, the spatial relationships among farms can create dependencies between the past state in a farm and the current state in another one that is geographically close. Thus, we want to avoid the STL assumption. We can then define a neural network model that models the probability in Eq. (3) using two latent spaces: one that is shared among the different tasks (whose representation is given by the hidden state hC (t)), and one that is specialized for the specific tasks (i.e. hS1 (t), . . . , hSNt (t)), thus capturing in the shared latent space the spatial correlations among farms, and giving to each farm the possibility to express its specificity (i.e. local geographic conditions) by using its own specialized latent space. Following this line, our model becomes: P(y1 (t), . . . , yNf (t)) = P(y1 (t)|hS1 (t))P(hS1 (t)|hC (t))×

...× P(yNt (t)|hSNt (t))P(hSNt (t)|hC (t))× P(hC (t)|x1 (t − 1), . . . , x1 (t − ∆t), xNt (t − 1), . . . , xNt (t − ∆t)). The above factorization makes explicit the main proposal of this work, i.e. to take advantage of MTL in order to improve the predictive performance over the single tasks. MTL improves generalization by learning the different tasks in parallel, while using shared hidden layers (in the case of ANNs). In this way, what is learned for one task is useful to learn the other related tasks. With the proposed technique, we will be able to predict the WPREs in all the wind farms with a single network, receiving inputs from all of them. Thus, there are two main benefits in our proposed approach: first, having a single model to predict WPREs at different wind farms is faster, from a computational point of view, than using individual networks; second, the predictive performance for the single tasks are improved, because the multitask model considers at the same time information provided by all the wind farms, leveraging on the fact that meteorological processes are correlated through time and space. This fact is well known, and it has been exploited and studied before. For example in Jiang, Zhuang, Huang, Wang, and Fu (2013) the spatio-temporal relationship of wind speed in offshore wind farms is analyzed. It is true that in off-shore wind farms (over the sea), this spatio-temporal relationship among different wind farms is stronger than in in-land installations, mainly because the spatial relationship among different zones is very strong in the sea. However, it also appears in in-land, as pointed out in Santos-Alamillos, Thomaidis, Quesada-Ruiz, Ruiz-Arias, and Pozo-Vázquez (2016) for the case of Spain. In Chidean, Caamaño, Ramiro-Bargueño, Casanova-Mateo, and Salcedo-Sanz (2018) another analysis of these relationships for the Iberian Peninsula is carried out with a clustering approach. In this work it is possible to see very large zones with similar characteristics in the wind speed (cluster zones), with synoptic scale even as previously reported in Santos-Alamillos et al. (2016). The spatio-temporal relationship of the wind speed in large areas has been previously exploited for wind speed forecast, as shown in Lucheroni, Boland, and Ragno (2019) and Sun, Feng, and Zhang (2019), which exploit this spatio-temporal relation at different spatial scales. Thus, the

406

M. Dorado-Moreno, N. Navarin, P.A. Gutiérrez et al. / Neural Networks 123 (2020) 401–411

motivation to apply MTL is that some meteorological events, such as WPREs, are related in a large spatial scale, so that their prediction can be based on shared information. The architecture of the deep neural network model is shown in Fig. 2, and it consists of two well-differentiated parts: the first one is composed of NC shared layers whose objective is to extract the most important information from the input vector. The second one is composed of groups of NS specification layers which are focused on the extraction of specific information for each single task (i.e. WPRE prediction in each wind farm). The special case where NC = 0 will produce one individual and independent model per each wind farm. With the purpose that the shared layers are able to extract the common information, a bottleneck is forced in the model, which compresses all the inputs (from the three wind farms), keeping only the relevant information. We define a bottleneck in a neural network as a layer with less neurons than the layer below it or above it. The amount of information that can be encoded in the bottleneck layer is necessarily lower compared to the previous layers. During training, each layer is encouraged to retain information useful for the task. The bottleneck forces the network to perform a compression, that may be useful for denoising the signal coming from lower layers, as well as for retaining just the subset of the available information that is related to the problem at hand. To do so, we set the minimum number of shared layers to NC = 2 (except for the special case when NC = 0, i.e. our baseline model). The number of neurons of the different layers is defined by the parameter N. Based on it, the layer sizes of the first and second shared layers are fixed to 2N and N /2, respectively, forcing the already mentioned bottleneck. Such bottleneck layers are commonly adopted, e.g. in the Inception network (Szegedy et al., 2015). Another interpretation of the bottleneck layer is a non-linear PCA (Scholz, Fraunholz, & Selbig, 2008) (the first shared layer is necessary to make it nonlinear), with a denoising effect. The layer after the bottleneck (our third shared layer) can be interpreted as a decoding layer, that is useful to ‘‘decompress’’ information after the bottleneck. The rest of layers will be configured with N neurons, as shown in Fig. 2. We start by defining how the different inputs are processed: X(t) = {x1 (t), x2 (t), . . . , xNf (t)}, where Nf is the number of wind farms, X(t) ∈ RNf ×(2T +12) is the global input of the model at time t (T is the window size, defined later in Section 5). In this way, each input pattern includes the information from three different wind farms. Moreover, for each farm, the input information is defined as: xf (t) = (wsf (t), wdf (t), zf (t)),

multitask model defines a mapping y = f (x; θ ) and learns the value of the parameters θ that result in the best approximation function. In our case:

θ = (WOf , WSf ,s , bSf ,s , WCc , bCc , WC1 , bC1 ),

(6)

N ×3 where WO are the output layer weights, s ∈ {1, . . . , NS } f ∈ R and c ∈ {2, . . . , NC }. For each wind farm f , the WPRE prediction will be computed using the following equation: C S yf = σ (WO f hf ,NS + bf ,s ),

(7)

where σ is the softmax function and the output of an specification layer (hSf ,s ) is computed as follows: hSf ,s = Φ (WSf ,s hSf ,s−1 + bSf ,s ), for s ∈ {1, . . . , NS }, where WSf ,s ∈ RN ×N includes the weights between the sth layer and the previous one (s − 1), bSf ,s ∈ RN includes the corresponding biases, and Φ is the sigmoid transfer function, Φ (x) = 1 . The end of the recursion is defined by connecting the first 1+e−x specification layer with the last shared one: hSf ,1 = hCNC . The case where NS = 0 would define a model with no specification layers, meaning that the last shared layer would be connected to the different outputs. The recursion is defined as: hCc = Φ (WCc hCc −1 + bCc ), for c ∈ {2, . . . , NC }, hC1 = Φ (WC1 hC0 + bC1 ), where WCc ∈ RN ×N is the matrix of weights between the cth shared layer and the previous layer, (c − 1), and bCc ∈ RN is the corresponding bias vector, except for the case c = 2 where WC2 ∈ R2N ×N /2 and bC2 ∈ RN /2 in order to force a bottleneck in the network. Finally, WC1 ∈ R12×2N is the vector of weights from the input layer to the first hidden layer and bC1 ∈ R2N is the corresponding bias vector. The first shared layer is connected to the input layer as follows: hC0 ≡ x(t). The special case NC = 0 means that there will be no shared layers, so that: hSf ,0 ≡ xf (t), f ∈ {1, 2, . . . , Nf }, that is, three individual and independent prediction models are obtained, one for each wind farm. 3.1. Training algorithm

(5)

where f is the index of the farm f ∈ {1, 2, . . . , Nf }, wsf (t) ∈ RT and wdf (t) ∈ RT are vectors including past wind speed and direction values measured by the sensors, respectively, and zf (t) ∈ R12 denotes a vector of reanalysis variables obtained from the ERA-Interim reanalysis project. These variables will be further described in Section 5. The model output at time t includes the predictions for the three considered wind farms: Y(t) = {y1 (t), y2 (t), . . . , yNf (t)}. For the sake of simplicity, from now on, the time index ‘‘(t)’’ will be removed from the equations. Let hC1 ∈ R2N , hC2 ∈ RN /2 and hCc ∈ RN for c ∈ {3, . . . , Nc } be a vector with the outputs of the hidden neurons, and let hSf ,s ∈ RN be a vector with the outputs of the c common or shared hidden neurons of the sth specification layer for wind farm f . A feed forward network

In order to optimize the MTL deep neural network, we select the adaptive moment estimation (Adam) algorithm (Kingma & Ba, 2014), due to its competitive results in the literature. The Adam algorithm does not require any extra hyper-parameter tuning apart from the common learning-rate (α ) and the batch-size (B). To adopt the MTL paradigm, we perform three back-propagation iterations for every batch, i.e. one per output. In that way, the links in the specification layers are updated depending on their corresponding output error, and the links in the shared layers will be updated by the back-propagation of every output error, resulting in a final update equal to the addition of the updates. On the other hand, in order to deal with the high imbalance degree of the data, we have modified the original algorithm. The modification proposed is to add weights to the cost function which will be adapted dynamically considering the worst classified class after each epoch. The weights representing each class, W j will be initialized to 1, and they will be updated in every

M. Dorado-Moreno, N. Navarin, P.A. Gutiérrez et al. / Neural Networks 123 (2020) 401–411

407

Fig. 2. Proposed Multi-Task Learning Deep Neural Network, where the inputs include measurements from all the considered wind farms, and one value per farm is obtained in the output layer.

epoch multiplying the weight of the worst classified class by an update factor (Wu > 0). After that, they will be re-normalized: W jN ′ W j = ∑N , j ∈ {1, 2, . . . , Nf }, f i i i=1 W N where N is the number of patterns in the dataset, and N i is the number of patterns belonging to class i. In this way, the sum of the product of each weight by the number of patterns of its corresponding class is equal to the total number of patterns in the database: N = W 1 N 1 + W 2 N 2 + · · · + W Nf N Nf . We selected this dynamic cost instead of re-sampling techniques because the application of re-sampling in this context would be problematic. Given that we are facing three different three-class classification tasks simultaneously, we should have evaluated all the different combinations of classes (N 2 combinations) and over-sampled those with the lowest representation. Finally, we included dropout (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014) to avoid over-fitting the data with the deep model. This would include the last hyper-parameter to configure for the algorithm which is the dropout rate (D). 4. WPRES definition Predicting if a wind ramp in a farm is positive or negative is extremely important, because the procedures that have to be activated for these two events are completely different. For this reason, in this paper, we consider positive and negative ramps as two different classes, which, together with the normal class, results in a 3-class classification problem. To define wind ramps,

the simplest idea is to consider two measurements of wind speed (wind power values) at a predefined time interval, and depending on how large that difference is, we could say that a ramp has occurred in that time interval or not. However, different wind farms (with different types of turbines installed) will react differently to the same wind speed (i.e. they will have different power curves). For this reason, a common approach in the literature is to measure directly the instantaneous power Pt produced in a wind farm at time t. More formally, we define a ramp function (S(t)) to measure the maximum difference in the instantaneous powers generated over a fixed time interval (see Eq. (1)). In our study, ∆ is set to six hours due to the reanalysis data temporal resolution. Then, in order to detect if S(t) is a candidate for being a positive or a negative ramp, we have to find out which of the two measurements (max and min) occurred first. To do that, we define two indexes: imin = arg mini∈[t −∆,t ] Pi , imax = arg maxi∈[t −∆,t ] Pi . These two indexes are combined with a threshold (S(0)) to decide if the difference is enough to be a ramp or not. Many proposals about how to define this threshold are presented in the literature (Gallego-Castillo, Cuerva-Tejero et al., 2015). Here, we decided to use the most common one, which is setting the threshold at 50% of the wind farm rated power, i.e. half of the maximum energy a wind farm can produce. Thus, in our setting, if in one of the considered six hours intervals, there is an increase (or decrease) of more than 50% of the maximum energy the wind farm can bear with, then, a WPRE occurred. We can define then a WPRE with this function: yf (t) =

⎧ ⎨CNR ,

if S(t) ≥ S(0) and imax < imin ,

CNoR ,

if S(t) < S(0),

CPR ,

if S(t) ≥ S(0) and imax > imin .



408

M. Dorado-Moreno, N. Navarin, P.A. Gutiérrez et al. / Neural Networks 123 (2020) 401–411

where NR stands for negative ramp, NoR stands for no ramp and PR for positive ramp, and CNR , CNoR and CPR are class labels assigned to the event of NR, NoR and PR, respectively. 5. Experimental design In this section, the characteristics of the dataset are described in detail as well as its building process. Furthermore, we explain how the proposed model is tuned and how the experiments were designed. 5.1. Data matching Matching both sources of information involves two main problems. The first one is that, since the reanalysis data is calculated each 0.125 degree in latitude and longitude, it is very unlikely that a wind farm is located exactly in one of those geographical locations. To solve this, we decided to select the four nearest points surrounding the wind farm, computing a weighted average of each variable taking into consideration the distance from the reanalysis node to the wind farm. In this way, the nearest node will be given more importance. First, the distance from each reanalysis node to the wind farm is calculated as follows: d(pw , prj ) = arccos(sin(lat w ) · sin(latjr ) · cos(lonw − lonrj )

+ cos(lat w ) · cos(latjr )), where pw = (lat w , lonw ) is the wind farm geographical position, prj = (latjr , lonrj ) is the location of the jth reanalysis node, and lat and lon are the latitude and longitude of the points, respectively. Once the distance from each of the four reanalysis nodes (the four black points surrounding the wind farm in Fig. 1) to the wind farm is calculated, these distances are inverted and normalized, considering that the shorter the distance, the larger the weight that reanalysis node should have. Thus, fixed a wind farm pw , we compute the weight of the jth reanalysis node as follows:

wj =

∑4

i=1

d(pw , pri )

d(pw , prj )

, j = 1, . . . , 4.

where 4 ∑

wj = 1

j=1

After calculating these weights, they are applied to obtain a weighted average for each of the 12 variables in Table 2. The second problem we had to face is that, in the two databases, the information is recorded with a different time horizon. In fact, variables from the wind farms are recorded hourly, while reanalysis data is available every 6 h. There are two main options to solve this problem: the first one is to replicate the information from the reanalysis data (6 h horizon) every hour, so that we can match every record in the wind farm with reanalysis information. The second option is to reduce the sample time of the wind farm sensors by a factor of 6, considering also for that database a sampling time of 6 h. This means that every 24 h, we will only have 4 records available. We selected this second option for two main reasons: 1. It avoids correlation that would be introduced in the dataset if the reanalysis data is hourly replicated. 2. It gives us more flexibility in dealing with missing data. If, on the selected sample, some sensor data are missing, we can recover them using the five preceding measurements without introducing any correlation in the produced dataset (see Section 5.2). After reducing the information from the wind farm sensors, we only have to match every record with its corresponding reanalysis data.

5.2. Missing measurements in wind farms It is well known that sensors are prone to errors, which leads to the possibility of having missing values among the measured variables. In this case, sensors placed at wind farms collect wind speed and direction. We observed a 3%, 10% and 7% of missing values in farms A, B and C, respectively. As we are dealing with time series, deleting them is not an option as this would damage the temporal structure of the data. In order to impute their value, we decided to use the reanalysis data, as it is a reliable data calculated using physical models. We have applied a regression technique to recover missing wind speed and direction values considering as independent variables the set of 48 predictors for each wind farm (12 predictors per reanalysis node). Specifically, we used a gradient boosted regression tree ensemble (Elith & Leathwick, 2008), as this is one of the best performing methodologies for regression tasks. The non-missing measurements were used to train the regressor and the missing data was imputed using the obtained model. 5.3. Experiments configuration In Section 3, the architecture of the model and the training algorithm have been defined using some hyper-parameters such as NC (number of shared layers), NS (number of specification layers), T (window size) and N (number of hidden neurons). With these 4 hyper-parameters, the structure of the deep neural network would be fully defined. Moreover, the training algorithm needs a specific value for α (learning rate), B (batch size), Wu (imbalance weight update factor) and D (dropout probability). To fit all these hyper-parameters, we have used a training set containing the first 7 years of data, a validation set composed of the 8th year and a test set with the last year. We have created a grid with the following values: NC ∈ {0, 3, 4}, NS ∈ {1, 2} (and 3 for NC = 0 case), T ∈ {1, 2, 3}, N ∈ {32, 64, 128}, α ∈ {0.0001, 0.0005, 0.001}, B ∈ {200, 400, 600}, Wu ∈ {1.01, 1.05, 1.1} and D ∈ {0.4, 0.6, 0.8}. We run the experiments with all different combinations and choose the configuration which obtained the best results for the validation set. After that, we redefine the database and include the validation set into the training set and retrain the model with the best selected hyper-parameter configuration in order to obtain the test results. In this way we are able to select the best model (after performing the cross-validation) taking into account the best validation performance with the full training set. We have followed the same experimental design for the special case where NC = 0, which results in three single models. 6. Results 6.1. Evaluation metrics Given that we are facing an imbalanced problem, the evaluation of the classifiers should consider both the global accuracy and the sensitivity per class. In this way, we have included a total of five different evaluation metrics:

• The accuracy is defined by: Acc =

N 1 ∑

N

(I(y∗i = yi )),

i=1

where I(·) is the indicator function, yi target category for pattern xi , y∗i is the category predicted by the model, and N is the number of examples in the dataset. As a ratio, Acc varies between 0 and 1, and it summarizes a global performance in the classification task. Its main disadvantage is that it is biased to the majority class.

M. Dorado-Moreno, N. Navarin, P.A. Gutiérrez et al. / Neural Networks 123 (2020) 401–411 Table 3 Best validation results according to number of shared layers (NC ), number of specification layers (NS ) and number of hidden units (N).

409

Table 4 Validation parameters for the best three models selected, plus the static weighted one.

NC

NS

N

Mean GMS

Model

T

N

α

B

Wu

Dr

0 0 0 0 0 0 0 0 0

1 2 3 1 2 3 1 2 3

32 32 32 64 64 64 128 128 128

0.6832 0.6903 0.7056 0.6834 0.5999 0.6841 0.6879 0.6011 0.6166

C 0S3 C 2S2 C 3S1 C 3S1(W ) C 4S1

2 1 2 1 2

32 64 32 32 32

0.001 0.001 0.001 0.001 0.001

200 200 200 400 400

1.02 1.05 1.05 1.01 1.1

0.8 0.8 0.8

2 2 2 2 2 2

1 2 1 2 1 2

32 32 64 64 128 128

0.8764 0.9365 0.9572 0.9638 0.9632 0.9720

3 3 3 3 3 3

1 2 1 2 1 2

32 32 64 64 128 128

0.6978 0.6958 0.6604 0.5700 0.5504 0.5358

4 4 4 4 4 4

1 2 1 2 1 2

32 32 64 64 128 128

0.6907 0.6402 0.6552 0.5331 0.5619 0.5252

• In order to consider all classes of the problem, the sensitivity of each class (SNR , SNoR and SPR ) can be obtained to evaluate the ratio of correctly classified patterns for each kind of event: SNR =

CCNR NNR

, SNoR =

CCNoR NNoR

, SPR =

CCPR NPR

,

where CCNR , CCNoR and CCPR are the number of correctly classified NR, NoR and PR events, respectively, and NNR , NNoR and NPR (NNR + NNoR + NPR = N) are the total number of NR, NoR and PR events, respectively. • As a summary of these measures, one can use the geometric mean of the sensitivities (GMS) of each class, which is a geometric average of the individual accuracies: GMS =

√ 3

SNR · SNoR · SPR .

This measure is suitable for imbalanced datasets, because, being based on the product, it is more sensitive to low individual accuracies. In this way, those classifiers which ignore at least one of the classes (resulting in a sensitivity of 0), can be easily recognized by GMS = 0, while they can result in very high global accuracy values. 6.2. Model selection In this section, we will show the results of the hyper-parameter validation. In order to simplify the amount of information recorded, we will introduce the best results for the most influent validation parameters (see Table 3), which are the number of layers, the number of hidden units and the window size. After that, we will select the best value for each number of layers and show their full parameter configuration. The results obtained with models that do not have a common layer structure, that is, models which only predict WPREs for a single wind farm, were selected according to the best mean GMS (from the three wind farms).

− 0.8

Table 5 Test results obtained for farm A. The best result is shown in bold face and the second best in italics. Model

GMS

Acc

SNR

SNoR

SR

Reservoir C 0S3 C 2S2 C 3S1 C 3S1(W ) C 4S1

0.6813 0.6832 0.4167 0.6898 0.4797 0.6923

0.6452 0.6941 0.8245 0.7270 0.8286 0.7133

0.6865 0.6865 0.1940 0.6268 0.2835 0.6567

0.6366 0.6967 0.8904 0.7332 0.8888 0.7169

0.7238 0.6666 0.4190 0.7142 0.4380 0.7047

Observing the results from Table 3, deeper specification layers, does not provide any increase in the performance of the multioutput model, in contrast to single-output models, where the increase of specification layers leads to better results. Although increasing the number of hidden units should help to obtain better results for the problem, in this case, even with the highest dropout rate the models led to over-fitting (this happens for both type of models, multiple and single-output). We selected the architectures in bold font from Table 3 to perform a deeper study. The model with no shared layers and three specification layers (three independent models) is denoted as C 0S3, C 2S2 is composed of two shared layers and two specification layers while C 3S1 represents three shared layers and one specification layer. Moreover, C 3S1 is also tested using the weighted version of the algorithm (according to the imbalance ratio of each class with no posterior updates) instead of our dynamic weight proposal, and this model is denoted as C 3S1(W ). Finally, C 4S1 is composed of four shared layers and one specification layer. In Table 4, the full parameter configuration for the best three architectures is shown. From Table 4, we can confirm that the best window size is T = 2 instants of time previous to the one we are going to predict. This short window size indicates that it is nonsense to use a full recurrent neural network to solve this problem. Also, in order to avoid over-fitting, a high dropout rate is needed (80%), and large batch size does not necessarily lead to better results. Finally, we can conclude that, the more shared layers, the higher update rate of the class weights is needed. 6.3. Test results In this section, we will show and discuss our results, that are reported in Tables 5–7. We compare the results of our proposed models against the reservoir model proposed in DoradoMoreno, Cornejo-Bueno, Gutiérrez, Prieto, Salcedo-Sanz et al. (2017), where a reservoir computing approach combined with an oversampling algorithm is proposed to solve this problem using the same raw dataset, and against the three independent models C 0S3. Recall that, for individually trained models (NC = 0) the hyper-parameter validation was done considering the average GMS of the three wind farms. In Dorado-Moreno, Cornejo-Bueno, Gutiérrez, Prieto, SalcedoSanz et al. (2017), it was shown that state of the art models are not able to deal with the high imbalance ratio of the problem, obtaining models which classify all the patterns in the majority class

410

M. Dorado-Moreno, N. Navarin, P.A. Gutiérrez et al. / Neural Networks 123 (2020) 401–411

Table 6 Test results obtained for farm B. The best result is shown in bold face and the second best in italics. Model

GMS

Acc

SNR

SNoR

SR

Reservoir C 0S3 C 2S2 C 3S1 C 3S1(W ) C 4S1

0.6489 0.6716 0.4283 0.6845 0.0 0.6891

0.6705 0.7201 0.8060 0.7146 0.8307 0.7373

0.5989 0.6495 0.3333 0.6410 0.0 0.6153

0.6770 0.7356 0.9064 0.7241 0.9901 0.7520

0.6747 0.6341 0.2601 0.6910 0.0406 0.7073

Table 7 Test results obtained for farm C. The best result is shown in bold face and the second best in italics. Model

GMS

Acc

SNR

SNoR

SR

Reservoir C 0S3 C 2S2 C 3S1 C 3S1(W ) C 4S1

0.6291 0.6583 0.3638 0.6828 0.3933 0.6409

0.6835 0.7386 0.8567 0.7277 0.8793 0.7448

0.6206 0.6379 0.1724 0.6724 0.2241 0.5517

0.6910 0.7496 0.9118 0.7339 0.9350 0.7585

0.5806 0.5967 0.3064 0.6451 0.2903 0.6290

(NR). The reservoir model of Dorado-Moreno, Cornejo-Bueno, Gutiérrez, Prieto, Salcedo-Sanz et al. (2017) achieves competitive results, as it is able to predict correctly all the three classes, obtaining non-trivial classifiers, but the GMS results are not comparable with the ones obtained by models designed for this paper. On the other hand, C 3S1(W ) is not able to achieve competitive results (GMS < 0.5) in all the wind farms, as weighting the algorithm is not enough to train the model correctly. When comparing the C 3S1 model with the other 3 alternatives, we can say that it is more robust to changes, as it is the only one able to keep the GMS value over 0.68 in all of the three wind farms. Also, observing the sensitivities of each class, all of them are higher than 0.62 (balanced results for all the classes) without focusing only on the majority class. C 2S2 results are way better than those obtained by C 3S1(W ) but good enough to be competitive, as the corresponding GMS value is not able to reach 0.6 in any of the wind farms. Checking the results in Table 3, it achieves the highest GMS which means that the model is highly overtrained. The C 4S1 model also obtains good results for the test set, but its performance drops for farm C. Finally, independent models (C 0S3) are not robust to data changes, obtaining the best results in the validation and the lower ones in generalization for all the wind farms, which means that it is prone to overfitting. 7. Conclusions This paper proposes and evaluates a multi-task deep neural network architecture for predicting Wind Power Ramps Events (WPREs) in three different classes (negative ramps, non-ramp events and positive ramps). The main idea of this work is to build a global model which considers all the information regarding wind speed and direction in all of the wind farms available and then predicts future WPREs in all of the wind farms at the same time. We also propose a modification of the Adam optimization algorithm for imbalanced data. It is a cost-based modification which dynamically modifies the cost of miss-classifying each pattern depending on the worst classified class. This allows the model obtaining good results without the necessity of applying over-sampling or under-sampling, which are hard to apply in a multi-task problem, and can also affect the temporal structure of time series. The benefit from the proposed multi-task model is that it obtains more robust results than the single-output ones, as it is able to handle more input data, making better decisions for

predicting WPREs after the training stage. Also, although it is a more complex model, there is no need to independently validate the hyper-parameters for each wind farm. It is important to note that, as shown in the results section, deeper models do not lead to better results in generalization due to over-fitting. A direct extension of this work could be based on considering a larger number of wind farms and constructing clusters of them depending on their geographical position. The proposed model could be used to perform predictions in the whole cluster. In that way, we would be able to achieve a high quality prediction of WPREs in all the Spanish geography using a small number of deep models. Acknowledgments This work has been subsidized by the projects with references TIN2017-85887-C2-1-P, TIN2017-85887-C2-2-P and TIN201790567-REDT of the Spanish Ministry of Economy and Competitiveness (MINECO) and FEDER funds. Manuel Dorado-Moreno’s research has been subsidized by the FPU Predoctoral Program (Spanish Ministry of Education and Science), grant reference FPU15/00647. The authors acknowledge NVIDIA Corporation for the grant of computational resources through the GPU Grant Program. References Argyriou, A., Evgeniou, T., & Pontil, M. (2008). Convex multi-task feature learning. Machine Learning, 73(3), 243–272. Baxter, J. (2000). A model of inductive bias learning. Journal of Artificial Intelligence Research, 12, 149–198. Bossavy, A., Girard, R., & Kariniotakis, G. (2010). Forecasting ramps of wind power production with numerical weather prediction ensembles. In Proc. European wind energy conference & exhibition (pp. 20–23). Cannon, D. J., Brayshaw, D. J., Methven, J., Coker, P. J., & Lenaghan, D. (2015). Using reanalysis data to quantify extreme wind power generation statistics: a 33 year case study in great britain. Renewable Energy, 75, 767–778. Caruana, R. (1998). Multitask learning. Autonomous Agents and Multi-Agent Systems, 21(1), 95–133. Chidean, M. I., Caamaño, A. J., Ramiro-Bargueño, J., Casanova-Mateo, C., & Salcedo-Sanz, S. (2018). Spatio-temporal analysis of wind resource in the iberian peninsula with data-coupled clustering. Renewable & Sustainable Energy Reviews, 81, 2684–2694. Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing: Deep neural networks with multitask learning. In Proceedings of the 25th international conference on machine learning (pp. 160–167). Cui, M., Feng, C., Wang, Z., & Zhang, J. (2017). Statistical representation of wind power ramps using a generalized gaussian mixture model. IEEE Transactions on Sustainable Energy, 9, 261–272. Cui, M., Ke, D., Sun, Y., Gan, D., Zhang, J., & Hodge, B. M. (2015). Wind power ramp event forecasting using a stochastic scenario generation method. IEEE Transactions on Sustainable Energy, 6, 422–433. Dee, D. P., Uppala, S. M., Simmons, A. J., Berrisford, P., & Poli, P. (2011). The erainterim reanalysis: configuration and performance of the data assimilation system. Quarterly Journal of the Royal Meteorological Society, 137, 553–597. Díaz-Vico, D., Torres-Barrán, A., Omari, A., & Dorronsoro, J. R. (2017). Deep neural networks for wind and solar energy prediction. Neural Processing Letters, 46, 829–844. Dorado-Moreno, M., Cornejo-Bueno, L., Gutiérrez, P. A., Prieto, L., HervásMartínez, C. C., & Salcedo-Sanz, S. (2017). Robust estimation of wind power ramp events with reservoir computing. Renewable Energy, 111, 428–437. Dorado-Moreno, M., Cornejo-Bueno, L., Gutiérrez, P. A., Prieto, L., Salcedo-Sanz, S., & Hervás-Martínez, C. (2017). Combining reservoir computing and oversampling for ordinal wind power ramp prediction. In Lecture Notes in Computer Science: vol. 10305, International Work-conference on Artificial Neural Networks (pp. 708–719). Dorado-Moreno, M., Durán-Rosal, A. M., Guijo-Rubio, D., Gutiérrez, P. A., Prieto, L., Salcedo-Sanz, S., & Hervás-Martínez, C. (2016). Multiclass prediction of wind power ramp events combining reservoir computing and support vector machines. In Lecture Notes in Computer Science: vol. 9868, Conference of the Spanish Association for Artificial Intelligence (pp. 300–309). Elith, J., & Leathwick, S. J. (2008). A working guide to boosted regression trees. Journal of Animal Ecology, 77(4), 802–813.

M. Dorado-Moreno, N. Navarin, P.A. Gutiérrez et al. / Neural Networks 123 (2020) 401–411 Evgeniou, T., & Pontil, M. (2004). Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 109–117). Gallego-Castillo, C., Cuerva-Tejero, A., & López-García, O. (2015). A review on the recent history of wind power ramp forecasting. Renewable and Sustainable Energy Reviews, 52, 1148–1157. Gallego-Castillo, C., Garcíia-Bustamante, E., Cuerva-Tejero, A., & Navarro, J. (2015). Identifying wind power ramp causes from multivariate datasets: a methodological proposal and its application to reanalysis data,. IET Renewable Power Generation, 9(8), 867–875. Jahn, D. E., Takle, E. S., & Gallus, W. A. (2017). Improving wind-ramp forecast in the stable boundary layer. Bounday-Layer Meteorology, 163, 423–446. Jiang, D., Zhuang, D., Huang, Y., Wang, J., & Fu, J. (2013). Evaluating the spatiotemporal variation of china’s offshore wind resources based on remotely sensed wind field data. Renewable & Sustainable Energy Reviews, 24, 142–148. Kingma, D., & Ba, J. (2014). Adam: a method for stochastic optimization. In International conference on learning representations. Liu, J., Ji, S., & Ye, J. (2009). Multi-task feature learning via efficient l2, 1-norm minimization. In Proceedings of the 25th conference on uncertainty in artificial intelligence (pp. 339–348). Liu, P., Qiu, X., & Huang, X. (2016). Recurrent neural network for text classification with multi-task learning. In Computing Research Repository. Lucheroni, C., Boland, J., & Ragno, C. (2019). Scenario generation and probabilistic forecasting analysis of spatio-temporal wind speed series with multivariate autoregressive volatility models. Applied Energy, 239, 1226–1241. Maurer, A., Pontil, M., & Romera-Paredes, B. (2016). The benefit of multitask representation learning. Journal of Machine Learning Research (JMLR), 17, 1–32. Ouyang, T., Zha, X., Qin, L., & Kusiak, A. (2016). Optimisation of time window size for wind power ramps prediction. IET Renewable Power Generation, 11, 1270–1277. Salcedo-Sanz, S., Pastor-Sánchez, A., Del Ser, J., Prieto, L., & Geem, Z. W. (2015). A coral reefs optimization algorithm with harmony search operators for accurate wind speed prediction. Renewable Energy, 75, 93–101. Santos-Alamillos, F. J., Thomaidis, N. S., Quesada-Ruiz, S., Ruiz-Arias, J. A., & Pozo-Vázquez, D. (2016). Do current wind farms in spain take maximum advantage of spatiotemporal balancing of the wind resource? Renewable Energy, 96, 574–582.

411

Scholz, Matthias, Fraunholz, Martin, & Selbig, Joachim (2008). Nonlinear principal component analysis: neural network models and applications. Lecture Notes in Computational Science and Engineering, 58, 44–67. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research (JMLR), 15, 1929–1958. Sun, M., Feng, C., & Zhang, J. (2019). Conditional aggregated probabilistic wind power forecasting based on spatio-temporal correlation. Applied Energy, 256, 113842. Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet, Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Dumitru, Vanhoucke, Vincent, & Rabinovich, Andrew (2015). Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, ISBN: 978-1-4673-6964-0. Taylor, J. W. (2017). Probabilistic forecasting of wind power ramp events using autoregressive logit models. European Journal of Operational Research, 259, 703–712. Wang, L., Kisi, O., Zounemat-Kermani, M., Ariel-Salazar, G., Zhu, Z., & Gong, W. (2017). Solar radiation prediction using different techniques: model evaluation and comparison. Renewable & Sustainable Energy Reviews, 61, 384–397. Xiong, Y., Zha, X., Qin, L., Ouyang, T., & Xia, T. (2016). Research on wind power ramp events prediction based on strongly convective weather classification. IET Renewable Power Generation, 11, 1278–1285. Xue, Y., Liao, X., Carin, L., & Krishnapuram, B. (2007). Multi-task learning for classification with dirichlet process priors. Journal of Machine Learning Research (JMLR), 8, 35–63. Zhang, T., Ghanem, B., Liu, S., & Ahuja, N. (2012). Robust visual tracking via multi-task sparse learning. IEEE conference on computer vision and pattern recognition (pp. 2042–2049). Zhang, C., & Zhang, Z. (2014). Improving multiview face detection with multitask deep convolutional neural networks. In 2014 IEEE winter conference on applications of computer vision (pp. 1036–1041). Zhou, G., Cichocki, A., Zhang, Y., & Mandic, D. P. (2016). Group component analysis for multiblock data: common and individual feature extraction. IEEE Transactions on Neural Networks and Learning Systems, 27(11), 2426–2439. http://dx.doi.org/10.1109/TNNLS.2015.2487364. Zhu, Q., Chen, J., Zhu, L., Duan, X., & Liu, Y. (2018). Wind speed prediction with spatio-temporal correlation: A deep learning approach. Energies, 11(4), 705–723.