Renewable Energy 123 (2018) 513e525
Contents lists available at ScienceDirect
Renewable Energy journal homepage: www.elsevier.com/locate/renene
A database infrastructure to implement real-time solar and wind power generation intra-hour forecasts Hugo T.C. Pedro a, Edwin Lim b, Carlos F.M. Coimbra a, * a Department of Mechanical and Aerospace Engineering, Jacobs School of Engineering, Center of Excellence in Renewable Resource Integration and Center for Energy Research, University of California San Diego, La Jolla, CA 92093, USA b Temporal Dynamics, Martinsville, NJ, 08836, USA
a r t i c l e i n f o
a b s t r a c t
Article history: Received 9 October 2017 Received in revised form 24 January 2018 Accepted 9 February 2018 Available online 15 February 2018
This paper presents a simple forecasting database infrastructure implemented using the open-source database management system MySQL. This proposal aims at advancing the myriad of solar and wind forecast models present in the literature into a production stage. The paper gives all relevant details necessary to implement a MySQL infra-structure that collects the raw data, filters unrealistic values, classifies the data, and produces forecasts automatically and without the assistance of any other computational tools. The performance of this methodology is demonstrated by creating intra-hour power output forecasts for a 1 MW photovoltaic installation in Southern California and a 10 MW wind power plant in Central California. Several machine learning forecast models are implemented (persistence, auto-regressive and nearest neighbors) and tested. Both point forecasts and prediction intervals are generated with this methodology. Quantitative and qualitative analyses of solar and wind power forecasts were performed for an extended testing period (4 years and 6 years, respectively). Results show an acceptable and robust performance for the proposed forecasts. © 2018 Elsevier Ltd. All rights reserved.
Keywords: Renewable generation forecast Real-time implementation Nearest neighbors forecast
1. Introduction Methods and algorithms for assessing renewable resources, forecasting wind and solar irradiance and renewable power generation have been the topic of many research papers in the last years. This interest is motivated by the increasing presence of renewable energy in our lives, from roof-top installations in our homes to large wind and solar farms that supply the electrical grids, and the challenges posed by the inherent variability of this energy source [1e3]. Forecasting models, both deterministic and stochastic in nature, and qualitative and quantitative performance metrics have been amply discussed in the recent literature [4e9]. Recent works in renewable generation forecasting follow one or more of the following trends: (i) exploring new exogenous predictors relevant to the forecasting; (ii) data prepossessing into predictors more amenable to the forecasting tools; (iii) creating probabilistic forecasts instead the classic point forecast; (iv) expanding the
* Corresponding author. E-mail address:
[email protected] (C.F.M. Coimbra). https://doi.org/10.1016/j.renene.2018.02.043 0960-1481/© 2018 Elsevier Ltd. All rights reserved.
forecasting to large spatial domains; (v) creating models dedicated to forecast extreme events such as power ramps; (vi) exploring new metrics to better characterize forecasting performance [9,10]. Despite these advances, there is, to the best of our knowledge, no literature that addresses, in a practical way, how the models are implemented in a real case scenario. Motivated by this fact, we present here a simple strategy to deploy solar and wind forecasting tools in real-time. The proposed workflow can be readily implemented in any computing platform capable of ruining the opensource database management system (DBMS) MySQL. As shown below, all the data storage, data validation and forecasting tools can be implemented in this simple infrastructure. The use of MySQL database in local forecast conveys a number of advantages. MySQL database is robust, fast, has an open source license, and is available on all the common platforms. As such, MySQL has gained widespread use in diverse applications and is today one of the most popular DMBS [11]. MySQL can be set up with ease. With a little learning investment, it is easy to perform basic store and retrieve functions. For intermediate and advance usage, the official documentation is comprehensive and freely available [12] (accessed June 2016). Accessing the data in MySQL is vastly
514
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525
simplified by the availability of connectors and APIs (application programming interfaces) for various third party software like C, Java, Python, Perl, etc. Being natively network oriented, MySQL adds an important dimension to the data accessibility: different users and tools can access the same data set from different computers simultaneously. The native text based client and the graphical interface (MySQL Workbench) provide convenient methods of manually querying the data during software development and troubleshooting problems in general. MySQL also offers an extensive set of functions and operators that can be explored to change the DBMS from a mere data hosting infrastructure to much more. For example, MySQL stored procedures are used in Zach et al. [13] to improve building monitoring and in Ref. [14] to enable complex queries to a bioinformatic database. The purpose of this paper is to demonstrated how one can take advantage of these functionalities to create a forecasting methodology completely implemented in MySQL. In the approach described here, all the operations are performed within MySQL as illustrated in Fig. 1. This is a departure from the more traditional approach where the database simply stores the measured data and the forecasts are produced by attendant jobs (e.g. cron jobs or Windows scheduled tasks) implemented in another programming language [15,16]. In the remainder of this paper we explain how to set up this database infrastructure for two real case scenarios. In the first case, the power output (PO) from a 1 MW solar canopy is predicted from the PO and global horizontal irradiance measurements (GHI). In the second case, we consider telemetry (PO, wind speed and direction, temperature, pressure and density) from a 10 MW wind power plant and predict wind PO. We implement simple forecast techniques to create real-time point forecast as well as prediction intervals for intra-hour horizons (15e60 min).
modifications necessary to apply this database infrastructure for the forecasting of PO for the 10 MW wind power plant. 2.1. MySQL tables All the data used in this work (sensor data, computed forecasts, etc.) are stored in MySQL tables. A table is a structure where the data is collected. Tables contain records or rows, and fields or columns. In this case most tables follow a very simple structure in which the first column contains the date and time (timestamp) that identifies uniquely the data. To improve search efficiency, tables should be indexed or keyed and the natural choice here is to use the timestamp as the index. All timestamps are in Coordinated Universal Time (UTC) to avoid any problems related to daylight saving time. Indices are kept in memory which enables the database to process queries much faster (e.g. get sensor data between a given time period). Using indices requires more memory but reduces the query times substantially. Furthermore, if the indexed quantity is unique, the index can be declared as the primary key of the table. In this case, MySQL will enforce that there are no duplicates of any indexed quantity. The tables used here can be categorized into two main types depending on how they are indexed. The first category are tables that store data indexed by the timestamp (primary key). These are the tables that contain sensor data and data directly derived from them such as normalized data or classification data, as explained below. Such tables contain one column that stores the timestamp followed by as many columns as necessary to store the indexed variables. For example, the table that stores the raw PO and GHI data can be created with the following MySQL code:
2. Methodology In this section we describe the key steps necessary to create the database tables and the procedures used to compute the forecast. As mentioned above, several data streams are available in this case: power output and environmental data. This set up corresponds to a very common scenario for medium sized solar and wind power generation, where local telemetry data is used as predictors for the forecasting models. The methodology description will focus on the first case under consideration, the forecasting of PO from a 1 MW solar canopy. At the end of the section we explain the few
To the second category belong tables used to store forecasted data. Data in these tables are associated with two timestamps: the reference time (_reftime), which records the time at which forecast is computed, and the valid time (_valtime), which is time at which the forecast applies. These tables have a compound primary key made up by combining the reference time and valid time.
Fig. 1. (a) In this approach the database is used to store measured data and the forecast. External attendant jobs are responsible for computing the forecast. (b) In the approach described in this work all computations are performed inside the MySQL database. No external processes are required. (c) Flow chart for the processes that create the forecasts.
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525
This means that there can be duplicates in valid times, as long as the duplicates have different reference times. Other columns always present in these tables include the forecasted value and the forecast error. There are several tables in this category, one for each one of the different forecast models implemented. In this work we implement three models: a persistence model (Pers), an auto-regressive model (AR) and a nearest neighbor model (kNN). The statement to create the table to store the nearest neighbor forecast is:
Note that in addition to the primary key of the combined _reftime and _valtime, three other indices are also declared to increase the data retrieval speed. The primary key here is mainly to enforce the uniqueness of the _reftime, _valtime pair, whereas most of the actual queries will be performed on the other indices. For example, the queries to update the forecast error with the latest PO measurements are done based on _valtime, as explained in the following section. If the additional indices are not declared, the performance will suffer. Another aspect from this particular table is that it also contains columns for the minimum and maximum values that define the prediction interval (PI) for PO (explained below). All the main tables used in this work can be created using variations of the two codes above. In total we use nine different tables. Four that contain sensor and sensor-derived data, four that store the forecasted values, and a temporary table used to speed up the nearest neighbors forecast computation. Table 1 lists and describes all these tables in more detail. In the following sections we show how all these tables are populated automatically once data is inserted into the dataRaw table. 2.2. Data filtering and data normalization In this scenario we assume that measurements are fed unfiltered from the telemetry hardware sensors into the database. Thus, the data must be filtered to discard missing and invalid measurements, e.g., physically impossible values such as negative irradiance and power generation measurements. In this case such measurements are tagged with the value 9999.9 in accordance to the convention used in data from the SURFRAD network [17] (accessed June 2016). Instead of relying on external tools to filter the data, MySQL supports triggers (procedures that are executed automatically in response to events in a table) that can be used to perform this task. The triggering event in this case, is the insertion of new data into the dataRaw table. The triggered procedure has access to the data that has just been inserted into the dataRaw. It operates on these data and inserts the resulting filtered data into the dataFiltered
515
table. In this case, the procedure discards flagged values and night time values (qz 85 ):
The key statement in this code is AFTER INSERT ON dataRaw, which indicates that the procedure is triggered after data is inserted into the table dataRaw (MySQL supports other trigger events such as BEFORE INSERT or AFTER DELETE). The data that was inserted in the triggering event is accessed by prepending “NEW.” to the respective column name. Also, note that in the snippets presented here, lines starting with # are comments and words prefixed by the underscore character “_” refer to columns in the different tables. In the same triggered procedure we also normalize the raw data values and insert them into the respective dataNormalized table. Normalization can be accomplished with accurate clear-sky models [18] most of which are not easily implemented in MySQL. A simpler approach is to normalize PO and GHI with the cosine of the zenith angle cosqz :
cos qz ¼ cos f cos d cos u þ sin f sin d
(1)
where f is latitude, u is the solar time and d is the declination angle. These quantities can be computed easily for any time and geographical location [19] and can be implemented as a MySQL procedure or function (see Appendix A.1). This normalization
516
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525
Table 1 Description of the MySQL tables used in this work. Name
Type
Description
dataRaw dataFiltered
Data Data
dataNormalized dataClass
Data Data
dataTarget
Data
forecastPer forecastAR forecastNN tempTableNN
Forecast Forecast Forecast Temporary
Stores the raw PO and exogenous data. The primary key is the data acquisition timestamp. Stores the filtered raw data. The filtering removes night time values (in the first test case) and unrealistic values. The primary key is the data acquisition timestamp inherited from the dataRaw table. Stores the normalized filtered data. The primary key is the data acquisition timestamp inherited from the dataRaw table. Stores several features used to classify data. These features are used in the nearest neighbor forecast. The primary key is the timestamp inherited from the dataNormalized table. Stores the target data for the different forecast horizons corresponding to the data in dataClass. The primary key is the timestamp inherited from the dataNormalized table. Stores the persistence forecast. The primary key combines the _reftime and _valtime timestamps. Stores the auto-regressive forecast. The primary key combines the _reftime and _valtime timestamps. Stores the nearest neighbors forecast. The primary key combines the _reftime and _valtime timestamps. A temporary table that stores the timestamps for the nearest neighbors. The contents of this table are deleted once the forecast is computed.
accounts for much of the daily and seasonal variation in the PO, so that the models' task is to predict the effect of the atmosphere on the target variable. A third task implemented in this procedure is the update of the forecasted data tables with the forecast error. That is accomplished with the UPDATE query in the code above. With this implementation the forecast performance is updated in real-time as new measured data becomes available. The above trigger is a very simple way of implementing data filtering, data normalization, and updating the forecast error directly into the database. It also assures the integrity of the information, that is, it guarantees that there is a one-to-one correspondence between the indexes in the tables necessary for the forecast. Finally, since this is the first procedure to be triggered, it also defines several global session variables (variables that start with @) that are shared by all subsequent procedures. This avoids having hard-coded values dispersed in the MySQL code that would difficult modifying the code for other locations, for instance. The global variables, their values and their description are listed in Table 2.
2.3. Classification features Another type of data derived directly from the measured data is stored in the tables dataClass and dataTarget. The first table contains features such as backward averages, standard deviations, variability, etc., that are useful for classifying the data for machine learning tools such as nearest neighbors and artificial neural networks [20,21]. In the cases demonstrated here PO and exogenous data are classified by computing the average and standard deviation for the normalized variables for different backward windows (30, 60 and 120 min), the lagged values from 15 to 60 min, and the Sun's zenith angle. All these values are stored in the table dataClass which is automatically populated by a procedure triggered when new rows are inserted into the dataNormalized. The second table, dataTarget, stores the normalized PO data for the different horizons k and forecasting issuing time n. The data ! in these two tables form the pair f n /ðxnþ1 ;xnþ2 ;xnþ3 ;xnþ4 Þ, where ! f n denotes the classification features at time instance n. At that time the values of xnþk are not known, thus the indexed rows in dataTarget are initialized with NULL and later updated as the measured xnþk values become available. The snippet below provides an abbreviated listing of the code that performs this task:
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525 Table 2 Global variables used in the MySQL procedures. Symbol
Value
Description
@lat_deg @lon_deg @utc2loc
32.959 117.190 8
@Zlim
85.0
@NNghb
50
@MAXPO
1050
Latitude of the target location in degrees. Longitude of the target location in degrees. Difference in hours between UTC time and local standard time. The cut off qz for the day-time night time differentiation. The number on nearest-neighbors in the kNN forecast. The maximum power output.
Other features can easily be incorporated into this triggered procedure. Although, too many classification features may lead to the deterioration of the forecast performance. In that case an optimization procedure similar to the one implemented in Ref. [20] may be implemented in order to find the set of features that minimizes the forecast error.
517
is the nearest neighbor forecast. This results from the fact that one of the main functions of MySQL is to search data within the tables based on some conditions. In this case, the condition is the similarity between the data classification features for the new data and previous classification features already present in the dataClass table. This forecast is computed in three steps: 1. identification of the list of the nearest neighbors based on the classification features; 2. retrieval of PO target data corresponding to those neighbors from dataTarget table; 3. aggregation of those data. Identifying the nearest neighbors can be done with a simple MySQL query:
2.4. Forecast models Once the dataClass and dataTarget tables are populated with new values a third procedure is triggered that automatically produces the different forecasts. In this work we consider three forecasting models: two auto-regressive models and a nearest neighbor model. 2.4.1. Auto-regressive forecast The simplest forecasts implemented here are the autoregressive models that follow the general form proposed in Ref. [22]:
a1 xn am xnmþ1 b þ/þ cos qz ðnþkÞ x nþk ¼ a0 þ cos qz ðnÞ cos qz ðnmþ1Þ
(2)
where b x nþk denotes the forecasted PO for time n þ k where k ¼ f1; 2; 3; 4g corresponds to the four forecast horizons 15, 30, 45, and 60 min given that the data is in 15 min intervals. The variable m indicates the number of lagged PO measurements used in the forecast. Two variations of Eq. (2) are implemented in this work: the persistence forecast (Pers) for which m ¼ 1, a0 ¼ 0, a1 ¼ 1, and an auto-regressive (AR) forecast with m ¼ 5 and a0 ; /; a5 that minimize the error function:
min kxnþk b x nþk k
a0 ;/;a5
2
(3)
The AR model depends on external tools to solve the minimization problem. However, we include this model in this work to demonstrate how these type of models can be implemented in MySQL once the free parameters are known, and also, as a benchmark forecast that performs better than persistence. In this case the parameters a0 ; /; a5 were determined using least squares implemented in Matlab for the four different forecast horizons using the first 12 months of data available. The value m ¼ 5 was determined using a grid search method in which m2f1; /; 10g. The MySQL procedure that computes the Pers and AR forecasts is listed in Appendix A.2. That code includes the free parameters a0 ; /; a5 determined for each one of the four forecast horizons. Any model that can be expressed in an algebraic form can be easily implemented using this methodology by modifying this procedure. 2.4.2. Nearest neighbors forecast One type of forecast model that is specially suited for MySQL
The third line in this snippet computes the Euclidean distance between the features for the new data (identified by the “NEW.” prefix) and the same features already in the dataClass table. This line can be modified to include any combination of the different classification features available. It can also be modified to use a different distance metric. The last line in the snippet indicates that all neighbors are listed in ascending order and limits the result to the first NN items. In this work, we set the number of nearest neighbors NN to 50 following a previous study in the optimization of nearest neighbors forecast models [20]. The first line in this code indicates that the retrieved timestamps that index the nearest neighbors are stored in the temporary table tempTableNN. This speeds up the process by avoiding re-querying the table since the nearest neighbors are the same for all the four forecast horizons considered. Different combinations of features could be specified for each forecast horizon, although that was not tested in this work. We also limit the search in terms of solar geometry through the condition ABS(NEW._z-_z) <¼ 10. This illustrates how one of the classification features is used as a query condition instead of being considered in the distance calculation. Finally, the condition _timestamp < reftime guarantees that the current values are excluded from the list of neighbors given that the target values are not known. Once the list of timestamps (t ¼ fmi ; i ¼ 1; 2; /; Ng) is known we select the target data according to the forecast horizon k, resulting in the vector xk ¼ fxmi þk ; i ¼ 1; 2; /; Ng. The elements of this vector are then aggregated to determine the forecast value:
b x nþk ¼ hxk icos qz ðn þ kÞ ¼
! N 1 X xmi þk cos qz ðn þ kÞ N i¼1
(4)
In this implementation it is also possible to obtain a prediction interval (PI) for the forecasts. The lower and upper bounds for the PI can be estimated as:
PInþk ¼ fminðxk Þ; maxðxk Þgcos qz ðn þ kÞ
(5)
The forecasted value and the PI bounds can be obtained with the following MySQL code (for the 15 min horizon):
518
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525
This code computes the average, minimum and maximum values for the target PO values indexed by the nearest neighbors timestamps (stored in the tempTableNN table) and assigns them into the variables nPO_nn,nPO_nn_min, and nPO_nn_max. The non-normalized predicted values are computed in the last line of the snippet, where we also limit the value to the maximum the power output. The complete procedure to compute the forecast and the PI is listed in Appendix A.3. Once these values are determined they are inserted in the table forecastNN. The full implementation in MySQL for this case is about 400 lines long with comments. Table 3 lists and describes all the triggers and stored procedures necessary for the work here described and Fig. 2 depicts the entity-relationship diagram using ”Crows foot” notation. The code snippets provided above and in the appendix give all the relevant information but the interested reader can obtain the full code in the supplementary material. 3. Forecasting validation This database infrastructure was implemented with the Mysql Community Server [23] (accessed June 2016) version 5.5 in a Ubuntu machine with 2.5 GHz Intel Xeon processors. The results were post processed with Matlab version R2017a. We applied this methodology to two cases: 1. Forecasting the power output from a 1 MW PV non-tracking solar canopy installation at the Canyon Crest Academy (CCA) in San Diego, USA; 2. Forecasting the power output from a 10 MW wind power plant in central California. For the first case, the monitoring system for the solar panels provides 15 min averaged data for PO and GHI every 15 min, and PO and GHI were collected from 2011-02-03 to 2014-12-31. In the second case, data was retrieved from the The Wind Integration National Dataset (WIND) Toolkit [24] for the site 56345 with latitude and longitude 37.75 ,-121.68 . This dataset contains modeled data rather than measured data. Comparisons between the modeled data and measured data in King et al. [25]; Draxl et al. [26] showed that the PO data models the power output of wind resources appropriately and is representative of production by modern wind power plants. In this case the data is provided in 5 min intervals from 2007 to 2012. In order to maintain data resolution equal to the one in the first case the data was averaged to 15 min bins. The MySQL implementation for the second case differs from the first case in a few aspects: The tables that store data and data-derived values are expanded to accommodate the different set of exogenous variables (wind speed and direction, temperature, etc.). Given that there is no clear-sky normalization equivalent for the case of wind power we normalize the PO based on the wind power plant nominal value of 10 MW. Wind speed and
temperature are normalized based on maximum values for 2007. Wind direction is normalized by 360 and pressure and density by the respective sea level values for a standard atmosphere. The target data from which the forecasts are created is not the normalized PO at the desired horizon but the step changes xn xnk . This was chosen to illustrate an alternative way to define the target data for the kNN forecast. There is no data filtering based on solar geometry. The persistence model in this case, is the naïve persistence b x nþk ¼ xn , no auto-regressive model is implemented, and the distance formula used to define the nearest-neighbors is changed to reflect the different set of exogenous variables. The complete MySQL code for this case can also be found in the Supplementary material. Once both MySQL databases were initialized data were inserted, row by row, into the dataRaw table, automatically triggering the different procedures that populate all other tables. The execution time increases with the size of the tables (mainly due to the selection of the nearest neighbors), however even with six years of data in the second case it takes approximately 1 s to produce the forecast for a new data entry. 3.1. PV generation forecasting As indicated above the forecat error is updated automatically as new data are inserted into the dataRaw table. We tracked the performance of the difference forecast models as function of the size of the database. For this analysis we used common quantitative metrics for the point forecast, such as the root mean square error
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u M 2 u1 X f xnþk b RMSEf ¼ t x nþk M n
(6)
where the superscript f denotes the forecast model (f 2fPers; AR; kNNg). Forecast improvement relative to the reference persistence model is also computed [39]:
skillf ¼
1
RMSEf RMSEPer
! 100
(7)
These metrics were computed daily in two ways: using all the data available up to that day, and the preceding three months of data. The former reflects the overall performance of the models, whereas the latter allows the study of seasonal variations in the models' performance. Fig. 3 plots the global (dashed lines) and seasonal (solid lines) evolution of the RMSE for the three forecast models and four forecast horizons as a function of time, which is proportional to the database size. Fig. 4 shows the forecast improvement evolution for the AR and nearest neighbors forecast with respect to the persistence model. Initially the global or cumulative values oscillate substantially due to the small data set. Once there is between one and one and a half years of data in the database the values stabilize, although small variations can be observed. The seasonal metrics show much more variability with lower errors and higher skill in summer periods, and the reverse for winter. These plots show that, with sufficient data, the nearest neighbors model outperforms the other models for all forecast horizons. In order to further analyze the performance of kNN model we also recorded, for each time instance, the average and minimum distances (normalized by the maximum power output) for the 50 neighbors. The shaded band in Fig. 4 shows the 3-month rolling average for these two values. This seasonal depiction of the nearest-
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525
519
Table 3 Triggers and stored procedures. Name
Type
Description
filterData
Trigger
classifyData
Trigger
createForecasts
Trigger
calcZenith nnForecast
Procedure Procedure
arForecast
Procedure
Triggers when new data is inserted into the dataRaw table. Filters the new data by removing physical impossible measurements, normalizes the data and updates the error columns in the forecast tables. Triggers when new data is inserted into the dataNormalized table. Computes several statistical features using the most recent normalized data and inserts those values into the dataClass table. Initializes the rows for the dataTarget table and updates previous rows with the new PO measurements. Triggers when new data is inserted into the dataClass. Computes the AR and kNN forecats based on the latest data classification values and inserts them in the respective tables. Receives a DATETIME value and computes the corresponding zenith angle, qz . Receives date and time, cos qz , the forecast horizon, and the number of neighbors to consider in the forecast. Computes the forecast and inserts into the table forecastNN. Receives date and time, cos qz , the forecast horizon, the latest lagged values for PO. Computes the AR and persistence forecasts and inserts them into the tables forecastAR and forecastPer, respectively.
neighbors distance helps to explain the large swings in the seasonal kNN forecasting skill. Fig. 4 shows that, in summer periods, the minimum and average distances are substantially lower than in winter. Lower average and minimum distances indicate that closer matches were found during the nearest-neighbors identification which in turn benefits the forecasting performance. The seasonal disparity also indicates that, most likely, the best set of features to determine the nearest neighbors differs with the season. This could be explored by subjecting the kNN model to some optimization procedure that segregates between seasons (not explored in this work). Nevertheless, despite some indications that the kNN model could be further improved, these results show that the proposed methodology can be used to successfully implement a nearest neighbor algorithm that significantly outperforms simpler autoregressive models. In terms of how this model compares against the state of the art in short-term PV generation forecast, we resort
to the recent comprehensive review by Antonanzas et al. [27]. To do so, we use the results compiled in Fig. 9 in that paper, to which we overlay the forecasting skills here reported (Fig. 5). As remarked above, this paper focus more on providing a general forecasting infrastructure than on the forecasting accuracy. Nevertheless, this comparison clearly shows that our results compare well with results from competing models. In general methods with better skill are based on more complex machine learning tools and/or include additional predictors to the model. Two of the methods that clearly show better skill than what was presented here are identified explicitly in Fig. 5. In the case of the work by Soubdhan et al. [28] the higher forecasting skill was obtained using a Kalman filtering approach with a twofold parameter tuning procedure and exogenous solar and weather data. In the case of Lorenz et al. [29] the authors used several exogenous variables (satellite data and numerical weather predictions) and support vector regression to improve the forecast.
Fig. 2. Entity-relationship diagram for the proposed database infrastructure. Relationships are indicated using ”Crows foot” notation.
520
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525
prediction interval, respectively. Often this value is compared against a nominal coverage used to create the PIs. In this case that is not done given that the PIs result directly from the nearest neighbors for which no nominal value is preset. The second metric measures the PI width relative to the maximum range observed in the forecasted variable.
PINAW ¼
Fig. 3. Variation of the RMSE for the four forecast models with time. Increasing time corresponds to increasing size of the data base as new values are added sequentially. The different colors indicate the different forecast models. The dashed lines and the solid lines correspond to cumulative and seasonal metrics, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)
N X 1 ðU Li Þ N POmax i¼1 i
(10)
where POmax is set to 1050 kW. Good PIs should have large PICP and low PINAW. If PINAW is large and approaches the maximum possible range, the PIs provide little information as it is trivial to say that a future PO value will be within its extreme values. Fig. 6 shows these two metrics for the PIs computed using Eq. (5) for the four forecast horizons. The figures show similar features to the previous ones for the cumulative error metrics with strong variations at the beginning followed by a quick convergence. The seasonal curves show larger variations with lower PINAW and
In the previous analysis we studied the point forecast accuracy of the proposed model. Another important feature of the nearest neighbor forecast implemented here is the ability to produce prediction intervals (Eq. (5)). Two quantitative performance parameters are often used to test the quality of the PIs: the PI coverage probability (PICP) [30] and the PI normalized average width (PINAW) [31]. The PICP measures the probability that an actual PO value is within the prediction interval and is defined as:
PICP ¼
N 1 X ε N i¼1 i
(8)
where N is the number of forecast instances and εi is a boolean variable given by
εi ¼
1 0
if xi 2½Li ; Ui if xi ;½Li ; Ui
(9)
where Lnþk and Unþk denote the lower and upper bounds of the
Fig. 5. Comparison against recent works in short term PV generation forecast. The figure is adapted from Fig. 9 in Antonanzas et al. [27]. The sources from where these values were obtained are listed in the legend in Fig. 9 in Antonanzas et al. [27]. The black lines and symbols correspond to the results obtained with the methodology proposed here.
Fig. 4. Variation of the improvement of the AR and kNN forecast with respect to the persistence forecast as a function of time. The dashed lines and the solid lines correspond to cumulative and seasonal metrics, respectively. The band indicates normalized minimum (band's lower boundary) and average (band's upper boundary) distance for the 50 nearest neighbors in the kNN model for each forecast instance.
Fig. 6. Performance of the nearest neighbor forecasted PIs, measured by the metrics PINAW (left) and PICP (right), as a function of time. The different colors indicate the forecast horizon. The dashed lines and the solid lines correspond to cumulative and seasonal metrics, respectively. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525
lower PICP in winter periods, and higher values in summer. The figures show that the cumulative PICP converges to z95% regardless of the forecast horizon whereas for PINAW the convergence value varies much more. For instance, the cumulative PINAW converges to z25% and z35% for the 15 and 60 min forecasts, respectively. The evolution of the global PINAW with time shows a slight but consistent decrease as more data is inserted into the table, an effect not so evident in the evolution of the point forecast metrics. This fact shows that increasing the pool of nearest neighbor candidates helps to define the PIs better (narrower PIs) without decreasing the PICP (PICP remains constant as PINAW decreases). This is a desirable characteristic of these forecasts, although, as Fig. 3 shows the forecast error does not improve in the long term.
521
In order to better observe the effect of an increasing pool of candidates available to the nearest neighbors model, we selected several days from the entire dataset for three levels of PO variability: low, medium and high. The days were selected by matching days with similar daily characteristics: RMSE of the persistence model, PO average, PO standard deviation, and PO line-length. In this analysis we use the 30 min forecast results. The same analysis for other horizons yields a similar outcome. Fig. 7 shows the measured PO for the selected days, the daily RMSE values for the persistence and kNN models, the forecasting skill, and the PI bounds as well as the daily PINAW value for the kNN model. The plots are ordered from top to bottom from low to high PO variability. The figure illustrates the reduction in the width of the PI with
Fig. 7. Daily RMSE and forecast skill for the 30 min forecast horizon for several days with similar daily PO profiles). The circles on the time line at the top of the plots indicate the selected days. The measured PO for each day is shown below the time line (solid black) together with the PI bounds calculated with the kNN model. The daily PINAW is indicated by the height of the shaded area behind the daily PO profile. The scale to the right of the profiles is used to quantify the daily PINAW. The scatter plots show the daily RMSE for the three models and the gray dashes indicate the skill for the kNN forecast. The skill is measured using the gray scale to the right of the scatter plot. The three subplots show days of different PO variability: (a) low, (b) medium, and (c) high.
522
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525
Fig. 10. Same as Fig. 6 but for the wind power plant PO forecast.
Fig. 8. Same as Fig. 3 but for the wind power plant PO forecast. No AR model was implemented in this case.
candidates). In the case of the high variability days, Fig. 7(c) shows that, although, the predictions by the kNN model are always better than persistence, the error reduction is much smaller than in the other two cases. 3.2. Wind generation forecasting
Fig. 9. Same as Fig. 4 but for the wind power plant PO forecast. No AR model was implemented in this case.
the increase of the database size. That effect is especially noticeable in the case of the low variability days. The same effect is also noticeable in the medium variability days but to a lesser extent, and not noticeable in the high variability days. For the low and medium variability days we observe that the daily RMSE for the persistence model is very stable as it does not depend on the database length. In the case of the kNN model, also as expected, the highest RMSE values and lowest skills are observed for days near the beginning of the time line (due to the small pool of
The performance analysis for the second test-case follows exactly the same procedure as in the previous case. Figs. 8 and 9 show the cumulative and seasonal RMSE and forecast skill, respectively for the wind power forecasts. The global skill values at the end of the experiments vary between z5% and 8%. These values are substantially lower than the ones obtained in the previous case. This is expected given the higher variability of wind power relatively PV power. The seasonal skills plotted in Fig. 9 show values above 15% and 20% in the summer in contrast to values below 5% in winter, indicating a strong seasonality in the kNN forecasting performance. It is well documented [32e34] that for very short forecasting horizons such as the ones target in this work the naïve persistence model is very competitive and hard to beat. Table 4 compares the results obtained with the proposed methodology against metrics provided in recent works on wind power short term forecasting. As in the case of PV generation forecast presented above, we can see that our results are comparable to the state of the art. The PI performance metrics plotted in Fig. 10 show a very similar behavior to the previous case. Again the cumulative PICP converge to 95% and the PINAW values show a consistent decrease with increasing data availability. A strong seasonality is also observed in this plot. The tightening of the PIs is illustrated in Fig. 11 for 10 days with similar wind power profiles. Is possible to observed a slight but consistence reduction in the width of the PIs as the size of the database increases. Cases that depart from this trend, such as days 7
Table 4 Comparison against recent works in short-term wind power solar forecasting with similar data resolution and forecast horizon. Some values are approximated given that they are only reported in figures. The nRMSE metric is defined as the RMSE (Eq. (6)) divided by the power plant nominal capacity and multiplied by 100. Authors/year
Resolution
Horizon
Method
nRMSE [%]
Dowell and Pinson (2016) [35] Dong et al. (2017) [36] a Cavalcante et al. (2017) [37] Yan et al. (2016) [38]
5 min 15 min 1h 1h
5 min 15e120 min 1h 1e12 h
sparse vector autoregressive models hybrid (wavelet decomposition þ linear fuzzy neural network) vector autoregression with LASSO hybrid (temporally local moving window þ Gaussian process)
3.95 2.66e14.78 z11 z7ez24
Present work
15 min
15e60 min
kNN implemented in a MySQL infrastructure
4.59e10.97
a
Results from Experiment II which are comparable to the forecast configuration presented here.
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525
523
Fig. 11. Same as Fig. 7 but for days characterized with medium variability window power. The upper and lower PI bounds correspond to the 30 min forecast. The inverted triangles over the x-axis indicate negative forecasting skill.
and 8, also show a poorer point forecast performance (negative in those cases). Nevertheless, the analysis of point forecast and PI bounds for this case shows a good overall performance relative to the persistence forecast. Although, in this case seasonal effects are much stronger that in the previous case.
Acknowledgments
4. Conclusion
Appendix A. MySQL procedures
Here we demonstrate a MySQL database infrastructure to acquire, filter, classify, and forecast solar and wind power generation data. From this work we can conclude that MySQL can be used for much more that just storing data for solar forecast models. Any forecast model that can be expressed in an algebraic form can be implemented in this manner as it was demonstrated with the autoregressive used to predict solar power generation. By adapting the data classification and forecasting procedures presented here, it is possible to deploy many of the forecast models present in the literature in real-time. This can be very relevant in the production of stand-alone forecasting units that provide local monitoring and forecasting for renewable generation. For instance, in the case of solar generation, a simple hardware configuration consisting of a portable computing unit (e.g. BeagleBone or RaspberryPI) and photodiode sensor could host the proposed methodology. The forecasts produced in this work may not be able to compete with more sophisticated approaches (sky imagers, cloud tracking, ect.) but they are robust backup approaches. The nearest neighbor algorithm is especially well suited for this database infrastructure given that, one of the key functions of MySQL, is to query data based on some condition (in this case the similarity of data classification features). They are easy to implement and perform much better than simple auto-regressive models. This model also shows a very desirable property: its forecasting performance improves (mostly the prediction intervals) as more data becomes available. This improvement is achieved automatically (no model training is necessary) by virtue of the increase in candidates from which nearest neighbors are selected and the forecast produced. The error analysis and prediction interval quality assessment revealed a strong seasonality in both test-cases (stronger in the second). The lower performance in some seasons may be addressed, perhaps, by using different classification features when querying for the nearest-neighbors in different seasons. This observation, and the fact that no classification feature selection, no optimization of the number of neighbors, no optimization in the aggregation of the neighbors when computing the forecast was implemented suggests that better forecasting performance is possible with this database infrastructure.
In this appendix we include some MySQL code snippets that illustrate some of the operations available for the purpose of this work. As mentioned the complete code is available in the Supplementary material.
The authors gratefully acknowledge funding from the California California Energy Commission PIER PON-13-303 program, which is managed by Dr. Silvia Palma-Rojas.
Appendix A.1. Procedure to compute solar zenith angle The zenith angle qz for a given timestamp (in UTC) and geographical location are easily computed using the following MySQL procedure (global variables start with @ and are listed in Table 2):
524
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525
Appendix A.2. Procedure to compute the AR forecast The persistence and AR models are implemented using a simple MySQL procedure that determines the predicted values and inserts them into the respective forecastPer and forecastAR tables:
Appendix B. Supplementary data Supplementary data related to this article can be found at https://doi.org/10.1016/j.renene.2018.02.043. Appendix A.3. Procedure to compute the kNN forecast The key procedure in this work computes the kNN forecast from the list of nearest neighbors. This procedure retrieves the target data depending on the forecasting horizon and aggregates those values using the arithmetic mean operator AVG(). The list of nearest neighbors is retrieved from the table tempTableNN, which is populated according the code snippet provided in section 2.4.2.
References ~o, W. Kempton, W.B. Powell, M.J. Dvorak, The challenge [1] C.L. Archer, H.P. Sima of integrating offshore wind power in the U.S. electric grid. Part I: wind forecast error, Renew. Energy 103 (2017) 346e360. [2] D. Halamay, T. Brekken, A. Simmons, S. McArthur, Reserve requirement impacts of large-scale integration of wind, solar, and ocean wave power generation, IEEE Trans. Sustain. Energy 2 (2011) 321e328. [3] D. Neves, M.C. Brito, C.A. Silva, Impact of solar and wind forecast uncertainties on demand response of isolated microgrids, Renew. Energy 87 (2016) 1003e1015.
H.T.C. Pedro et al. / Renewable Energy 123 (2018) 513e525 [4] A.M. Foley, P.G. Leahy, A. Marvuglia, E.J. McKeogh, Current methods and advances in forecasting of wind power generation, Renew. Energy 37 (2012) 1e8. [5] R.H. Inman, H.T.C. Pedro, C.F.M. Coimbra, Solar forecasting methods for renewable energy integration, Prog. Energy Combust. Sci. 39 (2013) 535e576. [6] J. Kleissl, Solar Energy Forecasting and Resource Assessment, Academic Press, 2013. [7] G. Reikard, S.E. Haupt, T. Jensen, Forecasting ground-level irradiance over short horizons: time series, meteorological, and time-varying parameter models, Renew. Energy 112 (2017) 474e485. [8] C. Voyant, G. Notton, S. Kalogirou, M.L. Nivet, C. Paoli, F. Motte, A. Fouilloy, Machine learning methods for solar radiation forecasting: a review, Renew. Energy 105 (2017) 569e582. [9] H.T.C. Pedro, R.H. Inman, C.F.M. Coimbra, Mathematical methods for optimized solar forecasting, in: G. Kariniotakis (Ed.), Renewable Energy Forecasting, Woodhead Publishing, 2017, pp. 111e152. Woodhead Publishing Series in Energy. [10] M.Q. Raza, M. Nadarajah, C. Ekanayake, On recent advances in PV output power forecast, Sol. Energy 136 (2016) 125e144. [11] A. Barua, S.W. Thomas, A.E. Hassan, What are developers talking about? An analysis of topics and trends in Stack Overflow, Empir. Software Eng. 19 (2012) 619e654. [12] MySQL 5.5, MySQL 5.5 Reference Manual, 2016. http://dev.mysql.com/doc/ refman/5.5/en/. (Accessed June 2016). [13] R. Zach, M. Schuss, R. Bruer, A. Mahdavi, Improving building monitoring using a data preprocessing storage engine based on MySQL, in: eWork and EBusiness in Architecture, Engineering and Construction, CRC Press, 2012, pp. 151e157. [14] G. Anders, S.D. Mackowiak, M. Jens, J. Maaskola, A. Kuntzagk, N. Rajewsky, M. Landthaler, C. Dieterich, doRiNA: a database of RNA interactions in posttranscriptional regulation, Nucleic Acids Res. 40 (2012) 180e186. [15] Y. Chu, M. Li, H.T.C. Pedro, C.F.M. Coimbra, Real-time prediction intervals for intra-hour DNI forecasts, Renew. Energy 83 (2015a) 234e244. [16] Y. Chu, H.T.C. Pedro, M. Li, C.F.M. Coimbra, Real-time forecasting of solar irradiance ramps with smart image processing, Sol. Energy 114 (2015b) 91e104. [17] SURFRAD, SURFRAD Data README, 2016. ftp://aftp.cmdl.noaa.gov/data/ radiation/surfrad/Boulder_CO/README. (Accessed June 2016). [18] V. Badescu, C.A. Gueymard, S. Cheval, C. Oprea, M. Baciu, A. Dumitrescu, F. Iacobescu, I. Milos, C. Rada, Accuracy analysis for fifty-four clear-sky solar radiation models using routine hourly global irradiance measurements in Romania, Renew. Energy 55 (2013) 85e103. [19] V. Badescu, Modeling Solar Radiation at the Earth's Surface. Springer, 2014. [20] H.T.C. Pedro, C.F.M. Coimbra, Nearest-neighbor methodology for prediction of intra-hour global horizontal and direct normal irradiances, Renew. Energy 80 (2015a) 770e782. [21] H.T.C. Pedro, C.F.M. Coimbra, Short-term irradiance forecastability for various solar micro-climates, Sol. Energy 122 (2015b) 587e602. [22] S.A. Fatemi, A. Kuh, Solar radiation forecasting using zenith angle, in: 2013 IEEE Global Conference on Signal and Information Processing, 2013,
525
pp. 523e526. [23] MySQL, MySQL Community Server, 2016. https://dev.mysql.com/downloads/ mysql/. (Accessed June 2016). [24] NREL, Wind Integration National Dataset Toolkit, 2017. https://www.nrel.gov/ grid/wind-toolkit.html. [25] J. King, A. Clifton, B.M. Hodge, Validation of Power Output for the WIND Toolkit, Technical Report, National Renewable Energy Laboratory, 2014. [26] C. Draxl, A. Clifton, B.M. Hodge, J. McCaa, The wind integration national dataset (WIND) Toolkit, Appl. Energy 151 (2015) 355e366. [27] J. Antonanzas, N. Osorio, R. Escobar, R. Urraca, F.J. Martinez-de Pison, F. Antonanzas-Torres, Review of photovoltaic power forecasting, Sol. Energy 136 (2016) 78e111. [28] T. Soubdhan, J. Ndong, H. Ould-Baba, M.T. Do, A robust forecasting framework based on the Kalman filtering approach with a twofold parameter tuning procedure: application to solar and photovoltaic prediction, Sol. Energy 131 (2016) 246e259. [29] E. Lorenz, J. Kühnert, B. Wolff, A. Hammer, O. Kramer, D. Heinemann, PV power predictions on different spatial and temporal scales integrating pv measurements, satellite data and numerical weather predictions, in: Proceedings of the 29-th European Photovoltaic Solar Energy Conference and Exhibition (EU PVSEC), 2014, pp. 22e26. [30] A. Khosravi, S. Nahavandi, D. Creighton, Construction of optimal prediction intervals for load forecasting problems, IEEE Trans. Power Syst. 25 (2010) 1496e1503. [31] A. Khosravi, S. Nahavandi, D. Creighton, Prediction interval construction and optimization for adaptive neurofuzzy inference systems, IEEE Trans. Fuzzy Syst. 19 (2011) 983e988. [32] R. Blonbou, Very short-term wind power forecasting with neural networks and adaptive Bayesian learning, Renew. Energy 36 (2011) 1118e1124. [33] H. Madsen, P. Pinson, G. Kariniotakis, H.A. Nielsen, T.S. Nielsen, Standardizing the performance evaluation of short-term wind power prediction models, Wind Eng. 29 (2005) 475e489. [34] M. Milligan, M. Schwartz, Y.h. Wan, Statistical Wind Power Forecasting Models: Results for US Wind Farms, Technical Report, National Renewable Energy Laboratory (NREL), Golden, CO, 2003. [35] J. Dowell, P. Pinson, Very-short-term probabilistic wind power forecasts by sparse vector autoregression, IEEE Trans. Smart Grid 7 (2016) 763e770. [36] Q. Dong, Y. Sun, P. Li, A novel forecasting model based on a hybrid processing strategy and an optimized local linear fuzzy neural network to make wind power forecasting: a case study of wind farms in China, Renew. Energy 102 (2017) 241e257. [37] L. Cavalcante, R.J. Bessa, M. Reis, J. Browell, LASSO vector autoregression structures for very short-term wind power forecasting, Wind Energy 20 (2017) 657e675. [38] J. Yan, K. Li, E.W. Bai, J. Deng, A.M. Foley, Hybrid probabilistic wind power forecasting using temporally local Gaussian process, IEEE Trans. Sustain. Energy 7 (2016) 87e95. [39] R. Marquez, C.F.M. Coimbra, Proposed Metric for Evaluating Solar Forecasting Models, ASME J. Solar Energy Eng. 135 (2013) 0110161e0110169.