On the determination of coherent solar microclimates for utility planning and operations

On the determination of coherent solar microclimates for utility planning and operations

Available online at www.sciencedirect.com ScienceDirect Solar Energy 102 (2014) 173–188 www.elsevier.com/locate/solener On the determination of cohe...

3MB Sizes 0 Downloads 43 Views

Available online at www.sciencedirect.com

ScienceDirect Solar Energy 102 (2014) 173–188 www.elsevier.com/locate/solener

On the determination of coherent solar microclimates for utility planning and operations Athanassios Zagouras, Rich H. Inman, Carlos F.M. Coimbra ⇑ Department of Mechanical and Aerospace Engineering, Jacobs School of Engineering, Center for Energy Research and Center for Renewable Resource Integration, University of California San Diego, La Jolla, CA 93093, USA Received 2 July 2013; received in revised form 10 January 2014; accepted 18 January 2014 Available online 11 February 2014 Communicated by: Associate Editor Christian A. Gueymard

Abstract This work presents a cluster analysis for the determination of coherent zones of Global Horizontal Irradiance (GHI) for a utility scale territory in California, which is serviced by San Diego Gas & Electric. Knowledge of these coherent zones, or clusters, would allow utilities and power plants to realize cost savings through regional planning and operation activities such as the mitigation of solar power variability through the intelligent placement of solar farms and the optimal placement of radiometric stations. In order to determine such clusters, two years of gridded satellite data were used to describe the evolution of GHI over a portion of Southern California. Step changes of the average daily clear-sky index at each location are used to characterize the fluctuation of GHI. The k-means clustering algorithm is applied in conjunction with a stable initialization method to diminish its dependency to random initial conditions. Two validity indices are then used to define the quality of the cluster partitions as well as the appropriate number of clusters. The clustering algorithm determined an optimal number of 14 coherent spatial clusters of similar GHI variability as the most appropriate segmentation of the service territory map. In addition, 14 cluster centers are selected whose radiometric observations may serve as a proxy for the rest of the cluster. A correlation analysis, within and between the proposed clusters, based both on single-point ground-based and satellitederived measurements evaluates positively the coherence of the conducted clustering. This method could easily be applied to any other utility scale region and is not dependent on GHI data which shows promise for the application of such clustering methods to load data and/or other renewable resources such as wind. Ó 2014 Elsevier Ltd. All rights reserved.

Keywords: Clustering validation; Data clustering; Utility planning and operations; Coherent solar microclimates

1. Introduction The heightened awareness surrounding climate change on a global scale underlines the significance of the technological and economic issues associated with increased levels of renewable penetration into the power grid. In particular, solar power generation has recently seen strong increases in market share and corresponding growth in grid penetration ⇑ Corresponding author. Tel.: +1 858 534 4285; fax: +1 858 534 7599.

E-mail address: [email protected] (C.F.M. Coimbra). http://dx.doi.org/10.1016/j.solener.2014.01.021 0038-092X/Ó 2014 Elsevier Ltd. All rights reserved.

rates which has lead to issues concerning the variability of the solar resource at ground level and associated ramprates in solar power production. These issues arise from the coupling of cloud dynamics and resource availability which is radically unlike traditional power generation technologies (such as fossil and nuclear power) which were designed to run in stable output modes and have resulted in the majority of power grid variability originating from demand fluctuations (Energy, 2010). Consequently, utilities must develop new mitigating measures to offset weatherdependent solar power fluctuations in order to effectively

174

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

integrate increased levels of solar penetration into the grid while maintaining grid reliability (Inman et al., 2013; Energy, 2010; Venkataraman et al., 2010; Rodriguez, 2010; Boyle, 2008). Recently, a number of methods have been proposed for mitigating such variability effects through the improvement of irradiance forecasting techniques (see e.g., Marquez and Coimbra, 2011; Marquez et al., 2013; Chow et al., 2011. Forecasting allows for the development of tools aimed at lowering operational costs through improved grid regulation, load-following production, power scheduling, unit commitment, automatic generation control, and a diminished need to dispatch ancillary fossil fuel generators (Inman et al., 2013). That being said, ground-based irradiance sensing networks are typically required in order to perform validation and verification activities on the forecasts themselves. Dissatisfyingly, accurate ground-based pyranometer observations have traditionally been rangebound to scattered weather monitoring stations and research institutions due to the capital required to procure, maintain, and quality check instruments and data. Therefore knowledge concerning the optimal number and location of groundbased monitoring stations required to effectively monitor a service area is of great value to system operators and utilities as these are typically not known a priori. Several studies have attempted to optimize the spatial distribution of weather monitoring stations. Amorim et al. (2012) proposed a monitoring network to observe the average air temperature based on the optimization of the geostatistical uncertainty of estimation. Vose and Menne (2004) described an iterative process for reducing the density of an existing network of stations and for estimating the appropriate number of stations required for a U.S. climate observation system. Schmalwieser and Schauberger (2001) proposed an objective method for selecting sites for the Austrian UV monitoring network based on de-correlation distances. Kabalci (2011) developed an intelligent analysis tool based on an agglomerative hierarchical clustering approach to provide a mapping of solar radiation parameters of the Central Anatolian Region of Turkey. In addition, methods for producing physical mappings of regions that exhibit similar climatological attributes have also been proposed in the literature. Diabate et al. (2004) performed a clustering analysis of the monthly means of the daily clearness indexes of 62 stations and defined 20 coherent solar radiation regions in Africa. Similarly, Zagouras et al. (2013) proposed a novel method for the optimal placement of radiometric stations over the country of Greece, where a cluster analysis of data representing the biannual fluctuation of cloud modification factors was performed to conclude that 22 ground-based instruments could efficiently track the surface solar irradiance over the observed area. In this work we attempt to provide a detailed understanding of solar power variability over a utility-scale region in order to aid in the planning and deployment of

the most cost effective radiometric network to facilitate solar forecasting activities. For a given service territory, defining the number and location of areas with similar/dissimilar variability would allow utilities to strategically site solar power farms as well as subsidize areas of desired private solar production growth. In this way, a cluster analysis would aid planning and operations for future solar growth through the spatial averaging of production fluctuations. Additionally, understanding the variability associated with different regions would enable system operators to improve decisions on unit dispatch by increasing the confidence of unit commitment operations, predicting intra-hour dispatch and reducing automatic generation control (AGC) errors. To this end, we aim to determine the optimal number and spatial distribution of regions of coherent global irradiance based on a cluster analysis. To achieve this, a clustering scheme that identifies regions with coherent structures of solar irradiance is proposed and applied to a utility-scale territory in California, which is serviced by San Diego Gas & Electric, Fig. 1. An efficient feature extraction technique is applied to satellitederived solar data yielding a set of vectors representing the average daily clear-sky index over the gridded domain of interest. In addition, data available from an existing radiometric network operated by the California Irrigation Management Information System (CIMIS) are used to validate the clustering results obtained from the satellite based irradiance data. 2. Approach In order to obtain different partitions for a varying number of clusters we employ the k-means clustering algorithm (MacQueen, 1967) in this study. A deterministic initialization method of seed centers is utilized to address the fundamental drawback of k-means, which is its solution instability. Additionally, the determination of the number of clusters represents a critical step in this process since no known partition of the data is available initially. Clustering validation is performed by the computation of two internal validity indices in order to investigate the number of clusters that best captures the cohesion and separation of the clustering partition with respect to the parameterization of the variability distribution problem. The appropriate number of clusters is estimated by a simple but efficient graphical method, known as the L-method (Salvador and Chan, 2005). Results show convergence to a narrow range of clusters for the two validity indices. We then produce a representative mapping of coherent irradiance clusters for the region of interest, and determine the optimal range of numbers and respective locations for placing solar monitoring installations. This work is organized as follows: Section 3 includes a description of the dataset, preprocessing techniques, and the proposed clustering feature is presented; Section 4 details the k-means clustering algorithm and initialization scheme, the two clustering validity indices, and the

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

175 900

Long Beach

Palm Desert

800 700 600

Oceanside Escondido La Jolla 20 mi 50 km

San Diego

El Cajon

400

[Wm-2]

500

Poway

Chula Vista

300 200 100 0

Fig. 1. Map showing the state of California, the area of interest employed in this study, and the San Diego Gas & Electric service area. The inset shows an example of a single frame of irradiance data as obtained from the Solar Anywhere service.

L-method for the determination of the final number of clusters; numerical experiments are described in Section 5; results are presented in Sections 6 and Section 7 highlights the conclusions of this study. 3. Data and preprocessing 3.1. Satellite data In order to investigate coherent clusters of similar broadband Global Horizontal Irradiance (GHI) while maintaining a uniform spatial discretization over a utility-scale area of interest, irradiance data derived from satellite images is employed for the cluster analysis in this work. In particular, we use GHI data from the Solar Anywhere (2012) Enhanced Resolution dataset for 2009 and 2010. This dataset consists of GHI derived from the semi-empirical SUNY model which extracts global and direct irradiances from the visible channel of geostationary weather satellites (Perez et al., 2002). The spatial and temporal resolution of the dataset are 0:01  0:01 (1 km  1 km) and 30 min, respectively. The spatial domain of interest covers the landmass of Southern California between 32°– 34°N and 116°–119°W. This utility-scale domain includes the San Diego Gas & Electric service area which supplies power to a population of 1.4 million business and residential accounts in a 4100 square-mile service area spanning 2 counties and 25 communities, shown in Fig. 1 for reference. 3.2. Clear Sky Model Clear Sky Models (CSMs), which estimate ground level irradiance in the absence of clouds, are employed in several areas of solar engineering including persistence forecasts, establishing forecasting skill metrics, normalizing satellite data, and defining clear sky indexes (Inman et al., 2013). While a number of broadband GHI CSMs exist in the literature, it has been shown that it is not the model itself, but

rather the quality of the input parameters (most notably turbidity) which have the highest influence on GHI CSM accuracy (Ineichen, 2006; Reno et al., 2012). Rather than employ a multi-parameter CSM, which can require up to eight atmospheric inputs, in this study we employ the bulk-parameter CSM developed by Ineichen and Perez (2002) which requires only the Linke turbidity coefficient as an input. This CSM is based on an air mass independent formulation of the Linke turbidity which has the advantages of being independent of zenith angle and matches the original Linke turbidity factor at an airmass of 2. In addition, it is well known that the local Linke turbidity coefficient varies in space and time with fluctuating concentrations of aerosol particles suspended in the atmosphere (Eltbaakh et al., 2012). To address this, the CSM references Linke turbidity maps of the world for each month which were developed by Remund et al. (2003) using a combination of ground measurements and satellite data. While it is true that the maps of Linke turbidity employed in this study possess relatively low spatial resolution and exhibit relatively large uncertainties (especially over complex terrain) which lead to unavoidable errors, it is important to note that our study investigates the step changes of the average daily clear-sky index at each pixel in order to create coherent clusters. In other words, the vectors which are being compared are temporal vectors at a fixed location, see Section 3.4. To this end, the uncertainties in Linke turbidity at neighboring locations do not influence the nature of the vector elements at a given location. In addition, any bias introduced through uncertainties in the Linke turbidity at a specific location is removed through the examination of step changes of the variability index during the construction of the temporal vectors themselves. 3.3. Clear sky index In order to remove variability associated with deterministic diurnal/seasonal solar cycles, in this work we normal-

176

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

ize the GHI time series with respect to the CSM described in the previous section. The normalized GHI, or clear-sky index K c as it is more commonly known, is defined as: K c ðtÞ ¼

Gh ðtÞ Ghc ðtÞ

ð1Þ

where Gh ðtÞ is the modeled GHI at time t, Ghc ðtÞ is the modeled clear-sky GHI at time t, and the dimensionless quantity K c tends to vary between 0 and 1. It is worth emphasizing, however, that in practice values of K c may exceed unity during cloud enhancement events and/or when the modeled clear-sky GHI is negatively biased (Tapakis and Charalambides, 2014). 3.4. Variability of the clearness As a first step towards coherent GHI clustering, daily variability is considered in this work. To this end, it is desirable to compile daily (rather than intra-hourly) parameters at each pixel to be employed in the cluster analysis. In order to accomplish this, K c frames corresponding to the same day are averaged at each location to yield vectors of average daily clearness. Note that prior to the normalization, integration of daily GHI values over a day is equivalent to the total daily energy per pixel area (J m2) and speaks to the pixel’s relative potential for energy production. However, after normalization the integral represents the Daily Average Clear-Sky Index q for a given pixel, which is a dimensionless parameter that tends to vary between 0 and 1. The extracted feature q for a daily time course of K c defined on a closed interval [0; T c ] is approximated by the trapezoidal rule: " # Z Tc c1 X 1 1 q¼ K c ðtÞdt ¼ K c ð0Þ þ 2 K c ðiÞ þ K c ðcÞ ð2Þ T c t¼0 c i¼1 where T c is the total daylight time (s) of c frames for each day, which varies with the day/season. This definition of q possesses several benefits including, being independent of units, eliminating inconsistencies in the length of days,

and is void of deterministic fluctuations resulting in a time series which is dimensionless, bounded, stationary, and completely stochastic. Rather than use the temporal vectors of q in the cluster analysis, a final post processing is applied which aims to investigate the variability of clearness at each pixel. To this end, we define a new variability measure of the Daily Average Clear-Sky Index as: Variability ðiÞ ¼ Dqi ji¼1;...;N ¼ j½qi ð2Þ  qi ð1Þ;. ..;qi ðMÞ  qi ðM  1Þji¼1;...;N ð3Þ where M is the number of days in the dataset for the ith pixel. We expect to have low values of Dq for consecutive similar cloud-conditions and high values for a sequence of fluctuating clear and cloudy days. With this preprocessing technique the original dataset is transformed in a way such that each element of a vector represents the consecutive daily absolute step changes of q for each location in the area of interest. Fig. 2 depicts the described feature extraction stage, where the N land-cover pixels are used to construct N temporal vectors that represent the course of Dq at each pixel’s location. At this point the dimensionality of the dataset should be reduced before the cluster analysis can begin which is described in the following subsection. 3.5. Dimensionality reduction The high dimensionality of the temporal vectors requires a dimensionality reduction in order to lower the computational complexity and rescript the conventional noise effect of the data. For that purpose we employ the most widely used linear dimensionality reduction method, the Principal Component Analysis (PCA) (Jolliffe, 1986). Ding and He (2004) indicated that the dimensional reduction via PCA is directly related to unsupervised clustering as a result of the continuous solutions of the discrete cluster membership indicators in k-means clustering actually being the principal eigenvectors given by the

Fig. 2. A sequence of consecutive K c frames over Southern California. The lower blue segment is not a territory of the United States and is not included in the experiments. An example of how the concatenated pixels over a specific location are used to calculate the daily q value is illustrated (left). All the temporal vectors (blue lines) that represent the M  1 daily differences of q (red dots) for each of the N pixel locations are then stacked to create the Dq feature’s dataset (middle). A dimensionality reduction method (un-centered PCA/SVD) precedes to provide the clustering input dataset (right). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

PCA. Although the centering of the data matrix to the origin (i.e. columned mean subtracting) is essential before applying the classical PCA, many researchers debate this demand and investigate the differences between the centered and uncentered PCA (Cadima and Jolliffe, 2009), especially in studies regarding neurobiological data (Spetsieris and Eidelberg, 2011; Gwadry et al., 2011) and physical signals (Noy-Meir, 1973; Miranda et al., 2008). Subtraction of the mean from each column is equivalent to translation of the original high-dimensional axes to the mean of the coordinates. Thus, the PCA relates the variability of the differences from means about the incipient origin. Additionally, the centering deprives the physical interpretation of the original data by representing unsigned physical data with negative values. On the contrary, the un-centered PCA provides a low-dimensional coordinate system by the Principal Components (PCs), which is explained by the variance about the center of gravity of the dataset (Cadima and Jolliffe, 2009). Considering the advantages resulting from the un-centered scheme such as (i) the untransformed representation of the data structure in the feature space in terms of discontinuous data (Noy-Meir, 1973) and (ii) the purposeful understanding of the time dependencies, we choose not to apply a column-based mean subtracting prior to the PCA. In this study, the PCA via Singular Value Decomposition (SVD) projects the initial high dimensional un-centered data into the best low-dimensional linear approximation in such a manner that 99% of the initial variance of the data is preserved.

177

that are located within the examined region of interest in this analysis (see Table 3). Each weather station measures GHI using a LI200S Li-Cor photovoltaic pyranometer and its error accuracy is within ±5% under natural sunlight conditions. The sensors record irradiance every one minute and hourly/daily averages are calculated. A preprocessing stage of the ground time series involves the extraction of only the time stamps 3 h around the solar noon for each day, both applied to the ground and satellite-derived GHI time series, in order to exclude gratuitous fluctuations. According to the time resolution of the Solar Anywhere dataset, we downscale the ground measurements sampling rate from 1 h to 30 min by duplicating each hourly value. 4. Methods 4.1. Clustering algorithm The k-means clustering algorithm is one of the most widely used and simple methods of unsupervised learning

Table 2 Appropriate number of clusters according to each validity index. The upper part exhibits the results for a fixed threshold level among a range of k number of clusters. The final number of clusters is shown at the bottom, among partitions that correspond to the minimum and the median SSE for each k regardless of the threshold level. Threshold distance (% of the maximum distance)

Optimal number of k clusters CH

SIL

3.33 5.33 7.33 9.33 11.33 13.33 15.33 17.33 19.33 21.33 23.33 25.33 27.33 29.33 31.33

16 16 16 16 16 16 18 20 14 16 16 16 14 14 18

20 20 20 18 14 14 14 14 12 12 14 12 10 8 8

Regardless of threshold distance Minimum SSE Median SSE

16 14

10 14

3.6. CIMIS dataset Validation of the proposed clustering via Solar Anywhere is performed through a comparison to available groundbased radiometric stations in the region of interest. The purpose of this analysis is to demonstrate that data collected by instruments which are located at ground level within each cluster’s area are consistent with the clustering of the satellite-derived time series. Ground data from locations that belong to remote clusters are expected to perform in a variant manner. Consequently, the consistent topology of the proposed clustering is substanciated. For this study, hourly ground measurements from meteorological stations in the California Irrigation Management and Information System (CIMIS) were used. This independent dataset has been used in other comparisons (Nottrott and Kleissl, 2010). We include only the 13 CIMIS stations

Table 1 The main statistics of the SSE. Each column shows the minimum (min), the mean and the standard deviation (std) of SSE for k number of clusters as obtained from all the tested threshold levels of the initialization method. Number of clusters k Min Mean Std

4 7.63 7.73 0.14

6 6.28 6.36 0.12

8 5.72 5.77 0.06

10 5.32 5.34 0.02

12

14

4.99 5.05 0.03

SSE (104) 4.64 4.38 4.15 4.74 4.46 4.25 0.05 0.06 0.09

16

18

20

30

40

50

60

70

80

3.96 4.06 0.09

3.37 3.40 0.02

2.98 3.02 0.04

2.72 2.75 0.03

2.49 2.55 0.04

2.34 2.40 0.05

2.20 2.25 0.03

178

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

Table 3 Locations of the CIMIS measurement sites within the domain of interest considered in this study for the associated time period 2009 to 2010. Station ID number

Latitude (°N)

Longitude (°W)

Station name Country

62 75 118

33.49 33.69 33.84

117.22 117.72 116.48

136 147 150 153

33.52 32.63 32.89 33.08

116.15 116.94 117.14 116.98

173 174

32.90 33.80

117.25 118.09

Temecula Irvine Cathedral City Oasis Otay Lake Miramar Escondido SPV Torrey Pines Long Beach

179 184 200 208

33.66 32.73 33.75 33.68

117.09 117.14 116.26 116.27

Riverside Orange Riverside Riverside San Diego San Diego San Diego

San Diego Los Angeles Winchester Riverside San Diego II San Diego Indio 2 Riverside La Quinta II Riverside

(Jain, 2010; Wu et al., 2008). The reasons for its popularity rely primarily on its scalability and simplicity. On the other hand, the algorithm also suffers from a number of limitations. Primarily, it considers the underlying structure of the data as hyper-spherical, owing to the typical selection of the Euclidean distance as the primary clustering criteria. For this reason, k-means partitions may be fallacious for dataset structures composed by nonhyper-spherical shapes. In addition, the requirement to define a priori the number of k clusters can also be considered as a primal handicap. In order to address this issue, the default iterative refinement algorithm (Lloyd, 1982) of k-means uniformly chooses a random number of k points as the initial centers of the desired k clusters, where each point of the dataset is assigned to its closest center. Subsequently, the position of the k centers is iteratively optimized by the minimization of the distance criteria between the points of a cluster to its center. The algorithm stops either after a predefined number of iterations or when a convergence threshold value of a criterion function is reached. Hence, it is obvious that kmean’s effectiveness depends on how close the initial centers are to the final partition. The initialization of different seed centers generates divergent final clustering solutions. In addition, the risk of convergence to local minima of the criterion distance is high. 4.2. Initialization method In order to achieve a stable solution, several heuristic algorithms have been proposed. Celebi et al. (2013) presented a comprehensive survey along with a comparative study of k-means initialization methods. These methods are mainly distributed by their time complexity and their deterministic or non-deterministic heuristic approach to select the initial centers. In this work, we present a deterministic initialization scheme that provides stable seed centers with respect to a structural parameter, as described

below. A pseudocode for the presented initialization scheme is shown in Appendix A. The first step is to initialize a set of candidate initial centers from the entire dataset based on the reverse nearest neighbor (RNN) search as proposed by Xu et al. (2009). According to this scheme, the nearest neighbor (NN) of each point is calculated using the Euclidean distance and then the points are sorted in descending order according to how many times each one was selected as another point’s NN, i.e. according to the number of RNNs. This concept is based on the assumption that a point assigned as a NN multiple times will have a dense distribution of points around it and thus it is reasonable to consider it as a cluster center. Based on this method it is likely that close neighbor centers occur, especially in regions with high density. To avoid extremely close candidate centers a threshold distance is applied upon which the m first points are selected as the set of candidate centers S. An empirical rule to define the threshold value based on a “reasonable portion” of the maximum distance that occurs in the dataset is employed (i.e. not finer than the spatial resolution, and a not greater than the order of magnitude of the distance between the farthest points). After the determination of the initial set of potential centroids S ðmÞ , the algorithm extracts the final cluster centers. If we let the number of candidate centroids be m, the remaining N  m points are assigned to them according to the minimum Euclidean distance, thus creating clusters of data. We select the first n clusters with the maximum number of members as initial centroids. This approach is based on the maxmin algorithm (Theodoridis and Koutroumbas, 2009). It is expected that this approach provides more reasonable clusters than the first rough selection thus yielding more robust and dense clusters. In this study the number of the first m centers is empirically chosen as the threefold of the desired number of n initial cluster centers. An example of how the initial cluster centers are posed on a 3D projection of the dataset is illustrated in Fig. 3. 4.3. Internal clustering validity indices The criteria used to estimate how well a proposed clustering fits the structure underlying the partitioned data are called cluster validity indices. In the case that no correct or known partition is available, the clustering validation is achieved by estimating internal measures of the data such as the compactness and the inter separation of the clusters. These types of criteria are known as internal cluster validity indices. Milligan and Cooper (1985) compared 30 validity indices that existed by that time and this work constitutes an important reference in cluster analysis. Recently, Arbelaitz et al. (2013) published an extensive comparative study of popular and self-subsistent cluster validity indices in different experimental configurations and suggested guidelines to select the most suitable for any particular environment. Typically these indices are based on the

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

179

Fig. 3. Three-dimensional plots of the dataset (black dots) used in the experiments in the Principal Component (PC) axis. Each of the black dots represents the 3-component feature vector at every spatial location (pixel) of the service area as lying in the direction/subspace that corresponds to the three first PCs. On the left, the first m ¼ 3k selected centers at the first step of the initialization method are shown as green square symbols. On the right, a selection of k centers that are spaced at least by a predefined distance which are finally used as the cluster centers. Different initial seeds are acquired by applying different threshold distances and k number of centers. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

computation of the intra-cluster cohesion and the inter-cluster separation. The indices can then estimate the partitions quality in terms of different variations of ratio-type and summation-type factors. These factors are commonly related to the geometrical or statistical properties of the clusters, the similarity or dissimilarity between the data, the number of partitioned data, and/or the number of clusters. Indeed, a good clustering is equivalent to close quantifiable distances among the member points of a cluster and at the same time, high distinction from the points of other clusters. Most of these methods tend to consider the clusters as hyper-spherical shapes providing the clusters centers as a benchmark for the measurement of compactness and separation. The performance of the internal validity indices is related to factors such as the type of datasets and which clustering algorithms are used (Arbelaitz et al., 2013), and their validation properties depend on aspects such as the monotonicity, the noise and the density of the data, the impact of subclusters or the unequal size of clusters (Liu et al., 2010). Taking into account the properties of the most frequently cited internal validity measures, the Calin´ski and Harabasz (1974) and the Silhouette (Rousseeuw, 1987) indices are employed as the most suitable for this study among others. The following subsections provide outlines of the above validity indices. 4.3.1. Calin´ski–Harabasz A commonly used validity index is provided by the Calin´ski–Harabasz (CH) method. The index is based on the cluster center and likewise the overall data centroid to estimate separation and compactness. The distance from all the members of any cluster to the respective cluster centroid xk defines the cohesion of the clusters. The separation is calculated as the distance between each cluster centroid

to the centroid x of all the data. The ratio-type relationship is described by the following equation: CH ¼

TrðBÞ N  n  TrðWÞ n  1

ð4Þ

where TrðÞ; B, and W denote the trace, “between cluster scatter” matrices, and “within cluster scatter” matrices respectively. They are defined as: X Þðxk  x  ÞT B¼ Nk ðxk  x ð5Þ ck 2c1 ;c2 ;...;cn

where the N k is the number of points grouped to cluster ck , and X X Þðx  x  ÞT W¼ ðx  x ð6Þ ck 2c1 ;c2 ;...;cn x2ck

Thus, maximization of this index indicates the highest quality partitioning. The term ðN  nÞ=ðn  1Þ serves as a normalizing factor to optimize the CH score with respect to the increase of the number of n clusters. 4.3.2. Silhouette The Silhouette index (SIL) for each point provides a score based on the relation of the point to all the points in its respective cluster compared to points in other clusters. Unlike other methods the SIL is a normalized summation-type index where the compactness is estimated by the average distance axi between each point xi of a cluster ck to every other point xj in the cluster. The separation is measured based on the minimum value bxi of the average distances between that point and the points belonging to all other clusters fc1 ; c2 ; . . . ; cn n ck g. The difference of these terms, normalized by their maximum value, results in a Silhouette value of the ith point that ranges from 1 to 1. The

180

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

overall Silhouette index for the clustering of N points of a dataset is given by: X X bx  ax i i ð7Þ SIL ¼ maxðbxi ; axi Þ ck 2c1 ;c2 ;...;cn xi 2ck A Silhouette value which approaches unity signifies that the ith point belongs to a coherent cluster (small axi ) and is located at a maximal distance from its nearest cluster (large bxi ), resulting in tightly clustered data in well separated groups. 4.4. L-method Estimating the appropriate number of clusters is one of the most ambiguous steps in cluster analysis. Increasing the numbers of desired clusters in a partitioning problem usually implies a monotonic increase or decrease of the validity index values due to the inherent trend to provide better clustering quality with the increase of clusters. The higher the number of clusters the more degrees of freedom are available for the system to assign the data points to many small and compact groups. However, this assumption is inconsistent with a clear determination of the correct number of clusters. On the other hand, a global extreme of the validity index would be ambiguous as this conflicts with the above consideration about monotonicity. Particularly for the clustering of large datasets with many parameters and an unknown number of clusters, the abundance of local minima renders the correctness of each solution subjective. Therefore, we seek the knee point of a validity index curve that corresponds to the number of clusters after which no significant change in value of the considered index occurs. A heuristic procedure to estimate the number of clusters via the gap statistic was proposed by Tibshirani et al. (2001), whereas (Zhao et al., 2008) present a graphical knee point detection method based on the Bayesian Information Criterion (BIC) curves in model-based clustering. In this work we adopt an efficient knee point detection method, called the L-method. The L-method is used for the location of the crucial point on the evaluation graph of any validity index. The main advantage of the L-method is that it does not require the execution of the clustering algorithm itself. It performs a rapid standalone procedure on the curve of an already implemented validity index graph. Unlike the model-based methods, the L-method detects the boundary between the pair of straight lines that best fit either side of the curve. For example, the ideal shape of the curve forms an ‘L’ which implies a sharp change of the considered index values to a uniform segment. In this case, two lines are fit to the approximately linear left and right parts of the curve indicating the crucial point of discontinuity of the two lines. It should be noted that no significant change of the validity index prevails at the curve segment following the knee point indicating that the clusters are no longer discrete and they do not contribute to an appropriate

partition. Regarding less ideal curves where a monotonically smooth decrease occurs, the point of discontinuity of the pairs of lines that best fit the underlying shape of the curve locates the point after which the curve continues more smoothly than at any other point. The L-method can be implemented by defining an evaluation graph where the values of a validity index are on the y-axis and the number of clusters on the x-axis. By selecting iterative sequences of points left and right of every possible point that can be considered as a knee point, we create all the possible pairs of fitted lines on either side. The first left sequence of points, L, must necessarily be comprised by the first 2 points of the curve whereas the right part, R, contains the remaining and so forth. This method covers every possible pair of lines. According to the least squares method, a first-degree polynomial P approximates the given points of every line segment linearly. The L-method determines the appropriate pair of lines that best fit the monotonicity of the curve by minimizing the total rootmean-square error RMSET calculated as: RMSET ¼

c1 bc RMSEL þ RMSER b1 b1

ð8Þ

where c corresponds to the vertical projection of the x-axis on the point of discontinuity of the left and right lines, RMSEL and RMSER are their root-mean-square error, respectively, and b is defined as the maximum number of clusters. The crucial point # 2 ½3; b  2 is defined as: # ¼ argmin½RMSET 

ð9Þ

c

and indicates the appropriate number of clusters. 5. Experimental analysis In this section we describe the experiments performed to classify coherent clusters and determine the number of clusters that optimally characterize the data partition under the current parameterization. Each location’s K c time series is transformed to the row-oriented data matrix of the variability feature Dq, based on Eq. (3). The ith row of the data matrix represents the Dq feature values among all the M days (i.e. columns) above the pixel location indicated by the row index. It should be noted that each element of a feature vector is calculated by the absolute daily step changes of the clearness feature q over a specific pixel, see Section 3.4. Therefore, the initial dimensionality is equivalent to the corresponding time period of this dataset in days. The Dq feature space is then reduced to a lower dimensional space by performing un-centered PCA via SVD, after which, the presented stable initialization scheme for cluster senators is carried out on the generated feature space. As previously mentioned, the initialization method requires a pre-defined parameter; the closest allowable distance between the candidate centroids. The derived initial centers obtained by the presented method tends to vary

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

as a function of threshold distance. For the current dataset it was experimentally found that threshold distances beyond one third of the maximum length scale in the domain causes inadequate selection of initial seed points. In addition, the spatial resolution of the satellite data, combined with the scale of the spatial domain in this study, results in a reasonable upper limit of 80 clusters. Consequently, the number of clusters in this study is allowed to vary between 4 and 80. The number of clusters is identical with the number of initial centroids applied in each run. The block diagram of the proposed methodology is displayed in Fig. 4. Several partitions have been built for each number of clusters with respect to the various threshold distances. The default criteria function that k-means uses to estimate the cohesion of a cluster is the sum of squared error (SSE) (Celebi et al., 2013). The minimization of the SSE determines the convergence of the k-means to the desired number of clusters. Table 1 shows relevant statistics including minimum, mean, and standard deviation of the SSE for various numbers of clusters resulting from different threshold values. Although the various partitions were built to cover all the possible scenarios of threshold no statistically significant differences occur among the runs of each k, which demonstrates the solution stability of k-means for the current seed centers initialization. Furthermore, for a given initialization set it is not required to execute k-means multiple times since the initialization method is considered deterministic and produces an identical partition. The two validity indices CH and SIL were computed for each single partition and the L-method employed to identify the reasonable number of clusters for the overall experimentation. For more detail

181

see Fig. 5 which illustrates the process that leads to the final number of clusters. Finally, a correlation analysis is conducted between the satellite and ground measured GHI time series. Groundbased time series from three clusters are selected to be compared: the San Diego region, the Southern Los Angeles Basin, and the region situated within the Imperial and Coachella Valleys on the east, see Fig. 6. These three areas are chosen since they contain a considerable number of proximal CIMIS monitoring station as well as being located the in distinct clusters. Equally important for this selection is the fact that these locations encompass substantially diverse levels of variability , as defined in the next section. All the routines and the experiments of this study Ò were implemented under MATLAB version 7.14 running on a 3.4 GHz 64-bit processor. 6. Results and discussion The results of the proposed methodology are presented in two ways. First, the coherently optimal number of clusters as obtained from each of the internal validity indices as a function of threshold distance is presented in Table 2. Here, the L-method is implemented over the evaluation graphs that plot the values of a considered index versus the number of clusters, for each threshold individually, and then locates the respective knee point, see Fig. 7. Table 2 also shows that the CH and SIL indices have small variations in terms of the derived number of clusters, which indicates the robustness of their performance. However, as mentioned before, the mean SSE for each k (rather than an explicit threshold level) is considered the most statistically meaningful for the case of deterministic initialization

Fig. 4. The block diagram of the proposed methodology.

Fig. 5. The k-means clustering algorithm provides column-based clustering partitions of N cluster labels for each predefined clustering parameterization, i.e. threshold distance of the initial seeds and k number of clusters. The VIs, CH and SIL, are calculated correspondingly and for each k we select the VI value from the partition that corresponds to the minimum or the median SSE (blue dots) among different threshold distances (red strip row). The Lmethod is applied on the evaluation graph of the selected VI values in respect to the k and determines the appropriate h number of clusters. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

182

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

Fig. 6. Distribution of CIMIS ground-based measurement stations and cluster centers locations used in this study.

methods (Celebi et al., 2013). Despite that, in order to maintain the method’s robustness to extremes, the median is selected rather than the mean. In addition, since the statistics regarding the dispersion of the SSEs (see Table 1) indicate that the reliability between the partitions remains high, we also give the results regarding the minimum SSE for each k. Fig. 7 shows the implementation of the L-method on the evaluation graphs for each cluster validity index resulting from both the minimum and the median SSE clusterings. To avoid the spurious detection of a knee point, data points to the left of the ’worst’ value at the beginning of an evaluation graph have been ignored as suggested by Salvador and Chan (2004). As seen in Fig. 7, a proper cutoff value has been identified as the knee point for all the validity indices. The range of appropriate number of clusters varies between 10 and 16, with 14 being the median value. Consequently, a regional partition of the domain of interest is generated, see Fig. 8a. In order to emphasize not only the separation of discrete clusters but also quantify the variability between the clusters, we define a metric that describes the average variability Dq of the mean timeseries among the members of each cluster. As an instance we use the partition derived by the median SSE result of the CH index, where the L-method indicates 14 clusters. Here, the average Dq time-series from all the pixels of each of the 14 clusters are calculated. The probability density function (PDF) and the cumulative density function (CDF) are then computed for each cluster. The area  above the curve of the cumulative distribution function (1-ACDF ) typically determines the average extent to which the feature q will tend to change from one day to the next. The different colors in Fig. 8a, indicates the average cluster variability as defined by the area  in an increasing order. The smaller the , the lower the average daily variability of the cluster, i.e. the more “stable” the cluster is.

Aiming to a physical interpretation of the proposed clustering partition into 14 clusters, all the  areas of the cumulative distribution function graphs for each initial single q vector are also computed and their overall value range is divided, in accordance to these specific clustering findings, into 14 equally spaced intervals. The coloring of the area in Fig. 8b is based on the range to which each pixel belongs. Regions with different variability among the domain of interest can be readily ascertained in both the figures. However, the proposed clustering in Fig. 8a has a physically meaningful content as it defines not only the exact borders of different clusters in terms of an overall similar variability in time but also provides the most appropriate number of those groups. It is clear that while the proposed clusters are directly related to the physical variance distribution they also provide order in terms of wellseparated and neatly arranged regions. It should also be noted that the proposed method separates regions with similar average variability. However, regions with similar average variability must also be sufficiently close in order to be classifies into the same cluster. For example, the final clustering presented in Fig. 8a does not necessarily group values of similar , shown in Fig. 8b, unless they lie sufficiently close together. Certainly, another interesting feature of the experimental results is the appearance of three different clusters prevalent on the southern California coastal region. These clusters indicate the influence of marine layer clouds, which are a distinct local weather phenomenon that causes frequent overcast conditions in the coastal areas of the region during the summer months (Jamaly et al., 2013; Mass and Albright, 1989). The marine layer effects are clearly revealed in Fig. 8a instead of a weak imprinting based on the average variability in Fig. 8b. The coastal clusters coincide with the areas of higher and lower marine layer dynamics as observed from satellite images. The variability of marine layer clouds

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

183

Fig. 7. Determination of the appropriate number of clusters by the implementation of the L-method over the evaluation graphs of (a) CH (min SSE), (b) CH (median SSE), (c) SIL (min SSE) and (d) SIL (median SSE) cluster validity indices. Each row from left to right shows the diagram of the values of the considered index (y-axis) versus the number of clusters (x-axis), the plot of all possible pairs of fit lines (red and green), the RMSE curve with respect to each candidate knee point (the minimum RMSE, i.e. the knee point, is marked with a green square) and, at the right, the best fit lines where the point of discontinuity defines the knee point. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

is especially pronounced in the San Diego, California area. Therefore, the clustering of the coastal areas into these clusters of variability Dq clearly reflects the presence of this phenomena and can be seen as another reference for the validity of this study. Additionally, it is noticeable that this clustering scheme captures precisely the dry desert basin around the Salton Sea, California (33.5°N–116°W) and assigns it to a discrete cluster of similar solar variability. It is important to note that the assignment of an  value to each of the clusters refers only to an after-clustering step attempting to interpret how close the clusters are in terms of average per-cluster daily variability. Notwithstanding, it should not be confused with the clustering criterion of the k-means,

which is the depicted temporal attribute of the input time-series. The reproduction of the marine layer dynamics and stable desert basin within the clustering results suggests that elevation and aridity would explain, at least in part, the final partitioning of the clusters. However, this is to be expected as the elevation and aridity would affect local microclimates in the region of interest. In addition, our findings that Southern California possesses substantial and uneven variability in GHI is in agreement with the work of Gueymard and Wilcox (2011). The 14 final k-means centers are placed inside of each of the clusters in such a way that the mean value of the cluster converges towards the cluster center providing the optimal

184

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

Fig. 8. (a) Segmentation map of the Southern California region into 14 clusters of coherent GHI variability Dq. Each cluster corresponds to terrestrial sites and is indicated by different color. The proposed centers of each cluster are shown (white circled symbols) at their interior and can be considered as sites for potential solar installations. Similar color hues define the magnitude of the variance of the average Dq time-series from all the pixels of each of the 14 clusters according to the  quantity. (b) The biennial distribution of variance  for each location (pixel). Each color hue corresponds to one of the 14 equally spaced intervals of the  value range. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

Fig. 9. Segmentation maps of the Southern California area into clusters of similar Dq variability as derived by the validity index (a) CH (min SSE), (b) CH (median SSE), (c) SIL (min SSE) and (d) SIL (median SSE).

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

185

Correlation Map, Satellite vs Ground 1.0 San Diego

0.8 0.6

Imperial/Coachella Valleys

0.4 0.2 0

Long Beach

Coachella Valley 4

Coachella Valley 3

Coachella Valley 2

Coachella Valley 1

San Diego 3

San Diego 2

San Diego 1

Los Angeles Basin South

Fig. 10. Cross-correlation between the CIMIS ground stations (columns) and the remote sensing data (rows) derived from the clustering map (Fig. 7a). The top right color of each coordinate corresponds to a correlation between the proposed cluster center and the CIMIS data while the lower left color corresponds to a correlation between 100 randomly selected pixels in the proposed cluster and the CIMIS data. A red value indicates strong correlation (>0.8) between the GHI time series of a precise cluster and the CIMIS ground stations located within the considered cluster, whereas a yellow value (0.6) refers to weaker correlations. It is apparent that high positive correlations (red bands) are clearly observed only between locations within the same cluster region (Fig. 6). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)

1

2

3

4

5

6

7

8

9 10 11 12 13 14

1

1.0

34

9

11

3

0.8

2

33.5

4

4

0.7

5 6

12

6

14

0.6 7

0.5

8

33

7 5

0.4

9 10

0.3 11

8 13

32.5

0.2

12

(a)

10

1

3

0.9

2

13

(b)

0.1 14

0

32 119

118.5

118

117.5

117

116.5

116

Fig. 11. (a) Cross-correlation analysis between the Dq time series in order to evaluate the intra/inter cluster relationship of the proposed clustering. Each column and row of the correlation map is numbered so as to correspond to the cluster labels (b), according a decreasing order of the variability expression . A high intra-cluster correlation is an indication of excellent coherence within the clusters just as low inter-correlation argues adequate separation between the clusters. It can be seen that the average intra-cluster correlation remains high (>0.9), while the inter-cluster relationship is diminished.

intra-cluster membership. These centroids constitute a pioneering network of representative locations for the coherently optimal placement of ground instruments to observe the current territory. Additionally, these results are consistent with the main attribute of a cluster center, which is an optimized position in the cluster. The main characteristic of these clusters is that they constitute wellseparated subareas with similar Dq variability over a span of 2 years used in this study. To highlight the consistency of the clustering results among all the instances of the proposed range of clusters, the respective clustering schemes are shown in Fig. 9. As a result of CIMIS ground data being available in the region of interest (see Fig. 6), a correlation map between the satellite-derived GHI time series (rows) and the ground

truth (columns) is graphically illustrated in Fig. 10. The color at each pixel of this map corresponds to the correlation coefficient and reflects the relationship between the satellite data and the CIMIS stations. The performed correlation analysis is examined in two ways. Pearson correlation coefficients were calculated by choosing a CIMIS location and correlating the corresponding GHI time series at the site with, first, the time series at the cluster centers and, second, the time series at 100 randomly selected locations in each cluster. These results are shown as the top right-and bottom-left of each coordinate in the correlation graph in Fig. 10, respectively. Result for both methods are similar with a slight increase in correlation for the cluster centers with respect to the randomly selected locations, further validating the proposed clustering. Additionally,

186

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

correlation along the coastal region, San Diego and Los Angeles, is distinguishable and justified as they experience similar marine layer effects. More importantly, we also perform a cross-correlation analysis between the time series of the feature used in this work, (i.e. the Dq time series) in order to evaluate the intra/inter cluster relationship of the proposed clustering. A high intra-cluster correlation is an indication of excellent coherence within the clusters. Similarly, low inter-cluster correlation argues adequate separation between the clusters. In Fig. 11a, the average correlation coefficients among all the gridded locations belonging to a cluster are shown on the diagonal. The average correlation coefficients between all the possible combinations of inter-cluster correlation are depicted on the off-diagonal segments. Each of the clusters was numbered in order of decreasing variability , which is shown in Fig. 11b. It is apparent that the average intra-cluster correlation remains high (>0.9), while the inter-cluster relationship is diminished. Knowledge of the spatial boundaries of correlated/uncorrelated clusters would grant a utility’s planning and operations insight into locations of optimal production potential coupled with uncorrelated variability allowing for optimal spatial averaging of resource fluctuations.

temporal scales. In addition, there is no need to employ average daily clearness as the clustering feature. Future work will include the definition of various clustering features (such as average daily variability) in order to investigate the relationship between clustering indices as well as the evaluation the forecast effectiveness of National Oceanic and Atmospheric Administration (NOAA) numerical weather prediction models (High Resolution Rapid Refresh), and forecast engines based on ground data in the SDG&E service area. Acknowledgments The authors appreciate funding from the California Solar Initiative (CSI) Research, Development, Demonstration, and Deployment (RD&D) Program Grant #3; and from the National Science Foundation (NSF) EECS-EPAS Award No .1201986, which is managed by Dr. Paul Werbos. Partial support from the DOE SUNRISE project DE-EE0006330, which is technically managed by Dr. Venkat Banunarayanan is also gratefully acknowledged.

Appendix A. 7. Conclusions This work presents a methodology for estimating the number of coherent clusters of average daily clearness from satellite-derived irradiance observations over a utility-scale region. The comparison with available ground measurements was performed in order to validate the remote sensing results. The first conclusion is that the use of a stable initialization scheme of seed centers in conjunction with mean data centering via PCA renders k-means a robust approach for the clustering of satellite-derived average daily clearness. Both the CH and SIL internal validity indices indicate 14 coherent irradiance clusters are present in the area of interest used in this study. Secondly, the centroids of these 14 coherent regions are proposed as the optimal location for the placement of ground-based sensors in order to effectively capture the daily variability over the utility-scale region. This sensor network could then be employed by utilities and system operators in order to reduce operational costs via forecasting activities. Additionally, knowledge of areas with similar/dissimilar average daily variability would allow utilities and system operators to strategically cite both centralized and distributed solar generations through improved planning and subsidies effectively reducing variability through coherently optimal spatial averaging. Finally, this approach is limited only by the availability of gridded data and computational resources. Therefore, similar types of clustering could be performed on both larger and smaller spatial domains as well as finer and coarser

Algorithm 1. Pseudocode for the cluster centers initialization algorithm Find n Initial Centers IC ðnÞ Step 1: Compute the NN of N data points (dataset DðN Þ ) Count the RNNs of each point i 2 N . number of points whose NN is the query point i Sort N data points in descending order according the number of their RNNs; Assign to E) while (j ¼ 0; j < m; j þ þ) do . where n: the desired number of centroids; m ¼ 3  n Select the point j from E and assign it to the set of potential centroids S ðjÞ Þ Calculate the Euclidean distance matrix of S ðjÞ ) if centroid j  centroid S=fjg < threshold distance then discard centroid j and Continue else update S ðjÞ and Continue Step 2: Compute the Euclidean distance between candidate centroids m and N  m points [dataset DðN mÞ ] Assign each N  m point to its nearest centroid m ! n clusters of points Sort clusters in descending order according the number of their point members Sort clusters in descending order according the number of their point members return Select and return the first n clusters (i.e. centroids) as the set of initial centers IC ðnÞ

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

References Amorim, A.M., Goncßalves, A.B., Nunes, L.M., Sousa, A.J., 2012. Optimizing the location of weather monitoring stations using estimation uncertainty. Int. J. Clim. 32 (6), 941–952. Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pe´rez, J.M., Perona, I., 2013. An extensive comparative study of cluster validity indices. Pattern Recognit. 46 (1), 243–256. Boyle, G., 2008. Renewable electricity and the grid: the challenge of variability. Earthscan. Cadima, J., Jolliffe, I., 2009. On relationships between uncentred and column-centred principal component analysis. Pak. J. Statist. 25 (4), 473–503. Calin´ski, T., Harabasz, J., 1974. A dendrite method for cluster analysis. Commun. Statist. Theory Methods 3 (1), 1–27. Celebi, M.E., Kingravi, H.A., Vela, P.A., 2013. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst. Appl. 40 (1), 200–210. Chow, C.W., Urquhart, B., Lave, M., Dominguez, A., Kleissl, J., Shields, J., Washom, B., 2011. Intra-hour forecasting with a total sky imager at the UC San diego solar energy testbed. Sol. Energy 85 (11), 2881–2893. Diabate, L., Blanc, P., Wald, L., 2004. Solar radiation climate in Africa. Sol. Energy 76 (6), 733–744. Ding, C., He, X., 2004. K-means clustering via principal component analysis. In: Proceedings of the Twenty-first International Conference on Machine Learning. ICML ’04. ACM, New York, NY, USA, pp. 225–232. Eltbaakh, Y.A., Ruslan, M., Alghoul, M., Othman, M., Sopian, K., 2012. Issues concerning atmospheric turbidity indices. Renew. Sustain. Energy Rev. 16 (8), 6285–6294. Energy, G.E., 2010. In: Western Wind and Solar Integration Study, National Renewable Energy Laboratory. Tech. Rep., Report NREL/ SR-550-47434. Gueymard, C.A., Wilcox, S.M., 2011. Assessment of spatial and temporal variability in the US solar resource from radiometric measurements and predictions from models using ground-based or satellite data. Sol. Energy 85 (5), 1068–1084. Gwadry, F., Berenstein, C.A., Horn, J.V., Braun, A., 2011. In: Implementation and Application of Principal Component Analysis on Functional Neuroimaging Data. Tech. Rep., Institute for Systems Research Technical Reports. Ineichen, P., 2006. Comparison of eight clear sky broadband models against 16 independent data banks. Sol. Energy 80 (4), 468–478. Ineichen, P., Perez, R., 2002. A new airmass independent formulation for the linke turbidity coefficient. Sol. Energy 73 (3), 151–157. Inman, R.H., Pedro, H.T.C., Coimbra, C.F.M., 2013. Solar forecasting methods for renewable energy integration. Prog. Energy Combust. Sci. 39, 535–576. Jain, A.K., 2010. Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31 (8), 651–666. Jamaly, M., Bosch, J., Kleissl, J., 2013. Aggregate ramp rates of distributed photovoltaic systems in San Diego County. IEEE Trans. Sustain. Energy 4 (2), 519–526. Jolliffe, I.T., 1986. In: Principal Component Analysis, vol. 487. SpringerVerlag, New York. Kabalci, E., 2011. Development of a feasibility prediction tool for solar power plant installation analyses. Appl. Energy 88 (11), 4078–4086. Liu, Y., Li, Z., Xiong, H., Gao, X., Wu, J., 2010. Understanding of internal clustering validation measures. In: Data Mining (ICDM), 2010 IEEE 10th International Conference on. IEEE, pp. 911–916. Lloyd, S., 1982. Least squares quantization in PCM. IEEE Trans. Inform. Theory 28 (2), 129–137. MacQueen, J., 1967. Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 14. California, USA, pp. 281–297.

187

Marquez, R., Coimbra, C.F.M., 2011. Forecasting of global and direct solar irradiance using stochastic learning methods, ground experiments and the NWS database. Sol. Energy 85 (5), 746–756. Marquez, R., Pedro, H.T.C., Coimbra, C.F.M., 2013. Hybrid solar forecasting method uses satellite imaging and ground telemetry as inputs to ANNs. Sol. Energy 92, 176–188. Mass, C., Albright, M., 1989. Origin of the Catalina Eddy. Mon. Weather Rev. 117 (11), 2406–2436. Milligan, G.W., Cooper, M.C., 1985. An examination of procedures for determining the number of clusters in a data set. Psychometrika 50 (2), 159–179. Miranda, A.A., Le Borgne, Y.-A., Bontempi, G., 2008. New routes from minimal approximation error to principal components. Neural Process. Lett. 27 (3), 197–207. Nottrott, A., Kleissl, J., 2010. Validation of the NSRDB–SUNY global horizontal irradiance in California. Sol. Energy 84 (10), 1816–1827. Noy-Meir, I., 1973. Data transformations in ecological ordination: I. Some advantages of non-centering. J. Ecol. 61 (2), 329–341. Perez, R., Ineichen, P., Moore, K., Kmiecik, M., Chain, C., George, R., Vignola, F., 2002. A new operational model for satellite-derived irradiances: description and validation. Sol. Energy 73 (5), 307–317. Remund, J., Wald, L., Lefe`vre, M., Ranchin, T., 2003. Wordwide Linke turbidity information. In: Proceedings of ISES Solar World Congress 2003, Go¨teborg, Sweden, pp. 1–13. Reno, M.J., Hansen, C.W., Stein, J.S., 2012. In: Global Horizontal Irradiance Clear Sky Models: Implementation and Analysis. Tech. Rep. SAND12-2389. Sandia National Laboratories, Albuquerque, New Mexico 87185 and Livermore, California 94550. Rodriguez, G., 2010. A utility perspective of the role of energy storage in the smart grid. In: Power and Energy Society General Meeting, 2010 IEEE, pp. 1–2. Rousseeuw, P.J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65. Salvador, S., Chan, P., 2004. Determining the number of clusters/ segments in hierarchical clustering/segmentation algorithms. In: Tools with Artificial Intelligence, 2004. ICTAI 2004. 16th IEEE International Conference on. IEEE, pp. 576–584. Salvador, S., Chan, P., 2005. Learning states and rules for detecting anomalies in time series. Appl. Intell. 23 (3), 241–255. Schmalwieser, A.W., Schauberger, G., 2001. A monitoring network for erythemally-effective solar ultraviolet radiation in Austria: determination of the measuring sites and visualisation of the spatial distribution. Theor. Appl. Climatol. 69 (3–4), 221–229. Solar Anywhere, 2012. SolarAnywhere data, Clean Power Research 2012. URL Spetsieris, P.G., Eidelberg, D., 2011. Scaled subprofile modeling of resting state imaging data in Parkinson’s disease: methodological issues. Neuroimage 54 (4), 2899–2914. Tapakis, R., Charalambides, A.G., 2014. Enhanced values of global irradiance due to the presence of clouds in Eatern Mediterranean. Renew. Energy 62, 459–467. Theodoridis, S., Koutroumbas, K., 2009. Pattern Recognition, fourth ed. Academic Press, p. 638. Tibshirani, R., Walther, G., Hastie, T., 2001. Estimating the number of clusters in a data set via the gap statistic. J. Roy. Statist. Soc.: Series B (Statist. Methodol.) 63 (2), 411–423. Venkataraman, S., Jordan, G., Piwko, R., Freeman, L., Helman, U., Loutan, C., Rosenblum, G., Rothleder, M., Xie, J., Zhou, H., et al., 2010. In: Integration of Renewable Resources: Operational Requirements and Generation Fleet Capability at 20% rps. Tech. Rep., California ISO. Vose, R.S., Menne, M.J., 2004. A method to determine station density requirements for climate observing networks. J. Climate 17 (15), 2961– 2971. Wu, X., Kumar, V., Ross Quinlan, J., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G., Ng, A., Liu, B., Yu, P., Zhou, Z.-H., Steinbach, M.,

188

A. Zagouras et al. / Solar Energy 102 (2014) 173–188

Hand, Steinberg, D., 2008. Top 10 algorithms in data mining. Knowl. Inform. Syst. 14 (1), 1–37. Xu, J., Xu, B., Zhang, W., Zhang, W., Hou, J., 2009. Stable initialization scheme for K-means clustering. Wuhan Univ. J. Nat. Sci. 14 (1), 24– 28. Zagouras, A., Kazantzidis, A., Nikitidou, E., Argiriou, A., 2013. Determination of measuring sites for solar irradiance, based on cluster analysis of satellite-derived cloud estimations. Sol. Energy 97, 1–11.

Zhao, Q., Xu, M., Franti, P., 2008. Knee point detection on Bayesian information criterion. In: Tools with Artificial Intelligence, 2008. ICTAI’08. 20th IEEE International Conference on, vol. 2. IEEE, pp. 431–438.