Regionalization of watersheds by finite mixture models

Regionalization of watersheds by finite mixture models

Journal of Hydrology 583 (2020) 124620 Contents lists available at ScienceDirect Journal of Hydrology journal homepage: www.elsevier.com/locate/jhyd...

1MB Sizes 0 Downloads 61 Views

Journal of Hydrology 583 (2020) 124620

Contents lists available at ScienceDirect

Journal of Hydrology journal homepage: www.elsevier.com/locate/jhydrol

Research papers

Regionalization of watersheds by finite mixture models ⁎

T

Ali Ahani, S. Saeid Mousavi Nadoushani , Ali Moridi Department of Water Resources Management, Faculty of Civil, Water and Environmental Engineering, Shahid Beheshti University, Tehran, Iran

A R T I C LE I N FO

A B S T R A C T

Keywords: Clustering Gaussian mixture models Homogeneity Regionalization Regional flood frequency analysis

Cluster analysis methods are one of the most widely-used types of methods in regionalization of watersheds for regional flood frequency analysis. Several cluster analysis algorithms have been used in regional frequency analysis studies in the last decades, which most of them have been data-driven algorithms that identify clusters based on distances or dissimilarities between data points. Finite mixture models are a type of statistical models that are able to identify the clusters based on frequency distributions. In this study, the performance of finite mixture models for regionalization of watersheds was evaluated by applying Gaussian mixture models for regionalization of watersheds of Karun-e-bozorg in the southwest of Iran. Gaussian mixture models are the bestknown type of finite mixture models. The performance of this method was evaluated according to the homogeneity of regions and the accuracy of flood quantile estimates. In addition, the results of the proposed method were compared with the results provided by using a well-known and efficient data-driven hybrid clustering algorithms. According to the results, the Gaussian mixture models clearly outperformed the hybrid clustering algorithm in identifying homogeneous regions. While one of the Gaussian mixture models assigned all the watersheds to homogeneous regions for five different choices of the number of regions, the hybrid clustering algorithm assigned all the watersheds homogeneous regions only for one choice of the number of regions. Also, the flood estimates provided based on the regions identified by the Gaussian mixture model and the hybrid clustering algorithm were compared with the at-site estimates. As evidenced by the results, the flood estimates related to the regions identified by the Gaussian mixture model are less deviated from at-site estimates in terms of the error measures. In general, results show that finite mixture models can be considered as an efficient option to perform regionalization of watersheds in regional flood frequency analysis studies.

1. Introduction Floods are one type of hydrological extreme events that affect several aspects of human life. Understanding the behavior and statistical attributes of floods can play an important role to improve flood management plans. Flood frequency analysis (FFA) methods are statistical methods to estimate the magnitude of a flood corresponding to a specific return period. If the flood data record of a watershed is long enough, the flood frequency analysis can be implemented for the watershed based on its flood data. This type of flood frequency analysis is called at-site flood frequency analysis. However, when a watershed is ungauged, it is not possible to estimate flood quantiles of the watershed by at-site flood frequency analysis. In addition, if the flood data record of a watershed is very short, the flood estimates provided by the at-site flood frequency analysis methods may not be reliable. In such cases, regional flood frequency analysis (RFFA) can be an appropriate alternative. Most of the RFFA methods are based on the index-flood method



developed by Dalrymple (1960). Hosking and Wallis (1993), Hosking and Wallis (1997)developed a RFFA algorithm based on the use of lmoments, which were used in several studies for regional frequency analysis of hydrologic extreme events (e.g., Ramachandra Rao and Srinivas, 2006; Ramachandra Rao and Srinivas, 2006; Srinivas et al., 2008; Viglione et al., 2007; Sadri and Burn, 2011; Farsadnia et al., 2014; Basu and Srinivas, 2014; Asong et al., 2015; Ahani and Mousavi Nadoushani, 2016). As stated in several RFFA studies, RFFA usually includes two main steps: regionalization and flood estimation. In regionalization step, the aim is to form or identify the groups of watersheds named homogeneous regions. A homogeneous region is a group of watersheds that they have similar flood generation mechanism. Flood estimation step aims at fitting a regional frequency distribution to each homogeneous region and estimating flood quantiles for each watershed of the region. In regionalization, the similarity of the flood generation mechanisms is often assessed based on the similarity of feature vectors of watersheds. The feature vectors include some features which are effective

Corresponding author. E-mail addresses: [email protected] (A. Ahani), [email protected] (S.S. Mousavi Nadoushani), [email protected] (A. Moridi).

https://doi.org/10.1016/j.jhydrol.2020.124620 Received 24 October 2019; Received in revised form 24 December 2019; Accepted 22 January 2020 Available online 03 February 2020 0022-1694/ © 2020 Elsevier B.V. All rights reserved.

Journal of Hydrology 583 (2020) 124620

A. Ahani, et al.

on the flood generation mechanism (e.g., physiographical attributes, geological characteristics, climatological and meteorological attributes, land use and vegetation type) or some flood-related features (e.g., flood statistics). The first group is called watershed features and the second group is called flood-related feature in this article. Since the feature vectors consist of several features and may not include the geographical location attributes, the watersheds belonging to a region are not essentially geographically contiguous. The use of the flood-related features can usually increase the homogeneity of regions because of their direct relation to the flood data of watersheds. This is an important advantage of the use of flood-related features in forming the feature vectors. On the other hand, it is not possible to calculate the floodrelated features for ungauged watersheds and so, when the objective of RFFA is to estimate the flood quantiles for one or more ungauged watersheds, it is not possible to use the flood-related features to form the feature vectors. In such cases, the watershed features can be used even to regionalize the ungauged watersheds. There are two major approaches to form the regions: forming the fixed region and forming the region of influence or hydrological neighborhood. The methods based on the cluster analysis often follow the fixed-region approach, whereas the method region of influence (ROI) (Burn, 1990) and the hydrological neighborhood methods based on canonical correlation analysis (CCA) adopt the region of influence approach (e.g., Cavadias, 1990; Cavadias et al., 2001; Ouarda et al., 2001). The cluster analysis methods are multivariate statistical analysis methods for grouping data points based on their similarities and differences. Several types and techniques of cluster analysis have been utilized for regionalization of watersheds in RFFA studies because of their ability in dealing with multivariate data (e.g., Acreman and Sinclair, 1986; Bhaskar and O’Connor, 1989; Burn, 1989; Hall and Minns, 1999; Jingyi and Hall, 2004; Ramachandra Rao and Srinivas, 2006; Ramachandra Rao and Srinivas, 2006; Srinivas et al., 2008; Sadri and Burn, 2011; Basu and Srinivas, 2014; Asong et al., 2015; Ahani and Mousavi Nadoushani, 2016; Abdi et al., 2017). There are several categorizations of clustering methods and algorithms. For example, clustering algorithms can be divided into two main groups named hierarchical and partitional algorithms which each group may have some advantages and disadvantages for different applications. The hierarchica algorithms often do not need initialization such as determination of the initial cluster centers, it is not possible to move a data point from one cluster to another in the same level. On the other hand, for some partitional algorithms, it is required to determine the initial cluster centers, but they are capable to move the data points between different clusters in sequent iterations to minimize the objective function of clustering. Ward’s algorithm (Ward, 1963) can be mentioned as one of the widely-used hierarchical algorithms in the regionalization studies, whereas among the partitional algorithms, Kmeans (Hartigan and Wong, 1979) has been applied in several regionalization studies. In another categorization, clustering can be implemented as hard and fuzzy. While in hard clustering each data point can belong to only one cluster, in fuzzy clustering each data point can belong to more than one cluster simultaneously. The most important fuzzy clustering used in the regionalization studies is fuzzy c-means (FCM) algorithm (Bezdek, 1981), whereas most of aforementioned clustering algorithms can be categorized as hard clustering algorithms. In addition to the mentioned clustering algorithms, there are some techniques such as self-organizing feature maps (SOFM) that may not be considered as a clustering algorithm, but they can be used as a tool for clustering (Ramachandra Rao and Srinivas, 2008). Kohonen’s SOFM (Kohonen, 1982) has been the most widely-used type of SOFM in the studies related to regionalization of watersheds and RFFA. Furthermore, some hybrid algorithms have been developed and used in the regionalization studies to exploit the capabilities of different clustering algorithms and improve their performances. A list of some studies which utilized cluster analysis methods to regionalized watersheds are

Table 1 Some important studies about application of cluster analysis methods to regionalize watersheds. Clustering technique

Studies

Hierarchical algorithms

Mosley (1981), Acreman and Sinclair (1986) Lin and Chen (2006) Wiltshire (1986), Bhaskar and O’Connor (1989) Burn (1989), Burn and Goel (2000) Lin and Chen (2006) Hall and Minns (1999), Jingyi and Hall (2004) Lin and Chen (2006) Ramachandra Rao and Srinivas (2006) Basu and Srinivas (2014), Basu and Srinivas (2015) Hall et al. (2002), Jingyi and Hall (2004) Prinzio et al. (2011); Toth (2013) Ramachandra Rao and Srinivas (2006) Srinivas et al. (2008), Farsadnia et al. (2014) Ahani and Mousavi Nadoushani (2016)

Partitional algorithms

Fuzzy clustering

Self-organizing feature maps Hybrid algorithms

presented in Table 1. Since the homogeneity measures are based on the regional flood statistics, it is logical to use flood statistics of the watersheds to measure the similarities and differences between the watersheds for forming the regions. However, when flood estimation for ungauged watersheds is one of the objectives of the RFFA, the lack of flood data in the ungauged watersheds is an obstacle to assign these sites to the regions. Therefore, in several studies, the regionalization of watersheds is implemented only based on the watershed features such as physiographic characteristics, meteorological features, geological attributes, geographical location characteristics, soil type, and land use descriptors (e.g., Acreman and Sinclair, 1986; Nathan and McMahon, 1990; Hall and Minns, 1999; Burn and Goel, 2000; Hall et al., 2002; Jingyi and Hall, 2004; Ramachandra Rao and Srinivas, 2006; Ramachandra Rao and Srinivas, 2006; Srinivas et al., 2008; Farsadnia et al., 2014; Ahani and Mousavi Nadoushani, 2016). The cluster analysis method can be categorized into data-driven and model-based clustering methods. In most data-driven clustering methods, the regions are identified based on the similarities and differences between the data points which are measured by the distance measures (e.g., Euclidean, Manhattan, Minkowski, Chebyshev, and Mahalanobis distance measures). On the other hand, in model-based clustering methods, the data points that belong to a cluster are those that a probability density function is fitted to them appropriately. Although different clustering methods have been used for regionalization of watersheds in several studies, most of the used methods have been data-driven methods and the model-based methods have not attracted much attention. Therefore, in the current study, the application of model-based clustering for regionalization of watersheds to implement RFFA was examined by focusing on finite mixture models (FMM). For this purpose, a well-known family of FMM named Gaussian mixture models (GMM) was applied to regionalize the watersheds of Karun-ebozorg basin in the south west of Iran for implementing RFFA. The results were evaluated and compared with the results provided by applying a well-known and efficient data-driven clustering method. It should be noted that, in the current study, FMM and their wellknown family, GMM were not used as flood frequency distributions, but they were used as a clustering method for regionalization of watersheds. In the present study, FMM (or GMM) was used a statistical clustering method to regionalize the watersheds. It was used to cluster watersheds based on two different types of feature vectors. The first type of the feature vectors included the watershed features and the second type consisted of the flood-related features (not flood data record).

2

Journal of Hydrology 583 (2020) 124620

A. Ahani, et al.

Fig. 1. Geographical location of the basin Karun-e-bozorg and the selected gauging stations.

2. Materials and methods

two other hybrid algorithms namely the combinations of single linkage and complete linkage with K-means. In WAKM, the initial clusters are identified by Ward’s algorithm for a desirable number of clusters. Then, the centers of the initial clusters are used as the initial cluster centers of K-means for final clustering and identifying the regions. In this study, WAKM was utilized for regionalization of watersheds and its results were compared with the results provided by the finite mixture models (FMM). To read more details about the algorithms Ward, K-means, and WAKM, see Ramachandra Rao and Srinivas (2006, 2008).

2.1. Study area and datasets To examine the proposed regionalization method and evaluate its performance, the basin Karun-e-bozorg located in the south west of Iran (Fig. 1) was chosen as the case study. The total area of the basin is about 67257 km2 . Karun and Dez are two main rivers of the basin, which join together in the upstream of the city Ahvaz and make the river Karun-ebozorg. The longitude varies between 48.05ddE and 52.00ddE and the latitude varies from 29.10ddN to 34.12ddN over the basin. For applying the regionalization methods, 42 gauging stations and their corresponding watersheds were selected over the basin. The stream flow in the selected stations was unregulated and the flood record length in each station was equal or greater than 10 years. Annual peak flow data records were used as the flood data to implement RFFA. The record length was varying in the range [10, 45] with the mean value equal to 27 years.

2.3. Regionalization of watersheds by finite mixture models The basic assumption in model-based clustering by finite mixture models (FMM) is that a population consists of a number of subpopulations that each of them has a probability density function. The mixture of the probability density functions of the subpopulations forms a finite mixture model for the population (Raftery and Dean, 2006). When the objective is to cluster the data, each subpopulation can be considered as a cluster. The subpopulations often are modeled by the members of the same parametric probability density function family. The general form of a finite mixture model including C subpopulation or clusters is presented in Eq. (1):

2.2. Regionalization of watersheds by the clustering algorithm WAKM Ward’s hierarchical clustering algorithm (Ward, 1963) and the partitional clustering algorithm K-means (Hartigan and Wong, 1979) are two clustering algorithms widely-used for regionalization of watersheds in RFFA studies (e.g., Hosking and Wallis, 1997; Jingyi and Hall, 2004; Lin and Chen, 2006; Ramachandra Rao and Srinivas, 2006; Kingston et al., 2011; Satyanarayana and Srinivas, 2011; Kar et al., 2012; Ssegane et al., 2012; Ilorme and Griffis, 2013; Farsadnia et al., 2014; Smith et al., 2015; Abdi et al., 2017). WAKM clustering algorithm first used by Ramachandra Rao and Srinivas (2006) for regionalization of watersheds of the state of Indiana in the U.S. and provided more suitable results in comparison with three hierarchical algorithms including single linkage, complete linkage, and Ward, and the partitional algorithm K-means. WAKM, which consists of Ward and K-means algorithms, also provided better results compared to

C

f (x ) =

∑ c=1

πc fc (x|ϕc )

(1)

where x is the multivariate feature vector, πc denotes the proportion of the size of the subpopulation c to the population size, fc represents the probability density function of the subpopulation c, and ϕc is the parameter vector of the subpopulation c (Raftery and Dean, 2006). By using Bayes rule, the mixture models can be used to assign the data points to the clusters. The data point x is assigned to the cluster h, if the posterior probability of its assignment to the subpopulation h is greater than all the posterior probabilities related to other clusters, i.e. 3

Journal of Hydrology 583 (2020) 124620

A. Ahani, et al.

τh (x) > τc (x),

c = 1, …, C where τh (x) =

π h fh (x) ∑C c = 1 πc fc (x)

Table 2 Descriptive statistics of watershed features.

represents the

posterior probability of its assignment to the subpopulation h. Since the denominator is the same for all the posterior probabilities, the Bayes rule can be simplified as: x is assigned to the cluster h, if h = argmaxc πc fc (x) . The Bayes rule can be estimated by substituting the unknown parameters with their estimated values (Raftery and Dean, 2006). Gaussian mixture models (GMM) are the most well-known and widely-used family of finite mixture models. In GMM, it is supposed that all the subpopulations have Gaussian distributions. In this study, the Gaussian mixture models were applied for regionalization of watersheds. The method named Expectation-Maximization (EM) (Dempster et al., 1977) was used to fit GMM to the data. To read more about GMM and EM algorithms, see Bishop (2006), Everitt et al. (2011). It should be noted that in the situations like regionalization of watersheds, FMMs or GMMs are used as clustering methods. Clustering is an unsupervised learning problem and so, we don’t dealing with predefined and specific groups or classes of data points, but we are trying to recognize unknown patterns existed in data set by using clustering methods. The Gaussianity assumption in GMM is not related to some specified subpopulation in data set, but it is an assumption to start solving the clustering problem and searching to find the best solution for the problem that they are consistent with the assumption. Before solving the clustering problem by GMM through EM algorithm, the Gaussianity assumption cannot be examined because there is no specified subpopulation or cluster. The Expectation–Maximization (EM) algorithms are used to provide such solutions (clusters) in the form of FMM or GMM. In fact, EM algorithm is used as a particular type of Maximum Likelihood method to estimate parameters of multivariate Gaussian distributions for subpopulations (clusters) and provide the best choice among different possible solutions for the clustering problem. Also, it is worth noting that GMM can be considered as a universal function approximator and so, it can be applied without any question about the data distribution (if it is Gaussian distributed or not). It means that regardless of the original distribution of the data set, when allowing a significant number of mixture components, it is expected that the GMM approaches the true distribution. In the current study, the R package ‘mclust’ (5.4.2) was utilized to perform computations related to GMM. There are ten options of EM algorithms for clustering of multivariate data which all of them were examined. For each number of clusters or regions (c), the option that provided best results in terms of the homogeneity of regions was selected for regionalization of watersheds by GMM for RFFA. The objective of the regionalization of watersheds is to identify the regions that are homogeneous and a probability density function can be fitted to the data related to each region appropriately. Since in the model-based clustering, the clusters are identified based on their probability density functions, it is expected that the use of FMM results in forming more homogeneous regions. It is reasonable to use the features that represent or describe the probability density function of data of watersheds in order to form the regionalization feature vectors. In the present study, two types of feature vectors were formed to address this issue. The first type included some geographical, physiographical, and meteorological attributes of the watersheds. The first-type feature vectors are called watershed feature vectors and are denoted by vw hereafter in this article. The watershed features include the features longitude, latitude, elevation from sea level, drainage area, main stream length, main stream slope, and Gravelius coefficient. The descriptive statistics of the watershed features are presented in Table 2. The logarithmic transform was applied to the watershed features data to modify their asymmetry and skewness. In addition, to neutralize the effects of dimensional and variance differences, all the watershed features were standardized by Eq. (2). In Eq. (2), yij and x ij represent the value of the jth feature in the i-th watershed and the standardized value of the j-th feature in the i-th watershed. Also, y¯j and Sj denote the mean value and

Feature Longitude (dd) Latitude (dd) Elevation (m a.s.l) Drainage area (km2 ) Main stream length (km ) Main stream slope (%) Gravelius coefficient

Range

Mean

Standard deviation

48.25–51.73 30.82–33.92 20–2430 35.00–25033.18

49.69 32.47 1332.16 3261.95

1.03 0.87 799.49 6158.98

7.80–344.22 0.20–29 1.14–1.86

73.05 2.50 1.46

78.02 4.77 0.22

standard deviation of the j-th feature over all the watersheds.

yij − y¯j

x ij =

Sj

(2)

The values of first, second and third quartiles of the flood data in each watershed are used to form the second type of the regionalization feature vectors which are named flood-related feature vectors hereafter in this article. The second-type feature vectors are represented by vf . It is worth noting that the assumption that is referred to as ’the basic assumption in model-based clustering by finite mixture models’ in this study is related to clustering these two types of feature vectors, not flood data records and their frequency distributions. Both GMM and WAKM were applied to both types of feature vectors for regionalization of the watersheds to evaluate the performances of the algorithms and the effects of the use of different feature vectors. 2.4. Discordancy assessment, homogeneity evaluation, and flood quantile estimation To identify the sites or watersheds which are inappropriate for RFFA, Hosking and Wallis (1993) developed the discordancy measure D based on the L-moment ratios. For each site or watershed in a group of N watersheds, the discordancy measure D is defined as Eq. (3):

D=

1 N (ui − u¯ )TA−1 (ui − u¯ ) 3

where

A= N

N ∑i = 1

(ui − u¯ )(ui − u¯ )T , ui = [t (i)

(3)

t3(i)

t4(i)],

and

u¯ = N−1 ∑i = 1 ui (Hosking and Wallis, 1997). For the i-th watershed, t (i), t3(i), t4(i) denote the sample L-moment ratios L-CV, L-skewness, and Lkurtosis, respectively which are defined as Eqs. (4)–(6):

t=

l2 l1

(4)

t=

l3 l2

(5)

t=

l4 l2

(6)

where l1, l2, l3, and l4 represent the first, second, third, and fourth sample L-moments. To find more details about calculation of L-moments, see Hosking and Wallis (1997). The flood data collected from 42 gauging stations were evaluated by the discordancy measure D. According to Hosking and Wallis (1997), for the number of watersheds greater than 14, if D > 3 for a watershed, the watershed is considered as discordant. Since the value of D was not greater than 3 for any watershed, no watershed was considered as discordant. In RFFA, homogeneity measures or tests are used to evaluate the homogeneity of regions. Hosking and Wallis (1993) proposed the three heterogeneity measures H1, H2 , and H3 for the homogeneity assessment of regions. The heterogeneity measures are calculated by fitting a threeparameter Kappa distribution on the regional average L-moment ratios of the regions and performing Monte-carlo simulation experiments in a 4

Journal of Hydrology 583 (2020) 124620

A. Ahani, et al. n

large number Nsim (Nsim = 1000 for the current study). The heterogeneity measures H have been used in several studies (e.g., Hosking and Wallis, 1997; Ramachandra Rao and Srinivas, 2006; Ramachandra Rao and Srinivas, 2006; Viglione et al., 2007; Srinivas et al., 2008; Sadri and Burn, 2011; Asong et al., 2015; Ahani and Mousavi Nadoushani, 2016; Abdi et al., 2017; Ahani et al., 2018) and more details about their calculation can be found in Hosking and Wallis (1997). A region is regarded as ‘acceptably homogeneous’ if H < 1; it is considered as ‘possibly heterogeneous’ if 1 ⩽ H < 2; and it is identified as ‘definitely heterogeneous’ if H ≥ 2 (Hosking and Wallis, 1997). In the current study, a region is considered as homogeneous if H1 < 1 and H2 < 1 and H3 < 1 simultaneously. Although the homogeneity of regions can be evaluated by the measures H, it cannot be much useful when the objective is to compare some different regionalizations in which the number of regions, number of homogeneous regions, sizes of homogeneous and heterogeneous regions, and assignment of the watersheds to the regions are different from each other. So, a criterion is required that is able to compare the homogeneity provided by different regionalizations. Therefore, in this study, a criterion named homogeneity ratio (HR) is defined as Eq. (7):

HR =

nhom n

∑ MRE =

Qi(AS ) − Qi(R) Qi(AS )

i=1

∑ MARE =

(8)

n n

|Qi(AS ) − Qi(R) | Qi(AS )

i=1

(9)

n n

∑ RMSRE =

i=1

(AS )

⎛ Qi ⎝ ⎜

− Qi(R)

Qi(AS )

2

⎞ ⎠



n

(10)

denotes at-site flood quantile estimate for the i-th warepresents the flood quantile estimate provided by RFFA for the i-th watershed, and n is the total number of watersheds.

where Qi(AS ) tershed, Qi(R)

3. Results and discussion After the discordancy assessment, the regionalization of watersheds was implemented by WAKM and GMM. Considering the number of watersheds and the lengths of flood data records, the regionalization was implemented for the number of regions c = 2, 3, …, 8. Based on the 5T rule, a rule of thumb can be considered to determine the maximum number of regions. Assuming that the flood data record length in all the watersheds is the same, there must be at least 5 watersheds with the flood data record length equal to T in each region to make it possible to provide reliable flood estimates corresponding to the return period T. Since there were 42 watersheds in the selected case study, the maximum number of regions was determined as C = 8 ([42/5] = 8). The regionalization was implemented by both WAKM and GMM based on both the feature vectors vw and vf . The value of homogeneity ration HR was calculated for each regionalization. The HR values can be observed in Fig. 2 in which WAKMw and WAKMf denote the regionalizations by WAKM based on the feature vectors vw and vf , respectively and GMMw and GMMf denote the regionalizations by GMM based on the feature vectors vw and vf , respectively. As seen in Fig. 2, among the four regionalization options, the regionalization by WAKM based on the feature vectors vw provided the lowest homogeneity in terms of HR in most cases. The lowest value of HR provided by WAKMw is equal to 0 for c = 2 and the greatest value is equal to 0.64 for c = 8. If the regionalization based on vw is implemented by the best GMM option, the results of regionalization are improved considerably. For all the number of regions c, the HR values provided by GMMw are greater than or equal to the HR values provided by WAKMw . For regionalizations implemented by GMMw , HR varies between 0.26 and 0.86 and its maximum value is related to c = 6. When the regions were identified based on the feature vectors vf , the homogeneity was improved for most values of c in comparison with the regionalizations based on vw . According to the results, the regionalization by WAKM based on vf provides HR values that higher than those provided by WAKMw and GMMw . The important point is that WAKMf reaches HR = 1 for c = 7 that means providing the complete homogeneity or assigning all the watersheds to the homogeneous regions. This is the desirable regionalization condition for RFFA. In the next step, the feature vectors vf were applied for regionalization of watersheds by GMM. As evidenced by Fig. 2, for all the values of c, the HR values provided by GMMf are superior to those provided by the three other regionalization options. Only for c = 7 the results provided by WAKMf and GMMf are the same in terms of HR . Satisfying the complete homogeneity (HR = 1) for c ≥ 4 proves the excellent performance of GMMf compared to the three other options. In these regionalization states, all the watersheds were assigned to the homogeneous regions, which is an appropriate condition for RFFA. In general, according to the results presented in Fig. 2, when vf is for regionalization of watersheds, the performances of both method WAKM and GMM in assigning the watersheds to the homogeneous regions are

(7)

where nhom represents the number of watersheds assigned to the homogeneous regions and n denotes the total number of the regionalized watersheds. It is clear that HR varies in the range [0, 1] and when nhom increases and HR tends to1, the homogeneity provided by the regionalization is more desirable. If HR = 1, all the watersheds belong to the homogeneous regions or in other words, all the regions are homogeneous. This state is called ‘complete homogeneity’ hereafter in this article. After regionalization, a regional frequency distribution must be fitted to each region. In this study, the goodness-of-fit measure Z suggested by Hosking and Wallis (1997) was used to determine the regional frequency distributions. For each region, five three-parameter distributions namely generalized normal (GNO), generalized logistic (GLO), generalized pareto (GPA), generalized extreme value (GEV), and Pearson type III (PE3) were examined to find the appropriate regional frequency distribution. The parameter estimation for the mentioned distributions was performed by L-moments method (Hosking and Wallis, 1997). After that, the mean value of flood data in each watershed was calculated as index-flood to change the regional distribution into the specific frequency distribution of each watershed. The flood quantiles were estimated for to the return periods T = 5, 10, 25, 50, 100, 250, 500 years which are the typical return periods to design the open channels, flood control facilities, transportation infrastructures, and hydraulic structures such as urban drainage networks, levees, highway culverts, and bridges. However, it should be noted that according to the 5T rule (Reed et al., 1999), if the total number of flood data in a region is equal to 5T , for each watershed of the region, the greatest return period for which the reliable flood quantile can be estimated is T. So, even if all the watersheds in the selected case study belong to one region, the maximum reliable return period is equal to T = 227 years. In fact, the return period T = 500 years is considered to ensure that there is no considerable change in the performances of different methods in flood estimation for longer return periods in comparison with the shorter ones. To evaluate the effects of the different regionalizations on the flood estimation, the flood quantile estimates provided by RFFA methods were compared with the at-site flood quantile estimates. For each regionalization, three relative error measures including mean relative error (MRE), mean absolute relative error (MARE), and root mean square relative error (RMSRE) were calculated for each return period as Eq.5, 6, and 7:

5

Journal of Hydrology 583 (2020) 124620

A. Ahani, et al.

Fig. 2. The HR value for the regionalizations implemented by WAKM and GMM based on the feature vectors vw and vf for c = 2, 3, …, 8.

improved noticeably. In fact, the HR values related to WAKMf are greater than the values related to WAKMw , and the HR values related to GMMf are greater than the values related to GMMw . In addition, in regionalization based on both types of feature vectors vw and vf , GMM provides better results in terms of the number of the watersheds assigned to the homogeneous regions. In other words, the HR values calculated for GMMw are greater than the values calculated for WAKMw , and the HR values calculated for GMMf are greater than the values calculated for WAKMf . The sizes of the regions identified by WAKMf and GMMf (in terms of station-years) are shown in Figs. 3 and 4 for the optimal number of regions for each of them. The optimal number of regions for each regionalization option is the lowest number of regions for which the complete homogeneity (HR = 1) can be reached. So, the optimal number of regions for WAKMf and GMMf is equal to 7 and 4, respectively. Reaching the complete homogeneity by GMMf for c = 4 can be considered as an advantage in comparison with WAKMf that satisfies

Fig. 4. Sizes of the regions identified by GMMf (station-years) for c = 4 .

the complete homogeneity for c = 7 , because the lower number of regions roughly means formation of larger regions (in terms of stationyears) and when a region size increases, it becomes possible to provide the flood estimates corresponding to the larger return periods. As shown in Figs. 3 and 4, the average size of the regions identified by WAKMf is equal to 163 station-years, whereas the average size of the regions identified by GMMf is equal to 285 station-years. According to the 5T rule (Reed et al., 1999), to provide reliable flood quantiles related to the return period T for the watersheds belonging to a region, the total number of flood data recorded in all the watersheds of the region at least must be equal to 5T . Therefore, regarding the average sizes of the regions identified by WAKMf and GMMf for their optimal number of regions, on average the greatest return period for reliable flood estimation based on the regions provided by WAKMf is T = 33 years and for reliable flood estimation based on the regions identified by GMMf is T = 57 years . However, the variability of the region size for each regionalization option is considerable and so the 5T

Fig. 3. Sizes of the regions identified by WAKMf (station-years) for c = 7 . 6

Journal of Hydrology 583 (2020) 124620

A. Ahani, et al.

Fig. 5. Value of the heterogeneity measures H for four regionalization options for c = 4 .

According to the results presented in Fig. 5, GMMf is the only regionalization option that provides complete homogeneity (HR = 1) for c = 4 . Also, the results provided by WAKMf can be acceptable in terms of the homogeneity of regions. Therefore the regions identified by GMMf and WAKMf were used in the next step of RFFA that is the flood quantile estimation. The results of calculation of the error measures for the flood estimates provided by WAKMf and GMMf are presented in Fig. 6. For T = 25, 50 years , the values of MRE for the regionalization by WAKMf are lower than the values related to the regionalization by GMMf , but for T = 5, 10, 100 years , the MRE values calculated based on the regions provided by GMMf are closer to zero than those calculated based on the regions provided by WAKMf (Fig. 6(a)). According to the values of MARE , for T = 5, 10, 25, 50, 100 years , the deviation of the flood quantiles estimated based on the regionalization implemented by GMMf from the at-site estimates are lower than the estimates related to the regions provided by WAKMf (Fig. 6(b)). In addition, the results related to the calculation of RMSRE are in complete consistency with the results related to MARE and show the better performance of RFFA based on the regionalization implemented by GMMf . Therefore, as evidenced by the results shown in Fig. 6, applying GMMf for regionalization of watersheds results in the flood estimates closer to the at-site estimates in comparison with the use of WAKMf , though the differences are not much considerable. It should be mentioned that the increasing trend in the difference between the error measures of the estimates provided based on the use of WAKMf and GMMf continues when the return period increases up to T = 250, 500 years . However, the results related to T = 250, 500 years are not presented in Fig. 6 because according to the 5T rule and the region sizes, it is not possible to provide reliable flood quantile estimates for these return periods.

rule must be applied to each region separately. Considering the regions identified by WAKMf and GMMf , the largest region identified by WAKMf contains 392 station-years of flood data, while the size of the largest region identified by GMMf is equal to 649 station-years. So based on the 5T rule, the greatest return period for reliable flood estimation for 15 watersheds belonging to the largest region identified by WAKMf is about 78 years. On the other hand, the greatest return period for reliable flood estimation for 24 watersheds belonging to the largest region identified by GMMf is about 130 years. Since c = 4 was the lowest number of regions for which GMMf provided the complete homogeneity (HR = 1), RFFA was implemented for c = 4 in the study area. In Fig. 5, the values of the heterogeneity measures H are seen for the four regions identified by each of the four regionalization options. Among the four regions identified by WAKMw , the regions 2 and 4 are heterogeneous clearly and also the region 1 is identified as heterogeneous according to the measure H3 , though its heterogeneity is not noticeable. Only the region 3 satisfies homogeneity conditions completely (Fig. 5(a)). The three heterogeneous regions include 31 watersheds and 11 watersheds belong to the homogeneous region. For c = 4 , regionalization by WAKMw results in HR = 0.26 (Fig. 2). WAKMf provides three homogeneous regions and one heterogeneous region. A small heterogeneity can be identified only based on the measures H2 and H3 for the region 4 (Fig. 5(b)). Only 5 watersheds belong to the heterogeneous region and 37 watersheds belong to the three homogeneous regions. For c = 4, WAKMf provides HR = 0.88 (Fig. 2). Applying GMMw results in identifying two homogenous and two heterogeneous regions (Fig. 5(c)). The regions 1 and 2, which are homogeneous, contain 18 watersheds, while the regions 3 and 4, which are heterogeneous, include 24 watersheds. For c = 4, HR = 0.42 for the regionalization implemented by GMMw (Fig. 2). Use of GMMf to identify four regions results in forming four homogeneous regions (Fig. 5(d)). All the regions are homogeneous based on the three measures H 1, H2 , and H3 . So, all the watersheds (42 watersheds) belong to the homogeneous regions and HR = 1.

4. Conclusions In the current study, the finite mixture models (FMM) were used to regionalize the watersheds for RFFA. The main assumption in using FMM is that a population consists of some subpopulation that a specific 7

Journal of Hydrology 583 (2020) 124620

A. Ahani, et al.

Fig. 6. Values of the error measures for the flood quantile estimation by WAKMf and GMMf for c = 4 .

probability density function can be fitted to the data of each of them. FMM is a mixture of some probability density functions that can be applied to model the population. FMM can be used as a statistical method for clustering of data. In this study, the Gaussian mixture models (GMM), which are the most widely-used family of FMM were utilized for regionalization of watersheds of Karun-e-bozorg basin in the southwest of Iran. In

addition to GMM, the regionalization was implemented by the hybrid clustering algorithm WAKM (Ramachandra Rao and Srinivas, 2006). The regionalization was implemented by both GMM and WAKM based on two types of feature vectors vw and vf which the first one consists of the watershed features and the second one consists of the flood statistics. The evaluation of results based on the heterogeneity measures H 8

Journal of Hydrology 583 (2020) 124620

A. Ahani, et al.

and the homogeneity ration HR shows that the regionalization option GMMf , which denotes regionalization by GMM based on the feature vectors vf , provides the greatest values of HR for all the number of regions c = 2, 3, 4, 5, 6, 7, 8, which means assigning the largest number of the watersheds to the homogeneous regions. In addition, only GMMf provides the complete homogeneity (HR = 1) for 4 ⩽ c ≤ 8. Among the other regionalization option, only WAKMf , which represents regionalization by WAKM based on the feature vectors vf , provides complete homogeneity for c = 7 . In general, the results indicate that both WAKM and GMM provide better results in the regionalization based on vf in comparison with the regionalization based on vw . It means that the use of the flood statistics may improve the results of the regionalization in terms of the homogeneity ratio. However, there is a limitation for using the feature vectors vf for regionalization of ungauged watersheds. Also, according to the results, in regionalization based on both types of the feature vectors vw and vf , GMM provides more appropriate results in terms of HR in comparison with WAKM. Furthermore, the comparison of the flood estimates provided based on the regions identified by WAKMf and GMMf shows that for c = 4 , the flood estimates related to the use of GMMf are closer to the at-site estimates in comparison with the flood estimates related to the use of WAKMf . This means that GMM can be considered as a useful method for regionalization of watersheds for RFFA. The use of non-Gaussian mixture models such as Dirichlet mixture models, which do not have some limitations of GMM (e.g., assumption of symmetry in all the subpopulations or clusters), can be followed as a complement of this study. Also, the simultaneous use of FMM in regionalization of watersheds and fitting the regional frequency distribution and the flood estimation can be studied in future researches.

flood regionalization. J. Water Resour. Plann. Manage. 115 (6), 793–808. Bishop, C.M., 2006. Pattern Recognition and Machine Learning. Information science and statistics. Springer, New York. Burn, D., Goel, N.K., 2000. The formation of groups for regional flood frequency analysis. Hydrol. Sci. J. 45 (1), 97–112. Burn, D.H., 1989. Cluster analysis as applied to regional flood frequency. J. Water Resour. Planning Manage. 115 (5), 567–582. Burn, D.H., 1990. Evaluation of regional flood frequency analysis with a region of influence approach. Water Resour. Res. 26 (10), 2257–2265. Cavadias, G.S., 1990. The canonical correlation approach to regional flood estimation. In: Regionalization in Hydrology (Proceedings of the Ljubljana Symposium). IAHS, pp. 171–178. Cavadias, G.S., Ouarda, T.B.M.J., BobE, B., Girard, C., 2001. A canonical correlation approach to the determination of homogeneous regions for regional flood estimation of ungauged basins. Hydrol. Sci. J. 46 (4), 499–512. Dalrymple, T., 1960. Flood frequency analyses. Report. Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood from incomplete data via the em algorithm. J. R. Stat. Soc. Series B (Methodological) 39 (1), 1–38. Di Prinzio, M., Castellarin, A., Toth, E., 2011. Data-driven catchment classification: application to the pub problem. Hydrol. Earth Syst. Sci. 15 (6), 1921–1935. Everitt, B.S., Landau, S., Leese, M., Stahl, D., 2011. Cluster Analysis. Wiley Series in Probabilityand Statistics. John Wiley & Sons Ltd. Farsadnia, F., Rostami Kamrood, M., Moghaddam Nia, A., Modarres, R., Bray, M.T., Han, D., Sadatinejad, J., 2014. Identification of homogeneous regions for regionalization of watersheds by two-level self-organizing feature maps. J. Hydrol. 509, 387–397. Hall, M.J., Minns, A.W., 1999. The classification of hydrologically homogeneous regions. Hydrol. Sci. J. 44 (5), 693–704. Hall, M.J., Minns, A.W., Ashrafuzzaman, A.K.M., 2002. The application of data mining techniques for the regionalisation of hydrological variables. Hydrol. Earth Syst. Sci. 6 (4), 685–694 HESS. Hartigan, J.A., Wong, M.A., 1979. Algorithm as 136: A k-means clustering algorithm. J. Roy. Stat. Soc. 28 (1), 100–108. Hosking, J.R.M., Wallis, J.R., 1993. Some statistics useful in regional frequency analysis. Water Resour. Res. 29 (2), 271–281. Hosking, J.R.M., Wallis, J.R., 1997. Regional Frequency Analysis – An Approach Based on LMoments. Cambridge University Press, New York. Ilorme, F., Griffis, V.W., 2013. A novel procedure for delineation of hydrologically homogeneous regions and the classification of ungauged sites for design flood estimation. J. Hydrol. 492, 151–162. Jingyi, Z., Hall, M.J., 2004. Regional flood frequency analysis for the gan-ming river basin in china. J. Hydrol. 296 (1–4), 98–117. Kar, A.K., Goel, N.K., Lohani, A.K., Roy, G.P., 2012. Application of clustering techniques using prioritized variables in regional flood frequency analysis-case study of mahanadi basin. J. Hydrol. Eng. 17 (1), 213–223. Kingston, D.G., Hannah, D.M., Lawler, D.M., McGregor, G.R., 2011. Regional classification, variability, and trends of northern north atlantic river flow. Hydrol. Process. 25 (7), 1021–1033. Kohonen, T., 1982. Self-organized formation of topologically correct feature maps. Biol. Cybern. 43 (1), 59–69. Lin, G.-F., Chen, L.-H., 2006. Identification of homogeneous regions for regional frequency analysis using the self-organizing map. J. Hydrol. 324 (1–4), 1–9. Mosley, M.P., 1981. Delimitation of new zealand hydrologic regions. J. Hydrol. 49, 173–192. Nathan, R.J., McMahon, T.A., 1990. Identification of homogeneous regions for the purposes of regionalisation. J. Hydrol. 121, 217–238. Ouarda, T.B.M.J., Girard, C., Cavadias, G.S., Bobée, B., 2001. Regional flood frequency estimation with canonical correlation analysis. J. Hydrol. 254 (1–4), 157–173. Raftery, A.E., Dean, N., 2006. Variable selection for model-based clustering. J. Am. Stat. Assoc. 101 (473), 168–178. Ramachandra Rao, A., Srinivas, V.V., 2006. Regionalization of watersheds by fuzzy cluster analysis. J. Hydrol. 318 (1–4), 57–79. Ramachandra Rao, A., Srinivas, V.V., 2006. Regionalization of watersheds by hybrid-cluster analysis. J. Hydrol. 318 (1–4), 37–56. Ramachandra Rao, A., Srinivas, V.V., 2008. Regionalization of Watersheds – An Approach Based on Cluster Analysis. In: Water Science and Technology Library, vol. 58 Springer. Reed, D.W., D. Jakob, A.J. Robson, D.S. Faulkner, Stewart, E.J., 1999. Regional frequency analysis: a ne w vocabulary. In Hydroiogical Extremes: Understanding. Predicting, Mitigating (Proceedings of IUGG 99 Symposium Birmingham, July 19), IAHS, pp. 237–243. Sadri, S., Burn, D.H., 2011. A fuzzy c-means approach for regionalization using a bivariate homogeneity and discordancy approach. J. Hydrol. 401 (3–4), 231–239. Satyanarayana, P., Srinivas, V.V., 2011. Regionalization of precipitation in data sparse areas using large scale atmospheric variables – a fuzzy clustering approach. J. Hydrol. 405 (3–4), 462–473. Smith, A., Sampson, C., Bates, P., 2015. Regional flood frequency analysis at the global scale. Water Resour. Res. 51 (1), 539–553. Srinivas, V.V., Tripathi, S., Ramachandra Rao, A., Govindaraju, R.S., 2008. Regional flood frequency analysis by combining self-organizing feature map and fuzzy clustering. J. Hydrol. 348 (1–2), 148–166. Ssegane, H., Tollner, E.W., Mohamoud, Y.M., Rasmussen, T.C., Dowd, J.F., 2012. Advances in variable selection methods ii: Effect of variable selection method on classification of hydrologically similar watersheds in three mid-atlantic ecoregions. J. Hydrol. 438–439, 26–38. Toth, E., 2013. Catchment classification based on characterisation of streamflow and precipitation time series. Hydrol. Earth Syst. Sci. 17 (3), 1149–1159. Viglione, A., Laio, F., Claps, P., 2007. A comparison of homogeneity tests for regional frequency analysis. Water Resour. Res. 43 (3) n/a–n/a. Ward Jr., J.H., 1963. Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58, 236–244. Wiltshire, S.E., 1986. Regional flood frequency analysis ii: Multivariate classification of drainage basins in britain. Hydrol. Sci. J. 31 (3), 335–346.

CRediT authorship contribution statement Ali Ahani: Conceptualization, Methodology, Software, Validation, Formal analysis, Resources, Data curation, Writing - original draft, Writing - review & editing, Visualization. S. Saeid Mousavi Nadoushani: Conceptualization, Methodology, Data curation, Writing review & editing, Supervision, Project administration. Ali Moridi: Conceptualization, Methodology, Writing - review & editing, Supervision. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References Abdi, A., Hassanzadeh, Y., Talatahari, S., Fakheri-Fard, A., Mirabbasi, R., 2017. Regional drought frequency analysis using l-moments and adjusted charged system search. J. Hydroinf. 19 (3), 426–442. Acreman, M.C., Sinclair, C.D., 1986. Classification of drainage basins according to their physical characteristics; an application for flood frequency analysis in scotland. J. Hydrol. 84 (3–4), 365–380. Ahani, A., Mousavi Nadoushani, S.S., 2016. Assessment of some combinations of hard and fuzzy clustering techniques for regionalisation of catchments in sefidroud basin. Journal of Hydroinformatics 18 (6), 1033–1054. Ahani, A., Mousavi Nadoushani, S.S., Moridi, A., 2018. A feature weighting and selection method for improving the homogeneity of regions in regionalization of watersheds. Hydrol. Process. 32 (13), 2048–2095. Asong, Z.E., Khaliq, M.N., Wheater, H.S., 2015. Regionalization of precipitation characteristics in the canadian prairie provinces using large-scale atmospheric covariates and geophysical attributes. Stoch. Env. Res. Risk Assess. 29 (3), 875–892. Basu, B., Srinivas, V.V., 2014. Regional flood frequency analysis using kernel-based fuzzy clustering approach. Water Resour. Res. 50 (4), 3295–3316. Basu, B., Srinivas, V.V., 2015. Analytical approach to quantile estimation in regional frequency analysis based on fuzzy framework. J. Hydrol. 524, 30–43. Bezdek, J.C., 1981. Pattern Recognition with Fuzzy Objective Function Algorithms. Advanced Applications in Pattern Recognition. Plenum Pres, New York. Bhaskar, N.R., O’Connor, C.A., 1989. Comparison of method of residuals and cluster analysis for

9