Journal of Hydrology 545 (2017) 251–262
Contents lists available at ScienceDirect
Journal of Hydrology journal homepage: www.elsevier.com/locate/jhydrol
Research papers
Flood frequency analysis using multi-objective optimization based interval estimation approach K.S. Kasiviswanathan, Jianxun He ⇑, Joo-Hwa Tay Department of Civil Engineering, Schulich School of Engineering, University of Calgary, 2500 University Drive NW, Calgary, AB T2N 1N4, Canada
a r t i c l e
i n f o
Article history: Received 23 December 2015 Received in revised form 29 September 2016 Accepted 12 December 2016 Available online 23 December 2016 This manuscript was handled by A. Bardossy, Editor-in-Chief, with the assistance of Felix Frances, Associate Editor Keywords: Flood frequency analysis Ensemble simulation Multi-objective optimization Prediction interval Uncertainty
a b s t r a c t Flood frequency analysis (FFA) is a necessary tool for water resources management and water infrastructure design. Owing to the existence of variability in sample representation, distribution selection, and distribution parameter estimation, flood quantile estimation is subjected to various levels of uncertainty, which is not negligible and avoidable. Hence, alternative methods to the conventional approach of FFA are desired for quantifying the uncertainty such as in the form of prediction interval. The primary focus of the paper was to develop a novel approach to quantify and optimize the prediction interval resulted from the non-stationarity of data set, which is reflected in the distribution parameters estimated, in FFA. This paper proposed the combination of the multi-objective optimization approach and the ensemble simulation technique to determine the optimal perturbations of distribution parameters for constructing the prediction interval of flood quantiles in FFA. To demonstrate the proposed approach, annual maximum daily flow data collected from two gauge stations on the Bow River, Alberta, Canada, were used. The results suggest that the proposed method can successfully capture the uncertainty in quantile estimates qualitatively using the prediction interval, as the number of observations falling within the constructed prediction interval is approximately maximized while the prediction interval is minimized. Ó 2016 Elsevier B.V. All rights reserved.
1. Introduction Flooding has always been one of the leading causes of damage among all natural disasters. During past decades, annual losses from floods have reached tens of billions of US dollars and thousands of people were killed each year (Hirabayashi et al., 2013). The major factors causing floods include nature (e.g., extreme rainfall) as well as anthropogenic activities such as deforestation and rapid urbanisation, both of which would increase flood frequency and severity. While the occurrence of the hydrologic extreme events cannot be avoided, proper structural and non-structural measures can effectively reduce the risk of economic losses, environmental damages and social vulnerability. Hydraulic structures, which are commonly designed using a risk-based approach, might fail as the risk level is not always constant over the life span of the structures (Prakash et al., 2014). However, the risk level and its corresponding flood quantile have often been treated as constants in engineering design. Therefore, the hydraulic structures designed based on the stationary concept would need to be progressively improved and/or replaced to maintain a required risk level; ⇑ Corresponding author. E-mail address:
[email protected] (J. He). http://dx.doi.org/10.1016/j.jhydrol.2016.12.025 0022-1694/Ó 2016 Elsevier B.V. All rights reserved.
whereas replacement and improvement of hydraulic structures can be cost-prohibitive in many occasions. On the other hand, more reliable guidelines for design, operation, and management of the structures can be formulated through updating and enhancing non-structural measures (e.g., flood analysis and modeling) as the progress in understanding the physical phenomenon and computational techniques. Flood frequency analysis (FFA) is one of the non-structural measures and a standard practice in designing water and flood control infrastructure. FFA provides flood quantile estimates, which are used as the basis to design water infrastructure. In practice, flow data (often annual maximum daily flow) are fitted using a theoretical probabilistic distribution function, which is usually selected from a set of several candidate distributions. The commonly used probability distributions in FFA include lognormal, Gamma, Pearson III, generalized extreme value (GEV) and generalized Pareto (GP) distributions. The selection of a particular distribution is largely dependent on the data set(s) used and thus is site- or region-specific. To estimate the distribution parameters, different methods such as method of moment (MOM), maximum likelihood estimator (MLE), probability weighted moment (PWM) and linear moment (L-moment) method have been applied. The accuracy of these methods is assessed based on how closely the theoretical
252
K.S. Kasiviswanathan et al. / Journal of Hydrology 545 (2017) 251–262
distribution matches with the observations (Strupczewski et al., 2002). Among these methods, L-moment method appears to be more reliable as it is less sensitive to sampling variability and measurement error compared to other methods (Hosking, 1990). Please note that each method, however, has its own merits and demerits in terms of the performance and the procedure to estimate the distribution parameters. In addition to the conventional estimators, a variety of other approaches have been applied in FFA and it is worth mentioning a few here such as neuro fuzzy inference system (Shu and Ouarda, 2008; Basu and Srinivas, 2015), artificial neural network (Shu and Burn, 2004; Aziz et al., 2014), and stochastic simulation (Arnaud et al., 2014). Similar to many hydrologic models and analyses, FFA has often been performed without considering the inherent uncertainty in the analysis. A few researchers, however, have attempted to quantify uncertainty in FFA. For instance, Wood and Rodríguez-Iturbe (1975) applied Bayesian approach, in which normal and lognormal distributions were considered, to assess uncertainty. However, the assumption of a normal/or lognormal distribution is often not valid for FFA. Tang (1980) developed a procedure to account model structure uncertainty through coupling different probability distributions into FFA. Reis and Stedinger (2005) employed Bayes theorem to enhance quantile estimation and quantify distribution parameter uncertainty. It has been reported that Monte Carlo based simulation approaches produce unique distribution which is not generally biased to other types of distribution through large number of simulations (Reis and Stedinger, 2005; Lee and Kim, 2008; Halbert et al., 2016). Most recently, a few studies showed that the bootstrap based sampling method provides reliable estimate of prediction interval in hydrologic frequency analysis (Tung et al., 2014; Hu et al., 2015). To summarize, most of the methods mentioned above for quantifying uncertainty in FFA are computationally challenging due to the fact that they often require probabilistic and statistical information on the parameters of the selected probability distribution; however such an assumption is hard, if not impossible, to be justified. In the literature, the approach of the ensemble of quantile estimates has often been applied to quantify the uncertainty. Based upon the uncertainty source(s) of interest, the quantification of uncertainty can be carried out using the ensemble simulation approach through varying the data window (e.g., random sampling from the data set) (Chowdhury and Stedinger, 1991; Salas et al., 2013; Obeysekera and Salas, 2014), the distribution parameters (Reis and Stedinger, 2005), the selected distribution type (Tang, 1980) as well as the method for estimating distribution parameters. There is no doubt that both natural and anthropogenic factors, such as changes/variability in land use, water consumption and global climate, alter hydrological processes and in turn hydrological variables (e.g., flow). To enhance the reliability and resilience of water resources systems, water management agencies need to work towards improved methods for better incorporating the uncertainty introduced by the changes. On the other hand, wide prediction interval, which reflects high level of uncertainty, produced in the modeling/analysis approaches would, however, jeopardize their applications in real world. Therefore approaches, which can provide optimal/tight prediction interval, are always desirable. For FFA, in contrast to the existing methods for quantifying uncertainty, this paper proposed a novel approach to quantify the optimal prediction interval for quantile estimates. This was carried out using the combination of an ensemble simulation (Latin hypercube sampling (LHS) used for sampling) and a multiobjective optimization solved by a search algorithm, called Genetic Algorithm (GA). The advantage of using the proposed method for FFA lies in that it can define the optimal variability of distribution parameters based on the data used in the analysis, so as to ensure the reliability/precision of the proposed method in estimating
flood quantiles and quantify their corresponding prediction intervals. The paper is organized as follows. Followed by the introduction, the description of the study area and the data sets used is given. In the following section the challenges for using the conventional FFA are illustrated, along with detailed introduction of proposed methodology. Subsequently, the analysis results using the proposed methodology are presented and discussed. The conclusions drawn from the paper are then presented at the end.
2. Study area and data description The Bow River originates from the Canadian Rockies flowing from west towards east through three geographic regions including the mountains, the foothills, and the prairies. The climate of the Bow River Basin is typical of southern Alberta and characterized by long, cold winters and short summers. The temperature in general varies between 12 °C and 18 °C from winter to summer seasons. The annual precipitation in the upper Bow River ranges from 500 to 700 mm in which half of the amount is in the form of snow. In the Calgary region, which is situated in the lower part of the Bow River Basin, there is slightly less annual precipitation (i.e. 412 mm). Approximately 78% of precipitation is rainfall and the remaining is in the form of snow in Calgary. The Bow River is fed by various water sources including surface runoff from late spring to early summer, snowmelt from early spring to early summer, and groundwater recharge throughout a year, which makes the Bow River a complex system. All these challenge the conventional FFA as high flows can be resulted from different mechanisms (snowmelt, rainfall, and their combination) in addition to the potential impacts of climate variability and/or climate change on flows. There are several flow gauge stations, which are operated by the Water Survey of Canada of Environment Canada, on the Bow River. The City of Calgary is situated approximate in the middle of the river and is the most populated community within the river basin. Most recently in June of 2013, southern Alberta experienced a record-breaking flood in Alberta’s history, which caused billions of dollars in losses and demonstrated the detrimental effects of floods. In the City of Calgary, the flood caused an infrastructure loss of $445 million estimated by the flood risk recovery task force. Flow in the Bow River is consistently low from winter until the subsequent early spring. Flow normally starts to increase in May when the melt of mountain snowpack starts feeding the river. The annual maximum flow often occurs around the late-June due to water contribution from both rainfall and mountain snowmelt and it can also be observed in other months including May and from July to September. In the 2013 flood, the rainfall-runoff was considered to be the primary mechanism causing the extremely high peak flow as over 200 mm heavy rain fell in less than two days in many regions in the river basin, particularly west and southwest of Calgary. This paper used the data collected from two flow gauge stations on the Bow River: the Bow River at Banff (named as Banff station) (05BB001) and the Bow River at Calgary (named as Calgary station) (05BH004), which are located in the upper and middle reaches of the river, respectively (Fig. 1). From the river origin to Banff, there are no hydraulic structures regulating flows; whereas several hydraulic dams located between Banff and Calgary and in the tributaries flowing into the Bow River. However, the dams at the upstream Calgary do not have significant impact on the peak flows considering the sizes of the reservoirs. The rationale behind the selection of these two gauge stations lies in that the both gauge stations have long historical data (from 1909 and 2013 at Banff station and from 1912 to 2013 at Calgary station) and are representative to the upper and middle reaches of the river, respectively.
K.S. Kasiviswanathan et al. / Journal of Hydrology 545 (2017) 251–262
253
Fig. 1. Study area (the extent of the figure is from the origin of the Bow River to just downstream of the Calgary city limit).
Most importantly, the application of the proposed methodology can be well presented using the data collected from these two gauge stations as illustrated in the following sections. 3. Methodology 3.1. Challenges in the conventional FFA The conventional FFA is carried out under the assumptions that: (i) the observations are independent and identically distributed; (ii) the data set is stationary, which is however often violated because of changes or variations in many influencing factors such as climate and land cover (Khaliq et al., 2006; Gilroy and McCuen, 2012). The violation of the stationarity assumption can often be seen, especially in long-term data sets, for example at the two selected gauge stations in the paper. The time series of the annual maximum daily flow along with trend lines are displayed in Fig. 2 for Banff and Calgary stations, respectively. Fig. 2 illustrates downward trends in the two time series. Significant downward trends were also detected at both stations using Mann-Kendall test at the significant level of 5%. Most recently, Gado and Nguyen (2016) proposed an approach, L-moments for non-stationarity (LM-NS), to conduct FFA at sites where the stationarity assumption is not valid. In the LM-NS approach, a nonstationary data set is decomposed into a trend component, which is linked to time, and the other component without trend. However as illustrated in Fig. 2, when the time window is moved to after 1960, an insignificant downward trend was identified at Banff station; while an insignificant upward trend at Calgary station was found. These two sample data sets demonstrate the dilemma of selecting an appropriate time window for FFA and the challenge
in describing the non-stationarity using time due to the existence of non-monotonic trend in the data sets. Alternatively, the effect of non-stationarity might need to be quantified in a way which can assess the uncertainty in the form of prediction intervals of quantile estimates. The quantile estimates along with their corresponding prediction intervals can then be applied for decision making considering changes/variations in the statistical characteristics of data sets in FFA. 3.2. Proposed methodology As discussed above, the conventional FFA has been challenged due to the fact that the assumption of the stationarity of data sets may not be satisfied in many occasions. In addition, including extremely high value(s) to a data set could lead to large variations in the estimated quantiles, especially the high quantiles. However, a long-term data set is always desirable for FFA since a short-term data set is often not capable of sufficiently representing the statistical characteristics of a population. To overcome these drawbacks discussed above, the paper proposed a new methodology for using the long-term data set in FFA. The flood quantiles with their prediction intervals were produced using the ensemble simulation. The prediction intervals were optimized using GA. The flowchart in Fig. 3 describes the methodology proposed in the paper. As illustrated in the flowchart, the proposed methodology estimates the flood quantiles and their corresponding prediction intervals in two stages: Stage 1 and Stage 2. Stage 1, which is similar to the conventional FFA, selects the distribution type and estimates the distribution parameters using the L-moment method. Please note that this proposed method is not only limited to the Lmoment method and that other methods, such as MOM, MLE,
254
K.S. Kasiviswanathan et al. / Journal of Hydrology 545 (2017) 251–262
Annual maximum daily flow (m3 /s)
2000 Banff Calgary Trend line at Banff Trend line at Calgary
1800 1600 1400 1200 1000 800 600 400 200 0 1900
1910
1920
1930
1940
1950
1960
1970
1980
1990
2000
2010
2020
Year Fig. 2. Time series of the annual maximum daily flow at Banff and Calgary stations, respectively.
and PWM, are applicable as well. In this paper, four commonly used probability distribution functions including Gumbel distribution (GUM), GEV, generalized logistic distribution (GL), and GP are considered to select the appropriate distribution. Please be aware that any distributions can be included in the distribution selection. Details on L-moment method and the distribution parameter estimation for each distribution can be found in Hosking (1990). In this stage, the distribution, which fits most closely to the observations, is selected from the candidate distributions based on the standard error (SE) calculated using the observations and their corresponding quantiles estimated. The SE is calculated by
SE ¼
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi hXn iffi 2 ðg h Þ =ðn qÞ T T T¼1
ð1Þ
where g denotes the observation; h represents the estimated quantile corresponding the frequency/or return period (T) of the observation; n and q are the sample size of the data set and the number of the distribution parameters, respectively. In Stage 2, the quantification of the prediction interval of the estimated quantile is carried out using the combination of the ensemble simulation and the multi-objective optimization. As discussed previously, the primary objective of this paper was to propose a methodology to provide quantile estimates and their optimal prediction intervals for water resources management and water infrastructure design. It has been well acknowledged that FFA is heavily dependent on the data availability and the statistical characteristics of the data set used. This paper, thus, focused on the uncertainty originated from the distribution parameters, as the distribution parameters are sensitive to the statistical characteristics of the data set used. In this Stage, the ranges of the GA parameters (here the perturbation of distribution parameters, initial population, and initial generation parameters of GA) are firstly initialized. The ensemble of the distribution parameters is obtained through randomly applying the possible perturbation to the estimated distribution parameters obtained from Stage 1. Given the ranges of the distribution parameters, the sampling of the distribution parameters within their ranges is accomplished using LHS, which stratifies the probability distribution into equal intervals to generate the ensemble of distribution parameter combination. Considering a three-parameter distribution, such as GEV, which is characterized by the location parameter (n), scale parameter (a) and shape parameter (k), the ensemble simulation
discussed above can be mathematically represented by Eq. (2), which produces prediction intervals of quantiles given the perturbations to the distribution parameters estimated in Stage 1. Please note that the mathematical presentation can be different from Eq. (2) when applying a different distribution. To quantify the inherent uncertainty of estimated quantiles in the ensemble simulation, two uncertainty indices including the percentage of coverage (POC) and the average width (AW) of prediction/estimation, which are defined by Eqs. (3) and (4), are applied. The POC and the AW have been commonly used in the literature (e.g., Alvisi and Franchini, 2011; Kasiviswanathan et al., 2013). It has been well acknowledged that a wide prediction interval would compromise the reliability/precision of a model/analysis and thus hinder its application in decision-making practices. Therefore in the multiobjective optimization, the problem of deriving the optimal prediction interval of quantile estimates is formulated as minimizing the AW and maximizing the POC. Please note that both the AW and POC are the function of the upper and lower bounds of the estimated quantiles.
h i ^ U; Q ^ L ¼ f fT; ½j ; jþ jj 6 j 6 jþ ; ½n ; nþ n 6 n 6 nþ ; Q T T ½a ; aþ ja 6 a 6 aþ g POC ¼
AW ¼
m 1X cT m T¼1
ð2Þ
! 100
m h i 1X ^L ^U Q Q T m T¼1 T
ð3Þ
ð4Þ
^ U and Q ^ L are the upper and lower bounds of the estimation where Q T T of a flow, which has a return period of T, respectively; m is the total number of ensemble of quantiles used for constructing the prediction interval; and cT = 1 if the observed value falls in the prediction h i ^ U; Q ^ L , otherwise cT = 0. interval of Q T
T
In the multi-objective optimization formulated above, there does not typically exist a feasible solution that maximizes the POC and simultaneously minimizes the AW, as the POC and the AW conflict with each other. For instance, the increase of POC can be achieved by increasing the AW. Therefore in the optimization, Pareto optimal solutions, which mean that one of objective function cannot be further improved without degrading the other
K.S. Kasiviswanathan et al. / Journal of Hydrology 545 (2017) 251–262
255
Fig. 3. The flowchart of the proposed methodology.
objective function(s), are produced. Then the Pareto optimal solutions are used to determine the optimal range of the estimated parameters and subsequently the estimated quantiles for a given specific value of one objective function. Consequently, the proposed methodology can yield the optimal prediction intervals of quantile estimates by minimizing the AW when given an approximately maximized POC and vice versa. The multi-objective optimization is solved using GA based search algorithm. The GA as an optimization tool was developed based on the concept of biological evolutionary process (Goldberg, 1989). The GA has an
advantage of identifying the global optimal solution, whereas conventional search algorithms such as quasi-Newton method often lead to local optima (Jain et al., 2005). The GA has often been adopted for solving water resources problems, especially for dealing with non-linearity. In GA, the solution set is represented by a population of strings, which comprises of a number of blocks, each representing the individual decision variables of the problem. The strings are evolved to produce new strings through mutation, selection, and crossover. The evolved new strings contain a pair of parent strings selected from the pool selected previously accord-
256
K.S. Kasiviswanathan et al. / Journal of Hydrology 545 (2017) 251–262
ing to their fitness (objective function evaluated using the components in the string). In each step (each perturbation), the ensemble simulations are evaluated using POC and AW to generate the next generation of decision variables (here the perturbation of distribution parameters). 4. Results and discussion 4.1. Annual maximum daily flow The two time series of annual maximum daily flow collected at Banff and Calgary stations as shown in Fig. 2 are used to illustrate the proposed methodology. The annual maximum flow observed in 2013 at Banff and Calgary stations are 439 and 1750 m3/s, respectively. The descriptive statistics of the two data sets are presented in Table 1 for two scenarios: one not includes 2013 flow (Scenario I); the other includes 2013 flow (Scenario II). The annual maximum daily flows observed in 2013 at both Banff and Calgary stations are identified as outliers based on an interval spanning over the mean plus/minus three times of standard deviation. At Calgary station, the statistical characteristics, especially standard deviation, skewness, and kurtosis, of Scenario II significantly deviate from those calculated from Scenario I; whereas similar significant changes in the statistical characteristics were not observed at Banff station. The prominent changes in the statistical characteristics between Scenarios I and II at Calgary station are expected to lead large variations in the quantile estimates as well as their corresponding prediction intervals. Thus, this paper applied the proposed methodology to the two data sets, for Scenario I and Scenario II, respectively, at Banff and Calgary stations, to demonstrate the efficiency of the method. 4.2. Determination of distribution and distribution parameters Fig. 4 shows the results when all the candidate distributions are fitted to the observations in each scenario at Calgary and Banff stations, respectively. In Stage 1, the empirical plotting position equation, Cunnane formula (Cunnane, 1978) was used for plotting the frequency curve of the observations. It is evident that most candidate distributions perform equivalently in fitting the observations in both Scenarios I and II, except that GUM largely deviates from the observations in both scenarios. The results also demonstrate the difficulty to fit the tail region of the frequency curve, in particular at Calgary station. Among the distributions considered in this paper, GEV, GP and GL are three-parameter distributions, which describe the data distribution using location, scale, and shape parameters; while GUM is a two-parameter distribution using location and scale parameters to describe the data set statistically. The estimated parameters of the distributions using L-moment method along with their differences between these two scenarios (in term of percentage) are presented in Table 2 for Calgary and Banff stations, respectively. As demonstrated in Table 2, the estimated location and scale parameters do not show obvious variations between Scenarios I and II, compared to the estimated shape parameter. This implies that the inclusion of the extremely high value in 2013 does not alter the scale and location parameters
significantly; whereas it leads to a large difference in the estimated shape parameter. The result is under expectation due to the fact that the inclusion of extremely high flow(s) could largely vary the statistics of the tail region of a frequency curve, where often has sparse distributed data points and consequently is sensitive to the addition/exclusion of an extreme data point(s). Thus, the addition of the extremely high flow in 2013 largely affects the shape parameter, and in turns the estimation of high quantiles. The selection of a particular distribution for further analysis was conducted based on the calculated SE, which is also reported in Table 2. All three-parameter distributions outperform the twoparameter GUM. Among GEV, GL, and GP, it appears that both GEV and GL, in general, produce smaller SE than GP does. At Calgary, the GL distribution yields minimum SE whereas it appears that the GEV performs best at Banff. Both GEV and GL, in general, appear to perform equivalently. The GEV distribution was selected for further analysis in Stage 2. The scatter plots of observations and the calculated flows, whose return periods are corresponding to those of observations, are displayed in Fig. 5, for Scenarios I and II at Calgary and Banff stations, respectively. It is evident that, at Calgary station, the calculated flows underestimated the observations in the tail region of the frequency curves. By closely observing the results for the two scenarios at Calgary station, the addition of 2013 data obviously elevated the estimated high flows. The results imply that the fitting of the sparse tail region would introduce large uncertainty into the distribution parameters and subsequently quantile estimates. This explains the difficulty to fit the tail of a frequency curve and thus to extrapolate for high quantiles. Besides, the result reinforces the necessity to quantify prediction interval to embrace observations, which provides critical and useful information for water resources planning and management. However, the effect of the addition of 2013 data at Banff station is not as significant as that at Calgary station. It appears that GEV can fit the observations fairly well in both scenarios at Banff station.
4.3. Multi-objective optimization of prediction interval of quantile estimates Through experimental runs, the range of perturbation was determined as ±30% for each estimated distribution parameter in Stage 1, as a higher level perturbation did not lead to a larger variation in the prediction interval. The number of the ensemble for constructing the prediction interval in the ensemble simulation is also needed to be decided. The determination of the number of ensemble is often subjective and dependent on the convergence to the solution. It is, in general, determined based on the quality of the final prediction, which should closely match with the observations (Houtekamer and Derome, 1995). Therefore, the number of ensemble simulation of 100 was determined herein through experimental runs in this paper. In addition, a population size of 30 and stopping criteria of 1000 generations were selected in the GA. Fig. 6 shows the optimal Pareto fronts obtained in Stage 2 for both scenarios at Calgary and Banff stations, respectively. In the figure, each point represents an optimal POC corresponding to an AW or an optimal AW corresponding to a POC. In general, an
Table 1 Summary of the descriptive statistics of the data sets collected at Calgary and Banff stations for Scenarios I and II, respectively (Std denotes the standard deviation). Station
Scenario
Mean (m3/s)
Std (±) (m3/s)
Skewness
Kurtosis
Calgary
I (1912–2012) II (1912–2013)
347.30 350.73
183.15 233.75
1.66 2.83
6.03 13.61
Banff
I (1909–2012) II (1909–2013)
209.97 213.85
56.26 60.57
0.60 0.88
0.16 0.94
257
K.S. Kasiviswanathan et al. / Journal of Hydrology 545 (2017) 251–262
Fig. 4. Frequency curves derived using various distributions for each scenario at Calgary and Banff stations, respectively.
Table 2 The estimated distribution parameters of all candidate distributions (The numbers in bold face indicate the large percentage change in the parameters of three-parameter distributions). Gauging station
Scenario
Distribution
Calgary
I (1912–2012)
GEV GL GP GUM GEV GL GP GUM
0.18 0.29 0.10 – 0.28 0.36 0.07 –
GEV GL GP GUM GEV GL GP GUM
0.07 0.13 0.54 – 0.02 0.16 0.46 –
Shape
II (1912–2013)
Banff
I (1909–2012)
II (1909–2013)
SE (m3/s)
Distribution parameters %Change
55.56 24.14 170.00
71.43 23.08 14.81
increase of AW would lead to an increase of POC and vice versa. Thus, the two objective functions, namely maximizing POC and minimizing AW, used in the optimization are conflicting to each other. As a result, the optimal solution would be a trade-off between these two uncertainty indices. Although there is no general rule to determine an optimal POC when quantifying prediction
Scale 99.29 72.48 194.03 120.35 98.52 76.29 172.67 138.33 48.98 31.20 125.90 46.27 49.45 32.29 120.59 48.51
%Change
Location
0.78 5.26 11.01 14.94
283.46 323.66 186.02 292.15 281.22 321.99 190.53 295.94
0.96 3.49 4.22 4.84
186.40 204.94 130.11 184.97 186.32 205.29 131.17 185.85
%Change
0.79 0.52 2.42 1.30
30.98 28.75 45.37 100.97 47.93 43.83 66.16 94.75
0.04 0.17 0.81 0.48
5.89 9.98 4.89 111.61 5.04 7.69 8.46 99.82
interval, 95% POC has been commonly used in the literature (e.g., Zhang et al., 2009; Alvisi and Franchini, 2011). By observing the Pareto fronts, it appears that all the Pareto fronts converge approximately above 95% POC, namely a large increase in AW does not lead to an obvious increase of POC above 95%. Therefore, 95% POC was selected herein as the optimal solution of the multi-
258
K.S. Kasiviswanathan et al. / Journal of Hydrology 545 (2017) 251–262
Fig. 5. The scatter plots of the observed and estimated flows for Scenarios I and II at Calgary and Banff stations, respectively.
Fig. 6. The Pareto fronts for Scenarios I and II at Calgary and Banff stations, respectively.
objective optimization, and then it was used to determine its corresponding AW and the optimal ranges of distribution parameters at Calgary and Banff stations in each scenario, respectively. In addition, each optimal Pareto solution corresponds to a set of distribution parameters (in ranges) which are determined to minimize the prediction intervals of quantile estimates. Therefore the ranges of distribution parameters also vary with the POC/or AW. Fig. 7 illus-
trates the ranges of the three distribution parameters versus several selected values of POC (5%, 45%, 75%, 80%, 95%, and 100%) for Scenario II at Calgary station as an example. This figure shows that a higher POC is achieved at the expense of increasing the perturbation applied to all distribution parameters. Note that same results were also observed in all other cases studied in the paper. The results suggest that a larger perturbation, which implies a
K.S. Kasiviswanathan et al. / Journal of Hydrology 545 (2017) 251–262
259
Fig. 7. Ranges of the shape, scale and location parameters corresponding to several selected values of POC.
higher level of uncertainty, should be applied to the distribution parameters to ensure the prediction intervals containing more observations, namely to achieve a higher POC. These also argue that the uncertainty in quantile estimates introduced due to the variability/or non-stationarity of observations in statistical sense can be captured in terms of prediction interval. At Calgary station, there is an obvious shift in the Pareto fronts from Scenario I to Scenario II. Given a POC, its corresponding AW is wider in Scenario II than that in Scenario I. This results suggest an increased prediction interval and thus an increased level of uncertainty when the extremely high 2013 data is added into the data set. However, at Banff station, there is no significant shift in the Pareto fronts between these two scenarios. Given a POC, the corresponding AWs are similar in Scenarios I and II, which implies that the inclusion of 2013 data at Banff station does not significantly alter the level of uncertainty. From the statistical analysis/modeling point of view, a large sample data set is always desirable as such a data set is believed to be able to represent the population well. However, these results suggest that the inclusion of one or more observation does not always increase the reliability (namely reducing the uncertainty) of FFA. At Calgary station, the addition of 2013 annual maximum flow data resulted in significant changes in the statistical characteristics, especially the standard deviation and the skewness (Table 1) of the data set. Thus, the inclusion of such a data point (s) compromises the reliability and increase the uncertainty in FFA, since the data point added into might not belong to the population that the data set represents. This means that the annual
maximum daily flow in 2013 (at Calgary station) might be generated from a mechanism differing from the mechanism governing the most flows in the data set. The annual maximum flows generated from different mechanisms might be represented by different distributions (different distribution family or same distribution family with different distribution parameters). Under such circumstances, when data point(s) in a data set are suspected from different populations, the use of more complex distribution, such as two-component extreme value distribution (Rossi et al., 1984) and the mixture of distributions (Evin et al., 2011; Shaw and Riha, 2011), might be alternative options. However in most cases, there is no sufficiently long data to estimate the distribution parameters of complex distributions. On the other hand, the extremely high flows and their corresponding frequencies and risks are of great interest in water resources management. As discussed above, the use of either the conventional FFA or the more complex approach would introduce a large degree of uncertainty into the analysis. As a result, the quantification of the prediction interval of quantile estimates in the FFA might be a feasible and practical measure. 4.4. Distribution parameter uncertainty In the proposed methodology, the perturbations applied to the distribution parameters are optimized and subsequently the distribution parameters sampled from their optimal ranges are used to quantify the prediction interval of quantile estimates. Fig. 8 illustrates the box plots of the optimal ranges of the three GEV distri-
Fig. 8. The optimal ranges of the GEV distribution parameters for Scenarios I and II at Calgary and Banff stations.
260
K.S. Kasiviswanathan et al. / Journal of Hydrology 545 (2017) 251–262
Table 3 The coefficients of variation of three distribution parameters for Scenarios I and II at Calgary and Banff stations. Station
Scenario
Distribution parameter Shape Coefficient of variation
Scale
Location
Calgary
I II
0.11 0.11
0.03 0.05
0.02 0.04
Banff
I II
0.15 0.17
0.02 0.03
0.02 0.03
Fig. 9. Prediction interval of quantile estimates constructed using the proposed method along with the observations and mean quantile estimates by the random sampling approach for Scenarios I and II at Calgary and Banff stations, respectively.
bution parameters for Scenarios I and II at Calgary and Banff stations. As shown in these plots, both the medians and ranges of the location and scale parameters are approximately equivalent in the both scenarios at these two stations; whereas different results were observed in the shape parameter. At Calgary station, not only a large shift in the medians of the shape parameter was found between Scenario I and Scenario II, but also an increase in its range was detected. At Banff station, although a notable shift in the medians of the shape parameter was seen from Scenario I to Scenario II, a decrease of its range was found. In addition, the magnitude of the perturbation range of the distribution parameters was assessed using the coefficient of variation and the calculated coefficients of variation are presented in Table 3. The coefficients of variation are consistently high for the shape parameter compared to those for both the location and scale parameters. This suggests that the estimated shape parameter contains more uncertainty than the location and scale parameters do.
This result is also consistent with the fact that the shape parameter, which is estimated using high moments, is difficult to estimate with precision (Northrop, 2004; Coles, 2001). These results suggest that the addition of data points into FFA can lead to the changes in the distribution parameters and introduce (i.e. increase or reduce) uncertainty in the estimated distribution parameters, especially in the shape parameter. The effect of the added data point(s) would depend on to what degree the addition of the data point(s) affects the statistical characteristics of the data set as discussed previously. Therefore in order to enhance the reliability of FFA, the quantification of the uncertainty in the quantile estimates is indispensable. It should be noted that unlike the existing approaches, the proposed methodology uses stochastic perturbation to assess the ranges of the distribution parameters and then the quantile estimates, which reflect their uncertainty level. Consequently, one can expect that the prediction interval of the quantile assessed through the ensemble simulation can
K.S. Kasiviswanathan et al. / Journal of Hydrology 545 (2017) 251–262
cover observations well while constraining the uncertainty level in the analysis through minimizing AW to improve the reliability of the analysis. 4.5. Prediction interval of quantile estimates Fig. 9 shows the prediction intervals of quantile estimates assessed in a 100 ensemble simulations, each of which calculated the quantiles using the distribution parameters sampled within their optimal ranges using LHS sampling for each scenario at Calgary and Banff stations, respectively. In order to validate the reliability of the proposed approach, the prediction intervals were compared to those yielded from the random sampling approach. In the random sampling approach, 50% of data points were randomly sampled from the data set 100 times and then the prediction intervals and mean quantile estimates were quantified from the results produced by the conventional FFA. The mean quantile estimates produced by the random sampling approach are also shown in Fig. 9. As illustrated in Fig. 9, most observations fall within the constructed prediction intervals by the proposed approach except a few data points at both the stations. The mean quantile estimates in the random sampling approach also fall within the prediction intervals generated by the proposed approach. In addition, the prediction intervals of quantile estimates produced by the proposed approach is narrower than those by the random sampling approach for Scenarios I and II at both Calgary and Banff stations (results not shown). All these results indicate that the proposed approach is effective in reducing the uncertainty level of quantile estimates and thus is a reliable approach. The observations falling outside the optimal prediction
261
intervals might be ascribed to measurement errors or rare events, in which flow might be primarily governed by different mechanisms from the majority of other events. Note that such observations can be embraced into the prediction intervals by increasing their widths, which would reduce the reliability/precision. From the point view of application, accurate and precise modeling results are always desired. The wide prediction interval would hinder the applications of the results in water resources management. The box plots of the 100 ensemble of quantile estimates for three high return periods, 50-year, 100-year, and 200-year are shown in Fig. 10 as examples. The calculated standard deviations of the 100 ensemble of quantile estimates are also displayed in Fig. 10. As demonstrated in both Figs. 9 and 10, there are no large differences in the magnitude of the prediction intervals between Scenario I and Scenario II at Banff station, as the variation in the estimated quantiles are approximately equivalent in these two scenarios for each return period. However, small shifts in the medians of the quantile estimates from Scenario I to Scenario II were observed. Different from Banff station, both the shift in the medians and the increase of the standard deviations of quantile estimates for each return period were found from these two scenarios at Calgary station (Fig. 10). These results suggest that the addition of 2013 flow at Calgary station introduces more uncertainty into quantile estimates compared to at Banff station. This can be explained by the findings in the estimated and optimized shape parameter (Fig. 8), which largely controls the tail region of the frequency curve. At Calgary station, the AWs of estimated quantiles, corresponding to 95% POC, are 40.23 m3/s and 63.32 m3/s for Scenarios I and II, respectively; at Banff station, the AWs are 17.77 m3/s and 18.01 m3/s for Scenario I and II, respec-
Fig. 10. Prediction intervals of quantile estimates of 50-yr, 100-yr, and 200-yr return periods for Scenarios I and II at Calgary and Banff stations, respectively. The values mentioned with the boxes are the standard deviation of quantiles estimates from the 100 ensemble.
262
K.S. Kasiviswanathan et al. / Journal of Hydrology 545 (2017) 251–262
tively. These quantitative results further confirm the consequence of including 2013 flow data: an increase in the uncertainty level of quantile estimates at Calgary station and no significant change in the uncertainty level of quantile estimates at Banff station. 5. Conclusions In many cases, the violation of the assumptions behind the conventional FFA, such as the stationarity of data sets, appears inevitable, especially in long-term data sets. However, a long-term data set is often required to produce high quantile estimates. The dilemma challenges the application of the conventional FFA, which does not asses the uncertainty of quantile estimates. Therefore, this paper proposed a new methodology, which combines the ensemble simulation and the multi-objective optimization, aiming to optimize the prediction interval of quantile estimates in FFA. The optimization was formulated using two objective functions: minimizing AW and maximizing POC of quantile estimates, which mutually constrain each other. The proposed method was illustrated using long-term (over 100 years) annual maximum daily flows collected at Calgary and Banff stations on the Bow River, Alberta, Canada. The results obtained from the paper demonstrate that the proposed methodology can: (1) capture qualitatively the uncertainty of quantile estimates (i.e., without assigning any probability), as the majority of observations are located within the constructed prediction interval; and (2) capture the change in prediction interval, which reflects the change in uncertainty level, due to the addition of a data point(s), which appears to be statistically different from other data points. The paper focused on the uncertainty originated from the distribution parameters rather considering other possible uncertainty sources, such as the selection of distribution. It is also worth to note that the prediction interval quantified using the proposed approach does not represent a true level of uncertainty due to the absence of a quantitative linkage of the prediction interval with probability. The prediction interval of quantile estimates might be able to link to a probability through establishing the association of the uncertainty of perturbation applied to distribution parameters and the prediction interval. Thus the extension of the proposed methodology to include other sources of uncertainty, to quantitatively link the prediction interval of quantile estimates with probability, and subsequently to compare this approach with the classic and/or Bayesian statistical approaches is recommended for future study. Acknowledgement The authors would like to thank the University of Calgary (Eyes High Program) for the financial support of this study. The authors would also like to thank the anonymous reviewers and associate editor for their insightful comments. References Alvisi, S., Franchini, M., 2011. Fuzzy neural networks for water level and discharge forecasting with uncertainty. Environ. Model. Softw. 26, 523–537. http://dx.doi. org/10.1016/j.envsoft.2010.10.016. Arnaud, P., Cantet, P., Aubert, Y., 2014. Relevance of an at-site flood frequency analysis method for extreme events based on stochastic simulation of hourly rainfall. Hydrol. Sci. J. 150527103244004. http://dx.doi.org/10.1080/ 02626667.2014.965174. Aziz, K., Rahman, A., Fang, G., Shrestha, S., 2014. Application of artificial neural networks in regional flood frequency analysis: a case study for Australia. Stoch. Environ. Res. Risk Assess. 28, 541–554. http://dx.doi.org/10.1007/s00477-0130771-5. Basu, B., Srinivas, V.V., 2015. Analytical approach to quantile estimation in regional frequency analysis based on fuzzy framework. J. Hydrol. 524, 30–43. http://dx. doi.org/10.1016/j.jhydrol.2015.02.026.
Chowdhury, J.U., Stedinger, J.R., 1991. Confidence interval for design flood with estimated skew coefficient. J. Hydraul. Eng. 117 (7), 811–831. Coles, S., 2001. An Introduction to Statistical Modeling of Extreme values. SpringerVerlag, London. Cunnane, C., 1978. Unbiased plotting position – a review. J. Hydrol. 37, 205–222. http://dx.doi.org/10.1016/0022-1694(78)90017-3. Evin, G., Merleau, J., Perreault, L., 2011. Two-component mixture of normal, Gamma, and Gumbel distribution for hydrological applications. Water Resour. Res. 47, W08525. http://dx.doi.org/10.1029/2010WR010266. Gado, T.A., Nguyen, V.-T.-V., 2016. An at-site flood estimation method in the context of nonstationarity I. A simulation study. J. Hydrol. 535, 710–721. http://dx.doi. org/10.1016/j.jhydrol.2015.12.063. Gilroy, K.L., McCuen, R.H., 2012. A non-stationary flood frequency analysis method to adjust for future climate change and urbanization. J. Hydrol. 414, 40–48. Goldberg, D., 1989. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA. Halbert, H., Nguyen, C.C., Payrastre, O., Gaume, E., 2016. Reducing uncertainty in flood frequency analyses: a comparison of local and regional approaches involving information on extreme historical floods. J. Hydrol. http://dx.doi.org/ 10.1016/j.jhydrol.2016.01.017. Hirabayashi, Y., Mahendran, R., Koirala, S., Konoshima, L., Yamazaki, D., Watanabe, S., Kim, H., Kanae, S., 2013. Global flood risk under climate change. Nat. Publ. Gr. 3, 816–821. http://dx.doi.org/10.1038/nclimate1911. Hosking, J.R.M., 1990. L-moments: analysis and estimation of distributions using linear combinations of order statistics. J. R. Stat. Soc. Ser. B 52, 105–124. Houtekamer, P.L., Derome, J., 1995. Methods for ensemble prediction. Mon. Wea. Rev. 123, 2181–2196. Hu, Y.-M., Liang, Z.-M., Liu, Y.-W., Zeng, X.-F., Wang, D., 2015. Uncertainty assessment of estimation of hydrological design values. Stoch. Environ. Res. Risk Assess. 29, 501–511. http://dx.doi.org/10.1007/s00477-014-0979-z. Jain, A., Srinivasulu, S., Bhattacharjya, R.K., 2005. Determination of an optimal unit pulse response function using real-coded genetic algorithm. J. Hydrol. 303, 199– 241. Kasiviswanathan, K.S., Cibin, R., Sudheer, K.P., Chaubey, I., 2013. Constructing prediction interval for artificial neural network rainfall runoff models based on ensemble simulations. J. Hydrol. 499, 275–288. http://dx.doi.org/10.1016/j. jhydrol.2013.06.043. Khaliq, M.N., Ouarda, T.B.M.J., Ondo, J.C., Gachon, P., Bobée, B., 2006. Frequency analysis of a sequence of dependent and/or non-stationary hydrometeorological observations: a review. J. Hydrol. 329 (3), 534–552. Lee, K.S., Kim, S.U., 2008. Identification of uncertainty in low flow frequency analysis using Bayesian MCMC method. Hydrol. Process. 22 (12), 1949–1964. http://dx.doi.org/10.1002/hyp.6778. Northrop, P.J., 2004. Likelihood-based approaches to flood frequency estimation. J. Hydrol. 292 (1–4), 96–113. http://dx.doi.org/10.1016/j.jhydrol.2003.12.031. Obeysekera, J., Salas, J., 2014. Quantifying the uncertainty of design floods under nonstationary conditions. J. Hydrol. Eng. 19 (7), 1438–1446. Prakash, O., Sudheer, K.P., Srinivasan, K., 2014. Improved higher lead time river flow forecasts using sequential neural network with error updating. J. Hydrol. Hydromech. 62, 60–74. http://dx.doi.org/10.2478/johh-2014-0010. Reis, D.S., Stedinger, J.R., 2005. Bayesian MCMC flood frequency analysis with historical information. J. Hydrol. 313, 97–116. http://dx.doi.org/10.1016/j. jhydrol.2005.02.028. Rossi, F., Fiorentino, M., Versace, P., 1984. Two-component extreme value distribution for flood frequency analysis. Water Resour. Res. 20, 847–856. Salas, J.D., Heo, J.H., Lee, D.J., Burlando, P., 2013. Quantifying the uncertainty of return period and risk in hydrologic design. J. Hydrol. Eng. 18 (5), 518–526. http://dx.doi.org/10.1061/(ASCE)HE.1943-5584.0000613. Shaw, S.B., Riha, S.J., 2011. Assessing possible changes in flood frequency due to climate change in mid-sized watersheds in New York State, USA. Hydrol. Process. 25, 2542–2550. Shu, C., Burn, D.H., 2004. Artificial neural network ensembles and their application in pooled flood frequency analysis. Water Resour. Res. 40 (1–10). http://dx.doi. org/10.1029/2003WR002816. Shu, C., Ouarda, T.B.M.J., 2008. Regional flood frequency analysis at ungauged sites using the adaptive neuro-fuzzy inference system. J. Hydrol. 349, 31–43. http:// dx.doi.org/10.1016/j.jhydrol.2007.10.050. Strupczewski, W.G., Singh, V.P., Weglarczyk, S., 2002. Asymptotic bias of estimation methods caused by the assumption of false probability distribution. J. Hydrol. 258, 122–148. Tang, W.H., 1980. Bayesian frequency analysis. J. Hydraul. Div. 106 (7), 1203–1218. Tung, Y.K., Wong, Chi-leung, 2014. Assessment of design rainfall uncertainty for hydrologic engineering applications in Hong Kong. Stoch. Environ. Res. Risk Assess. 28, 583–589. Wood, E.F., Rodríguez-Iturbe, I., 1975. A Bayesian approach to analyzing uncertainty among flood frequency models. Water Resour. Res. 11, 839–843. http://dx.doi. org/10.1029/WR011i006p00839. Zhang, X., Liang, F., Srinivasan, R., Van Liew, M., 2009. Estimating uncertainty of streamflow simulation using Bayesian neural networks. Water Resour. Res. 45, 1–16. http://dx.doi.org/10.1029/2008WR007030.