Evaluation and calibration of operational hydrological ensemble forecasts in Sweden

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden

Journal of Hydrology (2008) 350, 14– 24 available at www.sciencedirect.com journal homepage: www.elsevier.com/locate/jhydrol Evaluation and calibra...

216KB Sizes 0 Downloads 83 Views

Journal of Hydrology (2008) 350, 14– 24

available at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/jhydrol

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden ¨ran Lindstro ¨m Jonas Olsson *, Go ¨ping, Sweden Hydrological Research Unit, Swedish Meteorological and Hydrological Institute, 601 76 Norrko Received 4 July 2007; received in revised form 2 November 2007; accepted 9 November 2007

KEYWORDS Ensemble; Probability; Forecasts; Flood warning; HBV model; Sweden

Daily operational hydrological 9-day ensemble forecasts during 18 months in 45 catchments were evaluated in probabilistic terms. The forecasts were generated by using ECMWF meteorological ensemble forecasts as input to the HBV model, set up and calibrated for each catchment. Two kinds of reference discharges were used in the evaluation, ‘‘perfect forecasts’’ and actual discharge observations. A percentile-based evaluation indicated that the ensemble spread is underestimated, with a degree that decreases with increasing lead time. The share of this error related to hydrological model uncertainty was found to be similar in magnitude to the share related to underdispersivity in the ECMWF meteorological forecasts. A threshold-based evaluation indicated that the probability of exceeding a high discharge threshold is generally overestimated in the ensemble forecasts, with a degree that increases with probability level. In this case the contribution to the error from the meteorological forecasts is larger than the contribution from the hydrological model. A simple calibration method to adjust the ensemble spread by bias correction of ensemble percentiles was formulated and tested in five catchments. The method substantially improved the ensemble spread in all tested catchments, and the adjustment parameters were found to be reasonably well estimated as simple functions of the mean catchment discharge. ª 2007 Elsevier B.V. All rights reserved.

Summary

Introduction Within meteorology, medium-range (3–10 days) ensemble forecasting is an established way to acknowledge the uncer* Corresponding author. Tel.: +46 (0) 11 4958322; fax: +46 (0) 11 4958250. E-mail addresses: [email protected] (J. Olsson), Goran. [email protected] (G. Lindstro ¨m).

tainty in the initial atmospheric conditions and to generate probabilistic forecasts. Operational ensemble forecasting is performed at several meteorological institutes and services, e.g. the European Centre for Medium-range Weather Forecasts, ECMWF (e.g. Molteni et al., 1996; Buizza et al., 2005). The forecasts are obtained by perturbing the initial state to produce a number of possible realisations, from which an atmospheric model is run to generate an ensemble of members (e.g. Persson, 2001). The main gain from

0022-1694/$ - see front matter ª 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.jhydrol.2007.11.010

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden ensemble forecasts is a measure of the forecast uncertainty. This measure may be of a qualitative nature, essentially with a small spread among the ensemble members indicating a high certainty, and vice versa. This is known as spread-skill relationships, which have been widely evaluated for meteorological ensemble forecasts (e.g. Scherrer et al., 2004). Further, the ensemble may be used to produce quantitative probabilistic forecasts. A precise estimation of the probabilities however requires that the forecasts are well calibrated, i.e. that they accurately describe the variability of the predictand. This is not always the case. For instance, the ECMWF forecasts tend to be somewhat underdispersive, i.e. the variability is underestimated (e.g. Buizza, 1997; Persson, 2001; Buizza et al., 2005). An attractive prospect is to use the meteorological ensemble forecasts as input to a hydrological model, thereby producing medium-range hydrological (discharge) ensemble forecasts. This was attempted shortly after the meteorological ensemble forecasts became available, and to date a fair amount of scientific evaluation of hydrological ensemble forecasts has been performed. De Roo et al. (2003) applied ensemble forecasts as a part of their European flood forecasting system (EFFS), which includes the LISFLOOD rainfall–runoff model, and showed some case studies. Roulin and Vannitsem (2005) evaluated hydrological ensemble forecasts in two Belgian catchments and found a considerably higher accuracy than with forecasts based on climatological precipitation. The EFFS system was evaluated for two cases of flooding (rivers Meuse and Odra) by Gouweleeuw et al. (2005), who concluded that proper evaluation requires data from a long time period where different discharge levels are represented. Werner et al. (2005) evaluated the EFFS system for a case of flooding in river Rhine 1995 and proposed a new type of presentation that combines today’s ensemble forecast with previous forecasts. Roulin (2006) investigated how economical considerations can be combined with ensemble forecasts to improve decision-making in flooding situations. Feeding a hydrological model with meteorological ensemble forecasts is a way to handle the input uncertainty. There is however also another major source of uncertainty involved, that of the hydrological model, and an important issue concerns how to accommodate also this component. Pappenberger et al. (2005) tackled this issue by feeding the meteorological ensemble through an ensemble of LISFLOOD models with different but equally plausible parameter sets, obtained using the GLUE concept (Beven and Binley, 1992). The methodology was demonstrated for the Meuse case mentioned above. A method for statistical post-processing of hydrological ensemble forecasts that accounts for the model uncertainty was developed and tested by Seo et al. (2006). In this method, each original ensemble member is perturbed stochastically with an amount that reflects the model uncertainty. Verification of the method in five subcatchments of the river Juanita showed that the post-processed ensemble was accurate and unbiased both in the mean and in the probabilistic sense. Virtually all published evaluations of hydrological ensemble forecasts have been limited to flooding case studies and/or single catchments. This is partly due to the fact that despite the recent scientific advancements, there are still only few institutes where hydrological ensemble forecasts

15

are issued on an operational basis. At the Swedish Meteorological and Hydrological Institute (SMHI), medium-range hydrological ensemble forecasting has been operational in around 50 catchments since July 2004. The forecasts are stored for subsequent analysis and a deterministic evaluation was recently performed by Johnell et al. (2007). The main aim of this study is to perform a probabilistic evaluation of the first 18 months of daily forecasts from this system. This is to our knowledge the largest data set for which medium-range hydrological ensemble forecasts have been evaluated to date. A further objective is to develop and test a methodology for adjusting the derived exceedance probabilities.

Operational system and analysed data The hydrological forecasting system at the SMHI is based on the HBV-96 model (e.g. Bergstro ¨m, 1976; Lindstro ¨m et al., 1997), which is a widely used catchment-scale conceptual hydrological model. Water volumes in different compartments are determined according to the general water balance equation PEQ ¼

d ½SSP þ SSM þ SGW þ SSW  dt

ð1Þ

where P denotes precipitation, E evapotranspiration, Q discharge, and S storage in various compartments: snow pack (SP), soil moisture (SM), groundwater (GW) and surface water (SW). The main input data are observations of precipitation and temperature (T), which are normally interpolated to catchment values. The model contains subroutines for estimation of snow melt and accumulation, evapotranspiration, soil moisture and generated runoff. A simple routing scheme connects runoff from different sub-catchments. The model is semi-distributed and a catchment may be divided into altitude and vegetation zones. The model has a number of free parameters that needs to be calibrated, which may be done by an automatic procedure (Lindstro ¨m, 1997). Daily operational forecasting is performed for a number (currently around 50) of so-called indicator catchments that are rather evenly distributed throughout Sweden. The catchments have been selected to represent different geographical regions as well as catchment sizes (from 8 to 6110 km2; mean size 647 km2), although the main focus is on small and medium-sized catchments which respond rapidly to changes in the meteorological forcing. Forecasts are updated autoregressively, i.e. the model error for a certain day in the forecast is estimated as a function of the model error at the start of the forecast (e.g. Lundberg, 1982). In 2004, the hydrological forecasting system was complemented with a routine for ensemble forecasting. In this routine, meteorological ensemble forecasts from the European Centre for Medium-range Weather Forecasts (ECMWF) (Persson, 2001) are used to provide the P and T input for the indicator catchments. In this case no interpolation is made, but the P and T values from the nearest cell in the ECMWF T255 grid are used directly. This grid has a spatial resolution of 0.7 which corresponds to 60 km, thus an indicator catchment is generally covered by only one grid cell. ECMWF provides 10-day forecasts, but owing to time differences in the forecasting systems at ECMWF and SMHI, respectively,

16

J. Olsson, G. Lindstro ¨m

the first 12 h in the ECMWF forecasts can not be used in the hydrological forecasts, and therefore the hydrological forecasts are 9 days long. The ECMWF forecasts comprise 50 ensemble members and one control forecast, which are used to generate 51 9-day hydrological forecasts in each indicator catchment. In the ensemble routine, these forecasts are processed to generate five statistical percentiles for each forecast day: minimum (2% probability of non-exceedance), lower quartile (25%), median (50%), upper quartile (75%) and maximum (98%). These percentiles were used in the probabilistic evaluation below. As the hydrological ensemble forecasting system was made operational in July 2004 and the present evaluation began in January 2006, 18 months of data were available. Out of all indicator catchments, a few were omitted in the present evaluation. In some of these the number of observations was insufficient for a reliable evaluation. In others the HBV model error was unusually large during the evaluation period, possibly because of insufficient model calibration (or errors in the observations), which may have a substantial impact on the total result. The final number of catchments evaluated is 45 and for these on average 9% of the data are missing. Evaluation of the hydrological ensemble forecasts requires a reference discharge with which to compare. One obvious candidate is the observed discharge (OBS). The difference between HBV forecast and observation, however, contains error components from both the meteorological forecast and the hydrological model. As this evaluation focuses on the accuracy of the spread in the hydrological forecasts, generated only by the spread in the meteorological ensemble forecasts, the model error component needs to be eliminated. This was achieved by using as reference discharge ‘‘perfect HBV forecasts’’ (HBVpf). These are reconstructed forecasts (hindcasts) generated from the actually observed meteorological inputs, which are assumed to be without errors. This way the forecasts and the reference will have the same model error and the comparison thus reflects only the properties of the meteorological forecast. Thus HBVpf is the main reference discharge and for this

a

purpose HBVpf forecasts were reconstructed for all catchments during the entire period. However, also reference discharge OBS was used to evaluate the total, operational accuracy in the ensemble forecasts. This further made it possible to estimate the relative contributions to the total error from the meteorological forecast and the hydrological model, respectively.

Probabilistic evaluation methods Two types of probabilistic evaluation methods have been employed, one percentile-based and one threshold-based. In the percentile-based evaluation, the frequency of reference discharge falling below the ensemble percentiles is calculated. If the ensemble spread is accurate, these frequencies should agree with the percentiles’ probabilities of non-exceedance. Thus, during 2% of the evaluation period (which here corresponds to 11 days) the reference discharge should lie above (below) the ensemble maximum (minimum). During 23% of the period it should fall between the upper (lower) quartile and the maximum (minimum), and during 25% of the period between the median and each of the quartiles. This theoretical frequency distribution is shown graphically in Fig. 1a, which corresponds to the Talagrand diagram that is widely used in the verification of meteorological ensemble forecasts (e.g. Persson, 2001). The percentile-based evaluation thus focuses only on the discharge levels given by the ensemble percentiles, and how they relate to the actually occurring discharge. An alternative approach is to focus on a certain event, such as the exceedance of some critical discharge threshold. In the threshold-based evaluation, the ensemble percentiles as well as the actual observation are compared with a discharge threshold. Similarly to the percentile-based evaluation, the estimated risk of threshold exceedance is compared with the corresponding frequency of reference discharge exceeding the threshold. For example, if in a certain forecast the threshold is located between the ensemble maximum (2% probability of exceedance) and the upper quartile (25%), the probability of exceedance may be approximated as (2 + 25)/2 = 13.5%. Thus, out of all such

b 100

25

Observed frequency (%)

Observed frequency (%)

30

20 15 10 5 0

80 60 40 20 0

<2%

2%25%

25%50%

50%75%

75%98%

Theoretically correct spread

>98%

0

20

40

60

80

100

Forecasted probability (%) Correct forecast

Figure 1 Theoretical frequencies of reference discharge falling between different ensemble percentiles (a) and reliability diagram (b).

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden forecasts, in 13.5% of the cases the reference discharge should exceed the threshold if the ensemble probabilities are accurate. The result of this type of analysis may be plotted in a reliability diagram (e.g. Wilks, 1995). In this diagram, forecasted probability is plotted on the x-axis and observed frequency on the y-axis, with the line y = x indicating perfect forecasts (Fig. 1b). The threshold-based evaluation has been performed with operational flood warning in mind. Thus, in the evaluation only cases when the discharge was below the threshold at the time of the forecast have been considered, since continued exceedance during the forecast is of limited interest. At SMHI, a flood warning is issued when e.g. the 2-year flow or the 10-year flow is expected to be exceeded. However, as this evaluation comprises only 18 months of data these thresholds are not suitable. Instead two lower threshold levels were selected, termed Q70 and Q90, the former being the 70th percentile (i.e. exceeded during 30% of the evaluation period) and the latter the 90th percentile (10%).

Results and discussion Percentile-based evaluation

and (2) mainly occurs during periods of recession or low flow, when it is likely that no precipitation occurred in reality. Because of the low resolution of the ECMWF model grid, however, ensemble forecast precipitation has an overestimated frequency of days with non-zero precipitation, as compared with observed catchment average precipitation (Johnell et al., 2007). Thus it may be suspected that small amounts of erroneously forecasted precipitation during these periods is one source of this deviation. Fig. 2b shows the result for catchment Ersbo, forecast day 5, which represents catchments with a comparatively inaccurate ensemble spread. In this case the HBVpf forecast far too often falls outside the entire ensemble spread, during nearly 50% of the time compared with the theoretical 4%, and the spread is thus far too narrow. Compared with Hammarby (Fig. 2a) the HBVpf forecast falls below the ensemble minimum nearly twice as often, and further lies above the ensemble maximum more than 20% of the time. The variation with forecast lead time is exemplified in Fig. 3a, showing the results for Hammarby on days 1, 5 and 9 in the forecast. It is obvious that the agreement of the ensemble spread with the theoretical spread increases with increasing lead time. The HBVpf forecast falls outside the ensemble spread in 40% of the forecasts day 1 while this only happens in 10% of the forecasts day 9. The HBVpf forecast falls within the IQR in 20% of the forecasts day 1 and in 40% day 9. This indicates that the spread in single ECMWF precipitation forecasts (for day 1) is not sufficient for generating a proper spread in the discharge forecasts. However, as spread from a number of consecutive precipitation forecasts gradually accumulate in the discharge ensemble forecasts, the spread of the latter approaches the correct level. Fig. 3b shows the result averaged over all forecast lead times (day 1–9). The picture is very similar to Fig. 2a, i.e. the middle day in the forecast (day 5) represents the average behaviour well. The variation with lead time averaged over all catchments is illustrated in Fig. 4. The picture is similar to that of Hammarby showed in Fig. 3a, i.e. the ensemble spread is greatly underestimated on the first day of the forecast and reasonably accurate on the last day. On average, the

30

30

25

25

Frequency (%)

Frequency (%)

Fig. 2 shows two examples of results from the percentilebased evaluation using reference discharge HBVpf. Fig. 2a shows the result for catchment Hammarby, forecast day 5, which represents catchments with a comparatively accurate ensemble spread. In this case the observed frequencies agree relatively well with the theoretical distribution. One notable deviation is that the frequencies of HBVpf forecasts falling between the median and one of the quartiles (25– 50% and 50–75% in Fig. 2) are 5–10% too low. This implies that the interquartile range (IQR) is generally too small and the reference discharge falls outside it too often. Another clear difference is that the frequency of HBVpf forecasts falling below the ensemble minimum is 10% too high, i.e. the ensemble minimum is systematically overestimated. A closer inspection of the data showed that this overestimation (1) is very small, i.e. the HBVpf forecast is very close to but slightly below the ensemble minimum,

20 15 10 5

17

20 15 10 5

0

0 <2%

2%25%

25%50%

Hammarby

50%75%

75%98%

Theoretical

>98%

<2%

2%25%

25%50%

Ersbo

50%75%

75%98%

>98%

Theoretical

Figure 2 Frequencies of reference discharge (HBVpf) between ensemble percentiles in catchments Hammarby (a) and Ersbo (b) for lead time 5 days.

J. Olsson, G. Lindstro ¨m

30

30

25

25

Frequency (%)

Frequency (%)

18

20 15 10 5

20 15 10 5

0

0 <2%

2%25%

Day 1

25%50%

Day 5

50%75%

Day 9

75%98%

>98%

<2%

Theoretical

2%25%

25%50%

50%75%

Average day 1-9

75%98%

>98%

Theoretical

Figure 3 Frequencies of reference discharge (HBVpf) between ensemble percentiles in catchments Hammarby for lead times 1, 5 and 9 days (a) and as average over all lead times (b).

Day 9

Day 5

Day 1 40

40

40

30

30

30

20

20

20

10

10

10 0

0

0 <2%

2%- 25%- 50%- 75%- >98% 25% 50% 75% 98%

<2%

2%- 25%- 50%- 75%- >98% 25% 50% 75% 98%

<2%

2%- 25%- 50%- 75%- >98% 25% 50% 75% 98%

Figure 4 Frequencies of reference discharge (HBVpf) between ensemble percentiles averaged over all catchments, for lead times 1, 5 and 9 days.

frequency of HBVpf forecasts falling within the ensemble spread increases from 33% of the time day 1 to 75% day 9. The overestimated frequency of HBVpf forecasts below the ensemble minimum is clear for all lead times. When using the perfect forecast HBVpf as reference discharge, the result is not affected by model uncertainty. If assuming the meteorological observations used to generate HBVpf to be accurate, the only source of inaccuracies in the spread of the discharge ensemble forecast is thus inaccuracies in the input precipitation and temperature ensemble forecasts. Even if temperature has a strong impact during the snowmelt period, in general the precipitation is the main forcing with respect to runoff generation. Thus it may be assumed that underestimated spread in the ECMWF precipitation forecasts is the main source of the underestimated discharge spread. One reason for this is clearly the mismatch in spatial scale between the ECMWF grid (3600 km2) and the catchments (650 km2). Another reason is likely an underestimated spread in the ECMWF ensemble forecasts, which is typically found in verifications (e.g. Buizza et al., 2005). As no downscaling to catchment scale is performed in this study, it may be expected that the spread in the ECMWF precipitation better captures the observed variability in large catchments than in small ones. This would be reflected also in the discharge forecasts, i.e. the reference discharge would be more often located

within the ensemble spread in large catchments than in small ones. Evaluation for the data used here, however, failed to reveal any clear trend with catchment size, probably because the catchments are too small (all but one are around or below half the size of the ECMWF grid). As mentioned in Section ‘‘Operational system and analysed data’’, evaluation was also performed using the actual discharge observations (OBS) as reference discharge. Fig. 5 shows the result of the percentile-based evaluation for both reference discharges, averaged over all catchments and lead times. The frequency of HBVpf forecasts falling within the ensemble spread is 63% whereas for OBS the frequency is 38%. The frequency of reference discharge within the IQR is 26% for HBVpf and 14% for OBS. The accuracy of the forecasted ensemble spread is thus approximately halved for reference discharge OBS, which indicates a similar contribution to the total error from errors in the meteorological forecast and the hydrological model, respectively.

Threshold-based evaluation In the threshold-based evaluation, for each catchment the results from all lead times (days 1–9) have been aggregated. This was done because of the small number of occasions when the ensemble forecasts indicate exceedance of a discharge threshold level, from a starting level below the

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden 50

Frequency (%)

40 30 20 10 0 <2%

2%25%

25%50%

HBVpf

50%75%

OBS

75%98%

>98%

Theoretical

Figure 5 Frequencies of reference discharge (HBVpf and OBS) between ensemble percentiles averaged over all catchments and lead times.

threshold at the time of the forecast. Usually either the entire ensemble spread is below the threshold or the discharge is above the threshold at the time of the forecast. Therefore there is only a small number of applicable forecasts for each lead time. This is most pronounced for short lead times, when the ensemble spread is very narrow. For example, on average over all catchments there are in the evaluation period only two occasions when the threshold level Q70 is forecasted to be exceeded with a probability between 75% and 98% for day 1 in the forecast. For longer lead times both the ensemble spread and the number of applicable forecasts increase, but the latter seldom greatly exceeds 10. The limited number of cases makes the estimated frequencies of exceedance very uncertain for single lead times, so the results from all lead times were averaged to get reasonably stable frequency estimates. The average number of applicable cases for each catchment varies between 100 for the highest probabilities to 500–1500 for the lowest.

19

Fig. 6 shows two examples of results from the thresholdbased evaluation in terms of reliability diagrams for two catchments, threshold level Q70. The forecasted probabilities on the x-axis are the mean values of each considered interval. For example, the case when the ensemble maximum is below the threshold represents a probability between 0% and 2%, on average 1%, which is the smallest forecasted probability. The other forecasted probabilities are 13.5% (mean of 2 and 25), 37.5% (25 and 50), 62.5% (50 and 75), 86.5% (75 and 98) and 99% (98 and 100). The inserted diagrams show the number of forecasts applicable for each forecasted probability, as previously mentioned generally between 100 and 1500. The result for catchment Krokfors kvarn, representing catchments with a good agreement between observed frequencies and forecasted probabilities, is shown in Fig. 6a. For all probability levels, the observed frequencies are very close to the theoretical y = x line. For example, out of 52 cases with a forecasted probability of 99% (i.e. all members exceed the threshold), in 48 exceedance actually occurred giving an observed frequency of 92.3% (top right corner in ˚ kesta Fig. 6a). Fig. 6b shows the result for catchment A kvarn, representing catchments with less accurately fore˚ kesta kvarn the probabilities of casted probabilities. In A exceedance are substantially overestimated, particularly the high probability levels. In this case exceedance with 99% probability was forecasted 105 times, but only in 72 (68.6%) was the threshold actually exceeded. Exceedance forecasted with 86.5% probability occurred in only 40% of the cases. In this catchment there is thus a pronounced risk of false alarms. The reliability diagram averaged over all catchments is shown in Fig. 7a. Overall the curve reflects a behaviour in between the results shown in Fig. 6, characterised by an overestimation of the exceedance probabilities that increase with the probability level. For the highest probability (99%) the average overestimation is 20%. Exceedance will thus occur in only 80% of the cases when it is forecasted to almost certainly happen, i.e. in this sense every fifth case

100

100

2000

1500

1500

80

1000

80

500

500

Observed frequency (%)

Observed frequency (%)

1000

0

60

40

20

0

0

60

40

20

0 0

20

40

60

Forecasted probability (%)

80

100

0

20

40

60

80

100

Forecasted probability (%)

˚ kesta kvarn (b). Figure 6 Reliability diagrams (Q70, HBVpf), averaged over all lead times, for catchments Krokfors kvarn (a) and A Inserted diagram shows the number of cases for each forecasted probability between 1% (far left) and 99% (far right).

J. Olsson, G. Lindstro ¨m

100

100

80

80

Observed frequency (%)

Observed frequency (%)

20

60

40

20

0

60

40

20

0 0

20

40

60

80

100

Forecasted probability (%) Q70

Q90

0

20

40

60

80

100

Forecasted probability (%) HBVpf

OBS

Figure 7 Reliability diagrams, averaged over all catchments and lead times, for thresholds Q70 and Q90 (HBVpf) (a) and for reference discharge HBVpf and OBS (Q70) (b).

will be a false alarm. The curves for thresholds Q70 and Q90, respectively, are virtually identical, which is also the case for individual catchments. The only real difference is a smaller number of forecasted cases for threshold Q90, and thus a higher uncertainty in the estimated frequencies. The situation in Fig. 7a, where the curve is approximately a straight line through the origin but less steep than y = x, can be adjusted by a simple calibration procedure. If all forecasted probabilities are multiplied by 0.7 the adjusted values will agree very well with the y = x line. A drawback is however that 70% will be the highest probability possible to forecast. Fig. 7b shows the average reliability diagrams for reference discharge HBVpf and OBS, respectively, threshold Q70. The overestimation of exceedance probabilities is markedly larger for OBS, for which exceedance forecasted with 99% probability occurred in only 40% of the cases. In total, the deviation from the y = x line is 40% larger for OBS than for HBVpf, indicating that in this thresholdbased perspective the impact of the hydrological model error is smaller than that of the meteorological forecast error.

Calibration of ensemble spread As shown in Section ‘‘Percentile-based evaluation’’, the spread in the hydrological ensemble forecasts is systematically underestimated. Consequently, the direct use of the ensemble spread does not give correct exceedance probabilities. For example, on average over all catchments, for forecast day 1 30% of the HBVpf forecasts exceed the ensemble maximum (Fig. 4). Thus the ensemble maximum reflects a non-exceedance probability of 70% rather than the theoretical 98%. For reference discharge OBS, which is more relevant in operational forecasting, the discrepancy is substantially larger (Fig. 5) which complicates the use of the ensemble forecasts for decision-making purposes. Therefore it was decided to perform a follow-up study to the evaluation, with the objective of developing a method for adjusting the derived probabilities.

An adjustment of the probabilities with respect to reflecting the variability of the discharge observation in principle has two components. One component is the required adjustment related to the meteorological forecasts, i.e. to adjust for the inaccuracies found in the evaluation using reference discharge HBVpf. As shown in Fig. 4, the type of adjustment required is an increase of the spread which is very large day 1 and gradually decreases for longer lead times. The second component is the adjustment related to the evolution of the hydrological model error. Owing to the autoregressive updating used, the model error is zero at the time of the forecast and gradually increases for longer lead times. This indicates an opposite type of adjustment, where the increase of the spread increases with lead time. As the two adjustment components have opposite trends with lead time, and further are of a similar order of magnitude (Section ‘‘Percentile-based evaluation’’), a simple conceivable approach is a constant adjustment for all days in the forecast. One way to achieve this adjustment is by a simple translation of the original ensemble percentiles, upwards in the case of the maximum and the upper quartile and downwards in the case of the lower quartile and the minimum. This amounts to a bias correction of the percentiles which is independent of lead time. Fig. 8 illustrates such an adjustment of the minimum (2%) and maximum (98%) percentiles for an arbitrary forecast. The parameter Cmin (Cmax) denotes the constant amount with which the ensemble minimum (maximum) is shifted downwards (upwards). The quartile adjustment parameters will be denoted C25 and C75, and only C when referring to all four parameters. The approach was evaluated for five of the catchments by, for each lead time, identifying which values of C that would provide observed frequencies of non-exceedance in line with the ensemble percentiles. For example, out of the 550 days in the evaluation period, the ensemble maximum should be exceeded on 11. The value of Cmax producing this number of exceedances was estimated in an iterative

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden

21

400

C max

Q (m3/s)

300

200

98% 50% 2% 100

C min 0 04-aug

05-aug

06-aug

07-aug

EPS-median

Figure 8

08-aug

10-aug

11-aug

Original EPS-min/max

a

13-aug

14-aug

Adjusted EPS-min/max

period available was considered too short for reliable split-sample evaluation, especially of the minimum and maximum percentiles. The re-evaluation was carried out in all five catchments, and Fig. 10 shows the result in catchment Sundstorp. The figure shows the frequency of observations falling between the ensemble percentiles, both the original (top) and the adjusted (bottom) ones, for forecast days 1, 5 and 9. It is apparent that the adjustment improves the agreement between the observed frequencies and the theoretical probabilities for all three lead times shown. In a few cases, the agreement with the theoretical probability deteriorates after the adjustment (e.g. for interval ‘‘2–25%’’ on day 9), but in total the improvement is remarkable. It may be noted that the frequency of observations above the ensemble maximum (‘‘>98%’’ in Fig. 10) is somewhat overestimated on day 1 and somewhat underestimated day 9, which is in line with the variation of Cmax with lead time (Fig. 9b). As Cmax systematically decreases with increasing lead time, the mean value approximation will underestimate the adjustment on day 1 and overestimate it on day 9. The effect is, however, rather limited, indicating that

b 6

5

5

4

C (m3/s)

4 3 2

3 2 1

1

0

0 1

2

3

4

5

6

7

8

9

1

2

3

Cmin

Cmax

4

5

6

7

8

9

Forecast day

Forecast day

Figure 9

12-aug

Schematic of the proposed calibration method.

trial-and-error procedure, and similarly for the other C parameters. The results for Cmin and Cmax in two of the catchments are shown in Fig. 9. In catchment Pepparforsen (Fig. 9a) Cmin is constantly close to 1 m3/s, weakly increasing with lead time. The parameter Cmax varies more, between 4 m3/s and more than 5 m3/s, but is for all lead times relatively close to the mean value Cmax = 4.3 m3/s. In Sundstorp (Fig. 9b) Cmin is relatively stable around the mean value Cmin = 3.4 m3/s, whereas Cmax varies between 0 and 5 m3/s (Cmax = 2.0 m3/s) and clearly decreases with increasing lead time. The results for C25 and C75 as well as for the other three evaluated catchments are overall similar to the picture shown in Fig. 9, i.e. that the C parameters are relatively independent of lead time and are generally well approximated by their mean value. To verify that the mean value approximation of the C parameters is sufficiently accurate, the ensemble forecasts were re-evaluated after adjustment by the suggested procedure. For each forecast the ensemble percentiles were adjusted prior to the evaluation, using the mean C values. It should be noted that the entire data sample was thus used in both calibration and verification of the method, as the

C (m3/s)

09-aug

Mean value

Cmin

Cmax

Mean value

Parameters Cmin and Cmax as a function of forecast day in catchments Pepparforsen (a) and Sundstorp (b).

22

J. Olsson, G. Lindstro ¨m Day 1 - original

Day 5 - original

Day 9 - original

50

50

50

40

40

40

30

30

30

20

20

20

10

10

10 0

0

0 <2%

<2%

2%- 25%- 50%- 75%- >98% 25% 50% 75% 98%

<2

2%- 25%- 50%- 75%- >98% 25% 50% 75% 98%

Day 5 - adjusted

Day 1 - adjusted

Day 9 - adjusted

50

50

50

40

40

40

30

30

30

20

20

20

10

10

10

0

0 <2%

2%- 25%- 50%- 75%- >98% 25% 50% 75% 98%

0 <2%

2%- 25%- 50%- 75%- >98% 25% 50% 75% 98%

2%- 25%- 50%- 75%- >98% 25% 50% 75% 98%

<2%

2%- 25%- 50%- 75%- >98% 25% 50% 75% 98%

Figure 10 Frequencies of reference discharge (OBS) between ensemble percentiles in catchments Sundstorp before (top) and after (bottom) adjustment, for lead times 1, 5 and 9 days.

the mean value approximation is satisfactory even when C exhibits a pronounced and systematic variation with lead time. The level of improvement by the adjustment was similar in the other four evaluated catchments. As expected, the values of the C parameters vary between catchments, in particular reflecting their difference in size (50–700 km2) and mean discharge (0.8–6.3 m3/s). For the purpose of generalisation, the possibility to relate the C parameters to the catchment mean discharge Q was investigated. It was found

6 Cmin 5

C25

Individual (m3/s)

C75 Cmax

4

3

that all C parameters may be reasonably well estimated as simple functions of Q according to C25 ¼ C75 ¼ Cq ¼ 0:06Q Cmin ¼ 6Cq

ð2Þ

Cmax ¼ 12Cq where Cq denotes a common adjustment of both quartiles. Fig. 11 shows the relationship between generalised C parameters estimated according to (2) and individually calibrated parameters in all five catchments. The amount of explained variance achieved by (2) is 70%. The adjustment procedure was developed and optimised to improve the forecasts in the percentile-based sense without regard to the performance in a threshold-based sense. An evaluation of the adjusted forecasts, however, revealed some improvement also of the reliability diagrams, although not as pronounced. This was at least partly because already the un-adjusted reliability diagrams were relatively accurate in the evaluated catchments.

Conclusions

2

The main findings in this study can be summarized as follows:

1

0 0

1

2

3

4

5

6

Generalised (m3/s)

Figure 11 Relationship between adjustment parameters obtained by calibration for the individual catchments (Individual) and by the generalised equation (2) (Generalised).

• The spread in the raw (i.e. unadjusted) hydrological ensemble forecasts is underestimated, most pronounced on forecast day 1 and to a gradually lower degree with increasing lead time. • The exceedance probability of high discharge thresholds is overestimated in the forecasts and this overestimation increases with increasing probability level.

Evaluation and calibration of operational hydrological ensemble forecasts in Sweden • The contributions to these inaccuracies coming from errors in the meteorological forecast and the hydrological model, respectively, are similar in the case of the spread underestimation whereas the meteorological forecast error is slightly larger in the case of the overestimated exceedance probability. • A simple calibration method can substantially improve the ensemble spread. In essence, this investigation has shown that the output from a conceptual hydrological model, set up and calibrated for a catchment of size 650 km2, fed with raw ECMWF meteorological ensemble forecasts, is characterised in particular by substantial deficiencies in the ensemble spread and consequently the derived exceedance probabilities. This finding is neither new nor unexpected, but the significant contributions are (1) characterisation and quantification of the deficiencies involved and (2) assessment of the contributions from uncertainties in the meteorological forecast and the hydrological model, respectively. Concerning the latter, the relative proportions of the two contributions thus turned out to differ somewhat between the two evaluation methods used. The similar contribution found in the percentile-based evaluation indicates that in total, i.e. when all phases of the runoff process are considered (base flow, rising limb, recession), the two uncertainty components are of approximately equal importance. The threshold-based evaluation (as performed here), however, focuses on the rising limb. Thus, the dominance of the meteorological forecast error in this case indicates that the uncertainty in forecasts preceding flow peaks contributes more than the HBV model’s inability of exactly describing the corresponding flow rise. From a scientific point of view, a separate treatment of the two uncertainty components is clearly desirable. Such an approach is further appropriate for producing an ensemble of adjusted members, which may be required in some hydrological applications, and not only adjusted percentiles. Concerning the meteorological forecast component, this would include some form of downscaling of the ECMWF forecasts from the ECMWF model grid to the catchment scale, effectively increasing the ensemble spread, prior to the hydrological modelling. Further, the underestimated spread is likely related also to underdispersivity in the ECMWF ensemble prediction system, which is likely to be reduced in future versions of the system (e.g. Buizza et al., 2005). Concerning the hydrological model component, this may be treated e.g. by combining the meteorological forecast ensemble with an ensemble of hydrological models with different parameter sets (e.g. Pappenberger et al., 2005). Alternatively an error term may be added to the hydrological ensemble members from a single model, reflecting the probability distribution of the expected model error (e.g. Seo et al., 2006). From an operational point of view, however, a total adjustment of only the final discharge output has apparent advantages. Using separate adjustment methods first of all involves substantially more work, both in design and management. Both the meteorological and hydrological models involved are constantly developing, which puts high demands on the continuous update also of the adjustment methods. Also a total adjustment method has to be kept

23

updated, but this is far less laborious as it may involve only rather straight-forward analyses of relatively limited amounts of directly available data. Further, with separate adjustments it has to be verified that these when integrated are indeed able of generating accurate exceedance probabilities also for the final output. It may be remarked that adjustment methods conceptually similar to the one proposed here, i.e. essentially bias correction of the ensemble percentiles, have been advocated also for meteorological forecasts (e.g. Hamill and Colucci, 1997). We consider the version of the method proposed here to be a first, rather crude approach, with a range of possible improvements. For example, it may be expected that the amount of adjustment required is characterised by dependencies with several factors such as discharge level at the time of the forecast as well as properties of the meteorological forecast. The prospect of development in this direction, which essentially may be a way to indirectly take the two separate uncertainty components into account, will be assessed in connection with future evaluations of the ensemble forecasting system. Also the more advanced methods of post-processing ensemble forecasts recently developed within meteorology (e.g. Raftery et al., 2005) may be considered for the adjustment of hydrological ensemble forecasts.

Acknowledgements The financial support from Ra ¨ddningsverket (Swedish Rescue Services Agency), Elforsk (Swedish Electrical Utilities Joint Company for Research and Development) and SMHI (Swedish Meteorological and Hydrological Institute) is gratefully acknowledged. Many thanks to Anna Johnell for providing the data base, to Anders Persson for fruitful discussions and to two reviewers for constructive criticism of the original manuscript. The work has benefited from participation in COST Action 731 (Propagation of uncertainty in advanced meteo-hydrological forecast systems).

References Bergstro ¨m, S., 1976. Development and application of a conceptual runoff model for Scandinavian catchments. SMHI Norrko ¨ping, Report RHO 7. Beven, K.J., Binley, A., 1992. The future of distributed models: model calibration and uncertainty prediction. Hydrological Processes 6, 279–298. Buizza, R., 1997. Potential forecast skill of ensemble prediction and spread and skill distributions of the ECMWF ensemble prediction system. Monthly Weather Review 125, 99–119. Buizza, R., Houtekamer, P.L., Toth, Z., Pellerin, G., Wei, M., Zhu, Y., 2005. A comparison of the ECMWF, MSC, and NCEP global ensemble prediction systems. Monthly Weather Review 133, 1076–1097. De Roo, A., Gouweleeuw, B., Thielen, J., Bartholmes, J., Bongioannini-Cerlini, P., Todini, E., Bates, P., Horritt, M., Hunter, N., Beven, K., Pappenberger, F., Heise, E., Rivin, G., Hils, M., Hollingsworth, A., Holst, B., Kwadijk, J., Reggiani, P., van Dijk, M., Sattler, K., Sprokkereef, E., 2003. Development of a European flood forecasting system. International Journal of River Basin Management 1, 49–59.

24 Gouweleeuw, B.T., Thielen, J., Franchello, G., De Roo, A.P.J., Buizza, R., 2005. Flood forecasting using medium-range probabilistic weather prediction. Hydrology and Earth System Sciences 9, 365–380. Hamill, T.M., Colucci, S.J., 1997. Verification of Eta-RSM short-range ensemble forecasts. Monthly Weather Review 125, 1312–1327. Johnell, A., Lindstro ¨m, G., Olsson, J., 2007. Deterministic evaluation of ensemble streamflow predictions in Sweden. Nordic Hydrology 38, 441–450. Lindstro ¨m, G., 1997. A simple automatic calibration routine for the HBV model. Nordic Hydrology 28, 153–168. Lindstro ¨m, G., Johansson, B., Persson, M., Gardelin, M., Bergstro ¨m, S., 1997. Development and test of the distributed HBV-96 model. Journal of Hydrology 201, 272–288. Lundberg, A., 1982. Combination of a conceptual model and an autoregressive error model for improving short time forecasting. Nordic Hydrology 13, 233–246. Molteni, F., Buizza, R., Palmer, T.N., Petroliagis, T., 1996. The ECMWF ensemble prediction system: methodology and validation. Quarterly Journal of the Royal Meteorological Society 122, 73–119. Pappenberger, F., Beven, K.J., Hunter, N.M., Bates, P.D., Gouweleeuw, B.T., Thielen, J., De Roo, A.P.J., 2005. Cascading model uncertainty from medium range weather forecasts (10 days) through a rainfall-runoff model to flood inundation

J. Olsson, G. Lindstro ¨m predictions with the European Flood Forecasting System (EFFS). Hydrology and Earth System Sciences 9, 381–393. Persson, A., 2001. User guide to ECMWF forecast products. Meteorological Bulletin M3.2, ECMWF. Raftery, A.E., Gneiting, T., Balabdaoui, F., Polakowski, M., 2005. Using Bayesian averaging to calibrate forecast ensembles. Monthly Weather Review 133, 1155–1174. Roulin, E., 2006. Skill and relative economic value of medium-range hydrological ensemble predictions. Hydrology and Earth System Sciences Discussions 3, 1369–1406. Roulin, E., Vannitsem, S., 2005. Skill of medium-range hydrological ensemble predictions. Journal of Hydrometeorology 6, 729–744. Scherrer, S.C., Appenzeller, C., Eckert, P., Cattani, D., 2004. Analysis of the spread-skill relations using the ECMWF ensemble prediction system over Europe. Weather and Forecasting 19, 552–565. Seo, D.-J., Herr, H.D., Schaake, J.C., 2006. A statistical postprocessor for accounting of hydrologic uncertainty in shortrange ensemble streamflow prediction. Hydrology and Earth System Sciences Discussions 3, 1987–2035. Werner, M., Reggiani, P., De Roo, A., Bates, P., Sprokkereef, E., 2005. Flood forecasting and warning at the river basin and at the European scale. Natural Hazards 36, 25–42. Wilks, D.S., 1995. Statistical models in the atmospheric sciences. Academic, San Diego.