Accepted Manuscript Estimating predictive hydrological uncertainty by dressing deterministic and ensemble forecasts; a comparison, with application to Meuse and Rhine J.S. Verkade, J.D. Brown, F. Davids, P. Reggiani, A.H. Weerts PII: DOI: Reference:
S0022-1694(17)30679-0 https://doi.org/10.1016/j.jhydrol.2017.10.024 HYDROL 22304
To appear in:
Journal of Hydrology
Received Date: Revised Date: Accepted Date:
17 January 2017 29 July 2017 8 October 2017
Please cite this article as: Verkade, J.S., Brown, J.D., Davids, F., Reggiani, P., Weerts, A.H., Estimating predictive hydrological uncertainty by dressing deterministic and ensemble forecasts; a comparison, with application to Meuse and Rhine, Journal of Hydrology (2017), doi: https://doi.org/10.1016/j.jhydrol.2017.10.024
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
3
Estimating predictive hydrological uncertainty by dressing deterministic and ensemble forecasts; a comparison, with application to Meuse and Rhine
4
J.S. Verkadea,b,c,∗, J.D. Brownd , F. Davidsa,b , P. Reggianie , A.H. Weertsa,f
1
2
5 6 7 8 9 10 11 12 13
a
Deltares, PO Box 177, 2600 MH Delft, The Netherlands, +31.88.335.8348 (tel), +31.88.335.8582 (fax) b Ministry of Infrastructure and the Environment, Water Management Centre of The Netherlands, River Forecasting Service, Lelystad, The Netherlands c Delft University of Technology, Delft, The Netherlands d Hydrologic Solutions Limited, Southampton, United Kingdom e University of Siegen, Research Institute for Water and Environment, Siegen, Germany f Wageningen University and Research Centre, Hydrology and Quantitative Water Management Group, Wageningen, The Netherlands
14
Abstract
15
Two statistical post-processing approaches for estimation of predictive hy-
16
drological uncertainty are compared: (i) ‘dressing’ of a deterministic forecast
17
by adding a single, combined estimate of both hydrological and meteoro-
18
logical uncertainty and (ii) ‘dressing’ of an ensemble streamflow forecast by
19
adding an estimate of hydrological uncertainty to each individual stream-
20
flow ensemble member. Both approaches aim to produce an estimate of the
21
‘total uncertainty’ that captures both the meteorological and hydrological
22
uncertainties. They differ in the degree to which they make use of statistical
23
post-processing techniques. In the ‘lumped’ approach, both sources of un-
24
certainty are lumped by post-processing deterministic forecasts using their
25
verifying observations. In the ‘source-specific’ approach, the meteorological
26
uncertainties are estimated by an ensemble of weather forecasts. These enPreprint submitted to Journal of Hydrology ∗ Corresponding author Email address:
[email protected] (J.S. Verkade)
October 11, 2017
27
semble members are routed through a hydrological model and a realization of
28
the probability distribution of hydrological uncertainties (only) is then added
29
to each ensemble member to arrive at an estimate of the total uncertainty.
30
The techniques are applied to one location in the Meuse basin and three
31
locations in the Rhine basin. Resulting forecasts are assessed for their reli-
32
ability and sharpness, as well as compared in terms of multiple verification
33
scores including the relative mean error, Brier Skill Score, Mean Continu-
34
ous Ranked Probability Skill Score, Relative Operating Characteristic Score
35
and Relative Economic Value. The dressed deterministic forecasts are gen-
36
erally more reliable than the dressed ensemble forecasts, but the latter are
37
sharper. On balance, however, they show similar quality across a range of
38
verification metrics, with the dressed ensembles coming out slightly better.
39
Some additional analyses are suggested. Notably, these include statistical
40
post-processing of the meteorological forecasts in order to increase their re-
41
liability, thus increasing the reliability of the streamflow forecasts produced
42
with ensemble meteorological forcings.
43
Keywords: Quantile Regression, hydrological forecasting, predictive
44
uncertainty, ensemble dressing, statistical post-processing
45
1. Introduction
46
The future value of hydrological variables is inherently uncertain. Fore-
47
casting may reduce, but cannot eliminate this uncertainty.
48
forecast–sensitive decision making is aided by adequate estimation of the
49
remaining uncertainties (see, for example, Verkade and Werner 2011 and 2
Informed,
50
the references therein). Omission of relevant uncertainties would result in
51
overconfident forecasts, hence all relevant uncertainties must be addressed in
52
the estimation procedure. These include uncertainties related to the model-
53
ing of the streamflow generation and routing processes (jointly referred to as
54
“hydrological uncertainties”) and uncertainties related to future atmospheric
55
forcing (“meteorological uncertainties”). Generally speaking, the total un-
56
certainty can be estimated by separately modelling the meteorological and
57
hydrological uncertainties or by lumping all uncertainties together (cf. Re-
58
gonda et al. 2013).
59
The source-specific approach identifies the relevant sources of uncertainty
60
and models these individually before integrating them into an estimate of
61
the total uncertainty. In this context, the hydrologic uncertainties may be
62
treated separately (independently) from the meteorological uncertainties, be-
63
cause they depend only on the quality of the hydrologic modelling. This ap-
64
proach has been followed by, among others, Kelly and Krzysztofowicz (2000);
65
Krzysztofowicz (2002); Krzysztofowicz and Kelly (2000), Seo et al. (2006) and
66
Demargne et al. (2013). The approach has a number of attractive charac-
67
teristics. The individual sources of uncertainty may each have a different
68
structure, which can be specifically addressed by separate techniques. Also,
69
some of the uncertainties vary in time, while others are time invariant. A dis-
70
advantage of source-based modelling is that developing uncertainty models
71
for each source separately may be expensive, both in terms of the develop-
72
ment itself as well as in terms of computational cost. Also, whether modelling
73
the total uncertainty as a lumped contribution or separately accounting for 3
74
the meteorological and hydrological uncertainties, hydrological forecasts will
75
inevitably contain residual biases in the mean, spread and higher moments
76
of the forecast probability distributions, for which statistical post-processing
77
is important.
78
In the lumped approach, a statistical technique is used to estimate the
79
future uncertainty of streamflow conditionally upon one or more predictors,
80
which may include a deterministic forecast. Underlying this approach is an
81
assumption that the errors associated with historical predictors and predic-
82
tions are representative of those in future. This approach is widely used
83
in atmospheric forecasting, where it is commonly known as Model Output
84
Statistics (MOS) (Glahn and Lowry, 1972). Several reports of applications
85
of the lumped approach in hydrology can be found in the literature. These
86
include the Reggiani and Weerts (2008) implementation of the Hydrologic
87
Uncertainty Processor (Kelly and Krzysztofowicz, 2000), the Model Condi-
88
tional Processor (Todini, 2008; Coccia and Todini, 2011), Quantile Regression
89
(Weerts et al., 2011; Verkade and Werner, 2011; L´opez L´opez et al., 2014),
90
the “uncertainty estimation based on local errors and clustering” approach
91
(UNEEC; Solomatine and Shrestha, 2009) and the Hydrologic Model Output
92
Statistics approach (HMOS; Regonda et al., 2013). For a complete overview
93
that is periodically updated, see Ramos et al. (2013). These techniques each
94
estimate the total uncertainty in future streamflow conditionally upon one or
95
more predictors, including the deterministic forecast. Of course, they vary in
96
their precise formulation and choice of predictors. The lumped approach is
97
attractive for its simplicity, both in terms of development and computational 4
98
costs. The main disadvantage of the approach is that both meteorological
99
and hydrological uncertainties are modeled together via the streamflow fore-
100
cast, which assumes an aggregate structure for the modeled uncertainties (al-
101
though the calibration may be different for particular ranges of streamflow).
102
Also, in order to produce ensemble traces, these techniques must explicitly
103
account for the temporal autocorrelations in future streamflow, which may
104
not follow a simple (e.g. autoregressive) form.
105
In the text above, the source-specific and the lumped approach were pre-
106
sented as separate strategies. However, as the source-based approach may
107
not fully account for all sources of uncertainty, statistical post-processing is
108
frequently used to correct for residual biases in ensemble forecasts. In the
109
present work, an intermediate approach is described, namely the ‘dressing’ of
110
streamflow ensemble forecasts. Here, the meteorological uncertainties are es-
111
timated by an ensemble of weather forecasts. The remaining, hydrological un-
112
certainties are lumped and described statistically. Subsequently, the stream-
113
flow ensemble members are, cf. Pagano et al. 2013, ‘dressed’ with the hy-
114
drological uncertainties. This approach has previously been taken by, among
115
others, Reggiani et al. (2009); Bogner and Pappenberger (2011) and Pagano
116
et al. (2013) and, in meteorological forecasting, by Fortin et al. (2006); Roul-
117
ston and Smith (2003) and Unger et al. (2009). Most of these studies report
118
skill of the dressed ensembles versus that of climatology; Pagano et al. (2013)
119
explored the gain in skill when moving from raw to dressed ensembles and
120
found this gain to be significant. In contrast, the present study compared
121
dressed ensemble forecasts to post-processed single-valued streamflow fore5
122
casts.
123
The kernel dressing approach is akin to kernel density smoothing, whereby
124
missing sources of uncertainty (i.e. dispersion) are introduced by dressing the
125
individual ensemble members with probability distributions and averaging
126
these distributions (cf. Br¨ocker and Smith, 2008). As ensemble dressing aims
127
to account for additional sources of dispersion, not already represented in the
128
ensemble forecasts, a “best member” interpretation is often invoked (Roul-
129
ston and Smith, 2003). Here, the width of the dressing kernel is determined
130
by the historical errors of the best ensemble member. The resulting distribu-
131
tion is then applied to each ensemble member of an operational forecast and
132
the final predictive distribution given by the average of the individual distri-
133
butions. In this context, ensemble dressing has some similarities to Bayesian
134
Model Averaging (BMA; see Raftery et al. 2005 for a discussion).
135
In the ensemble dressing approach, one highly relevant source of uncer-
136
tainty, namely the weather forecasts, is described using an ensemble Numer-
137
ical Weather Prediction (NWP) model. This NWP model takes into account
138
current initial conditions of the atmosphere and exploits the knowledge of
139
physical processes of the atmosphere embedded in the NWP model, as well
140
as any meteorological observations that are assimilated to improve estimates
141
of the predicted states. The hydrologic uncertainties, which may originate
142
from the hydrologic model parameters and structure (among other things)
143
are then lumped, modelled statistically, and integrated with the meteorolog-
144
ical contribution to the streamflow.
145
The objective of this work is to compare the quality and skill of the 6
146
forecasts created through dressing of deterministic streamflow forecasts and
147
through dressing of ensemble streamflow forecasts. A priori, the dressed
148
ensemble forecasts are expected to have higher skill than the dressed de-
149
terministic forecasts. Both account for the effects of all relevant sources of
150
uncertainty on the streamflow forecasts. However, in the ensemble case, the
151
estimate of atmospheric uncertainties is based on knowledge of the physical
152
system and its state at issue time of a forecast, whereas this knowledge is un-
153
used in the lumped approach. Nevertheless, the lumped approach accounts
154
for any residual meteorological biases via the streamflow.
155
The context for this study is an operational river forecasting system used
156
by the Dutch river forecasting service. This system models the total un-
157
certainty in the future streamflow using a lumped approach, whereby a de-
158
terministic streamflow forecast is post-processed through quantile regression
159
(following a procedure similar to that in Weerts et al. 2011). While this mod-
160
ule performs reasonably well, there is a desire among operational forecasters
161
to explore the benefits of (and account for) information in ensemble weather
162
predictions, including information beyond the ensemble mean. This resulted
163
in the operational implementation of the ensemble dressing approach using
164
the same statistical technique (quantile regression). Thus, estimates of the
165
meteorological uncertainties, which were previously modeled indirectly (i.e.
166
lumped into the total uncertainty), are now disaggregated and included sep-
167
arately in the streamflow forecasts. This raises the question of whether the
168
‘new’ approach indeed increases forecast skill.
169
The novel aspects and new contributions of this work include (i) a direct 7
170
comparison between the quality of the dressed deterministic forecasts and
171
the dressed ensemble forecasts; (ii) the application of quantile regression
172
to account for the hydrologic uncertainties, and (iii) the application of the
173
dressing technique to dynamic ensemble streamflow forecasts.
174
This paper is organised as follows. In the next section, the study ap-
175
proach is detailed, followed by a description of the study basins in section 3.
176
In section 4 the results of the experiments are presented and analysed. In
177
section 5, conclusions are drawn and discussed.
178
2. Approach
179
2.1. Scenarios
180
The present study consists of an experiment in which verification results in
181
two scenarios are inter-compared: dressed deterministic forecasts and dressed
182
ensemble forecasts. These are tested in multiple cases, that is, combinations
183
of forecasting locations and lead times.
184
2.2. Cross–validation
185
The results from the experiments that are described below are cross–
186
validated by means of a leave–one–year–out analysis. The set of available
187
data is divided into N available years. Models are trained using N − 1 years
188
and validated on the remaining one year. The experiment is carried out
189
N times, every time with another year chosen as a validation year. Thus,
190
the validation record that is verified comprises of N years of data that is
8
191
obtained independently from the training data — and optimal use is made
192
of the available length of record.
193
2.3. Hindcasting
194
The experiment comprises the process of hindcasting or reforecasting:
195
forecasts are produced for periods of time that are currently in the past –
196
while only using data that was available at the time of forecast (issue time,
197
or reference time or time zero).
198
Hindcasts are produced using an offline version of the Delft-FEWS
199
(Werner et al., 2013) based forecast production system “RWsOS Rivers” that
200
is used by the Water Management Centre of The Netherlands for real-time,
201
operational hydrological forecasting.
202
Hindcasting is a two-step process: first, the hydrological models are forced
203
with observed temperature and precipitation for a period up to the forecast
204
initialisation time. Thus, the internal model states reflect the basin’s actual
205
initial conditions as closely as possible. These initial states are used as the
206
starting point for forecast runs, where the models are forced with the pre-
207
cipitation and temperature ensemble NWP forecasts. These are imported
208
into the forecast production system as gridded forecasts and subsequently
209
spatially aggregated to sub-basins . Basin-averaged precipitation or tem-
210
perature is then equal to the average of the grid boxes within that basin,
211
accounting for partial inclusion of grid boxes through a process of weighting.
212
In the case of St Pieter streamflow forecasts, an autoregressive-moving-
213
average (ARMA) error-correction procedure was used to correct for biases in
9
214
the raw streamflow forecasts and hence any residual biases contributed by the
215
forcing (see below). This is in line with the procedures used in the operational
216
system. The autoregressive (AR) correction scheme used here is based on
217
estimation of the AR parameters using the Burg’s method and ARMASel
218
(see Broersen and Weerts, 2005, and the references therein). The procedure
219
is implemented in Delft-FEWS (Werner et al., 2013) and can be configured as
220
self-calibrating (automated selection of order and/or AR parameter values)
221
or with fixed AR parameters for post-processing of hydrological forecasts.
222
Here, the latter option was chosen.
223
The effect of error correction is that forecast uncertainty will be reduced
224
to zero at zero lead time; with increasing lead time, this uncertainty reduc-
225
tion will ‘phase out’. To account for this in the streamflow hindcasts, error
226
correction was used in simulation mode also. Effectively, the hindcasting pro-
227
cedure comprised the re–production of forecasts where the models were forced
228
with precipitation measurements instead of precipitation forecasts. This in-
229
troduces a lead time dependence in the quality of the streamflow simulations.
230
This is described in more detail in Section 2.5.
231
2.4. ‘Dressing’ of streamflow forecasts
232
The dressing technique is similar across the lumped and the source-
233
specific approaches in that the forecasts are dressed with predictive dis-
234
tributions of uncertainties that are not already explicitly addressed in the
235
raw forecasts. Thus, deterministic hydrological forecasts are dressed with a
236
predictive distribution that comprises both meteorological and hydrological
10
237
uncertainties, and hydrological ensemble forecasts are dressed with a pre-
238
dictive distribution that comprises hydrological uncertainties only. In both
239
approaches, the total uncertainty is computed by averaging over the number
240
of ensemble members E (which E = 1 in the case of deterministic forecasts), E 1 X φn (qn |sn,e ) , Φn (qn |sn,1 , sn,2 , . . . , sn,E ) = E e=1
(1)
241
where Φ is the aggregated distribution of observed streamflow q at lead time
242
n, conditional on the raw streamflow forecast s that consists of ensemble
243
members e ∈ {1, . . . , E}, each of which are dressed with distribution φ.
244
In the ensemble dressing scenario, this means that each of the ensemble
245
members is dressed with a predictive distribution of hydrological uncertainty,
246
and that these multiple distributions are averaged to obtain a single distribu-
247
tion of predictive uncertainty. Note that here, we assume that the ensemble
248
members are equiprobable (which generally applies to atmospheric ensembles
249
generated from a single model, but not necessarily to multi-model ensembles,
250
for example). If the members are not equiprobable then a weighting can eas-
251
ily be introduced.
252
Here, the distribution, Φ, aims to capture the historical residuals between
253
the observed and simulated streamflows (i.e. streamflows produced with ob-
254
served forcing). A “best member” interpretation does not apply here, because
255
the dressing kernel is aiming to capture a new source of uncertainty (the hy-
256
drologic uncertainty) and not to account for under-dispersion in (the hydro-
257
logic effects of) the meteorological uncertainties. In short, we assume that
11
258
the meteorological ensembles are unbiased and correctly dispersed. This is a
259
necessary assumption: uncertainty originating in future weather is only esti-
260
mated by the spread of the streamflow ensemble and not in any way through
261
statistical post-processing, nor does post-processing in any way bias-correct
262
the streamflow ensemble to better account for meteorological uncertainty.
263
This constitutes a disclaimer to the ensemble dressing approach: if the as-
264
sumption of a correctly dispersed meteorological ensemble is invalid then this
265
will have an adverse effect on the quality of the dressed ensemble!
266
By construction, the raw deterministic forecasts are dressed with a single
267
distribution only, which aims to account for the total uncertainty of the future
268
streamflow, not the residual uncertainty of a best ensemble member,
Φn (qn |sn,1 ) = φn (qn |sn,1 ) . 269
The dressing procedures are schematically visualised in Figure 1. [Figure 1 about here.]
270
271
(2)
2.5. Uncertainty models
272
As mentioned above, the source-specific and lumped approaches differ in
273
the predictive distributions that the raw forecasts are dressed with. In the
274
lumped approach, the deterministic forecast is dressed by a distribution of
275
both hydrological and atmospheric uncertainties, conditional on the value of
276
the deterministic forecast itself. The “errors” in the deterministic forecast
277
are thus a measure of uncertainties originating in both the meteorological
278
forcing as well as the hydrological modeling. 12
279
Ensemble streamflow forecasts are dressed using a predictive distribution
280
of hydrological uncertainty only. This is achieved by fitting a probability dis-
281
tribution to the historical residuals between the hydrological model simula-
282
tions and observations (see section 2.6 for details on how this was done). The
283
simulations are derived by forcing a hydrological model with observed pre-
284
cipitation and temperature. As such, the simulations are independent of lead
285
time. This approach is quite widely reported in the literature, for example by
286
Montanari and Brath (2004); Seo et al. (2006); Chen and Yu (2007); Hantush
287
and Kalin (2008); Montanari and Grossi (2008); Todini (2008); Bogner and
288
Pappenberger (2011); Zhao et al. (2011) and Brown and Seo (2013).
289
Time invariant estimates of hydrological uncertainty using these hydro-
290
logical simulations, however, do not take into account any error correction
291
or data assimilation procedures that, in a real-time setting, reduce predic-
292
tive uncertainty. In the Meuse at St Pieter case, such an error correction
293
method is an integral component of the forecasting system. Hence, in the
294
Meuse case, hydrological uncertainty is not based on observations and simu-
295
lations, but on observations and “perfect forcing hindcasts”. Similar to the
296
forecasts produced by the operational system, these hindcasts benefit from
297
the ARMA error correction procedure that is applied to streamflow forecasts
298
at St Pieter at the onset of every hindcast. However, the hindcasts are forced
299
with observed precipitation and streamflow. Hence the hindcasting record
300
is similar to a simulation record, but with the added benefit of an ARMA
301
error correction. This introduces a lead time dependency in the skill of the
302
hindcast. At zero lead time, the hindcast has perfect skill; with increasing 13
303
lead time, forecast errors increase in magnitude and skill deteriorates. These
304
resulting forecast errors are largely due to hydrological uncertainties. When
305
the effect of the ARMA procedure has worn out, forecast skill reduces to that
306
of the hydrological simulations. The procedure is similar to that followed by
307
Bogner and Pappenberger (2011); an alternative approach to this would be
308
to use use the latest available observation as a predictor in the uncertainty
309
model; this approach was taken by, among others, Seo et al. (2006).
310
311
As the experimental setup is quite complex, the Table1 should be helpful in distinguishing the various types of model runs from each other. [Table 1 about here.]
312
313
2.6. Quantile Regression
314
In the present paper, in both scenarios, uncertainty is estimated using
315
Quantile Regression (QR; Koenker and Bassett Jr, 1978; Koenker and Hal-
316
lock, 2001; Koenker, 2005). QR is a regression technique for estimating the
317
quantiles of a conditional distribution. As the sought relations are conditional
318
quantiles rather than conditional means, quantile regression is robust with
319
regards to outliers. Quantile Regression does not make any prior assumptions
320
regarding the shape of the distribution; in that sense, it is a non-parametric
321
technique. It is, however, highly parametric in the sense that, for every quan-
322
tile of interest, parameters need to be estimated. QR was developed within
323
the economic sciences, and is increasingly used in the environmental sciences
324
(see, for example, Bremnes 2004; Nielsen et al. 2006). Specific applications
14
325
in the hydrological sciences include Weerts et al. (2011), Roscoe et al. (2012)
326
and L´opez L´opez et al. (2014).
327
328
In the present work, Quantile Regression is used to estimate lead time n specific conditional distributions of streamflow,
φn = {Qn,τ1 , Qn,τ2 , . . . , Qn,τT }
(3)
329
where T is the number of quantiles τ (τ ∈ [0, 1]) considered.
330
sufficiently large and the quantiles τ cover the domain [0, 1] sufficiently
331
well, we consider φn to be a continuous distribution. Here, T = 50 and
332
τ ∈ {0.01, 0.03, . . . , 0.99},
φn = {Qn,τ =.01 , Qn,τ =.03 , . . . , Qn,τ =.99 }
If T is
(4)
333
We assume that, cf. Weerts et al. (2011), separately for every lead time
334
n considered and for every quantile τ , there is a linear relationship between
335
the streamflow forecast S and its verifying observation Q,
Qn,τ = an,τ Sn + bn,τ
(5)
336
where an,τ and bn,τ , are the slope and intercept from the linear regression.
337
Quantile Regression allows for finding the parameters an,τ and bn,τ of this
338
linear regression by minimising the sum of residuals,
min
J X
ρn,τ (qn,j − (an,τ sn,j + bn,τ ))
j=1
15
(6)
339
where ρn,τ is the quantile regression weight for the τ th quantile, qn,j and
340
sn,j are the j th paired samples from a total of J samples, and an,τ and bn,τ the
341
regression parameters from the linear regression between streamflow forecast
342
and observation. By varying the value of τ , the technique allows for describ-
343
ing the entire conditional distribution.
344
In the present work, solving equation 6 was done using the quantreg
345
package (Koenker, 2013) in the R software environment (R Core Team, 2013).
346
Figure 5 and Figure 6 show the joint distributions of forecast–observation
347
pairs as well as a selection of estimated quantiles; these plots are discussed
348
in the Results and analysis section.
349
2.7. Verification strategy
350
Forecast quality in the two scenarios is assessed using visual exploration
351
of forecast hydrographs, examination of graphical measures of reliability and
352
sharpness as well as a selection of metrics (presented as skill scores) for both
353
probabilistic forecasts and single valued derivatives thereof. The metrics and
354
skill scores are described in detail in Appendix A.
355
Reliability is the degree to which predicted probabilities coincide with the
356
observed relative frequencies, given those forecast probabilities. The degree
357
to which the forecasts are reliable is indicated by reliability plots that plot the
358
conditional relative frequency of observations against forecast probabilities.
359
Proximity to the 1:1 diagonal, where observed frequency equals predicted
360
probability, indicates higher reliability.
361
The refinement of a set of forecasts refers to the dispersion of the marginal
16
362
distribution of predicted probabilities. A refinement distribution with a large
363
spread implies refined forecasts, in that different forecasts are issued relatively
364
frequently and so have the potential to discern a broad range of conditions.
365
This attribute of forecast refinement often is referred to as sharpness in the
366
sense that refined forecasts are called sharp (Wilks, 2011). Often, sharpness
367
is summarized as the ability to forecast extreme probabilities such as 0%
368
and 100%. This is closely related to the width, or spread, of an ensemble
369
forecast. Large spreads are unlikely to yield 0% or 100% event probabilities.
370
Hence, here we have measured the width of the ensemble forecasts. These
371
“sharpness plots” show the empirical cumulative distribution of the width of
372
the 10th –90th quantiles of the probability forecasts. In this context, sharpness
373
measures the degree of confidence (narrowness of spread) afforded by the
374
forecasts.
375
Metrics that describe the quality of single valued forecasts include the
376
correlation coefficient (COR), relative mean error (RME), mean absolute
377
error (MAE) and the root mean squared error (RMSE). The correlation co-
378
efficient describes the degree of linear dependence between the observation
379
and the forecast. The RME or fractional bias measures the average differ-
380
ence between the forecast and the observation, relative to the mean of the
381
observations. MAE measures the mean absolute difference between a set of
382
forecasts and corresponding observations. RMSE provides the square root
383
of the average mean square error of the forecasts. It has the same unit as
384
the forecasts and the observations. In each of these four metrics the forecast
385
mean is used as a single valued forecast. 17
386
In terms of the probabilistic characteristics of forecasts, the overall ac-
387
curacy is measured with the Brier Score (BS) and the mean Continuous
388
Ranked Probability Score (CRPS). The BS comprises the mean square error
389
of a probability forecast for a discrete event, where the observation is an
390
indicator variable. The CRPS measures the integral square error of a prob-
391
ability forecast across all possible event thresholds, again assuming that the
392
observation is deterministic.
393
In the present paper, both BS and CRPS of the forecasts under considera-
394
tion are presented as a skill relative to the BS and CRPS of the climatological
395
forecast, that is, the climatology of streamflow observations as derived from
396
the sample of verification pairs.
397
For discrimination, the trade-off between correctly predicted events (true
398
positives, or hits) and false alarms (false positives) is considered. Hit rate
399
is plotted versus false alarm rate in the ROC curves. The area under the
400
curve (AUC) is a measure of discrimination; this is expressed as the Relative
401
Operating Characteristic score (ROCS), which factors out the climatological
402
AUC of 0.5, i.e. 2AUC − 1. Forecast value is measured using the Relative
403
Economic Value (REV; Murphy 1985; Zhu et al. 2002). REV (V [−]) is cal-
404
culated by matching the occurrences of hits, misses, false alarms and correct
405
negatives (‘quiets’) with their consequences (Table 2). It is expressed on a
406
scale from negative infinity to 1, where V = 0 is the situation in which there
407
is no forecasting and warning system present and V = 1 is the situation in
408
which there is a perfect forecasting and warning system present. Negative
409
values imply that the warning system introduces more costs than benefits. 18
410
The REV is expressed as a function of the users cost-loss rate r.
411
Verification was performed at the same timestep as the post-processing;
412
results are shown for a selection of lead times only. For verification, the
413
open source Ensemble Verification System (Brown et al., 2010) is used. EVS
414
takes ensemble forecasts as input. Here, predictive uncertainty is expressed
415
by quantiles rather than ensemble members, but the 50 quantiles are equally
416
spaced, and ensemble members may be interpreted as quantiles of the un-
417
derlying probability distribution from which they are sampled (e.g. Br¨ocker
418
and Smith, 2008).
419
Conditional quality and skill is determined by calculating verification met-
420
rics for increasing levels of the non-exceedence climatological probability, P ,
421
ranging from 0 to 1. Essentially, P = 0 constitutes an unconditional verifica-
422
tion for continuous measures, such as the CRPSS, as all available data pairs
423
are considered (Bradley and Schwartz, 2011). Conversely, at P = 0.95, only
424
the data pairs with observations falling in the top 5% of sample climatology
425
are considered; this amounts to approx. 60 pairs here for the Meuse case and
426
approx. 150 pairs for the Rhine case (Figure 2).
427
The BSS, ROCS and REV measure forecast skill for discrete events. The
428
BSS, ROCS and REV are, therefore, unknown for thresholds corresponding
429
to the extremes of the observed data sample, nominally denoted by P = 0
430
and P = 1).
431
432
[Figure 2 about here.] Sampling uncertainties were quantified with the stationary block boot19
433
strap (Politis and Romano, 1994). Here, blocks of adjacent pairs are sam-
434
pled randomly, with replacement, from the J available pairs in each basin.
435
Overlapping blocks are allowed, and the average length of each block is deter-
436
mined by the autocorrelation of the sample data. In both cases, an average
437
block length of 10 days was found to capture most of the autocorrelation
438
(some interseasonal dependence remained). The resampling was repeated
439
1,000 times, and the verification metrics were computed from each sample.
440
Confidence intervals were derived from the bootstrap sample with a nominal
441
coverage probability of 0.9, i.e. [0.05, 0.95]. The intervals should be treated
442
as indicative and do not necessarily provide unbiased estimates of coverage
443
probabilities, particularly for rare events (Lahiri, 2003). Also, observational
444
uncertainties were not considered.
445
These sampling uncertainty intervals provide information as to the ‘true
446
value’ of the metric or skill considered. Unfortunately, the intervals cannot
447
be used for a formal statistical analysis as the verification samples are not
448
strictly independent. Hence in the present paper, the comparison between
449
scenarios is (necessarily) based on a qualitative description of the uncertainty
450
intervals.
451
3. Study basins, models and data used
452
To enhance the robustness of the findings presented in this paper, the
453
experiment was carried out on two separate study basins. These comprise
454
forecasting locations in two basins with different characteristics, where hydro-
455
logical models are forced with different atmospheric ensemble forcing prod20
456
ucts. [Figure 3 about here.]
457
458
3.1. Meuse
459
3.1.1. Basin description
460
The river Meuse (Figure 3) runs from the Northeast of France through
461
Belgium and enters the Netherlands just south of Maastricht. It continues to
462
flow North and then West towards Dordrecht, where it meets the Rhine before
463
discharging into the North Sea near Rotterdam. Geology and topography
464
vary considerably across the basin. The French Meuse basin is relatively
465
flat and has thick soils. The mountainous Ardennes are relatively high and
466
steep and the area’s impermeable bedrock is covered by thin soils. Average
467
annual basin precipitation varies around 900mm. The Meuse is a typically
468
rain-fed river; long lasting, extensive snowpacks do not occur. Figure 4
469
shows the distribution of streamflow at the forecasting locations considered
470
in this study. Near Maastricht, average runoff equals approx. 200 m3 /s.
471
Temporal variability can be large as, during summer, streamflow can be less
472
than 10 m3 /s, while the design flood, associated with an average return period
473
of 1,250 years, has been established at approx. 3,000 m3 /s.
474
[Figure 4 about here.]
475
This study explicitly looks at St Pieter, which is near where the river
476
enters The Netherlands. Water levels in the Belgian stretch of the Meuse,
477
just upstream of the Dutch-Belgian border, are heavily regulated by large 21
478
weirs. These, together with the locks that have been constructed to allow
479
ships to navigate the large water level differences, cause relatively high fluc-
480
tuations in discharge. The manual operations that lead to these fluctuations
481
are not communicated with the forecasting agency across the border in The
482
Netherlands, which introduces additional uncertainties with respect to future
483
streamflow conditions.
484
3.1.2. Models used
485
The forecasting system contains an implementation of the HBV rainfall-
486
runoff model (Bergstr¨om and Singh, 1995). This is a semi-lumped, conceptual
487
hydrological model, which includes a routing procedure of the Muskingum
488
type. The model schematisation consists of 15 sub-basins jointly covering
489
the Meuse basin upstream of the Belgian-Dutch border, which is very near
490
the St Pieter forecasting location. The model runs at a one-hour time step.
491
Inputs to the model consist of temperature and precipitation forcings; actual
492
evaporation is estimated from a fixed annual profile that is corrected using
493
temperature forecasts. The model simulates both streamflow generation and
494
streamflow routing in natural flow conditions only. Thus, it does not include
495
models of human interference that occurs at weirs and sluices. This interfer-
496
ence occurs mainly at low flows; at high flows, weirs are drawn. Hence, at
497
low flows, considerable uncertainty is associated with model outcomes.
498
3.1.3. Forecasts and observations used
499
COSMO-LEPS is the ensemble implementation of the COSMO model, a
500
non-hydrostatic, limited-area atmospheric prediction model. Its 16 members 22
501
are nested on selected members of the ECMWF-EPS forecasts. COSMO-
502
LEPS runs twice daily on a 10km grid spacing and 40 vertical layers. It
503
covers large parts of continental Europe including the Meuse basin. For the
504
present experiment, approx. 1,400 historical COSMO-LEPS forecasts were
505
available (Figure 2): one every day between mid 2007 and early 2011. The
506
forecasts have a 1-h time step and have a maximum lead time of 132-h,
507
i.e. 5.5 days. Within the operational forecasting system, the lead time is
508
artificially extended to 7 days through assuming zero precipitation and 8° C
509
temperature for the lead times ranging from 132-h through 168-h. The 36-
510
h lead time gain more or less coincides with the time required for a flood
511
wave to cover the distance from Chooz (near the French–Belgian border) to
512
St Pieter. As a general rule, about half of the streamflow volume originates
513
from the basin upstream from Chooz.
514
From the 16 members, a single member was isolated to serve as the de-
515
terministic forecast. Note that while the results for a single deterministic
516
forecast are presented here, the dressing was in fact done 16 times, using
517
each of the available 16 members as a single deterministic forecasts. Each of
518
these 16 dressed deterministic forecasts behaves similarly with respect to the
519
‘competing’ scenario — hence only one of these is presented in this paper.
520
Hourly streamflow observations St Pieter as well as temperature and pre-
521
cipitation observations within the Meuse basin were obtained from the Water
522
Management Centre of The Netherlands.
23
523
3.2. Rhine
524
3.2.1. Basin description
525
The river Rhine runs from the Swiss Alps along the French-German bor-
526
der, through Germany and enters The Netherlands near Lobith, which is
527
situated upstream of the Rhine-Meuse delta, and is often considered the out-
528
flow of the Rhine. At Lobith, the basin area equals approx. 160,000 km2 .
529
During spring and early summer, a considerable fraction of flow at the outlet
530
originates from snowmelt in the Swiss Alps. Figure 3 shows the basin lo-
531
cation, elevations and the forecasting locations that were used in this work.
532
These are Metz, Cochem and Lobith. Metz is located in the headwaters of
533
the river Moselle, of which Cochem is the outlet.
534
3.2.2. Models used
535
The forecast production system that was used to create simulations and
536
hindcasts for the Rhine is a derivative of the operational system that was
537
mentioned above. The system contains an implementation of the HBV rain-
538
fall runoff model (Bergstr¨om and Singh, 1995). The Rhine model schematisa-
539
tion consists of 134 sub-basins jointly covering the entire basin. The models
540
run at a daily time step. Inputs to the model consist of temperature and
541
precipitation forcings; actual evaporation is estimated from a fixed annual
542
profile that is corrected using temperature forecasts.
543
3.2.3. Forecasts and observations used
544
For observations of precipitation, the CHR08 dataset (Photiadou et al.,
545
2011) was used. This dataset was prepared specifically for the HBV model 24
546
used here and covers the period 1961 through 2007. The spatial scale of the
547
observations coincides with the 134 HBV sub-basins. Temperature observa-
548
tions originate from version 5.0 of the E-OBS data set (Haylock et al., 2008),
549
and are available from 1951 through mid 2011. These forcings were available
550
at a time step of one day. The observations are used to force the hydrological
551
model in historical mode to estimate the initial conditions at the onset of a
552
hydrological forecast, as well as in simulation mode.
553
Predicted forcings consisted of the ECMWF reforecast dataset, compris-
554
ing medium-range EPS forecasts with 5 ensemble members (Hagedorn, 2008).
555
These reforecasts were produced using the current operational model (Cy38r1
556
with a 0.25 degrees horizontal resolution). The forecasts were temporally ag-
557
gregated to a one day time step, which coincided with that of the hydrological
558
model used, and go out to a maximum lead time of 240-h, i.e. 10 days. The
559
gridded forecasts were spatially averaged to the HBV sub-basin scale. For
560
this work, approx. 2,900 reforecasts were available (Figure 2), covering the
561
period 1990–2008.
562
Similar to the Meuse case, the deterministic forecasts used in this study
563
consist of a randomly chosen ensemble member from each of the available
564
ensemble forecasts. Each of the members was used to create a deterministic
565
forecast which was subsequently dressed and analysed. However, results for
566
one of these forecasts is presented only.
567
Hourly streamflow observations for the hydrological stations within the
568
Rhine basin were obtained from the Water Management Centre of The Neth-
569
erlands. 25
570
4. Results and analysis
571
4.1. Post-processing of single valued forecasts
572
Scatter plots of single–valued forecasts and observations are shown in
573
Figures 5 and 6 for St Pieter and Rhine locations, respectively. Two datasets
574
are shown: (i) the forecasts with perfect forcings (simulations in the Rhine
575
case) and (ii) the deterministic forecasts. In all plot panels, the horizontal
576
axes are identical to the vertical axes and the 1:1 diagonal is emphasised.
577
The forecast–observation pairs are plotted in a transparent colour; areas
578
that show up darker comprise multiple pairs plotted on top of one another.
579
A selection of estimated quantiles is superimposed on the scatter plots, with
580
the median in red and the 5th , 10th , 25th , 75th , 90th and 95th percentiles in
581
blue.
582
[Figure 5 about here.]
583
[Figure 6 about here.]
584
The pairs and the estimated quantiles in the St Pieter figure (Figure 5)
585
show that the perfect forcing pairs (bottom row) are closer to the diagonal
586
than the deterministic forecast pairs (top row). This is because the residuals
587
between the perfect forcings forecast and the observations comprise the hy-
588
drological uncertainties only. The plots also show that the median quantile
589
of the pairs comprising the deterministic forecasts has a shallower slope than
590
the diagonal. This indicates an overforecasting bias: the majority of pairs
591
is located below, or to the right of the diagonal. The median of the pairs 26
592
comprising the perfect forcing forecasts shows a slope almost equal to, or
593
higher than that of the 1:1 diagonal. The latter indicates underforecasting:
594
most of the pairs are located above, or to the left of the 1:1 diagonal. Both
595
sets of pairs show that the spread increases with increasing forecast lead time
596
and that higher values of flow have higher spread in real units.
597
In the Rhine case (Figure 6), the simulations are independent of forecast
598
lead time. The difference between the spread of pairs based on the simula-
599
tions and that of the deterministic forecasts is less obvious, especially when
600
the shorter lead times are considered. Without exception, the median fore-
601
cast is located below the diagonal which indicates an overforecasting bias.
602
4.2. Forecast hydrographs
603
Sample forecasts or predictive distributions for both scenarios are shown
604
in Figure 7. The rows show the cases that use deterministic (top) and ensem-
605
ble (bottom) forecasts, with the raw forecasts indicated by thick blue lines
606
and the dressed forecasts by thin grey lines. Note that the raw cases show
607
one or multiple traces, whereas for the dressed cases, quantiles are shown
608
(which should not be confused with ensemble traces).
609
[Figure 7 about here.]
610
By construction, ensemble dressing corrects for under-dispersion and,
611
therefore, increases the ensemble spread. In this example, the spread of
612
the dressed single-valued forecasts is larger than the spread of the dressed
613
ensemble forecasts. It is also noticeable that the raw ensemble forecast fails
614
to capture many observations, whereas the dressed forecasts capture all. 27
615
616
Please note that this is an example on a particular date and not necessarily the general behaviour of all forecasts that are post-processed.
617
The example forecasts also show an artefact associated with statistical
618
post-processing, namely that the most extreme quantiles are relatively noisy
619
This originates from the increased sampling uncertainty associated with es-
620
timating extreme quantiles.
621
4.3. Single-valued forecast verification
622
Generally speaking, COR, RME and RMSE follow similar patterns. Each
623
worsens with increasing lead time and with increasing value of the verifying
624
observation as indicated by P . In this manuscript, only the RME is shown
625
(Figure 8).
626
The correlations (plot not shown) are highest for Lobith, followed by
627
Cochem, Metz and St Pieter. While the correlations are generally positive,
628
they approach zero at St Pieter for higher P at longer forecast lead times.
629
Both the patterns and values of the correlation coefficient (as function of
630
lead time and P ) are similar across the two scenarios. Only at the longer
631
forecast lead times and at St Pieter do they differ. In those cases, the dressed
632
ensembles outperform the dressed deterministic forecasts.
633
The RME plots (Figure 8) show that, at P = 0, the dressed determinis-
634
tic forecasts have near-perfect RME, that is, RME ≈ 0, at all forecast lead
635
times. The dressed ensemble forecasts show a larger fractional bias, with
636
St Pieter and Metz showing positive values and Cochem and Lobith showing
637
negative values. For higher values of P and at longer forecast lead times,
28
638
RME becomes increasingly negative. Consequently, at higher values of P
639
and at longer forecast lead times, the dressed ensembles at St Pieter and
640
Metz show smaller fractional bias than the dressed deterministic forecasts.
641
The converse is true for Cochem and Lobith, where the dressed determin-
642
istic forecasts have smaller RME. The difference in RME between scenarios
643
increases with increasing forecast lead time.
644
The RMSE (not shown in plots) worsens (i.e., increases) with increasing
645
forecast lead time, increasing threshold amount, and with declining basin
646
size. The RMSE is lower for the dressed ensemble forecasts than the dressed
647
single-valued forecasts at most values of P . Only at Cochem and Lobith and
648
for some values of P is the RMSE higher for the dressed ensemble forecasts.
649
Overall, in terms of the single valued verification measures, neither the
650
dressed ensemble forecasts nor the dressed deterministic forecasts perform
651
consistently better. At St Pieter and Metz, the mean of the dressed ensembles
652
has higher quality in terms of COR, RME and RMSE, whereas at Cochem
653
and Lobith, the reverse is true in terms of RME and, at small ranges of P ,
654
for RMSE.
655
656
[Figure 8 about here.] 4.4. Reliability and sharpness
657
Reliability plots for the P = 0.50 and P = 0.90 subsamples (Figures 9
658
and 10) show that both approaches are reasonably and more or less equally
659
reliable at all leadtimes and at all locations. The one exception here is
29
660
St Pieter, where the dressed deterministic forecasts appear to be more reliable
661
than the dressed ensemble forecasts.
662
Reliability varies slightly with the level of extremeity of the event: the
663
P = 0.50 reliability plot (Figure 9) follows the diagonal more closely than
664
the P = 0.90 reliability plot (Figure 10).
665
plot (Figure 9), the samples are distributed much more evenly across the
666
horizontal (as indicated by the size of the plotting positions) compared to
667
the P = 0.90 reliability plot (Figure 10). In the latter, the distribution is
668
much less even, with a relatively large number of forecasts to be found at the
669
0% and 100% forecasts.
In the P = 0.50 reliability
670
[Figure 9 about here.]
671
[Figure 10 about here.]
672
The width of the forecast distributions (Figures 11 and 12) increases with
673
increasing lead time and with increasing value of P . Sharpness, compared to
674
reliability, varies more sharply (we couldn’t resist the pun) with the value of
675
P . At lead times longer than 24-h, the differences in forecast width between
676
scenarios becomes noticeable. In all cases, the dressed ensembles result in
677
more narrow predictive distributions than the dressed deterministic forecasts.
678
These differences are more pronounced at higher values of P . Note that while
679
‘narrowness’ is a desirable property, it is only valuable in a decision making
680
context if the forecasts are also reliable.
681
[Figure 11 about here.] 30
[Figure 12 about here.]
682
683
4.5. Probabilistic Skill Scores
684
For the skill scores (Figure 13), the patterns are more important than the
685
absolute values, as the baseline is unconditional climatology. The patterns
686
of the Brier Skill Score are similar to those observed for other metrics: skill
687
is highest for the largest basins and reduces with increasing forecast lead
688
time. The BSS is generally very similar for both scenarios. Only in the case
689
of St Pieter is there a consistent difference between the scenarios, with the
690
dressed ensemble forecasts outperforming the dressed deterministic forecasts,
691
but not beyond the range of sampling uncertainty. [Figure 13 about here.]
692
693
The patterns of the mean CRPSS (Figure 14) are similar to those of the
694
BSS. The difference is that the CRPSS improves with increasing P . This
695
is understandable because sample climatology is much less skilful at higher
696
thresholds
697
Often, the CRPSS is similar across the two scenarios. Again, any differ-
698
ences are more pronounced at longer forecast lead times and higher values of
699
P , where the dressed ensemble forecasts are somewhat more skilful, partic-
700
ularly at St Pieter and Metz (but again, not beyond the range of sampling
701
uncertainty).
702
[Figure 14 about here.]
31
703
4.6. Forecast value
704
Relative Operating Characteristic plots for the event defined by the ex-
705
ceedence of the 90th percentile of the observational record are shown in Fig-
706
ure 15. The plots show that, in all cases, the ROC curves for both scenarios
707
are well above the diagonal, indicating that these forecasts improve upon
708
climatology.
709
At the shortest lead time shown, the curves for the two scenarios are very
710
similar. Differences, if any, increase with increasing forecast lead time. At
711
longer forecast lead times, the dressed ensemble forecasts are slightly more
712
discriminatory than the dressed deterministic forecasts.
713
The associated ROC scores (Figure 16) are very similar at most locations
714
and forecast lead times and generally decline with increasing forecast lead
715
time and increase with threshold amount.
716
[Figure 15 about here.]
717
[Figure 16 about here.]
718
The Relative Economic Value also relies on the ability of a forecasting
719
system to discriminate between events and non-events, but assigns a cost-
720
loss model to weight the consequences of particular actions (or inaction).
721
In most cases, the REV of the dressed ensemble forecasts is similar to, or
722
slightly higher than, the dressed deterministic forecasts for different values
723
of the cost-loss ratio. Again, these differences are more pronounced at longer
724
forecast lead times, higher thresholds, and larger values of the cost-loss ratio. 32
725
[Figure 17 about here.]
726
[Figure 18 about here.]
727
4.7. Analysis
728
The results show that, at P = 0, the dressed deterministic forecasts
729
improve (albeit only marginally) on the dressed ensemble forecasts in terms of
730
reliability and RME. However, the dressed ensemble forecasts are somewhat
731
sharper. On balance, the dressed ensemble forecasts have slightly better
732
RMSE and CRPSS scores.
733
The dressed ensemble forecasts are only slightly less reliable than the
734
dressed deterministic forecasts at the three Rhine locations Metz, Cochem
735
and Lobith. The differences are larger at St Pieter, where the dressed en-
736
semble forecasts show a substantial wet bias. In this context, the dressed
737
deterministic forecasts account for both the atmospheric and hydrologic un-
738
certainties and correct for biases via the quantile regression, whereas this
739
dressed ensemble forecasts do not account for under-dispersion of the mete-
740
orological forecasts. In other words: the dressing assumes that the meteoro-
741
logical uncertainties are well described by the ensembles whereas in reality
742
this isn’t necessarily so.
743
At P = 0, the fractional bias of the dressed deterministic forecasts is small
744
at all forecast lead times. This is understandable, because post-processing
745
techniques, such as quantile regression, are generally good at correcting for
746
unconditional biases and biases conditional upon forecast value/probability
747
(i.e. lack of reliability). 33
748
The dressed ensemble forecasts are sharper than the dressed deterministic
749
forecasts. However, sharpness is only meaningful when the forecasts are also
750
reliable. In the case of St Pieter at 0–h lead time, forecasts are infinitely
751
sharp and of near perfect quality — this is due to the error correction just
752
upstream of St Pieter.
753
This error correction introduces some erratic effects on the fractional bias
754
at the shortest lead time (Figure 8) and a relatively steep jump to less-than-
755
perfect skills at those lead times when the effect of error correction has worn
756
off. This is but one of the differences between the St Pieter catchment on the
757
one hand and the Rhine catchments on the other. In general, the St Pieter
758
results are slightly more erratic and dramatic in nature: more pronounced
759
dependency on lead time, bigger difference in sharpness between the two
760
scenarios. We believe this to be due to the nature of the basin which is
761
more complex (in terms of geology and topography) than the other basins,
762
in combination with the fact that less information is available for the Meuse:
763
the hydrometeorological monitoring network is a lot less dense.
764
At higher values of P , both sets of forecasts are consistently less reliable.
765
The dressed deterministic forecasts show a ‘dry bias’ where the observed rel-
766
ative frequency (or quantile exceedence) is higher than the predicted proba-
767
bility. However, at St Pieter, this conditional dry bias is offset by an uncon-
768
ditional wet bias, leading to reasonably reliable forecasts at higher thresholds
769
and early forecast lead times.
770
In general, the fractional negative bias (RME) increases with increas-
771
ing threshold. This is consistent with the RME of the precipitation fore34
772
casts, which systematically underestimate the largest observed precipitation
773
amounts (Verkade et al., 2013). At higher thresholds, the differences in
774
RME between the two scenarios are similar in pattern to those in the uncon-
775
ditional sample. In other words, at St Pieter and Metz, the fractional bias
776
of the dressed ensemble forecasts is smaller, while at Cochem and Lobith,
777
the dressed deterministic forecasts show smaller fractional bias. Again, this
778
is due to the RME in the precipitation forecasts.
779
The BSS, ROCS and REV, which assess the quality of the forecasts at
780
predicting discrete events, are very similar between the two scenarios at all
781
forecast lead times and all values of P . One exception is St Pieter, where the
782
dressed ensemble forecasts improve somewhat on the dressed deterministic
783
forecasts in terms of ROCS and REV, but not at the highest thresholds. At
784
St Pieter and at these higher values of P , the dressed ensembles were both
785
sharper and more reliable than the dressed deterministic forecasts.
786
5. Conclusions
787
We compared a source-based approach and a lumped approach for esti-
788
mating predictive uncertainty in terms of various aspects of forecast quality,
789
skill and value. The analysis shows that the dressed ensemble forecasts are
790
sharper, but slightly less reliable than the dressed deterministic forecasts.
791
On balance, this results in skill and value that is very similar across the
792
two scenarios, with the dressed ensemble forecasts improving slightly on the
793
dressed deterministic forecasts at St Pieter and Metz and the reverse being
794
true at Cochem and Lobith. 35
795
6. Discussion
796
While the analysis revealed quite similar results between the scenarios,
797
further studies or different approaches to quantifying the various sources of
798
uncertainty could reveal larger differences. For example, a larger hindcast
799
dataset would help to reduce the sampling uncertainties and identify any
800
marginal differences in forecast quality, as well as supporting an analysis
801
of higher thresholds. The quantile regression technique could be configured
802
in alternative ways (some configurations were tested by L´opez L´opez et al.
803
2014), or could be replaced by an alternative technique altogether. Alter-
804
native basins can be used for the experiment, and/or alternative ensemble
805
NWP products.
806
As noted earlier, the quality of the dressed streamflow ensembles is de-
807
pendent on the quality of the raw streamflow ensembles - and therefore on
808
the quality of the underlying NWP ensembles. Any lack of reliability will
809
transfer to the dressed streamflow ensembles. Such meteorological biases
810
could be addressed through meteorological post-processing or by accounting
811
for the effects of under-dispersion on the streamflow forecasts. Conceivably,
812
meteorological post-processing could address error structures that are spe-
813
cific to the forecast variable (i.e., precipitation, temperature or other) while
814
hydrologic post-processing then addresses error structures specific to the hy-
815
drological forecasts. Presumably, the latter errors will have been reduced
816
by the presence of meteorological post-processing techniques. Note that in
817
practice, a meteorological post-processing technique that is effective within
36
818
the context of hydrological forecasting is yet to be found. For example, the
819
approach taken by Verkade et al. (2013) did not result in improved hydrolog-
820
ical forecasts, which was in line with the findings of Kang et al. (2010). Ap-
821
proaches such as those followed by Bennett et al. (2016); Schefzik (2016a,b)
822
and Hu et al. (2016) appear to be promising.
823
In terms of choosing an approach, the results presented here are quite
824
similar for both techniques. However, there are additional considerations,
825
including the relative simplicity of dressing a deterministic forecast, data
826
availability, and the expected additional information contributed by an en-
827
semble of weather forecasts (versus a single forecast or the ensemble mean)
828
in different contexts. As indicated by Pagano et al. (2014), combining en-
829
semble and other forecasting technologies with the subjective experience of
830
operational forecasters is an ongoing challenge in river forecasting.
831
Essentially, statistical modelling relies on the stationarity of the model
832
errors or the ability to account for any non-stationarity with a reasonably
833
simple model. In practice, however, basin hydrology changes over time with
834
changes in climate and land-use, among other things. The lumped approach
835
cannot easily account for this, while the source–based approach may, using
836
information about the individual sources of uncertainty, better isolate (and
837
model) the causes of non-stationarity.
838
Also, by definition, extreme events are not well represented in the obser-
839
vational record and frequently change basin hydrology. Thus, for extreme
840
events in particular, basing estimates of predictive uncertainty on the ob-
841
servational record is fraught with difficulty. In this context, ensemble ap37
842
proaches to propagating the forcing and other uncertainties through hydro-
843
logical models should (assuming the models are physically reasonable) cap-
844
ture elements of the basin hydrology that are difficult to capture through
845
purely statistical approaches.
846
7. Acknowledgements
847
The authors would like to thank Marc van Dijk at Deltares for running the
848
historical simulations for the river Meuse. Florian Pappenberger at ECMWF
849
provided the ECMWF-EPS reforecast data. The Water Management Centre
850
of the Netherlands is thanked for allowing us to use an off-line version of the
851
operational system RWsOS Rivers to do this research. The “Team Expertise
852
Meuse” at Rijkswaterstaat Zuid-Nederland provided valuable comments on
853
the implementation of the ensemble dressing technique for St Pieter. The
854
HEPEX community (www.hepex.org) is thanked for providing inspiration,
855
and for bringing additional fun to doing research into predictive hydrologi-
856
cal uncertainty. The digital elevation model used as a background map in
857
Figure 3 was made available by the European Environment Agency on a
858
Creative Commons Attribution License (www.eea.europa.eu). We are also
859
grateful to no less than four anonymous referees for the time taken to re-
860
view earlier versions of the manuscript, and to Konstantine Georgakakos and
861
Hamid Moradkhani for acting as editors.
38
862
Appendix A. Verification metrics
863
For ease of reference, the probabilistic verification metrics used in this
864
study are briefly explained; this description is partly based on that in Verkade
865
and Werner (2011), Brown and Seo (2013) and Verkade et al. (2013). Addi-
866
tional details can be found in the documentation of the Ensemble Verification
867
System (Brown et al., 2010) as well as in reference works on forecast verifi-
868
cation by Jolliffe and Stephenson (2012) and Wilks (2011).
869
Appendix A.1. Reliability plots
870
One desired property of probabilistic forecasts is that forecasted event
871
probabilities f coincide with observed relative frequencies F . This is visual-
872
ized in reliability plots that plot the latter variable versus the former, P Fi =
o|f = i Ji
(A.1)
873
where F is the observed relative frequency, i is one of eleven allowable values
874
of the forecast f : i ∈ { 0, 0.1, 0.2, . . . , 1 }, o is an indicator variable that is
875
assigned a value of 1 if the event was observed and a value of 0 if the event
876
was not observed, f is the forecasted probability of event occurrence and Ji
877
is the total number of forecasts made within bin i (Wilks, 2011).
878
Appendix A.2. Sharpness
879
880
Here, sharpness is indicated by the width of the centred 80% interval of the predictive distribution,
39
wj = Sτ =0.90,j − Sτ =0.10,j ,
(A.2)
881
for all J forecasts. Again, sharpness is determined for each lead time n
882
separately and the lead time indicators have been omitted from above equa-
883
tion. The combined record wj=1,2,...,J is shown as an empirical cumulative
884
distribution function.
885
Appendix A.3. Correlation coefficient
886
The Correlation Coefficient (COR) is a measure for the statistical re-
887
lationship between two variables. Here, the Pearson product–moment cor-
888
relation coefficient is a measure of the strength and direction of the linear
889
relationship between two variables that is defined as the (sample) covariance
890
of the variables divided by the product of their (sample) standard deviations,
CORX,Y =
E[(X − µX )(Y − µY )] , σX σY
(A.3)
891
where E indicates an expected value, µ is the mean and σ is the standard
892
deviation.
893
Appendix A.4. Relative Mean Error
894
The Relative Mean Error (RME, sometimes called relative bias) measures
895
the average difference between a set of forecasts and corresponding observa-
896
tions, relative to the mean of the latter, PJ RME =
j=1
S¯j − qj
PJ
j=1 qj
40
,
(A.4)
897
where q is the observation and S¯ is the mean of the ensemble forecast. The
898
RME thus provides a measure of relative, first-order bias in the forecasts.
899
RME may be positive, zero, or negative. Insofar as the mean of the ensemble
900
forecast should match the observed value, a positive RME denotes overfore-
901
casting and a negative RME denotes underforecasting. Zero RME denotes
902
the absence of relative bias in the mean of the ensemble forecast.
903
Appendix A.5. Brier skill score
904
905
The (half) Brier score (BS, Brier 1950) measures the mean square error of J predicted probabilities that Q exceeds q,
BS = 906
J 2 1 X FQj (q) − FSj (q) , J j=1
(A.5)
where
FSj (q) = Pr [Qj > q] and
1 if Qj > q FQj (q) = . 0 otherwise
907
The Brier Skill Score (BSS) is a scaled representation of forecast quality
908
that relates the quality of a particular system BS to that of a perfect system
909
BSperfect (which is equal to 0) and to a reference system BSref ,
BSS =
BS − BSref BSperfect − BSref
41
(A.6)
910
BSS ranges from −∞ to 1. The highest possible value is 1. If BSS = 0,
911
the BS is as good as that of the reference system. If BSS < 0 then the
912
system’s Brier score is less than that of the reference system.
913
Appendix A.6. Mean Continuous Ranked Probability Skill Score
914
The Continuous Ranked Probability Score (CRPS) measures the integral
915
square difference between the cumulative distribution function (cdf) of the
916
forecast FS (q), and the corresponding cdf of the observed variable FQ (q), Z
∞
{FS (q) − FQ (q)} dq.
CRPS =
(A.7)
−∞ 917
918
The mean CRPS comprises the CRPS averaged across J pairs of forecasts and observations, J 1X CRPS = CRPSj . J j=1
(A.8)
919
The Continuous Ranked Probability Skill Score (CRPSS) is a function of
920
the ratio of the mean CRPS of the main prediction system, CRPS, and a
921
reference system, CRPSref ,
CRPSS = 922
CRPS − CRPSref CRPSperfect − CRPSref
(A.9)
Appendix A.7. Relative Operating Characteristic score
923
The relative operating characteristic (ROC; Green and Swets 1966) plots
924
the hit rate versus the false alarm rate. These are calculated using the ele-
42
925
ments of a contingency table, which is valid for a single probabilistic decision
926
rule, [Table 2 about here.]
927
928
and are defined as follows
# hits h = # observed events o # false alarms f false alarm rate = = 0, # events not observed o hit rate =
929
930
The ROC score measures the area under the ROC curve (AUC) after adjusting for randomness, i.e.
ROCS = 2 × (AUC − 0.5) . 931
(A.10)
(A.11)
Appendix A.8. Relative Economic Value
932
The Relative Economic Value (V [−]) of an imperfect warning system is
933
defined as the value relative to the benchmark cases of No Warning (V = 0)
934
and Perfect Forecasts (V = 1). REV can be less than 0 if the cost of false
935
alarms is higher than the benefits attained by the warning system:
V =
VnoFWS − VFWS . VnoFWS − Vperfect
(A.12)
936
Using the consequences of contingency table elements (Table 2) and the
937
cost-to-loss ratio r (Verkade and Werner, 2011), REV can be expressed as a
938
function of r, 43
V =
o − (h + f ) r − m . o (1 − r)
(A.13)
939
References
940
Bennett, J. C., Wang, Q. J., Li, M., Robertson, D. E., Schepen, A., Oct. 2016.
941
Reliable long-range ensemble streamflow forecasts: Combining calibrated
942
climate forecasts with a conceptual runoff model and a staged error model.
943
Water Resources Research 52 (10), 8238–8259.
944
URL http://doi.wiley.com/10.1002/2016WR019193
945
Bergstr¨om, S., Singh, V. P., 1995. The HBV model. In: Singh, V. (Ed.),
946
Computer models of watershed hydrology. Water Resources Publications,
947
Highlands Ranch, Colorado, United States, pp. 443–476.
948
Bogner, K., Pappenberger, F., Jul. 2011. Multiscale error analysis, correction,
949
and predictive uncertainty estimation in a flood forecasting system. Water
950
Resources Research 47 (7), W07524.
951
URL http://www.agu.org/pubs/crossref/2011/2010WR009137.shtml
952
Bradley, A. A., Schwartz, S. S., Sep. 2011. Summary verification measures
953
and their interpretation for ensemble forecasts. Monthly Weather Review
954
139 (9), 3075–3089.
955
URL http://journals.ametsoc.org/doi/abs/10.1175/2010MWR3305.
956
1
44
957
Bremnes, J. B., 2004. Probabilistic forecasts of precipitation in terms of
958
quantiles using NWP model output. Monthly Weather Review 132, 338–
959
347.
960
961
962
963
Brier, G., 1950. Verification of forecasts expressed in terms of probability. Monthly weather review 78, 1–3. Br¨ocker, J., Smith, L. A., 2008. From ensemble forecasts to predictive distribution functions. Tellus A 60 (4), 663–678.
964
Broersen, P., Weerts, A., 2005. Automatic error correction of rainfall-runoff
965
models in flood forecasting systems. In: Instrumentation and Measurement
966
Technology Conference, 2005. IMTC 2005. Proceedings of the IEEE. Vol. 2.
967
pp. 963–968.
968
Brown, J. D., Demargne, J., Seo, D.-J., Liu, Y., 2010. The ensemble veri-
969
fication system (EVS): a software tool for verifying ensemble forecasts of
970
hydrometeorological and hydrologic variables at discrete locations. Envi-
971
ronmental Modelling & Software 25 (7), 854 – 872.
972
Brown, J. D., Seo, D.-J., 2013. Evaluation of a nonparametric post-processor
973
for bias correction and uncertainty estimation of hydrologic predictions.
974
Hydrological Processes 27 (1), 83–105.
975
URL http://doi.wiley.com/10.1002/hyp.9263
976
977
Chen, S., Yu, P., 2007. Real-time probabilistic forecasting of flood stages. Journal of Hydrology 340 (1-2), 63–77.
45
978
Coccia, G., Todini, E., 2011. Recent developments in predictive uncertainty
979
assessment based on the model conditional processor approach. Hydrology
980
and Earth System Sciences 15 (10), 3253–3274.
981
Demargne, J., Wu, L., Regonda, S., Brown, J., Lee, H., He, M., Seo, D.-J.,
982
Hartman, R., Herr, H. D., Fresch, M., Schaake, J., Zhu, Y., Jun. 2013.
983
The science of NOAA’s operational hydrologic ensemble forecast service.
984
Bulletin of the American Meteorological Society, 130611111953000.
985
URL
986
BAMS-D-12-00081.1
http://journals.ametsoc.org/doi/abs/10.1175/
987
Fortin, V., Favre, A.-c., Said, M., 2006. Probabilistic forecasting from en-
988
semble prediction systems: Improving upon the best-member method by
989
using a different weight and dressing kernel for each member. Quarterly
990
Journal of the Royal Meteorological Society 132 (617), 1349–1369.
991
Glahn, H. R., Lowry, D. A., 1972. The use of model output statistics (MOS)
992
in objective weather forecasting. Journal of Applied Meteorology 11 (8),
993
1203–1211.
994
995
996
997
998
Green, D. M., Swets, J. A., 1966. Signal detection theory and psychophysics. John Wiley & Sons, Inc., New York. Hagedorn, R., 2008. Using the ECMWF reforecast dataset to calibrate EPS forecasts. ECMWF Newsletter 117, 8–13. Hantush, M. M., Kalin, L., 2008. Stochastic residual-error analysis for es-
46
999
1000
timating hydrologic model predictive uncertainty. Journal of Hydrologic Engineering 13 (7), 585–596.
1001
Haylock, M., Hofstra, N., Klein Tank, A., Klok, E., Jones, P., New, M., 2008.
1002
A european daily high-resolution gridded data set of surface temperature
1003
and precipitation for 1950–2006. J. Geophys. Res 113, D20119.
1004
Hu, Y., Schmeits, M. J., Jan van Andel, S., Verkade, J. S., Xu, M., Solo-
1005
matine, D. P., Liang, Z., Sep. 2016. A Stratified Sampling Approach for
1006
Improved Sampling from a Calibrated Ensemble Forecast Distribution.
1007
Journal of Hydrometeorology 17 (9), 2405–2417.
1008
URL http://journals.ametsoc.org/doi/10.1175/JHM-D-15-0205.1
1009
Jolliffe, I. T., Stephenson, D. B. (Eds.), 2012. Forecast Verification: A
1010
Practitioner’s Guide in Atmospheric Science, Second Edition. John Wiley
1011
& Sons.
1012
URL
1013
9781119960003
http://onlinelibrary.wiley.com/book/10.1002/
1014
Kang, T.-H., Kim, Y.-O., Hong, I.-P., Jun. 2010. Comparison of pre- and
1015
post-processors for ensemble streamflow prediction. Atmospheric Science
1016
Letters 11 (2), 153–159.
1017
URL http://doi.wiley.com/10.1002/asl.276
1018
1019
1020
Kelly, K., Krzysztofowicz, R., 2000. Precipitation uncertainty processor for probabilistic river stage forecasting. Water resources research 36 (9). Koenker, R., 2005. Quantile regression. Cambridge University Press. 47
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
Koenker, R., 2013. quantreg: Quantile Regression. R package version 5.05. URL http://CRAN.R-project.org/package=quantreg Koenker, R., Bassett Jr, G., 1978. Regression quantiles. Econometrica: journal of the Econometric Society, 33–50. Koenker, R., Hallock, K., 2001. Quantile regression. The Journal of Economic Perspectives 15 (4), 143–156. Krzysztofowicz, R., 2002. Bayesian system for probabilistic river stage forecasting. Journal of hydrology 268 (1-4), 16–40. Krzysztofowicz, R., Kelly, K., 2000. Hydrologic uncertainty processor for probabilistic river stage forecasting. Water resources research 36 (11). Lahiri, P., 2003. On the impact of bootstrap in survey sampling and smallarea estimation. Statistical Science 18 (2), 199–210.
1033
L´opez L´opez, P., Verkade, J., Weerts, A., Solomatine, D., 2014. Alternative
1034
configurations of quantile regression for estimating predictive uncertainty
1035
in water level forecasts for the upper severn river: a comparison. Hydrology
1036
and Earth System Sciences Discussions 11, 3811–3855.
1037
Montanari, A., Brath, A., 2004. A stochastic approach for assessing the un-
1038
certainty of rainfall-runoff simulations. Water Resources Research 40 (1),
1039
W01106.
1040
1041
Montanari, A., Grossi, G., 2008. Estimating the uncertainty of hydrological forecasts: A statistical approach. Water resources research 44 (12). 48
1042
Murphy, A., 1985. Decision making and the value of forecasts in a generalized
1043
model of the cost-loss ratio situation. Monthly Weather Review 113 (3),
1044
362–369.
1045
Nielsen, H. A., Madsen, H., Nielsen, T. S., 2006. Using quantile regression
1046
to extend an existing wind power forecasting system with probabilistic
1047
forecasts. Wind Energy 9 (1-2), 95–108.
1048
Pagano, T. C., Shrestha, D. L., Wang, Q. J., Robertson, D., Hapuarachchi,
1049
P., 2013. Ensemble dressing for hydrological applications. Hydrological
1050
Processes, n/a–n/a.
1051
Pagano, T. C., Wood, A. W., Ramos, M.-H., Cloke, H. L., Pappenberger,
1052
F., Clark, M. P., Cranston, M., Kavetski, D., Mathevet, T., Sorooshian,
1053
S., Verkade, J. S., May 2014. Challenges of operational river forecasting.
1054
Journal of Hydrometeorology.
1055
URL
1056
JHM-D-13-0188.1
http://journals.ametsoc.org/doi/abs/10.1175/
1057
Photiadou, C. S., Weerts, A. H., van den Hurk, B. J. J. M., Nov. 2011. Eval-
1058
uation of two precipitation data sets for the rhine river using streamflow
1059
simulations. Hydrology and Earth System Sciences 15 (11), 3355–3366.
1060
URL http://www.hydrol-earth-syst-sci.net/15/3355/2011/
1061
1062
Politis, D. N., Romano, J. P., 1994. The stationary bootstrap. Journal of the American Statistical Association 89 (428), 1303–1313.
49
1063
R Core Team, 2013. R: A Language and Environment for Statistical Com-
1064
puting. R Foundation for Statistical Computing, Vienna, Austria.
1065
URL http://www.R-project.org/
1066
Raftery, A., Gneiting, T., Balabdaoui, F., Polakowski, M., 2005. Us-
1067
ing Bayesian model averaging to calibrate forecast ensembles. Monthly
1068
Weather Review 133 (5), 1155–1174.
1069
Ramos, M., Voisin, N., Verkade, J., 2013. HEPEX Science and Implementa-
1070
tion plan: hydro-meteorological post-processing. http://hepex.irstea.
1071
fr/science-and-implementation-plan/.
1072
Reggiani, P., Renner, M., Weerts, A., van Gelder, P., 2009. Uncertainty
1073
assessment via bayesian revision of ensemble streamflow predictions in
1074
the operational river rhine forecasting system. Water Resources Research
1075
45 (2).
1076
Reggiani, P., Weerts, A., 2008. A bayesian approach to decision-making under
1077
uncertainty: An application to real-time forecasting in the river rhine.
1078
Journal of Hydrology 356 (1-2), 56–69.
1079
Regonda, S. K., Seo, D.-J., Lawrence, B., Brown, J. D., Demargne, J., Aug.
1080
2013. Short-term ensemble streamflow forecasting using operationally-
1081
produced single-valued streamflow forecasts – a hydrologic model output
1082
statistics (HMOS) approach. Journal of Hydrology 497, 80–96.
1083
URL
1084
S0022169413003958
http://linkinghub.elsevier.com/retrieve/pii/
50
1085
Roscoe, K. L., Weerts, A. H., Schroevers, M., 2012. Estimation of the uncer-
1086
tainty in water level forecasts at ungauged river locations using quantile
1087
regression. International Journal of River Basin Management 10 (4), 383–
1088
394.
1089
Roulston, M. S., Smith, L. A., 2003. Combining dynamical and statistical
1090
ensembles. Tellus A 55 (1), 16–30.
1091
URL
1092
2003.201378.x/abstract
http://onlinelibrary.wiley.com/doi/10.1034/j.1600-0870.
1093
Schefzik, R., May 2016a. Combining parametric low-dimensional ensemble
1094
postprocessing with reordering methods: Parametric low-dimensional en-
1095
semble postprocessing and reordering. Quarterly Journal of the Royal Me-
1096
teorological Society.
1097
URL http://doi.wiley.com/10.1002/qj.2839
1098
Schefzik, R., May 2016b. A Similarity-Based Implementation of the Schaake
1099
Shuffle. Monthly Weather Review 144 (5), 1909–1921.
1100
URL http://journals.ametsoc.org/doi/10.1175/MWR-D-15-0227.1
1101
Seo, D., Herr, H., Schaake, J., 2006. A statistical post-processor for account-
1102
ing of hydrologic uncertainty in short-range ensemble streamflow predic-
1103
tion. Hydrology and Earth System Sciences Discussions 3 (4), 1987–2035.
1104
Solomatine, D. P., Shrestha, D. L., 2009. A novel method to estimate model
1105
uncertainty using machine learning techniques. Water Resources Research
1106
45 (12). 51
1107
Todini, E., 2008. A model conditional processor to assess predictive uncer-
1108
tainty in flood forecasting. International Journal of River Basin Manage-
1109
ment 6 (2), 123–137.
1110
1111
Unger, D., van den Dool, H., O’Lenic, E., Collins, D., 2009. Ensemble regression. Monthly Weather Review 137 (7), 2365–2379.
1112
Verkade, J., Brown, J., Reggiani, P., Weerts, A., 2013. Post-processing
1113
ECMWF precipitation and temperature ensemble reforecasts for op-
1114
erational hydrologic forecasting at various spatial scales. Journal of
1115
Hydrology 501, 73–91.
1116
URL
1117
S0022169413005660
http://linkinghub.elsevier.com/retrieve/pii/
1118
Verkade, J. S., Werner, M. G. F., 2011. Estimating the benefits of single
1119
value and probability forecasting for flood warning. Hydrology and Earth
1120
System Sciences 15 (12), 3751–3765.
1121
URL http://www.hydrol-earth-syst-sci.net/15/3751/2011/
1122
Weerts, A., Winsemius, H., Verkade, J., Jan. 2011. Estimation of predictive
1123
hydrological uncertainty using quantile regression: examples from the na-
1124
tional flood forecasting system (england and wales). Hydrology and Earth
1125
System Sciences 15 (1), 255–265.
1126
URL http://www.hydrol-earth-syst-sci.net/15/255/2011/
1127
Werner, M., Schellekens, J., Gijsbers, P., van Dijk, M., van den Akker,
1128
O., Heynert, K., Feb. 2013. The Delft-FEWS flow forecasting system. 52
1129
Environmental Modelling & Software 40, 65–77.
1130
URL
1131
S1364815212002083
1132
1133
http://linkinghub.elsevier.com/retrieve/pii/
Wilks, D., 2011. Statistical methods in the atmospheric sciences. Academic Press, Oxford, United Kingdom.
1134
Zhao, L., Duan, Q., Schaake, J., Ye, A., Xia, J., Feb. 2011. A hydrologic
1135
post-processor for ensemble streamflow predictions. Adv. Geosci. 29, 51–
1136
59.
1137
URL http://www.adv-geosci.net/29/51/2011/
1138
Zhu, Y., Toth, Z., Wobus, R., Richardson, D., Mylne, K., 2002. The economic
1139
value of ensemble-based weather forecasts. Bulletin of the American Me-
1140
teorological Society 83 (1), 73–83.
53
1141
List of changes
54
Table 1: Summary of characteristics of the various types of model runs used in the present manuscript
Type of model run
Forcings
CharacterisedUse by leadtime? Simulations Observed meteoro- No Estimation of hydrological unlogical parameters certainty in cases where no data assimilation is applied Perfect forcing (hydro- Observed meteoro- Yes Estimation of hydrological unlogical) forecasts logical parameters certainty in cases where data assimilation is applied Yes These comprise the raw hydroDeterministic and en- Meteorological semble (hydrological) forecasts logical forecasts forecasts
55
Table 2: Contingency table. The consequences of the items listed are in brackets. Taken from Verkade and Werner (2011).
Event observed Event NOT observed Warning issued hits h (C + Lu ) false alarms f (C) Warning NOT issued missed events m (L + L ) quiets/correct negatives q (−) a u P o o0
56
P w w0 N
streamflow [m3/s]
Dressing of deterministic forecasts
Dressing of ensemble forecasts
time
Figure 1: Schematic representation of the dressing procedures. The vertical red line denotes the issue time of the forecast, with the observations (black dots) in the past and the raw forecasts (blue lines) in the future.
57
3000
sample size J
1000
500
100 Legend Meuse 50
Rhine 0.10
0.50
0.90
0.95
Climatological non−exceedence probability P Figure 2: Sub–sample size as a function of the probability P of non-exceedence of the observation in the climatological record.
58
Figure 3: Map of the Meuse and Rhine basins and the forecasting locations that are 59 considered in this manuscript.
1.00
location Metz (Rhine) St Pieter (Meuse) Cochem (Rhine) Lobith (Rhine)
ecdf(Q)[−]
0.75
0.50
0.25
0.00 1
10
100
1000
10000
Observed streamflow Q [m3/s]
Figure 4: Distribution of streamflow observations at the forecasting locations considered in this study.
60
24h
72h
120h
2000 deterministic forecasts
1000
500
0 2000 perfect forcing forecasts
Observed streamflow [m3/s]
1500
1500
1000
500
0 0
500
1000
1500
2000 0
500
1000
1500
2000 0
500
1000
1500
2000
Streamflow forecast [m3/s]
Figure 5: Quantile Regression plots for St Pieter for deterministic (top row) and perfect forcing (bottom row) forecasts with 24-h, 72-h and 168-h forecasts (columns). The blue lines indicate the estimated 5%, 10%, 25%, 75%, 90% and 95% quantiles; the red line indicates the median quantile. As a reminder, the reader is referred to section 2.5 for a description of how simulations differ from forecasts and perfect forcings forecasts.
61
simulations
250
24h
72h
168h
200
Metz
150 100 50 0
50
100
150 0
50
100
150 0
50
100
150 0
50
100
150
5000 4000
Cochem
Observed streamflow [m3/s]
0
3000 2000 1000 0 0
1000 2000 3000 4000 0
1000 2000 3000 4000 0
1000 2000 3000 4000 0
1000 2000 3000 4000
10000
Lobith
5000
0
4000
8000
12000 0
4000
8000
12000 0
4000
8000
12000 0
4000
8000
12000
Streamflow forecast [m3/s]
Figure 6: Quantile Regression plots for Metz (top), Cochem (middle) and Lobith (bottom) for simulations (leftmost column) and deterministic forecasts with 24-h, 72-h and 168-h forecasts (rightmost three columns). The blue lines indicate the estimated 5%, 10%, 25%, 75%, 90% and 95% quantiles; the red line indicates the median quantile. As a reminder, the reader is referred to section 2.5 for a description of how simulations differ from forecasts and perfect forcings forecasts.
62
1500
deterministic
Streamflow [m3/s]
1000
500
●
●●● ●
●
● ● ●
● ●
●
● ●
●
●
●
0 1500
500
●
●●● ●
●
● ● ●
● ●
●
ensemble
1000
● ●
●
●
●
0 09
10
11
12
13
14
15
Date scenario
raw
dressed
Figure 7: Sample forecasts of the scenarios: deterministic and ensemble based forecasts in top and bottom row, respectively. The vertical red lines on the horizontal time axis indicate forecast issue time. Observed values are indicated by blue dots. Note that the lines for the raw forecasts represent one or multiple traces, whereas the lines for the dressed forecasts represent quantiles.
63
P=0
P = 0.5
P = 0.9
P = 0.95
0.2 St Pieter
0.0 −0.2 −0.4 0.2 0.0
Metz
RME [−]
−0.2 −0.4 −0.6 0.0
Cochem
−0.1 −0.2 −0.3 −0.4 −0.5 0.00
Lobith
−0.05 −0.10 −0.15 −0.20 0
48
96
144 192 240 0
48
96
144 192 240 0
48
96
144 192 240 0
48
96
144 192 240
leadtime [h] scenario
dressed deterministic forecasts
dressed ensemble forecasts
Figure 8: Relative Mean Error (RME) as a function of lead time for several subsamples of the verification pairs (columns) and several locations (rows). Note that the vertical scales used in the various columns differ. The shading shows the width of the intervals obtained by bootstrapping the verification metric (see section 2.7 for details).
64
24h
96h
120h
1.00 St Pieter
0.75 0.50
0.00 1.00 0.75 Lobith
0.50 0.25 0.00 1.00 0.75
Cochem
observed relative frequency F [−]
0.25
0.50 0.25 0.00 1.00 0.75
Metz
0.50 0.25 0.00 0.00
0.25
0.50
0.75
1.000.00
0.25
0.50
0.75
1.000.00
0.25
0.50
0.75
1.00
event probability [−] dressed deterministic forecasts
dressed ensemble forecasts
Figure 9: Reliability plots for various lead times (columns) for several locations (rows). The plot is conditional on the observation exceeding the 50th percentile of the climatological exceedence probability (i.e., P = 0.50).
65
24h
96h
120h
1.00 St Pieter
0.75 0.50
0.00 1.00 0.75 Lobith
0.50 0.25 0.00 1.00 0.75
Cochem
observed relative frequency F [−]
0.25
0.50 0.25 0.00 1.00 0.75
Metz
0.50 0.25 0.00 0.00
0.25
0.50
0.75
1.000.00
0.25
0.50
0.75
1.000.00
0.25
0.50
0.75
1.00
event probability [−] dressed deterministic forecasts
dressed ensemble forecasts
Figure 10: Reliability plots for various lead times (columns) for several locations (rows). The plot is conditional on the observations exceeding the 90th percentile of the climatological exceedence probability (i.e., P = 0.90).
66
24h
96h
120h
1.00 St Pieter
0.75 0.50 0.25 0.00 1.00 0.75
Metz
0.50
F [−]
0.25 0.00 1.00 Cochem
0.75 0.50 0.25 0.00 1.00 0.75
Lobith
0.50 0.25 0.00 0
1000
2000
3000
4000
0
1000
2000
3000
4000
5000 0
1000
2000
3000
4000
5000
width of the 5%−95% probability interval [m3/s] scenario dressed deterministic forecasts dressed ensemble forecasts
Figure 11: Sharpness plots for various lead times (columns) for several locations (rows). The plot is unconditional, i.e. for the full data sample (i.e., P = 0).
67
24h
96h
120h
1.00 St Pieter
0.75 0.50 0.25 0.00 1.00 0.75
Metz
0.50
F [−]
0.25 0.00 1.00 Cochem
0.75 0.50 0.25 0.00 1.00 0.75
Lobith
0.50 0.25 0.00 0
1000
2000
3000
4000
0
1000
2000
3000
4000
5000 0
1000
2000
3000
4000
5000
width of the 5%−95% probability interval [m3/s] scenario dressed deterministic forecasts dressed ensemble forecasts
Figure 12: Sharpness plots for various lead times (columns) for several locations (rows). The plot is conditional on the observations exceeding the 90th percentile of the climatological exceedence probability (i.e., P = 0.90).
68
P = 0.5
P = 0.9
P = 0.95
1.00 St Pieter
0.75 0.50 0.25 0.00 1.00 0.75
Metz
BSS [−]
0.50 0.25 0.00 1.00
Cochem
0.75 0.50 0.25 0.00 1.00 0.75
Lobith
0.50 0.25 0.00 0
48
96
144
192
240 0
48
96
144
192
240 0
48
96
144
192
240
leadtime [h] scenario
dressed deterministic forecasts
dressed ensemble forecasts
Figure 13: Brier Skill Score (BSS) as a function of lead time for several events (columns) and several locations (rows). The shading shows the width of the intervals obtained by bootstrapping the verification metric (see section 2.7 for details).
69
P=0
P = 0.5
P = 0.9
P = 0.95
1.00 St Pieter
0.75 0.50 0.25 0.00 1.00 0.75
Metz
0.25 0.00 1.00 0.75
Cochem
CRPSS [−]
0.50
0.50 0.25 0.00 1.00 0.75
Lobith
0.50 0.25 0.00 0
48
96
144 192 240 0
48
96
144 192 240 0
48
96
144 192 240 0
48
96
144 192 240
leadtime [h] scenario
dressed deterministic forecasts
dressed ensemble forecasts
Figure 14: Mean Continuous Ranked Probability Skill Score (CRPSS) as a function of lead time for several subsamples of the verification pairs (columns) and several locations (rows). The shading shows the width of the intervals obtained by bootstrapping the verification metric (see section 2.7 for details).
70
24h
96h
120h
1 St Pieter
0.8 0.6 0.4 0.2 0 1 0.8
Metz
0.6 0.2 0 1 0.8
Cochem
hit rate [−]
0.4
0.6 0.4 0.2 0 1 0.8
Lobith
0.6 0.4 0.2 0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
false alarm rate [−] scenario
dressed deterministic forecasts
dressed ensemble forecasts
Figure 15: ROC plots for various lead times (columns) for several locations (rows). The plot is for the event that the posterior water level exceeds the 90th percentile of the climatological exceedence probability (i.e., P = 0.90). The shading shows the width of the intervals obtained by bootstrapping the verification metric (see section 2.7 for details).
71
P = 0.5
P = 0.9
P = 0.95
1.0 St Pieter
0.9 0.8 0.7 0.6 0.5 1.0 0.9
Metz
0.8
ROCS [−]
0.7 0.6 0.5 1.0
Cochem
0.9 0.8 0.7 0.6 0.5 1.0 0.9
Lobith
0.8 0.7 0.6 0.5 0
48
96
144
192
240 0
48
96
144
192
240 0
48
96
144
192
240
leadtime [h] scenario
dressed deterministic forecasts
dressed ensemble forecasts
Figure 16: Relative Operating Characteristic Score (ROCS) as a function of lead time for several subsamples of the verification pairs (columns) and several locations (rows).
72
24h
96h
120h
1 St Pieter
0.8 0.6 0.4 0.2
0.8 0.6
Metz
0.4 0.2 0 1 0.8
Cochem
Relative Economic Value [−]
0 1
0.6 0.4 0.2 0 1 0.8
Lobith
0.6 0.4 0.2 0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
cost−loss ratio [−] scenario
dressed deterministic forecasts
dressed ensemble forecasts
Figure 17: Value plots for various lead times (columns) for several locations (rows). The plot is conditional on the observations exceeding the 50th percentile of the climatological exceedence probability (i.e., P = 0.50).
73
24h
96h
120h
1 St Pieter
0.8 0.6 0.4 0.2
0.8 0.6
Metz
0.4 0.2 0 1 0.8
Cochem
Relative Economic Value [−]
0 1
0.6 0.4 0.2 0 1 0.8
Lobith
0.6 0.4 0.2 0 0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
0.6
0.8
1
cost−loss ratio [−] scenario
dressed deterministic forecasts
dressed ensemble forecasts
Figure 18: Value plots for various lead times (columns) for several locations (rows). The plot is conditional on the observations exceeding the 90th percentile of the climatological exceedence probability (i.e., P = 0.90).
74