Estimating predictive hydrological uncertainty by dressing deterministic and ensemble forecasts; a comparison, with application to Meuse and Rhine

Estimating predictive hydrological uncertainty by dressing deterministic and ensemble forecasts; a comparison, with application to Meuse and Rhine

Accepted Manuscript Estimating predictive hydrological uncertainty by dressing deterministic and ensemble forecasts; a comparison, with application to...

6MB Sizes 3 Downloads 26 Views

Accepted Manuscript Estimating predictive hydrological uncertainty by dressing deterministic and ensemble forecasts; a comparison, with application to Meuse and Rhine J.S. Verkade, J.D. Brown, F. Davids, P. Reggiani, A.H. Weerts PII: DOI: Reference:

S0022-1694(17)30679-0 https://doi.org/10.1016/j.jhydrol.2017.10.024 HYDROL 22304

To appear in:

Journal of Hydrology

Received Date: Revised Date: Accepted Date:

17 January 2017 29 July 2017 8 October 2017

Please cite this article as: Verkade, J.S., Brown, J.D., Davids, F., Reggiani, P., Weerts, A.H., Estimating predictive hydrological uncertainty by dressing deterministic and ensemble forecasts; a comparison, with application to Meuse and Rhine, Journal of Hydrology (2017), doi: https://doi.org/10.1016/j.jhydrol.2017.10.024

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

3

Estimating predictive hydrological uncertainty by dressing deterministic and ensemble forecasts; a comparison, with application to Meuse and Rhine

4

J.S. Verkadea,b,c,∗, J.D. Brownd , F. Davidsa,b , P. Reggianie , A.H. Weertsa,f

1

2

5 6 7 8 9 10 11 12 13

a

Deltares, PO Box 177, 2600 MH Delft, The Netherlands, +31.88.335.8348 (tel), +31.88.335.8582 (fax) b Ministry of Infrastructure and the Environment, Water Management Centre of The Netherlands, River Forecasting Service, Lelystad, The Netherlands c Delft University of Technology, Delft, The Netherlands d Hydrologic Solutions Limited, Southampton, United Kingdom e University of Siegen, Research Institute for Water and Environment, Siegen, Germany f Wageningen University and Research Centre, Hydrology and Quantitative Water Management Group, Wageningen, The Netherlands

14

Abstract

15

Two statistical post-processing approaches for estimation of predictive hy-

16

drological uncertainty are compared: (i) ‘dressing’ of a deterministic forecast

17

by adding a single, combined estimate of both hydrological and meteoro-

18

logical uncertainty and (ii) ‘dressing’ of an ensemble streamflow forecast by

19

adding an estimate of hydrological uncertainty to each individual stream-

20

flow ensemble member. Both approaches aim to produce an estimate of the

21

‘total uncertainty’ that captures both the meteorological and hydrological

22

uncertainties. They differ in the degree to which they make use of statistical

23

post-processing techniques. In the ‘lumped’ approach, both sources of un-

24

certainty are lumped by post-processing deterministic forecasts using their

25

verifying observations. In the ‘source-specific’ approach, the meteorological

26

uncertainties are estimated by an ensemble of weather forecasts. These enPreprint submitted to Journal of Hydrology ∗ Corresponding author Email address: [email protected] (J.S. Verkade)

October 11, 2017

27

semble members are routed through a hydrological model and a realization of

28

the probability distribution of hydrological uncertainties (only) is then added

29

to each ensemble member to arrive at an estimate of the total uncertainty.

30

The techniques are applied to one location in the Meuse basin and three

31

locations in the Rhine basin. Resulting forecasts are assessed for their reli-

32

ability and sharpness, as well as compared in terms of multiple verification

33

scores including the relative mean error, Brier Skill Score, Mean Continu-

34

ous Ranked Probability Skill Score, Relative Operating Characteristic Score

35

and Relative Economic Value. The dressed deterministic forecasts are gen-

36

erally more reliable than the dressed ensemble forecasts, but the latter are

37

sharper. On balance, however, they show similar quality across a range of

38

verification metrics, with the dressed ensembles coming out slightly better.

39

Some additional analyses are suggested. Notably, these include statistical

40

post-processing of the meteorological forecasts in order to increase their re-

41

liability, thus increasing the reliability of the streamflow forecasts produced

42

with ensemble meteorological forcings.

43

Keywords: Quantile Regression, hydrological forecasting, predictive

44

uncertainty, ensemble dressing, statistical post-processing

45

1. Introduction

46

The future value of hydrological variables is inherently uncertain. Fore-

47

casting may reduce, but cannot eliminate this uncertainty.

48

forecast–sensitive decision making is aided by adequate estimation of the

49

remaining uncertainties (see, for example, Verkade and Werner 2011 and 2

Informed,

50

the references therein). Omission of relevant uncertainties would result in

51

overconfident forecasts, hence all relevant uncertainties must be addressed in

52

the estimation procedure. These include uncertainties related to the model-

53

ing of the streamflow generation and routing processes (jointly referred to as

54

“hydrological uncertainties”) and uncertainties related to future atmospheric

55

forcing (“meteorological uncertainties”). Generally speaking, the total un-

56

certainty can be estimated by separately modelling the meteorological and

57

hydrological uncertainties or by lumping all uncertainties together (cf. Re-

58

gonda et al. 2013).

59

The source-specific approach identifies the relevant sources of uncertainty

60

and models these individually before integrating them into an estimate of

61

the total uncertainty. In this context, the hydrologic uncertainties may be

62

treated separately (independently) from the meteorological uncertainties, be-

63

cause they depend only on the quality of the hydrologic modelling. This ap-

64

proach has been followed by, among others, Kelly and Krzysztofowicz (2000);

65

Krzysztofowicz (2002); Krzysztofowicz and Kelly (2000), Seo et al. (2006) and

66

Demargne et al. (2013). The approach has a number of attractive charac-

67

teristics. The individual sources of uncertainty may each have a different

68

structure, which can be specifically addressed by separate techniques. Also,

69

some of the uncertainties vary in time, while others are time invariant. A dis-

70

advantage of source-based modelling is that developing uncertainty models

71

for each source separately may be expensive, both in terms of the develop-

72

ment itself as well as in terms of computational cost. Also, whether modelling

73

the total uncertainty as a lumped contribution or separately accounting for 3

74

the meteorological and hydrological uncertainties, hydrological forecasts will

75

inevitably contain residual biases in the mean, spread and higher moments

76

of the forecast probability distributions, for which statistical post-processing

77

is important.

78

In the lumped approach, a statistical technique is used to estimate the

79

future uncertainty of streamflow conditionally upon one or more predictors,

80

which may include a deterministic forecast. Underlying this approach is an

81

assumption that the errors associated with historical predictors and predic-

82

tions are representative of those in future. This approach is widely used

83

in atmospheric forecasting, where it is commonly known as Model Output

84

Statistics (MOS) (Glahn and Lowry, 1972). Several reports of applications

85

of the lumped approach in hydrology can be found in the literature. These

86

include the Reggiani and Weerts (2008) implementation of the Hydrologic

87

Uncertainty Processor (Kelly and Krzysztofowicz, 2000), the Model Condi-

88

tional Processor (Todini, 2008; Coccia and Todini, 2011), Quantile Regression

89

(Weerts et al., 2011; Verkade and Werner, 2011; L´opez L´opez et al., 2014),

90

the “uncertainty estimation based on local errors and clustering” approach

91

(UNEEC; Solomatine and Shrestha, 2009) and the Hydrologic Model Output

92

Statistics approach (HMOS; Regonda et al., 2013). For a complete overview

93

that is periodically updated, see Ramos et al. (2013). These techniques each

94

estimate the total uncertainty in future streamflow conditionally upon one or

95

more predictors, including the deterministic forecast. Of course, they vary in

96

their precise formulation and choice of predictors. The lumped approach is

97

attractive for its simplicity, both in terms of development and computational 4

98

costs. The main disadvantage of the approach is that both meteorological

99

and hydrological uncertainties are modeled together via the streamflow fore-

100

cast, which assumes an aggregate structure for the modeled uncertainties (al-

101

though the calibration may be different for particular ranges of streamflow).

102

Also, in order to produce ensemble traces, these techniques must explicitly

103

account for the temporal autocorrelations in future streamflow, which may

104

not follow a simple (e.g. autoregressive) form.

105

In the text above, the source-specific and the lumped approach were pre-

106

sented as separate strategies. However, as the source-based approach may

107

not fully account for all sources of uncertainty, statistical post-processing is

108

frequently used to correct for residual biases in ensemble forecasts. In the

109

present work, an intermediate approach is described, namely the ‘dressing’ of

110

streamflow ensemble forecasts. Here, the meteorological uncertainties are es-

111

timated by an ensemble of weather forecasts. The remaining, hydrological un-

112

certainties are lumped and described statistically. Subsequently, the stream-

113

flow ensemble members are, cf. Pagano et al. 2013, ‘dressed’ with the hy-

114

drological uncertainties. This approach has previously been taken by, among

115

others, Reggiani et al. (2009); Bogner and Pappenberger (2011) and Pagano

116

et al. (2013) and, in meteorological forecasting, by Fortin et al. (2006); Roul-

117

ston and Smith (2003) and Unger et al. (2009). Most of these studies report

118

skill of the dressed ensembles versus that of climatology; Pagano et al. (2013)

119

explored the gain in skill when moving from raw to dressed ensembles and

120

found this gain to be significant. In contrast, the present study compared

121

dressed ensemble forecasts to post-processed single-valued streamflow fore5

122

casts.

123

The kernel dressing approach is akin to kernel density smoothing, whereby

124

missing sources of uncertainty (i.e. dispersion) are introduced by dressing the

125

individual ensemble members with probability distributions and averaging

126

these distributions (cf. Br¨ocker and Smith, 2008). As ensemble dressing aims

127

to account for additional sources of dispersion, not already represented in the

128

ensemble forecasts, a “best member” interpretation is often invoked (Roul-

129

ston and Smith, 2003). Here, the width of the dressing kernel is determined

130

by the historical errors of the best ensemble member. The resulting distribu-

131

tion is then applied to each ensemble member of an operational forecast and

132

the final predictive distribution given by the average of the individual distri-

133

butions. In this context, ensemble dressing has some similarities to Bayesian

134

Model Averaging (BMA; see Raftery et al. 2005 for a discussion).

135

In the ensemble dressing approach, one highly relevant source of uncer-

136

tainty, namely the weather forecasts, is described using an ensemble Numer-

137

ical Weather Prediction (NWP) model. This NWP model takes into account

138

current initial conditions of the atmosphere and exploits the knowledge of

139

physical processes of the atmosphere embedded in the NWP model, as well

140

as any meteorological observations that are assimilated to improve estimates

141

of the predicted states. The hydrologic uncertainties, which may originate

142

from the hydrologic model parameters and structure (among other things)

143

are then lumped, modelled statistically, and integrated with the meteorolog-

144

ical contribution to the streamflow.

145

The objective of this work is to compare the quality and skill of the 6

146

forecasts created through dressing of deterministic streamflow forecasts and

147

through dressing of ensemble streamflow forecasts. A priori, the dressed

148

ensemble forecasts are expected to have higher skill than the dressed de-

149

terministic forecasts. Both account for the effects of all relevant sources of

150

uncertainty on the streamflow forecasts. However, in the ensemble case, the

151

estimate of atmospheric uncertainties is based on knowledge of the physical

152

system and its state at issue time of a forecast, whereas this knowledge is un-

153

used in the lumped approach. Nevertheless, the lumped approach accounts

154

for any residual meteorological biases via the streamflow.

155

The context for this study is an operational river forecasting system used

156

by the Dutch river forecasting service. This system models the total un-

157

certainty in the future streamflow using a lumped approach, whereby a de-

158

terministic streamflow forecast is post-processed through quantile regression

159

(following a procedure similar to that in Weerts et al. 2011). While this mod-

160

ule performs reasonably well, there is a desire among operational forecasters

161

to explore the benefits of (and account for) information in ensemble weather

162

predictions, including information beyond the ensemble mean. This resulted

163

in the operational implementation of the ensemble dressing approach using

164

the same statistical technique (quantile regression). Thus, estimates of the

165

meteorological uncertainties, which were previously modeled indirectly (i.e.

166

lumped into the total uncertainty), are now disaggregated and included sep-

167

arately in the streamflow forecasts. This raises the question of whether the

168

‘new’ approach indeed increases forecast skill.

169

The novel aspects and new contributions of this work include (i) a direct 7

170

comparison between the quality of the dressed deterministic forecasts and

171

the dressed ensemble forecasts; (ii) the application of quantile regression

172

to account for the hydrologic uncertainties, and (iii) the application of the

173

dressing technique to dynamic ensemble streamflow forecasts.

174

This paper is organised as follows. In the next section, the study ap-

175

proach is detailed, followed by a description of the study basins in section 3.

176

In section 4 the results of the experiments are presented and analysed. In

177

section 5, conclusions are drawn and discussed.

178

2. Approach

179

2.1. Scenarios

180

The present study consists of an experiment in which verification results in

181

two scenarios are inter-compared: dressed deterministic forecasts and dressed

182

ensemble forecasts. These are tested in multiple cases, that is, combinations

183

of forecasting locations and lead times.

184

2.2. Cross–validation

185

The results from the experiments that are described below are cross–

186

validated by means of a leave–one–year–out analysis. The set of available

187

data is divided into N available years. Models are trained using N − 1 years

188

and validated on the remaining one year. The experiment is carried out

189

N times, every time with another year chosen as a validation year. Thus,

190

the validation record that is verified comprises of N years of data that is

8

191

obtained independently from the training data — and optimal use is made

192

of the available length of record.

193

2.3. Hindcasting

194

The experiment comprises the process of hindcasting or reforecasting:

195

forecasts are produced for periods of time that are currently in the past –

196

while only using data that was available at the time of forecast (issue time,

197

or reference time or time zero).

198

Hindcasts are produced using an offline version of the Delft-FEWS

199

(Werner et al., 2013) based forecast production system “RWsOS Rivers” that

200

is used by the Water Management Centre of The Netherlands for real-time,

201

operational hydrological forecasting.

202

Hindcasting is a two-step process: first, the hydrological models are forced

203

with observed temperature and precipitation for a period up to the forecast

204

initialisation time. Thus, the internal model states reflect the basin’s actual

205

initial conditions as closely as possible. These initial states are used as the

206

starting point for forecast runs, where the models are forced with the pre-

207

cipitation and temperature ensemble NWP forecasts. These are imported

208

into the forecast production system as gridded forecasts and subsequently

209

spatially aggregated to sub-basins . Basin-averaged precipitation or tem-

210

perature is then equal to the average of the grid boxes within that basin,

211

accounting for partial inclusion of grid boxes through a process of weighting.

212

In the case of St Pieter streamflow forecasts, an autoregressive-moving-

213

average (ARMA) error-correction procedure was used to correct for biases in

9

214

the raw streamflow forecasts and hence any residual biases contributed by the

215

forcing (see below). This is in line with the procedures used in the operational

216

system. The autoregressive (AR) correction scheme used here is based on

217

estimation of the AR parameters using the Burg’s method and ARMASel

218

(see Broersen and Weerts, 2005, and the references therein). The procedure

219

is implemented in Delft-FEWS (Werner et al., 2013) and can be configured as

220

self-calibrating (automated selection of order and/or AR parameter values)

221

or with fixed AR parameters for post-processing of hydrological forecasts.

222

Here, the latter option was chosen.

223

The effect of error correction is that forecast uncertainty will be reduced

224

to zero at zero lead time; with increasing lead time, this uncertainty reduc-

225

tion will ‘phase out’. To account for this in the streamflow hindcasts, error

226

correction was used in simulation mode also. Effectively, the hindcasting pro-

227

cedure comprised the re–production of forecasts where the models were forced

228

with precipitation measurements instead of precipitation forecasts. This in-

229

troduces a lead time dependence in the quality of the streamflow simulations.

230

This is described in more detail in Section 2.5.

231

2.4. ‘Dressing’ of streamflow forecasts

232

The dressing technique is similar across the lumped and the source-

233

specific approaches in that the forecasts are dressed with predictive dis-

234

tributions of uncertainties that are not already explicitly addressed in the

235

raw forecasts. Thus, deterministic hydrological forecasts are dressed with a

236

predictive distribution that comprises both meteorological and hydrological

10

237

uncertainties, and hydrological ensemble forecasts are dressed with a pre-

238

dictive distribution that comprises hydrological uncertainties only. In both

239

approaches, the total uncertainty is computed by averaging over the number

240

of ensemble members E (which E = 1 in the case of deterministic forecasts), E 1 X φn (qn |sn,e ) , Φn (qn |sn,1 , sn,2 , . . . , sn,E ) = E e=1

(1)

241

where Φ is the aggregated distribution of observed streamflow q at lead time

242

n, conditional on the raw streamflow forecast s that consists of ensemble

243

members e ∈ {1, . . . , E}, each of which are dressed with distribution φ.

244

In the ensemble dressing scenario, this means that each of the ensemble

245

members is dressed with a predictive distribution of hydrological uncertainty,

246

and that these multiple distributions are averaged to obtain a single distribu-

247

tion of predictive uncertainty. Note that here, we assume that the ensemble

248

members are equiprobable (which generally applies to atmospheric ensembles

249

generated from a single model, but not necessarily to multi-model ensembles,

250

for example). If the members are not equiprobable then a weighting can eas-

251

ily be introduced.

252

Here, the distribution, Φ, aims to capture the historical residuals between

253

the observed and simulated streamflows (i.e. streamflows produced with ob-

254

served forcing). A “best member” interpretation does not apply here, because

255

the dressing kernel is aiming to capture a new source of uncertainty (the hy-

256

drologic uncertainty) and not to account for under-dispersion in (the hydro-

257

logic effects of) the meteorological uncertainties. In short, we assume that

11

258

the meteorological ensembles are unbiased and correctly dispersed. This is a

259

necessary assumption: uncertainty originating in future weather is only esti-

260

mated by the spread of the streamflow ensemble and not in any way through

261

statistical post-processing, nor does post-processing in any way bias-correct

262

the streamflow ensemble to better account for meteorological uncertainty.

263

This constitutes a disclaimer to the ensemble dressing approach: if the as-

264

sumption of a correctly dispersed meteorological ensemble is invalid then this

265

will have an adverse effect on the quality of the dressed ensemble!

266

By construction, the raw deterministic forecasts are dressed with a single

267

distribution only, which aims to account for the total uncertainty of the future

268

streamflow, not the residual uncertainty of a best ensemble member,

Φn (qn |sn,1 ) = φn (qn |sn,1 ) . 269

The dressing procedures are schematically visualised in Figure 1. [Figure 1 about here.]

270

271

(2)

2.5. Uncertainty models

272

As mentioned above, the source-specific and lumped approaches differ in

273

the predictive distributions that the raw forecasts are dressed with. In the

274

lumped approach, the deterministic forecast is dressed by a distribution of

275

both hydrological and atmospheric uncertainties, conditional on the value of

276

the deterministic forecast itself. The “errors” in the deterministic forecast

277

are thus a measure of uncertainties originating in both the meteorological

278

forcing as well as the hydrological modeling. 12

279

Ensemble streamflow forecasts are dressed using a predictive distribution

280

of hydrological uncertainty only. This is achieved by fitting a probability dis-

281

tribution to the historical residuals between the hydrological model simula-

282

tions and observations (see section 2.6 for details on how this was done). The

283

simulations are derived by forcing a hydrological model with observed pre-

284

cipitation and temperature. As such, the simulations are independent of lead

285

time. This approach is quite widely reported in the literature, for example by

286

Montanari and Brath (2004); Seo et al. (2006); Chen and Yu (2007); Hantush

287

and Kalin (2008); Montanari and Grossi (2008); Todini (2008); Bogner and

288

Pappenberger (2011); Zhao et al. (2011) and Brown and Seo (2013).

289

Time invariant estimates of hydrological uncertainty using these hydro-

290

logical simulations, however, do not take into account any error correction

291

or data assimilation procedures that, in a real-time setting, reduce predic-

292

tive uncertainty. In the Meuse at St Pieter case, such an error correction

293

method is an integral component of the forecasting system. Hence, in the

294

Meuse case, hydrological uncertainty is not based on observations and simu-

295

lations, but on observations and “perfect forcing hindcasts”. Similar to the

296

forecasts produced by the operational system, these hindcasts benefit from

297

the ARMA error correction procedure that is applied to streamflow forecasts

298

at St Pieter at the onset of every hindcast. However, the hindcasts are forced

299

with observed precipitation and streamflow. Hence the hindcasting record

300

is similar to a simulation record, but with the added benefit of an ARMA

301

error correction. This introduces a lead time dependency in the skill of the

302

hindcast. At zero lead time, the hindcast has perfect skill; with increasing 13

303

lead time, forecast errors increase in magnitude and skill deteriorates. These

304

resulting forecast errors are largely due to hydrological uncertainties. When

305

the effect of the ARMA procedure has worn out, forecast skill reduces to that

306

of the hydrological simulations. The procedure is similar to that followed by

307

Bogner and Pappenberger (2011); an alternative approach to this would be

308

to use use the latest available observation as a predictor in the uncertainty

309

model; this approach was taken by, among others, Seo et al. (2006).

310

311

As the experimental setup is quite complex, the Table1 should be helpful in distinguishing the various types of model runs from each other. [Table 1 about here.]

312

313

2.6. Quantile Regression

314

In the present paper, in both scenarios, uncertainty is estimated using

315

Quantile Regression (QR; Koenker and Bassett Jr, 1978; Koenker and Hal-

316

lock, 2001; Koenker, 2005). QR is a regression technique for estimating the

317

quantiles of a conditional distribution. As the sought relations are conditional

318

quantiles rather than conditional means, quantile regression is robust with

319

regards to outliers. Quantile Regression does not make any prior assumptions

320

regarding the shape of the distribution; in that sense, it is a non-parametric

321

technique. It is, however, highly parametric in the sense that, for every quan-

322

tile of interest, parameters need to be estimated. QR was developed within

323

the economic sciences, and is increasingly used in the environmental sciences

324

(see, for example, Bremnes 2004; Nielsen et al. 2006). Specific applications

14

325

in the hydrological sciences include Weerts et al. (2011), Roscoe et al. (2012)

326

and L´opez L´opez et al. (2014).

327

328

In the present work, Quantile Regression is used to estimate lead time n specific conditional distributions of streamflow,

φn = {Qn,τ1 , Qn,τ2 , . . . , Qn,τT }

(3)

329

where T is the number of quantiles τ (τ ∈ [0, 1]) considered.

330

sufficiently large and the quantiles τ cover the domain [0, 1] sufficiently

331

well, we consider φn to be a continuous distribution. Here, T = 50 and

332

τ ∈ {0.01, 0.03, . . . , 0.99},

φn = {Qn,τ =.01 , Qn,τ =.03 , . . . , Qn,τ =.99 }

If T is

(4)

333

We assume that, cf. Weerts et al. (2011), separately for every lead time

334

n considered and for every quantile τ , there is a linear relationship between

335

the streamflow forecast S and its verifying observation Q,

Qn,τ = an,τ Sn + bn,τ

(5)

336

where an,τ and bn,τ , are the slope and intercept from the linear regression.

337

Quantile Regression allows for finding the parameters an,τ and bn,τ of this

338

linear regression by minimising the sum of residuals,

min

J X

ρn,τ (qn,j − (an,τ sn,j + bn,τ ))

j=1

15

(6)

339

where ρn,τ is the quantile regression weight for the τ th quantile, qn,j and

340

sn,j are the j th paired samples from a total of J samples, and an,τ and bn,τ the

341

regression parameters from the linear regression between streamflow forecast

342

and observation. By varying the value of τ , the technique allows for describ-

343

ing the entire conditional distribution.

344

In the present work, solving equation 6 was done using the quantreg

345

package (Koenker, 2013) in the R software environment (R Core Team, 2013).

346

Figure 5 and Figure 6 show the joint distributions of forecast–observation

347

pairs as well as a selection of estimated quantiles; these plots are discussed

348

in the Results and analysis section.

349

2.7. Verification strategy

350

Forecast quality in the two scenarios is assessed using visual exploration

351

of forecast hydrographs, examination of graphical measures of reliability and

352

sharpness as well as a selection of metrics (presented as skill scores) for both

353

probabilistic forecasts and single valued derivatives thereof. The metrics and

354

skill scores are described in detail in Appendix A.

355

Reliability is the degree to which predicted probabilities coincide with the

356

observed relative frequencies, given those forecast probabilities. The degree

357

to which the forecasts are reliable is indicated by reliability plots that plot the

358

conditional relative frequency of observations against forecast probabilities.

359

Proximity to the 1:1 diagonal, where observed frequency equals predicted

360

probability, indicates higher reliability.

361

The refinement of a set of forecasts refers to the dispersion of the marginal

16

362

distribution of predicted probabilities. A refinement distribution with a large

363

spread implies refined forecasts, in that different forecasts are issued relatively

364

frequently and so have the potential to discern a broad range of conditions.

365

This attribute of forecast refinement often is referred to as sharpness in the

366

sense that refined forecasts are called sharp (Wilks, 2011). Often, sharpness

367

is summarized as the ability to forecast extreme probabilities such as 0%

368

and 100%. This is closely related to the width, or spread, of an ensemble

369

forecast. Large spreads are unlikely to yield 0% or 100% event probabilities.

370

Hence, here we have measured the width of the ensemble forecasts. These

371

“sharpness plots” show the empirical cumulative distribution of the width of

372

the 10th –90th quantiles of the probability forecasts. In this context, sharpness

373

measures the degree of confidence (narrowness of spread) afforded by the

374

forecasts.

375

Metrics that describe the quality of single valued forecasts include the

376

correlation coefficient (COR), relative mean error (RME), mean absolute

377

error (MAE) and the root mean squared error (RMSE). The correlation co-

378

efficient describes the degree of linear dependence between the observation

379

and the forecast. The RME or fractional bias measures the average differ-

380

ence between the forecast and the observation, relative to the mean of the

381

observations. MAE measures the mean absolute difference between a set of

382

forecasts and corresponding observations. RMSE provides the square root

383

of the average mean square error of the forecasts. It has the same unit as

384

the forecasts and the observations. In each of these four metrics the forecast

385

mean is used as a single valued forecast. 17

386

In terms of the probabilistic characteristics of forecasts, the overall ac-

387

curacy is measured with the Brier Score (BS) and the mean Continuous

388

Ranked Probability Score (CRPS). The BS comprises the mean square error

389

of a probability forecast for a discrete event, where the observation is an

390

indicator variable. The CRPS measures the integral square error of a prob-

391

ability forecast across all possible event thresholds, again assuming that the

392

observation is deterministic.

393

In the present paper, both BS and CRPS of the forecasts under considera-

394

tion are presented as a skill relative to the BS and CRPS of the climatological

395

forecast, that is, the climatology of streamflow observations as derived from

396

the sample of verification pairs.

397

For discrimination, the trade-off between correctly predicted events (true

398

positives, or hits) and false alarms (false positives) is considered. Hit rate

399

is plotted versus false alarm rate in the ROC curves. The area under the

400

curve (AUC) is a measure of discrimination; this is expressed as the Relative

401

Operating Characteristic score (ROCS), which factors out the climatological

402

AUC of 0.5, i.e. 2AUC − 1. Forecast value is measured using the Relative

403

Economic Value (REV; Murphy 1985; Zhu et al. 2002). REV (V [−]) is cal-

404

culated by matching the occurrences of hits, misses, false alarms and correct

405

negatives (‘quiets’) with their consequences (Table 2). It is expressed on a

406

scale from negative infinity to 1, where V = 0 is the situation in which there

407

is no forecasting and warning system present and V = 1 is the situation in

408

which there is a perfect forecasting and warning system present. Negative

409

values imply that the warning system introduces more costs than benefits. 18

410

The REV is expressed as a function of the users cost-loss rate r.

411

Verification was performed at the same timestep as the post-processing;

412

results are shown for a selection of lead times only. For verification, the

413

open source Ensemble Verification System (Brown et al., 2010) is used. EVS

414

takes ensemble forecasts as input. Here, predictive uncertainty is expressed

415

by quantiles rather than ensemble members, but the 50 quantiles are equally

416

spaced, and ensemble members may be interpreted as quantiles of the un-

417

derlying probability distribution from which they are sampled (e.g. Br¨ocker

418

and Smith, 2008).

419

Conditional quality and skill is determined by calculating verification met-

420

rics for increasing levels of the non-exceedence climatological probability, P ,

421

ranging from 0 to 1. Essentially, P = 0 constitutes an unconditional verifica-

422

tion for continuous measures, such as the CRPSS, as all available data pairs

423

are considered (Bradley and Schwartz, 2011). Conversely, at P = 0.95, only

424

the data pairs with observations falling in the top 5% of sample climatology

425

are considered; this amounts to approx. 60 pairs here for the Meuse case and

426

approx. 150 pairs for the Rhine case (Figure 2).

427

The BSS, ROCS and REV measure forecast skill for discrete events. The

428

BSS, ROCS and REV are, therefore, unknown for thresholds corresponding

429

to the extremes of the observed data sample, nominally denoted by P = 0

430

and P = 1).

431

432

[Figure 2 about here.] Sampling uncertainties were quantified with the stationary block boot19

433

strap (Politis and Romano, 1994). Here, blocks of adjacent pairs are sam-

434

pled randomly, with replacement, from the J available pairs in each basin.

435

Overlapping blocks are allowed, and the average length of each block is deter-

436

mined by the autocorrelation of the sample data. In both cases, an average

437

block length of 10 days was found to capture most of the autocorrelation

438

(some interseasonal dependence remained). The resampling was repeated

439

1,000 times, and the verification metrics were computed from each sample.

440

Confidence intervals were derived from the bootstrap sample with a nominal

441

coverage probability of 0.9, i.e. [0.05, 0.95]. The intervals should be treated

442

as indicative and do not necessarily provide unbiased estimates of coverage

443

probabilities, particularly for rare events (Lahiri, 2003). Also, observational

444

uncertainties were not considered.

445

These sampling uncertainty intervals provide information as to the ‘true

446

value’ of the metric or skill considered. Unfortunately, the intervals cannot

447

be used for a formal statistical analysis as the verification samples are not

448

strictly independent. Hence in the present paper, the comparison between

449

scenarios is (necessarily) based on a qualitative description of the uncertainty

450

intervals.

451

3. Study basins, models and data used

452

To enhance the robustness of the findings presented in this paper, the

453

experiment was carried out on two separate study basins. These comprise

454

forecasting locations in two basins with different characteristics, where hydro-

455

logical models are forced with different atmospheric ensemble forcing prod20

456

ucts. [Figure 3 about here.]

457

458

3.1. Meuse

459

3.1.1. Basin description

460

The river Meuse (Figure 3) runs from the Northeast of France through

461

Belgium and enters the Netherlands just south of Maastricht. It continues to

462

flow North and then West towards Dordrecht, where it meets the Rhine before

463

discharging into the North Sea near Rotterdam. Geology and topography

464

vary considerably across the basin. The French Meuse basin is relatively

465

flat and has thick soils. The mountainous Ardennes are relatively high and

466

steep and the area’s impermeable bedrock is covered by thin soils. Average

467

annual basin precipitation varies around 900mm. The Meuse is a typically

468

rain-fed river; long lasting, extensive snowpacks do not occur. Figure 4

469

shows the distribution of streamflow at the forecasting locations considered

470

in this study. Near Maastricht, average runoff equals approx. 200 m3 /s.

471

Temporal variability can be large as, during summer, streamflow can be less

472

than 10 m3 /s, while the design flood, associated with an average return period

473

of 1,250 years, has been established at approx. 3,000 m3 /s.

474

[Figure 4 about here.]

475

This study explicitly looks at St Pieter, which is near where the river

476

enters The Netherlands. Water levels in the Belgian stretch of the Meuse,

477

just upstream of the Dutch-Belgian border, are heavily regulated by large 21

478

weirs. These, together with the locks that have been constructed to allow

479

ships to navigate the large water level differences, cause relatively high fluc-

480

tuations in discharge. The manual operations that lead to these fluctuations

481

are not communicated with the forecasting agency across the border in The

482

Netherlands, which introduces additional uncertainties with respect to future

483

streamflow conditions.

484

3.1.2. Models used

485

The forecasting system contains an implementation of the HBV rainfall-

486

runoff model (Bergstr¨om and Singh, 1995). This is a semi-lumped, conceptual

487

hydrological model, which includes a routing procedure of the Muskingum

488

type. The model schematisation consists of 15 sub-basins jointly covering

489

the Meuse basin upstream of the Belgian-Dutch border, which is very near

490

the St Pieter forecasting location. The model runs at a one-hour time step.

491

Inputs to the model consist of temperature and precipitation forcings; actual

492

evaporation is estimated from a fixed annual profile that is corrected using

493

temperature forecasts. The model simulates both streamflow generation and

494

streamflow routing in natural flow conditions only. Thus, it does not include

495

models of human interference that occurs at weirs and sluices. This interfer-

496

ence occurs mainly at low flows; at high flows, weirs are drawn. Hence, at

497

low flows, considerable uncertainty is associated with model outcomes.

498

3.1.3. Forecasts and observations used

499

COSMO-LEPS is the ensemble implementation of the COSMO model, a

500

non-hydrostatic, limited-area atmospheric prediction model. Its 16 members 22

501

are nested on selected members of the ECMWF-EPS forecasts. COSMO-

502

LEPS runs twice daily on a 10km grid spacing and 40 vertical layers. It

503

covers large parts of continental Europe including the Meuse basin. For the

504

present experiment, approx. 1,400 historical COSMO-LEPS forecasts were

505

available (Figure 2): one every day between mid 2007 and early 2011. The

506

forecasts have a 1-h time step and have a maximum lead time of 132-h,

507

i.e. 5.5 days. Within the operational forecasting system, the lead time is

508

artificially extended to 7 days through assuming zero precipitation and 8° C

509

temperature for the lead times ranging from 132-h through 168-h. The 36-

510

h lead time gain more or less coincides with the time required for a flood

511

wave to cover the distance from Chooz (near the French–Belgian border) to

512

St Pieter. As a general rule, about half of the streamflow volume originates

513

from the basin upstream from Chooz.

514

From the 16 members, a single member was isolated to serve as the de-

515

terministic forecast. Note that while the results for a single deterministic

516

forecast are presented here, the dressing was in fact done 16 times, using

517

each of the available 16 members as a single deterministic forecasts. Each of

518

these 16 dressed deterministic forecasts behaves similarly with respect to the

519

‘competing’ scenario — hence only one of these is presented in this paper.

520

Hourly streamflow observations St Pieter as well as temperature and pre-

521

cipitation observations within the Meuse basin were obtained from the Water

522

Management Centre of The Netherlands.

23

523

3.2. Rhine

524

3.2.1. Basin description

525

The river Rhine runs from the Swiss Alps along the French-German bor-

526

der, through Germany and enters The Netherlands near Lobith, which is

527

situated upstream of the Rhine-Meuse delta, and is often considered the out-

528

flow of the Rhine. At Lobith, the basin area equals approx. 160,000 km2 .

529

During spring and early summer, a considerable fraction of flow at the outlet

530

originates from snowmelt in the Swiss Alps. Figure 3 shows the basin lo-

531

cation, elevations and the forecasting locations that were used in this work.

532

These are Metz, Cochem and Lobith. Metz is located in the headwaters of

533

the river Moselle, of which Cochem is the outlet.

534

3.2.2. Models used

535

The forecast production system that was used to create simulations and

536

hindcasts for the Rhine is a derivative of the operational system that was

537

mentioned above. The system contains an implementation of the HBV rain-

538

fall runoff model (Bergstr¨om and Singh, 1995). The Rhine model schematisa-

539

tion consists of 134 sub-basins jointly covering the entire basin. The models

540

run at a daily time step. Inputs to the model consist of temperature and

541

precipitation forcings; actual evaporation is estimated from a fixed annual

542

profile that is corrected using temperature forecasts.

543

3.2.3. Forecasts and observations used

544

For observations of precipitation, the CHR08 dataset (Photiadou et al.,

545

2011) was used. This dataset was prepared specifically for the HBV model 24

546

used here and covers the period 1961 through 2007. The spatial scale of the

547

observations coincides with the 134 HBV sub-basins. Temperature observa-

548

tions originate from version 5.0 of the E-OBS data set (Haylock et al., 2008),

549

and are available from 1951 through mid 2011. These forcings were available

550

at a time step of one day. The observations are used to force the hydrological

551

model in historical mode to estimate the initial conditions at the onset of a

552

hydrological forecast, as well as in simulation mode.

553

Predicted forcings consisted of the ECMWF reforecast dataset, compris-

554

ing medium-range EPS forecasts with 5 ensemble members (Hagedorn, 2008).

555

These reforecasts were produced using the current operational model (Cy38r1

556

with a 0.25 degrees horizontal resolution). The forecasts were temporally ag-

557

gregated to a one day time step, which coincided with that of the hydrological

558

model used, and go out to a maximum lead time of 240-h, i.e. 10 days. The

559

gridded forecasts were spatially averaged to the HBV sub-basin scale. For

560

this work, approx. 2,900 reforecasts were available (Figure 2), covering the

561

period 1990–2008.

562

Similar to the Meuse case, the deterministic forecasts used in this study

563

consist of a randomly chosen ensemble member from each of the available

564

ensemble forecasts. Each of the members was used to create a deterministic

565

forecast which was subsequently dressed and analysed. However, results for

566

one of these forecasts is presented only.

567

Hourly streamflow observations for the hydrological stations within the

568

Rhine basin were obtained from the Water Management Centre of The Neth-

569

erlands. 25

570

4. Results and analysis

571

4.1. Post-processing of single valued forecasts

572

Scatter plots of single–valued forecasts and observations are shown in

573

Figures 5 and 6 for St Pieter and Rhine locations, respectively. Two datasets

574

are shown: (i) the forecasts with perfect forcings (simulations in the Rhine

575

case) and (ii) the deterministic forecasts. In all plot panels, the horizontal

576

axes are identical to the vertical axes and the 1:1 diagonal is emphasised.

577

The forecast–observation pairs are plotted in a transparent colour; areas

578

that show up darker comprise multiple pairs plotted on top of one another.

579

A selection of estimated quantiles is superimposed on the scatter plots, with

580

the median in red and the 5th , 10th , 25th , 75th , 90th and 95th percentiles in

581

blue.

582

[Figure 5 about here.]

583

[Figure 6 about here.]

584

The pairs and the estimated quantiles in the St Pieter figure (Figure 5)

585

show that the perfect forcing pairs (bottom row) are closer to the diagonal

586

than the deterministic forecast pairs (top row). This is because the residuals

587

between the perfect forcings forecast and the observations comprise the hy-

588

drological uncertainties only. The plots also show that the median quantile

589

of the pairs comprising the deterministic forecasts has a shallower slope than

590

the diagonal. This indicates an overforecasting bias: the majority of pairs

591

is located below, or to the right of the diagonal. The median of the pairs 26

592

comprising the perfect forcing forecasts shows a slope almost equal to, or

593

higher than that of the 1:1 diagonal. The latter indicates underforecasting:

594

most of the pairs are located above, or to the left of the 1:1 diagonal. Both

595

sets of pairs show that the spread increases with increasing forecast lead time

596

and that higher values of flow have higher spread in real units.

597

In the Rhine case (Figure 6), the simulations are independent of forecast

598

lead time. The difference between the spread of pairs based on the simula-

599

tions and that of the deterministic forecasts is less obvious, especially when

600

the shorter lead times are considered. Without exception, the median fore-

601

cast is located below the diagonal which indicates an overforecasting bias.

602

4.2. Forecast hydrographs

603

Sample forecasts or predictive distributions for both scenarios are shown

604

in Figure 7. The rows show the cases that use deterministic (top) and ensem-

605

ble (bottom) forecasts, with the raw forecasts indicated by thick blue lines

606

and the dressed forecasts by thin grey lines. Note that the raw cases show

607

one or multiple traces, whereas for the dressed cases, quantiles are shown

608

(which should not be confused with ensemble traces).

609

[Figure 7 about here.]

610

By construction, ensemble dressing corrects for under-dispersion and,

611

therefore, increases the ensemble spread. In this example, the spread of

612

the dressed single-valued forecasts is larger than the spread of the dressed

613

ensemble forecasts. It is also noticeable that the raw ensemble forecast fails

614

to capture many observations, whereas the dressed forecasts capture all. 27

615

616

Please note that this is an example on a particular date and not necessarily the general behaviour of all forecasts that are post-processed.

617

The example forecasts also show an artefact associated with statistical

618

post-processing, namely that the most extreme quantiles are relatively noisy

619

This originates from the increased sampling uncertainty associated with es-

620

timating extreme quantiles.

621

4.3. Single-valued forecast verification

622

Generally speaking, COR, RME and RMSE follow similar patterns. Each

623

worsens with increasing lead time and with increasing value of the verifying

624

observation as indicated by P . In this manuscript, only the RME is shown

625

(Figure 8).

626

The correlations (plot not shown) are highest for Lobith, followed by

627

Cochem, Metz and St Pieter. While the correlations are generally positive,

628

they approach zero at St Pieter for higher P at longer forecast lead times.

629

Both the patterns and values of the correlation coefficient (as function of

630

lead time and P ) are similar across the two scenarios. Only at the longer

631

forecast lead times and at St Pieter do they differ. In those cases, the dressed

632

ensembles outperform the dressed deterministic forecasts.

633

The RME plots (Figure 8) show that, at P = 0, the dressed determinis-

634

tic forecasts have near-perfect RME, that is, RME ≈ 0, at all forecast lead

635

times. The dressed ensemble forecasts show a larger fractional bias, with

636

St Pieter and Metz showing positive values and Cochem and Lobith showing

637

negative values. For higher values of P and at longer forecast lead times,

28

638

RME becomes increasingly negative. Consequently, at higher values of P

639

and at longer forecast lead times, the dressed ensembles at St Pieter and

640

Metz show smaller fractional bias than the dressed deterministic forecasts.

641

The converse is true for Cochem and Lobith, where the dressed determin-

642

istic forecasts have smaller RME. The difference in RME between scenarios

643

increases with increasing forecast lead time.

644

The RMSE (not shown in plots) worsens (i.e., increases) with increasing

645

forecast lead time, increasing threshold amount, and with declining basin

646

size. The RMSE is lower for the dressed ensemble forecasts than the dressed

647

single-valued forecasts at most values of P . Only at Cochem and Lobith and

648

for some values of P is the RMSE higher for the dressed ensemble forecasts.

649

Overall, in terms of the single valued verification measures, neither the

650

dressed ensemble forecasts nor the dressed deterministic forecasts perform

651

consistently better. At St Pieter and Metz, the mean of the dressed ensembles

652

has higher quality in terms of COR, RME and RMSE, whereas at Cochem

653

and Lobith, the reverse is true in terms of RME and, at small ranges of P ,

654

for RMSE.

655

656

[Figure 8 about here.] 4.4. Reliability and sharpness

657

Reliability plots for the P = 0.50 and P = 0.90 subsamples (Figures 9

658

and 10) show that both approaches are reasonably and more or less equally

659

reliable at all leadtimes and at all locations. The one exception here is

29

660

St Pieter, where the dressed deterministic forecasts appear to be more reliable

661

than the dressed ensemble forecasts.

662

Reliability varies slightly with the level of extremeity of the event: the

663

P = 0.50 reliability plot (Figure 9) follows the diagonal more closely than

664

the P = 0.90 reliability plot (Figure 10).

665

plot (Figure 9), the samples are distributed much more evenly across the

666

horizontal (as indicated by the size of the plotting positions) compared to

667

the P = 0.90 reliability plot (Figure 10). In the latter, the distribution is

668

much less even, with a relatively large number of forecasts to be found at the

669

0% and 100% forecasts.

In the P = 0.50 reliability

670

[Figure 9 about here.]

671

[Figure 10 about here.]

672

The width of the forecast distributions (Figures 11 and 12) increases with

673

increasing lead time and with increasing value of P . Sharpness, compared to

674

reliability, varies more sharply (we couldn’t resist the pun) with the value of

675

P . At lead times longer than 24-h, the differences in forecast width between

676

scenarios becomes noticeable. In all cases, the dressed ensembles result in

677

more narrow predictive distributions than the dressed deterministic forecasts.

678

These differences are more pronounced at higher values of P . Note that while

679

‘narrowness’ is a desirable property, it is only valuable in a decision making

680

context if the forecasts are also reliable.

681

[Figure 11 about here.] 30

[Figure 12 about here.]

682

683

4.5. Probabilistic Skill Scores

684

For the skill scores (Figure 13), the patterns are more important than the

685

absolute values, as the baseline is unconditional climatology. The patterns

686

of the Brier Skill Score are similar to those observed for other metrics: skill

687

is highest for the largest basins and reduces with increasing forecast lead

688

time. The BSS is generally very similar for both scenarios. Only in the case

689

of St Pieter is there a consistent difference between the scenarios, with the

690

dressed ensemble forecasts outperforming the dressed deterministic forecasts,

691

but not beyond the range of sampling uncertainty. [Figure 13 about here.]

692

693

The patterns of the mean CRPSS (Figure 14) are similar to those of the

694

BSS. The difference is that the CRPSS improves with increasing P . This

695

is understandable because sample climatology is much less skilful at higher

696

thresholds

697

Often, the CRPSS is similar across the two scenarios. Again, any differ-

698

ences are more pronounced at longer forecast lead times and higher values of

699

P , where the dressed ensemble forecasts are somewhat more skilful, partic-

700

ularly at St Pieter and Metz (but again, not beyond the range of sampling

701

uncertainty).

702

[Figure 14 about here.]

31

703

4.6. Forecast value

704

Relative Operating Characteristic plots for the event defined by the ex-

705

ceedence of the 90th percentile of the observational record are shown in Fig-

706

ure 15. The plots show that, in all cases, the ROC curves for both scenarios

707

are well above the diagonal, indicating that these forecasts improve upon

708

climatology.

709

At the shortest lead time shown, the curves for the two scenarios are very

710

similar. Differences, if any, increase with increasing forecast lead time. At

711

longer forecast lead times, the dressed ensemble forecasts are slightly more

712

discriminatory than the dressed deterministic forecasts.

713

The associated ROC scores (Figure 16) are very similar at most locations

714

and forecast lead times and generally decline with increasing forecast lead

715

time and increase with threshold amount.

716

[Figure 15 about here.]

717

[Figure 16 about here.]

718

The Relative Economic Value also relies on the ability of a forecasting

719

system to discriminate between events and non-events, but assigns a cost-

720

loss model to weight the consequences of particular actions (or inaction).

721

In most cases, the REV of the dressed ensemble forecasts is similar to, or

722

slightly higher than, the dressed deterministic forecasts for different values

723

of the cost-loss ratio. Again, these differences are more pronounced at longer

724

forecast lead times, higher thresholds, and larger values of the cost-loss ratio. 32

725

[Figure 17 about here.]

726

[Figure 18 about here.]

727

4.7. Analysis

728

The results show that, at P = 0, the dressed deterministic forecasts

729

improve (albeit only marginally) on the dressed ensemble forecasts in terms of

730

reliability and RME. However, the dressed ensemble forecasts are somewhat

731

sharper. On balance, the dressed ensemble forecasts have slightly better

732

RMSE and CRPSS scores.

733

The dressed ensemble forecasts are only slightly less reliable than the

734

dressed deterministic forecasts at the three Rhine locations Metz, Cochem

735

and Lobith. The differences are larger at St Pieter, where the dressed en-

736

semble forecasts show a substantial wet bias. In this context, the dressed

737

deterministic forecasts account for both the atmospheric and hydrologic un-

738

certainties and correct for biases via the quantile regression, whereas this

739

dressed ensemble forecasts do not account for under-dispersion of the mete-

740

orological forecasts. In other words: the dressing assumes that the meteoro-

741

logical uncertainties are well described by the ensembles whereas in reality

742

this isn’t necessarily so.

743

At P = 0, the fractional bias of the dressed deterministic forecasts is small

744

at all forecast lead times. This is understandable, because post-processing

745

techniques, such as quantile regression, are generally good at correcting for

746

unconditional biases and biases conditional upon forecast value/probability

747

(i.e. lack of reliability). 33

748

The dressed ensemble forecasts are sharper than the dressed deterministic

749

forecasts. However, sharpness is only meaningful when the forecasts are also

750

reliable. In the case of St Pieter at 0–h lead time, forecasts are infinitely

751

sharp and of near perfect quality — this is due to the error correction just

752

upstream of St Pieter.

753

This error correction introduces some erratic effects on the fractional bias

754

at the shortest lead time (Figure 8) and a relatively steep jump to less-than-

755

perfect skills at those lead times when the effect of error correction has worn

756

off. This is but one of the differences between the St Pieter catchment on the

757

one hand and the Rhine catchments on the other. In general, the St Pieter

758

results are slightly more erratic and dramatic in nature: more pronounced

759

dependency on lead time, bigger difference in sharpness between the two

760

scenarios. We believe this to be due to the nature of the basin which is

761

more complex (in terms of geology and topography) than the other basins,

762

in combination with the fact that less information is available for the Meuse:

763

the hydrometeorological monitoring network is a lot less dense.

764

At higher values of P , both sets of forecasts are consistently less reliable.

765

The dressed deterministic forecasts show a ‘dry bias’ where the observed rel-

766

ative frequency (or quantile exceedence) is higher than the predicted proba-

767

bility. However, at St Pieter, this conditional dry bias is offset by an uncon-

768

ditional wet bias, leading to reasonably reliable forecasts at higher thresholds

769

and early forecast lead times.

770

In general, the fractional negative bias (RME) increases with increas-

771

ing threshold. This is consistent with the RME of the precipitation fore34

772

casts, which systematically underestimate the largest observed precipitation

773

amounts (Verkade et al., 2013). At higher thresholds, the differences in

774

RME between the two scenarios are similar in pattern to those in the uncon-

775

ditional sample. In other words, at St Pieter and Metz, the fractional bias

776

of the dressed ensemble forecasts is smaller, while at Cochem and Lobith,

777

the dressed deterministic forecasts show smaller fractional bias. Again, this

778

is due to the RME in the precipitation forecasts.

779

The BSS, ROCS and REV, which assess the quality of the forecasts at

780

predicting discrete events, are very similar between the two scenarios at all

781

forecast lead times and all values of P . One exception is St Pieter, where the

782

dressed ensemble forecasts improve somewhat on the dressed deterministic

783

forecasts in terms of ROCS and REV, but not at the highest thresholds. At

784

St Pieter and at these higher values of P , the dressed ensembles were both

785

sharper and more reliable than the dressed deterministic forecasts.

786

5. Conclusions

787

We compared a source-based approach and a lumped approach for esti-

788

mating predictive uncertainty in terms of various aspects of forecast quality,

789

skill and value. The analysis shows that the dressed ensemble forecasts are

790

sharper, but slightly less reliable than the dressed deterministic forecasts.

791

On balance, this results in skill and value that is very similar across the

792

two scenarios, with the dressed ensemble forecasts improving slightly on the

793

dressed deterministic forecasts at St Pieter and Metz and the reverse being

794

true at Cochem and Lobith. 35

795

6. Discussion

796

While the analysis revealed quite similar results between the scenarios,

797

further studies or different approaches to quantifying the various sources of

798

uncertainty could reveal larger differences. For example, a larger hindcast

799

dataset would help to reduce the sampling uncertainties and identify any

800

marginal differences in forecast quality, as well as supporting an analysis

801

of higher thresholds. The quantile regression technique could be configured

802

in alternative ways (some configurations were tested by L´opez L´opez et al.

803

2014), or could be replaced by an alternative technique altogether. Alter-

804

native basins can be used for the experiment, and/or alternative ensemble

805

NWP products.

806

As noted earlier, the quality of the dressed streamflow ensembles is de-

807

pendent on the quality of the raw streamflow ensembles - and therefore on

808

the quality of the underlying NWP ensembles. Any lack of reliability will

809

transfer to the dressed streamflow ensembles. Such meteorological biases

810

could be addressed through meteorological post-processing or by accounting

811

for the effects of under-dispersion on the streamflow forecasts. Conceivably,

812

meteorological post-processing could address error structures that are spe-

813

cific to the forecast variable (i.e., precipitation, temperature or other) while

814

hydrologic post-processing then addresses error structures specific to the hy-

815

drological forecasts. Presumably, the latter errors will have been reduced

816

by the presence of meteorological post-processing techniques. Note that in

817

practice, a meteorological post-processing technique that is effective within

36

818

the context of hydrological forecasting is yet to be found. For example, the

819

approach taken by Verkade et al. (2013) did not result in improved hydrolog-

820

ical forecasts, which was in line with the findings of Kang et al. (2010). Ap-

821

proaches such as those followed by Bennett et al. (2016); Schefzik (2016a,b)

822

and Hu et al. (2016) appear to be promising.

823

In terms of choosing an approach, the results presented here are quite

824

similar for both techniques. However, there are additional considerations,

825

including the relative simplicity of dressing a deterministic forecast, data

826

availability, and the expected additional information contributed by an en-

827

semble of weather forecasts (versus a single forecast or the ensemble mean)

828

in different contexts. As indicated by Pagano et al. (2014), combining en-

829

semble and other forecasting technologies with the subjective experience of

830

operational forecasters is an ongoing challenge in river forecasting.

831

Essentially, statistical modelling relies on the stationarity of the model

832

errors or the ability to account for any non-stationarity with a reasonably

833

simple model. In practice, however, basin hydrology changes over time with

834

changes in climate and land-use, among other things. The lumped approach

835

cannot easily account for this, while the source–based approach may, using

836

information about the individual sources of uncertainty, better isolate (and

837

model) the causes of non-stationarity.

838

Also, by definition, extreme events are not well represented in the obser-

839

vational record and frequently change basin hydrology. Thus, for extreme

840

events in particular, basing estimates of predictive uncertainty on the ob-

841

servational record is fraught with difficulty. In this context, ensemble ap37

842

proaches to propagating the forcing and other uncertainties through hydro-

843

logical models should (assuming the models are physically reasonable) cap-

844

ture elements of the basin hydrology that are difficult to capture through

845

purely statistical approaches.

846

7. Acknowledgements

847

The authors would like to thank Marc van Dijk at Deltares for running the

848

historical simulations for the river Meuse. Florian Pappenberger at ECMWF

849

provided the ECMWF-EPS reforecast data. The Water Management Centre

850

of the Netherlands is thanked for allowing us to use an off-line version of the

851

operational system RWsOS Rivers to do this research. The “Team Expertise

852

Meuse” at Rijkswaterstaat Zuid-Nederland provided valuable comments on

853

the implementation of the ensemble dressing technique for St Pieter. The

854

HEPEX community (www.hepex.org) is thanked for providing inspiration,

855

and for bringing additional fun to doing research into predictive hydrologi-

856

cal uncertainty. The digital elevation model used as a background map in

857

Figure 3 was made available by the European Environment Agency on a

858

Creative Commons Attribution License (www.eea.europa.eu). We are also

859

grateful to no less than four anonymous referees for the time taken to re-

860

view earlier versions of the manuscript, and to Konstantine Georgakakos and

861

Hamid Moradkhani for acting as editors.

38

862

Appendix A. Verification metrics

863

For ease of reference, the probabilistic verification metrics used in this

864

study are briefly explained; this description is partly based on that in Verkade

865

and Werner (2011), Brown and Seo (2013) and Verkade et al. (2013). Addi-

866

tional details can be found in the documentation of the Ensemble Verification

867

System (Brown et al., 2010) as well as in reference works on forecast verifi-

868

cation by Jolliffe and Stephenson (2012) and Wilks (2011).

869

Appendix A.1. Reliability plots

870

One desired property of probabilistic forecasts is that forecasted event

871

probabilities f coincide with observed relative frequencies F . This is visual-

872

ized in reliability plots that plot the latter variable versus the former, P Fi =

o|f = i Ji

(A.1)

873

where F is the observed relative frequency, i is one of eleven allowable values

874

of the forecast f : i ∈ { 0, 0.1, 0.2, . . . , 1 }, o is an indicator variable that is

875

assigned a value of 1 if the event was observed and a value of 0 if the event

876

was not observed, f is the forecasted probability of event occurrence and Ji

877

is the total number of forecasts made within bin i (Wilks, 2011).

878

Appendix A.2. Sharpness

879

880

Here, sharpness is indicated by the width of the centred 80% interval of the predictive distribution,

39

wj = Sτ =0.90,j − Sτ =0.10,j ,

(A.2)

881

for all J forecasts. Again, sharpness is determined for each lead time n

882

separately and the lead time indicators have been omitted from above equa-

883

tion. The combined record wj=1,2,...,J is shown as an empirical cumulative

884

distribution function.

885

Appendix A.3. Correlation coefficient

886

The Correlation Coefficient (COR) is a measure for the statistical re-

887

lationship between two variables. Here, the Pearson product–moment cor-

888

relation coefficient is a measure of the strength and direction of the linear

889

relationship between two variables that is defined as the (sample) covariance

890

of the variables divided by the product of their (sample) standard deviations,

CORX,Y =

E[(X − µX )(Y − µY )] , σX σY

(A.3)

891

where E indicates an expected value, µ is the mean and σ is the standard

892

deviation.

893

Appendix A.4. Relative Mean Error

894

The Relative Mean Error (RME, sometimes called relative bias) measures

895

the average difference between a set of forecasts and corresponding observa-

896

tions, relative to the mean of the latter, PJ RME =

j=1

S¯j − qj

PJ

j=1 qj

40

 ,

(A.4)

897

where q is the observation and S¯ is the mean of the ensemble forecast. The

898

RME thus provides a measure of relative, first-order bias in the forecasts.

899

RME may be positive, zero, or negative. Insofar as the mean of the ensemble

900

forecast should match the observed value, a positive RME denotes overfore-

901

casting and a negative RME denotes underforecasting. Zero RME denotes

902

the absence of relative bias in the mean of the ensemble forecast.

903

Appendix A.5. Brier skill score

904

905

The (half) Brier score (BS, Brier 1950) measures the mean square error of J predicted probabilities that Q exceeds q,

BS = 906

J 2 1 X FQj (q) − FSj (q) , J j=1

(A.5)

where

FSj (q) = Pr [Qj > q] and

  1 if Qj > q FQj (q) = .  0 otherwise

907

The Brier Skill Score (BSS) is a scaled representation of forecast quality

908

that relates the quality of a particular system BS to that of a perfect system

909

BSperfect (which is equal to 0) and to a reference system BSref ,

BSS =

BS − BSref BSperfect − BSref

41

(A.6)

910

BSS ranges from −∞ to 1. The highest possible value is 1. If BSS = 0,

911

the BS is as good as that of the reference system. If BSS < 0 then the

912

system’s Brier score is less than that of the reference system.

913

Appendix A.6. Mean Continuous Ranked Probability Skill Score

914

The Continuous Ranked Probability Score (CRPS) measures the integral

915

square difference between the cumulative distribution function (cdf) of the

916

forecast FS (q), and the corresponding cdf of the observed variable FQ (q), Z



{FS (q) − FQ (q)} dq.

CRPS =

(A.7)

−∞ 917

918

The mean CRPS comprises the CRPS averaged across J pairs of forecasts and observations, J 1X CRPS = CRPSj . J j=1

(A.8)

919

The Continuous Ranked Probability Skill Score (CRPSS) is a function of

920

the ratio of the mean CRPS of the main prediction system, CRPS, and a

921

reference system, CRPSref ,

CRPSS = 922

CRPS − CRPSref CRPSperfect − CRPSref

(A.9)

Appendix A.7. Relative Operating Characteristic score

923

The relative operating characteristic (ROC; Green and Swets 1966) plots

924

the hit rate versus the false alarm rate. These are calculated using the ele-

42

925

ments of a contingency table, which is valid for a single probabilistic decision

926

rule, [Table 2 about here.]

927

928

and are defined as follows

# hits h = # observed events o # false alarms f false alarm rate = = 0, # events not observed o hit rate =

929

930

The ROC score measures the area under the ROC curve (AUC) after adjusting for randomness, i.e.

ROCS = 2 × (AUC − 0.5) . 931

(A.10)

(A.11)

Appendix A.8. Relative Economic Value

932

The Relative Economic Value (V [−]) of an imperfect warning system is

933

defined as the value relative to the benchmark cases of No Warning (V = 0)

934

and Perfect Forecasts (V = 1). REV can be less than 0 if the cost of false

935

alarms is higher than the benefits attained by the warning system:

V =

VnoFWS − VFWS . VnoFWS − Vperfect

(A.12)

936

Using the consequences of contingency table elements (Table 2) and the

937

cost-to-loss ratio r (Verkade and Werner, 2011), REV can be expressed as a

938

function of r, 43

V =

o − (h + f ) r − m . o (1 − r)

(A.13)

939

References

940

Bennett, J. C., Wang, Q. J., Li, M., Robertson, D. E., Schepen, A., Oct. 2016.

941

Reliable long-range ensemble streamflow forecasts: Combining calibrated

942

climate forecasts with a conceptual runoff model and a staged error model.

943

Water Resources Research 52 (10), 8238–8259.

944

URL http://doi.wiley.com/10.1002/2016WR019193

945

Bergstr¨om, S., Singh, V. P., 1995. The HBV model. In: Singh, V. (Ed.),

946

Computer models of watershed hydrology. Water Resources Publications,

947

Highlands Ranch, Colorado, United States, pp. 443–476.

948

Bogner, K., Pappenberger, F., Jul. 2011. Multiscale error analysis, correction,

949

and predictive uncertainty estimation in a flood forecasting system. Water

950

Resources Research 47 (7), W07524.

951

URL http://www.agu.org/pubs/crossref/2011/2010WR009137.shtml

952

Bradley, A. A., Schwartz, S. S., Sep. 2011. Summary verification measures

953

and their interpretation for ensemble forecasts. Monthly Weather Review

954

139 (9), 3075–3089.

955

URL http://journals.ametsoc.org/doi/abs/10.1175/2010MWR3305.

956

1

44

957

Bremnes, J. B., 2004. Probabilistic forecasts of precipitation in terms of

958

quantiles using NWP model output. Monthly Weather Review 132, 338–

959

347.

960

961

962

963

Brier, G., 1950. Verification of forecasts expressed in terms of probability. Monthly weather review 78, 1–3. Br¨ocker, J., Smith, L. A., 2008. From ensemble forecasts to predictive distribution functions. Tellus A 60 (4), 663–678.

964

Broersen, P., Weerts, A., 2005. Automatic error correction of rainfall-runoff

965

models in flood forecasting systems. In: Instrumentation and Measurement

966

Technology Conference, 2005. IMTC 2005. Proceedings of the IEEE. Vol. 2.

967

pp. 963–968.

968

Brown, J. D., Demargne, J., Seo, D.-J., Liu, Y., 2010. The ensemble veri-

969

fication system (EVS): a software tool for verifying ensemble forecasts of

970

hydrometeorological and hydrologic variables at discrete locations. Envi-

971

ronmental Modelling & Software 25 (7), 854 – 872.

972

Brown, J. D., Seo, D.-J., 2013. Evaluation of a nonparametric post-processor

973

for bias correction and uncertainty estimation of hydrologic predictions.

974

Hydrological Processes 27 (1), 83–105.

975

URL http://doi.wiley.com/10.1002/hyp.9263

976

977

Chen, S., Yu, P., 2007. Real-time probabilistic forecasting of flood stages. Journal of Hydrology 340 (1-2), 63–77.

45

978

Coccia, G., Todini, E., 2011. Recent developments in predictive uncertainty

979

assessment based on the model conditional processor approach. Hydrology

980

and Earth System Sciences 15 (10), 3253–3274.

981

Demargne, J., Wu, L., Regonda, S., Brown, J., Lee, H., He, M., Seo, D.-J.,

982

Hartman, R., Herr, H. D., Fresch, M., Schaake, J., Zhu, Y., Jun. 2013.

983

The science of NOAA’s operational hydrologic ensemble forecast service.

984

Bulletin of the American Meteorological Society, 130611111953000.

985

URL

986

BAMS-D-12-00081.1

http://journals.ametsoc.org/doi/abs/10.1175/

987

Fortin, V., Favre, A.-c., Said, M., 2006. Probabilistic forecasting from en-

988

semble prediction systems: Improving upon the best-member method by

989

using a different weight and dressing kernel for each member. Quarterly

990

Journal of the Royal Meteorological Society 132 (617), 1349–1369.

991

Glahn, H. R., Lowry, D. A., 1972. The use of model output statistics (MOS)

992

in objective weather forecasting. Journal of Applied Meteorology 11 (8),

993

1203–1211.

994

995

996

997

998

Green, D. M., Swets, J. A., 1966. Signal detection theory and psychophysics. John Wiley & Sons, Inc., New York. Hagedorn, R., 2008. Using the ECMWF reforecast dataset to calibrate EPS forecasts. ECMWF Newsletter 117, 8–13. Hantush, M. M., Kalin, L., 2008. Stochastic residual-error analysis for es-

46

999

1000

timating hydrologic model predictive uncertainty. Journal of Hydrologic Engineering 13 (7), 585–596.

1001

Haylock, M., Hofstra, N., Klein Tank, A., Klok, E., Jones, P., New, M., 2008.

1002

A european daily high-resolution gridded data set of surface temperature

1003

and precipitation for 1950–2006. J. Geophys. Res 113, D20119.

1004

Hu, Y., Schmeits, M. J., Jan van Andel, S., Verkade, J. S., Xu, M., Solo-

1005

matine, D. P., Liang, Z., Sep. 2016. A Stratified Sampling Approach for

1006

Improved Sampling from a Calibrated Ensemble Forecast Distribution.

1007

Journal of Hydrometeorology 17 (9), 2405–2417.

1008

URL http://journals.ametsoc.org/doi/10.1175/JHM-D-15-0205.1

1009

Jolliffe, I. T., Stephenson, D. B. (Eds.), 2012. Forecast Verification: A

1010

Practitioner’s Guide in Atmospheric Science, Second Edition. John Wiley

1011

& Sons.

1012

URL

1013

9781119960003

http://onlinelibrary.wiley.com/book/10.1002/

1014

Kang, T.-H., Kim, Y.-O., Hong, I.-P., Jun. 2010. Comparison of pre- and

1015

post-processors for ensemble streamflow prediction. Atmospheric Science

1016

Letters 11 (2), 153–159.

1017

URL http://doi.wiley.com/10.1002/asl.276

1018

1019

1020

Kelly, K., Krzysztofowicz, R., 2000. Precipitation uncertainty processor for probabilistic river stage forecasting. Water resources research 36 (9). Koenker, R., 2005. Quantile regression. Cambridge University Press. 47

1021

1022

1023

1024

1025

1026

1027

1028

1029

1030

1031

1032

Koenker, R., 2013. quantreg: Quantile Regression. R package version 5.05. URL http://CRAN.R-project.org/package=quantreg Koenker, R., Bassett Jr, G., 1978. Regression quantiles. Econometrica: journal of the Econometric Society, 33–50. Koenker, R., Hallock, K., 2001. Quantile regression. The Journal of Economic Perspectives 15 (4), 143–156. Krzysztofowicz, R., 2002. Bayesian system for probabilistic river stage forecasting. Journal of hydrology 268 (1-4), 16–40. Krzysztofowicz, R., Kelly, K., 2000. Hydrologic uncertainty processor for probabilistic river stage forecasting. Water resources research 36 (11). Lahiri, P., 2003. On the impact of bootstrap in survey sampling and smallarea estimation. Statistical Science 18 (2), 199–210.

1033

L´opez L´opez, P., Verkade, J., Weerts, A., Solomatine, D., 2014. Alternative

1034

configurations of quantile regression for estimating predictive uncertainty

1035

in water level forecasts for the upper severn river: a comparison. Hydrology

1036

and Earth System Sciences Discussions 11, 3811–3855.

1037

Montanari, A., Brath, A., 2004. A stochastic approach for assessing the un-

1038

certainty of rainfall-runoff simulations. Water Resources Research 40 (1),

1039

W01106.

1040

1041

Montanari, A., Grossi, G., 2008. Estimating the uncertainty of hydrological forecasts: A statistical approach. Water resources research 44 (12). 48

1042

Murphy, A., 1985. Decision making and the value of forecasts in a generalized

1043

model of the cost-loss ratio situation. Monthly Weather Review 113 (3),

1044

362–369.

1045

Nielsen, H. A., Madsen, H., Nielsen, T. S., 2006. Using quantile regression

1046

to extend an existing wind power forecasting system with probabilistic

1047

forecasts. Wind Energy 9 (1-2), 95–108.

1048

Pagano, T. C., Shrestha, D. L., Wang, Q. J., Robertson, D., Hapuarachchi,

1049

P., 2013. Ensemble dressing for hydrological applications. Hydrological

1050

Processes, n/a–n/a.

1051

Pagano, T. C., Wood, A. W., Ramos, M.-H., Cloke, H. L., Pappenberger,

1052

F., Clark, M. P., Cranston, M., Kavetski, D., Mathevet, T., Sorooshian,

1053

S., Verkade, J. S., May 2014. Challenges of operational river forecasting.

1054

Journal of Hydrometeorology.

1055

URL

1056

JHM-D-13-0188.1

http://journals.ametsoc.org/doi/abs/10.1175/

1057

Photiadou, C. S., Weerts, A. H., van den Hurk, B. J. J. M., Nov. 2011. Eval-

1058

uation of two precipitation data sets for the rhine river using streamflow

1059

simulations. Hydrology and Earth System Sciences 15 (11), 3355–3366.

1060

URL http://www.hydrol-earth-syst-sci.net/15/3355/2011/

1061

1062

Politis, D. N., Romano, J. P., 1994. The stationary bootstrap. Journal of the American Statistical Association 89 (428), 1303–1313.

49

1063

R Core Team, 2013. R: A Language and Environment for Statistical Com-

1064

puting. R Foundation for Statistical Computing, Vienna, Austria.

1065

URL http://www.R-project.org/

1066

Raftery, A., Gneiting, T., Balabdaoui, F., Polakowski, M., 2005. Us-

1067

ing Bayesian model averaging to calibrate forecast ensembles. Monthly

1068

Weather Review 133 (5), 1155–1174.

1069

Ramos, M., Voisin, N., Verkade, J., 2013. HEPEX Science and Implementa-

1070

tion plan: hydro-meteorological post-processing. http://hepex.irstea.

1071

fr/science-and-implementation-plan/.

1072

Reggiani, P., Renner, M., Weerts, A., van Gelder, P., 2009. Uncertainty

1073

assessment via bayesian revision of ensemble streamflow predictions in

1074

the operational river rhine forecasting system. Water Resources Research

1075

45 (2).

1076

Reggiani, P., Weerts, A., 2008. A bayesian approach to decision-making under

1077

uncertainty: An application to real-time forecasting in the river rhine.

1078

Journal of Hydrology 356 (1-2), 56–69.

1079

Regonda, S. K., Seo, D.-J., Lawrence, B., Brown, J. D., Demargne, J., Aug.

1080

2013. Short-term ensemble streamflow forecasting using operationally-

1081

produced single-valued streamflow forecasts – a hydrologic model output

1082

statistics (HMOS) approach. Journal of Hydrology 497, 80–96.

1083

URL

1084

S0022169413003958

http://linkinghub.elsevier.com/retrieve/pii/

50

1085

Roscoe, K. L., Weerts, A. H., Schroevers, M., 2012. Estimation of the uncer-

1086

tainty in water level forecasts at ungauged river locations using quantile

1087

regression. International Journal of River Basin Management 10 (4), 383–

1088

394.

1089

Roulston, M. S., Smith, L. A., 2003. Combining dynamical and statistical

1090

ensembles. Tellus A 55 (1), 16–30.

1091

URL

1092

2003.201378.x/abstract

http://onlinelibrary.wiley.com/doi/10.1034/j.1600-0870.

1093

Schefzik, R., May 2016a. Combining parametric low-dimensional ensemble

1094

postprocessing with reordering methods: Parametric low-dimensional en-

1095

semble postprocessing and reordering. Quarterly Journal of the Royal Me-

1096

teorological Society.

1097

URL http://doi.wiley.com/10.1002/qj.2839

1098

Schefzik, R., May 2016b. A Similarity-Based Implementation of the Schaake

1099

Shuffle. Monthly Weather Review 144 (5), 1909–1921.

1100

URL http://journals.ametsoc.org/doi/10.1175/MWR-D-15-0227.1

1101

Seo, D., Herr, H., Schaake, J., 2006. A statistical post-processor for account-

1102

ing of hydrologic uncertainty in short-range ensemble streamflow predic-

1103

tion. Hydrology and Earth System Sciences Discussions 3 (4), 1987–2035.

1104

Solomatine, D. P., Shrestha, D. L., 2009. A novel method to estimate model

1105

uncertainty using machine learning techniques. Water Resources Research

1106

45 (12). 51

1107

Todini, E., 2008. A model conditional processor to assess predictive uncer-

1108

tainty in flood forecasting. International Journal of River Basin Manage-

1109

ment 6 (2), 123–137.

1110

1111

Unger, D., van den Dool, H., O’Lenic, E., Collins, D., 2009. Ensemble regression. Monthly Weather Review 137 (7), 2365–2379.

1112

Verkade, J., Brown, J., Reggiani, P., Weerts, A., 2013. Post-processing

1113

ECMWF precipitation and temperature ensemble reforecasts for op-

1114

erational hydrologic forecasting at various spatial scales. Journal of

1115

Hydrology 501, 73–91.

1116

URL

1117

S0022169413005660

http://linkinghub.elsevier.com/retrieve/pii/

1118

Verkade, J. S., Werner, M. G. F., 2011. Estimating the benefits of single

1119

value and probability forecasting for flood warning. Hydrology and Earth

1120

System Sciences 15 (12), 3751–3765.

1121

URL http://www.hydrol-earth-syst-sci.net/15/3751/2011/

1122

Weerts, A., Winsemius, H., Verkade, J., Jan. 2011. Estimation of predictive

1123

hydrological uncertainty using quantile regression: examples from the na-

1124

tional flood forecasting system (england and wales). Hydrology and Earth

1125

System Sciences 15 (1), 255–265.

1126

URL http://www.hydrol-earth-syst-sci.net/15/255/2011/

1127

Werner, M., Schellekens, J., Gijsbers, P., van Dijk, M., van den Akker,

1128

O., Heynert, K., Feb. 2013. The Delft-FEWS flow forecasting system. 52

1129

Environmental Modelling & Software 40, 65–77.

1130

URL

1131

S1364815212002083

1132

1133

http://linkinghub.elsevier.com/retrieve/pii/

Wilks, D., 2011. Statistical methods in the atmospheric sciences. Academic Press, Oxford, United Kingdom.

1134

Zhao, L., Duan, Q., Schaake, J., Ye, A., Xia, J., Feb. 2011. A hydrologic

1135

post-processor for ensemble streamflow predictions. Adv. Geosci. 29, 51–

1136

59.

1137

URL http://www.adv-geosci.net/29/51/2011/

1138

Zhu, Y., Toth, Z., Wobus, R., Richardson, D., Mylne, K., 2002. The economic

1139

value of ensemble-based weather forecasts. Bulletin of the American Me-

1140

teorological Society 83 (1), 73–83.

53

1141

List of changes

54

Table 1: Summary of characteristics of the various types of model runs used in the present manuscript

Type of model run

Forcings

CharacterisedUse by leadtime? Simulations Observed meteoro- No Estimation of hydrological unlogical parameters certainty in cases where no data assimilation is applied Perfect forcing (hydro- Observed meteoro- Yes Estimation of hydrological unlogical) forecasts logical parameters certainty in cases where data assimilation is applied Yes These comprise the raw hydroDeterministic and en- Meteorological semble (hydrological) forecasts logical forecasts forecasts

55

Table 2: Contingency table. The consequences of the items listed are in brackets. Taken from Verkade and Werner (2011).

Event observed Event NOT observed Warning issued hits h (C + Lu ) false alarms f (C) Warning NOT issued missed events m (L + L ) quiets/correct negatives q (−) a u P o o0

56

P w w0 N

streamflow [m3/s]

Dressing of deterministic forecasts

Dressing of ensemble forecasts

time

Figure 1: Schematic representation of the dressing procedures. The vertical red line denotes the issue time of the forecast, with the observations (black dots) in the past and the raw forecasts (blue lines) in the future.

57

3000

sample size J

1000

500

100 Legend Meuse 50

Rhine 0.10

0.50

0.90

0.95

Climatological non−exceedence probability P Figure 2: Sub–sample size as a function of the probability P of non-exceedence of the observation in the climatological record.

58

Figure 3: Map of the Meuse and Rhine basins and the forecasting locations that are 59 considered in this manuscript.

1.00

location Metz (Rhine) St Pieter (Meuse) Cochem (Rhine) Lobith (Rhine)

ecdf(Q)[−]

0.75

0.50

0.25

0.00 1

10

100

1000

10000

Observed streamflow Q [m3/s]

Figure 4: Distribution of streamflow observations at the forecasting locations considered in this study.

60

24h

72h

120h

2000 deterministic forecasts

1000

500

0 2000 perfect forcing forecasts

Observed streamflow [m3/s]

1500

1500

1000

500

0 0

500

1000

1500

2000 0

500

1000

1500

2000 0

500

1000

1500

2000

Streamflow forecast [m3/s]

Figure 5: Quantile Regression plots for St Pieter for deterministic (top row) and perfect forcing (bottom row) forecasts with 24-h, 72-h and 168-h forecasts (columns). The blue lines indicate the estimated 5%, 10%, 25%, 75%, 90% and 95% quantiles; the red line indicates the median quantile. As a reminder, the reader is referred to section 2.5 for a description of how simulations differ from forecasts and perfect forcings forecasts.

61

simulations

250

24h

72h

168h

200

Metz

150 100 50 0

50

100

150 0

50

100

150 0

50

100

150 0

50

100

150

5000 4000

Cochem

Observed streamflow [m3/s]

0

3000 2000 1000 0 0

1000 2000 3000 4000 0

1000 2000 3000 4000 0

1000 2000 3000 4000 0

1000 2000 3000 4000

10000

Lobith

5000

0

4000

8000

12000 0

4000

8000

12000 0

4000

8000

12000 0

4000

8000

12000

Streamflow forecast [m3/s]

Figure 6: Quantile Regression plots for Metz (top), Cochem (middle) and Lobith (bottom) for simulations (leftmost column) and deterministic forecasts with 24-h, 72-h and 168-h forecasts (rightmost three columns). The blue lines indicate the estimated 5%, 10%, 25%, 75%, 90% and 95% quantiles; the red line indicates the median quantile. As a reminder, the reader is referred to section 2.5 for a description of how simulations differ from forecasts and perfect forcings forecasts.

62

1500

deterministic

Streamflow [m3/s]

1000

500



●●● ●



● ● ●

● ●



● ●







0 1500

500



●●● ●



● ● ●

● ●



ensemble

1000

● ●







0 09

10

11

12

13

14

15

Date scenario

raw

dressed

Figure 7: Sample forecasts of the scenarios: deterministic and ensemble based forecasts in top and bottom row, respectively. The vertical red lines on the horizontal time axis indicate forecast issue time. Observed values are indicated by blue dots. Note that the lines for the raw forecasts represent one or multiple traces, whereas the lines for the dressed forecasts represent quantiles.

63

P=0

P = 0.5

P = 0.9

P = 0.95

0.2 St Pieter

0.0 −0.2 −0.4 0.2 0.0

Metz

RME [−]

−0.2 −0.4 −0.6 0.0

Cochem

−0.1 −0.2 −0.3 −0.4 −0.5 0.00

Lobith

−0.05 −0.10 −0.15 −0.20 0

48

96

144 192 240 0

48

96

144 192 240 0

48

96

144 192 240 0

48

96

144 192 240

leadtime [h] scenario

dressed deterministic forecasts

dressed ensemble forecasts

Figure 8: Relative Mean Error (RME) as a function of lead time for several subsamples of the verification pairs (columns) and several locations (rows). Note that the vertical scales used in the various columns differ. The shading shows the width of the intervals obtained by bootstrapping the verification metric (see section 2.7 for details).

64

24h

96h

120h

1.00 St Pieter

0.75 0.50

0.00 1.00 0.75 Lobith

0.50 0.25 0.00 1.00 0.75

Cochem

observed relative frequency F [−]

0.25

0.50 0.25 0.00 1.00 0.75

Metz

0.50 0.25 0.00 0.00

0.25

0.50

0.75

1.000.00

0.25

0.50

0.75

1.000.00

0.25

0.50

0.75

1.00

event probability [−] dressed deterministic forecasts

dressed ensemble forecasts

Figure 9: Reliability plots for various lead times (columns) for several locations (rows). The plot is conditional on the observation exceeding the 50th percentile of the climatological exceedence probability (i.e., P = 0.50).

65

24h

96h

120h

1.00 St Pieter

0.75 0.50

0.00 1.00 0.75 Lobith

0.50 0.25 0.00 1.00 0.75

Cochem

observed relative frequency F [−]

0.25

0.50 0.25 0.00 1.00 0.75

Metz

0.50 0.25 0.00 0.00

0.25

0.50

0.75

1.000.00

0.25

0.50

0.75

1.000.00

0.25

0.50

0.75

1.00

event probability [−] dressed deterministic forecasts

dressed ensemble forecasts

Figure 10: Reliability plots for various lead times (columns) for several locations (rows). The plot is conditional on the observations exceeding the 90th percentile of the climatological exceedence probability (i.e., P = 0.90).

66

24h

96h

120h

1.00 St Pieter

0.75 0.50 0.25 0.00 1.00 0.75

Metz

0.50

F [−]

0.25 0.00 1.00 Cochem

0.75 0.50 0.25 0.00 1.00 0.75

Lobith

0.50 0.25 0.00 0

1000

2000

3000

4000

0

1000

2000

3000

4000

5000 0

1000

2000

3000

4000

5000

width of the 5%−95% probability interval [m3/s] scenario dressed deterministic forecasts dressed ensemble forecasts

Figure 11: Sharpness plots for various lead times (columns) for several locations (rows). The plot is unconditional, i.e. for the full data sample (i.e., P = 0).

67

24h

96h

120h

1.00 St Pieter

0.75 0.50 0.25 0.00 1.00 0.75

Metz

0.50

F [−]

0.25 0.00 1.00 Cochem

0.75 0.50 0.25 0.00 1.00 0.75

Lobith

0.50 0.25 0.00 0

1000

2000

3000

4000

0

1000

2000

3000

4000

5000 0

1000

2000

3000

4000

5000

width of the 5%−95% probability interval [m3/s] scenario dressed deterministic forecasts dressed ensemble forecasts

Figure 12: Sharpness plots for various lead times (columns) for several locations (rows). The plot is conditional on the observations exceeding the 90th percentile of the climatological exceedence probability (i.e., P = 0.90).

68

P = 0.5

P = 0.9

P = 0.95

1.00 St Pieter

0.75 0.50 0.25 0.00 1.00 0.75

Metz

BSS [−]

0.50 0.25 0.00 1.00

Cochem

0.75 0.50 0.25 0.00 1.00 0.75

Lobith

0.50 0.25 0.00 0

48

96

144

192

240 0

48

96

144

192

240 0

48

96

144

192

240

leadtime [h] scenario

dressed deterministic forecasts

dressed ensemble forecasts

Figure 13: Brier Skill Score (BSS) as a function of lead time for several events (columns) and several locations (rows). The shading shows the width of the intervals obtained by bootstrapping the verification metric (see section 2.7 for details).

69

P=0

P = 0.5

P = 0.9

P = 0.95

1.00 St Pieter

0.75 0.50 0.25 0.00 1.00 0.75

Metz

0.25 0.00 1.00 0.75

Cochem

CRPSS [−]

0.50

0.50 0.25 0.00 1.00 0.75

Lobith

0.50 0.25 0.00 0

48

96

144 192 240 0

48

96

144 192 240 0

48

96

144 192 240 0

48

96

144 192 240

leadtime [h] scenario

dressed deterministic forecasts

dressed ensemble forecasts

Figure 14: Mean Continuous Ranked Probability Skill Score (CRPSS) as a function of lead time for several subsamples of the verification pairs (columns) and several locations (rows). The shading shows the width of the intervals obtained by bootstrapping the verification metric (see section 2.7 for details).

70

24h

96h

120h

1 St Pieter

0.8 0.6 0.4 0.2 0 1 0.8

Metz

0.6 0.2 0 1 0.8

Cochem

hit rate [−]

0.4

0.6 0.4 0.2 0 1 0.8

Lobith

0.6 0.4 0.2 0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

false alarm rate [−] scenario

dressed deterministic forecasts

dressed ensemble forecasts

Figure 15: ROC plots for various lead times (columns) for several locations (rows). The plot is for the event that the posterior water level exceeds the 90th percentile of the climatological exceedence probability (i.e., P = 0.90). The shading shows the width of the intervals obtained by bootstrapping the verification metric (see section 2.7 for details).

71

P = 0.5

P = 0.9

P = 0.95

1.0 St Pieter

0.9 0.8 0.7 0.6 0.5 1.0 0.9

Metz

0.8

ROCS [−]

0.7 0.6 0.5 1.0

Cochem

0.9 0.8 0.7 0.6 0.5 1.0 0.9

Lobith

0.8 0.7 0.6 0.5 0

48

96

144

192

240 0

48

96

144

192

240 0

48

96

144

192

240

leadtime [h] scenario

dressed deterministic forecasts

dressed ensemble forecasts

Figure 16: Relative Operating Characteristic Score (ROCS) as a function of lead time for several subsamples of the verification pairs (columns) and several locations (rows).

72

24h

96h

120h

1 St Pieter

0.8 0.6 0.4 0.2

0.8 0.6

Metz

0.4 0.2 0 1 0.8

Cochem

Relative Economic Value [−]

0 1

0.6 0.4 0.2 0 1 0.8

Lobith

0.6 0.4 0.2 0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

cost−loss ratio [−] scenario

dressed deterministic forecasts

dressed ensemble forecasts

Figure 17: Value plots for various lead times (columns) for several locations (rows). The plot is conditional on the observations exceeding the 50th percentile of the climatological exceedence probability (i.e., P = 0.50).

73

24h

96h

120h

1 St Pieter

0.8 0.6 0.4 0.2

0.8 0.6

Metz

0.4 0.2 0 1 0.8

Cochem

Relative Economic Value [−]

0 1

0.6 0.4 0.2 0 1 0.8

Lobith

0.6 0.4 0.2 0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

cost−loss ratio [−] scenario

dressed deterministic forecasts

dressed ensemble forecasts

Figure 18: Value plots for various lead times (columns) for several locations (rows). The plot is conditional on the observations exceeding the 90th percentile of the climatological exceedence probability (i.e., P = 0.90).

74