Do higher data frequency and Bayesian auto-calibration lead to better model calibration? Insights from an application of INCA-P, a process-based river phosphorus model

Do higher data frequency and Bayesian auto-calibration lead to better model calibration? Insights from an application of INCA-P, a process-based river phosphorus model

Journal of Hydrology 527 (2015) 641–655 Contents lists available at ScienceDirect Journal of Hydrology journal homepage: www.elsevier.com/locate/jhy...

1MB Sizes 3 Downloads 39 Views

Journal of Hydrology 527 (2015) 641–655

Contents lists available at ScienceDirect

Journal of Hydrology journal homepage: www.elsevier.com/locate/jhydrol

Do higher data frequency and Bayesian auto-calibration lead to better model calibration? Insights from an application of INCA-P, a process-based river phosphorus model Leah A. Jackson-Blake a,⇑, Jostein Starrfelt b,c a b c

The James Hutton Institute, Craigiebuckler, Aberdeen AB15 8QH, UK NIVA, Gaustadalléen 21, NO-0349 Oslo, Norway Centre for Ecological and Evolutionary Synthesis, Department of Biosciences, University of Oslo, P.O. Box 1066 Blindern, NO-0316 Oslo, Norway

a r t i c l e

i n f o

Article history: Received 23 June 2014 Received in revised form 10 February 2015 Accepted 1 May 2015 Available online 9 May 2015 This manuscript was handled by Peter K. Kitanidis, Editor-in-Chief, with the assistance of Wolfgang Nowak, Associate Editor Keywords: Catchment modelling Phosphorus INCA MCMC DREAM Data frequency

s u m m a r y We use Bayesian auto-calibration to explore how observed data frequency affects the performance and uncertainty of INCA-P, a process-based catchment phosphorus model. A fortnightly dataset of total dissolved phosphorus (TDP) concentration was derived from 18 months of daily data from a small (51 km2) rural catchment in northeast Scotland. We then use the DiffeRential Evolution Adaptive Metropolis (DREAM) Markov Chain Monte Carlo (MCMC) algorithm to calibrate 29 of the >127 model parameters using the daily and the fortnightly observed datasets. Using daily rather than fortnightly data for model calibration resulted in a large reduction in parameter-related uncertainty in model output and lower risk of obtaining unrealistic results. However, peaks in TDP concentration were as well simulated as when fortnightly data were used. A manual model calibration did a better job of simulating the magnitude of TDP peaks and baseflow concentrations, suggesting that alternative measures of model performance may be needed in the auto-calibration. Results suggest that higher frequency sampling, perhaps for just a short period, can greatly increase the confidence that can be placed in model output. In addition, we highlight the many subjective elements involved in auto-calibration, in an attempt to temper a common perception that auto-calibration is an objective and rigorous alternative to manual calibration. Finally, we suggest practical improvements that could make models such as INCA-P more suited to auto-calibration and uncertainty analyses, a key requirement being model simplification. Ó 2015 Elsevier B.V. All rights reserved.

1. Introduction Nutrient enrichment of aquatic ecosystems can result in the excessive growth of algae and changes in species composition. Nitrogen and phosphorus (P) are the two most important nutrients contributing to eutrophication, and P is generally of most concern in freshwaters (Correll, 1998). Reducing P fluxes to surface waters has therefore become a top conservation priority. In-stream P concentration is controlled by a variety of input fluxes and processes, many of which are highly variable spatially and temporally (e.g. House, 2003; Edwards and Withers, 2007; Withers and Jarvie, 2008). Many of these fluxes can be modified by land management, but there is often no straightforward or immediate link between implementing a measure to reduce P inputs and observable in-stream effects (Meals et al., 2010; Jarvie et al., 2013; Sharpley ⇑ Corresponding author. Tel.: +44 (0)1224 395161. E-mail addresses: [email protected] (L.A. Jackson-Blake), jostein. [email protected] (J. Starrfelt). http://dx.doi.org/10.1016/j.jhydrol.2015.05.001 0022-1694/Ó 2015 Elsevier B.V. All rights reserved.

et al., 2013). This makes it difficult for policy makers and water managers to decide on appropriate measures, or to know when improvements in water quality might be expected after measures have been implemented. Given this complexity, it can be useful to use process-based, integrated catchment models, which provide a means of formalising current knowledge of a catchment system. These models can be used to help set appropriate water quality and load reduction goals, to advise on the best means of achieving those goals, to predict time lags in the system and to explore potential system responses to future environmental change. For such models to be useful they need to be well calibrated and tested and be shown to reproduce key catchment processes. Many model parameter values may need to be determined through calibration, either because they are not directly measurable, there is a lack of data to constrain values, or measurements apply at different spatial and/or temporal scales to those at which the model operates (Oreskes et al., 1994). In general, the more parameters a model has, the more observations are required for those parameters to be

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

that could be made to models such as INCA-P to make them better suited to auto-calibration. 2. Methods 2.1. Study catchment The model applications were carried out for the Tarland Burn catchment (51 km2), a rural catchment in northeast Scotland where diffuse agricultural sources provide the dominant phosphorus input. Land use is a mixture of arable (17%), improved grassland (30%), rough grazing (13%) and forestry and moorland (40%). Agricultural activities focus on beef cattle, sheep and spring barley. Humus iron podzols and brown forest soils are the dominant soils underlying all but semi-natural land, where peaty podzols are also important. Over the study period (2004–2005), mean annual rainfall was 960 mm and median discharge was 0.65 m3 s1, with 5th and 95th percentiles of 0.29 and 1.38 m3 s1. The village of Tarland has a small wastewater treatment works and septic tanks serve around half the catchment’s residents. Water quality is of concern, primarily due to inputs of nutrients and sediments from agriculture. 2.2. Monitoring data used for calibration INCA-P has been shown to produce adequate dissolved P simulations in the study catchment, whilst particulate P is currently more problematic (Jackson-Blake et al., 2015). The analysis therefore focuses on dissolved P, and measurements of discharge and total dissolved P (TDP) concentration were used for model calibration. Daily mean discharge was obtained from 15 minute data from a gauging station at Coull, at the catchment outflow (Fig. 1). Daily stream water samples were collected between April 2004 and June 2005, also from the catchment outflow. Water samples were analysed for total dissolved P (TDP), giving 15 months of daily data – full details of sampling protocol, auto-sampler set-up and analytical methods are given in Stutter et al. (2008). Additionally, lower frequency sampling (fortnightly or less) took place at three locations on the Tarland Burn upstream of Coull (Fig. 1), as well as between June and December 2005 at Coull. This was also used for model calibration to give two full years of data. Observed stream water TDP from September and October 2005 was excluded

810

identifiable (Kuczera and Mroczkowski, 1998). Calibration can therefore be challenging for process-based catchment models, which tend to be complex and highly parameterised. Despite this, most catchment P model applications are calibrated using relatively sparse fortnightly or monthly observed data (reviewed in Jackson-Blake et al., 2015). We might expect daily data, in which short-lived flow events are better captured, to result in more realistic model calibrations. McIntyre and Wheater (2004) obtained better model calibrations using higher frequency synthetic datasets, but few studies have compared model calibrations using different frequencies of real data. A primary aim of this paper is therefore to investigate how increasing the sampling frequency of the observed data used in model calibration impacts on the quality and uncertainty of simulated dissolved P concentrations over the calibration period. We used the INtegrated CAtchment model of Phosphorus dynamics (INCA-P; Wade et al., 2002, 2007), which has been applied extensively to assess the potential effects of changing land management, land use and climate on water quality (e.g. Crossman et al., 2013; Farkas et al., 2013; Whitehead et al., 2013). To allow a robust comparison between model calibrations obtained using different observed data frequencies, we used an automated model calibration procedure. In recent years, particular emphasis has been placed on the benefits of automated calibration techniques over manual calibration (e.g. Gupta et al., 2006). For most hydrological and process-based water quality models, the large number of non-linearly interacting parameters makes manual calibration a labour intensive and difficult process, requiring considerable experience (Wagener and Gupta, 2005). With automated techniques, algorithms search the parameter space to find an optimum parameter set or set of probability distributions, for a given set of observed data. Automated techniques may therefore allow a more thorough searching of the parameter space, and can often be used to investigate model sensitivity to the parameters and parameter-related uncertainty in model output, both key in improving confidence in models. Whilst the ideas behind auto-calibration are certainly good, auto-calibration may produce less realistic results than manual calibration (Boyle et al., 2000; Van Liew et al., 2005). In addition, despite the obvious benefits of auto-calibration, it is still a highly subjective procedure (e.g. Pappenberger et al., 2007). We explore these issues by comparing auto-calibration results with a manual calibration, and by highlighting some of the key subjective elements encountered during the analysis. Finally, automated techniques can only be applied well when model design is suited to such techniques. We point out practical issues in the design of INCA-P which reduce the ease with which it, and similar models, can be automatically calibrated, and suggest improvements that could be made. A number of automated calibration techniques exist. Here, we use a Bayesian framework, which results in probability densities for calibrated parameters rather than single ‘optimum’ values, as in procedures such as PEST (Doherty et al., 1994). Probability densities allow equifinality to be taken into account (Beven and Freer, 2001), as well as allowing us to quantify uncertainty in model output due to parameter uncertainty. In recent years much effort has been put into developing algorithms which are able to generate samples representative of complex posteriors. We use the DiffeRential Evolution Adaptive Metropolis (DREAM) Markov Chain Monte Carlo (MCMC) algorithm developed by Vrugt et al. (2009b), and adapted for use with INCA-P by Starrfelt and Kaste (2014), which has been shown to perform well when used with complex hydrological models (e.g. Huisman et al., 2010). In summary, this paper has the following primary aims: (1) Investigate how increasing the data frequency used in calibration affects the uncertainty and realism of model output; (2) Highlight the subjective elements involved in auto-calibration and uncertainty analysis; and (3) Suggest practical improvements

N

1

805

642

Key Stream Sub-catchment 1 Reach number Effluent input

2 3 4

Monitoring points Chemistry, low freq Chemistry, daily freq Discharge 345

Land cover Agricultural Semi-natural

km 0

2

350

Fig. 1. Map of the Tarland catchment, together with the location of monitoring points and effluent input, simplified land use, and reaches and associated subcatchments used in the INCA set up. Coull is at the bottom of reach 4. Eastings and northings (km) are relative to the British National Grid.

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

from the observed data used in model calibration, as during this period the wastewater treatment works was upgraded and sewage effluent was discharged directly to the Tarland Burn, causing an increase in observed TDP concentrations. Sewage inputs are already poorly constrained in this catchment, without taking this additional temporary change in flux into account. To investigate the impact of using a reduced frequency chemistry dataset to calibrate INCA-P, the daily TDP timeseries from Coull was down-sampled to a fortnightly timeseries. This was done by randomly selecting a start day within the first 14 day period in the record, and then selecting every 14th data point thereafter. This reduced the number of observations at Coull from 448 to 43 for the period 2004–2005. The three additional water monitoring locations upstream of Coull (Fig. 1) were already at a low sampling frequency, with between 20 and 29 observations over the study period. The ‘daily’ and down-sampled ‘fortnightly’ TDP timeseries from Coull both contain periods of lower frequency monitoring or data gaps, but for simplicity they are referred to as daily or fortnightly timeseries throughout the rest of the paper. 2.3. Description of the INCA-P model INCA-P is a dynamic, process-based model developed by Wade et al. (2002) and significantly revised in 2007 (Wade et al., 2007). INCA-P uses a semi-distributed approach to simulate the flow of water, sediment and particulate and dissolved P through the terrestrial system to the river. Model equations are given in Wade et al. (2007) and the main inputs, stores, processes and transport pathways are summarised diagrammatically in Jackson-Blake et al. (2015). Briefly, the main stem of a water course is split into reaches with associated sub-catchments. As many land use classes as desired can be defined, each with separate P inputs and parameterisations of the equations that control plant uptake, weathering, immobilisation, water flows, etc. Within each sub-catchment the flow of water and nutrients is calculated for each land use type, and then these inputs are summed to give the overall input to the associated stream reach. In-stream processing may be taken into account by switching on model processes such as sediment settling and re-suspension, bank erosion, exchange of phosphorus between an adsorbed and a dissolved phase and biological uptake. The most recent model development is the facility to simulate fully branched river networks (Whitehead et al., 2011). We used the command line version of branching INCA-P for the auto-calibration applications, and the graphical user interface version for the manual calibration (both version 1.0.2). 2.4. Spatial set-up of INCA-P and input data Four sub-catchments and associated stream reaches were defined above Coull, based on the location of effluent discharge and monitoring points (Fig. 1). Reaches 1–3 therefore had associated low frequency TDP monitoring data, whilst reach 4 (Coull) had associated discharge and daily TDP monitoring data, or the down-sampled fortnightly dataset. Sub-catchments were split into just two land use classes, semi-natural and agricultural (Fig. 1). These were derived by amalgamating land use classes from the UK Land Cover Map 2007 (LCM2007; Morton et al., 2011). INCA requires input timeseries of precipitation, air temperature, hydrologically effective rainfall (HER) and soil moisture deficit (SMD); a single timeseries of each was provided for the whole catchment. Precipitation and air temperature were derived from the Met Office 5 km gridded dataset. Potential evapotranspiration (PET) was calculated using the FAO56 modified Penman Monteith method (Allen et al., 1998). HER and SMD were calculated using a simple spreadsheet water balance model, building

643

on that described by Bernal et al. (2004) and Durand (2004); a full description of the model and its parameterisation for this study is given in Jackson-Blake et al. (2015). 2.5. Brief introduction to Bayesian auto-calibration and the DREAM algorithm Bayesian auto-calibration techniques involve firstly defining a prior distribution for each model parameter to be calibrated, given no knowledge of the observations. A formal likelihood function is also defined, in which model output is compared to observations, and which reflects the probability of a set of observations given the parameter values. To define the likelihood function the modeller must assume a distribution for the model residuals, and the parameters defining this distribution must be estimated in some way. An algorithm then searches parameter space and the prior is updated to take into account the information contained in the observations, to produce the posterior distribution of parameter values. This can be factorized into marginal posteriors for each model parameter of interest. These are the calibrated parameters, not single values, as in a manual calibration or an optimisation algorithm, but probability densities. We can then sample from the posterior distribution and generate model output for each member of the sample and, by looking at the range in simulated outcomes, quantify the effect of parameter uncertainty on model predictions. Markov chain Monte Carlo (MCMC) sampling techniques are a useful means of generating samples from complex posteriors. We used the DREAM MCMC scheme. Full details are given in Vrugt et al. (2009b); in short, multiple Markov chains are used to explore the parameter space simultaneously, and the algorithm attempts to increase the speed at which chains converge on the posterior by tuning the step taken by each chain at each iteration, with smaller steps as chains converge. Proposed parameter values are accepted or rejected as in a traditional Metropolis–Hastings algorithm (e.g. Chib and Greenberg, 1995). 2.6. INCA-P parameterisation and choice of priors INCA-P requires around 127 parameter values to be supplied by the user, excluding those that are directly measured from a GIS (e.g. land use area), and before taking any spatial heterogeneity into account (i.e. how parameters vary by land use, sub-catchment or reach). Only a subset of these could be auto-calibrated. Practically, run times are an issue, as the more parameters that are included, the more runs are required to effectively search the parameter space. Acceptance rates were also found to drop to unacceptably low values once the number of parameters increased above around 40. Manual model calibration and an examination of model equations were used to identify around 37 parameters that might be expected to exert an important control on the simulation of hydrology and phosphorus, in some conditions at least, and before taking any spatial heterogeneity into account (Jackson-Blake et al., 2015). All but 11 of these were included in the MCMC analysis. Five were excluded as they are only likely to be important for particulate phosphorus simulation or for TDP in pristine upland environments (direct runoff time constant, fraction of particulate P in the stream bed, inorganic P deposition), or because they were not relevant in the simple model set up used (stream bed Freundlich isotherm constant, groundwater sorption coefficient). Another six potentially key parameters were excluded because of practical difficulties (discussed further in Section 4.3). Of the parameters to be calibrated, four were linked to other parameters and varied synchronously (Table 1; e.g. plant growth and fertilizer addition periods). Three of the 12 land use parameters were varied independently for

644

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

Table 1 Parameters varied in the DREAM analysis, parameter aliases, maximum and minimum values used to define marginal prior distributions, and values from manual calibration. Whether parameters were varied independently for different land uses (LU), sub-catchments (S) and reaches (R) is also shown. P: phosphorus, GW: groundwater, SN: seminatural, A: arable. Where a single prior is given for more than one parameter, those parameters were varied synchronously during the analysis. Type

Hydrology

Parameter

Soil zone time constant (days) Baseflow Index (£) Groundwater time constant (days) Initial groundwater flow (m3 s1) Groundwater sustainable flow, the minimum groundwater flow (m3 s1) Max. infiltration rate, above which infiltration excess occurs (N.B. this includes all quick flow paths; mm) Proportion of rainfall above Imax that contributes to infiltration excess (£) Threshold soil zone flow above which saturation excess runoff occurs (m3 s1)

P (soil processes)

Soil water equilibrium phosphorus concentration (EPC0) and initial soil water TDP concentration (mg P l1) Soil water Freundlich isotherm constant (controls sorption non-linearity; £) Soil water adsorption coefficient (l kg soil1 day1) Topsoil depth (m) Plant growth & fertilizer addition periods (days) Liquid fertiliser addition (kg P ha1 day1) Plant P uptake coefficient (m day1)

P (effluent)

Effluent flow rate (m3 s1)

TDP concentration in effluent (mg P l1)

Alias

Land use class (LU), reach (R) or sub-catchment (S)

Prior Min

Max

SW Tc (SN) SW Tc (A) BFI GW Tc GW qinit GW qsus

Semi-natural Agricultural All S All S All S All S

1 1 0.3 30 0.001 0

20 20 0.9 100 0.02 0.01

14 3 0.7 70 0.01 0.005

Imax

Both LU

1

50

2

Iex rain f

All S

0

0.05

0.005

SW q for SEx

All S

0.01

0.2

0.05

SW EPC0 (SN)

Semi-natural

0.001

0.015

0.006

SW EPC0 (A) SW nF

Agricultural Both LU

0.01 0.01

0.2 1

0.1 0.5

SW Kad (SN)

Semi-natural

0

10

10

SW Kad (A) Soil d PG & F periods Fert PG uptake k

Agricultural Both LU Both LU Agricultural Agricultural

0 0.1 169 0.07 0.01

10 1 213 0.14 0.5

0.55 0.2 180 0.111 0.2

Eff Eff Eff Eff Eff Eff

R1 R2 R3 R4 R1, R2 & R4 R3

5E06 1E05 3E04 1E05 0.01 0.01

5E05 1E04 3E03 1E04 10 10

2.2E05 6.0E05 1.3E03 6.0E05 7.8 3

q (R1) q (R2) q (R3) q (R4) TDP (R1, 2, 4) TDP (R3)

Manual calibration

P (groundwater)

Aquifer matrix EPC0 and initial groundwater TDP (mg P l1)

GW TDP

All S

0.001

0.05

0.01

P (in-stream)

Water exchange between the water column and the stream bed (m3 day1) Stream bed depth (m) Stream bed P adsorption coefficient (l kg soil1 day1) Stream bed EPC0 (mg P l1)

P exchange

All R

0

10

10

Streambed d Streambed Kad

All R All R

0.1 0

1 700

1 400

Streambed EPC0

All R

0.001

0.03

0.009

different land use classes and effluent inputs were varied by reach; all others were assumed to be spatially homogeneous. This gave a total of 29 effective parameters for auto-calibration (Table 1). All other parameters were fixed at values given in Jackson-Blake et al. (2015). Suitable priors must be chosen for each parameter included in the MCMC analysis. In this application, these priors define the domain within which parameter values can vary, so they need to encompass the plausible range of values whilst not causing numerical instabilities. Given a lack of justification for doing otherwise, uniform priors were used, which avoid underestimation of model prediction uncertainties (Freni and Mannina, 2010). For parameters that have some physical basis, upper and lower limits were based on literature-derived or measured/estimated values (e.g. groundwater TDP concentration, equilibrium P concentrations and effluent flow rates and concentrations). Otherwise, wide priors were used, to encompass physically realistic values. For the majority of parameters, manual model testing was required to help determine these ranges. Output from DREAM test runs was used to extend the range of some priors, unless there was a physical reason for not doing so. The final choice of priors is given in Table 1.

2.7. Set-up of the DREAM algorithm Initially, two DREAM applications were carried out. For both applications, INCA-P was run for the period 2004–2005, encompassing the 15 month period with daily TDP data at reach 4. Likelihoods were calculated by comparing model output to TDP observations from reaches 1 to 4 and to discharge observations from reach 4. Both applications were calibrated to daily discharge data, but different frequencies of observed TDP data were used. The first application (DREAM-d) used daily TDP observations from reach 4, the second (DREAM-f) used the fortnightly TDP timeseries derived from the daily series (Section 2.2). We defined the likelihood function as in Starrfelt and Kaste (2014), assuming model errors (simulated – observed), after log transformation of both datasets, to be independent and normally distributed with mean zero and variance r2. More complex error models require more parameters to be fitted, so this simple model is a good starting point; its validity is examined in Section 3.1. For numerical stability, calculations were performed using the log likelihood (Eq. (1)), where P(obs|h) is the probability of the observed data given the set of parameter values (as well as the model structure, input data and error model); nk is the number of observations

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

for each variable, k (discharge and TDP); rk2 is the residual variance, independent for each observed variable. The square of the residuals is summed over i timesteps and j reaches, and then the log likelihood for each of k observed variables is summed to give the overall log likelihood. This introduces one additional parameter that needs to be fitted for each observed variable, rk2, the error variance. This is done within the DREAM algorithm using Gibbs sampling, assuming r2 to follow a gamma distribution (for more details, see Vrugt et al., 2009b).

" !# PP 2 X 1 j i ðsimji  obsji Þ 2 PðobsjhÞ ¼  nk lnð2prk Þ þ 2 r2k k

ð1Þ

The number of Markov chains needs to be greater than half the number of parameters (Vrugt et al., 2009b); in this application, 20 were used. Low acceptance rates were a problem during test runs, indicating that the chains were not mixing well, and thus not sampling the posterior efficiently. This was improved by altering the DREAM parameters which control the variance of the proposal distribution, in effect the step size taken by the Markov chains. The final DREAM parameters opted for were delta = 3, 3 crossover values, gamma = 1 every 25th iteration, b = 0.1 and b⁄ = 1  105 (for an explanation of these parameters, see Vrugt et al., 2009b). The Gelman–Rubin (GR) statistic, or ‘shrink factor’, was used to test for convergence on the posterior distribution. This statistic approaches 1 when within-chain variance dominates compared to between-chain variance, and is generally taken to mean that all chains have escaped the influence of their starting points and have fully traversed the target distribution. The algorithm was deemed to have converged on the posterior once the GR statistic was below 1.2 for all chains (Gelman and Rubin, 1992), implying that the posterior distribution might narrow by a further 20% were the simulations to carry on indefinitely. Iterations before this period were discarded; iterations after were classed as samples from the posterior. 2.8. Manual calibration of INCA-P A manual calibration of INCA-P in the study catchment is described in Jackson-Blake et al. (2015). Briefly, the method involved first deciding on an initial parameter set, by assigning preliminary parameter values using literature, GIS or experimentally-derived values (for parameters with some physical basis), or by using parameter values used in other INCA-P model applications. Model parameters were then altered, within plausible ranges where appropriate, first focusing on obtaining a good discharge calibration and then on phosphorus. During this procedure, model performance was evaluated using a visual assessment of time series and a combination of performance statistics (including NS, bias and RMSD; described in Section 2.9). In addition, the calibration procedure involved checking for relatively constant stores of soil P, groundwater volume and stream bed sediment over the course of the model run, and ensuring that soil solution TDP concentrations lay within the observed range. To ensure comparability with results from the auto-calibration, the set-up described in Jackson-Blake et al. (2015) was simplified to just two land use classes, by assuming all agricultural land had the parameterisation used for improved grassland, whilst rough grazing and semi-natural were grouped as semi-natural land. 2.9. Analysis of the output For each DREAM application, 100 parameter sets were randomly sampled from the converged portion of each of the 20 chains, providing 2000 samples from the posterior distribution.

645

INCA-P was then re-run using each of these parameter sets. INCA output timeseries of simulated discharge and surface water chemistry for each of these runs were stored, and medians, 2.5 and 97.5 percentiles were used to provide an estimate of uncertainty in model output due to parameter uncertainty. Model performance for both the manual and auto-calibration was assessed by a combination of a visual comparison of distributions (using boxplots) and timeseries, and a range of performance statistics. For the DREAM applications, performance statistics were calculated using timeseries of daily medians from the posterior ensemble of INCA-P simulations. The following statistics were calculated: correlation coefficients (Pearson’s r and Spearman’s rank), the coefficient of determination (r2, the square of Pearson’s r), the Nash Sutcliffe efficiency criterion (NS; Nash and Sutcliffe, 1970), the NS calculated with logged values, model bias (mean of simulated minus mean of observed), percentage bias (sum of simulated minus observed, normalised by the sum of the observations) and the root mean square difference between simulated and observed, normalised by the standard deviation of the observations (RMSD⁄). To plot marginal posterior probability densities for each of the parameters varied in the MCMC analysis, values from the converged portion of each chain were first combined into a single group. For each parameter, an associated pdf for this group was estimated non-parametrically using Gaussian kernel density estimation, using Scott’s Rule to estimate kernel variance (Scott, 2009). Parameter covariance was investigated by calculating a correlation matrix (Pearson’s r) for all parameters from the converged portion of each chain. To compare TDP distributions under the different flow conditions, simulated and observed TDP values were summarised in boxplots, grouping by flow conditions on the corresponding day (whole series, peak flow or baseflow). Flow peaks were estimated using the following logic, where Q is discharge, subscript t denotes the current day and t  1 the previous day: If (Qt  Qt1) P 0.5 Qt1, then Q is classed as a peak. Otherwise, if the previous day was classified as a peak, if Qt > Qt1 or |Qt  Qt1| 6 0.05 Qt1 (i.e. if there has been a peak and flow is still increasing, or not decreasing much), then Q is classified as a peak. This resulted in 42 days being classified as flow peaks. Periods of baseflow were assigned to periods when discharge was below 0.5 m3 s1, discounting any flow peaks. This classification of flow conditions is somewhat arbitrary, but captures the majority of peaks and periods of baseflow apparent in the timeseries (Fig. 2), and is sufficient for the purposes of this study. 3. Results 3.1. DREAM algorithm The first DREAM application using daily data (DREAM-d) achieved GR convergence statistics below 1.2 for all chains by 5000 iterations. The second application using fortnightly data (DREAM-f) was slightly quicker, with GR statistics below 1.2 for all chains by 4500 iterations. After this point both algorithms were left to run for another 12,000 iterations to provide a sample from the posterior, taking around seven days to run each application. Mean acceptance rates after convergence were 0.23 for DREAM-d and 0.43 for DREAM-f, within the desired range of around 0.1 to 0.45 (Roberts et al., 1997). Histograms of model residuals suggest that the assumption of normally distributed residuals, after log transformation, was valid (Fig. 3). The distribution of errors from the DREAM-d TDP simulation were somewhat wider than expected, but otherwise the errors show little skew or bias. Likewise, residual plots against observations support the assumption of homogeneous error variance (data not shown).

646

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

Fig. 2. Observed discharge (Q) at Coull (reach 4) and estimated flow peaks and periods of baseflow, for use in comparing simulated and observed dissolved phosphorus concentrations under different flow conditions.

3.2. Comparison of observed timeseries Down-sampling the TDP timeseries from daily to fortnightly reduced the number of observations used in calibration from 448 to 43. The distributions of the two datasets are similar (Fig. 4a), the main difference being the lack of high TDP values in the fortnightly series. The two series also have similar distributions under baseflow conditions (Fig. 4b), although the fortnightly series does not capture as many low concentrations as the daily series. However, the fortnightly series only includes one TDP value corresponding to peak flows, and hence could not be plotted in Fig. 4c. 3.3. Comparison of DREAM applications using daily and fortnightly observations 3.3.1. Comparison of uncertainty in simulated output Perhaps the most obvious difference between timeseries of TDP simulations from DREAM-f and DREAM-d (Fig. 5) is that parameter-related uncertainty in model output is much greater

Fig. 3. Histograms of model residuals, calculated using the median simulated daily timeseries from the posterior sampling, after log transforming both the observed and simulated data. Associated fitted probability densities from kernel density estimation (KDE) are also shown, for ease of comparison to the idealised normal distribution of mean 0, variance r2 (the latter determined by Gibbs sampling); the vertical zero error line is also shown. Varying bin widths are due to varying number of observations (n). Q: discharge, TDP: total dissolved phosphorus, D: daily DREAM application; F: fortnightly application.

for DREAM-f. For DREAM-d, 95% of simulated values lie within a 6 lg l1 range, roughly a quarter the DREAM-f interval of 26 lg l1. The uncertainty is similar for all flows up to the 95th flow percentile; above this the uncertainty interval roughly doubles for both applications, to 14 lg l1 for DREAM-d and to 42 lg l1 for DREAM-f. Soluble reactive P (SRP) can be estimated as 0.6  TDP (from an empirical relationship between observed data; not shown), so this equates to parameter-related uncertainty during the majority of flow conditions of around ±1.8 lg SRP l1 for DREAM-d and ±7.7 lg SRP l1 for DREAM-f. 3.3.2. Comparison of distributions Median simulated TDP timeseries from DREAM-d and DREAM-f have very narrow distributions compared to both the daily and fortnightly observed timeseries (Fig. 4a), with far fewer high concentrations than the daily observed data. Under baseflow conditions (Fig. 4b), the distribution of TDP from DREAM-d was again too narrow, and also slightly over-estimated. DREAM-f not only lacked the variability seen in either observed timeseries, but greatly over-estimated baseflow TDP, particularly compared to the daily observations. Finally, both DREAM-d and DREAM-f significantly underestimated TDP concentrations under peak flow conditions (Fig. 4c), particularly when compared to the daily observations. 3.3.3. Comparison of timeseries and model performance statistics DREAM-d produced a reasonable simulation when compared to the observed TDP timeseries (Fig. 5b). Although many of the TDP peaks are under-estimated and baseflow TDP concentrations are over-estimated, there is an appropriate increase in simulated TDP concentration during flow events, particularly during winter. The summer simulation is worse, and the simulation generally lacks the higher frequency variability seen in the observations. Nonetheless, the response is realistic, and model performance statistics are reasonable for a TDP simulation in an agricultural setting (reviewed in Jackson-Blake et al., 2015), with significant correlation coefficients, small bias and positive NS (Table 2). The DREAM-f TDP simulation is markedly poorer (Fig. 5c). Despite small bias and similar variability to the observed fortnightly timeseries (DREAM-fa statistics in Table 2), when compared to the daily observations we see that the simulated dynamics are unrealistic. Increases in flow are accompanied first by a decrease in simulated TDP, only later by an increase. This dilution effect is not seen in the observed data, and leads to DREAM-f having a negative correlation with observed daily data (Table 2). 3.3.4. Comparison of marginal posterior distributions for calibrated parameters Differences in marginal posterior probability densities for DREAM-d and DREAM-f (Fig. 6) help explain the differences between the two calibrations. Key differences include:

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

Fig. 4. Boxplots comparing observed and simulated distributions of total dissolved phosphorus (TDP) concentrations for the whole study period and under different flow conditions. Data included: daily and fortnightly observations (Obs-d and Obsf), the manual INCA-P calibration (‘Manual’), and simulated output from DREAM calibrated using daily data (D-d) and fortnightly data (D-f). For D-f, superscript a is the original application, b a second application in which only one soil water time constant was calibrated (Section 3.3.5). Boxes were only plotted when the number of data points (n) was greater than 5. Obs-f was therefore excluded from the bottom plot (n = 1). Boxes mark the interquartile range (IQR) and horizontal lines within boxes the median. Whiskers extend to 1.5 IQR or the maximum/minimum data point. Data points beyond whiskers are shown as outliers.

(1) Differences in marginal posterior widths: Marginal posteriors derived from DREAM-f were less well defined than from DREAM-d, as might be expected given the smaller amount of information used in the calibration. This results in the greater uncertainty in model output from DREAM-f. The shape of marginal posteriors can also be used to say something about model sensitivity to each parameter. Although the width of a marginal posterior relative to the prior is

647

partly dependent on the prior, in general a narrow, well-defined peak implies that the model is sensitive to that parameter. Meanwhile, a relatively flat marginal posterior implies that the model is relatively insensitive to that parameter (although flat distributions can hide important parameter covariance; see below). For DREAM-d, particularly well-defined peaks were obtained for the majority of the hydrological parameters, as well as for agricultural soil water EPC0 (SW EPC0 (A)) and groundwater TDP concentration (Fig. 6). The latter play a key role in controlling the concentration of TDP in the soil and groundwater. Meanwhile, nine of the 29 calibrated parameters had relatively flat marginal posteriors, including two of the four parameters controlling terrestrial and in-stream sorption reactions (adsorption coefficients, Kad, and sediment or soil depth, d) and parameters controlling septic tank effluent inputs (Eff q and Eff TDP for all but reach 3). Marginal posteriors from DREAM-f were less well defined, and five additional parameters had essentially flat marginal posteriors (Fig. 6), including soil water EPC0 for both agricultural and semi-natural land. Indeed, with DREAM-f, INCA-P only appears sensitive to hydrology-related parameters and the three phosphorus-related parameters which control baseflow TDP concentrations (Eff TDP and Eff q at reach 3; GW TDP). (2) Different soil water time constants (SW Tc): The unrealistic TDP simulation produced by DREAM-f appears to be due to issues with the marginal posteriors for the soil water time constants. DREAM-d posteriors had maxima at c. 8 days for semi-natural land and at c. 2 days for agricultural land. However, these were reversed in DREAM-f, with the shorter time constant being assigned to semi-natural land and the longer to agricultural land. This suggests that calibration to the discharge data, which was the same for both applications, favoured two separate soil water time constants. In the DREAM-d application, the daily TDP data resulted in the shorter of these being assigned to agricultural land. This makes intuitive sense, as water from agricultural land is likely to reach the stream quicker due to the presence of tile drains and drainage ditches, soil compaction, and the lower water retention capacity of agricultural soils in this area. The reduced information content of the fortnightly dataset led to the soil water time constants being assigned to the wrong land use, with important knock-on effects for the quality of the simulation: rainfall events initially result in increased soil water flow from semi-natural land which, with its low TDP concentration, causes a dilution event in the stream; only later, once soil water from agricultural land reaches the stream, do simulated concentrations increase (Fig. 5c). Parameter covariance also differs for the two calibrations. Table 3 lists correlation coefficients between calibrated parameters where a correlation was present (only correlation coefficients >|0.2| are shown; weaker correlations were statistically significant but too weak to be of interest). 41 correlations were present between DREAM-d derived parameters, compared to only 23 for DREAM-f. This highlights an additional benefit of increasing data frequency used in model calibration, as through better constraining the posterior, any correlations inherent in the model structure become more visible. This knowledge is of use in identifying over-parameterisation and parameter identifiability issues (Section 4.3, point 1). 3.3.5. Additional DREAM application with fortnightly data There was not enough information in the fortnightly TDP dataset to calibrate the soil water time constants for semi-natural and

648

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

Fig. 5. Timeseries of observed and simulated (a) discharge (Q) and (b–d) TDP (Total Dissolved Phosphorus). TDP was simulated using: (b) the DREAM algorithm, calibrating to daily data; (c) DREAM, calibrating to fortnightly data; and (d) a manual calibration. In (a), only simulated output from DREAM with daily data is shown; other simulated timeseries were nearly identical.

Table 2 Performance statistics for reach 4 for discharge (Q) and total dissolved phosphorus (TDP), for each of the INCA-P calibrations considered. MC: manual calibration, DREAM-d: autocalibration using daily observations, DREAM-f: auto-calibration using fortnightly observations. Superscript a is the original DREAM-f calibration, b is the second DREAM-f calibration in which just one soil water time constant was calibrated (see Section 3.3.5). Outputs from DREAM-f calibrations were compared to both daily and fortnightly observations. N: number of observations, CC: correlation coefficient, NS: Nash Sutcliffe, RMSD⁄: root mean square difference normalised to the standard deviation of the observations. Asterisks give significance of test for correlation. Units of bias are m3 s1 for discharge, lg l1 for TDP. Statistic

Q MC

N Spearman’s CC Pearson’s CC NS NS (logs) Bias Bias (%) RMSD⁄ * **

0.001 < p < 0.05. p < 0.001.

716 0.92** 0.87** 0.71 0.83 0.05 6.8 0.54

TDP DREAM-d

716 0.91** 0.87** 0.76 0.84 0.02 3.4 0.49

DREAM-fa

716 0.92** 0.87** 0.69 0.84 0.04 6.4 0.55

DREAM-fb

716 0.91** 0.86** 0.73 0.84 0.03 4.7 0.52

versus daily obs.

versus fortnightly obs.

MC

DREAM-d

DREAM-fa

DREAM-fb

DREAM-fa

DREAM-f’b

448 0.27** 0.29** 0.65 0.40 0.7 2.6 1.28

448 0.37** 0.41** 0.16 0.19 1.4 5.0 0.92

448 0.04 0.10* 0.17 0.19 1.2 4.4 1.08

448 0.28** 0.28** 0.02 0.03 2.1 7.6 0.99

43 0.19 0.18 0.01 0.00 0.7 2.4 0.99

43 0.28 0.28 0.06 0.06 0.7 2.5 0.97

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

649

Fig. 6. Marginal prior and posterior probability density functions (pdfs) for parameters calibrated using DREAM, calibrated to daily observations (DREAM-d) and fortnightly observations (DREAM-f). Value used in manual calibration also shown; where this is not obvious it is at the upper end of the range. See Table 1 for a key to parameter aliases.

agricultural land separately, with the given priors (Section 3.3.4). An additional short DREAM-f application was therefore run, linking the soil water time constants for the two land use classes so that they varied synchronously as a single ‘effective’ parameter. This application achieved GR statistics <1.2 after just 800 iterations, and was run for a further 5000 iterations to provide a sample of the posterior. INCA-P output timeseries obtained from a sample from the latter 5000 iterations are summarised in Fig. 7. The 95% credible interval is very similar to the original DREAM-f application, but the simulated response is now more consistent with the observed daily dataset, with simulated increases in discharge being accompanied by increases in TDP concentration. The daily dynamics are therefore more realistic. The performance statistics are also markedly improved, with for example a significant positive correlation between simulated and daily observed TDP, where before it was negative (statistics for DREAM-fb in Table 2). Baseflow TDP concentrations are still however over-estimated, but the distribution of simulated TDP concentrations during peak flows is now very similar to DREAM-d, although both underestimate these compared to

the daily observations (DREAM-fb in Fig. 4). This is surprising, given that the fortnightly observed data contained far fewer TDP peaks than the daily data, and is discussed further in Section 4.1.2. 3.4. Comparison of auto-calibration and manual calibration results DREAM-d resulted in markedly better TDP performance statistics than the manual calibration (Table 2), with higher correlation coefficients and a positive NS and NS on logged values. However, a comparison of the timeseries does not reveal such a clear picture (Fig. 5). Whilst both calibrations struggle to simulate summer TDP peaks, the manual calibration does a better job, visually, of simulating the magnitude of these peaks, although it also over-estimates their duration. Simulated baseflow concentrations are also closer to observed. This results in the observed and simulated distributions being more comparable for the manual calibration than for the auto-calibration, under all flow conditions (Fig. 4). This suggests the statistic used to measure model performance in the auto-calibration procedure may need supplementing or altering, discussed further in Section 4.2 (point 7).

650

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

Table 3 Correlation coefficients (Pearson’s r) showing correlations between marginal posteriors from the two DREAM applications calibrated using daily (DREAM-d) and fortnightly (DREAM-f) data, split by the strength and nature (positive or negative) of the DREAM-d correlation. Only relationships where |r| > 0.2 are shown. See Table 1 for a key to parameter aliases. Category

Correlated parameters

Correlation coefficient DREAM-d

DREAM-f 0.77 0.52 0.61 0.35

Strong negative (1 to 0.5)

BFI Eff q (R3) SW Tc (A) SW Tc (SN) SW Tc (SN)

GW Tc Eff TDP (R3) BFI BFI SW EPC0 (A)

0.73 0.67 0.67 0.81 0.59

Weak negative (>0.5 to 0.2)

BFI SW EPC0 (A) BFI GW TDP GW TDP GW Tc GW Tc GW Tc GW qinit PG uptake k PG uptake k SW EPC0 (A) SW Tc (A) SW Tc (A) SW Tc (A) SW Tc (A) SW Tc (SN) SW Tc (SN) SW Tc (SN)

SW nF SW nF SW q for SEx Eff TDP (R3) Streambed d GW qsus SW EPC0 (A) GW qinit SW q for SEx SW Kad (A) SW nF GW TDP GW qsus SW EPC0 (A) SW q for SEx GW qinit GW qsus SW q for SEx GW qinit

0.33 0.40

BFI BFI SW nF GW qsus GW qsus GW Tc GW Tc Imax SW EPC0 (A) SW EPC0 (A) SW EPC0 (A) SW Tc (A) SW Tc (SN) SW Tc (SN)

GW qsus GW qinit Fert SW EPC0 (A) GW qinit SW nF SW q for SEx Iex rain f PG uptake k SW q for SEx GW qinit SW nF SW nF GW Tc

0.38 0.25 0.34 0.26 0.22 0.32 0.46

SW Tc (SN) BFI SW Tc (A)

SW Tc (A) SW EPC0 (A) GW Tc

0.56 0.59 0.52

Weak positive (0.2 to 0.5)

Strong positive (>0.5 to 1)

4. Discussion 4.1. Data frequency used in model calibration 4.1.1. Parameter-related uncertainty in model output The parameter-related uncertainty in simulated TDP concentration from DREAM-f is large enough to limit the use of this application in terms of advising policy or land management, even before other sources of uncertainty are taken into account. For example, the UK Technical Advisory Group has developed environmental standards for phosphate concentrations, as required by the EU Water Framework Directive (WFD). However, the difference between ‘moderate’ and ‘good’ chemical status is only around 8 lg SRP l1, of the same order as the uncertainty in the calibration to fortnightly data (±7.7 lg SRP-P l1). It would therefore be hard to use this model setup to predict how waterbody classification might change under changing land or sewage management, for example. This is concerning, given that most modelling studies use at best fortnightly data for model calibration. However, further work is required to assess how increasing the length of the lower frequency record, and therefore the number of observations, impacts on the parameter-related uncertainty. The much smaller uncertainty in the application based on daily data (±1.8 lg

0.37 0.21 0.38 0.44 0.37 0.26 0.35 0.26 0.32 0.37 0.44 0.37 0.37 0.37 0.38 0.44 0.46 0.25 0.22 0.23

0.33 0.34 0.42 0.22 0.44 0.20

0.42 0.32

0.30

0.22 0.49 0.54 0.24 0.29 0.40 0.47

0.24 0.25 0.32

SRP-P l1), while not encompassing all sources of uncertainty, suggests that this calibration would be better suited to policy relevant scenario analysis. Parameter-related uncertainty in model output could be further reduced by decreasing the number of model parameters and better constraining priors, particularly for parameters the model is sensitive to (Section 3.3.4). A decrease in model uncertainty is desirable not just to increase confidence in model output. On a more fundamental level, greater uncertainty makes models less falsifiable, as observations are more likely to lie within the wider credible interval. This makes it harder to identify model structural errors, and in turn develop more useful models. 4.1.2. Realism of the simulated timeseries When the same parameters were calibrated using fortnightly and daily observed datasets, the fortnightly dataset resulted in a markedly less realistic model simulation (Section 3.3.3). The reduced information content of the fortnightly observed dataset resulted in poorly-constrained soil water time constants for agricultural and semi-natural land, parameters that were well constrained when daily observations were used in model calibration (Section 3.3.4). The failure of the DREAM application based on fortnightly data to reproduce key catchment processes means that this model set-up

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

651

Fig. 7. Timeseries of total dissolved phosphorus (TDP) concentration from a second DREAM application calibrated to fortnightly data, where just one soil water time constant was calibrated.

would be of limited use in predicting future water quality. This highlights the importance of using higher frequency data or prior process knowledge to critically assess model output before such models are used for any scenario analysis. Only attempting to calibrate one spatially homogeneous soil water time constant using the fortnightly observations resulted in a more realistic model simulation. That attempting to calibrate just one extra parameter could produce such different model output is concerning, and is a good illustration of the problems of attempting to calibrate too many model parameters with a given dataset (discussed in Kirchner, 2006). This problem can be reduced by reducing the dimensions of the parameter space (Beck, 1987), and we would emphasise a cautionary approach of not varying parameters spatially (e.g. by land use, sub-catchment or reach) unless there is data or evidence to support the differentiation (Rustomji and Prosser, 2001), or there is a real need to do so for the model to fulfil its purpose (e.g. for scenario analysis). Whilst regularization could be carried out, with a penalty for increased complexity such that lower probability is given to more complex models (Tonkin and Doherty, 2005; Marcé et al., 2008), this would have to be on the basis of prior expected behaviour rather than goodness of fit, given the lack of information in the fortnightly observations. The fortnightly observed timeseries captured baseflow concentrations well, but peak concentrations were under-represented. We might therefore expect the model calibration using the fortnightly data to produce similar results, with a good baseflow simulation but poor representation of peaks. This was not however the case: DREAM-f over-estimated baseflow TDP concentrations, and yet reproduced TDP peaks as well as the DREAM-d simulation. The reasons for the over-estimation of baseflow concentrations could not be thoroughly investigated without re-running the application with a range of alternative fortnightly datasets, but it seems likely that the small number of data points and the chance sampling of several TDP peaks and under-representation of low TDP values in the fortnightly series, together resulted in the over-estimation. Meanwhile, the similarity in the simulation of TDP peaks by the two applications suggests that the DREAM-d application did not make the most of the information contained in the daily data, due to limitations either in the model structure or in the auto-calibration algorithm. Jackson-Blake et al. (2015) suggest small changes that could be made to the model structure to improve the simulation of summer peak TDP concentrations, but the fact that more and bigger TDP peaks were simulated in the manual calibration suggests that focus should be on improving the likelihood function (discussed further in Section 4.2, point 7).

4.1.3. Implications for water chemistry monitoring strategy Routine water monitoring programmes tend to collect grab samples at most once a fortnight, more often monthly. There are

good practical reasons for this, but results suggest these lower frequency datasets may lead to misleading results when used to calibrate complex, process-based catchment models, as well as large parameter-related model uncertainty. Ultimately, the key to obtaining a realistic model simulation is ensuring that the natural variability in water chemistry is well represented by the monitoring data (Jarvie et al., 1999; Irvine, 2004; Bennett et al., 2013). In theory, the daily data should therefore have produced a model calibration that was better able to simulate peak TDP concentrations, as the fortnightly observed data contained far fewer peaks. However, most probably because of issues with the auto-calibration (Section 4.1.2), this was not the case. Improvements to models and/or auto-calibration procedures are therefore needed before the most can be made of higher frequency data in model calibration. However, for any kind of source apportionment or scenario analysis, we need to have confidence that models are producing the right results for the right reasons (Kirchner, 2006), and in our view this is where higher frequency data is key. Continuous higher frequency data shows how the system responds throughout a flow event, information that a discontinuous timeseries cannot provide, however long the record may be. In this study, we have seen how useful this data was in highlighting an entirely unrealistic model calibration. We would therefore advocate carrying out targeted higher frequency sampling for a short period of time, to characterise the system response to rainfall events at different times of the year. Lower frequency sampling could be adopted thereafter, to provide an indication of longer-term trends. If this were carried out for long enough, we might expect parameter-related uncertainty in model output to drop to the more acceptable levels obtained in this study from calibration to a relatively short, higher frequency dataset. Finally, monitoring efforts should also be aimed at better constraining parameter values and internal system processes and fluxes. This would lead to better-constrained priors, which would both reduce uncertainty and reduce the need for extensive calibration datasets. It would also lead to increased confidence that the right processes were being simulated.

4.2. Subjective elements within auto-calibration techniques and potential improvements The pros and cons of manual versus auto-calibration techniques have been discussed in the hydrological literature for decades (e.g. Boyle et al., 2000; Gupta et al., 2006), and the aim here is not to review this. Instead, we offer a few remarks to attempt to temper a common perception that auto-calibration is an objective and rigorous alternative to manual techniques. Our auto-calibration results were dependent on a number of choices. Expanding on these, in our view most auto-calibrations and uncertainty analyses

652

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

include the following subjective elements; some potential improvements are also highlighted: (1) Model parameters to include in the analysis. All parameters could be included or a subset; if the latter, the choice of subset could be made through manual model testing and an examination of model equations, as in this study, or from a sensitivity analysis. There is then the decision of whether to calibrate spatially variable parameters separately or not, shown here to exert a key influence. (2) Prior ranges and distributions for those parameters to be auto-calibrated, often determined through manual model testing, informed by the literature or data. (3) Values assigned to parameters excluded from the analysis. In this study, many parameters could not be included in the auto-calibration, and hence their values were decided on by manual calibration. (4) The choice of auto-calibration algorithm, although there is encouraging work which suggests similar results may be obtained from formal and informal Bayesian approaches (Vrugt et al., 2009a). (5) The observed data used for calibration. As well as the importance of data frequency, the length of record chosen for the calibration period may be important. The types of observations included also exert an important influence on the simulation, with multi-response objective functions offering the potential for improved simulations (Kuczera and Mroczkowski, 1998). A potential benefit of manual calibration over automated procedures is the ease with which data on internal processes, such as concentrations and fluxes, can be included in the calibration, helping to ensure that the right results are being obtained for the right reasons. Incorporating such additional ‘soft’ data into auto-calibration is not common practice at present, but it is feasible and indeed highly desirable (e.g. Seibert and McDonnell, 2002; Rankinen et al., 2006). (6) Choice of weighting strategy for observations. In this study, all data points were given equal weight. However, we might have achieved better simulation of peak TDP concentrations if we had applied higher weights to peaks, although perhaps at the expense of baseflow simulations. (7) Formulation of the objective function. It is likely that a different result would have been obtained in this study, for example, had temporal autocorrelation of errors been taken into account (Beven, 2009; Schoups and Vrugt, 2010), or errors associated with the hydrological forcing data (Vrugt et al., 2008). More fundamentally, we used an aggregate measure of model performance, based on the sum of squared model residuals. However, some argue for the use of diagnostic model evaluation, as proposed by Gupta et al. (2008). This is an area of active development (e.g. Yadav et al., 2007; Yilmaz et al., 2008; Hartmann et al., 2013; Coxon et al., 2014), but the basic idea is that data ‘signatures’, rather than single regression-based aggregate measures of model performance, may provide more powerful diagnostics of model performance. These multiple diagnostic measures are selected for their ability to provide insights into system behaviour, and in this study the use of peak TDP concentration as one of the diagnostics, for example, might have led to a smaller difference between manual and auto-calibration results. In summary, auto-calibration certainly has many advantages over manual techniques, but these important subjective elements should not be down-played. The size of any uncertainty bands around predictions, for example, should not be interpreted too

literally or they could give a misleading impression of the confidence that can be placed in model simulations.

4.3. Practical improvements to make INCA-P and similar models more compatible with automated calibration and uncertainty analysis A number of improvements could be made to INCA-P, and similar models, which would make the model significantly easier to auto-calibrate and reduce the need for a good quality manual calibration to inform the auto-calibration procedure: (1) Model simplification, so that all model parameters could be included in the analysis. This is a key requirement to improve confidence in model output, and thereby make models more policy-relevant. In this study, only 29 of >127 model parameters were varied in the auto-calibration procedure. More could have been varied – Starrfelt and Kaste (2014) included 49 parameters, whilst Dean et al. (2009) varied 44 – but computing run times and a high chance of non-identifiable parameters means that including all is not feasible. Simplification is desirable for other reasons too: the strong correlations observed between many of the calibrated parameters (Section 3.3.4) suggest over-parameterization and a lack of parameter identifiability, even with the parameter sub-set considered here (Kuczera and Mroczkowski, 1998; Bennett et al., 2013). In addition, even in the calibration with daily data, nine of the 29 calibrated parameters had relatively flat marginal posteriors (Fig. 6), implying model insensitivity to these parameters. This number rose to 14 for the calibration with fortnightly data, and included the majority of parameters affecting phosphorus processing. Some suggestions for model simplification are given in Jackson-Blake et al. (2015), but more is likely to be necessary to bring the number of parameters down to a reasonable number. (2) Documenting recommended ranges for each parameter, to help in choosing sensible priors. This could be built up in a database of parameter values, together with the associated geo-climatic and land use setting. This would require more researchers to report the parameter values used in their application, either in research articles or, preferably, in some on-line repository. (3) Reformulation or removal of interacting parameters. The way in which model parameters interact can make auto-calibration techniques more or less cumbersome to apply. Many parameters are correlated to some extent; issues only arise when combinations of prior space result in unrealistic values for some model process. In INCA-P, for example, the user-supplied soil grain size distribution is made up of percentages in each of five size classes. These need to sum to 100, so randomly sampling from within any one is difficult. Such parameters can be dealt with within the auto-calibration procedure by, for example, using alternative distributions for the priors, calculating dependent parameters as a function of drawn parameter values, or estimating ‘placeholder’ parameters, from which the real model parameters are derived. However, whilst it is feasible for parameters to be linked and dependent on one another, tailor-made calibration code is required, which may be time-consuming to write and not transferable to different models. Many of these interacting parameters could instead be dealt with by making changes to the model, with little likely loss of performance. Table 4 lists some of these parameters for INCA-P, with suggested potential improvements.

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

653

Table 4 Examples of INCA-P parameters requiring special treatment for inclusion in an auto-calibration procedure, together with a description of the issue and possible improvements. Process

Relevant parameters

Issues

Potential improvements

Plant P uptake

Plant growth start day Plant growth period Maximum plant P uptake Plant uptake coefficient

Calibrating these independently can result in unrealistic annual P uptake rates

Replace growth period with growth end day; remove the uptake coefficient. Model fits daily uptake as a sine curve, whose integral equals the user-supplied annual uptake

Fertilizer input

Fertiliser addition start day Fertilizer addition period Daily application rates: Solid fertilizer Liquid fertilizer Solid manure Liquid manure

Calibrating too many of these independently results in unrealistic annual fertilizer inputs

Replace daily addition rate parameter with an annual addition; model calculates the daily rate. Simplify to just two inputs: solid (contributing to the soil store) or liquid (entering the soil water)

Soil texture

Clay (%) Silt (%) Fine sand (%) Medium sand (%) Coarse sand (%)

Must sum to 100

Consider using just one size fraction, e.g. a combination of silt and fine sand. The fine fraction is more likely to be mobilised and has a higher P content (e.g. Sharpley, 1980)

Stream bed water volume

Stream bed depth Stream bed porosity Water column–pore water exchange

INCA crashes when water column–pore water exchange is larger than bed depth  porosity

Replace bed depth and porosity with effective bed depth. Internal logic to prevent exchange being bigger than the volume of water in the pores

5. Conclusions When calibrating a process-based catchment water quality model, using daily rather than fortnightly observed data resulted in the following benefits:  Reduced uncertainty and therefore greater confidence in model output, in turn making the model more useful in terms of advising land managers and policy makers.  Lower risk of obtaining unrealistic model output, allowing for greater confidence that models are producing the right results for the right reasons.  More parameters could be calibrated, i.e. a more complex and potentially more realistic model could be used.  Models become more falsifiable, through the combination of reduced model uncertainty and tighter constraints on how the model should be performing. This in turn paves the way for the development of more realistic models. A great additional benefit of higher frequency data is the process understanding gained, which allows for better discrimination between realistic and unrealistic model output. When only lower frequency data is available for model calibration, decreasing the number of calibrated parameters and having well-constrained parameter values (i.e. informative priors) for those being calibrated may help achieve some of the benefits of using higher frequency data. Otherwise, care should be taken to emphasise the low confidence that can be placed in model output. We have highlighted the many subjective elements involved in auto-calibration and uncertainty analysis. When these elements are downplayed, there is a real risk of incorrectly promoting a perception that results are objective, robust and, particularly in the case of uncertainty analysis, encompass all sources of uncertainty. Despite the risks, auto-calibration still presents many advantages over manual calibration and it is likely that, in the future, well-devised automated calibration routines with associated uncertainty analyses will become pre-requisites for any model application (as vouched for by Beven, 2006; Radcliffe et al., 2009). To enable this, targeted data collection and improvements to models and auto-calibration techniques should all be environmental modelling priorities. In particular:

 Monitoring efforts need to focus on: (1) Short term higher frequency monitoring, to obtain process understanding; (2) gathering data to better constrain parameter values, and (3) monitoring to quantify the internal workings of the system (e.g. soil and groundwater concentrations, terrestrial fluxes, etc.). Deciding on priorities for data gathering efforts in a given area is likely to require greater collaboration between modellers and field scientists.  Practical improvements are needed to make models more compatible with automated calibration and uncertainty analyses, including: (1) simplification, so all parameters can be included in an auto-calibration/uncertainty analysis; (2) well-documented recommended ranges for each parameter; and (3) removal of interacting parameters that make auto-calibration cumbersome to implement.  Promising future ways of achieving more meaningful model auto-calibrations include using additional data, such as measurements of internal system concentrations and fluxes, explicitly incorporating additional error terms in the likelihood function, and maybe using alternative measures of model performance, such as statistics extracted from timeseries that are diagnostic of certain modes of behaviour. More generally, many model users have only a general knowledge of auto-calibration techniques and limited knowledge of computer programming, and therefore still rely on manual calibration to avoid the complexity and computational demands of auto-calibration (Nasr et al., 2007). Clearly described auto-calibration tools, integrated in a user-friendly way with water quality models, are required for this situation to change. Acknowledgements This work was funded by the Rural and Environment Science and Analytical Services division (RESAS) of the Scottish Government and the European Union 7th Framework Programme REFRESH project (adaptive strategies to mitigate the impacts of climate change on European freshwaters; Grant agreement 244121). Many thanks to Luigi Spezia for statistical support and advice. The text was much improved following comments from and

654

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655

discussions with J. Sample, M. Troldborg, R. Helliwell, S. Dunn, R. Skeffington and A. Wade. The input from the four anonymous reviewers is also gratefully acknowledged.

References Allen, R., Pereira, L., Raes, D., Smith, M., 1998. Crop Evapotranspiration – Guidelines for Computing Crop Water Requirements. FAO Irrigation and Drainage Paper 56. Beck, M.B., 1987. Water quality modeling: a review of the analysis of uncertainty. Water Resour. Res. 23, 1393–1442. Bennett, N.D., Croke, B.F.W., Guariso, G., Guillaume, J.H.A., Hamilton, S.H., Jakeman, A.J., et al., 2013. Characterising performance of environmental models. Environ. Modell. Software 40, 1–20. Bernal, S., Butturini, A., Riera, J.L., Vázquez, E., Sabater, F., 2004. Calibration of the INCA model in a Mediterranean forested catchment: the effect of hydrological inter-annual variability in an intermittent stream. Hydrol. Earth Syst. Sci. 8, 729–741. Beven, K., 2006. On undermining the science? Hydrol. Process. 20, 3141–3146. Beven, K., 2009. Comment on ‘‘Equifinality of formal (DREAM) and informal (GLUE) Bayesian approaches in hydrologic modeling?’’. In: Vrugt, Jasper A., ter Braak, Cajo J.F., Gupta, Hoshin V., Robinson, Bruce A. (Eds.), Stochastic Environmental Research and Risk Assessment, vol. 23, pp. 1059–1060. Beven, K., Freer, J., 2001. Equifinality, data assimilation, and uncertainty estimation in mechanistic modelling of complex environmental systems using the GLUE methodology. J. Hydrol. 249, 11–29. Boyle, D.P., Gupta, H.V., Sorooshian, S., 2000. Toward improved calibration of hydrologic models: combining the strengths of manual and automatic methods. Water Resour. Res. 36, 3663–3674. Chib, S., Greenberg, E., 1995. Understanding the metropolis-hastings algorithm. Am. Statistician 49, 327–335. Correll, D.L., 1998. The role of phosphorus in the eutrophication of receiving waters: a review. J. Environ. Qual. 27, 261–266. Coxon, G., Freer, J., Wagener, T., Odoni, N.A., Clark, M., 2014. Diagnostic evaluation of multiple hypotheses of hydrological behaviour in a limits-of-acceptability framework for 24 UK catchments. Hydrol. Process. 28, 6135–6150. Crossman, J., Futter, M.N., Oni, S.K., Whitehead, P.G., Jin, L., Butterfield, D., et al., 2013. Impacts of climate change on hydrology and water quality: Future proofing management strategies in the Lake Simcoe watershed, Canada. J. Great Lakes Res. 39, 19–32. Dean, S., Freer, J., Beven, K., Wade, A.J., Butterfield, D., 2009. Uncertainty assessment of a process-based integrated catchment model of phosphorus. Stoch. Env. Res. Risk Assess. 23, 991–1010. Doherty, J., Brebber, L., Whyte, P., 1994. PEST: Model-Independent Parameter Estimation. Watermark Computing, Corinda, Australia, 122. Durand, P., 2004. Simulating nitrogen budgets in complex farming systems using INCA: calibration and scenario analyses for the Kervidy catchment (W. France). Hydrol. Earth Syst. Sci. 8, 793–802. Edwards, A.C., Withers, P.J.A., 2007. Linking phosphorus sources to impacts in different types of water body. Soil Use Manage. 23, 133–143. Farkas, C., Beldring, S., Bechmann, M., Deelstra, J., 2013. Soil erosion and phosphorus losses under variable land use as simulated by the INCA-P model. Soil Use Manage. 29, 124–137. Freni, G., Mannina, G., 2010. Bayesian approach for uncertainty quantification in water quality modelling: the influence of prior distribution. J. Hydrol. 392, 31– 39. Gelman, A., Rubin, D.B., 1992. Inference from iterative simulation using multiple sequences. Statistical Sci., 457–472. Gupta, H.V., Beven, K.J., Wagener, T., 2006. Model calibration and uncertainty estimation. In: Encyclopedia of Hydrological Sciences. John Wiley & Sons Ltd. Gupta, H.V., Wagener, T., Liu, Y., 2008. Reconciling theory with observations: elements of a diagnostic approach to model evaluation. Hydrol. Process. 22, 3802–3813. Hartmann, A., Wagener, T., Rimmer, A., Lange, J., Brielmann, H., Weiler, M., 2013. Testing the realism of model structures to identify karst system processes using water quality and quantity signatures. Water Resour. Res. 49, 3345–3358. House, W.A., 2003. Geochemical cycling of phosphorus in rivers. Appl. Geochem. 18, 739–748. Huisman, J.A., Rings, J., Vrugt, J.A., Sorg, J., Vereecken, H., 2010. Hydraulic properties of a model dike from coupled Bayesian and multi-criteria hydrogeophysical inversion. J. Hydrol. 380, 62–73. Irvine, K., 2004. Classifying ecological status under the European Water Framework Directive: the need for monitoring to account for natural variability. Aquatic Conserv.: Mar. Freshwater Ecosyst. 14, 107–112. Jackson-Blake, L.A., Dunn, S.M., Helliwell, R.C., Skeffington, R.A., Stutter, M.I., Wade, A.J., 2015. How well can we model stream phosphorus concentrations in agricultural catchments? Environ. Modell. Software 64, 31–46. Jarvie, H.P., Withers, J.A., Neal, C., 1999. Review of robust measurement of phosphorus in river water: sampling, storage, fractionation and sensitivity. Hydrol. Earth Syst. Sci. 6, 113–131. Jarvie, H.P., Sharpley, A.N., Withers, P.J.A., Scott, J.T., Haggard, B.E., Neal, C., 2013. Phosphorus mitigation to control river eutrophication: murky waters, inconvenient truths, and ‘‘Postnormal’’ science. J. Environ. Qual. 42, 295–304.

Kirchner, J.W., 2006. Getting the right answers for the right reasons: linking measurements, analyses, and models to advance the science of hydrology. Water Resources Res. 42, W03S04. Kuczera, G., Mroczkowski, M., 1998. Assessment of hydrologic parameter uncertainty and the worth of multiresponse data. Water Resour. Res. 34, 1481–1489. Marcé, R., Ruiz, C.E., Armengol, J., 2008. Using spatially distributed parameters and multi-response objective functions to solve parameterization of complex applications of semi-distributed hydrological models. Water Resour. Res. 44, W02436. McIntyre, N.R., Wheater, H.S., 2004. Calibration of an in-river phosphorus model: prior evaluation of data needs and model uncertainty. J. Hydrol. 290, 100–116. Meals, D.W., Dressing, S.A., Davenport, T.E., 2010. Lag time in water quality response to best management practices: a review. J. Environ. Qual. 39, 85–96. Morton, D., Rowland, C., Wood, C., Meek, L., Smith, G., Simpson, I.C., 2011. Final report for LCM2007 – the new UK land cover map. In: CS Technical Report No. 11/07 (CEH project number: C03259) UK, p. 108. Nash, J.E., Sutcliffe, J.V., 1970. River flow forecasting through conceptual models part I – a discussion of principles. J. Hydrol. 10, 282–290. Nasr, A., Bruen, M., Jordan, P., Moles, R., Kiely, G., Byrne, P., 2007. A comparison of SWAT, HSPF and SHETRAN/GOPC for modelling phosphorus export from three catchments in Ireland. Water Res. 41, 1065–1073. Oreskes, N., Shrader-Frechette, K., Belitz, K., 1994. Verification, validation, and confirmation of numerical models in the earth sciences. Science 263, 641–646. Pappenberger, F., Beven, K., Frodsham, K., Romanowicz, R., Matgen, P., 2007. Grasping the unavoidable subjectivity in calibration of flood inundation models: a vulnerability weighted approach. J. Hydrol. 333, 275–287. Radcliffe, D.E., Freer, J., Schoumans, O., 2009. Diffuse phosphorus models in the United States and Europe: their usages, scales, and uncertainties. J. Environ. Qual. 38, 1956–1967. Rankinen, K., Karvonen, T., Butterfield, D., 2006. An application of the GLUE methodology for estimating the parameters of the INCA-N model. Sci. Total Environ. 365, 123–139. Roberts, G.O., Gelman, A., Gilks, W.R., 1997. Weak Convergence and Optimal Scaling of Random Walk Metropolis Algorithms, pp. 110–120. Rustomji, P., Prosser, I., 2001. Spatial patterns of sediment delivery to valley floors: sensitivity to sediment transport capacity and hillslope hydrology relations. Hydrol. Process. 15, 1003–1018. Schoups, G., Vrugt, J.A., 2010. A formal likelihood function for parameter and predictive inference of hydrologic models with correlated, heteroscedastic, and non-Gaussian errors. Water Resour. Res. 46, W10531. Scott, D.W., 2009. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons. Seibert, J., McDonnell, J.J., 2002. On the dialog between experimentalist and modeler in catchment hydrology: use of soft data for multicriteria model calibration. Water Resour. Res. 38, 1241. Sharpley, A.N., 1980. The enrichment of soil phosphorus in runoff sediments. J. Environ. Qual. 9, 521–526. Sharpley, A., Jarvie, H.P., Buda, A., May, L., Spears, B., Kleinman, P., 2013. Phosphorus legacy: overcoming the effects of past management practices to mitigate future water quality impairment. J. Environ. Qual. 42, 1308–1326. Starrfelt, J., Kaste, O., 2014. Bayesian uncertainty assessment of a semi-distributed integrated catchment model of phosphorus transport. Environ. Sci.: Processes Impacts 16, 1578–1587. Stutter, M.I., Langan, S.J., Cooper, R.J., 2008. Spatial contributions of diffuse inputs and within-channel processes to the form of stream water phosphorus over storm events. J. Hydrol. 350, 203–214. Tonkin, M.J., Doherty, J., 2005. A hybrid regularized inversion methodology for highly parameterized environmental models. Water Resour. Res. 41, W10412. Van Liew, M.W., Arnold, J., Bosch, D., 2005. Problems and potential of autocalibrating a hydrologic model. Trans. ASAE 48, 1025–1040. Vrugt, J.A., ter Braak, C.J.F., Clark, M.P., Hyman, J.M., Robinson, B.A., 2008. Treatment of input uncertainty in hydrologic modeling: doing hydrology backward with Markov chain Monte Carlo simulation. Water Resources Res. 44, W00B09. Vrugt, J., ter Braak, C.F., Gupta, H., Robinson, B., 2009a. Equifinality of formal (DREAM) and informal (GLUE) Bayesian approaches in hydrologic modeling? Stoch. Env. Res. Risk Assess. 23, 1011–1026. Vrugt, J.A., Braak, C.J.F., Diks, C.G.H., Robinson, B.A., Hyman, J.M., Higdon, D., 2009b. Accelerating Markov chain Monte Carlo simulation by differential evolution with self-adaptive randomized subspace sampling. Int. J. Nonlinear Sci. Numer. Simulat. 10, 273–290. Wade, A.J., Whitehead, P.G., Butterfield, D., 2002. The Integrated Catchments model of Phosphorus dynamics (INCA-P), a new approach for multiple source assessment in heterogeneous river systems: model structure and equations. Hydrol. Earth Syst. Sci. 6, 583–606. Wade, A.J., Butterfield, D., Lawrence, D.S.L., Bärlund, I., Ekholm, P., Lepistö, A. et al., 2007. The Integrated Catchment Model of Phosphorus (INCA-P), a new structure to simulate particulate and soluble phosphorus transport in European catchments. In: Deliverable 185 to the EU Euro-limpacs project UCL, London, p. 69. Wagener, T., Gupta, H., 2005. Model identification for hydrological forecasting under uncertainty. Stoch. Env. Res. Risk Assess. 19, 378–387. Whitehead, P.G., Jin, L., Baulch, H.M., Butterfield, D.A., Oni, S.K., Dillon, P.J., et al., 2011. Modelling phosphorus dynamics in multi-branch river systems: a study of the Black River, Lake Simcoe, Ontario, Canada. Sci. Total Environ. 412–413, 315–323.

L.A. Jackson-Blake, J. Starrfelt / Journal of Hydrology 527 (2015) 641–655 Whitehead, P.G., Crossman, J., Balana, B.B., Futter, M.N., Comber, S., Jin, L., et al., 2013. A cost-effectiveness analysis of water security and water quality: impacts of climate and land-use change on the River Thames system. Philos. Trans. Royal Soc. A: Math., Phys. Eng. Sci., 371 Withers, P.J.A., Jarvie, H.P., 2008. Delivery and cycling of phosphorus in rivers: a review. Sci. Total Environ. 400, 379–395.

655

Yadav, M., Wagener, T., Gupta, H., 2007. Regionalization of constraints on expected watershed response behavior for improved predictions in ungauged basins. Adv. Water Resour. 30, 1756–1774. Yilmaz, K.K., Gupta, H.V., Wagener, T., 2008. A process-based diagnostic approach to model evaluation: application to the NWS distributed hydrologic model. Water Resour. Res. 44, W09417.