Assessing skill of estuarine and coastal eutrophication models for water quality managers

Assessing skill of estuarine and coastal eutrophication models for water quality managers

Journal of Marine Systems 76 (2009) 195–211 Contents lists available at ScienceDirect Journal of Marine Systems j o u r n a l h o m e p a g e : w w ...

1MB Sizes 0 Downloads 28 Views

Journal of Marine Systems 76 (2009) 195–211

Contents lists available at ScienceDirect

Journal of Marine Systems j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / j m a r s y s

Assessing skill of estuarine and coastal eutrophication models for water quality managers James J. Fitzpatrick ⁎ HydroQual, Inc., Mahwah, NJ, USA

a r t i c l e

i n f o

Article history: Received 8 October 2007 Received in revised form 14 January 2008 Accepted 2 May 2008 Available online 29 May 2008 Keywords: Eutrophication Skill assessment Mathematical models Water quality Dissolved oxygen Nutrients

a b s t r a c t The need for predictive mechanistic models of nutrients, primary production, and dissolved oxygen has long been recognized for their utility in assisting natural resource and water quality managers to evaluate the potential effectiveness of nutrient management in reducing cultural eutrophication. However, there is also a corresponding need to assess the degree of confidence one has in the model projections. This is particularly true as mathematical models of eutrophication are being utilized more and more to determine the total maximum daily loads (TMDLs) necessary to attain water quality standards for dissolved oxygen and since the potential cost to implement these TMDLs can be a significant cost burden to the affected community. A review of the literature suggests that a number of possible statistical measures are available, which can be used by modelers and water quality managers to assess model skill and help quantify uncertainty in model projections. This paper presents an overview of these measures and suggests that no single metric is suitable for assessing model skill. Rather a combination of quantitative and qualitative measures should be used for skill assessment and defining model uncertainty. © 2008 Elsevier B.V. All rights reserved.

“All models are wrong, some models are useful.”

G.E.P. Box, 1979

1. Introduction Environmental models of natural systems are constructed for two basic reasons. First is the need to improve the level of understanding between the cause and effect relationships that influence environmental ecosystems or organisms of interest or concern and second is to apply that increased understanding to assist in decision-making for environmental management. Over the past several decades, as scientific and

⁎ Tel.: +1 201 529 5151; fax: +1 201 529 5728. E-mail address: jfi[email protected]. 0924-7963/$ – see front matter © 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.jmarsys.2008.05.018

public awareness of the environmental issues associated with climate warming and cultural eutrophication has increased, the use of environmental models has also increased. To varying degrees, mathematical models that include primary production or phytoplankton biomass frameworks have been constructed to help natural resource managers and water quality managers: improve our understanding of the ocean carbon balance (Doney and Duckow, 2006); evaluate potential ecosystem response to climate change (Boyd and Doney, 2002; Hood et al., 2006); evaluate ocean carbon sequestration (Hannon et al., 2001); understand the relationships between primary production and fishery stocks (Lehodey et al., 1998; Luo et al., 2001; Scott et al., 2005); and help evaluate and manage cultural eutrophication (Cerco, 1995; Reckhow and Chapra, 1999; Skogen et al., 2004). Since natural resource and water quality managers wish to utilize these primary production/phytoplankton biomass models as one component of their decision-making process, it is important to these managers that (1) model predictions or projections have some basis in reality, i.e., that the models

196

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

demonstrate their ability to reproduce the “real world” and (2) uncertainties in model predictions be quantified. Indeed, the question of the validity of water quality models has long been an issue for both the practitioners of water quality modeling, as well as for the managers who use their outputs. In the historical work of Streeter and Phelps (1925), concerning dissolved oxygen modeling of the Ohio River, both qualitative comparisons of model computations to observed dissolved oxygen data and quantitative comparisons of root mean square error between model and data were made. In another early paper, Thomann (1982) suggested an approach for judging model credibility using regression analysis between observed and computed values, relative and root mean square errors, and comparisons of means. Chapra (1997) presented a discussion of the water quality modeling process, which used the term calibration, i.e., initial modeling efforts wherein the modeler “varies [sic] the model parameters to obtain an optimal agreement between the model calculations and the data set.” Chapra then went on to discuss the importance of confirming the model using independent data sets, sensitivity analysis, uncertainty analysis, and postaudit of models. More recently, Arhonditsis and Brett (2004) reported on a meta-analysis of 153 mechanistic biogeochemical models. As part of their analysis, Arhonditis and Brett considered differences between model calibration and model validation; the latter could include one or more of the following: • predictive validation — evaluation of model-fit with an independent data set from the same system, • model transferability — ability of the model to perform in other regions or ecosystem types, and • structural validation — assessment of the realistic reproduction of the causal relationships and relative magnitudes of various components of the system, such as biochemical reactions rates, mass balances, etc. Arhonditsis and Brett also reported on the performance of the model data set using relative error and the coefficient of determination or the regression coefficient (r 2). It is interesting to note that in the papers listed above the terms model credibility, model performance, model calibration, model confirmation, model validation, and model verification have been, to various degrees, used interchangeably. For the purposes of this paper, the following definitions will apply: • model validation — the process by which a computer code and its associated algorithms are checked to ensure that they are mathematically correct. For water quality models, the process should include verifying that mass balances are maintained and verifying model computations against known analytical or “known” numerical solutions. The latter can include running a dynamic model to equilibrium or steady-state or using idealized geometries to facilitate comparisons of model outputs to analytical or known numerical solutions. • model calibration — the process by which a model and its associated model coefficients (i.e., algal growth and respiration rates, nutrient recycle rates, reaeration rates, etc.) are adjusted or tuned to “reproduce” or obtain a “best fit” to the

observed data. To the extent possible, the specification of the model coefficients should be based on field (e.g., primary productivity) and/or laboratory (e.g., oxidation rates) measurements. The remaining model coefficients should be adjusted in a systematic fashion (including provision for sensitivity analysis) and ensuring that the coefficients remain within acceptable ranges (Bowie et al., 1985). • model confirmation — in this process the model is run and compared against another data set(s). Ideally, this data set should have different physical and environmental forcings (e.g., freshwater inflows, pollutant loadings, etc.) than the calibration data set. The model is run with the new physical and environmental forcings, while holding the model coefficients fixed at the calibration values. If the model reproduces the observed data, then the model is confirmed to act as a predictive tool with which to assess water quality state-variables within the range of conditions defined by the calibration and confirmation data sets. In addition, for the purposes of this paper skill assessment is the process by which a model's calibration and confirmation can be established using qualitative and quantitative techniques. While model validation is an important step in model development, skill assessment does not apply, since there are no model data comparisons. 2. Data issues affecting the skill assessment of estuarine and coastal eutrophication models As discussed above models of primary production/phytoplankton biomass have been constructed for a number of purposes. This paper, however, will focus on skill assessment of such models that are constructed to assist in the decisionmaking process for the management of estuarine and coastal eutrophication. Before discussing quantitative and qualitative measures that might be used for skill assessment of eutrophication models, it is important to briefly review factors that complicate skill assessment of such models, especially as compared to hydrodynamic models. 2.1. Model complexity — data availability Today's modern eutrophication models usually incorporate twenty of more state-variables (Cerco, 2000; Allen et al., 2001; Lancelot et al., 2005) versus the 7–10 state-variables (water elevation or continuity, three components of velocity or momentum, salinity, temperature, and vertical mixing or turbulence closure and for some models — turbulence kinetic energy, turbulence length scale and an equation of state) utilized in modern three-dimensional hydrodynamic models. Typical eutrophication models include the following statevariables: one or more phytoplankton groups, one or more zooplankton groups, particulate and dissolved organic nutrient pools (nitrogen and phosphorus), biogenic silica, inorganic nutrients (ammonia, nitrate, phosphate, silicate), particulate and dissolved organic carbon, and dissolved oxygen. A few models also include mechanistic frameworks for sediment oxygen demand and nutrient flux from the sediment (HydroQual, 2000; Cerco and Noel, 2004). Therefore, the data requirements for eutrophication models, both

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

for specifying model inputs and for conducting skill assessment, are, generally, more voluminous (although not necessarily available to the same degree) than those required for estuarine and coastal 3-D hydrodynamic models. While 3-D hydrodynamic modelers may, in many cases, claim to be datalimited, the current state-of-the-science provides a greater level of instrumentation (for example, buoy-systems that continuously measure wind, air temperature, atmospheric pressure, wave height, surface current speed and direction and water temperature and salinity at several depths; bottom-mounted ADCP systems that provide continuous measurements of currents at multiple depths in the water column; CODAR systems, which can provide continuous observations of surface currents and wave heights) with which to develop “near continuous” data sets for skill assessment then is currently available for water quality modelers. In general, due to the high costs associated with sampling vessels and crews and laboratory analysis of samples for multiple constituents (see the list above), spatially detailed water quality data sets may be collected only once or twice per month. These data sets only represent “snapshots in time” rather than continuous data for use in model versus data comparisons. 2.2. Model inputs Given that eutrophication models have a greater number of state-variables, the data requirements for specifying model forcings and boundary conditions are significantly greater than those required for hydrodynamic models. Generally, hydrodynamic models require the following inputs: bathymetry, boundary water elevations, boundary temperature and salinity, freshwater inflows, and meteorological forcings, including air temperature, relative humidity, short wave radiation, cloud cover, and barometric pressure. While this may sound like an extensive list of data to assemble, many of these data are readily available. Bathymetric information is usually readily obtained from navigational charts and can be easily and relatively inexpensively supplemented via sidescan sonar field data collection efforts. Information concerning freshwater inflows (flow hydrographs) to an estuary or coastal system is also usually readily obtained, thanks to the availability of data from flow gages for major streams, rivers and tributaries maintained by the USGS (in the United States) and similar agencies in Europe. Meteorological forcings can easily be obtained from regional airports and/or ocean buoys deployed by NOAA or similar agencies. Perhaps the most difficult hydrodynamic model inputs to construct are the boundary water elevations and boundary salinity and temperature. Depending on the system being modeled, i.e., estuarine versus coastal waters, boundary water elevation information may be available from existing tidegage recorders. Otherwise, it may be necessary to construct time-series water elevation boundary conditions from other limited data sources. For coastal waters it is possible to make initial estimates of boundary water elevations, e.g., astronomical tides, using global ocean models, such as POX0.2, developed by Oregon State University (Egbert et al., 1994). However, these estimates will need to be modified to include the effects of long-term circulation (geostrophic currents) due to cross-shelf climatological slope, as well as sub-tidal

197

meteorological forcing. To a degree, boundary water elevations can also be determined via calibration of the model to internal tide-gages. In a similar fashion, for hydrodynamic models of coastal waters, initial estimates of salinity and temperature boundary conditions can be guided from climatological data sets such as those constructed by Levitus (1982), but these estimates need to be adjusted for local and temporal conditions for a particular simulation period. Again, these estimates can be refined via calibration to internal observations of salinity and temperature. For coastal waters, this calibration process can also be supplemented by satellite imagery of sea surface temperature (SST). Developing model inputs (forcings and boundary conditions) for eutrophication models is not so easy to perform. The task of generating model forcings, i.e., estimates of nutrient loadings, is complicated by the myriad number of sources from which nutrients enter a system. Depending on the system being studied, point source inputs from wastewater treatment facilities may be the dominant source. If this is the case, one might anticipate that these sources would be easy to quantify. However, while most wastewater treatment facilities may measure their effluent flows on a daily basis, they may only measure effluent concentrations of the variables of interest using grab samples taken once or twice a week or once or twice a month. Even then these measurements may not contain all of the relevant variables of interest (ex., usually neither biogenic silica nor silicate are sampled). In other systems, such as the Gulf of Mexico, non-point source loads associated with runoff from agricultural and non-urbanized areas may be a larger component of the nutrient loading pool. While monitoring programs such as the USGS National Stream Quality Accounting Network (NASQAN) collect water quality and nutrient data in conjunction with flow monitoring, the nutrient sampling is usually limited to monthly or quarterly grab samples. Therefore, it is often necessary to develop flow–concentration relationships in order to generate nutrient loadings on a daily basis. An alternative approach may be to utilize watershed models to generate required nutrient loadings. However, these models, in turn, need to be calibrated to observed data. Obviously, there is inherent uncertainty in the accuracy of nutrient loadings using either of these techniques. As one begins to model larger estuarine systems and coastal waters, atmospheric inputs become more significant. Similar to watershed loading estimates, one is dependent upon either sparse data sets from which to estimate nutrient inputs or the outputs from atmospheric models, both of which have uncertainty in the accuracy of associated nutrient loads. As is the case for hydrodynamic models, generating estimates of boundary conditions for use in water quality models of eutrophication can be extremely challenging. A number of factors influence a modeler's ability to determine reliable estimates of boundary conditions for eutrophication models. First, is the spatial location of the data or monitoring station to be used to supply the data. In the case of existing data sets, one must try to determine whether the data station locations are spatially appropriate. Ideally, the data stations should be located outside of (but near) the computational domain of the model. In a modeling

198

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

Fig. 1. Massachusetts Bays boundary location selection: (a) MWRA sampling locations, (b) Bays Eutrophication Model (BEM) grid, (c) observed surface dissolved oxygen concentrations for 1992.

study of the Massachusetts Bays system, the initial monitoring program (Fig. 1(a)), implemented in 1992, did not include stations F26, F27, and F28. Therefore, model boundary conditions were dependent on observed data located at stations F22 and F12, which were located well inside the model domain (Fig. 1(b)). In the particular case of data collected in 1992 at station F22, it was difficult to establish whether the marked decline (April–May) and recovery (May– June) in bottom water dissolved oxygen that was observed to occur at station N10P (Fig. 1(c)) was due to internal biogeochemical processes or due to boundary driven influ-

ences. One result of the early modeling analysis was a recommendation to extend the monitoring program to include stations F26, F27, and F28, which were in closer proximity to the boundary of the model domain. A second factor influencing the estimation of boundary conditions is the frequency at which the data are collected. While it is reasonably easy to deploy continuous recording monitoring equipment for the purposes of providing physical data (i.e., salinity and temperature and water depth from bottom pressure sensors) to establish boundary conditions for hydrodynamic models, such continuous recorders are only

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

199

Fig. 2. Massachusetts Bays chlorophyll-a data — illustrating the 1993 Fall Algal Bloom. Observed surface chlorophyll-a (Chl-a — µg/L), particulate organic carbon (POC — mg C/L ), and dissolved oxygen (DO — mg O2/L). Top panel — station N10P, bottom panel — station N20P.

available for a limited number of water quality parameters. Historically, continuous recorders were only available for estimating phytoplankton chlorophyll-a (flourometers), turbidity, and dissolved oxygen. More recently instruments for continuous recording of ammonia, nitrate, phosphate, and silicate have become available. However, reliable recorders for measuring particulate and dissolved organic carbon, nitrogen, phosphorus and biogenic silicate are still lacking. Therefore, modelers are dependent on monitoring programs, which may

only take samples on the order of once a week to once a month. In the case of coastal systems, given the expense of ocean going vessels such sampling may be only be once or twice a season. 2.3. Measurements of state-variables Measurements of data required for skill assessment of eutrophication models are complicated by the fact that often

200

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

it is only possible to (1) indirectly measure state-variables or water quality parameters of interest and (2) sample composite state-variables of interest. For example, there is no simple direct measurement of algal biomass. Generally, modern eutrophication models represent algal biomass either as carbon (mg C/L) or as chlorophyll-a (µg chl-a/L) and relate the internal pools of nitrogen and phosphorus (and silica for diatom species) using a fixed- or variable-stoichiometry, the latter based on external nutrient availability and/or internal nutrient stores. While one may be able to take water quality samples and analyze the sample to identify algal species and numbers of cells per unit volume, it is difficult, even with reliable estimates of biovolumes for each species observed, to convert these data to the appropriate internal state-variable units, for example mg C/L or µg chl-a/L. Measuring chlorophyll-a directly only provides an indirect measure of algal biomass, since carbon to chlorophyll ratios may be species specific or subject to the organism's recent exposure to light and external nutrients (Chalup and Laws, 1990). Attempting to measure phytoplankton biomass directly by measuring particulate organic carbon fails, since it is not possible to distinguish between living phytoplankton carbon and nonliving particulate organic carbon or detritus. Examination of chlorophyll-a concentrations obtained from a multi-year monitoring program conducted by the Massachusetts Water Resource Authority (MWRA) and used to calibrate a eutrophication model of the Bays, suggested a large algal bloom occurred in the fall of 1993, i.e., chlorophyll-a levels were 2–3 times greater than observed in the fall of 1992 and 1994 (Fig. 2). However, evidence of the magnitude of the fall 1993 bloom was not as obvious when compared against coincident measurements of dissolved oxygen (DO) data and with limited measurements of particulate organic carbon (POC). These data did not show as significant increase in the concentrations of fall POC or DO, as might be expected to occur in conjunction with higher primary productivity that would be associated with such a large algal bloom. Further analysis of the data from the MWRA monitoring program (HydroQual, 2000), which included coincident measurements of algal cell counts and species identification, was conducted. A cross-plot (Fig. 3) of the observed chlorophyll-a to particulate organic carbon (chl-a:C) ratios and cell counts for the diatom Asterionellopsis glacialis indicated that for those samples with high cell counts of A. glacialis (N4 × 106) the average chl-a:C ratio (~ 48.9) was almost a factor of two greater than the average (~ 26.5) for those samples with lower A. glacialis cell counts and were significantly different from the average fall chl-a:C ratios of ~ 20 mg chl-a:mg C observed in 1992 and 1994. In addition, some modern eutrophication models make distinctions between the lability and reactivity of particulate and dissolved organic matter by including state-variables that represent these labile and refractory pools. In part, this has been driven by recognition that in modeling estuarine and nearby coastal waters, pools of organic nutrients discharged from wastewater treatment facilities may have more rapid rates of mineralization than nutrients from coastal or oceanic sources. However, when water column samples are taken and analyzed for particulate and/or dissolved organic nutrients, it is only possible to get a total measurement and it is not possible to distinguish between labile and refractory fractions.

2.4. Spatial and temporal variability Two additional factors that contribute to the difficulty in achieving good skill in eutrophication modeling concern spatial and temporal variability, especially as related to phytoplankton biomass. Spatial and temporal sampling of algal biomass, as represented by chlorophyll-a, is often influenced by patchiness. Phytoplankton spatial heterogeneity or patchiness can be related to: flood or storm events (Cloern et al., 1985; Gibbs, 1993); aggregation of phytoplankton along tidal fronts when river flow and tides are in opposite direction (Dustan and Pinckney, 1989); vertical velocities associated with wind-induced stress and Ekman-type upwelling (Falkowski et al., 1991), coastal upwelling of nutrient-rich waters associated with local-winds or mesoscale eddies (Abbot and Letelier, 1998; McGillicuddy et al., 2001). Note that a number of the factors that contribute to spatial heterogeneity are in turn due to temporal variability in environmental forcings (variable freshwater inflows and wind events). Although eutrophication models have been constructed that are able to reproduce some of the behavior of phytoplankton patchiness associated with estuarine and nearshore coastal upwelling, it is not yet computationally feasible for model grids to have the spatial resolution necessary to fully capture the behavior of algal patch dynamics in coastal systems and still be run for the multi-year scenarios required for evaluating the effectiveness of nutrient control strategies. Even if such models could be constructed, the water quality model would still be dependent on the ability of the hydrodynamic model to fully reproduce the physical forcings, eddies and upwelling events that contribute to these patch dynamics. Temporal variability of phytoplankton species and biomass is also complicated by the number of algal species and predator (zooplankton) species in the natural environment. For example, the Smithsonian's Environmental Research Center's “Guide to Phytoplankton” website (http://www. serc.si.edu/labs/phytoplankton/guide/sppindx.jsp) lists more than 300 species under nine genera. With the exception of a limited number of models, most of which focus on species associated with harmful algal blooms or HABs (e.g., Siopsis, 2003; McGillicuddy et al., 2005; Lancelot et al., 2005), most eutrophication models represent phytoplankton with a limited number of phytoplankton state-variables that represent different functional groups (mainly seasonal) that bloom under specific temperature, light, and nutrient regimes. The

Fig. 3. Chlorophyll to carbon (Chl-a:C) ratio as a function of Asterionellopsis glacialis cell count.

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

state-of-the-science, as evidenced by the relatively poor skill assessment scores for model calibration to algal biomass, suggests that a lack of understanding of detailed algal physiology and food preferences by individual zooplankton species, as well as zooplankton physiology, precludes the modeling community from being able to predict why specific species, such as A. glacialis in the Massachusetts Bays example, bloom in 1 year and not another.

where oi and ci are the individual observations or data points and model computed values, respectively, and N is the number of data-model pairs. The rms error is better behaved that the relative error and provides a direct measure of model skill. 3.3. Regression analysis Another measure that can be used in skill assessment is linear regression analysis of the form

3. Quantitative measures for skill assessment Despite the limitations considered above, there are a number of statistical measures that can be used be to help quantify the skill of environmental and eutrophication water quality models.

oest ¼ m  c þ b

ð3aÞ

eregr ¼ o−oest

ð3bÞ

or

3.1. Relative error and absolute relative error

eest ¼ o−ðm  c þ bÞ

Two simple statistical measures are the relative error, defined as

where

er ¼

ðo−cÞ o

ð1Þ

and the absolute relative error, defined as ear ¼

jðo−cÞj o

ð2Þ

The relative error and absolute relative error can be computed for each observation (o) and model computation (c) pair and a cumulative frequency distribution of the relative errors can be developed so as to determine the median relative and absolute relative errors as well as the 10% and 90% exceedance frequency of error. However, there are some problems with these error statistics in that they can have relatively poor behavior at low values of o; for example, low concentrations of surface nutrients and bottom water dissolved oxygen. Both of these water quality parameters have seasonal patterns. Generally, surface nutrients and bottom dissolved oxygen are high during the winter and early spring, but low during the summer and before fall turnover. The former is due to algal uptake during the spring and summer, which tends to deplete nutrient stocks, while the latter is due to bottom water respiration and sediment oxygen demand. Vertical stratification in the summer tends to intensify the situation, since stratification limits the exchange of nutrient-rich bottom waters with surface waters and limits the exchange of oxygen-enriched surface waters with the bottom. Also, the statistic performs poorly if o N c, since the maximum relative error is 100%. Nevertheless, it is a relatively simple statistic to compute, provides a measure of skill assessment, and can be useful in making comparisons between different models applied to the same data sets (Thomann and Segna, 1980). 3.2. Root mean square error The root mean square (rms) is defined as

erms

sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ∑ðoi −ci Þ2 ¼ N

201

ð2Þ

oest m b c o eest

ð3cÞ

estimate of the observation, slope of the regression equation, intercept of the regression equation, model computation, observation or data sample, and standard error of the estimate,

and where, the slope, m, is obtained by minimizing the sum of the squares of the differences between model computations and the observed data. In addition to estimating the slope, intercept, and the standard error of the estimate, another skill assessment statistic provided by linear regression is the square of the correlation coefficient, r 2. The correlation coefficient provides a measure of the goodness-of-fit that the model provides to the observed data or equivalently the amount of the variance in the observed data that is accounted for by the model. Determining skill assessment by regressing model vs. observation can yield several outcomes. The first two potential outcomes, Fig. 4(a) and (b), show that although a good correlation may be obtained, a constant fractional bias may exist, i.e., the regression slopes are greater than or less than one and indicate a constant fractional bias by the model. Another potential outcome, Fig. 4(c), shows that although a slope of one and an intercept of zero can be found, that a poor correlation can result, i.e., the variance explained by the model may not be ideal. Finally, Fig. 4(d) indicates that a model can achieve high correlation to the observed data, as well as a slope of one, but still have a constant bias relative to the observations (b N 0). 3.4. Comparison of means — Student's t test Another statistical test that can be used in an analysis of skill is to evaluate the differences between the mean of the observed data and the mean of the model computations. The Student's t test can be used to test of the null hypothesis that the means of two normally distributed “populations” are equal. Given two data sets, e.g., one a set of time- or spatiallyaveraged data and the other the corresponding time- or spatially-averaged model computations, each characterized

202

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

Fig. 4. Potential outcomes for linear regression analysis of model versus data.

by its mean, standard deviation, and number of data points, the Student's t test can be used to determine whether the means are statistically distinct. The use of the Student's t test requires the following assumptions be met: normal distribution of the data and equality of variances. For some water quality parameters, such as the inorganic nutrients, this assumption may not be valid. Nutrients can vary significantly over an annual cycle, e.g., high nutrient levels during the winter, early spring, and late fall when algal utilization may be low versus very low, near detection limits, during the spring, summer, and early fall when algal uptake is very high and may be more log-normally distributed than normally distributed. If this is the case one may log-transform the data and conduct the Student's t test on the log-transformed data. One could also perform a seasonal analysis of the data, wherein the seasonallystratified data may be normally distributed, thus permitting the use of the Student's t test. Determining the variance of the model computation can also prove difficult, since the components of this variance would include uncertainties in the model input (loads, transport, temperature, etc.), system parameters (growth rates, uptake rates, recycle rates, etc.), and system variability (reflective of variable grid sizes). The determination of the model variance would, therefore, require a number of sensitivity runs and/or Monte Carlo analyses, at several time and space scales, varying model inputs and system parameters. Given the large computational requirements associated with complex large scale estuarine and coastal applications, this may not always be feasible.

Rather it may be reasonable to assume that the model variance (essentially unknown) is at least equal to the observed variance. For analyzing the difference of means between data and the computations of eutrophication models, the dependent form of the t test is used t¼

X D pffiffiffiffi N sD

ð4Þ

where, the differences of each data vs. model pair must be computed and where XD is the average and sD is the standard deviation of those differences and N is the number of datamodel pairs. As with regression analysis, there are a number of possible outcomes when evaluating the differences of means using the Student's t test. Fig. 5(a) illustrates a case where the difference between the means of the observed data and model computations would be considered to be the same, since there is a large degree of overlap between the two distributions and the observed t value would likely be less than the critical t value (found in a table of values of the t distribution) for a given less of significance (ex., 0.05 level). In Fig. 5(b), the two distributions still have the same means as shown in Fig. 5(a), but the standard deviations are now much smaller. As a consequence, the degree of overlap is now much smaller and it is unlikely that the computed t would be less than the critical t value and, therefore, it is likely that the means are not the same, i.e., that the model failed the skill test. In the

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

203

Fig. 5. Potential outcomes for comparison of means (Student's t test) for data and model.

case of Fig. 5(c), the two distributions have the same standard deviation; however, the means are different from Fig. 5(a). Again, there is little overlap between the two distributions and it is unlikely that they are the same. Finally, in the case of Fig. 5(d), the two distributions have considerable overlap due to the large standard deviations of the data and it is likely that the Student's t test would be met. However, one should also look at the power of the test to make sure one does not make a type II error, i.e., accepting the null hypothesis that the means are equal, when they are not.

3.6. Cumulative density function Cumulative density function plots, used in conjunction with time-series plots, can also be useful in providing simple skill assessment statistics, as well as demonstrating that a model captures the underlying distribution of the data, although not

3.5. Binary discriminator tests — receiver operating characteristic (ROC) scores Although they have been used in other disciplines, binary discriminator testing is a relatively new tool for skill assessment of estuarine and eutrophication modeling. As described by Stow et al. (2007-this issue) and Sheng and Kim (2007-this issue), binary discriminator tests and the ROC, in particular, can be used to provide information to water resource managers concerning how useful a model is in a decision-making process. Essentially, the test involves the ability of the model to correctly reproduce observed data above or below a threshold value. Threshold values could be set for bloom prediction (e.g., chlorophyll-a concentrations greater than X µg/L) or prediction of hypoxia (e.g., dissolved oxygen concentrations less than 3 mg/L). When determining if a model correctly determines a threshold exceedance, as compared to the observed data, there can be four possible outcomes. These outcomes are: correctly positive, correctly negative, incorrectly positive, or incorrectly negative. Plotting these results, as well as other simple statistics that can be computed from these four outcomes can provide water resource managers with some indication as to the probability that the decision being made is correct.

Fig. 6. Chesapeake Bay Water Quality Model (CBWQM) calibration results for bottom water dissolved oxygen for CB3: (a) time-series and (b) cumulative frequency distributions.

204

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

Table 1 Cumulative density function statistics (mg/L) for Chesapeake Bay Segment CB3

Mean Minimum Maximum Standard deviation Median 90th percentile 10th percentile Number of violations of WQS

Observed

Predicted

3.47 0 10.1 2.25 3.52 6.32 0.49 173

3.25 0 9.45 2.0 3.06 6.08 0.75 155

perhaps the absolute timing of the data. An example of the application of the utility of this approach is the application of the Chesapeake Bay Water Quality Model — CBWQM (Cerco and Cole, 1993; Cerco et al., 2002) to proposed water quality criteria (Linker et al., 2002). Fig. 6(a), from Cerco et al. (2002), presents a 10-year (1985–1994) comparison of bottom water dissolved oxygen for model and data for Chesapeake Bay segment CB3, which is located in the upper portion of the Bay, near the head of the deep channel or thalwag that runs from the mouth of the Bay to just south of the City of Baltimore. The data and model

computations used for comparison for these comparisons represent a spatial average of four bottom water monitoring stations and about sixty-eight (68) model cells, respectively. Linker et al. (2002) compared model estimates of dissolved oxygen to observations to detect model uncertainty in the calibration. In particular, the focus was to see how the model performed relative to region-specific and season-specific water quality standards being set for the bay. In the example presented here, the seasonal (May 1st through September 30th) water quality standard for bottom water dissolved oxygen is a minimum dissolved oxygen concentration of 1.7 mg/L. A qualitative assessment of the time-series plots (Fig. 6(a)) of comparisons between computed and observed bottom water dissolved oxygen might suggest that the model is providing a reasonable calibration to the observed dissolved oxygen. However, a closer inspection reveals that the model does not perform as well for the critical May–September period that the water quality managers are interested in. A regression analysis of 636 model data pairs (between May 1st and September 30th) yielded a regression slope of 0.30 and an intercept of 2.49 mg/L with a r 2 value of 0.07 — relatively poor skill assessment scores. However, if one is more concerned with the actual determination of violations of the water quality standard during the critical summer period, rather than the exact timing, then the

Fig. 7. Map of the Potomac River Estuary and the Potomac Eutrophication Model (PEM) segmentation.

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

cumulative frequency distribution of model and data appears to provide another method of analysis and assessment of model skill. Drawing from a plot (Fig. 6(b)) of the cumulative density functions of data and model, a number of statements concerning the model and data can be made (Table 1). These include: • the mean difference between the predicted and observed (predicted–observed) is −0.22 mg/L, • the model over-estimates the 10th percentile dissolved oxygen by 0.26 mg/L (about a 50% relative error),

205

• the model predicted 155 violations of the water quality standard violations versus 173 violations observed in the 636 predicted and observed pairs; a difference of only 3%. This comparison is much better than would be expected from the regression statistics and indicates that while the CBWQM may have some difficulties in computing the absolute timing of the minimum dissolved oxygen for this region of the bay, the model appears to be able to reproduce the minimums on a summer season basis.

Fig. 8. PEM skill assessment — relative errors distributions for (a) ammonia nitrogen and (b) dissolved inorganic nitrogen across model calibration and confirmation years.

206

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

4. Skill assessment application — Potomac Eutrophication Model An early application of model skill assessment was performed for the Potomac River Estuary by Thomann and Fitzpatrick (1982). The Potomac Eutrophication Model (PEM) was fairly simple in nature, consisting of thirtyeight (38) vertically integrated and laterally averaged segments (Fig. 7), that included the mainstem of the Potomac River Estuary and its tidal embayments, extending from the fall-line at Chain Bridge to its lower boundary where it joins Chesapeake Bay. The model consisted of eleven (11) state-variables and its skill was assessed using six data sets collected between 1968 and 1979; three were used for calibration and three for confirmation. The skill was assessed using relative error, linear regression, and comparison of means. Given the focus of the study, which was to evaluate phosphorus controls necessary to reduce summer phytoplankton levels, only data from the summer periods were considered. A sample of the distribution of relative errors across years is presented for ammonia and dissolved inorganic nitrogen (DIN) in Fig. 8. As can be seen there is quite a degree of variability across years. Care must be exercised in interpreting the relative error statistics, as indeed must be used for other statistical measures, so as not to be misled by considering simple numbers alone. For example, the 1977 relative error analysis for ammonia indicates a median relative error of approximately 65%, a relatively poor statistic. However, the median relative error for DIN (the sum of ammonia plus nitrite-nitrate) is only 22%. The implication is that there are some problems with ammonia and nitrate, but that the overall inorganic nitrogen system is reasonably reproduced. The relative error skill assessment (Table 2) for chlorophyll-a averaged ~17% with a minimum of 9% and a maximum of 32%; the year, 1977, with the highest observed summer average chlorophyll-a had a relative error of ~20%. Relative errors for total phosphorus and dissolved inorganic phosphorus (phosphate) averaged about 20% and 22%, respectively. Although there is no “standard” for judging model skill, one approach is to compare model skill against other model applications. PEM's skill, based on relative error, compares favorably against the median relative errors for phytoplankton (44%) and phosphate (42%) reported by Arhonditis and Brett (2004). However, the models reviewed by Arhonditis and Brett included lakes and reservoirs, as well as the open oceans and seas, with the latter comprising the largest number of papers reviewed.

Table 2 Potomac Eutrophication Model skill — relative error (%) Year

Chlorophyll-a

Total phosphorus

Dissolved inorganic phosphorus

1968 1969 1970 1977 1978 1979

10 32 31 20 9 19

45 15 21 17 8 12

ND ND ND 25 20 21

ND = no data.

Table 3 Potomac Eutrophication Model skill — chlorophyll-a regression analysis Year

r2

Standard error of estimate (µg/L)

Slope

Intercept (µg/L)

Hypothesis

1968 1969 1977 1978 1979

0.81 0.75 0.73 0.88 0.78

10.8 10.8 17.6 4.3 2.3

0.79 0.76 0.84 0.70 0.35

7.7 2.5 13.0 12.3 20.0

R R A R R

R = rejected, A = accepted.

A skill assessment based on linear regression was performed for the model chlorophyll-a. The regression analysis provided the following standard statistics: • the square of the correlation coefficient, r 2, between the calculated and observed values (the variance in the data explained by the model), • the standard error of the estimate, representing the residual error between model and data, • the slope and intercept estimate of the regression equation, and • the significance test on the slope and intercept estimates. A number of statements concerning the skill assessment of the PEM may be drawn from the regression analysis (Table 3): 1. with calculated slopes of less than 1.0 for all years, PEM appears to over-estimates phytoplankton biomass on a summer mean basis; 2. with summer mean chlorophyll concentrations of 50– 100 µg/L for the calibration period (1968–1970), 50–150 µg/ L for 1977, and 25–75 µg/L for 1978 and 1979, representing the three confirmation years, the standard error of estimates, with the exception of 1977, are reasonably good; 3. the model accounts for 73 to 88% of the variance observed in the data; 4. interestingly, 1977, which had the highest standard error of estimate and the lowest correlation coefficient (r 2 = 0.73), was the only year for which the slope and intercept were not found to be statistically different from unity and zero, respectively, essentially indicating complete confirmation. Also of interest is the observation that the standard errors of the estimates are correlated to the maximum summer average chlorophyll-a levels. Observed maximum summer average chlorophyll-a in 1977 were approximately 150– 155 µg/L, 1968 and 1969 were approximately 100–110 µg/L, and 1979 were approximately 45–50 µg/L. Comparing PEM to other models shows that the average r 2 of 0.79 for PEM compares favorably against the median r 2 of 0.48 reported by Arhonditsis and Brett (2004). One reason for the high r 2 for PEM may have to do with the fact that the analysis for PEM only focused on the summer period and did not have to worry about reproducing the timing and magnitude of winter/spring and fall algal blooms.

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

In the PEM analysis, the skill assessment evaluating the comparison of means assumed that if the Student's t test was passed, i.e., that there is no statistical difference between the means, then the model was assumed to be validated or verified. PEM obtained a score of 77% across all variables, across all years (Fig. 9), i.e., there is no statistical difference

207

between the observed and computed means for 77% of the variable–segment pairs for a comparison could be made. The highest score across all variables was obtained in 1979 (84%), while the lowest was obtained in 1977 (64%). Phytoplankton biomass, as indicated by chlorophyll-a, had an average score of 76% across all years, while dissolved oxygen had a score of 80% across the six years. Overall, given the PEM's relative high skill performance, one would conclude that it would provide a suitable tool for determining load reductions necessary to achieve reduced algal levels during the summer period. Analysis of long-term chlorophyll-a data (ICPRB, 1999) indicates that average surface chlorophyll-a concentrations declined as much as 50% at middle estuary monitoring stations after phosphorus loadings of phosphorus were sharply reduced in the late-1970s and early 1980s. Although summer blooms continued to occur in the upper estuary, this may have been due to the phosphorus that had accumulated in bottom sediments that was still being released to the water. It was also reported that the intensity and downriver extent of upper estuary summer blooms decreased after 1985 and that chlorophyll rarely exceeded 100 µg chl-a/L in the latter years of the analysis, suggesting that sediment phosphorus may have begun to be depleted. 5. Qualitative skill assessment

Fig. 9. PEM skill assessment — comparison of means (Student's t test) scores across variables and across model calibration/confirmation years.

Although the primary focus of this paper, as well as the other papers in this journal, is to evaluate quantitative measures and procedures for skill assessment of hydrodynamic and environmental/water quality models, qualitative measures still have a place and a utility in skill assessment. Graphical images can help provide the modeler and natural resource and water quality managers with an overview of how a model may be performing over time (i.e., traditional time-series) over space (i.e., traditional spatial profiles) and on a regional basis. The best examples of regional types of overviews, as applied to eutrophication models, are those that compare computer estimates of surface chlorophyll-a to SeaWiFs images generated from orbiting spacecraft (ex., Fennel et al., 2006). However, when faced with spatially-sparse or temporally-sparse data sets, graphical images are still useful to get a sense of how models are performing. Model versus data comparisons of a eutrophication model being constructed to assess the role of point source nutrients on eutrophication and dissolved oxygen in the Mid-Atlantic Bight (MAB) help illustrate this point. The spatial domain of the North-East Coastal Shelf Eutrophication Model (NECSEM) extends from the Nantucket Shoals on the northern boundary to Cape Hatteras on the southern boundary, to the 100-m isobath on the eastern boundary (Fig. 10). The model includes Long Island Sound, but does not include any of the MAB tributaries, such as Chesapeake Bay, Delaware Bay, or New York/New Jersey Harbor. With the exception of the boundary stations (located along the 100-m isobath of the eastern boundary), which were sampled quarterly, the other stations available for model vs. data comparisons were sampled once a month, under a one-year sampling program funded by the New York City Department of Environmental Protection. Contour plots (Fig. 10(a)) of seasonally-averaged computed surface

208

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

dissolved inorganic nitrogen (DIN) concentrations are compared to seasonally-averaged surface data (color discs). The model reproduces some of the seasonal and spatial patterns observed in the data, including the influence of the Hudson River plume along the New Jersey shore, during the winter months and the low levels of DIN present throughout most of the MAB during the summer months. However, there are regions and seasons wherein the model does not compare well to the observed data, particularly in the spring period, when the model overestimates the concentrations of DIN in New York Bight and along the New Jersey shore. The model also appears to overestimate DIN in the fall throughout much of the model domain; however, not to the magnitude as in the spring. The spatial mismatched between the model estimates of DIN and the data appear to be related to the model's inability to reproduce the levels of chlorophyll in the spring and the fall (Fig. 10(b)). Also, interesting to note, there are some unusual measurements of surface DIN. In particular, there are high concentrations (N30 µg N/L) along the eastern boundary, including one observation of 60–75 µg/L along the boundary directly south of Providence, Rhode Island. These values are high compared to other observations of less then 15 µg N/L for the remaining portions of the MAB monitoring and model domain. Similarly, one can note, some high measurements of surface chlorophyll-a along the eastern boundary during the springtime period. While these types of graphical comparisons do not yield any statistical information with which to judge skill assessment of a model, they do provide some qualitative assessment of skill, as well as to help identify unusual data in the observed data set that may contribute to poor skill statistics, such as relative error, for some stations. 6. Discussion To date most skill assessments of complex eutrophication models are performed on a qualitative basis (Arhonditsis and Brett, 2004), i.e., time-series or spatial plots of model vs. data, which demonstrate, to varying degrees, that the models reproduce or capture some of the temporal or spatial patterns observed in the data. However, these types of comparisons do not provide any statistical information concerning goodnessof-fit or quantify uncertainty in model projections. While currently, there is no one commonly accepted way to perform quantitative skill assessments for water quality models, there are a number of simple statistical measures that can be used to begin to provide water resource managers and water quality managers some degree of confidence that the models they are applying to environmental problems are reasonable and “correct” as opposed to Box (1979). In addition, there are other skill assessment tools, such as the Taylor diagram (Taylor, 2001; Fennel et al., 2006; Gruber et al., 2006; Raick et al., 2007; Anderson, 2009; Doney et al., 2007-this issue; Jollioff et al., 2007 this issue), the Target diagram (Jollioff et al., 2007-this issue), and multivariate approaches (Allen and Somerfield, 2007-this issue) that can be used to perform

209

model skill assessment. However, these other approaches are somewhat more complex in nature and while they may be “intuitive” to the scientific community, they may not be as easily understood by or provide the type of information concerning skill assessment and uncertainty in model projections needed by natural resource and water quality managers. This author believes that in order to construct a model with good skill assessment it is important to start with data analysis. A thorough analysis of the data helps the modeler to develop insights into the physical and biochemical behavior of the waterbody under study. Analysis of salinity and temperature can be used to gain insights into water movement, vertical stratification, and upwelling events in the system. Data analysis can also help understand what is the limiting nutrient or nutrients in the waterbody and do they vary seasonally. Data analysis can also be used to identify key features of the data set, that one would want a model to reproduce. One of the key components to the success of any water quality or eutrophication model is a well-calibrated hydrodynamic model. Without the ancillary hydrodynamic model getting the transport, including the vertical structure of the water column, correct, it is not possible to successfully calibrate the water quality model. This is particularly true in the case of dissolved oxygen, which is so strongly influenced by vertical mixing processes. Therefore, a through evaluation of model skill, paying particular attention to vertical structure, for the hydrodynamic is essential. During the initial calibration of the eutrophication model, attention should be paid to both qualitative and quantitative measures of skill assessment. Depending on the questions being asked of the model, certain quantitative measures may be more important than others. For example, if the model is being asked “just to compute the correct levels of phytoplankton biomass or minimum dissolved oxygen concentrations, but not to be as concerned with the timing,” then the cumulative density function may provide a better indication of model skill then may be reflected by the relative error statistic, which is likely to be sensitive to the timing and mis-timing of blooms and/or hypoxic/anoxic events. Also the modeler should conduct a sensitivity analysis on key model coefficients and inputs. Ideally one would like to conduct an extensive suite of Monte Carlo simulations in the sensitivity analysis, but given the computational burden and the large number of model coefficients and inputs associated with some eutrophication models this may not be practical. However, if the modeler conducts a quantitative skill assessment for each model run performed during the model calibration process, this information can be utilized as part of the sensitivity analysis. Water resource and water quality managers, however, need to realize that models that are applied and compared to multi-year data sets and that may not demonstrate as high a skill assessment “score” as compared to a model that is applied and calibrated to one or two annual data sets may

Fig. 10. Seasonally-averaged spatial calibration results for (a) surface dissolved inorganic nitrogen (DIN) and (b) surface chlorophyll-a (Chl-a) for the north-east coastal shelf eutrophication model (NECSEM).

210

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211

actually be the “better” model. Given the degrees of freedom, i.e., model coefficients, available to be adjusted in modern eutrophication models it is often easier to adjust model coefficients to “reproduce” one or two years of data than it is to “reproduce” multi-year data sets. When performing a quantitative skill assessment it is also important to recognize the limitations imposed upon a model when confronted with the data limitations, described earlier in this paper, that affect both preparation of model inputs as well as effect model calibration and determination of model skill. Finally, it is important to recognize the importance of and the role that qualitative model versus data plots play in skill assessment. While it is important that modelers begin to perform more quantitative skill assessments for newly developed models or model applications, statistics alone may not always provide a balanced picture of a model's ability to represent the physical world. As can be done with water quality models, people can be described by height, weight, eye color, hair type, etc., but beauty is still often in the eyes of the beholder. Acknowledgements The author would like to acknowledge the three reviewers whose constructive comments helped to better focus this paper.

References Abbot, M.R., Letelier, R.M., 1998. Decorrelation scales of chlorophyll as observed from bio-optical drifters in the California Current. Deep Sea Res. Part II: Top. Stud. Oceanogr. 45 (8–9), 1639–1667. Allen, J.I., and Somerfield, P.J., 2007-this issue. A multivariate approach to model skill assessment. J. Mar. Syst. Allen, J.I., Blackford, J., Holt, J., Protor, R., Ashworth, M., Siddorn, J., 2001. A highly spatially resolved ecosystem model for the North West European Continental Shelf. Sarsia 86, 423–440. Anderson, L.A., 2009. The seasonal cycle of phytoplankton in Wilkinson Basin, Gulf of Maine: biological model optimization to 2-D data and multivariate skill assessment. J. Mar. Syst., submitted for publication. Arhonditsis, G.B., Brett, M.T., 2004. Evaluation of the current state of mechanistic aquatic biogeochemical modeling. Mar. Ecol. Prog. Ser. 271, 13–26. Bowie, G.L., Mills, W.B., Porcella, D.B., Campbell, C.L., Pagenkopf, J.R., Rupp, G.L., Johnson, K.M., Chan, P.W., Gherini, S.A., Chamberlin, C.E., 1985. Rates, Constants, and Kinetics Formulations in Surface Water Quality Modeling, (Second Edition). EPA/600/3-85/040. Box, G.E.P., 1979. Robustness in the strategy of scientific modeling building. In: Lanner Wilkerson (Ed.), Robustness of Statistics. Academic Press, New York, NY, pp. 201–236. Boyd, P.W., Doney, S.C., 2002. Modeling regional responses by marine pelagic ecosystems to global climate change. Geophys. Res. Lett. 29 (16), 53–1–53-4. Cerco, C.F., 1995. Response of Chesapeake Bay to nutrient load reductions. J. Environ. Eng. ASCE 121 (8), 549–557. Cerco, C.F., 2000. Phytoplankton kinetics in the Chesapeake Bay eutrophication model. Water Qual. Ecosys. Model. 1, 5–49. Cerco, C.F., Cole, T., 1993. Three-dimensional eutrophication model of Chesapeake Bay. J. Environ. Eng. (ASCE) 119 (6), 1006–1025. Cerco, C.F. and Noel, M.R., 2004. The 2002 Chesapeake Bay eutrophication model. Prepared for the USEPA Chesapeake Bay Program Office. USACE, ERDC, Environmental Laboratory. Vicksburg, MS. Cerco, C.F., Johnson, B.L., and Wang, H.V., 2002. Tributary refinements to the Chesapeake Bay Mode. Final report prepared for the USEPA Chesapeake Bay Program. USACE. ERDC TR-02-4. Vicksburg, MS. Chalup, M.S., Laws, E.A., 1990. A test of the assumptions and predictions of recent microalgal growth models with the marine phytoplankter Pavlova lutheri. Limnol. Oceanogr. 35 (3), 583–596. Chapra, S.C., 1997. Surface Water-Quality Modeling, McGraw-Hill Series in Water Resources and Environmental Engineering. McGraw-Hill, New York, New York.

Cloern, J.E., Cole, B.E., Wong, R.L.J., Alpine, A.E., 1985. Temporal dynamics of estuarine phytoplankton: case study of San Francisco Bay. Hydrobiologia 129, 153–176. Doney, S.C., Duckow, H.W., 2006. A decade of synthesis and modeling in the US Joint Global Ocean Flux Study. Deep Sea Res. Part II. Top. Stud. Oceanogr. 53 (5–7), 451–458. Doney, S.C., Lima, I., Moore, J.K., Lindsay, K., Behrenfeld, M., Westberry, T.K., Mahowald, N., Glover, D.M., McGillicuddy, D.J., and Takahashi, T.T., 2007 this issue. Skill metrics for confronting global upper ocean ecosystembiogeochemisty models against field and remote sensing data. J. Mar. Syst. Dustan, P., Pinckney, J.L., 1989. Tidally induced estuarine phytoplankton patchiness. Limnol. Oceanogr. 34 (2), 410–419. Egbert, G.D., Bennett, A.F., Foreman, M.G.G., 1994. TOPEX/POSEIDON tides estimated using a global inverse model. J. Geophys. Res. 99, 24,821–24,852. Falkowski, P.G., Ziemann, D., Kolber, Z., Bienfang, P.K., 1991. Role of eddy pumping in enhancing primary production in the ocean. Nature 352, 55–58. Fennel, K., Wilkin, J., Levin, J., Moisan, J., O'Reilly, J., Haidvogel, D., 2006. Nitrogen cycling in the Middle Atlantic Bight: results from a threedimensional model and implications for the North Atlantic nitrogen budget. Glob. Biogeochem. Cycles 20, GB3007. Gibbs, M.M.,1993. Morphometrically induced estuarine phytoplankton patchiness in Pelorus Sound, New Zealand. New Zealand J. Mar. Fresh. Res. 27, 191–199. Gruber, N., Frenzel, H., Doney, S.C., Marchesiello, P., McWilliams, J.C., Moisan, J.R., Oram, J.J., Plattner, G.-K., Stolzenbach, K.D., 2006. Eddy-resolving simulation of phytoplankton ecosystem dynamics in the California Current System. Deep-Sea Res. 153, 1483–1516. Hannon, E., Boyd, P.W., Silvoso, M., Lancelot, C., 2001. Modeling the bloom evolution and carbon flows during SOIREE: implications for future in situ iron-enrichments in the Southern Ocean. Deep Sea Res. Part II. Top. Stud. Oceanogr. 48 (11–12), 2745–2773. Hood, R.R., Laws, E.A., Armstrong, R.A., Bates, N.R., Brown, C.W., Carlson, C.A., Chai, F., Doney, S.C., Falkowski, P.G., Feely, R.A., Friedrichs, M.A.M., Landry, M.R., Moore, J.K., Nelson, D.M., Richardson, T.L., Salihoglu, B., Schartau, M., Toole, D.A., Wiggert, J.D., 2006. Pelagic functional group modeling: progress, challenges and prospects. Deep Sea Res. Part II. Top. Stud. Oceanogr. 53 (5–7), 459–512. HydroQual, 2000. Bays Eutrophication Model (BEM): modeling analysis for the period 1992–1994. Prepared for the Massachusetts Waters Resources Authority. Mahwah, NJ. ICPRB, 1999. Integrated analysis of Potomac Estuary. In: Claire, Buchanan (Ed.), Interstate Commission for the Potomac River Basin. ICPRB-99-9. Jollioff, J.K., Kindle, J.C., Shulman, I., Penta, B., Friedrichs, M.A.M., Helber, R., and Arnone, R.A., 2007 this issue. Summary diagrams for coupled hydrodynamic-ecosystem model skill assessment. J. Mar. Syst. Lancelot, C., Spitz, Y., Gypens, N., Ruddick, K., Becquevort, S., Rousseau, V., Lacroix, G., Billen, G., 2005. Modeling diatom and Phaeocystis blooms and nutrient cycles in the Southern Bight of the North Sea: the MIRO model. Mar. Ecol. Prog. Ser. 289, 63–78. Lehodey, P., Andre, J.-M., Bertignac, M., Hampton, J., Stoens, A., Menkes, C., Mmery, L., Grima, N., 1998. Predicting skipjack tuna forage distributions in the equatorial Pacific using a coupled dynamical bio-geochemical model. Fisheries Oceanogr. 7 ((3–4),), 317–325. Levitus, S., 1982. Climatological atlas of the world's oceans. NOAA Prof. Pap. 13 173 pp. Linker, L.C., Shenk, G.W., Wang, P., Cerco, C.F., Butt, A.J., Tango, P.J., and Savidge, R.W., 2002. A comparison of the Chesapeake Bay Estuary Model calibration with 1985–1994 observed data and method of application to water quality criteria. Prepared for the USEPA Chesapeake Bay Program Office by the Modeling Subcommittee of the Chesapeake Bay Program. Annapolis, Maryland. Luo, J., Hartman, K.J., Brandt, S., Cerco, C.F., Rippetof, T., 2001. A spatially-explicit approach for estimating carrying capacity: an application for the Atlantic Menhaden (Brevoortia tyrannus) in Chesapeake Bay. Estuaries 24 (4), 545–556. McGillicuddy, D.J., Kosnyrev, V.K., Ryan, J.P., Yoder, J.A., 2001. Covariation of mesoscale ocean color and sea-surface temperature in the Sargasso Sea. Deep Sea Res. Part II: Top. Stud. Oceanogr. 48 (8–9), 1823–1836. McGillicuddy, D.J., Anderson, D.M., Lynch, D.R., Townsend, D.W., 2005. Mechanisms regulating large-scale seasonal fluctuations in Alexandrium fundyense populations in the Gulf of Maine: results from a physical– biological model. Deep Sea Res. II 52, 2698–2714. Raick, C., Soetaert, K., Gregoire, M., 2007. Model complexity and performance: how far can we simplify? Prog. Oceanogr. 70 (1), 27–57. Reckhow, K.H., Chapra, S.C., 1999. Modeling excessive nutrient loading in the environment. Environ. Pollut. 100, 197–207. Scott, B., Sharples, J., and Ross, O., 2005. Setting the scene(2): Oceanography. In: Understanding marine foodweb processes: an ecosystem approach to sustainable sandeel fisheries in the North Sea. C.J. Camphuysen (Ed.). Interactions Between the Marine Environment, Predators, and Prey: Implications for Sustainable Sandeel Fisheries. IMPRESS Final Report, Project #Q5RS-2000-30864.

J.J. Fitzpatrick / Journal of Marine Systems 76 (2009) 195–211 Sheng, P. and Kim, T., 2007 this issue. Skill assessment of an integrated modeling system for shallow coastal and estuarine ecosystems for shallow coastal and estuarine ecosystems. J. Mar. Syst. Siopsis, M., 2003. An individual-based model of the toxic algae species PseduoNitzschia Multiseries. Doctoral Dissertation, University of Tennessee. Knoxville, TN. Skogen, M.D., Soiland, H., Svendsen, E., 2004. Effects of changing nutrient loads to the North Sea. J. Mar. Syst. 46 (1–4), 23–28. Stow, C.A., Jolliff, J., McGillicuddy, D.J., Jr., Doney, S.C., Allen, J.I., Friedrichs, M.A.M., Rose, K.A., and Wallhead, P., 2007 this issue. Skill assessment for coupled biological/physical models of marine systems. J. Mar. Syst. Streeter, H.W. and Phelps, E.B., 1925. A study of the pollution and natural purification of the Ohio River, III, Factors concerned in the phenomena of oxidation and reaeration. Public Health Bulletin No. 146. U.S. Public Health Service.

211

Taylor, K.E., 2001. Summarizing multiple aspects of model performance in a single diagram. J. Geophys. Res. 106 (D7), 7183–7192. Thomann, R.V., 1982. Verification of water quality models. J. Envr. Eng. (ASCE) 108 (5), 923–940. Thomann, R.V., Segna, J.J., 1980. Dynamic phytoplankton-phosphorus model of Lake Ontario: ten year verification and simulation. In: Loehr, R.C., et al. (Ed.), Phosphorus Management Strategies for Lakes. Ann Arbor Science, Ann Arbor, MI, pp. 153–190. Thomann, R.V., and Fitzpatrick, J.J., 1982. Calibration and verification of a mathematical model of the eutrophication of the Potomac Estuary. Final report prepared for the Department of Environmental Services of the Government of the District of Columbia. HydroQual, Inc. Mahwah, NJ.