Statistical Methodology 9 (2012) 251–264
Contents lists available at SciVerse ScienceDirect
Statistical Methodology journal homepage: www.elsevier.com/locate/stamet
Attaching uncertainty to deterministic spatial interpolations Souparno Ghosh a,∗ , Alan E. Gelfand a , Thomas Mølhave b a
Department of Statistical Science, Duke University, United States
b
Department of Computer Science, Duke University, United States
article
info
Keywords: Calibration Data fusion Digital elevation model Measurement error Multi-level model
∗
abstract Deterministic spatial interpolation algorithms such as the natural neighbor interpolation (NNI) or the Cressman interpolation schemes are widely used to interpolate environmental features. In particular, the former have been applied to digital elevation models (DEM’s), the latter to weather data and pollutant exposure. However, they are unsatisfying in that they fail to provide any uncertainty assessment. Such schemes are not model-based; rather, they provide a set of rules, usually geometrically motivated, by which point-level data is interpolated to a grid. We distinguish this setting from the case where the deterministic model is essentially a mapping from inputs to outputs in which case a joint model can be formulated to assign uncertainty. In our setting we have no inputs, only an interpolated surface at some spatial resolution. We propose a general approach to handle the non model-based setting. In fact, the approach can be used to assign uncertainty to any supplied surface regardless of how it was created. We first formulate a useful notion of uncertainty and then show, with additional external validation data, that we can attach uncertainty using a convenient version of a data fusion model. We also clarify the distinction between this setting and the more usual case where we are trying to build an explanatory model to explain an environmental surface. We discuss two settings for such interpolation, one where the surface is presumed to be continuous such as elevation or temperature and the other where the surface would be discontinuous such as with precipitation where, at any location,
Corresponding author. E-mail addresses:
[email protected] (S. Ghosh),
[email protected] (A.E. Gelfand),
[email protected] (T. Mølhave).
1572-3127/$ – see front matter © 2011 Elsevier B.V. All rights reserved. doi:10.1016/j.stamet.2011.06.001
252
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
there would be a point mass in the distribution at 0. We work within a hierarchical Bayesian framework and illustrate with a DEM within the Cape Floristic Region of South Africa. © 2011 Elsevier B.V. All rights reserved.
1. Introduction Deterministic algorithms for creating interpolated surfaces over a spatial domain are frequently used for predicting variables like elevation, weather, environmental exposure and computer model output [23,19,15]. They can be used to obtain smooth surfaces for, say, elevation or temperature, or discontinuous surfaces by allowing local point masses, say at 0, as for precipitation. However, from an inferential perspective, they fall short since they do not attach uncertainty to the interpolations. The contribution of this paper is to formulate a useful notion of uncertainty for such deterministic predictions for the case where no ‘‘model’’ is available for the interpolated surface. We only have values of the surface at some spatial resolution. We do not know the original data that was used to develop the surface.1 Such maps frequently arise in digital elevation mapping and for other ecological variables such as soil or vegetation surfaces. Even weather surfaces, such as those created by WorldClim [12], do not make available potential uncertainty information. Our contribution is to demonstrate that such uncertainty can be learned about through suitable stochastic modeling. Our only additional requirement is some external ‘‘validation’’ data gathered independently over the same domain. Then, the uncertainty emerges under a data fusion model. We note that there is a secondary importance for these uncertainties. We often use features over a spatial region as covariates in explanatory models. For instance, we would use weather and elevation information to explain biomass or species abundance at locations across a region. When these covariates are provided at locations through maps as above, we need to propagate the uncertainty associated with their values through the regression model to properly explain the response. As an example of our setting, consider the natural neighbor interpolation (NNI) scheme [22]. The scheme is based on the Voronoi diagram (Fig. 1(a)). Let S = {s1 , s2 , . . . , sn } be a set of sites in R2 and {Y (s1 ), Y (s2 ), . . . , Y (sn )} are the values of the response variable at this set of sites. The Voronoi cell of a site si denoted by VorS (si ), is the set of points in R2 whose nearest neighbor in S is si . The Voronoi diagram of S is the union of VorS (si ) for all si ∈ S. Then for any point s in the convex hull of S, NNI sets Y (s) to be a convex combination of the Y (si ) values at the Voronoi neighbors of s (Fig. 1(b)). Venturato [24] uses the NNI scheme to build the DEM for coastal Oregon. Beutel et al. [2] propose an approximate NNI scheme to construct a grid DEM when one encounters massive amount data. Another example is the Cressman interpolation scheme [3]. This approach interpolates point source data to a user-defined latitude-longitude grid. The algorithm makes multiple passes through the grid at consecutively smaller radii of influence to increase precision. The radius of influence is defined as the maximum radius from a grid point in terms of allowing observed point-source values to contribute to estimating the value at a grid point. Points beyond the radius of influence have no effect on a grid point value. Cressman interpolation has been used for interpolating precipitation and temperature surfaces, see, e.g., [11]. Note that these schemes are essentially geometric and completely deterministic; they are not expressible through an underlying model. Both build the interpolation surface from values at a finite set of locations, ignoring, for example, spatial correlation, potential covariates, etc. Moreover, in the absence of further information, there is no justifiable way to associate uncertainty with any value on the surface. Several authors have looked into the problem of quantifying uncertainty associated with the outputs of mechanistic models. These works assume that the deterministic model is essentially a
1 In fact, the surface may not have been developed from real data. Rather, it may come from a black box which we know nothing about.
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
a
253
b
Fig. 1. (a) Voronoi diagram Vor(S ). (b) Natural neighbor interpolation at s, the shaded cell is VorS ∪{s} (s), and each color denotes the region stolen from each cell of Vor(S ). (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)
mapping from a set of input levels to a set of output variables. This assumption implies that the inputs are functionally related to the outputs via some meaningful parameters that are interpretable both statistically and physically. Givens [9] and Raftery et al. [18] present a general framework for this setting which they call ‘Bayesian synthesis’. Givens and Hughes [10] describe a SIR algorithm for estimation, fitting and uncertainty analysis in complex, nonlinear deterministic models. Patwardhan and Small [16] and Wolpert et al. [25] propose a Bayesian Monte Carlo method for doing the same. Hoel and Mitchell [13] and Diggle and Gratton [4] tackled this problem from a frequentist perspective. Jones and Nicholls [14] developed a state-space model with random inputs and then propagated input uncertainty and covariance by numerically integrating the resulting system of stochastic nonlinear differential equations. However, again these methods are not applicable when the outputs are completely model-free, like those from the NNI or Cressman algorithms above; i.e, when we only have an output file at some resolution. The novelty of the present paper is the development of a general Bayesian approach to attach uncertainties to interpolated surfaces in such settings given some external ‘‘validation’’ data gathered independently over the same domain as the interpolated surface. With the map and this validation data, we offer a fully model-based way to quantify the uncertainty associated with the former. We propose spatially varying uncertainty which, as must be the case, will depend upon the locations for the validation data. Although the algorithm treats the observations as deterministic and produces a deterministic surface as the output, the latter will not be the ‘‘true surface’’. Hence, given the truth and the deterministic surface, the associated ‘‘error’’ is not stochastic. However, under suitable stochastic specification, this error can be viewed as a random unknown which we can infer about using this specification. In particular, we use data fusion methodology to develop a joint stochastic model for the deterministic output of the algorithm and the stochastic ‘‘validation’’ data. We also allow for a potential calibration problem and a change of support problem that may arise between the interpolated map and the ‘‘validation’’ data. The format of the paper is as follows. In Section 2 we formalize what ‘‘uncertainty’’ means for us when dealing with a ‘‘model-free’’ deterministic surface. Section 3 gives a detailed description of the proposed fusion model when dealing with smooth interpolated surfaces. Section 4 describes how this model can be adapted to handle discontinuous surfaces. Section 5 offers some simulation results. In Section 6 we apply our model to quantify the uncertainty associated with a digital elevation model (DEM) developed using the NNI algorithm for a portion of the Cape Floristic Region in South Africa. The final section summarizes and discusses a few possible extensions of the present work.
254
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
2. What do we mean by uncertainty? We first discuss what we mean by the uncertainty associated with a deterministic interpolation. In practice, the interpolation will be at grid cell level but, for the moment, let us conceptualize at point level, whence we have a surface, MD = {M (s): s ∈ D}, available at every s ∈ D. In addition, we have a validation set, say Y (sj ), j = 1, 2, . . . , n. Imagining uncertainty in M (s) suggests that we imagine the surface as random and, in fact, clarifies that the uncertainty we attach is relative to the modeling we adopt. Indeed, this is the story in Sections 3 and 4 below. Moreover, since Y (s) is only observed at a finite set of locations, to connect Y ’s to arbitrary locations for M’s requires us to think of YD = {Y (s): s ∈ D} as a random surface as well. Hence, a joint model for MD , YD seems necessary. If M (s) is calibrated and Y (s) is ‘‘true’’, then M (s) − Y (s) is our ‘‘error’’ and we can consider the variability of M (s) − Y (s) as our uncertainty at location s. More generally, with calibration implemented for M (s), in order to make an appropriate comparison, we suggest that ϵ(s) = M (s) − E (M (s)|Y (s)) is the error and the uncertainty in ϵ(s) is what we seek. A priori, E (ϵ(s)) = 0 and, under a stationary bivariate Gaussian process, v ar (ϵ(s)) is given by:
v ar (ϵ(s)) = E (v ar (ϵ(s)|Y (s))) + v ar (E (ϵ(s)|Y (s))) = E (v ar (M (s)|Y (s))) = v ar (M (s)|Y (s)), since the last term is constant over s. This means that v ar (ϵ(s)) fails to capture ‘‘local’’ uncertainty. We expect uncertainty to vary over D; we need a location-specific posterior measure of uncertainty. Note that, in this formulation, there is no likelihood that we can write for the uncountable set, MD . So, we return to the practical setting where the deterministic interpolator produces a tiled surface, i.e., M (s) = M (Ai ) if s ∈ Ai with i = 1, 2, . . . , I and we interpret M (Ai ) as a block average, A M (s)ds/|Ai |. i With the finite sets {M (Ai )} and {Y (sj )}, we can write a likelihood and fit a suitable model. We find ourselves in the familiar territory of Bayesian melding [17,5] with the impossibility of implementing such melding when I is very large (as it will be for a high resolution interpolation) due to the required stochastic integrations. See [1] in this regard. We propose an alternative model specification to avoid such integrations and since we only have the {M (Ai )}, we will only assign uncertainty to this set of values. Suppose we introduce a ‘‘true’’ average value for the variable over Ai , say V˜ (Ai ) (anticipating Sections 3 and 4 below). Then, again under a calibration model, we would now take ϵm (Ai ) = M (Ai ) − E (M (Ai )|V˜ (Ai )) as the error. But the posterior, [E (v ar (ϵm (Ai )))|Data] is a parametric function, free of Ai for equal sized grids and does not provide varying uncertainty across grid cells. Instead, we look at the posterior distribution of ϵm (Ai ), [ϵm (Ai )|Data], and we use v ar (ϵm (Ai )|Data) as our local uncertainty. In Section 3 we show that, under our modeling, this turns out to be equivalent to examining the posterior distribution of σm2 (Ai ). To further clarify, the ‘‘true’’ error for M (Ai ) is M (Ai ) − True(Ai ) where True(Ai ) is the calibrated true average value for Ai . This is an unknown value and, as usual in the Bayesian setting, we model unknowns as random and look at their posterior, [M (Ai ) − True(Ai )|Data] for inference. With the specifications below, we take E (M (Ai )|V˜ (Ai )) as the model for the calibrated truth, after which we obtain the above as our attached uncertainty. Since our attached uncertainty is model-based, it is natural to ask how we might compare uncertainty assignments. That is, with a sufficiently complex model, we could make uncertainty arbitrarily small which is not our objective. More precisely, there is a trade-off between small uncertainty and bias in E (M (Ai )|V˜ (Ai )) relative to True(Ai ). That is, M (Ai ) − True(Ai ) = (M (Ai ) − E (M (Ai )|V˜ (Ai ))) + (E (M (Ai )|V˜ (Ai )) − True(Ai )), revealing the trade-off. So, to compare models, we need to look not only at the posterior variance arising from the first term but also the squared bias associated with the second term. What do we have to inform about bias? On the uncalibrated scale we could compare V˜ (Ai ) with Y (s) for s ∈ Ai suggesting E ((Y (sj ) − V˜ (Asj ))2 |Data) where Asj denotes the Ai such that sj ∈ Ai . Finally, a balanced loss
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
255
would arise in the form
−
v ar (ϵm (Ai )|Data)/I + c
i
−
E ((Y (sj ) − V˜ (Asj ))2 |Data)/n,
(1)
j
where c captures the relative regret for the two losses. See [6] in this regard. Small values of (1) would be preferred. When we consider an interpolated surface for say precipitation, we need to provide for point masses at 0. This requires a revision in the uncertainty measure which we describe in Section 4. 3. Uncertainty quantification; no point masses Following the notation of the previous section, let Y (sj ) be the observed value of a continuous variable (say temperature) at location sj , (j = 1, 2, . . . , n) obtained from independent station data. Let M (Ai ) be the deterministic predicted/interpolated value for cell Ai , (i = 1, 2, . . . , I ). Again, we conceptualize M (Ai ) as an average value for cell Ai . Let V˜ (Ai ) be the true average value for Ai . Then, we model M (Ai ) as M (Ai ) = β0 + β1 V˜ (Ai ) + ϵm (Ai ),
(2)
where ϵm (Ai ) ∼ N (0, σm2 (Ai )) independently ∀i = 1, . . . , I. So, (2) provides possible calibration for the interpolation relative to the truth. We assume Y (sj ) = V (sj ) + ϵy (sj ),
(3)
where ϵy (sj ) ∼ N (0, σ ) independently, j = 1, . . . , n, i.e., a simple measurement error model for the station data relative to the truth. To handle the potential change of support problem we avoid the integration problem associated with scaling form points to grid cells by, instead, ascribing a further measurement error model (MEM) [20] for V (sj ) as follows: 2 y
V (sj ) ∼ N (V˜ (Ai ), σv2 ),
(4)
˜ = (V˜ (A1 ), . . . , for j = 1, 2, . . . , n where Ai is the cell containing the site sj . Finally we assume V V˜ (AI )) ∼ N (µ1, ΣV˜ ) with ΣV˜ = σ ˜2 exp(−lij /θ0 ), where lij is the Euclidean distance between the V centroids of cell Ai and Aj . Without loss of generality, we can assume µ = 0. To complete the hierarchical model, we enable heterogeneous variances by assuming σm2 (Ai ) ∼ lognormal(α0 + α1 V˜ (Ai ), c ).2 Since we have no prior knowledge regarding the sizes of σy2 and σv2 , we give them a common inverse gamma prior, centered around half of the mean square error (MSE) arising from a simple linear regression of Y on M for the n sites where Y was observed and the associated interpolated value for M. As a result, we take σy2 ∼ σv2 ∼ Inv Gamma(ασ , βσ ), where we obtain the values of ασ and βσ by solving the equations:
βσ MSE = , ασ − 1 2
and
βσ2 = 102 , (ασ − 1)2 (ασ − 1)
implying a large variance. The regression parameters β0 , β1 , α0 , α1 are assumed to be independently normally distributed with mean 0 and variance 102 . Since σ ˜2 and θ0 are not well identified [26], we V fix the latter at roughly a third of the maximum distance over the region and adopt an inverse gamma prior for the former, centered around the sample variance of the M’s.
2 Having a mean for the log variance which depends upon the value of the true average exposure seems more sensible than using, for example, a trend surface. In particular, we anticipate α1 > 0 so larger variances will be associated with larger responses. Furthermore, in fitting the model we set c = 1 for computational stability. We have no prior knowledge regarding the variance for a regression of a log variance.
256
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
˜ = (V˜ (A1 ), . . . , V˜ (AI )), σ2m = (σm2 (A1 ), . . . , σm2 (AI )) and Y = Define M = (M (A1 ), . . . , M (AI )), V (Y (s1 ), . . . , Y (sn )), V = (V (s1 ), . . . , V (sn )). We can now write the full distributional specification as: [(M|V˜ , σ2m , β0 , β1 )][(Y|V, σy2 )][(V|V˜ , σv2 )][(V˜ |µ1, σV˜2 )]. With the foregoing priors on the model parameters, we obtain a complete Bayesian hierarchical model specification. The model is fitted using Markov chain Monte Carlo (MCMC). Details of the required full conditional distributions are given in Appendix A.1. Following Section 2, we turn to
ϵm (Ai ) = M (Ai ) − E (M (Ai )|V˜ (Ai )) = M (Ai ) − β0 − β1 V˜ (Ai ), as the error. E (σm2 (Ai )) = E (v ar (ϵm (Ai ))) is calculated for the model above in Appendix A.2 and is seen to be free of i. But, again from Section 2, we seek the posterior variance of ϵm (Ai ). We can obtain this directly using composition sampling by drawing posterior samples of ϵm (Ai ) and then calculating a sample variance. That is, we can use samples from [V˜ (Ai ), α0 , α1 |Data] to draw a posterior realization of σm2 (Ai ) which, in turn, enables a posterior draw of ϵm (Ai ). Alternatively, we have
v ar (ϵm (Ai )|Data) = Eσm2 (Ai ),V˜ (Ai )|Data (v ar (ϵm (Ai )|σm2 (Ai ), V˜ (Ai ), Data)) + v arσm2 (Ai ),V˜ (Ai )|Data (E (ϵm (Ai )|σm2 (Ai ), V˜ (Ai ), Data)). The second term above is clearly 0 while the first simplifies to E (σm2 (Ai )|Data). In other words, alternatively, we can obtain the desired uncertainty by posterior simulation of the σm2 (Ai )’s and then taking their sample mean. Regardless, we would display the I posterior expectations in a map (see Sections 5 and 6 below). 4. Uncertainty quantification with point masses When we are modeling phenomenon like precipitation whose distribution includes a point mass at 0, we assume that a latent multivariate normal distribution is specified for the {W (Ai )}. These, in turn, drive the interpolated averages M (Ai ), that is, M (Ai ) = 0
if W (Ai ) ≤ 0 = W (Ai ) if W (Ai ) > 0.
(5)
Since W (Ai ) is continuous, we assume that it is driven by V˜ (Ai ), as in the previous section, i.e., W (Ai ) = β0 + β1 V˜ (Ai ) + ϵw (Ai ).
(6)
Therefore, (6) provides the calibration for the interpolation relative to the truth but now on an underlying continuous scale. Furthermore, the distribution of ϵw is now that of ϵm defined in Section 3. Similarly, we assume that a latent Gaussian process Z (s) drives the process for the station data, Y (s), i.e., Y (s) = 0
if Z (s) ≤ 0 = Z (s) if Z (s) > 0,
(7)
and then model the continuous Z (s)’s using a measurement error model, Z (s) = V (s) + ϵz (s),
(8)
where ϵz (s) ∼ N (0, σ ) independently, s = 1, . . . , n. Then, we specify a further MEM for V (s) linking 2 z
V (s) and V˜ (As ), as done earlier. This MEM handles the change of support problem and allows the possibility, when s ∈ As , of Y (s) = 0 when M (As ) > 0 and vice versa. In order to obtain the full conditionals we rewrite W (Ai ) and Z (s) as follows: W (Ai ) = M (Ai ),
= W1 (Ai ),
if M (Ai ) > 0 if M (Ai ) = 0,
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
257
and Z (s) = Y (s),
= Z1 (s),
if Y (s) > 0 if Y (s) = 0.
The unknown quantities W1 (Ai ) < 0 and Z1 (s) < 0 are introduced in the likelihood every time an event of no precipitation is observed in the original processes M (Ai ) and Y (s), respectively. These unknowns are updated at each iteration of MCMC. Note that the full conditional distributions for these parameters are truncated normal distributions. The full conditionals for the other parameters are obtained analogously to the previous section and as given in Appendix A.1. Following the arguments of the previous sections, to attach uncertainty we would again be interested in ϵm (Ai ). Evidently,
ϵw (Ai ) = W (Ai ) − E (W (Ai )|V˜ (Ai )), differs from
ϵm (Ai ) = max(0, W (Ai )) − E (max(0, W (Ai ))|V˜ (Ai )). The uncertainty measure of the previous section would now apply to the W ’s and would give an estimate of the posterior variance of ϵw (Ai ), equivalently, of the posterior mean of σw2 (Ai ). However, the uncertainty in the M (Ai ) is associated with the posterior variance of ϵm (Ai ). Unlike before, σm2 (Ai ) = v ar (ϵm (Ai )) has no simple distribution so posterior sampling is not as direct as in the previous section. However, we still propose composition sampling. That is, given a draw of V˜ (Ai ), α0 , α1 , we can sample W (Ai )’s using (6), after drawing σw2 (Ai ) and thus sample M (Ai )’s. With the latter samples, we can obtain ϵm (Ai )’s and compute a sample variance from them. We omit further details. 5. Simulation examples
5.1. Simulation example with a smooth surface Here, we offer a simulation example on a unit square to illustrate the performance of the model. The simulation proceeds as follows: 1. The unit square is uniformly divided into 900 grid cells.
˜ = (V˜ (A1 ), . . . , V˜ (A900 )) from a zero mean GP with 2. Using the centroids of the grid cells, generate V exponential covariance structure with θ0 = 0.5 and σ ˜2 = 1. V
3. Then, the M (Ai ) are generated using the relation (2) with β0 = 0, β1 = 1, σm2 (Ai ) = 1. 4. Then, generate 200 random locations in the unit square. 5. Once the locations (s1 , . . . , s200 ) are obtained, generate V = (V (s1 ), . . . , V (s200 )) using relation (4) with σv2 = 1.
6. Finally, generate the validation data using the relation (3), with σy2 = 1.
The left panel of Fig. 2 shows the simulated gridded data (M (Ai )) and the right panel shows the validation data (Y (sj )). The median, IQR, maximum and minimum for the simulated M (Ai ) are 0.21, 0.40, 0.79 and −0.28, respectively and those for the simulated Y (sj ) are 0.15, 0.43, 0.83 and −0.42, respectively. A posterior summary of the model parameters is given in Table 1, showing that we can learn about these parameters (although, with a real map this would not be of interest). Using the modeling in Section 3, Fig. 3 shows the 900 local uncertainties associated with the gridded map, revealing the resultant spatial variation. In fitting the model, 10,000 posterior samples for each parameters are generated and 1000 are used as burn-in samples. The computation time is 3 h and 50 min using a Dell OptiPlex 755 computer with Intel Core 2 Quad-1066 MHz processors.
258
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264 Table 1 Posterior summary of the parameters. Parameters
True value
Posterior mean and 95% CI
β0 β1 σm2 σy2 σv2 σV˜2
0 1 1 1 1 1
0.41 (−1.23, 3.01) 1.49 (0.21, 3.72) 2.76 (0.34, 6.28) 0.85 (0.14, 3.97) 1.42 (0.71, 5.32) 0.28 (0.18, 3.31)
Fig. 2. Simulated gridded data and validation data.
Fig. 3. Map of the uncertainties associated with the M (Ai ) in Section 5.1.
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
259
Fig. 4. Simulated gridded data and validation data.
Table 2 Posterior summary of the parameters. Parameters
True value
Posterior mean and 95% CI
β0 β1 σw2 σz2 σv2 σV˜2
0 1 1 1 1 1
0.86 (−2.06, 1.98) 1.32 (0.39, 4.18) 1.73 (0.86, 6.12) 1.79 (0.67, 6.16) 1.32 (0.58, 6.39) 0.08 (0.008, 2.16)
5.2. Simulation example with an atom at 0 Again, we offer a simulation example, now for the case of variables with a point mass at 0. The simulation proceeds as follows: 1. The unit square is uniformly divided into 900 grid cells.
˜ = (V˜ (A1 ), . . . , V˜ (A900 )) from a zero mean GP with 2. Using the centroids of the grid cells, generate V exponential covariance structure with θ0 = 0.5 and σ ˜2 = 1. V
3. 4. 5. 6.
Then, the W (Ai )’s are generated using the relation (6) with β0 = 0, β1 = 1, σw2 (Ai ) = 1. After generating W (Ai ), obtain the gridded data M (Ai ) using relation (5). Then, generate 200 random locations in the unit square. Once the locations (s1 , . . . , s200 ) are obtained, generate V = (V (s1 ), . . . , V (s200 )) using relation (4) with σv2 = 1.
7. Using these simulated V (s)’s, obtain the Z (s)’s using the relation (8) with σz2 = 1. 8. Finally, generate the validation data from the Z (s)’s using the relation (7).
The left panel of Fig. 4 shows the simulated gridded and the right panel shows the corresponding validation data. The vector of Y = (Y (s1 ), . . . , Y (s200 )) produces about 13% 0’s and is skewed to the right with a maximum value of 1.97. Again, in Table 2, a posterior summary of the model parameters is encouraging. The variances obtained from the posterior samples of ϵm (Ai ) are shown in Fig. 5. Again, 10,000 posterior samples of each parameters are generated and 1000 are used as burn-in samples. The computational time now is around 5 h 25 min.
260
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
Fig. 5. Map of the uncertainties associated with the M (Ai ) in Section 5.2.
6. Uncertainty for a digital elevation map The Cape Floristic Region (CFR) is located in the southwestern part of South Africa and includes Cape Town. It is home to the greatest non-tropical concentration of higher plant species in the world. Due to its ecological uniqueness, understanding spatial patterns of species diversity and the distributions of individual species have been extensively studied (see, e.g., [7,8] and references therein). Spatially explicit explanatory models use climatological variables but also require various topographic features like elevation and roughness of terrain as inputs. Furthermore, these models require the values of the inputs on a regular grid spanning the study region in order to create a spatial map of the species distribution [7,8]. Uncertainty in these inputs affects prediction of species ranges. Here, the domain on interest, D, lies in the western part of the CFR. The elevation values (in meters) and their corresponding geographical coordinates are obtained from 456 monitoring stations operating in this region. Using these data, which we view as an arbitrary set of points in the study region, D ⊂ R2 , we first create a DEM for this region. In order to do so, the elevation function associated with each observed location has to be extended via interpolation to a uniform grid at a desired resolution. We randomly select 400 data points from the original set of 456 data points to be used as a training data set for creating the DEM and leave out the remaining 56 points for the purpose of attaching uncertainty. A natural neighbor interpolation scheme, as described in Section 1, is implemented using the nnbathy program [21] (available online at http://code.google.com/p/nnc/) to construct the DEM for the region at 0.05° grid cell resolution. In Fig. 6 we show the sources of our data. The one in the left panel (Fig. 6a) shows the NNI based elevation map of the study region in context of CFR and the one in the right panel (Fig. 6b) shows the location and elevation of the validation sites. We apply our methodology to obtain the uncertainty map associated with this DEM using the remaining 56 validation data points. The uncertainty map will provide useful information for the ecologists with regard to a particular value of elevation extracted from the DEM. Furthermore, these uncertainties will be propagated through the species distribution models and will be reflected in the uncertainty estimates of the species distribution. Fig. 7a shows the estimated uncertainties associated with the elevation map from Fig. 6a. Fig. 7b shows the estimated uncertainties of the study region in greater detail along with the location of the validation sites. Two main points emerge from Fig. 7a. First, as noted in Section 3, smaller uncertainties tend to be associated with lower elevations. Second,
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
261
Fig. 6a. Digital elevation map (in meters) of CFR produced by the Natural neighbor interpolation scheme.
Fig. 6b. Location and elevation of the validation sites.
Fig. 7a. Estimated uncertainty of the elevation map produced by natural neighbor interpolation.
through Fig. 7b, uncertainties tend to be larger at locations farther from validation sites than at locations nearer.
262
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
Fig. 7b. Estimated uncertainty of the elevation map along with the validation sites.
7. Summary and future work Many commonly-used interpolators are purely algorithmic, often iterative, not created in the form of ‘‘inputs to outputs’’. It is expected that there will be error in the interpolation. We have provided a general approach to attach uncertainty to such interpolation, given external validation data, using an assimilation approach. We can handle variables that are continuous such as elevation or temperature as well as variables that have point masses, such as precipitation, at 0. In the present work, we have only looked at a static version of the problem. Extension to accommodate say a collection of daily interpolations using a suitable dynamic model, borrowing strength across days can be envisioned. Also, one can imagine having joint interpolations, e.g., temperature and precipitation for a given day and attaching uncertainty through joint modeling. With multiple interpolations, the model comparison discussion in Section 2 would enable us to compare associated uncertainty assignments. Moreover, if the multiple interpolations are averaged in some fashion, our approach could attach uncertainty to the resulting interpolation. Acknowledgments The authors thank John Silander, Andrew Latimer, Adam Wilson and Alex Beutel for useful conversations. This work of the first two authors was supported in part by NSF DEB 056320 and NSF DEB 0516198. The work of the third author was supported in part by NSF CCF-06-35000. Appendix A.1 The posterior conditionals for σy2 and σv2 are: 1/σy2 ∼ Gamma(n/2 + ασ , βσ + 1/2 × (Y − V)′ (Y − V)),
˜ (1) )′ (V − V˜ (1) )), 1/σv2 ∼ Gamma(n/2 + ασ , βσ + 1/2 × (V − V ˜ (1) is vector containing the values at the n grid cells where independent point level (validation) where V observations are made. The posterior conditionals for β = (β0 , β1 ) are normal (λβ χβ , λβ ), where 1 ′ −1 λ− β = Gβ Σm Gβ + I2 ,
χβ = G′β Σm−1 M,
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
263
˜ ), Gβ = (1, V Σm = diag (σm2 (A1 ), . . . , σm2 (AI )). The posterior conditionals for α = (α0 , α1 ) are Normal(λα χα , λα ), where 1 ′ λ− α = Gα Gα + I2 ,
χα = G′α (log(σ2m )), ˜ ). Gα = (1, V The posterior conditionals for V are Normal(λv χv , λv ), where 1 2 2 λ− v = In /σy + In /σv ,
χv = (1/σv2 )V˜ (1) + (1/σy2 )Y. ˜ = (V˜ (1) , V˜ (2) ), where V˜ (1) has been defined earlier and V˜ (2) corresponds to the values We partition V obtained in grid cells for which no point level (validation) observations are made, then sample the ˜ using univariate sampling scheme as follows: individual elements of V Let the conditional prior for V˜ (Ai ) be Normal(µi|. , σi2|. ). Let the conditional likelihood contribution for ∗ ˜ (1) , we have V˜ (Ai ) be Normal(µi|. ∗ , σi2|. ). If V˜ (Ai ) ∈ V
−1 ∗ σi2|. = β12 /σm2 (Ai ) + 1/σv2 + α12 , ∗
µi|. ∗ = (σi2|. ) × [V (i)/σv2 + β1 (M (Ai ) − β0 )/σm2 (Ai ) + α1 (log(σm2 (Ai )) − α0 )]. ˜ (2) , we have If V˜ (Ai ) ∈ V −1 ∗ σi2|. = β12 /σm2 (Ai ) + α12 , ∗
µi|. ∗ = (σi2|. ) × [β1 (M (Ai ) − β0 )/σm2 (Ai ) + α1 (log(σm2 (Ai )) − α0 )]. Then the full conditional posterior for V˜ (Ai ) is Normal(λi χi , λi ) where ∗
1 λ− = 1/σi2|. + 1/σi2|. , i
∗
χi = µi|. /σi2|. + µi|. ∗ /σi2|. . The full conditionals for σm2 (Ai ) cannot be obtained in closed form and it is required to perform a Metropolis–Hastings step to generate samples from its posterior. A.2 In order to show that E (σm2 (Ai )) is location invariant, we only need to show E (σm2 (Ai )|α0 , α1 ) is independent of i. The rest follows from the fact that the priors of α0 and α1 do not depend on the location. E (σm2 (Ai )|α0 , α1 ) = EV˜ (Ai ) E (σm2 (Ai )|V˜ (Ai ), α0 , α1 )
= EV˜ (Ai ) (exp(α0 + α1 V˜ (Ai ) + c /2)) = exp(α0 + c /2)EV˜ (Ai ) (exp(α1 V˜ (Ai ))) α12 σV˜2 = exp(α0 + c /2) exp α1 µ + . 2
264
S. Ghosh et al. / Statistical Methodology 9 (2012) 251–264
References [1] V.J. Berrocal, A.E. Gelfand, D.M. Holland, A spatio–temporal downscaler for outputs from numerical models, Journal of Agricultural, Biological and Environmental Statistics 15 (2010) 176–197. [2] A. Beutel, Mølhave, P.K. Agarwal, Natural neighbor based grid dem construction using a GPU, in: GIS’10: Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic Information Systems, ACM, New York, 2010, pp. 172–181. [3] G.P. Cressman, An operational objective analysis system, Monthly Weather Review 87 (1959) 367–374. [4] P.J. Diggle, R.J. Gratton, Monte Carlo methods of inference for implicit statistical models with discussion, Journal of Royal Statistical Society, Series B 46 (1984) 193–227. [5] M. Fuentes, A.E. Raftery, Model evaluation and spatial interpolation by Bayesian combination of observations with outputs from numerical models, Biometrics 66 (2005) 36–45. [6] A.E. Gelfand, S.K. Ghosh, Model choice: A minimum posterior predictive loss approach, Biometrika 85 (1998) 1–11. [7] A.E. Gelfand, J.A. Silander Jr., S. Wu, A.M. Latimer, P. Lewis, A.G. Rebelo, M. Holder, Explaining species distribution patterns through hierarchical modeling, Bayesian Analysis 1 (2005) 4292. [8] A.E. Gelfand, A.M. Schmidt, S. Wu, J.A. Silander Jr., A.M. Latimer, A.G. Rebelo, Modelling species diversity through species level hierarchical modeling, Journal of the Royal Statistical Society—Series C 54 (2005) 120. [9] G.H. Givens, A Bayesian framework and importance sampling methods for synthesizing multiple sources of evidence and uncertainty linked by a complex mechanistic model, Ph.D. Dissertation, Department of Statistics, University of Washington, Seattle; 1993. [10] G.H. Givens, J.P. Hughes, A method for determining uncertainty of predictions from deterministic epidemic models, in: Proceedings of the 1995 SCS Western Multiconference on Health Science, Physiology and Pharmacological Simulation Studies, Las Vegas; 1995. [11] B.C. Hewitson, R.G. Crane, Gridded area-averaged daily precipitation via conditional interpolation, Journal of Climate 18 (2005) 41–57. [12] R.J. Hijmans, S.E. Cameron, J.L. Parra, P.G. Jones, A. Jarvis, Very high resolution interpolated climate surfaces for global land areas, International Journal of Climatology 25 (2005) 1965–1978. [13] D.G. Hoel, T.J. Mitchell, The simulation, fitting, and testing of a stochastic cellular proliferation model, Biometrics 27 (1971) 191–199. [14] R.H. Jones, D.F. Nicholls, An adaptive nonlinear state space model applied to modelling epidemics, Statistica Sinica 1 (1991) 389–400. [15] A.D. Maynard, R.L. Maynard, Ambient aerosol exposure-response as a function of particulate surface area: reinterpretation of historical data using numerical modelling, The Annals of Occupational Hygiene 46 (2002) 444–449. [16] A. Patwardhan, M.J. Small, Bayesian methods for model uncertainty analysis with applications to future, Journal of Hydrology 364 (1992) 311–327. [17] D. Poole, A.E. Raftery, Inference for deterministic simulation models: The Bayesian melding approach, Journal of American Statistical Association 95 (2000) 1244–1255. [18] A.E. Raftery, G.H. Givens, J.E. Zeh, Inference from a deterministic population dynamics model for bowhead whales with discussion, Journal of American Statistical Association 90 (1995) 402–430. [19] J. Sacks, W.J. Welch, T.J. Mitchell, H.P. Wynn, Design and analysis of computer experiments with discussion, Statistical Science 4 (1989) 409–435. [20] S.K. Sahu, A.E. Gelfand, D.M. Holland, Fusing point and areal level space–time data with application to wet deposition, Journal of Royal Statistical Society—Series C 59 (2010) 77–103. [21] P. Sakov, nn: Natural neighbours interpolation library, (Version 1.82), [Software], 2010. Available from http://code.google.com/p/nn-c/. [22] R. Sibson, A brief description of natural neighbour interpolation, in: V. Barnett (Ed.), Interpreting Multivariate Data, John Wiley, Chichester, 1981, pp. 21–36. [23] P. Thompson, Numerical Weather Analysis and Prediction, The Macmillan Company, New York, 1961. [24] A.J. Venturato, A digital elevation model for seaside, Oregon: Procedures, Data Sources and Analyses, NOAA Technical Memorandum OAR PMEL-129, NTIS: PB2006-101562, NOAA/Pacific Marine Environmental Laboratory, Seattle: WA; 2005. [25] R.L. Wolpert, K.H. Reckhow, L.J. Steinberg, Bayesian hierarchical models for transport and fate modeling, in: J.S. Hodges, R. Kass, C. Gatsonis, N. Singpurwalla (Eds.), Bayesian Statistics in Science and Technology: Case Studies, Springer-Verlag, New York, 1992, pp. 241–296. [26] H. Zhang, Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics, Journal of the American Statistical Association 99 (2004) 250–261.