Spatial Statistics xxx (xxxx) xxx
Contents lists available at ScienceDirect
Spatial Statistics journal homepage: www.elsevier.com/locate/spasta
Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity Sergio Castillo-Páez a , Rubén Fernández-Casal b , ∗ Pilar García-Soidán c , a
Dep. de Ciencias Exactas, Universidad de las Fuerzas Armadas ESPE, Ecuador Dep. de Matemáticas & Centro de Investigación CITIC, Universidade da Coruña, Spain c Dep. de Estadística e Investigación Operativa, Universidade de Vigo, Spain b
article
info
Article history: Received 14 February 2019 Accepted 23 September 2019 Available online xxxx Keywords: Heteroscedasticity Local linear regression Resampling method
a b s t r a c t The current work provides a nonparametric resampling procedure for approximating the (unconditional) probability that a spatial variable surpasses a prefixed threshold value. The existing approaches for the latter issue require assuming constant variance throughout the observation region, thus our proposal has been designed to be valid under heteroscedasticity of the spatial process. To develop the new methodology, nonparametric estimates of the variance and the semivariogram functions are computed by using bias-corrected residuals, which are then employed to derive bootstrap replicates for approximating the aforementioned risk. The performance of this mechanism is checked through numerical studies with simulated data, where a comparison with a semiparametric method is also included. In addition, the practical application of this approach is exemplified by estimating the risk of rainwater accumulation in the United States, during a specific period. © 2019 Elsevier B.V. All rights reserved.
1. Introduction The aim of this work is to provide a nonparametric procedure to construct a risk map, measuring the probability of exceeding a prefixed threshold at each location of the observation region, from ∗ Correspondence to: Facultad de Ciencias Sociales y de la Comunicación, Universidade de Vigo, CP 36005 Pontevedra, Spain. E-mail addresses:
[email protected] (S. Castillo-Páez),
[email protected] (R. Fernández-Casal),
[email protected] (P. García-Soidán). https://doi.org/10.1016/j.spasta.2019.100389 2211-6753/© 2019 Elsevier B.V. All rights reserved.
Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
2
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
the available spatial data. Approaches of this kind have important applications, especially in the environmental or meteorological fields, since they enable us to evaluate the risk of overcoming a critical value and the subsequent decision making. Thus, write {Y (x) : x ∈ D ⊂ Rd } for a spatial process and suppose that n data, Y = (Y (x1 ), . . . , Y (xn ))t , have been collected at the respective locations x1 , . . . , xn . We focus on the approximation of rc (xα ) = P (Y (xα ) > c ) ,
(1)
also referred to as long-term risk (Krzysztofowicz and Sigrest, 1997), which represents the unconditional probability (or risk) that Y (xα ) surpasses a fixed threshold c at xα ∈ D, for some c > 0. Estimation of the aforementioned risk can be useful for solving practical problems that require knowledge of the process distribution under rather general conditions (e.g. climate studies). The traditional approaches for risk assessment have been designed to estimate the conditional risk, P ( Y (xα ) > c | Y). Examples of this methodology include the indicator kriging (Goovaerts, 1997), the Markov chain modeling (Li et al., 2010) or more recent techniques, as those based on the compositional data analysis (Tolosana-Delgado et al., 2008). These methods can be affected by misspecification of the selected parametric model and, therefore, alternative nonparametric tools could be applied, as the kernel-based one described in García-Soidán and Menezes (2017). Anyway, the previous approaches should not be applied to estimate (1), since they are conditional on data and, therefore, their implementation is based on the random variable Y (xα )|Y, which has a different distribution than that of the target variable Y (xα ). In particular, the former variable is expected to have smaller variance at those sites xα close to the sampling locations. Then, specific procedures to deal with the unconditional risk are preferable, such as the nonparametric method introduced in Fernández-Casal et al. (2018). Some of the aforementioned proposals are applicable to spatial processes that can deviate from the stationary condition for a non-constant trend, although all of them require homoscedasticity. Bearing this in mind, our research intends to provide an estimator of the unconditional risk rc (xα ), which is valid for heteroscedastic random processes with non-constant trend. With this aim, a bootstrap approach is designed to account for the different variability that the underlying process can present along the observation domain. It is worth noting that the main procedures proposed in the statistics literature for its application to heteroscedastic data do not bear in mind the bias that the direct use of the residuals induces on the estimation (see e.g. Robinson and Thawornkaiwong, 2012; Nguyen and Veraart, 2017). However, a specific strategy to correct this effect was introduced in Fernández-Casal et al. (2017), which enables the adequate characterization of the dependence structure under heteroscedasticity. This methodology consists of first approximating the bias and then employing it to adjust the residuals for deriving corrected estimates of the variance and the semivariogram functions. The current work goes a step further, so we start by estimating the exact bias, instead of an approximation to it, so as to form the corrected residuals that are used to provide estimates of the variance and semivariogram. This way of proceeding reproduces the variability of the process in a more accurate way and it will be the basis to derive the new resampling scheme. Different bootstrap methods can be applied to homoscedastic spatial data, such as the block bootstrap (Lahiri, 2003) or the semiparametric bootstrap (Iranpanah et al., 2011). The former approach has been designed for stationary processes and it is highly dependent on the configuration of the spatial domain, whereas the second one gives rise to biased semivariogram estimates, under non-constant trend, and it is also affected by the misspecification problem, inherent to the selection of the parametric models that are needed to approximate the trend and/or the semivariogram. A better performance is exhibited by the resampling method suggested in Fernández-Casal et al. (2018) for homoscedastic processes. Thus, our proposal is the result of combining the latter bootstrap tool, modified to appropriately reproduce the variability of the data, with a bias-corrected approach that adjusts the underlying dependence of the heteroscedastic random process. The structure of this work is as follows. Section 2 introduces nonparametric approaches for estimating the main characteristics of a heteroscedastic process. The bootstrap procedure designed to approximate the unconditional risk is described in Section 3. Section 4 presents the results of the numerical studies developed with simulated and real data to illustrate the performance of the proposed resampling approach. The main conclusions are summarized in Section 5. Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
3
2. Nonparametric modeling under heteroscedasticity Let {Y (x) : x ∈ D ⊂ Rd } be a heteroscedastic spatial process that can be modeled as Y (x) = µ(x) + σ (x)ε (x),
(2)
where µ and σ denote the deterministic trend and variance functions, respectively, and ε is a second-order stationary process, with zero mean, unit variance and covariogram (or correlogram) ρ (u) = Cov (ε (x) , ε (x + u)). Then, γ (u) = 1 − ρ (u) is the semivariogram of ε . The specification of the small-scale variability of the process Y requires the estimation of the variance and the correlogram of the error process ε , since 2
Cov (Y (x) , Y (x + u)) = σ (x)σ (x + u)ρ (u) = σ (x)σ (x + u)(1 − γ (u)). Consequently,
Σ = DRD,
(3)
where Σ and R denote the covariance matrices of Y and ε = (ε (x1 ), . . . , ε (xn )) respectively, and D = diag(σ (x1 ), . . . , σ (xn )). The (local) variogram of the heteroscedastic spatial process is given by t
2γY (x, x + u) = (σ (x) − σ (x + u))2 + 2σ (x)σ (x + u)γ (u). Under the general spatial model (2), we will extend the bootstrap procedure developed in Fernández-Casal et al. (2018) for homoscedastic data, in order to derive a resampling approach under heteroscedasticity that will be employed for approximation of the unconditional risk (1). This way of proceeding requires characterizing the bootstrap counterpart of process Y and, therefore, it calls for estimates of the main functions µ, σ 2 and γ (or ρ ). This issue will be addressed through the local linear approach, combined with a bias-corrected algorithm for the second-order structure. Starting with the spatial trend µ, the application of the local linear regression (Fan and Gijbels, 1996) conveys the linear smoothing of (xi , Z (xi )), which provides the following estimator
)−1 t ( Xx Wx,H Y = stx,H Y, µ ˆ H (x) = et1 Xtx Wx,H Xx
(4)
where e1 = (1, 0, . . . , 0)t ∈ Rd+1 , the ith row of matrix Xx equals 1, (xi − x) , Wx,H = diag {KH (x1 − x), . . . , KH (xn − x)}, KH (u) = |H|−1 K (H−1 u), K denotes a d-dimensional kernel function and H is a d × d non-singular symmetric matrix that determines the shape and size of the neighborhood used in the estimation, which is referred to as bandwidth matrix. The natural procedure to obtain estimators of the variance and the semivariogram, respectively, consists of first removing the trend and then approximating the target functions from the residuals r = (r1 , . . . , rn )t = Y − SH Y, where SH is the smoother matrix, whose ith row is stx ,H . For the former i aim, we can apply again the local linear approach ( ) (Fan and Yao, 1998), so the variance estimator 2 would result from the linear smoothing of xi , ri , leading to
(
σˆ H22 (x) = stx,H2 r2 ,
) t
(5)
where r = ,..., and H2 is a d × d non-singular symmetric bandwidth matrix. In addition, write εˆ = (εˆ (x1 ), . . . , εˆ (xn ))t for the standardized residuals, with εˆ (xi ) = ri /σˆ H2 (xi ), for i = 1, .(. . , n. A pilot semivariogram ) for the error process could be achieved by the linear smoothing of ∥xi − xj ∥, (εˆ (xi ) − εˆ (xj ))2 /2 that, under isotropy, would yield 2
(r12
rn2 )t
( )2 wij (u) εˆ (xi ) − εˆ (xj ) , ∑n−1 ∑n 2 i=1 j=i+1 wij (u)
∑n−1 ∑n γˆg (u) =
i=1
j=i+1
(6)
with
wij (u) = L
) ( xi − xj − u g
×
Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
4
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
∑ ( ∥xk − xl ∥ − u ) L
g
k
) ( (∥xk − xl ∥ − u) ∥xk − xl ∥ − xi − xj ,
for a symmetric univariate kernel L and a bandwidth parameter g. Since the error variance process (or the semivariogram sill) is assumed to equal one, γˆg could be replaced by the rescaled version γˆg /sill(γˆg ), when sill(γˆg ) ̸= 1. As described above, we can use the direct residuals to characterize the dependence structure. However, this way of proceeding tends to underestimate the small-scale variability of the process, affecting both the variance (see e.g. Ruppert et al., 1997, for independent data) and the semivariogram estimates (see e.g. Cressie, 1993, Section 3.4.3, for homoscedastic spatial processes). Indeed,
Σ r = Var(r) = Σ + SH Σ StH − Σ StH − SH Σ ,
(7)
Equivalently, the covariance matrix of the unobserved standardized residuals ε˜ = (ε˜ (x1 ), . . . , ε˜ (xn )) = D−1 r satisfies that
Σ ε˜ = Var(ε˜ ) = R + B,
(8)
on account of (3) and (7), with B = D−1 SH Σ StH − Σ StH − SH Σ D−1 .
(
)
(9)
Relations (7) and (8) yield that
(
√
Var ri / 1 + bii
)
= σ 2 (xi ), ( ) ( ) Var ε˜ (xi ) − ε˜ (xj ) = Var ε (xi ) − ε (xj ) + bii + bjj − 2bij ,
where bij is the (i, j)-th element of B. These results make it advisable to design an alternative mechanism for estimation of the variance σ 2 and the semivariogram γ , as the one proposed next. The new methodology is similar to that introduced in Fernández-Casal et al. (2017), although the ‘‘exact’’ bias matrix (9) is considered instead of an approximation to it. It demands previously computing the trend estimates through (4) and forming the residuals r. Then, prior estimates of the variance and the semivariogram can be obtained ˆ = diag(σˆ H (x1 ), . . . , σˆ H (xn )) and from (5) and (6), respectively, providing σˆ H2 and γˆg . Then, take D 2 2 2
ˆ = (rˆij ), with rˆij = 1 − γˆg (∥xi − xj ∥). R Now we deal with the selection of the bandwidths required for implementation of the aforementioned local linear estimators. For the bandwidth matrix H, needed to approximate the trend through (4), the CGCV criterion (Francisco-Fernández and Opsomer, 2005) can be applied, so as to minimize n−1
n ∑
⎛
Y (xi ) − µ ˆ H (xi )
⎝
ˆ 1 − n−1 tr SH R
i=1
(
⎞2
)⎠ ,
ˆ denoting an estimate of the correlation matrix R and tr (A) being the trace of matrix A. The with R same approach can be used for the choice of matrix H2 in (5), but adapted to the variance setting, which conveys minimizing n−1
n ∑
⎛ ⎝
i=1
ri2 − σˆ H2 (xi )
(2
ˆ r2 1 − n−1 tr SH2 R
⎞2 )⎠ ,
ˆ r2 of the correlation matrix Rr2 of r2 . The assumptions of Gaussianity and zero for some estimate R mean for the residuals simplify the approximation of the latter matrix, since Lemma 1 of Ruppert et al. (1997) yields that Σ r2 = Var(r2 ) = 2Σ r ⊙ Σ r , Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
5
where ⊙ represents the Hadamard product. Finally, the bandwidth parameter g required for the pilot estimator (6) can be selected as the minimizer of the cross-validation relative squared error of the residual semivariances n−1 n ∑ ∑ i=1 j=i+1
)2 )2 εˆ (xi ) − εˆ (xj ) ( ) −1 , 2γˆ−(i,j)g ∥xi − xj ∥
( (
where γˆ−(i,j)g is the result of computing γˆg , as given in (6), when excluding the pair (i, j). The specific steps of our proposal (referred to as Algorithm 1) are summarized below:
(
)
ˆ = Dˆ Rˆ Dˆ and Bˆ = Dˆ −1 SH Σ ˆ St − Σ ˆ St − SH Σ ˆ Dˆ −1 . 1. Derive the estimates Σ H H 2 2. Compute an updated estimate σˆ H2 by substituting ri2 /(1 + bˆ ii ) for ri2 in (5), where bˆ ij is the
ˆ Take Dˆ = diag(σˆ H (x1 ), . . . , σˆ H (xn )). (i, j)-th element of B. 2 2 −1 ˆ 3. Form εˆ = D r and derive a new error semivariogram estimate γˆg by replacing (εˆ (xi ) − εˆ (xj ))2 for (εˆ (xi ) − εˆ (xj ))2 − bˆ ii − bˆ jj + 2bˆ ij in (6). ˆ = (rˆij ) of the correlation matrix R through rˆij = 1 − γˆg (∥xi − xj ∥). 4. Obtain an updated estimate R 5. Repeat steps 1–4 until convergence.
The convergence of this method does not require the development of a large number of iterations (less than five are usually enough). In this respect, we should remark that the first step of Algorithm ˆ which is derived from an approximation of the data 1 provides an estimator of the bias matrix B, ˆ through covariance matrix Σ . Another option for the latter could be to obtain an initial estimate Σ the bias-corrected approach introduced in Fernández-Casal and Francisco-Fernández (2014) for homoscedastic processes. This way of proceeding would speed up the convergence of the resulting algorithm, so that a single iteration could be sufficient. The performance of this alternative method is exemplified in Section 4.2. 3. Nonparametric resampling approach for risk estimation In this section, we introduce a nonparametric bootstrap procedure that can ( be applied) to map the unconditional risk rc (1). This issue demands estimating the probability P Y (xαk ) > c at a set of selected locations xαk within the observation domain D, for some c > 0 and k = 1, . . . , m, with m ∈ N. With this aim, bootstrap samples (Y ∗ (xα1 ), . . . , Y ∗ (xαm )) are generated through a resampling approach, specifically designed for a heteroscedastic random process Y , modeled as given in (2). Proceeding in this way, the unconditional risk at each estimation site xαk is approximated as the proportion of values Y ∗ (xαk ) exceeding the threshold c. To derive the new resampling scheme, the bootstrap approach proposed in Fernández-Casal et al. (2018) is modified and extended. Indeed, the latter procedure was designed for its application to homoscedastic processes and it is based on generating replicates of the original data at the sampling locations, which are then interpolated through the kriging methodology to construct the risk map. This mechanism is efficient from a computational point of view, although the variability of the kriging predictions may be significantly smaller than that of the underlying process. Instead, the new methodology aims to directly estimate the risk function at the target sites xαk , so it covers a wider range of spatial settings, including the heteroscedastic processes. The implementation of our bootstrap approach requires prior approximations of the trend, the variance and the semivariogram functions. We suggest addressing this task by employing nonparametric techniques to avoid the model misspecification problem inherent to the use of parametric methods. Thus, we could start by deriving initial estimates from (4)–(6), which are respectively denoted as µ ˆ H , σˆ 02,H2 and γˆ0,g . Write r = (r1 , . . . , rn )t and εˆ 0 = (εˆ 0 (x1 ), . . . , εˆ 0 (xn )) for the resulting residuals and standardized residuals, respectively, with εˆ 0 (xi ) = ri /σˆ 0,H2 (xi ), for i = 1, . . . , n. Then, the bias effect induced by the residuals on the estimation of the dependence structure can be corrected through the application of Algorithm 1. As a result, new approximations of the variance and the semivariogram are obtained, represented by σˆ H2 and γˆg , respectively. 2
Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
6
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
In a second stage, estimates of the correlation matrix R0 of the standardized residuals εˆ 0 and the correlation matrix Rα of the target locations xαk are respectively derived from the pilot semivariograms γˆ0,g and γˆg . With this aim, bear in mind that the estimated semivariogram functions may not satisfy the conditionally negative definiteness property, fulfilled by the theoretical semivariogram γ , so neither of them can be directly used to develop a bootstrap procedure (or a kriging approach). We propose fitting a Shapiro–Botha model (Shapiro and Botha, 1991) to each of the nonparametric estimates γˆ0,g and γˆg , so as to obtain the respective valid semivariograms γ¯0,g ¯ 0 = (r¯0ij ) and R¯ α = (r¯αkl ), respectively, with and γ¯g . Then, estimates of R0 and Rα are provided by R r¯0ij = 1 − γ¯0,g (∥xi − xj ∥) and r¯α kl = 1 − γ¯g (∥xαk − xαl ∥), for i, j = 1, . . . , n and k, l = 1, . . . , m. ¯ 0 and R¯ α , together with µ The resulting matrices, R ˆ H and σˆ 2H2 , are used to generate the desired replicates at the estimation sites. Our proposal (called Algorithm 2) is described below:
¯ 0 and R¯ α , so that R¯ 0 = L0 Lt and R¯ α = Lα Ltα . 1. Derive the Cholesky factorizations of R 0 1 ˆ 0 and 2. Compute the uncorrelated residuals e = (e1 , e2 , . . . , en )t = L− 0 ε ( center them.)t 3. Draw independent replicates of size m from e, represented by e∗ = e∗1 , e∗2 , . . . , e∗m .
)t
4. Obtain the bootstrap errors ε∗ = ε ∗ (xα1 ), . . . , ε ∗ (xαm ) = Lα e∗ . 5. Compute the bootstrap sample at the estimation locations xαk through Y ∗ (xαk ) = µ ˆ H (xαk ) + σˆ H2 (xαk )ε ∗ (xαk ), for k = 1, . . . , m. 6. Repeat B times steps 3–5 to obtain the B bootstrap replicates (Y ∗b (xα1 ), . . . , Y ∗b (xαm )), for b = 1, . . . , B. Approximate rc (xαk ) as
(
rc∗ (xαk ) = B−1
∑
I{Y ∗b (xα
k
)>c }
,
(10)
b
with IA denoting the indicator function of the set A, for k = 1, . . . , m. A similar scheme to the one followed in Algorithm 2 can be applied to address other inference problems, regarding any characteristic of the random process at prefixed target sites or properties of the overall process. 4. Numerical studies Next, we present the results of the analyses derived to illustrate the performance of the bootstrap procedure introduced in Section 3 for risk estimation. Firstly, the simulation studies with Gaussian and non-Gaussian data are described in Section 4.1, where the unconditional risk is approximated through the nonparametric proposal and a semiparametric alternative, based on the normal distribution. In addition, a practical application of the proposed approach for mapping the risk of rainwater accumulation is included in Section 4.2. The new methodology was implemented with the R software, by using the multivariate local linear approach (for approximating the trend and the variance functions) and the semivariogram estimation tools provided by the npsp package (Fernández-Casal, 2016), available at the network CRAN (https://cran.r-project.org/). 4.1. Numerical studies with simulated data The numerical studies described in this section were carried out with 1,000 samples of different sizes n, drawn from random processes Y , modeled as given in (2) on a regular grid within the unit square D = [0, 1] × [0, 1], with the following trend and variance functions
µ(x1 , x2 ) = 2.5 + sin(2π x1 ) + 4(x2 − 0.5)2 , ( )2 ( )2 ( )2 15 σ 2 (x1 , x2 ) = 1 − (2x1 − 1)2 1 − (2x2 − 1)2 + 0.1. 16
To start, the error process ε was assumed to be (Gaussian (with )) unit variance and isotropic exponential semivariogram γa1 ,a2 (u) = a1 + (1 − a1 ) 1 − exp − 3u , where a1 represents the a 2
Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
7
Fig. 1. (a) Theoretical variance and (b) averaged variance estimates obtained for the Gaussian data, with n = 20 × 20, a1 = 0.2 and a2 = 0.6.
nugget effect, 1 − a1 is the partial sill and a2 denotes the practical range. For data generation, the simulation parameters were selected as n = 10 × 10, 15 × 15, 20 × 20, a1 = 0, 0.2, 0.4, 0.8 and a2 = 0.3, 0.6, 0.9. The first study focused on illustrating the performance of the procedure proposed in Section 2 for approximating the variance and the semivariogram functions under heteroscedasticity. Thus, for each Gaussian data set, Algorithm 1 was implemented to derive nonparametric estimates of both functions at the selected grid of locations. The different kernel estimators were computed by using the multiplicative triweight kernel. The resulting values were averaged and compared with their theoretical counterparts. Figs. 1(a) and 1(b) plot the theoretical variance and the averaged variance estimates, respectively, obtained for samples of size n = 20 × 20, drawn from an exponential semivariogram with parameters a1 = 0.2 and a2 = 0.6. Fig. 2 depicts the theoretical error semivariogram for the same spatial configuration, as well as its averaged estimate. The similarity between the underlying functions and their respective estimated averages shows the good development of our proposal in this example, which also holds for the other Gaussian scenarios. The second study was addressed to analyze the behavior of the new bootstrap approach designed to estimate the unconditional risk (1) for heteroscedastic Gaussian(data through ) Algorithm 2, with B = 1000. Thus, for each sample, the probability rc (xαk ) = P Y (xαk ) > c was approximated by (10), for c = 2, 2.5, 3, 3.5, 4. Then, the values obtained at each location and threshold were averaged. To assess the effect of the number of target sites xαk selected, m was taken to equal 10 × 10 and 50 × 50. A further step of this study was to compare the performance of the nonparametric methodology (NP) for estimation of the unconditional risk with a semiparametric procedure (SP), designed with the same aim as follows. We assumed normally distributed errors, as well as considered the nonparametric estimates derived in the first analysis for µ and σ 2 , to avoid the effect that the use of different approximations for the (trend and standard deviation might have ) on the results. Then, the target probability rc (xαk ) = P Y (xαk ) > c was estimated through the semiparametric alternative by r˜c (xαk ) = 1 − Φ
(
xαk − µ ˆ H (xαk )
σˆ H2 (xαk )
)
,
where Φ denotes the standard normal distribution. Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
8
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
Fig. 2. Theoretical error semivariogram (solid line) and average of the error semivariogram estimates (dotted line) obtained for the Gaussian data, with n = 20 × 20, a1 = 0.2 and a2 = 0.6.
Fig. 3. Maps of (a) the theoretical and averages of the (b) nonparametrically and (c) semiparametrically estimated probabilities of exceeding c = 3, obtained for the Gaussian data, with n = 20 × 20, a1 = 0.2, a2 = 0.6 and m = 50 × 50.
The results derived with the NP and the SP approaches in the Gaussian scenarios were almost identical. For instance, Fig. 3 displays the theoretical probability and the average of the estimates derived with the two procedures for c = 3, giving an account of the proximity among the three representations. Similar patterns were observed with the remaining simulation parameters. To quantify the discrepancies between the ( theoretical and )2 each( of the estimated )2 probability functions, we computed the squared errors rc (xαk ) − rc∗ (xαk ) and rc (xαk ) − r˜c (xαk ) for the NP and the SP procedures, respectively, at the selected locations xαk and thresholds c. Table 1 reports the statistics of the resulting squared errors for m = 50 × 50 a1 = 0.2, a2 = 0.6 and different sample sizes, yielding small values in general, especially for the median. As expected for a consistent methodology, the squared error statistics tend to decrease, when n augments. Table 1 gives account of the close behavior of the two risk estimators, despite the advantage given to the SP procedure that assumes the theoretical distribution model. The same conclusion Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
9
Table 1 Summary of the squared errors (×10−2 ) of the NP and SP estimates obtained for the Gaussian data, with a1 = 0.2, a2 = 0.6 and m = 50 × 50. n = 10 × 10
n = 15 × 15
n = 20 × 20
c
Method
Mean
Median
Sd
Mean
Median
Sd
Mean
Median
Sd
2.0
NP SP
1.84 1.85
0.08 0.07
4.41 4.43
1.77 1.77
0.07 0.07
4.16 4.17
1.71 1.71
0.07 0.07
4.04 4.04
2.5
NP SP
2.75 2.76
0.47 0.46
5.58 5.64
2.72 2.72
0.45 0.43
5.49 5.52
2.63 2.63
0.44 0.42
5.35 5.36
3.0
NP SP
2.59 2.58
0.55 0.52
5.44 5.50
2.53 2.51
0.53 0.50
5.29 5.32
2.46 2.45
0.52 0.49
5.27 5.29
3.5
NP SP
2.08 2.09
0.19 0.18
4.74 4.80
2.06 2.06
0.18 0.17
4.67 4.69
2.00 1.99
0.18 0.16
4.60 4.61
4.0
NP SP
1.44 1.45
0.02 0.02
4.16 4.20
1.45 1.46
0.02 0.02
4.22 4.25
1.41 1.41
0.02 0.02
4.13 4.15
Table 2 Means of the squared errors averages (×10−2 ) of the NP and SP estimates obtained for the Gaussian data, with n = 20 × 20, a1 = 0.2 and m = 10 × 10. c
Method
a2 = 0.3
a2 = 0.6
a2 = 0.9
2.0
NP SP
0.99 0.99
1.51 1.51
1.86 1.86
2.5
NP SP
2.07 2.05
2.87 2.87
3.39 3.40
3.0
NP SP
2.23 2.19
3.16 3.15
3.73 3.73
3.5
NP SP
1.62 1.60
2.35 2.35
2.81 2.82
4.0
NP SP
1.21 1.19
1.66 1.66
1.93 1.93
holds when assessing the effect of the spatial dependence on the risk estimation. As an example, Table 2 summarizes the squared errors achieved with both methods, for samples of size n = 20 × 20 and m = 10 × 10 target locations, with a1 = 0.2 and several practical ranges. There is shown that an increase in the range (higher dependence) leads to larger squared errors averages for the two mechanisms. Also, if we compare the values derived for the same size n and practical range in Tables 1 and 2, a reduction in the number of sites m seems to produce an augment in the squared errors. Although we did not include the results derived when changing the dependence level in terms of the nugget value a1 (smaller nugget yields stronger dependence), analogous considerations regarding the variations in the squared errors remain valid. The third study aimed to check the performance of the NP proposal for estimating the unconditional risk, as well as that of the SP approach, when applied to non-Gaussian data. For this purpose, samples were generated from non-Gaussian random processes Y , following model (2) on D = [0, 1] × [0, 1], with the trend and variance functions used in the previous analyses. The error ε (x)2 −1 process was now taken as non-Gaussian by selecting it as ε (x) = 0 √ , where ε0 represented a 2 stationary Gaussian process, with zero mean, unit variance and covariance function that equaled the root square of the one chosen for generating the Gaussian data. By considering the same simulation parameters, this way of proceeding preserved the trend, variance and dependence structure of the Gaussian processes drawn in the previous analyses (see e.g. Adler, 1981, Section 7.1). A similar scheme was followed in the new scenario. Thus, firstly, Algorithm 1 was applied to derive nonparametric estimates of the variance and the semivariogram. Figs. 4(a) and 4(b) respectively plot the values achieved for n = 10 × 10, a1 = 0.2 and a2 = 0.6. These results show great similarities with their counterparts obtained for the Gaussian data in Figs. 1(a) and 2. Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
10
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
Fig. 4. Averages of the nonparametric (a) variance estimates and (b) semivariogram estimates obtained for the non-Gaussian data, with n = 10 × 10, a1 = 0.2 and a2 = 0.6. Table 3 Summary of the squared errors (×10−2 ) of the NP and SP estimates obtained for the non-Gaussian data, with n = 20 × 20, a1 = 0.2, a2 = 0.6 and m = 10 × 10. n = 20 × 20 c
Method
Mean
Median
Sd
2.0
NP SP
2.21 2.35
0.01 0.02
5.14 5.26
2.5
NP SP
2.66 2.81
0.22 0.27
6.22 6.34
3.0
NP SP
2.58 2.71
0.36 0.42
6.09 6.21
3.5
NP SP
2.44 2.53
0.26 0.30
5.75 5.81
4.0
NP SP
2.05 2.10
0.11 0.12
5.87 5.83
In a second stage, we proceeded to approximate the NP and the SP risk estimators, for different threshold c, on the same grids of sites. Table 3 summarizes the statistics attained for the resulting values, with c = 3, n = 20 × 20, a1 = 0.2, a2 = 0.6 and m = 10 × 10. These results reveal the better performance of the bootstrap methodology over the semiparametric estimator, which is affected by misspecification of the distribution model, thus giving rise to larger squared errors than those achieved with the nonparametric approach and also greater dispersion in almost all cases. 4.2. Application to map the risk of precipitations accumulation Now we present a practical application of the new methodology to data collected at 1053 sites of the United States, which measure the accumulated precipitations (in rainfall inches) during March 2016. Fig. 5(a) shows the registered values, obtained from http://www.ncdc.noaa.gov. This study aimed to provide unconditional risk maps of the observation domain for different thresholds, by estimating (1) as proposed in Section 3. A prior descriptive analysis depicted an asymmetric distribution of the data, which was overcome by considering their root-transformation. Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
11
Fig. 5. (a) Registered values and (b) final trend estimates of the precipitations accumulation data (in root-squared rainfall inches).
Fig. 6. Map of the residuals ri obtained for the precipitations accumulation data (in root-squared rainfall inches).
Then, we proceeded with the resulting sample to characterize the main functions of the underlying heteroscedastic process, namely, the trend, the variance and the semivariogram. This issue required selecting the bandwidths involved and it was addressed through global criteria. Firstly, the trend and semivariogram estimates were obtained from the approach introduced in Fernández-Casal and Francisco-Fernández (2014) for homoscedastic data. With the resulting semivariogram, we derived a ˆ 0 of the data covariance matrix, which was used in the CGCV criterion to select the prior estimate Σ bandwidths H and H2 . Then, we updated the trend estimates (4) and formed the resulting residuals, so as to obtain new approximations of the variance and the semivariogram through (5) and (6). From ˆ0 the latter functions, Algorithm 1 was performed as described in Section 2, except for the fact that Σ ˆ in step 1 of the first iteration. No more above was taken as the approximated covariance matrix Σ iterations were needed, so the initial bandwidths H = diag(11.1, 18.6) and H2 = diag(16.8, 37.9) were employed along this study. The final trend estimates are plotted in Fig. 5(b), which shows that the largest values were reached in the south of USA and, more specifically, in Louisiana and the nearby states, where there were heavy rains during the period under study that caused flooding. Fig. 6 shows the spatial distribution of the residuals ri derived from the final trend, which are the basis to derive the approximation of the variance function. The extreme values are mainly concentrated in the central southern states of USA, although with a smaller range of variation than the registered data (Fig. 5(a)). Fig. 7(a) displays the result of approximating the variance from the uncorrected residuals through (5), whereas Fig. 7(b) exhibits the final estimates of the variance obtained from the application of Algorithm 1. A similar pattern to that observed for the trend holds for the estimated variance, leading to incremented values in the central southern USA states, specially in Louisiana and Texas. However, the variability derived from the direct residuals seems to be underestimated, which cannot be due to a bandwidth effect, as the same matrix H2 was used for both approximations. Fig. 8(a) provides a histogram of the (uncorrected) standardized residuals εˆ i , used to estimate the dependence structure of the underlying error process. The resulting plot exhibits a left asymmetry, Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
12
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
Fig. 7. (a) Uncorrected and (b) corrected variance estimates of the precipitations accumulation data (in root-squared rainfall inches).
Fig. 8. (a) Histogram of the standardized residuals εˆ i and (b) Shapiro–Botha fits of the uncorrected γˆ0,g and corrected γˆg semivariogram estimates (dotted and solid line respectively) of the precipitations accumulation data (in root-squared rainfall inches).
thus assuming normality in this case would not be appropriate. To compare the error semivariogram estimates computed from the uncorrected and the corrected standardized residuals, Fig. 8(b) presents the resulting Shapiro–Botha fits. There is shown that the direct use of the residuals yields to assume a weaker dependence level for the underlying process, since the uncorrected semivariogram assigns larger values to the small distances. With the final fits obtained for the trend, variance and semivariogram, we applied the bootstrap approach introduced in Section 3 (Algorithm 2), in order to map the unconditional risk for different thresholds c. Figs. 9(a) and 9(b) depict the results achieved for c equaling 1 and 2 (1 and 4 rainfallinches), respectively. The higher risk of rainwater accumulation is assigned to the central area of USA, where the probability of exceeding the critical value c = 2 is even larger than 0.5 for the southern states. These risk maps, obtained through the new methodology that accounts for the heteroscedasticity of the underlying process, seem to be smoother than their corresponding counterparts, derived under the assumption of homoscedasticity, which are represented in Figures 4(a) and 4(b) of Fernández-Casal et al. (2018). For a practical comparison of our methodology with the one proposed for homoscedastic processes in Fernández-Casal et al. (2018), both approaches were applied to the total precipitations data at the same sampling locations and critical values c. Table 4 reports the statistics obtained Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
13
Fig. 9. Maps of the estimated heteroscedastic probabilities that the precipitations accumulation (in root-squared rainfall inches) is larger than (a) c = 1 and (b) c = 2. Table 4 Summary of the absolute differences between the estimated homoscedastic and heteroscedastic risks at the target sizes, for the precipitations accumulation data (in root-squared rainfall inches) and several threshold values c. c
Mean
Median
Sd
0.5 1.0 1.5 2.0 2.5
0.019 0.036 0.035 0.033 0.014
0.016 0.027 0.027 0.024 0.006
0.016 0.031 0.035 0.031 0.022
for the absolute difference between the two unconditional risks derived, showing that the largest variations are achieved for the central thresholds c. 5. Conclusions The current work provides a bootstrap procedure for estimation of the unconditional risk. It has been designed with nonparametric techniques, in order to avoid the model misspecification problem that can affect the parametric methods. Other mechanisms that have been suggested for risk estimation require the assumption of constant variance for the random process, so they are not applicable to data exhibiting heteroscedasticity. We deal with this issue to derive our proposal, since it accounts for the different variability that the underlying process can present along its domain. The new methodology is based on a resampling scheme, so that it can be easily adapted to address other inference problems. An additional advantage of our approach is to employ corrected residuals, obtained by adjusting the ‘‘exact’’ bias effect that the direct residuals induce on the estimation, thus leading to a better characterization of the dependence structure. Alternatively, a semiparametric procedure could be applied for estimating the unconditional risk, by assuming normality for the errors, which would result in speed savings. However, the performance of this method is conditional on the proximity of the theoretical distribution to the normal one. In this respect, the simulation results give an account of the accurate estimates derived with the nonparametric proposal under heteroscedasticity, regardless of the distribution model of the error process. Acknowledgments The authors would like to thank the Editor and the Reviewers for their helpful comments and suggestions. The first author’s work has been supported by the Universidad de las Fuerzas Armadas ESPE (Ecuador). The second author acknowledges financial support by MINECO, Spain grant MTM2017-82724-R and by the Xunta de Galicia, Spain (Grupos de Referencia Competitiva ED431C2016-015 and Centro Singular de Investigación de Galicia, Spain ED431G/01), all of them through Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.
14
S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán / Spatial Statistics xxx (xxxx) xxx
the European Regional Development Fund (ERDF). The third author’s work has been supported by the Spanish National Research and Development Program project [TEC2015-65353-R], by the ERDF and by the Xunta de Galicia, Spain under project GRC 2015/018 and under agreement for funding AtlantTIC (Atlantic Research Center for Information and Communication Technologies). References Adler, R.J., 1981. The Geometry of Random Fields. Wiley, New York. Cressie, N., 1993. Statistics for Spatial Data. Wiley, New York. Fan, J., Gijbels, I., 1996. Local Polynomial Modeling and its Applications. Chapman & Hall, London. Fan, J., Yao, Q., 1998. Efficient estimation of conditional variance functions in stochastic regression. Biometrika 85, 645–660. Fernández-Casal, R., 2016. Nonparametric spatial statistics. R package version 0.5-3 https://cran.r-project.org/package=np sp. Fernández-Casal, R., Castillo-Páez, S., Francisco-Fernández, M., 2018. Nonparametric geostatistical risk mapping. Stoch. Env. Res. Risk A 32 (3), 675–684. http://dx.doi.org/10.1007/s00477-017-1407-y. Fernández-Casal, R., Castillo-Páez, S., García-Soidán, P., 2017. Nonparametric estimation of the small-scale variability of heteroscedastic spatial processes. Spat. Stat. 22 (2), 358–370. http://dx.doi.org/10.1016/j.spasta.2017.04.001. Fernández-Casal, R., Francisco-Fernández, M., 2014. Nonparametric bias-corrected variogram estimation under non-constant trend. Stoch. Env. Res. Risk A 28 (5), 1247–1259. http://dx.doi.org/10.1007/s00477-013-0817-8. Francisco-Fernández, M., Opsomer, J.D., 2005. Smoothing parameter selection methods for nonparametric regression with spatially correlated errors. Canad. J. Statist. 33 (2), 279–295. http://dx.doi.org/10.1002/cjs.5550330208. García-Soidán, P., Menezes, R., 2017. Nonparametric construction of probability maps under local stationarity. Environmetrics 28 (3), e2438. http://dx.doi.org/10.1002/env.2438. Goovaerts, P., 1997. Geostatistics for Natural Resources Evaluation. Oxford University Press, Oxford. Iranpanah, N., Mohammadzadeh, M., Taylor, C., 2011. A comparison of block and semi-parametric bootstrap methods for variance estimation in spatial statistics. Comput. Statist. Data Anal. 55 (1), 578–587. http://dx.doi.org/10.1016/j.csda. 2010.05.031. Krzysztofowicz, R., Sigrest, A.A., 1997. Local climatic guidance for probabilistic quantitative precipitation forecasting. Mon. Weather Rev. 125 (3), 305–316. http://dx.doi.org/10.1175/1520-0493(1997)125<0305:LCGFPQ>2.0.CO;2. Lahiri, S.N., 2003. Resampling Methods for Dependent Data. Springer Science & Business Media, New York. Li, W., Zhang, C., Dey, D.K., Wang, S., 2010. Estimating threshold-exceeding probability maps of environmental variables with Markov chain random fields. Stoch. Env. Res. Risk A 24 (8), 1113–1126. http://dx.doi.org/10.1007/s00477-0100389-9. Nguyen, M., Veraart, A.E.D., 2017. Modelling spatial heteroskedasticity by volatility modulated moving averages. Spat. Stat. 20, 148–190. http://dx.doi.org/10.1016/j.spasta.2017.03.006. Robinson, P.M., Thawornkaiwong, S., 2012. Statistical inference on regression with spatial dependence. J. Econometrics 167 (2), 521–542. http://dx.doi.org/10.1016/j.jeconom.2011.09.033. Ruppert, D., Wand, M.P., Holst, U., Hössjer, O., 1997. Local polynomial variance-function estimation. Technometrics 39, 262–273. http://dx.doi.org/10.1080/00401706.1997.10485117. Shapiro, A., Botha, J.D., 1991. Variogram fitting with a general class of conditionally non-negative definite functions. Comput. Statist. Data Anal. 11 (1), 87–96. Tolosana-Delgado, R., Pawlowsky-Glahn, V., Egozcue, J.J., 2008. Indicator kriging without order relation violations. Math. Geosci. 40 (3), 327–347. http://dx.doi.org/10.1007/s11004-008-9146-8.
Please cite this article as: S. Castillo-Páez, R. Fernández-Casal and P. García-Soidán, Nonparametric bootstrap approach for unconditional risk mapping under heteroscedasticity. Spatial Statistics (2019) 100389, https://doi.org/10.1016/j.spasta.2019.100389.