Spatial statistics: A quantitative geographer’s perspective

Spatial statistics: A quantitative geographer’s perspective

Spatial Statistics 1 (2012) 3–15 Contents lists available at SciVerse ScienceDirect Spatial Statistics journal homepage: www.elsevier.com/locate/spa...

863KB Sizes 0 Downloads 71 Views

Spatial Statistics 1 (2012) 3–15

Contents lists available at SciVerse ScienceDirect

Spatial Statistics journal homepage: www.elsevier.com/locate/spasta

Spatial statistics: A quantitative geographer’s perspective Daniel A. Griffith ∗ University of Texas at Dallas, United States

article

info

Article history: Received 6 January 2012 Accepted 22 March 2012 Available online 5 April 2012 Keywords: Auto-models Cliff and Ord Eigenvector spatial filtering Spatial econometrics Spatial interaction Spatial statistics

abstract Historically, major contributions to popularizing spatial statistics derived from the pioneering work of Cliff and Ord. One outcome was the development of spatial econometrics. With the passing of time, this body of work merged with geostatistics to form the present day discipline of spatial statistics. The families of autoand semivariogram models constitute a prominent component of the subject matter of contemporary spatial statistics. Its expansion from linear to generalized linear statistical models involves new methodologies, one of which is eigenvector spatial filtering. This paper presents evidence that this particular new methodology furnishes an effective dimension reduction substitution for the spatial lag matrix appearing in spatial auto-models. It also summarizes ongoing extensions of this methodology to space-time and spatial interaction data. Eigenvector spatial filtering methodology presents a new frontier for spatial statistical research. © 2012 Elsevier B.V. All rights reserved.

1. Introduction Spatial statistics has a long and storied history, dating back to a few comments in papers by Student (1914: spatial autocorrelation influences the null distribution of the product moment correlation coefficient; also see Neprash, 1934), Yule (1926: alluding to spurious variable correlations attributable to autocorrelation), Stephan (1934, p. 165: ‘‘data of geographic units are tied together, like bunches of grapes, not separate, like balls in an urn’’), and Fisher (1935, p. 66: ‘‘After choosing the area we usually have no guidance beyond the widely verifiable fact that patches in close proximity are commonly more alike, as judged by the yield of crops, than those which are far apart’’). Grondona and Cressie (1991; also see Yates, 1938) note that Fischer, rather than modeling geographic variability, developed randomization-based experimental designs to neutralize spatial autocorrelation effects in agricultural field trials. Recognition and conceptualization of the spatial autocorrelation problem characterized this early period. The next stage focused on quantifying spatial autocorrelation, with



Tel.: +1 972 883 4950; fax: +1 972 883 4967. E-mail address: [email protected].

2211-6753/$ – see front matter © 2012 Elsevier B.V. All rights reserved. doi:10.1016/j.spasta.2012.03.005

4

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

classic papers appearing by Moran (1950; establishing the Moran Coefficient (MC)) and Geary (1954; establishing the Geary Ratio). Later, Moran’s statistic proved to be the statistically more powerful of the two. Griffith (2010) recently extended its distribution theory to non-normal random variables. In their classic 1969 paper (which they expanded to their 1973 book very shortly thereafter), Cliff and Ord fully developed the distribution theory for these two statistics, particularly for linear regression residuals coupled with a normal probability model, crystallizing the era of hypothesis testing for spatial autocorrelation. This period overlapped with the spatial autocorrelation model development stage, in part initiated with critical papers by Whittle (1954) and Mead (1967), among others. In the tradition of Bartlett (e.g., see 1975), Besag (1974, 1975) penned the next classical papers, coining the terminology auto-model, and extending statistical theory to both non-normal automodel specifications and polygon data geocoded to irregular surface partitioning (i.e., non-square tessellations). Meanwhile, Cliff and Ord (1973; 1981a; also see Ord, 1975) popularized specification and estimation of the auto-normal model, causing this type of analysis to eclipse the dominant point pattern analysis research focus of the time. Much of the current literature fails to acknowledge a lineage back to this Cliff–Ord model. As an aside, the MC, which links to the Moran scatterplot, closely relates to the practice of autoregression analysis. Recently Geographical Analysis (2009) celebrated this considerable contribution that Cliff and Ord made to the spatial sciences in general, and to spatial statistics in particular. Reasons for the present oversight of Cliff and Ord’s ground-breaking work include how spatial autocorrelation analysis diffused through science. Citing Cliff and Ord (1973), Paelinck and Klaassen (1979) coined the term spatial econometrics, launching this variation of spatial statistics. Nearly a decade later, Anselin (1988) published his book, which is cited widely today as though it is the original source of this methodology in the econometrics literature. A similar citation practice is starting to emerge with the more recent book by LeSage and Pace (2009). Meanwhile, the appearance of Cressie’s (1991) comprehensive tome also has resulted in it being cited today in spatial statistics publications in lieu of Cliff and Ord’s seminal work. Although these more recent books certainly merit being cited, especially for the more recent developments they summarize, the classical work by Cliff and Ord also merits being acknowledged, when appropriate. One additional noteworthy complication in this brief history of spatial statistics concerns geostatistics. Meyers1 furnishes a concise overview that highlights the parallel development of this field. Because spatial statistics/econometrics evolved in an Anglo tradition, whereas geostatistics evolved in a French tradition, even though they both focus on spatial autocorrelation, little interaction occurred between these two disciplines for a number of decades. The previously mentioned Geary Ratio closely relates to the practice of geostatistical analysis. Furthermore, historically these two methodologies tended to address two different data types—point data by geostatistics, and polygon data by spatial autoregression. Today spatial scientists move between these two data types by, for example, generating Thiessen polygons to convert point data to polygon data, and by determining polygon centroids to convert polygon data to point data. In addition, these two methodologies initially tended to focus on different specific objectives—interpolation to unsampled points by geostatistics, and data description and inference by spatial autoregression. Since establishment of basic equivalencies between the two methodologies (e.g., Griffith and Layne, 1997), these distinctions are much less relevant, with spatial scientists choosing to model either the covariance matrix (geostatistics) or its inverse (autoregression) often by preference. Couched in this selective abridged history, the purpose of this paper is to emphasize certain spatial statistical publications, particularly the pioneering ones by Cliff and Ord, and to discuss selected recent developments that help define the research frontiers at the interface between geography and spatial statistics. 2. Selected auto-model specifications: weaknesses and solutions A conventional auto-model, as formulated by Besag, contains the response variable Y on both sides of an equation: the left-hand side has Y , and the right-hand side has some spatially lagged version of Y

1 See http://math.arizona.edu/~myers/homepage/whatis.html.

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

5

that results in yi being a linear combination function of yj (i ̸= j). An n-by-n geographic weights matrix – where C usually denotes its 0–1 binary version, such that cij = 1 if locations i and j are neighbors, and 0 otherwise – posits the relationship between Y and lagged-Y : CY, where Y is an n-by-1 vector of observed response values yi . Frequently in practice, matrix C is replaced by its row-standardized version, matrix W. Initially the specification of choice was that for the auto-normal model, which in exact or approximation form can be applied to most any data. This preference arose from this model’s estimation being feasible, primarily because both mathematical statistical normal curve theory already was well developed, and the popular and widely used family of Box–Cox power transformations allowed skewed data to be appropriately adjusted in order to mimic a bell-shaped curve. Two orders of model became popular: the conditional autoregressive (CAR) model, a firstorder model, casts direct spatial dependencies as being between immediately adjacent neighbors, and second-order autoregressive models, which cast direct spatial dependencies as being between not only immediately adjacent neighbors, but also neighbors adjacent to these juxtaposed neighbors (with a discounted degree of dependency). The two popular model specifications here are the simultaneous autoregressive (SAR; which is called the spatial error model in spatial econometrics) and the autoregressive response (AR; see Upton and Fingleton, 1985; which is called the spatial lag model in spatial econometrics). The advent of widely available Markov chain Monte Carlo (MCMC) techniques supported the implementation and estimation of selected non-normal auto-models, specifically the auto-logistic/binomial (suitable for percentages data) and the auto-Poisson (suitable for counts data). In his original paper, Besag (1974) acknowledges that the auto-Poisson and the auto-negative binomial models cannot accommodate positive spatial autocorrelation, which is by far the most common nature of spatial autocorrelation encountered in empirical data. A need to describe geographically referenced counts data resulted in three different modifications of the auto-Poisson model. To circumvent this problem, Besag and his collaborators (e.g., Besag et al., 1991; Besag and Kooperberg, 1995; Besag et al., 1995) introduced a random effects term into a hierarchical Bayesian model specification. As such, spatial autocorrelation becomes a feature of model parameters, rather than correlated response variable values. A model’s intercept becomes an observation-specific surrogate for unobserved variables by expressing it as a random deviation from some global intercept. In spatial statistics, this random effects term comprises two components (Griffith and Paelinck, 2011): a spatially structured one accounting for spatial dependence (which contributes to overdispersion), and a spatially unstructured one accounting for nonspatial overdispersion. The CAR specification is the spatial dependence model of choice here, as its implementation in GeoBUGS illustrates. Kaiser and Cressie (1997) furnish an alternative formulation, namely Winsorizing the auto-Poisson model and then employing MCMC techniques to estimate its parameters. If the truncation threshold value exceeds the Poisson mean by a sufficient quantity, then differences in parameter estimates are small, and the two distributions exhibit essentially the same shape. Drawbacks to this solution include: its probabilities not summing exactly to 1, its being unable to accommodate high degrees of positive spatial autocorrelation, and selecting an appropriate threshold value. Griffith (2002a) presents a third alternative formulation based upon eigenvector spatial filtering, which employs the eigenvectors extracted from a modified version of the geographic connectivity matrix C, say (I − 11T /n)C(I − 11T /n), where 1 is an n-by-1 vector of ones, and (I − 11T /n) is the standard projection matrix commonly encountered in multivariate statistics. This matrix expression comes from the numerator of the MC, whose matrix version for response vector Y adjusted only for its mean is given by MC =

n

YT (I − 11T /n)C(I − 11T /n)Y

1T C1

YT (I − 11T /n)Y

.

(1)

Substituting the eigenvectors into this expression results in a Rayleigh quotient, with vector E1 maximizing the expression. Accordingly, these n eigenvectors can be interpreted as follows: the first eigenvector, say E1 , is the set of real numbers that has the largest MC achievable by any set for the geographic arrangement defined by the spatial connectivity matrix C;

6

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

the second eigenvector is the set of real numbers that has the largest achievable MC by any set that is orthogonal and uncorrelated with E1 ; the third eigenvector is the third such set of real numbers; and so on through En , the set of real numbers that has the largest negative MC achievable by any set that is orthogonal and uncorrelated with the preceding (n − 1) eigenvectors. As such, these eigenvectors furnish n distinct map pattern descriptions of latent spatial autocorrelation in geographically distributed variables because they are mutually orthogonal and uncorrelated. An eigenvector spatial filter (ESF) is constructed from some linear combination of a subset of these eigenvectors, called the candidate set, and serves as a spatial proxy variable when added as a control variable into a model specification. This control variable identifies and isolates stochastic spatial dependencies among location-indexed observations, thus allowing model specification and estimation to proceed as if these observations are independent. The single most conspicuous weakness of eigenvector spatial filtering is that the eigenvectors need to be selected with some stepwise regression procedure, whose drawbacks include: parameter estimation bias, inconsistencies among model selection algorithms, a need to adjust for multiple hypothesis testing, and an obsession with identifying a single best model (e.g., see Derksen and Keselman, 1992). The first of these is minimized for eigenvector spatial filtering because the eigenvectors involved are mutually orthogonal and uncorrelated; but they are correlated with covariates when ESF descriptions of those covariates contain at least some of the same eigenvectors. The final criticism also is less relevant, because frequently those eigenvectors selected last account for very little of the variance, and may or may not need to be included, depending upon the behavior of other model properties. These concerns cannot be minimized by increasing sample size; rather, it exacerbates them, because as sample size increases, ESFs tend to include larger numbers of eigenvectors, in part because the standard errors asymptotically go to 0. Nevertheless, Griffith and Chun (in press) furnish evidence that ESF estimators of covariate regression coefficients are unbiased, relatively more efficient than GLS estimators, and consistent. 2.1. An empirical auto-normal-ESF comparison: 2010 Puerto Rico population density The 2010 United States decennial census furnishes population counts by municipality across Puerto Rico. Annual estimates supplement these data.2 Suppose the response variable, Y , is population density (i.e., municipality population counts divided by their corresponding areas; Fig. 1(a)), whose MC = 0.54116. Estimation of SAR model parameters in the absence of covariates yields ρˆ = 0.71140, the autoregressive parameter estimate. Both this MC and this ρˆ value indicate the presence of moderate positive spatial autocorrelation. The accompanying pseudo-R2 is 0.5170; spatial autocorrelation accounts for roughly half of the variation in population density. The model residuals mimic a normal distribution (Shapiro–Wilk statistic null hypothesis probability of 0.6560). The ESF describing this geographic distribution of population density is a linear combination of the following 12 eigenvectors, all portraying positive spatial autocorrelation: E1 (MC = 1.094), E2 (MC = 1.046), E3 (MC = 0.912), E4 (MC = 0.852), E6 (MC = 0.815), E7 (MC = 0.719), E8 (MC = 0.702), E9 (MC = 0.610), E10 (MC = 0.570), E12 (MC = 0.504), E16 (MC = 0.374), and E18 (MC = 0.283). These eigenvectors were selected with a stepwise linear regression procedure using an entry significance level of 0.10, and a removal significance level of 0.1001. The accompanying R2 is 0.6923, which is greater than its SAR counterpart, and the model residuals display only trace spatial autocorrelation (zMC = 0.19271) and mimic a normal distribution (Shapiro–Wilk statistic null hypothesis probability of 0.5562). The MC for the ESF is 0.84543 (Fig. 1(b)), representing moderateto-strong positive spatial autocorrelation. Fig. 1(a) and (b) are extremely similar; the ESF duplicates the population density map pattern remarkably well, highlighting the San Juan metropolitan region as well as the rural interior mountain region.

2 See http://www.census.gov/popest/data/intercensal/puerto_rico/pr2010.html.

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

7

Fig. 1. Quantile maps associated with the 2010 United States census of population counts for Puerto Rico; the darkness of a color is directly proportional to the magnitude of its corresponding value. Top left (a): LN (population density). Top right (b): eigenvector spatial filter comprising 12 vectors. Middle left (c): eigenvector E1 . Middle right (d): eigenvector E3 . Bottom left (e): eigenvector E4 . Bottom right (e): eigenvector E12 .

Fig. 2. Statistical features of 2010 census and estimated population counts; solid black dots denote data points, and red lines denote trend lines. Left (a): the scatterplot for their log-density transformations. Right (b): the normal quantile plot for the difference between their log-density transformations. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Eigenvectors E1 , E3 , E4 , and E12 respectively account for 17.96%, 9.33%, 11.98%, and 6.13% of the variation in Y . With MC/MCmax = 1 (where MCmax denotes the maximum MC value of vector E1 ), eigenvector E1 portrays a global east–west trend across the island (Fig. 1(c)). With MC/MCmax = 0.83380 for eigenvector E3 , and 0.77935 for eigenvector E4 , these two vectors portray regional clusters of spatial correlation (Fig. 1(d) and (e)); vector E3 essentially displays four visible east–west cluster, and vector E4 essentially displays a north–northeast and a south–southwest cluster. With MC/MCmax = 0.46080, eigenvector E12 portrays local clusters (approximately eight are visible) of spatial correlation (Fig. 1(f)). One advantage of the ESF specification is that it reveals how spatial autocorrelation impacts upon a correlation coefficient. In addition to the 2010 census data, annual population estimates are available for Puerto Rico. As expected, these two sets of data are highly positively correlated (Fig. 2(a); r = 0.99999). These pairs of counts are very similar in magnitude: Yˆ = −0.00779 + 1.00077X ,

(2)

8

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

where Y = LN[(2010 census population)/area] and X = LN[(2010 estimated population)/area]; these intercept and slope parameter estimates are not significantly different from 0 and 1, respectively. The ESFs for these two measurements are very similar, and include exactly the same set of 12 eigenvectors, but with slightly different coefficients for their linear combinations. Accordingly, the correlation between the two ESFs essentially is 1. The residuals associated with these two ESFs represent the aspatial correlations between the variables. Their correlation is 0.99998. In other words, positive spatial autocorrelation latent in these two sets of independent measurements inflates the correlation between them (but by only 0.00001, because of the upper limit of 1 for r). Because neither ESF contains noncommon eigenvectors, none of the spatial autocorrelation in either measurement suppresses correlation between the two variables. Furthermore, the difference between these two log-population-densities has one conspicuous outliner, with the remaining 72 values aligning well with a normal frequency distribution (Fig. 2(b)). 2.2. An empirical auto-Poisson-ESF comparison: 2010 Puerto Rico population density The number of humans in a population is a count, and hence a Poisson random variable; because it has excess variation here, it is treated as a negative binomial random variable (i.e., a Poisson random variable with a gamma distributed mean). For this specification, LN(area) becomes an offset variable in order to analyze population density. The pseudo-likelihood estimate of the autoregressive parameter is ρˆ = 0.9644; the accompanying deviance statistic is 1.0680. The pseudo-R2 value for this specification is 0.5358. Because this is a pseudo-likelihood estimate, which is not based on the joint likelihood function (which can accommodate only negative spatial autocorrelation), its accompanying standard error is unreliable. This marginal distribution estimation substitutes for a joint distribution estimation here because of the previously noted drawbacks associated with an auto-negative binomial model. Constructing an ESF for this model specification, again using a 10% level of significance for vector selection, results in a linear combination of the vectors E1 , E2 , E3 , E4 , E6 , E7 , E8 , E10 , E12 , and E16 . This ESF is more parsimonious than its log-normal approximation counterpart because it no longer contains vectors E9 and E18 . Of note is that eigenvector E9 is the last vector selected by the stepwise linear regression procedure; eigenvector E18 is the second-to-last vector selected. The pseudo-R2 value for the generalized linear model specification is 0.7075, and its deviance statistic is 1.2073. In other words, the ESF specification furnishes a better description of the data than does the preceding autonegative binomial specification, one with a sound statistical inferential basis. 3. A new approach to specifying an ESF To date, two different approaches exist for constructing an ESF: stepwise regression that selects significant eigenvectors, and stepwise regression that minimizes the residual MC (this is the version implemented in the R package). A drawback of this first approach stems from the pre-testing problem as well as weaknesses associated with variable selection via stepwise regression. A drawback to this second approach arises from the inclusion of eigenvectors exhibiting essentially no covariation with the response variable Y . Nevertheless, Pace et al. (2011) report that eigenvector spatial filtering is becoming a popular way to address spatial dependence in spatial statistical models (e.g., see Möller and Soltwedel, 2007; Dormann et al., 2007; De Marco Jr. et al., 2008; Jacob et al., 2008; Ficetola and Padoa-Schioppa, 2009; Cuaresma and Feldkircher, forthcoming), and tends to reduce bias in parameter estimates (findings based upon the Frisch–Waugh–Lovell theorem) while maintaining simplicity. The two existing ESF construction approaches focus on the response variable. A third, alternative approach, motivated by principal components and factor analysis regression, approximates the spatial lag matrix WY with a set of eigenvectors, and may be denoted as ESFWY . This approach avoids the pretest complications concerning inference with regard to covariates of variable Y , and seeks to construct an ESF that approximates WY in a model specification. The linear combination coefficients become ET WY rather than ET Y, with resulting values close to 0 being removed to reduce the dimensionality of WY. Substituting this approximation into an AR or SAR specification should result in an insignificant estimate for the autoregressive parameter ρˆ .

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

9

Fig. 3. Areal units (i.e., surface partitioning) for three of the empirical datasets. Left (a): Columbus, OH, neighborhoods (n = 49). Middle (b) High Peak pixels (n = 900). Right (c): China counties (n = 2379).

Fig. 4. Scatterplot of WY versus its corresponding ESF. Top left (a): Columbus, OH. Top left middle (b): North Carolina. Top right middle (c): Murray superfund site. Top right (d): Mercer–Hall field plots. Bottom left (e): Toronto. Bottom middle left (f): High Peak pixels. Bottom middle right (g): Wiebe field plots. Bottom right: China.

3.1. Selected empirical demonstrations Eight real world datasets, ranging in size from n = 49 to n = 2379 (Fig. 3 presents the surface partitionings for a selected subset of these landscapes; Van Lieshout’s paper (2012) is relevant to this discussion), were retrieved from published works for illustrative purposes. Standard linear (SL; estimated with ordinary least squares (OLS)), as well as SAR, SAR-ESF, and AR-ESF – all three estimated with maximum likelihood techniques – specifications were estimated for each empirical example. Comparing estimation results appearing in Table 1 between the presence of an ESFWY term and a spatial lag term WY (the SAR-ESF and AR-ESF columns) suggests that the lower dimensional approximation to the spatial lag term WY (see Fig. 4) tends to be effective in accounting for spatial autocorrelation, with the accompanying ρˆ values being nonsignificant for one, if not both, hybrid specifications. The need to use different stepwise significance levels, as well as different threshold values for identifying the candidate set of eigenvectors – both being problematic requirements – may be attributable, in part, to substantial multicollinearity between the ESFWY and WY terms (i.e., all of the corresponding R2 terms in Table 1 are sizable, with all but one being greater than 0.85). Employing a uniform significance level and threshold value for all eight cases results in either over- or underadjustment for spatial autocorrelation in all but the Toronto case. Another source of complication is the initial level of spatial autocorrelation in a dataset’s OLS residuals: as residual spatial autocorrelation increases, and hence ρˆ increases, the number of eigenvectors needed in the candidate set with which to construct an ESF tends to increase. Future research needs to address properties of this dimensionality reduction approach, especially in terms of establishing the initial candidate set of eigenvectors.

10

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

Table 1 Estimation results for selected geographic datasets. Term

SL Estimate

SAR S.e.

Estimate

SAR-ESF S.e.

Estimate

AR-ESF S.e.

Estimate

S.e.

Columbus, OH crime (n = 49; approximation contains 5 vectors, R = 0.862; Anselin, 1988); Bonferroni adjustment; α = 0.10/12 2

Intercept Income

ρ Filter (pseudo-)R2 P(Wilk–Shapiro)

64.463

−2.041 ∗∗∗ ∗∗∗

4.748 0.307

−1.523

∗∗∗ ∗∗∗

∗∗∗

0.484 0.0016

56.568 0.487

6.183 0.366 0.241

∗∗∗

0.572 0.0001

15.354

−0.778 −0.212 1.072 0.696 <0.0001

9.091 0.320 0.353 0.326

12.486

−0.839 −0.286 1.175 0.705 <0.0001

7.887 0.320 0.297 0.338

NC 1979–84 nonwhite-to-total births (n = 100; approximation contains 12 vectors, R2 = 0.904; Cressie, 1991); no Bonferroni adjustment, α = 0.15 Intercept east–west coordinate

−0.012 0.119

0.039 0.013

ρ

∗∗∗ ∗∗∗

∗∗∗ ∗∗∗

Filter (pseudo-)R2 P(Wilk–Shapiro)

0.451 0.0265

0.082 0.078 0.714

∗∗∗

0.107 0.038 0.125

∗∗∗

0.694 0.0063

0.380

−0.026 0.176 1.009 0.771 0.0023

0.058 0.021 0.194 0.195

0.379

−0.021 0.171 1.013 0.770 0.0024

0.059 0.016 0.195 0.194

Murray superfund site Pb (n = 253; approximation contains 12 vectors, R2 = 0.647; Griffith, 2002b); Bonferroni adjustment; α = 0.10/64 Intercept LN(As-2.6)

4.229 0.589

0.165 0.038

ρ

∗∗∗ ∗∗∗

∗∗∗ ∗∗∗

Filter (pseudo-)R2 P(Wilk–Shapiro)

0.492 <0.0001

4.337 0.561 0.384

∗∗∗

0.186 0.039 0.107

∗∗∗

0.544 <0.0001

4.682 0.474 0.077 0.688 0.598 <0.0001

0.164 0.038 0.134 0.125

4.785 0.468 −0.044 0.784 0.597 <0.0001

0.247 0.037 0.114 0.146

2.051 0.275 0.057 0.520 0.600 0.0613

0.186 0.018 0.082 0.097

Mercer–Hall field plot yield (n = 500; approximation contains 79 vectors, R2 = 0.857; Besag, 1974); Bonferroni adjustment; α = 0.10/155 Intercept Straw

1.523 0.372

0.103 0.016

ρ

∗∗∗ ∗∗∗

∗∗∗ ∗∗∗

Filter (pseudo-)R2 P(Wilk–Shapiro)

0.533 0.0006

1.571 0.365 0.365

∗∗∗

0.117 0.018 0.067

∗∗∗

0.590 <0.0001

2.014 0.297 0.221 0.433 0.615 0.0008

0.124 0.019 0.078 0.071

Toronto 1986 population density (n = 731; approximation contains 123 vectors, R2 = 0.880; Griffith, 1997); no Bonferroni adjustment, α = 0.10 Intercept Distance to CBD

ρ Filter (pseudo-)R2 P(Wilk–Shapiro)

2.119

−0.028 ∗∗∗ ∗∗∗

0.030 0.001

−0.026

∗∗∗ ∗∗∗

∗∗∗

0.380 0.0023

2.160 0.593

0.061 0.003 0.051

∗∗∗

0.559 <0.0001

1.590 0.001 −0.062 1.182 0.679 <0.001

0.028 0.002 0.079 0.081

1.590 0.001 −0.062 1.181 0.679 <0.001

High Peak NDVI (n = 900; approximation contains 309 vectors, R2 = 0.997; Bailey and Gatrell, 1995); candidate set includes all 434 positive spatial autocorrelation eigenvectors; no Bonferroni adjustment, α = 0.10 Intercept −0.227 0.039 −0.027 0.075 0.338 0.009 0.337 Spectral band #5

0.128

0.009

ρ

∗∗∗ ∗∗∗

∗∗∗ ∗∗∗

Filter (pseudo-)R2 P(Wilk–Shapiro)

0.195 0.0005

0.082 0.986

∗∗∗ 0.958 <0.0001

0.012 0.009

∗∗∗

0.000 −0.109 1.179 0.962 <0.0001

0.002 0.130 0.130

0.001 −0.109 1.179 0.962 <0.0001

0.028 0.002 0.079 0.081

0.008 0.002 0.130 0.130

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

11

Table 1 (continued) Term

SL Estimate

SAR S.e.

Estimate

SAR-ESF S.e.

Estimate

AR-ESF S.e.

Estimate

S.e.

Wiebe field plot yield (n = 1500; approximation contains 165 vectors, R2 = 0.963; Wiebe, 1935); candidate set includes all 749 positive spatial autocorrelation eigenvectors; no Bonferroni adjustment, α = 0.01 Intercept North–south coordinate

ρ Filter (pseudo-)R2 P(Wilk–Shapiro)

25.072

−0.616 ∗∗∗ ∗∗∗

0.292 0.040

−0.326

∗∗∗ ∗∗∗

∗∗∗

0.139 0.0204

23.271 0.850

0.021 0.136 0.021

∗∗∗

0.736 <0.0001

20.350 0.111 −0.025 1.183 0.778 0.0024

0.169 0.024 0.079 0.081

20.351 0.114 −0.026 1.184 0.778 0.0024

0.170 0.023 0.079 0.081

China county population density (n = 2379; approximation contains 593 vectors, R2 = 0.945; Griffith, 2006); candidate set includes all 987 positive spatial autocorrelation eigenvectors; no Bonferroni adjustment, α = 0.25 Intercept LN(births/population)

ρ Filter (pseudo-)R2 P(Kolmogorov–Smirnov)

−4.908 −1.677 ∗∗∗ ∗∗∗ 0.149 <0.01

0.285 0.082

−4.317 −1.618

∗∗∗ ∗∗∗

∗∗∗

0.860 0.773

<0.01

0.266 0.071 0.013

∗∗∗

−1.918 −0.817 0.237 0.773 0.784 <0.01

0.190 0.056 0.044 0.044

−1.354 −0.610 0.059 0.953 0.779 <0.01

0.196 0.044 0.049 0.050

3.2. An empirical comparison of ESFs: 2010 Puerto Rico population density An ESF describing the spatial lag term WY contains the following eigenvectors: E1 , E2 , E3 , E4 , E5 , E6 , E7 , E8 , E9 , E12 , and E16 . Compared with the preceding ESF constructed by regressing Y on the candidate set of eigenvectors, it has the additional eigenvector E5 , and lacks eigenvectors E10 and E18 . This ESF accounts for roughly 61% of the variation in the spatial lag term. Replacing WY with this set of eigenvectors results in nonsignificant coefficients for vectors E5 and E9 for both the log-normal approximation and the negative binomial specifications, with an R2 value of 0.6517 for this first specification, and a pseudo-R2 value of 0.7042 for this second specification. In other words, the trade-off for indirectly selecting eigenvectors impacts the log-normal approximation more than the generalized linear model specification. The back-transformed version of the previous log-normal approximation specification produces an R2 of 0.7709 (Fig. 5(a)). In contrast, the pseudo-R2 for the preceding negative binomial estimated model is 0.7075 (Fig. 5(c)). A comparison of these two predicted population densities (Fig. 5(b)) reveals that the back-transformed normal approximation results appear to have less severe deviations from the observed values than do the negative binomial results, which is consistent with the modest difference in their corresponding bivariate R2 values. Their regression trend lines in Fig. 5(a) and (c) further corroborate this finding—overall, the negative binomial predicted values fail to better align with their corresponding observed values: back-transformed normal approximation : Yˆ = −93.48198 + 1.08795Xˆ N ,

and

(3a)

negative binomial : Yˆ = −123.35571 + 1.09699Xˆ NB , (3b) ˆ where Y denotes population density, XN denotes the back-transformed normal approximation predicted population density, and Xˆ NB denotes the negative binomial predicted population density. The OLS parameter estimates in Eq. (3a) are closer to their respective ideal values of 0 and 1 than are those in Eq. (3b). 4. Space–time analyses Cliff and Ord (e.g., 1975) also undertook pioneering work in space–time analysis, such as extending the MC to this three-dimensional domain (e.g., 1981b). Because computing is so much more powerful

12

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

Fig. 5. Normal approximation and negative binomial population density scatterplots; solid black dots denote data points, and red lines denote trend lines. Left (a): back-transformed normal approximation. Middle (b): superimposed back-transformed normal approximation (solid blue dots) and negative binomial (solid red dots) predicted values. Right (c): negative binomial. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

today, and because presently many more space–time datasets are available, spatial researchers increasingly are turning to space–time data analysis. To this end, Cressie and Wikle (2011) furnish a comprehensive tome about the subject. In addition, Allard and Soubeyrand (2012), Simpson et al. (2012), Ruiz-Medina (2012) and Scott and Gemmell (2012) discuss various extensions to the space–time domain, whereas Gelfand (2012) discusses this topic within the context of hierarchical modeling. Eigenvector filtering also can be extended to space–time data (e.g., Griffith, 2012). Space–time autocorrelation can be accounted for in such data with a set of eigenvectors extracted from a binary 0–1 space–time matrix, whose ones indicate which observations are tied together in space as well as in time. The following two principal space–time dependency structure specifications may be posited: (1) location (u, v, t ) links to the preceding in situ location as well as the preceding neighboring locations, a lagged specification, and (2) location (u, v, t ) links to the preceding in situ location as well as the instantaneous neighboring locations, a spatially contemporaneous specification, where (u, v) denotes geocodes, and t denotes a marker in time. Matrix versions of these space–time conceptualizations are as follows: lagged specification : [I − 11T /(nT )][CT ⊗ (Cs + Is )][I − 11T /(nT )],

and

contemporaneous specification : [I − 11 /(nT )](IT ⊗ Cs + CT ⊗ Is )[I − 11 /(nT )], T

T

(4a) (4b)

where ⊗ denotes Kronecker product, Cs is the previously mentioned n-by-n spatial connectivity matrix, CT is a standard T -by-T binary 0–1 time series matrix (i.e., all off-diagonal values are 1, and all other values are 0), and [I − 11T /(nT )] is an nT -by-nT projection matrix. These are the connectivity matrices for space–time versions of the MC (Cliff and Ord, 1981b; Griffith, 1981), and hence the matrices whose eigenvectors can be used in space–time filtering. In this context, an eigenvector space–time filter can be combined with a time invariant random effects term, linking space–time data analysis with contemporary mixed models theory. As an aside, the core space–time structure specification of Eq. (4a), namely [CT ⊗ (Cs + Is )], is the same as that presented by Cressie and Wikle (2011, pp. 386–387). Meanwhile, the core space–time structure specification of Eq. (4b), namely (IT ⊗ Cs + CT ⊗ Is ), is similar to that presented by Cressie and Wikle (2011, pp. 344–347), for the STARMA model specification. 5. Spatial interaction data A resurgence of interest is emerging about the role spatial dependency effects play in spatial interaction flows (e.g., Chun, 2008; Griffith, 2007, 2011; Fischer and Griffith, 2008). Model respecifications studied to date include: a spatial lag formulation with a log-normal probability model

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

13

approximation (LeSage and Pace, 2008), and an ESF formulation with a Poisson probability model specification. In other words, these new specifications are direct extensions of model specifications currently existing in the spatial statistics and spatial econometrics literature. For flows data, spatial autocorrelation effects occur in two different ways: in the geographic distribution of the origin and destination variables involved in movement; and, in the tendency for flows from nearby origins to go to nearby destinations. Furthermore, a space–time series of flows data (e.g., Chun and Griffith, 2011) also integrates serial correlation. Accordingly, spatial statistics has considerable insight to contribute to the analysis of these types of data. 6. Concluding comments Spawning from seeds of knowledge sowed by Cliff and Ord, parts of spatial statistics matured in the discipline of geography, and today promulgate through spatial analysis tools available in geographic information systems (GISs), such as ArcGIS© . The early work by Cliff, a geographer, and Ord, a statistician, was instrumental in motivating and guiding this line of geographic research. Prior to their work, other geographers applying standard statistical techniques, which overlook latent spatial dependencies in georeferenced data, were obtaining unreliable results, and consequently were criticized by colleagues as well as members of other disciplines for inappropriately quantifying phenomena studied by the discipline. Meanwhile, the discipline’s cartographic tradition, moving first through computer cartography and eventually into GISs while being steeped in the practice of interpolation, embraced geostatistics when that subject matter finally transitioned from French-only publications. This segment of the discipline fosters a continuation of a spatial statistical tradition in geography, linked by both the concept of spatial autocorrelation and the relationship between kriging and spatial autoregressive imputation (see Griffith, 1993). Today, the newly emerging eigenvector spatial filtering methodology, initially built upon the spatial autoregressive tradition, also relates to the geostatistical tradition (Griffith and Peres-Neto, 2006; also acknowledged in Fortin et al., 2012). This methodology presents a new frontier for spatial statistical research. References Allard, S., Soubeyrand, S., 2012. Skew-normality for climatic data and dispersal models for plant epidemiology: when application fields drive spatial statistics. Spatial Statistics 1, 50–64. Anselin, L., 1988. Spatial Econometrics. Kluwer, Dordrecht. Bailey, T., Gatrell, A., 1995. Interactive Spatial Data Analysis. Longman Scientific, Harlow. Bartlett, M., 1975. The Statistical Analysis of Spatial Pattern. Chapman and Hall, London. Besag, J., 1974. Spatial interaction and the statistical analysis of lattice systems (with discussions). Journal of the Royal Statistical Society: Series B 36, 192–236. Besag, J., 1975. Statistical analysis of non-lattice data. The Statistician 24, 179–195. Besag, J., Green, P., Higdon, D., Mengersen, K., 1995. Bayesian computation and stochastic systems. Statistical Science 10, 3–41. Besag, J., Kooperberg, C., 1995. On conditional and intrinsic autoregressions. Biometrika 82, 733–746. Besag, J., York, J., Mollié, A., 1991. Bayesian image restoration, with two applications in spatial statistics, with discussion. Annals of the Institute of Statistical Mathematics 43, 1–59. Chun, Y., 2008. Modeling network autocorrelation within migration flows by eigenvector spatial filtering. Journal of Geographical Systems 10 (4), 317–344. Chun, Y., Griffith, D., 2011. Modeling network autocorrelation in space–time migration flow data: an eigenvector spatial filtering approach. Annals of the Association of American Geographers 101, 523–536. Cliff, A., Ord, J., 1969. The problem of spatial autocorrelation. In: Scott, A.J. (Ed.), London Papers in Regional Science, vol. 1. Pion, London, pp. 25–55. Cliff, A., Ord, J., 1973. Spatial Autocorrelation. Pion, London. Cliff, A., Ord, J., 1975. Space–time modeling with an application to regional forecasting. Transactions of the Institute of British Geographers 66, 119–128. Cliff, A., Ord, J., 1981a. Spatial Processes. Pion, London. Cliff, A., Ord, J., 1981b. Spatial and temporal analysis: autocorrelation in space and time. In: Wrigley, N., Bennett, R. (Eds.), Quantitative Geography: A British View. Routledge & Kegan Paul, London, pp. 104–110. Cressie, N., 1991. Statistics for Spatial Data. Wiley, NY. Cressie, N., Wikle, C., 2011. Statistics for Spatial-Temporal Data. Wiley, NY. Cuaresma, J.C., Feldkircher, M., 2010. Spatial filtering, model uncertainty and the speed of income convergence in Europe. Working Paper 160, Oesterreichische Nationalbank (forthcoming in Journal of Applied Econometrics). De Marco Jr., P., Diniz-Filho, J.A.F., Bini, L., 2008. Spatial analysis improves species distribution modelling during range expansion. Biology Letters 4, 577–580.

14

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

Derksen, S., Keselman, H., 1992. Backward, forward and stepwise automated subset selection algorithms: frequency of obtaining authentic and noise variables. British Journal of Mathematical & Statistical Psychology 45, 265–282. Dormann, C.F., McPherson, J., Araújo, M., Bivand, R., Bolliger, J., Carl, G., Davies, R., Hirzel, A., Jetz, W., Kissling, W.D., Kühn, I., Ohlemüller, R., Peres-Neto, P., Reineking, B., Schröder, B., Schurr, F., Wilson, R., 2007. Methods to account for spatial autocorrelation in the analysis of species distributional data: a review. Ecography 30, 609–628. Ficetola, G.F., Padoa-Schioppa, E., 2009. Human activities alter biogeographical patterns of reptiles on Mediterranean islands. Global Ecology and Biogeography 18, 214–222. Fischer, M., Griffith, D., 2008. Modeling spatial autocorrelation in spatial interaction data: a comparison of spatial econometric and spatial filtering specifications. Journal of Regional Science 48, 969–989. Fisher, R., 1935. The Design of Experiments. Oliver and Boyd, Edinburgh. Fortin, M.-J., James, P., MacKenzie, A., Melles, S., 2012. Spatial statistics, spatial regression and spatial graph theory use in ecological studies. Spatial Statistics 1, 100–109. Geary, R., 1954. The contiguity ratio and statistical mapping. The Incorporated Statistician 5, 115–145. Gelfand, A., 2012. Hierarchical modeling for spatial data problems. Spatial Statistics 1, 30–39. Geographical Analysis, 2009. Special issue: a 40th Anniversary Celebration of A. Cliff and J. Ord, 1969, The Problem of Spatial Autocorrelation, vol. 41, pp. 343–463. Griffith, D., 1981. Interdependence in space and time: numerical and interpretative considerations. In: Griffith, D., MacKinnon, R. (Eds.), Dynamic Spatial Models. Sijthoff and Noordhoff, Alphen aan den Rijn, pp. 258–287. Griffith, D., 1993. Advanced spatial statistics for analyzing and visualizing geo-referenced data. International Journal of Geographical Information 7, 107–123. Griffith, D., 1997. Using estimated missing spatial data in obtaining single facility location-allocation solutions. l’Espace Géographique 26, 173–182. Griffith, D., 2002a. A spatial filtering specification for the auto-Poisson model. Statistics & Probability Letters 58, 245–251. Griffith, D., 2002b. The geographic distribution of soil lead concentration: description and concerns. URISA Journal 14, 5–15. Griffith, D., 2006. Hidden negative spatial autocorrelation. Journal of Geographical Systems 8, 335–355. Griffith, D., 2007. Spatial structure and spatial interaction: 25 years later. The Review of Regional Studies 37 (#1), 28–38. Griffith, D., 2010. The Moran coefficient for non-normal data. Journal of Statistical Planning and Inference 140, 2980–2990. Griffith, D., 2011. Visualizing analytical spatial autocorrelation components latent in spatial interaction data: an eigenvector spatial filter approach. Computers, Environment and Urban Systems 35, 140–149. Griffith, D., 2012. Space, time, and space–time eigenvector filter specifications that account for autocorrelation. Estadistica Española (Spanish Statistical Magazine) 54 (177), 4–27. Griffith, D., Chun, Y., 2012. Spatial autocorrelation and eigenvector spatial filtering. In: M. Fischer and P. Nijkamp (Eds.), Handbook of Regional Science, Springer, Berlin (Chapter 2) (in press). Griffith, D., Layne, L., 1997. Uncovering relationships between geo-statistical and spatial autoregressive models. In: The 1996 Proceedings on the Section on Statistics and the Environment. American Statistical Association, pp. 91–96. Griffith, D., Paelinck, J., 2011. Non-Standard Spatial Statistics and Spatial Econometrics. Springer-Verlag, Berlin. Griffith, D., Peres-Neto, P., 2006. Spatial modeling in ecology: the flexibility of eigenfunction spatial analyses. Ecology 87, 2603–2613. Grondona, M., Cressie, N., 1991. Using spatial considerations in the analysis of experiments. Technometrics 33, 381–392. Jacob, B., Muturi, E., Caamano, E., Gunter, J., Mpanga, E., Ayine, R., Okelloonen, J., Nyeko, J., Shililu, J., Githure, J., Regens, J., Novak, R., Kakoma, I., 2008. Hydological modeling of geophysical parameters of arboviral and protozoan disease vectors in internally displaced people camps in Gulu, Uganda. International Journal of Health Geographics 7, 11. Kaiser, M., Cressie, N., 1997. Modeling Poisson variables with positive spatial dependence. Statistics & Probability Letters 35, 423–432. LeSage, J., Pace, R., 2008. Spatial econometric modelling of origin–destination flows. Journal of Regional Science 48, 941–968. LeSage, J., Pace, K., 2009. Introduction to Spatial Econometrics. CRC Press, Boca Raton, FL. Mead, R., 1967. A mathematical model for the estimation of inter-plant competition. Biometrics 23, 189–205. Möller, J., Soltwedel, R., 2007. Recent development of regional labor market analysis using spatial econometrics: introduction. International Regional Science Review 30, 95–99. Moran, P., 1950. Notes on continuous stochastic phenomena. Biometrika 37, 17–23. Neprash, J., 1934. Some problems in the correlation of spatially distributed variables. Proceedings of the American Statistical Journal, New Series 29 (Suppl.), 167–168. Ord, J., 1975. Estimation methods for models of spatial interaction. Journal of the American Statistical Association 70, 120–126. Pace, K., Lesage, J., Zhu, S., 2011. Interpretation and computation of estimates from regression models using spatial filtering, Paper Presented at the Vth World Conference of the Spatial Econometrics Association, Toulouse, France, July 6–8. Paelinck, J., Klaassen, L., 1979. Spatial Econometrics. Saxon House, Farnborough, England. Ruiz-Medina, M., 2012. New challenges in spatial and spatiotemporal functional statistics for high-dimensional data. Spatial Statistics 1, 82–91. Scott, E., Gemmell, J., 2012. Spatial statistics: a watery business. Spatial Statistics 1, 121–132. Simpson, D., Lindgren, F., Rue, H., 2012. Think continuous: Markovian Gaussian models in spatial statistics. Spatial Statistics 1, 16–29. Stephan, F., 1934. Sampling errors and interpretations of social data ordered in time and space. Proceedings of the American Statistical Journal, New Series 29 (Suppl.), 165–166. Student,, 1914. The elimination of spurious correlation due to position in time or space. Biometrika 10, 179–180. Upton, G., Fingleton, B., 1985. Spatial Data Analysis by Example, vol. I. Wiley, NY. Van Lieshout, M., 2012. An introduction to planar random tessellation models. Spatial Statistics 1, 40–49. Whittle, P., 1954. On stationary processes in the plane. Biometrika 41, 434–449.

D.A. Griffith / Spatial Statistics 1 (2012) 3–15

15

Wiebe, G.A., 1935. Variation and correlation in grain yield among 1500 wheat nursery plots. Journal of Agricultural Research 50, 331–357. Yates, F., 1938. The comparative advantages of systematic and randomized arrangements in the design of agricultural and biological experiments. Biometrika 30, 444–466. Yule, U., 1926. Why do we sometimes get nonsense-correlations between time series? a study in sampling and the nature of time series. Journal of the Royal Statistical Society 89, 1–69.