Quantitative Data

Quantitative Data

Quantitative Data M. Charlton, National University of Ireland Maynooth, Maynooth, Republic of Ireland & 2009 Elsevier Ltd. All rights reserved. We ou...

108KB Sizes 1 Downloads 60 Views

Quantitative Data M. Charlton, National University of Ireland Maynooth, Maynooth, Republic of Ireland & 2009 Elsevier Ltd. All rights reserved.

We outline some of the characteristics of both quantitative data and spatial data. We then consider the initial exploration of such data, one variable at a time with appropriate statistical and graphical summaries – inference is introduced in this section. This leads naturally into a consideration of multivariate techniques, including regression, data reduction, and classification methods. We outline some of the characteristics of spatial data and discuss some of the problems which arise in spatial data analysis. We briefly consider some methods for local analysis before spending some time examining methods for modeling and regionalization with spatial interaction data. An extended section deals with regression-based methods for spatial data including spatial lag and spatial error models and geographically weighted regression. Finally, we look at some methods for spatial interpolation and conclude with a look at some software for quantitative analysis.

Quantitative Data In many studies, quantitative data exists as a data matrix in which the rows represent the units of observation (residents of a town, counties, or streets within a city) and the columns represent measurement of some items of interest and relevance to the study in hand. The nature of the research question should have already defined the appropriate units of observation and the items to be measured. Items to be measured are sometimes known as attributes (particularly in geographic information systems (GIS)) or variables in a statistical study. The data may have been collected by the researcher in person, perhaps with the aid of a team of interviewers, or they may have been obtained from another source, for example, data about the residents of a set of counties in a population census. Variables will contain either discrete or continuous measurements – discrete data have a finite number of categories, whereas continuous data can vary by infinitesimally small amounts. Discrete data are sometimes referred to as categorical data. The level of measurement of each variable is important in determining what methods are appropriate in the analysis of the data which has been collected. Categorical data can be either nominal or ordinal: the categories in nominal data have no ordering (names of villages, gender, ethnic group), whereas there is an ordering to the categories in ordinal data (age cohorts,

income groups, ranks). Discrete data can exist as interval or ratio level: interval data have no natural zero (zero on the centigrade scale does not imply an absence of temperature), whereas ratio level data have a well-defined zero (age in years, pulse rate, unemployment rate, population count). The level of measurement helps determine which methods will be appropriate in an analysis. Quantitative methods also include a body of techniques which deal with measures of location in the form of spatial references. The positional information might take the form coordinates of the object itself, if it can be represented spatially as a point, or the coordinates of some central location, for example, center of gravity, in an area. The measurements may be in degrees of longitude and latitude, or they may be in one or more projected coordinate systems. The analyst, faced with measurements in degrees, may convert these to projected coordinates before beginning the analysis. It is also necessary to consider the data representions in GIS because there is a close connection between quantitative methods and GIS. Discrete objects in the real world can be expressed in a GIS data base as points, lines, or polygons – the vector model. This representation consists of geometric information indicating the position of the object and associated attribute information which describes its characteristics. Continuous phenonomena such as population density or elevation may be represented as a grid or raster where the study area is divided into a regular lattice (usually with square cells), each cell containing the value of the phenomenon of interest. The GIS representation may be convenient for analysis: adjacency matrices are easily created from polygon data, which are used in a range of quantitative techniques – GIS software packages often provide an extensive range of spatial analysis and modeling methods.

Exploration Most analyses begin with some exploratory analysis – this helps in understanding the nature of the distributions of the variables which are to be used later in the study. For continuous data parametric statistics, such as the mean, variance, skewness, and kurtosis, provide useful summaries. Nonparametric indicators including the median, quartiles, and interquartile range are conveniently computed. Such statistical summaries may also be accompanied by plots of the distribution shapes – histograms and boxplots are

19

20

Quantitative Data

useful displays for examining the shape of the distribution (is it symmetrical, is it skew, are there outliers?). For projected coordinate data, summary measures exist – these are known as centrographic statistics and include the mean and median centers, as well as measures of spatial dispersion. There is also a body of statistical theory and practice which deals with directional data – the ‘average’ of a set of bearings is known as the directional mean and the variance is known as the circular variances. The theory of directional statistics provides a framework for inference, tests for uniformity, and goodness of fit to theoretical distributions. Categorical data may be explored by tabulating the frequencies in the individual categories. Visual methods include histograms – of interest is the modality of the distribution (has it one or more?) and the class sizes. One may also be interested to determine whether the distribution follows some theoretical distribution – might the distribution have arisen from a Poisson process? Tukey has suggested a range of exploratory methods – these include both numerical and visual summaries, some of which may be available in existing software. As analysis in the context of research in human geography, maps are convenient for summarizing the spatial pattern of both continuous and categorical data. There are a number of methods of examining bivariate relationships: which one is used depends on the research question and the nature of the data. Pearson and Spearman correlation coefficients are useful for measuring the strength of the association between two continuous variables. A chi-squared statistic can assist in determining the relationship between two categorical variables. The relationship between categorical and continuous variables might involve computing the means and variances of the continuous variable in each category of the other variable. Exploratory visual analysis of bivariate relationships may employ a number of tools. The scatterplot displays the relationship between a pair of continuous variables – with more than two a scatterplot matrix is sometimes helpful. Boxplots may be useful for exploring variation of a continuous variable within the categories of a nominal or ordinal variable. Given the values of two means computed from two samples, it is reasonable to ask the question whether there is sufficient evidence for the analyst to conclude that the difference between them is large enough to conclude that it is not simply due to chance. Statistical inference deals with selecting and using the appropriate tests to address these kinds of questions. If the two means are drawn from sufficiently large samples, the analyst would use a z-test to determine whether the difference is not zero (‘significant’ in statistical parlance); if the sample sizes are small, a t-test would be appropriate. A nonparametric version of the test is the Mann–Whitney test. If there are several groups, means analysis of variance

may be used to determine whether any differences between the means are small enough to be due to chance variation. The underlying theory of such tests assumes that the observations under analysis are independent – this is not always the case with spatial data. However, in spite of this, inferential testing is widely used with spatial and nonspatial data.

Methods for Multivariate Data With multivariate data, there are a wide range of tools for exploration and analysis. Regression techniques relate the variation in one variable (the y-, dependent, or response variable) to the variation in one or more independent (xor predictor variables). The most commonly encountered form is ordinary least-squares (OLS) regression in which the y variable is modeled as a linear combination of the independent variables, with an error term which is normally distributed with zero mean. If the dependent variable is one of counts (say cases of a disease) then Poisson regression is appropriate, and if the dependent variable is binary, the logistic (binary logit) regression is a useful model. Generalized linear models extend the regression framework by providing a wide range of options for modeling the error terms and the relationships among variables. A typical linear regression equation (model) with n variables might take the form y ¼ b0 þ b1 x1 þ ? þ bn xn þ e

y is the dependent variable, x1, y, xn are the independent variables, e is the error term, and b0, y, bn are the parameters which are to be estimated. Estimation using OLS attempts to find the combination of parameters that minimizes the sum of the squared differences between the observed values of y (dependent variable) and the predicted values of y computed from the equation. The dependent variable is a linear combination of the independent variables: multiplying each variable by its coefficient and summing up the resulting products yields the predicted values. An important part of regression is the interpretation of the coefficients in the model, in particular, whether they are sufficiently different from zero (a variable with a zero-valued coefficient will not contribute to the variation in the dependent variable). It is also important to examine the residuals from the model (these are the differences between the actual and the predicted (fitted) values). The residuals should have a zero mean, and should exhibit homoscedasticity (i.e., any samples taken from the residuals should have the same variance). With spatial data it is also necessary to examine the residuals for any spatial pattern – if there is significant autocorrelation, the coefficients will be unreliable because the

Quantitative Data

assumptions on which the model has been estimated will have been violated. There are many data reduction methods available to the analyst. ‘Principal-components analysis’ is used to create a number of new variables, known as components, from the given data. Components are linear combinations of the original variables. While there will be as many components as the original variables, typically, only a subset of these is used. The component loadings may be interpreted (perhaps to create a deprivation measure), and the component scores (the values of the new variables for each observation) may be useful for subsequent analysis. Relationships between two sets of variables can be explored using pairwise correlations, or canonical correlation analysis. To group observations together into related families or clusters, there is a wide range of ‘cluster analysis’ techniques – these take the values for a set of variables and attempt to create clusters of cases which have similar characteristics. Agglomerative methods will successively fuse cases together into a single group – fusion is usually stopped earlier to create the desired number of groups. Less frequently encountered are divisive techniques which start with a single group and attempt to split it successively into the component cases. Among the nonhierarchical techniques, k-means is commonly used – the user must specify the number of desired groups in advance of the analysis; so typically, a series of cluster analyses will be taken and compared before the most appropriate one is chosen. Mapping the results of a cluster analysis can yield useful insights. The cluster characteristics can be obtained by computing the means of the variables in each cluster – visualization techniques such as parallel coordinate plots may be helpful in examining the individual cluster structures. Should the analyst have an existing classification into a number of groups of the objects under study, linear ‘discriminant analysis’ is useful for relating the classification to a set of independent variables. Linear combinations of the independent variables give rise to discriminant functions which discriminate between the groups. Examining the discriminant functions can reveal which variables are important in distinguishing between the groups. One of the outputs is a prediction of the group membership for each case, given the values of variables for that case. This is often cross-classified with the input classification to create a confusion matrix – cases whose group membership has been correctly classified will lie on the leading diagonal. Misclassified cases warrant further investigation into the reasons for the incorrect assignment.

Problems with Spatial Data Spatial data can provide the analyst with some awkward problems. Secondary data, perhaps from a census, are

21

often aggregated to a set of areal units for reporting purposes. It is common to use such data to compute means, variances correlations, fit regression models, and carry out more complex tasks without thinking too deeply about the implications of using such data. It has been known since the 1930s that aggregating data from smaller to larger areal units alters the variance structure of the data – two sociologists, Charles Gehlke and Katherine Biehl, noticed that aggregating data to larger areal units leads to larger estimates of the correlation coefficient. This phenomenon has been labeled the modifiable areal unit problem (MAUP) – there are two related issues to the MAUP. The aggregation or ‘scale effect’ leads to the behavior observed by Gehkle and Biehl. There can be a large number of ways of aggregating a set of n smaller areal units into m larger areal units where mon – this ‘zoning effect’ will also influence the value of statistics derived from spatially aggregated data. A second problem which impinges on the use of quantitative approaches in human geography is that of spatial autocorrelation – the association between data values over space. Positive spatial autocorrelation occurs when objects have similar values of a variable as their neighbors; negative spatial autocorrelation occurs when neighboring values are dissimilar. Spatial autocorrelation is a problem – positive autocorrelation leads to an underestimate of the standard error of the mean, negative autocorrelation leads to an overestimate, with consequences for any inference that might be made. There are several diagnostics for examining the extent to which spatial autocorrelation is present – the Getis-Ord G statistic and Moran’s I are two widely used global measures of spatial autocorrelation. There are also a group of local indicators of spatial association, in particular a variant of Moran’s I which can be used to determine whether such association varies locally across the study area.

Spatial Pattern The analysis of spatial pattern attempts to determine the underlying processes which lead to such patterns in points, lines, and areas. The search for evidence that some observed pattern has not arisen from some chance process leads into a range of techniques. Many of these deal with point-pattern analyses: summary measures include the mean center and standard distances; inferential techniques include quadrat and nearest-neighbor analysis. More recent activity has been concerned with modeling such processes, in particular determining whether points in some study area exhibit clustering. Kernel density estimation is used to determine the variation in the intensity of the point process in the study area, and a useful technique for characterizing any clustering of the points is through K functions.

22

Quantitative Data

Computing the density of a spatial pattern – perhaps the locations of residential burglaries or road accidents – usually involves GIS software. Where the spatial pattern exists within a spatially clustered base population, the spatial variation in the base population needs to be controlled. Possible approaches here include Openshaw’s ‘geographical analysis machine’ (GAM) – the original intent of this was to identify clusters of disease cases in a spatially inhomogenous population. While GAM received some criticism, it was one of the first examples of a scan statistic (a statistic that scans an area to search for local clusters), its output was directly mappable, and it stimulated the development of further methods, notably those of Besag and Newell, and Bithell and Stone. Interest in methods for identifying localized clustering has extended outside geography, notably into crimepattern analysis. Many of the methods used are incorporated into the CrimeStat software.

Spatial Interaction Spatial interaction deals with movement between locations. Movement here refers to people, objects, or information, and for any single transfer, there will be an origin and a destination. The types of movements which are frequently dealt with include journey to work, migration, freight flows, and telephone traffic. One particular type of information which is often collected as part of census of population enumeration is journey to work data. The forms of this may vary but commonly the analyst may be presented with the numbers of individuals moving between pairs of zones. These flows could be represented in matrix form, with the origin zones forming the rows and the destination zones forming the columns – the cell at the intersection of a particular row/ column pair is the flow of workers between the corresponding pair of zones. Flows are then of two types: interzonal, the origin and destination zones are different, and intrazonal, the flow from residences to workplaces is wholly contained within the zone. The intrazonal flows are those along the leading diagonal of the flow matrix. Depending on the data supplier, there might be some disaggregations of the total flows – by gender, by age group, by mode of transport, by industry type, or by socioeconomic group. It is not unreasonable to wish to model the variation in such flows – more journeys may well be made between zones which are closer together than zones which are farther apart. Affluent residents may travel farther than less well-off ones. Trips made by public transport may be shorter on average than those made by car. The attributes of the residential zones may well influence the number and type of trips which originate from them; equally the attributes of the destination zones may well influence the

numbers of workers attracted. Early modeling efforts employed analogs from physics to create ‘gravity models’; Wilson’s attempts to place spatial interaction modeling in a sounder theoretical framework led to the development of ‘entropy-maximizing techniques’, and Fotheringham’s ‘competing destinations model’ represents further refinement. Models for dealing with such data fall into a broad group known as spatial interaction models. Generally, such models attempt to predict the flows between zone pairs, using information on the travel costs between them, and the different characteristics of the zones as origins and destinations. Fitting these models to the observed flow data entails a process known as calibration – this is often computer intensive. Deriving the characteristics of the zones as origins and destinations requires collating a wide variety of data (e.g., estimation of zonal floor space and money turnover of workplaces; and computation of the travel cost based on road distances and types). The flows are integers, so Poisson models are commonly used to avoid the prediction of negative flows. Flow data need not be confined to journey-to-work flows. Census enumeration often leads to detailed migration flow data, and a series of research questions on the drivers of migration. The outcome of these exercises to understand the causes of migration may lead to the formulation of policies for influencing it.

Regionalization A related activity using flow data is the construction of functional regions. Most methods restrict themselves to the information in the flow matrix. The idea is to combine zones into larger areas – when commuting data are used for this purpose, these larger areas are variously known as travel-to-work areas, local labor market areas, or functional regions. There are various algorithms – generally they start by combining the two zones with the largest interaction (the flow data are usually transformed). The two rows and two columns representing these zones are combined and the interactions recomputed for the smaller matrix. The largest interaction is again identified and the zone pair fused. This continues until all zones have been fused together into the desired number of regions. There are many variations on this basic approach, some of which require the analyst to supply a set of parameters to control the fusion process. Specialized software is usually required for regionalization. The analyst may be required to code the methods from descriptions in the published literature. Software implementing the ‘intramax’ procedure is downloadable. The output is usually a list of the original zones and the identifier of the region to which each has been allocated. Further analysis will generally involve GIS software to

Quantitative Data

create the boundary files for mapping and data re-organization. The analyst should be aware of any effects of the modifiable areal unit problem when undertaking further analysis.

Spatial Regression OLS regression models are widely used in quantitative analyses. However, for the geographer they can be problematic because, among a range of assumptions, they are required to be independent of the residuals. It has been observed that dependencies often exist in spatial data which lead to a violation of one or more of the assumptions of the OLS model – typically, we find that the residuals are spatially autocorrelated. Spatial regression models extend OLS models in an attempt to deal with spatial autocorrelation. There are a number of model forms which add terms to the OLS model to deal with spatial dependence. ‘Spatial autoregressive’ (spatial lag) models consider that the level of the response variable is dependent not only on the levels of the independent variables in the model, but also on neighboring values of the response variable itself. An additional term in the model is used to represent this: y ¼ b0 þ b1 x1 þ ? þ bn xn þ rWy þ e

where W is a spatial weights matrix and r is an additional coefficient to be estimated which measures the autocorrelation. The spatial weights matrix contains 0 where zones are not contiguous; if there are m neighboring zones, the cells in the row corresponding to the adjacencies contain the value 1/m. This ensures that the additional term contains the mean of the y variable in the neighboring zones. The means may be weighted using a Euclidean or non-Euclidean distance such that nearer zones are given more influence in the computation of the adjacent mean value. An alternative form considers autocorrelation in the error term. The model structure takes the form: y ¼ Xb þ e e ¼ lW e þ u

The error term e has a spatially weighted component lWe and a random error term u. Such models are referred to as ‘spatial error’ models but they are also known as ‘spatial moving average’ models. Downloadable software, GeoDa, is available for estimating a range of models including the spatial lag and the spatial error models. One of the assumptions with conventional regression models is that the regression equation is identical in form for every observation. If the process being modeled is not

23

spatially homogenous, then any spatial structure will be revealed in the residuals from the model: it will be spatially autocorrelated. However, one of the basic assumptions of regression is that the residuals are uncorrelated, so any autocorrelation is undesirable. One approach to dealing with the problems of modeling with data concerning individuals and their neighborhoods (or different levels in a spatial hierarchy) is ‘multilevel modeling’. This modeling framework allows the incorporation of data concerning individuals simultaneously with information about the neighborhoods in which they live or work. Specialized software is required to calibrate multilevel models. A simple technique to incorporate spatial effects into a model is to include place-specific dummy variables, perhaps to deal with broad regional effects where the regions cover several of the underlying data cases. Dummy variables are those which take the value 0 or 1, often representing yes/no or presence/absence. The effect of the dummy variables, if the parameter estimates are significant, is to move the regression line up and down the y-axis – they modify the value of the intercept term for each region. Interaction terms (in which a dependent variable is multiplied by a regional dummy variable) will allow the value of that variable’s parameter to be modified for a particular region – the combination of regional dummy variables and interaction allows the estimation of different regression structures in different regions. However, the multiplicity of terms in the model rapidly leads to difficulties in interpretation. A number of approaches have been suggested for allowing the model structure to vary across the study area to deal with this problem. Among the earlier attempts is the ‘expansion method’ developed by Casetti and Jones. In these models, the parameters are functions of location (with coordinates ui,vi) yi ¼ ða0 þ a1 ui þ a2 vi Þ þ ðb0 þ b1 ui þ b2 vi Þxi þ ei

The linear expansions of the parameters allow their values to ‘drift’ over the study area. In the example above, if a1 and a2 were positive, the value of the intercept would increase toward the northeast of the study area; if b1 and b2 were negative, the value of the parameter for the variable x would increase toward the southwest of the study area. Other forms for the expansion are possible, for example, quadratic or cubic. The analyst must specify the form of the expansions (linear, quadratic, cubic) prior to analysis, so if the model structure changes in some manner other than that dictated by the expansion functions, there may still be spatial structure present in the residuals. A more recently developed technique is ‘geographically weighted regression’ (GWR) – this generalizes the concepts of spatial drift. This model was proposed by

24

Quantitative Data

Fotheringham, Brunsdon, and Charlton, who have written widely on GWR. The model can provide local parameter estimates at the locations where the observations have been sampled or at any other location in the study area. GWR models have the general form yðuÞ ¼ b0 ðuÞ þ b1 ðuÞx þ ? þ bn ðuÞx þ e

where u is some location in the study area – a set of parameters are estimated for every observation in the dataset. The coordinates for zones are usually taken at some central point in the zone. To understand the nature of the geographical weighting we must consider the estimator: ˆ bðuÞ ¼ ðXW T ðuÞX Þ 1 X T W ðuÞy

The geographical weighting is provided by the matrix W which is specific to a particular location. Observations which are closer to the location u are given greater weighting in the estimation and observations which are further away receive a lower weight. The weighting functions typically have a Gaussian-like shape, and the influence of the distance between any observation and the location u is controlled by a parameter known as the bandwidth – this is supplied by the analyst and can be estimated from the data. The parameter estimates, and their associated standard errors, can be mapped. Also mappable are the residuals and local goodness of fit measures. Significance tests on the individual parameters are possible, as is a significance test to determine whether the spatial variation in the parameter estimates for a single variable differs from random. The GWR approach is quite flexible. Typically, the parameter estimates are made at the locations at which data are available. However, parameter estimates can be made at other locations in the study area, and if independent variables exist for those locations, estimates of the dependent variable can be made at these locations. GWR is a model of spatial heterogeneity. Interesting developments are in the area of semi-parametric GWR models – in these models, some variables have spatially fixed parameters and others have spatially varying parameters. Model selection – particularly in deciding which parameters should be fixed and which should vary – becomes an important issue. Estimating such models requires special software: Fotheringham, Brunsdon and Charlton have written suitable software; there is a GWR package for R, and Gaussian GWR models can be estimated in ArcGIS 9.3. For mapping parameter estimates, GWR output can be imported into suitable mapping software. For mapping convenience, a model can be estimated at the sample points to determine a bandwidth, and then estimates

made at the mesh points of a regular grid to create a parameter ‘surface’.

Spatial Interpolation A related activity is spatial interpolation. Typically, the analyst has values for a variable which have been collected at one set of locations and requires estimates of values for another set of locations in the study area – usually the mesh points of a grid. These techniques require the presence of spatial coordinates for each observation. Spatial interpolation techniques are widely used to create grid-based digital elevation models from irregularly spaced measurements of elevation, but they can also be used in human geography research. A commonly used technique is ‘inverse distance weighting’. This provides a distance weighted mean estimate of the measurements. Given the values of a variable of interest, an estimate of the mean at a location u is provided by Pn a d i ðuÞxi xˆ ðuÞ ¼ Pi¼1 n a i¼1 d i ðuÞ

where di(u) is the distance from the ith observation to the location u, and a is an exponent supplied by the analyst. A typical value for the exponent is 2. Enhancements sometimes include restricting the local sample size – only the closest m points to u (where mon) are used. Another approach is provided by ‘kriging’. Kriging is associated with the field of geostatistics. The estimates are derived from a distance weighted mean, but with kriging, the weights are computed from a theoretical variogram whose parameters, the nugget, sill, and range, are estimated from the observed data. This is usually a two-stage process – first, the analyst computes an empirical semi-variogram from the data which measures the spatial covariance of the data at various distances; to this is fitted a theoretical variogram by choosing suitable values for the parameters. As well as estimates of the mean of the variable being kriged, standard errors on the estimates are also available. We can think of spatial data to be interpolated using kriging as being composed of a deterministic component and a stochastic component. The deterministic component may be a mean or a more complex spatial drift term – residuals are left after the deterministic component has been removed from the data. The kriging process attempts to estimate the value of the residuals at a particular location, given the values of the residuals in the neighborhood. To generate the residuals, the analyst provides a suitable mean model: in ‘simple kriging’ the analyst supplies a value for the mean of the data; in ‘ordinary kriging’ this is computed from the data. In ‘universal kriging’ a more complex model is used to remove spatial trend, or

Quantitative Data

‘drift’, from the data – typical regression models used here include linear and quadratic forms. More complex models may be used – a study of house-price variation might model the trend as a hedonic regression. Once the residuals have been computed from the sample data, the errors can be estimated at other locations, and the mean or trend added back in to provide the final estimate. Kriging requires specialist software – Pebesma’s gstat package and ESRI’s geostatistical analyst provide powerful tools.

Software for Quantitative Analysis There is a range of computer software which is available to the analyst. These include comprehensive collections of statistical methods running on personal computers or workstations – notable examples are SPSS and SAS. The analyst is required to prepare a file of data which is then converted by the program to an internal format. The analysis involves making choices from logically organized menu structures, and interpreting the output correctly. While such programs are relatively easy to use, they will not prevent mistakes being made, such as creating descriptive statistics intended for continuous data from suitably coded categorical data. Other software may be more specialized and less widely used by geographers, for example, Statistica and Stata. Some geographers have created their own software for spatial analysis: Luc Anselin’s GeoDA, and GeoVISTA created by Pennsylvania State University Geographic Visualization Science, Technology, and Applications Center are examples. GeoVISTA is easily modified to create new tools in emerging areas such as geovisual analytics. Other examples of more task-specific software include Crimestat (aimed at crime analysts) and Clusterseer, Boundaryseer, and SpaceStat which are marketed by TerraSeer Inc. At the other end of the spectrum is R, a widely used statistical computing environment which is freeware. R has a rich scripting language which the analyst must use to perform the analyses which are required – statements in the language are typed into a text window. This involves remembering the names of a wide range of statistical functions, their syntax, and the syntax of the scripting language itself. While R is powerful and flexible, it is not always easy to use. There is a large library of add-on packages which have been written by R users from a wide range of disciplines: these include several libraries for spatial analytic methods, including kriging and geographically weighted regression.

The Future Many of the techniques described here have been developed in the last 30 years. The development of GIS has

25

been among the influences that have driven researchers to seek solutions to some of the awkward problems which are provided by spatial data. Much parallel activity has also taken place outside geography, notably in spatial epidemiology and spatial economics. Great progress has been made by statisticians in the field of spatial statistics – there is a rich variety of spatial models available for the analyst when dealing with spatial problems. There is a strong future for quantitative approaches in human geographical research. See also: Categorical Data Analysis; Ecological Fallacy; Exploratory Spatial Data Analysis; Factor Analysis and Principal-Components Analysis; First Law of Geography; Geocomputation; Geographically Weighted Regression; Kriging and Variogram Models; Longitudinal Methods (Cohort Analysis, Life Tables); Markov Chain Analysis; Modifiable Areal Unit Problem; Monte Carlo Simulation; Movies and Films, Analysis of; Multidimensional Scaling; Naturalistic Testing; Neighborhood Effects; Network Analysis; Neural Networks; Oral History; Oral History, Ecological; Overlay (in GIS); Participant Observation; Participatory Action Research; Participatory Video; Performance, Research as; Photographs; Point Pattern Analysis; Polyvocality; Psychoanalysis; Q Method/ Analysis; Qualitative Geographic Information Systems; Quantitative Data; Quantitative Methodologies; Questionnaire Survey; Regionalization/Zoning Systems; Regression, Linear and Nonlinear; Reliability and Validity; Remote Sensing; Representation and Re-presentation; Sampling; Scale Analytical; Scientific Method; Segregation Indices; Selection Bias; Shift-Share Analysis; Simulation; Situated Knowledge, Reflexivity; Sound and Music; Space-Time Modeling; Spatial Analysis, Critical; Spatial Autocorrelation; Spatial Clustering, Detection and Analysis of; Spatial Data Mining, Cluster and Pattern Recognition; Spatial Data Mining, Geovisualization; Spatial Data Models; Spatial Expansion Method; Spatial Filtering/Kernel Density Estimation; Spatial Interaction Models; Spatial Interpolation; Spatially Autoregressive Models; Statistics, Descriptive; Statistics, Inferential; Statistics, Overview; Statistics, Spatial; Structural Equations Models; Subalternity; Subjectivity; Text, Textual Analysis; Thiessen Polygon; Time Geographic Analysis; Time Series Analysis; Time-Space Diaries; Transcripts (Coding and Analysis); Translation; Trend Surface Models.

Further Reading Bailey, T. C. and Gatrell, A. C. (1995). Interactive Spatial Data Analysis. Harlow: Longman. Bivand, R., Pebesma, E. J. and Go´mez-Rubio (2008). Applied Spatial Data Analysis with R. New York: Springer. Cressie, N. A. C. (1993). Statistics for Spatial Data (revised edn.). New York: Wiley.

26

Quantitative Data

Fischer, M. M. and Getis, A. (eds.) (1997). Recent Developments in Spatial Analysis: Spatial Statistics, Behavioural Modeling and Computation Intelligence. Berlin: Springer. Fotheringham, A. S., Brunsdon, C. and Charlton, M. (2000). Quantitative Geography: Perspectives on Spatial Data Analysis. London: Sage. Fotheringham, A. S., Brunsdon, C. and Charlton, M. (2002). Geographically Weighted Regression: The Analysis of Spatially Varying Relationships. Chichester: Wiley. Haggett, P. (1965). Locational Analysis in Human Geography. London: Edward Arnold. Mardia, K. V. and Jupp, P. E. (1999). Directional Statistics. Chichester: Wiley.

Schabenberger, O. and Gotway, C. A. (2005). Statistical Methods for Spatial Data Analysis. Boca Raton, FL: Chapman and Hall. Thomas, R. W. and Huggett, R. J. (1980). Modelling in Geography: A Mathematical Approach. London: Methuen. Upton, G. and Fingleton, B. (1985). Spatial Data Analysis by Example 1 Point Pattern and Quantitative Data. Chichester: Wiley. Upton, G. and Fingleton, B. (1989). Spatial Data Analysis by Example 2 Categorical and Directional Data. Chichester: Wiley. Waller, L. A. and Gotway, C. A. (2004). Applied Spatial Statistics for Public Health Data. Hoboken, NJ: Wiley. Wrigley, N. (ed.) (1979). Statistical Applications in the Spatial Sciences. London: Pion.