Computers, Environment and Urban Systems 30 (2006) 143–160 www.elsevier.com/locate/compenvurbsys
Scales, levels and processes: Studying spatial patterns of British census variables David Manley a, Robin Flowerdew a
a,*
, David Steel
b
School of Geography and Geosciences, University of St. Andrews, St. Andrews KY16 9AL, UK b Department of Mathematics, University of Wollongong, Australia Accepted 30 August 2005
Abstract This paper is based on the assumption that there may be scale effects at all levels of areal data and that they vary both within areal units and between areal units. Spatial distributions are based on processes taking place in geographical space. A mapped pattern may reflect several distinct processes, each of which may affect a different area and operate at a different scale. The challenge for the spatial analyst is to identify these processes and evaluate their importance from the spatial pattern observed. Here the well known modifiable areal unit problem is not really a problem but a resource. Data at different scales can help us identify processes operating at different scales. We build on models and methods described by [Tranmer, M., & Steel, D. G. (2001). Using local census data to investigate scale effects. In N. J. Tate, & P. M. Atkinson (Eds.), Modelling scale in geographical information science (pp. 105–122). Chichester: John Wiley and Sons], which facilitate the identification of processes occurring within areal units. The method is extended using concepts from multi-level modelling and spatial autocorrelation, through the application of local statistics applied to what may be termed area effect estimates. It is illustrated with respect to two very different census variables and three different study areas. Ó 2005 Elsevier Ltd. All rights reserved. Keywords: Census data; Modifiable areal unit problem; Multi-level modelling; Scale effects; Spatial autocorrelation
*
Corresponding author. E-mail addresses:
[email protected] (D. Manley), r.fl
[email protected] (R. Flowerdew),
[email protected] (D. Steel). 0198-9715/$ - see front matter Ó 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.compenvurbsys.2005.08.005
144
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
1. Introduction The modifiable areal unit problem (MAUP) is a phenomenon whereby different results are obtained in analysis of the same data grouped into different sets of areal units. It vexes the geographical and spatial analyst almost as much today as it did when first identified by Gehlke and Biehl (1934) or when subsequently popularised by Openshaw and Taylor (1979, 1981). The MAUP has been subdivided into two separate but linked issues. One is the zonation issue, which concerns the effects of the arbitrary nature of the boundary division placed upon the data. The other issue is the scale issue, which can be defined as occurring where the statistical results of an analysis may change as the level of analysis changes. These effects occur because spatial processes generating the observed data may exist at scales and for particular areal units that may be reflected more or less accurately by the boundaries in use. Among other authors, Fotheringham and Wong (1991) have demonstrated these effects for US census data, and Tranmer and Steel (2001) have done so for UK data. See Openshaw (1984) for further discussion of these concepts. Two analytical techniques are applied in this paper to investigate the processes generating spatial patterns. The first technique is the Multi-level model, or MLM (Jones, 1991). The MLM is based on the recognition that a response variable can be affected by processes occurring at both the individual level and the group level. Thus, the MLM can be used to assess the existence, and estimate the magnitude, of processes that operate at the individual person level, and also one or more grouped level. In the classic applications of MLM in education, the groups may correspond to classes or schools; in the current context, the groups may refer to geographical areas over which spatial processes operate. The second of these techniques is spatial autocorrelation. This has been identified as highly relevant to the analysis of spatial data, such as data that is available for areal units (see for instance Cliff & Ord, 1973). Spatial autocorrelation has been discussed as a factor in the debate concerning the modifiable areal unit problem (see Openshaw & Taylor, 1979). At its simplest, spatial autocorrelation can be thought of as the correlation of a variable at one place with the same variable at neighbouring places. It exemplifies Tobler’s first law of geography that ‘‘everything is related to everything else, but near things are more related than distant things’’ (Tobler, 1970, p. 236). Goodchild (1986) gives a more detailed treatment. Spatial autocorrelation can inform analysts about the patterning of areal data. It is logical that spatial autocorrelation and multi-level modelling should be analysed together. Jones (1991, p. 8) states, ‘‘the degree of auto-correlation in MLM can loosely be conceived as the ratio of ‘variation at the higher level’ to the ‘total variation at all levels’. A value of zero for a spatial autocorrelation coefficient signifies no auto-correlation, indicating that there is no variation at the higher level’’. The work presented here builds on this basis, aiming to find evidence for the spatial processes generating the data under analysis, using a combination of adapted multi-level modelling and spatial autocorrelation techniques. The paper also provides conclusions about the patterns displayed by certain British census variables. 2. Background, data and theory Prior to presenting our methods it is necessary to consider the nature of areal units for which spatial data may be provided. There may be processes and effects within areal data
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
145
that interact in a complex fashion to create the observed data. If data are available at different scales, this may reflect the processes generating the data. However, there may be other processes affecting observed data that occur at scales for which we do not have information. Despite this, they deserve consideration, and it is the relationship between all these effects that the methods presented in the following sections seek to clarify. It is important to clarify the vocabulary that will be used to discuss these issues. Following Openshaw (1984), the term ‘areal unit’ is used as a general term for any bounded section of geographical space for which data are or could be recorded, regardless of scale. The term basic spatial unit (BSU) is used to denote the smallest type of areal unit for which data are available. In many situations, the data will be affected by processes operating at more than one scale. In this situation, the terms ‘local’ and ‘regional’ are used to distinguish between these scales. The data used in the analysis in this paper come from the 1991 British census. Most of the data are from the small area statistics, which are available at several scales. In England and Wales, the smallest set of areal units for the 1991 census is the set of enumeration districts (EDs). EDs therefore are the basic spatial units for the England and Wales census. They represent a complete coverage of England and Wales, consisting of 113,196 areal units in England and Wales (Denham, 1993, p. 55). They nest into wards, which themselves nest into districts and then to counties and regions. In Scotland, the BSUs are known as output areas (OAs). OAs tend to be smaller than EDs; there are a further 38,255 in Scotland (Denham, 1993, p. 55). OAs nest into pseudo-postcode sectors, which in turn nest into council areas. In the analysis, we will be particularly concerned with the relationships between data for the two smallest sets of areal units. For convenience, we will use the term enumeration district to refer to output areas in Scotland, and the term ward to refer to pseudo-postcode sectors; thus throughout the paper we will refer to enumeration districts and wards, but for Scottish data these really mean output areas and pseudo-postcode sectors. In addition to the small area statistics, we use data from the household sample of anonymised records (SAR). These data refer to all households within a particular geographical area, known as a SAR district. SAR districts are sometimes identical to districts or council areas, but because there is a minimum size threshold of 120,000 people, in some cases they consist of two or more districts or council areas. 2.1. Areal units and spatial processes This paper is based on the idea that the variation in statistical results for a particular variable may be attributed to processes operating at several different spatial scales, which may or may not be those for which data are available. There may be individual-level and aggregate-level effects, as is assumed in multi-level modelling, but the aggregate-level effects may occur at two scales, as in Green and Flowerdew’s discussion (1996) of local and regional effects, or at more than two scales. There is usually no theoretical reason to suppose that these effects happen to coincide with the scales at which data are released, such as EDs and wards in the British census. It may also be that the effects occur at one scale in part of the study area and at another scale elsewhere in the study area. For certain variables, it may be possible to identify the spatial processes causing local and regional effects. A good example is housing rented from the local authority; in many places (e.g. Glasgow), such housing is found in large estates.
146
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
Even where much of the local authority housing has been sold off under Britain’s right-to-buy legislation, there may be local spatial patterns in the distribution of which houses have and have not been sold, perhaps influenced by construction type, council housing allocation policy or social stigmatisation. Other spatial processes may include the impacts of local housing markets or job markets on the economic status of residents, patterns of ethnic concentration, suburbanisation, gentrification and urban decline. The geographies of these processes will all be reflected in geographical space, and their coincidence or otherwise with areal unit boundaries will affect the variability of the statistical results. The methodology explored here cannot be used to investigate the impact of the sizes and shapes of the basic spatial units (EDs in the case of census data). However it can be used to investigate the comparability of the spatial processes to areal units of larger size. For example, the success of the system of ward boundaries in reflecting the extent of spatial processes in the study area can be assessed. Similarly the usefulness of units smaller or larger than wards can be judged in terms of their coincidence with the spatial processes concerned, and so may the option of trying to replicate spatial patterns by a mixture of larger and smaller areal units in different parts of the study area. It should be noted that this discussion deals with only one variable at a time, although correlation and regression analysis usually dominate discussion of MAUP effects. If two variables are considered, it is also the case that areal units appropriate for one variable may not be appropriate for another, and also that the spatial processes operative in one study area may be observable at a different scale in another. 2.2. Local and regional effects It has been recognised previously that the MAUP relates to the differences between the spatial processes generating data and the units within which they are reported. Green and Flowerdew (1996) present an argument that it is possible to understand the MAUP with respect to interactions between data objects that occur at two geographical levels, the local level and the regional level. Considering the relationship between two variables, denoted by X and Y, it is possible that the relationship is not simply Yi to Xi, but also Yi to Xj, where Xj is the X variable for a neighbouring areal unit. Green and Flowerdew (1996) define this as cross-correlation, which occurs when the response variable at one place is affected by the explanatory variable at the same place and at surrounding locations. In the case of house prices, for example, the price of one house may be a function of not only its own condition, but also of the upkeep of the houses in the immediate area. They identify cross-correlation as part of the range of processes that can influence the results of statistical analysis on areal data. Flowerdew, Geddes, and Green (2001) explore this notion further and express it as: ‘‘If Y is a function of X and there is a cross-correlation effect, then a regression of Y at the most local level should include as explanatory variables both a local effect, i.e. the value of X at that local level, and a regional effect, i.e. the values of X in the surrounding region’’ (p. 91, emphasis in original). However, in neither Green and Flowerdew (1996) nor Flowerdew et al. (2001) are the local and regional effects measured precisely. They use an example where the local effect refers to the effect on Y of X at the ED level and the regional effect to the effect on Y of X at the level of the ward in which the ED is located. However, they do not claim that the ward is actually the appropriate unit for evaluating the regional effect,
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
147
and if this concept is to be used, the question of how appropriate units are identified needs addressing. 2.3. Identifying individual and areal effects It is possible to identify elements of correlations and covariances that are influenced by areal processes that can be considered in terms of the local and regional effects discussed above (see Tranmer & Steel, 2001). Usefully, it is also possible to isolate elements and statistical measures at different scales, reflecting only the processes occurring at the given level of analysis, and excluding processes occurring at other areal levels and the individual level. This follows the line of argument within MAUP research that does not seek to provide an overall solution to the problem (Openshaw & Rao, 1995). Rather it seeks to provide better statistical understanding, which enables the isolation of MAUP effects, and therefore a better identification of the processes behind the MAUP. This is also the viewpoint taken in this paper. A methodology for isolating effects at different scales is given in Tranmer and Steel (2001), and stems from a set of research ideas discussed by Steel and Holt (1996), Steel, Holt, and Tranmer (1996) and Tranmer (1999). 2.4. Intra-area correlations Central to the Tranmer and Steel (2001) approach is the derivation of Intra-area Correlations (or IACs). An IAC measures the extent of homogeneity within areal units, and can be computed for different scales and configurations of areal units. Despite the name, IACs refer only to single variables. Holt, Steel, and Tranmer (1996) demonstrate how the scale effects within the MAUP on the variance of a variable can be related to the IAC for that variable. Combining the individual level data from the 2% sample of anonymised records (or SAR) with the small area statistics (SAS) data available from the census, it is possible to provide estimates of these values. Crucially, these estimates do not require individual level spatial identifiers within the SAR region. It is then possible to use the population-weighted areal variance measures to provide statistical measures such as IAC, which can be used as a measure of homogeneity. If there are two levels of areal unit, the smaller being denoted level 1 and the larger level 2, the IAC can be found by kð2Þ ¼
ðS ð2Þ S ð1Þ Þ ð n 1Þ S ð1Þ
where k(2) represents the IAC (the superscript 2 refers to the level 2 areal units used in the analysis), S (1) is the level 1 variance for a variable, S(2) is the population-weighted level 2 variance and n is the average population size per level 2 areal unit. The IAC is based on the relationship between the variances of a variable at level 2 and at level 1, and could be regarded as a measure of the spatial autocorrelation within level 2 areal units. We suggest that there may indeed be local and regional processes within areal data that can influence the results of statistical analysis. Moreover, these processes may not be the same throughout the study area. While processes at one level may influence one part of an areal unit, different processes operating at different levels may be present for the same data in a different part of the same area. This concept will become more apparent below when some data are considered.
148
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
Secondly, we note that these effects can vary over the study area and it is unlikely that they will be reflected by the predetermined geography of the areal unit boundaries. Hence, it is unlikely that they will be sufficiently captured solely by using the standard geography of data publication (the ED and Ward boundaries in the case of the British Census). Therefore, it may be that the effects will not be identifiable at the levels of the enumeration district (ED) and ward, but may exist at an undetermined level between these two scales. Moreover, it is possible that scale effects are stronger in one part of the entire study area than another. These issues are discussed further below. 3. Methodology The models and methods described by Tranmer and Steel (2001) only allow for a global measure of homogeneity to be calculated, but do not allow the differing levels of homogeneity within a SAR district to be calculated. Therefore we extend the approach to examine evidence of such changes in homogeneity by attempting to identify processes generating these different levels of homogeneity. Having presented some background to the approach, this section details the method that was used to further these ideas. There are two parts to this method. The first seeks to exploit the multi-level model. This is justified by the recognition that geographical processes may often be affected by conditions at the neighbourhood (or group) level, and so there is frequently a need for one or more group level term. The second part of this method is to analyse this term using local statistics. This allows the identification of (relatively) homogeneous sets of areal units, from which it may be possible to identify processes and draw conclusions about the geography of these processes. The spatial extent of the processes can also be compared to the next level of census geography. 3.1. Extending the multi-level model Standard MLMs require at least two levels of data, an individual level and a group level. With the decennial census of the United Kingdom full individual records are not available due to confidentiality requirements. However, for the 1991 census it is possible to access a 2% individual sample at a coarse geographical level (SAR data). This is of limited use for multi-level modelling as it does not contain identifiers for an individual’s location below the coarse SAR district level. It is not possible to assign individuals to the ED or ward within the SAR district where they reside. Consequently, it is not practical to use the standard MLM techniques to analyse the census data for analysis below the SAR level. However, Tranmer and Steel (2001) have shown that, by making use of additional ED level data, it is possible to estimate these structures without the full individual-level data, and without losing significant efficiency. It is possible to express the traditional multi-level model in the following manner: y ig ¼ l þ ug þ eig where yig is the value of the variable of interest for the ith individual in the gth area, l is the overall population mean, ug is the area-level component and eig is the individual-level component. In terms of understanding the spatial processes that occur within geographical data, the ug term is the most useful as it reflects the effects of processes operating at the area level. It
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
149
is likely that these area effects would not be fully identifiable at the individual level. Within this model, there are a number of important assumptions that must be taken into account. One assumption is that the processes that occur within the data do so solely at the levels available for analysis. When using real world data, such as the census, it is unlikely that this assumption will be valid. Thus it would be useful to be able to provide an estimate of the higher-level variance component that is free from the constraint that the data are only published for a limited set of geographies. This can, through further analysis, enable the identification of the higher-level processes within the data. We will consider an example that uses the SAR districts as the study areas in which our analysis will be carried out. The individual-level data necessary will be taken from the 2% SAR, while the areal units are census enumeration districts (EDS). The estimator of ug, the ED level effect, will be denoted as ^ug . Mathematically, it can be defined as ^ ug ¼ wg ðy g y Þ where wg is a weighting term; y g is the observed mean of the variable in ED g, and y is the overall observed mean of the variable for EDs across the whole SAR district. The weight (wg) can be calculated by the following equation: wg ¼ ng ðkð2Þ =ð1 þ ðng 1Þkð2Þ ÞÞ where ng is the number of observations in the gth ED and k(2) is the intra-area correlation at the ED level for the variable, as defined in the previous section. The estimated area effects ^ ug attempt to allow for the variation between area means. Application of the weights wg shrinks the deviations of the chosen areal means from the overall mean to allow for the likely impact of individual level variation (see Goldstein, 2003, p. 22). 3.2. Identifying and using spatial autocorrelation ^g can be used to determine the structure of the data between the areal Analysis of the u units (in this case EDs), as each ^ ug value is an indication of the area-level effect within that unit. Therefore, ^ ug values that are similar could be the result of processes operating at a level coextensive with the area of similar ^ ug values. Measures of spatial autocorrelation of the area-level effects can be used to describe the geography of the processes. Consequently, these analyses will be able to show whether or not the spatial processes operate at the same scales as the standard census units. Such occurrences can be identified as clustering beyond the level of analysis, in the case of the following discussion, above the enumeration district level. Instances of spatial autocorrelations of ^ug will point to the existence of processes operating at a scale greater than the ED. Moreover, this technique could identify processes that operate between the standard census levels, such as at a level of aggregation that was half way between the ED and ward level. If this were done, the results could be used to guide census users as to how to perform their analysis, for example by grouping EDs into larger units of an appropriate size. If the analysis were carried out on British Census data, at the individual (SAR) and enumeration district levels, then the subsequent analysis could suggest a more appropriate construction for the higher level of aggregation for the census data, given the data structure of the variable under
150
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
investigation. Furthermore, it would also be able to demonstrate how well the current ward structure fitted the autocorrelations that were apparent within the data. As stated, the patterns within the data can be explored using the concepts of spatial autocorrelation. There are a number of measures of spatial autocorrelation, the most common of these being Geary’s G statistic and the Moran’s I test. They are very similar, and the analysis below uses a version of Moran’s I. This measurement is ‘‘analogous to a covariance between the values of a pair of objects’’ (Goodchild, 1986, p. 17), measuring the differences between the values for a variable for places that are close to each other. However, to determine a spatial pattern within a given data set the standard measures of spatial autocorrelation are inadequate, as in some of the SAR districts there could be as many as 5000 EDs. Moreover, there is no guarantee that the extent of the spatial autocorrelation will be constant within the study area. As Haining (2003, p. 186) states, ‘‘where process-induced heterogeneity is strong, data analysis based on ‘local’ statistics may be preferable to data analysis based on global or ‘whole map’ statistics.’’ Consequently a measure that can be defined within the suite of tools known as Local Indicators of Spatial Association (or LISA) is required (Anselin, 1995). One such tool is known as the Local Moran’s I. The Local Moran’s I is a variant of the global Moran’s I where individual values are determined for all of the units in a study area. The form of the Local Moran’s I is as follows (Getis & Ord, 1996): Ig ¼
^ ^ u X ug ^ ½W gh ð^ uh uÞ 2 Su h6¼g
^ where Ig is the local Moran’s I value for ED g, u is the mean value of estimated area-level effects for all EDs, ^ uh is the estimated area-level effect for ED h, ^ug is the estimated arealevel effect for ED g, S 2u is the variance over all estimated area-level effects and Wgh is a distance weight that can be defined by W gh ¼ d1gh . where dgh is the distance between the centroids of EDs g and h. Hence a value for the local Moran’s I can be computed for each areal unit in the study area. A negative I value indicates negative spatial autocorrelation, where geographically ‘close’ values are less similar than would be expected than if there were no spatial autocorrelation, while complete spatial independence would give a value near zero. Strong spatial autocorrelation is denoted by high positive values. It is possible to calculate a standardised version of the local Moran’s I that takes into account its sampling error. It is this standardised version that is referred to in the following analysis. The standardisation is carried out using the following function: ZðI g Þ ¼ ½I g EðI g Þ=SðI g Þ where Z(Ig) is the value of the standardised local Moran’s I for ED g, Ig is the local Moran’s I value of areal unit g, E(Ig) is the expected local Moran’s I value of areal unit g and S(Ig) is the standard deviation of areal unit g. Because it is standardised the results between different SAR districts can be compared. It is this standardised version of the local I that will be referred to in the following analysis. Moreover, we take any values of local Moran’s I that are either below 3 or above +3 to indicate significant spatial processes between the areal units.
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
151
4. Analysis The Glasgow SAR district was chosen to test the methodology outlined above, as it was known to be an area in which strong scale effects could be seen. It will be contrasted with the Reigate and Ribble SAR districts, which were identified as less susceptible to MAUP (scale) effects (Manley & Flowerdew, 2003). Reigate was chosen in part because Tranmer and Steel (2001) used it as an example, and Ribble because it was known to include areas of different settlement pattern. The variables used are the percentage of population living in a household who are local authority renters (denoted as RLA) and the percentage of population who are female (denoted as female). The latter includes population living in institutions. These were chosen because of the known differences between their data structures: renting from local authorities tends to occur in spatially compact estates, whereas in most areas the distribution of males and females is very similar. In all cases except Fig. 1, which uses class intervals based on quartiles to map percentage data, the figures present the ^ ug values and the standardised local I values using standard deviations as break points for the numerical groups. The IAC values (denoted in Section 2.4 as k(2)) can be used as an assessment of the magnitude of the scale effect, and generally the higher the IAC value, the higher the scale effect. The Glasgow female data has an IAC value of 0.0007, while the Glasgow RLA data has an IAC value of 0.627 (Table 1). Clearly, therefore, it would be expected that the RLA variable would exhibit larger scale effects than the female variable, which will be used as a comparison. These values also provide information about the relationships between EDs within SAR districts, which can be seen to be very strong in the RLA data, and relatively weak in the female data.
Fig. 1. Percentage of housing tented from local authorities (RLA) in the Glasgow SAR.
152
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
Table 1 IAC and global Moran’s I values (unstandardised and standardised) for the Glasgow variables
Female RLA
IAC
Global I
Z score
0.0007 0.627
0.011 0.015
15.99 78.25
Although the main focus here is on the relationships between areal units at the local level, it is nevertheless useful to present the global Moran’s I values to gauge the level of overall spatial autocorrelation. For the Glasgow female data, the I value is 0.011. Although low, it is still significantly different to spatial randomness, as shown by the associated Z score. The fact that the I value is non-zero may reflect the presence of old people’s homes, which tend to be predominantly occupied by females because of men’s lower life expectancy. Moreover, it is a low value when compared to the Global I for Glasgow RLA data, which has an I value of 0.015, but is considered far more significant, using the normalised significance value, where the value may be compared to a normal distribution with a mean of 0 and a variance of 1. To determine the area effect estimates, the ^ ug values were calculated for these two variables, and the resulting spatial patterns can be seen in Figs. 2 and 3. These maps provide confirmatory information that supports the Global Moran’s I values presented in Table 1 as it is apparent that there is more spatial pattern shown by the ^ug values for the RLA variable than by the values for the female variable. Indeed the female variable (Fig. 2) looks
Fig. 2. The ^ug values for the female variable by ED in Glasgow.
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
153
Fig. 3. The ^ug values for the RLA variable, by Glasgow EDs.
fairly similar to spatial independence (see Goodchild, 1986). Local spatial autocorrelation measures will further demonstrate the validity of this conclusion. Originally, the female variable was introduced to provide a contrast to RLA for which it would be unlikely that a strong spatial pattern would be found. The interpretation of the local Moran’s I analysis suggests that this hypothesis is correct (see Fig. 4). The majority of the I values can be seen to be around zero, suggesting that the distribution of the data is largely random with little or no spatial process. Despite this there are some extreme values in the representation, with the I value ranging from 22 up to over 41, which equates to a range of over 2.5 standard deviations about the mean. This large range is not so surprising considering that there are over 5000 observations in the Glasgow data set. However, although this is a large range, greater than that observed for the RLA data (see below), many of the values are clustered around the 0–5 range, as can be demonstrated by the upper category of 5–41 covering one standard deviation of the data. Furthermore, the areas with the higher local I tend not to be grouped together in large clusters, and can, therefore, be regarded as insignificant in relation to the identification of large-scale spatial patterns beyond the ED level. They might however represent similarities at a scale a bit bigger than EDs but not as big as wards. The second variable considered was RLA (Renters from the local Authority), and it is clear from Fig. 5 that there are strong spatial processes operating. However, there are more groups of EDs visible with the RLA data. This indicates the existence of spatial processes at a larger scale. It is possible to define some areas with clusters of EDs that exhibit high positive spatial autocorrelation. These can be seen in the darker grey areas around the river area in the western side of the SAR district, also in the northeastern edges. These would suggest areas that could be
154
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
Fig. 4. Standardised local Moran’s I showing some central spatial structure in the female variable at the ED level for Glasgow SAR district.
Fig. 5. Standardised local Moran’s I showing spatial clustering operating above the ED level in the RLA variable.
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
155
grouped together to form homogeneous groups at a level above the ED aggregation. It is also possible to determine areas that exhibit relatively large values of negative spatial autocorrelation. Indeed, the areas surrounding the high clusters can be identified as areas that exhibit high negative spatial autocorrelation. Such areas can be regarded as boundaries between areas subject to different processes. The analysis of the area effect estimates of the EDs with the local Moran’s I enables inference about the processes above the ED scale. As can be seen in Fig. 6(a), it is possible to demonstrate that some of the wards are composed of groups of EDs that reflect the nature of the spatial processes present in the data. These groups of EDs could be described as homogeneous with respect to the variable concerned. Fig. 6(b), on the other hand, shows that other ward boundaries do not reflect these processes, with the consequence that the wards are relatively dissimilar in composition. These wards could thus be described as heterogeneous. The fact that the Glasgow SAR district is composed of wards that have differing levels of homogeneity, both between and within wards, as demonstrated here, is suggested as a potential cause of the scale effects seen in the MAUP. The calculations for ^ ug and the local Moran’s I were repeated for the RLA variable in Reigate SAR district. The IAC value and the Global I are reported to provide supplementary information about the data (see Table 2). The IAC value is much lower than that of the Glasgow RLA data, and the Global I is also relatively low. Using the IAC measure as defined by Tranmer and Steel (2001), it would be expected that RLA data in Reigate is less susceptible to MAUP (scale) effects than the Glasgow SAR data. We suggest, therefore, that calculation of the local Moran’s I statistics will identify fewer groups of EDs showing the effects of location-specific spatial processes. Fig. 7 shows the ^ ug values for the RLA variable, and they have a smaller range of values
Fig. 6. Both maps depict the local Moran’s I in different parts of the Glasgow SAR district.
Table 2 IAC and global Moran’s I values for the Reigate data
Reigate RLA
IAC
Global I
Z score
0.19
0.0005
0.71
156
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
Fig. 7. Reigate SAR district ^ug values for the RLA variable at the enumeration district level.
Fig. 8. Local Moran’s I for the Reigate SAR district RLA data.
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
157
than observed in the equivalent estimates for RLA in the Glasgow SAR district (0.4 to 0.5 versus 0.2 to 0.6, see Fig. 3). Using the local Moran’s I value there are visually few clusters of EDs observable (see Fig. 8). The RLA variable in the Reigate SAR district has less spatial autocorrelation, as measured by the IAC values, than the RLA variable in the Glasgow SAR district. The range of the standard deviations around the mean is of a similar magnitude to that observed in the female data for Glasgow, although the numerical range is not as great. There are still some areas where high positive local Is are found showing the presence of spatial processes. Thus, although there is more homogeneity and therefore similarity in the Reigate SAR district, as measured by the IAC, there are a few pockets of dissimilarity as well, which could depict processes operating at the ED level. Thus, it is difficult to determine boundaries at a ward level of aggregation that could be used to construct homogeneous populations in the RLA variable. The final SAR district analysed was Ribble, Lancashire. The Ribble SAR district includes a large rural area and several smaller towns, which contrasts with the urban Glasgow SAR district and the suburban Reigate SAR district. The IAC value for RLA in the Table 3 IAC and global Moran’s I values for the Ribble data
Ribble RLA
IAC
Global I
Z score
0.28
0.011
3.91
Fig. 9. Ribble SAR district ^ug values, which are lower than in the previous two examples.
158
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
Ribble SAR district is between those of Reigate and Glasgow. Therefore, it would be expected that fewer clusters of EDs would be identifiable than in Glasgow, but more than those in Reigate. Table 3 presents the IAC and Global I value. Fig. 9 shows the ^ ug estimates for Ribble SAR district. Compared to the other SAR areas discussed above the area effects in the Ribble SAR are relatively constant, with much of it lying in the categories ranging from 0.07 to 0.06 and from 0.06 to 0.06. The positive values of ED effects are largely concentrated in the urban areas. Fig. 10 shows the local Moran’s I values occurring within the Ribble SAR for the RLA variable. As with the Reigate SAR, many of the I values are not significant, and therefore do not highlight spatial processes that can be identified using the EDs. However, in the southwestern side of the SAR district there is an area of significant clustering. This area is unlike much of the rest of the SAR district, and consists of villages and small towns. The larger town of Leyland is also located here. There are more likely to be sizeable housing developments, which in turn would define a spatial process such as the one highlighted here. It is also interesting to note that there is relatively little negative spatial autocorrelation within the area indicating that the autocorrelation that there is demonstrates similarity rather than differences. In the Ribble SAR district it is not possible to identify distinct sets of EDs where higher-level spatial processes, such as those that would coincide with a ward or higher geography, are occurring within the Ribble SAR district. It is possible that the spatial processes extend further than the data presented here or are too local in their nature to be identified by an analysis using the EDs as basic spatial units. The spatial processes that
Fig. 10. Local Moran’s I for the Ribble SAR district RLA data.
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
159
have been identified here accord with the IAC value, which is between that of the Glasgow and the Reigate SAR districts. 5. Conclusions It has been shown that although an aggregation level (EDs or wards in our case) is presented as a homogeneous set of areal units, the reality is that an aggregation level may be affected by processes operating at vastly different scales. Two variables have been used, demonstrating that different variables act in different manners. Thus, the processes that operate for certain units are specific to a certain variable. It is clear that it is not possible to define an ideal single census geography that captures all the processes for all variables. Through necessity, the census geography that is provided must be a compromise. However, it is possible, using this methodology, to assess the applicability of the census geography with which we are faced. Moreover, a means is provided for users of areal data to be aware of the extent to which spatially defined processes operate in the data that they are using. Although it does not provide a quantification of these processes it does at least enable analysts to show graphically that they exist. The methodology demonstrates a useful extension of the MLM technique, and allows it to be applied to spatial data processes that do not necessarily conform to the strict standard aggregation patterns. Moreover, the information relating to the spatial structure of the data could potentially be used in the formation of areal units from BSUs. The resultant census geographies would be better able to reflect some of the processes that have generated the data under analysis. The next phase of this research is to test the methodology with a larger number of areas and variables, and to extend it to identify spatial processes relating to two or more variables. Acknowledgements The census data used in this study, including the Household Sample of Anonymised Records, are Crown Copyright. They were bought for academic use by the ESRC/JISC/ DENI and are held at the Manchester Computing Centre. Digital boundary data for Great Britain were also purchased by ESRC for the academic community. Access was obtained via the UKBORDERS service at the University of Edinburgh. An initial version of this paper was presented at the GISRUK 2003 conference at City University. The authors would like to thank the referees for their helpful comments. References Anselin, L. (1995). Local indicators of spatial association—LISA. Geographical Analysis, 27(2), 94–114. Cliff, A., & Ord, J. K. (1973). Spatial autocorrelation. London: Pion. Denham, C. (1993). Census geography. In A. Dale & C. Marsh (Eds.), The 1991 Census user’s guide. London: HMSO. Flowerdew, R., Geddes, A., & Green, M. (2001). Behaviour of regression models under random aggregation. In N. J. Tate & M. Atkinson (Eds.), Modelling scale in geographical information science (pp. 89–104). Chichester: Wiley. Fotheringham, A. S., & Wong, D. W. S. (1991). The modifiable areal unit problem in multivariate statistical analysis. Environment and Planning A, 23, 1025–1044.
160
D. Manley et al. / Comput., Environ. and Urban Systems 30 (2006) 143–160
Gehlke, C. E., & Biehl, K. (1934). Certain effects of grouping upon the size of the correlation in census tract material. Journal of the American Statistical Association, 29(185, Special Supplement), 169–170. Getis, A., & Ord, J. K. (1996). Local spatial statistics: an overview. In P. Longley & M. Batty (Eds.), Spatial analysis: Modelling in a GIS environment (pp. 261–277). Cambridge: GeoInformation International. Green, M., & Flowerdew, R. (1996). New evidence on the modifiable areal unit problem. In P. Longley & M. Batty (Eds.), Spatial analysis: Modelling in a GIS environment (pp. 41–54). Cambridge: GeoInformation International. Goldstein, H. (2003). Multilevel statistical models (3rd ed.). London: Arnold. Goodchild, M. F. (1986). Spatial autocorrelation. Concepts and Techniques in Modern Geography 47. Norwich: Geo Books. Haining, R. (2003). Spatial data analysis: Theory and practice. Cambridge: Cambridge University Press. Holt, D., Steel, D. G., & Tranmer, M. (1996). Area homogeneity and the modifiable areal unit problem. Geographical Systems, 28(3), 181–200. Jones, K. (1991). Multi-level models for geographical research. Concepts and techniques in modern geography 55. Norwich: Environmental Publications. Manley, D., & Flowerdew, R. (2003). Scale effects in UK census data. Paper presented at AC2003 RGS/IBG conference. Openshaw, S. (1984). The modifiable areal unit problem. Concepts and techniques in modern geography 38. Norwich: Geo Books. Openshaw, S., & Rao, L. (1995). Algorithms for re-engineering 1991 census geography. Environment and Planning A, 27, 425–446. Openshaw, S., & Taylor P. J. (1979). A million or so correlation coefficients: three experiments on the modifiable areal unit problem. In N. Wrigley (Ed.), Statistical applications in the spatial sciences (pp. 127–144). London: Pion. Openshaw, S., & Taylor, P. J. (1981). The modifiable areal unit problem. In R. J. Bennett & N. Wrigley (Eds.), Quantitative geography (pp. 60–69). London: Routledge & Kegan Paul. Steel, D. G., & Holt, D. (1996). Analysing and adjusting aggregation effects: the ecological fallacy revisited. International Statistical Review, 64(1), 39–60. Steel, D. G., Holt, D., & Tranmer, M. (1996). Making unit-level inferences from aggregated data. Survey Methodology, 22(1), 3–15. Tobler, W. (1970). A computer movie simulating urban growth in the Detroit region. Economic Geography, 46(2, Suppl.), 234–240. Tranmer, M. (1999). Using census data to investigate the multilevel structure of local populations. Department of Social Statistics. Southampton, University of Southampton: 300. Tranmer, M., & Steel, D. G. (2001). Using local census data to investigate scale effects. In N. J. Tate & P. M. Atkinson (Eds.), Modelling scale in geographical information science (pp. 105–122). Chichester: John Wiley and Sons.