Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
Contents lists available at ScienceDirect
Spatial and Spatio-temporal Epidemiology journal homepage: www.elsevier.com/locate/sste
Linking health and environmental data in geographical analysis: It’s so much more than centroids Linda J. Young a, Carol A. Gotway b,*, Jie Yang a, Greg Kearney c, Chris DuClos c a
Department of Statistics, University of Florida, Gainesville, Fl, USA Office of Workforce and Career Development, Centers for Disease Control and Prevention, Atlanta, GA, USA c Florida Department of Health, Division of Environmental Health, Tallahassee, FL, USA b
a r t i c l e
i n f o
Keywords: Geographically weighted regression Spatial support Block kriging Spatial regression Uncertainty analysis
a b s t r a c t Programs and studies increasingly use existing data from multiple sources (e.g., surveillance systems, health registries, or governmental agencies) for analysis and inference. These data usually have been collected on different geographical or spatial units, with each varying from the ones of interest. Combining such disparate data creates statistical challenges. Florida’s efforts to move toward implementing the Centers for Disease Control and Prevention (CDC)’s Environmental Public Health Tracking (EPHT) program aptly illustrate these concerns, which are typical of studies designed to measure the association between environmental and health outcomes. In this paper, we develop models of spatial associations between myocardial infarctions (MIs) and ambient ozone levels in Florida during August 2005 and use these models to illustrate the problems that can occur when making inferences from aggregated data, the concept of spatial support, and the importance of correct uncertainty assessment. Existing data on hospital discharges and emergency department visits were obtained from Florida’s Agency for Health Care Administration. Environmental data were obtained from Florida’s Department of Environmental Protection; sociodemographic data were obtained from the US Census Bureau; and data from CDC’s Behavioral Risk Factor Surveillance System were used to provide additional information on other risk factors. We highlight the opportunities and challenges associated with combining disparate spatial data for EPHT analyses. We compare the results from two different approaches to data linkage, focusing on the need to account for spatial scale and the support of spatial data in the analysis. We use geographically weighted regression, not as a visual mapping tool, but as an inferential tool designed to indicate the need for spatial coefficients, a test that cannot be made by using the majority of Bayesian models. Finally, we use geostatistical simulation methods for uncertainty analysis to demonstrate its importance in models with predicted covariates. Our focus is on relatively simple methods and concepts that can be implemented with ESRI’sÒ ArcGISÒ software. Published by Elsevier Ltd.
1. Introduction Drinking water quality, ozone levels in air, and proximity to nuclear waste disposal sites are but a few of the topics that cause people to become concerned regarding how the environment can affect human health. The Centers for Disease Control and Prevention (CDC) created the Environ* Corresponding author. E-mail address:
[email protected] (C.A. Gotway). 1877-5845/$ - see front matter Published by Elsevier Ltd. doi:10.1016/j.sste.2009.07.008
mental Public Health Tracking (EPHT) program to trace exposure and health effects that might be related to environmental hazards. Initially funded by the US Congress in 2002, the program moved into the implementation stage with 16 participating states and one health department in 2006. The state-based systems are to feed into the national tracking network (CDC, 2007). Information on health and the environmental hazards and exposures has long been collected, and EPHT is relying on the information from these ongoing efforts.
74
L.J. Young et al. / Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
Dissemination of EPHT information is primarily through websites developed by each participating state. States provide summary data in tabular format and maps displaying health and environmental data, typically for each county. This is no small challenge. Obtaining all necessary data usually requires multiple interagency agreements. Moreover, health data must be provided and displayed in a manner that honors individuals’ privacy. Environmental data are often geographically sparse and rarely measured at the same place or time as health data. As standards for a unified national reporting system are being developed, the focus has turned to statistical inference based on combining data from disparate sources, each source associated with different concepts of place and geographic location. In this paper, we discuss key inferential concepts associated with environmental health studies. In particular, we use data collected in support of the EPHT program to illustrate crucial points regarding the majority of medical geography studies in which place and geographic variation are important in assessing public health. First, we compare a traditional centroid-based geographic analysis to one that incorporates the spatial support (i.e., geometric size, shape, and spatial orientation of the units) of the data, which permits a broader interpretation of place. Second, we illustrate the need for uncertainty assessment and discuss geostatistical tools that can be useful in performing uncertainty analysis within a geographic information system. Finally, in the context of geographically weighted regression models, we demonstrate how the impact of uncertainty in explanatory variables can be visualized and how this helps to refine conclusions inferred from maps of spatially varying regression coefficients. Although our focus is on data collected in support of the EPHT program, the methodology, concepts, and key ideas we present pertain to the majority of studies linking health with place.
2. The study and supporting data As part of its EPHT effort, Florida has been working to develop models of the spatial and temporal association of ambient ozone and myocardial infarction (MI), two of the program’s core measures. Note that the purpose of EPHT is not to conduct specific studies on the health impacts of environmental exposures, but to monitor associations between them. Data used in this effort have been gathered from four different sources. Ozone measurements, recorded from a network of air monitors placed throughout the state, were obtained from the Florida Department of Environmental Protection (FDEP). The number of ozone monitors within the network varies with time, with new monitors being added and selected sites being closed. For example, during the August 2005 study period, ozone data were available from 56 monitors (Fig. 1). Air-monitoring devices are used primarily for regulatory purposes and tend to be concentrated in areas with larger populations and larger population densities. Sometimes a monitor malfunctions and data are not recorded for an entire day or for consecutive days. An approximate 3-month lag exists between data collection and completion of quality assurance for these data.
Fig. 1. Florida Counties and locations of air-quality monitors.
A data-sharing agreement (DSA) with Florida’s Agency for Health Care Administration (AHCA) provided us with access to two data sources: confidential hospitalization records and emergency department (ED) records. Through the hospitalization records, all admissions to Florida’s public and private hospitals where either the primary or secondary cause of admission was MI (International Classification of Diseases, 10th revision [ICD-10] codes 410.0–414.0; World Health Organization, 2005) were obtained. If a person presents to the ED and is subsequently admitted to the hospital, that person’s record is included in the hospitalization records, but not the ED records. Thus, ED records provided all admissions to Florida’s public and private hospitals where either the primary or secondary cause of admission was MI, and the person was released without hospitalization. Because of the nature of MI, a limited number of cases were obtained from ED records; the majority came from the hospitalization records. Non-Florida residents were excluded from the analysis. Consistent with our DSA, ACHA provided both the zip code and county of residence for each patient’s record. As is typical for hospital records, the patient’s address is collected only for billing purposes and is not released. Selected patient demographic information is also recorded, including sex, age, and race/ethnicity. To assess whether an environmental factor has triggered the health event leading to hospitalization, the time of admission is the best available date for that event. ACHA data are made available 3–6 months after the end of each quarter, although hospitals might revise figures for 61 year. Selected sociodemographic data (age, race/ethnicity, sex, income, percentage of home ownership, and education) were obtained from the US Census Bureau. Sociodemographic variables often provide an indication of health status. For example, lower educational levels, lower income levels, smoking, and alcohol use are associated with multiple adverse health conditions, making inclusion of such covariates in the analysis necessary. Census data can be obtained for different geographic units, including census tracts, census blocks, zip code tabulation areas (ZCTA), counties, and states. The census is conducted every
L.J. Young et al. / Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
10 years, but estimates at the state and county level are available on an annual basis. Data from CDC’s Behavioral Risk Factor Surveillance System (BRFSS) provided additional potential covariates (e.g., the percentage of smokers in each county). BRFSS is a state-based system of health surveys that collect annual information on health risk behaviors, preventive health practices, and health-care access, primarily related to chronic disease and injury (Centers for Disease Control and Prevention, 2006). 3. Linking the data Each of the diverse data sources described previously provides different information relevant to the association between MI and ozone. In a traditional epidemiologic analysis, regression models might be used to quantify this association. Unfortunately, the data available for this analysis are associated with different geographic units and are recorded at different times. The air-quality monitors measure ozone levels at specific times at geographic locations georeferenced by longitude and latitude. The hospital admissions data provide date of admission and individual-level information concerning MI cases, as well as certain demographic variables (age, race/ethnicity, sex) associated with these cases. However, they are georeferenced only to the zip code or county level and not to a specific location. Census data provide annual measures of aggregate sociodemographic information related to the census block, census tract, ZCTA, or county levels, and BRFSS provides additional sociodemographic information aggregated to the county level on an annual basis. If we want to combine the information and explore the associations among these variables, we need to formally link them in a manner that will lead to valid statistical inference. Before data linking can be accomplished, a common geographic and temporal unit for analysis should be selected. To determine that unit of analysis, we focused on the MI data. We wanted the smallest possible geographic and temporal units while still satisfying confidentiality constraints; the confidentiality agreement with ACHA required at least six cases per reporting unit. Consequently, a tension existed between these two goals of limited geographic and temporal units. As we reduced the size of the geographic units from counties to zip codes, we lengthened the temporal unit, moving from months to quarters, to satisfy confidentiality constraints. Similarly, as we reduced the temporal unit (e.g., going from quarters to months), the size of the geographic units had to be increased to respect confidentiality. From a surveillance viewpoint, a tradeoff occurs between alerting every anomaly, which can happen when the temporal unit is too limited, and missing meaningful indicators, which can happen with broad aggregation. Through working with health professionals in the Florida Department of Health (FDOH), we determined that a county-level unit of analysis using monthly data would be the finest resolution that would satisfy the cell suppression and confidentiality concerns with the hospitalization data from ACHA, but still permit meaningful geographic and temporal analysis of the MI-ozone association.
75
Sociodemographic variables from the census and BRFSS were reported at the county level, but only annually. One option for obtaining monthly values is to use statistical forecasting methods to predict monthly values from yearly reports. However, because the rates are relatively stable throughout the course of a year and no trend is apparent, we used the same yearly values for each month. This might not be a satisfactory solution over a longer period. The MI data were indirectly standardized by age (645, 45–55, 55–65, and >65 years), race/ethnicity (black, white, or other), and sex (female or male) (Woodward, 2004, p. 181). The information regarding age, race/ethnicity, and sex for the MI cases were obtained from the hospital records. Florida’s population was used as the comparison standard to calculate the number of expected MI cases. This gave an MI standardized event ratio (MI SER), defined as the ratio of the number of observed MI cases to that expected among the Florida population, for each county and each month. These are displayed in Fig. 2. Having determined the geographic and temporal units for analysis, our attention turned to linking all the variables to these common units. First, we considered the temporal scale. Whereas both acute and chronic exposure to particulate matter have been associated with MI (Pope, 1991; Dominici et al., 2006), ozone levels, as well as 1–3day lagged values, have been reported to be related to MI (Ruidavets et al., 2005; Vermylen et al., 2005). The EPA’s National Ambient Air Quality Standards (US Environmental Protection Agency, 2008). Here daily average ozone, which correlates with maximum 8-h average ozone, was recorded. Because ozone levels decline at night, daytime peaks might not be evident in daily averages. To avoid peak ozone levels being further reduced by averaging across days of the month, the maximum of the daily average ozone values during a month was used as the monthly data value for a particular monitor. Although the term maximum ozone is used throughout this text, readers are reminded that the largest maximum 8-h average observed during the month was not under consideration, but rather the largest 24-h average observed during the month. After obtaining monthly maximum ozone values for each monitor, we also needed an ozone value for each
Fig. 2. MI SER for each Florida County during August 2005.
76
L.J. Young et al. / Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
county in Florida. Because not all counties have ozone monitors, the intuitive approach of averaging the ozone values from monitors within a county (i.e., Bobak and Leon, 1999; Pope et al., 2009) is not sufficient for this study. Typically, when predicting environmental variables for a region based on observations recorded at point locations, the predicted value of the environmental variable at the region’s centroid is used to represent that area (e.g., Salam et al., 2005; Bell, 2006). Several methods, such as inverse distance weighting (Bell, 2006), interpolation (Salam et al., 2005), and air quality modeling (Bell, 2006), could be used to predict the environmental exposure at the centroid. Here we used the spatial prediction technique known as kriging to predict an ozone value at each county’s centroid (see Waller and Gotway, 2004), for more information on kriging). However, the centroid approach completely ignores the fact that the MI SERs are essentially averages across each county, whereas ozone levels predicted at county centroids gives point-level measurements and not spatial averages. This is related to the larger challenge of spatial support as described in the following section (Gotway and Young, 2002). 4. Spatial support The support of a spatial variable is an n-dimensional physical volume over which the value of the variable is measured or computed. It includes the geometric size, shape, and spatial orientation of the volume (Olea, 1991). For our study, for example, ozone was measured with 56 air monitors located throughout the state that were georeferenced to a location identified by a longitude and latitude coordinate. Thus, the ozone values have point support. In contrast, the sociodemographic variables (e.g., the percentage of smokers) and the MI SER are spatial averages across the counties and therefore have areal support. Linking point-level ozone measurements to county-level averages can lead to problems with statistical inference. Complications arise because (1) the MI SER data are averaged across counties, which are spatial units of different sizes, and (2) the ozone data have point support, but ozone exposure is needed for counties, which have areal support. Both of these situations lead to what is known as the change-ofsupport problem in spatial statistics: the problem of valid inference regarding a spatial variable of one support from data at different supports. Change-of-support problems include problems associated with changes in spatial scales, aggregation and disaggregation of spatial data, and inference between units of different supports. Change-of-support problems are becoming increasingly prevalent as the availability of spatial data increases and such tools as geographic information systems make working simultaneously with data from different sources relatively easy (see Gotway and Young, 2002 and Young and Gotway, 2007, for a more comprehensive discussion). The bias introduced by using measurements with point support to represent areal units is well-documented in the geostatistical mining literature. A simple example that illustrates this bias is provided by Armstrong (1998). The basic cause of the bias is the fact that the variability in statistical averages depends on the support of the areal units
and is considerably less than the variability in observations based on point-level supports. This is analogous to the fundamental central limit theorem in statistics, which derives the distribution of averages based on individual measurements. Thus, instead of using predictions of the ozone value at the county centroid, we used a geostatistical prediction method known as block kriging (Journel and Huijbregts, 1978; Cressie, 1993) to predict the ozone values for each county. To implement block kriging, a fine grid is placed across the state, and ordinary kriging is used with the observed ozone values to predict ozone values at the each point on the grid. Then the predicted ozone values at grid points within a county are averaged to obtain the predicted ozone value for that county. Thus, a predicted average ozone value, rather than a predicted single ozone value at a point (e.g., a county’s centroid) is obtained for each county. The details of this approach, including the integrals that the averages over grid points are approximating, are presented elsewhere (Journel and Huijbregts, 1978; Cressie, 1993; Gotway and Young, 2002). Returning to the MI study, we used block kriging to obtain a predicted average maximum ozone value for each county that accounts for the change of support from the points at which monitors are located to the counties. Comparing this map to the map of maximum ozone values predicted for each county at county centroids (Fig. 3), the predicted values from block kriging appear smoother than those based on point predictions at centroids. Evidence also exists of autocorrelation in the smoothing effect of the support-adjusted predictions; that is, there is a greater tendency of counties close together to have ozone predicted values more alike than counties further apart for the block kriging method than for the centroid approach. The variability in predicted ozone values, both absolute and geographic, is considerably different between the two maps. This difference in variation will affect any subsequent analysis linking predicted ozone values to MI SER. Both sets of predicted ozone values were computed for each month and then linked to the monthly data on MI SERs and the social and demographic variables for each Florida County. After the data have been linked, the focus becomes the analysis and inference. 5. Analytical methods The traditional analytical approach, referred to here as global regression, is to conduct a multivariate linear regression analysis relating relative MI SER to ozone levels with adjustments for sociodemographic variables (e.g., education, income, and percentage of smokers). However, just as ozone levels and the number of MI cases can vary across the state, the relative MI SER also varies across the state. Hastie and Tibshirani (1993) introduced varying coefficient models, a class of regression and generalized regression functions in which the coefficients are allowed to vary as smooth functions of other variables. Müller (1998) adapted this idea to the spatial case and referred to the approach as local regression, the term used here. Apparently independently, Brunsdon et al. (1996) and Fotheringham et al. (2002) adapted the idea of varying coefficient models to
L.J. Young et al. / Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
77
Fig. 3. Predicted August average of the daily maximum ozone by using (a) centroids and (b) the support-adjusted approach (b).
the spatial case and called their method geographically weighted regression (GWR). More generally, when regression coefficients are assumed to vary smoothly across a space, the models are referred to as spatially varying coefficient models (Congdon, 2003, 2006; Gelfand et al., 2003). The local regression model relating the transformed MI SER of county i to the predicted maximum ozone value for county i, adjusting for covariates, is expressed as
lnðSERi Þ ¼ b0 ðsi Þ þ b1 ðsi Þxi1 þ þ bk ðsi Þxip þ ei ; i ¼ 1; 2; . . . ; n where SERi is the MI SER for county i; si is a vector of twodimensional coordinates indexing county i; xij; j = 1, 2, . . ., p 1, are covariates associated with county i; xip is the ozone level for county i; and ei is the error term with mean zero and variance r2i . The coefficients bj are fixed parameters that must be estimated at each spatial location si; that is, a separate regression is conducted for each of Florida’s 67 counties. To estimate the bj’s, ideas from local smoothing and kernel regression are used to define spatial neighborhoods. The regression is performed by using only data in the spatial neighborhoods. As a consequence, the error terms are not necessarily constant for all locations. Because the spatial neighborhoods associated with different points in space overlap, the same data are used more than once to estimate all the spatial regression parameters bj. The geographic weights for the local regression at county i gradually decrease as the distance between the centroids of counties i and j increases. The traditional choice of weighting function is the Gaussian kernel function,
2 2 wij ¼ exp dij =h
ð1Þ
where dij is the distance between the centroids of counties i and j, and h is the bandwidth. We used intercentroid distance between the counties for ease of computations, but a support-averaged distance measure can also be used.
The MI SER was log-transformed [denoted by ln(SER) because the natural logarithm was taken] so that the assumptions of linear regression (normality and constant variance) were more nearly met. By using local regression methodology, a weighted regression relating ln(SER) to ozone, adjusting for sociodemographic variables, was conducted by using two weights. The first was the expected MI SER for county i, the same weight that would be used in the traditional regression analysis so that the assumptions of linear regression (normality and constant variance) were more nearly met. The second weight was the weight given to a county on the basis of its proximity to the county of interest as given in Eq. (1). A search for the best bandwidth was conducted with potential bandwidths as large as half of the maximum distance between any two counties with a tolerance distance for the search of 0.01. The bandwidth resulting in the smallest sum of squared prediction error from cross-validation was chosen. Potential covariates from census and BRFSS data, including education, percentage of smokers, income level, and percentage of home ownership, were evaluated during the modeling process, and the coefficient associated with ozone, after adjusting for all other covariates in the model, was exponentiated to obtain the relative MI SER associated with ozone (Woodward, 2004). The relative MI SER associated with ozone for each county from the local regression model is displayed in Fig. 4 for (a) the centroid approach and (b) the support-adjusted approach. Although similar, we can see the effect of geographic variation in the predicted ozone values on the geographic variation of the estimates of relative MI SER. The smoothing and autocorrelation in the support-adjusted ozone predictions has also resulted in smoothing and autocorrelation in the relative MI SER estimates. Moreover, the interpretations of the two maps differ. With the centroid approach (a), the panhandle (including Panama City and Tallahassee) appears to be the area of greatest concern, whereas with the support-adjusted approach (b), concern is somewhat moderated, but with both the
78
L.J. Young et al. / Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
Fig. 4. Predicted relative MI SER during August 2005 by using (a) centroid-based ozone predictions and (b) a support-adjusted approach to predict ozone.
panhandle and the southeastern region of Florida having relative MI SERs that are >1.0. 6. Assessing uncertainty The relative MI SER maps depicted in Fig. 4 cannot be interpreted without accounting for the uncertainty in the relative MI SER estimates. Traditionally, this is done by computing the standard errors associated with the estimated relative MI SER. Maps of these standard errors are displayed for both the centroid and support-adjusted approaches in Fig. 5. These maps depict two key considerations that need to be made when interpreting the relative MI SER maps in Fig. 4. First, the relative MI SER estimates associated with counties in the panhandle area have the greatest uncertainty associated with them. Thus, one of the areas with relatively high MI SER estimates is also one with a relatively high uncertainty in the estimates. Second, the inher-
ent difference between predicting ozone values at centroids (points) and predicting support-adjusted ozone values (spatial averages) for use in regression is clearly reflected in the standard error maps of the relative MI SER estimates. The standard errors for the relative MI SER estimates based on support-adjusted ozone predictions are lower and exhibit less geographic variation than those based on predicting ozone values at centroids (points). We might expect greater variation in ozone values by using the centroid approach because values with point-support will have greater variability (both absolute and geographic) than the spatial averages derived from them. Unfortunately, simultaneous interpretation of the risk estimates and their standard errors can be difficult. Another way to assess the same information is through the p values associated with testing whether the relative MI SER estimates are significantly different from 1.0 (no association). Although maps of p values are not entirely accurate as a result of multiple tests on the same set of data
Fig. 5. Standard errors of estimated relative MI SERs during August 2005 by using (a) the centroid approach and (b) the support-adjusted approach to predict ozone values for each county.
L.J. Young et al. / Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
79
Fig. 6. p Values associated with testing relative MI SER during August 2005 by using (a) the centroid approach and (b) the support-adjusted approach to predict ozone values for each county.
(and on data that are likely spatially related), comparing the p values from the centroid-based approach to those obtained from the support-adjusted approach can illustrate the effects of these two different approaches on the relative MI SER estimates and the conclusions derived from them (Fig. 6). These maps, too, must be interpreted in the context of the relative MI SER maps presented in Fig. 4. Using the centroid-based approach, the relative MI SER associated with ozone in the panhandle area is substantially high, as is the risk in an area extending from New Smyrna Beach to Palm Beach on the Atlantic Coast and extending from there to counties in the center of the state. Making the same interpretation for the support-adjusted approach, the risk associated with counties in the panhandle is not statistically significant, but a larger area in southeastern Florida has significantly higher risk for MI associated with ozone. Thus, the interpretation of our results differs considerably, depending on the method used to predict county-level ozone values. In the spatial regression model described in the previous text, the explanatory variables are assumed to be measured without error. However, predicting ozone, either at county centroids or as county-level averages as in the support-adjusted approach, introduces uncertainty in the explanatory variables used in the regression model. Thus, two sources of uncertainty are associated with the estimated relative MI SER values: the estimation uncertainty associated with estimation of the regression coefficients, which the standard error and p-value maps previously presented adequately quantify, and the prediction uncertainty associated with using predicted ozone values in the regression model instead of known values. This latter source of uncertainty is not depicted in any of the results presented previously. Although computing the standard error of the predicted ozone values is possible, incorporating uncertainty measured by these standard errors into a regression model is difficult. Alternatively, a geostatistical simulation approach is used here to characterize the uncertainty associated with the predicted ozone values and the resulting
variability in the association between ozone and the relative MI SER (Journel, 1974; Gotway and Rutherford, 1994; Van Horssen et al., 2002). Using the original ozone values from the 56 air monitors, we used the geostatistical simulation technique known as LU decomposition (Davis, 1987) to generate realizations of the ozone surface that have the same statistical and spatial properties as the original ozone data. In addition, we force the simulated ozone surfaces to pass through the values recorded at each air monitor. The subsequent conditional simulation generates ozone surfaces that reflect the spatial variability and uncertainty in the predicted ozone values, and each is a plausible rendition of the ozone surface, given the available data. For each 1000 conditional geostatistical simulations, the ozone value for each grid point was predicted on the basis of the estimated spatial autocorrelation function and the observed ozone values. Both point predictions at centroids and support-adjusted predictions of ozone values for each county were obtained as described earlier, giving a distribution of predicted ozone values for each county that defines the uncertainty distribution for predicted ozone. To assess the resulting variation in ozone values, we consider two different sets of maps. The first set, displayed in Figs. 7 and 8, illustrates the 0.025 and 0.975 quantiles of the predicted ozone distribution for each county. These maps represent a visual confidence interval for the true ozone values. To obtain an idea of the differences in each simulation, we randomly selected four simulated ozone surfaces for display in Fig. 9. Each of these is a plausible rendition of the true ozone surface, which reflects the statistical and spatial properties and the exact values of the ozone values recorded at the 56 monitor locations. Because of space limitations, we display only the results from the centroid approach. The results from the support-adjusted are similar, although, as illustrated previously, the support-adjusted ozone surfaces exhibit less geographic variation than those produced by using the centroid approach. To assess the effect of the uncertainty in predicted ozone on the results from local spatial regression, each
80
L.J. Young et al. / Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
Fig. 7. The (a) 0.025 and (b) 0.975 quantiles of the predicted ozone distribution for each county, using point predictions at centroids.
Fig. 8. The (a) 0.025 and (b) 0.975 quantiles of the predicted ozone distribution for each county, using support-adjusted predictions.
set of ozone predictions was used in the geographically weighted regression model described earlier. For each simulation, the association between relative MI SER and ozone was estimated by using GWR, and a 95% confidence interval was obtained by using the analytic standard error associated with the estimated coefficients of the spatial regression. Thus, we obtain 1000 plausible versions of the association between relative MI SER and ozone that reflect the uncertainty in the predicted county-level ozone values. Just as the predicted ozone varied with each simulation, the estimated relative MI SER and its standard error varied. Consequently, for each county, the conclusions with respect to the statistical significance of the association between relative MI SER and ozone were not the same for all simulations. Using a 5% significance level, the proportion of simulations indicating a significant association between relative MI SER and ozone was determined (Fig. 10). Care must be taken in interpreting these maps because this is a summary of multiple tests, which are likely correlated.
If no association exists between MI and ozone, we expect 5% of the simulations to indicate significance, on the basis of Type I error alone. Thus, counties with higher proportions of simulations indicating a significant relative MI SER tend to reflect counties for which the association between ozone and MI is highly significant. The two maps in Fig. 10 are surprisingly similar. Compared with the results from the original GWR analysis (Fig. 6), the conclusions after accounting for uncertainty in the ozone measurements are similar for the support-adjusted approach, with perhaps heightened concern for the region in the panhandle. In contrast, the region of concern identified by the centroid approach is greatly expanded after accounting for uncertainty in the ozone measurements. In the support-adjusted approach, the average ozone for the county is predicted. By looking at inference across the realizations from the uncertainty analyses, we are further averaging over potential realizations. Perhaps, through the uncertainty analysis, some of the strength of the averaging inherent in the support-adjusted approach
L.J. Young et al. / Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
81
Fig. 9. Simulated ozone maps based on point-level ozone predictions at county centroids.
Fig. 10. Proportion of simulations indicating a statistically significant relative MI SER for each county, using (a) the centroid approach and (b) the supportadjusted approach to predict ozone values for each county.
is realized by the centroid approach, causing the results to become similar. The reader should compare these maps with the maps in Figs. 4 and 6. If we simply based our conclusions on Fig. 4a, we might recommend further investigation of counties in the panhandle area. Incorporating uncertainty
resulting from the regression coefficients, as summarized in Fig. 6a, supports this conclusion, but also perhaps increases our attention on counties in central-eastern Florida. After incorporating the additional uncertainty associated with predicting an ozone value for each county, attention is now focused on perhaps seven counties in cen-
82
L.J. Young et al. / Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
tral-eastern Florida, including Okeechobee County, which has a high MI SER (Fig. 2). 7. Discussion Although the analyses described here are based on data particular to CDC’s EPHT program, the statistical challenges that are discussed and illustrated are crucial for any study that links environmental data with health data through location. Here, as in other EPHT applications, studies have already been conducted linking air quality with public health, and the majority of these are based on more detailed data than what will be available within this program. Yet, associations between environmental factors and the public’s health should be considered at the state, regional, and national levels. The analytic challenges are great, and expressing exactly what conclusions can and cannot be drawn from such analyses is important. As one referee correctly noted, any effect of ozone exposure on MI is expected to be reflected within in a few days (not within a month) and basing the decision to work with monthly data at the county level on confidentiality constraints is a weakness of this study. Daily measures of MI are more likely to be related to daily measures of ozone exposure than are monthly measures of these variables. This is further complicated because the uncertainty of environmental exposure will also vary based on the spatial heterogeneity of that exposure. For some exposures, there may be large variation within a small spatial unit and, for others, there will not. However, we must recognize that these are the types of data that are being produced by EPHT, and these analyses are being conducted, not just by EPHT but also in other studies (e.g., Bobak and Leon, 1999; Haining et al., 2007). Although analyses such as these can never be as strong as those based on daily values, a set of statistical tools are needed to provide the best possible analyses under these circumstances, and the limitations of such analyses need to be fully documented. The measure of ozone exposure is especially problematic in this study. The ozone monitors in Florida, as in most other states, tend to be located in population centers where ozone levels tend to be higher. By using block kriging to predict points on a grid over the state of Florida based on monitors in population centers, the predictions at points in rural areas are likely biased; that is, ozone exposure in rural areas is predicted to be higher than it truly is. Determining an appropriate measure of monthly ozone exposure is challenging. Here we predicted the daily average ozone for each point on a grid covering the state. The monthly maximum predicted ozone value for each point was averaged with all other monthly maximum ozone values within the same county to obtain the monthly maximum for that county. An alternative approach would be to average the predicted daily ozone for each point and then to average over the points in the county. We have used both approaches and found that MI SER is more highly correlated with the monthly maximums than with the daily averages. This is consistent with the fact that increased incidence of MI has been associated with peaks in ambient ozone (Ruidavets et al., 2005; Vermylen et al., 2005). However, whether the maximum or
average monthly ozone exposure should be used remains a topic for further study. The lack of meteorological data, especially relative humidity and temperature, is another study weakness. As an example, ambient ozone tends to increase with temperature, and high temperature is often associated with adverse health effects. In this study, the estimated association between ozone exposure and MI is likely biased because no consideration is given to the effect of temperature or other meteorological variables. Because the biased predictions of ozone exposure at the county level and the lack of meteorological data are weaknesses in this study, any inference about the association between ozone exposure and MI for Florida during August 2005 should be viewed with skepticism. However, the primary purpose of this paper is to illustrate the statistical challenges in conducting such studies. We are currently seeking models that are effective in predicting ozone, particulate matter, and meteorological measures on a fine grid throughout the state. When one is acquired, the insights into the statistical issues presented here will be useful in developing an improved model of the association between ozone exposure and MI. Again, it is important to remember that the purpose of EPHT is not to conduct specific studies on the health impacts of environmental exposures, but to monitor associations between them. This effort will not replace in-depth epidemiological studies of the impact of environmental exposures on health. In this paper, we demonstrate two critical statistical considerations in linking data on different geographic units. First, inference can change, depending on how data are linked. Spatial data are often much more than just geographically referenced points. When data have been geographically aggregated into counts, rates, or proportions, the spatial supports of the data (sizes, shapes, and orientations) are critical elements in combining data, and accounting for the support of the data in analysis and mapping is critical. Variables associated with such areal units as counties and census tracts are inherently spatial averages, and the variability, both statistical and geographical, associated with averages is inherently much different than that associated with observations recorded at point locations. Thus, spatial prediction at areal centroids will often provide a misleading picture of the geographic variation inherent in an areal variable. The spatial prediction technique known as block kriging offers an alternative to simple prediction at areal centroids, and this method has been generalized so that it can be used for aggregation, disaggregation, side-scaling, and intensity estimation (Gotway and Young, 2007). Second, and most importantly, the results of any statistical estimation technique, even those leading to maps, must be interpreted in the context of the uncertainty and variability associated with the estimates produced. In the regression analysis we considered here based on geographically weighted regression, two sources of uncertainty must be considered. The first is that caused by estimation of the regression coefficients, which can be summarized by considering standard errors derived from the regression model. The second source of uncertainty arises when predicted values, and not the observed data values, are used as explanatory covariates in a regression model. We chose
L.J. Young et al. / Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
to incorporate this uncertainty by using geostatistical simulation, but other approaches based on measurement error models (Madsen et al., 2008; Gryparis et al., 2008) and Bayesian hierarchical modeling (Zhu et al., 2008) can also be used. Geostatistical simulation methods and geographically weighted regression models are being incorporated into ESRI’sÒ ArcGISÒ (Environmental Systems Research Institute, Inc., Redlands, California), and consequently, the analysis presented here can be applied to routine practice. More remains to be done. An association between relative MI SER and ozone concentrations was identified at the county level for August 2005. However, true interest lies in describing the spatiotemporal relation between MI SER and ozone concentrations at the individual patient’s level. Thus, another change of support problem, namely that of making individual-level inference from grouped data (the ecological inference problem), remains. Although using ecologic inference is common in epidemiology, the problems with using it are well-known (see Greenland and Robins (1994), for a review). Confidentiality concerns prevent an individual-level analysis here, and these are common in health-related studies. Visualization of uncertainty remains a key challenge in geographical analyses. Here we used confidence maps, maps of p values, and randomly chosen simulation maps to convey uncertainty, but linked micro-maps (Carr et al., 2000) might be an alternative. Conveying uncertainty in analyses involving health and place is important, not only in research, but for correct policy and public interpretation. The challenges associated with relating public health to environmental factors are numerous. The initial process of collecting data from disparate sources, including meeting each agency’s criteria for access and confidentiality, is time-consuming, although this continues to improve as researchers gain experience with the program. Linking the data on a common spatial and temporal scale must be done carefully for meaningful interpretation. Finally, methods of statistical analyses that permit inferences relating to potential associations between public health and environmental factors as a part of an ongoing environmental public health tracking effort continues to be an active research area. Perhaps the analyses presented here will provide insight into better ways forward. Acknowledgments This publication was supported by the Florida Department of Health, Division of Environmental Health and Grant/Cooperative Agreement Number 5 U38 EH00017702 from the Centers for Disease Control and Prevention (CDC). The findings and conclusions in this report are those of the authors and do not necessarily represent the views of the Centers for Disease Control and Prevention. References Armstrong M. Basic linear geostatistics. Berlin: Springer; 1998. Bell ML. The use of ambient air quality modeling to estimate individual and population exposure for human health research: A case study of ozone in the Northern Georgia Region of the United States. Environ Int 2006;32:586–93.
83
Bobak M, Leon DA. Pregnancy outcomes and outdoor air pollution: an ecological study in districts of the Czech Republic 1986–1988. Occup Environ Med 1999;56:529–43. doi:10.1136/oem.56.8.539. Brunsdon CF, Fotheringham AS, Charlton ME. Geographically weighted regression: a method for exploring spatial nonstationarity. Geogr Anal 1996;28:281–98. Carr DB, Wallin JF, Carr DA. Two new templates for epidemiology applications: linked micromap plots and conditioned choropleth maps. Stat Med 2000;19:2521–38. Centers for Disease Control and Prevention (CDC). National environmental public health tracking network: technical network implementation plan, version 1.0. Atlanta (GA): US Department of Health and Human Services, CDC; 2007. Centers for Disease Control and Prevention (CDC). Health risks in the United States: behavioral risk factor surveillance system 2006. Atlanta (GA): US Department of Health and Human Services, CDC; 2006. Congdon P. Modeling spatially varying impacts of socioeconomic predictors on mortality outcomes. J Geogr Syst 2003;5:161–84. Congdon P. A model for non-parametric spatially varying regression effects. Comput Stat Data Anal 2006;50:422–45. Cressie NAC. Statistics for spatial data. revised ed. New York: Wiley; 1993. Davis M. Production of conditional simulations via the LU triangular decomposition of the covariance matrix. Math Geol 1987;19:91–8. Dominici F, Peng RD, Bell ML, Pham L, McDermott A, Zeger SL, et al. Fine particulate air pollution and hospital admission for cardiovascular and respiratory diseases. J Am Med Assoc 2006;295:1127–34. Fotheringham AS, Brundson C, Charlton M. Geographically weighted regression: the analysis of spatially varying relationships. Chichester: Wiley; 2002. Gelfand AE, Kim HJ, Sirmans CF, Banerjee S. Spatial modeling with spatially varying coefficient processes. J Am Stat Assoc 2003;98:387–96. Gotway CA, Rutherford BM. Stochastic simulation for imaging spatial uncertainty: comparison and evaluation of available algorithms. In: Armstrong M, Dowd PA, editors. Geostatistical simulations. Dordrecht: Kluwer Academic Publishers; 1994. p. 1–21. Gotway CA, Young LJ. Combining incompatible spatial data. J Am Stat Assoc 2002;97:632–48. Gotway CA, Young LJ. A geostatistical approach to linking geographicallyaggregated data from different sources. J Comput Graph Stat 2007;16:1–21. Greenland S, Robins J. Invited commentary: ecologic studies – biases, misconceptions, and counterexamples. Am J Epidemiol 1994;139: 747–60. Gryparis A, Paciorek CJ, Zeka A, Schwartz J, Coull B. Measurement error caused by spatial misalignment in environmental epidemiology. Biostatistics 2008. doi:10.1093/biostatistics/kxn033. Haining R, Law J, Maheswaran R, Pearson T, Brindley P. Bayesian modeling of environmental risk: example using a small area ecological study of coronary heart disease mortality in relation to modelled outdoor nitrogen oxide levels. Stoch Environ Res Risk Assess 2007;21:501–9. doi:10.1007/s00477-007-0134-1. Hastie TJ, Tibshirani RJ. Varying-coefficient models. J Roy Stat Soc B 1993;55:757–96. Journel AG. Geostatistics for conditional simulation of ore bodies. Econ Geol 1974;69:673–87. Journel AG, Huijbregts CJ. Mining geostatistics. London: Academic Press; 1978. Madsen L, Ruppert D, Altman NS. Regression with spatially misaligned data. Environmetrics 2008;19:453–67. Müller WG. Fundamentals of spatial statistics. In: Collecting spatial data: optimum design of experiments for random fields, Physica-Verlag; 1998. p. 9–33. Olea RA. Geostatistical glossary and multilingual dictionary. New York: Oxford University Press; 1991. Pope A. Respiratory hospital admissions associated with PM10 pollution in Utah, Salt Lake, and Cache Valleys. Arch Environ Health 1991; 46:90–7. Pope CA, Ezzati M, Dockery W. Fine-particulate air pollution and life expectancy in the United States. N Engl J Med 2009;360:376–86. Ruidavets JB, Cournot M, Cassadou S, Giroux M, Meybeck M, Ferrières J. Ozone air pollution is associated with acute myocardial infarction. Circulation 2005;111:563–9. Salam MT, Millstein J, Li Y-F, Lurmann FW, Margolis HG, Gilliland FD. Birth outcomes and prenatal exposure to ozone, carbon monoxide, and particulate matter: results from the children’s health study. Environ Health Persp 2005;113:1638–44. US Environmental Protection Agency (EPA). National Ambient Air Quality Standards (NAAQS). http://epa.gov/air/criteria.html. Accessed November 21, 2008.
84
L.J. Young et al. / Spatial and Spatio-temporal Epidemiology 1 (2009) 73–84
Van Horssen PW, Pebesma EJ, Schot PP. Uncertainties in spatially aggregated predictions from a logistic regression model. Ecol Model 2002;154:93–101. Vermylen J, Nemmar A, Nemery B, Hoylaerts MF. Ambient air pollution and acute myocardial infarction. J Thromb Haemost 2005;3:1955–61. Waller LA, Gotway CA. Applied spatial statistics for public health data. New York: Wiley; 2004. Woodward M. Epidemiology: study design and data analysis. 2nd ed. London/Boca Raton (FL): Chapman & Hall/CRC Press; 2004.
World Health Organization (WHO). International classification of diseases and related health problems. 10th revision, 2nd ed. Geneva, Switzerland: WHO; 2005. Young LJ, Gotway CA. Linking spatial data from different sources: The effect of change of support. Stoch Environ Res Risk Assess 2007; 21:589–600. Zhu L, Carlin BP, Gelfand AE. Hierarchical regression with misaligned spatial data: relating ambient ozone and pediatric asthma ER visits in Atlanta. Environmetrics 2008;14:537–57.