Applied Geography 63 (2015) 253e263
Contents lists available at ScienceDirect
Applied Geography journal homepage: www.elsevier.com/locate/apgeog
Spatial obfuscation methods for privacy protection of household-level data Dara E. Seidl a, *, Gernot Paulus b, Piotr Jankowski a, c, Melanie Regenfelder b a
Department of Geography, San Diego State University, 5500 Campanile Drive, San Diego, CA 92182-4493, USA School of Engineering & IT, Department of Geoinformation and Environmental Technologies, Carinthia University of Applied Sciences, Europastrasse 4, A-9524 Villach, Austria c , Poland Institute for Geoecology and Geoinformation, Adam Mickiewicz University, Wieniawskiego 1, 61-712 Poznan b
a r t i c l e i n f o
a b s t r a c t
Article history: Received 4 April 2015 Received in revised form 1 July 2015 Accepted 1 July 2015 Available online 18 July 2015
The topic of geoprivacy is increasingly relevant as larger quantities of personal location data are collected and shared. The results of scientific inquiries are often spatially suppressed to protect confidentiality, limiting possible benefits of public distribution. Obfuscation techniques for point data hold the potential to enable the public release of more accurate location data without compromising personal identities. This paper examines the application of four spatial obfuscation methods for household survey data. Household privacy is evaluated by a nearest neighbor analysis, and spatial distribution is measured by a cross-k function and cluster analysis. A new obfuscation technique, Voronoi masking, is demonstrated to be distinctively equipped to balance between protecting both household privacy and spatial distribution. © 2015 Elsevier Ltd. All rights reserved.
Keywords: Privacy Geoprivacy Confidentiality Obfuscation Masking
1. Introduction A common goal in the research process is to share results with other researchers and the public. Access to a shared data source allows for results to be replicable and integrated with auxiliary data, facilitating improved knowledge production. However, sharing data collected under the promise of participant confidentiality can be restrictive. This is particularly true of spatial data, as location is a strong personal identifier. In 2012, researchers at the Carinthia University of Applied Sciences built a GIS Portal for the collection and reporting of high-resolution household energy data in Hermagor, Austria (Paulus, Kosar, Erlacher, & Anders, 2014). To protect confidentiality, the energy demand maps offered by the portal display data aggregated to grid-like statistical units with data suppressed where the population number is insufficient to ensure anonymity. As an alternative to data aggregation, geographic masking, or obfuscation, involves the alteration of point data for protection of
both spatial information and confidentiality (Armstrong, Rushton, & Zimmerman, 1999). This study evaluates the effectiveness of the recognized obfuscation techniques of grid masking, random perturbation, and weighted random perturbation in maintaining distributional integrity in the Hermagor energy data. We also evaluate the performance of a new masking procedure, Voronoi masking, in protecting privacy and spatial distribution. An important question for the utility of geomasking is whether the masked data are fit for decision support. To provide value to decision-makers, the masked data must maintain accuracy comparable to the original data. This study tests the clustering and neighbor patterns of household energy consumption survey points and evaluates the performance of the obfuscated data compared to the original unmasked data. These tests mark a first step in determining if masked data can serve to replace original data in decision support systems.
2. Background * Corresponding author. E-mail addresses:
[email protected] (D.E. Seidl),
[email protected] (G. Paulus),
[email protected] (P. Jankowski), m.regenfelder@fh-kaernten. at (M. Regenfelder). http://dx.doi.org/10.1016/j.apgeog.2015.07.001 0143-6228/© 2015 Elsevier Ltd. All rights reserved.
Over the past ten years, there has been a surge of interest in geoprivacy among the geographic community (Zandbergen, 2014). Geoprivacy is understood as the right to determine how, if, and when one's personal location information is shared with other
254
D.E. Seidl et al. / Applied Geography 63 (2015) 253e263
parties (AbdelMalik, Boulos, & Jones, 2008; Duckham & Kulik, 2007; Elwood & Leszczynski, 2011; Kwan, Casas, & Schmitz, 2004). This right is not always protected as location information is collected for research purposes. Kounadi and Leitner (2014a) write that location disclosure can come from new geospatial technologies, laws that do not stringently protect privacy, and negligence by authors and publishers. Encryption practices for data transmission are also only effective insofar as there is a collective body or rule of law to enforce them (Weiser & Scheider, 2014). Legal precedent for the right to geoprivacy is particularly strong in Europe. In Austria, where this research was conducted, information privacy is safeguarded under the Data Protection Act of 2000, which restricts further use of data collected by means such as surveys and sensor networks. Legal privacy expert Sjaak Nouwt (2008) asserts that the concept of a “reasonable expectation” of geoprivacy exists within the European legal framework, meaning that in realms where individuals can reasonably expect privacy with regard to their location information, their locations cannot lawfully be disclosed. Despite these protections, most authorities are not well-equipped to intervene in privacy violations (EU Fundamental Rights Agency, 2010), or in ensuring data encryption (Weiser & Scheider, 2014). If legal protections and encryption legislation are inadequate for participant confidentiality, it falls to authors and publishers to protect location data. Aggregation to administrative boundaries is commonly applied to protect confidentiality, but reducing spatial resolution reduces the ability to detect underlying patterns, such as disease risk (Hampton et al., 2010; Kwan et al., 2004). Zandbergen (2014) echoes that spatial analysis techniques, including cluster detection, become less accurate with aggregated data. Similarly, in testing the effect of aggregation on cancer risk prediction, Luo, McLafferty, and Wang (2010) note that smoothing effects adversely impact the estimations of statistical models. Armstrong et al. (1999) first introduce geographic masking as a means of protecting geoprivacy and preserving spatial information. The introduction is a response to restricted release of health records by the National Center for Health Statistics (NCHS) to geographic areas with at least 100,000 persons. Masking procedures have since been applied to improving privacy-versus-accuracy tensions in the analysis of homicide data (Leitner & Curtis, 2004), clustering of disease cases (Wieland, Cassa, Mandl, & Berger, 2008), and household travel survey residence data (Clifton & Gehrke, 2013). Documented obfuscation procedures include affine transformations, grid masking, unweighted and weighted perturbation, Gaussian perturbation, and donut masking. Affine transformations translate, re-scale, or rotate a point pattern (Armstrong et al., 1999; Kwan et al., 2004). Grid masking involves snapping each original data point to uniform grid cells (Curtis, Mills, Agustin, & Cockburn, 2011; Krumm, 2007; Leitner & Curtis, 2004). Random perturbation moves a point a random distance in a random direction within a distance threshold, which may then be weighted by a variable such as population density (Armstrong et al., 1999; Kwan et al., 2004). Gaussian perturbation ensures that the distance points are moved in random perturbation follows a Gaussian distribution (Cassa, Wieland, & Mandl, 2008; Zimmerman & Pavlik, 2008), and donut masking ensures that points are moved some minimum distance in random perturbation (Hampton et al., 2010). No masking technique has been implemented in standard or recommended use. Privacy in obfuscation has typically been conceptualized as kanonymity, which requires that each individual be indistinguishable from k-1 other individuals in a data set (Sweeney, 2002; Zandbergen, 2014). Using residential locations obtained from an E911 database, Allshouse et al. (2010) measure spatial k-anonymity as the number of households closer to the original point than the distance of displacement in masking. Applied to simulated disease
cases, Hampton et al. (2010) measure k-anonymity as the population in the circular region around the original point smaller than the distance of displacement. An issue with this conceptualization is that it does not consider the privacy implications of a false identification due to low population density in the vicinity of a displaced point. The k-anonymity of the ultimate resting place of each point should be considered in addition to that of the original location. Some obfuscation studies have focused solely on preservation of spatial pattern as a test of maintaining spatial data integrity. In an application of simulated geocoded health records, Shi, AlfordTeaster, and Onega (2009) generate kernel density surfaces of original and masked point data and test for Pearson's correlations between the rasters. More robust examinations use clustering techniques to compare point distributions. In their masking of death records, Kwan et al. (2004) implement the cross-k function, which tests whether differences observed between point patterns are significantly similar compared to random simulations. Olson, Grannis, and Mandl (2006), Wieland et al. (2008), and Hampton et al. (2010) use SaTScan circular clustering to compare the sensitivity of original and masked disease data points to cluster detection. Kounadi and Leitner (2014b) present indices of masked crime data divergence from original point distributions with a local index incorporating nearest neighbor hierarchical clustering detection. These are all methods to quantify information loss resulting from geomasking. 3. Methods This section describes the methods implemented to test changes in spatial distribution and household anonymity during obfuscation. Fig. 1 depicts the overview of the analysis from the original point address data down to the masked points and statistical comparisons. 3.1. Study area This study employs energy use data calculated for every household in the Hermagor district of Carinthia in southern Austria. Compared to study areas in other masking research (Kwan et al., 2004; Leitner & Curtis, 2004), Hermagor has a very low population density at 22.95 persons per square kilometer, which makes individual residences more vulnerable to identification (Statistics Austria, 2014). The data for this study include 1945 residential records represented by the centroids of georeferenced buildings provided by the individual communities in the district. Several energy consumption variables were calculated for each residence, including electricity, heating, warm water, and total energy consumption. Household warm water energy consumption is selected in the current analysis for demonstration purposes, although we acknowledge that from a decision support point of view, total energy consumption would be more pertinent. The mean warm water energy consumption for each household is 2.71 megawatt hours per annum with a standard deviation of 2.16 MWh/a. The highest consumption is in the central part of the district, as well as towards the northeast of the region. Fig. 2 displays the kernel density estimation (KDE) of warm water consumption with a 250-m cell size. The southern portion of the district is primarily uninhabited in the Carnic Alps along the border to Italy, with the exception of the major winter tourism center Nassfeld. 3.2. Spatial analysis of original data points This study focuses first on the original data points (ODP) with methods typically used by an energy analyst in a decision-making
D.E. Seidl et al. / Applied Geography 63 (2015) 253e263
255
Fig. 1. Overview of obfuscation analysis.
process. Identical methods are subsequently applied to the masked data sets. As a first step, exploratory statistics (ES) are run on the original data points (ODP), including the mean and median centers with standard deviations. The distance to the k nearest neighbors (DK) for each point is also calculated. This facilitates spatial weighting for obfuscation, where the distance threshold is varied according to how vulnerable each residence is to identification. Spatial autocorrelation (SA) is tested using global Moran's I and semivariogram analysis. If the energy consumption data are spatially autocorrelated, a similar correlation should ideally be present in the masked patterns. A subset of 400 points is selected from the ODP to examine the semivariogram. This random selection represents approximately 20% of the total data set and is chosen to examine finer patterns within the autocorrelated data. The same subset of features is selected from the masked data sets to produce spherical semivariograms for comparison. The average nearest neighbor measurement of the ODP is 43.8 m, so the
semivariograms are given a lag size of 43.8 m with 12 lags. Data are transformed by the normal score due to a skewed original distribution. Like in the Kounadi and Leitner (2014b) study, a nearest neighbor hierarchical cluster analysis (CA) is performed in CrimeStat 3.3 (Levine, 2006) to determine the number of first order clusters present in the original data, the density of the clusters, and their standard deviational ellipses. This type of clustering is commonly used in crime analysis, but can be applied to any kind of point distribution (Kounadi & Leitner, 2014b; Levine, 2006). A minimum cluster size of 20 points is selected to limit the quantity of clusters detected. 3.3. Geomasking techniques The next step is to obfuscate the original residence data points. Statistics Austria releases spatial data aggregated to regional
256
D.E. Seidl et al. / Applied Geography 63 (2015) 253e263
Fig. 2. KDE of warm water consumption, Hermagor district.
statistical units, which are regularly spaced vector grid cells of 100, 125, or 250 m side length, omitting information where there are too few households. In this case study, the statistical units are sized at 125 m. Aggregation to grid cells is preferable to aggregation to larger geographies such as zip codes for the preservation of spatial pattern. However, the grid cell size used for aggregation can cause point displacement at a greater distance than is actually necessary to preserve privacy. Geomasking offers the potential to maintain fine granularity for analysis. 3.3.1. Traditional techniques The traditional obfuscation techniques employed in this study are: 1. Grid masking (GM) 2. Random perturbation (RP) 3. Weighted random perturbation (WRP). Grid masking most closely resembles the gridded units the Austrian government uses in summarizing point data and thus serves as a proxy for this aggregation. A distance threshold of 125 m is used as representative of the official statistical unit aggregation. Random perturbation is selected because it is one of the most commonly applied masking procedures that approximates eventgeography relations (Armstrong et al., 1999; Kwan et al., 2004; Zandbergen, 2014). Like with grid masking, RP is applied uniformly with a 125-m distance threshold. Weighted random perturbation is included because of its potential to better preserve spatial distributions by displacing points shorter distances in more populated areas. Allshouse et al. (2010) weight the distance threshold in masking by the distance to specified number of k households, using a minimum of 5 households for
illustration purposes. The average distance to the 5th nearest neighbor in the original spatial data (ODP) is 117 m. Since GM and RP are applied at 125 m, a k ¼ 5 households threshold is set for WRP so that the average displacement distance approximates that of the other methods. The displacement distance is randomized between 0 and the distance to the 5th nearest neighbor for each point. 3.3.2. Voronoi masking This study introduces a new form of obfuscation referred to here as Voronoi masking (VM). Voronoi polygons define areas where the boundaries are equidistant between the surrounding points, or where inside the polygons is closer to the corresponding point than to any other point (Aurenhammer & Klein, 2000; Voronoi, 1908). In VM, each point is snapped to the closest part along the edges of its corresponding Voronoi polygon. An advantage of this technique is that where the density of original points is higher, the points are moved a shorter distance, resulting in patterns that more closely resemble the original data. The VM pattern is strongly linked to the residence distribution of the specified study area. Another advantage of VM is that some points in adjacent polygons will be snapped to the same location, which can increase their k-anonymity. Finally, if the data set incorporates all residences in a study area, no relocated point is placed on an actual residence. None of the relocated points remain in their original locations, or at the centroids of other residences. This permits no false identification of household points. In areas of sparse residences, it is expected that some points will be moved large distances with this method, which could disrupt patterns. However, if there are at least two households close to each other in a remote region, the points will potentially be moved a shorter distance than they would be with other masking techniques that do not account for underlying settlement patterns. Sample results are shown in Fig. 3.
D.E. Seidl et al. / Applied Geography 63 (2015) 253e263
257
Fig. 3. Points before and after Voronoi masking. Bottom row includes same points as top row, but without polygons.
3.4. Spatial analysis of masked data Just as with the original data points (ODP), spatial analysis techniques are implemented on the obfuscated data. These include kernel density difference maps of warm water consumption patterns, distance to k-nearest neighbors (DK), global Moran's I and semivariograms for spatial autocorrelation (SA), and nearest neighbor hierarchical cluster analysis (CA). These procedures are described in Section 3.2. A limitation of these comparisons is that they do not provide a statistical measure of masked point divergence from the ODP. For this reason, we include a point similarity analysis (PSA) of cross-k functions to test the degree to which the masked points are clustered around the ODP. Kwan et al. (2004) implement the crossk function in their masking study, which tests whether differences observed between point patterns are significantly similar or different compared to random simulations. The ODP points are set as the type i events with the masked data set as type j events with the cross-k run once for each masking method. Border correction is implemented using the administrative boundary of the Hermagor district. The cross-k functions are run with 99 simulations to test at 99% confidence. 4. Results Fig. 4 displays simulated obfuscated results from the masking techniques used in this analysis. In this example, GM appears to least preserve the spatial pattern exhibited by the original data. VM and WRP maintain the outer shape of the original data extent better than RP. The best results for spatial information preservation are thus expected from VM and WRP. 4.1. Exploratory statistics (ES) The mean center of the original data points is found at (2999.7,
164876.4; MGI Austria GK Central) along Gailtal Strabe. This is nestled between the northern and southern settlement clusters. The median center of the ODP is situated north of the mean center at (3071.2, 165050.9). The mean centers, median centers, and standard deviational ellipses between the ODP and masking results are within two meters of each other and do not vary much by technique (Fig. 5). The VM mean and median centers are among the closest to those of the ODP at less than one meter away. Two meters of distance between the mean and median centers is not enough of a difference to underscore any major divergence from the ODP. The standard deviational ellipses for each masked point pattern also exhibit low variation and on a map appear to completely overlap. The rotation of the ellipse varies 0.01 for RP, and even less for the other methods (Table 1). The vertical and horizontal standard distances vary most for GM compared to the other methods, but are still within 5 m of the unmasked standard deviational ellipse. Further examination of the clustering patterns and spatial relationships is needed to determine the similarity of the underlying patterns. 4.2. Warm water energy consumption kernel density difference maps The difference maps displayed in Fig. 6 were creating using kernel density estimations of the warm water energy consumption by household. A cell size of 100 m and a search distance of 250 m were used to smooth patterns and make them visible at the scale of the entire district of Hermagor. The absolute value of the difference between the ODP and masked result rasters for warm water energy consumption is the value depicted in the maps. GM and RP demonstrate the highest levels of divergence from the ODP kernel estimation. RP also has the worst performance with the isolated points at the south of the region. The errors appear more widespread in RP, which is expected with the random displacement of all points.
258
D.E. Seidl et al. / Applied Geography 63 (2015) 253e263
Fig. 4. Masking technique examples.
The clear best performance for this metric comes from the VM density surface. There are few visible cells for VM that reach the category of highest difference from the ODP kernel density raster. WRP fares better than GM and RP due to shorter-distance displacements of points where there is a high point density. A problem area where all the methods resulted in great difference from the ODP is in
€ plach, where the error is centered on a few sparse the town of Tro data points with high warm water energy consumption. There is similar error in the eastern portion of the town of Hermagor, where there are no data points, but the nearby points have higher consumption records. This result may say more about errors with interpolation rather than errors resulting from obfuscation, however.
Fig. 5. Median and mean centers of original and masked data points.
D.E. Seidl et al. / Applied Geography 63 (2015) 253e263
259
Table 1 First standard deviational ellipse statistics, distance in meters. Obfuscation method
X standard distance
Y standard distance
Rotation
Unmasked (ODP) Grid masking (GM) Random perturbation (RP) Weighted random perturbation (WRP) Voronoi masking (VM)
1625.97 1630.42 1625.95 1626.29 1627.21
6639.41 6640.55 6637.51 6640.17 6639.82
88.09 88.09 88.08 88.09 88.09
Fig. 6. Difference in warm water MWh/a kernel density estimation from ODP.
260
D.E. Seidl et al. / Applied Geography 63 (2015) 253e263
4.3. Distance to k-nearest neighbors
Table 3 Moran's I results by obfuscation method.
The concept of k-nearest neighbors is implemented both as a means of comparing spatial pattern across obfuscation methods and evaluating privacy. The original point data in Hermagor have a mean distance of 44 m to the 1st, 116 m to the 5th, 192 m to the 10th, and 329 m to the 20th neighbor (Table 2). RP and WRP increase the distance to the nearest 1, 5, 10, and 20 neighbors. GM and VM exhibit lower average distances to the 1st nearest neighbors (24 m and 17 m, respectively) due to the higher probability of these methods to snap points to the same coordinates. With RP and WRP, masked points have a low probability of sharing exact coordinates, which explains the higher average nearest neighbor distances. At 10 and 20 nearest neighbors, the average distances for GM and VM to the kth nearest neighbors are closer to the results of the ODP, although VM maintains lower average distance to all four neighbor totals tested. VM thus maintains k-anonymity better than the other obfuscation methods. 4.4. Spatial autocorrelation (SA) A global test for spatial autocorrelation (Moran's I) in the ODP with respect to household warm water energy consumption is significant with 99% confidence and a z-score of 2.67 (Table 3). The warm water energy consumption point patterns for the ODP and RP reach significance for clustering at the p < 0.01 level. GM (p < 0.05) and WRP (p < 0.10) are weakly significant for globally clustered patterns. VM is the only point pattern for which there is a random global distribution of warm water energy consumption points. A possible explanation for this is the behavior of remote outer points under VM, which may be snapped to the outer edges of Voronoi polygons, making them more dispersed than under the other masking conditions. The semivariogram analysis is intended to compare the spatial autocorrelation of the original data set with that of the masked data sets, including the nugget, range and sill. An upward trend of the semivariogram that levels out as the distance increases is characteristic of spatial autocorrelation, where closer points are more similar than points that are farther apart (Clark, 1979; Griffith, 1987). Knowing that the original data points (ODP) are globally clustered, but that not all the masked point patterns exhibit clustering, the semivariograms were expected to show some differences between the point patterns. Giving the nugget a value of 0, the partial sills of the semivariograms are shown in Table 4. The averaged sill values for all the semivariograms appear to approximate horizontal lines (values close to 1.0). This suggests that there is little spatial correlation in the data and that data points that are far away have similar values. The partial sill of the WRP semivariogram is identical to that of the ODP at 0.952, suggesting a closer approximation of the semivariogram model to the ODP than any of the other masking techniques. RP exhibits the highest partial sill value, and the plotted binned points are more randomized along the semivariogram model than in the other masked representations. Directional influences did not appear in the semivariograms when tested. Due to the horizontal trend of the semivariograms,
Table 2 Mean distance to k-nearest neighbors, meters.
Obfuscation method
Moran's I
z-score
p-value
Unmasked (ODP) Grid masking (GM) Random perturbation (RP) Weighted random perturbation (WRP) Voronoi masking (VM)
0.044 0.034 0.027 0.009 0.026
2.673 2.518 4.569 1.803 1.192
0.008 0.012 0.000 0.071 0.233
this test did not offer insights into whether obfuscation would impact exhibited spatial correlation in this manner. 4.5. Cluster analysis (CA) The ODP generated 20 first order clusters, encompassing an average of 27.7 points each (Table 5). VM and WRP were closest to ODP in the number of clusters generated. RP produced the fewest first order clusters at 13. RP, through random changes in distance and direction, does not tend to snap any nearby points together, as grid masking and Voronoi masking do, which can lead to comparatively less dense clustering patterns at fine scales. With a smaller number of clusters detected, RP without weighting may prove a less useful obfuscation technique for cluster analysis than the other masking methods. In WRP, clusters are more likely to remain intact, as with a higher number of neighbors, a point is only moved a short distance. The mean density of the clusters detected for all of the obfuscation methods was higher than the mean cluster density for the ODP, indicating that the masking techniques tend to strengthen the cohesion of existing clusters. The map in Fig. 7 highlights the locations and sizes of output clusters in the central study area where 14 of the 20 ODP clusters are situated. The VM cluster ellipses appear to best approximate the location and size of the ODP clusters. This is particularly true of the northern portion of the selected area, where only the VM cluster ellipses match up with those of the ODP, and no other ellipses are nearby. Outliers by location for the group are emphasized towards the west by GM and WRP. RP fares the worst in this specific representation, as RP ellipses are absent in 6 of the 14 cases where an ODP cluster exists. VM demonstrates the best performance in approximating the underlying cluster patterns of the ODP. 4.6. Point similarity analysis (PSA) The PSA demonstrates strong clustering between all four masked data sets and the ODP (Fig. 8). The point distributions of i and j events are highly similar in all four resulting graphs, far exceeding the theoretical distributions and simulation envelopes. Differences between the masking techniques in this regard are difficult to detect from the plots. From low to high distances, the cross-k demonstrates that the underlying point distributions remain highly linked to each other. It places confidence in the masking techniques that the results should be spatially dependent on the original data points, but the methods may not be ideal for highlighting finer-scale difference. Table 4 Partial sill values for semivariograms.
Obfuscation method
1n
5n
10 n
20 n
Obfuscation method
Partial sill
Unmasked (ODP) Grid masking (GM) Random perturbation (RP) Weighted random perturbation (WRP) Voronoi masking (VM)
43.8 24.4 50.6 52.0 16.9
116.8 116.5 129.3 122.9 109.7
192.0 201.9 199.6 192.6 188.6
328.8 335.1 335.6 329.6 326.5
Unmasked (ODP) Grid masking (GM) Random perturbation (RP) Weighted random perturbation (WRP) Voronoi masking (VM)
0.952 1.008 1.179 0.952 1.069
D.E. Seidl et al. / Applied Geography 63 (2015) 253e263
261
Table 5 Cluster attributes for nearest neighbor hierarchical clustering. Obfuscation method
First order clusters
Mean cluster points
Mean cluster density
Unmasked (ODP) Grid masking (GM) Random perturbation (RP) Weighted random perturbation (WRP) Voronoi masking (VM)
20 17 13 20 21
27.7 28.0 27.6 27.4 28.0
0.00045 0.00103 0.00093 0.00105 0.00102
5. Discussion and conclusions This study tested the performance of four obfuscation techniques in preserving the spatial patterns of warm water energy consumption in the Hermagor District of Carinthia, Austria. Grid masking (GM), random perturbation (RP), and weighted random perturbation (WRP) all have previous documented uses in masking studies, and are evaluated here alongside a new methodological contribution, Voronoi masking (VM). Between all the tests of underlying spatial pattern, VM outperforms the other obfuscation methods for preserving point distributions.
provide much insight in this regard, since the trend of the data was primarily horizontal. In the cluster analysis, however, VM again best approximated the results of the ODP, more closely matching the number, point frequency, and location of first-order clusters. GM and RP fared worse in these regards. In the PSA, conducted using a cross-k analysis, all the results exhibited significant spatial dependence. This remained true when 99 simulations were run to test significance. Given the almost identical nature of the cross-k plots among obfuscation methods, the overall similarity between all masked points and the ODP is confirmed. However, the results suggest that a different point similarity test would be needed to uncover slight variations in the levels of dependence between the point structures.
5.1. Preservation of spatial distributions The VM mean and median centers are closer to those of the ODP compared to the other masks. This overall result is reaffirmed by difference maps from a kernel density estimation of the original and masked data points. VM and WRP exhibit less variation from the density rasters of the ODP than the other methods. This is expected, since both of these methods are better tailored to the underlying spatial structure of data, moving points smaller distances where the density of points is higher. This better maintains patterns in concentrated areas as well as maintains the k-anonymity of the masked points. One test where the other masking measures outperformed VM in matching the ODP results was with a global Moran's I, where the ODP pattern was clustered. VM did not reach significance for clustering, yet GM and RP did. The semivariogram analysis did not
5.2. Performance of grid masking as representing statistical units An objective of this study was to determine how well masking would fare for privacy and pattern preservation compared to the currently implemented technique of aggregation to 125-m statistical units. The grid masking used in this study, which snapped the ODP to the centroids of 125-m grid cells, best approximates this aggregation technique and serves as its proxy in this analysis. The results present convincing evidence that this aggregation is more disruptive of spatial patterns than alternative masking techniques. GM demonstrated greater departure from the ODP kernel density patterns, as shown in the KDE difference maps for warm water energy consumption. In the cluster analysis, GM resulted in fewer clusters detected, and they tended to be offset from the ODP
Fig. 7. Subset of standard deviational ellipses for first order hierarchical clusters.
262
D.E. Seidl et al. / Applied Geography 63 (2015) 253e263
Fig. 8. Cross-kk graphs with envelopes from 99 simulations.
ellipses. Performing a cluster analysis based on such an aggregated pattern is more likely to lead to inaccurate results that could negatively masked data fitness for decision support. For cluster analyses, this study recommends an obfuscation technique that is more tailored to underlying data structure, such as VM or WRP. 5.3. Privacy protection Another key part of this research was evaluating privacy protection as measured by distance to the kth nearest neighbor (kanonymity). VM and GM both lowered the average distance to the nearest neighbor. This is because both techniques tend to snap points to each other, placing them at identical locations. This preserves privacy by making it more difficult to infer which household a given data point originates from. As the number of neighbors
increased to 10 and 20, VM continued to outperform the other masking techniques, even GM, for lowest average distance to the kth nearest neighbor. A next step for research on Voronoi masking is an evaluation of its reversal potential. While VM outperforms the other methods for privacy tested in this study, an obfuscation method is only valuable if it cannot be reversed and deciphered. The advantage for privacy in RP and WRP is that randomization makes the resulting pattern challenging to reverse engineer and infer actual identities. The pattern in VM is not random, and is instead dependent on the spatial structure of residences within a study area. If VM could be decrypted based on, for example, alignment with Voronoi polygons, there is a greater risk to privacy with this method. This potential vulnerability remains untested at this time, and more research is needed in reverse engineering of masking techniques.
D.E. Seidl et al. / Applied Geography 63 (2015) 253e263
Acknowledgments This research was made possible with fellowship funding from the Austrian Marshall Plan Foundation. Data provision by the Carinthian Geographic Information System KAGIS and the Interreg 4 A Project “AlterVis” is highly acknowledged. References AbdelMalik, P., Boulos, M. N. K., & Jones, R. (2008). The perceived impact of location privacy: a web-based survey of public health perspectives and requirements in the UK and Canada. BMC Public Health, 8. http://dx.doi.org/10.1186/1471-24588-156. Allshouse, W. B., Fitch, M. K., Hampton, K. H., Gesink, D. C., Doherty, I. A., Leone, P. A., et al. (2010). Geomasking sensitive health data and privacy protection: an evaluation using an E911 database. Geocarto International, 25(6), 443e452. http://dx.doi.org/10.1080/10106049.2010.496496. Armstrong, M. P., Rushton, G., & Zimmerman, D. L. (1999). Geographically masking health data to preserve confidentiality. Statistics in Medicine, 18(5), 497e525. Aurenhammer, F., & Klein, R. (2000). Voronoi diagrams. In J.-R. Sack, & J. Urrutia (Eds.), Handbook of computational geometry (pp. 201e290). Amsterdam, Netherlands: North-Holland. doi: 10.1.1.61.7055. Cassa, C. A., Wieland, S. C., & Mandl, K. D. (2008). Re-identification of home addresses from spatial locations anonymized by Gaussian skew. International Journal of Health Geographics, 7(45), 1e9. http://dx.doi.org/10.1186/1476-072X7-45. Clark, I. (1979). Practical geostatistics. London: Applied Science Publishers. Clifton, K. J., & Gehrke, S. R. (2013). Application of geographic perturbation methods to residential locations in the Oregon household activity survey: proof of concept. Transportation Research Record, 2354, 40e50. http://dx.doi.org/10.3141/ 2354-05. Curtis, A., Mills, J. W., Agustin, L., & Cockburn, M. (2011). Confidentiality risks in fine scale aggregations of health data. Computers, Environment and Urban Systems, 35(1), 57e64. http://dx.doi.org/10.1016/j.compenvurbsys.2010.08.002. Duckham, M., & Kulik, L. (2007). Location privacy and location-aware computing. In J. Drummond, R. Billen, E. Joao, & D. Forrest (Eds.), Dynamic & mobile GIS: Investigating change in space and time (pp. 34e51). Boca Raton, FL: CRC Press. Elwood, S., & Leszczynski, A. (2011). Privacy, reconsidered: new representations, data practices, and the geoweb. Geoforum, 42(1), 6e15. http://dx.doi.org/ 10.1016/j.geoforum.2010.08.003. EU Fundamental Rights Agency. (2010). Annual report 2010. At http://fra.europa.eu/ sites/default/files/fra_uploads/917-AR_2010-conf-edition_en.pdf Accessed 10.03.15. Griffith, D. (1987). Spatial autocorrelation: A primer. Washington, DC: Association of American Geographers Resource Publication. Hampton, K. H., Fitch, M. K., Allshouse, W. B., Doherty, I. A., Gesink, D. C., Leone, P. A., et al. (2010). Mapping health data: improved privacy protection with donut method geomasking. American Journal of Epidemiology, 172(9), 1062e1069. http://dx.doi.org/10.1093/aje/kwq248. Kounadi, O., & Leitner, M. (2014a). Why does geoprivacy matter? The scientific publication of confidential data presented on maps. Journal of Empirical Research on Human Research Ethics, 9(4). http://dx.doi.org/10.1177/ 1556264614544103.
263
Kounadi, O., & Leitner, M. (2014b). Spatial information divergence: using global and local indices to compare geographical masks applied to crime data. Transactions in GIS. http://dx.doi.org/10.1111/tgis.12125. Krumm, J. (2007). Inference attacks on location tracks. In Proceedings 5th International Conference, PERVASIVE 2007, May 13e16, 2007, Toronto, Canada (pp. 127e143). http://dx.doi.org/10.1007/978-3-540-72037-9_8. Kwan, M. P., Casas, I., & Schmitz, B. C. (2004). Protection of geoprivacy and accuracy of spatial information: how effective are geographical masks? Cartographica, 39, 15e28. http://dx.doi.org/10.3138/X204-4223-57MK-8273. Leitner, M., & Curtis, A. (2004). Cartographic guidelines for geographically masking the locations of confidential point data. Cartographic Perspectives, 49, 22e39. http://dx.doi.org/10.14714/CP49.439. Levine, N. (2006). Crime mapping and the CrimeStat program. Geographical Analysis, 38, 41e56. http://dx.doi.org/10.1111/j.0016-7363.2005.00673.x. Luo, L., McLafferty, S., & Wang, F. (2010). Analyzing spatial aggregation error in statistical models of late-stage cancer risk: a Monte Carlo simulation approach. International Journal of Health Geographics, 9(51), 1e14. http://dx.doi.org/ 10.1186/1476-072X-9-51. Nouwt, S. (2008). Reasonable expectations of geo-privacy? SCRIPTed 375, 5(2) http://dx.doi.org/10.2966/scrip.050208.375. at http://www.law.ed.ac.uk/ahrc/ script-ed/vol5e2/nouwt.pdf/ Accessed 10.11.14. Olson, K. L., Grannis, S. J., & Mandl, K. D. (2006). Privacy protection versus cluster detection in spatial epidemiology. American Journal of Public Health, 96(11), 2002e2008. http://dx.doi.org/10.2105/AJPH.2005.069526. Paulus, G., Kosar, B., Erlacher, C., & Anders, K. H. (2014). Energy efficient communities e development of a WebGIS portal for managing local energy data. In Proceedings American Association of Geographers Annual Meeting, April 8e12, 2014, Tampa, FL. Shi, X., Alford-Teaster, J., & Onega, T. (2009). Kernel density estimation with geographically masked points. In Proceedings 17th International Conference on Geoinformatics, August 12e14 2009, Fairfax, VA (pp. 1e4). http://dx.doi.org/ 10.1109/GEOINFORMATICS.2009.5292881. Statistics Austria. (2014). STATCube statistical database. at http://statcube.at/ superwebguest/ Accessed 10.08.14. Sweeney, L. (2002). k-Anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557e570. http://dx.doi.org/10.1142/S0218488502001648. Voronoi, G. F. (1908). Nouvelles applications des parametres continus a la theorie des formes quadratiques. Journal für die reine und angewandte Mathematik, 134, 198e287. http://dx.doi.org/10.1515/crll.1908.134.198. Weiser, P., & Scheider, S. (2014). A civilized cyberspace for geoprivacy. In Proceedings of the 1st ACM SIGSPATIAL International Workshop on Privacy in Geographic Information Collection and Analysis (GeoPrivacy '14) (Vol. 5, pp. 1e8). http:// dx.doi.org/10.1145/2675682.2676396. Wieland, S. C., Cassa, C. A., Mandl, K. D., & Berger, B. (2008). Revealing the spatial distribution of a disease while preserving privacy. Proceedings of the National Academy of Science of the United States, 105(46), 17608e17613. http://dx.doi.org/ 10.1073/pnas.0801021105. Zandbergen, P. A. (2014). Ensuring confidentiality of geocoded health data: assessing geographic masking strategies for individual-level data. Advances in Medicine, 2014, 1e14. http://dx.doi.org/10.1155/2014/567049. Zimmerman, D. L., & Pavlik, C. (2008). Quantifying the effects of mask metadata disclosure and multiple releases on the confidentiality of geographically masked health data. Geographical Analysis, 40(1), 52e76. http://dx.doi.org/ 10.1111/j.0016-7363.2007.00713.x.