Comparison of microaggregation approaches on anonymized data quality

Comparison of microaggregation approaches on anonymized data quality

Expert Systems with Applications 37 (2010) 8161–8165 Contents lists available at ScienceDirect Expert Systems with Applications journal homepage: ww...

945KB Sizes 0 Downloads 36 Views

Expert Systems with Applications 37 (2010) 8161–8165

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Comparison of microaggregation approaches on anonymized data quality Jun-Lin Lin, Pei-Chann Chang, Julie Yu-Chih Liu *, Tsung-Hsien Wen Department of Information Management, Yuan Ze University, Chung-Li 320, Taiwan

a r t i c l e Keywords: Mircroaggregation Disclosure control k-Anonymity Information loss

i n f o

a b s t r a c t Microaggregation is commonly used to protect microdata from individual identification by anonymizing dataset records such that the resulting dataset (called the anonymized dataset) satisfies the k-anonymity constraint. Since this anonymizing process degrades data quality, an effective microaggregation approach must ensure the quality of the anonymized dataset so that the anonymized dataset remains useful for further analysis. Therefore, the performance of a microaggregation approach should be measured by the quality of the anonymized dataset generated by the microaggregation approach. Previous studies often refer to the quality of an anonymized dataset as information loss. This study takes a different approach. Since an anonymized dataset should support further analysis, this study first builds a classifier from the anonymized dataset, and then uses the prediction accuracy of that classifier to represent the quality of the anonymized dataset. Performance results indicate that low information loss does not necessarily translate into high prediction accuracy, and vice versa. This is particularly true when the information losses of both anonymized datsets do not differ significantly. Ó 2010 Elsevier Ltd. All rights reserved.

1. Introduction Protecting publicly released microdata has recently become a major societal concern (Domingo-Ferrer & Torra, 2005b; Willenborg & Waal, 2001). Samarati (2001) and Sweeney (2002) introduced the k-anonymity constraint to provide a level of protection against individual identification of microdata. Since then, researchers have proposed many techniques to generate an anonymized version of a dataset such that the resulting dataset satisfies the k-anonymity constraint (Domingo-Ferrer & Mateo-Sanz, 2002; LeFevre, DeWitt, & Ramakrishnan, 2006; Lin & Wei, 2009). A dataset satisfies the k-anonymity constraint if, for any given positive integer k, each record in the dataset is identical to at least k  1 other records in the same dataset with respect to a set of privacy-related attributes, called quasi-identifiers. A common way to identify individuals in microdata is to link the quasi-identifiers to external datasets. Intuitively, a larger k provides better protection of the original dataset at the cost of lower data quality in the corresponding anonymized dataset. Microaggregation is commonly used to achieve k-anonymity (Chang, Li, & Huang, 2007; Domingo-Ferrer, Martínez-Ballesté, Mateo-Sanz, & Sebé, 2006; Domingo-Ferrer & Mateo-Sanz,

* Corresponding author. E-mail addresses: [email protected] (J.-L. Lin), [email protected] (P.-C. Chang), [email protected] (J.Y.-C. Liu), [email protected] (T.-H. Wen). 0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2010.05.071

2002;Laszlo & Mukherjee, 2005). This technique works by partitioning a dataset into groups of at least k records. For each group and each quasi-identifier, microaggregation replaces the values of the quasi-identifier of all the records in the group with their arithmetic mean. The resulting dataset (called the anonymized dataset) satisfies the k-anonymity constraint, but its data quality is inferior to that of the original dataset. Therefore, an effective microaggregation approach must enforce the k-anonymity constraint and simultaneously minimize the degradation of data quality. Previous studies often measure the quality of an anonymized dataset by calculating the information loss of the datatset. The calculated information loss represents the effectiveness of the microaggregation approach that generates the dataset. A lower information loss implies that the anonymized dataset is less distorted, and thus provides higher quality data for analysis. This study measures the quality of an anonymized dataset from a different perspective. Since the purpose of an anonymized dataset is to support further analysis, the quality of the dataset can be quantified by the performance of the prediction model using the anonymized dataset. This measure provides a more realistic view of data quality than information loss does. The rest of this paper is organized as follows. Section 2 describes two measurement methods for evaluating microaggregation approaches. Section 3 reviews the microaggregation approaches whose performances were compared in this study. Section 4 describes the experimental framework and results. Section 5 draws conclusions and provides recommendations for future research.

8162

J.-L. Lin et al. / Expert Systems with Applications 37 (2010) 8161–8165

2. Measurements of anonymized data quality

2.2. Prediction accuracy

Microaggregation is traditionally restricted to datasets whose quasi-identifiers are numerical so that the arithmetic mean can be easily calculated. Recent studies extend microaggregation to categorical and ordinal quasi-identifiers (Domingo-Ferrer & Torra, 2005a; Torra, 2004). This current study is restricted to numerical quasi-identifiers, but the results can easily be extended to categorical and ordinal quasi-identifiers as well. Consider a dataset T with p numerical quasi-identifiers and n records, where each record is represented as a vector in a p-dimensional space. For a given positive integer k 6 n, a microaggregation approach partitions T into g groups, where each group contains at least k records. This approach then replaces the records in each group with the centroid of the group. Note that only the values of quasi-identifiers of the records are replaced; if the dataset T also contains some attributes that are not quasi-identifiers, then the values of these non-quasi-identifiers are unchanged. Let ni denote the number of records in the ith group, and xij, 1 6 j 6 ni, denote the records in the ith group. Then, ni P k for i = 1 to g, and Pg i¼1 ni ¼ n. The centroid of the ith group, denoted by xi , is the average vector of all the records in the ith group. Similarly, the centroid of T, denoted by x, is the average vector of all the records in T. A microaggregation approach generates an anonymized version of the dataset T. The quality of the anonymized dataset can be measured using either information loss or prediction accuracy, as described below.

Another way to quantify the quality of an anonymized dataset is to put the dataset into real use and measure its performance. This measurement provides a more realistic view of data quality than information loss does. For regression tasks, Schmid and Schneeweiss (2005) study the effect of five microaggregation approaches on the estimation of linear models. This current study focuses on classification tasks, where an anonymized dataset is used to build a classifier. Therefore, the microaggregation problem can be formalized as follows.

2.1. Information loss Information loss can quantify the amount of dataset information that is lost after applying a microaggregation approach. This study adopts the most common definition of information loss from Domingo-Ferrer and Mateo-Sanz (2002)



SSE SST

ð1Þ

Problem Definition 2. Given a dataset T, a positive integer k and a classification algorithm, find a partitioning G = {G1, . . . , Gg} of T such that jGij P k for any Gi 2 G with the objective that the classifier that is built using the classification algorithm and the anonymized version (based on G) of T, yields the best accuracy rate. Cross validation is a commonly used procedure for measuring the accuracy of a classification algorithm. Traditionally, cross validation first divides a dataset into m folds of approximately equal size. The classification algorithm then repeatedly takes m  1 folds of data as the training set and the remaining one fold of data as the test set until each fold has been used as the test set once. Finally, the accuracy of the algorithm is calculated as the average of accuracy rates on these m test sets. Fig. 1 shows a naive procedure of applying cross validation to an anonymized dataset given a dataset T, a microaggregation approach and a classification algorithm. This procedure first anonymizes the dataset T to generate the anonymized dataset T0 , and then applies cross validation on T0 to measure the accuracy of the classification algorithm. Since the anonymized dataset T0 in Fig. 1 satisfies the k-anonymity constraint, each record in T0 has at least other k  1 identical records in T0 , with respect to quasi-identifiers. The accuracy rate estimated by the naive cross validation procedure in Fig. 1 may be overly optimistic because identical records can appear in both the training and test sets. Fig. 2 resolves this problem by separating the

where SSE is the within-group squared error and is calculated by summing the Euclidean distance of each record xij to the centroid xi as follows.

SSE ¼

ni g X X ðxij  xi ÞT ðxij  xi Þ i¼1

ð2Þ

j¼1

SST is the sum of squared error within the entire dataset T and is calculated by summing the Euclidean distance of each record xij to the centroid x as follows.

SST ¼

ni g X X ðxij  xÞT ðxij  xÞ i¼1

ð3Þ

j¼1

Fig. 1. A naive m-fold cross validation procedure for an anonymized dataset.

Since SST is fixed for a given dataset T regardless of how T is partitioned, the microaggregation problem can be formalized as a constraint optimization problem. Problem Definition 1. Given a dataset T and a positive integer k, find a partitioning G = {G1, . . . , Gg} of T that minimizes SSE subject to the constraint that jGij P k for any Gi 2 G. This problem is NP-hard for multivariate datasets (Oganian & Domingo-Ferrer, 2001). Generally, SSE is low when the number of groups is large. Therefore, the number of records in each group should be kept close to k. Domingo-Ferrer and Mateo-Sanz (2002) proved that no group should contain more than 2k  1 records since such groups can always be partitioned further to reduce information loss.

Fig. 2. An improved m-fold cross validation procedure for an anonymized dataset.

J.-L. Lin et al. / Expert Systems with Applications 37 (2010) 8161–8165

8163

anonymizing process from the test sets. The original dataset is first divided into m folds. One fold of data is kept un-anonymized and used as the test set. The remaining (m  1) folds of data are anonymized and used as the training set for building a classifier, which is then used to predict the records in the test set. This procedure mimics the process of using an anonymized dataset to build a model for predicting a future dataset, which is not anonymized. Repeat this procedure using another fold of data as the test set until each fold of data has been used as the test set once. Fig. 4. HDF Algorithm: first stage (Lin et al., submitted for publication).

3. Microaggregation approaches This section describes the microaggregation approaches whose anonymized dataset qualities are compared in this study. For a recent survey and classification of various microaggregation approaches, please refer to Lin, Wen, Hsieh, and Chang (2010). The Maximum Distance to Average Vector (MDAV) method is the most widely-used microaggregation approach (Solanas, 2008). MDAV first finds the record r that is the farthest from the centroid of the dataset, and the farthest neighbor s of this record r. Then, MDAV forms two groups of records, one group containing r and its k  1 nearest neighbors, and another group containing s and its k  1 nearest neighbors. This process is repeated until the number of remaining records is fewer than 2k. When the number of remaining records is between k and 2k  1, MDAV simply forms a new group with all of the remaining records. When the number of remaining records is below k, MDAV adds all of the remaining records to their nearest group. Fig. 3 shows an improved version of MDAV, called MDAV-1. When the number of the remaining records is between k and 2k  1, MDAV-1 forms a new group with the record that is the farthest from the centroid of the remaining records, and its k  1 nearest records. Any remaining records are then added to their respective nearest groups. Lin et al. (2010) proposed a two-stage density-based microaggregation approach, which this study denotes as High Density First (HDF). Given a set T of records and a positive integer k, the k-neighborhood of a record ^ x 2 T with respect to T is denoted by N k ð^ x; TÞ, and is defined as the set containing ^ x and the k  1 nearest records to ^ x in T. Let N k ð^ x; TÞ denote the centroid of N k ð^ x; TÞ. Then, the kk density of ^ x in T is denoted as d ð^ x; TÞ, and is defined as the inverse of the sum of the Euclidean distance from each record in N k ð^ x; TÞ to N k ð^ x; TÞ. That is, k d ðx^; TÞ ¼

1  T   k x  N ð^x; TÞ x  Nk ð^x; TÞ x2Nk ð^x;TÞ

P

ð4Þ

The HDF has two stages. The first stage of HDF (see Fig. 4) repeatedly builds a new group using the k-neighborhood of the record with the highest k-density among all records that are not yet assigned to any group. This stage continues until fewer than k

Fig. 3. MDAV-1 method (Lin et al., submitted for publication).

Fig. 5. HDF Algorithm: second stage (Lin et al., submitted for publication).

unassigned records remain (steps 2–5 of Fig. 4). These remaining records are then assigned to their respective nearest groups (step 6 of Fig. 4). The output of the first stage of HDF is a partitioning G of the dataset T in which each group in G contains no fewer than k records. The second stage of HDF (see Fig. 5) attempts to fine-tune G by checking whether or not it should decompose a group and merge its content to other groups. Given a group G 2 G, let LnoMerge denote the information loss of G, and Lmerge denote the information loss of G after each record in G has been added to its nearest group in G  {G}. If LnoMerge > Lmerge, then the group G is removed from G, and each record in G is merged into its nearest group in G; otherwise, G remains unchanged. During the second stage, all groups in G are checked in the reverse order that they were added to G in the first stage. Furthermore, the MDAV-1 algorithm (see Fig. 3) is applied to each group with size P2k in G at the end of the second stage. This study proposes a Low Density First (LDF) microaggregation approach. The first stage of LDF is exactly the same as that of HDF except that step 2 in Fig. 4 finds the group with the lowest rather than highest density. Here, the density of a group is the inverse of the sum of the Euclidean distance from each record in the group to the group’s centroid. Fig. 6 shows that the second stage of LDF sorts

Fig. 6. LDF and MDAV-2 Algorithms: second stage.

8164

J.-L. Lin et al. / Expert Systems with Applications 37 (2010) 8161–8165

the groups generated in the first stage by their densities. The groups are then checked in descending order of their densities to determine whether or not to decompose the groups, as in HDF. The second stage of LDF can also apply to the partitioning results generated by other microaggregation approaches. For example, it is possible to use MDAV-1 during the first stage, and apply the procedure in Fig. 6 during the second stage. The resulting approach is referred to as MDAV-2.

Table 4 Prediction accuracy rate for KNN with K = 3k. Iris

HDF LDF MDAV-1 MDAV-2 None

Ecoli

k=3

k=5

k = 10

k=3

k=5

k = 10

95.33 97.33 96.00 96.67 95.33

96.00 94.00 96.67 98.00 96.67

94.00 87.33 89.33 89.33 96.00

86.31 83.63 83.04 84.23 87.20

81.85 85.12 85.42 84.23 85.42

82.74 80.06 83.04 85.12 84.23

4. Performance results This study implemented the four microaggregation approaches described in Section 3 (i.e., HDF, LDF, MDAV-1 and MDAV-2), and performed experiments to compare their performance. These approaches are compared based on the quality of their resulting anonymized datasets. The quality of an anonymized dataset was measured using both the information loss and the prediction accuracy, as described in Section 2. This study used the K-nearest neighbor (KNN) algorithm, as implemented in Weka (Witten & Frank, 2005), to build classifiers from the anonymized datasets. The prediction accuracy of a classifier was measured by the cross validation procedure in Fig. 2, with m = 10. To ensure a fair comparison, each dataset was first partitioned into 10 folds, and then used in all tests for the dataset. This experiment used the Iris dataset and Ecoli dataset from UCI repository (Asuncion & Newman, 2007). Each dataset contains a number of numerical attributes and one categorical attribute. All numerical attributes were treated as quasi-identifiers, and were anonymized by the microaggregation approaches. The categorical attribute remained unchanged, and used as a dependent variable to be predicted by the classifiers. HDF and MDAV-2 offer the best performance in terms of information loss. Table 1 shows that they both incur the least information loss in three out of the six test conditions. Tables 2–5 show the

Table 1 Information loss. Iris

HDF LDF MDAV-1 MDAV-2

Ecoli

k=3

k=5

k = 10

k=3

k=5

k = 10

2.285 3.124 2.653 2.435

4.901 6.038 6.027 4.646

8.728 8.958 11.013 8.949

15.193 14.915 14.920 13.933

18.742 19.649 19.310 19.122

26.808 26.437 25.467 25.099

HDF LDF MDAV-1 MDAV-2 None

Ecoli

k=3

k=5

k = 10

k=3

k=5

k = 10

94.00 94.00 95.33 95.33 95.33

93.33 95.33 94.00 94.67 96.67

98.00 92.67 96.00 96.00 95.33

81.55 83.33 80.95 82.14 83.93

83.33 82.14 81.55 82.14 86.90

83.93 81.85 81.85 81.85 86.61

Table 3 Prediction accuracy rate for KNN with K = 2k. Iris

HDF LDF MDAV-1 MDAV-2 None

Iris

HDF LDF MDAV-1 MDAV-2 None

Ecoli

k=3

k=5

k = 10

k=3

k=5

k = 10

96.67 96.00 94.67 95.33 96.67

96.67 94.67 96.67 96.00 96.00

86.00 88.00 92.67 90.67 95.33

85.71 84.23 84.82 84.82 85.12

82.74 84.52 83.04 83.93 83.63

78.57 77.98 80.06 80.06 80.06

prediction accuracy of KNN classifiers with K ranging from k to 4k, where K is the number of nearest neighbors used in KNN, and k is the parameter for k-anonymity. The prediction accuracy of a microaggregation approach is shown in italic if this approach yields the highest accuracy for a specific setting of k and K. Although LDF and MDAV-1 do not incur the lowest information loss, they sometimes yield the highest prediction accuracy. Among the 24 different settings of (K, k), HDF yields the highest prediction accuracy nine times, LDF five times, MDAV-1 five times, and MDAV-2 eight times. These results show that a microaggregation approach that incurs lower information loss does not necessarily yield a higher prediction accuracy. The rows indicated as ‘‘none” in Tables 2–5 refer to the prediction accuracy calculated using the traditional ten-fold cross validation procedure without anonymizing the dataset. These rows act as a baseline for prediction accuracy. In most cases, a classifier based on an anonymized dataset has a lower prediction accuracy than the corresponding baseline under the same settings of K and k. However, there are some cases (shown in bold in Tables 2–5) where the prediction accuracy of a classifier based on an anonymized dataset is higher than the corresponding baseline. The may result from the reduction of variance in the anonymized dataset. 5. Conclusions

Table 2 Prediction accuracy rate for KNN with K = k. Iris

Table 5 Prediction accuracy rate for KNN with K = 4k.

Intuitively, an anonymized dataset with lower information loss should provide better data quality, and consequently, a classifier built on that dataset should have higher prediction accuracy. However, experimental results show that this may not be the case when the difference between the information losses of two anonymized datasets is small. Therefore, developing new microaggregation approaches to further reduce the information loss might not be the best way to ensure the quality of anonymized datasets. Instead, the k-anonymity problem can be rewritten as a multi-objective optimization problem by considering other objectives in addition to minimizing the information loss as follows.

Ecoli

k=3

k=5

k = 10

k=3

k=5

k = 10

94.00 96.00 94.00 94.67 96.67

97.33 94.67 95.33 98.00 95.33

94.67 92.00 90.67 91.33 96.00

84.82 82.44 83.93 85.12 85.12

83.04 83.33 84.82 86.01 86.61

82.14 82.74 82.74 83.33 83.63

Problem Definition 3. Given a dataset T, a positive integer k and a classification algorithm, find a partitioning G = {G1, . . . , Gg} of T such that jGij P k for any Gi 2 G with the following objectives: (i) minimize the SSE of the anonymized dataset of T; (ii) maximize the prediction accuracy of the classifier built using the classification algorithm on the anonymized dataset.

J.-L. Lin et al. / Expert Systems with Applications 37 (2010) 8161–8165

The two objectives in Problem 3 are to minimize information loss and maximize prediction accuracy. Researchers have proposed many microaggregation approaches, but most of them only consider quasi-identifiers. To achieve better prediction accuracy, new approaches should be developed by also considering attributes that are not quasi-identifiers. Experimental results show that a microaggregation approach has two opposite effects on data quality. On one hand, it anonymizes data, and therefore degrades the data quality and prediction accuracy. On the other hand, anonymization has a denoising effect on the dataset. This reduces variations in nearby data, and may improve prediction accuracy. References Asuncion, A., & Newman, D. J. (2007). UCI machine learning repository. Irvine, CA: University of California, School of Information and Computer Science. Available from. Chang, C.-C., Li, Y.-C., & Huang, W.-H. (2007). TFRP: An efficient microaggregation algorithm for statistical disclosure control. Journal of Systems and Software, 80(11), 1866–1878. Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J. M., & Sebé, F. (2006). Efficient multivariate data-oriented microaggregation. The VLDB Journal, 15(4), 355–369. Domingo-Ferrer, J., & Mateo-Sanz, J. (2002). Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering, 14(1), 189–201. Domingo-Ferrer, J., & Torra, V. (2005a). Ordinal, continuous and heterogeneous kanonymity through microaggregation. Data Mining and Knowledge Discovery, 11(2), 195–212.

8165

Domingo-Ferrer, J., & Torra, V. (2005b). Privacy in data mining. Data Mining and Knowledge Discovery, 11(2), 117–119. Laszlo, M., & Mukherjee, S. (2005). Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering, 17(7), 902–911. LeFevre, K., DeWitt, D. J., & Ramakrishnan, R. (2006). Mondrian multidimensional kanonymity. In Proceedings of the 22nd IEEE international conference on data engineering (ICDE’06). Lin, J.-L., Wen, T.-H., Hsieh, J.-C., & Chang, P.-C. (2010). Density-based microaggregation for statistical disclosure control. Expert Systems with Applications, 37(4), 3256–3263. Lin, J.-L., & Wei, M.-C. (2009). Genetic algorithm-based clustering approach for kanonymization. Expert Systems with Applications, 36(6), 9784–9792. Oganian, A., & Domingo-Ferrer, J. (2001). On the complexity of optimal microaggregation for statistical disclosure control. Statistical Journal of the United Nations Economic Commission for Europe, 18, 345–354. Samarati, P. (2001). Protecting respondent’s privacy in microdata release. IEEE Transactions on Knowledge and Data Engineering, 13(6), 1010–1027. Schmid, M., & Schneeweiss, H. (2005). The effect of microaggregation procedures on the estimation of linear models: A simulation study. Jahrbücher für Nationalökonomie und Statistik, 225(5), 529–543. Solanas, A. (2008). Privacy protection with genetic algorithms. In Success in evolutionary computation (pp. 215–237). Springer. Sweeney, L. (2002). K-anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557–570. Torra, V. (2004). Microaggregation for categorical variables: A median based approach. In Privacy in statistical databases (pp. 162–174). Willenborg, L., & Waal, T. d. (2001). Elements of statistical disclosure control. Lecture notes in statistics (Vol. 155). Berlin: Springer. Witten, I. H., & Frank, E. (2005). Data Mining: Practical machine learning tools and techniques (2nd ed.). Morgan Kaufmann.