Expert Systems with Applications 37 (2010) 3256–3263
Contents lists available at ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Density-based microaggregation for statistical disclosure control Jun-Lin Lin *, Tsung-Hsien Wen, Jui-Chien Hsieh, Pei-Chann Chang Department of Information Management, Yuan Ze University, Chung-Li, Taiwan
a r t i c l e
i n f o
Keywords: Mircroaggregation Disclosure control k-Anonymity Microdata protection
a b s t r a c t Protection of personal data in statistical databases has recently become a major societal concern. Statistical disclosure control (SDC) is often applied to statistical databases before they are released for public use. Microaggregation for SDC is a family of methods to protect microdata (i.e., records on individuals and/or companies) from individual identification. Microaggregation works by partitioning the microdata into groups of at least k records and, then, replacing the records in each group with the centroid of the group. An optimal microaggregation method must minimize the information loss resulting from this replacement process. However, this problem of minimizing information loss has been shown to be NPhard for multivariate data. Methods based on various heuristics have been proposed for this problem, but none performs the best for every microdata set and various k values. This work presents a densitybased algorithm (DBA) for microaggregation. The DBA first forms groups of records by the descending order of their densities, then fine-tunes these groups in reverse order. The performance of the DBA is compared against the latest microaggregation methods. Experimental results indicate that DBA incurs the least information loss for over half of the test situations. Ó 2009 Elsevier Ltd. All rights reserved.
1. Introduction Continuous advances in computer technologies enable governmental agencies and corporations to accumulate an enormous amount of personal data for analytical purposes. However, to protect personal data from individual identification, statistical disclosure control (SDC) is often applied before the data are released for analysis (Domingo-Ferrer & Torra, 2005b; Willenborg & Waal, 2001). SDC requires suppressing or altering the original data, and therefore can damage the quality of data, and thus of the analysis results. Hence, SDC methods must find a balance between data utility and personal confidentiality. Microaggregation, a family of SDC methods for microdata, works by partitioning a dataset into groups of at least k records and, then, replacing the records in each group with the centroid of the group. The resulting dataset (called the anonymized dataset) satisfies k-anonymity (Sweeney, 2002), requiring each record in a dataset to be identical to at least k 1 other records in the same dataset. Notably, both k-anonymity and microaggregation only consider a set of privacy-related attributes (called quasi-identifiers), rather than all attributes in a dataset. Microaggregation is traditionally restricted to numeric attributes in order to calculate the centroid of a group of records, but has also been extended to han* Corresponding author. Tel.: +886 3 4638800; fax: +883 3 4352077. E-mail addresses:
[email protected] (J.-L. Lin),
[email protected] (T.H. Wen),
[email protected] (J.-C. Hsieh),
[email protected] (P.C. Chang). 0957-4174/$ - see front matter Ó 2009 Elsevier Ltd. All rights reserved. doi:10.1016/j.eswa.2009.09.054
dle categorical and ordinal attributes (Domingo-Ferrer & Torra, 2002a, 2005a; Torra, 2004). For clarity, this study is restricted to numeric attributes. The effectiveness of a microaggregation method is measured by calculating its information loss. A lower information loss implies that the anonymized dataset is less distorted from the original dataset, and thus provides better data quality for analysis. Personal confidentiality is achieved via the constraint of k-anonymity, while to ensure the quality of the anonymized dataset, an effective microaggregation method should incur low information loss. Information loss can be reduced by placing similar records in the same groups. In data mining, clustering is an effective means of grouping similar records together. Consequently, many microaggregation methods derive from traditional clustering algorithms. For instance, Domingo-Ferrer and Mateo-Sanz (2002) proposed univariate and multivariate k-Ward algorithms that extend the agglomerative hierarchical clustering method of Ward et al. (1963), Domingo-Ferrer and Torra (2002b, 2003) proposed a microaggregation method based on the fuzzy c-means algorithm (Bezdek, 1981), and Laszlo and Mukherjee (2005) extended the standard minimum spanning tree partitioning algorithm (Zahn, 1971) for microaggregation. The above three microaggregation methods build all groups gradually but simultaneously. Other heuristics that build one group at a time have also been proposed for microaggregation. Notable examples include Maximum Distance (Solanas, 2008), Diameter-based Fixed-size microaggregation and Centroid-based Fixed-size microaggregation (Laszlo & Mukherjee, 2005), MDAV
J.-L. Lin et al. / Expert Systems with Applications 37 (2010) 3256–3263
(Domingo-Ferrer & Torra, 2005a), MHM (Domingo-Ferrer, Martínez-Ballesté, Mateo-Sanz, & Sebé, 2006) and the Two Fixed Reference Points method (Chang, Li, & Huang, 2007). The build-onegroup-at-a-time heuristics tend to cause less information loss than the build-all-groups-simultaneously methods. However, no single microaggregation method outperforms other methods for every dataset and various k values, in terms of information loss. This work presents a density-based microaggregation method with low information loss. The proposed method first builds groups of k records by following the density of each record in descending order. Then, this method determines whether to merge the records in each group (starting from the low-density groups) to other groups. The proposed method causes the smallest information loss among the state of the art of the microaggregation methods for most test situations. The remainder of this paper is organized as follows: Section 2 introduces the basic concept of microaggregation. Section 3 reviews previous microaggregation methods, and categorizes these methods as either build-one-group-at-a-time or build-all-groupssimultaneously. Section 4 then presents the proposed densitybased algorithm for microaggregation. Section 5 shows experimental results of the proposed method, and compares them with other methods in the literature. Conclusions are finally drawn in Section 6, along with recommendations for future research.
2. Basic concept Consider a dataset T with p numeric attributes and n records, where each record is represented as a vector in a p-dimensional space. For a given positive integer k 6 n, a microaggregation method partitions T into g groups where each group contains at least k records, and then replaces the records in each group with the centroid of the group. Let ni denote the number of records in the ith group, and xij ; 1 6 j 6 ni , denote the records in the ith group. Then, P ni P k for i ¼ 1 to g, and gi¼1 ni ¼ n. The centroid of the ith group, denoted by xi , is calculated as the average vector of all the records in the ith group. Similarly, the centroid of T, denoted by x, is the average vector of all the records in T. Information loss is used to quantify the amount of information of a dataset that is lost after applying a microaggregation method. This study adopts the most common definition of information loss from Domingo-Ferrer and Mateo-Sanz (2002), as below:
L¼
SSE SST
ð1Þ
where SSE is the within-group squared error, calculated by summing the Euclidean distance of each record xij to the centroid xi as follows:
SSE ¼
ni g X X ðxij xi ÞT ðxij xi Þ i¼1
ð2Þ
j¼1
and SST is the sum of squared error within the entire dataset T, calculated by summing the Euclidean distance of each record xij to the centroid x as follows:
SST ¼
ni g X X i¼1
ðxij xÞT ðxij xÞ
ð3Þ
j¼1
Since SST is fixed for a given dataset T, regardless of how T is partitioned, the microaggregation problem can be formalized as a constraint optimization problem as follows.
3257
2.1. Problem definition Given a dataset T and a positive integer k, find a partitioning G ¼ fG1 ; . . . ; Gg g of T such that SSE is minimized subject to the constraint that jGi j P k for any Gi 2 G. This problem can be solved in polynomial time for a univariate dataset (Hansen & Mukherjee, 2003), but has been shown to be NPhard for a multivariate dataset (Oganian & Domingo-Ferrer, 2001). Generally, SSE is low when the number of groups is large. Therefore, the number of records in each group should be kept close to k. Domingo-Ferrer and Mateo-Sanz (2002) proved that no group should contain more than 2k 1 records since such groups can always be partitioned to further reduce information loss. 3. Classification of microaggregation methods Microaggregation methods have been roughly divided into two categories in the literature, namely fixed-size and data-oriented (Domingo-Ferrer & Mateo-Sanz, 2002; Domingo-Ferrer et al., 2006). A fixed-size method divides a dataset into groups such that all groups have size k, except perhaps one group which has size between k and 2k 1. Data-oriented methods impose less constraint on the group size, allowing all groups having sizes between k and 2k 1. Intuitively, fixed-size methods reduce the search space, and thus are more computationally efficient than data-oriented methods. However, data-oriented methods can adapt to different values of k and various data distributions, and thus may achieve lower information loss than fixed-size methods. The downside of this categorization scheme is that some microaggregation methods are essentially the same, but are classified into different categories. One such example is the MD method and Diameter-based Fixedsize microaggregation method, described later in Section 3.1. Furthermore, most recently proposed microaggregation methods are data-oriented, making this classification scheme not quite useful. This study categorizes microaggregation methods as either build-all-groups-simultaneously or build-one-group-at-a-time. A build-all-groups-simultaneously microaggregation method builds and refines all groups simultaneously. In contrast, a build-onegroup-at-a-time microaggregation method attempts to build groups one at a time, and refine them if necessary once they are all formed. This categorization helps to distinguish among various microaggregation methods. The next two subsections respectively review these two approaches to microaggregation. 3.1. Build-one-group-at-a-time microaggregation methods Most build-one-group-at-a-time microaggregation methods repeatedly choose two records according to various heuristics, and form two groups with the chosen records and their respective k 1 nearest records. Other approaches choose one record instead of two each time, or allow the group size to expand to more than k. These methods are reviewed below. Domingo-Ferrer and Mateo-Sanz (2002) proposed a multivariate fixed-size microaggregation method, later called Maximum Distance (MD) method (Solanas, 2008). The MD method, shown in Fig. 1, repeatedly locates the two records that are most distant to each other, and forms two groups with their respective k 1 nearest records. This method has a time complexity of Oðn3 Þ, and works well for most datasets. Laszlo and Mukherjee (2005) modified the last step of the MD method such that each remaining record is added to its own nearest group. The resulting method, called Diameter-based Fixed-size microaggregation by Laszlo and Mukherjee (2005), is not a fixedsize method because it allows more than one group to have more than k records.
3258
J.-L. Lin et al. / Expert Systems with Applications 37 (2010) 3256–3263
Fig. 1. MD method.
The Maximum Distance to Average Vector (MDAV) method is the most widely-used microaggregation method (Solanas, 2008). MDAV is the same as MD, except at step 1 of Fig. 1. Instead on finding the two records most distant to each other as did in MD, MDAV finds the record that is most distant to the centroid of the dataset, and the farthest neighbor of this record, to reduce the computation time. As with MD, when the number of the remaining records is between k and 2k 1, MDAV simply forms a new group with all of the remaining records; when the number of the remaining records is below k, it adds all of the remaining records to their nearest group. Thus, MDAV is a fixed-size method. This study modifies MDAV such that, when the number of the remaining records is between k and 2k 1, a new group is formed with the record that is the furthest from the centroid of the remaining records, and its k 1 nearest records. Any remaining records are then added to their respective nearest groups. Fig. 2 shows the resulting method, called MDAV-1. Experimental results indicate that MDAV-1 incurs slightly less information loss than MDAV. Fig. 3 shows a variant of MDAV method, called MDAV-generic, proposed by Domingo-Ferrer and Torra (2005a), where the threshold 2k in step 2 of Fig. 2 is altered to 3k in step 2 of Fig. 3. Because step 1 repeatedly removes 2k records from the remaining records
until fewer than 3k records remain, the eventual number of the remaining records is between k and 3k 1. Finally, if the number of the remaining records is between 2k and 3k 1, then two new groups are formed (step 3), otherwise one new group is formed (step 4). This is a fixed-size method. Laszlo and Mukherjee (2005) proposed another method, called Centroid-based Fixed-size microaggregation, that also bases on centroid but builds only one (instead of two as in MDAV) group during each iteration. Fig. 4 shows the Laszlo and Mukherjee (2005) method, which is not a fixed-size method, since it allows more than one group to have more than k records. Solanas, Martinez-Balleste, and Domingo-Ferrer (2006b) proposed a variable-size variant of MDAV, called V-MDAV. Whenever V-MDAV builds a new group of k records, it tries to extend this group to up to 2k 1 records based on the criteria shown in Fig. 5. V-MDAV adopts a user-defined parameter c to control the threshold of adding more records to a group. Their experimental results indicated that c close to zero is effective for scattered datasets. However, it is not trivial to determine a proper value for c. Moreover, if the dataset is a mixture of scattered data and clustered data, then a single value for c may not work well in all parts of the dataset.
Fig. 2. MDAV-1 method.
Fig. 3. MDAV-generic method.
J.-L. Lin et al. / Expert Systems with Applications 37 (2010) 3256–3263
3259
Fig. 4. Centroid-based Fixed-size microaggregation method.
Fig. 5. Criteria for expanding a group in V-MDAV.
Chang et al. (2007) proposed the Two Fixed Reference Points (TFRP) method to accelerate the clustering process of k-anonymization. During the first phase, TFRP selects two extreme points calculated from the dataset. Let Gmin and Gmax be the minimum and maximum values over all attributes in the datasets, respectively. Then, one reference point R1 has Gmin as its value for all attributes, and another reference point R2 has Gmax as its value for all attributes. A group of k records is then formed with the record r that is the furthest from R1 , and the k 1 nearest records to r. Similarly, another group of k records is formed with the record s that is the furthest from R2 , and the k 1 nearest records to s. These two steps are repeated until fewer than k records remain. Finally, these remaining records are assigned to their respective nearest groups. Notably, R1 and R2 are fixed throughout the iterations, making this method quite efficient. After generating all groups, TFRP applies a refinement step to determine whether a group should be retained or decomposed and added to other groups. 3.2. Build-all-groups-simultaneously microaggregation methods Most build-all-groups-simultaneously microaggregation methods work by initially forming multiple groups of records in the form of trees, where each tree represents a group. Heuristics are then applied to either decompose a tree to reduce the group size to be fewer than 2k, or merge trees to raise the group size to be greater than or equal to k. Instead of using trees, other methods may adaptively adjust the number of groups to ensure that the size of each group is between k and 2k 1. These methods are reviewed below. The multivariate k-Ward algorithm (Domingo-Ferrer & MateoSanz, 2002) first finds the two records that are furthest from each other in the dataset, and builds two groups from these two records and their respective k 1 nearest records. Each of the remaining record then forms its own group. These groups are repeatedly merged, as in Ward’s agglomerative hierarchical clustering method (Ward et al., 1963), until all groups have at least k records. The kWard algorithm differs from the Ward’s method in that it never merges two groups which both have at least k records. Finally, the k-Ward algorithm is recursively applied to each group containing 2k or more records. The k-Ward algorithm tends to generate large groups, consequently increasing the information loss. For instance, this method could merge two groups, each with k 1 records, to form a large group of 2k 2 records. The minimum spanning tree microaggregation method (Laszlo & Mukherjee, 2005) first builds a minimal spanning tree (MST) of the dataset using the Prim method (Cormen, Leiserson, Rivest, & Stein, 2001). Then, as in the standard MST partitioning algorithm
(Zahn, 1971), the longest edge is recursively removed to form a forest of subtrees of the MST. However, unlike in the standard MST partitioning algorithm, the longest edge is removed only if both the two resulting subtrees contain at least k nodes. Finally, another microaggregation method (such as MDAV) is applied to those groups containing more than 2k records. According to the experimental results reported by Laszlo and Mukherjee, 2005, this method has the same complexity as the multivariate k-Ward algorithm, but causes less information loss. However, it still tends to generate large groups, and works well only if the dataset has well-separated clusters. Domingo-Ferrer, Sebé, and Solanas (2008) proposed a multivariate microaggregation method called l-Approx. This method first builds a forest, and then decomposes the trees in the forest such that all trees have sizes between k and maxð2k 1; 3k 5Þ. Finally, for any tree with size greater than 2k 1, find the node in the tree that is the furthest from the centroid of the tree, form a group with this node and its k 1 nearest records in the tree, and form another group with the remaining records in the tree. Hansen and Mukherjee (2003) proposed a microaggregation method for univariate datasets. This method, called HM, converts a dataset into a directed acyclic graph based on the ordering of the records, and then transforms the microaggregation problem into the shortest path problem, which can be solved in polynomial time. Notably, this method cannot be applied directly to multivariate datasets since these only have a partial ordering among records. Domingo-Ferrer et al. (2006) proposed a multivariate version of the HM method, called MHM. This method first uses various heuristics, such as nearest point next (NPN), maximum distance (MD) or MDAV, to order the multivariate records. Steps similar to the HM method are then applied to generate groups based on this ordering. Domingo-Ferrer et al. (2003) proposed a microaggregation method based on fuzzy c-means algorithm (FCM) (Bezdek, 1981). This method repeatedly runs FCM to adjust the two parameters of FCM (i.e., the number of clusters c and the exponent for the partition matrix m) until each group contains at least k records. The value of c is initially large (and m is small), and is gradually reduced (increased) during the repeated FCM runs to reduce the size of each group. The same process is then recursively applied to those groups with 2k or more records. Because FCM aims to minimize a weighted version of SSE, not SSE, it does not perform well in terms of SSE. Genetic algorithms (GAs) have also been applied to the microaggregation problem. Solanas, Martinez-Balleste, Mateo-Sanz, and Domingo-Ferrer (2006a) encoded a partitioning of a dataset as a chromosome of n genes, where n is the number of records in the
3260
J.-L. Lin et al. / Expert Systems with Applications 37 (2010) 3256–3263
dataset, and the value of the ith gene indicates the group number of the ith record in the dataset. Since each group contains at least k records, each group number is an integer in the interval 1; bnkc . When generating the initial population of chromosomes and performing genetic operations on these chromosomes, special care must be taken to avoid generating a chromosome where any group number appears fewer than k or more than 2k times in its n genes. Their experimental results showed that this method works well for small datasets ðn 6 50Þ. Therefore, they recommended first using a fixed-size microaggregation method such as MDAV to generate groups with k ¼ 50, and then applying GA for the real intended k value for each group. This two-step method was later studied by Martínez-Ballesté, Solanas, Domingo-Ferrer, and Mateo-Sanz (2007), and was also published in Solanas (2008).
4. Proposed approach This section presents the proposed density-based algorithm (DBA) for microaggregation. First, some notation is defined to facilitate the discussion. Given a set T of records and a positive integer k, the k-neighborhood of a record ^ x 2 T with respect to T is denoted
x; TÞ, and is defined as the set containing ^ x and the k 1 by N k ð^ x; TÞ denote the centroid of nearest records to ^ x in T. Let N k ð^ k x; TÞ. Then, the k-density of ^ x in T is denoted as d ð^ x; TÞ, and is N k ð^ defined as the inverse of the sum of Euclidean distance from each x; TÞ to N k ð^ x; TÞ. That is, record in N k ð^ k d ð^x; TÞ ¼ P
x2N
k
1 k ^ ðx N ð x ; TÞÞT ðx N k ð^x; TÞÞ ð^x;TÞ
ð4Þ
The DBA has two stages. The first stage of DBA (see Fig. 6) repeatedly builds a new group using the k-neighborhood of the record with the highest k-density among all records that are not yet assigned to any group, until fewer than k unassigned records remain (see steps 2–5 of Fig. 6). These remaining records are then assigned to their respective nearest groups (see step 6 of Fig. 6). In Fig. 6, T 0 denotes the set of records that are not yet assigned to any group. Since forming a new group G (step 3) always causes the content of T 0 to shrink (step 4), it could slow down the process of finding the record with the highest k-density among all records in T 0 (step 2). To speedup step 2 of Fig. 6, the k-density of each record in T 0 is first calculated, and a priority queue Q of these records ordered by their k-densities is built in descending order. Each
Fig. 6. Density-based algorithm: first stage (concise version).
Fig. 7. Steps for finding the record with the highest k-density in T 0 .
Fig. 8. Density-based algorithm: first stage (detailed version).
3261
J.-L. Lin et al. / Expert Systems with Applications 37 (2010) 3256–3263
Fig. 9. Density-based algorithm: second stage.
element e in Q contains three components, namely the corresponding record e.x in T 0 , and the k-density e.density and the assigned status e.assigned of the record e.x. Initially, e.assigned is false for every element e in Q. Notably, e.density is the k-density of the record e.x calculated when e was inserted into the queue, and may not equal the k-density of the record with respect to the current content of T 0 , because the content of T 0 shrinks at step 4 of Fig. 6. As the content of T 0 shrinks, the k-density of a record with respect to T 0 is either decreased or unchanged. Fig. 7 shows the steps to find the record with the highest k-density with respect to the current content of T 0 using the two standard operations of a priority queue, namely getNext(Q) and insertWithPriority(Q, e). The next element e is first extracted from Q by performing the getNext(Q) operation. If e:assigned is true, then this element e is discarded, and the same getNext(Q) operation is repeated at step 1. Otherwise, compare k e:density and d ðe:x; T 0 Þ, i.e., the k-density of the record e:x with respect to the current content of T 0 . If both values are the same, then e.x is the record with the highest k-density in T 0 . Otherwise, set k e.density to d ðe:x; T 0 Þ, insert the element e back to Q via the insertWithPriority(Q, e) operation, and repeat the same getNext(Q) operation at step 1. Fig. 8 shows the detailed version of the first stage of DBA. The output of the first stage of DBA is a partitioning G of the dataset T in which each group in G contains no fewer than k records. The second stage of DBA (see Fig. 9) attempts to fine-tune G by checking whether to decompose a group and merge its content to other groups. Given a group G 2 G, let LnoMerge denote the information loss of G, and Lmerge denote the information loss of G after each record in G has been added to its nearest group in G {G}. If LnoMerge > Lmerge , then the group G is removed from G, and each record in G is merged into its nearest group in G; otherwise, G remains unchanged. Notably, all groups in G are checked during the second stage by the reverse of the order that they were added to G in the first stage. Notably, some groups in G may contain more than 2k 1 records after several groups are removed from G and their records are added to their nearest groups (step 4 of Fig. 9). In this case, these groups with size above 2k 1 can be further split into smaller groups to reduce the information loss to a value below Lmerge . Hence, LnoMerge > Lmerge is a conservative condition to decide whether to decompose a group and add its records to their respective nearest groups in G, but is adopted here to reduce the computation effort. Furthermore, the MDAV-1 algorithm (see Fig. 2) is applied to each group with size P 2k in G at the end of the second stage. The second stage of DBA can also apply to the partitioning G generated by other microaggregation methods to further reduce the information loss. The procedure is to sort the groups in G in ascending order of their densities, then check whether to decompose a group following this order. Here, the density of a group is
the inverse of the sum of Euclidean distance from each record in the group to the group’s centroid.
Table 1 Information loss comparison using Tarragona dataset. Method
k=3
MDAV-MHM MD-MHM CBFS-MHM NPN-MHM M-d l-Approx TFRP-1 TFRP-2 MDAV-1 MDAV-2 DBA-1 DBA-2
16.9326 16.9829 16.9714 17.3949 16.6300 17.10 17.228 16.881 16.93258762 16.68261429 20.69948803 16.15265063
k=4
k=5
k = 10
19.66 20.51 19.396 19.181 19.54578612 19.01314997 23.82761456 22.67107728
22.4617 22.5269 22.8227 27.0213 24.5000 26.04 22.110 21.847 22.46128236 22.07965363 26.00129826 25.45039236
33.1923 33.1834 33.2188 40.1831 38.5800 38.80 33.186 33.088 33.19235838 33.17932950 35.39295837 34.80675148
Table 2 Information loss comparison using Census dataset. Method
k=3
MDAVMHM MD-MHM CBFS-MHM NPN-MHM M-d l-Approx TFRP-1 TFRP-2 MDAV-1 MDAV-2 DBA-1 DBA-2
5.6523 5.69724 5.6734 6.3498 6.1100 6.25 5.931 5.803 5.692186279 5.656049371 6.144855154 5.581605762
k=4
k=5 9.0870
8.24 8.47 7.880 7.638 7.494699833 7.409645342 9.127883805 7.591307664
8.98594 8.8942 11.3443 10.3000 10.78 9.357 8.980 9.088435498 9.012389597 10.84218735 9.046162117
k = 10 14.2239 14.3965 13.8925 18.7335 17.1700 17.01 14.442 13.959 14.15593043 13.94411775 15.78549732 13.52140518
Table 3 Information loss comparison using EIA dataset. Method
k=3
MDAV-MHM MD-MHM NPN-MHM l-Approx TFRP-1 TFRP-2 MDAV-1 MDAV-2 DBA-1 DBA-2
0.4081 0.4422 0.5525 0.43 0.530 0.428 0.482938725 0.411101515 1.090194828 0.421048322
k=4
k=5
k = 10
0.59 0.661 0.599 0.671345141 0.587381756 0.84346907 0.559755523
1.2563 1.2627 0.9602 0.83 1.651 0.910 1.666657361 0.946263963 1.895536919 0.81849828
3.7725 3.6374 2.3188 2.26 3.242 2.590 3.83966422 3.16085577 4.265801303 2.080980825
J.-L. Lin et al. / Expert Systems with Applications 37 (2010) 3256–3263
3262
5. Experimental results Experiments were performed to compare the performance of various microaggregation methods. Four methods, DBA-1, DBA-2, MDAV-1 and MDAV-2, were implemented. DBA-1 and DBA-2 refer, respectively, to the first and the second stages of the proposed DBA method. MDAV-1 is based on our implementation of Fig. 2, which incurs slightly less information loss than the MDAV method reported in the literature. MDAV-2 applies the second stage of DBA to the partitioning generated by MDAV-1, as described at the end of Section 4. The following three datasets (Domingo-Ferrer et al., 2006), which have been used as benchmarks in previous studies to evaluate various microaggregation methods, were adopted in our experiments. The Tarragona dataset contains 834 records with 13 numeric attributes. The Census dataset contains 1080 records with 13 numeric attributes. The EIA dataset contains 4092 records with 11 numeric attributes. Tables 1–3 show the information losses of these microaggregation methods. The lowest information loss for each dataset and
each k value is shown in bold face. The information losses of methods MDAV-MHM, MD-MHM, CBFS-MHM, NPN-MHM and M-d (for k=3, 5, 10) are quoted from Domingo-Ferrer et al. (2006); the information losses of methods l-Approx and M-d (for k = 4) are quoted from Domingo-Ferrer et al. (2008), and the information losses of methods TFRP-1 and TFRP-2 are quoted from Chang et al. (2007). TFRP is a two-stage method, and its two stages are denoted as TFRP-1 and TFRP-2, respectively. The second stage of TFRP is similar to the second stage of DBA, but disallows merging a record to a group of size over 4k 1. Tables 2 and 3 show that DBA-2 works particularly well for Census dataset and EIA dataset, with either the lowest or almost the lowest information loss. The second stage of DBA is important for further reducing the information loss. The difference between TFRP-1 and TFRP-2, the difference between MDAV-1 and MDAV2, and the difference between DBA-1 and DBA-2, were compared, revealing that this second stage is more important for DBA-1 than for MDAV-1 and TFRP-1. However, DBA-2 performed best for the Tarragona dataset only when k = 3, as shown in Table 1. The Tarragona dataset is sparser than the other two datasets, and DBA might generate some groups with large information loss near the end of its first stage. To alleviate this situation, an extra stage (shown in Fig. 10) is added to before the first stage of DBA, in which p groups are formed from those records with the lowest densities and their respective k 1 nearest
Fig. 10. The extra stage in DBA-2P.
Table 4 Information loss of DBA-2P. Dataset
p
k¼3
k¼4
k¼5
k ¼ 10
Tarragona
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
16.15265063 16.15265063 16.23872813 16.44303653 16.79918416 16.64588636 16.83292887 16.39216388 16.55135992 16.48248086 16.39123536 5.581605762 5.655423691 5.582607831 5.581838733 5.689026485 5.689026485 5.65351736 5.659502178 5.649439038 5.638595825 5.649439038 0.421048322 0.42248098 0.42248098 0.42248098 0.42248098 0.417502212 0.419781269 0.419781269 0.418062796 0.427870099 0.427870099
22.67107728 19.99254529 20.07896807 19.49015529 19.72324644 19.17717801 19.29351957 19.36603296 19.40254021 19.47603754 19.42096667 7.591307664 7.459786371 7.595568635 7.564291559 7.399673886 7.425495349 7.415348239 7.415348239 7.45568626 7.448438696 7.424304328 0.559755523 0.556022053 0.581507309 0.581507309 0.571265699 0.571265699 0.568467118 0.579195364 0.596301365 0.596301365 0.596301365
25.45039236 22.15646643 22.15646643 23.03647968 22.33978805 22.11322845 22.03169154 21.93658729 22.06628252 22.13060438 22.15411057 9.046162117 8.695048991 8.703966116 8.635531295 8.62235115 8.580946312 8.693830213 8.722671257 8.754090354 8.712344295 8.797317284 0.81849828 0.816579986 0.822344047 0.839734509 0.931369958 1.012266096 1.012266096 1.077683668 1.095103108 1.095103108 1.095103108
34.80675148 35.25071582 34.89050857 34.35249604 34.14383061 34.58314685 33.93185305 34.16002696 34.16415108 34.07545402 33.85016813 13.52140518 13.72590475 13.62904758 13.46323786 13.46323786 13.70981989 13.50212366 13.65659022 13.77135864 13.71269917 13.76779557 2.080980825 2.166698176 2.618034057 2.695769214 2.695769214 2.700990656 2.700990656 2.700990656 2.469599071 2.715370377 2.726880973
Census
EIA
J.-L. Lin et al. / Expert Systems with Applications 37 (2010) 3256–3263
neighbors. The resulting method is denoted as DBA-2P, and its information losses at various p values are shown in Table 4, where those values in bold face are those that are better than the information losses of the methods reported in the previous work, and the values in italic are the best information loss values achieved by DBA-2P for a test situation. At some p values, DBA-2P further reduces the information loss of DBA-2. 6. Conclusions Microaggregation is an effective means of protecting privacy in microdata. This work presents a two-stage method, Density-Based Algorithm (DBA), for microaggregation. This method builds groups based on the descending order of their densities during the first stage, and then checks whether to decompose groups based on the revere order during the second stage. Notably, the second stage of DBA is similar to that of TFRP. However, comparing the information losses incurred by DBA-2, TFRP-2 and MDAV-2 in Tables 1–3 reveals that this second stage is best used with DBA-1. The following improvements could be made to DBA. First, the condition adopted to determine whether to decompose a group during the second stage is conservative, and could be altered to allow more groups to be decomposed. Second, checking one group at a time might not lead to the optimal solution. Soft computing techniques (such as genetic algorithms, particle swarm optimization or artificial immune systems) can be adopted to enable checking whether to decompose multiple groups simultaneously. However, both are likely to increase the required computational effort. For example, Han, Cen, Yu, and Yu (2008) proposed an immune clonal selection algorithm for microaggregation, which incurs less information loss but more running time than MDAV. The microaggregation problem can be regarded as a constraint single-objective problem, where the objective is to minimize the information loss, and the constraint is the k-anonymity requirement. Many variations of the k-anonymity model have recently been proposed to further protect data from identification, such as l-diversity (Machanavajjhala, Gehrke, Kifer, & Venkitasubramaniam, 2006), t-closeness (Li & and Li, 2007) and ða; kÞ-anonymity (Wong, Li, Fu, & Wang, 2006). An interesting direction for further investigation would be to formalize these models as a constraint multi-objective optimization problem, and develop new microaggregation methods based on it. References Bezdek, J. C. (1981). Pattern recognition with fuzzy objective function algorithms. Norwell, MA, USA: Kluwer Academic Publishers. Chang, C.-C., Li, Y.-C., & Huang, W.-H. (2007). TFRP: An efficient microaggregation algorithm for statistical disclosure control. Journal of Systems and Software, 80(11), 1866–1878. Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2001). Introduction to algorithms (2nd ed.). The MIT Press.
3263
Domingo-Ferrer, J., & Mateo-Sanz, J. (2002). Practical data-oriented microaggregation for statistical disclosure control. IEEE Transactions on Knowledge and Data Engineering, 14(1), 189–201. Domingo-Ferrer, J., & Torra, V. (2002a). Extending microaggregation procedures using defuzzification methods for categorical variables. In 2002 first international IEEE symposium on intelligent systems (Vol. 2, pp. 44–49). Domingo-Ferrer, J., & Torra, V. (2002b). Soft methods in probability, statistics and data analysis. In Advances in soft computing (pp. 289–294). Physica-Verlag (Chapter: Towards fuzzy c-means based microaggregation). Domingo-Ferrer, J., & Torra, V. (2003). Fuzzy microaggregation for microdata protection. Journal of Advanced Computational Intelligence and Intelligent Informatics, 7(2), 153–159. Domingo-Ferrer, J., & Torra, V. (2005a). Ordinal, continuous and heterogeneous kanonymity through microaggregation. Data Mining and Knowledge Discovery, 11(2), 195–212. Domingo-Ferrer, J., & Torra, V. (2005b). Privacy in data mining. Data Mining and Knowledge Discovery, 11(2), 117–119. Domingo-Ferrer, J., Martínez-Ballesté, A., Mateo-Sanz, J. M., & Sebé, F. (2006). Efficient multivariate data-oriented microaggregation. The VLDB Journal, 15(4), 355–369. Domingo-Ferrer, J., Sebé, F., & Solanas, A. (2008). A polynomial-time approximation to optimal multivariate microaggregation. Computer and Mathematics with Applications, 55(4), 714–732. Han, J.-M., Cen, T.-T., Yu, H.-Q., & Yu, J. (2008). A multivariate immune clonal selection microaggregation algorithm. In IEEE international conference on granular computing (pp. 252–256). Hansen, S., & Mukherjee, S. (2003). A polynomial algorithm for optimal univariate microaggregation. IEEE Transactions on Knowledge and Data Engineering, 15(4), 1043–1044. Laszlo, M., & Mukherjee, S. (2005). Minimum spanning tree partitioning algorithm for microaggregation. IEEE Transactions on Knowledge and Data Engineering, 17(7), 902–911. Li, N., & Li, T. (2007). t-Closeness: Privacy beyond k-anonymity and l-diversity. In Proceedings of the 23rd IEEE international conference on data engineering (ICDE’07). Machanavajjhala, A., Gehrke, J., Kifer, D., & Venkitasubramaniam, M. (2006). lDiversity: Privacy beyond k-anonymity. In Proceedings of the 22nd ieee international conference on data engineering (ICDE’06). Martínez-Ballesté, A., Solanas, A., Domingo-Ferrer, J., & Mateo-Sanz, J. M. (2007). A genetic approach to multivariate microaggregation for database privacy. In 2007 IEEE 23rd international conference on data engineering workshop (pp. 180– 185). Oganian, A., & Domingo-Ferrer, J. (2001). On the complexity of optimal microaggregation for statistical disclosure control. Statistical Journal of the United Nations Economic Commission for Europe, 18, 345–354. Solanas, A. (2008). Success in evolutionary computation (pp. 215–237). Springer (Chapter: Privacy protection with genetic algorithms). Solanas, A., Martinez-Balleste, A., Mateo-Sanz, J., & Domingo-Ferrer, J. (2006a). Multivariate microaggregation based genetic algorithms. In 2006 IEEE third international conference on intelligent systems (pp. 65–70). Solanas, A., Martinez-Balleste, A., & Domingo-Ferrer, J. (2006b). V-MDAV: A multivariate microaggregation with variable group size. In COMPSTAT’2006. Sweeney, L. (2002). k-Anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10(5), 557–570. Torra, V. (2004). Microaggregation for categorical variables: A median based approach. In Privacy in statistical databases (pp. 162–174). Ward, J. H. J. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236–244. Willenborg, L., & Waal, T. D. (2001). Elements of statistical disclosure control. In Lecture notes in statistics (Vol. 155). Springer. Wong, R. C.-W., Li, J., Fu, A. W.-C., & Wang, K. (2006). ða; kÞ-Anonymity: an enhanced k-anonymity model for privacy preserving data publishing. In Proceedings of the twelfth ACM SIGKDD international conference on knowledge discovery and data mining. Zahn, C. T. (1971). Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, C-20(1), 68–86.