Information Processing Letters 115 (2015) 368–370
Contents lists available at ScienceDirect
Information Processing Letters www.elsevier.com/locate/ipl
k-Attribute-Anonymity is hard even for k = 2 Allan Scott 1 , Venkatesh Srinivasan ∗ , Ulrike Stege Department of Computer Science, ECS 504, PO Box 1700, STN CSC, University of Victoria, BC V8W 2Y2, Canada
a r t i c l e
i n f o
Article history: Received 3 May 2014 Received in revised form 16 October 2014 Accepted 16 October 2014 Available online 24 October 2014 Communicated by M. Chrobak Keywords: Computational complexity Database privacy Data anonymization Fixed parameter tractability NP-hardness
a b s t r a c t We study the problem of anonymizing data by column suppression. Meyerson and Williams showed that this problem is NP-hard for k ≥ 3. The complexity of this problem for k = 2 remained open. In this note, we show that 2-anonymizing data by suppressing the minimum number of columns is also NP-hard. In fact, we prove a stronger claim that this problem is NP-hard to approximate within a factor of Ω(log m), where m is the number of columns in the table. Furthermore, our proof also shows that this problem, parameterized by the number of columns to be suppressed, is not in FPT unless W [2] = FPT. © 2014 Elsevier B.V. All rights reserved.
1. Introduction A popular and well-studied method for achieving data privacy is the method of k-anonymization introduced by Samarati and Sweeney [1,2]. Suppose that we are given a database T in the form of a table with n rows and m columns with entries over a finite alphabet Σ . Here, the rows represent data records, columns correspond to attributes and Σ contains all possible values for the attributes. The goal in k-anonymization is to perturb T minimally so that each row is identical to at least k − 1 other rows in T . Two methods of perturbation have been explored. The first and more popular approach is to suppress a small number of entries in T until every row is identical to at least k − 1 other rows. That is, some entries in T are replaced by a special symbol (called a star). Clearly, any table T can be made k-anonymous by suppressing all its entries but doing so renders it useless for data mining. Therefore, the goal is to suppress the minimum number of
*
Corresponding author. E-mail addresses:
[email protected] (A. Scott),
[email protected] (V. Srinivasan),
[email protected] (U. Stege). 1 Currently address: Google Inc., 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA. http://dx.doi.org/10.1016/j.ipl.2014.10.011 0020-0190/© 2014 Elsevier B.V. All rights reserved.
entries of T and this problem is called the k-Anonymity problem. The second approach is to suppress the minimum number of columns (or attributes) to make it k-anonymous, also called the k-Attribute-Anonymity problem. In this case, all the entries of some columns of the table are replaced by . It is now well known that k-anonymization has certain drawbacks particularly against an adversary with background knowledge [3]. However, it continues to be an important privacy model due to its simplicity and its combinatorial flavour. Meyerson and Williams [4] initiated the formal investigation of the complexity of k-Anonymity and k-AttributeAnonymity. They showed that k-Anonymity was NP-hard for k ≥ 3 and |Σ| ≥ n, the number of rows. Subsequently, Aggarwal et al. [5] improved this hardness result by showing that this problem remains NP-hard for k ≥ 3 even for a ternary alphabet (|Σ| = 3). In addition, they showed a polynomial-time approximation algorithm for this problem with approximation ratio max(2k − 1, 3k − 5) improving the O (k log m) approximation ratio of [4]. Finally, Bonizzoni et al. [6] further strengthened the hardness result by proving an APX-hardness result for this problem for k ≥ 3 for a binary alphabet. Recently, Blocki and Williams [7] gave an alternate proof of the result in [6]. Furthermore, they also showed that k-Anonymity is in P for k = 2.
A. Scott et al. / Information Processing Letters 115 (2015) 368–370
The parameterized complexity of k-Anonymity has been explored in depth. Evans et al. [8] were the first to present fixed-parameter algorithms and hardness results for this problem. Bonizzoni et al. [9] show that the problem is W [1]-hard when parameterized by the value of the solution (and k) and designed a fixed-parameter algorithm when the problem is parameterized by the number of columns and the number of different values in any column. Bredereck et al. study new models of k anonymity by introducing data-driven parameterizations [10] and user specified patterns [11]. It was also shown in [4] that k-Attribute-Anonymity is NP-hard for k ≥ 3 and |Σ| = 2 via polynomial-time reduction from k-Dimensional Perfect Matching. However, the complexity of this problem for k = 2 remained open. In this work, we close this gap by showing that k-AttributeAnonymity is NP-hard for k = 2 and |Σ| = 2. In fact, we prove a stronger claim that this problem is NP-hard to approximate within a factor of Ω(log m), where m is the number of columns in the table. We also show that the same problem parametrized by c, the number of columns to be suppressed, is not in FPT unless W [2] = FPT. 2. Complexity of 2-Attribute-Anonymity We begin by defining the decision version of k-Attribute-Anonymity introduced in [4].
(k, c )-Attribute-Anonymity Instance: A table T consisting of n rows and m columns, each cell containing an element from a finite alphabet Σ , positive integers c and k Question: Is it possible to suppress at most c columns of T in such a way that, for each row, the data in the remaining columns is identical to at least k − 1 other rows? We investigate the parameterized complexity of (k, c )Attribute-Anonymity when parameterized by c and k. We prove the following result. Theorem 1. (k, c )-Attribute-Anonymity, parametrized by k and c, is not in FPT unless W [2] = FPT. We will show that, even for the special case k = 2, (k, c )-Attribute-Anonymity, parametrized by c, is not in FPT unless W [2] = FPT. Thus, the same is true when we generalize to any k and Theorem 1 holds. We prove this in two steps. In the first step, we show this result provided that the alphabet size, |Σ|, is large (Theorem 2). We then show how to reduce |Σ| to 2 in Theorem 3. Theorem 2. If |Σ| ≥ 1 + 2n/(2 + m), then (2, c )-AttributeAnonymity, parameterized by c, is not in FPT unless W [2] = FPT. Proof. We give a polynomial-time, parameter preserving reduction from the W [2]-complete problem c-Hitting Set, where the parameter is the size of the hitting set [12].
369
c-Hitting Set Instance: A finite family of sets S = { S 1 , S 2 , . . . , S }, where each S i consists of elements of U = {u 1 , u 2 , . . . , um }, and positive integer c. Question: Does there exist a hitting set, that is a subset H ⊆ U , of size at most c such that for all S i ∈ S , S i ∩ H = ∅? We describe a reduction from c-Hitting Set parameterized by c to (2, c )-Attribute-Anonymity parameterized by c. Table T consists of exactly m columns, one corresponding to each element u j ∈ U . It contains 1 + 2| S i | rows for every set S i , for a total of n = + of T are constructed as follows.
i =1 2| S i |
rows. The rows
• For each i, the single row corresponding to set S i contains element “i” in cell j if and only if u j ∈ S i , and 0 otherwise.
• For each i and each u j ∈ S i , create two identical rows that contain “−i” in cell j and are identical to the row corresponding to S i , otherwise. We next prove that for input c-hitting-set instance (S , U , c ), there is a hitting set of size at most c if and only if constructed table T can be made 2-anonymous by suppressing at most c columns. Claim 1. If there exists a hitting set H of size at most c for S , then there exist at most c columns in T that, when suppressed, make the table T 2-anonymous. Proof. We show that it is sufficient to suppress the columns from T that correspond to hitting set H . Note that for each i, all rows but the one corresponding to S i , call it r, are already 2-anonymous, since there are two identical rows for each u j ∈ S i . Further, if a row is 2-anonymous in T , then after suppression of any column, the row will remain 2-anonymous. It remains to show that the suppression of the specified c columns makes the remaining row, the one corresponding to set S i , identical to at least one other row. Since H must contain at least one element from every set, it will contain at least one, say u x , from S i . Therefore, after column suppression, the two rows corresponding to element u x for set S i are identical to r, as they are different only in the cells for the column corresponding to u x . Claim 2. If T can be made 2-anonymous by deleting at most c columns, then S has a hitting set H of size at most c. Proof. We show that elements from U corresponding to the suppressed c columns build a hitting set H for S . Since, in the modified table, every row must be identical to at least one other row of T , we guarantee that, for each i, at least one column that corresponds to an element in S i is removed from the table. Otherwise, a row corresponding to some S i is different from all other rows in its group and outside, which would prevent it from being 2-anonymous.
370
A. Scott et al. / Information Processing Letters 115 (2015) 368–370
The selection of elements for H therefore guarantees that at least one element from each S i is contained in H . By definition, H is a hitting set for S . 2 To conclude the proof of Theorem 2, note that Σ = {−, . . . , −1, 0, 1, . . . , }. Therefore, |Σ| = 1 + 2 ≥ 1 + 2n/(2 + m). Finally, we observe that the reduction above is a parameter-preserving reduction from a known W [2]-complete problem, c-Hitting Set to (2, c )-Attribute-Anonymity. Therefore, (2, c )-Attribute-Anonymity, parametrized by c, is hard for W [2]. In our next result, we show that it is possible to reduce the size of the alphabet to 2.
3. Concluding remarks A popular approach in data anonymization is to k-anonymize a table by suppressing the smallest number of entries in the table. The complexity of this problem was investigated in a series of papers. It was known that this problem has a polynomial-time algorithm for k = 2 and is NP-hard for k ≥ 3. However, there was a gap in our understanding of the second approach that k-anonymizes a table by suppressing the smallest number of attributes. It was known that this problem was NP-hard for k ≥ 3. However, the complexity of this problem for k = 2 was open. We close that gap in this work by showing that this problem is also NP-hard. If fact, it is not in FPT unless W [2] = FPT. Acknowledgements
Theorem 3. For Σ = {0, 1}, (2, c )-Attribute-Anonymity, parameterized by c, is not in FPT unless W [2] = FPT. Proof. We use a restricted version of Hitting Set in which S i S j for S i , S j ∈ S , i = j. It is not difficult to see that this restricted version of Hitting Set is also W [2]-complete. We can preprocess any Hitting Set instance as follows. For each set S i ∈ S and S j ∈ S with S i ⊆ S j , remove S j from S . One can easily check that the reduced set S has a hitting set of size at most c if and only if S has a hitting set of size at most c. We now modify the reduction in the proof of Theorem 1. In the reduction, we replace any symbol i, 1 ≤ i ≤ , by 1, symbol −i, 1 ≤ i ≤ , by 0, and leave 0’s unchanged. Claim 1 follows just as in above proof. To see the correctness of Claim 2, we observe that it is not possible to make the row corresponding to set S i identical to another row in group corresponding to set S j without suppressing a column corresponding to an element u x in S i unless Si ⊂ S j . 2 Furthermore, we observe that the reduction in Theorem 3 is a L-reduction from an NP-hard problem, Minimum Hitting Set, to 2-Attribute-Anonymity. It is known that Minimum Hitting Set is NP-hard to approximate within a factor of Ω(log m). Hence, we get Corollary 1. It is NP-hard to approximate 2-Attribute-Anonymity within a factor of Ω(log m) even for Σ = {0, 1}.
This work was completed during the 2014 Bellairs Workshop on Computational Geometry. We thank the organizers for inviting us and providing such a fun atmosphere to work in. We also thank the anonymous referees for their valuable feedback. References [1] P. Samarati, L. Sweeney, Generalizing data to provide anonymity when disclosing information (abstract), in: PODS, 1998, p. 188. [2] L. Sweeney, k-Anonymity: a model for protecting privacy, Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10 (5) (2002) 557–570. [3] A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, L-diversity: privacy beyond k-anonymity, ACM Trans. Knowl. Discov. Data 1 (1) (2007) (Article 3). [4] A. Meyerson, R. Williams, On the complexity of optimal k-anonymity, in: PODS, 2004, pp. 223–228. [5] G. Aggarwal, T. Feder, K. Kenthapadi, R. Motwani, R. Panigrahy, D. Thomas, A. Zhu, Anonymizing tables, in: ICDT, 2005, pp. 246–258. [6] P. Bonizzoni, G.D. Vedova, R. Dondi, Anonymizing binary and small tables is hard to approximate, J. Comb. Optim. 22 (1) (2011) 97–119. [7] J. Blocki, R. Williams, Resolving the complexity of some data privacy problems, in: ICALP, 2010, pp. 393–404. [8] P.A. Evans, T. Wareham, R. Chaytor, Fixed-parameter tractability of anonymizing data by suppressing entries, J. Comb. Optim. 18 (4) (2009) 362–375. [9] P. Bonizzoni, G.D. Vedova, R. Dondi, Y. Pirola, Parameterized complexity of k-anonymity: hardness and tractability, J. Comb. Optim. 26 (1) (2013) 19–43. [10] R. Bredereck, A. Nichterlein, R. Niedermeier, G. Philip, The effect of homogeneity on the computational complexity of combinatorial data anonymization, Data Min. Knowl. Discov. 28 (1) (2014) 65–91. [11] R. Bredereck, A. Nichterlein, R. Niedermeier, Pattern-guided k-anonymity, Algorithms 6 (4) (2013) 678–701. [12] R.G. Downey, M.R. Fellows, Parameterized Complexity, SpringerVerlag, 1999, 530 pp.