c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
Available online at www.sciencedirect.com
ScienceDirect journal homepage: www.elsevier.com/locate/cose
DP-Apriori: A differentially private frequent itemset mining algorithm based on transaction splitting Xiang Cheng, Sen Su*, Shengzhi Xu, Zhengyi Li State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
article info
abstract
Article history:
In this paper, we study the problem of designing a differentially private FIM algorithm
Received 30 April 2014
which can simultaneously provide a high level of data utility and a high level of data
Received in revised form
privacy. This task is very challenging due to the possibility of long transactions. A potential
9 December 2014
solution is to limit the cardinality of transactions by truncating long transactions. However,
Accepted 21 December 2014
such approach might cause too much information loss and result in poor performance. To
Available online 3 February 2015
limit the cardinality of transactions while reducing the information loss, we argue that long transactions should be split rather than truncated. To this end, we propose a trans-
Keywords:
action splitting based differentially private FIM algorithm, which is referred to as DP-
Frequent itemset mining
Apriori. In particular, a smart weighted splitting technique is proposed to divide long
Apriori
transactions into sub-transactions whose cardinality is no more than a specified number of
Differential privacy
items. In addition, to offset the information loss caused by transaction splitting, a support
Transaction splitting
estimation technique is devised to estimate the actual support of itemsets in the original database. Through privacy analysis, we show that our DP-Apriori algorithm is 3-differentially private. Extensive experiments on real-world datasets illustrate that DP-Apriori substantially outperforms the state-of-the-art techniques. © 2015 Elsevier Ltd. All rights reserved.
1.
Introduction
Frequent itemset mining (Agrawal et al., 1993) has been well recognized to be one of the most fundamental problems in data mining. Given a transactional database, FIM tries to find itemsets which occur more frequently than a given threshold. The discovery of frequent itemsets has a wide range of applications including market basket analysis, Web usage mining, and bioinformatics, etc. Despite valuable insights the
discovery of frequent itemsets can potentially provide, if the data is sensitive (e.g., patient health records and user behavior records), releasing the discovered frequent itemsets might pose considerable threats to individual privacy (Bhaskar et al., 2010). Differential privacy (Dwork et al., 2006; Dwork, 2006) has been proposed as one way to address such problem. Unlike traditional anonymization-based privacy models (e.g., k-anonymity (Sweeney, 2002) and l-diversity (Machanavajjhala et al., 2006)), differential privacy offers strong theoretical
* Corresponding author. E-mail addresses:
[email protected] (X. Cheng),
[email protected] (S. Su),
[email protected] (S. Xu), lizhengyi@ bupt.edu.cn (Z. Li). http://dx.doi.org/10.1016/j.cose.2014.12.005 0167-4048/© 2015 Elsevier Ltd. All rights reserved.
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
guarantees on the privacy of released data and makes almost no assumptions about the attacker's background knowledge. By adding a carefully chosen amount of noise, differential privacy assures that the output of a computation is insensitive to changes in any individual's record, and thus restricting privacy leaks through the results. In this paper, we study the problem of designing a differentially private FIM algorithm which can simultaneously provide a high level of data utility and a high level of data privacy. However, as shown in prior studies (Zeng et al., 2012; Li et al., 2012), this task is very challenging due to the possibility of long transactions (i.e., transactions containing a large number of items). A potential solution is to limit the cardinality of transactions by transaction truncating (Zeng et al., 2012). In particular, if a transaction has more than a specified number of items, items are deleted until the transaction is under the limit. However, such approach may cause too much information loss and result in poor performance. Intuitively, if we limit the cardinality of transactions by transaction splitting, we can keep more frequency information. That is, we divide long transactions into sub-transactions and guarantee the cardinality of each sub-transaction is under a specified number of items. For example, suppose itemsets {a,b,c} and {e,f,g} are frequent itemsets, and the maximal cardinality of transactions is set to 4. Given a transaction t ¼ {a,b,c,d,e,f,g}, if we truncate t to be {a,b,c,d}, the support of frequent itemset {e,f,g} and its subsets will decrease. Consequently, some itemsets which are frequent in the original database may become infrequent. Instead, if we divide t into t1 ¼ {a,b,c,d} and t2 ¼ {e,f,g}, the support of frequent itemsets {a,b,c}, {e,f,g} and their subsets will not be affected. However, through a theoretical investigation, we found that, if each transaction is divided into at most k sub-transactions, applying any 3-differentially private algorithm to the transformed database only ensures ðk,3Þ-differential privacy for the original database. In addition, despite the potential advantages of transaction splitting, it inevitably incurs information loss. For example, if long transactions are randomly divided into multiple sub-transactions, items in a frequent itemset might be grouped into different sub-transactions. Towards this end, different from existing studies (e.g., Zeng et al., 2012) that limit the cardinality of transactions by transaction truncating, we propose a novel transaction splitting approach, which consists of two techniques, namely, smart weighted splitting and support estimation. In smart weighted splitting, to preserve frequency information contained in long transactions as much as possible, candidate frequent itemsets are used to guide the transaction splitting process such that items in the same sub-transaction are more likely to produce frequent itemsets. Besides, a weight is assigned to each sub-transaction to ensure that applying any 3-differentially private algorithm to the transformed database can still guarantee 3-differential privacy for the original database. In support estimation, the information loss caused by transaction splitting is quantified and used to estimate the actual support of itemsets in the original database. We also show that the transaction truncating approach proposed in (Zeng et al., 2012) is only a special case of our transaction splitting approach. Utilizing the transaction splitting approach, we devise an Apriori based differentially private
75
FIM algorithm, which is referred to as DP-Apriori. DP-Apriori takes a database, a frequency threshold, a privacy parameter 3, and a maximal cardinality of transactions as input, and outputs the discovered frequent itemsets. Through privacy analysis, we show that our DP-Apriori algorithm is 3-differentially private. Extensive experimental results on real-world datasets show DP-Apriori outperforms existing differentially private FIM algorithms. To summarize, our key contributions are: We present a new differentially private FIM algorithm based on Apriori, which is referred to as DP-Apriori. In DPApriori, a novel transaction splitting approach, consisting of smart weighted splitting and support estimation techniques, is proposed to improve the tradeoff between utility and privacy. Through privacy analysis, we prove that DP-Apriori satisfies 3-differential privacy. Extensive experiments on real datasets illustrate that DP-Apriori substantially outperforms the state-of-the-art techniques. The rest of paper is organized as follows. We review related work in Section 2. Section 3 presents necessary background on differential privacy and briefly reviews the problem of FIM. In Section 4, we propose a straightforward algorithm for differentially private FIM. Section 5 presents the overview framework of out DP-Apriori algorithm and its two key components: smart weighted splitting and support estimation. In section 6, we show the detail of our DP-Apriori algorithm and give its privacy analysis. Comprehensive experimental results are reported in Section 7. Finally, we conclude the paper in Section 8.
2.
Related work
Differential privacy is first proposed by Dwork et al. (2006) and has gradually emerged as the de facto standard notion of privacy for research in private data analysis. There are two settings: interactive and non-interactive (Dwork, 2008). In the interactive setting where the database is held by a trusted server, users pose queries about the data, and the answers to the queries are modified to protect the privacy of the database participants. In the non-interactive setting, the data custodian either computes and publishes some statistics on the data, or releases an anonymized version of the raw data. Several differentially private FIM algorithms have been proposed under the interactive setting in the literature. Bhaskar et al. (2010) present two differentially private algorithms for top-k frequent pattern mining, which adapt the exponential mechanism (McSherry and Talwar, 2007) and the Laplace mechanism (Dwork et al., 2006) respectively. To meet the challenge of high dimensionality of transactional database, Li et al. (2012) propose the PrivBasis algorithm which projects the input database onto several sets of dimensions for differentially private top-k frequent itemsets mining. Different from their work, our work focuses on the problem of mining all itemsets whose support exceeds a given threshold rather than the problem of finding the top-k frequent itemsets. Zeng et al.( 2012) propose a transaction truncating approach where items in long transactions are deleted until the transactions'
76
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
cardinality is under a specified number. Based on the transaction truncating approach, they present a differentially private FIM algorithm. However, to limit the cardinality of transactions, we propose a transaction splitting approach where long transactions are divided into sub-transactions. As shown later in our work, the transaction truncating approach proposed in (Zeng et al., 2012) can be seen as a special case of our transaction splitting approach, and the differentially private FIM algorithm designed using our transaction splitting approach (i.e., DP-Apriori) can achieve much better results. There are also some studies on applying differential privacy to non-interactive FIM. With the help of context-free taxonomy trees, Chen et al. (2011) propose a probabilistic top-down partitioning algorithm to efficiently generate a sanitized release of set-valued data in a differentially private manner with guaranteed utility for mining frequent itemsets. Zhang et al. (2013) further address the problem of differentially private set-valued data release on an incremental scenario and propose an algorithm, called IncTDPart, to incrementally generate a series of differentially private releases. These two studies focus on publishing a sanitized release of the database, i.e., the noninteractive setting. In contrast, our work aim at designing a differentially private FIM algorithm under the interactive setting. In addition, existing work (Zeng et al., 2012) has shown that the publishing algorithms fail to generate accurate frequent itemsets since the number of transaction in the released database is significant reduced after anonymization. In addition, there is another series of studies on finding frequent patterns with differential privacy in sequential and graph databases. Based on a hybrid-granularity prefix tree structure, Chen et al. (2012a) propose an efficient datadependent yet differentially private transit data sanitization approach for publishing large volume of sequential data. Chen et al. (2012b) also propose an algorithm which utilizes variable length n-grams and Markov model to publish a sanitized version of the sequential database. It first uses a tree to group grams and adaptively allocate privacy budget to compute the noisy counts of nodes in the tree, then the sanitized version of the sequential database is constructed based on such tree. Bonomi and Xiong (2013) present a two-phase algorithm for differentially privately mining both prefixes and substring patterns. In the first phase, it generates frequent prefixes and a candidate set of substring patterns. In the second phase, it refines the count of the potential frequent substring patterns. Very recently, based on Markov Chain Monte Carlo (MCMC) sampling, Shen and Yu (2013) propose differentially private frequent graph pattern mining algorithm, which does not rely on the output of a non-private mining algorithm.
3.
Preliminaries
In this section, we introduce the basic concepts of differential privacy and briefly review the problem of frequent itemset mining.
performing data analysis. It provides a strong privacy guarantee that the output of a computation to be insensitive to any particular record in the database. Given two database D and D0 , we say that D and D0 are neighboring databases if they differ by at most one record. We formally define differential privacy as follows. Definition 1. (3-differential privacy Dwork, 2006). An algorithm A achieves 3-differential privacy iff for any pair of neighboring databases D1 and D2, and any subset of outputs S : Pr½A ðD1 Þ2S e3 Pr½A ðD2 Þ2S ; where the probability is taken over the randomness of A . For any pair of neighboring databases, differential privacy guarantees that the difference in the probability of any subset of outputs S is bounded by e3 , where smaller values of 3 mean a stronger privacy guarantee. Differential privacy is enforced by adding a carefully chosen amount of noise to the result. The required amount of noise depends on the concept of sensitivity (Dwork et al., 2006). Definition 2. (Sensitivity): Let Q be a set of functions. Then, the sensitivity of Q is: SQ ¼ maxQ ðDÞ Q D' 1 ; where D and D0 are any pair of neighboring databases. There are several approaches for designing algorithms to achieve 3-differential privacy. The most widely-adopted approach is the Laplace mechanism (Dwork et al., 2006). For the computation whose outputs are real, the Laplace mechanism works by adding random noise drawn from the Laplace distribution to the true output of the query. The Laplace distribution with magnitude l, i.e., Lap(l), follows the probability 1 jxj=l e , where l ¼ S3Q is density function (pdf) as Pr½xjl ¼ 2l determined by both the sensitivity SQ and the privacy budget 3. Theorem 1. Let Q be a set of n functions with a sensitivity SQ , and 〈D1, …, Dn〉 be a n-length vector, where Di is drawn i.i.d. from a Laplace distribution with scale S3Q . The algorithm A A ðDÞ ¼ Q ðDÞ þ 〈D1 ; …; Dn 〉 provides 3-differential privacy. Ghosh et al. (2009) propose the Geometric mechanism which can be regarded as a discrete variant of the Laplace mechanism for the computation with integer outputs. The magnitude of the noise added conforms to a two-sided geometric distribution G(a) with the probability mass function jxj , where a > 0. Pr½xja ¼ expðaÞ1 expðaÞþ1$expðaÞ Theorem 2. Let Q be a set of n functions with integer outputs, and its sensitivity is SQ . The algorithm A A ðDÞ ¼ Q ðDÞ þ 〈D1 ; …; Dn 〉
3.1.
Differential privacy
Differential privacy has drawn much attention in recent years as a new model for the protection of individual privacy when
gives 3-differential privacy, where Di i.i.d. samples from a geometric distribution G S3Q .
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
Besides, for a sequence of computations, its privacy is ensured by the composition properties. Any sequence of computations that each maintains differential privacy also provides differential privacy, which is known as sequence composition (Dwork et al., 2006). Theorem 3. Let f1, …, fm be m randomized algorithms, each provides 3i -differential privacy (1 i m). A sequence of fi(D) over the P database D provides ð 3i Þ-differential privacy.
3.2.
Frequent itemset mining
Frequent itemset mining (FIM) is fundamental to many important data mining tasks. Given the alphabet I ¼ fi1 ; …; in g, a transaction t is a subset of I , and a transactional database D is a multiset of transactions. Each transaction represents an individual's record. A non-empty set X3I is called an itemset, and the length of X is the number of items in X. An itemset is called an i-itemset if its length is i. We say that a transaction t contains X if X4t. The support of X is the number of transactions containing X in D, and an “i-itemset query” is a count query that computes the support of an iitemset. An itemset is called frequent if its support is no less than a user-specified threshold. Given a transaction database and a threshold, the goal of FIM is to find the complete set of frequent itemsets. There is a large body of research on designing FIM algorithms in the literature. The most two prominent ones are Apriori (Agrawal and Srikant, 1994), and FP-growth (Han et al., 2000). Apriori is the most classic and most widely used algorithm for mining frequent itemsets in a transactional database. In this paper, our main focus is on the design of an Apriori-based differentially private FIM algorithm which can achieve both good utility and good privacy.
4.
A straightforward algorithm
In this section, we first briefly introduce the Apriori algorithm (Agrawal and Srikant, 1994). Then, based on Apriori, we propose a straightforward algorithm for mining frequent itemsets with 3-differential privacy. At last, we point out the limitation of this algorithm and discuss promising approaches to improve the performance of this straightforward algorithm. The Apriori algorithm works within a multiple-pass generation-and-test framework. In particular, in each database scan, only candidates whose subsets are all frequent are generated and counted in the database to test whether they are frequent. Since transactions are not stored in the memory during the mining process, Apriori needs l database scans if the maximal cardinality of frequent itemsets is l. To enforce 3-differential privacy in FIM, we propose a strawman algorithm based on Apriori. Our main idea is to add noise to the support of each itemset during the mining process, and use the noisy support to determine whether an itemset is frequent or not. Specifically, given the candidate set of frequent i-itemsets Ci, we perturb their support by adding geometric noise. If the noisy support of a candidate X2Ci exceeds the given threshold l, we consider X as a frequent iitemset. The magnitude of noise added to the support of X
77
depends on both the allocated privacy budget and the sensitivity of i-itemset queries. Suppose the maximal cardinality of frequent itemsets is l, we uniformly assign i-itemset queries Qi a privacy budget 3=l. Additionally, since adding or removing one transaction can affect the support of an itemset by at most one, the sensitivity of Qi is equal to the number of i-itemset queries. For Apriori, we can easily get jQij from the discovered frequent (i1)-itemsets. Therefore, we can add geometric 3 in computing Qi, and it satisfies 3l -differential noise G l$jQ ij privacy. It is not hard to see that the whole mining process can be considered as a sequence of computations Q ¼ 〈Q1, …, Ql〉. According to the sequence composition property of differential privacy, we can easily prove that this approach overall satisfies 3-differential privacy. However, this approach might suffer from poor data utility when performing FIM on real world datasets. For example, the number of items in the Kosarak dataset (Frequent itemset) is 41,270. When mining the frequent 1-itemsets in this dataset, the sensitivity is 41,270, i.e., jQ1j. It means that, to enforce differential privacy, a large amount of noise will be added to the support of 1-itemsets and such errors will also be propagated to the following mining process. A promising approach is to limit the cardinality of transactions to improve the performance of this algorithm. For instance, if the maximal cardinality of transactions in the Kosarak dataset is limited to 50, the sensitivity of Q1 is , i.e., 50. As a result, the no greater than amount of noise added to the support of 1-itemsets will be significantly reduced. To limit the transactions' cardinality, Zeng et al. (2012) propose a transaction truncating approach, that is, if a transaction has more than a specified number of items, items are deleted until the transaction is under the limit. Although such approach can reduce the error due to the noise required to enforce privacy, it may result in poor performance. The main reason lies in that some frequent itemsets in transactions may have to be deleted due to the maximal cardinality constraint. This observation motivates us to present a new approach to limit the cardinality of transactions which can preserve the frequency information of transactions as much as possible. In the following section, we will show our novel transaction splitting approach for limiting the cardinality of transactions, and the design of a differentially private FIM algorithm using such approach.
5. DP-Apriori: Apriori with differential privacy In this section, we introduce our transaction splitting based differentially private FIM algorithm, which is referred to as DP-Apriori. We first provide an overview of this algorithm. Then, we show the details of the smart weighted splitting and support estimation techniques used in our transaction splitting approach.
5.1.
Design overview
In DP-Apriori, we privately discover frequent itemsets in order of increasing length. In particular, to discover frequent k-
78
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
itemsets, we utilize frequent (k1)-itemsets to generate candidate k-itemsets based on the downward closure property. We add noise to the support of candidate k-itemsets, and use the noisy support of each candidate k-itemset to determine whether it is frequent. To better improve the utilityprivacy tradeoff, instead of truncating long transactions in the input database, we split long transactions by using a novel transaction splitting approach. In this approach, a smart weighted splitting technique is used to divide long transactions into multiple short sub-transactions, such that the frequency information contained in long transactions are preserved as much as possible. In addition, a support estimation technique is also used to offset the information loss incurred by transaction splitting. Algorithm 1 shows the steps of how to discover frequent kitemsets in DP-Apriori. Specifically, given the original database and candidate k-itemsets, we first transform the database to enforce the length constraint. For the transaction whose cardinality violates the constraint, we split it by using our smart weighted splitting technique (for details see Section 5.2). Then, for each candidate k-itemset ck, we compute its noisy support in the transformed database. Based on the noisy support of ck, we estimate the actual support of ck in the original database by using our support estimation technique (line 12 and 16). Particularly, if the estimated “maximal” support of ck exceeds the threshold, we use it to generate candidate (k þ 1)-itemsets; if the estimated “average” support of ck exceeds the threshold, we regard it as a frequent k-itemset (for details see Section 5.3).
5.2.
Smart weighted splitting
Intuitively, if we limit the cardinality of transactions by transaction splitting, we can keep more frequency information. That is, long transactions are divided into multiple subtransactions whose cardinality is under a specified number of items. However, the following theorem shows that if each
transaction is divided into at most k sub-transactions, applying any 3-differentially private FIM algorithm to the transformed database only ensures ðk,3Þ-differential privacy for the original database. Theorem 4. Let A be an 3-differentially private FIM algorithm for the transformed database, and f be a function which divides one transaction into at most k sub-transactions. Let S denote the resulting frequent itemsets. Then, for any pair of neighboring databases D and D0 , we have: PrðA ðf ðDÞÞ ¼ S Þ ek$3 Pr A f D' ¼ S :
Proof. Consider any pair of neighboring databases D1 and D2, let t denote the transaction in D2 but not in D1 (i.e., D2 ¼ D1 þ t). Suppose the transformed database of D1 is D01 , and t is divided into k sub-transactions t1,…,tk. As A is an 3-differentially private algorithm for the transformed database, according to the definition of differential privacy, for any subset of outputs S in algorithm A , we have: PrðAðD1 Þ ¼ S Þ PrðAðD1 þ t1 Þ ¼ S Þ e3 ; e3 ; …; PrðAðD1 þ t1 Þ ¼ S Þ PrðAðD1 þ t1 þ t2 Þ ¼ S Þ PrðAðD1 þ … þ tk1 Þ ¼ S Þ e3 : PrðAðD1 þ … þ tk Þ ¼ S Þ
Therefore, based on above equations, we could get: PrðAðD1 Þ ¼ S Þ ek3 PrðAðD1 þ t1 þ … þ tk Þ ¼ S Þ Recall that, D2 ¼ D1 þ t. Since the transformed database of D1 is D01 and t is divided into t1, …, tk, < D01 , t1, …, tk > is the transformed database D01 . Hence, if each transaction is divided
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
into at most k sub-transactions, applying any 3-differentially private algorithm to the transformed database only ensures (k , 3)-differential privacy for the original database. To this end, we introduce a weighted splitting operation which assigns a weight to each sub-transaction of the divided transaction. We formally define the weighted splitting operation as follows. Definition 3. (Weighted splitting operation): Consider a transaction t whose cardinality exceeds the maximal cardinality m. A function f divides t into sub-transactions t1, …, tk, where ti is assigned a weight wi and jtij m. Then, function f is said to be a weighted splitting operation iff:
k ∪ ti ¼ t and
i¼1
k X
! wi ¼ 1 :
i¼1
In fact, transaction truncating can be seen as an extreme case of our weighted splitting. Suppose a transaction t is divided into sub-transactions t1, …, tk, if we assign weight 1 to one of the sub-transactions ti and assign weights 0 to other subtransactions, it is equivalent to truncate t and only keep the items in ti. Theorem 5 shows that as long as we use the weighted splitting operation to transform the database, applying any 3-differentially private FIM algorithm to the transformed database also guarantees 3-differential privacy for the original database. Theorem 5. Let A be an 3-differentially private FIM algorithm for the transformed database, and f be an arbitrary weighted splitting operation. Let S denote the set of resulting frequent itemsets. Then, for any pair of neighboring databases D and Dʹ, we have: PrðA ðf ðDÞÞ ¼ S Þ e3 Pr A f D' ¼ S : Proof. See Section 6.2. Obviously, for each sub-transaction of the divided transaction, weighted splitting operation can preserve only incomplete frequency information. To mitigate this sideeffect, we propose a support estimation technique to offset such information loss. Moreover, we also introduce a weight allocation scheme which our support estimation technique can benefit from (for details see Section 5.3). Ideally, after the transformation of the database via transaction splitting, the support of the frequent itemsets should not be changed. A simple approach is to randomly divide long transactions into multiple sub-transactions. However, this approach is only suitable when we mine frequent 1-itemsets. If we use this approach when we mine frequent i-timesets (i 2), the items in a frequent itemset might be grouped into different sub-transactions and the support of the frequent itemset will decrease. Since we do not know which itemsets are frequent without mining the database, we can only rely on the candidate frequent itemsets to guide the transaction splitting process during the mining process. In practice, if all the subsets of a candidate frequent itemset are sufficiently frequent, then this candidate is more likely to be a frequent itemset. Similar to (Zeng et al., 2012), to quantify the quality of candidate frequent
79
itemsets, we assign each candidate frequent i-itemset (i 2) a frequency score, which is the summation over all its (i1)subsets' noisy support. Definition 4. (Frequency score): Given a set of frequent (i1)itemsets Y ¼ {Y1, Y2, …, Yn}, the frequency score of a candidate frequent i-itemset X is defined as follows: X
fsðXÞ ¼
n sup Yj ;
Yj 2Y∧Yj 3X
where n_sup(Yj) is the noisy support of itemset Yj. A simple example for the frequency score of candidate itemsets is illustrated in Example 5.1. Example 5.1. Given there frequent items i1, i2 and i3 with support 4, 3, and 2, respectively. We can see the frequency score of candidate 2itemset {i1, i3} is 6, and the frequency score of candidate 2-itemset {i2, i3} is 5. We also define the cover score of a sub-transaction as follows. Definition 5. (Cover score): Let X denote the set of candidate frequent i-itemsets. Each candidate frequent i-itemset Xi 2 X has a frequency score. For a sub-transaction t0 , let Z denote the set of candidate frequent i-itemsets contained in t0 . Then, the cover score of sub-transaction t0 is defined as: X fsðXi Þ: cs t' ¼ Xi 2Z
An example for the cover score of sub-transactions is shown in Example 5.2. Example 5.2. Continue from Example 5.1. Given a sub-transaction t0 ¼ {i1, i2, i3, i4}. We can see candidate 2-itemsets {i1, i3} and {i2, i3} are contained in t0 . Thus, the cover score of sub-transaction t0 is the sum of the frequency scores of candidate 2-itemsets {i1, i3} and {i2, i3}, which is equal to 11. Then, we formulate our optimal (i, m)-splitting problem as follow. Definition 6. Optimal (i, m)-splitting: Given a transaction t and the maximal cardinality m, we divide t into t1, …, tk and maximize the following objective function: Max
k X
cs tj ;
j¼1
where k ¼ Qjtj=mS. The following lemma and theorem show the optimal (i, m)splitting problem is NP-hard. Lemma 1. The optimal (2, m)-splitting problem is NP-Hard. Proof. This can be proved by a reduction from the balanced minimum k-cut (BMK) problem (Saran and Vazirani, 1995),
80
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
which is known to be NP-Hard. The BMK problem is formally described as follows. G ¼ (V,E) is an undirected weighted graph with n vertices. A k-cut on G is defined as a subset of E that partition G into k components. The k-cut weight is defined as the sum of the weights on all the edges in the k-cut. The BMK problem is to find a k-cut with minimum weight whose removal partitions G into k disjoint subsets and each subset's size is no more than Qn=kS. Since the summation over the weights of all edges in G is a constant value, the BMK problem is equal to find a k-cut, which can partition G into k disjoint subsets whose size is no more than Qn/kS and the sum of the weights on all the edges in these subsets is maximized. We construct an instance P ¼ (t, C1) of the optimal (2, m)splitting problem as follows: for the given transaction t and the frequent 1-itemsets C1, we construct a undirected weighted graph G ¼ (V,E), where each vertex vi is in one-toone correspondence to the item i contained in t and an edge e ¼ (va,vb) is introduced to connect two vertices va and vb iff a and b can generate a candidate 2-itemset. In particular,
see that a solution of (i, m þ i 2)-splitting corresponds to a solution of (2, m)-splitting. Since (2, m)-splitting is NP-hard (see Lemma 1) and we can encode the (2, m)-splitting into an instance of (i, m)-splitting (i > 2), the Theorem 6 then follows.
for edge e ¼ (va,vb), its weight is assigned as the sum of the support of a and b, i.e., the frequency score of itemset ab. Clearly, this construction can be carried out in polynomial time. A solution of the optimal (2, m)-splitting problem on this instance is also a solution of the BMK problem on G. It is not hard to show the reverse also holds. Since the BMK problem is NP-hard, the optimal (2, m)-splitting problem is also NP-hard.
Algorithm 2 shows the detail of our heuristic algorithm for transaction splitting. The main idea of this algorithm is to leverage the candidate frequent itemsets to guide the transaction splitting process such that items in the same subtransaction are more likely to produce frequent itemsets. It takes a transaction t, candidate i-itemsets Ci, and the maximal cardinality m as input, and outputs the set of sub-transactions R. In particular, to obtain a sub-transaction ts, it first constructs a new set C'i by preserving the candidates in Ci which are contained in t (line 5). Then, it iteratively adds ci (ci 2C'i ) which has the highest frequency score and the smallest D(ci,ts) to ts (line 6e12). When a sub-transaction ts has been constructed, it removes the items in ts from t (line 14). After we get Qjtj/mS sub-transactions, if there are still items left in t, it randomly fulfills these items into the sub-transactions whose cardinality is less than the maximal cardinality m. Finally, we evenly assign weight 1/q to each obtained sub-transaction in R (line 19). The reason why we evenly assign weights to sub-
Theorem 6. The optimal (i, m)-splitting problem is NP-Hard. Proof. We prove Theorem 6 by a reduction from (2, m)splitting to an instance of (i, m)-splitting. Consider an arbitrary instance of (2, m)-splitting P ¼ (t, C1). We construct an instance P0 ¼ (t0 , C0 ) by adding i 2 items to the transaction t and each items in C1. In particular, suppose Dt ¼ {x1, …, xi 2} is disjoint with t. Let t0 ¼ t þ Dt, and C0 ¼ {c'k j c'k ¼ xk ∪Dt and xk2C1}. Obviously, this construction can be carried out in polynomial time. By this construction, we can
Since the optimal (i, m)-splitting problem is NP-Hard, we propose a heuristic algorithm to solve this problem. Before we show our heuristic algorithm, we first introduce the notion of the distance between an itemset and a sub-transaction, which is defined as follows. Definition 7. (Distance between an itemset and a sub-transaction): Given an itemset X and a sub-transaction t0 , the distance between X and t0 is: D X; t' ¼ jXj X∩t' ; where jXj is the number of items in X, and jX∩t' j is the number of items existing both in X and t0 .
81
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
transactions is explained in Section 5.3. This algorithm has a time complexity of OðQ:jtj=mS$jCi jÞ. We show a running example of our “Smart Weighted Splitting” algorithm in Example 5.3. Example 5.3. Given a transaction t ¼ {1, 2, 3, 4, 5, 6, 7, 8, 9}, and candidate 2-itemsets {1, 2}, {2, 3}, {2, 4}, {5, 6}, {8, 9} and {9, 10} with frequency score 5, 4, 3, 4, 2, 6, respectively. Suppose the maximal cardinality of transactions in the database is 3. We gradually generate 3 sub-transactions t1, t2 and t3 in the following manner. In particular, to generate sub-transaction t1, we first pick the candidate 2-itemsets which are contained in transaction t and construct the set C'i ¼ {{1, 2}, {2, 3}, {2, 4}, {5, 6}, {8, 9}}. Then, we pick itemset {1, 2} which has the highest frequency score in C'i , and t1 is updated to {1, 2}. Next, we select the candidate itemsets which has the smallest distance from sub-transaction t1, i.e., itemsets {2, 3} and {2, 4}. Since the frequency score of itemset {2, 3} is higher than that of itemset {2, 4}, we add itemset {2, 3} into t1, which is then updated to {1, 2, 3}. Since the cardinality of t1 meets the constraint, the algorithm removes the items in t1 from t, which is updated to {4, 5, 6, 7, 8, 9}. After that, our algorithm iteratively generates the sub-transactions t2 and t3 in a similar way. As shown in Algorithm 2, the smart weighted splitting technique depends on the candidate k-itemsets and maximal cardinality m. The candidate k-itemsets is obtained based on noisy results (i.e., frequent (k 1)-itemsets). Moreover, as shown in Section 6.1, the maximal cardinality m is also computed based on noisy results. Therefore, our smart weighted splitting technique will not cause a privacy breach.
5.3.
Support estimation
Despite the potential advantages of transaction splitting, it inevitably incurs information loss. Such information loss comes from two aspects. Suppose a transaction t ¼ {a,b,c,d} is divided into t1 ¼ {a,b} and t2 ¼ {c,d} with weight w1, w2 respectively. On the one hand, assigning weights makes the support of itemsets {a,b} and {c,d} decrease from 1 to w1 and w2. On the other hand, splitting t might cause the support of other subsets of t (e.g., {a,c}) decreases from 1 to 0. During the mining process, if a frequent itemset is misestimated as infrequent, all its supersets are regarded infrequent without even computing its support. It will negatively affects the utility of the results. To offset the information loss caused by transaction splitting, inspired by the double standards method in (Zeng et al., 2012), we propose our support estimation technique. The main idea of our support estimation technique is to estimate the actual support of an itemset, including its “average” support and “maximal” support, according to its noisy support in the transformed database. We use its estimated “average” support to determine whether it is frequent, and use its estimated “maximal” support to determine whether it will be used to generate candidate frequent itemsets. In particular, our support estimation technique consists of two steps: 1) When we get the noisy support of an itemset, we first compute its actual support in the transformed database. 2) Then, we further estimate its actual support in the original database based on its actual support in the transformed database.
~ , we estimate First Step. Given itemset X's noisy support u 0 its actual support u in the transformed database. Based on the ' ' uÞ ¼ Prð~uju Þ$Prðu Þ. By Bayesian rule, it is not hard to see that Prðu' j~ Prð~ uÞ 0 assuming a uniform prior on Pr(u ), the probability distribution 0 of u follows: ' ~ e3ju ~uj : Pr u' u
(1) 0
Second Step. In this step, based on u , we further estimate X's actual support u. The key to estimate the actual support is to quantify the information loss caused by transaction splitting. Since the method proposed in (Zeng et al., 2012) is specific to transaction truncating, it cannot be used in our case. Actually, how to quantify such information loss depends on how weights are assigned to sub-transactions. If we unevenly assign weights to sub-transactions, the information loss of itemsets will be uneven. However, given only the support of itemsets in the transformed database, it is impossible to precisely quantify the information loss for every specific itemset. Consequently, unevenly assigning weights to sub-transactions will make an obstacle for quantifying the information loss. Therefore, we propose to evenly assign weights to sub-transactions, i.e., when a long transaction is divided into k sub-transactions, each subtransaction is assigned a weight 1/k. Additionally, the information loss caused by transaction splitting also depends on how the items in a transaction are grouped into sub-transactions. Recall that, in DP-Apriori, we employ the smart weighted splitting technique to split transactions. However, to quantify the information loss caused by smart splitting, we need to use a lot of information in the original database. Due to the privacy requirement, whenever using the information in the original database, we have to do it in a differentially private way. Therefore, precisely quantifying the information loss caused by smart splitting will lead to a large portion of privacy budget waste. To address this issue, we approximate such information loss by quantifying the information loss caused by random splitting. In what follows, we show how to estimate the “average” support and the “maximal” support of an itemset in the original database. Suppose transaction t contains i-itemset X, the cardinality of t is p and the maximal cardinality of transactions is m (p > m). We divide t into q ¼ Qp/mS sub-transactions, each of which satisfies the maximal cardinality constraint. We want to know the probability that X remains in one subtransaction. We assume that, after splitting, there are 0 q ¼ Pp/mR sub-transactions whose cardinality is m, and only one sub-transaction's cardinality might be smaller than m. Let a ¼ p q' $m be the number of items in the sub-transaction whose cardinality is smaller than m. Then we have: Theorem 7. The probability that X remains in one sub-transaction is: 8 > q' Cmi > pi > > if a < i > < Cm p bp ¼ ' mi ai > > > q Cpi Cpi > > : Cm þ Ca if a i p p Proof. To compute the probability that X remains in one subtransaction, we first calculate the number of all possible combinations x1 when randomly splitting transaction t. Then, we count the number of combinations when the items in X are
82
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
partitioned into the same sub-transaction. At last, we compute the probability that X remains in one sub-transaction.
For ease of presentation, let ratioðiÞ ¼
In particular, we first compute the number of all possible combinations when we split t at random as follows: x1 ¼
m m Cm p Cpm …Cpðq' 1Þm
q' !
m Cap Cm pa …Cpaðq' 1Þm
¼
q' !
:
Then, we count the number of combinations when X's items are partitioned into the same sub-transaction. Specifically, if X remains in a sub-transaction whose cardinality is m, then the number of combinations can be computed as: x2 ¼
m m Cmi pi Cpm …Cpðq' 1Þm
ðq' 1Þ!
:
Otherwise (i.e., X remains in a sub-transaction with cardinality smaller than m when a i), the number of combinations can be computed as: x3 ¼
m m Cai pi Cpa …Cpaðq'1Þm
q'!
:
In the following, we calculate bp, the probability that X remains in one sub-transaction, under different conditions. For transaction t such that a < i, bp ¼
' mi x2 q Cpi ¼ : x1 Cm p
For transaction t such that a i, bp ¼
' mi ai x2 þ x3 x2 x3 q Cpi Cpi ¼ þ ¼ þ a : m x1 x1 x1 Cp Cp
Based on Theorem 7, we can estimate the “average” support of X in the original database. Suppose the size of the alphabet I is n, and the weights are evenly assigned to the subtransactions of each divided transaction. Let va(X) ¼ (a1, a2, …, an), where ak denotes the number of transactions with length k containing X in the database. We refer such vector to as the avector of X. For a transaction t containing X, if its length k exceeds the maximal cardinality of transactions m, based on Theorem 7, the probability of a divided sub-transaction of t containing X is bk. Since we evenly assign the weight 1/Qk/mS to each sub-transaction, the support of X in transaction t can be bk . The support of X in the regarded as decreasing from 1 to Qk=mS transformed database can be considered as a random variable, and thus its expectation is: m n X X b ak þ ak $ k E u' ¼ Qk=mS k¼mþ1 k¼i
It is not hard to see that j¼i aj is equal to the support of X in the original database u. Thus, the above equation can be written as: m P k¼i
0
ak u$Pn j¼i
aj
þ
n X k¼mþ1
ak Pn
j¼i aj
k¼i
ak u$Pn j¼i
b $ k aj Qk=mS 1
n BP X ak b C C B m ak P þ $ k C: ¼ u$B Pn @ k¼i j¼i aj k¼mþ1 nj¼i aj Qk=mSA
þ
n X k¼mþ1
ak Pn j¼i
b $ k : aj Qk=mS
Therefore, we can estimate the “average” support of X in the original database to be avg u' ¼
u' : ratioðiÞ
(2)
As discussed above, to estimate the “average” support of X, we need to compute the a-vector of X. However, it is impractical to compute the a-vector of each itemset in the mining process. Moreover, using a-vector to quantify information loss also has privacy implications. If we add noise to the a-vector of each itemset to avoid a breach of the privacy, it will waste a large portion of the privacy budget. In practice, it suffices to only consider frequent itemsets. We also observe that, in general, for the element ak in aP vector, the proportion ak = ni¼1 ai is nearly the same for frequent itemsets with the same length. To this end, for itemsets with length i, we use the a-vector of the i-itemset with the highest support to approximate the a-vectors of the other i-itemsets. In doing so, we can efficiently approximate the a-vectors of itemsets without introducing too much computational overhead, and save lots of privacy budget. To estimate the “maximal” support of X in the original database, like (Zeng et al., 2012), we utilize the r-lower bound, which is an integer g iff Pr(M g) r, where M is a random variable denoting the support of an itemset in the trans0 formed database. By treating X's support u in the transformed database as the r-lower bound, we could estimate X's maximal support in the original database to be qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 8 2 ' > ' ' < u lnr þ ln r 2u lnr max u ¼ ratioðiÞ > : avgðu' Þ
if ln r 2u'
(3)
'
if ln r > 2u :
~ of X, In summary, based on the resulting noisy support u we can estimate X's average support in the original database by combining equations (1) and (2) as: uÞ ¼ avg suppð~
n X
~ avg u' : Pr u' u
u' ¼0
By combining equations (1) and (3), X's maximal support in the original database could be computed as: uÞ ¼ max suppð~
Pn
Eðu' Þ ¼
m X
n X
~ max u' : Pr u' u
u' ¼0
For the Apriori algorithm, we can use the support estimation technique in the following manner. When we get the noisy support of an i-itemset X, we estimate its average support to determine whether X is frequent. Moreover, by estimating the maximal support of X, we decide whether to use X to generate the candidates of (i þ 1)-itemsets. We show a running example of our support estimation technique in Example 5.4.
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
Example 5.4. Given an i-itemset X, suppose the noisy support of X in ~ X ¼ 10. For the actual support of X in the the transformed database is u ~ X , we can estimate the probabiltransformed database u'X , by using u ities of different values of u'X based on equation (1). We found, when j~ uX u'X > 5, the probability of u'X is very small. This is because we limit the cardinality of transactions in the transformed database and the amount of noise is considerably reduced. Therefore, in our run-time uX u'X 5. Next, estimation method, we only consider u'X when j~ for each specific value of u'X (i.e., 5 u'X 15), we use equation (2) to estimate a value of uX. By combining the probabilities of different values of u'X and their corresponding estimated values of uX, we can obtain the “average” support of X in the original database ua. Suppose the userspecific threshold l ¼ 13, and the estimated ua ¼ 12. Since ua is smaller than l, we do not output itemset X as a frequent i-itemset. Similarly, for each specific value of u'X (i.e., 5 u'X 15), based on equation (3), we can also estimate a value of uX. By combining the probabilities of different values of u'X and their corresponding estimated values of uX, we can obtain the “maximal” support of X in the original database um. Suppose the estimated um ¼ 14. Since um is larger than l, we will use i-itemset X to generate candidate (i þ 1)-itemsets. In the support estimation technique, the use of the number of transactions in the original database may have privacy implications. However, as shown in Section 6.1, we compute such number based on noisy results. Therefore, our support estimation technique will not incur privacy leakage.
6. Algorithm description and privacy analysis In this section, we first present the details of our transaction splitting based differentially private FIM algorithm DPApriori. Then, we show that DP-Apriori satisfies 3-differential privacy.
6.1.
83
original database. Notice that for a given database, the preprocessing phase is performed only once. In the mining phase, it can privately mine the frequent itemsets for any user-specific threshold. The smart weighted splitting and support estimation techniques are used in the mining phase to improve the quality of results. To satisfy 3-differential privacy, the privacy budget 3 are divided into four portions: 31 ; …, 34 . In the rest of this subsection, we present the details of DP-Apriori.
6.1.1.
Preprocessing phase
In the preprocessing phase, we first compute the maximal cardinality of transactions m. Let d ¼ {a1, …, an}, where ai is the number of transactions with length i in the database and n is the size of the alphabet I . We set m to the value such Pn Pm that the percentage i¼1 ai = i¼1 ai is at least h. Since the computation of d has privacy implications, we add geometric noise Gð31 Þ to each ai. The number of transactions in the database can be obtained from a summation over all elements in d. We then compute a vector s ¼ {s1, …, sn}, where si is the maximal support of i-itemsets. This vector will be used in the mining phase to estimate the maximal cardinality of frequent itemsets L. We select a relatively small threshold and run non-private Apriori. We assume the user-specific threshold is not smaller than this threshold. Suppose the maximal cardinality of discovered frequent itemsets is k. For i from 1 to k, we keep the maximal support of i-itemsets si. Since vector s is a property of the database, we add geometric noise Gð32 =QlognSÞ to each si. Meanwhile, we compute a k n matrix Z, where each row zi, is a avector of the i-itemset with the maximal support. This matrix will be used to quantify the information loss caused by transaction splitting during the mining process.
6.1.2.
Mining phase
Algorithm description
DP-Apriori consists of two phases. In the preprocessing phase, it extracts some statistical information from the
Our DP-Apriori algorithm is shown in Algorithm 3. Given the threshold l, we first estimate the maximal cardinality of frequent itemsets L based on the vector s obtained in the
84
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
preprocessing phase (line 2). We set L to be the integer l such that sl2s is the smallest value exceeding l. Then we get the matrix Z which is used by the support estimation technique. Since Z is a property of the database, for each row of Z, we add geometric noise Gð33 =LÞ (line 3). In the following, we can start to privately mine frequent itemsets. We first uniformly assign a privacy budget 3' ¼ 34 =L to i-itemset queries to enforce differential privacy (line 4). Then by utilizing the procedure F_Max_Mining, we iteratively find i-itemsets (i ¼ 1, 2, …, L) whose estimated “maximal” support exceeds l and add them into Fmax (line 5e10). For each itemset X2Fmax, if its estimated “average” support exceeds l, we treat it as a frequent itemset (line 11e15). The procedure F_Max_Mining is shown in Algorithm 4. It is used to find itemsets whose estimated “maximal” support exceeds l. It first generates candidate i-itemsets (line 1e5). Then, it transforms the input database D to limit the cardinality of transactions. For mining 1-itemsets, we apply weighted splitting operation to randomly divide long transactions in D into sub-transactions. For simplicity, we denote this method random weighted splitting. Otherwise, we apply our smart weighted splitting technique (line 6e10). Next, we calculate the support of candidate i-itemsets and add Laplace noise Lapðm=3' Þ to their supports. Based on the obtained noisy support of itemsets, we estimate their “maximal” supports. If the estimated “maximal” support of a candidate i-itemset exceeds l, we add it into Fimax (line 11e18)
guarantees ð33 =LÞ-differential privacy. By the composition property of differential privacy, we can see that it satisfies 33 -differential privacy for computing 〈z1,, …, zL,〉. Moreover, as shown in (Zeng et al., 2012), adding geometric noise Gð32 =Qlog nSÞ in computing s (i.e., the maximal support of itemsets with different lengths) satisfies 32 -differential privacy, where n is the size of the alphabet. Since the maximal cardinality of frequent itemsets L is computed based on s, we can safely use it without privacy implications. The process of mining frequent itemsets can be considered as a sequence of computations Q ¼ 〈Q1, …, QL〉, where Qi denotes all i-itemset whose supports are computed during the mining process. For i from 1 to L, we uniformly assign Qi a private budget 34 =L. Suppose the maximal cardinality of transactions is m. Since adding (removing) one transaction can affect the support of an itemset by at most one, the . Thus, adding Laplace sensitivity of Qi is noise LapðSQi $L=34 Þ in computing Qi satisfies ð34 =LÞ-differential privacy. By the composition property of differential privacy, we can prove that the process of mining frequent itemsets overall is 34 -differentially private. In summary, our algorithm P satisfies ð3 ¼ 4i¼1 3i Þ-differential privacy. In the following, we prove that applying any 3-differentially private FIM algorithm to the transformed database, which is constructed by our weighted splitting operation, also guarantees 3-differential privacy for the original database (i.e., Theorem 5).
6.2.
Suppose D and D are two neighboring databases, and the 0 transaction in D but not D is t, which has weight w. It is not hard to see that t can affect the support of an itemset by at most w. Let Qi denote the i-itemset queries, and Ci denote the iitemsets whose supports are computed in the mining phase. For an i-itemset X, let n_sup(X) be its resulting noisy support, 0 0 and sup(X,D) and sup(X,D ) be its actual supports on D and D respectively. Suppose the privacy budget assigned to i-itemset queries is 3' ¼ 3=L, and the sensitivity used to perturb the support of i-itemsets is sen(X). Thus, adding (removing) one
Privacy analysis
In this subsection, we first prove our algorithm satisfies privacy. Since adding (removing) one transaction can only affect the number of transactions with a certain length by one, the sensitivity of computing d (i.e., the number of transactions with different lengths) is 1. Thus, adding geometric noise Gð31 Þ in computing d guarantees 31 -differential privacy. Similarly, we can prove adding geometric noise Gð33 =LÞ in computing zi, (i.e., the a-vector of the i-itemset)
3-differential
0
85
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
transaction can affect at most sen(X) itemsets. Let S i denote the set of resulting frequent i-itemsets. Then we have: jn supðXÞ supðX; DÞÞj exp 3' Y PrðQi ðDÞ ¼ S i Þ jsenðXÞj ¼ ' PrðQi D' ¼ S i ' jn supðXÞ supðX; D Þj X2Ci exp 3 jsenðXÞj ! jC j i X jsupðX; DÞ supðX; D' Þj ' exp 3 jsenðXÞj k¼1 ! jsenðXÞj X w exp 3' jsenðXÞj k¼1
(4)
'
ew3
Let S denote the set of resulting frequent itemsets. The mining of frequent itemsets can be considered as a series of computations Q ¼
. Thus, by the composition property of differential privacy, we have: PrðQðDÞ ¼ S Þ ew3 Pr Q D' ¼ S : Consider two original neighboring databases T and T' . We denote the transaction in T' but not T as t' (i.e., T' ¼ T þ t' ). Let f ~ ¼ f ðTÞ is the be a weighted splitting operation. Suppose T 0 transformed database of T, and t is divided into k subtransactions t1, …, tk, where tj is assigned a weight wj. Based on equation (4), we have: Pk ~ t1 ; …; tk 〉 ¼ S ~ ¼ S e3$ j¼1 wj $Pr Q 〈T; Pr Q T
the relative threshold with respect to the original dataset. Since PB is developed for mining top-k frequent itemsets, we cannot directly compare DPA with it. To compare their performance, we adapt the PB algorithm to frequent itemset mining by setting k to be the number of frequent itemsets given a threshold. We would like to emphasize that this scenario might create privacy concern. For mining top-k frequent itemsets, we adapt DPA and TT to discover the k most frequent itemsets by setting the threshold to be the frequency of the kth frequent itemsets. However, the computation of that frequency might have privacy implications. We add geometric noise to that computation to avoid a breach of the privacy. The parameter h used in the preprocessing phase of DPApriori is set to be 0.85, which typically obtains good results in our experiments. Like (Zeng et al., 2012), we set the parameter r which is used in the support estimation technique of DP-Apriori to be 0.01. The privacy budget 3 to be 1.0, and we also show the experiment results when varying 3 in Section 7.4. We implement all the algorithms in JAVA, and conduct the experiments on a PC with Intel Core2 Duo E8400 CPU (3.0 GHz) and 4 GB RAM. Since the algorithms involve randomization, we run each algorithm 10 times and its average result is reported.
7.1.1.
~ is the transformed database of T and t is divided Since T ~ t1 ; …, tk〉 is the transformed database of T0 . into t1, …, tk, 〈T; According to the definition of weighted splitting operation, it P follows that kj¼1 wj 1. Therefore, we have: 0
PrðQðf ðTÞÞ ¼ S Þ e3 $Pr Q f T' ¼ S That is, the Theorem 5 follows.
Datasets
We used four publicly available real datasets in our experiments. Accidents (Frequent itemset) contains traffic accident data; Pumsb-star (Frequent itemset) is census data from PUMS (Public Use Microdata Sample); POS (Zheng et al., 2001) contains several years' worth of point-of-sale data from a large electronics retailer; Kosarak (Frequent itemset) is a clickstream dataset from a hungarian news portal. The detailed characteristics of these datasets are summarized in Table 1.
7.1.2.
Utility Measures
In this section, we report the performance study of our DPApriori algorithm.
To evaluate the performance of our algorithm, we employ the standard metrics that have been used in previous studies. Specifically, we use F-score (Zeng et al., 2012) to measure the utility of produced frequent itemsets, which is defined as follows:
7.1.
F score ¼ 2
7.
Experiments
Experimental setup
We compare our DP-Apriori algorithm with the following two algorithms. 1) the algorithm proposed in (Zeng et al., 2012), which can privately find all the itemsets whose supports exceed a user-specified threshold. 2) the “PrivBasis” algorithm proposed in (Li et al., 2012), which can privately find the k most frequent itemsets. We use DPA (for DP-Apriori) to denote our algorithm, and use TT (for Transaction Truncating) and PB (for PrivBasis) to denote the algorithms proposed in (Zeng et al., 2012) and (Li et al., 2012), respectively. In our experiments, we compare DPA with TT and PB for mining frequent itemsets and mining top-k frequent itemsets, respectively. For mining frequent itemsets, we use the relative threshold, which is the percentage of transactions in the dataset. Since DP-Apriori performs transaction splitting, it will increase the number of transactions. For DP-Apriori, we use
precision recall precision þ recall
where precision ¼ jUj/jUpj, recall ¼ jUj/jUcj, Up is the set of frequent itemsets generated by a private algorithm, Uc is the set of correct frequent itemsets, and U ¼ Up ∩Uc . Additionally, to measure the error with respect to the actual support of itemsets in the dataset, we calculate the Relative error of the support of released itemsets (Li et al., 2012):
Table 1 e Data characteristics. Dataset Pumsb-star Accidents POS Kosarak
♯ Transactions ♯ Items Max.length Avg.length 49,046 340,183 515,597 990,183
2088 468 1657 41,270
63 51 164 2498
50.5 33.8 6.5 8.1
86
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
Fig. 1 e Frequent itemset mining on Pumsb-star.
jsup' ðxÞ supðxÞj RE ¼ medianX supðxÞ where X is the set of all frequent itemsets generated by a private algorithm, sup(x) is the actual support of itemset x and 0 sup (x) is its noisy support.
7.2.
Frequent itemset mining
Figs. 1e4 show the F-score and RE of the algorithms DPA, TT and PB with different values of threshold on the four datasets. Since datasets Pumsb-star and Accidents are dense, a large number of frequent itemsets can be discovered even for very large values of threshold. We can see that DPA achieves better performance than TT on all these datasets. This is because, by leveraging the transaction splitting approach, DPA preserves more frequency information compared to TT. We can also observe DPA outperforms PB in term of F-score on all these datasets. PB first privately samples items to construct a basis set. It then adds noise to the actual support of itemsets which are covered by some basis, and finally selects the top k itemsets. However, when the differences between the supports of
items are small, it is very likely to sample infrequent items, which leads to poor performance in term of F-score. On the other hand, we notice that PB obtains better performance than DPA and TT in term of RE. The reason can be explained as follows. To improve the utility and privacy tradeoff, DPA and TT transform the database to limit the cardinality of transactions. DPA and TT add noise to the actual support of itemsets in the transformed database. Therefore, compared with PB, due to the transformation of the database, it introduces more errors with respect to the support of the released itemsets. However, we can observe that DPA can obtain better performance than TT in term of RE in all cases. This validates that our transaction splitting approach can effectively mitigate the side-effect incurred by the transformation of the database.
7.3.
Top-k frequent itemset mining
We also compare DPA with TT and PB for mining top-k frequent itemsets on the four datasets. Figs. 5e8 shows the results by varying the parameter k from 10 to 200. We can observe that, DPA substantially outperforms both TT and PB in
Fig. 2 e Frequent itemset mining on Accidents.
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
87
Fig. 3 e Frequent itemset mining on POS.
term of F-score, and obtains better performance than TT in term of RE on all these datasets.
7.4.
Effect of Privacy Budget
Fig. 9 shows the F-score and RE of the algorithms DPA, TT and PB under varying privacy budget 3 from 0.5 to 1.25 on datasets Pumsb-star (for threshold q ¼ 0.6) and POS (for q ¼ 0.03). We
observe that all these algorithms behave in a similar way: the quality of frequent itemsets is improved as 3 increases. This is because, as 3 increases, a lower magnitude of noise is added, which also means a lower degree of privacy would be guaranteed. Clearly, at the same level of privacy guarantee, DPA can still achieve better performance than TT and PB in term of F-score, and obtain better performance than TT in term of RE. We can also observe that the quality of results is more stable
Fig. 4 e Frequent itemset mining on Kosarak.
Fig. 5 e Top-k frequent itemset mining on Pumsb-star.
88
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
Fig. 6 e Top-k frequent itemset mining on POS.
on POS. This can be explained by the high support of frequent itemsets in POS, which is more resistant to noise.
7.5.
Effect of our transaction splitting approach
Fig. 10 shows that how our smart weighted splitting and support estimation techniques affect the performance of DPA
on datasets Accidents and Kosarak. We can observe that, using smart weighted splitting to transform the database can obtain better performance than using random splitting. This is because, smart weighted splitting can preserve more frequency information for frequent itemsets. We can also observe that the performance can be further improved when both the smart weighted splitting and support estimation
Fig. 7 e Top-k frequent itemset mining on Kosarak.
Fig. 8 e Top-k frequent itemset mining on Accidents.
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
89
Fig. 9 e Effect of privacy budget.
Fig. 10 e Effect of our transaction splitting approach. RS denotes random weighted splitting, and SS denotes smart weighted splitting.
90
c o m p u t e r s & s e c u r i t y 5 0 ( 2 0 1 5 ) 7 4 e9 0
techniques are used (i.e., DPA). The reason lies in that our support estimation method can effectively offset the information loss caused by transaction splitting.
8.
Conclusion
In this paper, we investigate the problem of designing a differentially private FIM algorithm which can achieve both good utility and good privacy. To this end, we present an Apriori-based private FIM algorithm referred to as DP-Apriori. In particular, to better improve the utility-privacy tradeoff, instead of limiting the cardinality of transactions by transaction truncating, we propose a novel transaction splitting approach where long transactions are divided into multiple short sub-transactions. In our transaction splitting approach, a smart weighted splitting technique is devised to preserve the frequency information contained in the long transactions. Moreover, a support estimation technique is developed to offset the information loss incurred by transaction splitting. Our privacy analysis shows that DP-Apriori satisfies 3-differential privacy. The results of extensive experiments on real datasets show that the proposed DP-Apriori algorithm can outperform the state-of-the-art techniques. Since the transaction splitting approach proposed in this paper is specific to the Apriori algorithm, in our future work, we plan to extend our transaction splitting approach to support FP-Growth.
Acknowledgment
Chen R, Fung BCM, Desai BC, Sossou NM. Differentially private transit data publication: a case study on the montreal transportation system. In: KDD; 2012. p. 213e21. G, Castelluccia C. Differentially private sequential Chen R, Acs data publication via variable-length n-grams. In: CCS; 2012. p. 638e49. Dwork C. Differential privacy. In: ICALP; 2006. p. 1e12. Dwork C. Differential privacy: a survey of results. In: TAMC; 2008. p. 1e19. Dwork C, McSherry F, Nissim K, Smith A. Calibrating noise to sensitivity in private data analysis. In: TCC; 2006. p. 265e84. Frequent itemset mining dataset repository, http://fimi,ua.ac.be/ data. Ghosh A, Roughgarden T, Sundararajan M. Universally utilitymaximizing privacy mechanisms. In: STOC; 2009. p. 351e60. Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In: SIGMOD conference; 2000. p. 1e12. Li N, Qardaji WH, Su D, Cao J. Privbasis: frequent itemset mining with differential privacy. PVLDB 2012;5(11):1340e51. Machanavajjhala A, Gehrke J, Kifer D, Venkitasubramaniam M. ldiversity: privacy beyond k-anonymity. In: ICDE; 2006. p. 24. McSherry F, Talwar K. Mechanism design via differential privacy. In: FOCS; 2007. p. 94e103. Saran H, Vazirani VV. Finding k cuts within twice the optimal. SIAM J Comput 1995;24(1):101e8. Shen E, Yu T. Mining frequent graph patterns with differential privacy. In: KDD; 2013. p. 545e53. Sweeney L. k-anonymity: a model for protecting privacy. Int J Uncertain Fuzziness Knowledge-Based Syst 2002;10(5):557e70. Zeng C, Naughton JF, Cai J-Y. On differentially private frequent itemset mining. PVLDB 2012;6(1):25e36. Zhang X, Meng X, Chen R. Differentially private set-valued data release against incremental updates. In: DASFAA; 2013. p. 392e406. Zheng Z, Kohavi R, Mason L. Real world performance of association rule algorithms. In: KDD; 2001. p. 401e6.
We thank the reviewers for their valuable comments which significantly improved this paper. The work was supported in part by the following funding agencies of China: National Natural Science Foundation under grant 61170274, National Key Basic Research Program (973 Program) under grant 2011CB302506, Fundamental Research Funds for the Central Universities under grant 2014RC1103.
Xiang Cheng received the Ph.D. Degree in Computer Science from Beijing University of Posts and Telecommunications, China, in 2013. He is currently an Assistant Professor at the Beijing University of Posts and Telecommunications. His research interests include privacy-preserving data mining and large-scale data mining.
references
Sen Su received the Ph.D. Degree in Computer Science from the University of Electronic Science and Technology, China, in 1998. He is currently a Professor at the Beijing University of Posts and Telecommunications. His research interests include distributed systems and service computing.
Agrawal R, Srikant R. Fast algorithms for mining association rules in large databases. In: VLDB; 1994. p. 487e99. Agrawal R, Imielinski T, Swami AN. Mining association rules between sets of items in large databases. In: SIGMOD conference; 1993. p. 207e16. Bhaskar R, Laxman S, Smith A, Thakurta A. Discovering frequent patterns in sensitive data. In: KDD; 2010. p. 503e12. Bonomi L, Xiong L. A two-phase algorithm for mining sequential patterns with differential privacy. In: CIKM; 2013. p. 269e78. Chen R, Mohammed N, Fung BCM, Desai BC, Xiong L. Publishing set-valued data via differential privacy. PVLDB 2011;4(11):1087e98.
Shengzhi Xu is a Ph.D. Candidate from Beijing University of Posts and Telecommunications in China. His major is Computer Science. His research interests include privacy-preserving data mining and large-scale data mining. Zhengyi Li is a graduate student from Beijing University of Posts and Telecommunications in China. His major is Computer Science. His research interests include data mining and machine learning.