Data & Knowledge Engineering 88 (2013) 267–285
Contents lists available at ScienceDirect
Data & Knowledge Engineering journal homepage: www.elsevier.com/locate/datak
Graph publication when the protection algorithm is available Mingxuan Yuan a, b, Lei Chen b,⁎, Hong Mei c a b c
Huawei Noah Ark Lab, Hong Kong, China The Hong Kong University of Science and Technology, Hong Kong, China Peking University, Bei Jing, China
a r t i c l e
i n f o
Available online 28 April 2013 Keywords: Social networks Privacy Semi-Edge Anonymity
a b s t r a c t With the popularity of social networks, the privacy issues related with social network data become more and more important. The connection information between users, as well as their sensitive attributes, should be protected. There are some proposals studying how to publish a privacy preserving graph. However, when the algorithm which generates the published graph is known by the attacker, the current protection models may still leak some connection information. In this paper, we propose a new protection model, named Semi-Edge Anonymity to protect both user's sensitive attributes and connection information even when an attacker knows the publication algorithm. Moreover, Semi-Edge Anonymity model can plug in any state-of-the-art protection model for tabular data to protect sensitive labels. We theoretically prove that on two utilities, the possible world size and the true edge ratio, the Semi-Edge Anonymity model outperforms any clustering based model which protects links. We further conduct extensive experiments on real data sets for several other utilities. The results show that our model also has better performance on these utilities than the clustering based model. © 2013 Elsevier B.V. All rights reserved.
1. Introduction Social network websites, such as Facebook, LinkedIn, and Twitter, are becoming more and more popular in recent years. A social network graph contains lots of information, such as users' age, name, education background and relationship between users, etc. Fig. 1 is a social network example, where each user is represented as a node and a relationship between a pair of users is represented as a link between nodes. The social network website owners often want to release their data to third parties for further analysis [1,10,12]. However, when publishing a graph, it is important to protect the privacy of the involving users. For many applications, a social graph is modeled as a labeled graph, which contains the following information [2,3,6,23]: • Node information: - The non-sensitive attributes of each user. Similar to the tabular micro-data, we call them as node quasi-identifiers. For example, the education background and age on each node in Fig. 1 are the quasi-identifiers. - The sensitive attributes of each user. For example, In Fig. 1, the salary of each user is the sensitive attribute. We also call the sensitive attributes as the sensitive labels. • Link information: the relationships between users. We also call them as the structure information.
⁎ Corresponding author. E-mail address:
[email protected] (L. Chen). 0169-023X/$ – see front matter © 2013 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.datak.2013.04.006
268
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
Fig. 1. Social network example.
1.1. Challenges When publishing a graph, the link information as well as the sensitive attributes should be protected [2,6]. 1 We call the protection of the link information as the link protection and the protection of the sensitive attributes as the node protection. Simply removing the user identifiers in social networks does not guarantee privacy. The unique patterns, such as node degree and subgraph, etc., can be used to re-identify the nodes/links [9,10]. For example, an attacker could use structure information around Ben, such as Fig. 2(a) or (b) to attack the published graph. 2 If an attacker knows that Ben's neighborhood graph is Fig. 2(a), he could identify that Ben is node c in Fig. 3, the naive anonymized version (simply removing all identifiers) of Fig. 1. Then all the information around c in Fig. 3 is released. This attack is called as node re-identification. If the attacker further knows that Tim's neighborhood graph is Fig. 2(c), he can conclude that Tim and Ben have a connection. This attack is called as link re-identification. There are two models [2,3,6] proposed which can have the quantifiable guarantee on both the node protection 3 and the link protection 4 when an attacker may know arbitrary subgraphs around victims. These models well preserved the user privacy as they were proposed. However, they may suffer privacy leakage if the attackers know the algorithm used to generate the privacy preserving social network no matter using: 1. Edge-editing based model: to add or delete edges to make the graph satisfy certain properties according to the privacy requirement; or 2. Clustering based model: to cluster “similar” nodes together to form super nodes. Each super node represents several nodes which are also called as a cluster. Then the links between nodes are represented as the edges between super nodes which are called as super edges. Each super edge may represent more than one edge in the original graph. We call the graph that only contains super nodes and super edges as a clustered graph. In the following, we show that when the attacker does the attack by combining the graph generation policy, the link information could be released in some cases. Cheng [6] proposed a k-isomorphism model based on edge-editing for both node protection and link protection. The protection objective of k-isomorphism model is: for an attacker with arbitrary subgraph knowledge, the probability to discover any user's information or to find any two users have any connection is at most 1k. A graph is k-isomorphism if this graph consists at 1 It should be noted that the purpose of anonymity models is to protect one or both of the link and attributes. The final purpose of protecting nodes by anonymity is to protect the information around nodes. Making a node k-anonymity is the method to implement protection instead of the protection objective. 2 Note that, the letters in vertices are not vertex labels, they are vertex IDs that we introduce to simplify description. 3 The probability to re-identify a node is bounded by a constant. 4 The probability to learn any two nodes have a link is bounded by a constant.
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
(a) Ben’s subgraph 1
(b) Ben’s subgraph 2
(c) Tim
269
(d) Billy and Aron
Fig. 2. Possible background knowledge of an attacker.
Fig. 3. Naive anonymizing.
least k disjoint isomorphic subgraphs. Fig. 4 is a 2-isomorphism graph of Fig. 1. Since there are k disjoint isomorphic subgraphs, each node is “hidden” with at least k–1 other nodes. For link protection, since any node's candidates appear in at least k disjoint isomorphic subgraphs, the probability that an attacker finds two nodes have a connection is at most 1k. For example, the candidates of Tim are {e, j} and the candidates of Ben are {c, h}. So, the mappings between candidates and users (Tim, Ben) are {(e, c), (e, h), (j, c), (j, h)}. Due to disjoint subgraphs, at most two of them contain an edge between the two mapping nodes. Therefore, the probability that an attacker finds that two nodes have a link is at most 12. However, it is required to make as fewer edge changes as possible to keep the graph utility [6]. The published graph is generated only with a small portion of edge changes 5 [6]. Therefore, the probability that two connected nodes are put into different disjoint parts is low. Suppose a published graph is generated with p% edges changed. If an attacker knows a connected subgraph G′ (e.g. Fig. 2(a)), for any two nodes in G′, the probability that these two nodes are put into two different subgraphs is at most p%, since it is required to delete at least one edge in order to put these two nodes into two separated subgraphs. 6 When an attacker knows a subgraph as shown in Fig. 2(a), he can find the mapping between candidates and users {Alice, Chilly, Mike} should either be {a,b,d} in part 1 or be {g,i,k} in part 2 of Fig. 4 with very high confidence (at least 1-p% where p is small). Thus the edges such as e(Alice, Chilly) and e(Chilly, Mike) are released. In some cases, even when an attacker does not know that two nodes are in the same subgraph, the link between these two nodes can still be discovered. For example, if an attacker knows Ben's subgraph (as shown in Fig. 2(a)) and the quasi-identifiers of Billy and Aron (shown in Fig. 2(d)), then he knows that Billy and
5
At most around 10% of edges are changed in [6]'s experiments. Actually, to make the mappings {(e, c), (e, h), (j, c), (j, h)} have the equal probability of occurrence, at least 50% edges in the original graph should be changed. For any two nodes u and v where there exists one edge (u, v), to let an attacker believe that u and v only have 1k probability to be assigned into the same connected . Even with k = 2, at least 50% edges should be changed. This contradicts with the graph generation policy subgraph, (u, v) should be deleted with probability k−1 k of k-isomorphism model. 6
270
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
Fig. 4. 2-Isomorphism graph.
Aron are not in Fig. 2(a). After that, he can delete the subgraph about Ben (Fig. 2(a)) from Fig. 4. Although he might not have any idea about Fig. 2(a) that appears in which isomorphism part of Fig. 4, he can still randomly select one and remove Fig. 2(a). The new graph is shown in Fig. 5, which is not a 2-isomorphism graph. The link protection, such as the link between Billy and Aron (i.e. link between h and j), cannot be guaranteed anymore. Different from edge editing method, Cormode [2,3] proposed a clustering based model which also protects both the nodes and the links. For an attacker with arbitrary subgraph knowledge, their model guarantees that the probability to node re-identification and link re-identification is at most 1k. By making each cluster's size at least k, the probability to re-identify a user can be bounded to be at most 1k. For link protection, Cormode [2,3]'s protection model requests the following two safety conditions: (1) No links within a cluster; (2) for any two nodes in the same cluster, they do not connect to the same node. For the graph in Fig. 1, a clustered graph with k = 2 as shown in Fig. 6 can be published. 7 For any two super nodes Cx and Cy, if we use |Cx| and |Cy| to represent the number of nodes contained in Cx and Cy, this clustering model that constrains the number of edges between Cx and Cy is at most min{|Cx|, |Cy|}. Since each cluster's size is at least k, the probability that an attacker can find any two nodes have a link min jC j; C is at most fjC jjCj j jg ≤1k. x
x
y
y
However, the above safety conditions might lead to privacy leakage. Consider the following case: If an attacker knows a subgraph as shown in Fig. 2(a), he can uniquely find that Dik and Red has a link by combining the published graph with the two safety conditions (shown in Fig. 7). There are two edges between super nodes 0 and 2. Since there already exists an edge e(a, c) based on the background knowledge (i.e. Fig. 2(a)) and the clustering condition makes sure these two edges do not connect to the same node, the left edge must be between (g, i) (Dik and Red). This attack works for homogeneous graphs (there are at most one edge between any two nodes) since the safety condition provides more information than the published graph. 1.2. Our contributions From the above analysis, we can see that there indeed exists a problem when an attacker has the generation policy of the published graph as background knowledge. To solve this problem, in this paper, by assuming that an attacker knows our graph generation policy as well as arbitrary subgraphs: • We give a necessary and sufficient condition for clustering based model, which can guarantee the link protection objective by restricting the number of edges between and within clusters. • We propose a new protection model named Semi-Edge Anonymity (SEA) model, which separately protects nodes and edges. We give a safety random algorithm to implement this model. The new model performs well in the preserving of original graph utilities. Since we separately protects nodes and edges, the SEA model can plug in any state-of-art protection model for tabular data to protect sensitive labels. Although we use a random algorithm to generate the published graph, the SEA model well preserves the graph's structure information. We prove that the SEA model outperforms any clustering based model which protects links on two utilities. Extensive experiments on real data sets also show that our SEA model well preserves several other utilities too. The rest of the paper is organized as follows: In Section 2, we give the problem definition. We prove a necessary and sufficient condition for clustering based model when link protection should be achieved in Section 3. Section 4 describes our SEA model and Section 5 gives the algorithm to implement this model. We prove that the SEA model always outperforms clustering based models when link protection is considered on two utilities in Section 6. We report the experiment results in Section 8. The comparison of our work with related previous works is given in Section 9 followed by the conclusion in Section 10. 7 The number on each super edge is the number of edges this super edge represents. If a super edge only represents one edge, we don't display the number “1” to make the Figures clean.
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
271
Fig. 5. Attack demonstration.
Fig. 6. A 2-anonymous clustered graph.
2. Problem description 2.1. Protection objectives In this paper, we develop a graph protection model for both node protection and link protection. Our model guarantees the following two protection objectives: Objective 1 For any node u, the probability that an attacker can re-identify u's sensitive label is at most 1k; Objective 2 For any nodes u and v, the probability that an attacker can successfully find that u and v have a link is at most 1k. We use Prob(con(u, v)) to denote this probability. Then this objective is Probðconðu; vÞÞ≤ 1k. where k is a pre-given constant. 2.2. Attack model We suppose that an attacker has the following background knowledge: • The quasi-identifiers of victims; • Any labeled subgraphs around victims 8; • The method used to generate the published graph.
8
The quasi-identifier of a node u is actually involved in any labeled subgraph around u. To be clear, we represent them separately here.
272
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
Fig. 7. A 2-anonymous clustered graph under attack.
An attacker could use the above information to learn user's sensitive labels or to find whether two users have a link in the published data.
3. Safety condition for clustering For any clustering based method, to guarantee protection objective 1 (see in our problem definition), a necessary condition is that each cluster's size is at least k. To guarantee the link protection objective (protection objective 2), the following necessary and sufficient Safety Clustering Condition must be satisfied: C j−1Þ ; 1. For any super node C, the number of edges dC between the nodes within C must satisfy dC ≤ jCjðj2k 2. For any two super nodes Cx and Cy, the number of edges dC x ;C y between the nodes in Cx and the nodes in Cy must satisfy:
C x C y
dC x ;C y ≤
k
;
Proof 1. • Sufficient: For any node u, v,: - If u ∈ C ∧ v ∈ C: d Probðconðu; vÞÞ ¼ C jC j 2
2dC jC jðjC j−1Þ 2jC jðjC2kj−1Þ ≤ jC jðjC j−1Þ 1 ¼ k ¼
- If u ∈ Cx ∧ v ∈ Cy: dC ;C Probðconðu; vÞÞ ¼ x y C x C y C x C y
≤ k C x C y 1 ¼ k In both cases, protection objective 2 is guaranteed. • Necessary
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
273
C j−1Þ - If for a super node C, dC > jCjðj2k , then for ∀u, v ∈ C:
d Probðconðu; vÞÞ ¼ C jC j 2
2dC jC jðjC j−1Þ 2jC jðjC2kj−1Þ > jC jðjC j−1Þ 1 ¼ k
¼
Protection objective 2 is violated. C C - If for any two super nodes Cx and Cy, dC x ;C y > k , then for ∀ u ∈ Cx ∧ ∀ v ∈ Cy x
y
dC ;C Probðconðu; vÞÞ ¼ x y C x C y C x C y
> k C x C y ¼
1 k
Protection objective 2 is violated. So, SC condition is a necessary and sufficient condition to make a clustering based model to provide link protection. We further use this necessary and sufficient condition to analyze the utilities of any clustering based models when link protection must be provided in Section 6. 4. Semi-Edge anonymity model Properly combining the Safety Clustering Condition with a clustering based model can achieve the link protection (protection objective 2). However, the current edge-editing based models and clustering based models both mix the node protection and link protection together. The link protection is implemented based on node grouping, which brings unnecessarily structure information loss. For example, in clustering based models, the edge is anonymized based on clusters. A demonstration example is shown in Fig. 8(a). If there's only one edge between clusters Cx and Cy, the number of possible positions to put this edge is |Cx||Cy|,
(a) Clustering-based Model
(b) Semi-Edge
Fig. 8. Motivation example for utility improvement.
274
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
Semi-Edge Table
Attribute Table Fig. 9. A Semi-Edge graph.
which is at least k 2. However, to achieve protection objective 2, k positions to put an edge is enough. If we can directly anonymize this edge, the number of positions to put this edge can be reduced significantly to better preserve the structure information. In his paper, we propose the idea of Semi-Edge to anonymize each edge directly. Definition 1. Semi-Edge: for a graph G(V, E), a Semi-Edge se(u, C) is a pair where u is a node and C is a set of nodes. For ∀v ∈ C, we say se covers edge e(u, v). We use |C| to denote the size of set C. From one Semi-Edge se(u, C) where (|C| = k), an attacker can only have jC1j ¼ 1k probability to find that u has a link with any node in C. An example of Semi-Edge is shown in Fig. 8(b). If we use a Semi-Edge to anonymize the edge between Cx and Cy, the number of possible positions to put it is reduced to be only k (for theoretical analysis, please check Section 6). It has been shown that the node protection can only be guaranteed when the grouping of nodes considers the sensitive attributes' diversity and distribution (e.g. different l-diversity models [14], t-closeness [11]) as well as the cost function to generate the published data [17]. Current graph protection models either implement k-anonymity [2,4,6,12,22,24,20,10] or l-diversity [23,5]. They all rely on a specific protection model. If we separate the node protection and link protection, the protection of sensitive labels is not restricted to a specific protection model. Motivated by this, instead of using the edge-editing based model and clustering model, we proposed a new graph protection model, Semi-Edge Anonymity (SEA) model, to protect node and link information independently. The SEA model publishes two tables separately. The first table, attribute table, contains only node quasi-identifiers and sensitive attributes following any state-of-art tabular data protection model [14,11,17]. The second table, Semi-Edge table, anonymizes each edge directly by using a Semi-Edge. A simple example is shown in Fig. 9 (The connection information published in Fig. 9 is shown in Fig. 10). There are two benefits of our SEA model: • Any state-of-art sensitive label protection models for tabular data can be directly adopted to the SEA model. The generation of attribute table and Semi-Edge table can be done independently; • Its strategy on directly anonymizing each link is helpful to maintain the graph utility. In the remaining part of this paper, we use Q(u) to represent node u's quasi-identifiers and S(u) to represent u's sensitive attributes. To protect the node information, we partition nodes into groups and publish the quasi-identifiers/sensitive attributes of nodes based on these groups in an attribute table. Attribute table can reuse any existing protection model for sensitive labels in tabular data such as l-diversity [14], t-closeness [11], etc. We give each node an anonymity id in order to link with the Semi-Edge table. To be simple, in the remaining part of this paper, we use u to directly represent the id of u. For example, if we choose the
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
275
Fig. 10. The connection information in Fig. 9.
anatomy model defined in [18], v1, …, vm form a group, then a tuple ({(v1, Q(v1)),…,(vm, Q(vm))},{S(v1),…,S(vm)}) is published for this group as shown in Fig. 9. 9 The graph we publish is: Definition 2. Semi-Edge Graph: A Semi-Edge Graph SG(AT, SET) represents G(V, E) by the pair AT and SET. AT is an attribute table for all the nodes in G. All edges in E are covered by the Semi-Edges in SET. For example, Fig. 9 is a Semi-Edge Graph of Fig. 1. Since any node in SET is mapped to a group of sensitive labels in AT, it is obvious the protection of sensitive labels and links are independent: Lemma 1. For a Semi-Edge Graph SG, the node protection in AT is not influenced by SET and the link protection in SET is not influenced by AT, either. So AT and SET can be generated independently. There are lots of existing work on how to generate an attribute table, in this paper, we focus on how to implement the link protection in SET. We give a Safety Semi-Edge Condition which can guarantee that a Semi-Edge Graph SG achieves protection objective 2: Con 1 ∀se(u,C) ∈ SET, |C| = k; Con 2 ∀se ðu; C Þ∧∀v∈C; ∀se u; C ′ ; v∉C ′ ∧ ∀se v; C ′ ; u∉C ′ ; There are two constraints in our safety condition. The first constraint requests that each Semi-Edge covers k different edges. The second constraint requests that for any edge e(u, v) covered by one Semi-Edge, no other Semi-Edges, which can cover it, appear in SET. For example, in Fig. 9, edge e(a, b) and e(a, k) are covered by Semi-Edge (a, {b, k}), and none of the other Semi-Edges covers e(a, b) or e(a, k). A Semi-Edge graph SG which satisfies the Safety Semi-Edge Condition is called as Semi-Edge anonymized graph. Our SEA model is to publish a Semi-Edge anonymized graph instead of the original graph. Lemma 2. For an attacker with the following background knowledge: • The quasi-identifiers of victims; • Any labeled subgraphs around victims; • The published graph is generated under the Safety Semi-Edge Condition; A Semi-Edge anonymized graph always guarantees the protection objective 2. Proof 2. Firstly, the two constraints in the Safety Semi-Edge Condition could be directly observed in the published data. This guarantees that no extra information is released according to the safety condition. For any node u and v, suppose the attacker has the strongest background knowledge (i.e. the whole labeled graph without e(u, v)). In SG(AT, SET), there at most exists one Semi-Edge se(u, C) or se(v, C) such that se covers e(u, v). According to Con 1 in the Safety Semi-Edge Condition, any |C| = k. So for any two nodes u and v, the probability that an attacker finds that they have a link is at most 1k. 5. Generation algorithm We use Algorithm 1 to generate the Semi-Edge table SET. For each original edge e(u, v) in G, we generate a Semi-Edge se(u, C). We firstly add v into C, which promises se cover e(u, v). Then we randomly add k–1 other nodes into C to make sure that each Semi-Edge covers k edges (lines 4–7). We use a hash table ht to record the number of Semi-Edges that covers each possible edge (lines 8–12). If 9 If we use the models, which generalize all the quasi-identifiers in each group to be the same (i.e. Q(v1)…Q(vm) are generalized to GQ1…m), a tuple ({v1,…,vm}, {(GQ1…m, S(v1)),…,(GQ1…m, S(vm))}) is published.
276
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
each element in ht has value 1, the Safety Semi-Edge Condition is satisfied. If there are some edges covered by more than one Semi-Edge, we use the following method to adjust SET to make it satisfy the Safety Semi-Edge Condition. Algorithm 1. Generate Semi-Edge Table
1. We find that edge e(u, v) which is covered by the maximum number of Semi-Edges in SET (e(u, v) is the edge with the maximum value in ht). 2. If e(u, v) is only covered by one Semi-Edge, the Safety Semi-Edge Condition is satisfied, we finish our adjustment (lines 20–21). 3. Otherwise, we find a Semi-Edge se(u′,C′) which covers e(u, v) where se(u′,C′) is not originally created due to edge e(u, v). 10 We randomly select one node w where e(u′,w) ∉ ht and replace v′ with w. By doing this, the number of Semi-Edges that cover e(u, v) is decreased by 1. We recursively do the above steps until SET satisfies the Safety Semi-Edge Condition. Theorem 1. For an attacker with the following background knowledge: • The quasi-identifiers of victims; • Any labeled subgraphs around victims; • The published graph is generated by Algorithm 1; the published Semi-Edge anonymized graph always guarantees the protection objective 2. Proof 3. According to Lemma 2, knowing the Safety Semi-Edge Condition does not cause the privacy leakage. In Algorithm 1, the nodes in each Semi-Edge are randomly selected, therefore the algorithm does not guarantee any certain output when given an input. Given a SET, any graph which is consistent with SET can be the input of this algorithm. Thus, a Semi-Edge anonymized graph generated by Algorithm 1 always guarantees the protection objective 2. Cormode [2,3] concluded that a clustered graph which satisfies their safety condition (shown in Section 1) can be easily found since most social networks are sparse. We show that a Semi-Edge anonymized graph also exists because of the sparsity of the social networks. Theorem 2. If a clustered graph which satisfies the following safety conditions [2,3]: • No links within a cluster; • For any two nodes in the same cluster, they do not connect to the same node. a Semi-Edge anonymized graph always exists. 10 Since the Semi-Edge covering e(u, v) could be se(u, C1) or se(u, C2), we use u′ here to avoid the confusion in the description. We use v′ ∈ C′ to represent the other endpoint of e(u.v).
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
277
Proof 4. According to the first condition, the clustered graph only contains edges between different clusters. According to the second condition, for any two clusters Cx and Cy, for any two edges e(u1, v1) and e(u2, v2) where u1 ∈ Cx ∧ u2 ∈ Cx ∧ v1 ∈ for any two super nodes Cx and Cy, for each edge e(u, v) where u ∈ Cx ∧ v ∈ Cy, we Cy ∧ v2 ∈ Cy, u1 ≠ u2 ∧ v1 ≠ v2 . Then, generate a Semi-Edge se u; C y′ ( C y′ ¼ k ∧v ∈ C y′ ∧C y′ pC y , when |Cy| > k, we randomly remove some nodes out of Cy to generate C y′ ). The Semi-Edges generated by this method satisfy the two constraints in the Safety Semi-Edge Condition. By doing this for all super node pairs, a Semi-Edge anonymized graph can be generated. In our experiment, under all cases, Algorithm 1 successfully finds solutions. In the worst case, Algorithm 1's random assignment generates a ht in which each item has value |E|. Then, the algorithm should adjust k|E|(|E| − 1) steps. In each step, we should finish the computation of lines 14–20. The computation cost of finding the edge e(u, v) with the maximum value in ht is |ht|, which is at most k|E|. We attach a list for each edge in ht which contains all the Semi-Edges that cover the corresponding edge. Thus the computation in line 15 can be directly implemented. In the worst case, the computation in line 16 checks V nodes. Thus computation complex of lines 13–22 is O(k|E|(|E| − 1)(k|E| + |V| + 4)) = O(k|E| 3). In lines 1–12, the random generation of Semi-Edges at most takes k|E||V| steps, which is less than k|E 3|. Thus the computation cost of Algorithm 1 is O(k|E| 3). Since the privacy preserving graph publication problem is an offline problem, the computation cost of Algorithm 1 is acceptable. 6. Utility analysis We use a random algorithm to make sure the generation process do not cause privacy leakage. In this section, we compare our model with clustering based models by analyzing two utilities, the possible world size and the true edge ratio (see definitions in later paragraphs). We show although we enhance the privacy by using a random algorithm, our model always performs better than or equal to any clustering based models which consider link protection on these two utilities. For clustering based models, since only super nodes and super edges are published, each published graph represents a group of graphs which are consistent with it. The set of graphs W(G) that are consistent with the published graph is called as the possible world [10,9]. When using such a clustered graph, users can sample some graphs in W(G) and compute the average values of these graphs. So, the number of consistent graphs of a clustered graph |W(G)| should be as less as possible [10,9]. We call |W(G)| as the possible world size. When generating a clustered graph, the one with less |W(G)| is preferred. The experiments of [10,9] showed that a clustered graph with small |W(G)| preserves a group of other utilities well. The graph construction algorithm in [10,9] uses the smallest |W(G)| as the objective function to generate the published clustered graph. So, firstly, we analyze the number of graphs that are consistent with the published graph (|W(G)|). For a clustering based model, |W(G)| is computed as: jW ðGÞjclustering ¼ ∏ d
jC jðjC j−1Þ 2
∀C
C
∏
∀C x ;C y
j C x j jC y j dC x ;C y
Where dC is the number of edges between the nodes in cluster C and dC x ;C y is the number of edges between the nodes in cluster Cx and Cy. In a Semi-Edge anonymized graph, since for a Semi-Edge se(u, C), the original edge covered by se can be between u and any node in C, |W(G)| is computed as: jW ðGÞjSEA ¼
jE j
∏ jC j ¼ k
∀seðu;C Þ
Theorem 3. A Semi-Edge anonymized graph always has |W(G)| which is smaller than or at most equals to any clustered graph when link protection should be provided. Proof 5. For a clustered graph: jW ðGÞjclustering ¼ ∏ d
jC jðjC j−1Þ 2
∀C
C
∏
∀C x ;C y
Cx Cy dC x ;C y
For any two integer numbers M and x with 1 ≤ x ≤ Mk, 1 ≤ x ≤ Mk, (xM) ≥ k x.
M x
2
Þ…ðM−xþ1Þ M−y yk −yk . For any y ∈ [0, x − 1], M−y ¼ MðM−1 xðx−1Þ…1 x−y ≥ −y ¼ k þ M−yk ≥ k. So if M k
We proved in Section 3, that the Safety Clustering Condition is a necessary and sufficient condition for any clustered graph to provide link protection. The Safety Clustering Condition requests: C j−1Þ ; • For any super node C, the number of edges dC between nodes within C must satisfy dC ≤jCjðj2k C C • For two super nodes Cx and Cy, the number of edges dC x ;C y between the nodes in Cx and the nodes in Cy must satisfy: dC x ;C y ≤ k ; x
y
278
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
So
jC jðjC j−1Þ 2
dC
d ≥k C
Cx Cy d ≥ k Cx ;Cy . We can get: and dC ;C x
jW ðGÞjclustering ≥∏ k ∀C
dC
y
dCx ;C y
∏ k
∀C x ;C y
d ∑ ∑ d ¼ k ∀C C k ∀C x ;C y Cx ;Cy jEj ¼k ¼ jW ðGÞjSEA
Next we show for another utility, that our Semi-Edge anonymized graph also works better than or at least equal to any clustered graphs. Suppose that G′ is a sampled graph from the published graph. If an edge appears both in G′ and the original graph G, we call this edge as a true edge. We use the ratio of true edges in G′ to estimate how much structure information of G is ′ correctly represented by G′. The ratio of true edges TR is defined as: TR ¼ no: of truejEjedges in G . The larger TR is, the better G′ represents G. We use expTR to represent the expected ratio of true edges in any sampled graph. Theorem 4. If using uniform random sampling method, the expTR of any sampled graph from a Semi-Edge anonymized graph is always larger than or at least equals to the expTR of any sampled graph from a clustered graph when link protection should be provided. Proof 6. In a Semi-Edge anonymized graph, for a Semi-Edge se(u, C), the original edge covered by se can be between u and any node in C. Each original edge is only covered by one Semi-Edge (Safety Semi-Edge Condition's Con 2). Since for any Semi-Edge se(u, C), |C| = k, when using uniform random sampling: expTRSEA ¼
1 k
For a clustered graph, when link protection must be guaranteed, according to the Safety Clustering Condition: expTRclustering ¼ ≤
dC x ;C y dC ;C x y C x C y
∑∀C ðdCÞ dC þ ∑∀C x ;C y jC j 2
jEj ∑∀C
dC k
þ ∑∀C x ;C y
¼ 1=k ¼ expTRSEA
dC x ;C y k
jEj
7. Discussion In this section, we discuss how to extend the Semi-Edge Anonymity model to the case when an attacker could have some knowledge from the data mining results on the graph, which is the stronger information beyond the scope of all current works [12,22,24,6,20,10,21,4,7,2]. It should be emphasized that, in reality, the knowledge from the data mining results can only be gotten from the published graph unless there exist similar graphs in public (Then, the publishing of current graph is not necessary). Using the knowledge from the data mining overestimates an attacker's background knowledge. However, we would like to use them to demonstrate the completeness and extensibility of our model. Before we go into details, let's firstly see a simple example that uses stronger knowledge to do attack. We show how all the state-of-the-art protection model fails in this case. Suppose an attacker has the knowledge that two “Master”s whose ages are within 3 years have more than 90% probability to know each other. With this strong knowledge, all the current state-of-the-art protection models may have problem. For example, we can check the 2-Isomorphism graph in Fig. 4, the link between Dik (Master, 33) and Billy(Master, 36) is protected by 2-anonymity because Dik is mixed with Alice(Phd., 27) and Billy is mixed with Chilly(Bachelor, 30). Dik and Alice are mapped to {a,g}. Billy and Chilly are mapped to {b,h}. In the published graph, there are two edges e(a,b) and e(g,h), if no other knowledge, the probability that Dik and Billy has an edge is 12. However, the attacker knows that two “Master”s whose ages are within 3 years have more than 90% probability to know each other. The two edges must be e(Dik, Billy) and e(Alice, Chilly). The link information is released. For the latest clustering based protection model in Fig. 6, the link information published by Fig. 6 is that there are 2 edges between {a,g} and {b,h}. Thus the probability that a link between g (Dik (Master, 33)) and h (Billy(Master, 36)) should be at most 12. However, with the strong knowledge that two “Master”s whose ages are within 3 years have more than 90% probability to know each other, the attacker can know that there must exist a link between g (Dik (Master, 33)) and h (Billy(Master, 36)). The protection of link information fails. For our SEA model, the protection of link is based on each Semi-Edge. Given a Semi-Edge (u, C), if the probability between u and v ∈ C is larger than jC1j ≥1k, the SEA model also suffers the protection failure problem. In this example, the link e(g, h) is protected by Semi-Edge e(g,{h,k}). With the strong knowledge, an attacker may also know that the link must be e(g, h). So, when an attacker could have some knowledge from
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
279
the data mining results on the graph, all the state-of-the-art models have problems to protect link information. Next, we would like to discuss how our model can be extended to handle these more complex cases. Suppose that besides the knowledge shown in Section 2.2, an attacker also knows some knowledge from the data mining on the graph: K1 simðQ ðuÞ; Q Þ ≥ t⇒SðuÞ ¼ s ; 0 K2 sim G ðuÞ; Gk ≥ t⇒SðuÞ ¼ s; K3 simðQ ðuÞ; Q ðvÞÞ ≥ t⇒eðu; vÞ; 0 0 K4 sim G ðuÞ; G ðvÞ ≥ t⇒eðu; vÞ; where K1 represents the knowledge that when the similarity between Q(u) (u's quasi-identifiers) and Q is bigger than a threshold t, an attacker can find that u's sensitive attributes equal s. Suppose G′(u) is the subgraph around u that an attacker can observed from the published graph, K2 represents the knowledge that when the similarity between G′(u) and a subgraph Gk is bigger than a threshold t, an attacker can find that u's sensitive attributes equal s. K3 represents the knowledge that when the similarity between Q(u) and Q(v) is bigger than a threshold t, an attacker can find that there's an edge between u and v. K3 represents the knowledge that when the similarity between G′(u) and G′(v) is bigger than a threshold t, an attacker can find that there's an edge between u and v. These four categories of knowledge reflect most possible data mining results. Knowledge K1 should be handled by the tabular protection models. This is beyond the scope of this paper. Simply say, the anonymous groups should be generated with the consideration of K1. Consider the anatomy model, each group g({(v1, Q(v1)),…,(vm, Q(vm))}, {S(v1),…,S(vm)}) should satisfy ∀Q(vx), S(vy) ∈ gsim(Q(vx), Q(vy)) ≥ t. Each edge in the Semi-Edge Anonymity model has 1k probability to be true. If there are |E′(u)| possible edges in G′(u), the confidence that sim(G′(u), Gk) ≥ t is only around jE10 ðuÞj . So, the Semi-Edge Anonymity model has no problem to let an attacker k know knowledge K2. Similarly, since each edge in the Semi-Edge Anonymity model has 1k probability to be true, the Semi-Edge Anonymity model allows an attacker to know knowledge K4 too. When knowledge K3 is involved, if we make sure each Semi-Edge se(u, C) satisfies ∀v ∈ C, sim(Q(u), Q(v)) ≥ t, we can still guarantee that an attacker has probability 1k to learn each edge. To implement this, we only need to change two lines in Algorithm 5: Line 6: ⇒ Randomly select node v' where v' ≠ u ∧ v' ∉ C ∧ sim(Q(u),Q(v')) ≥ t; Line 17: ⇒ Randomly select a node w with e(u', w) ∉ ht ∧ w ≠ u' ∧ w ∉ C' ∧ sim(Q(u'), Q(w)) ≥ t. When stronger knowledge such as K3 is considered, we can make the Semi-Edge Anonymity model to handle it by involving the knowledge into the Semi-Edge generation process. 8. Experiment The analysis in the model description part proves the privacy protection effectiveness of our model. As the same as other privacy preserving graph publication works [12,22,24,6,20,10,21,4,7,2], we test several utilities to show how well the published data preserves the structure information of the original graph. We test four real data sets and compare our SEA model with two clustering based models. We only compare with clustering based models since the edge-editing based model with link protection is involved in the SEA model. For the edge-editing based model, to make it to protect links, we should let the published graph at most contains 1k jEj randomly selected true edges. Such a published graph is actually one sampled graph of the Semi-Edge anonymized graph. However, if we only publish one sampled graph, the information in it is biased since the information of E ) is totally missed. The SEA model is a more general solution which covers the edge-editing based model deleted edges k−1 k j j with link protection. Each edge in the original graph is represented by one tuple in SET. For any link, the user has 1k probability to correctly get it in the published Semi-Edge anonymized graph. 8.1. Data sets ●
Cora The Cora data set (http://www.cs.umd.edu/projects/linqs/projects/lbc/index.html) consists of 2708 machine learning papers which are classified into one of seven classes: Case_Based, Genetic_Algorithms, Neural_Networks, Probabilistic_Methods,
280
●
●
●
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
Reinforcement_Learning and Theory. Based on the citation relationship, the network consists of 5429 links. This graph was frequently used in link prediction and classification. Arnet ArnetMiner (http://www.arnetminer.net/) is an academic researcher social network collected by Tsinghua University. The Arnet data set contains information extracted from crawled web pages of computer scientists. The extracted information forms the co-authorship graph among these people. We use a subset of the whole data set, it contains 6000 nodes and 37,848 edges. ArXiv ArXiv(arXiv.org) is an e-print service system in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance and Statistics. We extract the co-author graph in Computer Science, which contains 19,835 nodes and 40,221 edges. Each node denotes an author, and each edge means that two authors have at least one co-author paper. DBLP We use a subset of PROXIMITY DBLP dataset (http://kdl.cs.umass.edu/data/dblp/dblp-info.html), which maps each entry in the original DBLP data to one of six types of objects representing different types of publications. The links represent the citation between publications. We randomly extract a subgraph which contains 12,000 nodes and 40,260 edges.
8.2. Utilities We have proven that the SEA model outperforms any clustering based model on |W(G)| and expTR. For a published graph that represents more than one graph, to test how good the published graph is, researchers sample a group of graphs that are consistent with the published graph, calculate graph proprieties in these sampled graph and use the average property changes as graph change benchmarks [10,21,4,7] teCormode-VLDB09. In our experiment, besides |W(G)| and expTR, we also sample n (n = 100) graphs to compute the average graph property changes. Suppose the sampling graph set is S, we test the change ratio of the jdegreeðuÞGs −degreeðuÞG j following three graph properties: ● ●
∑∀G
ACRCC ¼ ●
∈S
degreeðuÞG þdegreeðuÞG s
s jV j Average change ratio of degrees (ACRD) ACRD ¼ n Average change ratio of clustering coefficient (ACRCC) The CC of a vertex in a graph is commonly used [12,24] to represent its neighborhood graph. It is defined as the actual number of edges between the vertex's directed neighbors divides the max possible number of edges between these directed neighbors. We test the average change ratio of the cluster coefficients:
∑∀Gs ∈S
∑∀u
jCC ðuÞGs −CC ðuÞG j CC ðuÞGs þCC ðuÞG
jV j n
Average change ratio of shortest path lengths (ACRSPL) Suppose umaxd is the node which has the largest degree in the original graph G. We compute the average change ratio of shortest path lengths from umaxd to all the other nodes:
ACRSPL ¼
∑∀Gs ∈S
ðumaxd ;vÞGS −disðumaxd ;vÞGj ∑∀v jdis disðu ;vÞG þdisðu ;vÞG maxd
S
maxd
jV j n
For the published graph, smaller |W(G)|, ACRD, ACRCC, ACRSPL and larger expTR are preferred. 8.3. Clustering-based models We compare our SEA model with two clustering based models. ●
●
D-Clustering (Directly-Clustering) [9,10] Hay et al. use Simulated Annealing (SA) algorithm to generate clustered graph which prevents node re-identification [10,9]. The only limitation to generate the clustered graph is that each cluster's size should be at least k. The SA algorithm targets to minimize the |W(G)|. There are three operations in SA: (1) Merge two clusters to one cluster; (2) split one cluster to two clusters; and (3) move nodes between two clusters. It showed that after running SA algorithm for around 100|V| steps, the |W(G)| becomes stable. We would like to use this model as the baseline. In our experiment, we set the stop condition of SA algorithm as running at least 120|V| steps and in the last 1000 steps, no better solution was found. S-Clustering (Safely-Clustering) We enhance the clustering based model by adding the Safety Clustering Condition when generating the clusters. We also use the SA algorithm to find a solution with as smaller |W(G)| as possible. The only difference is besides the cluster size constraint, the number of edges within any cluster and between any two clusters is further bounded by the Safety Clustering Condition. The stop condition is set as the same as D-Clustering.
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
281
8.4. Results Fig. 11 shows the results of |W(G)| on the four data sets. From the results we can see, comparing with the S-Clustering which 2 . Even for D-Clustering provides link protection, that our SEA model performs much better. In all cases, |WG|S − Clustering > |WG|SEA which does not have link protection and directly optimizes the |W(G)|, the SEA model has much better performance in nearly all cases (except the four points in ArXiv graph when k ≤ 5). We demonstrate the results of expTR in Fig. 12. In all cases, the SEA model publishes a graph with higher expTR than the S-Clustering. In most cases, the SEA model also outperforms D-Clustering (except the five points in ArXiv graph when k ≤ 6), but D-clustering does not offer any link protection. Fig. 13 represents the results of ACRD. For data sets Cora, Arnet and DBLP, the SEA model has smaller ACRD than both S-Clustering and D-Clustering for all ks. For data set ArXiv, the SEA model always performs better than S-Clustering. For k ∈ [16,20], the SEA model has similar performance as D-Clustering. For other cases (k ∈ [2,15]), the SEA model also has smaller ACRD than D-Clustering. Fig. 14 represents the results of ACRCC. For all data sets, the SEA model performs better than S-Clustering and D-Clustering. Fig. 15 represents the results of ACRSPL. For data sets Cora and DBLP, the SEA model performs the best for all ks. For Arnet, the SEA model outperforms S-Clustering. D-Clustering has better performance when k = 2 and competitive results as the SEA model when k ∈ [18,20]. For all the other ks, the SEA model has smaller ACRSPL than D-Clustering. For ArXiv, the SEA model outperforms S-Clustering with all ks and D-Clustering when k ∈ [5,20]. From all the results, we can see that the SEA model has better performance than S-Clustering on the five utilities for all data sets. For most cases, the SEA model even has better performance than D-Clustering, which does not consider link protection. The experiment results confirm the effectiveness of structure information preserving by directly anonymizing each edge. Although our SEA model uses a random generation algorithm to guarantee the privacy, we can still preserve the graph utilities well.
(a) Cora
(b) Arnet
12000 10000
S-Clustering D-Clustering SEA
80000
9000
log10(|PWD|)
log10(|PWD|)
90000
S-Clustering D-Clustering SEA
11000
8000 7000 6000 5000 4000
70000 60000 50000 40000 30000
3000 20000
2000 1000
10000 2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
k
(c) ArXiv
14
16
18
20
12
14
16
18
20
(d) DBLP
100000
90000 S-Clustering D-Clustering SEA
90000
S-Clustering D-Clustering SEA
80000
log10(|PWD|)
80000
log10(|PWD|)
12
k
70000 60000 50000 40000 30000
70000 60000 50000 40000 30000
20000
20000
10000 0
10000 2
4
6
8
10
12
14
16
18
20
k
2
4
6
8
10
k Fig. 11. The results of |W(G)|.
282
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
(a) Cora 0.5
(b) Arnet 0.5
S-Clustering D-Clustering SEA
0.45 0.4
0.4
0.35
0.35
0.3
expTR
expTR
S-Clustering D-Clustering SEA
0.45
0.25 0.2
0.3 0.25 0.2 0.15
0.15 0.1
0.1
0.05
0.05
0
0 2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
k
(c) ArXiv 0.7
0.5
16
18
20
12
14
16
18
20
S-Clustering D-Clustering SEA
0.45 0.4
0.5
0.35
expTR
expTR
14
(d) DBLP S-Clustering D-Clustering SEA
0.6
12
k
0.4 0.3
0.3 0.25 0.2 0.15
0.2
0.1 0.1
0.05
0
0 2
4
6
8
10
12
14
16
18
20
2
k
4
6
8
10
k Fig. 12. The results of expTR.
9. Related works There are many works having been proposed to address privacy issues on publishing social network graphs. Basically, they can be classified into two categories based on attack models, passive attack and active attack. The attack that uses certain background knowledge to re-identify the nodes/links in the published graph is called “passive attack”. There are two models proposed to publish a privacy preserved graph against the passive attack: edge-editing based model [12,22,24,6,20] and clustering based model [10,21,4,7,2]. As described in Section 1, the edge-editing based model is to add or delete edges to make the graph satisfy certain properties according to the privacy requirement. (e.g. each degree in the graph appears at least k times [12], then an attacker who knows a user's degree would at most have 1k probability to correctly find this user in the published graph). Clustering based model is to cluster “similar” nodes together to form super nodes. Each super node represents several nodes which are also called as a cluster. The links between nodes are represented as the edges between super nodes which are called as super edges. Finally, a graph which only contains super nodes and super links instead of the original graph is published. Most graph protection models implement k-anonymity [16] of nodes on different background knowledge of the attacker. Liu [12] defined and implemented k-degree-anonymous model on network structure, that is for published network, for any node, there exists at least k-1 other nodes that have the same degree as this node. Zhou [22] considered a stricter model: for every node, there exist at least k-1 other nodes that share isomorphic neighborhoods when taking node labels into account. In paper [23], the k-neighborhood anonymity model is extended to l-neighborhood-diversity model to protect the sensitive node label. Probability l-diversity is implemented in this model. Zou [24] proposed a k-Automorphism protection model: A graph is k-Automorphism if and only if for every node there exist at least k-1 other nodes that do not have any structure difference with it. Hay [10] proposed a heuristic clustering algorithm to prevent node re-identification using vertex refinement, subgraph, and hub-print attacks. Campan [4] discussed how to implement clustering against subgraph attack to nodes when considering both node labels and structure information loss. Cormode [7,2] introduced (k,l)-groupings for bipartite graph against subgraph attack. Campan [5] implemented a p-sensitive-k-anonymity clustering model which requests that each cluster satisfies k-anonymity and distinct l-diversity.
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
(a) Cora
283
(b) Arnet
0.13 S-Clustering D-Clustering SEA
0.12 0.11
0.2
ACRD
0.1
ACRD
S-Clustering D-Clustering SEA
0.25
0.09 0.08 0.07
0.15 0.1
0.06 0.05
0.05 0.04 2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
k
12
14
16
18
20
12
14
16
18
20
k
(c) ArXiv
(d) DBLP
0.35
0.3
S-Clustering D-Clustering SEA
0.3
S-Clustering D-Clustering SEA
0.25
0.25
ACRD
ACRD
0.2 0.2 0.15
0.15 0.1
0.1
0.05
0.05 0
0 2
4
6
8
10
12
14
16
18
20
k
2
4
6
8
10
k Fig. 13. The results of ACRD.
Besides the protection of nodes, several works considered the protection of link information. Zheleva [21] developed a clustering method to prevent the sensitive link leakage. Ying [20] studied how random deleting and swapping edges change graph properties and proposed an eigenvalue oriented random graph change algorithm. These works did not provide a quantifiable guarantee on link protection [6]. Cheng [6] designed a k-isomorphism model to protect both nodes and links: a graph is k-isomorphism if this graph consists k disjoint isomorphic subgraphs. The attributes of nodes are protected by anatomy model [18] in a k-isomorphism graph. The k-isomorphism graph guarantees that an attacker at most has 1k probability to find that two nodes have a connection in case he knows arbitrary subgraphs. Cormode [2,3] introduced a clustering based model which implements node protection through k-anonymity and an attacker at most has 1k probability to find that two nodes have a connection in case he knows arbitrary subgraphs. However, when an attacker has the information of published graph generation policy, the link protection cannot be guaranteed in some cases. Our SEA model handles passive attack for both node protection and link protection even when an attacker knows our graph generation policy as well as arbitrary subgraphs. We do not restrict the node protection to any specific models such as k-anonymity [16] or k-anatomy [18], any state-of-the-art tabular data protection models such as different l-diversity models [14], t-closeness model [11], etc. can be adopted into our model. Our SEA model directly anonymizes each edge, which helps to reduce the structure information loss in the published graph. We give a safety random algorithm to implement the SEA model, which guarantees the privacy even when the publishing policy is known by the attacker. Besides the “passive attack”, there's the other type of attack on social networks, which is called as “active attack”. “Active attack” is to actively embed special subgraphs into a social network when this social network is collecting data. An attacker can attack the users who are connected with the embedded subgraphs by re-identifying these special subgraphs in the published graph. Backstrom [1] described active attacks based on randomness analysis and demonstrated that an attacker may plant some constructed substructures associated with the target entities. The method to prevent the active attack is to recognize the fake nodes added by attackers and remove them before publishing the data. Shrivastava [15] proposed an algorithm which can identify fake nodes based on the triangle probability difference between normal nodes and fake nodes. Ying [19] proposed another method which uses spectrum analysis to find the fake nodes. Since our model allows the attacker
284
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
(a) Cora
(b) Arnet 0.45
S-Clustering D-Clustering SEA
0.25
S-Clustering D-Clustering SEA
0.4
ACR CC
ACRCC
0.35 0.2
0.15
0.3 0.25 0.2 0.15 0.1
0.1
0.05 0 2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
k
(c) ArXiv
14
16
18
20
12
14
16
18
20
(d) DBLP
0.4
0.35
S-Clustering D-Clustering SEA
0.35
S-Clustering D-Clustering SEA
0.3
0.3
0.25
ACRCC
ACRCC
12
k
0.25 0.2
0.2 0.15
0.15 0.1
0.1
0.05
0.05 0
0 2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
k
k Fig. 14. The results of ACRCC.
to know arbitrary subgraphs (this involves the subgraphs embedded by an attacker), our model can handle the active attack, too. There are also two other works which focus on the edge weight protection in weighted graphs instead of link and node protection. The basic method to protect edge weights is to assign new edge weights in the published graph [13,8]. Their problem assumptions are different from all other works.
10. Conclusion In this paper, we propose a new protection model, Semi-Edge Anonymity, to solve the problem when the graph publishing algorithm is known to the attacker. The model could provide link protection and node protection at the same time. We also give a necessary and sufficient condition for clustering based models for link protection. An algorithm is proposed with randomness for the SEA model. We prove that the new model can protect links well even when the attacker knows the generation algorithm. The other important benefit of the SEA model is that it can adopt any state-of-the-art sensitive label protection models for tabular data. This makes that the SEA model also has the ability to provide the best protection to sensitive labels. We theoretically prove that the SEA model outperforms any clustering based model to provide link protection on two utilities. Experiment shows that the SEA model also performs better than clustering based models when link protection is considered on several other utilities.
Acknowledgments This work is supported in part by Hong Kong RGC grant N_HKU-ST612/09, National Grand Fundamental Research 973 Program of China under grant 2012CB316200 and NSFC 61232018. The DBLP dataset is based on the Proximity DBLP database prepared by the Knowledge Discovery Laboratory, University of Massachusetts Amherst.
M. Yuan et al. / Data & Knowledge Engineering 88 (2013) 267–285
(a) Cora
285
(b) Arnet 0.78 S-Clustering D-Clustering SEA
0.8 0.7
0.74
ACRSPL
ACRSPL
S-Clustering D-Clustering SEA
0.76
0.6
0.72 0.7
0.5 0.68 0.4
0.66 2
4
6
8
10
12
14
16
18
20
2
4
6
8
10
k
12
14
16
18
20
12
14
16
18
20
k
(c) ArXiv
(d) DBLP
1.2 S-Clustering D-Clustering SEA
1.1
0.8
0.9
ACRSPL
ACRSPL
1
S-Clustering D-Clustering SEA
0.9
0.8 0.7
0.7 0.6 0.5
0.6 0.4
0.5
0.3
0.4 0.3
0.2 2
4
6
8
10
12
14
16
18
20
2
k
4
6
8
10
k Fig. 15. The results of ACRSPL.
References [1] L. Backstrom, C. Dwork, J.M. Kleinberg, Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography, WWW, 2007, pp. 181–190. [2] S. Bhagat, G. Cormode, B. Krishnamurthy, D. Srivastava, Class-based graph anonymization for social network data, Proceedings of the VLDB Endowment 2 (2009) 766–777. [3] S. Bhagat, G. Cormode, B. Krishnamurthy, D. Srivastava, Prediction promotes privacy in dynamic social networks, Proceedings of the 3rd Conference on Online Social Networks, WOSN'10, USENIX Association, Berkeley, CA, USA, 2010, p. 6-6. [4] A. Campan, T.M. Truta, A clustering approach for data and structural anonymity in social networks, PinKDD'08, 2008. [5] A. Campan, T.M. Truta, N. Cooper, P-sensitive k-anonymity with generalization constraints, Transactions on Data Privacy 2 (2010) 65–89. [6] J. Cheng, A.W.-C. Fu, J. Liu, K-isomorphism: privacy preserving network publication against structural attacks, Proceedings of the 2010 International Conference on Management of Data, SIGMOD '10, ACM, New York, NY, USA, 2010, pp. 459–470. [7] G. Cormode, D. Srivastava, T. Yu, Q. Zhang, Anonymizing bipartite graph data using safe groupings, Proceedings of the VLDB Endowment 1 (2008) 833–844. [8] S. Das, O. Egecioglu, A.E. Abbadi, Privacy preserving in weighted social network, ICDE'10, 2010, pp. 904–907. [9] M. Hay, G. Miklau, D. Jensen, D. Towsley, C. Li, Resisting Structural Re-identification in Anonymized, Social Networks, 19 (6) (Dec. 2010) 797–823. [10] M. Hay, G. Miklau, D. Jensen, D. Towsley, P. Weis, Resisting structural re-identification in anonymized social networks, Proceedings of the VLDB Endowment 1 (2008) 102–114. [11] N. Li, T. Li, t-closeness: Privacy beyond k-anonymity and l-diversity, ICDE'07, 2007, pp. 106–115. [12] K. Liu, E. Terzi, Towards identity anonymization on graphs, SIGMOD'08, 2008, pp. 93–106. [13] L. Liu, J. Wang, J. Liu, J. Zhang, Privacy preserving in social networks against sensitive edge disclosure, Technical Report CMIDA-HiPSCCS 006-08, 2008. [14] A. Machanavajjhala, D. Kifer, J. Gehrke, M. Venkitasubramaniam, L-diversity: Privacy beyond k-anonymity, ACM Transactions on Knowledge Discovery from Data (1) (March 2007). [15] N. Shrivastava, A. Majumder, R. Rastogi, Mining (social) network graphs to detect random link attacks, ICDE'08, 2008, pp. 486–495. [16] L. Sweeney, k-anonymity: a model for protecting privacy, International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 10 (2002) 557–570. [17] R.C.-W. Wong, A.W.-C. Fu, K. Wang, J. Pei, Minimality attack in privacy preserving data publishing, Proceedings of the 33rd International Conference on Very Large Data Bases, VLDB '07, VLDB Endowment, 2007, pp. 543–554. [18] X. Xiao, Y. Tao, Anatomy: simple and effective privacy preservation, VLDB'06, 2006, pp. 139–150. [19] X.W. Xiaowei Ying, D. Barbara, Spectrum based fraud detection in social networks, ICDE'11, 2011. [20] X. Ying, X. Wu, Randomizing social networks: a spectrum preserving approach, SDM'08, 2008. [21] E. Zheleva, L. Getoor, Preserving the privacy of sensitive relationships in graph data, PinKDD'07, 2007, pp. 153–171. [22] B. Zhou, J. Pei, Preserving privacy in social networks against neighborhood attacks, ICDE'08, 2008, pp. 506–515. [23] B. Zhou, J. Pei, The k-anonymity and l-diversity approaches for privacy preservation in social networks against neighborhood attacks, Knowledge and Information Systems 28 (June 2010) 47–77. [24] L. Zou, L. Chen, M.T. Özsu, k-automorphism: a general framework for privacy preserving network publication, Proceedings of the VLDB Endowment 2 (2009) 946–957.