Accepted Manuscript
Identifying the Same Person across Two Similar Social Networks in a Unified Way: Globally and Locally Zhongbao Zhang, Qihang Gu, Tong Yue, Sen Su PII: DOI: Reference:
S0020-0255(17)30419-X 10.1016/j.ins.2017.02.008 INS 12731
To appear in:
Information Sciences
Received date: Revised date: Accepted date:
25 August 2016 23 November 2016 2 February 2017
Please cite this article as: Zhongbao Zhang, Qihang Gu, Tong Yue, Sen Su, Identifying the Same Person across Two Similar Social Networks in a Unified Way: Globally and Locally, Information Sciences (2017), doi: 10.1016/j.ins.2017.02.008
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
ACCEPTED MANUSCRIPT
Identifying the Same Person across Two Similar Social Networks in a Unified Way: Globally and Locally Zhongbao Zhang, Qihang Gu, Tong Yue, Sen Su
CR IP T
State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
Abstract
M
AN US
With the increasing popularity of online social networks, people today typically own several such accounts (e.g., Facebook, Twitter, and Flickr). In this scenario, it arises an interesting and challenging problem: how to identify the same person in different social networks, which is known as the network reconciliation problem. Prior work on this problem first assumes that the relationships between users are homogeneous, and then makes use of the local features (degree, common mapped neighbors) to achieve high precision. However, this assumption does not hold in reality since users usually have different tie strengths between each other. In this paper, we remodel the reconciliation problem by considering the users’ heterogeneous relationships and propose a unified framework called UniRank for incorporating the local features and global features together. Based on UniRank, we design an efficient two-stage network reconciliation algorithm. First, we design a global matching algorithm to explore more seeds with fast speed. Second, for each explored seed, we design a breadth-first strategy based local matching algorithm to match more seeds. Extensive simulations on realworld and synthetic social network datasets show that our algorithm significantly improves the state-of-art algorithm by up to 9X in terms of F1 score even under very rough conditions.
ED
Keywords: Social network; Network reconciliation; Identifying; Globally; Locally
1. Introduction
AC
CE
PT
1.1. Background and Motivation Recently, online social networks (OSNs) have exploded in popularity. For example, Facebook has more than 1 billion daily active users, and Twitter has over 320 million monthly active users [1, 2]. For these social network providers, they own vast amount of personal data and relationship information between their users, which have become a treasure trove scientific researchers and marketers sought for. And the proliferation of online social networks opens a door in the research of social behaviours and social interaction topology. For different social network platforms, one individual may own several accounts. For example, he may use twitter for what happened in the world, Facebook for what happened for his close friends and LinkedIn for his career development. In such scenario, it is very interesting to identify the same person across different social networks, which is termed as network reconciliation problem. Solving this problem is of critical importance to OSNs, since each social network is just a subset of the real world social network. Analysis on any of these networks could be partial and not sufficient because some behaviors and relationships will not appear until there is a systematic way to combine the information across social networks. For this reason, social network reconciliation is proposed as a promising technique to match all the accounts belonging to the same person across OSNs. With richer data combined from multiple different sources, social network providers and social scientists can have a more accurate understanding of users and their characteristics either on behaviors or Email addresses:
[email protected] (Zhongbao Zhang),
[email protected] (Qihang Gu),
[email protected] (Tong Yue),
[email protected] (Sen Su) Preprint submitted to Elsevier
February 2, 2017
ACCEPTED MANUSCRIPT
relationships. Therefore, more personalized services and practical applications can be provided for users, for example, combined user profiles can be used for content customisation and link information can be crawled for friend recommendation. 1.2. Limitation of Prior Art
AN US
CR IP T
The network reconciliation problem has attracted significant attentions from researchers. Generally, the prior studies fall into the following two categories. The first [17, 28, 3, 14, 12] is to use semantic features (e.g., such as username, location and interests) to identify the users one by one. However, it can only identify a smaller fraction of nodes and is vulnerable to fake users. For this reason, we move to the second category [4, 19, 15, 10], which is based on the first one. It uses the nodes identified by the first category of approaches as seeds and leverages structure information (e.g., degree, common mapped neighbors) to infer which pair of nodes in different social networks should be the same user. However, all of prior studies in the second category have the following limitations: i) They modeled the social network as unweighted graph and ignored the tie strength between two users. However, in real world, there are different tie strengths between different pairs of users. ii) They only use local features (e.g., degree of nodes or common mapped neighbors) for the node identification across two social networks while ignoring the global features (e.g., betweenness, closeness centralities) of nodes which make significance for our matching and thus suffer from low precision. 1.3. Proposed Approach
AC
CE
PT
ED
M
In this paper, we aim to address these two limitations above. To address the first limitation, we remodel the social network with weights on the edges. In reality, closeness between friends varies from person to person, indicating one user may have different tie strengths with different friends. For such tie strength on the edge euu0 between user u and user u0 , it is proportional to their interaction characterized by the overlap of their neighbors. For this reason, the goal of the network reconciliation problem becomes to maximize the number of identified nodes in two given graphs G1 and G2 by leveraging these edges with diverse weights. To address the second limitation, we first abstract the local features and global features respectively, and then propose a unified framework called UniRank for aggregating these two categories of features together. Through preliminary experiments in Section 4, we observe that the global feature follows a preferential attachment modeling, which means only a small fraction (about 1-2%) of the nodes have a great weight and make a great difference from other nodes. We leverage this observation and design an efficient two-stage Network Reconciliation algorithm using Global and Local features, called NR-GL, to improve the effectiveness and efficiency for the network reconciliation problem. In terms of global features, our algorithm divides nodes into two categories corresponding to two stages. The first stage deals with the seed exploration problem and aims at matching nodes with distinct global features. Starting from few initial seeds, we propose a global matching algorithm to easily explore more seeds with high precision in this stage. Its core idea is that, for the nodes with outstanding global features which are instantly recognizable, we prefer to match a node u in G1 to a node v in G2 with the similar global feature. The second stage deals with the seed expansion problem and aims at matching the left unidentified nodes. In this stage, regarding each explored seed node as a root, we develop a breadth first strategy based local matching algorithm that puts local features to greater use. Through iterative breadth first matching, an unidentified node will be matched to a corresponding node with the highest similarity score among all candidates. Moreover, a handful of seeds selected at the beginning may induce some mismatches, and these mismatches potentially lead to a cascade of errors in further steps, therefore we propose a conflict resolution to handle this problem. And a bi-directional match strategy is applied to enhance the precision in tough conditions. Finally, we empirically validate the effectiveness of the algorithm on real and synthetic social network datasets. The results show that our algorithm significantly outperforms the state-of-art algorithms even under very rough conditions.
ACCEPTED MANUSCRIPT
1.4. Key Contributions The key contributions of this paper are listed as follows:
CR IP T
1. We remodel the network reconciliation problem with tie strengths on the edges. To the best of our knowledge, this is the first attempt to shed light on this important factor when modeling the network reconciliation problem. 2. We propose a unified framework for aggregating global and local features together, which is also the first attempt in the network reconciliation problem. Based on the framework, we also design an efficient two-stage algorithm called NR-GL. 3. We validate the effectiveness of our algorithm on both of the real world dataset and synthetic dataset. Extensive simulations show that it significantly outperforms the state-of-art algorithm in terms of recall rate, precision and F1 score.
AN US
1.5. Roadmap The rest of this paper is organized as follows. Section 2 formally models and defines the reconciliation problem. Section 3 proposes a unified framework for the features of nodes, including local features and global features. Based on such a framework, an efficient and effective network reconciliation algorithm is proposed in Section 4. Section 5 evaluates our algorithm. Section 6 revisits the related work and Section 7 concludes this paper. 2. Model and Problem Formulation
In this section, we first describe the formal network model in Section 2.1, and then present the initial seed selection strategy in Section 2.2 and the problem description in Section 2.3. Before all of these, we first summarize the notations used throughout this paper in Table 1.
AC
CE
PT
ED
M
2.1. Network Modeling We are given two social networks G1 = (V1 , E1 ) and G2 = (V2 , E2 ), where a set of vertices V1 (V2 ) represents users and a set of edges E1 (E2 ) represents links between users in G1 (G2 ). The social networks G1 and G2 are partially overlapping, that is, V1 ∩ V2 6= ∅ and E1 ∩ E2 6= ∅. There exist a large quantity of user pairs (u, v) ∈ V1 × V2 referring to the same person in the underlying social network. Without loss of generality, we focus on the cases where V1 ≈ V2 , that’s to say, the users are similar to a relatively great extent. The difference between these two networks is called noise. In general, not all users have a counterpart in the other social network. And the group of friends that a user has in a social network may differ from the group of friends to some extent in his/her other social networks. In a social network, edges could correspond to social relationships. For each user, he may have different tie strengths with his friends, which can be denoted by the weights on the edges. As shown in Fig. 1, it presents such two weighted graphs. The numbers over the edges represent the tie strengths between two users in the social network. To measure the tie strength W (uu0 ) of the edge euu0 , the common method is to quantify the degree of interaction between the neighbors of u and u0 . Typically, it can be computed by |N gh(u)∩N gh(u0 )| 0 0 |N gh(u)∪N gh(u0 )| , where N gh(u) and N gh(u ) represent the sets of neighbors of u and u , respectively. 2.2. Initial Seed Selection In reality, people will post a link on the Facebook referring to its Google+ account or Twitter account and some users register with the same email address or the same phone numbers. This observation indicates that a small fraction of users explicitly link their accounts across two social networks. For such pre-matched user pairs, we consider them as the initial seed set I. Most techniques in this area start with a seed set of node pairs and propagate the reconciliation process by matching new node pairs. Notice that, these seedbased algorithms are sensitive to the seed set size: if the size is too small, the reconciliation even doesn’t start up. Our work assumes a small subset of nodes across two graphs as the initial seed set (simply “seeds” for short). The two-stage algorithm can operate with few seeds and incrementally identify node pairs.
ACCEPTED MANUSCRIPT
Table 1: Notations
CR IP T
Description Two social networks. The number of users in a social network. The nodes in G1 . The nodes in G2 . The edge between the nodes u and u0 in G1 . The weight of the edge euu0 . The set of neighbors of u. The initial seed set.
AN US
Notation G1 , G2 |V1 |, |V2 | u, u0 v, v 0 euu0 W (uu0 ) N gh(u) I
M
Figure 1: The Problem of Social Network Reconciliation. In this example, we assume that the optimal mapping solution is {A1 → A2 , B1 → B2 , C1 → C2 , D1 → D2 , E1 → E2 , F1 → F2 }. The goal of our algorithm is to find such a solution.
PT
ED
2.3. Problem Description We now formally define the reconciliation problem. Given two graphs G1 and G2 , the network reconciliation problem is defined as finding a one-to-one node matching M : V1 → V2 and maximizing the number of correct matching pairs between nodes of G1 and G2 . The problem we consider is identifying M only with inputs: G1 , G2 and the initial seed set. As shown in Fig. 1, we assume that the optimal mapping solution is that {A1 → A2 , B1 → B2 , C1 → C2 , D1 → D2 , E1 → E2 , F1 → F2 }. 3. A Unified Framework for Local Features and Global Features
CE
In this section, we first present a motivational example in Section 3.1. This example motivates us to consider the local features and global features, which are presented in Section 3.2 and 3.3, respectively. After that, we propose a unified framework for incorparating these two features together in Section 3.4.
AC
3.1. Motivational Example Recall the example as shown in Fig. 1. From this figure, we observe that, node E1 and node E2 play similar roles in the whole graph of G1 and G2 , as shown in Fig. 1 (a) and Fig. 1 (b), respectively. If we match node E1 to node E2 , it will build a significant bridge for both sides and accelerate the matching process since node E1 (E2 ) is the common neighbor of other three nodes: C1 (C2 ), D1 (D2 ) and F1 (F2 ). Through this figure, we demonstrate that the global feature plays an important role in discovering crucial nodes and then largely promoting the network reconciliation process. Based on such an observation, we design a unified framework for the similarities of nodes’ features. We classify the features of nodes into the following two categories: local features and global features. Local features represent its own features of the node, such as neighbors, locations and other factors in the profile. Global features capture the roles of each node in the global network structure, such as the closeness centrality, betweenness centrality and eigenvector centrality.
ACCEPTED MANUSCRIPT
3.2. Local Features One node’s local features represent its own features or role in the surrounding. In this paper, we mainly consider common mapped neighbors and link strength as the typical local features. Definition 1. Common Mapped Neighbors: All the node pairs (u0 , v 0 ) (u0 ∈ G1 , v 0 ∈ G2 ) are said to be common mapped neighbors of (u, v) (u ∈ G1 , v ∈ G2 ) if and only if: i) u0 ∈ N gh(u), ii) v 0 ∈ N gh(v) and iii) u0 has mapped to v 0 . Here N gh(u) denotes the set of neighbors of the node u.
CR IP T
If a node u ∈ V1 has the same or similar set of common mapped neighbors with v ∈ V2 , we say u and v is a pair of potential matching. Definition 2. Link Strength: For a node u, its link strength, denotedPby L(u), is defined as the sum of the link strengths of all the links in which u is involved. That is, L(u) = u0 ∈N gh(u) W (euu0 ).
If a node u ∈ V1 has the same or similar link strength with v ∈ V2 , we also say u and v is a pair of potential matching. Based on these two definitions, we use the following equation to compute the similarity of two nodes in terms of their local features: |N (u) ∩ N (v)| min(L(u), L(v)) · , |N (u) ∪ N (v)| max(L(u), L(v))
AN US
Siml (u, v) =
(1)
where N (u) and N (v) represent the sets of identified neighbors of u and v, respectively; L(u) and L(v) represent the link strengths of u and v, respectively. 3.3. Global Features
ED
M
For each node, global features represent its roles in the whole graph. It may include closeness centrality and betweenness centrality etc. In this paper, we apply the most important measurement, i.e., eigenvector centrality of graphs. A typical way to evaluate how each node ranks in a global network is utilizing PageRank [18], which is a well-known technique used in website ranking by Google. In traditional PageRank algorithm, the weights over the edges are not considered. However, Fig 1 shows that such weights also help to evaluate the roles of nodes in the whole graph structure. Thus, we propose to personalize PageRank in our context. Similarly, we define the GlobalRank for each node as follows:
CE
PT
Definition 3. GlobalRank: Assume there is a random walker walking on the graph G. He has two choices when he is located at each node u: forward to a neighbor node with a probability of pF u or jump to a random J node with a probability of pJu (pF u + pu = 1).If he chooses to forward, he may forward to its neighbor node v with a given probability Fuv ; if he chooses to jump, he may jump to any node w ∈ G randomly with another given probability Juw . For each node u, its GlobalRank, denoted by R(u), is defined as the probability of this random walker staying at this node after the final steady state has been achieved. In particular, we compute the forward probability Fuv as
L(v)
k∈N gh(u)
L(k)
and the jumping probability Juw
For example, assume that the random walker is located at E1 in Fig. 1 (a). He may follow
AC
as
P L(w) . k∈G L(k)
P
the edges to neighbor nodes (e.g., B1 , D1 ) with a probability of pF E1 = 0.85 or jump to any node with a J probability of pE1 = 0.15. After the random walker determines to follow the edges from E1 with a probability 0.4 of pF D1 = 0.85, he follows the edges to D1 with a probability of FE1 D1 = 0.4+0.4+0.4+0.33 = 0.261. Then, the F overall probability of following an edge from E1 to D1 can be computed by pE1 ·FE1 D1 = 0.85∗0.261 = 0.222. After the random walker determines to jump from E1 with a probability of pJE1 = 0.15, he jumps to the node 1) , where L(F1 ) denotes the weights F1 with a probability of JE1 F1 = L(A1 )+L(B1 )+L(CL(F 1 )+L(D1 )+L(E1 )+L(F1 ) 0.25 of all of links of F1 . That is, JE1 F1 = (0.67+0.4)+(0.4+0.5)+(0.5+0.25+0.33)+(0.67+0.4)+(0.4+0.4+0.4+0.33)+(0.25) = 0.25 J 5.9 = 0.04. Then, the overall probability of jumping from E1 to C1 can be computed by pE1 · JE1 F1 = 0.15∗0.04 = 0.006. The benefits of this personalized ranking scheme are two-fold: i) it takes the tie strength
ACCEPTED MANUSCRIPT
into consideration for better evaluating ones role; ii) it leverages link strength to better characterize ones role among the surrounding nodes since link strength is a integrated measurement of the number of ones neighbors and their interaction. For any node v ∈ V , let the vector of node ranks at iteration t + 1 be X R(t+1) (v) = Juv · pJu · R(t) (u) + u∈V
u∈N gh(v)
where pJu and pF u are bias factors. For t = 0, 1, · · · , we have
R(t+1)
(t) Fuv · pF u · R (u),
= T · R(t) ,
(2)
CR IP T
X
AN US
where T is a one-step transition matrix of the Markov chain defined by J p1 0 · · · 0 J11 J12 · · · J1n J21 J22 · · · J2n 0 pJ2 · · · 0 T= . .. .. · .. .. . . . .. .. . . .. . . . . Jn1 Jn2 · · · Jnn 0 0 · · · pJn F p1 0 ··· 0 0 F12 · · · F1n F F21 0 · · · F2n 0 p2 · · · 0 + . · . . . . . . . .. .. .. .. .. .. .. .. F Fn1 Fn2 · · · 0 0 0 · · · pn
M
After personalizing the transition probability in this context, we use the classical iteration process for computing GlobalRank. The algorithm is shown in Algorithm 1. Note that the iterative method has been shown to yield any desired precision with a number of iterations proportional to max{1, −log} [5].
PT
ED
Algorithm 1: The Personalized GlobalRank Computing Method 1: Given a positive value , i ← 0 2: repeat 3: R(i+1) ← T · R(i) 4: δ ← kR(i+1) − R(i) k 5: i++ 6: until δ <
(0)
CE
Instead of random initialization the vector R(0) , to speed up the convergence speed, we initialize each (0) dimension with preference. We initialize R(0) as Ru = P L(u)L(k) . For example, for the node C1 in Fig. 1 k∈G
AC
1) (a), RC1 = L(A1 )+L(B1 )+L(CL(C = 0.183. Then the vector of R(0) is (0.181, 0.153, 0.183, 1 )+L(D1 )+L(E1 )+L(F1 ) 0.181, 0.260, 0.042), where the sum of each dimension equals to 1. Note that T is stable since it is a stochastic matrix having a maximum eigenvalue equal to one. This guarantees that the above recurrence (∗) (∗) (∗) relation converges to R(∗) = (R1 , R2 , · · · , Rn )T , the steady state distribution [21]. This can be computed using a classic iterative scheme [18], given by Algorithm 1. If a node u ∈ V1 has a similar ranking with v ∈ V2 , it means that they have similar importance for the whole networks G1 and G2 respectively. We say u and v is a pair of potential matching. Therefore, we use the following equation to compute the similarity of two nodes in terms of global features:
Simg (u, v) =
min(R(u), R(v)) , max(R(u), R(v))
where R(u) and R(v) represent the normalized GlobalRank values of u and v, respectively.
(3)
ACCEPTED MANUSCRIPT
3.4. A Unified Framework We propose a unified framework called UniRank which combines the similarity of local features and global features together, as shown in the following equation: Simuni (u, v) = α · Siml (u, v) + (1 − α) · Simg (u, v),
(4)
CR IP T
where α makes a trade off between the local features and global features. According to Fig. 2, nodes with higher GlobalRank value are more distinguished than the lower ones. Therefore, the value of α should adaptively and dynamically change according to nodes’ rankings of the GlobalRank value in descending rk order. Specifically, we set the value of α to ln( |V | + c), where rk denotes the ranking of this node in all of the |V | nodes, and c is a constant value between (1, e − 1). The higher rk is, the smaller α gets and the more fraction the global feature will occupy in Equation 4, which is in accord with our analyses. The benefit of this strategy lies in that it can explore different weights of local feature and global feature in different reconciliation processes and thus make adaptive and dynamic changes, which may help improve the performance in the following proposed solutions.
AN US
4. Proposed Solutions
PT
ED
M
4.1. Outline Before designing the network reconciliation algorithm, we conducted preliminary experiments about the global feature on a publicly available Facebook dataset in [23], which includes 63731 users and 817090 edges. On this dataset, we compute the GlobalRank value for each node with normalization and plot its probability distribution in Fig. 2. From this figure, we observe that only less than 1% of the nodes have a higher GlobalRank value (1 ∗ 10−4 ), and most of the rest nodes have a very small GlobalRank value. The observation motivates us to group these nodes into two categories for two-stage algorithm design based on the initial seed set. In the first stage (category of nodes with high GlobalRank value), our approach aims to explore a larger set of seeds by leveraging the global feature more (a smaller value of α in Equation 4), illustrated in Section 4.2. As shown in Fig. 1, we can identify the mappings of E1 → E2 and C1 → C2 first. In the second stage (category of nodes with low GlobalRank value), we leverage the local feature more (a larger value of α in Equation 4) to design a parallel seed expansion algorithm based on breadth first strategy, illustrated in Section 4.3. As shown in Fig. 1, we can identify the rest mappings, i.e., A1 → A2 , B1 → B2 , D1 → D2 , F1 → F2 . For a given node u, there may be different matching solutions during the independent expansion processes for different explored seeds. To handle this issue, we also design a simple yet efficient conflict resolution (Section 4.4).
The Probability Distribution
AC
CE
0.6
(Probability,GlobalRank) The result of curvefitting
0.5 0.4 0.3 0.2 0.1 0 0
2
4 6 The GlobalRank Value
8 −4 x 10
Figure 2: The probability distribution over GlobalRank value
ACCEPTED MANUSCRIPT
4.2. Initial Seed Exploration (Based on Global Matching)
CR IP T
Since nodes with high GlobalRank value are instantly recognizable, we design a seed exploration algorithm based more on the global feature (shown in Algorithm 2). Given the initial seed set I, the algorithm aims to iteratively discover a larger explored seed set S until its size exceeds a specific quantity N S (Step 1-2). For all the nodes in G1 , we first sort them in a descending order according to the GlobalRank value and then push them into a list L1 (Step 3). We also do the same operation for the nodes in G2 and denote this list as L2 (Step 4). For each unmapped node u in L1 , we visit it by the order of its rank rk (Step 5-6). And we push all the unmapped nodes v ∈ L2 whose ranks are between 1 and 2rk (Step 3-5) into a candidate list CL for u (Step 7). Then, we will compute the similarity score of u with each node v ∈ CL according to Equation 4 (Step 8-9). We choose the highest similarity score and record the node in L2 as v ∗ (Step 10). To enhance the precision, we also reversely match the nodes v ∗ ∈ L2 which achieves the highest similarity score with u (Step 11). Only if two nodes u and v ∗ are matched in bi-direction, we will add the pair (u, v ∗ ) to the explored seed set S and label u as a success probability of P r(u) = Simuni (uv ∗ ) (Step 12-13). Algorithm 2: Seed Exploration Algorithm
1 2 3 4 5 6 7 8 9
AN US
Input: G1 (V1 , E1 ), G2 (V2 , E2 ), an initial seed set I, the desired number of explored seeds N S, the node lists L1 and L2 for G1 and G2 , respectively. Output: A larger explored seed set S of identified users. S = I. while |S| < N S do Sort all of nodes in G1 in a descending order according to the GlobalRank in a list L1 . Sort all of nodes in G2 in a descending order according to the GlobalRank in a list L2 . for each unmapped node u in L1 do Get the rank rk of u. Push all the unmapped nodes {v|v ∈ L2 , rank(v) ∈ [1, 2rk]} into a candidate list CL. for each node v ∈ CL do Compute the similarity score Simuni (u, v) of (u, v), as shown in Equation 4.
13
Add the pair (u, v ∗ ) to S and label it as a success probability of Simuni (u, v ∗ ).
ED
11
M
12
Choose the highest similarity score and record the node in L2 as v ∗ . Reverse match v ∗ with the same method. if v ∗ gets mapped back to u then
10
4.3. Parallel Seed Expansion (Based on Local Matching)
AC
CE
PT
After global matching, we have explored a larger seed set S through the quick start in the first stage. Next, we need to conduct the matching for the rest of nodes where seeds utilized for determining the candidate nodes greatly improve the time efficiency. To enhance the performance of matching, we adopt breadth first strategy for parallel matching from the S. Its core idea is as follows. In this algorithm, regarding each explored seed s ∈ G1 produced in the prior stage as a root node and pushing it into a queue q (Step 4-5), we design an efficient parallel and iterative seed expansion algorithm based on breadth first strategy (as shown in Algorithm 3). Specifically, for each explored root node r ∈ q, we pop it from q (Step 6-7).Then we will choose unmapped node u whose GlobalRank is the highest among neighbors of r (Step 8-9). Note that, for each node u in this neighbor set, it should satisfy that N (u) exceeds a pre-defined threshold value. Then, for each node u, we tactfully utilize its identified neighbors to construct its candidate node list CL in G2 (Step 10-12). Next, for each node v ∈ CL, we will compute the similarity score with u according to Equation 4 (Step 13-14). We will select the node with highest similarity score and label u as a success probability of P r(u) = P r(r) · Simuni (uv), which will be used for conflict resolution in Section 4.4 (Step 15-16). At last, we will push u into q and regard it as a root node in the next iteration. The advantage of this algorithm lies in that it leverages the local feature (e.g., neighbors and link strength) of each node and accelerates the seed expansion process. We take the Fig. 1 as an example again. First, we match E1 and E2 because of its evident role in both graphs. Based on E1 → E2 , we match C1 and C2 by using global information again. Then identify B1 and B2 via C1 → C2 and E1 → E2 . Independently, match A1 and A2 through E1 → E2 , and match F1 and F2 through C1 → C2 . At last, map D1 and D2 through just matched A1 → A2 .
ACCEPTED MANUSCRIPT
Algorithm 3: Seed Expansion Algorithm
2 3 4 5 6 7 8 9 10 11 12
/*Parallel seed expansion based on |S| seeds*/
for each node s to be expanded do label s as the root node r and push it to a queue q. while the queue q is not ∅ do Pop a node r from the queue q. for each neighbor u ∈ G1 of r do Choose unmapped node u whose GlobalRank is the highest among all of neighbors of r. Note that, N (u) exceeds the threshold value. for each mapped neighbors u0 of u do Find the corresponding nodes v 0 in G2 . Push the unmapped neighbors v of v 0 into the candidate list CL.
CR IP T
1
Input: G1 (V1 , E1 ), G2 (V2 , E2 ), the seed set S of identified users. Output: A larger set M of identified users. /*Initialization*/ M = S.
for each node v ∈ CL do Compute the similarity score Simuni (u, v) of (u, v), as shown in Equation 4.
15
if (u, v) gets the highest similarity score then
16
Add the pair (u, v) to M and label it as a success probability of P (u) = P (r) · Simuni (u, v). Push u to a queue q.
17
AN US
14
13
4.4. Conflict Resolution
ED
M
Since the expansion processes for different seeds are independent, for a node u in V1 , it may be matched to the node v during the expansion from a seed, and matched to another node v 0 during the expansion from another seed. To address this issue, we design a simple yet efficient conflict resolution. The core idea of this algorithm is to choose a solution with a high success probability from the multiple matching solutions when conflict happens. A usual situation is that an incorrect pair is successfully matched and added to the seed set. In consideration of these mismatches’ continuously bad impact on the next matches, we allow an unmapped node to match a mapped node if they achieve a higher success probability and match in bi-direction. Experiments show this strategy is extremely useful in rough conditions, since nodes are more distinguished with the iteration of the matching process. 4.5. Time Complexity Analysis
CE
PT
In the first stage, algorithm 2 explores a larger set S which needs consume time at a complexity of O(|S|2 ). In the second stage, the time complexity of algorithm 3 is O(|S| · |D|2 ), where S denotes how many nodes are explored in the first stage and |D| denotes the largest degree in both G1 and G2 . Thus, NR-GL is a polynomial-time algorithm with a time complexity of O(|S|2 + |S| · |D|2 ). Since the expansion processes for different seeds are independent, the algorithm can be implemented in a parallel way to accelerate. 5. Performance Evaluation
AC
In this section, we first present evaluation goals in Section 5.1. We then describe our experimental setup in Section 5.2. At last, we present our main evaluation results in Section 5.3. 5.1. Evaluation Goals For evaluating the performance of our algorithm, we compare it with the state-of-art algorithm, i.e., KL algorithm, proposed by Korula and Lattanzi [11]. The performance metrics include precision, recall rate and F1 score. Given two graphs G1 (V1 , E1 ) and G2 (V2 , E2 ), the set of identified matches Mi , and the total i ∩Mt | set of true matches Mt , i.e., V1 ∩ V2 , the pecision is defined as |M|M , where |Mi ∩ Mt | denotes the number i|
of correct matches the algorithms achieve. The recall rate is defined as
|Mi ∩Mt | |Mt | .
As an effective statistical
ACCEPTED MANUSCRIPT
analysis metric, F1 score combines precision and recall rate to provide a trade-off measurement defined as follows: pecision · recall F1 = 2 · . (5) pecision + recall The range of F1 score is between 0 and 1. When F1 score is close to 1, it refers that both precision and recall rate obtain a high value and the algorithm achieves an excellent performance. In the evaluation, we focus on evaluating our proposed algorithm from the following aspects:
1
NR-GL KL
0.6 0.4 0.2
0.6 0.4 0.2
0.2
0.4 0.6 Edge Selection Probability
0
0.2
0.4 0.6 Edge Selection Probability
0.8
(b) Recall rate comparison
ED
(a) Precision comparison
0.8
M
0
NR-GL KL
0.8
Precision
0.8 Recall Rate
AN US
1
CR IP T
1. To what extent does our algorithm outperform the state-of-art algorithm in terms of precision and recall? 2. How huge is the performance improvement of GlobalRank in the global matching compared to other measurements (e.g., degree)? 3. What is the impact of different parameters in the algorithm NR-GL (e.g., α in Equation 4) on its performance? 4. What is the impact of different network parameters (such as the number of explored seeds, the scale of the network, the noise between two networks) on the performance of our algorithm?
1
NR-GL KL
0.6
PT
F1 Score
0.8
0.4
AC
CE
0.2 0 0.2
0.4 0.6 Edge Selection Probability
0.8
(c) F1 score comparison
Figure 3: Comparison between our algorithm and KL algorithm
5.2. Experimental Setup Datasets: Similar to previous work [11, 19], the reconciliation algorithm is evaluated by applying an independent edge selection model which considers a fixed social network G to generate two correlated social networks G1 and G2 with edge selection probabilities pe1 and pe2 . We assume that every social network G1 (G2 ) can be regarded as a subset of the underlying social network G and all these networks have approximately the same set of nodes, i.e., |V1 | ≈ |V2 | ⊂ |V |. To construct edges in G1 and G2 from the edge set E of the supergraph G, we design two basic rules. The first rule is related to the total number of edges for G1 and G2 . Note that, previous work has a strong assumption that for G1 and G2 , they share similar amounts of links. In reality, social networks may
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
differ in user activeness resulting in distinct clustering coefficient and graph density. Therefore, we remove such an assumption and conduct more simulations when the total numbers of edges are different. We assume that G1 and G2 is created by independently selecting each edge with an average probability pe1 and pe2 respectively. It means that, we will delete (1 − pe1 ) ∗ |E| edges from G(V, E) to get G1 (V1 , E1 ) and delete (1 − pe2 ) ∗ |E| edges from G(V, E) to get G2 (V2 , E2 ). A user might keep the relationships in all networks or just one of them. The previous work [19] has validated the feasibility of the independently edge selection on different snapshots of an email network. The second rule is related to the edge weight for the weighted graph G. Edge euu0 with a higher weight W (uu0 ) indicates that users u and u0 have a strong relationship, which infers that they are more likely to connect as friends in another social network. Based on such inference, such an edge will be reserved in G1 and G2 with a higher probability (proportional to W (uu0 )) in the edge construction, which is deemed as reasonable and practical in real-world OSNs. Note that edge selection model is just used to instantiate G1 and G2 , and experiments will be conducted on G1 and G2 . We evaluated our algorithm on two datasets. The first one is a publicly available Facebook dataset [23], which serves as the underlying social network G. This undirected social network has 63731 users and 817090 relationships, withpan average degree of 25.64. Using the underlying graph G, we choose the number of initial seeds set to be |V | among the first 1% nodes in terms of GlobalRank. Next, we generate two social networks G1 and G2 through edge sampling with two average selection probabilities, i.e., pe1 and pe2 , respectively. As described above, the probability of edge euu0 being reserved is proportional to W (uu0 ) in G1 or G2 . For the degree distribution, the power law exponent is 2.9, which is quite a large number since the normal exponent is between 2.5 and 3.0. The second one is a synthetic random graph using RMAT random model [6], which has 131072 users and 9712628 relationships. In this model, it generates the weights on the edges, following power law. We also do the same generation into two social networks. Parameter Settings: First of all, unless otherwise specified, we use edge selection model with probability pe1 = pe2 = 0.5, which means 50% of edges have been deleted. The deleted edges cause the variation of nodes degree and brings in noise. The generated G1 and G2 for Facebook dataset contain 62812 and 62845 nodes whose degree is equal or greater than one, respectively. Similar to classical PageRank algorithm, the value for the parameter presented in the iterative scheme is set to 0.0001. We also conduct several preliminary experiments to check the convergence of the algorithm in different settings of in the iterative scheme. These experiments demonstrate that, as shown in classical PageRank algorithm, when setting the value for the parameter set to 0.0001, it is optimal for the performance and running time. Due to that this is not the main contribution of our work, we do not plot such figures in this paper. Besides, we set the bias factors pJu to 0.15 and pF u to 0.85 in Equation 2, and c in α to 1.7 in Equation 4. We set the N S in Algorithm 2 to 2% of |V1 | which ensures the precision of the first stage and start up the second stage . Moreover, we set the threshold value in Algorithm 3 to a relatively high value (e.g., 8) to ensure precision initially and then to a lower one (e.g., 2) with the iteration process for matching more nodes. Compared Algorithm: Zafarani et al. [29] propose a novel method to identify usernames across multiple social networks by leveraging behavioral patterns and information redundancies. Differently, our method focuses on network structure information rather than node itself (i.e., usernames). Their work is orthogonal to our work. That means, their algorithm can find more seed nodes, which serves as input for our algorithm, and further improves the perforamance of our algorithm. Korula et al. [11] propose to apply a heuristic algorithm, termed as KL, based on network structure. Given a small fraction of already mapped users, KL algorithm successfully leverages these seeds to identify a very large fraction of the network. Therefore, in this paper, we compare our algorithm with KL algorithm. Experiment Environment: We implemented our algorithm in C++ language. The reconciliation algorithm is conducted on a server with 24 Intel Xeon CPU, 16 GB memory, 1.1T disk, and CentOS release 6.4. 5.3. Experimental Results 5.3.1. Comparison to the State-of-art Algorithm We first compared our algorithm to the state-of-art algorithm, KL, proposed in [11] in terms of the precision, recall rate and F1 score. The corresponding result on the Facebook dataset is depicted in Fig. 3.
ACCEPTED MANUSCRIPT
CR IP T
In this figure, we observe the following two phenomena. The first one is that both algorithms achieve higher precision, recall rate and F1 score with the increase of edge selection probability. This is expected because more edges reserved in G1 and G2 lead to a higher similarity between them. The second one is that our algorithm significantly outperforms the algorithm KL. For example, when pe1 = pe2 = 0.6, the precision, recall rate and F1 score of our algorithm are 0.755, 0.968, 0.848, respectively, while these metrics of the counterpart are 0.366, 0.051, 0.09, respectively. It means our algorithm significantly improves the algorithm KL by up to 9X in terms of F1 score. The simulations on Facebook dataset over different settings of pe1 and pe2 is shown in Table 2. To demonstrate the effectiveness of our algorithm in wider scenario, we also conducted the simulations on the RMAT dataset where the weights on the edges follow pow law. We report these results in Table 3. From both of Table 2 and 3, it is also obvious to observe that the NR-GL algorithm performs remarkably well even for different edge selection probabilities. We also plot the recall-precision curves for these two tables in Fig. 4(a) and 4(b), respectively. This is because KL algorithm suffers a lot with a handful of seeds while NR-GL algorithm can explore more seeds first and expand the match process from these seeds.
Table 2: Results for different edge selection probabilities
Recall 0.526 0.618 0.837
KL
Precision 0.815 0.913 0.985
F1 0.639 0.737 0.905
M
(0.6, 0.4) (0.8, 0.4) (0.8, 0.6)
NR-GL
AN US
(pe1 , pe2 )
Recall 0.301 0.333 0.446
Precision 0.059 0.064 0.235
F1 0.099 0.107 0.308
Table 3: Results for different edge selection probabilities of NR-GL
CE
PT
ED
(pe1 , pe2 ) (0.3, 0.3) (0.4, 0.4) (0.5, 0.5) (0.3, 0.5) (0.4, 0.6)
Recall 0.852 0.992 0.999 0.885 0.995
Precision 0.973 0.999 1 0.998 1
F1 0.908 0.995 0.999 0.938 0.997
AC
5.3.2. Comparison to Other Measurement In order to evaluate the performance of GlobalRank, we compared our GlobalRank based rank versus the degree based rank over Facebook graph. For each correct pair (u, v), we compute the difference of the node ranking sorted by GlobalRank value in G1 and G2 , respectively. We also do the same operation for the degree based ranking and plot the corresponding results over top 800 nodes in Fig. 5. The difference reflects the stability of the metrics under various conditions. We find that GlobalRank based rank is more stable than degree based rank, which proves that the global feature helps discover crucial nodes and promote the overall reconciliation process in the early stage.
ACCEPTED MANUSCRIPT
1 0.8
(0.6,0.4)
0.6 0.4 (0.8,0.6)
0
(0.6,0.4)
0.2
NR-GL KL
0.6
0.97
(0.3,0.3)
0.96
(0.8,0.4)
0.4
0.98
0.8
0.95
1
NR-GL
0.8
0.85
0.9
0.95
1
CR IP T
0.2
(0.4,0.4) (0.5,0.5) (0.4,0.6)
(0.3,0.5)
0.99 Precision
Precision
1
(0.8,0.6) (0.8,0.4)
Recall Rate
Recall Rate
(a) Results for different edge selection probabilities
(b) Results for different edge selection probabilities of NR-GL
Figure 4: Comparison Results for different edge selection probabilities. Note that, the value pair (e.g., (0.6, 0.4)) near each point represents the setting of (pe1 , pe2 ).
GlobalRank Degree
AN US
Rank Difference
160 130 100 70
200
300 400 500 600 The Number of Top Nodes
700
800
ED
10 100
M
40
PT
Figure 5: The rank difference of GlobalRank vs degree between G1 and G2 (pe1 = 0.5, pe2 = 0.75)
AC
CE
5.3.3. The Impact of the value of α To evaluate the impact of the value of α in Equation 4 in our algorithm, we varied the value of α from 0.1 to 0.9 with each step of 0.1. We conducted the experiment over Facebook graph with pe1 = 0.5 and pe2 = 0.5, and we aim to find 800 nodes for the role of explored seeds out of the top 1200 nodes ranked by GlobalRank. We plot the corresponding results in Fig. 6(a). As Fig. 6(a) shows, the precision reaches the highest at α = 0.8 for the fixed α. Looking further ahead, global information should be attached more importance at the early stage and decrease its fraction with the decrease of rank. Due to this reason, recall that we design a function of α in Section 3.4 to vary its proportion in the calculation of similarity. In this function, it explores different weights of local feature and global feature in different reconciliation processes and thus makes adaptive and dynamic changes. The experimental result of dynamic α achieves a more excellent performance than all the static cases, which demonstrates its usage for helping achieve more accurate results. 5.3.4. The Impact of Number of Explored Seeds on the Performance The seeds are crucial to our algorithm because of their wide usage. In this experiment of Facebook dataset, we pay attention to how algorithm works with different numbers of explored seeds in two stages. We varied the number of explored seeds from 400 to 1600, with a step of 400. In Fig. 6(b), we observe the precision for different explored seeds in the first stage when the initial seeds is fixed. Fig. 6(c) displays
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
1 0.9 Precision Static
0.8 0.7 0.6
Dynamic
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 The Value of α
1
0.5 400
1200
1600
The Number of Explored Seeds
(a) The impact of α on precision between G1 and G2
(b) Precision over different size of explored seeds
0.7 0.65 0.6 0.55
AN US
Recall Rate
800
CR IP T
Precision
ACCEPTED MANUSCRIPT
0.5 0.45 400
800
1200
1600
The Number of Explored Seeds
(c) Recall over different size of explored seeds
M
Figure 6: The impact of α, the precision in the first stage and the recall in the second stage over explored seeds
ED
the recall rate in the second stage. As shown in Fig. 6(b), the precision is decreasing gradually with the increasing size of seeds. The drop of precision happens because there are less distinct nodes for matching among top nodes. In the meantime, Fig. 6(c) shows the expansion speed is proportional to the size of explored seeds due to the reliance 2 on seeds. For this dilemma, we suggest to choose the number of explored seeds nodes according to the GlobalRank distribution of all nodes, as shown in Fig 2.
CE
PT
5.3.5. The Comparison of Running Time between NR-GL Algorithm and KL Algorithm Finally, we report the comparison result of running time for these two compared algorithms in Fig. 7. From this figure, we observe that our algorithm consumes much less time than KL algorithm. The reason lies in that, our algorithm adopts an efficient candidate selection strategy using seeds in the second stage which shrinks the number of candidate nodes, as shown in Algorithm 3.
AC
6. Related Work
Identifying the same user across different social network websites has attracted significant attentions from researchers. For this problem, recent work falls into the following two categories. The first category leverages the node information of its own to identify the same user. For example, Novak et al. [17] proposed to identify users across different chat groups, bulletin board or web sessions on the Internet. The core idea of this work is that they capture the similar writing style (stylistic features) of the same users and use machine learning techniques to identify users. Based on the observation of the behavior of users on the web, Zafarani and Liu [28] proposed to find that people tend to use similar usernames across different communities. By adding/removing common prefixes or suffixes, their username elicitation method achieves the average accuracy of around 66%.
ACCEPTED MANUSCRIPT
Related work [29]
CR IP T
[14]
Main solution Their paper proposes a string-matching method to identify corresponding usernames across social media sites. Through exploiting information redundancies from users’ behavioral patterns, their classification algorithm matches a majority of users. Differently, our method focuses on network structure information rather than usernames. This paper measures the similarity of user profiles (username, description, location, profile image, and number of connections) across social networks. Then apply classifiers for disambiguating profiles belonging to the same user. The difference between our work and theirs is that we propose a similarity-based heuristic matching method to identify users. In this paper, authors apply bootstrap percolation and a graph slicing technique to de-anonymize scale-free social networks. The percolation graph matching algorithm is novel, however, simple threshold for mapped common neighbors may go wrong when there is a clique of users. And our method applies dynamic threshold and exploits more structural information for improving the performance. With a handful of seeds, this paper proposes a developed percolation graph matching algorithm to match nodes with a little sacrifice of precision. However, their algorithm suffers from high time and space complexity to overcome the cold start. In this paper, they design a simple, local, and efficient algorithm to solve reconciliation problem and give theoretical guarantees on the algorithm’s performance. However, the algorithm gets stuck if given few seeds, while our algorithm can explore seed users first and then expand the matching process. This paper proposes a generic re-identification algorithm to deanonymize the anonymized network in two phases: seed identification and propagation. Correspondingly, we remodel the problem with tie strength and develop a more elaborate two-stage algorithm to identify users based on global and local features. This paper proposes a community-enhanced de-anonymization algorithm to perform a two-stage mapping: first at community level and then for the entire network. Their divide-and-conquer approach may induce new difficulties because of asymmetric detected communities on two sides. And our algorithm matches users on explored seeds in parallel, which ensures the precision and time efficiency. This paper proposes a novel energy-based model to address this problem by considering both local and global consistency among multiple networks. However, the local and global consistency in their work have a completely differently meanings from our work. Our work focuses on better solving reconciliation of two social networks while their global consistency is targeted at level of multiple networks.
AN US
[7]
[10]
ED
M
[11]
CE
[16]
PT
[15]
AC
[31]
Table 4: Main solutions of prior studies
ACCEPTED MANUSCRIPT
5 4.5
NR-GL KL
4
CR IP T
Time (hours)
3.5 3 2.5 2 1.5 1 0
AN US
0.5 0.2
0.4 0.6 Edge Selection Probability
0.8
√ √
AC
CE
[29] [14] [7] [10] [11] [15] [16] [31] Our work
Profile
PT
Related work
ED
M
Figure 7: Time comparison
√
Network structure √ √ √ √ √ √ √
Edge weight
Local feature
√
√ √ √ √ √ √ √
Table 5: Main characteristics in related work
Global feature
√
Few seeds √ √ √ √ √ √ √ √
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
To further improve the performance of this naive method, several user profile based methods are proposed [3, 14, 12]. Generally, profile information contains user identities (username, location, connection topology and description), which can be used for matching users across different social networks. These techniques suffer a lot if fake profiles with similar profile information are created. Nevertheless, their approaches can be combined with ours for enhancing the performance. A correlated problem of reconciliation, called percolation graph matching, has been proposed and studied in [7, 26] on scale-free social networks and ER random model based social networks, respectively. Their algorithm is very novel, however, the analysis derived also has some limitations. First, they simply use the seed information ignoring the dynamic adaptation on thresholds which is powerful in user identification. Second, their seed selection strategy is more complex, instead ours is more targeted. Zafarani et al. [29] make a great effort and propose to find consistency in usernames across social media sites. They propose a novel string-matching method to identify corresponding usernames belonging to the same person. Through exploiting information redundancies from users’ behavioral patterns (414 features), their classification algorithm matches a majority of users. This paper gives us a comprehensive, systematic method to solve the user identification problem. Differently, our method focuses on network structure information rather than usernames. The second category is based on network structure, i.e., inferring other pairs of nodes with some seeds known in advance. Korula and Lattanzi [11] leveraged local features of nodes (common neighbors) for the network reconciliation problem in social networks. However, this algorithm only works well in the random edge deletion model when generating G1 and G2 from G, which is not practical. The reason lies in that in the real social networks, the probability of the edge depends on its tie strength. In such a scenario, the algorithm based on common neighbors will lead to more wrong pairs of match and thus suffer from low precision. In the second category, another sub line of work focused on de-anonymizing social network which has been widely studied in field of security and privacy [4, 19, 15, 10]. In [4], Backstrom introduced both active attacks and passive attacks to de-anonymize social networks for the first time. For active attacks, the adversary creates a small number of “sybil” users and links before the anonymizied network is released. In contrast, passive attacks are carried out by individuals who want to breach the privacy after the network has been released. To protect users privacy, the published data typically removes personally identifiable information, such as names and addresses. Anonymization techniques may be not effective since links will reveal the structural information of users. Attackers can use another auxiliary network to de-anonymize the anonymized network. In the study [19], Pedarsani et al. studied the problem using ER random graph model, which is a good model for theoretical analysis but not a suitable model to character the scale-free property of social networks. Narayanan et al. [15] proposed to de-anonymize the anonymized network in directed social network. They divided the problem into two phases: seed identification and propagation. Kazemi et al. [10] proposed a novel algorithm to successfully identify a majority of nodes with a handful of nodes. However, these two algorithms suffer from high time complexity and thus are not simple enough to scale out for large-sized networks. Recently, Nilizadeh et al. [16] proposed a community-enhanced de-anonynization algorithm of online social networks. The divide-and-conquer approach is first conducted by applying community detection into this problem. Subsequently, community-aware network alignment improves the user-level de-anonymization. We also apply the reconciliation algorithm on explored seeds in parallel, which ensures the precision of matching and the time efficiency. Zhang et al. [31] extended the network reconciliation problem by considering both local and global consistency among multiple networks. However, the local and global consistency in their work have a completely differently meanings from our work. Local matching deals with each pair of users independently while global consistency is proposed at the level of multiple networks. We draw two tables, i.e., Table 6, where the characteristics of these solutions are list and Table 6, where the differences of the proposed solution in this work from existing works are highlighted. Our work is one of de-anonymizing problem in social network. Besides this problem, there are also some hot topics about security and privacy issues, which has attracted a lots of attentions [25, 9, 33, 13, 8, 27, 30, 32, 20, 22, 24]. Because of the page limit, we do not introduce them one by one.
ACCEPTED MANUSCRIPT
7. Conclusion
M
AN US
CR IP T
In online social networks (OSNs), an interesting and challenging problem arises, i.e., reconciliation problem, has attracted significant attentions. However, prior studies have two limitation below: i) assuming the social network as unweighted graph and ignored the tie strength between two users; ii) only use local features for identify users. In this paper, to address these two limitations, we first remodel the network reconciliation problem by considering the heterogeneous users’ relationship. Then we propose a unified framework called UniRank for incorporating the local and global features together. Based on UniRank, we next design an effective and efficient network reconciliation algorithm, which consists of two stages. The first stage aims to match nodes with distinct global features. Starting from few initial seeds, we propose a global matching algorithm to easily explore more seeds with high precision in this stage. Its core idea is that, for the nodes with outstanding global features which are instantly recognizable, we prefer to match a node u in G1 to a node v in G2 with the similar global feature. The second stage aims to match the rest unidentified nodes. In this stage, regarding each explored seed node as a root, we develop a breadth first strategy based local matching algorithm that puts local features to greater use. Through iterative breadth first matching, an unidentified node will be matched to a corresponding node with the highest similarity score among all candidates. Moreover, a bi-directional match strategy is also applied to enhance the precision in tough conditions. Finally, we designed a series of comparative simulations to evaluate the algorithm under different levels of noise and seed numbers. Extensive simulations on real-world and synthetic social network datasets show that the algorithm significantly outperforms the state-of-art algorithm even under rough conditions. In reality, the social network usually consists of several communities. For future study, we plan to study the problem of multi-community based network reconciliation. The core idea lies in that we decomposite the entire network as several communities and do community-level mapping first and then do node mapping to further reduce the time complexity. Acknowledgment
PT
References
ED
This work was supported in part by the following funding agencies of China: National Natural Science Foundation under Grant 61170274, 61602050, U1534201, and the Fundamental Research Funds for the Central Universities (2015RC21).
AC
CE
[1] , . Facebook Company Info. 2015. URL: http://newsroom.fb.com/company-info/. [2] , . Twitter usage. 2015. URL: https://about.twitter.com/company. [3] Abel, F., Henze, N., Herder, E., Krause, D., 2010. Interweaving public user profiles on the web, in: User Modeling, Adaptation, and Personalization. Springer, pp. 16–27. [4] Backstrom, L., Dwork, C., Kleinberg, J., 2007. Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography, in: Proceedings of the 16th international conference on World Wide Web, ACM. pp. 181–190. [5] Bianchini, M., Gori, M., Scarselli, F., 2005. Inside pagerank. ACM Transactions on Internet Technology (TOIT) 5, 92–128. [6] Chakrabarti, D., Zhan, Y., Faloutsos, C., 2004. R-mat: A recursive model for graph mining., in: SDM, SIAM. pp. 442–446. [7] Chiasserint, C.F., Garetto, M., Leonardi, E., 2015. De-anonymizing scale-free social networks by percolation graph matching. [8] Fu, Z., Ren, K., Shu, J., Sun, X., Huang, F., . Enabling personalized search over encrypted outsourced data with efficiency improvement. IEEE Transactions on Parallel and Distributed Systems . [9] Fu, Z., Wu, X., Guan, C., Sun, X., Ren, K., 2016. Toward efficient multi-keyword fuzzy search over encrypted outsourced data with accuracy improvement. IEEE Transactions on Information Forensics and Security 11, 2706–2716. [10] Kazemi, E., S Hamed, H., Grossglauser, M., 2015. Growing a graph matching from a handful of seeds, in: Proceedings of the Vldb Endowment International Conference on Very Large Data Bases. [11] Korula, N., Lattanzi, S., 2014. An efficient reconciliation algorithm for social networks. Proceedings of the VLDB Endowment 7, 377–388. [12] Labitzke, S., Taranu, I., Hartenstein, H., 2011. What your friends tell others about you: Low cost linkability of social network profiles, in: Proc. 5th International ACM Workshop on Social Network Mining and Analysis, San Diego, CA, USA.
ACCEPTED MANUSCRIPT
AC
CE
PT
ED
M
AN US
CR IP T
[13] Li, J., Li, X., Yang, B., Sun, X., 2015. Segmentation-based image copy-move forgery detection scheme. IEEE Transactions on Information Forensics and Security 10, 507–518. [14] Malhotra, A., Totti, L., Meira Jr, W., Kumaraguru, P., Almeida, V., 2012. Studying user footprints in different online social networks, in: Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2012), IEEE Computer Society. pp. 1065–1070. [15] Narayanan, A., Shmatikov, V., 2009. De-anonymizing social networks, in: Security and Privacy, 2009 30th IEEE Symposium on, IEEE. pp. 173–187. [16] Nilizadeh, S., Kapadia, A., Ahn, Y.Y., 2014. Community-enhanced de-anonymization of online social networks, in: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, ACM. pp. 537–548. [17] Novak, J., Raghavan, P., Tomkins, A., 2004. Anti-aliasing on the web, in: Proceedings of the 13th international conference on World Wide Web, ACM. pp. 30–39. [18] Page, L., Brin, S., Motwani, R., Winograd, T., 1998. The pagerank citation ranking: Bringing order to the web. Technical report, Stanford Digital Library Technologies Project . [19] Pedarsani, P., Grossglauser, M., 2011. On the privacy of anonymized networks, in: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM. pp. 1235–1243. [20] Ren, Y., Shen, J., Wang, J., Han, J., Lee, S., 2015. Mutual verifiable provable data auditing in public cloud storage. Journal of Internet Technology 16, 318. [21] Seneta, E., 2006. Non-negative matrices and Markov chains. Springer Verlag. [22] Tinghuai, M., Jinjuan, Z., Meili, T., Yuan, T., Abdullah, A.D., Mznah, A.R., Sungyoung, L., 2015. Social network and tag sources based augmenting collaborative recommender system. IEICE transactions on Information and Systems 98, 902–910. [23] Viswanath, B., Mislove, A., Cha, M., Gummadi, K.P., 2009. On the evolution of user interaction in facebook, in: Proceedings of the 2nd ACM workshop on Online social networks, ACM. pp. 37–42. [24] Xia, Z., Wang, X., Sun, X., Wang, B., 2014. Steganalysis of least significant bit matching using multi-order differences. Security and Communication Networks 7, 1283–1291. [25] Xia, Z., Wang, X., Zhang, L., Qin, Z., Sun, X., Ren, K., 2016. A privacy-preserving and copy-deterrence content-based image retrieval scheme in cloud computing. IEEE Transactions on Information Forensics and Security 11, 2594–2608. [26] Yartseva, L., Grossglauser, M., 2013. On the performance of percolation graph matching, in: Proceedings of the first ACM conference on Online social networks, ACM. pp. 119–130. [27] Yuan, C., Sun, X., Lv, R., 2016. Fingerprint liveness detection based on multi-scale lpq and pca. China Communications 13, 60–65. [28] Zafarani, R., Liu, H., 2009. Connecting corresponding identities across communities, in: ICWSM. [29] Zafarani, R., Liu, H., 2013. Connecting users across social media sites: a behavioral-modeling approach, in: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM. pp. 41–49. [30] Zhang, Y., Sun, X., Wang, B., 2016. Efficient algorithm for k-barrier coverage based on integer linear programming. China Communications 13, 16–23. [31] Zhang, Y., Tang, J., Yang, Z., Pei, J., Yu, P.S., 2015. Cosnet: Connecting heterogeneous social networks with local and global consistency, in: Proceedings of ACM SIGKDD, ACM. [32] Zhangjie, F., Xingming, S., Qi, L., Lu, Z., Jiangang, S., 2015. Achieving efficient cloud search services: multi-keyword ranked search over encrypted cloud data supporting parallel computing. IEICE Transactions on Communications 98, 190–200. [33] Zhou, Z., Wang, Y., Wu, Q., Yang, C.N., Sun, X., 2016. Effective and efficient global context verification for image copy detection. IEEE Transactions on Information Forensics and Security , 1–10.