Journal Pre-proof iBridge: Inferring bridge links that diffuse information across communities Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li
PII: DOI: Reference:
S0950-7051(19)30562-3 https://doi.org/10.1016/j.knosys.2019.105249 KNOSYS 105249
To appear in:
Knowledge-Based Systems
Received date : 5 January 2019 Revised date : 31 October 2019 Accepted date : 20 November 2019 Please cite this article as: K.-J. Chen, P. Zhang, Z. Yang et al., iBridge: Inferring bridge links that diffuse information across communities, Knowledge-Based Systems (2019), doi: https://doi.org/10.1016/j.knosys.2019.105249. This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
© 2019 Published by Elsevier B.V.
Journal Pre-proof
iBridge: Inferring Bridge Links that Diffuse Information Across Communities
pro of
Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li
Jiangsu Key Laboratory of Big Data Security & Intelligent Processing Nanjing University of Posts and Telecommunications Nanjing, Jiangsu 210023, China
lP
re-
Abstract. While the accuracy of link prediction has been improved continuously, the utility of the inferred new links is rarely concerned especially when it comes to information diffusion. This paper defines the utility of links based on average shortest distance and more importantly defines a special type of links named bridge links based on community structure (overlapping or not) of the network. In sociology, bridge links are usually regarded as weak ties and play a more crucial role in information diffusion. Considering that the accuracy of previous link prediction methods is high in predicting strong ties but not much high in predicting weak ties, we propose a new link prediction method named iBridge, which aims to infer new bridge links using biased structural metrics in a PU (positive and unlabeled) learning framework. The experimental results in 3 real online social networks show that iBridge outperforms several comparative link prediction methods (based on supervised learning or PU learning) in inferring the bridge links and meantime, the overall performance of inferring bridge links and non-bridge links is not compromised, thus verifying its robustness in inferring all new links.
1
urn a
Keywords: bridge link prediction · information diffusion · weak ties · PU learning
INTRODUCTION
Jo
Many complex systems can be described as networks, where nodes represent individuals and links represent the interactions between nodes. As one of the most intensively studied networks, social network plays an important role for people to connect with others and to diffuse various types of information. Link prediction [26], one of the most important tasks in social network analysis and mining, studies the formation of missing links or new links based on current and historical network, with wide application in recommendation [10], pre-warning system [17], biomedical discovery [19] etc. Researchers have been working extensively to study effective link prediction methods for different types of networks and in different application scenarios. Some simple heuristics such as common neighbors and Katz index work well in practice and are scalable to large networks. Other early work also achieves good performance including
Journal Pre-proof
2
Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li
lP
re-
pro of
those based on Markov chain [34], probability graphical methods [24, 38] and the influential method based on supervised learning [16]. There recently have been some work predicting links in dynamic networks [1] and in heterogeneous networks [35]. Though the accuracy in link prediction continuously improves, it is realized that the new inferred links may not be particularly novel nor significantly useful to expand new connections for users. For example, two users in a social network sharing more common neighbors are more likely to establish connection, but the new link predicted between them does not help much in getting more new information or connecting with more new friends for both of them. This phenomenon reminds us of the sociological theory of weak ties proposed by Granovetter [13] and the theory of structural hole proposed by Burt [7]. According to weak tie theory, there is a large amount of redundant information in the circle of strong ties while the really useful information often comes from weak ties. Some researchers [23, 27] propose to measure weak ties using the weights of links. Other work [30] differentiate the strong and weak ties based on the community division of the network and the links between communities are defined as weak ties. But there is no commonly accepted definition for weak tie till now. Moreover, there is no relevant research on how to accurately infer weak ties, which may have strong application background in controlling public opinion or preventing the spread of infectious diseases and computer viruses. This paper aims to provide a new definition of weak ties as bridge links, quantify the utility of bridge links and eventually propose an effective method to infer bridge links accurately. The main contributions of this paper are as follows:
Jo
urn a
– A new type of weak tie, bridge link, is defined. Some previous work [41] calculates the tie strength as the frequency of user interactions. But the tuning of a cutoff threshold has a crucial impact on the correct identification of weak ties. Another definition of weak tie is the bridge proposed by Granovetter [13], which refers to the only path between two nodes. But this definition is too strict especially for large-scale networks. In this paper, bridges are redefined as bridge links which connect different communities to facilitate the wider diffusion of information. Moreover, the definition of bridge link avoids tuning threshold. – The utility of links is defined. Currently, there is no measure to evaluate the utility of links in information diffusion. Studies [13] have shown that the shorter the distance between nodes, the more easily the information diffuses. This paper defines a utility function based on the change rate of the average shortest path length of the network after deleting a certain number of links. With the utility function, the importance of bridge links on information diffusion can be evaluated. – A bridge link prediction method named iBridge (inferring Bridge links) is proposed. This method redefines the structural metrics of node pairs according to the statistical characteristics of bridge links, and uses a binary
Journal Pre-proof
iBridge: Inferring Bridge Links that Diffuse Information Across Communities
3
pro of
classifier model. The performance of iBridge is higher in predicting bridge links and is not compromised in predicting non-bridge links. – A PU-based sampling method P U S is proposed to solve the problem of network sparsity (i.e. data imbalance), which is the main challenge in link prediction. The PUS can get the reliable negative instances from the unlabeled instances by using K-means clustering and voting mechanism. With PUS method, the iBridge is further improved under a positive and unlabeled learning framework. The rest of this paper is organized as follows. Section 2 introduces the related work about link prediction and weak ties. Section 3 gives the formalization of terms and problems. Section 4 compares the utility of bridge links and nonbridge links and presents the proposed iBridge framework. Section 5 shows the experimental results in detail. Section 6 concludes the work.
RELATED WORK
re-
2
Jo
urn a
lP
The early work of link prediction is primarily based on Markov chain [34], which proves that the high-level model is more conducive to the prediction accuracy in large-scale networks. Subsequently, some researchers [16] convert link prediction to a binary classification problem in a supervised learning framework. Due to the sparsity of the real information networks, the methods using semi-supervised learning [6, 18] and active learning [4] are further proposed and achieved good results. Considering that many information networks evolve over time, timeaware techniques [1, 33] are developed, which has been proved to be effective for dynamic networks. Recently, the performance of the link prediction method is further improved by leveraging heterogeneous information in the network [10,35]. Some researchers pay more attention to the scalability of the algorithms as both the scalability and the effectiveness are important for massive real world social networks [15, 21]. Since missing links can only be predicted when they don’t significantly change structural features, SPM method is proposed to obtain more stable prediction results by considering structural consistency [22]. However, most of the previous link prediction methods are mainly concerned about whether the predicted links exist or not, instead of discussing their quality, such as whether they are useful for information diffusion. Bakshy et al. [3] find that information propagates more extensively through weak ties in social networks. Zhao et al. [41] propose the calculation formula of link strength and find that if links are deleted according to the order of weights, the information coverage in the network will fall sharply. The work of Ferrara et al. [12] shows that weak ties are able to connect small communities into one large community, helping to reach a wider variety of contacts. In further research by Chui et al. [8], only selected weak ties are helpful for information diffusion. In the link prediction task, weak tie theory has been introduced to solve the problem of information redundancy. L¨ u et al. [26] study the role of weak ties in
Journal Pre-proof
4
Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li
3
pro of
the weighted network link prediction problem. A link is defined as a strong tie if its weight is on the top 50% of all weights; Otherwise, it is a weak tie. Meo et al. [30] propose to define weak ties based on community division and then improved the accuracy of link prediction with weak ties. Different from the above two methods, this paper focuses on how to accurately infer weak ties rather than improving link prediction methods using weak ties. The most related work is the method proposed by Song et al. [36], which aims to find the brokers (a type of nodes) that are critical to information diffusion. But our method aims to find new diffusion paths.
FORMALIZATION
This section gives formalization of related conceptions used in this paper.
lP
re-
Definition 1 Link Prediction. In a given network G = hV, Ei, where V = t {vi }N i=1 denotes a set of nodes and E = {eij } denotes a set of edges that have been observed at time t, where eij denotes the link between the node pair hvi , vj i. This task is to predict the possibility of connection between hvi0 , vj0 i ∈ /E at time t0 (t0 > t). For a dynamic network, it can be represented as a sequence of discrete graphs hG1 , G2 , ..., GT i, where Gt = hV, Et i represents the network t at time t. V = {vi }N i=1 denotes the set of nodes and Et = {eij } denotes the set of edges at time t. This task is to predict the possibility of connection between hvi , vj i ∈ / Et at the time t + 1. This paper mainly discusses the link prediction in non-dynamic networks.
urn a
Definition 2 Community Detection. Community is an important feature of many networks, especially social networks [31]. Links within the same community are dense while links between different communities are sparse. Given a network G = hV, Ei, the task is to divide all node vi into different subsets obtaining the collection of communities Com = {Comi }K i=1 where Comi ⊂ V . Community detection methods can be classified as: non-overlapping methods [31], overlapping methods [2] and hierarchical methods [9]. A non-overlapping T method outputs Com, where any Comi ComT j = ∅ and i 6= j. An overlapping method outputs Com, where may exist Comi Comj 6= ∅ and i 6= j. The hierarchical method is an iteration of non-overlapping method, where each Comi can be further divided into smaller communities.
Jo
Definition 3 Bridge Link. In this paper, bridge link is defined as the link across communities. The definition varies depending on different community partitioning methods. Bridge link across non-overlapping communities. Let CN (vi ) denote the community set of node vi . Here, N denotes a non-overlapping community detection method. Since the communities are non-overlapping, for any vi ∈ V , |CN (vi )| = 1. BN (eij ) denotes whether any given link eij is a bridge link or not, which is defined as:
Journal Pre-proof
iBridge: Inferring Bridge Links that Diffuse Information Across Communities
5
Community 1
v2 v1
v1 v8
v5
v7
v7
v4
v6
v2
pro of
v5
v6
v3
v3
Community 2 Community 1
v4
Community 3
Community 2
(a) in non-overlapping communities
(b) in overlapping communities
Fig. 1. Examples of bridge links across communities.
BN (eij ) =
(
1 if CN (vi ) 6= CN (vj ) 0 otherwise
(1)
BO (eij ) =
(
lP
re-
If BN (eij ) = 1, eij is a bridge link, vise versa. As shown in Fig. 1(a), e1,5 and e3,6 are bridge links. Bridge link across overlapping communities. Sometimes one node may belong to multiple communities. Let CO (vi ) denote the community set of node vi . Here, O denotes an overlapping community detection method. Therefore, BO (eij ) is defined as: 0 if |CO (vi )| = |CO (vj )| = 1 and CO (vi ) = CO (vj ) 1 otherwise
(2)
urn a
In Fig. 1(b), two of the communities are overlapping. e1,8 and e4,6 are bridge links according to the above definition; e1,5 , e2,5 , e3,5 and e4,5 are also bridge links since node v5 is in the overlapping part of communities. Bridge link in hierarchical communities. The hierarchical community detection method is based on the iteration of non-overlapping method and there is no overlap between communities in each layer. Therefore, the definition of bridge link in each layer of the dendrogram is the same as that in non-overlapping communities described above.
Jo
Definition 4 Bridge Link Prediction. Similar to the general link prediction task, the bridge link prediction aims to predict the bridge links. In a given network G = hV, Ei, where V and E are the same as that in Definition 1. Q = {eij |B(eij ) = 1} (Q ⊂ E) denotes the bridge link set. The definition of bridge link is given in Definition 3. This task is to predict the possibility of connection e0ij between hvi0 , vj0 i at time t0 where hvi0 , vj0 i ∈ / E but B(e0ij ) = 1. As shown in Fig. 2, the task is to predict the formation of links like e1,6 and e3,5 at time t0 , which are potential bridge links.
Journal Pre-proof
6
Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li
v1
v1
v5
v5
? v4 v2
v4
v7
v6
v2
v6
v7
v3
pro of
v3
?
Community 2
Community 2
Community 1
Community 1
t'
t
Fig. 2. Description of bridge link prediction.
4 4.1
THE PROPOSED METHOD Utility of Bridge Links
re-
To quantify and verify the effectiveness of bridge links defined above in information diffusion, the paper defines the utility function based on the average shortest path. The average shortest distance Dist (Eq. 3) is often used to measure the capability of the network to diffuse information [32]. Generally, the smaller the Dist value, the more conducive the network is to diffusing information. Here, n denotes the total node number of the network, and d(vi , vj ) denotes the shortest path length between node vi and vj . X
lP
Dist =
vi ,vj ∈V
d(vi , vj ) n(n − 1)
(3)
urn a
For large-scale networks, the impact of a single edge on information diffusion is too little to be calculated. Therefore, the paper defines the utility function Φ(Ek ) (Eq. 4) to measure the change rate of Dist after deleting k edges, where Ek denotes the deleted edges. Normally, the larger the k value, the more significant the rate of change. When the k value is fixed, the higher the value of Φ, the more useful the deleted k edges are for information diffusion. Φ(Ek ) =
P
d(vi ,vj ) eij ∈E−Ek n(n−1)
P
d(vi ,vj ) eij ∈E n(n−1) P d(vi ,vj ) eij ∈E n(n−1)
−
(4)
Jo
Fig. 3 shows a comparison of the Φ value in the Facebook dataset, after randomly deleting a certain number of bridge links and non-bridge links (obtained by the Louvain method [5], which is a non-overlapping community detection method). As expected, the increase rate of the Φ value is significantly higher when deleting bridge links than when deleting non-bridged links. 4.2
Biased Features
Several heuristic structural features like common neighbors (CN), Jaccard coefficient (JC) and Resource Allocation (RA) are often used to describe node pairs
Journal Pre-proof
iBridge: Inferring Bridge Links that Diffuse Information Across Communities 0.200 0.175
7
delete bridge links delete non-bridge links
0.150
0.100 0.075 0.050 0.025 0.000
500
1000
pro of
Φ
0.125
1500 deleted edge number
2000
2500
Fig. 3. Comparison of the utility values of bridge links and non-bridge links.
lP
re-
due to their low computational complexity and good predictive performance. In the context of bridge link prediction, the paper proposes several new features (B-CN, B-JC, B-RA, SBC and SDC) to describe the node pairs in the bridge position. The following denotations are used to define these new features as indicated by Eq. 5-7. τ (vi ) - the neighbor node set of vi d(vi ) - the node degree of vi gist - the number of geodesic paths from vs to vt that pass through vi nst - the total number of geodesic paths from vs to vt τij - the set of common neighbors of vi and vj In Eq. 5, C(vk ) denotes the community (communities) where vk is located. If |C(vi ) ∩ C(vk )| ≥ 1 or |C(vk ) ∩ C(vj )| ≥ 1, B-CN will have an additional value. Notice that |C(vi ) ∩ C(vk )| or |C(vk ) ∩ C(vj )| may be larger than 1 in overlapping communities. X X B-CN (vi , vj ) = |τij | + |C(vi ) ∩ C(vk )| + |C(vk ) ∩ C(vj )| (5)
urn a
vk ∈τij
vk ∈τij
The definition of B-RA (Eq. 6) is similar to RA but only in the situation that the |C(vi ) ∩ C(vk )| ≥ 1 or |C(vk ) ∩ C(vj )| ≥ 1. B-RA(vi , vj ) =
X |(C(vi ) ∩ C(vk )) ∪ (C(vk ) ∩ C(vj ))| d(vk ) v ∈τ k
(6)
ij
Jo
If |C(vi ) ∩ C(vk )| ≥ 1 or |C(vk ) ∩ C(vj )| ≥ 1, B-JC(vi , vj ) will have an additional value. P |τij | vk ∈τij (|C(vi ) ∩ C(vk )| + |C(vk ) ∩ C(vj )|) + B-JC(vi , vj ) = |τ (vi ) ∪ τ (vj )| |τ (vi ) ∪ τ (vj )| (7)
In addition, SBC (Sum of Betweenness Centrality) and SDC (Sum of Degree Centrality) are used because bridge links are more likely than non-bridge links
Journal Pre-proof
8
Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li
to connect to the most influential nodes (top 10% users with highest PageRank scores) in the community. P gjst gist s
(8)
pro of
SBC(vi , vj ) =
P
d(vi ) + d(vj ) (9) n−1 The average SBC value and the average SDC value are compared for bridge links and non-bridge links respectively with three datasets (Facebook, Twitter and NetScience) as shown in Fig. 4, which verifies the effectiveness of SBC and SDC. Here, Louvain algorithm [5] is used to get non-overlapping communities. SDC(vi , vj ) =
0.12
0.10
nonbridge bridge
0.10
nonbridge bridge
0.08
0.08
SDC
re-
SBC
0.06 0.06
0.04
0.04
0.02
0.02 0.00
Facebook
Twitter Dateset
NetScience
lP
(a) SBC
0.00
Facebook
Twitter Dateset
NetScience
(b) SDC
Fig. 4. Comparison of average SBC and average SDC .
4.3
The iBridge method
Jo
urn a
In this paper, inferring bridge links is regarded as a supervised or semi-supervised binary classification problem. The proposed iBridge method is described in Algorithm 1. Firstly, a given network is divided into communities using a community detection method. Secondly, all node pairs which form or may form a bridge link (as defined in Section 3) are collected. If there exists an edge between node vi and vj , Labelhvi , vj i = 1. Otherwise 0. P is a set of positive examples, and N is a set of negative examples. F eahvi , vj ikn=1 represents the structural feature vector of node pair hvi , vj i. ClfBL is a classifier learned by training set, which can infer the label of any node pair hvi0 , vj0 i in G0 at time t0 , where Labelhvi0 , vj0 i = 0. Here is an example to illustrate how iBridge infers new bridge links (Fig. 5). First, the network is divided into different non-overlapping communities. Take node pair hv1 , v5 i as an example, where node v1 and v5 belong to different communities. The features of the node pair hv1 , v5 i are calculated: B-CN (v1 , v5 ) = 2, B-RA(v1 , v5 ) = 0.25, B-JC (v1 , v5 ) = 2, SBC (v1 , v5 ) = 0.33, and SDC (v1 , v5 ) = 1.17. If node v1 and v5 are connected, the feature vector (2, 0.25, 2, 0.33, 1.17)
Journal Pre-proof
iBridge: Inferring Bridge Links that Diffuse Information Across Communities
9
Algorithm 1: The proposed method — iBridge.
3 4 5 6 7 8 9
re-
10
pro of
1 2
Input: Network G = hV, Ei; Community detection algorithm CD; Learning model M ; Output: The Classifier ClfBL ; Call CD to find community set {Comi } in G; Get all node pairs {hvi , vj i} which have or may have bridge links per Definition 3 and represent the collection of the above node pairs as D; Create P = {hvi , vj i|hvi , vj i ∈ D ∩ Labelhvi , vj i = 1}, N = {hvi , vj i|hvi , vj i ∈ D ∩ Labelhvi , vj i = 0}; Generate T rainset by sampling from P and N ; for each hvi , vj i in T rainset do Calculate F eahvi , vj ikn=1 with Eq. 3-7; Get Labelhvi , vj i; end Train M with T rainset; Get ClfBL .
v1
lP
is labeled as 1, otherwise 0. In the illustration, node pair hv1 , v5 i, hv4 , v5 i and hv3 , v6 i are positive examples. The classifier ClfBL is then trained based on all node pairs in the training set. Finally, for any arbitrary node pair hvi , vj i belonging to different communities but without connection, the probability of forming a bridge link will be predicted. In the illustration, the link probability for the node pair hv4 , v6 i is higher than the threshold and is therefore inferred as a bridge link.
urn a
v5 v7
v4
v2
take as an example
SDUWRIDQRULJLQDOQHWZRUN
v5
v7
v4
v2
v6
v3
Jo
IHDWXUHFDOFXODWLRQIRUWKH H[DPSOH
Fea1 = B-CN(v1, v5) =2 Fea2 = B-RA(v1, v5) =0.25 Fea3 = B-JC(v1, v5) =2 Fea4 = SBC(v1, v5) =0.33 Fea5 = SDC(v1, v5) =1.17 Label = 1
v5 v7
v4
Com = {Com1, Com2} Com1 = {v1, v2, v3, v4} Com2 = {v5, v6, v7}
v2
v6
v3
v1
v1
v6
v3
DIWHUFRPPXQLW\GLYLVLRQ
v1
v5 v7
v4 v2 v3
v6
SUHGLFWXQIRUPHGEULGJHOLQNV DIWHUWUDLQLQJWKHPRGHO
Fig. 5. An illustrative example of iBridge framework.
prob(Label=1) = 0.75 > 0.5 Label = 1
Journal Pre-proof
10
4.4
Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li
The PU-based sampling method–PUS
re-
pro of
Network sparsity is often a challenge for supervised link prediction methods, that is, there are much more negative examples than positive examples. In Algorithm 1, only random sampling is used in the step 4 to get training set. Actually, the node pairs without connection should be better regarded as unlabeled instances instead of negative instances, because they may become positive in the future. Therefore, this is essentially a positive and unlabeled (PU) learning problem. PU learning has become a new research topic in the field of classification. Though widely used in text mining [20] and graph mining [42] , it was not used in link mining until recent years. In 2014, Zhang et al. [40] use PU learning for the first time to predict anchor links between multiple networks. They used the Spy technique to extract reliable negative examples. Inspired by the above work, PU learning is also introduced in our framework to alleviate the data imbalance. The paper proposes a sampling method P U S (Algorithm 2) based on K-means algorithm and voting mechanism. PUS aims to find the reliable negative set RN efficiently. Then, the iBridge method will be further improved by training with P and RN.
Algorithm 2: The proposed algorithm—PUS.
3 4 5 6 7 8 9 10 11 12 13 14
Jo
15
lP
2
urn a
1
Input: N local positive clusters P Ci (i = 1, 2, ...N ) ; Unlabeled set U ; Number of unlabeled clusters K; Output: Reliable negative set RN CN C = ∅ (set of candidate negative clusters); count[K] = {0} (the vote for each unlabeled cluster); d[N ][K] = {0} (matrix of distance between clusters); Get K unlabeled clusters U Cj (j = 1, 2, ...K) by K -means algorithm; for i = 1 to N do for j = 1 to K do Calculate dij between the cluster center of P Ci and U Cj ; end Find the median distance mediani in di ; for j = 1 to K do if dij > mediani then CN C ← CN C ∪ {U Cj } ; countj ++ end end end S RN ← Uj (j ∈ arg max countj );
The core idea of the PUS algorithm is to vote the majority. Each local positive cluster P Ci (1, 2, ..., n) votes the potential negative clusters as candidates in the K local unlabeled clusters U Ci (1, 2, ..., K), and then the candidate negative
Journal Pre-proof
iBridge: Inferring Bridge Links that Diffuse Information Across Communities
11
pro of
clusters with the largest number of votes are considered as the reliable negative cluster RNC, which makes up the reliable negative set RN. It should be noted that there may exist more than one RNC, since the candidate negative clusters may have the same number of votes. As illustrated in Fig. 6, local positive clusters P C1 and P C2 respectively vote the local unlabeled clusters U C2 , U C5 , U C7 , U C8 and U C3 , U C6 , U C7 , U C8 , which are considered as the candidate negative clusters. Local clusters U C7 and U C8 are with the largest votes and considered as the reliable negative clusters.
RNC
UC6
UC7
UC8
UC3 PC1
UC2
re-
UC4
UC1
PC2
UC5
5
lP
Fig. 6. An illustrated example of PUS method.
EXPERIMENT
5.1
urn a
This section evaluates the performance of iBridge in inferring new bridge links, as well as all new links. The effectiveness of new links inferred by iBridge on information diffusion is also evaluated. All experiments are run on the computer with Windows 10 systems, 2.6GHz CPU and 12 GB of memory. Datasets and settings
Jo
The experiment uses three real-world datasets: Facebook [29], Twitter [29] and NetScience [25]. The latter two networks are directed graphs and are converted to undirected graphs for convenience. By deleting small unconnected cliques, the processed datasets are finally generated: Facebook (4,039 nodes and 88,234 edges), Twitter (5,076 nodes and 47,014 edges) and NetScience (379 nodes and 914 edges). The base learning settings without PUS (detailed in Section 4.4) in all networks are shown in Table 1 and those with PUS are shown in Table 2. These settings are slightly different in non-overlapping communities and overlapping communities. After sampling, the number of positive and negative examples is relatively in balance. Louvain [5] and Infomap [28] are used to get non-overlapping
Journal Pre-proof
12
Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li
pro of
communities. SLPA [39] and LinkCommunity [2] are used to get overlapping communities. Considering the stability and complexity of the model, the Random Forest classifier is used. The experiment uses 10-fold cross validation and outputs the average results. Table 1. Learning settings without PUS after non-overlapping and overlapping community division.
Dataset Facebook Twitter NetScience
Overlapping methods SLPA Pos-set Neg-set Train-set Test-set 11,008 272,558 198,496 85,060 656 9,640,417 6,748,751 2,892,322 267 1216 1038 445 LinkCommunity Pos-set Neg-set Train-set Test-set 88,233 8,066,502 5,708,314 2,446,421 46,800 12,830,904 9,014,392 3,863,312 752 70,621 49,961 21,411
re-
Dataset Facebook Twitter NetScience
Non-overlapping methods Louvain Pos-set Neg-set Train-set Test-set 6,816 335,915 239,911 102,820 19,769 1,226,948 872,701 374,016 217 706 646 277 Infomap Pos-set Neg-set Train-set Test-set 7,237 298,393 213,941 91,689 13,735 1,562,592 1,103,428 472,899 135 1,426 1,092 469
Dataset Facebook Twitter NetScience
5.2
Non-overlapping methods Louvain Pos-set Neg-set Train-set Test-set 6,816 19,410 18,358 7,868 19,769 431,137 315,634 135,272 217 455 470 202 Infomap Pos-set Neg-set Train-set Test-set 7,237 30,104 26,138 11,203 13,735 177,221 133,669 57,287 135 129 184 80
urn a
Dataset Facebook Twitter NetScience
lP
Table 2. Learning settings with PUS after non-overlapping and overlapping community division. Overlapping methods SLPA Pos-set Neg-set Train-set Test-set 11,008 16,805 19,469 8,344 656 9,640,417 6,748,751 2,892,322 267 337 422 181 LinkCommunity Pos-set Neg-set Train-set Test-set 88,233 1,728,207 1,271,508 544,932 46,800 2,757,986 1,963,350 841,436 752 26,003 18,728 8,027
Comparative methods
Jo
First, the iBridge method is compared with the baseline method BLiP. Both methods are based on supervised learning framework which first appeared in the work of [16], except that BLiP uses benchmark features while iBridge uses biased features. Both methods can infer bridge links in non-overlapping communities and overlapping communities. Subsequently, iBridge is further compared with two state-of-art link prediction methods in predicting bridge links.
Journal Pre-proof
iBridge: Inferring Bridge Links that Diffuse Information Across Communities
13
All the methods with different features, different community detection methods and different sampling methods are summarized in Table 3.
5.3
Results
community detection method Louvain Louvain SLPA SLPA Infomap Infomap LinkCommunity LinkCommunity Louvain Louvain SLPA SLPA Infomap Infomap LinkCommunity LinkCommunity Louvain Louvain
re-
biased features X X X X X X X X -
lP
method BLiP-LV iBridge-LV BLiP-SP iBridge-SP BLiP-IM iBridge-IM BLiP-LC iBridge-LC BLiP-LV´ iBridge-LV´ BLiP-SP´ iBridge-SP´ BLiP-IM´ iBridge-IM´ BLiP-LC´ iBridge-LC´ NELiP-LV´ LoNGAE-LV
pro of
Table 3. Different settings of all compared methods. PU sampling X X X X X X X X X -
Jo
urn a
AUC (Area under ROC Curve) value, recall value and F1-score are used to evaluate the performance of comparative methods in inferring both bridge links and all types of links. In addition, precision value is also used to evaluate the performance of iBridge with PUS and two link prediction methods in inferring bridge links. The comparative results in inferring bridge links with the unbalanced base learning settings are shown in Fig. 7(a) and Fig. 7(b), respectively. Fig. 7(a) shows the comparative results using non-overlapping community detection methods. iBridge methods always achieve better results than BLiP methods. In NetScience, the improvement of each evaluation index is not as significant as in Facebook and Twitter. The possible reason is that the NetScience network is relatively small and simple, so the base method without using community information can still predict quite well. Fig. 7(b) shows that when the communities are overlapped, the performance of iBridge-SP is much higher than BLiP-SP, but the performance of iBridge-LC and BLiP-LC is almost equivalent. The F1-score of iBridge-LC is even slightly lower than that of BLiP-LC. This indicates that the choice of overlapping community detection method may have an impact on the performance of iBridge. The figure also shows that iBridge achieves a stable performance in inferring bridge links under different community detection configuration.
Journal Pre-proof
Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li
pro of
14
lP
re-
(a) BLiP and iBridge using Louvain and Infomap
(b) BLiP and iBridge using SLPA and LinkCommunity Fig. 7. Comparison of methods in different community settings.
urn a
Table 4 shows the indicator values of all methods in the balanced learning settings. After the training data is balanced using the PU sampling method, the performance of all methods is improved. Fig 8 shows the comparative results of methods using PUS and without using PUS in Twitter dataset, which verifies the effectiveness of PUS method. Unlike the result in Fig. 7(a) and Fig. 7(b), the performance indicator values of iBridge and BLiP are closer. In general, the accuracy of iBridge is still higher than or equivalent to that of BLiP.
Jo
In order to show the effectiveness of the proposed features, the weights of each feature of BLiP and iBridge in the prediction task are compared. Here, the Louvain setting and the PUS technique are used. The results in Fig. 9(a) and Fig. 9(b) show that the biased features do have an impact on the prediction result. But the importance of each feature varies in different datasets, which indicates that adaptive feature extraction method could be a better choice. The paper also compares the performance of two methods in inferring all new links. The experiment setup and training process are similar to the previous one, except that the dataset contains all node pairs in the network. The learning
Journal Pre-proof
iBridge: Inferring Bridge Links that Diffuse Information Across Communities
15
Table 4. Comparison of methods with PUS in both non-overlapping and overlapping communities.
BLiP-IM´ iBridge-IM´ BLiP-LC´ iBridge-LC´
Facebook Rec F1 0.7463 0.7455 0.9776 0.9828 0.8011 0.7952 0.9934 0.9908 Rec F1 0.7641 0.758 0.97 0.9716 0.9937 0.9687 0.9937 0.9703
AUC 0.9233 0.9486 0.7866 0.8822 AUC 0.9711 0.9776 0.9303 0.9331
Twitter Rec 0.7705 0.9091 0.4712 0.7758 Rec 0.8764 0.9253 0.84 0.8425
F1 0.6874 0.8092 0.4285 0.7624 F1 0.8354 0.8858 0.5988 0.5925
NetScience AUC Rec F1 0.9641 0.9576 0.9679 0.9894 0.9859 0.9742 0.9517 0.8769 0.9736 0.9942 0.9736 0.9888 AUC Rec F1 0.9531 0.9247 0.9259 0.9923 0.9852 0.9819 0.956 0.8189 0.7238 0.9576 0.8522 0.7603
pro of
BLiP-LV´ iBridge-LV´ BLiP-SP´ iBridge-SP´
AUC 0.9019 0.9908 0.8932 0.9969 AUC 0.9177 0.9883 0.9983 0.9986
re-
settings are shown in Table 5. The comparison results of AUC value, recall, and F1-score of BLiP-LV and iBridge-LV are listed in Table 6. Table 5. Learning settings for BLiP-LV and iBridge-LV in inferring both bridge links and non-bridge links. Pos-set 88,234 47,475 2,742
Neg-set 8,066,507 15,888,367 1,063,788
lP
Dataset Facebook Twitter NetScience
Train-set 7,339,266 14,342,257 959,877
Test-set 815,474 1,593,584 106,653
Table 6. Comparison of BLiP-LV and iBridge-LV in inferring all new links.
urn a
Facebook Twitter NetScience AUC Rec F1 AUC Rec F1 AUC Rec F1 BLiP-LV 0.943 0.7073 0.6416 0.6593 0.5383 0.2471 0.9406 0.8843 0.8492 iBridge-LV 0.9644 0.7002 0.6191 0.676 0.5898 0.1957 0.9512 0.8136 0.7838
Jo
The result shows, in all three networks, the AUC value of iBridge-LV is still higher than that of BLiP-LV in predicting all new links. But the recall and F1-score of iBridge-LV are lower than BLiP-LV in most cases. Considering the structural features in iBridge-LV are biased for inferring new bridge links, the experimental results verifies the robustness of iBridge-LV since it is not much worse than BLiP-LV in some indicators, and even slightly better in others. Furthermore, the performance of iBridge, NELiP and LoNGAE methods is compared in the task of bridge link prediction. NELiP is a supervised link prediction method based on the node embedding results output by node2vec [14]. LoNGAE is a link prediction method based on a multi-task autoencoder model [37].
Journal Pre-proof
Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li
re-
pro of
16
Fig. 8. Comparison of the methods with PUS and without PUS in Twitter dataset. 0.35
0.5
0.30
0.6
0.5
0.4 0.25
0.4
Weight
lP
Weight
0.15
Weight
0.3
0.20
0.3
0.2
0.10
0.2
0.1
0.05 0.00
CN
RA Jaccard SBC Facebook
0.0
SDC
CN
RA Jaccard SBC Twitter
0.1
0.0
SDC
CN
RA Jaccard SBC Netscience
SDC
B-CN
B-RA B-JC SBC Netscience
SDC
(a) BLiP
urn a
0.6
0.5
0.6
0.5
0.4
0.3
0.2
0.1
B-CN
B-RA
B-JC SBC Facebook
SDC
Weight
0.3
0.0
0.4
0.5
Weight
Weight
0.4
0.3
0.2 0.2 0.1
0.1
0.0
B-CN
B-RA
B-JC SBC Twitter
SDC
0.0
(b) iBridge
Jo
Fig. 9. Weights of features of BLiP and iBridge with PUS in all datasets.
Different from BLiP and iBridge, the features in NELiP and LoNGAE are not extracted manually. The parameter setting of LoNGAE is the same as its author
Journal Pre-proof
iBridge: Inferring Bridge Links that Diffuse Information Across Communities
17
re-
pro of
did, that is, the epoch is set to be 50, the training batch size to be 8, and the value batch size to be 256. Although the two link prediction methods are not bridge link prediction methods, they can also predict bridge links. The comparison results of iBridge, NELiP and LoNGAE are presented in Fig. 10. To be fair, all three methods use the Louvain community detection setting.
lP
Fig. 10. Comparison of iBridge, NELiP and LoNGAE in all datasets.
5.4
urn a
The result shows that in the bridge prediction task, the performance of iBridge is much higher than that of NELiP. LoNGAE also has a good performance, higher than NELiP, but still lower than iBridge in most indicators (except that the precision of LoNGAE is slightly higher that that of iBridge). The result indicates that using inductive bias to guide representation is more effective than using the embeddings of latent structure when solving a specific problem. Verification
Jo
In order to verify whether the inferred bridge links are indeed more useful for information diffusion, the utility of three sets of links inferred by different methods, i.e. bridge links inferred by iBridge-LV, bridge links inferred by BLiP-LV and links inferred by BLiP, is compared. For convenience, the utilities (defined in Section 4.1) are calculated by adding fixed numbers of inferred links. The comparative results in three networks are shown in Fig. 11. When deleting the same number of inferred links, iBridge-LV and BLiP-LV always get higher utility values than BLiP, which verifies that bridge links are more useful for the spreading of information than common links. The bridge links inferred by iBridge have higher utility in Facebook and Twitter but slightly lower utility in NetScience. It indicates that the size of the network may have an impact on the quality of the inferred bridge link.
Journal Pre-proof
18
Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li 0.030
0.30
BLiP BLiP-LV iBridge-LV
0.025 0.020
0.20
0.015
0.15
0.7
BLiP BLiP-LV iBridge-LV
0.25
BLiP BLiP-LV iBridge-LV
0.6 0.5
Φ
Φ
Φ
0.4 0.3 0.10
0.005
0.05
0.000
0.00
0
500
1000 1500 deleted edge number
2000
2500
0.2
pro of
0.010
0.1
0
(a) Facebook
500
1000 1500 deleted edge number
2000
2500
0.0
(b) Twitter
0
500
1000 1500 deleted edge number
2000
2500
(c) NetScience
Fig. 11. Comparison of utilities of inferred links.
5.5
Discussion
Jo
urn a
lP
re-
The complexity of all methods is also analyzed. Considering that all methods use Random Forest as the base classifier, the time complexity of training and testing model can be ignored when comparing. In a network with m edges and n nodes, the time complexity of BLiP is O(mn) + O(n2 ), the time complexity of iBridge is O(mn) + O(comm) + O(n2 ), the time complexity of NELiP is O(l/k(l − k)) + O(comm) and the time complexity of LoNGAE is O(n3 ) + O(comm). Here, O(mn) represents the time complexity of calculating the betweenness index. O(l/k(l − k)) represents the time complexity of node2vec embedding, where k is the number of neighbors in node sampling and l is the length of random walk. O(n2 ) represents the time complexity of generating features for all instances. O(n3 ) represents the time complexity of the autoencoder. O(comm) is the time 2 ) for complexity of detecting communities ( e.g., O(tm) for Louvain and O(nkmax LinkCommunity, where t is the number of iterations, kmax is the maximum node degree in the network). After balancing the training data, the time complexity of BLiP is O(mn) + O(n2 ) + O(KM t), and the time complexity of iBridge is O(mn) + O(comm) + O(n2 ) + O(KM t). Here, O(KM t) is the time complexity of PUS method, where K is the number of clustering centers, M is the number of instances, and t is the number of clustering iterations. The paper uses BLiP as a base comparison method because there is currently no other related work on inferring weak ties or bridge links. Moreover, the performance of two state-of-art link prediction methods on inferring bridge links is also explored. The most related work is Song et al.’s [36] method. They developed a heuristic algorithm to find the Top-k brokers based on the weak tie theory. But their method is to mine nodes (brokers) while our method is to mine links (weak ties). In practical applications, deleting the bridges (specific diffusion paths) may be less expensive than deleting brokers, because the latter changes the network structure to a great extent. Another advantage of our method is that it can infer new diffusion paths and therefore play an early warning role. In experiments, different community detection algorithms are used. Although the prediction results in different community settings are somewhat different, the difference of iBridge is far less obvious than other comparative methods (e.g. BLiP). Moreover, iBridge always performs better than other methods which verifies the robustness of the iBridge method in predicting bridge links. In order
Journal Pre-proof
iBridge: Inferring Bridge Links that Diffuse Information Across Communities
19
6
pro of
to completely avoid the influence caused by community detection methods, the network representation learning methods may be introduced. The bridge links can be redefined by using the low-dimensional vector of nodes. Subsequently, our proposed framework can be generalized to the new bridge link prediction task.
CONCLUSION
ACKNOWLEDGEMENTS
urn a
7
lP
re-
The paper redefines weak tie as bridge link and proposes a utility function to evaluate the effectiveness of bridge links. Moreover, it develops a model to infer bridge links, which uses the biased similarity index under a proposed supervised learning framework. The model is further improved by using a proposed sampling method based on PU learning in order to alleviate the network sparsity. The experiment result shows that the proposed model can effectively infer bridge links as well as non-bridge links. It also verifies that new links inferred by the proposed method make more contribution to information diffusion. Therefore, the method of this paper has potential application prospects in preventing the spread of malicious information or alleviating the Matthew effect in sociology. However, not all bridge links are equally useful, so more sensitive utility function need to be proposed in the future for further differentiation among bridge links. Inspired by the definition of structural position of bridge nodes in Gao et al.’s work [11], the utility function and the feature representation can be improved to infer bridge links which are more conducive to information diffusion. Moreover, weak links and weak ties are not completely equivalent. How to measure the weak relationships between nodes mathematically requires further study. Finally, network representation learning can be introduced aiming to automatically learn features of weak ties instead of using manual biased features.
This research was supported by the National Natural Science Foundation of China (No. 61571238, No. 61603197 and No. 61772284).
References
Jo
1. Aggarwal, C.C., Xie, Y., Yu, P.S.: A framework for dynamic link prediction in heterogeneous networks. Statistical Analysis & Data Mining the Asa Data Science Journal 7(1), 14–33 (2014) 2. Ahn, Y.Y., Bagrow, J.P., Lehmann, S.: Link communities reveal multiscale complexity in networks. Nature 466(7307), 761 (2010) 3. Bakshy, E., Rosenn, I., Marlow, C., Adamic, L.: The role of social networks in information diffusion. In: International Conference on World Wide Web. pp. 519– 528 (2012) 4. Bilgic, M., Mihalkova, L., Getoor, L.: Active learning for networked data. In: International Conference on Machine Learning. pp. 79–86 (2010)
Journal Pre-proof
20
Ke-Jia Chen, Pei Zhang, Zinong Yang, Yun Li
Jo
urn a
lP
re-
pro of
5. Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. Journal of Statistical Mechanics 2008(10), 155– 168 (2008) 6. Brouard, C., D’Alch´e-Buc, F., Szafranski, M.: Semi-supervised penalized output kernel regression for link prediction. In: International Conference on Machine Learning. pp. 593–600 (2013) 7. Burt, R.S.: Structural holes and good ideas. American Journal of Sociology 110(2), 349–399 (2004) 8. Chiu, H.Y., Chen, S.M.: Propagating online social networks:via different kinds of weak ties. In: IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. pp. 1189–1195 (2013) 9. Clauset, A., Moore, C., Newman, M.E.: Hierarchical structure and the prediction of missing links in networks. Nature 453(7191), 98 (2008) 10. Dong, Y., Tang, J., Wu, S., Tian, J., Chawla, N.V., Rao, J., Cao, H.: Link prediction and recommendation across heterogeneous social networks. In: IEEE International Conference on Data Mining. pp. 181–190 (2013) 11. Fei, G., Katarzyna, M., Bogdan, G.: A community bridge boosting social network link prediction model. In: Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. pp. 683–689 (2017) 12. Ferrara, E., Meo, P.D., Fiumara, G., Provetti, A.: The role of strong and weak ties in facebook: a community structure perspective. Communications of the ACM 57(11) (2012) 13. Granovetter, M.: The strength of weak ties. American Journal of Sociology 78(6), 1360–1380 (1973) 14. Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. pp. 855–864 (2016) 15. Han, H.S., Cho, T.W., Dave, V., Yin, Z., Qiu, L.: Scalable proximity estimation and link prediction in online social networks. In: Acm Sigcomm Conference on Internet Measurement (2009) 16. Hasan, M.A., Chaoji, V., Salem, S., Zaki, M.: Link prediction using supervised learning. In: In Proc. of SDM 06 workshop on Link Analysis, Counterterrorism and Security (2006) 17. Huang, Z., Zeng, D.D.: A link prediction approach to anomalous email detection. In: IEEE International Conference on Systems, Man and Cybernetics. pp. 1131– 1136 (2007) 18. Kashima, H., Kato, T., Yamanishi, Y., Sugiyama, M., Tsuda, K., Park, H., Parthasarathy, S., Liu, H.: Link propagation: A fast semi-supervised learning algorithm for link prediction. In: International Conference on World Wide Web. pp. 1099–1110 (2009) 19. Lei, C., Ruan, J.: A novel link prediction algorithm for reconstructing proteinprotein interaction networks by topological similarity. Bioinformatics 29(3), 355– 364 (2013) 20. Li, X., Liu, B., Ng, S.K.: Negative training data can be harmful to text classification. In: Conference on Empirical Methods in Natural Language Processing. pp. 218–228 (2010) 21. Li, X., Du, N., Li, H., Li, K., Gao, J., Zhang, A.: A deep learning approach to link prediction in dynamic networks. In: Proceedings of the 2014 SIAM International Conference on Data Mining. pp. 289–297. SIAM (2014)
Journal Pre-proof
iBridge: Inferring Bridge Links that Diffuse Information Across Communities
21
Jo
urn a
lP
re-
pro of
22. Linyuan, L., Liming, P., Tao, Z., Yi-Cheng, Z., H Eugene, S.: Toward link predictability of complex networks. Proceedings of the National Academy of Sciences of the United States of America 112(8), 2325–30 (2015) 23. Liu, H., Hu, Z., Haddadi, H., Tian, H.: Hidden link prediction based on node centrality and weak ties. Europhysics Letters 101(1), 18004 (2013) 24. Liu, Z., Zhang, Q.M., L¨ u, L., Zhou, T.: Link prediction in complex networks: a local naive bayes model. Europhysics Letters 96(4), 48007 (2011) 25. Lu, L., Zhou, T.: Role of weak ties in link prediction of complex networks (2009) 26. L¨ u, L., Zhou, T.: Link prediction in complex networks: A survey. Physica A Statistical Mechanics & Its Applications 390(6), 1150–1170 (2011) 27. L¨ u, L., Zhou, T.: Link prediction in weighted networks: The role of weak ties. Europhysics Letters 89(1), 18001 (2010) 28. Martin, R., Bergstrom, C.T.: Maps of random walks on complex networks reveal community structure. Proceedings of the National Academy of Sciences of the United States of America 105(4), 1118–1123 (2008) 29. Mcauley, J., Leskovec, J.: Learning to discover social circles in ego networks. In: International Conference on Neural Information Processing Systems (2012) 30. Meo, P.D., Ferrara, E., Fiumara, G., Provetti, A.: On facebook, most ties are weak. Communications of the ACM 57(11), 78–84 (2014) 31. Newman, M.E.: Fast algorithm for detecting community structure in networks. Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics 69(6 Pt 2), 066133 (2004) 32. Parotsidis, N., Pitoura, E., Tsaparas, P.: Selecting shortcuts for a smaller world. In: International Conference on Data Mining. pp. 28–36 (2015) 33. Sarkar, P., Chakrabarti, D., Jordan, M.: Nonparametric link prediction in dynamic networks 8(2), 1897–1904 (2012) 34. Sarukkai, R.R.: Link prediction and path analysis using markov chains. Computer Networks 33(1–6), 377–386 (2000) 35. Shi, C., Li, Y., Zhang, J., Sun, Y., Yu, P.S.: A survey of heterogeneous information network analysis. IEEE Transactions on Knowledge & Data Engineering 29(1), 17–37 (2017) 36. Song, C., Hsu, W., Lee, M.L.: Mining brokers in dynamic social networks. In: ACM International on Conference on Information and Knowledge Management. pp. 523–532 (2015) 37. Tran, P.V.: Learning to make predictions on graphs with autoencoders (2018). https://doi.org/10.1109/DSAA.2018.00034 38. Wang, C., Satuluri, V., Parthasarathy, S.: Local probabilistic models for link prediction. In: IEEE International Conference on Data Mining. pp. 322–331 (2007) 39. Xie, J., Szymanski, B.K.: Towards linear time overlapping community detection in social networks. Knowledge Discovery and Data Mining pp. 25–36 (2012) 40. Zhang, J., Yu, P.S., Zhou, Z.H.: Meta-path based multi-network collective link prediction pp. 1286–1295 (2014) 41. Zhao, J., Wu, J., Xu, K.: Weak ties: subtle role of information diffusion in online social networks. Physical Review E Statistical Nonlinear & Soft Matter Physics 82(2), 016105 (2010) 42. Zhao, Y., Kong, X., Yu, P.S.: Positive and unlabeled learning for graph classification. In: 11th IEEE International Conference on Data Mining. pp. 962–971 (2011)