Predicting the evolution of complex networks via similarity dynamics

Predicting the evolution of complex networks via similarity dynamics

Physica A xx (xxxx) xxx–xxx Contents lists available at ScienceDirect Physica A journal homepage: www.elsevier.com/locate/physa Q1 Q2 Predicting ...

2MB Sizes 0 Downloads 47 Views

Physica A xx (xxxx) xxx–xxx

Contents lists available at ScienceDirect

Physica A journal homepage: www.elsevier.com/locate/physa

Q1

Q2

Predicting the evolution of complex networks via similarity dynamics Tao Wu a,b,c,∗ , Leiting Chen a,b,c , Linfeng Zhong d,e , Xingping Xian f a

Department of Computer Science and Engineering, University of Electronic Science and Technology of China, China

b

Institute of Electronic and Information Engineering in Dongguan, University of Electronic Science and Technology of China, China

c

Digital Media Technology Key Laboratory of Sichuan Province, China

d

Web Science Center, University of Electronic Science and Technology of China, China

e

Big Data Research Center, University of Electronic Science and Technology of China, China

f

Department of Computer Science and Technology, Chengdu Neusoft University, China

highlights • • • •

Introduce a general solution for dynamic networks’ evolution prediction. Propose an effective and robust link prediction index. Envision networks as dynamic systems and model similarity dynamics. Propose a position drift model to infer future network structure.

article

info

Article history: Received 18 April 2016 Received in revised form 23 June 2016 Available online xxxx Keywords: Link prediction Evolutionary dynamics Spatial–temporal position drift Network evolution



abstract Almost all real-world networks are subject to constant evolution, and plenty of them have been investigated empirically to uncover the underlying evolution mechanism. However, the evolution prediction of dynamic networks still remains a challenging problem. The crux of this matter is to estimate the future network links of dynamic networks. This paper studies the evolution prediction of dynamic networks with link prediction paradigm. To estimate the likelihood of the existence of links more accurate, an effective and robust similarity index is presented by exploiting network structure adaptively. Moreover, most of the existing link prediction methods do not make a clear distinction between future links and missing links. In order to predict the future links, the networks are regarded as dynamic systems in this paper, and a similarity updating method, spatial–temporal position drift model, is developed to simulate the evolutionary dynamics of node similarity. Then the updated similarities are used as input information for the future links’ likelihood estimation. Extensive experiments on real-world networks suggest that the proposed similarity index performs better than baseline methods and the position drift model performs well for evolution prediction in real-world evolving networks. © 2016 Elsevier B.V. All rights reserved.

Corresponding author at: Department of Computer Science and Engineering, University of Electronic Science and Technology of China, China. E-mail address: [email protected] (T. Wu).

http://dx.doi.org/10.1016/j.physa.2016.08.013 0378-4371/© 2016 Elsevier B.V. All rights reserved.

2

1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

T. Wu et al. / Physica A xx (xxxx) xxx–xxx

1. Introduction Complex network theory, a marriage of ideas and methods from statistical physics and graph theory, provides an ideal tool for studying complex systems and leads to major advances in our understanding of metabolic networks, urban road networks, social and communication networks [1]. In the last decade, many works have been done in the research community of complex networks, including structure analysis [2–5], spreading modeling [6,7], link prediction [8,9], similarity measure [10,11], network reconstruction [12,13] and information filtering [14,15]. In particular, the evolution prediction of network structure is a critical topic in complex network research. Understanding the underlying mechanisms and predicting the evolution of dynamic complex networks is fundamental to many applications, including suppressing virus propagation, controlling rumor diffusion and protecting ecological network system. These problems are all equivalent to asking how about the future structure of the network systems. Since sensing future network structure can guide individuals’ behavior in the exploration of complex systems, the question which we will address here is how to predict the evolution of complex networks. According to Ref. [16], every effective link prediction algorithm corresponds to one or more mechanisms of network organization and evolution. Hence, we will in fact answer the above question with link prediction paradigm. Link prediction problem has received extensive research in complex network studies. Specifically, D. Liben-Nowell and J. Kleinberg [17] argue that the link prediction problem asks to what extent can the evolution of a complex network be modeled using features intrinsic to the network itself? They summarized many similarity indices based on network structure and found that there is indeed useful information contained in the network topology alone by comparing them with random predictors. Commonly, the topology-based methods can be divided into three classes. The first class is similarity-based methods, which assumes that the links between more similar nodes are of higher existing likelihood, and the similarity of the two endpoints can be transferred through the links. The similarity-based methods can be subdivided into neighborbased methods and distance-based methods. Neighbor-based methods are based on the idea that two nodes are more likely to generate a link if they have more common neighbors, such as common neighbors index (CN), Adamic–Adar index (AA) [18], resource allocation index (RA) [19], Leicht–Holme–Newman index (LHN) [20]. Distance-based methods suppose that link probability is determined by the distance or the number of the shortest path between nodes, such as local path index (LP) [19], Katz index [21], Leicht–Holme–Newman index (LHN-II) [20]. The second class of link prediction methods is maximum likelihood estimation methods. Two popular methods of this type are hierarchical structure model (HSM) [22] and stochastic block model (SBM) [23]. The third class is machine learning based methods. The main methods of this type are supervised learning method [24] and negative matrix factorization (NMF) method [25]. Owning to its simplicity, the study on similarity-based algorithms is the mainstream issue. In most of the existing link prediction works, a typical evaluation method is to calculate an algorithm’s accuracy by reproducing the known links that have been removed from test set. Moreover, most of the works do not make a clear distinction between future links and missing links, except Ref. [26] offers an evidence that missing links are more likely to be the links connecting low-degree nodes. Thus an accurate prediction of network links is not necessarily a useful one. Consequently, predicting future links without losing generality in dynamic networks is still one of the most challenging tasks. Up to now, many dynamic networks have been investigated empirically to uncover their evolution mechanisms. From the perspective of group evolution [27,28], it has already been pointed out that many new group members are the neighbors or second level neighbors of the current group members, and the neighbor nodes are more likely to join the cohesive groups than the unstructured ones. In other words, the neighbor nodes prefer to connect with the nodes in groups with higher clustering level. Ref. [29] presents a detailed study of network evolution by analyzing four large online social networks with full temporal information. The study shows that most new edges span very short distances, typically closing triangles. Furthermore, the result of Ref. [30] shows that the auto-correlation function of the successive states of evolving communities is continuous, which indicates that the states of networks are associated with the states at the last time point. Thus the timestamps of network interactions have potential to influence network evolution. Now the big question is obviously, how to utilize the spatial and temporal factors of networks to predict future links? The essence of link prediction is to estimate the topology similarity of node pairs based on the observed network structure, i.e. the similarity relations between nodes. In order to estimate the topology similarity in networks with different structure properties, the paper firstly proposes a robust and structure-dependent index. Moreover, the crux of dynamic networks’ evolution prediction is the future links prediction. According to the essence of the link prediction problem, the future similarity relations are the basis of future links prediction. To infer the future similarity relations, we envision evolving networks as dynamic systems and investigate the similarity dynamics based on the current network condition, in which node’s influence is defined based on spatial and temporal factors and node’s network position is defined as the similarity relations between the node and their neighbors. Then a spatial–temporal position drift model is proposed to update node’s network position iteratively according to the node influence. The variation of the similarities of node pairs reflects the underlying evolution trend of current network, and the iterative updating of nodes’ network position would lead to a drifted network structure. Finally, according to the experimental results in real-world networks, we find that the structure-dependent index is effective and robust for link prediction and the spatial–temporal position drift model performs well in the prediction of network evolution. The rest of the paper is organized as follows. Section 2 introduces some indices as baselines. Section 3 presents the structure-dependent index and the spatial–temporal position drift model. Section 4 gives the experimental results. Discussion and conclusion are drawn in Section 5.

T. Wu et al. / Physica A xx (xxxx) xxx–xxx

3

2. Preliminaries

1

2.1. Problem and evaluation

2

Consider an undirected simple network G = (V , E ), where V = {v1 , v2 , . . . , vn } is the set of nodes and E is the set of links. Multiple links and self-loops are not allowed. Each link e = (u, v, t ) ∈ E represents an interaction between node u and v that took place at time t. Let G[t , t ′ ] denote the subgraph of G consisting of all links with timestamps between t and t ′ . We choose three timestamps t0 < t1 < t2 and get subgraph G[t0 , t1 ] and G[t1 , t2 ]. We refer to [t0 , t1 ] as the training interval and [t1 , t2 ] as the probe interval. Commonly, networks grow through the addition of nodes and links, and it is not sensible to predict the links of G[t1 , t2 ] whose endpoints are not present in G[t0 , t1 ]. Thus we eliminate the new added nodes within [t1 , t2 ]. We define ET and EP to denote the set of links in G[t0 , t1 ] and G[t1 , t2 ] respectively. Clearly, we have ET ∩ EP = ∅. We use EU to denote the universal set containing all |V ′ |(|V ′ | − 1)/2 possible links, where V ′ denotes the set of nodes in G[t0 , t1 ]. In future links prediction, the task is to reveal the set of future links EP from the space of link prediction EU \ET , where ET is treated as known information and EP is only used to test the accuracy. For each pair of nodes u, v ∈ V ′ without a link in G[t0 , t1 ], each link predictor that we consider assigns a similarity score based on the existing links in ET . Then all unlinked node pairs are ranked in descending order according to their scores. In this study, we use two evaluation metrics, AUC (Area Under the Receiver operating characteristic curve) and Precision. In the present case, AUC can be simplified as the probability that a randomly chosen link in EP has higher similarity score than a randomly chosen nonexistent link in EU \(ET ∪ EP ). In the evaluation implementation, among n times of independent comparisons, if there are n′ times that the future link has a higher score and n′′ times the future link and nonexistent link have the same score, then AUC can be ′ ′′ calculated by AUC = n +0n.5n . If all the scores are generated from an independent and identical distribution, AUC will be approximately 0.5. Therefore, the extent to which AUC exceeds 0.5 indicates how much better the algorithm performs than pure chance. To calculate the Precision value of a predictor, we compare the first L, L = |EP |, links with EP given the ranking of the non-observed links EU \ET . If Lr links among the top-L links are accurately predicted (Lr is the number of the top-L links in the probe set EP), then Precision = Lr /L. In this paper, we assume that training set G[t0 , t1 ] is a connected network in experiment. 2.2. Similarity indices

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

26

Among many similarity indices, Liben-Nowell and Kleinberg [17] and Zhou et al. [19] show that CN, AA and RA perform best by systematically comparing local similarity indices in unweighted networks. Therefore, this paper concentrates on the weighted definition of the three indices denoted by WCN, WAA and WRA respectively.



3

27 28 29

w(x, z ) + w(z , y)

(1)

30

sWAA = xy

w(x, z ) + w(z , y) log(1 + s(z )) z ∈Γ (x)∩Γ (y)

(2)

31

sWRA = xy

w(x, z ) + w(z , y) . s( z ) z ∈Γ (x)∩Γ (y)

(3)

32

sWCN = xy

z ∈Γ (x)∩Γ (y)





′ Here, w(x, y) = w(y, x) denotes the weight of the link between node x and y, and s(z ) = z ′ ∈Γ (z ) w(z , z ) denotes the strength of node z, namely the sum of weights of its attached links. Moreover, we also take LP index into account:



2 3 sLP xy = A + ε · A

(4)

where A is the adjacent matrix of the network, and ε is a free parameter. LP index makes use of the information on local paths with lengths 2 and 3. In real implementation, we directly count the number of different paths with length 2 and 3 and the parameter is fixed at ε = 10−3 following the original article [19]. In weighted networks, we sum up the weights of different paths with lengths 2 and 3. 3. Method Evolution prediction of dynamic networks attempts to predict future network structure by mining history data. To predict the potential network links, we firstly propose a structure-dependent index under the guidance of network property. Then, from the perspective of dynamics, the dynamics of complex networks can be decomposed into the product of single node’s dynamics. To model the single node’s dynamics, a spatial–temporal node position drift model is introduced. By the iterative updating of network position based on node influence, the future similarity relations of network nodes will be inferred and the future links will be predicted accordingly.

33 34

35 36 37 38 39

40

41 42 43 44 45 46

4

1

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

22

T. Wu et al. / Physica A xx (xxxx) xxx–xxx

3.1. Structure-dependent similarity index Neighbor-based methods, such as CN, AA, RA, can obtain satisfactory prediction accuracy in networks with high clustering coefficient. However, in some sparse networks with low clustering coefficient, the performance of the methods is not so optimistic. One of the key reasons behind the result is that the neighbor-based methods cannot calculate the similarity between nodes without common neighbors. Moreover, the neighbor-based methods are always less distinguishable from each other, and the probability that two node pairs are assigned the same similarity score usually is high. To overcome the weaknesses, some distance-based methods have been proposed, such as LP, Katz and LHN-II. However, the distance-based methods are sensitive to the proportion of observed links. It means that the prediction accuracy will reduce obviously with the decrease of the observed links. This is because the removal of network links would increase the average shortest distance of node pairs. If the path range of distance-based indices is smaller than the shortest distance, the methods would entirely be not able to capture the similarities of the node pairs. Therefore, an effective and robust link prediction index should exploit different range of structure information adaptively in various networks. Moreover, link prediction algorithms need to estimate the similarities of all unconnected node pairs, thus it would be very time-consuming if the similarity index is based on the global structure information. Therefore, the link prediction algorithm should be not only effective but also efficient. Thus, semi-local index is the preferred method to obtain a good trade-off on effectiveness and efficiency. In order to determine the range of the structure information, this paper uses the value of the average shortest distance of node pairs as guidance. In addition, the nodes with greater degree are more likely to interconnect two isolated nodes. In this case, to the existence likelihood of links, the contribution of intermediate nodes with different degree will not be the same, and the similarity scores should be normalized according to the intermediate node’s degree. Based on the above analysis, we propose a structure-dependent (SD) similarity index exploiting different range of structure information. Specifically, the formal definition is given as follows: sSD xy =

⟨d⟩ 

ε δ−2 ·

δ=2 23 24

25



(5)

k(z )

where A is the adjacent matrix, k(z ) is the average node degree of a path between node x and y, ⟨d⟩ is the average shortest distance of the network, and ε is a free parameter. The weighted structure-dependent (WSD) index is defined as follows: sWSD = xy

⟨d⟩  δ=2

ε δ−2 ·



(6)

s(z )

28

where W is the weight matrix, s(z ) is the average node strength of a path between x and y. In real implementation, we directly sum up the weights of the paths with length in range of [2, ⟨d⟩] and the parameter is fixed at ε = 10−3 following the definition of LP.

29

3.2. Spatial–temporal network position drift model

26 27

30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49

50

To model the single node’s dynamics, a spatial–temporal node position drift model is introduced. Based on the iterative updating of node position, the new similarity relations of network nodes will be inferred. Here we normalize the link weights through negative exponential function ws = e−1/w as the initial similarities of the position drift model. As nodes are only under the influence of their neighbors, the iterative updating of nodes’ network position can only be achieved via neighbors’ influence analysis. The definition of neighbors’ influence and nodes’ position drift model are presented as follows. Spatial influence. To model the spatial influence, this paper assumes that network nodes are placed in a force field and every neighbor node has an attraction for the central node. As each node has different connection ability for structure organization, we estimate neighbors’ spatial influence by connection ability measure. Based on the in-depth analysis of group structure, we find that the group members’ neighborhood have more interactions if the group’ clustering level is greater. That is to say, the members in groups with greater clustering level always have a higher connection ability. According to Refs. [27,28], network nodes tend to approach the close connected groups and be far away from the unstructured ones. Combining with the above analysis, we can find that network nodes always tend to connect with the nodes with higher connection ability. Apart from the connection ability, the bond strength of the node’ neighborhood, defined as connectivity strength, also is a significant predictor in weighted networks. Take Fig. 1(a) for example, node a has neighbor nodes i and j, and there are multiple edges in the one-hop neighborhood of node i while j is an isolated node. From the perspective of structure organization, node i has a higher connection ability than node j. Thus node i has a higher influence than that of node j, and node a will tend to approach node i rather than node j. Moreover, node i has a greater connectivity strength if the links in the one-hop neighborhood of node i have a higher weight. To measure the neighbors’ influence comprehensively, this paper synthesizes the connection ability and the connectivity strength. Formally, considering the link number and the link weights in one-hop neighborhood of network nodes, the influence of them is defined as follows: AI (i) = cn(i) · st (i) =

 l∈Edge(ego(i))

wl

(7)

T. Wu et al. / Physica A xx (xxxx) xxx–xxx

5

Fig. 1. The illustration of the spatial–temporal position drift model. (a) The graph representation of the spatial interaction within local topology. In the one-hop neighborhood of node a, there are five neighbor nodes and four components. By analysis of the structure of every component, the neighbor nodes’ spatial influence can be estimated. (b) The graph representation of temporal interactions of node a. The arrows indicate the timelines corresponding to the network nodes. The temporal information of the network links hints the evolutionary trend of local structure and can be used to calculate their temporal influence. (c) The force difference between the spatial–temporal influence and the similarities for network position drift computation. The upper subgraph represents the similarity force field and the lower subgraph represents the influence force field. The arrows in the upper subgraph indicate the force difference between similarity force and influence force.

w

l where cn(i) = |E (ego(j))| is the connectivity ability of node i, st (i) = l∈E (ego(i)) |E (ego(i))| is the connectivity strength of node i, ego(i) is the set of nodes in the one-hop neighborhood of node i, Edge(ego(i)) is the set of links between the nodes in ego(i), wl is the weight of link l. Except for the interaction between the central node and its surrounding neighbors, there also may have interaction among its neighbors. In such situation, the neighbors and the central node form a triadic closure, and the connected neighbors would influence the central node consistently. It is worth noting that the integrated influence of the connected neighbors is different from the sum of their respective influence. To measure the integrated influence of them, they are analyzed as a whole in this paper. And then each neighbor’s influence is estimated proportionally based on the integrated influence. Let us take the neighbor node m and n in Fig. 1(a) for example, here, the connected neighbors m and n have a consistent and common influence on node a. To measure the influence of m and n, we integrate them into a virtual node f , as shown in the bottom right corner of Fig. 1(a). Then we calculate the influence of node f and estimate the influence of m and n proportionally. Formally, the influence of a connected neighbor can be defined as:



AI (i) AI (i) = |NC | · AI (v) ·  ′ AI (j)

1 2 3 4 5 6 7 8 9 10 11 12



(8)

13

where NC is the set of the connected neighbors, AI (v) is the influence of the virtual node fused from NC . AI ′ (i) is the influence

14

j∈NC

of node i, i ∈ NC , in independent situation. For the node m and n in Fig. 1(a), AI (m) = 2AI (f ) · AI ′ (n)

2AI (f )· AI ′ (m)+AI ′ (n) , where AI (f ) is the influence of the virtual node f fused from m and n, AI (f ) =

AI ′ (m) AI ′ (m)+AI ′ (n)

, AI (n) =

l∈Edge(ego(m))∪Edge(ego(n)) wl . ′ ′ AI (m) and AI (n) are the influence of node m and n in independent situation. Clearly, we have AI (m)+ AI (n) = 2AI (f ). What



is worth noticing is that the connected neighbors have common intention and consistent behavior in attracting the central node. Temporal influence. According to the results of Refs. [29,30], this paper assumes that the evolution process of networks is continuous and the future state of evolving network is more relevant with the current state than the past states. That is to say, the new links are more powerful than the older ones for future links prediction. To the central node a in Fig. 1(a), its temporal dynamics can be illustrated by Fig. 1(b). And the temporal link set of node a can be represented chronologically as la = {laj , lai , lan , lak , lam }. With the growth of survival time, the predictive power of the links reduces gradually. As a result, the temporal influence of link lai for future links prediction is defined as: PI (lai ) =

e(t (lai )−t ({la }))/2∆t j∈N (a) t (laj )



AI (i) k∈N (a)

·

PI (lai ) max PI (lak )

19 20 21 22 23 24 25

, N (a) is the neighbor set of node a, and

27

t (lai ) is the timestamp of link lai . ∆t is the unit time interval, ∆t = . |N (a)| Based on the above analysis, the spatial–temporal influence of each neighbor i on central node a is comprehensively defined as: max AI (k)

18

26

|N (a)| Maxi∈N (a) t (lai )−Minj∈N (a) t (laj )

I (i) =

17

(9)

1 + e(t (lai )−t ({la }))/2∆t

where t (la ) is the average of timestamps corresponding to la , t (la ) =

15 16

(10)

28 29 30

31

k∈N (a)

Spatial–temporal position drift model. After modeling the spatial–temporal influence of neighbor node, how should network nodes drift their network position according to their neighbors’ spatial–temporal influence? In fact, every neighbor node

32 33

6

T. Wu et al. / Physica A xx (xxxx) xxx–xxx

Table 1 The basic structure properties of the giant component of the six example networks. N and M are the total numbers of nodes and links, respectively. ⟨k⟩ is the average degree of the networks. ⟨d⟩ is the average shortest distance between node pairs. C , Cw and r are clustering coefficient, weighted clustering ⟨k2 ⟩

coefficient and assortative coefficient, respectively. H is the degree heterogeneity, defined as H = ⟨k⟩2 , where ⟨k⟩ denotes the average degree.

1 2 3 4 5 6 7 8

9

Networks

N

M

⟨k⟩

⟨ d⟩

C

Cw

r

H

Celegans Jazz USAir

297 198 332

1977 2523 1956

13.313 25.484 11.783

2.521 2.282 2.817

0.262 0.569 0.556

0.016 0.051 0.045

−0.167 0.028 −0.211

1.798 1.392 3.460

MIT Hypertext Infectious

96 113 378

2336 2021 2544

48.667 36.000 13.460

1.494 1.684 3.482

0.658 0.486 0.445

0.003 0.005 0.015

−0.022 −0.133

1.117 1.226 1.403

0.235

attracts the central node to move towards itself. In order to establish the optimum trade-off among the spatial–temporal influence of the neighbors, the neighborhood of the central node is regarded as a ‘‘force field’’ of node influence. In the force field, the similarities between the central node and its neighbors denote the innate bond strength of them. By comparative analysis on the strength of neighbors’ influence and the similarities, as shown in Fig. 1(c), the direction and distance of position drift of the central node can be estimated from the force difference between them. Specifically, if a neighbor’s influence is greater than the similarity between the neighbor node and the central node, the similarity increases. Otherwise, the similarity decreases. Based on the above philosophy, we define s(a, i) to denote the similarity between node a and i, i ∈ N (a), and define ∆s(a, i) to characterize the variation of s(a, i) under the neighbor nodes’ influence:

∆s(a, i) =

   0,     

I (i) ≤ if  I (k)





  I (i)    s(a, i)   −  − s(a, k) ·   ,   s(a, k) I (k)  k∈N (a) k∈N (a)

k∈N (a)

k∈N (a)

I (i)

if 

k∈N (a)

I (k)

s(a, i)

 k∈N (a)

s(a, k)

s(a, i)

> 

k∈N (a)

s(a, k)

(11)

.

11

Based on the above definition, each node updates its similarities iteratively, and the new similarity relations of current network nodes are finally inferred.

12

4. Experimental results

10

14

In this section, to demonstrate the benefits of the proposed methods, we apply them in real-world networks for empirical analysis.

15

4.1. Data description

13

26

The dataset studied in this paper, including three static networks and three evolving networks, are detailed as follows. (1) Celegans [31]: The neural network of C. Elegans. (2) Jazz [32]: The collaboration network between Jazz musicians. (3) USAir [33]: The network of US air transportation. (4) MIT [34]: The network contains human contact data among 100 students of the Massachusetts Institute of Technology (MIT), collected by the Reality Mining experiment performed in 2004 as part of the Reality Commons project. (5) Hypertext [35]: The network represents the face-to-face contacts of the attendees of the ACM Hypertext 2009 conference. Node represents a conference visitor, and an edge represents a face-to-face contact that was active for at least 20 s. Each edge is annotated with the time at which the contact took place. (6) Infectious [35]: The network describes the face-to-face behavior of people during the exhibition INFECTIOUS: STAY AWAY in 2009 at the Science Gallery in Dublin. Nodes represent exhibition visitors, and edges represent face-to-face contacts that were active for at least 20 s. The basic structure properties of the networks are summarized in Table 1. The original networks are turned into undirected and simple networks.

27

4.2. Effectiveness of similarity index WSD

16 17 18 19 20 21 22 23 24 25

28 29 30 31 32 33 34 35

To verify the effectiveness of the proposed index WSD, we compare the prediction accuracy of different similarity index under the AUC metric and the Precision metric. The results are shown in Tables 2 and 3 respectively. The highest AUC/Precision value for each network is shown in boldface. Under the AUC metric, WSD performs best in four out of six networks and performs next best in the other two networks. Meanwhile, WSD performs best in all of the six networks under the Precision metric, as shown in Table 3. In order to further demonstrate the effectiveness of WSD, we compare the prediction precision of the methods under varied percentage of training set, as shown in Fig. 2. It can be seen that the proposed index WSD is either the best or very close to the best in the six real-world networks. Based on the above analysis, we can conclude that the proposed WSD index performs well in link prediction problem.

T. Wu et al. / Physica A xx (xxxx) xxx–xxx

7

Table 2 Comparison of the prediction accuracy under the AUC metric in real-world networks. Networks

WCN

WAA

WRA

WLP

WSD

Celegans Jazz USAir MIT Hypertext Infectious

0.8625 0.9543 0.9500 0.7980 0.5485 0.6851

0.8362 0.9726 0.9764 0.7980 0.6343 0.7333

0.8772 0.9566 0.9588 0.8571 0.6857 0.6481

0.8187 0.9634 0.9764 0.8128 0.6114 0.7685

0.9152 0.9726 0.9706 0.8523 0.7028 0.7692

Each value is obtained by averaging over 100 implementations with independent random divisions of training set (90%) and probe set (10%) in static networks and orderly divisions of training set (90%) and probe set (10%) in evolving networks. The best result achieved for each network is in boldface. Table 3 Comparison of the prediction accuracy under the Precision metric in real-world networks. Networks

WCN

WAA

WRA

WLP

WSD

Celegans Jazz USAir MIT Hypertext Infectious

0.1169 0.4948 0.3294 0.2315 0.1771 0.0000

0.1169 0.5022 0.3588 0.2413 0.1942 0.0000

0.1111 0.4885 0.3588 0.4481 0.2571 0.0365

0.0760 0.5140 0.3235 0.2364 0.1771 0.0000

0.1286 0.5705 0.3941 0.4482 0.2571 0.0410

Each value is obtained by averaging over 100 implementations with independent random divisions of training set (90%) and probe set (10%) in static networks and orderly divisions of training set (90%) and probe set (10%) in evolving networks. The best result achieved for each network is in boldface.

Fig. 2. Effectiveness of WSD under varied percentage of training set. Each value of the Precision is a result averaged over 100 implementations.

As shown above, the different similarity indices yield different results in networks with varied structure properties. In practice, as discussed in Section 3.1, neighbor-based indices, including WCN, WAA, WRA, cannot calculate the similarity of nodes without common neighbors and distance-based indices, such as WLP, cannot capture the similarity of node pairs if their path range is smaller than the shortest distance of the node pairs. Specifically, under the Precision metric in Table 3 and Fig. 2(e), WCN, WAA and WLP are completely useless and the performance of WRA is also unsatisfactory in network Infectious. The reason is that the average shortest distance of Infectious is 3.482, which is greater than the path range 2.0 of WCN, WAA, WRA and 3.0 of WLP. In contrast, WSD uses the value of the average shortest distance as guidance and thereby obtains a better performance.

1 2 3 4 5 6 7 8

4.3. Robustness of similarity index WSD In realistic cases, the complete information of network structure is not always available, and many networks’ structures are strange. Here we reduce the percentage of the training set continuously to approximate the challenging situations as much as possible, and we apply WSD in them to test its robustness. Fig. 3 inspects the robustness of WSD by comparing the

9

10

Q3

11 12

8

T. Wu et al. / Physica A xx (xxxx) xxx–xxx

Fig. 3. Robustness of WSD evaluated by prediction precision under varied average shortest distance. The average shortest distance is calculated when different percentage of training set is available, as shown in the embedded figures. Each value of the Precision is a result averaged over 100 implementations.

12

prediction accuracy of different indices in the six networks. As the percentage of training set decreases, the average shortest distance of every target network takes on a rising trend shown in the embedded figures. The results in Fig. 3 show that the larger the average shortest distance is, the lower the prediction precision will be. Thus, prediction precision has a negative linear correlation to the average shortest distance of networks. Moreover, we can find that the rough range of the average shortest distance in Fig. 3(a), (b), (d) and (e) is 1 to 3, and the prediction precision decreases slowly with the increase of the average shortest distance. In contrast, the rough range of the average shortest distance in Fig. 3(c) and (f) is 3 to 6, and the prediction precision decreases sharply with the increase of the average shortest distance. Combining the results with the path range of WCN, WAA, WRA and WLP, we can learn that the precision of link prediction is directly dependent on the path range of the similarity indices, and prediction accuracy would be unsatisfactory if the average shortest distance of the target network is greater than the path range of the similarity indices. From the results of Fig. 3, although all the indices’ prediction precision declines with the increase of the target networks’ average shortest distance, WSD gets a better accuracy and is more robust than the other indices.

13

4.4. The effectiveness of position drift model on networks’ evolution prediction

1 2 3 4 5 6 7 8 9 10 11

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Based on the network structure in training interval, network nodes update their network position iteratively according to the spatial–temporal position drift model defined in Eq. (11) and the future similarity relations of network nodes will be inferred. The more the number of the iterations, the longer the prediction period. Based on the inferred similarity relations Q4 of network nodes, the similarities of unlinked node pairs are estimated based on similarity indices. As different iteration number of position drift model could lead to different future similarity relations and different prediction accuracy, here, we study the influence of the iteration number in the evolving network prediction. The result is plotted in Fig. 4 with iteration number ranging from 0 to 5 in the three evolving real-world networks. From the result, we can find that the overall optimal prediction precision is got when the iteration number equals to 3. Next, we test the effectiveness of the position drift model in the evolution prediction of dynamic networks. For the same similarity index and the same number of iteration, the inferred future similarity relations of network nodes would lead to a better future link prediction if they are more close to the real situation. Thus we evaluate the performance of the position drift model with future links’ prediction accuracy. Table 4 shows the effectiveness of the position drift model on future links prediction under the AUC metric. From Table 4, we can find that almost all similarity indices get a higher AUC value in the drifted networks than that in the original networks. Moreover, as shown in Fig. 5, we compare the prediction precision calculated in the drifted networks with the result obtained in the original networks under the Precision metric. From Fig. 5, it can be seen that the prediction results in the drifted network structure are either close to or better than the result obtained in the original network structure. Thus, all of the results in Table 4 and Fig. 5 mean that the spatial–temporal position drift model is effective for evolving networks prediction.

T. Wu et al. / Physica A xx (xxxx) xxx–xxx

(a) Precision with different iteration number.

9

(b) AUC with different iteration number.

Fig. 4. The influence of the iteration number of position drift model on link prediction. Each value is a result averaged over 100 independent implementations with orderly divisions of training set (90%) and probe set (10%) in the three evolving networks.

Fig. 5. The effectiveness of position drift model on future links prediction under the Precision metric. The horizontal axis denotes the percentage of training set. Each value of the Precision is a result averaged over 100 independent implementations. The iteration number of position drift is fixed as 3. Table 4 The effectiveness of position drift model on future links prediction under the AUC metric. Networks

WCN

WCN′

WAA

WAA′

WRA

WRA′

WLP

WLP′

WSD

WSD′

MIT Hypertext Infectious

0.798 0.548 0.685

0.837 0.691 0.693

0.798 0.634 0.733

0.817 0.662 0.685

0.857 0.685 0.648

0.861 0.685 0.731

0.812 0.611 0.768

0.843 0.622 0.676

0.852 0.702 0.769

0.866 0.720 0.777

Each value is obtained by averaging over 100 implementations with timestamps based orderly divisions of training set (90%) and probe set (10%). The iteration number of position drift is fixed as 3. Here WCN′ , WAA′ , WRA′ , WLP′ , WSD′ denote the prediction of WCN, WAA, WRA, WLP, WSD on drifted networks.

10

T. Wu et al. / Physica A xx (xxxx) xxx–xxx

Table 5 Computation time (in millisecond) comparison in the real-world networks.

Link Prediction

Position Drift Model

Networks

WCN

WAA

WRA

WLP

WSD

Celegans Jazz USAir MIT Hypertext Infectious

654 710 760 155 192 1030

1670 1890 2650 1400 1460 1430

1645 2105 2640 1410 1440 1380

38700 42300 65525 25570 27600 43500

38870 49920 66350 24280 26600 43850

Networks

iter 1

iter 2

iter 3

iter 4

iter 5

MIT Hypertext Infectious

49250 23340 8400

98400 47050 16400

147600 70000 24400

197300 93300 32560

248450 116650 40740

Each value is obtained by averaging over 100 implementations.

1

4.5. Computation time

12

Table 5 presents the computation time of the link prediction indices and the network position drift model with varied iteration number. As WAA, WRA are variants of WCN, they have nearly the smallest computation time in the real-world networks. WLP captures the information of paths with two and three hops and needs more computation time than that of WCN, WAA, and WRA. Compared with WLP, the WSD index has the same order of magnitude of computation time in all networks. Moreover, each node displaces its network position according to the influence of its neighbors in network position drift model, thus its computational complexity is sensitive to the average degree of the target network. From Table 5, we can find that the computation time of the position drift model in the three evolving networks decreases with the decline of the average degree of the networks. Since real-world networks are mostly sparse, the network position drift model is practical in realistic application in the perspective of computation time. In summary, the computation time of the methods in the networks are all less than 5 min and the proposed prediction index and the position drift model can handle the networks in a reasonable time.

13

5. Discussion and conclusion

2 3 4 5 6 7 8 9 10 11

30

Evolution of dynamic networks has been widely studied and many link prediction algorithms have been proposed in literature in the last decade, however, the evolution prediction of dynamics networks still remains challenging. To address this issue, the paper firstly proposed a robust similarity index for ranking candidate links. Then the paper developed a network position drift model to infer the future similarity relations of network nodes in evolving networks. We tested the methods in six real-world networks, and the experimental results show that the proposed methods can give satisfactory prediction with reasonable computation time. The model developed in this paper, though simple, captures several theoretically important features of network evolution, such as the clustering effect and the novelty effect. It is worth noticing that the proposed position drift model does not adopt clustering coefficient to measure neighbor node’s influence, because the clustering coefficient values of the nodes which do not have closed triadic closures are low and cannot reflect the practical connection ability of them even if they have high degree. Moreover, for simplicity, the computation of node influence only considers the one hop neighborhood of network nodes. In the near future, there are a number of interesting extensions could be done. An important aspect of the further work is that how to optimize the methods to adapt to large-scale networks. Moreover, analyzing the effect of network structure in more large range may model the spatial factor more precisely and improve the accuracy of evolution prediction. In addition, the change trend of interaction intervals may be an important role in temporal factor modeling if exact temporal information is available. Finally, this paper dynamically investigates node pairs’ similarity. We hope the method used herein can be further extended to study in multi-layer complex networks [36].

31

Acknowledgments

14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

36

The work was supported partially by the National Natural Science Foundation of China (Grant No. 61202255), UniversityIndustry Cooperation Projects of Guangdong Province (Grant No. 2012A090300001) and the Pre-research Project (Grant No. 51306050102). We thank Tao Zhou and Junming Shao for their advices. We also thank Chao Fan and Yuanping Zhang for enlightening discussions and careful reading of the manuscript. The authors also wish to thank the anonymous reviewers for their thorough review and highly appreciate their useful comments and suggestions.

37

References

32 33 34 35

38 39

[1] D. Grady, C. Thiemann, D. Brockmann, Robust classification of salient links in complex networks, Nature Commun. 3 (2012) 864. [2] D.B. Chen, M.S. Shang, Z.H. Lv, Y. Fu, Detecting overlapping communities of weighted networks via a local algorithm, Physica A 389 (19) (2010) 4177–4187.

T. Wu et al. / Physica A xx (xxxx) xxx–xxx

11

[3] T. Wu, Y.X. Guo, L.T. Chen, Y.B. Liu, Integrated structure investigation in complex networks by label propagation, Physica A 448 (2016) 68–80. [4] A.X. Cui, Z.K. Zhang, M. Tang, P.M. Hui, Y. Fu, Emergence of scale-free close-knit friendship structure in online social networks, Plos One 7 (12) (2012) e50702. [5] Y. Pan, D.H. Li, J.G. Liu, J.Z. Liang, Detecting community structure in complex networks via node similarity, Physica A 389 (14) (2010) 2849–2857. [6] T. Wu, L.T. Chen, X.P. Xian, Y.X. Guo, Full-scale cascade dynamics prediction with a local-first approach, arXiv:1512.08455. [7] K. Gong, M. Tang, P.M. Hui, H.F. Zhang, D. Younghae, Y.C. Lai, An efficient immunization strategy for community networks, Plos One 8 (12) (2013) e83489. [8] X.Z. Zhu, H. Tian, S.M. Cai, J.M. Huang, T. Zhou, Predicting missing links via significant paths, Europys. Lett. 106 (1) (2014) 18008. [9] X.Z. Zhu, H. Tian, S.M. Cai, Predicting missing links via effective paths, Physica A 413 (2014) 515–522. [10] J.G. Liu, K. Shi, Q. Guo, Solving the accuracy-diversity dilemma via directed random walks, Phys. Rev. E 85 (1) (2012) 016118. [11] J.G. Liu, L. Hou, X. Pan, Q. Guo, T. Zhou, Stability of similarity measurements for bipartite networks, Sci. Rep. 6 (2016) 18653. [12] P. Zhang, F.T. Wang, X. Wang, A. Zeng, J.H. Xiao, The reconstruction of complex networks with community structure, Sci. Rep. 5 (2015) 17287. [13] H. Liao, A. Zeng, Reconstructing propagation networks with temporal similarity, Sci. Rep. 5 (2015) 11404. [14] Q. Guo, W.J. Song, J.G. Liu, Ultra accurate collaborative information filtering via directed user similarity, Europys. Lett. 107 (1) (2014) 18001. [15] X. Pan, G.S. Deng, J.G. Liu, Information filtering via improved similarity definition, Chin. Phys. Lett. 27 (6) (2010) 068903. [16] L.Y. Lv, L.M. Pan, T. Zhou, Y.C. Zhang, H. Eugene Stanley, Toward link predictability of complex networks, Proc. Natl. Acad. Sci. 112 (2015) 2325–2330. [17] D. Liben-Nowell, J. Kleinberg, The link-prediction problem for social networks, J. Assoc. Inf. Sci. Technol. 58 (7) (2007) 1019–1031. [18] L.A. Adamic, E. Adar, Friends and neighbors on the web, Social Networks 25 (3) (2003) 211–230. [19] T. Zhou, L.Y. Lv, Y.C. Zhang, Predicting missing links via local information, Eur. Phys. J. B 71 (4) (2009) 623–630. [20] E.A. Leicht, P. Holme, M.E.J. Newman, Vertex similarity in networks, Phys. Rev. E 73 (2) (2006) 026120. [21] L. Katz, A new status index derived from sociometric analysis, Psychmetrika 18 (1) (1953) 39–43. [22] A. Clauset, C. Moore, M.E.J. Newman, Hierarchical structure and the prediction of missing links in networks, Nature 453 (7191) (2008) 98–101. [23] B. Karrer, M.E.J. Newman, Stochastic blockmodels and community structure in networks, Phys. Rev. E 83 (1) (2011) 016107. [24] M. Zaki, S. Salem, V. Chaoji, M. Al Hasan, Link prediction using supervised learning, Procedia Eng. 30 (9) (2012) 798–805. [25] A.K. Menon, C. Elkan, Link prediction via matrix factorization, in: ECML PKDD’11, pp. 437-452. [26] Y.X. Zhu, L.Y. Lv, Q.M. Zhang, T. Zhou, Uncovering missing links with cold ends, Physica A 391 (22) (2012) 5769–5778. [27] S.R. Kairam, D.J. Wang, J. Leskovec, The life and death of online groups: predicting group growth and longevity, in: WSDM’ 12, pp. 673–682. [28] J.Z. Qiu, Y. Li, J. Tang, Z. Lu, H. Ye, B. Chen, Q. Yang, John E. Hopcroft, The lifecycle and cascade of WeChat social messaging groups, in: WWW’ 16, pp. 311–320. [29] J. Leskovec, L. Backstrom, R. Kumar, A. Tomkins, Microscopic evolution of social networks, in: KDD’ 08, pp. 462–470. [30] G. Palla, A.L. Barabsi, T. Vicsek, Quantifying social group evolution, Nature 446 (2007) 664–667. [31] D.J. Watts, S.H. Strogatz, Collective dynamics of small-world networks, Nature 393 (6684) (1998) 440–442. [32] P.M. Gleiser, L. Danon, Community structure in jazz, Adv. Complex Syst. 6 (04) (2003) 565–573. [33] Pajek datasets, 2006. Available: http://vlado.fmf.uni-lj.si/pub/networks/data/. [34] N. Eagle, A.S. Pentland, Reality mining: Sensing complex social systems, Pers. Ubiquitous Comput. 10 (4) (2006) 255–268. [35] L. Isella, J. Stehlé, A. Barrat, C. Cattuto, J.F. Pinton, W. Van den Broeck, What’s in a crowd? Analysis of face-to-face behavioral networks, J. Theoret. Biol. 271 (1) (2011) 166–180. [36] W. Wang, M. Tang, H. Yang, Y. Do, Y.C. Lai, G.W. Lee, Asymmetrically interacting spreading dynamics on complex layered networks, Sci. Rep. 4 (2014) 5097.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Q5

22 23 24 25 26 27 28 29 30 31 32 33 34 35