Extended resource allocation index for link prediction of complex network

Extended resource allocation index for link prediction of complex network

Accepted Manuscript Extended resource allocation index for link prediction of complex network Shuxin Liu, Xinsheng Ji, Caixia Liu, Yi Bai PII: DOI: Re...

716KB Sizes 0 Downloads 144 Views

Accepted Manuscript Extended resource allocation index for link prediction of complex network Shuxin Liu, Xinsheng Ji, Caixia Liu, Yi Bai PII: DOI: Reference:

S0378-4371(17)30199-1 http://dx.doi.org/10.1016/j.physa.2017.02.078 PHYSA 18058

To appear in:

Physica A

Received date: 23 April 2016 Revised date: 31 January 2017 Please cite this article as: S. Liu, X. Ji, C. Liu, Y. Bai, Extended resource allocation index for link prediction of complex network, Physica A (2017), http://dx.doi.org/10.1016/j.physa.2017.02.078 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

*Highlights (for review)

1) 2) 3) 4) 5)

Similarity of endpoints depends on the potential resource interacted between them Resource transferred by common neighbors and non-common neighbors are considered It’s a self-adaptive similarity index by adjusting the resource of longer paths This method proposed is very suitable for large-scale networks It is well performed under two standard metrics AUC and precision

*Manuscript Click here to view linked References

Extended resource allocation index for link prediction of complex network

1

Extended resource allocation index for link prediction of complex network

2 Shuxin Liua, *, Xinsheng Jia, b, c, Caixia Liua, Yi Baia

3 4 a

5

National Digital Switching System Engineering and Technological R&D Center,

6

Zhengzhou 450002, P. R. China b

7

National Mobile Communications Research Laboratory, Southeast University,

8

Nanjing 211189, P. R. China c

9

Wireless Technology Institute, National Engineering Laboratory for Mobile Network

10 11

Security, Beijing 100876, P. R. China Abstract

12

Recently, a number of similarity-based methods have been proposed to predict the

13

missing links in complex network. Among these indices, the resource allocation index

14

performs very well with lower time complexity. However, it ignores potential resources

15

transferred by local paths between two endpoints. Motivated by the resource exchange

16

taking places between endpoints, an extended resource allocation index is proposed.

17

Empirical study on twelve real networks and three synthetic dynamic networks has

18

shown that the index we proposed can achieve a good performance, compared with

19

eight mainstream baselines.

20

PACS: 89.20.Ff; 89.75.Hc; 89.65.-s

21

Keywords: Link prediction; Complex network; Resource exchange; Similarity index

22 23

*

Corresponding author. Tel.: +8615939045554; E-mail:[email protected]; [email protected]

1

Extended resource allocation index for link prediction of complex network

1

1. Introduction

2

With the continuous development of complex network theory, many complex

3

systems in nature and society are described as complex network [1-5]. As a research

4

hotspot of complex network, link prediction which aims to predict the likelihood that a

5

link exists between two nodes in complex networks [6-7], has attracted increasing

6

attention in recent years. Predicting missing link can help discover unknown

7

interactions in protein-protein interaction networks [8], recommend new friends for

8

users in online social media networks [9] and identify spurious connection in author-

9

paper bipartite network [10].

10

Plenty of link prediction methods have been proposed based on evolution

11

mechanisms [11-20]. Among these methods, similarity-based prediction methods,

12

especially topology-based similarity indices, have received close attention from

13

researchers due to their simpleness and effectiveness [21]. Similarity methods based on

14

topological structure assume that the similarity of two endpoints is positively correlated

15

with the number of paths or amount of resources transfered between them [15], which

16

can be classified as neighbor-based and path-based methods. There are many neighbor-

17

based similarity indices for link prediction such as the simplest Common Neighbors

18

(CN)[11] index counting the number of the common neighbors, the Adamic-Adar (AA)

19

[12]index and the resource allocation (RA) [13] index adding the node degree

20

information of the common neighbors. To address the limitations of the path

21

information used in neighbor-based indices, the Katz index [14], effective path index

22

[15], significant path index [16], Average Commute Time (ACT) [17], Cosine

23

Similarity Time (Cos+) [18] and SimRank [19] are proposed by considering global

24

topological information between endpoints. Most of these global indices are performing

2

Extended resource allocation index for link prediction of complex network

1

well in real networks, but not suitable for large networks due to their high time

2

complexity [21]. In order to achieve a compromise between complexity and

3

performance, Zhou et al. proposed a Local Path index (LP) [20] by adding the paths

4

with length 3 to CN index. Many empirical studies show that these indices considering

5

local paths can get a good performance with lower complexity in link prediction of real

6

networks. However, most similarity indices ignore the potential resource exchange

7

between endpoints by local path.

b

x i

y a

j

8 9

Fig. 1: The resource transferred by common neighbors with different topology.

10

In the real world, such as online social network, many resources such as hot topics

11

are disseminated between strangers through friends around, and the more times the

12

news spread between two strangers by their friends, the higher the likelihood that they

13

become friends. Take Fig.1 for example, there are two pairs of endpoints( x and y ,

14

i and j ), and nodes a and b are their common neighbors with the same node degree.

15

According to RA index, the likelihood that link Lxy exists is the same as that of link Lij ,

16

though the link Lxy is more likely to be exist in the reality (because there are more

17

possiable paths between x and y ). For real networks, resources in each node are divided

18

into several pieces and then flow to its neighbors when resources flow constantly[40].

19

Therefore, if a new hot topic is spread from node x and i at the same time, endpoint y

20

can receive more news or more resources due to “branches” of its common neighbor b .

3

Extended resource allocation index for link prediction of complex network

1

Obviously, besides the “trunks” of common neighbors ( Lby ) considered by RA index,

2

the resource exchange through longer paths also play an important role in describing the

3

similarity between two endpoints.

4

Based on above discussion, we proposed an extended resource allocation index

5

(ERA), which adds longer paths to RA index. With a parameter adjusting the amount of

6

resource transferred by longer paths for different networks, the ERA index measure the

7

similarity by the amount of resources interacted through common neighbors and non-

8

common-neighbors between two endpoints. Empirical study has shown that the ERA

9

index we proposed can improve the prediction accuracy, compared with eight

10

mainstream baselines on fifteen datasets.

11

The rest of paper is organized as follows: in Section 2, the extended resource

12

allocation index is introduced; in Section 3, the metric is described; in Section 4, 15

13

experimental datasets are described; in Section 5, eight mainstream baselines are

14

introduced and comparison of the results are discussed under two different standards

15

AUC and precision; finally, we make a conclusion.

16

2. The extended resource allocation index

17

Considering an unweighted undirected network G (V , E ) , V and E are the sets of

18

nodes and links respectively. A score s xy is assigned to each pair of nodes x and y ,

19

according to a given similarity index. This score can be a measure of similarity between

20

them and the score for each nonexistent link represents the likelihood that the link exists.

21

In RA index, when calculating the amount of resource transferred by common

22

neighbors between node x and y , it is assumed that the resources of common neighbor

23

allocated to each link are equal. However, in the real world, the longer the path

24

between two endpoints, the higher the uncertainty of resource allocated through the path.

4

Extended resource allocation index for link prediction of complex network

1

Therefore, considering the uncertainty of resource allocated through the longer path is

2

not the same for different real networks, we introduce a parameter to adjust the amount

3

of resource transferred by longer path (longer than 2). As shown in Fig.2, if node x has

4

one unit of resource, and will distribute it to node y through its neighbor v1 , the amount

5

of resource y received from x denotes as  kv1 (  is the adjust parameter). Similarly, if

6

node y distributes one unit of resource to node x through its neighbor vn , the amount of

7

resource x received from y denotes as 1 k vn . In this paper, taking into account the time

8

complexity, we only consider the local paths between endpoints with length shorter than

9

or equal to 3. 1/kv1

x

10 11

v1

σ/kvn

x

v2

σ/kv1

other... nodes

other nodes

1/kvn

y

y

vn

Fig. 2: The amount of resource transferred through longer paths.

12

Considering all the resources are transferred through neighbors and the amount of

13

resources transferred by different neighbors are different, we divide the neighbors of

14

two endpoints into two classes: common neighbors and non-common-neighbors. Then,

15

non-common-neighbors and extended resource allocation index are defined as follows:

b

x

c

y

a 16 17 18

Fig. 3: The resource exchange between nodes.

Definition 1. Considering a pair of nodes, x, y  V . z is the neighbor of node x ,

5

Extended resource allocation index for link prediction of complex network

1

but not the common neighbor between node x and y . Because each neighbor of node

2

x may be a resource carrier for node x , node z is defined as a non-common-neighbors

3

(such as nodes b and c shown in Fig.3) between node x and y . C xy is the set of all the

4

non-common-neirghbors between node x and y ,and C x| y is the set of non-common-

5

neirghbors directly connected to node x in C xy ( Cxy =Cx| y

C y| x ).

6

Definition 2. Considering a pair of nodes, x, y  V . z is the common neighbor of

7

node x and y . In the simplest case, we assume that node x has one unit of resource, and

8

will distirbute it to its neighbors[13]. The amount of resource y received from

9

x through common neighbors, defined as

10

R( x  y ) 

1    nzy kz zC xy



(1)

11

where C xy is the set of common neighbors between x and y , kz repesents the node

12

degree of node z ,  is the adjust parameter and nzy denotes the number of common

13

neighbors between z and y . Taking Fig. 3 for example, the amount of resource

14

y received from x through common neighbors a is (1  2 ) ka .

15

Definition 3. Considering a pair of nodes, x, y  V . z is the neighbors of node x ,

16

and it is also the non-common-neighbors between node x and y ( z  C x| y ). If node x

17

has a unit of resource, and will distirbute it to its neighbors. The amount of resource

18

y received from x through non-common-neighbors is

19

R( x  y ) 



zC x| y

  nzy kz

(2)

20

Here, the length of all the local paths walking across non-common-neighbors is three.

21

Taking Fig. 3 for example, the amount of resource y received from x through non-

6

Extended resource allocation index for link prediction of complex network

1

common-neighbors b is  kb .

2

Definition 4. On an unweighted undirected network G (V , E ) , the total extended

3

resource allocation index composes of all the resource exchange between endpoints x

4

and y , defined as sxyERA  R( x  y )  R ( y  x)  R( x  y )  R( y  x )

5



1    nzy   nzy 1    nzx   nzx    k k k kz zCxy zC xy zC x| y zC y |x z z z







zCxy

2    (nzy  nzx )   (nzy  nzx )  kz kz  zC xy

6

(3)

7

Here, the   0 aims to adjust the proportion of resource allocated to each longer branch.

8

When  =0 , the ERA index proposed is the same as the RA index[13]. The amount of

9

resource transmitted between nodes x and y in Fig.3 are R( x  y)=(2  1) 4 ,

10

R( y  x)=14 , R( x  y)=  2 and R( y  x)   3   3+ 2=7 6 respectively.

11

3. Metrics

12

To quantify the prediction accuracy of different methods, there are two standard

13

metrics called the area under the receiver operating characteristic curve (AUC) [22] and

14

precision [23]. The AUC value can be interpreted as the probability that the score given

15

to a randomly chosen missing link is higher than a randomly chosen non-existent link

16

[21]. At each time, a missing link and a non-existent link is randomly picked to compare

17

their scores given by the algorithm, if among n times of independent comparisons, there

18

are n times the missing link having a higher score and n times they have the same

19

score, the AUC value of the algorithm is:

20

AUC 

n  0.5n n

7

(4)

Extended resource allocation index for link prediction of complex network

1

Clearly, if all the scores are randomly given, AUC  0.5 . Therefore, the higher the

2

value exceeds 0.5, the better the algorithm performs. Precision is defined as the ratio of

3

relevant links to the top L predicted links. If there are m relevant links appeared in the

4

probe set, the precision value is: Pr ecision 

5 6 7 8 9 10

(5)

The higher precision means higher prediction accuracy. Here, we set L=100 . Table 1: The basic topological features of the twelve real networks and three synthetic dynamic networks. V is the number of nodes, E denotes the number of links, k indicates the average degree. d denotes the average distance. r is the assortativity coefficient [38] . C represents the clustering coefficient [39]. H is the degree heterogeneity. Network Jazz FW USAir Hamster Yeast PB Flight Infect AIDS-Blog Email UcSocial Figeys Ba-1 Ba-2 Ba-3

11

m L

V 198 69 332 1858 2375 1222 2939 410 146 167 1899 2239 800 1200 2000

E 2742 880 2128 12534 11693 16717 30501 2765 180 5784 13838 6432 1727 2527 4123

k 27.7 25.51 12.81 13.49 9.85 27.36 20.75 13.49 2.47 69.26 14.57 5.76 4.32 4.21 4.12

d 2.24 1.64 2.74 3.39 5.10 2.74 4.18 3.63 3.42 1.87 3.06 3.98 3.14 3.27 3.40

r 0.020 -0.298 -0.208 -0.085 0.469 -0.221 0.051 0.226 -0.725 -0.295 -0.188 -0.331 -0.242 -0.229 -0.220

C 0.633 0.552 0.749 0.090 0.378 0.361 0.255 0.456 0.052 0.541 0.109 0.040 0.211 0.172 0.144

H 1.4 1.27 3.36 3.36 3.48 2.97 5.22 1.39 5.99 1.66 3.82 9.75 6.14 6.99 8.43

4. Data

12

Experiments are performed on twelve different real networks and three synthetic

13

dynamic networks (randomly generated by BA scale-free network model [1] with

14

different scales). The real networks are introduced as follow: (1)Jazz [24]:the network

15

of Jazz musicians. (2) Food Web of South Florida ecosystem (FW) [25]: the network of

16

carbon exchanges occurring during the wet season in the cypress wetlands of South

17

Florida. (3) USAir [26]: the network of the USA airline. (4) Hamster [27]: a friendship

18

network of users on the website hamsterster.com. (5) Yeast PPI (Yeast) [28]: the

19

protein-protein interaction network of yeast. (6) Political blogs (PB) [29]: a network of

8

Extended resource allocation index for link prediction of complex network

1

US political blogs. (7) Open flights (Flight) [30]: the network of flights between airports

2

of the world. (8) Infectious (Infect) [31]: the network of face-to-face behaviour of

3

people during the exhibition “Infectious: Stay away” in 2009 at the Science Gallery in

4

Dublin. (9) AIDS-Blog [32]: a network of citations among blogs related to AIDS,

5

patients, and their support networks. (10) Email network (Email) [33]: the internal email

6

communication network between employees of a mid-sized manufacturing company.

7

(11) UC Irvine messages social network (UcSocial) [34]: the messages communication

8

network between the users of an online community of students from the University of

9

California, Irvine. (12) Human protein network (Figeys) [35]: a network of interactions

10

between proteins in Humans (Homo sapiens). Table 1 shows the basic topological

11

features of these networks. Each original data is randomly divided into training set

12

contains 90% of links, and the probe set contains the remaining 10%.

13

5. Results

14

We compare the ERA index with other eight similarity indices, including four local

15

indices: CN, RA, CAR and LP index, and four global indices: Katz, ACT, Cos+ and

16

MFI index. A brief introduction of them is shown as follow:

17 18 19

(1) Common Neighbor index(CN) [11] believes that the similarity of two nodes is positively correlated with the number of their common neighbors:

sCN |  ( x)  ( y) | xy

(6)

20

 ( x) is the set of neighbors of nodes x , and  ( x)  ( y ) denotes the

21

common neighbors of nodes x and y .

22 23 24

(2) Resource Allocation index (RA) [13] weights the common neighbors based on resource allocation, and publishes the common neighbors with big degree:

s xyRA 

1

 z| ( x )  ( y )| k

9

z

(7)

Extended resource allocation index for link prediction of complex network

1

(3) CAR index [36] suggests that two nodes are more likely to link together if their

2

common-first-neighbours are members of a strongly inner-linked cohort:

3

sCAR |  ( x)  ( y) |  xy

 z| ( x )  ( y )|

 ( z)

(8)

2

4

 ( z ) refers to the sub-set of neighbors of z that are also common neighbors of

5

nodes x and y .

6

(4) Local Path index (LP)[20], adds the path information with length 3 to CN, as:

S = A2 +   A3

7 8 9 10

(9)

A is the adjacency matrix and  is the adjust parameter.

(5) Katz index[14] considers all the paths between two nodes, and gives more weights to the shorter paths, as: 

11

s xyKatz    l  | pathxyl |   Axy   2 ( A2 ) xy   3 ( A3 ) xy  ...

(10)

l 1

12

Where pathxyl is the set of paths with length l between nodes x and y , and 

13

is the adjust parameter.

14 15

(6) Average Commute Time (ACT)[17] is the average steps of random walk between two endpoints, as:

s xyACT 

16

1 l  l  2lxy  xx

 yy

(11)

17

l xy denotes the corresponding entry in L+ , and L+ is the pseudo-inverse of

18

matrix L  D  A .

19 20

(7) Cosine Similarity Time (Cos+)[18] is based on by L+ calculating similarity of two vectors, as:

10

Extended resource allocation index for link prediction of complex network

 s Cos  xy

1

2

vTx vy | vx |  | v y |

lxy



(12)

lxx  l yy

(8) Matrix-Forest Index(MFI)[37] is defined as:

S  ( I  L)1

3 0.98

0.9

0.98

(c).USAir

(a).Jazz σ=0.0001

0.8

Auc

0.95

(b).FW

Auc

Auc

(13)

σ=19.4

0.97

σ=0.009 0.96 0.92 0

0.2

0.4

4

σ

0.6

0.8

0.7 0

1

1

0.98

0.95

0.96

4

8

σ

12

16

20

0

0.1

0.2

0.3

σ

0.4

0.5

0.9

σ=0.438

0.85

(f).PB

0.94

Auc

(d).Hamster

Auc

Auc

0.95

(e).Yeast σ=0.267

0.8 0

0.2

0.4

σ

0.9 0

0.6

σ=0.056 0.93

0.92

5

0.94

0.1

0.2

σ

0.92 0

0.3

0.1

σ

0.3

0.5

0.97 0.9

0.99

0.98

σ=0.033

0.97 0

0.1

6

σ

0.95

0.2

σ=0.017

0.93

0.8

0.2

0.6 0

0.4

σ

Auc

Auc 0.92 0

0.1

7

(l).Figeys

0.85 σ=0.409 0.8

σ=0.005

0.2

0.3

σ

0.8 σ=0.44

0.7 0.6

0.75 0

0.4

1

σ

0.9

(k).Ucsocial

Auc

0.9

0.5

1

0.95

(j).Email

σ=1.143

0.7

0.94 0

0.3

(i).AIDS-Blog

(h).Infect

Auc

0.96

Auc

Auc

(g).Flight

0.2

0.5 0

0.4

σ

0.8

0.2

0.4

σ

0.6

0.7

(m). Ba-1

Auc

Auc

0.7

σ=0.44

(n).Ba-2

(o).Ba-3

Auc

0.7

0.75

σ=0.74

0.65

0.65 σ=0.28

0.65 0

8 9 10 11

0.6 0.1

0.2

σ

0.3

0.4

0.5

0.6 0

0.2

0.4

σ

0.6

0.8

1

0

0.1

0.2

σ

0.3

0.4

0.5

Fig. 4: The experiment result (AUC) of ERA index on twelve real networks and three synthetic dynamic networks with different values of  . Each AUC value is the average of 20 realizations, each of which corresponds to an independent division of E T and E P .

11

Extended resource allocation index for link prediction of complex network

1

5.1 AUC results

2

Firstly, we report the AUC result of ERA index with different  , and each data is the

3

average of 20 realizations. As shown in Fig. 4, AUC value is varies with the change of

4

 in fifteen datasets. When   1 , many of ERA can obtain the optimal values (except

5

the FW and AIDS-Blog networks   1 ). There is a sudden increase of AUC around

6

 =0 for all the datasets, which indicates the effectiveness of potential resource

7

allocated through longer paths in ERA index. In most of networks, the AUC values are

8

stable around the maximum value after a sudden increase (there is no big difference

9

between the AUC values of these points and peak point). However, in some datasets

10

such as Jazz, USAir, Infect and Email, the prediction accuracy is declined gradually

11

after reaching the highest point.

12 13

Table 2: Comparison of the AUC value between ERA and some similarity indices. Each AUC value is the average of 20 realizations, each of which corresponds to an independent division of E T and E P . AUC Jazz FW USAir Hamster Yeast PB Flight Infect AIDS-Blog Email UcSocial Figeys Ba-1 Ba-2 Ba-3

CN 0.954 0.690 0.954 0.812 0.915 0.924 0.969 0.940 0.601 0.921 0.780 0.565 0.640 0.614 0.599

RA 0.971 0.708 0.972 0.815 0.916 0.928 0.972 0.946 0.615 0.925 0.786 0.568 0.641 0.615 0.597

CAR 0.955 0.689 0.954 0.812 0.916 0.924 0.969 0.941 0.602 0.919 0.779 0.564 0.556 0.535 0.525

LP(a) 0.951 0.709 0.953 0.933 0.970 0.936 0.984 0.960 0.821 0.921 0.892 0.887 0.705 0.669 0.646

LP(b) 0.947 0.735 0.952 0.940 0.969 0.939 0.984 0.960 0.822 0.921 0.902 0.903 0.703 0.668 0.647

Katz(a) Katz(b) 0.951 0.941 0.707 0.738 0.952 0.951 0.933 0.937 0.972 0.971 0.937 0.933 0.983 0.981 0.961 0.960 0.840 0.841 0.920 0.917 0.891 0.903 0.887 0.900 0.699 0.697 0.666 0.667 0.641 0.637

ACT 0.795 0.786 0.902 0.843 0.898 0.893 0.909 0.802 0.954 0.899 0.895 0.875 0.566 0.535 0.516

Cos+ 0.925 0.507 0.956 0.924 0.971 0.928 0.989 0.947 0.579 0.905 0.867 0.806 0.261 0.267 0.274

MFI 0.921 0.707 0.939 0.948 0.971 0.905 0.979 0.960 0.730 0.887 0.868 0.884 0.581 0.556 0.535

ERA 0.972 0.875 0.976 0.973 0.974 0.952 0.992 0.968 0.932 0.928 0.934 0.952 0.761 0.713 0.683

In these methods, the adjust parameter  =0.001 . The adjust parameter  =0.01 .

14 15 16

(a)

17

indices. In 14 out of 15 datasets, the AUC value of ERA is the highest, and only lower

18

than the ACT in AIDS-Blog network. Because of the neglect of long path information,

19

CN index gets the lowest AUC values for most of networks. The performance of CAR

(b)

Table 2 shows comparison of the AUC value between ERA and some similarity

12

Extended resource allocation index for link prediction of complex network

1

index is almost the same as CN index under the AUC standard, though CAR has

2

considered the local-community based on CN. With more path information considered,

3

LP achieves a better performance than CN, and some of them are close to global ones.

4

All the global indices, especially the Katz and Cos+ index, can obtain higher prediction

5

accuracy than local indices in real networks. Nevertheless, the performance of ACT,

6

Cos+ and MFI is worse in synthetic dynamic networks than in real networks. It is worth

7

mentioning that, because the RA index considers the resource allocation of common

8

neighbors, it performs better than expected at lower complexity, which indicates that

9

resource interaction between endpoints may be more important than the number of paths

10

(LP index) in some networks such as in Jazz, Email and USAir. However, having

11

considered the potential resource exchange through longer paths, ERA can perform

12

even better than RA in synthetic dynamic or real networks. In addition, we recommend

13

that the parameter  is set at around 0.04 for ERA under AUC metric in the real

14

predicting (most of these AUC values are equal to or close to the optimal value).

15

5.2 Precision results

16

In order to further understand the performance of ERA, the standard metric precision

17

is introduced to measure the prediction accuracy from a different perspective. Fig. 5

18

shows the precision of ERA index with the change of  in different datasets. For most

19

of networks, there is also a sudden increase of precision value around  =0 (except for

20

Jazz). In high clustering networks such as USAir, Jazz, Infect and Email, the precision

21

value is declined gradually after reaching the highest point, because the longer paths

22

between nodes are more important for resource interaction in these networks. On the

23

contrary, for many networks without higher clustering coefficient, the precision of ERA

13

Extended resource allocation index for link prediction of complex network

1

can obtain the optimal values with   1 and stay around a certain value after a sudden

2

increase. 0.4

0.8 0.75

0.66

0.35

Precision

Precision

(a).Jazz σ=0

0.7

(c).USAir

(b).FW

0.3

Precision

0.85

σ=10

0.25 0.2

0.65 0

0.2

0.3

(d).Hamster

0.4

σ

3

0.6

0.8

3

5

7

σ

9

11

0

σ=17.7 0.1

0.65

σ=8.2

0.55

5

10

15

σ

σ

0.3

0.4

0.5

(f).PB

0.75

0.45 0

0.2

0.45

Precision

0.2

4

0.1

(e).Yeast

Precision

Precision

0.85

0 0

σ=0.009 0.62 0.6

0.15 0 1

1

0.64

0.35

σ=10.3

0.25 2

4

σ

6

8

10

0

2

4

6

8

σ

10

12

0.55

σ=6.8

0.45

(h).Infect

Precision

Precision

Precision

0.08

0.5

(g).Flight

σ=0.001

0.4

(i).AIDS-Blog

0.06

σ=0.14 0.04

0.3 0.35 0

2

4

6

σ

5

8

0

0.2

0.4

0.6

σ

0.8

0.02 0

1

σ

0.6

0.8

1

0.2

(j).Email σ=0.014

0.71

(k).Ucsocial

0.1

Precision

Precision

Precision

0.4

0.25

0.15 0.72

0.2

σ=3.1

0.05

(l).Figeys 0.15 σ=1.3

0.1 0.05

0.7 0

0.1

6

0.2

σ

0.3

0.4

0 0

0.5

1

2

3

σ

0 0

4

0.2

σ=3.5 0.1 0

7 8 9 10

(n).Ba-2

Precision

Precision

Precision

0.15

1

1.5

σ

2

0.2

0.18

(m).Ba-1

0.5

0.14 σ=0.64

(o).Ba-3

0.15

σ=4.9

0.1

1

2

σ

3

4

0

0.2

0.4

σ

0.6

0.8

1

0.12 0

1

2

3

σ

4

5

6

Fig. 5: The prediction result of ERA index on twelve real networks and three synthetic dynamic networks with different values of  . Each precision value is the average of 20 realizations, each of which corresponds to an independent division of E T and E P .

11 12

14

Extended resource allocation index for link prediction of complex network

1 2

Table 3: Comparison of the precision value between ERA and some similarity indices. Each precision value is the average of 20 realizations. Precision Jazz FW USAir Hamster Yeast PB Flight Infect AIDS-Blog Email UcSocial Figeys Ba-1 Ba-2 Ba-3

CN 0.814 0.148 0.585 0.015 0.684 0.409 0.509 0.397 0.016 0.703 0.032 0.014 0.192 0.173 0.187

RA 0.828 0.170 0.632 0.008 0.491 0.242 0.365 0.512 0.032 0.702 0.024 0.016 0.088 0.101 0.125

CAR 0.851 0.143 0.580 0.033 0.674 0.467 0.627 0.385 0.016 0.702 0.055 0.030 0.190 0.174 0.183

LP(a) 0.802 0.161 0.583 0.016 0.684 0.416 0.514 0.360 0.052 0.709 0.033 0.014 0.192 0.176 0.188

LP(b) 0.775 0.187 0.580 0.052 0.736 0.442 0.549 0.356 0.052 0.706 0.046 0.015 0.193 0.175 0.187

Katz(a) Katz(b) 0.802 0.747 0.161 0.192 0.583 0.574 0.016 0.077 0.683 0.729 0.416 0.451 0.514 0.543 0.360 0.344 0.053 0.053 0.709 0.694 0.033 0.047 0.014 0.015 0.194 0.193 0.177 0.174 0.188 0.186

ACT 0.253 0.271 0.477 0.085 0.571 0.131 0.337 0.134 0.079 0.619 0.069 0.010 0.001 0.002 0.000

Cos+ 0.354 0.000 0.078 0.016 0.243 0.326 0.042 0.204 0.000 0.614 0.010 0.006 0.000 0.001 0.003

MFI 0.218 0.047 0.052 0.036 0.062 0.007 0.056 0.151 0.000 0.365 0.002 0.001 0.000 0.000 0.001

ERA 0.828 0.384 0.651 0.346 0.853 0.456 0.545 0.512 0.082 0.722 0.142 0.223 0.195 0.180 0.197

In these methods, the adjust parameter  =0.001 . The adjust parameter  =0.01 .

3 4 5

(a)

6

out of 15 datasets, ERA index obtains the best performance, and only worse than the

7

CAR index in Jazz and PB. For the Flight network, there are two indices CAR and LP

8

which are performing better than ERA index. Unlike the result of AUC, the precision

9

value of local indices (CN, RA) is very close to global indices and even higher than that

10

of LP and Katz in some higher clustering network such as Jazz, USAir, Yeast, Infect

11

and Email. Considering the local community, CAR achieves a good performance under

12

the precision metric, and even better than LP, Katz in some networks. In Jazz, PB and

13

Flight networks, the precision values of CAR are higher than ERA, which indicate that

14

the longer paths passing across the common-neighbors (local community in CAR) play

15

a more important role than that passing across the non-common-neighbors in link

16

prediction for these networks. Surprisingly, the performance of some global indices

17

including ACT, Cos+ and MFI is poor than expected in all the datasets, partly because

18

these indices pay more attention to AUC and ignore the standard metric precision.

19

Compared with other indices, RA index perform worse in synthetic dynamic networks

(b)

Table 3 reports the average precision value of ERA and some similarity indices. In 12

15

Extended resource allocation index for link prediction of complex network

1

than in real networks. Nevertheless, ERA can improve the performance of RA in

2

synthetic networks with the consideration of longer paths. Besides, the complexity of

3

ERA is between O( N k ) (RA) and O( N k ) (LP). In the real predicting, we

4

recommend that the parameter  is set at around 1.1 for common networks under the

5

precision metric, and around 0 (such as 0.001) for high clustering networks.

6

4. Conclusions and discussions

2

3

7

Similarity index based on topological structure plays an important role in link

8

prediction. Motivated by the potential resource transferred through longer paths, an

9

extended resource allocation index is proposed. The ERA considers all the neighbors

10

which can transfer resources, and achieves a good performance with an adjust

11

proportion of resource allocated to longer paths. In all the twelve real networks and

12

three synthetic dynamic networks, the AUC and precision values of ERA have a sudden

13

increase around  =0 (RA), which indicates that the consideration of potential resource

14

transferred by longer paths can effectively improve the prediction accuracy of RA. With

15

the change of parameter  , each network can find an optimal prediction value. As can

16

be seen form the result of AUC and precision, the local indices (CN and RA) are more

17

suitable for networks with smaller average distance or higher clustering coefficient, and

18

even perform better than global indices. On the contrary, the global indices can perform

19

well in networks with larger average distance or lower clustering coefficient, because

20

they have considered all the path information. Having considered resource exchange

21

between endpoints, the ERA finds a tradeoff between node degree of neighbors and

22

potential paths for different networks by an adjusting parameter, and it indicates that the

23

growth mechanism of edge is closely related to the node degree of neighbors which

24

appeared in the paths between two endpoints of new edge. It is of great significance to

16

Extended resource allocation index for link prediction of complex network

1

understand the network evolution mechanism. In addition, many indices pay more

2

attention to the standard metric AUC and ignore the precision. However, the precision

3

also play an important role in measuring the prediction accuracy for some real networks

4

such as protein-protein interaction networks. In ERA index, it can achieve a high result

5

under two standard metrics AUC and precision. Duo to its good performances in

6

datasets with different clustering coefficient and low time complexity, the ERA index

7

can be applied to many more real networks, especially large-scale networks.

8 9

Acknowledgements

10

This work is partially supported by the Foundation for Innovative Research Groups

11

of the National Natural Science Foundation of China (No. 61521003) and the National

12

High

13

SS2015AA011306).

14

References

15

[1]. A.-L. Barabási, R. Albert, Emergence of Scaling in Random Networks, Science 286

16

(1999) 509-512.

17

[2]. S. Aral, D. Walker, Identifying influential and susceptible members of social

18

networks, Science 337 (2012) 337-341.

19

[3]. E. Bullmore, O. Sporns, Complex brain networks: graph theoretical analysis of

20

structural and functional systems, Nat. Rev. Neurosci. 10 (2009) 186-198.

21

[4]. F. Schweitzer, G. Fagiolo, D. Sornette, F. Vega-Redondo, A. Vespignani, D.R.

22

White, Economic networks: The new challenges, Science 325 (2009) 422.

23

[5]. Sun L, Liu L, Xu Z, et al, Locating inefficient links in a large-scale transportation

24

network, Physica A 419 (2015) 537-545.

Technology

Research

and

Development

17

Program

of

China

(No.

Extended resource allocation index for link prediction of complex network

1

[6]. P. Wang, B. Xu, Y. Wu, X. Zhou, Link prediction in social networks: the state-of-

2

the-art, Sci. China Inform. Sci. 58 (2015) 1-38.

3

[7]. A. Clauset, C. Moore, M.E. Newman, Hierarchical structure and the prediction of

4

missing links in networks, Nature 453 (2008) 98-101.

5

[8]. C. Von Mering, L.J. Jensen, B. Snel, S.D. Hooper, M. Krupp, M. Foglierini, N.

6

Jouffre, M.A. Huynen, P. Bork, STRING: known and predicted protein-protein

7

associations, integrated and transferred across organisms, Nucleic Acids Res. 33 (2005)

8

D433-D437.

9

[9]. S. Scellato, A. Noulas, C. Mascolo, Exploiting place features in link prediction on

10

location-based social networks, in: Proceedings of the 17th ACM SIGKDD

11

international conference on Knowledge discovery and data mining, ACM, 2011, pp.

12

1046-1054.

13

[10]. P. Zhang, A. Zeng, Y. Fan, Identifying missing and spurious connections via the

14

bi-directional diffusion on bipartite networks, Phys. Lett. A 378 (2014) 2350-2354.

15

[11]. F. Lorrain, H.C. White, Structural equivalence of individuals in social networks, J.

16

Math. Sociol 1 (1971) 49-80.

17

[12]. L.A. Adamic, E. Adar, Friends and neighbors on the web, Soc. Netw 25 (2003)

18

211-230.

19

[13]. T. Zhou, L. Lü, Y.-C. Zhang, Predicting missing links via local information, Eur.

20

Phys. J. B 71 (2009) 623-630.

21

[14]. L. Katz, A new status index derived from sociometric analysis, Psychometrika 18

22

(1953) 39-43.

23

[15]. X. Zhu, H. Tian, S. Cai, Predicting missing links via effective paths, Physica A

24

413 (2014) 515-522.

18

Extended resource allocation index for link prediction of complex network

1

[16]. X. Zhu, H. Tian, S. Cai, J. Huang, T. Zhou, Predicting missing links via significant

2

paths, Europhys. Lett. 106 (2014) 18008.

3

[17]. D.J. Klein, M. Randic, Resistance distance, J. Math. Chem. 12 (1993) 81.

4

[18]. F. Fouss, A. Pirotte, J.-M. Renders, M. Saerens, Random-walk computation of

5

similarities between nodes of a graph with application to collaborative recommendation,

6

IEEE Trans. Knowl. Data. Eng. 19 (2007) 355.

7

[19]. G. Jeh, J. Widom, SimRank: a measure of structural-context similarity, in:

8

Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery

9

and Data Mining, ACM Press, New York, 2002, pp. 271–279.

10

[20]. L. Lü, C.-H. Jin, T. Zhou, Similarity index based on local paths for link prediction

11

of complex networks, Phys. Rev. E 80 (2009) 046122.

12

[21]. X. Feng, J. Zhao, K. Xu, Link prediction in complex networks: a clustering

13

perspective, Eur. Phys. J. B 85 (2012) 1-9.

14

[22]. L. Lü, T. Zhou, Link prediction in complex networks: A survey, Physica A 390

15

(2011) 1150-1170.

16

[23]. J.L. Herlocker, J.A. Konstan, L.G. Terveen, J.T. Riedl, Evaluating collaborative

17

filtering recommender systems, ACM Trans. Inf. Syst. 22 (2004) 5-53.

18

[24]. P.M. Gleiser, L. Danon, Community structure in jazz, Adv. Complex Syst. 6

19

(2003) 565-573.

20

[25]. R.E Ulanowicz, D.L DeAngelis, Network analysis of trophic dynamics in south

21

florida ecosystems, US Geological Survey Program on the South Florida Ecosystem

22

(2005) 114.

23

[26]. V Batagelj, A Mrvar, Pajek-program for large network analysis, Connections

24

21(1998) 47-57.

19

Extended resource allocation index for link prediction of complex network

1

[27]. L. Lü, L. Pan, T. Zhou, Y.-C. Zhang, H.E. Stanley, Toward link predictability of

2

complex networks, Proc. Natl. Acad. Sci. 112 (2015) 2325-2330.

3

[28]. D. Bu, Y. Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang, S. Sun, L. Ling, N.

4

Zhang, G. Li, R. Chen, Topological structure analysis of the protein-protein interaction

5

network in budding yeast, Nucleic Acids Res. 31 (2003) 2443-2450.

6

[29]. L.A. Adamic, N. Glance, The political blogosphere and the 2004 US election:

7

divided they blog, in: Proceedings of the 3rd international workshop on Link discovery,

8

ACM, 2005, pp. 36-43.

9

[30]. T. Opsahl, F. Agneessens, J. Skvoretz, Node centrality in weighted networks:

10

Generalizing degree and shortest paths, Soc. Netw 32 (2010) 245-251.

11

[31]. L. Isella, J. Stehlé, A. Barrat, C. Cattuto, J. F. Pinton, W. Van den Broeck, What's

12

in a crowd? Analysis of face-to-face behavioral networks, J. Theor. Biol. 271 (2011)

13

166-180.

14

[32]. S. Gopal, The evolving social geography of blogs, H. Miller, Ed. Berlin: Springer,

15

2007, pp. 275-294.

16

[33]. R. Michalski, S. Palus, P. Kazienko, Matching organizational structure and social

17

network extracted from email communication, in: Business Information Systems,

18

Springer, 2011, pp. 197-206.

19

[34]. Tore Opsahl and Pietro Panzarasa, Clustering in weighted networks, Soc. Netw,

20

31(2009) 155-163.

21

[35]. R. M. Ewing, P. Chu, F. Elisma, H. Li, P.Taylor, S. Climie, L. M.- Cerajewski, M.

22

D. Robinson, L. O'Connor, M. Li, R. Taylor, M. Dharsee, Y. Ho, A. Heilbut, L. Moore,

23

S. Zhang, O. Ornatsky, Y. V. Bukhman, M. Ethier, Y. Sheng, J. Vasilescu, M. A.-

24

Farha, J. P. Lambert, H. S Duewel, I. I Stewart, B. Kuehl, K. Hogue, K. Colwill, K.

20

Extended resource allocation index for link prediction of complex network

1

Gladwish, B. Muskat, R. Kinach, S.- L. Adams, M. F Moran, G. B Morin, T.

2

Topaloglou, D. Figeys, Large-scale mapping of human protein-protein interactions by

3

mass spectrometry, Mol. Syst. Biol. 3 (2007).

4

[36]. C.V. Cannistraci, G Alanis-Lobato, T Ravasi, From link-prediction in brain

5

connectomes and protein interactomes to the local-community-paradigm in complex

6

networks, Sci. Rep. 3 (2013).

7

[36]. P. Chebotarev, E.V. Shamis, The matrix-forest theorem and measuring relations in

8

small social groups, Autom. Remote Control 58 (1997) 1505.

9

[37]. M. E. Newman, Assortative mixing in networks, Phys. Rev. Lett. 89 (2002)

10

208701.

11

[39]. D.J. Watts, S.H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature

12

393 (1998) 440-442.

13

[40]. Q. Ou, Y.-D. Jin, T. Zhou, B.-H. Wang, B.-Q. Yin, Power-law strength-degree

14

correlation from resource-allocation dynamics on weighted networks, Phys. Rev. E 75

15

(2007) 021102.

21