Link prediction via linear optimization

Link prediction via linear optimization

Physica A 528 (2019) 121319 Contents lists available at ScienceDirect Physica A journal homepage: www.elsevier.com/locate/physa Link prediction via...

4MB Sizes 1 Downloads 122 Views

Physica A 528 (2019) 121319

Contents lists available at ScienceDirect

Physica A journal homepage: www.elsevier.com/locate/physa

Link prediction via linear optimization Ratha Pech a,b , Dong Hao a,b , Yan-Li Lee a,b , Ye Yuan c , Tao Zhou a,b ,



a

CompleX Lab, Web Sciences Center, University of Electronic Science and Technology of China, Chengdu 611731, PR China Big Data Research Center, University of Electronic Science and Technology of China, Chengdu 611731, PR China c School of Artificial Intelligence and Automation, the State Key Laboratory of Digital Manufacturing Equipment and Technology, Huazhong University of Science and Technology, Wuhan 430074, PR China b

highlights • This work provides a highly accurate and general link prediction algorithm. • An analytical solution for the optimal likelihood matrix is obtained. • The number of 3-hop paths between two nodes performs better than 2-hop paths.

article

info

Article history: Received 19 September 2018 Received in revised form 27 March 2019 Available online 9 May 2019 Keywords: Link prediction Complex networks Linear optimization

a b s t r a c t Link prediction is an elemental challenge in network science, which has already found applications in guiding laboratorial experiments, digging out drug targets, recommending online friends, probing network evolution mechanisms, and so on. With a simple assumption that the likelihood of the existence of a link between two nodes can be unfolded by a linear summation of neighboring nodes’ contributions, we obtain the analytical solution of the optimal likelihood matrix, which shows remarkably better performance in predicting missing links than the state-of-the-art algorithms for not only simple networks, but also weighted and directed networks. To our surprise, even some degenerated local similarity indices from the solution outperform well-known local indices, which largely refines our knowledge, for example, the number of 3-hop paths between two nodes more accurately predicts missing links than the number of 2-hop paths (i.e., the number of common neighbors), while in previous methods, longer paths are always considered to be less important than shorter paths. © 2019 Elsevier B.V. All rights reserved.

1. Introduction Thanks to the breakthrough in uncovering the structural complexity (e.g., small-world [1] and scale-free [2] properties) in real networks, the recent twenty years have witnessed an explosion in the studies of networks, which is turning the so-called network science from niche branches of science in mathematics (i.e., graph theory) and social science (i.e., social network analysis) to an interdisciplinary focus that attracts increasing attentions from physicists, mathematicians, social scientists, computer scientists, biologists, and so on. Recently, the research focus of network science has been shifting from macroscopic statistical regularities [3] to different roles played by microscopic elements, such as nodes [4] and links [5], in network structure and functions. Therein, link prediction is an elemental challenge that aims at estimating the likelihood that a nonobserved link exists, on the basis of observed links in a network [6]. ∗ Corresponding author at: CompleX Lab, Web Sciences Center, University of Electronic Science and Technology of China, Chengdu 611731, PR China. E-mail address: [email protected] (T. Zhou). https://doi.org/10.1016/j.physa.2019.121319 0378-4371/© 2019 Elsevier B.V. All rights reserved.

2

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

Link prediction is of particular significance. Theoretically speaking, link prediction can be used as a probe to quantify to which extent the network formation and evolution can be explained by a mechanism model, since a better model should be in principle transferred to a more accurate algorithm [7,8]. Beyond theoretical interests, link prediction has already found many applications. For example, our knowledge of biological interactions is highly limited, with approximately 99.7% of the molecular interactions in human beings still unknown [9]. Instead of blindly checking all possible interactions, to predict based on known interactions and focus on those links most likely to exist can sharply reduce the experimental costs if the predictions are accurate enough [10]. Analogously, the known interactions between drugs and target proteins are very limited, while it is believed that any single drug can interact with multiple targets [11]. By this time, link prediction algorithms have already played a critical role in finding out new uses of old drugs [12]. Besides dealing with missing data problems, link prediction algorithms can also be used to predict the links that may appear in the future of evolving networks, with obviously commercial values in friend recommendations of online social networks [13] and product recommendations in e-commercial web sites [14]. Many algorithms have been proposed to solve the link prediction problem, including probabilistic models [15,16] that establish a model with usually a large number of parameters to best fit the observed data and then predict missing links by using the learned model, similarity-based algorithms [17,18] that assign a similarity score to every pair of nodes and rank all non-observed links according to their scores, maximum likelihood methods [19,20] that presuppose some network organizing principles with detailed rules and specific parameters being obtained by maximizing the likelihood of the observed structure and then calculate the likelihood of any nonobserved link according to those rules and parameters, and some others [21,22]. Despite these achievements, how to design effective and efficient algorithms remains a conspicuous challenge. The similarity-based algorithms are often very efficient for its low computational complexity (especially for local similarity indices [18]) but less accurate. The maximum likelihood methods are highly time consuming, with typical ones (e.g., hierarchical structure model [19], stochastic block model [20] and LOOP model [23]) can only handle networks with a few thousands of nodes, while real social networks scale from millions to more than a billion nodes. The probabilistic models often require the information about node attributes in addition to the observed network structure, which highly limits their applications. And the number of parameters are too many so that we cannot easily find any insights about network organization. In this paper, we assume that the likelihood of the existence of a nonobserved link from node i to node j can be unfolded by a linear summation of contributions from i’s neighbors. Accordingly, we transfer link prediction to an optimization problem for the likelihood matrix, which can be solved analytically. We have tested our algorithms as well as the stateof-the-art benchmarks in 24 real networks from disparate fields, including 8 simple networks, 8 weighted networks and 8 directed networks. Extensive empirical comparison shows that our algorithms remarkably outperforms the similaritybased algorithm and slightly better than the maximum likelihood methods. At the same time, the time complexity of our algorithm is much lower than the maximum likelihood methods. We further analyze some degenerated local similarity indices for simple networks from the analytical solution, which still perform much better than many well-known local indices. Of particular interest, the direct count of 3-hop paths between two nodes i and j, say (A3 )ij where A is the adjacency matrix, give more accurate predictions for missing links than the widely used common neighbor index (A2 )ij . This finding shakes a common belief in graph mining that the statistics on shorter paths are more significant than those on longer paths, as indicated by the decaying factor in Katz index [24] and local path index [25]. 2. Method Considering an observed network G(V , E) with V and E being the sets of nodes and links, respectively. The corresponding adjacency matrix A is defined as aij = 1 if there is a link from node i to node j, and aij = 0 otherwise. For simple networks (i.e., undirected unweighted networks), A is symmetric, say aij = aji ; for directed networks, in general, aij can be different from aji ; for weighted networks, aij denotes the weight assigned to the link from i to j, which is not necessarily equal to 1. In the following deviation, we use the general definition of A so that the results can be directly applied for directed and weighted networks. We assume that the likelihood of the existence of a link from i to j, denoted by sij , can be unfolded by a linear summation of contributions from i’s neighbors, namely sij =



aik zkj ,

(1)

k

where zkj is the contribution from node k to node j. In the likelihood matrix S (S = AZ, as defined in Eq. (1), which is also named as score matrix or similarity matrix in similarity-based algorithms), only the elements corresponding to nonobserved links are meaningful in link prediction, but the elements corresponding to observed links can be used to evaluate the rationality of S, because to be self-consistent, if aij > apq , sij should also be larger than spq . That is to say, the difference between A and S should be small. At the same time, to avoid overfitting, the magnitude of Z should also be small. Accordingly, the determination of the likelihood matrix S can be simply transferred to an optimization problem min α∥A − AZ∥ + ∥Z∥, Z

where α is a free parameter that balances the two requirements and ∥ · ∥ denotes a certain matrix norm.

(2)

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

3

Fig. 1. (a) The illustration of how to calculate Z∗ and S for a small-size simple network. According to the proposed algorithm, the two links (2, 5) and (4, 5) are considered to be most likely missing links. (b) The whole procedure of the algorithm.

To make Eq. (2) solvable, we choose the Frobenius norm with power 2, namely to minimize E = α∥A − AZ∥2F + ∥Z∥2F ,

(3)

where ∥X∥ = Tr(X X). Performances of other commonly used norms, such as ℓ1 -norm and nuclear norm are similar to the above one (see Section 3.4 about the corresponding algorithms and their performances). The expansion of Eq. (3) reads 2 F

T

E = α Tr[(A − AZ)T (A − AZ)] + Tr(ZT Z)

= α Tr(AT A − AT AZ − ZT AT A + ZT AT AZ) + Tr(ZT Z),

(4)

with its partial derivative being

∂E = α (−2AT A + 2AT AZ) + 2Z. ∂Z Setting ∂ E/∂ Z = 0, we can obtain the optimal solution of Z as Z∗ = α (α AT A + I)−1 AT A,

(5)

(6)

where I is the identity matrix. The likelihood matrix S can be obtained as S = AZ∗ .

(7)

Then, we rank all nonobserved links in a descending order according to their corresponding values in the likelihood matrix S, with the top-L links constituting the predicted results. The complete procedure of the proposed algorithm as well as an example of a small-size simple network are illustrated in Fig. 1. 3. Results To test the algorithms accuracy, the set of links, E, is randomly divided into two parts: (i) a training set E T , which is treated as known information, and (ii) a probe set (i.e., validation subset) E P , which is used for testing and can be considered as missing links. No information in the probe set is allowed to be used for prediction, that is to say, in the calculation of S, the adjacency matrix A only contains links in E T . Obviously, E T ∪ E P = E and E T ∩ E P = ∅. The task of a link prediction algorithm is to uncover the links in the probe set based on the information in the training set. We adopt two standard metrics to quantify the algorithms’ accuracy. The first one is called precision [26], which is defined as the ratio of relevant elements to the number of selected elements. That is to say, if we take the top-L links as predicted links, among which Lr links are right (i.e., there are Lr links in the probe set E P ), then the precision equals Lr /L. The second one is called the area under the receiver operating characteristic curve (AUC value for short) [27], which can be interpreted as the probability that a randomly chosen link in E P (i.e., a missing link that indeed exists but is not observed yet) is ranked higher than a randomly chosen link in U − E (i.e., a nonexistent link), where U is the universal set contains all possible links. If all scores are randomly generated from an independent and identical distribution, the AUC value should be about 0.5. Therefore, the degree to which the value exceeds 0.5 indicates how much the algorithm performs better than pure chance. 3.1. Simple networks We first test the present linear optimization (LO) method on eight simple networks (i.e., undirected and unweighted networks) from disparate fields, including: (i) FWF [28]—the food web in Florida Bay in the dry season; (ii) C.elegans [1]— the neural network of the worm named Caenorhabditis elegans, where each link indicates a connection between two

4

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319 Table 1 The structural statistics of the eight simple networks. |V | and |E | are the number of nodes and links, respectively. ⟨C ⟩, r and ⟨k⟩ are clustering coefficient [1], assortative coefficient [33] and average degree, respectively. H = and D =

2|E | |V |(|V |−1)

⟨k2 ⟩ ⟨k⟩2

is the degree heterogeneity,

is the link density.

Networks

|V |

|E |

⟨C ⟩

r

⟨k⟩

H

D

FWF C.elegans Hamster USAir MovieRate Reactome JDK WikiRate

128 297 300 332 1682 6229 6434 7115

2 106 2 148 2 503 2 126 94 834 146 160 53 658 100 762

0.335 0.292 0.201 0.625 0.364 0.598 0.671 0.141

−0.10 −0.163 −0.082 −0.207 −0.188

32.906 14.465 16.687 12.807 112.763 46.929 16.680 28.324

1.237 1.801 1.955 3.464 2.309 3.094 58.917 5.134

0.259 0.049 0.056 0.039 0.067 0.007 0.003 0.004

0.245

−0.223 −0.083

neurons; (iii) Hamster [29]–the friendship network of users of the website hamsterster.com; (iv) USAir [30]—the network of US airports with links denote flights between airports; (v) MovieRate [31]—a rating network in which links denote the ratings of users on the movies; (vi) Reactome [29]—a protein–protein interaction network for homosapiens; (vii) JDK [29]— the software dependency network of the JDK 1.6.0.7 framework, where each link indicates a dependency relationship between two software classes; and (viii) WikiRate [32]–the network of users from the English Wikipedia that voted for and against each other in admin elections. The structural statistics are shown in Table 1. We compare the proposed method with seven benchmarks, namely the common neighbor (CN) index [17], the Adamic– Adar (AA) index [34], the resource allocation (RA) index [35], the Cannistraci resource allocation (CRA) index [36], the local path (LP) index [25], the Katz index [24] and the structural perturbation method (SPM) [37]. CN index is defined as CN Sxy = |Γ (x) ∩ Γ (y)|,

(8)

where Γ (x) and Γ (y) are sets of neighbors of nodes x and y, respectively. AA and RA indices assign small-degree neighbors more weights, as 1



AA Sxy =

s∈|Γ (x)∩Γ (y)|

log(|Γ (s)|)

,

(9)

and 1



RA Sxy =

s∈|Γ (x)∩Γ (y)|

|Γ (s)|

.

(10)

CRA index prefers node pairs whose neighbors are densely connected, which is defined as CRA Sxy =

∑ s∈|Γ (x)∩Γ (y)|

|γ (s)| , |Γ (s)|

(11)

where γ (s) is the subset of neighbors of s that are also common neighbors of nodes x and y. LP index considers both contributions from 2-hop and 3-hop paths, as LP Sxy = (A2 )xy + ϵ (A3 )xy ,

(12)

where ϵ is a free parameter. Katz index considers all possible paths connecting nodes x and y with exponentially damped weights, as Katz Sxy = β (A)xy + β 2 (A2 )xy + β 3 (A3 )xy + · · · ,

(13)

where β is a free parameter. Katz index has various application scenarios, such as centrality characterization [24], link prediction [6], community detection [38] and knowledge graph characterization [39]. The SPM splits the observed network into two parts: a background network containing most links and a perturbation network containing a small portion of links. It uses eigenvectors of the background network while eigenvalues of the observed network to approximately reconstruct the observed network and the large-value elements in the reconstructed network but not the observed network indicate missing links. Readers are encouraged to find the mathematical details in the original article [37]. Table 2 compares the prediction accuracy, quantified by precision and AUC, of the proposed algorithm and the seven benchmark algorithms. Obviously, in most cases, LO performs best, usually with remarkably higher accuracy than widely applied local methods (CN, AA, RA, CRA and LP), as well as the famous global index, the Katz index. LO is also slightly better than the state-of-the-art method SPM, and LO runs much faster than SPM (see Section 3.5 about the comparison on CPU time). We further test the robustness of algorithms’ performances by varying the size of probe set from 5% to 50%. Again, LO performs overall best (see Figs. 2 and 3). Note that, LO performs different on various networks (e.g., the network

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

5

Table 2 Precision (top half) and AUC (bottom half) of the proposed algorithm and the seven benchmarks on the eight simple networks. Each result is averaged over 100 independent runs with probe set containing 10% random links. Parameters in LP, Katz and LO are tuned to their optimal values subject to maximal precision or AUC. The best-performed results are emphasized in bold. Networks

CN

AA

RA

CRA

LP

Katz

SPM

LO

FWF C.elegans Hamster USAir MovieRate Reactome JDK WikiRate

0.071 0.098 0.080 0.369 0.143 0.244 0.020 0.100

0.073 0.106 0.059 0.390 0.142 0.253 0.026 0.102

0.074 0.101 0.054 0.454 0.129 0.400 0.095 0.103

0.076 0.107 0.059 0.402 0.147 0.304 0.031 0.105

0.258 0.119 0.151 0.371 0.174 0.362 0.421 0.107

0.161 0.105 0.107 0.371 0.152 0.256 0.256 0.101

0.574 0.170 0.459 0.442 0.293 0.893 0.635 0.186

0.581 0.180 0.485 0.443 0.297 0.907 0.617 0.193

FWF C.elegans Hamster USAir MovieRate Reactome JDK WikiRate

0.607 0.845 0.779 0.936 0.902 0.989 0.823 0.927

0.608 0.862 0.782 0.947 0.904 0.990 0.896 0.928

0.611 0.866 0.782 0.953 0.902 0.991 0.914 0.928

0.620 0.768 0.714 0.916 0.903 0.978 0.864 0.882

0.799 0.870 0.860 0.928 0.916 0.992 0.941 0.954

0.718 0.867 0.859 0.925 0.908 0.991 0.912 0.947

0.948 0.884 0.911 0.924 0.945 0.991 0.935 0.941

0.950 0.895 0.912 0.938 0.950 0.995 0.983 0.965

Table 3 The structural statistics of the eight weighted networks. |V | and |E | are the number of nodes and links, respectively. ⟨C ⟩, ⟨k⟩ and ⟨s⟩ are weighted clustering coefficient [40], average degree and average node strength, where an arbitrary node i’s strength is defined ∑|V | as the sum of weights of its associated links, say si = j=1 wij . Networks

|V |

|E |

⟨C ⟩

⟨k⟩

⟨ s⟩

w-FWF w-FWE w-FWM w-C.elegans w-USAir w-World trade w-Lesmis w-Football

128 69 97 297 332 80 77 35

2106 880 1446 2148 2126 875 254 118

0.045 0.052 0.064 0.389 0.090 0.096 0.157 0.089

32.906 25.507 29.814 14.465 12.807 21.875 6.597 6.743

34.110 574.939 94.341 54.808 0.924 989 313.825 21.299 16.857

of Reactome and WikiRate). It is caused by the different predictabilities of the networks. If the network is generated by random or some complicated mechanisms, it is difficult to be predicted [37]. Moreover, from the view of topological statistical properties, the average clustering coefficient of WikiRate is smaller than Reactome. Generally speaking, the networks with poor locality are usually difficult to be predicted by most known link prediction algorithms. 3.2. Weighted networks Many real systems are naturally represented by weighted networks, since the strengths of links are highly heterogeneous and thus the binary representation will lose much information [41,42]. Accordingly, a number of methods are recently proposed to predict missing links in weighted networks [43–48]. LO can be directly extended to weighted networks via replacing the adjacency matrix A by the weight matrix W, where wij denotes link weight between nodes i and j and wij = 0 if i and j are disconnected. To avoid the over contributions from some very strong links, we normalize weights by using a simple sigmoid function as

w′ =

1 1 + e−w

.

(14)

If all original weights are positive, the normalized weights w ′ lie in the range (1/2, 1), while in a more general case with negative links [49], w ′ lie in the range (0, 1). We test the weighted LO method on eight weighted networks (using ‘‘w-’’ in their names to emphasize), including: (i) w-FWF [28]–the food web in Florida Bay in the dry season, where the link weight denotes energy flux between the corresponding species; (ii) w-FWE [50]–the food web in Everglades Graminoids in the wet season, with the same definition of weight as w-FWF; (iii) w-FWM [51]–the food web in Mangrove Estuary during the wet season, with the same definition of weight as w-FWF; (iv) w-C.elegans [1]–the neural network of worm named Caenorhabditis elegans, where two neurons are connected if at least one synapse or gap junction exists between them and the weight is defined as the number of synapses and gap junctions; (v) w-USAir [30]–the network of US airports with links denote flights between airports and

6

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

Fig. 2. Precision of LO and other benchmark methods on the eight simple networks for different probe sizes. The results are averaged by 100 independent runs, with the short vertical lines represent standard deviations. The result for LO is emphasized by red color.

the link weight is defined as the geographical distance between the two corresponding airports; (vi) w-WTN [52]–the world trading network of miscellaneous manufactures of metal among 80 countries in 1994, where each node denotes a country and the link weight represents the trade volume between the two corresponding countries; (vii) w-Lesmis [29]– the co-occurrences network of characters in the world-famous novel ‘Les Misérables’, where a link denotes two novel characters having co-appeared in the same chapter, and the weight of this link indicates the number of co-occurrences; (viii) w-Football [30]–the trading network of the players of the national soccer teams among 35 countries, where each node is a country and two nodes are connected if they have traded any football players, with link weight being defined as the number of traded players. The structural statistics are shown in Table 3. We compare the weighted LO with six benchmarks, namely the weighted common neighbor (WCN) index [43], the weighted Adamic–Adar (WAA) index [43], the weighted resource allocation (WRA) index [43], the reliable-route weighted CN (rWCN) index [46], the reliable-route weighted AA (rWAA) index [46] and the reliable-route weighted RA (rWRA) index [46]. They are mathematically defined as follows,



SWCN = xy

(wxs + wsy ),

s∈|Γ (x)∩Γ (y)|

wxs + wsy



SWAA = xy

s∈|Γ (x)∩Γ (y)|



SWRA = xy

log(1 + cs )

wxs + wsy

s∈|Γ (x)∩Γ (y)|

SrWCN = xy

∑ s∈|Γ (x)∩Γ (y)|

SrWAA = xy

∑ s∈|Γ (x)∩Γ (y)|

,

(15) (16)

,

(17)

(wxs · wsy ),

(18)

wxs · wsy , log(1 + cs )

(19)

cs

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

7

Fig. 3. AUC of LO and other benchmark methods on the eight simple networks for different probe sizes. The results are averaged by 100 independent runs, with the short vertical lines represent standard deviations. The result for LO is emphasized by red color.

SrWRA = xy

∑ s∈|Γ (x)∩Γ (y)|

wxs · wsy cs

,

(20)

where cs is the strength of node s, i.e. the sum of weights of its associated links. As shown in Table 4, LO performs best, with remarkably higher accuracy than all other methods. We also test the algorithms’ robustness by varying the size of probe set from 5% to 50%. Again, LO performs best no matter how large the probe size is (see Figs. 4 and 5).

3.3. Directed networks

Predicting links in directed network is the most challenging problem in link prediction since both the link existence and link direction have to be determined by the algorithm [53]. Obviously, LO can be directly extended to directed networks by introducing an asymmetric adjacency matrix A. Recently, a number of methods are proposed to solve this challenge [53– 56]. We compare the performance of LO with three kinds of algorithms: (i) the extension of local indices from simple networks to directed networks [56], including directed CN (d-CN), directed AA (d-AA) and directed RA (d-RA); (ii) the potential theory (PT) that makes use of local organization principle to predict the existence of missing directed links [55]; and (iii) the low rank (LR) approximation algorithm for directed networks [22]. The directed common-neighborhood-based indices are defined as

8

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319 Table 4 Precision (top half) and AUC (bottom half) of the proposed algorithm and the six benchmarks on the eight weighted networks. Each result is averaged over 100 independent runs with probe set containing 10% random links. Parameters in LO are tuned to their optimal values subject to maximal precision or AUC. The best-performed results are emphasized in bold. Networks

WCN

WAA

WRA

rWCN

rWAA

rWRA

LO

w-FWF w-FWE w-FWM w-C.elegans w-USAir w-WTN w-Lesmis w-Football

0.079 0.180 0.141 0.104 0.368 0.430 0.486 0.123

0.083 0.186 0.144 0.112 0.400 0.450 0.550 0.112

0.086 0.193 0.149 0.105 0.449 0.455 0.544 0.093

0.096 0.172 0.155 0.107 0.366 0.436 0.480 0.102

0.093 0.173 0.162 0.108 0.355 0.428 0.491 0.113

0.099 0.174 0.154 0.107 0.361 0.428 0.492 0.101

0.527 0.505 0.509 0.175 0.442 0.490 0.541 0.250

w-FWF w-FWE w-FWM w-C.elegans w-USAir w-WTN w-Lesmis w-Football

0.602 0.698 0.709 0.849 0.930 0.894 0.904 0.672

0.607 0.710 0.718 0.864 0.948 0.909 0.915 0.667

0.610 0.715 0.716 0.864 0.953 0.920 0.922 0.652

0.593 0.696 0.696 0.849 0.931 0.907 0.905 0.660

0.591 0.693 0.699 0.850 0.929 0.907 0.915 0.673

0.592 0.691 0.694 0.851 0.929 0.907 0.908 0.659

0.949 0.916 0.924 0.885 0.939 0.930 0.886 0.815

Fig. 4. Precision of LO and other benchmark methods on the eight weighted networks for different probe sizes. The results are averaged by 100 independent runs, with the short vertical lines represent standard deviations. The result for LO is emphasized by red color.

d−CN Sxy = |Γ out (x) ∩ Γ in (y)|, d−AA Sxy =

∑ s∈|Γ out (x)∩Γ in (y)|

(21) 1

log(|Γ out (s)|)

,

(22)

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

9

Fig. 5. AUC of LO and other benchmark methods on the eight weighted networks for different probe sizes. The results are averaged by 100 independent runs, with the short vertical lines represent standard deviations. The result for LO is emphasized by red color.

d−RA Sxy =

∑ s∈|Γ out (x)∩Γ in (y)|

1

|Γ out (s)|

,

(23)

where Γ out (x) is the set of nodes that x points to, Γ in (y) is the set of nodes pointing to y, and Sxy here denotes the likelihood of a directed link from x to y. The LR method decomposes the adjacency matrix into a low-rank matrix and a sparse matrix, where the former contains missing links and the latter contains spurious links. More details are presented in the original article [22]. PT assumes that the motifs obeying the potential theory are preferred, and thus links generating more preferred motifs are of higher likelihoods. Mathematical details can be found in the original article [55]. We test the directed LO method as well as the above benchmarks on eight directed networks (using ‘‘d-’’ in their names to emphasize), including: (i) d-FWF [28]–the food web in Florida Bay in the dry season, where each link represents a predator–prey relationship from the predator to the prey. (ii) d-FWE [50]–the food web in Everglades Graminoids in the wet season, with the same definition of link direction as d-FWF; (iii) d-FWM [51]–the food web in Mangrove Estuary during the wet season, with the same definition of link direction as d-FWF; (iv) d-C.elegans [1]–the neural network of worm named Caenorhabditis elegans, where the link direction denotes the synapse or gap junction from one neuron to the other; (v) d-PB [57]–the network of front-page hyperlinks among blogs in the context of the 2004 US election, where each node denotes a blog and a directed link represents the directed hyperlink between two blogs; (vi) d-WikiRate [32]–the voting network for the admin elections in Wikipedia, where each node is a Wikipedia user and each directed link pointing from a voter to an admin candidate; (vii) d-SmaGri [30]–the citation network including the citations to the papers of Small & Griffith and Descendants, where each node is a paper and a directed link represents a citation; (viii) d-SciMet [30]–the citation network in the area of Scientometrics, with the same definition as d-SmaGri. The structural statistics are shown in Table 5. As shown in Table 6, LO performs best, with remarkably higher accuracy than all extended indices for directed networks and considerably higher accuracy than PT and LR. We also test the algorithms’ robustness by varying the size of probe set from 5% to 50%. Again, LO performs best no matter how large the probe size is (see Figs. 6 and 7).

10

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319 Table 5 The structural statistics of the eight directed networks. |V | and |E | are the number of nodes and links, respectively. ⟨C ⟩ is the clustering coefficient for directed networks out [58]. kin max and kmax are the maximal in-degree and maximal out-degree of all nodes, respectively. ⟨k⟩ is the average degree of all nodes (average in-degree equals average |E | out-degree). D is the link density, defined as D = |V |(|V |−1) . Networks

|V |

|E |

⟨C ⟩

kin max

kout max

⟨k⟩

D

d-FWE d-FWM d-FWF d-C.elegans d-PB d-WikiRate d-SmaGri d-SciMet

69 97 128 297 1222 7115 1024 2729

911 1 492 2 137 2 345 19 021 103 689 4 918 10 412

0.303 0.261 0.176 0.183 0.246 0.121 0.175 0.101

63 90 110 134 337 457 89 121

44 46 63 39 256 893 232 104

13.2 15.4 16.7 7.9 15.6 14.6 4.8 3.8

0.193 0.160 0.131 0.027 0.013 0.002 0.005 0.001

Table 6 Precision (top half) and AUC (bottom half) of the proposed algorithm and the five benchmarks on the eight direct networks. Each result is averaged over 100 independent runs with probe set containing 10% random links. Parameters in LR and LO are tuned to their optimal values subject to maximal Precision or AUC. Due to the relatively high computational complexity of LR, the prediction results for d-WikiRate cannot be obtained in reasonable time. The best-performed results are emphasized in bold. Networks

d-CN

d-AA

d-RA

LR

PT

LO

d-FWF d-FWE d-FWM d-C.elegans d-PB d-WikiRate d-SmaGrid d-SciMet

0.058 0.144 0.088 0.063 0.189 0.121 0.081 0.053

0.076 0.169 0.117 0.065 0.192 0.123 0.067 0.043

0.088 0.251 0.124 0.056 0.149 0.049 0.041 0.023

0.549 0.582 0.518 0.076 0.182 N/A 0.021 0.035

0.114 0.196 0.234 0.060 0.102 0.074 0.072 0.048

0.570 0.623 0.532 0.143 0.210 0.189 0.106 0.110

d-FWF d-FWE d-FWM d-C.elegans d-PB d-WikiRate d-SmaGrid d-SciMet

0.731 0.755 0.764 0.789 0.917 0.908 0.706 0.647

0.736 0.752 0.782 0.795 0.919 0.909 0.702 0.639

0.739 0.745 0.787 0.793 0.917 0.885 0.707 0.647

0.828 0.905 0.817 0.604 0.732 N/A 0.537 0.621

0.857 0.809 0.875 0.812 0.925 0.963 0.815 0.861

0.967 0.965 0.961 0.888 0.954 0.972 0.884 0.797

Table 7 Precision (top half) and AUC (bottom half) of the proposed algorithms based on different matrix norms for five real simple networks. Each result is averaged over 100 independent runs with probe set containing 10% random links. All the parameters are tuned to their optimal values subject to maximal Precision or AUC. The best-performed results are emphasized in bold. Due to the high computational complexity of the invariant for ℓ∗ norm, the prediction results for MovieRate cannot be obtained in reasonable time. For some other simple networks, the ALM method cannot guarantee the convergence. Networks

ℓ1

ℓ21

ℓ∗

LO

FWF C.elegans Hamster USAir MovieRate

0.306 0.015 0.106 0.326 0.284

0.543 0.167 0.454 0.427 0.269

0.065 0.012 0.102 0.035 N/A

0.581 0.180 0.485 0.443 0.297

FWF C.elegans Hamster USAir MoviRate

0.751 0.535 0.658 0.793 0.914

0.944 0.887 0.908 0.928 0.933

0.573 0.532 0.627 0.467 N/A

0.950 0.895 0.912 0.938 0.950

3.4. Performance of alternative matrix norms In the proposed method, Frobenius norm is utilized. However, other matrix norms can also be used to constrain the solution matrix Z. One of the popular norms is the ℓ1 -norm [59]∑ (also called sparse norm in the literature), which calculates the sum of the absolute values of all matrix entries, as ∥A∥1 = i,j |aij |. The compressed or sparse matrix usually produces

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

11

Fig. 6. Precision of LO and other benchmark methods on the eight directed networks for different probe sizes. The results are averaged by 100 independent runs, with the short vertical lines represent standard deviations. The result for LO is emphasized by red color. As LR on WikiRate is time-consuming, the corresponding results is missing.

2 1/2 relatively small ℓ1 -norm. Another alternative is the ℓ21 -norm [60], which is defined as ∥A∥21 = . We j=1 ( i=1 |aij | ) ∑ also test the so-called trace norm or nuclear norm [61], defined as ∥A∥∗ = σ , where σ is the singular values of A. i i i Accordingly, the corresponding optimization functions are

∑n

∑m

min α∥A − AZ∥1 + ∥Z∥1 ,

(24)

min α∥A − AZ∥21 + ∥Z∥21 ,

(25)

min α∥A − AZ∥∗ + ∥Z∥∗ .

(26)

Z

Z

and Z

We use the Augmented Lagrange Multiplier (ALM) method [62] to solve the above three invariants. We test the LO method as well as the corresponding three invariants with different norms on five simple networks in Section 3.1, including a food web (FWF), a neural network (C.elegans), a friendship network (Hamster), an air transportation network (USAir), a rating network on movies (MovieRate). As shown in Table 7, from which one can observe that LO still performs best and the performance of ℓ21 -norm is close to LO. 3.5. Algorithms’ CPU times LO remarkably outperforms both well-known local and global methods for prediction accuracy. Here we compare its computational complexity with other methods. Fig. 8 reports the CPU times for some representative methods on simple networks (the computational complexity of LO does not increase for weighted and directed networks, and thus we do not show those similar results for weighted and directed networks). It is observed that LO’s computational time is much

12

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

Fig. 7. AUC of LO and other benchmark methods on the eight directed networks for different probe sizes. The results are averaged by 100 independent runs, with the short vertical lines represent standard deviations. The result for LO is emphasized by red color. As LR on WikiRate is time-consuming, the corresponding results is missing.

Fig. 8. The computational times of different methods for the eight simple networks, where the x-axis and y-axis represent the network size and CPU time (in seconds), respectively.

shorter than the global methods and comparable with local methods. In particular, the prediction accuracy of SPM is close to LO, while SPM is much more time-consuming. In a word, LO is a very effective and efficient method toward link prediction problem. The simulations are implemented on a normal Intel(R) Core (TM) i5-2520 desktop with a single 12 GB RAM.

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

13

Table 8 Precision (top half) and AUC (bottom half) of CN, DLO1 and DLO2 on the eight simple networks. Each result is averaged over 100 independent runs with probe set containing 10% random links. Parameter in DLO2 is tuned to its optimal values subject to maximal precision or AUC. Networks

FWF

C.elegans

Hamster

USAir

MovieRate

Reactome

JDK

WikiRate

CN DLO1 DLO2

0.072 0.315 0.474

0.097 0.123 0.160

0.060 0.185 0.363

0.369 0.358 0.442

0.143 0.188 0.233

0.244 0.445 0.498

0.024 0.511 0.516

0.100 0.107 0.116

CN DLO1 DLO2

0.605 0.816 0.921

0.851 0.846 0.892

0.778 0.860 0.903

0.936 0.897 0.939

0.905 0.923 0.935

0.989 0.987 0.991

0.823 0.947 0.955

0.927 0.954 0.961

Table 9 P (2) , P (3) , S (2) and S (3) for the eight simple networks. Networks

FWF

C.elegans

Hamster

USAir

MovieRate

Reactome

JDK

WikiRate

P (2) P (3) S (2) S (3)

1 0.9668 0.9552 0.9987

0.996 0.5755 0.6771 0.971

0.9888 0.5367 0.6871 0.975

0.9954 0.4739 0.6008 0.9496

0.999 0.7796 0.9048 0.9989

0.9972 0.2560 0.1493 0.5135

1 0.8858 0.5473 0.9836

0.9933 0.2182 0.25 0.8395

4. Degenerated local indices After observing the remarkably higher prediction accuracy of LO than other well-known local indices, we would like to uncover the underlying mechanism resulting in LO’s advantage. By substituting Z∗ in Eq. (7), one obtains S = AZ∗

= A(AT A + = A(AT A +

1

α 1

I)−1 AT A I)−1 (AT A +

α = A − A(I + α AT A)−1 .

1

α

I−

1

α

(27) I)

Applying the Neumann series (I + α AT A)−1 = I − α A2 + α 2 A4 − α 3 A6 + · · · ,

(28)

Eq. (27) can be rewritten as S = A − A(I − α A2 + α 2 A4 − α 3 A6 + · · ·)

= α A3 − α 2 A5 + α 3 A7 − α 4 A9 + · · · .

(29)

Comparing with the famous Katz index (i.e., β A + β 2 A2 + β 3 A3 + · · ·, where the first term does not work since all unobserved links correspond to zero elements in A), the differences lie in three aspects: (i) The expansion of LO starts from A3 (i.e., the number of 3-hop paths), instead of the usually considered item A2 (i.e., the number of 2-hop paths or the number of common neighbors); (ii) LO only takes into account odd paths; (iii) Some items in Eq. (29) play negative roles. To look closer, we focus on two degenerated local indices from LO, say A3 (named as DLO1) and A3 − α A5 (named as DLO2). Table 8 compares the prediction accuracy of CN, DLO1 and DLO2 on the eight simple networks. Two observations are highly striking. First of all, DLO1 remarkably outperforms CN, which challenges our intuition that shorter paths indicate stronger correlation than longer paths [24,25]. This is largely due to two following reasons. On the one hand, DLO1 (i.e., the number of 3-hop paths) is more informative than CN (i.e., the number of 2-hop paths). Denoting e(2) the set of node pairs connected by at least one 2-hop path and e(3) the set of node pairs connected by at least one 3-hop path, then we calculate fraction of node pairs connected by 3-hop paths in the set of node pairs having common neighbors ⋂ the P (2) = |e(2) e(3) |/|e(2) |, as ⋂ well as the fraction of node pairs connected by 2-hop paths in the set of node pairs connected by 3-hop paths P (3) = |e(2) e(3) |/|e(3) |. As shown in Table 9, for all the eight simple networks, P (3) < P (2) with P (2) very close to 1. That is to say, almost all node pairs being connected by at least one 2-hop path are also connected by at least one 3-hop path, while a considerable portion of node pairs connected by 3-hop paths do not have common neighbors. Hence we say A3 is more informative than A2 . On the other hand, DLO1 is more distinguishable than CN. Denoting N the (l) number of nodes in the target network and ni the number of node pairs connected by i different l-hop paths (these paths are allowed to pass through a node multiple times), then the ratio of node pairs connected by l-hop paths is ∑ i different (l) (l) (l) Ri = 2ni /N(N − 1) as there are in total N(N − 1)/2 node pairs. Obviously, for any given l, i≥0 Ri = 1. Lü et al. [25] showed an extremal case that in the router-level Internet [63] 99.59% of node pairs do not have common neighbors and

14

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

Fig. 9. The distributions of R(2) and R(3) for the eight simple networks, represented by red circles and blue squares, respectively. Since these (2) (3) distributions are drawn in log–log coordinate, the values of R0 and R0 are listed in the plots.

91.11% of those having common neighbors have just 1 common neighbor. In such case, the CN index is not distinguishable and the corresponding distribution of R(2) is highly concentrated. Therefore, we apply the famous diversity measure, called Simpson coefficient [64], to quantify the distinguishabilities of DLO1 and CN, namely S (l) = 1 −

∑ (l) [Ri ]2 ,

(30)

i

where i runs from zero to its possibly maximum value. Obviously, the larger S corresponds to more diverse and thus more distinguishable distribution of R. As shown in Table 9, A3 is more distinguishable than A2 (direct comparison between distributions of R(2) and R(3) is presented in Fig. 9). Putting the above two reasons together, it is now not surprising that the number of 3-hop paths is a better index than common neighbors in link prediction. Secondly, DLO2 remarkably outperforms DLO1. Clearly, as suggested by the well-known Homophily mechanism [65], if two nodes share many features, they have high probability to be directly connected [66]. Notice that, two nodes are probably connected by many 3-hop paths, and these paths may be built from independent reasons or may contain redundant information. The former usually indicates a higher similarity between the two nodes and thus to eliminate redundant correlation can improve the accuracy of link prediction [67]. Fig. 10 shows two examples where nodes i and j are both connected by 4 3-hop paths, say (A3 )ij = 4. The 4 paths in Fig. 10(a) are independent while the 4 paths in Fig. 10(b) are overlapped. Indeed, in the latter case, there are only two independent paths connecting i and j. At the same time, the overlapping paths will result in densely connected local structure and thus larger value of (A5 )ij , since (A5 ) includes the paths passing through a node by multiple times. Therefore, larger value of (A5 )ij indicates denser local connections and thus more redundance. This is the reason why to punish node pairs with many 5-hop paths will lead to better prediction as DLO2. In a word, we strongly suggest DLO1 (A3 ) and DLO2 (A3 − α A5 ) as two very good quasi-local indices for link prediction.

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

15

Fig. 10. The illustration of redundant information.

5. Conclusion and discussion This work starts from a very simple assumption that the likelihood of the existence of a link between two nodes can be unfolded by a linear summation of contributions of their common neighbors. The optimal likelihood matrix can be analytically obtained, with remarkably higher prediction accuracy than other state-of-the-art algorithms. The solution can be directly extended to weighted and directed networks, also with much better performance than well-known benchmarks. In particular, link prediction in directed networks is a great challenge and the proposed LO algorithm shows a huge advantage as shown in Table 6. However, there are two disadvantages for LO. Firstly, it takes time to adjust the parameter α for data performance. Secondly, the time complexity of LO is O(N2.373 ), which is mainly caused by the computation of matrix inversion of matrix α ∗ AT A + I in Eq. (27). One of the efficient algorithms to calculate the matrix inversion is Williams algorithms, whose time complexity is O(N2.373 ) [68]. The high time complexity makes LO cannot tackle the large-scale networks, but it can be applied to medium-scale networks with high accuracy. It is very interesting to notice that a formula similar to Eq. (6), named as ridge regression [69,70], was long ago proposed to estimate the solution of X in linear equation Y = AX + ε with A a singular matrix and ε the noise. Though the details of solutions of the two problems are different (e.g., the present solution does not involve noise ε or vector Y ), both method consider the usage of norm regularization to avoid the overfitting. Such regularization has recently found significant applications in disparate fields, such as brain science [71] and artificial intelligence [72]. Hence we believe the present linear optimization method could also find wide applications in graph mining and matrix completion. Lastly, after finishing this work, we are happy to see strongly supportive experiments in two very recent studies [73,74], which show that the number of 3-hop paths (the index is named as L3 , which is similar to but not the same as DLO1, as L3 uses a degree-normalized count) significantly outperforms CN index in predicting protein–protein interactions across multiple real datasets. Beyond biological explanations [73,74], our work indeed provides a solid theoretical basis with a more universal perspective. In the future work, we intend to compare the degenerated local indices with other local methods based on extensive real data. Acknowledgments The authors acknowledge discussions with István A. Kovács, Carlo Vittorio Cannistraci and Liming Pan. This work was partially supported by the National Natural Science Foundation of China (Grant Nos. 61433014, 61673085, 71601029 and 91748112). References [1] [2] [3] [4] [5] [6] [7] [8] [9]

D.J. Watts, S.H. Strogatz, Collective dynamics of ‘small-world’ networks, Nature 393 (1998) 440–442. A.-L. Barabási, R. Albert, Emergence of scaling in random networks, Science 286 (1999) 509–512. M.E.J. Newman, Networks: An Introduction, Oxford University Press, 2010. L. Lü, D. Chen, X.L. Ren, Q.M. Zhang, Y.-C. Zhang, T. Zhou, Vital nodes identification in complex networks, Phys. Rep. 650 (2016) 1–63. Csermely, Weak Links: Stabilizers of Complex Systems from Proteins to Social Networks, Springer, 2006. L. Lü, T. Zhou, Link prediction in complex networks: A survey, Physica A 390 (2011) 1150–1170. W.Q. Wang, Q.M. Zhang, T. Zhou, Evaluating network models: A likelihood analysis, EPL 98 (2012) 28004. Q.M. Zhang, X.K. Xu, Y.X. Zhu, T. Zhou, Measuring multiple evolution mechanisms of complex networks, Sci. Rep. 5 (2015) 10350. M.P. Stumpf, T. Thorne, E. De Silva, R. Stewart, H.J. An, M. Lappe, C. Wiuf, Estimating the size of the human interactome, Proc. Natl. Acad. Sci. USA 105 (2008) 6959–6964. [10] B. Barzel, A.-L. Barabási, Network link prediction by global silencing of indirect correlations, Nat. Biotechnol. 31 (2013) 720–725. [11] C.R. Chong, D.J. Sullivan, New uses for old drugs, Nature 448 (2007) 645–646. [12] H. Ding, I. Takigawa, H. Mamitsuka, S. Zhu, Similarity-based machine learning methods for predicting drug-target interactions: A brief review, Briefings Bioinf. 15 (2013) 734–747.

16

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

[13] L.M. Aiello, A. Barrat, R. Schifanella, C. Cattuto, B. Markines, F. Menczer, Friendship prediction and homophily in social media, ACM Trans. Web 6 (2012) 9. [14] L. Lü, M. Medo, C.H. Yeung, Y.-C. Zhang, Z.-K. Zhang, T. Zhou, Recommender systems, Phys. Rep. 519 (2012) 1–49. [15] J. Neville, D.J. Jensen, Relational dependency networks, J. Mach. Learn. Res. 8 (2007) 653–692. [16] K. Yu, W. Chu, S. Yu, V. Tresp, Z. Xu, Stochastic relational models for discriminative link prediction, in: Proceedings of the 19th International Conference on Neural Information Processing Systems, 2007, pp. 1553–1560. [17] D. Liben-Nowell, J. Kleinberg, The link-prediction problem for social networks, J. Assoc. Inf. Sci. Tech. 58 (2007) 1019–1031. [18] T. Zhou, L. Lü, Y.-C. Zhang, Predicting missing links via local information, Eur. Phys. J. B 71 (2009) 623–630. [19] A. Clauset, C. Moore, M.E.J. Newman, Hierarchical structure and the prediction of missing links in networks, Nature 453 (2008) 98–101. [20] R. Guimerá, M. Sales-Pardo, Missing and spurious interactions and the reconstruction of complex networks, Proc. Natl. Acad. Sci. USA 106 (2009) 22073–22078. [21] L. Backstrom, J. Leskovec, Supervised random walks: Predicting and recommending links in social networks, in: Proceedings of the 4th ACM International Conference on Web Search and Data Mining, 2011, pp. 635–644. [22] R. Pech, D. Hao, L. Pan, H. Cheng, T. Zhou, Link prediction via matrix completion, EPL 117 (2017) 38002. [23] L. Pan, T. Zhou, L. Lü, C.-K. Hu, Predicting missing links and identifying spurious links via likelihood analysis, Sci. Rep. 6 (2016) 22955. [24] L. Katz, A new status index derived from sociometric analysis, Psychometrika 18 (1953) 39–43. [25] L. Lü, C.-H. Jin, T. Zhou, Similarity index based on local paths for link prediction of complex networks, Phys. Rev. E 80 (2009) 046122. [26] J.L. Herlocker, J.A. Konstann, K. Terveen, J.T. Riedl, Evaluating collaborative filtering recommender systems, ACM Trans. Inf. Syst. 22 (2004) 5. [27] J.A. Hanley, B.J. McNeil, The meaning and use of the area under a receiver operating characteristic (ROC) curve, Radiology 143 (1982) 29–36. [28] R.E. Ulanowicz, C. Bondavalli, M.S. Egnotovich, Network analysis of trophic dynamics in South Florida Ecosystem. FY 97: The Florida Bay Ecosystem. Technical Report CBL 98–123, 1998. [29] Downloaded from the KONECT website http://konect.uni-koblenz.de/networks/. [30] V. Batagelj, A. Mrvar, 2006 Pajek datasets website. Available: http://vlado.fmf.uni-lj.si/pub/networks/data/. [31] GroupLens Research. 2006 MovieLens datasets. http://www.grouplens.org/node/73. [32] J. Leskovec, D.P. Huttenlocher, J.M. Kleinberg, Governance in social media: A case study of the wikipedia promotion process, in: Proceedings of the 4th International Conference on Weblogs and Social Media, 2010, pp. 98–105. [33] M.E.J. Newman, Assortative mixing in networks, Phys. Rev. Lett. 89 (2002) 208701. [34] L.A. Adamic, E. Adar, Friends and neighbors on the web, Soc. Netw. 25 (2003) 211–230. [35] Q. Ou, Y.-D. Jin, T. Zhou, B.-H. Wang, B.Q. Yin, Power-law strength-degree correlation from resource-allocation dynamics on weighted networks, Phys. Rev. E 75 (2007) 021102. [36] C.V. Cannistraci, G. Alanis-Lobato, T. Ravasi, From link-prediction in brain connectomes and protein interactomes to the local-communityparadigm in complex networks, Sci. Rep. 3 (2013) 1613. [37] L. Lü, L. Pan, T. Zhou, Y.-C. Zhang, H.E. Stanley, Toward link predictability of complex networks, Proc. Natl. Acad. Sci. USA 112 (2015) 2325–2330. [38] M. Girvan, M.E.J. Newman, Community structure in social and biological networks, Proc. Natl. Acad. Sci. USA 99 (2002) 7821–7826. [39] M. Nickel, K. Murphy, V. Tresp, E. Gabrilovich, A review of relational machine learning for knowledge graphs, Proc. IEEE 104 (2016) 11–33. [40] P. Holme, S.M. Park, B.J. Kim, C.R. Edling, KoreaN university life in a network perspective: Dynamics of a large affiliation network, Physica A 373 (2007) 821–830. [41] M.E.J. Newman, Analysis of weighted networks, Phys. Rev. E 70 (2004) 056131. [42] A. Barrat, M. Barthelemy, R. Pastor-Satorras, A. Vespignani, The architecture of complex weighted networks, Proc. Natl. Acad. Sci. USA 101 (2004) 3747–3752. [43] T. Murata, S. Moriyasu, Link prediction of social networks based on weighted proximity measures, in: Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence, 2007, pp. 85–88. [44] L. Lü, T. Zhou, Link prediction in weighted networks: The role of weak ties, EPL 89 (2010) 18001. [45] H.R. De Sá, R.B.C. Prudencio, Supervised link prediction in weighted networks, in: Proceedings of the 2011 International Joint Conference on Neural Networks, 2011, pp. 2281–2288. [46] J. Zhao, L. Miao, J. Yang, H. Fang, Q.-M. Zhang, M. Nie, P. Holme, T. Zhou, Prediction of links and weights in networks by reliable routes, Sci. Rep. 5 (2015) 12261. [47] N. Sett, S.R. Singh, S. Nandi, Influence of edge weight on node proximity based link prediction methods: An empirical analysis, Neurocomputing 172 (2016) 71–83. [48] B. Zhu, Y. Xia, Link prediction in weighted networks: A weighted mutual information model, PLoS ONE 11 (2016) e0148265. [49] J. Leskovec, D. Huttenlocher, J. Kleinberg, Predicting positive and negative links in online social networks, in: Proceedings of the 19th International Conference on World Wide Web, 2010, pp. 641–650. [50] R.E. Ulanowicz, J.J. Heymans, M.S. Egnotovich, Network analysis of trophic dynamics in South Florida Ecosystem. FY 99: The Graminoid Ecosystem. Technical Report TS–191–99, 2000. [51] R.E. Ulanowicz, C. Bondavalli, J.J. Heymans, M.S. Egnotovich, Network analysis of trophic dynamics in South Florida Ecosystem. FY 98: The Mangrove Ecosystem. Technical Report TS–191–99, 1999. [52] W. De Nooy, A. Mrvar, V. Batagelj, Exploratory Social Network Analysis with Pajek, Cambridge University Press, 2011. [53] F. Guo, Z. Yang, T. Zhou, Predicting link directions via a recursive subgraph-based ranking, Physica A 392 (2013) 3402–3408. [54] X. Wang, X. Zhang, C. Zhao, Z. Xie, S. Zhang, D. Yi, Predicting link directions using local directed path, Physica A 419 (2015) 260–267. [55] Q.M. Zhang, L. Lü, W.Q. Wang, T. Zhou, Potential theory for directed networks, PLoS ONE 8 (2013) e55437. [56] X. Xu, B. Liu, J. Wu, L. Jiao, Link prediction in complex networks via matrix perturbation and decomposition, Sci. Rep. 7 (2017) 14724. [57] L.A. Adamic, N. Glance, The political blogosphere and the 2004 U.S. election: Divided they blog, in: Proceedings of the 3rd International Workshop on Link Discovery, 2005, pp. 36–43. [58] G. Fagiolo, Clustering in complex directed networks, Phys. Rev. E 76 (2007) 026107. [59] E. Elhamifar, R. Vidal, Sparse subspace clustering: Algorithm, theory, and applications, IEEE Trans. Pattern Anal. Mach. Intell. 11 (2012) 2765–2781. [60] C. Ding, D. Zhou, X. He, H. Zha, R1 -PCA: Rotational invariant l1 -norm principal component analysis for robust subspace factorization, in: Proceedings of the 23rd International Conference on Machine Learning, 2006, pp. 281–288. [61] G. Liu, Z. Lin, S. Yan, J. Sun, Y. Yu, Y. Ma, Robust recovery of subspace structures by low-rank representation, IEEE Trans. Pattern Anal. Mach. Intell. 35 (2013) 171–184. [62] Z. Lin, M. Chen, Y. Ma, The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. Preprint arXiv:1009.5055, 2010. [63] N. Spring, R. Mahajan, D. Wetherall, T. Anderson, Measuring ISP topologies with rocketfuel, IEEE/ACM Trans. Network. 12 (2004) 2–16. [64] E.H. Simpson, Measurement of diversity, Nature 163 (1949) 688. [65] M. McPherson, L. Smith-Lovin, J.M. Cook, Birds of a feather: Homophily in social networks, Annu. Rev. Social. 27 (2001) 415–444.

R. Pech, D. Hao, Y.-L. Lee et al. / Physica A 528 (2019) 121319

17

[66] G. Kossinets, D.J. Watts, Empirical analysis of an evolving social network, Science 311 (2006) 88–90. [67] T. Zhou, R.Q. Su, R.R. Liu, L.L. Jiang, B.H. Wang, Y.C. Zhang, Accurate and diverse recommendations via eliminating redundant correlations, New J. Phys. 11 (2009) 123008. [68] X. Lei, X. Liao, T. Huang, H. Li, C. Hu, Outsourcing large matrix inversion computation to a public cloud, IEEE Trans. Cloud Comput. 1 (2013) 78–87. [69] A.E. Hoerl, R.W. Kennard, Ridge regression: Biased estimation for nonorthogonal problems, Technometrics 12 (1970) 55–67. [70] A.N. Tikhonov, V.Y. Arsenin, Solution of Ill-Posed Problems, Winston & Sons, 1977. [71] S.M. Smith, T.E. Nichols, D. Vidaurre, A.M. Winkler, T.E.J. Behrens, M.F. Glasser, K. Ugurbil, D.M. Barch, D.C. Van Essen, K.L. Miller, A positive-negative mode of population covariation links brain connectivity, demographics and behavior, Nat. Neurosci. 18 (2015) 1565–1567. [72] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. Van Den Driessche, T. Graepel, D. Hassabis, Mastering the game of go without human knowledge, Nature 550 (2017) 354–359. [73] I.A. Kovács, K. Luck, K. Spirohn, Y. Wang, C. Pollis, S. Schlabach, W. Bian, D.-K. Kim, N. Kishore, T. Hao, M.A. Calderwood, M. Vidal, A.-L. Barabási, Network-based prediction of protein interactions, Nature Commun. 10 (2019) 1240. [74] A. Muscoloni, I. Abdelhamid, C.V. Cannistraci, Local-community network automata modelling based on length-three-paths for prediction of complex network structures in protein interactomes, food webs and more. bioRxiv 10.1101/346916, 2018.