Node-coupling clustering approaches for link prediction

Node-coupling clustering approaches for link prediction

Accepted Manuscript Node-coupling Clustering Approaches for Link Prediction Fenhua Li, Jing He, Guangyan Huang, Yanchun Zhang, Yong Shi, Rui Zhou PII:...

955KB Sizes 0 Downloads 60 Views

Accepted Manuscript Node-coupling Clustering Approaches for Link Prediction Fenhua Li, Jing He, Guangyan Huang, Yanchun Zhang, Yong Shi, Rui Zhou PII: DOI: Reference:

S0950-7051(15)00353-6 http://dx.doi.org/10.1016/j.knosys.2015.09.014 KNOSYS 3275

To appear in:

Knowledge-Based Systems

Received Date: Revised Date: Accepted Date:

4 March 2015 10 July 2015 10 September 2015

Please cite this article as: F. Li, J. He, G. Huang, Y. Zhang, Y. Shi, R. Zhou, Node-coupling Clustering Approaches for Link Prediction, Knowledge-Based Systems (2015), doi: http://dx.doi.org/10.1016/j.knosys.2015.09.014

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Node-coupling Clustering Approaches for Link Prediction Fenhua Lia,d,e,∗, Jing Heb , Guangyan Huangf , Yanchun Zhangb,d,∗, Yong Shia,c , Rui Zhoub a CAS Research Center on Fictitious Economy and Data Science, University of Chinese Academy of Sciences, Beijing, 100190, China b Centre for Applied Informatics, Victoria University, Melbourne, Australia c CAS Key Laboratory for Big Data Mining and Knowledge Management, Beijing, China d School of Computer Science, Fudan University, Shanghai, China e Department of Computer Science and Technology, Yuncheng University, Shanxi, China f School of Information Technology, Deakin University, Melbourne, Australia

Abstract Due to the potential important information in real world networks, link prediction has become an interesting focus of different branches of science. Nevertheless, in “big data” era, link prediction faces significant challenges, such as how to predict the massive data efficiently and accurately. In this paper, we propose two novel node-coupling clustering approaches and their extensions for link prediction, which combine the coupling degrees of the common neighbor nodes of a predicted node-pair with cluster geometries of nodes. We then present an experimental evaluation to compare the prediction accuracy and effectiveness between our approaches and the representative existing methods on two synthetic datasets and six real world datasets. The experimental results show our approaches outperform the existing methods. Keywords: Link prediction, Node-coupling clustering, Data mining, Big data

1. Introduction 1.1. Background With the rapid development of internet technology, the amount of information in social networks increases significantly. While accessing useful information from social networks has become more and more difficult [1]. Social networks contain large number of potential useful information that is valuable for people’s daily lives and social business [2]. Therefore, social network analysis (SNA) has become a research focus to mine latent useful information from massive social network data. As part of this research, how to accurately predict a potential link in a real network is an important and challenging problem in many domains, such as recommender systems, decision making and ∗ Corresponding

author.Tel.:+86 18810401728. Email addresses: [email protected] (Fenhua Li), [email protected] (Yanchun Zhang)

Preprint submitted to Knowledge-Based Systems

July 10, 2015

criminal investigations. For example, we can predict a potential relationship between two persons to recommend new relationships in the Facebook network. In general, we call the above problem as link prediction [3]. As a subset of link mining [4], link prediction aims to compute the existence probabilities of the missing or future links among vertices in a network [5, 6]. There are two main difficulties in the link prediction problem: (1) huge amount of data, which requires the prediction approaches to have low complexity; (2) prediction accuracy, which requires the prediction approaches to have high prediction accuracy. However, traditional data mining approaches can not solve the link prediction problem well because they do not consider the relationships between entities, but the links between entities in a social network are interrelated. To overcome the above two difficulties and meet the practical requirements, many similarity-based methods have been proposed. These methods are mainly based on local analysis and global analysis [7]. The approaches based on local analysis consider only the number or different roles of the common neighbor nodes, which results in lower time complexity. At the same time, they have lower accuracy because of insufficient information. On the other hand, the approaches based on global analysis have higher prediction accuracy and higher time complexity due to accessing the global structure information of a network [5, 8]. So these methods are not satisfying solutions that can overcome the aforementioned two difficulties. In this paper, we propose two novel node-coupling clustering approaches and their extensions for the link prediction problem. They consider the different roles of nodes, and combine the coupling degrees of the common neighbor nodes of a predicted node-pair with cluster geometries of nodes. Our approaches remarkably outperform the existing methods in terms of efficiency accuracy and effectiveness. This is confirmed by experiments in Section 5. 1.2. Contributions The contributions of this paper consist of the following three aspects: (1) we propose two novel node-coupling clustering approaches and their extensions, which define a novel node-coupling degree metric. (2) we consider the coupling degrees of the common neighbor nodes of a predicted node-pair, by which some links that the existing methods cannot predict are accurately predicted. (3) we use the clustering coefficient to capture the clustering information of a network, which makes our approaches have lower time complexity compared with the existing clustering methods. (4) we use the clustering information that is important information for predicting links, which can improve the prediction accuracy. Experimental evaluation demonstrates our approaches outperform other methods in terms of accuracy and complexity. Our approaches are very suitable for large-scale sparse networks.

2

1.3. Organization The rest of this paper is organized as follows: Section 2 provides the overview of the related works of link prediction. Some preliminaries are briefly introduced in Section 3. Section 4 presents the idea of our approaches, and gives their complexity analysis. Experimental study is presented in Section 5. Section 6 concludes this paper and the future work.

2. Related work The existing link prediction approaches can be divided into three categories: the methods based on local analysis and global analysis [7], maximum likelihood estimation methods [5], and machine learning methods [5]. The methods based on local analysis and global analysis exploit the similarity of nodes in a network. The methods based on local analysis consist of Common Neighbors (CN), Adamic Adar (AA), Preferential Attachment (PA) and Jaccard Coefficient (JC). They suppose that the nodes of a network are independent of each other, and perform mostly on the local structure information of a network (i.e. node degree, nearest neighbors information). In contrast, there are Katz (Katz), Hitting Time, Average Commute Time (ACT ), Cosine based on random walk (CosRW), Graph distance (GD), Rooted PageRank and SimRank in the methods based on global analysis. These methods capture the global structure information of a network (i.e. the path set of a specific length). These above methods are described and discussed in [5, 7]. Besides, Liu and L¨u introduced a local random walk approach that provides good prediction accuracy [9]. Furthermore, Zhou et al. proposed two similarity-based methods - Resource Allocation (RA) and Local Path (LP). They verified the performance of these two methods on six real-world datasets [10]. In practice, since the methods based on local analysis focus only on the local structure information of a network, they have lower computational complexity than those based on global analysis, and are suitable for large-scale networks. However, as there are no sufficient information in the local-based methods, these methods have lower prediction accuracy than those based on global analysis. Maximum likelihood estimation methods apply some presumed rules and parameters with the maximum probability of the known structure to predict the potential links in a network. Clauset et al. proposed a missing-links prediction method based on hierarchical structure in the incomplete networks [11]. Guimera and Sales-Pardo presented a stochastic block model approach in link prediction [12]. Machine learning methods use the learned model to predict links by extracting the latent structure information of a known network. O’Madadhain et al applied several classifiers to predict the potential links from probability theory in the networks based on events [13]. Hasan et al treated link prediction as a supervised learning task, which

3

predicts the link of a predicted node-pair by identifying a negative or positive example. They extracted the features of a co-authorship network and evaluated the prediction results of several classifiers [14]. Although the above latter two groups methods provide competitively accurate prediction compared with the methods based on local analysis and global analysis, they are only suitable for small scale real networks and impractical for large-scale sparse networks because of their high computational complexities. Therefore, Li and He et al presented a clustering-based link prediction method (i.e. CLPA). They considered the clustering and free scale feature of a network for link prediction in their method [15]. In this paper, we will propose two novel node-coupling clustering approaches and their extensions to solve this problem.

3. Preliminaries 3.1. Clustering coefficient In graph theory, clustering coefficient is a metric that can evaluate the extent to which nodes tend to cluster together in a graph [16]. It can capture the clustering information of nodes in a graph [17]. An undirected network can be described as a graph G = (V, E), where V denotes the set of nodes and E indicates the set of edges. vi ∈ V is a node in Graph G. The clustering coefficient of node vi in Graph G can be defined as C(i) =

Ei 2 · Ei = (ki · (ki − 1))/2 (ki · (ki − 1))

(1)

where C(i) denotes the clustering coefficient value of node vi . ki represents the degree value of node vi . Ei is the number of the connected links among ki neighbors of node vi . For example, there is a node v1 in Graph G. The degree value of node v1 is 5 (i.e. k1 = 5). The number of connected links among the neighbors of node v1 is 6 (i.e. E1 = 6). Thus, the clustering coefficient value of node v1 is: C(1) =

2·E1 (k1 ·(k1 −1))

=

2·6 5·(5−1)

= 0.6.

3.2. Evaluation metrics In this section, we present two popular metrics for link prediction accuracy employed in this paper - Area under the receiver operating characteristic curve (AUC) and Precision. In general, a link prediction method can compute a score S xy for each unknown link to evaluate its existence probability and give an ordered list of all unknown links based on these S xy values [18]. AUC : It can evaluate the overall performance of a link prediction method. As described in [5, 8], the AUC value can be considered as the probability that the S xy value of an existing yet unknown link is more than that of a non-existing link at random. That is, we randomly select an existing yet unknown link in the test set and compare 4

its score with that of a non-existing link at a time. There are N independent comparisons, where the times that the existing yet unknown links have higher S xy value are H, and the times that they have the same S xy value are E. The AUC value is defined as: AUC =

H + 0.5 · E N

(2)

Precision : This metric considers N links with the highest S xy values in all unknown links. If there are T existing yet unknown links in the top N unknown links [5, 8], the Precision is defined as: Precision =

T N

(3)

4. Node-coupling clustering approaches In this section, we present our approaches for link prediction. Firstly, we present a new node-coupling degree metric - node-coupling clustering coefficient. Then, we present the process of our approaches. Finally, we give the complexity analysis of our approaches. 4.1. Node-coupling clustering coefficient Many similarity-based methods only consider the number or degrees of common neighbor nodes of a predicted node-pair in link prediction, and few exploit further the coupling degrees among the common neighbor nodes and the clustering information to improve the prediction accuracy. Based on the above reason, we propose a new nodecoupling degree metric - node-coupling clustering coefficient (NCCC), which can capture the clustering information of a network and evaluate the coupling degrees between the common neighbor nodes of a predicted node-pair. It also considers different roles of the common neighbor nodes of a predicted node-pair in a network. Now, we introduce this metric through a simple example. D

E

F

D

E

G

1

0

G

I

(a)

J

H

0

1

G

1

0

G

1

N

M

H

E

L

K

0

E

F

I

J

(b)

I

I

(c)

(d)

Fig. 1. An example for predicting the link between nodes M and N in two original networks Fig.1 shows an example for predicting the link between nodes M and N in two networks. Two original networks are described in Fig.1(a) and Fig.1(b). Fig.1(c) and Fig.1(d) are two subgraphs that consist of nodes M, N and their 5

common neighbors in Fig.1(a) and Fig.1(b), respectively. We aim to predict which link between nodes M and N is more likely to exist in Fig.1(a) and Fig.1(b). In general, we find that the coupling degrees of nodes M, N and their common neighbors are higher in Fig.1(c) than Fig.1(d). Thus, we believe the link of nodes M, N in Fig.1(a) is more likely to exist than in Fig.1(b). If we apply CN, AA, RA, PA to predict the link of nodes M, N in these two original networks, we can gain the same prediction result for each method. The reasons are as follows: from Fig.1(a) and Fig.1(b), we can see that the common neighbor set {bd f } of nodes M, N are the same, and every corresponding common neighbor node has the same degree value in these two original networks. The similarity metric is the number of the common neighbor nodes of a predicted node-pair in CN. CN has the same prediction results because of the same common neighbor node set {bd f } of nodes M, N in these two original networks. RA, AA are based on the degree values of the common neighbor nodes. RA has the same prediction probability as AA because of the same degree value of every corresponding common neighbor node in {bd f } in these two original networks. For the same reason, PA provides the same prediction result because that there are the corresponding same degree values for nodes M and N in these two original networks. However, the link probabilities between node M and node N in Fig.1(a) and Fig.1(b) are not likely to be the same. In the above case, inspired by [10, 19], we propose a new node-coupling degree metric based on the clustering information and node degree - node-coupling clustering coefficient. This metric can not only resolve the above prediction problem in Fig.1, but also capture the clustering information of a network. If node n is a common neighbor node of the predicted node-pair (M, N), the node-coupling clustering coefficient of node n, NCCC(n), can be defined as follows: P NCCC(n) = P

1 i∈Cn ( di

1 j∈Γ(n) ( d j

+ C(i)) + C( j))

(4)

where Γ(n) is the neighbor node set of node n. (M, N) denotes a predicted node-pair. n ∈ Γ(M) ∩ Γ(N). Cn denotes the common neighbor node set of the node-pair (M, N) in Γ(n), which includes nodes M, N. Namely Cn = Γ(M) ∩ Γ(N) ∩ Γ(n) ∪ {M, N}. di denotes the degree value of node i. C(i) denotes the clustering coefficient of node i. In this metric,

1 di

+ C(i) is considered as the contribution of node i to the coupling degree of the common neighbor nodes of

the predicted node-pair (M, N). The node-coupling clustering coefficient of node n is the ratio of the contribution sum of all nodes in Cn to that in Γ(n). In this way, our approaches can apply this metric that incorporates the clustering information and different roles of each related node to improve the prediction accuracy for link prediction. P P 1 1 In Equation (4), since Cn ⊆ Γ(n), i∈Cn ( di +C(i)) ≤ j∈Γ(n) ( d j +C( j)). As a result, NCCC(n) ∈ (0, 1]. Specially, NCCC(n) = 1 when Cn = Γ(n).

6

4.2. Node-coupling clustering approach based on probability theory (NCCPT) From probability theory, we propose a new link prediction approach based on the node-coupling clustering coefficient (NCCC) in Section 4.1. Given a pair of predicted nodes (x, y), node n is a common neighbor node of the node-pair (x, y). NCCC(n) can be considered as the contribution of node n to the connecting probability for the predicted node-pair (x, y). P(n) denotes the link existence probability that node x and node y connect because of node n. P(n) denotes the link non-existence probability that node n connects node x to node y. Therefore, P(n) = NCCC(n) and P(n) = 1 − NCCC(n). {A1 , A2 , . . . , Ai , . . . , Am } is the common neighbor set of the predicted node-pair (x, y), namely Γ(x) ∩ Γ(y) = {A1 , A2 , . . . , Ai , . . . , Am }. We assume that these common neighbor nodes of the node-pair (x, y) are independent to each other. If there exists a link between nodes x and y, at least one common neighbor node in {A1 , A2 , . . . , Ai , . . . , Am } connects node x to node y. According to probability theory, the link existence probability of the predicted node-pair (x, y), S xy , can be written as follows: S xy = 1 − P(A1 ) · P(A2 ) · · · · · P(Ai ) · · · · · P(Am ) = 1 − (1 − P(A1 )) · (1 − P(A2 )) · · · · · (1 − P(Ai )) · · · · · (1 − P(Am )) = 1 − (1 − NCCC(A1 )) · (1 − NCCC(A2 )) · · · · · (1 − NCCC(Ai )) · · · · · (1 − NCCC(Am )) P 1 Y i∈Cn ( di + C(i)) =1− (1 − P ) 1 j∈Γ(n) ( d j + C( j)) n∈Γ(x)∩Γ(y)

(5)

Equation (5) is a new node similarity metric in our approach. Clearly, a larger value of S xy means a higher probability that there exists a potential link between node x and y. The related parameters in Equation (5) have been P 1 Q i∈Cn ( +C(i)) described in Section 4.1. In Equation (5), since NCCC(n) ∈ (0, 1], we have n∈Γ(x)∩Γ(y) (1 − P ( d1i +C( j)) ) ∈ [0, 1), and j∈Γ(n) d j

S xy ∈ (0, 1]. Specially, S xy = 1 when Cn = Γ(n) for every node in Γ(x) ∩ Γ(y). For example, we apply our approach to predict the link probability of nodes M, N in Fig.1. Fig.1(a): S MN = 1 − (1 − Fig.1(b): S MN = 1 − (1 −

2.92 4.92(b) )

0.67 3.67(b) )

· (1 −

· (1 −

2.8 2.8(d) )

0.67 2.67(d) )

· (1 −

· (1 −

2.92 4.92( f ) )

0.67 3.67( f ) )

=1

= 0.50

From the above computing results, we find that the potential link between node M and node N is more likely to exist in Fig.1(a) than in Fig.1(b). Algorithm 1 describes the process of our above approach. 4.3. Node-coupling clustering approach based on common neighbors (NCCCN) The traditional CN method is based on the number of the common neighbor nodes of a predicted node-pair [7]. Its similarity metric is defined as follows: S CN xy = |Γ(x) ∩ Γ(y)| 7

(6)

where Γ(i) denotes the common neighbor node set of node i. |Γ(i)| represents the number of the common neighbor nodes of node i. Although CN has low complexity in the link prediction problem, it does not consider the different roles of the common neighbor nodes of a predicted node-pair. This results in low prediction accuracy. Here, we propose a new link prediction approach based on CN, which combines the different contributions of different nodes to the connecting probability with the clustering information of a network. In our approach, (x, y) is a predicted node-pair. Node n is a common neighbor node of the node-pair (x, y). NCCC(n) can be considered as the contribution of node n that connects node x to node y. S core(n) denotes the contribution score value that node n connects node x to node y. Therefore, S core(n) = NCCC(n). {A1 , A2 , . . . , Ai , . . . , Am } is the common neighbor set of the predicted node-pair (x, y), namely Γ(x) ∩ Γ(y) = {A1 , A2 , . . . , Ai , . . . , Am }. Here, these common neighbor nodes of the node-pair (x, y) are assumed to be independent to each other. We use the contribution sum of all common neighbor nodes of the predicted node-pair (x, y), S xy , to evaluate the link existence likelihood between node x and y. Therefore, the new similarity metric in our approach is defined as follows: S xy = S core(A1 ) + S core(A2 ) + · · · + S core(Ai ) + · · · + S core(Am ) = NCCC(A1 ) + NCCC(A2 ) + · · · + NCCC(Ai ) + · · · + NCCC(Am ) X = NCCC(Ai )

(7)

1≤i≤m

P =

X P n∈Γ(x)∩Γ(y)

1 i∈Cn ( di

1 j∈Γ(n) ( d j

+ C(i)) + C( j))

In our approach, the related parameters in Equation (7) have been described in Section 4.1. Clearly, a larger value of S xy means a higher likelihood that there exists a potential link between node x and y. |Γ(x) ∩ Γ(y)| is the number of common neighbor nodes of the predicted node-pair (x, y). In Equation (7), since 0 < NCCC(n) ≤ 1, we P 1 P i∈Cn ( +C(i)) have 0 < n∈Γ(x)∩Γ(y) P ( d1i +C( j)) ≤ |Γ(x) ∩ Γ(y)|, and S xy ∈ (0, |Γ(x) ∩ Γ(y)|]. Specially, S xy = |Γ(x) ∩ Γ(y)| when j∈Γ(n) d j

Cn = Γ(n) for every node in Γ(x) ∩ Γ(y). For example, we use this approach to compute the similarity score of the predicted node-pair (M, N) in Fig.1(a) and Fig.1(b), respectively. Fig.1(a): S MN =

2.92 4.92(b)

+

2.8 2.8(d)

+

2.92 4.92( f )

= 2.19

Fig.1(b): S MN =

0.67 3.67(b)

+

0.67 2.67(d)

+

0.67 3.67( f )

= 0.61

We obtain the same prediction result as NCCPT. From this example, we find that our node-coupling clustering approaches can provide better prediction results than the traditional methods. Algorithm 1 illustrates the process of our above approach. 8

4.4. The extensions of NCCPT and NCCCN To further improve the performance of link prediction, we extend NCCPT and NCCCN by adding its clustering coefficient information of every selected common neighbor node, C(n), to the above two approaches respectively. For the same reason, (x, y) is a predicted node-pair; node n is a common neighbor node of the node-pair (x, y). When we add the node clustering coefficient information, C(n), in the contribution of node n to the connecting probability based on NCCPT, we can obtain a new contribution of node n: NCCC(n) + C(n). However, 0 ≤ NCCC(n) + C(n) ≤ 2. This is outside the scope of the probability value. In order to extend NCCPT, we use the average value of NCCC(n) and C(n),

1 2

· (C(n) + NCCC(n)), as the contribution of node n. Therefore, S xy in the extended NCCPT approach

(ENCCPT) is defined as follows: S xy = 1 − P(A1 ) · P(A2 ) · · · · · P(Ai ) · · · · · P(Am ) = 1 − (1 − P(A1 )) · (1 − P(A2 )) · · · · · (1 − P(Ai )) · · · · · (1 − P(Am )) 1 1 · (NCCC(A1 ) + C(A1 ))) · (1 − · (NCCC(A2 ) + C(A2 ))) · . . . 2 2 1 1 · (1 − · (NCCC(Ai ) + C(Ai ))) · · · · · (1 − · (NCCC(Am ) + C(Am ))) 2 2 P 1 Y ( + C(i)) i∈Cn di 1 =1− (1 − · ( P + C(n))) 1 2 j∈Γ(n) ( d j + C( j)) n∈Γ(x)∩Γ(y) = 1 − (1 −

(8)

where C(n) is the clustering coefficient of node n. The other parameters are the same as Equation (5). In Equation (8), P 1 Q i∈Cn ( +C(i)) since NCCC(n) ∈ (0, 1] and C(n) ∈ [0, 1], we have n∈Γ(x)∩Γ(y) (1 − 12 · ( P ( d1i +C( j)) + C(n))) ∈ [0, 1), and S xy ∈ (0, 1]. j∈Γ(n) d j

Specially, S xy = 1 when Cn = Γ(n) and C(n) = 1 for every node in Γ(x) ∩ Γ(y). For instance, ENCCPT is used to predict the link existence probabilities of node-pair (M, N) in Fig.1(a) and Fig.1(b) as follows: 2.8 2.92 Fig.1(a): S MN = 1 − (1 − 0.5 · ( 2.92 4.92 + 0.2)(b)) · (1 − 0.5 · ( 2.8 + 0.67)(d)) · (1 − 0.5 · ( 4.92 + 0.2)( f )) = 0.94 0.67 0.67 Fig.1(b): S MN = 1 − (1 − 0.5 · ( 0.67 3.67 + 0)(b)) · (1 − 0.5 · ( 2.67 + 0)(d)) · (1 − 0.5 · ( 3.67 + 0)( f )) = 0.28

Similarly, (x, y) represents a pair of predicted nodes, and node n is a common neighbor node of the node-pair (x, y). When we add the node clustering coefficient information, C(n), in the contribution of node n that connects node x to node y based on NCCCN, we can obtain a new contribution of node n: NCCC(n) + C(n). S core(n) denotes the contribution score value that node n connects node x to node y. Therefore, S core(n) = NCCC(n) + C(n). Hence, the

9

extended NCCCN approach (ENCCCN) is shown in the following Equation (9). S xy = S core(A1 ) + S core(A2 ) + · · · + S core(Ai ) + · · · + S core(Am ) = (NCCC(A1 ) + C(A1 )) + (NCCC(A2 ) + C(A2 )) + . . . + (NCCC(Ai ) + C(Ai )) + · · · + (NCCC(Am ) + C(Am )) X = (NCCC(Ai ) + C(Ai ))

(9)

1≤i≤m

P =

X

(P

n∈Γ(x)∩Γ(y)

1 i∈Cn ( di

1 j∈Γ(n) ( d j

+ C(i)) + C( j))

+ C(n))

where C(n) denotes the clustering coefficient of node n. Other parameters are the same as Equation (7). In Equation P 1 P i∈Cn ( di +C(i)) P (9), since 0 < NCCC(n) ≤ 1 and 0 ≤ C(n) ≤ 1, we have 0 < NCCC(n) +C(n) ≤ 2, and 0 < n∈Γ(x)∩Γ(y) ( + ( 1 +C( j)) j∈Γ(n) d j

C(n)) ≤ 2 · |Γ(x) ∩ Γ(y)|, namely S xy ∈ (0, 2 · |Γ(x) ∩ Γ(y)|]. Specially, S xy = 2 · |Γ(x) ∩ Γ(y)| when Cn = Γ(n) and C(n) = 1 for every node in Γ(x) ∩ Γ(y). For instance, we use ENCCCN to predict the existence possibility of the link between nodes M and N in Fig.1(a) and Fig.1(b), respectively. The results are as follows: 2.8 2.92 Fig.1(a): S MN = ( 2.92 4.92 + 0.2)(b) + ( 2.8 + 0.67)(d) + ( 4.92 + 0.2)( f ) = 3.26 0.67 0.67 Fig.1(b): S MN = ( 0.67 3.67 + 0)(b) + ( 2.67 + 0)(d) + ( 3.67 + 0)( f ) = 0.61

From the above prediction results, we find that ENCCPT and ENCCCN have the same prediction results as NCCPT and NCCCN. Moreover, we notice that the prediction results of ENCCPT and ENCCCN have more obvious differences than NCCPT and NCCCN in the same example, respectively. This results in better prediction results compared with our baseline approaches (i.e. NCCPT, NCCCN), and it shows the importance of the clustering information in the link prediction. Algorithm 1 describes the process of our above extended approaches. 4.5. Complexity analysis of our approaches In real applications, most link prediction methods are based on local analysis and global analysis. CN is the simplest link prediction method in these methods. As a representative of the methods based on local analysis, CN has low complexity and suitable for large-scale networks. Its time complexity is O(n2 ), where n is the number of nodes in a network. Its space complexity is O(n2 ). In contrast, Katz is a representative of the methods based on global analysis. Its time complexity is O(n3 ). Its space complexity is O(n2 ). The methods based on global analysis are impractical for large-scale networks because of their high complexity. As illustrated in Algorithm 1, the main operations of our algorithms consist of lines 3-6 and lines 7-9. The time complexity of lines 3-6 is O(n2 ) in the worst case. The time complexity of lines 7-9 is O(n2 ). Therefore, the overall 10

Algorithm 1 Node-coupling Clustering Approaches 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

Set d[ ]=0; C[ ]=0; Divide the original network G into the training set T S and test set PS ; for each node i in G do Compute the degree value of this node: d[i] ; Compute the clustering coefficient of this node: C[i] ; end for for each nonexistent edge (x, y) in G do Compute the similarity score S xy by Equation (5), Equation (7), Equation (8) or Equation (9); end for Arrange the list of all S xy in descending order; Compute AUC by equation (2); Return AUC;

time complexity of our algorithms is O(n2 ). The space complexity of our algorithms is O(n2 ). Because our approaches have the same complexity as CN, they are suitable for large-scale networks.

5. Experimental analysis In this section, we experimentally evaluate the performance of our approaches on two synthetic datasets and six real datasets. 5.1. Experiment data and setup Datasets. They are divided into two categories in our experiments. The first category consists of two synthetic datasets with different size and clustering coefficients - S D(2, 100) and S D(6, 300), which can be created by the BA model in complex networks theory. Here, S D(m, n) denotes a network that contains n nodes, and m represents the number of nodes that are connected by a new node when this new node joins to this network. The second category contains six real datasets from different fields. Jazz is a collaboration network that consists of jazz bands [20]. Alines is a US air transportation network [21]. PBl is a US politics network that consists of many weblogs [22]. Karaclub is a social network that contains 34 members of a karate club at a US university [23]. Facebook is a dataset that was collected by surveying participants using the Facebook app from the Facebook social network [24, 25]. PtoP is a Gnutella peer-to-peer file sharing network that contains many hosts and their connections from August 2002 [24, 26]. The details of these datasets are described in Table 1, where V, E denote the number of nodes and edges in these datasets respectively. AD presents the average degree value of a dataset. C is the average clustering coefficient of a dataset. APL indicates the average path length of a dataset. S P represents the sparsity rate of a network. Evaluation metric. In our experiments, we apply AUC metric to evaluate the prediction accuracy of the tested methods. Firstly, we divide a dataset into a training set T S and a test set PS . Then, we compute the similarity score, 11

Table 1 The features description of the tested datasets Datasets S D(2, 100) S D(6, 300) Jazz Alines PBl Karaclub Facebook PtoP

V 100 300 198 332 1224 34 4039 10876

E 574 1997 2742 2126 16715 78 88234 39994

AD 11.48 13.31 27.697 12.807 27.312 4.588 43.691 7.355

C 0.793 0.298 0.633 0.749 0.360 0.588 0.6055 0.008

APL 2.316 2.419 2.235 2.738 2.738 2.408 3.693 4.636

SP 0.1160 0.0445 0.1406 0.0387 0.0223 0.1390 0.0108 0.0007

S xy , of every unknown link in T S by a specific method. Finally, we compute AUC based on S xy and PS . Experimental environment. All codes are written and implemented in Matlab 2012a. We do our experiments in a Dell Optiplex 990 computer with a 3.2GHz Intel(R) Core i7 CPU processor and 4 GB memory that runs 32-bit Microsoft Windows 7 Enterprise Edition. 5.2. Experimental results compared with other traditional methods and performance analysis In this section, we compare our approaches with the representatives of the existing methods that are introduced in [5, 7, 15] - CN, AA, RA, PA, Katz, ACT , CosRW, GD, CLPA as introduced in Section 2. Here, we conduct our experiments on the above mentioned eight datasets with different ratios of the edges that the training set includes in the total edges of a dataset (i.e. Ratio of the known edges) - 0.1, 0.3, 0.5, 0.7, 0.9. Moreover, we set β = 0.001 in Katz, θ = 0.90 in CLPA. In our experiments, we will analyze the prediction performance of the above approaches in two aspects: prediction accuracy and time cost. 5.2.1. Prediction accuracy analysis Fig.2 provides the prediction accuracy comparison results of the approaches, NCCPT, NCCCN, and other traditional methods on all tested datasets. From Fig.2, we find that NCCPT and NCCCN have better prediction accuracy than the other traditional methods on all tested datasets in most cases, and they produce more prediction accuracy compared with the other traditional methods as the ratio of the known edges grows in each dataset. The main reason is that our approaches use the coupling degrees information of the common neighbor nodes and the clustering information of a network while the different roles of nodes are considered. Even though in the worst cases such as S D(2, 100), S D(6, 300), Karaclub and PBl with 0.1 ratio of the known edges, the prediction accuracy of NCCPT and NCCCN is higher than CN. Meanwhile, we also find that NCCPT and NCCCN have the similar prediction accuracy under the specific ratio of the known edges in most cases, especially in high ratio of the known edges. This further proves that the node-coupling clustering idea in our approaches is very effective to improve the prediction results. In addition, we also have the following findings: Firstly, the prediction accuracy of each studied method increases 12

1

0.9

0.8

0.9

0.8

0.7

0.7

0.6

CN AA RA PA Katz ACT CosRW GD CLPA NCCPT NCCCN

0.5 0.4 0.3 0.2 0.1 0

0.1

0.3

0.5

0.7

0.6

0.6 CN AA RA PA Katz ACT CosRW GD CLPA NCCPT NCCCN

0.5 0.4 0.3 0.2 0.1 0

0.9

0.1

0.3

Ratio of the known edges

0.5

0.7

0.5

0.3 0.2 0.1 0

0.9

1

1 0.9

0.8

0.8

Accuracy (AUC)

0.6 CN AA RA PA Katz ACT CosRW GD CLPA NCCPT NCCCN

0.5 0.4 0.3 0.2 0.1

0.5

0.7

0.7 0.6

CN AA RA PA Katz ACT CosRW GD CLPA NCCPT NCCCN

0.5 0.4 0.3 0.2 0.1 0

0.9

0.1

0.3

Ratio of the known edges

0.5

0.7

CN AA RA PA Katz ACT CosRW GD CLPA NCCPT NCCCN

0.5 0.4 0.3 0.2 0.1 0

0.9

0.1

0.3

0.5

0.7

0.9

Ratio of the known edges

(f) PBl

1

0.9

0.9

0.8 0.7

0.6

Accuracy (AUC)

0.7 CN AA RA PA Katz ACT CosRW GD CLPA NCCPT NCCCN

0.5 0.4 0.3 0.2 0.1 0

0.9

0.6

(e) Alines

0.8

0.7

0.7

Ratio of the known edges

(d) Jazz

Accuracy (AUC)

Accuracy (AUC)

0.7

0.5

(c) Karaclub

0.9

0.3

0.3

(b) S D(6, 300)

0.8

0.1

0.1

Ratio of the known edges

0.9

0

CN AA RA PA Katz ACT CosRW GD CLPA NCCPT NCCCN

0.4

Ratio of the known edges

(a) S D(2, 100)

Accuracy (AUC)

Accuracy (AUC)

0.7

Accuracy (AUC)

Accuracy (AUC)

0.8

0.1

0.3

0.5

0.7

0.6 0.5

CN AA

0.4

RA PA

0.3

Katz ACT CosRW

0.2

GD CLPA

0.1

NCCPT NCCCN

0

0.9

Ratio of the known edges

0.1

0.3

0.5

0.7

0.9

Ratio of the known edges

(g) Facebook

(h) PtoP

Fig. 2. Prediction accuracy vs Ratio of the known edges on all tested datasets. Here, the studied prediction methods include the traditional methods(i.e.CN, AA, RA, PA, Katz, ACT, CosRW, GD, CLPA) and our methods(i.e.NCCPT, NCCCN). The parameters for these methods are: (1) for Katz, β = 0.001; (2) for CLPA, θ = 0.90. with the growth of the ratio of the known edges in most tested datasets. This is because there are more useful information that can be used to predict with the growth of the ratio of the known edges in each tested dataset. But the prediction accuracy of PA in Karaclub dataset does not always increase with the growth of the ratio of the known edges. The main reason is that PA is only related with the degree values of a predicted node-pair while Karaclub is a small world network and its node degree follows the Poisson distribution (i.e. the degree values of most nodes are identical). Secondly, as the representatives of the methods based on global analysis, Katz, ACT, CosRW have as good or better prediction accuracy compared with CN, AA, RA in most tested datasets, especially when the ratio of the known edges is low. This indicates we can obtain better prediction accuracy using the global structure information than the local structure information, especially under the ultra sparse of a network. Moreover, PA and the methods 13

0.85

0.9

0.8

0.9

0.85

0.75

0.85

0.8

0.75

0.65 0.1

0.3

0.5

0.7

0.8

0.75

0.7

CLPA NCCCN ENCCCN NCCPT ENCCPT

0.7

Accuracy (AUC)

0.95

Accuracy (AUC)

Accuracy (AUC)

1

0.95

0.9

0.1

0.3

Ratio of the known edges

0.7

0.55

0.5 0.1

0.9

(b) S D(6, 300) 1

1

0.95

0.95

0.9

0.75 0.7 0.65 CLPA NCCCN ENCCCN NCCPT ENCCPT

0.6 0.55

0.3

0.5

0.7

0.85 0.8 0.75 CLPA NCCCN ENCCCN NCCPT ENCCPT

0.7 0.65

0.9

0.1

0.3

Ratio of the known edges

0.5

0.7

0.8

0.75

CLPA NCCCN ENCCCN NCCPT ENCCPT

0.7

0.65 0.1

0.9

0.3

0.5

0.7

0.9

Ratio of the known edges

(e) Alines 1

(f) PBl

0.95

0.95

0.9

0.9

0.85

Accuracy (AUC)

Accuracy (AUC)

0.85

Ratio of the known edges

(d) Jazz

0.85

0.8

CLPA NCCCN ENCCCN NCCPT ENCCPT

0.75

0.7

0.65 0.1

0.9

0.9

Accuracy (AUC)

Accuracy (AUC)

0.8

0.7

(c) Karaclub

0.9 0.85

0.5

Ratio of the known edges

0.95

0.5 0.1

0.3

Ratio of the known edges

(a) S D(2, 100)

Accuracy (AUC)

0.5

0.7

0.65

0.6

CLPA NCCCN ENCCCN NCCPT ENCCPT

0.65

CLPA NCCCN ENCCCN NCCPT ENCCPT

0.3

0.5

0.7

0.8

0.75

0.7

CLPA NCCCN ENCCCN NCCPT ENCCPT

0.65

0.9

0.1

Ratio of the known edges

0.3

0.5

0.7

0.9

Ratio of the known edges

(g) Facebook

(h) PtoP

Fig. 3. The prediction accuracy results of our approaches on all tested datasets. Here, our approaches consist of NCCPT, ENCCPT, NCCCN and ENCCCN. based on global analysis (i.e. Katz, ACT ,and CosRW) have better prediction accuracy than our methods (i.e. NCCPT and NCCCN) in PBl when the ratio of the known edges is small (0.1, 0.3). The main reasons are as follows: (1) PBl is a free scale network and has ”the Rich get Richer” feature. This satisfies the mechanism of PA; (2) the methods based on global analysis have more global information to use than our methods when the ratio of the known edges is small, such as 0.1, 0.3. Thirdly, RA has almost the same prediction accuracy as AA. This is because they are both based on the degree of nodes in a graph/network. Finally, GD has the worst prediction accuracy under the specific ratio of the known edges on all tested datasets in all selected methods. This is because there is insufficient information for link prediction in GD. Fig.3 shows the different prediction accuracy of our baseline approaches (i.e. NCCPT, NCCCN) and their extensions (i.e. ENCCPT, ENCCCN) on all tested datasets. Apart from the above findings as shown in Fig.2, we also 14

find the following three points in Fig.3: Firstly, ENCCPT has better prediction accuracy than NCCPT even though there are no significant differences in their prediction results. ENCCCN has better prediction accuracy than NCCCN as well. This indicates that our extended approaches have a little better prediction accuracy than their corresponding baseline approaches. The main reason is that our extended approaches include the clustering information of the selected common neighbor nodes compared with their corresponding baseline approaches. The results verify that the extended approaches are more effective than the corresponding baseline approaches for the link prediction problem. Secondly, we find that NCCPT, NCCCN, ENCCCN and ENCCPT have the similar prediction accuracy when the ratio of the known edges is low, e.g., 0.1 and 0.3. However, the prediction performance of the extended approaches (i.e. ENCCPT, ENCCCN) is better than the corresponding baseline approaches (i.e. NCCPT, NCCCN) with the growth of ratio of the known edges in each tested dataset, especially when the ratio of the known edges is high. This is because there are more clustering information used to predict the potential links in the extended approaches than the corresponding baseline approaches with the growth of ratio of the known edges in each tested dataset. Thirdly, our approaches (i.e. NCCPT, NCCCN, ENCCPT, ENCCCN) have better prediction accuracy than CLPA when the ratio of the known edges is high (i.e. 0.3, 0.5, 0.7, 0.9) on all tested datasets. CLPA has only a little better prediction accuracy than our approaches when the ratio of the known edges is low, such as 0.1 on most tested datasets. This is because the free scale feature is used for link prediction with 0.1 ratio of the known edges in CLPA. But the time performance of CLPA is significantly worse than our approaches, especially in large-scale datasets, e.g., Facebook and Ptop. The results are shown in Table 2 - Table 6. Overall, the above three findings prove further the importance of the clustering information in link prediction. 3UHGLFWRUV

6' 

6' 

.DUDFOXE

-D]]

$OLQHV

3%O

)DFHERRN

5DQGRPSUHGLFWRU















*' DOOGLVWDQFHWZRSDLUV













&1











$$











5$









3$







.DW]





$&7



&RV5:

3WR3 













































































&/3$

















1&&37

















(1&&37

















1&&&1

















(1&&&1

















Fig. 4. Accuracy performance of the studied prediction approaches introduced in Section 2, 4 on all tested datasets when Ratio of the known edges is 0.9. For each approach and each dataset section, the given number is the accuracy factor improvement over random predictor. Three approaches in particular are used as baselines for comparison: Random predictor(RP), Graph distance(GD) and common neighbors(CN). Bold values have better accuracy performance than the common neighbors approach; italic values are at least as good as the graph distance approach. 15

(a) Baseline: Random predictor

(b) Baseline: Graph distance

(c) Baseline: Common neighbors

Fig. 5. Relative average performance of the studied prediction methods on all tested datasets versus three baselines: Random predictor(RP), Graph distance method(GD) and common neighbors method(CN) when Ratio of the known edges is 0.1. The plotted value shows the average performance ratio of the given method over the eight datasets versus these three baselines. The error bars display the minimum and maximum of this ratio over the eight datasets. The parameters for these methods are: (1) for Katz, β = 0.001; (2) for CLPA, θ = 0.90.

(a) Baseline: Random predictor

(b) Baseline: Graph distance

(c) Baseline: Common neighbors

Fig. 6. Relative average performance of the studied prediction methods on all tested datasets versus three baselines: Random predictor(RP), Graph distance method(GD) and common neighbors method(CN) when Ratio of the known edges is 0.3. The plotted value indicates the average performance ratio of the given method over the eight datasets versus these three baselines. The error bars show the minimum and maximum of this ratio over the eight datasets. All parameters for these methods are as in Fig.5. To understand further the prediction performance of the above related approaches, we use the average relative performance to evaluate each related approach in our experiments. Here, there are three baselines: Random predictor (RP), Graph distance (GD) and common neighbors (CN). Random predictor is a prediction method that simply selects some predicted node-pairs to predict at random in social networks. Fig.4 shows the accuracy performance of each approach on all tested datasets, in terms of the factor improvement over random predictor. Fig.5 - Fig.9 show the average relative performance of several related prediction approaches versus three baselines: Random predictor (RP), Graph distance method (GD) and common neighbors method (CN). We can see that the other approaches significantly outperform the random predictor. This means that there is indeed useful information for link prediction contained in the network topology structure. Besides, we also have the following findings: (1) CN, AA, RA have the similar performance on each tested dataset. This is because they are the methods based on the common neighbor nodes. (2) 16

(a) Baseline: Random predictor

(b) Baseline: Graph distance

(c) Baseline: Common neighbors

Fig. 7. Relative average performance of the studied prediction methods on all tested datasets versus three baselines: Random predictor(RP), Graph distance method(GD) and common neighbors method(CN) when Ratio of the known edges is 0.5. The plotted value displays the average performance ratio of the given method over the eight datasets versus these three baselines. The error bars show the minimum and maximum of this ratio over the eight datasets. All parameters for these methods are as in Fig.5.

(a) Baseline: Random predictor

(b) Baseline: Graph distance

(c) Baseline: Common neighbors

Fig. 8. Relative average performance of the studied prediction methods on all tested datasets versus three baselines: Random predictor(RP), Graph distance(GD) method and common neighbors method(CN) when Ratio of the known edges is 0.7. The plotted value shown is the average performance ratio of the given method over the eight datasets versus these three baselines. The error bars show the minimum and maximum of this ratio over the eight datasets. All parameters for these methods are as in Fig.5. Katz, ACT, CosRW have good or better prediction performance than CN, AA, RA, especially when the ratio of the known edges is low, such as 0.1. This shows that the methods based on global analysis have higher prediction accuracy than the methods based on local analysis. (3) NCCPT, NCCCN, ENCCPT, ENCCCN have the similar prediction performance in most cases. This is because there are the clustering information of a network. (4) Although there is no clear difference between Katz, ACT, CosRW, CLPA and NCCPT, NCCCN, ENCCPT, ENCCCN, our approaches (NCCPT, NCCCN, ENCCPT, ENCCCN) have a little better prediction performance than the methods based on global analysis (Katz, ACT, CosRW, CLPA), especially, when the ratio of the known edges is high. The main reason is that our approaches contain the clustering and node coupling degree information. From S P as introduced in Table 1, we know that Ptop is an ultra sparse dataset. Moreover, we also find that Katz, ACT, CosRW and NCCPT, NCCCN, ENCCPT, ENCCCN have better prediction performance than CN, AA, RA in most cases in Fig.2 and Fig.3. Meanwhile, we obtain the same conclusion when the ratio of the known edges is 0.1 17

in Fig.5. These results indicate that our approaches (i.e. NCCPT, NCCCN, ENCCPT, ENCCCN) and the methods based on global analysis are suitable for ultra sparse networks. In contrast, the methods based on local analysis are only suitable for dense networks.

(a) Baseline: Random predictor

(b) Baseline: Graph distance

(c) Baseline: Common neighbors

Fig. 9. Relative average performance of the studied prediction methods on all tested datasets versus three baselines: Random predictor(RP), Graph distance method(GD) and common neighbors method(CN) when Ratio of the known edges is 0.9. The value shown is the average performance ratio of the given method over the eight datasets versus these three baselines. The error bars show the minimum and maximum of this ratio over the eight datasets. All parameters for these methods are as in Fig.5.

5.2.2. Time performance analysis Table 2 The running time for different approaches on all tested datasets when the ratio of the known edges is 0.1 (Unit: Sec.) ``` ``Datasets Jazz Alines PBl Facebook PtoP ``` S D(2, 100) S D(6, 300) Karaclub Methods CN AA RA PA Katz ACT CosRW GD CLPA NCCPT ENCCPT NCCCN ENCCCN

0.0329 0.0319 0.0302 0.0322 0.0574 0.3111 0.2824 0.0373 0.0542 0.0455 0.0461 0.0448 0.0453

0.0811 0.0987 0.0957 0.1088 0.2324 0.5860 0.5582 0.0964 0.1164 0.0960 0.0974 0.0953 0.0961

0.0252 0.0279 0.0241 0.0255 0.0284 0.0966 0.0948 0.0268 0.0277 0.0269 0.0275 0.0258 0.0263

0.0562 0.0707 0.0651 0.0716 0.1327 0.4097 0.3823 0.0615 0.0758 0.0626 0.0637 0.0613 0.0624

0.0844 0.1164 0.1099 0.1295 0.2525 0.5971 0.5776 0.0991 0.1335 0.1030 0.1046 0.0982 0.1003

1.1940 1.4683 1.3729 1.5275 6.5271 15.9259 12.8088 1.5833 1.6375 1.3481 1.3874 1.3186 1.3371

8.7335 10.9740 10.3069 17.0911 54.2158 452.3120 398.1047 9.8276 18.5644 10.6347 11.3452 10.4528 10.9632

48.5264 97.6325 60.4276 275.1190 1035.3882 9657.7647 7541.8623 55.4329 283.7695 94.3152 95.7465 89.5324 91.6870

In practice, a prediction method should have high prediction accuracy and low time complexity to meet the practical requirements. Therefore, it is important to analyze the time performance of a prediction method as well. Table 2 - Table 6 show the running time of the studied prediction approaches on all tested datasets under different ratios of the known edges - 0.1, 0.3, 0.5, 0.7, 0.9, respectively. For each prediction approach, its running time increases as the data size of the tested dataset grows. But there is no absolute clear difference between the running time of each prediction approach under different ratios of the known edges on a tested dataset. CN has the best time performance because its running time is shorter than the other approaches in all tested datasets. In contrast, Katz, ACT, CosRW have the worst time performance, especially in some large scale datasets (e.g. PBl, Facebook, Ptop). AA, RA, PA 18

Table 3 The running time for different approaches on all tested datasets when the ratio of the known edges is 0.3 (Unit: Sec.) ``` ``Datasets Jazz Alines PBl Facebook PtoP ``` S D(2, 100) S D(6, 300) Karaclub Methods CN AA RA PA Katz ACT CosRW GD CLPA NCCPT ENCCPT NCCCN ENCCCN

0.0273 0.0278 0.0271 0.0287 0.0435 0.2240 0.1933 0.0349 0.0413 0.0393 0.0404 0.0370 0.0376

0.0754 0.0916 0.0896 0.1043 0.1739 0.4990 0.4588 0.0938 0.1059 0.0912 0.0935 0.0892 0.0919

0.0206 0.0220 0.0193 0.0223 0.0245 0.0927 0.0913 0.0259 0.0233 0.0211 0.0232 0.0204 0.0216

0.0501 0.0683 0.0604 0.0699 0.1171 0.3439 0.3214 0.0583 0.0694 0.0587 0.0605 0.0576 0.0598

0.0803 0.1107 0.1056 0.1236 0.2070 0.5236 0.5074 0.0968 0.1278 0.0984 0.1013 0.0926 0.0953

1.0724 1.4055 1.3376 1.4844 6.1575 14.8672 11.9379 1.2719 1.5841 1.2400 1.3046 1.2283 1.2678

8.2513 10.3562 9.8312 16.6794 49.2090 436.9345 379.7170 9.5405 17.9429 10.2513 10.9761 10.1162 10.6718

44.8740 92.4590 57.6704 241.0224 997.7428 9554.4825 7326.5278 52.8592 269.3249 89.7653 91.8387 84.8213 85.3679

Table 4 The running time for different approaches on all tested datasets when the ratio of the known edges is 0.5 (Unit: Sec.) ``` ``Datasets Jazz Alines PBl Facebook PtoP ``` S D(2, 100) S D(6, 300) Karaclub Methods CN AA RA PA Katz ACT CosRW GD CLPA NCCPT ENCCPT NCCCN ENCCCN

0.0243 0.0267 0.0253 0.0271 0.0332 0.1896 0.1253 0.0333 0.0324 0.0334 0.0342 0.0313 0.0316

0.0701 0.0886 0.0859 0.0979 0.1419 0.4351 0.3743 0.0916 0.0985 0.0871 0.0904 0.0842 0.0863

0.0145 0.0181 0.0178 0.0204 0.0223 0.0893 0.0870 0.0233 0.0183 0.0169 0.0198 0.0163 0.0179

0.0432 0.0612 0.0576 0.0655 0.0948 0.2788 0.2576 0.0497 0.0627 0.0546 0.0563 0.0535 0.0552

0.0754 0.1090 0.0988 0.1183 0.1818 0.4854 0.4667 0.0946 0.1196 0.0959 0.0988 0.0887 0.0921

0.9652 1.3750 1.2873 1.4541 5.7640 13.4585 10.7636 1.1647 1.5169 1.1865 1.2783 1.1790 1.2361

7.6771 9.8316 9.3894 16.2445 46.3052 417.1473 365.7857 9.0172 17.3256 9.7502 10.2483 9.6947 10.0475

40.3680 86.9890 52.0670 226.6342 954.1783 9305.2433 7215.3971 50.4763 237.7156 83.2461 85.7426 79.6761 82.4232

Table 5 The running time for different approaches on all tested datasets when the ratio of the known edges is 0.7 (Unit: Sec.) ``` ``Datasets Jazz Alines PBl Facebook PtoP ``` S D(2, 100) S D(6, 300) Karaclub Methods CN AA RA PA Katz ACT CosRW GD CLPA NCCPT ENCCPT NCCCN ENCCCN

0.0236 0.0249 0.0244 0.0258 0.0319 0.1648 0.1057 0.0322 0.0303 0.0262 0.0268 0.0246 0.0248

0.0686 0.0823 0.0752 0.0896 0.1288 0.3673 0.3443 0.0885 0.0913 0.0803 0.0857 0.0776 0.0815

0.0134 0.0159 0.0146 0.0179 0.0196 0.0847 0.0826 0.0227 0.0167 0.0152 0.0171 0.0149 0.0162

0.0378 0.0567 0.0533 0.0595 0.0861 0.2553 0.2382 0.0427 0.0585 0.0522 0.0542 0.0517 0.0541

0.0699 0.1044 0.0952 0.1133 0.1518 0.4118 0.4060 0.0917 0.1103 0.0894 0.0935 0.0841 0.0883

0.9246 1.3204 1.2239 1.4127 5.2617 11.2217 10.1463 1.0216 1.4732 1.1092 1.2247 1.0954 1.1725

6.7020 9.4937 9.0174 15.6418 44.2137 394.3129 342.0164 8.6821 16.8547 9.2548 9.7321 9.0719 9.5631

36.7650 78.7560 48.3456 205.7681 918.3237 9155.3683 7082.4537 47.2581 218.9013 76.5269 79.3675 72.4518 76.8193

Table 6 The running time for different approaches on all tested datasets when the ratio of the known edges is 0.9 (Unit: Sec.) ``` ``Datasets Jazz Alines PBl Facebook PtoP ``` S D(2, 100) S D(6, 300) Karaclub Methods CN AA RA PA Katz ACT CosRW GD CLPA NCCPT ENCCPT NCCCN ENCCCN

0.0228 0.0231 0.0213 0.0247 0.0304 0.1143 0.0820 0.0317 0.0275 0.0243 0.0251 0.0215 0.0216

0.0653 0.0744 0.0691 0.0836 0.1162 0.3264 0.3218 0.0853 0.0871 0.0785 0.0812 0.0757 0.0792

0.0126 0.0147 0.0115 0.0155 0.0178 0.0814 0.0802 0.0205 0.0135 0.0134 0.0148 0.0133 0.0141

19

0.0314 0.0516 0.0482 0.0573 0.0758 0.2253 0.1918 0.0376 0.0543 0.0514 0.0522 0.0505 0.0514

0.0667 0.0953 0.0898 0.1091 0.1315 0.3732 0.3695 0.0896 0.1024 0.0833 0.0889 0.0797 0.0858

0.8336 1.2922 1.1431 1.3811 4.8805 10.8290 9.9623 0.9768 1.4326 1.0768 1.1835 0.9989 1.0543

6.1295 9.0356 8.8531 15.1679 40.634 372.4180 316.8651 8.1737 16.2375 8.8872 9.2385 8.5475 9.1791

31.4768 73.5237 45.8190 174.6728 874.6591 8942.5133 6849.7439 43.2246 185.3890 71.3685 74.5253 68.3326 71.7651

have better time performance than Katz, ACT, CosRW, and this indicates the methods based on local analysis have better time performance than the methods based on global analysis. Compared with Katz, ACT, CosRW, our approaches (NCCPT, NCCCN, ENCCPT, ENCCCN) have significant better time performance, especially in large scale datasets (e.g. PBl, Facebook,and Ptop). The time performance of CLPA is obviously worse than our approaches (NCCPT, NCCCN, ENCCPT, ENCCCN), and this is because the clustering coefficient is used to capture the clustering information of a network instead of using the clustering methods in NCCPT, NCCCN, ENCCPT, ENCCCN. Through the experimental analysis from the above two aspects: prediction accuracy and efficiency, we have the following findings: (1) Although CN has good time performance and is suitable for massive data, it has low prediction accuracy. (2) Although Katz, ACT, CosRW have high prediction accuracy and are suitable for sparse data, they have inferior time performance. (3) NCCPT, NCCCN, ENCCPT and ENCCCN not only have better prediction accuracy compared with Katz, ACT, CosRW, but also have the same complexity as CN (as shown in Section 4.4) and better time performance (as shown in Table 2 - Table 6). (4) According to S P (see Table 1), we know that PtoP is an ultra sparse dataset. We find our approaches have better prediction accuracy than other methods from Fig.2 and Fig.3. This implies that our approaches are suitable for any sparse datasets. Hence, our approaches in this paper are suitable and robust to predict links in all kinds of datasets.

6. Conclusions and future work In this paper, we propose node-coupling clustering approaches and their extensions for link prediction. Our approaches not only combine the coupling degrees of the common neighbor nodes with the clustering information of a network but also consider the different roles of nodes for predicting links. Experiments on two synthetic and six real datasets have shown that our approaches have comparatively good prediction results. Specifically, our approaches capture the clustering information of a network using the clustering coefficient of every node, which requires low time complexity. As a result, our approaches are very suitable for large-scale networks. In the future research work, we plan to test our approaches in massive datasets and datasets from different domains. Furthermore, we also plan to evaluate the effectiveness of our approaches for bipartite networks and weighted networks.

Acknowledgments This work presented in this paper has been partially supported by the National Natural Science Foundation of China (Grant No. 61272480, 61332013, 71072172, 71110107026 and 71331005) and the Australian Research Council Discovery Projects (Grant No. DP140100841). 20

References [1] K. Musial, M. Budka, K. Juszczyszyn, Creation and growth of online social network, World Wide Web 16 (4) (2013) 421–447. [2] K. Musiał, P. Kazienko, Social networks on the internet, World Wide Web 16 (1) (2013) 31–72. [3] L. Getoor, Link mining: a new data mining challenge, ACM SIGKDD Explorations Newsletter 5 (1) (2003) 84–89. [4] L. Getoor, C. P. Diehl, Link mining: a survey, ACM SIGKDD Explorations Newsletter 7 (2) (2005) 3–12. [5] L. L¨u, T. Zhou, Link prediction in complex networks: A survey, Physica A: Statistical Mechanics and its Applications 390 (6) (2011) 1150–1170. [6] B. Taskar, M.-f. Wong, P. Abbeel, D. Koller, Link prediction in relational data, in: Advances in Neural Information Processing Systems, 2004, pp. 659–666. [7] D. Liben-Nowell, J. Kleinberg, The link prediction problem for social networks, Journal of the American society for information science and technology 58 (7) (2007) 1019–1031. [8] Z. Liu, Q.-M. Zhang, L. L¨u, T. Zhou, Link prediction in complex networks: A local na¨ıve bayes model, EPL (Europhysics Letters) 96 (4) (2011) 48007–48012. [9] W. Liu, L. L¨u, Link prediction based on local random walk, EPL (Europhysics Letters) 89 (5) (2010) 58007–58012. [10] T. Zhou, L. L¨u, Y.-C. Zhang, Predicting missing links via local information, The European Physical Journal B-Condensed Matter and Complex Systems 71 (4) (2009) 623–630. [11] A. Clauset, C. Moore, M. E. Newman, Hierarchical structure and the prediction of missing links in networks, Nature 453 (7191) (2008) 98–101. [12] R. Guimer`a, M. Sales-Pardo, Missing and spurious interactions and the reconstruction of complex networks, Proceedings of the National Academy of Sciences 106 (52) (2009) 22073–22078. [13] J. O’Madadhain, J. Hutchins, P. Smyth, Prediction and ranking algorithms for event-based network data, ACM SIGKDD Explorations Newsletter 7 (2) (2005) 23–30. [14] M. Al Hasan, V. Chaoji, S. Salem, M. Zaki, Link prediction using supervised learning, in: SDM06: Workshop on Link Analysis, Counterterrorism and Security, 2006. [15] F. Li, J. He, G. Huang, Y. Zhang, Y. Shi, A clustering-based link prediction method in social networks, Procedia Computer Science 29 (2014) 432–442. [16] D. J. Watts, S. H. Strogatz, Collective dynamics of small-world networks, nature 393 (6684) (1998) 440–442. [17] Z. Huang, C. Ma, J. Xu, J. Huang, Link prediction based on clustering coefficient, Applied Physics 4 (2014) 101–106. [18] S. Geisser, Predictive inference: An Introduction, Vol. 55, CRC Press, 1993. [19] Y.-X. Dong, Q. Ke, B. Wu, Link prediction based on node similarity, Computer Science 38 (7) (2011) 162–164. [20] P. M. Gleiser, L. Danon, Community structure in jazz, Advances in complex systems 6 (04) (2003) 565–573. [21] V. Batagelj, A. Mrvar, Pajek datasets, Web page http://vlado. fmf. uni-lj. si/pub/networks/data. [22] L. A. Adamic, N. Glance, The political blogosphere and the 2004 us election: divided they blog, in: Proceedings of the 3rd international workshop on Link discovery, ACM, 2005, pp. 36–43. [23] W. W. Zachary, An information flow model for conflict and fission in small groups, Journal of anthropological research (1977) 452–473. [24] J. Leskovec, Stanford large network dataset collection (2009), URL: http://snap. stanford. edu/data/index. html. [25] J. Leskovec, J. J. Mcauley, Learning to discover social circles in ego networks, in: Advances in neural information processing systems, 2012, pp. 539–547.

21

[26] J. Leskovec, J. Kleinberg, C. Faloutsos, Graph evolution: Densification and shrinking diameters, ACM Transactions on Knowledge Discovery from Data (TKDD) 1 (1) (2007) 2–39.

22

Highlights: l l l l

The novel node coupling clustering methods for link prediction are proposed. A new node coupling degree metric is proposed. The node coupling information and clustering information are used. Experimental evaluation about the effectiveness of our methods is presented.