Return random walks for link prediction

Return random walks for link prediction

Information Sciences 510 (2020) 99–107 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins ...

651KB Sizes 0 Downloads 54 Views

Information Sciences 510 (2020) 99–107

Contents lists available at ScienceDirect

Information Sciences journal homepage: www.elsevier.com/locate/ins

Return random walks for link prediction Manuel Curado Department of Technology, Catholic University of Avila, Avila 05005, Spain

a r t i c l e

i n f o

Article history: Received 30 January 2019 Revised 15 July 2019 Accepted 10 September 2019 Available online 13 September 2019 Keywords: Return random walk Link prediction Graph densification Dirichlet

a b s t r a c t In this paper we propose a new method, Return Random Walk, for link prediction to infer new intra-class edges while minimizing the amount of inter-class noise, and we show how to exploit it in an unsupervised densifier method, Dirichlet densification, which can be used to increase the edge density in undirected graphs, setting so that commute times can be better estimated by state-of-the-art methods. Moreover, this approach allows us to predict new intra-class links by exploring vertex similarities analyzing the mathematical relationship between the Cheeger constant and the minimization of the spectral gap, and the meaningful estimation of the commute distances. Our experiments show a significant improvement as inter-class filtering with respect to the state-of-the-art of link prediction, providing a weighting matrix denser and more clustered than others link predictors, and preconditioned the input graph to subsequent tasks of pattern recognition. © 2019 Elsevier Inc. All rights reserved.

1. Introduction Given a graph G = (V, E ), the problem arises when links are missing within the clusters. For that, in this paper, we introduce Return Random Walk as a powerful tool for improving the prediction of new intra-class links in mid-size graphs. The term Graph Rewiring includes several techniques addressed to process an input graph so that the subsequent pattern recognition task is better conditioned, as graph densification. Graph densification was introduced in [1], where it is posed as a constrained optimization problem driven by cut preservation, and was originally proposed as a formal tool for ruling out the existence of certain embeddings [1]. In other words, if graphs are embeddable then they cannot be densified and vice versa. However, the above observation can be seen as a way of pre-conditioning or rewiring graphs so that they boost the tractability of subsequent processing tasks, being one of these tasks the measurement of the similarity between nodes. Therefore, the link between densification and link prediction involves the relationship between the Cheeger constant [2], the spectral gap [3] and commute distances [4], and was firstly explored in [5], where we highlighted the fact that densification leads to a shrinkage of the inter-cluster distances, thus making commute times meaningful in large graphs. Later on, in [5,6,6], we highlighted the fact that state-of-the-art densifiers rely on semi-definite programming (SDP [7]) and motivate a novel fully unsupervised algorithm, Dirichlet Densifier, which is more scalable and robust, and densifies the graph both more efficiently and more accurately than SDP. Given a graph G, this approach aims at generating a new graph H = (V, E  ) where E ⊇E and the cuts in G are preserved (or bounded) to some extent in H. This is interesting in graph-based manifold learning where the input graphs (typically kNN or Gaussian) are very sparse. The core of this algorithm is the minimization of the spectral gap and also that it is a better alternative than anchor graphs [8] as a method for conditioning or pre-processing the adjacency matrix (or more precisely the Laplacian matrix): denser graphs with more intra-class links E-mail address: [email protected] https://doi.org/10.1016/j.ins.2019.09.017 0020-0255/© 2019 Elsevier Inc. All rights reserved.

100

M. Curado / Information Sciences 510 (2020) 99–107

and a minimal overhead of inter-class links are thus better conditioned for different subsequent pattern recognition tasks as clustering, ranking or comparative graph analysis [9]. One of these tasks, the accurate measurement of the similarity between nodes, is a key problem in graph-based learning and pattern recognition. For instance, Commute times (CTs) are Euclidean distances that rely on random walks. Namely, given a graph G, the commute time between two nodes is the expected time taken for a random walk to travel from an origin node to a destiny node and back again [4,10,11]. The link with the resistance distance [12,13] characterizes the diffusive nature of commute times. If we consider unit flows between nodes, the limitations of commute time become clear. These developments lead us to decompose the effective resistance in terms of a local component and a global one. Nguyen and Mamitsuka [14] propose a modified resistance that overcomes the problem of global information loss. However, the existence of a huge amount of data that can be represented as networks is a challenging problem in the field of data mining. Several problems related to network mining are recently being studied, such as community detection, structural network analysis and network visualization. One of the most important problems in this field is link prediction [15], which aims to discover the fundamental relationship behind networks, which could be used to predict or identify missing edges. The key issue of link prediction is to estimate the likelihood of potential links in networks. Link prediction has been used in several applications in various fields of science, like biology and sociology. Nowadays, we need to deal with the huge amount of real information (graphs/networks), but the main problem is the tractability of large data, which need to be pre-conditioned to subsequent tasks of pattern recognition. For instance, a crucial role in the development of machine learning and pattern recognition is played by the tractability of these large graphs, which is intrinsically limited by their size. In order to overcome this limit, the input graph can be compressed into a reduced version by means of Szemerédi Regularity Lemma [16]. However, this lemma needs more dense graphs to a correct estimation of commute times for large graphs. This problem is solved by Dirichlet densifiers, which has an order of O(δ 1 n4 ) with δ1 = 0.35 (the size of the Laplacian matrix [6]). Finally, to solve this problem, we propose a novel link prediction approach for preconditioning the input graphs, that increases the number of intra-class at the same time that it constraint the structural inter-class noise. This role can be played by Return Random Walk, as we have studied in this paper. 2. Contributions In this paper, we proposed a novel random walker, Return Random Walks, defined as a link predictor, which filters the structural inter-class noise at the same time that infers new intra-class links. This approach minimizes the probability of a random walk starting and ending at a given node traverses the inter-class links. The resulting weighting matrix ω e is denser and more clustered than other random walkers of the state-of-the-art. In Section 3, we commence by providing a mathematical background and more detailed motivation and definitions. Then, in Section 4 we study the state-of-the-art methods of link prediction and the relationship with a pattern recognition task: graph densification with Dirichlet process. In Section 5 we describe the design or Return Random Walk algorithm as a link predictor to obtain a dense graph, preserving the topological properties of the input graph. Finally, in Section 6 we compare our approach with different link prediction methods, which reduce/remove the strength of the inter-class links and infers new intra-class links. In the application of these algorithms in a subsequent task as a densifier, we demonstrate that our approach help to improve the estimation of commute times. 3. Background For a better understanding of our approach, we define the concept of graph densification as a constrained optimization problem in which cuts are to some extent preserved in the densified graph. Then, we explain the link with the minimization of graph conductance (Cheeger constant) and the constraining of the graph conductance (minimization of the spectral gap). The result is a meaningful estimation of commute times between nodes in the densified graph. The solution starts with the concept of graph densification: the study of how an important increase of the number of links or edges of an input graph G = (V, E ) produces a densified graph H = (V, E  ), and it formulates as a constrained optimization problem in which cuts are to some extent preserved in the densified graph. This densification often needs cut preservation because densified graphs can be better conditioned for spectral clustering than sparse graphs. In [5] we show as the densification leads to a shrinkage of the noise (inter-cluster distances), thus making commute times meaningful in mid-size/large-size graphs. Later on, in [17], we propose a novel densifier, Dirichlet Densifier, which is more scalable and robust than semi-definite programming [7]. In [6], we explain a deep mathematical background of the state-of-the-art methods based on semidefinite programming, showing that the SDP formulation is too simple to preserve global information in realistic situations, and these solvers are polynomial in the number of unknowns, which only smallsize graphs can be performed(O(n2 )). This problem is also compatible with Cheeger constant , or the minimization of the graph conductance [2], which address an implicit minimization of the spectral gap [3]. As a result, we obtain a globally meaningful estimation of commute distances [4] using the relationship between effective resistances [12–14] and von Luxburg bounds [18]. In the following subsection, we show the formulation of these links.

M. Curado / Information Sciences 510 (2020) 99–107

101

3.1. Graph densification: Dirichlet graph densifier The starting point of our approach is the following bound, derived by von Luxburg et al. [18] for any connected, undirected graph G = (V, E ) that is not bipartite:

   1 w  1 1 1  max  CT − + ≤2 +2 2   vol (G ) i j di dj λ2 dmin

(1)

where CTi j = Ri j vol (G ) is the commute time between the nodes i and j, vol(G) is the volume of the graph, λ2 is the spectral gap and dmin is the minimum node degree in G. The spectral gap λ2 is the second eigenvalue of the normalized graph Laplacian L = I − D−1W where D = diag(d1 , . . . , dn ) is the degree matrix and W is the (symmetric) affinity (or weight) matrix, with wij > 0 if (i, j) ∈ E, and wmax is the maximal affinity element in W. The above equation suggests that a way of making Ri j ≈ d1 + d1 diverge is to reweight/rewire the edges in E so that

λ2 → 0. To commence, we have the following lower bound:

i

j

vol (G ) ≤ λ2 , 2 γ dmax max b

(2)

where γmax is the maximum path length [3]. The definition of b relies on the set of paths {γ ij } between any pair of vertices   i = j: b  maxe∈E E{γi j } : e ∈ γi j }. Then, b is associated with the most traversed edge e. Actually, b is the expected number of paths traversing such an edge. Thus, Eq. (2) shows that λ2 is minimized when b → ∞, i.e. with there exists a small bottleneck defined by a handful of maximal traversed edges. The preservation or enforcement of bottlenecks is key to improving the consistency of the optimal graph-based partition or clustering (see [19] and references therein) and is also compatible with the minimization of the graph conductance or Cheeger constant  [2]:

  min S⊆V

cut (S ) min(vol (S ), vol (S¯ ))



,

where cut (S ) = i∈S, j∈S¯ wi j is the weight of the cut associated with S, the subset of vertices, and vol (S ) = volume (density) of S which determines the density of the graph. Then, we have the following upper bound for λ2 :

λ2 ≤ 2,

(3) 

i∈S

di is the

(4)

where  is the Cheeger constant (Eq. (3)): This bound suggests that λ2 is minimized when: (a) the cut is minimized (see above), and b) min(vol (S ), vol (S¯ )) is as large as possible. It is well known that for two cliques of size n linked by r edges, we have  = n(nr−1 ) , i.e. limn→∞  = 0. However, if r = n we need larger cliques for constraining the spectral gap. Semi-definite Programming (SDP) becomes intractable for large graphs. Thus, we use a more scalable, robust and effective method, Dirichlet Densifier [6]. This rationale opens the door to modify the set of edges E, by adding and/or reweighting edges so that min(vol (S ), vol (S¯ )) is maximized for all S ⊂ V. However, we must take into account the fact that the Cheeger constant relies on the worst case. On the one hand, we need to infer more edges for S, but on the other hand, we must minimize the number of new interclass edges linking S and S¯. This could be done by a good graph densifier. In this paper, we use Dirichlet Densifier, a more precise and scalable densification. This new procedure is designed to implicitly minimize the Cheeger bound (Eq. (4)) as follows. In [6,17], we define a measure of “edgeness” which is harmonically diffused. Harmonic diffusion enforces a conservative (minimum energy) method of inducing new edges [20]. However, this may be not strong enough to minimize the Cheeger bound. Let G be the (input) Cheeger bound associated with G = (V, E ) and cutG (S) the minimal cut in G and S ⊂ V the set associated with this cut. If G is unweighted then cutH (S ) = cutG (S ) + kO(d ), where d is the average degree of the k extremal nodes involved in the cut. As a result, it is straightforward to see that H < G only if min(volH (S ), volH (S¯ )) = O(d2 ). This means that we require a quadratic density (virtually to transform S into a clique, where d = n − 1) to improve the input bound. This explains why structural noise (strength of inter-class links) constrains the effectiveness of any densifier. We need a good link predictor with a structural filter of inter-class edges, which minimizes the spectral gap reducing the cut. In the following sections we study different filters for our densifier. Dirichlet Densifier can be used to increase the edge density in graphs by running random walkers on a line graph, i.e. in the edge space rather than the original graph. Dirichlet processes ensure that the resulting graph is more suitable for measuring CTs than the original graph. Our densifier consists of the following steps: 1. kNN-graph: Given a data set χ = {x1 , . . . , xn } ⊂ Rd , we map the xi to the vertices V of an undirected weighted graph G(V,

E, W) with Wi j = e−||xi −x j || /σ and (i, j) ∈ E if Wij > 0 and j ∈ Nk (i). 2. Link Prediction (Structural Filtering): Application of any method based in random walk to filter inter-class edges, which minimizes the spectral gap. 3. Edge Selection: Given G = (V, E  , We ), select E ⊂ E, with |E | |E| as follows: (a) S = sort(E, We , descend ). 

2

2

102

M. Curado / Information Sciences 510 (2020) 99–107

(b) S  = S ∼ {e ∈ S : We < δ1 } where δ 1 is set so that |S  | = α|S |. 4. Line Graph: Given G = (V, S  , We ) construct a the graph Line = (S  , LineE , LineWe ) where (a) The nodes of ei ∈ Line are the edges in S  . (b) The weight function LineWe is defined as follows: 

LineWe (ea , eb ) =

|E | 

pek ( e b |e a ) pek ( e a |e b ),

(5)

k=1

i.e. we use go and return probabilities. (c) LineE = {(ea , eb ) : LineWe (ea , eb ) > 0} 5. Dirichlet Process: Given the Line graph, we proceed as follows: (a) SB = sort(S  , LineWe , descend ). (b) SB  = SB ∼ {e ∈ LineE : LineWe < δ2 } where δ 2 is set so that |SB  | = β|SB |. (c) Consider SB  as the boundary B (known labels) of a Dirichlet process driven by the Laplacian LineL = LineD − LineWe . Then, finding an harmonic function, i.e. a function u(.) satisfying ∇ 2 u = 0 consists of minimizing:

DLine [u] =

1 T u LineL u 2

(6)

where u = [uB , uI ] and LineL are re-ordered so that the boundary nodes (edges in Line) come first. Then, minimizing DLine [u] with respect to uI leads to label the unknown nodes (edges in Line) uI as the solutions to the following linear system:

LI uI = −K T uB ,

(7)

where all the uB are set to the unit, LI is the sub-Laplacian of LineL concerning the uI nodes, and K is a |SB  | × |SB  | block of the re-ordered Laplacian. 6. Relabelling: Since there is a bijection between the nodes in the line graph and the edges in the original graph, we relabel the edges in the original graph with the information coming from the Dirichlet process in the line graph. This approach is fully unsupervised and it is both more scalable and robust than other techniques as the classical SemiDefinite Programming (SDP) densifier, where it cannot estimate reliably the commute times as the size of graphs increasing. In this paper, we study the step 2 (marked in bold font) of our densification process, the need of finding a good link predictor through structural-filter, whose resulting graph is both denser and better conditioned than the original input matrix. We propose a link predictor through a structural filter, Return Random Walk (in Section 5), and we compare with other random walks (defined in 4.1) in our experiments (Section 6). 4. Related work In this section, we introduce how different methods of link prediction based on random walks can be applied as preprocessing tasks to obtain better-conditioned graphs to feed a densification process, describing our unsupervised densifier based on the Dirichlet Principle. 4.1. Random walks as a link predictor Random walk algorithms [11,21–25] also consider how tightly connected two nodes are and argue that nodes that have many paths between them can be considered more related, i.e. is a Markov chain whose state at any time is define by a starting node i of a graph G = (V, E ) and the transition probability is distributed in an equal way through all outgoing edges. The process is easy: we randomly select a neighbour j and move to this node from node i, and repeat this process starting from this new node j. This random sequence of nodes selected is a random walk on the graph. Random walks have been used to compute proximity measures between nodes in a graph, and there are several random walk algorithms applied to different fields in mathematics, physics and computer science (diffusion of the information, sampling a complex network, etc.). We focus the two-top the state-of-the-art methods [26]: Random Walk with Restart and Superposed Random Walk. One of the most used is Random Walk with Restart [27] (PageRank algorithm [28] is related to this method), which is a global random walk. This algorithm is defined as a selection of a node i and moves following a random walk with probability c or we come back to the starting node j with probability (1 − c ). This model can be defined as an optimization problem as follows:

min α px



i, j∈V

Pi, j T ( pxi − pxj )2 + (1 − c )



( pxi − sxi )2 ,

(8)

i∈V

where P is the transition probability matrix computed for random walks and, px is a vector, and sx is a vector with all elements set to 0 except for sxx = 1.

M. Curado / Information Sciences 510 (2020) 99–107

103

Finally, to overcome the unfeasible computation for large graphs, it is approximated by an iterative formulation: R sRW = pxy + syx , where xy

px (t ) = cP T px (t − 1 ) + (1 − c )sx

(9)

On the other hand, Superposed Random Walk is a quasi-local approach, i.e. a method to strike a balance between global and local measures, which is efficient to consider topological information both local and global. Local random walks is a good walker method, which limits the number of iterations to a pre-fixed a small number l [29]. Thus, it does not focus on the stationary state when convergence is reached like other random walk approaches. It is defined as:

sLRW xy (t ) =

d dx πxy (t ) + y πyx (t ), 2|E | 2|E |

(10)

where π xy (t) is the probability of the walker locates at node y after t steps, dx is the degree of node x and |E| is the number of edges in the graph. However, are too sensitive to the topology of a graph in large distances, which is overcome by continuously releasing the walker at the starting node by Superposed Random Walk as follows:

sSRW xy (t ) =

t 

sLRW xy (l )

(11)

l=1

5. Return random walks In this section, we define as the main contribution of this paper, Return Random Walk as an implementation of a link predictor through structural filter process, which minimizes the probability of a random walk starting and ending at a given node traverses the inter-class links. The resulting weighting matrix We is denser and more clustered than that associated with input graph. Return Random Walk works enforcing intra-class links while penalizing inter-class weights, improving the efficiency and capturing more intra-class links whereas removing the noise (inter-class links). Our algorithm operates under the hypothesis that inter-class links (noise) are strange events. 5.1. Design of return random walks Given a set of points χ = {x1 , . . . , xn } ⊂ RD , we map the points xi to the vertices V of an undirected weighted graph G(V, E, W). We have that V is the set of nodes where each vi represents a data point xi , and E⊆V × V is the set of edges linking

adjacent nodes. An edge e = (i, j ) with i, j ∈ V, exists if wij > 0, where wi j = e−||xi −x j || /σ , and j ∈ Nk (i) (j is a kNN of i). The bandwidth parameter σ is optimally selected with respect to k. Design of We . Given W = {wi j } ∈ Rn×n , we produce a reweighted similarity matrix We as follows: (a) we explore the twostep random walks reaching a node vj from node vi through any transition node vk , (b) on return from vj to vi , we maximize the probability of returning through a different transition node vl = vk . For the first step (going from vi to vj through vk ) we 

w w

2

2

w w

have pvk (v j |vi ) = d (vik)d (kvj ) as well as a standard return pvl (vi |v j ) = d (v jl)d (liv ) , where d(.) is the degree function. The standard i j j i return works well if vi and vj belong to the same cluster (see Fig. 1-left). However, vl (the transition node for returning) can be constrained so that vl = vk . In this way, travelling out of a cluster of nodes (or class) is penalized since the walker must choose a different path, which in turn is hard to find on average. Therefore, we obtain wei j from wij as follows:

wei j = max max{ pvk (v j |vi ) pvl (vi |v j )}, k

∀l=k

(12)

i.e. for each possible transition node vk we compute the probability of leaving and returning (product of independent probabilities) through a different node vl . We retain the maximum product of probabilities for each vl referred to a given k and finally we retain the supremum of these maxima. As a result, when inter-class paths are frequent for a given e = (i, j ) (Fig. 1-right) its weight wei j is significantly reduced. The weights wei j measure the connectivity between two nodes in a specific cluster or region (not the direct connection but the indirect one through neighbouring nodes). Large values of wei j mean that both i and j are not only strongly locally connected but also they belong to a highly cohesive connected component. Our working hypothesis is that the number of edges involved in inter-class transitions (Fig. 1-right) is small on average, since the number of inter-class edges tends to be small compared with the total number of edges. In realistic situations patterns can be confused either due to their intrinsic similarity or due to the use of an improper similarity measure. As a result, this assumption leads to a significant decrease in many of the elements of W. Filtering of We . To reduce inter-class noise, we consider the relationship between the shortest path and the sum of different weights of the RRW:

wei j = wei j × e

γ ij

− γmin ij

,

(13)

104

M. Curado / Information Sciences 510 (2020) 99–107

Fig. 1. Return random walks for reducing inter-class noise.

where γi j = wik + wk j + w jl + wli and γmin is the length of the shortest path between i and j. Consequently, we enforce that the length of the actual path (constrained to pass through l and k with l = k) is very large in comparison with that of the ij shortest path between i and j. In addition, we enforce that the length of the shortest path γmin is small too. However, the above equation does not account for the difference between outward and return paths. For this case, we assign the weight as ij

wei j =

wei j bi j

, where bi j =

wik + wk j , w jl + wli

(14)

where bij measures the symmetry with respect going away and return (asymmetric if bij = 1). If the value bij is either small or large, then (i, j) will be considered an inter-class edge (see the final algorithm in (1)) The above filtering of W is quite effective for reducing inter-class noise. In Fig. 2 we show the Adjusted Rand Indices (ARIs) obtained for We , We and We (for example, a subset of NIST dataset with n=10 0 0 and m=10 classes, 100 samples per class). As k increases, both We and We are stable, whereas the effectiveness of the kNN graph, filtered with Eq. (12), decays with k. To avoid notational conflicts in the following sections, we rename We as We . We have defined the algorithm Algorithm 1. 6. Experiments To evaluate Return Random Walks, we use four datasets: (a) A subset of NIST dataset with n=10 0 0 and m=10 classes (100 samples per class), with an average density of 14.7%, (b) COIL-20 dataset,1 with n=1440 and consists of m=20 classes with 72 samples per class [30], and an average density of 21.33%, (c) LOGO (FlickrLogos-322 ) consists of m=32 classes with 70 samples per class (n=2240) [31], with an average density of 20.19%, and (d) YALE (or Extended Yale-B3 ) dataset with n=2414 and m=38 classes [32] (variable-sized classes). Once the associated kNN graphs are applied to different random walk algorithms, we estimate commute times through the Nguyen and Mamitsuka [14] method (state-of-the-art). Then, the Adjusted Rand Index (ARI) with respect to the ground truth is used to measure the performance of each method. In the experiments, we analyze the original affinity matrix W with kNNs, where k ∈ {15, 25, 35}. In RWR we have evaluated the parameter c and the results increase until the optimal value c = 0.9 in our experiments, and t in both random walkers produce a deterioration in performance as t increase. 6.1. Experiment 1. Comparison of different random walks as a link predictor At first, we compare our random walk, Return Random Walk (RRW), with the state-of-the-art of quasi-local random walk, Superposed Random Walk (SRW), and global random walk, Random Walk with Restart. 1 2 3

http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php. http://www.multimedia-computing.de/flickrlogos/. http://www.cad.zju.edu.cn/home/dengcai/Data/FaceData.html.

M. Curado / Information Sciences 510 (2020) 99–107

Fig. 2. Evaluation of Return Random Walks for different values of neighbours in kNN in NIST dataset.

Algorithm 1 Return Random Walks algorithm: Obtain a two-trip transition probability matrix We (see Fig. 1). Require: A weighted adjacency matrix W of original graph G Ensure: A two-trip transition probability matrix We 1: for all vi do for all v j = vi do 2: for all vk = v j do 3: 4:

pvk ( vi |v j ) =

Wv v ×Wv v i k k j

[GO]

d (vi )×d (v j )

17:

end for max pv = 0 k for all pvk (vi |v j ) do for all vl = vk do p = max( pvk (vi |v j ) × pvl (vi |v j )) [RETURN] if p > max pv then k max pv = p; k k = vk ; l = vl ; end if end for end for Wei j = max pv

18:

Wei j = Wei j × e

5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16:

k

wik +wk j w jl +wli We

19:

bi j =

20:

Wei j =

21: 22:

end for end for

ij

bi j

w +wk j +w jl +wli − ikShortestPath ij

105

106

M. Curado / Information Sciences 510 (2020) 99–107 Table 1 Comparison of densification level in terms of filtered inter-class edges (volume of intra-class edges of each graph with different datasets and number of k in kNN graphs). In bold, best percentage of intra-class edges with respect to global volume of graph. SRW

RWR

RRW

NIST

k=15 k=25 k=35

5.2 (54.79%) 2.03 (53.62%) 1.22 (51.69%)

618.2 (61.82%) 559.02 (55.9%) 511.66 (51.17%)

5.82 (89.29%) 2.87 (85.79%) 0.44 (69.17%)

COIL

k=15 k=25 k=35

1.11 (75.26%) 0.857 (68.05%) 0.705 (62.39%)

1083.7 (69.6%) 979.86 (65.1%) 898.4 (66.47%)

60.3 (97.85%) 12.87 (96.16%) 3.963 (94.39%)

LOGO

k=15 k=25 k=35

3.13 (39.76%) 1.18 (36.55%) 0.72 (33%)

1265.1 (56.48%) 1087.9 (48.57%) 941.46 (42.03%)

17.54 (86.6%) 2.38 (58.7%) 0.86 (52.76%)

YALE

k=15 k=25 k=35

0.33 (23.38%) 0.18 (16.57%) 0.12 (13.11%)

861.27 (35.68%) 667.81 (27.66%) 572.31 (23.71%)

4.97 (41.7%) 1.16 (31.9%) 0.35 (26.93%)

We run all algorithms in four datasets (NIST, COIL, LOGO and YALE) for different values of k, and we evaluate the den sification in terms of filtered inter-class edges, measuring the volume of intra-class edges volEi = e ∈E Wi j , where Ei are ij

i

intra-class edges subset, and showing RWR and SRW with their best parameter t. In Table 1, we can see that our RRW is better for filtering inter-class edges than SRW and RWR, and a universal bound in our experiments, k = 15 in all of the datasets, is found. For instance, the best cases for the volumes of intra-class edges in NIST dataset are 5.82 (RRW), 618.2 (RWR) and 5.2 (SRW). A higher value of volume (as in RWR) is important to minimize the Cheeger constant and improve graph densification, however, we need to decrease the cut at the same time. For that, we show how many edges are intra-class. We can see as RRW has more intra-class edges with respect to whole set of edges |E| than RWR and SRW (in NIST, we have 89.29% of intra-class edges and 10.71% of inter-class edges, whereas RWR has 61.82%/38.18% and SRW has 54.79%). In RRW, the inter-class edges are decreased, so that, spectral gap and Cheeger constant are minimized because the cut is reduced. 6.2. Experiment 2. Comparison in Dirichlet densification In this experiment, we evaluate different random walks defined in terms of link predictors (SRW, RWR and our RRW), and they are applied in graph densification. As we have seen in the previous experiments, RRW minimizes the spectral gap through decreasing the cut (a lower amount of inter-class edges), and now we apply Dirichlet densification to kNN graphs using different random walks as inter-class filters, and then we estimate Commute Times through the Nguyen and Mamitsuka method [14] (state-of-the-art). The main goal of the Dirichlet densifier is to obtain a good densified and betterconditioned graph than the original input matrix for post-tasks, as clustering or ranking, which can be estimated for large graphs. In Table 2, we show the comparison of Dirichlet Densifier in different datasets in terms of Adjusted Rand Index (ARI) with respect to the ground truth and the global densification level (DL, in percentage) of the number of edges as measures of the performance of the densification.

Table 2 Comparison of performance in terms of Adjusted Rand Index (ARI) and densification level in terms of number of edges (in percentage) after densifier with Dirichlet process. In bold, best ARI of each dataset. SRW+D

RWR+D

RRW+D

DL

ARI

DL

ARI

DL

ARI

NIST

k=15 k=25 k=35

0.7 1.16 1.61

34.36 27.43 21.96

0.64 1.1 1.54

36.92 34.01 37.54

1.76 4.78 11.2

74.4 71.61 71.38

COIL

k=15 k=25 k=35

0.4 0.74 1.06

44.97 62.03 68.11

0.57 0.7 1.01

66.08 55.43 59.39

1.08 1.82 2.92

95.44 92.81 92.11

LOGO

k=15 k=25 k=35

0.34 0.57 0.8

25.85 17.35 14.24

0.32 0.55 0.78

41.87 38.4 32.4

1.61 3.87 4.79

62.96 60.65 57.23

YALE

k=15 k=25 k=35

0.43 0.51 0.71

2.37 2.41 3.07

0.27 0.48 0.68

14.89 10.81 8.14

1.1 2.06 2.99

15.68 10.86 8.72

M. Curado / Information Sciences 510 (2020) 99–107

107

The results show that RRW+D (the application of Dirichlet with RRW) have better ARIs than SRW+D and RWR+D. Our best cases are 74.4%, 95.44%, 62.96% and 15.6% in NIST, COIL, LOGO and YALE datasets, respectively. 7. Conclusions and future work In this paper, we have designed a link predictor that infers intra-class edges, and it also works as a structural filter of inter-class noise based on random walks, namely the Return Random Walk (RRW). This filter is devoted to reducing the strength of the inter-class links. In the experiments, we check that feature. For instance, in NIST, COIL and LOGO, the number of final noisy edges is below the 15% of the number of edges of the whole graph (89.29% intra-class - 10.71% inter-class in NIST, 97.85% vs. 2.15% in COIL and 86.6% vs. 13.4%). Once the graph is filtered, we feed a Dirichlet process which becomes a scalable, robust, reliable and unsupervised method. This pre-processing of the input graph of the Dirichlet Densifier improves the estimation of commute times, since it minimizes the Cheeger constant and the spectral gap (increasing the volume of graphs at the same time that decreasing the cut). The state-of-the-art methods work worse because the volume of the graph is significantly increased, but cut increase too (a high inter-class level). As future work, in a similar way that we do in [33] through the compression/decompression of graphs to do faster the process of densification, we propose to reduce the complexity (O(n4 )) of our algorithm to extend the link predictor to large-size graphs using deep learning approaches. Declaration of Competing Interest We have no conflict of interest. References [1] M. Hardt, N. Srivastava, M. Tulsiani, Graph densification, in: Innovations in Theoretical Computer Science 2012, 2012, pp. 380–392. Cambridge, MA, USA, January 8-10, 2012 [2] F.R.K. Chung, Spectral graph theory, in: Proceedings of the Conference Board of the Mathematical Sciences (CBMS), American Mathematical Society, 1997. 2 [3] P. Diaconis, D. Stroock. [4] H. Qiu, E.R. Hancock, Clustering and embedding using commute times, IEEE Trans. Pattern Anal. Mach. Intell. 29 (11) (2007) 1873–1890. [5] F. Escolano, M. Curado, E.R. Hancock, Commute times in dense graphs, in: Proceedings of the IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, 2016, pp. 241–251. [6] M. Curado, F. Escolano, M.A. Lozano, E.R. Hancock, Dirichlet densifiers for improved commute times estimation, Pattern Recognit. 91 (2019) 56–68. [7] W.-K.K. Ma, Semidefinite relaxation of quadratic optimization problems and applications, IEEE Signal Process. Mag. 1053 (5888/10) (2010). [8] W. Liu, J. Wang, S. Kumar, S. Chang, Hashing with graphs, in: Proceedings of the 28th International Conference on Machine Learning, ICML, 2011, pp. 1–8. Bellevue, Washington, USA, June 28 - July 2, 2011 [9] F. Emmert-Streib, M. Dehmer, Y. Shi, Fifty years of graph matching, network alignment and network comparison, Inf. Sci. 346 (2016) 180–197. [10] P.G. Doyle, J.L. Snell, Random walks and electric networks, 22, 1, Mathematical Association of America, 1984. ˝ ˝ is Eighty, 2, János Bolyai Mathematical [11] L. Lovász, Random walks on graphs: a survey, in: D. Miklós, V.T. Sós, T. Szonyi (Eds.), Combinatorics, Paul Erdos Society, Budapest, 1996, pp. 353–398. [12] U. Von Luxburg, A. Radl, M. Hein, Hitting and commute times in large random neighborhood graphs, J. Mach. Learn. Res. 15 (1) (2014) 1751–1798. [13] U. Von Luxburg, A. Radl, M. Hein, Getting lost in space: large sample analysis of the commute distance, Adv. Neural Inf. Process. Syst. 23 (2010) 2622–2630. [14] C.H. Nguyen, H. Mamitsuka, New resistance distances with global information on large graphs, in: Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, AISTATS, 2016, pp. 639–647. Cadiz, Spain, May 9–11, 2016 [15] D. Liben-Nowell, J. Kleinberg, The link-prediction problem for social networks, J. Am. Soc. Inf. Sci. Technol. 58 (7) (2007) 1019–1031. [16] E. Szemerédi, Regular partitions of graphs, Technical Report, Stanford Univ Calif Dept of Computer Science, 1975. [17] F. Escolano, M. Curado, M.A. Lozano, E.R. Hancook, Dirichlet graph densifiers, in: Proceedings of the IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR), Springer, 2016, pp. 185–195. [18] U. von Luxburg, A. Radl, M. Hein, Hitting and commute times in large random neighborhood graphs, J. Mach. Learn. Res. 15 (1) (2014) 1751–1798. [19] N. Garcia Trillos, D. Slepcev, J. von Brecht, T. Laurent, X. Bresson, Consistency of Cheeger and ratio graph cuts, J. Mach. Learn. Res. (17) (2016) 1–46. [20] X. Zhu, Z. Ghahramani, J.D. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, in: Proceedings of the Twentieth International Conference Machine Learning (ICML), 2003, pp. 912–919. August 21–24, 2003, Washington, DC, USA [21] P.G. Doyle, J.L. Snell, Random Walks and Electric Networks, Mathematical Association of America, 1984. [22] R. Lyons, Y. Peres, Probability on Trees and Networks, 42, Cambridge University Press, 2016. [23] P. Diaconis, Group representations in probability and statistics, Lect. Notes-Monogr. Ser. 11 (1988) i–192. [24] D. Aldous, J. Fill, Reversible Markov Chains and Random Walks on Graphs, Monograph in preparation, 2002 Available Online: http://stat-www.berkeley. edu/users/aldous/RWG/book.html. [25] D.A. Levin, Y. Peres, Markov Chains and Mixing Times, 107, American Mathematical Soc., 2017. [26] V. Martínez, F. Berzal, J.-C. Cubero, A survey of link prediction in complex networks, ACM Comput. Surv. (CSUR) 49 (4) (2017) 69. [27] H. Tong, C. Faloutsos, J.Y. Pan, Fast random walk with restart and its applications, in: Sixth International Conference on Data Mining (ICDM’06), IEEE, 2006, pp. 613–622. [28] S. Brin, L. Page, The anatomy of a large-scale hypertextual web search engine, Comput. Netw. ISDN Syst. 30 (1–7) (1998) 107–117. [29] W. Liu, L. Lü, Link prediction based on local random walk, EPL (Europhys. Lett.) 89 (5) (2010) 58007. [30] S.A. Nene, S.K. Nayar, H. Murase, Columbia image object library (COIL-20), Dept. Comput. Sci., Columbia Univ. (1996) York, NY, USA, Tech. Rep. CUCS-006-96. [31] S. Romberg, L.G. Pueyo, R. Lienhart, R. van Zwol, Scalable logo recognition in real-world images, in: Proceedings of the 1st ACM International Conference on Multimedia Retrieval, in: ICMR ’11, 2011. [32] D. Cai, X. He, J. Han, Spectral regression for efficient regularized subspace learning, in: Proceedings of the International Conference Computer Vision (ICCV’07), 2007. [33] M. Fiorucci, A. Torcinovich, M. Curado, F. Escolano, M. Pelillo, On the interplay between strong regularity and graph densification, in: Proceedings of the 11th International Workshop Graph-Based Representations in Pattern Recognition GbRPR, 2017, pp. 165–174. Anacapri, Italy, May 16–18, 2017