Speeding up SimRank computations by polynomial preconditioners

Speeding up SimRank computations by polynomial preconditioners

Journal Pre-proof Speeding up SimRank computations by polynomial preconditioners Sio Wan Ng, Siu-Long Lei, Juan Lu, Zhiguo Gong PII: S0168-9274(20)...

1MB Sizes 0 Downloads 53 Views

Journal Pre-proof Speeding up SimRank computations by polynomial preconditioners

Sio Wan Ng, Siu-Long Lei, Juan Lu, Zhiguo Gong

PII:

S0168-9274(20)30045-3

DOI:

https://doi.org/10.1016/j.apnum.2020.02.009

Reference:

APNUM 3774

To appear in:

Applied Numerical Mathematics

Received date:

9 October 2019

Revised date:

3 January 2020

Accepted date:

10 February 2020

Please cite this article as: S.W. Ng et al., Speeding up SimRank computations by polynomial preconditioners, Appl. Numer. Math. (2020), doi: https://doi.org/10.1016/j.apnum.2020.02.009.

This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier.

Speeding Up SimRank Computations by Polynomial Preconditioners Sio Wan Nga , Siu-Long Leib,∗, Juan Luc , Zhiguo Gonga a State

Key Laboratory of Internet of Things for Smart City and Department of Computer and Information Science, University of Macau, Macau, People’s Republic of China b Department of Mathematics, University of Macau, Macau, China c Beijing Institute of Petrochemical Technology, Beijing, China

Abstract SimRank is popularly used for evaluating similarities between nodes in a graph. Though some recent work has transformed the SimRank into a linear system which can be solved by using conjugate gradient (CG) algorithm. However, the diagonal preconditioner used in the algorithm still left some room for improvement. There is no work showing what should be an optimal preconditioner for the linear system. For such a sake, in this paper, we theoretically study the preconditioner problem and propose a polynomial preconditioner which can improve the computation speed significantly. In our investigation, we also modify the original linear system to adapt to all situations of the graph. We further prove that the condition number of a polynomial preconditioned matrix is bounded in a narrow interval. It implies that the new preconditioner can effectively accelerate the convergence rate of the CG algorithm. The experimental results show the proposed algorithm outperforms the state-of-the-art algorithms for the all-pairs SimRank computation. Keywords: SimRank, graph, matrix, linear system, polynomial preconditioner

∗ Corresponding

author Email address: [email protected] (Siu-Long Lei)

Preprint submitted to Applied Numerical Mathematics

February 12, 2020

1. Introduction Graph is one of the most powerful tools to model objects and their complex relationships in various domains, such as the world wide web, social networks, and knowledge graphs. For example, in a social network users are modeled 5

as nodes and their friendships are specified as edges of the graph. As another example, the World Wide Web can be modeled as a simple directed graph, where nodes represent websites, and edges represent hyperlinks. Once the data are modeled into a graph, one interesting problem is how to measure the similarities among those objects (nodes of the graph). Though several other techniques are

10

also proposed, SimRank is regarded as one of the most effective measurements and is successfully exploited in various applications, such as web page ranking [14], collaborative filtering [1], web spam detection [2], network clustering [26], and natural language processing [16]. The basic intuition of SimRank is that “two objects are similar if they are

15

referenced by similar objects and an object should be most similar to itself”. In Figure 1, a simple graph shows that node 2 and node 4 are similar because they are referenced by the common neighbor node 3. SimRank implies a mutual reinforcement naturally, by iteratively updating the similarity score of node pair according to the similarity scores of all their in-neighbor pairs.

Graph G

SimRank scores

1

2

3

4

1 0.16 𝑆= 0 0

0.16 1 0 0.4

0 0 1 0

Figure 1: A simple graph G and its SimRank scores.

2

0 0.4 0 1

20

With respect to different computing scopes, SimRank techniques can be separated into single-pair (given two nodes, return the SimRank score between these given two nodes) [10], single-source (given one node, return SimRank scores of node pairs containing the given node) [8], top-k (given one node, return the topk SimRank scores of node-pairs containing this given node) [17], partial-pairs

25

(given two sets of nodes, return the SimRank scores of all node-pairs crossing these two sets) [25], and all-pairs algorithms (return the SimRank scores of all node-pairs in the graph) [9]. All-pairs SimRank techniques can conveniently get the scores of different requirements when the data resource is shared by multiple users. Hence, we concentrate on all-pairs SimRank computations in this paper.

30

Generally speaking, the computation algorithms for all-pairs, single-source, and single-pair SimRanks are often tackled separately because their exploited techniques are quite different. Further, all-pairs SimRank is the most fundamental one, such that, once all-pair SimRank scores are computed out and stored, any other kinds of SimRank queries can be easily obtained through the indices.

35

1.1. SimRank Model In 2002, G. Jeh and J. Widom proposed the SimRank model based on the link structure of the web to evaluate the similarity between any two nodes (objects) in the graph [7]. In detail, given a directed and unweighted graph G = (V, E), where V is the set of nodes and E is the set of directed links between any two nodes in G. For any node i ∈ V , I(i) = {j|(j, i) ∈ E, j ∈ V } denotes the set of in-neighbors of i and |I(i)| is the in-degrees of i. The SimRank score Sij ∈ [0, 1] between node i and j in G is defined by    1,    P P c Sab , Sij = |I(i)||I(j)|  a∈I(i) b∈I(j)     0,

i=j i 6= j, I(i) 6= ∅ and I(j) 6= ∅ i 6= j, I(i) = ∅ or I(j) = ∅

where c ∈ (0, 1) is called a decay factor, usually set between 0.6 and 0.8.

3

(1)

1.2. Existing techniques for SimRank computation The techniques for SimRank can be classified into two broad categories, the stochastic technique and the deterministic technique. Stochastic technique com40

putes SimRank scores using sampling techniques such as Monte Carlo Random Walk algorithm. So this technique is more scalable to handle large datasets. The related works are reviewed in the following. 1.2.1. Stochastic techniques Work [7] interprets SimRank computation as an expected meeting distance of two random walks starting from each of the nodes simultaneously. In detail, the SimRank score between two nodes i and j can be calculated as the expected value of a function of their meeting hops   Sij = E cτ (i,j) , where τ (i, j) is the number of hops for i and j to meet along their random walks

45

starting from each of them and c is a decay factor such that 0 < c < 1. To compute single-pair SimRank scores [4, 19, 20], Fogaras and R´acz [4] proposed to use Monte Carlo technique as an approximation algorithm. By this technique should incur O(n2 m)1 time and memory cost for all-pair SimRank computation. Recently, Work [20] presented three algorithms using different

50

sample reduction strategies (BLPMC, FLPMC and FBLPMC). For all-pairs SimRank computation, FBLPMC which combines the strategies of BLPMC 1 and FLPMC may take O(m2 |SBLP | + n2 NM C (1 + 1−c ))2 time, including index

time and Monte Carlo sampling time. For single-source SimRank query, works [19, 8] proposed some index schemes 55

with random walk method. The algorithm SLING [19] requires O(n/) space  and O m/ + n log nδ /2 3 pre-processing time. Moreover, the complexity of 1 n, m 2S

is the number of nodes and edges in the graph respectively. is the sequence of node-pairs performed by BLPMC, NM C is the number of samples

BLP

in Monte Carlo sampling and c is the dacay factor in Equation (1). 3  is an additive error bound and δ is the failure probability of the pre-processing algorithm.

4

 all-pairs calculation is O n2 / . To improve the performance, Jiang et al. [8] further proposed an algorithm called READ and they claim that their algorithm can support all-pairs SimRank queries, with both time and space complexity as 60

O(n2 r+nm)4 . Song et al. [18] introduced a rectified unidirectional random walk method for the top-k SimRank over a large undirected graph, called UniWalk. For the single-source SimRank task, the time complexity also costs nearly linear to the number of source vertices, but the convergence rate of UniWalk is not better than READS for the same situation. Though most of the random walk

65

techniques can improve the performance of computing SimRank scores, however, one common limitation of such techniques is their randomly generated errors, which may not be applicable to some value-sensitive applications. 1.2.2. Deterministic techniques On the other hand, Jeh and Widom [7] also gave a naive algorithm for

70

SimRank computation that converges to a fixed point by iteration. Deterministic techniques can compute the scores with consistently converging accuracy. Therefore, the results can be used in more applications. Unfortunately, such kind of techniques may incur expensive computational complexity (e.g. O(Kn4 ) in the worst case [21] where K is the iteration number). Thus, Lizorkin et al.

75

[11, 12] proposed an iterative SimRank algorithm with three optimization meth3

ods for speeding up the computation from O(Kn4 ) to O(K · min{nm, logn

2

n }),

the algorithm requires K = dlogc e − 1 iterations for the error bound . Works [23, 24] observed that the computing partial sums over different in-neighbor sets [11, 12] may have severe redundancy. To make use of this property, Yu et al. 80

optimized the sub-summations among different partial sums by using an optimal in-neighbors partitioning (OIP) to reduce the computational complexity of SimRank from O(Kn2 d) to O(Kn2 d0 ), where d = m/n is the average in-degree in the graph and d0 ≤ d. Despite those achievements, the convergence speed of those fixed-point iteration methods is still slow especially when we require a 4r

is the length of the random walks.

5

85

high accuracy (i.e., small ). To tackle the problem, some works [9, 22] proposed some methods with using Kronecker product and vectorization operator. Li et al. [9] developed three efficient algorithms based on singular value decomposition (SVD) for estimating similarities on static and dynamic information networks with time complexity

90

O(k 4 n2 ), where k is the rank of the adjacency matrix of the graph. Yu et al. [22] proposed a compressed storage scheme for sparse graphs and a fast matrix multiplication for dense graphs to accelerate the convergence rate of their algorithm. They reduced the memory from O(n2 ) to O(m + n) and time complexity from O(Kn3 ) to O(K · min{nm, nγ }) in the worst case where γ ≤ log2 7. They

95

also showed their algorithms can significantly speed up the convergence rate roughly twice faster than the technique proposed in [11, 12]. To further reduce the time complexity, Lu et al. [13] first transformed the SimRank problem into a linear system for computing all-pairs SimRank scores, and solved the problem by using CG algorithm for undirected graphs

100

and Biconjugate Gradient Stabilized (BiCGstab) algorithm for directed graphs. The CG algorithm can significantly reduce the iteration number K of the algorithm. In their algorithm, the iteration number K can be determined by √

K = (dlogc e − 1)/ logc ( √κ−1 ), where  is the given error bound of SimRank κ+1 and κ is the condition number of the coefficient matrix of the linear system. 105

They further proposed to utilize a diagonal preconditioner to reduce the condition number κ, which can make the iteration number K decrease in the CG algorithm. Even though CG is regarded as one of the most powerful algorithm for solving symmetric and positive definite linear systems, however, there is no work yet

110

providing a thorough discussion on the improvement room of preconditioners for SimRank computations. In this paper, we are interested in two questions: (1) Is there any more efficient preconditioner for SimRank? and (2) How to theoretically guarantee the efficiency of the preconditioners for SimRank computations?

6

115

1.3. Contributions As we discussed above, the convergence rate of the CG algorithm for SimRank will be raised if the coefficient matrix of the linear system is well-conditioned. However, solving the linear system with a well-approximated preconditioner also requires an expensive computational cost. To save the cost, work [13] only pro-

120

posed to use a simple diagonal preconditioner in the CG algorithm. In this paper, we thoroughly investigate the problem and propose to use a polynomial preconditioner to reduce the condition number of the coefficient matrix, thus improve the speed of the SimRank computation. We theoretically prove the spectral properties of the polynomial preconditioned matrix and present the

125

trade-off between the solving cost and the effectiveness of the preconditioner. The main contributions are listed as follows: 1. We propose a polynomial preconditioner for the SimRank linear system, and prove that the eigenvalues of the preconditioned matrix range in [1 − cα+1 , 1 + cα+1 ], where α is the degree of the polynomial precondi-

130

tioner and c is the decay factor used in SimRank. This claim indicates the condition number κ of the preconditioned matrix tends to 1 when α becomes large. Therefore, the polynomial preconditioner makes the CG algorithm converge faster, but may also bring more cost for solving the preconditioned linear system in each iteration. We investigate the trade-

135

off. 2. For directed graphs, there may exist some nodes without in-neighbor (i.e., I(i) = ∅ or I(j) = ∅ in Equation (1)), and the existing works often ignore this situation for simplicity. To make the problem more complete, we extend the linear system to include this situation.

140

3. We compare our approach with the state-of-the-art all-pairs SimRank algorithms, diagonally preconditioned CG [13] and the Monte Carlo based algorithm READS (the static version) [8], on several real datasets. The experimental results confirm that the proposed polynomial preconditioners are more efficient for solving all-pair SimRank.

7

145

This paper is organized as follows. We first review the linear system of SimRank in Section 2. The time complexity of CG and BiCGstab algorithms for solving the SimRank scores are discussed in Section 3. Then we introduce diagonal and polynomial preconditioners in Section 4. In Section 5, we prove the spectral properties and further discuss the iteration number of the methods theoretically.

150

Section 6 extends the SimRank linear system to include the situations where some nodes may not have in-neighbors. In Section 7, we report the experimental results on some real networks and we conclude this paper in Section 8.

2. Preliminary In this section, we introduce some notations and some basic concepts which 155

will be used in the later part of the paper. We also present the framework of the linear system for SimRank. The basic symbols are presented in Table 1, and the detail discussions could be found in work [13].

8

Table 1: The main symbols in this paper.

Symbol G = (V, E)

Definition The given graph G

n = |V |

The number of nodes in G

m = |E|

The number of edges in G

Sij

The SimRank score between node i and j

I(i)

The set of in-neighbors of node i

|I(i)|

The in-degrees of node i

c

The decay factor in the definition of SimRank

K

The number of iterations

κ

The condition number

A

The backward adjacency matrix of G



The partial backward adjacency matrix of G2

AT

The transpose of matrix A

I

Identity matrix

J

All-ones matrix

0

A column vector with all entries are zero

ek

The k-th column of the identity matrix



Kronecker product



Entry-wise product



Entry-wise division

Now we introduce several operators which can transform a vector into a matrix or a matrix into a vector. h iT Definition 1. (Diag operator) Let vector a = a1 , a2 , . . . , an ∈ Rn×1 , then

9

Diag(a) ∈ Rn×n is defined as a diagonal matrix  a1 0 · · ·    0 a2 · · · Diag(a) =  .. . .  .. . . .  0 160

···

0

0



  0  ..  . .  an

Remark 1. The diagonal vector of a square matrix C ∈ Rn×n is denoted by h iT diag(C) = C11 , C22 , . . . , Cnn in this paper. 1×m Definition 2. (Vec operator) Let B ∈ Rn×m be a matrix and bT be i ∈ R

its i-th row vector, then vec(B) ∈ Rnm×1 is defined as a vector by stacking all those row vectors together 

b1



     b2   vec(B) =   ..  .  .    bn Proposition 1. Let I ∈ Rn×n where I is the identity matrix. Then a matrix Iˆ = Diag(vec(I)) 2

is a diagonal matrix in Rn

×n2

.

Proposition 2. Let I ∈ Rn×n be the identity matrix and J ∈ Rn×n be the all-ones matrix. Then Jˆ = Diag(vec(J − I)) 2

is a diagonal matrix in Rn

×n2

.

2.1. Linear system of SimRank 165

Now we are going to introduce the linear system for SimRank which formulates the fundamental work in this paper. We omit all the proofs in this subsection, which could be found in [13]. The linear system for SimRank is easily addressed on the node-pairs graph other than on the original graph. In detail, let G = (V, E) be a graph, its 10

170

node-pair graph is defined as G2 = (V 2 , E 2 ), such that nk = (i, j) ∈ V 2 where k = (i − 1)n + j, n = |V | if and only if i, j ∈ V . For any np = (ip , jp ), nq = (iq , jq ) ∈ V 2 , (nk , nl ) ∈ E 2 if and only if both (ip , ip ) ∈ E and (jq , jq ) ∈ E. Figure 2 presents an example. Graph G2

Graph G

1

𝑛3 (1,3)

2

𝑛12 (3,4)

𝑛10 (3,2)

𝑛4 (1,4)

𝑛5 (2,1)

𝑛13 (4,1)

𝑛8 (2,4)

4

3

𝑛1 (1,1)

𝑛2 (1,2) 𝑛9 (3,1)

𝑛11 (3,3)

𝑛14 (4,2)

𝑛7 (2,3)

𝑛16 (4,4)

𝑛15 (4,3)

𝑛6 (2,2)

Figure 2: A simple graph G and its node-pair graph G2 .

Now, we separate the nodes of G2 into two parts VS = {(i, j)|i = j} and VD = {(i, j)|i 6= j}. Then, the SimRank Equation (1) can be restated on the node-pair graph G2 as Snk = 1, nk = (i, j) ∈ VS

(2)

and Snk =

c |I(nk )|

X

Snl , nk = (i, j) ∈ VD ,

(3)

nl ∈I(nk )

where I(nk ) is the set of in-neighbors of nk in G2 . (Now we temporally suppose 175

there is no node i ∈ V such that I(i) = ∅ to simplify our current discussion, and we will address the situation of I(i) = ∅ or I(j) = ∅ later in Section 6.) Let IS (nk ) = I(nk ) ∩ VS and ID (nk ) = I(nk ) ∩ VD , then, Equation (3) can be transformed into Snk

 c  = |I(nk )|

 X

Snl + |IS (nk )| , nk ∈ VD ,

nl ∈ID (nk )

11

(4)

Equation (4) can be further rewritten as |I(nk )|Snk − c

X

Snl = c|IS (nk )|, nk ∈ VD .

(5)

nl ∈ID (nk ) 2

Let d, s, v be vectors in Rn

×1

, such that dk = |I(nk )|, sk = Snk , vk =

|IS (nk )|. Then, Equation (5) can be represented into a matrix form   ˆ Jˆ Diag(d) − cAˆ s = cJv

ˆ Aˆkl where Jˆ is defined as in Proposition 2, and A,

(6)

  1, n ∈ I (n ) l D k = , is  0, else

the partial backward adjacency matrix of G2 .

Similarly, Equation (2) can be rewritten into a matrix form ˆ ˆ IDiag(d)s = Id 180

(7)

where Iˆ is defined as in Proposition 1. 2 2 Keep in mind that Iˆ + Jˆ = I (the identity matrix in Rn ×n ). By adding

Equation (6) and (7) together, we have the following linear equation for SimRank s Ls = q

(8)

ˆ + Id. ˆ where L = Diag(d) − cJˆAˆ and q = cJv Definition 3. The backward adjacency matrix of graph G, denoted as A ∈ Rn×n , is defined as

  1, (j, i) ∈ E . Aij =  0, else

The following two propositions present how to compute matrix Aˆ and vectors v, d, using the backward adjacency matrix of graph G. 2

Proposition 3. Vector v, d ∈ Rn

×1

are defined as in Equation (6). Then, we

have v = vec(AAT ) 12

and d = vec(ggT ) where g ∈ Rn×1 with gi =

n P

Aij .

j=1

Proposition 4. As defined in Equation (6), let Aˆ be the partial backward 2 2 adjacency matrix of G2 . Then, Aˆ ∈ Rn ×n can be calculated by

Aˆ = (A ⊗ A)Jˆ 185

2 2 where ⊗ is the Kronecker product and Jˆ ∈ Rn ×n is defined as in Proposition

2. Proposition 5. For an undirected graph, the coefficient matrix L, as defined in Equation (8), is a symmetric positive definite matrix.

3. Solving the linear system by an efficient algorithm 190

In the previous section, we represent the SimRank as a solution of a linear system. However, it is too expensive to employ direct methods for solving the 2

linear system L ∈ Rn

×n2

, when n is large.

If the linear system L is sparse, iterative methods are promising in solving the system and they may not require any extra storage in general. Since matrix L is 195

a symmetric positive definite matrix for the undirected graph [13], CG method is regarded as the most efficient iterative algorithm for solving the system in both theory and practice. For the directed graph, the matrix L is a non-symmetric matrix, BiCGstab method is one of the extensions of CG method to solve nonsymmetric linear systems, with a fast and smooth convergence.

200

3.1. The time complexity of CG and BiCGstab algorithms For any arbitrary vector x, calculating the product of Lx incurs the main time complexity of CG and BiCGstab methods in each iteration. Now, we first discuss the time complexity of CG and BiCGstab methods per iteration for SimRank, then, we tackle the overall convergence rate of the CG method for

205

solving the linear system. 13

h iT 2 Definition 4. (Mat operator) Let a = a1 , a2 , . . . , an2 ∈ Rn ×1 be a column vector, then M at(a) ∈ Rn×n is defined as a matrix 

   an+1 M at(a) =  ..   .  an2 −n+1



a2

···

an

an+2 .. .

··· .. .

an2 −n+2

···

  a2n   ..  . .   an2

a1

According to work [13], the product of the matrix L and an arbitrary vector 2

x ∈ Rn

×1

can be written as

ˆ vec(A · M at(diag(J) ˆ x)AT ), Lx = Dx − cU x = d x − c · diag(J) | {z } ˆ Ax

ˆ where the operator denotes entry-wise product, D = Diag(d), and U = JˆA. The algorithm of computing U x is given in Lines 1-5 of Algorithm 1. Algorithm 1 To compute Lx 2

Iuput: x ∈ Rn

×1

Output: Lx 1:

U ← M at(x)

2:

U ← diagonal elements of U to be zero

3:

U ← AU AT

4:

U ← diagonal elements of U to be zero

5:

u ← vec(U )

6:

Lx ← d x − cu

I u = Ux

ˆ = vec(A · M at(diag(J) ˆ x)AT ), From above discussion, we know that Ax ˆ to be O(nm) where from which we can derive the complexity of computing Ax 210

m is the number of edges in G and n is the number of nodes in G. Therefore, computing U x requires time complexity O(nm), which indicates the complexity of computing Lx is O(nm) too. Therefore we can get that the time complexity of CG and BiCGstab algorithms is O(nm) per iteration.

14

3.2. The convergence rate of CG method on symmetric and positive definite 215

systems Let A be a symmetric positive definite matrix and x∗ be the exact solution of Ax = b, then the convergence rate of the CG method for solving the linear system can be estimated by the following inequality [15] kx∗ − xK kA ≤ 2

K √ κ−1 √ kx∗ − x0 kA κ+1

(9)

where xK is the result in the K-th iteration of the CG algorithm, A-norm of √ x is defined as kxkA = xT Ax and κ is the condition number of A which is defined as κ(A) = kAkkA−1 k. Note that for any p-norm, κ(A) = kAkkA−1 k ≥ kA · A−1 k = kIk = 1. If κ(A) is small (closed enough to 1), the system Ax = b is said well-conditioned; otherwise, it is said ill-conditioned. Inequality (9) can be written as kx∗ − xK kA ≤2 kx∗ − x0 kA

√ K  K κ−1 2 √ =2 1− √ ≤ κ+1 κ+1

where  is a given error bound for the result. Then, we have    2 . ln 1 − √ , K ≥ ln 2 κ+1

(10)

(11)

where dxe is the smallest integer bigger than x. In general, matrix L always has a large condition number, that means L is ill220

conditioned. In Inequality (11) and Figure 3, we can observe that if the condition number κ is large and a small error bound for SimRank scores is required, the iteration number K becomes large, which implies the slow convergence rate of the CG algorithm. On the contrary, if the condition number κ is small, the CG method will require few iterations to achieve the same accuracy. Therefore, we

225

will discuss how to reduce the condition number of the linear system in order to accelerate the convergence rate.

15

Figure 3: The iteration number K for a given condition number κ and error bound .

4. Preconditioners for SimRank Preconditioning techniques serve to reduce the condition number of the linear system, thus to improve the performance of the iteration. Using preconditioner 230

M , we can rewrite the linear system of Equation (8) as M −1 Ls = M −1 q, to make M −1 L with a smaller condition number than that of L.

16

Algorithm 2

Algorithm 3

Preconditioned CG method 1: s0 = vec(I)

Preconditioned BiCGstab method 1: s0 = vec(I)

2:

r0 = q − Ls0

2:

r0 = q − Ls0

3:

z0 = M −1 r0

3:

ˆ r = r0

4:

p0 = z0

4:

ρ 0 = α = ω0 = 1

5:

for K = 1, 2, 3, . . . do

5:

v0 = p0 = 0

6:

vK−1 = LpK−1

6:

for K = 1, 2, 3, . . . do

7:

T α = (rT K−1 zK−1 )/(pK−1 vK−1 )

7:

ρK = ˆ rT rK−1

8:

sK = sK−1 + αpK−1

8:

β = (ρK /ρK−1 )(α/ωK−1 )

9:

rK = rK−1 − αvK−1

9:

pK = rK−1 +β(pK−1 −ωK−1 vK−1 )

10:

zK = M −1 rK

10:

y = M −1 pK

11:

T β = (zT K vK−1 )/(pK−1 vK−1 )

11:

vK = Ly

12:

pK = zK − βpK−1

12:

α = ρK /(ˆ rT vK )

13:

end for

13:

x = rK−1 − αvK

14:

return sK

14:

z = M −1 x

15:

t = Lz

16:

ωK = (tT x)/(tT t)

17:

sK = sK−1 + αy + ωK z

18:

rK = x − ωK t

19:

end for

20:

return sK

In the linear system of SimRank, the preconditioner can be designed as approximating L as possible (i.e., M ≈ L) in order to reduce the iterations of the CG algorithm. However, it is also expensive to compute M −1 x for an 235

arbitrary vector x in preconditioned CG (PCG) or preconditioned BiCGstab (PBiCGstab) algorithm. Thus, there is a trade-off between choosing an efficient preconditioner and computing M −1 x. In the following, we will introduce two types of preconditioners.

17

4.1. Diagonal preconditioner In [13], Lu et al. proposed a simple preconditioner of diagonal matrix   L , if k = l kk Mkl =  0, otherwise 240

for solving SimRank, which is known as Jacobi preconditioner. It is efficient for the strictly diagonally dominant matrix L [13], without incurring extra storage. The existence of M −1 is proved by the following theorem. Theorem 1. The matrix M −1 exists if each node of G has non-zero in-degree. Proof. If we can prove that the diagonal entries of the diagonal matrix M are all non-zero, the existence of M −1 can be obtained obviously. From Equation (8), we have ˆ L = Diag(d) − cJˆA. From the assumption that all nodes of G have non-zero in-degrees, we know that all node-pairs of G2 also have non-zero in-degrees, i.e., dk ≥ 1. As we know that the entries of Jˆ and Aˆ are either 0 or 1, where Jˆ and Aˆ are as defined in Proposition 2 and Proposition 4, respectively. Then, we have ˆ diag(A) ˆ >0 diag(M ) = diag(L) = d − c · diag(J) where 0 is a column vector with all zero entries. Hence, the diagonal entries of

245

M are greater than zero, thus, matrix M −1 exists. In addition, we know that for any vector x, h i ˆ (diag(A) ⊗ diag(A)) M −1 x = x d − c · diag(J) where symbol denotes the entry-wise division as defined in [13]. Therefore, we can easily derive that the complexity of computing M −1 x is O(nm), i.e., PCG and PBiCGstab solvers incur O(Knm) time complexity, where K is the iteration number.

250

18

In Section 5, we will prove that the spectrum of M −1 L is located in the interval [1 − c, 1 + c] for undirected graphs. In next subsection, we will propose a better preconditioner to make the spectrum of the preconditioned matrix in a smaller interval [1 − cα+1 , 1 + cα+1 ] where α ∈ N. Thus, the convergence rate 255

of the CG solver can be improved. 4.2. Polynomial preconditioners In order to reduce the condition number of the linear system, thus speeding up the convergence rate of the CG algorithm, in this subsection we propose a new preconditioner that can well approximate L. In graph G, we suppose that all nodes have non-zero in-degrees. By Equation (8), we represent matrix L as L = D − cU = D(I − cD−1 U ) = D(I − B)

260

(12)

ˆ and B = cD−1 U . where D = Diag(d), U = JˆA, In particular, D−1 exists since all entries of d are non-zero (similar to Theorem 1). Therefore, L−1 = (I − B)−1 D−1 = (I + B + B 2 + · · · + B α + · · · )D−1 , α ∈ N. Let Pα−1 = (I + B + B 2 + · · · + B α )D−1 . We know that Pα−1 L is close to the identity matrix when α becomes large. Pα is called the polynomial preconditioner, α ∈ N.

Regarding the time complexity, we have Pα−1 x = (I + B + B 2 + · · · + B α )D−1 x = (I + · · · + B(I + B(I + B)) · · · ) D−1 x. | {z } α brackets of I+B(·) Let x ˆ = D−1 x, then Pα−1 x = x ˆ + B(ˆ x + B(ˆ x + · · · + B(ˆ x + Bˆ x)) · · · ) . | {z } α brackets of x ˆ+B(·)

19

Algorithm 4 To compute Pα−1 x 2

Iuput: x ∈ Rn

×1

Output: Pα−1 x 1:

x0 ← D−1 x

2:

x1 ← x0

3:

for j = 1, 2, . . . , α do

4:

xj ← x0 + cD−1 U xj−1

5:

end for

6:

Pα−1 x ← xα 2

265

Since D−1 ∈ Rn of x ˆ = D

−1

×n2

I Call Algorithm 1 to calculate U x, i.e., u

is a diagonal matrix, the computational complexity

2

x is O(n ). From Section 3, we know that the time complexity

of U x is O(nm). As we know that m must be larger than n, so we get the complexity of computing Bx is O(nm), where B is as given in Equation (12). Therefore, the time complexity of Pα−1 x is O(αnm), correspondingly, PCG or 270

PBiCGstab solver takes O(Kα αnm) time complexity, where Kα is the iteration number of the algorithm and α is the degree of the polynomial preconditioner. In particular, we will show that Kα decreases when α increases in Section 5.

5. Spectral properties of two preconditioned matrices According to Inequality (11), we know that the iteration number of the 275

CG algorithm depends on the condition number κ when the error bound  is given. For a symmetric matrix A, the condition number κ can be computed as κ(A) = kAk2 kA−1 k2 =

λmax (A) λmin (A) .

This formula indicates that eigenvalues of a

well-conditioned matrix are clustered together. Theorem 2. For an undirected graph G, given matrix M = Diag(diag(L)) 280

as the diagonal preconditioner, then all eigenvalues of M −1 L are located in the interval [1 − c, 1 + c], where c is the decay factor of the SimRank. ˆ = M −1 L, then Proof. L = Diag(d) − cJˆAˆ is as given in Equation (8). Let L

20

we get ˆ kk = L−1 Lkk = 1 L kk and X

ˆ kl | = |L

l6=k

X

L−1 kk |Lkl | =

l6=k

X (dk − cJˆkk Aˆkk )−1 (cJˆkk Aˆkl ). l6=k

From Proposition 2, we know that either Jˆkk = 0 or Jˆkk = 1. If Jˆkk = 0, X

ˆ kl | = 0. |L

l6=k

If Jˆkk = 1, X

ˆ kl | = |L

l6=k

X (dk − cAˆkk )−1 cAˆkl l6=k

= c(dk − cAˆkk )−1

X

Aˆkl

l6=k

! = c(dk − c(A ⊗ A)kk Jˆkk )−1

X

Aˆkl − Aˆkk

l −1

= c(dk − c(A ⊗ A)kk )

(dk − |IS (k)| − (A ⊗ A)kk )

≤ c(dk − c(A ⊗ A)kk )−1 (dk − (A ⊗ A)kk ) ≤ c. The last inequality follows from dk − (A ⊗ A)kk ≤ dk − c(A ⊗ A)kk . Thus, we ˆ kk = 1 and P |L ˆ kl | ≤ c for either case. get L l6=k

ˆ by applying the Gershgorin theorem [3] to L, ˆ the Let λ be an eigenvalue of L, eigenvalues lie within the union of discs     X ˆ kk | ≤ ˆ kl | , Dk = λ : |λ − L |L  

k = 1, . . . , n2 .

l6=k

ˆ are equal to 1, we can easily deduce We know that the diagonal entries of L ˆ lie within the disc with center (1, 0) and radius c. that all the eigenvalues of L ˆ = M −1 L is similar to a symmetric matrix M − 12 LM − 21 for an Further, L 285

ˆ are real and located in the interval undirected graph. Hence, all eigenvalues of L [1 − c, 1 + c]. 21

This implies that the eigenvalues of M −1 L are non-zero, so the matrix is non-singular. Now, we will further prove that eigenvalues of the polynomial preconditioned matrix are located in the smaller interval [1 − cα+1 , 1 + cα+1 ] 290

around 1 where α is the degree of the polynomial preconditioner. Theorem 3. For a graph G, given matrix B = cD−1 U as defined in Equation (12), then the spectral radius of B is ρ(B) ≤ c. Proof. As we know that B = cD−1 U and U = JˆAˆ is as given in Equation (12), we have 2

ρ(B) ≤ kBk∞ = max 2 1≤k≤n

n X

−1 ˆ ˆ |Bkl | where Bkl = cDkk Ukl = cd−1 k Jkk Akl .

l=1

For any k, we know that either Jˆkk = 0 or Jˆkk = 1 from Proposition 2. In case of Jˆkk = 0, X

|Bkl | = 0.

l

In case of Jˆkk = 1, X

|Bkl | =

l

X

ˆ |cd−1 k Akl |

l

= cd−1 k

X

|Aˆkl |

l

=

cd−1 k (dk

− |IS (k)|)

≤ cd−1 k dk = c. Thus, 2

kBk∞ = max 2 1≤k≤n

n X

|Bkl | ≤ c.

l=1

Therefore, we get ρ(B) ≤ kBk∞ ≤ c.

22

Theorem 4. For an undirected graph G, given matrix Pα as the polynomial preconditioner such that Pα−1 = (I + B + B 2 + · · · + B α )D−1 , where B = cD−1 U . Then the eigenvalues of Pα−1 L satisfy 1 − cα+1 ≤ λ(Pα−1 L) ≤ 1 + cα+1

for all

α ∈ N.

Proof. For α ∈ N, let Pˆ = Pα−1 L where L = D(I − B) is as given in Equation (12), we have Pˆ = (I + B + B 2 + · · · + B α )D−1 L = (I + B + B 2 + · · · + B α )(I − B) = I − B α+1 . ˆ so U is a symmetric matrix for an By Equation (12), we know that U = JˆA, 295

undirected graph and the non-symmetric matrix B = cD−1 U is similar to a 1

1

symmetric matrix cD− 2 U D− 2 . Therefore, all eigenvalues of B are real. By Theorem 3, we know that ρ(B) ≤ c, then we can deduce that |λ(B)| ≤ c. Thus, it is easy to get the eigenvalues of B α+1 are also real and |λ(B α+1 )| ≤ cα+1 . Hence, we have 1 − cα+1 ≤ λ(Pˆ ) ≤ 1 + cα+1 for all α ∈ N. Since c ∈ (0, 1), we have cα+1 → 0 when α → ∞. That means all eigenvalues of Pˆ are close to 1 when α becomes large. 300

By applying the claims of Theorem 2 and Theorem 4 to Equation (11), we have the following estimations for the iteration number of the CG algorithm with the diagonal and the polynomial preconditioners respectively,   ,   2  K= ln 1 − q ln 2  1+c   1−c + 1   Kα =  ln 2 

,



(13)



ln 1 − q

23

2 1+cα+1 1−cα+1

 .  +1 

(14)

It is clear that Kα is much smaller than K for the same error bound , which indicates the polynomial preconditioner makes the CG algorithm converge faster.

However, we should understand that to apply the polynomial preconditioner 305

is more costly than the diagonal preconditioner in each iteration. Theoretically, the time complexities of the CG method by using diagonal preconditioner and polynomial preconditioner are O(Knm) and O(Kα αnm) respectively. In the experimental section, we will compare the overall performance of the CG algorithm by using different value of α.

310

6. The case of nodes without in-neighbor In Subsection 2.1, the linear system is derived under the assumption that all nodes are with some in-neighbors, that excludes the case of I(i) = ∅ for some i ∈ V as defined in the original SimRank. In this section, we generalize the linear system of SimRank to encompass all the situations. For this sake,

315

we first discuss what happens in the linear system of Equation (8) when there exists some i ∈ V such that I(i) = ∅. Then, we discuss how to modify the linear system to fulfill the original definition of SimRank. After the modification, all the previously discussed techniques can have no change for the SimRank computation.

320

Theorem 5. Matrix L and vector q are as defined in Equation (8). If i ∈ V (a node of graph G) makes |I(i)| = 0, then Lk,? = 0T and qk = 0, where k = (i − 1)n + j is the index for node-pairs (i, j) and (j, i), j = 1, 2, . . . , n. In other words, the k-th row of matrix L and the k-th entry of q are zero. From Theorem 5, we know that L becomes a singular matrix if there exists

325

any node without any in-neighbor in G, such that it cannot be solved by using the proposed method. Obviously, M −1 = Diag(L)−1 and D−1 do not exist. Now, we use an example to show how to modify the linear system to encompass all the situations and, at the same time, with preconditioners.

24

6.1. An example of nodes without in-neighbor

Graph G2

Graph G

1

2

𝑛3 (1,3)

𝑛12 (3,4)

3

𝑛10 (3,2)

𝑛4 (1,4)

𝑛5 (2,1)

𝑛13 (4,1)

𝑛8 (2,4)

4 𝑛1 (1,1)

𝑛2 (1,2) 𝑛9 (3,1)

𝑛11 (3,3)

𝑛14 (4,2)

𝑛7 (2,3)

𝑛16 (4,4)

𝑛15 (4,3)

𝑛6 (2,2)

Figure 4: An example for a node without in-neighbor in graph G and its graph G2 .

330

As shown in Figure 4, node 3 of G has no non-zero in-degree, correspondingly in G2 node-pairs n3 = (1, 3), n7 = (2, 3), n9 = (3, 1), n10 = (3, 2), n11 = (3, 3), n12 = (3, 4) and n15 = (4, 3) have no non-zero in-degree too. By Theorem 5, we know that the 3-rd, 7-th, 9-th, 10-th, 11-th, 12-th and 15-th diagonal entries of L and those corresponding entries of q are equal to zero. In such situation, the linear system cannot be solved using the proposed algorithm. Therefore, we follow the original SimRank definition by Equation (1) to modify the linear system. We can get  1 0.16   0.16 1 S=   0 0  0 0.4

the SimRank matrix of Figure 4 as  0 0   0 0.4   1 0  0 1

h iT which can be written as a vector s = 1, 0.16, 0, 0, 0.16, 1, 0, 0.4, 0, 0, 1, 0, 0, 0.4, 0, 1 . We know from the SimRank definition that s11 = 1 and the 3-rd, 7-th, 9-th, 335

10-th, 12-th and 15-th entries of s are zero. Now we modify L by replacing the zero value Lkk = 0, k ∈ {3, 7, 9, 10, 11, 12, 15}, by Lkk = 1. We also replace 25

q11 = 0 by q11 = 1. Now the revised L is no longer singular and the solution of Ls = q is just the SimRank scores under the definition. However, we know 2

that L ∈ Rn 340

×n2

is a large matrix, it is expensive to search the zero entries in

the diagonal vector of L one by one for the modification. In Theorem 5, it is implicit that if dk (cf. Proposition 3 for d) is zero, Lkk must be zero. Therefore, we only need to consider which entries of d are equal to zero.

345

  0 1 0 0     0 0 1 1 ,  For the graph G in Figure 4, the backward adjacency matrix A =   0 0 0 0   0 0 1 0 h iT the row sum of A is g = 1, 2, 0, 1 where gi = |I(i)| for all i ∈ V . In Proposi  1 2 0 1       2 4 0 2 , then tion 3, we mentioned that d = vec(ggT ). Let H = ggT =    0 0 0 0   1 2 0 1 the elements of H are corresponding to the in-degrees of node-pair in graph G2 by row-wise. It can be easily observed that the zero entries in g lead to zero of the corresponding entries in d.

350

In such a way, by only searching the zero entries of g, i.e., g3 , we can easily know that H3,? , H?,3 are 0, thus we modify them to 1. On the other hand, there is a simple way to find out the entry q11 (s11 = 1 where n11 = (3, 3) ∈ VS ) that must be modify to 1. Since the node-pair nk in G2 can be indexed by k = (i − 1)n + j, the 11-th entry of q can be obtained by the searching result in

355

the zero entry of g. After this handling, the solution of the new equation is the SimRank scores under the original definition. 6.2. Algorithms of the linear system modification In general, if node i has no in-neighbor in graph G, we can modify the zero

360

entries of d to 1 and only let qr = 1 when nr = (i, i), in order to satisfy the

26

definition of SimRank that is Sii = 1, Sij = 0 or Sji = 0 for j 6= i. The algorithms of the linear system modification are given as follows. Algorithm 5 Modification of the left-hand side of the linear system Iuput: A, the backward adjacency matrix of the graph G Output: new Lx and new Pα−1 x

2:

for i = 1, 2, . . . , n do n P gi ← Aij

3:

end for

4:

i ← Find the indices of zero entries in g

5:

H ← ggT

6:

Hi,? ← 1

7:

H?,i ← 1

8:

d = vec(H)

9:

Call Algorithm 1 with new d to return Lx

1:

j=1

10:

I new d

Call Algorithm 4 with new d to return Pα−1 x

Algorithm 6 Modification of the right-hand side of the linear system Iuput: i from Algorithm 5, q, as defined in Equation (8) Output: new q 1:

r = (i − 1)n + i

2:

qr = 1

3:

return q

I nr = (i, i)

I new q

After above modifications, matrix L is non-singular and matrices M −1 , D−1 exist for all situations, that means preconditioners M and Pα are well-defined. 365

In particular, the properties of the linear system are the same as the original one and all of our discussions can be used for the new linear system.

27

7. Experiments In this section, we evaluate our algorithm in terms of precision, normalized discounted cumulative gain (NDCG) [5] and efficiency. 370

7.1. Experiment setup All experiments are conducted on a PC with Intel(R) Core(TM) i7-4790 3.60GHz CPU, 32GB RAM, and the algorithms are implemented by using MATLAB R2014a. 7.1.1. Comparison with the state-of-the-art algorithms

375

In our experiments, we compare three methods for solving the all-pairs SimRank scores as below: 1. P is the proposed algorithm in this paper, which uses a polynomial preconditioner for solving the linear system of SimRank scores. It takes O(Kα αnm) time complexity, where Kα is the iteration number of the

380

polynomial preconditioned linear system, α is the degree of the polynomial power, n denotes the number of nodes in the graph, and m denotes the number of edges in the graph. 2. M is the algorithm proposed by Lu et al. [13], which uses a diagonal preconditioner for solving the linear system of SimRank scores. It takes

385

O(Knm) time complexity, where K is the iteration number of diagonal preconditioned linear system. 3. READS method is one of the stochastic techniques to obtain the singlesource SimRank scores [8]. It takes O(nr) index time and O(nr+m) query time for single-source SimRank computation, where r is a simulation num-

390

ber. Moreover, Jiang et al. proposed that their algorithm also supports all-pairs SimRank solving. Therefore, the READS method incurs O(nr) index time and O(n2 r + nm) query time for solving all-pairs SimRank scores.

28

7.1.2. Experimental environment and parameter settings 395

We set the decay factor c = 0.8 in our experiments. 1. Preconditioners: PCG (for undirected graph) or PBiCGstab (for directed graph) algorithm is used for solving the linear system. The initial SimRank scores are set as s0 = vec(I), and the error norm is computed by  = n2 P ksK −sk1 1 K = |sK is the SimRank scores in the K2 2 l − sl | where s n n l=1

400

th iteration, and s is the ground truth of SimRank scores. We compute the ground truth of SimRank by the original SimRank equation with a large (e.g. 100 times) iterations. In the experiments, we utilize the sparse matrix to store the backward adjacency matrix A in MATLAB. Different preconditioners (M, P1 and P3 that were defined as in Subsections 4.1 and

405

4.2) are used to compute the SimRank score on datasets. 2. The static version of READS: The termination of this algorithm is controlled by the length of the random walk and the number of simulations. To make it comparable with the error-bound controlled algorithms, we set t = 10 as the maximum length of a random walk, and rmax = 100 as the

410

maximum number of simulations. 7.1.3. Datasets As shown in Table 2, four datasets of directed graphs and three datasets of undirected graphs are used in the experiments, where n is the number of nodes and m is the number of edges in each graph. Those datasets are crawled from

415

two websites5 . 5 Euroroad,

Human protein (Stelzl), Reactome and OpenFlights are available at

http://konect.uni-koblenz.de/, and ca-GrQc, p2p-Gnutella04 and ca-HepPh are available at http://snap.stanford.edu/data/index.html.

29

Table 2: Main information of datasets.

Dataset

n

m

Direction

Euroroad

1,174

1,417

Undirected

Human protein (Stelzl)

1,706

6,207

Directed

OpenFlights

2,939

30,501

Directed

ca-GrQc

5,242

14,490

Undirected

Reactome

6,327

147,547

Directed

p2p-Gnutella04

10,876

39,994

Directed

ca-HepPh

12,008

118,505

Undirected

7.1.4. Evaluation measures We use precision and NDCG to measure the accuracy (correctness) of the compared algorithms and those evaluation measures are defined as follows: 1. The precision at k (P@k) is defined as: P recision =

|{top-k of s} ∩ {top-k of ˜s}| k

where s is the ground truth of SimRank and ˜s is the approximated Sim420

Rank. 2. The normalized discounted cumulative gain (NDCG) [6] is one of the popular measures for evaluating the rank of the results. NDCG at position k is defined as: N DCGk =

DCGk IDCGk

and DCGk =

k X i=1

reli log2 (1 + i)

where IDCGk is the ideal DCG at position k and reli is the graded relevance of the result at position i. 7.2. Experimental results We compare our preconditioners with others algorithms and evaluate the 425

accuracy and efficiency of our proposed algorithm. 30

7.2.1. Effectiveness In order to measure the effectiveness of the SimRank scores, we set the iteration number and simulation number to be the same, i.e., Kα = K = 4 (for PCG and PBiCGstab) and r = 4 (for READS). Table 3: Evaluation measures for top-800 querying precision and N DCG800 .

P3

Dataset

430

P1

M

READS

P@k

NDCG

P@k

NDCG

P@k

NDCG

P@k

NDCG

Euroroad

1

1

0.9975

1

0.9800

0.9994

0.7988

0.9105

H. protein

1

1

1

1

1

1

1

1

OpenFlights

1

1

1

1

0.9980

0.9761

1

1

ca-GrQc

0.9938

0.9997

0.9888

0.9994

0.9775

0.9980

0.9400

0.9964

Reactome

1

1

1

1

1

1

0.9860

0.9400

p2p-Gnutella04

1

1

1

1

1

1

0.9960

0.9696

ca-HepPh

1

1

1

1

1

1

1

1

From Table 3, we know that the polynomial preconditioner (P1 or P3 ) can always keep the higher scores than the diagonal preconditioner (M ) and READ in terms of both precision and NDCG, that indicates the effectiveness of the polynomial preconditioner in the SimRank solving. 7.2.2. Efficiency

435

In order to know how fast the algorithm to converge, we conduct three experiments: (1) given the same number of iterations we compare errors generated by those algorithms, (2) given the same error bound we compare the running time of those algorithms, (3) setting different error bound to compare iteration number and running time for one of datasets.

31

Table 4: Average error  when the iteration number is 4.

Dataset

440

P3

P1

M

READS

Euroroad

2.79E-06

3.15E-05

2.70E-04

1.10E-03

H. protein

2.06E-07

6.85E-07

2.10E-04

5.00E-03

OpenFlights

6.84E-09

7.93E-07

1.74E-04

3.80E-03

ca-GrQc

2.26E-06

2.23E-05

4.45E-04

6.48E-04

Reactome

1.51E-11

1.41E-06

6.60E-05

1.50E-03

p2p-Gnutella04

3.33E-08

1.81E-05

1.89E-04

9.02E-04

ca-HepPh

1.59E-06

2.11E-05

1.28E-04

5.56E-04

Table 4 shows the first objective of the testing, where READS algorithm generates 5.6 × 10−4 to 5.0 × 10−3 average error, the diagonal preconditioner M generates 6.6 × 10−5 to 4.5 × 10−4 , and our proposed polynomial preconditioner, P1 and P3 , can achieve the smallest error 1.5×10−11 to 3.2×10−5 . It means using polynomial preconditioners for solving the linear system can generate less error

445

than using diagonal preconditioner or READS algorithm when iteration number is fixed. However, different algorithms may use different time for each iteration. Thus, iteration number may not tell the execution time of the algorithm. Different to our proposed algorithms which can be terminated by a given error bound, READS can only be terminated by a simulation number r. To

450

make those algorithms comparable in terms of the second objective, we first set r = 100 and run READS on datasets Euroroad, ca-GrQc, Reactome, p2pGnutella04, and ca-HepPh, which achieves the average error bound  ≤ 1×10−3 . Then, we execute the proposed preconditioner algorithms on those datasets to reach this error bound. Table 5 shows the results. It is clear that the polynomial

455

preconditioner algorithms P1 and P3 require much less time than READS does. Since it is too expensive for READS to reach error bound  = 1 × 10−8 , we only conduct experiments with this error bound for the preconditioner based algorithms M , P1 , and P3 , and the results are presented in Table 6, where we

32

find P1 is the best among these three preconditioner algorithms in terms of 460

running time, even though P3 requires the least iteration number. Table 5: Experimental results for  = 1 × 10−3 .

P3

Dataset

P1

READS

K3

Time (in sec.)

K1

Time (in sec.)

r

Time (in sec.)

Euroroad

1

0.3707

1

0.2098

100

1.4536

ca-GrQc

1

9.1871

1

4.9464

100

144.9055

Reactome

1

20.2178

2

22.4352

100

9.00E+03

p2p-Gnutella04

1

30.8254

1

19.0387

100

122.3767

ca-HepPh

1

104.5943

1

51.935

100

4.66E+04

Table 6: Experimental results for  = 1.0 × 10−8 .

P3

Dataset

P1

M

K3

Time (in sec.)

K1

Time (in sec.)

K

Time (in sec.)

Euroroad

7

1.5988

11

1.3932

20

1.7648

H. protein

5

4.2651

6

3.1526

12

4.7006

OpenFlights

4

11.5226

6

10.2018

13

15.4716

ca-GrQc

7

37.2828

10

29.0245

19

38.6614

Reactome

3

58.5855

5

53.3944

9

64.4446

p2p-Gnutella04

5

148.8905

7

128.6511

15

210.8746

ca-HepPh

7

407.5188

10

298.6798

18

306.8895

As we know that READS is a good method for computing single-source. The experimental results show that the preconditioner algorithms are more suitable than READS for solving the all-pairs SimRank problem. Further, the proposed polynomial preconditioner algorithms can reduce at least half of iteration num465

ber than the diagonal preconditioner requires. 33

To know how the running time changes when the error bound decreases, the third objective allows the error bound  to change from 5.2 × 10−3 to 1.0 × 10−6 . The experimental results on dataset of Human protein (Stelzl) are presented in Figure 5. The results on other datasets are much similar. 102 102

P

P

1

Running time (in seconds)

Iteration number

M READS

101

1

M READS

101

100

100 5x10-3

4x10-3

3x10-3

2x10-3

1x10-3

1x10-6

5x10-3

Average error

4x10-3

3x10-3

2x10-3

1x10-3

1x10-6

Average error

(a) Average error vs iteration number.

(b) Average error vs Running time.

Figure 5: The experimental results on the convergence rate of different algorithms over the dataset of Human protein (Stelzl).

470

The result shows the polynomial preconditioner algorithm P1 can always require the least iteration number, while READS significantly increases the iteration number when the error bound becomes small. Figure 5b shows the running time of the three algorithms via the error bound change, where the polynomial preconditioner with degree 1 (P1 ) always requires the least running

475

time to reach the same accuracy.

8. Conclusion In this paper, we restudy the CG algorithm for SimRank computation. We observed that the diagonal preconditioner used in the existing CG algorithm still left some space for improvement. Thus, we proposed to use polynomial 480

preconditioners for the CG algorithm in order to reduce the required iteration number and execution time. Moreover, we proved that the linear system of

34

polynomial preconditioners has a smaller spectrum set than that of the diagonal preconditioner. This means a fast convergence rate of the CG algorithm can be achieved by using polynomial preconditioners. Furthermore, we extended 485

the linear system to encompass all situations of the graph, and all the proposed techniques can be applied to the new linear system. The experimental results show that the polynomial preconditioners are effective to speed up the computation of SimRank.

Acknowledgements 490

The research of Siu-Long Lei is funded by Research Committee of University of Macau (MYRG2018-00025-FST). The research of Zhiguo Gong is funded by The Science and Technology Development Fund, Macau SAR (FDCT/116/2013/A3, FDCT/007/2016/AFJ), Research Committee of University of Macau (MYRG201500070-FST, MYRG2017-00212-FST).

495

References References [1] I. Antonellis, H. G. Molina, C. C. Chang, Simrank++: query rewriting through link analysis of the click graph, Proceedings of the VLDB Endowment 1(1), 408-421, 2008.

500

[2] A. A. Bencz´ ur, K. Csalog´any, T. Sarl´os, Link-based similarity search to fight web spam, in AIRWEB, Citeseer, 2006. [3] D. G. Feingold, R. S. Varga, Block diagonally dominant matrices and generalizations of the Gerschgorin circle theorem, Pacific Journal of Mathematics 12(4), 1241-1250, 1962.

505

[4] D. Fogaras, B. R´ acz, Scaling link-based similarity search, Proceedings of the 14th international conference on World Wide Web, 641-650, 2005. doi:10.1145/1060745.1060839. 35

[5] K. J¨ arvelin, J. Kek¨ al¨ ainen, IR evaluation methods for retrieving highly relevant documents, In Proceedings of the 23rd annual international ACM 510

SIGIR conference on Research and development in information retrieval, pages 41–48. ACM, 2000. [6] K. J¨ arvelin, J. Kek¨ al¨ ainen, Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002.

515

[7] G. Jeh, J. Widom, SimRank: A Measure of Structural-Context Similarity, Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, 538-543, 2002. doi: 10.1145/775047.775126. [8] M. Jiang, A. W. C. Fu, R. C. W. Wong, READS: a random walk approach

520

for efficient and accurate dynamic SimRank, Proceedings of the VLDB Endowment 10(9), 937-948, 2017. [9] C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, T. Wu, Fast computation of SimRank for static and dynamic information networks, Proceedings of the 13th International Conference on Extending Database Technology, 465-476,

525

2010. doi: 10.1145/1739041.1739098. [10] Z. Li, Y. Fang, Q. Liu, J. Cheng, R. Cheng, J. Lui, Walking in the cloud: Parallel SimRank at scale, Proceedings of the VLDB Endowment 9(1), 24-135, 2015. doi: 10.14778/2850469.2850472. [11] D. Lizorkin, P. Velikhov, M. Grinev, D. Turdakov, Accuracy Estimate and

530

Optimization Techniques for SimRank Computation, Proceedings of the VLDB Endowment 1(1), 422-433, 2008. doi: 10.14778/1453856.1453904. [12] D. Lizorkin, P. Velikhov, M. Grinev, D. Turdakov, Accuracy estimate and optimization techniques for SimRank computation, The VLDB Journal — The International Journal on Very Large Data Bases 19(1), 45-66, 2010.

535

doi: 10.1007/s00778-009-0168-8. 36

[13] J. Lu, Z. Gong, X. Lin, A novel and fast SimRank algorithm, IEEE transactions on knowledge and data engineering 29(3), 572-585, 2017. doi: 10.1109/TKDE.2016.2626282. [14] S. Qiao, T. Li, H. Li, Y. Zhu, J. Peng, J. Qiu, SimRank: A Page Rank 540

approach based on similarity measure, 2010 IEEE International Conference on Intelligent Systems and Knowledge Engineering (ISKE), 390-395, 2010. [15] Y. Saad, Iterative methods for sparse linear systems, SIAM, Vol. 82, 2003. [16] C. Scheible, F. Laws, L. Michelbacher, H. Sch¨ utze, Sentiment translation through multi-edge graphs, Proceedings of the 23rd International Confer-

545

ence on Computational Linguistics: Posters, 1104-1112, 2010. [17] Y. Shao, B. Cui, L. Chen, M. Liu, X. Xie, An efficient similarity search framework for SimRank over large dynamic graphs, Proceedings of the VLDB Endowment 8(8), 838-849, 2015. [18] J. Song, X. Luo, J. Gao, C. Zhou, H. Wei, J.X. Yu, UniWalk: unidirectional

550

random walk based scalable SimRank computation over large graph, IEEE Transactions on Knowledge and Data Engineering 30(5), 992-1006, 2018. [19] B. Tian, X. Xiao, Sling: A near-optimal index structure for SimRank, Proceedings of the 2016 International Conference on Management of Data, 1859-1874, 2016.

555

[20] Y. Wang, L. Chen, Y. Che, Q. Luo, Accelerating pairwise SimRank estimation over static and dynamic graphs, The VLDB Journal—The International Journal on Very Large Data Bases 28(1), 99-122, 2019. [21] W. Xi, E. A. Fox, W. Fan, B. Zhang, Z. Chen, J. Yan, D. Zhuang, SimFusion: Measuring similarity using unified relationship matrix, Pro-

560

ceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, 130-137, 2005. doi: 10.1145/1076034.1076059.

37

[22] W. Yu, W. Zhang, X. Lin, Q. Zhang, J. Le, A space and time efficient algorithm for SimRank computation, World Wide Web 15(3), 327-353, 2012. 565

doi: 10.1007/s11280-010-0100-6. [23] W. Yu, X. Lin, W. Zhang, Towards efficient SimRank computation on large networks, 2013 IEEE 29th International Conference on Data Engineering (ICDE), 601-612, 2013. doi: 10.1109/ICDE.2013.6544859. [24] W. Yu, X. Lin, W. Zhang, J. A. McCann, Fast all-pairs SimRank

570

assessment on large graphs and bipartite domains, IEEE Transactions on Knowledge and Data Engineering 27(7), 1810-1823, 2015. doi: 10.1109/TKDE.2014.2339828. [25] W. Yu, J. A. McCann, Efficient partial-pairs simrank search on large networks, Proceedings of the VLDB Endowment 8(5), 569-580, 2015.

575

[26] Y. Zhou, H. Cheng, J. X. Yu, Graph clustering based on structural/attribute similarities, Proceedings of the VLDB Endowment 2(1), 718-729, 2009.

38