Journal Pre-proof Speeding up SimRank computations by polynomial preconditioners
Sio Wan Ng, Siu-Long Lei, Juan Lu, Zhiguo Gong
PII:
S0168-9274(20)30045-3
DOI:
https://doi.org/10.1016/j.apnum.2020.02.009
Reference:
APNUM 3774
To appear in:
Applied Numerical Mathematics
Received date:
9 October 2019
Revised date:
3 January 2020
Accepted date:
10 February 2020
Please cite this article as: S.W. Ng et al., Speeding up SimRank computations by polynomial preconditioners, Appl. Numer. Math. (2020), doi: https://doi.org/10.1016/j.apnum.2020.02.009.
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. © 2020 Published by Elsevier.
Speeding Up SimRank Computations by Polynomial Preconditioners Sio Wan Nga , Siu-Long Leib,∗, Juan Luc , Zhiguo Gonga a State
Key Laboratory of Internet of Things for Smart City and Department of Computer and Information Science, University of Macau, Macau, People’s Republic of China b Department of Mathematics, University of Macau, Macau, China c Beijing Institute of Petrochemical Technology, Beijing, China
Abstract SimRank is popularly used for evaluating similarities between nodes in a graph. Though some recent work has transformed the SimRank into a linear system which can be solved by using conjugate gradient (CG) algorithm. However, the diagonal preconditioner used in the algorithm still left some room for improvement. There is no work showing what should be an optimal preconditioner for the linear system. For such a sake, in this paper, we theoretically study the preconditioner problem and propose a polynomial preconditioner which can improve the computation speed significantly. In our investigation, we also modify the original linear system to adapt to all situations of the graph. We further prove that the condition number of a polynomial preconditioned matrix is bounded in a narrow interval. It implies that the new preconditioner can effectively accelerate the convergence rate of the CG algorithm. The experimental results show the proposed algorithm outperforms the state-of-the-art algorithms for the all-pairs SimRank computation. Keywords: SimRank, graph, matrix, linear system, polynomial preconditioner
∗ Corresponding
author Email address:
[email protected] (Siu-Long Lei)
Preprint submitted to Applied Numerical Mathematics
February 12, 2020
1. Introduction Graph is one of the most powerful tools to model objects and their complex relationships in various domains, such as the world wide web, social networks, and knowledge graphs. For example, in a social network users are modeled 5
as nodes and their friendships are specified as edges of the graph. As another example, the World Wide Web can be modeled as a simple directed graph, where nodes represent websites, and edges represent hyperlinks. Once the data are modeled into a graph, one interesting problem is how to measure the similarities among those objects (nodes of the graph). Though several other techniques are
10
also proposed, SimRank is regarded as one of the most effective measurements and is successfully exploited in various applications, such as web page ranking [14], collaborative filtering [1], web spam detection [2], network clustering [26], and natural language processing [16]. The basic intuition of SimRank is that “two objects are similar if they are
15
referenced by similar objects and an object should be most similar to itself”. In Figure 1, a simple graph shows that node 2 and node 4 are similar because they are referenced by the common neighbor node 3. SimRank implies a mutual reinforcement naturally, by iteratively updating the similarity score of node pair according to the similarity scores of all their in-neighbor pairs.
Graph G
SimRank scores
1
2
3
4
1 0.16 𝑆= 0 0
0.16 1 0 0.4
0 0 1 0
Figure 1: A simple graph G and its SimRank scores.
2
0 0.4 0 1
20
With respect to different computing scopes, SimRank techniques can be separated into single-pair (given two nodes, return the SimRank score between these given two nodes) [10], single-source (given one node, return SimRank scores of node pairs containing the given node) [8], top-k (given one node, return the topk SimRank scores of node-pairs containing this given node) [17], partial-pairs
25
(given two sets of nodes, return the SimRank scores of all node-pairs crossing these two sets) [25], and all-pairs algorithms (return the SimRank scores of all node-pairs in the graph) [9]. All-pairs SimRank techniques can conveniently get the scores of different requirements when the data resource is shared by multiple users. Hence, we concentrate on all-pairs SimRank computations in this paper.
30
Generally speaking, the computation algorithms for all-pairs, single-source, and single-pair SimRanks are often tackled separately because their exploited techniques are quite different. Further, all-pairs SimRank is the most fundamental one, such that, once all-pair SimRank scores are computed out and stored, any other kinds of SimRank queries can be easily obtained through the indices.
35
1.1. SimRank Model In 2002, G. Jeh and J. Widom proposed the SimRank model based on the link structure of the web to evaluate the similarity between any two nodes (objects) in the graph [7]. In detail, given a directed and unweighted graph G = (V, E), where V is the set of nodes and E is the set of directed links between any two nodes in G. For any node i ∈ V , I(i) = {j|(j, i) ∈ E, j ∈ V } denotes the set of in-neighbors of i and |I(i)| is the in-degrees of i. The SimRank score Sij ∈ [0, 1] between node i and j in G is defined by 1, P P c Sab , Sij = |I(i)||I(j)| a∈I(i) b∈I(j) 0,
i=j i 6= j, I(i) 6= ∅ and I(j) 6= ∅ i 6= j, I(i) = ∅ or I(j) = ∅
where c ∈ (0, 1) is called a decay factor, usually set between 0.6 and 0.8.
3
(1)
1.2. Existing techniques for SimRank computation The techniques for SimRank can be classified into two broad categories, the stochastic technique and the deterministic technique. Stochastic technique com40
putes SimRank scores using sampling techniques such as Monte Carlo Random Walk algorithm. So this technique is more scalable to handle large datasets. The related works are reviewed in the following. 1.2.1. Stochastic techniques Work [7] interprets SimRank computation as an expected meeting distance of two random walks starting from each of the nodes simultaneously. In detail, the SimRank score between two nodes i and j can be calculated as the expected value of a function of their meeting hops Sij = E cτ (i,j) , where τ (i, j) is the number of hops for i and j to meet along their random walks
45
starting from each of them and c is a decay factor such that 0 < c < 1. To compute single-pair SimRank scores [4, 19, 20], Fogaras and R´acz [4] proposed to use Monte Carlo technique as an approximation algorithm. By this technique should incur O(n2 m)1 time and memory cost for all-pair SimRank computation. Recently, Work [20] presented three algorithms using different
50
sample reduction strategies (BLPMC, FLPMC and FBLPMC). For all-pairs SimRank computation, FBLPMC which combines the strategies of BLPMC 1 and FLPMC may take O(m2 |SBLP | + n2 NM C (1 + 1−c ))2 time, including index
time and Monte Carlo sampling time. For single-source SimRank query, works [19, 8] proposed some index schemes 55
with random walk method. The algorithm SLING [19] requires O(n/) space and O m/ + n log nδ /2 3 pre-processing time. Moreover, the complexity of 1 n, m 2S
is the number of nodes and edges in the graph respectively. is the sequence of node-pairs performed by BLPMC, NM C is the number of samples
BLP
in Monte Carlo sampling and c is the dacay factor in Equation (1). 3 is an additive error bound and δ is the failure probability of the pre-processing algorithm.
4
all-pairs calculation is O n2 / . To improve the performance, Jiang et al. [8] further proposed an algorithm called READ and they claim that their algorithm can support all-pairs SimRank queries, with both time and space complexity as 60
O(n2 r+nm)4 . Song et al. [18] introduced a rectified unidirectional random walk method for the top-k SimRank over a large undirected graph, called UniWalk. For the single-source SimRank task, the time complexity also costs nearly linear to the number of source vertices, but the convergence rate of UniWalk is not better than READS for the same situation. Though most of the random walk
65
techniques can improve the performance of computing SimRank scores, however, one common limitation of such techniques is their randomly generated errors, which may not be applicable to some value-sensitive applications. 1.2.2. Deterministic techniques On the other hand, Jeh and Widom [7] also gave a naive algorithm for
70
SimRank computation that converges to a fixed point by iteration. Deterministic techniques can compute the scores with consistently converging accuracy. Therefore, the results can be used in more applications. Unfortunately, such kind of techniques may incur expensive computational complexity (e.g. O(Kn4 ) in the worst case [21] where K is the iteration number). Thus, Lizorkin et al.
75
[11, 12] proposed an iterative SimRank algorithm with three optimization meth3
ods for speeding up the computation from O(Kn4 ) to O(K · min{nm, logn
2
n }),
the algorithm requires K = dlogc e − 1 iterations for the error bound . Works [23, 24] observed that the computing partial sums over different in-neighbor sets [11, 12] may have severe redundancy. To make use of this property, Yu et al. 80
optimized the sub-summations among different partial sums by using an optimal in-neighbors partitioning (OIP) to reduce the computational complexity of SimRank from O(Kn2 d) to O(Kn2 d0 ), where d = m/n is the average in-degree in the graph and d0 ≤ d. Despite those achievements, the convergence speed of those fixed-point iteration methods is still slow especially when we require a 4r
is the length of the random walks.
5
85
high accuracy (i.e., small ). To tackle the problem, some works [9, 22] proposed some methods with using Kronecker product and vectorization operator. Li et al. [9] developed three efficient algorithms based on singular value decomposition (SVD) for estimating similarities on static and dynamic information networks with time complexity
90
O(k 4 n2 ), where k is the rank of the adjacency matrix of the graph. Yu et al. [22] proposed a compressed storage scheme for sparse graphs and a fast matrix multiplication for dense graphs to accelerate the convergence rate of their algorithm. They reduced the memory from O(n2 ) to O(m + n) and time complexity from O(Kn3 ) to O(K · min{nm, nγ }) in the worst case where γ ≤ log2 7. They
95
also showed their algorithms can significantly speed up the convergence rate roughly twice faster than the technique proposed in [11, 12]. To further reduce the time complexity, Lu et al. [13] first transformed the SimRank problem into a linear system for computing all-pairs SimRank scores, and solved the problem by using CG algorithm for undirected graphs
100
and Biconjugate Gradient Stabilized (BiCGstab) algorithm for directed graphs. The CG algorithm can significantly reduce the iteration number K of the algorithm. In their algorithm, the iteration number K can be determined by √
K = (dlogc e − 1)/ logc ( √κ−1 ), where is the given error bound of SimRank κ+1 and κ is the condition number of the coefficient matrix of the linear system. 105
They further proposed to utilize a diagonal preconditioner to reduce the condition number κ, which can make the iteration number K decrease in the CG algorithm. Even though CG is regarded as one of the most powerful algorithm for solving symmetric and positive definite linear systems, however, there is no work yet
110
providing a thorough discussion on the improvement room of preconditioners for SimRank computations. In this paper, we are interested in two questions: (1) Is there any more efficient preconditioner for SimRank? and (2) How to theoretically guarantee the efficiency of the preconditioners for SimRank computations?
6
115
1.3. Contributions As we discussed above, the convergence rate of the CG algorithm for SimRank will be raised if the coefficient matrix of the linear system is well-conditioned. However, solving the linear system with a well-approximated preconditioner also requires an expensive computational cost. To save the cost, work [13] only pro-
120
posed to use a simple diagonal preconditioner in the CG algorithm. In this paper, we thoroughly investigate the problem and propose to use a polynomial preconditioner to reduce the condition number of the coefficient matrix, thus improve the speed of the SimRank computation. We theoretically prove the spectral properties of the polynomial preconditioned matrix and present the
125
trade-off between the solving cost and the effectiveness of the preconditioner. The main contributions are listed as follows: 1. We propose a polynomial preconditioner for the SimRank linear system, and prove that the eigenvalues of the preconditioned matrix range in [1 − cα+1 , 1 + cα+1 ], where α is the degree of the polynomial precondi-
130
tioner and c is the decay factor used in SimRank. This claim indicates the condition number κ of the preconditioned matrix tends to 1 when α becomes large. Therefore, the polynomial preconditioner makes the CG algorithm converge faster, but may also bring more cost for solving the preconditioned linear system in each iteration. We investigate the trade-
135
off. 2. For directed graphs, there may exist some nodes without in-neighbor (i.e., I(i) = ∅ or I(j) = ∅ in Equation (1)), and the existing works often ignore this situation for simplicity. To make the problem more complete, we extend the linear system to include this situation.
140
3. We compare our approach with the state-of-the-art all-pairs SimRank algorithms, diagonally preconditioned CG [13] and the Monte Carlo based algorithm READS (the static version) [8], on several real datasets. The experimental results confirm that the proposed polynomial preconditioners are more efficient for solving all-pair SimRank.
7
145
This paper is organized as follows. We first review the linear system of SimRank in Section 2. The time complexity of CG and BiCGstab algorithms for solving the SimRank scores are discussed in Section 3. Then we introduce diagonal and polynomial preconditioners in Section 4. In Section 5, we prove the spectral properties and further discuss the iteration number of the methods theoretically.
150
Section 6 extends the SimRank linear system to include the situations where some nodes may not have in-neighbors. In Section 7, we report the experimental results on some real networks and we conclude this paper in Section 8.
2. Preliminary In this section, we introduce some notations and some basic concepts which 155
will be used in the later part of the paper. We also present the framework of the linear system for SimRank. The basic symbols are presented in Table 1, and the detail discussions could be found in work [13].
8
Table 1: The main symbols in this paper.
Symbol G = (V, E)
Definition The given graph G
n = |V |
The number of nodes in G
m = |E|
The number of edges in G
Sij
The SimRank score between node i and j
I(i)
The set of in-neighbors of node i
|I(i)|
The in-degrees of node i
c
The decay factor in the definition of SimRank
K
The number of iterations
κ
The condition number
A
The backward adjacency matrix of G
Aˆ
The partial backward adjacency matrix of G2
AT
The transpose of matrix A
I
Identity matrix
J
All-ones matrix
0
A column vector with all entries are zero
ek
The k-th column of the identity matrix
⊗
Kronecker product
Entry-wise product
Entry-wise division
Now we introduce several operators which can transform a vector into a matrix or a matrix into a vector. h iT Definition 1. (Diag operator) Let vector a = a1 , a2 , . . . , an ∈ Rn×1 , then
9
Diag(a) ∈ Rn×n is defined as a diagonal matrix a1 0 · · · 0 a2 · · · Diag(a) = .. . . .. . . . 0 160
···
0
0
0 .. . . an
Remark 1. The diagonal vector of a square matrix C ∈ Rn×n is denoted by h iT diag(C) = C11 , C22 , . . . , Cnn in this paper. 1×m Definition 2. (Vec operator) Let B ∈ Rn×m be a matrix and bT be i ∈ R
its i-th row vector, then vec(B) ∈ Rnm×1 is defined as a vector by stacking all those row vectors together
b1
b2 vec(B) = .. . . bn Proposition 1. Let I ∈ Rn×n where I is the identity matrix. Then a matrix Iˆ = Diag(vec(I)) 2
is a diagonal matrix in Rn
×n2
.
Proposition 2. Let I ∈ Rn×n be the identity matrix and J ∈ Rn×n be the all-ones matrix. Then Jˆ = Diag(vec(J − I)) 2
is a diagonal matrix in Rn
×n2
.
2.1. Linear system of SimRank 165
Now we are going to introduce the linear system for SimRank which formulates the fundamental work in this paper. We omit all the proofs in this subsection, which could be found in [13]. The linear system for SimRank is easily addressed on the node-pairs graph other than on the original graph. In detail, let G = (V, E) be a graph, its 10
170
node-pair graph is defined as G2 = (V 2 , E 2 ), such that nk = (i, j) ∈ V 2 where k = (i − 1)n + j, n = |V | if and only if i, j ∈ V . For any np = (ip , jp ), nq = (iq , jq ) ∈ V 2 , (nk , nl ) ∈ E 2 if and only if both (ip , ip ) ∈ E and (jq , jq ) ∈ E. Figure 2 presents an example. Graph G2
Graph G
1
𝑛3 (1,3)
2
𝑛12 (3,4)
𝑛10 (3,2)
𝑛4 (1,4)
𝑛5 (2,1)
𝑛13 (4,1)
𝑛8 (2,4)
4
3
𝑛1 (1,1)
𝑛2 (1,2) 𝑛9 (3,1)
𝑛11 (3,3)
𝑛14 (4,2)
𝑛7 (2,3)
𝑛16 (4,4)
𝑛15 (4,3)
𝑛6 (2,2)
Figure 2: A simple graph G and its node-pair graph G2 .
Now, we separate the nodes of G2 into two parts VS = {(i, j)|i = j} and VD = {(i, j)|i 6= j}. Then, the SimRank Equation (1) can be restated on the node-pair graph G2 as Snk = 1, nk = (i, j) ∈ VS
(2)
and Snk =
c |I(nk )|
X
Snl , nk = (i, j) ∈ VD ,
(3)
nl ∈I(nk )
where I(nk ) is the set of in-neighbors of nk in G2 . (Now we temporally suppose 175
there is no node i ∈ V such that I(i) = ∅ to simplify our current discussion, and we will address the situation of I(i) = ∅ or I(j) = ∅ later in Section 6.) Let IS (nk ) = I(nk ) ∩ VS and ID (nk ) = I(nk ) ∩ VD , then, Equation (3) can be transformed into Snk
c = |I(nk )|
X
Snl + |IS (nk )| , nk ∈ VD ,
nl ∈ID (nk )
11
(4)
Equation (4) can be further rewritten as |I(nk )|Snk − c
X
Snl = c|IS (nk )|, nk ∈ VD .
(5)
nl ∈ID (nk ) 2
Let d, s, v be vectors in Rn
×1
, such that dk = |I(nk )|, sk = Snk , vk =
|IS (nk )|. Then, Equation (5) can be represented into a matrix form ˆ Jˆ Diag(d) − cAˆ s = cJv
ˆ Aˆkl where Jˆ is defined as in Proposition 2, and A,
(6)
1, n ∈ I (n ) l D k = , is 0, else
the partial backward adjacency matrix of G2 .
Similarly, Equation (2) can be rewritten into a matrix form ˆ ˆ IDiag(d)s = Id 180
(7)
where Iˆ is defined as in Proposition 1. 2 2 Keep in mind that Iˆ + Jˆ = I (the identity matrix in Rn ×n ). By adding
Equation (6) and (7) together, we have the following linear equation for SimRank s Ls = q
(8)
ˆ + Id. ˆ where L = Diag(d) − cJˆAˆ and q = cJv Definition 3. The backward adjacency matrix of graph G, denoted as A ∈ Rn×n , is defined as
1, (j, i) ∈ E . Aij = 0, else
The following two propositions present how to compute matrix Aˆ and vectors v, d, using the backward adjacency matrix of graph G. 2
Proposition 3. Vector v, d ∈ Rn
×1
are defined as in Equation (6). Then, we
have v = vec(AAT ) 12
and d = vec(ggT ) where g ∈ Rn×1 with gi =
n P
Aij .
j=1
Proposition 4. As defined in Equation (6), let Aˆ be the partial backward 2 2 adjacency matrix of G2 . Then, Aˆ ∈ Rn ×n can be calculated by
Aˆ = (A ⊗ A)Jˆ 185
2 2 where ⊗ is the Kronecker product and Jˆ ∈ Rn ×n is defined as in Proposition
2. Proposition 5. For an undirected graph, the coefficient matrix L, as defined in Equation (8), is a symmetric positive definite matrix.
3. Solving the linear system by an efficient algorithm 190
In the previous section, we represent the SimRank as a solution of a linear system. However, it is too expensive to employ direct methods for solving the 2
linear system L ∈ Rn
×n2
, when n is large.
If the linear system L is sparse, iterative methods are promising in solving the system and they may not require any extra storage in general. Since matrix L is 195
a symmetric positive definite matrix for the undirected graph [13], CG method is regarded as the most efficient iterative algorithm for solving the system in both theory and practice. For the directed graph, the matrix L is a non-symmetric matrix, BiCGstab method is one of the extensions of CG method to solve nonsymmetric linear systems, with a fast and smooth convergence.
200
3.1. The time complexity of CG and BiCGstab algorithms For any arbitrary vector x, calculating the product of Lx incurs the main time complexity of CG and BiCGstab methods in each iteration. Now, we first discuss the time complexity of CG and BiCGstab methods per iteration for SimRank, then, we tackle the overall convergence rate of the CG method for
205
solving the linear system. 13
h iT 2 Definition 4. (Mat operator) Let a = a1 , a2 , . . . , an2 ∈ Rn ×1 be a column vector, then M at(a) ∈ Rn×n is defined as a matrix
an+1 M at(a) = .. . an2 −n+1
a2
···
an
an+2 .. .
··· .. .
an2 −n+2
···
a2n .. . . an2
a1
According to work [13], the product of the matrix L and an arbitrary vector 2
x ∈ Rn
×1
can be written as
ˆ vec(A · M at(diag(J) ˆ x)AT ), Lx = Dx − cU x = d x − c · diag(J) | {z } ˆ Ax
ˆ where the operator denotes entry-wise product, D = Diag(d), and U = JˆA. The algorithm of computing U x is given in Lines 1-5 of Algorithm 1. Algorithm 1 To compute Lx 2
Iuput: x ∈ Rn
×1
Output: Lx 1:
U ← M at(x)
2:
U ← diagonal elements of U to be zero
3:
U ← AU AT
4:
U ← diagonal elements of U to be zero
5:
u ← vec(U )
6:
Lx ← d x − cu
I u = Ux
ˆ = vec(A · M at(diag(J) ˆ x)AT ), From above discussion, we know that Ax ˆ to be O(nm) where from which we can derive the complexity of computing Ax 210
m is the number of edges in G and n is the number of nodes in G. Therefore, computing U x requires time complexity O(nm), which indicates the complexity of computing Lx is O(nm) too. Therefore we can get that the time complexity of CG and BiCGstab algorithms is O(nm) per iteration.
14
3.2. The convergence rate of CG method on symmetric and positive definite 215
systems Let A be a symmetric positive definite matrix and x∗ be the exact solution of Ax = b, then the convergence rate of the CG method for solving the linear system can be estimated by the following inequality [15] kx∗ − xK kA ≤ 2
K √ κ−1 √ kx∗ − x0 kA κ+1
(9)
where xK is the result in the K-th iteration of the CG algorithm, A-norm of √ x is defined as kxkA = xT Ax and κ is the condition number of A which is defined as κ(A) = kAkkA−1 k. Note that for any p-norm, κ(A) = kAkkA−1 k ≥ kA · A−1 k = kIk = 1. If κ(A) is small (closed enough to 1), the system Ax = b is said well-conditioned; otherwise, it is said ill-conditioned. Inequality (9) can be written as kx∗ − xK kA ≤2 kx∗ − x0 kA
√ K K κ−1 2 √ =2 1− √ ≤ κ+1 κ+1
where is a given error bound for the result. Then, we have 2 . ln 1 − √ , K ≥ ln 2 κ+1
(10)
(11)
where dxe is the smallest integer bigger than x. In general, matrix L always has a large condition number, that means L is ill220
conditioned. In Inequality (11) and Figure 3, we can observe that if the condition number κ is large and a small error bound for SimRank scores is required, the iteration number K becomes large, which implies the slow convergence rate of the CG algorithm. On the contrary, if the condition number κ is small, the CG method will require few iterations to achieve the same accuracy. Therefore, we
225
will discuss how to reduce the condition number of the linear system in order to accelerate the convergence rate.
15
Figure 3: The iteration number K for a given condition number κ and error bound .
4. Preconditioners for SimRank Preconditioning techniques serve to reduce the condition number of the linear system, thus to improve the performance of the iteration. Using preconditioner 230
M , we can rewrite the linear system of Equation (8) as M −1 Ls = M −1 q, to make M −1 L with a smaller condition number than that of L.
16
Algorithm 2
Algorithm 3
Preconditioned CG method 1: s0 = vec(I)
Preconditioned BiCGstab method 1: s0 = vec(I)
2:
r0 = q − Ls0
2:
r0 = q − Ls0
3:
z0 = M −1 r0
3:
ˆ r = r0
4:
p0 = z0
4:
ρ 0 = α = ω0 = 1
5:
for K = 1, 2, 3, . . . do
5:
v0 = p0 = 0
6:
vK−1 = LpK−1
6:
for K = 1, 2, 3, . . . do
7:
T α = (rT K−1 zK−1 )/(pK−1 vK−1 )
7:
ρK = ˆ rT rK−1
8:
sK = sK−1 + αpK−1
8:
β = (ρK /ρK−1 )(α/ωK−1 )
9:
rK = rK−1 − αvK−1
9:
pK = rK−1 +β(pK−1 −ωK−1 vK−1 )
10:
zK = M −1 rK
10:
y = M −1 pK
11:
T β = (zT K vK−1 )/(pK−1 vK−1 )
11:
vK = Ly
12:
pK = zK − βpK−1
12:
α = ρK /(ˆ rT vK )
13:
end for
13:
x = rK−1 − αvK
14:
return sK
14:
z = M −1 x
15:
t = Lz
16:
ωK = (tT x)/(tT t)
17:
sK = sK−1 + αy + ωK z
18:
rK = x − ωK t
19:
end for
20:
return sK
In the linear system of SimRank, the preconditioner can be designed as approximating L as possible (i.e., M ≈ L) in order to reduce the iterations of the CG algorithm. However, it is also expensive to compute M −1 x for an 235
arbitrary vector x in preconditioned CG (PCG) or preconditioned BiCGstab (PBiCGstab) algorithm. Thus, there is a trade-off between choosing an efficient preconditioner and computing M −1 x. In the following, we will introduce two types of preconditioners.
17
4.1. Diagonal preconditioner In [13], Lu et al. proposed a simple preconditioner of diagonal matrix L , if k = l kk Mkl = 0, otherwise 240
for solving SimRank, which is known as Jacobi preconditioner. It is efficient for the strictly diagonally dominant matrix L [13], without incurring extra storage. The existence of M −1 is proved by the following theorem. Theorem 1. The matrix M −1 exists if each node of G has non-zero in-degree. Proof. If we can prove that the diagonal entries of the diagonal matrix M are all non-zero, the existence of M −1 can be obtained obviously. From Equation (8), we have ˆ L = Diag(d) − cJˆA. From the assumption that all nodes of G have non-zero in-degrees, we know that all node-pairs of G2 also have non-zero in-degrees, i.e., dk ≥ 1. As we know that the entries of Jˆ and Aˆ are either 0 or 1, where Jˆ and Aˆ are as defined in Proposition 2 and Proposition 4, respectively. Then, we have ˆ diag(A) ˆ >0 diag(M ) = diag(L) = d − c · diag(J) where 0 is a column vector with all zero entries. Hence, the diagonal entries of
245
M are greater than zero, thus, matrix M −1 exists. In addition, we know that for any vector x, h i ˆ (diag(A) ⊗ diag(A)) M −1 x = x d − c · diag(J) where symbol denotes the entry-wise division as defined in [13]. Therefore, we can easily derive that the complexity of computing M −1 x is O(nm), i.e., PCG and PBiCGstab solvers incur O(Knm) time complexity, where K is the iteration number.
250
18
In Section 5, we will prove that the spectrum of M −1 L is located in the interval [1 − c, 1 + c] for undirected graphs. In next subsection, we will propose a better preconditioner to make the spectrum of the preconditioned matrix in a smaller interval [1 − cα+1 , 1 + cα+1 ] where α ∈ N. Thus, the convergence rate 255
of the CG solver can be improved. 4.2. Polynomial preconditioners In order to reduce the condition number of the linear system, thus speeding up the convergence rate of the CG algorithm, in this subsection we propose a new preconditioner that can well approximate L. In graph G, we suppose that all nodes have non-zero in-degrees. By Equation (8), we represent matrix L as L = D − cU = D(I − cD−1 U ) = D(I − B)
260
(12)
ˆ and B = cD−1 U . where D = Diag(d), U = JˆA, In particular, D−1 exists since all entries of d are non-zero (similar to Theorem 1). Therefore, L−1 = (I − B)−1 D−1 = (I + B + B 2 + · · · + B α + · · · )D−1 , α ∈ N. Let Pα−1 = (I + B + B 2 + · · · + B α )D−1 . We know that Pα−1 L is close to the identity matrix when α becomes large. Pα is called the polynomial preconditioner, α ∈ N.
Regarding the time complexity, we have Pα−1 x = (I + B + B 2 + · · · + B α )D−1 x = (I + · · · + B(I + B(I + B)) · · · ) D−1 x. | {z } α brackets of I+B(·) Let x ˆ = D−1 x, then Pα−1 x = x ˆ + B(ˆ x + B(ˆ x + · · · + B(ˆ x + Bˆ x)) · · · ) . | {z } α brackets of x ˆ+B(·)
19
Algorithm 4 To compute Pα−1 x 2
Iuput: x ∈ Rn
×1
Output: Pα−1 x 1:
x0 ← D−1 x
2:
x1 ← x0
3:
for j = 1, 2, . . . , α do
4:
xj ← x0 + cD−1 U xj−1
5:
end for
6:
Pα−1 x ← xα 2
265
Since D−1 ∈ Rn of x ˆ = D
−1
×n2
I Call Algorithm 1 to calculate U x, i.e., u
is a diagonal matrix, the computational complexity
2
x is O(n ). From Section 3, we know that the time complexity
of U x is O(nm). As we know that m must be larger than n, so we get the complexity of computing Bx is O(nm), where B is as given in Equation (12). Therefore, the time complexity of Pα−1 x is O(αnm), correspondingly, PCG or 270
PBiCGstab solver takes O(Kα αnm) time complexity, where Kα is the iteration number of the algorithm and α is the degree of the polynomial preconditioner. In particular, we will show that Kα decreases when α increases in Section 5.
5. Spectral properties of two preconditioned matrices According to Inequality (11), we know that the iteration number of the 275
CG algorithm depends on the condition number κ when the error bound is given. For a symmetric matrix A, the condition number κ can be computed as κ(A) = kAk2 kA−1 k2 =
λmax (A) λmin (A) .
This formula indicates that eigenvalues of a
well-conditioned matrix are clustered together. Theorem 2. For an undirected graph G, given matrix M = Diag(diag(L)) 280
as the diagonal preconditioner, then all eigenvalues of M −1 L are located in the interval [1 − c, 1 + c], where c is the decay factor of the SimRank. ˆ = M −1 L, then Proof. L = Diag(d) − cJˆAˆ is as given in Equation (8). Let L
20
we get ˆ kk = L−1 Lkk = 1 L kk and X
ˆ kl | = |L
l6=k
X
L−1 kk |Lkl | =
l6=k
X (dk − cJˆkk Aˆkk )−1 (cJˆkk Aˆkl ). l6=k
From Proposition 2, we know that either Jˆkk = 0 or Jˆkk = 1. If Jˆkk = 0, X
ˆ kl | = 0. |L
l6=k
If Jˆkk = 1, X
ˆ kl | = |L
l6=k
X (dk − cAˆkk )−1 cAˆkl l6=k
= c(dk − cAˆkk )−1
X
Aˆkl
l6=k
! = c(dk − c(A ⊗ A)kk Jˆkk )−1
X
Aˆkl − Aˆkk
l −1
= c(dk − c(A ⊗ A)kk )
(dk − |IS (k)| − (A ⊗ A)kk )
≤ c(dk − c(A ⊗ A)kk )−1 (dk − (A ⊗ A)kk ) ≤ c. The last inequality follows from dk − (A ⊗ A)kk ≤ dk − c(A ⊗ A)kk . Thus, we ˆ kk = 1 and P |L ˆ kl | ≤ c for either case. get L l6=k
ˆ by applying the Gershgorin theorem [3] to L, ˆ the Let λ be an eigenvalue of L, eigenvalues lie within the union of discs X ˆ kk | ≤ ˆ kl | , Dk = λ : |λ − L |L
k = 1, . . . , n2 .
l6=k
ˆ are equal to 1, we can easily deduce We know that the diagonal entries of L ˆ lie within the disc with center (1, 0) and radius c. that all the eigenvalues of L ˆ = M −1 L is similar to a symmetric matrix M − 12 LM − 21 for an Further, L 285
ˆ are real and located in the interval undirected graph. Hence, all eigenvalues of L [1 − c, 1 + c]. 21
This implies that the eigenvalues of M −1 L are non-zero, so the matrix is non-singular. Now, we will further prove that eigenvalues of the polynomial preconditioned matrix are located in the smaller interval [1 − cα+1 , 1 + cα+1 ] 290
around 1 where α is the degree of the polynomial preconditioner. Theorem 3. For a graph G, given matrix B = cD−1 U as defined in Equation (12), then the spectral radius of B is ρ(B) ≤ c. Proof. As we know that B = cD−1 U and U = JˆAˆ is as given in Equation (12), we have 2
ρ(B) ≤ kBk∞ = max 2 1≤k≤n
n X
−1 ˆ ˆ |Bkl | where Bkl = cDkk Ukl = cd−1 k Jkk Akl .
l=1
For any k, we know that either Jˆkk = 0 or Jˆkk = 1 from Proposition 2. In case of Jˆkk = 0, X
|Bkl | = 0.
l
In case of Jˆkk = 1, X
|Bkl | =
l
X
ˆ |cd−1 k Akl |
l
= cd−1 k
X
|Aˆkl |
l
=
cd−1 k (dk
− |IS (k)|)
≤ cd−1 k dk = c. Thus, 2
kBk∞ = max 2 1≤k≤n
n X
|Bkl | ≤ c.
l=1
Therefore, we get ρ(B) ≤ kBk∞ ≤ c.
22
Theorem 4. For an undirected graph G, given matrix Pα as the polynomial preconditioner such that Pα−1 = (I + B + B 2 + · · · + B α )D−1 , where B = cD−1 U . Then the eigenvalues of Pα−1 L satisfy 1 − cα+1 ≤ λ(Pα−1 L) ≤ 1 + cα+1
for all
α ∈ N.
Proof. For α ∈ N, let Pˆ = Pα−1 L where L = D(I − B) is as given in Equation (12), we have Pˆ = (I + B + B 2 + · · · + B α )D−1 L = (I + B + B 2 + · · · + B α )(I − B) = I − B α+1 . ˆ so U is a symmetric matrix for an By Equation (12), we know that U = JˆA, 295
undirected graph and the non-symmetric matrix B = cD−1 U is similar to a 1
1
symmetric matrix cD− 2 U D− 2 . Therefore, all eigenvalues of B are real. By Theorem 3, we know that ρ(B) ≤ c, then we can deduce that |λ(B)| ≤ c. Thus, it is easy to get the eigenvalues of B α+1 are also real and |λ(B α+1 )| ≤ cα+1 . Hence, we have 1 − cα+1 ≤ λ(Pˆ ) ≤ 1 + cα+1 for all α ∈ N. Since c ∈ (0, 1), we have cα+1 → 0 when α → ∞. That means all eigenvalues of Pˆ are close to 1 when α becomes large. 300
By applying the claims of Theorem 2 and Theorem 4 to Equation (11), we have the following estimations for the iteration number of the CG algorithm with the diagonal and the polynomial preconditioners respectively, , 2 K= ln 1 − q ln 2 1+c 1−c + 1 Kα = ln 2
,
(13)
ln 1 − q
23
2 1+cα+1 1−cα+1
. +1
(14)
It is clear that Kα is much smaller than K for the same error bound , which indicates the polynomial preconditioner makes the CG algorithm converge faster.
However, we should understand that to apply the polynomial preconditioner 305
is more costly than the diagonal preconditioner in each iteration. Theoretically, the time complexities of the CG method by using diagonal preconditioner and polynomial preconditioner are O(Knm) and O(Kα αnm) respectively. In the experimental section, we will compare the overall performance of the CG algorithm by using different value of α.
310
6. The case of nodes without in-neighbor In Subsection 2.1, the linear system is derived under the assumption that all nodes are with some in-neighbors, that excludes the case of I(i) = ∅ for some i ∈ V as defined in the original SimRank. In this section, we generalize the linear system of SimRank to encompass all the situations. For this sake,
315
we first discuss what happens in the linear system of Equation (8) when there exists some i ∈ V such that I(i) = ∅. Then, we discuss how to modify the linear system to fulfill the original definition of SimRank. After the modification, all the previously discussed techniques can have no change for the SimRank computation.
320
Theorem 5. Matrix L and vector q are as defined in Equation (8). If i ∈ V (a node of graph G) makes |I(i)| = 0, then Lk,? = 0T and qk = 0, where k = (i − 1)n + j is the index for node-pairs (i, j) and (j, i), j = 1, 2, . . . , n. In other words, the k-th row of matrix L and the k-th entry of q are zero. From Theorem 5, we know that L becomes a singular matrix if there exists
325
any node without any in-neighbor in G, such that it cannot be solved by using the proposed method. Obviously, M −1 = Diag(L)−1 and D−1 do not exist. Now, we use an example to show how to modify the linear system to encompass all the situations and, at the same time, with preconditioners.
24
6.1. An example of nodes without in-neighbor
Graph G2
Graph G
1
2
𝑛3 (1,3)
𝑛12 (3,4)
3
𝑛10 (3,2)
𝑛4 (1,4)
𝑛5 (2,1)
𝑛13 (4,1)
𝑛8 (2,4)
4 𝑛1 (1,1)
𝑛2 (1,2) 𝑛9 (3,1)
𝑛11 (3,3)
𝑛14 (4,2)
𝑛7 (2,3)
𝑛16 (4,4)
𝑛15 (4,3)
𝑛6 (2,2)
Figure 4: An example for a node without in-neighbor in graph G and its graph G2 .
330
As shown in Figure 4, node 3 of G has no non-zero in-degree, correspondingly in G2 node-pairs n3 = (1, 3), n7 = (2, 3), n9 = (3, 1), n10 = (3, 2), n11 = (3, 3), n12 = (3, 4) and n15 = (4, 3) have no non-zero in-degree too. By Theorem 5, we know that the 3-rd, 7-th, 9-th, 10-th, 11-th, 12-th and 15-th diagonal entries of L and those corresponding entries of q are equal to zero. In such situation, the linear system cannot be solved using the proposed algorithm. Therefore, we follow the original SimRank definition by Equation (1) to modify the linear system. We can get 1 0.16 0.16 1 S= 0 0 0 0.4
the SimRank matrix of Figure 4 as 0 0 0 0.4 1 0 0 1
h iT which can be written as a vector s = 1, 0.16, 0, 0, 0.16, 1, 0, 0.4, 0, 0, 1, 0, 0, 0.4, 0, 1 . We know from the SimRank definition that s11 = 1 and the 3-rd, 7-th, 9-th, 335
10-th, 12-th and 15-th entries of s are zero. Now we modify L by replacing the zero value Lkk = 0, k ∈ {3, 7, 9, 10, 11, 12, 15}, by Lkk = 1. We also replace 25
q11 = 0 by q11 = 1. Now the revised L is no longer singular and the solution of Ls = q is just the SimRank scores under the definition. However, we know 2
that L ∈ Rn 340
×n2
is a large matrix, it is expensive to search the zero entries in
the diagonal vector of L one by one for the modification. In Theorem 5, it is implicit that if dk (cf. Proposition 3 for d) is zero, Lkk must be zero. Therefore, we only need to consider which entries of d are equal to zero.
345
0 1 0 0 0 0 1 1 , For the graph G in Figure 4, the backward adjacency matrix A = 0 0 0 0 0 0 1 0 h iT the row sum of A is g = 1, 2, 0, 1 where gi = |I(i)| for all i ∈ V . In Proposi 1 2 0 1 2 4 0 2 , then tion 3, we mentioned that d = vec(ggT ). Let H = ggT = 0 0 0 0 1 2 0 1 the elements of H are corresponding to the in-degrees of node-pair in graph G2 by row-wise. It can be easily observed that the zero entries in g lead to zero of the corresponding entries in d.
350
In such a way, by only searching the zero entries of g, i.e., g3 , we can easily know that H3,? , H?,3 are 0, thus we modify them to 1. On the other hand, there is a simple way to find out the entry q11 (s11 = 1 where n11 = (3, 3) ∈ VS ) that must be modify to 1. Since the node-pair nk in G2 can be indexed by k = (i − 1)n + j, the 11-th entry of q can be obtained by the searching result in
355
the zero entry of g. After this handling, the solution of the new equation is the SimRank scores under the original definition. 6.2. Algorithms of the linear system modification In general, if node i has no in-neighbor in graph G, we can modify the zero
360
entries of d to 1 and only let qr = 1 when nr = (i, i), in order to satisfy the
26
definition of SimRank that is Sii = 1, Sij = 0 or Sji = 0 for j 6= i. The algorithms of the linear system modification are given as follows. Algorithm 5 Modification of the left-hand side of the linear system Iuput: A, the backward adjacency matrix of the graph G Output: new Lx and new Pα−1 x
2:
for i = 1, 2, . . . , n do n P gi ← Aij
3:
end for
4:
i ← Find the indices of zero entries in g
5:
H ← ggT
6:
Hi,? ← 1
7:
H?,i ← 1
8:
d = vec(H)
9:
Call Algorithm 1 with new d to return Lx
1:
j=1
10:
I new d
Call Algorithm 4 with new d to return Pα−1 x
Algorithm 6 Modification of the right-hand side of the linear system Iuput: i from Algorithm 5, q, as defined in Equation (8) Output: new q 1:
r = (i − 1)n + i
2:
qr = 1
3:
return q
I nr = (i, i)
I new q
After above modifications, matrix L is non-singular and matrices M −1 , D−1 exist for all situations, that means preconditioners M and Pα are well-defined. 365
In particular, the properties of the linear system are the same as the original one and all of our discussions can be used for the new linear system.
27
7. Experiments In this section, we evaluate our algorithm in terms of precision, normalized discounted cumulative gain (NDCG) [5] and efficiency. 370
7.1. Experiment setup All experiments are conducted on a PC with Intel(R) Core(TM) i7-4790 3.60GHz CPU, 32GB RAM, and the algorithms are implemented by using MATLAB R2014a. 7.1.1. Comparison with the state-of-the-art algorithms
375
In our experiments, we compare three methods for solving the all-pairs SimRank scores as below: 1. P is the proposed algorithm in this paper, which uses a polynomial preconditioner for solving the linear system of SimRank scores. It takes O(Kα αnm) time complexity, where Kα is the iteration number of the
380
polynomial preconditioned linear system, α is the degree of the polynomial power, n denotes the number of nodes in the graph, and m denotes the number of edges in the graph. 2. M is the algorithm proposed by Lu et al. [13], which uses a diagonal preconditioner for solving the linear system of SimRank scores. It takes
385
O(Knm) time complexity, where K is the iteration number of diagonal preconditioned linear system. 3. READS method is one of the stochastic techniques to obtain the singlesource SimRank scores [8]. It takes O(nr) index time and O(nr+m) query time for single-source SimRank computation, where r is a simulation num-
390
ber. Moreover, Jiang et al. proposed that their algorithm also supports all-pairs SimRank solving. Therefore, the READS method incurs O(nr) index time and O(n2 r + nm) query time for solving all-pairs SimRank scores.
28
7.1.2. Experimental environment and parameter settings 395
We set the decay factor c = 0.8 in our experiments. 1. Preconditioners: PCG (for undirected graph) or PBiCGstab (for directed graph) algorithm is used for solving the linear system. The initial SimRank scores are set as s0 = vec(I), and the error norm is computed by = n2 P ksK −sk1 1 K = |sK is the SimRank scores in the K2 2 l − sl | where s n n l=1
400
th iteration, and s is the ground truth of SimRank scores. We compute the ground truth of SimRank by the original SimRank equation with a large (e.g. 100 times) iterations. In the experiments, we utilize the sparse matrix to store the backward adjacency matrix A in MATLAB. Different preconditioners (M, P1 and P3 that were defined as in Subsections 4.1 and
405
4.2) are used to compute the SimRank score on datasets. 2. The static version of READS: The termination of this algorithm is controlled by the length of the random walk and the number of simulations. To make it comparable with the error-bound controlled algorithms, we set t = 10 as the maximum length of a random walk, and rmax = 100 as the
410
maximum number of simulations. 7.1.3. Datasets As shown in Table 2, four datasets of directed graphs and three datasets of undirected graphs are used in the experiments, where n is the number of nodes and m is the number of edges in each graph. Those datasets are crawled from
415
two websites5 . 5 Euroroad,
Human protein (Stelzl), Reactome and OpenFlights are available at
http://konect.uni-koblenz.de/, and ca-GrQc, p2p-Gnutella04 and ca-HepPh are available at http://snap.stanford.edu/data/index.html.
29
Table 2: Main information of datasets.
Dataset
n
m
Direction
Euroroad
1,174
1,417
Undirected
Human protein (Stelzl)
1,706
6,207
Directed
OpenFlights
2,939
30,501
Directed
ca-GrQc
5,242
14,490
Undirected
Reactome
6,327
147,547
Directed
p2p-Gnutella04
10,876
39,994
Directed
ca-HepPh
12,008
118,505
Undirected
7.1.4. Evaluation measures We use precision and NDCG to measure the accuracy (correctness) of the compared algorithms and those evaluation measures are defined as follows: 1. The precision at k (P@k) is defined as: P recision =
|{top-k of s} ∩ {top-k of ˜s}| k
where s is the ground truth of SimRank and ˜s is the approximated Sim420
Rank. 2. The normalized discounted cumulative gain (NDCG) [6] is one of the popular measures for evaluating the rank of the results. NDCG at position k is defined as: N DCGk =
DCGk IDCGk
and DCGk =
k X i=1
reli log2 (1 + i)
where IDCGk is the ideal DCG at position k and reli is the graded relevance of the result at position i. 7.2. Experimental results We compare our preconditioners with others algorithms and evaluate the 425
accuracy and efficiency of our proposed algorithm. 30
7.2.1. Effectiveness In order to measure the effectiveness of the SimRank scores, we set the iteration number and simulation number to be the same, i.e., Kα = K = 4 (for PCG and PBiCGstab) and r = 4 (for READS). Table 3: Evaluation measures for top-800 querying precision and N DCG800 .
P3
Dataset
430
P1
M
READS
P@k
NDCG
P@k
NDCG
P@k
NDCG
P@k
NDCG
Euroroad
1
1
0.9975
1
0.9800
0.9994
0.7988
0.9105
H. protein
1
1
1
1
1
1
1
1
OpenFlights
1
1
1
1
0.9980
0.9761
1
1
ca-GrQc
0.9938
0.9997
0.9888
0.9994
0.9775
0.9980
0.9400
0.9964
Reactome
1
1
1
1
1
1
0.9860
0.9400
p2p-Gnutella04
1
1
1
1
1
1
0.9960
0.9696
ca-HepPh
1
1
1
1
1
1
1
1
From Table 3, we know that the polynomial preconditioner (P1 or P3 ) can always keep the higher scores than the diagonal preconditioner (M ) and READ in terms of both precision and NDCG, that indicates the effectiveness of the polynomial preconditioner in the SimRank solving. 7.2.2. Efficiency
435
In order to know how fast the algorithm to converge, we conduct three experiments: (1) given the same number of iterations we compare errors generated by those algorithms, (2) given the same error bound we compare the running time of those algorithms, (3) setting different error bound to compare iteration number and running time for one of datasets.
31
Table 4: Average error when the iteration number is 4.
Dataset
440
P3
P1
M
READS
Euroroad
2.79E-06
3.15E-05
2.70E-04
1.10E-03
H. protein
2.06E-07
6.85E-07
2.10E-04
5.00E-03
OpenFlights
6.84E-09
7.93E-07
1.74E-04
3.80E-03
ca-GrQc
2.26E-06
2.23E-05
4.45E-04
6.48E-04
Reactome
1.51E-11
1.41E-06
6.60E-05
1.50E-03
p2p-Gnutella04
3.33E-08
1.81E-05
1.89E-04
9.02E-04
ca-HepPh
1.59E-06
2.11E-05
1.28E-04
5.56E-04
Table 4 shows the first objective of the testing, where READS algorithm generates 5.6 × 10−4 to 5.0 × 10−3 average error, the diagonal preconditioner M generates 6.6 × 10−5 to 4.5 × 10−4 , and our proposed polynomial preconditioner, P1 and P3 , can achieve the smallest error 1.5×10−11 to 3.2×10−5 . It means using polynomial preconditioners for solving the linear system can generate less error
445
than using diagonal preconditioner or READS algorithm when iteration number is fixed. However, different algorithms may use different time for each iteration. Thus, iteration number may not tell the execution time of the algorithm. Different to our proposed algorithms which can be terminated by a given error bound, READS can only be terminated by a simulation number r. To
450
make those algorithms comparable in terms of the second objective, we first set r = 100 and run READS on datasets Euroroad, ca-GrQc, Reactome, p2pGnutella04, and ca-HepPh, which achieves the average error bound ≤ 1×10−3 . Then, we execute the proposed preconditioner algorithms on those datasets to reach this error bound. Table 5 shows the results. It is clear that the polynomial
455
preconditioner algorithms P1 and P3 require much less time than READS does. Since it is too expensive for READS to reach error bound = 1 × 10−8 , we only conduct experiments with this error bound for the preconditioner based algorithms M , P1 , and P3 , and the results are presented in Table 6, where we
32
find P1 is the best among these three preconditioner algorithms in terms of 460
running time, even though P3 requires the least iteration number. Table 5: Experimental results for = 1 × 10−3 .
P3
Dataset
P1
READS
K3
Time (in sec.)
K1
Time (in sec.)
r
Time (in sec.)
Euroroad
1
0.3707
1
0.2098
100
1.4536
ca-GrQc
1
9.1871
1
4.9464
100
144.9055
Reactome
1
20.2178
2
22.4352
100
9.00E+03
p2p-Gnutella04
1
30.8254
1
19.0387
100
122.3767
ca-HepPh
1
104.5943
1
51.935
100
4.66E+04
Table 6: Experimental results for = 1.0 × 10−8 .
P3
Dataset
P1
M
K3
Time (in sec.)
K1
Time (in sec.)
K
Time (in sec.)
Euroroad
7
1.5988
11
1.3932
20
1.7648
H. protein
5
4.2651
6
3.1526
12
4.7006
OpenFlights
4
11.5226
6
10.2018
13
15.4716
ca-GrQc
7
37.2828
10
29.0245
19
38.6614
Reactome
3
58.5855
5
53.3944
9
64.4446
p2p-Gnutella04
5
148.8905
7
128.6511
15
210.8746
ca-HepPh
7
407.5188
10
298.6798
18
306.8895
As we know that READS is a good method for computing single-source. The experimental results show that the preconditioner algorithms are more suitable than READS for solving the all-pairs SimRank problem. Further, the proposed polynomial preconditioner algorithms can reduce at least half of iteration num465
ber than the diagonal preconditioner requires. 33
To know how the running time changes when the error bound decreases, the third objective allows the error bound to change from 5.2 × 10−3 to 1.0 × 10−6 . The experimental results on dataset of Human protein (Stelzl) are presented in Figure 5. The results on other datasets are much similar. 102 102
P
P
1
Running time (in seconds)
Iteration number
M READS
101
1
M READS
101
100
100 5x10-3
4x10-3
3x10-3
2x10-3
1x10-3
1x10-6
5x10-3
Average error
4x10-3
3x10-3
2x10-3
1x10-3
1x10-6
Average error
(a) Average error vs iteration number.
(b) Average error vs Running time.
Figure 5: The experimental results on the convergence rate of different algorithms over the dataset of Human protein (Stelzl).
470
The result shows the polynomial preconditioner algorithm P1 can always require the least iteration number, while READS significantly increases the iteration number when the error bound becomes small. Figure 5b shows the running time of the three algorithms via the error bound change, where the polynomial preconditioner with degree 1 (P1 ) always requires the least running
475
time to reach the same accuracy.
8. Conclusion In this paper, we restudy the CG algorithm for SimRank computation. We observed that the diagonal preconditioner used in the existing CG algorithm still left some space for improvement. Thus, we proposed to use polynomial 480
preconditioners for the CG algorithm in order to reduce the required iteration number and execution time. Moreover, we proved that the linear system of
34
polynomial preconditioners has a smaller spectrum set than that of the diagonal preconditioner. This means a fast convergence rate of the CG algorithm can be achieved by using polynomial preconditioners. Furthermore, we extended 485
the linear system to encompass all situations of the graph, and all the proposed techniques can be applied to the new linear system. The experimental results show that the polynomial preconditioners are effective to speed up the computation of SimRank.
Acknowledgements 490
The research of Siu-Long Lei is funded by Research Committee of University of Macau (MYRG2018-00025-FST). The research of Zhiguo Gong is funded by The Science and Technology Development Fund, Macau SAR (FDCT/116/2013/A3, FDCT/007/2016/AFJ), Research Committee of University of Macau (MYRG201500070-FST, MYRG2017-00212-FST).
495
References References [1] I. Antonellis, H. G. Molina, C. C. Chang, Simrank++: query rewriting through link analysis of the click graph, Proceedings of the VLDB Endowment 1(1), 408-421, 2008.
500
[2] A. A. Bencz´ ur, K. Csalog´any, T. Sarl´os, Link-based similarity search to fight web spam, in AIRWEB, Citeseer, 2006. [3] D. G. Feingold, R. S. Varga, Block diagonally dominant matrices and generalizations of the Gerschgorin circle theorem, Pacific Journal of Mathematics 12(4), 1241-1250, 1962.
505
[4] D. Fogaras, B. R´ acz, Scaling link-based similarity search, Proceedings of the 14th international conference on World Wide Web, 641-650, 2005. doi:10.1145/1060745.1060839. 35
[5] K. J¨ arvelin, J. Kek¨ al¨ ainen, IR evaluation methods for retrieving highly relevant documents, In Proceedings of the 23rd annual international ACM 510
SIGIR conference on Research and development in information retrieval, pages 41–48. ACM, 2000. [6] K. J¨ arvelin, J. Kek¨ al¨ ainen, Cumulated gain-based evaluation of IR techniques, ACM Transactions on Information Systems (TOIS), 20(4):422–446, 2002.
515
[7] G. Jeh, J. Widom, SimRank: A Measure of Structural-Context Similarity, Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, 538-543, 2002. doi: 10.1145/775047.775126. [8] M. Jiang, A. W. C. Fu, R. C. W. Wong, READS: a random walk approach
520
for efficient and accurate dynamic SimRank, Proceedings of the VLDB Endowment 10(9), 937-948, 2017. [9] C. Li, J. Han, G. He, X. Jin, Y. Sun, Y. Yu, T. Wu, Fast computation of SimRank for static and dynamic information networks, Proceedings of the 13th International Conference on Extending Database Technology, 465-476,
525
2010. doi: 10.1145/1739041.1739098. [10] Z. Li, Y. Fang, Q. Liu, J. Cheng, R. Cheng, J. Lui, Walking in the cloud: Parallel SimRank at scale, Proceedings of the VLDB Endowment 9(1), 24-135, 2015. doi: 10.14778/2850469.2850472. [11] D. Lizorkin, P. Velikhov, M. Grinev, D. Turdakov, Accuracy Estimate and
530
Optimization Techniques for SimRank Computation, Proceedings of the VLDB Endowment 1(1), 422-433, 2008. doi: 10.14778/1453856.1453904. [12] D. Lizorkin, P. Velikhov, M. Grinev, D. Turdakov, Accuracy estimate and optimization techniques for SimRank computation, The VLDB Journal — The International Journal on Very Large Data Bases 19(1), 45-66, 2010.
535
doi: 10.1007/s00778-009-0168-8. 36
[13] J. Lu, Z. Gong, X. Lin, A novel and fast SimRank algorithm, IEEE transactions on knowledge and data engineering 29(3), 572-585, 2017. doi: 10.1109/TKDE.2016.2626282. [14] S. Qiao, T. Li, H. Li, Y. Zhu, J. Peng, J. Qiu, SimRank: A Page Rank 540
approach based on similarity measure, 2010 IEEE International Conference on Intelligent Systems and Knowledge Engineering (ISKE), 390-395, 2010. [15] Y. Saad, Iterative methods for sparse linear systems, SIAM, Vol. 82, 2003. [16] C. Scheible, F. Laws, L. Michelbacher, H. Sch¨ utze, Sentiment translation through multi-edge graphs, Proceedings of the 23rd International Confer-
545
ence on Computational Linguistics: Posters, 1104-1112, 2010. [17] Y. Shao, B. Cui, L. Chen, M. Liu, X. Xie, An efficient similarity search framework for SimRank over large dynamic graphs, Proceedings of the VLDB Endowment 8(8), 838-849, 2015. [18] J. Song, X. Luo, J. Gao, C. Zhou, H. Wei, J.X. Yu, UniWalk: unidirectional
550
random walk based scalable SimRank computation over large graph, IEEE Transactions on Knowledge and Data Engineering 30(5), 992-1006, 2018. [19] B. Tian, X. Xiao, Sling: A near-optimal index structure for SimRank, Proceedings of the 2016 International Conference on Management of Data, 1859-1874, 2016.
555
[20] Y. Wang, L. Chen, Y. Che, Q. Luo, Accelerating pairwise SimRank estimation over static and dynamic graphs, The VLDB Journal—The International Journal on Very Large Data Bases 28(1), 99-122, 2019. [21] W. Xi, E. A. Fox, W. Fan, B. Zhang, Z. Chen, J. Yan, D. Zhuang, SimFusion: Measuring similarity using unified relationship matrix, Pro-
560
ceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, 130-137, 2005. doi: 10.1145/1076034.1076059.
37
[22] W. Yu, W. Zhang, X. Lin, Q. Zhang, J. Le, A space and time efficient algorithm for SimRank computation, World Wide Web 15(3), 327-353, 2012. 565
doi: 10.1007/s11280-010-0100-6. [23] W. Yu, X. Lin, W. Zhang, Towards efficient SimRank computation on large networks, 2013 IEEE 29th International Conference on Data Engineering (ICDE), 601-612, 2013. doi: 10.1109/ICDE.2013.6544859. [24] W. Yu, X. Lin, W. Zhang, J. A. McCann, Fast all-pairs SimRank
570
assessment on large graphs and bipartite domains, IEEE Transactions on Knowledge and Data Engineering 27(7), 1810-1823, 2015. doi: 10.1109/TKDE.2014.2339828. [25] W. Yu, J. A. McCann, Efficient partial-pairs simrank search on large networks, Proceedings of the VLDB Endowment 8(5), 569-580, 2015.
575
[26] Y. Zhou, H. Cheng, J. X. Yu, Graph clustering based on structural/attribute similarities, Proceedings of the VLDB Endowment 2(1), 718-729, 2009.
38