Accepted Manuscript Title: Influence efficiency maximization: How can we spread information efficiently? Author: Xiang Zhu Zhefeng Wang Yu Yang Bin Zhou Yan Jia PII: DOI: Reference:
S1877-7503(17)30528-8 https://doi.org/doi:10.1016/j.jocs.2017.11.001 JOCS 790
To appear in: Received date: Revised date: Accepted date:
12-5-2017 17-10-2017 4-11-2017
Please cite this article as: Xiang Zhu, Zhefeng Wang, Yu Yang, Bin Zhou, Yan Jia, Influence efficiency maximization: How can we spread information efficiently?, (2017), https://doi.org/10.1016/j.jocs.2017.11.001 This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Influence Efficiency Maximization: How Can We Spread Information Efficiently? ∗
Zhefeng Wang
College of Computer National University of Defense Technology
[email protected] Yu Yang
cr
[email protected] Bin Zhou Yan Jia
College of Computer National University of Defense Technology
College of Computer National University of Defense Technology
[email protected]
[email protected]
us
School of Computing Science Simon Fraser University
[email protected]
School of Computer Science and Technology University of Science and Technology of China
ip t
Xiang Zhu
ABSTRACT
an
Influence maximization plays an important role in many social network applications, such as viral marketing, business campaign and so on. Intuitively, by targeting a small number of nodes called seed nodes, it is possible to trigger a large cascade of information propagation in the network. Technically, influence maximization problem is to identify a set of nodes maximizing the influence in the network. There are a lot of previous studies about influence maximization, including various information diffusion models and computational methods for influence maximization. Researchers have proposed several models to describe the diffusion of information in the network. Among them, the Independent Cascade (IC) model [8] and Linear Threshold (LT) model [10] are popular models used in information diffusion. In these models, each node in the network has two possible states, active or inactive. Generally, an active node can be viewed as adopting the information or product which is propagated in the network. Given a diffusion model and a network, most existing work on influence maximization is to target a set of seed nodes, leading to the maximum expected number of activated nodes when the propagation terminates. This selection problem is formulated as a discrete optimization problem by Kempe et al. [11]. Due to its important application in viral marketing, influence maximization problem has been extensively explored. However, many demands in practice are still not satisfied. During the process of information diffusion, an active node will try to influence its neighbors in the next iteration. Thus, it will cost several iterations before a node is activated except seed nodes. We call it propagation time delay. If a node is inactive at the end of information diffusion, the propagation time delay is infinite. Let the influence efficiency denote the inverse of the propagation time delay. Given a set of seed nodes, if the total influence efficiency of the network is large, it means the nodes will be influenced efficiently during information diffusion in the network. Imagine that there are two solution for influence maximization problem in the same network and they have the same number of activated nodes. It means the solutions may influence the same
Keywords
Ac ce p
te
d
M
Influence maximization problem, due to its popularity, has been studied extensively these years. It aims at targeting a set of seed nodes for maximizing the expected number of activated nodes at the end of the information diffusion process. During the process of information diffusion, an active node will try to influence its neighbors in the next iteration. Thus, it will cost several iterations before a node is activated except seed nodes, which is called propagation time delay. However, it is not discussed in influence maximization problem. Thus, there is a need to understand the influence efficiency in the network. Motivated by this demand, we propose a novel problem called Influence Efficiency Maximization problem, which takes the propagation time delay into consideration. We prove that the proposed problem is NP-hard under independent cascade model and the influence efficiency function is submodular. Furthermore, we also prove the computation of influence efficiency is #P-hard under independent cascade model. After that, several algorithms are proposed to solve the influence efficiency maximization problem. Finally, we conduct a series of experiment with real-world data sets to verify the proposed algorithms. The experimental results demonstrate the performance of the proposed algorithms.
Influence Maximization; Information Diffusion; Influence Efficiency; Social Network
1.
INTRODUCTION
∗corresponding author Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
[email protected]. c 2017 ACM. ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
Page 1 of 17
ip t
cr
M
• We prove that the influence efficiency maximization problem is NP-hard and the efficiency function is submodular under independent cascade model. Furthermore, the computation of influence efficiency under independent cascade model is #P-hard.
us
• We take the propagation time delay into consideration, and explore the value of influence efficiency. Then we propose a novel problem of maximizing the expected influence efficiency in the network.
influence maximization as a optimization problem. Furthermore, Kempe et al. show that the influence maximization problem is NP-hard under independent cascade model and linear threshold model. Based on the submodularity, the approximation guarantee are shown for the greedy algorithm framework to solve influence maximization problem. An improved algorithm called CELF [13] is proposed to reduce the unnecessary simulations. It utilizes the lazy evaluation technique based on submodular function, implemented by a prior queue. Goyal et al. [9] propose an advanced algorithm called CELF++, which enhances the performance of the former algorithm. Cheng et al. [6] focus on solving the scalability-accuracy dilemma, and propose a static greedy algorithm called StaticGreedy to guarantee the submodularity of influence function during the information diffusion process. It makes the computational expense dramatically reduced without lose of accuracy. Borgs et al. [1] propose a novel algorithm to solve influence maximization problem, using reverse influence sampling method. It is an algorithm not under the framework of greedy algorithm and it boosts the efficiency to solve the problem. Based on the influence maximization problem, there are many extended problems in previous work. In the traditional influence maximization problem, the size of initial set is subject to a number k > 0. Leskovec et al. [13] suppose each selection of a node will have a cost and there is a budget that we could spend for selecting a set of nodes. The purpose of selecting is to influence most nodes subject to the budget. Several work [21, 5] study the information maximization problem in dynamic networks. Yang et al. [20] formulate the general continuous influence maximization problem and develop a general coordinate descent algorithm to solve the problem. Wang et al. [19] propose a new problem called Information Coverage Maximization, taking activated nodes and informed nodes (still inactivated) into consideration. Liu et al. [15] take propagation time into account and propose the time constrained influence maximization problem. Although the influence maximization problem has been extensively explored, this practical problem still cannot be answered comprehensively. In the traditional influence maximization problem, only influence extent is taken into consideration, while the propagation time delay is also an important factor in information diffusion. As a result, we formulate it as the Influence Efficiency Maximization problem. In that problem, we try to find a set of nodes maximizing the influence efficiency which is related to propagation time delay in the network.
an
number of nodes. While the solutions may have different influence efficiency, it is not discussed in traditional influence maximization problem. Motivated by the practical demands, in this paper, we formulate a new problem called Influence Efficiency Maximization problem to address this issue. The objective of this problem to maximize the expected influence efficiency (reciprocal sum of propagation time delay) in the network. We prove that influence efficiency maximization problem is NP-hard and the efficiency function is submodular under the independent cascade model. Furthermore, we show the computation of influence efficiency under independent cascade model is #P-hard. After that, we design three algorithms to solve the proposed problem. Finally, we evaluate the proposed algorithms with four real-world data sets. The experimental results demonstrates the performance of the proposed algorithms. Our contributions can be summarized as follows,
d
• We design three algorithms to solve the proposed problem and verify those algorithms with real-world data sets. Experimental results show the performance of the proposed algorithms.
2.
Ac ce p
te
The rest of the paper is organized as follows. In the Section. 2, we review some representative milestone work. Section. 3 gives the preliminary and problem statement. After that, we demonstrate the approximation guarantees and our proposed methods in Section. 4. Then, we show our experimental results and make the analysis in Section. 5. Finally, we make the conclusion in Section. 6.
RELATED WORK
The influence maximization problem is a hot topic in information diffusion field. Chen et al. [2] give a comprehensive review of information and influence propagation in the social network, including lots of work about influence maximization. One strong motivation for studying influence maximization is viral marketing. That is, a company want to promote a new product through word-of-mouth effects in social networks. A cost-effective way to achieve it could be to find influential individuals in the network and make them adopt the product by offering them products sample for free or at a discounted price. Then the influential individuals may drive other people in the network to adopt the product. That will generate a potentially large cascade in the network. To achieve the goal of viral marketing campaign in the network, there is a need to find an efficient way to identify the influential individuals in the network under a certain diffusion model. The influence maximization problem was first proposed by Domingos and Richardson [7] as a technique for viral marketing. Then, Kempe et al. [11] was the first to formulate
3.
PROBLEM STATEMENT
We first review the independent cascade model and the influence maximization problem under that model in Section 3.1. Next, we introduce several methods to solve the influence maximization problem and make an analysis of the those algorithms, including the monotone and submodularity of the expected influence function. After that, we uncover a novel problem called Influence Efficiency Maximization problem in Section 3.2. Then we formulate the influence efficiency maximization problem and illustrate the differences between the influence maximization problem and the influence efficiency maximization problem. For ease of reference, Table 1 lists the frequently used notations.
Page 2 of 17
3.1
Diffusion Model and Influence Maximization Problem
S ∗ = arg max EG [I (S)] s.t. S ⊆ V, |S| = k
(1)
us
cr
ip t
Given the input k = 1 and the tiny social network G in Fig. 1, consider the influence maximization problem. We can compute the expected influence spread of each node in G. For node v1 , the expected influence spread EG [I (v1 )] = 1 + 1 + 4 × 0.8 = 5.2. And for node v2 , EG [I (v2 )] = 1 + 4 × 0.8 = 4.2. For other nodes, EG [I (v3 )] = EG [I (v4 )] = EG [I (v5 )] = EG [I (v6 )] = 1. Thus, the optimal solution to influence maximization problem in G is S ∗ = {v1 } when k = 1. It is shown by Kempe et al. in [11] that influence maximization problem under independent cascade model is NPhard. So it is difficult to compute the influential nodes straightforward. Fortunately, the expected influence function EG [I (S)] under independent cascade model is monotone and submodular [11], those two properties allow for approximate algorithm to discover the influential nodes. Formally, a monotone function satisfies f (S ∪ {u}) ≥ f (S) for all the elements u and all the sets S and a submodular function satisfies f (S ∪ {u}) − f (S) ≥ f (W ∪ {u}) − f (W ) for all the elements u and all pairs of sets S ⊆ W . A simple greedy algorithm is proposed by Nemhauser et al. [17] to solve the influence maximization problem. The algorithm starts from an empty seed set S = ∅, then repeatedly chooses the node u with the most marginal gain of EG [I (S)] and adds it to the current seed set until the size of the seed set is k.
Ac ce p
te
d
M
In this work, we study the social influence under a widely adopted model, independent cascade model. Under that model, a social network is modeled as a directed network G = (V, E), where V corresponds to the individuals while E represents the social links between individuals. Moreover, each edge (u, v) ∈ E is associated with a propagation probability pGu,v indicating the strength of influence from individual u to v. When G is clear from the context, we simply use pu,v to keep the notations. The independent cascade model describes an intuitive diffusion process. In that model, an individual in a social network is influenced by the neighbors independently. Given a seed set S ⊆ V, the independent cascade model works as follows. Let St be the set of nodes activated at step t ≥ 0, with S0 = S. At step t + 1, every node u ∈ St may activate its out-neighbors v ∈ V \ ∪0≤i≤t Si with an independent probability pu,v . Once u has made all these attempts to activate its out-neighbors, it will not make further activation attempts at later times. The whole process ends up at a step t when St = ∅.
aims at targeting a seed set S of size at most k to maximize the expected influence function EG [I (S)]. Given an input k, the influence maximization problem can be formulated as follows,
an
Table 1: Notations used frequently Notation Description G = (V, E) a social network G, and V is the nodes set, E is the edges set H = (V, Z) a hypergraph H generated based on G (see Algorithm 3), and V is the nodes set, Z is the hyperedges set n the number of nodes in G and H m the number of edges in G k the size of the seed set pGu,v the probability that a node u may activate a node v I (S) the influence spread of the seed set S RR (v) the reverse reachable set of the node v (see Definition 3.1) eu,v the influence efficiency from the node u to the node v (see Equation 4) T (S) the influence efficiency function in graph G (see Equation 5) T 0 (S) the influence efficiency function in graph H (see Algorithm 3)
1
v1
v2
0.8
v3
0.8
v4
0.8 0.8
v5 v6
Figure 1: The tiny social network G Given a seed set S, let EG [I (S)] denote the expected influence spread of S in G, it equals the expected number of activated nodes when the diffusion process ends. The influence maximization problem under the independent model
u = arg max (EG [I (S ∪ w)] − EG [I (S)]) w∈V \S
(2)
It is proved that the greedy algorithm approximates the optimal solution with a factor of 1 − 1/e, that means any solution S discovered by the greedy algorithm satisfies I (S) ≥ I (S ∗ ). The conception of the greedy algorithm is simple, but it is non-trival to implement since the computation of EG [I (S)] is #P-hard [3]. To address that problem, Kempe et al. propose to estimate EG [I (S)] to a reasonable accuracy by a Monte Carlo method [11]. The process of the Monte Carlo method is as follows. Suppose that we flip a coin for each edge e ∈ E in G and remove the edge with a probability of 1 − p (e). Let g be the resulting graph and Rg (S) be the set of nodes in g that are reachable from the seed set S. For the node v ∈ g, we say that the node v is reachable from S if there is a path from S to v in g. Kempe et al. prove that the expected size of Rg (S) equals EG [I (S)] [11], that is to say, EG [I (S)] = Eg∼G [Ig (S)]
(3)
where Ig (S) = |Rg (S)|. Thus, we can estimate EG [I (S)] by estimating the expected size of Rg (S). To estimate EG [I (S)], we can first generate multiple instances g ∼ G, then measure Ig (S) on each instance, and finally compute the average measurement as the estimation of EG [I (S)]. Assume that we take a large number r of measurements of g in the estimation of EG [I (S)], the greedy algorithm will yield a (1 − 1/e − ε)-approximate solution under independent cascade model [11], where ε is a constant that are related to
Page 3 of 17
RR sets should be set to Θ k (m + n) log n/ε3 [1].
3.2
Influence Efficiency Maximization Problem
ip t
The traditional influence maximization problem aims at finding influential nodes to get the most influence spread. It does not care about the propagation time delay. However, in real world social networks, people cares about not only the influence spread but also the influence efficiency. For example, it is important to make potential customers to get informed of the new products as soon as possible in viral marketing. As a result, there is a new problem that how can we target a seed set to get the maximum influence efficiency.
an
us
cr
Definition 3.2 (Influence Efficiency). If there is a path from node u to node v that every node in the path is activated when the diffusion process ends up, we call the path is a live path. There could be more than one live path from node u to node v. Given a seed node u, when the diffusion process ends up, for each node v ∈ V, if there is no live path from u ∈ V to v, the influence efficiency from u to v is 0, otherwise the influence efficiency is formulated as follows, eu,v =
1 du,v + 1
(4)
where du,v is the length of the shortest live path (or propagation time delay) from node u to node v. And for each node u, du,u = 0. Then the influence efficiency function T (S) is formulated as follows, X 1 T (S) = (5) d S,v + 1 v∈V
M
both G and r [1][12]. In general, Kempe et al. suggest setting r = 10, 000 and most work following influence maximization problem adopts a similar setting of r. Although the greedy algorithm is effective, it is extremely inefficient for large networks. It suffers from a time complexity of O (knmr). Specifically, it runs k iterations to select a seed set of k and for each iteration it requires estimating the expected influence spread of O (n) nodes. In addition, each estimation of the expected influence spread takes measurements on r graphs and each measurement costs O (m) time. Thus, the whole process takes an O (knmr) total running time. For the submodular function, there is a well-known optimization technique called lazy evaluations that can be applied to significantly reduce the number of measurements without changing the output of the greedy algorithm. Even though lazy evaluations can help, the performance of greedy algorithms are still inefficient. The main reason for the inefficiency of the greedy algorithms is that it requires estimating the expected influence spread of O (kn) nodes. However, most of those O (kn) estimations are wasted and those wastes of computation are difficult to avoid under the framework of greedy algorithm. Borgs et al. [1] propose a novel method to avoid the limitation of greedy algorithms. Youze Tang et al. refer to that method as Reverse Influence Sampling (RIS ) and make a summary of how it works. To explain how RIS algorithm works, there is a concept to be introduced first.
te
d
Definition 3.1 (Reverse Reachable Set). For the graph G = (V, E), we flip a coin for each edge e ∈ E and remove the edge with probability of 1 − p (e). Let g be the resulting graph and a node v ∈ V, the reverse reachable (RR) set for v in g is the set of nodes that can reach to v in g. That is to say, there is one path at least from the node u in the RR set to the node v.
Ac ce p
By the definition, if a node u is in a RR (v), u could reach v through a certain path in G. That means u have a probability to activate v if we use S = u as the seed set and run a diffusion process on G. Based on that result, the RIS algorithm works in two steps. First, select a node v ∈ V uniformly and generate the reverse reachable set RR (v). Then repeat that process to generating multiple instances of RR set. In the second step, select k nodes to cover maximum number of RR sets. Iff a node u ∈ RR (v) we say u covers RR (v). The simple greedy algorithm is utilized to derive a 1 − 1/e-approximate solution S and return S as the final result. The main idea of RIS algorithm is as follows. If a node u appears in a large number of RR (v) sets, it would have a higher probability to activated more nodes. Moreover, the node v is selected uniformly from V, thus u would activate more nodes in G, that is to say I (u) should be large. So if the seed set S covers the most RR (v) sets, S would have a maximum expected influence spread in G. Compared with greedy algorithms, RIS is more efficient as it avoids the waste estimations of expected influence spread of O (kn) nodes. The key point of RIS is using the RR sets generated instead of estimations in diffusion process. To balance the effectiveness and the efficiency of RIS algorithm, we need to control the number of RR sets generated. Borgs et al. prove that in order to get a (1 − 1/e − ε)-approximate solution to the IM problem under IC model, the number of
where dS,v is the length of the shortest live path from the seed set S to v. Let EG [T (S)] denote expected influence efficiency function under IC model. The influence efficiency maximization problem under the independent cascade model aims at targeting a seed set S of size at most k to maximize the expected influence efficiency function EG [T (S)]. Given an input k, the influence efficiency maximization problem can be formulated as follows, S ∗ = arg max EG [T (S)] s.t. S ⊆ V, |S| = k
(6)
For example, given the input k = 1 and the social network G in Fig. 1, consider the influence efficiency maximization problem. We can compute the expected influence efficiency of each node in G. For v1 , EG [T (v1 )] = 1 × 1 + 1 × 21 + 4 × 0.8 × 13 ≈ 2.57. And for node v2 , EG [T (v2 )] = 1 × 1 + 4 × 0.8 × 12 = 2.6. For other nodes, EG [T (v3 )] = EG [T (v4 )] = EG [T (v5 )] = EG [T (v6 )] = 1. Thus, the optimal solution to IEM problem in G when k = 1 is S ∗ = {v2 }. Compared with optimal solution to IM problem, we can see that the node set with the maximum expected influence spread is not necessarily the node set with the maximum expected influence efficiency. In Fig. 1 when k = 1, {v1 } offers the maximum expected influence spread while {v2 } offers the maximum expected influence efficiency. To address the influence efficiency maximization problem, we can also use a Monte Carlo method to estimate EG [T (S)]. Let g be a graph that remove each edge e ∈ E
Page 4 of 17
4.
APPROXIMATION GUARANTEES AND PROPOSED METHOD
M
In this section, we analyze the hardness of the influence efficiency maximization problem at first. After that, we describe our strategy for proving approximation guarantees. Then we propose the Reverse Efficiency Sampling (RES ) algorithm to solve the influence efficiency maximization problem. As we know, the influence maximization problem is NPhard under the independent cascade model. Now, we show the hardness of the influence efficiency maximization problem.
ip t
where Tg (S) is the influence efficiency of the seed set S on g. In section 3, we review the the definition of the influence maximization problem and some properties of the expected influence spread function. The monotone and submodularity allow for approximate algorithm to solve the influence maximization problem. Based on the influence maximization problem, we take the propagation time delay into account, then we make a definition of the influence efficiency maximization problem.
cr
(7)
us
EG [T (S)] = Eg∼G [Tg (S)]
Proof. We prove the theorem by a reduction from the counting problem of s-t connectedness in a directed graph. Given a directed graph G = (V, E) and two vertices s and t, the s-t connectedness problem is to count the number of subgraphs of G in which s is connected to t. Let pGs,t denote the probability that s is connected to t in G. It is intuitive to see that the s-t connectedness problem is equivalent to calculating pGs,t when each edge in G has an independent probability of 1/2 to be connected, and other 1/2 to be disconnected. Then we reduce that problem to the influence efficiency computation problem as follows. Let T (S) denote the influence efficiency in G given a seed set S. First, let S = {s} and pGu,v = 1/2 for each edge (u, v) ∈ E. The influence efficiency given a seed set S in graph G is denoted as T0 (S). Next, we add a new node t1 and a directed edge from t to t1 to G. Then we get a new graph G1 and let pGt,t11 = 1. The influence efficiency given a seed set S in graph as P G1G is denoted 1 p · T1 (S) and T1 (S) = T0 (S) + pGt,t11 · n i=1 s,t,d=i d+1 = P G G 1 p · , where p T0 (S) + n s,t,d=i represents the i=1 s,t,d=i d+1 probability that s is connected to t at a distance of d and n = |V|. Next, we add another new node t2 and a directed edge from t1 to t2 to G1 . Then we get a new graph G2 and let pGt12,t2 = 1. The influence efficiency given a seed set S in graph G2 is denoted as T2 (S) and T2 (S) = T1 (S) + P Pn G G 1 1 pGt11,t2 ·pGt,t11 · n i=1 ps,t,d=i · d+2 = T1 (S)+ i=1 ps,t,d=i · d+2 . Next, we do that procedure iteratively to get n influence efficiency from T1 (S) to Tn (S) and Tj (S) = Tj−1 (S) + P n G 1 i=1 ps,t,d=i · d+j . Those influence efficiency equations can be represented as follows,
an
with a probability 1 − p (e) and we can formulate the expected influence efficiency under independent cascade model as follows,
Theorem 4.1. The influence efficiency maximization is NP-hard for the Independent Cascade model.
Ac ce p
te
d
Proof. Consider an instance of the Set Cover problem which is NP-complete. The problem is defined as follows. Given a collection of subsets S1 , S2 , · · · , Sm of a ground set U = {u1 , u2 , · · · , un }, we wish to know whether there exist k of the subsets whose union is equal to U . We show that the Set Cover problem can be viewed as a special case of influence efficiency maximization problem. Given an arbitrary instance of the Set Cover problem, we define a corresponding directed bipartite graph with n + m nodes. There is a node i corresponding to each subset Si and a node j corresponding to each element uj ∈ U . For each uj ∈ Si , there is a directed edge (i, j) with a propagation probability pi,j = 1 from i to j. The Set Cover problem is equivalent to decide if there is a set A of k nodes in that bipartite graph with EG [T (A)] ≥ k + n2 . The diffusion process is a deterministic process, as the propagation probability is 0 or 1 for each edge. Initially activating k nodes is equivalent to selecting k subsets in Set Cover problem and activating all n nodes is equivalent to covering the ground set u. Thus, if any set A of k satisfies EG [T (A)] ≥ k + n2 , the Set Cover problem must be solvable. One important issue is that computing the influence efficiency of a seed set is time consuming, it takes O (mr) time to evaluate the expected influence efficiency of the seed set. There is no efficient way to calculate the expected influence efficiency EG [T (S)] given a seed set S. We prove that the computation is #P-hard, by showing a reduction from the counting problem of s-t connectedness in a directed graph. Theorem 4.2. Computing the influence efficiency T (S) given a seed set S is #P-hard under the independent cascade model.
1 1+1
. . . 1 i+1 .. .
1 n+1
··· ··· ··· .. . ···
pGs,t,d=1 T1 (S) − T0 (S) . .. . .. .. . 1 G (8) T (S) − T (S) p = i i−1 s,t,d=i i+n .. .. .. . . . 1 G T (S) − T (S) n n−1 p s,t,d=n n+n 1 1+n
For convenience sake, let A denote the matrix in Equation T 8, x = pGs,t,d=1 , · · · , pGs,t,d=n and the right part b = (T1 (S) − T0 (S) , · · · , Tn (S) − Tn−1 (S))T . Therefore, the Equation 8 can be denoted as Ax = b. The matrix A is nonsingular which is proved in Lemma 4.1, so the linear equations can be solved in polynomial time by Gauss Seidel method. Furthermore, that linear equations are constructed in polynomial time and it is straightforward to see P G that pGs,t = n i=1 ps,t,d=i , that is to say the probability s is connected to t can be computed in polynomial time. So we solve the s-t connectedness counting problem. It is shown that s-t connectedness counting problem is #P-complete, and thus the influence efficiency spread computation problem is #P-hard.
Next, we show that the matrix A is nonsingular. The lemma and the provement are as follows. Lemma 4.1. The matrix A ∈ Rn×n for aij = singular.
1 i+j
is non-
Proof. Let ∆n denote the determinant of matrix A as
Page 5 of 17
follows,
.. . 1 n+2
··· ··· .. . ···
k=1
(9)
k=1
Q
k≤j≤n
1 2≤k+1≤i≤n Q = 2n k+j
n+n
1≤k≤j≤n
k−j
2≤k+1≤j≤n
Q
k+i
2≤k+1≤i≤n
(12) It is obvious to see that ∆n 6= 0, so the matrix A is nonsingular. Nemhauser et al. [17] prove that for a non-negative, monotone and submodular function f (·), greedy algorithm yields a (1 − 1/e) approximation. So an approximation guarantee for influence efficiency maximization problem under the independent cascade model will be a consequence of the following, Theorem 4.3. For an arbitrary instance of the independent cascade model in a graph G, the expected influence efficiency function EG [T (·)] is submodular. It is easy to see that EG [T (·)] is monotone, as adding any node u to an arbitrary set S would not let the influence efficiency decrease. That is to say EG [T (S ∪ {u})] ≥ EG [T (S)] for any node u and an arbitrary set S. In order to establish that result in Theorem 4.3, we need to compute the expression EG [T (S ∪ u)] − EG [T (S)], for arbitrary sets S and elements u. That is to say, we want to figure out what marginal gain do we get in expected influence efficiency when we add u to the set S. That marginal gain is very difficult to analyze directly, because the diffusion process is underspecified and the activation order is uncertain. To tackle those difficulties, we formulate an equivalent view of the diffusion process [11], which is not related to activation order. It provides an alternate way to prove the submodularity of EG [T (·)]. Consider a moment in the diffusion process under the independent cascade model when the node u has just been activated and it attempts to activate its out-neighbor node v with a probability of pu,v . We can simulate the random event by flipping a coin of bias pu,v , it intuitively does not matter whether the coin is flipped at the moment u is activated or at very begining of the whole diffusion process and revealed at the moment u is activated. Moreover, we can flip a coin of bias pu,v for each edge e = (u, v) in G at the very begining of the whole diffusion process and check it at the moment when u is activated and v is inactive. With all the coins flipped in advance, the diffusion process can be viewed as follows. The coin flip for each edge e = (u, v) indicates an activation. If the activation is successful, we declare e is a live edge. Furthermore, we call a path from u to v a live edge path if and only if every edge in that path is a live edge. So it is clearly to compute the influence efficiency of a seed set S under IC model if we fix the outcomes of the coin flips. The node v ends up with active if and only if there is at least a live edge path from S to v. Thus, we could compute the influence efficiency eS,v according to the length of the shortest live edge path from S to v. Consider the probability space in which each sample point stands for one possible outcome of all the coin flips on each edge. Let X denote a sample point in that space and define TX (S) as the influence efficiency of the seed set
M
Next, let each row i for 2 ≤ i ≤ n minus row 1 in matrix A, 1 1 1−i we can get aij = i+j − 1+j = (i+j)(1+j) . Then ∆n can be denoted as follows. 1 1 1 ··· 1+1 1+2 1+n . . . .. .. . ··· . 1−i 1−i 1−i · · · ∆n = (i+1)(1+1) (i+2)(1+2) (i+n)(1+n) .. .. .. .. . . . . 1−n 1−n 1−n · · · (n+1)(1+1) (n+2)(1+2) (n+n)(1+n) 1 1 ··· 1 1 1 1 2+1 2+2 · · · 2+n Q (10) .. .. 1 − i .. . . ··· . 2≤i≤n = Q 1 1 1 1 + j i+1 i+2 · · · i+n . 1≤j≤n .. .. . .. . . . . 1 1 1 · · · n+1 n+2 n+n Q 1−i 2≤i≤n = Q |B| 1+j
k+1≤i≤n
Q
k−i
ip t
1 i+2
···
cr
.. .
us
n+1
1 1+2
iteratively and ∆n is computed as follows. Q Q k−i k−j n−1 n−1 Y Y k+1≤i≤n k+1≤j≤n Q Q ∆n = λk ∆1 = ∆1 k+j k+i
1 1+n .. . 1 i+n .. . 1
an
1 1+1 . . . 1 ∆n = |A| = i+1 .. . 1
1≤j≤n
Ac ce p
te
d
Next, let each column j for 2 ≤ j ≤ n minus column 1 in 1−j 1 1 matrix B, we can get bij = i+j − i+1 = (i+j)(i+1) . Then ∆n can be denoted as follows. 1 0 ··· 0 1 1−2 1−n 2+1 (2+2)(2+1) · · · (2+n)(2+1) Q .. .. 1 − i .. . . ··· . 2≤i≤n ∆n = Q 1 1−2 1−n ··· 1 + j i+1 (i+2)(i+1) (i+n)(i+1) 1≤j≤n .. .. .. .. . . . . 1 1−2 1−n · · · (n+n)(n+1) n+1 (n+2)(n+1) 1 0 ··· 0 1 1 1 · · · 2+n 2+1 2+2 Q Q .. .. 1−i 1 − j .. . ··· . 2≤i≤n 2≤j≤n . Q = Q 1 1 1 1+j 1 + i i+1 i+2 · · · i+n . 1≤j≤n 2≤i≤n .. .. . .. . . . . 1 1 1 · · · n+1 n+2 n+n 1 1 1 · · · 2+n 2+2 2+3 . Q Q .. .. 1−i 1 − j .. . ··· . 2≤i≤n 2≤j≤n 1 1 1 Q = Q i+2 i+3 · · · i+n 1+j 1+i . .. .. . .. 1≤j≤n 2≤i≤n . . . . 1 1 1 · · · n+n n+2 n+3 = λn ∆n−1 (11) where λn is nonzero. So we can repeat those procedure
Page 6 of 17
S with a fixed X, TX (S) is computed as Equation 5. Then we make the proof of Theorem 4.3 as follows.
Algorithm 2 LazyGreedy(k, T ) 1: initialize S = ∅, priority queue Q = ∅, iteration i = 1 2: for j = 1 to n do 3: u.mg = T (u|∅), u.i = 1 4: put u into Q with u.mg as the key 5: end for 6: while i ≤ k do 7: pop the top element u of Q 8: if u.i = i then 9: S = S ∪ {u}, i = i + 1 10: else 11: u.mg = T (u|S), u.i = i 12: end if 13: end while 14: return S
Ac ce p
te
d
M
Proof. At first, we claim that for any fixed sample point X in the probability space, TX (·) is submodular. To prove that, let S and W be two sets of nodes that satisfy S ⊆ W ⊆ V and let u ∈ V be a node in G. Let δ (u|S) denote the marginal gain of influence efficiency when adding a node u to the set S, δ (u|S) = TX (S ∪ {u}) − TX (S). Then compare the value of δ (u|S) with the value of δ (u|W ). To figure out that, let RX (u) denote the set of nodes that are reachable from u under a fixed outcome X. Intuitively, we can see that RX (S) ⊆ RX (W ) since S ⊆ W . There could be three kinds of situations for the node v which has potential to update the influence efficiency eS,v or eW,v when adding the node u to S or W as Fig. 2. The situation 1 is as the slash area that v ∈ RX (u) \ RX (W ). In that situation, there is no live edge path from the seed set S or W to v before adding the node u. And the node u do offer the shortest live edge path to v. For the node v in situation 1, eS,v P = eW,v = 0 according to Definition 3.2. As δ (u|S) = v eu,v − eS,v P and δ (u|W ) = v eu,v − eW,v , we can see that δ (u|S) = δ (u|W ) in situation 1. The situation 2 is as the backslash area that v ∈ (RX (u) ∩ RX (W ))\RX (S). In that situation, there is at least one live edge path from W to v while there is no live edge path from S to v, that is to say eW,v > 0 and eS,v = 0. If u offers the shortest live edge path to v for the set W ∪ {u}, we can see that δ (u|S) > δ (u|W ) due to eW,v > 0 and eS,v = 0. If W offers the shortest live edge path to v for the set W ∪ {u}, that is to say the marginal gain for v does not increase when adding u to W . In that case, δ (u|W ) = 0 while δ (u|S) > 0. Thus, δ (u|S) > δ (u|W ) in situation 2. The situation 3 is as the cross area that v ∈ RX (u)∩RX (S). In that situation, there is at least one live edge path from both S and W to v. As S ⊆ W , eW,v ≥ eS,v > 0. If u offers the shortest live edge path to v for both the set S ∪ {u} and W ∪ {u}, we can get δ (u|S) ≥ δ (u|W ). If u only offers the shortest live edge path to v for the set S ∪ {u}. In that case, δ (u|W ) = 0 while δ (u|S) ≥ 0. If u does not bring marginal gain for S and W , it would lead to δ (u|S) = δ (u|W ) = 0. Note that it is impossible that u offers the shortest live edge path for the set W ∪ {u} while S offers the shortest live edge path for the set S ∪ {u}. If there exists a shortest live edge path from S to v, it must be a shortest live edge path from W to v due to S ⊆ W . Thus, we can see that δ (u|S) ≥ δ (u|W ) in situation 3. In conclusion, we prove that δ (u|S) ≥ δ (u|W ) in any situation, that is to say TX (S ∪ {u}) − TX (S) ≥ TX (W ∪ {u}) − TX (W ) for S ⊆ W . Thus, TX (·) is submodular. Finally, we have P EG [T (S)] = Pr [X] · TX (S), since the expected influX
Since we show the expected influence efficiency function EG [T (·)] is monotone and submodular, a simple greedy algorithm (called greedy) could solve the influence efficiency maximization problem with a approximation ratio of 1−1/e. The greedy is demonstrated as Algorithm 1. The greedy runs as follows. First, the seed set S is initialized to empty. Then it selects the node u which has the maximum marginal gain of the expected influence efficiency and adds it to the seed set S iteratively until the size of S equals k.
cr
Figure 2: A comprehensive example of RX (u), RX (S) and RX (W )
us
𝑅𝑔 (𝑢)
ip t
Algorithm 1 greedy(k, T ) 1: initialize S = ∅. 2: for i =1 to k do 3: u = arg maxw∈V \S (EG [T (S ∪ w)] − EG [T (S)]) 4: S =S∪u 5: end for 6: return S
𝑅𝑔 (𝑊)
an
𝑅𝑔 (𝑆)
ence efficiency function is a weighted average of all the outcomes X. A non-negative linear combination of submodular function is also submodular, so the expected influence efficiency function EG [T (·)] is submodular.
Moreover, we can utilize lazy evaluation technique [16] to optimize the greedy algorithm as the expected influence efficiency function is submodular. It could significantly reduce the number of evaluations without changing the output of the greedy algorithm. As the name of the lazy evaluation, the idea of it is to avoid the evaluations which are not necessary. Given a monotone and submodular function f , let f (u|S) denote the marginal gain when adding u to the set S, that is to say f (u|S) = f (S ∪ {u}) − f (S). Suppose that the current seed set is S in the i-th iteration of the greedy algorithm and we have evaluated f (u|S) for some nodes u ∈ V \ S. And in a earlier iteration the seed set is S 0 ⊂ S, we have evaluated f (w|S 0 ) for some nodes w ∈ V \ S 0 . If f (w|S 0 ) ≤ f (u|S), by the submodularity we have f (w|S) ≤ f (w|S 0 ) ≤ f (u|S). That means in the i-iteration, the node w cannot be selected and there is not necessary to evaluate f (w|S). The above idea can be easily implemented by a priority queue as Algorithm 2. For each node u, we design a structure for it. It contain two field u.mg and u.i in that structure. The u.mg stands for the latest marginal gain of the expected influence efficiency when adding u and the u.i means the iteration when u.mg
Page 7 of 17
Ac ce p
te
d
ip t
Theorem 4.4. Eg∼G [Tg (S)] = n Prv,g∼G [S ∩ RR (v) 6= ∅]·
cr
1 dS,v +1
Proof.
X
Prg∼G [∃u ∈ S, v ∈ Rg (u)] ·
us
Eg∼G [Tg (S)] =
v∈V
=
X
1 dS,v + 1
Prg∼G [∃u ∈ S, u ∈ RRg (v)] ·
v∈V
1 dS,v + 1
1 dS,v + 1 1 = n Prv,g∼G [S ∩ RRg (v) 6= ∅] · dS,v + 1
an
Algorithm 3 RES(k,r,T 0 (·)) 1: initialize H = (V, ∅) 2: for i = 1 to r do 3: stochastically select a node v ∈ V uniformly 4: generate zi = RR (v) 5: calculate eu,v = du,v1 +1 for u ∈ RR (v) 6: add zi to the edge set Z of H 7: end for 8: initialize S = ∅ 9: for j = 1 to k do 10: w = arg maxv∈V (T 0 (S ∪ v) − T 0 (S)) 11: add w to S 12: remove w from V 13: end for 14: return: S
In the second step, we calculate the seed set S based on the hypergraph H. That is done by repeatedly selecting the node w ∈ V with the maximum marginal gain T 0 (S ∪ w) − Pr 1 T 0 (S), and T 0 (S) = i=1 dS,v +1 , where vi denotes the i node selected uniformly from V for zi in i-th iteration. Then we remove the node w from V. At last, the set S of k nodes is the result seed set. Then we make an analysis of Algorithm 3 in detail. Firstly, we observe that the expected influence efficiency of a seed set S is equal to n times influence efficiency eS,u from the seed set S to a uniformly selected node u at random on the graph g generated stochastically based on G. It is formulated as follows,
= n Prv,g∼G [∃u ∈ S, u ∈ RRg (v)] ·
M
is updated. Initially, we compute T (u|∅) for each u and put them into a priority queue Q with u.mg as the key. Then in every iteration, we pop the first entry out of Q and it is the entry with the largest marginal gain. Then we check out u.i, if it equals to the current iteration i, it means u is indeed the node with the largest marginal gain and we add u to the seed set S. If u.i 6= i, we need to update the marginal gain u.mg to T (u|S) and set u.i to the current iteration i. Then put u back to the priority queue Q. The LazyGreedy algorithm significantly reduce the number of evaluations, Leskovec et al. [13] show empirically that LazyGreedy algorithm speeds up about 700 times for optimization problems related to influence maximization. Due to the monotone and submodularity properties of the expected influence efficiency function, LazyGreedy algorithm can reduce the number of evaluations and speed up for the influence efficiency maximization problem. Next we show the hardness of the influence efficiency maximization. Theorem 4.2 shows that computing the expected influence efficiency is hard. Several works [13, 9, 4] makes improvement in the framework of greedy algorithm. However, those algorithms are still not efficient. Heuristic algorithms can tackle the efficiency problem but they can not offer guarantees. We borrow the idea from Borgs et al. [1] to overcome the difficulty, which called RIS algorithm above.
The algorithm called Reverse Efficiency Sampling (RES ) is described as Algorithm 3, the RES algorithm can be described as follows. The RES algorithm runs in two steps. The first step is to build the hypergraph H = (V, Z), in which V is the same as the set of nodes in graph G and Z is the set of hyperedges. Each hyperedge zi ∈ Z represents a reverse reachable set RR (v) for a node v ∈ V selected uniformly at random. The hyperedge zi can be formulated as a set (u1 , u2 , · · · , uj ), which means there is a directed path from u ∈ zi to v. The set of hyperedges Z is constructed by repeatedly generating g based on the graph G and computing the RR (v) for a random node v. It is done by simulating the diffusion process on the transpose of G. It begins at the random node v, then activates its out-neighbors with a probability via width-first search. When the diffusion process is terminated, the set of nodes that are activated becomes a hyperedge in H. Moreover, we need to record the efficiency eu,v = du,v1 +1 for each node u ∈ zi . We repeat that process to build the set of hyperedges Z and we continue building hyperedges until the size of Z reach r.
Theorem 4.4 shows that we can estimate EG [T (S)] by estimating the probability of event S ∩ RRg (v) 6= ∅. The degree of the node u ∈ V in H is equal to the number of times that the node v selected uniformly influenced by u in diffusion process. So we can calculate the expected influence efficiency based on the hypergraph H. Let us take a look at the influence efficiency function T 0 (·) in hypergraph H, we show the theorem as follow, Theorem 4.5. For an arbitrary instance of the independent cascade model, the influence efficiency function T 0 (·) is submodular.
Proof. It is easy to show that T 0 (·) is monotone, as adding any node u to an arbitrary set S would not decrease T 0 (·). To prove that, let S and W be two sets of nodes that satisfy S ⊆ W ⊆ V and let u ∈ V be a node in H. There are three kinds of situations that when a node u is added to the set S and the set W . The first one is that neither S nor W covers the hyperedge zi , which is generated based on the node vi and a stochastic graph g. We call a set S covers zi when S ∩ zi 6= ∅. In the first situation, T 0 (S ∪ u) − T 0 (S) = T 0 (W ∪ u) − T 0 (W ), as the node u offers the marginal gain. The second kind of situation is that S does not cover the hyperedge zi , while W covers the hyperedge zi . In that kind of situation, S does not cover zi while W covers zi . If eW ∪u,vi = eW,vi that means u does not update the influence efficiency, thus T 0 (S ∪ u) − T 0 (S) > T 0 (W ∪ u) − T 0 (W ). If eW ∪u,vi > eW,vi that means u offers the shortest path to vi , so eS∪u,vi = eW ∪u,vi . Due to S ⊆ W , eS,vi ≤ eW,vi , thus T 0 (S ∪ u) − T 0 (S) > T 0 (W ∪ u) − T 0 (W ). Note that, because S ⊆ W , there is not possible that S covers zi while W does not cover zi . The third kind of situation is that both S and W cover zi . It means that there are already paths from S and W to zi . If u does not update neither eS∪u,vi nor
Page 8 of 17
EXPERIMENT
te
5.
d
ip t
cr
us
M
Due to the monotone and submodularity of the function T 0 (·), the RES yields a (1 − 1/e − ε)-approximate solution to the influence efficiency maximization problem under independent cacasde model. ε is related to the sampling times r and the graph G. It is an open question to figure out the relationship between ε and r, G. Now, let us take a look at the time complexity of the algorithms mentioned above. Both greedy and LazyGreedy are based on the greedy framework. Specifically, those algorithms run k iterations to select a seed set of k and for each iteration it requires estimating the expected influence efficiency of O (n) nodes. In addition, each estimation of the expected influence efficiency of a node takes measurements on r generated graph g and each measurement consumes O (m) time. Thus, both greedy and LazyGreedy algorithm takes an O (knmr) total running time. However, the LazyGreedy avoid the unnecessary estimations based on lazy evaluation, it runs faster than greedy algorithm in practice. The RES algorithm is designed in a different framework. Firstly, RES algorithm generates r hyperedges for H, each hyperedge costs an O (m) time. So it takes an O (rm) time to generate the whole hypergraph H. Then, RES algorithm runs k iterations to select the seed set S of k and for each iteration it needs to estimate the expected influence efficiency of O (n) nodes. For each estimation, it takes measurements on r hyperedges in H. So it takes an O (knr) time to calculate the seed set S.
Facebook and Twitter are the social networks. There is an edge e = (u, v) if a user u and a user v are friends in Twitter or if a user v follows another user u in Facebook. HepPh is high energy physics phenomenology citation graph from the e-print arXiv. If a paper u cites another paper v, the graph contains a directed edge from u to v. However, in our experiments, we reverse the edges in HepPh to indicate that the paper v has influenced the paper u if u cites v. The DBLP provides a comprehensive list of computer science papers. It constructs a co-authorship network where two authors are connected if they publish at least one paper together. Those four networks are representive networks, which cover a variety of relationships. The sizes of those networks range from ten thousands of edges to millions of edges. Diffusion Models. We consider two influence diffusion models in the experiments, namely uniform independent cascade (UIC) model and weighted independent cascade (WIC) model. Specifically, the propagation probability pu,v from u to v is assigned with a uniform value in UIC model. We set pu,v = 0.01 for all the networks. It is an empirical value according to the previous work [4]. And for WIC model, the propagation probability pu,v is set to i1v , where iP v denotes the indegree of the node v. That makes it meet u∈V pu,v = 1 and it is widely adopted in prior work [4, 18, 11, 6]. Algorithms. We compare our solutions with other three methods, RIS [1], LazyGreedy and DegreeGreedy. Firstly, we compare the output of RES and RIS to illustrate the difference between influence efficiency maximization problem and influence maximization problem. Then we conduct the experiments to verify the effectiveness and efficiency of our solutions. In particular, CELF [13] is an effective variant greedy algorithm that speed up the efficiency without breaking the approximation guarantees, we borrow the idea of CELF to implement LazyGreedy algorithm to solve the influence efficiency maximization problem. And DegreeGreedy is a heuristic algorithm, it straightforwardly selects the node with maximum outdegree in each iteration. We implement RIS, LazyGreedy, DegreeGreedy and our RES algorithm in Java. Parameter Settings. The size k of seed set S are 1, 5, 10, · · · , up to 50 when we compare the influence efficiency and the running time. For LazyGreedy, we set the number of Monte Carlo steps to r = 10, 000, which follows the previous work. And for RES, we intuitively set r = n log n to yield a (1−1/e−ε)-approximate solution, where ε is related to r and the network. In our experiments, we repeat each method five times and record the average result.
an
eW ∪u,vi , T 0 (S ∪ u)−T 0 (S) = T 0 (W ∪ u)−T 0 (W ). If u only updates eS∪u,vi , T 0 (S ∪ u) − T 0 (S) > T 0 (W ∪ u) − T 0 (W ). If u updates both eS∪u,vi and eW ∪u,vi , T 0 (S ∪ u) − T 0 (S) > T 0 (W ∪ u) − T 0 (W ) due to S ⊆ W . In conclusion, we can get a result that T 0 (S ∪ u) − T 0 (S) ≥ T 0 (W ∪ u) − T 0 (W ) for the sets S ⊆ W under IC model. That is to say the influence efficiency function T 0 (·) is submodular.
5.1
Ac ce p
In this section, we design several experiments to verify our proposed method. At first, we introduce our experimental settings. Then we show the experimental results detailedly and make a analysis of the experimental results. Our experiments are conducted on a machine with 8 cores (Intel Xeon 1.80GHz CPU) and 64GB memory, running 64bit CentOS release 6.7. All the algorithms above are implemented in Java and compiled in JDK 1.8.0 40.
Experimental Settings
Table 2: Name n Facebook 4k HepPh 35k Twitter 81k DBLP 655k
Dataset characteristics m Type Avg degree 88k undirected 43.7 422k directed 24.4 1.8M directed 43.5 2M undirected 6.1
Datasets. Table 2 shows the datasets we used in our experiments. The social networks Facebook, HepPh, Twitter and DBLP are the benchmarks in the literature of influence maximization. And all of the datasets are available on SNAP [14]. Those social networks are of different size and characteristics, they are utilized to verify our algorithm comprehensively. Those networks include two directed networks and two undirected networks. Among those networks,
5.2
Comparison with RIS
Our first set of experiments compares RES algorithm with RIS algorithm to illustrate the difference between influence efficiency maximization problem and influence maximization problem. We run RES to calculate the seed set S to get the maximum expected influence efficiency and run RIS to compute the seed set S 0 to get the maximum expected influence spread. Then we calculate the jaccard similarity between S and S 0 in different size k. Let J(S, S 0 ) denote the jaccard similarity between S and S 0 , it can be formulated as follow, J(S, S 0 ) =
|S ∩ S 0 | |S ∪ S 0 |
(13)
where |S ∩S 0 | means the number of nodes in the intersection
Page 9 of 17
1
1
10
UIC WIC
0.2
0.4
20
30
40
0
50
10 0
0
10
20
30
40
50
0
10
1
20
30
40
10 4
0.6
0.4
DegreeGreedy, UIC DegreeGreedy, WIC LazyGreedy, UIC LazyGreedy, WIC RES, UIC RES, WIC
10 2
40
50
0
10 1
0
10
20
30
40
50
0
30
40
50
3
10
2
10
1
20
30
40
50
10
0
0
10
DegreeGreedy, UIC DegreeGreedy, WIC LazyGreedy, UIC LazyGreedy, WIC RES, UIC RES, WIC
20
30
40
50
seed set size
Figure 9: Influence effi- Figure 10: Influence efficiency vs. k on DBLP ciency vs. k on Twitter
the difference between the performance of the RES algorithm and the best performance is less than 17%, the worst case is the result on Facebook. Under WIC model, the RES algorithm provides the maximum influence efficiency on all the networks and there is little difference between the RES algorithm and the LazyGreedy algorithm. While the DegreeGreedy algorithm show a poor performance under WIC model. For example, the difference between the performance of the RES algorithm and the DegreeGreedy algorithm is about 56.5%, which means the DegreeGreedy only get nearly a half of the influence efficiency of the RES algorithm. As demonstrated by the results on the networks of different sizes under both the UIC and WIC model, the RES algorithm has guaranteed the accuracy as the LazyGreedy algorithm and outperforms the heuristic algorithm DegreeGreedy.
5.3
Ac ce p
te
d
M
between S and S 0 , |S ∪ S 0 | denotes the number of nodes in the union of S and S 0 . We run RES and RIS on all the datasets, altering k to get different J(S, S 0 ). The results are shown in Fig. 3 to Fig. 6. We can see that when k is small, for example k = 1, the jaccard similarity is unstable. The results show that S = S 0 when k = 1 in Facebook, HepPh and DBLP while S 6= S 0 in Twitter. It demonstrates that when k is small, the node with maximum influence spread maybe the same with the node with maximum influence efficiency or not. It is because when k is small, the majority of nodes are not activated yet. So the node which could activate maximum nodes may get the maximum influence efficiency. As the k increases, for example over 10, the value of J(S, S 0 ) becomes stable both in UIC model and WIC model. It is about 0.2 in UIC model and 0.5 in WIC model. So it is to say that the solutions for influence efficiency maximization problem and influence maximization problem are quite different, which means to find the set with maximum influence efficiency and to find the set with maximum influence spread are two different problems.
10
seed set size
seed set size
Figure 5: Jaccard simi- Figure 6: Jaccard similarity vs. k on Twitter larity vs. k on DBLP
10
us
30
20
an
20
seed set size
10
10 4
10 3
0.2
10
0
cr
influence efficiency
avg jaccard similarity
0.2
10 0
10 5
0.8
0.4
DegreeGreedy, UIC DegreeGreedy, WIC LazyGreedy, UIC LazyGreedy, WIC RES, UIC RES, WIC
10 1
seed set size
UIC WIC
0.6
10 2
Figure 7: Influence effi- Figure 8: Influence efficiency vs. k on Facebook ciency vs. k on HepPh
1 UIC WIC
0.8
50
10 3
seed set size
seed set size
Figure 3: Jaccard simi- Figure 4: Jaccard similarity vs. k on Facebook larity vs. k on HepPh
0
4
ip t
10
seed set size
avg jaccard similarity
10
DegreeGreedy, UIC DegreeGreedy, WIC LazyGreedy, UIC LazyGreedy, WIC RES, UIC RES, WIC
1
0.2
0
influence efficiency
0.6
10 2
influence efficiency
0.4
influence efficiency
0.6
0
10
0.8
avg jaccard similarity
avg jaccard similarity
0.8
0
3
UIC WIC
Accuracy Comparison
Then we compare the accuracy of the RES algorithm with others algorithms by showing the influence efficiency of the result seed set. For every result seed set, we runs 10,000 Monte Carlo simulations to evaluate the influence efficiency. Furthermore, for each dataset we alter the size of seed set k from 1 to 50 to show the influence efficiency in different size of seed set. The experiments are conducted in both UIC model and WIC model and all the results are demonstrated from Fig. 7 to Fig. 10. As shown in Fig. 7 and Fig. 10, the LazyGreedy algorithm and DegreeGreedy algorithm yield a similar result on all the networks under the UIC model, while RES algorithm generates an inferior result. The reaP son is that it does not guarantee the rule u∈V pu,v = 1 under UIC model, so the RES algorithm may get an inferior performance on the networks. However, we can see that
5.4
Running time comparison
Now we test the running time of the RES algorithm and other algorithms for comparison. To verify the efficiency of the RES algorithm, we run the experiments on all the networks under UIC model and WIC model, comparing with the LazyGreedy algorithm and the DegreeGreedy algorithm. The results are demonstrated from Fig. 11 to Fig. 14. As shown on the four networks under UIC and WIC model, the RES can speed up about one order of magnitude comparing with the LazyGreedy algorithm. Note that the DegeeGreedy algorithm provides the minimum running time, because it is a heuristic algorithm and it has no guarantee for the accuracy as mentioned in Section. 5.3. We just use it as a running time lower bound in our experiments due to its simplicity. As shown in the results, both the running time of RES and LazyGreedy algorithm grow with the increase of k, and the running time of LazyGreedy algorithm grows much faster than the RES algorithm. For example, in Fig. 13 the running time of LazyGreedy is about 10 times as the running time of RES algorithm when k = 1, while that proportion grows to 20 when k = 50. So the RES algorithm
Page 10 of 17
10 4
1. So the RES algorithm is scalable to solve the influence efficiency maximization problem.
10 0
DegreeGreedy, UIC DegreeGreedy, WIC LazyGreedy, UIC LazyGreedy, WIC RES, UIC RES, WIC
10 -1 10 -2
0
10
20
30
40
10 2
6. DegreeGreedy, UIC DegreeGreedy, WIC LazyGreedy, UIC LazyGreedy, WIC RES, UIC RES, WIC
10 0
10 -2 50
0
10
20
30
40
50
seed set size
seed set size
Figure 11: Running time Figure 12: Running time vs. k on HepPh vs. k on Facebook 10 5
4
10
2
10
0
10 -2
10 4
DegreeGreedy, UIC DegreeGreedy, WIC LazyGreedy, UIC LazyGreedy, WIC RES, UIC RES, WIC
0
10
20
30
40
10 3 10 2
DegreeGreedy, UIC DegreeGreedy, WIC LazyGreedy, UIC LazyGreedy, WIC RES, UIC RES, WIC
10 1
50
10 0
cr
10
running time/secs
running time/secs
10 6
0
10
seed set size
20
30
40
50
seed set size
Figure 13: Running time Figure 14: Running time vs. k on Twitter vs. k on DBLP
7.
3
M
10 2
10
1
8.
10 0 RES, UIC RES, WIC
10 4
10 5
seed set size
Scalability performance of RES algo-
Ac ce p
Figure 15: rithm
10 6
te
10 -1 10 3
shows its efficiency when the seed set to find is large. So the RES algorithm outperforms other algorithms in the way of running time.
5.5
ACKNOWLEDGMENTS
The work in this paper is supported by National Key fundamental Research and Development Program of China (No.2013CB329601) and National Natural Science Foundation of China (No.61502517, No.61372191, No.61572492).
d
running time/secs
10
CONCLUSION
In this paper, to better understand the influence efficiency in the network, we take the propagation time delay into consideration in information diffusion process. We formulate a novel problem called influence efficiency maximization which aims to maximize the expected influence efficiency in the network. After that, we prove the proposed problem is NPhard under independent cascade model and the influence efficiency function is submodular. Furthermore, we also prove that the computation of influence efficiency is #P-hard under independent cascade model. Then based on the properties of the problem, we design three algorithms to solve it. Finally, we conduct a series of experiment to verify the proposed algorithms. The experimental results show the difference between influence efficiency maximization problem and influence maximization problem. The performance of proposed algorithms is also demonstrated in the experiments. We hope our work could be helpful to fully understand information and influence propagation.
ip t
10 1
us
running time/secs
running time/secs
10 2
an
10 3
Scalability
In this section, we conduct experiments to test the scalability of RES algorithm. The network DBLP is used to generate subgraphs of different size. The generation process is as follows. First, we select a node from the network uniformly at random. Then we use the selected node to do a breadth-first search and add the edges searched to the subgraph until the size of the subgraph meets our requirement. If the breadth-first search terminates while the size of the subgraph does not meet the requirement, we select a new node which is not in current subgraph and repeat the breadth-first search. By that method, we can generate a subgraph of size what we want based on a network. We alter the edge set’s size of the subgraph from m = 1, 000 to m = 1, 000, 000 and run the RES algorithm on it. The results of RES algorithm under UIC model and WIC model are shown as Fig. 15. We can see the gradient is less than 1 in log-log plot in Fig. 15, it is to say the exponent in the function of running time and edge set’s size is less than
REFERENCES
[1] C. Borgs, M. Brautbar, J. Chayes, and B. Lucier. Maximizing social influence in nearly optimal time. In Proceedings of the Twenty-Fifth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 946–957. SIAM, 2014. [2] W. Chen, L. V. Lakshmanan, and C. Castillo. Information and influence propagation in social networks. Synthesis Lectures on Data Management, 5(4):1–177, 2013. [3] W. Chen, C. Wang, and Y. Wang. Scalable influence maximization for prevalent viral marketing in large-scale social networks. In in KDD 2010, pages 1029–1038, 2010. [4] W. Chen, Y. Wang, and S. Yang. Efficient influence maximization in social networks. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 199–208. ACM, 2009. [5] X. Chen, G. Song, X. He, and K. Xie. On influential nodes tracking in dynamic social networks. In Proc. 15th SIAM Intl. Conf. on Data Mining, pages 613–621, 2015. [6] S. Cheng, H. Shen, J. Huang, G. Zhang, and X. Cheng. Staticgreedy: solving the scalability-accuracy dilemma in influence maximization. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pages 509–518. ACM, 2013. [7] P. Domingos and M. Richardson. Mining the network value of customers. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, pages 57–66. ACM, 2001.
Page 11 of 17
ip t cr us an
Ac ce p
te
d
M
[8] J. Goldenberg, B. Libai, and E. Muller. Talk of the network: A complex systems look at the underlying process of word-of-mouth. Marketing letters, 12(3):211–223, 2001. [9] A. Goyal, W. Lu, and L. V. Lakshmanan. Celf++: optimizing the greedy algorithm for influence maximization in social networks. In Proceedings of the 20th international conference companion on World wide web, pages 47–48. ACM, 2011. [10] M. Granovetter. Threshold models of collective behavior. American journal of sociology, pages 1420–1443, 1978. ´ Tardos. Maximizing [11] D. Kempe, J. Kleinberg, and E. the spread of influence through a social network. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 137–146. ACM, 2003. ´ Tardos. Influential [12] D. Kempe, J. Kleinberg, and E. nodes in a diffusion model for social networks. In Automata, languages and programming, pages 1127–1138. Springer, 2005. [13] J. Leskovec, A. Krause, C. Guestrin, C. Faloutsos, J. VanBriesen, and N. Glance. Cost-effective outbreak detection in networks. In Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 420–429. ACM, 2007. [14] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014. [15] B. Liu, G. Cong, D. Xu, and Y. Zeng. Time constrained influence maximization in social networks. In 2012 IEEE 12th International Conference on Data Mining, pages 439–448. IEEE, 2012. [16] M. Minoux. Accelerated greedy algorithms for maximizing submodular set functions. In Optimization Techniques, pages 234–243. Springer, 1978. [17] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular ˘ Ti. ˇ Mathematical Programming, set functionsˆ aA 14(1):265–294, 1978. [18] Y. Tang, X. Xiao, and Y. Shi. Influence maximization: Near-optimal time complexity meets practical efficiency. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data, pages 75–86. ACM, 2014. [19] Z. Wang, E. Chen, Q. Liu, Y. Yang, Y. Ge, and B. Chang. Maximizing the coverage of information propagation in social networks. In Proceedings of the 24th International Conference on Artificial Intelligence, pages 2104–2110. AAAI Press, 2015. [20] Y. Yang, X. Mao, J. Pei, and X. He. Continuous influence maximization: What discounts should we offer to social network users? In Proceedings of the 2016 ACM SIGMOD international conference on Management of data, pages 727–741, 2016. [21] H. Zhuang, Y. Sun, J. Tang, J. Zhang, and X. Sun. Influence maximization in dynamic social networks. In 2013 IEEE 13th International Conference on Data Mining, pages 1313–1318. IEEE, 2013.
Page 12 of 17
Highlight ● Propose a new problem influence efficiency maximization which take propagation time delay into consideration.
ip t
● Prove the proposed problem is NP-hard and the computation of influence efficiency is #P-hard under independent cascade model. ● Prove the influence efficiency function is submodular under independent cascade model.
Ac
ce pt
ed
M
an
us
cr
● Propose an efficient algorithm called RES to solve this problem.
Page 13 of 17
Author Biography
ip t
Xiang Zhu is currently a Ph.D. Candidate at the College of Computer, National University of Defense Technology, Hunan, China. He received a B.E. degree from Tsinghua University, Beijing, China, in 2010 and an M.E. degree from National University of Defense Technology, Hunan, China, in 2013. His research interests include social network and social media analysis, data mining and big data analysis.
an
us
cr
Zhefeng Wang is currently a Ph.D. Candidate at the School of Computer Science and Technology, University of Science and Technology of China. He received a B.E. degree from University of Science and Technology of China, China in 2012. His research interests include social network and social media analysis, recommender system, and text mining. He has published several papers in refereed conference proceedings such as SIGIR, KDD and IJCAI.
ed
M
Yu Yang received his B.E. degree from Hefei University of Technology in 2010, and his M.E. degree from University of Science and Technology of China in 2013, both in Computer Science. He is currently a Ph.D. student in School of Computing Science at Simon Fraser University, Canada. His research interests lie in algorithmic aspects of data mining, with an emphasis on managing and mining dynamics of large scale networks.
ce pt
Bin Zhou is currently Professor at the College of Computer, National University of Defense Technology, Hunan, China. His research interests include data mining and big data analysis.
Ac
Yan Jia is currently Professor at the College of Computer, National University of Defense Technology, Hunan, China. Her research interests include data mining and big data analysis.
Page 14 of 17
Page 15 of 17
d
Ac ce pt e us
an
M
cr
ip t
Page 16 of 17
d
Ac ce pt e us
an
M
cr
ip t
Page 17 of 17
d
Ac ce pt e us
an
M
cr
ip t